[PDF] Attention Actor-Critic algorithm for Multi-Agent Constrained Co-operative Reinforcement Learning

Abstract

In this work, we consider the problem of computing optimal actions for Reinforcement Learning (RL) agents in a co-operative setting, where the objective is to optimize a common goal. However, in many real-life applications, in addition to optimizing the goal, the agents are required to satisfy certain constraints specified on their actions. Under this setting, the objective of the agents is to not only learn the actions that optimize the common objective but also meet the specified constraints. In recent times, the Actor-Critic algorithm with an attention mechanism has been successfully applied to obtain optimal actions for RL agents in multi-agent environments. In this work, we extend this algorithm to the constrained multi-agent RL setting. The idea here is that optimizing the common goal and satisfying the constraints may require different modes of attention. By incorporating different attention modes, the agents can select useful information required for optimizing the objective and satisfying the constraints separately, thereby yielding better actions. Through experiments on benchmark multi-agent environments, we show the effectiveness of our proposed algorithm.

Full PDF

AAttention Actor-Critic algorithm forMulti-Agent Constrained Co-operativeReinforcement Learning

P. Parnika ∗ , , Raghuram Bharadwaj Diddigi ∗ , , Sai Koti Reddy Danda ∗ , andShalabh Bhatnagar Mindtree Ltd. Department of Computer Science and Automation, IISc Bangalore, India. IBM Research, Bangalore, [email protected], { raghub,shalabh } @iisc.ac.in,[email protected] Abstract

In this work, we consider the problem of computing optimal actions for Reinforcement Learning(RL) agents in a co-operative setting, where the objective is to optimize a common goal. However, inmany real-life applications, in addition to optimizing the goal, the agents are required to satisfy certainconstraints speciﬁed on their actions. Under this setting, the objective of the agents is to not onlylearn the actions that optimize the common objective but also meet the speciﬁed constraints. In recenttimes, the Actor-Critic algorithm with an attention mechanism has been successfully applied to obtainoptimal actions for RL agents in multi-agent environments. In this work, we extend this algorithmto the constrained multi-agent RL setting. The idea here is that optimizing the common goal andsatisfying the constraints may require different modes of attention. By incorporating different attentionmodes, the agents can select useful information required for optimizing the objective and satisfying theconstraints separately, thereby yielding better actions. Through experiments on benchmark multi-agentenvironments, we show the effectiveness of our proposed algorithm.

Equal contribution by the ﬁrst three authors. A version of this paper has been accepted for publication as an extended abstractin the Proceedings of the 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2021). a r X i v : . [ c s . A I] J a n . I NTRODUCTION

In a multi-agent co-operative RL setting [11], multiple agents are working towards a commongoal in a common environment. All the agents receive the same cost (or reward) dependingon the actions of all the agents and the objective is to minimize (or maximize) the expectedtotal discounted cost (or reward) [30]. However in many practical situations, one often encountersconstraints that restrict the choice of actions that can be taken by these agents. In the constrainedRL setting [3], these constraints can also be speciﬁed via certain expected total discounted costs.In such scenarios, the agents have to learn actions that not only minimize the expected totaldiscounted cost but also respect the constraints.One approach to satisfy the constraints is to construct a modiﬁed cost as a linear combinationof the original cost and the constraint costs. However, the weights to be associated with the costsare not known upfront and need to be learned in a trial-and-error fashion. This problem becomescompounded when multiple constraints are speciﬁed. We alleviate this problem by consideringthe Lagrangian formulation of the problem and training dual Lagrange parameters that act asweights for the constraint costs.

References Features [12], [16], [23], [25], [26] Deep RL algorithms for multi-agent setting. Attentionmechanism not considered.[20], [21], [24] Deep RL algorithms with Attention for multi-agent setting.Constrained setting not considered.[4], [5], [7] RL algorithms for single-agent constrained setting.Multi-agent constrained setting not considered.[1], [22], [32] Deep RL algorithms for single-agent constrained setting.Multi-agent constrained setting not considered.[2], [9], [14], [15], [17], [27],[29], [34] RL algorithms for multi-agent constrained setting. Attentionmechanism not considered.

Our Work

Deep RL algorithm with Attention mechanism formulti-agent Constrained setting.

Table I: Comparison with other works in the LiteratureSingle-agent RL algorithms for the constrained RL settings have been proposed under variouscost criteria like average cost in [5], [7] and discounted costs in [1], [4], [22], [32]. Constraints ina multi-agent setting can appear in multiple ways. Under budget constraints [9], every joint policyigure 1: Evolution of paradigms in literatureis associated with a cost. The objective of the agents here is to compute a joint optimal policythat maximizes the value, respecting the budget constraints. Under resource/task constraints [2],[15], [17], the optimal policy is the one that not only maximizes the value but also optimallyallocates the resources to the agents. Under safety constraints [27], [29], [34], each policy isassociated with a safety value, and the objective of the agents is to compute optimal policiesthat meet the safety constraints. Finally, in [14], similar to the model we consider in this paper,the constraints are speciﬁed as expected discounted cost which are required to be less than aprescribed threshold value.Actor-Critic algorithms [30] are a popular class of RL algorithms that are used by an agentto obtain an optimal policy. In this paradigm, ‘Actor’ computes the policy and ‘Critic’ providesfeedback on the policy computed by the ‘Actor’. Based on this feedback, ‘Actor’ improves thepolicy. This process is repeated until an optimal policy is obtained. The Actor-Critic paradigmfor multi-agent settings can be extended in three ways [26]. All the agents can independently(without co-operation and communication) run the Actor-Critic algorithm. This setting of agentsis known as ‘Independent Learners’ [31] and it suffers from the problem of non-stationarity [11].Another setting known as ‘Joint Action Learners’ assumes the existence of a central controllerwhich computes the optimal policy of all the agents and communicates the actions to the agents.This setting suffers from scalability problems as the state and action spaces for the centralcontroller increase exponentially as the number of the agents increases. Finally, a paradigmthat mitigates the problem of scalability and non-stationarity known as ‘centralized learning anddecentralized execution’ has become popular in recent times [12], [16], [23], [25]. The mainidea here is to use a centralized critic during the training and decentralized actors that learnactions independently. In these algorithms, however, information of all the agents are givenequal importance (or weights) while computing the optimal policy.he attention mechanism allows an agent to selectively pay attention to those agents whoseinformation is crucial in the computation of its policy. In [20], attention actor-critic algorithmshave been proposed that make use of the attention mechanism in the learning of ‘critic’. In[24], attention mechanism has been used to model the policies of teammates. The attentionmechanism for learning communication among the agents has been proposed in [21]. In thiswork, the authors propose an attentional communication model ATOC that provides an effectivemechanism for communication among agents resulting in better decision making. We illustratethe advantages of the attention mechanism through the following example. Consider a Smart Gridsetup [28] where a network of microgrids are employed, whose objective is to provide powerto its dedicated customers. These microgrids are equipped with renewable energy generationsources (like solar panels, wind turbines) and limited storage battery device to store their power.At every time instant, each microgrid has to make intelligent decisions like the number ofunits of power to store in its battery, the number of units of power to buy (or sell) from (to)other microgrids to maximize its proﬁts. For any given microgrid, the state information of itsneighboring microgrids is more important than those at a far distance from it. Through theattention mechanism, a microgrid can dynamically select these neighboring microgrids, insteadof attending to all the microgrids equally. This results in better decision making and hence betterproﬁts for the microgrids.We believe that the attention mechanism is particularly important in the constrained multi-agent setting. We explain its importance through the two following examples.1) Consider a warehouse where multiple robots are deployed. The objective of the robots isto pick up the goods from a target position with a constraint that the expected number ofcollisions among the robots is less than a predeﬁned threshold. The important informationfor a robot for collecting goods is the position of goods whereas the relative distance betweenthe robots is crucial information to avoid collisions. Hence having attention mechanismsseparately for learning optimal actions to collect the goods and avoid collisions will benatural in this setting.2) In the smart grid setting considered, let’s say that we impose a constraint that the expecteddemand-supply deﬁcit should be maintained at a certain level (to ensure the stability of thegrid). The relevant information for maximizing the proﬁts and maintaining the stability fora microgrid can be different. The attention mechanism enables the microgrid to attend tothe relevant information for these two tasks separately.oreover, the attention mechanism for multi-agent constrained setting ﬁnds its applications innumerous settings like self-driving cars [29] to ensure safety constraints, supply chain optimiza-tion [18] to ensure the resource constraints. In our work, we propose an attention-based Actor-Critic algorithm for solving the problem of multi-agent constrained Reinforcement Learning(RL). While the attention mechanism and constrained RL settings have been studied extensivelyin the literature, the use of two separate attention mechanisms for computing the policy andsatisfying the constraints has not been considered previously. We believe that this architectureis very important as optimizing the common goal and satisfying the constraints require differentmodes of attention. We show through our analysis of attention weights that using multipleattentive critics can beneﬁt and yield much better results on complicated real-world applications.A brief overview of the comparison of our work with other works in the literature is providedin Table I and the evolution diagram of these paradigms/themes is shown in Figure 1. The maincontributions of our work are the following: • We propose an Actor-Critic algorithm for computing the optimal actions for agents in aconstrained co-operative multi-agent setting that makes use of the attention mechanism. • We analyze and discuss the performance of our algorithm on constrained versions of standardmulti-agent RL environments. • We provide a detailed analysis of the attention mechanism learned by the agents in ourexperiments (Section IV-B3).The rest of the paper is organized as follows. In Section II, we describe the multi-agent con-strained co-operative setting considered in the paper. In Section III, we propose our multi-agent attention mechanism-based constrained Actor-Critic algorithm. In Sections IV and V, wepresent the performance of our algorithm on multi-agent environments and discuss the results.Concluding remarks are given in Section VI.II. M

ODEL

We now discuss the constrained co-operative multi-agent setting described in [13]. It isdescribed by tuple < n, S, A, T, k, c , . . . , c m , γ > . Here, n denotes the number of agents inthe environment. S = S × S × . . . S n is the joint state space and s ∈ S = ( s . . . s n ) is thejoint state with s i ∈ S i being the state of the agent i . Similarly, A = A × , . . . , × A n denotesthe joint action space where a ∈ A = ( a , . . . , a n ) is the joint action and a i ∈ A i being theaction of agent i . Each agent only observes its own state and chooses its action based on it.et T be the probability transition matrix where T ( s (cid:48) | s, a ) denotes the probability of next statebeing s (cid:48) when joint action a is taken in joint state s . Single-stage cost function ( k ) is the costincurred when joint action a is taken in state s . Moreover, c , . . . , c m denote the single-stagecost functions for the constraints. Note that both the main cost function ( k ) and constraint costs( c , . . . , c m ) depend on the joint action of the agents. Finally, γ denotes the discount factor. Let π i : S i −→ ∆( A i ) denote the policy of agent i , where for a given state of agent i , π i ( s i ) is aprobability distribution over its actions. We now deﬁne the total discounted cost ( J ) for a jointpolicy π = ( π , . . . , π n ) as follows: J ( π ) = E (cid:104) τ (cid:88) t =0 γ t k ( s t , π ( s t )) (cid:105) , (1)where E ( . ) is the expectation over entire trajectory of states with initial state s ∼ d , where d is a probability distribution over states, τ is a ﬁnite stopping time and s t is the joint state attime t . The m constraints on the system are deﬁned as follows: E (cid:104) τ (cid:88) t =0 γ t c j ( s t , π ( s t )) (cid:105) ≤ α j , ∀ j ∈ , . . . , m, (2)where α , . . . , α m are pre-speciﬁed thresholds.The objective of the agents in the multi-agent constrained co-operative RL setting is to computea joint policy π ∗ = ( π ∗ , . . . , π ∗ n ) that min π ∈ Π J ( π ) = E (cid:104) τ (cid:88) t =0 γ t k ( s t , π ( s t )) (cid:105) (3)s.t E (cid:104) τ (cid:88) t =0 γ t c j ( s t , π ( s t )) (cid:105) ≤ α j , ∀ j ∈ , . . . , m, where Π is set of all joint policies. The constrained problem (3) can be relaxed using theLagrangian formulation [4], [7] as follows: L ( π, λ ) = E (cid:104) τ (cid:88) t =0 γ t (cid:0) k ( s t , π ( s t )) + m (cid:88) j =1 λ j c j ( s t , π ( s t )) (cid:1)(cid:105) − m (cid:88) j =1 λ j α j , (4)where λ = ( λ , . . . , λ m ) is the vector of Lagrange parameters associated with the m constraints.From the theory of duality in optimisation (Chapter 5 of [10]), it is clear that the optimalpolicy π ∗ and Lagrange parameters λ ∗ satisfy the following: L ( π ∗ , λ ∗ ) = max λ> min π ∈ Π L ( π, λ ) . (5)he theory of two time-scale stochastic approximation [6] allows us to iteratively learn theLagrange parameters and policy. The main idea is to perform gradient descent on the objective(4) in the space of policy parameters on the faster timescale and gradient ascent on (4) in theLagrange parameters on the slower timescale [8]. The complete details of how we achieve thisis described in the next section. III. P ROPOSED A LGORITHM

Attention mechanism [24], [33]:

In general, the attention mechanism works as follows. Ittakes source vectors v = ( v , . . . , v n ) and a target vector T as inputs and outputs a context vector C . The attention mechanism ﬁrst computes attention weights ( w , . . . , w n ) , where w i , ≤ i ≤ n represents the importance of v i . The attention weight w i is computed from a given function f ( T, v ) as follows: w i = sof tmax ( f ( T, v )) = exp ( f ( T, v i )) (cid:80) nj =1 exp ( f ( T, v j )) . (6)Finally, the context vector is computed as: C = n (cid:88) j =1 w j v j . (7)As it can be seen from (6) that (cid:80) nj =1 w j = 1 , the attention mechanism can thought of acomputation that adaptively learns the distribution over input vector that accurately representthe context of the problem.We extend the attention mechanism proposed in the context of the multi-agent RL setting [20]to the constrained setting. The details of the proposed attention mechanism are as follows. Eachagent i maintains a total of m + 1 critics which use attention. Let’s denote these as the cost critic Q ψ (associated with the main cost function) and m penalty critics Q η , . . . , Q ηm (associated withthe m constraints). The intuition here is, by having multiple critics with different attentions, eachcritic is especially able to attend to that information which is crucial in solving its objective. Theway this information is utilized for attention is by encoding state and state-action informationof all agents where the embedding function is a single layer perceptron.These encodings are passed to another embedding function, also a single layer perceptron, tocreate keys ( K ), values ( V ), and selectors/queries ( q ) [20]. The keys K j and values V j representtate-action encodings of all agents j (cid:54) = i while queries q i are state encodings of the agent i .Now, the attention weights w j are computed as a function of queries and keys as follows: w j = sof tmax (cid:16) q i K Tj √ d k (cid:17) , (8)where d k is the size of the keys. Finally, critic Q of agent i (denoted by Q i ) is obtained asfollows (for notation convenience, we drop the subscripts from the critics of agent i as all m + 1 critics use similar architecture): Q i = f i ( g i ( o i , a i ) , x i ) , (9)where f i is a multi-layer perceptron with two layers, g i is an embedding function for agent i ,and x i = (cid:88) j (cid:54) = i w j V j is the contribution of other agents to agent i .We now discuss our proposed algorithm ‘MACAAC’ (Algorithm 1). We train the algorithmin µ parallel environments to improve the sample efﬁciency and reduce the variance of updates.At each time step of an episode, every agent samples an action from its current policy basedon its observations o i and obtains common single-stage cost k , single-stage penalties c , . . . , c m and next state as shown in steps 19 of Algorithm 1. The Lagrangian cost is calculated as shownin the step 20. This information is then stored in the replay buffer D . The ‘Critic’ and ‘Actor’parameters are updated after every U steps. This is done as follows. First a minibatch ‘B’ issampled independently from the replay buffer. For each sample from the minibatch ‘B’, the criticparameters are updated (Step 27 of Algorithm 1) by performing gradient descent on the MSEloss given by [20]: n (cid:88) i =1 E [( Q i ( o, a ) − y i ) ] , (10)where y i = r + γE [ Q i ( o ∗ , a ∗ ) − α log( π θ ( a ∗ i | o ∗ i ))] , o ∗ , a ∗ are the joint next state and actions ofagents and α is known as the temperature coefﬁcient [20] that is used to control the stochasticnature of the policy. The parameters of cost critic ψ are updated by performing gradient descenton (10) with r deﬁned as in Line 20 of Algorithm 1 (Lagrangian cost) and the parameters ofpenalty critic j , η j , are updated by performing gradient descent on (10) with r as c j (constraintcost). Moreover, all the agents i share the same parameters ( ψ, η , . . . , η m ) of critics.The ‘UpdateActors’ step is performed as follows. The policy parameters of each agent ( θ i )are updated by performing gradient descent using the gradient function given by [20]: E [ ∇ θ i log( π θ i ( a i | o i ))( − α log( π θ i ( a i | o i )) + Q iψ ( o, a ) − b ( o, a − i ))] , (11)here b is a baseline function that is independent of actions of agent i ( a − i represents theactions of all agents except i ). Note that in equations (10) and (11), an entropy term is addedthat facilitates stochastic policies [19].Finally, the Lagrange parameters λ j , j = 1 , . . . , m are updated by performing gradient ascenton the Lagrangian L (eq. 4) as shown in the step 30. An important point to note here is that thecritic and actor updates (steps 27 and 28) are performed on a faster time-scale compared to theLagrange parameter updates (steps 29-30). As a result, the critic and actor perceive Lagrangeparameters as constants in their updates, thereby ensuring the convergence of the algorithm [8,Chapter 6]. IV. E XPERIMENTS AND R ESULTS

In this section, we describe the performance of our proposed Algorithm ‘MACAAC’ on twomulti-agent environments and analyze the results. The ﬁrst environment is the constrained versionof Cooperative Navigation [23] followed by the constrained version of Cooperative TreasureCollection [20]. The constraint considered in the experiments is the collision between the multipleagents. The agents incur a penalty whenever there is a collision and their objective is to make surethat the expected total penalty is less than a prescribed penalty threshold. To avoid confusion,we refer to the main cost that the agents are minimizing as ‘cost’ and the constrained cost as the‘penalty’. For comparison purposes, we also implement the constrained version of MADDPG[23] algorithm, which we refer to as ‘MADDPG-C’. Moreover, to better analyze the results, wealso report the results on an un-constrained version of Multi-agent Attention Actor-Critic [20]where there is no penalty incurred for collisions among the agents, which we simply refer to as‘Unconstrained’. Finally, in section V, we evaluate the performance of ‘MACAAC’ with ﬁxedweights. The neural network architecture and hyper-parameters are kept the same for all threealgorithms A. Constrained Cooperative Navigation1) Description of the experiment:

In this experiment, there are agents and targets thatare randomly generated in a continuous environment at the beginning of each episode as shownin the Figure 2. The objective of the agents is to navigate towards the targets in a co-operative The source codes of our experiments are available at: https://github.com/parnika31/MACAAC Supplementary lgorithm 1

Multi-Agent Constrained Attention Actor-Critic (MACAAC) E ←− Maximum number of episodes. L ←− Length of an episode. U ←− Steps per update. θ i ←− policy parameters of the agent i , i = 1 , . . . , n . UpdateCritic:

Subroutine to update the critic parameters. UpdateActors:

Subroutine to update the policy parameters of all the agents. Q η j ←− Q-value of constrained cost associated with constraint j, j = 1 , . . . , m . β t ←− Slower timescale step-size at time step t . Initialize Lagrange parameters λ , . . . , λ m . Create µ parallel environments. Initialise replay buffer, D. u ←− for ep = 1 , , . . . , E do Obtain initial observations o ei for all agents i in each environment e for t = , , . . . , L do Obtain actions a ei ∼ π θ i ( . | o ei ) , ∀ i = 1 , . . . , n , ∀ e = 1 , . . . , µ Execute actions and get ( o ∗ ,ei , k e , c e , c e , . . . , c em ) ∀ i, e Let r e = k e + m (cid:88) j =1 λ j c ej , ∀ e Store ( o ei , a ei , r e , c e , c e , . . . , c em , o ∗ ,ei ) , ∀ i, e in D o ei = o ∗ ,ei , ∀ i, e u + = µ if ( u % U) ¡ µ then Sample minibatch (B) from D

Get next actions a (cid:48) , . . . , a (cid:48) n UpdateCritic (B, a (cid:48) , . . . , a (cid:48) n ) UpdateActors (B) for j = 1 , . . . , m do λ j ←− max(0 , λ j + β t ( Q η j − α j )) igure 2: Constrained Cooperative Navigation. The large blue balls are ‘agents’, whose objectiveis to navigate towards the small black balls which are ‘targets’ without collisions.manner such that all targets are covered. The length of each episode is time steps and thesingle-stage cost at each time step is the sum of the distance to the nearest agent, over all thetargets. Therefore, the agents have to learn to navigate towards the targets in such a way thatall target positions are covered. However, we include a single-stage penalty of when thereis a collision between the agents (and otherwise). The penalty threshold ( α ) is set to inour experiments. This means that the expected total penalty over all the episodes must be lessthan or equal to . The discount factor is set to . . In Figure 4, we show the performance ofalgorithms during the training phase, and in Table II, we report the performance of algorithmsduring the testing.

2) Discussion: • In Figure 4a, we observe that the total cost approaches convergence for all the threealgorithms. The ‘Unconstrained’ algorithm achieves the smallest average cost as there is nopenalty for collisions in this case. Therefore, the agents can move freely in the continuousspace and navigate quickly towards the targets. This can also be observed in Figure 4b,where we see that the average penalty of the ‘Unconstrained’ algorithm is the highest. • In Figure 4b, we see that the average penalty comes down as the training progresses for theconstrained algorithms (MADDPG-C and our proposed MACAAC), while for the ‘uncon-strained’ algorithm it almost remains constant. This is the effect of Lagrange parametershat are learnt in the constrained setting. • From Table II, we can see that both our proposed algorithm ‘MACAAC’ and ‘MADDPG-C’satisﬁes the penalty constraint. However, our algorithm ‘MACAAC’ achieves this averagepenalty at a lower cost than the ‘MADDPG-C’ algorithm.Figure 3: Constrained Cooperative Treasure Collection. The big blue and brown balls are‘depositors’. The dark orange and blue colored balls are ‘treasures’ which will be re-spawnedafter every capture. The rest of the balls are ‘collectors’. In this ﬁgure, the collector agentschanged color after capturing the treasures and are moving towards depositors (or banks) ofsame color to deposit them without collisions.

B. Constrained Cooperative Treasure Collection1) Description of the experiment:

In this experiment, we have a total of agents, out of which6 agents are ‘collectors’ (Agents , . . . , , and the other two are ‘deposits (or banks)’ (Agents7 and 8). The role of collectors is to collect the ‘treasures’ that are randomly generated in theenvironment and deposit them into the ‘banks’ of the same color as the treasure. New treasureswill be re-generated once the existing treasures are collected. The role of the ‘depositors’ is tostay close to the collectors carrying their treasures. The length of each episode is 100 time-stepswhere all agents receive the shared single-stage cost associated with the total distances fromtheir goals. Moreover, a cost of − (a positive reinforcement) is added every time a treasure is a) Expected total cost (b) Expected total penalty Figure 4: Performance of Algorithms on Constrained Cooperative Navigation during the training.The average total cost and penalty at each episode i are calculated, by taking mean of total costsand total penalties over 1024 runs, using the policies trained until i episodes. Name of the Algorithm Average total cost over , episodes Average total penalty over , episodes MACAAC 45.79 1.87

MADDPG-C 60.33 2.52Unconstrained 37.50 7.02MACAAC with Fixed Weights 38.73 1.25

Table II: Performance comparison of algorithms in testing phase on Constrained CooperativeNavigation with penalty threshold α = . The average total cost and penalty are calculated bytaking mean of total costs and penalties, respectively, over , runs using the policies ofagents obtained at the end of training.collected and deposited . We consider two penalty constraints in this experiment for collectorsand depositors separately to demonstrate the effect of attention weights (discussed in SectionIV-B3). The penalty threshold for collectors ( α ) is set to and depositors ( α ) is set to . and the discount factor is . . This is the reason the costs in Table III are negative, as the agents learn to collect and deposit treasures a) Expected total cost (b) Expected total penalty of col-lectors (c) Expected total penalty of de-positors

Figure 5: Performance of Algorithms on Constrained Cooperative Treasure Collection during thetraining.

Name ofthe Algorithm Average total costover , iterations Average total penalty of collectorsover , iterations Average total penalty of depositorsover , iterations MACAAC -76.21 4.70 0.15

MADDPG-C -22.20 7.59 0.76Unconstrained -88.41 13.99 0.35MACAAC with Fixed Weights -54.68 2.63 0.18

Table III: Performance comparison of algorithms in testing phase on Constrained CooperativeTreasure Collection with penalty threshold for collectors ( α ) set to and that of the depositors( α ) set to . .

2) Discussion: • As in our previous experiment, we can observe from Figure 5a that the average total costconverges for all three algorithms. • From Table III, we can see that our proposed ‘MACAAC’ satisﬁes the penalty constraintsof both ‘collectors’ and ‘depositors’. Moreover, the average cost obtained by ‘MACAAC’is lower compared to the ‘MADDPG-C’ algorithm. As the agents do not incur a penalty in‘Unconstrained’, its average cost is least among three algorithms.

3) Discussion of Attention graphs:

We now discuss the attention weights learned by the agentsduring the training. We present the attention weights learnt by the agent 1, which is a ‘collector’in Figure 6 and agent 8, which is a ‘depositor’ in Figure 7. Recall that, there are six collectors a) Attention Weights of Lagrangian critic (b) Attention weights of penalty 1 critic

Figure 6: Attention weights of Agent 1. These plots indicate the attention weights assigned toother agents by Agent 1 (a) Attention Weights of Lagrangian critic (b) Attention weights for penalty 2 critic

Figure 7: Attention weights of Agent 8. These plots indicate the attention weights assigned toother agents by Agent 8(agents 1-6) and two depositors (agents 7 and 8). In this experiment, there are three critics thatuse attention. Two penalty critics, which we refer to as penalty 1 and penalty 2 critics, computethe expected penalty costs of collectors and depositors respectively. The feedback from thesecritics is used in improving the Lagrange parameters (Step 30 of Algorithm 1). Then, there isthe main critic whose feedback is used to improve the policy parameters (Step 28 of Algorithm1). Note that the penalty critics make use of only the penalty costs whereas the main critic(Lagrangian critic) makes use of Lagrangian cost that involves both main cost and penalty costs(Step 20 of Algorithm 1).n Figure 6a, we observe that the Lagrangian Critic of agent 1 focuses more on the depositors throughout its training. The agent 1 required to deposit its collected treasures into depositors, tominimize its cost, and hence the Lagrangian critic attends more to information of the depositors.The attention graph of Penalty 1 critic of agent 1 in Figure 6b is very interesting. At thebeginning of training, agent 1, to avoid collisions, focuses more on the information of othercollectors and less on the depositors. However, as the training progresses, all the agents learn tomove towards the depositors to deposit their treasures. Hence, the information of the depositorsbecomes very relevant for agent 1 to avoid collisions. This can also be conﬁrmed from Figure 5bwhere the constraint ( α = 12 ) is being satisﬁed, towards the end, after k iterations. Therefore,we observe that, towards the end of the training, agent 1 attends to information of all the otheragents equally. In this way, the attention mechanism enables the agents to dynamically selectrelevant information during the training.In Figure 7, we report the attention graphs of agent 8, which is a depositor. In Figure 7a, weobserve that agent 8 attends more to the information of other depositors, i.e., agent 7. We haveseen earlier that the collectors attend more to the information of the depositors to deposit theirtreasures. Moreover, the Lagrangian cost is a combination of the main cost and penalty costs.Therefore, agent 8 has to move in directions that don’t overlap with agent 7, thereby providingcollectors enough space to safely (avoiding collisions) deposit their treasures. Finally, in Figure7b, we see that penalty critic of agent 8 attends to information of all the agents uniformlythroughout its training. Similar to the penalty 1 critic of agent 1, information of all the agentsis equally important for the agent 8 to avoid collisions.In this way, our proposed algorithm provides a framework for multi-agents to learn suitableattentions for various sub-tasks. The advantage of this paradigm can be seen from our results,where our proposed algorithm ‘MACAAC’ performs well while satisfying the speciﬁed penaltyconstraints. V. E FFECT OF F IXED WEIGHTS FOR C ONSTRAINED C OSTS

As discussed in the introduction section, a constrained problem could be solved by adding theconstrained costs to the main cost. However, the weights to be associated with the constrainedcosts to satisfy the speciﬁed constraints are not known. Therefore, in our proposed algorithm, on as higher attention weights are assigned to agents 7 and 8 slower time-scale, Lagrangian parameters are iteratively learnt, which act as weights for theconstrained costs. In this section, we investigate the effect of using constant and ﬁxed weights forthe constrained costs during the training. That is, we construct a cost function as k + (cid:80) mj =1 w j c j ,where k is the main cost function, c j , ≤ j ≤ m are m constrained costs and w j is the weightassociated with constraint j . The weights we use in the experiments are the converged Lagrangeparameters from the “MACAAC” algorithm. We call this experiment “MACAAC with FixedWeights”.In “Constrained Cooperative Navigation,” there is one constraint and the weight assigned tothis constraint is w = 5 . . We observe from Table II that, this value of w satisﬁes the penaltyconstraint α = 3 . Moreover, the average cost obtained is slightly less than that of the standard“MACAAC”.In Table III, we run the “MACAAC with Fixed Weights” for the “Constrained TreasureCollection” experiment. The weights assigned to two penalty constraints in this experiment are w = 3 . and w = 0 . . Similar to our earlier experiment, we observe that these weights satisfythe penalty constraints of both collectors and depositors. However, the average cost obtained ishigher than the standard “MACAAC” algorithm as these ﬁxed weights may be too restrictivein this experiment. On the other hand, our proposed algorithm adaptively trains (increases ordecreases) the Lagrange parameters during the training, leading to a better policy.From this study, we conclude the following:1) Our proposed “MACAAC” adaptively computes the Lagrange parameters that satisfy thepenalty constraints.2) Our proposed “MACAAC” algorithm computes a near-optimal solution using a two-timescale approach, where policy is updated on a faster timescale and Lagrange parameters areupdated on a slower timescale. VI. C ONCLUSIONS

We have considered a constrained multi-agent RL setting where the agents need to learn opti-mal actions that satisfy the constraints speciﬁed on their actions. We have proposed an attentionmechanism based constrained Actor-Critic algorithm that computes the Lagrange parameters on aslower time-scale and optimal policy on a faster time-scale. The attention mechanism enables theagents to select relevant information during the training, for computing the policy and satisfyinghe constraints. Through experiments on two benchmark multi-agent settings, we have shownthat our proposed algorithm computes a near-optimal solution satisfying the penalty constraints.VII. A

CKNOWLEDGEMENTS

Raghuram Bharadwaj was supported by a fellowship grant from the Centre for NetworkedIntelligence (a Cisco CSR initiative) of the Indian Institute of Science, Bangalore. This workwas supported by the Robert Bosch Centre for Cyber-Physical Systems, Indian Institute ofScience, and a grant from the Department of Science and Technology, India. S.Bhatnagar wasalso supported by the J.C.Bose Fellowship.R

EFERENCES [1] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. arXiv preprintarXiv:1705.10528 , 2017.[2] Pritee Agrawal, Pradeep Varakantham, and William Yeoh. Scalable greedy algorithms for task/resource constrained multi-agent stochastic planning. 2016.[3] Eitan Altman, Konstantin Avrachenkov, Richard Marquez, and Gregory Miller. Zero-sum constrained stochastic gameswith independent state processes.

Mathematical Methods of Operations Research , 62(3):375–386, 2005.[4] Shalabh Bhatnagar. An actor–critic algorithm with function approximation for discounted cost constrained markov decisionprocesses.

Systems & Control Letters , 59(12):760–766, 2010.[5] Shalabh Bhatnagar and K Lakshmanan. An online actor–critic algorithm with function approximation for constrainedmarkov decision processes.

Journal of Optimization Theory and Applications , 153(3):688–708, 2012.[6] Vivek S Borkar. Stochastic approximation with two time scales.

Systems & Control Letters , 29(5):291–294, 1997.[7] Vivek S Borkar. An actor-critic algorithm for constrained markov decision processes.

Systems & control letters , 54(3):207–213, 2005.[8] Vivek S Borkar.

Stochastic approximation: a dynamical systems viewpoint , volume 48. Springer, 2009.[9] Craig Boutilier and Tyler Lu. Budget allocation using weakly coupled, constrained markov decision processes. 2016.[10] Stephen Boyd and Lieven Vandenberghe.

Convex optimization . Cambridge university press, 2004.[11] Lucian Busoniu, Robert Babuska, and Bart De Schutter. A comprehensive survey of multiagent reinforcement learning.

IEEE Transactions on Systems, Man, And Cybernetics-Part C: Applications and Reviews, 38 (2), 2008 , 2008.[12] Gang Chen. A new framework for multi-agent reinforcement learning–centralized training and exploration withdecentralized execution via policy distillation. arXiv preprint arXiv:1910.09152 , 2019.[13] Raghuram Bharadwaj Diddigi, Sai Koti Reddy Danda, Prabuchandran K.J., and Shalabh Bhatnagar. Actor-critic algorithmsfor constrained multi-agent reinforcement learning. arXiv preprint arXiv:1905.02907 , 2019.[14] Raghuram Bharadwaj Diddigi, D Reddy, Prabuchandran KJ, and Shalabh Bhatnagar. Actor-critic algorithms for constrainedmulti-agent reinforcement learning. In

Proceedings of the 18th International Conference on Autonomous Agents andMultiAgent Systems , pages 1931–1933. International Foundation for Autonomous Agents and Multiagent Systems, 2019.[15] Dmitri A Dolgov and Edmund H Durfee. Resource allocation among agents with mdp-induced preferences. arXiv preprintarXiv:1110.2767 , 2011.16] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. arXiv preprint arXiv:1705.08926 , 2017.[17] Michael Fowler, Pratap Tokekar, T Charles Clancy, and Ryan K Williams. Constrained-action pomdps for multi-agentintelligent knowledge distribution. In , pages1–8. IEEE, 2018.[18] Ilaria Giannoccaro and Pierpaolo Pontrandolfo. Inventory management in supply chains: a reinforcement learning approach.

International Journal of Production Economics , 78(2):153–161, 2002.[19] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deepreinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290 , 2018.[20] Shariq Iqbal and Fei Sha. Actor-attention-critic for multi-agent reinforcement learning. In

Proceedings of the 36thInternational Conference on Machine Learning , volume 97 of

Proceedings of Machine Learning Research , pages 2961–2970, Long Beach, California, USA, 09–15 Jun 2019. PMLR.[21] Jiechuan Jiang and Zongqing Lu. Learning attentional communication for multi-agent cooperation. In

Advances in NeuralInformation Processing Systems , pages 7254–7264, 2018.[22] Qingkai Liang, Fanyu Que, and Eytan Modiano. Accelerated primal-dual policy optimization for safe reinforcementlearning. arXiv preprint arXiv:1802.06480 , 2018.[23] Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixedcooperative-competitive environments. In

Advances in Neural Information Processing Systems , pages 6379–6390, 2017.[24] Hangyu Mao, Zhengchao Zhang, Zhen Xiao, and Zhibo Gong. Modelling the dynamic joint policy of teammates withattention multi-agent ddpg. In

Proceedings of the 18th International Conference on Autonomous Agents and MultiAgentSystems , pages 1108–1116, 2019.[25] Thanh Thi Nguyen, Ngoc Duy Nguyen, and Saeid Nahavandi. Deep reinforcement learning for multiagent systems: Areview of challenges, solutions, and applications.

IEEE transactions on cybernetics , 2020.[26] Afshin OroojlooyJadid and Davood Hajinezhad. A review of cooperative multi-agent deep reinforcement learning. arXivpreprint arXiv:1908.03963 , 2019.[27] D Sai Koti Reddy, Amrita Saha, Srikanth G Tamilselvam, Priyanka Agrawal, and Pankaj Dayama. Risk averse reinforcementlearning for mixed multi-agent environments. In

Proceedings of the 18th International Conference on Autonomous Agentsand MultiAgent Systems , pages 2171–2173, 2019.[28] Walid Saad, Zhu Han, H Vincent Poor, and Tamer Basar. Game-theoretic methods for the smart grid: An overviewof microgrid systems, demand-side management, and smart grid communications.

IEEE Signal Processing Magazine ,29(5):86–105, 2012.[29] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomousdriving. arXiv preprint arXiv:1610.03295 , 2016.[30] Richard S Sutton and Andrew G Barto.

Reinforcement learning: An introduction . MIT press, 2018.[31] Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru, and Raul Vicente.Multiagent cooperation and competition with deep reinforcement learning.

PloS one , 12(4):e0172395, 2017.[32] Chen Tessler, Daniel J Mankowitz, and Shie Mannor. Reward constrained policy optimization. arXiv preprintarXiv:1805.11074 , 2018.[33] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and YoshuaBengio. Show, attend and tell: Neural image caption generation with visual attention. In

International conference onmachine learning , pages 2048–2057, 2015.34] Ruohan Zhang, Yue Yu, Mahmoud El Chamie, Behc¸et Ac¸ikmese, and Dana H Ballard. Decision-making policies forheterogeneous autonomous multi-agent systems with safety constraints. In