[PDF] Actor-Critic Algorithms for Constrained Multi-agent Reinforcement Learning

Abstract

In cooperative stochastic games multiple agents work towards learning joint optimal actions in an unknown environment to achieve a common goal. In many real-world applications, however, constraints are often imposed on the actions that can be jointly taken by the agents. In such scenarios the agents aim to learn joint actions to achieve a common goal (minimizing a specified cost function) while meeting the given constraints (specified via certain penalty functions). In this paper, we consider the relaxation of the constrained optimization problem by constructing the Lagrangian of the cost and penalty functions. We propose a nested actor-critic solution approach to solve this relaxed problem. In this approach, an actor-critic scheme is employed to improve the policy for a given Lagrange parameter update on a faster timescale as in the classical actor-critic architecture. A meta actor-critic scheme using this faster timescale policy updates is then employed to improve the Lagrange parameters on the slower timescale. Utilizing the proposed nested actor-critic schemes, we develop three Nested Actor-Critic (N-AC) algorithms. Through experiments on constrained cooperative tasks, we show the effectiveness of the proposed algorithms.

Full PDF

AActor-Critic Algorithms for Constrained Multi-agent Reinforcement Learning

Raghuram Bharadwaj Diddigi , ∗ , D. Sai Koti Reddy , ∗ , Prabuchandran K.J. , ∗ and ShalabhBhatnagar Department of Computer Science and Automation, IISc Bangalore, India IBM Research, Bangalore, India ∗ Equal Contribution { raghub, prabuchandra,shalabh } @iisc.ac.in, [email protected] Abstract

In cooperative stochastic games multiple agentswork towards learning joint optimal actions in anunknown environment to achieve a common goal.In many real-world applications, however, con-straints are often imposed on the actions that canbe jointly taken by the agents. In such scenariosthe agents aim to learn joint actions to achieve acommon goal (minimizing a speciﬁed cost func-tion) while meeting the given constraints (speciﬁedvia certain penalty functions). In this paper, weconsider the relaxation of the constrained optimiza-tion problem by constructing the Lagrangian of thecost and penalty functions. We propose a nestedactor-critic solution approach to solve this relaxedproblem. In this approach, an actor-critic schemeis employed to improve the policy for a given La-grange parameter update on a faster timescale as inthe classical actor-critic architecture. A meta actor-critic scheme using this faster timescale policy up-dates is then employed to improve the Lagrange pa-rameters on the slower timescale. Utilizing the pro-posed nested actor-critic schemes, we develop threeNested Actor-Critic (N-AC) algorithms. Throughexperiments on constrained cooperative tasks, weshow the effectiveness of the proposed algorithms.

In the reinforcement learning (RL) paradigm, an agent inter-acts with its environment by selecting actions in a trial anderror manner. The agent incurs cost for the chosen actionsand the goal of the agent is to learn to choose actions to min-imize a long-run cost objective. The evolution of the state ofthe environment and the cost feedback signal received by theagent is modeled using the standard Markov Decision Process(MDP) [Sutton and Barto, 1998] framework. Utilizing one ofthe RL methods like Q-learning [Sutton and Barto, 1998],the agent learns to choose optimal state dependent actions(policy) by suitably balancing exploration of unexplored ac-tions and exploiting the actions that incur low long-run costs.However, in many problems of practical interest the numberof environment states and the set of actions that the agenthas to explore for learning the optimal actions are typically high resulting in the phenomenon ‘curse of dimensionality’.In such high-dimensional scenarios, RL methods in conjunc-tion with deep neural networks as function approximatorsknown as “critic-only” or “actor-critic” methods have re-sulted in successful practical applications [Mnih et al. , 2015;Levine et al. , 2016].Many real world problems nonetheless cannot be consid-ered in the context of single agent RL and has led to the studyof multi-agent RL framework [Busoniu et al. , 2008]. It isimportant to observe that developing learning methods in themulti-agent setting poses a serious challenge compared to thesingle agent setting due to the exponential growth in state andaction spaces as the number of agents increase.Multi-agent reinforcement learning problems have beenposed and studied in the mathematical framework of“stochastic games” [Lauer and Riedmiller, 2000]. Thestochastic game setting could be categorized as (a) fullycooperative [Littman, 2001; Lauer and Riedmiller, 2000;Foerster et al. , 2017a], (b) fully competitive [Littman, 1994]and (c) mixed settings [Lowe et al. , 2017; Foerster et al. ,2018]. In a fully cooperative game, agents coordinate withother agents either through explicit communication [Das etal. , 2017; Foerster et al. , 2016; Mordatch and Abbeel, 2017]or through their actions to achieve a common goal. In thispaper, we consider the fully cooperative setting which hasgained popularity in recent times [Foerster et al. , 2017b;Gupta et al. , 2017].In many real-life multi-agent applications one often en-counters constraints speciﬁed on the sequence of actionstaken by the agents. Under this setting, the combined goalof the agents is to obtain the optimal joint action sequenceor policy that minimizes a long-run objective function whilemeeting the constraints that are typically speciﬁed as long-run penalty/budget functional constraints. It is important toobserve that both the objective as well as the penalty func-tions depend on the joint policy of the agents. These prob-lems are studied as “Constrained Markov Decision Process”(C-MDP) [Altman, 1999] for the single agent RL setting andas a “Constrained Stochastic Game (C-SG)” for the multi-agent RL settings.In this work, our goal is to develop multi-agent RL al-gorithms for the setting of constrained cooperative stochas-tic games. To this end, we utilize the Lagrange formulationand propose novel actor-critic algorithms. Our algorithms, a r X i v : . [ c s . M A ] M a y n addition to the classical actor-critic setup, utilize an addi-tional meta actor-critic architecture to enforce constraints onthe agents. The meta actor performs gradient ascent on theLagrange parameters by obtaining the gradient informationfrom the meta critic. We propose three RL algorithms namelyJAL N-AC, Independent N-AC and Centralized N-AC by ex-tending three popular algorithms of the unconstrained coop-erative SGs to the constrained fully cooperative SGs. We nowsummarize our contributions: • We propose, for the ﬁrst time, multi-agent actor-critic al-gorithms for the constrained fully cooperative stochasticgame setting. • Our algorithms do not require model information to beknown and utilize non-linear function approximation inthe actor as well as critic for modeling the policy as wellas value function. • We utilize a meta actor-critic architecture in additionto the classical actor-critic setup to satisfy the speciﬁedconstraints. • The meta critic utilizes a non-linear function approxima-tor for obtaining the value function of the penalty costs. • Under this setup, we develop three RL algorithms for theconstrained multi-agent setting. • We provide empirical evaluation of the performance ofour algorithms on certain constrained multi-agent tasks.

The long-run average cost as the objective function withlong-run average cost constraints for the single agent MDPsetting has been considered in [Borkar, 2005] and a two-time scale actor-critic scheme utilizing full state represen-tation without function approximation has been developedunder this C-MDP setting. In [Lakshmanan and Bhatnagar,2012] and [Bhatnagar, 2010; Bhatnagar and Lakshmanan,2012] a constrained Q-learning algorithm as well as actor-critic algorithms, respectively, utilizing linear function ap-proximators have been proposed. Recently deep neural net-work based value function approximators for C-MDPs underthe discounted cost objective setting have been presented in[Achiam et al. , 2017] and a constrained policy optimization(CPO) algorithm has been developed for the continuous C-MDPs for near constraint satisfaction.

We ﬁrst consider the problem of obtaining joint optimal ac-tion sequence in the cooperative multi-agent setting. Thisproblem can be formulated in the framework of stochasticgames. A stochastic game is an extension of the single agentMarkov Decision Process to multiple agents. A stochas-tic game is described by the tuple ( n, S, A , ...A n , T, C, γ ) where n denotes the number of agents participating in thegame, S denotes the state space of the game, A i , i ∈ , . . . , n denotes the action space of the agents, C : S × A × ... × A n × S −→ R denotes the common cost function forthe cost incurred by the agents when the joint action pro-ﬁle is ( a , a , . . . , a n ) , a i ∈ A i , i ∈ { , . . . , n } , T : S × A × ... × A n × S −→ [0 , denotes the probabilitytransition mechanism where T ( i, a , . . . , a n , j ) speciﬁes theprobability of transitioning to state j from the current state i under the joint action proﬁle ( a , ...a n ) of the agents and γ ∈ (0 , is the discount factor.Let X t ∈ S denote the state of the game at time t . Assumethat the initial state X is sampled from an initial distribution D . Let π i : S × A −→ [0 , be the stochastic policy followedby the agent i . Here π i ( a | s ) for agent i speciﬁes the probabil-ity of choosing action a ∈ A i in state s . Given a joint policyof the agents π = ( π , . . . π n ) , we deﬁne the total discountedcost incurred for the joint policy as J ( π ) = E (cid:104) τ − (cid:88) t =0 γ t C ( X t , π ( X t ) , X t +1 ) (cid:105) , (1)where E [ · ] denotes the expectation taken over the sequence ofstates under the joint policy π , τ denotes the number of timesteps until the terminal state is reached in the game (randombut ﬁnite integer).The objective of the agents in the cooperative stochasticgame is to learn a joint optimal policy π ∗ = ( π ∗ , . . . , π ∗ n ) that minimizes (1), i.e., π ∗ = arg min π J ( π ) . (2)In the constrained cooperative stochastic game setting, weconsider K common total discounted penalty constraints withsingle stage cost functions P j : S × A × ... × A n × S −→ R , j ∈ { , . . . , K } . These constraints are speciﬁed as E (cid:104) τ − (cid:88) t =0 γ t P j ( X t , π ( X t ) , X t +1 ) (cid:105) ≤ α j , j ∈ { , . . . , K } , (3)where α j ≥ , j ∈ , . . . K are certain prescribed thresholds.Under this constrained stochastic game setting, the objectiveof the agents is to learn a joint policy π that minimizes (1)under constraints (3).In order to solve for π ∗ in (2) subject to the constraints(3), we consider the Lagrangian formulation of the multi-agent constrained setting [Borkar, 2005; Bhatnagar, 2010].Let λ j , j ∈ { , . . . , K } denote the Lagrange multipliers foreach of these constraints. Let λ = ( λ , . . . , λ K ) denote thevector of Lagrange multipliers. We deﬁne the Lagrangiancost function as follows: L ( π, λ ) = E (cid:104) τ − (cid:88) t =0 γ t (cid:0) C ( X t , π ( X t ) , X t +1 ) + (4) K (cid:88) i =1 λ j P j ( X t , π ( X t ) , X t +1 ) (cid:1)(cid:105) − K (cid:88) j =1 λ j α j . Let g : R K −→ R denote the dual objective of the constrainedproblem that is deﬁned as g ( λ ) = inf π L ( π, λ ) . (5)For a given vector of Lagrange multipliers, g ( λ ) can be com-puted by optimally solving the unconstrained MDP with theodiﬁed single stage cost function ˜ C given by ˜ C ( X t , π ( X t ) , X t +1 ) = C ( X t , π ( X t ) , X t +1 )+ K (cid:88) j =1 λ j P j ( X t , π ( X t ) , X t +1 ) . (6)Let q : R K −→ Π denote the optimal policy obtained bysolving the unconstrained problem with the modiﬁed costfunction (6), i.e., q ( λ ) = arg min π ∈ Π L ( π, λ ) , (7)where Π denotes the space of all joint randomized policies.After constructing the dual of the constrained problem, thegoal then is to maximize g ( λ ) with respect to Lagrange mul-tipliers λ . Let λ ∗ denote the optimal Lagrange multipliersobtained by maximizing g( λ ), i.e., λ ∗ = arg max λ ≥ g ( λ ) . (8)The optimal Lagrange multipliers λ ∗ can be obtained by per-forming gradient ascent on the function g ( λ ) , i.e., λ t +1 = ( λ t + b ( t ) ∇ g ( λ )) + , (9)where b ( t ) , t ≥ is a suitably chosen step-size schedule and ( · ) + denotes the function max( · , . The gradient of g ( λ ) with respect to the Lagrange multipliers λ j , j ∈ { , . . . , K } can be obtained using the envelope theorem of mathematicaleconomics as follows (see [Borkar, 2005]): ∂g ( λ ) ∂λ j = ∂L ( π, λ ) ∂λ j (cid:12)(cid:12)(cid:12)(cid:12) π = q ( λ ) , j ∈ { , , . . . , K } = E (cid:104) τ − (cid:88) t =0 γ t P j ( X t , q ( λ )( X t ) , X t +1 ) (cid:105) − α j . (10)Equation (10) indicates that the partial derivative with re-spect to the Lagrange multipliers λ j , j ∈ { , , . . . , K } can be computed by performing policy evaluations of thepolicy q ( λ t ) corresponding to single-stage cost functions P j ( X t , q ( λ t )( X t ) , X t +1 ) . The policy q ( λ ∗ ) correspondingto the λ ∗ at the end of the gradient iterations provides a near-optimal policy satisfying (3), i.e., π ∗ = q ( λ ∗ ) . (11)To accomplish the task of computing gradients by policyevaluation (10) and improving Lagrange multipliers (9), wepropose nested actor-critic architectures as illustrated in Fig-ure 1. In this setup, the inner (policy) actor-critic computesthe optimal policy that minimizes (4) (utilizing (6)) for a ﬁxedset of Lagrange multipliers λ . The policy obtained from theinner actor-critic is given as input to the outer (penalty) critic.The penalty critic then computes the gradient with respect toLagrangian by evaluating the policy q ( λ ) for all the penaltyfunctions as in (10). The gradient information is then pro-vided to the outer (penalty) actor to improve the Lagrangemultipliers by performing gradient ascent as in (9). Finally,the outer actor provides the improved Lagrange multipliers tothe policy actor-critic for obtaining the optimal policy at theimproved Lagrange multipliers. Policy Critic(Estimates the valuefunction for themodiﬁed cost MDP)Policy Actor(Improves the policy)Penalty Critic(Estimates the valuefunction of penaltycost for the policy q ( λ ) Penalty Actor(Improves La-grange multipliers) ∇ g ( λ ) λq ( λ ) Figure 1: Nested Actor-Critic (N-AC) architecture

In this section, we propose three multi-agent deep reinforce-ment learning algorithms in the constrained setting. First, wepropose the Joint Action Learners (JAL) scheme for the con-strained case and refer to this algorithm as “JAL N-AC”. Thisalgorithm employs centralized critics and a centralized actors.As the number of agents that participate in the game increase,the learning becomes slow due to action explosion. Next, tomitigate the action explosion of JAL N-AC, we propose inde-pendent learners scheme for the constrained setting and referto this algorithm as “Independent N-AC ”. In Independent N-AC, both critics and actors are decentralized. Note that eventhough this algorithm handles action explosion, decentraliz-ing critics induces non-stationarity to the learning process foreach of the agents. Finally, we also propose an actor-criticalgorithm that employs “centralized learning and decentral-ized execution” where there is a single policy and penaltycritic and multiple policy actors, and refer to this algorithmas “Centralized N-AC”.

JAL N-AC employs centralized critics and centralized actorsfor all the agents. The centralized policy actor computes thejoint policy of all the agents in the game. Therefore, the ac-tion space of the centralized policy actor is the cartesian prod-uct of action spaces of all the agents in the game.We will now describe the JAL N-AC algorithm. Let us de-note the current sample of the game at time t by the tuple ( X t , a t , X t +1 , C t , { P t , . . . , P K t } ) , where X t is the currentstate, a t is the joint action taken by the central policy actor, X t +1 is the next state, C t is the single-stage cost obtainedfrom the environment, and { P t , . . . , P K t } are K single-tage penalty costs. Let θ c , θ π and θ p j , j ∈ { , . . . , K } correspond to the parameters of the policy critic, policy actorand penalty critic respectively. Policy critic parameters θ c areupdated by minimizing the loss function [Mnih et al. , 2015], L ( θ c ) = ( r t + γV θ c ( X t +1 ) − V θ c ( X t )) , (12)where r t is the modiﬁed single-stage cost given by r t = C ( X t , a t , X t +1 ) + (cid:80) Kj =1 λ j P j t ( X t , a t , X t +1 and V θ c ( · ) de-notes the value function approximated by the policy critic.Having found the parameters θ c of the policy critic, we uti-lize it to compute policy gradients for improving the policyparameters θ π of the actor. There are many ways to esti-mate the gradient for improving the actor parameters [Suttonand Barto, 1998]. We utilize the popular temporal differencelearning (TD(0)) update with baseline. We update the policyparameters θ π as follows: θ π := θ π − a ( t )( r t + γV θ c ( X t +1 ) − V θ c ( X t )) ∇ θ π log π ( a t | X t )) , (13)where V θ c ( X t ) is the baseline. Note that in (13) the baselineis subtracted from the value function estimate to reduce thevariance of the gradient estimate.The penalty critic estimates the penalty value function pa-rameters θ p j , j ∈ , . . . , K . These parameters are computedby minimizing the loss function L ( θ p j ) deﬁned as L ( θ p j ) = ( P j ( X t , a t , X t +1 )+ γV θ pj ( X t +1 ) − V θ pj ( X t )) , where V θ pj ( X t ) (resp. V θ pj ( X t +1 ) ) is the value functionassociated with the penalty constraint j for state X t (resp. X t +1 ).Finally, the Lagrange parameters are improved by thepenalty actor by performing stochastic gradient ascent as fol-lows (see [Borkar, 2005; Bhatnagar, 2010]): λ j t +1 = max (0 , λ j t + b ( t )( V θ pj ( X t ) − α j )) , (14)where b ( t ) is the step-size parameter and λ j t is the Lagrangeparameter corresponding to penalty function i at time t . Themaximum operation is done to ensure that the Lagrange pa-rameters stay always positive. The update in (13) is per-formed on a faster time scale while the update in (14) is per-formed on a slower timescale [Borkar, 2005]. In this algorithm, each agent has its own nested actor-criticarchitecture, i.e., there are a total of n nested actor-critic ar-chitectures. Each agent learns parameters separately for itsnested actor-critic architecture and estimates its individualpolicy π i , i.e., each agent maintains its own policy actor-criticand penalty actor-critic networks. At every step of training,each agent takes actions based on its current policy indepen-dent of other agents policies and receives common cost fromthe environment. Using this cost signal, all the agents inde-pendently improve their policy and penalty parameters. Eachagent manages its nested actor-critic architecture in the samemanner as described in the JAL N-AC algorithm. Indepen-dent N-AC suffers from the problem of non-stationarity aspast learning of an agent may become obsolete as other agents simultaneously explore actions during their training phase.Therefore the individual policies obtained by the agents maynot be optimal. Nonetheless this is a simple algorithm thatavoids action explosion and has been seen to perform well insome scenarios [Tampuu et al. , 2017]. This algorithm imbibes advantages of the two algorithms de-scribed above. During training, learning is centralized here inthe sense that value function is estimated based on the jointactions of all agents while policies for all agents are decen-tralized. There is one centralized critic (policy critic) for es-timating the value function of the single state cost (i.e., theLagrangian), another centralized critic (penalty critic) for es-timating the value function of the penalty cost (for improvingthe Lagrange parameters) and n actors estimating the policiesof each of the agents. Finally, there is a penalty actor im-proving the Lagrange multipliers. After learning is complete,agents execute learnt policies independently. This idea is wellstudied in the unconstrained multi-agent case in [Foerster etal. , 2017a; Lowe et al. , 2017]. The algorithmic descriptionof Centralized N-AC for solving the fully cooperative multi-agent constrained RL problem is provided in Algorithm 1. Remark 1.

In our algorithms, policy actor-critic determinesthe optimal policy by minimizing the Lagrangian for a given λ . On the other hand, the penalty actor-critic updates theLagrange multipliers by evaluating the policy on the penaltycost functions. As these two computations have to be carriedad inﬁnitum, the idea of two time-scale stochastic approxima-tions [Borkar, 1997] has been utilized to interleave these twooperations for ensuring the desired convergence behaviour. In this section, we evaluate and analyze our algorithms onthree constrained muti-agent tasks. We begin with two simplegames namely constrained grid world and constrained coingame that have discrete state spaces. We then discuss ourresults on a complex environment - constrained cooperativenavigation that has continuous state space.

In the constrained grid world, the objective of each agent isto learn the shortest path from a given source to the targetwith at most α overlap in the path with other agents. For oursetting, we consider a grid of size × with two agents. Thestate of each agent s i , i ∈ , is a vector of size 16 withvalue 1 at the current position of the agent i and value 0 at allother positions. The permissible actions for agents in the gridinclude moving up, down, left and right wherever applicable.The game ends when both the agents reach the target state or the number of steps in the game exceeds 10. Note thatwhen an agent reaches the target state, it remains in the targetstate till the end of the episode. In the constrained settingthat we consider, a single-stage penalty of +1 is imposed onthe agents if they enter the same block in the grid and weprescribe a penalty threshold of α . For example, if we let α tobe 0, then we are imposing the constraint that the agents have lgorithm 1 Centralized N-AC State sample at time t : X t = ( X t , ...X it ) where X it isthe state of the agent i . for agents i = 1 , , . . . , n do a i = Sample an action from π i ( · | X it ) end for Obtain cost, penalties and next state from the environ-ment. ( C t , P j t , X t +1 ) ←− get reward ( X t , a , ...a n ) Let r t = C t + (cid:80) Kj =1 λ j t P j t . Train policy critic parameters θ c to minimize the lossfunction ( r t + γV θ c ( X t +1 ) − V θ c ( X t )) for j = 1 , , . . . , K do Train penalty critic parameters θ p j to minimize theloss function ( P j t + γV θ pj ( X t +1 ) − V θ pj ( X t )) end for for agents i = 1 , . . . , n do Improve policy actor i ’s policy parameter θ π i by per-forming gradient descent along the estimated gradient ( r t + γV θ c ( X t +1 ) − V θ c ( X t )) ∇ θ πi log π i ( a i | X it ) end for Finally update the Lagrange parameters in the penalty ac-tor as λ t +1 = max(0 , λ t + b ( t )( V θ p ( X t ) − α )) where α = ( α , . . . , α K ) is the vector of prescribed thresholds.to reach the target state from every source state in minimumnumber of steps without any overlap in their paths.12 13 14 158 9 10 Table 1: Grid World

In this experiment, we train all three algorithms for , episodes starting from random start positions and three dif-ferent α values . , . and . respectively. We perform independent runs of the experiment and report the medianof the expected penalty obtained across runs. Note thatthe expected penalty is computed by averaging total penaltyobtained by following the converged policy over , testepisodes. We observe that in all the three algorithms, agentsreach the target state from any given start state in at most 10steps. The performance of our algorithms in meeting the con-straints is given in Table 2 and we ﬁnd that all algorithmsnearly meet the penalty constraints.We now brieﬂy discuss how the agents learn the shortestpath to reach the target state while meeting the penalty con-straints. For example, two agents starting from state learnto take the following paths to reach , the target state. Agent − − − − − and Agent − − − − − . Algorithm ExpectedPenalty( α = 0 . ) ExpectedPenalty( α = 0 . ) ExpectedPenalty( α = 0 . ) JALN-AC 0.092 0.250 0.444IndependentN-AC 0.127 0.221 0.346CentralisedN-AC 0.064 0.217 0.405

Table 2: Expected penalty obtained by the converged policy for dif-ferent values of penalty threshold in constrained grid world

Algorithm ExpectedTotal Cost α = 0 . ExpectedTotal Cost α = 0 . ExpectedTotal Cost α = 0 . JALN-AC 2.4149 2.1837 2.1831IndependentN-AC 1.8144 1.6276 1.5594CentralisedN-AC 2.1457 1.7104 1.5607

Table 3: Expected total cost of the converged policy in the con-strained grid world for different thresholds

On the other hand, for the initial state , both the agents learnto follow the same path: − − . In the ﬁrst case (start state 0), agents took disjoint paths toreach the target state in the least number of steps. In the sec-ond case (start state 3), however, if one of the agents takes adetour from the shortest route to reach the target state, thenit considerably increases the objective of the game. Hence,they learn to take the same shortest route violating the con-straint minimally when required. Note that average overlapis 1 when we start from state 3 however as the initial state ischosen with probability 1/16, the average overlap is 0.06.In Table 3, we present the median of the expected cost forthree distinct values of α = 0 . , . , . obtained by our al-gorithms. Note that the expected cost in this experiment is theaverage number of steps taken by the algorithm to reach thetarget state from random start positions. The expected costmonotonically decreases with increase in α . This is the de-sired behaviour as the constraint becomes less tighter when α increases and is seen to be the case in all our algorithms. In the coin game considered in [Foerster et al. , 2018], theobjective of the agents is to collect the coin that appear atrandom positions in the given grid. In the constrained version,multiple agents exist and each agent can only collect speciﬁctype of coin. The objective is to maximize the total coinscollected by the agents. We consider two agents ‘blue’ and‘red’ in a × grid. The coin can be in one of the two colors- blue or red. We impose a penalty on the agents if the color ofthe agent doesn’t match with the color of the coin collected.or example, if the agent ’blue’ collects a ’red’ coin, a penaltyof +1 is incurred by both the agents. The state of the game is a × × matrix that encodes the positions of the agents in thegrid and also positions of the coins (blue and red) [Foerster et al. , 2018]. Note that unlike in the grid world game, bothagents have access to full state information. The actions ofagents similar to the grid world setting include moving up,down, left and right wherever applicable. Algorithm Expected Penalty α = 0 . JAL N-AC 0.110Independent N-AC 0.211Centralized N-AC 0.208

Table 4: Performance of converged policy in meeting the constraintsin the constrained coin game

We evaluate the performance of the converged policyacross runs and report the median of the expected penaltyin Table 4. We observe that all three algorithms nearly meetthe penalty threshold value α = 0 . . This is the constrained version of the cooperative navigationgame proposed in [Lowe et al. , 2017; Mordatch and Abbeel,2017]. The objective of the agents in this game is to move to-wards the landmarks that are located on a continuous space.Note that this game despite having similarity to constrainedgrid world has continuous state space unlike the ﬁnite statespace in the grid world. As we have uncountable number ofstates in this game, non-linear function approximators for es-timating the cost and penalty value functions play a crucialrole in obtaining good policy. The positions of the agents andlandmarks change dynamically over different episodes. Theagents have to learn a policy that minimizes the number ofsteps to reach the landmark with constraint on the number ofcollisions between the agents. For each landmark, the Eu-clidean distance to the closest agent is calculated and sum ofthe distances is provided as the common cost to the agents.Each agent incurs a penalty of +1 for colliding with eachother.

Figure 2: Performance of algorithms in reducing the objective func-tion as the learning progresses Figure 3: Performance of algorithms in meeting the constraint as thelearning progresses

Algorithm ExpectedTotal Cost α = 0 . ExpectedPenalty α = 0 . JAL N-AC 18.5826 0.0442Independent N-AC 22.9310 0.0338Centralized N-AC 17.9201 0.1324

Table 5: Performance of converged policy in constrained coopera-tive navigation

In this experiment, we have two agents in the game. Theagent is said to reach the landmark if the total Euclidean dis-tance cost is less than units. The game also ends if agents donot reach the landmark in maximum of time steps. We setthe penalty constraint to α = 0 . . We train our algorithm on asingle run for × iterations. From Figures 2 and 3, we seethat the expected total cost and expected penalty decreases aslearning progresses. Finally, in Table 5, we observe that allour algorithms yield converged policies that nearly satisfy thepenalty constraints. In this paper, we have considered the problem of ﬁndingnear-optimal policies satisfying speciﬁed constraints for themulti-agent fully cooperative stochastic game setting. Ouralgorithms utilize nested actor-critic architectures to enforceagents to meet the penalty constraints. Utilizing this archi-tecture, we presented three multi-agent RL methods namelyJAL N-AC, Independent N-AC and Centralized N-AC eachof which utilize non-linear function approximators for valuefunction estimations. Finally, we empirically showed the per-formance of our algorithms on three multi-agent tasks.An interesting future direction would be to extend the pro-posed actor-critic algorithms to other constrained stochasticgames involving say fully competitive and mixed (coopera-tive and competitive) settings. Another line of research wouldbe to develop algorithms for agents with continuous actionspaces. Further, we would like to deploy our algorithmson real world applications such as cooperative surveillancethrough multiple drones and smart power grid settings withconstraints. eferences [Achiam et al. , 2017] Joshua Achiam, David Held, AvivTamar, and Pieter Abbeel. Constrained policy optimiza-tion. arXiv preprint arXiv:1705.10528 , 2017.[Altman, 1999] Eitan Altman.

Constrained Markov decisionprocesses , volume 7. CRC Press, 1999.[Bhatnagar and Lakshmanan, 2012] Shalabh Bhatnagar andK Lakshmanan. An online actor–critic algorithm withfunction approximation for constrained markov decisionprocesses.

Journal of Optimization Theory and Applica-tions , 153(3):688–708, 2012.[Bhatnagar, 2010] Shalabh Bhatnagar. An actor–critic al-gorithm with function approximation for discounted costconstrained markov decision processes.

Systems & Con-trol Letters , 59(12):760–766, 2010.[Borkar, 1997] Vivek S Borkar. Stochastic approxima-tion with two time scales.

Systems & Control Letters ,29(5):291–294, 1997.[Borkar, 2005] Vivek S Borkar. An actor-critic algorithm forconstrained markov decision processes.

Systems & controlletters , 54(3):207–213, 2005.[Busoniu et al. , 2008] Lucian Busoniu, Robert Babuska, andBart De Schutter. A comprehensive survey of multiagentreinforcement learning.

IEEE Transactions on Systems,Man, And Cybernetics-Part C: Applications and Reviews,38 (2), 2008 , 2008.[Das et al. , 2017] Abhishek Das, Satwik Kottur, Jos´e MFMoura, Stefan Lee, and Dhruv Batra. Learning coopera-tive visual dialog agents with deep reinforcement learning. arXiv preprint arXiv:1703.06585 , 2017.[Foerster et al. , 2016] Jakob Foerster, Ioannis AlexandrosAssael, Nando de Freitas, and Shimon Whiteson. Learn-ing to communicate with deep multi-agent reinforcementlearning. In

Advances in Neural Information ProcessingSystems , pages 2137–2145, 2016.[Foerster et al. , 2017a] Jakob Foerster, Gregory Farquhar,Triantafyllos Afouras, Nantas Nardelli, and ShimonWhiteson. Counterfactual multi-agent policy gradients. arXiv preprint arXiv:1705.08926 , 2017.[Foerster et al. , 2017b] Jakob Foerster, Nantas Nardelli,Gregory Farquhar, Triantafyllos Afouras, Philip H.S. Torr,Pushmeet Kohli, and Shimon Whiteson. Stabilising expe-rience replay for deep multi-agent reinforcement learning. arXiv preprint arXiv:1702.08887 , 2017.[Foerster et al. , 2018] Jakob Foerster, Richard Y Chen,Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel,and Igor Mordatch. Learning with opponent-learningawareness. In

Proceedings of the 17th International Con-ference on Autonomous Agents and MultiAgent Systems ,pages 122–130. International Foundation for AutonomousAgents and Multiagent Systems, 2018.[Gupta et al. , 2017] Jayesh K Gupta, Maxim Egorov, andMykel Kochenderfer. Cooperative multi-agent control us-ing deep reinforcement learning. In

International Con- ference on Autonomous Agents and Multiagent Systems ,pages 66–83. Springer, 2017.[Lakshmanan and Bhatnagar, 2012] K Lakshmanan andShalabh Bhatnagar. A novel q-learning algorithm withfunction approximation for constrained markov decisionprocesses. In

Communication, Control, and Computing(Allerton), 2012 50th Annual Allerton Conference on ,pages 400–405. IEEE, 2012.[Lauer and Riedmiller, 2000] Martin Lauer and Martin Ried-miller. An algorithm for distributed reinforcement learn-ing in cooperative multi-agent systems. In

In Proceedingsof the Seventeenth International Conference on MachineLearning . Citeseer, 2000.[Levine et al. , 2016] Sergey Levine, Chelsea Finn, TrevorDarrell, and Pieter Abbeel. End-to-end training of deepvisuomotor policies.

The Journal of Machine LearningResearch , 17(1):1334–1373, 2016.[Littman, 1994] Michael L Littman. Markov games as aframework for multi-agent reinforcement learning. In

Ma-chine Learning Proceedings 1994 , pages 157–163. Else-vier, 1994.[Littman, 2001] Michael L Littman. Value-function rein-forcement learning in markov games.

Cognitive SystemsResearch , 2(1):55–66, 2001.[Lowe et al. , 2017] Ryan Lowe, Yi Wu, Aviv Tamar, JeanHarb, Pieter Abbeel, and Igor Mordatch. Multi-agentactor-critic for mixed cooperative-competitive environ-ments. In

Advances in Neural Information Processing Sys-tems , pages 6379–6390, 2017.[Mnih et al. , 2015] Volodymyr Mnih, Koray Kavukcuoglu,David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-mare, Alex Graves, Martin Riedmiller, Andreas K Fidje-land, Georg Ostrovski, et al. Human-level control throughdeep reinforcement learning.

Nature , 518(7540):529,2015.[Mordatch and Abbeel, 2017] Igor Mordatch and PieterAbbeel. Emergence of grounded compositional lan-guage in multi-agent populations. arXiv preprintarXiv:1703.04908 , 2017.[Sutton and Barto, 1998] Richard S Sutton and Andrew GBarto.

Introduction to reinforcement learning , volume135. MIT press Cambridge, 1998.[Tampuu et al. , 2017] Ardi Tampuu, Tambet Matiisen, Do-rian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru,Jaan Aru, and Raul Vicente. Multiagent cooperation andcompetition with deep reinforcement learning.