Contrasting Centralized and Decentralized Critics in Multi-Agent Reinforcement Learning
CContrasting Centralized and Decentralized Critics inMulti-Agent Reinforcement Learning
Xueguang Lyu [email protected]
Yuchen Xiao [email protected]
Brett Daley [email protected]
Christopher Amato [email protected]
ABSTRACT
Centralized Training for Decentralized Execution , where agents aretrained offline using centralized information but execute in a decen-tralized manner online, has gained popularity in the multi-agentreinforcement learning community. In particular, actor-critic meth-ods with a centralized critic and decentralized actors are a com-mon instance of this idea. However, the implications of using acentralized critic in this context are not fully discussed and under-stood even though it is the standard choice of many algorithms.We therefore formally analyze centralized and decentralized criticapproaches, providing a deeper understanding of the implicationsof critic choice. Because our theory makes unrealistic assumptions,we also empirically compare the centralized and decentralized criticmethods over a wide set of environments to validate our theoriesand to provide practical advice. We show that there exist miscon-ceptions regarding centralized critics in the current literature andshow that the centralized critic design is not strictly beneficial, butrather both centralized and decentralized critics have different prosand cons that should be taken into account by algorithm designers.
ACM Reference Format:
Xueguang Lyu, Yuchen Xiao, Brett Daley, and Christopher Amato. 2021.Contrasting Centralized and Decentralized Critics in Multi-Agent Reinforce-ment Learning. In
Proc. of the 20th International Conference on AutonomousAgents and multi-agent Systems (AAMAS 2021), Online, May 3–7, 2021 , IFAA-MAS, 21 pages.
Centralized Training for Decentralized Execution (CTDE), whereagents are trained offline using centralized information but executein a decentralized manner online, has seen widespread adoptionin multi-agent reinforcement learning (MARL) [10, 16, 28]. In par-ticular, actor-critic methods with centralized critics have becomepopular after being proposed by Foerster et al. [11] and Lowe et al.[21], since the critic can be discarded once the individual actors aretrained. Despite the popularity of centralized critics, the choice isnot discussed extensively and its implications for learning remainlargely unknown.One reason for this lack of analysis is that recent state-of-the-artworks built on top of a centralized critic focus on other issues such asmulti-agent credit assignment [11, 44], multi-agent exploration [9],teaching [29] or emergent tool use [2]. However, state-of-the-art
Khoury College of Computer Sciences, Northeastern University. Boston, MA, USA.
Proc. of the 20th International Conference on Autonomous Agents and multi-agent Systems(AAMAS 2021), U. Endriss, A. Nowé, F. Dignum, A. Lomuscio (eds.), May 3–7, 2021, Online methods with a centralized critic do not compare with decentral-ized critic versions of their methods. Therefore, without precisetheory or tests, previous works relied on intuitions and educatedguesses. For example, one of the pioneering works on centralizedcritics, MADDPG [21], mentions that providing more informationto critics eases learning and makes learning coordinated behavioreasier. Later works echoed similar viewpoints, suspecting that acentralized critic might speed up training [50], or that it reducesvariance [7], is more robust [37], improves performance [19] orstabilizes training [20].In short, previous works generally give the impression that acentralized critic is an obvious choice without compromises underCTDE, and there are mainly two advertised benefits: a ) a central-ized critic fosters “cooperative behavior”, b ) a centralized critic alsostabilizes (or speeds up) training. It makes intuitive sense becausetraining a global value function on its own (i.e., joint learning [6])would help with cooperation issues and has much better conver-gence guarantees due to the stationary learning targets. However,these intuitions have never been formally proven or empiricallytested; since most related works focus on additional improvementson top of a centralized critic, the centralized critic is usually seen aspart of a basic framework rather than an optional hyperparameterchoice. In this paper, we look into these unvalidated claims andpoint out that common intuitions turn out to be inaccurate.First, we show theoretically that a centralized critic does notnecessarily improve cooperation compared to a set of decentralizedcritics. We prove that the two types of critics provide the decen-tralized policies with precisely the same gradients in expectation.We validate this theory on classical cooperation games and morerealistic domains and report results supporting our theory.Second, we show theoretically that the centralized critic resultsin higher variance updates of the decentralized actors assumingconverged on-policy value functions. Therefore, we emphasize thatstability of value function learning does not directly translate to areduced variance in policy learning . We also discuss that, in practice,this results in a bias-variance trade-off. We analyze straightforwardexamples and empirical evaluations, confirming our theory andshowing that the centralized critic often makes the policy learningless stable, contrary to the common intuition.Finally, we test standard implementations of the methods over awide range of popular domains and discuss our empirical findingswhere decentralized critics often outperform a centralized critic. Methods are also typically implemented with state-based critics [9, 11, 44, 50] insteadof the history-based critics we use in this paper, which might be another reason forperformance differences. However, we only consider history-based critics to fairly com-pare the use of centralized and decentralized critics with the same type of information. a r X i v : . [ c s . L G ] F e b e further analyze the results and discuss possible reasons forthese performance differences. We therefore demonstrate roomfor improvement with current methods while laying theoreticalgroundwork for future work. Recent deep MARL works often use the CTDE training paradigm.Value function based CTDE approaches [8, 23, 33, 34, 39, 40, 45, 46,49] focus on how centrally learned value functions can be reason-ably decoupled into decentralized ones and have shown promisingresults. Policy gradient methods on CTDE, on the other hand, haveheavily relied on centralized critics. One of the first works utiliz-ing a centralized critic was COMA [11], a framework adopting acentralized critic with a counterfactual baseline. For convergenceproperties, COMA establishes that the overall effect on decentral-ized policy gradient with a centralized critic can be reduced toa single-agent actor-critic approach, which ensures convergenceunder similar assumptions [18]. In this paper, we take the theoryone step further and show convergence properties for centralizedand decentralized critics as well as their respective policies, whilegiving a detailed bias-variance analysis.Concurrently with COMA, MADDPG [21] proposed to use adedicated centralized critic for each agent in semi-competitive do-mains, demonstrating compelling empirical results in continuousaction environments. M3DDPG [20] focuses on the competitivecase and extends MADDPG to learn robust policies against alteringadversarial policies by optimizing a minimax objective. On the co-operative side, SQDDPG [44] borrows the counterfactual baselineidea from COMA and extends MADDPG to achieve credit assign-ment in fully cooperative domains by reasoning over each agent’smarginal contribution. Other researchers also use critic centraliza-tion for emergent communication with decentralized execution inTarMAC [7] and ATOC [14]. There are also efforts utilizing an at-tention mechanism addressing scalability problems in MAAC [13].Besides, teacher-student style transfer learning LeCTR [29] alsobuilds on top of centralized critics, which does not assume expertteachers. Other focuses include multi-agent exploration and creditassignment in LIIR [9], goal-conditioned policies with CM3 [50],and for temporally abstracted policies [5]. Based on a centralizedcritic, extensive tests on a more realistic environment using self-play for hide-and-seek [2] have demonstrated impressive resultsshowing emergent tool use. However, as mentioned before, theseworks use centralized critics, but none of them specifically investi-gate the effectiveness of centralized critics, which is the main focusof this paper.
This section introduces the formal problem definition of cooperativeMARL with decentralized execution and partial observability. Wealso introduce the single-agent actor-critic method and its moststraight-forward multi-agent extensions.
A decentralized partially observable Markov decision process (Dec-POMDP) is an extension of an MDP in decentralized multi-agent settings with partial observability [27]. A Dec-POMDP is formallydefined by the tuple ⟨I , S , {A 𝑖 } , T , R , { Ω 𝑖 } , O⟩ , in which • I is the set of agents, • S is the set of states including initial state 𝑠 , • A = × 𝑖 A 𝑖 is the set of joint actions, • T : S × A → S is the transition dynamics, • R : S × A × S → R is the reward function, • Ω = × 𝑖 Ω 𝑖 is the set of observations for each agent, • O : S × A → Ω is the observation probabilities.At each timestep 𝑡 , a joint action 𝒂 = ⟨ 𝑎 ,𝑡 , ..., 𝑎 |I | ,𝑡 ⟩ is taken, eachagent receives its corresponding local observation ⟨ 𝑜 ,𝑡 , . . . , 𝑜 |I | ,𝑡 ⟩ ∼O( 𝒔 𝑡 , 𝒂 𝑡 ) and a global reward 𝑟 𝑡 = R( s 𝑡 , 𝒂 𝑡 , s 𝑡 + ) . The joint ex-pected discounted return is 𝐺 = (cid:205) 𝑇𝑡 = 𝛾 𝑡 𝑟 𝑡 where 𝛾 is a discountfactor. Agent 𝑖 ’s action-observation history for timestep 𝑡 is de-fined as ⟨ 𝑜 𝑖, , 𝑎 𝑖, , 𝑜 𝑖, , . . . , 𝑎 𝑖,𝑡 − , 𝑜 𝑖,𝑡 ⟩ , and we define history re-cursively as ℎ 𝑖,𝑡 = ⟨ ℎ 𝑖,𝑡 − , 𝑎 𝑖,𝑡 − , 𝑜 𝑖,𝑡 ⟩ , likewise, a joint history is 𝒉 𝑡 = ⟨ 𝒉 𝑡 − , 𝒂 𝑡 − , 𝒐 𝑡 ⟩ . To solve a Dec-POMDP is to find a set ofpolicies 𝝅 = ⟨ 𝜋 , . . . , 𝜋 |I | ⟩ where 𝜋 𝑖 : ℎ 𝑖,𝑡 → 𝑎 𝑖,𝑡 such that thejoint expected discounted return 𝐺 is maximized. Notation.
For notational readability, we denote the parameter-ized decentralized policy 𝜋 𝜃 𝑖 as 𝜋 𝑖 , on-policy 𝐺 estimate 𝑄 𝜋𝑖 as 𝑄 𝑖 ,and 𝑄 𝝅 as 𝑄 . We denote the objective (discounted expected return)for any agent with a decentralized critic as 𝐽 𝑑 and the objectivewith a centralized critic as 𝐽 𝑐 . In addition, a timestep 𝑡 is impliedfor 𝑠 , 𝒉 , ℎ 𝑖 , 𝒂 or 𝑎 𝑖 . Actor critic (AC) [18] is a widely used policy gradient (PG) ar-chitecture and is the basis for many single-agent policy gradientapproaches. Directly optimizing a policy, policy gradient (PG) algo-rithms perturb policy parameters 𝜃 in the direction of the gradientof the expected return ∇ 𝜃 E [ 𝐺 𝜋 𝜃 ] , which is conveniently given bythe policy gradient theorem [18, 42]: ∇ 𝜃 E [ 𝐺 𝜋 𝜃 ] = E ℎ,𝑎 [∇ 𝜃 log 𝜋 𝜃 ( 𝑎 | ℎ ) 𝑄 𝜋 𝜃 ( ℎ, 𝑎 )] (1)Actor-critic (AC) methods [18] directly implement the policy gra-dient theorem by learning the value function 𝑄 𝜋 𝜃 , i.e., the critic,commonly through TD learning [41]. The policy gradient for up-dating the policy (i.e., the actor) then follows the return estimatesgiven by the critic (Equation 1). We introduce three extensions of single-agent Actor Critic methodsto multi-agent settings, which are highlighted in Table 1.The first AC multi-agent extension, Joint Actor Critic (JAC) [3,47], treats the multi-agent environment as a single-agent environ-ment and learns in the joint observation-action space; JAC learnsa centralized actor, 𝝅 ( 𝒂 | 𝒉 ; 𝜃 ) , and a centralized value function(critic), 𝑄 𝝅 ( 𝒉 , 𝒂 ; 𝜙 ) . The policy gradient for JAC follows that ofsingle-agent actor critic: ∇ 𝐽 ( 𝜃 ) = E 𝒂 , 𝒉 [∇ log 𝝅 ( 𝒂 | 𝒉 ; 𝜃 ) 𝑄 𝝅 ( 𝒉 , 𝒂 ; 𝜙 )] . (2) For Proposition 1 and all following results, we employ a fixed-memory history wherethe history only consists of the past 𝑘 observations and actions. Formally, a historyat timestep 𝑡 with memory length 𝑘 is defined as 𝒉 𝑡,𝑘 = ⟨ 𝒉 𝑡 − ,𝑘 − , 𝒂 𝑡 − , 𝒐 𝑡 ⟩ when 𝑘 > and 𝑡 > , otherwise ∅ . ethod Critic ActorJAC [3, 47] Centralized CentralizedIAC [11, 43] Decentralized DecentralizedIACC [11, 21] Centralized Decentralized Table 1: Our Multi-agent Actor Critic Naming Scheme.
The second AC multi-agent extension, called Independent Ac-tor Critic (IAC) [11, 43], learns a decentralized policy and critic ⟨ 𝜋 𝑖 ( 𝑎 𝑖 | ℎ 𝑖 ; 𝜃 𝑖 ) , 𝑄 𝑖 ( ℎ 𝑖 , 𝑎 𝑖 ; 𝜙 𝑖 )⟩ for each of the agents locally. At ev-ery timestep 𝑡 , a local experience ⟨ ℎ 𝑖,𝑡 , 𝑎 𝑖,𝑡 ⟩ is generated for agent 𝑖 .The policy gradient learning for agent 𝑖 is defined as ∇ 𝜃 𝑖 𝐽 𝑑 ( 𝜃 𝑖 ) = E 𝒂 , 𝒉 [∇ log 𝜋 𝑖 ( 𝑎 𝑖 | ℎ 𝑖 ; 𝜃 𝑖 ) 𝑄 𝑖 ( ℎ 𝑖 , 𝑎 𝑖 ; 𝜙 𝑖 )] . (3)Finally, we define Independent Actor with Centralized Critic(IACC), a class of centralized critic methods where a joint valuefunction 𝑄 𝝅 ( 𝒉 , 𝒂 ; 𝜙 ) is used to update each decentralized policy 𝜋 𝜃 𝑖 [3, 11]. Naturally, the policy gradient for decentralized policieswith a centralized critic is defined as ∇ 𝜃 𝑖 𝐽 𝑐 ( 𝜃 𝑖 ) = E 𝒂 , 𝒉 [∇ log 𝜋 𝑖 ( 𝑎 𝑖 | ℎ 𝑖 ; 𝜃 𝑖 ) 𝑄 𝝅 ( 𝒉 , 𝒂 ; 𝜙 )] . (4)At any timestep, the joint expected return estimate 𝑄 𝝅 ( 𝒉 , 𝒂 ; 𝜙 ) is used to update the decentralized policy 𝜋 ( 𝑎 𝑖 | ℎ 𝑖 ; 𝜃 𝑖 ) . Noticethat the centralized critic 𝑄 𝝅 ( 𝒉 , 𝒂 ; 𝜙 ) estimates the return based onjoint information (all agents’ action-histories) that differs from thedecentralized case in Eq. 3. In the following section, we shall showthat from a local viewpoint of agent 𝑖 , for each joint action-history ⟨ 𝒉 𝑡 , 𝒂 𝑡 ⟩ , 𝑄 𝝅 ( 𝒉 𝑡 , 𝒂 𝑡 ; 𝜙 ) is a sample from the return distribution givenlocal action-histories Pr ( 𝐺 𝑡 : 𝑇 | ℎ 𝑖,𝑡 , 𝑎 𝑖,𝑡 ) , while the decentralizedcritic 𝑄 𝑖 ( ℎ 𝑖,𝑡 , 𝑎 𝑖,𝑡 ) provides an expectation. In this section, we prove policies have the same expected gradi-ent whether using centralized or decentralized critics. We provethat the centralized critic provides unbiased and correct on-policyreturn estimates, but at the same time makes the agents sufferfrom the same action shadowing problem [24] seen in decentral-ized learning. It is reassuring that the centralized critic will notencourage a decentralized policy to pursue a joint policy that isonly achievable in a centralized manner, but also calls into questionthe benefits of a centralized critic. We provide a theoretical proofon bias equivalence, then analyze a classic example for intuitiveunderstanding.
We first show the gradient updates for IAC and IACC are the samein expectation. We assume the existence of a limiting distributionPr ( 𝒉 ) over fixed-length histories, analogous to a steady-state as-sumption. That is, we treat fixed-memory trajectories as nodes in aMarkov chain for which the policies induce a stationary distribu-tion.Proposition 1. Suppose that agent histories are truncated to anarbitrary but finite length. A stationary distribution on the set of For example, COMA [11] is then considered as an IACC approach with a variancereduction baseline. all possible histories of this length is guaranteed to exist under anycollection of agent policies.
We provide formal justification for Proposition 1 in Appendix A.1,necessary to derive Bellman equations in our following theoreticalresults. With this, we begin by establishing novel convergenceresults for the centralized and decentralized critics in Lemmas 1and 2 below, which culminate in Theorem 1 regarding the expectedpolicy gradient. Lemma 1.
Given the existence of a steady-state history distribu-tion (Proposition 1), training of the centralized critic is character-ized by the Bellman operator 𝐵 𝑐 which admits a unique fixed point 𝑄 𝝅 ( ℎ 𝑖 , ℎ 𝑗 , 𝑎 𝑖 , 𝑎 𝑗 ) where 𝑄 𝝅 is the true expected return under the jointpolicy 𝝅 . Lemma 2.
Given the existence of a steady-state history distribution(Proposition 1), training of the 𝑖 -th decentralized critic is character-ized by a Bellman operator 𝐵 𝑑 which admits a unique fixed point E ℎ 𝑗 ,𝑎 𝑗 (cid:2) 𝑄 𝝅 ( ℎ 𝑖 , ℎ 𝑗 , 𝑎 𝑖 , 𝑎 𝑗 ) (cid:3) where 𝑄 𝝅 is the true expected return underthe joint policy 𝝅 . Theorem 1.
After convergence of the critics’ value functions, theexpected policy gradients for the centralized actor and the decentral-ized actors are equal. That is, E [∇ 𝜃 𝐽 𝑐 ( 𝜃 )] = E [∇ 𝜃 𝐽 𝑑 ( 𝜃 )] (5) where 𝐽 𝑐 and 𝐽 𝑑 are the respective objective functions for the centraland decentralized actors, and the expectation is taken over all jointhistories and joint actions. All actors are assumed to have the samepolicy parameterization.Proof sketch: We derive Bellman equations for the centralizedand decentralized critics and express them as Q-function operators.We show that these operators are contraction mappings and ad-mit fixed points at 𝑄 𝝅 ( ℎ 𝑖 , ℎ 𝑗 , 𝑎 𝑖 , 𝑎 𝑗 ) and E ℎ 𝑗 ,𝑎 𝑗 (cid:2) 𝑄 𝝅 ( ℎ 𝑖 , ℎ 𝑗 , 𝑎 𝑖 , 𝑎 𝑗 ) (cid:3) ,respectively. These convergence results reveal that the decentral-ized critic becomes the marginal expectation of the centralizedcritic after training for an infinite amount of time. Under the to-tal expectation over joint histories and joint actions, these fixedpoints are identically equal, implying that gradients computed forthe centralized and decentralized actors are the same in expecta-tion and therefore unbiased. The full proofs for Lemmas 1 and 2and Theorem 1 are respectively provided in Appendices A.2, A.3,and A.4.Our theory assumes that the critics are trained sufficiently toconverge to their true on-policy values. This assumption oftenexists in the form of infinitesimal step sizes for the actors [4, 11,18, 38, 51] for convergence arguments of AC, since the critics areon-policy return estimates and the actors need an unbiased andup-to-date critic. Although this assumption is in line with previoustheoretical works, it is nevertheless unrealistic; we discuss thepractical implications of relaxing this assumption in Section 6.1. We use a classic matrix game as an example to intuitively highlightthat IAC and IACC give the same policy gradient in expectation. While it appears that we analyze the case with two agents, agent 𝑖 and 𝑗 , the resultholds for arbitrarily many agents by letting 𝑗 represent all agents except 𝑖 . he Climb Game [6], whose reward function is shown in Table 2, isa matrix game (a state-less game) in which agents are supposed tocooperatively try to achieve the highest reward of 11 by taking thejoint action ⟨ 𝑢 , 𝑢 ⟩ , facing the risk of being punished by −
30 or 0when agents miscoordinate. It is difficult for independent learnersto converge onto the optimal ⟨ 𝑢 , 𝑢 ⟩ actions due to low expectedreturn for agent 1 to take 𝑢 when agent 2’s policy is not alreadyfavoring 𝑢 and vise versa.This cooperation issue arises when some potentially good action 𝑎 has low ("shadowed") on-policy values because yielding a high re-turn depends on other agents’ cooperating policies, but frequentlytaking action 𝑎 is essential for other agents to learn to adjust theirpolicies accordingly, creating a dilemma where agents are unwillingto frequently take the low-value action and are therefore stuck ina local optimum. In the case of the Climb Game, the value of 𝑢 isoften shadowed, because 𝑢 does not produce a satisfactory returnunless the other agent also takes 𝑢 frequently enough. This com-monly occurring multi-agent local optimum is called a shadowedequilibrium [12, 31], a known difficulty in independent learningwhich usually requires an additional cooperative mechanism (e.g.,some form of centralization) to overcome.Solving the Climb Game independently with IAC, assume theagents start with uniformly random policies. In expectation, IAC’sdecentralized critic would estimate 𝑄 ( 𝑢 ) at ( / ) × + ( / ) ×(− ) + ≈ − .
3, making 𝑢 (with 𝑄 ( 𝑢 ) ≈ .
7) a much moreattractive action, and the same applies for agent 2. Naturally, bothagents will update towards favoring 𝑢 ; continuing this path, agentswould never favor 𝑢 and never discover the optimal value of 𝑢 : 𝑄 ∗ 𝑖 ( 𝑢 ) = 𝑄 ( 𝑢 , 𝑢 ) =
11 since there is no environmentalstochasticity. However, consider at timestep 𝑡 agent 1 takes optimalaction 𝑢 , the (centralized) 𝑄 value estimate used in policy gradient ∇ 𝑄 ( 𝑢 , 𝑎 ,𝑡 ) 𝜋 ( 𝑢 ) actually depends on what action agent 2 choosesto take according to its policy 𝑎 ,𝑡 ∼ 𝜋 . Again assuming uniformpolicies, consider a rollout where action 𝑎 ,𝑡 is sampled from thepolicy of agent 2; then with sufficient sampling, we expect the meanpolicy gradient (given by centralized critic) for updating 𝜋 ( 𝑢 ) would be E 𝜋 ∇ 𝐽 𝐶 ( 𝜃 | 𝑎 ,𝑡 = 𝑢 ) = 𝜋 ( 𝑢 )∇ 𝑄 ( 𝑢 , 𝑢 ) 𝜋 ( 𝑢 ) + 𝜋 ( 𝑢 )∇ 𝑄 ( 𝑢 , 𝑢 ) 𝜋 ( 𝑢 )+ 𝜋 ( 𝑢 )∇ 𝑄 ( 𝑢 , 𝑢 ) 𝜋 ( 𝑢 ) = ( / )∇ 𝜋 ( 𝑢 ) + ( / )∇ − 𝜋 ( 𝑢 ) + ( / )∇ 𝜋 ( 𝑢 )≈ − . · ∇ 𝜋 ( 𝑢 ) (6)That is, with probabilities 𝜋 ( 𝑢 ) , 𝜋 ( 𝑢 ) and 𝜋 ( 𝑢 ) , the joint Q-values 11, −
30 and 0 are sampled for generating the policy gradientof action 1 for agent 1, 𝜋 ( 𝑢 ) . It implies that the sample-averagegradient for 𝜋 ( 𝑢 ) is high only when agent 2 takes action 𝑢 fre-quently. If agent 2 has a uniform policy, the sample-average gradient − . ∇ 𝜋 ( 𝑢 ) cannot compete with the gradient for 𝑢 at 1 . ∇ 𝜋 ( 𝑢 ) .Therefore, in the IACC case with a centralized critic, we see therise of an almost identical action shadowing problem we describedfor IAC, even though the centralized critic trained jointly and hasthe correct estimate of the optimal joint action. Steps R e t u r n s Climb Game
JACIACIACC
Figure 1: Climb Game empirical results (50 runs per method)showing both decentralized and centralized critic methodssuccumb to the shadowed equilibrium problem. agent 1 𝑢 𝑢 𝑢 agent 2 𝑢
11 -30 0 𝑢 -30 7 6 𝑢 Table 2: Return values for Climb Game [6].
Empirical evaluation on the Climb Game (shown in Figure 1)conforms to our analysis, showing both methods converge to thesuboptimal solution ⟨ 𝑎 , 𝑎 ⟩ . At the same time, unsurprisingly, acentralized controller always gives the optimal solution ⟨ 𝑎 , 𝑎 ⟩ . Ingeneral, we observe that the centralized critic has the informationof the optimal solution, information that is only obtainable in acentralized fashion and is valuable for agents to break out of theircooperative local optima. However, this information is unable to beeffectively utilized on the individual actors to form a cooperativepolicy. Therefore, contrary to the common intuition, in its currentform, the centralized critic is unable to foster cooperative behaviormore easily than the decentralized critics. In this section, we first show that with true on-policy value func-tions, the centralized critic formulation can increase policy gradientvariance. More precisely, we prove that the policy gradient varianceusing a centralized critic is at least as large as the policy gradientvariance with decentralized critics. We again assume that the criticshave converged under fixed policies, thus ignoring the variancedue to value function learning; we discuss the relaxation of thisassumption in Section 6.We begin by comparing the policy gradient variance betweencentralized and decentralized critics:Theorem 2.
Assume that all agents have the same policy param-eterization. After convergence of the value functions, the variance ofthe policy gradient using a centralized critic is at least as large as thatof a decentralized critic along every dimension.Proof sketch:
As with our proof of Theorem 1, Lemmas 1 and 2show that a decentralized critic’s value estimate is equal to thearginal expectation of the central critic’s 𝑄 -function after con-vergence. This implies that the decentralized critics have alreadyaveraged out the randomness caused by the other agents’ deci-sions. Since each agent has the same policy parameterization, theirpolicy gradient covariance matrices are equal up to the scale fac-tors induced by the critics’ value estimates. By Jensen’s inequality,we show that the additional stochasticity of the central critic canincrease (but not decrease) these scale factors compared to a decen-tralized critic; hence, Var ( 𝐽 𝑐 ( 𝜃 )) ≥ Var ( 𝐽 𝑑 ( 𝜃 )) element-wise.In the following subsections, we define and analyze this varianceincrease by examining its two independent sources: the “Multi-Action Variance” (MAV) induced by the other actors’ policies, andthe “Multi-Observation Variance” (MOV) induced by uncertaintyregarding the other agents’ histories from the local perspective ofa single agent. We introduce these concepts along with concreteexamples to illustrate how they affect learning. We discuss MAV in the fully observable case and will address thepartially observable case in MOV. Intuitively, from the local per-spective of agent 𝑖 , when taking an action 𝑎 𝑖 at state 𝑠 , MAV isthe variance Var [ 𝐺 ( 𝑠, 𝑎 𝑖 )] in return estimates due to the fact thatteammates might take different actions according to their own sto-chastic policies. With decentralized critics, MAV is averaged intothe value function (Lemma 2) which is an expectation incorporatingteammates actions 𝑎 𝑗 ; thus, at given timestep 𝑡 , 𝑄 𝑖 ( 𝑠, 𝑎 𝑖 ) has novariance. On the other hand, a centralized critic 𝑄 ( 𝑠, 𝑎 𝑖 , 𝑎 𝑗 ) distin-guishes between all action combinations ⟨ 𝑎 𝑖 , 𝑎 𝑗 ⟩ (Lemma 1), but 𝑎 𝑗 is sampled by agent 𝑗 during execution: 𝑎 𝑗 ∼ 𝜋 𝑗 ( 𝑠 ) ; therefore, thevalue of 𝑄 ( 𝑠, 𝑎 𝑖 , 𝑎 𝑗 ) varies during policy updates depending on 𝑎 𝑗 .We propose a simple domain to clarify this variance’s cause andeffect and show that a centralized critic transfers MAV directly topolicy updates. The Morning Game, in-spired by Peshkin et al. [32], shown in Table 3, consists of twoagents collaborating on making breakfast in which the most de-sired combination is ⟨ 𝑐𝑒𝑟𝑒𝑎𝑙, 𝑚𝑖𝑙𝑘 ⟩ . Since there is no environmentalstochasticity, a centralized critic can robustly learn all the valuescorrectly after only a few samples. In contrast, the decentralizedcritics need to average over unobserved teammate actions. Takea closer look at action 𝑐𝑒𝑟𝑒𝑎𝑙 : for a centralized critic, cereal withmilk returns 3 and cereal with vodka returns 0; meanwhile, a decen-tralized critic receives stochastic targets (3 or 0) for taking action 𝑐𝑒𝑟𝑒𝑎𝑙 , and only when agent 2 favors 𝑚𝑖𝑙𝑘 , a return of 3 comesmore often, which then would make 𝑄 ( 𝑐𝑒𝑟𝑒𝑎𝑙 ) a higher estimate.Therefore, the centralized critic has a lower variance (zero in thiscase), and the decentralized critic has a large variance on the updatetarget.Often neglected is that using a centralized critic has higher vari-ance when it comes to updating decentralized policies. Supposeagents employ uniform random policies for both IAC and IACC,in which case agent 1’s local expected return for 𝑐𝑒𝑟𝑒𝑎𝑙 would be 𝑄 ( 𝑐𝑒𝑟𝑒𝑎𝑙 ) = 𝜋 ( 𝑚𝑖𝑙𝑘 ) · + 𝜋 ( 𝑣𝑜𝑑𝑘𝑎 ) · = .
5. Assuming con-verged value functions, then in IACC, a centralized critic would uni-formly give either 𝑄 ( 𝑐𝑒𝑟𝑒𝑎𝑙, 𝑚𝑖𝑙𝑘 ) = 𝑄 ( 𝑐𝑒𝑟𝑒𝑎𝑙, 𝑣𝑜𝑑𝑘𝑎 ) = 𝜋 ( 𝑐𝑒𝑟𝑒𝑎𝑙 ) updates, and 𝑄 ( 𝑝𝑖𝑐𝑘𝑙𝑒𝑠, 𝑚𝑖𝑙𝑘 ) = 𝑄 ( 𝑝𝑖𝑐𝑘𝑙𝑒𝑠, 𝑣𝑜𝑑𝑘𝑎 ) = agent 1 𝑝𝑖𝑐𝑘𝑙𝑒𝑠 𝑐𝑒𝑟𝑒𝑎𝑙 agent 2 𝑣𝑜𝑑𝑘𝑎 𝑚𝑖𝑙𝑘 Table 3: Return values for the proposed Morning Game. 𝜋 ( 𝑝𝑖𝑐𝑘𝑙𝑒𝑠 ) updates. With IAC, a decentralized critic alwaysgives 𝑄 ( 𝑐𝑒𝑟𝑒𝑎𝑙 ) = . 𝜋 ( 𝑐𝑒𝑟𝑒𝑎𝑙 ) , and 𝑄 ( 𝑝𝑖𝑐𝑘𝑙𝑒 ) = . 𝜋 ( 𝑝𝑖𝑐𝑘𝑙𝑒 ) . Obviously, under both methods, 𝜋 converges towards 𝑐𝑒𝑟𝑒𝑎𝑙 , but the decentralized critic makes the update direction lessvariable and much more deterministic in favor of 𝑐𝑒𝑟𝑒𝑎𝑙 . In the partially observable case, another source of variance in localvalue Var [ 𝐺 ( ℎ 𝑖 , 𝑎 𝑖 )] comes from factored observations. More con-cretely, for an agent in a particular local trajectory ℎ 𝑖 , other agents’experiences ℎ 𝑗 ∈ H 𝑗 may vary, over which the decentralized agenthas to average. A decentralized critic is designed to average overthis observation variance and provide and single expected valuefor each local trajectory 𝑄 𝑖 ( ℎ 𝑖 , 𝑎 𝑖 ) . The centralized critic, on theother hand, is able to distinguish each combination of trajectories ⟨ ℎ 𝑖 , ℎ 𝑗 ⟩ , but when used for a decentralized policy at ℎ 𝑖 , teammatehistory ℎ 𝑗 can be considered to be sampled from Pr ( ℎ 𝑗 | ℎ 𝑖 ) , andwe expect the mean estimated return during the update process tobe E ℎ 𝑗 𝑄 ( ℎ 𝑖 , ℎ 𝑗 , 𝒂 ) .We use a thought experiment as an example. Consider a one-step task where two agents have binary actions and are individuallyrewarded 𝑟 if the action matches the other’s randomly given binaryobservations and − 𝑟 for a mismatch; that is, 𝑅 ( ℎ 𝑖 , ℎ 𝑗 , 𝑎 𝑖 ) = 𝑟 when ℎ 𝑗 = 𝑎 𝑖 and − 𝑟 otherwise. With any policy, assuming convergedvalue functions, a decentralized critic would estimate 𝑄 ( ℎ 𝑖 , 𝑎 𝑖 ) = ⟨ ℎ 𝑖 , ℎ 𝑗 , 𝑎 𝑖 ⟩ would result in a return of 1 or 0, hence estimates 𝑟 withprobability 0 . ℎ 𝑗 = 𝑎 𝑖 by definition) and − 𝑟 with probability0 . ℎ 𝑗 ≠ 𝑎 𝑖 ), resulting in a variance of 𝑟 . In this example, wesee that a centralized critic produces returns estimates with moresignificant variance when agents have varying observations. In this section, we discuss the anticipated trade-off in practice. Welook at the practical aspects of a centralized critic and address theunrealistic true value function assumption. We note that, althoughboth types of critics have the same expected gradient for policyupdates, they have different amount of actual bias in practice. Wealso discuss how and why the way of handling variance is differentfor IAC and IACC. We conclude that we do not expect one methodto dominate the other in terms of performance in general.
So far, in terms of theory, we have only considered value functionsthat are assumed to be correct. However, in practice, value functions We also propose a toy domain called Guess Game based on this thought experiment,which we elaborate and show empirical results in Appendix C. re difficult to learn. We argue that it is generally more so withdecentralized value functions.In MARL, the on-policy value function is non-stationary sincethe return distributions heavily depend on the current joint policy.When policies update and change, the value function is partiallyobsolete and is biased towards the historical policy-induced re-turn distribution. Bootstrapping from outdated values creates bias.Compounding bias can cause learning instability depending on thelearning rate and how drastically the policy changes.The non-stationarity applies for both types of critics since theyare both on-policy estimates. However, centralized value functionlearning is generally better equipped in the face of non-stationaritybecause it has no variance in the update targets. Therefore, thebootstrapping may be more stable in the case of a centralized critic.As a result, in a cooperative environment with a moderate numberof agents (as discussed later), we expect a centralized critic wouldlearn more stably and be less biased, perhaps counteracting theeffect of having larger variance in the policy gradient.
Learning a decentralized policy requires reasoning about local opti-mality. That is, given local information ℎ 𝑖 , agent 𝑖 needs to explicitlyor implicitly consider the distribution of global information: otheragents’ experiences and actions Pr ( 𝑎 𝑗 , ℎ 𝑗 | ℎ 𝑖 ) , through whichagent 𝑖 needs to take the expectation of global action-history val-ues. Abstractly, the process of learning via sampling over the jointspace of 𝑎 𝑗 and ℎ 𝑗 thus generates MAV and MOV for agent 𝑖 ’s pol-icy gradient. Interestingly, this process is inevitable but takes placein different forms: during IAC and IACC training, this averagingprocess is done by different entities. In IAC, those expectations areimplicitly taken by the decentralized critic and produce a singleexpected value for the local history; this is precisely why decen-tralized critic learning has unstable learning targets as discussed inSection 6.1. On the other hand, in IACC, the expectation takes placedirectly in the policy learning. Different samples of global valueestimates are used in policy updates for a local trajectory, hencethe higher policy gradient variance we discussed in Section 5. Thus,for decentralized policy learning purposes, we expect decentralizedcritics to give estimates with more bias and less variance, and thecentralized critic to give estimates with less bias and more variance.Consequently, the trade-off largely depends on the domain; as weshall see in the next section, certain domains favor a more stablepolicy while others favor a more accurate critic. Another important consideration is the scale of the task. A central-ized critic’s feature representation needs to scale linearly (in thebest case) or exponentially (in the worse case) with the system’snumber of agents. In contrast, a decentralized critic’s number offeatures can remain constant, and in homogeneous-agent systems,decentralized critics can even share parameters. Also, some envi-ronments may not require much reasoning from other agents. Forexample, in environments where agents’ decisions rarely dependon other agents’ trajectories, the gain of learning value functionsjointly is likely to be minimal, and we expect decentralized critics to perform better while having better sample efficiency in thosedomains. We show this empirically in Section 7.4.The impact of variance will also change as the number of agentsincreases. In particular, when learning stochastic policies with acentralized critic in IACC, the maximum potential variance in thepolicy gradient also scales with the number of agents (see Theo-rem 2). On the other hand, IAC’s decentralized critics potentiallyhave less stable learning targets in critic bootstrapping with in-creasing numbers of agents, but the policy updates still have lowvariance. Therefore, scalability may be an issue for both methods,and the actual performance is likely to depend on the domain, func-tion approximation setups, and other factors. However, we expectthat IAC should be a better starting point due to more stable policyupdates and potentially shared parameters.
Combining our discussions in Sections 6.1 and 6.2, we conclude thatwhether to use critic centralization can be essentially considered abias-variance trade-off decision. More specifically, it is a trade-offbetween variance in policy updates and bias in the value function:a centralized critic should have a lower bias because it will havemore stable Q-values that can be updated straightforwardly whenpolicies change, but higher variance because the policy updatesneed to be averaged over (potentially many) other agents. In otherwords, the policies trained by centralized critics avoid more-biasedestimates usually produced by decentralized critics, but in returnsuffer more variance in the training process. The optimal choice isthen largely dependent on the environment settings. Regardless,the centralized critic likely faces more severe scalability issues innot only critic learning but also in policy gradient variance. As aresult, we do not expect one method will always dominate the otherin terms of performance.
In this section, we present experimental results comparing cen-tralized and decentralized critics. We test on a variety of popularresearch domains including (but not limited to) classical matrixgames, the StarCraft Multi-Agent Challenge [36], the Particle En-vironments [25], and the MARL Environments Compilation [15].Our hyperparameter tuning uses grid search. Each figure, if nototherwise specified, shows the aggregation of 20 runs per method.
As we can see in Figures 2a and 2b, the variance per rollout inthe policy gradient for both actions is zero for IAC and non-zerofor IACC, validating our theoretical variance analysis. Figure 2cshows how the Q-values evolve for the optimal action 𝑎 in bothmethods. First, observe that both types of critics converged to thecorrect value 3, which confirms our bias analysis. Second, the twomethods’ Q-value variance in fact comes from different sources.For the decentralized critic, the variance comes from the criticshaving different biases across trials. For centralized critics, there isthe additional variance that comes from incorporating other agentactions, producing a high value when a teammate chooses 𝑚𝑖𝑙𝑘 and a low value when a teammate chooses 𝑣𝑜𝑑𝑘𝑎 . Batch G r a d i e n t V a r i a n ce Morning Game
IACIACC (a) Per-rollout gradient variance for action 𝑝𝑖𝑐𝑘𝑙𝑒𝑠 . Batch G r a d i e n t V a r i a n ce Morning Game
IACIACC (b) Per-rollout gradient variance for action 𝑐𝑒𝑟𝑒𝑎𝑙 . Step Q v a l u e Morning Game
IACIACC (c) Q values used for updating 𝜋 ( 𝑐𝑒𝑟𝑒𝑎𝑙 ) ;see Figure 6 for clearer illustration. Figure 2: Gradient updates of the Morning Game, shows 200 independent trials for each method.
Steps R e t u r n s Dec-Tiger
IACIACC 0.0 0.5 1.0 1.5
Steps R e t u r n s Cleaner
IACIACC 0.0 0.5 1.0 1.5 2.0
Steps R e t u r n s Move Box
IACIACC
Figure 3: Performance comparison in different domains: (a) Dec-Tiger, (b) Cleaner and (c) Move Box. Dec-Tiger and Cleanerhighlight instability resulting from high-variance actor updates employing a centralized critic; Move Box shows the central-ized critic is not able to bias the actors towards the joint optimum (at 100).
We test on the classic yet difficult Dec-Tiger do-main [26], a multi-agent extension to the Tiger domain [17]. To endan episode, each agent has a high-reward action (opening a doorwith treasure inside) and a high-punishment action (opening a doorwith tiger inside). The treasure and tiger are randomly initialized ineach episode, hence, a third action ( listen ) gathers noisy informationregarding which of the two doors is the rewarding one. The multi-agent extension of Tiger requires two agents to open the correctdoor simultaneously in order to gain maximum return. Conversely,if the bad action is taken simultaneously, the agents take less pun-ishment. Note that any fast-changing decentralized policies are lesslikely to coordinate the simultaneous actions with high probability,thus lowering return estimates for the critic and hindering jointpolicy improvement. As expected, we see in Figure 3a that IACC(with a centralized critic and higher policy gradient variance) doesnot perform as well as IAC due. In the end, the IACC agent learns toavoid high punishment (agents simultaneously open different doors, − listen ) and opening anagreed-upon door on the first timestep. IACC gives up completelyon high returns (where both agents listen for some timesteps andopen the correct door at the same time, +
20) because the unstablepolicies make coordinating a high return of +
20 extremely unlikely.
We observe similar policy degradation in the Cleanerdomain [15], a grid-world maze in which agents are rewarded forstepping onto novel locations in the maze. The optimal policy isto have the agents split up and cover as much ground as possible.The maze has two non-colliding paths (refer to Appendix D.3 forvisualization) so that as soon as the agents split up, they can followa locally greedy policy to get an optimal return. However, with a centralized critic (IACC), both agents start to take the longerpath with more locations to "clean." The performance is shown inFigure 3b. When policy-shifting agents are not completely locallygreedy, the issue is that they cannot "clean" enough ground in theirpaths. Subsequently, they discover that having both agents go forthe longer path (the lower path) yields a better return, convergingto a suboptimal solution. Again we see that in IACC with central-ized critic, due to the high variance we discussed in Section 6.2,the safer option is favored, resulting in both agents completelyignoring the other path (performance shown in Figure 3b). Overall,we see that high variance in the policy gradient (in the case of thecentralized critic) makes the policy more volatile and can resultin poor coordination performance in environments that requirecoordinated series of actions to discover the optimal solution.
Move Box [15] is another commonly used domain, where grid-worldagents are rewarded by pushing a heavy box (requiring both agents)onto any of the two destinations (see Appendix D.4 for details). Thefarther destination gives +
100 reward while the nearer destinationgives +
10. Naturally, the optimal policy is for both agents to go for + either of the agents is unwilling to do so, this optimaloption is "shadowed" and both agents will have to go for +
10. Wesee in Figure 3c that both methods fall for the shadowed equilibrium ,favoring the safe but less rewarding option.Analogous to the Climb Game, even if the centralized critic isable to learn the optimal values due to its unbiased on-policy natureshown in Section 4.1, the on-policy return of the optimal option isextremely low due to an uncooperative teammate policy; thus, theoptimal actions are rarely sampled when updating the policies. The
Steps R e t u r n s Find Treasure
IACIACC 0 2000000 4000000 6000000
Steps R e t u r n s Cross
IACIACC 0 2000000 4000000 6000000
Steps R e t u r n s Antipodal
IACIACC
Figure 4: Performance comparison of domains where IAC does better than IACC:, Find Treasure, Cross and Antipodal.
Episode (k) M ea n T e s t R e t u r n IACIACC 0 70 140 210 280
Episode (k) M ea n T e s t R e t u r n IACIACC 0 140 280 420 560
Episode (k) M ea n T e s t R e t u r n
12 x 12
IACIACC
Figure 5: Performance comparison on capture target under various grid world sizes. same applies to both agents, so the system reaches a suboptimalequilibrium, even though IACC is trained in a centralized manner.
Through tests on Go Together [15], Merge [50], Predator and Prey [21],Capture Target [30, 48], Small Box Pushing and SMAC [36] tasks(see Appendices D, E, F, G, H, I, and J), we observe that the twotypes of critics perform similarly in all these domains, with IACC(with a centralized critic) being less stable in only a few other do-mains shown in Figures 3 and 4. Since the performance of the twocritic types is similar in most results, we expect that it is due tothe fact that both are unbiased asymptotically (Lemmas 1, 2 andTheorem 1). We observe that, although decentralized critics mightbe more biased when considering finite training, it does not affectreal-world performance in a significant fashion in these domains.In cooperative navigation domains Antipodal, Cross [25, 50], andFind Treasure [15], we observe a more pronounced performancedifference among runs (Figure 4). In these cooperative navigationdomains (details in Appendices D.2 and E), there are no suboptimalequilibria that trap the agents, and on most of the timesteps, theoptimal action aligns with the locally greedy action. Those tasksonly require agents to coordinate their actions for a few timestepsto avoid collisions. It appears that those tasks are easy to solve, butthe observation space is continuous, thus causing large MOV in thegradient updates for IACC. Observe that some IACC runs struggle toreach the optimal solution robustly, while IAC robustly converges,conforming to our scalability discussion regarding large MOV. Acentralized critic induces higher variance for policy updates, wherethe shifting policies can become a drag on the value estimates which,in turn, become a hindrance to improving the policies themselves.The scalability issue can be better highlighted in environmentswhere we can increase the observation space. For example, in Cap-ture Target [22] where agents are rewarded by simultaneouslycatching a moving target in a grid world (details in Appendix G), by increasing the grid size from 4 × ×
12, we see a no-table comparative drop in overall performance for IACC (Figure 5).Since an increase in observation space leads to an increase in Multi-Observation Variance (MOV) and nothing else, it indicates that herethe policies of IACC do not handle MOV as well as the decentralizedcritics in IAC. The result might imply that, for large environments,decentralized critics scale better in the face of MOV due to the factthat they do not involve MOV in policy learning.
In this paper, we present an examination of critic centralizationtheoretically and empirically. The core takeaways are: 1) in theory,centralized and decentralized critics are the same in expectationfor the purpose of updating decentralized policies; 2) in theory,centralized critics will lead to higher variance in policy updates; 3) inpractice, there is a bias-variance trade-off due to potentially higherbias with limited samples and less-correct value functions withdecentralized critics; 4) in practice, a decentralized critic regularlygives more robust performance since stable policy gradients appearto be more crucial than stable value functions in our domains.Although IACC uses a centralized critic that is trained in a cen-tralized manner, the method does not produce policies that exploitcentralized knowledge. Therefore, future work on IACC may ex-plore feasible ways of biasing the decentralized policies towards abetter joint policy by exploiting the centralized information. Reduc-ing variance in policy updates and other methods that make betteruse of centralized training are promising future directions.
We thank the reviewers for their helpful feedback. We also thankAndrea Baisero and Linfeng Zhao for helpful comments and discus-sions. This research is supported in part by the U. S. Office of NavalResearch under award number N00014-19-1-2131, Army ResearchOffice award W911NF-20-1-0265 and an Amazon Research Award.
EFERENCES [1] Chistopher Amato, Jilles Steeve Dibangoye, and Shlomo Zilberstein. 2009. In-cremental policy generation for finite-horizon DEC-POMDPs. In
Proceedings ofthe Nineteenth International Conference on International Conference on AutomatedPlanning and Scheduling . 2–9.[2] Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, BobMcGrew, and Igor Mordatch. 2020. Emergent tool use from multi-agent au-tocurricula. In
Proceedings of the Eighth International Conference on LearningRepresentations .[3] Guillaume Bono, Jilles Steeve Dibangoye, Laëtitia Matignon, Florian Pereyron,and Olivier Simonin. 2018. Cooperative multi-agent policy gradient. In
JointEuropean Conference on Machine Learning and Knowledge Discovery in Databases .Springer, 459–476.[4] Michael Bowling and Manuela Veloso. 2001. Convergence of gradient dynamicswith a variable learning rate. In
International Conference on Machine Learning .27–34.[5] Jhelum Chakravorty, Patrick Nadeem Ward, Julien Roy, Maxime Chevalier-Boisvert, Sumana Basu, Andrei Lupu, and Doina Precup. 2020. Option-critic incooperative multi-agent systems. In
Proceedings of the 19th International Confer-ence on Autonomous Agents and MultiAgent Systems . 1792–1794.[6] Caroline Claus and Craig Boutilier. 1998. The dynamics of reinforcement learningin cooperative multiagent systems.
AAAI/IAAI
International Conference on Machine Learning . 1538–1546.[8] Christian Schroeder de Witt, Bei Peng, Pierre-Alexandre Kamienny, Philip Torr,Wendelin Böhmer, and Shimon Whiteson. 2018. Deep multi-Agent reinforce-ment learning for decentralized continuous cooperative control. arXiv preprintarXiv:2003.06709v3 (2018).[9] Yali Du, Lei Han, Meng Fang, Tianhong Dai, Ji Liu, and Dacheng Tao. 2019. LIIR:learning individual intrinsic reward in multi-agent reinforcement learning.. In
Advances in Neural Information Processing Systems .[10] Jakob Foerster, Ioannis Alexandros Assael, Nando De Freitas, and Shimon White-son. 2016. Learning to communicate with deep multi-agent reinforcement learn-ing. In
Advances in neural information processing systems . 2137–2145.[11] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, andShimon Whiteson. 2018. Counterfactual multi-agent policy gradients. In
Thirty-second AAAI conference on artificial intelligence .[12] Nancy Fulda and Dan Ventura. 2007. Predicting and preventing coordinationproblems in cooperative Q-learning systems.. In
International Joint Conferenceson Artificial Intelligence , Vol. 2007. 780–785.[13] Shariq Iqbal and Fei Sha. 2019. Actor-attention-critic for multi-agent reinforce-ment learning. In
Proceedings of the 36th International Conference on MachineLearning , Vol. 97. PMLR, 2961–2970.[14] Jiechuan Jiang and Zongqing Lu. 2018. Learning attentional communication formulti-agent cooperation. In
Advances in neural information processing systems .7254–7264.[15] Shuo Jiang. 2019. Multi agent reinforcement learning environments compila-tion. https://github.com/Bigpig4396/Multi-Agent-Reinforcement-Learning-Environment[16] Emilio Jorge, Mikael Kågebäck, Fredrik D Johansson, and Emil Gustavsson. 2016.Learning to play guess who? and inventing a grounded language as a consequence. arXiv preprint arXiv:1611.03218 (2016).[17] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. 1998. Plan-ning and acting in partially observable stochastic domains.
Artificial intelligence
Advancesin neural information processing systems . 1008–1014.[19] Hyun-Rok Lee and Taesik Lee. 2019. Improved cooperative multi-agent reinforce-ment learning algorithm augmented by mixing demonstrations from centralizedpolicy. In
Proceedings of the 18th International Conference on Autonomous Agentsand MultiAgent Systems . 1089–1098.[20] Shihui Li, Yi Wu, Xinyue Cui, Honghua Dong, Fei Fang, and Stuart Russell. 2019.Robust multi-agent reinforcement learning via minimax deep deterministic policygradient.
Proceedings of the AAAI Conference on Artificial Intelligence
33 (2019),4213–4220. https://doi.org/10.1609/aaai.v33i01.33014213[21] Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch.2017. Multi-agent actor-critic for mixed cooperative-competitive environments.In
Advances in neural information processing systems . 6379–6390.[22] Xueguang Lyu and Christopher Amato. 2020. Likelihood quantile networksfor coordinating multi-agent reinforcement learning. In
Proceedings of the 19thInternational Conference on Autonomous Agents and MultiAgent Systems . 798–806.[23] Anuj Mahajan, Tabish Rashid, Mikayel Samvelyan, and Shimon Whiteson. 2019.MAVEN: multi-agent variational exploration. In
Proceedings of the Thirty-thirdAnnual Conference on Neural Information Processing Systems .[24] Laetitia Matignon, Guillaume J Laurent, and Nadine Le Fort-Piat. 2012. Indepen-dent reinforcement learners in cooperative Markov games: a survey regarding coordination problems.
Knowledge Engineering Review
27, 1 (2012), 1–31.[25] Igor Mordatch and Pieter Abbeel. 2017. Emergence of grounded compositionallanguage in multi-agent populations. arXiv preprint arXiv:1703.04908 (2017).[26] Ranjit Nair, Milind Tambe, Makoto Yokoo, David Pynadath, and Stacy Marsella.2003. Taming decentralized POMDPs: towards efficient policy computation formultiagent settings. In
International Joint Conferences on Artificial Intelligence ,Vol. 3. 705–711.[27] Frans A. Oliehoek and Christopher Amato. 2016.
A Concise Introduction toDecentralized POMDPs . Springer.[28] Frans A Oliehoek, Matthijs TJ Spaan, and Nikos Vlassis. 2008. Optimal andapproximate Q-value functions for decentralized POMDPs.
Journal of ArtificialIntelligence Research
32 (2008), 289–353.[29] Shayegan Omidshafiei, Dong-Ki Kim, Miao Liu, Gerald Tesauro, Matthew Riemer,Christopher Amato, Murray Campbell, and Jonathan P How. 2019. Learning toteach in cooperative multiagent reinforcement learning. In
Proceedings of theAAAI Conference on Artificial Intelligence , Vol. 33. 6128–6136.[30] Shayegan Omidshafiei, Jason Pazis, Christopher Amato, Jonathan P How, andJohn Vian. 2017. Deep decentralized multi-task multi-Agent reinforcement learn-ing under partial observability. In
International Conference on Machine Learning .2681–2690.[31] Liviu Panait, Karl Tuyls, and Sean Luke. 2008. Theoretical advantages of le-nient learners: An evolutionary game theoretic perspective.
Journal of MachineLearning Research
9, Mar (2008), 423–457.[32] Leonid Peshkin, Kee-Eung Kim, Nicolas Meuleau, and Leslie Pack Kaelbling. 2000.Learning to cooperate via policy search. In
Proceedings of the Sixteenth conferenceon Uncertainty in artificial intelligence . 489–496.[33] Tabish Rashid, Gregory Farquhar, Bei Peng, and Shimon Whiteson. 2018.Weighted QMIX: expanding monotonic value function factorisation. arXivpreprint arXiv:2006.10800v1 (2018).[34] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Far-quhar, Jakob Foerster, and Shimon Whiteson. 2018. QMIX: monotonic valuefunction factorisation for deep multi-agent reinforcement learning. In
Proceedingsof the Thirty-Fifth International Conference on Machine Learning .[35] Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method.
The annals of mathematical statistics (1951), 400–407.[36] Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Far-quhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philiph H. S. Torr,Jakob Foerster, and Shimon Whiteson. 2019. The StarCraft multi-agent challenge.
CoRR abs/1902.04043 (2019).[37] David Simões, Nuno Lau, and Luís Paulo Reis. 2020. Multi-agent actor centralized-critic with communication.
Neurocomputing (2020).[38] Satinder P Singh, Michael J Kearns, and Yishay Mansour. 2000. Nash convergenceof gradient dynamics in general-sum games. In
UAI . 541–548.[39] Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi.2019. QTRAN: learning to factorize with transformation for cooperative multi-agent reinforcement learning. In
Proceedings of the 36th International Conferenceon Machine Learning , Vol. 97. PMLR, 5887–5896.[40] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vini-cius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, KarlTuyls, et al. 2018. Value-decomposition networks for cooperative multi-agentlearning based on team reward. In
Proceedings of the 17th International Conferenceon Autonomous Agents and MultiAgent Systems . 2085–2087.[41] Richard S Sutton. 1985. Temporal credit assignment in reinforcement learning.(1985).[42] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000.Policy gradient methods for reinforcement learning with function approximation.In
Advances in neural information processing systems . 1057–1063.[43] Ming Tan. 1993. Multi-agent reinforcement learning: independent vs. cooperativeagents. In
Proceedings of the tenth international conference on machine learning .330–337.[44] Jianhong Wang, Yuan Zhang, Tae-Kyun Kim, and Yunjie Gu. 2020. ShapleyQ-value: a local reward approach to solve global reward games. In
Proceedings ofthe AAAI Conference on Artificial Intelligence , Vol. 34. 7285–7292.[45] Tonghan Wang, Heng Dong, and Chongjie Zhang Victor Lesser. 2020. ROMA:multi-agent reinforcement learning with emergent roles. In
Proceedings of theThirty-Seventh International Conference on Machine Learning .[46] Tonghan Wang, Jianhao Wang, Chongyi Zheng, and Chongjie Zhang. 2020.Learning nearly decomposable value functions via communication minimization.
International Conference on Learning Representations (2020).[47] Weixun Wang, Jianye Hao, Yixi Wang, and Matthew Taylor. 2019. Achievingcooperation through deep multiagent reinforcement learning in sequential pris-oner’s dilemmas. In
Proceedings of the First International Conference on DistributedArtificial Intelligence . 1–7.[48] Yuchen Xiao, Joshua Hoffman, and Christopher Amato. 2019. Macro-action-based deep multi-agent reinforcement learning. In .[49] Yuchen Xiao, Joshua Hoffman, Tian Xia, and Christopher Amato. 2020. Learningmulti-robot decentralized macro-action-based policies via a centralized Q-net. In roceedings of the International Conference on Robotics and Automation .[50] Jiachen Yang, Alireza Nakhaei, David Isele, Kikuo Fujimura, and HongyuanZha. 2019. CM3: cooperative multi-goal multi-stage multi-agent reinforcementlearning. In
International Conference on Learning Representations .[51] Chongjie Zhang and Victor Lesser. 2010. Multi-agent learning with policy pre-diction. In
Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 24.
PROOFSA.1 Proposition 1: Existence of Steady-State History Distribution
We show that under a (limited-memory) history-based policy, there still exists a stationary history distribution. Let Pr ( 𝑠 ) denote theprobability of 𝑠 being the initial state of an episode, and 𝜂 ( ℎ, 𝑠 ) denote the average number of timesteps spent in state 𝑠 with a history of ℎ ina single episode, where ¯ 𝑠 and ¯ ℎ denote the state and history from the previous timestep and ℎ is the empty history: 𝜂 ( ℎ, 𝑠 ) = Pr ( 𝑠 ) · I ( ℎ = ℎ ) + ∑︁ ¯ ℎ, ¯ 𝑠 𝜂 ( ¯ ℎ, ¯ 𝑠 ) ∑︁ 𝑎 𝜋 ( 𝑎 | ¯ ℎ ) · Pr ( ℎ | ¯ ℎ, 𝑠 ) · Pr ( 𝑠 | ¯ 𝑠, 𝑎 ) for all 𝑠, ℎ ∈ S × H 𝑑 . Assuming our environment has a stationary observation model Pr ( ℎ | ¯ ℎ, 𝑠 ) = Pr ( 𝑜 | 𝑠 ) where the resulting Markov chains are irreducibleand aperiodic, a stationary transition model Pr ( 𝑠 | ¯ 𝑠, 𝑎 ) , and a given fixed policy 𝜋 ( 𝑎 | ¯ ℎ ) , we can solve the system of equations for 𝜂 ( ℎ, 𝑠 ) and then normalize to get the stationary distribution of state and history pairs:Pr ( ℎ, 𝑠 ) = 𝜂 ( ℎ, 𝑠 ) (cid:205) 𝑠 ′ ,ℎ ′ 𝜂 ( 𝑠 ′ , ℎ ′ ) for all 𝑠, ℎ ∈ S × H 𝑑 . However, what interests us is Pr ( ℎ | 𝑠 ) = Pr ( ℎ, 𝑠 )/ Pr ( 𝑠 ) , for that, we can express 𝜂 ( 𝑠 ) in a similar fashion: 𝜂 ( 𝑠 ) = Pr ( 𝑠 ) + ∑︁ ¯ ℎ, ¯ 𝑠 𝜂 ( ¯ ℎ, ¯ 𝑠 ) ∑︁ 𝑎 𝜋 ( 𝑎 | ¯ ℎ ) Pr ( 𝑠 | ¯ 𝑠, 𝑎 ) which can be normalized to obtain Pr ( 𝑠 ) . Similarly, one can obtain a stationary distribution Pr ( ℎ ) for fixed-length histories. A.2 Lemma 1: Convergence of Centralized Critic
Proof. We begin by observing the 1-step update rule for the central critic: 𝑄 ( 𝒉 , 𝒂 ) ← 𝑄 ( 𝒉 , 𝒂 ) + 𝛼 (cid:2) 𝑅 ( 𝑠, 𝒂 , 𝑠 ′ ) + 𝛾𝑄 ( 𝒉 ′ , 𝒂 ′ ) (cid:3) (7)This update occurs with conditional probability Pr ( 𝑠 ′ , 𝒉 ′ , 𝒂 ′ | 𝒉 , 𝒂 ) = Pr ( 𝑠 | 𝒉 ) · Pr ( 𝑜 | 𝑠, 𝒂 ) · Pr ( 𝑠 ′ | 𝑠, 𝒂 ) · 𝝅 ( 𝒂 ′ | 𝒉 ′ ) given that ( 𝒉 , 𝒂 ) occurred, where Pr ( 𝑠 | 𝒉 ) exists by Proposition 1. Assuming that this ( 𝒉 , 𝒂 ) -combination is visited infinitely often during an infinite amountof training, and that 𝛼 is annealed according to the stochastic approximation criteria from [35], then 𝑄 ( 𝒉 , 𝒂 ) will asymptotically converge tothe expectation of the bracketed term in (7). We can compute the expected value by summing over all histories and actions weighted by theirjoint probabilities of occurrence: 𝑄 ( 𝒉 , 𝒂 ) ← ∑︁ 𝑠,𝑜,𝑠 ′ , 𝒂 ′ Pr ( 𝑠 | 𝒉 ) · Pr ( 𝑜 | 𝑠, 𝒂 ) · Pr ( 𝑠 ′ | 𝑠, 𝒂 ) · 𝝅 ( 𝒂 ′ | 𝒉 ′ ) (cid:2) 𝑅 ( 𝑠, 𝒂 , 𝑠 ′ ) + 𝛾𝑄 ( 𝒉 ′ , 𝒂 ′ ) (cid:3) = ∑︁ 𝑠,𝑠 ′ Pr ( 𝑠 | 𝒉 ) · Pr ( 𝑠 ′ | 𝑠, 𝒂 ) (cid:34) 𝑅 ( 𝑠, 𝒂 , 𝑠 ′ ) + 𝛾 ∑︁ 𝑜, 𝒂 ′ Pr ( 𝑜 | 𝑠, 𝒂 ) · 𝝅 ( 𝒂 ′ | 𝒉 ′ ) · 𝑄 ( 𝒉 ′ , 𝒂 ′ ) (cid:35) (8)This represents the Bellman equation for the centralized critic. Let us denote this update rule by the operator 𝐵 𝑐 such that 𝑄 ← 𝐵 𝑐 𝑄 isequivalent to (8). We can show that 𝐵 𝑐 is a contraction mapping. Letting ∥ 𝑄 ∥ (cid:66) max 𝒉 , 𝒂 | 𝑄 ( 𝒉 , 𝒂 )| , ∥ 𝐵 𝑐 𝑄 − 𝐵 𝑐 𝑄 ∥ = max 𝒉 , 𝒂 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)∑︁ 𝑠,𝑠 ′ Pr ( 𝑠 | 𝒉 ) · Pr ( 𝑠 ′ | 𝑠, 𝒂 ) (cid:34) 𝛾 ∑︁ 𝑜, 𝒂 ′ Pr ( 𝑜 | 𝑠, 𝒂 ) · 𝝅 ( 𝒂 ′ | 𝒉 ′ ) (cid:2) 𝑄 ( 𝒉 ′ , 𝒂 ′ ) − 𝑄 ( 𝒉 ′ , 𝒂 ′ ) (cid:3)(cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ 𝛾 max 𝒉 , 𝒂 ∑︁ 𝑠,𝑜,𝑠 ′ , 𝒂 ′ Pr ( 𝑠 | 𝒉 ) · Pr ( 𝑠 ′ | 𝑠, 𝒂 ) · Pr ( 𝑜 | 𝑠, 𝒂 ) · 𝝅 ( 𝒂 ′ | 𝒉 ′ ) (cid:12)(cid:12) 𝑄 ( 𝒉 ′ , 𝒂 ′ ) − 𝑄 ( 𝒉 ′ , 𝒂 ′ ) (cid:12)(cid:12) ≤ 𝛾 max 𝒉 ′ , 𝒂 ′ (cid:12)(cid:12) 𝑄 ( 𝒉 ′ , 𝒂 ′ ) − 𝑄 ( 𝒉 ′ , 𝒂 ′ ) (cid:12)(cid:12) = 𝛾 ∥ 𝑄 − 𝑄 ∥ The first inequality is due to Jensen’s inequality, while the second follows because a convex combination cannot be greater than the maximum.Hence, 𝐵 𝑐 is a contraction mapping whenever 𝛾 <
1, which immediately implies the existence of a unique fixed point. (If 𝐵 𝑐 had two fixedpoints, then (cid:13)(cid:13) 𝐵 𝑐 𝑄 ∗ − 𝐵 𝑐 𝑄 ∗ (cid:13)(cid:13) = (cid:13)(cid:13) 𝑄 ∗ − 𝑄 ∗ (cid:13)(cid:13) , which would contradict our above result for 𝛾 < 𝑄 𝝅 ( 𝒉 , 𝒂 ) is invariant under the expectation E 𝝅 [ 𝑄 𝝅 ( 𝒉 , 𝒂 )] = 𝑄 𝝅 ( 𝒉 , 𝒂 ) , makingt a likely candidate for the fixed point. We can verify by applying 𝐵 𝑐 to it: 𝐵 𝑐 𝑄 𝝅 = ∑︁ 𝑠,𝑠 ′ Pr ( 𝑠 | 𝒉 ) · Pr ( 𝑠 ′ | 𝑠, 𝒂 ) (cid:34) 𝑅 ( 𝑠, 𝒂 , 𝑠 ′ ) + 𝛾 ∑︁ 𝑜, 𝒂 ′ Pr ( 𝑜 | 𝑠, 𝒂 ) · 𝝅 ( 𝒂 ′ | 𝒉 ′ ) · 𝑄 𝝅 ( 𝒉 ′ , 𝒂 ′ ) (cid:35) = ∑︁ 𝑠,𝑠 ′ Pr ( 𝑠 | 𝒉 ) · Pr ( 𝑠 ′ | 𝑠, 𝒂 ) (cid:34) 𝑅 ( 𝑠, 𝒂 , 𝑠 ′ ) + 𝛾 ∑︁ 𝑜 Pr ( 𝑜 | 𝑠, 𝒂 ) · 𝑉 𝝅 ( 𝒉 ′ ) (cid:35) = ∑︁ 𝑠,𝑜,𝑠 ′ Pr ( 𝑠 | 𝒉 ) · Pr ( 𝑠 ′ | 𝑠, 𝒂 ) · Pr ( 𝑜 | 𝑠, 𝒂 ) (cid:2) 𝑅 ( 𝑠, 𝒂 , 𝑠 ′ ) + 𝛾𝑉 𝝅 ( 𝒉 ′ ) (cid:3) = 𝑄 𝝅 The final step follows from the recursive definition of the Bellman equation. We therefore have 𝐵 𝑐 𝑄 𝝅 = 𝑄 𝝅 , which makes 𝑄 𝝅 the uniquefixed point of 𝐵 𝑐 and completes the proof. □ A.3 Lemma 2: Convergence of Decentralized Critic
Proof. Without loss of generality, we will consider the 2-agent case. The result can easily be generalized to the 𝑛 -agent case by treatingall of the agent policies except 𝜋 𝑖 as a joint policy 𝝅 𝑗 . We begin by observing the 1-step update rule of the 𝑖 -th decentralized critic: 𝑄 ( ℎ 𝑖 , 𝑎 𝑖 ) ← 𝑄 ( ℎ 𝑖 , 𝑎 𝑖 ) + 𝛼 (cid:2) 𝑅 ( 𝑠, 𝒂 , 𝑠 ′ ) + 𝛾𝑄 ( ℎ ′ 𝑖 , 𝑎 ′ 𝑖 ) (cid:3) (9)Letting 𝜋 (cid:66) 𝜋 𝑖 , this update occurs with conditional probability Pr ( 𝒂 , 𝑠 ′ , ℎ ′ 𝑖 , 𝑎 ′ 𝑖 | ℎ 𝑖 , 𝑎 𝑖 ) = Pr ( ℎ 𝑗 | ℎ 𝑖 ) · 𝜋 𝑗 ( 𝑎 𝑗 | ℎ 𝑗 ) · Pr ( 𝑠 | 𝒉 ) · Pr ( 𝑜 | 𝑠, 𝑎 𝑖 ) · Pr ( 𝑠 ′ | 𝑠, 𝒂 ) · 𝜋 ( 𝑎 ′ 𝑖 | ℎ ′ 𝑖 ) given that ( ℎ 𝑖 , 𝑎 𝑖 ) occurred, where Pr ( 𝑠 | 𝒉 ) exists by Proposition 1. Assuming that this ( ℎ 𝑖 , 𝑎 𝑖 ) -combination is visitedinfinitely often during an infinite amount of training, and that 𝛼 is annealed according to the stochastic approximation criteria from [35],then 𝑄 ( ℎ 𝑖 , 𝑎 𝑖 ) will asymptotically converge to the expectation of the bracketed term in (9). We can compute the expected value by summingover all histories and actions weighted by their joint probabilities of occurrence: 𝑄 ( ℎ 𝑖 , 𝑎 𝑖 ) ← ∑︁ ℎ 𝑗 ,𝑎 𝑗 ,𝑠,𝑜,𝑠 ′ ,𝑎 ′ 𝑖 Pr ( ℎ 𝑗 | ℎ 𝑖 ) · 𝜋 𝑗 ( 𝑎 𝑗 | ℎ 𝑗 ) · Pr ( 𝑠 | 𝒉 ) · Pr ( 𝑜 | 𝑠, 𝑎 𝑖 ) · Pr ( 𝑠 ′ | 𝑠, 𝒂 ) · 𝜋 ( 𝑎 ′ 𝑖 | ℎ ′ 𝑖 ) (cid:2) 𝑅 ( 𝑠, 𝒂 , 𝑠 ′ ) + 𝛾𝑄 ( ℎ ′ 𝑖 , 𝑎 ′ 𝑖 ) (cid:3) = ∑︁ ℎ 𝑗 ,𝑎 𝑗 ,𝑠,𝑠 ′ Pr ( ℎ 𝑗 | ℎ 𝑖 ) · 𝜋 𝑗 ( 𝑎 𝑗 | ℎ 𝑗 ) · Pr ( 𝑠 | 𝒉 ) · Pr ( 𝑠 ′ | 𝑠, 𝒂 ) 𝑅 ( 𝑠, 𝒂 , 𝑠 ′ ) + 𝛾 ∑︁ 𝑜,𝑎 ′ 𝑖 Pr ( 𝑜 | 𝑠, 𝑎 𝑖 ) · 𝜋 ( 𝑎 ′ 𝑖 | ℎ ′ 𝑖 ) · 𝑄 ( ℎ ′ 𝑖 , 𝑎 ′ 𝑖 ) (10)This represents the Bellman equation for a decentralized critic. Let us denote this update rule by the operator 𝐵 𝑑 such that 𝑄 ← 𝐵 𝑑 𝑄 isequivalent to (10). We can show that 𝐵 𝑑 is a contraction mapping. Letting ∥ 𝑄 ∥ (cid:66) max ℎ 𝑖 ,𝑎 𝑖 | 𝑄 ( ℎ 𝑖 , 𝑎 𝑖 )| , ∥ 𝐵 𝑑 𝑄 − 𝐵 𝑑 𝑄 ∥ = max ℎ 𝑖 ,𝑎 𝑖 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∑︁ ℎ 𝑗 ,𝑎 𝑗 ,𝑠,𝑠 ′ Pr ( ℎ 𝑗 | ℎ 𝑖 ) · 𝜋 𝑗 ( 𝑎 𝑗 | ℎ 𝑗 ) · Pr ( 𝑠 | 𝒉 ) · Pr ( 𝑠 ′ | 𝑠, 𝒂 ) 𝛾 ∑︁ 𝑜,𝑎 ′ 𝑖 Pr ( 𝑜 | 𝑠, 𝑎 𝑖 ) · 𝜋 ( 𝑎 ′ 𝑖 | ℎ 𝑖 ) (cid:2) 𝑄 ( ℎ ′ 𝑖 , 𝑎 ′ 𝑖 ) − 𝑄 ( ℎ ′ 𝑖 , 𝑎 ′ 𝑖 ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ 𝛾 max ℎ 𝑖 ,𝑎 𝑖 ∑︁ ℎ 𝑗 ,𝑎 𝑗 ,𝑠,𝑜,𝑠 ′ ,𝑎 ′ 𝑖 Pr ( ℎ 𝑗 | ℎ 𝑖 ) · 𝜋 𝑗 ( 𝑎 𝑗 | ℎ 𝑗 ) · Pr ( 𝑠 | 𝒉 ) · Pr ( 𝑠 ′ | 𝑠, 𝒂 ) · Pr ( 𝑜 | 𝑠, 𝑎 𝑖 ) · 𝜋 ( 𝑎 ′ 𝑖 | ℎ ′ 𝑖 ) (cid:12)(cid:12) 𝑄 ( ℎ ′ 𝑖 , 𝑎 ′ 𝑖 ) − 𝑄 ( ℎ ′ 𝑖 , 𝑎 ′ 𝑖 ) (cid:12)(cid:12) ≤ 𝛾 max ℎ ′ 𝑖 ,𝑎 ′ 𝑖 (cid:12)(cid:12) 𝑄 ( ℎ ′ 𝑖 , 𝑎 ′ 𝑖 ) − 𝑄 ( ℎ ′ 𝑖 , 𝑎 ′ 𝑖 ) (cid:12)(cid:12) = 𝛾 ∥ 𝑄 − 𝑄 ∥ The first inequality is due to Jensen’s inequality, while the second follows because a convex combination cannot be greater than the maximum.Hence, 𝐵 𝑑 is a contraction mapping whenever 𝛾 <
1, which immediately implies the existence of a unique fixed point; we must identify thefixed point to complete our proof. We will test the marginal expectation of the central critic E ℎ 𝑗 ,𝑎 𝑗 (cid:2) 𝑄 𝝅 ( ℎ 𝑖 , ℎ 𝑗 , 𝑎 𝑖 , 𝑎 𝑗 ) (cid:3) . We can verify thatthis is the fixed point by applying 𝐵 𝑑 to it: 𝐵 𝑑 𝑄 𝜋 = ∑︁ ℎ 𝑗 ,𝑎 𝑗 ,𝑠,𝑠 ′ Pr ( ℎ 𝑗 | ℎ 𝑖 ) · 𝜋 𝑗 ( 𝑎 𝑗 | ℎ 𝑗 ) · Pr ( 𝑠 | 𝒉 ) · Pr ( 𝑠 ′ | 𝑠, 𝒂 ) 𝑅 ( 𝑠, 𝒂 , 𝑠 ′ ) + 𝛾 ∑︁ 𝑜,𝑎 ′ 𝑖 Pr ( 𝑜 | 𝑠, 𝑎 𝑖 ) · 𝜋 ( 𝑎 ′ 𝑖 | ℎ 𝑖 ) · E ℎ ′ 𝑗 ,𝑎 ′ 𝑗 (cid:104) 𝑄 𝝅 ( ℎ ′ 𝑖 , ℎ ′ 𝑗 , 𝑎 ′ 𝑖 , 𝑎 ′ 𝑗 ) (cid:105) = ∑︁ ℎ 𝑗 ,𝑎 𝑗 ,𝑠,𝑠 ′ Pr ( ℎ 𝑗 | ℎ 𝑖 ) · 𝜋 𝑗 ( 𝑎 𝑗 | ℎ 𝑗 ) · Pr ( 𝑠 | 𝒉 ) · Pr ( 𝑠 ′ | 𝑠, 𝒂 ) (cid:34) 𝑅 ( 𝑠, 𝒂 , 𝑠 ′ ) + 𝛾 ∑︁ 𝑜 Pr ( 𝑜 | 𝑠, 𝑎 𝑖 ) · E ℎ ′ 𝑗 (cid:104) 𝑉 𝝅 ( ℎ ′ 𝑖 , ℎ ′ 𝑗 ) (cid:105)(cid:35) = ∑︁ ℎ 𝑗 ,𝑎 𝑗 ,𝑠,𝑜,𝑠 ′ Pr ( ℎ 𝑗 | ℎ 𝑖 ) · 𝜋 𝑗 ( 𝑎 𝑗 | ℎ 𝑗 ) · Pr ( 𝑠 | 𝒉 ) · Pr ( 𝑠 ′ | 𝑠, 𝒂 ) · Pr ( 𝑜 | 𝑠, 𝑎 𝑖 ) (cid:104) 𝑅 ( 𝑠, 𝒂 , 𝑠 ′ ) + 𝛾 E ℎ ′ 𝑗 (cid:104) 𝑉 𝝅 ( ℎ ′ 𝑖 , ℎ ′ 𝑗 ) (cid:105)(cid:105) = 𝑄 𝜋 he final step follows from the recursive definition of the Bellman equation. We therefore have 𝐵 𝑑 (cid:16) E ℎ 𝑗 ,𝑎 𝑗 (cid:2) 𝑄 𝝅 ( ℎ 𝑖 , ℎ 𝑗 , 𝑎 𝑖 , 𝑎 𝑗 ) (cid:3)(cid:17) = E ℎ 𝑗 ,𝑎 𝑗 (cid:2) 𝑄 𝝅 ( ℎ 𝑖 , ℎ 𝑗 , 𝑎 𝑖 , 𝑎 𝑗 ) (cid:3) , which makes E ℎ 𝑗 ,𝑎 𝑗 (cid:2) 𝑄 𝝅 ( ℎ 𝑖 , ℎ 𝑗 , 𝑎 𝑖 , 𝑎 𝑗 ) (cid:3) the unique fixed point of 𝐵 𝑑 and completes the proof. □ A.4 Theorem 1: Unbiased Policy Gradient
Proof. From Lemmas 1 and 2, we know that 𝑄 ( ℎ 𝑖 , ℎ 𝑗 , 𝑎 𝑖 , 𝑎 𝑗 ) → 𝑄 𝝅 ( ℎ 𝑖 , ℎ 𝑗 , 𝑎 𝑖 , 𝑎 ) and 𝑄 ( ℎ 𝑖 , 𝑎 𝑖 ) → E ℎ 𝑗 ,𝑎 𝑗 (cid:2) 𝑄 𝝅 ( ℎ 𝑖 , ℎ 𝑗 , 𝑎 𝑖 , 𝑎 𝑗 ) (cid:3) for thecentralized and decentralized critics, respectively, after an infinite amount of training. Substituting these into the definitions in (4) and (3),respectively, we can begin to relate the expected policy gradients after the critics have converged. Without loss of generality, we will computethese objectives for the 𝑖 -th agent with policy 𝜋 𝑖 . Starting with the decentralized critic, E ℎ 𝑖 ,𝑎 𝑖 [ 𝐽 𝑑 ( 𝜃 )] = E ℎ 𝑖 ,𝑎 𝑖 (cid:104) E ℎ 𝑗 ,𝑎 𝑗 (cid:2) 𝑄 𝝅 ( ℎ 𝑖 , ℎ 𝑗 , 𝑎 𝑖 , 𝑎 𝑗 ) (cid:3) · ∇ 𝜃 log 𝜋 𝑖 ( 𝑎 𝑖 | ℎ 𝑖 ) (cid:105) (11)Next, we can do the same for the centralized critic and split the expectation: E ℎ 𝑖 ,ℎ 𝑗 ,𝑎 𝑖 ,𝑎 𝑗 [ 𝐽 𝑐 ( 𝜃 )] = E ℎ 𝑖 ,ℎ 𝑗 ,𝑎 𝑖 ,𝑎 𝑗 (cid:2) 𝑄 𝝅 ( ℎ 𝑖 , ℎ 𝑗 , 𝑎 𝑖 , 𝑎 𝑗 ) · ∇ 𝜃 log 𝜋 𝑖 ( 𝑎 𝑖 | ℎ 𝑖 ) (cid:3) (12) = E ℎ 𝑖 ,𝑎 𝑖 (cid:104) E ℎ 𝑗 ,𝑎 𝑗 (cid:2) 𝑄 𝝅 ( ℎ 𝑖 , ℎ 𝑗 , 𝑎 𝑖 , 𝑎 𝑗 ) · ∇ 𝜃 log 𝜋 𝑖 ( 𝑎 𝑖 | ℎ 𝑖 ) (cid:3)(cid:105) = E ℎ 𝑖 ,𝑎 𝑖 (cid:104) E ℎ 𝑗 ,𝑎 𝑗 (cid:2) 𝑄 𝝅 ( ℎ 𝑖 , ℎ 𝑗 , 𝑎 𝑖 , 𝑎 𝑗 ) (cid:3) · ∇ 𝜃 log 𝜋 𝑖 ( 𝑎 𝑖 | ℎ 𝑖 ) (cid:105) = E ℎ 𝑖 ,𝑎 𝑖 [ 𝐽 𝑑 ( 𝜃 )] We therefore have E ℎ 𝑖 ,ℎ 𝑗 ,𝑎 𝑖 ,𝑎 𝑗 [ 𝐽 𝑐 ( 𝜃 )] = E ℎ 𝑖 ,𝑎 𝑖 [ 𝐽 𝑑 ( 𝜃 )] , so both policy gradients are equal in expectation and can be interchanged withoutintroducing bias. □ A.5 Theorem 2: Policy Gradient Variance Inequality
Proof. In Theorem 1, we derived policy gradients for the centralized and decentralized critics in (11) and (12), respectively. We also showedthat they are equal in expectation; hence, let 𝜇 = E ℎ 𝑖 ,ℎ 𝑗 ,𝑎 𝑖 ,𝑎 𝑗 [ 𝐽 𝑐 ( 𝜃 )] = E ℎ 𝑖 ,𝑎 𝑖 [ 𝐽 𝑑 ( 𝜃 )] . Additionally, let 𝐴 = (∇ 𝜃 log 𝜋 𝑖 ( 𝑎 𝑖 | ℎ 𝑖 ))(∇ 𝜃 log 𝜋 𝑖 ( 𝑎 𝑖 | ℎ 𝑖 )) T .We can analyze the relative magnitudes of the policy gradient variances by subtracting them and comparing to zero:Var ( 𝐽 𝑐 ( 𝜃 )) − Var ( 𝐽 𝑑 ( 𝜃 )) = (cid:16) E ℎ 𝑖 ,ℎ 𝑗 ,𝑎 𝑖 ,𝑎 𝑗 (cid:104) 𝐽 𝑐 ( 𝜃 ) 𝐽 𝑐 ( 𝜃 ) T (cid:105) − 𝜇𝜇 T (cid:17) − (cid:16) E ℎ 𝑖 ,𝑎 𝑖 (cid:104) 𝐽 𝑑 ( 𝜃 ) 𝐽 𝑑 ( 𝜃 ) T (cid:105) − 𝜇𝜇 T (cid:17) = E ℎ 𝑖 ,ℎ 𝑗 ,𝑎 𝑖 ,𝑎 𝑗 (cid:104) 𝐽 𝑐 ( 𝜃 ) 𝐽 𝑐 ( 𝜃 ) T (cid:105) − E ℎ 𝑖 ,𝑎 𝑖 (cid:104) 𝐽 𝑑 ( 𝜃 ) 𝐽 𝑑 ( 𝜃 ) T (cid:105) = E ℎ 𝑖 ,ℎ 𝑗 ,𝑎 𝑖 ,𝑎 𝑗 (cid:2) 𝑄 𝝅 ( ℎ 𝑖 , ℎ 𝑗 , 𝑎 𝑖 , 𝑎 𝑗 ) 𝐴 (cid:3) − E ℎ 𝑖 ,𝑎 𝑖 (cid:104) E ℎ 𝑗 ,𝑎 𝑗 (cid:2) 𝑄 𝝅 ( ℎ 𝑖 , ℎ 𝑗 , 𝑎 𝑖 , 𝑎 𝑗 ) (cid:3) 𝐴 (cid:105) = E ℎ 𝑖 ,𝑎 𝑖 (cid:16) E ℎ 𝑗 ,𝑎 𝑗 (cid:2) 𝑄 𝝅 ( ℎ 𝑖 , ℎ 𝑗 , 𝑎 𝑖 , 𝑎 𝑗 ) (cid:3) − E ℎ 𝑗 ,𝑎 𝑗 (cid:2) 𝑄 𝝅 ( ℎ 𝑖 , ℎ 𝑗 , 𝑎 𝑖 , 𝑎 𝑗 ) (cid:3) (cid:17)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) 𝑐 𝐴 ≥ 𝑐 ≥ 𝐴 arenonnegative because it is an outer product of a vector with itself. We must therefore have that each element of the centralized critic’s policycovariance matrix is at least as large as its corresponding element in the decentralized critic’s policy covariance matrix. This completes theproof. □ MORNING GAME
Figure 6: Morning Game Q-values used for updates of 𝜋 ( 𝑎 ) . Same as Figure 2c but plotted with individual runs. C GUESS GAME
We consider a simple one-step environment where two agents are trying to guess each other’s binary observations: Ω 𝑖 = { 𝑜 𝑖 , 𝑜 𝑖 } , withaction space A 𝑖 = { 𝑎 𝑖 , 𝑎 𝑖 , 𝑎 𝑖 } . Every timestep, the state ⟨ 𝑠 𝑠 ⟩ is uniformly sampled ( T = ⟨ 𝑈 ({ 𝑠 , 𝑠 }) , 𝑈 ({ 𝑠 , 𝑠 })⟩ ) and is observeddirectly by the respective agents ( 𝑜 𝑖 = 𝑠 𝑖 ). Agents are rewarded (+ ) if both their actions match the other agents’ observations (for 𝑎 𝑖𝑛 , 𝑠 𝑗𝑚 , 𝑛 = 𝑚 ), 0 for one mismatch, and penalized ( −
10) for two mismatches. On the other hand, ⟨ 𝑎 , 𝑎 ⟩ gives a deterministic, low reward ( +
5) nomatter the state.We observe that the agents with decentralized critics can robustly converge to the optimal action 𝑎 with a return of 5, whereas the agentswith a centralized critic do so with less stability. Notice that empirical result shown in Figure 7 the policy of JAC is not executable by eitherIAC or IACC due to their decentralized execution nature and an optimal decentralized set of policies only achieves a return of 5. We alsoobserve that although IACC is not biased, the lower performance is caused by larger variance in joint observations. More specifically, thedecentralized critic simply learned that 𝑎 and 𝑎 would average to an approximate return of 0 under both observations, while the explicitcentralized critic estimates the two matching states (e.g {⟨ 𝑜 , 𝑜 ⟩ , ⟨ 𝑎 , 𝑎 ⟩} ) gives +
10, one mismatch (e.g. {⟨ 𝑜 , 𝑜 ⟩ , ⟨ 𝑎 , 𝑎 ⟩} ) gives 0 and twomismatches (e.g. {⟨ 𝑜 , 𝑜 ⟩ , ⟨ 𝑎 , 𝑎 ⟩} ) gives −
10. Because the observations are uniformly distributed, in the actual roll-outs, the probability ofgetting +
10 is equal to that of −
10, it does not effect the expected return (thus the expected gradient), but it does incur much higher variancein this case. The variance is exacerbated if we use ⟨+ , − ⟩ instead of ⟨+ , − ⟩ (Figure 8b); and can be mitigated using larger batchsizes (Figure 8c). Steps R e t u r n s Guess Game
JACIACIACC
Figure 7: Performance comparison in Guess Game, showing centralized critic cannot bias the actors towards the global opti-mum in the simplest situation.
Steps R e t u r n s Guess Game
IACIACC (a) Same data as shown in Figure 7
Steps R e t u r n s Guess Game
IACIACC (b) + and − rewards Steps R e t u r n s Guess Game
IACIACC (c) Batch size increased from 64 to 512
Figure 8: Guess Game performance comparison under different settings.
D GRID WORLD ENVIRONMENTS
Grid World Environments from [15], composed of variety of grid world tasks. Readers are referred to the source code for more details. D.1 Go Together
Two agents goes to a same goal for a reward of 10, penalized for being too close or too far away. The domain is fully observable, agentsreceive both agent’s coordinates as observations.
D.2 Find Treasure
Two rooms, two agents initialized in one room and try to navigate to the other room for a treasure ( + D.3 Cleaner
Two agents initialized in a maze, in which every untouched ground gives a reward of +
1; an optimal policy would seek to navigate to asmany cells as possible while avoiding teammate’s path. Results are shown in Figure 9.
Figure 9: Grid word visualization for Cleaner domain. Agent shown in red, uncleaned locations shown in green. Blue lineshows the general two paths taken by an optimal policy.
D.4 Move Box
Two agents collectively carry a heavy box (requires two agents to carry) to a destination, Two destinations are present; the nearer destinationgives +
10 and the farther destination gives + E COOPERATIVE NAVIGATION
Agents cooperatively navigate to target locations, agents are penalized for collisions. The optimal routes (policies) are highly dependent onthe other agents’ routes (policies). Results are shown in Figure 12 https://github.com/Bigpig4396/Multi-Agent-Reinforcement-Learning-Environment igure 10: Grid word visualization for Move Box domain. Agent shown in red and blue, box shown in green. Steps R e t u r n s Go Together
IACIACC
Go Together domain
Steps R e t u r n s Find Treasure
IACIACC
Find Treasure domain
Steps R e t u r n s Cleaner
IACIACC
Cleaner domain
Steps R e t u r n s Move Box
IACIACC
Move Box domain
Figure 11: Performance comparison in various grid-world domains
Steps R e t u r n s Cross
IACIACC 0 2000000 4000000 6000000
Steps R e t u r n s Antipodal
IACIACC
Steps R e t u r n s Merge
IACIACC
Figure 12: Performance comparison in cooperative navigation domains. In these domains, agents navigate to designated targetlocations for reward, and are penalized for collisions.
PARTIALLY OBSERVABLE PREDATOR AND PREY
We conducted experiments in a variant version of the Predator and Prey domain (originally introduced in MADDPG [21] and widely usedby later works), where three slower predators (red circle) must collaboratively capture a faster prey (green circle) in a random initialized2-D plane with two landmarks. The prey’s movement is bounded within the 2 × obs_range 0.4 obs_range 0.6 obs_range 0.8 Figure 13: Domain configurations under various observation ranges.
Episode (k) M ea n T e s t R e t u r n obs_range 0.4 IACIACC 0 200 400 600 800
Episode (k) M ea n T e s t R e t u r n obs_range 0.6 IACIACC 0 200 400 600 800
Episode (k) M ea n T e s t R e t u r n obs_range 0.8 IACIACC0 200 400 600 800
Episode (k) M ea n T e s t R e t u r n obs_range 1.0 IACIACC 0 200 400 600 800
Episode (k) M ea n T e s t R e t u r n obs_range 1.2 IACIACC
Figure 14: Performance comparison on Predator-and-Prey under various observation ranges.
CAPTURE TARGET
Capture Target is another widely used domain [1, 22, 30, 48], in which two agents (green and blue circles) move in a 𝑚 × 𝑚 toroidal gridworld to capture a moving target (red cross) simultaneously. The positions of the agents and the target are randomly initialized at thebeginning of each episode. At each time step, the target keeps moving right without any transition noise, while the agents have five movingoptions ( up, down, left, right and stay ) with 0.1 probability of accidentally arriving any one of the adjacent cells. Each agent is allowed toalways observe its own location but the target’s is blurred with a probability 0.3. Agents can receive a terminal reward + . Figure 15: Domain configurations under various grid world sizes.
Episode (k) M ea n T e s t R e t u r n IACIACC 0 35 70 105 140
Episode (k) M ea n T e s t R e t u r n IACIACC 0 70 140 210 280
Episode (k) M ea n T e s t R e t u r n IACIACC0 105 210 315 420
Episode (k) M ea n T e s t R e t u r n
10 x 10
IACIACC 0 140 280 420 560
Episode (k) M ea n T e s t R e t u r n
12 x 12
IACIACC
Figure 16: Comparison on Capture Target under various grid world sizes.
SMALL BOX PUSHING
The objective of two agents, in this domain, is to push two small boxes to the goal (yellow) area at the top of the grid world with a discountfactor 𝛾 = .
98. When any one of the small boxes is pushed to the goal area, the team receives a +
100 reward and the episode terminates. Eachagent has four applicable actions: move forward, turn left, turn right and stay . Each small box moves forward one grid cell when any agentfaces it and operates move forward action. The observation captured by each agent consists of only the status of the one front cell which canbe empty, teammate, small box or boundary. The ideal behavior is that two agents push two small boxes to the goal area concurrently toobtain the maximal discounted return.
Figure 17: Domain configurations under various grid world sizes.
Episode (k) M ea n T e s t R e t u r n IACIACC 0 1 2 3 4 5
Episode (k) M ea n T e s t R e t u r n IACIACC 0 1 2 3 4 5
Episode (k) M ea n T e s t R e t u r n IACIACC0 2 4 6
Episode (k) M ea n T e s t R e t u r n
10 x 10
IACIACC 0 5 10 15 20
Episode (k) M ea n T e s t R e t u r n
12 x 12
IACIACC
Figure 18: Performance comparison on Small Box Pushing under various grid world sizes.
SMALL BOX PUSHING (3 AGENTS)
The domain settings are same as the above one, except one more agent involved to make the task a little more challenging.
Figure 19: Domain configurations under various grid world sizes.
Episode (k) M ea n T e s t R e t u r n IACIACC 0 1 2 3 4 5
Episode (k) M ea n T e s t R e t u r n IACIACC 0 1 2 3 4 5
Episode (k) M ea n T e s t R e t u r n IACIACC0 2 4 6 8 10
Episode (k) M ea n T e s t R e t u r n
10 x 10
IACIACC 0 5 10 15 20
Episode (k) M ea n T e s t R e t u r n
12 x 12
IACIACC
Figure 20: Performance comparison on three-agent Small Box Pushing under various grid world sizes.
SMAC
We also investigated the performance of IAC and IACC on StarCraft II micromanagement tasks from the StarCraft Multi-Agent Challenge(SMAC) [36]. We picked out three classical scenarios: 2 Stalkers vs 1 Spine Crawler (2s_vs_1sc), 3 Marines (3m) and 2 Stalkers and 3 Zealots(2s3z). We used the default configurations recommended in SMAC. Readers are referred to the source code for details. We use simpler tasksin SMAC due to harder tasks poses a huge challenge as both method in their vanilla formulation . Episode (k) M ea n T e s t R e t u r n IACIACC
Episode (k) M ea n T e s t R e t u r n IACIACC 0 2 4 6 8 10
Episode (k) M ea n T e s t R e t u r n IACIACC
Figure 21: Performance comparison on easy SMAC maps.7