Multi-Agent Collaboration via Reward Attribution Decomposition
Tianjun Zhang, Huazhe Xu, Xiaolong Wang, Yi Wu, Kurt Keutzer, Joseph E. Gonzalez, Yuandong Tian
UUnder review as a conference paper at ICLR 2021 M ULTI -A GENT C OLLABORATION VIA R EWARD A TTRI - BUTION D ECOMPOSITION
Tianjun Zhang Huazhe Xu Xiaolong Wang Yi Wu Kurt Keutzer Joseph E. Gonzalez Yuandong Tian University of California, Berkeley {tianjunz,huazhe_xu,keutzer,jegonzal}@berkeley.edu Unveristy of California, San Diego Tsinghua University Facebook AI Research [email protected] [email protected] [email protected] A BSTRACT
Recent advances in multi-agent reinforcement learning (MARL) have achievedsuper-human performance in games like Quake 3 and Dota 2. Unfortunately, thesetechniques require orders-of-magnitude more training rounds than humans and maynot generalize to slightly altered environments or new agent configurations (i.e., ad hoc team play ). In this work, we propose
Colla borative Q -learning ( CollaQ )that achieves state-of-the-art performance in the StarCraft multi-agent challengeand supports ad hoc team play. We first formulate multi-agent collaboration as ajoint optimization on reward assignment and show that under certain conditions,each agent has a decentralized Q -function that is approximately optimal and can bedecomposed into two terms: the self-term that only relies on the agent’s own state,and the interactive term that is related to states of nearby agents, often observed bythe current agent. The two terms are jointly trained using regular DQN, regulatedwith a M ulti- A gent R eward A ttribution (MARA) loss that ensures both terms retaintheir semantics. CollaQ is evaluated on various StarCraft maps, outperformingexisting state-of-the-art techniques (i.e., QMIX, QTRAN, and VDN) by improvingthe win rate by with the same number of environment steps. In the morechallenging ad hoc team play setting (i.e., reweight/add/remove units withoutre-training or finetuning), CollaQ outperforms previous SoTA by over . NTRODUCTION
In recent years, multi-agent deep reinforcement learning (MARL) has drawn increasing interestfrom the research community. MARL algorithms have shown super-human level performancein various games like Dota 2 (Berner et al., 2019), Quake 3 Arena (Jaderberg et al., 2019), andStarCraft (Samvelyan et al., 2019). However, the algorithms (Schulman et al., 2017; Mnih et al.,2013) are far less sample efficient than humans. For example, in Hide and Seek (Baker et al., 2019),it takes agents . − . million episodes to learn a simple strategy of door blocking, while it onlytakes human several rounds to learn this behavior. One of the key reasons for the slow learning is thatthe number of joint states grows exponentially with the number of agents.Moreover, many real-world situations require agents to adapt to new configurations of teams. Thiscan be modeled as ad hoc multi-agent reinforcement learning (Stone et al., 2010) (Ad-hoc MARL)settings, in which agents must adapt to different team sizes and configurations at test time. In contrastto the MARL setting where agents can learn a fixed and team-dependent policy, in the Ad-hoc MARLsetting agents must assess and adapt to the capabilities of others to behave optimally. Existing work in Our code for StarCraft II multi-agent challenge is public online at https://github.com/facebookresearch/CollaQ . a r X i v : . [ c s . L G ] O c t nder review as a conference paper at ICLR 2021ad hoc team play either require sophisticated online learning at test time (Barrett et al., 2011) or priorknowledge about teammate behaviors (Barrett and Stone, 2015). As a result, they do not generalizeto complex real-world scenarios. Most existing works either focus on improving generalizationtowards different opponent strategies (Lanctot et al., 2017; Hu et al., 2020) or simple ad-hoc settinglike varying number of test-time teammates (Schwab et al., 2018; Long et al., 2020). We considera more general setting where test-time teammates may have different capabilities. The need toreason about different team configurations in the Ad-hoc MARL results in an additional exponentialincrease (Stone et al., 2010) in representational complexity comparing to the MARL setting.In the situation of collaboration, one way to address the complexity of the ad hoc team play settingis to explicitly model and address how agents collaborate. In this paper, one key observation is thatwhen collaborating with different agents, an agent changes their behavior because she realizes thatthe team could function better if she focuses on some of the rewards while leaving other rewardsto other teammates. Inspired by this principle, we formulate multi-agent collaboration as a jointoptimization over an implicit reward assignment among agents. Because the rewards are assigneddifferently for different team configurations, the behavior of an agent changes and adaptation follows.While solving this optimization directly requires centralization at test time, we make an interestingtheoretical finding that each agent has a decentralized policy that is (1) approximately optimal for thejoint optimization, and (2) only depends on the local configuration of other agents. This enables us tolearn a direct mapping from states of nearby agents (or “observation” of agent i ) to its Q -functionusing deep neural network. Furthermore, this finding also suggests that the Q -function of agent i should be decomposed into two terms: Q alone i that only depends on agent i ’s own state s i , and Q collab i that depends on nearby agents but vanishes if no other agents nearby. To enforce this semantics, weregularize Q collab i ( s i , · ) = 0 in training via a novel M ulti- A gent R eward A ttribution (MARA) loss.The resulting algorithm, Colla borative Q -learning (CollaQ), achieves a 40% improvement in winrates over state-of-the-art techniques for the StarCraft multi-agent challenge. We show that (1) theMARA Loss is critical for strong performance and (2) both Q alone and Q collab are interpretablevia visualization. Furthermore, CollaQ agents can achieve ad hoc team play without retraining orfine-tuning. We propose three tasks to evaluate ad hoc team play performance: at test time, (a) assigna new VIP unit whose survival matters, (b) swap different units in and out, and (c) add or removeunits. Results show that CollaQ outperforms baselines by an average of in all these settings. Related Works.
The most straightforward way to train such a MARL task is to learn individualagent’s value function Q i independently(IQL) (Tan, 1993). However, the environment becomesnon-stationary from the perspective of an individual agent thus this performs poorly in practice.Recent works, e.g., VDN (Sunehag et al., 2017), QMIX (Rashid et al., 2018), QTRAN (Son et al.,2019), adopt centralized training with decentralized execution to solve this problem. They propose towrite the joint value function as Q π ( s, a ) = φ ( s, Q ( o , a ) , ..., Q K ( o K , a K )) but the formulationof φ differs in each method. These methods successfully utilize the centralized training technique toalleviate the non-stationary issue. However, none of the above methods generalize well to ad-hocteam play since learned Q i functions highly depend on the existence of other agents. OLLABORATIVE M ULTI -A GENT R EWARD A SSIGNMENT
Basic Setting . A multi-agent extension of Markov Decision Process called collaborative partiallyobservable Markov Games (Littman, 1994), is defined by a set of states S describing the possibleconfigurations of all K agents, a set of possible actions A , . . . , A K , and a set of possible observations O , . . . , O K . At every step, each agent i chooses its action a i by a stochastic policy π i : O i × A i → [0 , . The joint action a produces the next state by a transition function P : S × A × · · · × A K → S .All agents share the same reward r : S × A × · · · × A K → R and with a joint value function Q π = E s t +1: ∞ , a t +1: ∞ [ R t | s t , a t ] where R t = (cid:80) ∞ j =0 γ j r t + j is the discounted return.In Sec. 2.1, we first model multi-agent collaboration as a joint optimization on reward assignment:instead of acting based on the joint state s , each agent i is acted independently on its own state s i , following its own optimal value V i , which is a function of the perceived reward assignment r i .While the optimal perceived reward assignment r ∗ i ( s ) depends on the joint state of all agents andrequires centralization, in Sec. 2.2, we prove that there exists an approximate optimal solution ˆ r i thatonly depends on the local observation s local i of agent i , and thus enabling decentralized execution.2nder review as a conference paper at ICLR 2021Lastly in Sec. 2.3, we distill the theoretical insights into a practical algorithm CollaQ, by directlylearning the compositional mapping s local i (cid:55)→ ˆ r i (cid:55)→ V i in an end-to-end fashion, while keeping thedecomposition structure of self state and local observations.2.1 B ASIC A SSUMPTION
A naive modeling of multi-agent collaboration is to estimate a joint value function V joint := V joint ( s , s , . . . , s K ) , and find the best action for agent i to maximize V joint according to thecurrent joint state s = ( s , s , . . . , s N ) . However, it has three fundamental drawbacks: (1) V joint generally requires exponential number of samples to learn; (2) in order to evaluate this function, afull observation of the states of all agents is required, which disallows decentralized execution, onekey preference of multi-agent RL; and (3) for any environment/team changes (e.g., teaming withdifferent agents), V joint needs to be relearned for all agents and renders ad hoc team play impossible.Our CollaQ addresses the three issues with a novel theoretical framework that decouples the inter-actions between agents. Instead of using V joint that bundles all the agent interactions together, weconsider the underlying mechanism how they interact: in a fully collaborative setting, the reason whyagent i takes actions towards a state, is not only because that state is rewarding to agent i , but alsobecause it is more rewarding to agent i than other agents in the team, from agent i ’s point of view.This is the concept of perceived reward of agent i . Then each agent acts independently followingits own value function V i , which is the optimal solution to the Bellman equation conditioned on theassigned perceived reward, and is a function of it. This naturally leads to collaboration.We build a mathematical framework to model such behaviors. Specifically, we make the followingassumption on the behavior of each agent: Assumption 1.
Each agent i has a perceived reward assignment r i ∈ R | S i || A i | + that may depend onthe joint state s = ( s , . . . , s K ) . Agent i acts according to its own state s i and individual optimalvalue V i = V i ( s i ; r i ) (and associated Q i ( s i , a i ; r i ) ), which is a function of r i . Note that the perceived reward assignment r i ∈ R | S i || A i | + is a non-negative vector containing theassignment of scalar reward at each state-action pair (hence its length is | S i || A i | ). We might alsoequivalently write it as a function: r i ( x, a ) : S i × A i (cid:55)→ R , where x ∈ S i and a ∈ A i . Here x is adummy variable that runs through all states of agent i , while s i refers to its current state.Given the perceived rewards assignment { r i } , the values and actions of agents become decoupled .Due to the fully collaborative nature, a natural choice of { r i } is the optimal solution of the followingobjective J ( r , r , . . . , r K ) . Here r e is the external rewards of the environment, w i ≥ is thepreference of agent i and (cid:12) is the Hadamard (element-wise) product: J ( r , . . . , r K ) := K (cid:88) i =1 V i ( s i ; r i ) s . t . K (cid:88) i =1 w i (cid:12) r i ≤ r e (1)Note that the constraint ensures that the objective has bounded solution. Without this constraints, wecould easily take each perceived reward r i to + ∞ , since each value function V i ( s i ; r i ) monotonouslyincreases with respect to r i . Intuitively, Eqn. 1 means that we “assign” the external rewards r e optimally to K agents as perceived rewards, so that their overall values are the highest.In the case of sparse reward, most of the state-action pair ( x, a ) , r e ( x, a ) = 0 . By Eqn. 1, for all agent i , their perceived reward r i ( x, a ) = 0 . Then we only focus on nonzero entries for each r i . Define M to be the number of state-action pairs with positive reward: M = (cid:80) a i ∈ A i { r i ( x, a i ) > } .Discarding zero-entries, we could regard all r i as M -dimensional vector. Finally, we define thereward matrix R = [ r , . . . , r K ] ∈ R M × K .2.2 L EARN TO P REDICT THE O PTIMAL A SSIGNED R EWARD r ∗ i ( s ) The optimal reward assignments R ∗ of Eq. 1, as well as its i -th assignment r ∗ i , is a function of thejoint states s = { s , s , . . . , s K } . Once the optimization is done, each agent can get the best action a ∗ i = arg max a i Q i ( s i , a i ; r ∗ i ( s )) independently from the reconstructed Q function.The formulation V i ( s i ; r i ) avoids learning the value function of statistically infeasible joint states V i ( s ) . Since an agent acts solely based on r i , ad hoc team play becomes possible if the correct r i V i is a convex function regarding r i ,maximizing Eqn. 1 is a summation of convex functions under linear constraints optimization, and ishard computationally. Furthermore, to obtain actions for each agent, we need to solve Eqn. 1 at everystep, which still requires centralization at test time, preventing us from decentralized execution.To overcome optimization complexity and enable decentralized execution, we consider learning adirect mapping from the joint state s to optimally assigned reward r ∗ i ( s ) . However, since s is a jointstate, learning such a mapping can be as hard as modeling V i ( s ) .Fortunately, V i ( s i ; r i ( s )) is not an arbitrary function, but the optimal value function that satisfiesBellman equation. Due to the speciality of V i , we could find an approximate assignment ˆ r i for eachagent i , so that ˆ r i only depends on a local observation s local i of the states of nearby other agentsobserved by agent i : ˆ r i ( s ) = ˆ r i ( s local i ) . At the same time, these approximate reward assignments { ˆ r i } achieve approximate optimal for the joint optimization (Eqn. 1) with bounded error: Theorem 1.
For all i ∈ { , . . . , K } , all s i ∈ S i , there exists a reward assignment ˆ r i that (1) onlydepends on s local i and (2) ˆ r i is the i -th column of a feasible global reward assignment ˆ R so that J ( ˆ R ) ≥ J ( R ∗ ) − ( γ C + γ D ) R max M K, (2) where C and D are constants related to distances between agents/rewards (details in Appendix). Since ˆ r i only depends on the local observation of agent i (i.e., agent’s own state s i as well as thestates of nearby agents), it enables decentralized execution : for each agent i , the local observation issufficient for an agent to act near optimally. Limitation . One limitation of Theorem 1 is that the optimality gap of ˆ r i heavily depends on thesize of s local i . If the local observation of agent i covers more agents, then the gap is smaller butthe cost to learn such a mapping is higher, since the mapping has more input states and becomeshigher-dimensional. In practice, we found that using the observation o i of agent i covers s local i workssufficiently well, as shown in the experiments (Sec. 4).2.3 C OLLABORATIVE
Q-L
EARNING (C OLLA
Q)While Theorem. 1 shows the existence of perceived reward ˆ r i = ˆ r i ( s local i ) with good properties,learning ˆ r i ( s local i ) is not a trivial task. Learning it in a supervised manner requires (close to) optimalassignments as the labels, which in turn requires solving Eqn. 1. Instead, we resort to an end-to-endlearning of Q i for each agent i with proper decomposition structure inspired by the theory above.To see this, we expand the Q -function for agent i : Q i = Q i ( s i , a i ; ˆ r i ) with respect to its perceivedreward. We use a Taylor expansion at the ground-zero reward r i = r i ( s i ) , which is the perceivedreward when only agent i is present in the environment: Q i ( s i , a i ; ˆ r i ) = Q i ( s i , a i ; r i ) (cid:124) (cid:123)(cid:122) (cid:125) Q alone ( s i ,a i ) + ∇ r Q i ( s i , a i ; r i ) · (ˆ r i − r i ) + O ( (cid:107) ˆ r i − r i (cid:107) ) (cid:124) (cid:123)(cid:122) (cid:125) Q collab ( s local i ,a i ) (3)Here Q i ( s i , a i ; r i ) is the alone policy of an agent i . We name it Q alone since it operates as if otheragents do not exist. The second term is called Q collab , which models the interaction among agentsvia perceived reward ˆ r i . Both Q alone and Q collab are neural networks. Thanks to Theorem 1, weonly need to feed local observation o i := s local i of agent i , which contains the observation of W < K local agents (Fig. 1), for an approximate optimal Q i . Then the overall Q i is computed by a simpleaddition (here o alone i := s i is the individual state of agent i ): Q i ( o i , a i ) = Q alone i ( o alone i , a i ) + Q collab i ( o i , a i ) (4) Multi-Agent Reward Attribution (MARA) Loss . With a simple addition, the solution of Q alone i and Q collab i might not be unique: indeed, we might add any constant to Q alone and subtract thatconstant from Q collab to yield the same overall Q i . However, according to Eqn. 3, there is anadditional constraint: if o i = o alone i then ˆ r i = r i and Q collab ( o alone i , a i ) ≡ , which eliminates suchan ambiguity. For this, we add Multi-agent Reward Attribution (MARA) Loss.4nder review as a conference paper at ICLR 2021 Attentiono i FCGRUFCo o o W o env Q collaborative o i FCGRUFCo env Q alone AddQ i Agent iW Agents Visible FCo i FC FCo o o W q i k v FCFCk v FCFCk W v W Scaled Dot Product AttentionFCo o o W Attention2Attention1o i o o o W Concato o o W o i Figure 1: Architecture of the network. We use normal DRQN architecture for o alonei with attention-based modelfor Q collab . The attention layers take the encoded inputs from all agents and output an attention embedding. Overall Training Paradigm . For agent i , we use standard DQN training with MARA loss. Define y = E s (cid:48) ∼ ε [ r + γ max a (cid:48) Q i ( o (cid:48) , a (cid:48) ) | s, a ] to be the target Q -value, the overall training objective is: L = E s i ,a i ∼ ρ ( · ) [( y − Q i ( o i , a i )) (cid:124) (cid:123)(cid:122) (cid:125) DQN Objective + α ( Q collab i ( o alone i , a i )) (cid:124) (cid:123)(cid:122) (cid:125) MARA Objective ] (5)where the hyper-parameter α determines the relative importance of the MARA objective againstthe DQN objective. We observe that with MARA loss, training is much stabilized. We use a softconstraint version of MARA Loss. To train multiple agents together, we follow QMIX and feed theoutput of { Q i } into a top network and train in an end-to-end centralized fashion.CollaQ has advantages compared to normal Q-learning. Since Q alone i only takes o alone i whosedimension is independent of the number of agents, this term can be learned exponentially faster than Q collab i . Thus, agents using CollaQ would first learn to solve the problem pretending no other agentsare around using Q alone i then try to learn interaction with local agents through Q collab i . Attention-based Architecture . Fig. 1 illustrates the overall architecture. For agent i , the localobservation o i := s local i is separated into two parts, o alone i := s i and o i = s local i . Here, o alone i is sentto the left tower to obtain Q alone , while o i is sent to the right tower to obtain Q collab . We use attentionarchitecture between o alone i and other agents’ states in the field of view of agent i . This is becausethe observation o i can be spatially large and cover agents whose states do not contribute much toagent i ’s action, and effective s local i is smaller than o i . Our architecture is similar to EPC (Long et al.,2020) except that we use a transformer architecture (stacking multiple layers of attention modules).As shown in the experiments, this helps improve the performance in various StarCraft settings. XPERIMENTS ON R ESOURCE C OLLECTION
In this section, we demonstrate the effectiveness of CollaQ in a toy gridworld environment where thestates are fully observable. We also visualize the trained policy Q i and Q alone i . Figure 2: Results in resource collection. CollaQ (green)produces much higher rewards in both training and adhoc team play than IQL (orange).
Ad hoc Resource Collection . We demonstrateCollaQ in a toy example where multiple agentscollaboratively collect resources from a gridworld to maximize the aggregated team reward.In this setup, the same type of resources can re-turn different rewards depending on the type ofagent that collects it.The reward setup is randomly initialized at thebeginning of each episode and can be seen byall the agents. The game ends when all the re-sources are collected. An agent is expert fora certain resource if it gets the highest rewardamong the team collecting that. As a conse- 5nder review as a conference paper at ICLR 2021 Q i Q i a) b) c)4 alone Figure 3: Visualization of Q alone i and Q i in resource collection. The reward setup is shown in the leftmostcolumn. Interesting behaviors emerge: in b), Q collab i reinforces the behavior of Q alone i since they are both theexpert for the nearest resources; in a) and c), Q collab i alters the decision of collecting lemon for red agent since ithas lower reward for lemon compared with the yellow agent and similar phenomena occurs for the yellow agent. quence, to maximize the shared team reward, the optimal strategy is to let the expert collect thecorresponding resource.For testing, we devise the following reward setup: We have apple and lemon as our resources and N agents. For picking lemon, agent receives the highest reward for the team, agent gets thesecond highest, and so on. For apple, the reward assignment is reversed (agent N gets the highestreward, agent N − gets the second highest, ...). This specific reward setup is excluded from theenvironment setup for training. This is a very hard ad hoc team play at test time since the agents needto demonstrate completely different behaviors from training time to achieve a higher team reward.The left figure in Fig. 2 shows the training reward and the right one shows the ad hoc team play. Wetrain on 5 agents in this setting. CollaQ outperforms IQL in both training and testing. In this example,random actions work reasonably well. Any improvement over it is substantial. Visualization of Q alone i and Q i . In Fig. 3, we visualize the trained Q alone i and Q i (the overallpolicy for agent i ) to show how Q collab i affects the behaviors of each agent. The policies Q alone i and Q i learned by CollaQ are both meaningful: Q alone i is the simple strategy of collecting the nearestresource (the optimal policy when the agent is the only one acting in the environment) and Q i is theoptimal policy described formerly.The leftmost column in Fig. 3 shows the reward setup for different agents on collecting differentresources (e.g. the red agent gets 4 points collecting lemon and gets 10 points collecting apple). Thered agent specializes at collecting apple and the yellow specializes at collecting lemon. In a), Q alone i directs both agents to collect the nearest resource. However, neither agent is the expert on collectingits nearest resource. Therefore, Q collab i alters the decision of Q alone i , directing Q i towards resourceswith the highest return. This behavior is also observed in c) with a different resource placement.b) shows the scenario where both agents are the expert on collecting the nearest resource. Q collab i reinforces the decision of Q alone i , making Q i points to the same resource as Q alone i . XPERIMENTS ON S TAR C RAFT M ULTI -A GENT C HALLENGE
StarCraft multi-agent challenge (Samvelyan et al., 2019) is a widely-used benchmark for MARLevaluation. The task in this environment is to manage a team of units (each unit is controlledby an agent) to defeat the team controlled by build-in AIs. While this task has been extensivelystudied in previous works, the performance of the agents trained by the SoTA methods (e.g., QMIX)deteriorates with a slight modification to the environment setup where the agent IDs are changed. TheSoTA methods severely overfit to the precise environment and thus cannot generalize well to ad hocteam play. In contrast, CollaQ has shown better performance in the presence of random agent IDs,generalizes significantly better in more diverse test environments (e.g., adding/swapping/removing aunit at test time), and is more robust in ad hoc team play.4.1 I
SSUES IN THE C URRENT B ENCHMARK
In the default StarCraft multi-agent environment, the ID of each agent never changes. Thus, a trainedagent can memorize what to do based on its ID instead of figuring out the role of its units dynamicallyduring the play. As illustrated in Fig. 4, if we randomly shuffle the IDs of the agents at test time,6nder review as a conference paper at ICLR 2021 T e s t W i n R a t e s ( % ) T e s t W i n R a t e s ( % ) T e s t W i n R a t e s ( % ) Figure 4: QMIX overfitS to agent IDs. Introducing random agent IDs at test time greatly affect the performance.Figure 5: Results in standard StarCraft benchmarks with random agent IDs. CollaQ (without Attn and with Attn)clearly surpasses the previous SoTAs. The attention-based model further improves the win rates for all mapsexcept 2c_vs_64zg, which only has 2 agents and attention may not bring up enough benefits. the performance of QMIX gets much worse. In some cases (e.g., ), the win rate dropsfrom 95% to 50%, deteriorating by more than 40%. The results show that QMIX relies on the extrainformation (the order of agents) for generalization. As a consequence, the resulting agents overfit tothe exact setting, making it less robust in ad hoc team play. Introducing random shuffled agent IDs attraining time addresses this issue for QMIX as illustrated in Fig. 4.4.2 S
TAR C RAFT M ULTI -A GENT C HALLENGE WITH R ANDOM A GENT ID S Since using random IDs facilitates the learning of different roles, we perform extensive empiricalstudy under this setting. We show that CollaQ on multiple maps in StarCraft outperforms existingapproaches. We use the hard scenarios (e.g., , MMM2 and ) since theyare largely unsolved by previous methods. Maps like , and are considered medium difficult. For completeness, we also provide performance comparison underthe regular setting in Appendix D Fig. 10. As shown in Fig. 5, CollaQ outperforms multiple baselines(QMIX, QTRAN, VDN, and IQL) by around in terms of win rate in multiple hard scenarios.With attention model, the performance is even stronger.Trained CollaQ agents demonstrate interesting behaviors. On MMM2 : (1) Medivac dropship only healsthe unit under attack, (2) damaged units move backward to avoid focused fire from the opponent,while healthy units move forward to undertake fire. In comparison, QMIX only learns (1) and it isnot obvious (2) was learned. On , CollaQ learns to focus fire on one side of the attackto clear one of the corridors. It also demonstrates the behavior to retreat along that corridor whileattacking while agents trained by QMIX doesn’t. See Appendix D for more video snapshots.4.3 A D H OC T EAM W ORK
Now we demonstrate that CollaQ is robust to change of agent configurations and/or priority duringtest time, i.e., ad hoc team play, in addition to handling random IDs.
Different VIP agent . In this setting, the team would get an additional reward if the VIP agent isalive after winning the battle. The VIP agent is randomly selected from agent to N − during7nder review as a conference paper at ICLR 2021 Figure 6: Results for StarCraft ad hoc team play using different VIP agent. At test time, the CollaQ hassubstantially higher VIP survival rate than QMIX. Attention-based model also boosts up the survival rate.Figure 7: Ad hoc team play on: a) swapping, b) adding, and c) removing a unit at test time. CollaQ outperformsQMIX and other methods substantially on all these 3 settings. T e s t W i n R a t e s ( % ) T e s t W i n R a t e s ( % ) MMM2 0.0 0.5 1.0 1.5 2.0Environment Steps 1e6020406080100 T e s t W i n R a t e s ( % ) Figure 8: Ablation studies on the mixture of experts and the effect of MARA Loss. CollaQ outperforms QMIXwith a mixture of experts by a large margin and removing MARA Loss significantly degrades the performance. training. At test time, agent N becomes the VIP, which is a new setup that is not seen in training.Fig. 6 shows the VIP agent survival rate at test time. We can see that CollaQ outperforms QMIX by - . We also see that CollaQ learns the behavior of protecting VIP : when the team is about towin, the VIP agent is covered by other agents to avoid being attacked. Such behavior is not clearlyshown in QMIX when the same objective is presented. Swap / Add / Remove different units . We also test the ad hoc team play in three harder settings:we swap the agent type, add and remove one agent at test time. From Fig. 7, we can see that CollaQcan generalize better to the ad hoc test setting. Note that to deal with the changing number of agentsat test time, all of the methods (QMIX, QTRAN, VDN, IQL, and CollaQ) are augmented withattention-based neural architectures for a fair comparison. We can also see that CollaQ outperformsQMIX, the second best, by . on swapping, . on removing, and . on adding agents.4.4 A BLATION S TUDY
We further verify CollaQ in the ablation study. First, we show that CollaQ outperforms a baseline(
SumTwoNets ) that simply sums over two networks which takes the agent’s full observation asthe input.
SumToNets does not distinguish between Q alone (which only takes s i as the input) and Q collab (which respects the condition Q collab ( s i , · ) = 0 ). Second, we show that MARA loss is indeedcritical for the performance of CollaQ.We compare our method with SumTwoNets trained with QMIX in each agent. The baseline has asimilar parameter size compared to CollaQ. As shown in Fig. 8, comparing to
SumTwoNets trainedwith QMIX, CollaQ improves the win rates by - on hard scenarios. We also study theimportance of MARA Loss by removing it from CollaQ. Using MARA Loss boosts the performanceby - on hard scenarios, consistent with the decomposition proposed in Sec. 2.3.8nder review as a conference paper at ICLR 2021 ELATED W ORK
Multi-agent reinforcement learning (MARL) has been studied since 1990s (Tan, 1993; Littman,1994; Bu et al., 2008). Recent progresses of deep reinforcement learning give rise to an increasingeffort of designing general-purpose deep MARL algorithms (including COMA (Foerster et al.,2018), MADDPG (Lowe et al., 2017), MAPPO (Berner et al., 2019), PBT (Jaderberg et al., 2019),MAAC (Iqbal and Sha, 2018), etc) for complex multi-agent games. We utilize the Q-learningframework and consider the collaborative tasks in strategic games. Other works focus on differentaspects of collaborative MARL setting, such as learning to communicate (Foerster et al., 2016;Sukhbaatar et al., 2016; Mordatch and Abbeel, 2018), robotics manipulation (Chitnis et al., 2019),traffic control (Vinitsky et al., 2018), social dilemmas (Leibo et al., 2017), etc.The problem of ad hoc team play in multiagent cooperative games was raised in the early 2000s (Bowl-ing and McCracken, 2005; Stone et al., 2010) and is mostly studied in the robotic soccer do-main (Hausknecht et al., 2016). Most works (Barrett and Stone, 2015; Barrett et al., 2012; Chakrabortyand Stone, 2013; Woodward et al., 2019) either require sophisticated online learning at test time orrequire strong domain knowledge of possible teammates, which poses significant limitations whenapplied to complex real-world situations. In contrast, our framework achieves zero-shot generalizationand requires little changes to the overall existing MARL training. There are also works considering amuch simplified ad-hoc teamwork setting by tackling a varying number of test-time homogeneousagents (Schwab et al., 2018; Long et al., 2020) while our method can handle more general scenarios.Previous work on the generalization/robustness in MARL typically considers a competitive setting andaims to learn policies that can generalize to different test-time opponents . Popular techniques includemeta-learning for adaptation (Al-Shedivat et al., 2017), adversarial training (Li et al., 2019), Bayesianinference (He et al., 2016; Shen and How, 2019; Serrino et al., 2019), symmetry breaking (Hu et al.,2020), learning Nash equilibrium strategies (Lanctot et al., 2017; Brown and Sandholm, 2019) andpopulation-based training (Vinyals et al., 2019; Long et al., 2020; Canaan et al., 2020). Population-based algorithms use ad hoc team play as a training component and the overall objective is to improveopponent generalization. Whereas, we consider zero-shot generalization to different teammates at testtime . Our work is also related to the hierarchical approaches for multi-agent collaborative tasks (Shuand Tian, 2019; Carion et al., 2019; Yang et al., 2020). They train a centralized manager to assignsubtasks to individual workers and it can generalize to new workers at test time. However, all theseworks assume known worker types or policies, which is infeasible for complex tasks. Our methoddoes not make any of these assumptions and can be easily trained in an end-to-end fashion.Lastly, our mathematical formulation is related to the credit assignment problem in RL (Sutton, 1985;Foerster et al., 2018; Nguyen et al., 2018). But our approach does not calculate any explicit rewardassignment, we distill the theoretical insight and derive a simple yet effective learning objective.
ONCLUSION
In this work, we propose CollaQ that models Multi-Agent RL as a dynamic reward assignmentproblem. We show that under certain conditions, there exist decentralized policies for each agent and these policies are approximately optimal from the point of view of a team goal. CollaQ thenlearns these policies by resorting to an end-to-end training framework while using decomposition in Q -function suggested by the theoretical analysis. CollaQ is tested in a complex practical StarCraftMultiAgent Challenge and surpasses previous SoTA by in terms of win rates on various mapsand in several ad hoc team play settings. We believe the idea of multi-agent reward assignmentused in CollaQ can be an effective strategy for ad hoc MARL. CKNOWLEDGEMENTS
This project occurred under the BAIR Commons at UC-Berkeley and we thanks Commons sponsorsfor their support. In addition to NSF CISE Expeditions Award CCF-1730628, UC Berkeley researchis supported by gifts from Alibaba, Amazon Web Services, Ant Financial, CapitalOne, Ericsson,Facebook, Futurewei, Google, Intel, Microsoft, Nvidia, Scotiabank, Splunk and VMware.9nder review as a conference paper at ICLR 2021 R EFERENCES
Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ebiak, ChristyDennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scaledeep reinforcement learning. arXiv preprint arXiv:1912.06680 , 2019.Max Jaderberg, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio GarciaCastaneda, Charles Beattie, Neil C Rabinowitz, Ari S Morcos, Avraham Ruderman, et al. Human-level performance in 3d multiplayer games with population-based reinforcement learning.
Science ,364(6443):859–865, 2019.Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli,Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon Whiteson. Thestarcraft multi-agent challenge. In
Proceedings of the 18th International Conference on AutonomousAgents and MultiAgent Systems , pages 2186–2188. International Foundation for AutonomousAgents and Multiagent Systems, 2019.John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policyoptimization algorithms. arXiv preprint arXiv:1707.06347 , 2017.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, DaanWierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprintarXiv:1312.5602 , 2013.Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and IgorMordatch. Emergent tool use from multi-agent autocurricula. arXiv preprint arXiv:1909.07528 ,2019.Peter Stone, Gal A Kaminka, Sarit Kraus, and Jeffrey S Rosenschein. Ad hoc autonomous agentteams: Collaboration without pre-coordination. In
Twenty-Fourth AAAI Conference on ArtificialIntelligence , 2010.Samuel Barrett, Peter Stone, and Sarit Kraus. Empirical evaluation of ad hoc teamwork in the pursuitdomain. In
AAMAS , pages 567–574, 2011.Samuel Barrett and Peter Stone. Cooperating with unknown teammates in complex domains: A robotsoccer case study of ad hoc teamwork. In
Twenty-ninth AAAI conference on artificial intelligence ,2015.Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Pérolat,David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcementlearning. In
Advances in Neural Information Processing Systems , pages 4190–4203, 2017.Hengyuan Hu, Adam Lerer, Alex Peysakhovich, and Jakob Foerster. " other-play" for zero-shotcoordination. arXiv preprint arXiv:2003.02979 , 2020.Devin Schwab, Yifeng Zhu, and Manuela Veloso. Zero shot transfer learning for robot soccer. In
Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems ,pages 2070–2072. International Foundation for Autonomous Agents and Multiagent Systems,2018.Qian Long, Zihan Zhou, Abhibav Gupta, Fei Fang, Yi Wu, and Xiaolong Wang. Evolutionary popula-tion curriculum for scaling multi-agent reinforcement learning. arXiv preprint arXiv:2003.10423 ,2020.Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In
Proceedingsof the tenth international conference on machine learning , pages 330–337, 1993.Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, MaxJaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value-decompositionnetworks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296 , 2017.Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster,and Shimon Whiteson. Qmix: monotonic value function factorisation for deep multi-agentreinforcement learning. arXiv preprint arXiv:1803.11485 , 2018.10nder review as a conference paper at ICLR 2021Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. Qtran: Learningto factorize with transformation for cooperative multi-agent reinforcement learning. arXiv preprintarXiv:1905.05408 , 2019.Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In
Machine learning proceedings 1994 , pages 157–163. Elsevier, 1994.Lucian Bu, Robert Babu, Bart De Schutter, et al. A comprehensive survey of multiagent reinforcementlearning.
IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) ,38(2):156–172, 2008.Jakob N Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson.Counterfactual multi-agent policy gradients. In
Thirty-second AAAI conference on artificialintelligence , 2018.Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agentactor-critic for mixed cooperative-competitive environments. In
Advances in neural informationprocessing systems , pages 6379–6390, 2017.Shariq Iqbal and Fei Sha. Actor-attention-critic for multi-agent reinforcement learning. arXiv preprintarXiv:1810.02912 , 2018.Jakob Foerster, Ioannis Alexandros Assael, Nando De Freitas, and Shimon Whiteson. Learning tocommunicate with deep multi-agent reinforcement learning. In
Advances in neural informationprocessing systems , pages 2137–2145, 2016.Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropagation.In
Advances in neural information processing systems , pages 2244–2252, 2016.Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agentpopulations. In
Thirty-Second AAAI Conference on Artificial Intelligence , 2018.Rohan Chitnis, Shubham Tulsiani, Saurabh Gupta, and Abhinav Gupta. Efficient bimanual manipula-tion using learned task schemas. arXiv preprint arXiv:1909.13874 , 2019.Eugene Vinitsky, Aboudy Kreidieh, Luc Le Flem, Nishant Kheterpal, Kathy Jang, Cathy Wu, FangyuWu, Richard Liaw, Eric Liang, and Alexandre M Bayen. Benchmarks for reinforcement learningin mixed-autonomy traffic. In
Conference on Robot Learning , pages 399–409, 2018.Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agentreinforcement learning in sequential social dilemmas. In
Proceedings of the 16th Conference onAutonomous Agents and MultiAgent Systems , pages 464–473, 2017.Michael Bowling and Peter McCracken. Coordination and adaptation in impromptu teams. In
AAAI ,volume 5, pages 53–58, 2005.Matthew Hausknecht, Prannoy Mupparaju, Sandeep Subramanian, Shivaram Kalyanakrishnan, andPeter Stone. Half field offense: An environment for multiagent learning and ad hoc teamwork. In
AAMAS Adaptive Learning Agents (ALA) Workshop . sn, 2016.Samuel Barrett, Peter Stone, Sarit Kraus, and Avi Rosenfeld. Learning teammate models for ad hocteamwork. In
AAMAS Adaptive Learning Agents (ALA) Workshop , pages 57–63, 2012.Doran Chakraborty and Peter Stone. Cooperating with a markovian ad hoc teammate. In
Proceedingsof the 2013 international conference on Autonomous agents and multi-agent systems , pages1085–1092. International Foundation for Autonomous Agents and Multiagent Systems, 2013.Mark Woodward, Chelsea Finn, and Karol Hausman. Learning to interactively learn and assist. arXivpreprint arXiv:1906.10187 , 2019.Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel.Continuous adaptation via meta-learning in nonstationary and competitive environments. arXivpreprint arXiv:1710.03641 , 2017. 11nder review as a conference paper at ICLR 2021Shihui Li, Yi Wu, Xinyue Cui, Honghua Dong, Fei Fang, and Stuart Russell. Robust multi-agentreinforcement learning via minimax deep deterministic policy gradient. In
Proceedings of theAAAI Conference on Artificial Intelligence , volume 33, pages 4213–4220, 2019.He He, Jordan Boyd-Graber, Kevin Kwok, and Hal Daumé III. Opponent modeling in deep re-inforcement learning. In
International Conference on Machine Learning , pages 1804–1813,2016.Macheng Shen and Jonathan P How. Robust opponent modeling via adversarial ensemble reinforce-ment learning in asymmetric imperfect-information games. arXiv preprint arXiv:1909.08735 ,2019.Jack Serrino, Max Kleiman-Weiner, David C Parkes, and Josh Tenenbaum. Finding friend and foein multi-agent games. In
Advances in Neural Information Processing Systems , pages 1249–1259,2019.Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker.
Science , 365(6456):885–890, 2019.Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, JunyoungChung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level instarcraft ii using multi-agent reinforcement learning.
Nature , 575(7782):350–354, 2019.Rodrigo Canaan, Xianbo Gao, Julian Togelius, Andy Nealen, and Stefan Menzel. Generating andadapting to diverse ad-hoc cooperation agents in hanab. arXiv preprint arXiv:2004.13710 , 2020.Tianmin Shu and Yuandong Tian. M RL: Mind-aware multi-agent management reinforcementlearning. In
International Conference on Learning Representations , 2019.Nicolas Carion, Nicolas Usunier, Gabriel Synnaeve, and Alessandro Lazaric. A structured predictionapproach for generalization in cooperative multi-agent reinforcement learning. In
Advances inNeural Information Processing Systems , pages 8128–8138, 2019.Jiachen Yang, Alireza Nakhaei, David Isele, Kikuo Fujimura, and Hongyuan Zha. Cm3: Coopera-tive multi-goal multi-stage multi-agent reinforcement learning. In
International Conference onLearning Representations , 2020.Richard S Sutton. Temporal credit assignment in reinforcement learning. 1985.Duc Thien Nguyen, Akshat Kumar, and Hoong Chuin Lau. Credit assignment for collective multiagentrl with global rewards. In
Advances in Neural Information Processing Systems , pages 8102–8113,2018. 12nder review as a conference paper at ICLR 2021
A C
OLLABORATIVE Q DETAILS
We derive the gradient and provide the training details for Eq. 5.
Gradient for Training Objective . Taking derivative w.r.t θ an and θ cn in Eq. 5, we arrive at thefollowing gradient: ∇ θ an L n ( θ an , θ cn ) = E s i ,a ∼ ρ ( · ) ,r i ; s (cid:48) ∼ ε [( r + γ max a (cid:48) Q i ( s (cid:48) , a (cid:48) , r i ; θ an − , θ cn − ) − Q i ( o i , a, r i ; θ an , θ cn )) ∇ θ an Q ai ( s i , a, r i ; θ an )] (6a) ∇ θ cn L n ( θ an , θ cn ) = E s i ,a ∼ ρ ( · ) ,r i ; s (cid:48) ∼ ε [( r + γ max a (cid:48) Q i ( s (cid:48) , a (cid:48) , r i ; θ an − , θ cn − ) − Q i ( o i , a, r i ; θ an , θ cn )) ∇ θ cn Q ci ( o i , a, r i ; θ cn ) − αQ ci ( s i , a, r i ; θ cn ) ∇ θ cn Q ci ( s i , a, r i ; θ cn )] (6b) Soft CollaQ . In the actual implementation, we use a soft-constraint version of CollaQ: we subtract Q collab ( o alonei , a i ) from Eq. 4. The Q-value Decomposition now becomes: Q i ( o i , a i ) = Q alone i ( o alonei , a i ) + Q collab i ( o i , a i ) − Q collab ( o alonei , a i ) (7)The optimization objective is kept the same as in Eq. 5. This helps reduce variances in all thesettings in resource collection and Starcraft multi-agent challenge. We sometimes also replace Q collab ( o alonei , a i ) in Eq. 7 by its target to further stabilize training. B E
NVIRONMENT S ETUP AND T RAINING D ETAILS
Resource Collection . We set the discount factor as 0.992 and use the RMSprop optimizer with alearning rate of 4e-5. (cid:15) -greedy is used for exploration with (cid:15) annealed linearly from 1.0 to 0.01in 100 k steps. We use a batch size of 128 and update the target every 10k steps. For temperatureparameter α , we set it to 1. We run all the experiments for 3 times and plot the mean/std in all thefigures. StarCraft Multi-Agent Challenge . We set the discount factor as 0.99 and use the RMSpropoptimizer with a learning rate of 5e-4. (cid:15) -greedy is used for exploration with (cid:15) annealed linearly from1.0 to 0.05 in 50 k steps. We use a batch size of 32 and update the target every 200 episodes. Fortemperature parameter α , we set it to 0.1 for 27m_vs_30m and to 1 for all other maps.All experiments on StarCraft II use the default reward and observation settings of the SMACbenchmark. For ad hoc team play with different VIP, an additional 100 reward is added to the original200 reward for winning the game if the VIP agent is alive after the episode.For swapping agent types, we design the maps 3s1z_vs_16zg, 1s3z_vs_16zg and 2s2z_vs_16zg ( s stands for stalker, z stands for zealot and zg stands for zergling). We use the first two maps for trainingand the third one for testing. For adding units, we use 27m_vs_30m for training and 28m_vs_30m fortesting ( m stands for marine). For removing units, we use 29m_vs_30m for training and 28m_vs_30mfor testing.We run all the experiments for 4 times and plot the mean/std in all the figures. C D
ETAILED R ESULTS FOR R ESOURCE C OLLECTION
We compare CollaQ with QMIX and CollaQ with attention-based model in resource collection setting.As shown in Fig. 9, QMIX doesn’t show great performance as it is even worse than random action.Adding attention-based model introduces a larger variance, so the performance degrades by . intraining but boosts by . in ad ad hoc team play.13nder review as a conference paper at ICLR 2021 R e w a r d Training Reward 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4Environment Steps 1e6100150200250300 R e w a r d Adhoc Testing RewardRandom Action IQL QMIX CollaQ CollaQ with Attn
Figure 9: Results for resource collection. Adding attention-based model to CollaQ introduces a larger varianceso the performance is a little worse. QMIX doesn’t show good performance in this setting. T e s t W i n R a t e s ( % ) MMM2 0.0 0.5 1.0 1.5 2.0Environment Steps 1e6020406080100 T e s t W i n R a t e s ( % ) T e s t W i n R a t e s ( % ) Figure 10: Results for StarCraft Multi-Agent Challenge without random agent IDs. CollaQ outperforms QMIXon all three maps.
D D
ETAILED R ESULTS FOR S TAR C RAFT M ULTI -A GENT C HALLENGE
We provide the win rates for CollaQ and QMIX on the environments without random agent IDs onthree maps. Fig. 10 shows the results for both method.We show the exact win rates for all the maps and settings mentioned in StarCraft Multi-AgentChallenge. From Tab. 1, we can clearly see that CollaQ improves the previous SoTA by a largemargin.
Table 1: Win rates for StarCraft Multi-Agent Challenge. CollaQ show superior performance over all baselines.IQL VDN QTRAN QMIX CollaQ CollaQwith Attn5m_vs_6m 62.81 % % % % % % MMM2 4.22 % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % We also check the margin of winning scenarios, measured as how many units survive after winningthe battle. The experiments are repeated over 128 random seeds. CollaQ surpasses the QMIX by over2 units on average (Tab. 2), which is a huge gain.In a simple ad hoc team play setting, we assign a new VIP agent whose survival matters at test time.Results in Tab. 3 show that at test time, the VIP agent in CollaQ has substantial higher survival ratethan QMIX.We also test CollaQ in a harder ad hoc team play setting: swapping/adding/removing agents at testtime. Tab 4 summarizes the results for ad hoc team play, CollaQ outperforms QMIX by a lot.14nder review as a conference paper at ICLR 2021
Table 2: Number of survived units on six StaCraft maps. We compute mean and standard deviation over 128runs. CollaQ outperforms all baselines significantly by managing more units to survive. ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 3: VIP agents survival rates for StarCraft Multi-Agent Challenge. CollaQ with attention surpasses QMIXby a large margin. IQL VDN QTRAN QMIX CollaQ CollaQwith Attn5m_vs_6m 30.47 % % % % % % MMM2 0.31 % % % % % % % % % % % % Table 4: Win rates for StarCraft Multi-Agent Challenge with swapping/adding/removing agents. CollaQimproves QMIX substantially.IQL VDN QTRAN QMIX CollaQ CollaQwith AttnSwapping 0.00 % % % % % % Adding ∗ % % % % - % Removing ∗ % % % % - % ∗ IQL, VDN, QTRAN and QMIX here all use attention-based models.
E V
IDEOS AND V ISUALIZATIONS OF S TAR C RAFT M ULTI -A GENT C HALLENGE
We extract several video frames from the replays of CollaQ’s agents for better visualization. Inaddition to that, we provide the full replays of QMIX and CollaQ. CollaQ’s agents demonstratesuper interesting behaviors such as healing the agents under attack, dragging back the unhealthyagents, and protecting the VIP agent (under the setting of ad hoc team play with different VIP agentsettings). The visualizations and videos are available at https://sites.google.com/view/multi-agent-collaq-public/home
F P
ROOF AND L EMMAS
Lemma 1. If a (cid:48) ≥ a , then ≤ max( a (cid:48) , a ) − max( a , a ) ≤ a (cid:48) − a .Proof. Note that max( a , a ) = a + a + (cid:12)(cid:12) a − a (cid:12)(cid:12) . So we have: max( a (cid:48) , a ) − max( a , a ) = a (cid:48) − a (cid:12)(cid:12)(cid:12)(cid:12) a (cid:48) − a (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12) a − a (cid:12)(cid:12)(cid:12)(cid:12) ≤ a (cid:48) − a (cid:12)(cid:12)(cid:12)(cid:12) a − a (cid:48) (cid:12)(cid:12)(cid:12)(cid:12) = a (cid:48) − a (8)F.1 L EMMAS
Lemma 2.
For a Markov Decision Process with finite horizon H and discount factor γ < . For all i ∈ { , . . . , K } , all r , r ∈ R M , all s i ∈ S i , we have: | V i ( s i ; r ) − V i ( s i ; r ) | ≤ (cid:88) x,a γ | s i − x | | r ( x, a ) − r ( x, a ) | (9)15nder review as a conference paper at ICLR 2021 where | s i − x | is the number of steps needed to move from s i to x .Proof. By definition of optimal value function V i for agent i , we know it satisfies the followingBellman equation: V i ( x h ; r i ) = max a i (cid:0) r i ( x i , a i ) + γ E x h +1 | x h ,a h [ V i ( x h +1 )] (cid:1) (10)Note that to avoid confusion between agents initial states s = { s , . . . , s K } and reward at state-actionpair ( s, a ) , we use ( x, a ) instead. For terminal node x H , which exists due to finite-horizon MDP withhorizon H , V i ( x H ) = r i ( x H ) . The current state s i is at step (i.e., x = s i ).We first consider the case that r and r only differ at a single state-action pair ( x h , a h ) for h ≤ H .Without loss of generality, we set r ( x h , a h ) > r ( x h , a h ) .By definition of finite horizon MDP, V i ( x h (cid:48) ; r ) = V i ( x h (cid:48) ; r ) for h (cid:48) > h . By the property of maxfunction (Lemma 1), we have: ≤ V i ( x h ; r ) − V i ( x h ; r ) ≤ r ( x h , a h ) − r ( x h , a h ) (11)Since p ( x h | x h − , a h − ) ≤ , for any ( x h − , a h − ) at step h − , we have: ≤ γ (cid:2) E x h | x h − ,a h − [ V i ( x h ; r )] − E x h | x h − ,a h − [ V i ( x h ; r )] (cid:3) (12) ≤ γ (cid:2) r ( x h , a h ) − r ( x h , a h ) (cid:3) (13)Applying Lemma 1 and notice that all other rewards does not change, we have: ≤ V i ( x h − ; r ) − V i ( x h − ; r ) ≤ γ (cid:2) r ( x h , a h ) − r ( x h , a h ) (cid:3) (14)We do this iteratively, and finally we have: ≤ V i ( s i ; r ) − V i ( s i ; r ) ≤ γ h (cid:2) r ( x h , a h ) − r ( x h , a h ) (cid:3) (15)We could show similar case when r ( x h , a h ) < r ( x h , a h ) , therefore, we have: | V i ( s i ; r ) − V i ( s i ; r ) | ≤ γ h | r ( x h , a h ) − r ( x h , a h ) | (16)where h = | x h − s i | is the distance between s i and x h .Now we consider general r (cid:54) = r . We could design path { r t } from r to r so that each time weonly change one distinct reward entry. Therefore each ( s, a ) pairs happens only at most once and wehave: | V i ( s i ; r ) − V i ( s i ; r ) | ≤ (cid:88) t | V i ( s i ; r t − ) − V i ( s i ; r t ) | (17) ≤ (cid:88) x,a γ | x − s i | | r ( x, a ) − r ( x, a ) | (18)F.2 T HM . 1First we prove the following lemma: Lemma 3.
For any reward assignments r i for agent i for the optimization problem (Eqn. 1) and alocal reward set M local i ⊇ { x : | x − s i | ≤ C } , if we construct ˜ r i as follows: ˜ r i ( x, a ) = (cid:26) r i ( x, a ) x ∈ M local i x / ∈ M local i (19) Then we have: | V i ( s i ; r i ) − V i ( s i ; ˜ r i ) | ≤ γ C R max M (20) where M is the total number of sparse reward sites and R max is the maximal reward that could beassigned at each reward site x while satisfying the constraint φ ( r ( x, a ) , r ( x, a ) , . . . , r K ( s, a )) ≤ . KM 𝑀 !" 𝒔 !" 𝑅 ∗ "𝑅 ∗ 𝑅 " A function of all agent states 𝒔 = {𝑠 & , 𝑠 ’ , … , 𝑠 ( } A function of 𝒔 !" "𝑅 &∗ 𝑅 " 𝒔 !)*+ (a) (b) Figure 11: Different reward assignments.
Proof.
By Lemma 2, we know that | V i ( s i ; r ∗ i ) − V i ( s i ; ˜ r i ) | ≤ (cid:88) x/ ∈ S local i γ | x − s i | | r ∗ i ( s, a ) − ˜ r i ( s, a ) | (21) ≤ γ C (cid:88) x/ ∈ S local i | r ∗ i ( s, a ) | (22) ≤ γ C R max M (23)Note that “sparse reward site” is important here, otherwise there could be exponential sites x / ∈ S local i and Eqn. 23 becomes vacant.Then we prove the theorem. Proof.