Learning when to Communicate at Scale in Multiagent Cooperative and Competitive Tasks
PPublished as a conference paper at ICLR 2019 L EARNING WHEN TO COMMUNICATE AT SCALE IN M ULTIAGENT C OOPERATIVE AND C OMPETITIVE T ASKS
Amanpreet Singh ∗ New York UniversityFacebook AI Research † [email protected] Tushar Jain ∗ New York University [email protected]
Sainbayar Sukhbaatar
New York UniversityFacebook AI Research † [email protected] A BSTRACT
Learning when to communicate and doing that effectively is essential in multi-agenttasks. Recent works show that continuous communication allows efficient trainingwith back-propagation in multiagent scenarios, but have been restricted to fully-cooperative tasks. In this paper, we present Individualized Controlled ContinuousCommunication Model (IC3Net) which has better training efficiency than simplecontinuous communication model, and can be applied to semi-cooperative andcompetitive settings along with the cooperative settings. IC3Net controls continu-ous communication with a gating mechanism and uses individualized rewards foreach agent to gain better performance and scalability while fixing credit assign-ment issues. Using variety of tasks including StarCraft BroodWars TM explore andcombat scenarios, we show that our network yields improved performance andconvergence rates than the baselines as the scale increases. Our results convey thatIC3Net agents learn when to communicate based on the scenario and profitability. NTRODUCTION
Communication is an essential element of intelligence as it helps in learning from others experience,work better in teams and pass down knowledge. In multi-agent settings, communication allows agentsto cooperate towards common goals. Particularly in partially observable environments, when theagents are observing different parts of the environment, they can share information and learningsfrom their observation through communication.Recently, there have been a lot of success in the field of reinforcement learning (RL) in playing AtariGames (Mnih et al., 2015) to playing Go (Silver et al., 2016), most of which have been limited to thesingle agent domain. However, the number of systems and applications having multi-agents havebeen growing (Lazaridou et al., 2016; Mordatch & Abbeel, 2017); where size can be from a teamof robots working in manufacturing plants to a network of self-driving cars. Thus, it is crucial tosuccessfully scale
RL to multi-agent environments in order to build intelligent systems capable ofhigher productivity. Furthermore, scenarios other than cooperative, namely semi-cooperative (ormixed) and competitive scenarios have not even been studied as extensively for multi-agent systems.The mixed scenarios can be compared to most of the real life scenarios as humans are cooperativebut not fully-cooperative in nature. Humans work towards their individual goals while cooperatingwith each other. In competitive scenarios, agents are essentially competing with each other for betterrewards. In real life, humans always have an option to communicate but can choose when to actuallycommunicate. For example, in a sports match two teams which can communicate, can choose to notcommunicate at all (to prevent sharing strategies) or use dishonest signaling (to misdirect opponents)(Lehman et al., 2018) in order to optimize their own reward and handicap opponents; making itimportant to learn when to communicate. ∗ Equal contribution. † Current affiliation. This work was completed when authors were at New York University. a r X i v : . [ c s . L G ] D ec ublished as a conference paper at ICLR 2019Teaching agents how to communicate makes it is unnecessary to hand code the communicationprotocol with expert knowledge (Sukhbaatar et al., 2016)(Kottur et al., 2017). While the content ofcommunication is important, it is also important to know when to communicate either to increasescalability and performance or to increase competitive edge. For example, a prey needs to learn whento communicate to avoid communicating its location with predators.Sukhbaatar et al. (2016) showed that agents communicating through a continuous vector are easier totrain and have a higher information throughput than communication based on discrete symbols. Theircontinuous communication is differentiable, so it can be trained efficiently with back-propagation.However, their model assumes full-cooperation between agents and uses average global rewards. Thisrestricts the model from being used in mixed or competitive scenarios as full-cooperation involvessharing hidden states to everyone; exposing everything and leading to poor performance by allagents as shown by our results. Furthermore, the average global reward for all agents makes thecredit assignment problem even harder and difficult to scale as agents don’t know their individualcontributions in mixed or competitive scenarios where they want themselves to succeed before others.To solve above mentioned issues, we make the following contributions:1. We propose Individualized Controlled Continuous Communication Model (IC3Net), inwhich each agent is trained with its individualized reward and can be applied to any scenario whether cooperative or not.2. We empirically show that based on the given scenario–using the gating mechanism–our model canlearn when to communicate . The gating mechanism allows agents to block their communication;which is useful in competitive scenarios.3. We conduct experiments on different scales in three chosen environments including StarCraft andshow that IC3Net outperforms the baselines with performance gaps that increase with scale. Theresults show that individual rewards converge faster and better than global rewards . ELATED WORK
The simplest approach in multi-agent reinforcement learning (MARL) settings is to use an independentcontroller for each agent. This was attempted with Q-learning in Tan (1993). However, in practice itperforms poorly (Matignon et al., 2012), which we also show in comparison with our model. Themajor issue with this approach is that due to multiple agents, the stationarity of the environment islost and naïve application of experience replay doesn’t work well.The nature of interaction between agents can either be cooperative, competitive, or a mix of both.Most algorithms are designed only for a particular nature of interaction, mainly cooperative settings(Omidshafiei et al., 2017; Lauer & Riedmiller, 2000; Matignon et al., 2007), with strategies whichindirectly arrive at cooperation via sharing policy parameters (Gupta et al., 2017). These algorithmsare generally not applicable in competitive or mixed settings. See Busoniu et al. (2008) for survey ofMARL in general and Panait & Luke (2005) for survey of cooperative multi-agent learning.Our work can be considered as an all-scenario extension of Sukhbaatar et al. (2016)’s CommNet forcollaboration among multiple agents using continuous communication; usable only in cooperativesettings as stated in their work and shown by our experiments. Due to continuous communication, thecontroller can be learned via backpropagation. However, this model is restricted to fully cooperativetasks as hidden states are fully communicated to others which exposes everything about agent. On theother hand, due to global reward for all agents, CommNet also suffers from credit assignment issue.The Multi-Agent Deep Deterministic Policy Gradient (MADDPG) model presented by Lowe et al.(2017) also tries to achieve similar goals. However, they differ in the way of providing the coordinationsignal. In their case, there is no direct communication among agents (actors with different policy peragent), instead a different centralized critic per agent – which can access the actions of all the agents –provides the signal. Concurrently, a similar model using centralized critic and decentralized actorswith additional counterfactual reward, COMA by Foerster et al. (2017) was proposed to tackle thechallenge of multiagent credit assignment by letting agents know their individual contributions.Vertex Attention Interaction Networks (VAIN) (Hoshen, 2017) also models multi-agent communi-cation through the use of Interaction Networks (Battaglia et al., 2016) with attention mechanism(Bahdanau et al., 2014) for predictive modelling using supervised settings. The work by Foersteret al. (2016b) also learns a communication protocol where agents communicate in a discrete manner2ublished as a conference paper at ICLR 2019through their actions. This contrasts with our model where multiple continuous communicationcycles can be used at each time step to decide the actions of all agents. Furthermore, our approach isamenable to dynamic number of agents. Peng et al. (2017) also attempts to solve micromanagementtasks in StarCraft using communication. However, they have non-symmetric addition of agents incommunication channel and are restricted to only cooperative scenarios.In contrast, a lot of work has focused on understanding agents’ communication content; mostly indiscrete settings with two agents (Wang et al., 2016; Havrylov & Titov, 2017; Kottur et al., 2017;Lazaridou et al., 2016; Lee et al., 2017). Lazaridou et al. (2016) showed that given two neural networkagents and a referential game, the agents learn to coordinate. Havrylov & Titov (2017) extended thisby grounding communication protocol to a symbols’s sequence while Kottur et al. (2017) showedthat this language can be made more human-like by placing certain restrictions. Lee et al. (2017)demonstrated that agents speaking different languages can learn to translate in referential games.
ODEL
Figure 1:
An overview of IC3Net . (Left) In-depth view of a single communication step. LSTM getshidden state, h t and cell state, s t (not shown) from previous time-step. Hidden state h t is passed toCommunication-Action module f g for a communication binary action g t . Finally, communicationvector c t is calculated by averaging hidden states of other active agents gated by their communicationaction ac t and is passed through a linear transformation C before fed to LSTM along with theobservation. (Right) High-level view of IC3Net which optimizes individual rewards r t for each agentbased on observation o t .In this section, we introduce our model Individualized Controlled Continuous CommunicationModel (IC3Net) as shown in Figure 1 to work in multi-agent cooperative, competitive and mixedsettings where agents learn what to communicate as well as when to communicate.First, let us describe an independent controller model where each agent is controlled by an individualLSTM. For the j-th agent, its policy takes the form of: h t +1 j , s t +1 j = LST M ( e ( o tj ) , h tj , s tj ) a tj = π ( h tj ) , where o tj is the observation of the j-th agent at time t , e ( · ) is an encoder function parameterized by afully-connected neural network and π is an agent’s action policy. Also, h tj and s tj are the hidden andcell states of the LSTM. We use the same LSTM model for all agents, sharing their parameters. Thisway, the model is invariant to permutations of the agents.3ublished as a conference paper at ICLR 2019IC3Net extends this independent controller model by allowing agents to communicate their internalstate, gated by a discrete action. The policy of the j-th agent in a IC3Net is given by g t +1 j = f g ( h tj ) h t +1 j , s t +1 j = LST M ( e ( o tj ) + c tj , h tj , s tj ) c t +1 j = 1 J − C (cid:88) j (cid:48) (cid:54) = j h t +1 j (cid:48) (cid:12) g t +1 j (cid:48) a tj = π ( h tj ) , where c tj is the communication vector for the j -th agent, C is a linear transformation matrix fortransforming gated average hidden state to a communication tensor, J is the number of alive agentscurrently present in the system and f g ( . ) is a simple network containing a soft-max layer for 2 actions(communicate or not) on top of a linear layer with non-linearity. The binary action g tj specifieswhether agent j wants to communicate with others, and act as a gating function when calculating thecommunication vector. Note that the gating action for next time-step is calculated at current time-step.We train both the action policy π and the gating function f g with REINFORCE (Williams, 1992).In Sukhbaatar et al. (2016), individual networks controlling agents were interconnected, and they as awhole were considered as a single big neural network. This single big network controller approachrequired a definition of an unified loss function during training, thus making it impossible to trainagents with different rewards.In this work, however, we move away from the single big network controller approach. Instead, weconsider multiple big networks with shared parameters each controlling a single agent separately.Each big network consists of multiple LSTM networks, each processing an observation of a singleagent. However, only one of the LSTMs need to output an action because the big network is onlycontrolling a single agent. Although this view has a little effect on the implementation (we can stilluse a single big network in practice), it allows us to train each agent to maximize its individual rewardinstead of a single global reward. This has two benefits: (i) it allows the model to be applied to bothcooperative and competitive scenarios, (ii) it also helps resolve the credit assignment issue facedby many multi-agent (Sukhbaatar et al., 2016; Foerster et al., 2016a) algorithms while improvingperformance with scalability and is coherent with the findings in Chang et al. (2003). XPERIMENTS We study our network in multi-agent cooperative, mixed and competitive scenarios to understand itsworkings. We perform experiments to answer following questions:1. Can our network learn the gating mechanism to communicate only when needed according to thegiven scenario? Essentially, is it possible to learn when to communicate ?2. Does our network using individual rewards scales better and faster than the baselines? This wouldclarify, whether or not, individual rewards perform better than global rewards in multi-agentcommunication based settings.We first analyze gating action’s ( g t ) working. Later, we train our network in three chosen environmentswith variations in difficulty and coordination to ensure scalability and performance.4.1 E NVIRONMENTS
We consider three environments for our analysis and experiments. (i) a predator-prey environment( PP ) where predators with limited vision look for a prey on a square grid. (ii) a traffic junctionenvironment ( TJ ) similar to Sukhbaatar et al. (2016) where agents with limited vision must learnto communicate in order to avoid collisions. (iii) StarCraft BroodWars ( SC ) explore and combattasks which test control on multiple agents in various scenarios where agent needs to understand anddecouple observations for multiple opposing units . The code is available at https://github.com/IC3Net/IC3Net. StarCraft is a trademark or registered trademark of Blizzard Entertainment, Inc., in the U.S. and/or othercountries. Nothing in this paper should not be construed as approval, endorsement, or sponsorship by BlizzardEntertainment, Inc
VisionPredator moving down Fixed prey
Car entering Car leavingOne way
Figure 2:
Environments’ Visualizations . (Left) ×
10 version of predator-prey task where predators(red circles) with limited vision of size 1 (blue region) try to catch a randomly initializedfixed prey (green circle). (Center and Right) Easy and medium versions of traffic junction taskwhere cars have to cross the the whole path minimizing collisions using two actions, gas and brake respectively. Agents have zero vision and can only observe their own location. (Right)
In mediumversion, chances of collision are increased due to more possible routes and increased number of cars.4.1.1 P
REDATOR PREY
In this task, we have n predators (agents) with limited vision trying to find a stationary prey. Once apredator reaches a prey, it stays there and always gets a positive reward, until end of episode (rest ofthe predators reach prey, or maximum number of steps). In case of zero vision, agents don’t have adirect way of knowing prey’s location unless they jump on it.We design three cooperation settings (competitive, mixed and cooperative) for this task with differentreward structures to test our network. See Appendix 6.3 for details on grid, reward structure,observation and action space. There is no loss or benefit from communicating in mixed scenario. Incompetitive setting, agents get lower rewards if other agents reach the prey and in cooperative setting,reward increases as more agents reach the prey. We compare with baselines using mixed settings insubsection 4.3.2 while explicitly learning and analyzing gating action’s working in subsection 4.2.We create three levels for this environment – as mentioned in Appendix 6.3 – to compare ournetwork’s performance with increasing number of agents and grid size. 10 ×
10 grid version with 5agents is shown in Fig. 2 (left). All agents are randomly placed in the grid at start of an episode.4.1.2 T
RAFFIC JUNCTION
Following Sukhbaatar et al. (2016), we test our model on the traffic junction task as it is a good proxyfor testing whether communication is working. This task also helps in supporting our claim thatIC3Net provides good performance and faster convergence in fully-cooperative scenarios similarto mixed ones. In the traffic junction, cars enter a junction from all entry points with a probability p arr . The maximum number of cars at any given time in the junction is limited. Cars can take twoactions at each time-step, gas and brake respectively. The task has three difficulty levels (see Fig. 2)which vary in the number of possible routes, entry points and junctions. We make this task harder byalways setting vision to zero in all the three difficulty levels to ensure that task is not solvable withoutcommunication. See Appendix 6.4 for details on reward structure, observation and training.4.1.3 S TAR C RAFT : B
ROODWARS
To fully understand the scalability of our architecture in more realistic and complex scenarios, wetest it on StarCraft combat and exploration micro-management tasks in partially observable settings.StarCraft is a challenging environment for RL because it has a large observation-action space, manydifferent unit types and stochasticity. We train our network on Combat and Explore task. The task’sdifficulty can be altered by changing the number of our units, enemy units and the map size.By default, the game has macro-actions which allow a player to directly target an enemy unitwhich makes player’s unit find the best possible path using the game’s in-built path-finding system,move towards the target and attack when it is in a range. However, we make the task harder by (i)removing macro-actions making exploration harder (ii) limiting vision making environment partiallyobservable(iii) unlike previous works (Wender & Watson, 2012; Ontanón et al., 2013; Usunier et al.,5ublished as a conference paper at ICLR 2019 (a) (b) (c)(d) (e) g t vs Timesteps, Cooperative (f) g t vs Timesteps, Competitive Figure 3:
Learning the Gating Action : Plots show gating action g t for predators and prey averagedover each epoch in PP. In cooperative setting (a, e) agent almost always communicate to increasetheir own reward. In (b) mixed setting and (c) competitive setting, predators only communicate whennecessary and profitable. As is evident from (f) , they stop communicating once they reach prey. Inall cases, prey almost never communicates with predators as it is not profitable for it. Similarly, incompetitive scenario (d) for SC, team agents learn to communicate only when necessary due to thedivision of reward when near enemy, while enemy agent learns not to communicate as in PP.2016; Peng et al., 2017), initializing enemy and our units at random locations in a fixed size squareon the map, which makes it challenging to find enemy units. Refer to Appendix 6.5.1 for reward,action, observation and task details. We consider two types of tasks in StarCraft: Explore:
In this task, we have n agents trying to explore the map and find an enemy unit. This is adirect scale-up of the PP but with more realistic and stochastic situations. Combat:
We test an agent’s capability to execute a complex task of combat in StarCraft whichrequire coordination between teammates, exploration of a terrain, understanding of enemy units andformalism of complex strategies. We specifically test a team of n agents trying to find and kill a teamof m enemies in a partially observable environment similar to the explore task. The agents, with theirlimited vision, must find the enemy units and kill all of them to score a win. More information onreward structure, observation and setup can be found in Section 6.5.1 and 6.5.2.4.2 A NALYSIS OF THE GATING MECHANISM
We analyze working of gating action ( g t ) in IC3Net by using cooperative, competitive and mixedsettings in Predator-Prey (4.1.1) and StarCraft explore tasks (4.1.3). However, this time the enemyunit (prey) shares parameters with the predators and is trained with them. All of the enemy unit’sactions are noop which makes it stationary. The enemy unit gets a positive reward equivalent to r time = 0.05 per timestep until no predator/medic is captures it; after that it gets a reward of 0.For 5 × ×
50 map size for competitive and cooperative StaraCraft explore task and found similar results(3(d)). We can deduce following observations: • As can be observed in Fig. 3a, 3b, 3c and 3d, in all the four cases, the prey learns not tocommunicate. If the prey communicates, predators will reach it faster. Since it will get 0 rewardwhen an agent comes near or on top of it, it doesn’t communicate to achieve higher rewards. • In cooperative setting (Fig. 3a, 3e), the predators are openly communicating with g close to 1.Even though the prey communicates with the predators at the start, it eventually learns not tocommunicate; so as not to share its location. As all agents are communicating in this setting, it6ublished as a conference paper at ICLR 2019Figure 4: Result Plots for PP and TJ task. (Left)
Average steps taken to complete an episode in20 ×
20 grid. (Center)
IC3Net converges faster than CommNet as the number of predators (agents)increase in Predator-Prey environment. (Right)
Success % in medium TJ task trained with curriculum.Performance and convergence of IC3Net is superior than baselines.takes more training time to adjust prey’s weights towards silence. Our preliminary tests suggestthat in cooperative settings, it is beneficial to fix the gating action to 1.0 as communication isalmost always needed and it helps in faster training by skipping the need to train the gating action. • In the mixed setting (Fig. 3b), agents don’t always communicate which corresponds to the factthat there is no benefit or loss by communicating in mixed scenario. The prey is easily able tolearn not to communicate as the weights for predators are also adjusted towards non-cooperationfrom the start itself. • As expected due to competition, predators rarely communicate in competitive setting (Fig. 3c,3d). Note that, this setting is not fully-adversarial as predators can initially explore faster if theycommunicate which can eventually lead to overall higher rewards. This can be observed as theagents only communicate while it’s profitable for them, i.e. before reaching the prey (Fig. 3f)) ascommunicating afterwards can impact their future rewards.Experiments in this section, empirically suggest that agents can “learn to communicate when it isprofitable” ; thus allowing same network to be used in all settings.4.3 S
CALABILITY AND G ENERALIZATION E XPERIMENTS
In this section, we look at bigger versions of our environments to understand scalability and general-ization aspects of IC3Net.4.3.1 B
ASELINES
For training details , refer to Appendix 6.1. We compare IC3Net with baselines specified below inall scenarios.
Individual Reward Independent Controller (IRIC):
In this controller, model is applied individu-ally to all of the agents’ observations to produce the action to be taken. Essentially, this can be seenas IC3Net without any communication between agents; but with individualized reward for each agent.Note that no communication makes gating action ( g t ) ineffective . Independent Controller (IC - IC3Net w/o Comm and IR):
Like IRIC except the agents are trainedwith a global average reward instead of individual rewards. This will help us understand the creditassignment issue prevalent in CommNet.
CommNet:
Introduced in Sukhbaatar et al. (2016), CommNet allows communication betweenagents over a channel where an agent is provided with the average of hidden state representationsof other agents as a communication signal. Like IC3Net, CommNet also uses continuous signals tocommunicate between the agents. Thus, CommNet can be considered as IC3Net without both thegating action ( g t ) and individualized rewards.4.3.2 R ESULTS
We discuss major results for our experiments in this section and analyze particular behaviors/patternsof agents in Section 6.2.
Predator Prey:
Table 1 (left) shows average steps taken by the models to complete an episode i.e.find the prey in mixed setting (we found similar results for cooperative setting shown in appendix).7ublished as a conference paper at ICLR 2019
Predatory-Prey Mixed (Avg. Steps)Model 5x5, n=3 10x10, n=5 20x20, n=10
IRIC 16.5 ± ± ± IC 16.4 ± ± ± CommNet 9.1 ± ± ± IC3Net ± ± ± Traffic Junction (Success %)Model Easy Medium Hard
IRIC 29.8 ± ± ± IC 30.2 ± ± ± CommNet 93.0 ± ± ± IC3Net 93.0 ± ± ± Table 1:
Predator Prey (Left): Avg. number of steps taken to complete the episode in three differentenvironment sizes in mixed settings. IC3Net completes the episode faster than the baselines by findingthe prey.
Traffic Junction (Right): Success rate on various difficulty levels with zero vision for all.IC3Net provides better performance than baselines consistently especially as the scale increases.IC3Net reaches prey faster than the baselines as we increase the number of agents as well as the sizeof the maze. In 20 ×
20 version, the gap in average steps is almost 24 steps, which is a substantialimprovement over baselines. Figure 4 (right) shows the scalability graph for IC3Net and CommNetwhich supports the claim that with the increasing number of agents, IC3Net converges faster at abetter optimum than CommNet. Through these results on the PP task, we can see that compared toIC3Net, CommNet doesn’t work well in mixed scenarios. Finally, Figure 4 (left) shows the trainingplot of 20 ×
20 grid with 10 agents trying to find a prey. The plot clearly shows the faster performanceimprovement of IC3Net in contrast to CommNet which takes long time to achieve a minor jump. Wealso find same pattern of the gating action values as in 4.2.
Traffic Junction:
Table 1 (right) shows the success ratio for traffic junction. We fixed the gatingaction to 1 for TJ as discussed in 4.2. With zero vision, it is not possible to perform well withoutcommunication as evident by the results of IRIC and IC. Interestingly, IC performs better than IRIC inthe hard case, as we believe without communication, the global reward in TJ acts as a better indicatorof the overall performance. On the other hand, with communication and better knowledge of others,the global reward training face a credit assignment issue which is alleviated by IC3Net as evidentby its superior performance compared to CommNet. In Sukhbaatar et al. (2016), well-performingagents in the medium and hard versions had vision > 0. With zero vision, IC3Net is to CommNetand IRIC with a performance gap greater than 30%. This verifies that individualized rewards inIC3Net help achieve a better or similar performance than CommNet in fully-cooperative tasks withcommunication due to a better credit assignment.
IRIC IC CommNet IC3NetStarCraft task Win % Steps Win % Steps Win % Steps Win % Steps
Explore-10Medic 50 ×
50 35.4 ± ± ± ± ± ± ± ± Explore-10Medic 75 ×
75 9.0 ± ± ± ± ± ± ± ± Combat-10Mv3Ze 50 ×
50 74.6 ± ± ± ± ± ± ± ± Table 2:
StarCraft Results : Win Ratio and average number of steps taken to complete episodes forexplore and combat tasks. IC3Net beats the baselines with huge margin in case of exploration tasks,while it is as good as CommNet in case of 10 Marines vs 3 Zealots combat task.Figure 5:
Average steps taken to com-plete an episode of StarCraft Explore-10 Medic 50 ×
50 task.
StarCraft:
Table 2 displays win % and the average numberof steps taken to complete an episode in StarCraft exploreand combat tasks. We specifically test on (i) Explore task:10 medics finding 1 enemy medic on 50 ×
50 cell grid (ii)On 75 ×
75 cell grid (iii) Combat task: 10 Marines vs 3Zealots on 50 x 50 cell grid. Maximum steps in an episodeare set to 60. The results on the explore task are similar toPredator-Prey as IC3Net outperforms the baselines. Movingto a bigger map size, we still see the performance gap eventhough performance drops for all the models.On the combat task, IC3Net performs comparably well toCommNet. A detailed analysis on IC3Net’s performance inStarCraft tasks is provided in Section 6.2.1. To confirm that10 marines vs 3 zealots is hard to win, we run an experimenton reverse scenario where our agents control 3 Zealots ini-tialized separately and enemies are 10 marines initialized8ublished as a conference paper at ICLR 2019together. We find that both IRIC and IC3Net reach a success percentage of 100% easily. We find thateven in this case, IC3Net converges faster than IRIC.
ONCLUSIONS AND F UTURE W ORK
In this work, we introduced IC3Net which aims to solve multi-agent tasks in various cooperationsettings by learning when to communicate. Its continuous communication enables efficient trainingby backpropagation, while the discrete gating trained by reinforcement learning along with individualrewards allows it to be used in all scenarios and on larger scale.Through our experiments, we show that IC3Net performs well in cooperative, mixed or competitivesettings and learns to communicate only when necessary. Further, we show that agents learn to stopcommunication in competitive cases. We show scalability of our network by further experiments. Infuture, we would like to explore possibility of having multi-channel communication where agentscan decide on which channel they want to put their information similar to communication groups butdynamic. It would be interesting to provide agents a choice of whether to listen to communicationfrom a channel or not.
Acknowledgements
Authors would like to thank Zeming Lin for his consistent support and sugges-tions around StarCraft and TorchCraft. R EFERENCES
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointlylearning to align and translate. arXiv preprint arXiv:1409.0473 , 2014.Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networksfor learning about objects, relations and physics. In
Advances in neural information processingsystems , pp. 4502–4510, 2016.Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In
Proceedings of the 26th Annual International Conference on Machine Learning , ICML ’09, pp.41–48, 2009.Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, andWojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540 , 2016.Lucian Busoniu, Robert Babuska, and Bart De Schutter. A comprehensive survey of multiagentreinforcement learning.
IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applicationsand Reviews) , 38(2):156–172, March 2008.Yu-Han Chang, Tracey Ho, and Leslie Pack Kaelbling. All learning is local: Multi-agent learning inglobal reward games. In
Proceedings of the 16th International Conference on Neural InformationProcessing Systems , NIPS’03, pp. 807–814, 2003.Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning tocommunicate with deep multi-agent reinforcement learning. In
Advances in Neural InformationProcessing Systems 29 , pp. 2137–2145. 2016a.Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson.Counterfactual multi-agent policy gradients. arXiv preprint arXiv:1705.08926 , 2017.Jakob N Foerster, Yannis M Assael, Nando de Freitas, and Shimon Whiteson. Learning to communi-cate to solve riddles with deep distributed recurrent q-networks. arXiv preprint arXiv:1602.02672 ,2016b.Jayesh K. Gupta, Maxim Egorov, and Mykel J. Kochenderfer. Cooperative multi-agent controlusing deep reinforcement learning. In
Adaptive Learning Agents Workshop , 2017. doi: 10.1007/978-3-319-71682-4_5.Serhii Havrylov and Ivan Titov. Emergence of language with multi-agent games: learning tocommunicate with sequences of symbols. In
Advances in Neural Information Processing Systems ,pp. 2149–2159, 2017. 9ublished as a conference paper at ICLR 2019Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition ,pp. 770–778, 2016.Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
Neural computation , 9(8):1735–1780, 1997.Yedid Hoshen. Vain: Attentional multi-agent predictive modeling. In
Advances in Neural InformationProcessing Systems 30 , pp. 2701–2711. 2017.Satwik Kottur, José Moura, Stefan Lee, and Dhruv Batra. Natural language does not emerge‘naturally’in multi-agent dialog. In
Proceedings of the 2017 Conference on Empirical Methods inNatural Language Processing , pp. 2962–2967, 2017.Martin Lauer and Martin Riedmiller. An algorithm for distributed reinforcement learning in coop-erative multi-agent systems. In
In Proceedings of the Seventeenth International Conference onMachine Learning , pp. 535–542. Morgan Kaufmann, 2000.Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. Multi-agent cooperation and theemergence of (natural) language. arXiv preprint arXiv:1612.07182 , 2016.Jason Lee, Kyunghyun Cho, Jason Weston, and Douwe Kiela. Emergent translation in multi-agentcommunication. arXiv preprint arXiv:1710.06922 , 2017.Joel Lehman, Jeff Clune, Dusan Misevic, Christoph Adami, Julie Beaulieu, Peter J Bentley, SamuelBernard, Guillaume Belson, David M Bryson, Nick Cheney, et al. The surprising creativity ofdigital evolution: A collection of anecdotes from the evolutionary computation and artificial liferesearch communities. arXiv preprint arXiv:1803.03453 , 2018.Ryan Lowe, YI WU, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agentactor-critic for mixed cooperative-competitive environments. In
Advances in Neural InformationProcessing Systems 30 , pp. 6379–6390. 2017.L. Matignon, G. J. Laurent, and N. L. Fort-Piat. Hysteretic q-learning :an algorithm for decentral-ized reinforcement learning in cooperative multi-agent teams. In , pp. 64–69, Oct 2007.Laetitia Matignon, Guillaume J Laurent, and Nadine Le Fort-Piat. Independent reinforcementlearners in cooperative markov games: a survey regarding coordination problems.
The KnowledgeEngineering Review , 27(1):1–31, 2012.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Belle-mare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen,Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra,Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning.
Nature , 518:529–533, 2015.Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agentpopulations. arXiv preprint arXiv:1703.04908 , 2017.Shayegan Omidshafiei, Jason Pazis, Christopher Amato, Jonathan P How, and John Vian. Deepdecentralized multi-task multi-agent reinforcement learning under partial observability. arXivpreprint arXiv:1703.06182 , 2017.Santiago Ontanón, Gabriel Synnaeve, Alberto Uriarte, Florian Richoux, David Churchill, and MikePreuss. A survey of real-time strategy game ai research and competition in starcraft.
IEEETransactions on Computational Intelligence and AI in games , 5(4):293–311, 2013.Liviu Panait and Sean Luke. Cooperative multi-agent learning: The state of the art.
Autonomousagents and multi-agent systems , 11(3):387–434, 2005.Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, and Jun Wang.Multiagent bidirectionally-coordinated nets for learning to play starcraft combat games. arXivpreprint arXiv:1703.10069 , 2017. 10ublished as a conference paper at ICLR 2019David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driess-che, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, SanderDieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap,Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the gameof go with deep neural networks and tree search.
Nature , 529:484–489, 2016.Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropagation.In
Advances in Neural Information Processing Systems , pp. 2244–2252, 2016.Ming Tan. Multi-agent reinforcement learning: independent versus cooperative agents. In
Proceedingsof the Tenth International Conference on International Conference on Machine Learning , pp. 330–337. Morgan Kaufmann Publishers Inc., 1993.Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a runningaverage of its recent magnitude.
COURSERA: Neural networks for machine learning , 4(2):26–31,2012.Nicolas Usunier, Gabriel Synnaeve, Zeming Lin, and Soumith Chintala. Episodic exploration fordeep deterministic policies: An application to starcraft micromanagement tasks. arXiv preprintarXiv:1609.02993 , 2016.Sida I Wang, Percy Liang, and Christopher D Manning. Learning language games through interaction.In
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers) , volume 1, pp. 2368–2378, 2016.Stefan Wender and Ian Watson. Applying reinforcement learning to small scale combat in thereal-time strategy game starcraft: Broodwar. In
Computational Intelligence and Games (CIG),2012 IEEE Conference on , pp. 402–408. IEEE, 2012.Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcementlearning.
Machine learning , 8(3-4):229–256, 1992.
PPENDIX
RAINING D ETAILS
We set the hidden layer size to 128 units and we use LSTM (Hochreiter & Schmidhuber, 1997)with recurrence for all of the baselines and IC3Net. We use RMSProp (Tieleman & Hinton, 2012)with initial learning rate as a tuned hyper-parameter. All of the models use skip-connections (Heet al., 2016). The training is distributed over 16 cores and each core runs a mini-batch till totalepisodes steps are 500 or more. We do 10 weight updates per epoch. We run predator-prey, StarCraftexperiments for 1000 epochs, traffic junction experiment for 2000 epochs and report the final results.In mixed case, we report the mean score of all agents, while in cooperative case we report any agent’sscore as they are same. We implement our model using PyTorch and environments using Gym(Brockman et al., 2016).We use REINFORCE (Williams, 1992) to train our setup. We conduct 5 runson each of the tasks to compile our results. The training time for different tasks varies; StarCrafttasks usually takes more than a day (depends on number of agents and enemies), while predator-preyand traffic junction tasks complete under 12 hours.6.2 R
ESULTS A NALYSIS
In this section, we analyze and discuss behaviors/patterns in the results on our experiments.6.2.1 IC3N
ET IN S TAR C RAFT -C OMBAT TASK
As observed in Table 2, IC3Net performs better than CommNet in explore task but doesn’t outperformit on Combat task. Our experiments and visualizations of actual strategy suggested that compared toexploration, combat can be solved far easily if the units learn to stay together. Focused firepower withmore attack quantity in general results in quite good results on combat. We verify this hypothesisby running a heuristics baseline “attack closest” in which agents have full vision on map and have11ublished as a conference paper at ICLR 2019macro actions available . By attacking the closest available enemy together the agents are able tokill zealots with success ratio of 76.6 ± calculated over 5 runs, even though initialized separately.Also, as described in Section 6.5.2, the global reward in case of win in Combat task is relatively hugecompared to the individual rewards for killing other units. We believe that with coordination to staytogether, huge global rewards and focus fire–which is achievable through simple cooperation–add upto CommNet’s performance in this task.Further, in exploration we have seen that agents go in separate direction and have individual re-wards/sense of exploration which usually leads to faster exploration of an unexplored area. Thinkingin simple terms, exploration of an house would be faster if different people handle different rooms.Achieving this is hard in CommNet because global rewards don’t exactly tell your individual contri-butions if you had explored separately. Also in CommNet, we have observed that agents follow apattern where they get together at a point and explore together from that point which further signalsthat using CommNet, it is easy to get together for agents .6.2.2 V ARIANCE IN
IC3N ET In Fig. 5, we have observed significant variance in IC3Net results for StarCraft. We performed a lot ofexperiments on StarCraft and can attribute the significant variance to stochasticity in the environment.There are a huge number of possible states in which agents can end up due to millions of possibleinteractions and their results in StarCraft. We believe it is hard to learn each one of them. Thisstochasticity variance can even be seen in simple heuristics baselines like “attack closest” (6.2.1) andis in-fact an indicator of how difficult is it to learn real-world scenarios which also have same amountof stochasticity. We believe that we don’t see similar variance in CommNet and other baselinesbecause adding gating action increases the action-state-space combinations which yields better resultswhile being difficult to learn sometimes. Further, this variance is only observed in higher Win %models which requires to learn more state spaces.6.2.3 C
OMM N ET IN S TAR C RAFT -E XPLORE TASKS
In Table 2, we can observe that CommNet performs worse than IRIC and IC in case of StarCraft-Explore task. In this section, we provide a hypothesis for this result. First, we need to notice is thatIRIC is better than IC also overall, which points to the fact that individualized reward are better thanglobal rewards in case of exploration. This makes sense because if agents cover more area and knowhow much they covered through their own contribution (individual reward), it should lead to overallmore coverage, compared to global rewards where agents can’t figure out their own coverage butinstead overall one. Second, in case of CommNet, it is easy to communicate and get together. Weobserve this pattern in CommNet where agents first get together at a point and then start exploringfrom there which leads to slow exploration, but IC is better in this respect because it is hard to gatherat single point which inherently leads to faster exploration than CommNet. Third, the reward structurein the case of mixed scenario doesn’t appreciate searching together which is not directly visible toCommNet and IC due to global rewards.6.3 D ETAILS OF P REDATOR P REY
In all the three settings, cooperative, competitive and mixed, a predator agent gets a constant time-steppenalty r explore = − . , until it reaches the prey. This makes sure that agent doesn’t slack in findingthe prey. In the mixed setting, once an agent reaches the prey, the agent always gets a positive reward r prey = 0.05 which doesn’t depend on the number of agents on prey. . Similarly, in the cooperativesetting, an agent gets a positive reward of r coop = r prey * n , and in the competitive setting, an agentgets a positive reward of r comp = r prey / n after it reaches the prey, where n is the number of agentson the prey. The total reward at time t for an agent i can be written as: r ppi ( t ) = δ i ∗ r explore + (1 − δ i ) ∗ n λt ∗ r prey ∗ | λ | Macro-actions corresponds to “right click” feature in StarCraft and Dota in which a unit can be called toattack on other unit where units follows the shortest path on map towards the unit to be attacked and once reachedstarts attacking automatically, this essentially overpowers “attack closest” baseline to easily attack anyone underfull-vision without any exploration. You can observe the pattern for CommNet in PP, we just talked about at this link . This video has beengenerated using trained CommNet model on PP-Hard. Red ‘X’ are predators and ‘P’ is the prey to be found. Wecan observe the pattern where the agents get together to find the prey leading to slack eventually δ i denotes whether agent i has found the prey or not, n t is number of agents on prey attime-step t and λ is -1, 0 and 1 in the competitive, mixed and cooperative scenarios respectively.Maximum episode steps are set to 20, 40 and 80 for 5 ×
5, 10 ×
10 and 20 ×
20 grids respectively. Thenumber of predators are 5, 10 and 20 in 5 ×
5, 10 ×
10 and 20 ×
20 grids respectively. Each predator cantake one of the five basic movement actions i.e. up, down, lef t, right or stay . Predator, prey and alllocations on grid are considered unique classes in vocabulary and are represented as one-hot binaryvectors. Observation obs , at each point will be the sum of all one-hot binary vectors of location,predators and prey present at that point. With vision of 1, observation of each agent have dimension × | obs | .6.3.1 E XTRA E XPERIMENTS
Predator Prey Cooperative (Avg. Rewards)Model 5x5, n=3 10x10, n=5 20x20, n=10
IRIC 0.48 ± ± ± IC 0.47 ± ± ± CommNet 1.56 ± ± ± IC3Net 1.57 ± ± ± Table 3:
Predator-Prey Cooperative : Avg. rewards in three difficulty levels of predator-preyenvironment in cooperative setting. IC3Net performs equivalently or better than baselines consistently.Table 3 shows the results for IC3Net and baselines in the cooperative scenario for the predator-preyenvironment. As the cooperative reward function provides more reward after a predator reaches theprey, the comparison is provided for rewards instead of average number of steps. IC3Net performsbetter or equal to CommNet and other baselines in all three difficulty levels. The performance gapcloses in and increases as we move towards bigger grids which shows that IC3Net is more scalabledue to individualized rewards. More importantly, even with the extra gating action training, IC3Netcan perform comparably to CommNet which is designed for cooperative scenarios which suggeststhat IC3Net is a suitable choice for all cooperation settings.To analyze the effect of gating action on rewards in case of mixed scenario where individualizedrewards alone can help a lot, we test Predator Prey mixed cooperation setting on 20x20 grid on abaseline in which we set gating action to 1 (global communication) and uses individual rewards(IC2Net/CommNet + IR). We find average max steps to be 50.24 ± which is lower than IC3Net.This means that (i) individualized rewards help a lot in mixed scenarios by allowing agents tounderstand there contributions (ii) adding the gating action in this case has an overhead but allowsthe same model to work in all settings (even competitive) by “learning to communicate” which ismore close to real-world humans with a negligible hit on the performance.6.4 D ETAILS OF T RAFFIC J UNCTION
Traffic junction’s observation vocabulary has one-hot vectors for all locations in the grid and carclass. Each agent observes its previous action, route identifier and a vector specifying sum of one-hotvectors for all classes present at that agent’s location. Collision occurs when two cars are on samelocation. We set maximum number of steps to 20, 40 and 60 in easy, medium and hard difficultyrespectively. Similar to Sukhbaatar et al. (2016), we provide a negative reward r coll = -10 on collision.To cut off traffic jams, we provide a negative reward τ i r time = -0.01 τ i where τ i is time spent by theagent in the junction at time-step t . Reward for i th agent which is having C ti collisions at time-step t can be written as: r tji ( t ) = r coll C ti + r time τ i We utilized curriculum learning Bengio et al. (2009) to make the training process easier. The p arrive is kept at the start value till the first epochs and is then linearly increased till the end valueduring the course from th to th epoch. The start and end values of p arrive for differentdifficulty levels are indicated in Table 4. Finally, training continues for another epochs. Thelearning rate is fixed at . throughout. We also implemented three difficulty variations of thegame explained as follows.The easy version is a junction of two one-way roads on a × grid. There are two arrival points,each with two possible routes and with a N total value of .13ublished as a conference paper at ICLR 2019 P-arrive N-total Arrival Routes per Two-Way Junctions DimensionDifficulty Start End Points Entry Point
Easy 0.1 0.3 5 2 1 F 1 7 × × × Table 4:
Traffic Junction : Variations in different traffic junction difficulty levels. T refers to True asin the difficulty level has 2-way roads and F refers to False as in the difficulty level has 1-way roads.The medium version consists of two connected junctions of two-way roads in × as shownin Figure 2 (right). There are 4 arrival points and 3 different routes for each arrival point and have N total = 20 .The harder version consists of four connected junctions of two-way roads in × as shown inFigure 6. There are 8 arrival points and 7 different routes for each arrival point and have N total = 20 .Figure 6: Hard difficulty level of traffic junction task . Level has four connected junctions, eightentry points and at each entry point there are 7 possible routes increasing chances of a collision. Weuse curriculum learning to successfully train our models on hard level.6.4.1 IRIC
AND IC PERFORMANCE
In Table 1, we notice that IRIC and IC perform worst in medium level compared to the hard level.Our visualizations suggest that this is due to high final add-rate in case of medium version comparedto hard version. Collisions happen much more often in medium version leading to less success rate(an episode is considered failure if a collision happens) compared to hard where initial add-rate is lowto accommodate curriculum learning for hard version’s big grid size. The final add-rate in case ofhard level is comparatively low to make sure that it is possible to pass a junction without a collisionas with more entry points it is easy to collide even with a small add-rate.6.5 S
TAR C RAFT D ETAILS
BSERVATION AND A CTIONS
Explore:
To complete the explore task, agents must be within a particular range of enemy unit called explore vision . Once an agent is within explore vision of enemy unit, we noop further actions. Thereward structure is same as the PP task with only difference being that an agent needs to be within the explore vision range of the enemy unit instead of being on same location to get a non-negative reward.We use medic units which don’t attack enemy units. This ensures that we can simulate our exploretask without any of kind of combat happening and interfering with the goal of the task. Observationfor each agent is its (absolute x, absolute y) and enemy’s (relative x, relative y, visible) where visible , relative x and relative y are 0 when enemy is not in explore vision range. Agents have 9 actions tochoose from which includes 8 basic directions and one stay action. Combat:
Agent observes its own (absolute x, absolute y, healthpoints + shield, weapon cooldown,previous action) and (relative x, relative y, visible, healthpoints + shield, weapon cooldown) foreach of the enemies. relative x and relative y are only observed when enemy is visible which iscorresponded by visible flag. All of the observations are normalized to be in between (0, 1). Agenthas to choose from 9 + m actions which include 9 basic actions and 1 action for attacking each of the14ublished as a conference paper at ICLR 2019 m agents. Attack actions only work when the enemy is within the sight range of the agent, otherwiseit is a noop . In combat, we don’t compare with prior work on StarCraft because our environmentsetting is much harder, restrictive, new and different, thus, not directly comparable.6.5.2 C OMBAT R EWARD