[PDF] Scaling Multi-Agent Reinforcement Learning with Selective Parameter Sharing

Abstract

Sharing parameters in multi-agent deep reinforcement learning has played an essential role in allowing algorithms to scale to a large number of agents. Parameter sharing between agents significantly decreases the number of trainable parameters, shortening training times to tractable levels, and has been linked to more efficient learning. However, having all agents share the same parameters can also have a detrimental effect on learning. We demonstrate the impact of parameter sharing methods on training speed and converged returns, establishing that when applied indiscriminately, their effectiveness is highly dependent on the environment. We propose a novel method to automatically identify agents which may benefit from sharing parameters by partitioning them based on their abilities and goals. Our approach combines the increased sample efficiency of parameter sharing with the representational capacity of multiple independent networks to reduce training time and increase final returns.

Full PDF

SScaling Multi-Agent Reinforcement Learning with Selective Parameter Sharing

Filippos Christianos Georgios Papoudakis Arrasy Rahman Stefano V. Albrecht Abstract

Sharing parameters in multi-agent deep reinforce-ment learning has played an essential role in al-lowing algorithms to scale to a large number ofagents. Parameter sharing between agents sig-niﬁcantly decreases the number of trainable pa-rameters, shortening training times to tractablelevels, and has been linked to more efﬁcient learn-ing. However, having all agents share the sameparameters can also have a detrimental effect onlearning. We demonstrate the impact of parametersharing methods on training speed and convergedreturns, establishing that when applied indiscrimi-nately, their effectiveness is highly dependent onthe environment. Therefore, we propose a novelmethod to automatically identify agents whichmay beneﬁt from sharing parameters by partition-ing them based on their abilities and goals. Our ap-proach combines the increased sample efﬁciencyof parameter sharing with the representational ca-pacity of multiple independent networks to reducetraining time and increase ﬁnal returns.

1. Introduction

Multi-agent reinforcement learning (MARL) aims to jointlytrain multiple agents to solve a given task in a shared en-vironment. Recent work has focused on novel techniquesfor experience sharing (Christianos et al., 2020), agent mod-elling (Albrecht & Stone, 2018), and communication be-tween agents (Rangwala & Williams, 2020; Zhang et al.,2020) to address the non-stationarity and multi-agent creditassignment problems (Papoudakis et al., 2019). A problemwhich has received less attention to date is how to scaleMARL algorithms to many agents, with typical numbers inprevious works ranging between two and ten agents.One common implementation technique to facilitate train-ing with a larger number of agents is parameter sharing (e.g. (Gupta et al., 2017)) whereby agents share some or all School of Informatics, University of Edinburgh, Edinburgh,United Kingdom. Correspondence to: Filippos Christianos. parameters in their policy networks. In the literature, pa-rameter sharing is typically applied indiscriminately acrossall agents, which we call naive . Naive parameter sharinghas been effective primarily due to the similar (if not identi-cal) observation and reward functions between agents foundin many multi-agent environments. This similarity allowsagents to share representations in intermediate neural net-work layers. Despite its occasional effectiveness, it is notsupported by theoretical work and has not received much at-tention beyond being mentioned as an implementation detail.Indeed, naive parameter sharing can decrease training time,but we show that it can be detrimental to ﬁnal convergencein many environments, even when paired with implementa-tion details that generally accompany it. We observe in ourexperiments that when the transition or the reward functionsare distinct across agents, hidden representations that canbe shared are harder to form, and fully shared parametersare not effective.These limitations, however, do not imply that parametersharing does not have a place in MARL. In contrast, webelieve that it is an essential tool in scaling deep MARL al-gorithms to large numbers of agents, provided it can be doneselectively, so as not to limit ﬁnal performance. Therefore,we aim to beneﬁt from parameter sharing when possible butalso avoid potential bottlenecks. We introduce a method,Selective Parameter Sharing (SePS) , that automaticallyidentiﬁes agents which may beneﬁt from sharing parame-ters by partitioning them based on their abilities and goals.This partitioning is performed by encoding each agent to anembedding space by observing their trajectories and thenapplying an unsupervised clustering algorithm to the result.We can acquire an intuition of the setting this paper dis-cusses by imagining a team of robots that must learn to runa restaurant, fulﬁlling both waiters and cooks’ roles. Ofcourse, agents belonging to the same group have to learnsimilar policies and therefore shared latent representationscan signiﬁcantly decrease learning requirements (i.e. thereis no need for each cook to learn separately how to chop in-gredients). Nevertheless, waiters and cooks have almost nocommon functionalities, and the representational capacity ofa single neural network poses a bottleneck when attemptingto learn all distinct roles. Furthermore, we show that agentslearning will forget information needed by others, activelyinterfering with other agents’ learning. a r X i v : . [ c s . M A ] F e b caling Multi-Agent Reinforcement Learning with Selective Parameter Sharing We provide comparisons of typical usage of parameter shar-ing (sharing across all agents, appending agent indices, ornot sharing at all) and show that i) SePS can converge tohigher returns than both not sharing parameters and shar-ing them naively, ii) is more sample efﬁcient and executesconsiderably faster than not sharing parameters. Moreover,in contrast to baseline methods, our method scaled to hun-dreds of agents (we experimented with up to 200) in ourenvironments that contained non-homogeneous agents.

2. Background

Markov Games:

A partially observable Markovgame (Littman, 1994) is deﬁned by the tuple ( N , S , { O i } i ∈N , { A i } i ∈N , P , { R i } i ∈N ) , with agents i ∈ N = { , . . . , N } , state space S , and joint actionspace A = A × . . . × A N . Each agent i only perceiveslocal observations o i ∈ O i which depend on the currentstate. Function P : S × A (cid:55)→ ∆( S ) returns a distributionover successor states given a state and a joint action; R i : S × A × S (cid:55)→ R is the reward function giving agent i ’s individual reward r i . Each agent i seeks to maximiseits discounted returns G i = (cid:80) Tt =0 γ t r it , with γ and T denoting the discount factor and total timesteps of anepisode, respectively.Unlike some recent MARL work (Rashid et al., 2018; Foer-ster et al., 2018; Christianos et al., 2020) we do not assumeidentical action, observation spaces, or reward functionsbetween agents. Policy Gradient and Actor-Critic:

The goal of rein-forcement learning is to ﬁnd strategies that optimise thereturns of the agents. Policy gradient, a class of model-freeRL algorithms, directly learn and optimise a policy π φ pa-rameterised by φ . The REINFORCE algorithm (Williams,1992), follows the gradients of the objective ∇ φ J ( φ ) = E π [ G t ∇ φ ln π φ ( a t | s t )] to ﬁnd a policy that maximises thereturns. To further reduce the variance of gradient estimates,actor-critic algorithms replace the Monte Carlo returns witha value function V π ( s ; υ ) . In a multi-agent, partially ob-servable setting, a simple actor-critic algorithm deﬁnes thepolicy loss function for an agent i as: L ( φ i ) = − log π ( a it | o it ; φ i )( r it + γV ( o it +1 ; θ i ) − V ( o it ; υ i )) and the respective value loss function as: L ( θ i ) = || V ( o it ; υ i ) − y i || with y i = r it + γV ( o it +1 ; υ i ) In this paper, for reinforcement learning we use A2C (Mnihet al., 2016), an actor-critic algorithm that additionally usesn-step rewards, environments that run in parallel, and im-proved exploration with entropy regularisation.

Variational Autoencoders:

Variational autoencoders(VAEs) are generative models that explicitly learn a density policiesagents Environment

Figure 1: A top-level diagram of the selective parametersharing architecture. N agents are operating in one envi-ronment and receive observations and rewards. With SePSwe are training K < N policies and use a deterministicfunction µ to select which policy controls which agent.function over some unobserved latent variables Z given aninput x ∈ X , where X is a dataset. Given the unknowntrue posterior p ( z | x ) , VAEs approximate it with a paramet-ric distribution q θ ( z | x ) with parameters θ . Computing theKL-divergence from the parametric distribution to the trueposterior results in: D KL ( q θ ( z | x ) (cid:107) p ( z | x )) = log p ( x ) − E z ∼ q θ ( z | x ) [log p u ( x | z )] + D KL ( q θ ( z | x ) (cid:107) q ( z )) The term log p ( x ) is called log-evidence and it is constant.The other two term are the negative evidence lower bound(ELBO). Minimising the ELBO is equivalent to minimis-ing the KL-divergence between the parametric and the trueposterior.

3. Selective Parameter Sharing

To improve the effectiveness of parameter sharing, and al-low for several distinct roles to be learned, we attempt togroup agents that should be sharing their parameters duringtraining. In an environment, we assume that N agents canbe partitioned into K sets ( K < N ) but without knowing K nor the partitioning. With K = { π , . . . , π K } , each agentin a cluster k uses and updates the shared policy π k . As weshow in our experiments, such distinct shared policies canoften be trained more efﬁciently while offering enough rep-resentational capacity to successfully solve the environment,and may even reach higher overall returns than alternativemethods (see Section 4). Figure 1 depicts a top-level dia-gram of the components in our architecture.To assign agents to partitions, we propose the use of a de-terministic function µ : N (cid:55)→ K that maps each agent i toa parameterised policy (or partition) π k . This partitioningis learned prior to RL training. Therefore, agents that shareparameters get to beneﬁt from shared representations in thelatent layers of their neural networks, while not interferingwith agents using other parameters. caling Multi-Agent Reinforcement Learning with Selective Parameter Sharing encoder decoder circumventthe encoder Figure 2: The encoder-decoder model. The encoder learnsto encode the id of an agent in an embedding space whilethe decoder predicts the reward and next observation.Recall the transition P and reward R i functions that deﬁnean environment’s dynamics (Section 2). We aim to deter-mine µ and partition agents such that agents which try tosolve similar tasks use shared policies. Therefore, we in-troduce another concept: a set of functions ˆ P i and ˆ R i thatattempt to approximate P and R i , but from the agents’ lim-ited perspective of the world. An agent does not observe thestate nor the actions of another agent, and hence we deﬁne ˆ P i : O i × A i (cid:55)→ ∆( O i ) and ˆ R i : O i × A i (cid:55)→ R , that modelthe next observation and reward respectively, based only onthe observation and action of an agent i . When learningthese functions our goal is not to ensure their accuracy asapproximators of the dynamics; but rather that they identifysimilar agents to provide a basis for partitioning. We spec-ulate that agents that should be grouped together have thesimilar reward and observation transition functions. Thus,the reasoning behind the following method is our desire toidentify agents with identical ˆ P i and ˆ R i , and have themshare their network parameters.We deﬁne an encoder f e and a decoder f p parameterisedby θ (Fig. 2). The encoder, conditioned solely on the agentid, outputs the parameters that deﬁne an m -dimensionalGaussian distribution we can sample from. We refer tosamples from this latent space as z . The decoder, furtherdivided to an observation and reward decoder f op and f rp respectively, receives the observation, action, and sampledencoding z of agent i , and attempts to predict the nextobservation and reward. In contrast to the classical deﬁnitionof autoencoders, o it and a it bypass the encoder and are onlyreceived by the decoder. Thus, due to the bottleneck, z canonly encode information about the agent, such as its rewardfunction ˆ R i or observation transition model ˆ P i .To formalise the process, we assume that for each agent itsidentity i is representative of its observation transition dis-tribution and reward function. Additionally, we assume thatboth the identity of each agent and its observation transitiondistribution can be projected in a latent space Z throughthe posteriors q ( z | i ) and p ( z | tr = ( o t +1 , o t , r t , a t )) . Thegoal is to ﬁnd the posterior q ( z | i ) . We assume a variationalfamily of parameterised Gaussian distributions with param- eters θ : q θ ( z | i ) = N ( µ θ , Σ θ ; i ) . To solve this problem weuse the variational autoencoding (Kingma & Welling, 2014)framework to optimise the objective D KL ( q θ ( z | i ) || p ( z | tr)) .We derive a lower bound on the log-evidence (ELBO) of thetransition log p ( tr ) as: log p (tr) ≥ E z ∼ q θ ( z | i ) [log p u (tr | z )] − D KL ( q θ ( z | i ) || p ( z )) (1)The reconstruction term of the ELBO factorises as: log p u (tr | z ) = log p u ( r t , o t +1 | a t , o t , z ) p ( a t , o t | z ) =log p u ( r t | o t +1 , a t , o t , z ) + log p u ( o t +1 | a t , o t , z ) + c The last term is discarded as a t and o t do not depend on thelatent variable z .For the encoder-decoder model to learn from the experienceof all agents, it is trained with samples from all agents andwill represent the collection of the agent-centred transitionand reward functions ˆ P i and ˆ R i for all i ∈ N . Given theinputs of the decoder, the information of the agent id canonly pass through the sample z .Minimising the model loss (Eq. (1)) can be done prior toreinforcement learning. We sample actions a i ∼ A i andstore the observed trajectories in a shared experience replaywith all agents. We have empirically observed that the datarequired for this procedure is orders of magnitude less thanwhat is usually required for reinforcement learning, and caneven be reused for training the policies, thus not adding tothe sample complexity.The ﬁnal step of the pre-training procedure is to run a clus-tering algorithm on the means generated from the encoder f e ( i ) for all i ∈ N , and use the agent indices clusteredtogether to deﬁne µ . In the experiments that follow, we usek-means for simplicity. After the partitioning is completed,a static computational graph (for automatic differentiation)can be generated to train the policies with signiﬁcant speedadvantages.

4. Experimental Evaluation

In this section, we evaluate both whether SePS performs asintended by correctly partitioning the agents and whetherthis partitioning helps in improving the overall returns, sam-ple complexity, and training time. For RL, we use theA2C (Mnih et al., 2016) algorithm and report the sum ofreturns of all agents. A search was performed for A2C’shyperparameters across all baselines, while hyperparame-ters of the clustering portion of SePS were easily foundmanually, and kept identical across all environments (moredetails in Section 4.8). caling Multi-Agent Reinforcement Learning with Selective Parameter Sharing

Table 1: Brief description of environments, including howagents are distributed to different agent types: colours inBPS and C-RWARE, levels in LBF or different units inMMM2. Environments with † or ‡ mark different observa-tion spaces or action spaces respectively, while § marks acooperative (shared) reward. † † † ‡ ‡ ‡ § We use four multi-agent environments (Fig. 3) which aredescribed below and summarised in Table 1.

Blind-particle spread:

Our motivating toy environment, isa custom scenario created with the Multi-agent Particle En-vironment (Lowe et al., 2017) (MPE). Blind-particle spread(BPS), Fig. 3a, consists of landmarks of different coloursand numerous agents that have been also assigned colours.The agents are unable to see their own colour (or of theother agents), but need to move towards the correct land-mark. This environment enables us to investigate the effectsof parameter sharing by allowing us to control two impor-tant variables: i) the number of agents and ii) the numberof colours (distinct behaviours that must be learned). In afurther, more difﬁcult variation which we name BPS-h, eachgroup of agents also has a different observation space (e.g.the agents could be equipped with different sensors).

Coloured Multi-Robot Warehouse:

The Coloured Multi-Robot Warehouse (C-RWARE, Fig. 3b) is a variation of theRWARE environment (Christianos et al., 2020), where mul-tiple robots have different functionalities, and are rewardedonly for delivering speciﬁc shelves (denoted by differentcolours) and have different action spaces. The agents canrotate or move forward and pick-up or drop a shelf. The ob-servation consists only of a × square centred around theagent. Agents are only rewarded (with . ) when success-fully arriving to the goal with a requested shelf of the correctcolour, making the reward sparse. RWARE is known (Chris-tianos et al., 2020; Papoudakis et al., 2020) to be an envi-ronment with difﬁcult exploration, and independent learners have been shown to struggle on it. Level-based Foraging:

Level-based Foraging (LBF,Fig. 3c) (Albrecht & Ramamoorthy, 2013) is a multi-agentenvironment where agents are placed in a grid, and requiredto forage randomly scattered food. Each agent is assigned alevel, and each food also is assigned a level at the beginningof the episode. The agents can move in the four directionsand attempt to load an adjacent food. For foraging to besuccessful, the sum of the agent levels foraging the foodmust be equal or greater than its level. LBF is partiallyobservable, and while the agents can see the positioning ofagents and food, as well as the food levels, they cannot seeany of the agent levels. The reward is proportionate to theagents contribution when a food is successfully loaded.

Starcraft Multi-Agent Challenge:

While the multi-agentStarcraft (SMAC) (Samvelyan et al., 2019) environmentmight not be the archetype for displaying the strengths ofselective parameter sharing, it is a widely used setting wheremultiple agents of distinct types co-exist and must learn to-gether. For instance, the “MMM2” environment (Fig. 3d)contains three types of units (marines, marauders, and medi-vacs) with distinct attributes. One of those unit types, medi-vacs, is especially different since it needs to learn how toheal friendly units instead of attacking enemies.

We compare SePS against several other methods of parame-ter sharing described below.

No Parameter Sharing (NoPS):

In our NoPS baseline, allagents have their own set of parameters, and there is nooverlap of gradients. This approach is common in the litera-ture and usually encountered when there is no mention ofparameter sharing, e.g. MADDPG (Lowe et al., 2017).

Full Parameter Sharing (FuPS):

The second baseline,FuPS consists of a single set of parameters that will beshared between all agents. FuPS is a naive baseline, since itdoes not allow agents to develop any difference in behaviour.

Full Parameter Sharing with index (FuPS+id):

We alsotest a variation of the previous method: FuPS+id, wherethe policy is also conditioned on the agent id. While theuse of FuPS is limited and our expectations are not high,FuPS+id is encountered very often in the literature (Guptaet al., 2017; Rashid et al., 2018; Foerster et al., 2018).

The dense reward signal in our toy environment, BPS, makesit sufﬁciently simple for non-parameter sharing agents tolearn how to reach their respective landmarks. However,when parameter sharing is involved, we expect the task tobecome considerably harder. FuPS+id presumably solves caling Multi-Agent Reinforcement Learning with Selective Parameter Sharing

L LL

AGENTS (a) Blind-Particle Spread

G G (b) Coloured Multi-Robot Ware-house (c) Level-based Foraging (d) SMAC

Figure 3: Visualisation of environments used in experiments. M a x i m u m e v a l u a t i o n r e t u r n s NoPS (optimal)FuPS+idFuPS (local minimum)

Figure 4: BPS-h with 10 agents and a variable number ofcolours. The maximum evaluation returns are recorded foreach environment and algorithm, and vertical bars indicatestandard deviation across seeds.this by allowing each agent to develop a distinct policy basedon agent id. To investigate this, we tested NoPS, FuPS, andFuPS+id, in a series of BPS tasks, where the agent countwill remain constant, and the number of colours (landmarks)increases. NoPS should have no issue learning immediately,since each agent needs to learn to navigate to a speciﬁclandmark. In contrast, sharing parameters (FuPS) can notwork, because the agents lack the information required todetermine their colour and move accordingly.We are, however, very interested in what FuPS+id can learn.The agents have all the information needed to learn how tomove to the correct landmark. But, as we have hypothesisedin earlier sections, the overlap of different policies whichmust be represented on the same parameters, poses a sig-niﬁcant bottleneck for learning. Indeed, as Fig. 4 indicates,the performance of FuPS+id deteriorates sharply, even withonly three colours.

Next, we investigate how selectively sharing parametersaffects reinforcement learning performance. Our hypothesisremains that agents can beneﬁt from sharing parameters,if they have been clustered together by SePS. Our results,detailed in Table 2 support this hypothesis. We also presentthe learning curves on a selection of environments in Fig. 5.

BPS:

The BPS tasks (Table 2 and Figs. 5a and 5b) are trivialfor independent learners given the dense reward: each agentlearns to always move towards a speciﬁc colour. However,if all agents share a policy, then the agents (not being ableto perceive their own colours) only learn to move to a localminimum. FuPS+id, which supposedly circumvents theproblem, still has issues correctly learning this problem.Due to the high computational requirements of NoPS, whichrequires N different sets of parameters, running BPS-h (4)with 200 agents was infeasible to run. C-RWARE:

Our results in C-RWARE (Table 2 and Figs. 5cand 5d) are more surprising. NoPS which was a strongcontender in BPS, completely failed to learn in C-RWARE(2) and (3). These tasks are very sparsely rewarded, whichseems to make independent learning ineffective. Instead,sharing parameters also combined the received rewards, pro-viding a useful learning direction to the optimiser. Also,similarly to BPS, SePS outperforms the other naive parame-ter sharing methods.

LBF:

Similarly to other environments, SePS agents in LBFachieve higher returns with more efﬁcient use of environ-ment samples (Fig. 5e). The optimal performance in LBF is . , and while FuPS+id is close, it does not achieve the samereturns as SePS. NoPS takes considerably more samples totrain, but given our BPS results, it is possible it eventuallyconverges to the same returns as SePS. MMM2:

In one of the hardest SMAC environments,MMM2, the most surprising result was the difference in caling Multi-Agent Reinforcement Learning with Selective Parameter Sharing

Updates M e a n T r a i n i n g R e t u r n s NoPSSePS (ours)FuPSFuPS+id (a) BPS (3)

Updates M e a n T r a i n i n g R e t u r n s NoPSSePS (ours)FuPSFuPS+id (b) BPS-h (2)

Updates M e a n T r a i n i n g R e t u r n s NoPSSePS (ours)FuPSFuPS+id (c) C-RWARE (1)

Updates M e a n T r a i n i n g R e t u r n s NoPSSePS (ours)FuPSFuPS+id (d) C-RWARE (3)

Updates M e a n T r a i n i n g R e t u r n s NoPSSePS (ours)FuPSFuPS+id (e) LBF

Updates M e a n T r a i n i n g R e t u r n s NoPSSePS (ours)FuPSFuPS+id (f) SMAC (MMM2)

Figure 5: Learning curves during training for a selection of the environments.converged returns between NoPS and parameter sharingmethods. Even fully shared parameters - and after makingsure the identity of the agents does not leak through theobservation - outperforms NoPS. Our hypothesis on theseresults is that i) this task requires agents to act in a verysimilar way (e.g. only targeting the same opponents) andii) parameter sharing plays a previously underrated role indecomposing (or reasoning over) a shared reward. The restof the methods behave similarly to other environments, butwith minimal improvement of SePS over FuPS+id.

Next, an important important step in evaluating SePS is toverify that meaningful clusters appear after optimising theobjectives discussed in Section 3.Our goal is to compare the clustering of our algorithm, withone decided by a human who is given knowledge of theenvironment. We visualise the embedding for each of theunits on a SMAC task (Fig. 6). We can assert that theclustering which is clearly visible, is precisely what wewould expect. Different types of units all have differentproperties (e.g. movement speed, health, or damage) andthus a distinct interaction with the environment. Thesedifferences were picked up by the encoder that subsequentlyspread them in the embedding space.A question that might arise, is why the agents of the sameunit type (and cluster) are spread out in the z axis of Fig. 6.In the SMAC environment, there is another difference be- tween the agents: their starting position. Therefore, the ini-tial observations o i are sampled from different sets for eachof the agents. The encoder picks up on this feature and fur-ther spreads the latent encoding. While k-means clusteredthe agents by unit type, it could be argued that more clusterscould have been formed, which goes beyond the scope ofour work and is considered a feature (and open problem)of unsupervised clustering (Kaufman & Rousseeuw, 2009).However, this was the exception in our tests, and in all otherenvironments, where the starting observations are sampledfrom the same set, the latent variables of similar agents areoverlapping one another. Nevertheless, in Section 4.6 weexplore how the number of clusters can be determined, andhow wrong choices can affect learning.The clustering process across all environments and seedsmatched the various types of agents. For instance, in C-RWARE and BPS, each cluster contains only agents of thesame colour. Importantly, this information is not includedin the observation space and therefore not observed by theagents or even the encoder; it is only understood after ob-serving the transitions and rewards of each agent. Several ways to determine the number of clusters in theembedding space exist. Arguably the most straightforward,is the use of domain knowledge. Having an estimate ofthe value of K , could possibly translate to knowledge ofwhich agents should be assigned in clusters in the ﬁrst place.Despite of the diminished importance of the pretraining caling Multi-Agent Reinforcement Learning with Selective Parameter Sharing Table 2: Maximum evaluation returns with std across seeds. Highest means (within one std) are shown in bold.NoPS SePS (Ours) FuPS FuPS+idBPS (1) − . ± . − . ± . − . ± . − . ± . BPS (2) − . ± . − . ± . − . ± . − . ± . BPS (3) − . ± . − . ± . − . ± . − . ± . BPS (4) − . ± . − . ± . − . ± . − . ± . BPS-h (1) − . ± . − . ± . − . ± . − . ± . BPS-h (2) − . ± . − . ± . − . ± . − . ± . BPS-h (3) N/A − . ± . − . ± . − . ± . C-RWARE (1) . ± . . ± . . ± .

28 2 . ± . C-RWARE (2) . ± . . ± . . ± .

40 10 . ± . C-RWARE (3) . ± . . ± . . ± . . ± . LBF . ± . . ± . . ± .

02 0 . ± . MMM2 . ± . . ± . . ± . . ± . Figure 6: Visualisation of the means of z for each of the 10agents in SMAC (MMM2 task) in a 3-dimensional space.The colours identify the clusters created by k-means.SePS stage in this situation, we believe that understandingthe effectiveness of shared parameters between clusters isstill of value.A second approach would consist of treating K as a tunablehyperparameter. In Fig. 7, we present the returns duringtraining on C-RWARE when SePS is forced to create variedamount of clusters. It is clear from the results, that overesti-mating K is of little signiﬁcance. However, trying to formfewer clusters than needed lowers the achieved returns, andcollapses to NoPS when K = 1 .Finally, there is a plethora of well-studied heuristics forseparating clusters when the embedding space is known.The elbow method (Thorndike, 1953), the silhouettemethod (Rousseeuw, 1987), or the Davies–Bouldin in-dex (Davies & Bouldin, 1979), all could be used to de-termine the number of clusters since our method tends to Updates M e a n T r a i n i n g R e t u r n s K=5K=4K=3 (optimal)K=2K=1

Figure 7: Mean returns during training for different numberof clusters on C-RWARE (9 agents and 3 colours).produce well separated values. We have implemented andtested the Davies-Boulding index, and we have found thatcoupled with k-means, reliably ﬁnds the same clusters anexpert would in our tested environments (i.e. the secondcolumn in Table 1).

In the previous sections, we showed the effectiveness oflearning, showing that SePS achieves the highest returnsamong the baselines. However, we have not addressed howSePS beneﬁts MARL when applied to multiple agents. Toexamine this, we have created Fig. 8, that presents the me-dian time for a timestep during training. It is clear that whileSePS adds computational complexity over the fully sharednetworks, it scales signiﬁcantly better than NoPS does. Inthe BPS environments with 30 agents, SePS almost requireshalf the training time of NoPS due to the substantially fewertrainable parameters. In BPS-h(3), training with NoPS was caling Multi-Agent Reinforcement Learning with Selective Parameter Sharing B P S ( ) B P S ( ) B P S ( ) B P S ( ) B P S - h ( ) B P S - h ( ) L B F R W A R E ( a ) R W A R E ( b ) R W A R E ( c ) M e d i a n t i m e p e r t i m e s t e p ( s ) NoPSSePS (ours)FuPSFuPS+id

Figure 8: Median running time of a timestep during trainingover all the the environments and methods.infeasible since it requires 200 sets of parameters (50 moretimes than SePS).

In our experiments, we used Adam with learning rate of − , optimiser epsilon − , entropy coefﬁcient − (aﬁner search was done in case of SMAC), and value, critic,and encoder-decoder networks with two layers of 64 units.Eight environments were sampled concurrently and 5-stepreturns were computed. For the pretraining, m was set at 5,the reconstruction loss was scaled by , and used batchsize 128. Figure 8 was generated in an Epyc 7702 runningPython 3 with environments sampled in parallel threads.

5. Related Work

Centralised Training with Decentralised Execution(CTDE):

A paradigm popular in cooperative MARL, as-sumes that during training all agents can access data fromall other agents. After the training is completed, the agentsstop having access to external data, and can only observetheir own perspective of the environment. CTDE algorithmssuch as MADDPG (Lowe et al., 2017), Q-MIX (Rashidet al., 2018), and SEAC (Christianos et al., 2020) all beneﬁtfrom the centralised training stage and have been repeat-edly shown to outperform non-CTDE baselines. SePS alsoadheres to the CTDE paradigm, and assumes that duringtraining all information is shared.

Parameter Sharing:

Sharing parameters between agentshas a long history in MARL. Tan (1993) investigates shar-ing policies between cooperative settings in non-deep RLsettings. More recently, algorithms such as COMA (Fo-erster et al., 2018), Q-Mix (Rashid et al., 2018), or MeanField RL (Yang et al., 2018) share the parameters of neu-ral networks similarly to our FuPS and FuPS+id baselines. ROMA (Wang et al., 2020) learns dynamic roles to shareexperience between agents that perform similar tasks. WithSePS we do this operation statically in order to maximisecomputational efﬁciency, but we arrive at similar partition-ing of agents in heterogenous SMAC tasks (Fig. 6). Thenovelty of SePS does not come from sharing parameters,which is a well established method in MARL, but that itcreates neural network architectures in advance, allowingmore efﬁcient and effective sharing.

Sharing Experience:

SEAC (Christianos et al., 2020)shares experience between agents while maintaining sep-arate policy and value networks. While SEAC achievesstate-of-the-art performance, not only does it require onenetwork per agent (i.e. NoPS), it also stacks the experienceof the agents leading to increased batch sizes. With SePSwe forfeit the exploration beneﬁts of SEAC but arrive at amethod that may scale to hundreds of agents.

Scaling MARL to more Agents:

Mean Field (Yang et al.,2018) tackles MARL with numerous agents by approxi-mating interactions between a single agent and the averageeffect of the population. While it is shown that convergenceis improved, Mean Field RL shares parameters in a fashionsimilar to FuPS. Our method operates as a pre-training step,and attempts to ﬁnd a network architecture conﬁgurationthat improves learning. SePS can be combined with MARLalgorithms (centralised critic, value decomposition, or oth-ers) since it improves a different part of the RL procedure.

6. Conclusion

This paper explores existing methods for parameter sharingin MARL, identifying situations were they are ineffective.Our experiments suggest that sharing parameters indiscrim-inately between agents can make learning harder, sinceagents interfere with the learning of others (Section 4.3).Therefore, we proposed a method for selective parametersharing, that identiﬁes groups of agents that may beneﬁtfrom sharing parameters. SePS is shown to successfullyrecognise heterogeneous agents and assign them to differ-ent parameter sets, allowing MARL training to scale tohundreds of agents even when they are not homogeneous.Limitations of this work that could be studied in the fu-ture include our encoder learning differences in agents thatonly manifest after some complicated behaviour is alreadylearned, or ensuring that shared parameter methods gener-alise against unseen opponents.

References

Albrecht, S. V. and Ramamoorthy, S. A Game-TheoreticModel and Best-Response Learning Method for Ad HocCoordination in Multiagent Systems. In

Proceedings ofthe 2013 International Conference on Autonomous Agents caling Multi-Agent Reinforcement Learning with Selective Parameter Sharing and Multi-Agent Systems , AAMAS ’13, pp. 1155–1156,Richland, SC, May 2013. ISBN 978-1-4503-1993-5.Albrecht, S. V. and Stone, P. Autonomous Agents Mod-elling Other Agents: A Comprehensive Survey and OpenProblems.

Artiﬁcial Intelligence , 258:66–95, 2018. doi:10.1016/j.artint.2018.01.002. Publisher: Elsevier.Christianos, F., Schäfer, L., and Albrecht, S. Shared Experi-ence Actor-Critic for Multi-Agent Reinforcement Learn-ing. In

Advances in Neural Information Processing Sys-tems , volume 33, 2020.Davies, D. L. and Bouldin, D. W. A Cluster SeparationMeasure.

IEEE Transactions on Pattern Analysis andMachine Intelligence , PAMI-1(2):224–227, April 1979.ISSN 1939-3539. doi: 10.1109/TPAMI.1979.4766909.Foerster, J. N., Farquhar, G., Afouras, T., Nardelli, N., andWhiteson, S. Counterfactual Multi-Agent Policy Gra-dients. In

Thirty-Second AAAI Conference on ArtiﬁcialIntelligence , April 2018.Gupta, J. K., Egorov, M., and Kochenderfer, M. CooperativeMulti-agent Control Using Deep Reinforcement Learning.In Sukthankar, G. and Rodriguez-Aguilar, J. A. (eds.),

Au-tonomous Agents and Multiagent Systems , Lecture Notesin Computer Science, pp. 66–83, Cham, 2017. SpringerInternational Publishing. ISBN 978-3-319-71682-4. doi:10.1007/978-3-319-71682-4_5.Kaufman, L. and Rousseeuw, P. J.

Finding Groups in Data:An Introduction to Cluster Analysis , volume 344. JohnWiley & Sons, 2009.Kingma, D. P. and Welling, M. Auto-Encoding VariationalBayes. arXiv:1312.6114 [cs, stat] , May 2014. arXiv:1312.6114.Littman, M. L. Markov Games as a Framework for Multi-Agent Reinforcement Learning. In

Proceedings of theEleventh International Conference on Machine Learning ,1994.Lowe, R., Wu, Y., Tamar, A., Harb, J., Pieter Abbeel, O.,and Mordatch, I. Multi-Agent Actor-Critic for MixedCooperative-Competitive Environments. In

Advances inNeural Information Processing Systems , volume 30, pp.6379–6390, 2017.Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn-chronous Methods for Deep Reinforcement Learning. In

International Conference on Machine Learning , pp. 1928–1937. PMLR, June 2016. ISSN: 1938-7228.Papoudakis, G., Christianos, F., Rahman, A., and Albrecht,S. V. Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning. arXiv:1906.04737 [cs, stat] ,June 2019. arXiv: 1906.04737.Papoudakis, G., Christianos, F., Schäfer, L., and Albrecht,S. V. Comparative Evaluation of Multi-Agent Deep Rein-forcement Learning Algorithms. arXiv:2006.07869 [cs,stat] , June 2020. arXiv: 2006.07869.Rangwala, M. and Williams, R. Learning Multi-AgentCommunication through Structured Attentive Reasoning.

Advances in Neural Information Processing Systems , 33,2020.Rashid, T., Samvelyan, M., Schroeder, C., Farquhar, G.,Foerster, J., and Whiteson, S. QMIX: Monotonic ValueFunction Factorisation for Deep Multi-Agent Reinforce-ment Learning. In

International Conference on MachineLearning , pp. 4295–4304. PMLR, July 2018. ISSN: 2640-3498.Rousseeuw, P. J. Silhouettes: A graphical aid to the in-terpretation and validation of cluster analysis.

Journalof Computational and Applied Mathematics , 20:53 – 65,1987. ISSN 0377-0427. doi: https://doi.org/10.1016/0377-0427(87)90125-7.Samvelyan, M., Rashid, T., De Witt, C. S., Farquhar, G.,Nardelli, N., Rudner, T. G. J., Hung, C. M., Torr, P. H. S.,Foerster, J., and Whiteson, S. The StarCraft Multi-AgentChallenge. In

International Joint Conference on Au-tonomous Agents and Multi-Agent Systems , volume 4,pp. 2186–2188, 2019. ISBN 978-1-5108-9200-2. ISSN:15582914.Tan, M. Multi-Agent Reinforcement Learning: Independentvs. Cooperative Agents. In

International Conference onMachine Learning , 1993.Thorndike, R. L. Who belongs in the family?

Psychome-trika , 18(4):267–276, December 1953. ISSN 1860-0980.doi: 10.1007/BF02289263.Wang, T., Dong, H., Lesser, V., and Zhang, C. ROMA:Multi-Agent Reinforcement Learning with EmergentRoles. In

International Conference on Machine Learning ,pp. 9876–9886. PMLR, November 2020. ISSN: 2640-3498.Williams, R. J. Simple Statistical Gradient-Following Algo-rithms for Connectionist Reinforcement Learning.

Ma-chine Learning , 8(3–4):229–256, May 1992. ISSN 0885-6125. doi: 10.1007/BF00992696. Place: USA Publisher:Kluwer Academic Publishers.Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., and Wang,J. Mean Field Multi-Agent Reinforcement Learning. In

International Conference on Machine Learning , pp. 5571–5580. PMLR, July 2018. ISSN: 2640-3498. caling Multi-Agent Reinforcement Learning with Selective Parameter Sharing

Zhang, S. Q., Zhang, Q., and Lin, J. Succinct and RobustMulti-Agent Communication With Temporal MessageControl.