[PDF] HAMMER: Multi-Level Coordination of Reinforcement Learning Agents via Learned Messaging

Abstract

Cooperative multi-agent reinforcement learning (MARL) has achieved significant results, most notably by leveraging the representation learning abilities of deep neural networks. However, large centralized approaches quickly become infeasible as the number of agents scale, and fully decentralized approaches can miss important opportunities for information sharing and coordination. Furthermore, not all agents are equal - in some cases, individual agents may not even have the ability to send communication to other agents or explicitly model other agents. This paper considers the case where there is a single, powerful, central agent that can observe the entire observation space, and there are multiple, low powered, local agents that can only receive local observations and cannot communicate with each other. The job of the central agent is to learn what message to send to different local agents, based on the global observations, not by centrally solving the entire problem and sending action commands, but by determining what additional information an individual agent should receive so that it can make a better decision. After explaining our MARL algorithm, hammer, and where it would be most applicable, we implement it in the cooperative navigation and multi-agent walker domains. Empirical results show that 1) learned communication does indeed improve system performance, 2) results generalize to multiple numbers of agents, and 3) results generalize to different reward structures.

Full PDF

hhammer: Multi-Level Coordination of Reinforcement LearningAgents via Learned Messaging

Nikunj Gupta

International Institute of Information TechnologyBangalore, [email protected]

G. Srinivasaraghavan

International Institute of Information TechnologyBangalore, [email protected]

Swarup Kumar Mohalik

Ericsson ResearchBangalore, [email protected]

Matthew E. Taylor

University of AlbertaEdmonton, [email protected]

ABSTRACT

Cooperative multi-agent reinforcement learning (MARL) has achievedsignificant results, most notably by leveraging the representation-learning abilities of deep neural networks. However, large central-ized approaches quickly become infeasible as the number of agentsscale, and fully decentralized approaches can miss important op-portunities for information sharing and coordination. Furthermore,not all agents are equal — in some cases, individual agents may noteven have the ability to send communication to other agents orexplicitly model other agents. This paper considers the case wherethere is a single, powerful, central agent that can observe the entireobservation space, and there are multiple, low-powered, local agents that can only receive local observations and cannot communicatewith each other. The central agent’s job is to learn what messageto send to different local agents, based on the global observations,not by centrally solving the entire problem and sending actioncommands, but by determining what additional information an in-dividual agent should receive so that it can make a better decision.After explaining our MARL algorithm, hammer, and where it wouldbe most applicable, we implement it in the cooperative navigationand multi-agent walker domains. Empirical results show that 1)learned communication does indeed improve system performance,2) results generalize to multiple numbers of agents, and 3) resultsgeneralize to different reward structures.

KEYWORDS

Multi-agent Reinforcement Learning, Learning to Communicate,Heterogeneous Agent Learning

The field of multi-agent reinforcement learning (MARL) combinesideas from single-agent reinforcement learning (SARL), game the-ory, and multi-agent systems. Cooperative MARL calls for simulta-neous learning and interaction of multiple agents in the same envi-ronment to achieve shared goals. Applications like distributed logis-tics [58], package delivery [40], and disaster rescue [33] are somedomains that can naturally be modeled using this framework. How-ever, even cooperative MARL suffers from several complicationsinherent to multi-agent systems, including non-stationarity [3], apotential need for coordination [3], the curse of dimensionality [42],and global exploration [27]. Multi-agent reasoning has been extensively studied in MARL [27,32] in both centralized and decentralized settings. While very smallsystems could be completely centralized, decentralized implemen-tation quickly becomes necessary as the number of agents increase.Decentralization may also become necessary in practice to copewith the exponential growth in the joint observation and actionspaces however, it often suffers from synchronization issues [27]and complex teammate modeling [1]. Moreover, independent learn-ers may have to optimize their own, or the global, reward fromonly their local, private observations [50]. In contrast, centralizedapproaches can leverage global information and also mitigate non-stationarity through full awareness of all teammates.Further, communication has been shown to be important, espe-cially in tasks requiring coordination. For instance, agents tend tolocate each other or the landmarks more easily using shared infor-mation in navigation tasks [11], or communication can influencethe final outcomes in group strategy coordination, [13, 57]. Therehave been significant achievements using explicit communicationin video games like StarCraft II [34] as well as in mobile roboticteams [26], smart-grid control [35], and autonomous vehicles [4].Communication can be in the form of sharing experiences amongthe agents [59], sharing low-level information like gradient updatesvia communication channels [7] or sometimes directly advisingappropriate actions using a pretrained agent (teacher) [52] or evenlearning teachers [31, 43].Inspired by the advantages of centralized learning and commu-nication for synchronization we propose multi-level coordinationamong intelligent agents via messages learned by a separate agentto ease the localized learning of task-related policies. A single cen-tral agent is introduced and is designed to learn high-level messagesbased on complete knowledge of all the local agents in the envi-ronment. These messages are communicated to the independentlearners who are free to use or discard them while learning theirlocal policies to achieve a set of shared goals. By introducing cen-tralization in this manner, the supplemental agent can play therole of a facilitator of learning for the rest of the team. Moreover,the independent learners need not be as powerful as required ifthey must train to communicate or model other agents alongsidelearning task-specific policies.A hierarchical approach to MARL is not new — we will contrastwith other existing methods in Section 2. However, the main in-sight of our algorithm is to learn to communicate relevant pieces a r X i v : . [ c s . M A ] J a n f information from a global perspective to help agents with lim-ited capabilities improve their performance. Potential applicationsinclude autonomous warehouse management [6] and traffic lightcontrol [24], where there can be a centralized monitor. After weintroduce our algorithm, hammer, we will show results in twovery different simulated environments to showcase its generaliza-tion. OpenAI’s multi-agent cooperative navigation lets agents learnin continuous state space with discrete actions and global teamrewards. In contrast, Stanford’s multi-agent walker environmenthas a continuous action space and agents can receive only localrewards.The main contributions of this paper are to explain a noveland important setting that combines agents with different abilitiesand knowledge (Section 3), introduce the hammer algorithm thataddresses this setting (Section 4), and then empirically demonstratethat hammer can make significant improvements (Section 6) in twomulti-agent domains. This section will provide a summary of background concepts nec-essary to understand the paper and a selection of related work.

Single-agent reinforcement learning (SARL) can be formalized interms of Markov Decision Processes (MDPs). An MDP is definedas a tuple ⟨ S, A, P, R, 𝛾 ⟩ , where S is the set of states, A is theset of available actions for the agent, P: S × A × S → [0, 1] is thestate transition function, R: S × A → ℜ is the reward function, and 𝛾 ∈ ( , ] is a discount factor. Actions, selected by a policy 𝜋 : S × A → [0, 1], are taken and the agent tries to maximize the return,which is the expectation over the sum of discounted future rewards: E 𝜋 (cid:34) ∞ ∑︁ 𝑡 = 𝛾 𝑡 𝑅 ( 𝑠 𝑡 , 𝑎 𝑡 ) (cid:35) , where 𝑡 is the time step.To maximize this expectation, one class of RL algorithms aim tolearn a Q-value (state-action) function and the Bellman optimalityequation for the same can be defined as 𝑄 ∗ ( 𝑠 ∈ 𝑆, 𝑎 ∈ 𝐴 ) = ∑︁ 𝑠 ′ ∈ 𝑆 𝑃 ( 𝑠, 𝑎, 𝑠 ′ ) (cid:20) 𝑅 ( 𝑠, 𝑎, 𝑠 ′ ) + 𝛾 max 𝑎 ′ ∈ 𝐴 𝑄 ∗ ( 𝑠 ′ , 𝑎 ′ ) (cid:21) It provably converges to the optimal function 𝑄 ∗ under certainconditions. It has also become popular to use non-linear functionapproximators to scale to huge state spaces — like in DQN [28].In another popular choice for solving RL tasks — policy gradients— the parameters of the policy 𝜃 are directly updated to maximizean objective J ( 𝜃 ) by moving in the direction of ∇ J ( 𝜃 ). However,policy gradient methods can exhibit high variance gradient esti-mates, are sensitive to the selection of step-size, progress slowly,and sometimes encounter catastrophic drops in performance. Thesituation becomes worse in the case of multi-agent systems where,often, rewards for each agent are altered by interactions of otheragents in the environment. In our work, we focus on ProximalPolicy Optimization (PPO) [39], which reduces these challenges,while also being relatively easy to implement. Its objective function, well-suited for updates using stochastic gradient descent, can bedefined as follows: 𝐿 𝐶𝐿𝐼𝑃 ( 𝜃 ) = E 𝑡 [ 𝑚𝑖𝑛 ( 𝑟 𝑡 ( 𝜃 ) 𝐴 𝑡 , 𝑐𝑙𝑖𝑝 ( 𝑟 𝑡 ( 𝜃 ) , − 𝜖, + 𝜖 ) 𝐴 𝑡 )] , where 𝑟 𝑡 is the ratio of probability under the new and old policiesrespectively, 𝐴 𝑡 is the estimated advantage at time t and 𝜖 is ahyperparameter. SARL can be generalized to competitive, cooperative, or mixedmulti-agent settings. We focus on the fully cooperative multi-agentsetting, which can described as a generalization of MDPs to sto-chastic games (SGs). An SG for 𝑛 local agents, and an additionalcentralized agent in our case, can be defined as the tuple ⟨ 𝑆 , 𝑈 , 𝑂 , . . . , 𝑂 𝑛 , 𝐴 , . . . , 𝐴 𝑛 , 𝑃 , 𝑅 , 𝛾 ⟩ where 𝑆 is the set of all configurationsof environment states for all agents, U is the set of all actions forthe central agent, 𝑂 , . . . , 𝑂 𝑛 represent the observations of eachlocal agent, 𝐴 , . . . , 𝐴 𝑛 correspond to the set of actions availableto each local agent and 𝑃 is the state transition function. 𝛾 is thediscount factor. In case all the agents have the same common goal,i.e., they aim to maximize the same expected sum, the SG becomesfully cooperative.The state transitions in a multi-agent case are as a result of thejoint action of all the agents U × 𝐴 × · · · × 𝐴 𝑛 . The policies jointogether to form a joint policy h: S × U × A . There can be tworeward structures possible for the agents. First, they could sharea common team reward signal, 𝑅 : 𝑆 × 𝐴 → ℜ , which can bedefined as function of the state s ∈ S and the agents joint actionA: 𝐴 × ... × 𝐴 𝑛 . In the case of such shared rewards, agents aimto directly maximize the returns for the team. Second, each agentmight receive its own reward 𝑅 𝑖 : 𝑂 𝑖 × 𝐴 𝑖 → ℜ . Such localizedrewards mean that agents each maximize their own total expecteddiscounted return r = (cid:205) ∞ 𝑡 = 𝛾 𝑡 𝑟 𝑡 . Recent works in both SARL and MARL have employed deep learn-ing methods especially to tackle the high dimensionality of theobservation and action spaces [9, 23, 28, 34, 49].Several works in the past have taken the advantage of hierarchi-cal approaches to MARL. The MAXQ algorithm was designed toprovide for a hierarchical break-down of the reinforcement learningproblem by decomposing the value function for the main probleminto a set of value functions for the sub-problems [5]. Tang et al.[51] used temporal abstraction to let agents learn high-level coordi-nation and independent skills at different temporal scales together.Kumar et al. [16] presented another framework benefiting fromtemporal abstractions to achieve coordination among agents withreduced communication complexities. These works show positiveresults with the idea of combining centralized and decentralizedapproaches in different manners and are therefore closely relatedto our work. Vezhnevets et al. [55] introduce Feudal networks inhierarchical reinforcement learning and employ a Manager-Workerframework. However, there are some key differences. In their case,the manager directly interacts with the environment to receivethe team reward and accordingly distributes it among the workers(analogous to setting their goals), unlike in our work where theentral agent interacts indirectly and receives the same rewardas is earned by the local agents. In our work, the central agentis only allowed to influence the actions of independent learnersrather than set their goals explicitly. Also, they partly pretrain theirworkers before introducing the manager into the scene. In contrast,our results show that when the central agent and the independentlearners simultaneously learn, they achieve better performance.Some works have developed architectures that use centralizedlearning but ensure decentralized execution. COMA [9] used acentralized critic to estimate the Q-function along with a counter-factual advantage for the decentralized actors in MARL. VDN [47]architecture trained individual agents by learning to decomposethe team value functions into agent-wise value functions in a cen-tralized manner. QMIX [36] employs a mixing network to factorthe joint action-value into a monotonic non-linear combination ofindividual value functions for each agent. Another work, MADDPG[25], extends deep deterministic policy gradients (DDPG) [23] toMARL. They learn a centralized critic for each agent and continuouspolicies for the actors and also allow explicit communication amongagents. Even though the works mentioned here work in a similarsetting and allow for using extra globally accessible informationlike in our work, they mainly aim at decentralized execution whichapplies to domains where global view is unavailable. In contrast, wetarget domains where a global view is accessible to a single agent,even during execution.There has been considerable progress in learning by communica-tion in cooperative settings involving partially observable environ-ments. Reinforced Inter-Agent Learning (RIAL) and DifferentiableInter-Agent Learning (DIAL) proposed in [7] use neural networksto output communication messages in addition to agent’s Q-values.RIAL used shared parameters to learn a single policy whereas DIALused gradient sharing during learning and communication actionsduring execution. Both methods use discrete communication chan-nels. On the other hand, CommNet [46], used continuous vectors,enabled multiple communication cycles per time step and the agentswere allowed to freely enter and exit the environment. Lazaridouet al. [19] and Mordatch et al. [30] trained the agents to developan emergent language for communication. Furthermore, standardtechniques used in deep learning like dropout [45] have inspiredworks like [15], where messages of other agents are dropped outduring learning to work well even in conditions with only limitedcommunication feasible. In all these works, however, the goal is tolearn inter-agent communication alongside local policies that sufferfrom the bottleneck of simultaneously achieving effective commu-nication and global collaboration [41]. They also face difficultyin extracting essential and high-quality information for exchangeamong agents [41]. Further, unlike hammer, these works expectmore sophisticated agents available in the environment in termsof communication capabilities or the ability to run complex algo-rithms to model other agents present — which might not always befeasible.One of the popular ways of independent learning is by emergentbehaviours [21, 22, 49] wherein each agent learns its own privatepolicy and assumes all other agents to be a part of the environment.This methodology disregards the underlying assumptions of single-agent reinforcement learning, particularly the Markov property.Although this may achieve good results [27], it may also fail due to non-stationarity [18, 54]. Self-play can be a useful concept in suchcases [2, 44, 53], but it is still susceptible to failures by forgettingpast knowledge [17, 20, 37]. Gupta et al. [12] extend three SARLalgorithms, Deep Q Network (DQN) [29], Deep Deterministic Pol-icy Gradient (DDPG) [23], and Trust Region Policy Optimization(TRPO) [38], to cooperative multi-agent settings.

This section details a novel setting in which multiple agents —the central agent and the independent learners — with differentcapabilities and knowledge are combined (see Figure 1). The ham-mer algorithm is designed for this type of cooperative multi-agentenvironment.Consider a warehouse setting where lots of small, simple, robotsfetch and stock items on shelves, as well as bring them to packingstations. If the local agents could communicate among themselves,they could run a distributed reasoning algorithm, but this wouldrequire more sophisticated robots and algorithms. Or, if the obser-vations and actions could be centralized, one very powerful agentcould determine the joint action of all the agents, but this would notscale well and would require a very powerful agent. Section 6 willmake this more concrete with two multi-agent tasks. Now, assumean additional central agent in the team, which can have a global per-spective, unlike the local agents, who have only local observations.Further, the central agent is more powerful — not only does it haveaccess to more information, but it can also communicate messagesto all local agents. The local agents can only blindly transmit theirobservations and actions to the central agent and receive messagesin return. Like this, the local agents can simply rely on the com-municated messages to get briefed about the other agents in theenvironment and make informed decisions (actions) accordingly.Consequently, the central agent must learn to encapsulate the avail-able information in its, possibly large, inputs into smaller vectors(messages) to facilitate the learning of the independent learners.Having described an overview of the setting, we can now havea closer look at the inputs, outputs, and roles of the agents inthe heterogeneous system. The centralized agent receives a globalobservation 𝑠 ∈ 𝑆 on every time step and outputs a unique message(equivalently, executes an action), 𝑢 𝑖 ∈ 𝑈 , to each of the localagents, where 𝑖 is the agent identifier. Its global observation s isthe union of all the local observations 𝑜 𝑖 ∈ 𝑂 𝑖 and actions of theindependent learners 𝑎 𝑖 ∈ 𝐴 𝑖 — can either be obtained from theenvironment or transmitted to it by the local agents at every timestep. 𝑢 𝑖 encodes a message vector that a local agent can use tomake better decisions. Local agents receive a partial observation 𝑜 𝑖 , and a private message 𝑢 𝑖 from the central agent. Based on 𝑜 𝑖 and 𝑢 𝑖 , at each time step, all 𝑛 local agents will choose their actionssimultaneously, forming a joint action ( 𝐴 × , . . . , × 𝐴 𝑛 ) and causinga change in the environment state according to the state transitionfunction 𝑃 .Upon changing the dynamics of the environment, a reward 𝑟 ∈ 𝑅 — which could be team-based or localized — is sent back to localagents, using which they must learn how to act. If the messagesfrom the central agent were not useful, local agents could learn toignore such messages. Every time the central agent communicatesa message 𝑢 𝑖 to a local learner, it receives the same reward as is igure 1: Our cooperative MARL setting: a single globalagent sends messages to help multiple independent localagents act in an environment obtained by that local agent on performing an action 𝑎 𝑖 in theenvironment. In other words, the central agent does not directlyinteract with the environment to get feedback, instead, it learns tooutput useful messages by looking at how the independent agentsperformed in the environment using the messages it communicatedto them. In domains with localized rewards for agents, the centralagent gets a tangible reward for its messages, whereas, in othercases that have team rewards, it needs to learn using comparativelyabstract feedback. In Section 6 we show that hammer generalizesto both the reward structures. This section introduces hammer, the

Heterogeneous Agents Mas-tering Messaging to Enhance Reinforcement learning algorithm, de-signed for the cooperative MARL setting discussed above.There are multiple local and independent agents in an environ-ment given tasks. Depending on the domain, they may take discreteor continuous actions to interact with the environment. As alsostated before, we introduce a single central agent into the environ-ment. This agent is relatively powerful and is capable of 1) obtaininga global view of all the other agents present in an environment and2) transmitting messages to them. It learns a separate policy andaims to support the local team by communicating short messagesto them. It is designed to use both a global or local reward structure.Hence, local agents’ private observations will now have additionalmessages sent to them by the central agent and they can choose touse or discard it while learning their policies.As described by Algorithm 1, in every iteration, the centralizedagent receives the union of private observations of all local agentsalong with their previous actions (line 3). It encodes its input andoutputs an individual message vector for each agent (line 7). These

Algorithm 1: hammer Initialize

Actor-Critic Network for the central agent (CA),Actor-Critic Network for independent (shared parameters)agents (IA), and two experience replay memory buffers (Band B’) ; for episode e = 1 to TOTAL_EPISODES do Fetch combined initial random observations 𝑠 = [ 𝑜 , ..., 𝑜 𝑖 ] from environment ( 𝑜 𝑖 is agent i’s localobservation); Input:

Concat (s, 𝑎 ′ , ..., 𝑎 ′ 𝑖 ) → CA ( 𝑎 ′ 𝑖 is previous actionof agent i, or 0 on the first time step); for time step t = 1 to TOTAL_STEPS do for each agent 𝑛 𝑖 do Output : message vector 𝑢 𝑖 ← CA, for agent 𝑛 𝑖 ; Input:

Concat ( 𝑜 𝑖 ∈ 𝑠 , 𝑢 𝑖 ) → IA ; Output : local action 𝑎 𝑖 ← IA ; end Perform sampled actions in environment and getnext set of observations s’ and rewards 𝑟 𝑖 for eachagent ; Add experiences in B and B’ for CA and IArespectively; if update interval reached then Sample random minibatch b ∈ B and b’ ∈ B’; Train CA on b and IA on b’ using stochasticpolicy gradients ; end end end messages from the central agent are sent to the independent learn-ers, augmenting their private partial observations obtained from theenvironment (line 8). Then, they output an action (line 9) affectingthe environment (line 11). Reward for the joint action performed inthe environment is returned (line 11) and is utilized as feedback byall the agents to adjust their parameters and learn private policies(lines 14—15)The independent learners follow the formerly proposed idea ofhaving a single network with shared parameters for each of them,allowing a single policy to be learned with experiences from allthe agents simultaneously [8, 12]. Parameter sharing has been suc-cessfully applied in several other multi-agent deep reinforcementlearning works [10, 34, 36, 46, 48]. Learning this way makes it cen-tralized, however, execution can be decentralized by making a copyof the policy and using them individually in each of the agents.Furthermore, often in cooperative multi-agent settings, concurrentlearning does not scale up to the number of agents and can make theenvironment dynamics non-stationary, hence difficult to learn [12].In this implementation of hammer, a single policy is learned for thelocal agents via the centralized learning, decentralized executionsetting.Note that while we focus on parameter sharing for independentagents and using PPO methods for learning the policies, hammercan be implemented with other MADRL algorithms. Moreover, igure 2: The cooperative navigation environment is com-posed of blue agents and black (stationary) landmarks theagents must cover, while avoiding collisions. even though we test with a single central agent in this paper, itis entirely possible that multiple central agents could even betterassist independent learners. This section details two multi-agent environments that we use toevaluate hammer. In addition to releasing our code after acceptance,we fully detail our approach so that results are replicable.

Cooperative navigation is one of the Multi-Agent Particle Environ-ments [25]. It is a two-dimensional cooperative multi-agent taskwith a continuous observation space and a discrete action spaceconsisting of 𝑛 movable agents and 𝑛 fixed landmarks. Figure 2shows the case for 𝑛 =

3. Agents occupy physical space (i.e., arenot point masses), perform physical actions in the environment,and have the same action and observation spaces. The agents mustlearn to cover all the landmarks while avoiding collisions, with-out explicit communication. The global reward signal, seen by allagents, is based on the proximity of any agent to each landmarkand the agents are penalized on colliding with each other. The teamreward can be defined by the equation: 𝑅 = [ 𝑁,𝐿 ∑︁ 𝑛 = ,𝑙 = 𝑚𝑖𝑛 ( 𝑑𝑖𝑠𝑡 ( 𝑎 𝑛 , 𝑙 ))] − 𝑐, where 𝑁 is the number of agents and 𝐿 is the number of landmarksin the environment. The function dist() calculates the distance interms of the agent’s and landmark’s ( 𝑥 𝑖 , 𝑦 𝑖 ) position in the envi-ronment. c is the number of collisions among the agents and is Figure 3: In the multi-agent walker environment, multiplerobots work together to act in continuous action spaces totransport a package over varying terrain to a destinationwithout falling. subtracted as a penalty of -1 for each time two agents collide. Theaction is discrete, corresponding to moving in the four cardinal di-rections or remaining motionless. Each agent has a fixed referenceframe and its observation can include the relative positions of otheragents and landmarks if they are within the frame. Note that thelocal observations do not convey the velocity (movement direction)of other agents. Consistent with past work, the initial positions ofall the agents and the landmark locations are randomized at thestart of each episode, and the episode ends after 25 time steps.To test hammer, we modify this task so that a centralized agentreceives the union of local agents’ observations and has access to alltheir actions at every time step. To test hammer’s scalability, we alsochange the number of agents and landmarks in the environment. Weare most interested in this environment because of the motivatingrobotic warehouse example.

Multi-agent walker is a more complex, continuous control bench-mark locomotion task [12]. A package is placed on top of 𝑛 pairs ofrobot legs which can be controlled. The agents must learn to movethe package as far as possible to the right towards a destination,without dropping it. The package is large enough (it stretches acrossall of the walkers) that agents must cooperate. The environmentdemands high inter-agent coordination for them to successfullynavigate a complex terrain while keeping the package balanced.This environment supports both team and individual rewards, butwe focus on the latter case to be less similar to the cooperativenavigation task. Each walker is given a reward of -100 if it falls andall walkers receive a reward of -100 if the package falls. Through-out the episode, each walker is given an additional reward of +1for every step taken. However, there is no supplemental rewardfor reaching the destination. By default, the episode ends if anywalker falls, the package falls, after 500 time steps, or if the pack-age successfully reaches the destination. Each agent receives 32real-valued numbers, representing information about noisy Lidareasurements of the terrain and displacement information relatedto the neighbouring agents. The action space is a 4-dimensionalcontinuous vector, representing the torques in the walker’s twojoints of both legs. Figure 3 is an illustration of the environmentwith 𝑛 = In both domains, two networks were used — one for the centralagent and the second for the independent agents. The parameterswere shared for independent agents. PPO used policy gradients totrain all the networks. For the actor and critic networks of bothcentral and individual agents, 2 fully-connected multi-layer percep-tron layers were used to process the input layer and to produce theoutput from the hidden state. The central agent’s actor networkoutputs a vector of floating points scaled to [− , ] and is consistentacross domains. A tanh activation function was used everywhere,with the exception of the local agent’s output. Local agents in coop-erative navigation used a softmax over the output, corresponding tothe probability of executing each of the five discrete actions. Localagents in the multi-agent walker task used 4-output nodes, each ofwhich modeled a multivariate Gaussian distribution over torques.In the interest of experiment time, we limit trials to 30,000episodes in cooperative navigation and 10,000 in multi-agent walker.We set 𝛾 = .

95 in all our experiments. After having tried six learn-ing rates in PPO { . , . , . , . , × − , × − } , wefound 3 × − to be the best for the centralized agent and 1 × − to be the best for independent agents in both domains. Trainingbatch sizes were tuned to 2000 and 4000 respectively for the centraland independent agents in both domains. Five independent trialswith different random seeds were run on for both environmentsfor establishing statistical significance. The clip parameter for PPOwas set to 0.2. Empirically, we found that messages lengths of 4 and8 performed well in the navigation and multi-agent walker tasks,respectively. This section describes the experiments conducted to test hammer’spotential of encapsulating and communicating learned messages,speeding up independent learning and scalability, on the two en-vironments — cooperative navigation and multi-agent walker —whose details are described in the previous section. All the curvesare averaged over five independent trials (except the ablative studyof comparing the performance of hammer with different messagelengths — where three independent trials were conducted).

We investigated in detail the learning of independent learners inthe cooperative navigation environment under different situations.

Figure 4: hammer agents outperform independent PPOlearners in cooperative navigation. Using a similar frame-work as hammer, but providing the local agents with ran-dom messages, causes degraded performance (as expected).hammer also significantly outperforms centrally learnedpolicy for independent agents.

Learning curves used for evaluation are plots of the average rewardper local agent as a function of episodes. Moreover, to smooth thecurves for readability, a rolling mean with a window size of 500episodes was used for each of the cases.First, we let the local agents learn independently, without anyaid from other sources. The corresponding curve (red), as shown inFigure 4, can also act as our baseline to evaluate learning curves ob-tained when the learners are equipped with additional messages. Asdescribed earlier, the experimental setup is consistent with earlierwork [12].Second, we used a central agent to learn messages, as describedby hammer, and communicated them to the local learners to seeif the learning improved. hammer performed significantly betterthan the previous case over a training period of 30,000 episodes, ascan be seen in Figure 4. From this experiment, we can claim thatresults show: (1) hammer agents were able to learn much faster ascompared to independent local agents, and (2) the success impliesthat the central agent was able to successfully learn to produceuseful smaller messages. (Recall that the total global observationvector has 18 real-valued numbers, and our message uses only 4.)To help evaluate the communication quality, random messagesof the same length were generated and communicated to the in-dependent agents to see if the central agent was indeed learningrelevant or useful messages. As expected, random messages inducea larger observation space with meaningless values and degradethe performance of independent learners (see Figure 4). This alsosupports the claim that the central agent is learning much bettermessages to communicate (rather than sending random values) as igure 5: Graphs show the performance of 5 local agents inthe cooperative navigation task. y-axis shows the average re-wards earned per agent. hammer outperforms independentlearners in case of larger number of agents too. This suggeststhat hammer is scalable. it outperforms the independent learning of agents provided withrandom messages.To confirm that the central agent is not simply forwarding a com-pressed version of the global information vector to all the agents,we let the independent agents learn in a fully centralized manner,using a joint observation space. The performance drops drasticallyin this case (Figure 4). This suggests that the central agent in ham-mer is learning to encapsulate only partial but relevant informationas messages and communicates them to facilitate the learning oflocal agents. Complete information would have become too over-whelming for the localized agents to learn how to act and henceslowed the progress, as is clear by the curve (green) shown in Figure4. It has been shown that this approach does not perform well in do-mains like pursuit, water world and multi-agent walker earlier [12].We confirm the same in the cooperative navigation environmentfor N=3 agents as well.To test scalability, we also tried N=5. The average reward peragent obtained within the training remains lesser than in the case ofN=3. However, hammer still outperforms independent learning (seeFigure 5). This suggests scalability of hammer to a larger number ofagents. Empirically, we found that when increasing the number ofagents from three to five, performance improved when the lengthof the central agent’s messages was increased from four to eight.We speculate that with larger information made accessible to thecentral agent, a broader bandwidth for encapsulating it was requiredby hammer.The final ablation study compared the performance with mes-sages of lengths { , , , } when 𝑛 = Figure 6: hammer-n, in the figure, corresponds to using amessage length of n for hammer. Plots show that the lengthof the learned message influences the performance of ham-mer on cooperative navigation, but that even poorly-tunedmessage lengths still allow hammer to perform at least aswell as independent PPO agents. (This graph is smoothed us-ing a larger moving average window — 2000 episodes — forreadability.) include enough information to be useful) or too large (which likelyincreased the agent’s state space too much) performed worse (seefigure 6). However, it is interesting to notice that changing the mes-sage lengths was not hurting the performance of the independentlearners like in the case when random messages were being fed (Fig-ure 4). We speculate here that the central agent attempted to extractand encapsulate relevant messages out of the global knowledge butin cases with lesser capacity, was only partially successful. If themessages would have been close to random values, they would haveharmed the learning of independent agents (Figure 4). In the caseof larger messages, they would have included extra informationalong with the relevant pieces, but being large, would have maskedthe private observations of the local agents, hence slowing theirprogress.In summary, hammer successfully summarizes relevant globalknowledge into small messages to individual learners that out-performs other cases — (1) unaided independent learning, (2) in-dependent agents supplied with random messages, and (3) fullycentralized training of independent agents. Moreover, hammerscales to a larger number of agents in the environment.

This section shows that hammer also works in a multi-agent taskwith continuous control and individual rewards. Figure 7 shows thathammer performs substantially better than unaided independent igure 7: hammer significantly improves the performanceover independent local agents in the multi-agent walkertask.Figure 8: hammer outperforms independent local agentswhen increasing from 3 to 5 local agents in multi-agentwalker. local agents. Like in the cooperative navigation task, results hereare averaged over five independent trials, with an additional 500-episode moving window to increase readability. Additional resultsin Figure 8 confirm that hammer outperforms independent learningin the case of 5 walkers. In the case of N=3 walkers, we used amessage length of 8 for the central agent, whereas an increasedlength of 10 showed a better performance for the N=5 walkersscenario. We did not include this result as Gupta et al. [12] alreadyshow that independent learners outperform a centralized policy inthe multi-agent walker domain. hammer performing better in a domain like Multi-Agent Walkerconfirms the generalization of the approach to continuous actionspaces and different reward structures. Positive results in increasingthe number of agents further supports the claim of hammer beingscalable.Results comparing the average performances and their standarderrors of hammer and independent learning (IL) are shown in Ta-ble 1. The results for the two domains in this section show that(1) heterogeneous agents successfully learn messaging and enablemulti-level coordination among the independent agents to enhancereinforcement learning using hammer approach, (2) hammer scalesto more number of agents, (3) the approach works well in both dis-crete and continuous action spaces, and (4) the approach performedwell with both individual rewards and global team rewards.

Table 1: The average Performances and standard errors indifferent settings show that hammer does outperform inde-pendent learning (IL) agents, but that in some cases the dif-ferences are unlikely to be statistically significant.Environment Agents hammer IL

CooperativeNavigation N=3 -61.52 ± ± ± ± ± ± ± ± This paper presented hammer, an approach used to achieve multi-level coordination among heterogeneous agents by learning use-ful messages from a global view of the environment or the multi-agent system. Using hammer, we addressed challenges like non-stationarity and inefficient coordination among agents in MARL byintroducing a powerful all-seeing central agent in addition to theindependent learners in the environment. Our results in two do-mains, the cooperative navigation and multi-agent walker, showedthat hammer can be generalized to discrete and continuous ac-tion spaces with both global team rewards and localized personalrewards. hammer also proved to be scalable by outperforming un-aided independent learners in cases with more agents in both thedomains. We believe that the key reasons for the success of hammerwere two-fold. First, we leveraged additional global informationlike global states, actions and rewards in the system. Second, cen-tralizing observations has its own benefits, including helping tobypass problems like non-stationarity in multi-agent systems andavoiding getting stuck in local optima [14, 56]. Several works, re-lated to ours (discussed earlier) demand extremely powerful agentsin the environment — ones that can transmit to other agents and/orare capable of modeling other agents in the system. This might notalways be feasible, motivating hammer. Only one central agentneeds to be powerful while the independent learners can be simpleagents interacting with the environment based on their privatelocal observations augmented with learned messages.Warehousemanagement and traffic lights management are good settings thatcould fit these assumptions well.

FUTURE WORK

There are several directions for future work from here. First, boththe domains used in this work involved low-dimensional observa-tion spaces and seem to offer substantial global coverage on com-bining local observations. hammer’s results in multi-agent settingswith tighter coupling and further complex interactions involvedamong agents such as in autonomous driving in SMARTS [60] orheterogeneous multi-agent battles in StarCraft [34] could help bet-ter appreciate the significance of the method. Second, more complexhierarchies could be used, such as by making several central agentsavailable in the system. Third, in this work, we performed an ini-tial analysis of message vectors communicated by hammer. Butadditional work remains to better understand if and how hammertailors messages for the local agents using its global observation. Itwould also be interesting to further study how RL learns to encodeinformation in messages and try to understand what the encodingmeans. Fourth, the maximum number of agents tested for in thiswork is 5 and even though both the chosen domains were complexmulti-agent test-beds, we believe it would be useful to conduct fur-ther experiments to justify hammer’s scalability. Fifth, even thoughwe emphasize why empirical comparisons with techniques likeCOMA, VDN, Q-MIX, and MADDPG are not relevant, consideringthat their goals and that of hammer are closely related, we couldinclude them as empirical comparisons. Sixth, in our setting, com-munication is free — future work could consider the case where itwas costly and attempt to trade off the number of messages sentwith the learning speed of hammer. Seventh, we explicitly did notallow inter-agent communication among the local learners. If lo-cal agents had the ability to communicate small amounts of datato each other, would a centralized agent still be able to provide asignificant improvement?

ACKNOWLEDGMENTS

This work commenced at Ericsson Research Lab Bangalore, andmost of the follow-up work was done at the International Instituteof Information Technology - Bangalore. ∗ Part of this work has takenplace in the Intelligent Robot Learning (IRL) Lab at the Universityof Alberta, which is supported in part by research grants from theAlberta Machine Intelligence Institute (Amii), CIFAR, and NSERC.We also would like to thank Shahil Mawjee and anonymous re-viewers for comments and suggestions on earlier versions of thispaper.

REFERENCES [1] Stefano V Albrecht and Peter Stone. 2018. Autonomous agents modelling otheragents: A comprehensive survey and open problems.

Artificial Intelligence

Science

IEEE Transactions on Systems, Man,and Cybernetics, Part C (Applications and Reviews)

38, 2 (2008), 156–172.[4] Yongcan Cao, Wenwu Yu, Wei Ren, and Guanrong Chen. 2012. An overviewof recent progress in the study of distributed multi-agent coordination.

IEEETransactions on Industrial informatics

9, 1 (2012), 427–438.[5] Thomas G Dietterich. 2000. Hierarchical reinforcement learning with the MAXQvalue function decomposition.

Journal of artificial intelligence research

13 (2000),227–303. ∗ As part of Nikunj Gupta’s Master’s Thesis titled — “Fully Cooperative Multi-AgentReinforcement Learning". [6] John J Enright and Peter R Wurman. 2011. Optimization and coordinated au-tonomy in mobile fulfillment systems. In

Workshops at the twenty-fifth AAAIconference on artificial intelligence . Citeseer.[7] Jakob Foerster, Ioannis Alexandros Assael, Nando De Freitas, and Shimon White-son. 2016. Learning to communicate with deep multi-agent reinforcement learn-ing. In

Advances in neural information processing systems . 2137–2145.[8] Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon White-son. 2016. Learning to Communicate with Deep Multi-Agent ReinforcementLearning. In

Advances in Neural Information Processing Systems 29 , D. D. Lee,M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates,Inc., 2137–2145.[9] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, andShimon Whiteson. 2017. Counterfactual multi-agent policy gradients. arXivpreprint arXiv:1705.08926 (2017).[10] Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras,Philip HS Torr, Pushmeet Kohli, and Shimon Whiteson. 2017. Stabilising ex-perience replay for deep multi-agent reinforcement learning. arXiv preprintarXiv:1702.08887 (2017).[11] Dieter Fox, Wolfram Burgard, Hannes Kruppa, and Sebastian Thrun. 2000. Aprobabilistic approach to collaborative multi-robot localization.

Autonomousrobots

8, 3 (2000), 325–344.[12] Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. 2017. Cooperative multi-agent control using deep reinforcement learning. In

International Conference onAutonomous Agents and Multiagent Systems . Springer, 66–83.[13] Takayuki Ito, Minjie Zhang, Valentin Robu, Shaheen Fatima, Tokuro Matsuo, andHirofumi Yamaki. 2010.

Innovations in agent-based complex automated negotia-tions . Vol. 319. Springer.[14] Michael Johanson, Kevin Waugh, Michael Bowling, and Martin Zinkevich. 2011.Accelerating best response calculation in large extensive games. In

IJCAI , Vol. 11.258–265.[15] Woojun Kim, Myungsik Cho, and Youngchul Sung. 2019. Message-dropout:An efficient training method for multi-agent deep reinforcement learning. In

Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 33. 6079–6086.[16] Saurabh Kumar, Pararth Shah, Dilek Hakkani-Tur, and Larry Heck. 2017. Feder-ated control with hierarchical multi-agent deep reinforcement learning. arXivpreprint arXiv:1712.08266 (2017).[17] Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, KarlTuyls, Julien Pérolat, David Silver, and Thore Graepel. 2017. A unified game-theoretic approach to multiagent reinforcement learning. In

Advances in neuralinformation processing systems . 4190–4203.[18] Guillaume J Laurent, Laëtitia Matignon, Le Fort-Piat, et al. 2011. The world ofindependent learners is not Markovian.

International Journal of Knowledge-basedand Intelligent Engineering Systems

15, 1 (2011), 55–64.[19] Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. 2016. Multi-agent cooperation and the emergence of (natural) language. arXiv preprintarXiv:1612.07182 (2016).[20] Joel Z Leibo, Edward Hughes, Marc Lanctot, and Thore Graepel. 2019. Autocur-ricula and the emergence of innovation from social interaction: A manifesto formulti-agent intelligence research. arXiv preprint arXiv:1903.00742 (2019).[21] Joel Z Leibo, Julien Perolat, Edward Hughes, Steven Wheelwright, Adam HMarblestone, Edgar Duéñez-Guzmán, Peter Sunehag, Iain Dunning, and ThoreGraepel. 2018. Malthusian reinforcement learning. arXiv preprint arXiv:1812.07019 (2018).[22] Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel.2017. Multi-agent reinforcement learning in sequential social dilemmas. arXivpreprint arXiv:1702.03037 (2017).[23] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez,Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control withdeep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).[24] Mengqi Liu, Jiachuan Deng, Ming Xu, Xianbo Zhang, and Wei Wang. 2017.Cooperative deep reinforcement learning for tra ic signal control. In

The 7thInternational Workshop on Urban Computing (UrbComp 2018) .[25] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. 2017.Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments.

Neural Information Processing Systems (NIPS) (2017).[26] Laëtitia Matignon, Laurent Jeanpierre, and Abdel-Illah Mouaddib. 2012. Coordi-nated multi-robot exploration under communication constraints using decentral-ized markov decision processes. In

AAAI 2012 . p2017–2023.[27] Laetitia Matignon, Guillaume J Laurent, and Nadine Le Fort-Piat. 2012. Indepen-dent reinforcement learners in cooperative Markov games: a survey regardingcoordination problems. (2012).[28] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, IoannisAntonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deepreinforcement learning. arXiv preprint arXiv:1312.5602 (2013).[29] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness,Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, GeorgOstrovski, et al. 2015. Human-level control through deep reinforcement learning. nature

Proceedings of the AAAI Conference onArtificial Intelligence , Vol. 32.[31] Shayegan Omidshafiei, Dong-Ki Kim, Miao Liu, Gerald Tesauro, Matthew Riemer,Christopher Amato, Murray Campbell, and Jonathan P How. 2019. Learning toteach in cooperative multiagent reinforcement learning. In

Proceedings of theAAAI Conference on Artificial Intelligence , Vol. 33. 6128–6136.[32] Liviu Panait and Sean Luke. 2005. Cooperative multi-agent learning: The state ofthe art.

Autonomous agents and multi-agent systems

11, 3 (2005), 387–434.[33] James Parker, Ernesto Nunes, Julio Godoy, and Maria Gini. 2016. Exploitingspatial locality and heterogeneity of agents for search and rescue teamwork.

Journal of Field Robotics

33, 7 (2016), 877–900.[34] Peng Peng, Ying Wen, Yaodong Yang, Quan Yuan, Zhenkun Tang, Haitao Long,and Jun Wang. 2017. Multiagent bidirectionally-coordinated nets: Emergenceof human-level coordination in learning to play starcraft combat games. arXivpreprint arXiv:1703.10069 (2017).[35] Manisa Pipattanasomporn, Hassan Feroze, and Saifur Rahman. 2009. Multi-agentsystems in a distributed smart grid: Design and implementation. In . IEEE, 1–8.[36] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Far-quhar, Jakob Foerster, and Shimon Whiteson. 2018. QMIX: Monotonic valuefunction factorisation for deep multi-agent reinforcement learning. arXiv preprintarXiv:1803.11485 (2018).[37] Spyridon Samothrakis, Simon Lucas, ThomasPhilip Runarsson, and David Robles.2012. Coevolving game-playing agents: Measuring performance and intransitivi-ties.

IEEE Transactions on Evolutionary Computation

17, 2 (2012), 213–226.[38] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz.2015. Trust region policy optimization. In

International conference on machinelearning . 1889–1897.[39] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).[40] Sven Seuken and Shlomo Zilberstein. 2012. Improved memory-bounded dynamicprogramming for decentralized POMDPs. arXiv preprint arXiv:1206.5295 (2012).[41] Junjie Sheng, Xiangfeng Wang, Bo Jin, Junchi Yan, Wenhao Li, Tsung-Hui Chang,Jun Wang, and Hongyuan Zha. 2020. Learning Structured Communication forMulti-agent Reinforcement Learning. arXiv preprint arXiv:2002.04235 (2020).[42] Yoav Shoham, Rob Powers, and Trond Grenager. 2007. If multi-agent learning isthe answer, what is the question?

Artificial intelligence

Au-tonomous Agents and Multi-Agent Systems (Jan 2020).[44] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, GeorgeVan Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershel-vam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neuralnetworks and tree search. nature

The journal of machine learning research

15, 1 (2014), 1929–1958.[46] Sainbayar Sukhbaatar, Rob Fergus, et al. 2016. Learning multiagent communica-tion with backpropagation. In

Advances in neural information processing systems .2244–2252.[47] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vini-cius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, KarlTuyls, et al. 2017. Value-decomposition networks for cooperative multi-agentlearning. arXiv preprint arXiv:1706.05296 (2017).[48] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Viní-cius Flores Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel ZLeibo, Karl Tuyls, et al. 2018. Value-Decomposition Networks For CooperativeMulti-Agent Learning Based On Team Reward.. In

AAMAS . 2085–2087.[49] Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Kor-jus, Juhan Aru, Jaan Aru, and Raul Vicente. 2017. Multiagent cooperation andcompetition with deep reinforcement learning.

PloS one

12, 4 (2017), e0172395.[50] Ming Tan. 1993. Multi-agent reinforcement learning: Independent vs. cooperativeagents. In

Proceedings of the tenth international conference on machine learning .330–337.[51] Hongyao Tang, Jianye Hao, Tangjie Lv, Yingfeng Chen, Zongzhang Zhang, Hang-tian Jia, Chunxu Ren, Yan Zheng, Changjie Fan, and Li Wang. 2018. Hierarchicaldeep multiagent reinforcement learning. arXiv preprint arXiv:1809.09332 (2018).[52] Matthew E. Taylor, Nicholas Carboni * , Anestis Fachantidis, Ioannis Vlahavas, andLisa Torrey. 2014. Reinforcement learning agents providing advice in complexvideo games. Connection Science

26, 1 (2014), 45–63. https://doi.org/10.1080/09540091.2014.885279 arXiv:http://dx.doi.org/10.1080/09540091.2014.885279[53] Gerald Tesauro. 1995. Temporal difference learning and TD-Gammon.

Commun.ACM

38, 3 (1995), 58–68.[54] Karl Tuyls and Gerhard Weiss. 2012. Multiagent learning: Basics, challenges, andprospects.

Ai Magazine

33, 3 (2012), 41–41. [55] Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, MaxJaderberg, David Silver, and Koray Kavukcuoglu. 2017. Feudal networks forhierarchical reinforcement learning. arXiv preprint arXiv:1703.01161 (2017).[56] Shimon Whiteson, Brian Tanner, Matthew E Taylor, and Peter Stone. 2011. Pro-tecting against evaluation overfitting in empirical reinforcement learning. In . IEEE, 120–127.[57] Michael Wunder, Michael Littman, and Matthew Stone. 2009. Communication,credibility and negotiation using a cognitive hierarchy model. In

AAMAS Work-shop , Vol. 19. Citeseer, 73–80.[58] Wang Ying and Sang Dayong. 2005. Multi-agent framework for third partylogistics in E-commerce.

Expert Systems with Applications

29, 2 (2005), 431–436.[59] Mohamed Salah Zaïem and Etienne Bennequin. 2019. Learning to Commu-nicate in Multi-Agent Reinforcement Learning: A Review. arXiv preprintarXiv:1911.05438 (2019).[60] Ming Zhou, Jun Luo, Julian Villela, Yaodong Yang, David Rusu, Jiayu Miao,Weinan Zhang, Montgomery Alban, Iman Fadakar, Zheng Chen, et al. 2020.SMARTS: Scalable Multi-Agent Reinforcement Learning Training School forAutonomous Driving. arXiv preprint arXiv:2010.09776arXiv preprint arXiv:2010.09776