Learning Complex Multi-Agent Policies in Presence of an Adversary
LLearning Complex Multi-Agent Policies in Presence of an Adversary
Siddharth Ghiya and Katia Sycara Abstract — In recent years, there has been some outstandingwork on applying deep reinforcement learning to multi-agentsettings. Often in such multi-agent scenarios, adversaries canbe present. We address the requirements of such a setting byimplementing a graph-based multi-agent deep reinforcementlearning algorithm. In this work, we consider the scenarioof multi-agent deception in which multiple agents need tolearn to cooperate and communicate to deceive an adversary.We have employed a two-stage learning process to get thecooperating agents to learn such deceptive behaviors. Ourexperiments show that our approach allows us to employcurriculum learning to increase the number of cooperatingagents in the environment and enables a team of agents tolearn complex behaviors to successfully deceive an adversary.
Keywords : Multi-agent system, Graph neural network, Rein-forcement learning
I. INTRODUCTIONReal world environments often consist of multiple agentswhich need to either collaborate or compete with each otherto perform their task successfully. For example, an environ-ment of self driving cars can be a multi-agent environmentin which multiple autonomous cars need to collaborate andcommunicate with each other for effective decision making.multi-agent reinforcement learning (MARL) can be used totrain a team of agents in such scenarios by maximising areward function. However, most of the existing frameworksin this domain focus only on training collaborative policiesfor multiple homogeneous agents. Our work is specificallyfocused on the task of learning transferable collaborativepolicies in the presence of an adversary in the environment.For our experiments, we have considered the standard multi-agent task of deception in which multiple agents need tocollaborate with each other in order to deceive an adversary.In such a scenario, designing heuristic guided behaviors fora team of agents is not a trivial task. We have adoptedthe framework of centralised training and decentralised ex-ecution in which only a single neural network is trainedusing PPO[17]. During testing however, every agent actsindependently in the environment and multiple agents needto communicate with each other in order to reach consensus.Through this work, we have made the following contribu-tions: • We have modified a graph neural network based multi-agent reinforcement framework proposed by [1] to *This work was supported by DARPA OFFSET award HR00111820029 Siddharth Ghiya is with Robotics Institute, School of ComputerScience, Carnegie Mellon University, Pittsburgh, PA 15232, USA [email protected] Katia Sycara is with Robotics Institute, School of ComputerScience, Carnegie Mellon University, Pittsburgh, PA 15232, USA [email protected] incorporate an opponent observation module. The pro-posed framework enables a group of collaborativeagents to come up with behaviours to successfullydeceive an adversary present in the environment. • We have proposed a unique two stage training procedureusing curriculum learning to enable a group of agentsto learn complex cooperative policies in the presence ofan adversary in the environment.II. RELATED WORKThere has been a recent surge in the application of deeplearning to reinforcement learning. Researchers have comeup with various off-policy[13][10] and on-policy[16][17][12]algorithms and have demonstrated super human performance.Multi-agent reinforcement learning is one of the more widelystudied topics in the field of reinforcement learning. In-dependent Q-Learning[20] represents some of the earliestworks in this field. In Independent Q-Learning, agents in theenvironment are trained with the assumption that the otheragent is part of the environment. Naturally, this fails if weincrease the number of agents in the environment due tonon-stationarity in the environment.More recently, there has been some work in MARLwhich can be classified under the centralised training anddecentralised execution paradigm. [11] proposed to modifythe critic to evaluate the value of the next state conditionedon the actions of all the agents in the environment. Theyreasoned that training such a critic helped to counter thenon stationarity introduced due to multiple actors in theenvironment. [7] pointed out the problem of credit assign-ment in multi-agent reinforcement learning problems andproposed a method to assign credit to the action taken by anagent in such a setting. [19] proposed Value DecompositionNetworks (VDN) to decompose the team value function toagent specific value functions. [15] further built on the ideaof VDN and proposed QMIX with an additional mixingnetwork. The weights of such a mixing network are producedby another set of hyper-networks. They argue that theirmethod represents a richer class of action-values functions. Inmost of these works, a centralised critic is maintained duringthe training process which accounts for the non-stationarityin the environment and hence they fall under the paradigmof centralised training with decentralised execution. In allof these works, no communication is assumed between theagents.Researchers have also proposed communication betweenagents in multi-agent systems in order to encourage co-operation. [5] proposed Differentiable Inter Agent Learn-ing(DIAL) and Reinforced Inter Agent Learning(RIAL) as a r X i v : . [ c s . M A ] O c t ig. 1. Each agent receives V j and Q j from other agents in theenvironment. It then uses it’s own key K i to produce attention it needsto pay to the message it received from agent j . It then aggregates overthe messages it received from the other agents in the environment. Thisaggregated embedding is used to produce the action the agent takes in thecurrent time step. This is similar to the architecture used by [1]. communication modules to enable agents to communicatewith each other to be able to successfully collaborate witheach other. [14] showed the emergence of grounded languagecommunication between agents. [18] proposed CommNetwhich used continuous communication for cooperative tasks.Some of the existing methods in research don’t explicitlymake assumptions about the type of other agents in the envi-ronment. In this context, there has been work where an agentactively tries to modify the learning behaviour of the oppositeagent [6] or employs recursive reasoning to reason about thebehaviour of other agents in the environment [21]. [2] usedmeta-learned policies to enable agents to adapt to differentkinds of other agents in the environment. They were able todemonstrate such adaptive behaviours of the agents in bothcooperative and competitive settings. [8] proposed learningpolicy representations of other agents in the environment inorder to make more informed decisions. [9] developed DeepReinforcement Opponent Networks (DRON) to model theopposite agents in the environment. [3] have done a detailedstudy about the available literature in the field of opponentmodelling in multi-agent systems.[4] [1] proposed communication between the agents usingdot-product attention mechanism. In this work, we have builton the framework proposed by [1] and modified it so that ateam of agents can collaborate and come up with strategiesto deceive an adversary present in the environment.III. METHODIn [1], authors have proposed modelling the environmentas a graph of entities. In such a graph, nodes would representan entity and and the interaction between two entities can berepresented as edges. An example of an environment being Fig. 2. Inter agent communication when we have scenarios with onlyhomogeneous agents learning to collaborate with each other. The opponentencoder module has been disabled here. modelled as a graph can be an environment of a self drivingvehicle. In such a graph, cars, buildings and pedestrianswould be modelled as nodes in the graph and their interac-tions such as communication between self driving vehicleswould be modelled as edges in the graph.Following a similar approach, in this work, we have alsomodelled the agents and landmarks in the environment asnodes in a graph, G = ( V, E ) , where V represents verticesand E represents edges. The agents learn to communicate im-portant information about their surroundings to other agents.Since we have two types of teams in the environment, onlyagents in the same team are allowed to communicate witheach other. All of the agents can observe the static entities inthe environment(landmarks). Every agent has four modules: Agent State Encoder, Environment Encoder, OpponentEncoder and Inter Agent Communication Module. All ofthese modules have been explained in detail below: A. Agent State Encoder
Each agent i ∈ V can observe it’s own state whichincludes it’s own position and velocity. The agent formsits own state encoding U i = f a ( X i ) using a learnabledifferentiable encoder, f a . B. Environment State Encoder
Each agent is also able to observe all the entities in theenvironment and uses a Graph Neural Network to producea fixed size embedding E i . Note that this embedding isindependent of the number of entities present in the envi-ronment. First, an agent uses an entity encoder function f e to produce an entity embedding, e li = f e ( X li ) where X li isthe position of entity l ∈ V with respect to agent i . Thenthe agent produces a fixed size embedding E i by applyingdot product attention mechanism over entity embeddings. E i can be intuitively understood as a representation of theenvironment of an agent. An important characteristic of usingsuch technique is that the produced final embedding E i isinvariant to the number of entities present in the environment.It is also invariant to the order in which the agent observesthe entities. ig. 3. Inter agent communication when we have scenarios with differentteams of agents. We have an opponent encoder module C. Opponent State Encoder
There can be scenarios where we have different teamsof agents in the environment. In such scenarios, we alsohave an opponent encoder module which similarly takes inthe state information of the opponent agents and producesan embedding reflecting the opponent information. Everyagent uses an opponent encoder function f o to producean opponent embedding e oi = f o ( X oi ) where X oi is theposition of opponent o ∈ V with respect to agent i . Afterproducing an opponent encoding for all the opponents inthe environment, a fixed size embedding O i is computed byapplying dot product attention mechanisms over opponentembeddings. Similar to the environment embedding E i , O i is invariant to the number of opponents in the environmentand to the order in which the agent observes them.We can have scenarios where we only have homogeneousagents operating in the environment or we can have differentteams of agents in the environment. In case of homogeneousagents operating in the environment, every agent is allowedto communicate with every other agent in the environment(Figure 2). On the contrary, if we have different teams ofagents in the environment then the agents are only allowedto communicate with other agents in the same team (Figure3). D. Inter Agent Communication Module
After computing its state encoding U i , its environmentencoding E i , and an its opponent encoding O i , an agent usesthem to produce a message. It also uses them to choose howmuch attention it wants to pay to the messages it receivesfrom other agents. First the agent concatenates U i , E i and O i to produce h i . Here, h i represents an agent’s understandingof its own environment.Using h i , Each agent produces a key K i = W K h i , a value V i = W V h i and a query Q i = W Q h i . It then proceeds tosend the computed V i and Q i to all the other agents in theenvironment. It also receives V j and Q j , j ∈ V −{ i } , where j is every other agent in the environment. It then uses dotproduct attention method to calculate the attention it needs topay to the message of agent j at every time step. This methodof calculating attention and aggregating information receivedfrom other agents is described in more detail in Figure 1. Fig. 4. Inter agent communication when we have scenarios with an heuristicadversary.
After message passing, every agent updates its embedding h i . Now each agent passes it’s hidden embedding, h i throughanother neural network which produces a distribution overthe actions that the agent can take. At each time step,all the agents calculate their own action in a decentralisedmanner and are given a single reward for their collectiveset of actions. This reward is then used to train the agentsusing PPO [17]. One important implementation detail isthat we assume that the agents in the same team share allthe learnable parameters of agent state encoder network,environment state encoder network, opponent state encodernetwork, inter agent communication module and final policynetwork. This means that our work falls under the paradigmof centralised training and decentralised execution.IV. EXPERIMENTSIn this work, we have worked on the standard swarmrobotic task of deception. The deception environment hasbeen implemented in the Multi-Agent Particle Environment[11] where the agents can move around in 2D space fol-lowing a double integrator dynamics model. Every agentproduces an action in the form of acceleration along the Xor Y direction. A. Environment Description : Deception
In the task of deception, we can have a team of N agentstrying to protect high value target among N targets. Theagents needs to learn to spread out to cover targets in orderto confuse an observing adversary as to which of the targetis the most important. There are two teams of agents in theenvironment: • Good Agents : They can observe all the landmarks inthe environment and know which landmark is the targetlandmark. In our work, the good agents are collectivelylearning to collaborate with each other and to come upwith strategies to deceive the observing adversary. • Adversary Agent : In this work, we have only presentedresults with one adversary agent in the environment. Wehave tried experiments with both learning and heuristicadversary. Basically, an adversary needs to infer the targetlocation from the motion of the good agents. If learning,the adversary can see all the landmarks but does not ig. 5. This figure provides a visual representation of the deceptionenvironment. In this image:
Green colored dot represents the targetlandmark,
Black colored dots represent the other landmarks,
Blue coloreddots represent the good agents and
Red colored dot represents the adversaryagent. If we have a heuristic adversary, it tries to move towards the landmarkwith closest good agent or the landmark corresponding to min ( d , d , d . Notice that the good agents know which landmark is green while theadversary agent needs to infer this information by observing the good agents. know which landmark is the target landmark. If heuristic,the adversary moves towards landmark with closest goodagent. In this work, results have been presented with onlya heuristic adversary.Apart form the agents in the environment, we also havelandmarks as entities in the environment. The kind of land-marks in the environment are as follows: • Target Landmark : Target landmarks are the high valuelandmarks in the environment. The good agents in thedeception environment can observe all the landmarks andknow which landmark is the target landmark. Adver-sary agents do not know which landmarks are the targetlandmarks. The adversary agent needs to infer the targetlandmark by observing the good agents while the goodagents need to come up with strategies to deceive theadversary. • Non Target Landmark : Non target landmarks are theother landmarks in the environment. Both good agents andthe adversary agents can observe all the landmarks in theenvironment.
B. Reward Description
In this section, we describe the reward formulation whichwe have used for our experiments. Reward formulation inreinforcement learning should be done according to thebehaviour which we want the learned agent to perform. Inour case, we will describe different reward configurationsand how we used them to obtain desired behaviour. Thereward which we have used are based on distance and are
TABLE IWEIGHTS USED TO TRAIN THE GOOD AGENTSINDEX COVERAGE WEIGHT DECEPTION WEIGHT1 0.9 0.12 0.8 0.23 0.7 0.34 0.6 0.4 continuous. Since the reward configuration is continuousand given at every time step, learning different behavioursbecomes easier for the agents. We employed curriculumlearning to make agents learn complex deceptive behaviourswhich is explained in more detail in the next subsection. Wehave described the different reward configurations used byus in more detail below: • Coverage Reward : This reward is used by the good agentsto learn how to collaboratively cover all the landmarks inthe environment. The agents are rewarded on the basis ofthe bi-partite graph distance between the graph of agentsand the graph of landmarks. In other words, agents wouldbe given a higher reward if they are successfully coveringthe landmarks with 1:1 matching and would be given alower reward if they are unable to so. • Deception Reward : This is the reward which is used toactually learn deceptive behaviours. The reward given tothe good agents is different than the reward given to theadversary. The good agents are rewarded on the basis ofhow close the adversary is to the target landmark. Theyare given a higher reward if the adversary is far awayfrom target and a lower reward if it comes near the targetlandmark.The adversary’s reward configuration is exact opposite tothat of the good agents. We give a higher reward to theadversary if it gets nearer to the target landmark and giveit a lower reward if it’s unable to do so. Please note thatthis reward is only given to the adversary if it’s learning.If we use a heuristic adversary, this reward configurationis used only for good agents and the adversary is not givenany reward.
C. Curriculum Learning
We did some preliminary experiments using the decep-tive reward described in the previous section. However, wenoticed that the agents did not converge to any reasonablebehaviour. So we propose a two step training process. Wefirst train the good agents using the coverage reward. Afterthey learn to cover all the landmarks with a high successrate, we gradually introduce a weighted deception reward.We observe in our experiments that the good agent assignedto the target landmark learns to stay a little away fromthe landmark instead of covering it in order to deceive theadversary. We performed a sensitivity analysis by varyingthe weights of deception and coverage and the results havebeen provided in the next section.
ABLE IISENSITIVITY ANALYSIS FOR 2 GOOD AGENTS VS ADVERSARY AGENT
GOOD AGENTS ADVERSARY AGENT
REWARD WEIGHT BIPARTITE DIST. DIST. THRESHOLD TARGET DIST. DIST. THRESHOLD TARGET SELECT0.9 cov, 0.1 dec 0.12 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± GOOD AGENTS ADVERSARY AGENT
REWARD WEIGHT BIPARTITE DIST. DIST. THRESHOLD TARGET DIST. DIST. THRESHOLD TARGET SELECT0.9 cov, 0.1 dec 0.11 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± After training the good agents with the coverage reward,we train the agents with a weighted sum of deception andcoverage reward. We did experiments with weights presentedin Table I. V. RESULTS AND DISCUSSIONWe performed experiments using the methods describedand evaluated their performance in cases of 2 good agentsand a heuristic adversary and 3 good agents and a heuristicadversary. This has been done by doing a sensitivity analysisover different weights of deception and coverage rewards.
A. Evaluation Metrics
We evaluated the performance using five different metrics.These are goal selected, bipartite distance, distance thresh-old and distance from target landmark. These metrics andevaluation methods for good agents and the adversary agenthave been explained in more detail below. • Good Agents : The following metrics have been observedand produced due to the behaviour displayed by the goodagents:1)
Bipartite Distance : We calculate the mean bipartitedistance between the good agents and the landmarksduring an episode. We then calculate mean of thesedistances over 30 episodes. A lower mean bipartitedistance would mean that the good agents are coveringthe targets properly.2)
Distance Threshold : We calculate the number of stepsduring an episode for which bipartite distance of goodagents is within a threshold of 0.1. We then calculatemean of these steps over 30 episodes. • Adversary Agent : The following metrics have been ob-served and produced due to the behaviour displayed by theadversary agent:1)
Distance from Target : We calculate the mean of thedistance between adversary and the target landmarkduring an episode. We then calculate mean of thesedistances over 30 episodes.2)
Distance Threshold : We calculate the number of stepsduring an episode for which distance of adversary fromtarget is within a threshold distance of 0.1. We thencalculate mean of these steps over 30 episodes.
B. Results
We performed sensitivity analysis on 2 goods agents vsheuristic adversary and 3 good agents vs heuristic adversary.The corresponding results have been shown in Table II andTable III. We can notice that as we increase the deceptionreward weight in both cases, the number of times the targetis selected by the adversary decreases. Also we can observethat the number of time steps for which the adversary agentcomes near to the target landmark decreases with the increasein deception reward weight. All of this suggests that as weincrease the weight of deception reward, the success rate ofthe good agents deceiving the adversary agent also increase.We also tried increasing the deception reward weight to 0.5but in that case the good agents failed to converge to areasonable behaviour.One more interesting observation from the results canbe that the bipartite distance between the good agents andthe landmarks also increases as we increase the deceptionweights. This makes sense because as the deception weightncreases, the good agent assigned to the target landmark inevery episode learns to stay at some distance away from thetarget landmark. This result can be noticed visually also.
C. Future Work
In this work, we proposed augmenting the multi-agentreinforcement learning framework proposed by [1] with anopponent encoder module for learning multi-agent policies inpresence of an adversary in the environment. We performed atwo step training process with different reward configurationto train a team of agents to deceive an adversary in theenvironment.Only preliminary results for different reward configura-tions against a heuristic agent have been presented in thiswork. A good future direction would be to investigate otherreward configurations and to perform experiments with alearning adversary. The proposed framework can also beextended to learn multi-agent policies against a team ofheuristic or learning adversaries.ACKNOWLEDGMENTI would like to thank my wonderful set of collaboratorsAkshay Sharma, Dana Hughes, Fan Jia, Keitaro Nishimura,Sumit Kumar, Swaminathan Gurumurthy and Yunfei Shi forsupporting me and for taking time out of their busy schedulesfor discussions about my work. This work has been fundedin part by DARPA OFFSET award HR00111820029.R
EFERENCES[1] A. Agarwal, S. Kumar, and K. Sycara, “Learning transferable cooper-ative behavior in multi-agent teams,” 2019.[2] M. Al-Shedivat, T. Bansal, Y. Burda, I. Sutskever, I. Mordatch, andP. Abbeel, “Continuous adaptation via meta-learning in nonstationaryand competitive environments,” 2017.[3] S. V. Albrecht and P. Stone, “Autonomous agents modelling otheragents: A comprehensive survey and open problems,”
ArtificialIntelligence , vol. 258, p. 66–95, May 2018. [Online]. Available:http://dx.doi.org/10.1016/j.artint.2018.01.002[4] A. Das, T. Gervet, J. Romoff, D. Batra, D. Parikh, M. Rabbat, andJ. Pineau, “Tarmac: Targeted multi-agent communication,” 2018.[5] J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson, “Learningto communicate with deep multi-agent reinforcement learning,” 2016.[6] J. N. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel,and I. Mordatch, “Learning with opponent-learning awareness,” 2017.[7] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, andS. Whiteson, “Counterfactual multi-agent policy gradients,”
CoRR ,vol. abs/1705.08926, 2017. [Online]. Available: http://arxiv.org/abs/1705.08926[8] A. Grover, M. Al-Shedivat, J. K. Gupta, Y. Burda, and H. Edwards,“Learning policy representations in multiagent systems,” 2018.[9] H. He, J. Boyd-Graber, K. Kwok, and H. D. III, “Opponent modelingin deep reinforcement learning,” 2016.[10] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforce-ment learning,” 2015.[11] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, andI. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,”
CoRR , vol. abs/1706.02275, 2017.[Online]. Available: http://arxiv.org/abs/1706.02275[12] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley,D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deepreinforcement learning,” 2016.[13] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcementlearning,” 2013.[14] I. Mordatch and P. Abbeel, “Emergence of grounded compositionallanguage in multi-agent populations,” 2017. [15] T. Rashid, M. Samvelyan, C. S. de Witt, G. Farquhar, J. N. Foerster,and S. Whiteson, “QMIX: monotonic value function factorisation fordeep multi-agent reinforcement learning,”
CoRR , vol. abs/1803.11485,2018. [Online]. Available: http://arxiv.org/abs/1803.11485[16] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trustregion policy optimization,” 2015.[17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,“Proximal policy optimization algorithms,” 2017.[18] S. Sukhbaatar, A. Szlam, and R. Fergus, “Learning multiagent com-munication with backpropagation,” 2016.[19] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. F. Zambaldi,M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls,and T. Graepel, “Value-decomposition networks for cooperativemulti-agent learning,”
CoRR , vol. abs/1706.05296, 2017. [Online].Available: http://arxiv.org/abs/1706.05296[20] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru,J. Aru, and R. Vicente, “Multiagent cooperation and competitionwith deep reinforcement learning,”
CoRR , vol. abs/1511.08779, 2015.[Online]. Available: http://arxiv.org/abs/1511.08779[21] Y. Wen, Y. Yang, R. Luo, J. Wang, and W. Pan, “Probabilistic recursivereasoning for multi-agent reinforcement learning,” in