A perspective on multi-agent communication for information fusion
Homagni Saha, Vijay Venkataraman, Alberto Speranzon, Soumik Sarkar
AA perspective on multi-agent communication forinformation fusion
Homagni Saha
Department of Mechanical EngineeringIowa State UniversityAmes, IA 50011 [email protected]
Vijay Venkataraman
Honeywell AerospacePlymouth, MN 55441
Alberto Speranzon
Honeywell AerospacePlymouth, MN 55441
Soumik Sarkar
Department of Mechanical EngineeringIowa State UniversityAmes, IA 50011 [email protected]
Abstract
Collaborative decision making in multi-agent systems typically requires a prede-fined communication protocol among agents. Usually, agent-level observationsare locally processed and information is exchanged using the predefined protocol,enabling the team to perform more efficiently than each agent operating in isola-tion. In this work, we consider the situation where agents, with complementarysensing modalities must co-operate to achieve a common goal/task by learningan efficient communication protocol. We frame the problem within an actor-criticscheme, where the agents learn optimal policies in a centralized fashion, while tak-ing action in a distributed manner. We provide an interpretation of the emergentcommunication between the agents. We observe that the information exchangedis not just an encoding of the raw sensor data but is, rather, a specific set of di-rective actions that depend on the overall task. Simulation results demonstrate theinterpretability of the learnt communication in a variety of tasks.
In this paper, we analyze communication protocols learnt by a team of agents equipped with com-plementary sensor modalities and tasked with a common goal. We call this “task based multi-modaldecision making”, wherein agents learn to map their sensor measurements and the information com-municated by other agents, directly into actions based on the common goal. In this setting, eachagent has access only to its own sensor data but needs to rely on the communication with otheragents to obtain task relevant information from that agent’s sensor modality. We present a way tointerpret the emergent communication by visualizing this mapping into the agent’s action space.We find the communication that is learnt, within a reinforcement learning paradigm, is not onlyemergent, but is task dependent and adaptive to the size of the communication channel.In relation to existing literature on learning for multi-agent systems our work borrows from the gen-eral framework of “Markov-games”, proposed in Lowe et al. [2017], Mordatch and Abbeel [2018],Foerster et al. [2016] and references therein. In particular, we consider ( Lowe et al. [2017]) for ourlearning problem. Related to the emergent communication aspect, central to this work, we considerideas from the literature on “multi-agent referential games” ( Golland et al. [2010], Andreas andKlein [2016], Evtimova et al. [2017], Lazaridou et al. [2018]), where a (sender) agent communi-cates highly structured information (images and text) to a (receiver) agent which has to interpretwhat the other agent saw. However, here we are interested in the evolution of communication for
Preprint. Under review. a r X i v : . [ c s . M A ] N ov nstructured data under joint interactions using an actor-critic algorithm. Emergent communicationwas also studied in ( Kottur et al. [2017], Cao et al. [2018]), however, we focus our analysis ofemergent communication on the action space. Specifically, we project the learnt communication onthe action space and visually analyze the results. This enables us to more clearly interpret the learntcommunication. It is shown that powerful joint representations of the world can be encoded throughtask dependent communication which is easy to interpret under complementary sensing modalityconstraints. For our experimental study, we consider a two-dimensional world with two agents and L landmarks.Each landmark has a color { red, green, blue } and shape { triangle, circle, square } property. Ouragents have complementary sensing modalities: one of the agents, denoted as color-agent, can onlyobserve the color of the landmarks and the other, denoted as shape-agent, can only observe the shapeof the landmarks. We assume that both agents can measure their (relative) distance from all land-marks but cannot measure their distance from each other. At every discrete time step the agents takeboth physical movement actions (a unit movement in one of the four directions or stand still) andcommunicative actions, namely, broadcast a k -bit message. The communication message sent byone agent is received by the other in the next time step.Each agent’s observation is a vector o i = [ x i , y i , m i1 , . . . , x iL , y iL , m iL | c i , . . . , c ik | g i , . . . , g iL ] , where x ij and y ij denote the horizontal and vertical distances, respectively, betweenthe i th agent and the j th landmark; m ij denotes a one hot encoding of the j th landmark’s propertyas sensed by the i th agent. For example, m ∈ { [1 , , , [0 , , , [0 , , } denotes the encoding ofthe j th landmark’s color for the color agent or shape for the shape agent. The vector [ c i , . . . , c ik ] ,denotes the k bit word received bythe i th agent. Finally, [ g i , . . . , g iL ] denotes one hot encodingof the target landmark properties provided to agent i . We base our study on the three collaborativetasks, described below. Task 1: Cross modal information exchange:
In this task, the map contains three landmarks. Notwo landmarks have the same shape or color. During each episode, agents and landmarks are placedrandomly in the map. One of the landmarks is designated as “target” and the goal, for both agents,is to reach the designated target landmark. This target landmark’s property is indicated to the agentsusing [ g i , . . . , g iL ] as described previously. Consider as example where the target landmark is ablue circle, if we were to provide to the color agent the color properties, it can trivially navigate tothe target given that it has full knowledge of where the different colored landmarks are with respectto itself. To avoid this and to encourage communication, we pass the encoding corresponding to theshape of the target landmark to the color-agent (circle in this example) and vice versa for the shapeagent. This creates a situation where the agents need to exchange information in order to success-fully navigate to the right landmark. Task 2: Multi target consensus:
For this task, the map contains six landmarks each with shapeand color properties and no two landmarks have the same set of properties. However, there can betwo landmarks with the same color or shape and the target landmark is unique when both propertiesare considered. The goal is for both agents to move to the target landmark. The agent observationand action spaces are similar to the previous scenario, but here the property of the target landmarkis specified in the agent’s own modality. Specifying the encoding for circle, as the target, to theshape agent does not trivially solve the problem as there can be two circles and co-ordination withthe color agent is necessary to figure out which circle is blue and then move towards it. In this task,the agents need to reach a consensus on which is the target landmark by learning to reasoning overtheir observation spaces.
Task 3: Collaborative localization:
Here the setup is similar to the information exchange task.However, no target landmark is specified as the goal is for the agents to meet with each other in theshortest possible time. Here a constant negative reward R t supplied to the agents at each time-stepto encourage meeting up fast. Here the agents must learn to estimate their relative position withrespect to each other and then take actions to move closer.Summarizing, all the above tasks share the following key challenging characteristics. Agents have(i) different sensing modalities; (ii) No knowledge about other agent’s sensor or position; (iii) Nocommon world coordinate frame in their state space; (iv) a finite communication bandwidth. Reward structure and learning framework:
We primarily used three types of rewards: R d = (cid:80) i = ni =1 (cid:112) x iT + y iT , where i is the agent number, x iT and y iT are the horizontal and vertical dis-tances of i th agent from the target landmark, and n is the number of agents. We define an instanta-2 ask k Reward Metric (%) M M Information exchange 2 R d
99 803 R d
100 89.54 R d
100 98.9Multi target consensus 4 R d R d + R i R d + R t Table 1: Performance metrics forthe three test tasks with variationsin reward structure and communi-cation channels.neous reward R i = H , where H is a large number if at least one agent is touching the target at thecurrent time step and 0 otherwise. For collaborative localization task, R t is a constant penalty pertime step and R d is the inter agent distance. For all tasks, our reinforcement learning (RL) frame-work is based on the MADDPG algorithm (Lowe et al. [2017]), which relies on centralized trainingand decentralized execution, making it suitable for multi-agent problems. The core of MADDPG isan actor-critic scheme (Grondman et al. [2012]) that maintains a critic for each agent and the criticshave access to actions (movement and communications) and rewards of all the agents. This helpswith the problem of non-stationarity in multi-agent environments. In all experiments, we param-eterize the output of both actors and the critics with a three layered fully connected network withReLU activations. It must be noted that, although the agent state space allows for real numbers in thecommunication stream, the use of Gumbel-Softmax estimator (Jang et al. [2016]) transforms theseinto discrete valued messages. While k word variations are possible, we observe that the agentslimit their vocabulary use to k + 1 words. For 3 channels of communication, the word vocabularywas limited to w = [0 , , , w = [1 , , , w = [0 , , , w = [0 , , . Details of the learningframework, training hyperparameters and reward curves are provided in Appendix A. In the following we evaluate agent performance using simple metrics and then provide interpreta-tions of the emerged communication between the agents.
Performance metrics:
We use two metrics to evaluate the performance of the agents in achievingthe common goal. Let m denote the number of episodes in which at least one of the agent reachesthe target landmark, m denote the number of episodes where both agents reach the target land-mark, and N denote the total number of test episodes. We use N = 1000 , let M = m /N and M = m /N . For the collaborative localization task m denotes the number of times the agentsmeet with each other, and m is the starting distance between agents divided by total distance trav-elled by both agents in the episode. As we mentioned, k is the number of communication bitsavailable to the agent. Table 3 shows these metrics for the three different tasks. Emerged communication:
Table 3 shows that the agents are able to successfully complete theinformation exchange and collaborative localization tasks almost every time. In the more complexmulti target consensus task the agents achieve a . success rate which is better than randomchance of . In order to better understand how the communication aid the agents, we devise away to visualize this relation as follows. At every time step the i th agent decides its actions basedon its observations o i comprising of relative position to the landmarks, the word received from theother agent and target landmark (if applicable). For a given test case the target landmark is fixed.Then for every possible word that can be received, we can place the agent in a fixed position of theenvironment and query the learnt policy to find in which direction the agent would move. We canthen repeat this for all possible agent positions and color code its preferential direction of motionat each location, obtaining a picture as shown in figure 1. As expected, we observe random motionin the beginning of the training. Over time the color agent learns to move to the blue triangle if theword uttered by the shape agent is [1 , , . Similarly the shape agent learns to move towards theblue circle for the same utterance by the color agent. Visualization for other word utterances aregiven in the Appendix, see figure 6. Note that both agents are either focusing on a blue or circularobject for all word utterances as the given target in this example is a blue circle. It is remarkablethat the agents are able to solve a complex map alignment and reasoning problem by directing eachothers actions through communication. Examples of the final learnt policy for the information ex-change and collaborative localization tasks are shown in figure 2 left and right respectively. In theinformation exchange task, we observe that each unique word uttered by the color agent causes theshape-agent to move close to a specific shape of landmark irrespective of its current position in themap (e.g [1 , , , causes movement towards the circle). In the collaborative localization task eachunique word uttered by the color agent causes the shape agent to move towards a focus point in the3igure 1: Multi target con-sensus task. Evolution ofpolicies, for the word utter-ance [1 , , , , for both coloragent (top row) and shape-agent (bottom row). Policy re-lation to color: Green - Godown, Turquoise - Go left,Blue - Go up, Yellow - Goright, Grey - No movement.The meeting point of the dif-ferent colors (vertex) is theequilibrium point at which theagent may come to rest. Figure 2:
Left - Information exchange task. From left to right, the color-agent utters the words [1 , , , , [0 , , , , [0 , , , to the shape agent. Top and bottom rows represent different land-mark configurations. Right - Collaborative localization task. From left to right, policy for the words [1 , , , [0 , , , [0 , , are visualized for both color (top row) and shape agents (bottom row). map and vice-versa. During the episode, both agents continuously change their utterance in order toforce their partner to travel towards each other and meet in the shortest time possible. Effect of communication channels:
Changing the number of channels affects the learning capac-ity. In the information exchange task when 3 channels are provided agents associate their targetlandmarks as “topmost”,“leftmost”, or “bottom most” landmark in the map, depending on receivedcommunication. While this is a clever reference, as the “topmost” etc. landmark can easily be thesame for the two agents as their reference axes are only translated from each other. However, thisleads to a failure when the target landmark is located in the middle or is the “rightmost”. Whenusing 4 channels agents can directly associate their targets to the property communicated to them bythe other agent. In figure 2 (left), shape agent interprets [1 , , , from the color agent as a signal togo to a circle, [0 , , , to a triangle and [0 , , , to a square. This improves performance greatly. We studied the application of multi agent reinforcement learning for task driven multimodal decisionmaking. We analyzed the emergence of interpretable communication between agents and found thatadaptive and non trivial communication protocols can be learned based on number of availablecommunication channels and imposed reward structures. The size of the communication channelcan be crucial in deciding the amount of information that is required to reconcile various modalitieswith each other and reward structures affect the nature of the learned communication. Visualizingemergent policies in the agent’s action spaces confirms that powerful joint representation of theworld can be encoded through communication.
We are thankful to Shashank Shivkumar at Honeywell Aerospace for discussions related to thispaper ranging from initial idea for the research through algorithm implementation and testing.4 eferences
Jacob Andreas and Dan Klein. Reasoning about pragmatics with neural listeners and speakers. arXivpreprint arXiv:1604.00562 , 2016.Kris Cao, Angeliki Lazaridou, Marc Lanctot, Joel Z Leibo, Karl Tuyls, and Stephen Clark. Emergentcommunication through negotiation. arXiv preprint arXiv:1804.03980 , 2018.Katrina Evtimova, Andrew Drozdov, Douwe Kiela, and Kyunghyun Cho. Emergent communicationin a multi-modal, multi-step referential game. arXiv preprint arXiv:1705.10369 , 2017.Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning tocommunicate with deep multi-agent reinforcement learning. In
Advances in Neural InformationProcessing Systems , pages 2137–2145, 2016.Dave Golland, Percy Liang, and Dan Klein. A game-theoretic approach to generating spatial de-scriptions. In
Proceedings of the 2010 conference on empirical methods in natural languageprocessing , pages 410–419. Association for Computational Linguistics, 2010.Ivo Grondman, Lucian Busoniu, Gabriel AD Lopes, and Robert Babuska. A survey of actor-criticreinforcement learning: Standard and natural policy gradients.
IEEE Transactions on Systems,Man, and Cybernetics, Part C (Applications and Reviews) , 42(6):1291–1307, 2012.Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXivpreprint arXiv:1611.01144 , 2016.Satwik Kottur, Jos´e MF Moura, Stefan Lee, and Dhruv Batra. Natural language does notemerge’naturally’in multi-agent dialog. arXiv preprint arXiv:1706.08502 , 2017.Angeliki Lazaridou, Karl Moritz Hermann, Karl Tuyls, and Stephen Clark. Emergence of lin-guistic communication from referential games with symbolic and pixel input. arXiv preprintarXiv:1804.03984 , 2018.Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In
Advances in Neural Information Pro-cessing Systems , pages 6379–6390, 2017.Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agentpopulations. In
Thirty-Second AAAI Conference on Artificial Intelligence , 2018.5 ppendix: A perspective on multi-agent communication forinformation fusion
A : Framework and training
An overview of the learning framework we used for the three different tasks is shown in Figure 3.In the information exchange and collaborative localization tasks we train with a maximum episodelength of 60. In multi-target consensus task, agents are trained for 120000 episodes with a maximumepisode length of 80 steps. We use Adam optimizer with learning rate of 0.01, discount factor of0.001 for the critics, and a batch size of 1024. In general, convergence was observed within 5000episodes. Figure 4 shows the training progress over number of episodes for the three differentFigure 3:
Learning framework used in this work. tasks. In the information exchange and collaborative localization tasks, the majority of the policyimprovement takes place in the initial episodes and the agent maintains the word-action associationsit learns over the following episodes. In the more complex multi target consensus task, the agentstake the longest to learn meaningful policies that maximize reward. Reward curve has a suddenpeak near 5000 episodes, however the policies still gradually keep improving over time till 80000episodes.
B : Effect of reward structure on communication
For the complex multi-target consensus task, we found that just using a continuous average distancepenalty ( R d ) for each time step in the episode is detrimental to learning. As the training progresses,the agents learn to ignore communications and stay in the same place where they started: the wholeFigure 4: Plots of average total cumulative rewards vs training episode number Left- multi targetconsensus task. Middle- information exchange task. Right - collaborative localization task.
Evolution of policy in multi target consensus when only average distance penalty is used.Two scenarios are shown. Identical plots are obtained for shape agent. policy map changes to grey (no movement) in the final epochs as shown in figure 5. To encouragemore exploration, we introduced an instantaneous touching reward R i in addition to the constantaverage distance penalty and observe improved performance. We also experimented with just using R i alone and the policies learned are sub optimal. So it appears that both the reward types arenecessary for learning in complex multimodal scenarios. C : Number of words learned
While k bit communication channel is used, we experimentally verified that k different words arevalid. However the agents mainly chose to use only k + 1 words while navigating. An examplepolicy evolution corresponding to all the words in multi-target consensus is shown in figure 6.To understand what the other unused words meant we visualized the policy corresponding to twosuch words as shown in figure 7. We find that the (combination) word utterance [1 , , will producea new focus / equilibrium point different from that of the individual words [1 , , and [0 , , .While it can be useful to use another focus point to guide the movement of the other agent inthe collaborative localization environment, the agents prefer to just utter different simple wordssequentially, rather than a more complex word once.2igure 6: Evolution of policy in multi target consensus for shape-agent (top) and color-agent (bot-tom) for different words uttered (left to right).
Figure 7:
Policy visualizations corresponding to simple [1 , , , [0 , , , [0 , , and complex words [1 , , , [1 , , in collaborative localization task.in collaborative localization task.