Hybrid Information-driven Multi-agent Reinforcement Learning
William A. Dawson, Ruben Glatt, Edward Rusu, Braden C. Soper, Ryan A. Goldhahn
HHybrid Information-driven Multi-agent Reinforcement Learning
William A. Dawson, Ruben Glatt, Edward Rusu, Braden C. Soper, Ryan A. Goldhahn
Lawrence Livermore National Laboratory7000 East AveLivermore, California 94550 { dawson29, glatt1, rusu1, soper3, goldhahn1 } @llnl.gov Abstract
Information theoretic sensor management approaches are anideal solution to state estimation problems when consider-ing the optimal control of multi-agent systems, however theyare too computationally intensive for large state spaces, espe-cially when considering the limited computational resourcestypical of large-scale distributed multi-agent systems. Re-inforcement learning (RL) is a promising alternative whichcan find approximate solutions to distributed optimal controlproblems that take into account the resource constraints in-herent in many systems of distributed agents. However, theRL training can be prohibitively inefficient, especially in low-information environments where agents receive little to nofeedback in large portions of the state space. We propose a hy-brid information-driven multi-agent reinforcement learning(MARL) approach that utilizes information theoretic mod-els as heuristics to help the agents navigate large sparse statespaces, coupled with information based rewards in an RLframework to learn higher-level policies. This paper presentsour ongoing work towards this objective. Our preliminaryfindings show that such an approach can result in a system ofagents that are approximately three orders of magnitude moreefficient at exploring a sparse state space than naive baselinemetrics. While the work is still in its early stages, it providesa promising direction for future research.
Introduction
The continuing technical advancement of sensors togetherwith the reduction in their size and production costs al-low for increasingly complex sensor applications driven bymachine learning and artificial intelligence algorithms. Anobservable trend is the integration of heterogeneous, au-tonomous sensors into complex, intelligent sensor networksthat can be controlled in a centralized (Chen et al. 2019) ordecentralized manner (Wang et al. 2020). However, these ad-vancements also come with a number of open challenges as-sociated with the design and control of such networks. Manyof those challenges are related to scaling the algorithmiccomplexity as the number of agents increases. In particular,many exact centralized methods for learning optimal controlpolicies become prohibitively expensive as the number ofsensors grows, suffering from the so-called “curse of dimen-sionality” (Boutilier 1996). Such computational challenges are compounded when sensors and actors are combined inintelligent, autonomous agents that can work together tosolve highly complex tasks (Baker et al. 2019). Moreover,when considering large-scale decentralized networks com-posed of low-to-mid cost sensors, resource constraints mustbe taken into account to ensure the sensor network accom-plishes its objective given a limited resource budget (e.g.,communication bandwidth, battery life, CPU/GPU cycles,reach, etc.).One basic way to describe such systems in whichdecision-making agents interact to solve a task is the ex-tension of Markov Decision Processes (MDP) (Puterman2014) to the multi-agent case, a Markov game or simplyMulti-agent MDP (MMDP) (Littman 1994). An MMDP isdefined by a set of states S , a set of actions A i for eachagent i = 1 , ..., n in the environment, a transition function T ( s (cid:48) | s, u ) defining the probability of observing a follow-up state s (cid:48) after applying the joint action of all agents u in state s , and a reward function R ( s, u ) providing a re-ward for applying the joint action u in state s . Here, weconsider stochastic games which are a further variant wherethe reward function R i ( s, u ) provides individual rewards foreach agent instead of a joint reward (Bowling and Veloso2000). Many possible ways to solve this kind of decision-making processes are described in the field of Reinforce-ment Learning (RL) (Sutton and Barto 2018) and Multi-agent Reinforcement Learning (MARL) (Busoniu, Babuska,and De Schutter 2008). In particular, single-agent learninghas received much attention and the literature provides so-lutions on a wide range of challenging tasks such as Atarigame playing (Mnih et al. 2015) or electric vehicle charging(Pettit et al. 2019). In general, RL has been shown to findgood solutions for tasks involving distributed systems, butscalability remains an issue due to the high sample ineffi-ciency and long training time required in complex systems,as evidenced, for example, by training over 900 agents eachfor 44 days (!) to find a good solution in the real-time strat-egy game StarCraft II (Vinyals et al. 2019).Most works consider agents that try to learn very com-plex problems without making many assumptions or usingprior knowledge, for example known physical processes inan environment. While minimizing assumptions and priorknowledge may lead to more robust policies, it neverthelessmakes the learning process extremely difficult in complex a r X i v : . [ c s . M A ] F e b roblems. In this work, we investigate agents that use modelheuristics and approximations when available and reason-able, so that the RL agent can focus on learning the aspectsof the problem for which we do not have reasonable models.Specifically, we consider agents trying to predict the sourcelocation of a chemical plume in a partially observable en-vironment. The agents make plume concentration measure-ments and then decide how to explore the state space basedon information theoretic metrics. Our experiments show thatby combining these approaches, along with communicationbetween the agents, we can achieve a speedup in learn-ing while reducing computational requirements. This reportpresents the general idea of a hybrid information-based RLapproach as well as some preliminary experiments and re-sults from our ongoing work. We also discuss the importantchallenges that remain with such an approach. The Plume State Estimation Problem
This work focuses on a multi-agent chemical plume sourcelocalization problem, where multiple agents are allowed tomove around a two-dimensional space taking concentra-tion measurements in concert to estimate the location of thechemical plume’s source (see Schmidt et al. 2019, for moredetails). Our initial attempts at a pure RL solution to thisproblem appeared to be computationally intractable as thestate space was relatively large, facing the typical challengesof multi-agent learning (e.g., Vinyals et al. 2019) plus theadditional challenge that the vast majority of the agents’ ini-tial measurements were null (i.e., the plume occupies a verysmall region of the environment). This motivated us to con-sider a new hybrid information-driven multi-agent RL ap-proach.
Information-based Motion and Exploration
Arguably the most fundamental solution to state estimationproblems is the information theoretic sensor managementapproach (Hero, Kreucher, and Blatt 2008). Following (Heroet al. 2007), we present the relevant details of Information-Optimal Policy Search which will become the core of ourhybrid RL approach.The basic principle is that at time t we want to explore aset of possible locations at which we will make future mea-surements. Let m t be the observation made at time t and let ( x t , y t ) be the location where the measurement was taken attime t . We denote the true fixed source location by ( x s , y s ) .The observations are assumed to be conditionally Gaussiangiven the location of the observation, the true source locationand a parametric plume model given by a deterministic func-tion f , giving us m t | x t , y t , x s , y s ∼ N ( f ( x t , y t , x s , y s ) , σ ) .Details on the function f can be found in (Schmidt et al.2019). We assume each agent places a prior distribution π ( x s , y s ) over the source location parameters. We assumea discretized space of possible measurement and source lo-cations with x t ∈ A , y t ∈ B , x s ∈ I , and y s ∈ J (discretespace is not a requirement of the method in general, althoughit facilitates an FFT approximation).The objective is to choose the next measurement loca-tion ( x t +1 , y t +1 ) that maximizes the expected information gain (IG) relative to the cost C ( x t +1 , y t +1 ) of making thatmeasurement. The IG is defined as the α -divergence (R´enyi1961) between a prior distribution on the source locationand the posterior distribution of the source location givena new measurement m t +1 taken at location ( x t +1 , y t +1 ) . Inthe limit that α approaches 1 this reduces to the Kullback-Leibler divergence D KL .Letting d t = ( x t , y t , m t ) be the data collected at time t ,and d t = ( x t , y t , m t ) be the data collected up untiltime t , the information gain is defined as IG( m t +1 | x t +1 , y t +1 ) ≡ D KL ( p ( x s , y s | d t +1 ) || q ( x s , y s ))= (cid:88) x s (cid:88) y s p ( x s , y s | d t +1 ) log p ( x s , y s | d t +1 ) q ( x s , y s ) . The term q ( x s , y s ) in the above equation could take on var-ious probability distributions. However, in the multi-agentRL case it is important that metrics be consistent acrossagents and epochs. Thus, we set this as the initial prior π ( x s , y s ) .Because we do not know the true source location ( x s , y s ) ,we must marginalize over the posterior predictive distribu-tion of the observation m t +1 in order to determine the ex-pected IG. Letting p ( m t +1 | x t +1 , y t +1 , d t ) = (cid:88) x s (cid:88) y s p ( m t +1 | x t +1 , y t +1 , x s , y s ) p ( x s , y s | d t ) be the posterior predictive distribution of the measurement m t +1 taken at location x t +1 , y t +1 given previously observeddata d t , we have the expected IG as E m t +1 [IG( m t +1 | x t +1 , y t +1 )] = (cid:90) m t +1 IG( m t +1 | x t +1 , y t +1 ) p ( m t +1 | x t +1 , y t +1 , d t ) dm t +1 . The key implication of the previous equation is that it re-sults in another set of sums outside of the IG sums, since weare implicitly determining the IG for the expected measure-ment at a possible combination of source and measurementlocations, E [ m t +1 ( x t +1 , y t +1 , x s , y s )] = (cid:88) x s (cid:88) y s f ( x t +1 , y t +1 , x s , y s ) p ( x s , y s | d t ) . Note that these expected IG must be computed for eachpossible measurement location ( x t +1 , y t +1 ) ∈ A × B . Theobjective can then be written in the following way. max x t +1 ∈ A,y t +1 ∈ B E m t +1 [IG ( m t +1 | x t +1 , y t +1 )] C ( x m t +1 , y m t +1 ) (1)The full information theoretic sensor management solu-tion to this simple plume problem requires A × B × I × J total calculations per time step per agent. Clearly as thetate estimation space increases in size or dimensionality,the number of agents increases, or a non-myopic optimiza-tion is attempted, the problem becomes computationally in-tractable. Kreucher et al. (2008) suggest particle filtering as apossible solution, but in the case of a large state space with auninformative prior, the number of particles necessary is ini-tially approximately A × B × I × J . Thompson sampling isanother alternative and can reduce the dimensionality of theproblem to A × B × I × J and has proven convergence proper-ties (Verstraeten et al. 2020). We can further reduce the com-putational complexity to I × J × log( A × B ) by adopting aSignal to Noise Ratio (SNR) expectation maximization ap-proximation (Fisher 1935) to IG expectation maximizationand leverage the fact that the mathematical form of such anapproximation facilitates fast Fourier transformation convo-lution. Note that this last approximation breaks down whenthe signal is not weak. However, if this is the case then theproblem is much simpler.For this work we adopt the SNR expectation maximiza-tion approximation, as it is most conducive to embeddingin RL algorithms and on-agent computation. To demon-strate the information-based motion heuristic we randomlypopulated a space with five agents and a single plume asin Schmidt et al. (2019). The area of the plume wherethe agents could make an SNR > ∼ ∼
17 bits butthe agents only achieve IG ∼
15 bits. This is because theSNR expectation maximization approximation of IG expec-tation maximization breaks down in the high SNR regime.Still, the plume source 95% credible interval has been local-ized to just a few discrete locations. As can be see from thelower panels of Fig. 1 the information-based motion heuris-tic is more than three-orders of magnitude more efficientthan baseline random heuristics. Perhaps just as importantis that this approach based on Eq. 1 enables the agents tonaturally transition from exploration to exploitation.
Hybrid Information-driven Multi-agentReinforcement Learning
As demonstrated above, information theory based sensor op-timization is an intuitive, effective, and relatively compu-
40 20 0 20 40402002040
Time 299 N o r m a li z e d P r o b a b ili t y C o n c e n t r a t i o n D K L ( P p o s t e r i o r || Q p r i o r )[ b i t s ] Info/Cost HeuristicCost HeuristicRandom Heuristic0 40 80 120 160 200 240 280
Time D K L / C o s t Figure 1: Results of a simplified five agent plume localiza-tion problem relying exclusively on an information-basedmotion heuristic. Top panel: The two-dimensional plume en-vironment. This history of the agent measurement locationsare shown with the small circles, colored according to agentID. The contour background shows the normalized poste-rior probability distribution of the plume source location asestimated by at agents at time step 299, which has been con-centrated about the true plume source location (black X).The thin black contours to the right of the plume source de-note the region where the agents could make a SNR > concentration measurement. Second panel: A history of theconcentration measurements made by the agents, which forthe most part are scattered about zero with variance due touser defined noise (dashed black lines denote σ ). Bottomtwo panels: The history of the information gain ( D KL ) forthe simulation (thick blue line). For comparison purposeswe show histories for two separate simulations using just thecost weight as a motion heuristic (orange) and completelyrandom motion (green). Fluctuations are a byproduct of thenoisy measurements. The information-based motion heuris-tic is > orders of magnitude more efficient than the base-line random methods in the long run.igure 2: Gaussian plume model with plume source (smallfull blue dot) and concentration indication (orange) togetherwith three agents with action trail (red, light blue, green)inter-acting in the environment.tationally efficient multi-agent motion heuristic. However,once we increase the complexity of the problem by no-longer allowing the agents to communicate, infer their state,and make measurements for free, but instead factor the costsof these choices into the optimization problem, the optimiza-tion cost quickly becomes prohibitive, and applying a fullyinformation theory based sensor optimization approach be-comes computationally intractable. RL holds the potentialto learn a good policy for this scenario, but as we previ-ously noted the training was prohibitively computationallyexpensive for the full partially observable state space esti-mation with a noninformative prior. Thus we propose an ap-proach leveraging models where they exist as heuristics andonly using RL for the high level decision making problem.In our case, this means using RL to learn policies for deci-sions about what higher level action the agent should takeand then leveraging the information based heuristic to de-termine the agent’s motion. Since both the heuristic and RLreward are based on the same IG/cost function this allowsfor consistent behavior and enables the possibility of easilytransferring the agents decision making abilities to differenttasks, which has been shown to be difficult in traditional ap-proaches (Glatt, Da Silva, and Costa 2016).For our hybrid RL experiments, our environment consid-ers a single plume source location that emits a static Gaus-sian plume as in the previous section (see Fig. 2) and pro-vides the physics of the world so that agents can move, in-teract with other objects (when present), and communicatewith each other. The plume concentration is normalized inthe possible concentration range to keep all state variablesin a similar range and to avoid issues when learning. Action space:
The action space for the agents is a set offive discrete actions: do nothing, move, take a measurement,update, and communicate.
Communicate collects the lastmeasurements from the other agents and directly updatesthe posterior, measuring means taking a new concentrationmeasure at the current location and saving it in the concen-tration buffer, and update uses the last maximum four mea-surements after the last update to update the posterior and re- Figure 3: Reward over steps for agents that learn individu-ally (red) and for agents that communicate with each other(blue).sets the concentration buffer. The move action then uses theagent’s current belief (i.e., posterior probability distribution)about the location of the source to change the accelerationof the agent towards the estimated source location.
Observation space:
The observation space is a flattenedvector containing current agent position (x, y), agent veloc-ity (x, y), wind velocity (x, y), last concentration measure-ment, current source location estimate (x, y), informationgain since episode start, and some internal state indicatorsas Boolean values indicating if the agent moved since takingthe last measurement, if the agent has performed the sameaction more than 4 times, and a one-hot vector indicatingthe last action taken (5 values here).
Reward structure:
Our reward structure is driven by theinternal assumptions of the agent using the distance betweenthe actual source location and the agents’ current best esti-mate of the source location as well as the information gainachieved so far. However, we also integrated part of the re-ward based on logical assumptions that we make about fa-vorable behavior with respect to cost of operation to reduceenergy consumption and low profile operation to protectagainst detection. In these initial settings, at the beginningof training, the information-based part is very dominant; butover time when the estimation gets better and the achievedinformation gain increases, this additional action based re-ward becomes more dominant. The goal is to facilitate learn-ing of desired behavior under logical assumptions withoutgiving too much guidance, for example, when an agent hashigh confidence about the source location estimate it is bestto do nothing to save resources and avoid detection throughmovement or communication.
Results:
In the experiments, we used a small variant of the
Deep Q-Network (Mnih et al. 2015) to learn the high-levelactions of each agent while we use the information based es-timation model for the location estimation and actual move-ment decision. Our first experiments show that agents thatcommunicate have a clear head start when it comes to esti-mating the source location properly and then remain aheadeven when individual agents have already converged to a be-havior (Fig. 3). This is based on the fact that even bad mea-surements provide information about possible source loca-tions and we observed that agents quickly learn to use thecommunication action early to get higher rewards from thestart. iscussion
Other approaches that use communication to speed up learn-ing are often built on a student-teacher relationship wherethe teacher is usually an expert that advises the student withperfect information. Agents that advise each other with im-perfect information are also considered in (Da Silva, Glatt,and Costa 2017), where the authors show that confidencebased communication can be a helpful tool to speed uplearning. While this approach investigates ad hoc commu-nication with individual agents learning in the same envi-ronment, the approach only considers a pure RL approachwithout integrating additional intelligence.An idea that could integrate well with our approach to im-prove resource conservation is (Da Silva et al. 2020) whereagents ask for advice only when their epistemic uncertaintyis high for a certain state. An important aspect in this workis also that the proposed method considers that the advice islimited and might be sub-optimal.
Conclusion & Future Work
In this short position paper we have shared our initial con-siderations with respect to hybrid information-driven multi-agent reinforcement learning. State space estimation basedon information metrics is a powerful tool that we can lever-age to solve tasks where we use RL to learn a decision-making policy that indicates what to do while basic knowl-edge about physical processes guides us on how to do it.In our experiments, agents cooperate only by sharingknowledge through communication which is triggered bythe communication action. In future versions, we intend toinvestigate other means of training the agents like Multi-Agent Deep Deterministic Policy Gradient (MADDPG)(Lowe et al. 2017) or Reinforced Inter-Agent Learning(RIAL) and Differentiable Inter-Agent Learning (DIAL)(Foerster et al. 2016). On the information side, we are cur-rently working on integrating decentralized Markov chainMonte Carlo (MCMC) methods for full posterior inferencefor non-discretized, non-conjugate models. With respect toscalability, we are working towards high-fidelity, discrete-event simulations to model wireless communication proto-cols and a larger number of agents.
Acknowledgments
This work was performed under the auspices of the U.S. De-partment of Energy by Lawrence Livermore National Lab-oratory under contract DE-AC52-07NA27344. LawrenceLivermore National Security, LLC through the support ofLDRD 20-SI-005. LLNL-CONF-816423.
References
Baker, B.; Kanitscheider, I.; Markov, T.; Wu, Y.; Powell, G.;McGrew, B.; and Mordatch, I. 2019. Emergent Tool UseFrom Multi-Agent Autocurricula. In
International Confer-ence on Learning Representations .Boutilier, C. 1996. Planning, learning and coordination inmultiagent decision processes. In
TARK , volume 96, 195–210. Citeseer. Bowling, M.; and Veloso, M. 2000. An analysis of stochasticgame theory for multiagent reinforcement learning. Techni-cal report, Carnegie-Mellon Univ Pittsburgh Pa School ofComputer Science.Busoniu, L.; Babuska, R.; and De Schutter, B. 2008. Acomprehensive survey of multiagent reinforcement learning.
IEEE Transactions on Systems, Man, and Cybernetics, PartC (Applications and Reviews)
IEEEInternet of Things Journal
Proceedings of the 16th conference onautonomous agents and multiagent systems , 1100–1108.Da Silva, F. L.; Hernandez-Leal, P.; Kartal, B.; and Taylor,M. E. 2020. Uncertainty-Aware Action Advising for DeepReinforcement Learning Agents. In
AAAI , 5792–5799.Fisher, R. 1935.
The Design of Experiments . The Design ofExperiments. Oliver and Boyd. URL https://books.google.com/books?id=-EsNAQAAIAAJ.Foerster, J.; Assael, I. A.; De Freitas, N.; and Whiteson, S.2016. Learning to communicate with deep multi-agent re-inforcement learning. In
Advances in neural informationprocessing systems , 2137–2145.Glatt, R.; Da Silva, F. L.; and Costa, A. H. R. 2016. Towardsknowledge transfer in deep reinforcement learning. In ,91–96. IEEE.Hero, A. O.; Casta˜n´on, D.; Cochran, D.; and Kastella, K.2007.
Foundations and applications of sensor management .Springer Science & Business Media.Hero, A. O.; Kreucher, C. M.; and Blatt, D. 2008. In-formation theoretic approaches to sensor management. In
Foundations and applications of sensor management , 33–57. Springer.Kreucher, C. M.; Morelande, M.; Kastella, K.; and Hero,A. O. 2008. Joint multi-target particle filtering. In
Foun-dations and applications of sensor management , 59–93.Springer.Littman, M. L. 1994. Markov games as a framework formulti-agent reinforcement learning. In
Machine learningproceedings 1994 , 157–163. Elsevier.Lowe, R.; Wu, Y. I.; Tamar, A.; Harb, J.; Abbeel, O. P.;and Mordatch, I. 2017. Multi-agent actor-critic for mixedcooperative-competitive environments. In
Advances in neu-ral information processing systems , 6379–6390.Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve-ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fid-jeland, A. K.; Ostrovski, G.; et al. 2015. Human-level con-trol through deep reinforcement learning. nature arXivpreprint arXiv:1912.03408 .Puterman, M. L. 2014.
Markov decision processes: discretestochastic dynamic programming . John Wiley & Sons.R´enyi, A. 1961. On Measures of Entropy and Information.In
Proceedings of the Fourth Berkeley Symposium on Math-ematical Statistics and Probability, Volume 1: Contributionsto the Theory of Statistics , 547–561. Berkeley, Calif.: Uni-versity of California Press. URL https://projecteuclid.org/euclid.bsmsp/1200512181.Schmidt, K.; Smith, R. C.; Hite, J.; Mattingly, J.; Azmy, Y.;Rajan, D.; and Goldhahn, R. 2019. Sequential optimal po-sitioning of mobile sensors using mutual information.
Sta-tistical Analysis and Data Mining: The ASA Data ScienceJournal
Reinforcement learn-ing: An introduction . MIT press.Verstraeten, T.; Bargiacchi, E.; Libin, P. J.; Helsen, J.; Roi-jers, D. M.; and Now´e, A. 2020. Multi-agent Thompsonsampling for bandit applications with sparse neighbourhoodstructures.
Scientific reports
Nature