Policy-focused Agent-based Modeling using RL Behavioral Models
Osonde A. Osoba, Raffaele Vardavas, Justin Grana, Rushil Zutshi, Amber Jaycocks
PP OLICY - FOCUSED A GENT - BASED M ODELINGUSING
RL B
EHAVIORAL M ODELS
A P
REPRINT
Osonde A. Osoba ∗ Raffaele Vardavas Justin Grana Rushil Zutshi Amber Jaycocks
June 11, 2020 A BSTRACT
Agent-based Models (ABMs) are valuable tools for policy analysis. ABMs help analysts explorethe emergent consequences of policy interventions in multi-agent decision-making settings. But thevalidity of inferences drawn from ABM explorations depends on the quality of the ABM agents’behavioral models. Standard specifications of agent behavioral models rely either on heuristicdecision-making rules or on regressions trained on past data. Both prior specification modes havelimitations. This paper examines the value of reinforcement learning (RL) models as adaptive,high-performing, and behaviorally-valid models of agent decision-making in ABMs. We test thehypothesis that RL agents are effective as utility-maximizing agents in policy ABMs. We also addressthe problem of adapting RL algorithms to handle multi-agency in games by adapting and extendingmethods from recent literature. We evaluate the performance of such RL-based ABM agents viaexperiments on two policy-relevant ABMs: a minority game ABM, and an ABM of InfluenzaTransmission. We run some analytic experiments on our AI-equipped ABMs e.g. explorations ofthe effects of behavioral heterogeneity in a population and the emergence of synchronization in apopulation. The experiments show that RL behavioral models are effective at producing reward-seeking or reward-maximizing behaviors in ABM agents. Furthermore, RL behavioral models canlearn to outperform the default adaptive behavioral models in the two ABMs examined. These resultssuggest that the RL formalism can be an efficient default abstraction for behavioral models in ABMs.The core of the argument is that the RL formalism cleanly and efficiently represents reward-seekingbehavior in intelligent agents. Furthermore, the RL formalism allows modelers to specify behavioralmodels in a way that is agnostic to the agents detailed internal structure of decision-making, which isoften unidentifiable.
Agents based models (ABMs) are useful for analyzing the dynamics of complex social systems [1, 2]. In its mostabstracted form, an ABM consists of virtual agents interacting with each other in a virtual environment. ABM arenatural tools for policy analysis. A good ABM can enable policy analysts and decision-makers to explore the potentialmacro-effects of diverse policy interventions on a population of real-world agents.Modelers will typically equip each agent in an ABM with individual behavioral models or rules that determine howthe agent behaves in response to its local virtual environment. These behavioral models are intended to be plausibleapproximations of real-world individual decision-making. Simple and intuitive behavioral rules can result in a widerange of emergent macroscale phenomena that are not intuitively predictable from the behavior of each agent [3]. Somesimple behavioral models may produce observed behaviors that do not fit the standard rational actor model or otherassumptions of utility-maximizing decision-making (and the multi-agent corollary, the Nash equilibrium).This work explores a solution to the problem of specifying agent behavioral models that implement the agent’s identified utility using the reinforcement learning (RL) formalism. The use of the RL formalism may also increase the range of Corresponding Author Email: [email protected] a r X i v : . [ c s . L G ] J un PREPRINT - J
UNE
11, 2020 policy research questions ABMs can be used to address. We suggest that the RL formalism may be an efficient defaultabstraction for specifying behavioral models in ABMs.
The analytic promise of agent-based modeling is achievable only if the ABMs are valid representation of real-worldbehaviors. The validity of an ABM depends strongly on the plausibility of the agents defined behavior models. Andthis is especially true when ABMs applied to inform real-world policy decision-making. Small changes to behavioralmodels can sometimes lead to macro-predictions that are qualitatively very different. And it is not always clear whichbehavioral model should apply since the behavioral models are not necessarily validated with data.This is especiallypoignant when a particular rule set leads to some agents behaving obviously sub-optimally and this sub-optimal behaviordrives observed macro outcomes.There are a number of traditional approaches that aim to assure the validity of ABM behavioral models. The mostdirect is to base behavioral models on data-driven regression models that relate state or environmental factors to theagents decision outcomes or profiles. This approach is prominent in microsimulation modeling. There are assumptionsembedded in this practice that may undermine the validity of the approach e.g. stationarity (i.e. the patterns of behaviorsin observed training data will persist) and causality (i.e. that the observational data samples sufficiently capture thecausal structure of agent decision-making).
Cognitive architecture modeling offers another approach to developing valid behavioral models. Here, the modelerspecifies a psychologically plausible model of agent decisionmaking for the ABM (sometimes informed by survey data).Each agent then simulates a private version of the identified cognitive architecture when making decisions in the ABMenvironment. Cognitive architectures have a long history in the field of artificial intelligence. Examples include theACT-R and Soar architectures [4, 5].This approach to decision modeling aims to be more causal and mechanistic than the data-driven microsimulationapproach. But, arguably, this detailed lower-level modeling is disconnected from the level of inquiry the ABM isdesigned to support.The policy analyst often cares more about the representativeness of observed agent behaviors ratherthan the specific mental calculi (often not even uniquely identifiable) that produce observed decisions.Modelers need an efficient approach for specifying decision or behavioral models that is: Representative of real-worldcausal decision-making; Agnostic to the exact internal structure agents decision-making; & Flexible or adaptable to thechanges in agents local context over time.
We explore the use of RL policies as behavioral models for ABMs. In this approach, we specify or identify: what anABM agent can observe , what they can do , and what they value in an environment. Given these pieces of information,we specify a generic-but-flexible policy model that the agents can tune using powerful learning algorithms applied toexperience data gleaned from interaction with the ABM environment. Each agent tunes its decision policy to optimizehow much reward it can expect to extract from its interactions with the environment. In essence, we implementutility-maximizing virtual agents in our ABMs with the key generalization that agents can have diverse specificationsfor their private utilities. And we represent the agents decision functions with differentiable models (neural networks)for ease of implementation and optimization.This approach abstracts away the agents internal structure of decision-making. Each agent can continue to learn andadapt effectively even when the virtual environment is operating a completely new regime. And, if we know and canrepresent the utility functions of real-world agents, then our policies can be representative of real-world decision-making.This approach could be preferable to the default assumption of ad-hoc rule-based agents in ABMs. We developed proof-of-concept AI-based behavioral models in multiagent decision-making contexts. We focusedon applying these models to policy-relevant ABMs. Applicable ABMs include ABMs of stock market, vaccination,taxation, and health insurance market behaviors. This line of inquiry is motivated by the observation that somepolicy-based ABMs are similar to or simpler in game-theoretic structure to video games. We target our efforts to explorethe use of RL-based behavioral model for multiagent experiments with ABM. The outcomes of our efforts are a set ofported ABM environments and templates for RL-based behavioral models that enable policy researchers to: • Equip the agents in ABMs with flexible capacity adaptive intelligence; PREPRINT - J
UNE
11, 2020 • Explore the effects of interaction between diverse sets of strategic behaviors (e.g. bounded rationality, collusion)in ABM populations; & • Possibly learn behavioral policies that better explain empirically-observed behaviors (Similar to Epsteinsdiscussion of inverse explanations) [2].We show the value of our efforts by running experiments to examine the effects of RL behavioral models in twopolicy-relevant ABMs: the minority game & the flu vaccination game. The basic research questions we aim to answerin these experiments include: • Examining the effects of heterogeneous decision models (RL vs. default heuristics) in an ABM; • Examining the effects of differing levels of agents intelligence and/or memory on decision-making in an ABM; & • Examining the capacity for populations intelligent agents to learn coordination or synchronization underconditions of limited or no direct communication.The rest of this discussion has the following structure. The next section (2) outlines key design and modeling decisionswe made in the process of developing our AI-equipped ABMs, both in the single-agent and the multiagent setups.Sections 3 & We developed a collection of policy-relevant multiagent environments & implementations of single and multiagentRL algorithms. The goal of was to connect both lines of development to produce policy-focused ABMs that featureRL-based decisionmaking agents. The rest of this section explores key details in these efforts.We identified a some new and pre-existing policy-focused ABMs that are relevant for our purposes. These include a Fluvaccination ABM [6, 7, 8], and a tax-evasion ABM [9]. They are listed roughly in increasing order of model complexity.We decided to start with a simpler ABM that captures key policy-relevant behavior: the Minority Game ABM [10, 11]which is based on the El Farol Bar problem [12]. It represents a simplified abstracted version of the stochastic game atthe heart of the Flu and Tax Evasion models. Section 3 discusses this ABM in more detail.We encapsulate each ABMs using standardized gym environment interfaces. This abstraction effectively separates themechanics of the ABM from the behavioral models of the agents. Each environment wrapper implements (using Figure2.1 as background context): s t +1 , r t , doneQ = env.step ( a t ) , s = env.reset () . Given a goal within the virtual environment, we specify a reward function such that the achieving the environmentgoal maximizes the agents reward. We equip an agent with a neural-network-based model and statistical learningalgorithms to enable the agent to adapt its behavioral policy to maximize the reward function [13]. In this setup, itis also possible to model complex or strategic goals like maximizing the volatility of a market or a stock price oroptimizing the explanatory or predictive power of agent’s policies for observed real policies [2].Figure 1 depicts key components and interactions in the environment and agent modules. The agents implement: abehavioral policy model a t = agent.act ( s t ) , an experience buffer, and play/work/train functions.The basic template for the agent’s decision models, the embodied architecture is modular, based on a basic actor-criticarchitecture [15]. We have flexibility on what kinds of learning algorithms we can apply to this basic architecture. Andwe can always augment the architecture with auxiliary intelligence artifacts (e.g. see central Q-function in the MACalgorithm, Table 4).We default to using policy-gradient adaptation algorithms on actor-critic-style models for our agent behavioral models.This approach is more flexible for representing mixed-strategy policies. And that flexibility in policy structure will becrucial in the multiagent RL (MARL) context. The name is a nod towards the concept of embodied cognition wherein an agents cognition is strongly influenced or determinedby how the agent senses, reasons about, and acts in its surrounding environment [14]. PREPRINT - J
UNE
11, 2020
Figure 1: : Schematic of interaction components between an agent and the ABM environment (arrows extending out of boxes denoteexposed signals).
The Multi-agent RL (MARL) context has more complexity than the single-agent control problem described so far.We need to make adaptations to the learning models and algorithms to account for multi-agency [16, 17, 18]. MARLalgorithms are often generalizations of single-agent RL algorithms [18, 16]. But this generalization strategy is inadequatebecause the convergence of the single-agent control algorithms described assumes that the agent is interacting with asomewhat stationary environment. Multi-agent contexts will tend to violate stationarity [19].Furthermore, the optimal strategy for one member of a population depends on the current strategies of the rest of thepopulation. We may need to replace concept of an independent optimal strategy with the concept of an evolutionarilystable strategy (ESS) [20, 21] instead.The violation of basic assumptions (stationarity, the Markov frame, independence of agent rewards, the existence ofstable unconditional policy targets) can be severe enough to render traditionally powerful learning algorithms unstablein multiagent environments [19, 18]. We implement a Multi-agent Actor Critic (MAC) algorithm based on [19] toaddress the non-stationarity issue in our models. The MAC algorithm furnishes each agent with access to an centralglobal critic, but only during training. Our agents revert to independent adaptation after deployment.Examples may help illustrate some of key MARL-specific concerns in more concrete terms. For example, competingfirms must consider the rigidities in the pricing of other firms and how adaptive other firms are in changes to themarket. As another example, computer network defense teams must consider the sophistication and tools available tothe adversaries, especially knowing that the adversaries are actively trying to evade detection.We model the multiagent environment in a couple of ways. The first is the simplest setup: the independent populationmodel [22]. Each RL-driven agent in the population has a standalone embodied model with which they interact withthe shared environment. In this model, there is no internal communication or direct private transfer of informationamong agents. Intra-agent communication is still possible via stigmergy [23] if the observed state variables aresufficiently detailed. This setup is good for modeling fully competitive games with no explicit collusion allowed. Butthis independent setup is often imperfect, both as a model of real-world behaviors and for robust multiagent learningdynamics [22].The second setup is to enable some amount of communication and coordination across the agent population. Communi-cation can be in the form of sharing state information, sharing private decision propensities, or network weight-sharing.
We deploy RL models in the classic ABM known as the minority game. The minority game is a classic environment forstudying adaptive and learning agents. The game generalizes the El Farol Bar problem. Specifically, N agents mustdecide whether to go to the El Farol bar or to stay home. If more than half of the agents go to the bar, then the bar is toocrowded, and the agent would be better off staying home. On the other hand, if less than half of the agents go to the bar,then the best action is to go to the bar because it is lively and not too crowded. Thus, the minority game is names assuch since each agent prefers to be in the minority group.The minority game has been extensively studied and prior work provides us with a benchmark into how well the RLagents are performing. This is important because it is often difficult to know when an RL algorithm converges to a“good” local optimum. Prior results in the minority game will serve as a guidepost. Our minority game implementation PREPRINT - J
UNE
11, 2020 parallels the literature [24] with the except that we enable the use of a RL-based behavioral model for one or moreagents in the population of N players. In each time steps, agents make a binary decision, which we represent by choosing either or . At the end of the timestep, the agents observe the minority and majority group. That is, the agents observe whether there were more agentsthat chose or more agents that chose . Crucially, the agents do not observe how many other agents choose and but only observe which choice was selected by less than half of the agents, which we call the minority group.Each agent is endowed with a memory parameter, m, that determines how many past realizations of the game the agentconsiders when making its next decision. For example, if m=3, the agent bases its binary decision at time t on itsobservations in time t − , t − , and t − . A strategy is a mapping from m to { , } . That is, a strategy specifiesan action for all possible length- m memories. For example, if m=2, a strategy would specify an action for each of thepossible memories in { { } , { } , { } , { } } .At the start of the game, each agent draws k random strategies with replacement. Each agent is seeded with an initialstrategy. Also, a random initial history is drawn uniform randomly. At each timestep, an agent selects an action specifiedby the strategy with the most accumulated points. A strategy accumulates a point in each timestep if the action specifiedby the strategy would have put the agent in the minority group. In the model, the RL agents do not choose their strategy like the basic agents. Instead, the RL-based agents choose theiractions based on inference via tuned neural networks. To control for the RL agents memory, we change the numberof inputs into the neural network. For example, if we want to model the case where the RL agent chooses its actionconditional on the previous 3 minority groups, then the input to the deep neural network is a three-dimensional binaryvector. Later, we discuss the possibility of modeling an agents memory with recurrent neural networks. We trainthe RL agents in response to its experience in the game using a variety of policy gradient algorithms (REINFORCE,REINFORCE with baseline, & Actor-Critic).
We train reinforcement learner to play against N − basic agents. In experiments, N=301, and each basic agent has2 strategies with a memory of . The number of agents in each group is periodic and follows a perfectly predictablepattern. This implies that the winning group is also periodic and follows the pattern (cid:48) , , , , , , , . (cid:48) Therefore, withsufficient memory, a reinforcement learner should be able to perfectly predict the minority group simply by observingthe history of winners.To test this, we train a policy gradient neural network with 4 hidden layers and 20 nodes per layer in the minority gameenvironment. To aid in computation, the RL agent’s look-ahead is limited to 5 time steps. The input to the neuralnetwork is the last 3 winning groups, which is one greater than the basic agent’s memory. Figure 2 shows that the RLagent does indeed learn the optimal strategy and correctly guesses the minority group each time.
Figure 2: Time series of attendees for 500 time steps, 301 agents each with 2 strategies and a memory of 2 PREPRINT - J
UNE
11, 2020
A crucial feature of this experiment is that for each of the 10 trials, the RL agent is trained against a different set ofagents but the set of agents stays constant within each of the 10 trials. In other words, in the first trial, we sample aset of 300 basic agents and train a neural network against those 300 agents using 400 training steps (recall there israndomness in the agents when drawing their strategies). In the second trial, we sample a different set of 300 agents andretrain. Therefore, Figure 3.2 shows that an RL agent plays optimally against the agents it trained against. However, akey question regarding generalization is then how well does a RL learner perform against agents it did not train against?Figure 3 shows that the RL learner’s behavior is not always optimal against a different randomly drawn population.This is especially surprising since every randomly drawn population with 2 strategies and a 2 time-step memory yieldperiodic and predictable patterns. However, the key point is that the precise pattern depends on the labeling of thegroups and the RL agent is sensitive to such labeling. In other words, the RL algorithm is not robust to relabeling.
Figure 3: A trained RL learner playing against 100 different randomly drawn populations
While an agents performance trained on one population does not generalize, the next obvious question is: can an agentwith enough memory to play perfectly against any given population be trained to play perfect against all populations?To answer this, we once again train an RL agent with the same parameters. However, in this experiment we resamplebasic agents for each training iteration. In other words, the RL learner plays against a different population for everytraining iteration.
Figure 4: Two different rounds of training with a neural network with 3 time steps of memory. In each round, there were 400 trainingsteps with an episode length of 500 time steps. The black line is a rolling mean with a window length of 20.
Figure 4 gives two examples of how the RL agent learns when it faces a different population of basic agents for eachtraining iteration. The key insight is that the RL agent is not able to correctly choose the minority group every time,even though for a given population of basic agents, the sequence of minority groups is deterministic. This is because 3observations in the past is not sufficient to determine which population the learning agent is playing against. Therefore,since the RL agent cannot condition its action on the full periodic sequence but can only use the last three observations,it cannot always correctly choose the minority group. PREPRINT - J
UNE
11, 2020
Figure 5: Training with an RL agent with 5 time steps of memory. In each round, there were 400 training steps with an episodelength of 500 time steps. The low spikes are the case where the initial distribution of agents yields ties.Table 1: Improvements from training three minority game players (out of a population ten player) using RL-based behavioral models.The remaining 7 players act according to the default behavioral model for the minority game.
Stated succinctly, an RL agent that has enough memory to play perfectly against any given draw of an initial distributionis not sufficient to play perfectly against all draws of the initial distribution.Since units of memory are insufficient for the RL learner to predict the minority group in all cases, we extend thememory parameter. It can be shown that the RL learner cannot learn perfectly against all distributions of agents until ithas a memory of length . Figure 5 shows that this is indeed the case. The troughs in the figure represent cases wherethe initial distribution of agents yields ties in the minority group, which changes the dynamics of the system. However,the probability of this occurring goes to zero as the number of agents gets larger and thus can safely be ignored. We also explore some of the dynamics of the game when more than one agent in the population uses RL-basedbehavioral models. All agents interact in a shared minority game. And all agents make decisions on the basis ofobservable information (in this case, a fully shared history of recent winning choices). Agents using the default PREPRINT - J
UNE
11, 2020 behavioral model continue to make decisions based on their adaptive strategy books. RL agents make decisions basedon the outcome of their stochastic policy neural networks. The reward signal in our setup is simply a binary signal (0 / corresponding to individual loss or win. We assume no explicit collusion or direct communication within the RLagent sub-population during the test phase of the experiments.Preliminary results (Table 1) show that the multiagent RL sub-population is able to adapt to improve both its aggregateand individual rewards in the minority game. Figure 3.6 shows learning improvements in reaped rewards for an RLsubpopulation of 3 agents playing in an overall population of size N=10. The game is structured with standard memoryof length m=3 and 4 strategies per agent in the default behavior model. The second panel of the figure compares theagents distribution of rewards pre- and post- RL training (601 training epochs using the MAC algorithm; see Table4 in the Appendix). The primary takeaway from our experiments is that the RL training algorithms are able to learn reward-seeking behavior . However experiments comparing the performance of the multiagent RL behavioral model tothe default behavioral model yield equivocal results. In experiment 2 we use and modified an existing ABM that models the complex dynamical interplay between vaccinationbehavior, influenza epidemiology and influenza prevention policies over a synthetic population that is representative ofthe population of Portland, OR. Our experiment replaces individual agents in the model with agents that use a differentbehavioral algorithm based on standard RL rules. This ABM builds off from previous work [8, 6, 7]. These priorworks contain high-level descriptions of the flu vaccine ABM we adapt here. And further details of the complete Flutransmission ABM model is provided as supplementary material.Central to the ABM’s dynamics is the assumption that personal and social-network experiences from past influenzaseasons affect current decisions to get vaccinated, and thus influence the course of an epidemic. The model proceedsiteratively from one flu season to the next. Agents in the model represent individuals who determine whether or notthey get vaccinated for the present season. Their collective decisions drive influenza epidemiology that, in turn, affectsfuture behavior and decisions.The ABM environment is described by: (A) the behavioral model; (B) the social and contact network structures; and (C)the influenza transmission dynamics. The model simulates within-year influenza transmission dynamics on the networkby means of a Susceptible-Infected-Recovered (SIR) model. The full model also considers additional complications andconfounding factors such as whether the agents are infected with other non-influenza influenza like illnesses (niILI).The ABM’s underlying network structure was informed by generating a reduced but statistically representative networkwith roughly ten thousand individuals taken from open datasets from the Network Dynamics and Simulation ScienceLaboratory (NDSSL). This dataset represents a synthetic population of the city of Portland, OR and its surroundings inan instance of a time-varying social contact network for a normal day, derived from the daily activities.
The behavioral model specifies the way that agents belonging to the different outcome groups put different empha-sis/weights on their personal experience and local and global feedback information and how they evaluate their choice.Agents are assumed to remember experiences from past years, and these determine how they change their vaccinationbehavior. Therefore, agents update their propensity to get vaccinated for the following season based on a weighted sumof their most recent evaluation and evaluations made in past years (discounted over time).The ABMs default behavioral model was informed from different data sources. For example, to inform the behavioralmodel and quantify its relevant parameters, we use internet surveys on a random subsample of a longitudinal panelsurvey of a nationally-representative cross-section of the US. Specifically, the surveys were used to quantify howindividuals behavior towards vaccination changes based on key flu-related experiences and beliefs.The behavioral model assumes that at the beginning of year n , an adult individual i will get vaccinated with probability w ( i ) n given by a convex-combination expression w ( i ) n = β a υ ( i ) n + β b φ ( i ) n + β c .ψ ( i ) , (1)where the β values are convex coefficients that weight a number of factors that influence each agents vaccinationprobability. Vaccination probability w ( i ) n is modeled as a Bernoulli mixture model with contributions from the followingthree factors: PREPRINT - J
UNE
11, 2020 • υ ( i ) n : the agents intrinsic adaptation and learning. This includes signal contributions from its social network.And this factor represents a weighted sum of past experiences with flu infections and vaccination. Theweighting scheme lends more importance to the more recent experiences. • φ ( i ) n : the influence of healthcare workers (HCWs) recommendations on the agents probability of gettingvaccinated. & • ψ ( i ) : a set of socio-economic, biological and attitudinal factors (e.g., cost, medical predisposition andconvenience of obtaining vaccination) that are assumed to be stationary and relevant to vaccination decisions.The full ABM considers all three factors. However, for the purpose of this model we decided to use a simplified versionof the behavioral model which considers a homogeneous population that make their vaccination decisions exclusivelybased on the present and past evaluations of their vaccination decisions and outcomes. As such in this simplified modelwe only consider the agents intrinsic adaptation and learning factor, and set the value β α to 1. Hence, agents are notinfluence by recommendations made by their physician. Moreover, the simplified model that we are considering herealso removes other relevant details. For all RL experiments on this model, the agents gain rewards by not being infected during a flu season. They observetheir prior year flu outcomes and healthcare worker recommendations. They act by either choosing vaccination or not.
Figure 6: Comparing the performance of an RL agent in the Flu ABM environment. The RL model a.) is able to improve itsoutcomes after multiple training epochs (bottom panel), & b.) outperforms the default behavioral model on average (n=100 seasons)by a couple of percentage points (top panel). Before running our experiments in the ABM of influenza vaccination, we ran the simulation model with no RL agentsuntil the dynamics of aggregated macro quantity reached a stationary state. In this state our model has reached adynamic equilibrium of detailed balance analogous to kinetic theory. At this point we sample 1 or N agents and changetheir behavioral model to that specified by the RL algorithm. Figure 6 below shows the performance of the RL modelafter training iterations (using a basic actor-critic learning model). The reward signal in this game is the number ofseasons the agent goes through without contracting the flu. The basic finding from this experiment is that single-agentreward-driven learning is effective in this Flu ABM.
We run a set of experiments to examine how much influence an agents degree centrality has on the effectiveness of itslearning. We draw two separate subpopulations at random each of size n = 40 . The first sub-population in the lowerquartile of network degree centrality in the overall synthetic population. The second sub-population is in the top quartile PREPRINT - J
UNE
11, 2020 of network degree centrality. Both sub-populations are trained with the same algorithm (MAC) starting from identicalpolicy templates.Then we examine how much the ensembles improved after about learning iterations.Table 2 shows their improvements. The more basic result from this experiment is that multiagent reward-driven learningis effective in this Flu ABM. The results also suggest preliminary evidence in support of the hypothesis that low-degreecentrality agents are more effective at learning to avoid the flu compared to high-degree centrality agents. This alignswith our intuitions since higher degree nodes have more contact and thus more paths for infection. But the experimentneeds more data to test the hypothesis further.We also explored the question of synchronization in this experiment. We are concerned with the emergence ofsynchronized behavior in ensembles on non-communicating learning agents. We examine the evidence for thisby calculating an average correlation matrix for the observed actions from an ensemble of trained agents. Anysynchronization (positive or negative) between agents will be exposed by high magnitude correlation matrix entries orbands. No synchronization was evident in this experiment (and others) since the correlations are capped in magnitude atabout . Our most basic finding is that single agents, playing in either of the test policy ABMs, exhibit reward-seeking behaviorand can even learn optimal policies. There are caveats including issues with generalization and policy model memoryor capacity (which may be addressable using recurrent networks).The multiagent experiments on both ABMs again support our motivating hypothesis that RL behavioral model are ableto reproduce adaptive reward-seeking behavior even in a complex ABM like the flu model. And the RL subpopulationscan significantly outperform subpopulations that deployed the default behavioral model. One observation was that naveMARL algorithms are not as effective at learning good behaviors. This is in line with the current MARL literature. Weadapt the MAC algorithm to address pathologies specific to multiagent games.The RL-based behavioral models can also compare favorably with default behavioral models in ABMs. The experimentalresults on this point are not as unequivocal. The effect sizes in these experiments are smaller. This is likely because theoutcome of such comparisons is rather path-dependent and sensitive to initial conditions.These results suggest that the RL formalism can be an efficient default abstraction for behavioral models in ABMs.The RL formalism cleanly and efficiently represents reward-seeking behavior in intelligent agents. Furthermore, theRL formalism allows modelers to specify behavioral models in a way that is agnostic to the agent’s detailed internalstructure of decision-making, which is often unidentifiable.Future work would explore more complex experiments on these ABMs; explore new ABMs; and develop relevantalgorithms for behavioral adaptation in these ABMs (e.g. neuro-evolutionary learning algorithms [25]). PREPRINT - J
UNE
11, 2020
Table 2: Outcome Improvements in Flu Avoidance Post-RL-training for agent ensembles in low and high quartiles of degree centrality PREPRINT - J
UNE
11, 2020
Appendix A: The Influenza Transmission Environment
In experiment 2, we modified and use an existing ABM that models the complex dynamical interplay betweenvaccination behavior, influenza epidemiology and influenza prevention policies over a synthetic population that isrepresentative of the population of Portland, OR. Our experiment replaces individual agents in the model with agentsthat use a different behavioral algorithm based on standard RL rules. This ABM builds off from previous work [6, 7].And it is also being extended by on NIH grant research.
Figure 7: : Architecture of Influenza Vaccination Model.
Figure 7 gives a high-level view of the ABM. Central to the ABM is the assumption that personal and social-networkexperiences from past influenza seasons affect current decisions to get vaccinated, and thus influence the course of anepidemic. In more detail, the model proceeds iteratively from one flu season to the next. Agents in the model representindividuals who determine whether or not they get vaccinated for the present season. Depending on their choice, theefficacy of the vaccine and the disease transmission dynamics, by the end of the season agents have either been infectedor not. The model also considers additional complications and confounding factors such as whether the agents areinfected with other non-influenza influenza like illnesses (niILI), leading them to believe that they might have caughtinfluenza. The agent population can be classified into four outcome groups:Agents in different groups evaluate the outcomes of their choice in different ways using their personal experiencewith the flu and the vaccine (i.e., whether they got vaccinated and whether they were infected, including close lovedones) and using both local and global feedback information. This information includes the proportion of individuals intheir social network (i.e., local) and in the population (i.e., global) that they perceive as vaccinated and infected. Ourbehavioral model specifies the way that agents belonging to the different outcome groups put different emphasis/weightson their personal experience and local and global feedback information and how they evaluate their choice. Agents areassumed to remember experiences from past years, and these determine how they change their vaccination behavior.Therefore, agents update their propensity to get vaccinated for the following season based on a weighted sum of theirmost recent evaluation and evaluations made in past years (discounted over time).The ABMs default behavioral model was informed from different data sources. First, to inform the behavioral modeland quantify its relevant parameter, we use internet surveys on a random subsample of the RAND CorporationsAmerican Life Panel (ALP), a longitudinal panel study of a nationally-representative cross-section of the US was used.Specifically, the surveys were used to quantify how individuals behavior towards vaccination changes based on:1. personal past experiences with catching the flu and getting vaccinated;2. local effects due to observed experiences in the social network; and3. global effects based on perceived prevalence of the flu and vaccination in the population.The ABM depends on an underlying network structure used to model both how influenza transmission occurs as wellas how vaccination behavior influences spreads. This network structure was informed by generating a reduced butstatistically representative network with roughly ten thousand individuals taken from open datasets from the NetworkDynamics and Simulation Science Laboratory (NDSSL). This dataset represents a synthetic population of the city of PREPRINT - J
UNE
11, 2020
Portland, OR and its surroundings in an instance of a time-varying social contact network for a normal day, derivedfrom the daily activities.
Describing the Influenza ABM Environment
The ABM considers a population consisting of N individuals on a social network that every year considers protectingthemselves, and their dependent family member from seasonal influenza. In particular, they make decisions as towhether or not to get vaccinated against seasonal influenza. Their collective decisions drive influenza epidemiology that,in turn, affects future behavior and decisions. The population is stratified according to other relevant demographic andsocio-economic characteristics and are connected by an overlaying social network structure. The ABM environment isdescribed by: (A) the behavioral model; (B) the social and contact network structures; and (C) the influenza transmissiondynamics. The model proceeds iteratively as follows. At the beginning of each influenza season, every adult individualdecides whether or not to get vaccinated against the flu depending on both their recent experiences with vaccinationand that shared by their alters on the social network. The model then simulates influenza transmission dynamics onthe network by means of a Susceptible-Infected-Recovered (SIR) model. As defined by the transmission model, anepidemic occurs every season depending on the achieved vaccination coverage (i.e., the proportion of individualsvaccinated) and whether infections span-throughout (i.e., percolate) the network. At the end of the influenza season,individuals evaluate their new experiences and accordingly modify their vaccination probabilities for the subsequentyear.Figure 8 below shows a schematic view of all the components in the default behavioral model. It assumes that at thebeginning of year n , an adult individual i will get vaccinated with probability w ( i ) n given by a convex-combinationexpression w ( i ) n = β a υ ( i ) n + β b φ ( i ) n + β c .ψ ( i ) , Where the β values are convex coefficients that weight a number of factors that influence each agents vaccinationprobability. Vaccination probability w ( i ) n is modeled as a Bernoulli mixture model with contributions from the followingthree factors:1. υ ( i ) n : the agents intrinsic adaptation and learning. This includes signal contributions from its social network.And this factor represents a weighted sum of past experiences with flu infections and vaccination. Theweighting scheme lends more importance to the more recent experiences.2. φ ( i ) n : the influence of healthcare workers (HCWs) recommendations on the agents probability of gettingvaccinated. & ψ ( i ) : a set of socio-economic, biological and attitudinal factors (e.g., cost, medical predisposition andconvenience of obtaining vaccination) that are assumed to be stationary and relevant to vaccination decisions.The β coefficients are assumed to vary across socio-demographic population strata and measure the relative importanceof each of these behavioral processes. Notice that β c is an array of regression coefficients. Individuals that are morelikely to visit HCW prior or during the flu season will likely be more influenced by their HCW and thus have a highervalue of β b . Instead, individuals belonging to households with low income may be more influenced by the cost ofobtaining vaccination than by HCWs recommendations and their past experiences, and thus have a higher cost related β c .In the model, φ ( i ) n is determined by the influence of an HCW k in recommending his patient i to get vaccinated. HCWsfollow guidelines in recommending vaccination. However, their effort in recommending vaccination is determined bytheir own perceptions that, in part, change with time. The model assumes that the perceptions of HCWs are formedfrom the population level cumulative incidence, the overall vaccination coverage and the observation of the number ofpatients they visit every year with influenza.The full ABM considers all three factors. However, for the purpose of this project we decided to use a simplified versionof the behavioral model which considers a homogenous population that make their vaccination decisions exclusivelybased on the present and past evaluations of their vaccination decisions and outcomes. As such in this simplified modelwe only consider the agents intrinsic adaptation and learning factor, and set the value β b to 1. Hence, agents are notinfluence by recommendations made by their physician. Moreover, the simplified model that we are considering herealso removes other details that are important in the real system. For example, as mentioned previously the full modelallows agents to be infected with a niILI and they could think that they caught the flu when instead they caught a niILI(or vice-versa). These details are not considered in the simplified model. PREPRINT - J
UNE
11, 2020
Figure 8: : Behavioral Model for Agents in Influenza ABM.
To model the agents intrinsic adaptation and learning, at the end of season n − , each individual evaluates the choiceof either being vaccinated or not. Their evaluation is quantified by a variable ∆ ( i ) n − ranging in [0 , . Its value is largewhen the individual perceives the benefit of having been vaccinated. In the full ABM, this depends on three evaluationcriteria:1. P E : personal and household evaluations;2. SN : perceived local decisions and outcomes in their social-network (SN); and3. M sM : perceived global decisions and outcomes as reported by mainstream media (MsM).The importance placed on each of these levels of evaluation depends on three weights ω P E , ω SN and ω MsM , whichadd to 1. In the simplified model, we remove the MsM evaluation criteria. Moreover, the PE evaluation criteria issimplified to only depends on personal decisions and outcomes and not on household outcomes. Hence, the agentsintrinsic adaptation and learning only depends on the PE and the SN evaluation criteria.An agents PE generates a value ∆ ( i ) P E ranging in [0 , which depend on four possible decision-outcome combinations,and result from (i) whether or not they were vaccinated, and (ii) whether or not they were infected by the end of theseason. For example, if agent i vaccinated but still caught the flu, his/her value for ∆ ( i ) P E would be low and close to0. This means that in absence of any information other than his/her most recent decision and outcome with influenzaand its vaccine, his/her propensity to vaccinate again is low. On the other hand if s/he did not vaccinate and caught theflu then ∆ ( i ) P E would be very high and usually set to equally 1. This means that in absence of any information, his/herpropensity to vaccinate in the flowing season is very high. Similarly, an agents SN evaluation generates a value ∆ ( i ) SN ranging in [0 , which depends on the perceived proportion of alters in the social network that experienced each of thefour possible decision-outcome combinations. For example, if a large majority of alters in agent is social network didnot vaccinate and caught the flu, the value of ∆ ( i ) SN would be high and close to 1.In our simplified model, the value of ∆ ( i ) n − is obtained by the weighted sum of ∆ ( i ) P E and ∆ ( i ) SN . ∆ ( i ) n − = ω P E ∆ ( i ) P E + ω SN ∆ ( i ) SN . PREPRINT - J
UNE
11, 2020
Therefore, even if agent i did not vaccinate and did not catch the flu, leading to a low ∆ ( i ) P E value, if his/her ∆ ( i ) SN is high, s/he may decide to vaccinate in the following flu season. The interpretation here is that in absence of anyinformation from other past influenza seasons, the agent will judge how lucky they were not to have caught the flu inthe present season by his/her perception of how many alters in his/her social network did not vaccinate and caught theflu. The actual values for ∆ ( i ) P E and ∆ ( i ) SN which result from the various decision outcomes of the agent and his/heralters can be set by tunable parameter values.The value υ ( i ) n describing an agents intrinsic adaptation and learning depends on both the most recent flu season andresulting value for ∆ ( i ) n − , as well as past flu seasons resulting in values for ∆ ( i ) m for m < n − . Each individualremembers and weights previous outcomes and their evaluations with respect to the present outcome. The weighted sumof past years evaluations defines individual i s pro-vaccination experience V ( i ) n for the present season and is given by V ( i ) n = sV ( i ) n − + ∆ ( i ) n − .The parameter s ranges in [0 , and discounts the previous years outcome with respect to the outcome of the presentyear. When s is equal to 0, individuals completely ignore the outcome of previous years. When s is equal to 1,individuals give equal weight to the outcomes from previous years as the present outcome. For example, when s isequal to 1, and if ∆ ( i ) n − only took discrete values 0 or 1, the value V ( i ) n would be an integer number representing atally of the number of years that vaccination was or wound have benefitted agent i. When s < , and for continuousvalues of ∆ ( i ) n − in [0 , , V ( i ) n represents an exponential weighted moving average of the evaluations ∆ ( i ) n . Finally, theprobability that agent i will use to decide whether or not to vaccinate in the next flu season is found by normalizing V ( i ) n using its maximum possible value. based only on the adaptive learning process is given by υ ( i ) n = V ( i ) n /N ( s ) , where N ( s ) = (1 − s n ) / (1 − s ) . ABM Influenza Transmission Model
Influenza transmission is based on an SIR model over the social network structure that considers both node and edgepercolation. Individuals, represented as nodes on the network, begin in either one of the three states: susceptible,infected or recovered. At the start of the season, some individuals decide to seek vaccination. This is determinedaccording to the probabilities w ( i ) i s described previously. Based on the vaccine efficacy, we deactivate edges from thenetwork that connect to nodes representing immune individuals who chose to get vaccinated. The degree of protectionconferred by an influenza vaccine is assumed to be 50 % to 70 % and can vary from season to season. The lower range inefficacy will be applied to the elderly in our population. Figure 9 below provides an illustration where blue nodes areimmune due to vaccination. Figure 9: SIR percolation model on the Network.
During an influenza epidemic, not all individuals in the network that remain susceptible to infection will becomeinfected. Depending on the influenza transmissibility, some individuals will come in contact with infectious individualsbut still avoid infection. Therefore, in the model influenza transmission is allowed to propagate only through the active PREPRINT - J
UNE
11, 2020 edges of the network (shown in bold). An edge that connects two susceptible individuals i and j stays active withprobability T ij = [1 − exp ( − λ ij /γ )] which depends on1. influenza transmissibility rate ( β ij ), and2. the duration for which either individual is infectious and continues interacting socially in the network ( γ − ).The infectiousness duration depends on the probability that either individual is symptomatic and stays home in theevent of being infected (i.e., social-distancing). Influenza transmissibility, infectiousness duration, probability of beingasymptomatic and social-distancing depends on socio-demographic attributes of i and j have been parameterized fromthe literature. The basic reproductive number resulting from this model is R = (cid:104)(cid:104) k (cid:105) m (cid:105)(cid:104) T ij (cid:105) where (cid:104) T ij (cid:105) is the averagetransmission probability in our resulting network. Appendix B: Statement of Modified MAC RL Algorithm
Here we outline learning algorithm used in the ABM experiments. We take inspiration from [19] and implement anadapted RL algorithm that aims to address the complexities specific to multi-agent learning, the Multi-agent Actor-Critic(MAC) algorithm (see below). It is based on the same principle underlying the Multi-Agent Deep DeterministicPolicy Gradient (MADDPG) algorithm benchmarked in [19]. It aims to enable limited time information-sharing via aglobally-accessible full-observation oracle or central Q-critic during the training phase. The central Q-critic is moreinformative about the value of action-state pairs. The use of this augmented oracle leads to more stable learning.Algorithm 1 also distinguishes between information-sharing during the test vs. training phases. The individual agentsaccess to central Q-critic is limited to just the training phase. For continued adaptation after model deployment, we canstill update the agents model using value estimates from its local critic instead of the central Q-critic. We will use thefollowing notation as a compact way to denote a population variable: (cid:126)d t = { d i,t } Ni =1 . Algorithm 1
Variant on Multiagent Actor-Critic (MAC) Algorithm
Input: { s i,t , a i,t , r i,t , s i,t +1 } Tt =0 episode experience tuple for N agents repeat Initialize s i, ∀ i ∈ { , · · · , N } . for each i ∈ { , · · · , N } dofor each t ∈ { , · · · , T − } do Compute i th advantage: Adv i,t = r i,t + γQ i ( (cid:126)a t +1 , (cid:126)s t +1 ; Φ t ) − Q i ( (cid:126)a t , (cid:126)s t ; Φ t ) Apply i th actor’s policy gradient ascent update: Θ t +1 = Θ t + α π ∇ Θ log π i ( a | s t , Θ t ) if Training then
Apply central-Q gradient descent update: Φ t +1 = Φ t − α Q ∇ Φ Adv i,t end ifif Deployed then
Apply i th critic’s gradient ascent update using local advantage: Adv ( s i,t ) = r i,t + γv ( s i,t +1 , ω t ) − v ( s i,t , ω t ) ω t +1 = ω t + α v Adv ( s i,t ) ∇ ω v ( s i,t , ω t ) end ifend forend foruntil simulation ends References [1] Lawrence Blume. Agent-based models for policy analysis. In
Assessing the Use of Agent-Based Models forTobacco Regulation . National Academies Press (US), 2015.[2] Joshua M Epstein.
Generative social science: Studies in agent-based computational modeling . PrincetonUniversity Press, 2006.[3] Thomas C Schelling. Models of segregation.
The American Economic Review , 59(2):488–493, 1969. PREPRINT - J
UNE
11, 2020 [4] Todd R Johnson. Control in ACT-R and Soar. In
Proceedings of the Nineteenth Annual Conference of theCognitive Science Society , pages 343–348, 1997.[5] John R Anderson. ACT: A simple theory of complex cognition.
American psychologist , 51(4):355, 1996.[6] Sarah A Nowak, Luke Joseph Matthews, and Andrew M Parker. A general agent-based model of social learning.
Rand health quarterly , 7(1), 2017.[7] Raffaele Vardavas, Romulus Breban, and Sally Blower. Can influenza epidemics be prevented by voluntaryvaccination?
PLoS computational biology , 3(5), 2007.[8] Raffaele Vardavas and Christopher Steven Marcum. Modeling influenza vaccination behavior via inductivereasoning games. In
Modeling the interplay between human behavior and the spread of infectious diseases , pages203–227. Springer, 2013.[9] Raffaele Vardavas, Pavan Katkar, Andrew M Parker, Gursel Rafig oglu Aliyev, Marlon Graf, and Krishna B Kumar.Rand’s interdisciplinary behavioral and social science agent-based model of income tax evasion: Technical report.2019.[10] Damien Challet, Matteo Marsili, Yi-Cheng Zhang, et al. Minority games: interacting agents in financial markets.
OUP Catalogue , 2013.[11] Damien Challet, Alessandro Chessa, Matteo Marsili, and Yi-Cheng Zhang. From minority games to real markets.2001.[12] W Brian Arthur. Complexity in economic theory: Inductive reasoning and bounded rationality.
The AmericanEconomic Review , 84(2):406–411, 1994.[13] Richard S Sutton and Andrew G Barto.
Reinforcement learning: An introduction . MIT press, 2018.[14] G Lakoff. Explaining embodied cognition results. topics in cognitive science, 4 (4), 773-785, 2012.[15] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods forreinforcement learning with function approximation. In
Advances in neural information processing systems , pages1057–1063, 2000.[16] Peter Stone and Manuela Veloso. Multiagent systems: A survey from a machine learning perspective.
AutonomousRobots , 8(3):345–383, 2000.[17] Karl Tuyls and Gerhard Weiss. Multiagent learning: Basics, challenges, and prospects.
Ai Magazine , 33(3):41–41,2012.[18] Lucian Bus¸oniu, Robert Babuˇska, and Bart De Schutter. Multi-agent reinforcement learning: An overview. In
Innovations in multi-agent systems and applications-1 , pages 183–221. Springer, 2010.[19] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-criticfor mixed cooperative-competitive environments. In
Advances in neural information processing systems , pages6379–6390, 2017.[20] J Maynard Smith and George R Price. The logic of animal conflict.
Nature , 246(5427):15–18, 1973.[21] Robert Axelrod and William Donald Hamilton. The evolution of cooperation. science , 211(4489):1390–1396,1981.[22] Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien P´erolat, David Silver,and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning. In
Advances inNeural Information Processing Systems , pages 4190–4203, 2017.[23] Guy Theraulaz and Eric Bonabeau. A brief history of stigmergy.
Artificial life , 5(2):97–116, 1999.[24] Radu Manuca, Yi Li, Rick Riolo, and Robert Savit. The structure of adaptive competition in minority games.
Physica A: Statistical Mechanics and its Applications , 282(3-4):559–608, 2000.[25] Kenneth O Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies.
Evolution-ary computation , 10(2):99–127, 2002., 10(2):99–127, 2002.