[PDF] Policy-focused Agent-based Modeling using RL Behavioral Models

Abstract

Agent-based Models (ABMs) are valuable tools for policy analysis. ABMs help analysts explore the emergent consequences of policy interventions in multi-agent decision-making settings. But the validity of inferences drawn from ABM explorations depends on the quality of the ABM agents' behavioral models. Standard specifications of agent behavioral models rely either on heuristic decision-making rules or on regressions trained on past data. Both prior specification modes have limitations. This paper examines the value of reinforcement learning (RL) models as adaptive, high-performing, and behaviorally-valid models of agent decision-making in ABMs. We test the hypothesis that RL agents are effective as utility-maximizing agents in policy ABMs. We also address the problem of adapting RL algorithms to handle multi-agency in games by adapting and extending methods from recent literature. We evaluate the performance of such RL-based ABM agents via experiments on two policy-relevant ABMs: a minority game ABM, and an ABM of Influenza Transmission. We run some analytic experiments on our AI-equipped ABMs e.g. explorations of the effects of behavioral heterogeneity in a population and the emergence of synchronization in a population. The experiments show that RL behavioral models are effective at producing reward-seeking or reward-maximizing behaviors in ABM agents. Furthermore, RL behavioral models can learn to outperform the default adaptive behavioral models in the two ABMs examined.

Full PDF

PP OLICY - FOCUSED A GENT - BASED M ODELINGUSING

RL B

EHAVIORAL M ODELS

A P

REPRINT

Osonde A. Osoba ∗ Raffaele Vardavas Justin Grana Rushil Zutshi Amber Jaycocks

June 11, 2020 A BSTRACT

Agent-based Models (ABMs) are valuable tools for policy analysis. ABMs help analysts explorethe emergent consequences of policy interventions in multi-agent decision-making settings. But thevalidity of inferences drawn from ABM explorations depends on the quality of the ABM agents’behavioral models. Standard speciﬁcations of agent behavioral models rely either on heuristicdecision-making rules or on regressions trained on past data. Both prior speciﬁcation modes havelimitations. This paper examines the value of reinforcement learning (RL) models as adaptive,high-performing, and behaviorally-valid models of agent decision-making in ABMs. We test thehypothesis that RL agents are effective as utility-maximizing agents in policy ABMs. We also addressthe problem of adapting RL algorithms to handle multi-agency in games by adapting and extendingmethods from recent literature. We evaluate the performance of such RL-based ABM agents viaexperiments on two policy-relevant ABMs: a minority game ABM, and an ABM of InﬂuenzaTransmission. We run some analytic experiments on our AI-equipped ABMs e.g. explorations ofthe effects of behavioral heterogeneity in a population and the emergence of synchronization in apopulation. The experiments show that RL behavioral models are effective at producing reward-seeking or reward-maximizing behaviors in ABM agents. Furthermore, RL behavioral models canlearn to outperform the default adaptive behavioral models in the two ABMs examined. These resultssuggest that the RL formalism can be an efﬁcient default abstraction for behavioral models in ABMs.The core of the argument is that the RL formalism cleanly and efﬁciently represents reward-seekingbehavior in intelligent agents. Furthermore, the RL formalism allows modelers to specify behavioralmodels in a way that is agnostic to the agents detailed internal structure of decision-making, which isoften unidentiﬁable.

Agents based models (ABMs) are useful for analyzing the dynamics of complex social systems [1, 2]. In its mostabstracted form, an ABM consists of virtual agents interacting with each other in a virtual environment. ABM arenatural tools for policy analysis. A good ABM can enable policy analysts and decision-makers to explore the potentialmacro-effects of diverse policy interventions on a population of real-world agents.Modelers will typically equip each agent in an ABM with individual behavioral models or rules that determine howthe agent behaves in response to its local virtual environment. These behavioral models are intended to be plausibleapproximations of real-world individual decision-making. Simple and intuitive behavioral rules can result in a widerange of emergent macroscale phenomena that are not intuitively predictable from the behavior of each agent [3]. Somesimple behavioral models may produce observed behaviors that do not ﬁt the standard rational actor model or otherassumptions of utility-maximizing decision-making (and the multi-agent corollary, the Nash equilibrium).This work explores a solution to the problem of specifying agent behavioral models that implement the agent’s identiﬁed utility using the reinforcement learning (RL) formalism. The use of the RL formalism may also increase the range of Corresponding Author Email: [email protected] a r X i v : . [ c s . L G ] J un PREPRINT - J

UNE

11, 2020 policy research questions ABMs can be used to address. We suggest that the RL formalism may be an efﬁcient defaultabstraction for specifying behavioral models in ABMs.

The analytic promise of agent-based modeling is achievable only if the ABMs are valid representation of real-worldbehaviors. The validity of an ABM depends strongly on the plausibility of the agents deﬁned behavior models. Andthis is especially true when ABMs applied to inform real-world policy decision-making. Small changes to behavioralmodels can sometimes lead to macro-predictions that are qualitatively very different. And it is not always clear whichbehavioral model should apply since the behavioral models are not necessarily validated with data.This is especiallypoignant when a particular rule set leads to some agents behaving obviously sub-optimally and this sub-optimal behaviordrives observed macro outcomes.There are a number of traditional approaches that aim to assure the validity of ABM behavioral models. The mostdirect is to base behavioral models on data-driven regression models that relate state or environmental factors to theagents decision outcomes or proﬁles. This approach is prominent in microsimulation modeling. There are assumptionsembedded in this practice that may undermine the validity of the approach e.g. stationarity (i.e. the patterns of behaviorsin observed training data will persist) and causality (i.e. that the observational data samples sufﬁciently capture thecausal structure of agent decision-making).

Cognitive architecture modeling offers another approach to developing valid behavioral models. Here, the modelerspeciﬁes a psychologically plausible model of agent decisionmaking for the ABM (sometimes informed by survey data).Each agent then simulates a private version of the identiﬁed cognitive architecture when making decisions in the ABMenvironment. Cognitive architectures have a long history in the ﬁeld of artiﬁcial intelligence. Examples include theACT-R and Soar architectures [4, 5].This approach to decision modeling aims to be more causal and mechanistic than the data-driven microsimulationapproach. But, arguably, this detailed lower-level modeling is disconnected from the level of inquiry the ABM isdesigned to support.The policy analyst often cares more about the representativeness of observed agent behaviors ratherthan the speciﬁc mental calculi (often not even uniquely identiﬁable) that produce observed decisions.Modelers need an efﬁcient approach for specifying decision or behavioral models that is: Representative of real-worldcausal decision-making; Agnostic to the exact internal structure agents decision-making; & Flexible or adaptable to thechanges in agents local context over time.

We explore the use of RL policies as behavioral models for ABMs. In this approach, we specify or identify: what anABM agent can observe , what they can do , and what they value in an environment. Given these pieces of information,we specify a generic-but-ﬂexible policy model that the agents can tune using powerful learning algorithms applied toexperience data gleaned from interaction with the ABM environment. Each agent tunes its decision policy to optimizehow much reward it can expect to extract from its interactions with the environment. In essence, we implementutility-maximizing virtual agents in our ABMs with the key generalization that agents can have diverse speciﬁcationsfor their private utilities. And we represent the agents decision functions with differentiable models (neural networks)for ease of implementation and optimization.This approach abstracts away the agents internal structure of decision-making. Each agent can continue to learn andadapt effectively even when the virtual environment is operating a completely new regime. And, if we know and canrepresent the utility functions of real-world agents, then our policies can be representative of real-world decision-making.This approach could be preferable to the default assumption of ad-hoc rule-based agents in ABMs. We developed proof-of-concept AI-based behavioral models in multiagent decision-making contexts. We focusedon applying these models to policy-relevant ABMs. Applicable ABMs include ABMs of stock market, vaccination,taxation, and health insurance market behaviors. This line of inquiry is motivated by the observation that somepolicy-based ABMs are similar to or simpler in game-theoretic structure to video games. We target our efforts to explorethe use of RL-based behavioral model for multiagent experiments with ABM. The outcomes of our efforts are a set ofported ABM environments and templates for RL-based behavioral models that enable policy researchers to: • Equip the agents in ABMs with ﬂexible capacity adaptive intelligence; PREPRINT - J

UNE

11, 2020 • Explore the effects of interaction between diverse sets of strategic behaviors (e.g. bounded rationality, collusion)in ABM populations; & • Possibly learn behavioral policies that better explain empirically-observed behaviors (Similar to Epsteinsdiscussion of inverse explanations) [2].We show the value of our efforts by running experiments to examine the effects of RL behavioral models in twopolicy-relevant ABMs: the minority game & the ﬂu vaccination game. The basic research questions we aim to answerin these experiments include: • Examining the effects of heterogeneous decision models (RL vs. default heuristics) in an ABM; • Examining the effects of differing levels of agents intelligence and/or memory on decision-making in an ABM; & • Examining the capacity for populations intelligent agents to learn coordination or synchronization underconditions of limited or no direct communication.The rest of this discussion has the following structure. The next section (2) outlines key design and modeling decisionswe made in the process of developing our AI-equipped ABMs, both in the single-agent and the multiagent setups.Sections 3 & We developed a collection of policy-relevant multiagent environments & implementations of single and multiagentRL algorithms. The goal of was to connect both lines of development to produce policy-focused ABMs that featureRL-based decisionmaking agents. The rest of this section explores key details in these efforts.We identiﬁed a some new and pre-existing policy-focused ABMs that are relevant for our purposes. These include a Fluvaccination ABM [6, 7, 8], and a tax-evasion ABM [9]. They are listed roughly in increasing order of model complexity.We decided to start with a simpler ABM that captures key policy-relevant behavior: the Minority Game ABM [10, 11]which is based on the El Farol Bar problem [12]. It represents a simpliﬁed abstracted version of the stochastic game atthe heart of the Flu and Tax Evasion models. Section 3 discusses this ABM in more detail.We encapsulate each ABMs using standardized gym environment interfaces. This abstraction effectively separates themechanics of the ABM from the behavioral models of the agents. Each environment wrapper implements (using Figure2.1 as background context): s t +1 , r t , doneQ = env.step ( a t ) , s = env.reset () . Given a goal within the virtual environment, we specify a reward function such that the achieving the environmentgoal maximizes the agents reward. We equip an agent with a neural-network-based model and statistical learningalgorithms to enable the agent to adapt its behavioral policy to maximize the reward function [13]. In this setup, itis also possible to model complex or strategic goals like maximizing the volatility of a market or a stock price oroptimizing the explanatory or predictive power of agent’s policies for observed real policies [2].Figure 1 depicts key components and interactions in the environment and agent modules. The agents implement: abehavioral policy model a t = agent.act ( s t ) , an experience buffer, and play/work/train functions.The basic template for the agent’s decision models, the embodied architecture is modular, based on a basic actor-criticarchitecture [15]. We have ﬂexibility on what kinds of learning algorithms we can apply to this basic architecture. Andwe can always augment the architecture with auxiliary intelligence artifacts (e.g. see central Q-function in the MACalgorithm, Table 4).We default to using policy-gradient adaptation algorithms on actor-critic-style models for our agent behavioral models.This approach is more ﬂexible for representing mixed-strategy policies. And that ﬂexibility in policy structure will becrucial in the multiagent RL (MARL) context. The name is a nod towards the concept of embodied cognition wherein an agents cognition is strongly inﬂuenced or determinedby how the agent senses, reasons about, and acts in its surrounding environment [14]. PREPRINT - J

UNE

11, 2020

Figure 1: : Schematic of interaction components between an agent and the ABM environment (arrows extending out of boxes denoteexposed signals).

The Multi-agent RL (MARL) context has more complexity than the single-agent control problem described so far.We need to make adaptations to the learning models and algorithms to account for multi-agency [16, 17, 18]. MARLalgorithms are often generalizations of single-agent RL algorithms [18, 16]. But this generalization strategy is inadequatebecause the convergence of the single-agent control algorithms described assumes that the agent is interacting with asomewhat stationary environment. Multi-agent contexts will tend to violate stationarity [19].Furthermore, the optimal strategy for one member of a population depends on the current strategies of the rest of thepopulation. We may need to replace concept of an independent optimal strategy with the concept of an evolutionarilystable strategy (ESS) [20, 21] instead.The violation of basic assumptions (stationarity, the Markov frame, independence of agent rewards, the existence ofstable unconditional policy targets) can be severe enough to render traditionally powerful learning algorithms unstablein multiagent environments [19, 18]. We implement a Multi-agent Actor Critic (MAC) algorithm based on [19] toaddress the non-stationarity issue in our models. The MAC algorithm furnishes each agent with access to an centralglobal critic, but only during training. Our agents revert to independent adaptation after deployment.Examples may help illustrate some of key MARL-speciﬁc concerns in more concrete terms. For example, competingﬁrms must consider the rigidities in the pricing of other ﬁrms and how adaptive other ﬁrms are in changes to themarket. As another example, computer network defense teams must consider the sophistication and tools available tothe adversaries, especially knowing that the adversaries are actively trying to evade detection.We model the multiagent environment in a couple of ways. The ﬁrst is the simplest setup: the independent populationmodel [22]. Each RL-driven agent in the population has a standalone embodied model with which they interact withthe shared environment. In this model, there is no internal communication or direct private transfer of informationamong agents. Intra-agent communication is still possible via stigmergy [23] if the observed state variables aresufﬁciently detailed. This setup is good for modeling fully competitive games with no explicit collusion allowed. Butthis independent setup is often imperfect, both as a model of real-world behaviors and for robust multiagent learningdynamics [22].The second setup is to enable some amount of communication and coordination across the agent population. Communi-cation can be in the form of sharing state information, sharing private decision propensities, or network weight-sharing.

We deploy RL models in the classic ABM known as the minority game. The minority game is a classic environment forstudying adaptive and learning agents. The game generalizes the El Farol Bar problem. Speciﬁcally, N agents mustdecide whether to go to the El Farol bar or to stay home. If more than half of the agents go to the bar, then the bar is toocrowded, and the agent would be better off staying home. On the other hand, if less than half of the agents go to the bar,then the best action is to go to the bar because it is lively and not too crowded. Thus, the minority game is names assuch since each agent prefers to be in the minority group.The minority game has been extensively studied and prior work provides us with a benchmark into how well the RLagents are performing. This is important because it is often difﬁcult to know when an RL algorithm converges to a“good” local optimum. Prior results in the minority game will serve as a guidepost. Our minority game implementation PREPRINT - J

UNE

11, 2020 parallels the literature [24] with the except that we enable the use of a RL-based behavioral model for one or moreagents in the population of N players. In each time steps, agents make a binary decision, which we represent by choosing either or . At the end of the timestep, the agents observe the minority and majority group. That is, the agents observe whether there were more agentsthat chose or more agents that chose . Crucially, the agents do not observe how many other agents choose and but only observe which choice was selected by less than half of the agents, which we call the minority group.Each agent is endowed with a memory parameter, m, that determines how many past realizations of the game the agentconsiders when making its next decision. For example, if m=3, the agent bases its binary decision at time t on itsobservations in time t − , t − , and t − . A strategy is a mapping from m to { , } . That is, a strategy speciﬁesan action for all possible length- m memories. For example, if m=2, a strategy would specify an action for each of thepossible memories in { { } , { } , { } , { } } .At the start of the game, each agent draws k random strategies with replacement. Each agent is seeded with an initialstrategy. Also, a random initial history is drawn uniform randomly. At each timestep, an agent selects an action speciﬁedby the strategy with the most accumulated points. A strategy accumulates a point in each timestep if the action speciﬁedby the strategy would have put the agent in the minority group. In the model, the RL agents do not choose their strategy like the basic agents. Instead, the RL-based agents choose theiractions based on inference via tuned neural networks. To control for the RL agents memory, we change the numberof inputs into the neural network. For example, if we want to model the case where the RL agent chooses its actionconditional on the previous 3 minority groups, then the input to the deep neural network is a three-dimensional binaryvector. Later, we discuss the possibility of modeling an agents memory with recurrent neural networks. We trainthe RL agents in response to its experience in the game using a variety of policy gradient algorithms (REINFORCE,REINFORCE with baseline, & Actor-Critic).

We train reinforcement learner to play against N − basic agents. In experiments, N=301, and each basic agent has2 strategies with a memory of . The number of agents in each group is periodic and follows a perfectly predictablepattern. This implies that the winning group is also periodic and follows the pattern (cid:48) , , , , , , , . (cid:48) Therefore, withsufﬁcient memory, a reinforcement learner should be able to perfectly predict the minority group simply by observingthe history of winners.To test this, we train a policy gradient neural network with 4 hidden layers and 20 nodes per layer in the minority gameenvironment. To aid in computation, the RL agent’s look-ahead is limited to 5 time steps. The input to the neuralnetwork is the last 3 winning groups, which is one greater than the basic agent’s memory. Figure 2 shows that the RLagent does indeed learn the optimal strategy and correctly guesses the minority group each time.

Figure 2: Time series of attendees for 500 time steps, 301 agents each with 2 strategies and a memory of 2 PREPRINT - J

UNE

11, 2020

A crucial feature of this experiment is that for each of the 10 trials, the RL agent is trained against a different set ofagents but the set of agents stays constant within each of the 10 trials. In other words, in the ﬁrst trial, we sample aset of 300 basic agents and train a neural network against those 300 agents using 400 training steps (recall there israndomness in the agents when drawing their strategies). In the second trial, we sample a different set of 300 agents andretrain. Therefore, Figure 3.2 shows that an RL agent plays optimally against the agents it trained against. However, akey question regarding generalization is then how well does a RL learner perform against agents it did not train against?Figure 3 shows that the RL learner’s behavior is not always optimal against a different randomly drawn population.This is especially surprising since every randomly drawn population with 2 strategies and a 2 time-step memory yieldperiodic and predictable patterns. However, the key point is that the precise pattern depends on the labeling of thegroups and the RL agent is sensitive to such labeling. In other words, the RL algorithm is not robust to relabeling.

Figure 3: A trained RL learner playing against 100 different randomly drawn populations

While an agents performance trained on one population does not generalize, the next obvious question is: can an agentwith enough memory to play perfectly against any given population be trained to play perfect against all populations?To answer this, we once again train an RL agent with the same parameters. However, in this experiment we resamplebasic agents for each training iteration. In other words, the RL learner plays against a different population for everytraining iteration.

Figure 4: Two different rounds of training with a neural network with 3 time steps of memory. In each round, there were 400 trainingsteps with an episode length of 500 time steps. The black line is a rolling mean with a window length of 20.

Figure 4 gives two examples of how the RL agent learns when it faces a different population of basic agents for eachtraining iteration. The key insight is that the RL agent is not able to correctly choose the minority group every time,even though for a given population of basic agents, the sequence of minority groups is deterministic. This is because 3observations in the past is not sufﬁcient to determine which population the learning agent is playing against. Therefore,since the RL agent cannot condition its action on the full periodic sequence but can only use the last three observations,it cannot always correctly choose the minority group. PREPRINT - J

UNE

11, 2020

Figure 5: Training with an RL agent with 5 time steps of memory. In each round, there were 400 training steps with an episodelength of 500 time steps. The low spikes are the case where the initial distribution of agents yields ties.Table 1: Improvements from training three minority game players (out of a population ten player) using RL-based behavioral models.The remaining 7 players act according to the default behavioral model for the minority game.

Stated succinctly, an RL agent that has enough memory to play perfectly against any given draw of an initial distributionis not sufﬁcient to play perfectly against all draws of the initial distribution.Since units of memory are insufﬁcient for the RL learner to predict the minority group in all cases, we extend thememory parameter. It can be shown that the RL learner cannot learn perfectly against all distributions of agents until ithas a memory of length . Figure 5 shows that this is indeed the case. The troughs in the ﬁgure represent cases wherethe initial distribution of agents yields ties in the minority group, which changes the dynamics of the system. However,the probability of this occurring goes to zero as the number of agents gets larger and thus can safely be ignored. We also explore some of the dynamics of the game when more than one agent in the population uses RL-basedbehavioral models. All agents interact in a shared minority game. And all agents make decisions on the basis ofobservable information (in this case, a fully shared history of recent winning choices). Agents using the default PREPRINT - J

UNE

11, 2020 behavioral model continue to make decisions based on their adaptive strategy books. RL agents make decisions basedon the outcome of their stochastic policy neural networks. The reward signal in our setup is simply a binary signal (0 / corresponding to individual loss or win. We assume no explicit collusion or direct communication within the RLagent sub-population during the test phase of the experiments.Preliminary results (Table 1) show that the multiagent RL sub-population is able to adapt to improve both its aggregateand individual rewards in the minority game. Figure 3.6 shows learning improvements in reaped rewards for an RLsubpopulation of 3 agents playing in an overall population of size N=10. The game is structured with standard memoryof length m=3 and 4 strategies per agent in the default behavior model. The second panel of the ﬁgure compares theagents distribution of rewards pre- and post- RL training (601 training epochs using the MAC algorithm; see Table4 in the Appendix). The primary takeaway from our experiments is that the RL training algorithms are able to learn reward-seeking behavior . However experiments comparing the performance of the multiagent RL behavioral model tothe default behavioral model yield equivocal results. In experiment 2 we use and modiﬁed an existing ABM that models the complex dynamical interplay between vaccinationbehavior, inﬂuenza epidemiology and inﬂuenza prevention policies over a synthetic population that is representative ofthe population of Portland, OR. Our experiment replaces individual agents in the model with agents that use a differentbehavioral algorithm based on standard RL rules. This ABM builds off from previous work [8, 6, 7]. These priorworks contain high-level descriptions of the ﬂu vaccine ABM we adapt here. And further details of the complete Flutransmission ABM model is provided as supplementary material.Central to the ABM’s dynamics is the assumption that personal and social-network experiences from past inﬂuenzaseasons affect current decisions to get vaccinated, and thus inﬂuence the course of an epidemic. The model proceedsiteratively from one ﬂu season to the next. Agents in the model represent individuals who determine whether or notthey get vaccinated for the present season. Their collective decisions drive inﬂuenza epidemiology that, in turn, affectsfuture behavior and decisions.The ABM environment is described by: (A) the behavioral model; (B) the social and contact network structures; and (C)the inﬂuenza transmission dynamics. The model simulates within-year inﬂuenza transmission dynamics on the networkby means of a Susceptible-Infected-Recovered (SIR) model. The full model also considers additional complications andconfounding factors such as whether the agents are infected with other non-inﬂuenza inﬂuenza like illnesses (niILI).The ABM’s underlying network structure was informed by generating a reduced but statistically representative networkwith roughly ten thousand individuals taken from open datasets from the Network Dynamics and Simulation ScienceLaboratory (NDSSL). This dataset represents a synthetic population of the city of Portland, OR and its surroundings inan instance of a time-varying social contact network for a normal day, derived from the daily activities.

The behavioral model speciﬁes the way that agents belonging to the different outcome groups put different empha-sis/weights on their personal experience and local and global feedback information and how they evaluate their choice.Agents are assumed to remember experiences from past years, and these determine how they change their vaccinationbehavior. Therefore, agents update their propensity to get vaccinated for the following season based on a weighted sumof their most recent evaluation and evaluations made in past years (discounted over time).The ABMs default behavioral model was informed from different data sources. For example, to inform the behavioralmodel and quantify its relevant parameters, we use internet surveys on a random subsample of a longitudinal panelsurvey of a nationally-representative cross-section of the US. Speciﬁcally, the surveys were used to quantify howindividuals behavior towards vaccination changes based on key ﬂu-related experiences and beliefs.The behavioral model assumes that at the beginning of year n , an adult individual i will get vaccinated with probability w ( i ) n given by a convex-combination expression w ( i ) n = β a υ ( i ) n + β b φ ( i ) n + β c .ψ ( i ) , (1)where the β values are convex coefﬁcients that weight a number of factors that inﬂuence each agents vaccinationprobability. Vaccination probability w ( i ) n is modeled as a Bernoulli mixture model with contributions from the followingthree factors: PREPRINT - J

UNE

11, 2020 • υ ( i ) n : the agents intrinsic adaptation and learning. This includes signal contributions from its social network.And this factor represents a weighted sum of past experiences with ﬂu infections and vaccination. Theweighting scheme lends more importance to the more recent experiences. • φ ( i ) n : the inﬂuence of healthcare workers (HCWs) recommendations on the agents probability of gettingvaccinated. & • ψ ( i ) : a set of socio-economic, biological and attitudinal factors (e.g., cost, medical predisposition andconvenience of obtaining vaccination) that are assumed to be stationary and relevant to vaccination decisions.The full ABM considers all three factors. However, for the purpose of this model we decided to use a simpliﬁed versionof the behavioral model which considers a homogeneous population that make their vaccination decisions exclusivelybased on the present and past evaluations of their vaccination decisions and outcomes. As such in this simpliﬁed modelwe only consider the agents intrinsic adaptation and learning factor, and set the value β α to 1. Hence, agents are notinﬂuence by recommendations made by their physician. Moreover, the simpliﬁed model that we are considering herealso removes other relevant details. For all RL experiments on this model, the agents gain rewards by not being infected during a ﬂu season. They observetheir prior year ﬂu outcomes and healthcare worker recommendations. They act by either choosing vaccination or not.

Figure 6: Comparing the performance of an RL agent in the Flu ABM environment. The RL model a.) is able to improve itsoutcomes after multiple training epochs (bottom panel), & b.) outperforms the default behavioral model on average (n=100 seasons)by a couple of percentage points (top panel). Before running our experiments in the ABM of inﬂuenza vaccination, we ran the simulation model with no RL agentsuntil the dynamics of aggregated macro quantity reached a stationary state. In this state our model has reached adynamic equilibrium of detailed balance analogous to kinetic theory. At this point we sample 1 or N agents and changetheir behavioral model to that speciﬁed by the RL algorithm. Figure 6 below shows the performance of the RL modelafter training iterations (using a basic actor-critic learning model). The reward signal in this game is the number ofseasons the agent goes through without contracting the ﬂu. The basic ﬁnding from this experiment is that single-agentreward-driven learning is effective in this Flu ABM.

We run a set of experiments to examine how much inﬂuence an agents degree centrality has on the effectiveness of itslearning. We draw two separate subpopulations at random each of size n = 40 . The ﬁrst sub-population in the lowerquartile of network degree centrality in the overall synthetic population. The second sub-population is in the top quartile PREPRINT - J

UNE

11, 2020 of network degree centrality. Both sub-populations are trained with the same algorithm (MAC) starting from identicalpolicy templates.Then we examine how much the ensembles improved after about learning iterations.Table 2 shows their improvements. The more basic result from this experiment is that multiagent reward-driven learningis effective in this Flu ABM. The results also suggest preliminary evidence in support of the hypothesis that low-degreecentrality agents are more effective at learning to avoid the ﬂu compared to high-degree centrality agents. This alignswith our intuitions since higher degree nodes have more contact and thus more paths for infection. But the experimentneeds more data to test the hypothesis further.We also explored the question of synchronization in this experiment. We are concerned with the emergence ofsynchronized behavior in ensembles on non-communicating learning agents. We examine the evidence for thisby calculating an average correlation matrix for the observed actions from an ensemble of trained agents. Anysynchronization (positive or negative) between agents will be exposed by high magnitude correlation matrix entries orbands. No synchronization was evident in this experiment (and others) since the correlations are capped in magnitude atabout . Our most basic ﬁnding is that single agents, playing in either of the test policy ABMs, exhibit reward-seeking behaviorand can even learn optimal policies. There are caveats including issues with generalization and policy model memoryor capacity (which may be addressable using recurrent networks).The multiagent experiments on both ABMs again support our motivating hypothesis that RL behavioral model are ableto reproduce adaptive reward-seeking behavior even in a complex ABM like the ﬂu model. And the RL subpopulationscan signiﬁcantly outperform subpopulations that deployed the default behavioral model. One observation was that naveMARL algorithms are not as effective at learning good behaviors. This is in line with the current MARL literature. Weadapt the MAC algorithm to address pathologies speciﬁc to multiagent games.The RL-based behavioral models can also compare favorably with default behavioral models in ABMs. The experimentalresults on this point are not as unequivocal. The effect sizes in these experiments are smaller. This is likely because theoutcome of such comparisons is rather path-dependent and sensitive to initial conditions.These results suggest that the RL formalism can be an efﬁcient default abstraction for behavioral models in ABMs.The RL formalism cleanly and efﬁciently represents reward-seeking behavior in intelligent agents. Furthermore, theRL formalism allows modelers to specify behavioral models in a way that is agnostic to the agent’s detailed internalstructure of decision-making, which is often unidentiﬁable.Future work would explore more complex experiments on these ABMs; explore new ABMs; and develop relevantalgorithms for behavioral adaptation in these ABMs (e.g. neuro-evolutionary learning algorithms [25]). PREPRINT - J

UNE

11, 2020

Table 2: Outcome Improvements in Flu Avoidance Post-RL-training for agent ensembles in low and high quartiles of degree centrality PREPRINT - J

UNE

11, 2020

Appendix A: The Inﬂuenza Transmission Environment

In experiment 2, we modiﬁed and use an existing ABM that models the complex dynamical interplay betweenvaccination behavior, inﬂuenza epidemiology and inﬂuenza prevention policies over a synthetic population that isrepresentative of the population of Portland, OR. Our experiment replaces individual agents in the model with agentsthat use a different behavioral algorithm based on standard RL rules. This ABM builds off from previous work [6, 7].And it is also being extended by on NIH grant research.

Figure 7: : Architecture of Inﬂuenza Vaccination Model.

Figure 7 gives a high-level view of the ABM. Central to the ABM is the assumption that personal and social-networkexperiences from past inﬂuenza seasons affect current decisions to get vaccinated, and thus inﬂuence the course of anepidemic. In more detail, the model proceeds iteratively from one ﬂu season to the next. Agents in the model representindividuals who determine whether or not they get vaccinated for the present season. Depending on their choice, theefﬁcacy of the vaccine and the disease transmission dynamics, by the end of the season agents have either been infectedor not. The model also considers additional complications and confounding factors such as whether the agents areinfected with other non-inﬂuenza inﬂuenza like illnesses (niILI), leading them to believe that they might have caughtinﬂuenza. The agent population can be classiﬁed into four outcome groups:Agents in different groups evaluate the outcomes of their choice in different ways using their personal experiencewith the ﬂu and the vaccine (i.e., whether they got vaccinated and whether they were infected, including close lovedones) and using both local and global feedback information. This information includes the proportion of individuals intheir social network (i.e., local) and in the population (i.e., global) that they perceive as vaccinated and infected. Ourbehavioral model speciﬁes the way that agents belonging to the different outcome groups put different emphasis/weightson their personal experience and local and global feedback information and how they evaluate their choice. Agents areassumed to remember experiences from past years, and these determine how they change their vaccination behavior.Therefore, agents update their propensity to get vaccinated for the following season based on a weighted sum of theirmost recent evaluation and evaluations made in past years (discounted over time).The ABMs default behavioral model was informed from different data sources. First, to inform the behavioral modeland quantify its relevant parameter, we use internet surveys on a random subsample of the RAND CorporationsAmerican Life Panel (ALP), a longitudinal panel study of a nationally-representative cross-section of the US was used.Speciﬁcally, the surveys were used to quantify how individuals behavior towards vaccination changes based on:1. personal past experiences with catching the ﬂu and getting vaccinated;2. local effects due to observed experiences in the social network; and3. global effects based on perceived prevalence of the ﬂu and vaccination in the population.The ABM depends on an underlying network structure used to model both how inﬂuenza transmission occurs as wellas how vaccination behavior inﬂuences spreads. This network structure was informed by generating a reduced butstatistically representative network with roughly ten thousand individuals taken from open datasets from the NetworkDynamics and Simulation Science Laboratory (NDSSL). This dataset represents a synthetic population of the city of PREPRINT - J

UNE

11, 2020

Portland, OR and its surroundings in an instance of a time-varying social contact network for a normal day, derivedfrom the daily activities.

Describing the Inﬂuenza ABM Environment

The ABM considers a population consisting of N individuals on a social network that every year considers protectingthemselves, and their dependent family member from seasonal inﬂuenza. In particular, they make decisions as towhether or not to get vaccinated against seasonal inﬂuenza. Their collective decisions drive inﬂuenza epidemiology that,in turn, affects future behavior and decisions. The population is stratiﬁed according to other relevant demographic andsocio-economic characteristics and are connected by an overlaying social network structure. The ABM environment isdescribed by: (A) the behavioral model; (B) the social and contact network structures; and (C) the inﬂuenza transmissiondynamics. The model proceeds iteratively as follows. At the beginning of each inﬂuenza season, every adult individualdecides whether or not to get vaccinated against the ﬂu depending on both their recent experiences with vaccinationand that shared by their alters on the social network. The model then simulates inﬂuenza transmission dynamics onthe network by means of a Susceptible-Infected-Recovered (SIR) model. As deﬁned by the transmission model, anepidemic occurs every season depending on the achieved vaccination coverage (i.e., the proportion of individualsvaccinated) and whether infections span-throughout (i.e., percolate) the network. At the end of the inﬂuenza season,individuals evaluate their new experiences and accordingly modify their vaccination probabilities for the subsequentyear.Figure 8 below shows a schematic view of all the components in the default behavioral model. It assumes that at thebeginning of year n , an adult individual i will get vaccinated with probability w ( i ) n given by a convex-combinationexpression w ( i ) n = β a υ ( i ) n + β b φ ( i ) n + β c .ψ ( i ) , Where the β values are convex coefﬁcients that weight a number of factors that inﬂuence each agents vaccinationprobability. Vaccination probability w ( i ) n is modeled as a Bernoulli mixture model with contributions from the followingthree factors:1. υ ( i ) n : the agents intrinsic adaptation and learning. This includes signal contributions from its social network.And this factor represents a weighted sum of past experiences with ﬂu infections and vaccination. Theweighting scheme lends more importance to the more recent experiences.2. φ ( i ) n : the inﬂuence of healthcare workers (HCWs) recommendations on the agents probability of gettingvaccinated. & ψ ( i ) : a set of socio-economic, biological and attitudinal factors (e.g., cost, medical predisposition andconvenience of obtaining vaccination) that are assumed to be stationary and relevant to vaccination decisions.The β coefﬁcients are assumed to vary across socio-demographic population strata and measure the relative importanceof each of these behavioral processes. Notice that β c is an array of regression coefﬁcients. Individuals that are morelikely to visit HCW prior or during the ﬂu season will likely be more inﬂuenced by their HCW and thus have a highervalue of β b . Instead, individuals belonging to households with low income may be more inﬂuenced by the cost ofobtaining vaccination than by HCWs recommendations and their past experiences, and thus have a higher cost related β c .In the model, φ ( i ) n is determined by the inﬂuence of an HCW k in recommending his patient i to get vaccinated. HCWsfollow guidelines in recommending vaccination. However, their effort in recommending vaccination is determined bytheir own perceptions that, in part, change with time. The model assumes that the perceptions of HCWs are formedfrom the population level cumulative incidence, the overall vaccination coverage and the observation of the number ofpatients they visit every year with inﬂuenza.The full ABM considers all three factors. However, for the purpose of this project we decided to use a simpliﬁed versionof the behavioral model which considers a homogenous population that make their vaccination decisions exclusivelybased on the present and past evaluations of their vaccination decisions and outcomes. As such in this simpliﬁed modelwe only consider the agents intrinsic adaptation and learning factor, and set the value β b to 1. Hence, agents are notinﬂuence by recommendations made by their physician. Moreover, the simpliﬁed model that we are considering herealso removes other details that are important in the real system. For example, as mentioned previously the full modelallows agents to be infected with a niILI and they could think that they caught the ﬂu when instead they caught a niILI(or vice-versa). These details are not considered in the simpliﬁed model. PREPRINT - J

UNE

11, 2020

Figure 8: : Behavioral Model for Agents in Inﬂuenza ABM.

To model the agents intrinsic adaptation and learning, at the end of season n − , each individual evaluates the choiceof either being vaccinated or not. Their evaluation is quantiﬁed by a variable ∆ ( i ) n − ranging in [0 , . Its value is largewhen the individual perceives the beneﬁt of having been vaccinated. In the full ABM, this depends on three evaluationcriteria:1. P E : personal and household evaluations;2. SN : perceived local decisions and outcomes in their social-network (SN); and3. M sM : perceived global decisions and outcomes as reported by mainstream media (MsM).The importance placed on each of these levels of evaluation depends on three weights ω P E , ω SN and ω MsM , whichadd to 1. In the simpliﬁed model, we remove the MsM evaluation criteria. Moreover, the PE evaluation criteria issimpliﬁed to only depends on personal decisions and outcomes and not on household outcomes. Hence, the agentsintrinsic adaptation and learning only depends on the PE and the SN evaluation criteria.An agents PE generates a value ∆ ( i ) P E ranging in [0 , which depend on four possible decision-outcome combinations,and result from (i) whether or not they were vaccinated, and (ii) whether or not they were infected by the end of theseason. For example, if agent i vaccinated but still caught the ﬂu, his/her value for ∆ ( i ) P E would be low and close to0. This means that in absence of any information other than his/her most recent decision and outcome with inﬂuenzaand its vaccine, his/her propensity to vaccinate again is low. On the other hand if s/he did not vaccinate and caught theﬂu then ∆ ( i ) P E would be very high and usually set to equally 1. This means that in absence of any information, his/herpropensity to vaccinate in the ﬂowing season is very high. Similarly, an agents SN evaluation generates a value ∆ ( i ) SN ranging in [0 , which depends on the perceived proportion of alters in the social network that experienced each of thefour possible decision-outcome combinations. For example, if a large majority of alters in agent is social network didnot vaccinate and caught the ﬂu, the value of ∆ ( i ) SN would be high and close to 1.In our simpliﬁed model, the value of ∆ ( i ) n − is obtained by the weighted sum of ∆ ( i ) P E and ∆ ( i ) SN . ∆ ( i ) n − = ω P E ∆ ( i ) P E + ω SN ∆ ( i ) SN . PREPRINT - J

UNE

11, 2020

Therefore, even if agent i did not vaccinate and did not catch the ﬂu, leading to a low ∆ ( i ) P E value, if his/her ∆ ( i ) SN is high, s/he may decide to vaccinate in the following ﬂu season. The interpretation here is that in absence of anyinformation from other past inﬂuenza seasons, the agent will judge how lucky they were not to have caught the ﬂu inthe present season by his/her perception of how many alters in his/her social network did not vaccinate and caught theﬂu. The actual values for ∆ ( i ) P E and ∆ ( i ) SN which result from the various decision outcomes of the agent and his/heralters can be set by tunable parameter values.The value υ ( i ) n describing an agents intrinsic adaptation and learning depends on both the most recent ﬂu season andresulting value for ∆ ( i ) n − , as well as past ﬂu seasons resulting in values for ∆ ( i ) m for m < n − . Each individualremembers and weights previous outcomes and their evaluations with respect to the present outcome. The weighted sumof past years evaluations deﬁnes individual i s pro-vaccination experience V ( i ) n for the present season and is given by V ( i ) n = sV ( i ) n − + ∆ ( i ) n − .The parameter s ranges in [0 , and discounts the previous years outcome with respect to the outcome of the presentyear. When s is equal to 0, individuals completely ignore the outcome of previous years. When s is equal to 1,individuals give equal weight to the outcomes from previous years as the present outcome. For example, when s isequal to 1, and if ∆ ( i ) n − only took discrete values 0 or 1, the value V ( i ) n would be an integer number representing atally of the number of years that vaccination was or wound have beneﬁtted agent i. When s < , and for continuousvalues of ∆ ( i ) n − in [0 , , V ( i ) n represents an exponential weighted moving average of the evaluations ∆ ( i ) n . Finally, theprobability that agent i will use to decide whether or not to vaccinate in the next ﬂu season is found by normalizing V ( i ) n using its maximum possible value. based only on the adaptive learning process is given by υ ( i ) n = V ( i ) n /N ( s ) , where N ( s ) = (1 − s n ) / (1 − s ) . ABM Inﬂuenza Transmission Model

Inﬂuenza transmission is based on an SIR model over the social network structure that considers both node and edgepercolation. Individuals, represented as nodes on the network, begin in either one of the three states: susceptible,infected or recovered. At the start of the season, some individuals decide to seek vaccination. This is determinedaccording to the probabilities w ( i ) i s described previously. Based on the vaccine efﬁcacy, we deactivate edges from thenetwork that connect to nodes representing immune individuals who chose to get vaccinated. The degree of protectionconferred by an inﬂuenza vaccine is assumed to be 50 % to 70 % and can vary from season to season. The lower range inefﬁcacy will be applied to the elderly in our population. Figure 9 below provides an illustration where blue nodes areimmune due to vaccination. Figure 9: SIR percolation model on the Network.

During an inﬂuenza epidemic, not all individuals in the network that remain susceptible to infection will becomeinfected. Depending on the inﬂuenza transmissibility, some individuals will come in contact with infectious individualsbut still avoid infection. Therefore, in the model inﬂuenza transmission is allowed to propagate only through the active PREPRINT - J

UNE

11, 2020 edges of the network (shown in bold). An edge that connects two susceptible individuals i and j stays active withprobability T ij = [1 − exp ( − λ ij /γ )] which depends on1. inﬂuenza transmissibility rate ( β ij ), and2. the duration for which either individual is infectious and continues interacting socially in the network ( γ − ).The infectiousness duration depends on the probability that either individual is symptomatic and stays home in theevent of being infected (i.e., social-distancing). Inﬂuenza transmissibility, infectiousness duration, probability of beingasymptomatic and social-distancing depends on socio-demographic attributes of i and j have been parameterized fromthe literature. The basic reproductive number resulting from this model is R = (cid:104)(cid:104) k (cid:105) m (cid:105)(cid:104) T ij (cid:105) where (cid:104) T ij (cid:105) is the averagetransmission probability in our resulting network. Appendix B: Statement of Modiﬁed MAC RL Algorithm

Here we outline learning algorithm used in the ABM experiments. We take inspiration from [19] and implement anadapted RL algorithm that aims to address the complexities speciﬁc to multi-agent learning, the Multi-agent Actor-Critic(MAC) algorithm (see below). It is based on the same principle underlying the Multi-Agent Deep DeterministicPolicy Gradient (MADDPG) algorithm benchmarked in [19]. It aims to enable limited time information-sharing via aglobally-accessible full-observation oracle or central Q-critic during the training phase. The central Q-critic is moreinformative about the value of action-state pairs. The use of this augmented oracle leads to more stable learning.Algorithm 1 also distinguishes between information-sharing during the test vs. training phases. The individual agentsaccess to central Q-critic is limited to just the training phase. For continued adaptation after model deployment, we canstill update the agents model using value estimates from its local critic instead of the central Q-critic. We will use thefollowing notation as a compact way to denote a population variable: (cid:126)d t = { d i,t } Ni =1 . Algorithm 1

Variant on Multiagent Actor-Critic (MAC) Algorithm

Input: { s i,t , a i,t , r i,t , s i,t +1 } Tt =0 episode experience tuple for N agents repeat Initialize s i, ∀ i ∈ { , · · · , N } . for each i ∈ { , · · · , N } dofor each t ∈ { , · · · , T − } do Compute i th advantage: Adv i,t = r i,t + γQ i ( (cid:126)a t +1 , (cid:126)s t +1 ; Φ t ) − Q i ( (cid:126)a t , (cid:126)s t ; Φ t ) Apply i th actor’s policy gradient ascent update: Θ t +1 = Θ t + α π ∇ Θ log π i ( a | s t , Θ t ) if Training then

Apply central-Q gradient descent update: Φ t +1 = Φ t − α Q ∇ Φ Adv i,t end ifif Deployed then

Apply i th critic’s gradient ascent update using local advantage: Adv ( s i,t ) = r i,t + γv ( s i,t +1 , ω t ) − v ( s i,t , ω t ) ω t +1 = ω t + α v Adv ( s i,t ) ∇ ω v ( s i,t , ω t ) end ifend forend foruntil simulation ends References [1] Lawrence Blume. Agent-based models for policy analysis. In

Assessing the Use of Agent-Based Models forTobacco Regulation . National Academies Press (US), 2015.[2] Joshua M Epstein.

Generative social science: Studies in agent-based computational modeling . PrincetonUniversity Press, 2006.[3] Thomas C Schelling. Models of segregation.

The American Economic Review , 59(2):488–493, 1969. PREPRINT - J

UNE

11, 2020 [4] Todd R Johnson. Control in ACT-R and Soar. In

Proceedings of the Nineteenth Annual Conference of theCognitive Science Society , pages 343–348, 1997.[5] John R Anderson. ACT: A simple theory of complex cognition.

American psychologist , 51(4):355, 1996.[6] Sarah A Nowak, Luke Joseph Matthews, and Andrew M Parker. A general agent-based model of social learning.

Rand health quarterly , 7(1), 2017.[7] Raffaele Vardavas, Romulus Breban, and Sally Blower. Can inﬂuenza epidemics be prevented by voluntaryvaccination?

PLoS computational biology , 3(5), 2007.[8] Raffaele Vardavas and Christopher Steven Marcum. Modeling inﬂuenza vaccination behavior via inductivereasoning games. In

Modeling the interplay between human behavior and the spread of infectious diseases , pages203–227. Springer, 2013.[9] Raffaele Vardavas, Pavan Katkar, Andrew M Parker, Gursel Raﬁg oglu Aliyev, Marlon Graf, and Krishna B Kumar.Rand’s interdisciplinary behavioral and social science agent-based model of income tax evasion: Technical report.2019.[10] Damien Challet, Matteo Marsili, Yi-Cheng Zhang, et al. Minority games: interacting agents in ﬁnancial markets.

OUP Catalogue , 2013.[11] Damien Challet, Alessandro Chessa, Matteo Marsili, and Yi-Cheng Zhang. From minority games to real markets.2001.[12] W Brian Arthur. Complexity in economic theory: Inductive reasoning and bounded rationality.

The AmericanEconomic Review , 84(2):406–411, 1994.[13] Richard S Sutton and Andrew G Barto.

Reinforcement learning: An introduction . MIT press, 2018.[14] G Lakoff. Explaining embodied cognition results. topics in cognitive science, 4 (4), 773-785, 2012.[15] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods forreinforcement learning with function approximation. In

Advances in neural information processing systems , pages1057–1063, 2000.[16] Peter Stone and Manuela Veloso. Multiagent systems: A survey from a machine learning perspective.

AutonomousRobots , 8(3):345–383, 2000.[17] Karl Tuyls and Gerhard Weiss. Multiagent learning: Basics, challenges, and prospects.

Ai Magazine , 33(3):41–41,2012.[18] Lucian Bus¸oniu, Robert Babuˇska, and Bart De Schutter. Multi-agent reinforcement learning: An overview. In

Innovations in multi-agent systems and applications-1 , pages 183–221. Springer, 2010.[19] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-criticfor mixed cooperative-competitive environments. In

Advances in neural information processing systems , pages6379–6390, 2017.[20] J Maynard Smith and George R Price. The logic of animal conﬂict.

Nature , 246(5427):15–18, 1973.[21] Robert Axelrod and William Donald Hamilton. The evolution of cooperation. science , 211(4489):1390–1396,1981.[22] Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien P´erolat, David Silver,and Thore Graepel. A uniﬁed game-theoretic approach to multiagent reinforcement learning. In

Advances inNeural Information Processing Systems , pages 4190–4203, 2017.[23] Guy Theraulaz and Eric Bonabeau. A brief history of stigmergy.

Artiﬁcial life , 5(2):97–116, 1999.[24] Radu Manuca, Yi Li, Rick Riolo, and Robert Savit. The structure of adaptive competition in minority games.

Physica A: Statistical Mechanics and its Applications , 282(3-4):559–608, 2000.[25] Kenneth O Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies.

Evolution-ary computation , 10(2):99–127, 2002., 10(2):99–127, 2002.