A Microscopic Epidemic Model and Pandemic Prediction Using Multi-Agent Reinforcement Learning
AA Microscopic Epidemic Model and Pandemic Pre-diction Using Multi-Agent Reinforcement Learning
Changliu Liu C. Liu is with Carnegie Mellon Uni-versity, Pittsburgh PA . Email: [email protected]
This paper introduces a microscopic approach to model epidemics,which can explicitly consider the consequences of individual’s deci-sions on the spread of the disease. We first formulate a microscopicmulti-agent epidemic model where every agent can choose its activitylevel that affects the spread of the disease. Then by minimizing agents’cost functions, we solve for the optimal decisions for individual agentsin the framework of game theory and multi-agent reinforcement learn-ing. Given the optimal decisions of all agents, we can make predictionsabout the spread of the disease. We show that there are negative exter-nalities in the sense that infected agents do not have enough incentivesto protect others, which then necessitates external interventions to reg-ulate agents’ behaviors. In the discussion section, future directions arepointed out to make the model more realistic. This paper is adapted from the lec-ture notes for Lectures of - Adaptive Control and ReinforcementLearning (Spring ) at CMU. Thecourse slides are available at https://piazza.com/class_profile/get_resource/k54pll5057h79m/k8qaky3o8x65bm . The code is avail-able at https://github.com/intelligent-control-lab/Microscopic_Epidemic_Model . Introduction
With the COVID- pandemic souring across the world, a reliablemodel is needed to describe the observed spread of the disease, makepredictions about future, and guide public policy design to controlthe spread. Existing Epidemic Models
There are many existing macroscopic epi-demic models. For example, the SI model describes the growth of Daryl J Daley and Joe Gani.
Epidemicmodelling: an introduction , volume .Cambridge University Press, infection rate as the product of the current infection rate and the cur-rent susceptible rate. The SIR model further incorporates the effectof recovery into the model, i.e., when the infected population turnsinto immune population after a certain period of time. The SIRSmodel considers the case that immunity is not for lifetime and thatthe immune population can become susceptible population again. Inaddition to these models, the SEIR model incorporates the incubationperiod into analysis. Incubation period refers to the duration beforesymptoms show up. The most important factor in all those models Ping Yan and Shengqiang Liu. SEIRepidemic model with delay.
TheANZIAM Journal , ( ): – , is R
0, the regeneration number, which tells how fast the disease canspread. R Limitations of Existing Models
Although these models are useful inpredicting the spread of epidemics, they lack the granularity neededfor analyzing individual behaviors during an epidemic and un-derstanding the relationship between individual decisions and thespread of the disease. For example, many countries now announced Christopher L Barrett, Keith Bisset,Jonathan Leidig, Achla Marathe, andMadhav Marathe. Estimating theimpact of public and private strategiesfor controlling an epidemic: A multi-agent approach. In
Twenty-First IAAIConference , “lock-down", “shelter-in-place", “stay-at-home", or similar orders. a r X i v : . [ c s . M A ] A p r icroscopic epidemic model and marl 2 However, their effects are very different across different countries, oreven across different counties in the same country. One factor thatcan possibly explain these differences is the cultural difference. Indifferent cultures, individuals make different choices. For instance, inthe west, people exhibit greater inertia to give up their working/liferoutines so that they do not follow the orders seriously. While in theeast, people tend to obey the rules better. These different individ-ual choices can result in significantly different outcomes in diseasepropagation that cannot be captured by a macroscopic model.
A Microscopic Epidemic Model
In this paper, we develop a micro-scopic epidemic model by explicitly considering individual decisionsand the interaction among different individuals in the population, inthe framework of multi-agent systems. The aforementioned culturaldifference can be understood as a difference in agents’ cost functions,which then affect their behaviors when they are trying to minimizetheir cost functions. The details of the microscopic epidemic modelwill be explained in the next section, followed by the analysis of thedynamics of the multi-agent system, and the prediction of system tra-jectories using multi-agent reinforcement learning. The model is stillin its preliminary form. In the discussion section, future directionsare pointed out to make the model more realistic.
Microscopic Epidemic Model
Suppose there are M agents in the environment. Initially, m agentsare infected. Agents are indexed from 1 to M . Every agent has itsown state and control input. The model is in discrete time. The timeinterval is set to be one day. The evolution of the infection rate forconsecutive days depends on agents’ actions. The questions of inter-est are: How many agents will eventually be infected? How fast theywill be infected? How can we slow down the growth of the infectionrate? Agent Model
We consider two state values for an agent, e.g., for agent i , x i = x i = i decides its level of activities u i ∈ [
0, 1 ] . The level of activ-ities for agent i can be understood as the expected percentage ofother agents in the system that agent i wants to meet. For example, u i = M means agent i expects to meet one other agent. The actualnumber of agents that agent i meets depends not only on agent i ’s ac-tivity level, but also on other agents’ activity level. For example, if all icroscopic epidemic model and marl 3 other agents choose an activity level 0, then agent i will not be able tomeet any other agent no matter what u i it chooses. Mathematically,the chance for agent i and agent j to meet each other depends on theminimum of the activity levels of these two agents, i.e., min { u i , u j } .In the extreme cases, if agent i decides to meet everyone in the sys-tem by choosing u i =
1, then the chance for agent j to meet withagent i is u j . If agent i decides to not meet anyone in the system bychoosing u i =
0, then the chance for agent j to meet with agent i is 0.Before we derive the system dynamic model, the assumptions arelisted below: These assumptions can all be relaxedin future work. They are introducedmainly for the simplicity of the discus-sion. . In the agent model, we only consider two states: healthy (suscepti-ble) and infected. All healthy agents are susceptible to the disease.There is no recovery and no death for infected agents. There is noincubation period for infected agents, i.e., once infected, the agentcan start to infect other healthy agents. To relax this assumption,we may introduce more states for every agent. . The interactions among agents are assumed to be uniform, al-though it is not true in the real world. In the real world, given afixed activity level, agents are more likely to meet with close fami-lies, friends, colleagues than strangers on the street. To incorporatethis non-uniformity into the model, we need to redefine the chancefor agent i and agent j to meet each other to be β i , j min { u i , u j } ,where β i , j ∈ [
0, 1 ] is a coefficient that encodes the proximity be-tween agent i and agent j and will affect the chance for them tomeet with each other. For simplicity, we assume that the interac-tion patterns are uniform in this paper. . Meeting with infected agents will result in immediate infection. Torelax this assumption, we may introduce an infection probabilityto describe how likely it is for a healthy agent to be infected if itmeets with an infected agent. System Dynamic Model
On day k , denote agent i ’s state and control as x i , k ∈ X and u i , k ∈ U .By definition, the agent state space is X = {
0, 1 } and the agentcontrol space is U = [
0, 1 ] . The system state space is denoted X M : = X × · · · × X . The system control space is denoted U M : = U × · · · × U .Define m k = ∑ i x i , k as the number of infected agents at time k . Theset of infected agents is denoted: I k : = { i : x i , k = } . ( )The state transition probability for the multi-agent system is a icroscopic epidemic model and marl 4 mapping T : X M × U M × X M (cid:55)→ [
0, 1 ] . ( )According to the assumptions, an infected agent will always re-main infected. Hence the state transition probability for an infectedagent i does not depend on other agents’ states or any control. How-ever, the state transition probability for a healthy agent i depends onothers. The chance for a healthy agent i to not meet an infected agent j ∈ I k is 1 − min { u i , u j } . A healthy agent can stay healthy if andonly if it does not meet any infected agent, the probability of whichis Π j ∈I k ( − min { u i , u j } ) . Then the probability for a healthy agentto be infected is 1 − Π j ∈I k ( − min { u i , u j } ) . From the expression Π j ∈I k ( − min { u i , u j } ) , we can infer that: the chance for a healthyagent i to stay health is higher if• the agent i limits its own activity by choosing a smaller u i ;• the number of infected agents is smaller;• the infected agents in I k limit their activities.The state transition probability for an agent i is summarized intable . x i , k = x i , k = x i , k + = Π j ∈I k ( − min { u i , u j } ) x i , k + = − Π j ∈I k ( − min { u i , u j } ) Table : The state transition probabilityfrom x i , k to x i , k + for an agent. Example
Consider a four-agent system shown in Fig. . Only agent 1is infected. And the agents choose the following activity levels: u = u = u = u = p i , j for agents i and j to meet with each other is p = p = p = p = p = p = p i , j = p j , i . The chance for agents 2, 3, and 4to stay healthy is 0.9, although they have different activity levels.
12 43 u = 0.1 u = 0.3 u = 0.4 u = 0.2 . . Figure : Example of a four-agentsystem. Agent 1 is infected (shaded).Other agents are healthy. The numberson the links denote the probability foragents to meet with each other, whichdepend on the chosen activity levels ofdifferent agents. Case Study
Before we start to derive the optimal strategies for individual agentsand analyze the closed-loop multi-agent system, we first character-ize the (open-loop) multi-agent system dynamics by Monte Carlosimulation according to the state transition probability in table .Suppose we have M = u and reduced activity level u ∗ . The two activity levels are assignedto different agents following different strategies as described below.In particular, we consider “no intervention" case where all agents icroscopic epidemic model and marl 5 continue to follow the normal activity level, “immediate isolation"case where the activity levels of infected agents immediately drop tothe reduced level, “delayed isolation" case where the activity levelsof infected agents drop to the reduced level after several days, and“lockdown" case where the activity levels of all agents drop to thereduced level immediately. DaysHealthy Infected
Figure : Illustration of the result of oneMonte Carlo simulation when all agentshave the activity level u = M . Thehorizontal axis corresponds to day k .The vertical axis corresponds to agentID i . The color in the graph representsthe value of x i , k , blue for 0 (healthy) andyellow for 1 (infected). For each case, we simulate system trajectories and compute theaverage, maximum, and minimum m k (number of infected agents)versus k from all trajectories. A system trajectory in the “no inter-vention" case is illustrated in Fig. , where u = M for all agents.The m k trajectories under different cases are shown in Fig. , wherethe solid curves illustrate the average m k and the shaded area cor-responds to the range from min m k to max m k . The results are ex-plained below.• Case : no intervention.All agents keep the normal activity level u . The scenarios for u = M and u = M are illustrated in Fig. . As expected,a higher activity level for all agents will lead to faster infection.The trajectory of m k has a S shape, whose growth rate is relativelyslow when either the infected population is small or the healthypopulation is small, and is maximized when 50% agents are in-fected. It will be shown in the following discussion that (empirical)macroscopic models also generate S -curves.• Case : immediate isolation of infected agents.The activity levels of infected agents immediately drop to u ∗ ,while others remain u . The scenario for u = M and u ∗ = M is illustrated in Fig. . Immediate isolation significantly slowsdown the growth of the infections rate. As expected, it has thebest performance in terms of flattening the curve, same as thelockdown case. The trajectory also has a S shape.• Case : delayed isolation of infected agents.The activity levels of infected agents drop to u ∗ after T days, whileothers remain u . In the simulation, u = M and u ∗ = M .The scenarios for T = T = . Asexpected, the longer the delay, the faster the infection rate grows,though the growth of the infection rate is still slower than the “nointervention" case. Moreover, the peak growth rate (when 50%agents are infected) is higher when the delay is longer.• Case : lockdown.The activity levels of all agents drop to u ∗ . The scenario for u ∗ = M is illustrated in Fig. . As expected, it has the best perfor- icroscopic epidemic model and marl 6 mance in terms of flattening the curve, same as the immediateisolation case. In the case that infected populationcan be asymptomatic or have a longincubation period before they showany symptom, like what we observefor COVID- , immediate identificationof infected person and then immediateisolation is not achievable. Then lock-down is the only best way to control thespread of the disease in our model. Since the epidemic model is monotone, every agent will eventuallybe infected as long as the probability to meet infected agents does notdrop to zero. Moreover, we have not discussed decision making byindividual agents yet. The activity levels are just predefined in thesimulation.
No Intervention u=0.001No Intervention u=0.001No Intervention u=0.002No Intervention u=0.002Immediate Isolation u=0.001, u * =0.0001Immediate Isolation u=0.001, u * =0.00011-Day Delayed Isolation u=0.001, u * =0.00011-Day Delayed Isolation u=0.001, u * =0.00012-Day Delayed Isolation u=0.001, u * =0.00012-Day Delayed Isolation u=0.001, u * =0.0001Lock-down u * =0.0001Lock-down u * =0.0001 Figure : The growth of the infectionrate under different activity levelsof agents. For every scenario, theresult is extracted from MonteCarlo simulations. The horizontal axiscorresponds to day k . The vertical axiscorresponds number of infected agents m k . The solid curves are the average m k and the shaded area corresponds to therange from min m k to max m k . Remark
The model we introduced is microscopic, in the sense thatinteractions among individual agents are considered. The simulatedopen-loop trajectories are indeed similar to those from a macroscopicmodel. Since only susceptible and infected populations are consid-ered in the proposed microscopic model, we then compare it withthe macroscopic Susceptible-Infected (SI) model. Define the state s ∈ [
0, 1 ] as the fraction of infected population. The growth of theinfected population is proportional to the susceptible population andthe infected population. Suppose the infection coefficient is β , thesystem dynamics in the SI model follow:˙ s = β s ( − s ) . ( ) = 0.2 = 0.3 = 0.4 = 0.5 = 0.6 = 0.7 = 0.8 = 0.9 = 1.0 Figure : The system trajectories in themacroscopic SI model. The horizontalaxis corresponds to days. The verticalaxis corresponds to the infection rate s . We simulate the system trajectory under different infection coeffi-cients as shown in Fig. . The trajectories also have S shapes, similarto the ones in the microscopic model. However, since this macro-scopic SI model is deterministic, there is no “uncertainty" range asshown in the microscopic model. The infection coefficient β dependson the agents’ choices of activity levels. However, there is not anexplicit relationship yet. It is better to directly use the microscopicmodel to analyze the consequences of individual agents’ choices. icroscopic epidemic model and marl 7 Distributed Optimal Control
This section tries to answer the following question: in the micro-scopic multi-agent epidemic model, what is the best control strategyfor individual agents? To answer that, we need to first specify theknowledge and observation models as well as the cost (reward) func-tions for individual agents. Then we will derive the optimal choicesof agents in a distributed manner. The resulting system dynamicscorrespond to a Nash Equilibrium of the system.
Knowledge and Observation Model
A knowledge and observation model for agent i includes two as-pects: what does agent i know about itself, and what does agent i know about others? The knowledge about any agent j includes thedynamic function of agent j and the cost function of agent j . The ob-servation corresponds to run-time measurements, i.e., the observationof any agent j includes the run-time state x j , k and the run-time con-trol u j , k . In the following discussion, regarding the knowledge andobservation model, we make the following assumptions:• An agent knows its own dynamics and cost function;• All agents are homogeneous in the sense that they share the samedynamics and cost functions. And agents know that all agentsare homogeneous, hence they know others’ dynamics and costfunctions; Not knowing other agents’ dynamicsor cost functions will result in infor-mation asymmetry, which createsdifficulty in the analysis. Nonetheless,the assumption can be relaxed in thefuture. • At time k , agents can measure x j , k for all j . But they cannot mea-sure u j , k until time k +
1. Hence, the agents are playing a simul-taneous game. They need to infer others’ decisions when makingtheir own decisions at any time k . Cost Function
We consider two conflicting interests for every agent: The identification of these two con-flicting interests is purely empirical. Tobuild realistic cost functions, we needto either study the real world data orconduct human subject experiments. • Limit the activity level to minimize the chance to get infected;• Maintain a certain activity level for living.We define the run-time cost for agent i at time k as l i , k = x i , k + + α i p ( u i , k ) , ( )where x i , k + corresponds to the first interest, p ( u i , k ) corresponds tothe second interest, and α i > p ( u ) is assumed to be smooth. Due to The function p ( u ) can be a decreasingfunction on [
0, 1 ] , meaning that thehigher the activity level, the better.The function p ( u ) can also be a convexparabolic function on [
0, 1 ] with theminimum attained at some u ∗ , mean-ing that the activity level should bemaintained around u ∗ . icroscopic epidemic model and marl 8 our homogeneity assumption on agents, they should have identicalpreferences, i.e., α i = α for all i .Agent i chooses its action at time k by minimizing the expectedcumulative cost in the future: u i , k = arg min E [ ∞ ∑ t = k γ t − k l i , k ] , ( )where γ ∈ [
0, 1 ] is a discount factor. The objective function dependson all agents’ current and future actions. It is difficult to directlyobtain an analytical solution of ( ). Later we will use multi-agentreinforcement learning to obtain a numerical solution.In this section, to simplify the problem, we consider a single stagegame where the agents have zero discount of the future, i.e., γ = The formulation ( ) corresponds toa repeated game as opposed to thesingle stage game. Repeated gamescapture the idea that an agent will haveto take into account the impact of itscurrent action on the future actions ofothers. This impact is called the agent’sreputation. The interaction is morecomplex in a repeated game than thatin a single stage game. Hence the objective function is reduced to u i , k = arg min E [ l i , k ] , ( )which only depends on the current actions of agents. According tothe state transition probability in table , the expected cost is E [ l i , k ] = (cid:40) − Π j ∈I k ( − min { u i , u j } ) + α i p ( u i , k ) if x i , k = + α i p ( u i , k ) if x i , k = ) Nash Equilibrium u c o s t Figure : Illustration of the curveexp ( u − ) . According to ( ), the expect cost for an infected agent only dependson its own action. Hence the optimal choice for an infected agent is u i , k = ¯ u : = arg min u p ( u ) . Then the optimal choice for a healthy agentsatisfies: u i , k = arg min u [ − Π j ∈I k ( − min { u , ¯ u } ) + α i p ( u )] , ( ) = arg min u [ − ( − min { u , ¯ u } ) m k + α i p ( u )] . ( )Note that the term 1 − ( − min { u , ¯ u } ) m k is positive and is increasingfor u ∈ [
0, ¯ u ] and then constant for u ∈ [ ¯ u , 1 ] . Hence, the optimalsolution for ( ) should be smaller than ¯ u = arg min u p ( u ) . Then If u ≥ ¯ u , then ( ) becomesarg min u [ − ( − ¯ u ) m k + α i p ( u )] ,whose optimal solution is u = ¯ u withcost [ − ( − ¯ u ) m k + α i p ( ¯ u )] . If u ≤ ¯ u ,then ( ) becomes arg min u J ( u ) where J ( u ) = [ − ( − u ) m k + α i p ( u )] . Since ∂∂ u | ¯ u J ( u ) >
0, the optimal solution sat-isfies that u < ¯ u with cost J ( u ) < J ( ¯ u ) .Note that J ( ¯ u ) equals to the smallestcost for the case u ≤ ¯ u . Hence theoptimal solution for ( ) satisfies that u < ¯ u . the objective in ( ) can be simplified as 1 − ( − u ) m k + α i p ( u ) . Insummary, the optimal actions for both the infected and the healthyagents in the Nash Equilibrium can be compactly written as u i , k = arg min u { − ( − u ) m k ( − x i , k ) + α i p ( u ) } , ∀ i . ( ) Example
Consider the previous example with four agents shown inFig. . Define p ( u ) = exp ( u − ) , ( ) icroscopic epidemic model and marl 9 which is a monotonically decreasing function as illustrated in Fig. .Then the optimal actions in the Nash Equilibrium for this specificproblem satisfy: u i , k = arg min u { u + x i , k − ux i , k + α i exp ( u − ) } , ∀ i . ( ) u c o s t = 0.5, Healthy = 0.5, Infected = 1, Healthy = 1, Infected = 2, Healthy = 2, Infected = 3, Healthy = 3, Infected = 5, Healthy = 5, Infected Figure : Illustration of the objectivefunction in ( ) under different condi-tions. Solving for ( ), for infected agents, u i , k =
1. For healthy agents,the choice also depends on α i as illustrated in Fig. . We have as-sumed that α i = α which is identical for all agents. We further as-sume that α < u i , k =
0. The optimal actions and the corresponding costsfor all agents are listed in table . In the Nash Equilibrium, no agentwill meet each other, since all agents except agent 1 reduce their ac-tivity levels to zero. The actual cost (received at the next time step)equals to the expected cost (computed at the current time step). Agent ID State x i , k Optimal u i , k Optimal E [ l i , k ] Actual l i , k , , α exp ( − ) α exp ( − ) Total 1 + α exp ( − ) Table : List of the agent decisions andassociated costs in the Nash Equilib-rium in the four-agent example. However, let us consider another situation where the infectedagent chooses 0 activity level and all other healthy agents choose 1activity level. The resulting costs are summarized in table . Obvi-ously, the overall cost is reduced in the new situation. However, thisbetter situation cannot be attained spontaneously by the agents, dueto externality of the system which will be explained below. Agent ID State x i , k Optimal u i , k Optimal E [ l i , k ] Actual l i , k + α exp ( − ) + α exp ( − ) , , Total 1 + α exp ( − ) Table : List of the agent decisionsand associated costs in a situationbetter than the Nash Equilibrium in thefour-agent example. Dealing with Externality
For a multi-agent system, define the system cost as a summation ofthe individual costs: L k : = ∑ i l i , k . ( )The system cost in the Nash Equilibrium is denoted L ∗ k , which corre-sponds to the evaluation of L k under agent actions specified in ( ).On the other hand, the optimal system cost is defined as L ok : = min u i , k , ∀ i L k . ( ) icroscopic epidemic model and marl 10 The optimization problem ( ) is solved in a centralized manner,which is different from how the Nash Equilibrium is obtained. Toobtain the Nash Equilibrium, all agents are solving their own opti-mization problems independently. Although their objective functionsdepend on other agents’ actions, they are not jointly make the deci-sions, but only “infer" what others will do. By definition, L ok ≤ L ∗ k . Inthe example above, L ∗ k = + α exp ( − ) and L ok = + α exp ( − ) . Thedifference L ∗ k − L ok is called the loss of social welfare . In the epidemicmodel, the loss of social welfare is due to the fact that bad conse-quences (i.e., infecting others) are not penalized in the cost functionsof the infected agents. Those unpenalized consequences are called externality . There can be both positive externality and negative exter-nality. Under positive externality, agents are lacking motivations todo things that are good for the society. Under negative externality,agents are lacking motivations to prevent things that are bad for thesociety. In the epidemic model, there are negative externality withinfected agents.To improve social welfare, we need to “internalize" externality, i.e.,add penalty for “spreading" the disease. Now let us redefine agent i ’srun-time cost as ˜ l i , k = x i , k + + α i p ( u i , k ) + x i , k q ( u i , k ) , ( )where q ( · ) is a monotonically increasing function. The last term x i , k q ( u i , k ) does not affect healthy agents since x i , k =
0, but adds apenalty for infected agents if they choose large activity level. Onecandidate function for q ( u ) is 1 − ( − u ) m k . In the real world, such“cost shaping" using q can be achieved through social norms or gov-ernment regulation. The expected cost becomes E [ ˜ l i , k ] = (cid:40) − Π j ∈I k ( − min { u i , u j } ) + α i p ( u i , k ) if x i , k = + α i p ( u i , k ) + q ( u i , k ) if x i , k = )Suppose the function q is well tuned such that the arg min u [ α i p ( u ) + q ( u )] =
0. Then although the expected costs for infected agents arestill independent from others, their decision is considerate to healthyagents. When the infected agents choose u =
0, then for healthyagents, the expected cost becomes α i p ( u i , k ) , meaning that they do notneed to worry about getting infected. Let us now compute the re-sulting Nash Equilibrium under the shaped costs using the previousexample. Example
In the four-agent example, set q ( u ) = u . Then arg min u [ α p ( u ) + u ] =
0. Hence agent will choose u k =
0. For agents i =
2, 3, 4, theywill choose u i , k = p ( u ) . The re-sulting costs are summarized in table . With the shaped costs, the icroscopic epidemic model and marl 11 system enters into a better Nash Equilibrium which indeed alignswith the system optimum in ( ). A few remarks:• Cost shaping did not increase the overall cost for the multi-agentsystem.• The system optimum remains the same before and after cost shap-ing.• Cost shaping helped agents to arrive at the system optimum with-out centralized optimization. Agent ID State x i , k Optimal u i , k Optimal E [ ˜ l i , k ] Actual ˜ l i , k + α exp ( − ) + α exp ( − ) , , Total 1 + α exp ( − ) Table : List of the agent decisions andassociated costs in the Nash Equilib-rium with shaped cost functions in thefour-agent example. Multi-Agent Reinforcement Learning
We have shown how to compute the Nash Equilibrium of the multi-agent epidemic model in a single stage. However, it is analyticallyintractable to compute the Nash Equilibrium when we consider re-peated games ( ). The complexity will further grow when the num-ber of agents increases and when there are information asymmetry.Nonetheless, we can apply multi-agent reinforcement learning to Lucian Bu¸soniu, Robert Babuška,and Bart De Schutter. Multi-agentreinforcement learning: An overview.In
Innovations in multi-agent systems andapplications- , pages – . Springer, numerically compute the Nash Equilibrium. Then the evolution ofthe pandemic can be predicted by simulating the system under theNash Equilibrium. Q Learning
As evident from ( ), the optimal action for agent i at time k is afunction of x i , k and m k . Hence we can define a Q function (actionvalue function) for agent i as Q i : x i , k × m k × u i , k (cid:55)→ R . ( )According to the assumptions made in the observation model, allagents can observe m k at time k . For a single stage game, we havederived in ( ) that Q i ( x , m , u ) = − ( − u ) m ( − x ) + α i p ( u ) .For repeated games ( ), we can learn the Q function using temporaldifferent learning. At every time k , agent i chooses its action as u i , k = arg min u Q i ( x i , k , m k , u ) . ( ) icroscopic epidemic model and marl 12 After taking the action u i , k , agent i observes x i , k + and m k + andreceives the cost l i , k at time k +
1. Then agent i updates its Q function: Q i ( x i , k , m k , u i , k ) ← Q i ( x i , k , m k , u i , k ) + ηδ i , k , ( ) δ i , k = l i , k + γ min u Q i ( x i , k + , m k + , u ) − Q i ( x i , k , m k , u i , k ) , ( )where η is the learning gain and δ i , k is the temporal difference error.All agents can run the above algorithm to learn their Q functionsduring the interaction with others. However, the algorithm intro-duced above has several problems:• Exploration and limited rationality.There is no exploration in ( ). Indeed, Q-learning is usually ap-plied together with (cid:101) -greedy where with probability 1 − (cid:101) , theaction u i , k is chosen to be the optimal action in ( ), and withprobability (cid:101) , the action is randomly chosen with a uniform distri-bution over the action space. The (cid:101) -greedy approach is introducedmainly from an algorithmic perspective to improve convergence ofthe learning process. When applied to the epidemic model, it hasa unique societal implication. When agents are randomly choosingtheir behaviors, it represents the fact that agents have only limitedrationality. Hence in the learning process, we apply (cid:101) -greedy as away to incorporate exploration for faster convergence as well as totake into account limited rationality of agents.• Data efficiency and parameter sharing.Keeping separated Q functions for individual agents is not dataefficient. An agent may not be able to collect enough samples toproperly learn the desired Q function. Due to the homogeneity as-sumptions we made earlier about agents’ cost functions, it is moredata efficient to share the Q function for all agents. Its societal im-plication is that agents are sharing information and knowledgewith each other. Hence, we apply parameter sharing as a way to Jayesh K Gupta, Maxim Egorov,and Mykel Kochenderfer. Coopera-tive multi-agent control using deepreinforcement learning. In
Interna-tional Conference on Autonomous Agentsand Multiagent Systems , pages – .Springer, improve data efficiency as well as to consider information sharingamong agents during the learning process. In a more complex situation whereagents are not homogeneous, it isdesired to have parameter sharing witha smaller group of agents, instead ofparameter sharing will all agents.
With the above modifications, the multi-agent Q learning algo-rithm is summarized below. Junling Hu and Michael P Wellman.Nash Q-learning for general-sumstochastic games.
Journal of machinelearning research , (Nov): – , • For every time step k , agents choose their actions as: u i , k = (cid:40) arg min u Q ( x i , k , m k , u ) probability 1 − (cid:101) random probability (cid:101) ∀ i . ( )• At the next time step k +
1, agents observe the new states x i , k + and receive rewards l i , k for all i . Then the Q function is updated: Q ( x i , k , m k , u i , k ) ← Q ( x i , k , m k , u i , k ) + ηδ i , k , ∀ i , ( ) δ i , k = l i , k + γ min u Q ( x i , k + , m k + , u ) − Q ( x i , k , m k , u i , k ) . ( ) icroscopic epidemic model and marl 13 Example
In this example, we consider M =
50 agents in the system.Only one agent is infected in the beginning. The run-time cost is thesame as in the example in the distributed optimal control section, i.e., l i , k = x i , k + + α exp ( u i , k − ) where α is chosen to be 1. For simplicity,the action space is discretized to be {
0, 1/ M , 10/ M } , called as low,medium, and high. Hence the Q function can be stored as a 2 × M × η =
1. The exploration rate is set to decay in different episodes, i.e., (cid:101) = ( − E / max E ) where E denotes the current episode and themaximum episode is max E = episodes as wellas the system trajectories for episodes 10, 20, . . . , 200, blue for earlierepisodes and red for later episodes. The results are shown in Fig. .• Case : discount γ = l i , k .With γ =
0, this case reduces to a single stage game as discussedin the distributed optimal control section. The result should alignwith the analytical Nash Equilibrium in ( ). As shown in the leftplot in Fig. (a), the optimal action for a healthy agent is always low (solid green), while the optimal action for an infected agent isalways high (dashed magenta). The Q values for infected agents donot depend on m k . The Q values for healthy agents increase when m k increases if the activity level is not zero, due to the fact that:for a fixed activity level, the chance to get infected is higher whenthere are more infected agents in the system. All these resultsalign with our previous theoretical analysis. Moreover, as shownin the right plot in Fig. (a), the agents are learning to flatten thecurve across different episodes.• Case : discount γ = l i , k .Since the agents are now computing cumulative costs as in ( ), thecorresponding Q values are higher than those in case . However,the optimal actions remain the same, low (solid green) for healthyagents, high (dashed magenta) for infected agents, as shown inthe left plot in Fig. (b). The trends of the Q curves also remainthe same: the Q values do not depend on m k for infected agentsand for healthy agents whose activity levels are zero. However, asshown in the right plot in Fig. (b), the agents learned to flattenthe curve faster than in case , mainly because healthy agents aremore cautious (converge faster to low activity levels) when theystart to consider cumulative costs.• Case : discount γ = l i , k in ( ). icroscopic epidemic model and marl 14 The shaped cost changes the optimal actions for all agents as wellas the resulting Q values. As shown in the left plot in Fig. (c),the optimal action for an infected agent is low (dashed green),while that for a healthy agent is high (solid magenta) when m k issmall and low (solid green) when m k is big. Note that when m k is high, the healthy agents still prefer low activity level, thoughthe optimal actions for infected agents are low. That is because:due to the randomization introduced in (cid:101) -greedy, there is stillchance for infected agents to have medium or high activity levels.When m k is high, the healthy agents would rather limit their ownactivity levels to avoid the risk to meet with infected agents thatare taking random actions. This result captures the fact that agentsunderstand others may have limited rationality and prefer moreconservative behaviors. We observe the same trends for the Qcurves as the previous two cases: the Q values do not depend on m k for infected agents and for healthy agents whose activity levelsare not zero. In terms of absolute values, the Q values for infectedagents are higher than those in case due to the additional cost q ( u ) in ˜ l i , k . The Q values for healthy agents are smaller than thosein case for medium and high activity levels, since the chance toget infected is smaller as infected agents now prefer low activitylevels. The Q values remain the same for healthy agents with zeroactivity levels. With shaped costs, the agents learned to flattenthe curve even faster than in case , as shown in the right plot inFig. (c), since the shaped cost encourages infected agents to lowertheir activity levels. Discussion and Future Work
Agents vs humans
The epidemic model can be used to analyze real-world societal problems. Nonetheless, it is important to understandthe differences between agents and humans. We can directly de-sign and shape the cost function for agents, but not for humans. Foragents, their behavior is predictable once we fully specify the prob-lem (i.e., cost, dynamics, measurement, etc). Hence we can optimizethe design (i.e., the cost function) to get desired system trajectory. Forhumans, their behavior is not fully predictable due to limited ratio-nality. We need to constantly modify the knowledge and observationmodel as well as the cost function to match the true human behavior.
Future work
The proposed model is in its preliminary form. Manyfuture directions can be pursued.• Relaxation of assumptions. icroscopic epidemic model and marl 15
10 20 30 40
Infected Q v a l ue healthy, lowsick, lowhealthy, mediumsick, mediumhealthy, highsick, high Time I n f e c t ed (a) Case with discount γ =
10 20 30 40
Infected Q v a l ue healthy, lowsick, lowhealthy, mediumsick, mediumhealthy, highsick, high Time I n f e c t ed (b) Case with discount γ =
10 20 30 40
Infected Q v a l ue healthy, lowsick, lowhealthy, mediumsick, mediumhealthy, highsick, high Time I n f e c t ed (c) Case with discount γ = : Results in the multi-agentQ learning under the microscopicepidemic model. The left plots showthe learned Q values after episodes.The horizontal axis corresponds to m k .The vertical axis corresponds to theQ values. Solid curves are for healthyagents x i , k =
0. Dashed curves are forinfected agents x i , k =
1. Green curvesare for low activity levels u i , k = u i , k = M . Magenta curves arefor high activity levels u i , k = M .The right plots illustrate the systemtrajectories for episodes 10, 20, . . . , 200,blue for earlier episodes and red forlater episodes. In the last episode inall cases where there is no exploration (cid:101) =
0, the system trajectories arehorizontal with m k ≡ icroscopic epidemic model and marl 16 We may add more agent states to consider recovery, incubationperiod, and death. We may consider the fact that the interac-tion patterns among agents are not uniform. We may consider awide variety of agents who are not homogeneous. For example,health providers and equipment suppliers are key parts in fightingthe disease. They should receive lower cost (higher reward) formaintaining or even expanding their activity levels than ordinarypeople. Their services can then lead to higher recovery rate. Inaddition, we may relax the assumptions on agents’ knowledge andobservation models, to consider information asymmetry as wellas partial observation. For example, agents cannot get immedi-ate measurement whether they are infected or not, or how manyagents are infected in the system.• Realistic cost functions for agents.The cost functions for agents are currently hand-tuned. We maylearn those cost functions from data through inverse reinforcementlearning. Those cost functions can vary for agents from differentcountries, different age groups, and different occupations. More-over, the cost functions carry important cultural, demographical,economical, and political information. A realistic cost functioncan help us understand why we observe significantly differentoutcomes of the pandemic around the world, as well as enablemore realistic predictions into the future by fully considering thosecultural, demographical, economical, and political factors.• Incorporation of public policies.For now, the only external intervention we introduced is cost shap-ing. We may consider a wider range of public policies that canchange the closed-loop system dynamics. For example, shut-downof transportation, isolation of infected agents, contact tracing, anti-body testing, etc.• Transient vs steady state system behaviors.We have focused on the steady state system behaviors in the NashEquilibrium. However, as agents live in a highly dynamic world, itis not guaranteed that a Nash Equilibrium can always be attained.While agents are learning to deal with unforeseen situations, thereare many interesting transient dynamics, some of which is cap-tured in Fig. , i.e., agents may learn to flatten the curve at differ-ent rates. Methods to understand and predict transient dynamicsmay be developed in the future.• Validation against real world historical data. icroscopic epidemic model and marl 17 To use the proposed model for prediction in the real world, weneed to validate its fidelity again the historical data. The valida-tion can be performed on the m k trajectories, i.e., for the sameinitial condition, the predicted m k trajectories should align withthe ground truth m k trajectories. Conclusion
This paper introduced a microscopic multi-agent epidemic model,which explicitly considered the consequences of individual’s deci-sions on the spread of the disease. In the model, every agent canchoose its activity level to minimize its cost function consisting oftwo conflicting components: staying healthy by limiting activities andmaintaining high activity levels for living. We solved for the optimaldecisions for individual agents in the framework of game theory andmulti-agent reinforcement learning. Given the optimal decisions ofall agents, we can make predictions about the spread of the disease.The system had negative externality in the sense that infected agentsdid not have enough incentives to protect others, which then requiredexternal interventions such as cost shaping. We identified futuredirections were pointed out to make the model more realistic.
References
Christopher L Barrett, Keith Bisset, Jonathan Leidig, Achla Marathe,and Madhav Marathe. Estimating the impact of public and privatestrategies for controlling an epidemic: A multi-agent approach. In
Twenty-First IAAI Conference , .Lucian Bu¸soniu, Robert Babuška, and Bart De Schutter. Multi-agentreinforcement learning: An overview. In Innovations in multi-agentsystems and applications- , pages – . Springer, .Daryl J Daley and Joe Gani. Epidemic modelling: an introduction , vol-ume . Cambridge University Press, .Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. Cooper-ative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Sys-tems , pages – . Springer, .Junling Hu and Michael P Wellman. Nash Q-learning for general-sum stochastic games. Journal of machine learning research , (Nov): – , .Ping Yan and Shengqiang Liu. SEIR epidemic model with delay. TheANZIAM Journal , ( ): – ,2006