[PDF] Reinforced Contact Tracing and Epidemic Intervention

Abstract

The recent outbreak of COVID-19 poses a serious threat to people's lives. Epidemic control strategies have also caused damage to the economy by cutting off humans' daily commute. In this paper, we develop an Individual-based Reinforcement Learning Epidemic Control Agent (IDRLECA) to search for smart epidemic control strategies that can simultaneously minimize infections and the cost of mobility intervention. IDRLECA first hires an infection probability model to calculate the current infection probability of each individual. Then, the infection probabilities together with individuals' health status and movement information are fed to a novel GNN to estimate the spread of the virus through human contacts. The estimated risks are used to further support an RL agent to select individual-level epidemic-control actions. The training of IDRLECA is guided by a specially designed reward function considering both the cost of mobility intervention and the effectiveness of epidemic control. Moreover, we design a constraint for control-action selection that eases its difficulty and further improve exploring efficiency. Extensive experimental results demonstrate that IDRLECA can suppress infections at a very low level and retain more than 95% of human mobility.

Full PDF

IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1

Reinforced Contact Tracing and EpidemicIntervention

Tao Feng, Sirui Song, Tong Xia, Yong Li

Abstract —The recent outbreak of COVID-19 poses a serious threat to people’s lives. Epidemic control strategies have also causeddamage to the economy by cutting off humans’ daily commute. In this paper, we develop an Individual-based Reinforcement LearningEpidemic Control Agent (IDRLECA) to search for smart epidemic control strategies that can simultaneously minimize infections and thecost of mobility intervention. IDRLECA ﬁrst hires an infection probability model to calculate the current infection probability of eachindividual. Then, the infection probabilities together with individuals’ health status and movement information are fed to a novel GNN toestimate the spread of the virus through human contacts. The estimated risks are used to further support an RL agent to selectindividual-level epidemic-control actions. The training of IDRLECA is guided by a specially designed reward function considering boththe cost of mobility intervention and the effectiveness of epidemic control. Moreover, we design a constraint for control-action selectionthat eases its difﬁculty and further improve exploring efﬁciency. Extensive experimental results demonstrate that IDRLECA cansuppress infections at a very low level and retain more than of human mobility.

Index Terms —COVID-19, RL, GNN (cid:70)

NTRODUCTION

The recent outbreak of COVID-19 has caused thousands ofinfections and deaths. Similar to most epidemics that canspread via human contact [1], control the spread of theCOVID-19 virus requires cutting off human contacts. Gov-ernments have taken different epidemic-control strategies,such as travel-restriction orders, individual quarantine poli-cies, and city lockdown [2]. However, restricting human’sdaily mobility and gathering will inevitably pose a negativeeffect on economic growth. The current epidemic controlstrategies for COVID-19 has ultimately caused damage tothe economy [3], [4].To control the epidemic both efﬁciently and effec-tively, researchers have proposed smart and computationalEpidemic-Prevention-and-Control (EPC) strategies in bothgroup level and individual level. Group-level EPC strategies[5], [6] aim to select customized epidemic-control actions foreach population group. These works are mainly based onthe SIR model [7] which can characterize the developmenttrend of the epidemic from a group-level view. However,Group-level EPC strategies ignore the unique situation ofeach individual, which may easily cause unnecessary mo-bility intervention costs or secondary transmission of infec-tion. By contrast, individual-based EPC strategies exploitindividual information to estimate infection risk for eachindividual, and further select a customized epidemic-controlaction for each individual [8]. However, current individual-based EPC strategies [9], [10], [11], [12], [13] lack a module toestimate the spread of the virus through complex contactsbetween individuals. To achieve an efﬁcient and effectiveEPC result, we in this paper aim to maximally make use of • T. Feng, S. Song, T. Xia and Y. Li are with Beijing National ResearchCenter for Information Science and Technology (BNRist), Departmentof Electronic Engineering, Tsinghua University, Beijing 100084, China.Email: [email protected]. available information and design an individual-based EPCstrategy that can both minimize the number of infectionsand the social cost of epidemic control.The main challenges of our research are three-fold.

First ,primitive individual information can hardly reﬂect an in-dividual’s infection risk. For example, an asymptomaticpatient who has a very high infection risk is usually hardto detect just through symptoms. In other words, the largepopulation and their complex information form a vast statespace for control, making it very hard to extract effectiveinformation to support the selection for control actions.

Second , the large and complex action space exacerbates thedifﬁculty of control-action selection. If there exist M peopleand d kinds of control actions, the action space is M d , whichis growing exponentially. Third , searching for a policy thatachieves the dual objective of minimizing both infectionsand the social cost of implementing the strategy is hard. Thetwo optimization goals will inﬂuence each other. For exam-ple, better control of the epidemic requires greater controlefforts, which will naturally increase mobility interventioncosts.To solve the above challenges, we propose an Individual-based Reinforcement Learning Epidemic Control Agent(IDRLECA) by combining Graph Neural Network and Rein-forcement Learning approach. Speciﬁcally, to deal with thevast-state-space challenge, we design an infection probabil-ity model to calculate the current infection probability ofeach individual, whose result is further added to the indi-vidual’s state as auxiliary information. In order to better ex-tract individual features hidden in his/her daily commute,we design a novel GNN which inputs with individuals’states their visiting history and estimates their infectionrisks of individuals. As for the large-action-space challenge,we design and impose a constraint to control-action selec-tion by requiring individuals with larger calculated infectionprobability should receive more stringent control actions. a r X i v : . [ c s . A I] F e b EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2

In response to the dual-objective optimization challenge,we carefully design a reward function considering boththe social cost of EPC and the effectiveness of infectionsuppression. More importantly, the reward function is ableto efﬁciently guide training.We build a simulation environment based on the PAPWChallenge . and experimentally compare the performanceof expert EPC strategies, winners in the PAPW Challenge,and our proposed IDRLECA. Extensive results show thatIDRLECA achieves the best performance for both infection-suppression and mobility-retaining in all three comparedscenarios.In summary, this paper makes the following contribu-tions: • We propose IDRLECA to minimize the number ofinfections and the social cost of EPC. IDRLECAachieves the best performance in both infection-suppression and mobility-retaining compared withexpert baselines and PAPW winners. • We propose a method to address the vast-state-spaceproblem in individual-based EPC. Our method in-cludes an infection-probability model and a novelGNN. • We design and impose a constraint to control-actionselection to improve the exploration efﬁciency ofIDRLECA in the extremely large action space.The remainder of this paper is as follows. We ﬁrstintroduce our problem formulation in Section 2 and intro-duce our method in Section 3. The experiment results arepresented in Section 4. We introduce the related works inSection 5 and conclude our paper in Section 6.

ROBLEM F ORMULATION A ND C HALLENGES

In this section, we formulate the problem of individual-based EPC and discuss the challenges in ﬁnding an effectiveEPC policy.

We consider a within-city epidemic control scenario. Thecity is assumed to be composed of N areas and has apopulation of M . Each individual’s health status can be:Susceptible, Asymptomatic, Symptomatic, and Recovered.Asymptomatic and Symptomatic individuals are both in-fected. Each individual will commute between differentareas according to some predeﬁned commute rules. Whenpeople are staying in the same area, they have a probabilityto contact each other and their health status will changefrom Susceptible to Asymptomatic. The Asymptomatic sta-tus will Symptomatic after a predeﬁned incubation time.Symptomatic individuals will be sent to the hospital andtransit to Recover after a predeﬁned time of treatment. Notethat policymakers cannot distinguish between Susceptibleindividuals and Asymptomatic individuals. The goal ofindividual-based EPC is to select a control action for each in-dividual in the Susceptible group and Asymptomatic groupto minimize the number of infected people and the cost ofintervention measures. Speciﬁcally, we deﬁne four kinds of

1. PAPW 2020: https://prescriptive-analytics.github.io/. control actions: No Intervention, Conﬁne (no contact withpeople living outside his/her residential area), Quarantine(no stranger contact), Isolate (no contact). The above mod-eling for epidemic transmission and individual-based EPCactions is shown in Figure 1.

Fig. 1. Epidemic Spread and Intervention(CDC: Center for DiseaseControl and prevention).

Once the number of infected people exceeds a threshold,the medical system will be penetrated, leading to a rapidincrease in medical costs. On the other hand, when themobility intervention is greater than a certain threshold, theeconomic system will be paralyzed, also leading to a sharpincrease in social cost. So we design the metric

Score toevaluate the total cost of an EPC policy to consider reducingthe infections and maintaining the mobility at the same time.The smaller

Score value indicates better EPC results.

Score is deﬁned as follows: Q = λ h ∗ N h + λ i ∗ N i + λ q ∗ N q + λ c ∗ N c ,Score = exp (cid:26) Iθ I (cid:27) + exp (cid:26) Qθ Q (cid:27) , where I denotes the total number of infected people withinall simulation days, Q denotes the aggregate of mobilityinterventions, N h , N i , N q and N c denote the accumulatednumber of hospitalized, isolated, quarantined, and conﬁnedpeople for all simulation days, θ I and θ Q refers to thesoft thresholds for medical system’s capacity and economicsystem’s endurance. λ h , λ i , λ q and λ c denote scale factors.In this paper, we aim to ﬁnd an EPC policy that givesdaily control actions for all individuals to minimize Score . Finding an effective EPC policy is challenging in threeaspects:

The invisibility of asymptomatic patients and people’s com-plex contacts makes the state space vast. It’s difﬁcult to ex-tract effective features for control-action selection. To tacklethis challenge, we propose two solutions. We design aninfection probability model to calculate the current infectionprobability of each individual. The probability is addedto the state of each individual as auxiliary information.Moreover, IDRLECA employs a novel GNN acquires the

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 3 whole individuals’ state and the area visited history asinput, which can estimate the infection risks through thecontact between individuals. The estimated infection riskswhich measure the individual’s ability to potentially infectothers are further used as action thresholds to support theselection for actions.

Individual-level epidemic control aims to select a controlaction for each individual, which brings an extremely largeaction space for this control problem. This further leadsto low exploration efﬁciency for reinforcement learning. Inorder to solve this challenge, we design an infection proba-bility model to calculate the current infection probability foreach person and use IDRLECA to output different actioncontrol thresholds for each individual. The estimated risksare further used to support RL’s selection for action actions.

Since our goal is to minimize the social cost

Score whichcontains two optimization objectives of the entire epidemiccontrol process. To solve this problem, we propose a spe-cial design instant reward, which considers the number ofnew infections on two consecutive days and the mobilityintervention cost on the current day.

ETHODOLOGY

To tackle the above challenges, we develop an Individual-based Reinforcement Learning Epidemic Control Agent(IDRLECA) that employs a novel GNN and RL approach tosearch for smart control policies. An overview of IDRLECAis shown in Figure 2. At each time step, IDRLECA collectsthe health status, intervention state and area-visit-history foreach individual and gives each a customized interventionaction. In the rest of this section, we will provide the detailsof the IDRLECA.

Fig. 2. The detailed structure of proposed IDRLECA.

The difﬁculty of epidemic prevention and control lies inhow to ﬁnd asymptomatic infections and how to take timelyand effective measures. To help the latter part of IDRLECAefﬁciently take use of effective information, we here designan infection probability model to estimate the probabilityof an individual being infected. We deﬁne the probability of infection and health of the i -th person as p infei and p heli ,respectively. The infection probabilities of contacting withstrangers and acquaintances are calculated by the simula-tion environment, denoted as p s and p c , respectively. Theinfection probability model works as follows: Step 1:

Trace back all individuals’ area-visit history inthe past T time steps. Step 2:

For individual i, i = 1 , , ..., M , deﬁne his/herprobability of being healthy as p heli,t at time step t . p heli, isinitialized to be 1 if individual i is not infected. we have thefollowing equation to update p heli,t : p heli,t = p heli,t − ∗ (1 − p s N infet − N areat − ) , t = 1 , , ..., T, (1)where N infet − and N area refer to the number of discoveredinfections and total number of visitors to the same area asindividual i , respectively. Step 3:

Update p heli,T for acquaintances’ contacts: ˆ p heli,T = p heli,T ∗ (1 − p c ) . (2) Step 4:

Acquire infection probability: p infei = 1 − ˆ p heli,T , (3)After the above steps, we can obtain the estimatedprobability of an individual being infected. We will use itas auxiliary information and add it to each individual’sstate. Also, the estimated probabilities are used as priorknowledge for the agent selection control actions. Note thatthis estimation is not 100% accurate since our estimationsimpliﬁes the process of contact and spread of virus betweenpeople. In the later part, we will combine GNN to solve thisproblem. We propose IDRLECA to search smart strategies to mini-mize the spread of epidemic and cost of intervention at thesame time. We treat all individuals in the area as one agent.Therefore, for IDRLECA, its status and actions are for allpeople. We use one day as the decision time interval. In thefollowing, we will introduce our design of state, action andreward: • State:

The state of IDRLECA is the integration ofeach individual’s information, which is obtainedat the start of one day. For each individual, thestate includes infection state, intervention state, andthe probability of infection calculated by Equation(1) ∼ (3). • Action:

The action at each step for the agent is todetermine the intervention measure of each individ-ual. The action contains no intervention, conﬁne, andquarantine. In order to ensure the ﬂexibility of thepolicy, we set the implementation time of actions toone day. • Reward:

The goal of our method is to minimize thetotal number of infected people, and to minimize thetotal intervention cost at the same time. Considering

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 4 our dual objective optimization, we set the reward r as follows, r = − exp (cid:26) ∆ Iθ I (cid:27) − exp (cid:26) ∆ Qθ Q (cid:27) , (4)where ∆ I and ∆ Q denote the daily incremental partof the number in the infected population betweenconsecutive days and the cost of mobility interven-tion on the day, respectively. • Learning Algorithm:

IDRLECA employs a ProximalPolicy Optimization(PPO) [14] agent to ﬁnd the op-timal strategy that minimizes the number of infectedpeople and the cost of prevention at the same time.The PPO agent adopts the actor-critic framework.The critic network is used to estimate the long-termreward of the action, and the actor network is to ﬁndthe optimal action policy to achieve dual objectiveoptimization. We also add an entropy bonus to en-sure sufﬁcient exploration when RL training [14].

Since asymptomatic patients are indistinguishable, it’s hardto trace all the contacts and infections caused by them.Moreover, vast modern trafﬁc and complex social networkstructure make it more challenging to estimate the infectionrisk of each individual. To deal with this challenge, wepropose a novel GNN, namely Individual Contact GNN,to estimate the infection risk of each individual. IndividualContact GNN is used to build both the actor network andcritic network in IDRLECA. The GNN regards individualsand city areas as two kinds of nodes. This enables us tomodel individual-individual contacts by individual-area-individual contacts, which further helps us to avoid theextremely large individual-individual contract matrix (size M ∗ M ).Speciﬁcally, Individual Contact GNN is designed on thebasis of GraphSage [15]. The state input to the GNN consistsof health status, intervention state, infection probability forall individuals, and the edge-information inputs are thearea-visit-history at different time steps. We use f karea , f kind to denote for the area-nodes’ features and individual-nodes’features outputted by the k -th GNN layer, respectively. Thedetailed layer-calculation of Individual Contact GNN is asfollows: f k − c = sof tmax ( f k − v ) , (5) f k − area = σ ( W k − ( f k − c ) T f k − ind + B k − ) , (6) f kind = σ ( W k f k − c f k − area + B k ) , (7)where f k − v denotes for the area’s visit history at the k − time step, W k − , B k − , W k , B k denotes for trainable pa-rameters.In the above equations, Equation (5) uses the area-visit-history as edge weights; Equation (6) aggregates weightedvisitors’ characteristics to calculate the area-node feature;Equation (7) aggregates the features of areas where anindividual has visited to calculate individual-node feature. As discussed before, the EPC problem has a extremely largeaction space, which challenges policy search. To addressthis issue, we incorporate prior knowledge into the control-selection step. Speciﬁcally, we let the actor network ofIDRLECA ﬁrst outputs four values < p i, , p i, , p i, , p i, > for individual i, i = 1 , , , ..., M . Then, we transform thefour values to three thresholds: P i, = e − p i, e − p i, + e − p i, + e − p i, + e − p i, , (8) P i, = e − p i, + e − p i, e − p i, + e − p i, + e − p i, + e − p i, , (9) P i, = e − p i, + e − p i, + e − p i, e − p i, + e − p i, + e − p i, + e − p i, . (10)Through the above equations, we can ensure ≤ P i, ≤ P i, ≤ P i, ≤ . Thus, P i , P i , P i can be used as differentinfection risk levels, which considers the risk of individualinfection and individual’s ability to potentially infect others.It’s natural and reasonable to expect that an individualwith a higher infection risk should receive a more stringentcontrol action. The infection risk levels are further used asthe thresholds for the infection probabilities estimated inSection 2, which imposes a constraint that individuals withhigher infection probability will have higher infection riskand receive more stringent control actions. In this way, indi-viduals with high probability of infection are not identiﬁedas low risk, thus reducing unnecessary strategy exploration.By comparing the pre-calculated infection probability p infci with < P i , P i , P i > , we deﬁne the action-selectionrule in Table 1. It can be seen from Table 1 that as the infec-tion probability goes from low to high, the correspondingintervention actions become more and more stringent. Thereare different thresholds for different individuals, which fullytakes into account the differences in individual states. TABLE 1Action-Selection Rule.

Infection probability Intervention actions ≤ p infci ≤ P i No intervention P i ≤ p infci ≤ P i Conﬁne P i ≤ p infci ≤ P i Quarantine P i ≤ p infci ≤ Isolate

Similar to DURLECA where RL is used for epidemic control[6], it is possible to encounter extreme states or actionsduring the RL exploration in IDRLECA’s training. This mayseverely impact exploration efﬁciency and result in localoptimums. Inspired by DURLECA, we have a rule to avoidthese extreme experiences: • The infection-increase threshold I t : During theagent’s exploration process, if the number of newinfections on a certain day exceeds I t , the currentepisode will be stopped and a large penalty will begiven to the reward of the agent. EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 5

XPERIMENTS

In this section, we conduct extensive experiments on fourscenarios to answer the following research questions: • RQ1:

Can IDRLECA minimize the number of infec-tions and the cost of interventions? • RQ2:

Can IDRLECA be adapted to different scenar-ios? • RQ3:

How does IDRLECA compare to expert poli-cies and PAPW winners?

In the following, we introduce more details about our ex-periment design.

We build a simulation environment mainly based on thePAPW Challenge . The simulated disease has an R rangefrom 2 to 2.5, which is similar to COVID-19 . The totalsimulation time is 60 days. Every individual has a pre-deﬁned commute pattern. To simulate a more practical EPCscenario, we add a new rule in the original simulator: allsymptomatic patients should be sent to the hospital. We deﬁne t start as the days to start epidemic interventionafter discovering the ﬁrst patient. • Scenario-Default: N = 11 , M = 10000 , t start = 1 .This scenario is to verify the EPC performance ofIDRLECA in a ordinary epidemic scenario. • Scenario-Larger: N = 98 , M = 10000 , t start = 1 .This scenario is to verify whether IDRLECA is suit-able for scenarios with greater individual mobility. • Scenario-Changeable: N = 11 , M = 10000 , t start =1 . Compared with Scenario-Default, people’s com-mute patterns are more changeable in this scenario.This scenario is to verify whether IDRLECA is appli-cable when there are greater differences in individu-als’ characteristics. • Scenario-Late: N = 11 , M = 10000 , t start = 5 .Compared with Scenario-Default, this scenario startsintervention after 5 days of discoverying the ﬁrst pa-tient. This scenario is to verify the EPC performanceof IDRLECA with a late intervention. • I : The total number of infected people in all simula-tion days. It is used to measure the effectiveness ofEPC strategies in suppressing infections. • Q : The aggregated mobility interventions deﬁned inSection 2. To have a fair comparison with PAPWwinners, we set λ h = 1 , λ i = 0 . , λ q = 0 . and λ c = 0 . , which are the same with the setting in thePAPW Challenge. • Score : The social cost of epidemic control policywhich is deﬁned in Section 2. We set θ I = 500 and θ Q = 10000 , which are the same with the setting inthe PAPW Challenge. We set up 4 expert baselines to simulate EPC strategies inthe real world: • N o Intervention : No intervention at all. • Lockdown [2]: Lockdown the city for successive 60days. • Expert (0 . and Expert (0 . : Baselines based onthe infection probability model. Isolate individualswhose infection probability is higher than a giventhreshold.We compare DIRLECA with two baselines commonlyused in epidemic research: • Degree − Sample [13]: If the number of an indi-vidual’s acquaintances n is more than 4, isolate theindividual with a probability ( n − /n . • Degree − Order [12]: Count the number of contactsof an individual in the past 5 days. Select the top for isolation.We compare

IDRLECA with PAPW winners: • GBM [9]: a baseline for epidemic intervention bypredicting individual health states, which strikes abalance between precision and recall. • EIT L [10]: a heuristic baseline that adjusts the epi-demic strategy through a heuristic algorithm, whichbased on evaluating the intervention action effec-tiveness and understanding resulting patterns andinterpret causality. • HRLI [11]: a state-of-the-art RL baseline combiningindividual prevention with regional control.

We compare

IDRLECA with all the baselines when t start = 1 day . Table 2 shows our main results. IDRLECA is better than all baselines in three scenarios in metric Score.For instance, compared with the best baseline

HRLI inScenario-Default, our method can reduce the number ofinfected persons by . and the cost of mobile inter-vention by . .In the four expert baselines, we can ﬁnd that N oIntervention will aggravate the spread of infectious dis-eases and eventually lead to the paralysis of the medicalsystem. The other three expert baselines can limit the spreadof epidemic to some extent. However, these strategies havepaid huge mobile intervention costs in order to reduce thenumber of infections, thereby greatly increasing the totalsocial cost. For example,

Lockdown is a common method inour real life when dealing with epidemic, which achievesbest performance in minimizing infections at the expenseof the maximum mobile intervention cost. Compared withthe four expert baselines,

IDRLECA can minimize the

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 6

TABLE 2Performance comparison in three scenarios when t start = 1 days Scenario-Default Scenario-Larger Scenario-Changeable

Method I Q Score I Q Score I Q ScoreNo Intervention 8289 123153.00 > > > > > > > > > > > > TABLE 3Performance comparison in Scenario-Late when t start = 5 days Scenario-Late

Method I Q ScoreNo Intervention 8040 119175.00 > > > > infections and retain large amounts of mobility at the sametime.Compared to GBM , EIT L and two baselines com-monly used in epidemic research,

IDRLECA performs bet-ter than them mainly because it considers more individualcharacteristics and the long-term impact of current actionswhen making decisions.Compared to

HRLI , we ﬁnd that

IDRLECA outper-forms them in all metrics which may be because the GNNin our method models the contact between individualsand estimates individual infection risks through contact.

IDRLECA can ﬁnd hidden infections through GNN andthus be able to stop the spread of epidemic quickly atminimal mobility intervention cost, which will be veriﬁedin Case Study.From the results of Scenario-Larger and Scenario-Changeable, we can ﬁnd

IDRLECA can still guarantee theminimum number of infections and mobility interventioncosts in more changeable and ﬂexible scenarios.Late intervention to an epidemic is very common in thereal world. An effective control strategy should be able tostop the spread of the epidemic in time with the least costof mobility intervention in the case of late intervention. Weperform our experiment in Scenario-Late, and the resultsare shown in Table 3. The results show that our methodperforms best in metric Score compared with other baselinesin the case of late intervention. In Figure 3, we compared the number of infectionsand the cost of mobility intervention between

IDRLECA and the best baseline method HRLI in Scenario-Late with t start = 5 days within 60 days. It can be found that ourmethod can not only stop the spread of epidemic diseasesfaster, but also reduce the cost of intervention during thepeak period of the epidemic.In order to verify the effectiveness of our method forindividual epidemic prevention and control, we randomlyselect 100 individuals in Scenario-Default, and draw a heatmap of the infection probability change within 60 days inFigure 4. It can be found that the infection probability of 100people reaches its peak in about 15 days, but soon underthe inﬂuence of intervention measures by IDRLECA , theprobability of infection is soon reduced to 0 around the 40thday.

In order to verify the effectiveness of our method in individ-ual prevention and control, we conduct two case studies.

Evaluating individual intervention:

To verify the spe-ciﬁc effects of our method on individual intervention, wedraw the infection probability of a person and the changes inprevention and control measures within 60 days of Scenario-Default in Figure 5. It can be found that our method isvery sensitive to the action control of different infectionprobabilities and can effectively reduce the risk of infection.

Finding hidden infections:

In order to verify whetherour method can discover hidden infections, we used IDR-LECA to output actions to individuals with ID 927 and 959on the 20th day of Scenario-Default: quarantine and conﬁne.However, the infection probability of these two individualsis 0.004 and 0.34 respectively, whose numerical order isexactly opposite to the prevention and control level. Wefurther found that the ﬁrst person had more contacts andacquaintances in the past ﬁve days than the second person.This is because the infection probability calculated in Sec3.1 only considers the impact of the current discoveredinfections and simpliﬁes the spread of the epidemic byindividual contacts. Our GNN models the contact betweenindividuals and can estimate the individual’s potential riskof infection and the ability to potentially infect others.Therefore, although the ﬁrst person is relatively low inthe probability of infection, our model takes into account

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 7 the infection risk which measures the harm and risk ofsecondary transmission of potentially infected individuals,so more strict measures are taken for the ﬁrst person. Twodays later, the ﬁrst person was detected as infected duringthe intervention period, which also veriﬁes our ﬁndings.

Fig. 3. Q (the aggregated mobility interventions deﬁned in Section 2) and H (the number of healthy people) are changing over time.Fig. 4. Change in infection probability of 100 individuals within 60days(Scenario-Default).Fig. 5. The relationship between infection probability and interveneaction types. To evaluate the effectiveness of our proposed IndividualContact GNN and RL exploration strategy(Avoiding ex-treme experiences), we take ablation study in this section.We select three baselines and perform experiments in twoscenes. No Intervene is the baseline of the blank control.RL-NoGraph and RL-NoEP denote removing GNN and RLexploration strategy(Avoiding extreme experiences) com-pared with IDRLECA. The results in Table 4 show thatremoving the GNN network structure will make it difﬁ-cult for RL to ﬁnd hidden infections, which will increasethe number of infections and the cost of prevention andcontrol. The removal of the exploration strategy(Avoidingextreme experiences) will make it hard for RL to furtherreduce the number of infections and the cost, falling into alocal optimum. Compared with RL-NoGraph and RL-NoEP,IDRLECA can better ﬁnd hidden infections with the helpof GNN, and ensure reasonable and effective explorationunder the exploration strategy, so that it can learn betterresults.

TABLE 4Ablation study

Scenario-Default Scenario-Default( t start = 5 days ) Method I Q Score I Q ScoreNo Intervene 8289 123153.00 > > ELATED W ORKS

Individual-based Infectious Diseases Model(IBIDM) is anepidemiology model that has emerged in recent years [16].Compared with traditional infectious disease models, IB-DIM can reﬂect the heterogeneity of individuals and reﬂectindividual-level behavior dynamics, thus more preciselyreﬂect the spread of the epidemic. IBIDM models each in-dividual as a unit, and measures the contact relationship be-tween individuals through social contact network. The LosAlamos Laboratory in the United States has developed anindividual-based infectious disease simulation tool, calledEpiSimS system, which can effectively simulate the spread,prevention and control of the epidemic based on individualcharacteristics [17]. Later, some researchers propose Epifastto simulate the spread of Ebola in West Africa, which hashigher prediction accuracy and simulation preciseness thantraditional methods [18]. There are also many researchesrelated to epidemic control based on these epidemic simula-tion. [19] studies the trade off between spread of COVID-19and economic impact and proposes some mechanisms basedon group scheduling to strike a balance between epidemiccontrol and economic development. [12], [13], [20] regardindividuals as nodes of the graph, and the connectionsbetween individuals as edges, and ﬁnd the individuals whoneed to be isolated through graphs. [21] introduces mean-ﬁeld models and complex networks to solve the individualprevention and control of the epidemic.

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 8

However, current researches are hard to to effectivelyextract the status of individuals and their strategies areoften unable to cope with various scenarios and conditions.They often only pay attention to the current prevention andcontrol effects, and do not care about the long-term impactof the current decision-making. Therefore, we develop IDR-LECA which considers not only how to track and control theinfectious and asymptomatic based on the infection statusof individuals, but also how to achieve better epidemicprevention while minimizing economic losses in the longterm.

Graph Neural Networks (GNN) are mainly used for nodeprediction, link prediction and graph prediction tasks. Nodeprediction refers to predicting the type of a given node [22],[23]. Link prediction means predicting the connection statusof two given nodes [24], [25]. Graph prediction aggregatesall node features in the graph as the graph feature, andthen classiﬁes the type of the graph based on it [26], [27].There are some commonly used GNN methods. GCN usesthe adjacency matrix of nodes as input to learn the rela-tionship between nodes [22]. GAT introduces an attentionmechanism on the basis of GNN [28]. GraphSage learnsnode relationships by aggregating information from neigh-bor nodes [15].However, current GNN methods lack a framework tomodel the spread of epidemic between individuals overa dynamic graph. Therefore, we propose a novel GNNstructure to characterize the epidemic-spreading betweenindividuals, whose nodes and edges represent the statefeatures of individuals and contacts between individualsrespectively.

PPO algorithm is a new type of policy gradient algorithmand has been applied in many aspects. [14] proposes thatPPO strikes a balance between implementation simplicity,sample complexity and difﬁculty of tuning and achievesgood results in many games. [29], [30] proves that PPO canperform well in solving some problems with large-scale andcomplex state and action space.Since the PPO algorithm has good stability and adapt-ability, and can achieve good results in large-scale state andaction space problems, we choose PPO as RL algorithm inour problem.

ONCLUSION

In this paper, we propose IBRLECA that employs a novelGNN and RL approach to minimize infections as wellas the mobility intervention cost in EPC. The proposedGNN can estimate the spread of the virus through contactsbetween individuals. The training of IBRLECA is guidedby a specially designed reward. We design and impose aconstraint for control-action selection that eases its difﬁcultyand further improve exploration efﬁciency. Extensive exper-iments are conducted on different scenarios to show theeffectiveness of our proposed method.

PPENDIX

To help reproduce the results, here we present the details ofthe simulator and experiment settings. The simulator contains a human mobility model and adisease transmission model. The simulator uses these twomodels to simulate individuals’ movements and the spreadof the epidemic among individuals. The two models arebrieﬂy introduced below:

Human Mobility Model:

The human mobility model simu-lates individual mobility in a city of N areas with M people.Each area is assumed to belong to one of the three cate-gories: working, residential, and commercial. An individualis associated with two ﬁxed areas: a residential area and aworking area. We assume that an individual has differentmodes of mobility during weekdays and weekends. Onweekdays, an individual will move from his/her residentialarea to his/her working area. After work, he/she may visita nearby commercial area and then will return to his/herresidential area. On weekends, an individual will visit arandom commercial area. After that, he/she will return tothe residential areas. Disease Transmission Model:

The disease can transmitfrom an infected individual through acquaintance con-tacts and stranger contacts. Contacts happen among peoplewithin the same region. The infection probabilities of contactwith acquaintances and strangers are P c and P s , respec-tively. The disease transmission is simulated every hour. We set the infection probabilities of contacting withstrangers p s = 0 . and infection probabilities of contactingwith acquaintances p c = 0 . . The estimated R is 2-2.5. Forthe extreme-experience policy, we set Q t = 250 . For Score ,we set λ h = 1 , λ i = 0 . , λ q = 0 . and λ c = 0 . , which arethe same with the setting in the PAPW Challenge. For thereward and Score , we set θ I = 500 and θ Q = 10000 .In the training process, the beginning state of an episodeis random every time. We train IDRLECA for 200,000 steps,using Adam optimizer with learning rate 0.0001. Duringtesting, the initial setting is ﬁxed in both IDRLECA andthe baseline methods. Taking into account the randomnessof the simulator, we compared the average results of allmethods tested with three random seeds. Privacy issues:

Each area’s visited history, the total numberof people visiting a particular area and person-person rela-tionship [31]can be obtained by individuals’ trajectories. Inpractical system, user anonymity can be used to reduce therisk of privacy leakage for the individual trajectories and thehealth history data.

Implementation issues:

Our system runs on a central serverinstead of individuals’ smartphones and usually the poli-cymaker has the ability to collect the data needed in our

4. PAPW 2020: https://prescriptive-analytics.github.io/. Simulator:https://hzw77-demo.readthedocs.io/en/round2/.

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 9 model. Besides, some recent techniques like Apple andGoogle APIs can be used to collect data without usingprivate information. In the practical implementation of ourmethod, distributed servers and federated learning can beused to protect privacy. The city will be divided into smallareas. In each area we have a distributed server that receivesencrypted data from smartphones and conducts federatedlearning with the central server. After training, each dis-tributed server pulls the model from the central server andsend reminders to users’ smartphones. A CKNOWLEDGMENTS

This work was supported in part by The National KeyResearch and Development Program of China under grant2018YFB1800804, the National Nature Science Foundationof China under U1936217, 61971267, 61972223, 61941117,61861136003, Beijing Natural Science Foundation underL182038, Beijing National Research Center for InformationScience and Technology under 20031887521, and researchfund of Tsinghua University - Tencent Joint Laboratory forInternet Innovation Technology. R EFERENCES [1] D. Balcan, B. Gonçalves, H. Hu, J. J. Ramasco, V. Colizza, andA. Vespignani, “Modeling the spatial spread of infectious diseases:The global epidemic and mobility computational model,”

Journalof computational science , vol. 1, no. 3, pp. 132–145, 2010.[2] T. Hale, A. Petherick, T. Phillips, and S. Webster, “Variation ingovernment responses to covid-19,”

Blavatnik school of governmentworking paper , vol. 31, 2020.[3] G. Bonaccorsi, F. Pierri, M. Cinelli, A. Flori, A. Galeazzi,F. Porcelli, A. L. Schmidt, C. M. Valensise, A. Scala,W. Quattrociocchi, and F. Pammolli, “Economic and socialconsequences of human mobility restrictions under covid-19,”

Proceedings of the National Academy of Sciences et al. , “Understanding coronanomics: The economic impli-cations of the coronavirus (covid-19) pandemic,”

SSRN ElectronicJournal https://doi org/10/ggq92n , 2020.[5] Z. Yang, Z. Zeng, K. Wang, S.-S. Wong, W. Liang, M. Zanin, P. Liu,X. Cao, Z. Gao, Z. Mai et al. , “Modiﬁed seir and ai predictionof the epidemics trend of covid-19 in china under public healthinterventions,”

Journal of Thoracic Disease , vol. 12, no. 3, p. 165,2020.[6] S. Song, Z. Zong, Y. Li, X. Liu, and Y. Yu, “Reinforced epidemiccontrol: Saving both lives and economy,” 2020.[7] W. O. Kermack and A. G. McKendrick, “A contribution to themathematical theory of epidemics,”

Proceedings of the royal societyof london. Series A, Containing papers of a mathematical and physicalcharacter , vol. 115, no. 772, pp. 700–721, 1927.[8] L. E. Rocha and N. Masuda, “Individual-based approach to epi-demic processes on arbitrary dynamic contact networks,”

Scientiﬁcreports , vol. 6, p. 31456, 2016.[9] S. G. Rizzo, “Balancing precision and recall for cost-effective epidemic containment,” [EB/OL], 2020, https://prescriptive-analytics.github.io/ﬁle/3-strizzo.pdf.[10] J.-S. Kim, H. Jin, and A. Züﬂe, “Expert-in-the-loop prescriptiveanalytics using mobility intervention for epidemics,” 2020.[11] Y. Dong, C. Yu, and L. Xia, “Hierarchical reinforcement learningfor epidemics intervention,” 2020.[12] S. Eubank, H. Guclu, V. A. Kumar, M. V. Marathe, A. Srinivasan,Z. Toroczkai, and N. Wang, “Modelling disease outbreaks inrealistic urban social networks,”

Nature , vol. 429, no. 6988, pp.180–184, 2004.[13] D. J. Watts and S. H. Strogatz, “Collective dynamics of ‘small-world’networks,” nature , vol. 393, no. 6684, pp. 440–442, 1998.5. https://covid19.apple.com/contacttracing [14] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,“Proximal policy optimization algorithms,” arXiv preprintarXiv:1707.06347 , 2017.[15] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representationlearning on large graphs,” in

Advances in neural information process-ing systems , 2017, pp. 1024–1034.[16] G. J. Milne, J. K. Kelso, H. A. Kelly, S. T. Huband, and J. McVernon,“A small community model for the transmission of infectiousdiseases: comparison of school closure as an intervention inindividual-based models of an inﬂuenza pandemic,”

PloS one ,vol. 3, no. 12, p. e4005, 2008.[17] S. M. Mniszewski, S. Y. Del Valle, P. D. Stroud, J. M. Riese, and S. J.Sydoriak, “Episims simulation of a multi-component strategy forpandemic inﬂuenza,” in

Proceedings of the 2008 Spring simulationmulticonference , 2008, pp. 556–563.[18] K. R. Bisset, J. Chen, X. Feng, V. A. Kumar, and M. V. Marathe,“Epifast: a fast algorithm for large scale realistic epidemic simu-lations on distributed memory systems,” in

Proceedings of the 23rdinternational conference on Supercomputing , 2009, pp. 430–439.[19] J. Augustine, K. Hourani, A. R. Molla, G. Pandurangan, and A. Pa-sic, “Economy versus disease spread: Reopening mechanisms forcovid 19,” arXiv preprint arXiv:2009.08872 , 2020.[20] P. S. Park, J. E. Blumenstock, and M. W. Macy, “The strength oflong-range ties in population-scale social networks,”

Science , vol.362, no. 6421, pp. 1410–1413, 2018.[21] Q. Wu and T. Hadzibeganovic, “An individual-based modelingframework for infectious disease spreading in clustered complexnetworks,”

Applied Mathematical Modelling , 2020.[22] T. N. Kipf and M. Welling, “Semi-supervised classiﬁcation withgraph convolutional networks,” arXiv preprint arXiv:1609.02907 ,2016.[23] Z. Liu, C. Chen, X. Yang, J. Zhou, X. Li, and L. Song, “Heteroge-neous graph neural networks for malicious account detection,” in

Proceedings of the 27th ACM International Conference on Informationand Knowledge Management , 2018, pp. 2077–2085.[24] M. Zhang and Y. Chen, “Link prediction based on graph neuralnetworks,” in

Advances in Neural Information Processing Systems ,2018, pp. 5165–5175.[25] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, andJ. Leskovec, “Graph convolutional neural networks for web-scalerecommender systems,” in

Proceedings of the 24th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining ,2018, pp. 974–983.[26] D. Bacciu, F. Errica, and A. Micheli, “Contextual graph markovmodel: A deep and generative approach to graph processing,” arXiv preprint arXiv:1805.10636 , 2018.[27] X. Geng, Y. Li, L. Wang, L. Zhang, Q. Yang, J. Ye, and Y. Liu,“Spatiotemporal multi-graph convolution network for ride-hailingdemand forecasting,” in

Proceedings of the AAAI Conference onArtiﬁcial Intelligence , vol. 33, 2019, pp. 3656–3663.[28] P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Lio,and Y. Bengio, “Graph attention networks,” arXiv preprintarXiv:1710.10903 , 2017.[29] C. Berner, G. Brockman, B. Chan, V. Cheung, P. D˛ebiak, C. Den-nison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse et al. , “Dota2 with large scale deep reinforcement learning,” arXiv preprintarXiv:1912.06680 , 2019.[30] D. Ye, Z. Liu, M. Sun, B. Shi, P. Zhao, H. Wu, H. Yu, S. Yang,X. Wu, Q. Guo et al. , “Mastering complex control in moba gameswith deep reinforcement learning.” in

AAAI , 2020, pp. 6672–6679.[31] K. Xu, K. Zou, Y. Huang, X. Yu, and X. Zhang, “Mining communityand inferring friendship in mobile social networks,”