[PDF] COVID-19 Risk Estimation using a Time-varying SIR-model

Abstract

Policy-makers require data-driven tools to assess the spread of COVID-19 and inform the public of their risk of infection on an ongoing basis. We propose a rigorous hybrid model-and-data-driven approach to risk scoring based on a time-varying SIR epidemic model that ultimately yields a simplified color-coded risk level for each community. The risk score Γ t that we propose is proportional to the probability of someone currently healthy getting infected in the next 24 hours. We show how this risk score can be estimated using another useful metric of infection spread, R t , the time-varying average reproduction number which indicates the average number of individuals an infected person would infect in turn. The proposed approach also allows for quantification of uncertainty in the estimates of R t and Γ t in the form of confidence intervals. Code and data from our effort have been open-sourced and are being applied to assess and communicate the risk of infection in the City and County of Los Angeles.

Full PDF

CCOVID-19 Risk Estimation using a Time-varyingSIR-model

Mehrdad Kiamari, GowriRamachandran, Quynh Nguyen

Viterbi School of Engineering,University of Southern California,Los Angeles, USA { kiamari, gsramach,quynhngu } @usc.edu Eva Pereira,Jeanne Holm

Ofﬁce of the Mayor,City of Los Angeles,Los Angeles, USA { eva.pereira, jeanne.holm } @lacity.org BhaskarKrishnamachari

Viterbi School of Engineering, University of Southern California,Los Angeles, USA { bkrishna } @usc.edu Abstract

Policy-makers require data-driven tools to assess the spread of COVID-19 and inform the public of their risk ofinfection on an ongoing basis. We propose a rigorous hybrid model-and-data-driven approach to risk scoring basedon a time-varying SIR epidemic model that ultimately yields a simpliﬁed color-coded risk level for each community.The risk score Γ t that we propose is proportional to the probability of someone currently healthy getting infectedin the next 24 hours. We show how this risk score can be estimated using another useful metric of infection spread, R t , the time-varying average reproduction number which indicates the average number of individuals an infectedperson would infect in turn. The proposed approach also allows for quantiﬁcation of uncertainty in the estimatesof R t and Γ t in the form of conﬁdence intervals. Code and data from our effort have been open-sourced and arebeing applied to assess and communicate the risk of infection in the City and County of Los Angeles. Index Terms

Risk Modelling, COVID-19, SIR model

I. I

NTRODUCTION

The ongoing COVID-19 epidemic has forced governments and public authorities to employ stringentmeasures [1], [2], including closing business and implementing stay-at-home orders, to contain the spread.When making such decisions, policymakers require tools to understand in “real-time” how the virus isspreading in the community, as well as tools to help communicate the level of risk to citizens so that theycan be encouraged to take appropriate measures and take the public health directives seriously.One metric that has been found to be useful for authorities to assess the level of containment over timeis the effective reproduction number [3]. The effective reproduction number, R t , indicates on average howmany currently susceptible persons can be infected by a currently infected individual. The epidemic growsif this measure is above one. It is desirable to keep this value as far below one as possible over time inorder to contain and eventually, hopefully, eliminate the virus from the community.While R t is meaningful to understand the rate at which the epidemic is spreading and has been proposedpreviously (for example, see https://rt.live/ ), what has been missing in the public discourse is a risk metricthat is more suitable for communication to a wider public. One key requirement for such a metric is thatit be something that a citizen could relate to on an individual basis. Another requirement is that it needsto be easy to communicate to a wide audience. We address both these requirements in this work andmake the following contributions.First, we obtain the daily effective reproduction number R t of a time-varying SIR model as well as thecorresponding conﬁdence Interval. The conﬁdence interval reﬂects uncertainty in both the parameter ofthe underlying model and uncertainty in the data itself. Further, we present the mathematical derivationof the distribution of R t . a r X i v : . [ phy s i c s . s o c - ph ] A ug ig. 1: Overview of Our System.Second, we propose a novel risk score Γ t for a community that is proportional to the probabilitythat an individual will get infected in the next 24 hours. We show that the risk score can be calculatedgiven estimates of four quantities: a) an estimate of I rep,new ( t ) , the most recently reported count of newconﬁrmed infectious cases, b) an estimate of R t as discussed above, c) an estimate of K , the ratio oftrue infectious cases to the number of conﬁrmed cases, and d) an estimate of S ( t ) , the current numberof susceptible individuals in the community. To make the score more meaningful, we normalize theprobability of infection by multiplying it by 10,000. Then, a risk score of x is an indication that there is,on average, a chance of x in 10,000 of an individual in the community becoming infected in the next 24hours.Third, we propose to convert the numerical risk score, which has an intuitive meaning as indicated above,to a color-coded risk level based on suitably chosen thresholds. We propose the use of four color-levelsto indicate the corresponding risk level from low to high: green, yellow, orange, and red.Fourth, we have implemented software to estimate the risk level for any community and released itas open-source. The code requires only time-series data on conﬁrmed new cases, the population of thecommunity, and an estimate for the ratio of true to conﬁrmed (detected) COVID-19 positive cases. Thissoftware is being used at USC to process the daily data of communities within Los Angeles County toestimate and generate maps of risk levels by community. The block diagram in ﬁgure 1 illustrates keyelements of our system design. Our data parser is able to get the raw data from online data sources, cleanthem up and store them in machine-friendly (csv and json) formats. Our code for infection risk calculationuses this data in conjunction with a time-varying SIR-based Bayesian mathematical model to obtain riskestimates and prediction for different communities. The results are provided in CSV format and can beused to generate a heatmap-type visualization as well.The risk scoring model we describe in this work is now being used by the City of Los Angeles, whichin turn is working with the County of Los Angeles and other partners to develop a publicly accessibletool that can be used by individuals and communities to grow awareness and mitigate risk of infection.We believe that our risk estimation approach will be similarly of value to other communities around theworld. II. R ELATED W ORK

As noted above, the calculation of the risk score requires an estimate of R t . We show how this canbe estimated using a time-varying SIR model, a generalization of the well-known SIR compartmentalmodel [4], [5] which consists of three states, namely the susceptible state, the infected state, and therecovered state. While traditionally this model is assumed to have a interaction rate / infection rateparameter that is constant, one recent work has used a time-varying SIR model to recover the time-varyingeffective reproduction number [6]. Going beyond that work, we also show how to derive a conﬁdenceinterval for R t in this work. Further, the authors of [6] make strong assumptions on the number ofusceptible individuals by approximating it as a constant factor of the entire population. This assumptionmay not be accurate when the number of infected individuals are high compared to the total populationof a community; we therefore take a more general approach.Another recent work by Systrom [7] has presented a Bayesian prediction approach to obtain conﬁdenceintervals for R t . However, Systrom’s work builds on [8], where the deﬁnition of infection rate R t is notbased on a time-varying contact rate of the SIR model. Instead, their approach estimates infection rateprobabilistically based on the number of new cases alone.We are not aware of prior work that has proposed deﬁning risk for COVID-19 or other epidemics interms of an individual’s probability of infection, which we argue is more meaningful for communicatingrisk to the public. III. M ETHODOLOGY

Compartmental mathematical models for epidemic spreads including the well-known SIR model havebeen used since the work of Kermack and McKendrick in 1927 [4]. In the SIR model, each member ofa given population is in one of three states at any time: susceptible, infectious, recovered. Any individualthat is susceptible could become infected with some probability when they come into contact with aninfected individual. Any individual that is infectious eventually recovers (in the context of COVID-19when applying the SIR model, note that the category of recovered individuals will also include removedindividuals due to deaths, which could be modeled as a constant fraction of all individuals in this category).In the classical SIR model, the number of susceptible individuals that become infected depends on the rateat which infected and susceptible individuals encounter each other and this rate is assumed to be constant.A well-known parameter in the classical SIR model is called R0, the effective reproductive number,which measures the average number of infections caused by infectious individuals at the beginning of theepidemic.

A. Time-Varying SIR model and R t In our work, we have extended the SIR model to a time-varying model, in which the rate of encountersand infection probability between individuals in the population is assumed to be time-varying. This betterreﬂects the reality of our present epidemic where interventions such as stay-at-home have been put inplace and relaxed and various times and compliance with recommendations such as wearing masks andmaintaining physical density has also been time-varying. Based on this model, we are able to deﬁne andderive a new approach to calculating a time-varying version of the effective reproductive number, whichwe refer to as R t .A particularly innovative aspect of our model is that it is a Bayesian model that allows the incorporationof various sources of uncertainty in the model, including uncertainty in the actual numbers of infectedindividuals (due to not every infected individual having been tested, as studies [2] have shown), uncertaintyin recovery times, and uncertainty in the choice of parameters for de-noising the empirical data. This allowsus to generate not only an estimate of R t , but also quantify conﬁdence in the estimate from a rigorousstatistical perspective.In this section, we elaborate upon the SIR model in detail. The SIR model is one of the simplest andthe most well-known epidemic model [4], [5] where each person belongs to one of the following threestates: the susceptible state, the infected state, and the recovered state. Regarding the susceptible state,individuals have not had the virus yet. However, they may get infected in case of being exposed to aninfected individual. As far as the infected state is concerned, a susceptible person has the virus afterbeing exposed to infected individuals. Finally, a person enters the recovered state in case of either theindividual gets healed or dies. One important point about this model is that a recovered person will notbe a susceptible one anymore.he SIR model follows the following differential equations: dS ( t ) dt = − β S ( t ) I ( t ) NdI ( t ) dt = β S ( t ) I ( t ) N − σI ( t ) dR ( t ) dt = σI ( t ) (1) where S ( t ) , I ( t ) , and R ( t ) respectively represent the number of susceptible, infected, and recovered peoplein a population size of N at time t . Regarding the parameter σ , it is the recovery rate after being infectedand is equal to D I where D I represents the average infectious days. Parameter β is known as the effectivecontact rate, i.e. the average number of contacts an individual have with others is β .In analyzing whether any pandemic is contained, it is very crucial to obtain parameter β . We next showthat how we can derive β from the aforementioned differential equations. Obtaining β t and R t for the SIR Model: In the SIR model, we can express the number ofsusceptible individuals in terms of population size and the number of infected persons as S ( t ) ≈ N − I ( t ) .By replacing S ( t ) with N − I ( t ) in the second differential equation of (1), we would have dI ( t ) dt = β ( N − I ( t )) I ( t ) N − σI ( t ) . (2) We can rewrite (2) as follows: dI ( t )( β − σ ) I ( t ) − βN I ( t ) = dt. (3) By taking deﬁnite integral from time t to t and assuming β to be constant in this time interval, wewould have (cid:90) t t dI ( t )( β − σ ) I ( t ) − βN I ( t ) = (cid:90) t t dt (4) which leads to β − σ ( log I ( t ) β − σ − βN I ( t ) − log I ( t ) β − σ − βN I ( t ) ) = t − t (5) One can easily check (5) has a unique solution for β due to the fact that term β − σ and log term havemonotonic behaviors.An epidemic happens in case of increase in the number of infected individuals, i.e. dI ( t ) dt > , orconsequently β ( N − I ( t )) I ( t ) N − σI ( t ) > . (6) In the early stage of an epidemic, almost everyone are susceptible except very few cases. Therefore, N − I ( t ) ≈ N and as a result, condition (6) would turn into βσ > .The variable R (cid:44) βσ is deﬁned as the effective reproduction number . It is a useful metric to determineepidemic growth. In case of having R > , the epidemic is growing exponentially while R < indicatesthe epidemic is contained and will decline and die out eventually.For discrete-time cases such as daily reporting on number of infected cases, the time-variant effectivecontact rate β t , which represents the contact rate for time slot t can be derived by solving the followingequation: β t − σ ( log I ( t + 1) β t − σ − β t N I ( t + 1) − log I ( t ) β t − σ − β t N I ( t ) ) = 1 ∀ t. (7) Therefore, the time-variant effective reproduction number would be deﬁned as R t (cid:44) β t σ . Since it is difﬁcultto write a closed form solution for β t in (7), we take a simpler approximation to β t by considering thefollowing which is based on (2) β t ≈ σI ( t ) + ( I ( t + 1) − I ( t ))(1 − I ( t ) N ) I ( t ) . (8) Then, we estimate R t as β t σ . ) Obtaining the Conﬁdence Interval for R t : Since there is uncertainty about parameter D I (orequivalently σ ) and the number of infected cases I ( t ) , we now provide the derivation of conﬁdence intervalfor parameter R t . Regarding modeling the ambiguity in the number of the infected cases, we present theuncertainty about the actual number of infected cases as a factor of reported ones, i.e. I rep ( t ) (cid:44) K I ( t ) ,and K is a constant greater than 1. The main intuition behind this factor is due to taking into accountthe following two phenomena, namely lack of sufﬁcient number of tests (specially in the beginning ofthe pandemic) and asymptomatic cases (mild infections which might not even be noticed).To derive the conﬁdence interval, we need to ﬁrst ﬁnd the marginal distribution of R t . By considering f D ( d ) and f K ( k ) as the probability distribution function (pdf) for parameters D I and K , respectively, thejoint pdf of these parameters would be f D,K ( d, k ) = f D ( d ) f K ( k ) (9) due to the independence of D I and K . We can derive the probability distribution function of R t byperforming the following transformation on parameters D I and K and introducing auxiliary variable Z : Z (cid:44) K , R t = 11 − KI rep ( t ) N (1 + D I I rep ( t + 1) − I rep ( t ) I rep ( t ) ) . (10) Since the transformation of ( Z, R t ) to ( D I , K ) is one-to-one, we have K = Z , D I = R t (1 − Za t ) − b t , (11) where a t (cid:44) I rep ( t ) N and b t (cid:44) I rep ( t +1) − I rep ( t ) I rep ( t ) , the joint pdf of Z and R t would be f Z,R t ( z, r ) = | J | f D,K ( d, k ) with J (cid:44) (cid:20) ∂d∂z ∂d∂r∂k∂z ∂k∂r (cid:21) . (12) By substituting the corresponding values of parameters and the Jacobin, we have: f Z,R t ( z, r ) = | − za t b t | f D ( r (1 − za t ) − b t ) f K ( z ) . (13) The marginal pdf of R t can be obtained by taking integral of (13) over parameter z , i.e. f R t ( r ) = (cid:90) f Z,R t ( z, r ) dz = (cid:90) | − za t b t | f D ( r (1 − za t ) − b t ) f K ( z ) dz. (14) Remark 1 : One reasonable assumption regarding the pdf of parameters D I and K is that both of themhave Gaussian distributions. By considering D I ∼ N ( µ D , σ D ) and K ∼ N ( µ K , σ K ) , the pdf of R t canbe simpliﬁed as f R t ( r ) = (cid:90) at −∞ ( β + β z ) C (cid:112) πσ c φ µ c ,σ c ( z ) dz + (cid:90) ∞ at ( − β − β z ) C (cid:112) πσ c φ µ c ,σ c ( z ) dz, (15)where φ µ c ,σ c ( . ) indicates the pdf of a normal distribution with mean µ c and variance σ c while β (cid:44) b t , β (cid:44) − a t b t ,α (cid:44) ( r − b t − µ D ) σ D + µ K σ K , α (cid:44) ( − ra t b t )( r − b t − µ D ) σ D − µ K σ K , α (cid:44) ( ra t b t ) σ D + 12 σ K ,µ c (cid:44) − α α , σ c (cid:44) α , C (cid:44) e − ( α − α α ) πσ D σ K . (16)y taking integral through using change of parameters, (15) can be rewritten as follows f R t ( r ) = − Cβ σ c e − ( 1 at − µc )22 σ c + C (cid:112) πσ c ( β µ c + β )Φ µ c ,σ c ( 1 a t ) + C (cid:112) πσ c ( − β µ c − β )(1 − Φ µ c ,σ c ( 1 a t )) (17)where Φ µ c ,σ c ( . ) represents the cumulative distribution function (cdf) of a normal distribution with mean µ c and variance σ c .The conﬁdence interval would belong to ( ¯ R t − δ, ¯ R t + δ ) where ¯ R t (cid:44) E [ R t ] = (cid:82) rf R t ( r ) dr and δ canbe derived by satisfying P ( | R t − ¯ R t |≤ δ ) = (cid:82) ¯ R t + δ ¯ R t − δ f R t ( x ) dx = 1 − (cid:15) for some small (cid:15) > . Estimating the Risk Score:

We propose a novel risk score metric for a given community that isproportional to the probability of someone in that community becoming infected in the next time period(typically, 24 hours). The risk score can be derived as the average number of people in that communitythat are likely to get infected in the next 24 hours by the currently infectious people divided by the currentnumber of susceptible individuals. We further normalize this probability by multiplying by 10,000, so thata score of 1 implies a 1 in 10,000 chance of getting infected, a score of 2 implies a 2 in 10,000 chanceof getting infected, and so on. Mathematically, the risk score is deﬁned as follows: Γ t = I ( t ) · R t D I · S ( t ) · ≈ K · I rep,new ( t ) · R t N · , (18)where I rep,new ( t ) indicates the most recently reported count of new conﬁrmed infectious cases, K refersto the ratio of true cases to reported cases, R t is the time-varying reproduction number, and N is the totalpopulation size of the community. The approximation follows from the fact that I rep,new ( t ) is approximatelyequal to I ( t ) D I · K and S ( t ) the number of susceptible people in the community is approximately equal to N inthe early stages of the epidemic. Conﬁdence intervals for the risk score Γ t could be obtained numericallyusing a similar process as described for R t accounting also for uncertainty in K . Note that since K maynot be known for a given community, it may be helpful to use the following normalized form of the riskscore: Γ t K , which is still proportional to the probability of infection for an individual. Color-coded Risk Levels:

To further simplify the presentation of the risk score to a wider audience,we propose to classify the risk levels into four color-coded levels: (Green, Yellow, Orange, Red). Therisk level is determined by evaluating the normalized risk score ( Γ K ) with respect to three pre-speciﬁedthreshold levels θ , θ , θ , such that when Γ K < θ the risk level is green, when θ ≤ Γ K < θ the risk levelis yellow, when θ ≤ Γ K < θ the risk level is orange, and when Γ K ≥ θ the risk level is red.Fig. 2: Plots a) and b) respectively represent the estimated effective reproduction number R t and the riskscore Γ t over time for the entire county of LA considering E [ D I ] = 7 . , V ar [ D I ] = 9 , E [ K ] = 3 , and V ar [ K ] = 0 . . The gray area represents the 95% conﬁdence interval in the estimates.ig. 3: Estimate of risk score Γ t over time for four representative communities in LA County: BoyleHeights, Glendale, East LA, and Norwalk. Regarding the settings, we considered the following E [ D I ] =7 . , V ar [ D I ] = 9 , E [ K ] = 3 , and V ar [ K ] = 0 . . Our approach also yields uncertainty in the estimate,as shown in the form of conﬁdence intervals (in gray).IV. I MPLEMENTATION AND E VALUATION IN L OS A NGELES C OUNTY

The software for data collection, infection rate estimation and prediction has already been implementedand made available as open-source software (at the following repository: https://github.com/ANRGUSC/covid19 risk estimation). The software is written in Python using standard data processing libraries suchas NumPy and SciPy.

A. Data Sources • The CoVID-19 case information was collected through LA County’s daily press releases (Accessiblethrough the following website: http://publichealth.lacounty.gov/media/Coronavirus/). • Recovery information provided by the World Health Organization. • The population data from LA County Census is available online (from lacounty.gov/government/geography-statistics/cities-and-communities/).

B. Real-world Usage

The City of Los Angeles is currently using the risk model described in this work that has been developedby researchers at USC, to help assess location-based risk for COVID-19 infection. The City is workingwith the County and other partners to develop a tool that is publicly accessible and can be used byindividuals and communities to mitigate risk of infection. The goal is to change behaviors to reduce riskof infection and promote a greater understanding of factors that increase COVID risk. A color-codedCOVID-19 threat level tool that can be used by citizens has also been unveiled by the Mayor of the Cityof LA, online at https://corona-virus.la/covid-19-threat-level.ig. 4: Maps showing the estimated risk score for different LA County Communities on different datessince mid-March 2020. Top row: 3/23/20, 4/10/20, 4/27/20; Bottom row: 5/15/20, 6/18/20, 7/1/20V. E

VALUATION R ESULTS

We present below plots from our analysis of LA County community case data using the estimationapproach described in this work. Figure 2 shows plots of the estimated expected reproductive number R t and the estimated risk score for the entire LA county. These plots are based on a 14-day moving averageapplied on the daily number of conﬁrmed cases. In accordance with LA county daily press releases, thereis a sharp jump in both R t and risk score around the beginning of July. Note that the reason the riskscore during the beginning of July is higher than the risk score during the last week of March, despitehaving the same R t , is due to the fact that there are signiﬁcantly more conﬁrmed cases in July comparedto March. Figure 3 shows the risk score estimates over time for four representative communities withinthe LA County. Figure 4 shows the color-coded risk levels for communities in LA County for select datesover the past 3 months. VI. C ONCLUSION

We have proposed a new risk metric Γ t that can be used by individuals in any community to assesstheir probability of getting infected by COVID-19. The metric builds on the estimation of R t , the averagereproductive number, which is obtained from a time-varying extension of the classical SIR model. Weshow how to evaluate the uncertainty in both metrics as well. In future work, we plan to generalize theapproach to the SEIR model, which also models an additional incubation period. We have released codeto implement an estimation of the risk score that can be used for any community worldwide as long astime-series data for conﬁrmed new cases and the population are known. We have also proposed the useof simple color-coded risk levels to inform and guide the public, as has been adopted in the City of LosAngeles. EFERENCES [1] A. Tob´ıas, “Evaluation of the lockdowns for the sars-cov-2 epidemic in italy and spain after one month follow up,”

Science of The TotalEnvironment , p. 138539, 2020.[2] K. Kupferschmidt, “The lockdowns workedbut what comes next?” 2020.[3] Y. Liu, A. A. Gayle, A. Wilder-Smith, and J. Rockl¨ov, “The reproductive number of covid-19 is higher compared to sars coronavirus,”

Journal of travel medicine , 2020.[4] F. Brauer, “The Kermack–McKendrick epidemic model revisited,”

Mathematical biosciences , vol. 198, no. 2, pp. 119–131, 2005.[5] M. Newman, “Networks: An introduction,” in

Oxford University Press , 2010.[6] Y. Chen, P.-E. Lu, and C.-S. Chang, “A time-dependent sir model for covid-19,”

ArXiv , vol. abs/2003.00122, 2020.[7] K. Systrom, “The metric we need to manage covid-19,” 2020. [Online]. Available: http://systrom.com/blog/the-metric-we-need-to-manage-covid-19/[8] L. Bettencourt and R. Ribeiro, “Real time bayesian estimation of the epidemic potential of emerging infectious diseases,”