The Limits to Learning a Diffusion Model
Jackie Baek, Vivek F. Farias, Andreea Georgescu, Retsef Levi, Tianyi Peng, Deeksha Sinha, Joshua Wilde, Andrew Zheng
TThe Limits to Learning an SIR Process: GranularForecasting for Covid-19
Jackie Baek
Vivek F. Farias
Andreea Georgescu
Retsef Levi
Tianyi Peng
Deeksha Sinha
Joshua Wilde
Andrew Zheng
Abstract
A multitude of forecasting efforts have arisen to support management of theongoing COVID-19 epidemic. These efforts typically rely on a variant of the SIRprocess [1] and have illustrated that building effective forecasts for an epidemic inits early stages is challenging. This is perhaps surprising since these models relyon a small number of parameters and typically provide an excellent retrospectivefit to the evolution of a disease. So motivated, we provide an analysis of thelimits to estimating an SIR process. We show that no unbiased estimator canhope to learn this process until observing enough of the epidemic so that oneis approximately two-thirds of the way to reaching the peak for new infections.Our analysis provides insight into a regularization strategy that permits effectivelearning across simultaneously and asynchronously evolving epidemics. Thisstrategy has been used to produce accurate, granular predictions for the COVID-19epidemic that has found large-scale practical application in a large US state.
The so-called Susceptible-Infected-Recovered (SIR) model, proposed nearly a century ago [2],remains a cornerstone for the forecasting of epidemics. In numerous retrospective studies the modelhas been found to fit the trajectory of epidemics well, while simultaneously providing a meaningfullevel of interpretability. As such, the plurality of models used for forecasting efforts related to theCOVID-19 epidemic are either SIR models or close cousins thereof. Surprisingly, the experiencewith these forecasts has illustrated that predicting the cumulative number of cases (or peak number ofcases) in an epidemic, early in its course, is a challenging task.Ultimately, this paper is motivated by producing high quality forecasts for the progression of epi-demics such as COVID-19. However, we begin with a fundamental question that, despite the SIRmodel’s prevalence, has apparently not been asked:
What are the limits of learning in epidemicmodels such as the SIR model?
Surprisingly, we find that it is fundamentally difficult to predictthe cumulative number of infections (or the ‘peak’ of an infection) until quite late in an epidemic.In particular, we show that no estimator can hope to learn these quantities until observing enoughof the epidemic so that one is approximately two-thirds of the way to reaching the peak for newinfections. As far as we can tell, this result is the first of its kind for epidemic modeling. We findthat this hardness is driven by uncertainty in the initial prevalence of the infection in the population,which is not observable with incomplete testing and/ or asymptomatic patients. On the other hand, weshow that certain other important parameters in the SIR model – including the so-called reproductionnumber and infectious period – are actually easy to learn.
Preprint. Under review. a r X i v : . [ s t a t . M E ] J un ur analysis on the limits of learning above suggests a specific regularization approach that dramat-ically impacts forecast performance when learning across regions. In greater detail, we considera differentiable model, where the true infection curve is latent and follows SIR dynamics. Theobservations are the results of limited testing, which censors the true infection curve, and deaths. Ourlearning rate analysis suggests that by suitably ‘matching’ regions early in the epidemic to regionsthat are further along, we can build useful estimators of initial disease prevalence in the formerregions. This matching is effectively enabled through a regularization strategy in our approach wherecertain region specific random effect terms are regularized to zero.We demonstrate that our approach allows us to learn granular epidemic models (essentially at thelevel of individual counties) that are substantially more accurate than a highly referenced benchmark[3]. As such, our model has found practical application in a large US state. Among other uses, theforecasts inform decisions related to providing surge capacities in hospitals across the state; suchdecisions necessitate a granular forecast. Related Literature:
Extant epidemiological models are typically compartmental models, of whichthe SIR model [2] is perhaps the best known. The plurality of COVID-19 modeling efforts arefounded on SIR-type models, eg. [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]. Some of these efforts considergeneralizations to the SIR model that add additional states or ‘compartments’. For instance the(retrospectively fit) model studied in [6] considers an eight state model (as opposed to three for thevanilla SIR). Not surprisingly, learning gets harder with as the number of states increases [15].Whereas the identifiability of the SIR model (in a deterministic sense) is well understood [16], this isnot the case for the natural stochastic variant of this model. Specifically, calibrating a vanilla SIRmodel to data requires learning the so-called infectious period and basic reproduction rate. Boththese parameters are relatively easy to calibrate with limited data; this is supported both by thepresent paper, but also commonly observed empirically; see for instance [15]. In addition to theseparameters, however, one needs to measure both the initial number of infected individuals and thesize of the susceptible population. Estimating the number of infected individuals poses a challengein the presence of limited testing and asymptomatic carriers. Indeed, epidemiological models forCOVID-19 typically assume that measured infections are some fraction of true infections; eg. [4, 6].This challenge is closely related to that of measuring the true fraction of cases that lead to fatalities(or the so-called Infection Fatality Rate) [17].Our main theorem shows that having to learn the true initial prevalence of the infection presentsa fundamental difficulty to learning SIR models with limited data. Our theoretical results arecomplemented by empirical findings in the literature, see [18, 19]. Our result is the first to characterizethe limits to learning an SIR process [1] and relies on a non-trivial application of the Cramer-Raobound.Finally, on the empirical front, we compare the quality of our predictions to publicly availablepredictions for the IHME model [3]: this highly publicized benchmark model has a large numberof forecast vintages concurrently available which makes possible an analysis of relative predictionaccuracy as a function of available data.
We begin by describing the deterministic
SIR model. Let s ( t ) , i ( t ) and r ( t ) be the size of thesusceptible, infected and recovered populations respectively, as observed at time t . The SIR model isdefined by the following system of ODEs, specified by the tuple of parameters ( α, β, γ ) : dsdt = − β sαN i, didt = β sαN i − γi, drdt = γi. (1)The rate of recovery is specified by γ > ; /γ is frequently referred to as the infectious period . β > quantifies the rate of transmission; β/γ (cid:44) R is also referred to as the basic reproductionnumber . N is the total population (assumed known). The typical exposition of this model sets α = 1 corresponding to the setting where the infection is entirely observed. On the other hand, the quantity α ∈ (0 , corresponds to the fraction of true infections we actually observe. The role of α is madeclear by the following Proposition: Proposition 2.1.
Let { ( s (cid:48) ( t ) , i (cid:48) ( t ) , r (cid:48) ( t )) : t ≥ } be a solution to (1) for parameters α =1 , β = β (cid:48) , γ = γ (cid:48) and initial conditions i (0) = i (cid:48) (0) , s (0) = s (cid:48) (0) . Then, for any α (cid:48) > , ( α (cid:48) s (cid:48) ( t ) , α (cid:48) i (cid:48) ( t ) , α (cid:48) r (cid:48) ( t )) : t ≥ } is a solution to (1) for parameters α = α (cid:48) , β = β (cid:48) , γ = γ (cid:48) and i (0) = α (cid:48) i (cid:48) (0) , s (0) = α (cid:48) s (cid:48) (0) . The SIR model described above is identifiable if i ( t ) is observable, and N is known. Specifically: Proposition 2.2.
Fix N , and let i ( t ) be observed over some open set in R + . Then the parameters ( α, β, γ ) are identifiable. A self-contained proof follows immediately from the identity theorem; a more involved argumentbased on identification results for non-linear systems can be found in [16].
Stochastic SIR Process:
Of course, any real world model must incorporate noise , and we describenext a natural continuous-time Markov chain variant of the deterministic model described above,proposed by [1]. Specifically, the stochastic SIR process , { ( S ( t ) , I ( t ) , R ( t )) : t ≥ } , is a multivari-ate counting process, with RCLL paths, determined by the parameters ( α, β, γ ) . The jumps in thisprocess occur at rate λ ( t ) (to be defined) and correspond either to a new observed infection (where I ( t ) increments by one, and S ( t ) decrements by one) or to a new observed recovery (where I ( t ) decrements by one, and R ( t ) increments by one). Let C ( t ) = I ( t ) + R ( t ) denote the cumulativenumber of infections observed up to time t . Denote by t k the time of the k th jump, and let T k be thetime between the k − st and k th jumps. Finally, let I k (cid:44) I ( t k ) , and similarly define R k , S k and C k .The SIR process is then completely specified by: C k − C k − ∼ Bern (cid:26) βS k − βS k − + αN γ (cid:27) (2) T k ∼ Exp (cid:26)(cid:18) βS k − αN + γ (cid:19) I k − (cid:27) . (3)It is well known that solutions to the deterministic SIR model (1) provide a good approximation tosample paths of the SIR process (described by (2), (3)) in the so-called fluid regime; see [20].The next section analyzes the rate at which one may hope to learn the unknown parameters ( α, β, γ ) as a function of k ; our key result will illustrate that in large systems, α is substantially harder to learnthan β or γ . In fact, an approximation suggests that the time taken to learn this parameter accuratelywill be approximately two-thirds of the time required to hit the peak for infections. In Section 4we will consider a differentiable model inspired by the SIR process and show that a regularizationstrategy motivated by our learning analysis yields material performance gains. This section characterizes the rate at which one may hope to learn the parameters of a stochastic SIRprocess, simply from observing the process.
Observations:
Define the stopping time τ = inf { k : I k = 0 } ; clearly τ is bounded. For clarity,when k > τ , we define C k = C k − , I k = 0 , and T k = ∞ . We define the k -th information set O k = ( I , R , T , C , . . . , T k , C k ) for all k ≥ . Note that I k and R k are deterministic given C k , I ,and R .We next characterize a sequence of systems of increasing size. Specifically, consider a sequenceof SIR processes, indexed by n , such that α n N n → ∞ while β n = β, γ n = γ . Moreover, let m n = o ( α n N n ) and β > γ . Finally, assume that I n, , R n, ≤ m n . Our main theorem characterizesthe Fisher information of O m n relative to α n as n grows large. Theorem 3.1.
Assume β > γ are known and I n, ≥ D , where D is a constant that depends only on β and γ . Then, the Fisher information of O m n relative to α n is J O mn ( α n ) = Θ (cid:18) m n α n N n (cid:19) . (4)Components of the proof of this result are provided in Section 3.1. Now since α n ∈ (0 , andsince all points in this interval are regular in the terminology of [21], we have by the constrainedCramer-Rao bound (Lemma 4 in [21]): 3 orollary 3.2 (Constrained Cramer-Rao Bound) . Assume β > γ are known and I n, ≥ D , where D depends only on β and γ . Let ˆ α n be any unbiased estimator of α n based on the observations O m n .Then, var (ˆ α n ) = Ω (cid:18) α n N n m n (cid:19) . (5) Interpretation of Corollary 3.2:
Now since we know (from the fluid model (1)) that cumulativeand peak infections scale with α n N n , it is reasonable to require that var (ˆ α n ) scale like o ( α n ) . Forthis, Corollary 3.2 shows that we require m n = ω (( α n N n ) / ) observations. How long mightthis take relative to the point at which, say, the epidemic itself is past its peak? One way ofcharacterizing such a benchmark is comparing the time it takes to get to ( α n N n ) / observations(call this time T ,n ) with the time taken for the instantaneous reproductive number for the epidemic R t (cid:44) βγ S n ( t ) / ( α n N n ) to fall below 1 (the expected increment in the infection process is negativeat times when R t < ); call this time T ,n . More precisely, T ,n = inf (cid:8) t : C n ( t ) ≥ ( α n N n ) / (cid:9) and T ,n = inf { t : βS n ( t ) / ( α n N n ) < γ } . Unfortunately, characterizing either of these (random)times exactly in the stochastic SIR process appears to be a difficult task and so we consider analyzingthese times in the deterministic model, where we denote T d ,n = inf (cid:8) t : c n ( t ) ≥ ( α n N n ) / (cid:9) forthe process defined by (1) and similarly T d ,n = inf { t : βs n ( t ) / ( α n N n ) < γ } . We have: Proposition 3.3. If c n (0) = O (log( α n N n )) , lim n →∞ T d ,n T d ,n ≥ . In summary, this suggests that the sampling requirements made precise by Corollary 3.2 can only bemet at such time where we are close to reaching the peak infection rate.We next turn our attention to learning β and γ : Theorem 3.4.
Assume β ∈ [0 , β max ] , γ ∈ [0 , γ max ] . Let C n, , m n , α n , N n satisfy m n ( m n + C n, ) ≤ α n N n , (cid:113) m n m n ≤ and ββ + γ α n N n − m n − C n, α n N n > ( ββ + γ + ) . Then, there exist estimators ˆ β m n and ˆ γ m n , both functions of O m n , such that E [( ˆ β m n − β ) ] ≤ M β log m n m n + B β e − B I n, , (6) E [(ˆ γ m n − γ ) ] ≤ M β log m n m n + B γ e − B I n, , (7) where B , B > depend only on β and γ and M , M > are absolute constants. In stark contrast with Corollary 3.2, Theorem 3.4 shows that the variance in estimating β and γ growssmall as m n and I n, grow, but is independent of α n and N n . Consequently, to achieve any desiredlevel of accuracy, we simply need the number of events m n and the initial number of infections I n, to exceed a constant that is independent of the size of the system N n . Define p (cid:44) ( ββ + γ + ) > , and let P n = α n N n . We fix n large enough so that m n + C ≤ P n and β ( P n − m n − C ) β ( P n − m n − C )+ P n γ > p (this is possible since ββ + γ > p and m n = o ( P n ) ). We remove thesubscript n henceforth.Define the indicator variable E k = { τ > k } on the event which the SIR process has not terminatedafter k jumps. The following lemma states that both E k and I k can be determined from C k , I , and R , which will allow us to decouple variables in O m in the analysis of the Fisher information. Theresult follows from the definitions of τ , E k , and C k ; the details can be found in the Appendix. Lemma 3.5.
Define r k (cid:44) I + k +2 R for all k ≥ . For all k , E k = { C k > r k } . Moreover, when E k = 1 , I k = 2 C k − k − I − R > . The next lemma writes an exact expression for J O m ( α ) .4 emma 3.6. The Fisher information of the observations O m with respect to the parameter α is J O m ( α ) = m (cid:88) k =1 Pr( E k − = 1) E (cid:34) N C k − P ( P − C k − )(( P − C k − ) + γβ P ) (cid:12)(cid:12)(cid:12)(cid:12) E k − = 1 (cid:35) . (8) Proof.
We first define conditional Fisher information and state some known properties.
Definition 3.7.
Suppose
X, Y are random variables defined on the same probability space whosedistributions depend on a parameter θ . Let g X | Y ( x, y, θ ) = ∂∂θ log f X | Y ; θ ( x | y ) be the square of thescore of the conditional distribution of X given Y = y with parameter θ evaluated at x . Then, theconditional Fisher information is defined as J X | Y ( θ ) = E X,Y (cid:2) g X | Y ( X, Y, θ ) (cid:3) . Property 3.8. J X ,...,X n ( θ ) = J X ( θ ) + (cid:80) ni =2 J X i | X ,...,X i − ( θ ) . Property 3.9. If X is independent of Z conditioned on Y , J X | Y,Z ( θ ) = J X | Y ( θ ) . Property 3.10. If X is deterministic given Y = y , g X | Y ( X, y, θ ) = 0 . Property 3.11. If θ ( η ) is a continuously differentiable function of η , J X ( η ) = J X ( θ ( η ))( dθdη ) . Since I and R are known and not random, the Fisher information of O m is equal to the Fisherinformation of ( T , C , T , C , . . . , T m , C m ) . Then, Property 3.8 implies J O m ( α ) = J T ( α ) + J C | T ( α ) + J T | T ,C ( α ) + J C | T ,C ,T ( α ) + · · · + J C M | T ,C ,...,T m ( α ) . Note that for any k , C k and T k only depend on C k − . Indeed, since C k − determines E k − , if E k − = 0 (the stopping time has passed), then C k = C k − and T k = ∞ . When E k − = 1 , thedistributions of C k and T k are given in (2)-(3). Since β, γ, I , R are known, S k − = P − C k − ,and I k − can be determined from C k − (Lemma 3.5), the distributions of C k and T k are determinedby C k − . Therefore, we use Property 3.9 to simplify the above equation to J O m ( α ) = m (cid:88) k =1 ( J C k | C k − ( α ) + J T k | C k − ( α )) , where we used J T ( α ) = J T | C ( α ) , J C ( α ) = J C | C ( α ) . Moreover, when E k − = 0 , C k and T k are deterministic conditioned on C k − , which implies the score in this case is 0 (Property 3.10).Therefore, we can condition on E k − = 1 to write J O m ( α ) = m (cid:88) k =1 E [ g C k | C k − ( C k , C k − , α ) + g T k | C k − ( T k , C k − , α ) | E k − = 1] Pr( E k − = 1) . The last step is to evaluate g C k | C k − ( C k , C k − , α ) and g T k | C k − ( T k , C k − , α ) . When E k − = 1 , thedistributions of C k and T k conditioned on C k − have a simple form provided in (2)-(3). Property 3.11allows for straight-forward calculations, resulting in (8). See Appendix for details of this last step. Proof of Theorem 3.1.
We show both upper and lower bounds for J O m ( α ) starting from Equation(8). For the upper bound, we have that C k ≤ k + I + R by definition. Since I , R ≤ m byassumption, C k ≤ m . Moreover, by assumption, C k ≤ m + C ≤ P . Plugging these into (8)results in J O m ( α ) ≤ m − (cid:88) k =0 Pr( E k − = 1) N (3 m ) P ( P − P )(( P − P ) + γβ P ) ≤ H m α N , for a constant H . For the lower bound, we first prove a lower bound for Pr( E m = 1) . Lemma 3.12.
There exists a constant D that only depends on β and γ such that if β ( P − m − C ) β ( P − m − C )+ P γ >p and I ≥ D , then Pr( E m = 1) ≥ . This result relies on an interesting stochastic dominance argument and can be found in the Appendix.Then, similarly to the upper bound, J O m ( α ) ≥ H m α N follows from using Pr( E m = 1) ≥ andthe fact that C k ≥ k + I +2 R when E k = 1 (Lemma 3.5). Combining the upper and lower boundsfinish the proof. 5 Forecasting in the real world
This section describes a practical SIR model that can incorporate a number of real-world features anddatasets. We then consider an approximation to the MLE estimate for this model. Finally we proposean approach to overcome the difficulty in learning α . We present comprehensive experimental resultsat a ‘regional’ granularity (essentially close to county level) that illustrate strong relative merits. Discrete-time SIR
Recall the stochastic SIR process , { ( S ( t ) , I ( t ) , R ( t )) : t ≥ } , a multi-variatecounting process determined by parameters ( α, β, γ ) . We now allow β to be time-varying, yieldinga counting process with jumps C k − C k − ∼ Bern { βS k − / ( β k S k − + αN γ ) } and rate λ ( t ) =( β ( t ) S ( t ) /αN + γ ) I ( t ) . We obtain discrete-time SIR processes, { ( S i [ t ] , I i [ t ] , R i [ t ]) : t ∈ N } forregions i ∈ I by considering the Euler-approximation to this counting process (e.g. [22]). Specifically,let ∆ I [ t ] = I [ t ] − I [ t − , and define ∆ S [ t ] and ∆ R [ t ] analogously. The discrete-time approximationto the SIR process is then given by: ∆ S i [ t + 1] = − β i [ t ]( S i [ t ] /α i N i ) I i [ t ] + ν Si,t ∆ I i [ t + 1] = β i [ t ]( S i [ t ] /α i N i ) I i [ t ] − γI i [ t ] + ν Ii,t ∆ R i [ t + 1] = γI i [ t ] + ν Ri,t where { ν Si,t } , { ν Ii,t } , { ν Ri,t } are appropriately defined martingale difference sequences. Model Parameters and Covariates
For each region i , the observability parameter α i and repro-duction factor { β i [ t ] : t ∈ N } must be learned from data, but we assume the total population N i and recovery rate γ are known (a typical assumption; for COVID-19, γ ∼ / ). Demographic andmobility factors influence the reproduction rate of the disease. To model these effects, we estimate β i [ t ] as a mixed effects model incorporating these covariates: β i [ t ] = exp( X i [ t ] (cid:62) θ ) + (cid:15) i , where X i [ t ] = ( Z i , M i [ t ]) ∈ R d + d is a set of observed covariates for region i partitioned into static Z i ∈ R d and time-varying M i ( t ) ∈ R d . We estimate the parameters (cid:15) i , representing stationaryper-region random effects, and θ ∈ R d + d , a vector of fixed effects shared across regions. In the real world, the SIR model is a latent process – we never directly observe any of the statevariables S i [ t ] , I i [ t ] , R i [ t ] . Instead, we observe C i [ t ] = I i [ t ] + R i [ t ] = α i N i − S i [ t ] . Note that otherprocesses (deaths, hospitalizations, etc.) that depend on the latent state may also be observable, andcan easily be incorporated into this learning scheme. The MLE problem for parameters ( θ, α, (cid:15) ) issimply max ( θ,α,(cid:15) ) (cid:80) i,t log P ( C i [ t ] | θ, α, (cid:15) ) .This is a difficult non-linear filtering problem (and an interesting direction for research). We thereforeconsider an approximation: Denote by { ( s i [ t ] , i i [ t ] , r i [ t ]) : t ∈ N } the deterministic process obtainedby ignoring the martingale difference terms in the definition of the discrete time SIR process. Weconsider the approximation C i [ t ] = α i N i − S i [ t ] ∼ ( α i N i − s i [ t ]) ω i [ t ] where ω i [ t ] is log-normally distributed with mean and variance exp( σ ) − . Under this approxi-mation, the MLE problem is now transformed to min ( θ,α,(cid:15) ) (cid:88) i,t (log C i [ t ] − log ( α i N i − s i [ t ])) (9)which constitutes a differentiable model. We solve (9), (or a weighted version) using Adam [23]. Working around the limits to learning
Theorem 3.4 asserts that β is easy to learn via MLE,suggesting that the parameters θ, { (cid:15) i } underlying our model of β can be estimated as well. However,Corollary 3.2 illustrates that we cannot estimate { α i } in reasonable time or accuracy from infectionsalone. This necessitates augmenting the estimation problem (9) with auxiliary information. We Adam was run for 20k iterations, with learning rate tuned over a coarse grid. A weighted version of the lossfunction in (9) with weights for ( i, t ) th observation set to C i [ t ] worked well. P [ t ] ⊆ I of regions that have alreadyexperienced enough infections to reliably estimate α i for i ∈ P [ t ] via MLE. At a high level, ourstrategy will be to identify the set P [ t ] , estimate α i for i ∈ P [ t ] , then extrapolate these estimates toobtain α i for i / ∈ P [ t ] . We define the set P [ t ] = { i ∈ I : C i [ t ] − C i [ t − ≤ − γ max τ ≤ t ( C i [ τ ] − C i [ τ − } (10)where γ is a hyperparameter. In the fluid model this identifies the first time t where d s ( t ) /dt < .We expect α i to vary across regions due to different demographics (that impact asymptomatic rates)and testing policies. Given a set of parameters { ˜ α i : i ∈ P [ t ] } estimated via MLE, a natural approachto estimate α i for regions i / ∈ P [ t ] is by matching to regions in P [ t ] with similar covariates. It is alsoworth noting that since in large systems in the fluid regime, we may show that C i ( t ) /N i → α i as t → ∞ , we also enforce α i ≥ C i [ t ] /N i . Overall Learning Algorithm (‘Two-Stage’)
So motivated, given data up to time T , we now defineour overall learning algorithm ‘Two-Stage’, as follows: α i = (cid:26) exp( φ (cid:62) Z i + δ i ) , if i ∈ P [ T ]max (cid:0) exp( φ (cid:62) Z i ) , γ C i [ T ] /N i (cid:1) , otherwisewhere γ is a hyper-parameter and δ i are region specific random effect terms. We then estimate themodel parameters ( θ, φ, δ, (cid:15) ) by minimizing the following specialization of (9): min ( θ,φ,δ,(cid:15) ) (cid:88) i,t (log C i [ t ] − log ( α i ( φ, δ i ) N i − s i [ t ])) (11)The hyper-parameters γ , γ and the learning rates are tuned via cross-validation. We use our estimates to forecast cumulative infections in the US at the level of sub-state ‘regions’,corresponding broadly to public health service areas. The median state has seven regions. The staticcovariates Z i for each region i comprise of 60 demographic features from multiple data providers,ranging from socio-economic to health indicators. The time varying covariates M i ( t ) correspond tomobility data inferred from aggregated cell phone data. All datasets are widely available and are inactive use in other models. The Appendix provides a detailed description of these covariates.This section focuses on two goals: First, we perform an ablation study focused on the impact ofassumed structure on the parameter α i . This allows us to understand whether and why optimizing(11) provides improved performance. Second, and perhaps more importantly, our overall goal wassimply to produce accurate forecasts; here we compare the performance of our approach to publishedforecast performance for the most broadly cited forecaster for COVID-19, [3]. The impact of learning α : We compare four approaches to learning α : Default : Ignoring themodeling of α altogether by setting α i = 1 ∀ i ∈ I (the ‘default’ SIR choice); MLE : Learn α i foreach region via the MLE approximation (9); Two-Stage : The two-stage approach specified by (11);and
Idealized : Using a value of α computed in-sample (i.e. by looking into the future) via (9).Figure 1 shows weighted mean absolute percentage error (WMAPE) over regions, with weightsproportional to infections on May 21, 2020, for two metrics relevant to decision making: cumulativeinfections by May 21, 2020 and maximum daily infections, for regions that have peaked by May 21,2020. Model vintages vary along the x-axis so that moving from left to right models are trained ondata approaching the target date of May 21. Default fails, with a WMAPE over 70% even on May 21 (an in-sample fit). At the other extreme,
Idealized exhibits consistently low error even for early model vintages. This bears out the predictionof Theorem 3.4: given α , β is easy to learn even early in the infection with few samples. MLE performs poorly until close to the target date of May 21 at which point sufficient data is available tolearn α . This empirically illustrates the difficult of learning α , as described in Corollary 3.2. We do not show this model in Figure 1 as the WMAPEs are substantially larger than the scale shown. um. Cases Peak Cases
Apr 27
May 04
May 11
May 18
Apr 27
May 04
May 11
May 18 Model Vintage W M APE
Ideal alpha
MLE
Two-Stage
Figure 1: Prediction errors by model vintage, for regions that have peaked by May 21, 2020. Colorsdenote different approaches to learning α i . Not Yet Peaked
Peaked - - - - + - - - - + α for April 21 , 2020 vintage α f o r M a y , v i n t age MLE
Two-Stage
Figure 2: α estimated on May 21 vs. April 21. Size indicates cumulative cases on May 21. MLEunderestimates α in non-peaked regions (i.e. with limited data). Two-Stage does not.Turning to our proposed approach, we see that Two-Stage significantly outperforms
MLE far awayfrom the test date. Close to the test date the two approaches are comparable. For maximum dailyinfections – perhaps the most critical metric for capacity planning –
MLE drastically underperforms
Two-Stage far from the test date. Our approach to learning from peaked regions significantly mitigatesthe difficulty of learning α . As further evidence, Figure 2 shows how well α from the April 21 vintagepredicts α in the final May 21 vintage, for regions that have peaked by May 21. As Proposition 3.3predicts, the April 21 α are close to the May 21 α values for peaked regions, in both MLE and
Two-Stage models. For non-peaked regions, the April 21
MLE α estimates vary much more than theTwo-Stage estimates, and tend to underestimate the true α value. Overall model performance
Finally, we compare our analyzed models to the widely used IHMEmodel [3]. Figure 3 compares state-level WMAPE for MLE, Two-Stage and IHME models, forvintages stretching back 28 days. The IHME model is, in effect, an SI model with carefully tunedparameters. We report published IHME forecasts; 10 vintages of that model were reported betweenApril 21 and May 21.
Two Stage dominates IHME across all model vintages. Due to IHME only providing state-level predictions. Additionally IHME only offers deaths predictions forthese vintages; we show WMAPE on deaths for IHME and WMAPE on infections for MLE and Two-Stage. Apr 20
Apr 27
May 04
May 11
May 18
Model Vintage W M APE
IHME
MLE
Two-Stage
Figure 3: WMAPE for predicting state-level cumulative cases on May 21, 2020, comparing MLEand the Two-Stage approach against IHME.
Broader Impact
In contrast with the plurality of available forecast models (which make predictions at the levelof a state), the models we construct here are at a much finer level of granularity (essentially, thecounty) and provide accurate predictions early in an epidemic. As a result our models can guidethe deployment of resources in states with heterogeneity in resource availability and prevalence.The model was deployed in such an operational fashion in a large US state, where it was used toproactively place hospital resources (mobile surge capacity) in areas where we anticipated large peaksin infections. Many of these predictions were proven accurate and timely in hindsight. This wouldnot be possible with a state-level forecast (let alone a forecast with limited predictive power). Whilenot every state was able to make such data driven decisions in resource deployment, we believe thatin the event of a second outbreak, the approach we develop here can serve this need broadly.
Acknowledgements
The authors express their gratitude to Danial Mirza, Suzana Iacob, El Ghali Zerhouni, Neil SanjayPendse, Shen Chen, Celia Escribe and Jonathan Amar for their excellent support in data collectionand organization.
References [1] MS Bartlett. Some evolutionary stochastic processes.
Journal of the Royal Statistical Society.Series B (Methodological) , 11(2):211–229, 1949.[2] William Ogilvy Kermack and Anderson G McKendrick. A contribution to the mathematicaltheory of epidemics.
Proceedings of the royal society of london. Series A, Containing papers ofa mathematical and physical character , 115(772):700–721, 1927.[3]
IHME , 2020. .[4] Giuseppe C Calafiore, Carlo Novara, and Corrado Possieri. A modified sir model for thecovid-19 contagion in italy. arXiv preprint arXiv:2003.14391 , 2020.[5] Giuseppe Gaeta. A simple sir model with a large set of asymptomatic infectives. arXiv preprintarXiv:2003.08720 , 2020.[6] Giulia Giordano, Franco Blanchini, Raffaele Bruno, Patrizio Colaneri, Alessandro Di Filippo,Angela Di Matteo, and Marta Colaneri. Modelling the covid-19 epidemic and implementationof population-wide interventions in italy.
Nature Medicine , pages 1–6, 2020.[7] FA Binti Hamzah, C Lau, H Nazri, DV Ligot, G Lee, CL Tan, et al. Coronatracker: world-widecovid-19 outbreak data analysis and prediction.
Bull World Health Organ. E-pub , 19, 2020.[8] Adam J Kucharski, Timothy W Russell, Charlie Diamond, Yang Liu, John Edmunds, SebastianFunk, Rosalind M Eggo, Fiona Sun, Mark Jit, James D Munday, et al. Early dynamics of9ransmission and control of covid-19: a mathematical modelling study.
The lancet infectiousdiseases , 2020.[9] Joseph T Wu, Kathy Leung, and Gabriel M Leung. Nowcasting and forecasting the potentialdomestic and international spread of the 2019-ncov outbreak originating in wuhan, china: amodelling study.
The Lancet , 395(10225):689–697, 2020.[10] Cleo Anastassopoulou, Lucia Russo, Athanasios Tsakris, and Constantinos Siettos. Data-basedanalysis, modelling and forecasting of the covid-19 outbreak.
PloS one , 15(3):e0230405, 2020.[11] Kathakali Biswas, Abdul Khaleque, and Parongama Sen. Covid-19 spread: Reproduction ofdata and prediction using a sir model on euclidean network. arXiv preprint arXiv:2003.07063 ,2020.[12] Maria Chikina and Wesley Pegden. Modeling strict age-targeted mitigation strategies forcovid-19. arXiv preprint arXiv:2004.04144 , 2020.[13] Clément Massonnaud, Jonathan Roux, and Pascal Crépey. Covid-19: Forecasting short termhospital needs in france. medRxiv , 2020.[14] Rahul Goel and Rajesh Sharma. Mobility based sir model for pandemics–with case study ofcovid-19. arXiv preprint arXiv:2004.13015 , 2020.[15] Kimberlyn Roosa and Gerardo Chowell. Assessing parameter identifiability in compartmentaldynamic models using a computational approach: application to infectious disease transmissionmodels.
Theoretical Biology and Medical Modelling , 16(1):1, 2019.[16] Neil D Evans, Lisa J White, Michael J Chapman, Keith R Godfrey, and Michael J Chappell.The structural identifiability of the susceptible infected recovered model with seasonal forcing.
Mathematical biosciences , 194(2):175–197, 2005.[17] Anirban Basu. Estimating the infection fatality rate among symptomatic covid-19 cases in theunited states: Study estimates the covid-19 infection fatality rate at the us county level.
HealthAffairs , pages 10–1377, 2020.[18] Gerardo Chowell. Fitting dynamic models to epidemic outbreaks with quantified uncertainty: aprimer for parameter uncertainty, identifiability, and forecasts.
Infectious Disease Modelling ,2(3):379–398, 2017.[19] Alex Capaldi, Samuel Behrend, Benjamin Berman, Jason Smith, Justin Wright, and Alun LLloyd. Parameter estimation and uncertainty quantication for an epidemic model.
Mathematicalbiosciences and engineering , page 553, 2012.[20] RWR Darling, James R Norris, et al. Differential equation approximations for markov chains.
Probability surveys , 5:37–79, 2008.[21] John D Gorman and Alfred O Hero. Lower bounds for parametric estimation with constraints.
IEEE Transactions on Information Theory , 36(6):1285–1301, 1990.[22] Jean Jacod, Thomas G Kurtz, Sylvie Méléard, and Philip Protter. The approximate euler methodfor lévy driven stochastic differential equations. In
Annales de l’IHP Probabilités et statistiques ,volume 41, pages 523–558, 2005.[23] Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014.[24] Svante Janson. Tail bounds for sums of geometric and exponential variables.
Statistics &Probability Letters , 135:1–6, 2018.[25] Joel C Miller. Mathematical models of sir disease spread with combined non-sexual and sexualtransmission routes.
Infectious Disease Modelling , 2(1):35–55, 2017.[26] Joel C Miller. A note on the derivation of epidemic final sizes.
Bulletin of mathematical biology ,74(9):2125–2141, 2012.[27] Ensheng Dong, Hongru Du, and Lauren Gardner. An interactive web-based dashboard to trackcovid-19 in real time.
The Lancet infectious diseases , 20(5):533–534, 2020.[28]
Safegraph Social Distancing Metrics , 2020. https://docs.safegraph.com/docs/social-distancing-metrics .[29]
University of Michigan Health and Retirement Study , 2020. https://hrs.isr.umich.edu/data-products . 1030]
Claritas , 2020. .[31]
CDC Interactive Atlas of Heart Disease and Stroke , 2020. .[32]
CDC Social Vulnerability Index , 2020. https://svi.cdc.gov/ .11 ppendices
Appendices A and B contain the proofs of Theorems 3.1 and 3.4 respectively. Appendix C containsthe proofs of Propositions 2.1, 2.2, and 3.3, each in their own subsections. Appendix D contains aresult justfying the use of (10) as a sufficient condition in reliably estimating α . Appendix E containsa description of the covariates used in the practical SIR model. A Proof of Theorem 3.1
We finish the sections of the proof that were not included in the main paper. This includes the proofof Lemma 3.5, Lemma 3.12, and calcuations for Lemma 3.6.We define λ ( α, k − , C k − ) = (cid:16) β ( αN − C k − ) αN + γ (cid:17) I k − and η ( α, C k − ) = β ( αN − C k − ) β ( αN − C k − )+ αNγ .Thus, for k ≤ τ , λ ( N, k − , C k − ) is the mean of the k -th inter-arrival time and η ( α, C k − ) is theprobability that the arrival in the k -th instance is a new infection rather than a recovery. A.1 Proof of Lemma 3.5
Proof.
Suppose k < τ i.e E k = 1 . Then, k is equal to total number of jumps that have occurred sofar (the number of movements from S to I and from I to R). The number of individuals that havemoved from S to I is C k − I − R , and the number of movements from I to R is C k − I k − R .Therefore, k = 2 C k − I − I k − R . Since I k > , C k > r k .Suppose k ≥ τ i.e E k = 0 . Then, k is greater than or equal to the total number of jumps, which isstill equal to C k − I − I k − R . Hence C k ≤ r k in this case. A.2 Proof of Lemma 3.12
Proof.
Let X k iid ∼ Bern ( p ) for k = 1 , , . . . . Let { A k : k ≥ } be a stochastic process defined by: A k = C if k = 0 C + X + · · · + X k if A i > r i ∀ i < kA k − otherwise. (12)Let τ A = min { k : A k ≤ r k } be the “stopping time” of this process. Claim A.1.
Pr( τ ≤ m ) ≤ Pr( τ A ≤ m ) . The proof of this claim involves showing the process { A k } is stochastically less than { C k } ; the proofcan be found in Section A.2.1. We now upper bound Pr( τ A ≤ m ) . τ A ≤ m if and only if A k ≤ r k for some k ≤ m . Before this happens, A k = C + X + · · · + X k . Therefore, if τ A ≤ m , it must bethat C + X + · · · + X k ≤ k + I +2 R for some k ≤ m . Pr( τ A ≤ m ) ≤ m (cid:88) k =1 Pr (cid:18) C + X + · · · + X k ≤ k + I + 2 R (cid:19) (13) = m (cid:88) k =1 Pr (cid:18) X + · · · + X k < pk (cid:18) − pk − k + I pk (cid:19)(cid:19) . (14)12ince E [ X + · · · + X k ] = pk , using the Chernoff bound (multiplicative form: Pr( (cid:80) ki =1 X i ≤ (1 − δ ) µ ) ≤ exp( − δ µ/ ) gives Pr( τ A ≤ m ) ≤ m (cid:88) k =1 exp (cid:32) − pk (cid:18)(cid:18) − p (cid:19) + I pk (cid:19) (cid:33) (15) = m (cid:88) k =1 exp (cid:32) − pk (cid:18) − p (cid:19) − I (cid:18) − p (cid:19) − I pk (cid:33) (16) ≤ m (cid:88) k =1 exp (cid:32) − pk (cid:18) − p (cid:19) − I (cid:18) − p (cid:19)(cid:33) (17) ≤ exp (cid:18) − (cid:18) − p (cid:19) I (cid:19) m (cid:88) k =1 exp (cid:32) − pk (cid:18) − p (cid:19) (cid:33) (18) ≤ B exp( − B I ) , (19)for constants B = (cid:80) ∞ k =1 exp (cid:18) − pk (cid:16) − p (cid:17) (cid:19) , B = − p > . ( B is a constant sinceit is a geometric series with a ratio smaller than 1, since p > / .) Let D be the solution to B exp( − B D ) = . Then, if I ≥ D , Pr( E m ) = 1 − Pr( τ ≤ m ) ≥ − Pr( τ A ≤ m ) ≥ . A.2.1 Proof of Claim B.5.Definition A.2.
For scalar random variables
X, Y , we say that X is stochastically less than Y (written X ≤ st Y ) if for all t ∈ R , Pr(
X > t ) ≤ Pr(
Y > t ) . (20) For random vectors
X, Y ∈ R n we say that X ≤ st Y if for all increasing functions φ : R n → R , φ ( X , . . . , X n ) ≤ st φ ( Y , . . . , Y n ) . (21)We make use of the following known result for establishing stochastic order for stochastic processes. Theorem A.3 (Veinott 1965) . Suppose X , . . . , X n , Y , . . . , Y n are random variables such that X ≤ st Y and for any x ≤ y , ( X k | X = x , . . . , X k − = x k − ) ≤ st ( Y k | Y = y , . . . , Y k − = y k − ) (22) for every ≤ k ≤ n . Then, ( X , . . . , X n ) ≤ st ( Y , . . . , Y n ) .Proof of Claim B.5. Because of the condition β ( P − m − C ) β ( P − m − C )+ P γ > p , for k ≤ m and k ≤ τ , C k − C k − ∼ Bern ( q ) for q > p . First, we show ( A , A , . . . , A m ) ≤ st ( C , C , . . . , C m ) us-ing Theorem A.3. C ≤ st A since C = A = I . We condition on A k − = x and C k − = y for x ≤ y , and we must show A k ≤ st C k . (We do not need to condition on all past variables since theboth processes are Markov.) If x ≤ r k − , then A k = A k − = x ≤ y = C k − ≤ C k . Otherwise,the process A k has not stopped, and neither has C k since y ≥ x . Then, A k ∼ x + Bern ( p ) and C k ∼ y + Bern ( q ) for some q ≥ p . Clearly, A k ≤ st C k in this case. We apply Theorem A.3, whichimplies A m ≤ st C m .Define the function u : R m +1 → { , } , u ( x , x , . . . , x m ) = {∪ mk =1 { x k ≤ r k }} . Then, u ( A , A , . . . , A m ) = 1 if and only if τ A ≤ m , and u ( C , C , . . . , C m ) = 1 if and only if τ ≤ m . u is a decreasing function. Therefore, u ( A , A , . . . , A m ) ≥ st u ( C , C , . . . , C m ) . Then, Pr( τ ≤ m ) = Pr( u ( C , C , . . . , C m ) ≥ ≤ Pr( u ( A , A , . . . , A m ) ≥
1) = Pr( τ A ≤ m ) asdesired. 13 .3 Calculations for Lemma 3.6Derivation of E C k [ g C k | C k − ( C k , C k − , α ) | E k − = 1] . When E k − = 1 , we have C k ∼ C k − + Bern ( η ( α, C k − )) . Therefore, E C k [ g C k | C k − ( C k , C k − , α ) | E k − = 1] = J C k ∼ Bern ( η ( α,C k − )) ( α ) .We reparameterize to write the Fisher information as: E C k [ g C k | C k − ( C k , C k − , α ) | E k − = 1] = J C k ∼ Bern ( η ) ( η ) (cid:18) ∂∂α η ( α, C k − ) (cid:19) (23) = 1 η (1 − η ) (cid:18) ∂∂α η ( α, C k − ) (cid:19) . (24)Use η ( α, C k − ) = β ( αN − C k − ) β ( αN − C k − )+ αNγ to derive ∂∂α η ( α, C k − ) = βN ( β ( αN − C k − ) + γαN ) − βN ( αN − C k − )( β + γ )( βN ( αN − C k − ) + γαN ) (25) = βN γC k − ( β ( αN − C k − ) + γαN ) . (26)Also, η (1 − η ) = ( β ( αN − C k − )+ αNγ ) ( αN − C k − ) βαNγ .Substituting, E C k [ g C k | C k − ( C k , C k − , α ) | E k − = 1] = ( β ( αN − C k − ) + αN γ ) β ( αN − C k − ) αN γ (cid:18) βN γC k − ( β ( αN − C k − ) + γαN ) (cid:19) (27) = βN γC k − α ( αN − C k − )( β ( αN − C k − ) + αN γ ) (28) = βN γC k − P ( P − C k − )( β ( P − C k − ) + P γ ) . (29) Derivation of E T k [ g T k | C k − ( T k , C k − , α ) | E k − = 1] . Similarly, conditioned on E k − = 1 , T k ∼ Exp ( λ ( α, k − , C k − )) . Therefore, E T k [ g T k | C k − ( T k , C k − , α )] = J T k ∼ Exp ( λ ( α,k − ,C k − )) ( α ) . Wereparameterize to write E T k [ g T k | C k − ( T k , C k − , α )] = J T k ∼ Exp ( λ ) ( λ ) (cid:18) ∂∂α λ ( α, k − , C k − ) (cid:19) (30) = 1 λ (cid:18) ∂∂α λ ( α, k − , C k − ) (cid:19) . (31)Use λ ( α, k − , C k − ) = ( β ( αN − C k − ) αN + γ )(2 C k − − ( k − − I − R ) to derive ∂∂α λ ( α, k − , C k − ) = N βC k − (2 C k − − ( k − − I − R ) α N (32) λ ( α, k − , C k − ) = αN ( β ( αN − C k − ) + γαN )(2 C k − − ( k − − I − R ) . (33)Substituting, E T k [ g T k | C k − ( T k , C k − , α )] = (cid:18) βN C k − αN ( β ( αN − C k − ) + γαN ) (cid:19) (34) = (cid:18) βN C k − P ( β ( P − C k − ) + γP ) (cid:19) (35) Derivation of J O m ( α ) . Using the expressions derived above for E C k [ g C k | C k − ( C k , C k − , α ) | E k − = 1] and 14 T k [ g T k | C k − ( T k , C k − , α )] , we get E C k [ g C k | C k − ( C k , C k − , α ) | E k − = 1] + E T k [ g T k | C k − ( T k , C k − , α )]= βN γC k − P ( P − C k − )( β ( P − C k − ) + P γ ) + (cid:18) βN C k − P ( β ( P − C k − ) + γP ) (cid:19) = N C k − P ( P − C k − )(( P − C k − ) + γβ P ) Thus, J O m ( α ) = m (cid:88) k =1 E [ g C k | C k − ( C k , C k − , α ) + g T k | C k − ( T k , C k − , α ) | E k − = 1] Pr( E k − = 1)= m (cid:88) k =1 E (cid:34) N C k − P ( P − C k − )(( P − C k − ) + γβ P ) (cid:12)(cid:12)(cid:12)(cid:12) E k − = 1 (cid:35) Pr( E k − = 1) . B Proof of Theorem 3.4
Fix an instance in which the assumptions of the theorem statement hold. We remove the subscript n ,and let P = αN . Let p (cid:44) ( ββ + γ + ) > . Let ˆ A = C m − C m be an estimator for ββ + γ , ˆ B = ˜ S m m bean estimator for β + γ for ˜ S m = (cid:80) min( m,τ ) k =1 I k − T k . Let ˆ β = ˆ A/ ˆ B and ˆ γ = 1 / ˆ B − ˆ β .This first lemma follows from (19) of the proof of Lemma 3.12. Lemma B.1. If ββ + γ P − m − C P > p , Pr( τ < m ) ≤ B e − B I , where B , B > are constant thatdepend only on β and γ . The next two lemmas give a high probability confidence bound for estimators ˆ A and ˆ B . Lemma B.2.
For any m, I where ββ + γ P − m − C P > , for any δ > , Pr (cid:18) C m − C m / ∈ (cid:20) ββ + γ (1 − δ ) P − m − C P , ββ + γ (1 + δ ) (cid:21) , τ ≥ m (cid:19) ≤ − mδ / (4 + 2 δ )) . (36) Lemma B.3.
Let ˜ S m = (cid:80) min( m,τ ) k =1 I k − T k . Then Pr (cid:32) ˜ S m m / ∈ [ (1 − δ ) β + γ , (1 + δ ) β + γ PP − m − C ] , τ ≥ m (cid:33) ≤ e − m P − m − C P ( δ − ln(1+ δ )) . (37)The next proposition combines the two estimators from the above lemmas and into estimators ˆ β and ˆ γ . Proposition B.4.
Assume β > γ > . Let I ≤ m < P such that ββ + γ P − m − C P > p . Let z = P − m − C P . Then, for any < δ < , with probability − e − m ( δ − ln(1+ δ )) − e − mδ / (4+2 δ ) − B e − B I , ˆ β ∈ (cid:20) β (1 − δ ) z δ , β δ − δ (cid:21) (38) ˆ γ ∈ (cid:20) γ z δ + β (1 − δ ) z − (1 + δ ) (1 + δ )(1 − δ ) , γ − δ + β δ − (1 − δ ) z (1 − δ )(1 + δ ) (cid:21) , (39) where B , B > are constants that depend on β and γ . We first show Theorem 3.4 using these results. We then prove Lemma B.2, Lemma B.3, andProposition B.4 in Appendix B.2. 15 .1 Proof of Theorem 3.4
Proof.
Let δ = (cid:113) mm . First, we claim that the probability in Proposition B.4 is greater than − m − B e − B I . Note that ln(1 + δ ) ≤ δ − δ + δ , implying δ − ln(1 + δ ) ≥ δ ( − δ ) . Since δ ≤ , e − m ( δ − ln(1+ δ ) ≤ e − m δ ≤ m . (40)Using δ ≤ again, e − mδ / (4+2 δ ) ≤ e − m δ = 4 m . (41)Hence, the bound in B.4 holds with probability greater than − m − B e − B I .Since we assume m ( m + C ) ≤ P and z = 1 − m + C P , − z ≤ m . (42)Let R be the event that the confidence bounds (38)-(39) hold. Note that δ − δ ≤ δ and − δ δ ≥ − δ for δ < . E [( ˆ β − β ) | R ] ≤ β (cid:0) δ − (1 − δ ) z (cid:1) (43) ≤ β ((1 − z ) + 3 δ (1 + z )) (44) ≤ β (cid:32) m + 6 (cid:114) mm (cid:33) (45) ≤ β M log mm (46)for an absolute constant M > . The second last step uses (42) and z ≤ .Similarly, E [(ˆ γ − γ ) | R ] ≤ (cid:18) γ (cid:18) − δ − z δ (cid:19) + β (cid:18) δ − (1 − δ ) z (1 − δ )(1 + δ ) − (1 − δ ) z − (1 + δ ) (1 + δ )(1 − δ ) (cid:19)(cid:19) . (47)Using the fact that (1 − δ )(1 + δ ) ≥ , − δ − z δ ≤ − z ) + δ (1 + z )) ≤ (cid:32) m + 2 (cid:114) mm (cid:33) . (48) δ − (1 − δ ) z (1 − δ )(1 + δ ) − (1 − δ ) z − (1 + δ ) (1 + δ )(1 − δ ) = (1 + δ ) − (1 − δ ) z + (1 + δ ) − (1 − δ ) z − δ (49) ≤ − z ) + 4 δ (1 + z ) + 1 + δ − δ − − δ δ z (50) ≤ − z ) + 8 δ + (1 + 3 δ ) − (1 − δ ) z (51) ≤ − z ) + 8 δ + (1 − z ) + 6 δ (1 + z ) (52) ≤ (1 − z )(3 + z ) + δ (8 + 6(1 + z )) (53) ≤ m + 20 (cid:114) mm . (54)16ubstituting back into (47) results in E [(ˆ γ − γ ) | R ] ≤ (cid:32) γ (cid:32) m + 4 (cid:114) mm (cid:33) + β (cid:32) m + 20 (cid:114) mm (cid:33)(cid:33) (55) ≤ M β log mm , (56)for an absolute constant M , since β > γ .Therefore, E [( ˆ β − β ) ] ≤ E [( ˆ β − β ) | R ] + β Pr( ¯ R ) (57) ≤ M β log mm + β (cid:18) m + 2 B e − B I (cid:19) (58) ≤ M β log mm + 2 B β e − B I , (59)for an absolute constant M > . Similarly, E [(ˆ γ − γ ) ] ≤ M β log mm + γ (cid:18) m + 2 B e − B I (cid:19) (60) ≤ M β log mm + 2 B γ e − B I . (61) B.2 Proofs of Intermediate ResultsB.2.1 Proof of Lemma B.2.
Proof.
Fix m , let z := P − m − C P , p = ββ + γ z . Then p > . Define three stochastic processes { A k : k ≥ } , { B k : k ≥ } , { ˜ C k : k ≥ } : A k = (cid:26) C if k = 0 A k − + Bern ( p ) otherwise . (62) B k = (cid:26) C if k = 0 B k − + Bern ( p/z ) otherwise . (63) ˜ C k = (cid:40) C if k = 0˜ C k − + Bern (cid:110) β ( P − ˜ C k − ) β ( P − ˜ C k − )+ P γ (cid:111) otherwise . (64)Note that ˜ C k is a modified version of C k where ˜ C k still evolves after the stopping time. Claim B.5. A m is stochastically less than ˜ C m ( A m ≤ st ˜ C m ); ˜ C m is stochastically less than B m ( ˜ C m ≤ st B m ); that is, for any (cid:96) ∈ R , Pr( B m ≤ (cid:96) ) ≤ Pr( ˜ C m ≤ (cid:96) ) ≤ Pr( A m ≤ (cid:96) ) . (65)This claim follows from Theorem A.3, using a similar argument to Claim B.5.Let A k = C + X + X + . . . X k where X i ∼ Bern ( p ) are independent. We provide the left tailbound for C m . Note that when τ ≥ m , C m d = ˜ C m . Hence,
Pr( C m ≤ mp (1 − δ ) + C , τ ≥ m ) = Pr( ˜ C m ≤ mp (1 − δ ) + C , τ ≥ m ) (66) ≤ Pr( ˜ C m ≤ mp (1 − δ ) + C ) (67) ≤ Pr( A m ≤ mp (1 − δ ) + C ) . (68)Using the Chernoff bound gives, Pr( A m ≤ mp (1 − δ ) + C ) = Pr( C + X + · · · + X m ≤ pm (1 − δ ) + C ) (69) = Pr ( X + · · · + X m ≤ mp (1 − δ )) (70) ≤ exp (cid:16) − mp δ (cid:17) . (71)17herefore, Pr (cid:18) C m − C m ≤ pz (1 − δ ) z, τ ≥ m (cid:19) = Pr ( C m ≤ mp (1 − δ ) + C , τ ≥ m ) (72) ≤ exp (cid:16) − mp δ (cid:17) ≤ exp( − mδ / . (73)Let B k = C + Y + . . . + Y k where Y i ∼ Bern ( p/z ) are independent. Similarly, for the upper tailbound, we have Pr (cid:18) C m − C m ≥ pz (1 + δ ) , τ ≥ m (cid:19) = Pr( C m ≥ mp/z (1 + δ ) + C , τ ≥ m ) (74) ≤ Pr( B m ≥ mp/z (1 + δ ) + C ) (75) ≤ Pr( C + Y + · · · + Y m ≥ mp/z (1 + δ ) + C ) (76) ≤ exp( − mp/z δ δ ) ≤ exp( − mδ / (4 + 2 δ )) (77)due to the multiplicative Chernoff bound Pr( Z ≥ E [ Z ](1 + δ )) ≤ e − Z δ δ where Z is the sum ofi.i.d Bernoulli random variables.Combine upper and lower tail bounds and note that p/z = ββ + γ . Then, we can conclude, for any δ > , Pr (cid:18) C m − C m / ∈ [ ββ + γ (1 − δ ) z, ββ + γ (1 + δ )] , τ ≥ m (cid:19) ≤ − mδ / (4 + 2 δ )) . (78) B.2.2 Proof of Lemma B.3.
Proof.
Conditioned on ( I , C , I , C , . . . , I m − , C m − ) with τ ≥ m , we have I k − T k ∼ Exp (cid:18) β P − C k − P + γ (cid:19) are independent exponential random variables.Theorem 5.1 in [24] gives us a tail bound for the sum of independent exponential random variables:let X = (cid:80) ni =1 X i with X i ∼ Exp ( a i ) independent, then for δ > , Pr( X ≥ (1 + δ ) µ ) ≤
11 + δ e − a ∗ µ ( δ − ln(1+ δ )) ≤ e − a ∗ µ ( δ − ln(1+ δ )) (79) Pr( X ≤ (1 − δ ) µ ) ≤ e − a ∗ µ ( δ − ln(1+ δ )) (80)where µ = E [ X ] , a ∗ = min ≤ i ≤ n a i . Let ˜ S m | (cid:126)C,(cid:126)I be ˜ S m conditioned on ( I , C , I , C , . . . , I m − , C m − ) with τ ≥ m . Let µ = E [ ˜ S m | (cid:126)C,(cid:126)I ] = (cid:80) mk =1 1 β ( P − C k − ) /P + γ , a ∗ = min ≤ k ≤ m β ( P − C k − ) /P + γ. It is easy to ver-ify the following facts µa ∗ ≥ m (cid:88) k =1 a ∗ ( β + γ ) ≥ m P − m − C P (81) β + γ ≤ µm ≤ β + γ PP − m − C . (82)Combining these with Eqs. (79) and (80), we have Pr (cid:32) ˜ S m | (cid:126)C,(cid:126)I m / ∈ (cid:20) (1 − δ ) β + γ , (1 + δ ) β + γ PP − m − C (cid:21)(cid:33) ≤ Pr (cid:32) ˜ S m | (cid:126)C,(cid:126)I m / ∈ (cid:20) µ (1 − δ ) m , µ (1 + δ ) m (cid:21)(cid:33) (83) ≤ e − m P − m − C P ( δ − ln(1+ δ )) . (84)18herefore, Pr (cid:32) ˜ S m m / ∈ I, τ ≥ m (cid:33) = (cid:90) (cid:126)C,(cid:126)I | τ ≥ m Pr (cid:32) ˜ S m m / ∈ I | (cid:126)C, (cid:126)I, τ ≥ m (cid:33) f ( (cid:126)C, (cid:126)I | τ ≥ m ) Pr( τ ≥ m ) (85) ≤ e − m P − m − C P ( δ − ln(1+ δ )) Pr( τ ≥ m ) (86) ≤ e − m P − m − C P ( δ − ln(1+ δ )) . (87) B.2.3 Proof of Proposition B.4.
Proof.
Let ˆ β = C m − C ˜ S m , z = P − C − mP . Suppose x ∈ ββ + γ [(1 − δ ) z, δ ] , y ∈ β + γ [1 − δ, (1 + δ )1 /z ] . Then, xy ∈ (cid:20) β (1 − δ ) z δ , β δ − δ (cid:21) (88)Similarly, let ˆ γ = m ˜ S m − ˆ β. Suppose a ∈ ( β + γ )[ z δ , − δ ] , b ∈ β [ (1 − δ ) z δ , δ − δ ] . Then a − b ∈ (cid:20) γ z δ + β (1 − δ ) z − (1 + δ ) (1 + δ )(1 − δ ) , γ − δ + β δ − (1 − δ ) z (1 − δ )(1 + δ ) (cid:21) . (89)Then, for any sets U , U , Pr( ˆ β ∈ U , ˆ γ ∈ U ) ≥ − Pr( ˆ β / ∈ U ) − Pr(ˆ γ / ∈ U ) (90) ≥ − Pr( ˆ β / ∈ U , τ > m ) − Pr( ˆ β / ∈ U , τ > m ) − τ < m ) (91) ≥ − e − m ( δ − ln(1+ δ )) − e − mδ / (4+2 δ ) − B e − B I , (92)where the last step uses Lemma B.1, Lemma B.2 and Lemma B.3, using the intervals (88) and (89)for U and U respectively. C Proofs of Propositions
C.1 Proof of Proposition 2.1
Proof.
As in [25, 26], the solution { ( s (cid:48) ( t ) , i (cid:48) ( t ) , r (cid:48) ( t )) : t ≥ } with α = 1 can be written: s (cid:48) ( t ) = s (cid:48) (0) e − ξ (cid:48) ( t ) (93) i (cid:48) ( t ) = N − s (cid:48) ( t ) − r (cid:48) ( t ) (94) r (cid:48) ( t ) = r (0) + γ (cid:48) Nβ (cid:48) ξ (cid:48) ( t ) (95) ξ (cid:48) ( t ) = β (cid:48) N (cid:90) t i (cid:48) ( t ∗ ) dt ∗ (96)Making the appropriate substitutions yields the following equivalent system: i (cid:48) ( t ) = N − s (cid:48) (0) exp (cid:18) − β (cid:48) N ξ ( t ) (cid:19) − r (0) − γ (cid:48) Nβ (cid:48) ξ (cid:48) ( t ) (97) ξ (cid:48) ( t ) = β (cid:48) N (cid:90) t i (cid:48) ( t ∗ ) dt ∗ . (98)19herefore, it remains to show that for α (cid:48) > , { ( s ( t ) , i ( t ) , r ( t )) : t ≥ } (cid:44) { ( α (cid:48) s (cid:48) ( t ) , α (cid:48) i (cid:48) ( t ) , α (cid:48) r (cid:48) ( t )) : t ≥ } is a solution for (97) and (98). Starting with (97), i (cid:48) ( t ) = N − s (cid:48) (0) exp ( − ξ (cid:48) ( t )) − r (cid:48) (0) − γ (cid:48) Nβ (cid:48) ξ (cid:48) ( t ) (99) α (cid:48) i (cid:48) ( t ) = α (cid:48) (cid:18) N − s (cid:48) (0) exp ( − ξ (cid:48) ( t )) − r (cid:48) (0) − γ (cid:48) Nβ (cid:48) ξ (cid:48) ( t ) (cid:19) (100) = α (cid:48) N − αs (cid:48) (0) exp ( − ξ ( t )) − α (cid:48) r (0) − γ (cid:48) α (cid:48) Nβ (cid:48) ξ ( t ) (101)where ξ ( t ) = ξ (cid:48) ( t ) = β (cid:48) Nα (cid:48) (cid:82) t α (cid:48) i (cid:48) ( t ∗ ) dt ∗ . Noting that ξ (cid:48) ( t ) = ξ ( t ) and substituting i ( t ) = α (cid:48) i (cid:48) ( t ) yields the equations below, clearly showing that { ( s ( t ) , i ( t ) , r ( t )) : t ≥ } satisfy (97) and (98): i ( t ) = α (cid:48) N − s (0) exp ( − ξ ( t )) − r (0) − γ (cid:48) α (cid:48) Nβ (cid:48) ξ ( t ) (102) ξ ( t ) = β (cid:48) N α (cid:48) (cid:90) t i ( t ∗ ) dt ∗ . (103) C.2 Proof of Proposition 2.2
Proof.
Consider initial conditions ( s (0) , i (0) , , as in [25, 26], the analytical solution is given by s ( t ) = s (0) e − ξ ( t ) ,i ( t ) = αN − s ( t ) − r ( t ) ,r ( t ) = γ αNβ ξ ( t ) ,ξ ( t ) = βαN (cid:90) t i ( t (cid:48) ) dt (cid:48) . Consider two SIR models with parameters ( α, β, γ ) and ( α (cid:48) , β (cid:48) , γ (cid:48) ) , and initial conditions ( s , i , and ( s (cid:48) , i (cid:48) , respectively. We claim that infection trajectories i ( t ) and i (cid:48) ( t ) being identical on anopen set [0 , T ) implies the parameters and initial conditions are identical as well.Assume i ( t ) = i (cid:48) ( t ) for all t ∈ [0 , T ) ; then, given the exact solution above it follows that αN − s e − βαN x − γx = α (cid:48) N − s (cid:48) e − β (cid:48) α (cid:48) N x − γ (cid:48) x, for all x ∈ (cid:104) , (cid:90) T i ( t ) dt (cid:105) (104)As functions of x , both the RHS and LHS in the equality above are holomorphic, and hence, using theidentity theorem, we conclude that αN = α (cid:48) N , s = s (cid:48) , − βαN = − β (cid:48) α (cid:48) N , γ = γ (cid:48) , which completesthe proof. C.3 Proof of Proposition 3.3
Proof.
The crux of the problem is summarised in two smaller results, bounding T d ,n and T d ,n respectively. To ease our exposition, we will let P n = α n N n and drop the index n in these helperresults. Proposition C.1.
There exists a constant ν that only depends on γ, β such that T d ≥ β − γ (cid:32)
23 log ν Pc (0) / + log ν / c (0) (cid:16) − c (0) P / (cid:17)(cid:33) . roposition C.2. Let ρ = 1 − P , there exists a constant ν that only depends on γ, β and aconstant C = O (1) , such that T d ≤ βρ − γ log ν Pi (0) + C − ρ . The argument follows directly by taking the limit of the bounds we provide in Propositions C.1-C.2.Specifically, using that the constants ν ,n , ν ,n do not change with n , we arrive at lim n →∞ T d ,n T d ,n ≤ lim n →∞ βρ − γ log ν P n i n (0) + C n (1 − ρ )1 β − γ (cid:18) log ν P n c n (0) / + log ν / c n (0) (cid:16) − c n (0) P / n (cid:17)(cid:19) = lim n →∞ β − γβρ − γ · log P n + log ν − log i n (0) log P n + log ν − c n (0) + log (cid:16) − c n (0) P / n (cid:17) + lim n →∞ ( β − γ ) C n log log P n log P n + log ν − c n (0) + log (cid:16) − c n (0) P / n (cid:17) Since c n (0) = O (log( P n )) by assumption (and i n (0) ≤ c n (0) ), and C n = O (1) by Proposition C.2,the limits of the two summands above are / and respectively, which concludes the proof. C.3.1 Proof of Proposition C.1.
Proof of Proposition C.1.
Define ˜ i ( t ) such that ˜ i (0) = i (0) and d ˜ idt = ( β − γ )˜ i , implying ˜ i ( t ) = i (0) exp { ( β − γ ) t } . (105)Since d ˜ idt ≥ didt for all t , ˜ i ( t ) ≥ i ( t ) for all t . Then, for all t , dsdt = − β sP i ≥ − βi ≥ − β ˜ i. (106)Hence we can write s ( t ) ≥ s (0) + (cid:90) t − β ˜ i ( t (cid:48) ) dt (cid:48) (107) = s (0) − βi (0) (cid:90) t exp { ( β − γ ) t (cid:48) } dt (cid:48) (108) = s (0) − βi (0) β − γ (exp { ( β − γ ) t } − (109)(110)Since s (0) − s ( T d ) = P / − c (0) , setting t = T d and solving for T d in the inequality above resultsin T d ≥ β − γ log (cid:18) β − γβi (0) ( P / − c (0)) (cid:19) (111) ≥ β − γ log (cid:18) β − γβc (0) ( P / − c (0)) (cid:19) (112) = 1 β − γ (cid:18) log β − γβc (0) ( P / ) + log β − γβc (0) (cid:16) − c (0) P / (cid:17)(cid:19) (113) = 1 β − γ (cid:32)
23 log ν Pc (0) / + log ν / c (0) (cid:16) − c (0) P / (cid:17)(cid:33) (114)for ν = (cid:16) β − γβ (cid:17) / as desired. 21 .3.2 Proof of Proposition C.2. For ρ ∈ [0 , γβ ] , let t ρ be the time t when s ( t ) P = ρ . ρ will represent the fraction of the total populationthat is susceptible. Since ρ ≤ γβ , i is increasing for the time period of interest.Let β > γ , P be fixed. Let ρ = 1 − P and ρ = γβ . We assume P is large enough that ρ > ρ , hence t ρ < t ρ . T d = t ρ . Lemma C.3.
For any ρ ∈ [0 , γβ ] , i ( t ρ ) ≥ P (1 − ρ ) βρ − γβρ − c (0)2 .Proof of Lemma C.3. Fix ρ . At time t ρ , the total number of people infected is c ( t ρ ) = i ( t ρ )+ r ( t ρ ) = P (1 − ρ ) , by definition. At any time t ≤ t ρ , the rate of increase in i is β s ( t ) P − γβ s ( t ) P ≥ βρ − γβρ of the rateof increase in c . Therefore, i ( t ρ ) − i (0) ≥ (cid:0) βρ − γβρ (cid:1)(cid:0) c ( t ρ ) − c (0) (cid:1) and i ( t ρ ) ≥ (cid:0) βρ − γβρ (cid:1) P (1 − ρ ) − βρ − γβρ c (0) + i (0) . Using the fact that i (0) ≥ c (0)2 and rearranging terms gives the desired result. Lemma C.4.
For t ∈ [ t ρ , t ρ ] , where ρ > ρ for ρ , ρ ∈ [0 , γβ ] , t ρ − t ρ ≤ P ( ρ − ρ ) βρ i ( t ρ ) .Proof of Lemma C.4. The difference in s between t ρ and t ρ is s ( t ρ ) − s ( t ρ ) = P ( ρ − ρ ) . Asa consequence of the mean value theorem, s ( t ρ ) − s ( t ρ ) t ρ − t ρ ≤ max t ∈ [ t ρ ,t ρ ] { dsdt } . Using these twoexpressions, P ( ρ − ρ ) t ρ − t ρ ≥ min (cid:26) − dsdt (cid:27) = min (cid:26) β s ( t ) P i ( t ) : t ∈ [ t ρ , t ρ ] (cid:27) ≥ βρ i ( t ρ ) (115)The desired expression follows from rearranging terms. Lemma C.5.
For any ρ ≤ min { γβ , / } , t ρ ≤ βρ − γ log ν i (0) P , for ν = β − γ ) β . The proof of this lemma follows the exact same procedure as the proof of Proposition C.1.
Proof of Lemma C.5.
We proceed in the same way as the proof of Proposition C.1 except in this casewe will lower bound s (0) − s ( t ) . We achieve this by letting ˜ i be defined to grow slower than i , so itis used as a lower bound. Define ˜ i ( t ) such that ˜ i (0) = i (0) and d ˜ idt = ( βρ − γ )˜ i , implying ˜ i ( t ) = i (0) exp { ( βρ − γ ) t } . (116)Since d ˜ idt ≤ didt when , ˜ i ( t ) ≤ i ( t ) for all t < t ρ . In addition, when t < t ρ , sP ≥ γβ ≥ ρ . Then, for t < t ρ , dsdt = − β sP i ≤ − βρ ˜ i. (117)Hence we can write s ( t ) ≤ s (0) + (cid:90) t − βρ ˜ i ( t (cid:48) ) dt (cid:48) (118) = s (0) − βρi (0) (cid:90) t exp { ( βρ − γ ) t (cid:48) } dt (cid:48) (119) = s (0) − βρi (0) βρ − γ (exp { ( βρ − γ ) t } − (120)(121)Since s ( t ρ ) = ρP , ρP ≤ s (0) − βρi (0) βρ − γ (exp { ( βρ − γ ) t ρ } − . t ρ results in t ρ ≤ log (cid:16) βρ − γβρi (0) ( s (0) − ρP ) + 1 (cid:17) βρ − γ ≤ βρ − γ log (cid:16) ν i (0) P (cid:17) (122)where ν = β − γ ) β , using the fact that ρ ≤ / . Proof of Proposition C.2.
Using the results from Lemmas C.3-C.5, t ρ = t ρ + ( t ρ − t ρ ) ≤ βρ − γ log (cid:16) ν i (0) P (cid:17) + P ( ρ − ρ ) βρ i ( t ρ ) ≤ βρ − γ log (cid:16) ν i (0) P (cid:17) + P ( ρ − ρ ) ρ ρ P (1 − ρ )( βρ − γ ) − βρ c (0)= 1 βρ − γ log (cid:16) ν i (0) P (cid:17) + C − ρ , where C = ρ − ρ ρ ρ ( βρ − γ ) − βρ c (0) P (1 − ρ . Note that, as required in the statement, C = O (1) . Indeed, C = ( ρ − ρ ) ρ ρ ( βρ − γ ) − βρ c (0) P (1 − ρ ) = 1 − ρ − P βρ − γρ − P − βρ c (0) log log PP , (123)and so, as P grows large, C tends to (1 − ρ ) /ρ ( β − γ ) (recall that c (0) = O (log log P ) ). D Sufficient Condition for P [ t ] In the practical model, P [ t ] represents the regions that have enough observations to reliably estimate α at time t . In this section, we justify why the definition (10) is a sufficient condition. Let T d ,n =inf { t : d sdt > } be the time at which the rate of new infections is highest in the deterministic SIRmodel. Proposition D.1. As n → ∞ , T d ,n ≤ T d ,n .Proof. d sdt = − βα n N n (cid:18) dsdt i + didt s (cid:19) (124) = − βα n N n (cid:18) − βsα n N n i + (cid:18) βsα n N n − γ (cid:19) is (cid:19) (125) = − βis ( − βi + βs − γα n N n ) (126) = β is (cid:18) i − s + γβ α n N n (cid:19) (127)From (127), we see that d sdt > if and only if s < γβ α n N n + i. (128)At t ≤ T d ,n , c ( t ) ≤ ( α n N n ) / which implies i ( t ) ≤ ( α n N n ) / and s ( t ) ≥ α n N n − ( α n N n ) / .We assume n is large enough so that α n N n ) − / < − γβ . Then, (128) cannot hold: α n N n ) − / < − γβ (129) γβ + ( α n N n ) − / < − ( α n N n ) − / (130) γβ α n N n + i ( t ) ≤ γβ α n N n + ( α n N n ) / < α n N n − ( α n N n ) / ≤ s ( t ) . (131)23herefore, T d ,n ≤ T d ,n . E Covariate Details for Practical SIR Model
For observed COVID-19 cases, we use publicly available case data from the ongoing COVID-19epidemic provided by [27]. X i [ t ] consists of static demographic covariates and time-varying mobilityfeatures that affect the disease transmission rate.The dynamic covariates proxy mobility by estimating the daily fraction of people staying at homerelative to a region-specific benchmark of activity in early March before social distancing measureswere put in place. We also include a regional binary indicator of the days when the fraction ofpeople staying home exceeds the benchmark by 0.2 or more. These data are provided by Safegraph,a data company that aggregates anonymized location data from numerous applications in order toprovide insights about physical places. To enhance privacy, SafeGraph excludes census block groupinformation if fewer than five devices visited an establishment in a month from a given census blockgroup. Documentation can be found at [28].The static covariates capture standard demographic features of a region that influence variation ininfection rates. These features fall into several categories: • Fraction of individuals that live in close proximity or provide personal care to relatives inother generations. These covariates are reported by age group by state from survey responsesconducted by [29]. • Family size from U.S. Census data, aggregated and cleaned by [30]. • Fraction of the population living in group quarters, including colleges, group homes, militaryquarters, and nursing homes (U.S. Census via [30]). • Population-weighted urban status (US Census via [30]) • Prevalence of comorbidities, such as cardiovascular disease and hypertension ([31]) • Measures of social vulnerability and poverty (U.S. Census via [30]; [32]) ••