A Bayesian spatio-temporal nowcasting model for public health decision-making and surveillance
David Kline, Ayaz Hyder, Enhao Liu, Michael Rayo, Samuel Malloy, Elisabeth Root
AA Bayesian spatio-temporal nowcasting model forpublic health decision-making and surveillance
David Kline, Ayaz Hyder, Enhao Liu,Michael Rayo, Samuel Malloy, Elisabeth Root
Abstract
As COVID-19 spread through the United States in 2020, states began to set up alert sys-tems to inform policy decisions and serve as risk communication tools for the general public.Many of these systems, like in Ohio, included indicators based on an assessment of trends inreported cases. However, when cases are indexed by date of disease onset, reporting delayscomplicate the interpretation of trends. Despite a foundation of statistical literature to addressthis problem, these methods have not been widely applied in practice. In this paper, we developa Bayesian spatio-temporal nowcasting model for assessing trends in county-level COVID-19cases in Ohio. We compare the performance of our model to the current approach used in Ohioand the approach that was recommended by the Centers for Disease Control and Prevention.We demonstrate gains in performance while still retaining interpretability using our model. Inaddition, we are able to fully account for uncertainty in both the time series of cases and in thereporting process. While we cannot eliminate all of the uncertainty in public health surveil-lance and subsequent decision-making, we must use approaches that embrace these challengesand deliver more accurate and honest assessments to policymakers.
Keywords:
Bayesian hierarchical modeling, COVID-19, reporting lag, spatial analysis,surveillance
Abbreviations:
COVID-19 - Coronavirus Disease 2019, OPHAS - Ohio Public HealthAlert System a r X i v : . [ s t a t . A P ] F e b he first cases of SARS-CoV-2 in the United States were reported in early March [1], thoughrecent phylogenetic evidence suggests the first introductions occurred in January 2020 [2, 3, 4].As COVID-19 spread throughout the country, states began to set up risk alert systems to supportdata-driven decision-making, improve government accountability, and communicate health risksto the public [5]. The goal of such systems is to provide clear and consistent messaging around thecurrent state of the COVID-19 pandemic and help people adopt protective behaviors while policy-makers implement appropriate structural changes to mitigate spread. Risk or public health alertsystems typically develop a series of indicators which use various sources of surveillance data [5].In some states, these systems were linked to specific policy actions [6] while in others they servemore as a risk communication tool to inform local health departments and the general public [7].In most systems, several key indicators are tied to the reporting of confirmed COVID-19 cases andtheir onset date of illness (i.e., the date an individual first began to have symptoms) [8]. However,chronic delays in outbreak investigation and case reporting have led to challenges in estimatingcase-based indicators and communicating the situation in a location in near real-time.Issues related to reporting lag or reporting delay are not a new challenge in public healthsurveillance [9, 10, 11]. It is quite common for reporting in infectious disease and vital statis-tics systems to not occur instantaneously with the onset or occurrence of the event of interest. Forinfectious diseases, this delay can be due to: 1) a prolonged interval between the time an individualrecognizes symptoms and is able to seek care and receive confirmatory testing, 2) administrativebacklogs and delays in the acquisition, processing, and ultimate reporting of information, and 3)the length of time necessary to conduct a full case investigation. However, particularly when facinga fast-moving epidemic, important decisions need to be made in real-time despite the fact that themost recent information is likely incomplete. This added uncertainty can reduce the confidenceof both policymakers and the public in the public health decision-making process. Methodologyis needed to help provide a clearer picture to decision-makers in the face of the uncertainty fromdelays in reporting.To address this issue and build on the foundational methodology [9, 10, 11], a relatively recent2iterature around “nowcasting" has emerged for delayed reporting. In contrast to forecasting whichfocuses on estimating what could happen in the future, nowcasting focuses on estimating what hasalready happened but has not yet been reported. Nowcasting leverages historical patterns in re-porting and trajectories of the disease outcome to estimate current counts given partially reportedvalues. To enhance model flexibility and interpretability, recent work [12, 13, 14] has extendedprior work for nowcasting time series [15] and aberration detection [16] within a Bayesian frame-work. This work has been applied to estimate COVID-19 deaths in regions of the United Kingdom[17] and to incorporate spatial dependence [18, 19]. In addition, simulation modeling approacheshave also been used for nowcasting [20, 21, 22]. In contrast to much of the current epidemiologi-cal work that relies on the specification of splines to capture trends, Bayesian structural time seriescan be specified as hierarchical autoregressive processes [23]. Given the link between autoregres-sive processes and infectious disease dynamics, we propose a spatial extension of the Bayesianstructural time series model to nowcast county-level counts of confirmed COVID-19 cases in Ohiowhile accounting for reporting delay.Despite prior and current literature on methods for accounting for reporting delay, these meth-ods have not been fully embraced in practice. The purpose of this paper is to highlight the crit-ical need to account for reporting lag and other potential daily reporting patterns when assessingwhether case rates are increasing. This is important because an increase in case rates is an indicatorin many states’ alert systems, including the Ohio Public Health Alert System (OPHAS) [7], andcan also serve as an early warning signal of disease spread. We apply our method to OPHAS Indi-cator 2 which measures "an increasing trend of at least 5 consecutive days in overall cases by onsetdate over the last 3 weeks" [7]. Ohio adopted a 21 day "look-back" period in an attempt to man-ually curtail the effect of reporting delays. We develop an extension of a Bayesian structural timeseries model that incorporates spatial dependence across counties and flexibly captures temporaldynamics with an autoregressive structure. We use case data from earlier in the pandemic that isnow fully reported so the true trends can be determined for each county in Ohio. We then compareindicators based on the method currently used in Ohio, the method suggested by the Centers for3isease Control and Prevention (CDC) [8], and our Bayesian approach. METHODS
Data
We used data on confirmed cases of COVID-19 in the state of Ohio which are captured by the OhioDisease Reporting System and reported publicly [24]. In Ohio, case investigation is done by local,typically county, health departments and entered into the state system. Confirmed cases are definedas individuals who have a positive result on a laboratory molecular amplification test [1] or otherapproved testing methods. For each individual case, the system records the county of residenceand the onset date of illness as determined by case investigators. If onset date is unknown, thesystem records the earliest date associated with the record. Onset date currently provides the indexdate for all reporting and analysis at the state-level in Ohio. The reporting date is defined as thefirst date at which a case appears in the system and is often several days or possibly weeks after theonset date. Thus, when examining case counts by onset date, counts for the most recent days areincomplete because of the delay between onset date and reporting date. The reporting delay canalso be impacted by system strains due to case volume and daily variation in reporting that differby local health department.To explore the impact of reporting patterns on the calculation and subsequent interpretation ofpublic health alert indicators, we retrospectively consider four points in time during the pandemic:June 15, 2020, July 15, 2020, August 15, 2020, and September 15, 2020. At the time of theanalysis, at least one full month had passed since September 15, and we assume that case reportingwas complete through this date. For each date, we examine cases reported by that date and computeindicators related to the trends in case counts. Since the data are completely reported, we cancompare the estimates from the indicators to the true trend observed in the onset cases at that pointin time. This will allow us to examine the performance of each proposed approach for determiningif a county is experiencing an increasing trend of cases.4 olling Average Approach
We refer to the current approach for determining if case rates are increasing used by the OhioDepartment of Health as the rolling average approach [7]. This approach computes a 7 day rollingaverage of case counts, indexed by onset date, for each of the last 21 days. The alert indicatorfor an increasing trend in cases is flagged if there are 5 consecutive days of increasing averages atany point in the 21 day window. That is, the indicator flags if for 5 consecutive days the averageis greater than the average the day before. This approach crudely accounts for daily reportingvariation by averaging across 7 days but makes no attempt to account for reporting lag or any othersources of variation.
Spline Approach
A slightly more sophisticated but still simple approach was recommended by the CDC for detectingrebounds [8] and will be referred to as the spline approach. This approach is similar to the rollingaverage approach described above but fits a spline to the time series of rolling averages. Forconsistency, we used 7 day rolling averages over a 21 day period to align with the temporal windowof interest for the alert system. We fit a cubic spline [8] to each series with 4 knots. By using aspline, we are able to smooth daily and other systematic variation in reporting patterns. Alignedwith the CDC [8], we determine if there is an increasing trend by looking at the fitted values fromthe spline and determining if there are any 5 consecutive day periods where the fit for each day isgreater than the previous day. Like the rolling average approach, uncertainty is not incorporatedinto the decision-making process. Splines were estimated using the mgcv package in R [25].
Model-based Approach
In contrast to the simpler approaches, we explicitly model both the process for new onset cases andthe reporting delay process. We extend the general framework outlined by previous work [12, 18]by using an autoregressive spatial Bayesian structural time series, rather than a spline based model.5hile the spline based model is flexible, it relies on reasonably specifying knots and is not idealfor estimating beyond the range of the observed data. In addition, it can be more challenging toincorporate hierarchical structure when temporal trends may be quite different across locations,which has been the case for COVID-19. Instead, an autoregressive structure retains the ability toflexibly capture spatio-temporal trends while also linking more closely to the dynamics of infec-tious disease [26]. It also allows for added flexibility in specifying a spatially varying reportingdelay process.We follow the general set up outlined in previous work [12, 16]. In Ohio, COVID-19 cases arereported daily so we use a daily time scale. To reduce computation time, we will take a movingwindow approach [14] that considers the past 90 days ( T = 90 ). From April through September2020, 94% of cases were reported within 2 weeks of onset and 98% of cases were reported within30 days. To be conservative, we set a maximum reporting delay time of 30 days following onset( D = 30 ). Outcome Model.
Let Y it be the count of reported cases in county i = 1 , . . . , N with onset date t = 1 , . . . , T . Note that Y it is assumed to be the true total count, which is assumed to be partiallyobserved for time t such that t + D > T . We assume Y it ∼ Poisson ( λ it ) log ( λ it ) = O i + α it + X t η i where O i is an offset of the log population of county i , α it is the latent state of the process, X t isa design vector indicating the day of the week, and η i is the day of the week effect. Note that X t is parameterized using sum to 0 effect coding so α it reflects the average of the process across daysof the week. By using this structure for the model, we are able to remove daily reporting variationfrom the latent state, α it , through X t η i .After removing the daily “seasonal" variation, we focus on the model for the latent state orstructural part of the model. We use a semi-local linear trend model [27] to allow for some degree6f longer term structure while still facilitating a very flexible model. That is for t > , α it = α i ( t − + δ i ( t − + (cid:15) αit where (cid:15) αit iid ∼ N (0 , τ α ) and the initial value at t = 1 is α i ∼ N (0 , . Then for the model fortrend, we let δ i = δ + d i + (cid:15) δi δ it = δ + d i + ρ δ ( δ i ( t − − δ − d i ) + (cid:15) δit where δ is a common statewide trend, d i is a county-specific spatial trend, and ρ δ is an autoregres-sive term. Let (cid:15) δit iid ∼ N (0 , τ δ ) . A benefit to this parameterization is it allows us to separate changesthat are due to white noise ( (cid:15) αit ) from those that are due to more consistent temporal trends ( δ it ). Byusing a stationary model for δ , we are able to provide some structure around a longer term trendwhile retaining flexibility for local deviations in space and time.To account for spatial correlation, we assume the trends in neighboring counties are correlatedand specify an intrinsic conditional autoregressive model. That is, d i | d − i ∼ N (cid:32) w i + (cid:88) j w ij d j , τ d w i + (cid:33) where d − i is the set of counties excluding county i , w ij is an indicator of whether counties i and j are adjacent, w i + = (cid:80) j (cid:54) = i w ij , and τ d is a variance. To ensure a valid process model, we enforcea sum to 0 constraint on the d i [28]. We chose to incorporate spatial dependence in the trend toreflect a belief that cases in a county are likely to change in a similar fashion as cases in neighboringcounties. This choice explicitly aligns with our general surveillance and risk evaluation strategyfor counties where we have implicitly considered trends in neighboring counties when making ourassessments. Another added benefit is that this helps to stabilize estimates for counties with smallpopulations by borrowing strength from neighboring counties.7e also assume county-specific effects of the day of the week. We assume that while variabilityexists between counties, the daily patterns are similar across the state. We assume the followinghierarchical model η i iid ∼ N ( η , τ η I ) where η i is a vector of state average day of the week effects, τ η is a variance, and I is a × identity matrix. This allows each county to have its own daily pattern while borrowing strengthacross all counties in the state as warranted. Reporting Model.
Since we know that Y it is observed with reporting lag, we must specify amodel for the delay. Let Z itd be the count of cases observed in county i with onset date t that areobserved d = 0 , . . . , D days after t . Note that Z itd corresponds to when cases are reported d daysafter onset date t and so is unobserved when t + d > T . We assume Z it | p it , Y it ∼ Multinomial ( p it , Y it ) p it ∼ GD ( α it , β it ) where Z it = ( Z it , . . . , Z itD ) , p it is the vector of proportions of the total Y it reported on eachof the D days, and GD is the generalized Dirichlet distribution. We use a generalized Dirichletdistribution to properly account for potential overdispersion of the p it [12]. This leads to thefollowing conditional distribution: Z itd | Z it ( − d ) , Y it ∼ Beta-Binomial (cid:32) α itd , β itd , Y it − (cid:88) j
One major advantage of a model-based approach is the flexibility to address more complex ques-tions of interest. However, the goal of this paper is to assess the method used to calculate theOPHAS indicator for when cases are increasing in a county. To most closely align with the ques-tion as currently posed by the state of Ohio, we define a true increase in cases as when the numberof cases in the most recent 7 day period is greater than the number of cases two weeks prior.10able 1: Estimated sensitivity and specificity of the rolling average indicator, the spline indicator,and the model-based indicator at 3 different posterior probability cut-points across the 4 datesexamined. Method Sensitivity SpecificityRolling Average 0.20 0.96Spline 0.87 0.48Model-based: >0.9 0.07 1.00Model-based: >0.7 0.46 0.93Model-based: >0.5 0.83 0.60This corresponds to comparing the first week with the last week in the most recent 21 day period.While there are other potential ways to define a true increase, this most closely reflects the currentdefinition used by the state of Ohio.
RESULTS
The results from applying each of the three methods for calculating increasing case rates are shownin Figure 1. There are several general observations that can be made across the four time points.The rolling average indicator generally does a poor job at accurately capturing counties where thecases have increased, and in most counties, there were true increases that went undetected. Thespline indicator tends to make errors in the other direction by incorrectly flagging counties that didnot meet the definition of a true increase. For the model-based approach, we generate a posteriorprobability of an increasing trend and highlight counties in yellow with a probability greater than0.7 and in red those with a probability greater than 0.9.In addition to visually examining the results, we calculated sensitivity and specificity for eachapproach in Table 1. The rolling average approach currently in use has a very low sensitivity of0.20 and so is not successfully identifying counties with increasing trends. The spline approachhas a much higher sensitivity of 0.87 but at the cost of a specificity of 0.48. Three cut pointsare shown for the model-based posterior probabilities. As expected, the higher thresholds exhibitexcellent specificity but lower sensitivity since it reflects stronger evidence of an increase. Using acut-point of 0.5, which reflects that the trend is more likely increasing than decreasing, we estimate11 a) June 15, 2020(b) July 15, 2020(c) August 15, 2020(d) September 15, 2020
Figure 1: Comparison between the rolling average indicator, the spline indicator, the true observedindicator of an increase, and the model-based posterior probability across 4 time points during thepandemic. For the model-based probabilities, counties outlined in red have a probability greaterthan 0.9 and outlined in yellow have a probability greater than 0.7.12 sensitivity of 0.83 and a specificity of 0.6, which seems to most reasonably balance false positivesand false negatives among the approaches considered.The model-based approach also provides a rich set of additional results that can provide usefulinsights. Typically, the main goal of these models is to nowcast case counts. In Figure 2, we shownowcast estimates with their 90% credible interval in black and the true counts in red for an urbanand rural county. Averaging across the 4 time points, the 90% credible interval coverage was 0.96over the 30 day period with incomplete reporting. The coverage was 0.92 in the most recent 7 dayswhich have the most incomplete reporting. Thus, our model performs as expected for nowcastingcases. In Figure 2, we also show time series plots of the latent state, which removes the dailyseasonality, and the trend. The trend can also be viewed as the derivative of the latent state curveso when it is greater than 0, it indicates increasing case counts.
DISCUSSION
We applied three approaches for assessing increasing trends in cases to completely observed dataat four time points during the COVID-19 pandemic. When assessments are linked to onset date,case reporting is subject to reporting lag or delay. We illustrate that the simple approach currentlyused in OPHAS does not perform well as it fails to account for lag and other variation in reporting.The spline approach outlined by the CDC is more sensitive as it smooths over daily reportingvariation but also fails to account for lag. In contrast, the model-based approach accommodateslag, daily variation, and spatio-temporal dependence. The model-based approach can also directlysummarize observed evidence of increasing trends and the associated uncertainty through posteriordistributions. This results in a better trade-off between sensitivity and specificity and can allow forprioritization of areas where the evidence of an increase is strongest.We note several key advantages to the model-based approach. First, the Bayesian approachallows us to use calculated posterior summaries to directly communicate uncertainty. Public healthofficials are constantly considering trade offs between different policy options - e.g., stay-at-home13 a) Franklin County - urban(b) Harrison County - rural
Figure 2: Case nowcast projections and time series model components for an urban and ruralcounty on September 15, 2020. The left panel shows the posterior mean number of cases in bold,90% credible interval in black, and the true number of cases in red. The green vertical line indicatesthe divide between complete and incomplete reporting. The center panel shows the posterior meanand 90% credible interval of the latent state which is the mean process on the log scale with dailyvariation removed. The right panel shows the posterior mean and 90% credible interval for thedaily change with a red reference line at 0. 14rders vs. economic impacts. Specific policy responses may only be warranted when evidence foran increase in COVID-19 cases is very strong and models indicate a very high level of certainty.Since the posterior probability reflects the probability of an increasing trend given the observeddata, this quantity can be used to directly address the policy question of interest and providesan indication of how strong the evidence is in each county. Unlike the spline or 7-day rollingaverage approaches that return a binary decision, the ability of the model-based approach to conveyadditional meaning through continuous estimates is a clear advantage that can improve decision-making [30, 31]. Second, by accounting for reporting delays and fully exploiting partially reportedcounts, the Bayesian approach can be more responsive to changing trends and provide earlierwarning of changes in trends. Finally, the output from the Bayesian models (shown in Figure2) provide important additional information that can be used by surveillance teams to understandtrends over time. These results do require a team of epidemiologists to review the data, but stillprovide more information than the spline or 7-day moving average methods.When responding to a pandemic, it is important that the public health and policy responseis guided by the best available information. Often even the best information can be incompleteand uncertain. However, statistical models have been developed to overcome these issues and aidin characterizing and quantifying uncertainty. These models are not as simple as the approachcurrently used in Ohio, and this is one limitation of this method. Risk alert systems should betransparent and easy to understand. Complex modeling approaches are difficult to explain to thegeneral public and can lead to mistrust in the data and, by extension, the system as a whole. How-ever, with proper preparation, the model output can be summarized to simply communicate thecore messages, while leaving much of the complexity and technical details to the experts imple-menting the model. Additionally, the Bayesian models provide a wide range of information thatcan be used internally by epidemiologists and other public health data scientists to directly addressimportant policy questions. Given the clear improvements our Bayesian models offer, it is impera-tive that we take advantage of these methodological advances to better serve the public and informthe distribution of limited resources. 15n conclusion, we have illustrated shortcomings in using simple approaches for public healthdecision-making. We have also illustrated how more sophisticated statistical models can accountfor the real-world complexities associated with surveillance data. Despite the added complexity,the output from these models can be summarized in a relatively simple and concise form thatstill appropriately reflects uncertainty. While we cannot eliminate all of the uncertainty in publichealth surveillance and decision-making, we must use approaches that embrace these challengesand deliver more accurate and honest assessments to policymakers.
References [1] Centers for Disease Control and Prevention . Coronavirus Disease 2019 (COVID-19)2020 Interim Case Definition, Approved August 5, 2020 2020. .[2] Wu S.L., Mertens A.N., Crider Y.S., et al. Substantial underestimation of SARS-CoV-2 in-fection in the United States
Nature Communications.
Science.
Cell. https://preventepidemics.org/wp-content/uploads/2020/05/STAYING-ALERT-Navigating-COVID-19-Risk-Toward-a-New-Normal_final.pdf .[6] Utah Department of Health . Phased Guidelines for the General Public and Busi-nesses to Maximize Public Health and Economic Reactivation 2020. https: /coronavirus-download.utah.gov/Health/Phased%20Health%20Guidelines%20V4.0.1.pdf .[7] Ohio Department of Health . Summary of Alert Indicators 2020. https://coronavirus.ohio.gov/static/OPHASM/Summary-Alert-Indicators.pdf .[8] Centers for Disease Control and Prevention . CDC Activities and Initiatives Support-ing the COVID-19 Response and the President’s Plan for Opening America Up Again2020. .[9] Brookmeyer R., Damiano A.. Statistical methods for short-term projections of AIDS inci-dence Statistics in Medicine.
Journal of the American Statistical Association.
Canadian Journal of Statistics.
Biometrics.
Epidemiology.
PLOS Computa-tional Biology.
Biometrics.
Biometrical Journal. medRxiv. arXiv.
International Journal of Health Geographics.
Nature Communications.
PLOS Computational Biology.
The Lancet.
Int. J. Math.Model. Numer. Optimisation. https://coronavirus.ohio.gov/wps/portal/gov/covid-19/dashboards/overview .1825] Wood S.N.
Generalized Additive Models: An Introduction with R . Chapman andHall/CRC2 ed. 2017.[26] Viboud Cécile, Bjørnstad Ottar N., Smith David L., Simonsen Lone, Miller Mark A., GrenfellBryan T.. Synchrony, Waves, and Spatial Hierarchies in the Spread of Influenza
Science.
Ann. Appl. Stat..
Hierarchical modeling and analysisfor spatial data . Boca Raton, Fla.: Chapman & Hall/CRC 2004.[29] de Valpine P., Turek D., Paciorek C.J., Anderson-Bergman C., Temple Lang D., Bodik R..Programming with models: writing statistical algorithms for general model structures withNIMBLE
Journal of Computational and Graphical Statistics.
Human Factors: the Journal of the Human Factors and Ergonomics Society.