Gaussian Process Nowcasting: Application to COVID-19 Mortality Reporting
Iwona Hawryluk, Henrique Hoeltgebaum, Swapnil Mishra, Xenia Miscouridou, Ricardo P Schnekenberg, Charles Whittaker, Michaela Vollmer, Seth Flaxman, Samir Bhatt, Thomas A Mellan
GGaussian Process Nowcasting: Application to COVID-19 Mortality Reporting
Iwona Hawryluk *1 Henrique Hoeltgebaum Swapnil Mishra Xenia Miscouridou Ricardo P Schnekenberg Charles Whittaker Michaela Vollmer Seth Flaxman Samir Bhatt †1,**
Thomas A. Mellan ‡1,**1
Department of Infectious Disease Epidemiology, School of Public Health, Imperial College London, UK Department of Mathematics, Imperial College London, UK Nuffield Department of Clinical Neuroscience, University of Oxford, UK * Contributed equally
Abstract
Updating observations of a signal due to the delaysin the measurement process is a common problemin signal processing, with prominent examples in awide range of fields. An important example of thisproblem is the nowcasting of COVID-19 mortality:given a stream of reported counts of daily deaths,can we correct for the delays in reporting to paintan accurate picture of the present, with uncer-tainty? Without this correction, raw data will oftenmislead by suggesting an improving situation.We present a flexible approach using a latentGaussian process that is capable of describingthe changing auto-correlation structure present inthe reporting time-delay surface. a This approachalso yields robust estimates of uncertainty forthe estimated nowcasted numbers of deaths. Wetest assumptions in model specification such asthe choice of kernel or hyper priors, and evaluatemodel performance on a challenging real datasetfrom Brazil. Our experiments show that Gaussianprocess nowcasting performs favourably againstboth comparable methods, and a small sampleof expert human predictions. Our approach hassubstantial practical utility in disease modelling —by applying our approach to COVID-19 mortalitydata from Brazil, where reporting delays are large,we can make informative predictions on importantepidemiological quantities such as the currenteffective reproduction number. a Code is available at https://github.com/ihawryluk/GP_nowcasting * [email protected] † [email protected] ‡ [email protected] In many real-world settings, current observations from anoisy signal can be systematically biased, with these biasesonly being corrected after subsequent updates create morecomplete data. Often, these updates occur much later inthe future due to data processing or reporting delays. Notaccounting for these delays would result in biased predic-tions, while waiting for updates would result in a lack oftimely estimates. The need for timely estimates to predictthe present is colloquially known as nowcasting, and despiteits importance to a wide range of fields such as actuarialscience, economics, and epidemiology [Kaminsky, 1987,Lawless, 1994, Bastos et al., 2019, McGough et al., 2020],relatively little literature focuses exclusively on the problem.Nowcasting, as defined by Banbura et al. [2010] at theEuropean Central Bank, is the process of predicting thepresent, the very recent past, and very near future using timeseries data known to be incomplete. An example from eco-nomics is using monthly data to nowcast the current stateof important indicators for an economy such as GDP orincome. More broadly, nowcasting is relevant for scenariosnot only where the data are incomplete, but when the dataare comprised of a biased subsample that will be updated inthe future retrospectively, following lengthy delays.In epidemiology, nowcasting is required due to delays inreporting arising from limitations in testing capacity, datacuration, and the requirement for pseudonymisation of pa-tient data [Bastos et al., 2019]. These delays are furthercompounded by the noise inherent in such data due to lim-ited sampling (typically only a subset of the population issampled). Throughout this paper, we specifically focus onthe delays in the reporting of deaths. An individual dies of adisease on a given day, but the delay between this event andthe death being reported (and appearing in the dataset) canbe substantial because of the reasons noted above. Thesereporting delays mask the true current state of the epidemic,and have material consequences for our understanding ofboth present and future evolution of the epidemic. For ex- a r X i v : . [ s t a t . A P ] F e b round truth data Raw data O c t O c t N o v N o v D a il y nu m be r o f dea t h s Mortality dataA O c t O c t N o v N o v R t R t using raw dataB O c t O c t N o v N o v R t using ground truth dataC O c t O c t N o v N o v R t using nowcasted dataD Figure 1: A) Reported daily hospital deaths are censored at recent times due to reporting delays. This can be seen bycomparing the raw data with a ground truth from two months in the future, when the records have been backdated. B-C) Theeffective reproduction number R t for SARS-CoV-2 infections in Brazil from 30-Jun-2020 to 23-Nov-2020, estimated usingdeaths from the raw reported data released on the 23-Nov-2020, and using a backdated ground truth based on data releasedon 08-Feb-2021. D) R t estimates based on nowcasted mortality data. Whereas the raw data results in misleading estimates of R t , with the estimated R t <
1, by applying nowcasting to the deaths counts we achieve a picture of the epidemic closer to thetruth.ample, estimation of key epidemiological quantities such asthe effective reproduction number ( R t ) would be systematic-ally biased. Contemporary, real-time and unbiased estimatesare necessary for effective public health planning and policy.In this paper we propose a nowcasting framework based onlatent Gaussian processes (GPs). This methodology is usedto address the specific problem of delayed reporting in thetrue incidence of deaths due to COVID-19 in Brazil. Previous methods for nowcasting exist in several differentcontexts. Ba´nbura and Modugno [2014] propose a maximumlikelihood approach with a dynamic factor model to predictGDP. Shi et al. [2015] use a deep learning approach basedon LSTM to nowcast rainfall intensity. Codeco et al. [2018]provide a framework to gather epidemiological informationand correct for delays in reporting in Brazilian data. Bas-tos et al. [2019] present a Bayesian hierarchical model fornowcasting applied on data relating to dengue fever andsevere acute respiratory infection cases. In McGough et al.[2020] a Bayesian nowcasting approach is proposed thatproduces accurate estimates that capture the time evolutionof the epidemic curve. Specifically for COVID-19, Bayesiannowcasting approaches have been used to correct for thereporting delays in Bavaria in Günther et al. [2020]. Furtherdiscussion around the challenges in estimating reportingdelays are also addressed in Seaman and De Angelis [2020].Finally the problem and background context in Brazil fordelays in reporting with corrected data are further explainedin Bastos et al. [2020], Villela [2020]. Our methods build upon and generalize the NobBS (Now-casting by Bayesian Smoothing) method originally proposedby McGough et al. [2020]. NobBS is a Bayesian methodthat produces smooth and accurate nowcasted estimates inthe presence of multiple diseases. NobBS allows for bothuncertainty in the delay distribution and the evolution of theepidemic curve. While an effective method, NobBS has sev-eral limitations, such as inability to pick up fast-occurringchanges in the delay distribution, which we overcome inthis paper. The extensions we show result in comparableperformance for COVID-19 mortality surveillance in Brazil,but present a better fit to the dynamic delays distribution.
The problem tackled in this paper is conceptually illustratedin Figure 1. The black points are the data available to usat a given time, and the red the ground truth that is onlyavailable much further in the future. It can be seen that thediscrepancy between the presently available data and the un-derlying ground truth data grows markedly as we approachthe present – a distinguishing characteristic of reportingdelays. Alongside this, in Figure 1 we also show 3 estimatesof the effective reproduction number R t (defined as the aver-age number of infections an infected individual will go on toinfect), obtained using a Bayesian hierarchical renewal-typemodel [Flaxman et al., 2020, Mellan et al., 2020, Mishraet al., 2020]. Understanding this epidemiological quantityis vital - R t > R t < R t derived from the raw data, while Figures 1C and 1D2how estimates of R t derived from the ground truth data andour nowcasting approach respectively. These plots show thatnot correcting for delays can lead to a fundamentally dif-ferent picture of the current epidemic state. Delays in deathreporting lead to an underestimation of the true number ofdeaths in the observed data - the result is a suggestion ofa declining epidemic, despite the fact that the epidemic isactually growing.In this paper we focus on the Brazilian death data from thepublicly available hospitalisation database with deaths fromboth confirmed and suspected COVID-19 diagnostic status[Ministério da Saúde, 2020]. Our central premise is thatusing these daily death data alone results in policy decisionsbeing made based on false statistics and trends [Villela,2020]. To facilitate well informed policy making based onunreliable data streams we propose and implement a now-casting method using latent Gaussian processes. These GPsare capable of capturing the complex correlation structurein delayed data and present an effective means to correctthe reporting delays. We use this corrected death data tocalculate the effective reproduction number R t using rawretrospective observed data, nowcasted data and the groundtruth updated dataset (Figure 1).Our contributions are the following:• We provide a new flexible and accurate way to cor-rect for delays in reporting. Our framework solves thenowcasting problem through using latent GPs, andprovides realistic estimates for the deaths today givenincomplete data. Our approach closely predicts the nonobserved/missing values and simultaneously learns theunderlying (latent) data generating mechanisms of thedelays.• We compare our approach to an established alternativemethod (NobBS), and in a novel contribution, alsoprovide a comparison to a small human expert panel ofinfectious disease epidemiologists. Domain knowledgeis of primary importance for such applications, andis frequently the primary approach taken to interpretdata. In generating estimates that are improved overboth existing computational methods as well as humanexperts, we demonstrate the utility of our approach.• An important contribution of this work are the resultsand estimates provided. Implementing our approachenables generating of more accurate estimates of the re-production number; and in turn, a better understandingof the evolution of the COVID-19 epidemic in Brazil.Our framework is implemented in the easy to use prob-abilistic program PyStan, and therefore facilitates usein low and middle income settings with limited tech-nical expertise.The structure of the paper is as follows: In section 3 webriefly introduce Gaussian processes and describe the latentGP nowcasting models with several variants. In section 4 we describe the data and perform retrospective tests to evaluatethe accuracy of the new models and compare them with asample of human experts predictions. Finally, we discuss theadvantages and limitations of the GP nowcasting frameworkin section 5. Let n t denote the response variable of interest that needsto be nowcasted at time t . In this paper n t represents thereported COVID-19 mortality. The mortality observations,in general, consist of measurements from an online datasource, subject to distributed observation delays. The cent-ral task of nowcasting approaches is to identify a regulartime-delay structure, and use this to estimate n t , at a timewhen it has only been partially observed. The bias that now-casting identifies and corrects in this scenario is the additivedecomposition of the observable over the reporting delay d .That is, the true signal at a given time t is the sum over allthe delayed partial observations for that time: n t = ∑ d n t , d . (1)The intuition behind this formulation is that the ’true’ deathsthat occurred at time t are distributed over various delays d due to the delays in reporting them.A visual example of partial observation at recent times isthe right-censored epidemiological data shown in Figure 2A.For all data releases, we observe precipitous declines in con-temporary data, which is then subsequently revised upwardsas the data becomes more complete. In the COVID-19 con-text this occurs due to time lags in registering and reportingdeath certificates [Villela, 2020]. Figure 2B shows that mostdeaths are reported to near completeness after around 5weeks, and 90% are reported within 10 weeks. The splittingof the data by delay index, to form the 2D array in time anddelay n t , d , called a reporting triangle , is shown in Figure 2C.The figure is also highlighting that both the temporal and thedelay dimensions have structured correlations. The lowertriangle part of this 2D array is missing, since at any time T only the number of deaths reported with delay d ≤ T − t are known for each epidemiological week t .The representation of the data by time and delay, ratherthan time and reporting date, induces a regular structure —one that is auto-correlated and approximately monotonicallydecreasing in delay. This relatively simple structure makesthis problem amenable to statistical modelling. The lowertriangle of the n t , d matrix can be predicted with the model,and therefore an estimate of the true signal is available forany time up to the current time by Eqn. 1, by summing overthe delays. This is a common theme from which variationsof nowcasting branch out.3 ( S L G H P L R O R J L F D O ' D \ ' D L O \ Q X P E H U R I G H D W K V $ 5 H O H D V H G D W H 6 H S 2 F W ) H E 5 H S R U W L Q J G H O D \ Z H H N V 7 R W D O Q X P E H U R I G H D W K V % 5 H S R U W L Q J G H O D \ Z H H N V ( S L G H P L R O R J L F D O Z H H N & Figure 2: A) Daily COVID-19 deaths in Brazil as reported in releases of data between July-2020 and Feb-2021. Each linerepresents a single release. B) Total number of deaths reported per reporting delay in weeks. Most deaths are reported withdelay ≤ n t , d , we can use aPoisson or a negative-binomial likelihood for overdisperseddata: n t , d ∼ NB ( λ t , d , r ) . (2)In the negative-binomial case, the dispersion parameter r isa hyperparameter that can be learnt or given an informativeprior based on the problem. The latter approach is commonamong the established Bayesian nowcasting methods [Bas-tos et al., 2019, Günther et al., 2020, McGough et al., 2020].The mean of the negative binomial, λ t d , is often modelled asa random walk [Bastos et al., 2019] or as an auto-regressiveprocess [McGough et al., 2020] along the time dimension,that is joint independent with a learnt vector of delays.This approach has been successful for dengue and influenzasurveillance [Codeco et al., 2018, Bastos et al., 2019], buthas limitations in terms of the generality of the time-delaycovariance structure that can become apparent in more dy-namic nowcasting scenarios, such as an evolving epidemicwith changing delay distributions. Such issues can be min-imised by tuning the window over which the static delayvector is estimated, or by manually adding cross-term co-variates. Here we employ Gaussian processes as a genericflexible alternative to model arbitrarily structured λ t d . Thedetails of this are set out in the following section. The introductory model we consider consists of a latentGP with a 1D kernel. In general terms, GPs are a class ofBayesian non-parametric models that define a prior overfunctions. They are a powerful tool in machine learning, forlearning complex functions with applications in both regres-sion and classification problems [Rasmussen and Williams,2006, Wilson and Adams, 2013]. In recent years GPs havegained popularity in statistics and in machine learning, dueto their flexibility and excellent performance for many spa-tial and spatiotemporal problems [Wilson and Adams, 2013, Flaxman et al., 2015]. The covariance function or kerneltogether with the mean function completely define a GP.The mean function is the base function around which all ofthe realizations of the GP are be distributed. The covariancekernel is a crucial component of the Gaussian process, asit describes the covariance of the Gaussian process randomvariables i.e. how similar two points are. Therefore, the ker-nel defines the shape of the distribution and which type offunctions are more probable.One of the most popular choices of covariance kernel, andthe one we chose to introduce the model with, is the squaredexponential kernel, k SE , with entries defined by a covariancefunction k SE ( · , · ) such that k SE ( t i , t j ) = α exp (cid:18) − || t i − t j || ρ (cid:19) . (3)The parameter α defines the kernel’s variance scale, and ρ is a lengthscale parameter that specifies how nearsighted thecorrelation between pairs of time points ( t i ) is. The kernelresults in a prior over a set of functions to describe, λ t , d , themean of the statistical model in Eqn. 2. This is modelled asa zero mean log-space latent Gaussian processlog ( λ t , d ) ∼ GP ( , k SE ) . (4)Due to weak identifiability [Rasmussen and Williams, 2006],a strategy to identify the hyperparameters ρ and α is to fixthe lengthscale ρ to the maximum delay time consideredin the nowcasting problem, and learn the scale parameter α . Markov Chain Monte Carlo (MCMC) is used in orderto generate posterior summaries for arbitrary (non-normal)latent Gaussian processes.4 .3 GENERALISED MODEL3.3.1 Additive Kernel Model The basic model introduced above can be extended toprovide a more expressive description of the data. The pur-pose of this is to be able to describe the complex structurein n t , d . Using the compositional kernel approach [Duven-aud et al., 2013, Wilson and Adams, 2013, Wilson et al.,2016], we can create a new additive kernel over multiplelengthscales, indexed s , as k add = ∑ s k s , log (cid:0) λ t , d (cid:1) ∼ GP ( , k add ) . (5)The lengthscale hyperparameters are fixed or given stronglyinformative priors, ρ s , while each α s is learnt. In the simplestcase we consider a kernel with two lengthscale contributions,short- and long-range correlation structure: k add ( t i , t j ) = k long ( t i , t j ) + k short ( t i , t j ) + σ δ i j , (6)plus a regularising term with a Kronecker delta functionensuring σ Gaussian noise is only added when i = j . Thechoice of kernel confers bias that can result in a better gen-eralisation. The logic of this kernel is to split the covarianceinto two components: (a) a smooth long-range component,used to extrapolate the trend into the unknown part of thereporting triangle where large distances from the observedpoints exist and (b) a part for describing variation in n t , d overshorter lengthscales. Additionally, the separation of kernelsprovides a generic method to describe more complex datagenerating processes – for example, the long-range kernelcan be squared-exponential, while the short-range can bea less smooth type with a different power spectrum suchMatérn (1/2). This can be used to create a general statisticalmodel for all of n t , d . Furthermore, in this regard the δ con-tribution provides a source of regularisation which may beuseful if there is reason to believe n t , d values are subject tovariation beyond the scope of the basic nowcasting frame-work. For example, if a death can switch category from aCOVID-19 suspected death to a cause other than COVID-19 in later data releases, this could result in a negative n t , d count, which can be modelled as an error to be regularised.A further modification that can be applied if the time-delaysurface n t , d has a complex structure, is to split the datainto two components and model each with separate kernels.For example, if delays of 0 or 1 weeks account for a largefraction of total counts, they can be considered separately todelays >
1. This approach is considered later in section 4.2.But a more generic formulation is to consider a 2D kernel tofully account for the time-delay correlation structure, whichis introduced below.
As a further expansion of the approach described before,we introduce a separable two dimensional kernel over timeand delay, k (( t i , t j ) , ( d i , d j )) = k t ( t i , t j ) k d ( d i , d j ) . Separablekernels can be efficiently implemented using Kroneckerproduct algebra as described in Flaxman et al. [2015]. Spe-cifically, individual Gram matrices for time and delay arecombined using the Kronecker product such that K t , d = K t ⊗ K d . (7)As before, the kernel can be given an additive structure overmultiple lengthscales. For example, k long ( t , d ) = k t long k d long , k short ( t , d ) = k t short k d short , log (cid:0) λ t , d (cid:1) ∼ GP ( , k long ( t , d ) + k short ( t , d )) . (8)This approach captures the relationship between t and d . Inboth 1D and 2D kernel approaches it is possible to performpartial pooling of the models parameters by combining twoor more spatial locations with similar features, for exampleneighbouring states, if limited data is available. In practicehowever we found that there was a limited gain in doing soas our approach works well with relatively few observations. The number of deaths per date has been extracted from theBrazilian Ministry of Health’s Sistema de Informação de Vi-gilância Epidemiológica da Gripe (SIVEP-Gripe) database[Ministério da Saúde, 2020]. SIVEP-Gripe is a large pub-licly available database providing anonymised patient-levelrecords of all individuals who died or were hospitalisedwith suspected or confirmed COVID-19 [Bastos et al., 2020,de Souza et al., 2020, Niquini et al., 2020]. New data havebeen released regularly online, on a weekly basis, in thesecond half of 2020 considered here. In this study, we ex-tracted all SIVEP-Gripe data releases from 7-July-2020 to8-Feb-2021. We consider all cases of suspected or confirmedCOVID-19 (class 4 and 5).There are a number of potential sources of error in thereported SIVEP data. One is underascertainment — sys-tematic biases which are beyond the scope of correctionby this nowcasting methodology. Another source of erroris delayed classification. After the initial input of patient’sdata into the database (usually at the time of hospitalisa-tion), the entry might be later updated with clinical andlaboratory data, including confirmatory COVID-19 testing.5urther updates will include the outcome and its date (i.e.date of death or date of hospital discharge) and cases receivea final classification. Cases can be classified as COVID-19(classi_fin=5), other causes (classi_fin=1-3) or unknown(classi_fin=4). Despite being described as a "final classifica-tion", reclassification does occur, and is especially commonfor unknown cases to be reclassified as COVID-19 onceresults from confirmatory tests are informed to the healthauthorities. On the other hand, some deaths attributed tosuspected SARS-CoV-2 infection are later ’removed’ fromthe SIVEP database, due to duplicate filtering or becausethey are eventually attributed to other diseases. That cancause the number of deaths on certain days to decreasein consecutive data releases, as shown in Figure S1 in theSupplement.The number of deaths per day as reported by each releaseis presented in Figure 2, together with a reporting triangle,showing the distribution of the reporting delays across time.According to the SIVEP-Gripe dataset, over 90% of alldeaths have been reported with delay less than 10 weeks(Figure 2B). We therefore choose the maximum reportingdelay D for our data to be D =
10, and sum up all deathswhich were reported with the delay longer than 10 weeks.Finally, to create the reporting triangle appropriate for ourmodel, we aggregate the data into weeks.
We fit and present 9 models. For 1D kernel GPs, we considera single SE kernel (1D SE), and additive long- and short-range component kernels (1D SE+SE and 1D SE+Mat).The additive long- and short-range component kernels arealso considering splitting the data into across delay greaterand less than one (1D SE+SE data-split and 1D SE+Matdata-split). Finally we consider a 2D kernel GP model withadditive long and short range components (2D). The NobBSmodel of McGough et al. [2020] is fitted and presented forreference of the current state-of-the-art. All models are fittedto the SIVEP-Gripe weekly COVID-19 deaths reported inBrazil, currently available until 8-Feb-2021.Posterior samples of the parameters in the models were gen-erated using Hamiltonian Monte Carlo with Stan [Hoffmanand Gelman, 2014, Carpenter et al., 2017], using the PyStaninterface (version 2.19.0.0). For each fit we used 4 chainsand 1000 iterations, with 400 iterations dedicated to warm-up. The convergence of each model fit was evaluated byensuring that ˆ R < .
01 for each parameter. Traceplots andother MCMC diagnostic measures were also investigated(Tables S3 - S4 and Figures S13 - S14).Each of the models, characterised by the likelihood givenin Eq (2), and a latent GP part for modelling the λ t , d (sec-tion 3.2) is trained by supplying the reporting triangle n t , d filled with data available up to the point of the nowcast. Each of the parameters governing the model, such as overdisper-sion r or lengthscales and variances of the GPs are learntduring the model fit. The best performing hyperparametersof the prior distribution were selected conditioned on theobserved results. All of those parameters and their priordensities are given in the Supplementary Material, Table S1.The training of the model and nowcasting through samplingeach element of the n t , d matrix is done simultaneously. Spe-cifically, at each iteration parameter values are sampled andimmediately used to sample from the negative binomialdistribution to obtain all elements of the n t , d matrix.Other nowcasting methods, including NobBS, focus primar-ily on estimating only the "missing" part of the n t , d array andcomparing the total numbers n t , that is sums of each row ofthe array. Here, we aim to obtain a statistical model explain-ing all elements of the n t , d matrix. The reason for that istwofold: firstly, having a model that describes the whole n t , d surface well increases the reliability of the model, which isvital in any healthcare setting. Secondly, the SIVEP-Gripedatabase contains hard to identify errors discussed in sec-tion 4.1, therefore it is preferable to treat the reported datawith additional statistical uncertainty. The fit of the 2D GPand the NobBS models to the n t , d matrix is presented inFigure 3 and shows that the GP-based nowcasting methodfits the time-delay structure much closer than NobBS. ' * 3 P R G H O ( S L G H P L R O R J L F D O Z H H N 1 R E % 6 1 X P E H U R I G H D W K V S H U Z H H N 5 H S R U W L Q J G H O D \ Z H H N V Figure 3: Reported and nowcasted numbers of deaths withreporting delay 0 to 5 weeks generated by the 2D GP andNobBS models. These plots show the columns of the re-porting triangle n t , d . The reported data are shown with solidlines, and the 50% CrI for the nowcasts with the ribbons.6 .3 RETROSPECTIVE TESTING 2 F W 2 F W * 3 P H D Q * 3 &