Estimating undocumented Covid-19 infections in Cuba by means of a hybrid mechanistic-statistical approach
EEstimating undocumented Covid-19 infections in Cuba by means of a hybridmechanistic-statistical approach
Gabriel Gil and Alejandro Lage-Castellanos
1, 2 Instituto de Cibern´etica, Matem´atica y F´ısica (ICIMAF), La Habana, Cuba, [email protected] Faculta de F´ısica, Universidad de La Habana, La Habana, Cuba, lage@fisica.uh.cu
We adapt the hybrid mechanistic-statistical approach of Ref. [1] to estimate the total numberof undocumented Covid-19 infections in Cuba. This scheme is based on the maximum likelihoodestimation of a SIR-like model parameters for the infected population, assuming that the detectionprocess matches a Bernoulli trial. Our estimations show that (a) 60% of the infections were undoc-umented, (b) the real epidemics behind the data peaked ten days before the reports suggested, and(c) the reproduction number swiftly vanishes after 80 epidemic days.
INTRODUCTION
Covid-19 crisis has put forward the need for simpleand yet realistic epidemic modeling. The goal is to pro-vide the authorities and the public with accurate predic-tions to devise and schedule containment and organiza-tion policies before an outbreak peaks. To this effect, thedaily number of new infections and a minimal descriptionof the epidemic peak (the so-called acme), in terms of thedate and number of active infections, are among the cru-cial data. Another quantity of concern, especially rele-vant to grasp the full extent of an epidemics –as well as toassess detection strategies– is the total number of activeinfections per day. Due to the fact that there is always afraction of infections that is not detected, a fortiori whenthe pathogen may be carried asymptomatically (e.g., thecase of SARS-CoV-2), the full infected population canonly be inferred.In particular, estimations of total Covid-19 infections(including undocumented trasmission events) are alreadyavailable in the literature for France [1] and China [2], butare still unknown to many other countries. To the bestof our knowledge, there are no current estimates of totalCovid-19 infections in Cuba. Therefore, this paper aimsat contributing such an important piece of informationto the modeling of the pandemics in Cuba.Historically, epidemic modeling is dominated by mech-anistic approaches, like SIS, SIR and SEIR (where lettersin the acronyms stands for susceptible, exposed, infectedand recovered compartments of the population) [3]. Theadvantage of a mechanistic model is that it sets up theepidemic evolution from reasonable time-dependent rulesfor trasmission and recovery (the mechanics behind theepidemics, so to say). Often, such a mechanical descrip-tion applies to the full infected population. A straight-forward (brute-force) fitting of mechanistic models (e.g.,the typical SIR) to just the reported data may not bethe correct strategy to follow, since, on one hand, detec-tion draws from statistical sampling of the full infectedpopulation, and, on the other, there are detected caseswith unknown source of contagion (i.e., unable to stemfrom the interaction between the infected and susceptiblecompartments, as modeled).There have been some efforts directed at reconciling a mechanistic approach with the intrinsic statistical na-ture of the reported data [4, 5]. In particular, we areinspired by the hybrid mechanistic-statistical (HMS) ap-proach laid down in Ref. [1], which was succesful at mod-eling Covid-19 in France.The HMS scheme by Roques et al. applies Bayesian in-ference to estimate SIR parameters (and, hence, the totalinfected population), assuming that the detection processaccomodates to a Bernoulli trial [1]. Since we are inter-ested in limited outbreaks, where government measuresare effective at containing the disease, we can choose asimplification of the general SIR model that considers theinfected cases negligible with repect to the full popula-tion. Moreover, in the same spirit of Cabo-Cabo (fullymechanistic) modeling of Covid-19 in Cuba [6], we sim-ulate the effect of the state interventions by means of aheuristic time-dependence of the infection rate, droppingdown the day the most strigent measures against spread-ing were implemented. We also correct the statisticalpart of the HMS as formulated in Ref. [1], by consideringthat only the still undocumented portion of the infectedcases are sampled for test.The outline of the paper is the following. Sec. I summa-rizes the mechanistic and the statistical side of the hybridapproach, emphasizing our ammendments to the formu-lation in Ref. [1]. Sec. II describe the validation schemeand computations. Sec. III tackle two epidemic scenarios:a synthetic outbreak and the case of Covid-19 in Cuba.In Sec. IV, we provide some concluding remarks. In theAppendix, we comment on how the synthetic epidemics(used for validation purposes) was generated.
I. METHODSMechanics: simplified SIR model with a heuristictime-dependent infection rate
We start from the most customary Susceptible + In-fected + Recovered (SIR) model, introduced by Kermackand McKendrick [7], i.e., dS ( t ) dt = − α I ( t ) S ( t ) N , (1) a r X i v : . [ q - b i o . P E ] A ug dI ( t ) dt = α I ( t ) S ( t ) N − βI ( t ) , (2) dR ( t ) dt = βI ( t ) . (3) S ( t ), I ( t ) and R ( t ) are the susceptible, infected and re-covered time-dependent populations, which sum up tothe size of the full population N (i.e., S ( t )+ I ( t )+ R ( t ) = N ), whereas α and β are the infection and recovery rates.Time t is given in days, hereafter. The cumulative num-ber of infections within a timespan reads T ( t ) = I ( t ) + R ( t ) . (4)At difference with Ref. [1], we take a reasonable sim-plification of the usual SIR model valid for a more or lesscontained outburst or an early stage of the epidemics [6].In such a case, T ( t ) (cid:28) N and S ( t ) ≈ N . Therefore, theinfected population reads I ( t ) = I exp { ( R − β ( t − t ) } , (5)where R = α/β is the basic reproduction number, and I ( t ) = I is the initial number of infections. FromEq. (5), we note that an exponential increase or decreaseof infectious events takes place depending on whether R is greater or lesser than the unity.Now, let the infection rate be time-dependent. In sucha case, we get I ( t ) = I exp (cid:26)(cid:90) tt α ( t (cid:48) ) dt (cid:48) − β ( t − t ) (cid:27) , (6)instead of Eq. (5). We assume a heuristic shape for sucha time-dependence, for example, a step function takinga constant value at the beginning ( α > β , to allow foran outbreak) and dropping down to a lower value ( α ∞ <α ) at some point in time, t . For a country sufferingan epidemics, we set such an inflection point to the dayborders were closed or a stringency index jumps abruptly[8]. In particular, we choose a Fermi-Dirac distributionas a smooth version of the step function, i.e., α ( t ) = ∆ α e ( t − t ) /τ + α ∞ , (7)where α ≡ α ( t ) = ∆ α/ (1 + e ( t − t ) /τ ) + α ∞ , α ≡ α ( t ) = ∆ α + α ∞ , α ∞ ≡ lim t →∞ α ( t ) and τ is a smoothparameter modulating how fast the infection rate dropsdown from α to α ∞ . From Eqs. (6) and (7), we get I ( t ) = I exp { ( α − β )( t − t ) } × (cid:18) (1 + exp { ( t − t ) /τ } )(1 + exp { ( t − t ) /τ } ) (cid:19) ∆ ατ , (8)Models relying on heuristic time-dependent infectionrates that swiftly vanish after lockdown and stringentmeasures from the government have been recently ex-plored in Ref. [6]. Statistics: testing as a binomial process andmaximum likelihood inference of SIR parameters
Let us assume that the daily detection of new infectedcases δ t is a Bernoulli process, as in Ref. [1]. Hence, therandom variable δ t distributes binomially, δ t ∼ Bin( n t , p t ) , (9)where n t is the number of trials (i.e., the daily size of thetest sample) and p t , the probability of success in each in-dependent trial (for us, the probability of finding infectedcases in the population). Subindex t is kept throughoutfor time series data.In such a Bernoulli process, test outcomes are indepen-dent from each other. For that to happen, the sample fortest would have to be selected at random from the fullpopulation. However, more effective and realistic testingstrategies often departs from random sampling by focus-ing on trasmission chains and/or risk groups. To simulatehigher prevalence in test samples as compared to randomones, we assume that susceptible population will alwaysbe under-sampled, i.e., we bias the probability of findingpositive results. In Ref. [1], p t = I ( t ) I ( t ) + κS ( t ) (10)where κ ∈ (0 , δ t . To that aim, p t does not builds up fromthe total but from the currently infected cases that areyet undocumented, i.e., the difference between the fullinfected population and the active (positive to the test)cases just the day before the current tests, A t − . Again,we take S ( t ) ≈ N . Therefore, Eq. (10) is replaced by p t = I ( t ) − A t − I ( t ) − A t − + κN . (11)The likelihood of detecting daily reported cases δ t given an infected population evolving a l`a SIR (seeEq. (8)) and assuming testing outcomes can be accom-modated into a Bernoulli process is [1] L ( I , ∆ α, α ∞ , τ, κ ) = Π t f t = t (cid:18) n t δ t (cid:19) p δ t t (1 − p t ) n t − δ t . (12)Here, all the parameters are encoded into p t , in the caseof SIR parameters ( I , ∆ α , α ∞ and τ ), through the in-fected population I ( t ) (see Eqs. (8) and (11)). Suppos-ing parameters distribute uniformly within the searcheddomain, the maximization of the likelihood L lead us tothe SIR infected evolution that best reproduce the re-ported data on daily new cases (in the sense of havingthe largest probability of being the specific dynamics be-hind the data). The biasing parameter κ is also inferredby maximizing the likelihood. II. VALIDATION AND COMPUTATIONALDETAILS
As anticipated, we want to use our model to estimatethe number of total infected population over time fromthe reported data. Since the motivation for developingsuch a model is precisely the fact that there are undocu-mented infections, we do not have an immediate way ofvalidating our results. Roques et al. [1] carried out anindirect validation, by comparing actual data on infectedcases with the expected number of the corresponding,binomially distributed, random variable (actually, thecumulative over time of such quantities was employed).Here, we propose as well a direct validation scheme bymeans of a network epidemic model (NEM). In a nutshell,such a NEM builds up from an unconstrained spread anda detection/quarantine process based on symptomaticcases and traceable trasmission chains they point at (seethe Appendix for details). Although, the NEM deservesattention in its own right, its full presentation and analy-sis is outside the scope of this article and will be publishedelsewhere. In this paper, it only provides a syntheticepidemic scenario where the total infected population isknown by construction, therefore, setting a benchmarkfor our hybrid model estimation (introduced in earliersections).The HMS model is implemented in a Wolfram Mathe-matica 11.3 notebook (available upon request to the au-thors). The maximum likelihood estimation (MLE) iscarried out through a Nelder-Mead global optimizationmethod with a maximum of 10 iterations. We imposethe natural constraints on the parameters (i.e., ∆ α ≥ α ∞ ≥ τ ≥ I > κ ∈ (0 , α ≡ ∆ α/ (1 + exp ( t − t /τ )) + α ∞ >
1, to allow foran initial outbreak. As a matter of fact, we do not max-imize the likelihood L itself (see Eq. (12)) but ln ( L ),which is a smoother function of the parameters. Weset N ≈ × , approximately the Cuban popula-tion. To model the outbreak of Covid-19 in Cuba, weset 1 /β = 20 days [9] and t to the 13th epidemic day,i.e., the day borders and schools were closed and strin-gency index took the largest leap (25 out of a maximumof 100 units) [10]. Covid-19 data for Cuba is taken from[10, 11]. For the MLE in Cuba, we use only the first 80days, which include an almost complete epidemic peak.Our SIR cannot capture the two-peak profile actuallyseen in the number of active cases reported for Covid-19in Cuba (see Fig. 2), so far as 110 epidemic days. Thenumber of daily PCR tests for Covid-19 as reported bythe Cuban Ministry of Health is a tricky figure in manysenses: for example, it is not clear whether it includes or not (a) retests of already detected cases and (b) testsfor which the result might be pending [12]. Not havinga more intuitive way of estimating the daily number oftests, we choose a constant value for the number of dailytests and set it to the geometric mean of the reporteddaily Covid-19 tests data in Cuba in a period of 110 epi-demic days (i.e., n t ≈ t = 43 days)and choose a recovery time of 1 /β = 7 days. The numberof tests is the same as for the Covid-19-in-Cuba scenario. III. RESULTSA synthetic epidemic scenario
Figure 1 shows our results on HMS estimations of theextent of a simulated epidemic outburst, where the totalnumber of active cases is known by construction. Re-markably, such estimations are in a good agreement withthe actual benchmark data during the early stages of theoutbreak. The quality of our estimation after the epi-demics has peaked is poor, but note that they are atleast able to spot accurately the date of the acme. To-gether with our HMS estimations, we plot a non-linearregression model as applied to the total number of activecases. Both the HMS and the regression models have thesame functional dependence on time, given by Eq. (8),and target the same quantity (i.e., the total number ofactive cases). The practical difference between the twois in the input data: whereas HMS builds up from dataon the active and newly detected cases per day, the otherdirectly fits the usually unknown total number of activecases. We do not expect that HMS approach achievethe same degree of success of a fit, since the underlyingmethod does not try to make I ( t ) conform to the totalnumber of active cases, e.g., by minimizing their relativedifferences. At odds with a regression, the HMS maxi-mizes the chances of obtaining newly detected cases perday out of a statistical sampling of a proxy for yet undoc-umented infections (i.e., I ( t ) − A t − ). The advantage ofHMS is that it is useful in all realistic situations in whichthe total number of active cases is simply not availabledue to undocumented infections. Covid-19 in Cuba
As seen in the last section, HMS works well at theepidemic outburst, providing reasonable estimates of thetotal number of cases and the date of the full epidemiccurve (which need not to coincide with the peak of the re-ported data). Thus validated, we proceed to apply HMSto the recent Covid-19 epidemics in Cuba. In Fig. 2, weshow a comparison between active cases as obtained fromofficial reports and our estimates (which includes undoc-umented infections). The full epidemic curve peaks ten N u m b e r o f a c t i v e c a s e s Time (days)Simulated epidemics reportedtotal fi t to totalestimated total FIG. 1: (color online) Total number of active cases for a sim-ulated epidemics (full circles) and our HMS estimation of thesame quantity (solid red line). We also show a fit of Eq. (8)to the total number of active cases (dashed blue line). As areference, we plot the reported number of active cases (emptycircles), which is the input of our HMS estimation. days before the reported one. Day by day, reported in-fections ranges from 4 to 49% of the total, with a mean of28%, achieved at the full acme. Remarkably, the cumu-lative number of reported infections within the timespanof 80 days was about 40% of the total (see Eq. (4)), and
T /N = 4 . × − . These are fingerprints of good man-agement of Covid-19 medical crisis in Cuba (cf. Ref. [2]’sestimate of 14% of documented infections in China be-fore the travel restrictions on 23 January 2020). More-over, notice that the fraction of documented infectionsincrease over time, from an initial value of 11% to theaforementioned 40%, indicating a refinement in Cubandetection process during the epidemics. N u m b e r o f a c t i v e c a s e s Time (days)Covid-19 in Cuba reportedestimated total
FIG. 2: (color online) Number of active cases of Covid-19in Cuba (black dots), together with our HMS estimation ofthe total number of active cases (solid red line), includingundocumented infections.
In the case of Covid-19 in Cuba, we can only appeal toan indirect validation of our HMS estimates. Fig. 2 showsthe cumulative number of reported infections, (cid:80) tt (cid:48) = t δ t (cid:48) ,along with its expected value within our Bernoulli pro-cess, (cid:80) tt (cid:48) = t n t (cid:48) p t (cid:48) . Good agreement is generally obtained(mean and standard deviation of around 9 and 28 cases,respectively), and nearly zero relative difference at the end of the interval (80 days). C u m u l a t i v e nu m b e r o f r e p o r t e d c a s e s Time (days)Covid-19 in Cubaactualexpected
FIG. 3: (color online) Cumulative number of reported casesof Covid-19 in Cuba (black dots) along with the expectednumber of the same quantity by means of HMS (solid redline).TABLE I: Summary of the parameters obtained for each mod-eling presented in this paper. R (0)0 = α /β is the reproductionnumber the first epidemic day, R ( ∞ )0 = α ∞ /β is its asymptoticvalue and (cid:101) κ = κ × .Epidemics I R (0)0 R ( ∞ )0 τ (days) (cid:101) κ NEM simulated 7.81 1.92 0.69 0.86 3.9Covid-19 in Cuba 32.47 6.39 0.02 11.17 4.0
Table I summarizes the parameters we get for both, thesimulated epidemics and Covid-19 in Cuba. In the lattercase, we emphasize the jump in reproduction number theday borders and schools were closed, from 6.39 at thebeginning to nearly zero the 80th epidemic day (cf. [6]).
IV. CONCLUDING REMARKS
We adapted the hybrid mechanistic-statistical methoddeveloped by Roques et al. [1], already succesful at mod-eling Covid-19 in France, to be able to make reasonableestimations of total Covid-19 infections in Cuba. Ourtheoretical contribution is two-fold. On one hand, wechose a heuristic modification of the classical SIR modelthat assumes limited outbreaks together with an infec-tion rate changing abruptly when stringent measures takeplace (see also Ref. [6]). On the other hand, we cor-rected the probability entering the binomial distributionof newly detected cases, in order to account for the factthat the daily reports do not include retest outcomes ofalready detected cases. Both ammendments turn out tobe essential when modeling Covid-19 in Cuba.Furthermore, we provided a testing ground for the hy-brid mechanistic-statistical estimations: the case of a net-work epidemic simulation where the total number of ac-tive cases is known by construction. The hybrid model isvalidated against such a benchmark, at least in the earlystages of the outburst before the epidemics peaks.Applying the hybrid model to Covid-19 in Cuba allowsus to estimate the total number of active cases, includ-ing undocumented infections. The resulting number ofundocumented Covid-19 infections in Cuba reaches 60%,which is considerably less than the estimate for China(86%) before the travel restrictions were implemented [2],therefore, indicating a good management of the medicalcrisis in Cuba.
ACKNOWLEDGEMENTS
G.G. acknowledges support from the National Pro-gramme for Basic Sciences in Cuba, the Cuban Ministryof Science, Technology and Environment (CITMA) andthe Abdus Salam International Centre for TheoreticalPhysics (ICTP) through the grant NT09-OEA. A.L.C.acknowledges PHC Carlos J. Finlay funds, from the Em-bassy of France in Cuba, for supporting the exchangewith french researchers. We were stimulated by the meet-ings for the mathematical modeling of Covid-19 crisis inCuba, organized by the Faculty of Mathematics from theUniversity of Havana, in close connection with the CubanMinistry of Health (MINSAP). In particular, we thankN. Cabo-Bizet and A. Cabo Montes de Oca for sharingand discussing with us an early preprint with their find-ings, finally included in this RCN special number alongwith the our paper. G.G. is also grateful to M. Mon-tero and D. S. Fontanella (ICIMAF, Havana, Cuba), foruseful discussions and comments.
Appendix: Network epidemic model with quarantine
We consider an stochastic branching process in whichnodes are infected people, and connections represent thetransmission of the disease. As we are interested in thecase of controlled or small size epidemics, we will disre-gard the total size of the population that will be effec-tively considered as infinite.In our simulation, we will sequentially grow an epi-demic tree in which nodes are in any of the followingstates: E : exposed to the virus, meaning the person has thevirus but is not capable of transmitting it, I s : infectious and symptomatic, meaning the person iscapable of transmitting the virus and is also show-ing symptoms of the disease, I a : infectious and asymptomatic, when the person donot show symptoms but still can transmit the virus, R : when the person is no longer transmitting the virus(either because it recovered or died).On top of these states, nodes can either be quarantinedor not. The infectious process is controlled by a set of con-stants: R : is the expected number of new infections causedby a single infected individual. This means that inaverage, every infected person will generate R newnodes in the tree; α : is the fraction of infected people that will developsymptoms; β : is the fraction of contacts that are traceable, mean-ing that if one node is detected to be infected, thenit can point to the neighbors (parent or childrenin the tree) that are connected through traceablecontacts; r E → I : is the rate at which exposed nodes turn into infec-tious; r S → R : is the rate at which infectious and symptomaticnodes recover; r A → R : is the rate at which infectious and asymptomaticnodes recover; c S : is the rate at which a symptomatic infectious nodegenerates new contacts each day. In order to keepthe meaning of R , we shall have c S = R × r S → R ; c A : is the rate at which a symptomatic infectious nodegenerates new contacts each day. For the same pre-vious reason, c A = R × r A → R ; r S → Q : is the rate at which symptomatic people are de-tected by the quarantine process and moved toquarantine; r Q → R : is the rate at which people are released from thequarantine, either because they died or recovered. Algorithm 1
Stochastic daily SEIRQ cascade process. procedure One-day-update ( E, I s , I a , R, Q list of nodesin each state) for n ∈ Q do if n is new in quarantine then Add-contacts-to-Q( n ) (cid:46) Contact tracing Move-Q-to-R-with-prob( r Q → R ,n) for n ∈ E do Move-E-to-I-with-prob( r E → I ,n) for n ∈ I s do Generate-offspring-with-rate( c s ,n) Move-S-to-R-with-prob( r S → R ,n) Move-S-to-Q-with-prob( r S → Q ,n) for n ∈ I a do Generate-offspring-with-rate( c a ,n) Move-S-to-R-with-prob( r A → R ,n) The whole simulation is schematized in algorithm 1.Functions Move-A-to-B-with-prob will remove the givennode from list A and put it on list B, with a given prob-ability. The function Generate-offspring-with-rate willadd new nodes as children of the given node, some ofwhich will be traceable some who won’t, and some ofwhich will be symptomatic and some who won’t. Allthis new nodes are added also to the Exposed list. Thefunction Add-contacts-to-Q will follow all the traceablecontacts of the node n and put them in quarantine (re-moving them from the lists they were).A simulation like this can mimic most of the indicatorsthat are being reported by the Cuban Ministry of Health in its daily briefings. The reports correspond to the char-acteristics of the nodes that are quarantined each day:whether they come from known contacts, whether theyhave symptoms or not. A quantity that is not directlyincluded in this simulation is the amount of declared con-tacts that will be negative to the virus test, since we onlydeal with positive cases. However, this number is not re-ported either by the health authorities, and it is naturalto assume that the number is a Poisson random variablewith not particular other implications in the process. [1] L. Roques, E. K. Klein, J. Papa¨ıx, A. Sar, andS. Soubeyrand. Using early data to estimate the actualinfection fatality ratio from COVID-19 in France. Biol-ogy , 9:97, 2020.[2] R. Li, S. Pei, B. Chen, Y. Song, T. Zhang, W. Yang,and J. Shaman. Substantial undocumented infectionfacilitates the rapid dissemination of novel coronavirus(SARS-CoV2).
Science , 368:489–493, 2020.[3] S. Mandal, R. R. Sarkar, and S. Sinha. Mathematicalmodels of malaria - a review.
Malaria Journal , 10:202,2011.[4] C. Bret´o, D. He, E. L. Ionides, and A. King. Time seriesanalysis via mechanistic models.
The Annals of AppliedStatistics , 3(1):319–348, 2009.[5] M. Li, J. Dushoff, and B.M. Bolker. Fitting mechanisticepidemic models to data: A comparison of simple Markovchain Monte Carlo approaches.
Statistical Methods in Medical Research , 27(7):1956–1967, 2018.[6] N. Cabo-Bizet and A. Cabo Montes de Oca. Mode-los SIR modificados para la evoluci´on del COVID19. arXiv:2004.11352v1 , 2020.[7] W.O. Kermack and A.G. McKendrick. A Contributionto the Mathematical Theory of Epidemics.
Proceedingsof the Royal Society A , 1927.[8] .[9] F. Zhou, T. Yu, R. Du, G. Fan, Y. Liu, Z. Liu, J. Xiang,Y. Wang, B. Song, and X. Gu et al. Clinical course andrisk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study.
TheLancet , 395:10229, 2020.[10] https://covid19cubadata.github.io .[11] https://github.com/pomber/covid19 .[12] https://ourworldindata.org/coronavirus-testinghttps://ourworldindata.org/coronavirus-testing