[PDF] Monitoring and Forecasting COVID-19: Statistical Heuristic Regression, Susceptible-Infected-Removed model and, Spatial Stochastics

Abstract

The COVID-19 pandemic has had worldwide devastating effects on human lives, highlighting the need for tools to predict its development. Dynamics of such public-health threats can often be efficiently analysed through simple models that help to make quantitative timely policy decisions. We benchmark a minimal version of a Susceptible-Infected-Removed model for infectious diseases (SIR) coupled with a simple least-squares Statistical Heuristic Regression (SHR) based on a lognormal distribution. We derived the three free parameters for both models in several cases and tested them against the amount of data needed to bring accuracy in predictions. The SHR model is approximately +/- 2% accurate about 20 days past the second inflexion point in the daily curve of cases, while the SIR model reaches a similar accuracy a fortnight before. All the analyzed cases assert the utility of SHR and SIR approximants as a useful tool to forecast the evolution of the disease. Finally, we have studied simulated stochastic individual-based SIR dynamics, which yields a detailed spatial and temporal view of the disease that cannot be given by SIR or SHR methods.

Full PDF

AAbstract

The COVID-19 pandemic has had worldwide devastating eﬀects on human lives,highlighting the need for tools to predict its development. Dynamics of such public-health threats can often be eﬃciently analysed through simple models that help tomake quantitative timely policy decisions. We benchmark a minimal version of aSusceptible-Infected-Removed model for infectious diseases (SIR) coupled with a simpleleast-squares Statistical Heuristic Regression (SHR) based on a lognormal distribution.We derived the three free parameters for both models in several cases and tested themagainst the amount of data needed to bring accuracy in predictions. The SHR modelis ≈ ±

2% accurate about 20 days past the second inﬂexion point in the daily curveof cases, while the SIR model reaches a similar accuracy a fortnight before. All theanalyzed cases assert the utility of SHR and SIR approximants as a useful tool toforecast the evolution of the disease. Finally, we have studied simulated stochasticindividual-based SIR dynamics, which yields a detailed spatial and temporal view ofthe disease that cannot be given by SIR or SHR methods. a r X i v : . [ q - b i o . P E ] J a n onitoring and Forecasting COVID-19:Statistical Heuristic Regression,Susceptible-Infected-Removed model and,Spatial Stochastics Pedro L. de Andres , Lucia de Andres-Bragado , and Linard D. Hoessly ICMM, Consejo Superior de Investigaciones Cientiﬁcas, E-28049 Madrid, Spain Department of Biology, University of Fribourg, CH-1700 Fribourg, Switzerland Department of Mathematics, University of Copenhagen, DK-2100 Copenhagen,DenmarkJanuary 28, 2021

The consequences of a pandemic like COVID-19 caused by the virus SARS-CoV-2 cannotbe overstated [Nature, 2020]. Accurate mathematical tools allowing to monitor and fore-cast the evolution of the contagious disease are useful to guide social, economic and publichealth decisions made by governments. Nevertheless, despite the availability of powerfulmathematical models [Anderson et al., 2020], initial forecasting by some organizations un-derestimated the evolution of the epidemics, hampering the immediate taking of necessaryactions [Economist, 2020, Herzberge and Hecketsweller, 2020].This study aims to take advantage of available worldwide data on COVID-19 [Roser et al., 2020,Dong et al., 2020] to benchmark and assign error bars to minimal models, like the susceptible-infected-recovered (SIR) [Kermack and McKendrick, 1927, Weiss, 2013], a straightforwardleast-squares best-ﬁt (LS) Statistical Heuristic Regression (SHR) based on a lognormal dis-tribution [Lam, 1988], or basic Monte-Carlo simulation [Girona, 2020, Gang, 2020]. Thesemodels are gauged against two variables measured daily: (i) the number of deaths, and (ii)the number of new infections. Such indicators both possess advantages and disadvantages.Deaths are usually counted using a consistent methodology and, undeniably, it is an ob-servable proportional to the spread of the disease, but the tally of deaths carry a delay However, some countries have varied these criteria during the course of the epidemic. Moreover, diﬀerentcountries can use diﬀerent rules to deﬁne these variables. Those inconsistencies can be compensated and

2f about one month on the actual dynamics of the disease. On the other hand, the num-ber of infections is timely, but incorporates more uncertainties since it depends on detailsnot related to the disease, e.g. on the number of tests performed. We show the simulta-neous monitoring of both observables supplemented with relatively simple mathematicalapproaches can be used to follow and forecast the evolution of the disease with enoughaccuracy to help decision-making processes and we discuss the associated error bars.

Epidemics can eﬃciently be modelled as a geometric process related to independent randomevents [Lam, 1988]. This method yields a regression curve that describes the temporalvariation of a contagious disease for the number of deaths, infections or some other relevantobservable variable. Such a statistical heuristic approach results in a lognormal function, c c M ,µ,σ ( t ) = c M e − (ln ( t − t − µ )22 σ √ πσ ( t − t ) t − t > u =ln ( t ), is normally distributed around its mean value µ with a dispersion σ [Johnson et al., 1994].The beginning of the propagation is determined by t and the value of the single maximum c M = c ( t M ) happens at t M = t + µ − σ .Starting from the model d c ( t )d t = α ( t ) c ( t ) and imposing general requirements on α ( t )(which follow from the observed behaviour of the number of daily cases) also leads tothe same expression [Wenbin et al., 2013]. Using entropy-related arguments, these authorshave estimated that σ ≈ .

4, which compares well with the averaged values for ten diﬀerentwestern countries for deaths and infected, 0 . ± . . ± . C c M ,µ,σ ( t ) = (cid:90) t c ( u ) d u = c M (cid:18) − ln ( t − t ) − µ √ σ (cid:19) t − t > C ( t ) and c ( t ) carry the same information about the set of threeparameters, F = { c M , µ, σ } , since c ( t ) is simply the temporal derivative of C ( t ). However,in practical terms, c ( t ) is heavily aﬀected by noise in the collection of the data series { c i } ,and least-squares ﬁts to the functions C ( t ) and c ( t ) are expected to determine slightly do not signiﬁcantly aﬀect the conclusions of our analysis. However, a comparison between countries is onlyan approximated one. F . Therefore, we chose to report values related to C ( t ), which are lessaﬀected by noise. Still, we notice that the information contained in c ( t ) is equally valuableand sometimes simpler to obtain, in particular, the position and value of its inﬂexion pointsand single maximum.Next, we aim to prove that the ansatz in Equations 1 and 2 reproduces the behaviour ofCOVID-19 in a few diﬀerent western countries. Even if SHR merely amounts to a preciseﬁt of the data, we observe that it carries signiﬁcant advantages over the mere manipulationof the data series, { c i } , as: (i) it can be extrapolated to the near future (extrapolationsshould be treated with great care, but an informed extrapolation about the behavior in thefuture is always better than a wild guess) and, (ii) it reduces long lists of numbers to ananalytical expression which only depends on three parameters. Such an analytical functioncan be then easily manipulated to get integrals, derivatives of any order, or to search forextrema/inﬂexion points, etc. Spain is a country where the disease was particularly virulent in its ﬁrst wave, spreadingwith remarkable strength. The SHR model agrees well with the data for both deaths andinfections (Fig. 1 and Tables 1 and 2). Together, these two variables provide a better ideaof the epidemic’s course by identifying two critical items: the impact on the population viainfections and, the impact on the health system via deaths. Three simple features deﬁningthe epidemics that will be rationalized later in the context of the SIR model are: (i) theexponential behavior near the origin, (ii) the position and value of the single maximum inthe daily number of cases and, (iii) an asymmetric decay towards the future w.r.t the past.The ratio between total infections and deaths has evolved from about 1% in March to amaximum of 12% in August, but it has signiﬁcantly decayed for the second wave to about4% at the end of September (inset in left-hand side in Fig. 1).Three regions are identiﬁed in the plots, both for deaths and infections. The ﬁrst region(I, wave 1) ﬁnishes approximately on the 1st of May, 2020 ( d = 152) and it clearly showsthe three aforementioned features marking its association with an infectious disease. Thesecond region (II, inset in right-hand side in Fig. 1), goes approximately between the 150thand 200th day, and its hallmark is to sustain a fairly constant level of daily infections, c ( t ) = < c i > , which reﬂects in a linear increase of the accumulated number of cases, C ( t ).Region II approximately terminates near the end of the general lockdown in Spain, on the21st of June ( d = 173).Neither the SHR nor the SIR models can account for a sustained period of constantinfections, although they can accommodate this regime via the slowly decaying queue ofthe distribution where derivative of the function is very low. In contrast, such behaviorcan be naturally described via Monte-Carlo simulation. Likewise, while MC can describeseveral waves by producing more than one local maxima due to spatial inhomogeneousdynamics, SHR and SIR can only describe such a scenario via a linear combination of4ndividual waves, each one governed with its own parameters.Finally, in a third region (III, wave 2) the collective transmission displays again a similarbehavior to the region I, marking the evolution of an out-of-control disease. The superpo-sition of these multiple regimes, plus other waves if needed, describes the overall functionwell. We notice that the accuracy in the ﬁt for any wave is not expected to be reasonablystable until at least the corresponding maximum is well developed (Section 2.1.6). Howeversuch incertitude, the model predicts that in Spain the number of infections due to the sec-ond wave should be reaching its maximum in December 2020, at most in January 2021. Inaddition, the model predicts that the strength of the second wave is approximately weakerthan the ﬁrst one by a factor two, as measured by the number of accumulated certiﬁeddeaths from SARS-Cov-2. Although these predictions may be aﬀected by large error barssince the maximum in the second wave is not yet well developped, those values oﬀer soundguidance about the course of the disease. We have used this model to extrapolate theshape of the curve by a fortnight after the last day of the corresponding available data; the resemblance to the ulterior course of the disease will be seen in the next weeks.The accompanying number of registered infections yield a picture of the likely evolutionof deaths in the following days, even if the variation in the absolute numbers from the ﬁrstto the second wave is dominated by the change in the number of tests performed. Giventhe large dispersion of raw data due to diﬃculties to collect them it is clear the necessityto perform moving averages and the advantages of working with least-square approximantsthat can be extrapolated a few days ahead, a statement that is true for the behaviourof other countries. While deaths only show two waves so far, infections identify at leastfour local maxima that can be correlated with diﬀerent events, like the end of the summervacations or the occurrence of several bank holidays in Spain where the population hasbeen moving and mixing in great numbers. Compared with other countries with large populations, Germany has managed the pan-demics quite well, as it is observed by comparing the number of cases per inhabitant.Moreover, its evolution has been recorded with consistency both for deaths and infections.Therefore, it is an appropriate benchmark for any model.Similarly to Spain, the SHR model can be used to accurately represent the diseaseevolution using only three parameters per wave (Fig. 2). Curiously enough, best-ﬁt valuesfor µ and σ are quite similar to Spain (Tables 1 and 2), indicating that, independentlyof the absolute strength, there are common underlying features in both cases. Therefore,it is interesting to explore the ability of a single normalized averaged curve to represent Assuming ﬁxed conditions, in particular disregarding the possible eﬀect of Christmas festivities. On the 4th of November, 1385 previously unaccounted deaths were added in Spain, producing a dis-continuity in the curve. To make possible a prediction based on an extrapolation, we have not includedthose extra deaths. C ( ∞ ) ∝ c M as the single only freeparameter. Such a curve is represented in Figs. 1 and 2 by the green dashed line having µ = 3 .

53 and σ = 0 .

56 (Section 2.1.5), and it is clear that despite having such a limitedfreedom for ﬁtting (since it only depends on one parameter), it provides a very reasonableapproximation to the data. In contrast to Spain, the ratio between total infections anddeaths in Germany evolved from about 1% in April to 4% in September, which is aboutthree times lower than for Spain (inset in Fig. 2).

We also prove the capabilities and versatility of the SHR ansatz to reproduce the observeddata by applying the same methodology to a pool of western countries: Great Britain(GBR), Italy (ITA), United States (USA), France (FRA), Switzerland (CHE), Denmark(DNK), Austria (AUT) and Finland (FIN), cf. Figs. 3 and 4. In general, the agreement isquite good, both for deaths and infections. Among other advantages, this procedure allowsa quick and simple monitoring of the evolution of the disease in the diﬀerent countries. Inparticular, it is a useful tool to identify and forecast the appearance of a second wave. Atthe moment of writing, only the USA has fully developed the maximum associated with thesecond wave and, from the combined behaviour of deaths and infections, it could be arguedthat the country is clearly heading towards a third wave. Since this is the only case so far,it is not possible to characterize well such a second wave by a proper average of diﬀerentcountries, although it seems fair to say that it is represented by a wider distribution ofdaily deaths (e.g. the second component represented by the dotted red curve correspondsto having µ = 5 . σ = 1 .

2) and a lower value at the peak by about a factor 2.

Prominent places where the infection spreads quickly are densely populated regions, whichconstitute the core of the propagation of the disease. Therefore, it is interesting to comparethe distribution of cases in those regions. We have juxtaposed the performance of NewYork City (9.1 M-people, NYC) and the Community of Madrid (6.7 M-people, CAM) inthe ﬁrst wave (Fig. 5). To highlight the similarities rather than the diﬀerences they are su-perimposed in such a way that the position (day) of the maximum coincides. Furthermore,CAM has been scaled by the ratio of respective populations, which makes the value at themaximum very similar for both regions. Despite all the diﬀerences between these regions,it is clear that a typical pattern emerges, which leads us to investigate the advantages ofworking with averages.

Normalizing and superimposing the curves for COVID-19 deceases on diﬀerent countriessuch that the maximum in c ( t ) is in a common position, ( t M ) allows us to focus on similar-6ties. Despite slight diﬀerences, nine out of the ten arbitrarily-chosen western countries areall well represented by a normalized average function, < c ( t ) > , (Fig 6). USA shows as anoutlier; a warning about the quite diﬀerent boundary conditions from the other Europeancountries. Since second waves are not fully developed (except in USA) it is not possibleyet to ascertain if such a universal average could represent faithfully second waves, evenif maybe with diﬀerent eﬀective values of µ and σ owing to the diﬀerent boundary con-ditions that may apply. We have not tried the same procedure with the infected becauseof the greater temporal and spatial variability of procedures used to deﬁne that variable.However, results for deaths in the ﬁrst waves make us believe that such a representativeaverage could also be applied to a properly deﬁned observable for infections. ExcludingUSA, the maximum average error made by substituting the actual data by the averagefunction ( (cid:15) = c ( t ) − < c ( t ) > ) is ≈ .

03 in units of c M , which happens near the inﬂec-tions points where the function c ( t ) has decayed to ≈ . ≈ ± c M . Such parameter c M can be easily obtained from a single point: the maximum value inthe daily distribution of cases for each wave, which we derive from a moving average of afew days (seven days makes appropriate averages that account for regular weekly routinesand removes most of the noise for all the cases we have analyzed). To be able to conﬁdently use a least-squares statistical regression to a given data set { C i } ( i = 1 , n ) the main question is how many data points, n , are needed to yield areasonable estimation of the evolution of the epidemics based solely on the extrapolationof the ﬁtted functions. Such question is relevant considering how unreliable extrapolationsusually are [Press et al., 2007]. Indeed, any simple algorithm to forecast the evolution ofan epidemics can only be valuable if reasonable error bars can be assigned to predictions.A simple target to quantify the error is to study the behavior of the expected totalnumber of cases, C ( ∞ ), as a function of n . Fig. 7 shows the variation of the predictedasymptotic value as a function of the available amount of data after the second inﬂexionpoint. In most cases, a fractional accuracy of ±

15% is achieved a fortnight after the secondinﬂexion point, which is further decreased to ±

5% in another fortnight.

So far, we have shown that SHR qualiﬁes as a quick and straightforward way to describe theevolution of an infectious disease. If adequately used, i.e. attached with appropriate errorbars, it can be extrapolated to make predictions in the near future, since the functionalforms associated with Equations 1 and 2 adapt so well to the observed data.However, a better understanding of the dynamics of the epidemics can be obtained7rom a set of diﬀerential equations which describe its time evolution. The simplest modelfor the evolution of a contagious disease is to postulate that the rate of new infectionsis proportional to the number of infected people itself, d I ( t )d t = I ( t ) τ , which results in anunbound exponential growth, I ( t ) ∝ e tτ , and makes a characteristic mark for the onset ofa pandemic.Such a simple model does not take into account how the rate of infections decreases asthe number of infections approaches the total population. Therefore, a reﬁned version is todivide a given population of size N into three classes ( S , I , R ): (i) susceptible entities whocan catch the disease, S ( t ), (ii) infected ones who have the disease and transmit it, I ( t ), and(iii) removed ones who have been isolated, died, or recovered and become immune, and aretherefore not able to propagate the disease, R ( t ). In this model, individuals pass from thesusceptible class S to the infective class I and ﬁnally to the removed class R with rates deter-mined by a set of ordinary diﬀerential equations(ODEs) [Kermack and McKendrick, 1927,Anderson, 1991, Hethcote, 2000, Weiss, 2013]. The ODEs derive from the interactions ofthe entities in the diﬀerent classes, which can be represented as S + I → I , I → R , where we assume generalized mass-action kinetics [M¨uller and Regensburger, 2012] (withslightly diﬀerent scaling with respect to N ). First, it is assumed that the number ofsusceptible individuals decreases at a rate proportional to the density of infected, i ( t ) = I ( t ) N times the number of susceptible individuals, S ( t ),d S ( t )d t = − ( S ( t )) n τ i ( t ) (3)where τ is an adjustable parameter that represents a typical time to transmit the disease,and n is a parameter that inﬂuences the ability of the disease to infect susceptible individ-uals in a non-linear way (e.g. it might represent the eﬀect of the viral load). Its main eﬀectis to alter the temporal scale of the epidemics, which in some circumstances facilitates theﬁtting of the model to real data. The standard SIR model is recovered with n = 1.Removed entities originate from infected; therefore, its variation is assumed to be pro-portional to the number of infected, d R d t = ( I ( t )) τ (4)where τ is an adjustable parameter that represents a typical time to recover from the dis-ease. This equation merely helps to count the total number of removed from the beginningof the infection up to a given day t , R ( t ) = (cid:90) t ( I ( u )) τ d u (5)8astly, the infected vary according to the inﬂow of susceptible individuals who becomeinfected minus the outﬂow of infected that have been removed,d I d t = ( S ( t )) n τ i ( t ) − I ( t ) τ (6)The derivative d I d t moves from positive to negative depending on the balance between bothterms in the equation and it determines a single peak in I ( t ) (for n = 1, I (cid:48) ( t ) = 0 for S ( t ) N = τ τ ).The task at hand for a given population of N elements is to determine the parameters, τ , τ and n , that best reproduce the behavior of the epidemics by solving the coupledsystem of diﬀerential equations 3 and 6, subject to some initial conditions, e.g. S (0) = N − I (0) = 1. Good agreement with data can be used to lend an interpretative valueto τ and τ (unlike parameters µ and σ which only have a statistical meaning). The ratio (cid:60) = τ τ is called the eﬀective reproductive number; values (cid:60) >> R ( ∞ ) = S (0).First, we focus on the task of simulating a population where s (0) = S (0) N = r ( ∞ ) = R ( ∞ ) N = 1. For this particular case, (cid:60) >> t ∗ M and i ∗ M .2. τ is the main parameter that determines the position of the peak in i ( t ). We estimatea value for τ that brings the maximum in i ( t ) near t ∗ M .3. We get an approximate value for τ from the expression i ∗ M = 1 − (cid:60) (1 + ln (cid:60) ) [Weiss, 2013].4. The value i ∗ M yields N in the particular case of R ( ∞ ) = N = c M . We adjust thevalue of N to agree with i ∗ M .5. We minimize the root-mean square deviation, χN = (cid:113) n (cid:80) ni =1 ( C i − R ( t i )) , betweenthe number of accumulated cases predicted by the model, R ( t ), and the recordeddata, C i , to ﬁnd optimal values for τ , and τ . Similarly to the SHR analysis we have presented above, we illustrate the performance ofthe SIR model by ﬁrst looking at the distribution of deaths and infections in Spain (Fig. 8).The lower left panel shows how numerical solutions to SIR equations match very well thetemporal behavior of the epidemics under the condition s (0) = r ( ∞ ) = 1 for optimizedvalues of τ and τ (Table 3). Dispersion of data in the daily reported cases is usuallysmaller before the peak is reached (the quasi-exponential region) and ﬂuctuations grow in9mportance after the maximum is reached– which is a general observation holding for mostof the countries we have studied. We assign it to the balance between diﬀerent currents transferring individuals between the three classes, the phenomenon responsible for theappearance of a single maximum in daily cases for a given wave in the pandemics.The proposed procedure works for the deaths subset as follows. First, the curve of dailycases is followed up to the appearance of its maximum, which to circumvent the noise isidentiﬁed from a smoothed curve obtained by a ﬁve-day moving average, I ∗ M ( t ∗ M = 96) =17 . I M ( t M = 94) = 20 . N = 383, which is oﬀ the ﬁnal mark by about 40%.Once the maximum is identiﬁed, the quasi-exponential behavior near the origin is usedto estimate an initial value for τ (supplementary material, Eq. S6). For Spain, the ﬁrstcase happens at t = 65, and the ﬁrst inﬂexion point is at t = 82. Therefore, the ﬁrst10 points (about halfway to t ) are used to get an exponential ﬁt to the accumulatednumber of cases that yields τ ≈ . ± .

3. Such a value, combined with an initial guess (cid:60) = 10 produces a maximum in the curve of daily deaths at t M = 103. Accordingly, τ is decreased until we locate the maximum closer to the right position. For τ = 2 . t M = 95 and I M = 11 . . . and start an eﬃcient local Levenberg-Marquardt minimization ofthe root-mean-squared deviation between the actual data and the computed values. Thisis done to simultaneously optimize N , τ and τ (Fig. 8, left-lower panel). Taking intoaccount that only data up to six days past the maximum have been used, it is remarkablethat this self-consistent procedure reduces the fractional error between the prediction ofthe SIR model and the data from 40% to ± χN = 0 . t = 32 vs t = 65), but need more time to attain its maximum value ( t = 54 vs t = 25 after the ﬁrstcase).A prominent feature is the existence of the second wave of infections separated from theﬁrst one by a region of sustained constant number of cases, as we have discussed in Fig. 1. Toﬁt the data, we superpose the two waves, each with its own deﬁning parameters. However,the constant region between waves cannot be easily accommodated in these models and it isa clear indication of a diﬀerent stage in the epidemics with low but sustained transmissionof the disease at a pace similar to the one at which individuals are removed (while in theSIR model usually it is assumed that τ > τ ). We shall come back to this point in the10ontext of Monte-Carlo simulation. Finally, we notice that this second wave of infectionshas ﬁnally overlapped with a third one, as it is noticeable in Fig. 1. We have applied the same procedure to Germany, a country which had in the ﬁrst waveabout four times less casualties per million inhabitant than Spain. The left panel of Fig. 9shows the ﬁnal iteration for the daily and accumulated number of deaths, which againpredicts the total number due to the ﬁrst wave with accuracy ≈ ±

3% of the ﬁnal truevalue, even if we have only used data up to six days past the maximum. The RMSDbetween the accumulated data and the predicted function is χN = 0 . t = 28) w.r.t deaths ( t = 70). Furthermore, maximumvalues are attained after a longer amount of time ( t = 63 for infections ( t = 63) than fordeaths ( t = 30), counted after the ﬁrst case, following a similar procedure to the one forSpain. Finally, similar results have been observed in four more countries: France, Italy, GreatBritain and Switzerland (Fig. 10 and Table 3).

To gain further insight into the spatio-temporal evolution of COVID-19, we consider nexta stochastic discrete-time individual-based model in which the spread propagates on atwo-dimensional N × N lattice, where each node represents an individual. The dynamicsare Markovian, and we will use Monte Carlo (MC) to sample from its distributions intime, which is a technique known to handle well diﬃcult collective eﬀects in many-bodysystems, like e.g. the magnetic phase transition in the 2D Ising model[Peliti, 2011]. The N individuals can be in any of the three states of the SIR model, making transitionsbetween them with two probabilities: (i) for someone susceptible to be infected S → I , p i and, (ii) for someone infected to recover and be removed I → R , p r . At each time-step,individuals make transitions between classes according to the corresponding probabilities.We consider various scenarios of uniform and spatially dependent Markov dynamics.First, we start with a single isolated case of infection per 10 individuals, and we use p i ( t ) = i ( t ) τ for S → I in close analogy to the SIR model, while we assign a constantvalue p r = τ to the second transition probability, I → R . Comparing MC simulationsfor N = 100 with p i ( t ) = i ( t ) and p r = to the deterministic SIR with τ = 2 . τ = 9 yields an excellent agreement between both approaches for same initial values,which conﬁrms the adequacy of Monte-Carlo techniques (left panel in Fig. 11, where bothresults cannot be distinguished). By way of example, we modify the model to increasethe probability of infection of individuals in next-neighbors contact with members alreadyinfected to p i = i ( t ). As expected, infections grow faster near the onset, the dailymaximum happens earlier and results in a larger and narrower peak (while keeping theﬁnal total number the same, Fig. 11 blue dotted line compared to thick dotted line).On the other hand, a scenario where the infection probability is kept constant ( p i = 0 . p r = 0 .

05) results in a wider and smaller maximum (the infection and recover constantprobabilities have been adjusted to yield the peak near the same MC steps on the previouscases, Fig. 11 red dotted line compared to thick dotted line). For these conditions, atypical temporal evolution of individuals (pixels) is shown in Fig. 12. A weak tendencyto clustering is observed, although the system is seen to reach a quasi-homogeneous statefast.Unlike SIR, this model can sustain in a natural way a constant background of infectionsif at some point in the epidemics p i becomes very similar to p r , establishing a transientregime which we categorize as qualitatively diﬀerent from the region where the daily distri-bution derived from SHR or SIR is simply too low. This is a feature that can be observedin real data (Fig. 1 inset in the right-hand side).Finally, we checked how statistical properties of the model perform and scale underdiﬀerent lattice sizes and parameters via simulation. The distributions over time for N =100 and N = 1000 are virtually indistinguishable as long as the initial infectious individualsare kept in the same ratio. In order to further visualize the stochasticity under the chosenscale, we show in the right panel of Fig. 11 ten randomly chosen realizations out of onehundred runs with random initial positions of infectious in the lattice. As the starting daywhere the infection expands is random, we have rigidly displaced the time of the samplessuch that they peak on the same day. Then, the ten diﬀerent realizations and their averagedvalue lie nicely on the same curve and the standard deviation displayed in the inset is seento be acceptably small. Conclusions

We have analysed and compared three mathematical approaches of increasing complexity toinvestigate the dynamics of COVID-19. We have proved that a least-squares SHR-modelbased on the lognormal distribution is well suited to describe the epidemic’s evolutionusing only two free parameters per infection wave. Confronted against real data up to thesecond inﬂexion point, the values determined for these parameters are well converged andstable, guaranteeing fractional error bars of ± ables Table 1:

Parameters for SHR model (conﬁrmed deaths, ﬁrst wave).

P: country’spopulation (millions). µ and σ : parameters in the lognormal distribution. C ( ∞ ): asymp-totic value for accumulated cases (per million person). R and r : R-squared correlationfactors for C ( t ) and c ( t ), respectively. t M and t : maximum and second inﬂection point(origin is the 1st of January, 2020).COUNTRY P µ σ C ( ∞ ) R d t M t G. BRITAIN 66.65 3 . ± .

01 0 . ± .

02 688 ± . ± .

06 0 . ± .

03 606 ± . ± .

02 0 . ± .

01 581 ± . ± .

01 1 . ± .

04 456 ± . ± .

03 0 . ± .

02 451 ± . ± .

03 0 . ± .

02 199 ± . ± .

02 0 . ± .

01 110 ± . . ± .

02 0 . ± .

01 106 ± . . ± .

04 0 . ± .

03 80 . ± . . ± .

14 0 . ± .

03 59 . ± . Parameters for SHR model (conﬁrmed infections, ﬁrst wave).

P: coun-try’s population (millions). µ and σ : parameters in the lognormal distribution. C ( ∞ ):asymptotic value for accumulated cases (per million person). R and r : R-squared cor-relation factors for C ( t ) and c ( t ), respectively. t M and t : maximum and second inﬂectionpoint (origin is the 1st of January, 2020).COUNTRY P µ σ C ( ∞ ) R d t M t G. BRITAIN 66.65 4 . ± .

01 0 . ± .

02 4511 ±

11 0.9999 31/01 78 99SPAIN 46.94 3 . ± .

06 0 . ± .

03 5082 ±

14 0.9996 1/02 56 67ITALY 60.36 3 . ± .

02 0 . ± .

01 4083 ± . ± .

03 0 . ± .

02 2743 ±

12 0.9998 25/01 68 80USA 327.20 4 . ± .

04 0 . ± .

03 13300 ±

400 0.9391 21/01 49 107SWITZERLAND 10.23 3 . ± .

03 0 . ± .

02 4028 ± . ± .

02 0 . ± .

01 2038 ±

10 0.9999 27/01 63 23DENMARK 5.81 4 . ± .

12 0 . ± .

04 2056 ±

30 0.9994 27/02 42 61AUSTRIA 8.86 2 . ± .

05 0 . ± .

03 2334 ± . ± .

06 0 . ± .

02 1341 ±

12 0.9998 30/01 74 9614able 3:

Parameters for SIR model (ﬁrst wave). N (number of individuals), τ and τ (given in days). Upper: deaths per million people. Lower: infections per million people.COUNTRY N τ τ SPAIN 622 1.81 18.23G. BRITAIN 621 2.57 22.47ITALY 586 2.47 25.59FRANCE 454 3.68 19.41SWITZERLAND 200 2.97 16.70GERMANY 112 2.48 23.15COUNTRY

N τ τ SPAIN 5082 3.11 18.05GERMANY 2111 3.49 12.5015 eferences [Anderson, 1991] Anderson, R. M. (1991). Discussion: The kermack-mckendrick epidemicthreshold theorem.

Bulletin of Mathematical Biology , 53(1):3 – 32.[Anderson et al., 2020] Anderson, R. M., Heesterbeek, H., Klinkenberg, D., andHollingsworth, T. D. (2020). How will country-based mitigation measures inﬂuencethe course of the COVID-19 epidemic?

The Lancet , 395:931–934.[Chan et al., 2006] Chan, J. S. K., Yu, P. L. H., Lam, Y., and Ho, A. P. K. (2006). Mod-elling sars data using threshold geometric process.

Statist. Med. , 25:1826–1839.[Dong et al., 2020] Dong, E., Du, H., and Gardner, L. (2020). An interactive web-baseddashboard to track COVID-19 in real time.

Lancet Inf. Dis. , 20:533–534.[Economist, 2020] Economist (2020). Forecasting covid-19 – a terrible toll.

The Economist ,page 77. 23rd May.[Gang, 2020] Gang, X. (2020). A novel monte carlo simulation procedure for modellingcovid-19 spread over time.

Scientiﬁc Reports , 10:13120.[Girona, 2020] Girona, T. (2020). Conﬁnement time required to avoid a quick rebound ofcovid-19: Predictions from a monte carlo stochastic model.

Front. Phys. , pages 1–7.[Herzberge and Hecketsweller, 2020] Herzberge, N. and Hecketsweller, C. (2020). Lesmod`eles d´eboussoles pour pr´edire l’´evolution de l’´epid´emie due au coronavirus.

LeMonde .[Hethcote, 2000] Hethcote, H. W. (2000). The mathematics of infectious diseases.

SIAMReview , 42:599–653.[Johnson et al., 1994] Johnson, N. L., Kotz, S., and Balakrishnan, N. (1994).

ContinousUnivariate Distributions . Wiley and Sons, N.Y.[Kermack and McKendrick, 1927] Kermack, W. O. and McKendrick, A. (1927). A contri-bution to the mathematical theory of epidemics.

Proc. Royal Soc. of London , 115:700–721.[Lam, 1988] Lam, Y. (1988). Geometric process and replacement problem.

Acta Mathe-matica Applied Sinica , 4:366–377.[M¨uller and Regensburger, 2012] M¨uller, S. and Regensburger, G. (2012). Generalizedmass action systems: Complex balancing equilibria and sign vectors of the stoichiometricand kinetic-order subspaces.

SIAM Journal on Applied Mathematics , 72(6):1926–1947.16Nature, 2020] Nature (2020). Nature wades through the literature on the new coronavirus– and summarizes key papers as they appear.

Nature .[Peliti, 2011] Peliti, L. (2011).

Statistical Mechanics in a Nutshell . Princeton UniversityPress.[Press et al., 2007] Press, W., Flannery, B., Teukolsky, S., and Vetterling, W. (2007).

Nu-merical Recipes . Cambridge University Press, 3 edition.[Roser et al., 2020] Roser, M., Ritchie, H., Ortiz-Ospina, E., and Hasell,J. (2020). Coronavirus pandemic (covid-19).

Our World in Data .https://ourworldindata.org/coronavirus.[Weiss, 2013] Weiss, H. (2013). The SIR model and the foundations of public health.

MatMat , 3:1–17.[Wenbin et al., 2013] Wenbin, W., Ziniu, W., Chunfeng, W., and Hu, R. (2013). Mod-elling the spreading rate of controlled communicable epidemics through an entropy-basedthermodynamic model.

Science in China Series G Physics Mechanics and Astronomy ,56:2143–2150.

Conﬂict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial orﬁnancial relationships that could be construed as a potential conﬂict of interest.

Author Contributions

All the authors contributed equally to the research and the manuscript.

Funding

L. Hoessly is supported by the Swiss National Science Foundations Early Postdoc.Mobilitygrant (P2FRP2 188023). This work has been ﬁnanced by the Spanish MINECO (MAT2017-85089-C2-1-R) and the European Research Council under contract (ERC-2013-SYG-610256NANOCOSMOS). Computing resources have been provided by CTI-CSIC. Open access ispartly funded by CSIC. The funders had no role in study design, data collection and anal-ysis, decision to publish, or preparation of the manuscript. The authors have declared thatno competing interests exist. 17 cknowledgments

The authors are grateful to Profs. F. Flores and R. Ramirez for useful suggestions andcomments on this manuscript. 18igure 1:

SHR/Spain.

Left/Right panels: deaths/infections related to COVID-19. Data(circles) are taken from [Roser et al., 2020, Dong et al., 2020]. Dashed curves ﬁt the datausing Eqs 1 and 2. Blue: total accumulated cases per million inhabitant. Red: daily casesper one hundred million inhabitants (the factor ×

100 is introduced only for the sake ofbetter visibility on the scale of total cases). The black thin line is a 7-day moving average ofdata. The green dashed line is the averaged representative curve discussed in section 2.1.5.Red and blue thin dotted lines give the contributions of individual waves. The inset (left)gives the percentage between deaths and infections from March to September. The inset(right) enlarges region II where a nearly constant number of infections takes place (red:least-squares ﬁt to data and constant mean value. Black: 7-day moving average). Changesin data collection methodology took place on April 19th, April 25th and November 4th,producing discontinuities on the data.Figure 2:

SHR/Germany.

Symbols as in Fig. 1.19igure 3:

SHR/Other countries (I).

Symbols as in Fig. 1. Changes in methodologytook place in United Kingdom (GRB) on May 20th and July 3rd, and in France (FRA) onMay 28th. 20igure 4:

SHR/Other countries (II).

Symbols as in Fig. 1.21igure 5:

Comparison of the evolution of number of deaths in NYC (9.1 Mpeople, orange) and in the Community of Madrid (6.7 M people, red) duringthe ﬁrst wave.

The data and SHR ﬁts for both locations were juxtaposed matching theday with the maximum number of deaths, aiming to highlight the similarities. The valuesof the CAM were also scaled to the ratio of population between the two regions ( . . ) toenable a better comparison. 22igure 6: Averaged proﬁle.

Daily c ( t ) (left panel) and accumulated C ( t ) = r ( t ) cases(right panel) for ten diﬀerent countries, normalized to its maximum value and displacedrigidly in time so C (cid:48)(cid:48) ( t ) = c (cid:48) ( t ) = 0 the same day. Color codes are: (1) Spain (blue),(2) Germany (red), (3) France (green), (4) USA (orange), (5) Italy (magenta), (6) GreatBritain (cyan), (7) Switzerland (purple), (8) Denmark (brown), (9) Austria (darker blue),(10) Finland (darker red). The black thick dashed line gives the average over the tencountries, with µ = 3 . σ = 0 .

56. 23igure 7:

Accuracy of SHR best ﬁts.

Starting at the second inﬂection point, t ,fractional error in the evolution of the predicted accumulated number of cases C ( t ) for: (1)GBR (green), (1) ESP (blue), (2) ITA (red), (3) GBR (green), (4) FRA (orange), (5) USA(cyan), (6) CHE (darker green), (7) DNK (darker blue), (8) DEU (darker red), (9) AUT(darker orange), (10) FIN (darker cyan). The region ±

5% is delimited by black dashedlines. A common normalizacion has been used by making C ( t + 40) = 1 for all cases.24igure 8: SIR/Spain.

Left upper panel: exponential ﬁt near the onset. Right upperpanel: initial iteration for deaths (see text). Left lower panel: ﬁnal iterations for deaths.Right lower panel: ﬁnal iterations for infections. Blue: accumulated cases, R ( t ) (per millionpeople). Red: daily cases, I ( t ) ( ×

100 to increase visibility in the same scale as R ). Blackis a 7-day moving average of data to help the eye.25igure 9: SIR/Germany.

Left/Right panels: Deaths/Infections. Blue: total cases, r ( t ).Red: daily cases, i ( t ) ( × SIR (deaths)/FRA, ITA, GBR and CHE.

Respectively left to right andtop to bottom. Symbols and lines as in Fig. 8.26igure 11:

Monte-Carlo simulations . Left panel: MC (continous) vs SIR (dashed). N = 10000. Red/Magenta: infected. Blue/Cyan: removed. Parameters used in MC are: p i = 0 . ∗ i ( t ), p r = 0 .

1. Parameters used in SIR are: τ = 2, τ = 8. Middle panel: variousMC scenarios for infected (dotted) and removed (dashed). Thick black: probabilities asin left panel. Blue: nearest-next neighbors increased probability of infection increases to p i = 0 . i ( t ). Red: both probabilities for infection and removal are kept constant values, p i = 0 . p r = 0 .

05. Right panel: average and standard deviation (inset) of 10 randomrealizations for p i = 0 . ∗ i ( t ) (double for nearest-neighbors infections) and, p r = 0 . MC/Spatial.

Typical evolution of individuals (pixels) with MC steps (10 stepsbetween frames). Green, red and blue correspond to susceptible, infected and recovered.Other parameters are: N = 100, p i = 0 . p i = 0 . p r = 0 ..