A self-supervised neural-analytic method to predict the evolution of COVID-19 in Romania
Radu D. Stochiţoiu, Marian Petrica, Traian Rebedea, Ionel Popescu, Marius Leordeanu
AA self-supervised neural-analytic method to predictthe evolution of COVID-19 in Romania
Radu D. Stochi¸toiu
Faculty of Automatic Control and ComputersUniversity Politehnica of Bucharest [email protected]
Traian Rebedea
Faculty of Automatic Control and ComputersUniversity Politehnica of Bucharest [email protected]
Ionel Popescu
Institute of Mathematics of theRomanian AcademyUniversity of Bucharest [email protected]
Marius Leordeanu
Institute of Mathematics of theRomanian AcademyUniversity Politehnica of Bucharest [email protected]
Abstract
Analysing and understanding the transmission and evolution of the COVID-19pandemic is mandatory, to be able to design the best social and medical policies,foresee their outcomes and deal with all the subsequent socio-economic effects.We address this important problem from a computational and machine learningperspective. We are, to the best of our knowledge, the first to tackle the task forthe case of Romania. More specifically, we want to statistically estimate all therelevant parameters of the spreading of the new coronavirus COVID-19, such asthe reproduction number, fatality rate or length of infectiousness period, based onRomanian patients, as well as be able to predict future outcomes. This endeavor isimportant, since it is well known that these factors vary across the globe, and mightbe dependent on many causes, including social, medical, age and genetic factors. Atthe core of our computational approach lies the recently published, state-of-the-artwork Chowdhury et al. [2020] that proposes an improved version of SEIR, whichis the classic, established model for infectious diseases. We want to infer all theparameters of the model, which govern the evolution of the pandemic in Romania,based on the only reliable, true measurement, which is the number of deaths. Thetrue number of infected people is in reality impossible to know precisely. Oncethe model parameters are estimated, we are able to predict, all the other relevantmeasures, such as the number of exposed and infectious people and many otherfactors, as shown in this paper. To this end, we propose a self-supervised approachto train a deep convolutional network to guess the correct set of Modified-SEIRmodel parameters, given the observed number of daily fatalities. Then, startingfrom these initial parameters, we refine the solution with a stochastic coordinatedescent approach. We compare our deep learning optimization scheme with theclassic grid search approach and show great improvement in both computationaltime and prediction accuracy. We find an optimistic result in the case fatality ratefor Romania which may be around 0.3% and we also demonstrate that our modelis able to correctly predict the number of daily fatalities for up to three weeksin the future (the latest available data at the moment of writing), while stayingaround the intervals defined by the recent machine learning approach (Gu [2020])currently used in the United States of America and the predictions from IHME(IHME COVID-19 health service utilization forecasting team [2020]). a r X i v : . [ q - b i o . P E ] J un Introduction
While cases of the new coronavirus, COVID-19, appeared in China as early as December 2019and January 2020 (Organization [2020]), the first documented cases of COVID-19 in Romaniaemerged in the middle of March 2020. Early understanding of the dynamics of an infectious diseaseis fundamental to being able to act on time and take the best safety measures for the population.Existing powerful mathematical models based on differential equations are able to predict reasonablywell the evolution of the different curves (e.g. number of infected and hospitalized people, fatalities),given the correct set of model parameters. However, the inverse learning problem of finding the bestparameters given the observed data is not an easy task, especially when the problem is not convexand several distinctive sets of parameters constitute relatively good local optima. The existence ofmultiple solutions is an important aspect, as we want to be able to predict future outcomes as well aslearn fundamental parameters, such as the reproduction number and the fatality rate.We take a dual neural-analytic approach, which effectively combines the power of the analyticalsolutions to model and predict data with a relatively small set of meaningful parameters with thepower of deep neural networks to learn the inverse problem, that of estimating the correct parametersfrom the observed data. We propose an effective self-supervised training and prediction scheme, inwhich the two pathways, one classic, analytical (using differential equations) and the other basedon machine learning (using a deep convolutional networks) can feed each other, in tandem. Duringthe self-supervised training phase, we start from random parameters of an improved, state-of-the-artmodified SEIR model (Chowdhury et al. [2020], Goh [2020]) (which we will refer to as Modified-SEIR), within medically acceptable intervals to predict fatality curves for a given period. Then wetrain the neural net to predict the known generator set of parameters, given the generated curves, bythe analytical model. At test time, the network is used to rapidly estimate the correct set of parametersfrom the real, observed, curve of daily fatalities. Then, the set of parameters is further optimized, bystochastic coordinate descent to minimize the L2-norm between the predicted curve (by the analyticalModified-SEIR model Chowdhury et al. [2020]) and the real curve of daily fatalities. Note that weestimate the correct set of model parameters solely from the number of fatalities, since, as mentionedabove, that number is the only one that could be measured correctly. The number of true infectedpeople is impossible to know, given mainly the limitations in testing and relatively large portion ofthe population that is asymptomatic.Among the first measures taken by the authorities in Romania was to impose a very strict socialdistancing plan. This restriction has a major impact on the basic reproduction number of the SEIRmodel, usually reducing it by a percentage that lies between 40% and 80% (Kelso et al. [2009], Readet al. [2020]). As the social distancing norms were alleviated on May 15, 2020, we consider twosimulated scenarios, so we could analyse and compare the impact of the heavy vs. moderate isolationrestrictions. We now know, based on Qian et al. [2020], that there are Gaussian simulations that couldhelp us make a more educated guess for a date when to lift the lockdown completely. Consideringthe fact that Romania is among the most religious out of the 34 countries in Europe, ranked 1st inEurope by the combined index (Center [2018] and Jonathan et al. [2018]), we are also aware of thepossibility of Easter (April 19. 2020), when people tend to gather in larger groups, to influence thedynamics of COVID-19.The Romanian authorities have not presented yet any official analysis or predictions for the evolutionof the pandemic. However, Romania urgently needs a rigorous data-driven approach, based on actualcases and statistics, with thorough computational models, for dealing with such pandemics. Notethat using and learning from data collected from a specific country is important, as it is well-knownthat a model that best applies to one part of the world might not be optimal for another. From thisperspective, our work brings a genuine value, being the first to propose a successful computationalapproach for estimating the evolution of COVID-19 in Romania.The two main contributions that we make in this paper are:1. We present the first computational approach to predict the evolution of COVID-19 inRomania, based on publicly available data. For prediction we implement a recently proposedmathematical model based on SEIR (Chowdhury et al. [2020], Goh [2020]) (refered to asModified-SEIR) for estimating the evolution of infectious diseases. We optimize the model tofit the data provided by the Romanian health authorities for the COVID-19 pandemic, usinga novel deep learning approach (see the second contribution), trained in a self-supervised2ashion. We estimate that on August 1, 2020, there will be 1640 total deaths by keeping theheavy social containment and up to 1861 deaths by slightly alleviating it. Our predictions areright around the bounds predicted by IHME (1614 deaths by August 1, as of June 12, 2020- IHME COVID-19 health service utilization forecasting team [2020]) and the ML-basedapproach presented in Section 2 (1776 deaths by August 1, as of June 12, 2020 - Gu [2020]).2. We introduce a novel self-supervised deep learning approach for fast optimization andlearning of the parameters of the anlytical Modified-SEIR model. More specifically, theconvolutional network is trained on many outputs of the Modified-SEIR model, generatedby random sets of parameters, within medically acceptable ranges, to predict preciselythe same generator parameters set, in each case. After being trained on hundreds ofthousands of such synthetic cases, the neural network becomes able to take us directly in theneighborhood of the best fitting parameters, when presented with the real data curve. Then,a refinement coordinate descent procedure is applied at the end, to obtain the final solution.Our experiments clearly show that the proposed deep learning approach to optimizationgreatly improves speed and accuracy over the baseline (grid search with coordinate gradient)optimization approach.
Shortly after the first cases of COVID-19 appeared in the world, an important research movementhas began, which aims at finding bounds for the characteristics of the infectiousness of this newcoronavirus. In Kucharski et al. [2020] authors show that the basic reproduction number is signifi-cantly influenced by travel restrictions, ranging from . to . . In their procedures they used anestimate of the incubation period equal to . days, but which can be as low as days, according to astudy conducted in Wuhan (Li et al. [2020]). This study also found that the value of . is a goodapproximation for the basic reproduction number, which is also similar to the findings in our researchreported here.In Wu et al. [2020] authors use a SEIR (Susceptible-Exposed-Infectious-Recovered) model toestimate the basic reproduction number at . on January 25 in Wuhan. At the same time, it revealsa worldwide incubation time that ranges from . to . . We see in this article the positive correlationbetween the basic reproduction number and the probability of creating an exponential epidemicstarting with a single infected person.The official report of the World Health Organization (WHO [2020]) introduces several observedparameters, including a basic reproduction number in the range of − . , an incubation periodwith an average in the range of − , a minimum hospitalization rate around . , represented bycritically ill patients, and a maximum of (cid:39) , to which are added those . in severe condition,a period from incubation to death in a wide range of − weeks and a recovery time for mild casestopped by days and for severe cases up to days of hospitalization.In several studies dealing with the estimation of the incubation period of viruses such as 2019-nCoV,SARS or MERS (Backer et al. [2020], Lau et al. [2010], Virlogeux et al. [2016]) we notice values inthe range of . − . .Another important studied factor is how strict the rules of social distancing must be in order to stopthe number of cases from increasing. In Read et al. [2020] it is estimated that a reduction in the basicreproduction number by − is needed to stop the increase in the number of infected people,considering the base reproduction number equal to . .At the core of our prediction model, it is the Modified-SEIR (Chowdhury et al. [2020]), with a freelyavailable toolbox, the Epidemic Calculator (Goh [2020]), which we use as a stepping stone in ourown implementation and validation of our model and experiments.There is thorough research focused on the hospital needs (IHME COVID-19 health service utiliza-tion forecasting team [2020]) that predicts a total number of deaths equal to 1625 caused by a firstwave of the infection. An ML-based project (Gu [2020]) assuming a mid-May reopening predicted1776 deaths in Romania on Aug 1, 2020, as of June 12, 2020. While Artificial Intelligence (AI)solutions are sometimes biased and considered inadequate, we know that their true power arrivesfrom grasping the inner complexity of a problem by learning from experience (real data), which asimpler heuristic method cannot. 3igure 1: Modified SEIR model: the "Removed" case is split into: 1) Recovered from mild symptoms;2) Recovered from severe symptoms; 3) Deceased. The diagram follows the differential equations ofthe Modified-SEIR model presented in Table 1.For solving a very difficult prediction problem, such as the ones revolving around COVID-19,Mihaela van der Schaar [2020] proposes AI-powered ways to manage limited healthcare resources, todevelop personalized patient management and treatment plans, to inform policies, to enable effectivecollaboration and to understand and account for uncertainty better.A limited number of published medical studies Popescu et al. [2020], Gherghel and Bulai [2020]present and discuss the epidemiology, clinical preparedness and medical challenges of the COVID-19pandemic in Romania. However, ours, as mentioned previously, is the first to consider COVID-19and its evolution from a computational perspective and offer an efficient model for learning andprediction. As shown next, our model accurately fits the real data, predicts future events (not seenduring training) and estimates important core characteristics of the virus specific to Romania, such asthe reproduction number, the length of infectiousness, time to recovery and fatality rate. Note that therelated work presented above provides specific ranges, established in the medical literature, withinwhich we optimize the parameters of the Modified-SEIR. In order to predict the evolution and understand key factors of COVID-19 we use the recent mathemat-ical model (Chowdhury et al. [2020]) based on the classic SEIR ( S usceptible, E xposed, I nfectious, R emoved; Hethcote [2000]), which is a widely accepted standard for modeling the evolution ofinfectious diseases. The Modifed-SEIR model follows the usual steps in which an infectious diseaseevolves. The evolution along with key elements and measures are fully described by a set of differen-tial equations (Table 1) that we present in this section. We also offer a visual representation of themodel in Figure 1.We further divide the Removed section into three categories: recovered from mild symptoms ( R M ),recovered from severe symptoms ( R V ) and deceased ( R F ). There is also an extra layer of differentialequations in the middle which helps us better shape and understand the dynamics of the disease.Following the Modified-SEIR model (Chowdhury et al. [2020]) and the Epidemic Calculator (Goh[2020]), we assume that all fatalities come from hospitals, and that all severe cases are admitted tohospitals immediately after the infectious period ends.The advantage of having an explicit analytic model versus a pure deep learning approach is thateach parameter has a clear meaning that is easy to interpret. In this case, the meaning of eachparameter used is described and summarized in Table 2. The ranges in which we search for theoptimal parameters are fixed according to recently published medical research and their confidenceintervals, as presented in Section 2. Note that we use constant values for some parameters, such as thetotal size of the population or the time from severe severe symptoms to hospitalization (in accordancewith Goh [2020]). 4able 1: Modified SEIR: System of differential equations, which describe the evolution of keycharacteristics and measures of pandemics over time. In our work we consider the number of fatalities( F , R F ) as the only real variable that could be measured correctly and we optimize the modelaccording to it. d S d t = − βIS d E d t = βIS − σE d I d t = σE − γI d M d t = P M γI − T M M d V d t = P V γI − T H V d H d t = T H V − T V H d F d t = P F γI − T F F d R M d t = T M M d R V d t = T V H d R F d t = T F Fβ = (cid:40) R T inf , before T (1 − P T ) R T inf , otherwise σ = T inc γ = T inf P M = 1 − P V − P F In order to solve the time-based differential equations and produce the different evolution curves weuse a fourth order Runge-Kutta integrator (Tan Delin [2012]).
As the number of new fatalities per day it is easily known and it is not influenced by the numberof actual tests run in any location, we consider it as ground truth in our experiments. Even thoughthis number mixes patients with comorbidites (whose health is also influenced by other conditions)with those without comorbidities, the actual measured number of infected people who are dying isultimately the only estimation that can be considered certain, in the current research. We use the datauploaded daily by Johns Hopkins CSSE.Our purpose is to find a set of parameters that best approximate with a curve modelled by the modifiedSEIR, the real curve of daily number of fatalities. We denote D real as the vector (curve) of reporteddaily number of deaths in a specific time interval and D θ as the vector (curve) generated by the modelwith parameters θ , for the same time interval. Thus, the cost function that we minimize to find thebest fitting parameters is the square root of sum of squared errors between the real and the predictedcurve, as shown in Equation (1). We denote the optimal set of parameters by θ ∗ , as defined in (2).The valid search (optimization) ranges for each parameter are presented in Table 2, as mentionedpreviously, and constitute relatively large search regions, as unions over the ranges published in recentmedical literature.We test and compare two main ways of finding the optimal parameters. One is a baseline, whichstarts by a classic grid search procedure followed by a stochastic coordinate descent refinement. Thesecond optimization method, which is our main technical novelty, is to make an initial guess of theparameters using the self-supervised trained convolutional network followed by the same refinementstep using stochastic coordinate descent. Each optimization module (grid search, neural network andcoordinate descent) is described in the next sections. J ( θ ) = (cid:115)(cid:88) i ( D real ( i ) − D θ ( i )) (1) θ ∗ = arg min θ J ( θ ) (2)5able 2: Modified SEIR parameters. "Deduced" means that the values are deduced from the systemof differential equations found at Table 1.Name Description Initial value Range S Susceptible population N − I Deduced E Exposed population Deduced I Infectious population I Deduced M Recovering at home with mild symptoms 0 Deduced V Recovering at home with severe symptoms 0 Deduced H Recovering in hospital with severe symptoms 0 Deduced F Dying 0 Deduced R M Recovered from mild symptoms 0 Deduced R V Recovered from severe symptoms 0 Deduced R F Dead (Fatal) 0 Deduced P M Mild symptoms rate Deduced P V Severe symptoms rate [0 . − . P F Case fatality rate [0 . − T inc Length of incubation period (days) [2 − T inf Length of infectiousness period (days) [3 − T M Recovery time for mild cases (days) [4 − T V Recovery time for severe cases (days) [7 − T H Time from severe symptoms onset to hospitalization (days) [5] T F Time from end of infectiousness to death (days) [14 − R Basic reproduction number [1 . − . T Intervention time to reduce R (days) [20 − P T Percentage to decrease transmission by after intervention [40% − β Transmission rate Deduced σ Rate of getting infectious from being exposed Deduced γ Recovery rate Deduced I Number of initial infections [1500 − N Total size of population [20175912]
There are 11 parameters that we optimize over: I , R , T inc , T inf , P F , T F , T M , T V , P V , P T , T .Therefore a full and very fine grid search is computationally infeasible. However, we can divideeach range in (2-4) smaller ranges and look for a decent approximation to start with. The grid searchmodule is followed by the coordinate descent refinement procedure (Section 4.4). Because of the long computational time required by grid search optimization, we propose a deeplearning approach, using a convolutional neural network trained in a self-supervised manner, asdiscussed previously, that is able to bring the solution in the neighborhood of the optimum veryfast. The neural network optimization module is also followd by the same final coordinate descentprocedure. Interestingly enough, it turns out that the results, when using the neural net optimization,are vastly superior in both speed and accuracy to the grid search approach.
Below we present the exact steps taken for the self-supervised scheme in which the neural networklearns to guess the right set of parameters, given a curve of daily deaths for a given period of time.1. Create a dataset • Take 100000 random samples from a uniform distribution of the 11 parameters in theranges indicated in Table 2; • Generate a curve of daily fatalities using the Modified-SEIR model (Table 1) for eachset of model parameters picked at the previous step. Even though the estimated total6umber of deaths is cumulative (the R F variable), we take the daily deaths number asdaily increments in the total number of fatalities. • For each curve we randomly select a fixed number L of consecutive days of dailydeaths ( L is defined by the number of days in certain time ranges in our experiments,such that for [March 22, May 3], L = 43 ). This vector of L consecutive numbers,representing the fatalities for the corresponding L days, modeled by Modified-SEIR,along with the corresponding set of parameters will constitute the 100K training pairsused in training the neural net optimizer presented below. Note that the start of the L -day sequence is chosen randomly (so it could be at the beginning or towards the endof the pandemic). We try to mirror the real case, when we really do not know whichday should be considered the first of the pandemic.2. Multi-head neural network optimizer, trained in a self-supervised way, to predict theModified-SEIR model parameters, given the L -element curve of daily fatalities (gener-ated by precisely the same set of parameters that should be predicted by the network). • Hidden layers(a) Conv1D (512, 5, ’relu’)(b) MaxPooling1D(c) Conv1D (128, 5, ’relu’)(d) MaxPooling1D(e) Conv1D (32, 5, ’relu’)(f) MaxPooling1D(g) Flatten(h) Dense (512)(i) Dense (256)(j) Dense (128) • One output for each parameter • Loss: Mean Squared Error • Optimizer: Adam (Kingma and Ba [2014])We compare the grid search approach with the neural network predictions. The advantage of the latterapproach is that it offers almost instantaneous predictions. In Figure 2 we show the percentage oftasks (problems that are randomly generated by the model using random initialization) where thesamples (sets of parameters) found by grid search produce superior curves (closer to the truth) thanthe ones predicted by the neural network. The percentage is a function of time, since for an infiniteamount of time we expect grid search, with a sufficiently fine grid, to beat the neural network. Theplot shows how vastly superior in terms of speed the neural network is. Almost 7 hours of runningfor grid search, on an Intel c (cid:13)
Core TM i9-9980HK CPU @ 2.40GHz x 8, are not enough to surpass theneural net predictions. We can expect that neither of the two approaches (grid search and neural net optimization) willdirectly produce an optimal solution, even though we do expect them to output a set of parametersthat are close to a local optimum. In order to refine our results we further apply an iterative stochasticcoordinate descent approach similar to Wright [2015]. Starting from the best predictions of a givenfirst-stage module (neural net or grid search), we take random subsets of two parameters at a time,divide their search ranges into 20-40 parts around the current best solution and replace the next valuesof the chosen parameters with the ones that minimize the cost function. We iterate the procedure untilwe reach a convergence of − absolute error.We make use of the stochastic property iteratively because we assume that some parameters influencethe cost more than others. Thus, by choosing random subsets of parameters to optimize over, weavoid the risk of spending valuable time optimizing over subsets of parameters that do not bring muchvalue. 7igure 2: Comparison between the grid search and the neural network solutions. We present how oftengrid search produces better solutions than the neural net. We clearly see that 7 hours of computationare far from sufficient for grid search to beat the neural net. Note that in the plot, the final refinementprocedure is not used by either of the two approaches. We consider three particular data sets on which we optimize and search for the best parameters of ourmodel: • Daily fatalities from March 22 to May 3 (2020); • Daily fatalities from March 22 to May 14 (2020); • Daily fatalities from March 22 to May 21 (2020).We know that on May 15 2020 the Romanian authorities changed the policy from state of emergencyto state of alert. Because the time from incubation to death is greater than one week, we assume thereported daily fatalities from May 15 2020 to May 21 2020 are not influenced by the changed policy.This enables us to search for optimal parameters for the interval March 22 2020 to May 21 2020, as itcan be modelled by Modified-SEIR.For every data set, we present in Table 3 the optimal model parameters found with neural netoptimization followed by the refinement step, as defined in Section 4. As stated previously, theyminimize the L2 distance between the real and the predicted curves of fatalities from March 22 toMay 3, May 14, and May 21, respectively (all in year 2020, of course). For each set of optimalparameters we compute the error of prediction for the following dates: May 3, May 15, May 21, June3, June 8, June 9, June 10, June 11 (all in 2020).As the prediction errors are inversely proportional to the data set size, we prove that our approachis data-driven and provides better solutions for bigger data sets. In other words, every new day ofobservations is important in finding the real parameters that shape the evolution of the COVID-19infectious disease.
It is worth mentioning that in general the reversed problem of finding the correct model parametersfrom the observation of partial data (in our case we observe only the values of one output variable, thenumber of daily fatalities) is ill posed, since the problem is not necessarily convex and many differentsets of parameters can produce similar output. However, this is a common case in machine learning,in which several valid solutions exist. The AI system usually learns to predict the most probableoutput (in this case, the set of parameters) given the observed data, based on its training experience.One good example is the case of vision, in which many different 3D worlds with different semantic8able 3: Best parameters found by our neural net optimization followed by the final coordinatedescent refinement.Name Description May 3 May 14 May 21 I Initial infectious population R Basic reproduction number .
63 2 .
63 2 T inc Length of incubation period (days) T inf Length of infectiousness period (days) . . P F Case fatality rate .
39% 0 .
68% 0 . T F Time from end of infectiousness to death (days)
14 16 . T M Recovery time for mild cases (days) .
28 4 T V Recovery time for severe cases (days) P V Severe symptoms rate
5% 15 .
6% 10% P T Decrease in transmission after intervention
60% 62% 56% T Intervention time to reduce R (days)
21 21 21
Err. May 3 Prediction absolute error on May 3, 2020 .
04% 3 .
92% 3 . Err. May 15 Prediction absolute error on May 15, 2020 .
21% 3 .
55% 0 . Err. May 21 Prediction absolute error on May 21, 2020 .
95% 8 .
22% 1 . Err. Jun 3 Prediction absolute error on June 3, 2020 .
92% 20 .
14% 6 . Err. Jun 8 Prediction absolute error on June 8, 2020 .
22% 24 .
94% 8 . Err. Jun 9 Prediction absolute error on June 9, 2020 .
83% 25 .
26% 8 . Err. Jun 10 Prediction absolute error on June 10, 2020 .
31% 26 .
40% 8 . Err. Jun 11 Prediction absolute error on June 11, 2020 .
48% 27 .
25% 8 . interpretations could produce the same 2D image. Nevertheless, the visual system learns to pick themost likely interpretation (given its prior experience) out of infinitely many. In our specific case, weexpect that the synthetic generation of many curves from different sets of parameters, will help theneural network implicitly learn the priors in the data model, such that the network will learn to output,from the many different solutions, one that is most likely to have produced the given curve. We caneasily imagine that there are certain distinct neighborhoods of parameters that generate similar curvesand that a set of parameters coming from a larger neighborhood is more likely than one coming froma smaller one. In other words, sets of model parameters for which the output curve is more stable(low curve gradient w.r.t parameters) are probably more likely than sets of parameters for which thecurve changes rapidly in their immediate neighborhood. Such subtle priors in the space of parametersshould be implicitly learned during the self-supervised training, if sufficiently many pairs of modelgenerated curves (input) - parameters (output) are presented to the network.It is clear by now that searching for the best parameters is a complex task, especially if the costfunction is not convex, thus admitting multiple local minima. Besides the intuitive discussion, wealso experimentally analyze this issue by comparing different 4D plots where the 3D space is definedby three different parameters and the fourth dimension, the final cost function, is defined by color. InFigure 3 one can see that there are regions where there is a linear dependency between two parameterspreserving the minimum cost and that there are multiple intervals with local minima for sets of threeparameters. The latter finding is especially interesting as it tells us that our prediction might not be thecorrect one even if it has the lowest cost. Or, in other words, different distinctive sets of parametersmay generate the same final cost. Thus, learning from larger sets of reported data, over longer periodsof time, may shift the balance towards a different set of optimal parameters which can immediatelymodify the foreseen dynamics of the coronavirus infectiousness and fatality. This situation, which iscommonly encountered in many AI tasks, reveal the inner ambiguity and difficulty of the problemtackled. What is however, important and relveant here, is that we are able to learn sets of parameterswhich are plausible, often matching many independent findings in the literature, which predict withsurprising accuracy the evolution of COVID-19 in Romania.Starting from the knowledge that a particular parameter set might not be necessarily the correct oneeven if it has the lowest cost on the limited training data, we compare the predictions of two differentsets of parameters that have similar L2 costs (very close to the lowest cost found for the interval fromMarch 22, 2020, to May 3, 2020). In Figure 4 we present the evolution curves corresponding to thesetwo sets of parameters. Both of them seem to fit the given data equally well.9igure 3: Plots of the cost function, represented by color, computed for joint distributions ofthree parameters at a time. The cost function goes from black (optimal/smallest) to white (leastoptimal/greatest). For better visualization we upper bound the cost at 200.10igure 4: Daily deaths fitted curves for two sets of parameters with similar cost. The red dotsrepresent the officially reported daily number of deaths in Romania. The blue and green lines withdots show the fitting of the Modified-SEIR models through the real data.Figure 5: Daily deaths fitted curves extrapolation for two sets of parameters with similar costs (onseen data). The red dots represent the officially reported daily number of deaths in Romania. Theblue and green lines show the extrapolation of the Modified-SEIR models through the real data. Notehow different the future predictions are, between the two sets of parameters.Surprisingly, when we extrapolate the curves into the future (by running the models according tothe corresponding parameters into the future), we notice substantially different evolution curves. InFigure 5 we notice how one approximation goes down quickly, while the other continues to increasefor another three months.This subtle change in the way the two fitted curves diverge at the end of the data set has drasticoutcomes regarding, among other characteristics such as the number of active infections, the totalnumber of fatalities. Figure 6 shows us that while the fitted curves for the observed interval (March22, 2020, to May 3, 2020) are similar, one set of parameters predicts a total of 2290 deaths, while theother predicts a total of 10225. We compare their predictions in Table 4. As discussed in the previous section, the approach based on a neural network optimization followedby a stochastic alternative coordinate descent seems to overfit the data set. In order to help our modellearn parameters that generalize better in the future, we add a relative future cost based on the realnumbers to the cost presented in Section 4, so its new formula is introduced in Equation 3.11igure 6: Cumulative extrapolation of the total number of fatalities for two sets of parameters withsimilar cost. The red dots represent the officially reported total number of deaths in Romania. Theblue and green lines show the cumulative extrapolation of the Modified-SEIR models through thereal data.Table 4: Two sets of parameters for the Modified-SEIR model with similar costs but very differentevolutions.Name Description Set 1 Set 2 I Initial infectious population R Basic reproduction number .
63 2 . T inc Length of incubation period (days) T inf Length of infectiousness period (days) . . P F Case fatality rate .
39% 0 . T F Time from end of infectiousness to death (days)
14 25 . T M Recovery time for mild cases (days) T V Recovery time for severe cases (days) P V Severe symptoms rate
5% 10% P T Decrease in transmission after intervention
60% 59 . T Intervention time to reduce R (days)
21 21 J ( θ ) L2 cost of the fitting .
285 33 . Err. May 3 Prediction absolute error on May 3, 2020 .
04% 1 . Err. May 15 Prediction absolute error on May 15, 2020 .
21% 0 . Err. May 21 Prediction absolute error on May 21, 2020 .
95% 5 . Err. Jun 3 Prediction absolute error on June 3, 2020 .
92% 10 . Err. Jun 8 Prediction absolute error on June 8, 2020 .
22% 13 . Err. Jun 9 Prediction absolute error on June 9, 2020 .
83% 13 . Err. Jun 10 Prediction absolute error on June 10, 2020 .
31% 13 . Err. Jun 11 Prediction absolute error on June 11, 2020 .
48% 14 . (cid:80) Fatalities Prediction of total number of fatalities 10225 2065 J ( θ ) = (cid:115)(cid:88) i ( D real ( i ) − D θ ( i )) + λ May + λ June + λ June (3) λ date = 100 · | R date − P date R date | (4) R date ≡ Reported number of f atalities on date (5) P date ≡ P redicted number of f atalities on date (6)12able 5: Best parameters found by our neural net optimization + refinement, in the case of heavysocial distancing assumption through the real data.Name Description Value I Initial infectious population 1725 R Basic reproduction number . T inc Length of incubation period (days) T inf Length of infectiousness period (days) . P F Case fatality rate . T F Time from end of infectiousness to death (days) . T M Recovery time for mild cases (days) T V Recovery time for severe cases (days) P V Severe symptoms rate P T Percentage to decrease transmission by after intervention T Intervention time to reduce R (days) We present in Table 5 the best model parameters found with neural net optimization followed by therefinement step. As stated previously, they minimize the L2 distance between the real and predictedcurves of fatalities, from March 22, 2020, to May 21, 2020, while trying not to overfit the data setby looking ahead until the last available reported date. Several interesting observations are worthmaking: the estimated basic reproduction number was found to be 2.21, which is very close to thethe value estimated in the literature. Since it is above 1, it defines an exponential growth in thenumber of infected people unless reduced by the social distancing measures. The imposed measuresof containment indeed reduced the reproduction number by 60% (from 2.2 to 0.884), so the curvestarts decreasing towards zero. The continuous decrease definitely helps our medical staff to managepatients better and safer, which may also explain why the fatality rate found is so small, of only0.245%. Note that this value is significantly lower than other numbers reported in the literature so far,which is very good news but it is highly dependent on the number of tested people and on the socialdynamics of the observed population.Based on this set of parameters, we analyse two cases that influence the daily deaths curve:1. Heavy social distancing, meaning that the enforced norms will not be diminished until theend of the pandemic period;2. Moderate social distancing, meaning that on May 15, 2020, the social interaction increasedby 10%.
Here, we assume that people will adopt a careful behavior and the heavy social distancing rules willapply long after they were proposed ( R becomes 40% of its initial value). In Figure 7 we plot ourbest fitting for the daily deaths approximation using the parameters from Table 5. You can notice achange in the convexity of the prediction (blue) curve when the heavy social distancing norms havebeen adopted (on intervention day).Using the RK4 integrator, as presented in Section 3, we extrapolate around 200 days from the firstday of reported data. In this way, we predict when there are going to be less than , or deaths perday, as you can see in Figure 8. A key insight is that the maximum number of daily deaths, the peakof our extrapolation, has already passed on April 18, 2020 (which is right before the Orthodox Easteron April 19-21), meaning that the curve is going to stay under 24 deaths per day.A subject that is of interest is the total number of fatalities the virus is going to cause. Using the dailynumbers extrapolation, we create a cumulative curve which tells us the total number of fatalities by acertain date. Thus, we predict a total of around 1730 using data from March 22, 2020, to May 21,2020, as presented in Figure 9.There is the possibility of optimizing the parameters for fitting the cumulative number of fatalitiesinstead of their daily number. The reason we opted for the latter is three-fold. First, the amount ofinformation that each specific day brings to the cumulative function becomes smaller and smallereach day, converging to 0 at infinity. The cumulative number thus increases with each day and the13igure 7: Daily deaths fitted curves. The red dots represent the officially reported daily number ofdeaths in Romania. The blue line with dots shows our best fit of the Modified-SEIR model throughthe real data.Figure 8: Extrapolation of daily number of fatalities. The red plot represents the officially reporteddaily number of deaths in Romania. Our extrapolation of daily deaths is the blue line and the greenlines represent the days when we estimate less that , or daily fatalities.meaningful information for a given day becomes much smaller than the total cumulated number. It isthus expected that learning could suffer from numerical issues. Second, the number of daily fatalitiescontains a certain amount of noise that can help us generalize better for our predictions. And third,the curve of daily numbers shows better and clearer when the pandemic peaks and when it is expectedto diminish to a non-threatening state.We know that the real number of actively infectious people is hard to obtain, as the number of testedpeople at a time is just a fraction of the whole population. However, once the right parameters arelearned, the model can estimate a total number of infectious people and predict that we need to waituntil August 25, 2020, to have less than such individuals. The results are shown in Figure 10. While we do not know for sure how the social dynamics influence the evolution of the coronavirus,we conduct an experiment assuming that on May 15, 2020, when the social distancing norms becameless constraining, the already reduced basic reproduction number (by the heavy social distancingnorms) increased by 10%, while the rest of the parameters remained the same.14igure 9: Extrapolation of cumulative fatalities. The red plot represents the officially reportedcumulative number of deaths in Romania. Our extrapolation of the cumulative number is the blueline.Figure 10: Total number of infectious people extrapolation. We predict that we will have less than infections on June 2, 2020, less than infections on June 8, 2020, and less than infections on August 25, 2020.In Figure 11 we show in our prediction that the period until there will be less than daily deaths isprolonged until the end of August, 2020. We do not see a second peak because the basic reproductionnumber does not get to be over 1 again, but the pandemic is predicted not to end until the end of 2020.The total number of deaths jumps to , shown in Figure 12. Compared to Gu [2020] we seem abit pessimistic as the number of deaths predicted by us by Aug 1, 2020, is 1861 while their machinelearning approach predicts 1776, as of June 12, 2020. Thus, we are still inside their confidence bounds:[1586 - 2187]. Please note that their approach is also data dependent and can offer significantlydifferent predictions by using more data.In Figure 13 one could see the effect of a slight change in the prevention norms, as the total numberof actively infected people takes way more time to go down. Ultimately, such modeling is important not only to fit the available observed data and estimate variousmodel parameters, such as the fatality rate, but to predict future events. The ability to predict futureoutcomes is its real value, in order to prepare the best courses of action in advance. Acting on time,especially in the face of a pandemic, is vital. Thus, in order to test the validity of our model (as itis usually done, in fact, in machine learning) we compare the future predictions made on May 21,2020 (based on observed data until that date), with the latest information available in the meantime,15igure 11: Daily fatalities extrapolation with increased mobility (moderate social distancing) fromMay 15, 2020. The red plot represents the officially reported daily number of deaths in Romania.Our extrapolation is the blue line and the green lines represent the days when we estimate less that , or daily fatalities.Figure 12: Cumulative deaths extrapolation with increased mobility from May 15, 2020. The red plotrepresents the officially reported cumulative number of deaths in Romania. Our extrapolation of thecumulative number of deaths is the blue line.until June 11, 2020. Our model is surprisingly accurate with the heavy social distancing assumptions.The predicted total number of fatalities is close to the reported values: 1296 true reported deathson June 3, 2020, versus 1294 (predicted by the heavy social distancing model) and 1312 (predictedby the moderate social distancing model) and 1369 true reported fatalities on Jun 11, 2020, versus1375 (predicted by the heavy social distancing model) and 1411 (predicted by the moderate socialdistancing model).Regarding the number of infected people, we already know that the number of reported active casesis lower than the real one, so, as mentioned previously, we can use our model to guess the actualnumber of active infections. Testing our predictions against future data is shown in Table 6. Now that we have two models, one following heavy social distancing norms and the other followingmoderate social distancing norms, that have prediction errors < 4% so far, we can use them as boundsfor our future predictions. We care about how the coronavirus will evolve for the next several months,so we summarize our findings in Table 7. 16igure 13: Total number of infectious people extrapolation with less social distancing from May 15,2020. We predict that we will have less than only in 2021.Table 6: Predictions verification. HSD means heavy social distancing and MSD means moderatesocial distancing.Criterion Reported HSD prediction MSD predictionTotal number of deaths on May 3, 2020 790 807 (2.15%) 807 (2.15%)Total number of deaths on May 15, 2020 1070 1031 (3.64%) 1031 (-3.64%)Total number of deaths on May 21, 2020 1156 1125 (2.68%) 1128 (2.42%)Total number of deaths on Jun 3, 2020 1296 1294 (0.15%) 1312 (1.23%)Total number of deaths on Jun 11, 2020 1369 1375 (0.44%) 1411 (3.07%)Total number of deaths on Jun 19, 2020 1484 1442 (2.83%) 1501 (1.15%)Active infections on May 3, 2020 7504 22066 22066Active infections on May 15, 2020 5997 16093 17673Active infections on May 23, 2020 5494 13000 16278Active infections on Jun 3, 2020 4573 9666 14478Active infections on Jun 11, 2020 4530 7778 13259Active infections on Jun 19, 2020 5361 6253 12117
In this paper we propose the first computational model to predict the evolution of COVID-19 inRomania and estimate key factors of the pandemics such as the fatality rate, incubation period,infectiousness period and reproduction number, based on the state of the art Modified-SEIR modelChowdhury et al. [2020]. Our technical novelty consists in the way we optimize the parameters of themodel, through a self-supervised deep learning approach, in which a convolutional neural networklearns from synthetic data, produced by the analytical Modified-SEIR model for random sets ofparameters, to predict the correct parameter set - which is known, since it is the one used to generatethe synthetic data. Our results show beyond any doubt that our novel self-supervised approach iseffective and learning a set of parameters which are not only able to fit the observed data but also toaccurately predict in the future, for the three weeks period tested (which is a relatively large period inthe case of a rapidly evolving pandemics).At the conclusion of our study, we highlight some important findings comprising the total number offatalities by following the heavy social distancing norms (1730), the total number of deaths followinga small decrease in the prevention norms on May 15, 2020, (2361) and the fact that we already passedthe peak of the daily number of deaths on April 18, 2020 (one day before the Orthodox Easter).Our predictions are right inside and around the bounds predicted by IHME (1614 deaths by August1, as of June 12, 2020; IHME COVID-19 health service utilization forecasting team [2020]) andthe ML-based approach presented in Section 2 (1776 deaths by August 1, as of June 12, 2020; Gu[2020]). This and the fact that our set of found parameters are close to the ones presented in the latestliterature (for example, an optimal basic reproduction number of 2.21) only empowers the idea that17able 7: Predictions. HSD means heavy social distancing and MSD means moderate social distancing.Criterion HSD prediction MSD predictionTotal number of deaths on July 1, 2020 1521 1620Total number of deaths on August 1, 2020 1640 1861Total number of deaths on September 1, 2020 1692 2027Total number of deaths on October 1, 2020 1713 2137Total number of deaths on November 1, 2020 1723 2213Total number of deaths on December 1, 2020 1727 2262Total number of deaths 1730 2361Less than 5 deaths per day Jul 6, 2020 Aug 23, 2020Less than 3 deaths per day Jul 26, 2020 Oct 2, 2020Less than 1 deaths per day Sep 4, 2020 Dec 24, 2020Active infections on July 1, 2020 4499 10548Active infections on August 1, 2020 1912 7252Active infections on September 1, 2020 808 4900Active infections on October 1, 2020 351 3315Active infections on November 1, 2020 148 2196Active infections on December 1, 2020 64 1467Less than 10000 active infections Jun 2, 2020 Jul 6, 2020Less than 5000 active infections Jun 28, 2020 Aug 31, 2020Less than 1000 active infections Aug 25, 2020 Dec 30, 2020our novel approach can be useful in a fast paced pandemic, maybe not only for the case of Romania.A notable finding is that the case fatality rate in all the local optima sets seems to be significantly lessthan 1%, mostly around 0.245% and 0.3% - and this is a optimistic surprise, when compared to theestimates in the literature.We know that access to more data could change the optimal parameters, but, based on the resultspresented in this paper, we are confident that in the case of Romania, the case fatality rate issignificantly smaller ( ≈ . % ) than the worldwide average .Considering that our model trained on data collected until May 21, 2020, accurately describes thefuture evolution (future unseen data) of the number of fatalities until June 11, 2020, we conclude thatboth the model and its inner parameters found, provide answers that is very close to the true ones.The results strongly indicate that we should seriously consider data-driven computational approaches,in combination with machine learning, in the analysis and decision making process, with respect tofundamental aspects of our lives (such as it is the case of COVID-19 pandemics), for the future andgreater good of the society. References
Jantien A Backer, Don Klinkenberg, and Jacco Wallinga. Incubation period of 2019 novel coronavirus(2019-ncov) infections among travellers from wuhan, china, 20-28 january 2020.
Euro surveillance: bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin ,25(5), February 2020. ISSN 1025-496X. doi: 10.2807/1560-7917.ES.2020.25.5.2000062. URL https://europepmc.org/articles/PMC7014672 .Pew Research Center. Eastern and western europeans differ on importance of religion, views ofminorities, and key social issues.
Pew Research Center , 10 2018.Rajiv Chowdhury, Kevin Heng, Md Shajedur Rahman Shawon, Gabriel Goh, Daisy Okonofua,Carolina Ochoa-Rosales, Valentina Gonzalez-Jaramillo, Abbas Bhuiya, Daniel Reidpath, ShaminiPrathapan, et al. Dynamic interventions to control covid-19 pandemic: a multivariate predictionmodelling study comparing 16 worldwide countries.
European journal of epidemiology , pages1–11, 2020. 18ulian Gherghel and Mihai Bulai. Is romania ready to face the novel coronavirus (covid-19) outbreak?the role of incoming travelers and that of romanian diaspora.
Travel Medicine and InfectiousDisease , 2020.Gabriel Goh. Epidemic calculator, 2020. URL https://gabgoh.github.io/COVID/index.html .Youyang Gu. Covid-19 projections using machine learning, 2020. URL https://https://covid19-projections.com/ .Herbert W. Hethcote. The mathematics of infectious diseases.
SIAM Review , 42(4):599–653, 2000.doi: 10.1137/S0036144500371907. URL https://doi.org/10.1137/S0036144500371907 .Murray IHME COVID-19 health service utilization forecasting team, Christopher JL. Forecastingthe impact of the first wave of the covid-19 pandemic on hospital demand and deaths for the usaand european economic area countries. medRxiv , 2020. doi: 10.1101/2020.04.21.20074732. URL .Jonathan, Chris Evans, and Baronavski. How do european countries differ in religious commitment?
Pew Research Center , 12 2018. URL .Joel Kelso, George Milne, and Heath Kelly. Simulation suggests that rapid activation of socialdistancing can arrest epidemic development due to a novel strain of influenza.
BMC public health ,9:117, 05 2009. doi: 10.1186/1471-2458-9-117.Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
InternationalConference on Learning Representations , 12 2014.Adam J Kucharski, Timothy W Russell, Charlie Diamond, Yang Liu, , John Edmunds, Sebastian Funk,and Rosalind M Eggo. Early dynamics of transmission and control of covid-19: a mathematicalmodelling study. medRxiv , 2020. doi: 10.1101/2020.01.31.20019901. URL .Eric Lau, C Hsiung, Benjamin Cowling, Chang-Hsun Chen, Lai-Ming Ho, Thomas Tsang, Chiu-Wen Chang, Christl Donnelly, and Gabriel Leung. A comparative epidemiologic analysis ofsars in hong kong, beijing and taiwan.
BMC infectious diseases , 10:50, 03 2010. doi: 10.1186/1471-2334-10-50.Qun Li, Xuhua Guan, Peng Wu, Xiaoye Wang, Lei Zhou, Yeqing Tong, Ruiqi Ren, Kathy S.M. Leung,Eric H.Y. Lau, Jessica Y. Wong, Xuesen Xing, Nijuan Xiang, Yang Wu, Chao Li, Qi Chen, DanLi, Tian Liu, Jing Zhao, Man Liu, Wenxiao Tu, Chuding Chen, Lianmei Jin, Rui Yang, Qi Wang,Suhua Zhou, Rui Wang, Hui Liu, Yinbo Luo, Yuan Liu, Ge Shao, Huan Li, Zhongfa Tao, YangYang, Zhiqiang Deng, Boxi Liu, Zhitao Ma, Yanping Zhang, Guoqing Shi, Tommy T.Y. Lam,Joseph T. Wu, George F. Gao, Benjamin J. Cowling, Bo Yang, Gabriel M. Leung, and Zijian Feng.Early transmission dynamics in wuhan, china, of novel coronavirus–infected pneumonia.
NewEngland Journal of Medicine , 382(13):1199–1207, 2020. doi: 10.1056/NEJMoa2001316. URL https://doi.org/10.1056/NEJMoa2001316 . PMID: 31995857.Ahmed Alaa Mihaela van der Schaar. How artificial intelligence and machine learning can helphealthcare systems respond to covid-19. 3 2020.World Health Organization. Who timeline - covid-19. , 2020. Accessed: 2020-06-03.Corneliu Petru Popescu, Alexandru Marin, Violeta Melinte, George Sebastian Gherlan, Filofteia Co-janu Banicioiu, Adelina Dogaru, Sebastian Smadu, Ana Maria Veja, Elena Nedu, Delia Stanciu,et al. Covid-19 in a tertiary hospital from romania: Epidemiology, preparedness and clinicalchallenges.
Travel Medicine and Infectious Disease , 2020.Zhaozhi Qian, Ahmed Alaa, and Mihaela Schaar. When to lift the lockdown? global covid-19scenario planning and policy effects using compartmental gaussian processes. 05 2020.19onathan M Read, Jessica RE Bridgen, Derek AT Cummings, Antonia Ho, and Chris P Jewell.Novel coronavirus 2019-ncov: early estimation of epidemiological parameters and epidemicpredictions. medRxiv , 2020. doi: 10.1101/2020.01.23.20018549. URL .Chen Zheng Tan Delin. On a general formula of fourth order runge-kutta.
Journal of MathematicalScience & Mathematics Education , 2012. URL http://w.msme.us/2012-2-1.pdf .Victor Virlogeux, Vicky Fang, Minah Park, Jianhong Wu, and Benjamin Cowling. Comparison ofincubation period distribution of human infections with mers-cov in south korea and saudi arabia.
Scientific Reports , 6, 10 2016. doi: 10.1038/srep35839.World Health Organization WHO. Report of the who-china joint mission on coronavirusdisease 2019 (covid-19), 2020. URL .Stephen J. Wright. Coordinate descent algorithms.
Mathematical Programming , 2015. doi: 10.1007/s10107-015-0892-3. URL https://arxiv.org/abs/1502.04759 .Joseph T Wu, Kathy Leung, and Gabriel M Leung. Nowcasting and forecasting the potential domesticand international spread of the 2019-ncov outbreak originating in wuhan, china: a modellingstudy.
The Lancet , 395(10225):689 – 697, 2020. ISSN 0140-6736. doi: https://doi.org/10.1016/S0140-6736(20)30260-9. URL