Forecasting Covid-19 dynamics in Brazil: a data driven approach
Igor G. Pereira, Joris M. Guerin, Andouglas G. Silva Junior, Cosimo Distante, Gabriel S. Garcia, Luiz M. G. Gonçalves
FForecasting Covid-19 dynamics in Brazil: a data drivenapproach (cid:73)
Igor G. Pereira a , Joris M. Guerin a , Andouglas G. Silva J´unior a,b , CosimoDistante c , Gabriel S. Garcia d , Luiz M. G. Gon¸calves a, ∗ a Federal University of Rio Grande do Norte, Brazil b Federal Institute of Rio Grande do Norte, Brazil c Institute of Applied Sciences and Intelligent Systems, Italy d University of Bras´ılia, Brazil
Abstract
This paper has a twofold contribution. The first is a data driven approachfor predicting the Covid-19 pandemic dynamics, based on data from more ad-vanced countries. The second is to report and discuss the results obtained withthis approach for Brazilian states, as of May 4th, 2020. We start by present-ing preliminary results obtained by training an LSTM-SAE network, which aresomewhat disappointing. Then, our main approach consists in an initial clus-tering of the world regions for which data is available and where the pandemicis at an advanced stage, based on a set of manually engineered features repre-senting a country’s response to the early spread of the pandemic. A ModifiedAuto-Encoder network is then trained from these clusters and learns to predictfuture data for Brazilian states. These predictions are used to estimate impor-tant statistics about the disease, such as peaks. Finally, curve fitting is carriedout on the predictions in order to find the distribution that best fits the outputsof the MAE, and to refine the estimates of the peaks of the pandemic. Resultsindicate that the pandemic is still growing in Brazil, with most states peaks ofinfection estimated between the 25th of April and the 19th of May 2020. Pre-dicted numbers reach a total of 240 thousand infected Brazilians, distributed (cid:73)
This work is supported by CAPES under Grant 001 and by CNPq. ∗ Corresponding author
Email address: [email protected] (Luiz M. G. Gon¸calves)
Preprint submitted to ArXiv May 20, 2020 a r X i v : . [ q - b i o . P E ] M a y mong the different states, with S˜ao Paulo leading with almost 65 thousandestimated, confirmed cases. The estimated end of the pandemics (with 97%of cases reaching an outcome) starts as of May 28th for some states and reststhrough August 14th, 2020. Keywords:
Time Series Prediction , Covid-19 Pandemic, ModifiedAuto-Encoder
1. Introduction
2. Materials and Methods
This work is devoted to develop a method to predict the dynamics of trans-mission of viral epidemics by analyzing contamination data from the perspectiveof artificial intelligence. Deep Learning techniques are studied and implemented,aiming to learn the dynamics of the pandemics using data from other locations4countries). This approach is then applied to the specific case of Brazil. Westart by describing traditional approaches to set a baseline for comparison, andthen detail the different components of the data driven method retained.
The spread and contamination of the Covid-19 virus is not entirely randomand follows certain patterns. These dynamics can vary across different regionsas they depend on parameters such as pollution, demographic density, averageage of the population, among others. Analyzing the actions taken to fight thevirus, in both the social and economic spheres, there is a need for more realisticepidemiological data. Indeed, the use of local models, taking into account thereality of each region, state or municipality, can allow the authorities to takecoherent decisions. Therefore, it is assumed that the spread of the virus followssome statistical model, which parameters can be tuned to represent differentsituations.Approaches to model the behavior of infectious diseases, such as SEIR, havebeen used to the epidemic of COVID-19 [5, 14]. In these approaches, the phasetransitions of the disease are modeled as instantaneous rates in differential equa-tions or as probabilities of transition in discrete time differences or matrix equa-tions. These models provide accurate estimates of the position of the equilibriumpoints, when the rate at which individuals enter each stage is equal to the rateat which they exit. However, they do not accurately capture the distribution ofthe time an individual spends at each stage; therefore, they do not accuratelycapture the transitory dynamics of epidemics. Actually, the SEIR model hasbeen tested at Italy [6] to model the dynamics of the COVID-19 epidemic. Ithas been shown to underestimate peak infection rates (by a factor of three usingpublished parameter estimates based on the progress of the epidemic in Wuhan)and to substantially overestimate the persistence of the epidemic after the peakhas passed[5].Other approaches such as SIR [15], SEIRD [6], and SEITR [16] are also help-ful to understanding the Covid-19 dynamics. Nonetheless, the lack of ground5ruth data prevents us from determining which of these models is the mostprecise. Despite somehow representing the Covid-19 dynamics, some of thesetraditional models (SIR, SEIR, SEITR, SEIRD) must be improved so that theycan be applied with higher precision to the study of the new virus, as theyhave been shown to present some issues on the recent works cited above. In thiswork, besides discussing the main advances of the contributions in this direction,these traditional models are compared to ours, which is a data driven approach.Some preliminary studies on the above methods have been conducted for betterunderstanding of the Covid-19 dynamics. In fact, we verified that it is a virusthat cannot be model perfectly with any specific traditional model because ofthe influence of several factors on its dissemination speed. Mainly, it is difficultto model its behavior because of the non-linearity of infection data caused byunder-notifications and also the lack of effective and constant counter measures,which changes all the time as the infection spreads. For these reasons, it seemsappealing to apply AI-based methods. As a first test, we start by implementingan LSTM, one of the default neural network models for analyzing time seriesdata, in the next section.
Several neural network models can be used to solve problems of time se-ries estimation. Recurrent neural networks (RNN) are a family of architecturescontaining recurring feedback connections, which define an internal state, orshort-term memory. This memory makes them suitable for modeling sequentialor time series data [17]. To this end, a standard RNN keeps a vector of acti-vation parameters at each time step, especially when short-term dependenciesare included in the input data. However, when trained with gradient descentalgorithms, learning the long-term dependencies that are encoded in data be-comes difficult due to the vanishing gradient problem. This is solved using aspecialized neuron for long-term memory that keeps a constant reverse flow inthe error signal, allowing it to learn long-term dependencies. This approachwas presented by Hochreiter [18] and is known as LSTM (Long Short Term6emory).In this way, a LSTM network is kind of RNN architecture, having a recursivebranch for modeling time series and solving the vanishing gradient problem. Todo so, it uses a memory cell that is able to represent long-term dependencies inthe time series, composed of four neural units: input, output, forgetting and theself-recurring neuron (Figure 1a). These units are responsible for controlling theinteractions between different memory units. Specifically, the input unit controlswhether the input data can modify the state of the memory cell or not. On theother hand, the output unit controls whether or not it can change the state ofother memory cells.Mathematically, considering the output gates ( f t , i t , o t and τ t ) shown inFigure 1a, we have: f t = σ ( X t U f + S t − W f + b f ) (1) i t = σ ( X t U i + S t − W i + b i ) (2) O t = σ ( X t U o + S t − W o + b o ) (3) τ t = tanh ( X t U c + S ( t − W c + b c ) (4) C t = C t − ⊗ f t ⊕ i t ⊗ C (cid:48) t (5) S t = O − t ⊗ tanh ( c t ) (6)where, U , W and b are respectively the input weights, recurrent weights andbiases; X is the input; S is the hidden output; C is the cell state; and t is thetime step.According to Sagheer [17], despite the advantages of the LSTM architecture,its performance for time series problems is not always satisfactory. The shallowLSTM architecture may not represent the complex features of sequential dataefficiently, especially if they are used to learn data from long-range time serieswith high non-linearity, which is the case for Covid-19 data. In order to overcomethis problem, other RNN architectures based on LSTM have been created. Wetested two approaches proposed by Sagheer: DLSTM [19] and LSTM-SAE [17],7 σ τ στC t-1 S t-1 C t S t X t f t i t τ t o t (a) LSTM Block 1LSTM Block 2LSTM Block nX t S t-1(1) S t(1) S t-1(2) S t(2) S t-1(n) S t(n) S t(1) S t(2) S t(n - 1) (b) LSTM - AE
Input LayerEncoderDecoder Input Layer
Output Layer (c)
Figure 1: LSTM, DLSTM and LSTM-SAE Blocks arameters MetricsHidden Layers Epochs Epochs AE Model Dropout Units Sequence Lenght MAPE CorelationLSTM DLSTM
LSTM-SAE
Table 1: Training parameters and Metrics using Covid-19 data from China provinces (daily number of cases and cumulativenumber of cases).The LSTM-SAE and DLSTM blocks are shown in Figures 1b and 1c, re-spectively. Basically, both blocks are composed of stacked LSTM layers, whichincrease the depth of the network. Besides that, the LSTM-SAE configurationuses an auto-encoder to initialize the weights of each LSTM layer. In our ap-plication, we used only one hidden layer for this setup, but it is possible to usemore layers and more auto-encoders as shown on the original paper. In orderto select the best architecture for the Covid-19 problem, we trained three mod-els, one LSTM, one DLSTM and one LSTM-SAE. These models were trainedusing data from all China provinces except Hubei (that was used for testing).We evaluated which model generalized best to the dataset available using theMAPE metric. Finally, we used a dynamic prediction, where the model is up-dated for each new predicted value. This method improves the forecast due theincorporation of data from other countries or regions. The training parametersand results metrics are shown in Table 1.Figures 2a and 2b show the results for the three trained models for Hubei(province of China). As shown in Table 1, the best model was LSTM-SAE,being thus chosen as the model to forecast other regions or countries.On the one hand, despite the devastating effects of the pandemic, threemonths of data is a relatively short period of time for training complex timeseries prediction models without overfitting, which has been reported as one ofthe main problems for training LSTMs (see Section 4). On the other hand,this pandemic is the first large scale global pandemic that our generation hasto face and there are not yet standardized guidelines for countries on how to9 eb2020 Mar Apr MayDate020000400006000080000 LSTMDLSTMLSTM-SAEReal (a)
Feb2020 Mar Apr MayDate0100020003000400050006000 LSTMDLSTMLSTM-SAEReal (b)
Figure 2: Results for comparison of LSTM, DLSTM and LSTM-SAE on Covid-19 cumulative(2a) and daily (2b) number of cases, data from Hubei, province of China react to such an event. For this reason, responses to the pandemic have variedwidely throughout the different regions and countries worldwide, thus creatinga huge variability in the available data. Hence, we propose to conduct a pre-liminary study which consists in grouping countries and regions with similarearly responses. In this way, smaller specialized networks can be trained foreach cluster, and we hope that, by learning on more consistent data, our modelscould generalize better without overfitting to the training data. Also, we founda better model (MAE) that is used, instead of LSTM, on data resulting fromthe clustering approach that is described next.
The objective of this paper is to train a predictive model for Brazil, as wellas some distinct models for each of the groups of Brazilian states. Hence, theproposed clustering pipeline considers both entire countries and smaller regionsas entries. The input data used in this preliminary study are all countriesavailable in the JHU dataset [20], Chinese and Canadian provinces, American,Australian and Brazilian states[21] as well as Italian Regions[22].The approach used for identifying which countries present similar early re-sponses to Covid-19 is inspired by the literature in this area [23]. First, we define10 igure 3: Number of deaths per million inhabitants in the different Brazilian states on 1st ofMay, 2020 the outbreak date of a country to be the day at which it registered 5 confirmedcases per million inhabitants. Normalizing by the population of the region helpsto characterize the true response of a country, avoiding to give more weight tohighly populated countries. Figure 3 shows the number of accumulated deathsper million inhabitants for the different Brazilian states on the first of May,2020.We start with the preprocessing scheme to be applied on this dataset. A7-days arithmetic moving average is first calculated to each time series of thedataset. This is done to deal with the seasonality that is observed in data, i.e.higher variability during the weekends. After filtering, a feature representationcontaining three characteristics is computed for each time series. These featuresare: • Early Mortality : Weekly number of deaths 14 days after the outbreak,divided by the number of confirmed cases, in the week of the outbreak.A two weeks period was used because it is the time required to know theoutcome of a contamination. • Days until 10x : The number of days it takes to multiply the confirmed11ases by 10, from the day of the outbreak. • Early Acceleration : If we denote ∆ W W as the percentage increase ofconfirmed cases from the week of the outbreak to the week after, and∆ W W as he percentage increase from the 1st to the 2nd week after theoutbreak, then the early acceleration is defined by: earlyAccel = ∆ W W / ∆ W W . (7)The values of these features for the different Brazilian states are shown onFigure 4.Then, the clustering pipeline is applied to the former feature representa-tion to group the different countries/regions together. To do that, a UniformManifold Approximation (UMAP) embedding [24] is applied to generate a two-dimensional clustering friendly feature space. UMAP is an unsupervised em-bedding method that tends to preserve the global distances present in the initialdataset. This lower dimensional feature space not only facilitates the visualiza-tion and interpretation of data but also tends to improve clustering results foralgorithms where the number of clusters is unspecified. In practice, UMAP isused with n neighbors = 15 and min dist = 0. However, UMAP only producesa new embedded space and does not generates directly the clusters assignments,which are needed to select the countries for training our neural network models.To solve this issue, we use the scikit-learn [25] implementation of AffinityPropagation [26] with a damping factor of 0 .
8, applied to the UMAP embed-ded space. The results from this preliminary clustering procedure are furtherpresented in Section 3.3.Therefore, our clustered data series is ready for the MAE training and pre-diction procedures, depending on the phase. In practice, to forecast contami-nation data of a given Brazilian state, we use the time series data of the coun-tries/regions belonging to the same cluster, and which are at a more advancedstage of the pandemic. In this section, the clustering approach adopted to char-acterize the responses of the different countries is explained, the details of the12 a) Early Mortality(b) Time to 10x(c) Early Acceleration
Figure 4: Values of the three features used for characterizing the early response to covid-19for the Brazilian states.
In order to model the transmission dynamics of the SARS-COV2 virus inBrazil, we propose to use a set of Modified Auto-Encoders (MAE) to forecasttime-series data regarding the number of daily confirmed cases of Covid-19. Anauto-encoder is a specific neural network architecture that is trained to copyits input to its output [27]. In this way, the auto-encoder generates a hiddenrepresentation that describes useful properties of the input data.The network architecture can be divided in two parts: an encoder function h = f ( x ), that maps the input data x to the hidden representation h , and adecoder function ˆ x = g ( h ) that attempts to approximate the input ˆ x from thehidden representation. With the use of the stochastic gradient descent strategyto train neural network architectures, the auto-encoder mapping functions canbe generalized to stochastic mappings such as p encoder ( h | x ) and p decoder (ˆ x | h ).The hidden representation, also called latent space, generated by the map-ping p encoder ( h | x ) contains a stochastic representation of the probability dis-tribution of the input data and can be used for dimensionality reduction [27],feature learning [27], and also in generative models when combined with latentvariable models[28]. Auto-encoders can also learn useful properties from time-series if a sequenceis applied to its inputs. Such properties may be used to forecast next samples ofthe given input sequence. In this way, we propose to modify the traditional auto-encoder architecture in order to employ an extra output derived from the latentspace. Therefore, while the traditional output of the auto-encoder is trained toapproximate the input values, the extra output is trained to approximate thenext sample of the sequence given to the input of the auto-encoder.Consider X a sequence such as X = x , x , ..., x n , the latent space vector H is obtained with the mapping p encoder ( H | X ) and the traditional output of the14uto-encoder is obtained with the mapping p decoder ( ˆ X | H ). The extra outputadded to the auto-encoder model tries to approximate x n +1 with the mapping p predictor ( x n +1 | H ).In order to increase the latent space dimension without increasing the inputsequence, we apply 3 auto-encoders in parallel and aggregate their latent spacebefore computing the predictor output. Such Modified Auto-Encoder (MAE)architecture is depicted in Figure 5. Figure 5: Modified Auto-Encoder architecture
The predictor output, the input-samples and the decoder output have 1, 8and 8 units, respectively. Each encoder, latent space and decoder has 32, 4,and 32 units, respectively. The output of each decoder is averaged to create thetotal decoder output. The latent spaces of each auto-encoder are concatenatedprior the final computation of the predictor output. We train the modifiedarchitecture with the mean squared error loss function and the Adam optimizer.
Lets consider the epidemic curve a time-series that models the advance ofan epidemic by measuring the number of new confirmed cases of Covid-19 ona daily basis. Hence, we first apply a moving average filter of size 3 to dealwith the variability of data related to the amount of tests available and delays15n reporting between other problems.We compute the input examples by dividing the whole epidemic curve inoverlapped segments of 8 days, shifted one day from each other, with the 9 th day being the value to be forecast by the MAE. Each example is normalized bydividing its values by its maximum value. A set of 10 examples is taken fromthe most advanced places in each cluster to form a batch of examples.In order to evaluate the most advanced places in the epidemic timeline, wecompute the difference between the number of cases sampled at the day of peakoccurrences and the last number of cases reported. If this difference is positive,the number of daily cases started to decrease, meaning that such place passedthe peak number of cases and is more advanced in the epidemic timeline. In order to forecast the Brazilian epidemic curve, we start by applying thesame moving average filter of size 7 to the epidemic curve of the Brazilian statesas depicted in details in Section 2.3, then we perform the forecasting on theseclustered time series data in two phases.The first phase uses existing data to feed the network, and the forecastvalue is one-step ahead of the current example. In the second phase, referredto as multi-step ahead, we use the predicted value of the i − th step to forecastthe value of the ( i + 1) − th step. In this way, it is not necessary to haveexisting data for the second phase of forecasting, allowing us to forecast theepidemic behaviour several days ahead and identify the probable date of thepeak number of daily cases, which might indicate a drop in the number ofoccurrences. Notice that this peak or the end of the pandemic might be subjectto some displacements due to problems in the data, so a final step needs tobe applied in order to verify the peaks for all states. This is done by fitting adistribution curve on the output data as described next.16 .5. Final approximation for the Covid-19 curves Despite we discuss below the impossibility of finding a curve that mathe-matically represents the Covid-19 dissemination, the main and most importantreason for trying to approximate this curve is that it allows to define useful in-formation such as verifying the peak, and estimating the end of the pandemics.Moreover, it can generate more realistic number of cases to some degree of pre-cision, thus being of importance. To determine the end of the pandemic or thepeak are two of these advantages, as it is supposed that epidemics obey certainstatistical rules [7], to some degree of precision. In this work we verify the peaksafter approximating the final predicted curve using some statistical procedure.In relation to modeling Covid-19 using statistical distributions, it has beendiscussed that this is a somewhat difficult task. Actually, the Covid-19 curve cannot be considered a Gaussian probability distribution [29]. In fact, it is arguedthat the shape of a normal distribution is a histogram that is a transformationof probability density against values of a single variable while the Covid-19contagion curve is a transformation of the values of one variable (confirmedcases) according to a second variable (time). So the curve is not a distributionsin the sense usually meant in probability and statistics. Nonetheless, one canvisually notice that the curve of daily confirmed cases × time (day) looks likea distorted Gaussian, and can actually be approximated by some distributionssuch as the normal (rarely), pearson, logistic, logNormal, and gamma, amongothers. For the sake of confirming or ratifying the estimated peak, we thusconduct a statistical procedure to the time series data output by the MAEmodels.
3. Results
In the following we present the experimental results for validating the meth-ods introduced in this work. We start by describing some results found inthe literature for traditional approaches followed by the LSTM results, as acomparison discussion over these approaches will be further conducted. Then17esults of the clustering procedure are shown in order to validate and clear theapproach used. Finally, we present the results obtained with the Modified Auto-Encoder model to forecast the epidemic curve of Covid-19 in Brazil, as well asthe approximated distribution curve confirming the peaks obtained by the abovefitting procedure, for all Brazilian states. The numbers for Brazil, which areof straight interest to the population, are presented for some of the states andare available at . We notice one more time that thesenumbers are predictions, as so they might get different from the real numbersas the pandemic dynamics evolves.
Results with SIR and SEIR can be found including several applications run-ning on the Internet [30]. Before entering these results, we note that we couldnot find accurate results on Covid-19 long-term dynamics prediction for thesemethods, up to date. However any approximation is useful at this time, aslong-term forecasts may help managers to discuss different types of confinementpolicies [31], and it can help to come up with an estimate of the optimal date toend the confinement policy. For example, the preliminary results reported byBastos and Cajueiro in April 1st [32], using SIR and SIAS, are far from beingaccurate. Indeed, they suggest that 30 million people will get infected in Brazilon May 11th (pandemic predicted peak) in the less infected people situation.Later, in their revised work evolving to SID and SIASD approaches [31], theresults are better, however not precise yet.Referring to the Northeast (our main interest region), a web site for mon-itoring of COVID of UFRN [30] is used by the Govern, which is based on amodified SEIR accounting for social distancing rules [33]. According to them,the epidemic started on March 1st and the symptomatic cases are predictedto end on July 1st. The peak of symptomatic people is predicted for May17th with 20 million. In a more detailed look of the web page (at May 4th),a php application says that Brazilian state RN could have 2,039 confirmedcases on April 30th 2020, following the current scenario of social distancing,18s shown in Figure 6, available and printed from the website. Actually, wehave had an outcome of 1,297 confirmed cases for that day, not so bad (lessthan 50 % error). For Brazil, their prediction was about 742 thousand con-firmed cases by the same day. The actual number was 87,187 confirmed cases(according to the Health Ministry of Brazil - MS). This is a greater disparitybetween values. This picture is shown in Figure 7, also available from theirwebsite ( http://astro.dfte.ufrn.br/html/Cliente/COVID19.php ). Again,these predictions, besides a bit exaggerated, are useful for the authorities totake decisions, reporting the worst cases of the pandemic, most of the time.As it will be shown, our results tend to be a little more humble than the onesreported on this site and we sincerely hope that our predictions are still exag-gerated.
In our initial studies towards data driven approaches, we tested the possi-bility of using the LSTM-type RNN for determining Covid-19 dynamics for ourregion of most interest (Brazilian state RN), but it did not work as expected dueto several factors. Mainly the under-notification of data made available by thegovernments of Brazil and its Federate States. Hence, more work is necessaryfor improving and testing this model, and adjusting it to predict the dynamicsof the pandemic, including its various parameters. The problems with this ar-chitecture applied to COVID-19 data will be further discussed in the Section 4.Basically, the main drawback of the model is its inability to reset at certaintime and bringing the values to zero (or close). Anyway, we used the LSTM-SAE to forecast three different places, with different phases of the disease: Italy(Figure 8), which contamination curve is starting to decrease; Brazil (Figure 9),which is approaching the peak; and the state of Rio Grande do Norte (Figure 10)that is about to reach the peak.Although not responding perfectly, we notice some LSTM important featuresthat can be seen on the charts. One is that the LSTM-SAE model could stabilizeover time. By analyzing the daily results, the other LSTM models cannot return19 igure 6: Projections for Rio Grande do Norte state (at the northeastof Brazil) [33]. Figure printed out from the web application running athttp://astro.dfte.ufrn.br/html/Cliente/COVID19.php. Acessed on May 4th. igure 7: Projections for Brazil with adapted SEIR model [33], extracted fromhttp://astro.dfte.ufrn.br/html/Cliente/COVID19.php. Acessed on May 4th. - - - - - - Date050000100000150000200000250000 Prediction|ForecastReal (a) - - - - - - Date0200040006000 Prediction|ForecastReal (b)
Figure 8: Predictions and forecasting to Italy on Covid-19 cumulative (8a) and daily (8b) - - - - - - - - - - - - - - Date050000100000150000 Prediction|ForecastReal (a) - - - - - - - - - - - - - - Date02000400060008000 Prediction|ForecastReal (b)
Figure 9: Predictions and forecasting to Brazil on Covid-19 cumulative (9a) and daily (9b) - - - - - Date010002000300040005000 Prediction|ForecastReal (a) - - - - - Date050100150200 Prediction|ForecastReal (b)
Figure 10: Predictions and forecasting to RN on Covid-19 cumulative (10a) and daily (10b)
23o zero and keep oscillating around some positive value. Because of this, whenthe value is accumulated it always increases. This issue is more apparent whenthe model is used for countries or regions that did not stabilize their cases.These limitations can be associated to the non-linearity of data, among otherissues. Another point is that, as presented in previous work [17], the LSTM-SAE addresses the input data randomization in the LSTM block. The encoder-decoder model trained first feeds the hidden layer with initialization weights. Itis possible that because of that, the LSTM-SAE architecture presents the bestresults. More complete results using LSTM can be found at . Before showing MAE results, this subsection presents and discusses the re-sults from the preliminary clustering necessary for the better performance ofMAE, which was presented in Section 2.3. To evaluate qualitatively the clus-ters obtained, lets use the 2D UMAP embedding shown in Figure 11. Sevenclusters were formed, and overall, they seem rather compact and distinct fromeach others. Although there is a slight overlapping between some pairs of clus-ters, this plot suggests that there was actually well defined groups within thedifferent countries/regions, probably reflecting the types of actions taken bythe governments to react to the early signs of the pandemic. Notice that wesuppose and believe that countries from the same clusters should follow similarcontamination curves.In order to visualize this preliminary classification and get some insight forBrazilian states, a map of Brazil representing the clusters is shown at Figure 12.In that map we separate in the same color the states and countries that pre-sented a similar reaction to the outbreak of Covid-19. The countries which arerepresented by hatches in the maps were either not sufficiently advanced at thetime of the study or their time series produced numerical instabilities duringfeature computation. The states from USA, Australia, Italia, China and Canadaappear in the different clusters but are not represented in the maps to improve24 igure 11: 2D UMAP embedding of the different countries and states studied. The colorsrepresents different clusters generated using Affinity Propagation. readability of the paper. The full results of the cluster assignment used in thetraining process can be found at .Finally, the values of the features of the different groups are presented inthe form of violin plot on Figure 13. We can see, for example that cluster0 gathers the countries with higher
Days until 10x , meaning that these coun-tries/regions managed to contain the contagion early. In turn, Brazil belongsto cluster 1, which contains countries with high early acceleration and aboveaverage mortality rate.We conclude this section by underlining the fact that the colors representingthe clusters used for Figures 11, 12 and 13 are matching, meaning that a countryin yellow on the map belongs to the yellow cluster on the UMAP plot and itsstatistics can be seen in yellow on the violin plot. In addition, we believe thatall major centers of Covid-19 are represented in these maps, which providessufficient material to train our models.25 a) Brazilian states(b) World
Figure 12: Clusters assignment of the different Brazilian states and world countries. igure 13: Violin plots representing the values taken by the different features for each groupsobtained after UMAP + Affinity Propagation clustering. This Section presents the results obtained by applying the MAE architec-ture model to forecast the Covid-19 epidemic curves of Brazilian states of eachcluster. Therefore, for each cluster, a MAE model was trained with the 10 mostadvanced countries of the cluster with the data available up to the day of thisstudy, and the epidemic curves for the Brazilian states of the cluster were fore-cast. We note, however, that the Brazilian states were only on clusters 0, 1, 2and 3.Here, we depict one state for each cluster. For an interactive visualizationof all Brazilian states you may refer to .In Figure 14, the daily and cumulative epidemic curves for the Sergipe stateis displayed. The peak for the Sergipe state is predicted to happen on May 9and should reach up to 2546 total number of cases at the mid of July.Figure 15 depicts the epidemic curves for the S˜ao Paulo state. In this case,27 a) Daily cases for Sergipe State.(b) Cumulative cases for Sergipe State.
Figure 14: Daily and Cumulative cases for Sergipe State from the Cluster 0. we predict the peak number of cases for May 8 and that the state would reacha total of 64 ,
984 cases at mid-July.The epidemic curves for the Rio Grande do Norte state is depicted in Fig-ure 16. The peak occurrence in daily cases is predicted to happen in May 15and should reach up to 6025 at the end of August.For the last cluster, we depict the epidemic curves of the state of SantaCatarina. The peak occurrence is predicted to happen in May 16 and shouldreach 15329 cases at the end of September.From the epidemic curves illustrated above, we verify that each state hasits behavior associated to the cluster it belongs. States from cluster 0 generallypresents a steep peak but a very low number of daily cases, indicating that theepidemic is starting and evolving fast but will not present an elevated numberof daily cases. 28 a) Daily cases for So Paulo State.(b) Cumulative cases for So Paulo State.
Figure 15: Daily and Cumulative cases for So Paulo State from the Cluster 1.
The cluster 1 presents a different behaviour. Generally, states from cluster1 presents a steep peak with an elevated number of daily cases, meaning thatthe transmission dynamics is happening much faster than cluster 0. In themeantime, the predictions show that states from cluster 1 are close to reach thepeak number of occurrence of daily cases and should have its occurrence of dailycases decaying very fast.The states from cluster 2 present a slower rate of transmission dynamics ifcompared to states from the cluster 1, and according to the date expected forthe peak number of daily cases, these states still did not reach the peak numberof occurrences.States from cluster 3 present the slowest transmission dynamics and tendsto have their number of daily cases decaying slowly.We also indicate, in Table 2, the date of the peak occurrence of cases, the29 a) Daily cases for Rio Grande do Norte State.(b) Cumulative cases for Rio Grande do Norte State.
Figure 16: Daily and Cumulative cases for Rio Grande do Norte State from the Cluster 2. date that it will reach 97% of the total number of cases, the total number ofcases and a peak occurrence date obtained by fitting a probability distributionto the predicted curves as well as the curve used in the probability distributionfitting process. Examples of these curves are shown in Figures 18 and 19, for thestates of Rio de Janeiro and S˜ao Paulo, respectively, ratifying the peaks shownin Table 2. The other states curves can be found at .
4. Discussion
As already explained, the LSTM based approaches did not work well in theproblem of modeling the Covid-19 dynamics. The LSTM-SAE was also triedand performed a little better. There are some explanations for this lower perfor-mance. The first issue is related to data non-linearity caused by under sampling30 a) Daily cases for Santa Catarina State.(b) Cumulative cases for Santa Catarina State.
Figure 17: Daily and Cumulative cases for Santa Catarina State from the Cluster 3. (every other day and then every day for example) and under notifications (num-bers under real values), which have also been problematic for other countriesthan Brazil. Several countries are under-testing their population, making thenumber of reported cases below reality. Even for the countries that are doingmassive testing, there are often delays between the real occurrences and notifi-cation. Another potential source of error is the randomization of the weights,which can be solved with LSTM-SAE [17], however the first issue still remainsa problem here (non-linearity problem). Yet, instability has been acknowledgedduring training. Several attempts had to be done in order to get a more stablemodel by manually tuning a fixed initialization seed.Neural networks are known to be good function approximators, and at firstlook, the functions they are approximating are likely to be nonlinear. In partic-ular, an LSTM creates an embedding that transforms the function into a linear31 igure 18: Curve fitting for Rio de Janeiro state (logNormal model was the best fit) with peakis indicated on May 5th, 2020.Figure 19: Curve fitting for S˜ao Paulo state (logistic model was the best fit) with peak isindicated on May 6th, 2020. able 2: Peak occurrences for each state predicted by the MAE Model and by a distributionprobability. We also indicate the total number of cases expected by the MAE prediction andthe day that it’ll reach 97% of the total number of cases. State Predicted by MAE Curve fit peak Best curve Total 97% of Total
TO 2020-05-10 2020-05-10 Pearson 846 2020-06-13SE 2020-05-09 2020-05-10 Pearson 2546 2020-06-13MG 2020-05-04 2020-04-30 Logistic 2992 2020-06-03MS 2020-04-25 2020-04-24 Pearson 327 2020-05-28PA 2020-05-09 2020-05-10 Lognormal 10332 2020-06-11AP 2020-05-12 2020-05-12 Logistic 5172 2020-06-15MA 2020-05-07 2020-05-07 Lognormal 9684 2020-06-10CE 2020-04-30 2020-04-28 Pearson 11556 2020-05-29PE 2020-05-04 2020-05-05 Lognormal 18210 2020-06-08RJ 2020-05-05 2020-05-05 Lognormal 21587 2020-06-07SP 2020-05-08 2020-05-06 Logistic 64984 2020-06-07RN 2020-05-15 2020-05-13 Lognormal 6025 2020-07-06DF 2020-05-16 2020-05-17 Logistic 6347 2020-07-06RO 2020-05-12 2020-05-14 Pearson 3061 2020-08-10PI 2020-05-16 2020-05-19 Pearson 4974 2020-08-13PB 2020-05-16 2020-05-21 Pearson 8765 2020-08-14AL 2020-05-11 2020-05-19 Pearson 8119 2020-08-11BA 2020-05-07 2020-05-08 Pearson 8945 2020-08-04ES 2020-05-16 2020-05-17 Pearson 18271 2020-08-12PR 2020-05-10 2020-05-07 Lognormal 4038 2020-08-04SC 2020-05-16 2020-05-20 Pearson 15329 2020-08-13RS 2020-05-08 2020-05-08 Gamma 4269 2020-08-03MT 2020-04-30 2020-30-04 Pearson 701 2020-07-30GO 2020-05-03 2020-05-05 Pearson 2245 2020-08-03 one for the final prediction. However this is not related to the fact that the inputis nonlinear, which is the case for the data distribution of Covid-19. Actually,we conjecture that the input data can be considered quasi-linear (somehow be-tween nonlinear and linear) and that it obeys a certain pattern, otherwise nomodel could approximate it. The limited latent space is also a problem, evenmore for the long sequences as it is the case here. Besides modeling well thelong-term memories, it fails in regularizing for other sequences with differentproperties [27]. That is to say that if a certain situation (lock-down or dis-tancing) is kept, thus it could perform better. Besides, the problem of learning33 igure 20: Curve fitting for Rio Grande do Norte state (logNormal model was the best fit)with peak indicated on May 13th, 2020. long-term dependencies remains as one of the main challenges in deep learning[27]. A last problem with LSTM is that the time series has to be stationaryand with stable mean, an assumption that does not hold with the data that weanalyzed in this paper.An issue that recalled attention is that for states approaching the peak, thefinal curve fitting process performed better with a curve visually closer to dataas is the cases reported in Figures 18 and 19. Notice in Figure 20 for example,this may indicate that the values close to the peak might have lower values thanthe ones that are predicted by MAE approach. This can be confirmed when thepeak is reached. If this is the case, some adjustment can be done in our methodin order to account for this property, which is our first idea for future works.The clustering approach proposed in this paper uses a feature representationfocusing on the early response of the countries. This was based on the assump-tion that the first week of the spread of the disease are crucial to determine itsdynamics in a given region. However, in future work, it might be interesting34efine the groups based on the most recent data, in order to obtain even moreaccurate predictions. For example, if a state is at 6 weeks after outbreak, wecould compute the features for weeks 4 to 6 after outbreak.
5. Conclusions
The main problem that was solved in this paper is the model estimationfor the Covid-19 dynamics that can be more realistic, by using cases that havealready occurred in other locations or countries, with some similar distribution.Although our study focus on the Brazilian reality, technically, the proposedapproach can be applied elsewhere. For determining these similar distributions,firstly, a clustering was applied to the countries/regions (training and to bepredicted). This clustering was a key of the process and will be improved infuture works in order to represent more closely the characteristics of the timeseries.Thus, to this end we have proposed alternative ways for modeling Covid-19dynamics, using a data driven approach based on MAE. By our results, thisapproach performed better than traditional and LSTM approaches. To do that,we have proposed an initial clustering of the training data based on Early Mor-tality, Days until 10x, and Early Acceleration using data from regions wherethe pandemic is at an advanced stage. Then, we used the deep learning MAEapproach to train a neural network guided by this clustering. This approachworked better, verified at the end by fitting approximating curves to the dy-namics of each Brazilian state, in order to verify or ratify the peaks.So, with basis on the results discussed above, up to date, we could verifythe applicability of data driven approaches to model Covid-19 dynamics. Withthis approach, dealing with regional aspects based on the used features of thepandemic, city managers can get more precise information and better insightto plan their actions. Complementary material for this work can be found at , where the next step is to implement this approachrunning and updating automatically, using the most recent available data.35 eferences [1] P. Byass, Eco-epidemiological assessment of the covid-19 epidemic inchina, january-february 2020, medRxiv , doi:10.1101/2020.03.29.20046565 .URL [2] F. Hamzah, A. Binti, C. Lau, H. Nazri, D. V. Ligot, G. Lee, C. L. Tan,Coronatracker: Worldwide covid-19 outbreak data analysis and prediction,Bull World Health Organ. 1 (2020) 32. doi:http://dx.doi.org/10.2471/BLT.20.255695 .[3] D. Fanelli, F. Piazza, Analysis and forecast of covid-19 spreading inchina, italy and france, Chaos, Solitons & Fractals 134 (2020) 109761. doi:https://doi.org/10.1016/j.chaos.2020.109761 .URL [4] G. F. Webb, P. Magal, Z. Liu, O. Seydi, A model to pre-dict covid-19 epidemics with applications to south korea, italy,and spain, medRxiv , doi:10.1101/2020.04.07.20056945 .URL [5] A. Grant, Dynamics of covid-19 epidemics: Seir models underestimatepeak infection rates and overestimate epidemic duration, medRxiv , doi:10.1101/2020.04.02.20050674 .URL , doi:10.1101/2020.04.03.20049734 .URL [7] G. K. Baerwolff, A contribution to the mathematical modeling of thecorona/covid-19 pandemic, medRxiv , doi:10.1101/2020.04.01.20050229 .URL [8] N. Periwal, S. Sarma, P. Arora, V. Sood, In-silico analysis ofsars-cov-2 genomes: Insights from sars encoded non-coding rnas,bioRxiv , doi:10.1101/2020.03.31.018499 .URL [9] C. Distante, P. Piscitelli, A. Miani, Covid-19 outbreak progression in italianregions: Approaching the peak by the end of march in northern italy andfirst week of april in southern italy, International Journal of EnvironmentalResearch and Public Health 17 (9). doi:10.3390/ijerph17093025 .URL [10] L. Wang, J. Li, S. Guo, N. Xie, L. Yao, Y. Cao, S. W. Day, S. C.Howard, J. C. Graff, T. Gu, J. Ji, W. Gu, D. Sun, Real-time estimationand prediction of mortality caused by covid-19 with patient informationbased algorithm, Science of The Total Environment 727 (2020) 138394. doi:https://doi.org/10.1016/j.scitotenv.2020.138394 .37RL [11] M. te Vrugt, J. Bickmann, R. Wittkowski, Effects of social distancing andisolation on epidemic spreading: a dynamical density functional theorymodel (2020). arXiv:2003.13967 doi:https://doi.org/10.20944/preprints202004.0311.v1 .[13] C. Distante, I. Gadelha Pereira, L. M. Garcia Goncalves, P. Piscitelli,A. Miani, Forecasting covid-19 outbreak progression in italian regions: Amodel based on neural network training from chinese data, medRxiv doi:10.1101/2020.04.09.20059055 .[14] Z. Yang, Z. Zeng, K. Wang, S.-S. Wong, W. Liang, M. Zanin, P. Liu,X. Cao, Z. Gao, Z. Mai, J. Liang, X. Liu, S. Li, Y. Li, F. Ye, W. Guan,Y. Yang, F. Li, S. Luo, Y. Xie, B. Liu, Z. Wang, S. Zhang, Y. Wang,N. Zhong, J. He, Modified seir and ai prediction of the epidemics trendof covid-19 in china under public health interventions, Journal of ThoracicDisease 12 (3).URL http://jtd.amegroups.com/article/view/36385 [15] W. C. Roda, M. B. Varughese, D. Han, M. Y. Li, Why is it difficult toaccurately predict the covid-19 epidemic?, Infectious Disease Modelling 5(2020) 271 – 281. doi:https://doi.org/10.1016/j.idm.2020.03.001 .URL [16] O. M. Otunuga, M. O. Ogunsolu, Qualitative analysis of astochastic seitr epidemic model with multiple stages of infectionand treatment, Infectious Disease Modelling 5 (2020) 61 – 90.38 oi:https://doi.org/10.1016/j.idm.2019.12.003 .URL [17] K. M. Sagheer, A., Unsupervised pre-training of a deep lstm-based stackedautoencoder for multivariate time series forecasting problems, Sci Rep 9(2019) 1938. doi:10.1038/s41598-019-55320-6 .[18] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computa-tion 9 (8) (1997) 1735–1780.[19] A. Sagheer, M. Kotb, Time series forecasting of petroleum productionusing deep lstm recurrent networks, Neurocomputing 323 (2019) 203 –213. doi:https://doi.org/10.1016/j.neucom.2018.09.082 .URL [20] E. Dong, H. Du, L. Gardner, An interactive web-based dashboard to trackcovid-19 in real time, The Lancet infectious diseases.[21] Coronavrus brasil.URL https://covid.saude.gov.br/ [22] Covid-19 (May 2020).URL https://github.com/pcm-dpc/COVID-19 [23] M. Ploner, Towards data science: which countries react similar to covid 19,machine learning provides the answer.URL https://towardsdatascience.com/ [24] L. McInnes, J. Healy, J. Melville, Umap: Uniform manifold approximationand projection for dimension reduction, arXiv preprint arXiv:1802.03426.[25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn:39achine learning in Python, Journal of Machine Learning Research 12(2011) 2825–2830.[26] B. J. Frey, D. Dueck, Clustering by passing messages between data points,science 315 (5814) (2007) 972–976.[27] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016, .[28] D. P. Kingma, M. Welling, An introduction to variational autoencoders,Foundations and Trends in Machine Learning 12 (4) (2019) 307392. doi:10.1561/2200000056 .URL http://dx.doi.org/10.1561/2200000056 [29] S. (https://stats.stackexchange.com/users/124694/samos), Is thecovid-19 pandemic curve a gaussian curve?, Cross Validated,uRL:https://stats.stackexchange.com/q/455202 (version: 2020-03-22). arXiv:https://stats.stackexchange.com/q/455202 .URL https://stats.stackexchange.com/q/455202 [30] W. Lyra, J. D. do Nascimento Junior, J. Belkhiria, P. P. M. C. Lean-dro de Almeida, I. de Andrade, Projections for the state of rio grande donorte: Population, demand for hospitalization and progression of cases (inportuguese), Web Covid-19 resource page of Department for Theoric andExperimental Physics - UFRN, accessed on May 04th, 2020.URL http://astro.dfte.ufrn.br/html/Cliente/COVID19bra.php [31] S. B. Bastos, D. O. Cajueiro, Modeling and forecasting the early evolutionof the covid-19 pandemic in brazil (second version, april 10th 2020 (2020). arXiv:2003.14288 .[32] S. B. Bastos, D. O. Cajueiro, Modeling and forecasting the early evolutionof the covid-19 pandemic in brazil (first version, april 1st 2020) (2020). arXiv:2003.14288 . 4033] W. Lyra, J. D. do Nascimento, J. Belkhiria, L. de Almeida, P. P. Chrispim,I. de Andrade, Covid-19 pandemics modeling with seir(+caqh), social dis-tancing, and age stratification. the effect of vertical confinement and releasein brazil., medRxiv doi:10.1101/2020.04.09.20060053doi:10.1101/2020.04.09.20060053