Forecasting Brazilian and American COVID-19 cases based on artificial intelligence coupled with climatic exogenous variables
Ramon Gomes da Silva, Matheus Henrique Dal Molin Ribeiro, Viviana Cocco Mariani, Leandro dos Santos Coelho
FForecasting Brazilian and American COVID-19 cases based onartificial intelligence coupled with climatic exogenous variables
Ramon Gomes da Silva a, ∗ , Matheus Henrique Dal Molin Ribeiro a,b , Viviana CoccoMariani c,d , Leandro dos Santos Coelho a,d a Industrial & Systems Engineering Graduate Program (PPGEPS), Pontifical Catholic University of Parana(PUCPR). 1155, Rua Imaculada Conceicao, Curitiba, PR, Brazil. 80215-901 b Department of Mathematics, Federal Technological University of Parana (UTFPR). Via do Conhecimento,KM 01 - Fraron, Pato Branco, PR, Brazil. 85503–390 c Mechanical Engineering Graduate Program (PPGEM), Pontifical Catholic University of Parana(PUCPR). 1155, Rua Imaculada Conceicao, Curitiba, PR, Brazil. 80215-901 d Department of Electrical Engineering, Federal University of Parana (UFPR). 100, Avenida CoronelFrancisco Heraclito dos Santos, Curitiba, PR, Brazil. 81530-000
Abstract
The novel coronavirus disease (COVID-19) is a public health problem once according to theWorld Health Organization up to June 10th, 2020, more than 7.1 million people were infected,and more than 400 thousand have died worldwide. In the current scenario, the Brazil and theUnited States of America present a high daily incidence of new cases and deaths. Therefore,it is important to forecast the number of new cases in a time window of one week, once thiscan help the public health system developing strategic planning to deals with the COVID-19. The application of the forecasting artificial intelligence (AI) models has the potentialof deal with difficult dynamical behavior of time-series like of COVID-19. In this paper,Bayesian regression neural network, cubist regression, k -nearest neighbors, quantile randomforest, and support vector regression, are used stand-alone, and coupled with the recent pre-processing variational mode decomposition (VMD) employed to decompose the time seriesinto several intrinsic mode functions. All AI techniques are evaluated in the task of time-series forecasting with one, three, and six-days-ahead the cumulative COVID-19 cases in fiveBrazilian and American states, with a high number of cases up to April 28th, 2020. Previouscumulative COVID-19 cases and exogenous variables as daily temperature and precipitationwere employed as inputs for all forecasting models. The models’ effectiveness are evaluatedbased on the performance criteria. In general, the hybridization of VMD outperformed singleforecasting models regarding the accuracy, specifically when the horizon is six-days-ahead,the hybrid VMD–single models achieved better accuracy in 70% of the cases. Regardingthe exogenous variables, the importance ranking as predictor variables is, from the upperto the lower, past cases, temperature, and precipitation. Therefore, due to the efficiencyof evaluated models to forecasting cumulative COVID-19 cases up to six-days-ahead, the ∗ Corresponding author
Email address: [email protected] (Ramon Gomes da Silva)
Preprint submitted to Chaos, Solitons & Fractals July 22, 2020 a r X i v : . [ q - b i o . P E ] J u l dopted models can be recommended as a promising models for forecasting and be used toassist in the development of public policies to mitigate the effects of COVID-19 outbreak. Keywords:
Artificial intelligence, COVID-19, Exogenous variables, Forecasting,Variational mode decomposition, Machine learning
1. Introduction
The new coronavirus disease (COVID-19) is a virus infectious disease induced by severeacute respiratory syndrome coronavirus 2 (SARS-CoV2). According to the World HealthOrganization (WHO), most of the population will mild to moderate respiratory illness andrecover without requiring special treatment [1]. However, several studies are being devel-oped, and preliminary results indicated that people with underlying medical problems likecardiovascular disease, diabetes, chronic respiratory disease, obesity, and cancer are morelikely to develop serious injuries [2, 3, 4, 5, 6, 7]. Also, the COVID-19 can cause exten-sive and multiple lung injuries [8], thus compromising the respiratory system of patients.In this context, the demand for devices that assist in the performance of breathing-relatedmovements have increased.Due to the serious damage caused by COVID-19, according to WHO, up to June 10th2020, more than 7.1 million people were already infected, as well as more than 400 thou-sand people worldwide have now died with the coronavirus. Indeed, considering the currentscenario of the health system worldwide, the overcrowding could be observed in some coun-tries, like Italy, Spain and perhaps Brazil. In Brazil context, believed that the average of3388 municipalities could have a significant deficit in hospital beds. Especially, the deficit isprojected to occur in Brazilian North and Northeast regions, which means exceeding healthcare capacity due to the COVID-19 [9].Considering the importance of knowing the difficult epidemiological scenario for COVID-19 on a short-term horizon, to mitigate the effects of this pandemic, the development ofefficient and effective forecasting models also has a positive impact on product reasonablyaccurate success rates forecasts the immediate future. Also, these models allow health man-agers to develop strategic planning and perform decision-making as assertively as possible.For this purpose, epidemiological models can be used, as it has been widely adopted in[10, 11]. Alternatively, linear forecasting models [12, 13, 14], artificial intelligence (AI) ap-proaches [15, 16], as well as hybrid forecasting models [17, 18] proved to be effective tools toforecast COVID-19 cases. The advantages of AI approaches for time series forecasting lie inthe flexibility of dealing with different kinds of response variables, as well as to the ability ofthese approaches to learning data dynamical behavior, complexity and accommodate non-linearities, such as the observed in epidemiological data [19]. Besides, hybrid methodologiesallow us to combine several techniques such as pre-processing methods and single forecastingmodels.By the coupling of some methods, it is possible to use the specialty of each one to dealwith different characteristics and therefore building an effective model. In context of thepreprocessing techniques, especially signal decomposition methods, the variational mode2ecomposition (VMD) [20] is an effective approach to decompose a dimensional signal intoan ensemble of band-limited modes with specific bandwidth in a spectral domain applied inseveral fields [21, 22, 23], once can deal with nonlinearities, and non-stationarity inherent totime series. Considering the intrinsic mode function (IMF) obtained through VMD, it is hardto choose AI models to train and forecasting the VMD components. Therefore, based onthis understanding, some models are coupled with VMD and are described in the following.Due to the necessity of understanding the COVID-19 outbreak, and the associated factors,or exogenous variables, some studies are being conducted considering the social environment,climatic variables, pollution, and population density [24, 25, 26, 27, 28]. In this direction, ina general aspect, Sobral et al. [29] investigated the effects of climatic variables in COVID-19spread for 166 countries. The authors argued that increasing the temperature reduced theCOVID-19 cases, and precipitation also has a positive correlation with SARS-CoV2 cases.In the sequence, for Brazil, Auler et al. [30] evaluated how meteorological conditions suchas temperature, humidity, and rainfall can affect the spread of COVID-19 in five Braziliancities. The authors concluded that higher mean temperatures and average relative humiditymight support the COVID-19 transmission. Considering the United States of America (USA)weather aspects, especially for the New York state, Bashir et al. [31] inferred that averageand minimum temperature and air quality are significantly associated with the COVID-19pandemic. All previously mentioned studies tried related the climatic variables with COVID-19 but in those papers were not incorporated in time series models to forecasting COVID-19cases. However, we think that incorporating the exogenous climatic variables in forecastingmodels can help to understand the data dynamic, and perhaps more efficient forecastingmodels could be obtained [32].In this respect, for forecasting of cumulative cases of COVID-19, the objective of thispaper is to explore and compare the predictive capacity of Bayesian regression neural net-work (BRNN), cubist regression (CUBIST), k -nearest neighbors (KNN), quantile randomforest (QRF), and support vector regression (SVR) when are used stand-alone, and a hybridframework composed by VMD coupled with previously mentioned models. In this study wereused as datasets the number about the cumulative cases of COVID-19 from five Brazilianstates (Amazonas - AM, Ceara - CE, Pernambuco - PE, Rio de Janeiro - RJ, and Sao Paulo- SP), the first state from north region, the second and third states from northeast region,and the other two states from southeast region. Also were considered five American states(California - CA, Illinois - IL, Massachusetts - MA, New Jersey - NJ, and New York - NY).The choice of these states was made through the largest number of new cases of COVID-19up to 28 April 2020.In the task of forecasting horizons of the time series one, three, and six-days-aheadof cumulative COVID-19 cases are adopted to evaluates the forecasting efficiency of thedifferent models. Additionally, previous COVID-19 cases, and exogenous variables such asdaily temperature (maximum and minimum), and precipitation are employed as inputs foreach evaluated model. The output-of-sample forecasting accuracy of each model is comparedby performance metrics such as the improvement percentage index (IP), symmetric meanabsolute percentage error (sMAPE), and relative root mean squared error (RRMSE). Also,3he importance of each input variable is presented for each country.Forecasting models are impacted by the small dataset effect and the prediction of cases ofCOVID-19 a challenging task. The choice of the forecasting and pre-processing approachesis due to the fact that even that non-linear and AI models need large datasets to properlylearn the data pattern, the use of exogenous variables (climate variables) and past values ofthe response variable overcomes this drawback.VMD decomposes a time series into its intrinsic mode functions adaptively and non-recursively obtaining a set of sub-series with different features from low-frequency to high-frequency. The adoption of VMD with modes in conjunction with nonlinear predictionmodels of machine learning is a powerful framework to approach small datasets in forecastingtask. In addition, BRNN and SVR approaches are capable of handling small samples, whichmakes them attractive for this study.The contributions of this paper can be summarized as follows: • The first contribution is related to the proposal of two frameworks, non-decomposedand decomposed models, applied in the task of forecasting the new cumulative casesof COVID-19 in five Brazilian and American states. It is expected that these evalu-ated models can be used as most accurate approaches to perform decision-making tostructure the health system to avoid overcrowding in hospitals, and preventing newdeaths. • The second contribution, we can highlight the use of a distinct set of AI models basedon machine learning approaches regarding learning structure, even as the recent effec-tive pre-processing VMD to forecasting the Brazilian and American COVID-19 newcumulative cases. The forecasting models BRNN, CUBIST, KNN, QRF, SVR, andpre-processing VMD method were chosen once that have reached success into severalfields of regression and time series forecasting [33, 34, 35, 36]; • Also, this paper evaluates AI models in a multi-day-ahead forecasting strategy coupledwith climatic exogenous inputs. The range of the forecasting time horizon allows us toverify the effectiveness of the predicting models in different scenarios, associated withinputs such as previous COVID-19 cumulative cases, temperature, and precipitation,allowing that the models achieve high forecasting accuracy. Finally, their results canhelp in planning actions to improve the health system to contain the COVID-19 deaths.The remainder of this paper is organized as follows: Section 2.1 a brief description ofthe dataset adopted in this paper is presented. The forecasting models applied in thisstudy are described in Section 2.2. Section 3 details the procedures applied in the researchmethodology. Results obtained and related discussion about models’ forecasting performanceare mentioned in Section 4. Finally, Section 5 concludes this study with considerations andsome directions for future research proposals.4 . Material and Methods
This section presents a description of the material analyzed (Section 2.1), as well as themodel’s description applied in this paper (Section 2.2).
The collected dataset refers to the COVID-19 cumulative cases that occurred in five statesof the Brazil and the USA until April 20th, 2020. For the Brazilian context, the datasetwas collected from an API (Application Program Interface) [37] that retrieves the dailyinformation about COVID-19 cases from all 27 Brazilian State Health Offices, assemblesand makes them publicly available. And for USA context, the dataset was collected from“COVID-19 Data Repository” on Github provided by the Center for Systems Science andEngineering (CSSE) at Johns Hopkins University [38]. The cumulative confirmed cases anddeaths of each state, and the period from the first and last reports, are illustrated in Table 1.
Table 1: Summary of COVID-19 cases by country and state
Country State Number ofobserved days Fistreported Lastreported Cumulativecases CumulativedeathsBrazil AM 47 13/03/2020 28/04/2020 4337 351CE 44 16/03/2020 28/04/2020 6985 403PE 48 12/03/2020 28/04/2020 5724 508RJ 55 05/03/2020 28/04/2020 8504 738SP 64 25/02/2020 28/04/2020 24041 2049USA CA 94 26/01/2020 28/04/2020 46164 1864IL 96 24/01/2020 28/04/2020 48102 2125MA 87 01/02/2020 27/04/2020 56462 3003NJ 55 05/03/2020 28/04/2020 113856 6442NY 58 02/03/2020 28/04/2020 295106 22912
The climatic exogenous variables were retrieved from the “
Instituto Nacional de Mete-orologia ” (INMET) [39] for data from Brazil, while the USA climate dataset were takeninto a count from the daily global historical climatology network that was retrieved fromthe National Centers for Environmental Information (NCEI) from the National Oceanic andAtmospheric Administration [40], by using rnoaa package [41]. For each state, consideringthe daily available information, minimum and maximum temperature ( o C ), and precipita-tion ( mm ) were select as climatic exogenous inputs to each forecasting model applied in thisstudy. The measurement period of each state is variable, this is due because the record ofthe first case of the disease may differ from state to state. The summary of the climaticvariables used is described in Table 2.The heat-map of the cumulative confirmed cases from the Brazil and the USA in eachof the five states analyzed are presented in Figure 1. In that figure can be seen that thestates with the highest number of COVID-19 cumulative cases are SP and NY, respectively,in Brazil and the USA, the states with the highest demographic index in both countries.5 able 2: Descriptive measures for climatic variables by country and state Country State Variable Minimum Median Mean MaximumBrazil AM Minimum temperature ( o C ) 24.76 26.20 26.36 28.28Maximum temperature ( o C ) 25.29 27.05 27.24 29.55Precipitation ( mm ) 0.00 0.11 0.33 2.40CE Minimum temperature ( o C ) 25.14 26.62 26.60 27.90Maximum temperature ( o C ) 25.91 27.73 27.68 28.99Precipitation ( mm ) 0.00 0.12 0.25 1.31PE Minimum temperature ( o C ) 23.36 25.18 25.05 26.74Maximum temperature ( o C ) 24.27 26.30 26.10 27.96Precipitation ( mm ) 0.00 0.14 0.23 1.33RJ Minimum temperature ( o C ) 19.07 21.23 21.57 25.33Maximum temperature ( o C ) 19.69 22.16 22.56 26.49Precipitation ( mm ) 0.00 0.03 0.13 1.32SP Minimum temperature ( o C ) 17.60 19.99 20.09 23.40Maximum temperature ( o C ) 18.76 21.11 21.37 25.03Precipitation ( mm ) 0.00 0.00 0.12 1.19USA CA Minimum temperature ( o C ) -1.76 4.90 4.91 11.92Maximum temperature ( o C ) 10.60 18.33 18.54 28.59Precipitation ( mm ) 0.01 4.66 20.98 162.67IL Minimum temperature ( o C ) -19.42 -0.42 -0.75 14.63Maximum temperature ( o C ) -6.39 8.40 8.99 26.06Precipitation ( mm ) 0.00 4.29 23.30 196.47MA Minimum temperature ( o C ) -14.75 -0.86 -1.76 5.39Maximum temperature ( o C ) -2.77 8.32 8.16 18.70Precipitation ( mm ) 0.00 6.84 34.66 320.86NJ Minimum temperature ( o C ) -11.80 1.60 0.96 7.27Maximum temperature ( o C ) 0.54 10.78 11.22 22.02Precipitation ( mm ) 0.00 5.44 31.53 274.04NY Minimum temperature ( o C ) -20.48 -2.61 -3.75 4.60Maximum temperature ( o C ) -7.98 5.62 6.23 17.88Precipitation ( mm ) 0.00 9.94 27.61 167.94 AM CE PERJSP
Cumulativeconfirmed cases (a) Brazil
CA IL MANJNY
Cumulativeconfirmed cases (b) USAFigure 1: Heatmap of the cumulative confirmed cases to five states from Brazil and USA.
This section presents a summary of each model employed in the data analysis. • BRNN is a kind of feedforward neural network, a two-layer neural network, composedby one input and one hidden layer, which uses the Bayesian methods, such as empiricalBayes, for parameter estimation, to avoid overfitting [42]. In the BRNN formulation,6he variances are regularization parameters, in which the trade-off between goodness-of-fit and smoothing can be controlled. Also, in this approach the method of [43] is usedto assign initial weights of neural network and the Gauss-Newton training algorithm toperform the optimization. For the datasets evaluated in this paper, the BRNN becomesattractive once it can deal with small samples, as well as it has a lower computationalcost. • CUBIST is a rule-based algorithm used to build forecasting models (in the time seriesfield) based on the analysis of input data [44]. It estimates the target values by estab-lishing regression models with one or more rules (committee/ensemble of rules) basedon the input set. These rules are employed based on a combination of conditions witha linear function (in general linear regression). When the rule satisfies all conditionsdefined in the learning process, this approach can execute multiple rules once and finddifferent linear functions suitable to forecast COVID-19 cases. However, if the stan-dard deviation reduction value is smaller or equal to the expected error for sub-tree,some leaves are pruned to avoid overfitting [15]. • KNN is an instance-based learner model designed to solve classification and regressionproblems [45]. In fact, in the time series context, the KNN searches k nearest pastsimilar values in the input set (past COVID-19 values, and climatic variables), in whichthese k values are namely nearest neighbors. In this context, to find the nearest values,a similarity measure is adopted. The k -nearest neighbors are those that similaritymeasure between past cases and new cases is the smallest. Considering that the setof k -nearest neighbors are defined, the forecasting of new COVID-19 cases is obtainedthrough of average of past similar values. In contrast to the simplicity of this supervisedlearning, the computational cost may be a disadvantage [32]. • QRF approach is an extension of the random forests (RF) ensemble model [46]. Itprovides information about the full conditional distribution of the response variable,not only about the conditional mean. In this approach, the use of conditional quantile isto enhance the RF performance, which makes this a consistent approach [47]. The mainassumption about QRF lies in that weighted observations can be used for estimatingthe conditional mean [48]. Additionally, while the RF approach keeps in the resultsinformation as regards the average cases of COVID-19 of the leaves, the QRF keepsall COVID-19 cases contained in the leaves. • SVR is a type support vector machine that consists in determining support vectorsclose to a hyperplane, which maximizes the margin between two-point classes obtainedfrom the difference between the target value and a threshold. To deal with non-linear problems SVR takes into account kernel functions, which calculates the similaritybetween two observations through the inner product. In this paper, the linear kernelis adopted. The main advantages of the use of SVR lie in its capacity to capture thepredictor non-linearity and then use it to improve the forecasting cases. Also, it is7dvantageous to employ to forecast COVID-19 cumulative cases, once the samples aresmall [49, 15]. • VMD is a pre-processing technique in the field of decomposition approaches, whichdecomposes a time series into a finite and predefined k number of Intrinsic ModeFunctions (IMF) or mode functions. In a general way, VMD reproduces the decom-posed signal with different sparsity properties [20]. There are three main conceptsrelated to VMD, which are Wiener filtering, Hilbert transform and analytic signal, andfrequency mixing and heterodyne demodulation. Sparsity prior of each mode is chosenas bandwidth in the spectral domain and can be accessed by the following scheme foreach model: (i) compute associated analytic signal utilizing the Hilbert transform toobtain a unilateral frequency spectrum; (ii) shift frequency spectrum of mode to base-band by mixing the exponential tune to the respective estimated center frequency; and(iii) the bandwidth estimated through the Gaussian smoothness of the demodulatedsignal [21].
3. Proposed forecasting framework
This section describes the main steps in the data analysis adopted by BRNN, CUBIST,KNN, QRF, SVR, and VMD based models.
Step 1 : First, the dataset output variables are decomposed into five IMFs by performingVMD. The lag equal 2 was chosen by grid-search, applied on the IMFs creating four inputsfrom the lags, and applied on the exogenous inputs as well. Further, the new data is splitinto training and test sets. The test set consists of the last six observations and the trainingset defined by the remaining samples. In the training state, leave one-out-cross-validationwith time slice was adopted, such as developed by [32].
Step 2 : Each IMF is trained with each model described in Section 2.2 using time-slicevalidation approach. Next, the IMF predictions were reconstructed by a simple summation-grouping model, in other words, the IMF is trained by the same model and is summed. Then,five predictions outputs were generated named VMD–BRNN, VMD–CUBIST, VMD–KNN,VMD–QRF, and VMD–SVR.
Step 3 : A recursive strategy is employed to develop multi-days-ahead COVID-19 casesforecasting [15]. Regarding this, one model is fitted for one-day-ahead forecasting, then therecursive strategy uses this forecasting result as an input for the same model to forecastthe next step, continuing until the desirable forecasting horizon. In this study, the aim is toobtain the cases up to H next days, especially up to 1 (ODA, one-day-ahead), 3 (TDA, three-days-ahead), and 6-days-ahead (SDA, six-days-ahead), respectively. The following structuresare considered, ˆ y ( t + h ) = ˆ f (cid:8) y ( t + h − , y ( t + h − , X ( t + h − (cid:9) if h = 1 , ˆ f (cid:8) ˆ y ( t + h − , ˆ y ( t + h − , X ( t + h − (cid:9) if h = 3 , ˆ f (cid:8) ˆ y ( t + h − , ˆ y ( t + h − , X ( t + h − (cid:9) if h = 6 , (1)where ˆ f is a function that maps the cumulative COVID-19 cases, ˆ y ( t + h ) is the forecastof cumulative cases in horizon h =1, 3 and 6, y ( t + h − y ( t + h −
2) are the previous8bserved, ˆ y ( t + h − y ( t + h −
2) are the predicted cumulative cases, X ( t + h − n x ) is theexogenous inputs vector at the maximum lag of inputs ( n x = 1 if h = 1, n x = 3 if h = 3,and n x = 6 if h = 6).The analyses are developed using R software [50]. All hyperparametersemployed in this study are presented in Tables B.1 and B.2 in Appendix B. Step 4 : To evaluate the effectiveness of adopted models, from obtained forecasts out-of-sample (test set), performance IP (2), sMAPE (3), and RRMSE (4) criteria are computedas IP = 100 × M c − M b M c , (2)sMAPE = 2 n n (cid:88) i =1 | y i − ˆ y i || y i | + | ˆ y i | , (3)RRMSE = (cid:115) n n (cid:88) i =1 ( y i − ˆ y i ) n n (cid:88) i =1 y i , (4)where n is the number of observation, y i and ˆ y i are the i -th observed and predicted values,respectively. Also, the M c and M b represent the performance measure of compared and bestmodels, respectively.Figure 2 presents the proposed forecasting framework.
4. Results
This section describes the results of the developed experiments in forecasting out-of-sample (test set). First, Section 4.1 compares the results of evaluated models over tendatasets and three forecasting horizons adopted. In Tables A.1 and A.2 in Appendix A, thebest results regarding accuracy are presented in bold. Additionally, Figures 3 and 4 illustratethe relation between observed and predicted values achieved by models with the best set ofperformance measures depicted in Tables A.1 and A.2, as well as box-plots for out-of-sampleerrors, are illustrated in Figure 5. Also, Figure 6 illustrates the variable importance of eachinput (both lags and exogenous inputs) used in the models’ predictions.
In this section, the main results achieved by the best model regarding sMAPE andRRMSE criteria are presented for short-term forecasting multi-days-ahead of cumulativecases of COVID-19 from five Brazilian and five American states.Firstly, considering the results for the Brazil context, the main results are highlighted asfollows. • AM: In this state, VMD–BRNN could be considered to forecasting COVID-19 cases,once the model outperformed all the single and VMD models in both performance cri-teria in all forecasting horizons. The improvement in the sMAPE achieved by VMD–BRNN ranges between 39.47% - 96-06%, 55.97% - 94.88%, and 67.41% - 94.25%,9 Hybrid Model Flowchart
Raw data of Brazil / USA states VMD ∑ IPsMAPE
RRMSE
Climatic variables
COVID-19 cases IMF predictions VMD-BRNNVMD-CUBISTVMD-KNNVMD-QRFVMD-SVRDecomposition phase Training and Integration phase Performance Metrics
BRNN
General approaches
CUBIST
KNN
QRF SVRVMD
IMF , IMF , IMF , IMF , and IMF BRNNCUBISTSVRKNNQRF1
Climatic variables
Figure 2: Proposed forecasting framework for ODA, TDA, and SDA horizon respectively. Regarding RRMSE analysis, the im-provement ranges between 9.86% - 94.81%, 33.44% - 93.29%, and 56.66% - 93.89%,respectively. • CE, RJ, and SP: For these states, in all forecasting horizons, the VMD–CUBISTapproach achieved better accuracy than other models, for both sMAPE and RRMSEcriteria in the multi-days-ahead forecasting task of the confirmed number of COVID-19.In fact, the improvement in sMAPE is ranged in 8.67% - 96.57%, 12.15% - 97.78%, and59.37% - 97.09%, respectively, in ODA, TDA, and SDA forecasting horizons. Moreover,the improvement in RRMSE is ranged in 12.41% - 97.32%, 2.61% - 98.29%, and 49.99%- 97.95%, respectively. • PE: In this state, CUBIST and SVR present better performance to forecasting COVID-19 cases. For ODA and TDA, CUBIST outperforms models, while for SDA the SVRachieves better accuracy regarding sMAPE and RRMSE than others. The improve-ment in the sMAPE for ODA and TDA achieved by CUBIST ranges between 6.81% –97.93%, and 24.94% – 98.23%, respectively. For SDA, SVR outperforms other models,and this criterion is reduced in the range of 49.36% - 98.27%. Moreover, the samebehavior is observed when the improvement in the RRMSE criterion is obtained.10 emark:
In this experiment, regarding the Brazilian states, 150 scenarios (5 datasets,3 forecasting horizons, and 10 models) were evaluated for the task of forecasting cumulativeCOVID-19 cases. In an overview, the best models for each state, obtained sMAPE rangedbetween 1.14% - 3.05%, 1.06% - 2.79%, and 1.05% - 3.03% for ODA, TDA, and SDA fore-casting, respectively. In the Brazilian context, the ranking of the model in all scenariosis VMD–CUBIST, VMD–BRNN, SVR, CUBIST, VMD–SVR, BRNN, VMD–QRF, QRF,VMD–KNN, and KNN. From a broader perspective, the efficiency of the VMD models isdue to the capability of the approach to deal with non-linearity and non-stationarity of thedata. Moreover, the efficiency of the CUBIST is due mainly to its ensemble learning of rules,in which the approach takes advantage of each rule based on the input set. On the otherhand, the difficulty of the KNN model to forecasting cumulative COVID-19 cases could beattributed to the fact that this approach requires more observations to effectively learn thedata pattern, once the forecasting is obtained by an average of past similar values.In the next, considering the results for the USA context, the main results are highlightedas follows. • CA: In CA state, BRNN outperformed other models, in all forecasting horizons, forboth sMAPE and RRMSE criteria. In this aspect, the improvement in sMAPE rangesbetween 29.98% - 97.86%, 4.64% - 97.71%, and 48.56% - 97.99%, for ODA, TDA, andSDA, respectively. Regarding RRMSE, the improvement ranges in 24.00% - 97.67%,6.57% - 97.78%, and 48.62% - 98.11%, respectively. • IL, MA, and NJ: For both performance criteria, CUBIST outperformed other modelsin ODA, for IL and NJ states, and TDA, for IL. BRNN presented better accuracy thanother models, for MA state in ODA and TDA. Moreover, VMD–CUBIST outperformedother models in SDA for these three states. In fact, the improvement in sMAPE isranged in 6.63% - 98.76%, 31.89% - 98.09%, and 3.76% - 97.98%, respectively, inODA, TDA, and SDA forecasting horizons. Moreover, regarding the RRMSE, theimprovement ranges between 7.54% - 98.48%, 0.83% - 98.25%, and 3.25% - 98.11%,respectively. • NY: For NY state, in both performance criteria, VMD–CUBIST presented better accu-racy than other model in ODA forecasting, while SVR outperformed the other modelsin TDA and SDA forecasting. Regarding sMAPE, the improvement ranges 17.86% -95.44%, 16.12% - 95.69%, and 42.39% - 92.71%, for ODA, SDA, and TDA, respectively.For RRMSE, the improvement ranges 25.78% - 96.09%, 7.78% - 95.43%, and 43.76% -93.45%, respectively.
Remark:
In this experiment, regarding the American states, 150 scenarios (5 datasets,3 forecasting horizons, and 10 models) were evaluated for the task of forecasting cumulativeCOVID-19 cases. In an overview, the best models for each state, obtained sMAPE rangedbetween 0.54% - 1.90%, 0.55% - 1.59%, and 0.62% - 3.08% for ODA, TDA, and SDA fore-casting, respectively. In the American context, the ranking of the models in all scenariosis VMD–CUBIST, BRNN, CUBIST, SVR, VMD–BRNN, VMD–SVR, VMD–QRF, QRF,11NN, and VMD–KNN. The same behavior presented in Brazilian cases is presented in theAmerican, which the VMD–CUBIST in overall had better average performance comparedto the other models.According to the information depicted in Figures 3 and 4 it is possible to identify that thebehavior of the data is learned by the evaluated models, which can forecasting compatiblecases with the observed values. In most states, the good performance presented in thetraining stage persists in the test phase. In Figures 3a, 3c, 4a, and 4e the models presentedsome difficulties to capture the behavior of the data in the training stage, however in testphase the models could perform accurately presenting low errors.Furthermore, Figure 5 presents the box-plots of test set forecasting errors in the SDAhorizon for each model and each state. Due to the recursive strategy adopted, the SDAhorizon was chosen to the analysis, once the errors tend to grow as the forecast horizonincreases. The box diagram depicts the variation of absolute errors for each model, whichreflects the stability of each model. In this context, the dots out of boxes are consideredoutliers errors.Analyzing the box-plot, models with lower variation in the errors are indicated by theboxes with a smaller size. Figure 5 corroborates the results presented in Tables A.1 and A.2.Models with lower errors achieve better stability, which means that the most appropriatemodel for each state can maintain a learning pattern, obtaining homogeneous forecastingerrors.The variable importance is an overall quantification of the relationship between the pre-dictor variables (inputs) and the predicted value. Finally, Figure 6 is presented the variableimportance of each input used to fit and train the models. As expected, the lag inputspresent high importance due to their high correlation to the output. However, it is impor-tant to notice that climate data indeed presented some influence in predicting COVID-19cumulative cases, especially in the Brazilian context, that the variance of the Temperaturedata reaches up to 50% of importance. In other words, the climatic exogenous inputs are insome level relevant to the prediction of cumulative cases of COVID-19 in both Brazil’s andUSA’s context for the five evaluated states.
5. Conclusion and Future Research
In this paper, machine learning approaches named BRNN, CUBIST, KNN, QRF, andSVR, as well as VMD approach, were employed in the task of forecasting one, three, andsix-days-ahead the COVID-19 cumulative confirmed cases in five Brazilian states and fiveAmerican states with a high daily incidence. The COVID-19 cumulative confirmed cases forAM, CE, PE, RJ, and SP states, as well as CA, IL, MA, NJ, and NY were used. The IP,sMAPE and RRMSE criteria were adopted to evaluate the performance of the comparedapproaches. The stability of out-of-sample errors was evaluated through box-plots. Further,the variable importance of the lag and climatic exogenous inputs were analyzed.In respect of obtained results, it is possible to infer that CUBIST coupled with theVMD model are suitable tools to forecast COVID-19 cases for most of the adopted states,once that these approaches were able to learn the non-linearities inherent to the evaluated12 raining Test , , , , Day C u m u l a t i v e c o n f i r m e d c a s e s ObservedODA.VMD.BRNNTDA.VMD.BRNNSDA.VMD.BRNN (a) AM
Training Test , , , Day C u m u l a t i v e c o n f i r m e d c a s e s ObservedODA.VMD.CUBISTTDA.VMD.CUBISTSDA.VMD.CUBIST (b) CE
Training Test , , , Day C u m u l a t i v e c o n f i r m e d c a s e s ObservedODA.CUBISTTDA.CUBISTSDA.SVR (c) PE
Training Test , , , , Day C u m u l a t i v e c o n f i r m e d c a s e s ObservedODA.VMD.CUBISTTDA.VMD.CUBISTSDA.VMD.CUBIST (d) RJ
Training Test , , , , , Day C u m u l a t i v e c o n f i r m e d c a s e s ObservedODA.VMD.CUBISTTDA.VMD.CUBISTSDA.VMD.CUBIST (e) SPFigure 3: Prediction versus observed COVID-19 cases for Brazilian States epidemiological time series. Also, BRNN and SVR models deserve attention for the devel-opment of this task as well. Therefore, the ranking of models in all scenarios for Brazilianstates is VMD–CUBIST, VMD–BRNN, SVR, CUBIST, VMD–SVR, BRNN, VMD–QRF,13 raining Test , , , , Day C u m u l a t i v e c o n f i r m e d c a s e s ObservedODA.BRNNTDA.BRNNSDA.BRNN (a) CA
Training Test , , , , , Day C u m u l a t i v e c o n f i r m e d c a s e s ObservedODA.CUBISTTDA.CUBISTSDA.VMD.CUBIST (b) IL
Training Test , , Day C u m u l a t i v e c o n f i r m e d c a s e s ObservedODA.BRNNTDA.BRNNSDA.VMD.CUBIST (c) MA
Training Test , , , , Day C u m u l a t i v e c o n f i r m e d c a s e s ObservedODA.CUBISTTDA.BRNNSDA.VMD.CUBIST (d) NJ
Training Test , , , Day C u m u l a t i v e c o n f i r m e d c a s e s ObservedODA.VDM.CUBISTTDA.SVRSDA.SVR (e) NYFigure 4: Prediction versus observed COVID-19 cases for American States
QRF, VMD–KNN, and KNN, and for USA states is VMD–CUBIST, BRNN, CUBIST,SVR, VMD–BRNN, VMD–SVR, VMD–QRF, QRF, KNN, and VMD–KNN. Also, lookingfor COVID-19 forecasts six-days-ahead, hybrid models are more suitable tools than non-14
A IL MA NJ NYAM CE PE RJ SP B R N N C U B I S T K N N Q R F S V R V M D − B R N NV M D − C U B I S T V M D − K N NV M D − S V R V M R − Q R F B R N N C U B I S T K N N Q R F S V R V M D − B R N NV M D − C U B I S T V M D − K N NV M D − S V R V M R − Q R F B R N N C U B I S T K N N Q R F S V R V M D − B R N NV M D − C U B I S T V M D − K N NV M D − S V R V M R − Q R F B R N N C U B I S T K N N Q R F S V R V M D − B R N NV M D − C U B I S T V M D − K N NV M D − S V R V M R − Q R F B R N N C U B I S T K N N Q R F S V R V M D − B R N NV M D − C U B I S T V M D − K N NV M D − S V R V M R − Q R F B R N N C U B I S T K N N Q R F S V R V M D − B R N NV M D − C U B I S T V M D − K N NV M D − S V R V M R − Q R F B R N N C U B I S T K N N Q R F S V R V M D − B R N NV M D − C U B I S T V M D − K N NV M D − S V R V M R − Q R F B R N N C U B I S T K N N Q R F S V R V M D − B R N NV M D − C U B I S T V M D − K N NV M D − S V R V M R − Q R F B R N N C U B I S T K N N Q R F S V R V M D − B R N NV M D − C U B I S T V M D − K N NV M D − S V R V M R − Q R F B R N N C U B I S T K N N Q R F S V R V M D − B R N NV M D − C U B I S T V M D − K N NV M D − S V R V M R − Q R F Model A b s o l u t e e rr o r Country
Brazil USA
Figure 5: Box-plot for absolute error according to model and state for COVID-19 forecasting for SDA
Variable V a r i a b l e I m p o r t a n ce Country
BrazilUSA
Figure 6: Variable importance for Brazil and USA decomposed models. Further, it was observed that climatic variables, such as temperatureand precipitation indeed influence increasing the accuracy when predicting COVID-19 cases,wherein some cases climate inputs reached up to 50% of importance in the forecasting model.For future works, it is intended to adopt (i) deep learning approaches, (ii) different15ecomposition approaches, (iii) multi-objective optimization to tune hyperparameters offorecasting models, and (iv) more climatic data and demographic features.
CRediT Author StatementRamon Gomes da Silva:
Conceptualization, Methodology, Formal analysis, Valida-tion, Writing - Original Draft, Writing - Review & Editing.
Matheus Henrique DalMolin Ribeiro:
Conceptualization, Methodology, Formal analysis, Validation, Writing -Original Draft, Writing - Review & Editing.
Viviana Cocco Mariani:
Conceptualization,Writing - Review & Editing.
Leandro dos Santos Coelho:
Conceptualization, Writing -Review & Editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personalrelationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The authors would like to thank the National Council of Scientific and TechnologicDevelopment of Brazil – CNPq (Grants number: 307958/2019-1-PQ, 307966/2019-4-PQ,404659/2016-0-Univ, 405101/2016-3-Univ), PRONEX ‘
Funda¸c˜ao Arauc´aria ’ 042/2018, and
Coordena¸c˜ao de Aperfei¸coamento de Pessoal de N´ıvel Superior - Brasil (CAPES) - FinanceCode 001 for financial support of this work. Furthermore, the authors wish to thank the Ed-itor and anonymous reviewers for their constructive comments and recommendations, whichhave significantly improved the presentation of this paper.
References [1] World Health Organization (WHO), . Coronavirus (COVID-19). 2020. URL: ; (accessed in 14 May, 2020).[2] Bansal, M.. Cardiovascular disease and COVID-19. Diabetes & Metabolic Syndrome:Clinical Research & Reviews 2020;14(3):247–250. doi: .[3] Lai, C.C., Shih, T.P., Ko, W.C., Tang, H.J., Hsueh, P.R.. Severe acute respi-ratory syndrome coronavirus 2 (SARS-CoV-2) and coronavirus disease-2019 (COVID-19): The epidemic and the challenges. International Journal of Antimicrobial Agents2020;55(3):105924. doi: .[4] Hussain, A., Bhowmik, B., do Vale Moreira, N.C.. COVID-19 and diabetes:Knowledge in progress. Diabetes Research and Clinical Practice 2020;162:108142.doi: . 165] Moujaess, E., Kourie, H.R., Ghosn, M.. Cancer patients and research during COVID-19 pandemic: A systematic review of current evidence. Critical Reviews in Oncol-ogy/Hematology 2020;150:102972. doi: .[6] Abbas, A.M., Fathy, S.K., Fawzy, A.T., Salem, A.S., Shawky, M.S.. The mutualeffects of COVID-19 and obesity. Obesity Medicine 2020;19:100250. doi: .[7] Su, H., Yang, M., Wan, C., Yi, L.X., Tang, F., Zhu, H.Y., et al. Renal histopatho-logical analysis of 26 postmortem findings of patients with COVID-19 in China. KidneyInternational 2020;doi: .[8] Guan, W.j., Ni, Z.y., Hu, Y., Liang, W.h., Ou, C.q., He, J.x., et al. Clinicalcharacteristics of coronavirus disease 2019 in China. New England Journal of Medicine2020;382(18):1708–1720. doi: .[9] Requia, W.J., Kondo, E.K., Adams, M.D., Gold, D.R., Struchiner, C.J.. Risk ofthe Brazilian health care system over 5572 municipalities to exceed health care capacitydue to the 2019 novel coronavirus (COVID-19). Science of The Total Environment2020;730:139144. doi: .[10] Nda¨ırou, F., Area, I., Nieto, J.J., Torres, D.F.. Mathematical modeling of COVID-19 transmission dynamics with a case study of Wuhan. Chaos, Solitons & Fractals2020;135:109846. doi: .[11] Barmparis, G., Tsironis, G.. Estimating the infection horizon of COVID-19 in eightcountries with a data-driven approach. Chaos, Solitons & Fractals 2020;135:109842.doi: .[12] Zhang, X., Ma, R., Wang, L.. Predicting turning point, duration and attackrate of COVID-19 outbreaks in major Western countries. Chaos, Solitons & Fractals2020;135:109829. doi: .[13] Ceylan, Z.. Estimation of COVID-19 prevalence in Italy, Spain, and France. Scienceof The Total Environment 2020;729:138817. doi: .[14] Ahmar, A.S., del Val, E.B.. SutteARIMA: Short-term forecasting method, acase COVID-19 and stock market in Spain. Science of The Total Environment2020;729:138883. doi: .[15] Ribeiro, M.H.D.M., da Silva, R.G., Mariani, V.C., Coelho, L.S.. Short-term fore-casting COVID-19 cumulative confirmed cases: Perspectives for Brazil. Chaos, Solitons& Fractals 2020;135:109853. doi: .[16] Chimmula, V.K.R., Zhang, L.. Time series forecasting of COVID-19 transmission inCanada using LSTM networks. Chaos, Solitons & Fractals 2020;135:109864. doi: . 1717] Chakraborty, T., Ghosh, I.. Real-time forecasts and risk assessment of novelcoronavirus (COVID-19) cases: A data-driven analysis. Chaos, Solitons & Fractals2020;135:109850. doi: .[18] Singh, S., Parmar, K.S., Kumar, J., Makkhan, S.J.S.. Development of new hybridmodel of discrete wavelet decomposition and autoregressive integrated moving average(ARIMA) models in application to one month forecast the casualties cases of COVID-19.Chaos, Solitons & Fractals 2020;135:109866. doi: .[19] Ribeiro, M.H.D.M., da Silva, R.G., Fraccanabbia, N., Mariani, V.C., Coelho, L.d.S..Forecasting epidemiological time series based on decomposition and optimization ap-proaches. In: 14th Brazilian Computational Intelligence Meeting (CBIC). Bel´em, Brazil;2019, p. 1–8.[20] Dragomiretskiy, K., Zosso, D.. Variational mode decomposition. IEEE Transactionson Signal Processing 2014;62(3):531–544.[21] Moreno, S.R., da Silva, R.G., Mariani, V.C., Coelho, L.d.S.. Multi-step windspeed forecasting based on hybrid multi-stage decomposition model and long short-term memory neural network. Energy Conversion and Management 2020;213:112869.doi: .[22] Wu, Q., Lin, H.. Daily urban air quality index forecasting based on variational modedecomposition, sample entropy and lstm neural network. Sustainable Cities and Society2019;50:101657. doi: .[23] Li, J., Zhu, S., Wu, Q.. Monthly crude oil spot price forecasting using variational modedecomposition. Energy Economics 2019;83:240–253. doi: .[24] Prata, D.N., Rodrigues, W., Bermejo, P.H.. Temperature significantly changesCOVID-19 transmission in (sub)tropical cities of Brazil. Science of The Total Envi-ronment 2020;729:138862. doi: .[25] Coccia, M.. Factors determining the diffusion of COVID-19 and suggested strategyto prevent future accelerated viral infectivity similar to COVID. Science of The TotalEnvironment 2020;729:138474. doi: .[26] Shi, P., Dong, Y., Yan, H., Zhao, C., Li, X., Liu, W., et al. Impact of temperature onthe dynamics of the COVID-19 outbreak in China. Science of The Total Environment2020;728:138890. doi: .[27] Wu, Y., Jing, W., Liu, J., Ma, Q., Yuan, J., Wang, Y., et al. Effects of temperatureand humidity on the daily new cases and new deaths of COVID-19 in 166 countries.Science of The Total Environment 2020;729:139051. doi: . 1828] Ahmadi, M., Sharifi, A., Dorosti, S., Jafarzadeh Ghoushchi, S., Ghanbari, N..Investigation of effective climatology parameters on COVID-19 outbreak in Iran. Scienceof The Total Environment 2020;729:138705. doi: .[29] Sobral, M.F.F., Duarte, G.B., da Penha Sobral, A.I.G., Marinho, M.L.M.,de Souza Melo, A.. Association between climate variables and global transmissionof SARS-CoV-2. Science of The Total Environment 2020;729:138997. doi: .[30] Auler, A., C´assaro, F., da Silva, V., Pires, L.. Evidence that high temperatures andintermediate relative humidity might favor the spread of COVID-19 in tropical climate:A case study for the most affected Brazilian cities. Science of The Total Environment2020;729:139090. doi: .[31] Bashir, M.F., Ma, B., Bilal, , Komal, B., Bashir, M.A., Tan, D., et al. Correlationbetween climate indicators and COVID-19 pandemic in New York, USA. Science ofThe Total Environment 2020;728:138835. doi: .[32] Ribeiro, M.H.D.M., Coelho, L.d.S.. Ensemble approach based on bagging, boosting andstacking for short-term prediction in agribusiness time series. Applied Soft Computing2020;86(105837). doi: .[33] Ribeiro, V.H.A., Reynoso-Meza, G.. Multi-objective support vector machines en-semble generation for water quality monitoring. In: IEEE Congress on EvolutionaryComputation (CEC). Rio de Janeiro, Brazil: IEEE; 2018, p. 1–6.[34] Fern´andez-Delgado, M., Sirsat, M., Cernadas, E., Alawadi, S., Barro, S., Febrero-Bande, M.. An extensive experimental survey of regression methods. Neural Networks2019;111:11–34. doi: .[35] Zuo, G., Luo, J., Wang, N., Lian, Y., He, X.. Decomposition ensemble model based onvariational mode decomposition and long short-term memory for streamflow forecasting.Journal of Hydrology 2020;585:124776. doi: .[36] Zhu, Q., Zhang, F., Liu, S., Wu, Y., Wang, L.. A hybrid VMD–BiGRU modelfor rubber futures time series forecasting. Applied Soft Computing 2019;84:105739.doi: .[37] Justen, A.. COVID-19: Coronavirus newsletters and cases by municipality per day.2020. URL: https://brasil.io/api/dataset/covid19/caso/data/?place_type=state ; (accessed in 28 April, 2020).[38] Center for Systems Science and Engineering (CSSE), . Novel coronavirus (COVID-19)cases, provided by JHU CSSE. 2020. URL: https://github.com/CSSEGISandData/COVID-19 ; (accessed in 28 April, 2020). 1939] Brazil, . Instituto Nacional de Meteorologia (INMET), Minist´erio da Agricultura,Pecu´aria e Abastecimento. 2020. URL: ; (accessed in 28 April, 2020), (in Portuguese).[40] U.S., . National Oceanic and Atmospheric Administration (NOAA): National Centersfor Environmental Information. 2020. URL: ; (accessedin 28 April, 2020).[41] Chamberlain, S.. rnoaa: ‘NOAA’ weather data from R. 2020. URL: https://CRAN.R-project.org/package=rnoaa ; R package version 0.9.6.[42] MacKay, D.J.C.. Bayesian interpolation. Neural Computation 1992;4(3):415–447.doi: .[43] Nguyen, D., Widrow, B.. Improving the learning speed of 2-layer neural networks bychoosing initial values of the adaptive weights. In: IJCNN International Joint Confer-ence on Neural Networks; vol. 3. San Diego, USA; 1990, p. 21–26.[44] Quinlan, J.R.. Learning with continuous classes. In: 5th Australian Joint Conferenceon Artificial Intelligence; vol. 92. Hobart, Tasmania: World Scientific; 1992, p. 343–348.[45] Aha, D.W., Kibler, D., Albert, M.K.. Instance-based learning algorithms. MachineLearning 1991;6(1):37–66. doi: .[46] Breiman, L.. Random forests. Machine Learning 2001;45(1):5–32. doi: .[47] Meinshausen, N.. Quantile regression forests. Journal of Machine Learning Research2006;7:983–999.[48] Vaysse, K., Lagacherie, P.. Using quantile regression forest to estimate uncertainty ofdigital soil mapping products. Geoderma 2017;291:55–64. doi: .[49] Drucker, H., Burges, C.J.C., Kaufman, L., Smola, A.J., Vapnik, V.. Support vectorregression machines. In: Mozer, M.C., Jordan, M.I., Petsche, T., editors. Advancesin Neural Information Processing Systems 9. MIT Press; 1997, p. 155–161.[50] R Core Team, . R: A language and environment for statistical computing. R Foundationfor Statistical Computing; Vienna, Austria; 2018.20 ppendix A. Performance Measures
Tables A.1 and A.2 present the performance measures for each model in each state andforecasting horizon.
Table A.1: Performance measures for each evaluated model for Brazilian states
Country State ForecastingHorizon Criteria ModelBRNN CUBIST KNN QRF SVR VMD–BRNN VMD–CUBIST VMD–KNN VMD–QRF VMD–SVRBRA AM ODA sMAPE 11.59% 5.34% 50.92% 45.99% 5.40% able A.2: Performance measures for each evaluated model for American states Country State ForecastingHorizon Criteria ModelBRNN CUBIST KNN QRF SVR VMD–BRNN VMD–CUBIST VMD–KNN VMD–QRF VMD–SVRUSA CA ODA sMAPE ppendix B. Hyperparameters Tables B.1 and B.2 present the hyperparameters obtained by grid-search for the modelsemployed in this paper.
Table B.1: Hyperparameters selected by grid-search for each evaluated model for Brazilian states
Country State Component BRNN CUBIST KNN QRF SVR able B.2: Hyperparameters selected by grid-search for each evaluated model for American states Country State Component BRNN CUBIST KNN QRF SVR5 1 5 5 5 1Non-decomposed 5 20 0 5 4 1