[PDF] Backtesting the predictability of COVID-19

Abstract

The advent of the COVID-19 pandemic has instigated unprecedented changes in many countries around the globe, putting a significant burden on the health sectors, affecting the macro economic conditions, and altering social interactions amongst the population. In response, the academic community has produced multiple forecasting models, approaches and algorithms to best predict the different indicators of COVID-19, such as the number of confirmed infected cases. Yet, researchers had little to no historical information about the pandemic at their disposal in order to inform their forecasting methods. Our work studies the predictive performance of models at various stages of the pandemic to better understand their fundamental uncertainty and the impact of data availability on such forecasts. We use historical data of COVID-19 infections from 253 regions from the period of 22nd January 2020 until 22nd June 2020 to predict, through a rolling window backtesting framework, the cumulative number of infected cases for the next 7 and 28 days. We implement three simple models to track the root mean squared logarithmic error in this 6-month span, a baseline model that always predicts the last known value of the cumulative confirmed cases, a power growth model and an epidemiological model called SEIRD. Prediction errors are substantially higher in early stages of the pandemic, resulting from limited data. Throughout the course of the pandemic, errors regress slowly, but steadily. The more confirmed cases a country exhibits at any point in time, the lower the error in forecasting future confirmed cases. We emphasize the significance of having a rigorous backtesting framework to accurately assess the predictive power of such models at any point in time during the outbreak which in turn can be used to assign the right level of certainty to these forecasts and facilitate better planning.

Full PDF

BBacktesting the predictability of COVID-19

Dmitry Gordeev ∗ H2O.ai [email protected]

Philipp Singer ∗ H2O.ai [email protected]

Marios Michailidis ∗ H2O.ai [email protected]

Mathias Müller

H2O.ai [email protected]

SriSatish Ambati

H2O.ai [email protected] A BSTRACT

The advent of the COVID-19 pandemic has instigated unprecedented changes in many countriesaround the globe, putting a signiﬁcant burden on the health sectors, affecting the macro economicconditions, and altering social interactions amongst the population through a number of mitigationmeasures and governmental instructions. In response, the academic community has produced multipleforecasting models, approaches and algorithms to best predict the different indicators of COVID-19,such as the number of conﬁrmed infected cases, the number of deceased and economic indicators.Speciﬁcally at the beginning of the pandemic, researchers had little to no historical informationabout the pandemic at their disposal in order to inform their forecasting methods. Our work studiesthe predictive performance of models at various stages of the pandemic to better understand theirfundamental uncertainty and the impact of data availability on such forecasts.We use historical data of infected, deceased and recovered cases of COVID-19 from 253 regions fromthe period of 22nd January 2020 until 22nd June 2020 to predict, through a rolling window backtestingframework, the cumulative number of infected cases for the next 7 and 28 days. We implement threesimple models to track the root mean squared logarithmic error in this 6-month span, a baselinemodel that always predicts the last known value of the cumulative conﬁrmed cases, a power growthmodel and an epidemiological model called SEIRD. Within our presented backtesting framework,each model is re-ﬁtted daily and its best parameters are obtained with information known as of thatpoint in time (in terms of historical conﬁrmed, deceased and recovered cases) and predictions areextended for the 7-days and 28-days forecast horizons.We demonstrate that prediction errors are substantially higher in early stages of the pandemic,resulting from limited data. Throughout the course of the pandemic, errors regress slowly, but steadily.The more conﬁrmed cases a country exhibits at any point in time, the lower the error in forecastingfuture conﬁrmed cases. Our work emphasizes the signiﬁcance of having a rigorous backtestingframework to accurately assess the predictive power of such models at any point in time during theoutbreak which in turn can be used to assign the right level of certainty to these forecasts and facilitatebetter planning. ∗ All authors contributed equally. a r X i v : . [ phy s i c s . s o c - ph ] J u l Introduction

In the event of a pandemic outbreak, stakeholders such as politicians, pharmaceutical companies or hospitals attemptto forecast the spread of the pandemic to make informed decisions about actions and policies such as lock-downs,supply chain optimization, or, in worst case, even crucial decisions about intensive care units. However, every pandemicis unique in itself and COVID-19 reached a magnitude and severity that has not been observed over the last decades[17, 21, 52]. As a result, little historical information about similar pandemics was at our disposal at the beginning of theoutbreak in order to make good estimates about the future development of the disease. This information bottleneckleads to uncertainty in forecasting methods and can be crucial in the efforts to develop new medicine, vaccines, publicguidelines and other important aspects to guarantee public health and safety.

Background.

Substantial research has been published over the course of the pandemic, as evident from the COVID-19Open Research Dataset (CORD-19) [1] containing close to 30,000 COVID-19 related research papers; the datasethas been extended to cover publications from similar corona-viruses and fostered NLP-related research on the corpus[4, 28]. In Figure 1, we depict the number of research publications containing the term “Covid“ in the title and havinga publication date as well as the weekly number of new cases retrieved from data made available from the John’sHopkins University (see Section 2). Similar to the rapid spread of the pandemic, we observe an accelerating number ofpublications indicating the strong efforts of the research community to study the pandemic across disciplines.Many different types of models have been proposed to model and forecast the number of infections within andacross countries. A prominent and frequently applied type is the classical epidemiological framework [32] modelingsusceptible, exposed, infected, and recovered agents (SEIR) that has also found its application in several COVID-19forecasting approaches [13, 14, 26, 27, 35, 36, 48]. A second type of category represents autoregressive moving averagemodels that attempt to extrapolate future data by means of aggregating recent data. These types of models have hadmany successful implementations in time series forecasting—e.g. ﬁnancial methods [16]—and have recently alsobeen applied to predict COVID-19 numbers [15, 42, 45]. Third, several curve ﬁtting and statistical models have beenproposed to be well-tailored for COVID-19 forecasting, including power-law models [56], simple linear or polynomialmodels [41, 54], logistic models [49], mixed-effects models [11, 19], and many others. Finally, many approaches in therealm of machine learning have been developed [5], including e.g. Facebook’s prophet algorithm [40], gradient boostedtrees [47], or neural networks [55]. This list only covers a small fraction of published models, an exemplary overviewof others is also given in [3, 20, 33].Kaggle, a large competitive data science platform with around ﬁve million users [25], conducted a series of ﬁvecompetitions [6, 7, 8, 9, 10] allowing data scientists to develop and submit their COVID-19 forecasting models topredict conﬁrmed cases and fatalities across ∼ regions—including mostly country-level and in certain casesprovince-level or state-level predictions—for no less than 30 days into the future . The models were always developedon historical data and then evaluated live over a period of four succeeding weeks or more. Across all competitions,different types of models have performed well including the above-mentioned machine learning models (boostingtrees, neural networks) as well as a diverse set of curve ﬁtting, statistical, and autoregressive models. The seriesof competitions captures the state of development of these kinds of models during a pandemic quite well, with themodels being initially quite simple and uninformed [6, 7], and developing to more robust models and ensembles overtime [8, 9, 10]. While many strong solutions have been developed, it has also been shown that a lot of subjectiveadjustments[22, 37] can make a model shine or fail and that it is explicitly complex to forecast rapidly changing patterns. Objectives.

As summarized, a plethora of research has been conducted in order to forecast the COVID-19 pandemic.However, given the rapidly changing environments, data irregularities as well as the inherent difﬁculty of predictingthese numbers, this type of research has also been criticized due to the sensitivity of the topic at hand and the potentialhuge implications of poorly performing models [29, 30]. Wynants et al. [53] conducted a review of 66 published modelswith focus on predicting different aspects of COVID-19 or similar diseases including models for forecasting hospitaladmissions due to pneumonia, diagnostic models for detecting COVID-19 as well as prognostic models for assessingmortality risk, length of stay in the hospital and exacerbation of the disease. Their review rated the aforementionedmodels as of high or unclear risk and biased due to improper testing frameworks, with non-representative selection ofcontrol patients . They also highlighted the lack of clarity in the reporting of the ﬁndings and that these models havea high risk of overﬁtting . They concluded that a reporting guideline needs to be adhered from all works predictingCOVID-19 or similar diseases to avoid unreliable predictions as the latter “could cause more harm than beneﬁt inguiding clinical decisions“. All publications from WHO database do not have a date and are excluded from this visualization. In each of the ﬁve competitions, at least one author of this paper ﬁnished in the top 5.

Research publications and COVID-19 conﬁrmed cases.

This ﬁgure shows the number of published articlesderived from the CORD-19 dataset as well as the worldwide number of new cases on a weekly aggregation level.We still strongly believe in the fundamental value of these types of models, speciﬁcally for application in potentialsecond waves or other future pandemics. In order to be able to utilize these types of models, they have to be properlyevaluated and made transparent [18].Nonetheless, most of these models have been developed during the outbreak of the pandemic, and thus, could only beevaluated on historical data up to that point. While some countries still see rising numbers in COVID-19 infections as ofthis writing (e.g., Brazil, India or the US), most countries are well past the peak and see rapid ﬂattening of the curves.However a few potential instances of 2nd waves may already be happening [50]. Consequently, we are now in theunique position to backtest and investigate predictive performance of COVID-19 forecasting models across countries atdifferent points in time. This not only allows us to study the fundamental prediction difﬁculty of infection curves , butalso measure predictive performance at various stages of the pandemic.

Contributions and ﬁndings.

To study these and similar questions, we make the following contributions: (i) We applytwo simple, yet well-known and robust, short-term forecasting methods along with a baseline to predict conﬁrmedCOVID-19 cases. (ii) We introduce a thorough backtesting framework that allows us to provide accurate assessmentsabout a model’s prediction performance. (iii) We utilize this framework to empirically study the general predictabilityof COVID-19 across various stages of the outbreak.

Our work highlights the importance of proper testing, tracking and quantifying of the prediction error through time aswell as through different levels of accumulated infected cases. We observe that the prediction error is substantiallyhigher in the early stages of the pandemic, when the number of conﬁrmed cases is still low and the trends are stillundeveloped. Then follows a period of approximately 15 days (past the early days of March) where the error dropsigniﬁcantly by about 3.5 times. From that point on it regresses steadily to lower levels as more data becomes available.This paper is organized as follows. Section 2 describes the source of the data used as well as the methods by which thelatter was transformed and processed to underpin the experiments. A power growth model and a version of the SEIRmodel called SEIRD are optimized and applied via multiple moving windows in a backtesting setting across all countriesto predict the conﬁrmed infected cases of COVID-19 and track the prediction error over time . Section 3 describes themethodology supporting these models in terms of their parameters, optimization routines and loss functions minimizedwithin the context of the backtesting framework . Section 4 highlights the conducted experiments and core ﬁndings.Ultimately, the conclusions of the experiments are drawn in Section 5. Beginning of July 2020

Data preparation.

This ﬁgure depicts our data cleaning routine, where a linear interpolation is applied toreplace two consecutive points of the same value.

The primary source of data is the data repository of the Johns Hopkins University Centre for Systems Science andEngineering [2]. It contains daily updates about conﬁrmed, deceased, and recovered cases at country level. Due to givenirregularities in the way different countries report daily COVID-19 statistics, we employ a basic data cleaning routine.As we are working on a cumulative level of conﬁrmed cases, we aim at guaranteeing the monotonicity requirement.The corrective measure to ensure monotonicity is applied when the value between two consecutive dates remains thesame and then increases with a high pace. In this case, the latter of the two is replaced with a linear interpolation of theneighbor values (i.e., the average between the previous value and the next value). The reasoning behind this measure isthat often the cases remain the same due to irregularities and delays of the reporting system [31, 43]. An example ofapplying this transformation is shown in Figure 2, assuming the conﬁrmed cumulative count is originally as depicted inthe solid blue line, we transform it to the dashed blue line. Even though the overall expansion of a curve may not belinear in respect to time, ﬁnding a better method to correct the same-value irregularity is an exhaustive task and out ofscope for this analysis.Overall, our dataset contains regions, with daily statistics ranging from the 22nd January 2020 to the 22nd June2020. We observe around 9.1 million conﬁrmed cases and , fatalities; see also Figure 1 for a visualization of thedevelopment over time. This section details the elements utilized to implement the experiments of tracking and understanding the error inpredicting the COVID-19 conﬁrmed infected cases over time across the globe at the country level. We start byelaborating our core backtesting methodology in Section 3.1 which we utilize to study the RMSLE loss function (seeSection 3.2) over time. Within the scope of backtesting, we employ three models described in Section 3.3: a simplebaseline model, a power growth model, and an extension of the well-known SEIR model, called SEIRD.3 .1 Backtesting

Backesting—or the process of evaluating a model or an algorithm over different past periods of time—is commonlyassociated with trading strategies, banking, or risk prediction [51]. Backtesting in predictive modeling can be animportant tool in ﬁnding the optimal parameters for the used models as well as for measuring the volatility of predictionsthrough time. Assessment of the forecast accuracy can be dramatically biased if done on the same data used for modelﬁtting [23]. A technique of setting up a single hold-out sample can serve as a way to derive more accurate forecast errorestimations, however it does not provide the information about how the model accuracy improves over time, as moreinformation becomes available. Moreover, an assessment based on a single sample is not robust given the limited sizeof the data available to ﬁt the models.In the context of the COVID-19 pandemic, backtesting the models aimed to predict different aspects of the disease (inconﬁrmed, deceased, or recovered cases) can facilitate understanding of the sensitivity of these models in producingaccurate and robust results given varying sizes of training history. Such an approach can enable deﬁning the time (or theamount of training history) required to produce results of certain accuracy levels, quite often essential in order to usethem efﬁciently in decision systems.Using the EDI model, an exponentially decreasing intensity growth model, Moriconi [39] used backtesting on China’sdaily conﬁrmed COVID-19 cases, starting from 13th February 2020, and observed substantial overestimation for(approximately) the ﬁrst week of predictions before the model started being signiﬁcantly more accurate. Volatility inpredictions (via varying levels of over- and underestimation) were also observed in the work of Lesage [34] where theHawkes process [24] was utilized to predict via backtesting the number of conﬁrmed cases in both France and China indifferent periods of February and March.Rouabah et al. [44] used the SEIQRDP model—a variant of the SEIRD model that also incorporates quarantined (Q)individuals to be considered as active cases as well as the protected population (P) for cases that strictly follow thestandard advised protection measures—in order to forecast the elements of that model for the next six months past thelast training day of 24th May 2020 for Algeria. Their work emphasizes the threat of creating unstable models due tooverﬁtting and underﬁtting , plus they point out that overﬁtting is a major issue in epidemic dynamical models due to thenoise embedded in the data. The SEIQRDP model’s parameters were optimized using a genetic algorithm, enhancedby published information for these parameters. To ﬁnd the optimum number of iterations for the genetic algorithmto obtain the best parameters, a time-based cross validation procedure was applied in different countries so that theﬁrst n days for a given country’s infected numbers are used to ﬁt the algorithm and the last v are used for validation.This process was tested on the countries of Italy, Spain, Germany and South Korea before applying it to Algeria. Theratio of v/n can be adjusted based on the number of parameters that need optimization. In this case the ratio was about / . The study highlights that there is an inverse relationship between the training sample’s size and the number ofiterations required in the genetic algorithm. As more data becomes available for a given country, the optimum numberof iterations decreases. Therefore re-evaluating the optimization at different points in time is important for obtaining themost accurate results.With the application of backtesting , we can not only derive the accuracy of predictions made in the past, but alsoshow how the accuracy changes during the pandemic. The main idea behind backtesting is to make the predictions bythe model at a ﬁxed time point t in the past and estimate the error at the time point t + H , where H is a predeﬁnedprediction horizon. In this paper, we focus on two values, H = 7 days for short-term predictions and H = 28 days fora longer forecast.Denote by X a p × N matrix, where p is number of regions, and N is number of days with available numbers ofconﬁrmed cases. Let us denote by X ( t )( t ) = X i,j where i = 1 , ..., p and j = t , ..., t the matrix of observationsavailable between days t and t and by X ( t ) = X i,t where i = 1 , ..., p the values at the day t . The backtesting impliesﬁtting a model f : R p × t → R p × t + H for each day t = 1 , ..., N − H ˆ θ t = argmin θ L fit ( f ( X (1)( t ) ; θ ) (1)( t ) , X (1)( t ) ) (1)where L fit is the loss function.We later assess the error of the forecast with horizon H as ERR H ( t ) = L eval ( f ( X (1)( t ) ; ˆ θ t ) ( t + H ) , X ( t + H ) ) (2)where L eval is the evaluation metric. The experiment results in two ( H = 7 and H = 28 ) time series of errors permodel f , showing the values of forecast error, the ﬂuctuations of the error and its dynamics during the pandemic.4 .2 Loss function The choice of the loss function was driven by the fact that most models predicting the development of number ofconﬁrmed infection cases assume exponential growth. In such a case, metrics like RMSE (root mean squared error) andMAE (mean absolute error), based on absolute difference in number of predicted and realized cases, tend to signiﬁcantlypenalize any exponential over-estimations. Therefore, root mean squared logarithmic error (RMSLE) was chosen as theevaluation metric L eval . RM SLE ( X, Y ) = (cid:115) N ∗ p (cid:88) i,j ( ln (1 + X ) − ln (1 + Y )) (3)where X and Y are matrices of the same size N × p .Also, consistent with [6, 7, 8, 9], the loss function is applied to the cumulative number of cases. Next, we specify three models that we employ utilizing our backtesting framework to study the RMSLE error over time.(1) For reference, we utilize a simple parameter-free baseline model that predicts future conﬁrmed cases by using thelatest known data point, (2) we introduce a power-growth model employing constant growth that decays over time, and(3) we utilize a variation of the well-known epidemiological SEIR model called SEIRD.

Baseline model.

As a reference point for the evaluation metric across the pandemic development, a simple parameter-free baseline model was applied. Denoting by C t number of cumulative conﬁrmed cases in a region at point in time t ,the baseline predictions are C t + i = C t , i = 1 , ..., N (4)The baseline model is not intended to produce reasonable forecasts, but rather to indicate how difﬁcult it is to makeaccurate predictions at each point in time t . Power-growth model.

This model is motivated by different types of COVID-19 forecasting models such as statisticalpower law models with exponential growth [38], or autoregressive moving average models as described in Section 1.We have utilized this model successfully across all ﬁve Kaggle competitions [6, 7, 8, 9, 10] and the version presentedin this paper is the ﬁnal adaption of it. The main idea of the model is to forecast COVID-19 cases, by employing aconstant growth rate that is derived from previous observations. This growth rate is decaying over time and the decaycan accelerate. In detail, we can deﬁne the power-growth model as follows: C t + i = C t + gr · max (0 , (1 + gr d · (1 + gr d a ) i ) log ( t + i ) ) (5)In Equation 5 we want to predict the cumulative number of cases C at time t + i , i = 1 , ..., H using the number of casesat time t ; gr refers to the growth rate, gr d to the decay of the growth rate, and gr d a to the acceleration of the decay.The growth rate is calculated for each region separately by taking an exponential weighted average of the observeddaily growth rate over a certain number of past days ( n days ). If a region does not exceed a minimum number of cases( min _ cases ), a default growth rate ( gr def ) is employed. The growth rate decay gr d as well as its acceleration gr d a areconstant across all regions. All parameters except the growth rate are thus hyperparameters that are optimized based ona global metric across regions. The modiﬁed power-growth model ﬁtting was performed the following way. ˆ θ t = argmin θ L fit ( f ( X (1)( t − ; θ ) ( t − t ) , X ( t − t ) ) (6)Meaning that the most recent 21 days of data were used to optimize the hyperparameters. Loss function is the same asthe evaluation metric L fit = L eval = RM SLE . SEIRD model.

The SEIR model belongs to a family of epidemiological models (see also Section 1 that map the spreadof an epidemic through the sequential interaction of 4 groups or states (represented as ordinary differential equations),the susceptible (or number of individuals that can contract the disease), exposed , infected and removed .5ur implementation uses a variation of the SEIR model called SEIRD [12]. In this application, the removed category isfurther divided into recovered and deceased . The equations that map the rate of change in respect to the main states aredisplayed below: ∂S∂t = − βI SN (7)where ∂S∂t represents the change applied to the susceptible population S at time t , β is the infection rate (or how manypeople an infected individual infects), I the infected population at time t and N the total population. ∂E∂t = βI SN − δE (8)where ∂E∂t represents the change applied to the exposed population E at time t, δ is a parameter that controls the rate bywhich the exposed population transitions to the infected state and it can be interpreted as divided by the incubationperiod (or in other words, the period that an individual is infected but asymptomatic and unable to spread the disease toothers). ∂I∂t = δE − (1 − α ) γI − αρI (9)Similarly, to compute the change applied to the infected group at time t , a parameter γ is used to represent the recoveryrate, or how quickly individuals move to the recovered state. Equivalently, ρ controls how quickly individuals move tothe deceased state. The parameter α represents the fatality rate or the proportion of the infected population that willtransition to the deceased state. The (1 − α ) represents the proportion of the infected population that will transition tothe recovered state. ∂R∂t = (1 − α ) γI (10)which expresses the change applied to the recovered group R at time t . Finally, ∂D∂t = αρI (11)expresses the change applied to the deceased group D at time t .For each country, and given a set of bounds for the model’s parameters (of N, β, δ, γ, α, ρ ), a stochastic, population-based optimisation algorithm with differential evolution [46] is applied to ﬁnd the optimum values for these parametersin order to minimize the RMSE across all the infected, recovered and deceased groups up to a selected point intime. The bounds-based optimization algorithm was preferred over others (like gradient-based), because the boundswere selected based on latest known information in regards to infection rate, incubation period, and fatality rate and itprovided a fairly narrow constrained environment for the algorithm to converge more quickly. rmse ( y, ˆ y ) = (cid:118)(cid:117)(cid:117)(cid:116) n n (cid:88) i =1 ( y i − ˆ y i ) (12)Where y is the observed value for either one of infected, recovered or deceased and ˆ y the corresponding predicted valuefor these groups. Then the overall metric to optimize can be deﬁned as: (cid:98) N , (cid:98) β, (cid:98) δ, (cid:98) γ, (cid:98) α, (cid:98) ρ = argmin ( N,β,δ,γ,α,ρ ) M ( I, ˆ I, R, ˆ R, D, ˆ D ) = rmse ( I, ˆ I ) + rmse ( R, ˆ R ) + rmse ( D, ˆ D )3 (13)Where M is the objective to minimize and connotes the average of rmse ( I, ˆ I ) , rmse ( R, ˆ R ) , rmse ( D, ˆ D ) whichare the respective root mean squared errors for infected, recovered and deceased. Once the optimum parameters Experiments indicated better results and faster convergence with RMSE instead of RMSLE for SEIRD; we still report RMSLEfrom here onwards for fair comparison. N , (cid:98) β, (cid:98) δ, (cid:98) γ, (cid:98) α, (cid:98) ρ have been obtained, the curves for infected, recovered and deceased are extrapolated in time to matchthe forecasting period. Since the predicted values are based on the ﬁt with the known values, it is possible that thecumulative predicted numbers are lower than the last known value. In that case, the differences between the predictedpoints are computed and added to the last known value to form the new cumulative predictions. (a) 7 days forecast horizon(b) 28 days forecast horizon Figure 3:

Backtesting error over time.

This plot shows the errors for each model versus time (baseline, power-growth,SEIRD) for the two forecast horizons of 7 days in (a) and 28 days in (b), limiting the analysis to regions having at least100 conﬁrmed cases. The x-axis depicts the forecast date—i.e., the point in time where respective model has been ﬁttedon past data only—and the left y-axis its respective prediction error based on the given forecast horizon. For instance,in (a), the green point at May 10 refers to the RMSLE error seven days in the future for a SEIRD model ﬁtted on allhistoric up to this point in time. The grey bars highlight the global number of new conﬁrmed cases (right y-axis) oneach day for visual reference. Note that the last data points plotted for (a) and (b) are 7 and 28 days prior to the lastevaluation date of 22nd June 2020 as such were the periods required to calculate the evaluation error.7

Experiments

Our experiments are based on the data spanning the period from the 22nd January ( d = 1 ) until the 22nd June 2020( d = 153 ), counting N = 153 days of data points from each of p = 253 regions. In order to provide at least a monthof data for training the models, backtesting results are reported from d = 31 onwards. Two prediction horizons werechosen for the experiments: H = 7 and H = 28 . Many regions report new cases with weekly cycle, where lower casesare reported during the weekend, therefore, horizons over full weeks are suggested to avoid instability. We make thebacktesting framework as well as further code to run the experiments available online .The ﬁrst experiment aggregates the ERR H ( t ) by the date t in order to show how forecasting error develops over thecourse of the pandemic. We show respective results for both forecasting horizons in Figure 3. A ﬁrst observation isthat it is easier to capture short-term trends, compared to long-term trends, as evident from smaller absolute predictionerrors across all models for the 7-days forecasts in Figure 3a compared to the 28-days forecasts in Figure 3b. Both thepower-growth and SEIRD model perform better than the simple baseline for most parts of the curve, which is why wefocus on them next.We observe a steady trend of prediction error decreasing together with error of a baseline model. We see that in thebeginning of the outbreak in early March, both models depict high errors in predicting conﬁrmed infected cases whichare near the levels of 1.3 and 5 respectively for the 7-days and 28-days forecast horizons. Over time, the models’ errorsmove down elastically and reach 0.4 and 1.5 respectively for the two forecast horizons. In other words, the error getsreduced about three times by middle-to-end of March, which is roughly 15 extra days of observed data. From that pointon the errors regress more inelastically through time and gradually reach 0.1-0.2 and 0.5 for the two forecast periods inJune.Given the decreasing error over time, we are now interested in studying the effect of historical data volume on predictionerrors. To that end, our second experiment visualized in Figure 4, contrasts how the error depends on the accumulatednumber of conﬁrmed cases. The evaluation metric was aggregated by C t —number of conﬁrmed cases at the datewhen the forecast was made. We focus on a forecasting horizon of 28 days, but can observe similar trends for the7-days forecast horizon. We can clearly see, that the error decreases with more training data available. SEIRD is evenperforming worse than the baseline with very limited number of recorded cases. As soon as a region reaches conﬁrmed cases, we observe that the forecast accuracy is securely below the baseline and monotonically decreasing.The power-growth model shows constant improvement of the error with increasing C t . In this paper, we studied the predictive performance of COVID-19 forecasting models throughout the course of thepandemic. To that end, we examined the error (through RMSLE) for predicting COVID-19 conﬁrmed cases acrossmultiple countries around the globe through time and volume starting from 22nd January until 22nd June 2020. Theerror was investigated via applying three models, a simple baseline model, a power growth model, and the well-knownepidemiological SEIRD model, under a rigorous back testing framework that required reﬁtting the models’ parametersevery day on historical data up to this point and making predictions, covering the whole six-month period. We used7-days and 28-days forecast horizons for our measurements.Our work highlights the importance of applying a rigorous backtesting framework to predicting the different stagesof COVID-19. It is clearly demonstrated (and expected) that different time and volume can result in different errorlevels . In the early days of the outbreak, when the volume of observed cases is still low, the error is larger (with highervolatility) versus later stages when the curves have more developed shapes. Accurately depicting the error level canfacilitate better usage of such epidemic models when they get integrated into decision systems as it can help the decisionmaker decide how much conﬁdence to place in such models at different stages throughout the epidemic. It is imperativeto understand whether the error level of an epidemic model is low enough at any given point in time to provide a usefulor exploitable prediction as the cost the of the models’ errors may result in more than ﬁnancial losses.

Acknowledgements.

We want to thank Dr. Christof Henkel ( kaggle.com/christofhenkel ) for collaboration onKaggle developing our proposed power-growth model. https://github.com/h2oai/covid19-backtesting-publication Backtesting error by number of cases.

This plot shows how the error evolves based on the number ofhistorical conﬁrmed cases; it highlights the errors grouped by the number of conﬁrmed cases in a country or region.The x-axis depicts buckets of number of cases at the point in time when the model was ﬁtted. The left y-axis depictsthe average error 28 days in the future, across all countries falling into the respective bucket of cases at any point intime. The grey bars (right y-axis) show how many overall observations (region-day pairs) are accumulated for eachbucket. For example, the 256-512 bucket captures all observations where the historical number of conﬁrmed cases fallsinto this range at any speciﬁc day of interest for ﬁtting the models. The y-axis then depicts the average error across allregion-day pairs (which sums to approximately 1,700 observations for this bucket). A single country can fall multipletimes into the same bucket.

References [1] Cord-19 covid-19 open research dataset. . Accessed: 2020-06-03.[2] Covid-19 data repository by the center for systems science and engineering (csse) at johns hopkins university. https://github.com/CSSEGISandData/COVID-19 . Accessed: 2020-06-03.[3] Covid-19 forecast hub. https://github.com/reichlab/covid19-forecast-hub . Accessed: 2020-06-02.[4] Covid-19 open research dataset challenge (cord-19). . Accessed: 2020-06-03.[5] Covid-19 projections using machine learning. https://covid19-projections.com/ . Accessed: 2020-06-02.[6] Covid19 global forecasting (week 1). . Accessed:2020-06-03.[7] Covid19 global forecasting (week 2). . Accessed:2020-06-03.[8] Covid19 global forecasting (week 3). . Accessed:2020-06-03.[9] Covid19 global forecasting (week 4). . Accessed:2020-06-03.[10] Covid19 global forecasting (week 5). . Accessed:2020-06-03.[11] Ihme covid-19 projections. https://covid19.healthdata.org/ . Accessed: 2020-06-02.[12] Infectious disease modelling: Beyond the basic sir model. https://towardsdatascience.com/infectious-disease-modelling-beyond-the-basic-sir-model-216369c584c4 . Accessed: 2020-06-03.[13] Lanl covid-19 conﬁrmed and forecasted case data. https://covid-19.bsvgateway.org/ . Accessed: 2020-06-02.[14] S. Afonso, J. Azevedo, and M. Pinheiro. Epidemic analysis of covid-19 in brazil by a generalized seir model. arXiv preprintarXiv:2005.11420 , 2020.[15] R. Anne. Arima modelling of predicting covid-19 infections. medRxiv , 2020.

16] A. A. Ariyo, A. O. Adewumi, and C. K. Ayo. Stock price prediction using the arima model. In , pages 106–112. IEEE, 2014.[17] S. R. Baker, N. Bloom, S. J. Davis, K. J. Kost, M. C. Sammon, and T. Viratyosin. The unprecedented stock market impact ofcovid-19. Technical report, National Bureau of Economic Research, 2020.[18] C. M. Barton, M. Alberti, D. Ames, J.-A. Atkinson, J. Bales, E. Burke, M. Chen, S. Y. Diallo, D. J. Earn, B. Fath, et al. Call fortransparency of covid-19 models.

Science , 368(6490):482–483, 2020.[19] I. COVID, C. J. Murray, et al. Forecasting the impact of the ﬁrst wave of the covid-19 pandemic on hospital demand and deathsfor the usa and european economic area countries. medRxiv , 2020.[20] C. S. Currie, J. W. Fowler, K. Kotiadis, T. Monks, B. S. Onggo, D. A. Robertson, and A. A. Tako. How simulation modellingcan help reduce the impact of covid-19.

Journal of Simulation , pages 1–15, 2020.[21] S. Flaxman, S. Mishra, A. Gandy, H. Unwin, H. Coupland, T. Mellan, H. Zhu, T. Berah, J. Eaton, P. Perez Guzman, et al.Report 13: Estimating the number of infections and the impact of non-pharmaceutical interventions on covid-19 in 11 europeancountries. 2020.[22] G. Fodor. 1st place solution lgbm with some adjustments. . Accessed: 2020-07-22.[23] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning. chapter 7.10. Springer series in statistics NewYork, 2001.[24] A. G. Hawkes. Spectra of some self-exciting and mutually exciting point processes.

Biometrika , 58(1):83–90, 1971.[25] M. Henze. Kaggle milestone: 5 million registered users! . Accessed:2020-07-22.[26] P. Hernández, C. Pena, A. Ramos, and J. Gómez-Cadenas. A simple formulation of non-markovian seir. arXiv preprintarXiv:2005.09975 , 2020.[27] T. Hulshof, J. Jorritsma, and J. Komjáthy. Not all interventions are equal for the height of the second peak. arXiv preprintarXiv:2005.06880 , 2020.[28] M. Huttson. Artiﬁcial-intelligence tools aim to tame the coronavirus literature. . Accessed: 2020-06-15.[29] J. P. Ioannidis, S. Cripps, and M. A. Tanner. Forecasting for covid-19 has failed. https://forecasters.org/blog/2020/06/14/forecasting-for-covid-19-has-failed/ . Accessed: 2020-06-15.[30] N. P. Jewell, J. A. Lewnard, and B. L. Jewell. Caution warranted: using the institute for health metrics and evaluation model forpredicting the course of the covid-19 pandemic, 2020.[31] A. J. Kucharski, T. W. Russell, C. Diamond, Y. Liu, J. Edmunds, S. Funk, R. M. Eggo, F. Sun, M. Jit, J. D. Munday, et al. Earlydynamics of transmission and control of covid-19: a mathematical modelling study.

The lancet infectious diseases , 2020.[32] M. Kuperman and G. Abramson. Small world effect in an epidemiological model.

Physical Review Letters , 86(13):2909, 2001.[33] S. Latif, M. Usman, S. Manzoor, W. Iqbal, J. Qadir, G. Tyson, I. Castro, A. Razi, M. N. K. Boulos, A. Weller, et al. Leveragingdata science to combat covid-19: A comprehensive review. 2020.[34] L. Lesage.

A Hawkes process to make aware people of the severity of COVID-19 outbreak: application to cases in France .PhD thesis, Université de Lorraine; University of Luxembourg, 2020.[35] Q. Lin, S. Zhao, D. Gao, Y. Lou, S. Yang, S. S. Musa, M. H. Wang, Y. Cai, W. Wang, L. Yang, et al. A conceptual modelfor the outbreak of coronavirus disease 2019 (covid-19) in wuhan, china with individual reaction and governmental action.

International journal of infectious diseases , 2020.[36] L. López and X. Rodo. A modiﬁed seir model to predict the covid-19 outbreak in spain and italy: simulating control scenariosand multi-scale epidemics.

Available at SSRN 3576802 , 2020.[37] M. Michailidis. Some ml, a lot of judgement and luck. . Accessed: 2020-07-22.[38] M. Mitzenmacher. A brief history of generative models for power law and lognormal distributions.

Internet mathematics ,1(2):226–251, 2004.[39] F. Moriconi. A model with exponentially decreasing intensity for covid-19 epidemic outbreak.

Available at SSRN 3575705 ,2020.[40] B. M. Ndiaye, L. Tendeng, and D. Seck. Analysis of the covid-19 pandemic by sir model and machine learning technics forforecasting. arXiv preprint arXiv:2004.01574 , 2020.[41] G. Pandey, P. Chaudhary, R. Gupta, and S. Pal. Seir and regression model based covid-19 outbreak predictions in india. arXivpreprint arXiv:2004.00958 , 2020.[42] H. R. Pourghasemi, S. Pouyan, Z. Farajzadeh, N. Sadhasivam, B. Heidari, S. Babaei, and J. P. Tiefenbacher. Assessment of theoutbreak risk, mapping and infestation behavior of covid-19: Application of the autoregressive and moving average (arma) andpolynomial models. medRxiv , 2020.

43] M. Roser, H. Ritchie, E. Ortiz-Ospina, and J. Hasell. Coronavirus pandemic (covid-19).

Our World in Data , 2020.[44] M. Rouabah, A. Tounsi, and N. Belaloui. Epidemic seiqrdp model using genetic ﬁtting algorithm with cross-validation andapplication to early dynamics of covid-19 in algeria. arXiv preprint arXiv:2005.13516 , 2020.[45] R. K. Singh, M. Rani, A. S. Bhagavathula, R. Sah, A. J. Rodriguez-Morales, H. Kalita, C. Nanda, S. Sharma, Y. D. Sharma,A. A. Rabaan, et al. Prediction of the covid-19 pandemic for the top 15 affected countries: Advanced autoregressive integratedmoving average (arima) model.

JMIR public health and surveillance , 6(2):e19115, 2020.[46] R. Storn and K. Price. Differential evolution–a simple and efﬁcient heuristic for global optimization over continuous spaces.

Journal of global optimization , 11(4):341–359, 1997.[47] Y. Suzuki and A. Suzuki. Machine learning model estimating number of covid-19 infection cases over coming 24 days in everyprovince of south korea (xgboost and multioutputregressor). medRxiv , 2020.[48] B. Tang, X. Wang, Q. Li, N. L. Bragazzi, S. Tang, Y. Xiao, and J. Wu. Estimation of the transmission risk of the 2019-ncov andits implication for public health interventions.

Journal of clinical medicine , 9(2):462, 2020.[49] D. Tátrai and Z. Várallyay. Covid-19 epidemic outcome predictions based on logistic ﬁtting and estimation of its reliability. arXiv preprint arXiv:2003.14160 , 2020.[50] P. Venkatesan. Covid-19 in iran: round 2.

The Lancet. Infectious Diseases , 20(7):784, 2020.[51] N. K. Virdi. A review of backtesting methods for evaluating value-at-risk.

International Review of Business Research Papers ,7(4):14–24, 2011.[52] P. Walker, C. Whittaker, O. Watson, M. Baguelin, K. Ainslie, S. Bhatia, S. Bhatt, A. Boonyasiri, O. Boyd, L. Cattarino, et al.Report 12: The global impact of covid-19 and strategies for mitigation and suppression. 2020.[53] L. Wynants, B. Van Calster, M. M. Bonten, G. S. Collins, T. P. Debray, M. De Vos, M. C. Haller, G. Heinze, K. G. Moons, R. D.Riley, et al. Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal. bmj ,369, 2020.[54] S. Yang, P. Cao, P. Du, Z. Wu, Z. Zhuang, L. Yang, X. Yu, Q. Zhou, X. Feng, X. Wang, et al. Early estimation of the casefatality rate of covid-19 in mainland china: a data-driven analysis.

Annals of translational medicine , 8(4), 2020.[55] Z. Zhao, K. Nehil-Puleo, and Y. Zhao. How well can we forecast the covid-19 pandemic with curve ﬁtting and recurrent neuralnetworks? medRxiv , 2020.[56] A. L. Ziff and R. M. Ziff. Fractal kinetics of covid-19 pandemic. medRxiv , 2020., 2020.