[PDF] Learning as We Go: An Examination of the Statistical Accuracy of COVID19 Daily Death Count Predictions

Abstract

This paper provides a formal evaluation of the predictive performance of a model (and its various updates) developed by the Institute for Health Metrics and Evaluation (IHME) for predicting daily deaths attributed to COVID19 for each state in the United States. The IHME models have received extensive attention in social and mass media, and have influenced policy makers at the highest levels of the United States government. For effective policy making the accurate assessment of uncertainty, as well as accurate point predictions, are necessary because the risks inherent in a decision must be taken into account, especially in the present setting of a novel disease affecting millions of lives. To assess the accuracy of the IHME models, we examine both forecast accuracy as well as the predictive performance of the 95% prediction intervals provided by the IHME models. We find that the initial IHME model underestimates the uncertainty surrounding the number of daily deaths substantially. Specifically, the true number of next day deaths fell outside the IHME prediction intervals as much as 70% of the time, in comparison to the expected value of 5%. In addition, we note that the performance of the initial model does not improve with shorter forecast horizons. Regarding the updated models, our analyses indicate that the later models do not show any improvement in the accuracy of the point estimate predictions. In fact, there is some evidence that this accuracy has actually decreased over the initial models. Moreover, when considering the updated models, while we observe a larger percentage of states having actual values lying inside the 95% prediction intervals (PI), our analysis suggests that this observation may be attributed to the widening of the PIs. The width of these intervals calls into question the usefulness of the predictions to drive policy making and resource allocation.

Full PDF

LLearning as We Go – An Examination of the Statistical Accuracy ofCOVID-19 Daily Death Count Predictions

Roman Marchant a,b , Noelle I. Samia d , Ori Rosen e , Martin A. Tanner d , and Sally Cripps a,b,c,*a ARC Centre for Data Analytics for Resources and Environments, Australia b Centre for Translational Data Science, The University of Sydney, Australia c School of Mathematics and Statistics, The University of Sydney, Australia d Department of Statistics, Northwestern University, USA e Department of Mathematical Sciences, University of Texas at El Paso, USA * Corresponding author: [email protected] +61 425-276-967May 26, 2020

AbstractOBJECTIVE:

This paper provides a formal evaluation of the predictive performance of a model (andupdates) developed by the Institute for Health Metrics and Evaluation (IHME) for predicting dailydeaths attributed to COVID-19 for the United States.

STUDY DESIGN:

To assess the accuracy of the IHME models, we examine both forecast accuracy, aswell as the predictive performance of the 95% prediction intervals (PI).

RESULTS:

The initial model underestimates the uncertainty surrounding the number of daily deaths.Speciﬁcally, the true number of next day deaths fell outside the IHME prediction intervals as muchas 76% of the time, in comparison to the expected value of 5%. Regarding the updated models, ouranalyses indicate that the April models show little, if any, improvement in the accuracy of the pointestimate predictions. Moreover, while we observe a larger percentage of states having actual valueslying inside the 95% PI’s, this observation may be attributed to the widening of the PI’s. A majorrevised model in early May did result in a decrease in the estimated model uncertainty, albeit at theexpense of poorer coverage probability.

CONCLUSION:

Our analysis calls into question the usefulness of the predictions to drive policy makingand resource allocation.

Keywords:

COVID-19; Forecast Accuracy; Uncertainty Quantiﬁcation; Decision Making under Uncertainty;Public Health Resource Allocation; Model Calibration 1 a r X i v : . [ s t a t . O T ] M a y ighlights

1. Regarding the initial IHME model, between 51% and 76% of states in the USA have actual dailydeath counts which lie outside the 95% prediction interval.2. The updated IHME models do not show any improvement in the accuracy of point estimate pre-dictions.3. The rather large level of predictive uncertainty implied by the models over the period 4/4 – 4/29casts doubt on their usefulness to drive the development of health, social, and economic policies.4. A major revised model in early May did result in a decrease in the estimated model uncertainty,at the expense of poorer coverage probability, again calling into question the model’s reliability asa predictive tool.5. The discrepancy between the predicted death and the actual death in the USA has serious implica-tions for the USA government’s future planning and provision of ventilators, PPE, and the staﬃngof medical professionals equipped to respond to this pandemic.

A recent model developed at the Institute for Health Metrics and Evaluation (IHME) provides forecasts forventilator use and hospital beds required for the care of COVID-19 patients on a state-by-state basis throughoutthe United States over the period March 2020 through August 2020 [8] See the related website https://covid19.healthdata.org/projections for interactive data visualizations. In addition, a manuscript and thatwebsite provide projections of deaths per day and total deaths throughout this period for the entire US, as wellas for the District of Columbia. The IHME research has received extensive attention in social media, as well asin the mass media [2, 3]. Moreover, this work has inﬂuenced policy makers at the highest levels of the UnitedStates government, having been mentioned at White House Press conferences, including March 31, 2020 [2].Our goal in this paper is to provide a framework for formally evaluating the predictive validity of the IHMEforecasts for COVID-19 outcomes, as data become sequentially available. We treat the IHME model (and itsvarious updates) as a “black box” and examine the projected numbers of deaths per day in light of the groundtruth to help quantify the predictive accuracy of the model. We do not provide a critique of the assumptionsmade by the IHME model, nor do we suggest any possible modiﬁcations to the IHME approach. Moreover, ouranalysis should not be misconstrued as an investigation of mitigation measures such as social distancing. Wedo, however, strongly believe that it is critical to formally document the operating characteristics of the IHMEmodel – to meet the needs of social and health planners, as well as a baseline of comparison for future models.

Our report examines the quality of the IHME deaths per day predictions for the initial model over the periodMarch 29–April 2, 2020, for a series of irregularly updated IHME models over the period April 4, 2020–April29, 2020, as well as for a major revision of the model in early May. For these analyses we use the actualdeaths attributed to COVID-19 as our ground truth – our source being the number of deaths reported by JohnsHopkins University [1].Each day the IHME model computes a daily prediction and a 95% posterior interval (PI) for COVID-19 deaths, four months into the future for each state. For example, on March 29 there is a prediction andcorresponding PI for March 30 and March 31, while on March 30 there is a prediction and corresponding PI forMarch 31. We call the prediction for a day made on the previous day a “1-step-ahead” prediction. Similarly, aprediction for a day made two days in advance is referred to as a “2-step-ahead” prediction, while a predictionfor a day made k days in advance is called a “k-step-ahead” prediction.To investigate the accuracy of the point predictions, we computed the logit of the absolute value of thepercentage error (APE) [5], denoted by LAPE, where the logit of | x | is deﬁned to be −| x | ) . Note thatunder this logit transformation, LAPE is a normed metric which approaches one for very large percentage2iscrepancies (especially when the observed count is close to or equal to zero), while equaling 0.5 for thoseinstances with perfectly accurate predictions. Working on the logit scale avoids the need for ad-hoc rules fordiscarding outliers. When both observed and predicted death counts are equal to zero, the LAPE is equal to0.5. Boxplots of LAPE values are compared using the Friedman nonparametric test [6], which accounts forpossible correlation within states over time. Figure 1 graphically represents the discrepancy between the actual number of deaths and the 95% PIs for deaths,by state for the dates March 30 through May 2. The color in these ﬁgures shows whether the actual deathcounts for a state were less than the lower limit of the 95% PI (blue), or within the 95% PI (white), or abovethe upper limit of the 95% PI (red). The depth of the red/blue color denotes the number of actual death countsabove/below the PI. A deep red signiﬁes that the number of deaths in that state was substantially above theupper limit of the 95% PI, while a light red indicates that the number of deaths was marginally above the 95%PI upper limit. Similarly, a deep blue signiﬁes that the number of deaths was substantially below the lowerlimit of the 95% PI, while a light blue indicates that the number of deaths was marginally below 95% PI lowerlimit. Note that the days in the ﬁgure are not consecutive due to the fact that these were the only days forwhich 1-step ahead predictions were made available by IHME.An examination of Figure 1 (a) shows that for March 30 only about 27% of states had an actual number ofdeaths lying in the 95% PI for the 1-step-ahead forecast. The corresponding percentages for March 31, April 1and April 2, are 33%, 24% and 49%, respectively (see Figures 1 (b) – (d)). Therefore the percentage of stateswith actual number of deaths lying outside this interval is 73%, 67%, 76% and 51% for March 30, March 31,April 1 and April 2, respectively. We note that we would expect only 5% of observed death counts to lie outsidethe 95% PI.For a given day the initial model is also biased, although the direction of the bias is not constant acrossdays. For the 1-step-ahead prediction for March 30, 47% of all locations were over-predicted, that is 47% ofall locations had a death count which was below the 95% PI lower limit, while 26% were under-predicted. ForMarch 31 the reverse was true; only 16% of locations had actual death counts below the 95% PI lower limitwhile 51% had actual death counts above the 95% PI upper limit. This can be clearly seen from Figures 1(a)and 1(b) which are predominantly blue, and red, respectively. See also the ﬁrst four lines of Table 1.The ﬁrst four lines of Table 1 also suggest that the accuracy of predictions does not improve as the forecasthorizon decreases, as one would expect. For March 31 and April 1 the forecast accuracy, as measured by thepercentage of states whose actual death count lies within the 95% PI, decreases as the forecast horizon decreases.For March 31, the 2-step ahead prediction is better than the 1-step ahead prediction, while for April 1, the3-step is better than the 2-step, which in turn is better than the 1-step. However, April 2 shows that accuracyslightly improves between the 3-step and the 2-step.To investigate the relationship between the 2-step-ahead and the 1-step-ahead prediction errors by state,Figure 2 shows the March 31 1-step-ahead prediction errors, for predictions made on March 30, on the y -axis,versus the March 31 2-step-ahead prediction errors, for predictions made on March 29, on the x -axis. The colorsin the graph correspond to diﬀerent subsets of the data; red corresponds to those locations where the actualnumber of deaths was above the 1-step-ahead 95% PI upper limit, blue corresponds to those locations wherethe actual number of deaths was below the 1-step-ahead 95% PI lower limit, while gray corresponds to thoselocations where the actual number of deaths was within the 1-step-ahead 95% PI. This graph shows a verystrong linear association between the predicted errors for the red locations ( R = 96% , n = 25). This suggeststhat the additional information contained in the March 30 data did little to improve the prediction for thoselocations where the actual death count was much higher than the predicted number of deaths. The number ofobservations in the other two subsets of data was insuﬃcient to draw any ﬁrm conclusions. Per the IHME website, the IHME model underwent a series of updates beginning in early April, followed bya “major update” (per the IHME website) in early May. In this subsection we examine the performance ofthese later versions of the model. Our analysis focuses on two aspects of the IHME model predictions, ﬁrst on3 a) March 30 (b) March 31 (c) April 1(d) April 2 (e) April 4 (f) April 8(g) April 13 (h) April 17 (i) April 28(j) April 29 (k) May 2

Figure 1: Discrepancy between actual death counts and one-step-ahead PIs for speciﬁc dates (see sub-titles).The color shows whether the actual death counts were less than the lower limit of the 95% PI (blue), withinthe 95% PI (white), or above the upper limit of the 95% PI (red). The depth of the red/blue color denotes howmany actual deaths were above/below the 95% PI. Initial model: March 30 – April 2; revised models: April 4– April 29; major model update: May 2. 4orecast Date 1 - step 2 - step 3 - step 4 - stepMarch 30 27(47,26)March 31 33(16,51) 45(12,43)April 01 24(37,39) 37(29,33) 43(31,25)April 02 49(29,22) 50(25,25) 44(25,31) 46(25,29)April 03 31(45,24) 31(45,24) 36(39,25)April 04 74(24,2) 41(41,18) 39(43,18)April 05 84(12,4) 33(51,16)April 06 86(10,4)April 07 92(4,4)April 08 82(14,4)April 09 84(10,6)April 10 84(12,4)April 11 94(6,0)April 13 84(14,2)April 14 96(4,0)April 15 98(2,0)April 16 94(2,4)April 17 100(0,0)April 18 96(4,0)April 19 98(2,0)April 20 94(4,2)April 28 94(6,0)April 29 94(0,6) 90(4,6)April 30 86(2,12) 86(2,12)May 01 86(4,10) 86(4,10)May 02 45(47,8) 88(6,6)May 03 33(59,8)May 04 23(71,6)May 05 55(25,20)Table 1: Percentage of locations with actual death counts inside the 95% PI, for the number of forecast periods.The values in parentheses indicate the percentage of locations that were (below,above) the limits of the 95%PI. Initial model: March 30 – April 2; revised models: April 4 – April 29; major model update: May 2.5igure 2: Actual minus 1-step ahead prediction values for March 31, ( y -axis) vs. actual minus 2-step aheadprediction values for March 31 ( x -axis). The colors in the graph correspond to diﬀerent subsets of the data; redcorresponds to those locations where the actual number of deaths was above the 1-step-ahead 95% PI upperlimit, blue corresponds to those locations where the actual number of deaths was below the 1-step-ahead 95%PI lower limit, while gray corresponds to those locations where the actual number of deaths was within the the1-step-ahead 95% PI.the accuracy of the point estimates used for forecasting and second on the estimated uncertainties surroundingthose forecasts.We note that there are two ways in which the accuracy of the model, as measured by the percentage of stateswith death counts which fall within the 95% PI, can improve. Either the estimated uncertainty increases andtherefore the prediction intervals become much wider, or the estimated expected value improves. The latter ispreferable but much harder to achieve in practice. The former can potentially lead to prediction intervals thatare too wide to be useful to drive the development of health, social, and economic policies. A major concern with the initial model had to do with the fact that the PI’s had poor coverage - namely, as lowas 24% of the 95% PI’s contained the true value. We now turn to the evaluation of the uncertainty estimatesproduced by the updated models from 4/4 – 5/2.An examination of Figures 1 (e) – (j), corresponding to the revised models of 4/4 – 4/29, illustrates thatmany more states now have actual death counts which lie within the 1-step ahead 95% PI, as estimated bythese revised models than as estimated by the initial model. (Though in this regard, it is noted that anyinside percentage presented in Figures 1 (e) – (j), as well as lines 6 – 26 of Table 1, below 88% is statisticallysigniﬁcantly diﬀerent from 0.95 at the 5% level, according to a one-tailed binomial test.) In this way, we seethat the percentage coverage improved substantially for the April models and lines 6 - 26 of Table 1 conﬁrmthis. Interestingly, on April 29, the model grossly underestimated the death counts in the neighboring states ofOhio and Pennsylvania; see Figure 1 (j).Regarding the major model update of 5/2, we see in Figure 1 (k), a serious deterioration in the empiricalcoverage probability, as was noted with the initial models. For this 5/2 model, we see from lines 26 – 29 ofTable 1 that the model systematically overestimates the daily death total.To explore this change in the uncertainty estimates of the predictions from the initial model to the revisedmodels to the major model update, we computed the range of the 95% PI at the date of the forecast peak ofdaily deaths for each state divided by the predicted value of the number of daily deaths at that peak (analogous6igure 3: The range at the maximum predicted number of deaths, divided by the maximum predicted numberof deaths across states. Each observation represents a state and boxplots are calculated across model releasedates. Initial model: March 30 – April 2; revised models: April 4 – April 29; major model update: May 2.to a coeﬃcient of variation). In particular, division by the expected value of daily deaths at the peak takes intoaccount the fact that those states with higher predicted peak daily deaths will have a larger 95% PI than thosestates with a lower expected peak daily deaths. Figure 3 presents boxplots of this quantity for all states forboth the initial and updated models, As can be seen from this ﬁgure, the normalized range of the PI’s expandsdramatically with the revised models (4/4–4/29), with p < .

001 according to the Friedman nonparametric test[6]. Apparently, the major model revision of 5/2 resulted in a reduced estimate of variation in the predicteddeath count.

Figure 4 is a heat map of the diﬀerence between the actual daily death count and the 1-step ahead predicteddaily death count produced by the initial, revised, and major updated models for each state, expressed as apercentage of the actual daily death count for the days between March 30–May 2. For future reference, wedenote this percentage error as PE. (Again, note that the days in the ﬁgure are not consecutive due to the factthat these were the only days for which 1-step ahead predictions were made available by IHME.) This graphreproduces Figure 1 with two changes. First, instead of analyzing the discrepancy between actual daily deathsand the predicted daily deaths, we now analyze the discrepancy as a percent of the actual daily death count.This is done so that the discrepancy between observed and predicted counts is normalized across diﬀerent statesand on diﬀerent days. If the actual value and the predicted value are both zero, we have set the percentageerror to zero. If the actual value for a state was zero but the predicted value value was not, we have labeled“NA” for that state and shaded it as gray. The second alteration is that the white color coding of states forwhich the actual death rate was within the 95% posterior interval is now omitted, so that Figure 4 is now aheat map of the percentage discrepancy.An examination of Figure 4 reveals several features. First, the initial model produced predictions that werebiased toward under-prediction. The median 1-step PE was greater than or equal to zero for the ﬁrst four days.This can be seen by the predominance of red in Figures 4 (b) and (c), particularly for March 31 and April 1.The revised models over the next two weeks (particularly April 4 and April 13) had median PE below zeroindicating over-prediction, as can be seen by the predominance of blue in Figure 4 (e) and (g). Beginning onApril 17, the median PE was positive for the remainder of the month indicative of under-prediction, as notedby the predominance of red in Figures 4 (h) – (j). Following this sustained period of under-prediction, that is,more people died than predicted, the model underwent a major revision.Figure 5 presents boxplots of the LAPE values for the dates March 30 to May 2, corresponding to predictions7 a) March 30 (b) March 31 (c) April 1(d) April 2 (e) April 4 (f) April 8(g) April 13 (h) April 17 (i) April 28(j) April 29 (k) May 2

Figure 4: Heat maps of the percentage error (PE) between the actual daily death count and the 1-step aheadpredicted daily death count produced by the model for each state, expressed as a percentage of the actual dailydeath count for the days between March 30 and May 2. The colors in this ﬁgure are consistent with Figure1: blue indicating that the actual death counts were less than the predicted 1-step ahead death count and redindicating that the actual death counts were above the predicted 1-step ahead death count. If the actual valuefor a state was zero but the predicted value value was not, we have labeled “NA” for that state and shaded itas gray. Initial model: March 30 – April 2; revised models: April 4 – April 29; major model update: May 2.8igure 5: The logit of the absolute percentage error (LAPE) in multiple-step ahead predictions for the model’srevision dates. The LAPE values for dates from April 4 onwards had k -step ahead predictions (correspondingto the particular row in the ﬁgure) made by the updated models, while those prior to this date had k -step aheadpredictions made by the initial model. Initial model: March 30 – April 2; revised models: April 4 – April 29;major model update: May 2.made with the initial model, the updated models, and the major model update of May 2, where each rowcorresponds to 1-step through 4-step ahead predictions.An examination of the ﬁrst row of this ﬁgure (corresponding to 1-step ahead predictions) suggests that thepredictive performance may have deteriorated somewhat with the updated models, as some boxplots to the rightof the initial models seem shifted toward 1. More formally, the Friedman nonparametric test [6], which accountsfor possible correlation within states over time, revealed a diﬀerence across the eleven time points ( p < . p = 0 . deteriorate as the number of stepsahead i.e. as k decreases . Interestingly, the major model revision of May 2 seems to follow a similar trajectory. Our results suggest that the initial IHME model substantially underestimated the uncertainty associated withCOVID-19 death count predictions. We would expect to see approximately 5% of the observed number of deathsto fall outside the 95% prediction intervals. In reality, we found that the observed percentage of death countsthat lie outside the 95% PI to be in the range 49%–73%, which is more than an order of magnitude above the9xpected percentage. Moreover, we would expect to see 2.5% of the observed death counts fall above and belowthe PI. In practice, the observed percentages were asymmetric, with the direction of the bias ﬂuctuating acrossdays.In addition, the performance accuracy of the initial model does not improve as the forecast horizon decreases.In fact, Table 1 indicates that the reverse is generally true. Interestingly, the model’s prediction for the state ofNew York is consistently accurate, while the model’s prediction of the neighboring state of New Jersey, whichis part of the New York metropolitan area, is not consistently accurate.Our comparison of forecasts made by the initial model versus forecasts to the updated models indicates thatthe later models do not show any improvement in the accuracy of point predictions. In fact, there is someevidence that this accuracy has actually decreased. Moreover, when considering the updated models of early tomid April, while we observe a larger percentage of states having actual values lying inside the 95% PI, Figure 3suggests this observation may be attributed to the widening of the PI’s. The width of these intervals doescall into question the usefulness of the predictions to drive policy making and resource allocation. A majormodel revision in early May resulted in a decrease in the estimated model uncertainty, at the expense of poorercoverage probability. This observed vacillation between narrow PI’s (low empirical coverage) / wide PI’s (highempirical coverage) / narrow PI’s (low empirical coverage) reinforces the concern raised by Etzioni [4]: “thatthe IHME model keeps changing is evidence of its lack of reliability as a predictive tool”. In this regard, seeJewell et al. [7] for general comments as to why the IHME model may suﬀer from the shortcomings formallydocumented in the present paper.In the major update of May, the data reported by IHME was pre-processed to “smooth” over ﬂuctuations inthe reporting of daily deaths. In particular, the processing involved the following steps: the cumulative numberof deaths in a state reported each day’s count was replaced with a three-day moving geometric mean. At thispoint, the daily deaths were obtained by diﬀerencing this processed data. This idea of pre-processing the dataand then ﬁtting a model to the processed data, rather than to the observed death counts, raises several concerns.First, the procedure results in replacing each death count by a weighted average of the adjacent death counts,where the weighting extends up to plus/minus 10 days. Is it reasonable to assume that the diﬀerence betweenthe actual data and the processed data is due to reporting errors in all states and across all days of the week?Second, there are many methods to pre-process or “smooth” data. Why this method? Why a window of 3days? Why 10 repetitions? Most importantly, how sensitive are the inferences if another method, or anotherwindow size or another number of repetitions were used? If there is seasonality across the days of the week,why not model this aspect of the data directly, rather than smooth it out?Third, the prediction intervals of the “smoothed” data are akin to conﬁdence intervals for a mean estimate.Indeed the authors make the point they are are interested in the general trend rather than in the variability ofthe individual daily death count. However, the correct measure of risk for a decision maker regarding resourceallocation on a local level, is the local level of risk, not the risk associated with an average or trend smoothed overpotentially twenty days. These concerns (and others) require a comprehensive examination of pre-processingmethodology research and are beyond the scope of this paper.The accurate quantiﬁcation of uncertainty in real time is critical for optimal decision making. And while asnoted by Jewell et al. [7] “an appearance of certainty is seductive when the world is desperate to know what liesahead”, it is perhaps the most pressing issue in policy making that decision makers need accurate assessmentsof the risks inherent in their decisions. All predictions that are used to inform policy should be accompanied byestimates of uncertainty, and we strongly believe that these estimates should be formally validated against actualdata as the data become available – especially in the case of a novel disease that has aﬀected millions of livesaround our entire planet.

The authors have no competing interests to declare.

Authors’ contributions:

RM, NIS, OR, MAT and SC contributed to the design of the study, the data analysisand interpretation, and the writing of the paper.10 unding:

None for the project.We thank the authors of the IHME model for making their predictions and data publicly available. We agreewith the statement on their website : Having more timely,high-quality data is vital for all modeling endeavors, but its importance is dramatically higher when tryingto quantify in real time how a new disease can aﬀect lives.

Without access to the data and predictions thisanalysis would not have been possible. We also thank Noam B. Tanner for bringing the Brooks referenceto our attention.

References [1] Johns Hopkins University Coronavirus Resource Center, March 2020. retrieved from: https://coronavirus.jhu.edu , last accessed: April 8, 2020.[2] White House press brieﬁng on U.S. coronavirus response, March 2020. retrieved from: , last accessed: April 8, 2020.[3] A Azad. Model cited by white house says 82,000 people could die from coronavirus by august, even with socialdistancing, 2020. , last accessed: April 8, 2020.[4] S Begley. Inﬂuential covid-19 model uses ﬂawed methods and shouldn’tguide u.s. policies, critics say, 2020. , lastaccessed: May 8, 2020.[5] C Brooks.

Introductory Econometrics for Finance, 4th edition . Cambridge University Press, 2019.[6] M Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis of variance.

Journal of the American Statistical Association , 32:675–701, 1937.[7] NP Jewell, JA Lewnard, and BL. Jewell. Caution warranted: Using the institute for health metrics andevaluation model for predicting the course of the covid-19 pandemic.

Ann Intern Med. , 2020. https://doi.org/10.7326/M20-1565 , Epub ahead of print 14 April 2020.[8] CJL Murray. Forecasting covid-19 impact on hospital bed-days, icu-days, ventilator-days and deaths by usstate in the next 4 months. medRxivmedRxiv