Estimating the Number of Infected Cases in COVID-19 Pandemic
aa r X i v : . [ q - b i o . P E ] M a y Estimating the Number of InfectedCases in COVID-19 Pandemic
Donghui Yan † , Ying Xu ‡ , Pei Wang ¶† Department of Mathematics and Program in Data ScienceUniversity of Massachusetts, Dartmouth, MA 02747 ‡ Indigo Agriculture Inc, Boston, MA ¶ Icahn School of Medicine at Mount Sinai, NYC, NYMay 28, 2020
Abstract
The COVID-19 pandemic has caused major disturbance to human life.An important reason behind the widespread social anxiety is the hugeuncertainty about the pandemic. One major uncertainty is how many orwhat percentage of people have been infected? There are published andfrequently updated data on various statistics of the pandemic, at local,country or global level. However, due to various reasons, many caseswere not included in those reported numbers. We propose a structuredapproach for the estimation of the number of unreported cases, where wedistinguish cases that arrive late in the reported numbers and those whohad mild or no symptoms and thus were not captured by any medicalsystem at all. We use post-report data for the estimation of the formerand population matching to the latter. We estimate that the reportednumber of infected cases in the US should be corrected by multiplying afactor of 220.54% as of Apr 20, 2020. The infection ratio out of the USpopulation is estimated to be 0.53%, implying a case mortality rate at2.85% which is close to the 3.4% suggested by the WHO.
With the quick spread at the global scale, the COVID-19 pandemic has becomeone of the most tragic disasters in human history, with a worldwide confirmedcases of 2.74 million and death toll at 192k as of April 24, 2020. Rising trendof these numbers still remains in multiple countries right now. The most riskyaspects about the coronavirus are the long incubation period and the existenceof a large number of asymptomatic cases. These cause a substantial propor-tion of infected cases not tracked by medical systems. For better policy makingand disease control, as well as to reduce the widespread speculations among thepublic about the extent of the disease spread, it is of significant interest to givean estimate on the missing counts. Specifically, when the pandemic graduallybecomes under control, the world is considering the resume of normal business.1his requires a prudent assessment of the potential risk. Inevitably, such anassessment would involve the estimation of the number of asymptomatic caseswhen such cases are still active. However, the task of estimating the numberof those undocumented cases is very challenging, exactly because of the longincubation period and those asymptomatic cases. In this work, we will presenta structured approach for such an estimation task and give an approximate es-timate at the US national and state level.The remainder of this paper is organized as follows. In Section 2, we willdescribe our approach. This is followed by a presentation of results in Section 3.Finally we conclude in Section 4.
Statistically, the estimation of the number of unreported cases is related to theproblem of inference with missing data [5] or censored data [2]. However, cer-tain characteristics of the coronavirus epidemiology allow us to take a differentapproach. We adopt a structured approach, inspired by the diagnostic analysisof remote sensing studies [8] where the errors in the land use classification weredecomposed according to their sources. Our approach is illustrated by Figure 1.The missing counts in the reported numbers come from two sources. One isthose cases for which, at the report date, the symptoms were not severe enoughand the affected individuals would not test for infection; however, they wouldeventually visit some medical facility and test for potential coronavirus infec-tion. We call such cases the type I cases , and the waiting period before the onsetis termed as the incubation (or dormant) period . This is illustrated as the filledblue bars in Figure 1. At the time of report, all such cases are still in dormantstatus thus are missing in the reported number.Figure 1:
Time train of the progression of infection with COVID-19. Blue ornon-filled bars indicate type I or II cases, D , ..., D are the number of reportedcases during the 1 th , 2 nd , ..., and 7 th day past the original report date. The second source of unreported cases are those who were infected but are eithernot aware of it or with symptoms too light to visit the medical facility, and lateron recovered without any particular medical treatments. We call such cases the type II cases . The type II cases never show up in any reported numbers, thusleaving too little clue for estimation. But we cannot overlook such cases, be-2ause the number of such cases could be potentially large and such individualsform an important infecting source. For the rest of this section, we will describeour method for the estimation of each of the two cases.
The estimation of the number of type I cases is facilitated by a crucial observa-tion. Though not included in the reported number while in the dormant period,such cases would eventually be reported when the symptoms become so severethat the individuals have to seek medical treatments. By that time, those previ-ously missed cases at the original report date (which was a few days ago) wouldbe counted towards infected cases at some later report dates (though one wouldnot know at which particular report date). Such numbers should be includedat the original report date but surface only several days later; for this reasonwe call them delayed counts . If there is a way to estimate such delayed countsor their total, then one can estimate the number of type I cases for the originalreport date.It will be instructive to consider a simple ideal case where all infected caseshave a dormant period of 7 days. In this ideal case, the numbers D , D , ..., D in Figure 1 are exactly the number of cases who were at their 6 th , 5 th , ..., 1 st day of infection, respectively, when counted at the original report date. If weassume that the incubation period is 7 days, then these are all the number oftype I cases missed at the original report date but reflected perfectly later in thenumber of newly reported cases during the 6 days time window following theoriginal report date. So the total number of type I cases at the original reportdate can be calculated by their sum, ˆ D type = P i =1 D i .The reality is, however, complicated. First, the length of the dormant periodvaries for individual cases. Also, during the post-report time window, newlyinfected cases may arise and be reported. Thus the number of newly reportedcases at any particular day within this time window might be mixed, in thesense that it would include both cases that are infected both before (but werein dormant period) and after the report date. The former case will not pose aproblem as anyway such cases would be counted towards ˆ D type though casesinfected at the same day may now contribute to different D i ’s. The latter caseis undesirable but could be corrected, to a certain extent, by the truncatingeffect when we only sum up the counts in the post-report time window up to alength of T days. That is, those cases with a dormant period extending morethan T days post-report will be truncated and not included in ˆ D type , with thetotal count of such truncated cases being ‘cancelled out’ by the newly infectedcases within the post-report time window of a properly chosen length T . Thisleads to an estimate for the number of type I cases asˆ D type = T X D i , where D i are now the number of cases reported at the i th day after the originalreport date. We can let T take a value around or slightly larger than the meanof the incubation periods. In the appendix, we give a justification on why our3stimate, ˆ D type , would be a reasonable one.If we can keep track of the delayed estimate ˆ D type through time, then wecan get a time series which, upon smoothing, could be used to estimate thecurrent missing type I counts. For such an estimation to be feasible, we havetwo requirements. One is that the daily reported counts through time wouldnot change too abruptly. Thus, our approach would not work well when theinfection trend still rises very rapidly. During such a period, the safest strategymight be to strictly enforce social distancing. But as the overall situation isgradually under control, our estimation would apply. The other is knowledgeof the duration of the incubation period. According to many studies [3, 1, 4],the incubation periods has a median of around 4-7 days. While further studiesor data analysis is required to confirm this, we take T = 7 in our estimation.Additionally, it should be cautious that our estimation is valid assuming thatthe test of coronavirus is sufficiently carried out for the population of interest;insufficient test would render an underestimate of the number of type I cases. Days since Mar 8, 2020 E s t i m a t ed t y pe I r a t i o CT6CT7MA6MA7
Days since 2020−03−08 E s t i m a t ed t y pe I r a t i o US6US7
Figure 2:
Ratio of unreported cases w.r.t. reported cases for CT, MA and theUS, where ‘6’ and ‘7’ in the legend are the values of T used, respectively. In Figure 2, we plot the ratio of estimated type I cases w.r.t. the reported num-ber of cases for Connecticut (CT) and Massachusetts (MA) since Mar 8, 2020.These two states were chosen as they are similar in many aspects, so we expecttheir ratio of type I cases out of reported cases would be similar. In Figure 2,there is an initial difference in ratios of type I cases in these two states, whichwe attribute to the late response and the small number of cases tested in CT.Later, these two states exhibit strikingly similar trend, which is quite xpected.We also explore the effect of using different values of T where 6 and 7 are used.Again, initially the resulting estimation are fairly different, which indicates therapid spread of coronavirus and the rapid rise of infected cases. Gradually, thedifference in the resulting estimations diminish, which implies that the choice of T = 7 leads to a fairly stable estimation at late stages of disease spread. Similar4bservation can be made for the estimation of type I cases in US. This is shownin the right panel of Figure 2. The estimation of the number of type II cases is extremely challenging, as thereis barely anything observable. Our main strategy in the estimation is based onthe matching of population statistics —using what we see well to infer what ismissing. When we group reported infection cases, we notice a significant dis-crepancy in the count statistics by age groups between reported cases and theUS population. We expect that, while people in most age groups in the popu-Age groups
US population 25.06 33.27 12.73 12.92 9.32 4.70 2.00Reported cases 5.00 29.00 18.00 18.00 17.00 9.00 6.00Corrected cases 46.47 61.70 23.61 23.96 17.00 9.00 6.00Table 1:
Percentages by age groups in the US population (2020) and in thereported infection cases. Note numbers in the bottom row are not normalized tosum up to 1. lation have a similar chance of being infected, those type II cases occur muchmore often in age groups 20-64 but rarely for people of age 65+. This is becausepeople at age over 65 typically have a relatively weaker immune system alongwith some pre-existing medical conditions. Once they are infected with coron-avirus, a slight symptom would prompt them to seek medical treatments. As aresult, such cases are very likely to be discovered. Thus reported counts aboutsuch age groups would be more accurate and can serve as a reference to correctcounts for other age groups. On the contrary, cases for the 20-64 age group areeasy to be overlooked or not noticed, unless their status is deteriorated. Thereported counts for these age groups thus require a correction (termed as agecorrection ). The age group of 85+ is more vulnerable to infections, as theytypically live in the senior centers or extended-care nursing facilities which, as amatter of fact, have a very high risk of infection. The case statistics for this agegroup would be very thorough, but many in this age group get infected simplybecause they share a very confined living space with many other equally vulner-able seniors, and the infection of any one (including staff) in a senior center willquickly spread to the rest (
To certain extent, one may think of this as a partyof many people during COVID-19 ). So statistics in this age group would not bea reliable reference for population match, since people in other age groups havea very different mobility pattern (the infants interact with the world throughtheir parents thus have a chance of infection not so different from the generalpopulation).The main assumption we use for population match is that all the people, withan age in the range 0-84, have similar chance of being infected. As a result,the counts at different age groups would be proportional to their respectivepercentage in the population; we call such an approach population matching .Let r pop and r case be the proportions of the reference group in the population5nd in the reported cases, and x pop and x corrected be the the respective propor-tions for the target group, respectively. Then r case : r pop = x corrected : x pop ,and the corrected percentage in the infected cases for the target group can becalculated accordingly. As we argue before, the case statistics for age groups65-74 and 75-84 are reliable, but those for ages 0-64 are incomplete and con-sist of substantial missing data, and we will use the reliable portion of thedata to infer or correct statistics about the incomplete part of the data . Asimple calculation reveals that age groups, 65-74 and 75-84, according to Ta-ble 1, have a similar ratio of cases percentage: population percentage , i.e.,9 .
00 : 4 . ≈ .
00 : 9 .
32. Thus, we can pool counts from these two groupsand obtain r case : r pop = (9 .
00 + 17 . / (4 .
70 + 9 .
32) = 1 . . . . . . E s t i m a t ed t y pe I r a t i o NYNJCACTAZFLTXMARIPAOHILMDWAVANHWVMEVTUTCOMTMILAAKARALDEGAHIIAIDINKSSDMNMOMSOKNDNENMNVNCORSCKYTNWYWI
Figure 3:
Ratio of type I cases w.r.t. reported cases for each of the 50 states inUS on Apr 20, 2020. Results
We apply our approach to each of the 50 states and the US. The data are avail-able from Wikipedia [7]. Due to the large variation of the population at differentstates, we calculate the ratio of missing cases out of the number of reported casesfor individual states. The ratio for type I cases is shown in Figure 3. Due tothe lack of reported case data for individual states by age groups, we use theoverall estimate, which is 87.94% according to discussions in Section 2.2, for theratio of type II cases for all the states. The overall ratio for type I cases forthe US is estimated to be 32.60%. Combining with the 87.94% ratio for type IIcases, this gives an estimated ratio of missing cases versus the reported numberat 120.54%. In other words, the reported number should multiply by a factorof 220.54% to reflect the true number of infected cases.With the unreported numbers estimated, we can estimate the infection ratio ,defined as the ratio of the number of infected cases out of the population. Theoverall infection ratio of the US is estimated to be 0.53%, or 1.75 million, asof Apr 20, 2020. If we use the associated death toll at about 50k, then thecase mortality rate is calculated as 2.85%, which is close to the WHO suggestedestimation of 3.4% [6]. The infection ratio for individual states are visualizedas heatmap in Figure 4. Heavily hit states are NY, NJ, CT, RI, MA, and LAwith infection ratio estimated at 2.61%, 2.11%, 1.22%, 1.15%, 1.31% and 1.04%,respectively, as of Apr 20, 2020. The trend of infection ratio and cases by timefor these states is shown in Figure 5. It can be seen that, except LA, the infec-tion ratio for all other five states are still rapidly increasing. NJ shows a similargrowing pattern as NY, while the three New England states, CT, RI and MA,are similar.
Percent of population infected by state
ALAZ ARCA CO CTDEDCFLGAID IL INIAKS KYLA MEMD MAMIMN MSMOMT NENV NHNJNM NYNCND OHOKOR PA RISCSD TNTXUT VTVAWA WVWIWY 3.02.52.01.51.00.50.0
Figure 4:
Heatmap of infection ratio for individual states as of Apr 20, 2020.
10 20 30 40 . . . . . . . Days since Mar 9, 2020 E s t i m a t ed i n f e c t i on r a t i o ( % ) NYNJCTMARILA
Days since Mar 9, 2020 E s t i m a t ed nu m be r o f c a s e s ( k ) NYNJCTMARILA
Figure 5:
Trend of estimated infection ratio and infected cases for NY, NJ, CT,MA, RI, and LA since Mar 9, 2020.
We have proposed a structured approach for the estimation the number of in-fected cases not included in the reported number at a given date. We distinguishtwo types of ‘missing’ cases, those cases which were infected but are still dur-ing the dormant period and those asymptomatic cases which later self-recoverwithout medical treatments. The number of these two types of cases are esti-mated by accumulating reported counts within a properly chosen post-reporttime window and by population matching. The reported number, as of Apr20, 2020, of infected cases in US should be corrected by multiplying a factor of220.54%. The overall infection ratio of the US is estimated to be 0.53%, witha case mortality rate of 2.85% which is close to the recommended 3.4% by WHO.Our estimation can potentially be used for risk assessment. The infant agegroup may worth further consideration as people in this group are much lessrisky than other age groups as they interact with the rest of the world throughtheir parents, so the number of cases for this group may need to adjust accord-ingly to reflect the true risk.
Appendix
In this appendix, we will give a justification on our estimation algorithm. Weshow that the error between our estimate, ˆ D type , of the number of type I casesand its actual value D type is small in expectation under reasonable assumptionsabout the distribution of the incubation periods.Denote by random variable X the length of the incubation period, and forsimplicity we further assume that X ≥ N − , N − , ... before the report date (for which we use N ). Here we limit to type I cases as we canconveniently assume that type II cases have an infinite incubation period. Thenthe expected number of cases that are discovered during the time window of T days following the original report date is calculated as D a = ∞ X i =0 E N − i · P ( i + 1 ≤ X < i + T ) . (1)For simplicity, we would assume that all the E N − i ’s take a constant value N . Wefeel that this should not be too unrealistic as we would expect that the numberof newly infected cases per day do not vary too much when the pandemic reachesa stable stage (those at the very far distant past would be small, but they carry avery small fraction of the total number so could be ignored). Also, we abuse thenotation a bit by using D . ’s to also indicate the expected value of the associatedrandom variable; the exact meaning will be determined by the context. ThenEquation (1) can be rewritten as D a = ( T − · N − N T − X i =1 ( T − i ) · P ( i − ≤ X < i ) . Under the same assumption, the number of new cases generated during thepost-report time window of length T days is D new = T X i =1 E N i · P ( X < T + 1 − i )= N T X i =1 ( T + 1 − i ) · P ( i − ≤ X < i )= N T − X i =1 ( T − i ) · P ( i − ≤ X < i ) + N · P ( X < T )Thus the total number of reported cases during the T days post-report timewindow is calculated as D a + D new = ( T − · N + N · P ( X ≤ T ) = T N − N · P ( X > T ) . Assuming that random variable X has a finite mean, then we have P ( X > T ) ≤ E X/T , µ/T, implying that the estimated number of type I cases satisfies T N ≥ ˆ D type = D a + D new ≥ T N − N · µ/T. The actual number of cases that have accumulated but not being discoveredbefore the report date consists of missing cases during the previous T days andthose even earlier cases, which has an expected value D type = ∞ X i =0 E N − i · P ( X ≥ i + 1) = N ∞ X i =0 P ( X ≥ i ) = N · E X , µN. (2)92) indicates that the mean number of type I cases equals the product of themean daily infected cases of type I and the mean length of the incubation period,which is consistent with the ideal case discussed in Section 2.1. Let T = (1+ ǫ ) µ ,then we have the following error bound for the estimated number of cases oftype I | ˆ D type − D type | ≤ µN · max( ǫ, | ǫ − /T | ) . It follows that the relative error of the estimate satisfies | ˆ D type − D type | D type ≤ max( ǫ, | ǫ − /T | ) = max( ǫ, | ǫ − / ((1 + ǫ ) µ ) | ) . For a given µ , one can pick ǫ to optimize the above bound. For example, when µ = 7, one can take ǫ = 0 .
07 to achieve a relative error bound of about 7%.
References [1] J. A. Backer, D. Klinkenberg, and J. Wallinga. Incubation periodof 2019 novel coronavirus (2019-ncov) infections among travellers fromwuhan, china, 2028 january 2020.
Eurosurveillance (doi: 10.2807/1560-7917.ES.2020.25.5.2000062) , 25(5), 2020.[2] E. L. Kaplan and P. Meier. Nonparametric estimation from incompleteobservations.
Journal of the American Statistical Association , 53(282):457–481, 1958.[3] S. A. Lauer, K. Grantz, Q. Bi, F. K. Jones, Q. Zheng, H. Meredith, A. S.Azman, N. G. Reich, and J. Lessler. The incubation period of coronavirusdisease 2019 (covid-19) from publicly reported confirmed cases: Estimationand application.
Annals of Internal Medicine (doi:10.7326/M20-0504) , 2020.[4] N. M. Linton, T. Kobayashi, Y. Yang, K. Hayashi, A. R. Akhmetzhanov,S. Jung, B. Yuan, R. Kinoshita, and H. Nishiura. Incubation period andother epidemiological characteristics of 2019 novel coronavirus infectionswith right truncation: A statistical analysis of publicly available case data.
Journal of Clinical Medicine , 9(2):538, 2020.[5] R. J. Little and D. Rubin.
Statistical Analysis with Missing Data . Wiley,2002.[6] World Health Organization.
Coronavirus (COVID-19) Mortality Rate
COVID-19 pandemic in the United States , 2020.https://en.wikipedia.org/wiki/COVID-19 pandemic in the United States.[8] D. Yan, C. Li, N. Cong, L. Yu, and P. Gong. A structured approach to theanalysis of remote sensing images.