Anti-clustering in the national SARS-CoV-2 daily infection counts
aa r X i v : . [ q - b i o . P E ] J u l To be submitted manuscript No. (will be inserted by the editor)
Anti-clustering in the national SARS-CoV-2 daily infection counts
Boudewijn F. Roukema , , ORCID
Abstract
The noise in daily infection counts of an epidemic should be super-Poissonian due to intrinsicepidemiological and administrative clustering. Here, we use this clustering to classify the official national SARS-CoV-2 daily infection counts and check for infection counts that are unusual by being anti-clustered. We adopt aone-parameter model of φ ′ i infections per cluster, dividing any daily count n i into n i /φ ′ i ‘clusters’, for ‘country’ i .We assume that n i /φ ′ i on a given day j is drawn from a Poisson distribution whose mean is robustly estimatedfrom the 4 neighbouring days, and calculate the inferred Poisson probability P ′ ij of the observation. The P ′ ij valuesshould be uniformly distributed. We find the value φ i that minimises the Kolmogorov–Smirnov distance from auniform distribution. We investigate the ( φ i , N i ) distribution, for total infection count N i . We consider consecutivecount sequences above a threshold of 50 daily infections. We find that most of the daily infection count sequencesare inconsistent with a Poissonian model. All are consistent with the φ i model. Clustering increases with totalinfection count for the full sequences: φ i ∼ √ N i . The 28-, 14- and 7-day least noisy sequences for several countriesare best modelled as sub-Poissonian, suggesting a distinct epidemiological family. The 28-day sequences of DZ, BY,TR, AE have strongly sub-Poissonian preferred models, with φ i < .
5; and FI, SA, RU, AL, IR have φ i < . Keywords
COVID-19 · Epidemic curve · Poisson point process
Received: . . . / Accepted: . . .
Introduction
The daily counts of new, laboratory-confirmed infec-tions with severe acute respiratory syndrome coron-avirus 2 (SARS-CoV-2) constitute one of the key statis-tics followed by citizens and health agencies aroundthe world in the ongoing 2019–2020 coronavirus disease2019 (COVID-19) pandemic [9, 19]. Can these countsbe classified in a way that makes as few epidemiologicalassumptions as possible, as motivation for deeper anal-ysis to either validate or invalidate the counts? Whilefull epidemiological modelling and prediction is a vitalcomponent of COVID-19 research [e.g. 6, 15, 23, 13, 3],these cannot be accurately used to study the pandemicas a whole – a global phenomenon by definition – if thedata at the global level is itself inaccurate. Knowledgeof the global state of the current pandemic is weakenedif any of the national-level SARS-CoV-2 infection datahave been artificially interfered with by the health agen-cies providing that data or by other actors involved in Institute of Astronomy, Faculty of Physics, Astronomy andInformatics, Nicolaus Copernicus University, Grudziadzka 5,87-100 Toru´n, Poland Univ Lyon, Ens de Lyon, Univ Lyon1, CNRS, Centre deRecherche Astrophysique de Lyon UMR5574, F–69007, Lyon,FranceEmail: boud @ astro.umk.pl the chain of data lineage [34]. Since personal medicaldata are private information, only a limited number ofindividuals at health agencies are expected to be ableto check the validity of these counts based on origi-nal records. Nevertheless, artificial interventions in thecounts could potentially reveal themselves in statisticalproperties of the counts. Unusual statistical propertiesin a wide variety of quantitative data sometimes ap-pear, for example, as anomalies related to Benford’slaw [24, 25], as in the 2009 first round of the Iranianpresidential election [29, 30, 20].Here, we check the compatibility of noise in the offi-cial national SARS-CoV-2 daily infection counts, N i ( t ),for country i on date t , with expectations based on thePoisson distribution [27]. It is unlikely that any realcount data will quite match the theoretical Poisson dis-tribution, both due to the complexity of the logical treeof time-dependent intrinsic epidemiological infection aswell as administrative effects in the SARS-CoV-2 test-ing procedures, and the sub-national and national levelprocedures for collecting and validating data to producea national health agency’s official report. In particular, No position is taken in this paper regarding jurisdictionover territories; the term ‘country’ is intended here as a neu-tral term without supporting or opposing the formal notionof state. Apart from minor changes for technical reasons, the‘countries’ are defined by the data sources.
Boudewijn F. Roukema clusters of infections on a scale of φ ′ i infections per clus-ter, either intrinsic or in the testing and administrativepipeline, would tend to cause relative noise to increasefrom a fraction of 1 / √ N i for pure Poisson noise up to p φ ′ i /N i , greater by a factor of p φ ′ i . This overdisper-sion has been found, for example, for COVID-19 deathrate counts in the United States [15].In contrast, it is difficult to see how anti-Poissoniansmoothing effects could occur, unless they were imposedadministratively. For example, an administrative officemight impose (or have imposed on it by political au-thorities) a constraint to validate a fixed or slowly andsmoothly varying number of SARS-CoV-2 test resultfiles per day, independently of the number received orqueued; this would constitute an example of an artifi-cial intervention in the counts that would weaken theepidemiological usefulness of the data.A one-parameter model to allow for the clustering isproposed in this paper, and used to classify the counts.We allow the parameter to take on an effective anti-clustering value, in order to allow the data to freelydetermine its optimal value. For more in-depth modelsof clustering, called “burstiness” in stochastic models ofdiscrete event counts, power-law models have also beenproposed [5, 8].The method is presented in § Method . The section § SARS-CoV-2 infection data describes the choice of dataset and the definition, for any given country, of a con-secutive time sequence that has high enough daily in-fection counts for Poisson distribution analysis to bereasonable. The method of analysis is given in § Analysis .Results are presented in § Results . Qualitative discussionof the results is given in § Discussion and conclusions aresummarised in § Conclusion . This work is intended to befully reproducible by independent researchers using the
Maneage framework: see commit 252cf1c of the git repository https://codeberg.org/boud/subpoisson andthe archive zenodo.3951152.
Method
SARS-CoV-2 infection data
Two obvious choices of a dataset for national dailySARS-CoV-2 counts would be those provided by theWorld Health Organization (WHO) or those curatedby the Wikipedia WikiProject COVID-19 Case CountTask Force in medical cases chart templates (hereafter, https://covid19.who.int/WHO-COVID-19-global-data.csv; (archive) https://en.wikipedia.org/w/index.php?title=Wikipedia:WikiProject COVID-19/Case CountTask Force&oldid=967874960 N jump (WHO) N j u m p ( W P m e d . c a s e s c h a r t s ) Fig. 1
Number N jump of sudden jumps or drops in counts onadjacent days in WHO and Wikipedia WikiProject COVID-19 Case Count Task Force medical cases chart national dailySARS-CoV-2 infection counts for countries present in bothdata sets. A line illustrates equal quality of the two datasets. The C19CCTF version of the data is clearly less af-fected by sudden jumps than the WHO data. Plain text table:zenodo.3951152/WHO vs WP jumps.dat.
C19CCTF). While WHO has published a wide vari-ety of documents related to the COVID-19 pandemic,it does not appear to have published details of hownational reports are communicated to it and collated.Given that most government agencies and systems ofgovernment procedures tend to lack transparency, de-spite significant moves towards forms of open govern-ment [e.g. 37] in many countries, data lineage trac-ing from national governments to WHO is likely tobe difficult in many cases. In contrast, the curation ofofficial government SARS-CoV-2 daily counts by theWikipedia
WikiProject COVID-19 Case Count TaskForce follows a well-established technology of trackingdata lineage.Unfortunately, it is clear that in the WHO data,there are several cases where two days’ worth of de-tected infections appear to be listed by WHO as a se-quence of two days j and j + 1 on which all the in-fections are allocated to the second of the two days,with zero infections on the first of the pair. There arealso some sequences in the WHO data where the daylisted with zero infections is separated by several daysfrom a nearby day with double the usual amount ofinfections. This is very likely an effect of difficulties incorrectly managing world time zones, or time zone andsleep schedule effects, in any of several levels of thechains of communication between health agencies andWHO. In other words, there are several cases where atemporary sharp jump or drop in the counts appears inthe data but is most likely a timing artefact. Whatever nti-clustering in the national SARS-CoV-2 daily infection counts the reason for the effect, this effect will tend to confusethe epidemiological question of interest here: the aim isto globally characterise the noise and to highlight coun-tries where unusual smoothing may have taken place.We quantify this jump/drop problem as follows. Weconsider a pair of days j , j + 1 for a given country to bea jump if the absolute difference in counts, | n i ( j + 1) − n i ( j ) | , is greater than the mean, ( n i ( j + 1) + n i ( j )) / N jump for both the WHO data andthe C19CCTF medical cases chart data, starting, forany given country, from the first day with at least 50infections. Figure 1 shows N jump for the 130 countriesin common to the two data sets; there are 216 countriesin the WHO data set and 132 in the C19CCTF data. Itis clear that most countries have fewer jumps or dropsin the Wikipedia data set than in the WHO data set.Thus, at least for the purposes of understand-ing intrinsic and administrative clustering, theC19CCTF medical cases chart data appear to bethe better curated version of the national dailySARS-CoV-2 infection counts as reported by offi-cial agencies. The detailed download and extractionscript of national daily SARS-CoV-2 infection datafrom these templates and the resulting data filezenodo.3951152/WP C19CCTF SARSCoV2.dat areavailable in the reproducibility package associated withthis paper ( § Code availability ). Dates without data areomitted; this should have an insignificant effect on theanalysis if these are due to low infection counts.The full set of data includes many days, especiallyfor countries or territories (as defined by the datasource) of low populations, with low values, includingzero and one. The standard deviation of a Poisson dis-tribution of expectation value N is √ N [27], giving afractional error of 1 / √ N . Even taking into account clus-tering or anticlustering of data, inclusion of these peri-ods of close to zero infection counts would contributenoise that would overwhelm the signal from the periodsof higher infection rates for the same or other coun-tries. In the time sequences of SARS-CoV-2 infectioncounts, chaos in the administrative reactions to the ini-tial stages of the pandemic will tend to create extranoise, so it is reasonable to choose a moderately highthreshold at which the start and end of a consecutive se-quence of days should be defined for analysis. Here, weset the threshold for a sequence to start at a minimumof 50 infections in a single day. The sequence is contin-ued for at least 7 days (if available in the data), andstops when the counts drop below the same thresholdfor 2 consecutive days. The cutoff criterion of 2 con-secutive days avoids letting the analysable sequence be too sensitive to individual days of low fluctuations. Ifthe resulting sequence includes less than 7 days, thesequence is rejected as having insufficient signal to beanalysed. AnalysisPoissonian and φ ′ i models: full sequences We firstconsider the full count sequence { n i ( j ) , ≤ j ≤ T i } for each country i , with T i valid days of analy-sis as defined in § SARS-CoV-2 infection data . Our one-parameter model assumes that the counts are predom-inantly grouped in clusters, each with φ ′ i infectionsper cluster. Thus, the daily count n i ( j ) is assumed toconsist of n i ( j ) /φ ′ i infection events. We assume that n i ( j ) /φ ′ i on a given day is drawn from a Poisson distri-bution of mean b µ i ( j ) /φ ′ i . We set b µ i ( j ) to the median ofthe 4 neighbouring days, excluding day j and centredon it. For the initial sequence of 2 days, b µ i ( j ) is set to b µ i (3), and b µ i ( j ) for the final 2 days is set to b µ i ( T i − b µ i as a median of a small number of neigh-bouring days, our model is almost identical to the dataitself and statistically robust, with only mild depen-dence on the choices of parameters. This definition ofa model is more likely to bias the resulting analysistowards underestimating the noise on scales of severaldays rather than overestimating it; this method will notdetect oscillations on the time scale of a few days to afortnight that are related to the SARS-CoV-2 incuba-tion time [10]. For any given value φ ′ i , we calculate thecumulative probability P ′ ij that n i ( j ) /φ ′ i is drawn froma Poisson distribution of mean b µ i ( j ) /φ ′ i . For country i , the values P ′ ij should be drawn from a uniform dis-tribution if the model is a fair approximation. In par-ticular, for φ ′ i set to unity, P ′ ij should be drawn froma uniform distribution if the intrisic data distributionis Poissonian. Individual values of P ′ ij (close to zero orone) could, in principle, be used to identify individualdays that are unusual, but here we do not consider thesefurther.We allow a wide logarithmic range in values of φ ′ i ,allowing the unrealistic domain of φ ′ i <
1, and find thevalue φ i that minimises the Kolmogorov–Smirnov (KS)distance [16, 32] from a uniform distribution, i.e. thatmaximises the KS probability that the data are consis-tent with a uniform distribution, when varying φ ′ i . Theone-sample KS test is a non-parametric test that com-pares a data sample with a chosen theoretical probabil-ity distribution, yielding the probability that the sam-ple is drawn randomly from the theoretical distribution.We label the corresponding KS probability as P KS i . We Boudewijn F. Roukema write P Poiss i := P KS i ( φ ′ i = 1) to check if any country’sdaily infection rate sequence is consistent with Poisso-nian, although this is likely to be rare, as stated above:super-Poissonian behaviour seems reasonable. Of par-ticular interest are countries with low values of φ i . Al-lowing for a possibly fractal or other power-law natureof the clustering of SARS-CoV-2 infection counts, weconsider the possibility that the optimal values φ i maybe dependent on the total infection count N i . We inves-tigate the ( φ i , N i ) distribution and see whether a scal-ing type relation exists, allowing for a corrected statistic ψ i to be defined in order to highlight the noise structureof the counts independent of the overall scale N i of thecounts.Standard errors in φ i for a given country i are es-timated once φ i has been obtained by assuming that b µ i ( j ) and φ i are correct and generating 30 Poisson ran-dom simulations of the full sequence for that country.Since the scales of interest vary logarithmically, thestandard deviation of the best estimates of log φ i forthese numerical simulations is used as an estimate of σ (log φ i ), the logarithmic standard error in φ i . Subsequences
Since artificial interference indaily SARS-CoV-2 infection counts for a givencountry might be restricted to shorter periodsthan the full data sequence, we also analyse 28-, 14- and 7-day subsequences. These analysesare performed using the same methods as above( § Poissonian and φ ′ i models: full sequences ), except thatthe 28-, 14- or 7-day subsequence that minimises φ i isfound. The search over all possible subsequences wouldrequire calculation of a ˇSid`ak-Bonferonni correctionfactor [1] to judge how anomalous they are. The KSprobabilities that we calculate need to be interpretedkeeping this in mind. Since the subsequences for a givencountry overlap, they are clearly not independent fromone another. Instead, the a posteriori interpretationof the results of the subsequence searches found hereshould at best be considered indicative of periods thatshould be considered interesting for further verification. Results
Data
The 132 countries and territories in the C19CCTFcounts data have 19 negative values out of the totalof 16367 values. These can reasonably be interpreted N i P Fig. 2
Probability of the noise in the country-leveldaily SARS-CoV-2 counts being consistent with a Pois-son point process, P Poiss i , shown as red circles; and prob-ability P KS i ( φ i ) for the φ i clustering model proposed here( § Poissonian and φ ′ i models: full sequences ), shown as green X symbols, versus N i , the total number of officially recorded in-fections for that country. The horizontal axis is logarithmic.As discussed in the text ( § Full infection count sequences ), thePoisson point process is unrealistic for most of these data,while the φ i clustering model is consistent with the data forall countries. Plain text table: zenodo.3951152/phi N full.dat. N i −1 ϕ i Fig. 3
Noisiness in daily SARS-CoV-2counts, showing the clustering parameter φ i ( § Poissonian and φ ′ i models: full sequences ) that best modelsthe noise, versus the total number of counts for that country N i . The error bars show standard errors derived fromnumerical (bootstrap) simulations based on the model.The axes are logarithmic, as indicated. Values of theclustering parameter φ i below unity indicate sub-Poissonianbehaviour – the counts in these cases are less noisy thanexpected for Poisson statistics. A robust (Theil–Sen [33, 31])linear fit of log φ i against log N i is shown as a thickgreen line ( § Full infection count sequences ). Plain text table:zenodo.3951152/phi N full.dat. nti-clustering in the national SARS-CoV-2 daily infection counts N i −3 −2 −1 ψ i Fig. 4
Normalised noisiness ψ i (Eq. (1)) for daily SARS-CoV-2 counts versus total counts N i . The error bars are asin Fig. 3, assuming no additional error source contributed by N i . The axes are logarithmic. A few low ψ i values appear tobe outliers of the ψ i distribution. Table 1
Clustering parameters for the countries with the10 lowest φ i and 10 lowest ψ i values (least noise); extendedversion of table: zenodo.3951152/phi N full.dat.Country N i P Poiss i P KS i φ i ψ i DZ 23691 0.30 0.65 0.89 0.005FI 7347 0.35 0.98 1.72 0.020BY 66348 0.09 0.89 2.11 0.008AL 3906 0.23 0.83 2.57 0.041HR 4422 0.27 0.93 3.24 0.048AE 57193 0.00 0.70 3.35 0.014NZ 1557 0.45 0.94 4.32 0.109AU 12450 0.11 0.93 5.07 0.045TH 3255 0.29 0.99 5.37 0.094DK 13466 0.00 0.98 5.56 0.047DZ 23691 0.30 0.65 0.89 0.005BY 66348 0.09 0.89 2.11 0.008RU 783328 0.00 0.92 10.35 0.011AE 57193 0.00 0.70 3.35 0.014SA 255825 0.00 0.85 9.02 0.017FI 7347 0.35 0.98 1.72 0.020IR 276202 0.00 0.77 12.73 0.024TR 220572 0.00 0.44 12.30 0.026IN 1155191 0.00 0.80 33.88 0.031AL 3906 0.23 0.83 2.57 0.041 as corrections for earlier overcounts, and we reset thesevalues to zero with a negligible reduction in the amountof data. Consecutive day sequences satisfying the crite-ria listed in § SARS-CoV-2 infection data were found for68 countries.
Clustering of SARS-CoV-2 counts Full infection count sequences
Figure 2 shows, un-surprisingly, that only a small handful of the countries’daily SARS-CoV-2 counts sequences have noise whosestatistical distribution is consistent with the Poissondistribution, in the sense modelled here: P Poiss i (red cir-cles) is close to zero in most cases. On the contrary, theintroduction of the φ ′ i parameter, optimised to φ i forcountry i , provides a sufficient fit in all cases; none ofthe probabilities ( P KS i ( φ i ), green X symbols) in Fig. 2is low enough to be considered a significant rejection.The consistency of the φ i model with the data justi-fies continuing to Figure 3, which clearly shows a scalingrelation: countries with greater overall numbers N i ofinfections also tend to have greater noise in the dailycounts n i ( j ). A Theil–Sen linear fit [33, 31] to the re-lation between log φ i and log N i has a zeropoint of − . ± .
32 and a slope of 0 . ± .
08, where the stan-dard errors (68% confidence intervals if the distributionis Gaussian) are conservatively generated for both slopeand zeropoint by 100 bootstraps. By using a robust es-timator, the low φ i cases, which appear to be outliers,have little influence on the fit. The fit is shown as athick green line in Fig. 3.This φ i – N i relation is consistent with φ i ∝ √ N i . Toadjust the φ i clustering value to take into account thedependence on N i , and given that the slope is consis-tent with this simple relation, we propose the empiricaldefinition of a normalised clustering parameter ψ i := φ i / p N i , (1)so that ψ i should, by construction, be approximatelyconstant. While the estimated slope of the relationcould be used rather than this half-integer power re-lation, the fixed relation in Eq. (1) offers the benefit ofsimplicity.This relation should not be confused with the usualPoisson error. By the divisibility of the Poisson distri-bution, the relation φ i ∝ √ N i found here can be usedto show that σ [ b µ i ( j ) /φ i ] ∼ pb µ i ( j ) /φ i ⇒ σ [ b µ i ( j )] ∼ φ i pb µ i ( j ) /φ i ∝ N / i b µ i ( j ) / , (2)where σ [ x ] is the standard deviation of random variable x . If we accept b µ i ( j ) as a fair model for n i ( j ) and that n i ( j ) is proportional to N i , then we obtain σ [ n i ( j )] ∝ n / i . (3)Figure 4 shows visually that ψ i appears to be scale-independent, in the sense that the dependence on N i has been cancelled, by construction. The countries withthe 10 lowest values of ψ i are those with ISO 3166-1alpha-2 codes DZ, BY, RU, AE, SA, FI, IR, TR, IN, Boudewijn F. Roukema N i −2 −1 ϕ i Fig. 5
Clustering parameter φ i for 28-day sequence withlowest φ i , as in Fig. 3. The vertical axis range is expandedfrom that in Fig. 3, to accommodate lower values A robust(Theil–Sen [33, 31]) linear fit of log φ i against log N i isshown as a thick green line ( § Full infection count sequences ).Plain text table: zenodo.3951152/phi N 28days.dat.
Table 2
Least noisy 28-day sequences – clustering parame-ters for the countries with the 10 lowest φ i values; extendedtable: zenodo.3951152/phi N 28days.dat.country N i h n i i P Poiss i P KS i φ i startingdateDZ 23691 154.1 0.10 0.80 0.17 2020-05-13BY 66348 921.9 0.14 0.92 0.21 2020-05-08TR 220572 1131.2 0.08 0.86 0.21 2020-06-23AE 57193 512.8 0.08 0.23 0.23 2020-04-14FI 7347 83.4 0.99 0.99 0.92 2020-04-15SA 255825 1182.2 0.47 0.55 1.11 2020-04-12RU 783328 6946.0 0.82 0.95 1.36 2020-06-17AL 3906 74.6 0.23 0.83 2.57 2020-06-21IR 276202 1863.3 0.20 0.98 2.85 2020-03-30HR 4422 60.2 0.27 0.93 3.24 2020-03-28 AL. Detailed SARS-CoV-2 daily count noise character-istics for the countries with lowest φ i and ψ i are listedin Table 1, including Kolmogorov–Smirnov probabilitythat the data are drawn from a Poisson distribution, P Poiss i , the probability of the optimal φ i model, P KS i ,and φ i and ψ i .The approximate proportionality of φ i to √ N i forthe full sequences is strong and helps separate low-noise SARS-CoV-2 count countries from those followingthe main trend. However, the results for subsequencesshown below in § Subsequences of infection counts sug-gest that this N i dependence may be an effect of thetypically longer durations of the pandemic in countrieswhere the overall count is higher. Subsequences of infection counts
Figures 5–7 showthe equivalent of Fig. 3 for sequences of lengths 28, N i −2 −1 ϕ i Fig. 6
Clustering parameter φ i for 14-day se-quence with lowest φ i , as in Fig. 5. Plain text table:zenodo.3951152/phi N 14days.dat. Table 3
Least noisy 14-day sequences – clustering parame-ters for the countries with the 10 lowest φ i values; extendedversion of table: zenodo.3951152/phi N 14days.dat.country N i h n i i P Poiss i P KS i φ i startingdateAE 57193 521.2 0.11 0.58 0.09 2020-04-19DZ 23691 144.1 0.11 0.49 0.09 2020-05-23BY 66348 945.6 0.22 1.00 0.13 2020-05-12TR 220572 991.6 0.12 0.97 0.13 2020-07-06SA 255825 1227.5 0.38 0.98 0.30 2020-04-19KE 12750 126.2 0.22 0.66 0.47 2020-06-03FI 7347 95.1 0.64 0.98 0.65 2020-04-16RU 783328 6522.9 0.37 0.42 0.72 2020-07-04IN 1155191 9409.7 0.62 0.68 0.82 2020-05-30AL 3906 70.8 0.59 0.92 0.87 2020-06-24 N i −2 −1 ϕ i Fig. 7
Clustering parameter φ i for 7-day sequence with low-est φ i , as in Fig. 5. There is clearly a wider overall scatterand bigger error bars compared to Figs 5 and 6; a low φ i is a weaker indicator than φ i φ i
4. Plain text table:zenodo.3951152/phi N 07days.dat. nti-clustering in the national SARS-CoV-2 daily infection counts Table 4
Least noisy 7-day sequences – clustering parametersfor the countries with the 10 lowest φ values; extended table:zenodo.3951152/phi N 07days.dat.country N i h n i i P Poiss i P KS i φ i startingdateAE 57193 544.9 0.24 1.00 0.05 2020-04-27BY 66348 947.9 0.61 0.97 0.05 2020-05-13IN 1155191 10109.3 0.34 0.62 0.05 2020-06-06DZ 23691 188.6 0.20 1.00 0.06 2020-05-20FI 7347 94.9 0.43 0.56 0.08 2020-04-20TR 220572 1022.4 0.43 0.97 0.10 2020-07-07PL 40782 297.7 0.32 0.98 0.16 2020-06-20PA 54426 171.1 0.89 0.98 0.17 2020-05-09HN 33835 160.7 0.94 1.00 0.18 2020-06-01DK 13466 71.1 0.48 0.97 0.28 2020-05-11
14 and 7 days, respectively. The Theil–Sen robust fitsto the logarithmic ( φ i , N i ); ( φ i , N i ); and ( φ i , N i ) re-lations are zeropoints and slopes of 0 . ± .
33 and0 . ± .
07; 0 . ± .
59 and 0 . ± .
13; and 0 . ± . − . ± .
13, respectively. There is clearly no sig-nificant dependence of φ di on N i for any of these fixedlength subsequences, in contrast to the case of the φ i de-pendence on N i for the full count sequences. Thus, theempirical motivation for using ψ (Eq. (1)) to discrim-inate between the countries’ full sequences of SARS-CoV-2 data is not justified for the subsequences. Ta-bles 2–4 show the countries with the least noisy se-quences as determined by φ i , φ i and φ i , respectively.Tables 2 and 3 show that the lists of countries withthe strongest anti-clustering are similar. Thus, Fig. 8shows the SARS-CoV-2 counts curves for countries withthe lowest φ i , and Fig. 9 the curves for those withthe lowest φ i . Both figures exclude countries with totalcounts N i ≤ φ i and φ i distribu-tions have their curves shown in Fig. 10 for comparison.It is visually clear in the figure that the counts are dis-persed widely beyond the Poissonian band, and that the φ i and φ i models are reasonable as a model for rep-resenting about 68% of the counts within one standarddeviation of the model values. Discussion
Figures 3 and 4 clearly show that some groups of coun-tries are unusual in terms of the characteristics of theirlocation in the ( N i , ψ i ) plane. High total infection count
Brazil (BR) and the United States (US) are separatedfrom the majority of other countries by their high to-tal infection count. They have correspondingly higherclustering values φ i , although their normalised cluster-ing values ψ i are in the range of about 0 . < ψ i < φ i values greater than 300 are purely an effect of intrin-sic infection events – ‘superspreader’ events in crowdedplaces or nursing homes. While individual big clustersmay occur given the high overall scale of infections, itseems more likely that this is administrative clustering.Both countries are federations, and have numerous geo-graphic administrative subdivisions with a diversity ofpolitical and administrative methods. A plausible ex-planation for the dominant effect yielding φ i >
300 inthese two countries is that on any individual day, the ar-rival and full processing of reports depends on a numberof sub-national administrative regions, each reporting afew hundred new infections.For example, if there are 10 reporting regions, eachtypically reporting 300 infections, then typically (onabout 68% of days) there will be about 7 to 13 re-ports per day. This would give a range varying fromabout 2100 to 3900 cases per day, rather than 2945 to3055, which would be the case for unclustered, Pois-sonian counts (since √ ≈ Low normalised clustering ψ i In Fig. 4, there appears to be a group of eight countriesthat are also separated from the main group of coun-tries, but by having low normalised noise ψ i rather thanjust having a high total count N i . Low ψ i , low N i , high P Poiss i Classifying the countriesby ψ i alone (Table 1) would add Finland (FI) to thisgroup, but in Fig. 4, Finland appears better groupedwith the main body of countries in the ( ψ i , N i ) plane.This could be interpreted as Eq. (1) providing insuffi-cient correction for the φ i – N i relation. Alternatively,looking at Finland’s entry in Table 2 for 28-day se-quences, we see that Finland is among the three withthe lowest total (or mean) daily infection counts in the Boudewijn F. Roukema n i ( j ) DZ 2020-05-13DZ modelPoisson n i ( j ) BY 2020-05-08BY modelPoisson n i ( j ) TR 2020-06-23TR modelPoisson n i ( j ) AE 2020-04-14AE modelPoisson n i ( j ) SA 2020-04-12SA modelϕ i Poisson n i ( j ) RU 2020-06-17RU modelϕ i Poisson
Fig. 8
Least noisy 28-day official SARS-CoV-2 national daily counts for countries with total counts N i > b µ i ( j ) model (median of the 4 neighbouring days) and 68% error band forthe Poisson point process. The ranges in daily counts (vertical axis) are chosen automatically and in most cases do not startat zero. About nine (32%) of the points should be outside of the shaded band unless the counts have an anti-clustering effectthat weakens Poisson noise. A faint shaded band shows the φ i model for the one country here with φ i (slightly) greater thanone (RU), but is almost indistinguishable from the Poissonian band. The dates indicate the start date of each sequence. nti-clustering in the national SARS-CoV-2 daily infection counts n i ( j ) AE 2020-04-27AE modelPoisson n i ( j ) IN 2020-06-06IN modelPoisson n i ( j ) BY 2020-05-13BY modelPoisson n i ( j ) DZ 2020-05-20DZ modelPoisson n i ( j ) TR 2020-07-07TR modelPoisson n i ( j ) PL 2020-06-20PL modelPoisson
Fig. 9
Least noisy 7-day daily counts for countries with total counts N i > Boudewijn F. Roukema n i ( j ) BE 2020-06-02BE modelϕ i Poisson n i ( j ) BD 2020-04-16BD modelϕ i Poisson n i ( j ) PE 2020-07-12PE modelϕ i Poisson n i ( j ) AU 2020-04-05AU modelϕ i Poisson
Fig. 10
Typical (median) 28-day (above) and 7-day (below) daily counts, as in Figs 8 and 9. The dark shaded band againshows a Poissonian noise model, which underestimates the noise. A faint shaded band shows the φ i models for these countries’SARS-CoV-2 daily counts, and should contain about 68% of the infection count points. table, and has the highest consistency with a Poissondistribution ( P Poiss i ). Having a low total infection count,it seems credible that Finland lacks the intrinsic, testingand administrative clustering of countries with higherinfection counts. Low ψ i , high N i India (IN) and Russia (RU) havetotal infection counts nearly as high (logarithmically)as Brazil and the US, but have managed to keep theirdaily infection rates much less noisy – by about a factorof 10 to 100 – than would be expected from the generalpattern displayed in the diagram. Despite having of theorder of a million total official SARS-CoV-2 infectionseach, these two countries have, as of the download dateof the data, 21 July 2020, avoided having the clusteringeffects present in Brazil and the US.The most divergent case in the high- N i part of thisgroup (see Fig. 4 and Table 1) is Russia, which has onlya very modest value of φ i = 10 . × ± . for its to- tal infection count of over a million. This would requirethat both intrinsic clustering of infection events andadministrative procedures work much more smoothly inRussia than in the United States, Brazil and, to a lesserdegree, India. Tables 2 and 3 and Fig. 8 show that theRussian official SARS-CoV-2 counts indeed show verylittle noise compared to more typical cases (Fig. 10).At the intrinsic epidemiological level, this means thatif the Russian counts are to be considered accurate,then very few clusters – in nursing homes, religiousgatherings, bars, restaurants, schools, shops – can haveoccurred. Moreover, laboratory testing and transmis-sion of data through the administrative chain from lo-cal levels to the national (federal) health agency musthave occurred without the clustering effects present inthe United States and Brazil and in countries withmore typical clustering values φ i , characterising theirdaily infection counts. International media interest inRussian COVID-19 data has mostly focussed on con- nti-clustering in the national SARS-CoV-2 daily infection counts troversy related to COVID-19 death counts [7], withapparently no attention given so far to the modestlysuper-Poissonian nature of the daily counts, in contrastto the strongly super-Poissonian counts of other coun-tries with high total infection counts.India’s overall position in the ( ψ i , N i ) plane (Fig. 4and Table 1) is less extreme than that of Russia, with anunnormalised clustering parameter φ i = 34 × ± . .However, Table 3 shows that despite its large overall in-fection count, India achieved a 14-day sequence with apreferred φ i value close to unity. Moreover, it has a verylow-ranked φ i value, as given in Table 4 and illustratedin Fig. 9. Five values appear almost exactly on themodel curve rather than scattering above and below.Moreover, the value is just below 10,000. Epidemiolog-ically, it is not credible to believe that 10,000 officiallyreported cases per day should be an attractor resultingfrom the pattern of infections and system of reporting.Given that the value of 10,000 is a round number in thedecimal-based system, a reasonable speculation wouldbe that the daily counts for India were artificially heldat just below 10,000 for several days. The crossing ofthe 10,000 psychological threshold of daily infectionswas noted in the media [28], but the lack of noise in thecounts during the week preceding the crossing of thethreshold appears to have gone unnoticed. After cross-ing the 10,000 threshold, the daily infections in Indiacontinued increasing, as can be seen in the full counts(zenodo.3951152/WP C19CCTF SARSCoV2.dat). Low ψ i , low φ i , medium N i Among the group of eightlow ψ i countries, Table 1 shows that only one countryhas its full data set (as defined here) best modelledby the ordinary Poisson point process. Algeria (DZ)appears to have completely avoided clustering effects,with φ i close to unity. Figure 8 shows the least noisy 28-day sequence for Algeria. Only one day of SARS-CoV-2 recorded infections appears to have diverged beyondthe Poissonian 68% band, rather than about nine, theexpected number for a Poissonian distribution. Most ofthe points appear to stick very closely to the model. It isdifficult to imagine a natural process for obtaining thissub-Poissonian noise (as preferred by the φ i model), es-pecially in the context where most countries have super-Poissonian daily counts. In a frequentist interpretation,the least noisy Algerian 28-day count sequence would beconsidered only mildly, not significantly, unusual, sinceit is consistent with a Poisson distribution, with onlya weak rejection (Tables 2–4). However, as a memberof the general class of countries’ SARS-CoV-2 daily in-fection count curves, use of the φ i model would appearto be justified. It is in this sense that the sequence canbe considered sub-Poissonian. Moreover, a full Bayesian analysis would need to consider independent credibilitycriteria.In line with the counts for India that appeared to besmooth just below a round-number boundary of 10,000infections per day, the least noisy 7-day sequence forAlgeria, shown in Fig. 9, might appear to have beenaffected by a similar psychological boundary of 200 in-fections per day. Medical specialists interviewed by themedia interpreted the 200 daily infections period as rep-resenting stability and resulting from partial lockdownmeasures, without providing an explanation for whyPoisson noise was nearly absent [14]. While lockdownmeasures should reduce intrinsic epidemiological clus-tering down towards the Poissonian level, it is difficultto see how they could reduce testing and administrativepipeline clustering. A coincidence that occurred duringthis least-noisy 7-day period, on 24 May 2020, was thata full COVID-19 lockdown was implemented in Algeria[21].The Belarus (BY) case is present in all four tables(Tables 1–4). The least noisy Belarusian counts curveappears in Fig. 8. As with the other panels in the dailycounts figures, the vertical axis is set by the data insteadof starting at zero, in order to best display the informa-tion on the noise in the counts. With the vertical axisstarting at zero, the Belarus daily counts would looknearly flat in this figure. They appear to be boundedabove by the round number of 1000 SARS-CoV-2 infec-tions per day, which, again, appears to be a psychologi-cally preferred barrier. Media have expressed scepticismof Belarusian COVID-19 related data [17, 2].One remaining case of a coincidence is that the low-est noise 7-day sequence listed for Poland (PL, Table 4)is for the 7-day period starting 20 June 2020, with φ i = 0 . × ± . . This is a factor of about 100 (or atleast 10 at about 95% confidence) below Poland’s clus-tering value for the full sequence of its SARS-CoV-2daily infection counts, φ i = 13 × ± . , which Fig. 3shows is typical for a country with an intermediate to-tal infection count. On 28 June 2020, there was a defacto (of disputed constitutional validity [36, 18]) first-round presidential election in Poland. Figure 9 showsthat the counts for Poland during the final pre-first-round-election week did not scatter widely throughoutthe Poissonian band. A decimal-system round numberalso appears in this figure: the daily infection rate isslightly above about 300 infections per day and dropsto slightly below that. For an unknown reason thatdoes not previously appear to have been studied, theintrinsic clustering of SARS-CoV-2 infections in Polandtogether with testing and administrative clustering ofthe confirmed cases appears to have temporarily disap- Boudewijn F. Roukema peared just prior to the election date, yielding what isbest modelled as sub-Poissonian counts.
Conclusion
Given the overdispersed, one-parameter Poissonian φ i model proposed, the noise characteristics of the dailySARS-CoV-2 infection data suggest that most of thecountries’ data form a single family in the ( φ i , N i ) plane.The clustering – whether epidemiological in origin, orcaused by testing or administrative pipelines – tendsto be greater for greater numbers of total infections.Several countries appear, however, to show unusuallyanti-clustered (low-noise) daily infection counts.Since these daily infection counts data constitutedata of high epidemiological interest, the statistical char-acteristics presented here and the general method couldbe used as the basis for further investigation into thedata of countries showing exceptional characteristics.The relations between the most anti-clustered countsand the psychologically significant decimal system roundnumbers (IN: 10,000 daily, DZ: 200 daily, BY: 1000daily, PL: 300 daily), and in relation to a de facto na-tional presidential election, raise the question of whetheror not these are just coincidences.It should be straightforward for any reader to ex-tend the analysis in this paper by first checking its re-producibility with the free-licensed source package pro-vided using the Maneage framework [4], and then ex-tending, updating or modifying it in other appropriateways; see § Code availability below. Reuse of the datashould be straightforward using the files archived atzenodo.3951152.
Acknowledgements
Thank you to Marius Peper and ananonymous colleague for several useful comments and to theManeage developers for the Maneage framework in generaland for several specific comments on this work. This projecthas been supported by the Pozna´n Supercomputing and Net-working Center (PSNC) computational grant 314. This re-search was partly done using the following free-licensed soft-ware packages: Bzip2 1.0.6, cURL 7.65.3, Dash 0.5.10.2, Dis-coteq flock 0.2.3, File 5.36, FreeType 2.9, Git 2.26.2, GNUAWK 5.0.1, GNU Bash 5.0.11, GNU Binutils 2.32, GNUCompiler Collection (GCC) 9.2.0, GNU Coreutils 8.31, GNUDiffutils 3.7, GNU Findutils 4.7.0, GNU gettext 0.20.2, GNUGrep 3.3, GNU Gzip 1.10, GNU Integer Set Library 0.18,GNU libiconv 1.16, GNU Libtool 2.4.6, GNU libunistring0.9.10, GNU M4 1.4.18-patched, GNU Make 4.3, GNU Multi-ple Precision Arithmetic Library 6.1.2, GNU Multiple Preci-sion Complex library, GNU Multiple Precision Floating-Point Reliably 4.0.2, GNU NCURSES 6.1, GNU Patch 2.7.6, GNUReadline 8.0, GNU Sed 4.7, GNU Tar 1.32, GNU Texinfo6.6, GNU Wget 1.20.3, GNU Which 2.21, GPL Ghostscript9.50, ImageMagick 7.0.8-67, Libbsd 0.9.1, Libffi 3.2.1, Lib-jpeg v9b, Libpng 1.6.37, Libtiff 4.0.10, Libxml2 2.9.9, Lzip1.22-rc2, Metastore (forked) 1.1.2-23-fa9170b, OpenBLAS0.3.10, OpenSSL 1.1.1a, PatchELF 0.10, Perl 5.30.0, pkg-config 0.29.2, Python 3.7.4, Unzip 6.0, XZ Utils 5.2.4, Zip 3.0and Zlib 1.2.11. Python packages used include: Cycler 0.10.0,Kiwisolver 1.0.1, Matplotlib 3.1.1 [11], Numpy 1.17.2 [35],PyParsing 2.3.1, python-dateutil 2.8.0, Scipy 1.3.1 [26, 22],Setuptools 41.6.0, Setuptools-scm 3.3.3 and Six 1.12.0. L A TEXpackages for creating the pdf version of the paper included:biber 2.14, biblatex 3.14, bitset 1.3, breakurl 1.40, caption55507 (revision), courier 35058 (revision), csquotes 5.2j, date-time 2.60, ec 1.0, etoolbox 2.5h, fancyhdr 3.10, fmtcount 3.07,fontaxes 1.0d, footmisc 5.5b, fp 2.1d, kastrup 15878 (revi-sion), latexpand 1.6, letltxmacro 1.6, logreq 1.0, mweights53520 (revision), newtx 1.627, pdfescape 1.15, pdftexcmds0.32, pgf 3.1.5b, pgfplots 1.17, preprint 2011, setspace 6.7a,tex 3.14159265, texgyre 2.501, times 35058 (revision), titlesec2.13, trimspaces 1.1, txfonts 15878 (revision), ulem 53365 (re-vision), varwidth 0.92, xcolor 2.12 and xkeyval 2.7a.
Authors’ contributions
The design, execution and writing upof this paper were carried out by the author alone. This re-search was partly done using the reproducible paper templatesubpoisson-252cf1c.
Funding
No funding has been received for this project.
Data availability
As described above in § SARS-CoV-2 infection data , the source of curated SARS-CoV-2 infection count data used for the main analysis in thispaper is the C19CCTF data, downloaded using the script download-wikipedia-SARS-CoV-2-charts.sh andstored in the file
Wikipedia SARSCoV2 charts.dat in the reproducibility package available atzenodo.3951152. The data file is archived atzenodo.3951152/WP C19CCTF SARSCoV2.dat. TheWHO data that was compared with the C19CCTF datavia a jump analysis (Fig. 1) was downloaded from https://covid19.who.int/WHO-COVID-19-global-data.csv and wasarchived on 15 July 2020.
Code availability
In addition to the SARS-CoV-2 infectioncount data for this paper, the full calculations, productionof figures, tables and values quoted in the text of the pdfversion of the paper are intended to be fully reproducible onany POSIX-compatible system using free-licensed software,which, by definition, the user may modify, redistribute andredistribute in modified form. The reproducibility frameworkis technically a git branch of the
Maneage package [4] ,earlier used to produce reproducible papers such as Infante-Sainz et al. [12]. The git repository commit ID of this versionof this paper is subpoisson-252cf1c. The primary git repos-itory is https://codeberg.org/boud/subpoisson. Bug reports https://maneage.org nti-clustering in the national SARS-CoV-2 daily infection counts Conflict of interest
The author of this paper has participatedas a volunteer in the Wikipedia
WikiProject COVID-19 CaseCount Task Force in helping to collate a small fraction ofCOVID-19 pandemic related data. He is aware of no otherconflicts of interest or competing interests.
References ∼ herve/Abdi-Bonferroni2007-pretty.pdf2. AFN, Nexta channel accuses the Ministry of Health ofthe Republic of Belarus of publishing censored data oncoronavirus ( in Russian Boudewijn F. Roukema