[PDF] A systems biology approach to COVID-19 progression in a population

Abstract

A number of models in mathematical epidemiology have been developed to account for control measures such as vaccination or quarantine. However, COVID-19 has brought unprecedented social distancing measures, with a challenge on how to include these in a manner that can explain the data but avoid overfitting in parameter inference. We here develop a simple time-dependent model, where social distancing effects are introduced analogous to coarse-grained models of gene expression control in systems biology. We apply our approach to understand drastic differences in COVID-19 infection and fatality counts, observed between Hubei (Wuhan) and other Mainland China provinces. We find that these unintuitive data may be explained through an interplay of differences in transmissibility, effective protection, and detection efficiencies between Hubei and other provinces. More generally, our results demonstrate that regional differences may drastically shape infection outbursts. The obtained results demonstrate the applicability of our developed method to extract key infection parameters directly from publically available data so that it can be globally applied to outbreaks of COVID-19 in a number of countries. Overall, we show that applications of uncommon strategies, such as methods and approaches from molecular systems biology research to mathematical epidemiology, may significantly advance our understanding of COVID-19 and other infectious diseases.

Full PDF

COVID-19

PUZZLE IN C HINA : A SERENDIPITOUS INTERPLAY BETWEEN TRANSMISSIBILITY AND SOCIAL DISTANCING MEASURES

Magdalena Djordjevic, , Marko Djordjevic,

Igor Salom, Andjela Rodic, Dusan Zigic, Ognjen Milicevic, Bojana Ilic Institute of Physics Belgrade, University of Belgrade, Serbia. Computational Systems Biology Group, Faculty of Biology, University of Belgrade, Serbia. Department for Medical Statistics and Informatics, Faculty of Medicine, University of Belgrade, Serbia. * Correspondence: [email protected], [email protected] † Equally contributed to this work.

Abstract:

A number of models in mathematical epidemiology have been developed to account for control measures such as vaccination or quarantine. However, COVID-19 has brought unprecedented social distancing measures, with a challenge on how to include these in a manner that can explain the data, but avoid overfitting in parameter inference. We here develop a simple non-homologous model, where social distancing effects are introduced analogous to course grained models of gene expression control in systems biology. We apply our approach to understand drastic differences in COVID-19 infection and fatality counts (observed between Hubei (Wuhan) and other Mainland China provinces), which we explain through interplay of transmissibility and effective protection differences. That is, we obtain a negative feedback (also commonly encountered in systems biology) between transmissibility and effects of social distancing, with Hubei being an outlier with both large transmissibility and lower social distancing effects. While Case Fatality Rate is significantly larger for Hubei, we find that Infection Fatality Rates (IFRs) are much more uniform/consistent across all the provinces. Even for Wuhan, our inferred Infection Attack Rate is much below the herd immunity threshold, raising a major reemission risk. The obtained results demonstrate applicability of our developed method to extract key infection parameters directly from the data (publicly available for a number of countries), so that it can be applied globally to potential future outbreaks of COVID-19, or other infectious diseases. ntroduction

On January 31 st , The World Health Organization (WHO) classified COVID-19 outbreak as the Public Health Emergency of International concern [1], due to the epidemics in the People’s Republic of China (PRC) that originated in Wuhan (Hubei province) [2] . WHO classified the infection spread as a pandemic [3] on March 11 th , which followed large secondary outbreaks in Europe, while later US became a new hotspot [4]. The imposed restrictions in different countries ranged from mild to severe, with a significant impact on people’s lives, and likely notable future economic consequences [3,5]. After unprecedented containment and infection suppression measures, PRC announced the end of the epidemics on March 29 th [6]. Progression of the infection in PRC seems highly intriguing, as Hubei (with only 4% of PRC population) shows an order of magnitude larger number of confirmed infection cases (Fig. 1A), and two order of magnitude higher fatalities (Fig. 1B) compared to the total sum in all other Mainland China provinces. The epidemics was unfolding well before the Wuhan closure (with the reported symptom onset of the first patient on December 1 st th (the Chinese Lunar New Year) [7]. As a rough baseline, a modeling study [8] based on the Wuhan infection dynamics, estimated as many as 150,000 confirmed cases per day in Chongqing alone (due to its large population coupled with intense travel volume with Wuhan) – instead, the actual (reported) peak number for all Mainland China provinces outside Hubei was just 831.

Fig. 1. Infection and fatality counts for Hubei vs. all other provinces.

Number of ( A ) Confirmed infections, ( B ) Fatality cases. Zero on the horizontal axis corresponds to the time from which the data are taken (January 23 rd ), which also coincides with the Wuhan closure. Red circles correspond to the observed Hubei counts. Blue squares correspond to the sum of the number of counts for all other provinces. The figure illustrates a puzzling difference in the number of counts between Hubei alone, and the sum of all other mainland China provinces. agnitude of the subsequent outbursts in Europe and USA is comparably even more surprising. For example, Ireland, with the population size of 4.9 million (two orders of magnitude smaller than PRC population), as of June 15 th has almost two times more confirmed infection cases than all Mainland China except Hubei (~25.000 compared to ~13.000) and more than order of magnitude larger number of fatalities (~1.700 compared to 116). Such extreme disproportions have created a public media controversy [9,10], where e.g., “The New York Times” reported that the Central Intelligence Agency has been warning the USA administration that China is underreporting their coronavirus cases [11]. A political controversy also unfolded, where similar concerns have been publicly expressed by the highest ranking USA, France and UK officials [12], and subsequently strongly denied by PRC [13]. The controversy above underlines a challenge for computational modeling. The drastic differences may be a consequence of an interplay between the virus transmissibility (influenced by environmental and demographic factors) and the effectiveness of the protection measures. Both can significantly change between different provinces (more generally different countries/regions), and the model has to infer this from available data (commonly the number of infection counts, publicly available for a large number of countries/regions). While the measures such as vaccination or quarantine are well established in mathematical epidemiology[14-16], COVID-19 brought a challenge to account for previously unprecendated social distancing measures, taken by majority of countries, but which have been particularly drastic in China. Social distancing has been (up to now) accounted for by the direct changes in the transmissability term [17,18], which, however, corresponds to introducing a phenomenological dependence in otherwise mechanistic model (see also the section below). That is, to be included consistently, the social distancing should move individuals from one compartment to the other, as also standardly implemented for vaccination or quarantine. Moreover, while the control measures are commonly accounted for by homologous models [15] (with time independent terms) social distancing should depend on time, as it is introduced at a certain point in epidemics (in a more complicated case, it may also progress gradually). To account for a time dependent control measures we will here use analogies with modeling intracellular processes from systems biology. Both systems biology and mathematical epidemiology use compartmental models based on the mass action law (or sometimes more complex kinetics, like Michaelis-Menten law, taking saturations into account). Moreover, in systems biology, one often has to account for both growth (analogous to disease transmission) and strong regulation (analogous to control measures)[19]. To address the balance between model complexity and applicability to real ata, coarse-grained models of gene regulation (where this regulation is also often time-dependent, e.g. turned on by a regulatory signal), have been used with success [20-23]. For example, we previously used such models to explain experimental measurements (including those at single-cell level) [24-26], infer system design principles [27], and analyze system dynamical features [28]. We will here apply a similar approach, where a minimal mechanistic model will be constructed in terms of ability to explain the data with smallest number of parameters, so that relevant infection progression properties can be inferred without overfitting. In this study, we will therefore develop a method to analyze the puzzling differences in dynamics trajectories in Mainland China provinces, which will also be more generally applicable in terms of understanding regional differences in outburst dynamics. The unintuitive data from PRC may put strong constraints on the underlying infection progression parameters and allow understanding: 1. What interplay between the transmissibility and the suppression measure effects is responsible for the large difference in the count numbers between Hubei and the rest of PRC? 2.

What is the Infected Fatality Rate (IFR, the number of fatalities per total number of infected cases) in PRC? Case Fatality Rate (CFR, the number of fatalities per confirmed / detected cases) can be obtained directly from the data, but is highly sensitive to the testing coverage. IFR is a more fundamental mortality parameter, as it does not depend on the testing coverage, but is however much harder to determine, due to the unknown number of infected cases. 3. What is the infection attack rate (AR), i.e. a fraction of population that got infected (and presumably became resistant afterwards), in Hubei and other provinces? Estimating AR allows understanding the risks that the epidemic will reinitiate. As this study is completed, PRC announced that they will assess AR in Wuhan through serology tests on almost entire Wuhan population [29], which will allow a direct comparison with our predictions. Addressing these questions allows understanding of both different response policies, and inherent risks posed by the pandemics, and will enable future cross-country comparisons. The developed methodology will allow to readily analyze future outbreaks of COVID-19 and other infectious diseases, as it depends on inference from straightforward and publically available data.

Model and parameter inference

We developed a model that takes into account all main qualitative features of the infection progression under suppression measures. The model however remains simple enough to estimate the key dynamics parameters, individually for each of the provinces. We opt for a deterministic model, as it allows a robust and computationally less demanding parameter inference [30]. The case ounts are also well in a commonly considered deterministic range (>10), while the transmission process is not strongly non-linear (see below) – for a linear system, mean of the stochastic trajectories corresponds to the deterministic solution [31]. Classical SIR or SEIR models (or their modifications) were up to now used to predict or explain different features of COVID-19 infection dynamics [17,32-35]. We here utilize SEIR framework, due to COVID-19 latency period, which (due to a latency delay) shows significantly different dynamics compared to the infected population. We thus modify SEIR model, to take into account social distancing (lockdowns, school and university closures, infection induced behavioral changes), consistent with drastic non-pharmaceutical interventions applied in PRC. Quarantine by detecting (and subsequently containing) some fraction of the infected is also explicitly taken into account, which accounts for (presumably) large number of the infected that are not detected [5]:   / / dS dt I S N t S        (1) / / dE dt I S N E       (2) / dI dt E I I           (3) / dD dt I h D m D         (4) / dH dt h D   (5) / dF dt m D   (6) S, E, I, D, H, F, N are, respectively: susceptible, exposed, infected, active detected, healed (recovered), fatalities and the total population number. The parameter notation is the following: β - the infection rate in a fully susceptible population; α ( t ) - the time dependent protection rate, i.e., the rate with which the population moves from susceptible to protected category, quantifying the impact of the social protection measures. σ - inverse of the latency period; γ - inverse of the infectious period; δ - inverse of the period of the infected detection/diagnosis; ε - the detection efficiency quantifying the fraction of diagnosed infected population; h - the recovery rate; m - the mortality rate. In Eq. (1), the susceptible pool is depleted due to both the infection process, and the social distancing measures - moving the susceptible to the protected compartment. Note that we here introduce social distancing consistent with the SEIR compartmental mechanistic structure, so that susceptible move to the protected category with rate α ( t ) . The nonlinearity appears only through the “infection” ( β·S·I )/ N term. In Eq. (2), the latent population increases due to the infection events, and decreases due to a delayed transition to the infected with rate σ . In Eq. (3), the infected population is increased due to this transition, removed with rate γ , or detected (and isolated/quarantined) with rate ε·δ . Eq. 4) describes the dynamics of active detected cases, which increase due to detection of the infected, or decrease due to the recovery or fatalities. Eqs. (5) and (6) follow directly from Eq. (4). Note that the dynamics of the protected dP/dt = α ( t ) S , and the resistant dR/dt = γI compartments is not explicitly included in Eqs. (1) - (6), as it does not impact the observable variables ( D , H , F ). The corresponding conservation is S+E+I+D+H+F+P+R=N.

The main difficulty in parameter inference is that healing and mortality rates ( h and m ) are time-dependent, i.e. they, respectively, significantly increase and decrease with time – this can be directly obtained from Eqs. (5) and (6). Consequently, instead of the active detected cases D (which depend on time-dependent h and m rates), we propose as the main observable the total number of detected (diagnosed) cases D t =D+H+F , i.e. the sum of active cases, recovered cases and fatalities: t dD dt I     , (7) where this equation substitutes Eqs. (4) - (6), conveniently removing the rates h and m from our analysis. The protection rate α ( t ) is taken as 0 before t (the onset of social measures), and constant value (α) afterwards. The social distancing is therefore introduced analogously to regulation by protein or transcript decay in systems biology, with onset at t . Instead of the step function used here, a more gradual switch is often introduced through a Hill function, but this would introduce an additional parameter (Hill constant), which is here not necessary due to a very rapid introduction of social distancing measures (i.e. a large value of Hill constant is assumed, where in this limit Hill function becomes unit step function). Introduction of α( t ) in this form also allows to more easily conceptualize parameter inference. Before t , social distancing does not take effect, and the measures introduced at t will take effect on D t only ~10 days later, as this is about the time between infection and detection/diagnosis [2]. Consequently, for the first t +10 days. D t curve reflects transmission without suppression measures, characterized by R (see below); social distancing takes effect afterwards, and α can be inferred from the subsequent suppressed growth of D t . Consequently, by combining Eq. (1) before t (when α ( t )=0) with Eq. (3), we obtain the basic reproduction rate in the absence of applied social distancing measures ( α ), R = β/(γ+ε·δ). Notation R corresponds to a definition of R in a completely susceptible population in the absence of social distancing [36,37]. R , which corresponds to the definition of R that does not require the absence of social distancing (only non-resistant population) can be obtained from     exp R S t N t dt       [18]. From this, R = β/(γ+ε·δ+  ) [32] can be recovered, under the assumption of sufficiently strong α and small t (immediately effective measures), so that from Eq. 1) S ( t ) ~ exp (  αt); R is consequently always smaller than R . We use R ,free in the further analysis, in line with a goal of our work, to separate the transmissibility influenced by biological, climate and sociodemographic qualities, from social distancing interventions. In addition to R free , α and t , two initial conditions ( I and E ), and the detection efficiency ε , are unknown and may differ between the provinces. σ and γ characterize the fundamental infection processes, not expected to change from one province to another, and have consistent values between independent studies ( γ = 0.4 days -1 , σ = 0.2 days -1 ) [38,39]. How different properties of D t curve determine these unknown parameters? Early in the infection, almost the entire population is susceptible ( S ≈ N ), so Eqs. (2) and (3) become linear, and decoupled from the rest of the system. This sets the ratio of I to E , through the eigenvector components with the dominant (positive) Jacobian eigenvalue, where this eigenvalue, corresponding to the initial slope of log ( D t ) curve, sets the value of β. From Eq. (7) one can see that the product of I and ε · δ is set by dD t / dt at the initial time ( t =0). Later dynamics of the D t curve is determined by the combination t α = t + 1/ α (which we denote as protection time), setting the time at which ~1/2 of the population moves to the protected category. We also numerically checked this, i.e. t can be lowered at the expense of increasing 1/ α , without affecting the fit quality. For the remaining independent parameter ( I ), D t curve saturates when the infected are depleted, so the saturation time depends on the scale (height) of the infected curve, which sets I . Therefore, as the number of characteristic dynamics features is at least equal to the number of fit parameters, over-fitting is not expected. For few provinces, we however observed that I can be decreased compared to the best fit value, without noticeably affecting the fit quality. For these provinces, we chose the lowest I value that still leads to a comparably good fit. This allows to obtain the most conservative (i.e., as high as consistent with the data) IFR estimate, as the reported number of fatalities for provinces other than Hubei are surprisingly low. Model, parameter inference, and uncertainties, are estimated separately for each province. However, within a given province, demographic, special, or population activity (network effects) heterogeneities [14,40], or seasonality effects [41], are not taken into account. These are potentially important, particularly for projections (longer term predictions of infection dynamics under different scenarios), and can be readily included in our model. Such extensions would however complicate parameter inference, due to increase in parameter number, as this may either lead to overfitting, or require special/additional data that may be available only for a limited number of countries/regions (which would limit generality of our proposed method). More complex model structure may also obscure a straightforward relationship between the model parameters and distinct dynamical features of the confirmed case count curve analyzed above. In this work we therefore employ the odel structure and parameter inference introduced above on widely available case count data, as a proof of the principle for generality of our proposed approach, while the inclusion of additional effects is left for future work. We further denote our model as εα- SEIR, where εα denotes that it takes into account not only the basic (unsuppressed) infection dynamics, but also the effective social distancing (through α ) and joint quarantine/detection processes (through ε ). εα- SEIR was numerically solved by Runge-Kutta method [42] for each parameter combinations. Parameter values were inferred by exhaustive search over a wide parameter range, to avoid reaching local minimum of the objective function ( R ). To infer the unknown parameters, we fit (by minimizing R ) εα- SEIR to the observed total number of detected D t for each province. As an alternative to exhaustive search, some of many optimization techniques used in epidemics modelling, such as Markov chain Monte Carlo (MCMC) approach, can be used instead [16,41] – exhaustive search is however straightforward, guarantees that global minimum were estimated through Monte-Carlo simulations [43], individually for each province with assumption that count numbers follow Poisson distribution. Monte-Carlo simulations were found as the most reliable estimate of the fit parameter uncertainties for non-linear fit [44], where in our case a closed form expression is moreover not available. Monte-Carlo simulations also serve as an independent check for overfitting, as in that case datapoint perturbations would lead to large parameter uncertainties. P values for extracted parameter differences between provinces are estimated by t -test. Results

We used εα-

SEIR with the parameter inference described above, to analyze all Mainland China provinces, with the exception of Tibet, where only one COVID-19 case was reported. Parameters were estimated separately for each of the 30 provinces. In Fig. 2A and B, we show that our model can robustly explain the observed D t , in the cases of large outburst (Hubei on Fig. 2A), as well as for all other provinces, where D t is in the range from intermediate (e.g., Guangdong) to low (e.g., Inner Mongolia). Provinces in Fig. 2B were selected to cover the entire range of observed D t (from lower to higher counts), while comparably good fits were obtained for other provinces, which were all included in the further analysis. Our method is also robust with respect to data perturbations (which might be frequent), e.g., in the case of Hubei (Wuhan), a large number of counts was added on Feb. 12 th , based on clinical diagnosis (CT scan) [2], which is apparent as a discontinuity in observed D t in Fig. 2A. The model however interpolates this discontinuity, finding a reasonable description of the overall data. e back propagated the dynamics inferred for Hubei, to estimate that January 5 th (±4 days) was the onset of the infection exponential growth in the population (not to be confused with the appearance of first infections, which likely happened in December [2]). This agrees well with [2] (cf. Fig. 3A), which tracked cases according to their symptom onset (shifted for ~12 days with respect to detection/diagnosis, cf. Fig. 3B), and coincides with WHO reports on social media that there is a cluster of pneumonia cases – with no deaths – in Wuhan [45]. Since our analysis does not directly use any information prior to January 23 rd , this agreement provides confidence in our I (see above) estimate. Fig. 2. Model predictions: comparison with data and key parameter estimates.

Predictions (compared to data) of confirmed infection counts and for ( A ) Hubei, ( B ) other Mainland China provinces. The observed counts are shown by dots, and our model predictions by dashed curves. Names of the provinces are indicated in the legend, with provinces selected to cover the full range of the observed total detected counts. The distribution with respect to provinces of ( C ) the basic reproduction number, in the absence of social distances, R free , ( D ) the protection time t α . ( E ) Case Fatality Rate. ( F ) Infected Fatality Rate. The values for Hubei are indicated by the red bars. Key parameters inferred from our analysis are summarized in Fig. 2C-F, with individual results and errors for all the provinces shown in the Supplement table S1. Fig. 2C shows the distribution of R ,free . Note that R ,free might depend on demographic (population density, etc.) and climate factors (temperature, humidity…), which are not controllable, but is unrelated to the applied social distancing measures (see above). It is known that R value can strongly depend on the model, e.g. the number of introduced compartments[16]; accordingly, a wide range of R values were reported for China in the literature [46,47]. Consequently, a clear advantage of our study is that parameters for all China provinces were determined from the same model and data set, which allows direct omparisons. Our obtained average R free for provinces outside of Hubei is 5.3±0.3, in a reasonable agreement with a recent estimate (≈5.7) [47]. Furthermore, we observe that R free for Hubei is a far outlier with the value of 8.2±0.4, which is notably larger than for other provinces with p~10 -11 . This then strongly suggests that demographic and climate factors that determine R free , played a decisive role in a large outburst in Hubei vs. other provinces, which we further address below. Distribution of protection time t α for the provinces is shown in Fig. 2D, with the value for Hubei indicated in red. The mean for the other provinces is 6.6±0.2 days. That is, we observe that the suppression measures were efficiently implemented, with ~½ of the population moving to the protected category within a week from Wuhan closure. The protection time for Hubei of 8.3±0.2 days was longer, which is statistically significant at p~10 -11 level. The estimated less efficient protection in the case of Hubei may also be an important contributing factor in the surprising difference in Hubei vs. other provinces, which we further investigate below. CFR distribution, based on the fatality numbers reported for Hubei and other provinces is shown in Fig. 2E. These numbers are not based on the model predictions, i.e. can be straightforwardly obtained by dividing the total number of fatalities with the total number of detected cases. CFR for other provinces with a mean of 1.2±0.4%, is significantly smaller compared to CFR for Hubei, which was 4.6% before the correction on April 17th, and 6.5% after the correction (with 1290 fatalities added to Wuhan). This large difference in CFR between Hubei and other provinces further accentuates the differences noted in Fig. 1. IFR distribution, which provides a much less biased measure of the infection mortality, is shown in Fig. 2F. In distinction to CFR, estimated IFR shows a much smaller difference between Hubei (0.15%±0.09%) and other provinces (0.056±0.007%). Therefore, while Hubei is a clear outlier with respect to CFR, we observe similar IFR values for all PRC provinces, where few provinces have even higher IFR than Hubei. The ratio of IFR to CFR equals the fraction of all infected that got detected ( detection coverage ). We estimate that the mean detection coverage for all provinces except Hubei is higher than detection coverage for Hubei (4.5±0.9% vs. 2±2%). This difference is responsible for decrease by a factor of two from CFR to IFR for Hubei, compared to the other provinces, and consequently for more uniform mortality estimates at IFR level. Xinjiang has the highest IFR of 0.25±0.09%, so that Hubei is not an outlier anymore. In Fig. 3A, two key infection progression parameters are plotted against each other: protection time t α vs. transmissibility R free . Unexpectedly, there is a high negative correlation, with Pearson correlation coefficient R =  -5 , where these two are a priori unrelated (see above). Even for R , e , stronger social distancing measures would decrease t (see above), leading to a tendency to be positively correlated with t α , oppositely from the strong negative correlation in Fig. 3A. Therefore, higher transmissibility is genuinely related with a shorter protection time (larger effect of the suppression measures). Intuitively, this could be understood as a negative feedback loop, commonly observed in systems biology [22,48], where larger R free leads to steeper initial growth in the infected numbers, which may elicit stronger measures. Fig. 3. Interplay of transmissibility and effective social distancing. ( A ) The correlation plot of t α vs. R free for all provinces, where the point corresponding to Hubei is marked in red. ( B ) The effect (on the Hubei dynamics of infected and latent cases) of reducing R free and t α to the mean values of other PRC provinces. Both the unperturbed Hubei dynamics and the sum of infected and latent cases for all other provinces are included as a references. The two main properties of the Hubei outburst are therefore higher R free and t α compared to other provinces. In Fig. 3B, we investigate how these two properties separately affect the Wuhan outburst for latent and infected cases, where unperturbed Hubei dynamics is shown by the red full curve. We first reduce only R free from the Hubei value, to the mean value for all other provinces (the dash-dotted green curve). We see that this reduction substantially lowers the peak of the curve, though it still remains wide. Next, instead of decreasing R ,free , we decrease the protection t α to the mean value for all other provinces (dashed orange curve). While reducing t α also significantly lowers the peak of the curve, its main effect is in narrowing the curve, i.e. reducing the outburst time. Finally, when R .free and t α are jointly reduced, we obtain the (dotted purple) curve that is both significantly lower and narrower than the original Hubei progression. This curve comes quite close to the curve that presents the sum of all other provinces (full blue curve) - the dotted curve remains somewhat above this sum, mainly because the initial number of latent and infected cases is somewhat higher for Hubei compared to the sum of all other provinces. This synergy between the transmissibility and the suppression measures, will be further discussed below. iscussion We here formulated a new method for analyzing infection transmission dynamics, under strong suppression measures. In addition to its main contribution of taking into account social distancing in a manner suitable for parameter inference, the model also takes into account quarantine and diagnosis of only a fraction of the infected pool, and goes around the problem of time dependent recovery and mortality rates. The non-homologous introduction of the social distancing allows to clearly separate the infection transmission in the absence of social distancing, for the subsequent suppressed growth. The model is introduced with the minimal number of parameters in the case of China, but can be straightforwardly generalized to the case of more gradually introducing the social distancing measures. We applied our method to the infection progression in PRC, where it allowed the comprehensive parameter inference for all the provinces, consistently from the same model. Specifically, we analyzed the puzzling difference in the number of counts observed between Hubei and other Mainland China provinces, which has in part created first public media, and then high-level political, controversy (see Introduction). A naive explanation of the puzzle might be that closing Wuhan prevented the influx of the infected to other Chinese provinces. It was however previously shown that, at the start of the travel ban, most Chinese cities had already received many infected travelers [49,50], which is also consistent with the fact that the initial number of detected (and our estimates of the infected) cases, were similar in Wuhan compared to the sum in all other provinces. We here obtained that Hubei was an outlier with respect to two key infection progression parameters: having significantly larger R ,free , and a longer time needed to move a sizable fraction of the population from susceptible to protected category. It is interesting that the initial origin of the epidemic (Hubei) has also a significantly larger R ,free compared to all other provinces, as in principle these two do not have to be related. While stricter measures were formally introduced in Hubei, the initial phase of the outburst put a large strain on the system, arguably leading to less effective measures compared to other provinces [13]. The fact that the initial epidemic in Hubei was not followed by similar outbursts in the rest of PRC may be understood as a serendipitous interplay of the two factors noted above. While both smaller R free and lower half-protection time (more efficient measures) significantly suppress the infection curve, their effect is also qualitatively different. While lowering R free more significantly suppresses the peak, decreasing the half-protection time significantly reduces the outburst duration. Consequently, the synergy of these two effects appears to lead to drastically suppressed infection dynamics in other Mainland China provinces compared to Hubei. The number of detected diagnosed) cases in entire PRC is therefore, though unintuitive, well consistent with the model, and is explainable by a seemingly reasonable combination of circumstances. On April 17 th Wuhan fatality count was increased for 50%, which is taken into account by our analysis, but which makes the difference in fatalities between Hubei and other provinces even more drastic. As, at least, a partial explanation for the fatality rates, we here find a significant change in relative differences between CFR and IFR. While CFR is more than five times larger for Hubei, the numbers appear much more self-consistent at the level of IFR, with Hubei comparable to other PRC provinces. We show that this is due to smaller testing coverage in Hubei, which appears to be consistent with known testing capacities (10000 tests per day in Hubei, vs. 40000 tests per day in all other provinces with significantly smaller total infection cases) [51]. Ethnicity might also contribute to IFRs, as we inferred the highest IFR (surpassing that of Hubei, even with the most recent revision) for Xinjiang province. Uyghur majority in Xinjiang have been genetically linked with mixed European/West Asian and East Asian ancestry [52], which in that respect might better reflect apparently larger mortality in European and US hotspots; these observations are in line with our large scale analysis of current COVID hotspots (in preparation). Our analysis also indicates a very low overall infection attack rate (AR). That is, we obtained that a fraction of PRC population that got infected (and possibly resistant), is just 5±3% for Hubei, and about two orders of magnitude lower for other provinces. The obtained low AR (i.e. the associated low fraction of presumably resistant population), is well below both classical [14] and recently recalculated (by taking into account age and activity structured population) heard immunity level [40]. This may be concerning in a longer term, since what really quelled the infection was rapidly moving the population from susceptible to protected category through social distancing measures. These measures forced the infection curve to an essentially unstable fixed point, as stochastic simulations showed that as few as 5 individuals are sufficient for the infection onset [53].

Summary and outlook

In this study, we extracted key infection parameters from the data that are readily available and publicly accessible (both for PRC and a number of other countries), so that, in a nutshell, our approach is of a wide applicability. To our best knowledge, such parameters (necessary to assess any future COVID-19 risks), were not extracted by other computational approaches. Overall, our study infers that COVID-19 in PRC has been a highly contagious disease, with transmissibility significantly larger than one for all the provinces. Moreover, COVID-19 spreads in an almost completely naive population, even after the first infection wave (AR is much below the heard immunity level, even for Hubei). We showed that unintuitive dissimilarity in the infection rogression for Hubei vs. other Mainland China provinces is consistent with our model, which demonstrates that regional differences may drastically shape the infection outbursts. This also shows that comparisons in terms of confirmed case or fatality counts (even when normalized for population size), between COVID-19 and other infectious diseases, or between different regions for COVID-19, are not feasible, and that parameter inference from quantitative models (individually for different affected regions) is necessary. While impact of intervention measures is clear (a significant decrease in AR, including corresponding decrease in fatalities), the exact demographic and climate factors that determine severity of outbursts are likely complex, and remain as a significant open research problem.

Acknowledgmentѕ:

MagD, IS, BI and DZ thank European Research Council for their support. MarD acknowledges exchange of ideas through Shanghai-Islamabad-Belgrade joint innovation center on Antibacterial Resistance (Grant No. 19430750600), “One Belt And One Road” International Cooperation Projects.

Funding:

Serbian Ministry of Education, Science and Technological Development.

Autor Contributions:

Conceptualization: MarD and MagD; Formal Analysis and Investigation: all authors; Visualization: all authors; Writing – original draft: MarD and MagD; Writing – review and editing: all authors.

Conflict of interest:

The authors declare no clonflict of interest.