Exploring the space-time pattern of log-transformed infectious count of COVID-19: a clustering-segmented autoregressive sigmoid model
EExploring the space-time pattern of log-transformedinfectious count of COVID-19: a clustering-segmentedautoregressive sigmoid model
Xiaoping Shi ∗ Meiqian Chen and Yucheng Dong † Abstract
At the end of April 20, 2020, there were only a few new COVID-19 cases remain-ing in China, whereas the rest of the world had shown increases in the number of newcases. It is of extreme importance to develop an efficient statistical model of COVID-19 spread, which could help in the global fight against the virus. We propose aclustering-segmented autoregressive sigmoid (CSAS) model to explore the space-timepattern of the log-transformed infectious count. Four key characteristics are includedin this CSAS model, including unknown clusters, change points, stretched S-curves,and autoregressive terms, in order to understand how this outbreak is spreading intime and in space, to understand how the spread is affected by epidemic controlstrategies, and to apply the model to updated data from an extended period of time.We propose a nonparametric graph-based clustering method for discovering dissim-ilarity of the curve time series in space, which is justified with theoretical supportto demonstrate how the model works under mild and easily verified conditions. Wepropose a very strict purity score that penalizes overestimation of clusters. Simula-tions show that our nonparametric graph-based clustering method is faster and moreaccurate than the parametric clustering method regardless of the size of data sets.We provide a Bayesian information criterion (BIC) to identify multiple change pointsand calculate a confidence interval for a mean response. By applying the CSAS modelto the collected data, we can explain the differences between prevention and controlpolicies in China and selected countries. ∗ Department of Mathematics and Statistics, Thompson Rivers University, email: [email protected] † Center for Network Big Data and Decision-Making, Business School, Sichuan University, email: [email protected] a r X i v : . [ s t a t . M E ] F e b Introduction
During the COVID-19 outbreak, multiple complex factors resulted in the space-timepattern of spread. Fig. 1 shows the log-transformed infectious counts in each region inChina, and in 33 selected countries at the end of April 20, 2020.From Fig. 1, we can see two main characteristics of the spread: (i) the spread of COVID-19 has a space-time characteristic determined by different intervention policies, incompleteinformation, geographical locations, transport, climate, and so on; (ii) along the time, thelog-transformed infectious counts presented different sigmoid (stretched S-shaped) curves.This phenomenon often happens in the life cycles of plants, animals, and viruses, whichcan rise and fall periodically. In each cycle, the sigmoid curve experiences three phases:slow rising, sharp rising, and slow falling.Modeling the spread of COVID-19 in many regions over a long period of time is provento be challenging. That is because many regions may not share the same spread patternand different regions may exhibit various intervention policies that may cause instabilityin the models. The model for each region may have a large degree of noise, but a commoncluster of all regions could have less noise by the law of large numbers. Thus, clustering isof importance to increase model fit. We may have to cluster all regions, even if the numberof clusters is unknown. In addition, we should allow the model to incorporate unknownchange points to further enhance the fitting performance. Ignoring the existence of changepoints may lead to poor model fitting and misleading model interpretation (Shi, Wang,Wei & Wu, 2016). Furthermore, it is often necessary to apply the model to updated datafrom an extended period of time. In the extended period, old clusters need to be updatedand new change points may occur. Models with incorporated clusters and change pointsshould be flexible and adaptive to the new data. In the next step, we shall consider the2onlinear characteristics of the models.Logarithmic transformation is often used for transforming count data, which includeszero values (Jin et al. , 2020) and grows exponentially over time. The simplest for-mula for exponential growth of a function y at the growth rate r , as time t goes on, is y ( t ) = y (0)(1 + r ) t , which satisfies the linear differential equation dy ( t ) dt = log(1 + r ) y ( t ). Anonlinear variation of this differential equation may lead y ( t ) to a sigmoid function. Forexample, the solution of a nonlinear differential equation dy ( t ) dt = log(1 + r ) y ( t ) − y ( t ) isthe logistic function (Murray, p.308 , 1989; Liu & Stechlinski, p.84 , 2017). The expo-nential growth model has shown numerous applications in the modeling and controlling ofcomplex systems. For example, the number of cells in a culture will increase exponentiallyuntil an essential nutrient is exhausted. A virus, for example SARS or COVID-19, has beenfound to spread exponentially (Katul et al. , 2020). The speed of spread slows down whenan artificial immunization becomes available or intervention policies take effect. Otherapplications of the exponential growth model can be found in Physics (e.g., radioactivedecay), Economics (e.g., a country’s gross domestic product), Finance (e.g., investments),Computer science (e.g., computing growth and internet phenomena), and so on.When systems have short-term memories and become more complex, it is extremelydifficult to find a differential equation to describe the growth curve. In contrast, we mayadd some autoregressive terms in a regression function to adapt to the complex system.Kowsar et al. (2017) shows that an autoregressive logistic model was more accurate thana logistic model when it comes to predicting the behaviors of complex biological systems.The reason is that the added autoregressive terms, which behave like short-term memory,can make an appropriate adjustment to better fit the complex system. In the same spirit,we propose the clustering-segmented autoregressive sigmoid (CSAS) model with four keycharacteristics including unknown clusters, change points, stretched S-shaped curves, and3utoregressive terms. With the help of the CSAS model, we expect to understand howan outbreak is spreading in time and in space, to understand how the spread is affectedby epidemic control strategies, and to apply the model to updated data from an extendedperiod of time.To identify this CSAS model, we first identify unknown clusters. There are many popu-lar methods, such as K-means (Wang & Hartiganm , 1979) (implemented in the R function kmeans ), Expectation-Maximization clustering for Gaussian Mixture Models (GMM-EM)(Akaho , 1995) (implemented in the R package mclust ), Density-Based Spatial Clusteringof Applications with Noise (DBSCAN) (Ester et al., 1996) (implemented in the R function fpc::dbscan ), and Hierarchical clustering (Murtagh & Legendre , 2014) (implemented in the R function hclust ). Except for GMM-EM, which can be considered to be parametric, allother methods need to predetermine the number of clusters or distance related parameters.To compare the dissimilarity of the curve time series, we need a nonparametric method thatdoes not require predetermined parameters. Then, we can separate different regions fromChina and the selected 33 countries into clusters that share common patterns, segment thecurve time series, and provide accurate fittings.Our contributions include the following: (1) we propose the CSAS model to help un-derstand how an outbreak is spreading in time and in space, to understand how the spreadis affected by epidemic control strategies, and to apply the model to updated data from anextended period of time; (2) we provide a nonparametric graph-based clustering methodwith theoretical support, which furthermore proposes a very strict purity score that penal-izes the overestimation of clusters. Simulations show that our method is fast and efficientfor different sizes of data sets; (3) we give practical methods for segmentation and providea confidence interval estimation for mean response; (4) we analyze the COVID-19 data inregions in China and selected countries, and explain the differences among the epidemic4revention and control policies. We assume the clustering-segmented autoregressive sigmoid (CSAS) model: Z i,t = M i (cid:88) m =1 (cid:26) β ( m )1 ,i + β ( m )2 ,i Φ( β ( m )3 ,i + β ( m )4 ,i t )+ p (cid:88) q =1 β ( m ) q +4 ,i Z i,t − q + ε ( m ) i,t (cid:27) I ( τ ( m − i < t ≤ τ ( m ) i ) , (1)where Z i,t = log(1 + Y i,t ); Y i,t is the number of confirmed cases for the i th (1 ≤ i ≤ N )cluster and time t ∈ [1 , T ]; i = δ ( j ) for j th region with 1 ≤ j ≤ K ; Z i, − q = 0 for q = 1 , . . . , p ; I ( A ) is an indicator function taking 1 if A is true, 0 otherwise; τ (0) i =0, τ ( M i ) i = T , τ ( m ) i for M i > ≤ m ≤ M i − i th cluster; Φ( x ) = √ π (cid:82) x −∞ e − u / du is a cumulative distribution function (CDF) ofthe standard normal distribution representing the sigmoid curve; β ( m )1 ,i ’s and β ( m )2 ,i ’s arestretch location and scalar parameters, respectively; β ( m )3 ,i ’s and β ( m )4 ,i ’s are linear regressioncoefficients within the sigmoid curves; β ( m ) q +4 ,i ’s are autoregressive regression coefficients; ε ( m ) i,t ’s are independent random errors with a mean of zero and constant variance of ( σ ( m ) i ) .The CSAS model has four key characteristics: (1) it is implemented with unknown N different clusters among K regions. Due to the epidemic mechanism, human mobility andcontrol strategy, the spread of epidemics displays a spatial propagation. We will proposea nonparametric method to cluster the regional data by applying the characteristic ofsigmoid curve. This method does not introduce any factors and hence can be considerednonparametric; (2) the multiple S-shaped curves are described by change points. The5hange points τ ( m ) i for 1 ≤ m ≤ M i − i ).This is because different intervention policy releases such as lockdown, maintaining socialdistance, cancelling large events, closing schools, and so on, result in different segmentedsigmoid curves among unknown clusters; (3) the regression function is mainly determinedby the stretched S-curve β ( m )1 ,i + β ( m )2 ,i Φ( β ( m )3 ,i + β ( m )4 ,i t ) and it allows a slight adjustmentthrough the autoregressive terms, (cid:80) pq =1 β ( m ) q +4 ,i Z i,t − q which can be considered as a short-term memory for the response variable; and (4) after specifying both clusters and changepoints, we use the corresponding data sets to answer three questions. How do we estimatethe regression coefficients? Are those coefficients significantly different from zero? How dowe give a confidence interval for the mean response?We give the following five remarks for the logarithmic transformation, the CDF functionΦ( x ), and the random error in the CSAS model. Remark 1. log(1 + x ) transformation is often used for transforming count data thatinclude zero values (Jin et al. , 2020). When Y i,t is much smaller or larger than 1 inmagnitude, log(1 + Y i,t ) ≈ Y i,t or log(1 + Y i,t ) ≈ log Y i,t can be used. This transformationlog(1 + Y i,t ) of Y i,t , which may grow exponentially over time, has two patterns, slow risesand slow falls, and hence can often be modeled by a stretched S-curve. Remark 2.
The nonlinear function Φ( x ) is used to describe the stretched S-shapedcurve. Other similar functions may be considered. For example, if we apply the approxi-mation of Φ( x ) by (Tocher , 1963), Φ( x ) ≈ e − √ /πx for all x , then we have β ( m )1 ,i + β ( m )2 ,i Φ( β ( m )3 ,i + β ( m )4 ,i t ) ≈ β ( m )1 ,i + β ( m )2 ,i e − √ /π ( β ( m )3 ,i + β ( m )4 ,i t ) , which is an extended logistic function of time t and is commonly used in logistic regression. Remark 3.
Mathematical modelling may provide an understanding of spread mecha-6isms. The original mathematical model was proposed and solved by Daniel Bernoulli in1760; see (Dietza & Heesterbeek , 2002). Recent developments and applications are mainlyfocused on the susceptible-infectious-recovered (SIR) model and its variants. The logisticfunction derived from a nonlinear differential equation may explain why we should applythe sigmoid curve to model the spread of disease (Murray, p.308 , 1989; Liu & Stechlinski,p.84 , 2017; Katul et al. , 2020).
Remark 4.
The model for each region may have a large degree of noise, but a commoncluster of all regions could have less noise because of the law of large numbers. So, weshould consider an individual cluster in the CSAS model. From Fig. 7 C (Cluster 3),it can be seen that the noise is significantly smaller than that of Fig. 7 A (ProvinceNM) or B (Province TJ). In addition, we should allow the model to incorporate unknownchange points to further enhance the fitting performance. Fig. 8 F suggests that theresiduals from the CSAS model without change points exhibit a clear trend. In contrast,the variance of noises in each segment should be constant; see Fig. 8 A-C. Models withincorporated clusters and change points should be flexible and adaptive to the new data;see the continued good performance of the CSAS model in the extended two-month datain the “Discussion and Conclusions” section.
Remark 5.
With both autoregressive terms and CDF function in the CSAS model,the variance of random errors can be considered to be constant across segments. In Fig. 8A, B and C, the model residuals are well-behaved across segments. The residuals in Fig.8 D (the autoregressive terms are removed) and Fig. 8 E (the CDF function is removedas shown in the Long-Short-Term-Memory model in (Yang et al. , 2020)) suggest a timetrend. This finding agrees with the fact that an autoregressive logistic model was moreaccurate than a logistic model as shown in (Kowsar et al. , 2017).7 .1 Clustering
To find clusters of all K regions, we consider the T dimensional series { Z j , ≤ j ≤ K } ,where its t th component is Z j,t for 1 ≤ t ≤ T , and define the Euclidean distance between Z j and Z j as follows: d ( Z j , Z j ) = (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 ( Z j ,t − Z j ,t ) . (2)We construct an approximate shortest Hamiltonian path (SHP) based on a heuristicKruska algorithm (HKA), which was proposed by (Biswas et al. , 2014) for a two-sampletest. This was successfully applied into change point detection in (Shi, Wu and Rao, 2017,2018). The HKA first sorts all edges in order of increasing distance defined in (2). Firstand foremost, the edge with a minimum distance must be selected. Then subsequent edgesare chosen one-by-one from the remaining list of sorted edges according to the requirementof a path. If this current edge does not form a cycle with the previously selected edges, andevery vertex connected by this current edge, or previously selected edges, has a degree notgreater than 2, then this current edge must be selected. The HKA terminates when K − K − P = ( j , . . . , j K ) . The next step is to find clusters based on P . We define the edge set of P as E ( P ), and consider a subset of E ( P ): E ∗ ( P , θ ) = { ( j s , j s +1 ) for s = 1 , . . . , K − j s , j s +1 ) ∈ E ( P ) and d ( Z j s , Z j s +1 ) ≤ θ (cid:9) . We create a graph from the edge set E ∗ ( P , θ ) and define the connected componentsof this graph as a set of clusters A = {A (cid:96) , ≤ (cid:96) ≤ L } . We note that the R function8 omponents in the R package igraph (Csardi & Nepusz , 2006) can calculate the connectedcomponents given the edge set.Suppose that there is a set of classes C = {C i , ≤ i ≤ N } , where C i = { j | δ ( j ) = i } . Weneed to measure how close the set of clusters A is to the predetermined set of classes C .Purity (Manning, Raghavan and Sch¨utze , 2008) is a measure of this extent defined as: S ( A , C ) = 1 K L (cid:88) (cid:96) =1 max ≤ i ≤ N |A (cid:96) ∩ C i | . (4)In most cases, a bad clustering has a purity value close to 0 and a perfect clustering hasa purity of 1. However, this measure may not give a realistic evaluation for overestimatedclusters. For example, a purity score of 1 could happen by putting A (cid:96) = (cid:96) , L = K and N = 1. In this case, one whole class is mis-clustered to K separate clusters with a purityscore of 1.We propose a very strict purity score to penalize overestimated clusters: S ∗ ( A , C ) = 1 K L (cid:88) (cid:96) =1 max ≤ i ≤ N |A (cid:96) ∩ C i | − | L − N | max( L, N ) . (5)Users may add additional weight on the second penalty term according to different require-ments. Based on this very strict purity evaluation, a very bad clustering would have apurity value close to -1, and a perfect clustering will still have a purity of 1. If A (cid:96) = (cid:96) , L = K and N = 1, then S ∗ ( A , C ) = 1 /L , decreasing as L increases. Overestimated clustersmay have a very strict purity score close to 0. A natural question comes: does our cluster-ing have a very strict purity score of 1? To answer this question, we make the followingassumptions. Assumption 1.
Let ε i,t be Z i,t − E ( Z i,t ). Assume that ε i,t is independent and identically9istributed (i.i.d.) satisfying E ( ε i,t ) < ∞ for all 1 ≤ j ≤ K and 1 ≤ t ≤ T . Assumption 2.
There exists a η ( T ), satisfying that η ( T ) > E ( ε , ), K << { η ( T ) − E ( ε , ) } T and min j (cid:54) = j ,δ ( j ) (cid:54) = δ ( j ) d ( E ( Z j ) , E ( Z j )) > η ( T ) . In Assumption 1, if ε i,t is dependent, then we require the upper bound of E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t =1 ( ε j ,t − ε j ,t ) − E ( ε j ,t ) − E ( ε j ,t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) << T η ( T ) /K. In Assumption 2, we require K to be quite small compared to T . Note that d ( E ( Z j ) , E ( Z j ))is easy to evaluate because d ( E ( Z j ) , E ( Z j )) = (cid:113) T (cid:80) Tt =1 { E ( Z j ,t ) − E ( Z j ,t ) } . We havethe following Theorem 1.
Theorem 1.
Suppose Assumptions 1-2 hold. Choose θ = η ( T ) as in (3). As T → ∞ ,we have P { S ∗ ( A , C ) = 1 } → . Proof of Theorem 1. We first prove that P { max j (cid:54) = j ,δ ( j )= δ ( j ) (cid:113) T (cid:80) Tt =1 ( Z j ,t − Z j ,t ) >η ( T ) } → . Because δ ( j ) = δ ( j ), Z j ,t − Z j ,t = ε j ,t − ε j ,t . Then we have P max j (cid:54) = j ,δ ( j )= δ ( j ) (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 ( Z j ,t − Z j ,t ) > η ( T ) ≤ (cid:88) j (cid:54) = j P (cid:40) T (cid:88) t =1 ( ε j ,t − ε j ,t ) > η ( T ) T (cid:41) = (cid:88) j (cid:54) = j P (cid:34) T (cid:88) t =1 ( ε j ,t − ε j ,t ) − E { ( ε j ,t − ε j ,t ) } > { η ( T ) − E ( ε , ) } T (cid:35) ≤ (cid:88) j (cid:54) = j P (cid:34) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t =1 ( ε j ,t − ε j ,t ) − E { ( ε j ,t − ε j ,t ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) { η ( T ) − E ( ε , ) } T (cid:35) ≤ cK { η ( T ) − E ( ε , ) } T , (6)where c is a constant not related to either K or T . By Assumption 2, this upper boundconverges to zero. Next, we prove that P { min j (cid:54) = j ,δ ( j ) (cid:54) = δ ( j ) (cid:113) T (cid:80) Tt =1 ( Z j ,t − Z j ,t ) ≤ η ( T ) } → . By the Minkowski inequality and Assumption 2, P min j (cid:54) = j ,δ ( j ) (cid:54) = δ ( j ) (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 ( Z j ,t − Z j ,t ) ≤ η ( T ) ≤ P (cid:34) min j (cid:54) = j ,δ ( j ) (cid:54) = δ ( j ) (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 { E ( Z j ,t ) − E ( Z j ,t ) } − (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 { ε j ,t ) − ε j ,t ) } ≤ η ( T ) (cid:35) ≤ P (cid:34) min j (cid:54) = j ,δ ( j ) (cid:54) = δ ( j ) η ( T ) − (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 { ε j ,t ) − ε j ,t ) } ≤ η ( T ) (cid:35) = P (cid:34) max j (cid:54) = j ,δ ( j ) (cid:54) = δ ( j ) (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 { ε j ,t ) − ε j ,t ) } ≥ η ( T ) (cid:35) , which converges to zero by (6).By the HKA, for any A (cid:96) , there exists C i such that C i = A (cid:96) in probability, which impliesthat max ≤ i ≤ N |A (cid:96) ∩ C i | = |A (cid:96) | and L = N hold in probability. So, P ( S ∗ ( A , C ) = 1)converges to 1 as T → ∞ . The proof of Theorem 1 is finished.To apply Theorem 1, we need to set the right value for θ . In real problems, θ could11e unknown. We shall propose a data driven method to select the threshold value of θ . Anaive choice of θ based on outlier detection isˆ θ = median s =1 ,...,K − ( x s )+ 2 . . × median s =1 ,...,K − | x s − median s =1 ,...,K − ( x s ) | ) , (7)where x s = d ( Z j s , Z j s +1 ) for s = 1 , . . . , K −
1, median s =1 ,...,K − ( x s ) and 1 . × median s =1 ,...,K − | x s − median s =1 ,...,K − ( x s ) | are robust estimates of mean and standard deviation of { x s , s =1 , . . . , K − } , respectively, and 2.5 is the cutoff value. It works well for relatively small N to K . The large values in the series of { x s , s = 1 , . . . , K − } would not affect the thresholdvalue ˆ θ and hence they could be successfully removed. However, if the distribution of x s ’s,with the exception of outliers, is a mixture of two or more probability distributions whichcommonly occurs in multiple clusters, then ˆ θ may not be consistent to θ . Therefore, wepropose Algorithm 1 based on Bayesian information criterion (BIC) to choose θ . Denote a set of change points as C i = { τ (1) i , · · · , τ ( M i − i } , where i = 1 , . . . , N and M i − M i is unknown in practice, we would need toestimate the change points. Consider the segment [ t − , t + ] and define two residual sums ofsquares S i, ( t − , t + ) = min β t + (cid:88) t = t − { Z i,t − f ( t ; β ) } , (8) S i, ( t − , t , t + ) = min β t (cid:88) t = t − { Z i,t − f ( t ; β ) }
12 min β t + (cid:88) t = t +1 { Z i,t − f ( t ; β ) } , (9)where 1 ≤ t − < t < t + ≤ T and f ( t ; β ) = β + β Φ( β + β t ) + β Z i,t − + β Z i,t − .Here, we consider two autoregressive terms. Then, the estimated change point is denotedas ˆ t i ( t − , t + ): ˆ t i,t − ,t + = arg min t − +∆ /
6) is the α/ t + − t − − β ) can be estimated by the nls function in R . Here, we consider α = 0 . We consider three classes N = 3. Let C , C , and C be randomly seperated classes with ∪ i =1 C i = { , . . . , K } . Denote the number of elements in i th class as n i with (cid:80) i =1 n i = K .We produce Z i,t from the following model Z i,t = M i (cid:88) m =1 (cid:110) β ( m )1 ,i + β ( m )2 ,i Φ( β ( m )3 ,i + β ( m )4 ,i t ) (cid:111) I ( τ ( m − i < t ≤ τ ( m ) i )+ ε t , (12)where i = 1 , , t = 1 , . . . , T and ε t ’s are independent Normal errors with mean zero andvariance σ .For the first class, let M = 1, β (1)1 , = 0, β (1)2 , = 10, β (1)3 , = −
4, and β (1)4 , = − .
05. Forthe second class, let M = 2, β (1) (cid:96), = 0 for (cid:96) = 1 , . . . , τ (1)2 = T / β (2)1 , = 0, β (2)2 , = 20,14 (2)3 , = −
3, and β (2)4 , = 0 .
03. For the third class, let M = 2, β (1) (cid:96), = 0 for (cid:96) = 1 , . . . , τ (1)3 = 2 T / β (2)1 , = 0, β (2)2 , = 5, β (2)3 , = −
2, and β (2)4 , = 0 . Z i,t for i = 1 , , T = 150. Next, we compare ourgraph-based clustering method as shown in Algorithm 1 with the model-based clusteringmethod (Fratey & Raftery , 2002) implemented in the R function Mclust (Fraley et al. ,2020). Fig. 3 shows the averaged strict purity score as in (5) for estimated clusters basedon these two methods for σ = 0 . , . , . . . ,
1, different n i ’s and 100 replications.It can be seen from Fig. 3 that our method is very accurate for different σ and sizes ofclasses because its very strict purity scores are close to 1. This agrees with the conclusion inTheorem 1. In contrast, the performance of the model-based clustering method is affectedby large σ and large sizes of classes. Specially, when n = 20, n = 100 and n = 200,the computation is significantly slower compared with our method. The comparison ofcomputing time is not presented here. We continue to use the data set of log-transformed infection counts from December 1,2019 to April 20, 2020 from Chinese provinces/regions and the 33 countries, and present theclustering and S-shaped fitting with change points. Here, two autoregressive components( p = 2) in (1) are suggested. Based on the graph-based clustering Algorithm, the clusters of COVID-19 in China andthe rest of the world are presented in Fig. 4, where the optimal path is presented as a cyclewith vertexes representing clusters in different colors and overextended curves. This way of15resentation is to transmit three aspects of information: (i) this analysis is for virus data,therefore, we should use the cycle and the sharp nodes to describe the structure of the virus;(ii) the optimal graph is a path connecting all nodes where nodes can be provinces/regionsin China or countries in the world; and (iii) readers can quickly find the different clustersand where to separate them from the path.From Fig. 4, we observe the following.(1) As shown in COVID-19 cases in China, the 34 provinces/regions are clustered into7 categories. Specifically, Hubei (HB), Xizang (XZ), Qinghai (QH), Macao (MO), HongKong (HK), and Taiwan (TW) are individually clustered into separate categories, and theremaining provinces/regions are all clustered into one category. This clustering result canbe explained by the differences in epidemic control strategies among the provinces/regions:HB is the center of the COVID-19 breakout, with a large number of infection cases; un-derpopulated XZ and QH are both located on the Qinghai Tibet Plateau, with only a fewinfection cases; MO, HK, and TW are of self-governance: meaning their epidemic controlstrategies are different from all other regions in China. The model-based clustering method(Fratey & Raftery , 2002) suggests both HK and TW are to be in one cluster, which maynot be correct.(2) As shown in COVID-19 cases in the world, the 33 selected countries are clusteredinto 8 categories. Specifically, China (CN), Korea (KR), Japan (JP), Spain (ES), andTurkey (TR) are individually clustered into separate categories; Italy (IT) and Iran (IR)are clustered into one category; the United States of America (US), Germany (DE), France(FR), the United Kingdom of Great Britain (UK), Northern Ireland (GB), and Canada(CA) are clustered into one category; and the remaining countries are all clustered intoone category. This clustering result is partly based on the timing of COVID-19 outbreaksin those countries. For example, the first large-scale outbreak was in CN, followed by KR16nd JP. After that, infections in IR and IT experienced rapid growth, followed by theoutbreaks in European countries and the US. Finally, the epidemic spread worldwide. Inaddition, the clustering is also based on the epidemic control strategies in each country.For example, in KR and JP, even while the epidemic broke out around the same time, thetwo countries had taken different strategies: JP adopted a “defensive strategy”’ to ensurethe health care system operated normally as usual, while KR used an “aggressive attackstrategy” to comprehensively detect infections.
Based on the BIC-based ICSS Algorithm, we segment the curve time series and presentthe segmented fittings and confidence interval estimation for the log-transformed infectioncounts Z i,t (1 ≤ i ≤ N ) of each cluster in China and the rest of the world; see Fig. 5 and 6,respectively.We can obtain that all sigmoid curves share the form of multiple stages and multiplechange points, with the exception of Cluster 7 (XZ) in China, with only one infection; thecalculated change points of each cluster can still be explained by the differences in epidemiccontrol strategies. See the details below.(1) As shown in Fig. 5 A and Fig. 6 A, the sigmoid curves and change points arealmost the same because HB province was the center of the COVID-19 outbreak in CN.In Fig. 6 A in CN, the first segment (19/12/01 to 19/12/13) was the germination periodof the outbreak. In the second segment (19/12/13 to 20/01/16), COVID-19 seemed tohave been controlled in CN. However, because many COVID-19 cases had not been founddue to varied epidemic control strategies in the previous two stages, COVID-19 broke outin the third segment (20/01/16 to 20/01/26) and fourth segment (20/01/26 to 20/02/11)in CN. This coincided with Chunyun (the annual massive movement of people during17hinese Lunar New Year), which particularly accelerated the outbreak. Finally, in the lasttwo segments (20/02/11 to 20/02/27 and 20/02/27 to 20/04/20), COVID-19 was controlledand stabilized once the CN government implemented very strict epidemic control strategies,such as traffic control and home quarantine.(2) As shown in Fig. 5 C, the sigmoid curves in HK and TW seem similar becausethey were both strongly affected by COVID-19 cases from mainland China. However, wefind that the change points of COVID-19 in TW are about a week delayed compared tothose in HK after COVID-19 started to break out in both regions. This is because TWresponded in a timely manner to the COVID-19 outbreak and controlled it more quicklyand effectively than HK, while the implementation of epidemic control strategies in HKlagged behind.(3) As shown in Fig. 6, the number of new cases in China had tentatively stabilizedsince the last change point, 20/02/27, which was delayed by about one week in otherclusters. In Fig. 6 A and B, the infections in CN and KR are mostly stable, but theepidemic situations in other countries have not been controlled effectively. Take the fifthcluster (Fig. 6 C) as an illustration, considering that this cluster had the fastest growth.The four segments can be explained as follows: (i) the infections in the first segmentwere mainly from oversea imports; (ii) in the second segment, COVID-19 seemed to havebeen controlled; (iii) COVID-19 broke out because of many unfound COVID-19 cases inprevious segments; and (iv) in the last segment, COVID-19 began to come under controlas governments declared states of emergency and started implementing strict measures tocontrol the spread of the virus.(4) As shown in Fig. 6 B, confidence intervals for KR tended to be quite narrow inwidth when the number of new cases had tentatively stabilized, resulting in more preciseestimates of mean response, whereas confidence intervals for JP tended to be wide since JP18ad adopted a “defensive strategy”. In most of cases, confidence intervals produced preciseresults. A clustering-segmented autoregressive sigmoid model is developed to explore the space-time pattern of the log-transformed infectious count by the end of April 20, 2020. Itperformed well when it was applied to COVID-19 cases in both China and the 33 countries,and thus provides an efficient statistical model of COVID-19 spread to help fight against thevirus. Currently, the infections in China are mostly stable, and the graph-based clusteringalgorithm is robust to the clusters from the 34 provinces/regions in China. When COVID-19began to come under control, the clustering of the disease globally will become increasinglystable.In fact, the CSAS model can adapt to an extended period of time when clusters havebeen updated and new change points have been identified. To do so, we use the last changepoints in time, 20/03/07 obtained from Fig. 5 or 2020/03/08 obtained from Fig. 6, as thestart of the extended period at two-month intervals, from 20/03/07 to 20/05/07 or 20/03/08to 20/05/08. In Fig 9, we show segmentations and fittings for log-transformed infectioncounts of each cluster in both China and the 33 countries during this extended period. Wecan see that the fittings continue to work well. We provide an R package, GraphCpClust ,which can be accessed from https://github.com/Meiqian-Chen/GraphCpClust. From this R package, users can obtain the same results presented in this paper and can model datafor another extended period of time. In addition, the data and code for another two papers(Shi, Wu and Rao, 2017, 2018) are included in this R package.Regarding the dataset used in this article, Wuhan-2019-nCoV, we make the follow-19ng additional remarks: 1. Back in early March 2020, there were very few datasets onCOVID-19, and especially few datasets containing timely epidemic data from each Chineseprovince. This dataset, Wuhan-2019-nCoV, collects national outbreak reports from WHO,as well as daily outbreak reports from provincial health and family planning commissionsin China; 2. The Wuhan-2019-nCoV dataset is very timely updated and has been includedin the “Open Source Wuhan” data resource. Therefore, we believe that the data quality ofthe Wuhan-2019-nCoV dataset is trustworthy. There are now more and more COVID-19data resources available, such as WHO data (https://covid19.who.int/) and Our World inData (https://ourworldindata.org/covid-data-switch-jhu). For these two datasets, we findthat our model still works very well. Please see this webpage, http://graph-clustering-system.com/, for the three data analyses described above. References
Akaho, S. (1995). Mixture model for image understanding and the EM algorithm, https://staff.aist.go.jp/s.akaho/papers/ETL-TR-95-13E.pdf . Bai, J. S. & Perron, P. (2003). Computation and analysis of multiple structural changemodels.
J APPL ECONOM , 1-22.
Biswas, M. , Mukhopadhyay, M. & Ghosh, A. K. (2014). A distribution-free two-sample run test applicable to high-dimensional data.
Biometrika , 913-926.
Csardi, G. & Nepusz, T. (2006). The igraph software package for complex networkresearch.
InterJournal , Complex Systems 1695. 2006. http://igraph.org. Accessed April1, 2020. 20 ietza, K. , Heesterbeek, J. A. P. (2002). Daniel Bernoulli’s epidemiological modelrevisited.
MATH BIOSCI , 1-21
Ester, M. , Kriegel, H. P. , Sander, J. & Xu, X.W. (1996). A Density-Based Algo-rithm for Discovering Clusters in Large Spatial Databases with Noise.
KDD-96 Proceed-ings , 226-231.
Fraley, C. & Raftery, A.E. (2020) mclust: Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation. R package version 2.2-5 .
Avail-able at https://cran.r-project.org/web/packages/fpc/index.html . Fratey, C. & Raftery, A.E. (1993). Model-based clustering, discriminant analysis anddensity estimation.
J AM STAT ASSOC , 611-631.
Incl´an, C. & Tiao, G.C. (1994). Use of cumulative sums of squares for retrospectivedetection of changes of variance. Publications of the American Statistical Association.
JAM STAT ASSOC , , 913-923. Jin, B. S. , Wu, Y. H. , Rao, C. R. & Hou, L. (2020). Estimation and model selectionin general spatial dynamic panel data models.
Proc Natl Acad Sci , 5235-5241.
Katul, G. G. , Mrad, A. , Bonetti, S. , Manoli, G. , &
Parolari, A. J. (2020).Global convergence of COVID-19 basic reproduction number and estimation from early-time SIR dynamics. medRXiv . Kowsar, R. , Keshtegar., B. , Marey, M. A. , &
Miyamoto, A. (2017). An autore-gressive logistic model to predict the reciprocal effects of oviductal fluid components onin vitro spermophagy by neutrophils in cattle.
SCI REP-UK , 4482.21 iu, X.Z. , &
Stechlinski, P. (2017). Infectious Disease Modeling.
A Hybrid SystemApproch , Springer, Heidelberg.
Manning, C.D. , Raghavan, P. & Sch¨utze, H. (2008). Introduction to InformationRetrieval, Cambridge University Press, Cambridge, England.
Murray, J. D. (1989) Mathematical Biology. Springer, Heidelberg.
Murtagh, F. & Legendre, P. (2014). Ward’s hierarchical agglomerative clusteringmethod: which algorithms implement ward’s criterion?
J CLASSIF , 274-295.
R Core Team
Shi, X.P. , Wang, X.S. , Wei, D.W. & Wu, Y.H. (2016). A sequential multiple change-point detection procedure via VIF regression.
Comput Stat , 671-91. Shi, X.P. , Wu, Y.H. & Rao, C.R. (2017). Consistent and powerful graph-based change-point test for high-dimensional data.
Proc Natl Acad Sci , 3873-8.
Shi, X.P. , Wu, Y.H. & Rao, C.R. (2018). Consistent and powerful non-Euclideangraph-based change-point test with applications to segmenting random interfered videodata.
Proc Natl Acad Sci , 5914-5919.
Tocher, K. D. (1963). The Art of Simulation.
English University Press , London.
Wang, J. A. & Hartiganm, A. (1979). Algorithm as 136: a k-means clustering algorithm.
J ROY STAT SOC A STA , 100-108.22 ang, et al. (2020). Modified SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions.
J Thorac Dis , 165-174.23 B Log ( + i n f e c t i on ) HBTWHKMOQHXZGDBJSHZJAHJX SDSCHAHNCQJSGXHIYNHLHEFJ SNJLGZNXTJSXLNNMGSXJ
COVID−19 in China
CN US l og ( + i n f e c t i on ) CNJPKRUSFRCADEINPHBEES GBITRUSEEGIRATCHBRDZDK MKRONLECIEDOIDPTPLPETR
COVID−19 in World − tClass 1 Class 2 Class 3 Figure 2: Plots of S-curves for three classes.24 lgorithm 1:
Graph-based clustering Algorithm
Result:
Output the optimal clusters A . Notations: x ( θ ) = { x s | x s ≤ θ, s = 1 , . . . , K − } where x s is defined in (7); θ ( s ) is the s ’th largest element in set { x s , s = 1 , . . . , K − } ; ˆ σ ( θ ) is the sample variance of x ( θ ); BIC( θ, A ) = ( K −
1) log(ˆ σ ( θ )) + 2 L ( A ) log( K − L ( A ) is the number ofclusters in A ; Initialize : Let i = 1, L = 1, and A = {A } where A = { , . . . K } ; for s = 2; s < K − s = s + 1 do Let θ be θ ( s ) and calculate the clusters based on E ∗ ( P , θ ) in (3) denoted as A temp ; if BIC( θ, A temp ) < BIC( θ ( s − , A ) then A = A temp ; else break; end end − . − . . . . n =
1, n =
10, n = σ S * − . − . . . . n =
5, n =
20, n = σ S * − . − . . . . n =
20, n = = σ S * Graph−based clustering Model−based clustering
Figure 3: Comparisons of graph-based clustering method and model-based clusteringmethod. 25 lgorithm 2:
BIC-based ICSS Algorithm
Result:
Output the estimated change points ˆ C i . Notations: ˆ C i, ( s ) and | ˆ C i | are the s th smallest element and number of elements ofset ˆ C i , respectively; Initialization:
Let t − = 1, t + = T , and ˆ C i = { , T } ; while t + − t − > ∆ do t first ← t + ; t last ← t − ; while BIC i, ( t − , t first ) ≥ BIC i, ( t − , t first ) do t first ← ˆ t i,t − ,t first ; end while BIC i, ( t last , t + ) ≥ BIC i, ( t last , t + ) do t last ← ˆ t i,t last ,t + ; end if t first = t last then ˆ C i ← ˆ C i ∪ { t first } ; break; else ˆ C i ← ˆ C i ∪ { t first , t last } ; t − ← t first ; t + ← t last ; end end for s = 2; j < | ˆ C i | ; s = s + 1 do if BIC i, ( ˆ C i, ( s − + 1 , ˆ C i, ( s +1) ) ≤ BIC i, ( ˆ C i, ( s − + 1 , ˆ C i, ( s +1) ) then ˆ C i ← ˆ C i \ { ˆ C i, ( s ) } ; end end ˆ C i ← ˆ C i \ { , T } ; 26 OVID−19 in China
HBGDZJHAHNAHJXSDJSSCCQBJSHHLHEFJSNGXYNHILNSX TJ GZ GS NM JL XJ NX TW HK MO QH XZ
COVID−19 in world
CNIRITKRJPCAGBFRDEUSESCHNLBEATSEDKPTBRIEROID PL PE EC RU IN PH EG DZ MK DO TR
Figure 4: Plots of clusters in China and in 33 selected countries based on the log-transformed infection counts. l og ( + i n f e c t i on ) Cluster 1(HB) A l og ( + i n f e c t i on ) Cluster 2(GD,ZJ,HA,HN,AH,JX,SD,JS,SC,CQ,BJ,SH,HL,,HE,FJ,SN,GX,YN,HI,LN,SX,TJ,GZ,GS,NM,JL,XJ,NX) B l og ( + i n f e c t i on ) Cluster 3Cluster 4
Cluster 3(TW) and Cluster 4(HK) C l og ( + i n f e c t i on ) Cluster 5Cluster 6Cluster 7
Cluster 5(MO), Cluster 6(QH) and Cluster 7(XZ) D Figure 5: Plots of segmentations and fittings of provinces/regions in China based on thelog-transformed infection counts. 27 l og ( + i n f e c t i on ) Cluster 1Cluster 2
Cluster 1(CN) and Cluster 2(IR,IT) A l og ( + i n f e c t i on ) Cluster 3Cluster 4
Cluster 3(KR) and Cluster 4(JP) B l og ( + i n f e c t i on ) Cluster 5Cluster 6
Cluster 5(US,DE,FR,GB,CA) and Cluster 6(ES) C l og ( + i n f e c t i on ) Cluster 7Cluster 8
Cluster 7(CH,NL,BE,AT,SE,DK,PT,BR,IE,RO,ID,PL,PE,,EC,RU,IN,PH,EG,DZ,MK,DO) and Cluster 8(TR) D Figure 6: Plots of segmentations and fittings of 33 selected countries in the world based onthe log-transformed infection counts. − . . . . A − . . . . B − . − . . . C Figure 7: Plots of residuals of the last segment from the CSAS model for two separateprovinces in China, NM (A) and TJ (B), and their common cluster 2 (C). − . − . . A − . . . . B − . − . . . C − . − . . . . D . . . . . E − . − . . . . F Figure 8: Plots of residuals for HB in China for the fourth segment (A), the fifth segment(B), and the last segment (C) from the CSAS model, for the last segment from the CSASmodel without autoregressive terms (D), for the whole period from the autoregressive modelof order 2 after taking the first difference (E), and for the whole period from the CSASmodel without change points (F). 28 l og ( + i n f e c t i on ) Cluster 1Cluster 2
Cluster 1 (HB) and Cluster 2 (GD,HA,ZJ,HN,AH,JX,SD,JS,CQ,SC,BJ,,SH,HL,HK,TW,FJ,HE,SN,GX,NM,SX,TJ,YN,HI,GZ,GS,LN,JL,XJ,NX) A l og ( + i n f e c t i on ) Cluster 3Cluster 4
Cluster 3 (QH,MO) and Cluster 4 (XZ) B l og ( + i n f e c t i on ) Cluster 1Cluster 2
Cluster 1 (CN,IR,IT,ES,DE,FR,GB,BE,NL,CH,AT,SE,JP,DK,PT,CA,BR,,RU,PE,EC,IN,IE,RO,PL,ID,PH,EG,DZ,MK,DO) and Cluster 2 (KR) C l og ( + i n f e c t i on ) Cluster 3Cluster 4