[PDF] Exploring the space-time pattern of log-transformed infectious count of COVID-19: a clustering-segmented autoregressive sigmoid model

Abstract

At the end of April 20, 2020, there were only a few new COVID-19 cases remaining in China, whereas the rest of the world had shown increases in the number of new cases. It is of extreme importance to develop an efficient statistical model of COVID-19 spread, which could help in the global fight against the virus. We propose a clustering-segmented autoregressive sigmoid (CSAS) model to explore the space-time pattern of the log-transformed infectious count. Four key characteristics are included in this CSAS model, including unknown clusters, change points, stretched S-curves, and autoregressive terms, in order to understand how this outbreak is spreading in time and in space, to understand how the spread is affected by epidemic control strategies, and to apply the model to updated data from an extended period of time. We propose a nonparametric graph-based clustering method for discovering dissimilarity of the curve time series in space, which is justified with theoretical support to demonstrate how the model works under mild and easily verified conditions. We propose a very strict purity score that penalizes overestimation of clusters. Simulations show that our nonparametric graph-based clustering method is faster and more accurate than the parametric clustering method regardless of the size of data sets. We provide a Bayesian information criterion (BIC) to identify multiple change points and calculate a confidence interval for a mean response. By applying the CSAS model to the collected data, we can explain the differences between prevention and control policies in China and selected countries.

Full PDF

EExploring the space-time pattern of log-transformedinfectious count of COVID-19: a clustering-segmentedautoregressive sigmoid model

Xiaoping Shi ∗ Meiqian Chen and Yucheng Dong † Abstract

At the end of April 20, 2020, there were only a few new COVID-19 cases remain-ing in China, whereas the rest of the world had shown increases in the number of newcases. It is of extreme importance to develop an eﬃcient statistical model of COVID-19 spread, which could help in the global ﬁght against the virus. We propose aclustering-segmented autoregressive sigmoid (CSAS) model to explore the space-timepattern of the log-transformed infectious count. Four key characteristics are includedin this CSAS model, including unknown clusters, change points, stretched S-curves,and autoregressive terms, in order to understand how this outbreak is spreading intime and in space, to understand how the spread is aﬀected by epidemic controlstrategies, and to apply the model to updated data from an extended period of time.We propose a nonparametric graph-based clustering method for discovering dissim-ilarity of the curve time series in space, which is justiﬁed with theoretical supportto demonstrate how the model works under mild and easily veriﬁed conditions. Wepropose a very strict purity score that penalizes overestimation of clusters. Simula-tions show that our nonparametric graph-based clustering method is faster and moreaccurate than the parametric clustering method regardless of the size of data sets.We provide a Bayesian information criterion (BIC) to identify multiple change pointsand calculate a conﬁdence interval for a mean response. By applying the CSAS modelto the collected data, we can explain the diﬀerences between prevention and controlpolicies in China and selected countries. ∗ Department of Mathematics and Statistics, Thompson Rivers University, email: [email protected] † Center for Network Big Data and Decision-Making, Business School, Sichuan University, email: [email protected] a r X i v : . [ s t a t . M E ] F e b Introduction

During the COVID-19 outbreak, multiple complex factors resulted in the space-timepattern of spread. Fig. 1 shows the log-transformed infectious counts in each region inChina, and in 33 selected countries at the end of April 20, 2020.From Fig. 1, we can see two main characteristics of the spread: (i) the spread of COVID-19 has a space-time characteristic determined by diﬀerent intervention policies, incompleteinformation, geographical locations, transport, climate, and so on; (ii) along the time, thelog-transformed infectious counts presented diﬀerent sigmoid (stretched S-shaped) curves.This phenomenon often happens in the life cycles of plants, animals, and viruses, whichcan rise and fall periodically. In each cycle, the sigmoid curve experiences three phases:slow rising, sharp rising, and slow falling.Modeling the spread of COVID-19 in many regions over a long period of time is provento be challenging. That is because many regions may not share the same spread patternand diﬀerent regions may exhibit various intervention policies that may cause instabilityin the models. The model for each region may have a large degree of noise, but a commoncluster of all regions could have less noise by the law of large numbers. Thus, clustering isof importance to increase model ﬁt. We may have to cluster all regions, even if the numberof clusters is unknown. In addition, we should allow the model to incorporate unknownchange points to further enhance the ﬁtting performance. Ignoring the existence of changepoints may lead to poor model ﬁtting and misleading model interpretation (Shi, Wang,Wei & Wu, 2016). Furthermore, it is often necessary to apply the model to updated datafrom an extended period of time. In the extended period, old clusters need to be updatedand new change points may occur. Models with incorporated clusters and change pointsshould be ﬂexible and adaptive to the new data. In the next step, we shall consider the2onlinear characteristics of the models.Logarithmic transformation is often used for transforming count data, which includeszero values (Jin et al. , 2020) and grows exponentially over time. The simplest for-mula for exponential growth of a function y at the growth rate r , as time t goes on, is y ( t ) = y (0)(1 + r ) t , which satisﬁes the linear diﬀerential equation dy ( t ) dt = log(1 + r ) y ( t ). Anonlinear variation of this diﬀerential equation may lead y ( t ) to a sigmoid function. Forexample, the solution of a nonlinear diﬀerential equation dy ( t ) dt = log(1 + r ) y ( t ) − y ( t ) isthe logistic function (Murray, p.308 , 1989; Liu & Stechlinski, p.84 , 2017). The expo-nential growth model has shown numerous applications in the modeling and controlling ofcomplex systems. For example, the number of cells in a culture will increase exponentiallyuntil an essential nutrient is exhausted. A virus, for example SARS or COVID-19, has beenfound to spread exponentially (Katul et al. , 2020). The speed of spread slows down whenan artiﬁcial immunization becomes available or intervention policies take eﬀect. Otherapplications of the exponential growth model can be found in Physics (e.g., radioactivedecay), Economics (e.g., a country’s gross domestic product), Finance (e.g., investments),Computer science (e.g., computing growth and internet phenomena), and so on.When systems have short-term memories and become more complex, it is extremelydiﬃcult to ﬁnd a diﬀerential equation to describe the growth curve. In contrast, we mayadd some autoregressive terms in a regression function to adapt to the complex system.Kowsar et al. (2017) shows that an autoregressive logistic model was more accurate thana logistic model when it comes to predicting the behaviors of complex biological systems.The reason is that the added autoregressive terms, which behave like short-term memory,can make an appropriate adjustment to better ﬁt the complex system. In the same spirit,we propose the clustering-segmented autoregressive sigmoid (CSAS) model with four keycharacteristics including unknown clusters, change points, stretched S-shaped curves, and3utoregressive terms. With the help of the CSAS model, we expect to understand howan outbreak is spreading in time and in space, to understand how the spread is aﬀectedby epidemic control strategies, and to apply the model to updated data from an extendedperiod of time.To identify this CSAS model, we ﬁrst identify unknown clusters. There are many popu-lar methods, such as K-means (Wang & Hartiganm , 1979) (implemented in the R function kmeans ), Expectation-Maximization clustering for Gaussian Mixture Models (GMM-EM)(Akaho , 1995) (implemented in the R package mclust ), Density-Based Spatial Clusteringof Applications with Noise (DBSCAN) (Ester et al., 1996) (implemented in the R function fpc::dbscan ), and Hierarchical clustering (Murtagh & Legendre , 2014) (implemented in the R function hclust ). Except for GMM-EM, which can be considered to be parametric, allother methods need to predetermine the number of clusters or distance related parameters.To compare the dissimilarity of the curve time series, we need a nonparametric method thatdoes not require predetermined parameters. Then, we can separate diﬀerent regions fromChina and the selected 33 countries into clusters that share common patterns, segment thecurve time series, and provide accurate ﬁttings.Our contributions include the following: (1) we propose the CSAS model to help un-derstand how an outbreak is spreading in time and in space, to understand how the spreadis aﬀected by epidemic control strategies, and to apply the model to updated data from anextended period of time; (2) we provide a nonparametric graph-based clustering methodwith theoretical support, which furthermore proposes a very strict purity score that penal-izes the overestimation of clusters. Simulations show that our method is fast and eﬃcientfor diﬀerent sizes of data sets; (3) we give practical methods for segmentation and providea conﬁdence interval estimation for mean response; (4) we analyze the COVID-19 data inregions in China and selected countries, and explain the diﬀerences among the epidemic4revention and control policies. We assume the clustering-segmented autoregressive sigmoid (CSAS) model: Z i,t = M i (cid:88) m =1 (cid:26) β ( m )1 ,i + β ( m )2 ,i Φ( β ( m )3 ,i + β ( m )4 ,i t )+ p (cid:88) q =1 β ( m ) q +4 ,i Z i,t − q + ε ( m ) i,t (cid:27) I ( τ ( m − i < t ≤ τ ( m ) i ) , (1)where Z i,t = log(1 + Y i,t ); Y i,t is the number of conﬁrmed cases for the i th (1 ≤ i ≤ N )cluster and time t ∈ [1 , T ]; i = δ ( j ) for j th region with 1 ≤ j ≤ K ; Z i, − q = 0 for q = 1 , . . . , p ; I ( A ) is an indicator function taking 1 if A is true, 0 otherwise; τ (0) i =0, τ ( M i ) i = T , τ ( m ) i for M i > ≤ m ≤ M i − i th cluster; Φ( x ) = √ π (cid:82) x −∞ e − u / du is a cumulative distribution function (CDF) ofthe standard normal distribution representing the sigmoid curve; β ( m )1 ,i ’s and β ( m )2 ,i ’s arestretch location and scalar parameters, respectively; β ( m )3 ,i ’s and β ( m )4 ,i ’s are linear regressioncoeﬃcients within the sigmoid curves; β ( m ) q +4 ,i ’s are autoregressive regression coeﬃcients; ε ( m ) i,t ’s are independent random errors with a mean of zero and constant variance of ( σ ( m ) i ) .The CSAS model has four key characteristics: (1) it is implemented with unknown N diﬀerent clusters among K regions. Due to the epidemic mechanism, human mobility andcontrol strategy, the spread of epidemics displays a spatial propagation. We will proposea nonparametric method to cluster the regional data by applying the characteristic ofsigmoid curve. This method does not introduce any factors and hence can be considerednonparametric; (2) the multiple S-shaped curves are described by change points. The5hange points τ ( m ) i for 1 ≤ m ≤ M i − i ).This is because diﬀerent intervention policy releases such as lockdown, maintaining socialdistance, cancelling large events, closing schools, and so on, result in diﬀerent segmentedsigmoid curves among unknown clusters; (3) the regression function is mainly determinedby the stretched S-curve β ( m )1 ,i + β ( m )2 ,i Φ( β ( m )3 ,i + β ( m )4 ,i t ) and it allows a slight adjustmentthrough the autoregressive terms, (cid:80) pq =1 β ( m ) q +4 ,i Z i,t − q which can be considered as a short-term memory for the response variable; and (4) after specifying both clusters and changepoints, we use the corresponding data sets to answer three questions. How do we estimatethe regression coeﬃcients? Are those coeﬃcients signiﬁcantly diﬀerent from zero? How dowe give a conﬁdence interval for the mean response?We give the following ﬁve remarks for the logarithmic transformation, the CDF functionΦ( x ), and the random error in the CSAS model. Remark 1. log(1 + x ) transformation is often used for transforming count data thatinclude zero values (Jin et al. , 2020). When Y i,t is much smaller or larger than 1 inmagnitude, log(1 + Y i,t ) ≈ Y i,t or log(1 + Y i,t ) ≈ log Y i,t can be used. This transformationlog(1 + Y i,t ) of Y i,t , which may grow exponentially over time, has two patterns, slow risesand slow falls, and hence can often be modeled by a stretched S-curve. Remark 2.

The nonlinear function Φ( x ) is used to describe the stretched S-shapedcurve. Other similar functions may be considered. For example, if we apply the approxi-mation of Φ( x ) by (Tocher , 1963), Φ( x ) ≈ e − √ /πx for all x , then we have β ( m )1 ,i + β ( m )2 ,i Φ( β ( m )3 ,i + β ( m )4 ,i t ) ≈ β ( m )1 ,i + β ( m )2 ,i e − √ /π ( β ( m )3 ,i + β ( m )4 ,i t ) , which is an extended logistic function of time t and is commonly used in logistic regression. Remark 3.

Mathematical modelling may provide an understanding of spread mecha-6isms. The original mathematical model was proposed and solved by Daniel Bernoulli in1760; see (Dietza & Heesterbeek , 2002). Recent developments and applications are mainlyfocused on the susceptible-infectious-recovered (SIR) model and its variants. The logisticfunction derived from a nonlinear diﬀerential equation may explain why we should applythe sigmoid curve to model the spread of disease (Murray, p.308 , 1989; Liu & Stechlinski,p.84 , 2017; Katul et al. , 2020).

Remark 4.

The model for each region may have a large degree of noise, but a commoncluster of all regions could have less noise because of the law of large numbers. So, weshould consider an individual cluster in the CSAS model. From Fig. 7 C (Cluster 3),it can be seen that the noise is signiﬁcantly smaller than that of Fig. 7 A (ProvinceNM) or B (Province TJ). In addition, we should allow the model to incorporate unknownchange points to further enhance the ﬁtting performance. Fig. 8 F suggests that theresiduals from the CSAS model without change points exhibit a clear trend. In contrast,the variance of noises in each segment should be constant; see Fig. 8 A-C. Models withincorporated clusters and change points should be ﬂexible and adaptive to the new data;see the continued good performance of the CSAS model in the extended two-month datain the “Discussion and Conclusions” section.

Remark 5.

With both autoregressive terms and CDF function in the CSAS model,the variance of random errors can be considered to be constant across segments. In Fig. 8A, B and C, the model residuals are well-behaved across segments. The residuals in Fig.8 D (the autoregressive terms are removed) and Fig. 8 E (the CDF function is removedas shown in the Long-Short-Term-Memory model in (Yang et al. , 2020)) suggest a timetrend. This ﬁnding agrees with the fact that an autoregressive logistic model was moreaccurate than a logistic model as shown in (Kowsar et al. , 2017).7 .1 Clustering

To ﬁnd clusters of all K regions, we consider the T dimensional series { Z j , ≤ j ≤ K } ,where its t th component is Z j,t for 1 ≤ t ≤ T , and deﬁne the Euclidean distance between Z j and Z j as follows: d ( Z j , Z j ) = (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 ( Z j ,t − Z j ,t ) . (2)We construct an approximate shortest Hamiltonian path (SHP) based on a heuristicKruska algorithm (HKA), which was proposed by (Biswas et al. , 2014) for a two-sampletest. This was successfully applied into change point detection in (Shi, Wu and Rao, 2017,2018). The HKA ﬁrst sorts all edges in order of increasing distance deﬁned in (2). Firstand foremost, the edge with a minimum distance must be selected. Then subsequent edgesare chosen one-by-one from the remaining list of sorted edges according to the requirementof a path. If this current edge does not form a cycle with the previously selected edges, andevery vertex connected by this current edge, or previously selected edges, has a degree notgreater than 2, then this current edge must be selected. The HKA terminates when K − K − P = ( j , . . . , j K ) . The next step is to ﬁnd clusters based on P . We deﬁne the edge set of P as E ( P ), and consider a subset of E ( P ): E ∗ ( P , θ ) = { ( j s , j s +1 ) for s = 1 , . . . , K − j s , j s +1 ) ∈ E ( P ) and d ( Z j s , Z j s +1 ) ≤ θ (cid:9) . We create a graph from the edge set E ∗ ( P , θ ) and deﬁne the connected componentsof this graph as a set of clusters A = {A (cid:96) , ≤ (cid:96) ≤ L } . We note that the R function8 omponents in the R package igraph (Csardi & Nepusz , 2006) can calculate the connectedcomponents given the edge set.Suppose that there is a set of classes C = {C i , ≤ i ≤ N } , where C i = { j | δ ( j ) = i } . Weneed to measure how close the set of clusters A is to the predetermined set of classes C .Purity (Manning, Raghavan and Sch¨utze , 2008) is a measure of this extent deﬁned as: S ( A , C ) = 1 K L (cid:88) (cid:96) =1 max ≤ i ≤ N |A (cid:96) ∩ C i | . (4)In most cases, a bad clustering has a purity value close to 0 and a perfect clustering hasa purity of 1. However, this measure may not give a realistic evaluation for overestimatedclusters. For example, a purity score of 1 could happen by putting A (cid:96) = (cid:96) , L = K and N = 1. In this case, one whole class is mis-clustered to K separate clusters with a purityscore of 1.We propose a very strict purity score to penalize overestimated clusters: S ∗ ( A , C ) = 1 K L (cid:88) (cid:96) =1 max ≤ i ≤ N |A (cid:96) ∩ C i | − | L − N | max( L, N ) . (5)Users may add additional weight on the second penalty term according to diﬀerent require-ments. Based on this very strict purity evaluation, a very bad clustering would have apurity value close to -1, and a perfect clustering will still have a purity of 1. If A (cid:96) = (cid:96) , L = K and N = 1, then S ∗ ( A , C ) = 1 /L , decreasing as L increases. Overestimated clustersmay have a very strict purity score close to 0. A natural question comes: does our cluster-ing have a very strict purity score of 1? To answer this question, we make the followingassumptions. Assumption 1.

Let ε i,t be Z i,t − E ( Z i,t ). Assume that ε i,t is independent and identically9istributed (i.i.d.) satisfying E ( ε i,t ) < ∞ for all 1 ≤ j ≤ K and 1 ≤ t ≤ T . Assumption 2.

There exists a η ( T ), satisfying that η ( T ) > E ( ε , ), K << { η ( T ) − E ( ε , ) } T and min j (cid:54) = j ,δ ( j ) (cid:54) = δ ( j ) d ( E ( Z j ) , E ( Z j )) > η ( T ) . In Assumption 1, if ε i,t is dependent, then we require the upper bound of E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t =1 ( ε j ,t − ε j ,t ) − E ( ε j ,t ) − E ( ε j ,t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) << T η ( T ) /K. In Assumption 2, we require K to be quite small compared to T . Note that d ( E ( Z j ) , E ( Z j ))is easy to evaluate because d ( E ( Z j ) , E ( Z j )) = (cid:113) T (cid:80) Tt =1 { E ( Z j ,t ) − E ( Z j ,t ) } . We havethe following Theorem 1.

Theorem 1.

Suppose Assumptions 1-2 hold. Choose θ = η ( T ) as in (3). As T → ∞ ,we have P { S ∗ ( A , C ) = 1 } → . Proof of Theorem 1. We ﬁrst prove that P { max j (cid:54) = j ,δ ( j )= δ ( j ) (cid:113) T (cid:80) Tt =1 ( Z j ,t − Z j ,t ) >η ( T ) } → . Because δ ( j ) = δ ( j ), Z j ,t − Z j ,t = ε j ,t − ε j ,t . Then we have P  max j (cid:54) = j ,δ ( j )= δ ( j ) (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 ( Z j ,t − Z j ,t ) > η ( T )  ≤ (cid:88) j (cid:54) = j P (cid:40) T (cid:88) t =1 ( ε j ,t − ε j ,t ) > η ( T ) T (cid:41) = (cid:88) j (cid:54) = j P (cid:34) T (cid:88) t =1 ( ε j ,t − ε j ,t ) − E { ( ε j ,t − ε j ,t ) } > { η ( T ) − E ( ε , ) } T (cid:35) ≤ (cid:88) j (cid:54) = j P (cid:34) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t =1 ( ε j ,t − ε j ,t ) − E { ( ε j ,t − ε j ,t ) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) { η ( T ) − E ( ε , ) } T (cid:35) ≤ cK { η ( T ) − E ( ε , ) } T , (6)where c is a constant not related to either K or T . By Assumption 2, this upper boundconverges to zero. Next, we prove that P { min j (cid:54) = j ,δ ( j ) (cid:54) = δ ( j ) (cid:113) T (cid:80) Tt =1 ( Z j ,t − Z j ,t ) ≤ η ( T ) } → . By the Minkowski inequality and Assumption 2, P  min j (cid:54) = j ,δ ( j ) (cid:54) = δ ( j ) (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 ( Z j ,t − Z j ,t ) ≤ η ( T )  ≤ P (cid:34) min j (cid:54) = j ,δ ( j ) (cid:54) = δ ( j ) (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 { E ( Z j ,t ) − E ( Z j ,t ) } − (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 { ε j ,t ) − ε j ,t ) } ≤ η ( T ) (cid:35) ≤ P (cid:34) min j (cid:54) = j ,δ ( j ) (cid:54) = δ ( j ) η ( T ) − (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 { ε j ,t ) − ε j ,t ) } ≤ η ( T ) (cid:35) = P (cid:34) max j (cid:54) = j ,δ ( j ) (cid:54) = δ ( j ) (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 { ε j ,t ) − ε j ,t ) } ≥ η ( T ) (cid:35) , which converges to zero by (6).By the HKA, for any A (cid:96) , there exists C i such that C i = A (cid:96) in probability, which impliesthat max ≤ i ≤ N |A (cid:96) ∩ C i | = |A (cid:96) | and L = N hold in probability. So, P ( S ∗ ( A , C ) = 1)converges to 1 as T → ∞ . The proof of Theorem 1 is ﬁnished.To apply Theorem 1, we need to set the right value for θ . In real problems, θ could11e unknown. We shall propose a data driven method to select the threshold value of θ . Anaive choice of θ based on outlier detection isˆ θ = median s =1 ,...,K − ( x s )+ 2 . . × median s =1 ,...,K − | x s − median s =1 ,...,K − ( x s ) | ) , (7)where x s = d ( Z j s , Z j s +1 ) for s = 1 , . . . , K −

1, median s =1 ,...,K − ( x s ) and 1 . × median s =1 ,...,K − | x s − median s =1 ,...,K − ( x s ) | are robust estimates of mean and standard deviation of { x s , s =1 , . . . , K − } , respectively, and 2.5 is the cutoﬀ value. It works well for relatively small N to K . The large values in the series of { x s , s = 1 , . . . , K − } would not aﬀect the thresholdvalue ˆ θ and hence they could be successfully removed. However, if the distribution of x s ’s,with the exception of outliers, is a mixture of two or more probability distributions whichcommonly occurs in multiple clusters, then ˆ θ may not be consistent to θ . Therefore, wepropose Algorithm 1 based on Bayesian information criterion (BIC) to choose θ . Denote a set of change points as C i = { τ (1) i , · · · , τ ( M i − i } , where i = 1 , . . . , N and M i − M i is unknown in practice, we would need toestimate the change points. Consider the segment [ t − , t + ] and deﬁne two residual sums ofsquares S i, ( t − , t + ) = min β t + (cid:88) t = t − { Z i,t − f ( t ; β ) } , (8) S i, ( t − , t , t + ) = min β t (cid:88) t = t − { Z i,t − f ( t ; β ) }

12 min β t + (cid:88) t = t +1 { Z i,t − f ( t ; β ) } , (9)where 1 ≤ t − < t < t + ≤ T and f ( t ; β ) = β + β Φ( β + β t ) + β Z i,t − + β Z i,t − .Here, we consider two autoregressive terms. Then, the estimated change point is denotedas ˆ t i ( t − , t + ): ˆ t i,t − ,t + = arg min t − +∆ / ∆.In light of Bai & Perron (2003), we apply the BIC method for model comparison.Deﬁne BIC i,ν ( t − , t + ) = ( t + − t − + 1) log { ˆ σ i,ν } + 6( ν + 1) log( t + − t − ) (11)where ν = 0 or 1, 6( ν + 1) is the number of parameters and ˆ σ i, = ( t + − t − + 1) − S i, ( t − , t + )and ˆ σ i, = ( t + − t − + 1) − S i, ( t − , ˆ t i,t − ,t + , t + ). Combined with the Iterated Cumulative Sumsof Squares Algorithm (ICSS) (Incl´an & Tiao , 1994), we propose Algorithm 2 to estimatemultiple change points.In Algorithm 2, there are two main steps that include ﬁnding candidate change pointsand reﬁning them. We set the minimum distance between two adjacent change points, ∆,to be 10 for real data analysis. First, we use the well-known nls function in the R package stats (R Core Team , 2020)to ﬁnd the minimum value as shown in (8) and give t tests on regression coeﬃcients, whereinitial values of parameters are given by grid search. Second, we give a conﬁdence intervalof regression function, denoted as g i,t ( β ) = E ( Z i,t | Z i,t − , Z i,t − ) for t ∈ [ t − , t + ], by the delta13ethod as follows. By ﬁrst-order Taylor expansion at the solution ˆ β , we have g i,t ( β ) ≈ g i,t ( ˆ β ) + ∇ g i,t ( ˆ β )( β − ˆ β ) . The approximate (1 − α )100% conﬁdence interval for g i,t ( β ) is g i,t ( ˆ β ) ± t ∗ α/ ( t + − t − − (cid:113) ∇ g i,t ( ˆ β ) (cid:62) Var( ˆ β ) ∇ g i,t ( ˆ β ) , where t ∗ α/ ( t + − t − −

6) is the α/ t + − t − − β ) can be estimated by the nls function in R . Here, we consider α = 0 . We consider three classes N = 3. Let C , C , and C be randomly seperated classes with ∪ i =1 C i = { , . . . , K } . Denote the number of elements in i th class as n i with (cid:80) i =1 n i = K .We produce Z i,t from the following model Z i,t = M i (cid:88) m =1 (cid:110) β ( m )1 ,i + β ( m )2 ,i Φ( β ( m )3 ,i + β ( m )4 ,i t ) (cid:111) I ( τ ( m − i < t ≤ τ ( m ) i )+ ε t , (12)where i = 1 , , t = 1 , . . . , T and ε t ’s are independent Normal errors with mean zero andvariance σ .For the ﬁrst class, let M = 1, β (1)1 , = 0, β (1)2 , = 10, β (1)3 , = −

4, and β (1)4 , = − .

05. Forthe second class, let M = 2, β (1) (cid:96), = 0 for (cid:96) = 1 , . . . , τ (1)2 = T / β (2)1 , = 0, β (2)2 , = 20,14 (2)3 , = −

3, and β (2)4 , = 0 .

03. For the third class, let M = 2, β (1) (cid:96), = 0 for (cid:96) = 1 , . . . , τ (1)3 = 2 T / β (2)1 , = 0, β (2)2 , = 5, β (2)3 , = −

2, and β (2)4 , = 0 . Z i,t for i = 1 , , T = 150. Next, we compare ourgraph-based clustering method as shown in Algorithm 1 with the model-based clusteringmethod (Fratey & Raftery , 2002) implemented in the R function Mclust (Fraley et al. ,2020). Fig. 3 shows the averaged strict purity score as in (5) for estimated clusters basedon these two methods for σ = 0 . , . , . . . ,

1, diﬀerent n i ’s and 100 replications.It can be seen from Fig. 3 that our method is very accurate for diﬀerent σ and sizes ofclasses because its very strict purity scores are close to 1. This agrees with the conclusion inTheorem 1. In contrast, the performance of the model-based clustering method is aﬀectedby large σ and large sizes of classes. Specially, when n = 20, n = 100 and n = 200,the computation is signiﬁcantly slower compared with our method. The comparison ofcomputing time is not presented here. We continue to use the data set of log-transformed infection counts from December 1,2019 to April 20, 2020 from Chinese provinces/regions and the 33 countries, and present theclustering and S-shaped ﬁtting with change points. Here, two autoregressive components( p = 2) in (1) are suggested. Based on the graph-based clustering Algorithm, the clusters of COVID-19 in China andthe rest of the world are presented in Fig. 4, where the optimal path is presented as a cyclewith vertexes representing clusters in diﬀerent colors and overextended curves. This way of15resentation is to transmit three aspects of information: (i) this analysis is for virus data,therefore, we should use the cycle and the sharp nodes to describe the structure of the virus;(ii) the optimal graph is a path connecting all nodes where nodes can be provinces/regionsin China or countries in the world; and (iii) readers can quickly ﬁnd the diﬀerent clustersand where to separate them from the path.From Fig. 4, we observe the following.(1) As shown in COVID-19 cases in China, the 34 provinces/regions are clustered into7 categories. Speciﬁcally, Hubei (HB), Xizang (XZ), Qinghai (QH), Macao (MO), HongKong (HK), and Taiwan (TW) are individually clustered into separate categories, and theremaining provinces/regions are all clustered into one category. This clustering result canbe explained by the diﬀerences in epidemic control strategies among the provinces/regions:HB is the center of the COVID-19 breakout, with a large number of infection cases; un-derpopulated XZ and QH are both located on the Qinghai Tibet Plateau, with only a fewinfection cases; MO, HK, and TW are of self-governance: meaning their epidemic controlstrategies are diﬀerent from all other regions in China. The model-based clustering method(Fratey & Raftery , 2002) suggests both HK and TW are to be in one cluster, which maynot be correct.(2) As shown in COVID-19 cases in the world, the 33 selected countries are clusteredinto 8 categories. Speciﬁcally, China (CN), Korea (KR), Japan (JP), Spain (ES), andTurkey (TR) are individually clustered into separate categories; Italy (IT) and Iran (IR)are clustered into one category; the United States of America (US), Germany (DE), France(FR), the United Kingdom of Great Britain (UK), Northern Ireland (GB), and Canada(CA) are clustered into one category; and the remaining countries are all clustered intoone category. This clustering result is partly based on the timing of COVID-19 outbreaksin those countries. For example, the ﬁrst large-scale outbreak was in CN, followed by KR16nd JP. After that, infections in IR and IT experienced rapid growth, followed by theoutbreaks in European countries and the US. Finally, the epidemic spread worldwide. Inaddition, the clustering is also based on the epidemic control strategies in each country.For example, in KR and JP, even while the epidemic broke out around the same time, thetwo countries had taken diﬀerent strategies: JP adopted a “defensive strategy”’ to ensurethe health care system operated normally as usual, while KR used an “aggressive attackstrategy” to comprehensively detect infections.

Based on the BIC-based ICSS Algorithm, we segment the curve time series and presentthe segmented ﬁttings and conﬁdence interval estimation for the log-transformed infectioncounts Z i,t (1 ≤ i ≤ N ) of each cluster in China and the rest of the world; see Fig. 5 and 6,respectively.We can obtain that all sigmoid curves share the form of multiple stages and multiplechange points, with the exception of Cluster 7 (XZ) in China, with only one infection; thecalculated change points of each cluster can still be explained by the diﬀerences in epidemiccontrol strategies. See the details below.(1) As shown in Fig. 5 A and Fig. 6 A, the sigmoid curves and change points arealmost the same because HB province was the center of the COVID-19 outbreak in CN.In Fig. 6 A in CN, the ﬁrst segment (19/12/01 to 19/12/13) was the germination periodof the outbreak. In the second segment (19/12/13 to 20/01/16), COVID-19 seemed tohave been controlled in CN. However, because many COVID-19 cases had not been founddue to varied epidemic control strategies in the previous two stages, COVID-19 broke outin the third segment (20/01/16 to 20/01/26) and fourth segment (20/01/26 to 20/02/11)in CN. This coincided with Chunyun (the annual massive movement of people during17hinese Lunar New Year), which particularly accelerated the outbreak. Finally, in the lasttwo segments (20/02/11 to 20/02/27 and 20/02/27 to 20/04/20), COVID-19 was controlledand stabilized once the CN government implemented very strict epidemic control strategies,such as traﬃc control and home quarantine.(2) As shown in Fig. 5 C, the sigmoid curves in HK and TW seem similar becausethey were both strongly aﬀected by COVID-19 cases from mainland China. However, weﬁnd that the change points of COVID-19 in TW are about a week delayed compared tothose in HK after COVID-19 started to break out in both regions. This is because TWresponded in a timely manner to the COVID-19 outbreak and controlled it more quicklyand eﬀectively than HK, while the implementation of epidemic control strategies in HKlagged behind.(3) As shown in Fig. 6, the number of new cases in China had tentatively stabilizedsince the last change point, 20/02/27, which was delayed by about one week in otherclusters. In Fig. 6 A and B, the infections in CN and KR are mostly stable, but theepidemic situations in other countries have not been controlled eﬀectively. Take the ﬁfthcluster (Fig. 6 C) as an illustration, considering that this cluster had the fastest growth.The four segments can be explained as follows: (i) the infections in the ﬁrst segmentwere mainly from oversea imports; (ii) in the second segment, COVID-19 seemed to havebeen controlled; (iii) COVID-19 broke out because of many unfound COVID-19 cases inprevious segments; and (iv) in the last segment, COVID-19 began to come under controlas governments declared states of emergency and started implementing strict measures tocontrol the spread of the virus.(4) As shown in Fig. 6 B, conﬁdence intervals for KR tended to be quite narrow inwidth when the number of new cases had tentatively stabilized, resulting in more preciseestimates of mean response, whereas conﬁdence intervals for JP tended to be wide since JP18ad adopted a “defensive strategy”. In most of cases, conﬁdence intervals produced preciseresults. A clustering-segmented autoregressive sigmoid model is developed to explore the space-time pattern of the log-transformed infectious count by the end of April 20, 2020. Itperformed well when it was applied to COVID-19 cases in both China and the 33 countries,and thus provides an eﬃcient statistical model of COVID-19 spread to help ﬁght against thevirus. Currently, the infections in China are mostly stable, and the graph-based clusteringalgorithm is robust to the clusters from the 34 provinces/regions in China. When COVID-19began to come under control, the clustering of the disease globally will become increasinglystable.In fact, the CSAS model can adapt to an extended period of time when clusters havebeen updated and new change points have been identiﬁed. To do so, we use the last changepoints in time, 20/03/07 obtained from Fig. 5 or 2020/03/08 obtained from Fig. 6, as thestart of the extended period at two-month intervals, from 20/03/07 to 20/05/07 or 20/03/08to 20/05/08. In Fig 9, we show segmentations and ﬁttings for log-transformed infectioncounts of each cluster in both China and the 33 countries during this extended period. Wecan see that the ﬁttings continue to work well. We provide an R package, GraphCpClust ,which can be accessed from https://github.com/Meiqian-Chen/GraphCpClust. From this R package, users can obtain the same results presented in this paper and can model datafor another extended period of time. In addition, the data and code for another two papers(Shi, Wu and Rao, 2017, 2018) are included in this R package.Regarding the dataset used in this article, Wuhan-2019-nCoV, we make the follow-19ng additional remarks: 1. Back in early March 2020, there were very few datasets onCOVID-19, and especially few datasets containing timely epidemic data from each Chineseprovince. This dataset, Wuhan-2019-nCoV, collects national outbreak reports from WHO,as well as daily outbreak reports from provincial health and family planning commissionsin China; 2. The Wuhan-2019-nCoV dataset is very timely updated and has been includedin the “Open Source Wuhan” data resource. Therefore, we believe that the data quality ofthe Wuhan-2019-nCoV dataset is trustworthy. There are now more and more COVID-19data resources available, such as WHO data (https://covid19.who.int/) and Our World inData (https://ourworldindata.org/covid-data-switch-jhu). For these two datasets, we ﬁndthat our model still works very well. Please see this webpage, http://graph-clustering-system.com/, for the three data analyses described above. References

Akaho, S. (1995). Mixture model for image understanding and the EM algorithm, https://staﬀ.aist.go.jp/s.akaho/papers/ETL-TR-95-13E.pdf . Bai, J. S. & Perron, P. (2003). Computation and analysis of multiple structural changemodels.

J APPL ECONOM , 1-22.

Biswas, M. , Mukhopadhyay, M. & Ghosh, A. K. (2014). A distribution-free two-sample run test applicable to high-dimensional data.

Biometrika , 913-926.

Csardi, G. & Nepusz, T. (2006). The igraph software package for complex networkresearch.

InterJournal , Complex Systems 1695. 2006. http://igraph.org. Accessed April1, 2020. 20 ietza, K. , Heesterbeek, J. A. P. (2002). Daniel Bernoulli’s epidemiological modelrevisited.

MATH BIOSCI , 1-21

Ester, M. , Kriegel, H. P. , Sander, J. & Xu, X.W. (1996). A Density-Based Algo-rithm for Discovering Clusters in Large Spatial Databases with Noise.

KDD-96 Proceed-ings , 226-231.

Fraley, C. & Raftery, A.E. (2020) mclust: Gaussian Mixture Modelling for Model-Based Clustering, Classiﬁcation, and Density Estimation. R package version 2.2-5 .

Avail-able at https://cran.r-project.org/web/packages/fpc/index.html . Fratey, C. & Raftery, A.E. (1993). Model-based clustering, discriminant analysis anddensity estimation.

J AM STAT ASSOC , 611-631.

Incl´an, C. & Tiao, G.C. (1994). Use of cumulative sums of squares for retrospectivedetection of changes of variance. Publications of the American Statistical Association.

JAM STAT ASSOC , , 913-923. Jin, B. S. , Wu, Y. H. , Rao, C. R. & Hou, L. (2020). Estimation and model selectionin general spatial dynamic panel data models.

Proc Natl Acad Sci , 5235-5241.

Katul, G. G. , Mrad, A. , Bonetti, S. , Manoli, G. , &

Parolari, A. J. (2020).Global convergence of COVID-19 basic reproduction number and estimation from early-time SIR dynamics. medRXiv . Kowsar, R. , Keshtegar., B. , Marey, M. A. , &

Miyamoto, A. (2017). An autore-gressive logistic model to predict the reciprocal eﬀects of oviductal ﬂuid components onin vitro spermophagy by neutrophils in cattle.

SCI REP-UK , 4482.21 iu, X.Z. , &

Stechlinski, P. (2017). Infectious Disease Modeling.

A Hybrid SystemApproch , Springer, Heidelberg.

Manning, C.D. , Raghavan, P. & Sch¨utze, H. (2008). Introduction to InformationRetrieval, Cambridge University Press, Cambridge, England.

Murray, J. D. (1989) Mathematical Biology. Springer, Heidelberg.

Murtagh, F. & Legendre, P. (2014). Ward’s hierarchical agglomerative clusteringmethod: which algorithms implement ward’s criterion?

J CLASSIF , 274-295.

R Core Team

Shi, X.P. , Wang, X.S. , Wei, D.W. & Wu, Y.H. (2016). A sequential multiple change-point detection procedure via VIF regression.

Comput Stat , 671-91. Shi, X.P. , Wu, Y.H. & Rao, C.R. (2017). Consistent and powerful graph-based change-point test for high-dimensional data.

Proc Natl Acad Sci , 3873-8.

Shi, X.P. , Wu, Y.H. & Rao, C.R. (2018). Consistent and powerful non-Euclideangraph-based change-point test with applications to segmenting random interfered videodata.

Proc Natl Acad Sci , 5914-5919.

Tocher, K. D. (1963). The Art of Simulation.

English University Press , London.

Wang, J. A. & Hartiganm, A. (1979). Algorithm as 136: a k-means clustering algorithm.

J ROY STAT SOC A STA , 100-108.22 ang, et al. (2020). Modiﬁed SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions.

J Thorac Dis , 165-174.23 B Log ( + i n f e c t i on ) HBTWHKMOQHXZGDBJSHZJAHJX SDSCHAHNCQJSGXHIYNHLHEFJ SNJLGZNXTJSXLNNMGSXJ

COVID−19 in China

CN US l og ( + i n f e c t i on ) CNJPKRUSFRCADEINPHBEES GBITRUSEEGIRATCHBRDZDK MKRONLECIEDOIDPTPLPETR

COVID−19 in World − tClass 1 Class 2 Class 3 Figure 2: Plots of S-curves for three classes.24 lgorithm 1:

Graph-based clustering Algorithm

Result:

Output the optimal clusters A . Notations: x ( θ ) = { x s | x s ≤ θ, s = 1 , . . . , K − } where x s is deﬁned in (7); θ ( s ) is the s ’th largest element in set { x s , s = 1 , . . . , K − } ; ˆ σ ( θ ) is the sample variance of x ( θ ); BIC( θ, A ) = ( K −

1) log(ˆ σ ( θ )) + 2 L ( A ) log( K − L ( A ) is the number ofclusters in A ; Initialize : Let i = 1, L = 1, and A = {A } where A = { , . . . K } ; for s = 2; s < K − s = s + 1 do Let θ be θ ( s ) and calculate the clusters based on E ∗ ( P , θ ) in (3) denoted as A temp ; if BIC( θ, A temp ) < BIC( θ ( s − , A ) then A = A temp ; else break; end end − . − . . . . n =

1, n =

10, n = σ S * − . − . . . . n =

5, n =

20, n = σ S * − . − . . . . n =

20, n = = σ S * Graph−based clustering Model−based clustering

Figure 3: Comparisons of graph-based clustering method and model-based clusteringmethod. 25 lgorithm 2:

BIC-based ICSS Algorithm

Result:

Output the estimated change points ˆ C i . Notations: ˆ C i, ( s ) and | ˆ C i | are the s th smallest element and number of elements ofset ˆ C i , respectively; Initialization:

Let t − = 1, t + = T , and ˆ C i = { , T } ; while t + − t − > ∆ do t ﬁrst ← t + ; t last ← t − ; while BIC i, ( t − , t ﬁrst ) ≥ BIC i, ( t − , t ﬁrst ) do t ﬁrst ← ˆ t i,t − ,t ﬁrst ; end while BIC i, ( t last , t + ) ≥ BIC i, ( t last , t + ) do t last ← ˆ t i,t last ,t + ; end if t ﬁrst = t last then ˆ C i ← ˆ C i ∪ { t ﬁrst } ; break; else ˆ C i ← ˆ C i ∪ { t ﬁrst , t last } ; t − ← t ﬁrst ; t + ← t last ; end end for s = 2; j < | ˆ C i | ; s = s + 1 do if BIC i, ( ˆ C i, ( s − + 1 , ˆ C i, ( s +1) ) ≤ BIC i, ( ˆ C i, ( s − + 1 , ˆ C i, ( s +1) ) then ˆ C i ← ˆ C i \ { ˆ C i, ( s ) } ; end end ˆ C i ← ˆ C i \ { , T } ; 26 OVID−19 in China

HBGDZJHAHNAHJXSDJSSCCQBJSHHLHEFJSNGXYNHILNSX TJ GZ GS NM JL XJ NX TW HK MO QH XZ

COVID−19 in world

CNIRITKRJPCAGBFRDEUSESCHNLBEATSEDKPTBRIEROID PL PE EC RU IN PH EG DZ MK DO TR

Figure 4: Plots of clusters in China and in 33 selected countries based on the log-transformed infection counts. l og ( + i n f e c t i on ) Cluster 1(HB) A l og ( + i n f e c t i on ) Cluster 2(GD,ZJ,HA,HN,AH,JX,SD,JS,SC,CQ,BJ,SH,HL,,HE,FJ,SN,GX,YN,HI,LN,SX,TJ,GZ,GS,NM,JL,XJ,NX) B l og ( + i n f e c t i on ) Cluster 3Cluster 4

Cluster 3(TW) and Cluster 4(HK) C l og ( + i n f e c t i on ) Cluster 5Cluster 6Cluster 7

Cluster 5(MO), Cluster 6(QH) and Cluster 7(XZ) D Figure 5: Plots of segmentations and ﬁttings of provinces/regions in China based on thelog-transformed infection counts. 27 l og ( + i n f e c t i on ) Cluster 1Cluster 2

Cluster 1(CN) and Cluster 2(IR,IT) A l og ( + i n f e c t i on ) Cluster 3Cluster 4

Cluster 3(KR) and Cluster 4(JP) B l og ( + i n f e c t i on ) Cluster 5Cluster 6

Cluster 5(US,DE,FR,GB,CA) and Cluster 6(ES) C l og ( + i n f e c t i on ) Cluster 7Cluster 8

Cluster 7(CH,NL,BE,AT,SE,DK,PT,BR,IE,RO,ID,PL,PE,,EC,RU,IN,PH,EG,DZ,MK,DO) and Cluster 8(TR) D Figure 6: Plots of segmentations and ﬁttings of 33 selected countries in the world based onthe log-transformed infection counts. − . . . . A − . . . . B − . − . . . C Figure 7: Plots of residuals of the last segment from the CSAS model for two separateprovinces in China, NM (A) and TJ (B), and their common cluster 2 (C). − . − . . A − . . . . B − . − . . . C − . − . . . . D . . . . . E − . − . . . . F Figure 8: Plots of residuals for HB in China for the fourth segment (A), the ﬁfth segment(B), and the last segment (C) from the CSAS model, for the last segment from the CSASmodel without autoregressive terms (D), for the whole period from the autoregressive modelof order 2 after taking the ﬁrst diﬀerence (E), and for the whole period from the CSASmodel without change points (F). 28 l og ( + i n f e c t i on ) Cluster 1Cluster 2

Cluster 1 (HB) and Cluster 2 (GD,HA,ZJ,HN,AH,JX,SD,JS,CQ,SC,BJ,,SH,HL,HK,TW,FJ,HE,SN,GX,NM,SX,TJ,YN,HI,GZ,GS,LN,JL,XJ,NX) A l og ( + i n f e c t i on ) Cluster 3Cluster 4

Cluster 3 (QH,MO) and Cluster 4 (XZ) B l og ( + i n f e c t i on ) Cluster 1Cluster 2

Cluster 1 (CN,IR,IT,ES,DE,FR,GB,BE,NL,CH,AT,SE,JP,DK,PT,CA,BR,,RU,PE,EC,IN,IE,RO,PL,ID,PH,EG,DZ,MK,DO) and Cluster 2 (KR) C l og ( + i n f e c t i on ) Cluster 3Cluster 4