Generalized k-Means in GLMs with Applications to the Outbreak of COVID-19 in the United States
aa r X i v : . [ s t a t . M E ] A ug Generalized k -Means in GLMs with Applications to the Outbreakof COVID-19 in the United States Tonglin Zhang ∗ and Ge Lin † August 11, 2020
Abstract
Generalized k -means can be incorporated with any similarity or dissimilarity measure forclustering. By choosing the dissimilarity measure as the well known likelihood ratio or F -statistic, this work proposes a method based on generalized k -means to group statistical models.Given the number of clusters k , the method is established under hypothesis tests betweenstatistical models. If k is unknown, then the method can be combined with GIC to automaticallyselect the best k for clustering. The article investigates both AIC and BIC as the special cases.Theoretical and simulation results show that the number of clusters can be identified by BIC butnot AIC. The resulting method for GLMs is used to group the state-level time series patternsfor the outbreak of COVID-19 in the United States. A further study shows that the statisticalmodels between the clusters are significantly different from each other. This study confirms theresult given by the proposed method based on generalized k -means. Key Words:
Clustering; COVID-19; Dissimilarity measure; Generalized k -means; Generalizedinformation criterion; Generalized linear models (GLMs). Generalized k -means, including both k -means and k -medians as special cases, can be incorporatedwith any similarity or dissimilarity measure to group objects (or observations). The similarity ordissimilarity measure can be very general. In this work, we specify the dissimilarity measure as thewell known likelihood ratio or F -statistic, such that our method can be used to group statisticalmodels. In particular, if each object contains a vector for the response and a design matrix for theexplanatory variables, then statistical models can be used to describe the relationship between theresponse and explanatory variables within the object. A clustering problem arises if we want togroup the statistical models between objects. This problem can be solved by generalized k -means.The current research proposes the method and use it to group the state-level time series patternsfor the outbreak of COVID-19 in the United States. ∗ Department of Statistics, Purdue University; 250 North University Street, West Lafayette, IN 47907-2066; email:[email protected] † Department of Environmental and Occupational Health, University of Nevada Las Vegas, Las Vegas, NV 89154,Email: [email protected] k -means. We recommend using generalized k -means.Clustering is one of the most popular unsupervised statistical learning methods for unknownstructures. Clustering methods are often carried out by a similarity or dissimilarity measure betweenobjects such that they can be grouped into a few clusters. The purpose of clustering is to makeobjects within clusters mostly homogeneous and objects between clusters mostly heterogeneous.In the literature, one of the most well known clustering methods is the k -means. It assigns eachobject to the cluster with the nearest mean. Based on a given k , the k -means provides k clustersaccording to k centers. The k centers are solved by minimizing the sum-of-squares (SSQ) criterion,which is derived based on the Euclidean distance between objects in the data. The SSQ criterionin the k -means can be replaced by any similarity or dissimilarity measure. A method called thegeneralized k -means is proposed [1, 27]. Because the choice of the similarity or dissimilarity measureis flexible, generalized k -means can be extended to any divergence measure for clustering.Many clustering methods have been proposed in the literature. Examples include hierarchicalclustering [34], fuzzy clustering [29], density-based clustering [18], model-based clustering, andpartitioning clustering. Model-based clustering is usually carried out by EM algorithms or Bayesianmethods under the framework of mixture models [11, 20]. Partitioning clustering can be interpretedby the centroidal Voronoi tessellation method in mathematics [6]. It can be further specified to k -means [10, 14, 21, 22], k -medians [2], and k -modes [12], where k -means is the most popular.To implement those, one needs to express observations of the data in a metric space, such that adistance measure can be defined. Several approaches have been developed to specify the distancemeasure. A review of these can be found in [17], P 670.Challenges appear when we want to group the time series patterns for the outbreak of COVID-19in United States. Suppose that the time series patterns from individual states have been analyzedby statistical models. Then, we need to study the relationship between these models. It is inap-propriate to directly compare their coefficients because of disparity. To overcome the difficulty,we recommending using likelihood ratio statistics or an F -statistics derived based on hypothesistests for the relationship between these models, leading to the generalized k -means in GLMs. Ourmethod wants to make models within clusters mostly homogeneous and models between clustersmostly heterogeneous. Because of the differences of population sizes, transportation manners, andsocial activities, the number of daily new cases among the fifty states and Washington DC cannotbe identical or even similar, implying that it is inappropriate to compare all of the coefficients. Wecompare some of the coefficients. This is called the unsaturated clustering problem. In particular,we partition the coefficient vector into two sub-vectors. The first sub-vector does not contain anytime information. Therefore, we only need to study the second sub-vector. We implement thegeneralized k -means to the second sub-vector only. This problem can be partially reflected by2 − time Cluster 1Cluster 2
Figure 1: Generalized k -means clustering for six regression lines.Figure 1. Suppose that six regression lines are compared. Note that the intercepts do not containany time information. We allow them to vary within clusters. We restrict the generalized k -meanson the slopes only, leading to two clusters.We investigate our method with either a known or an unknown k . In the case when k isunknown, we propose GIC to select the best k . We specify it to BIC and AIC. We find that BICis more reliable than AIC in selecting the number of clusters. Therefore, we recommend using theBIC selector. We compare our method with a previous method based on the EM algorithm [26].Our simulation results show that the number of clusters can be identified by our proposed methodbut not by the previous method. The previous EM algorithm cannot identify the number of clustersif the true number of clusters is greater than two. We implement our method to the state-leveltime series data for the outbreak of COVID-19 in the United States. We find six clusters.The article is organized as follows. In Section 2, we propose our method. In Section 3, we studytheoretical properties of our method. In Section 4, we evaluate our method with the comparison toa previous method by simulation studies. In Section 5, we implement our method to the state-levelCOVID-19 data in the United States. In Section 6, we provide a discussion. We propose our method under a given k in Section 2.1. It can be combined with GIC [33] toselect the best k when k is unknown. This is introduced in Section 2.2. In Section 2.3, we specifyour method to regression models for normal data and loglinear models for Poisson data. The twomodels will be used in our simulation studies in Section 4. The loglinear model for Poisson datacan be extended to a model with overdispersion for quasi-Poisson data. It will be used in clusteringfor the state-level COVID-19 data in the United States.3 .1 Generalized k -Means in GLMs Clustering is the problem of partitioning a set of N objects, denoted by S = { z , . . . , z N } , intoseveral non-empty subsets or clusters, such that the objects within clusters are mostly homogeneousand the objects between clusters are mostly heterogeneous. In the case when the objects can beexpressed by points in an Euclidean space, the k -means partitions S into k distinct clusters denotedby C = { C , . . . , C k } with C given by C = argmin C k X s =1 X i ∈ C s k z i − c s k , (1)where c s is the center point of C s . The right-hand side of (1) is called the SSQ criterion in the k -means. The generalized k means is proposed if the SSQ criterion is extended to an arbitrary dis-similarity measure. In particular, let d ( z, C ) be a selected dissimilarity measure with z representingan object and C representing a cluster. The generalized k -means solves C by C = argmin C k X s =1 X i ∈ C s d ( z i , C s ) . (2)When z i are points in an Euclidean space, the generalized k -means can be the k -means if onechooses d ( z i , C s ) = k z i − c s k . It can also be the k -medians if one chooses d ( z i , C s ) = k z i − c s k .Because d ( z, C ) is flexible in (2), the generalized k -means can be combined with any statisticalmodels. Suppose that z i is composed by a vector of response and a design matrix of explanatoryvariables. Then, we can use a statistical model to describe the relationship between the responseand the explanatory variables. The response and explanatory variables can be general, implyingthat the generalized k -means can be implemented to various kinds of data, such as text, DNAstrains, or images. Here, we restrict our attention to GLMs for continuous or count responses. Ourtask is to group the GLMs for the objects into a number of clusters.Suppose that z i contains a response vector y i = ( y i , . . . , y in i ) ⊤ and a design matrix X =( x ⊤ i , . . . , x ⊤ in i ) ⊤ , such that the sample size of the data is n = P Ni =1 n i . Assume that y i , . . . , y in i areindependently obtained from an exponential family distribution with the probability mass function(PMF) or the probability density function (PDF) as f ( y ij ) = exp (cid:20) y ij θ ij − b ( θ ij ) a ( φ ) + c ( y ij , φ ) (cid:21) , (3)where θ ij is a canonical parameter representing the location and φ is a dispersion parameter repre-senting the scale. The linear component η ij is related to explanatory variables by η ij = x ⊤ ij β i . Thelink function g ( · ) connects µ ij = E( y ij ) = b ′ ( θ ij ) and η ij through η ij = g ( µ ij ) = g [ b ′ ( θ ij )] = x ⊤ ij β i (4)for all i ∈ { , . . . , N } and j ∈ { , . . . , n i } , where θ ij = h ( x ⊤ ij β i ) is the inverse function obtainedby (4). The variance of the response is V( y ij ) = a ( φ ) v ( µ ij ), where v ( µ ) = b ′′ { h − [ g ( µ )] } is thevariance function of the model. If the canonical link is used, then (4) becomes η ij = θ ij = g ( µ ij ) = x ⊤ ij β i , implying that h ( · ) is the identity function.4he MLEs of β i , denoted by ˆ β i , can only be solved numerically if the distribution is not normal.A popular and well known algorithm is the iteratively reweighted least squares (IRWLS) [13].The IRWLS is equivalent to the Fisher scoring algorithm. It is identical to the Newton-Raphsonalgorithm under the canonical link. After ˆ β i is derived, a straightforward method is to estimate φ by moment estimation [24] as a ( ˆ φ ) = 1 df N X i =1 n i X j =1 ( y ij − ˆ µ ij ) b ′′ [ h ( x ⊤ ij ˆ β i )] , (5)where ˆ µ ij = b ′ [ h ( x ⊤ ij ˆ β i )] and df is the residual degrees of freedom in the derivation of ˆ β i . If φ isnot present in (3), then (5) is not needed. This occurs in Bernoulli, binomial, and Poisson models.The IRWLS is the standard algorithm for fitting GLMs. It is adopted by many software packages,such as R , SAS , and
Python .Our interest is to determine whether β i can be grouped into a few clusters, such that there is β i = β i ′ if objects i and i ′ are in the same cluster or β i = β ′ otherwise. The regression versionof this problem has been previously investigated in gene expressions by an EM algorithm [26]. Itwants to know whether the entire coefficient vectors can be partitioned into a few clusters. In ourmethod, we allow a few components of β i to be different within clusters, such that we only needto partition the objects based on the rest components.Suppose that (4) is expressed as η ij = x ⊤ ij β i + x ⊤ ij β i , (6)where x ij = ( x ⊤ ij , x ⊤ ij ) ⊤ and β i = ( β ⊤ i , β ⊤ i ) ⊤ . We want to know whether β i can be groupedinto a few clusters. It means that we want β i = β i ′ if objects i and i ′ are in the same cluster or β i = β i ′ otherwise. Based on a given C , the clustering model in our method is g ( µ ij ) = x ⊤ ij β i + x ⊤ ij β s (7)for z i ∈ C s . We call (7) the unsaturated clustering problem . If β i is absent, it becomes the saturatedclustering problem , the problem studied by [26]. As the choice of x ij and x ij is flexible in (7),our method can be used to group GLMs based on any sub-vectors of β i . We use either a likelihoodratio statistic or an F -statistic to construct d ( z i , C s ) in (2).Our method starts with a set of cluster candidates C , which can be any partition of S . Forany given z i ∈ S and C s ∈ C , if z i C s , we directly use z i and C s in the derivation of d ( z i , C s );otherwise, we remove z i from C s in the derivation. Then, the resulting C s does not contain z i . Weobtain the likelihood ratio statistic or the F -statistic for H : β i = β s ↔ H : β i = β s . (8)The likelihood ratio statistic is used if φ is absent in (3). This appears in Poisson and binomialmodels. The F -statistic is used if φ is present. This appears in normal models. For each z i ∈ S and C s ∈ C , we calculate the p -value of the likelihood ratio statistic or the F -statistic. We assign z i to the cluster candidate with the largest p -value. Then, we obtain the updated set of clustercandidates, which is used in the next iteration. To ensure C s non-empty, we do not move the objectwith the largest p -value in C s to any other cluster candidates. We have the following algorithm.5 lgorithm 1 Generalized k -means in GLMs Input : S = { z , . . . , z N } with z i = { y i , X i } Output : C = { C , . . . , C k } and the value of the likelihood ratio or the F-statistic based on theresulting C Initialization: find distinct z i , . . . , z i k such that they are most dissimilar, and use those togenerate the initial C . procedure Update Iteratively For each C s , compute the p -value of z i under (8) for every z i ∈ C s . The object with thelargest the p -value will be remained in C s . For every other z i that will not be remained, compute its p -values under (8) for every C s ∈ C .Assign z i to the cluster candidate with the largest p value. end procedure Output.Algorithm 1 has two major stages. The second stage is given by Step 2 to Step 5, which iscommon in many k -means algorithms. Therefore, we only discuss the first stage, which is given byStep 1. The goal of the first stage is to find the best initial C . We do not use the usual approachadopted by many k -means algorithms, as they select the initial set of cluster candidates randomly.Instead, we want to make the initial C as heterogeneous as possible. At the beginning, we randomlychoose the first z i from S . We denote it as z i . It is the seed for C . We calculate the p -value ofthe likelihood ratio statistic or the F -statistic for H : β i = β i ↔ H : β i = β i (9)for every i = i . The object z i with the lowest p -value is chosen the seed for C . It is denoted by z i . Then, we incorporate the minimax principle to select the rest seeds. Suppose that z i and z i are selected. For each i = i , i , we calculate the p -values of the likelihood ratio statistic or the F -statistic based on two testing problems as H : β i = β j ↔ H : β i = β j , (10)where the two testing problems are derived by taking j = i and j = i , respectively. We obtaintwo p -values. We assign the maximum of the two p -values as the p -value of z i . The object with thelowest p -value is chosen as the seed for C . It is denoted by z i . Using this idea, we can obtain allthe seeds z i , . . . , z ik . Then, we reconsider the testing problem given by (10). For a given z i with i
6∈ { i , . . . , i k } , we calculate the p -values of the likelihood ratio statistic or the F -statistic for all j ∈ { i , . . . , i k } . We assign z i to C s if the p -value is maximized at j = i s . After doing this for all therest objects, we obtain the initial C . Combining it with the second stage, we obtain k non-emptyclusters with the corresponding value and p -value of the likelihood ratio statistic or the F -statistic. The generalized k -means introduced in Section 2.1 cannot be used if k is unknown. As the likeli-hood function is provided in Algorithm 1, we can used it to define a penalized likelihood function.It is used to determine the number of clusters if k is unknown. The penalized likelihood approach6as been widely applied in variable selection problems. Here, we adopt the well known GIC ap-proach [33] to construct our objective function. The best k is obtained by optimizing the objectivefunction.Let ℓ ( ω C ) be the loglikelihood of (7), where ω C represents all of the parameters involved inthe model. If the dispersion parameter is not present, then ω is composed by β i and β s for all i ∈ { , . . . , N } and s ∈ { , . . . , k } only. It is enough for us to use ℓ ( ω C ) to define the objectivefunction in GIC. If the dispersion parameter is present, then we need to address the impact ofestimation of a ( φ ), because variance can be seriously underestimated in the penalized likelihoodapproach under the high-dimensional setting [7]. We decide to introduce our approach based on(3) without a ( φ ). We then propose a modification when it is present.Assume that a ( φ ) does not appear in (3). The GIC for (7) is defined as GIC κ ( C ) = − ℓ ( ˆ ω C ) + κdf C , where ˆ ω C is the MLE of ω and df C is the model degrees of freedom under C , and κ is a positivenumber that controls the properties of GIC. Let q be the dimension of β i and q be the dimensionof β i . Then, df C = N q + kq . Note that N does not vary with k . We define the objective functionin our GIC as GIC κ ( C ) = − ℓ ( ˆ ω C ) + κkq . (11)The best k is solved by ˆ k κ = argmin k { GIC κ ( ˆ C k ) } , (12)where ˆ C k is the best grouping based on the current k . The GIC given by (11) includes AIC if wechoose κ = 2 or BIC if we choose κ = log n . If they are adopted, then the solutions given by (12)are denoted by ˆ k AIC and ˆ k BIC , respectively.We need to estimate the dispersion parameter if it is present. Because the estimator basedon the current k can be seriously biased, we recommending using k + 1 as the number of clustersin the computation of the estimate of a ( φ ). In particular, we calculate the best C based on thecurrent k in the generalized k means. We use it to compute ˆ β i and ˆ β s for all i ∈ { , . . . , N } and s ∈ { , . . . , k } . Then, we calculate the best C by setting the number of clusters equal to k + 1. Weuse (5) to estimate a ( φ ). This is analogous to the full model versus the reduced model approachin linear regression, where the variance parameter is always estimated under the full model. Wetreat the model with k + 1 clusters in (7) as the full model, and the model with k clusters as thereduced model. We estimate a ( φ ) based on the full model but not the reduced model. After a ( ˆ φ )is derive, we put it into (11) in the computation of GIC. We then use (12) to calculate the best k with the dispersion parameter. This is important in our method for regression models. We specify our method to regression models for normal data and loglinear models for Poissondata. Ordinary regression models are commonly used if the interest is to discover the relationshipbetween a single continuous response variable and a number of explanatory variables. Multivariateregression models are commonly used if at least two continuous response variables are involved.Loglinear models for Poisson data and logistic linear models for binomial data are commonly usedif the response is count. They have been extended to quasi-Poisson or quasi-binomial models toincorporate overdispersion. 7n ordinary regression models, (6) becomes y i = X i β i + X i β i + ǫ i , (13)where X i = ( x ⊤ i , . . . , x ⊤ in i ) ⊤ , X i = ( x ⊤ i , . . . , x ⊤ in i ) ⊤ , and ǫ i ∼ N ( , σ I n i ). With a given C ,the model in our generalized k -means becomes y i = X i β i + X i β s + ǫ i (14)for z i ∈ C s . We treat (14) as a special case of (13). The second stage in Algorithm 1 is common.We only focus on the first stage.We select seed z i for C randomly. Suppose that z i , . . . , z i ˜ k have been selected as the seedsfor C , . . . , C ˜ k , for any ˜ k < k , respectively. To determine the seed for C ˜ k +1 , we calculate thedissimilarity measure between z s and z i for pairs ( s, i ) with s ∈ ˜ S ˜ k = { z i , . . . , z i ˜ k } and i ˜ S ˜ k based on y v = X v ( β s + δ v ξ s ) + X v ( β s + δ v ξ s ) + ǫ v , (15)where v = s or v = i , δ v is the dummy variable defined as δ v = 0 if v = s or δ v = 1 if v = i , and ǫ v ∼ N ( , σ I n i ) is the error vector. We calculate the F -statistic for H : ξ i = ↔ H : ξ i = . (16)Let p si be the p -value of the F -statistic. We define the p -value of the dissimilarity between z i and˜ S ˜ k as p i = max s ∈ ˜ S ˜ k p si . (17)We choose z i as the seed of C ˜ k +1 if it has the lowest p i value among all objects in ˜ S ˜ k . Therefore, z i ˜ k +1 is given by the minimax principal as i ˜ k +1 = argmin i p i = argmin i max s p si . (18)After we obtain ˜ S k , the set of all of the seeds for C , we calculate the p -value of the F -statisticfor (16) for every s ∈ ˜ S k and i ˜ S k . We assign z i to C s if p si is maximized at s . In the end, weobtain the initial C . By iterating the second stage in Algorithm 1, we obtain ˆ C k based on a given k .Because of the presence of σ = a ( φ ), we follow the GIC in variable selection for regressionmodels [33], and propose our GIC in the generalized k -means for regression models asGIC κ ( C ) = SSE σ + κkq , (19)where SSE is the sum of squares of errors given by (14). To implement (19), we need to estimate σ . If the current k is used, then the estimate of σ is SSE divided by residual degrees of freedom.The first term on the right-hand side of (19) is always equal to n − N q − kq , implying that wecannot use this approach to select the best k . To overcome the difficulty, we recommend using k + 1in (14) in estimating σ . We denote it by ˆ σ k +1 . Therefore, our GIC isGIC κ ( C ) = SSE k ˆ σ k +1 + κkq , (20)8here SSE k is the SSE with k clusters in (14). This is appropriate. If the number of true clusters isless than or equal to k , then slightly increasing the number of clusters by 1 would not significantlychange the estimate of σ , implying that the second term dominates the right-hand side of (20).Otherwise, the estimate of σ would be significantly reduced, implying that the first term dominatesthe right-hand side of (20). Therefore, the objective function in our GIC provides a nice trade-offbetween the SSE and the penalty function.For Poisson data, there is V( y ij ) = E( y ij ) = µ ij , implying that a ( φ ) = 1. Under the frameworkof loglinear models, (6) becomes log( µ ij ) = x ⊤ ij β i + x ⊤ ij β i . (21)With a given C , it reduces to log( µ ij ) = x ⊤ ij β i + x ⊤ ij β s (22)for i ∈ C s . Analogous to the regression models, after selecting z i randomly, we investigatelog( µ vj ) = x ⊤ vj ( β s + δ v ξ i ) + x ⊤ vj ( β s + δ v ξ i ) (23)with v = s or v = i . We measure the dissimilarity between z s and z i by the likelihood ratio statisticfor (16). We derive the initial C by the same idea that we have displayed in regression models.With the second stage in Algorithm 1, we obtain ˆ C k based on a given k . To determine the best k ,we choose − ℓ (ˆ ω C k ) as the residual deviance of (22). As the dispersion parameter is not present,the implementation of GIC is straightforward.For quasi-Poisson data, there is V( y ijj ) = φ E( y ij ) = φµ ij , implying that a ( φ ) = φ . We can stilluse (21), (22), and (23) to find the best C . To determine the best k , we estimate φ by (5), which isthe Pearson goodness-of-fit given by (22) divided by its residual degrees of freedom. For the samereason, we choose the number of clusters equal to k + 1 in (22) in estimating φ . We denote it byˆ φ k +1 , leading to GIC κ ( C ) = G k ˆ φ k +1 + κkq , (24)where G k is the residual deviance (i.e., deviance goodness-of-fit) with k clusters in (22). We evaluate the asymptotic properties of our method under n = P Ni =1 n i → ∞ . It is achieved byletting n min = min i ( n i ) → ∞ . To simplify our notations, we assume that n i are all equal to n and | C s | are all equal to c such that we have N = kc and n = kcn in our data. The case with distinct n i and | C s | can be proven under their minimums tend to infinity with bounded ratios between theminimums and the maximums.The asymptotic properties are evaluated under n → ∞ with k, c → ∞ . For any i = i ′ , let Λ ii ′ be the likelihood ratio statistic for H : β i = β i ′ ↔ H : β i = β i ′ . (25)As n → ∞ , − χ q distributed if z i and z i ′ are in the same cluster, or goesto ∞ with rate n otherwise. Because (25) is applied to all pairs ( i, i ′ ) in S , the multiple testingproblem must be addressed. This can be solved by the method of higher criticisms [5].9 emma 1 Assume that ( y ij , x ⊤ ij ) ⊤ for j ∈ { , . . . , n } are iid copies from (7) for any given i ∈ S .If z i and z i ′ are in the same cluster, then − ii ′ L → χ q . If z i and z i ′ are in different clusters,then exists a positive constant A = A ( β i , β i ′ , φ ) , such that the limiting distribution of − − n A is non-degenerate as n → ∞ . Proof.
The conclusion can be proven by the standard approach to the asymptotic properties ofmaximum likelihood and M-estimation. Please refer to Chapter 22 in [9] and Chapter 5 in [30]. ♦ Theorem 1
If the assumption of Lemma 1 holds, and N = o ( e n α ) for some α ∈ (0 , when n → ∞ , then ˆ C k P → C . Proof.
Note that the likelihood ratio test based on Λ ii ′ is applied to distinct i, i ′ ∈ C . We need toevaluate the impact of the multiple testing problem. We examine the distribution of the max i = i ′ λ ii ′ based on Lemma 1. According to [5], it is asymptotically bounded by a constant times 2 log N if z i and z i ′ are in same clusters or increases to ∞ with rate n if z i and z i ′ are in different clusters.Thus, with probability 1, the increasing rate of Λ ii ′ with z i and z i ′ in different clusters is fasterthan that of Λ ii ′ with z i and z i ′ in same clusters, implying the conclusion. ♦ Theorem 2
Assume that a ( φ ) is not present in (3) or a ( φ ) is consistently estimated by a ( ˆ φ ) usedin the construction of GIC, and the assumption of Theorem 1 holds. If κ − log c → as n → ∞ ,then ˆ k κ P → k and ˆ C ˆ k κ P → C . Proof.
If ˆ k κ < k , then we can find at least one pair of z i and z i ′ , such that they are not in the samecluster but they are grouped to the same cluster. By Lemma (1), the first term on the right-handside of (11) goes to ∞ with rate n . It is faster than the rate of GIC under ˆ k κ = k , implying that P (ˆ k κ < k ) = 0 as n → ∞ . Therefore, we only need to study the case when ˆ k κ ≥ k . Note that theloglikelihood function of (7) based on a given C is equal to the sum of the loglikelihood functionsobtained from each C s ∈ C . By Theorem 1, we can restrict our attention to the case when allobjects in C s are in the same cluster. By [5], with probability 1, the loglikelihood function (7) in C s is not higher than that under the true cluster plus 2 log c . By the property of the χ -approximationof the likelihood ratio statistic under the true C , we have that with probability 1 the first term onthe right-hand side of (11) is not higher than n N − ( N q + kq ) + 2 kq log c . Combining it withthe second term, we conclude that ˆ k κ P → k . We obtain the first conclusion. Then, we draw thesecond conclusion ˆ C ˆ k κ P → C by Theorem 1. ♦ Theorem 1 implies that both c and k can increase exponential fast than n if k is known, butthe rate is significantly reduced if k is unknown. If c → ∞ , then we cannot choose κ = 2 in ourmethod, implying that ˆ k AIC is not consistent, but we can still show that ˆ k BIC is consistent.
Corollary 1
Suppose that all of assumptions of Theorem 2 are satisfied. If k → ∞ or k is constant,and c/n → when n → ∞ , then ˆ k BIC P → k . Proof.
Note that the increasing rate of log n cannot be lower than the increasing rate of log c . Wedraw the conclusion by Theorem 2. ♦ Simulation
We carried out simulations to evaluate our methods. For an estimated cluster assignment ˆ C andthe true clustering assignment C , we define the clustering error ( CE ) of ˆ C as CE ˆ C = (cid:18) N (cid:19) − { ( i, i ′ ) : ˆ δ ii ′ = δ ii ′ , ≤ i < i ′ ≤ N } (26)where ˆ δ ii ′ = 1 if z i and z i ′ belong to the same clusters in ˆ C , or ˆ δ ii ′ = 0 otherwise, and similarly for δ ii ′ in C . For estimated clustering assignments ˆ C , . . . , ˆ C R obtained from R simulation replications,respectively, we calculate the percentage of clustering object errors ( OE ) by OE = 100 R R X j =1 CE ˆ C j . (27)This is a commonly used criterion in the clustering literature [31]. We also study the percentageof number of clusters identified correctly ( IC ) as IC = 100 R R X j =1 I (ˆ k j = k ) (28)where ˆ k , . . . , ˆ k R are the numbers of clusters obtained from R simulation replications, respectively,and k is the true number of clusters. We compare clustering methods based on CE and IC . We established regression models with k = 2 , c = 10 ,
20 objects. Eachobject contained n = 50 ,
100 observations. We generated explanatory variables x ij from U [18 , x ij from N (0 ,
9) independently. For each selected k , c , and n , we generated the normalresponse from y ij = β i + x ij β s + x ij β s + ǫ ij , (29)for j = 1 , . . . , n and i = 1 , . . . , N , where ǫ ij ∼ iid N (0 , σ ) with σ = 0 . , . β i not varied within clusters. We made β i = β i ′ if z i and z i ′ were in the same cluster in (29). If k = 2, we chose β i = 1, β i = − .
06, and β i = − . z i was in the first cluster or β i = 1, β i = 0 .
06, and β i = 0 .
01 if z i was in the second cluster. If k = 3, we added one more cluster with β i = 1, β i = − .
02, and β i = 0 .
01 if z i was in the thirdcluster. Then, we could generate data from (29) with either 2 or 3 clusters.Table 1 displays the simulation results for the percentage of clustering number errors withrespect to the previous EM algorithm, and our AIC and BIC selectors. In all of the simulationsthat we ran, we found that the number of clusters reported by the EM algorithm was either 1 or2, implying that it could not find the correct number of clusters if k >
2. The true k could bedetected by our BIC not our AIC.Table 2 displays the simulation results for the percentage of clustering object errors based on theprevious EM algorithm and our BIC selector. We did not include AIC in the table because it could11able 1: Percentage of number clusters identified correctly ( IC ) in regression models based on 1000simulation replications with data generated from (29). n = 50 n = 100 σ c k EM AIC BIC EM AIC BIC0 . . . . . . . . . . . . .
20 2 75 . . . . . . . . . . . . . . . . . . . . . . . . .
20 2 74 . . . . . . . . . . . . Table 2: Percentage of clustering object errors ( OE ) in regression models based on 1000 simulationreplications with data generated from (29). n = 50 n = 100 k = 2 k = 3 k = 2 k = 3 σ c EM BIC EM BIC EM BIC EM BIC0 . . . . . . . . .
20 12 . . . . . . . . . . . . . . . . .
20 13 . . . . . . . . IC ) in loglinear models based on 1000simulation replications with data generated from (30). n = 50 n = 100 k = 2 k = 3 k = 2 k = 3 τ c AIC BIC AIC BIC AIC BIC AIC BIC0 . . . . . . . . .
20 0 . . . . . . . . . . . . . . . . .
20 0 . . . . . . . . not detect the correct k . Our result shows that BIC in the generalized k -means was always betterthat in the previous EM algorithm. Our BIC was able to find the true number of clusters withlower clustering object errors. The previous EM algorithm cannot be used to study the unsaturatedclustering problem. This is an advantage of our generalized k -means. Similar to the regression models, we also chose k = 2 , c = 10 ,
20 objects. Each object contained n = 50 ,
100 observations. Wegenerated explanatory variables x ij and x ij from N (0 ,
4) independently. For each selected k , c ,and n , we independently generated the response y ij from P ( λ ij ) withlog λ ij = β i + x ij β s + x ij β s (30)for j = 1 , . . . , n and i = 1 , . . . , N . We generated β i independently from N (10 , β , β ) = (1 ,
1) in the first cluster and ( β , β ) = ( − , −
1) in the second cluster. This was usedif k = 2. If k = 3, we chose ( β , β ) = (1 , −
1) in the third cluster. We evaluated our methodbased on AIC and BIC for the unsaturated clustering problem, where we varied β i within clusters.Table 3 displays the simulation results for the percentage of clustering number errors. We alsofound that the true k could be identified by our BIC but not by our AIC. Table 4 displays theresults for the percentage of clustering object errors based on BIC. It shows that the percentage ofclustering object errors was still low, indicating that BIC can be used to find the correct numberof cluster with the low error rate. Therefore, we recommend using BIC in our generalized k -meansif the number of clusters is unknown. We implemented our method to the state-level COVID-19 data in the United States. It is knownthat the outbreak of COVID-19 in the world has occurred in March 2020 and more than 200countries have affected. The situation in the United States is the most serious in the world. UntilJuly 31, the United States has over 4 . OE ) in loglinear models based on 1000simulation replications with data generated from (30). k for n = 50 k for n = 100 τ c . . . . .
420 4 . . . . . . . . .
320 3 . . . . λ j = µ + β ( t j − t ) , (31)where t is the starting date, t j is the current date, λ j = E( y j ), and y j is the number of daily newcase observed on the current date. The second is the Gamma model given aslog λ j = µ + α log( t j − t ) + β ( t j − t ) . (32)It assumes that the expected value of number of daily new cases is proportional the density of aGamma-distribution. If the second term is absent, then the Gamma model becomes the exponentialmodel, implying that (31) is a special case of (32).14 eb May Aug California
Feb May Aug
Florida
Feb May Aug
Texas
Feb May Aug
New York
Feb May Aug
Georgia
Feb May Aug
New Jersey
Feb May Aug
Illinois
Feb May Aug
Arizona
Feb May Aug
North Carolina
Feb May Aug
Louisiana
Feb May Aug
Pennsylvania
Feb May Aug
Massachusetts
Feb May Aug
Tennessee
Feb May Aug
Ohio
Feb May Aug
South Carolina
Feb May Aug
Virginia
Feb May Aug
Alabama
Feb May Aug
Michigan
Feb May Aug
Maryland
Feb May Aug
Indiana
Feb May Aug
Mississippi
Feb May Aug
Washington
Feb May Aug
Minnesota
Feb May Aug
Wisconsin
Feb May Aug
Missouri
Feb May Aug
Nevada
Feb May Aug
Connecticut
Feb May Aug
Colorado
Feb May Aug
Iowa
Feb May Aug
Arkansas
Feb May Aug
Utah
Feb May Aug
Oklahoma
Feb May Aug
Kentucky
Feb May Aug
Kansas
Feb May Aug
Nebraska
Feb May Aug
Idaho
Feb May Aug
New Mexico
Feb May Aug
Oregon
Feb May Aug
Rhode Island
Feb May Aug
Delaware
Feb May Aug
South Dakota
Feb May Aug
West Virgina
Feb May Aug
North Dakota
Feb May Aug
New Hampshire
Feb May Aug
Montana
Feb May Aug
Maine
Feb May Aug
Wyoming
Feb May Aug
Vermont
Figure 2: Daily new cases of COVID-19 in 48 states in the mainland United States.Table 5: Fitting results of the exponential and the Gamma models for the outbreak of COVID-19in eleven selected countries between January 11 to May 31, 2020Exponential GammaCountry µ β R µ α β R PeakChina 7 . − .
032 0 . − . . − .
290 0 .
813 02/07USA 7 .
23 0 .
025 0 . − . . − .
195 0 .
939 04/26Canada 4 .
19 0 .
026 0 . − . . − .
215 0 .
920 04/26Russia 3 .
67 0 .
044 0 . − . . − .
308 0 .
993 05/13Spain 6 .
59 0 .
013 0 . − . . − .
283 0 .
899 04/08UK 5 .
53 0 .
023 0 . − . . − .
248 0 .
857 04/22Italy 6 .
66 0 .
010 0 . − . . − .
238 0 .
945 04/03France 6 .
26 0 .
012 0 . − . . − .
353 0 .
694 04/07Germany 6 .
33 0 .
011 0 . − . . − .
317 0 .
862 04/05Switzerland 4 .
90 0 .
006 0 . − . . − .
463 0 .
853 03/30Sweden 3 .
28 0 .
026 0 . − .
75 13 . − .
123 0 .
876 04/3015igure 3: Six clusters identified by BIC in generalized k -means for the period between and February24 to May 31 (left) and the period between February 24 to July 31 (right), respectively.Suppose that α >
0. If β >
0, then the third term dominates the variation of the right-handside of (32). The expected value of the response goes to infinity as time goes to infinity, leadingto an exponential increasing trend. If β <
0, then the peak of the model is t max = t − α/β . Anincreasing trend is expected if t < t max and a decreasing trend is expected otherwise. Therefore,we can use the sign of β to determine whether the outbreak is under control or out of control.We chose t as January 11 in both (31) and (32). We assumed that y i followed the quasi-Poissonmodel, such that we could fit the two models by the traditional loglinear model with the dispersionparameter a ( φ ) = φ to be estimated by (5). We assessed the two models by their R values, wherethe R value of a GLM was defined as one minus residual deviance divided by the null deviance.We verified (31) and (32) by implementing them to eleven countries in the world (Table 5), wherethe peak was estimated by ˆ t max = t − ˆ α/ ˆ β with ˆ α and ˆ β as the MLEs of α and β in the model.We found that the results given by the Gamma model were significantly better than those givenby the exponential model.We established our generalized k -means clustering under (32) to group the 50 states and Wash-ington DC. The model was log λ ij = µ i + α s log( t j − t ) + β s ( t j − t ) , (33)where λ ij = E( y ij ), y ij was the number of daily new cases from the i th state on the j th date, and α s and β s were the coefficients given by the s th cluster.We looked at the state-level data and found that many of daily new cases were zeros in January16able 6: Parameter estimates in the six clusters with a selected state (State) for each cluster basedon the Gamma model for the outbreak of COVID-19 in the United States, where the standarderrors are given inside the parenthesis and × means out of control. α β Peak State α β
Peak1 California 10 . . − . . . .
25) 0 . . × . . − . . . . − . . . . − . . . . − . . . . − . . . .
39) 0 . . × . . − . . . . − . . . . − . . . .
31) 0 . . × and February. This was because the United States only had 6 total number of confirmed casesuntil February 24. We decided to exclude data before February 24 in the analysis. We applied(33) to the data between February 24 and May 31 and the data between February 24 and July31, respectively. We compared their difference to evaluate the impact of the two issues that wementioned at the beginning of this section. We obtained six clusters based on the BIC approach(Figure 3).To verify our clustering result, we examined three models. The first was the main effect model.It had only one cluster in (33). The second was the resulting (33) with k clusters. The third wasthe interaction effect model. It assumed that each state formed a cluster in (33). We calculatedthe differences of residual deviance between the first and second models, and between the first andthe third models, respectively. We obtained the partial R value by the ratio of the two differences.The partial R value interpreted the ratio of residual deviance reduced by the model with k clusters.When k = 6, we obtained that the partial R was 0 . . α s and β s with k = 6 in (33)with k = 6 (Table 6). We found the situation in the entire United States was under control beforeMay 31 as the signs of ˆ β s were all negative. The situations in the states contained by the first, thefourth, and the six clusters became worse, but the situations in the states contained by the second,the third, and the fifth clusters were still under control. The change was probably caused by thata lot of people did not keep social distance or did not stay at home in June and July in the UnitedStates. We have proposed a new clustering method under the framework of the generalized k -means togroup statistical models. The method can automatically select the number of clusters if it iscombined with the GIC approach. We study BIC and AIC, which are two popular special casesin GIC. Our theoretical and simulation results show that the correct number of clusters can beidentified by BIC but not by AIC. Therefore, we recommend using BIC to find the number ofclusters if k is unknown. We implement our method to partition loglinear models for the state-level17OVID-19 data in the United States and finally we have identified six clusters. An importantadvantage is that our method can be used to study the unsaturated clustering problem, which isdifferent from the saturated clustering problem studied by traditional k -means or k -medians. Asthe choice of the dissimilarity measure is flexible, our method can be extended to many scenariosbeyond GLMs. This is left to future research. References [1] Bock, H. (2008) Origins and extensions of the k-means algorithm in cluster analysis Electron.
Electronic Journal for History of Probability and Statistics , , Article 14.[2] Charikar, M., and Guha, S. (2002). A constant-factor approximation algorithm for the k -medianproblem. Journal of Computer and System Science , , 129-149.[3] Chen, N., Zhou, M., Dong, X., Qu, J., Gong, F., Han, Y., Qiu, Y., Wang, J., Liu, Y., Wei, Y.,Xia, J., Yu, T., Zhang, X., and Zhang, L. (2020). Epidemiological and clinical characteristics of99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study. The Lancet , , 507-513.[4] de Picoli, S., Teixeira, J.J., Ribeiro, H.V., Malacarne, L.C., dos Santos, R.P., dos Santos Mendes,R. (2011). Spreading patterns of the influenza A (H1N1) pandemic. PLOS ONE , , e17823.[5] Donoho, D., and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics , , 962-994.[6] Du, Q., and Wong, T.W. (2002). Numerical studies for MacQueen’s k -means algorithms forcomputing the centroidal Voronoi tessellations. Computer and Mathematics with Applications , , 511-523.[7] Fan, J., Guo, S., and Hao, N. (2012). Variance estimation using refitted cross-validation inultrahigh dimensional regression. Journal of Royal Statistical Society Series B , , 37-55.[8] Feng, Z. (2020). Urgent research agenda for the novel coronavirus epidemic: transmission andnon-pharmaceutical mitigation strategies. Chinese Journal of Epidemiology , , 135-138.[9] Ferguson, T.S. (1996). A Course in Large Sample Theory . CRC Press, New York.[10] Forgy, E. W. (1965). Cluster analysis of multivariate data: efficiency vs interpretability ofclassifications.
Biometrics , , 768769.[11] Fraley, C. and Raftery, A.E. (2002). Model-based clustering, discriminant analysis, and densityestimation. Journal of the American Statistical Association , , 611-631.[12] Goyal, M., and Aggarwal, S. (2017). A review on k -mode clustering algorithm. InternationalJournal of Advanced Research in Computer Science , , 725-729.[13] Green, P.J. (1984). Iteratively reweighted least squares for maximum likelihood estimation,and some robust and resistant alternative. Journal of Royal Statistical Society Series B , ,149-192. 1814] Hartigan, J. A. and Wong, M. A. (1979). A k -means clustering algorithm. Applied Statistics , , 100108.[15] Huang, C., Wang, Y., Li, X., Ren, L., Zhao, J., Hu., Y., Zhang, L., Fan, G., Xu, J., Gu., X.,Cheng, Z., Yu, T., Xia, J., Wei, Y., Wu., W., Xie, X., Yin, W., Li, H., Liu, M., Xiao, Y., Gao,H., Guo, L., Xie, J., Wang, G., Jiang, R., Gao, Z., Jin, Q., Wang, J., Cao, B. (2020). Clinicalfeatures of patients infected with 2019 novel coronavirus in Wuhan, China. The Lancet , ,497-506.[16] Hunt, A.G. (2014). Exponential growth in Ebola outbreak since May 14, 2014. Complexity ,2¯0, 8-11.[17] Johnson, R.A. and Wichern, D.W. (2002).
Applied Multivariate Statistical Analysis . PrenticeHall, New Jersey.[18] Kriegel, H.P., Kr¨oger, P., Sander, J., and Zimek, A. (2001). Density-based clustering.
WIRESData Mining Knowledge Discovery , , 231-240.[19] Kwedlo, W. (2011). A clustering method combing differential evolution with the k -meansalgorithm. Pattern Recognition Letters , , 1613-1621.[20] Lau, J.W. and Green, P.J. (2007). Bayesian model-based clustering procedures. Journal ofComputational and Graphical Statistics , , 526-558.[21] Lloyd, S.P. (1982). Least squares quantization in PCM. IEEE Transactions on InformationTheory , , 128137.[22] MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Ob-servations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability .University of California Press. pp. 281297.[23] Maier, B.F. and Brockmann, D. (2020). Effective containment explains subexponential growthin recent confirmed COVID-19 cases in China.
Science , DOI: 10.1126/science.abb4557.[24] McCullagh, P. (1983). Quasi-likelihood functions.
Annals of Statistics , Biometrics , , 526-533.[27] Soheily-Khah, S., Douzal-Chouakria, A., and Gaussie, E. (2016). Generalized k -means-basedclustering for temporal data under weighted and kernel time warp. Pattern Recognition Letters , , 63-69.[28] Sun, K., Chen, J., and Viboud, C. (2020). Early epidemiological analysis of the coronavirusdisease 2019 outbreak based on crowdsourced data: a population-level observational study. TheLancet Digital Health . 1929] Trauwaert, E., Kaufman, L., and Rousseeuw, P. (1991) Fuzzy clustering algorithms based onthe maximum likelihood principle.
Fuzzy Sets and Systems , , 213-227.[30] van der Vaart, A.W. (1998). Asymptotic Statistics , Cambridge University Press, Cambridge,UK.[31] Wang, J. (2010). Consistent selection of the number of clusters via cross validation.
Biometrika , , 893-904.[32] World Health Organization (WHO) (2020). Getting your workplace ready for COVID-19.February 27 2020.[33] Zhang, Y., Li, R., and Tsai,, C. (2010). Regularization parameter selections via generalizedinformation criterion. Journal of the American Statistical Association , , 312-323.[34] Zhao, Y. and Karypis, G. (2005). Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery ,10