[PDF] K -Means and Gaussian Mixture Modeling with a Separation Constraint

Abstract

We consider the problem of clustering with K -means and Gaussian mixture models with a constraint on the separation between the centers in the context of real-valued data. We first propose a dynamic programming approach to solving the K -means problem with a separation constraint on the centers, building on (Wang and Song, 2011). In the context of fitting a Gaussian mixture model, we then propose an EM algorithm that incorporates such a constraint. A separation constraint can help regularize the output of a clustering algorithm, and we provide both simulated and real data examples to illustrate this point.

Full PDF

KK -Means and Gaussian Mixture Modelingwith a Separation Constraint He Jiang Ery Arias-CastroUniversity of California, San Diego, USA

Abstract

We consider the problem of clustering with K -means and Gaussian mixture models with aconstraint on the separation between the centers in the context of real-valued data. We ﬁrstpropose a dynamic programming approach to solving the K -means problem with a separationconstraint on the centers, building on (Wang and Song, 2011). In the context of ﬁtting aGaussian mixture model, we then propose an EM algorithm that incorporates such a constraint.A separation constraint can help regularize the output of a clustering algorithm, and we provideboth simulated and real data examples to illustrate this point. Cluster analysis is broadly seen as one of the most important tasks in (unsupervised) data analysis(Jain et al., 1999; Kaufman and Rousseeuw, 2009). In some situations, the analyst might have someprior knowledge or relevant information regarding the clusters, and incorporating this information— which may be done via constraints — is thought to improve the accuracy of the clusteringoutput (Basu et al., 2008). In the present paper, we address situations where the analyst has priorknowledge on the separation between the cluster centers. We indeed focus on methods that relyon centers to deﬁne clusters: the K -means criterion and Gaussian mixture models (GMM). K -means A number of constrained variants of K -means have been proposed in the literature, where a sig-niﬁcant amount of attention have been placed on the cluster sizes such as imposing a lower boundon the minimum cluster size or aiming for balancing the cluster sizes. For instance, Bennett et al.(2000) addressed the problem of small or even empty clusters by enforcing a size constraint in thecluster assignment step of Lloyd’s algorithm; Nielsen and Nock (2014) designed a dynamic K -meansalgorithm in 1D aimed at optimizing the Bregman divergence that can incorporate a constrainton the minimum cluster size; and Banerjee and Ghosh (2006) proposed an approach to K -meansthat is able to handle cluster size balance constraints. Constrained K -means has also been widelyconsidered in other areas (Basu et al., 2008). For example, (Basu et al., 2004; Wagstaﬀ et al., 2001)considered forcing some user-speciﬁed pairs of observations to belong to the same cluster whileforcing others to belong to diﬀerent clusters. A overview of instance-level constrained clustering,including K -means approaches and related methods, can be found at (Davidson and Basu, 2007).Davidson and Ravi (2005) also incorporated a minimum separation constraint between clustersmandating that any two points assigned to diﬀerent clusters need to be separated by at least δ , aparameter speciﬁed by the analyst. Szkaliczki (2016) considered points arriving sequentially and1 a r X i v : . [ s t a t . C O ] J u l constraints where clusters have to be of the form { x i , . . . , x i + j } . A recent survey of constrained clus-tering, including K -means related methods and modern development can be found at Gan¸carskiet al. (2020).We propose a variant of K -means where a user-speciﬁed separation between the cluster centersis enforced which, to the best of our knowledge, is novel. We place ourselves in dimension one,where we are able to solve the resulting constrained optimization problem exactly by building onthe dynamic programming approach of Wang and Song (2011). Model based clustering is an important aspect in clustering (Fraley and Raftery, 2002; McLachlanand Peel, 2004), and among parametric ﬁnite mixture models, Gaussian mixture models (GMM)are by far the most popular, with their parameters frequently estimated by the EM algorithm orvariants (Dempster et al., 1977; McLachlan and Krishnan, 2007).EM-type algorithms forcing various constraints on the parameters of the mixture model havebeen considered extensively in the literature. Kim and Taylor (1995), in a more general setting thatincludes GMM, considered enforcing some equality constraints on a subset of the parameters andprovided an EM algorithm based on a projected Newton–Raphson step in the maximization stage.Along the lines, but addressing both equality and inequality constraints, Jamshidian (2004) useda projected gradient step instead, in conjunction with an active set method in order to handle theinequality constraints. See also (Takai, 2012). In the context of linear models in regression, EM-type algorithms were also proposed to handle constraints on the parameters, in particular linearequalities and inequalities (Davis et al., 2012; Shi et al., 2005; Shin et al., 2001; Tan et al., 2007;Tian et al., 2008; Zheng et al., 2012, 2005).In the context of GMM proper, constraints have been considered for a very long time to reducethe number of parameters, for example, by making all the variances or covariance matrices to beequal. Beyond that, constraints on the variances or eigenvalues of the covariance matrices havebeen considered extensively, mostly for the purpose of bounding the log likelihood (Day, 1969). Inparticular, Hathaway (1985) proposed a lower bound on the ratio of any two component standarddeviations, which enabled the author to prove the strong consistency of the resulting maximumlikelihood estimator. This was followed by (Hathaway, 1986) where a constrained EM algorithmwas proposed for maximizing the likelihood with this lower bound constraint on the ratio of anytwo standard deviations, and also another lower bound constraint on the weights. GeneralizingHathaway’s results to higher dimensions, Ingrassia (2004); Ingrassia and Rocci (2007, 2011); Rocciet al. (2018) examined EM algorithms enforcing constraints on the eigenvalues of the covariancematrices.In addition to constraints on component variance or eigenvalues, estimation of component meansunder constraints has also been considered in the literature, where EM-type algorithms have beenproposed for that purpose. For example, Pan and Shen (2007) and Wang and Zhu (2008) consideredpenalizing the L and L ∞ norms of the mean vectors, respectively, to enforce sparsity and/orregularize the model in high dimensions, while Chauveau and Hunter (2013) and Qiao and Li(2015) proposed EM algorithms for linear constraints on the mean parameters.We consider here, in dimension 1, arbitrary constraints on the separation between adjacentmeans. We propose an EM algorithm that enforces such constraints, where the M step amounts tosolving a quadratic program. The main focus will be on real-valued data. We do not know how to enforce a separation constrainton the centers in higher dimensions, at least not in such principled fashion. The data points willbe assumed ordered without loss of generality, denoted x ≤ ⋯ ≤ x N and gathered in a vector x = ( x , . . . , x N ) ∈ R N . Our goal will be to group these data points into K clusters, where K isspeciﬁed by the user. We do not address the perennial issue of choosing K .The organization of the paper will be as follows. In Section 2, we introduce our dynamicprogramming algorithm for exactly solving K -means in dimension one with a constraint on theminimum separation between the centers. In Section 3, we introduce our EM algorithm for (ap-proximately) ﬁtting a Gaussian mixture model by maximum likelihood under the same constraint.It is a true EM algorithm in that the likelihood increases with each iteration. In Section 4, wedescribe some numerical experiments on both simulated data and real data to illustrate the useand accuracy of our methods. K -Means Given data points on the real line, x , . . . , x n , and a desired number of cluster, K , basic K -meansis deﬁned as the following optimization problem (MacQueen, 1967)minimize n ∑ i = min k = ,...,K ( x i − µ k ) , over µ < ⋯ < µ K . This problem is diﬃcult in general. For one thing, a brute-force approach by grid search would be indimension K and quickly unfeasible or too costly. The problem is instead most often approached viaiterative methods such as Lloyd’s algorithm (Lloyd, 1982), which consists in alternating between,for all k , deﬁning Cluster k as the set of points nearest to µ k and then recomputing µ k as thebarycenter or average of the points forming Cluster k . In dimension 1, however, the K -meansproblem can be solved by dynamic programming as was established and carried out by Wang andSong (2011).We are interested in a variant of the K -means problem where the following separation constrainton the centers is enforced: Constraint 1. µ k + − µ k ≥ δ for all k = , . . . , K − . In other words, we consider the following optimization problem:minimize n ∑ i = min k = ,...,K ( x i − µ k ) , over µ < ⋯ < µ K satisfying µ k + − µ k ≥ δ for all k = , . . . , K − . The amount of separation is determined by the user-speciﬁed parameter δ ≥

0. (Of course, if δ = K -means problem.) Inspired by the work of Wang and Song (2011), we proposea dynamic programming algorithm to solve this constrained variant of K -means. The remaining ofthis section is dedicated to describing this algorithm, which is encapsulated in Table 1 and Table 2. Given the sorted data x = ( x , . . . , x N ) , for r = , . . . , N and r = r , . . . , N , we ﬁrst deﬁne thewithin current cluster center ( C ) and the within current cluster sum of squares ( W ): C [ r , r ] = r − r + r ∑ i = r x i , (1) W [ r , r ] = r − r + r ∑ i = r ( x i − C [ r , r ]) (2)Each element can be eﬃciently computed using a simple recursion. For i = , . . . , N and m = , . . . , K , deﬁne D [ i, m ] as recording the minimum error sum of squares when grouping the ﬁrst i data points into m clusters. The main idea of Wang and Song (2011) is to update D dynamicallyas follows: D [ i, m ] = min j = m,...,i D [ j − , m − ] + W [ j, i ] . (3)This is carried out for i = , . . . , N and m = , . . . , K . (Initialization is D [ , m ] = m and D [ i, ] = i .)The method of (Wang and Song, 2011) was designed for solving the basic K -means problem andneeds to be modiﬁed, in a substantial way, to solve the separation-constrained K -means problem.For illustrative purposes, consider clustering data x = (− , , , , , , , ) into K = δ = .

75. Note that without separation theclustering assignment is 12233455 but the centers of Cluster 3 and Cluster 4 are only separatedby 1 .

5, which is less than δ . With separation, the clustering assignment is 12334455. The sameexample is used to highlight two issues that need to be addressed in order to incorporate theseparation constraints: • The m -th cluster consisting of x j , . . . , x i , and the previous cluster, i.e., the ( m − ) -th clusterthat gives the optimal clustering of j − m − δ in terms of their means. In our example, when grouping { x , . . . , x } into 3 clusters,the optimal solution is 12233, but when we try to group { x , . . . , x } into 4 clusters underthe separation constraint it becomes impossible to do so as C [ , ] − C [ , ] < δ . (We are notallowing the clusters to be empty.) • Consider the stage where we have grouped x , . . . , x i into m clusters, and the last cluster is x j , . . . , x i . Then it is not necessarily the case that the ﬁrst m − x , . . . , x j − , and so the recursion (3) is not necessarily valid. In our example, theoptimal clustering for grouping { x , . . . , x } into 3 clusters is 1223, however when grouping { x , . . . , x } into 4 clusters under the separation constraint the clustering assignment becomes12334, so that the starting point of cluster 3 has changed from x to x .In brief, (3) cannot be used as the updating rule. Although the ﬁrst issue is easy ﬁx by checkingthat the separation constraint is satisﬁed, the second issue is more diﬃcult to deal with and requiresus to introduce additional matrices/tensors as described below. We describe here our algorithm. First initialize: β = W [ , N ] + , D = β N × K , I = N × K , B = N × K , U = β N × N × K , (4)where β is chosen larger than the maximum possible error sum of squares. Deﬁne: D [ i, ] = i ∑ l = ( x l − C [ , i ]) , I [ i, ] = , B [ i, ] = , U [ i, r, ] = , for i = , , . . . , N ; r = i, . . . , N . (5)The D matrix is used to store the optimal error sum of squares under the separation constraint,speciﬁcally D [ i, m ] records the minimal error sum of squares when putting the ﬁrst i values into m clusters while satisfying Constraint 1, if possible (recall we do not allow clusters to be empty). Thecorresponding entry in the I matrix, I [ i, m ] , is used to record the index of the ﬁrst data point inthe m -th cluster in the optimal clustering of the ﬁrst i data points into m clusters (the clusteringcorresponding to D [ i, m ] ). B [ i, m ] , on the other hand, is used to store the smallest possible indexof the leftmost data point in the m -th cluster over all groupings of the ﬁrst i data points into m clusters satisfying the separation constraint. Lastly, U [ q, r, m ] records the optimal error sum ofsquares for grouping the ﬁrst q − m − ( m − ) -th cluster and m -th cluster, which consists of data points x q , . . . , x r , are separatedby δ in means.In our updates of the above matrices/tensors, we enforce the separation constraint. Suppose that D [⋅ , m − ] , I [⋅ , m − ] , B [⋅ , m − ] as well as U [⋅ , ⋅ , m − ] have been updated, so we are at stage m withthe immediate task of updating D [⋅ , m ] , I [⋅ , m ] , B [⋅ , m ] as well as U [⋅ , ⋅ , m ] . Now consider groupingthe ﬁrst i data points into m clusters. If B [ i − , m − ] =

0, do nothing (the values stay at the initialones). Otherwise, for each j = m, . . . , i , we ﬁnd the ﬁrst value T j ∈ { I [ j − , m − ] , . . . , B [ j − , m − ]} that satisﬁes C [ j, i ]− C [ T j , j − ] ≥ δ . Note that T j also depends on i and m , but that is left implicit.The reason for deﬁning T j as such is because starting from B [ j − , m − ] , the closer we are to I [ j − , m − ] , the smaller the error sum of squares. Then the main update that guaranteesoptimality is: D [ i, m ] = min j = m,...,i U [ T j , j − , m − ] + W [ T j , j − ] + W [ j, i ] , (6)and U [ j, i, m ] = U [ T j , j − , m − ] + W [ T j , j − ] . (7)Then I [ i, m ] is updated as the smallest minimizing index in (6) and B [ i, m ] is updated as thesmallest j that satisﬁes C [ j, i ] − C [ T j , j − ] ≥ δ . Computational complexity

While the algorithm of Wang and Song (2011) has time complexity O ( N K ) , our method has computational complexity O ( N K ) . While this is true in the worst case,in practice the computational cost appears much lower as moving from I [ j − , m − ] to B [ j − , m − ] and stopping at the ﬁrst value where the constraint is satisﬁed considers signiﬁcantly fewer optionsthan N . The upper bound for space complexity of this algorithm is O ( N K ) — compared to O ( N K ) for the algorithm of Wang and Song (2011). This is due to the maintaining of the tensor U , but again, this is in the worst case, as in practice U is very sparse (most of its coeﬃcients areequal to the initial value β ) and the space complexity is far below the upper bound. We are nowready to use Table 2 to ﬁnd the the optimal error sum of squares under Constraint 1 and ﬁnd thecorresponding optimal clustering using Table 1. A Gaussian mixture model is of the form: Table 1: Separation Constrained K -Means inputs : C, I, B (all three obtained from Table 2), number of clusters K , separation δ initialize b = { K } , compute b [ K ] = I [ N, K ] , and set l = b [ K ] − if K = thenreturn b for j = I [ l, K − ] , . . . , B [ l, K − ] doif C [ l + , N ] − C [ j, l ] ≥ δ then b [ K − ] = j, l = j − breakif K = thenreturn b for k = ( K − ) , . . . , dofor j = I [ l, k ] , . . . , B [ l, k ] doif ( C [ l + , b [ k + ] − ] − C [ j, l ]) ≥ δ then b [ k ] = j, l = j − breakreturn b K ∑ k = π k f ( x ; µ k , v k ) , (8)where f (⋅ ; µ, v ) is the density of the normal distribution with mean µ and variance v , meaning f ( x ; µ, v ) = √ πv exp (− ( x − µ ) v ) , (9)and π , . . . , π K are the mixture weights satisfyingmin k π k > , ∑ k π k = . (10)For convenience, deﬁne θ = ( π, µ, v ) where π = ( π , . . . , π K ) ∈ R K , µ = ( µ , . . . , µ K ) ∈ R K , and v = ( v , . . . , v K ) ∈ R K . We then let g θ denote the mixture (8). The working assumption is that thedata points, x , . . . , x N , are the realization of a sample of some size N drawn iid from a mixturedistribution of the form (8), where a draw from that distribution amounts to drawing a clusterindex in { , . . . , K } according to the probability distribution π and then generating a real valuedobservation according the corresponding normal density — f (⋅ ; µ k , v k ) if k is the drawn index value.Given the mixture model (8), a point x is assigned to the most likely cluster, meaning, toarg max k π k f ( x ; µ k , v k ) . (11)Thus clustering using a Gaussian mixture model boils down to estimating the parameters ofthe model. Maximum likelihood estimation is a natural candidate, but diﬃcult to implement, Interestingly, maximum likelihood is a competitive method even though the likelihood is not bounded. To makesense of what the EM algorithm does, we believe that a useful perspective is to see it as attempting to maximizethe likelihood starting from a reasonable initialization, understanding that the E and M steps prevent the algorithmfrom arriving at pathological values of the parameters.

Table 2: Routine computing matrices

C, W, D, I, B inputs : data x ≤ ⋯ ≤ x N , number of clusters K , separation δ compute C and W as in (1) and (2)deﬁne β and initialize D, I, B, U as in (4) and (5) if K = then go to return step else if K = then do everything in K = for i = , . . . , N do p = β { i − } for j = , . . . , i doif C [ j, i ] − C [ , j − ] ≥ δ then p [ j − ] = W [ j, i ] + W [ , j − ] U [ j, i, ] = W [ , j − ] D [ i, ] = min ( p ) if D [ i, ] = β then I [ i, ] = , B [ i, ] = else I [ i, ] = min ( arg min ( p )) + B [ i, ] = min { a ∶ p [ a ] < β } + else do everything in K = for m = , . . . , K dofor i = m, . . . , N doif D [ i − , m − ] < β then p = β { i − m + } for j = m, . . . , i doif B [ j − , m − ] ≠ thenif C [ j, i ] − C [ I [ j − , m − ] , j − ] ≥ δ then p [ j − m + ] = D [ j − , m − ] + W [ j, i ] U [ j, i, m ] = D [ j − , m − ] elsefor t = I [ j − , m − ] , . . . , B [ j − , m − ] doif C [ j, i ] − C [ t, j − ] ≥ δ then p [ j − m + ] = min ( p [ j − m + ] , U [ t, j − , m − ]+ W [ t, j − ]+ W [ j, i ]) U [ j, i, m ] = U [ t, j − , m − ] + W [ t, j − ] break D [ i, m ] = min ( p ) if D [ i, m ] = β then I [ i, m ] = , B [ i, m ] = else I [ i, m ] = min ( arg min ( p )) + m − B [ i, m ] = min { a ∶ p [ a ] < β } + m − return D, I, B, C, W, β , Ueven in dimension 1, as maximizing the likelihood is not a convex problem and a grid search is toocostly given the number of parameters (3 K − L ( θ, x ) = N ∑ i = log g θ ( x i ) . (12)The main idea is based on introducing cluster assignment variables, z , . . . , z n , where z i = k if x i was drawn from the k -th component of the mixture. These assignment variables are not observed(they are said to be latent), but are useful nonetheless for thinking about the problem. To that end,let g θ ( x, z ) denote the joint density of ( X i , Z i ) . Note that Z i has distribution π and X i ∣ Z i = k hasthe normal distribution with mean µ k and variance v k . The algorithm starts with a value of theparameter vector, θ , and then alternates between an E step and an M step, until some convergencecriterion is satisﬁed. Suppose we are at the s -th iterate. The E step consists in computing theexpectation of the log-likelihood for an arbitrary value of θ with respect to the mixture distributiongiven by the current parameter estimates: Q ( θ, θ s ) ∶= N ∑ i = E θ s [ log g θ ( X i , Z i ) ∣ X i = x i ] . (13)It turns out that Q ( θ, θ s ) = N ∑ i = K ∑ k = w s + i,k log ( π k f ( x i ; µ k , v k )) , (14)where w s + i,k ∶= π sk f ( x i ; µ sk , v sk )∑ Kl = π sl f ( x i ; µ sl , v sl ) . (15)The M step consists of maximizing the resulting function with respect to θ to yield the updatedvalues of the parameters: θ s + = arg max θ Q ( θ, θ s ) . (16)This amounts to the following: π s + k = N N ∑ i = w s + i,k (17) µ s + k = ∑ Ni = w s + i,k N ∑ i = w s + i,k x i (18) v ks + = ∑ Ni = w s + i,k N ∑ i = w s + i,k ( x i − µ s + k ) (19)We are interested in a variant of the EM algorithm for maximizing the likelihood under thefollowing separation constraint on the centers: Constraint 2. δ k, ≤ µ k + − µ k ≤ δ k, for all k = , . . . , K − . The separation parameters, δ k, and δ k, , are set by the user. We note that this reduces toConstraint 1 when we set δ , = ⋅ ⋅ ⋅ = δ K − , = δ and δ , = ⋅ ⋅ ⋅ = δ K − , = ∞ (in practice, a very largenumber). We propose an EM algorithm that incorporates Constraint 2. It is a true EM algorithmin that the likelihood increases with each iteration. The remaining of this section is devoted tointroducing the algorithm, which is otherwise compactly given in Table 3.In the algorithm, the E Step is identical to that in the regular EM algorithm. The M Step isalso the same except that the maximization is done under Constraint 2. The constraint is only onthe centers, and to isolate that, we adopt an ECM approach (Meng and Rubin, 1993), where weﬁrst maximize over the weights, which amounts to the same update (17); then maximize over thecenters subject to Constraint 2 (see details below); and ﬁnally maximize over the variances, whichamounts to the same update (19). Proposition 1.

The separation-constrained EM algorithm does not decrease the log likelihood ofthe model, i.e., L ( θ s + , x ) ≥ L ( θ s , x ) for all s . A proof of this result, although standard, is given in the appendix.The maximization over the centers amounts to solving the following optimization problem:minimize N ∑ i = K ∑ k = w s + i,k v sk ( x i − µ k ) over µ , . . . , µ K such that δ k, ≤ µ k + − µ k ≤ δ k, for all k = , , . . . , K − . (20)Deﬁne g i,k = w s + i,k / v sk and G i = diag ( g i, , . . . , g i,K ) , G = n ∑ i = G i . (21)Also deﬁne A = ⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝− . . . − . . . ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ . . . − − . . . − . . . ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ . . . − ⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠ , b = ⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ δ , δ , . . .δ ,K − − δ , − δ , . . . − δ ,K − ⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ . (22)With this notation, the optimization problem (20) can be equivalently expressed as follows:minimize µ ⊺ Gµ − N ∑ i = x i ⊺ G i µ subject to Aµ ≥ b , (23)which is a quadratic program. In practice, we solve it using the primal dual interior point methodof Goldfarb and Idnani (1982, 1983) implemented in the quadprog package in R . This section will provide some simulated and real data examples illustrating the application of ouralgorithms. The constrained algorithms are shown to outperform their unconstrained counterpartsin both parameter estimation and in producing a clustering that is more similar to the actualclusters, of course, if the separation constraint reﬂects the reality of the situation. Our code ispublicly available. https://cran.r-project.org/web/packages/quadprog/ https://github.com/h8jiang/Center-Separation-Constrained-K-Means-and-EM-Algorithm/ inputs : data x , . . . , x N , number of clusters K , tolerance value γ initialization initialize ( π , µ , v ) using constrained 1D optimal K -means algorithm as in Section2, if given the same lower bound constraint; otherwise initialize using 1D optimal K -means of (Wang and Song, 2011): µ < µ < ⋅ ⋅ ⋅ < µ K , µ k = ∣ J k ∣ ∑ i ∈ J k x i , π k = ∣ J k ∣ N , v k = ∣ J k ∣ ∑ i ∈ J k ( x i − µ k ) (24)where J k denotes the indices that are clustered into the k − th cluster E step compute w i,k for i = , , . . . , N and k = , , . . . K : w s + i,k = π sk N ( x i ∣ µ sk , v sk )∑ Kj = π sj N ( x i ∣ µ sj , v sj ) (25) M step update π k , k = , , . . . , K : π s + k = N N ∑ i = w s + i,k (26)update µ k , k = , , . . . , K by the solution to minimization problem (23)update v k , k = , , . . . , K : v ks + = ∑ Ni = w s + i,k N ∑ i = w s + i,k ( x i − µ s + k ) (27) convergence checkif max { max k ∣ π s + k − π sk ∣ , max k ∣ µ s + k − µ sk ∣ , max k ∣ v s + k − v sk ∣} ≤ γ then stop and return θ s + = ( π s + , µ s + , v s + ) else s = s + K -means and Optimal Constrained K -means Experiment Model Criteria K -means Constrained K -means Experiment 1 Model D center error 1.092 (0.276) 0.374 (0.161)size error 165.6 (22.9) 119.9 (25.4)Rand index 0.786 (0.015) 0.807 (0.014)Experiment 2 Model B center error 1.339 (0.393) 0.561 (0.190)size error 143.4 (45.9) 58.1 (18.8)Rand index 0.834 (0.016) 0.858 (0.014)

We start with experiments performed on simulated data using our constrained algorithms. Wewill ﬁrst look at the performance of our constrained K -means algorithm over regular K -means asimplemented by Wang and Song (2011). We then look at the performance of our constrained EMalgorithm for ﬁtting Gaussian mixture models with the regular EM algorithm (Dempster et al.,1977). Finally, we examine the impact of the imposed constraints, including when they are notcongruent with the actual situation.All the simulated data sets were generated from one of the following models: • Model A: 0 . N ( , ) + . N ( , ) • Model B: 0 . N ( , . ) + . N ( , . ) + . N ( , . ) • Model C: 0 . N ( , ) + . N ( , ) + . N ( , ) + . N ( , ) + . N ( , ) • Model D: 0 . N ( , . )+ . N ( , . )+ . N ( , . )+ . N ( , . )+ . N ( , . ) We ﬁrst apply constrained and unconstrained K -means on data generated from Model D (Ex-periment 1) and on data generated from Model B (Experiment 2). The comparison criteria aretotal error on cluster center values and the total diﬀerence on the cluster sizes: K ∑ k = ∣ ˆ µ k − µ k ∣ , and K ∑ k = ∣ ˆ n k − n k ∣ , (28)and also the Rand index. The sample size was set to N = δ = .

95, andthe number of repeats to R = R experiments and their standard deviations inTable 4. It is quite clear that in these experiments constrained K -means improves signiﬁcantly overunconstrained K -means in terms of both center estimates and clustering accuracy.Experiments 3, 4 and 5 will show three examples where applying our constrained EM algorithmwith a valid separation constraint results in a large improvement of parameter estimation overthe regular EM algorithm. Both algorithms are identically initialized using the constrained K -means algorithm in Section 2. We apply these two methods on data generated from Model B inExperiment 3, data generated from Model A in Experiment 4, and data generated from Model Cin Experiment 5. The comparison criteria are average error on center values and the average erroron all parameters: 1 K K ∑ k = ∣ ˆ µ k − µ k ∣ , and 1 K K ∑ k = (∣ ˆ µ k − µ k ∣ + ∣ ˆ π k − π k ∣ + ∣ ˆ v k − v k ∣) , (29)2 Table 5: Errors and Rand Indices of EM Algorithm and Constrainted EM Algorithm Experiment Model Criteria Regular EM Constrained EM

Experiment 3 Model B center error 0.339 (0.205) 0.058 (0.021)all parameters error 0.976 (0.296) 0.454 (0.197)Rand index 0.893 (0.023) 0.906 (0.013)Experiment 4 Model A center error 0.252 (0.155) 0.172 (0.120)all parameters error 0.568 (0.320) 0.409 (0.242)Rand index 0.715 (0.047) 0.726 (0.040)Experiment 5 Model C center error 0.448 (0.231) 0.276 (0.213)all parameters error 0.994 (0.415) 0.764 (0.367)Rand index 0.810 (0.028) 0.820 (0.030) and the Rand index. Here the sample size is N = δ = . δ = . k ), and the number of repeat is R = R experiments and their standard deviations in Table 5. It is clearthat in these 3 experiments, the constrained EM algorithm improves over the regular EM algorithmin terms of both parameter estimation and clustering accuracy.In the last experiments, we look at the impact of the separation constraint on the result of theconstrained algorithms (Experiment 6), and also explore how imposing two-sided constraints diﬀersfrom imposing one-sided constraints (Experiment 7). We focus on the simplest model, Model A,which has only two components. In Experiment 6, we gradually increase δ on a grid of valuesranging from 0 .

85 to 2 .

4. We generate R =

100 data sets from Model A, then for each δ , we ﬁtthe model and record all parameters as done in Experiments 3-5. We present the results in 3plots with 90% conﬁdence bands in Figure 1. We can see that the curves representing regular andone-sided EM tend to overlap on the very left. The one-sided EM starts to become more accuratein parameter estimation compared to regular EM when δ is as low as 0.9, and the improvementincreases as we move closer to actual value 2. Once passed 2, the performance starts to deterioratecompared to the regular EM. On the other hand, when the two-sided constraint indeed containsthe true separation, 2, the two-sided constrained EM outperforms the one-sided constrained andregular EM. However, quite intuitively, when the two-sided constraint does not include the actualseparation inside, the performance of the two-sided constrained EM is signiﬁcantly worse thanthe two other algorithms before reaching actual separation value, and the same as the one-sidedconstrained EM after passing the actual separation value. In this section we apply our constrained EM Algorithm to the 2010 National Youth Physical Activityand Nutrition Study (NYPANS) dataset provided in (Centers for Disease Control and Prevention,2010). The dataset consists of measurements on high school students in the United States in 2010.Since the height of students in each grade follows an approximate Gaussian distribution, and thereare 4 grades, the height of all the students could form a GMM with K =

4. We focus on estimatingthe parameters of this GMM.As prior knowledge, we use information from the data table of stature-for-age charts providedby National Center for Health Statistics in (Centers for Disease Control and Prevention, 2001). The3 (a) error in the estimation of the cen-ters as a function of the separation δ (b) error in the estimation of all theparameters as a function of the sepa-ration δ (c) Rand index as a function of theseparation δ Figure 1: Results of Experiments 6 and 7: black curves represent unconstrained EM; red curvesrepresent constrained EM with a lower bound on separation, δ ; blue curves represent two-sidedconstrained EM, with lower bound on separation being δ = δ and upper bound being δ = δ + . Model Criteria Regular EM Constrained EM

NYPANS center error (times 100) 6.7 0.5all parameters error (times 100) 15.6 3.2Rand index 0.566 0.596 two sources are both gathered in the United States, with a time diﬀerence of 9 years. From the meanage of students in the NYPANS study in each of the 4 grades, we found the most relevant medianheight, and in Gaussian distributions the mean height, in these charts, and subtracted the adjacentvalues. To account for possible slight changes over the years, we add and subtract respectively0 . δ , = . , δ , = . , δ , = . δ , = . , δ , = . , δ , = . Acknowledgments

This work was partially supported by a grant from the US National Science Foundation.

References

Banerjee, A. and J. Ghosh (2006). Scalable clustering algorithms with balancing constraints.

DataMining and Knowledge Discovery 13 (3), 365–395.4Basu, S., A. Banerjee, and R. J. Mooney (2004). Active semi-supervision for pairwise constrainedclustering. In

SIAM International Conference on Data Mining , pp. 333–344.Basu, S., I. Davidson, and K. Wagstaﬀ (2008).

Constrained Clustering: Advances in Algorithms,Theory, and Applications . CRC Press.Bennett, K., P. Bradley, and A. Demiriz (2000). Constrained K -means clustering. Technical ReportMSR-TR-2000-65, Microsoft Research.Centers for Disease Control and Prevention (2001). Data table of stature-for-age charts. .Centers for Disease Control and Prevention (2010). National youth physical activity and nutritionstudy (NYPANS). .Chauveau, D. and D. Hunter (2013). ECM and MM algorithms for Normal mixtures with con-strained parameters. https://hal.archives-ouvertes.fr/hal-00625285 .Davidson, I. and S. Basu (2007). A survey of clustering with instance level constraints. ACMTransactions on Knowledge Discovery from Data 1 (1-41), 2–42.Davidson, I. and S. Ravi (2005). Clustering with constraints: feasibility issues and the K -meansalgorithm. In SIAM International Conference on Data Mining , pp. 138–149.Davis, K. A., C. G. Park, and S. K. Sinha (2012). Testing for generalized linear mixed models withcluster correlated data under linear inequality constraints.

Canadian Journal of Statistics 40 (2),243–258.Day, N. E. (1969). Estimating the components of a mixture of Normal distributions.

Biometrika 56 (3), 463–474.Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete datavia the EM algorithm.

Journal of the Royal Statistical Society: Series B (Methodological) 39 (1),1–22.Fraley, C. and A. E. Raftery (2002). Model-based clustering, discriminant analysis, and densityestimation.

Journal of the American Statistical Association 97 (458), 611–631.Gan¸carski, P., B. Cr´emilleux, G. Forestier, and T. Lampert (2020). Constrained clustering: currentand new trends. In

A Guided Tour of Artiﬁcial Intelligence Research , pp. 447–484. Springer.Goldfarb, D. and A. Idnani (1982). Dual and primal-dual methods for solving strictly convexquadratic programs. In

Numerical Analysis , pp. 226–239. Springer.Goldfarb, D. and A. Idnani (1983). A numerically stable dual method for solving strictly convexquadratic programs.

Mathematical Programming 27 (1), 1–33.Hathaway, R. J. (1985). A constrained formulation of maximum-likelihood estimation for Normalmixture distributions.

The Annals of Statistics 13 (2), 795–800.Hathaway, R. J. (1986). A constrained EM algorithm for univariate Normal mixtures.

Journal ofStatistical Computation and Simulation 23 (3), 211–230.Ingrassia, S. (2004). A likelihood-based constrained algorithm for multivariate Normal mixturemodels.

Statistical Methods and Applications 13 (2), 151–166.Ingrassia, S. and R. Rocci (2007). Constrained monotone EM algorithms for ﬁnite mixture ofmultivariate Gaussians.

Computational Statistics & Data Analysis 51 (11), 5339–5351.Ingrassia, S. and R. Rocci (2011). Degeneracy of the EM algorithm for the MLE of multivariateGaussian mixtures and dynamic constraints.

Computational Statistics & Data Analysis 55 (4),1715–1725.Jain, A. K., M. N. Murty, and P. J. Flynn (1999). Data clustering: a review.

ACM ComputingSurveys (CSUR) 31 (3), 264–323.Jamshidian, M. (2004). On algorithms for restricted maximum likelihood estimation.

ComputationalStatistics & Data Analysis 45 (2), 137–157.Kaufman, L. and P. J. Rousseeuw (2009).

Finding Groups in Data: An Introduction to Cluster Analysis , Volume 344. John Wiley & Sons.Kim, D. K. and J. M. Taylor (1995). The restricted EM algorithm for maximum likelihood esti-mation under linear restrictions on the parameters.

Journal of the American Statistical Associ-ation 90 (430), 708–716.Lloyd, S. (1982). Least squares quantization in PCM.

IEEE Transactions on Information The-ory 28 (2), 129–137.MacQueen, J. (1967). Some methods for classiﬁcation and analysis of multivariate observations. In

Fifth Berkeley Symposium on Mathematical Statistics and Probability , Volume 1, pp. 281–297.Oakland, CA, USA.McLachlan, G. J. and T. Krishnan (2007).

The EM Algorithm and Extensions , Volume 382. JohnWiley & Sons.McLachlan, G. J. and D. Peel (2004).

Finite Mixture Models . John Wiley & Sons.Meng, X.-L. and D. B. Rubin (1993). Maximum likelihood estimation via the ECM algorithm: Ageneral framework.

Biometrika 80 (2), 267–278.Nielsen, F. and R. Nock (2014). Optimal interval clustering: Application to Bregman clusteringand statistical mixture learning.

IEEE Signal Processing Letters 21 (10), 1289–1292.Pan, W. and X. Shen (2007). Penalized model-based clustering with application to variable selec-tion.

Journal of Machine Learning Research 8 , 1145–1164.Qiao, M. and J. Li (2015). Gaussian mixture models with component means constrained in pre-selected subspaces. arXiv preprint arXiv:1508.06388.Rocci, R., S. A. Gattone, and R. Di Mari (2018). A data driven equivariant approach to constrainedGaussian mixture modeling.

Advances in Data Analysis and Classiﬁcation 12 (2), 235–260.Shi, N.-Z., S.-R. Zheng, and J. Guo (2005). The restricted EM algorithm under inequality restric-tions on the parameters.

Journal of Multivariate Analysis 92 (1), 53–76.Shin, D. W., C. G. Park, and T. Park (2001). Testing for one-sided group eﬀects in repeatedmeasures study.

Computational Statistics & Data Analysis 37 (2), 233–247.Szkaliczki, T. (2016). clustering.sc.dp: Optimal clustering with sequential constraint by usingdynamic programming.

R Journal 8 (1), 318–327.Takai, K. (2012). Constrained EM algorithm with projection method.

Computational Statis-tics 27 (4), 701–714.Tan, M., G.-L. Tian, H.-B. Fang, and K. W. Ng (2007). A fast EM algorithm for quadraticoptimization subject to convex constraints.

Statistica Sinica 17 (3), 945–964.Tian, G.-L., K. W. Ng, and M. Tan (2008). EM-type algorithms for computing restricted MLEsin multivariate Normal distributions and multivariate t-distributions.

Computational Statistics& Data Analysis 52 (10), 4768–4778.Wagstaﬀ, K., C. Cardie, S. Rogers, and S. Schr¨odl (2001). Constrained K -means clustering withbackground knowledge. In International Conference on Machine Learning , Volume 1, pp. 577–584.Wang, H. and M. Song (2011). Ckmeans.1d.dp: Optimal K -means clustering in one dimension bydynamic programming. The R Journal 3 (2), 29.Wang, S. and J. Zhu (2008). Variable selection for model-based high-dimensional clustering andits application to microarray data.

Biometrics 64 (2), 440–448.Zheng, S., J. Guo, N.-Z. Shi, and G.-L. Tian (2012). Likelihood-based approaches for multivariatelinear models under inequality constraints for incomplete data.

Journal of Statistical Planningand Inference 142 (11), 2926–2942.Zheng, S., N. Shi, and J. Guo (2005). The restricted EM algorithm under linear inequalities in alinear model with missing data.

Science in China Series A: Mathematics 48 (6), 819–828.6

A Proof of Proposition 1

Using Jensen’s inequality, we derive: L ( θ, x ) = N ∑ i = log ( K ∑ k = π k f ( x i ; µ k , v k )) (30) = N ∑ i = log ( K ∑ k = w s + i,k π k f ( x i ; µ k , v k ) w s + i,k ) (31) ≥ N ∑ i = K ∑ k = w s + i,k log ( π k f ( x i ; µ k , v k ) w s + i,k ) (32) = Q ( θ, θ s ) − N ∑ i = K ∑ k = w s + i,k log ( w s + i,k ) . (33)Plugging in θ s + for θ , we obtain: L ( θ s + , x ) ≥ Q ( θ s + , θ s ) − N ∑ i = K ∑ k = w s + i,k log ( w s + i,k ) , (34)while by deﬁnition of the weights in (15), we have: L ( θ s , x ) = Q ( θ s , θ s ) − N ∑ i = K ∑ k = w s + i,k log ( w s + i,k ) (35)Thus to prove that L ( θ s + , x ) ≥ L ( θ s , x ) , it suﬃces to prove that Q ( θ s + , θ s ) ≥ Q ( θ s , θ s ) . As wemaximized Q ( θ = ( π, µ, v ) , θ s ) in every step of the ECM algorithm with other parameters ﬁxed, wehave: Q ( θ s + , θ s ) = Q (( π s + , µ s + , v s + ) , θ s ) (36) ≥ Q (( π s + , µ s + , v s ) , θ s ) (37) ≥ Q (( π s + , µ s , v s ) , θ s ) (38) ≥ Q (( π s , µ s , v s ) , θ s ) (39) = Q ( θ s , θ s ) ..