K -Means and Gaussian Mixture Modeling with a Separation Constraint
KK -Means and Gaussian Mixture Modelingwith a Separation Constraint He Jiang Ery Arias-CastroUniversity of California, San Diego, USA
Abstract
We consider the problem of clustering with K -means and Gaussian mixture models with aconstraint on the separation between the centers in the context of real-valued data. We firstpropose a dynamic programming approach to solving the K -means problem with a separationconstraint on the centers, building on (Wang and Song, 2011). In the context of fitting aGaussian mixture model, we then propose an EM algorithm that incorporates such a constraint.A separation constraint can help regularize the output of a clustering algorithm, and we provideboth simulated and real data examples to illustrate this point. Cluster analysis is broadly seen as one of the most important tasks in (unsupervised) data analysis(Jain et al., 1999; Kaufman and Rousseeuw, 2009). In some situations, the analyst might have someprior knowledge or relevant information regarding the clusters, and incorporating this information— which may be done via constraints — is thought to improve the accuracy of the clusteringoutput (Basu et al., 2008). In the present paper, we address situations where the analyst has priorknowledge on the separation between the cluster centers. We indeed focus on methods that relyon centers to define clusters: the K -means criterion and Gaussian mixture models (GMM). K -means A number of constrained variants of K -means have been proposed in the literature, where a sig-nificant amount of attention have been placed on the cluster sizes such as imposing a lower boundon the minimum cluster size or aiming for balancing the cluster sizes. For instance, Bennett et al.(2000) addressed the problem of small or even empty clusters by enforcing a size constraint in thecluster assignment step of Lloyd’s algorithm; Nielsen and Nock (2014) designed a dynamic K -meansalgorithm in 1D aimed at optimizing the Bregman divergence that can incorporate a constrainton the minimum cluster size; and Banerjee and Ghosh (2006) proposed an approach to K -meansthat is able to handle cluster size balance constraints. Constrained K -means has also been widelyconsidered in other areas (Basu et al., 2008). For example, (Basu et al., 2004; Wagstaff et al., 2001)considered forcing some user-specified pairs of observations to belong to the same cluster whileforcing others to belong to different clusters. A overview of instance-level constrained clustering,including K -means approaches and related methods, can be found at (Davidson and Basu, 2007).Davidson and Ravi (2005) also incorporated a minimum separation constraint between clustersmandating that any two points assigned to different clusters need to be separated by at least δ , aparameter specified by the analyst. Szkaliczki (2016) considered points arriving sequentially and1 a r X i v : . [ s t a t . C O ] J u l constraints where clusters have to be of the form { x i , . . . , x i + j } . A recent survey of constrained clus-tering, including K -means related methods and modern development can be found at Gan¸carskiet al. (2020).We propose a variant of K -means where a user-specified separation between the cluster centersis enforced which, to the best of our knowledge, is novel. We place ourselves in dimension one,where we are able to solve the resulting constrained optimization problem exactly by building onthe dynamic programming approach of Wang and Song (2011). Model based clustering is an important aspect in clustering (Fraley and Raftery, 2002; McLachlanand Peel, 2004), and among parametric finite mixture models, Gaussian mixture models (GMM)are by far the most popular, with their parameters frequently estimated by the EM algorithm orvariants (Dempster et al., 1977; McLachlan and Krishnan, 2007).EM-type algorithms forcing various constraints on the parameters of the mixture model havebeen considered extensively in the literature. Kim and Taylor (1995), in a more general setting thatincludes GMM, considered enforcing some equality constraints on a subset of the parameters andprovided an EM algorithm based on a projected Newton–Raphson step in the maximization stage.Along the lines, but addressing both equality and inequality constraints, Jamshidian (2004) useda projected gradient step instead, in conjunction with an active set method in order to handle theinequality constraints. See also (Takai, 2012). In the context of linear models in regression, EM-type algorithms were also proposed to handle constraints on the parameters, in particular linearequalities and inequalities (Davis et al., 2012; Shi et al., 2005; Shin et al., 2001; Tan et al., 2007;Tian et al., 2008; Zheng et al., 2012, 2005).In the context of GMM proper, constraints have been considered for a very long time to reducethe number of parameters, for example, by making all the variances or covariance matrices to beequal. Beyond that, constraints on the variances or eigenvalues of the covariance matrices havebeen considered extensively, mostly for the purpose of bounding the log likelihood (Day, 1969). Inparticular, Hathaway (1985) proposed a lower bound on the ratio of any two component standarddeviations, which enabled the author to prove the strong consistency of the resulting maximumlikelihood estimator. This was followed by (Hathaway, 1986) where a constrained EM algorithmwas proposed for maximizing the likelihood with this lower bound constraint on the ratio of anytwo standard deviations, and also another lower bound constraint on the weights. GeneralizingHathaway’s results to higher dimensions, Ingrassia (2004); Ingrassia and Rocci (2007, 2011); Rocciet al. (2018) examined EM algorithms enforcing constraints on the eigenvalues of the covariancematrices.In addition to constraints on component variance or eigenvalues, estimation of component meansunder constraints has also been considered in the literature, where EM-type algorithms have beenproposed for that purpose. For example, Pan and Shen (2007) and Wang and Zhu (2008) consideredpenalizing the L and L ∞ norms of the mean vectors, respectively, to enforce sparsity and/orregularize the model in high dimensions, while Chauveau and Hunter (2013) and Qiao and Li(2015) proposed EM algorithms for linear constraints on the mean parameters.We consider here, in dimension 1, arbitrary constraints on the separation between adjacentmeans. We propose an EM algorithm that enforces such constraints, where the M step amounts tosolving a quadratic program. The main focus will be on real-valued data. We do not know how to enforce a separation constrainton the centers in higher dimensions, at least not in such principled fashion. The data points willbe assumed ordered without loss of generality, denoted x ≤ ⋯ ≤ x N and gathered in a vector x = ( x , . . . , x N ) ∈ R N . Our goal will be to group these data points into K clusters, where K isspecified by the user. We do not address the perennial issue of choosing K .The organization of the paper will be as follows. In Section 2, we introduce our dynamicprogramming algorithm for exactly solving K -means in dimension one with a constraint on theminimum separation between the centers. In Section 3, we introduce our EM algorithm for (ap-proximately) fitting a Gaussian mixture model by maximum likelihood under the same constraint.It is a true EM algorithm in that the likelihood increases with each iteration. In Section 4, wedescribe some numerical experiments on both simulated data and real data to illustrate the useand accuracy of our methods. K -Means Given data points on the real line, x , . . . , x n , and a desired number of cluster, K , basic K -meansis defined as the following optimization problem (MacQueen, 1967)minimize n ∑ i = min k = ,...,K ( x i − µ k ) , over µ < ⋯ < µ K . This problem is difficult in general. For one thing, a brute-force approach by grid search would be indimension K and quickly unfeasible or too costly. The problem is instead most often approached viaiterative methods such as Lloyd’s algorithm (Lloyd, 1982), which consists in alternating between,for all k , defining Cluster k as the set of points nearest to µ k and then recomputing µ k as thebarycenter or average of the points forming Cluster k . In dimension 1, however, the K -meansproblem can be solved by dynamic programming as was established and carried out by Wang andSong (2011).We are interested in a variant of the K -means problem where the following separation constrainton the centers is enforced: Constraint 1. µ k + − µ k ≥ δ for all k = , . . . , K − . In other words, we consider the following optimization problem:minimize n ∑ i = min k = ,...,K ( x i − µ k ) , over µ < ⋯ < µ K satisfying µ k + − µ k ≥ δ for all k = , . . . , K − . The amount of separation is determined by the user-specified parameter δ ≥
0. (Of course, if δ = K -means problem.) Inspired by the work of Wang and Song (2011), we proposea dynamic programming algorithm to solve this constrained variant of K -means. The remaining ofthis section is dedicated to describing this algorithm, which is encapsulated in Table 1 and Table 2. Given the sorted data x = ( x , . . . , x N ) , for r = , . . . , N and r = r , . . . , N , we first define thewithin current cluster center ( C ) and the within current cluster sum of squares ( W ): C [ r , r ] = r − r + r ∑ i = r x i , (1) W [ r , r ] = r − r + r ∑ i = r ( x i − C [ r , r ]) (2)Each element can be efficiently computed using a simple recursion. For i = , . . . , N and m = , . . . , K , define D [ i, m ] as recording the minimum error sum of squares when grouping the first i data points into m clusters. The main idea of Wang and Song (2011) is to update D dynamicallyas follows: D [ i, m ] = min j = m,...,i D [ j − , m − ] + W [ j, i ] . (3)This is carried out for i = , . . . , N and m = , . . . , K . (Initialization is D [ , m ] = m and D [ i, ] = i .)The method of (Wang and Song, 2011) was designed for solving the basic K -means problem andneeds to be modified, in a substantial way, to solve the separation-constrained K -means problem.For illustrative purposes, consider clustering data x = (− , , , , , , , ) into K = δ = .
75. Note that without separation theclustering assignment is 12233455 but the centers of Cluster 3 and Cluster 4 are only separatedby 1 .
5, which is less than δ . With separation, the clustering assignment is 12334455. The sameexample is used to highlight two issues that need to be addressed in order to incorporate theseparation constraints: • The m -th cluster consisting of x j , . . . , x i , and the previous cluster, i.e., the ( m − ) -th clusterthat gives the optimal clustering of j − m − δ in terms of their means. In our example, when grouping { x , . . . , x } into 3 clusters,the optimal solution is 12233, but when we try to group { x , . . . , x } into 4 clusters underthe separation constraint it becomes impossible to do so as C [ , ] − C [ , ] < δ . (We are notallowing the clusters to be empty.) • Consider the stage where we have grouped x , . . . , x i into m clusters, and the last cluster is x j , . . . , x i . Then it is not necessarily the case that the first m − x , . . . , x j − , and so the recursion (3) is not necessarily valid. In our example, theoptimal clustering for grouping { x , . . . , x } into 3 clusters is 1223, however when grouping { x , . . . , x } into 4 clusters under the separation constraint the clustering assignment becomes12334, so that the starting point of cluster 3 has changed from x to x .In brief, (3) cannot be used as the updating rule. Although the first issue is easy fix by checkingthat the separation constraint is satisfied, the second issue is more difficult to deal with and requiresus to introduce additional matrices/tensors as described below. We describe here our algorithm. First initialize: β = W [ , N ] + , D = β N × K , I = N × K , B = N × K , U = β N × N × K , (4)where β is chosen larger than the maximum possible error sum of squares. Define: D [ i, ] = i ∑ l = ( x l − C [ , i ]) , I [ i, ] = , B [ i, ] = , U [ i, r, ] = , for i = , , . . . , N ; r = i, . . . , N . (5)The D matrix is used to store the optimal error sum of squares under the separation constraint,specifically D [ i, m ] records the minimal error sum of squares when putting the first i values into m clusters while satisfying Constraint 1, if possible (recall we do not allow clusters to be empty). Thecorresponding entry in the I matrix, I [ i, m ] , is used to record the index of the first data point inthe m -th cluster in the optimal clustering of the first i data points into m clusters (the clusteringcorresponding to D [ i, m ] ). B [ i, m ] , on the other hand, is used to store the smallest possible indexof the leftmost data point in the m -th cluster over all groupings of the first i data points into m clusters satisfying the separation constraint. Lastly, U [ q, r, m ] records the optimal error sum ofsquares for grouping the first q − m − ( m − ) -th cluster and m -th cluster, which consists of data points x q , . . . , x r , are separatedby δ in means.In our updates of the above matrices/tensors, we enforce the separation constraint. Suppose that D [⋅ , m − ] , I [⋅ , m − ] , B [⋅ , m − ] as well as U [⋅ , ⋅ , m − ] have been updated, so we are at stage m withthe immediate task of updating D [⋅ , m ] , I [⋅ , m ] , B [⋅ , m ] as well as U [⋅ , ⋅ , m ] . Now consider groupingthe first i data points into m clusters. If B [ i − , m − ] =
0, do nothing (the values stay at the initialones). Otherwise, for each j = m, . . . , i , we find the first value T j ∈ { I [ j − , m − ] , . . . , B [ j − , m − ]} that satisfies C [ j, i ]− C [ T j , j − ] ≥ δ . Note that T j also depends on i and m , but that is left implicit.The reason for defining T j as such is because starting from B [ j − , m − ] , the closer we are to I [ j − , m − ] , the smaller the error sum of squares. Then the main update that guaranteesoptimality is: D [ i, m ] = min j = m,...,i U [ T j , j − , m − ] + W [ T j , j − ] + W [ j, i ] , (6)and U [ j, i, m ] = U [ T j , j − , m − ] + W [ T j , j − ] . (7)Then I [ i, m ] is updated as the smallest minimizing index in (6) and B [ i, m ] is updated as thesmallest j that satisfies C [ j, i ] − C [ T j , j − ] ≥ δ . Computational complexity
While the algorithm of Wang and Song (2011) has time complexity O ( N K ) , our method has computational complexity O ( N K ) . While this is true in the worst case,in practice the computational cost appears much lower as moving from I [ j − , m − ] to B [ j − , m − ] and stopping at the first value where the constraint is satisfied considers significantly fewer optionsthan N . The upper bound for space complexity of this algorithm is O ( N K ) — compared to O ( N K ) for the algorithm of Wang and Song (2011). This is due to the maintaining of the tensor U , but again, this is in the worst case, as in practice U is very sparse (most of its coefficients areequal to the initial value β ) and the space complexity is far below the upper bound. We are nowready to use Table 2 to find the the optimal error sum of squares under Constraint 1 and find thecorresponding optimal clustering using Table 1. A Gaussian mixture model is of the form: Table 1: Separation Constrained K -Means inputs : C, I, B (all three obtained from Table 2), number of clusters K , separation δ initialize b = { K } , compute b [ K ] = I [ N, K ] , and set l = b [ K ] − if K = thenreturn b for j = I [ l, K − ] , . . . , B [ l, K − ] doif C [ l + , N ] − C [ j, l ] ≥ δ then b [ K − ] = j, l = j − breakif K = thenreturn b for k = ( K − ) , . . . , dofor j = I [ l, k ] , . . . , B [ l, k ] doif ( C [ l + , b [ k + ] − ] − C [ j, l ]) ≥ δ then b [ k ] = j, l = j − breakreturn b K ∑ k = π k f ( x ; µ k , v k ) , (8)where f (⋅ ; µ, v ) is the density of the normal distribution with mean µ and variance v , meaning f ( x ; µ, v ) = √ πv exp (− ( x − µ ) v ) , (9)and π , . . . , π K are the mixture weights satisfyingmin k π k > , ∑ k π k = . (10)For convenience, define θ = ( π, µ, v ) where π = ( π , . . . , π K ) ∈ R K , µ = ( µ , . . . , µ K ) ∈ R K , and v = ( v , . . . , v K ) ∈ R K . We then let g θ denote the mixture (8). The working assumption is that thedata points, x , . . . , x N , are the realization of a sample of some size N drawn iid from a mixturedistribution of the form (8), where a draw from that distribution amounts to drawing a clusterindex in { , . . . , K } according to the probability distribution π and then generating a real valuedobservation according the corresponding normal density — f (⋅ ; µ k , v k ) if k is the drawn index value.Given the mixture model (8), a point x is assigned to the most likely cluster, meaning, toarg max k π k f ( x ; µ k , v k ) . (11)Thus clustering using a Gaussian mixture model boils down to estimating the parameters ofthe model. Maximum likelihood estimation is a natural candidate, but difficult to implement, Interestingly, maximum likelihood is a competitive method even though the likelihood is not bounded. To makesense of what the EM algorithm does, we believe that a useful perspective is to see it as attempting to maximizethe likelihood starting from a reasonable initialization, understanding that the E and M steps prevent the algorithmfrom arriving at pathological values of the parameters.
Table 2: Routine computing matrices
C, W, D, I, B inputs : data x ≤ ⋯ ≤ x N , number of clusters K , separation δ compute C and W as in (1) and (2)define β and initialize D, I, B, U as in (4) and (5) if K = then go to return step else if K = then do everything in K = for i = , . . . , N do p = β { i − } for j = , . . . , i doif C [ j, i ] − C [ , j − ] ≥ δ then p [ j − ] = W [ j, i ] + W [ , j − ] U [ j, i, ] = W [ , j − ] D [ i, ] = min ( p ) if D [ i, ] = β then I [ i, ] = , B [ i, ] = else I [ i, ] = min ( arg min ( p )) + B [ i, ] = min { a ∶ p [ a ] < β } + else do everything in K = for m = , . . . , K dofor i = m, . . . , N doif D [ i − , m − ] < β then p = β { i − m + } for j = m, . . . , i doif B [ j − , m − ] ≠ thenif C [ j, i ] − C [ I [ j − , m − ] , j − ] ≥ δ then p [ j − m + ] = D [ j − , m − ] + W [ j, i ] U [ j, i, m ] = D [ j − , m − ] elsefor t = I [ j − , m − ] , . . . , B [ j − , m − ] doif C [ j, i ] − C [ t, j − ] ≥ δ then p [ j − m + ] = min ( p [ j − m + ] , U [ t, j − , m − ]+ W [ t, j − ]+ W [ j, i ]) U [ j, i, m ] = U [ t, j − , m − ] + W [ t, j − ] break D [ i, m ] = min ( p ) if D [ i, m ] = β then I [ i, m ] = , B [ i, m ] = else I [ i, m ] = min ( arg min ( p )) + m − B [ i, m ] = min { a ∶ p [ a ] < β } + m − return D, I, B, C, W, β , Ueven in dimension 1, as maximizing the likelihood is not a convex problem and a grid search is toocostly given the number of parameters (3 K − L ( θ, x ) = N ∑ i = log g θ ( x i ) . (12)The main idea is based on introducing cluster assignment variables, z , . . . , z n , where z i = k if x i was drawn from the k -th component of the mixture. These assignment variables are not observed(they are said to be latent), but are useful nonetheless for thinking about the problem. To that end,let g θ ( x, z ) denote the joint density of ( X i , Z i ) . Note that Z i has distribution π and X i ∣ Z i = k hasthe normal distribution with mean µ k and variance v k . The algorithm starts with a value of theparameter vector, θ , and then alternates between an E step and an M step, until some convergencecriterion is satisfied. Suppose we are at the s -th iterate. The E step consists in computing theexpectation of the log-likelihood for an arbitrary value of θ with respect to the mixture distributiongiven by the current parameter estimates: Q ( θ, θ s ) ∶= N ∑ i = E θ s [ log g θ ( X i , Z i ) ∣ X i = x i ] . (13)It turns out that Q ( θ, θ s ) = N ∑ i = K ∑ k = w s + i,k log ( π k f ( x i ; µ k , v k )) , (14)where w s + i,k ∶= π sk f ( x i ; µ sk , v sk )∑ Kl = π sl f ( x i ; µ sl , v sl ) . (15)The M step consists of maximizing the resulting function with respect to θ to yield the updatedvalues of the parameters: θ s + = arg max θ Q ( θ, θ s ) . (16)This amounts to the following: π s + k = N N ∑ i = w s + i,k (17) µ s + k = ∑ Ni = w s + i,k N ∑ i = w s + i,k x i (18) v ks + = ∑ Ni = w s + i,k N ∑ i = w s + i,k ( x i − µ s + k ) (19)We are interested in a variant of the EM algorithm for maximizing the likelihood under thefollowing separation constraint on the centers: Constraint 2. δ k, ≤ µ k + − µ k ≤ δ k, for all k = , . . . , K − . The separation parameters, δ k, and δ k, , are set by the user. We note that this reduces toConstraint 1 when we set δ , = ⋅ ⋅ ⋅ = δ K − , = δ and δ , = ⋅ ⋅ ⋅ = δ K − , = ∞ (in practice, a very largenumber). We propose an EM algorithm that incorporates Constraint 2. It is a true EM algorithmin that the likelihood increases with each iteration. The remaining of this section is devoted tointroducing the algorithm, which is otherwise compactly given in Table 3.In the algorithm, the E Step is identical to that in the regular EM algorithm. The M Step isalso the same except that the maximization is done under Constraint 2. The constraint is only onthe centers, and to isolate that, we adopt an ECM approach (Meng and Rubin, 1993), where wefirst maximize over the weights, which amounts to the same update (17); then maximize over thecenters subject to Constraint 2 (see details below); and finally maximize over the variances, whichamounts to the same update (19). Proposition 1.
The separation-constrained EM algorithm does not decrease the log likelihood ofthe model, i.e., L ( θ s + , x ) ≥ L ( θ s , x ) for all s . A proof of this result, although standard, is given in the appendix.The maximization over the centers amounts to solving the following optimization problem:minimize N ∑ i = K ∑ k = w s + i,k v sk ( x i − µ k ) over µ , . . . , µ K such that δ k, ≤ µ k + − µ k ≤ δ k, for all k = , , . . . , K − . (20)Define g i,k = w s + i,k / v sk and G i = diag ( g i, , . . . , g i,K ) , G = n ∑ i = G i . (21)Also define A = ⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝− . . . − . . . ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ . . . − − . . . − . . . ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ . . . − ⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠ , b = ⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ δ , δ , . . .δ ,K − − δ , − δ , . . . − δ ,K − ⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ . (22)With this notation, the optimization problem (20) can be equivalently expressed as follows:minimize µ ⊺ Gµ − N ∑ i = x i ⊺ G i µ subject to Aµ ≥ b , (23)which is a quadratic program. In practice, we solve it using the primal dual interior point methodof Goldfarb and Idnani (1982, 1983) implemented in the quadprog package in R . This section will provide some simulated and real data examples illustrating the application of ouralgorithms. The constrained algorithms are shown to outperform their unconstrained counterpartsin both parameter estimation and in producing a clustering that is more similar to the actualclusters, of course, if the separation constraint reflects the reality of the situation. Our code ispublicly available. https://cran.r-project.org/web/packages/quadprog/ https://github.com/h8jiang/Center-Separation-Constrained-K-Means-and-EM-Algorithm/ inputs : data x , . . . , x N , number of clusters K , tolerance value γ initialization initialize ( π , µ , v ) using constrained 1D optimal K -means algorithm as in Section2, if given the same lower bound constraint; otherwise initialize using 1D optimal K -means of (Wang and Song, 2011): µ < µ < ⋅ ⋅ ⋅ < µ K , µ k = ∣ J k ∣ ∑ i ∈ J k x i , π k = ∣ J k ∣ N , v k = ∣ J k ∣ ∑ i ∈ J k ( x i − µ k ) (24)where J k denotes the indices that are clustered into the k − th cluster E step compute w i,k for i = , , . . . , N and k = , , . . . K : w s + i,k = π sk N ( x i ∣ µ sk , v sk )∑ Kj = π sj N ( x i ∣ µ sj , v sj ) (25) M step update π k , k = , , . . . , K : π s + k = N N ∑ i = w s + i,k (26)update µ k , k = , , . . . , K by the solution to minimization problem (23)update v k , k = , , . . . , K : v ks + = ∑ Ni = w s + i,k N ∑ i = w s + i,k ( x i − µ s + k ) (27) convergence checkif max { max k ∣ π s + k − π sk ∣ , max k ∣ µ s + k − µ sk ∣ , max k ∣ v s + k − v sk ∣} ≤ γ then stop and return θ s + = ( π s + , µ s + , v s + ) else s = s + K -means and Optimal Constrained K -means Experiment Model Criteria K -means Constrained K -means Experiment 1 Model D center error 1.092 (0.276) 0.374 (0.161)size error 165.6 (22.9) 119.9 (25.4)Rand index 0.786 (0.015) 0.807 (0.014)Experiment 2 Model B center error 1.339 (0.393) 0.561 (0.190)size error 143.4 (45.9) 58.1 (18.8)Rand index 0.834 (0.016) 0.858 (0.014)
We start with experiments performed on simulated data using our constrained algorithms. Wewill first look at the performance of our constrained K -means algorithm over regular K -means asimplemented by Wang and Song (2011). We then look at the performance of our constrained EMalgorithm for fitting Gaussian mixture models with the regular EM algorithm (Dempster et al.,1977). Finally, we examine the impact of the imposed constraints, including when they are notcongruent with the actual situation.All the simulated data sets were generated from one of the following models: • Model A: 0 . N ( , ) + . N ( , ) • Model B: 0 . N ( , . ) + . N ( , . ) + . N ( , . ) • Model C: 0 . N ( , ) + . N ( , ) + . N ( , ) + . N ( , ) + . N ( , ) • Model D: 0 . N ( , . )+ . N ( , . )+ . N ( , . )+ . N ( , . )+ . N ( , . ) We first apply constrained and unconstrained K -means on data generated from Model D (Ex-periment 1) and on data generated from Model B (Experiment 2). The comparison criteria aretotal error on cluster center values and the total difference on the cluster sizes: K ∑ k = ∣ ˆ µ k − µ k ∣ , and K ∑ k = ∣ ˆ n k − n k ∣ , (28)and also the Rand index. The sample size was set to N = δ = .
95, andthe number of repeats to R = R experiments and their standard deviations inTable 4. It is quite clear that in these experiments constrained K -means improves significantly overunconstrained K -means in terms of both center estimates and clustering accuracy.Experiments 3, 4 and 5 will show three examples where applying our constrained EM algorithmwith a valid separation constraint results in a large improvement of parameter estimation overthe regular EM algorithm. Both algorithms are identically initialized using the constrained K -means algorithm in Section 2. We apply these two methods on data generated from Model B inExperiment 3, data generated from Model A in Experiment 4, and data generated from Model Cin Experiment 5. The comparison criteria are average error on center values and the average erroron all parameters: 1 K K ∑ k = ∣ ˆ µ k − µ k ∣ , and 1 K K ∑ k = (∣ ˆ µ k − µ k ∣ + ∣ ˆ π k − π k ∣ + ∣ ˆ v k − v k ∣) , (29)2 Table 5: Errors and Rand Indices of EM Algorithm and Constrainted EM Algorithm Experiment Model Criteria Regular EM Constrained EM
Experiment 3 Model B center error 0.339 (0.205) 0.058 (0.021)all parameters error 0.976 (0.296) 0.454 (0.197)Rand index 0.893 (0.023) 0.906 (0.013)Experiment 4 Model A center error 0.252 (0.155) 0.172 (0.120)all parameters error 0.568 (0.320) 0.409 (0.242)Rand index 0.715 (0.047) 0.726 (0.040)Experiment 5 Model C center error 0.448 (0.231) 0.276 (0.213)all parameters error 0.994 (0.415) 0.764 (0.367)Rand index 0.810 (0.028) 0.820 (0.030) and the Rand index. Here the sample size is N = δ = . δ = . k ), and the number of repeat is R = R experiments and their standard deviations in Table 5. It is clearthat in these 3 experiments, the constrained EM algorithm improves over the regular EM algorithmin terms of both parameter estimation and clustering accuracy.In the last experiments, we look at the impact of the separation constraint on the result of theconstrained algorithms (Experiment 6), and also explore how imposing two-sided constraints differsfrom imposing one-sided constraints (Experiment 7). We focus on the simplest model, Model A,which has only two components. In Experiment 6, we gradually increase δ on a grid of valuesranging from 0 .
85 to 2 .
4. We generate R =
100 data sets from Model A, then for each δ , we fitthe model and record all parameters as done in Experiments 3-5. We present the results in 3plots with 90% confidence bands in Figure 1. We can see that the curves representing regular andone-sided EM tend to overlap on the very left. The one-sided EM starts to become more accuratein parameter estimation compared to regular EM when δ is as low as 0.9, and the improvementincreases as we move closer to actual value 2. Once passed 2, the performance starts to deterioratecompared to the regular EM. On the other hand, when the two-sided constraint indeed containsthe true separation, 2, the two-sided constrained EM outperforms the one-sided constrained andregular EM. However, quite intuitively, when the two-sided constraint does not include the actualseparation inside, the performance of the two-sided constrained EM is significantly worse thanthe two other algorithms before reaching actual separation value, and the same as the one-sidedconstrained EM after passing the actual separation value. In this section we apply our constrained EM Algorithm to the 2010 National Youth Physical Activityand Nutrition Study (NYPANS) dataset provided in (Centers for Disease Control and Prevention,2010). The dataset consists of measurements on high school students in the United States in 2010.Since the height of students in each grade follows an approximate Gaussian distribution, and thereare 4 grades, the height of all the students could form a GMM with K =
4. We focus on estimatingthe parameters of this GMM.As prior knowledge, we use information from the data table of stature-for-age charts providedby National Center for Health Statistics in (Centers for Disease Control and Prevention, 2001). The3 (a) error in the estimation of the cen-ters as a function of the separation δ (b) error in the estimation of all theparameters as a function of the sepa-ration δ (c) Rand index as a function of theseparation δ Figure 1: Results of Experiments 6 and 7: black curves represent unconstrained EM; red curvesrepresent constrained EM with a lower bound on separation, δ ; blue curves represent two-sidedconstrained EM, with lower bound on separation being δ = δ and upper bound being δ = δ + . Model Criteria Regular EM Constrained EM
NYPANS center error (times 100) 6.7 0.5all parameters error (times 100) 15.6 3.2Rand index 0.566 0.596 two sources are both gathered in the United States, with a time difference of 9 years. From the meanage of students in the NYPANS study in each of the 4 grades, we found the most relevant medianheight, and in Gaussian distributions the mean height, in these charts, and subtracted the adjacentvalues. To account for possible slight changes over the years, we add and subtract respectively0 . δ , = . , δ , = . , δ , = . δ , = . , δ , = . , δ , = . Acknowledgments
This work was partially supported by a grant from the US National Science Foundation.
References
Banerjee, A. and J. Ghosh (2006). Scalable clustering algorithms with balancing constraints.
DataMining and Knowledge Discovery 13 (3), 365–395.4Basu, S., A. Banerjee, and R. J. Mooney (2004). Active semi-supervision for pairwise constrainedclustering. In
SIAM International Conference on Data Mining , pp. 333–344.Basu, S., I. Davidson, and K. Wagstaff (2008).
Constrained Clustering: Advances in Algorithms,Theory, and Applications . CRC Press.Bennett, K., P. Bradley, and A. Demiriz (2000). Constrained K -means clustering. Technical ReportMSR-TR-2000-65, Microsoft Research.Centers for Disease Control and Prevention (2001). Data table of stature-for-age charts. .Centers for Disease Control and Prevention (2010). National youth physical activity and nutritionstudy (NYPANS). .Chauveau, D. and D. Hunter (2013). ECM and MM algorithms for Normal mixtures with con-strained parameters. https://hal.archives-ouvertes.fr/hal-00625285 .Davidson, I. and S. Basu (2007). A survey of clustering with instance level constraints. ACMTransactions on Knowledge Discovery from Data 1 (1-41), 2–42.Davidson, I. and S. Ravi (2005). Clustering with constraints: feasibility issues and the K -meansalgorithm. In SIAM International Conference on Data Mining , pp. 138–149.Davis, K. A., C. G. Park, and S. K. Sinha (2012). Testing for generalized linear mixed models withcluster correlated data under linear inequality constraints.
Canadian Journal of Statistics 40 (2),243–258.Day, N. E. (1969). Estimating the components of a mixture of Normal distributions.
Biometrika 56 (3), 463–474.Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete datavia the EM algorithm.
Journal of the Royal Statistical Society: Series B (Methodological) 39 (1),1–22.Fraley, C. and A. E. Raftery (2002). Model-based clustering, discriminant analysis, and densityestimation.
Journal of the American Statistical Association 97 (458), 611–631.Gan¸carski, P., B. Cr´emilleux, G. Forestier, and T. Lampert (2020). Constrained clustering: currentand new trends. In
A Guided Tour of Artificial Intelligence Research , pp. 447–484. Springer.Goldfarb, D. and A. Idnani (1982). Dual and primal-dual methods for solving strictly convexquadratic programs. In
Numerical Analysis , pp. 226–239. Springer.Goldfarb, D. and A. Idnani (1983). A numerically stable dual method for solving strictly convexquadratic programs.
Mathematical Programming 27 (1), 1–33.Hathaway, R. J. (1985). A constrained formulation of maximum-likelihood estimation for Normalmixture distributions.
The Annals of Statistics 13 (2), 795–800.Hathaway, R. J. (1986). A constrained EM algorithm for univariate Normal mixtures.
Journal ofStatistical Computation and Simulation 23 (3), 211–230.Ingrassia, S. (2004). A likelihood-based constrained algorithm for multivariate Normal mixturemodels.
Statistical Methods and Applications 13 (2), 151–166.Ingrassia, S. and R. Rocci (2007). Constrained monotone EM algorithms for finite mixture ofmultivariate Gaussians.
Computational Statistics & Data Analysis 51 (11), 5339–5351.Ingrassia, S. and R. Rocci (2011). Degeneracy of the EM algorithm for the MLE of multivariateGaussian mixtures and dynamic constraints.
Computational Statistics & Data Analysis 55 (4),1715–1725.Jain, A. K., M. N. Murty, and P. J. Flynn (1999). Data clustering: a review.
ACM ComputingSurveys (CSUR) 31 (3), 264–323.Jamshidian, M. (2004). On algorithms for restricted maximum likelihood estimation.
ComputationalStatistics & Data Analysis 45 (2), 137–157.Kaufman, L. and P. J. Rousseeuw (2009).
Finding Groups in Data: An Introduction to Cluster Analysis , Volume 344. John Wiley & Sons.Kim, D. K. and J. M. Taylor (1995). The restricted EM algorithm for maximum likelihood esti-mation under linear restrictions on the parameters.
Journal of the American Statistical Associ-ation 90 (430), 708–716.Lloyd, S. (1982). Least squares quantization in PCM.
IEEE Transactions on Information The-ory 28 (2), 129–137.MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In
Fifth Berkeley Symposium on Mathematical Statistics and Probability , Volume 1, pp. 281–297.Oakland, CA, USA.McLachlan, G. J. and T. Krishnan (2007).
The EM Algorithm and Extensions , Volume 382. JohnWiley & Sons.McLachlan, G. J. and D. Peel (2004).
Finite Mixture Models . John Wiley & Sons.Meng, X.-L. and D. B. Rubin (1993). Maximum likelihood estimation via the ECM algorithm: Ageneral framework.
Biometrika 80 (2), 267–278.Nielsen, F. and R. Nock (2014). Optimal interval clustering: Application to Bregman clusteringand statistical mixture learning.
IEEE Signal Processing Letters 21 (10), 1289–1292.Pan, W. and X. Shen (2007). Penalized model-based clustering with application to variable selec-tion.
Journal of Machine Learning Research 8 , 1145–1164.Qiao, M. and J. Li (2015). Gaussian mixture models with component means constrained in pre-selected subspaces. arXiv preprint arXiv:1508.06388.Rocci, R., S. A. Gattone, and R. Di Mari (2018). A data driven equivariant approach to constrainedGaussian mixture modeling.
Advances in Data Analysis and Classification 12 (2), 235–260.Shi, N.-Z., S.-R. Zheng, and J. Guo (2005). The restricted EM algorithm under inequality restric-tions on the parameters.
Journal of Multivariate Analysis 92 (1), 53–76.Shin, D. W., C. G. Park, and T. Park (2001). Testing for one-sided group effects in repeatedmeasures study.
Computational Statistics & Data Analysis 37 (2), 233–247.Szkaliczki, T. (2016). clustering.sc.dp: Optimal clustering with sequential constraint by usingdynamic programming.
R Journal 8 (1), 318–327.Takai, K. (2012). Constrained EM algorithm with projection method.
Computational Statis-tics 27 (4), 701–714.Tan, M., G.-L. Tian, H.-B. Fang, and K. W. Ng (2007). A fast EM algorithm for quadraticoptimization subject to convex constraints.
Statistica Sinica 17 (3), 945–964.Tian, G.-L., K. W. Ng, and M. Tan (2008). EM-type algorithms for computing restricted MLEsin multivariate Normal distributions and multivariate t-distributions.
Computational Statistics& Data Analysis 52 (10), 4768–4778.Wagstaff, K., C. Cardie, S. Rogers, and S. Schr¨odl (2001). Constrained K -means clustering withbackground knowledge. In International Conference on Machine Learning , Volume 1, pp. 577–584.Wang, H. and M. Song (2011). Ckmeans.1d.dp: Optimal K -means clustering in one dimension bydynamic programming. The R Journal 3 (2), 29.Wang, S. and J. Zhu (2008). Variable selection for model-based high-dimensional clustering andits application to microarray data.
Biometrics 64 (2), 440–448.Zheng, S., J. Guo, N.-Z. Shi, and G.-L. Tian (2012). Likelihood-based approaches for multivariatelinear models under inequality constraints for incomplete data.
Journal of Statistical Planningand Inference 142 (11), 2926–2942.Zheng, S., N. Shi, and J. Guo (2005). The restricted EM algorithm under linear inequalities in alinear model with missing data.
Science in China Series A: Mathematics 48 (6), 819–828.6
A Proof of Proposition 1
Using Jensen’s inequality, we derive: L ( θ, x ) = N ∑ i = log ( K ∑ k = π k f ( x i ; µ k , v k )) (30) = N ∑ i = log ( K ∑ k = w s + i,k π k f ( x i ; µ k , v k ) w s + i,k ) (31) ≥ N ∑ i = K ∑ k = w s + i,k log ( π k f ( x i ; µ k , v k ) w s + i,k ) (32) = Q ( θ, θ s ) − N ∑ i = K ∑ k = w s + i,k log ( w s + i,k ) . (33)Plugging in θ s + for θ , we obtain: L ( θ s + , x ) ≥ Q ( θ s + , θ s ) − N ∑ i = K ∑ k = w s + i,k log ( w s + i,k ) , (34)while by definition of the weights in (15), we have: L ( θ s , x ) = Q ( θ s , θ s ) − N ∑ i = K ∑ k = w s + i,k log ( w s + i,k ) (35)Thus to prove that L ( θ s + , x ) ≥ L ( θ s , x ) , it suffices to prove that Q ( θ s + , θ s ) ≥ Q ( θ s , θ s ) . As wemaximized Q ( θ = ( π, µ, v ) , θ s ) in every step of the ECM algorithm with other parameters fixed, wehave: Q ( θ s + , θ s ) = Q (( π s + , µ s + , v s + ) , θ s ) (36) ≥ Q (( π s + , µ s + , v s ) , θ s ) (37) ≥ Q (( π s + , µ s , v s ) , θ s ) (38) ≥ Q (( π s , µ s , v s ) , θ s ) (39) = Q ( θ s , θ s ) ..