k-means as a variational EM approximation of Gaussian mixture models
11 Pattern Recognition Letters A ccepted V ersion for the published version, see: https://doi.org/10.1016/j.patrec.2019.04.001 k -means as a variational EM approximation of Gaussian mixture models J¨org L¨ucke a, ∗∗ , Dennis Forster a a Machine Learning Lab, University of Oldenburg, Ammerl¨ander Heerstr. 114-118, 26129 Oldenburg, Germany
ABSTRACTWe show that k -means (Lloyd’s algorithm) is obtained as a special case when truncated variationalEM approximations are applied to Gaussian Mixture Models (GMM) with isotropic Gaussians. Incontrast to the standard way to relate k -means and GMMs, the provided derivation shows that it isnot required to consider Gaussians with small variances or the limit case of zero variances. Thereare a number of consequences that directly follow from our approach: (A) k -means can be shownto increase a free energy associated with truncated distributions and this free energy can directly bereformulated in terms of the k -means objective; (B) k -means generalizations can directly be derivedby considering the 2nd closest, 3rd closest etc. cluster in addition to just the closest one; and (C) theembedding of k -means into a free energy framework allows for theoretical interpretations of other k -means generalizations in the literature. In general, truncated variational EM provides a natural andrigorous quantitative link between k -means-like clustering and GMM clustering algorithms which maybe very relevant for future theoretical and empirical studies. c (cid:13)
1. Introduction
Clustering is the task of associating a set of N data pointswith a set of C clusters (typically with C (cid:28) N ), where such anassociation is defined by a high similarity of points within onecluster compared to the similarity of any two points of di ff erentclusters. Di ff erent criteria for data point similarity and di ff erentalgorithmic properties have led to the development of a largevariety of clustering algorithms in the course of more than halfa century. Two of the presumably most influential classes of al-gorithms are k -means-like algorithms (Lloyd, 1982; Jain, 2010,and many more) and Gaussian Mixture Models (GMMs). k-means. The k -means algorithm and its many variants (e.g.Steinley, 2006) have been used since the 1950’s and are oftenconsidered as the most popular clustering algorithms (Berkhin,2006). If we denote by (cid:126) y (1: N ) = (cid:126) y (1) , . . . , (cid:126) y ( N ) the data points(with (cid:126) y ( n ) ∈ R D ) and by (cid:126)µ C = (cid:126)µ , . . . , (cid:126)µ C the cluster centers(with (cid:126)µ c ∈ R D ), then the most common form of k -means isgiven by Alg. 1, with (cid:107) · (cid:107) as Euclidean metric. After initializa-tion of (cid:126)µ C , Alg. 1 increases the k -means objective given by J ( s (1: N )1: C , (cid:126)µ C ) = N (cid:88) n = C (cid:88) c = s ( n ) c (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:107) . (1) ∗∗ Corresponding author: Tel.: + + e-mail: [email protected] (J¨org L¨ucke) The updates of s ( n ) c and (cid:126)µ c in Alg. 1 are usually derived from(1). Because of its few elementary algorithmic steps, k -meansis easy to implement, and it has been observed to work verywell in practice (e.g. Duda et al., 2001). Algorithm 1: k -means. repeatfor c = . . . , C and n = . . . , N do s ( n ) c = (cid:40) ∀ c (cid:48) (cid:44) c : (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:107) < (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:48) (cid:107) for c = . . . , C do (cid:126)µ c = (cid:80) Nn = s ( n ) c (cid:126) y ( n ) / (cid:80) Nn = s ( n ) c ; until (cid:126)µ C have converged ; GMM.
GMM-based clustering algorithms (e.g. McLachlanand Basford, 1988) are derived from a probabilistic data model p ( (cid:126) y | Θ ). While general GMMs allow for di ff erent mixing pro-portions and multivariate Gaussian distributions, we will for thepurposes of this study consider equal mixing proportions andequally sized, isotropic Gaussians: p ( c | Θ ) = C , p ( (cid:126) y | c , Θ ) = (2 πσ ) − D exp (cid:0) − σ (cid:107) (cid:126) y − (cid:126)µ c (cid:107) (cid:1) , (2)i.e., we will use a ‘flat’ prior p ( c | Θ ) and equal and isotropicvariance σ of the clusters. The most standard form to update a r X i v : . [ s t a t . M L ] A p r the GMM model parameters Θ = ( (cid:126)µ C , σ ) is derived usingexpectation maximization (EM; Dempster et al., 1977), whichresults for GMM (2) in Alg. 2 (Barber, 2012, & refs. therein). Algorithm 2:
EM for GMM. repeatfor c = . . . , C and n = . . . , N do r ( n ) c = exp (cid:0) − σ (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:107) (cid:1)(cid:80) Cc (cid:48) = exp (cid:0) − σ (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:48) (cid:107) (cid:1) ; for c = . . . , C do (cid:126)µ c = (cid:80) Nn = r ( n ) c (cid:126) y ( n ) / (cid:80) Nn = r ( n ) c ; for c = . . . , C do σ = DN (cid:80) N , Cn , c = r ( n ) c (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:107) ; until parameters Θ have converged ;After initialization of Θ = ( (cid:126)µ C , σ ), the algorithm maxi-mizes the data log-likelihood given by: L ( Θ ) = N N (cid:88) n = log (cid:16) C (cid:88) c = C N (cid:0) (cid:126) y ( n ) ; (cid:126)µ c , σ (cid:1)(cid:17) , (3)with N ( (cid:126) y ( n ) ; (cid:126)µ c , σ
1) as given in (2). Note that (3) is normalizedby the number of data points for this study. As is customary forGMMs, we refer to the posteriors p ( c | (cid:126) y ( n ) , Θ ) as responsibili-ties (abbreviate by r ( n ) c ). Computing the r ( n ) c in Alg. 2 is referredto as E-step , while updates of parameters (cid:126)µ C and σ in Alg. 2are referred to as M-step . Related Work and Our Contribution.
The popularity of k -means and GMM algorithms has resulted in many theoreticalas well as empirical studies of their functional and theoreticalproperties. Considerable progress using novel versions couldbe made, and much insight could be gained for k -means (Har-Peled and Sadri, 2005; Arthur and Vassilvitskii, 2006; Arthuret al., 2009; Bachem et al., 2016) and GMMs (Chaudhuri et al.,2009; Kalai et al., 2010; Moitra and Valiant, 2010; Belkin andSinha, 2010; Xu et al., 2016) relatively recently. Because oftheir similarity, k -means and GMMs have long been formallyrelated to each other. It is thus well-known (see, e.g. MacKay,2003; Barber, 2012, & refs. therein) that k -means (Alg. 1) canbe obtained as a limit case of EM for GMM (2). This limitis given by considering increasingly small σ , i.e., σ → r ( n ) c in Alg. 2 then become equal to onefor the closest cluster and zero otherwise, and the k -means al-gorithm (Alg. 1) is recovered. Furthermore, approaches usingalgorithms which modify EM algorithms by introducing addi-tional ‘hard’ assignment steps of data points to clusters havebeen used to relate k -means and GMM clustering. Given a gen-erative model, such approaches are often referred to as ‘hard’EM (e.g. Segal et al., 2002; Van den Oord and Schrauwen,2014), as ‘classification EM’ (CEM; e.g. Celeux and Govaert,1992) for GMMs, or as ‘Viterbi training’ for HMMs (e.g. Al-lahverdyan and Galstyan, 2011). For data distributions withnegligible cluster overlap (a setting which is closely relatedto the limit σ → ff erent k -means generalizations have thereforebeen suggested, e.g., with aims to enhance k -means conver-gence (Har-Peled and Sadri, 2005) or to relax its ‘hard’ clus-ter assignment (e.g. Bezdek, 1981; Celeux and Govaert, 1992;MacKay, 2003; Miyamoto et al., 2008). As for clustering ingeneral, k -means also remained of interest in the probabilisticMachine Learning community, and notably in the field of non-parametric approaches. Welling and Kurihara (2006) suggested‘Bayesian k -means’, for instance, and used variational Bayesianapproximations in order to obtain k -means-like run time behav-ior for model selection. Later on, Kulis and Jordan (2012) alsoused a Bayesian treatment, and combined it with the relationof k -means to GMMs obtained in the limit σ →
0. In thisway they derived new ‘hard assignment’ algorithms based ona Gibbs sampler used within a non-parametric approach (alsocompare Broderick et al., 2013).In this work, we derive the k -means algorithm from a novelclass of variational EM algorithms applied to GMMs. Most no-tably k -means is obtained cleanly and rigorously without anyassumptions on σ . Variational EM seeks to optimize a lowerbound (the free energy; Neal and Hinton,1998) of the data log-likelihood by making use of variational distributions that ap-proximate full posterior probabilities. The free energy is alsofrequently referred to as the evidence lower bound (ELBO; e.g.Ho ff man et al., 2013). For our study, we apply truncated pos-teriors (L¨ucke and Eggert, 2010) as variational distributions intheir fully variational formulation (L¨ucke, 2018). After hav-ing shown that k -means is a variational approximation, k -meansand its generalizations can be quantitatively related to GMMswithout taking the limit to zero cluster variances or without as-suming σ to be small compared to cluster-to-cluster distances.Furthermore, the observation that k -means is a variational op-timization implies that it optimizes a lower bound of a GMMlog-likelihood. Hence, we can derive lower free energy boundsfor k -means and its generalizations that quantify the link be-tween the k -means and the GMM objective. As such we pro-vide a closer theoretical link between these two central classesof clustering methods than has previously been established.Truncated approaches have been applied to mixture modelsbefore. Work by Dai and L¨ucke (2014) used truncated approx-imations for a position invariant mixture model, and Forsteret al. (2018) for a hierarchical Poisson mixture. Work by Shel-ton et al. (2014) was the first to apply truncated EM to standardGMMs, followed by Hughes and Sudderth (2016) who addi-tionally used a constraint likelihood optimization to find clustercenters for truncated posteriors. None of these contributionshas derived k -means as a variational EM algorithm for GMMsnor did any contribution provide quantitative free energy resultsor the links to generalizations of k -means derived in this study.
2. Truncated variational EM and GMMs
The basic idea of truncated EM is the use of truncated ap-proximations of exact posterior distributions (e.g. L¨ucke andEggert, 2010; Sheikh et al., 2014). In the notation as used forGMMs above, the truncated approximation takes the form: r ( n ) c ≈ q ( n ) c = p ( c , (cid:126) y ( n ) | Θ ) (cid:80) c (cid:48) ∈ K ( n ) p ( c (cid:48) , (cid:126) y ( n ) | Θ ) δ ( c ∈ K ( n ) ) , (4)where K ( n ) is a set of cluster indices (containing di ff erent clus-ters c associated with data point (cid:126) y ( n ) ). Suppl. A and Fig. S1provide an example. The set of all K ( n ) we denote by K ,i.e., K = ( K (1: N ) ). As is customary for truncated distributions(L¨ucke and Eggert, 2010; Dai and L¨ucke, 2014; Shelton et al.,2014; Hughes and Sudderth, 2016), we take the sizes of all K ( n ) to be equal, |K ( n ) | = C (cid:48) , with 1 ≤ C (cid:48) ≤ C . The truncatedapproximation (4) is a good approximation if K ( n ) containsall those clusters with significant posterior mass p ( c | (cid:126) y ( n ) , Θ )(i.e., significant non-zero responsibilities r ( n ) c ). Truncated ap-proaches can represent very accurate approximations for manydata sets, as typically most responsibilities are negligible.In order to derive a learning algorithm for GMMs based ontruncated distributions, we have to answer the question how theparameters K ( n ) and Θ are to be updated. For our purposeswe will here make use of a recent study which addressed thisquestion for general models (with discrete latents) by embed-ding truncated distributions into a fully variational optimizationframework (L¨ucke, 2018). More specifically, we use the resultof L¨ucke (2018) that the free energy as a lower bound of thedata likelihood is monotonically increased if: (A) the parame-ters Θ are updated using standard M-steps, with exact posteriorsreplaced by truncated posteriors; and (B) that the sets K ( n ) canbe found using a simplified expression for the free energy.For GMMs, this means that we can use the standard M-stepsof Alg. 2 and replace r ( n ) c with the truncated approximations q ( n ) c in (4). For the GMM (2), the truncated responsibilities and M-steps are thus: q ( n ) c = exp (cid:0) − σ (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:107) (cid:1)(cid:80) c (cid:48) ∈K ( n ) exp (cid:0) − σ (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:48) (cid:107) (cid:1) δ ( c ∈ K ( n ) ) (5) (cid:126)µ new c = (cid:80) Nn = q ( n ) c (cid:126) y ( n ) (cid:80) Nn = q ( n ) c , σ = DN N , C (cid:88) n , c = q ( n ) c (cid:107) (cid:126) y ( n ) − (cid:126)µ new c (cid:107) (6)The parameters K ( n ) of the truncated distributions q ( n ) c have tobe found in the variational E-step. In order to do so, we use thesimplified free energy derived in (L¨ucke, 2018, Prop. 3), whichtakes for our GMM (2) the following form: F ( K , Θ ) = N (cid:80) Nn = log (cid:16)(cid:80) c ∈K ( n ) p ( c , (cid:126) y ( n ) | Θ ) (cid:17) = N (cid:80) Nn = log (cid:16)(cid:80) c ∈K ( n ) C N (cid:0) (cid:126) y ( n ) ; (cid:126)µ c , σ (cid:1)(cid:17) . (7)The truncated variational E-step (TV-E-step) first optimizes F ( K , Θ ) w.r.t. K and the obtained truncated responsibilities q ( n ) c are then used in the M-step (6) to optimize F ( K , Θ ) w.r.t. Θ .The form of the free energy (7) and the result that it is mono-tonically increased by iterating TV-E-step and M-step are thecrucial theoretical results by L¨ucke (2018) that are used in this study. Neither of these two results is straight-forward: (A) trun-cated distributions themselves depend on the model parameters Θ , and (B) it requires a number of derivations exploiting spe-cific properties of truncated distributions to obtain the conciseform used for expression (7).The TV-E-step now requires finding sets K ( n ) which increase F ( K , Θ ). The free energy (7) is computationally tractable, soa new K could in principle be found by directly comparing F ( K new , Θ ) of a new K new with F ( K old , Θ ) of the current K old . We can slightly reformulate the problem by consideringa specific data point n and cluster ˜ c ∈ K ( n ) for which we askwhen any other replacing cluster c (cid:60) K ( n ) would increase thefree energy F ( K , Θ ). By virtue of the properties of GMM (2)and due to the specific structure of the free energy (summationand concavity of the logarithm in Eqn. 7), we can then show: Proposition 1
Consider the GMM (2) and the free energy (7) for n = N data points (cid:126) y ( n ) ∈ R D . Furthermore, consider for a fixed n thereplacement of a cluster ˜ c ∈ K ( n ) by a cluster c (cid:60) K ( n ) . Thenthe free energy F ( K , Θ ) increases if and only if (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:107) < (cid:107) (cid:126) y ( n ) − (cid:126)µ ˜ c (cid:107) . (8) Proof
First observe that the free energy is increased if p ( c , (cid:126) y ( n ) | Θ ) > p (˜ c , (cid:126) y ( n ) | Θ ) because of the summation over c in (7) and becauseof the concavity of the logarithm. Analogously, the free energystays constant or decreases for p ( c , (cid:126) y ( n ) | Θ ) ≤ p (˜ c , (cid:126) y ( n ) | Θ ). Ifwe use the GMM (2), we obtain for the joint: p ( c , (cid:126) y | Θ ) = C (2 πσ ) − D exp (cid:0) − σ (cid:107) (cid:126) y − (cid:126)µ c (cid:107) (cid:1) . (9)The first two factors are independent of the data point and clus-ter. The criterion for an increase of the free energy can thereforebe reformulated as follows: p ( c , (cid:126) y | Θ ) > p (˜ c , (cid:126) y | Θ ) ⇔ exp (cid:0) − σ (cid:107) (cid:126) y − (cid:126)µ c (cid:107) (cid:1) > exp (cid:0) − σ (cid:107) (cid:126) y − (cid:126)µ ˜ c (cid:107) (cid:1) ⇔ (cid:107) (cid:126) y − (cid:126)µ c (cid:107) < (cid:107) (cid:126) y − (cid:126)µ ˜ c (cid:107) . (cid:3) Prop. 1 means that we have to replace clusters in K ( n ) that arerelatively distant from (cid:126) y ( n ) by those closer to (cid:126) y ( n ) in order toincreases the free energy F ( K , Θ ). Any such procedure giveswith M-step (6) rise to a variational EM algorithm that mono-tonically increases the lower bound (7) of likelihood (3). Foran arbitrary generative model, the degree how much F ( K , Θ )is increased or how long one should seek new clusters in the E-step is a design choice of the algorithm. In the case of GMMs(and other mixture models) we can exhaustively enumerate allclusters such that F ( K , Θ ) can be fully maximized. Corollary 1
Same prerequisites as for Prop. 1. The free energy F ( K , Θ ) ismaximized w.r.t. K (with fixed Θ ) if and only if for all n the set K ( n ) contains the C (cid:48) clusters closest to data point (cid:126) y ( n ) . Proof
We assume that there are no equal distances among all pairs ofdata points and cluster centers. If K ( n ) contains the C (cid:48) closestclusters, it applies: ∀ c ∈ K ( n ) , ∀ ˜ c (cid:60) K ( n ) : (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:107) < (cid:107) (cid:126) y ( n ) − (cid:126)µ ˜ c (cid:107) . If we now consider an arbitrary n and replacean arbitrary c ∈ K ( n ) by an arbitrary c new (cid:60) K ( n ) it applies (cid:107) (cid:126) y ( n ) − (cid:126)µ c new (cid:107) > (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:107) such that by virtue of Prop. 1 F ( K , Θ ) decreases. As any arbitrary such replacement (anychange of K ) results in a decrease of the free energy, F ( K , Θ )is maximized if K contains the C (cid:48) closest clusters. (cid:3) We can now formulate a truncated variational EM (TV-EM) al-gorithm for GMM (2), here referred to as k -means- C (cid:48) (Alg. 3). Algorithm 3:
The k -means- C (cid:48) algorithm.set |K ( n ) | = C (cid:48) for all n and init (cid:126)µ C and σ ; repeatfor n = . . . , N do define K ( n ) such that ∀ c ∈ K ( n ) ∀ ˜ c (cid:60) K ( n ) : (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:107) < (cid:107) (cid:126) y ( n ) − (cid:126)µ ˜ c (cid:107) ;compute q ( n ) c for all c and n using (5);update (cid:126)µ C and σ using (6); until (cid:126)µ C and σ have converged ; k -means and truncated variational EM for GMMs TV-EM for GMMs (Alg. 3) increases the similarity between k -means and standard EM for GMMs in two ways: (A) it re-lates Euclidean distances to a variational free energy and thusto the GMM likelihood; and (B) it introduces ‘hard’ zeros inthe updates of model parameters (some or many q ( n ) c are zero).Crucial remaining di ff erences are, however, (A) the weightedupdates of the cluster centers in Eqn. 6 compared to the k -meansupdate, and (B) the update of the cluster variance σ in Eqn. 6along with the cluster centers for Alg. 3 which does not havea correspondence in k -means. By considering the first di ff er-ence, the obvious next step is to consider a boundary case ofAlg. 3 by demanding that the sets K ( n ) shall contain just oneelement, i.e., we set C (cid:48) =
1. All derivations above apply forall 1 ≤ C (cid:48) ≤ C , and while standard EM for the GMM (2) isrecovered for C (cid:48) = C , we find that for C (cid:48) = k -means(Alg. 1) is recovered. Proposition 2
Consider the TV-EM algorithm (Alg. 3) for the GMM (2) with σ >
0. If we set C (cid:48) =
1, then the TV-EM updates of the clustercenters (cid:126)µ c (6) become independent of the variance σ and aregiven by the standard k -means algorithm in Alg. 1. Proof
If we choose |K ( n ) | = C (cid:48) = n , then each K ( n ) computedin the TV-E-step of Alg. 3 contains according to Corollary 1 theindex of the cluster center closest to (cid:126) y ( n ) as only element. If wedenote these centers by c ( n ) o , we get K ( n ) = { c ( n ) o } and obtain forthe truncated responsibilities q ( n ) c in (5): q ( n ) c = exp (cid:0) − σ (cid:107) (cid:126) y − (cid:126)µ c (cid:107) (cid:1) δ ( c = c ( n ) o ) (cid:80) c (cid:48) ∈{ c ( n ) o } exp (cid:0) − σ (cid:107) (cid:126) y − (cid:126)µ c (cid:48) (cid:107) (cid:1) = (cid:40) c = c ( n ) o s ( n ) c in Alg. 1. By using q ( n ) c = s ( n ) c for theM-step, we consequently obtain: (cid:126)µ new c = (cid:80) Nn = s ( n ) c (cid:126) y ( n ) (cid:80) Nn = s ( n ) c , σ = DN N , C (cid:88) n , c = s ( n ) c (cid:107) (cid:126) y ( n ) − (cid:126)µ new c (cid:107) . (11)Now observe that the computation of q ( n ) c = s ( n ) c and the updatesof the (cid:126)µ c do not involve the parameter σ . The cluster centers (cid:126)µ c can thus be optimized without requiring knowledge about thecluster variances σ , i.e., the (cid:126)µ c optimization becomes indepen-dent of σ . As the TV-EM updates for q ( n ) c and (cid:126)µ c are identicalto the updates of s ( n ) c and (cid:126)µ c in Alg. 1, the optimization proce-dure for the (cid:126)µ c is given by the standard k -means algorithm. (cid:3) A direct consequence of Prop. 2 is that standard k -means prov-ably monotonically increases the truncated free energy (7) with C (cid:48) =
1. Notably, only for this choice of C (cid:48) the updates of clustermeans and variance decouple. We can, of course, add the vari-ance updates to standard k -means but this does not e ff ect the (cid:126)µ c updates. With or without σ updates the free energy mono-tonically increases. If our goal is the maximization of the freeenergy objective, the σ updates should be included, however.According to the independence of (cid:126)µ c -optimization from σ , itwould be su ffi cient to update σ once and only after k -meanshas optimized the cluster centers.Prop. 2 shows that k -means is obtained from a variational freeenergy objective. This free energy is in turn closely related tothe likelihood objective of GMMs (3). By analyzing the freeenergy for C (cid:48) = Proposition 3
Consider a set of N data points (cid:126) y (1: N ) ∈ R D and the k -means al-gorithm (Alg. 1) where s (1: N )1: C and (cid:126)µ C denote, respectively, thecluster assignments and cluster centers computed in one itera-tion. Furthermore, let σ denote the variance computed with s (1: N )1: C and (cid:126)µ C as in Eqn. 6: σ = σ ( s (1: N )1: C , (cid:126)µ C ) = DN N (cid:88) n = C (cid:88) c = s ( n ) c (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:107) . (12)It then follows that each k -means iteration monotonically in-creases the free energy F ( s (1: N )1: C , (cid:126)µ C ) given by: F ( s (1: N )1: C , (cid:126)µ C ) = − log( C ) − D π e σ ) , (13)where e is Euler’s number. The free energy (13) is a lowerbound of the GMM log-likelihood (3). The di ff erence betweenlog-likelihood (3) and free energy (13) is given by: D KL ( s (1: N )1: C , (cid:126)µ C ) = D + N N (cid:88) n = log (cid:16) C (cid:88) c = exp (cid:0) − (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:107) σ (cid:1)(cid:17) . (14)If for all n and c where s ( n ) c = σ (cid:28) (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:107) , i.e., ifclusters are well separable, then the bound becomes tight. Proof
In the k -means case ( |K ( n ) | = C (cid:48) =
1) each K ( n ) only containsone cluster which is given by the cluster assignments s ( n ) c as: K ( n ) = { c | s ( n ) c = } . If we abbreviate this cluster for n with c ( n ) o ,it follows for the free energy (7) after one k -means iteration: F ( K , Θ ) = N (cid:80) n log (cid:0) (cid:80) c ∈{ c ( n ) o } C N ( (cid:126) y ( n ) ; (cid:126)µ c , σ (cid:1) = N (cid:80) n log (cid:0) C N ( (cid:126) y ( n ) ; (cid:126)µ c ( n ) o , σ (cid:1) (15) = − log( C ) − D log(2 πσ ) − σ N (cid:80) Nn = (cid:80) Cc = s ( n ) c (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:107) ,where we inserted the Gaussian density and then used f ( c ( n ) o ) = (cid:80) c s ( n ) c f ( c ). σ and (cid:126)µ c are the parameters obtained after a single k -means iteration. Following (6) we can therefore insert theexpression DN (cid:80) Nn = s ( n ) c (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:107) for σ , noting that the (cid:126)µ c are the same as in (15). The last term of (15) then simplifies to − D . If we now rewrite this as − D log( e ) and combine with thesecond summand, we obtain (13).The di ff erence (14) between log-likelihood and free en-ergy can be derived from the KL-divergence D KL (cid:0) q (1: N )1: C , r (1: N )1: C (cid:1) .Using results of (L¨ucke, 2018) the KL-divergence for atruncated distribution is given by: D KL (cid:0) q (1: N )1: C , r (1: N )1: C (cid:1) = − (cid:80) n log( (cid:80) c ∈K ( n ) r ( n ) c ). Inserting r ( n ) c (Alg. 2) for the GMM (2),we obtain: D KL (cid:0) q (1: N )1: C , r (1: N )1: C (cid:1) = − N (cid:80) n log (cid:16) (cid:80) c ∈K ( n ) exp (cid:0) − σ (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:107) (cid:1) (cid:80) c (cid:48) exp (cid:0) − σ (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:48) (cid:107) (cid:1) (cid:17) = N σ (cid:80) n , c s ( n ) c (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:107) + N (cid:80) n log (cid:16)(cid:80) c exp (cid:0) − σ (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:107) (cid:1)(cid:17) = D + N (cid:80) n log (cid:16) (cid:80) c exp (cid:0) − (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:107) σ (cid:1)(cid:17) , (16)using again expression (12) for σ . If σ (cid:28) (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:107) forall n , c with s ( n ) c =
0, then the last term of (16) is dominated bythose n , c with s ( n ) c =
1, such that D KL (cid:0) q (1: N )1: C , r (1: N )1: C (cid:1) → (cid:3) Prop. 3 makes explicit the di ff erence to the GMM likelihood ob-jective if k -means is used for parameter optimization (we elab-orate in Suppl. B). Furthermore, by using Prop. 3, we can nowdirectly link the GMM likelihood to the k -means objective. Corollary 2 If s (1: N )1: C and (cid:126)µ C are updated by k -means (Alg. 1), then it appliesfor the GMM likelihood (3) after each iteration that L ( Θ ) ≥ − log( C ) − D (cid:16) π eDN J ( s (1: N )1: C , (cid:126)µ C ) (cid:17) , (17)where J ( s (1: N )1: C , (cid:126)µ C ) is the k -means objective (1). The lowerfree energy bound (right-hand-side of Eqn. 17) is strictly mono-tonically increased. Proof If s (1: N )1: C are the cluster assignments of the first for-loop in Alg. 1,and (cid:126)µ C the centers of the second for-loop, then σ in Prop. 3can directly be replaced by ( DN ) − J ( s (1: N )1: C , (cid:126)µ C ). The free en-ergy is thus a function of the k -means objective. As k -meanshas been shown to strictly monotonically decrease the objective J ( s (1: N )1: C , (cid:126)µ C ) (compare, e.g., Anderberg, 1973; Inaba et al.,2000), the lower free energy bound (17) is strictly monotoni-cally increased by k -means. (cid:3)
4. Applications of Theoretical Results
The principled link between k -means and variational GMMscan be used for a number of theoretical applications and interpretations of previous algorithms, including soft- k -means,lazy- k -means, fuzzy k -means, and previous GMM variantswith ‘hard’ posterior zeros. For such comparisons, let us firstgeneralize Prop. 3 for k -means- C (cid:48) with C (cid:48) > Proposition 4
Same prerequisites as for Prop. 3. If (cid:126)µ C and σ are updatedusing k -means- C (cid:48) (Alg. 3), then a lower free energy bound ofthe log-likelihood (3) is monotonically increased. The bound isafter convergence given by: F ( q (1: N )1: C , (cid:126)µ C ) = − log( C ) − D π e σ ) − N N (cid:88) n = C (cid:88) c = q ( n ) c log( q ( n ) c ) . (18) Proof
For GMM (2) the entropy of the noise distribution, H ( p ( (cid:126) y | c , Θ )) = H ( N ( (cid:126) y ; (cid:126)µ c , σ c .The GMM therefore has an entropy limit (L¨ucke and Henniges,2012) given by: Q ( Θ ) = −H ( p ( c | Θ )) − H ( p ( (cid:126) y | c , Θ )) = − log( C ) − D log(2 π e σ ) ,which is derived simply by inserting (2) into Q ( Θ ). If we (fol-lowing L¨ucke and Henniges, 2012) reformulate the free energy(7) such that it is expressed in terms of this entropy limit, we ob-tain: F ( K , Θ ) = Q ( Θ ) + D (cid:0) − σ σ (cid:1) + N (cid:80) n H ( q ( n ) c ), where σ is the variance after the M-step of k -means- C (cid:48) . At conver-gence, the ratio σ /σ converges to one and we obtain (18). (cid:3) Already by considering (7), we can conclude that for the same Θ applies F ( ˜ K , Θ ) ≤ F ( K , Θ ) if ˜ K ⊆ K . Prop. 4 now showsthat the free energy di ff erence is (after convergence) solelygiven by the entropy of the truncated distributions. For C (cid:48) = C (cid:48) = C the entropy is maximal and (18)can be used to estimate the likelihood during learning.Alg. 3 ( k -means- C (cid:48) ), for which Prop. 4 applies, can be com-pared to soft- k -means (MacKay, 2003), which was suggestedas a ‘non-hard’ k -means generalization. Soft- k -means and k -means- C (cid:48) share an additional parameter for data variance. For k -means- C (cid:48) this is the variance σ itself, for soft- k -means thisparameter is the ‘sti ff ness’ parameter β , which also closelylinks to σ (essentially β = σ ) of GMM (2). However, k -means- C (cid:48) makes k -means ‘softer’ by allowing for more thanone non-zero value for the cluster assignments. This is di ff er-ent from soft- k -means, which maintains non-zero values for allcluster assignments. Related to this, problems with sensitivityto sti ff ness values and sensitivity to initial conditions comparedto standard k -means (Barbakh and Fyfe, 2008) may be relatedto Prop. 1 and Prop. 2 which imply that for any approach with C (cid:48) >
1, updates of σ should (in contrast to soft- k -means) notbe neglected. k -means- C (cid:48) is itself closely related to the GMMalgorithms of Shelton et al. (2014) and Hughes and Sudderth(2016). But while Shelton et al. (2014) and Hughes and Sud-derth (2016) focus on EM acceleration, no proofs that their al-gorithms monotonically increase free energies are given (weelaborate in Suppl. C).In general, other selection criteria than (8) could be derived B I RCH × ( g r i d ) k -means L & F − − − − − L init L GT A LF k -means-C’ L (mean & highest) − − − L GT B C = 1 (k-means) C = 2 C = 5 C = 25 (GMM) k -means-C’ purity (mean & highest) purityGT DBSCAN C C = 1 (k-means) C = 2 C = 5 C = 25 (GMM) K DD k -means-C’ L & F − − − G C = 1 ( k -means) C = 2 C = 10 C = 200 (GMM) IJK B I RCH ( × r ando m po s . ) − − − − L init L GT D Iteration LF − − − L GT E Iteration C = 1 (k-means) C = 2 C = 5 C = 25 (GMM) 0 10 200.750.80.85 purityGT F Iteration C = 1 (k-means) C = 2 C = 5 C = 25 (GMM) K DD k -means-C’ q.-error · H Iteration C = 1 ( k -means) C = 2 C = 10 C = 200 (GMM) Fig. 1:
A - C show results on a BIRCH data set with grid-positioned clusters,
D - F on a BIRCH data set with randomly positioned clusters, and G and H on theKDD data set. The first column (plots A and D ) shows log-likelihoods and free energies per data point for three individual runs of Alg. 1 ( k -means). Red: all of the25 clusters are found correctly; Blue: all but one; Green: all but two cluster. Additionally, the ‘grey’ plot shows the mean of 100 independent runs. Plots B and E each show the mean log-likelihood (solid line) and the log-likelihood of the run with the highest final value (striped) based on 100 runs of Alg. 3 ( k -means- C (cid:48) )for di ff erent C (cid:48) . Plots C and F show the same for the purity, where we show the run with the highest sum of purity and NMI (additional plots for the NMI aregiven in Fig. S2). For comparison, we show results for DBSCAN, with free parameters optimized for highest combined purity and NMI. For BIRCH with randomclusters, DBSCAN reaches a purity of 0.5, which is hence not visible in plot F . Detailed results including comparisons with lazy- k -means, are given in Suppl. E. Plot G shows the mean log-likelihood and free energy (shaded with their respective SEMs) of k -means- C (cid:48) (Alg. 3) with di ff erent C (cid:48) based on 10 individual runs each. H visualizes the same runs but plots the quantization error instead. Visualizations of some ground truth cluster centers (blue circles) and found cluster centers of thebest runs (red crosses) are shown in I ( k -means, BIRCH 5 × J ( k -means, BIRCH 25 × random) and K ( k -means- C (cid:48) with C (cid:48) = × random). for other mixture models. Visa versa also other versions of k -means based on other criteria than (8) can be interpreted asvariational EM. An example is lazy-k-means which is a rela-tively recent k -means generalization used to study convergenceproperties (Har-Peled and Sadri, 2005). Lazy- k -means only re-assigns a data point n from a cluster ˜ c to a new cluster c if:(1 + (cid:15) ) (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:107) < (cid:107) (cid:126) y ( n ) − (cid:126)µ ˜ c (cid:107) , (19)where (cid:15) ≥ k -means is recovered for (cid:15) = K ac-cording to (19) would also increase the free energy (7). Basedon our variational interpretation, lazy- k -means corresponds toa partial TV-E-step. In analogy to Prop. 1, we can show that(7) is monotonically increased, but it is not necessarily maxi-mized, i.e., Corollary 1 does not apply. However, the essentialobservation of a decoupled (cid:126)µ and σ update only depends on C (cid:48) being equal to one. Prop. 2 thus generalizes to the lazy- k -meanscase, and the same applies for Corollary 2 (see Suppl. D for theproofs). For lazy- k -means, polynomial running time boundscould be derived (Har-Peled and Sadri, 2005). By virtue ofCorollary 2, this means that the corresponding log-likelihoodbound can be optimized in polynomial time. More generally,Corollary 2 (as well as the other results) can serve for trans-ferring many of the diverse run-time complexity results for k -means and k -means-like algorithms to results for GMM bounds.Likelihood bounds are, on the other hand, of interest for theo-retical studies of GMM optimization (e.g. Kalai et al., 2010;Moitra and Valiant, 2010; Xu et al., 2016). The here estab-lished link can thus serve to transfer results from k -means-likeapproaches (e.g. Arthur et al., 2009) to GMM clustering. In this study, we have focused on k -means and its relation toGMMs with isotropic Gaussians and equal mixing proportions(Eqn. 2). The analytical tools applied here could be used sim-ilarly for general GMM densities. Also in the general case itwould be possible to define algorithms only considering the C (cid:48) most relevant clusters for updates. However, the criterion to as-sign clusters to data points would diverge considerably from theclosest cluster selection used by k -means. As a consequence,even when choosing C (cid:48) =
1, a general GMM density would notresult in a decoupling of (cid:126)µ c updates from the updates of theother model parameters. We elaborate in Suppl. C.Finally, a very popular k -means version is fuzzy k-means (e.g., Bezdek, 1981, for references), which takes the form of ageneralization of the k -means objective (1) by using non-binary s ( n ) c in the place of the k -means assignments. Fuzzy k -means al-gorithms then update weighted cluster assignments and clustercenters in order to minimize such objectives. Prop. 4 serves bestto highlight the di ff erences between standard fuzzy k -meansand k -means- C (cid:48) , because it shows that the average entropy ofthe cluster assignments emerges in the context of GMMs as aterm in addition to a softened objective. Standard algorithmsfor fuzzy k -means (e.g. Bezdek, 1981; Yang, 1993) are di ff er-ent as they usually generalize the k -means objective without anadditional entropy term. Notably, newer versions of fuzzy k -means have been suggested to improve on earlier versions byintroducing additional regularization terms. One of these regu-larizations takes the form of the entropy of cluster assignments(compare Miyamoto et al., 2008, Sec. 2). Considering Eqn. 18of Prop. 4, we could now relate the regularization constant ofentropy regularized fuzzy k -means to the GMM log-likelihoodoptimization, or introduce novel versions of fuzzy k -means withmany weights set to ‘hard’ zeros. Other, e.g., quadratic regular-izations (see Miyamoto et al., 2008) are, on the other hand, notas closely related to the GMM objective but may correspond toother data statistics.
5. Numerical Verification
Before we conclude, we briefly numerically verify the maintheoretical results of this work. We use a BIRCH dataset with C =
25 clusters on a 5 × N =
100 data points percluster (same data set for all runs) as partly shown in Fig. 1I.Fig. 1A shows di ff erent runs of standard k -means and the timecourse of the free energy and likelihood computed using (13)and (3), respectively. The shown exemplary runs converge todi ff erent optima. The run with highest final free energy recov-ers all cluster centers and results in a log-likelihood larger thanthe log-likelihood of the generating (ground-truth) parameters.We verified that this (small) overfitting e ff ect decreases with in-creasing N . The bound for the best run is relatively tight, whichis consistent with (14) of Prop. 3 for small σ . The gap is largerfor local optima, which have to have a larger σ according to(12) and consequently higher entropy for q ( n ) c of C (cid:48) > C (cid:48) = C . The gap also increases for clusters with larger over-lap in Fig. 1D / J, where we use the same setting as for Fig. 1Abut with randomly (uniformly) distributed cluster centers (seeFig. 1J and 1K). Note that we use the seeding of k -means ++ (Arthur and Vassilvitskii, 2006) for Fig. 1. The initial values of L ( Θ ) are thus already relatively high (see L init ).Fig. 1E shows di ff erent runs of k -means- C (cid:48) for the data asused for Fig. 1J. Using k -means- C (cid:48) with di ff erent numbers ofwinning clusters C (cid:48) can prevent shifted cluster centers causedby unsymmetrical cluster overlaps (compare Fig. 1J and 1K).Final likelihoods of the best runs with C (cid:48) > k -means. Fig. 1E,J,K can also serve as nu-merical verification of the di ff erences between free energies fordi ff erent C (cid:48) . Suppl. E elaborates on this. Figs. 1C / F give addi-tional results on the purity. Here we also compare to the popularDBSCAN method (Ester et al., 1996). While for the well sepa-rated grid data set the purity is comparably high, for the randomset with larger overlaps, the purity for DBSCAN is with around0.5 no longer comparable. More detailed results are given inTab. S1, where we also show results for lazy- k -means. Finally,Figs. 1G and H verify our results using real and large scale data.The KDD-Cup 2004 Protein Homology Task (KDD, Caruanaet al., 2004) comprises 145 751 samples of 74-dimensional datapoints. We observe tighter bounds between log-likelihood (3)and free energy (7) for better solutions of increasing C (cid:48) . Al-ready for C (cid:48) = D KL -gap decreases significantly relative to k -means and vanishes nearly completely for C (cid:48) =
6. Conclusion
We have established a novel and, we believe, very naturallink between k -means and EM for GMMs by showing that k -means is a special case of a truncated variational EM approx-imations for GMMs. The link can serve to transfer theoret- ical research between k -means-like and GMM clustering ap-proaches (Sec. 4 treated some examples). Of the many stud-ies which consider k -means and data samples of GMMs (e.g.Chaudhuri et al., 2009, & refs. therein), there is none that pro-vides the close theoretical links and free energy results pro-vided here (also see Suppl. B). Earlier work by Pollard (1982)is maybe one of the most relevant studies, as it proves a the-orem which relates the convergence points of k -means to anunderlying distribution. In the sense of a central limit theorem,this distribution is given by a GMM with clusters of specificcovariance. Cluster overlap in the samples influences the clus-ter shapes via non-zero o ff -diagonal elements. The questionof Pollard (1982) is thus how to fit a GMM (in a central limittheorem sense) to correspond to k -means convergence points.Prop. 3 may be related to the theorem of Pollard (1982) but acloser inspection would require a more extensive analysis.Other than the above discussed theoretical link of k -means toGMM clustering, our investigations may also be useful for theanalysis and improvement of further aspects of k -means-likeand GMM clustering. GMMs are used to address a wide rangeof tasks. Two examples may be image denoising (e.g. Zoranand Weiss, 2011) and video tracking (e.g. Jepson et al., 2003;Lan et al., 2015, 2018). Training k -means may, however, oftenbe more e ffi cient, which can be of importance, e.g., when a lotof data has to be processed in short times. By assigning a prob-abilistic interpretation to k -means, it may o ff er itself as a fasteralternative to GMMs in such settings. Similarly, k -means- C (cid:48) could be used as a compromise between GMMs and e ffi cient k -means versions. A further aspect our results can be relatedto is the estimation of cluster numbers. The standard k -meansalgorithm (Alg. 1), standard EM for GMMs (Alg. 2) as well as k -means- C (cid:48) (Alg. 3) require the number of clusters as input. Alarge number of studies have addressed this disadvantage of thestandard approaches. Model selection and fully Bayesian ap-proaches (Fraley and Raftery, 1998; Rasmussen, 2000; Neal,2000) are common methods to estimate the cluster numbers ofGMMs from data. For k -means, well known contributions arethe X -means algorithm (Pelleg et al., 2000), the G -means algo-rithm (Hamerly and Elkan, 2004) as well as approaches basedon clustering stability (see von Luxburg, 2010, & refs. therein).All the approaches for k -means use standard k -means iterationsor full k -means runs as part of the complete algorithm, e.g., assubroutines in split-and-merge approaches (Ueda et al., 2000,& refs. therein). There are di ff erent options how the results ofthis study can be combined with these previous studies. For X -means-like approaches, our results (e.g., Eqn. 14) could beused to quantify how well the BIC selection criterion used by X -means can be expected to work. If for a given data set k -meansis not well approximating GMM solutions (e.g., for larger clus-ter overlaps), k -means- C (cid:48) iterations would o ff er themselves asalternative iterations within an X -means setting. Less directly, k -means- C (cid:48) algorithms could (A) be used in conjunction withstatistical tests for Gaussianity of projected data as in G -means,or (B) they could be used (like k -means) to define stabilityscores for stability-based selections of cluster numbers. Alsoin these two cases, improvements can be expected especiallywhen cluster overlaps are large. Finally, k -means and k -means- C (cid:48) could be combined with general Bayesian model selection(Schwarz, 1978) as their free energies (Eqns. 17 and 18, respec-tively) provide likelihood approximations.More generally, k -means is usually not directly integratedinto probabilistic frameworks as the limit to zero cluster vari-ance remained the most well known relation between k -meansand GMMs. From the probabilistic point of view, this limitis unsatisfactory, however, as the likelihood of data points un-der a GMM with σ → k -means / GMM relation with fi-nite variances σ >
0) are novel compared to standard varia-tional approaches which assume a-posteriori independence (e.g.Saul et al., 1996; Jaakkola, 2000). Truncated EM approaches(L¨ucke and Eggert, 2010; Sheikh et al., 2014; L¨ucke, 2018)aim at scalable and accurate approximations without assuminga-posteriori independence; a goal they share with many later ap-proaches (e.g. Mnih and Gregor, 2014; Rezende and Mohamed,2015; Salimans et al., 2015; Kucukelbir et al., 2016). TruncatedEM is a natural variational approximation for k -means-like al-gorithms, and is here not only related but becomes, indeed,identical to standard k -means. Acknowledgements
We acknowledge funding by the DFG projects SFB 1330(B2) and EXC 2177 / References
Allahverdyan, A., Galstyan, A., 2011. Comparative analysis of Viterbi trainingand maximum likelihood estimation for HMMs, in: NIPS, pp. 1674–1682.Anderberg, M.R., 1973. Cluster Analysis for Applications. Academic Press.Arthur, D., Manthey, B., R¨oglin, H., 2009. k-means has polynomial smoothedcomplexity, in: IEEE Symp. Foundations of Comp. Sci., pp. 405–414.Arthur, D., Vassilvitskii, S., 2006. How slow is the k-means method?, in:Comp. Geo., pp. 144–153.Bachem, O., Lucic, M., Hassani, H., Krause, A., 2016. Fast and provably goodseedings for k-means, in: NIPS, pp. 55–63.Barbakh, W., Fyfe, C., 2008. Online clustering algorithms. International Jour-nal of Neural Systems 18, 185–194.Barber, D., 2012. Bayesian reasoning and machine learning. Cam. Univ. Press.Belkin, M., Sinha, K., 2010. Polynomial learning of distribution families, in:Symp. Comp. Sci., pp. 103–112.Berkhin, P., 2006. A survey of clustering data mining techniques, in: Groupingmultidimensional data. Springer, pp. 25–71.Bezdek, J.C., 1981. Pattern recognition with fuzzy objective function algo-rithms. Springer.Broderick, T., Kulis, B., Jordan, M., 2013. Mad-bayes: Map-based asymptoticderivations from bayes, in: ICML, pp. 226–234.Caruana, R., Joachims, T., Backstrom, L., 2004. KDD-Cup 2004: results andanalysis. ACM SIGKDD Explorations Newsletter 6, 95–108.Celeux, G., Govaert, G., 1992. A classification EM algorithm for clusteringand two stochastic versions. Comp. statistics & Data analysis 14, 315–332.Chaudhuri, K., Dasgupta, S., Vattani, A., 2009. Learning mixtures of gaussiansusing the k-means algorithm. arXiv preprint arXiv:0912.0086 .Dai, Z., L¨ucke, J., 2014. Autonomous document cleaning – A Generative Ap-proach to Reconstruct Strongly Corrupted Scanned Texts. IEEE Trans. onPattern Analysis and Machine Intelligence 36, 1950–1962.Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood fromincomplete data via the EM algorithm. J. Roy. Stat. Soc. B 39, 1–38.Duda, R.O., Hart, P.E., Stork, D.G., 2001. Pattern Classification. Wiley-Interscience (2nd Edition).Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al., 1996. A density-based algo-rithm for discovering clusters in large spatial databases with noise., in: Kdd,pp. 226–231.Forster, D., Sheikh, A.S., L¨ucke, J., 2018. Neural simpletrons: Learning in thelimit of few labels with directed generative networks. Neural computation ,2113–2174.Fraley, C., Raftery, A.E., 1998. How many clusters? Which clustering method?Answers via model-based cluster analysis. The computer journal 41, 578–588.Hamerly, G., Elkan, C., 2004. Learning the k in k-means, in: Proc. NIPS, pp.281–288.Har-Peled, S., Sadri, B., 2005. How fast is the k-means method? Algorithmica41, 185–202.Ho ff man, M.D., Blei, D.M., Wang, C., Paisley, J.W., 2013. Stochastic varia-tional inference. JMLR 14, 1303–1347. Hughes, M.C., Sudderth, E.B., 2016. Fast learning of clusters and topics viasparse posteriors. preprint arXiv:1609.07521 .Inaba, M., Katoh, N., Imai, H., 2000. Variance-based k-clustering algorithmsby Voronoi diagrams and randomization. Trans. Inf. Sys. 83, 1199–1206.Jaakkola, T., 2000. Tutorial on variational approximation methods, in: Opper,M., Saad, D. (Eds.), Advanced mean field methods. MIT Press.Jain, A.K., 2010. Data clustering: 50 years beyond k-means. Pattern Recogni-tion Letters 31, 651–666.Jepson, A.D., Fleet, D.J., El-Maraghi, T.F., 2003. Robust online appearancemodels for visual tracking. IEEE Trans. on Pattern Analysis and MachineIntelligence 25, 1296–1311.Jordan, M.I., Ghahramani, Z., Saul, L.K., 1997. Hidden markov decision trees,in: NIPS, pp. 501–507.Kalai, A.T., Moitra, A., Valiant, G., 2010. E ffi ciently learning mixtures of twogaussians, in: Proc. ACM Symp. Theo. Comp., ACM. pp. 553–562.Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., Blei, D.M., 2016. Auto-matic di ff erentiation variational inference. CoRR abs / ffi -cient estimation of the number of clusters., in: Proc. ICML, pp. 727–734.Pollard, D., 1982. A central limit theorem for k -means clustering. The Annalsof Probability 10, 919–926.Rasmussen, C.E., 2000. The infinite gaussian mixture model, in: Proc. NIPS,pp. 554–560.Rezende, D.J., Mohamed, S., 2015. Variational inference with normalizingflows. ICML .Salimans, T., Kingma, D., Welling, M., 2015. Markov chain monte carlo andvariational inference: Bridging the gap. ICML .Saul, L.K., Jaakkola, T., Jordan, M.I., 1996. Mean field theory for sigmoidbelief networks. Journal of artificial intelligence research 4, 61–76.Schwarz, G., 1978. Estimating the dimension of a model. The annals of statis-tics 6, 461–464.Segal, E., Battle, A., Koller, D., 2002. Decomposing gene expression intocellular processes, in: Pacific Symposium on Biocomputing, pp. 89–100.Sheikh, A.S., Shelton, J.A., L¨ucke, J., 2014. A truncated EM approach forspike-and-slab sparse coding. JMLR 15, 2653–2687.Shelton, J.A., Gasthaus, J., Dai, Z., L¨ucke, J., Gretton, A., 2014. GP-select:Accelerating em using adaptive subspace preselection. arXiv:1412.3411,now published by Neural Computation 29(8):2177-2202, 2017 .Steinley, D., 2006. K-means clustering: A half-century synthesis. Brit. J. Math.and Stat. Psych. 59, 1–34.Ueda, N., Nakano, R., Ghahramani, Z., Hinton, G.E., 2000. Split and mergeem algorithm for improving gaussian mixture density estimates. J. of VLSISig. Proc. Systems for Signal, Image and Video Tech. 26, 133–140.Welling, M., Kurihara, K., 2006. Bayesian k-means as a maximization-expectation algorithm, in: Proc. SIAM Conf. Data Mining, pp. 474–478.Xu, J., Hsu, D.J., Maleki, A., 2016. Global analysis of expectation maximiza-tion for mixtures of two gaussians, in: NIPS, pp. 2676–2684.Yang, M.S., 1993. A survey of fuzzy clustering. Mathematical and Computermodelling 18, 1–16.Zoran, D., Weiss, Y., 2011. From learning models of natural image patches towhole image restoration, in: Proc. ICCV, IEEE. pp. 479–486. Supplement
A. Illustration of truncated posterior approximations
Fig. S1 illustrates truncated distributions for an example withtwo-dimensional data points ( D =
2) with C = C (cid:48) = n well. For basicallyall data points (grey dots), truncated distributions with C (cid:48) = ffi ciently exact; and for most data points C (cid:48) = C (cid:48) = k -means case, will su ffi ciently wellmodel the posterior because for most data points in this exam-ple the posterior is dominated by the value of the closest cluster.Also see Figs. S2 and S3 for numerical verifications. B. k -Means and hard cluster assignments for GMMs Here we provide more details on how k -means or the k -means objective has previously been related to maximumlikelihood optimization of GMMs. Classification expectation maximization.
The log-likelihoodobjective of GMMs (3) and the quantization error (1) optimizedby k -means are non-trivially related. This is also the case for theGMMs with isotropic and equal Gaussian variances and equalmixing proportions as considered here (Eq. 2). For the purposesof our study we emphasize this point as earlier contributionsreported results for clustering criteria from which one may in-correctly infer a trivial relation between (1) and (3). One ex-ample of such previous work (see Celeux and Govaert, 1992,and references therein) does, for instance, consider a classifica-tion expectation maximization (CEM) algorithm for clustering.The paper defines a classification maximum likelihood (CML)objective which is (in the notation of this paper) given by: L CML ( Θ ) = N N (cid:88) n = C (cid:88) c = s ( n ) c log (cid:16) C N (cid:0) (cid:126) y ( n ) ; (cid:126)µ c , σ (cid:1)(cid:17) , (B.1)where the s ( n ) c are the binary weights of Alg. 1. In the paper(Celeux and Govaert, 1992) it is then shown that the problemof maximizing the CML objective (B.1) is equivalent to theproblem of minimizing the quantization error (1). Although(B.1) is also referred to as a maximum likelihood (ML) ob-jective (see Celeux and Govaert, 1992, and references therein),note the di ff erence between this classification maximum like-lihood (CML) objective (B.1) and the standard ML objectivefor GMMs in (3). Essentially, the sum over clusters in (3) andthe logarithm can not be trivially commuted to obtain (B.1).Eq. (14) can be regarded as quantification of the di ff erence be-tween (B.1) and (3) in terms of the ratio between data-to-clustercenter distances and σ (compare initial discussion of Celeuxand Govaert, 1992). Eq. (14) is ultimately a consequence of ap-plying Jensen’s inequality to commute logarithm and the sumover clusters, which gives rise to a lower free energy bound.Only if cluster centers are far apart compared to σ , the sumover c will for each data point n be dominated by the termsof within cluster distances. This is the case of well separable clusters, i.e., if ‘hard’ data partitions are representing a goodapproximation of ‘soft’ a-posterior assignments . In that casethe KL-divergence becomes zero. Also see Supplement E be-low for numerical experiments showing di ff erences between the k -means and log-likelihood objectives. Hard cluster assignments and variational distributions.
As dis-cussed in the main text, the by far most common approach torelate k -means and Gaussian mixture models is to take the limitto zero cluster variances σ →
0. This relation is very com-monly used in text books as well as in the research literatureitself. Alternatively, and related to this study, k -means is fordidactic purposes also sometimes casually related to GMM op-timization using variational EM. Such a relation is usually con-fined to derivations that make the relation of k -means to GMMdata models plausible. For instance, if the free energy w.r.t. avariational distribution is in its standard form given by F ( q , Θ ) = N (cid:80) Nn = (cid:16) (cid:80) Cc = q ( n ) ( c ) log (cid:0) p ( c , (cid:126) y ( n ) | Θ ) (cid:1) − (cid:80) Cc = q ( n ) ( c ) log (cid:0) q ( n ) ( c ) (cid:1)(cid:17) (B.2)then one can informally define q ( n ) ( c ) to be equal to one ifand only if c corresponds to the maximal value of q ( n ) ( c ). ForGMM (2) q ( n ) ( c ) are then given by: q ( n ) ( c ) = (cid:40) ∀ c (cid:48) (cid:44) c : (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:107) < (cid:107) (cid:126) y ( n ) − (cid:126)µ c (cid:48) (cid:107) F ( q , Θ ) = N (cid:80) Nn = log (cid:0) p ( c ( n ) o , (cid:126) y ( n ) | Θ ) (cid:1) (B.4)where c ( n ) o denotes the cluster closest to data point (cid:126) y ( n ) . k -meansis then often taken as optimizing this objective.In order to make any mathematically rigorous statements, theargumentation above lacks, at closer inspection, a solid theoret-ical foundation in two important aspects: (A) Derivations of thefree energy using variational distributions all assume q ( n ) ( c ) tobe strictly positive ( q ( n ) ( c ) > n and c ), which is vio-lated for q ( n ) ( c ) defined as in Eq. (B.3). (B) Relating k -meansto a free energy objective as (B.4) implicitly assumes the vari-ational distributions q ( n ) ( c ) to be independent of the model pa-rameters Θ (i.e., independent of (cid:126)µ C and σ in our case). Con-sidering Eq. (B.3) also this independence is not given (whichis in contrast, e.g., to mean field distributions). The model pa-rameters can also not simply be assumed to be constant as isthe case for full posteriors in standard EM: The proof verifyingthat values for the model parameters can be held fixed is givenfor full posteriors only (see, e.g., Lemma 1 of Neal and Hinton,1998) but it does not necessarily apply for general variationaldistributions q ( n ) ( c ) defined using model parameters Θ .The here applied results (L¨ucke, 2018) do address both theseaspects: variational distributions with ‘hard’ zeros are treated(Point A), and variational distributions that can depend on themodel parameters are explicitly considered (Point B). Address-ing any of these two points is non-trivial (see Propositions 1and 2 in L¨ucke (2018), for Point A; and, e.g., Propositions 3-5in L¨ucke (2018), for Point B). However, if treated rigorously,2 cc
31 52 4 87631 52 4 876 cluster centers (cid:126)µ c data point (cid:126) y ( n ) q ( n ) c r ( n ) c c = c = c = c = Full PosteriorTruncated Posterior c = c = c = c = (cid:126) y ( n ) K ( n ) = {
2, 3, 6 } Fig. S1: Illustration of truncated distributions for a GMM (2) in D = (cid:126) y ( n ) which lies (for illustrative purposes) tosome extend in between some clusters. The full posterior (the responsibilities) r ( n ) c = p ( c | (cid:126) y ( n ) , Θ ) for the C = q ( n ) c with |K ( n ) | = C (cid:48) = C (cid:48) highest posterior values, sets all others tozero, and renormalizes the distribution to sum to one. The three closest clusters, which correspond to the three highest posterior values, are connected with blacklines in the main figure. results for a large class of distributions (which includes distri-butions of Eq. B.3) can be derived, and the derived results ap-ply for any generative model with discrete latents. Notably, alsotruncated variational distributions with non-zero entropy are in-cluded as well as distributions (B.3) in which q ( n ) ( c ) = lazy-k-means , see Proposi-tion 4). In this paper we make use of results of L¨ucke (2018) byapplying them to GMMs given by Eq. (2) (e.g., through Propo-sitions 1 and 4 which in turn use the simplified free energy (7)of L¨ucke (2018)).The di ffi culties to cleanly and rigorously treat distributionssuch as (B.3) may explain why (to the knowledge of the au-thors) any relation of k -means and variational approaches israther informally discussed (compare, e.g., Jordan et al. (1997),who, e.g., relate Viterbi training to variational EM). If the re-lation of k -means to GMMs is made more explicit, the litera-ture, including popular text books (e.g. MacKay, 2003; Barber,2012), drops back to the zero variance limit to derive k -means. C. Generalization of criterion (8) for general GMMs
Consider a general standard GMM of the form: p ( c | Θ ) = π c with (cid:80) Cc = π c =
1, (C.1) p ( (cid:126) y | c , Θ ) = | π Σ c | − exp (cid:0) − (cid:107) (cid:126) y − (cid:126)µ c (cid:107) Σ c (cid:1) , (C.2) (cid:107) (cid:126) y − (cid:126)µ c (cid:107) Σ c = ( (cid:126) y − (cid:126)µ c ) T Σ − c ( (cid:126) y − (cid:126)µ c ), (C.3)where π c ≥ Σ c is a for each c posi-tive definite covariance matrix, and | · | denotes the determinant.We denote by Θ = ( π C , (cid:126)µ C , Σ C ) the set of all parameters.For GMM (C.1) to (C.3) a corresponding variational free en-ergy is because of Eq. (7) (first line) increased if and only if: p ( c , (cid:126) y | Θ ) > p (˜ c , (cid:126) y | Θ ) ⇔ π c | π Σ c | − exp (cid:0) − (cid:107) (cid:126) y − (cid:126)µ c (cid:107) Σ c (cid:1) > π ˜ c | π Σ ˜ c | − exp (cid:0) − (cid:107) (cid:126) y − (cid:126)µ ˜ c (cid:107) Σ ˜ c (cid:1) ⇔ log( π c ) − log( | π Σ c | ) − (cid:107) (cid:126) y − (cid:126)µ c (cid:107) Σ c > log( π ˜ c ) − log( | π Σ ˜ c | ) − (cid:107) (cid:126) y − (cid:126)µ ˜ c (cid:107) Σ ˜ c ⇔ (cid:107) (cid:126) y − (cid:126)µ c (cid:107) Σ c + log( | π Σ c | ) − π c ) < (cid:107) (cid:126) y − (cid:126)µ ˜ c (cid:107) Σ ˜ c + log( | π Σ ˜ c | ) − π ˜ c ) . (C.4)In comparison, Shelton et al. (2014) use an estimated E-step,which does consequently not guarantee a monotonic increaseof a free energy. Hughes and Sudderth (2016) do use a con-strained likelihood optimization to find the best C (cid:48) clusters perdata point (related to Corollary 1), but a complete proof forgeneral GMMs would require Proposition 5 of L¨ucke (2018),which warrants that M-steps can be derived while the parame-ters Θ of the variational distributions q ( n ) c remain fixed.Considering (C.4), note that the criterion to select clustersnow depends on all model parameters (in contrast to the crite-rion of Eq. 8). If algorithms for parameter updates are definedbased on (C.4), all current parameter values have to be consid-ered in E-steps which compute generalizations of the responsi-bilities q ( n ) c (compare Eq. 5). Notably, even if these responsibil-ities q ( n ) c become binary for the choice C (cid:48) =
1, the selection ofthe non-zero values of q ( n ) c would still require the other param-eter values. There would consequently not be a k -means-likedecoupling from other parameter updates like for the GMM de-fined by Eq. (2). D. Generalizations for lazy- k -means If we change the cluster selection criterion (8) to the criterionfor lazy-k-means (19), then it follows from Proposition 1 thateach cluster assignment in lazy- k -means increases the freeenergy (7). As the M-steps (equal to the k -means M-steps) thenincrease the free energy w.r.t. Θ , it follows that lazy- k -meansmonotonically increases the same free energy objective.Corollary 1 does not apply but we can generalize Proposition 2.3 Table S1: Log-likelihood per data point L , quantization error φ , purity and normalized mutual information (NMI) for k -means, k -means-C’ with C (cid:48) =
2, lazy- k -means with (cid:15) = ff s on the other. Given are the means over100 independent runs as well as the values of the best single run. The mean and the best are identical for DBSCAN given the same, optimized free parameters. Thebest values per column are written in bold. Algorithm BIRCH 5 × × random positions) L φ purity NMI L φ purity NMImean best mean best mean best* mean best* mean best mean best mean best* mean best* k -means -6.127 -6.016 -5.842 -5.789 k -means-C’ -6.117 -6.016 -5.828 -5.771 lazy- k -means -6.117 -6.016 5,452 -5.850 -5.803 4,592 4,351 0.809 0.846 0.876 DBSCAN – – – – *: best values for purity and NMI are for all algorithms given as values of the run with the highest sum of purity and NMI to omit solutions highly overfitted to oneof the two criteria C = (k-means) C = C = C = (GMM) B I RCH × ( g r i d ) − − − L GT L (mean) L (highest) 0 10 204,7505,0005,2505,5005,7506,000 φ GT φ (mean) φ (lowest) 0 10 200.940.960.981 purityGT purity (mean)purity (highest comb.) 0 10 200.950.960.970.980.991 NMIGTNMI (mean)
NMI (highest comb.) B I RCH ( × r ando m po s . ) − − − L GT Iteration L (mean) L (highest) 0 10 204,2504,5004,7505,000 φ GT Iteration φ (mean) φ (lowest) 0 10 200.750.80.85 purityGT Iteration purity (mean)purity (highest comb.) 0 10 200.8650.870.8750.880.885
NMIGT
Iteration
NMI (mean)
NMI (highest comb.)
Fig. S2: The four columns show from left to right: the log-likelihood L , quantization error φ , purity, and normalized mutual information (NMI) of the k -means-C’algorithms on BIRCH data sets as in Fig. 1. The plots show the mean over 100 independent runs (solid line). For the log-likelihood and quantization error, the singlerun with the best respective final value is shown in striped. For the purity and NMI plots, the single run with the highest sum of purity and NMI is shown in striped.Such a selection criterion omits runs that are highly overfitted to either purity or NMI. Proposition (Generalization of Proposition 2 for lazy- k -means)Consider the TV-EM algorithm (Alg. 3) but with criterion (19)instead of criterion (8). If we set C (cid:48) =
1, then the TV-EM up-dates of the cluster centers (cid:126)µ c (6) become independent of thevariance σ and are given by the lazy- k -means algorithm. Proof
The proof is analogous to the one of Proposition 2 with the onlydi ff erence that c ( n ) o is now a cluster of (cid:126) y ( n ) for which applies: ∀ ˜ c (cid:44) c ( n ) o : (cid:107) (cid:126) y ( n ) − (cid:126)µ c ( n ) o (cid:107) < (1 + (cid:15) ) (cid:107) (cid:126) y ( n ) − (cid:126)µ ˜ c (cid:107) . The clus-ter assignments thus become those of lazy- k -means, while theparameter updates remain those of standard k -means (i.e., thesame as used for lazy- k -means). (cid:3) As Propositions 1 and 2 can be generalized, the fact that lazy- k -means optimizes the same free energy as k -means does alsoimply that Corollary 2 can be used to relate lazy- k -means to theGMM objective (3). E. More details on the numerical experiments
Fig. S2 shows additional results of k -means- C (cid:48) on the BIRCHdata sets, namely the log-likelihood, quantization error, purityand NMI (where likelihood and purity values are the same asthose in Fig. 1, but shown here again for easier comparison).Tab. S1 gives a numerical comparison of these results to theDBSCAN and lazy- k -means algorithms. The results on thequantization error compared to the likelihoods in Fig. S2 andTab. S1 highlight the fact that optimization of the k -means cri-terion (i.e., the quantization error) does generally not directlycoincide with optimization of free energies by k -means- C (cid:48) with C (cid:48) > C (cid:48) = C ). For the NMI and purity scores, we find that on theBIRCH set with random clusters (and therefore larger overlaps) k -means is prone to trade o ff NMI with decreasing purity scores(which results in a lower than average NMI score for the shown4 − − − − − all clusters recovered1 cluster not recovered2 clusters not recovered L init L GT A Iteration LF − − − − all clusters recovered1 cluster not recovered2 clusters not recovered L init L GT B Iteration LF − − − all clusters recovered L GT C Iteration C = 1 (k-means) C = 2 C = 5 C = 25 (GMM) − − − − D Iteration C = 1 ( k -means) C = 2 C = 10 C = 200 (GMM) E F G Fig. S3: A shows experiments of Alg. 1 ( k -means) on a BIRCH data set with grid-positioned clusters, as visualized in E . Shown are the log-likelihood and freeenergy per iteration for three individual runs (red, blue and green) and the mean of 100 independent runs (gray). The individual runs show convergences to di ff erentoptima. B shows the same experiments as A on uniform randomly positioned clusters, as visualized in F . In A and B , the D KL -gap at convergence is visualized ascolored vertical lines next to the plot and can in A be clearly observed to increase for less optimal solutions. In B , due to higher cluster overlaps, the D KL -gap is hereoverall larger compared to A . C shows the mean log-likelihood (solid line) and the log-likelihood of the best run (striped) of 100 runs of Alg. 3 ( k -means- C (cid:48) ) fordi ff erent C (cid:48) . For C (cid:48) ≥
2, the best solutions are close to identical for the di ff erent settings, although some tend to find these best solutions more frequently. D showsthe mean log-likelihood (solid line) and free energy (dashed) on the KDD data set over 10 runs, shaded with their respective SEMs. Visualization of some groundtruth cluster centers (blue circles) and found cluster centers of the best runs (red crosses) on BIRCH data sets are shown in E , F (for k -means) and G (for k -means- C (cid:48) with C (cid:48) = F and G shows the di ff erence between using C (cid:48) = k -means) and C (cid:48) >
1. Especially for regions with higher cluster overlap, k -meanstends to push close-by clusters away from each other, due to the hard assignment of data points to only a single cluster. This e ff ect can be observed on the groups oftwo and three clusters in the upper half as well as on the group of clusters in the bottom left corner. For C (cid:48) =
2, this e ff ect is already greatly reduced. . . . . . . − − Likelihood − − − Iteration
Free Energy (a) k -means . . . . . . − − − − − − Iteration
LikelihoodFree Energy (b) GMM
Fig. S4: This example illustrates that the free energy and the GMM likelihood objective are not trivially related. We generate data from seven overlapping, equaland isotropic Gaussians arranged as above (black circles), drawing 100 000 data samples per Gaussian. The contour lines show the underlying probability densitydistribution of which the data points are drawn. We compare k -means in (a) with EM for isotropic (non-truncated) GMMs in (b). For both, we use the ground-truthgenerating cluster centers (and variances for the GMM) as initialization. If we now run k -means, we observe that while the free energy increases the log-likelihooddecreases. For this example the final cluster centers (red crosses) obtained by k -means di ff er very significantly from the ground-truth. But also, e.g., for just twooverlapping Gaussians, k -means results in final cluster centers significantly di ff erent from ground truth as can be observed in Fig. S3 (F / G). The higher the clusteroverlap, the more pronounced this e ff ect becomes. EM for GMM does on the other hand (as expected) result in final cluster centers (red crosses) very similarto ground truth (note the di ff erent scales of the plots; the higher initial likelihood value for the GMM compared to k -means is not due to di ff erent initial clustercenters, but due to a di ff erent σ -value for k -means as a result of applying Eq. (12) with k -means activations). Our example also provides a counterexample forthe k -means objective (1) and the likelihood objective for GMMs (3) giving rise to the same optimization problem: here, the quantization error decreases, butthe GMM likelihood gets worse. The optimization of equally sized, isotropic GMMs (2) and of the k -means objective are sometimes regarded as equivalent;Feldman et al. (2011), for instance, write “[...] their result requires that the Gaussians are identical spheres, in which case the maximum likelihood problem isidentical to the k -means problem”. Also results of Pollard (1982), who is often cited for showing that k -means is a GMM maximum likelihood estimator, seemto be misinterpreted sometimes. k -means becomes an increasingly good maximum likelihood estimator if we additionally demand increasingly separable clusters.Increased separability is in turn closely related to the σ → k -means and GMM objectives become increasingly similar.Feldman, D., Faulkner, M., Krause, A., 2011, Scalable training of mixture models via coresets, NIPS, 2142–2150.Pollard, D., 1982. A central limit theorem for k-means clustering. The Annals of Probability, 919926. k -means- C (cid:48) algorithm, on the other hand, already results for C (cid:48) = Fig. S3 shows enlarged versions of plots A, D, E, G of Fig. 1with more details in the caption. In addition to these resultswe also verified that the free energies (7), (13) and the right-hand-side of (17) are numerically equal for k -means. For k -means- C (cid:48) we verified that free energies (7) and (18) are equalat convergence.Note that Fig. S3 can also be interpreted as numerically ver-ifying that the free energies and the likelihood objective of theGMM (2) are not trivially related. This includes the free energy(13) which is optimized by k -means and which corresponds to C (cid:48) =
1. Comparison of the means of Figs. S3(F) and (G) al- ready shows the di ff erence when comparing results between C (cid:48) = k -means) to C (cid:48) =
2. Finally, the numerical experimentof Fig. S4 is deliberately chosen to highlight the di ff erence be-tween the k -means objective (1) and the GMM log-likelihood(3). By applying k -means, the k -means free energy increases(the quantization error gets smaller) but the log-likelihood getsworse. Results for the cluster centers recovered by k -means andEM for GMM (2) are very di ff erent. For the formulars of purity and NMI, see Manning et al. 2008, chapter:Evaluation of clustering, https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.htmlhttps://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html