[PDF] t - k -means: A Robust and Stable k -means Variant

Abstract

k -means algorithm is one of the most classical clustering methods, which has been widely and successfully used in signal processing. However, due to the thin-tailed property of the Gaussian distribution, k -means algorithm suffers from relatively poor performance on the dataset containing heavy-tailed data or outliers. Besides, standard k -means algorithm also has relatively weak stability, i.e. its results have a large variance, which reduces its credibility. In this paper, we propose a robust and stable k -means variant, dubbed the t - k -means, as well as its fast version to alleviate those problems. Theoretically, we derive the t - k -means and analyze its robustness and stability from the aspect of the loss function and the expression of the clustering center, respectively. Extensive experiments are also conducted, which verify the effectiveness and efficiency of the proposed method. The code for reproducing main results is available at \url{this https URL}.

Full PDF

aa r X i v : . [ c s . L G ] J a n t - k -means: A k -means Variant with Robustness and Stability Yang Zhang †∗ , Qingtao Tang †∗ , Yiming Li †∗ , Weipeng Huang ‡ , Shutao Xia †† Tsinghua University, China ‡ University College Dublin, [email protected]

Abstract

Lloyd’s k -means algorithm is one of the most clas-sical clustering method, which is widely used indata mining or as a data pre-processing procedure.However, due to the thin-tailed property of theGaussian distribution, k -means suffers from rela-tively poor performance on the heavy-tailed dataor outliers. In addition, k -means have a relativelyweak stability, i.e. its result has a large variance,which reduces the credibility of the model. In thispaper, we propose a robust and stable k -means vari-ant, the t - k -means, as well as its fast version insolving the ﬂat clustering problem. Theoretically,we detail the derivations of t - k -means and analyzeits robustness and stability from the aspect of lossfunction, inﬂuence function and the expression ofclustering center. A large number of experimentsare conducted, which empirically demonstrates thatour method has empirical soundness while preserv-ing running efﬁciency. Lloyd’s algorithm [Lloyd, 1982] is one of themost classical methods in solving the clusteringproblem, and is widely used today in data mining[Yu et al. , 2009; Tsironis et al. , 2013], pattern recog-nition [Coates et al. , 2011; Coelho and Murphy, 2009], etc. , or as a data pre-processing procedurein more complex algorithms [Gopalan, 2016;Zhang et al. , 2017].It’s known that k -means is a special case of Gaussianmixture model (GMM) [Mclachlan and Basford, 1988] witheach components sharing the same mixing coefﬁcient and co-variance matrix [Bishop, 2006]. However, due to the thin-tailed property of the Gaussian distribution, k -means (alsoGMM) may perform poorly on the data which contain agroup or groups of observations with heavy tails or outliers[Peel and Mclachlan, 2000]. Consequently, t mixture model(TMM) [Liu and Rubin, 1995] has been introduced to gain ∗ equal contribution. It is very common to call Lloyd’s algorithm the “standard k -means algorithm” (called “ k -means” for short). robustness in the clustering task, since its base ( t distribu-tion) is a heavy-tailed generalization of Gaussian distribution.However, due to the tremendous demand in the necessary pa-rameter estimation (such as covariance matrices), TMM isunstable with the arbitrary initialization and requires over-whelming time cost. The facts greatly prevent it to be a pop-ular clustering method. In addition, since the update of clus-tering center is based only on the information of the samplein its cluster, k -means have a relatively large variance, whichreduces the credibility of the model.In this paper, to obtain robust and stable clustering methodwhile preserving running efﬁciency, we propose t - k -means.It is not only as extensible and fast as k -means but also robustto heavy-tailed data and more stable than classical k -meansmethod. Through this paper, we elaborate on the derivationsof t - k -means, prove the robustness and stability, and also il-lustrate an extensive empirical study.In summary, our three major contributions are as follows. • We derive t - k -means clustering method from TMM,which is a robust and stable generalization of k -means. • We theoretically prove the proposed method more robustand stable than k -means, from the views of loss function,inﬂuence function and clustering center expression. • Empirically, a large number of experiments demonstratethat our method has empirical soundness while preserv-ing running efﬁciency. k -means Variants k -medoids [Kaufman and Rousseeuw, 1987] chooses sam-ples as cluster centroid and works with a generalization ofthe Manhattan Norm to deﬁne distance between samples in-stead of L2-Norm. k -medians [Arora et al. , 1998] calculat-ing the median for each cluster to determine its centroid, in-stead of the means, as a result of using L1-loss. k -meanswith Mahalanobis distance metric [Mao and Jain, 1996] hasbeen used to detect hyperellipsoidal clusters, but at the ex-pense of higher computational cost. A variant of k -meansusing the Itakura–Saito distance [Linde et al. , 1980] has beenused for vector quantization in speech processing. Banerjee[Banerjee et al. , 2005] exploits the family of Bregman dis-tances for k -means [Jain, 2010].n addition, a preprocessing procedure, k -means++, forchoosing the initial values for k -means to avoid the occa-sional poor k -means results due to the arbitrarily terrible ini-tialization is proposed in [Arthur and Vassilvitskii, 2007]. Itcan also be perfectly integrated into our proposed t - k -meansmethod. k -means and GMM For better explaining TMM and t - k -means, we start by re-viewing the most well-known technique GMM.Given the dataset D = { xxx n | n = 1 , , . . . , N } , where xxx n ∈ R p denotes a p -dim sample, Gaussian mixture model(GMM) is a linear superposition of K -component Gaussiandistribution [Mclachlan and Basford, 1988], i.e. , N ( xxx | πππ, µµµ, ΣΣΣ) = K X k =1 π k N k ( xxx | µµµ k , ΣΣΣ k ) , (1)where π k ∈ R ( P Kk =1 π k = 1) , µµµ k ∈ R p and ΣΣΣ k ∈ Π( p ) are the mixing coefﬁcient, the mean vector and the covari-ance matrix of the k -th component, respectively, and πππ = { π k | k = 1 , . . . , K } , µµµ = { µµµ k | k = 1 , . . . , K } , ΣΣΣ = { ΣΣΣ k | k =1 , . . . , K } .Usually, the EM algorithm can be used to estimate the pa-rameters. More speciﬁcally, the complete-data of sample xxx n in the EM algorithm is given by ( xxx ⊤ n , zzz ⊤ n ) ⊤ , where the latentvariable z nk = ( zzz n ) k ∈ { , } denotes whether xxx n belongsto the k -th component. As we have known, in M-step, the pa-rameters πππ, µµµ, ΣΣΣ of GMM is updated by following objective min πππ,µµµ,

ΣΣΣ − ln N Y n =1 K Y k =1 [ π k N k ( xxx n | µµµ k , ΣΣΣ k )] r nk , (2)where r nk is the expectation of z nk . Let III denote a p -dimidentity matrix and α be a known parameter. Assuming thatall the components share one single mixing coefﬁcient andcovariance matrix, we will have π i = π j = K , ΣΣΣ i = ΣΣΣ j = αIII, i, j = 1 , . . . , K . As a result, Eq. (2) becomes min µµµ − N X n =1 K X k =1 r nk ln (cid:20) (2 πα ) − exp (cid:18) − α ( xxx n − µµµ k ) ⊤ ( xxx n − µµµ k ) (cid:17)i ⇔ min µµµ N X n =1 K X k =1 r nk ( xxx n − µµµ k ) ⊤ ( xxx n − µµµ k ) . (3)Eq. (3) is identical to the loss function of k -means. Clearly, k -means can be regarded as a special case of GMM with dif-ferent components sharing the same mixture coefﬁcient andcovariance matrix [Mitchell and others, 1997]. t - k -means Similar to k -means and GMM, t - k -means is also a specialcase of TMM under the condition that π i = K , ΣΣΣ i = αIII, i =1 , . . . , K given parameter α . To reduce the parameters fur-ther, following Liu and Rubin, et al. [Liu and Rubin, 1995],we also assume that ν i = ν, i = 1 , , . . . , K . Those con-ditions are used to regulate the model complexity, so that t - k -means can have robustness while preserving running efﬁ-ciency. Similar to GMM, the t mixture model (TMM) is a linear su-perposition of K -component t distribution, i.e. , t ( xxx | ΨΨΨ) = K X k =1 π k t k ( xxx | ν k , µµµ k , ΣΣΣ k ) , where ΨΨΨ = { πππ, ννν, µµµ, ΣΣΣ } and ννν = { ν k | k = 1 , . . . , K } .Following the deﬁnition of the complete-data vector zzz inGMM, we write it for TMM by xxx c = ( xxx ⊤ , . . . , xxx ⊤ N , zzz ⊤ , . . . , zzz ⊤ N , u , . . . , u N ) ⊤ , where z , . . . , z N is deﬁned in section 2.2 and u , . . . , u N arethe additional missing data [Liu and Rubin, 1995], such that xxx n | u n , z nk = 1 ∼ N ( µµµ k , αIIIu n ) ,u n | z nk = 1 ∼ gamma( 12 ν, ν ) . (4)Thus, the complete-data log likelihood function can bewritten as ln L c (ΨΨΨ | xxx, uuu, zzz ) = ln L G ( ν | uuu, zzz ) + ln L N ( µµµ, α | xxx, uuu, zzz ) , ln L G ( ν | uuu, zzz ) = K X k =1 N X n =1 z nk (cid:26) − ln Γ (cid:18) ν (cid:19) + 12 ν ln (cid:18) ν (cid:19) + 12 ν (ln u n − u n ) − ln u n (cid:27) , ln L N ( µµµ, α | xxx, uuu, zzz ) = K X k =1 N X n =1 z nk (cid:26) − p ln(2 π ) −

12 ln αu n − u n α ( xxx n − µµµ k ) ⊤ ( xxx n − µµµ k ) (cid:27) . In this section, we detail the derivations about how to use EMalgorithm to iterative optimizes log likelihood.In the EM algorithm, the objective function in a new iter-ation is the current conditional expectation of the complete-data log likelihood function, i.e. , Q (ΨΨΨ ⋆ | ΨΨΨ) = E (ln L c (ΨΨΨ | xxx, uuu, zzz ))= Q ( ν ⋆ | ΨΨΨ) + Q ( µµµ ⋆ , α ⋆ | ΨΨΨ) , (5)where Q ( ν ⋆ | ΨΨΨ) = E (ln L G ( ν | uuu, zzz )) ,Q ( µµµ ⋆ , α ⋆ | ΨΨΨ) = E (ln L N ( µµµ, α | xxx, uuu, zzz )) . The parameters with superscript “ ⋆ ” will be estimated in thenew iteration. E-stepEstimate E ( z nk | xxx n ) The posterior estimation of latentvariable z nk is E ( z nk | xxx n ) = t k ( xxx n | ν, µµµ k , αIII ) P Kj =1 t j ( xxx n | ν, µµµ j , αIII ) = τ nk . (6) stimate E ( u n | xxx n , zzz n ) Since xxx n | u n , z nk = 1 ∼N ( µµµ k , αIIIu n ) , from the property of Gaussian distribution, weknow u n ( xxx n − µµµ k ) ⊤ ( xxx n − µµµ k ) /α follows χ p distribution, i.e. , gamma( p/ , / . Treating xxx n as data, from the property ofgamma distribution, it is not hard to prove that the likelihoodof u n is L ( u n | xxx n ) ∝ gamma (cid:18) p , ( xxx n − µµµ k ) ⊤ ( xxx n − µµµ k )2 α (cid:19) . (7)According to Eq. (4) and Eq. (7), the posterior distribution of u n given xxx n , z nk = 1 is u n | xxx n , z nk = 1 ∼ gamma (cid:18) ν + p ,ν + α ( xxx n − µµµ k ) ⊤ ( xxx n − µµµ k )2 ! . (8)Based on Eq. (8), we have E ( u n | xxx n , zzz n ) = ν + pν + α ( xxx n − µµµ k ) ⊤ ( xxx n − µµµ k ) = u nk . (9) Estimate E (ln u n | xxx n , zzz n ) To estimate E (ln u n | xxx n , zzz n ) ,we need to make use of the following lemma from[Liu and Rubin, 1995]. Lemma 1.

If a random variable R ∼ gamma( a, b ) , then E (ln R ) = φ ( a ) − ln b , where φ ( a ) = { ∂ Γ( a ) /∂a } .Applying Lemma 1 to Eq. (8), it is obvious that we obtain E (ln u n | xxx n , zzz n ) = ln u nk + φ (cid:18) ν + p (cid:19) − ln (cid:18) ν + p (cid:19) . M-step

Given the result of E-step, we can decompose Q (ΨΨΨ ⋆ | ΨΨΨ) to Q (ΨΨΨ ⋆ | ΨΨΨ) = Q ( ν ⋆ | ΨΨΨ) + Q ( µµµ ⋆ , α ⋆ | ΨΨΨ) , where Q ( ν ⋆ | ΨΨΨ) = K X k =1 N X n =1 τ nk (cid:26) − ln Γ (cid:18) ν ⋆ (cid:19) + 12 ν ⋆ ln (cid:18) ν ⋆ (cid:19) + 12 ν ⋆ " N X n =1 (ln u nk − u nk )+ φ (cid:18) ν + p (cid:19) − ln (cid:18) ν + p (cid:19)(cid:21)(cid:27) , (10) Q ( µµµ ⋆ , α ⋆ | ΨΨΨ) = K X k =1 N X n =1 τ nk (cid:26) − p ln(2 π ) −

12 ln αu nk − u nk α ( xxx n − µµµ ⋆k ) ⊤ ( xxx n − µµµ ⋆k ) o . (11) Estimate µµµ ⋆k µµµ ⋆k is obtained by solving the equation ∂Q ( µµµ ⋆ , α ⋆ | ΨΨΨ) ∂µµµ ⋆k = 0 = ⇒ µµµ ⋆k = P Nn =1 τ nk u nk xxx n P Nn =1 τ nk u nk . (12) Estimate α ⋆ With the same technique for estimating µµµ ⋆k ,we solve ∂Q ( µµµ ⋆ , α ⋆ | ΨΨΨ) ∂α ⋆ = 0 , and obtain α ⋆ = P Kk =1 P Nn =1 τ nk u nk ( xxx n − µµµ ⋆k ) ⊤ ( xxx n − µµµ ⋆k ) p P Kk =1 P Nn =1 τ nk . (13) Estimate ν ⋆ The estimation of ν ⋆ is the solution of theequation − φ (cid:18) ν ⋆ (cid:19) + ln (cid:18) ν ⋆ (cid:19) + 1 + 1 K K X k =1 P Nn =1 τ nkN X n =1 τ nk (ln u nk − u nk ) + φ (cid:18) ν + p (cid:19) − ln (cid:18) ν + p (cid:19) = 0 . (14)We apply the following lemma in the workdone by Abramowitz and Milton, et al. [Abramowitz and Stegun, 1964], to solve Eq. (14). Lemma 2. φ ( s ) ≈ ln s − P ∞ i =1 B i is i , where B i is theBernoulli numbers of the second kind and B = .From Lemma 2, we have − φ (cid:18) ν ⋆ (cid:19) + ln (cid:18) ν ⋆ (cid:19) ≈ ν ⋆ + ǫ ( ν ⋆ ) , where ǫ ( ν ⋆ ) = P ∞ i =2 B i i ( ν ⋆ ) i is the error term. Denoting theconstant term in Eq. (14) as η , it is not hard to show ν ⋆ + ǫ ( ν ⋆ ) + η ≈ ⇒ ν ⋆ ≈ − η − ǫ ( ν ⋆ ) . (15) * ( * ) Figure 1: Graph of ǫ ( ν ⋆ ) We illustrate the functional plot of ǫ ( ν ⋆ ) in Figure 1, look-ing at which, when ν ≥ , ǫ ( ν ⋆ ) approximates to . There-fore, we can update ν ⋆ using − η . A Fast Version of t - k -means In TMM, if ν is unknown, the EM algorithm convergesslowly [Liu and Rubin, 1995]. Therefore, following JarnoVanhatalo and Pasi Jyl¨anki, et al. [Vanhatalo et al. , 2009],we ﬁx ν as a constant. For further reducing the dimension-ality of parameters, we apply α → referring to Bishop[Bishop, 2006]. With ﬁxed ν and α → , we have a fastversion of t - k -means, and we coin it fast t - k -means. distance between x n and n l o ss v a l ue L1-loss (k-medians)L2-loss (k-means)log L2-loss (t-k-means)

Figure 2: Graph of the loss functions of k -means, k -medians and t - k -means ( αν = 1 ). t - k -means and k -means If ν → ∞ , then TMM degenerates to GMM. According toSection 2.2, k -means is a special case of GMM with all com-ponents sharing the same mixing coefﬁcient and covariancematrix, meanwhile t - k -means is a special case of TMM withthe same condition. Therefore, it is not hard to obtain that t - k -means is a robust generalization of k -means, i.e. , t - k -means → k -means when ν → ∞ . In this section, we will prove that t - k -means is more robustthan k -means from the perspective of loss function and inﬂu-ence function [Koh and Liang, 2017], and explain why t - k -means is more stable than k -means. Loss Function Perspective

The log likelihood of t - k -means is given by ln L (ΨΨΨ | xxx ) = ln N Y n =1 K Y k =1 [ t k ( xxx n | ν, µµµ k , αIII )] z nk . (16)Given Eq. (16),we can rewrite the loss function of t - k -meansas J t - k -means = − N X n =1 K X k =1 τ nk ln Γ (cid:0) ν + p (cid:1) Γ (cid:0) ν (cid:1) ν p π p α · (cid:20) να ( xxx n − µµµ k ) ⊤ ( xxx n − µµµ k ) (cid:21) − ν + p . Focusing on the term related to data xxx , we have J t - k -means ( xxx, µµµ ) ∝ N X n =1 K X k =1 τ nk ln (cid:18) να ( xxx n − µµµ k ) ⊤ ( xxx n − µµµ k ) (cid:19) (17)Considering Eq. (17) and Eq. (3), we learn that J t - k -means is a log L2-loss function of xxx n while the loss function of k -means Eq. (3) is a L2-loss norm. Besides, from the work in[Arora et al. , 1998], it is known that the loss of k -medians is a L1-loss norm. On the other hand, an outlier is often distantfrom the component mean µµµ k . Thus, we plot the relationshipbetween the loss values and the data-to-centre distance in Fig-ure 2. The ﬁgure illustrates that log L2-loss is the least sensi-tive to the distance between xxx n and µµµ n . That is, in this regard, t - k -means is more robust than k -means and k -medians as itsobjective function is far less sensitive to the outliers than theother two. Inﬂuence Function Perspective

The inﬂuence function, a measure of the inﬂuence from up-weighting a training sample xxx i on the estimation of modelparameters [Koh and Liang, 2017], is adopted to compare therobustness of t - k -means and k -means in this section. Theinﬂuence of upweighting the training sample xxx i on the pa-rameter ΨΨΨ is given by I up,params ( xxx i ) def = − HHH − ∇ ΨΨΨ L ( xxx i , ΨΨΨ) , From Eq. (3), we can obtain the inﬂuence function of k -means for parameter µµµ ⋆k , i.e. , I up,params, k -means ( xxx i ) = r ik ( xxx i − µµµ ⋆k ) . Now we consider the inﬂuence function of t - k -means: I up,params, t - k -means ( xxx i ) = τ ik u ik N P Nn =1 u nk τ nk ( xxx i − µµµ ⋆k ) . It is clear that the difference between the inﬂuence of k -means and that of t - k -means lies on the coefﬁcient. We de-note these coefﬁcients as follows: C up,params, k -means = r ik , (18) C up,params, t - k -means = u ik τ ik N P Nn =1 u nk τ nk . (19)Let us denote ( xxx − yyy ) ⊤ ( xxx − yyy ) = dis ( xxx, yyy ) , from Eq.(19), Eq. (9) and Eq. (6), it is not hard to prove that u nk τ nk and C up,params, t - k -means are the strictly decreasing func-tions of dis ( xxx, µµµ k ) . Since the outliers are farther from thecomponent mean µµµ k than clean samples (assume the outliersare in the k -th component), C up,params, t - k -means of outliers aresmaller than that of clean samples.Assuming that a sample xxx i is an outlier and lies in the k -th component, we know that C up,params, k -means = 1 . In contrast, since the outlier xxx i is fartherfrom the component mean µµµ k than clean samples and u ik τ ik is a strictly decreasing function of the distance between xxx i and µµµ k , u ik τ ik is smaller than u nk τ nk where x n is a cleansample, i.e. , C up,params, t - k -means = u ik τ ik N P Nn =1 u nk τ nk < , whichimplies I up,params, t - k -means ( xxx i ) < I up,params, k -means ( xxx i ) . There-fore, t - k -means is more robust to outliers than k -means fromthe view of inﬂuence function. The randomness of the k -means and t - k -means methods ismainly involved in the selection of the initial clustering cen-ter. Once the initial clustering center is given, the clusteringresults of the two methods are also ﬁxed. In this paper, we adopt the deﬁnition of outliers in[Tukey, 1977], i.e. , [ Q − Q − Q ) , Q + 2( Q − Q )] , if Q and Q are the lower and upper quartiles respectively. n k -means method, the update of clustering center is basedonly on the information of the sample in its cluster. However,according to equation (9) and (14), during the iteration, theupdate of the clustering center in t - k -means is determinedby the information of all samples. In other words, no mat-ter which sample is used as the initial clustering center, thefurther update of cluster centers still depend on all samples.This use of such global information signiﬁcantly reduces theinﬂuence of the randomized clustering center on t - k -means,therefore it enjoys stronger stability. The information of the datasets is shown in Table 2. Thesynthetic datasets are from [Pasi Franti, 2015] and thereal-world datasets are from UCI datasets [Lichman, 2013].In the experiments, the hyper-parameter K is given bythe selected datasets and the hyper-parameter ν in fast t - k -means is set as . The baselines include k -means[Lloyd, 1982] , k -means++ [Arthur and Vassilvitskii, 2007], k -medoids [Kaufman and Rousseeuw, 1987], k -medians[Arora et al. , 1998], GMM [Mclachlan and Basford, 1988]and TMM [Liu and Rubin, 1995].In order to evaluate the performance of the model, Ad-justed Rand Index (ARI) [Hubert and Arabie, 1985] is em-ployed for data with label, clustering mean squared er-ror (MSE) [Tan and others, 2006], and W/B (W: within-cluster sum of squares; B: between-cluster sum of squares)[Kriegel et al. , 2017] is used for unlabelled data. Besides, theexperiment is repeated 100 times to reduce the effect of ran-domness. Among all methods, the one with the best perfor-mance is indicated in boldface and the value with underlinedenotes the second best. In addition, to make it fair, all ofthe evaluated methods are implemented with MATLAB and conducted on Intel(R) Core(TM) i7-7500U CPU running at2.7GHz with 16 GB of RAM. Table 2: Dataset description.D

ATA SET I NSTANCES F EATURES C LUSTERS

A1 3000 2 20A2 5250 2 35A3 7500 2 50S1 5000 2 15S2 5000 2 15S3 5000 2 15S4 5000 2 15U

NBALANCE

DIM

32 1024 32 16

DIM

64 1024 64 16I

RIS

150 4 3B

EZDEKIRIS

150 4 3S

EED

210 7 3W

INE

178 14 3

In this part, we conduct the experiments on the syntheticdatasets, including S1 S4, A1 A3, Unbalance, dim32 anddim64 [Pasi Franti, 2015].As illustrated in Table 1, GMM and TMM have relativelypoor performance, since the mixture models demand heavyparameter esimation and are sensitive to the parameter ini-tialization. With randomly initiated parameters, the proposed t - k -means and fast t - k -means outperform all k -means classmethods, GMM and TMM in all datasets. Besides, a newmethod, the fast t - k -means++, obtained when fast t - k -meansis initialized with k -means++ instead of random initializa-tion reaches the best performance in all 10 synthetic datasets. Table 1: ARI of the clustering results on synthetic datasets.A1 A2 A3 S1 S2 k -means 0.804 ± ± ± ± ± k -means++ 0.856 ± ± ± ± ± k -medoids 0.775 ± ± ± ± ± k -medians 0.760 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± t - k -means 0.851 ± ± ± ± ± t - k -means 0.922 ± ± ± ± ± t - k -means++ ± ± ± ± ± S3 S4 Unbalance dim32 dim64 k -means 0.639 ± ± ± ± ± k -means++ 0.671 ± ± ± ± ± k -medoids 0.649 ± ± ± ± ± k -medians 0.650 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± t - k -means 0.699 ± ± ± ± ± t - k -means 0.718 ± ± ± ± ± t - k -means++ ± ± ± ± ± able 3: MSE and W/B of the clustering results on real datasets.metrics methods Bezdekiris Iris Seed WineMSE k -means 0.201 ± ± ± ± k -means++ 0.198 ± ± ± ± k -medoids 0.205 ± ± ± ± k -medians 0.215 ± ± ± ± ± ± ± ± ± ± ± ± t - k -means ± ± ± ± t - k -means 0.187 ± ± ± ± fast t - k -means++ 0.187 ± ± ± ± W/B k -means 0.225 ± ± ± ± k -means++ 0.222 ± ± ± ± k -medoids 0.242 ± ± ± ± k -medians 0.254 ± ± ± ± ± ± ± ± ± ± ± ± t - k -means ± ± ± ± t - k -means 0.216 ± ± ± ± t - k -means++ 0.216 ± ± ± ± In addition, the t - k -means class method has a smaller stan-dard deviation than the k -means class method on all data sets,which empirically demonstrates the stability of t - k -means. The methods are also evaluated on real-world datasets withlabels, including Iris and Bezdekiris. The experiment leadto the same conclusion that the family of t - k -means achievethe best performance on all datasets and with best stability.However, the sample sizes of real-world datasets are so smallthat the gap between t - k -means and other methods cannot beopened. Table 4: ARI of the clustering results on real datasets.methods Iris Bezdekirisk-means 0.665 ± ± ± ± ± ± ± ± ± ± ± ± ± ± fast t-k-means 0.697 ± ± ± ± We evaluate our methods on real-world datasets withoutlabels (Bezdekiris, Iris, Seeds and Wine) in this section. ForIris and Bezdekiris, the labels are ignored.For the real-world data, with regard to two measures, thebest performer is the family of t - k -means except the W/B forSeed and Wine. Even when our methods could not performthe best (in regard to the certain measure), they are very close to the best performers. In addition, within all measure-datapairs, there is always at least one member in the t - k -meansfamily that performs the best (most probably) or the secondbest. The stability of t - k -means is also veriﬁed here again. As shown in Table 5, t - k -means reduces the total runtime sig-niﬁcantly compared with TMM. Notably, the speed of fast t - k -means and fast t - k -means(++) is on the same order ofmagnitude as the speed of k -means. Table 5: Time cost on Iris datasetMethods Iteration Total Time (sec)k-means 9.58 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± This paper depicts a novel TMM-based k -means variant, t - k -means, and its fast version, in order to improve the robustnessand stability of the conventional k -means method. We presentthe full mathematical derivations for t - k -means, and compareits robustness and stability with k -means with respect to theloss function, inﬂuence function and clustering center expres-sion. Additionally, a large number of experiments empiricallydemonstrate that our method has empirical soundness whilepreserving running efﬁciency. eferences [Abramowitz and Stegun, 1964] Milton Abramowitz andIrene A Stegun. Handbook of mathematical func-tions: with formulas, graphs, and mathematical tables ,volume 55. Courier Corporation, 1964.[Arora et al. , 1998] Sanjeev Arora, Prabhakar Raghavan,and Satish Rao. Approximation schemes for euclideank-medians and related problems. In

Proceedings of thethirtieth annual ACM symposium on Theory of computing ,pages 106–113. ACM, 1998.[Arthur and Vassilvitskii, 2007] David Arthur and SergeiVassilvitskii. k-means++:the advantages of careful seed-ing. In

Eighteenth Acm-Siam Symposium on DiscreteAlgorithms, New Orleans, Louisiana , pages 1027–1035,2007.[Banerjee et al. , 2005] Arindam Banerjee, Srujana Merugu,Inderjit S Dhillon, and Joydeep Ghosh. Clustering withbregman divergences.

Journal of machine learning re-search , 6(Oct):1705–1749, 2005.[Bishop, 2006] Christopher M. Bishop.

Pattern Recognitionand Machine Learning . Springer-Verlag New York, Inc.,2006.[Coates et al. , 2011] Adam Coates, Blake Carpenter, CarlCase, Sanjeev Satheesh, Bipin Suresh, Tao Wang, David JWu, and Andrew Y Ng. Text detection and characterrecognition in scene images with unsupervised featurelearning. In

ICDAR , pages 440–445. IEEE, 2011.[Coelho and Murphy, 2009] Luıs Pedro Coelho and Robert FMurphy. Unsupervised unmixing of subcellular locationpatterns. In

Proceedings of ICML-UAI-COLT 2009 Work-shop on Automated Interpretation and Modeling of CellImages , 2009.[Gopalan, 2016] Raghuraman Gopalan. Bridging heteroge-neous domains with parallel transport for vision and mul-timedia applications. In

UAI , 2016.[Hubert and Arabie, 1985] Lawrence Hubert and PhippsArabie. Comparing partitions.

Journal of Classiﬁcation ,2(1):193–218, 1985.[Jain, 2010] Anil K Jain. Data clustering: 50 years beyond k-means.

Pattern recognition letters , 31(8):651–666, 2010.[Kaufman and Rousseeuw, 1987] Leonard Kaufman and Pe-ter Rousseeuw.

Clustering by means of medoids . North-Holland, 1987.[Koh and Liang, 2017] Pang Wei Koh and Percy Liang. Un-derstanding black-box predictions via inﬂuence functions. arXiv preprint arXiv:1703.04730 , 2017.[Kriegel et al. , 2017] Hans-Peter Kriegel, Erich Schubert,and Arthur Zimek. The art of runtime evaluation: Are wecomparing algorithms or implementations?

Knowledgeand Information Systems , 52(2):341–378, Aug 2017.[Lichman, 2013] M. Lichman. UCI machine learning repos-itory, 2013. [Linde et al. , 1980] Yoseph Linde, Andres Buzo, and RobertGray. An algorithm for vector quantizer design.

IEEETransactions on communications , 28(1):84–95, 1980.[Liu and Rubin, 1995] Chuanhai Liu and Donald B Rubin.Ml estimation of the t distribution using em and its exten-sions, ecm and ecme.

Statistica Sinica , pages 19–39, 1995.[Lloyd, 1982] Stuart Lloyd. Least squares quantizationin pcm.

IEEE transactions on information theory ,28(2):129–137, 1982.[Mao and Jain, 1996] Jianchang Mao and Anil K Jain. Aself-organizing network for hyperellipsoidal clustering(hec).

Ieee transactions on neural networks , 7(1):16–29,1996.[Mclachlan and Basford, 1988] Geoffrey J Mclachlan andKaye E Basford. Mixture models. inference and applica-tions to clustering.

Applied Statistics , 38(2), 1988.[Mitchell and others, 1997] Tom M Mitchell et al. Machinelearning. wcb, 1997.[Pasi Franti, 2015] et al Pasi Franti. Clustering basic bench-mark, 2015.[Peel and Mclachlan, 2000] David Peel and Geoffrey JMclachlan. Robust mixture modeling using the t distri-bution.

Statistics and Computing , 10(4):339–348, 2000.[Tan and others, 2006] Pang-Ning Tan et al.

Introduction todata mining . Pearson Education India, 2006.[Tsironis et al. , 2013] Serafeim Tsironis, Mauro Sozio,Michalis Vazirgiannis, and LE Poltechnique. Accuratespectral clustering for community detection in mapreduce.In

NeurIPS Workshops . Citeseer, 2013.[Tukey, 1977] John W Tukey.

Exploratory data analysis ,volume 2. Reading, Mass., 1977.[Vanhatalo et al. , 2009] Jarno Vanhatalo, Pasi Jyl¨anki, andAki Vehtari. Gaussian process regression with student-tlikelihood. In Y. Bengio, D. Schuurmans, J. D. Lafferty,C. K. I. Williams, and A. Culotta, editors,

NeurIPS , pages1910–1918. Curran Associates, Inc., 2009.[Yu et al. , 2009] Shi Yu, BD Moor, and Yves Moreau. Clus-tering by heterogeneous data fusion: framework and ap-plications. In

NeurIPS workshop , 2009.[Zhang et al. , 2017] Cheng Zhang, Hedvig Kjellstr¨om, andStephan Mandt. Determinantal point processes for mini-batch diversiﬁcation. In