Too Much Information Kills Information: A Clustering Perspective
Yicheng Xu, Vincent Chau, Chenchen Wu, Yong Zhang, Vassilis Zissimopoulos, Yifei Zou
TToo Much Information Kills Information:A Clustering Perspective
Yicheng Xu Vincent Chau Chenchen Wu Yong Zhang Vassilis Zissimopoulos Yifei Zou Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, P.R.China. { yc.xu, vincentchau, zhangyong } @siat.ac.cn Tianjin University of Technology, P.R.China. wu chenchen [email protected] National and Kapodistrian University of Athens, Greece. [email protected] The University of Hong Kong, P.R.China. [email protected]
Abstract —Clustering is one of the most fundamental toolsin the artificial intelligence area, particularly in the patternrecognition and learning theory. In this paper, we propose asimple, but novel approach for variance-based k -clustering tasks,included in which is the widely known k -means clustering. Theproposed approach picks a sampling subset from the givendataset and makes decisions based on the data information in thesubset only. With certain assumptions, the resulting clusteringis provably good to estimate the optimum of the variance-based objective with high probability. Extensive experiments onsynthetic datasets and real-world datasets show that to obtaincompetitive results compared with k -means method (Llyod 1982)and k -means++ method (Arthur and Vassilvitskii 2007), we onlyneed 7% information of the dataset. If we have up to 15%information of the dataset, then our algorithm outperforms boththe k -means method and k -means++ method in at least 80% ofthe clustering tasks, in terms of the quality of clustering. Also, anextended algorithm based on the same idea guarantees a balanced k -clustering result. I NTRODUCTION
Cluster analysis is a subarea of machine learning that studiesmethods of unsupervised discovery of homogeneous subsets ofdata instances from heterogeneous datasets. Methods of clusteranalysis have been successfully applied in a wide spectrum ofareas of image processing, information retrieval, text miningand cybersecurity. Cluster analysis has a rich history in dis-ciplines such as biology, psychology, archaeology, psychiatry,geology and geography, even through there is an increasinginterest in the use of clustering methods in very hot fields likenatural language processing, recommended system, image andvideo processing, etc. The importance and interdisciplinarynature of clustering is evident through its vast literature.The goal of variance-based k -clustering is to find a k sized partition of a given dataset so as to minimize the sumof the within-cluster variances. The well-known k -means isa variance-based clustering which defines the within-clustervariance as the sum of squared distances from each data tothe means of the cluster it belongs to. The folklore of k -means method [15], also known as the Lloyd’s algorithm,is still one of the top ten popular data mining algorithmsand is implemented as a standard clustering method in mostmachine learning libraries, according to [16]. To overcomethe high sensitivity to proper initialization, [2] propose the k -means++ method by augmenting the k -means method witha careful randomized seeding preprocessing. The k -means++method is proved to be O (log k ) -competitive with the optimalclustering and the analysis is tight. Even through it is easyto implement, k -means++ has to make a full pass throughthe dataset for every single pick of the seedings, which leadsto a high complexity. [5] drastically reduce the number ofpasses needed to obtain, in parallel, a good initialization. Theproposed k -means (cid:107) obtains a nearly optimal solution aftera logarithmic number of passes, and in practice a constantnumber of passes suffices. Following this path, there areseveral speed-ups or hybrid methods. For example, [4] replacethe seeding method in k -means++ with a substantially fasterapproximation based on Markov Chain Monte Carlo sampling.The proposed method retains the full theoretical guaranteesof k -means++ while its computational complexity is onlysublinear in the number of data points. A simple combinationof k -means++ with a local search strategy achieves a constantapproximation guarantee in expectation and is more competi-tive in practice [11]. Furthermore, the number of local searchsteps is dramatically reduced from O ( k log log k ) to (cid:15)k whilemaintaining the constant performance guarantee [6].A balanced clustering result is often required in a varietyof applications. However, many existing clustering algorithmshave good clustering performances, yet fail in producingbalanced clusters. The balanced clustering, which requiressize constraints for the resulting clusters, is at least APX-hard in general under the assumption P (cid:54) = NP [3]. It attractsresearch interests simultaneously from approximation andheuristic perspectives. Heuristically, [14] apply the method ofaugmented Lagrange multipliers to minimize the least squarelinear regression in order to regularize the clustering model.The proposed approach not only produces good clusteringperformance but also guarantees a balanced clustering result.To achieve more accurate clustering for large scale dataset,exclusive lasso on k -means and min-cut are leveraged toregulate the balance degree of the clustering results. Byoptimizing the objective functions that build atop the exclusivelasso, one can make the clustering result as much balanced aspossible [12]. Recently, [13] introduce a balance regularizationterm in the objective function of k -means and by replacing the a r X i v : . [ c s . L G ] S e p ssignment step of k -means method with a simplex algorithmthey give a fast algorithm for soft-balanced clustering, andthe hard-balanced requirement can be satisfied by enlargingthe multiplier in the regularization term. Also, there are somealgorithmic results for balanced k -clustering tasks with validperformance guarantees. The first constant approximation al-gorithm for the variance based hard-balanced clustering is a (69 + (cid:15) ) -approximation in fpt-time [17]. The approximationratio is then improved to (cid:15) [8] and (cid:15) [7] sequentiallywith the same asymptotic running time. Our contributions
In this paper, we propose a simple,but novel algorithm based on random sampling that com-putes provably good k -clustering results for variance basedclustering tasks. An extended version based on the sameidea is valid for balanced k -clustering tasks with hard sizeconstraints. We make cross comparisons between the proposedRandom Sampling method with the k -means method andthe k -means++ method in both synthetic datasets and real-world datasets. The numerical results show that our method iscompetitive with the k -means method and k -means++ methodwith a sampling size of only 7% of the dataset. When thesampling size reaches 15% or higher, the Random Samplingmethod outperforms both the k -means method and the k -means++ method in at least 80% rounds of the clustering tasks.The remainder of the paper is organized as follows. Inthe Warm-up section, we mainly provide some preliminariestowards a better understand of the proposed algorithm. In theRandom Sampling section, we present the main algorithmand the analysis. After that, we provide the performance ofthe proposed algorithm on different datasets in the NumericalResults section. Then we extend the proposed algorithm to dealwith the balanced clustering tasks in the Extension section.In the last section, we discuss the advantages as well asdisadvantages of the proposed algorithm, and some promisingareas where our algorithm has the potential to outperformexisting clustering methods.W ARM - UP Variance-Based k -Clustering Roughly speaking, clustering tasks seek an organization ofa collection of patterns into clusters based on similarity, suchthat patterns within a cluster are very similar while patternsfrom different clusters are highly dissimilar. One way tomeasure the similarity is the so-called variance-based objectivefunction, that leverages the squared distances between patternsand the centroid of the cluster they belong to.A well-known variance-based clustering task is the k -meansclustering, which is a method of vector quantization thatoriginally comes up from signal processing, which aims topartition n real vectors (quantification from colors) into k clusters so as to minimize the within-cluster variances. Whatmakes the k -means clustering different from other variance-based k -clustering is the way it measures the similarity. The k -means defines the similarity between vectors as the squaredEuclidean distance between them. For simplicity, we mainlytake the k -means as an example in the later discussion but most of the results carry over to the general variance-based k -clustering tasks.The k -means clustering can be formally described as fol-lows. Given are a data set X = { x , x , · · · , x n } and anintegral number k , where each data in X is a d -dimensionalreal vector. The objective is to partition X into k ( ≤ n ) disjointsubsets so as to minimize the total within-cluster sum ofsquared distances (or variances). For a fixed finite data set A ⊆ R d , the centroid (also known as the means) of A isdenoted by c ( A ) := (cid:80) x ∈ A x/ | A | . Therefore, the objective ofthe k -means clustering is to find a partition { X , X , · · · , X k } of X such that the following is minimized: k (cid:88) i =1 (cid:88) x ∈ X i || x − c ( X i ) || , where || a − b || denotes the Euclidean distance between vectors a and b .Also, we will extend our result to a general scenario ofbalanced clustering, where the capacity constraints must besatisfied. For the balanced k -clustering, the only difference isadditional global constraints for the size of the clusters. Bothlower bound and upper bound constraints are considered inthis paper. Based on the above, the balanced k -means canbe described as finding a partition { X i } ≤ i ≤ k of X so as tominimize the aforementioned k -means objective and l ≤ | X i | ≤ u, for all 1 ≤ i ≤ k. Obviously, by taking appropriate values for l and u , we reduceit to the k -means clustering. Thus, it is more difficult to obtainan optimal balanced k -means clustering. Voronoi Diagram and Centroid Lemma
Solving the optimal k -means clustering for an arbitrary dataset is NP-hard. However, Lloyd proposes a fast local searchbased heuristic for k -means clustering, also known as the k -means method. A survey of data mining techniques statesthat it is by far the most popular clustering algorithm usedin scientific and industrial applications. The k -means methodis carried out through iterative Voronoi Diagram construction,combined with the centroid adjustment according to the Cen-troid Lemma.Voronoi Diagram is a partition of a space into regions closeto each of a given set of centers. Formally, given centers C = { c , c , ..., c k } in R d for example, the Voronoi Diagram w.r.t.(with respect to) C consists of the following Voronoi cellsdefined for i = 1 , , ..., k as Cell( i ) = { x ∈ R d : d ( x, c i ) ≤ d ( x, c j ) for all j (cid:54) = i } . See Figure 1 as examples of the Voronoi Diagrams in theplane. Obviously, any Voronoi Diagram Π of R d gives afeasible partition for any set X ⊆ R d (ties broken arbitrarily),which is called the Voronoi Partition of X w.r.t. Π . Moreprecisely, the Voronoi Partition of X is given by { X i } ≤ i ≤ k ,where X i = X ∩ Cell( i ) . ig. 1. Examples of Voronoi diagram in the plane On the other hand, given X ⊆ R d , it holds for any v ∈ R d that (cid:88) x ∈ X || x − v || = (cid:88) x ∈ X || x − c ( X ) || + | X | · || c ( X ) − v || , which is the so-called Centroid Lemma. An example ofapplication of the Centroid Lemma refers to [10]. Note that theCentroid Lemma implies that the centroid/means of a clusteris the minimizer of the within-cluster variance.R ANDOM S AMPLING
Given a dataset X , we say S ⊆ X is a random samplingof X if S is obtained by several independent draws from X uniformly at random. We show that it is not bad to estimate theobjective value of the variance-based k -clustering of X using S . Before that, we introduce two basic facts on expectation andvariance from probability theory. Given independent randomvariables V and V , we have the follows.Fact 1 E ( aV + bV ) = aE ( V ) + bE ( V ) Fact 2 var ( aV + bV ) = a var ( V ) + b var ( V ) Suppose S is an m -draws random sampling of X . Then, c ( S ) is an unbiased estimation of c ( X ) and the squaredEuclidean distance between them can be estimated by thefollowing lemma. Lemma 1: E ( c ( S )) = c ( X ) , E ( || c ( S ) − c ( X ) || ) = m var ( X ) . Proof:
Assume w.o.l.g. that S = { V , V , ..., V m } andrecall V i are independent random variables. Based on Fact 1,it holds that E ( c ( S )) = E ( 1 m m (cid:88) i =1 V i ) = 1 m m (cid:88) i =1 E ( V i )= 1 m m (cid:88) i =1 c ( X ) = c ( X ) . Then E ( || c ( S ) − c ( X ) || ) = E ( || c ( S ) − E ( c ( S )) || )= var ( c ( S ))= var ( 1 m m (cid:88) i =1 V i )= 1 m m (cid:88) i =1 var ( V i )= 1 m var ( X ) , where the second last equality is derived from Fact 2.Based on the above, we conclude that c ( S ) is indeed agood estimate for c ( X ) . A natural idea comes from here thatit is probably a good estimate for (cid:80) x ∈ X || x − c ( X ) || using (cid:80) x ∈ X || x − c ( S ) || , as given in the following lemma. Lemma 2:
With probability at least − δ , (cid:88) x ∈ X || x − c ( S ) || ≤ (1 + 1 mδ ) (cid:88) x ∈ X || x − c ( X ) || . Proof:
From Lemma 1 and the Markov Inequality weknow, with probability at least − δ , || c ( S ) − c ( X ) || ≤ mδ (cid:88) x ∈ X || x − c ( X ) || . Recalling the Centroid Lemma, immediately we have withprobability at least − δ that (cid:88) x ∈ X || x − c ( S ) || = (cid:88) x ∈ X || x − c ( X ) || + | X | · || c ( S ) − c ( X ) || ≤ (1 + 1 mδ ) (cid:88) x ∈ X || x − c ( X ) || , completing the proof.Consider the following randomized algorithm for the k -clustering task based on the random sampling idea, which wesimply call Random Sampling. Given the sampling set S , weconstruct every k -clustering of S by a brute force search. Notethat there are O ( m dk ) many possibilities due to [9], but weare allowed to do this because S is much smaller than X .For each k -clustering of S , we divide the R d space into k Voronoi cells according to the centroids of the k clusters of S .Subsequently, we obtain a feasible k -clustering of X , simplyby grouping the data points in the same Voronoi cell together.Then we choose the best one among these possible results.The Random Sampling algorithm is provided as Algorithm 1.Next, we estimate the value for each of the k clusters of X .Let { X (cid:48) i } ≤ i ≤ k be the output of the Random Sampling algo-rithm, from which we obtain the corresponding k -clustering { S (cid:48) i } ≤ i ≤ k of the random sampling subset S (cid:48) . Because thecentroid of each cluster in { X (cid:48) i } ≤ i ≤ k defines a Voronoi cell ofthe space, according to which we partition S (cid:48) into k -clustering.Assume w.o.l.g. that | S (cid:48) i | ≤ | S (cid:48) i +1 | for i = 1 , , ..., k − .Suppose { X ∗ i } ≤ i ≤ k is the optimal solution such that | X ∗ i | ≤ lgorithm 1: Random Sampling for k -clustering tasks Input:
Dataset X , integer k ; Output: k -clustering of X . Sample a subset S by m ( ≥ k ) independent draws from X uniformly at random; for every k -clustering { S i } ≤ i ≤ k of S do Compute the centroid set C = { c ( S i ) } ≤ i ≤ k ; Obtain { X i } ≤ i ≤ k , the Voronoi Partition of X w.r.t.the Voronoi Diagram generated by C ; Compute the value k (cid:80) i =1 (cid:80) x ∈ X i || x − c ( X i ) || ; return { X i } ≤ i ≤ k with the minimum value. | X ∗ i +1 | for i = 1 , , ..., k − . Since S (cid:48) is obtained from m independent draws from X , the size of each cluster in { S (cid:48) i } ≤ i ≤ k is determined by independent Bernoulli trials, andis dependent on the distribution of | X ∗ i | over all i . Thus itmust be that E ( | S (cid:48) i | ) = mn E ( | X ∗ i | ) . We denote the distributionfunction of | X ∗ i | by p ( i ) := | X ∗ i | n over all i ∈ { , ..., k } . Wecall X a µ -balanced instance ( ≤ µ ≤ ) if there existsan optimal k -clustering for X such that all clusters have sizeat least µ | X | . For example, if p (1) ≥ µ , then we call X a µ -balanced instance. Recall X ∗ is the smallest cluster in { X ∗ i } ≤ i ≤ k . We obtain the following lemma. Lemma 3: If X is a (ln m/m ) -balanced instance, then forany small positive constant η , it holds with probability at least − m − η / that | S (cid:48) i | ≥ (1 − η ) mp ( i ) for all i = 1 , ..., k . Proof:
It is obvious that E ( | S (cid:48) i | ) = mn E ( | X ∗ i | ) = mp ( i ) . We now start the proof with S (cid:48) , the smallest cluster inexpectation. Consider m rounds of the following Bernoullitrial (cid:26) , with probability p (1);0 , with probability 1 − p (1) . Let B , B , ..., B m be the independent random variables ofthe m trials and let B = (cid:80) mi =1 B i . Obviously E ( B ) = mp (1) and from the Chernoff Bound we have Pr[
B < (1 − η ) mp (1)] < e − mp (1) η ≤ e − ln mη = m − η . Thus, with probability at least − m − η / , it follows that | S (cid:48) | ≥ (1 − η ) mp (1) . Similarly for i = 2 , ..., k as p ( i ) ≥ ln m/m hold for all i ,complete the proof.By combining Lemma 2 and 3, we conclude the followingestimate for the Random Sampling algorithm. Theorem 1:
For any (ln m/m ) -balanced instance of a k -clustering task, Algorithm 1 returns a feasible solution that it is with probability at least − δ − m − η / within a factor of − η ) δ ln m to the optimum. Proof:
Considering the objective value of the output ofAlgorithm 1, and using the Centroid Lemma, we have k (cid:88) i =1 (cid:88) x ∈ X (cid:48) i || x − c ( X (cid:48) i ) || ≤ k (cid:88) i =1 (cid:88) x ∈ X (cid:48) i || x − c ( S (cid:48) i ) || . From line 4 of Algorithm 1, we know that the partition { X (cid:48) i } ≤ i ≤ k is obtained from the Voronoi Diagram generatedby { c ( S (cid:48) i ) } ≤ i ≤ k . That is to say, for any x ∈ X (cid:48) i and anarbitrary j (cid:54) = i , it must be the case that || x − c ( S (cid:48) i ) || ≤ || x − c ( S (cid:48) j ) || . Summing over all x , we obtain k (cid:88) i =1 (cid:88) x ∈ X (cid:48) i || x − c ( S (cid:48) i ) || ≤ k (cid:88) i =1 (cid:88) x ∈ X ∗ i || x − c ( S (cid:48) i ) || . The right hand side implies an assignment where an x isassigned to c ( S (cid:48) i ) as long as x ∈ X ∗ i for some i . Consideringan x ∈ X ∗ i , we do not change its cost of those x ∈ X ∗ i ∩ X (cid:48) i .But we increase the cost of those x ∈ X ∗ i ∩ X (cid:48) j for any j (cid:54) = i .Applying Lemma 2 to every cluster in { X ∗ i } ≤ i ≤ k , withprobability at least − δ , it holds that k (cid:88) i =1 (cid:88) x ∈ X ∗ i || x − c ( S (cid:48) i ) || ≤ k (cid:88) i =1 (cid:88) x ∈ X ∗ i (1 + 1 δ | S (cid:48) i | ) || x − c ( X ∗ i ) || . Combining with Lemma 3, we obtain with probability at least (1 − δ )(1 − m η / ) ≈ − δ − m η / that k (cid:88) i =1 (cid:88) x ∈ X ∗ i || x − c ( S (cid:48) i ) || ≤ k (cid:88) i =1 (cid:88) x ∈ X ∗ i (1 + 1(1 − η ) δmp ( i ) ) || x − c ( X ∗ i ) || ≤ (1 + 1(1 − η ) δ ln m ) k (cid:88) i =1 (cid:88) x ∈ X ∗ i || x − c ( X ∗ i ) || , where the last inequality follows from the assumption that X is a (ln m/m ) -balanced instance. Complete the proof.N UMERICAL R ESULTS
In this section, we evaluate the performance of the proposedRS (abbreviation for the Random Sampling algorithm) mainlythrough the cross comparisons with the widely known KM(abbreviation for the k -means method) and KM++ (abbre-viation for the k -means++ method) on the same datasets.The environment for experiments is Intel(R) Xeon(R) CPUE5-2620 v4 @ 2.10GHz with 64GB memory. We constructextensive numerical experiments to analyze different impactsof the proposed algorithm as well as the parameter settings.Since all algorithms are randomized, we run RS, KM andKM++ on 100 instances per setting and report the numberf instances of each algorithm hitting the minimum objectivevalue. We mainly design the following experiments due todisparate purposes.1) Effect of n :We generate 100 instances of each n = { , , ..., } with a standard normal distribu-tion, after which we run simultaneously the RS, KMand KM++ on the same instance and record whichof the three algorithms hits the minimum objectivevalue. We fix m = n/ , k = 3 throughout theexperiments and see Figure 2 the numerical results.
100 200 300 400 500 600 700 800 900 1000 value of n i n s t a n ce s h itt h e m i n i m u m KM KM++ RSFig. 2. Effect of the size of the dataset n The RS performs not so good as KM or KM++ atthe beginning because the sampling set is too smallto represent the entire dataset. Taking n = 100 as an example, a 10-sized sampling set is proba-bly not a good estimate for the original 100-sizeddataset. However, when n increases to 700, a 70-sized sampling set seems good enough for RS tobe competitive with KM and KM++. With the riseof n , RS performs increasingly better and tends tooutperform both the KM and KM++. Note that fixing n = 100 for example, the total number of instancesthat any of the three algorithms hitting the minimumexceeds 100. This is because for smaller instances, itis more likely that not only one algorithm is hittingthe minimum, and in this case we count all of themonce in Figure 2.2) Effect of k :We generate 100 instances with a standard normaldistribution, after which we run simultaneously theRS, KM and KM++ on the same instance for dif-ferent k -clustering tasks with each k = { , , ..., } ,and record which of the three algorithms hits theminimum objective value. We fix n = 100 , m = 50 throughout the experiments and see Figure 3 thenumerical results.As shown, the RS reaches the best performance in2-clustering and worst performance in 5-clustering.Overall, it is competitive with KM and KM++ with vaue of k i n s t a n ce s h itt h e m i n i m u m KM KM++ RSFig. 3. Effect of the number of clusters k these settings.3) Effect of m :We evaluate the performance of our algorithm onreal-world dataset. The Cloud dataset consists of1024 points and represents the 1st cloud coverdatabase available from the UC-Irvine MachineLearning Repository. We run simultaneously the KMand KM++ on the Cloud dataset, along with the RSwith each sampling size m = { , , , ..., } .Since there is only one instance here, we run 100rounds of each algorithm per setting and reportthe one hitting the minimum objective value. Notethat n = 1024 and we fix k = 3 throughout theexperiments and see Figure 4 the numerical results.
25 50 75 100 125 150 175 200 value of m r ound s h itt h e m i n i m u m KM KM++ RSFig. 4. Effect of the size of the sampling set m As predicted, the RS performs increasingly betterwhen the sampling size gets large. But it is quitesurprising that when m = 75 (about only 7% of theCloud dataset), the RS performs as good as KM++.When m is higher than 100 (about 10% of the Clouddataset), the RS outperforms any one of the KM andKM++. If m reaches 150 (about 15% of the Clouddataset) or higher, the RS wins in at least 80% roundsof the clustering tasks. XTENSION TO B ALANCED k -C LUSTERING
An additional important feature of the proposed RandomSampling algorithm is the extension to handle the balancedvariance-based k -clustering tasks, for which the k -meansmethod and the k -means++ method can not deal with. Bothupper bound and lower bound constraints are considered,which means a feasible balanced k -clustering has a globallower bound l and an upper bound u for the cluster sizes. Weassume w.o.l.g. that l and u are positive integers. The mainidea is a minimum-cost flow subroutine embedded into theRandom Sampling algorithm.To start, we introduce the well-known minimum-cost flowproblem. Given a directed graph G = ( V, E ) , every edge e ∈ E has a weight c ( e ) representing its cost of sending a unitof flow. Also, every e ∈ E is equipped with a bandwidthconstraint. Only those flows within a maximum flow value of upper ( e ) and minimum value of lower ( e ) can pass throughedge e for each e ∈ E , where upper ( e ) and lower ( e ) denotethe upper bound and the lower bound for the bandwidth of e respectively. Every node v ∈ V has a demand d ( v ) , defined asthe total outflow minus total inflow. Thus a negative demandrepresents a need for flow and a positive one represents asupply.A flow in G is defined as a function from V to R + . Afeasible flow carrying f amount of flow in the graph requiresa source s and a sink t with d ( s ) = f and d ( t ) = − f . Everynode v ∈ V \ { s, t } must have d ( v ) = 0 , which means it iseither an intermediate node or an idle node. The cost of flow f is defined as c ( f ) = (cid:80) e ∈ E f ( e ) · c ( e ) , where f ( · ) : V → R + is the corresponding function of flow f . The minimum–costflow is the optimization problem to find a cheapest way (i.e.with the minimum cost) of sending a certain amount of flowthrough graph G .To deal with the capacity constraints, we herein proposea Random Sampling based randomized algorithm embeddingin the minimum–cost flow subroutine. Obviously, the VoronoiDiagram generated by the centroids of the k -clustering of thesampling set S does not guarantee a feasible Voronoi Partitionof X satisfying the capacity constraints. Assume that we aregiven a k -clustering of S and we look for a feasible balanced k -clustering of X .Consider the following instance of the minimum–cost flowproblem. Let V be X ∪ C ∪ { s, t } , where C consists of thecentroids { c ( C i ) } ≤ i ≤ k obtained from the given k -clusteringof S , and s and t are the dummy source and sink nodesrespectively. Let E be E ∪ E ∪ E , where E are the directededges ( s, i ) from s to each i ∈ X , E are the edges ( i, j ) fromeach i ∈ X to j ∈ C , and E are the edges ( j, t ) from each j ∈ C to t . Every edge in E ∪ E has bandwidth interval [0 , while E has [ l, u ] . Edges in E ∪ E are unweightedand edge ( i, j ) ∈ E has weight || i − j || for each i ∈ X and j ∈ C . See Figure 5 as a description.As shown in the figure, the bandwidth intervals and theweights/costs are labeled on the edges. All the edges areoriented from the source to the sink and we simply omit Fig. 5. A minimum-cost flow instance the direction labels. Inside the shadowed box is a completebipartite graph, also known as a biclique, consisting of vertices X ∪ C and edges E . Consider a flow f that carrying n ( n = | X | ) amount of flow from the source to the sink in G and suppose that function f : E (cid:55)→ R + reflects such a flow.Recall that d ( v ) = (cid:80) e ∈ δ + ( v ) f ( e ) − (cid:80) e ∈ δ − ( v ) f ( e ) , where δ + ( v ) denotes the edges leading away from node v and δ − ( v ) denotes the edges leading into v . Then the follows must hold. • Flow conservation: d ( v ) = n, v = s ; − n, v = t ;0 , ∀ v (cid:54) = s, t. • Bandwidth constraints: ≤ f ( e ) ≤ , ∀ e ∈ E ;0 ≤ f ( e ) ≤ , ∀ e ∈ E ; l ≤ f ( e ) ≤ u, ∀ e ∈ E . Then minimum–cost flow problem aims to find a function f : E (cid:55)→ R + satisfying both the flow conservation andthe bandwidth constraints so as to minimize its cost, i.e., (cid:80) e ∈ E f ( e ) · c ( e ) . An important property of the minimum-costflow problem is that basic feasible solutions are integer-valuedif capacity constraints and quantity of flow produced at eachnode are integer-valued, as captured by the following lemma. Lemma 4: [1] If the objective value of the minimum-costflow is bounded from below on the feasible region, theproblem has a feasible solution, and if capacity constraintsand quantity of flow are all integral, then the problem has atleast one integral optimal solution.The integral solution can be computed efficiently by theCycle Canceling algorithms, Successive Shortest Path al-gorithms, Out-of-Kilter algorithms and Linear Programmingbased algorithms. These algorithms can be found in manytextbooks. See for example [1]. We take any one of thesealgorithms as the MCF (Minimum-Cost Flow) subroutine inour algorithm. We show the following theorem.
Theorem 2:
The integral optimal solution to the aboveminimum-cost flow instance provides an optimal assignmentfrom X to C for balanced clustering tasks. roof: We only need to prove that any feasible assignmentfrom the dataset X to the given centroid C can be representedby a feasible integral flow to the aforementioned minimum-cost flow instance, and vice versa.Let σ : X (cid:55)→ C be a feasible assignment from X to C .Consider the following flow f : E (cid:55)→ R + . f ( e ) = , ∀ e ∈ E , , ∀ e ∈ E and σ ( e o ) = e d , , ∀ e ∈ E and σ ( e o ) (cid:54) = e d , (cid:80) e (cid:48) : e (cid:48) d = e o f ( e (cid:48) ) , ∀ e ∈ E , where we denote the origin and destination of edge e by e o and e d respectively. Note that the quantity of f is n . Obviously, f satisfies the flow conversation and every edge in E and E obeys the bandwidth constraints. For e ∈ E , from theconstruction we have f ( e ) = (cid:80) e (cid:48) : e (cid:48) d = e o f ( e (cid:48) ) = | σ − ( e o ) | .Since σ is feasible, then it must hold for every j ∈ C that l ≤| σ − ( j ) | ≤ u , which implies the feasibility of the bandwidthconstraints for E .On the other hand, given an integral feasible flow f , thecorresponding assignment must be feasible, i.e., satisfying thesize constraints. Note that a feasible flow with quantity n inthe above instance must have all f ( e ) = 1 for every e ∈ E .Consider the following assignment σ : For any i ∈ X, j ∈ C , σ ( i ) = j if and only if an edge with e o = i and e d = j issuch that f ( e ) = 1 . The defined assignment must be feasiblebecause | σ − ( j ) | = (cid:80) e ∈ δ − ( j ) f ( e ) = (cid:80) e ∈ δ + ( j ) f ( e ) holdsfor any j ∈ C . Then from the feasibility of flow f we knowthat l ≤ (cid:80) e ∈ δ + ( j ) f ( e ) ≤ u .It is obvious that the cost of a feasible assignment and thecost of its corresponding flow are exactly the same. Because (cid:88) e ∈ E c ( e ) f ( e ) = (cid:88) e ∈ E c ( e ) f ( e )= (cid:88) e ∈ E : f ( e )=1 || e o − e d || = (cid:88) i ∈ X (cid:88) j = σ ( i ) || i − j || = (cid:88) x ∈ X || x − σ ( x ) || = k (cid:88) i =1 (cid:88) x ∈ X i || x − σ ( x ) || , where the first equality is derived from the construction andthe last equality holds for any feasible partition of X , whichwe assume without loss of generality is { X i } ≤ i ≤ k . Impliesthe lemma.Based on the above, we conclude that a MCF subroutineembedded in the Random Sampling algorithm guarantees avalid solution for the balanced k -clustering problem. Thepseudocode is provided as Algorithm 2.D ISCUSSION
We are incredibly well informed yet we know incrediblylittle, and this is what is happening in the clustering tasks.
Algorithm 2:
Random Sampling for balanced k -clusteringtasks Input:
Dataset X , integer k ; Output: k -clustering of X . Sample a subset S by m ( ≥ k ) independent draws from X uniformly at random; for every k -clustering { S i } ≤ i ≤ k of S do Compute the centroid set C = { c ( S i ) } ≤ i ≤ k ; Obtain { X i } ≤ i ≤ k by the MCF subroutine; Compute the value k (cid:80) i =1 (cid:80) x ∈ X i || x − c ( X i ) || ; return { X i } ≤ i ≤ k with the minimum value.Our work implies that we do not need so much information ofdataset when doing clustering. From the experiments, roughlyspeaking, to obtain a competitive clustering result comparedwith the k -means method and k -means++ method, we onlyneed about 7% information of the dataset. For the rest ofthe 93% data, we immediately make decisions for them withonly O ( k ) additional computations. Note that the resourcesconsumed in the algorithm are dominated by the brute forcesearch for the k -clustering of the sampling set. If we have upto 15% information of the dataset, then with high probability,our algorithm outperforms both the k -means method and k -means++ method in terms of the quality of clustering. Theabove statements hold only when 1) The dataset is independentand identically distributed; 2) The sampling set is pickeduniformly at random from the original dataset; 3) The mostimportant, the dataset is large enough (experimentally 500 datapoints or above suffice). At a cost, the proposed algorithmhas a high complexity with respect to k , but fortunately notsensitive to the size of the dataset or the size of the samplingset.We believe that the Random Sampling idea as well asthe framework of the analysis has the potential to deal withincomplete dataset and online clustering tasks.R EFERENCES[1] Ravindra K. Ahuja, Thomas L. Magnanti, and James B. Orlin.
Networkflows - theory, algorithms and applications . Prentice Hall, 1993.[2] David Arthur and Sergei Vassilvitskii. k-means++: The advantagesof careful seeding. In
ACM-SIAM Symposium on Discrete Algorithms(SODA) , pages 1027–1035, 2007.[3] Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, andAli Kemal Sinop. The hardness of approximation of euclidean k-means.In
International Symposium on Computational Geometry (SoCG) , pages754–767, 2015.[4] Olivier Bachem, Mario Lucic, S Hamed Hassani, and Andreas Krause.Approximate k-means++ in sublinear time. In
AAAI Conference onArtificial Intelligence (AAAI) , pages 1459–1467, 2016.[5] Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, andSergei Vassilvitskii. Scalable k-means++. In
Very Large Data Bases(VLDB) , pages 622–633, 2012.[6] Davin Choo, Christoph Grunau, Julian Portmann, and V´aclav Rozhoˇn.k-means++: few more steps yield constant approximation. arXiv preprintarXiv:2002.07784 , 2020.7] Vincent Cohen-Addad. Approximation schemes for capacitated clus-tering in doubling metrics. In
ACM-SIAM Symposium on DiscreteAlgorithms (SODA) , pages 2241–2259, 2020.[8] Vincent Cohen-Addad and Jason Li. On the fixed-parameter tractabilityof capacitated clustering. In
International Colloquium on Automata,Languages, and Programming (ICALP) , pages 1–14, 2019.[9] Mary Inaba, Naoki Katoh, and Hiroshi Imai. Applications of weightedvoronoi diagrams and randomization to variance-based k-clustering. In
International Symposium on Computational Geometry (SoCG) , pages332–339, 1994.[10] Kamal Jain and Vijay V Vazirani. Approximation algorithms for metricfacility location and k-median problems using the primal-dual schemaand lagrangian relaxation.
Journal of the ACM , 48(2):274–296, 2001.[11] Silvio Lattanzi and Christian Sohler. A better k-means++ algorithm vialocal search. In
International Conference on Machine Learning (ICML) ,pages 3662–3671, 2019.[12] Zhihui Li, Feiping Nie, Xiaojun Chang, Zhigang Ma, and Yi Yang.Balanced clustering via exclusive lasso: A pragmatic approach. In
AAAIConference on Artificial Intelligence (AAAI) , pages 3596–3603, 2018.[13] Weibo Lin, Zhu He, and Mingyu Xiao. Balanced clustering: A uniformmodel and fast algorithm. In
International Joint Conference on ArtificialIntelligence (IJCAI) , pages 2987–2993, 2019.[14] Hanyang Liu, Junwei Han, Feiping Nie, and Xuelong Li. Balancedclustering with least square regression. In
AAAI Conference on ArtificialIntelligence (AAAI) , pages 2231–2237, 2017.[15] Stuart Lloyd. Least squares quantization in pcm.
IEEE transactions oninformation theory , 28(2):129–137, 1982.[16] Xindong Wu, Vipin Kumar, J Ross Quinlan, Joydeep Ghosh, QiangYang, Hiroshi Motoda, Geoffrey J McLachlan, Angus Ng, Bing Liu,S Yu Philip, et al. Top 10 algorithms in data mining.
Knowledge andinformation systems , 14(1):1–37, 2008.[17] Yicheng Xu, Rolf H M¨ohring, Dachuan Xu, Yong Zhang, and Yifei Zou.A constant fpt approximation algorithm for hard-capacitated k-means.