Ball k-means
Shuyin Xia, Daowan Peng, Deyu Meng, Changqing Zhang, Guoyin Wang, Zizhong Chen, Wei Wei
11 Ball k -means Shuyin Xia, Daowan Peng, Deyu Meng, Changqing Zhang, Guoyin Wang, Zizhong Chen, Wei Wei
Abstract —This paper presents a novel accelerated exact k -means algorithm called the Ball k -means algorithm, which uses a ball todescribe a cluster, focusing on reducing the point-centroid distance computation. The Ball k -means can accurately find the neighborclusters for each cluster resulting distance computations only between a point and its neighbor clusters centroids instead of allcentroids. Moreover, each cluster can be divided into a stable area and an active area, and the later one can be further divided intoannulus areas. The assigned cluster of the points in the stable area is not changed in the current iteration while the points in theannulus area will be adjusted within a few neighbor clusters in the current iteration. Also, there are no upper or lower bounds in theproposed Ball k -means. Furthermore, reducing centroid-centroid distance computation between iterations makes it efficient for large kclustering. The fast speed, no extra parameters and simple design of the Ball k -means make it an all-around replacement of the naive k -means algorithm. Index Terms —Ball k -means, Stable Area, Active Area, Neighbor Cluster. (cid:70) ALL k - MEANS C LUSTERING
In this section, the main idea of the Ball k-mean algorithm ispresented by introducing the ball cluster concept, neighborclusters searching, ball cluster division, and the mechanismon how to reduce centroid-centroid distance computationbetween iterations.
A ball structure is characterized by a radius and centroid.Therefore, to describe a cluster and analyze the proposedmethod, we propose the ball cluster concept, where a ball isused to describe a cluster.
Definition 1.
Given a cluster C , C is called as a ball cluster that isdefined by its centroid c and radius r as follows: c = 1 | N | N (cid:88) i =1 x i , r = max( (cid:107) x i − c (cid:107) ) , (1)where x i denotes a point assigned to C , and | N | denotesthe number of samples in C . • Shuyin Xia, Daowan Peng, and Guoyin Wang are with the Departmentof Chongqing Key Laboratory of Computational Intelligence, ChongqingUniversity of Posts and Telecommunications, Chongqing 400065, China.E-mail: [email protected], daowan [email protected],[email protected] • Deyu Meng is with National Engineering Laboratory for Algorithmand Analysis Technologiy on Big Data, Xian Jiaotong University,Xi’an710049, China.E-mail: [email protected] • Changqing Zhang is with College of Intelligence and Computing, TianjinUniversity, 300072, China.E-mail:[email protected] • Zizhong Chen is with Department of Computer Science and Engineering,University of California, Riverside,900 University Avenue, Riverside, CA92521, USA.E-mail: [email protected] • Wei Wei is with School of Computer Science and Engineering, Xi’anUniversity of Technology. Xi’an 710048, China.E-mail: [email protected].
To skip the distance calculation between a point and acentroid that is very far away from that point, we introducea method that can find the neighbor clusters of each cluster.Thus, the distance computation is limited to the points andtheir neighbor clusters. The neighbor cluster is defined byDefinition 2.
Definition 2.
Given two ball clusters C i and C j , whose centroids aredenoted as c i and c j ; r i represents the radius of C i , if itsatisfies the following inequality: r i > (cid:107) c i − c j (cid:107) , (2)then, C j is a neighbor cluster of C i .Equation (2) indicates the neighbor relationship is notsymmetric. Specifically, for two ball clusters C i and C j ,referring to Fig. 1, their neighbor relationship can be oneof the three following types:(1) C i and C j are neighbor clusters, and vice versa, suchas clusters C and C presented in Fig. 1. As C and C areneighbor clusters, some points in C ( C ) may be adjustedinto C ( C ) in the current iteration.(2) C i is a neighbor cluster of C j , but C j is not a neighborcluster of C i ; such as C and C presented in Fig. 1; namely, C is a neighbor cluster of C , but C is not a neighborcluster of C . Therefore, some points in C may be adjustedinto C but no points in C can be adjusted into C .(3) C i and C j are not neighbor clusters, such as C and C presented in Fig. 1. Therefore, points in C ( C ) cannotbe adjusted into C ( C ) in the current iteration. Theorem 1.
Given two clusters C i and C j with centroids c i and c j ,respectively. For a queried ball cluster C with a centroidand radius denoted as c and r , if C i is a neighbor clusterof C (i.e., r > (cid:107) c − c i (cid:107) ) while C j is not a neighbor clusterof C (i.e., r ≤ (cid:107) c − c j (cid:107) ), then, it holds that some points in C may be adjusted into C i , while all points in C cannotbe adjusted into C j . a r X i v : . [ c s . L G ] M a y Fig. 1. The schematic diagram of neighbor relationship of the queriedball cluster C . The red dash line represents the vertical bisector ofthe centroids of two ball clusters. The yellow triangle and green linerepresent the centroid and radius of a cluster respectively. For a queried cluster C , its neighbor ball clusters can beexactly found by Definition 2, so the distance computationof points in C to the centroids of the other clusters is limitedto the neighbor clusters of C , resulting in a significantdecrease in the distance computation amount. In [1], similarmethod in finding neighbor cluster was proposed. For twoclusters c i and c j , || c i − c j || represents the distance betweencentroids of c i and c j . if m ( c i ) + s ( c i ) ≥ / || c i − c j || , c j is the neighbor of c i where m ( c i ) represents the radius of c i , and s ( c i ) represents half the distance between centroidof c i and its closest other centroid. On the contrast, in thispaper, if ri > / || c i − c j || , c j is the neighbor of c i , where r i represents half the distance. Therefore, in comparisonwith Definition 2, there was one additional element in [1].Consequently, the condition in [1] was looser than that inDefinition 2. In other words, Definition 2 can yield findingless but exacter neighbor clusters than that in [1]. In Section1.3, it is shown that the ball cluster division can furtherdecrease the distance computation amount. A queried ball cluster can be divided into two parts, stablearea and active area, which are defined by Definition 3. Thepoints in the stable area stay in the assigned cluster, which isgiven in Theorem 2. The active area can be further dividedinto annulus areas as given in Definition 4. Points in eachannulus area need to calculate distance only to some of theneighbor clusters, which is given in Theorem 3.
The definitions of the stable area and active area are asfollows:
Definition 3.
Given a queried ball cluster C i , { N C i } de-notes the centroid set of the neighbor clusters of C i . If { N C i } (cid:54) = ∅ , for a ball cluster C j whose centeroid is c j ,and c j (cid:15) { N C i } , then the sphere area whose centeroid andradius r are equal44 to c i and min ( (cid:107) c i − c j (cid:107) ) c j ∈ N Ci respectively is defined as the stable area of C i . And therest area is defined as the active area of C i . Theorem 2.
Given a cluster C i , the points in the stable areaof C i cannot be adjusted into any neighbor clusters inthe current iteration. However, in a special case, when a ball cluster has noneighbor clusters, the stable area is equal to the whole ballcluster. The description similar to the stable area is onlyprovided in [2], but it relies on the upper bound which isbigger than the direct distance when checking that filteringcondition. On the contrary, Definition 3 provides an exactdefinition that relies on no bounds. In this section, we show that the active area of a queriedcluster can be divided into annulus areas that are generatedby the neighbor clusters.
Definition 4.
Annulus areaGiven a queried ball cluster C with a centroid c andradius r , supposing |{ N C }| = k ’, { N C } represents theneighbor clusters’centroids set of C . c i and c i +1 repre-sent the centroids of the i th and ( i +1) th closest neighborclusters of C , respectively ( i < k ’). For x ∈ C , the i th annulus area denoted as (cid:60) iC : (cid:60) iC = (cid:40) (cid:107) c − c i (cid:107) < (cid:107) x − c (cid:107) < (cid:107) c − c i +1 (cid:107) , i = 1 ...k (cid:48) -1 (cid:107) c − c i (cid:107) < (cid:107) x − c (cid:107) < r, i = k (cid:48) Theorem 3.
Given a queried cluster C with a centroid c , supposing |{ N C }| = k (cid:48) , the points in its i th annulus area can beadjusted only within its first- i closest neighbor clustersand itself ( i ≤ k ’). As presented in Section 1.2, to find the neighbor clusters ofeach ball cluster, it is needed to calculate all the centroid-centroid distances, which costs O ( k ) per iteration, andfor large k clustering, this is a non-negligible cost. In thispaper, the purpose of the calculation of centroid-centroiddistances is to find the neighbor clusters of the next iteration.If a non-neighbor relationship in the next iteration canbe found in advance according to the relationship of theball clusters in the current iteration, then direct centroid-centroid distance calculation can be avoided. In this work,we develop a method to implement this idea that can findthe non-neighbor relationship in advance to avoid unneces-sary calculation of centroid-centroid distances. The specificprocess of this method is formulated as follows.Let c ( t ) i represent the centroid of cluster C i in the t th iteration, and δ ( c ( t ) i ) = (cid:107) c ( t ) i − c ( t − i (cid:107) represent the shift ofthe cluster centroid of C i between ( t − th iteration and t th iteration, and dist ( c ( t ) i , c ( t ) j ) represent the distance of c i and c j in the t th iteration. Theorem 4.
Given clusters C i and C j , supposing that dist ( c ( t − i , c ( t − j ) ≥ r ( t ) i + δ ( c ( t ) i ) + δ ( c ( t ) j ) , thenit holds that C j cannot be the neighbor cluster of C i inthe current iteration and the centroid-centroid distanceof them could be skipped. Proof.
With the shift of cluster centroids due to the centroidupdate, it holds that:
Fig. 2. The schematic diagram of avoiding direct centroid-centroiddistance calculation. The red dash line represents the midpoint of dist ( c ( t ) i , c ( t ) j ) . C j is not a neighbor cluster of C i in the ( t ) -th iteration. dist ( c ( t ) i , c ( t ) j ) ≥ dist ( c ( t − i , c ( t − j ) − δ ( c ( t ) i ) − δ ( c ( t ) j ) , andthe supposing that dist ( c ( t − i , c ( t − j ) ≥ r ( t ) i + δ ( c ( t ) i ) + δ ( c ( t ) j ) , ⇒ dist ( c ( t ) i , c ( t ) j ) ≥ r ( t ) i + δ ( c ( t ) i )+ δ ( c ( t ) j ) − δ ( c ( t ) i ) − δ ( c ( t ) j ) =2 r ( t ) i , ⇒ dist ( c ( t ) i , c ( t ) j ) ≥ r ( t ) i .As given in Definition 2 and Theorem 1, when r i ≤ dist ( c i , c j ) , C j is not the neighbor of C i . So, as it shows inFig.2 when dist ( c ( t − i , c ( t − j ) ≥ r ( t ) i + δ ( c ( t ) i ) + δ ( c ( t ) j ) ,it holds that dist ( c ( t ) i , c ( t ) j ) > r ( t ) i (i.e., C j cannot be aneighbor cluster of C i in the current iteration). Thus, thecomputation of distance between c i and c j can be avoided. (cid:3) According to the characteristics of the k-means algorithm,with the number of the iteration increasing, more and moreball clusters tend to be stable, i.e., that the points in it are un-changed. In ball k-means, an stable ball cluster can be simply described as that no points move into this ball cluster andno points in this ball cluster move out in current iteration.Based on this characteristic of the k-means algorithm itself,we propose a method to find those stable ball clusters. Inthis method, a flag corresponding to a ball cluster is usedto judge whether a ball cluster is stable. For a queried ballcluster, if no points in the queried ball cluster move into itsnearest cluster, and no points in other ball cluster move intothe queried ball cluster in the current iteration, then its flagis marked TRUE.
Theorem 5.
Ball k-means is implemented on a given dataset D . For a queried ball cluster C, if the points in C arenot changed, it is called as a stable ball cluster. In oneiteration, if all the neighbor ball clusters of C are stable,C will not participate into the distance calculations innext iteration. Proof.
This proof is straightforward. For a queried ballcluster, if all the neighbor ball clusters of C is stable, thenthe division of the stable area and annulus areas are thesame as those in the previous iteration, so the assignmentstep of the queried ball cluster can be avoided. (cid:3)
During the iteration of the k-means algorithm, more andmore ball clusters will become stable, and the data points inthose stable ball cluster will not participate into any distancecalculations. Therefore, the time complexity of ball k-meansper iteration will become to be sublinear, and the ball k-means will run faster and faster per iteration. R EFERENCES [1] Petr Ryˇsav`y and Greg Hamerly. Geometric methods to acceleratek-means algorithms. In
Proceedings of the 2016 SIAM InternationalConference on Data Mining , pages 324–332. SIAM, 2016.[2] Charles Elkan. Using the triangle inequality to accelerate k-means. In