Clustering via Boundary Erosion
CClustering via Boundary Erosion
Cheng-Hao DengComputer Science DepartmentXiamen University [email protected]
Wan-Lei Zhao ∗ Computer Science DepartmentXiamen University [email protected]
Abstract
Clustering analysis identifies samples as groups basedon either their mutual closeness or homogeneity. In orderto detect clusters in arbitrary shapes, a novel and genericsolution based on boundary erosion is proposed. The clus-ters are assumed to be separated by relatively sparse re-gions. The samples are eroded sequentially according totheir dynamic boundary densities. The erosion starts fromlow density regions, invading inwards, until all the samplesare eroded out. By this manner, boundaries between differ-ent clusters become more and more apparent. It thereforeoffers a natural and powerful way to separate the clusterswhen the boundaries between them are hard to be drawn atonce. With the sequential order of being eroded, the sequen-tial boundary levels are produced, following which the clus-ters in arbitrary shapes are automatically reconstructed. Asdemonstrated across various clustering tasks, it is able tooutperform most of the state-of-the-art algorithms and itsperformance is nearly perfect in some scenarios.
1. Introduction
Clustering problems arise from variety of applications,such as documents/web pages categorization [45], patternrecognition, biomedical analysis [41], data compression viavector quantization [36] and nearest neighbor search [19,27]. In general, clustering analysis plays an indispensablerole for understanding various phenomena across differentcontexts. Given a set of samples S in d -dimensional space R d , the task of clustering is to partition the data samplesinto subsets (called clusters) such that the samples in thesame cluster are more homogeneous or closer to each otherthan those from different subsets (clusters).Traditionally, this issue has been modeled as a distor-tion minimization problem in k -means [26]. The clusteringprocedure is organized into two steps. Firstly, samples are ∗ Corresponding author. Fujian Key Laboratory of Sensing and Com-puting for Smart City, Xiamen University, Fujian, China. assigned to their closest centers. Secondly, the center ofeach cluster is updated with the data samples assigned to it.These two steps are repeated until the structure of clustersdoes not change in two consecutive iterations. This algo-rithm is simple and efficient whereas it is unable to discoverclusters that are not in spherical shape.Aiming to identify clusters of arbitrary shapes, a seriesof algorithms have been proposed in the last two decades.Most of these algorithms [11, 6, 33, 24, 2, 32, 31] are con-ceived from the perspective of density distribution. Intu-itively, samples within each cluster are concentrated, andclusters are separated by sparse regions. Among these algo-rithms, samples are either iteratively assigned to [11, 2, 33]or shifted towards the density peaks [6]. The clusters aretherefore forged. However, heuristic rules [11] or ker-nels [6] that are employed in the algorithms are unable todeal with various density distributions in practice. For thisreason, the performance of these algorithms turns out to beunstable under different scenarios.Apart from the density based approaches, graph basedalgorithms are also able to discover clusters of arbitraryshapes. Representative methods are Chameleon [22] andorder-constrained transitive distance clustering [44]. In bothof them, the connectivity between samples are carefullyconsidered. According to the strategies presented in thepapers, samples that are far away from each other are stillclustered together as long as they are reachable to each othervia a chain of closely connected bridging samples. Unfor-tunately, for both of them, a matrix that keeps pair-wise dis-tance between samples is required, which makes it inscal-able to large-scale clustering tasks.As a consequence, despite numerous efforts have beentaken in the last several decades, two major goals in clus-tering analysis, namely the ability of identifying clusters inarbitrary shapes and the scalability towards large scale andhigh dimensional data, are hardly achieved with one algo-rithm. In this paper, a simple but effective density basedsolution is proposed. The basic idea is inspired by the phe-nomenon of land erosion by water. The boundaries betweenclusters are drawn gradually by a boundary erosion process1 a r X i v : . [ c s . C V ] A p r ithout any heuristic rules or kernels. This is particularlypowerful in the case that boundaries between clusters areobscure at the first sight. In addition, the bit-by-bit bound-ary erosion produces a sequential order following which thepotential clusters could be reconstructed with the guidanceof an r -NN graph. The boundary erosion is feasible in anymetric spaces as long as the density of data sample could beestimated. Furthermore, we also demonstrate that this algo-rithm achieves satisfactory performance and high efficiencyon large-scale and high dimensional clustering task with thesupport of efficient k -NN graph construction [8].
2. Related Work
Since the proposal of k -means, a variety of clusteringalgorithms have been proposed in the past three decades,which are in general categorized into seven groups [43].Namely, they are agglomerative [23], divisive, partition-ing [26], density based [11, 32, 2, 33, 15, 6], graphbased [22] and neural network based algorithms. In the lit-erature, it is also seen the ensemble of several existing algo-rithms to boost the performance [47, 46]. For the compre-hensive surveys, readers are referred to [43, 17]. In this sec-tion, our focus will be on the review of several typical algo-rithms that are able to identify clusters in arbitrary shapes.Namely, the density based and graph based algorithms arediscussed mainly.Although clustering problem has been modeled from dif-ferent perspectives, people basically agree clusters are com-posed by samples that are relatively concentrated and areseparated in-between by relatively sparse regions. This per-ception is made without any specification about the distancemeasure on the input data. Density based algorithms aredesigned in general in line with this perception. Althoughdifferent in details, the density based algorithms aims to dis-cover groups of samples that are continuously connected.In general, two steps are involved in the density basedclustering process. Firstly, the local density surroundingeach sample is estimated. Given sample x i and radius r ,density of sample x i is defined as the number of samples( y i s) that fall into x i ’s neighborhood of range r (as shownin Eqn. 1). ρ ( x i ) = (cid:88) j σ j , where σ j = (cid:26) d ( x i , y j ) ≤ r otherwise . (1)Function d ( · , · ) in Eqn. 1 returns the distance between x i and y j . In the second step, the clusters are forged basicallyin two different manners. For instance, in DBSCAN [11],cluster is formed by expanding it from “core points” (pointshold high density) to points with low density. While inmean-shift clustering [6], data samples are shifted itera-tively from region of low density towards the density peaks. In DBSCAN, the expansion process could be very sensitiveto the parameters. For instance, two heterogeneous clus-ters are falsely merged into one as the parameter changesslightly. While in mean-shift, the shifting process could beeasily stuck in a local if there is no obvious density peak.In the approach of clustering based on density peak (clus-terDP) [33], data samples are directly assigned to the closestdensity peak in which each density peak is recognized as acluster center. However, it faces similar problem as mean-shift since it is hard to identify the cluster center when thereis no obvious density peak. Another pitfall of this approachis that the number of peaks to be selected as the cluster cen-ter has to be set manually.Recently, several clustering methods are proprosed todeal with high dimensional data, which are hardly sepa-rable in the original space. Methods are typically in twocategories. One is built upon generative model [42, 21].Another is called sparse spectral clustering (SSC) [10, 20].SSC projects the high-dimensional data into a sparse andlow-dimensional representation first. The projected data areclustered via spectral clustering. Method in [7] in generalfollows similar framework to fulfill the clustering on high-dimensional data.
3. Clustering via Boundary Erosion
For all the clustering algorithms discussed, the key stepis to partition the data samples into groups. However, this ischallenging particularly when boundaries between differentclusters are not obvious. In this paper, a boundary erosionprocedure is proposed, which addresses such kind of ambi-guity. The idea is inspired by the natural phenomenon ofland erosion by water. An illustration is given in Fig. 1. Asshown in the figure, the erosion explicitizes the boundariesbetween clusters gradually as the water erodes the landsbit by bit. More importantly, a sequential order that indi-cates how the samples are assembled one after another asone cluster is established based on the order of the samplesbeing eroded. With this sequential order, which is called assequential boundary levels in the paper, the latent clusterscould be easily reconstructed.Notice that this idea is essentially different from water-shed transform [34] in the sense that “water level” in ourcase does not rise up to bury the lands. Instead, it onlyerodes the lands. The lands on the outer part are eroded ear-lier than the inner lands instead of being buried at the sametime, even they are on the same altitude.
To facilitate the boundary erosion, the density of onesample is estimated in a quite different manner from con-ventional algorithms. Namely, the density of each sample2
10 15 20 25 30 35 40510152025303540 (a) gradual erosion (step 1) (b) gradual erosion (step 2) (c) gradual erosion (step 3) (d) trend of erosion
Figure 1. An illustration of boundary erosion. The erosion startsfrom bottomlands, invading inwards. As more and more lands(from (a) to (c)) are eroded out, the boundaries between lands(clusters) become more and more apparent. The erosion contin-ues until all the lands are eroded out. is dynamically estimated by gradually eroding its neighborsout. In order to do that, a dynamic array Q , in which sam-ples along with their dynamic boundary density ρ ∗ is main-tained. The dynamic boundary density is given in Eqn. 2. ρ ∗ ( x i ) = (cid:88) j σ j , where σ j = (cid:26) d ( x i , y j ) ≤ r & j ∈ Q otherwise . (2)As shown in Eqn. 2, the major difference from Eqn. 1is that samples outside of Q are not counted during densityestimation. At the beginning, all the samples are put intothe dynamic array Q . For this reason, ρ ∗ of each sample isinitially the same as ρ given in Eqn. 1.The boundary erosion starts from deleting the samplewith the lowest density (which corresponds to boundarieswe are most cerntain) in Q . Each time, data sample hold-ing the lowest density is removed out from Q . Due tothe removal of sample x i , dynamic boundary density ρ ∗ ofits neighbors’ are influenced according to Eqn. 2. There-fore the density ρ ∗ of x i ’s neighbors’ in Q are recalculatedand updated. Thereafter, next sample which holds the low-est dynamic boundary density ρ ∗ is identified and removedfrom Q . This process continues until Q is empty. At eachtime of removal, a sequential boundary level is assigned tothe samples of being removed. Samples that are removedat the same moment are assigned with the same level. Thiserosion process is summarized in Alg. 1. It is possible that several samples hold the same density value will beremoved at once.
Algorithm 1:
Produce boundary levels via erosion
Data:
Data sample matrix: S n × d Result:
Boundary levels: Ω , r -NN graph G Compute r -NN Graph G for S n × d ; Sort each G [ i ] in ascending order; Calculate ρ ∗ ( x i ) (Eqn. 2) based on G ; Push all samples x i s into Q ; Sort Q in ascending order by ρ ∗ ; Ω ← φ , l ← ; while Q (cid:54) = φ do Pop x i with the lowest ρ ∗ from Q ; Ω ← Ω ∪{ < x i , l > } ; for each x j that x i is in its neighborhood do Calculate ρ ∗ ( x j ) ; Update Q with ρ ∗ ( x j ) ; end l ← l + 1 ; end This erosion process invades inwards from boundaries asmore and more samples have been eroded out. It is imagin-able that samples that are initially not located on the clusterborder are gradually exposed to the boundary erosion. Theerosion continues until all the samples have been deletedfrom Q . In the erosion process, a sample living inside au-tomatically ruptures as the start of new boundary when thecurrent lowest dynamic boundary density equals to the den-sity of this sample. An illustration of the erosion process isgiven by movie S1 in the supplementary materials.In the above process, samples are removed out from Q sequentially according to their dynamic boundary densityranging from low to high. Based on the order of being re-moved from Q , sample is assigned with a boundary level l ,which reflects both the original density ρ (Eqn. 1) and inner-ness of one sample as a cluster member. It is easy to see thesample lying outer holds lower boundary level than the onelying inner even they share the same ρ . This is the essen-tial difference between our approach and watershed trans-form [34].Fig. 2(a) and Fig. 2(b) show the density estimation fromEqn. 1 and the boundary levels produced by the erosion pro-cess respectively. Accordingly, the 3D views of Fig. 2(a)and Fig. 2(b) are shown in Fig. 2(c) and Fig. 2(d). As shownin the figure, density estimated by Eqn. 1 is full of potholes.This is not surprising since it is not necessarily true that den-sity ρ from the border to the center increases smoothly. Thecluster expansion undertaken afterwards is easily trapped inthe potholes distributed along the density slope, which is acommon issue latent in the traditional approaches. Whilethis issue is avoided in dynamic density estimation, whichconsiders both the density and innerness of one sample. A3 (a) density ρ (b) boundary levels X 0510152025303540Y0 5 10 15 20 25 30 D e n s i t y (c) 3D view of (a) X 0510152025303540Y0 5 10 15 20 25 30 L e v e l (d) 3D view of (b) Figure 2. The comparison between conventional density estima-tion and sequential boundary levels produced by boundary erosion.The darker of the red color, the higher of the value for both (bestviewed in color). clear contrast is seen from their 3D views (in Fig. 2(c) andFig. 2(d)), the boundary levels produced by Alg. 1 turn outto be smooth within each emerging cluster.In Alg. 1, the first step calculates the r -NN Graph G which keeps nearest neighbors of each sample within range r . r is the only parameter used to set the scale of neigh-borhood of each sample. Entry G [ i ] keeps a list of nearestneighbors for sample x i in its neighborhood r . The nearestneighbor list of each entry is sorted in ascending order ac-cording to the distances to the sample x i . This will facilitatethe afterwards boundary erosion and labeling step (Alg. 2).The time complexity of building nearest neighbor lists forall the samples is quadratic to the scale of input samples,which is on the same level as DBSCAN [11, 15] and algo-rithm in [33]. The complexity of computing an approximate r -NN graph could be decreased to O ( d · n . ) [8], whichwill be discussed in detail in later section.In order to support fast updating of ρ ∗ for x j s inducedby the removal of x i (Alg. 1, Line 10-13) , a reverse nearestneighbor graph G ∗ [8] is also maintained, in which G ∗ [ i ] keeps the data samples x j s that x i appears in their nearestneighbor lists. Essentially, reverse nearest neighbor graph G ∗ is nothing more than a simple reorganization of G . Discussion
The advantages of boundary erosion are sev-eral folds. Firstly, the erosion takes place on the regionof lowest density all the way. It therefore guarantees thatthe boundaries between clusters are drawn along the mostlikely regions. From this sense, the global optimality ofthis process is reached although it is a greedy strategy. Sec-ondly, the bit-by-bit erosion allows the boundaries between clusters to be drawn gradually instead of at once, which isapproriate when boundaries are not clear at the beginning.More importantly, the gradual erosion produces an orderedsequence that sorts samples from boundary to center, the re-verse of which regularizes a roadmap for cluster expansion.In the whole process, no kernels or heuristic rules are in-troduced which avoids any unnecessary assumptions on thedata distribution or metric spaces.
Once the sequence of boundary levels are produced, theclustering process becomes natural and could be conve-niently undertaken. It is basically a process of cluster ex-pansion that starts from peaks of boundary levels (given inAlg. 2). The propagation starts from the data sample withthe highest boundary level. Data sample x i is assigned witha new cluster label if none of its neighbors in G [ i ] are la-beled. Otherwise, the sample is assigned with the samecluster label as its closest neighbor that has been labeledin the previous rounds. Likewise, the unlabeled samplesare sequentially visited following the boundary levels fromhigh to low. The process continues until all the samplesare assigned with a label. In this process, the expansion ofone cluster stops automatically when it reaches to the clus-ter boundary, where samples from other cluster hold higherboundary levels. An illustration of this propagation proce-dure is given by movie S2 in the supplementary materials. Algorithm 2:
Label propagation based on r -NN graph Data:
Boundary levels: Ω , r -NN graph G Result:
Cluster labels: L n × Sort Ω by l in descending order; C ← , L ← ; while Ω (cid:54) = φ do Pop < x i , l > from Ω ; for each x j in G [ i ] do if L[j] > then L[i] ← L[j]; break; end end if L[i] == then L[i] ← C; C ← C + 1 ; end end As a summary, the proposed clustering process consistsof three steps. Firstly, given the radius of neighborhood r ,a nearest neighbor graph is built. In the graph, a list ofneighbors falling within r range are kept for each sample.With the support of nearest neighbor graph, the sequentialboundary levels are produced based on a boundary erosion4rocess in the second step. Finally, clusters are producedby propagating cluster labels sequentially from samples ofholding high boundary level to those of lower.Similar as DBSCAN [11], mean-shift [6] and clus-terDP [33], our algorithm is able to identify clusters of arbi-trary shapes as well as the outliers. However, the proposedapproach is more attractive from several aspects of view.On one hand, unlike DBSCAN, no heuristic rules are intro-duced, which makes the clustering insensitive to extra theparameter settings. On the other hand, unlike mean-shift orclusterDP in [33], no kernel is adopted in the density esti-mation, which makes it feasible for various types of metricspaces. Moreover, unlike DBSCAN or clusterDP in [33],no cluster centers or cluster peaks are explicitly defined orspecified. Instead, similar as affinity propagation [13], thecluster peaks and the clusters emerge gradually. Further-more, in the algorithm, there is no specification on the dis-tance measure. As a consequence, unlike k -means [26],mean-shift [6] or recent OCTD [44], it is feasible for var-ious metric spaces as long as the density of samples couldbe estimated.The boundary erosion shares similar motivation as“border-peeling” in [3], however they are essentially differ-ent in three major aspects. Firstly, no kernel is introducedin our approach. Secondly, all samples will be eroded outafter the erosion process. In constrast, cores points are re-served for cluster expansion in “border-peeling”. Finally,“border-peeling” relies on DBSCAN to reconstruct the clus-ters, while clustering in boundary erosion is undertaken vialabel propagation with the guidance of a r -NN graph.In the above label propagation process, the same r -NNgraph is used as the boundary erosion process (Alg. 1). Al-ternatively, it is feasible to use a different r -NN graph in thelabel propagation. In some cases, the density of a sample isvery low. Such kind of samples are usually recorgnized asoutliers by Alg. 1. However in certain scenario, we mayexpect that such kind of outliers are assigned to clustersthat are the most close to them. To achieve that, the r -NNgraph that is supplied to the above expansion procedure isrevised. In particular, G [ i ] is augmented to top- k nearestneighbors when the size of its nearest neighbors list is lessthan k , where k is another given parameter. In the experi-ment section, we are going to show this augmented propa-gation strategy is meaningful in certain circumstances.According to our observation, boundary erosion failsonly when samples from different clusters are mixed witheach other. In this case, the assumption of this algorithmthat clusters are separated by sparse regions actually breaks.However, it is possible to address this issue by recent sub-space embedding [20]. The input high dimensional data arefirstly projected to lower and separable space by DSC-Net-L2 [20]. Boundary erosion is therefore applied on the pro-jected data, which will be illustrated in the experiment sec- tion.
4. Clustering in Large-scale
As presented in Alg. 1, r -NN graph is required as thepre-requisite of the boundary erosion process. The timecomplexity of calculating r -NN graph could be as high as O ( d · n ) . Moreover, in the worst case, the space complex-ity of keeping r -NN graph is close to O ( n ) since one couldnot assume how many neighbors are located in range r inadvance. As a consequence, this algorithm becomes com-putationally inefficient in the situation that both n and d arelarge. To address this issue, an approximate solution is pre-sented in this section. r -NN Graph As shown above, it is computationally expensive to cal-culate an exact r -NN graph particularly in high dimensionaland large-scale cases. Many attempts have been made toseek for approximate solutions for this issue. Thanks tothe progress made in recent years, with NN-Descent algo-rithm presented in [8], it is possible to construct a k -NNgraph in high accuracy under the empirical complexity of O ( d · n . ) . More attractively, there is no specification ondistance measure in the algorithm, which is precisely in linewith our clustering algorithm.In our practice for large-scale clustering task, the firststep of Alg. 1 (i.e., Line 1) is modified. NN-Descent [8] iscalled to produce an approximate k -NN graph. The k -NNlist of each sample is further pruned according to the givenparameter r , which results in an approximate r -NN graph .While the rest of clustering process remains unaltered. Inthe experiment section, the results on the large-scale imageclustering are illustrated. As presented in the previous sections, clustering viaboundary erosion basically consists of three steps. In thefirst step, an r -NN graph is built. The time complexity ofbuilding an exact r -NN is O ( d · n ) . This is feasible forlow dimensional and small scale cases. While for high di-mensional and large-scale task, NN-Descent [8] is adoptedfor the approximate r -NN graph construction, for which theconstruction complexity is around O ( d · n . ) [8]. In thesecond step, the boundary erosion process operates on a dy-namic array Q . Each time, at least one sample with the low-est boundary density is removed from the array. ¯ k sampleson average in the array, that are influenced by the removalof one sample, are updated. For efficiency, the dynamic ar-ray can be implemented with a heap. The removal repeatsfor at most n times. As a result, the time complexity of thisstep is (cid:80) ni =0 log ( n + i · ¯ k ) , which is on O ( n · log ( n )) level. This is the average number of samples fall into neighborhood r .
5n the label propagation step, it is clear to see the complex-ity is only O ( n ) . Overall, the complexity of the clusteringalgorithm is O ( d · n + n · log ( n )) if one expects exact solu-tion. This is suitable for small scale tasks. While for large-scale cases, the complexity is only O ( d · n . + n · log ( n )) with the support of NN-Descent k -NN graph construction,which is even more efficient than the conventional k -means.
5. Experiments
In the following, the performance of the proposedboundary erosion (BE) is studied in comparison to severalstate-of-the-art approaches on various evaluation bench-marks and tasks, such as synthetic data of different distribu-tions, face image grouping, and clustering on biological aswell as large-scale image data. On large scale image cluster-ing part, our algorithm is implemented with C++ and com-piled with
GCC 5.4 ,and conducted by single thread on a PCwith
CPU and memory setup.
The first experiment is conducted on six syntheticdatasets, all of which are 2D spatial data points. The datapoints are drawn from a probability distribution with non-spherical shapes. These datasets have been widely adoptedto test the robustness of a clustering algorithm. Results pro-duced by our algorithm are presented in Fig. 3. The validrange of parameter r for each dataset that allows to repro-duce the same results is accordingly attached. As shown inthe figure, the proposed algorithm is able to identify all theclusters as well as the outliers of each case. In particular,satisfactory results are observed on challenging datasets S3 , Path and
Jain , on which existing approaches hardly producedescent results.Fig. 4 shows the results from the augmented propagation strategy. As shown from the figure, the results for datasets a , d and f are the same as previous experiment. While fordatasets b , c and e , where the outliers are in presence, the augmented propagation assigns the outliers to the clustersthat are the most close to them. This is meaningful in thecase that one prefers to producing clusters without isolatedoutliers. The results in Fig. 4 are also quantiatively shownin Tab. 1 in terms of clustering accuracy [44]. It is comparedto k -means (KMS), spectural clustering (SC) and order-constrained transitive distance clustering (OCTD). Perfectresults are achieved on most of the datasets. Compared toresults from the most representative methods (in [44]), BEachieves the best performance in all the cases.In the following experiments, the results are produced by r -NN graph without augmentation if it is not specified. In this part, our algorithm is tested on Brown dataset [41]which are biological data. In this datset, the affinity ma- (a) (b) (c) (d) (e)
10 0 10 20 30 40 50051015202530 (f)
Figure 3. Clustering results on six synthetic datasets (best viewedin color). The datasets in Fig 3(a to f) are namely Aggregation(AGG) [16], S3 [12], flame [14], sparil [4], Jain [18] and path-based (Path) [5] respectively. The valid range of r that allowsto reproduce the same results as shown are [ , ], [ × , × ], [ , ], and for a , b , c , d , e , and f respectively.Table 1. Clustering accuracy (%) on synthetic datasets. For BE,the augmented r -NN graph is adopted in label propagation, whichallows isolated samples to be assigned to its closest cluster M THD . A GG S3 FLAME SPARIL P ATH J AIN K MS OCTD 99.87 U.A. BE trix that keeps pairwise distances between DNA sequencesare supplied. They are pairwise of blasted sequences of proteins belonging to groups of families. In thiscase, algorithms such as k -means, are not feasible sincethey only work on l -space. In this study, BE is com-pared to DIANA [23], AGNES [23], Hierarchical Clus-tering (HC) [38], Transitivity Clustering (TC) [40], clus-terDP [33], clusterONE [30], Markov Clustering (MC) [9], k -Medoids (PAM) [23], Affinity Propagation (AP) [13],DBSCAN [11] and Spectral Clustering (SC) [35]. Twenty eight clusters are produced when r = 60 by ouralgorithm. The F score is , which is nearly perfect.6 (a) (b) (c) (d) (e)
10 0 10 20 30 40 50051015202530 (f)
Figure 4. Clustering results on six synthetic datasets (best viewedin color) by boundary erosion with augmented propagation . Thevalid range of r that allows to reproduce the same results as shownare [ , ], [ × , × ], [ , ], [ , ],[ , ]and for a , b , c , d , e , and f respectively.Table 2. Comparisons to state-of-the-art algorithms on Browndataset. Results of the state-of-the-art algorithms are citedfrom [41] M THD . F - SCORE P ARA . S
ETTINGS
DIANA 0.991 metric = l , k = 26 AGNES 0.987
Complete-link , metric = l , k = 25 HC 0.987
Complete-link , k = 25 TC 0.986
T = 48.868
CLUSTER
DP 0.975 k = 25 , dc = 258.645 CLUSTER
ONE 0.946 s = 1 , d = 0.0 MC 0.923
I = 2.196k -M EDOIDS k = 37
AP 0.910 dampfact = 0.845 , preference = 80.827maxits = 5000 , convits = 500 DBSCAN 0.680 eps = 323.306 , MinPts = 1
SC 0.656 k = 11 BE r = 60 This is also the best performance ever reported accordingto [41], as is shown in Table 2. While affinity propaga-tion [13] only achieves , which is considerably worsethan that of our algorithm. Our algorithm also outperforms
Figure 5. Clustering results on the groups of Olivetti FaceDatabase (best viewed in color). Faces with the same cover colorare clustered into the same cluster. The cluster centers are labeledwith a white circle. clusterDP [33] by more than . Our algorithm is also tested on two face datasets namelyOlivetti Face Database (ORL) [37] and Extended YaleB(EYaleB) [25], and two visual object image datasets,namely COIL20 [29] and COIL100 [28]. In these fourdatasets, there are , , and images whichare from , , and visual object groups respec-tively. On these four datasets, clustering algorithms areexpected to identify images that are from the same objectgroups. Since the images are not directly separable by theirpixel intensities (i.e., RGB), images are projected to low-dimensional feature space by DSC-Net-L2 [20]. Our al-gorithm (BE) is adopted in the final clustering stage. Inthe experiments, DSC-Net-L2 in combination with BE (de-noted as DSC+BE) is compared to sparse spectral cluster-ing (SSC) [10], DSC-Net-L2 in combination with specturalclustering (denoted as DSC+SC) and standard configurationof DSC-Net-L2 based clustering [20], in which a discrimi-native variant of spectral clustering is integrated. The clus-tering error rates [20] are shown in Table 3. As shown inthe table, DSC+BE outperforms or is very close to the bestresults ever reported on these datasets.Overall, superior performance is achieved in all experi-ments and on different categories of data by BE, which is7 able 3. Clustering error rates on four image datasets. The radius r of BE is set to , , and respectively on fourdatasets D ATASETS
SSC DSC DSC+SC DSC+BEORL 32.50 14.00 15.16 EY ALE
B 27.51
COIL100 45.00
Figure 6. Sample image groups that have been successfully iden-tified by our algorithm on YFCC million). essentially attributed to its extraordinary capability of iden-tifying clusters in arbitrary shapes and the genericness of itsmodel.
In this section, the effectiveness of the proposed cluster-ing algorithm is verified on image clustering/linking task. Asubset of YFCC100M [39] is adopted for evaluation. Thereare million images in total. They are represented withdeep features from HybridNet [1], which are in dimen-sions after PCA. In the clustering, NN-Descent is called tobuild the approximate r -NN graph for YFCC million.In the experiments, top- k is fixed to for YFCC mil-lion. While r is set to . The augmented propagation isadopted in the cluster expansion stage, which avoids isolat-ing similar images that are under severe transfomations.It takes around minutes for r -NN graph constructionand minutes to produce groups. In contrast,for the same task it would take more than hours for k -means. Most of the clusters produced by our algorithmsare meaningful. There are clusters that contain morethan images. Since no ground-truth available, only threesample groups are shown in Fig. 6(e). As shown in the fig-ure, the algorithm performs reasonably well even with thesupport of approximate r -NN graph. According to our ob-servation, the small clusters (whose size is less than ) arecomprised by near-duplicate images, which is highly help-ful for large-scale image linking tasks.
6. Conclusion
Boundary erosion is a process of unspinning the natu-ral structures of potential clusters. The erosion starts fromboundaries of clusters, invading inwards, till it reaches toall the density peaks. It therefore produces a sequentialorder that following which the clusters are naturally re-constructed. In the whole process, only one parameter,namely the radius of neighborhood r is involved. The den-sity peaks, the corresponding clusters and the cluster bound-aries emerge automatically. The effectiveness of the algo-rithm has been verified on various clustering tasks and indifferent scales. Due to its simplicity, genericness, speedefficiency as well as superior performance across variousdatasets, this algorithm will find its value in various scienceand engineering tasks.
7. Acknowledgments
This work is supported by National Natural ScienceFoundation of China under grants 61572408.
References [1] G. Amato, F. Falchi, C. Gennaro, and F. Rabitti. YFCC100Mhybridnet fc6 deep features for content-based image re-trieval. In
The 2016 ACM Workshop on Multimedia COM-MONS , pages 11–18, 2016.[2] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander.OPTICS: ordering points to identify the clustering structure.In
ACM Sigmod record , pages 49–60. ACM, 1999.[3] N. Bar, H. Averbuch-Elor, and D. Cohen-Or. Border-peelingclustering.
CoRR , abs/1612.04869, 2016.[4] H. Chang and D.-Y. Yeung. Robust path-based spectral clus-tering.
Pattern Recognition , 41(1):191–203, January 2008.[5] H. Chang and D.-Y. Yeung. Robust path-based spectral clus-tering.
Pattern Recognition , 41(1):191–203, 2008.[6] Y. Cheng. Mean shift, mode seeking, and clustering.
IEEETransactions on Pattern Analysis and Machine Intelligence ,17(8):790–799, August 1995.[7] K. G. Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang.Deep clustering via joint convolutional autoencoder embed-ding and relative entropy minimization. In , pages5747–5756. IEEE, 2017.[8] W. Dong, C. Moses, and K. Li. Efficient k-nearest neighborgraph construction for generic similarity measures. In
In-ternational Conference on World Wide Web , pages 577–586,Mar. 2011.[9] S. Dongen. A cluster algorithm for graphs. Technical re-port, CWI (Centre for Mathematics and Computer Science),Amsterdam, The Netherlands, 2000.[10] E. Elhamifar and R. Vidal. Sparse subspace clustering: Al-gorithm, theory, and applications.
IEEE transactions on pat-tern analysis and machine intelligence , 35(11):2765–2781,2013.
11] M. Ester, H. peter Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatialdatabases with noise. In
IEEE Transactions on KnowledgeDiscovery and Data Engineering , pages 226–231, 1996.[12] P. Fr¨anti and O. Virmajoki. Iterative shrinking method forclustering problems.
Pattern Recognition , 39(5):761–775,May 2006.[13] B. J. Frey and D. Dueck. Clustering by passing messagesbetween data points.
Science , 315:972–976, Feburary 2007.[14] L. Fu and E. Medico. FLAME, a novel fuzzy clusteringmethod for the analysis of dna microarray data.
BMC Bioin-formatics , 8(3), January 2007.[15] J. Gan and Y. Tao. Dbscan revisited: Mis-claim, un-fixability, and approximation. In
Proceedings of the 2015ACM SIGMOD International Conference on Management ofData , pages 519–530, 2015.[16] A. Gionis, H. Mannila, and P. Tsaparas. Clustering aggrega-tion.
ACM Transactions on Knowledge Discovery from Data ,1(1), March 2007.[17] R. Greenlaw and S. Kantabutra. Survey of clustering: Algo-rithms and applications.
International Journal of Informa-tion Retrieval and Resouces , 3(2):1–29, April 2013.[18] A. K. Jain and M. H. Law. Data clustering: A user’sdilemma.
PReMI , 3776:1–10, 2005.[19] H. J´egou, M. Douze, and C. Schmid. Product quantizationfor nearest neighbor search.
IEEE Transactions on PatternAnalysis and Machine Intelligence , 33(1):117–128, Janurary2011.[20] P. Ji, T. Zhang, H. Li, M. Salzmann, and I. Reid. Deep sub-space clustering networks. In
Advances in Neural Informa-tion Processing Systems , pages 23–32, 2017.[21] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou. Variationaldeep embedding: An unsupervised and generative approachto clustering. In
International Joint Conference on ArtificialIntelligence , 2017.[22] G. Karypis, E.-H. S. Han, and V. Kumar. Chameleon: Hi-erarchical clustering using dynamic modeling.
Computer ,32(8):68–75, August 1999.[23] L. Kaufman and P. J. Rousseeuw.
Finding Groups in Data:An Introduction to Cluster Analysis . John Wiley & Sons,1990.[24] H.-P. Kriegel, P. Kr¨oger, J. Sander, and A. Zimek. Density-based clustering.
Wiley Interdisciplinary Reviews: DataMining and Knowledge Discovery , 1(3):231–240, 2011.[25] K.-C. Lee, J. Ho, and D. J. Kriegman. Acquiring linear sub-spaces for face recognition under variable lighting.
IEEETransactions on pattern analysis and machine intelligence ,27(5):684–698, 2005.[26] S. P. Lloyd. Least squares quantization in PCM.
IEEE Trans-actions on Information Theory , 28:129–137, March 1982.[27] M. Muja and D. G. Lowe. Scalable nearest neighbor al-gorithms for high dimensional data.
IEEE Transactions onPattern Analysis and Machine Intelligence , 36:2227–2240,2014.[28] S. Nene, S. Nayar, and H. Murase. Columbia object imagelibrary (coil 100).
Technical Report CUCS-005-96 , 1988. [29] S. A. Nene, S. K. Nayar, H. Murase, et al. Columbia ob-ject image library (coil-20).
Technical report CUCS-005-96 ,1996.[30] T. Nepusz, H. Yu, and A. Paccanaro. Detecting overlappingprotein complexes in protein-protein interaction networks.
Nature methods , 9(5):471–472, 2012.[31] C. Otto, D. Wang, and A. Jain. Clustering millions of facesby identity.
IEEE Transactions on Pattern Analysis and Ma-chine Intelligence , 2017.[32] T. Pei, A. Jasra, D. J. Hand, A.-X. Zhu, and C. Zhou. DE-CODE: a new method for discovering clusters of differentdensities in spatial data.
Data Mining and Knowledge Dis-covery , 18(3):337–369, 2009.[33] A. Rodriguez and A. Laio. Clustering by fast search and findof density peaks.
Science , 344(6191):1492–1496, June 2014.[34] J. B. Roerdink and A. Meijster. The watershed transform:Definitions, algorithms and parallelization strategies.
Jour-nal Fundamenta Informaticae , 41(2):187–228, April 2000.[35] J. Shi and J. Malik. Normalized cuts and image segmenta-tion.
IEEE Transactions on Pattern Analysis and MachineIntelligence , 22(8):888–905, August 2000.[36] J. Sivic and A. Zisserman. Video Google: A text retrievalapproach to object matching in videos. In
IEEE InternationalConference on Computer Vision , pages 1470–1477, October2003.[37] F. S. Smaria and A. C. Harter. Parameterisation of a stochas-tic model for human face identification. In
IEEE Workshopon Applications of Computer Vision , pages 138–142, 1994.[38] R. C. Team. R: A language and environment for statisticalcomputing. Technical report, R Foundation for StatisticalComputing., 2012.[39] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni,D. Poland, D. Borth, and L.-J. Li. YFCC100M: The newdata in multimedia research.
Communications of the ACM ,59(2):64–73, Feb. 2016.[40] T. Wittkop, D. Emig, S. Lange, S. Rahmann, M. Albrecht,J. H. Morris, S. B¨ocker, J. Stoye, and J. Baumbach. Parti-tioning biological data with transitivity clustering.
Naturemethods , 7(6):419–420, 2010.[41] C. Wiwie, J. Baumbach, and R. R¨ottger. Comparing the per-formance of biomedical clustering methods.
Nature meth-ods , 12(11):1033–1038, 2015.[42] J. Xie, R. Girshick, and A. Farhadi. Unsupervised deep em-bedding for clustering analysis. In
International Conferenceon Machine Learning , pages 478–487, 2016.[43] R. Xu and D. I. Wunsch. Survey of clustering algo-rithms.
Transactions on Neural Networks , 16(3):645–678,May 2005.[44] Z. Yu, W. Liu, W. Liu, Y. Yang, M. Li, and B. V. K. V. Kumar.On order-constrained transitive distance clustering. In
Pro-ceedings of the Thirtieth AAAI Conference on Artificial In-telligence , AAAI’16, pages 2293–2299. AAAI Press, 2016.[45] Y. Zhao and G. Karypis. Empirical and theoretical compar-isons of selected criterion functions for document clustering.
Machine Learning , 55:311–331, 2004.[46] L. Zheng, T. Li, and C. Ding. A framework for hierarchi-cal ensemble clustering.
ACM Transactions on KnowledgeDiscovery from Data , 9(2):9:1–9:23, September 2014.
47] Z.-H. Zhou.
Ensemble Methods: Foundations and Algo-rithms . Chapman & Hall/CRC, 1st edition, 2012.. Chapman & Hall/CRC, 1st edition, 2012.