[PDF] Selecting the Number of Clusters K with a Stability Trade-off: an Internal Validation Criterion

Abstract

Model selection is a major challenge in non-parametric clustering. There is no universally admitted way to evaluate clustering results for the obvious reason that there is no ground truth against which results could be tested, as in supervised learning. The difficulty to find a universal evaluation criterion is a direct consequence of the fundamentally ill-defined objective of clustering. In this perspective, clustering stability has emerged as a natural and model-agnostic principle: an algorithm should find stable structures in the data. If data sets are repeatedly sampled from the same underlying distribution, an algorithm should find similar partitions. However, it turns out that stability alone is not a well-suited tool to determine the number of clusters. For instance, it is unable to detect if the number of clusters is too small. We propose a new principle for clustering validation: a good clustering should be stable, and within each cluster, there should exist no stable partition. This principle leads to a novel internal clustering validity criterion based on between-cluster and within-cluster stability, overcoming limitations of previous stability-based methods. We empirically show the superior ability of additive noise to discover structures, compared with sampling-based perturbation. We demonstrate the effectiveness of our method for selecting the number of clusters through a large number of experiments and compare it with existing evaluation methods.

Full PDF

SSelecting the Number of Clusters K with a StabilityTrade-off: an Internal Validation Criterion Alex Mourer ∗ SAMM, Université Paris 1 Panthéon SorbonneSafran Aircraft Engines [email protected]

Florent Forest ∗ LIPN, Université Sorbonne Paris NordSafran Aircraft Engines [email protected]

Mustapha Lebbah, Hanane Azzag

LIPN, Université Sorbonne Paris Nord lebbah,[email protected]

Jérôme Lacaille

Safran Aircraft Engines [email protected]

Abstract

Model selection is a major challenge in non-parametric clustering. There is no uni-versally admitted way to evaluate clustering results for the obvious reason that thereis no ground truth against which results could be tested, as in supervised learning.The difﬁculty to ﬁnd a universal evaluation criterion is a direct consequence of thefundamentally ill-deﬁned objective of clustering. In this perspective, clustering sta-bility has emerged as a natural and model-agnostic principle: an algorithm shouldﬁnd stable structures in the data. If data sets are repeatedly sampled from the sameunderlying distribution, an algorithm should ﬁnd similar partitions. However, itturns out that stability alone is not a well-suited tool to determine the number ofclusters. For instance, it is unable to detect if the number of clusters is too small.We propose a new principle for clustering validation: a good clustering should bestable, and within each cluster, there should exist no stable partition. This principleleads to a novel internal clustering validity criterion based on between-cluster andwithin-cluster stability, overcoming limitations of previous stability-based methods.We empirically show the superior ability of additive noise to discover structures,compared with sampling-based perturbation. We demonstrate the effectivenessof our method for selecting the number of clusters through a large number ofexperiments and compare it with existing evaluation methods.

Clustering is a widely used unsupervised learning technique which aims at discovering structure inunlabeled data. It can be deﬁned as the “partitioning of data into groups (a.k.a. clusters) so that similar(or close w.r.t. the underlying distance function) elements share the same cluster and the members ofeach cluster are all similar (or, equivalently, dissimilar elements are separated into different clusters)”[1]. This goal is contradictory because of the non-transitivity of the notion of similarity: if A is similarto B , and B is similar to C , A is not necessarily similar to C . Since clustering is an ill-posed problem,it cannot be properly solved using this deﬁnition, and clustering algorithms often optimize only oneof its aspects. For instance, K -means [2] only guarantees that dissimilar objects are separated, byminimizing within-cluster distances. On the other hand, single linkage clustering only guaranteesthat similar objects will end up in the same cluster. As a consequence, model selection is a majorchallenge in non-parametric clustering [1]. ∗ Both authors contributed equally.Preprint. Under review. a r X i v : . [ c s . L G ] J u l n the sample-based clustering framework we adopt in this work, model selection is assessing whetherpartitions found by an algorithm correspond to meaningful structures of the underlying distribution,and not just artifacts of the algorithm or sampling process [3, 4]. Practitioners need to evaluateclustering results in order to select the best parameters for an algorithm (e.g. the number of clusters K ) or choose between different algorithms. Plenty of evaluation methods exist in literature, but theyusually incorporate strong assumptions on the geometry of clusters (e.g. compact, spherical clusters)or on the underlying distribution, which are speciﬁc to the algorithm or to an application.There is a need for a general, model-agnostic evaluation method. Clustering stability has emergedas a principle stating that "to be meaningful, a clustering must be both good and the only good clustering of the data, up to small perturbations. Such a clustering is called stable. Data that containsa stable clustering is said to be clusterable" [5]. Hence, a clustering algorithm should discover stablestructures in the data. In statistical learning terms, if data sets are repeatedly sampled from the sameunderlying distribution, an algorithm should ﬁnd similar partitions. As we do not have access tothe data-generating distribution in model-free clustering, perturbed data sets are obtained either bysampling or injecting noise into the original data. Stability seems to be an elegant principle, butthere are still severe limitations making it difﬁcult to use in practice. For instance, stability doesnot necessarily depend on clustering outcomes but can be solely related to properties of the datasuch as symmetries [6]. As outlined in [7], there exist various protocols to compute stability scores.Unfortunately, a thorough study that compares and evaluates them in practice does not exist. Contributions

We propose a method for quantitatively and visually assessing the presence ofstructure in clustered data. The main contributions of our work can be stated as follows: • To our knowledge, this is the ﬁrst large-scale empirical study on clustering stability analysis. • A novel deﬁnition of clustering is proposed, based on between-cluster and within-clusterstability, along with a concrete approach to implement it for a large class of algorithms. • Based on this deﬁnition, we introduce the stability difference criterion,

Stadion , an internalclustering validation index coming with an interpretable visualization tool, stability paths . • In our experiments, we assess the ability of Stadion to both discover structure and select theright number of clusters on a huge collection of data sets and compare it with widely usedinternal indices in the largest benchmark ever conducted on clustering stability analysis. • We show that in this context, only additive noise perturbation is reliable, and a methodologyto determine the amount of perturbation is proposed. • A study on the inﬂuence of parameters, such as the number of perturbed samples, isconducted and we provide guidelines to users.

Internal clustering indices measure the quality of a clustering when ground-truth labels are unavailable,which is mostly the case in unsupervised data exploration. The majority of internal criteria rely ona combination of between-cluster and within-cluster distances. Between-cluster distance measureshow distinct clusters are dissimilar or far apart , while within-cluster distance measures how elementsbelonging to the same cluster are similar, or the coherence of the cluster. Unfortunately, thisincorporates a prior on the geometry of clusters [8, 9, 10, 11, 12, 13, 14].Stability analysis for clustering validation is a long-established technique. It can be traced back as faras 1973 [15], took off with inﬂuential works [16, 17] and from there has drawn increasing attentionculminating with [6, 18, 19] and [7]. These works concluded that stability is not a well-suited toolfor model selection in clustering [4]. In the general case, stability can only detect if the number ofclusters is too large for the K -means algorithm (see Figure 1). A partition with too few clusters isindeed stable, except for perfectly symmetric distributions. More accurately, these works proved thatthe asymptotic stability of risk-minimizing clustering algorithms, as sample size grows to inﬁnity,only depends on whether the objective function has one or several global minima.Albeit signiﬁcant theoretical efforts, few empirical studies have been conducted. Each study focuseson speciﬁc aspects of clustering stability. For example, [20, 21, 16, 17] investigated perturbationby random subsampling of the original data set without replacement. Stability in a model-basedframework was studied in [22]. Perturbation by random projections [23] and random noise [24, 25, 26]were also considered. Overall, stability demonstrated its effectiveness, but no clear comparison2etween methods was done. As mentioned in [7, 27], a thorough study comparing all differentprotocols in practice does not exist and a more objective evaluation of these results is warranted. Clustering stability is analyzed in a standard statistical learning setting. A data set X = { x , . . . , x N } consists in N independent and identically distributed (i.i.d.) samples, drawn from a data-generatingdistribution P on an underlying space X . Formally, a clustering algorithm A takes as input the dataset X , some parameter K ≥ , and outputs a clustering C K = { C , . . . , C K } of X into K disjointsets. Thus, a clustering can be represented by a function X → { , . . . , K } assigning a label to everypoint of the input data set. Some algorithms can be extended to construct a partition of the entireunderlying space. This partition is represented by an extension operator function X → { , . . . , K } (e.g. for center-based algorithms, we compute the distance to the nearest center).Let X and X (cid:48) be two different data sets drawn from the same distribution and note C K and C (cid:48) K theirrespective clusterings. Let s be a similarity measure such that s ( C K , C (cid:48) K ) measures the agreementbetween the two clusterings. Possible choices for this measure are detailed below. Then, for a givensample size N , the stability of a clustering algorithm A is deﬁned as the expected similarity betweentwo clusterings C K , C (cid:48) K on different data sets X and X (cid:48) , sampled from the same distribution P ,Stab ( A , K ) := E X , X (cid:48) ∼P N [ s ( C K , C (cid:48) K )] . (1)The expectation is taken with respect to the i.i.d. sampling of the sets from P . This quantity isunavailable in practice, as we have a ﬁnite number of samples, so it needs to be estimated empirically.Various methods have been devised to estimate stability using perturbed versions of X .The ﬁrst methods used in literature are based on resampling the original data set, with or withoutreplacement (splitting in half [15], subsampling [16], bootstrapping [28], jackknife [29]. . . ). Anothermethod consists in adding random noise either to the original data points [25] or to their pairwisedistances (additive or multiplicative) [30, 31, 32, 33, 34]. For high-dimensional data, other alternativesare random projections or randomly adding or deleting variables [15]. Once the perturbed data sets aregenerated, there are several ways to compare the resulting clusterings. With noise-based methods, itis possible to compare the clustering of the original data set (reference clustering) with the clusteringsobtained on perturbed data sets, or to compare only clusterings obtained on the latter. With sampling-based methods, we can compare overlapping subsamples on data points where both clusteringsare deﬁned [28], or compare clusterings of disjoint subsamples (using for instance an extensionoperator or a supervised classiﬁer to transfer labels from one sample to another [17]). Finally, tocompute a similarity score between two partitions, common choices are the adjusted Rand index(ARI) [28, 35], Fowlkes-Mallows (FM), Jaccard [16], Normalized Mutual Information (NMI) [36],Minimal Matching Distance [17], or Variation of Information [37].Before discussing in details the mechanisms of stability, we introduce a trivial example to illustrateits main issue: it cannot detect in general whenever K is too small. Consider the example presentedin Figure 1 with three clusters, two of them closer to each other than to the third one. On any samplefrom such a distribution, as soon as we have a reasonable amount of data, K -means with K = 2 always constructs the solution separating the left cluster from the two right clusters. Consequently, itis stable despite K = 2 being the wrong number of clusters. This situation was pointed out in [6]. (a) K = 2 (stable) (b) K = 3 (stable) (c) K = 4 (unstable, jittering) Figure 1: Example data set with three clusters. Random letter shapes have been chosen to avoid anyartiﬁcial symmetries that could arise with spherical gaussians. The labels correspond to the K -meansclustering result for K = 2 , and . K -means is stable even if the number of clusters is too small.3 iscussion Stability is determined by the number of data points changing clusters. In the case ofalgorithms that minimize an objective function (e.g. center-based or spectral clustering), two differentsources of instability have been identiﬁed [7]. First, jittering is caused by data points changing sideat cluster boundaries after perturbation. Therefore, strong jitter is produced when a cluster boundarycuts through high-density regions. Second, jumping refers to the algorithm ending up in differentlocal minima. The most important cause of jumping is initialization (see Figure 6 in Appendix A foran example). Another cause is the existence of several global minima of the objective function on theunderlying distribution. This happens only if there are perfect symmetries in the distribution (seeFigure 5 in Appendix A), which is extremely unlikely for real-world data sets.However, practitioners mainly use algorithms with consistent initialization strategies. For instancewith K -means, we keep the best trial over a large number of runs and use the K -means++ seedingheuristic [38]. This initialization tends to make K -means deterministic and its effectiveness hasbeen proven in practice. Thus, it is different from the initialization proposed in [7, 39] which allowsjumping to occur whenever K > K (cid:63) , where K (cid:63) is the true number of clusters. Throughout thiswork, we consider a setting with large enough sample size, without perfect symmetries and witheffective initialization. Thus, we do not consider jumping as the main source of instability even when K > K (cid:63) , and rather believe that jittering plays a major role. It captures useful information about aclustering, i.e. densities at boundaries, and also seems fundamental and related to supervised learning.As a consequence, we need a perturbation process that produces jittering. Unfortunately, as soon as N is reasonably large, resampling methods become trivially stable whenever there is a single globalminimum [6, 7]. See Appendix A for an example where sampling methods fail. We summarizeimportant results in the diagram Figure 2. References:[1] Ben-David and von Luxburg (2006)[2] Ben-David and von Luxburg (2008)[3] von Luxburg (2010)

NoYesSYM NoYesINITIndependently ofINIT and PERT SamplingNoisePERTIndependently of PERT ∀ K Jumping due toglobal minima

Unstable (b)Stable (a)

Justified by or related to:(a)

Theorem 10 (Stability theorem) [1]

Lemma 1 (Stability and global optima of the objective function) [3](b)

Theorem 15 (Instability from symmetry) [1]

Lemma 1 (Stability and global optima of the objective function) [3](c)

Theorem 4 (High instability implies cut in high density region) [2]

Conclusion 3 (Instable clusterings) [3](d)

Conjecture 4 (Stable clusterings) [3]

Conclusion 5 (Stability of idealized K-means detects whether K is too large) [3](e)

Conjecture 8 (Stability of the actual K-means algorithm) [3] K Stable (d) Unstable (c, d)

Jittering K > K ⋆ K = K ⋆ Notations:SYM: Symmetries in the data distributionINIT: Effective initialization schemePERT: Perturbation process

Oursetting K Stable or Unstable (e)Unstable (e) K > K ⋆ K ≤ K ⋆ Jumping due toinitialization

Stable or Unstable (d) K < K ⋆ Figure 2: Diagram explaining the various sources of instability in different settings for K -meanswith large sample size, assuming K (cid:28) N and the underlying distribution has K (cid:63) well-separatedclusters that can be represented by K -means. We consider no symmetries, effective initialization andnoise-based perturbation, thus instability (due to jittering) arises when K is too large, and sometimeswhen K is too small whenever cluster boundaries are in high-density regions.To conclude, for K -means in our setting, the perturbation process causes jittering and more rarelyjumping (in experiments, we seldom observed jumping when K is too large), enabling stability toindicate whenever K is too large. On the other hand, stability cannot in general detect when K is toosmall. Despite a lack of theoretical guarantees, concepts should apply to other algorithms. In order toovercome this limitation of stability, we introduce a novel concept of within-cluster stability . A clustering algorithm applied with the same parameters to perturbed versions of a data set shouldﬁnd the same structure and obtain similar results. The stability principle described by (1) relies on4etween-cluster boundaries and we thus call it between-cluster stability . Therefore, it cannot detectstructure within clusters. In Figure 1, K = 2 is stable, whereas one cluster contains two sub-clusters.This sub-structure cannot be detected by between-cluster stability alone. Obviously, this implies thatstability is unable to decide whether a data set is clusterable or not (i.e. when K (cid:63) = 1 ), which is asevere limitation. For this very reason, we introduce a second principle of within-cluster stability :clusters should not be composed of several sub-clusters. This implies the absence of stable structuresinside any cluster. In other words, any partition of a cluster should be unstable. The combination ofthese two principles leads to a new deﬁnition of a clustering: Deﬁnition 4.1

Clustering: A clustering is a partitioning of data into groups (a.k.a. clusters) so thatthe partition is stable, and within each cluster, there exists no stable partition.

Then, a clustering should have a high between-cluster stability and a low within-cluster stability.Despite their apparent simplicity, implementing these principles is a difﬁcult task. As seen in thelast section, between-cluster stability can be estimated in many different ways, however not all areeffective. On the other hand, within-cluster stability is a challenging quantity to deﬁne and estimate.We propose a method to estimate both quantities, and then we detail and discuss our choices.

Let { X , . . . , X D } be D perturbed versions of the data set obtained by adding random noise tothe original data set X . Between-cluster stability of algorithm A with parameter K estimatesthe expectation (1) by the empirical mean of the similarities s between the reference clustering C K = A ( X , K ) and the clusterings of the perturbed data sets,Stab B ( A , X , C K , K ) := 1 D D (cid:88) d =1 s ( C K , A ( X d , K )) . (2)Since s is a similarity measure, this quantity needs to be maximized, and conversely with a dissimi-larity measure. In order to deﬁne within-cluster stability, we need to assess the presence of stablestructures inside each cluster. To this aim, we propose to cluster again the data within each cluster of C K . Formally, let Ω ⊂ N ∗ be a set of numbers of clusters. The k -th cluster in the reference clusteringis noted C k , its number of elements N k , and Q ( k ) K (cid:48) = A ( C k , K (cid:48) ) denotes a partition of C k into K (cid:48) clusters. Within-cluster stability of algorithm A is deﬁned asStab W ( A , X , C K , K, Ω ) := K (cid:88) k =1 (cid:18) | Ω | (cid:88) K (cid:48) ∈ Ω Stab B ( A , C k , Q ( k ) K (cid:48) , K (cid:48) ) (cid:19) × N k N . (3)As a good clustering is unstable within each cluster, this quantity needs to be minimized. Hence,we propose to build a new validity index combining between-cluster and within-cluster stability. Anatural choice is the difference between both quantities. We call this index

Stadion , standing for stability difference criterion and for the sake of brevity, A , K and X are omitted in the notations:Stadion ( C K , Ω ) := Stab B ( C K ) − Stab W ( C K , Ω ) . (4)Since we use an effective initialization scheme, the same partition C K is used in both terms of (4).Thus, Stadion evaluates the stability of an algorithm w.r.t. a reference partition. How to perturb data?

In our realistic setting (see Figure 2), neither jumping nor jittering willoccur if the data are perturbed under sampling processes, as soon as there is enough data. We showon a simple example that sampling-based methods such as [16, 17] cannot work in the general case(see Appendix A). Therefore, only noise-based perturbation is considered here. Among them, weadopt the ε -Additive Perturbation ( ε -AP) with Gaussian or uniform noise for this work, assumingvariables are scaled to zero mean and unit variance. The number of perturbed samples D can be keptvery low and still gives reliable estimates. An analysis has also been conducted on the inﬂuence of D and showed that even very small numbers ( D = 1 ) lead to great performance (see Appendix B). How to choose ε ? A central trade-off has to be taken into account when perturbing the data set. If ε -AP is too strong, we might alter the very structure of the data. If on the contrary ε -AP is too small,the clustering algorithm will always obtain identical results, inevitably leading to stability. Although5etting this value is not crucial according to [25], compared with choosing a subsample size, we stillbelieve it is somewhat arbitrary and in a way implicitly deﬁnes what a clustering is. As in Example 9in Appendix A, if ε is too large, the two closest clusters will be merged under our stability principle.Hence, in a way, ε -AP deﬁnes a threshold distance below which two data points are similar andshould belong to the same cluster. We propose to circumvent this issue by not choosing a single valuefor the level of noise ε , but a grid of possible values. By gradually increasing ε from to a value ε max , we obtain what we call a stability path , i.e. the evolution of stability as a function of ε . Thismethod has one crucial advantage: it allows to compare partitions for different values of ε withoutthe necessity of choosing one. However, it comes with two drawbacks: setting both the ﬁnenessand the maximum value of the grid. In our experiments, the ﬁneness does not play a major role inthe results. A straightforward method to ﬁx a maximum value ε max beyond which comparisons arenot meaningful anymore is as follows. The perturbation corresponding to ε max is meant to destroythe cluster structure of the original data. This corresponds to the value where the data are no longerclusterable, i.e. K = 1 becomes the best solution w.r.t. Stadion. A ﬁrst guess at ε max = √ p (where p is the data dimension) works well in practice. We found that visualizing the stability paths (seeFigure 3) is appealing and greatly helps interpreting the structures found by an algorithm, henceimproving the usefulness of results. Which data to compare?

A strong ε -AP can alter the data, but it can also destroy structure andclose-by clusters can merge faster than others. Therefore, pairwise comparison between perturbeddata sets may become unreliable, and we consider only comparison between the original and perturbeddata sets. As stated in (2), we compute similarities between the reference and perturbed partitions. How to compare partitions?

The similarity measure s chosen to compare two partitions is theARI. A total of 16 different similarity and distance measures (such as the NMI or FM) are comparedin Appendix B, and ARI achieved the best results. Its value is in [0 , , thus the Stadion has a value inthe [ − , range, with corresponding to the best clustering and − to the worst. How to aggregate the Stadion path?

To compute a scalar validity index for model selection,the Stadion path must be aggregated on the noise strength ε from to ε max (when the solution for K = 1 has the highest Stadion among all other solutions). Two aggregation strategies, the maximum(Stadion-max) and the mean (Stadion-mean), are evaluated in our experiments.The within-cluster stability is governed by parameter Ω , which detects stable structures inside clustersof C K . As these are unknown, averaging several different values in Ω gives a better estimate. Inabsence of sub-clusters, all partitions will be unstable because cluster boundaries will be placed inhigh-density regions. For the opposite reason, in presence of sub-clusters, at least some partitionswill result in higher stability, thus increasing the within-cluster stability. The analysis of inﬂuenceconducted in Appendix B showed that Ω has low impact on Stadion results and can be set easily.An important assumption behind our implementation of within-cluster stability is that, for non-clusterable structures (w.r.t. an algorithm), the algorithm must place cluster boundaries in high-densityregions to produce instability through jittering. This encompasses a wide range of algorithms suchas center-based, spectral or Ward linkage clustering which, for the sake of saving cost, would cutthrough dense clouds of points. If this requirement is not fulﬁlled, it is unclear whether this methodwill work. For instance, single linkage cannot be evaluated this way, since it may build two-clusterpartitions of size and N − , where the boundary lies at the frontier of the cluster.Finally, the motivation for using the same algorithm to cluster again each cluster is that an algorithmshould evaluate itself. For instance, one could use a clustering algorithm to estimate within-clusterstability different from the one used to compute stability between clusters, or one could train asupervised classiﬁer on the clusters labels and then assess its stability [21, 17, 40]. However, it is notobvious what kind of bias would be introduced with this approach. We begin this section by illustrating our method with K -means and uniform ε -AP on the exampledata set discussed previously (see Figure 1). Figure 3 displays between-cluster stability, within-6 .0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.60.20.30.40.50.60.70.80.91.0 B e t w ee n - c l u s t e r s t a b ili t y ( S t a b B ) K = 1K = 2K = 3K = 4K = 5K = 6 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.60.00.20.40.60.81.0 W i t h i n - c l u s t e r s t a b ili t y ( S t a b W ) K = 1K = 2K = 3K = 4K = 5K = 6 S t a d i o n K = 3 max

K = 1K = 2K = 3K = 4K = 5K = 6 K Stab B -mean Stab W -meanStadion-mean Figure 3: Between-cluster stability paths (top left), within-cluster stability paths (top right), Stadionpaths (bottom left) and stability trade-off curve (bottom right) for K -means on the data set Figure 1,for K ∈ { . . . } . ε is the amplitude of the uniform noise perturbation. The best solution K = 3 isselected either by taking the maximum or by averaging Stadion over ε until ε max . The trade-off plotrepresents the averaged Stadion, between- and within-cluster stability as a function of K .cluster stability and Stadion as a function of the noise strength ε . For reasonable amounts of noise,the solutions K = 1 , K = 2 and K = 3 are all perfectly stable, showing the insufﬁciency ofbetween-cluster stability alone to indicate whenever K is too small. The solutions for K ≥ cutthrough the clusters and are thus unstable due to jittering. However, the solutions for K = 1 and K = 2 both have high within-cluster stability, caused by the presence of sub-clusters, which is notthe case for K ≥ . By computing a difference, our criterion Stadion combines this informationand is able to indicate the correct number of clusters ( K = 3 ) by selecting the Stadion path with thehighest maximum or mean value. Through its formulation, Stadion is acting as a stability trade-off .The stability paths also give additional insights about the data structure. For example, we can readfrom the between-cluster stability path how the clusters successively merge together as ε increases.Supplementary examples are provided in Appendix A. Finally, the last graph (called stability trade-offplot) represents Stadion-mean for different values of the parameter K . K in K -means, GMM and Ward clustering In this benchmark experiment, three algorithms are considered: K -means [2], Gaussian MixtureModels (GMM) [41] and Ward hierarchical clustering [42]. For K -means, two versions of Stadionare evaluated: the ﬁrst one using the stability computation described in section 4 (referred to asthe standard version), and the second one using the extension operator (referred to as the extended version). As seen in section 2, an extension operator extends a clustering to new data points. K -meansextends naturally by computing the euclidean distance to centers. Hence, instead of re-running K -means for each perturbation of the data, we directly predict the cluster assignments of perturbed datapoints. This approximation is sensible since we consider jittering as the main source of instability, andspares computation time (see complexity study in Appendix C). GMM allows a similar extension, byassigning points to the cluster with the highest posterior probability. It is the only version we considerhere due to the high computational cost of GMM which makes the standard version prohibitive.Albeit ﬁrst experiments looked promising, the same limitation was encountered for spectral clustering[43], but unfortunately it has no straightforward extension operator.7e evaluate clustering validation methods on a large collection of 73 artiﬁcial benchmark data sets,most of them extensively used in literature. Data sets were selected so that the algorithms can achievegood partitions w.r.t. the true clusters. We also ensured different difﬁculty levels for model selection,obtained by varying the numbers of clusters, sizes, variances, shapes and the presence of noise andclose-by or overlapping clusters. In addition, we present results on 7 real data sets, labeled into K (cid:63) ground-truth classes. It was surprising to discover how difﬁcult it is to ﬁnd real-world data that areclusterable into K (cid:63) clusters without preprocessing. First, the original features seldom have a clusterstructure. Second, it may happen that the labels do not represent a natural partitioning of the data,unlike for artiﬁcial data sets. Thus, is was necessary to preprocess most real data sets. For instance,Crabs was preprocessed by a PCA [44] keeping only components two and three, as described in[45]. High-dimensional data sets such as images (MNIST, USPS) were reduced beforehand, usingan autoencoder network in order to extract clusterable features, followed by UMAP [46] to obtain atwo-dimensional representation, as introduced in [47]. This way, we ensure the labels truly representclusters. Identical preprocessing settings were used across every method and data set, withoutany further tuning. An exhaustive description of the experimental setting, including data sets andpreprocessing steps, is provided in Appendix F.Table 1 summarizes results for all three algorithms. We compare Stadion to the partitions obtainedwith the true number of clusters K (cid:63) , best-performing internal clustering indices (see [14, 48] forreviews), Gap statistic [13] ( K -means only), BIC (GMM only) and stability methods [16, 17]. Toensure a fair comparison, all internal indices were computed on the same partition, that was alsothe reference partition in Stadion. We report the number of data sets where each method found K (cid:63) ,which we refer to as the number of wins . However, only checking whether K (cid:63) is selected is notalways related to the goodness of the partition w.r.t. the ground-truth. Results strongly depend onthe performance of algorithms, which do not necessarily succeed in ﬁnding a good partition into K (cid:63) clusters. Thus, a more adequate performance measure is the similarity between the selected partitionand the ground-truth. As a performance measure, the ARI is a standard choice [49] when clusters aremostly balanced. Let us note Y K (cid:63) = { Y , . . . , Y K (cid:63) } the ground-truth partition. The performance ofeach validation method is assessed by computing ARI ( Y K (cid:63) , C ˆ K ) , where ˆ K is the estimated numberof clusters. In order to compare methods on multiple data sets, we compute the average ranks in termsof ARI, denoted R ARI . Since data sets have different difﬁculties, their results are not comparable.Thus, average ARI or numbers of wins are meaningless [50]. Nonetheless, wins are reported forreference. The benchmark uses uniform ε -AP, D = 10 , Ω = { , . . . , } , s = ARI and evaluatessolutions for K ∈ { , . . . , K max } where K max is K (cid:63) + 20 rounded down to the nearest tenth.Table 1: Benchmark results on 80 artiﬁcial and real data sets for K -means, Ward and GMM. Averagerank of the ARI with the ground-truth classes ( R ARI ) and number of times K (cid:63) was selected (wins). Artiﬁcial data sets Real data sets K -means Ward GMM K -means Ward GMMMethod R ARI wins R ARI wins R ARI wins R ARI wins R ARI wins R ARI wins K (cid:63) - - - -Stadion-mean 6.12 51 5.80 49 - - 6.57 4 7.64 3 - -Stadion-max (extended) 6.13 - - Stadion-mean (extended) 6.42 48 - - 6.79 43 6.29 3 - - 5.50 3BIC - - - - 6.45 48 - - - - 7.29 2Wemmert-Gancarski [14] 6.62 53 5.40 Stadion-max achieves the best results overall. On K -means, it is even ranked higher than K (cid:63) interms of ARI. The second-best performing index is Wemmert-Gancarski. It was shown in [33] that8gglomerative clustering is not robust to noise, which explains inferior Stadion results with Ward.Moreover, results are slightly biased in favor of the indices that are only valid for K ≥ , unlikeStadion that will select K = 1 on non-clusterable distributions, as shown in Appendix A. Full resulttables and statistical tests are provided in Appendix E along with a more thorough analysis. Inparticular, the ranking is unchanged by using other external performance measures such as AMI orNMI instead of the ARI. Beyond selecting K , Stadion may also be used to select the kernel parameterin spectral clustering, the radius in density-based clustering or select between different algorithms. References [1] Shai Ben-David. Clustering-what both theoreticians and practitioners are doing wrong. In

Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018.[2] James MacQueen et al. Some methods for classiﬁcation and analysis of multivariate observations.1967.[3] Stephen P. Smith and Richard Dubes. Stability of a hierarchical clustering.

Pattern Recognition ,12(3):177–187, 1980.[4] Ohad Shamir and Naftali Tishby. Cluster stability for ﬁnite samples. In

Advances in NeuralInformation Processing Systems , pages 1–8, 2007.[5] Marina Meila. How to tell when a clustering is (approximately) correct using convex relaxations.In

Advances in Neural Information Processing Systems , pages 7407–7418, 2018.[6] Shai Ben-David, Ulrike Von Luxburg, and Dávid Pál. A sober look at clustering stability. In

International Conference on Computational Learning Theory , pages 5–19. Springer, 2006.[7] Ulrike Von Luxburg. Clustering stability: An overview.

Foundations and Trends R (cid:13) in MachineLearning , 2(3):235–274, 2010.[8] J C Dunn. Well-Separated Clusters and Optimal Fuzzy Partitions. Journal of Cybernetics ,1(4):95–104, 1974.[9] T Cali´nski and J Harabasz. A dendrite method for cluster analysis.

Communications in Statistics ,3(1):1–27, 1974.[10] David L Davies and Donald W Bouldin. A Cluster Separation Measure.

IEEE Transactions onPattern Analysis and Machine Intelligence , PAMI-1(2):224–227, 1979.[11] Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of clusteranalysis.

Journal of Computational and Applied Mathematics , 20(C):53–65, 1987.[12] Siddheswar Ray and R.H. Turi. Determination of number of clusters in k-means clustering andapplication in colour image segmentation.

Proceedings of the 4th international conference onadvances in pattern recognition and digital techniques , pages 137–143, 1999.[13] Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the number of clusters ina data set via the gap statistic.

Journal of the Royal Statistical Society Series B , 63:411–423,2001.[14] Bernard Desgraupes. ClusterCrit: Clustering Indices.

CRAN Package , (April):1–10, 2013.[15] J. S. Strauss, J. J. Bartko, and W. T. Carpenter. The use of clustering techniques for theclassiﬁcation of psychiatric patients.

British Journal of Psychiatry , 122(570):531–540, 1973.[16] Asa Ben-Hur, Andre Elisseeff, and Isabelle Guyon. A stability based method for discover-ing structure in clustered data.

Paciﬁc Symposium on Biocomputing. Paciﬁc Symposium onBiocomputing , 17:6–17, 2002.[17] Tilman Lange, Volker Roth, Mikio L. Braun, and Joachim M. Buhmann. Stability-basedvalidation of clustering solutions.

Neural Computation , 16(6):1299–1323, 2004.[18] Shai Ben-David, Dávid Pál, and Hans Ulrich Simon. Stability of k-means clustering. In

International conference on computational learning theory , pages 20–34. Springer, 2007.[19] Shai Ben-David and Ulrike Von Luxburg. Relating clustering stability to properties of clusterboundaries. , pages 379–390, 2008.[20] Erel Levine and Eytan Domany. Resampling method for unsupervised estimation of clustervalidity.

Neural Computation , 13(11):2573–2593, 2001.921] Sandrine Dudoit and Jane Fridlyand. A prediction-based resampling method for estimating thenumber of clusters in a dataset.

Genome biology , 3(7):1–21, 2002.[22] M Kathleen Kerr and Gary A Churchill. Experimental design for gene expression microarrays.

Biostatistics , 2(2):183–201, 2001.[23] Mark Smolkin and Debashis Ghosh. Cluster stability scores for microarray data in cancerstudies.

BMC bioinformatics , 4(1):36, 2003.[24] Jane Fridlyand and Sandrine Dudoit. Applications of resampling methods to estimate thenumber of clusters and to improve the accuracy of a clustering method. Technical report, 2001.[25] Ulrich Möller and Dörte Radke. A cluster validity approach based on nearest-neighbor resam-pling.

Proceedings - International Conference on Pattern Recognition , 1:892–895, 2006.[26] Ulrich Möller and Dörte Radke. Performance of data resampling methods for robust classdiscovery based on clustering.

Intelligent Data Analysis , 10(2):139–162, 2006.[27] Shalev Ben-David and Lev Reyzin. Data stability in clustering: A closer look.

TheoreticalComputer Science , 558:51–61, 2014.[28] M. Falasconi, A. Gutierrez, M. Pardo, G. Sberveglieri, and S. Marco. A stability based validitymethod for fuzzy clustering.

Pattern Recognition , 43(4):1292–1305, 2010.[29] K. Y. Yeung, D. R. Haynor, and W. L. Ruzzo. Validating clustering for gene expression data.

Bioinformatics , 17(4):309–318, 2001.[30] Yonatan Bilu and Nathan Linial. Are stable instances easy?

Combinatorics Probability andComputing , 21(5):643–660, 2012.[31] Pranjal Awasthi, Avrim Blum, and Or Sheffet. Center-based clustering under perturbationstability.

Information Processing Letters , 112(1-2):49–54, 2012.[32] Aravindan Vijayaraghavan, Abhratanu Dutta, and Alex Wang. Clustering stable instances ofeuclidean k-means. In

Advances in Neural Information Processing Systems , pages 6500–6509,2017.[33] Maria-Florina Balcan and Yingyu Liang. Clustering under perturbation resilience.

SIAMJournal on Computing , 45(1):102–155, 2016.[34] Maria-Florina Balcan, Nika Haghtalab, and Colin White. k-center Clustering under PerturbationResilience.

ACM Transactions on Algorithms , 16(2):1–39, 2020.[35] Qinpei Zhao, Mantao Xu, and Pasi Fränti. Extending external validity measures for determiningthe number of clusters.

International Conference on Intelligent Systems Design and Applications,ISDA , pages 931–936, 2011.[36] Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information theoretic measures for cluster-ings comparison: Variants, properties, normalization and correction for chance.

The Journal ofMachine Learning Research , 11:2837–2854, 2010.[37] Marina Meila. Comparing clusterings by the Variation of Information. In

COLT , 2003.[38] David Arthur and Sergei Vassilvitskii. k-means++: The Advantages of Careful Seeding. In

ACM-SIAM Symposium on Discrete Algorithms , pages 1027–1035, 2007.[39] Sébastien Bubeck, Marina Meila, and Ulrike Von Luxburg. How the initialization affects thestability of the k-means algorithm.

ESAIM - Probability and Statistics , 16:436–452, 2012.[40] Robert Tibshirani and Guenther Walther. Cluster validation by prediction strength.

Journal ofComputational and Graphical Statistics , 14(3):511–528, 2005.[41] Jeffrey D Banﬁeld and Adrian E Raftery. Model-based gaussian and non-gaussian clustering.

Biometrics , pages 803–821, 1993.[42] Joe H Ward Jr. Hierarchical grouping to optimize an objective function.

Journal of the Americanstatistical association , 58(301):236–244, 1963.[43] Ulrike Von Luxburg. A tutorial on spectral clustering.

Statistics and computing , 17(4):395–416,2007.[44] Karl Pearson. On lines and planes of closest ﬁt to systems of points in space.

The London,Edinburgh, and Dublin Philosophical Magazine and Journal of Science , 2(11):559–572, 1901.1045] Charles Bouveyron and Camille Brunet-Saumard. Model-based clustering of high-dimensionaldata: A review.

Computational Statistics & Data Analysis , 71:52–78, 2014.[46] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximationand projection for dimension reduction. arXiv preprint arXiv:1802.03426 , 2018.[47] Ryan McConville, Raul Santos-Rodriguez, Robert J Piechocki, and Ian Craddock. N2D: (NotToo) Deep Clustering via Clustering the Local Manifold of an Autoencoded Embedding. 2019.[48] Joonas Hämäläinen, Susanne Jauhiainen, and Tommi Kärkkäinen. Comparison of internalclustering validation indices for prototype-based clustering.

Algorithms , 10(3), 2017.[49] Simone Romano, Nguyen Xuan Vinh, James Bailey, and Karin Verspoor. Adjusting for chanceclustering comparison measures. arXiv preprint arXiv:1512.01286 , 2015.[50] Janez Demšar. Statistical comparisons of classiﬁers over multiple data sets.

Journal of machinelearning research , 7(Jan):1–30, 2006.[51] Xuanli Lisa Xie and Gerardo Beni. A validity measure for fuzzy clustering.

IEEE Transactionson Pattern Analysis and Machine Intelligence , 13(8):841–847, 1991.11

Additional experiments and examples

A.1 Finding K = 1 : the case of non-clusterable data Is a data set clusterable? Between-cluster stability is unable to answer this question, as the solutionwith a single cluster ( K = 1 ) is trivially stable. Some stability methods are not even deﬁned for K = 1 because of normalization [17]. Moreover, many internal indices use between-cluster distanceand are not deﬁned for a single cluster neither. We veriﬁed empirically that our criterion consistentlyoutputs K = 1 in the case when the algorithm does not ﬁnd any cluster structure.Table 2 contains results for non-clusterable artiﬁcial data sets. Stadion outputs K = 1 in all cases.An example of Stadion path and trade-off curve for the golfball data set is provided below 4 (resultsare similar for other data sets).Table 2: Number of clusters found by Stadion on non-clusterable artiﬁcial data sets. Dataset N dimension K selected by Stadion (max/mean)Uniform cube (2d) 1000 2 1Uniform cube (10d) 1000 10 1Gaussian (2d) 1000 2 1Gaussian (10d) 1000 10 1Golfball [52] −1 −0.5 0 0.5 1−1−0.500.51−1−0.500.51 x GolfBall, n = 4002, dimension = 3, classes = 1, main problem: no cluster at ally z S t a d i o n K = 1

K = 1K = 2K = 3K = 4K = 5K = 6 K Stab B -mean Stab W -meanStadion-mean Figure 4: Stadion path (left) and stability trade-off plot (right) on the golfball data set with K -means. K = 1 is clearly the best solution found by Stadion-max/mean (uniform noise, Ω = { , . . . , } ). A.2 Examples of jumping between local minima

As explained in section 3, two sources of instability are jumping and jittering. We have already statedthat our method leverages jittering of cluster boundaries in high-density regions due to perturbation.Jumping, on the other hand, happens when the algorithm ﬁnds very different solutions on differentsamples; in case of objective-minimizing algorithms, it ends up in different local minima. Two maineffects lead to jumping: ﬁrst, symmetries in the data distribution, and second, initialization. Finally,subtle geometrical properties of the distribution might also cause jumping [7].12igure 5: Example of K -means jumping between three global minima for K = 2 on a symmetricdistribution with three Gaussians, despite effective initialization ( K -means++ and best of 10 runs).Under slight perturbation (here uniform ε -AP, but resampling gives identical results), the algorithmjumps between grouping two random clusters together.An example of jumping of K -means due to symmetries is shown on Figure 5: clearly, there are severalglobal minima, and even if the algorithm is deterministic, slight perturbations of the distribution(noise or sampling) make the algorithm jump between solutions. The second cause of jumping isdue to initialization. As illustrated by Figure 6 for K -means, if a single random initialization isused, depending on the initial position of centers, four different conﬁgurations occur randomly, evenwithout any perturbation of the data. (a) (b)(c) (d) Figure 6: Example of K -means jumping between four local minima for K = 4 , when a singlerandom initialization is used. Depending on the initial centers conﬁguration, the algorithm jumpsbetween splitting a random cluster in two (a, b, c), or even the left cluster in three sub-clusters (d).We place ourselves in a realistic setting without perfect symmetries and an effective algorithminitialization strategy, thus jumping is not the main source of instability. A.3 Failure of sampling-based stability methods

In this section, we will see on a trivial example why stability methods based on sampling are notreliable to detect the presence of structure in the data. Four methods are compared:13. Stadion based on ε -Additive Perturbation2. Stadion based on bootstrapping3. The model explorer algorithm [16] based on subsampling4. The model order selection method [17] based on splitting data in two halves and transferringlabels from one half onto the other using a supervised nearest-neighbor classiﬁer.We demonstrate that only the ﬁrst method is successful on a simple example consisting in a mixtureof two correlated Gaussians, represented on Figure 7. Data are scaled to zero mean and unit varianceas for every other data set. K -means is used to cluster the data. As illustrated on the plot, K -meanswith K = 2 separates almost perfectly the two Gaussians. All other solutions split the two Gaussiansinto several sub-clusters of equal sizes, with cluster boundaries lying in the regions of highest density,as can be seen from the example for K = 4 (where the boundaries are in the middle of the Gaussians).Thus, in addition to being the best solution, K = 2 is the only acceptable one. However, sampling-based methods fail in assessing its stability, since they estimate K = 4 as the most stable solution.This result can be explained because the data set is not symmetric and for each K there is one globalminimum so no jumping occurs, even with a poor initialization scheme. Thus the only possiblesource of instability stems from jittering. As expected in theory, our experiments showed here thatthe different sampling processes did not succeed in creating jittering. Conversely, ε -AP has indeedproduced jittering, where a small amount of noise produced very different partitions. (a) (b) K = 2 (c) K = 4 Figure 7: Example data set of two correlated Gaussians, scaled to zero mean and unit variance.With the K -means algorithm, all sampling-based methods select K = 4 or K = 6 , whereas with ε -Additive Perturbation, K = 2 is the only stable solution.In details, the model order selection method [17] selects K = 4 , followed by K = 6 . The modelexplorer [16] ﬁnds K = 6 as the best solution, followed by K = 4 . These results are consistent acrossinitialization schemes (random, K -means++, best of several runs). Hence, random initialization willnot help creating instability by jumping. Furthermore, our stability criterion Stadion was able to ﬁnd K = 2 among the set of tested values { , . . . , } (here with uniform noise and Ω = { , . . . , } ). Thisis not only due to adding the within-cluster stability. As evidence, we replaced ε -AP by a bootstrapperturbation: Stadion with bootstrapping also fails, selecting K = 1 as the best solution followed by K = 4 , and this for all initialization schemes. B e t w ee n - c l u s t e r s t a b ili t y ( S t a b B ) K = 1K = 2K = 3K = 4K = 5K = 6 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.40.20.40.60.81.0 W i t h i n - c l u s t e r s t a b ili t y ( S t a b W ) K = 1K = 2K = 3K = 4K = 5K = 6 S t a d i o n K = 2

K = 1K = 2K = 3K = 4K = 5K = 6

Figure 8: Between-cluster stability, within-cluster stability and Stadion paths (uniform noise, Ω = { , . . . , } ) on the example of two correlated Gaussians where all sampling-based methods fail toselect K = 2 . Stadion clearly ﬁnds K = 2 by taking the max or mean of the path curve.14 .4 Example of Stadion behavior with K -means This example illustrates the behavior of our stability criterion Stadion and how to interpret the stabilitypaths, using the data set 2d-4c shown in Figure 9. It consists in four clusters with different varianceand size, where two clusters are closer to each other while the other clusters are at a greater distance.At ﬁrst glance, this example looks trivial, but the majority of internal indices fail. For instance, theDunn and Silhouette indices both select K = 3 .Figure 9: Example data set 2d-4c consists in four clusters of different variance and size.The stability paths are presented in Figure 10, where we observe that Stadion is able to detect thestructure of the data and selects K = 4 . The only difference between the solutions with K = 4 and K = 5 is that the largest cluster (in green) is split, thus leading to a much lower between-clusterstability but the same within-cluster stability. Solutions K = 2 and K = 3 group clusters togetherwithout any splitting. Therefore, those solutions have a high between-cluster stability and also ahigh within-cluster stability. Altogether, on the Stadion path (Figure 10), the path corresponding to K = 4 is similar to K = 5 whereas K = 2 and K = 3 have an equivalent behavior. This is due tothe structure of the data, and especially because the two rightmost clusters are close to each other.The moment when the path of solution K = 3 becomes the best solution is the moment when thesetwo clusters merge because of a high ε -AP, and this is also the moment where K = 1 prevails. B e t w ee n - c l u s t e r s t a b ili t y ( S t a b B ) K = 1K = 2K = 3K = 4K = 5K = 6 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.60.00.20.40.60.81.0 W i t h i n - c l u s t e r s t a b ili t y ( S t a b W ) K = 1K = 2K = 3K = 4K = 5K = 6 S t a d i o n K = 4

K = 1K = 2K = 3K = 4K = 5K = 6

Figure 10: Between-cluster stability, within-cluster stability and Stadion paths (uniform noise, Ω = { , . . . , } ) on the 2d-4c data set. Stadion selects K = 4 , followed by K = 3 .Finally, Stadion paths (with stability and instability paths) give useful additional information on aclustering and on the structure of the data. When K > K (cid:63) , the paths are similar to the path of K (cid:63) but with a smaller scale, as they have the same within-cluster stability but lower between-clusterstability. On the other hand, when K < K (cid:63) , the paths are shifted towards the right, and may becomesuperior for larger ε values. A.5 Whenever K (cid:63) is not the best partition Sometimes, the best solution is not the partition obtained with the true number of clusters K (cid:63) , becausethe algorithm is unable to recover the ground-truth partition. This is the case for the 4clusters_cornerdata set, depicted on Figure 11. While obviously the best solution is to separate the four clusters, it is15ot achievable by K -means: with K (cid:63) = 4 , it will cut through the large cluster instead of separatingthe two small green clusters, for the sake of saving the cost induced by the variance and the sizeof this cluster. Among the proposed solutions, the highest ARI (w.r.t. the ground-truth partition) isobtained with K = 3 (ARI = 0.92), followed by K = 2 (0.74), K = 5 (0.65) and lastly K (cid:63) = 4 (0.58). (a) K = 2 (ARI = 0.74)Most internal indices (b) K = 3 (ARI = 0.92)Stadion (c) K (cid:63) = 4 (ARI = 0.58)Ben-Hur [16], Lange [17] (d) K = 5 (ARI = 0.65) Figure 11: Partitions found by K -means on the 4clusters_corner data set for K ∈ { , . . . , } .All internal indices, excepted the Gap, select K = 2 . Stability methods based on sampling (Ben-Hur[16], Lange [17]) selected the ground-truth K (cid:63) = 4 , earning them a "win", although it is the worstpartition among the four. We explain it by the fact that these methods are not leveraging jitteringinside the large cluster. Finally, the Stadion always selects the solution K = 3 having the highestARI. Moreover, the criterion outputs solutions in the same order than ARI. This examples clearlyexhibits the stability trade-off occurring in Stadion: it tries to preserve a high between-cluster stabilitywhile keeping within-cluster stability as low as possible (see Table 3). Stadion paths on Figure 12also show how the three smaller clusters merge as the noise level increases.Table 3: Stability trade-off leveraged by Stadionon the 4clusters_corner data set.K ARI StabB StabW Stadion1 0.00 ++ - - 0 % % ! % % S t a d i o n K = 3 max

K = 1K = 2K = 3K = 4K = 5K = 6

Figure 12: Stadion paths on the 4clusters_cornerdata set. K = 3 is selected although K (cid:63) = 4 . A.6 Example of K approaching N This paragraph introduces the behavior of Stadion when the numbers of clusters K evaluated becomeas large as the number of samples N . Even if this is beyond the common setting in clustering, thecriterion is still valid. Figure 13 displays the stability trade-off for K -means on an example withthree Gaussians, using ARI as the similarity metric. As K approaches N : • Between-cluster stability decreases towards , except for K = N where it jumps back to ,because all partitions with one sample per cluster are perfectly similar to ARI. • Within-cluster stability increases towards , as clusters with few samples become triviallystable. • Stadion still indicates the correct solution K = 3 , while decreasing towards − , onlyjumping back to when K = N .Note that the borderline case K = N depends on the similarity used, for instance with Fowlkes-Mallows the between-cluster stability does not jump back to , staying at . In addition, with the16

10 20 30 40 50 K Stab B -mean Stab W -meanStadion-mean Figure 13: Stability trade-off plot for K -means on three Gaussians with N = 50 for K ∈ { , . . . , } (uniform noise, Ω = { , . . . , } ). Stadion is still valid when the tested K becomes large.extended version, the perturbed partition will not have one sample per cluster, thus it will also stay at . Nevertheless, for all similarity measures, Stadion’s behavior is consistent and valid even for largevalues of K . 17 Hyperparameter study

The stability difference criterion (Stadion) introduced in this work is governed by several hyperpa-rameters: • D : the number of perturbed samples used in the stability computation (2,3). • noise: the type of noise for ε -Additive Perturbation. We experimented with uniform andGaussian noise. • Ω : the set of algorithm parameters K (cid:48) used in within-cluster stability computation in (3). • s : the similarity measure used in stability computation is a special hyperparameter, and istreated speciﬁcally in the last paragraph of this section.The goal of this section is to study their importance and impact on the performance of Stadion forclustering model selection, using the three studied algorithms ( K -means, Ward linkage and GMM).Only the extended versions of Stadion for K -means and GMM are included, for the sake of savingcomputational cost. B.1 Importance study with fANOVA

Ideally, practitioners would like to know how hyperparameters affect performance in general, notjust in the context of a single ﬁxed instantiation of the remaining hyperparameters, but across alltheir instantiations. The fANOVA (functional ANalysis Of VAriance) framework for assessinghyperparameter importance introduced in [53] is based on efﬁcient marginalization over dimensionsusing regression trees. The importance of each hyperparameter is obtained by training a RandomForest model of 100 regression trees to predict the performance of Stadion in terms of ARI giventhe set of hyperparameters. Then, the variance of the performance due to a given hyperparameteris decomposed by marginalizing out the effects of all other parameters. It also allows to assessinteraction effects. Hence, the fANOVA framework provides insights on the overall importance ofhyperparameters and their interactions.The maximum amount of noise ε max and the ﬁneness of the grid are not included in the study, becauseit is data-dependent and one can easily check if values are appropriate by looking at the paths. Westudy the following discrete hyperparameter space: • D ∈ { , . . . , }• noise is uniform or Gaussian • Ω ∈ { , , , , { , . . . , } , { , . . . , } , { , . . . , } , { , . . . }}• s = ARI

K-means Ward GMM0.00.20.40.60.81.0 I m p o r t a n c e ( D , noise)noiseD(noise, )( D , noise, )( D , ) K-means Ward GMM0.00.20.40.60.81.0 I m p o r t a n c e ( D , noise)noiseD(noise, )( D , noise, )( D , ) Figure 14: Box plots of the fANOVA importance of parameters and their interactions for Stadion-max(left) and mean (right) across 73 artiﬁcial data sets, for three algorithms.Before going any further, we would like to add a clariﬁcation on the causes of jumping. In section3, we stated that the two causes of jumping are symmetries in the data and initialization. Indeed,they are the only ones possible in our setting. But one aspect has not been addressed, whenever K becomes large w.r.t. N . In this case, the effective initialization strategy no longer prevents jumping.18urthermore, if the size of clusters is very small, then small perturbations can drastically changethe solution, unlike when K << N with N sufﬁciently large. The latter can undeniably causejumping. That brings us to the conclusion that Stadion should not be used, with this formulation, forapplications with large K w.r.t N , for instance deduplication [54], and that further investigations areneeded in this context.Figure 14 shows the contributions of hyperparameters and their interactions to the variance of ARIperformance, across 73 artiﬁcial data sets. Ω is by far the most important parameter, and this for tworeasons.First, the size of the data N needed to obtain good estimations is relative to the number of clusters K .More precisely, in our setting we only consider K << N . Whenever K is large w.r.t. N , jumpingdue to "large K " can occur, which sometimes happens in within-cluster stability, where small clustersmight be split into up to 20 sub-clusters. This implies that even in presence of sub-clusters, highvalues of K (cid:48) in Ω will create instability and thus lead to low within-cluster stability. More precisely,if K ≥ K (cid:63) , then in general Ω will not affect within-cluster stability because it is already low. Butwhenever K ≤ K (cid:63) , within-stability is more impacted by large values of K (cid:48) in Ω , and within-stabilitypaths for these speciﬁc values of K shift down, leading to higher Stadion paths. Second, ARI hasdecreasing performance for large numbers of small clusters [49].The second most important parameter is the interaction ( D , Ω ) , for the same reason: large numbersof clusters make estimating the within-cluster stability more difﬁcult, and thus a higher number ofperturbations D is needed to obtain a good approximation. B.2 Inﬂuence of D The D hyperparameter deﬁnes the number of perturbed samples used in the stability computationin (2) and (3). In our benchmark, we used D = 10 . Surprisingly, a number of samples as low as D = 1 already gives a good estimate of the expectation and the performance only slightly increaseswith larger values of D . We perform an experiment by making D vary from 1 to 10, keeping otherhyperparameters ﬁxed (uniform noise, Ω = { , . . . , } ), for the three algorithms and both Stadionpath aggregation strategies (max and mean), and measure performance in terms of ARI over 73artiﬁcial benchmark data sets. Results on Figure 15 show that low D values have a higher varianceand slightly lower performance. K-means Ward GMM0.00.20.40.60.81.0 A R I K-means Ward GMM0.00.20.40.60.81.0 A R I Figure 15: Box-plot of the ARI of partitions selected by Stadion-max (left) and mean (right) across73 data sets, for three algorithms and different values of D , the number of samples in the stabilitycomputation.To quantify further the inﬂuence of this parameter, we followed the recommendation in [50] and usedthe Friedman test [55] for comparisons on multiple data sets, in order to test against the null hypothesis H stating that all parameters have equivalent performance. After rejecting H , we performed thepairwise post-hoc analysis recommended by [56] where the average rank comparison (e.g. Nemenyitest) is replaced by a Wilcoxon signed-rank test [57] at α = 5% with a Holm-Bonferroni correctionprocedure to control the family-wise error rate (FWER) [58, 59]. To visualize post-hoc test results,we use the critical difference (CD) diagram [50], where a thick horizontal line shows groups (cliques)of classiﬁers that are not signiﬁcantly different in terms of performance. In all but one case, the19riedman test could not reject the null hypothesis. Only for the GMM algorithm and max aggregation,the null hypothesis was rejected, leading to the critical difference diagrams on Figure 16. Figure 16: Critical difference diagrams after Wilcoxon-Holms test ( α = 5% ) on GMM performance,for Stadion-max with uniform (left) and Gaussian (right) noise, for different values of D , the numberof perturbations in the stability computation.The number of samples D has a negligible impact on the performance of our method. We assume itis due to the fact that in our setting, instability is caused by jittering at cluster boundaries, which doesnot vary much from one perturbation to another with reasonable amounts of data. On the contrary,sampling-based stability methods that rely on jumping require a much higher number of samples (forinstance, [16] use 100 samples and [17] use 20 samples). As a conclusion, we recommend using D ≥ , but if computation time is costly, D = 1 can be used safely to cut down complexity. B.3 Inﬂuence of noise type

We experiment with two types of ε -additive noise perturbation: uniform noise and Gaussian noise.As previously, we report the distributions of performance in terms of ARI across 73 artiﬁcial datasets for both noise types on Figure 17 (with D = 10 and Ω = { , . . . , } ). K-means Ward GMM0.00.20.40.60.81.0 A R I UniformGaussian

K-means Ward GMM0.00.20.40.60.81.0 A R I UniformGaussian

Figure 17: Box plots of the ARI of partitions selected by Stadion-max (left) and mean (right) across73 artiﬁcial data sets, for three algorithms, using uniform or Gaussian noise perturbation.To assess the difference between both noise types, we perform the Wilcoxon signed-rank test on theperformance results (at conﬁdence level α = 5% ). For every algorithm and Stadion path aggregation,the test did not reject the null hypothesis. Thus, either uniform or Gaussian noise can be used. B.4 Inﬂuence of Ω The Ω hyperparameter is a set deﬁning the numbers of clusters used to cluster again each cluster ofthe original partition. We perform an experiment by varying Ω , keeping other hyperparameters ﬁxed(uniform noise, D = 10 ), for both Stadion path aggregation strategies (max and mean), and measureperformance in terms of ARI over 73 artiﬁcial benchmark data sets.Results in Figure 18 demonstrate that Ω does not have, in most cases, a big impact on the performanceof Stadion. This does not contradict the results of the fANOVA. Indeed, Ω has the largest variancecontribution to the performance, but this variance remains small, and overall Stadion is robust forreasonable choices of the parameter, such as { , . . . , } or { , . . . , } . Ward linkage is the mostinﬂuenced by the choice of Ω . We guess the main reason is that agglomerative clustering algorithmsare not robust to noise [33]. Critical difference diagrams after Wilcoxon-Holms test on performanceare given in Figures (19, 20, 21). None of them showed signiﬁcant differences, indicating that there isnot enough data to conclude. However, small values in Ω seem to perform better, which conﬁrms the20 -means Ward GMM0.00.20.40.60.81.0 A R I K-means Ward GMM0.00.20.40.60.81.0 A R I Figure 18: Box plots of ARI of partitions selected by Stadion-max (left) and mean (right) across 73artiﬁcial data sets, for three algorithms and different sets Ω (numbers of clusters used in within-clusterstability).previous claim that large values of K (cid:48) in Ω negatively impact performance. In particular, the range { , . . . , } used in our benchmark performs well across all algorithms. (a) Stadion-max / uniform noise (b) Stadion-max / Gaussian noise Could not reject H . (c) Stadion-mean / uniform noise (d) Stadion-mean / Gaussian noise Figure 19: Critical difference diagrams after Wilcoxon-Holms test on K -means performance, fordifferent values of Ω , the set of parameters used in within-cluster stability computation. (a) Stadion-max / uniform noise Could not reject H . (b) Stadion-max / Gaussian noise (c) Stadion-mean / uniform noise (d) Stadion-mean / Gaussian noise Figure 20: Critical difference diagrams after Wilcoxon-Holms test on Ward performance, for differentvalues of Ω , the set of parameters K used in within-cluster stability computation. B.5 Similarity measure analysis

An extensive study was conducted to compare similarity measures (or distances) between partitions,noted s in the stability computations (2, 3). Note that all deﬁnitions and formulae are available inAppendix D. The ﬁrst ﬁve are count-based measures and were compared in a study [60]: • RI : Rand Index [61] • ARI : Hubert and Arabie’s Adjusted Rand Index [62] • ARI : Morey and Agresti’s Adjusted Rand Index [63]21 (a) Stadion-max / uniform noise (b) Stadion-max / Gaussian noise Figure 21: Critical difference diagrams after Wilcoxon-Holms test on GMM performance, fordifferent values of Ω , the set of parameters K used in within-cluster stability computation. WithStadion-mean, the Friedman test could not reject H . • FM : Fowlkes and Mallows index [64] • JACC : Jaccard indexThroughout the paper, ARI was referring to ARI and the two terms are now used interchangeably.Following information theoretic measures [36] are also compared in this work: • MI : Mutual Information • AMI : Adjusted Mutual Information • VI : Variation of Information • NVI : Normalized Variation of Information • ID : Information Distance • NID : Normalized Information Distance • NMI : Normalized Mutual Information, with max normalization • NMI : Normalized Mutual Information, with min normalization • NMI : Normalized Mutual Information, with geometric mean normalization • NMI : Normalized Mutual Information, with arithmetic mean normalization • NMI : Normalized Mutual Information, with joint entropy normalizationTable 4 compares measures by counting the number of data sets where Stadion selected the truenumber of clusters. Our results conﬁrm that adjusted measures are generally preferable [36]. However,for particular applications, for instance large numbers of clusters or small number of observations,other measures might be better suited. On average, the best-performing measure is ARI , but theaverage number of wins is not sufﬁcient to conclude.Table 4: Comparison of similarity measures s used in Stadion computation. Number of correctnumbers of clusters for each algorithm and aggregation (with uniform noise, D = 10 and Ω = { , . . . , } ). Stadion-max Stadion-meanmeasure K -means Ward GMM K -means Ward GMM average winsARI

48 49 43 51.0ARI

48 49 43 51.0AMI 54 52 55 48 49 45 50.5NID 54 52 55 48 49 45 50.5NMI

54 52 55 48 49 45 50.5NMI

54 51 55 48 49 46 50.5NMI

54 51

48 46 38 48.8NMI

53 52 55

47 34 48.5NMI

53 48

41 48.5NVI 53 47

41 48.2FM 47

51 48 41 45 48.0JACC 45 55 50 38 44 45 46.2ID 32 55 39 47 35

34 47 34 45 41.3RI 23 21 46 43 35 31 33.2MI 8 11 18 17 15 13 13.7

In order to asses which measures are signiﬁcantly different, we perform a statistical test. However,we cannot use a signed-rank test as previously, because we can no longer use the ARI score as anexternal performance measure. Our experiments have shown that the choice of the performance22etric introduces a bias, favoring different similarity or distance measures used inside Stadion. Forinstance, using ARI as the performance measure has lead to higher performance for s = ARI andequivalently for the other measures. Thus, the only way to compare a partition with the ground-truthis whether it has found the correct number of clusters K (cid:63) or not. Under this limitation, the onlytest at our disposal is the sign test, which compares the number of successes/losses/ties for eachpair of methods, where success indicates if a method selected K (cid:63) and not the other. The sign testuses a binomial test, assuming that if two methods are equivalent, they should each succeed onapproximately half of the data sets. The results of the sign test is represented on Figure 22. As before,we control the FWER at α = 5% using the Holm-Bonferroni procedure for multiple comparisons. ARI1 ARI2 FM AMI NMI1 NMI2 NMI3 NMI4 NMI5 NID NVI JACC ID VI MI RIARI1ARI2FMAMINMI1NMI2NMI3NMI4NMI5NIDNVIJACCIDVIMIRI p -value rejected not rejected Figure 22: Matrix of p -values after a pairwise sign test comparing different similarity measures anddistances between clusterings used in Stadion, here with K -means (uniform noise, D = 10 and Ω = { , . . . , } ). The null hypothesis H , that two measures are equivalent, is tested at α = 5% conﬁdence, using Holm-Bonferroni correction to control the FWER.The matrix of p -values exhibits a block structure with on one hand, the majority of measures thatperform well with Stadion, and on the other hand, the MI and RI, which perform poorly (becausethey scale with K ). In addition, ARI are signiﬁcantly superior to JACC, ID and VI. However, due tothe high number of ties, the low power of the sign test and insufﬁciency of data, we cannot reachany further conclusions. This structure remains across all three tested algorithms and Stadion pathaggregations. As a conclusion, we recommend using ARI with the Stadion criterion, but severalsimilarity measures between partitions are well-suited to measure stability.23 Pseudo-code and complexity

C.1 Pseudo-codeInput: algorithm A ; data set X ; reference clustering C K ; parameter K ; perturbations D ; similaritymeasure s ; noise amplitude ε Output: between-cluster stability Stab B ( A , X , C K , K, ε ) bstab ← ; for d = 1 . . . D do Generate random noise (cid:15) ∼ U ( − ε, + ε ) or N ( , ε I ) ; X d ← X + (cid:15) ;bstab ← bstab + s ( C K , A ( X d , K )) ; endReturn bstab /D ; Algorithm 1:

Between-cluster stability procedure.

Input: algorithm A ; data set X ; set of parameters Ω ; reference clustering C K ; perturbations D ;similarity measure s ; noise amplitude ε Output: within-cluster stability Stab W ( A , X , C K , K, Ω , ε ) wstab ← ; N ← | X | ; for k = 1 . . . K do C k ← k -th cluster of X in reference clustering C K ; N k ← | C k | ;bstab ← ; for K (cid:48) in Ω do Q ( k ) K (cid:48) ← A ( C k , K (cid:48) ) ;bstab ← bstab + Stab B ( A , C k , Q ( k ) K (cid:48) , K (cid:48) ) ; end bstab ← bstab / | Ω | ;wstab ← wstab + bstab × N k N ; endReturn wstab ; Algorithm 2:

Within-cluster stability procedure.

C.2 Complexity study

Let A ( K, N ) be the time complexity of the algorithm with parameter K and a data set of size N ,assuming the data dimension is ﬁxed. In addition, let S ( K, N ) be the complexity of the similaritymeasure s , D the number of perturbations and M the length of the stability path. Between-cluster stability

The complexity for a given parameter K (assuming the complexity ofperturbation is negligible) is O (( A ( K, N ) + S ( K, N )) DM ) . Within-cluster stability

For a given parameter K and a set of parameters Ω = { , . . . , K (cid:48) } ,the amount of operations is (cid:80) Kk =1 (cid:80) K (cid:48) k (cid:48) =2 ( A ( k (cid:48) , N k ) + S ( k (cid:48) , N k )) DM , which can be bounded by O ( KK (cid:48) ( A ( K (cid:48) , N ) + S ( K (cid:48) , N )) DM ) .In the case of K -means, we have A ( K, N ) = O ( KN T I ) , where T is the number of iterations untilconvergence of the algorithm, and I the number of runs. Then, ARI is linear: S ( K, N ) = O ( N ) .Overall, we obtain a complexity for Stadion with K -means and ARI equal to O ( KK (cid:48) N T IDM ) .The inﬂuence studies showed that Ω can be set to a small range, e.g. { , . . . , } or { , . . . , } , andthat D can be kept very low. Thus, complexity in K (cid:48) and D is manageable. Thus, complexity ofStadion is mainly driven by O ( KN T IM ) . 24 nput: algorithm A ; data set X ; maximum number of clusters K max ; perturbations D ; similaritymeasure s ; grid of noise amplitudes { ε i } ≤ i ≤ M with ε = 0 and ε M = ε max Output: selected number of clusters ˆ K for K = 1 . . . K max do C K ← A ( X , K ) ; for i = 1 . . . M do bstab i ← Stab B ( A , X , C K , K, ε i ) ;wstab i ← Stab W ( A , X , C K , K, Ω , ε i ) ;stadion i ← bstab i − wstab i ; endend ˆ K ← argmax K max i stadion or argmax K mean i stadion ; Return ˆ K ; Algorithm 3:

Complete procedure for selecting the number of clusters ˆ K using Stadion paths, withmax (Stadion-max) or mean (Stadion-max) aggregation.The extended version avoids running the algorithm again for each perturbation D , getting rid of the T and I factors. For K -means, we only have to ﬁnd the closest centers, which is O ( KN ) . Thus, wehave an overall complexity of O ( KK (cid:48) N DM ) .In regard, internal indices relying on between-cluster and within-cluster distances have a complexityof O ( N ) or O ( KN ) with centroid distance. Thus, the cost of having to run the algorithm severaltimes may be smaller than a quadratic index, if N is large and the algorithm is linear.25 Deﬁnition of similarity measures between partitions

This section provides formulae for every clustering measure used in this work to compute a similarityor distance between two partitions. These can be broadly divided into two families: pair counting-based measures (e.g. Rand index) and information theoretic measures (e.g. MI, VI, ID) [36].

D.1 Contingency matrix

The measures for comparing two clusterings C K and C (cid:48) K (cid:48) can be obtained from the contingency table of the partitions. The contingency table is a K × K (cid:48) matrix, whose kk (cid:48) -th element N kk (cid:48) is thenumber of points belonging to cluster C k in clustering C K and to cluster C (cid:48) k (cid:48) in clustering C (cid:48) K (cid:48) , i.e. N kk (cid:48) = | C k ∩ C (cid:48) k (cid:48) | . Any pair of samples falls under one of four cases:1. N the number of pairs that are in the same cluster under both C K and C (cid:48) K (cid:48) N the number of pairs in different clusters under both C K and C (cid:48) K (cid:48) N the number of pairs in the same cluster under C K but not under C (cid:48) K (cid:48) N the number of pairs in the same cluster under C K but not under C (cid:48) K (cid:48) D.2 Rand Index

The Rand index [61] is deﬁned as RI = N + N N + N + N + N = N + N (cid:0) N (cid:1) . (5)The Rand index is a value between 0 and 1, a value of 0 indicating that the clusterings do not agreeon any pair of points, and a value of 1 indicating an agreement for every pair of points. D.3 Adjusted Rand Index

The Adjusted Rand index is a version of the RI that is corrected by the expected index value. It canyield negative values if the index is lower than the expected value. The expression of the ARI is:ARI = 2( N N − N N )( N + N )( N + N ) − ( N + N )( N + N ) . (6) D.4 Fowlkes-Mallows FM = N (cid:112) ( N + N )( N + N ) . (7) D.5 Jaccard

JACC = N N + N + N . (8) D.6 Mutual Information and variants

Let P ( k, k (cid:48) ) represent the probability that a sample belongs to C k in clustering C K and to C (cid:48) k (cid:48) in C (cid:48) K (cid:48) ,namely the joint distribution of the random variables associated with the two clusterings: P ( k, k (cid:48) ) = | C k ∩ C (cid:48) k (cid:48) | N . (9)We deﬁne I ( C K , C (cid:48) K (cid:48) ) the mutual information between the clusterings C K , C (cid:48) K (cid:48) to be equal to themutual information between the associated random variablesI ( C K , C (cid:48) K (cid:48) ) = (cid:88) k (cid:88) k (cid:48) P ( C k ∩ C (cid:48) k (cid:48) ) log P ( C k ∩ C (cid:48) k (cid:48) ) P ( C k ) P ( C (cid:48) k (cid:48) ) . (10)26utual information measures the information that C K and C (cid:48) K (cid:48) share: it tells how much knowingone of these clusterings reduces our uncertainty about the other. Let us call H ( C K ) the entropy ofpartition C K : H ( C K ) = − (cid:88) k P ( C k ) log P ( C k ) = − (cid:88) k | C k | N log | C k | N . (11)Entropy is always positive. It takes value 0 only when there is no uncertainty, namely when there is asingle cluster. We can now deﬁne the different variants of the Normalized Mutual Information:NMI ( C K , C (cid:48) K (cid:48) ) = I ( C K , C (cid:48) K (cid:48) )max ( H ( C K ) , H ( C (cid:48) K (cid:48) )) , (12)NMI ( C K , C (cid:48) K (cid:48) ) = I ( C K , C (cid:48) K (cid:48) )min ( H ( C K ) , H ( C (cid:48) K (cid:48) )) , (13)NMI ( C K , C (cid:48) K (cid:48) ) = I ( C K , C (cid:48) K (cid:48) ) (cid:112) H ( C K ) H ( C (cid:48) K (cid:48) ) , (14)NMI ( C K , C (cid:48) K (cid:48) ) = 2 I ( C K , C (cid:48) K (cid:48) ) H ( C K ) + H ( C (cid:48) K (cid:48) ) , (15)NMI ( C K , C (cid:48) K (cid:48) ) = I ( C K , C (cid:48) K (cid:48) ) H ( C K , C (cid:48) K (cid:48) ) . (16)Finally the AMI is deﬁned asAMI ( C K , C (cid:48) K (cid:48) ) = I ( C K , C (cid:48) K (cid:48) ) − E [ I ( C K , C (cid:48) K (cid:48) )]max ( H ( C K ) , H ( C (cid:48) K (cid:48) )) − E [ I ( C K , C (cid:48) K (cid:48) )] , (17)where E [ I ( C K , C (cid:48) K (cid:48) )] is the expected mutual information between two clusterings C K , C (cid:48) K (cid:48) as deﬁnedin [36]. D.7 Variation of Information

The variation of information is deﬁned asVI ( C K , C (cid:48) K (cid:48) ) = H ( C K , C (cid:48) K (cid:48) ) − I ( C K , C (cid:48) K (cid:48) ) , (18)and its normalized version as NVI ( C K , C (cid:48) K (cid:48) ) = 1 − I ( C K , C (cid:48) K (cid:48) ) H ( C K , C (cid:48) K (cid:48) ) . (19)Unlike all previously introduced measures, the VI measures dissimilarity instead of similarity. D.8 Information Distance

The expression of Information Distance is:ID ( C K , C (cid:48) K (cid:48) ) = max( H ( C K ) , H ( C (cid:48) K (cid:48) )) − I ( C K , C (cid:48) K (cid:48) ) . (20)The normalized variant, NID, is both a distance and a normalized measure:NID ( C K , C (cid:48) K (cid:48) ) = 1 − I ( C K , C (cid:48) K (cid:48) )max( H ( C K ) , H ( C (cid:48) K (cid:48) )) . (21)27 Benchmark results

E.1 Results analysis

A large benchmark on 73 artiﬁcial and 7 real data sets compares the solutions selected by Stadionwith the true number of clusters K (cid:63) , a selection of internal indices including the Gap statistic for K -means, the BIC for GMM and two previous stability methods. As summarized in Table 1, Stadionachieves the best results overall, and other internal indices such as Wemmert-Gancarski and Silhouettealso perform well. In particular, with GMM, it outperforms the BIC on both the artiﬁcial and realbenchmark, although BIC is a standard choice and a strong baseline in model-based clustering. Notethat Stadion and BIC were evaluated for K ≥ , while all other methods used K ≥ . This slightlyfavors the latter methods, because many data sets have K (cid:63) = 2 , thus on data sets where the algorithmfails to recover structure, Stadion will select K = 1 while most indices output K = 2 as a default. Inorder to assess the statistical signiﬁcance of those results and determine which methods really aredifferent, we adopt the same methodology as previously and carry out a Friedman test followed byWilcoxon-Holms post-hoc analysis on the ARI performances of each method, across our benchmarkof 73 artiﬁcial data sets and for the three algorithms considered in this work. Figure 23: Critical difference diagram after Wilcoxon-Holms test on ARI performance across 73artiﬁcial data sets comparing several clustering validation methods for the K -means algorithm. Figure 24: Critical difference diagram after Wilcoxon-Holms test on ARI performance across 73artiﬁcial data sets comparing several clustering validation methods for the Ward algorithm.

Figure 25: Critical difference diagram after Wilcoxon-Holms test on ARI performance across 73artiﬁcial data sets comparing several clustering validation methods for the GMM algorithm.As shown on the CD diagrams in Figures (23, 24, 25), Stadion seems to outperform other indices.However, the signed-rank Wilcoxon-Holms test on ARI performance did not assess signiﬁcantdifference between the group of best-performing methods. The main reason is that ARI performanceis evaluated on partitions that are ﬁxed for any given K . In contrast to supervised learning (usingaccuracy), here the methods are often attributed the same ARI scores, since they often ﬁnd the samenumber of clusters. In other words, methods mostly succeed and fail on the same data sets. Thisimplies a large number of ties in terms of ARI score. Under these conditions, the experimental data isnot sufﬁcient to reach any conclusion regarding the statistical superiority of our method.28able 5: Performance rankings of clustering validation methods, evaluated with 16 external indicesw.r.t. the ground-truth partitions, over 73 data sets for K -means. Rankings are mostly unchanged. Stadion StadionARI

K(cid:63) max mean max (ext) mean (ext) WG Sil Lange DB RT CH Dunn XB Gap Ben-HurRI 6.43 6.15 6.26 Table 6: Performance rankings of clustering validation methods, evaluated with 16 external indicesw.r.t. the ground-truth partitions, over 73 data sets for Ward. Rankings are mostly unchanged.

ARI

K(cid:63)

Stadion-max Stadion-mean WG Sil Lange DB RT CH Dunn XB Ben-HurRI 4.92 That said, some conclusions can still be drawn. Stadion (mean/max, extended or standard) performssigniﬁcantly better than Ben-Hur, Gap, Xie-Beni, Dunn, Calinski-Harabasz, Ray-Turi and Davies-Bouldin indices for the K -means algorithm. Similarly for GMM, Stadion-max performs signiﬁcantlybetter than the same groups of indices. Finally for Ward, Stadion-max performs signiﬁcantly betterthan Ben-Hur, Xie-Beni, Dunn, Calinsky-Harabasz and Ray-Turi. Overall, Stadion-mean had slightlyinferior performance than Stadion-max, but this was not signiﬁcant.Note that performance is evaluated using the external index ARI (w.r.t. the ground-truth partition),while Stadion also uses s = ARI as its similarity measure to estimate stability. It would be reasonableto expect this situation to introduce some kind of bias. However, results are not biased in favor ofStadion, as shown by Tables (5, 6, 7). Indeed, the ranking almost never changed while using differentexternal indices to evaluate performance of Stadion, and keeping s = ARI.29able 7: Performance rankings of clustering validation methods, evaluated with 16 external indicesw.r.t. the ground-truth partitions, over 73 data sets for GMM. Rankings are mostly unchanged. ARI

K(cid:63)

Stadion-max Stadion-mean BIC WG Sil Lange DB RT CH Dunn XB Ben-HurRI 5.04 Table 8: Performance rankings of clustering validation methods, evaluated with ARI w.r.t. theground-truth partitions, over 73 data sets for the three algorithms. Five poorly-performing indicesthat were omitted in this work for sake of brevity were added.

Method K -means Ward GMM K (cid:63) -Stadion-mean 6.72 6.73 -Stadion-max (ext) 6.67 - Stadion-mean (ext) 7.05 - 7.88BIC - - 7.69Wemmert-Gancarski 7.42 6.16 6.55Silhouette 8.44 7.54 8.26Lange 9.08 7.51 8.31Davies-Bouldin 9.23 7.31 8.54Ray-Turi 9.29 8.06 9.15Calinski-Harabasz 9.68 8.20 8.55SDbw [14] 11.14 9.87 10.56Dunn 11.76 9.32 9.40Xie-Beni 12.05 9.32 9.77Gap statistic 12.12 - -Ben-Hur 12.86 9.53 10.86C-index [14] 12.94 10.48 10.71Banﬁeld-Raftery [14] 16.42 13.64 14.62SD [14] 16.67 14.05 14.23Scott-Symons [14] 16.75 13.83 13.77 .2 Complete results on real-world and artiﬁcial data sets Table 9: Results for K -means on real data sets (number of clusters selected by each method). Stadion Stadiondataset K (cid:63) max mean max (ext) mean (ext) WG Sil Lange DB RT CH Dunn XB Gap Ben-Hurcrabs 4 4 4 5 5 4 4 4 4 4 28 30 24 2 5faithful 2 2 2 2 2 2 2 2 2 2 2 26 28 2 2iris 3 2 2 2 2 2 2 2 30 30 2 2 30 6 2MFDS_UMAP 10 10 10 10 10 10 10 8 10 10 30 8 8 3 5MNIST_UMAP 10 10 8 8 8 10 7 7 7 7 30 7 7 3 14USPS_UMAP 10 8 5 8 8 5 5 4 5 5 29 5 5 13 8wine_UMAP 3 3 3 3 3 3 3 3 3 3 30 3 3 3 4Number of wins 7 Table 10: Results for K -means on real data sets (ARI of the partition selected by each method). Stadion Stadiondataset K (cid:63) ARI

K(cid:63) max mean max (ext) mean (ext) WG Sil Lange DB RT CH Dunn XB Gap Ben-Hurcrabs 4 0.72 0.72 0.72 0.70 0.70 0.72 0.72 0.72 0.72 0.72 0.19 0.18 0.22 0.40 0.70faithful 2 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.07 0.07 0.99 0.99iris 3 0.62 0.57 0.57 0.57 0.57 0.57 0.57 0.57 0.13 0.13 0.57 0.57 0.13 0.35 0.57MFDS_UMAP 10 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.77 0.94 0.94 0.51 0.77 0.77 0.34 0.54MNIST_UMAP 10 0.92 0.92 0.77 0.77 0.77 0.92 0.65 0.65 0.65 0.65 0.46 0.65 0.65 0.30 0.81USPS_UMAP 10 0.69 0.75 0.50 0.75 0.75 0.50 0.50 0.44 0.50 0.50 0.39 0.50 0.50 0.66 0.75wine_UMAP 3 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.11 0.80 0.80 0.80 0.69Average ARI 0.81

Table 11: Results for Ward on real data sets (number of clusters selected by each method). dataset K (cid:63) Stadion-max Stadion-mean WG Sil Lange DB RT CH Dunn XB Ben-Hurcrabs 4 4 1 5 4 4 28 5 30 29 30 6faithful 2 2 2 2 2 2 2 2 2 2 2 2iris 3 2 2 2 2 2 2 2 2 2 2 2MFDS_UMAP 10 10 10 10 10 10 10 10 30 8 8 5MNIST_UMAP 10 7 8 10 7 7 7 7 11 7 7 2USPS_UMAP 10 5 5 5 5 5 5 5 30 5 5 8wine_UMAP 3 3 3 3 3 3 3 3 30 3 3 4Number of wins 7 Table 12: Results for Ward on real data sets (ARI of partition selected by each method). dataset K (cid:63) ARI

K(cid:63)

Stadion-max Stadion-mean WG Sil Lange DB RT CH Dunn XB Ben-Hurcrabs 4 0.66 0.66 0.00 0.64 0.66 0.66 0.18 0.64 0.17 0.17 0.17 0.60faithful 2 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00iris 3 0.63 0.54 0.54 0.54 0.54 0.54 0.54 0.54 0.54 0.54 0.54 0.54MFDS_UMAP 10 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.52 0.77 0.77 0.54MNIST_UMAP 10 0.93 0.65 0.77 0.93 0.65 0.65 0.65 0.65 0.89 0.65 0.65 0.19USPS_UMAP 10 0.69 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.39 0.50 0.50 0.73wine_UMAP 3 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.11 0.80 0.80 0.69Average ARI 0.81 0.73 0.65 dataset K (cid:63) Stadion-max (ext) Stadion-mean (ext) BIC WG Sil Lange DB RT CH Dunn XB Ben-Hurcrabs 4 4 9 4 4 4 4 24 4 19 21 21 4faithful 2 2 2 2 2 2 2 2 2 2 2 2 2iris 3 2 2 2 2 2 2 30 30 2 2 30 2MFDS_UMAP 10 10 10 18 10 10 9 10 10 10 8 8 2MNIST_UMAP 10 10 9 19 9 7 7 7 7 11 7 7 4USPS_UMAP 10 9 9 19 5 5 5 5 5 17 2 2 2wine_UMAP 3 3 3 4 3 3 3 3 3 3 3 3 4Number of wins 7 Table 14: Results for GMM on real data sets (ARI of the partion selected by each method). dataset K (cid:63) ARI

K(cid:63)

Stadion-max (ext) Stadion-mean (ext) BIC WG Sil Lange DB RT CH Dunn XB Ben-Hurcrabs 4 0.79 0.79 0.55 0.79 0.79 0.79 0.79 0.30 0.79 0.33 0.32 0.32 0.79faithful 2 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00iris 3 0.90 0.57 0.57 0.57 0.57 0.57 0.57 0.15 0.15 0.57 0.57 0.15 0.57MFDS_UMAP 10 0.94 0.94 0.94 0.78 0.94 0.94 0.85 0.94 0.94 0.94 0.77 0.77 0.14MNIST_UMAP 10 0.93 0.93 0.84 0.67 0.84 0.65 0.65 0.65 0.65 0.89 0.65 0.65 0.40USPS_UMAP 10 0.70 0.76 0.76 0.56 0.50 0.50 0.50 0.50 0.50 0.63 0.19 0.19 0.19wine_UMAP 3 0.80 0.80 0.80 0.68 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.68Average ARI 0.87 K -means on artiﬁcial data sets (number of clusters selected by each method). Stadion Stadiondataset K (cid:63) max mean max (ext) mean (ext) WG Sil Lange DB RT CH Dunn XB Gap Ben-Hur2d-10c 9 9 8 9 9 9 9 3 9 9 20 5 5 9 32d-3c-no123 3 2 2 2 2 2 2 2 2 2 9 2 2 3 32d-4c 4 3 4 4 3 4 3 2 4 4 20 3 3 9 72d-4c-no4 4 7 5 14 12 5 5 5 5 7 8 2 2 8 82d-4c-no9 4 5 4 6 6 4 2 2 5 3 15 20 20 4 83clusters_elephant 3 2 3 3 3 2 2 3 2 2 2 20 20 2 34clusters_corner 4 3 3 3 3 2 2 4 2 2 2 2 2 3 44clusters_twins 4 6 6 6 6 2 2 2 2 2 2 2 2 3 35clusters_stars 5 5 5 5 5 3 3 3 3 3 5 20 20 3 6A1 20 20 20 20 20 20 17 2 16 17 20 17 18 2 22A2 35 37 38 38 36 34 34 7 34 32 36 37 37 14 2curves1 2 2 2 2 2 2 2 2 2 2 20 2 2 2 2D31 31 34 32 31 31 31 31 6 31 31 31 32 32 3 7diamond9 9 9 9 9 9 9 9 9 9 9 9 9 9 3 5dim032 16 17 17 16 16 16 16 16 16 16 16 16 16 17 14dim064 16 16 16 16 16 16 16 16 16 16 16 16 16 17 14dim1024 16 16 16 16 16 16 16 16 20 20 16 16 20 19 2dim128 16 16 16 16 16 16 16 16 20 20 16 16 20 17 14dim256 16 16 16 16 16 16 16 16 16 16 16 16 16 19 3dim512 16 16 16 16 16 16 16 16 20 20 16 16 20 20 2DS-577 3 3 3 3 3 3 3 3 3 3 3 3 3 3 5DS-850 5 5 5 6 6 5 5 4 4 4 16 20 20 5 5ds4c2sc8 8 7 7 7 7 7 6 7 7 6 16 20 20 2 7elliptical_10_2 10 10 10 10 10 10 10 10 9 10 10 10 16 2 3elly-2d10c13s 10 5 15 10 14 5 5 5 10 5 16 20 20 2 7engytime 2 2 4 2 4 3 3 2 3 3 3 20 20 3 6exemples1_3g 3 3 3 3 3 3 3 3 3 3 3 2 2 3 7exemples10_WellS_3g 3 3 3 3 3 3 3 3 3 3 3 3 3 3 6exemples2_5g 5 5 5 5 5 5 5 2 5 5 5 2 3 5 5exemples3_Uvar_4g 4 6 6 9 9 6 5 6 5 9 6 7 7 3 8exemples4_overlap_3g 3 3 3 3 5 3 3 3 3 3 5 14 14 3 4exemples5_overlap2_3g 3 2 9 4 12 4 4 2 9 4 9 20 17 2 7exemples6_quicunx_4g 4 4 4 4 4 4 4 4 4 4 4 4 4 4 7exemples7_elbow_3g 3 3 3 3 3 3 3 3 3 3 3 2 3 3 7exemples8_Overlap_Uvar_5g 6 5 5 5 5 4 3 3 4 3 5 19 19 3 7exemples9_YoD_6g 6 5 5 5 5 5 5 2 5 5 6 2 5 5 6fourty 40 39 39 40 39 39 39 40 34 24 40 23 23 2 4g2-16 2 2 2 2 2 2 2 2 20 20 2 2 20 2 2g2-2 2 2 4 2 16 2 2 2 19 9 2 18 18 2 4g2-64 2 2 2 2 2 20 4 2 20 6 5 18 20 2 2hepta 7 7 7 7 7 7 7 7 7 7 7 7 7 2 8long1 2 2 2 2 2 2 4 2 4 4 12 2 2 4 4long2 2 2 2 2 2 2 4 2 4 4 12 2 2 10 8long3 2 2 2 2 2 2 2 2 3 4 10 2 2 3 3longsquare 6 8 6 10 5 2 2 3 5 2 20 2 20 3 3R15 15 15 15 15 15 15 8 15 15 15 15 8 8 2 8s-set1 15 15 15 15 15 15 15 15 15 15 15 15 15 3 6s-set2 15 15 15 15 15 15 15 2 15 15 15 15 15 3 5s-set3 15 15 15 15 15 15 15 3 15 4 15 9 17 2 10s-set4 15 15 15 15 15 15 14 2 14 11 15 20 20 3 8sizes1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4sizes2 4 4 4 4 4 4 4 4 4 4 4 3 3 6 6sizes3 4 4 4 4 4 4 4 4 4 4 6 4 4 6 4sizes4 4 4 4 4 4 4 4 4 4 4 6 2 2 2 4sizes5 4 4 4 4 4 4 4 4 4 4 6 3 3 2 6spherical_4_3 4 4 4 4 4 4 4 2 4 4 4 4 4 4 2spherical_5_2 5 5 5 5 6 5 5 2 5 4 5 20 20 2 5spherical_6_2 6 6 6 6 5 6 4 4 6 4 6 4 4 6 4square1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4square2 4 4 4 4 4 4 4 4 4 4 4 20 20 4 4square3 4 4 4 4 4 4 4 4 4 4 4 20 20 4 5square4 4 4 4 4 4 4 4 4 4 4 4 20 20 4 4square5 4 4 4 4 4 4 4 4 4 4 4 20 18 3 4st900 9 9 9 11 9 9 9 9 9 9 9 16 20 2 9tetra 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4triangle1 4 5 5 7 6 4 4 4 4 4 6 4 4 6 7triangle2 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5twenty 20 21 21 20 20 20 20 20 20 20 20 20 20 2 2twodiamonds 2 2 2 2 2 4 4 2 4 4 20 20 20 4 4wingnut 2 1 1 2 6 2 2 2 20 4 20 20 20 2 8xclara 3 3 3 3 3 3 3 2 3 3 3 2 2 3 7zelnik2 3 8 5 11 9 15 3 2 5 12 20 20 20 3 3zelnik4 5 8 8 10 9 8 4 4 4 13 20 20 20 8 3Number of wins 73 50 51

48 53 46 45 40 37 41 26 22 26 20 K -means on artiﬁcial data sets (ARI of the partition selected by each method). Stadion Stadiondataset K (cid:63) ARI

K(cid:63) max mean max (ext) mean (ext) WG Sil Lange DB RT CH Dunn XB Gap Ben-Hur2d-10c 9 1.00 1.00 0.99 1.00 1.00 1.00 1.00 0.48 1.00 1.00 0.65 0.69 0.69 1.00 0.482d-3c-no123 3 0.67 0.76 0.76 0.76 0.76 0.76 0.76 0.76 0.76 0.76 0.34 0.76 0.76 0.67 0.672d-4c 4 1.00 0.83 1.00 1.00 0.83 1.00 0.83 0.64 1.00 1.00 0.24 0.83 0.83 0.44 0.552d-4c-no4 4 0.74 0.64 0.71 0.48 0.50 0.71 0.71 0.71 0.71 0.64 0.58 0.73 0.73 0.58 0.582d-4c-no9 4 0.87 0.82 0.87 0.76 0.76 0.87 0.48 0.48 0.83 0.69 0.34 0.27 0.27 0.87 0.553clusters_elephant 3 0.74 0.61 0.74 0.74 0.74 0.61 0.61 0.74 0.61 0.61 0.61 0.17 0.17 0.61 0.744clusters_corner 4 0.58 0.92 0.92 0.92 0.92 0.74 0.74 0.44 0.74 0.74 0.74 0.74 0.74 0.28 0.444clusters_twins 4 0.66 0.65 0.65 0.65 0.65 0.28 0.28 0.74 0.28 0.28 0.28 0.28 0.28 0.92 0.925clusters_stars 5 0.71 0.71 0.71 0.71 0.71 0.53 0.53 0.53 0.53 0.53 0.71 0.28 0.28 0.53 0.58A1 20 0.94 0.94 0.94 0.94 0.94 0.94 0.82 0.09 0.76 0.82 0.94 0.82 0.85 0.09 0.89A2 35 0.90 0.96 0.95 0.95 0.93 0.91 0.91 0.28 0.91 0.88 0.93 0.92 0.92 0.50 0.06curves1 2 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.10 1.00 1.00 1.00 1.00D31 31 0.92 0.93 0.95 0.92 0.92 0.92 0.92 0.25 0.92 0.92 0.92 0.90 0.90 0.11 0.31diamond9 9 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.32 0.53dim032 16 1.00 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.88dim064 16 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.88dim1024 16 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.11dim128 16 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.99 1.00 1.00 0.99 1.00 0.88dim256 16 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.18dim512 16 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.99 1.00 1.00 0.99 0.99 0.12DS-577 3 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.69DS-850 5 0.88 0.87 0.87 0.83 0.83 0.88 0.88 0.76 0.76 0.76 0.46 0.37 0.37 0.88 0.87ds4c2sc8 8 0.65 0.71 0.71 0.71 0.71 0.71 0.61 0.71 0.71 0.61 0.57 0.48 0.48 0.15 0.71elliptical_10_2 10 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.88 0.98 0.98 0.98 0.80 0.19 0.28elly-2d10c13s 10 0.34 0.34 0.36 0.34 0.36 0.34 0.34 0.34 0.34 0.34 0.34 0.29 0.29 0.10 0.34engytime 2 0.81 0.81 0.50 0.81 0.50 0.56 0.56 0.81 0.56 0.56 0.56 0.10 0.10 0.56 0.31exemples1_3g 3 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.57 0.57 0.96 0.50exemples10_WellS_3g 3 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.58exemples2_5g 5 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.37 1.00 1.00 1.00 0.37 0.48 1.00 1.00exemples3_Uvar_4g 4 0.53 0.78 0.78 0.79 0.79 0.78 0.79 0.78 0.79 0.79 0.78 0.81 0.81 0.49 0.80exemples4_overlap_3g 3 0.59 0.59 0.59 0.59 0.28 0.59 0.59 0.58 0.59 0.59 0.28 0.10 0.10 0.59 0.35exemples5_overlap2_3g 3 0.25 0.44 0.12 0.35 0.10 0.35 0.35 0.44 0.12 0.35 0.12 0.06 0.07 0.44 0.16exemples6_quicunx_4g 4 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.63exemples7_elbow_3g 3 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.51 0.97 0.97 0.56exemples8_Overlap_Uvar_5g 6 0.76 0.83 0.83 0.83 0.83 0.61 0.56 0.71 0.61 0.56 0.83 0.30 0.30 0.71 0.51exemples9_YoD_6g 6 0.76 0.83 0.83 0.83 0.83 0.83 0.83 0.45 0.83 0.83 0.76 0.45 0.83 0.83 0.76fourty 40 0.85 0.94 0.94 0.85 0.94 0.94 0.94 1.00 0.82 0.69 0.85 0.67 0.67 0.05 0.13g2-16 2 0.91 0.91 0.91 0.91 0.91 0.91 0.91 0.91 0.09 0.09 0.91 0.91 0.09 0.91 0.91g2-2 2 0.25 0.25 0.11 0.25 0.03 0.25 0.25 0.25 0.03 0.06 0.25 0.03 0.03 0.25 0.11g2-64 2 1.00 1.00 1.00 1.00 1.00 0.10 0.50 1.00 0.10 0.33 0.42 0.12 0.10 1.00 1.00hepta 7 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.27 0.96long1 2 1.00 1.00 1.00 1.00 1.00 1.00 0.50 1.00 0.50 0.50 0.20 1.00 1.00 0.50 0.50long2 2 1.00 1.00 1.00 1.00 1.00 1.00 0.49 1.00 0.49 0.49 0.19 1.00 1.00 0.22 0.28long3 2 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.42 0.39 0.14 1.00 1.00 0.42 0.42longsquare 6 0.80 0.85 0.80 0.84 0.81 0.27 0.27 0.57 0.81 0.27 0.45 0.27 0.45 0.57 0.57R15 15 0.99 0.99 0.99 0.99 0.99 0.99 0.26 0.99 0.99 0.99 0.99 0.26 0.26 0.12 0.26s-set1 15 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.23 0.48s-set2 15 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.13 0.94 0.94 0.94 0.94 0.94 0.23 0.39s-set3 15 0.73 0.73 0.73 0.73 0.73 0.73 0.73 0.21 0.73 0.29 0.73 0.53 0.70 0.11 0.55s-set4 15 0.63 0.63 0.63 0.63 0.63 0.63 0.62 0.11 0.62 0.53 0.63 0.59 0.59 0.20 0.44sizes1 4 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96sizes2 4 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.88 0.88 0.48 0.48sizes3 4 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.39 0.94 0.94 0.39 0.94sizes4 4 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.33 0.78 0.78 0.78 0.93sizes5 4 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.29 0.91 0.91 0.83 0.29spherical_4_3 4 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.50 1.00 1.00 1.00 1.00 1.00 1.00 0.50spherical_5_2 5 0.86 0.86 0.86 0.86 0.87 0.86 0.86 0.32 0.86 0.68 0.86 0.33 0.33 0.32 0.88spherical_6_2 6 1.00 1.00 1.00 1.00 0.82 1.00 0.68 0.68 1.00 0.68 1.00 0.68 0.68 1.00 0.68square1 4 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95square2 4 0.92 0.92 0.92 0.92 0.92 0.92 0.92 0.92 0.92 0.92 0.92 0.27 0.27 0.92 0.92square3 4 0.86 0.86 0.86 0.86 0.86 0.86 0.86 0.86 0.86 0.86 0.86 0.25 0.25 0.86 0.74square4 4 0.79 0.79 0.79 0.79 0.79 0.79 0.79 0.79 0.79 0.79 0.79 0.23 0.23 0.79 0.79square5 4 0.68 0.68 0.68 0.68 0.68 0.68 0.68 0.68 0.68 0.68 0.68 0.19 0.21 0.48 0.68st900 9 0.83 0.83 0.83 0.75 0.83 0.83 0.83 0.83 0.83 0.83 0.83 0.57 0.51 0.17 0.83tetra 4 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00triangle1 4 0.98 0.91 0.91 0.88 0.89 0.98 0.98 0.98 0.98 0.98 0.89 0.98 0.98 0.89 0.87triangle2 4 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.79 0.79 0.79 0.79 0.79twenty 20 0.93 0.99 0.99 0.93 0.93 0.93 0.93 1.00 0.93 0.93 0.93 0.93 0.93 0.10 0.10twodiamonds 2 1.00 1.00 1.00 1.00 1.00 0.50 0.50 1.00 0.50 0.50 0.10 0.10 0.10 0.50 0.50wingnut 2 0.67 0.00 0.00 0.67 0.34 0.67 0.67 0.67 0.11 0.51 0.11 0.11 0.11 0.67 0.23xclara 3 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.56 0.99 0.99 0.99 0.56 0.56 0.99 0.49zelnik2 3 0.47 0.63 0.54 0.66 0.64 0.54 0.47 0.42 0.54 0.66 0.69 0.69 0.69 0.47 0.47zelnik4 5 0.68 0.74 0.74 0.76 0.76 0.75 0.65 0.65 0.65 0.77 0.67 0.67 0.67 0.74 0.43Average ARI 0.85 - dataset K (cid:63) Stadion-max Stadion-mean WG Sil Lange DB RT CH Dunn XB Ben-Hur2d-10c 9 9 5 9 9 2 9 9 17 5 5 112d-3c-no123 3 2 2 2 2 2 2 2 2 2 20 22d-4c 4 4 3 4 3 2 4 4 20 3 3 52d-4c-no4 4 4 4 4 4 4 5 7 8 3 4 42d-4c-no9 4 4 4 6 2 2 6 2 16 2 2 63clusters_elephant 3 3 3 2 2 2 2 2 2 20 20 34clusters_corner 4 3 3 2 2 4 2 2 2 2 3 24clusters_twins 4 3 3 2 2 3 2 2 2 2 3 35clusters_stars 5 1 1 4 3 5 4 3 5 4 5 3A1 20 20 20 20 11 2 19 10 21 2 2 2A2 35 35 35 35 32 35 35 32 35 35 40 8curves1 2 2 2 2 2 2 2 2 20 2 2 2D31 31 31 31 30 28 31 28 31 31 31 31 35diamond9 9 9 9 9 9 9 9 9 9 9 9 2dim032 16 16 16 16 16 16 16 16 16 16 16 5dim064 16 16 16 16 16 16 16 16 16 16 16 2dim1024 16 16 16 16 16 16 20 20 16 16 20 2dim128 16 16 16 16 16 16 16 16 16 16 16 13dim256 16 16 16 16 16 16 16 16 16 16 16 2dim512 16 16 16 16 16 16 20 20 16 16 20 2DS-577 3 3 3 3 3 3 3 3 3 3 3 3DS-850 5 5 5 5 5 2 4 4 8 4 5 7ds4c2sc8 8 1 1 6 6 7 6 5 17 20 20 3elliptical_10_2 10 10 10 10 9 10 10 10 10 2 5 11elly-2d10c13s 10 8 15 6 4 7 15 13 19 20 20 3engytime 2 1 1 3 3 2 4 3 3 15 20 2exemples1_3g 3 3 3 3 3 3 3 3 3 2 2 3exemples10_WellS_3g 3 3 3 3 3 3 3 3 3 3 3 3exemples2_5g 5 5 5 5 5 2 5 5 5 2 3 6exemples3_Uvar_4g 4 4 4 6 5 2 5 9 6 9 4 4exemples4_overlap_3g 3 2 7 3 3 3 3 3 8 18 20 3exemples5_overlap2_3g 3 2 2 2 2 2 2 7 8 2 2 2exemples6_quicunx_4g 4 4 4 4 4 2 4 4 4 4 4 4exemples7_elbow_3g 3 3 3 3 3 3 3 3 3 3 3 5exemples8_Overlap_Uvar_5g 6 5 5 4 4 4 4 3 4 3 4 4exemples9_YoD_6g 6 5 6 5 4 2 4 5 5 3 4 6fourty 40 42 42 40 40 40 40 40 40 40 40 4g2-16 2 2 2 2 2 2 20 20 2 18 20 2g2-2 2 1 1 2 2 2 16 7 2 18 20 2g2-64 2 2 2 3 2 2 20 20 3 20 20 2hepta 7 7 7 7 7 7 7 7 7 7 7 8long1 2 2 2 2 3 2 4 6 11 2 2 2long2 2 2 2 2 2 2 4 4 20 2 2 2long3 2 2 2 2 2 2 3 4 10 2 2 2longsquare 6 5 5 2 2 2 5 2 6 2 2 3R15 15 15 15 15 8 15 15 15 15 8 8 8s-set1 15 15 15 15 15 15 15 15 15 15 15 12s-set2 15 15 15 15 15 15 15 15 15 15 15 17s-set3 15 15 15 15 11 15 11 4 15 20 20 2s-set4 15 20 11 15 11 15 13 5 15 20 20 2sizes1 4 4 4 4 4 4 4 4 4 4 4 4sizes2 4 4 4 4 4 4 4 4 4 4 4 4sizes3 4 4 4 4 4 4 4 4 5 4 4 4sizes4 4 4 3 4 4 4 4 4 6 4 4 4sizes5 4 2 3 4 4 4 4 4 6 4 4 4spherical_4_3 4 4 4 4 4 4 4 4 4 4 4 5spherical_5_2 5 6 5 5 4 5 5 5 20 20 20 5spherical_6_2 6 6 6 6 4 4 6 4 6 4 4 4square1 4 4 4 4 4 4 4 4 4 4 4 4square2 4 4 4 4 4 4 4 4 4 4 5 4square3 4 4 4 4 4 4 4 4 4 20 20 4square4 4 4 5 4 4 4 4 4 4 20 20 3square5 4 5 5 4 4 4 4 4 4 20 20 2st900 9 9 13 10 9 9 9 9 10 9 9 3tetra 4 4 4 4 4 4 4 4 4 20 20 4triangle1 4 4 4 4 4 4 4 4 6 3 4 8triangle2 4 4 4 4 4 2 4 4 5 4 4 4twenty 20 20 20 20 20 20 20 20 20 20 20 27twodiamonds 2 2 7 2 4 4 4 4 20 20 20 2wingnut 2 2 10 2 2 2 20 4 20 2 2 2xclara 3 3 3 3 3 2 3 3 3 2 2 3zelnik2 3 11 5 20 5 3 6 20 20 20 20 5zelnik4 5 5 5 19 4 4 4 19 20 19 19 4Number of wins 73

45 51 41 40 39 33 34 31 dataset K (cid:63) ARI

K(cid:63)

Stadion-max Stadion-mean WG Sil Lange DB RT CH Dunn XB Ben-Hur2d-10c 9 1.00 1.00 0.69 1.00 1.00 0.25 1.00 1.00 0.68 0.69 0.69 0.892d-3c-no123 3 0.74 0.76 0.76 0.76 0.76 0.76 0.76 0.76 0.76 0.76 0.15 0.762d-4c 4 1.00 1.00 0.83 1.00 0.83 0.64 1.00 1.00 0.22 0.83 0.83 0.842d-4c-no4 4 1.00 1.00 1.00 1.00 1.00 1.00 0.73 0.59 0.59 0.85 1.00 1.002d-4c-no9 4 0.96 0.96 0.96 0.79 0.50 0.50 0.79 0.50 0.31 0.50 0.50 0.813clusters_elephant 3 0.75 0.75 0.75 0.62 0.62 0.61 0.62 0.62 0.62 0.14 0.14 0.764clusters_corner 4 0.99 0.94 0.94 0.75 0.75 0.44 0.75 0.75 0.75 0.75 0.94 0.174clusters_twins 4 0.64 0.64 0.64 0.34 0.34 0.92 0.34 0.34 0.34 0.34 0.64 0.925clusters_stars 5 0.69 0.00 0.00 0.62 0.56 0.65 0.62 0.56 0.69 0.62 0.69 0.49A1 20 0.96 0.96 0.96 0.96 0.61 0.09 0.92 0.56 0.95 0.09 0.09 0.09A2 35 0.95 0.95 0.95 0.95 0.88 0.93 0.95 0.88 0.95 0.95 0.92 0.32curves1 2 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.10 1.00 1.00 1.00D31 31 0.93 0.93 0.93 0.90 0.86 0.92 0.86 0.93 0.93 0.93 0.93 0.90diamond9 9 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.18dim032 16 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.34dim064 16 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.08dim1024 16 1.00 1.00 1.00 1.00 1.00 1.00 0.95 0.95 1.00 1.00 0.95 0.12dim128 16 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.83dim256 16 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.12dim512 16 1.00 1.00 1.00 1.00 1.00 1.00 0.97 0.97 1.00 1.00 0.97 0.12DS-577 3 0.97 0.97 0.97 0.97 0.97 0.99 0.97 0.97 0.97 0.97 0.97 0.99DS-850 5 0.99 0.99 0.99 0.99 0.99 0.40 0.81 0.81 0.76 0.81 0.99 0.82ds4c2sc8 8 0.69 0.00 0.00 0.58 0.58 0.71 0.58 0.53 0.58 0.50 0.50 0.27elliptical_10_2 10 1.00 1.00 1.00 1.00 0.90 0.97 1.00 1.00 1.00 0.18 0.56 0.94elly-2d10c13s 10 0.34 0.39 0.35 0.42 0.27 0.43 0.35 0.36 0.31 0.29 0.29 0.30engytime 2 0.82 0.00 0.00 0.67 0.67 0.81 0.57 0.67 0.67 0.12 0.10 0.81exemples1_3g 3 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.57 0.57 0.94exemples10_WellS_3g 3 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00exemples2_5g 5 1.00 1.00 1.00 1.00 1.00 0.37 1.00 1.00 1.00 0.37 0.48 0.94exemples3_Uvar_4g 4 0.92 0.92 0.92 0.82 0.87 0.46 0.87 0.65 0.82 0.65 0.92 0.56exemples4_overlap_3g 3 0.75 0.59 0.22 0.75 0.75 0.41 0.75 0.75 0.17 0.07 0.07 0.41exemples5_overlap2_3g 3 0.39 0.79 0.79 0.79 0.79 0.79 0.79 0.16 0.13 0.79 0.79 0.79exemples6_quicunx_4g 4 0.90 0.90 0.90 0.90 0.90 0.29 0.90 0.90 0.90 0.90 0.90 0.91exemples7_elbow_3g 3 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.80exemples8_Overlap_Uvar_5g 6 0.90 0.84 0.84 0.70 0.70 0.80 0.70 0.59 0.70 0.59 0.70 0.80exemples9_YoD_6g 6 0.90 0.84 0.90 0.84 0.70 0.45 0.70 0.84 0.84 0.59 0.70 0.90fourty 40 1.00 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.13g2-16 2 0.85 0.85 0.85 0.85 0.85 0.82 0.09 0.09 0.85 0.10 0.09 0.82g2-2 2 0.18 0.00 0.00 0.18 0.18 0.21 0.04 0.06 0.18 0.03 0.03 0.21g2-64 2 1.00 1.00 1.00 0.80 1.00 1.00 0.12 0.12 0.80 0.12 0.12 1.00hepta 7 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96long1 2 1.00 1.00 1.00 1.00 0.75 1.00 0.55 0.37 0.22 1.00 1.00 1.00long2 2 1.00 1.00 1.00 1.00 1.00 1.00 0.56 0.56 0.11 1.00 1.00 1.00long3 2 1.00 1.00 1.00 1.00 1.00 1.00 0.47 0.31 0.12 1.00 1.00 1.00longsquare 6 0.79 0.81 0.81 0.27 0.27 0.27 0.81 0.27 0.79 0.27 0.27 0.56R15 15 0.98 0.98 0.98 0.98 0.26 0.98 0.98 0.98 0.98 0.26 0.26 0.26s-set1 15 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.81s-set2 15 0.91 0.91 0.91 0.91 0.91 0.92 0.91 0.91 0.91 0.91 0.91 0.90s-set3 15 0.68 0.68 0.68 0.68 0.60 0.69 0.60 0.30 0.68 0.65 0.65 0.11s-set4 15 0.58 0.58 0.50 0.58 0.50 0.59 0.54 0.29 0.58 0.58 0.58 0.09sizes1 4 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95sizes2 4 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97sizes3 4 0.96 0.96 0.96 0.96 0.96 0.93 0.96 0.96 0.52 0.96 0.96 0.93sizes4 4 0.97 0.97 0.93 0.97 0.97 0.97 0.97 0.97 0.34 0.97 0.97 0.97sizes5 4 0.96 0.89 0.93 0.96 0.96 0.97 0.96 0.96 0.33 0.96 0.96 0.97spherical_4_3 4 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.91spherical_5_2 5 0.88 0.82 0.88 0.88 0.71 0.88 0.88 0.88 0.33 0.33 0.33 0.88spherical_6_2 6 1.00 1.00 1.00 1.00 0.68 0.68 1.00 0.68 1.00 0.68 0.68 0.68square1 4 0.91 0.91 0.91 0.91 0.91 0.93 0.91 0.91 0.91 0.91 0.91 0.93square2 4 0.86 0.86 0.86 0.86 0.86 0.88 0.86 0.86 0.86 0.86 0.80 0.88square3 4 0.84 0.84 0.84 0.84 0.84 0.83 0.84 0.84 0.84 0.25 0.25 0.83square4 4 0.73 0.73 0.66 0.73 0.73 0.70 0.73 0.73 0.73 0.24 0.24 0.55square5 4 0.61 0.62 0.62 0.61 0.61 0.62 0.61 0.61 0.61 0.19 0.19 0.35st900 9 0.74 0.74 0.64 0.75 0.74 0.73 0.74 0.74 0.75 0.74 0.74 0.32tetra 4 1.00 1.00 1.00 1.00 1.00 0.97 1.00 1.00 1.00 0.28 0.28 0.97triangle1 4 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.89 0.71 1.00 0.87triangle2 4 0.98 0.98 0.98 0.98 0.98 0.45 0.98 0.98 0.79 0.98 0.98 0.97twenty 20 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.90twodiamonds 2 0.99 0.99 0.30 0.99 0.55 0.54 0.55 0.55 0.11 0.11 0.11 1.00wingnut 2 1.00 1.00 0.23 1.00 1.00 1.00 0.11 0.53 0.11 1.00 1.00 1.00xclara 3 0.99 0.99 0.99 0.99 0.99 0.56 0.99 0.99 0.99 0.56 0.56 0.99zelnik2 3 0.58 0.71 0.63 0.55 0.63 0.49 0.74 0.55 0.55 0.55 0.55 0.51zelnik4 5 0.69 0.69 0.69 0.66 0.65 0.65 0.65 0.66 0.66 0.66 0.66 0.65Average ARI 0.89 - dataset K (cid:63) Stadion-max (ext) Stadion-mean (ext) BIC WG Sil Lange DB RT CH Dunn XB Ben-Hur2d-10c 9 8 8 10 9 9 2 9 9 9 8 9 102d-3c-no123 3 3 9 3 3 3 2 5 3 10 3 3 32d-4c 4 10 10 4 4 3 2 4 4 4 3 3 42d-4c-no4 4 4 4 4 4 4 4 4 6 6 4 4 42d-4c-no9 4 3 3 4 2 2 2 4 2 15 3 3 43clusters_elephant 3 2 3 4 2 2 2 10 2 2 16 16 44clusters_corner 4 3 3 5 2 3 2 3 2 2 2 2 44clusters_twins 4 16 19 7 5 3 3 3 3 20 2 17 45clusters_stars 5 15 15 7 3 3 3 3 3 3 17 18 7A1 20 20 18 20 21 18 2 23 18 21 25 18 4A2 35 35 33 35 36 29 35 36 17 40 36 36 6curves1 2 2 2 9 2 2 2 2 2 10 2 2 4D31 31 33 24 31 32 27 31 27 32 32 32 32 36diamond9 9 9 9 9 9 9 9 9 9 9 9 9 4dim032 16 16 16 9 16 16 16 16 16 16 16 16 13dim064 16 16 16 4 16 16 16 20 20 16 16 20 14dim1024 16 16 15 2 16 16 16 20 20 16 16 20 2dim128 16 16 16 2 16 16 16 20 20 16 16 20 13dim256 16 16 15 2 16 16 16 20 20 16 16 20 7dim512 16 16 15 2 16 16 16 20 20 16 16 20 2DS-577 3 3 3 3 3 3 3 3 3 3 3 3 4DS-850 5 5 9 5 5 5 5 4 3 16 2 5 5ds4c2sc8 8 12 12 4 7 2 3 7 13 14 18 18 3elliptical_10_2 10 10 12 10 10 9 10 10 10 10 5 5 5elly-2d10c13s 10 10 10 7 2 2 5 2 2 17 11 11 3engytime 2 2 2 2 2 4 2 4 4 5 19 2 5exemples1_3g 3 3 3 3 3 3 2 3 3 3 2 2 6exemples10_WellS_3g 3 3 3 3 3 3 3 3 3 3 3 3 4exemples2_5g 5 5 6 5 5 5 2 5 6 6 2 3 5exemples3_Uvar_4g 4 8 8 4 5 4 3 7 5 5 5 5 7exemples4_overlap_3g 3 3 3 3 3 3 3 3 3 6 3 3 4exemples5_overlap2_3g 3 2 18 4 2 2 2 2 18 10 14 14 5exemples6_quicunx_4g 4 4 4 4 4 4 3 4 4 4 4 4 4exemples7_elbow_3g 3 3 3 5 3 3 2 3 3 3 2 3 6exemples8_Overlap_Uvar_5g 6 6 6 6 5 4 4 4 5 5 6 6 6exemples9_YoD_6g 6 6 6 6 5 4 5 4 5 5 6 6 8fourty 40 42 26 40 25 25 40 25 20 34 2 2 4g2-16 2 2 2 2 2 2 2 20 20 2 2 20 2g2-2 2 2 9 2 2 2 2 19 6 2 16 16 2g2-64 2 2 2 1 2 2 2 20 20 2 2 20 2hepta 7 7 7 7 7 7 7 7 7 7 7 7 8long1 2 2 2 2 2 8 2 8 8 20 2 2 2long2 2 2 2 2 2 8 2 8 8 12 2 2 4long3 2 2 2 2 2 2 2 2 2 17 2 2 3longsquare 6 7 5 7 2 2 2 5 2 6 2 2 5R15 15 15 15 15 15 8 15 15 15 15 8 8 8s-set1 15 15 15 17 15 15 15 20 20 15 15 20 5s-set2 15 15 15 19 16 10 15 12 12 16 12 15 2s-set3 15 15 15 15 13 18 2 11 8 13 15 13 2s-set4 15 15 15 19 15 2 3 2 7 15 12 12 4sizes1 4 4 4 4 4 4 4 4 4 4 4 5 4sizes2 4 4 4 4 4 4 3 4 4 4 4 4 4sizes3 4 4 4 4 4 4 4 4 4 6 4 4 4sizes4 4 3 3 4 3 3 4 6 3 5 3 3 4sizes5 4 4 4 4 4 4 4 4 6 7 3 3 5spherical_4_3 4 4 4 4 4 4 4 4 4 4 4 4 5spherical_5_2 5 5 7 4 5 5 4 5 5 5 9 17 5spherical_6_2 6 6 6 6 6 4 4 6 4 6 4 4 7square1 4 4 5 4 4 4 4 4 4 4 4 4 4square2 4 4 4 4 4 4 4 5 4 4 4 6 4square3 4 4 4 4 4 4 4 4 4 4 19 19 4square4 4 4 4 4 4 4 4 4 4 4 20 20 4square5 4 4 5 4 4 4 4 4 4 4 13 13 4st900 9 9 9 6 9 9 9 9 9 9 12 12 2tetra 4 4 4 4 4 4 4 4 4 4 4 4 4triangle1 4 9 9 4 4 4 4 4 4 9 4 4 2triangle2 4 4 4 4 4 4 4 4 4 5 4 4 5twenty 20 20 20 20 20 20 20 20 20 20 20 20 2twodiamonds 2 2 2 7 2 2 2 11 2 17 10 10 2wingnut 2 2 7 6 2 2 2 18 16 16 16 16 2xclara 3 3 3 3 3 3 3 6 3 3 2 2 5zelnik2 3 20 20 3 18 11 3 18 13 19 18 18 3zelnik4 5 17 18 5 17 4 5 20 20 20 16 17 3Number of wins 73

43 48 52 45 48 34 33 37 34 28 28 dataset K (cid:63) ARI

K(cid:63)

Stadion-max (ext) Stadion-mean (ext) BIC WG Sil Lange DB RT CH Dunn XB Ben-Hur2d-10c 9 1.00 0.99 0.99 0.95 1.00 1.00 0.25 1.00 1.00 1.00 0.99 1.00 0.952d-3c-no123 3 0.98 0.98 0.44 0.98 0.98 0.98 0.76 0.63 0.98 0.33 0.98 0.98 0.982d-4c 4 1.00 0.74 0.74 1.00 1.00 0.83 0.64 1.00 1.00 1.00 0.83 0.83 1.002d-4c-no4 4 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.71 0.71 1.00 1.00 1.002d-4c-no9 4 1.00 0.85 0.85 1.00 0.49 0.49 0.50 1.00 0.49 0.49 0.85 0.85 1.003clusters_elephant 3 0.69 0.62 0.69 0.60 0.62 0.62 0.61 0.64 0.62 0.62 0.57 0.57 0.604clusters_corner 4 0.62 0.75 0.75 0.43 0.66 0.75 0.26 0.75 0.66 0.66 0.66 0.66 0.444clusters_twins 4 0.61 0.47 0.48 0.56 0.60 0.60 0.93 0.60 0.60 0.47 0.28 0.47 0.995clusters_stars 5 0.69 0.60 0.60 0.60 0.55 0.55 0.54 0.55 0.55 0.55 0.53 0.53 0.60A1 20 0.94 0.94 0.78 0.94 0.99 0.86 0.30 1.00 0.86 0.99 0.99 0.86 0.20A2 35 0.93 0.93 0.87 0.93 0.92 0.76 0.93 0.92 0.55 0.96 0.92 0.92 0.35curves1 2 1.00 1.00 1.00 0.22 1.00 1.00 1.00 1.00 1.00 0.21 1.00 1.00 0.50D31 31 0.86 0.89 0.73 0.91 0.89 0.81 0.86 0.81 0.89 0.89 0.89 0.89 0.91diamond9 9 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.45dim032 16 1.00 1.00 1.00 0.60 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.83dim064 16 1.00 1.00 1.00 0.32 1.00 1.00 1.00 0.94 0.94 1.00 1.00 0.94 0.88dim1024 16 1.00 1.00 0.94 0.12 1.00 1.00 1.00 0.97 0.97 1.00 1.00 0.97 0.12dim128 16 1.00 1.00 1.00 0.12 1.00 1.00 1.00 0.94 0.94 1.00 1.00 0.94 0.83dim256 16 1.00 1.00 0.94 0.12 1.00 1.00 1.00 0.94 0.94 1.00 1.00 0.94 0.48dim512 16 1.00 1.00 0.94 0.12 1.00 1.00 1.00 0.97 0.97 1.00 1.00 0.97 0.12DS-577 3 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.97 0.86DS-850 5 0.98 0.98 0.76 0.98 0.98 0.98 0.98 0.68 0.54 0.50 0.19 0.98 0.98ds4c2sc8 8 0.80 0.79 0.79 0.44 0.69 0.08 0.37 0.69 0.63 0.65 0.55 0.55 0.37elliptical_10_2 10 0.98 0.98 0.93 0.99 0.98 0.90 0.99 0.98 0.98 0.98 0.56 0.56 0.56elly-2d10c13s 10 0.47 0.47 0.47 0.45 0.08 0.08 0.33 0.08 0.08 0.40 0.47 0.47 0.16engytime 2 0.86 0.86 0.86 0.87 0.86 0.62 0.87 0.62 0.62 0.41 0.18 0.86 0.39exemples1_3g 3 0.96 0.96 0.96 0.97 0.96 0.96 0.57 0.96 0.96 0.96 0.57 0.57 0.57exemples10_WellS_3g 3 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.88exemples2_5g 5 1.00 1.00 1.00 1.00 1.00 1.00 0.37 1.00 1.00 1.00 0.37 0.48 1.00exemples3_Uvar_4g 4 0.96 0.85 0.85 0.95 0.88 0.96 0.66 0.90 0.88 0.88 0.88 0.88 0.85exemples4_overlap_3g 3 0.82 0.82 0.82 0.83 0.82 0.82 0.83 0.82 0.82 0.26 0.82 0.82 0.42exemples5_overlap2_3g 3 0.36 0.79 0.14 0.51 0.79 0.79 0.79 0.79 0.14 0.11 0.17 0.17 0.92exemples6_quicunx_4g 4 0.92 0.92 0.92 0.92 0.92 0.92 0.64 0.92 0.92 0.92 0.92 0.92 0.67exemples7_elbow_3g 3 1.00 1.00 1.00 0.76 1.00 1.00 0.53 1.00 1.00 1.00 0.53 1.00 0.62exemples8_Overlap_Uvar_5g 6 0.93 0.93 0.93 0.62 0.85 0.61 0.86 0.61 0.85 0.85 0.93 0.93 0.85exemples9_YoD_6g 6 0.93 0.93 0.93 0.93 0.85 0.61 0.85 0.61 0.85 0.85 0.93 0.93 0.13fourty 40 0.68 0.68 0.68 1.00 0.67 0.67 1.00 0.67 0.61 0.68 0.04 0.04 0.91g2-16 2 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.13 0.13 0.90 0.90 0.13 0.90g2-2 2 0.22 0.22 0.06 0.22 0.22 0.22 0.22 0.03 0.08 0.22 0.04 0.04 0.22g2-64 2 1.00 1.00 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00hepta 7 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96long1 2 1.00 1.00 1.00 1.00 1.00 0.35 1.00 0.35 0.35 0.14 1.00 1.00 1.00long2 2 1.00 1.00 1.00 1.00 1.00 0.35 1.00 0.35 0.35 0.25 1.00 1.00 0.50long3 2 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.10 1.00 1.00 0.42longsquare 6 0.80 0.99 0.82 0.99 0.27 0.27 0.27 0.82 0.27 0.80 0.27 0.27 0.82R15 15 0.99 0.99 0.99 0.99 0.99 0.26 0.99 0.99 0.99 0.99 0.26 0.26 0.26s-set1 15 0.99 0.99 0.99 0.97 0.99 0.99 0.99 0.93 0.93 0.99 0.99 0.93 0.40s-set2 15 0.85 0.85 0.85 0.88 0.91 0.56 0.85 0.77 0.77 0.91 0.77 0.85 0.13s-set3 15 0.61 0.61 0.61 0.73 0.64 0.55 0.11 0.55 0.48 0.64 0.61 0.64 0.11s-set4 15 0.55 0.55 0.55 0.62 0.55 0.01 0.19 0.01 0.31 0.55 0.48 0.48 0.22sizes1 4 0.96 0.96 0.96 0.95 0.96 0.96 0.95 0.96 0.96 0.96 0.96 0.96 0.95sizes2 4 0.97 0.97 0.97 0.97 0.97 0.97 0.89 0.97 0.97 0.97 0.97 0.97 0.97sizes3 4 0.98 0.98 0.98 0.97 0.98 0.98 0.97 0.98 0.98 0.42 0.98 0.98 0.97sizes4 4 0.47 0.95 0.95 0.98 0.95 0.95 0.98 0.75 0.95 0.55 0.95 0.95 0.98sizes5 4 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.97 0.43 0.95 0.95 0.45spherical_4_3 4 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.92spherical_5_2 5 0.89 0.89 0.80 0.69 0.89 0.89 0.69 0.89 0.89 0.89 0.62 0.39 0.93spherical_6_2 6 1.00 1.00 1.00 1.00 1.00 0.68 0.68 1.00 0.68 1.00 0.68 0.68 0.95square1 4 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.95square2 4 0.93 0.93 0.93 0.92 0.93 0.93 0.92 0.93 0.93 0.93 0.93 0.92 0.92square3 4 0.85 0.85 0.85 0.86 0.85 0.85 0.86 0.85 0.85 0.85 0.36 0.36 0.86square4 4 0.78 0.78 0.78 0.79 0.78 0.78 0.79 0.78 0.78 0.78 0.31 0.31 0.79square5 4 0.68 0.68 0.61 0.67 0.68 0.68 0.67 0.68 0.68 0.68 0.30 0.30 0.67st900 9 0.84 0.84 0.84 0.55 0.84 0.84 0.80 0.84 0.84 0.84 0.74 0.74 0.15tetra 4 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00triangle1 4 1.00 0.87 0.87 1.00 1.00 1.00 1.00 1.00 1.00 0.87 1.00 1.00 0.33triangle2 4 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.86 0.99 0.99 0.79twenty 20 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.10twodiamonds 2 1.00 1.00 1.00 0.29 1.00 1.00 1.00 0.20 1.00 0.12 0.22 0.22 1.00wingnut 2 0.86 0.86 0.25 0.28 0.86 0.86 0.86 0.15 0.15 0.15 0.15 0.15 0.86xclara 3 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.56 0.56 0.71zelnik2 3 1.00 0.72 0.72 1.00 0.72 0.75 1.00 0.72 0.73 0.72 0.72 0.72 1.00zelnik4 5 0.99 0.84 0.84 0.99 0.84 0.78 0.99 0.84 0.84 0.84 0.84 0.84 0.51Average ARI 0.89 - 0.00 Experimental setting

In this section, we detail the initialization used in algorithms, provided the complete list of data sets(with references) and ﬁnally, we explain the preprocessing steps used for real data sets.

F.1 Algorithm initialization • K -means: in both Stadion versions (standard and extended), effective initialization isachieved using K -means++ [38] and keeping the best of 35 runs (w.r.t. the cost function). • Ward linkage: agglomerative clustering is deterministic, no initialization strategy is needed. • Gaussian Mixture Model: Two initializations were considered. The ﬁrst one uses K -meansto initialize the EM algorithm. The second one uses the approach discussed in [66]. The mainidea is to project the data through a suitable transformation which enhances separation ofclusters. Among the investigated transformations, the scaled SVD transformation performedbest in their experiments and so we used it. Then it applies model-based agglomerativehierarchical clustering at the initialization step. Once the hierarchy is obtained, the EMalgorithm is run on the original data. However, there was no noticeable difference betweenthe two initializations. F.2 Preprocessing

Every data set was scaled to zero mean and unit variance on each dimension, which is essential forthe algorithms to work but also for the additive noise perturbation. For the real data sets, additionalpreprocessing was necessary to ensure class labels truly represent the cluster structure.

Crabs

The Crabs data set was decomposed using a PCA and keeping the principal componentstwo and three, as described in [45]. The standard scaling was applied both before and after the PCA.

Old Faithful & Iris

No preprocessing apart from scaling.

MFDS, MNIST, USPS

On these high-dimensional data sets (respectively 649, 784 and 256 dimen-sions), we extracted a two-dimensional representation following a two-step process, as introducedin [47]. First, a fully-connected symmetric autoencoder with [500 , , , units is trainedto compress data into a 10-dimensional latent feature space. Then, the UMAP [46] dimensionalityreduction algorithm is applied on the latent features to obtain 2D representations. Finally, standardscaling is applied. Wine

This data set was reduced to two dimensions with UMAP, and then scaled. All UMAP-reduced data sets displayed in Figure 26.

F.3 List of data sets

A complete list of the 80 artiﬁcial and real data sets used in this paper is provided below, indicatingthe number of samples ( N ), dimension ( p ), ground-truth number of clusters ( K (cid:63) ) and reference.The artiﬁcial data sets are easily downloaded from the archives [67] and [68]. The data sets withoutreferences are original and have been created for this work, in order to provide more challengingmodel selection tasks. They will be available in the companion repository of this paper.Table 21: List of the real-world data sets. Dataset

N p K (cid:63) referencecrabs 200 2 4 [69]faithful 272 2 2 [70]iris 150 4 3 [71]MFDS_UMAP 2000 2 10 [72]MNIST_UMAP 10000 2 10 [73]USPS_UMAP 2007 2 10 [74]wine_UMAP 178 2 3 [75] .5 1.0 0.5 0.0 0.5 1.0 1.51.51.00.50.00.51.01.52.0 (a) MFDS_UMAP (b) MNIST_UMAP (c) USPS_UMAP (d) wine_UMAP Figure 26: Data sets MFDS, MNIST and USPS and UMAP after dimensionality reduction.40able 22: List of the artiﬁcial data sets.

Dataset

N p K (cid:63) reference2d-10c 2990 2 9 [76]2d-3c-no123 715 2 3 [76]2d-4c 1261 2 4 [76]2d-4c-no4 863 2 4 [76]2d-4c-no9 876 2 4 [76]3clusters_elephant 700 2 34clusters_corner 1575 2 44clusters_twins 612 2 45clusters_stars 1050 2 5A1 3000 2 20 [77]A2 5250 2 35 [77]curves1 1000 2 2 [67]D31 3100 2 31 [78]diamond9 3000 2 9 [79]dim032 1024 32 16 [77]dim064 1024 64 16 [77]dim1024 1024 1024 16 [77]dim128 1024 128 16 [77]dim256 1024 256 16 [77]dim512 1024 512 16 [77]DS-577 577 2 3 [80]DS-850 850 2 5 [80]ds4c2sc8 485 2 8 [81]elliptical_10_2 500 2 10 [82]elly-2d10c13s 2796 2 10 [76]engytime 4096 2 2 FCPS [52]exemples1_3g 525 2 3exemples10_WellS_3g 975 2 3exemples2_5g 1375 2 5exemples3_Uvar_4g 1000 2 4exemples4_overlap_3g 1050 2 3exemples5_overlap2_3g 1550 2 3exemples6_quicunx_4g 2250 2 4exemples7_elbow_3g 788 2 3exemples8_Overlap_Uvar_5g 2208 2 6exemples9_YoD_6g 2208 2 6fourty 1000 2 40 [67]g2-16 2048 16 2 [77]g2-2 2048 2 2 [77]g2-64 2048 64 2 [77]hepta 212 3 7 FCPS [52]long1 1000 2 2 [83]long2 1000 2 2 [83]long3 1000 2 2 [83]longsquare 900 2 6 [83]R15 600 2 15 [78]s-set1 5000 2 15 [77]s-set2 5000 2 15 [77]s-set3 5000 2 15 [77]s-set4 5000 2 15 [77]sizes1 1000 2 4 [83]sizes2 1000 2 4 [83]sizes3 1000 2 4 [83]sizes4 1000 2 4 [83]sizes5 1000 2 4 [83]spherical_4_3 400 3 4 [82]spherical_5_2 250 2 5 [82]spherical_6_2 300 2 6 [82]square1 1000 2 4 [83]square2 1000 2 4 [83]square3 1000 2 4 [83]square4 1000 2 4 [83]square5 1000 2 4 [83]st900 900 2 9 [82]tetra 400 3 4 FCPS [52]triangle1 1000 2 4 [83]triangle2 1000 2 4 [83]twenty 1000 2 20 [67]twodiamonds 800 2 2 FCPS [52]wingnut 1016 2 2 FCPS [52]xclara 3000 2 3 [84]zelnik2 303 2 3 [85]zelnik4 622 2 5 [85] eferences [52] Alfred Ultsch. Clustering with SOM: U*C. In Workshop on Self Organizing Feature Maps ,2005.[53] Holger Hoos and Kevin Leyton-Brown. An efﬁcient approach for assessing hyperparameterimportance. In

International Conference on Machine Learning , pages 754–762, 2014.[54] Shrinu Kushagra, Shai Ben-David, and Ihab Ilyas. Semi-supervised clustering for de-duplication.(D):1–18, 2018.[55] Milton Friedman. A comparison of alternative tests of signiﬁcance for the problem of mrankings.

The Annals of Mathematical Statistics , 11(1):86–92, 1940.[56] Alessio Benavoli, Giorgio Corani, and Francesca Mangili. Should we really use post-hoc testsbased on mean-ranks?

The Journal of Machine Learning Research , 17(1):152–161, 2016.[57] Frank Wilcoxon. Probability tables for individual comparisons by ranking methods.

Biometrics ,3(3):119–122, 1947.[58] Sture Holm. A simple sequentially rejective multiple test procedure.

Scandinavian journal ofstatistics , pages 65–70, 1979.[59] Salvador Garcia and Francisco Herrera. An extension on statistical comparisons of classiﬁersover multiple data sets for all pairwise comparisons.

Journal of machine learning research ,9(Dec):2677–2694, 2008.[60] Glenn W Milligan and Martha C Cooper. A study of the comparability of external criteria forhierarchical cluster analysis.

Multivariate behavioral research , 21(4):441–458, 1986.[61] William M Rand. Objective criteria for the evaluation of clustering methods.

Journal of theAmerican Statistical association , 66(336):846–850, 1971.[62] Lawrence Hubert and Phipps Arabie. Comparing partitions.

Journal of Classiﬁcation , 2(1):193–218, 1985.[63] Leslie C. Morey and Alan Agresti. The Measurement of Classiﬁcation Agreement: An Adjust-ment of the Rand Statistic for Chance Agreement.

Educational and Psychological Measurement ,44:33–37, 1984.[64] Edward B Fowlkes and Colin L Mallows. A method for comparing two hierarchical clusterings.

Journal of the American statistical association , 78(383):553–569, 1983.[65] Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information theoretic measures for cluster-ings comparison: Variants, properties, normalization and correction for chance.

The Journal ofMachine Learning Research , 11:2837–2854, 2010.[66] Luca Scrucca and Adrian E Raftery. Improved initialisation of model-based clustering usinggaussian hierarchical partitions.

Advances in data analysis and classiﬁcation , 9(4):447–460,2015.[67] Tomas Barton. Clustering benchmarks. https://github.com/deric/clustering-benchmark/ .[68] M. Gagolewski. Clustering Benchmark Data. .[69] NA Campbell and RJ Mahon. A multivariate study of variation in two species of rock crab ofthe genus leptograpsus.

Australian Journal of Zoology , 22(3):417–425, 1974.[70] Adelchi Azzalini and Adrian W Bowman. A look at some data on the old faithful geyser.

Journal of the Royal Statistical Society: Series C (Applied Statistics) , 39(3):357–365, 1990.[71] Ronald A Fisher. The use of multiple measurements in taxonomic problems.

Annals of eugenics ,7(2):179–188, 1936.[72] D.M.J. Tax M. van Breukelen, R.P.W. Duin and J.E. den Hartog. Handwritten digit recognitionby combined classiﬁers.

Kybernetika , 1998.[73] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based LearningApplied to Document Recognition. In

Proceedings of the IEEE , 1998.[74] J. J. Hull. A database for handwritten text recognition research.

IEEE Transactions on PatternAnalysis and Machine Intelligence , 1994. 4275] B Vandeginste. Parvus: An extendable package of programs for data exploration, classiﬁcationand correlation, m. forina, r. leardi, c. armanino and s. lanteri, elsevier, amsterdam, 1988, price:Us 645 isbn 0-444-43012-1.

Journal of Chemometrics , 4(2):191–193, 1990.[76] Julia Handl. Cluster generators. https://personalpages.manchester.ac.uk/staff/Julia.Handl/generators.html .[77] O. Virmajoki P. Fränti. Iterative shrinking method for clustering problems.

Pattern Recognition ,2006.[78] M.J.T. Reinders Veenman, C.J. and E. Backer. A maximum variance cluster algorithm.

IEEETransactions on Pattern Analysis and Machine Intelligence , 2002.[79] S. Salvador and P. Chan. Determining the Number of Clusters/Segments in Hierarchicalclustering/Segmentation Algorithm. In

ICTAI .[80] C. H. Chou M. C. Su and C. C. Hsieh. Fuzzy C-Means Algorithm with a Point SymmetryDistance.

International Journal of Fuzzy Systems , 2005.[81] Marcilio C. P. de Souto Katti Faceli, Tiemi C. Sakata and Andre C. P. L. F. de Carvalho.Partitions selection strategy for set of clustering solutions.

Neurocomputing , 2010.[82] S. Bandyopadhyay and S. K. Pal. Classiﬁcation and Learning Using Genetic Algorithms:Applications in Bioinformatics and Web Intelligence.

Springer, Heidelberg , 2007.[83] Julia Handl and Joshua Knowles. Multiobjective clustering with automatic determination of thenumber of clusters. 2004.[84] Vincent Arel-Bundock. xclara dataset. http://vincentarelbundock.github.io/Rdatasets/datasets.html .[85] Lihi Zelnik-Manor and Pietro Perona. Self-tuning spectral clustering. In