[PDF] An empirical comparison and characterisation of nine popular clustering methods

Abstract

Nine popular clustering methods are applied to 42 real data sets. The aim is to give a detailed characterisation of the methods by means of several cluster validation indexes that measure various individual aspects of the resulting clusters such as small within-cluster distances, separation of clusters, closeness to a Gaussian distribution etc. as introduced in Hennig (2019). 30 of the data sets come with a "true" clustering. On these data sets the similarity of the clusterings from the nine methods to the "true" clusterings is explored. Furthermore, a mixed effects regression relates the observable individual aspects of the clusters to the similarity with the "true" clusterings, which in real clustering problems is unobservable. The study gives new insight not only into the ability of the methods to discover "true" clusterings, but also into properties of clusterings that can be expected from the methods, which is crucial for the choice of a method in a real situation without a given "true" clustering.

Full PDF

AAn empirical comparison and characterisation ofnine popular clustering methods

Christian Hennig,Dipartimento di Scienze Statistiche “Paolo Fortunati”Universita di BolognaBologna, Via delle belle Arti, 41, 40126, [email protected] 9, 2021

Abstract

Nine popular clustering methods are applied to 42 real data sets.The aim is to give a detailed characterisation of the methods by meansof several cluster validation indexes that measure various individualaspects of the resulting clusters such as small within-cluster distances,separation of clusters, closeness to a Gaussian distribution etc. asintroduced in Hennig (2019). 30 of the data sets come with a “true”clustering. On these data sets the similarity of the clusterings fromthe nine methods to the “true” clusterings is explored. Furthermore,a mixed eﬀects regression relates the observable individual aspects ofthe clusters to the similarity with the “true” clusterings, which in realclustering problems is unobservable. The study gives new insight notonly into the ability of the methods to discover “true” clusterings,but also into properties of clusterings that can be expected from themethods, which is crucial for the choice of a method in a real situationwithout a given “true” clustering.

Keywords:

Cluster benchmarking internal cluster validation ex-ternal cluster validation mixed eﬀects modelMSC2010 classiﬁcation: 62H30

This work compares cluster analysis methods empirically on 42 real datasets. 30 of these data sets come with a given “true” classiﬁcation. The prin-cipal aim is to explore how diﬀerent clustering methods produce solutionswith diﬀerent data analytic characteristics, which can help a user choos-ing an appropriate method for the research question of interest. This doesnot require the knowledge of a “true” clustering. The performance of themethods regarding recovery of the “truth” is reported, but is not the mainfocus. 1 a r X i v : . [ s t a t . M E ] F e b luster analysis plays a central role in modern data analysis and is ap-plied in almost every ﬁeld where data arise, be it ﬁnance, marketing, genet-ics, medicine, psychology, archaeology, social and political science, chem-istry, engineering, or machine learning. Cluster analysis can have well-deﬁned research aims such as species delimitation in biology, or be applied ina rather exploratory manner to learn about potentially informative structurein a data set, for example when clustering the districts of a city. New clusteranalysis methods are regularly developed, often for new data formats, butalso to ﬁx apparent defects of already existing methods. One reason for thisis that cluster analysis is diﬃcult, and all methods, or at least those withwhich enough experience has been collected, are known to “fail” in certain,even fairly regular and non-pathological, situations, where “failing” is oftentaken to mean that a certain pre-speciﬁed “true” clustering in data is notrecovered.A key problem with clustering is that there is no unique and generallyaccepted deﬁnition of what constitutes a cluster. This is not an accident,but rather part of the nature of the clustering problem. In real applicationsthere can be diﬀerent requirements for a “good” clustering, and diﬀerentclusterings can qualify as “true” on the same data set. For example, crabscan be classiﬁed according to species, or as male or female; paintings can beclassiﬁed according to style of the painter or according to the motif; a dataset of customers of a company may not show any clusters that are clearlyseparated from each other, but may be very heterogeneous, and the companymay be interested in having homogeneous subgroups of customers in orderto better target their campaigns, but the data set may allow for diﬀerentgroupings of similar quality; in many situations with given “true” classes,such as companies that go bankrupt in a given period vs. those that donot, there is no guarantee that these “true” classes correspond to patternsin the data that can be found at all. One could even argue that in a dataset that comes with a supposedly “true” grouping a clustering that does not coincide with that grouping is of more scientiﬁc interest than reproducingwhat is already known.Rather than being generally “better” or “worse”, diﬀerent cluster analy-sis methods can be seen as each coming with their own implicit deﬁnition ofwhat a cluster is, and when cluster analysis is to be applied, the researchershave to decide which cluster concepts are appropriate for the application athand. Cluster analysis can have various aims, and these aims can be in con-ﬂict with each other. For example, clusters that are well separated by cleardensity gaps may involve quite large within-cluster distances, which may betolerable in some applications but unacceptable in others. Clusters that canbe well represented by cluster centroids may be diﬀerent from those thatcorrespond to separable Gaussian distributions with potentially diﬀerentcovariance matrices, which in some applications are interpreted as meaning-fully diﬀerent data subsets. See Ackerman et al. (2010); von Luxburg et al.22012); Hennig (2015b,a) for the underlying “philosophy” of clustering.The starting point of this work is the collection of cluster validation in-dexes presented in Hennig (2019). These are indexes deﬁned in order toprovide a multivariate characterisation of a clustering, individually measur-ing aspects such as between-cluster separation, within-cluster homogeneity,or representation of the overall dissimilarity structure by the clustering.They are applied here in order to give general information about how thecharacteristics of clusterings depend on the clustering method.Many cluster validation indexes have been proposed in the literature,often in order to pick an optimal clustering in a given situation, e.g., bycomparing diﬀerent numbers of clusters, see Halkidi et al. (2015) for anoverview. Most of them (such as the Average Silhouette Width, Kaufmanand Rousseeuw (1990)) attempt to assess the quality of a clustering over-all by deﬁning a compromise of various aspects, particularly within-clusterhomogeneity and between-cluster separation. Following Hennig (2019) andAkhanli and Hennig (2020), the present work deviates from this approachby keeping diﬀerent aspects separate in order to inform the user in a moredetailed way what a given clustering achieves.A number of benchmark studies for cluster analysis have already beenpublished. Most of them focus on evaluating the quality of clusterings bycomparing them to given “true” clusterings. This has been done for arti-ﬁcially generated data (e.g., Milligan (1980); Brusco and Steinley (2007);Steinley and Brusco (2011); Saracli et al. (2013); Rodriguez et al. (2019);see Milligan (1996) for an overview of earlier work), for real data, mostlyfocusing on speciﬁc application areas or types of data (e.g., de Souto et al.(2008); Kou et al. (2014); Boulesteix and Hatz (2017); Liu et al. (2019)), ora mixed collection of real and artiﬁcial data, sometimes generating artiﬁcialdata from models closely derived from a real application (e.g., Meila andHeckerman (2001); Maulik and Bandyopadhyay (2002); Dimitriadou et al.(2004); Arbelaitz et al. (2013); Javed et al. (2020)). An exception is Jainet al. (2004), where diﬀerent clustering methods were mapped according tothe similarity of their clusterings on various data sets (something similar isdone here, see Section 3.1). Anderlucci and Hennig (2014) contrasted recov-ery of a “true” classiﬁcation in artiﬁcial data sets with the requirement ofhaving homogeneous clusters.All of these studies attempt to provide a neutral comparison of clusteringmethods, which is to be distinguished from the large number of studies, usingreal and artiﬁcial data, that have been carried out by method developers inorder to demonstrate that their newly proposed method compares favourablywith existing methods. Due to selection eﬀects, the results of such work,although of some value in their own right, cannot be taken as objectiveindicators of the quality of methods (Boulesteix et al. (2013); Hennig (2018)).The study presented here is meant to be neutral; I have not been involvedin the development of any of the compared methods, and have no speciﬁc3nterest to portray any of them as particularly good or bad. Note that “true”neutrality can never be secured and is probably never given; for example, Ihave been active promoting my own “philosophy” of clustering (e.g., Hennig(2015a)) and may be suspected to favour results that are in line with the ideathat the success of clustering methods strongly depends on the application;however n No selections have been made depending on results (Boulesteix(2015)); the 42 data sets from which results are reported are all that wereinvolved.Section 2 explains the design of the study, i.e., the clustering methods,the data sets, and the validation indexes. Section 3 presents the results,starting with the characterisation of the methods in terms of the internalindexes, then results regarding the recovery of the “true” clusters, and ul-timately connecting “true” cluster recovery with the characteristics of theclustering solutions using a mixed eﬀects regression model. A discussionconcludes the paper. For the study design, recommendations for benchmark studies as given,e.g., in Boulesteix (2015); Van Mechelen et al. (2018) have been taken intoaccount. One important issue is a deﬁnition of the scope of the study. Thereis an enormous amount of clustering methods, and clustering is applied todata of very diﬀerent formats. It is not even remotely possible to covereverything that could potentially be of interest. Therefore the present studyconstrains its scope in the following way: • Only clustering methods for 2 ≤ p -dimensional Euclidean data thatcan be treated as continuous are used. Methods that work with dis-similarities are run using the Euclidean distance. • Accordingly, data sets contain numerical variables only. Some datasets include discrete variables, which are treated as admissible for thestudy if they carry numerical information and take at least three dif-ferent values (variables taking a small number of values, particularlythree or four, are very rare in the study). • The number of clusters is always treated as ﬁxed. Only methods thatallow to ﬁx the number of clusters are used; methods to estimatethe number of clusters are not involved. For data sets with a given“true” clustering, the corresponding number of clusters was taken. Fordata sets without such information, a number of clusters was chosensubjectively considering data visualisation and, where possible, subjectmatter information. 4

The included clustering methods were required to have an R-implementationthat can be used in a default way without additional tuning in orderto allow for a comparison that is not inﬂuenced by diﬀerent tuningﬂexibilities. • No statistical structure (such as time series or regression clustering)is taken into account, and neither is any automatic dimension reduc-tion involved as part of any method. All data is treated as plain p -dimensional Euclidean. • Methods are only admissible for the study if they always produce crisppartitions. Every observation always is classiﬁed (also in the given“true” clusterings) to belong to one and only one cluster.

The involved clustering methods are all well established and widely used,as far as my knowledge goes. They represent the major classes of clusteringmethods listed in Hennig and Meila (2015) with the exception of density-based clustering, which was excluded because standard density-based meth-ods such as DBSCAN (Ester et al. (1996)) do not accept the number ofclusters as input and often do not produce partitions. Another popularmethod that was not involved was Ward’s method, as this is based on thesame objective function as K -means and can be seen as just another tech-nique to optimise this function locally (Everitt et al. (2011)). On the otherhand, including mixtures of t- and skew t-distributions means that mixturemodel-based clustering is strongly represented. The motivation for this isthat the other included methods are not meant to ﬁt distributional shapesincluding outliers and skewness, which may be widespread in practice; alter-natives would be methods that have the ability to not include observationsclassiﬁed as “outliers” in any cluster, but this is beyond the scope of thepresent study. Here are the included methods. K-means as implemented in the R-function kmeans using the algorithm byHartigan and Wong (1979).

Partitioning Around Medoids (clara) (Kaufman and Rousseeuw (1990))as implemented in the R-function claraCBI (therefore abbreviated“clara” in the results) in R-package fpc (Hennig (2020)), which callsfunction pam in R-package cluster (Maechler et al. (2019)) using (un-squared) Euclidean distances.

Gaussian mixture model (mclust) ﬁtted by Maximum Likelihood us-ing the EM-algorithm, where the best of various covariance matrixmodels is chosen by the Bayesian Information Criterion (BIC) (Fraley5nd Raftery (2002)) as implemented in the R-function mclustBIC inR-package mclust (Scrucca et al. (2016)).

Mixture of skew t-distributions (emskewt) ﬁtted by Maximum Like-lihood using the EM-algorithm (Lee and McLachlan (2013)), includingfully ﬂexible estimation of the degrees of freedom and the shape matrix,as implemented in the function

EmSkew with parameter distr="mst" in the R-package

EMMIXskew (Wang et al. (2018)).

Mixture of t-distributions (teigen) ﬁtted by Maximum Likelihood us-ing the EM-algorithm (McLachlan and Peel (2000)), where the bestof various covariance matrix models is chosen by the BIC (Andrewsand McNicholas (2012)) as implemented in the R-function teigen inR-package teigen (Andrews et al. (2018)).

Single linkage hierarchical clustering as implemented in the R-function hclust and the dendrogram cut at the required number of clusters toproduce a partition, as is done also for the other hierarchical methods.See Everitt et al. (2011) for an explanation and historical referencesfor all involved hierarchical methods.

Average linkage hierarchical clustering as implemented in the R-function hclust . Complete linkage hierarchical clustering as implemented in the R-function hclust . Spectral clustering (Ng et al. (2001)) as implemented in the R-function specc in R-package kernlab (Karatzoglou et al. (2004)).The functions were mostly run using the default settings. In some cases,e.g., hclust , parameters had to be provided in order to determine whichexact method was used. Some amendments were required. In particular,all methods were run in such a way that they would always deliver a validpartition as a result. See Appendix A1 for more computational detail.

The data sets used in this study are a convenience sample, collected frommostly well known benchmark data sets in widespread use together withsome data sets that I have come across in my work. 21 data sets are fromthe UCI repository (Dua and Graﬀ (2017)), further ones are from Kaggle, , example data sets of R-packages, open data accompanyingbooks and research papers, and some were collected by myself or providedto me by collaborators and advisory clients with permission to use them.Details about the data sets are given in Appendix A2.6able 1: Numbers of observations for the 42 data sets.Observations Number of data sets n ≤

100 5100 < n ≤

200 6200 < n ≤

300 8300 < n ≤

500 5500 ≤ n < ≤ n < n > n than the number of variables p within clusters, alldata sets have p substantially smaller than n . The calibration of validationindexes requires repeated computations based on n × n distance matrices(see Section 2.3), for this reason the biggest data set has n = 4601, andgenerally data sets with n < p is 72. p = 1 is excluded, as it could not be handled by two methods. The maxi-mum number of “true” clusters K is 100. The aim was to achieve a fairlyeven representation of p and K up to 10 and a number of instances for thesevalues larger than 10, although there are apparently far more data sets inbenchmark use with k = 2 than with larger K . Data sets without missingvalues were preferred, but some data sets with a very small number of miss-ing values were admitted. In these cases mean imputation was used. Tables1, 2, and 3 show the distributions of n , p , and K , respectively, over the datasets.The variables were scaled to mean 0 and variance 1 before clustering,except for data sets in which the variables have compatible units of mea-surement and there seems to be a subject matter justiﬁcation to make theirimpact for clustering proportional to the standard deviation. See AppendixA2 for details on the preprocessing for some data sets.An issue with the “representativity” of these data sets for real clus-7able 2: Numbers of variables for the 42 data setsVariables Number of data sets p = 2 2 p = 4 5 p = 5 56 ≤ p ≤ ≤ p ≤

11 1112 ≤ p ≤

20 621 ≤ p ≤

50 4 p >

50 3Table 3: Numbers of clusters for the 30 data sets with given “true” cluster-ings, and for the 12 data sets without “true” clusterings, as chosen by theauthor.Number of clusters With “true” clustering Without “true” clustering k = 2 8 1 k = 3 3 3 k = 4 3 1 k = 5 2 66 ≤ k ≤ ≤ k ≤

11 6 0 k >

11 3 0tering problems is that the availability of “true” clusterings constitutes adiﬀerence to the real unsupervised problems to which clustering is usuallyapplied. This is an issue with almost all collections of data sets for bench-marking clustering algorithms. In particular, several such data sets havebeen constructed in order to have all clusters represented by the same num-ber of observations. This is the case for eight of the 30 data sets with “true”clusterings used here (seven of these have exactly equal cluster sizes). Thisis not possible for unsupervised problems in practice. Such data sets willfavour methods that tend to produce clusters of about equal sizes.

Internal validation indexes are used here with the aim of measuring variousaspects of a clustering that can be seen as desirable, depending on the speciﬁcapplication. It is then investigated to what extent the diﬀerent clusteringmethods work well according to these aspects. Hennig (2015a) lists anddiscusses a number of aspects that can be relevant. Hennig (2019) andAkhanli and Hennig (2020) formalised many of these aspects, partly usingalready existing indexes, partly introducing new ones. Here the indexes8sed in the present study are listed. For more background and discussion,including possible alternatives, see Hennig (2019) and Akhanli and Hennig(2020). The indexes attempt to formalise clustering aspects in a directintuitive manner, without making reference to speciﬁc models (unless it is ofinterest whether data look like generated by a particular probability model,see below). The indexes as deﬁned here do not allow comparison betweenor aggregation over diﬀerent data sets. In order to do this, they need to becalibrated, which is treated in Section 2.4.The data set is denoted as D = { x , . . . , x n } . Here the observations x , . . . , x n are assumed to be ∈ R p , and d ( x, y ) is the Euclidean distancebetween x and y , although the indexes can be applied to more general typesof data and distances. A clustering is a set C = { C , . . . , C K } with C j ⊆D , j = 1 , . . . , K . For j = 1 , . . . , K , n j = | C j | is the number of objects in C j .Assume C to be a partition, e.g., j (cid:54) = k ⇒ C j ∩ C k = ∅ and (cid:83) Kj =1 C j = D .Let γ : { , . . . , n } (cid:55)→ { , . . . , K } be the assignment function, i.e., γ ( i ) = j ⇔ x i ∈ C j . Average within-cluster distances (avewithin; aw; Akhanli and Hennig(2020)). This index measures homogeneity in the sense of small dis-tances within clusters. Smaller values are better. I avewithin ( C ) = 1 n K (cid:88) k =1 n k − (cid:88) x i (cid:54) = x j ∈ C k d ( x i , x j ) . Representation of cluster members by centroids.

In some applicationscluster centroids are used in order to represent the clustered objects,and an important aim is that this representation is good for all clus-ter members. This is directly formalised by the objective functions of K -means (sum of squared distances from the cluster mean) and Par-titioning Around Medoids (sum of distances from the cluster medoid).Both of these criteria have been used as internal validation indexes inthe present study, however results are not presented, because over allresults both of these turn out to have a correlation of larger than 0.95with I avewithin , so I avewithin can be taken to measure this clusteringaspect as well. Maximum diameter (maxdiameter; md). In some applications there maybe a stricter requirement that large distances within clusters cannotbe tolerated, rather than having only the distance average small. Thiscan be formalised by I maxdiameter ( C ) = max C ∈C ; x i ,x j ∈ C d ( x i , x j ) . Smaller values are better. 9 idest within-cluster gap (widestgap; wg; Hennig (2019)). Another in-terpretation of cluster homogeneity is that there should not be diﬀerentparts of the same cluster that are separated from each other. This canbe formalised by I widestgap ( C ) = max C ∈C ,D,E : C = D ∪ E min x ∈ D,y ∈ E d ( x, y ) . Smaller values are better.

Separation index (sindex; si; Hennig (2019)). This index measures whetherclusters are separated in the sense that the closest distances betweenclusters are large. For every object x i ∈ C k , i = 1 , . . . , n , k ∈ , . . . , K ,let d k : i = min x j / ∈ C k d ( x i , x j ). Let d k :(1) ≤ . . . ≤ d k :( n k ) be the values of d k : i for x i ∈ C k ordered from the smallest to the largest, and let [ pn k ]be the largest integer ≤ pn k . p is a parameter tuning what propor-tion of observations counts as “close to the border” of a cluster withanother. Here, p = 0 .

1. Then, I sindex ( C ; p ) = 1 (cid:80) Kk =1 [ pn k ] K (cid:88) k =1 [ pn k ] (cid:88) i =1 d k :( i ) . Larger values are better.Analogously to the maximum diameter, the minimum separation, i.e.,the minimum distance between any two clusters may also be of inter-est. In the present study, this has a correlation of 0.93 with I sindex ,and results for the minimum separation are omitted for reasons ofredundancy. Pearson-version of Hubert’s

Γ (pearsongamma; pg; Hubert and Schultz(1976)). This index measures to what extent the clustering corre-sponds or represents the distance structure in the data. the vectorof pairwise dissimilarities Let d = vec ([ d ( x i , x j )] i

Density mode index (dmode; dm). An intuitive idea of a cluster is thatit is associated with a density mode, and that the density goes downtoward the cluster border. This is formalised by the “dmode” index.10t is based on a simple kernel density estimator h that assigns a densityvalue h ( x ) to every observation. Let q d,p be the p -quantile of the vectorof dissimilarities d , e.g., for p = 0 .

1, the 10% smallest dissimilaritiesare ≤ q d, , . Deﬁne the kernel and density as κ ( d ) = (cid:18) − q d,p d (cid:19) d ≤ q d,p ) , h ( x ) = n (cid:88) i =1 κ ( d ( x, x i )) . The following algorithm constructs a sequence of neighbouring obser-vations from the mode in such a way that the density should alwaysgo down, and penalties are incurred if the density goes up. It alsoconstructs a set T that collects information about high dissimilaritiesbetween high density observations used below. I densdec collects thepenalties. Initialisation I d = 0, T = ∅ . For j = 1 , . . . , K : Step 1 S j = { x } , where x = arg max y ∈ C j h ( y ). Step 2

Let R j = C j \ S j . If R j = ∅ : j = j + 1, if j ≤ K go to Step1, if j + K = 1 then go to Step 5. Otherwise: Step 3

Find ( x, y ) = arg min ( z ,z ): z ∈ R j ,z ∈ S j d ( z , z ). S j = S j ∪ { x } , T = T ∪ { max z ∈ R j h ( z ) d ( x, y ) } . Step 4 If h ( x ) > h ( y ) : I d = I d + ( h ( x ) − h ( y )) , back to Step 2. Step 5 I densdec ( C ) = (cid:113) I d n . It is possible that there is a large gap between two observations withhigh density, which does not incur penalties in I densdec if there are nolow-density observations in between. This can be picked up by I highdgap ( C ) = max T. These two indexes, which are both better for smaller values, were de-ﬁned in Hennig (2019), but they can be seen as contributing to themeasurement of the same aspect, with I highdgap just adding informa-tion missed by I densdec . An aggregate version, which is used here, canbe deﬁned as I dmode ( C ) = 0 . I ∗ densdec ( C ) + 0 . I ∗ highdgap ( C ) , where I ∗ densdec and I ∗ highdgap are suitably calibrated versions of I densdec , I highdgap , respectively, see Section 2.4. The weights 0.75 and 0.25 inthe deﬁnition of I dmode can be interpreted as the relative impact of thetwo sub-indexes. 11 luster boundaries cutting through density valleys (denscut; dc; Hen-nig (2019)). A complementary aspect of the idea that clusters are as-sociated with high density regions is that cluster boundaries should runthrough density valleys rather than density mountains. The “denscut”-index penalises a high contribution of points from diﬀerent clusters tothe density values in a cluster (measured by h o below).For x i , i = 1 , . . . , n : h o ( x i ) = n (cid:88) k =1 κ ( d ( x i , x k ))1( γ ( k ) (cid:54) = γ ( i )) . A penalty is incurred if for observations with a large density h ( x ) thereis a large contribution h o ( x ) to that density from other clusters: I denscut ( C ) = 1 n K (cid:88) j =1 (cid:88) x ∈ C j h ( x ) h o ( x ) . Smaller values are better.

Entropy (en; Shannon (1948)). Although not normally listed as primaryaim of clustering, in many applications very small clusters are not veryuseful, and cluster sizes should optimally be close to uniform. This ismeasured by the well known entropy: I entropy ( C ) = − K (cid:88) k =1 n k n log( n k n ) . Large values are good.

Gaussianity of clusters (kdnorm; nor; Coretto and Hennig (2016)). Dueto the Central Limit Theorem and a widespread belief that the Gaus-sian distribution approximates many real random processes, it maybe of interest in its own right to have clusters that are approximatelyGaussian. The index I kdnorm is deﬁned, following Coretto and Hennig(2016), as the Kolmogorov distance between the empirical distribu-tion of within-cluster Mahalanobis distances to the cluster means, anda χ p -distribution, which is the distribution of Mahalanobis distancesin perfectly Gaussian clusters. Coeﬃcient of variation of distances to within-cluster neighbours (cvnnd;cvn; Hennig (2019)). Another within-cluster distributional shape ofpotential interest is uniformity, where clusters are characterised bya uniform within-cluster density level. This can be characterised bythe coeﬃcient of variation (CV) of the dissimilarities to the k th near-est within-cluster neighbour d kw ( x ) ( k = 2 is used here). Deﬁne for12 = 1 , . . . , k , assuming n j > k : m ( C j ; k ) = 1 n j (cid:88) x ∈ C j d kw ( x ) , CV( C j ) = (cid:113) n j − (cid:80) x ∈ C j ( d kw ( x ) − m ( C j ; k )) m ( C j ; k ) . Using this, I cvdens ( C ) = (cid:80) Kj =1 n j CV( C j )1( n j > k ) (cid:80) Kj =1 n j n j > k ) . Smaller values are better.

Average Silhouette Width (asw; Kaufman and Rousseeuw (1990)). Thisis a popular internal validation index that deviates somewhat from the“philosophy” behind the collection of indexes presented here, becauseit attempts to balance two aspects of cluster quality, namely homo-geneity and separation. It has been included in the study anyway,because it also uses an intuitive direct formalisation of clustering char-acteristics of interest. For i = 1 , . . . , n , deﬁne the “silhouette width” s i = b i − a i max { a i , b i } ∈ [ − , , where a i = 1 n l i − (cid:88) x j ∈ C li d ( x i , x j ) , b i = min h (cid:54) = l i n h (cid:88) x j ∈ C h d ( x i , x j ) . The Average Silhouette Width is then deﬁned as I asw ( C ) = 1 n n (cid:88) i =1 s i . For aggregating the indexes introduced in Section 2.3 over diﬀerent data setsand to compare the performance of a clustering method over the indexes inorder to characterise it, it is necessary to calibrate the values of the indexes,so that they become comparable. This is done as in Hennig (2019); Akhanliand Hennig (2020). The idea is to generate a large number m of “randomclusterings” C R , . . . , C Rm on the data. Denote the clusterings of the q = 9methods from Section 2.1 by C , . . . , C q . For a given data set D and index I , ﬁrst change I to − I in case that smaller values are better according to13he original deﬁnition of I , so that for all calibrated indexes larger valuesare better. Then use these clusterings to standardise I : m ( I, D ) = 1 m + q (cid:32) m (cid:88) i =1 I ( C Ri ) + q (cid:88) i =1 I ( C i ) (cid:33) ,s ( I, D ) = 1 m + q − (cid:32) m (cid:88) i =1 [ I ( C Ri ) − m ( I, D )] + q (cid:88) i =1 [ I ( C i ) − m ( I, D )] (cid:33) ,I ∗ ( C i ) = I ( C i ) − m ( I, D ) s ( I, D ) , i = 1 , . . . , q.I ∗ is therefore scaled so that its values can be interpreted as expressingthe quality (larger is better) compared to what the collection of clusterings C R , . . . , C Rm , C , . . . , C q achieves on the same data set. The approach de-pends on the deﬁnition of the random clusterings. These should generateenough random variation in order to work as a tool for calibration, but theyalso need to be reasonable as clusterings, because if all random clusteringsare several standard deviations away from the “proper” clusterings, the ex-act distance may not be very meaningful. They also need to be fast togenerate, as many of them will be required in order to calibrate index valuesof every single data set.Four diﬀerent algorithms are used for generating the random clusterings,for detains see Akhanli and Hennig (2020). For clusterings with K clusters,these are: Random K -centroids: Draw K observations from D . Assign every ob-servation to the nearest centroid. Random nearest neighbour:

Draw K observations as starting points forthe K clusters. At every stage, of the observations that are not yetclustered, assign the observation x to the cluster of its nearest alreadyclustered neighbour, where x is the observation that has the smallestdistance to this neighbour. Random farthest neighbour:

As random nearest neighbour, but x is theobservation that has the smallest distance to the minimum farthestcluster member. Random average distances:

As random nearest neighbour, but x is theobservation that has the smallest average distance to the closest clus-ter.Experience shows that these methods generate a range of clusterings thathave suﬃcient variation in characteristics and are mostly reasonably closeto the proper clustering methods (as can be seen in Akhanli and Hennig(2020) as well as from the results of the present study). Here, 50 random14lusterings from each algorithm are generated, i.e., m = 200. All results inSection 3 are given in terms of calibrated indexes I ∗ . “Truth” recovery is measured by external validation indexes that quantifythe similarity between two clusterings on the same data, here the “true” oneand a clustering generated by one of the clustering methods.The probably most popular external validation index is the AdjustedRand Index (ARI; Hubert and Arabie (1985)). This index is based on therelative number of pairs of points that are in the same cluster in both clus-terings or in diﬀerent clusters in both clusterings, adjusted for the numberof clusters and the cluster sizes in such a way that its expected value underrandom cluster labels with the same number and sizes of clusters is 0. Themaximum value is 1 for perfect agreement. Values can be negative, but al-ready a value of 0 can be interpreted as indicating that the two clusteringshave nothing to do with each other.In some work, the ARI has been criticised, often in the framework ofan axiomatic approach where it can be shown that it violates some axiomstaken to be desirable, e.g., Meila (2007); Amigo et al. (2009). Alternativeindexes have been proposed that fulﬁll the presented axioms. Meila (2007)introduced the Variation of Information (VI), which is a proper metric be-tween partitions. This means that, as opposed to the ARI, smaller valuesare better. In Section 3, the negative VI is considered so that for all consid-ered indexes larger values are better. The VI is deﬁned by comparing theentropies of the two clusterings with the so-called “mutual information”,which is based on the entropy of the intersections between two clusters fromthe two diﬀerent clusterings. If the two clusterings are the same, the en-tropy of the intersections between clusters is the same as the entropy of theoriginal clusterings, meaning that the VI is zero, its minimum value.Amigo et al. (2009) show their axioms for an index called BCubed ﬁrstproposed in Bagga and Baldwin (1998). This index is based on observation-wise concepts of “precision” and “recall”, i.e., what percentage of observa-tions in the same cluster are from the same “true” class, and what percentageof observations in a diﬀerent cluster is “truly” diﬀerent. It takes values be-tween 0 and 1, 1 corresponding to a perfect agreement. See Meila (2015) forfurther discussion and some more alternatives. Three issues are addressed: • How can the clusters produced by the methods be characterised interms of the external validation indexes?15 − − Average within−cluster distance

Method A v e r age w i t h i n − c l u s t e r d i s t an c e c a li b r a t ed kmeans clara mclust emskewt teigen single average complete spectral − − Maximum cluster diameter

Method M a x i m u m c l u s t e r d i a m e t e r c a li b r a t ed kmeans clara mclust emskewt teigen single average complete spectral Figure 1: Calibrated values of I ∗ avewithin and I ∗ maxdiameter . Values belongingto the same data set are connected by lines. The thick red line gives theaverage values. • How do the methods perform regarding the recovery of the “true”clusterings? • Can the recovery of the “true” clusterings be related to the internalvalidation indexes?

The methods can be characterised by the distribution of values of the cal-ibrated internal validation indexes, highlighting the dominating features ofthe clusterings that they produce. In order to do this, parallel coordinateplots will be used that show the full results including how results belongingto the same data set depend on each other.I decided against running null hypothesis tests due to issues of multipletesting and model assumptions; the plots allow a good assessment of to whatextent diﬀerences between methods are meaningful, dominated by randomvariation, or borderline. Although the values of the calibrated indexes canbe compared over indexes as relative to the ensemble of clusterings from themethods and random, what is shown are images that compare the diﬀerentclustering methods for each index, as the comparison of the clustering meth-ods gives information additional to the performance relative to the randomclusterings.

Average within-cluster distances (left side of Figure 1): The two centroid-based methods K -means and clara achieve the best results. The Gaus-16 − Widest within−cluster gap

Method W i de s t w i t h i n − c l u s t e r gap c a li b r a t ed kmeans clara mclust emskewt teigen single average complete spectral Separation index

Method S epa r a t i on i nde x c a li b r a t ed kmeans clara mclust emskewt teigen single average complete spectral Figure 2: Calibrated values of I ∗ widestgap and I ∗ sindex . Values belonging to thesame data set are connected by lines. The thick red line gives the averagevalues.sian and t -mixture are about at the same level as spectral clustering;complete linkage and the mixture of skew t -distributions are worse.Average linkage is behind these, and single linkage is the worst bysome distance.Results regarding representation of the data by centroids are not shownand look largely the same. The only additional distinctive feature isthat K -means is better than clara looking at squared Euclidean dis-tances to the centroid, whereas clara is better for unsquared distances.This was to be expected, as it corresponds to what K -means, clara,respectively, attempt to optimise. Maximum diameter (right side of Figure 1): Unsurprisingly, completelinkage is best; at each step it merges clusters so that the maximumdiameter is the smallest possible, although it is not optimal for ev-ery single data set (the hierarchical scheme will not normally producea global optimum). Average linkage is second best, followed by K -means, clara, and single linkage, which somewhat surprisingly avoidslarge distances within clusters more than spectral clustering and thethree mixture models. Another potential surprise is that the Gaussianmixture does not do better than the t -mixture in this respect; a ﬂexiblecovariance matrix can occasionally allow for very large within-clusterdistances. Widest within-cluster gap (left side of Figure 2): The three linkagemethods are best at avoiding large within-cluster gaps, with single17

Pearson Gamma

Method P ea r s on G a mm a c a li b r a t ed kmeans clara mclust emskewt teigen single average complete spectral − − − − Mode index

Method M ode i nde x c a li b r a t ed kmeans clara mclust emskewt teigen single average complete spectral Figure 3: Calibrated values of I ∗ P earson Γ and I ∗ dmode . Values belonging to thesame data set are connected by lines. The thick red line gives the averagevalues.linkage in the ﬁrst place, which will not join sets between which thereis a large gap. The two centroid-based methods follow, however diﬀer-ences between them, the three mixture models, and spectral clusteringlook small compared to the variance, and dominated by outliers. Theskew t -mixture produces very large within-cluster gaps for a numberof data sets. With strong skewness there can be large distances in atail of a cluster. Separation index (right side of Figure 2): Single linkage achieves the bestresults here. Its clustering process keep separated subsets in distinctclusters (often one-point clusters with strongly separated outliers).The two other linkage methods follow. Complete linkage is sometimesportrayed as totally prioritising within-cluster homogeneity over sep-aration, but in fact regarding separation it does better than spectralclustering, which is still a bit better than the centroid-based and themixture models, between which diﬀerences look insigniﬁcant.

Pearson-

Γ (left side of Figure 3): The average results for the methodsregarding the representation of the distance structure by the clusteringvary relatively little compared to the variation over data sets. Averagelinkage is overall best, and the skew t -mixture worst, even if the latterhas good results in some data sets. Single linkage does occasionallyvery well, but also worse than the others for a number of data sets. Density mode index (right side of Figure 3): Results here are dominatedby variation between data sets as well. Interestingly, the methods18

Boundaries through density valleys

Method B ounda r i e s t h r ough den s i t y v a ll e ys c a li b r a t ed kmeans clara mclust emskewt teigen single average complete spectral − − − − Entropy

Method E n t r op y c a li b r a t ed kmeans clara mclust emskewt teigen single average complete spectral Figure 4: Calibrated values of I ∗ denscut and I ∗ entropy . Values belonging to thesame data set are connected by lines. The thick red line gives the averagevalues.based on mixtures of unimodal distributions do not do best here, butrather clara and spectral clustering. Once more the mixture of skew t -distributions does worst, with outliers in both directions. Density cutting (left side of Figure 4): Due to its focus on cluster separa-tion, single linkage is best at avoiding cutting through density moun-tains. The skew t - and t -mixture have the strongest tendency to putcluster boundaries in high density areas, but diﬀerences between meth-ods are not large. Entropy (right side of Figure 4): clara yields the highest average entropyfollowed by K -means, but diﬀerences between these and the threemixture models do not seem signiﬁcant. This runs counter to the idea,sometimes found in the literature, that K -means favours similar clustersizes more than mixtures, or even implicitly assumes them. The otherfour methods have a clear tendency to produce less balanced clusters,particularly single linkage, but also average and complete linkage, andto some lesser extent spectral clustering. Gaussianity (left side of Figure 5): Although the Gaussian mixture pro-duces on average the most Gaussian-looking clusters, as was to beexpected, the diﬀerences between all nine methods look largely in-signiﬁcant. The Gaussian mixture has positive and negative outliers,the skew t -mixture only negative ones. CV of distances to within-cluster neighbours (right side of Figure 5):Despite one lower outlier, the Gaussian mixture tends to produce the19 − Gaussianity

Method G au ss i an i t y c a li b r a t ed kmeans clara mclust emskewt teigen single average complete spectral − CV of distance to 2nd nearest within−cluster neighbour

Method C V o f d i s t an c e t o nea r e s t w i t h i n − c l u s t e r ne i ghbou r c a li b r a t ed kmeans clara mclust emskewt teigen single average complete spectral Figure 5: Calibrated values of I ∗ kdnorm and I ∗ cvdens . Values belonging to thesame data set are connected by lines. The thick red line gives the averagevalues.largest cvnnd, i.e., the lowest within-cluster CVs. It probably helpsthat large variance clusters can bring together observations that havelarge distances between each other and to the rest. clara and the t -mixture produce the lowest cvnnd values. Diﬀerences between theother methods are rather small. Average silhouette width (left side of Figure 6): Average linkage is amethod that explicitly balances separation and homogeneity, and con-sequently it achieves the best ASW values. K -means achieves highervalues than complete linkage, but the remaining methods do worsethan the linkage methods. ASW had been originally proposed for usewith clara (Kaufman and Rousseeuw (1990)), but clara does not pro-duce particularly high ASW values, if better than the mixture modelsand spectral clustering.These results characterise the clustering methods as follows: kmeans clearly favours within-cluster homogeneity over separation. It doesnot favour entropy as strongly as some literature suggests; in thisrespect it is in line with clara and the mixture models, ahead of theremaining methods. It should be noted that entropy is treated here asa potentially beneﬁcial feature of a clustering, whereas some literaturemakes it seem like a defect of kmeans that such solutions are favoured(as far as this in fact happens). clara has largely similar characteristics to kmeans. It is slightly worse re-garding the representation of the distance structure and the ASW.20 − Average Silhouette Width

Method A v e r age S il houe tt e W i d t h c a li b r a t ed kmeans clara mclust emskewt teigen single average complete spectral Figure 6: Calibrated values of the ASW. Values belonging to the same dataset are connected by lines. The thick red line gives the average values.It is slightly better regarding clusters with density decrease from themode. This may have to do with the fact that the density goes downfaster from the mode for the multivariate Laplace distribution (wherethe log-likelihood sums up unsquared distances) than for the Gaussiandistribution (which corresponds to squared distances). mclust produces clusters with the highest Gaussianity, but only by a ratherinsigniﬁcant distance. It is best regarding uniformity as measured bycvnnd. The reason for this is probably its ability to build clusterswith large within-cluster variation collecting observations that havelarge distances to all or most other points, whereas other methodseither need to isolate such observations in one-point clusters, or in-tegrate them in clusters with denser cores. Mixtures of t - and skew t -distributions could in principle also produce large variance clusters,but the shapes of t - and skew t -distributions allow to integrate outlyingobservations more easily with denser regions.mclust often tolerates large within-cluster distances, whereas its clus-ters are not on average better separated than those from K -means.On the other hand, its cluster sizes are not signiﬁcantly less well bal-anced. Its ability to produce clusters with strongly diﬀerent within-cluster variance makes it less suitable regarding Pearson-Γ and theASW, which treat distances in the same way in all clusters. emskewt looks bad on almost all internal indexes. It is not particularly badregarding recovery of the “true” clusters though, see Section 3.2. Thismeans that the current collection of internal indexes does not capture21avourable characteristics of skewly distributed clusters appropriately;it also means that emskewt is not an appropriate method for ﬁndingclusters with the characteristics that are formalised by the internalindexes. teigen has a proﬁle that is by and large very similar to the one of mclust,apart from being slightly better regarding the maximum diameter, andslightly worse regarding Gaussianity and uniformity. single linkage has a very distinct proﬁle. It is best regarding separation,avoiding wide within-cluster gaps, and cluster boundaries through den-sity valleys, and worst by some distance regarding within-cluster ho-mogeneity and entropy. average linkage has similar strengths and weaknesses as single linkage,but not as extreme. It is the best method regarding Pearson-Γ and theASW, both of which balance homogeneity and separation and measuretherefore how much the clustering is in line with the distance structure. complete linkage is best regarding the maximum diameter. In most otherrespects it stands between single and average linkage on one side andthe centroid- and mixture-based methods on the other side. spectral is another method that provides a compromise between the ratherseparation-oriented single and average linkage on one side and therather homogeneity-oriented centroid- and mixture-based methods. Itsmaximum cluster diameter is rather high on average. Its mode indexvalue is good if not clearly diﬀerent from the one of clara. Its mid-range entropy value may look attractive in applications in which aconsiderable degree of imbalance in the cluster sizes may seem realisticbut the tendency of the linkage methods to produce one-point clustersshould be avoided.The multivariate characterisation of the clustering methods also allows tomap them, using a principal components analysis (PCA). Results of this areshown in Figure 7. On the left side, PCs are shown using every index valuefor every data set as a separate value, i.e., 42*11 variables. The ﬁrst twoPCs carry 30.9% and 16.6% of the variance, respectively. On the right side,the PCA is performed on 11 variables that give average index values overall data sets. While this reduces information, it allows to show the indexesas axes in a biplot. The ﬁrst two PCs here carry 50.0% and 19.7% of thevariance, respectively. After rotation, the maps are fairly similar. Using themore detailed data set information, spectral seems much closer to kmeansand clara than to mclust and teigen, but the apparent similarity to thelatter ones using average index values is an eﬀect of dimension reduction;22

20 0 20 40 − − PC1 P C kmeansmclust singleaveragecompleteclaraspectralemskewtteigen −0.5 0.0 0.5 − . − . . . . PC1 P C kmeansmclust singleaveragecompleteclaraspectralemskewtteigen−4 −2 0 2 4 6 − − − avewithin cvnndmaxdiameter widestgap sindexaswdenscutpearsongammaentropy kdnormdmode Figure 7: Clustering methods mapped on ﬁrst two principal componentsfrom using all data sets separately (left side), and from using mean valuesover the data sets (right side).involving information from the third PC (not shown), the similarity struc-ture is more similar to that of the plot using all 42*11 variables. The biploton the right side shows the opposite tendencies of separation on one handand entropy and average within distances on the other hand when char-acterising the methods, with indexes such as maximum diameter, densitymode, Pearson-Γ, and the ASW opening another dimension, rather corre-sponding to kmeans, average, and complete linkage. Qualitative conclusionsfrom these maps agree roughly with those in Jain et al. (2004), where moreclustering algorithms, but fewer data sets, were involved.The study data allow to also investigate the values of the internal in-dexes computed for the “true” clusterings. These are shown in Figure 8.Only the entropy and Gaussianity are clearly above the mean zero of therandom clustering ensemble (which includes the solutions from the properclustering methods as a small minority), and also above the mean for theclustering methods. The clustering methods are on average all above zero,which should be expected, because these are meant to be desirable featuresof a good clustering, and as such should be better for the proper clusteringmethods than for the random ones. The methods achieve the highest aver-age for the ASW, which makes sense as this attempts to measure generalclustering quality. The fact that index values are mostly below zero for the“true” clusterings can be interpreted in such a way that many given “true”clusterings are data analytically wanting. The high values for entropy areprobably artiﬁcial, due to a biased choice of data sets. The high values forGaussianity, however, could suggest that there is a tendency in some realclusters, i.e., homogeneous subpopulations, to approximate the Gaussian23 n nor cvn wg aw dm md si pg asw dc − Index C a li b r a t ed v a l ue Figure 8: The boxplots show the distributions of the internal indexes com-puted on the “true” clusterings. The red line shows the average index valuesproduced by the clustering methods.distribution. A possible explanation is that in a crisp clustering of a dataset produced by a clustering method, tails of a within-cluster distributiontend to be cut oﬀ in the direction of other clusters, whereas “true” clusterstend to have some proper overlap (clearly separated clusters are in my ex-perience indeed rare in real data), which is in line with the low values of theseparation and denscut (cluster boundaries running through density valleys)index. This probably also aﬀects the ASW and Pearson-Γ.

The quality of the recovery of the “true” clusterings is measured by theARI, BCubed, and the VI. Figure 9 shows the ARI-values achieved by thediﬀerent clustering methods. On average, there is a clear advantage of thecentroid- and mixture-based methods compared with the linkage methods(single linkage is clearly the worst), and spectral clustering is in between.Every method achieves good results on some data sets, but the linkage meth-ods produce an ARI around zero on many data sets. Diﬀerences betweenkmeans, clara, mclust, emskewt, and teigen do not seem signiﬁcant but areclearly dominated by variation. On some data sets all methods producevery low values, and no method achieves an ARI larger than 0.5 on morethan half of the data sets. The mean ARI is 0.28, the mean ARI of the bestclusterings for every data set is 0.46. Interpreting these numbers, it has tobe kept in mind that the given “true” clustering does not necessarily qualifyas the best clustering of the data from a data analytic point of view; someof these are neither homogeneous nor separated. Furthermore there may be24 . . . . . . ARI

Method A R I kmeans clara mclust emskewt teigen single average complete spectral Figure 9: Adjusted Rand Index values by method. Values belonging to thesame data set are connected by lines. The thick red line gives the averagevalues.meaningful clusters in the data that diﬀer from those declared as “truth”. Abetter recovery does not necessarily mean that a method delivers the mostuseful clustering that can be found. On the other hand, some given “true”clusterings correspond to clearly visible patterns in the data, and at leastsome methods manage to ﬁnd them. Overall, the variation is quite high.The picture changes strongly looking at the results regarding BCubedand particularly VI, see Figure 10. BCubed still shows single linkage asthe weakest method, but otherwise diﬀerences look hardly signiﬁcant, andaccording to the VI, the average quality of the methods is almost uniform.Further exploratory analysis (not shown) reveals that better values of theexternal indexes are systematically associated with lower data dimension p and lower sample size n , the latter probably because of confounding withthe correlated dimension. There was no clear interaction with the methods,and no clear pattern regarding the number of clusters k .Table 4 shows how often the diﬀerent methods come out as the best ac-cording to the indexes. This portrays mclust as very successful at recoveringthe “truth”. Spectral clustering is hardly ever on top, but it has values veryclose to the best for a number of data sets. Given that emskewt looks sobad regarding the internal indexes in Section 3.1, its performance regardingthe external indexes looks surprisingly good. The most striking diﬀerencebetween the indexes is that single linkage is not the best method for a singledata set with respect to the the ARI, but it is the best for 11 data sets withrespect to the VI. This is explored in the following.Figure 11 shows how the three indexes are related to each other over allnine clustering methods applied to the 30 data sets with “true” clusterings.25 . . . . . BCUBED

Method B CU BE D kmeans clara mclust emskewt teigen single average complete spectral − − − − −VI Method − V I kmeans clara mclust emskewt teigen single average complete spectral Figure 10: BCubed and negative Variation of Information values by method.Values belonging to the same data set are connected by lines. The thick redline gives the average values. Clustering methodsIndex kmeans clara mclust mskewt teigen single average complete spectral

ARI 3 4 8 5 5 0 3 1 1BCubed 2 2 7 5 3 4 4 2 1VI 2 1 6 3 3 11 2 1 1Table 4: Number of times that a method comes out best according to thethree external indexes. 26 ri . . . . . bcubed . . . . . . vi Figure 11: Pairs plot of ARI, BCubed, and VIVI and BCubed have a correlation ρ of -0.94, but the ARI is correlatedsubstantially weaker to both, ρ = 0 .

75 with BCubed and ρ = − .

57 withVI. BCubed can therefore be seen as a compromise between the two. Inorder to explore what causes the diﬀerences between ARI and VI, in Figure11 it can be seen that the major issue is that the VI can produce fairly goodvalues close to zero for some situations in which the ARI is around zero,indicating unrelated clusterings, or only slightly better. Generally thesesituations tend to occur where one clustering is very imbalanced, mostlywith one or more one-point clusters, whereas the other one (more oftenthe “true” one) is not. The VI involves cluster-wise percentages of pointsoccurring together in the same cluster in the other clustering, and thereforeassesses one-point clusters favourably, whereas the random labels modelbehind ARI indicates that what happens with the object in a one-pointcluster in another (potentially “true”) clustering is random and thereforenot meaningful as long as it appears in a substantially bigger cluster there.For example, consider the data set “22 - Wholesale” (see Appendix A2).According to the VI, the single linkage clustering is optimal (VI= 0 . It is of interest whether the internal index values, which are observable ina real situation, can explain to some extent the performance regarding the“true” cluster recovery. A tool to assess this is a linear regression with anexternal index as response, and the internal indexes as explanatory variables.There is dependence between the diﬀerent clusterings on the same data set,and this can be appropriately handled using a random data set eﬀect.An important issue is that the internal indexes are correlated, which canmake the interpretation of the regression results diﬃcult. Figure 12 showsthe correlation structure among the internal indexes, ARI and -VI (BCubedis not taken into account in this section due to the high correlation withVI). The order of indexes in Figure 12 was determined by a hierarchicalclustering using correlation dissimilarity, however -VI and ARI were put ontop due to their diﬀerent role in the regression, and the ASW was put atthe bottom. The ASW is not involved in the regression, as it is deﬁnedin order to compromise between homogeneity and separation, which them-selves are represented by other internal indexes. It is involved in Figure 12because its correlation to the other indexes may be of interest anyway. Onething that can be seen is that it is fairly strongly correlated to a number ofother indexes, particularly maximum diameter, Pearson-Γ, and the separa-tion index, but rather weakly to the average within-cluster distances meantto formalise homogeneity.Considerable correlation occurs between the average within-cluster dis-tances and the entropy. Both of these are the internal indexes with thehighest correlation to the ARI. This is a problem for interpretation becausethis means that entropy and homogeneity are confounded when explainingrecovery success. Furthermore, both, entropy in particular, are stronglynegatively correlated with separation, which may explain the negative cor-relation between separation and the ARI. There is no further high ( > . s w d m m d pg d c s i w g cv n no r en a w a r i − v i aswdmmdpgdcsiwgcvnnorenawari−vi Figure 12: Correlation matrix of internal and external validation indexes29able 6: Mixed-eﬀects regression results regressing ARI, -VI, respectively,on the internal indexes excluding the ASW.Response ARI -VIIndexes Coeﬃcient t p

Coeﬃcient t p

Intercept .324 6.91 .000 -1.54 -10.11 .000avewithin -.019 -1.34 .181 0.03 0.88 .377maxdiameter -.025 -4.03 .000 0.01 0.64 .520widestgap .014 2.00 .047 -0.00 -0.21 .814sindex -.010 -1.65 .101 0.05 3.84 .000pearsongamma .020 2.43 .016 -0.04 -1.86 .064dmode .009 0.89 .374 0.05 1.92 .056denscut .000 0.03 .978 -0.05 -1.80 .074entropy .088 4.69 .000 0.00 0.01 .990kdnorm .024 3.51 .001 0.02 1.44 .151cvnnd -.006 -0.86 .388 -0.01 -0.48 .633random eﬀ. (data set) .000 .000the -VI is more positively connected to separation. There are a numberof further correlations among the internal indexes; separation, the densitymode and cut indexes, Pearson-Γ, the maximum cluster diameter, and theabsence of large within-cluster gaps are all positively connected. The Gaus-sianity index and the nearest neighbours CV are correlated 0.24 to eachother; all their other correlations are lower.Table 6 gives the results of two regression analyses, with ARI and -VIas responses, with a random data set eﬀect. This has been obtained by theR-package lme , Pinheiro and Bates (2000). p -values are interpreted in anexploratory manner, as they are not precise. However, the null hypothesesof zero eﬀect of a variable given all other variables are in the model are ofinterest here.The ARI regression has maximum diameter, entropy, and Gaussianity ashighly signiﬁcant eﬀects; Pearson-Γ is clearly signiﬁcant at 5%-level. widest-gap is borderline signiﬁcant, which is potentially not meaningful given thenumber of tests.The interpretation of entropy (which has the clearly largest t -value) isproblematic for two reasons. Firstly, due to correlation, its coeﬃcient maypartly carry information due to avewithin. Secondly, eight data sets haveartiﬁcially balanced classes, which may favour entropy among good clus-terings. The regression was re-run excluding those data sets (not shown),yielding by and large the same signiﬁcances including entropy, but its t -valuefell to 2.75. Even in this scenario it cannot be excluded that the “sample”of data sets with known “true” clusters favours entropy artiﬁcially. Gaus-sianity seems to be a valuable predictor for recovery of “true” classes. The30aximum diameter has a negative coeﬃcient, meaning that on average andcontrolled for all other indexes, a larger (therefore worse) maximum clusterdiameter went with a better “truth” recovery regarding the ARI. It is how-ever clearly correlated with Pearson-Γ and widestgap, which have positiveeﬀects.Despite a positive relationship between ARI and -VI, the results of theVI-regression are very diﬀerent, mainly because -VI can achieve high valuesfor clusterings with very low entropy even if the “true” clustering is balanced.This means that there is no bias in favour of entropy by the data set sample;rather the VI seems biased against entropy by deﬁnition, see above. Theonly clearly signiﬁcant index for -VI is the separation index, with a positivecoeﬃcient, which was not signiﬁcant in the ARI-regression.Plots of the ﬁtted values of both regressions against their response vari-able (not shown) look satisfactorily linear. In principle, the regressions couldbe used to predict the ARI or VI for data sets with unknown “truth” fromthe observable internal indexes, but this will not work very well, due thestrong data set eﬀect.Overall these results do not allow clear cut conclusions, due to correla-tion, issues with the representativity of the data sets, and the very diﬀerentpatterns observed for ARI and VI. The character of the “true” clusteringsmay just be so diverse that no general statement about which clusteringcharacteristics allow for good recovery can be made. Preferring the ARI asexternal index, the only safely interpretable signiﬁcance seems to be the oneof Gaussianity, due to its low correlation with other indexes. Separationseems to help in terms of the VI, but this includes favouring clusterings thatseparate outliers as one-point clusters, arguably an issue with the VI. The aim of this study is to characterise the clustering methods in termsof the internal indexes, to learn about the recovery of “true” clusterings,both regarding the methods, and regarding characteristics that could beconnected to recovery.Regarding the characterisation of the clustering methods, the right sideof Figure 7 is probably most expressive, locating the clustering methods rela-tive to the internal indexes. Some indexes do not separate the methods verystrongly. Single linkage stands out as being quite diﬀerent to most othermethods in many respects. On the other hand, the centroid-based methods,the mixture-based methods and spectral clustering have much in common;one surprising result is that K -means does not favour balanced cluster sizesparticularly strongly, compared to the mixture-based methods. Another re-sult is that single and complete linkage are not opposite extremes, but ratherthat on most characteristics of single linkage, complete linkage is closer to31ingle linkage, with average linkage in between, than the centroid- based andmixture-based methods. Gaussian mixture-based clustering stands out moreby its good value regarding uniformity (cvnnd) than regarding Gaussianityof the clusters.Regarding the recovery of “true” clusterings, there is large variationbetween the data sets. According to the ARI and BCubed, the Gaussianmixture is the best for the largest number of data sets. Single Linkage doesbadly regarding the ARI. Diﬀerences between the other methods are notthat pronounced, and all of them did best in some data sets. This includesthe skew t -mixture, which does not look good according to the internalindexes but better regarding the external indexes. There is currently noindex, at least in the collection used here, that formalises in which sensesuch a mixture can yield a good clustering. This is a topic for furtherwork. According to the VI (and to some extent BCubed), single linkagedoes much better, but this rather indicates a problem with the indexes thana good performance of single linkage.Explaining the “true” cluster recovery by the internal indexes does notdeliver very clear results, except that Gaussianity seems to help, which issometimes achieved by the Gaussian mixture, but only insigniﬁcantly moreoften than by some other methods. A critical interpretation could be thatquality according to the internal indexes does not really measure what isimportant for recovery. On the other hand one could argue that this showsthe heterogeneity of “true” clusterings, and that there is no “one ﬁts it allapproach”, neither for clustering, nor for measuring clustering quality. Thegiven “true” clusterings are of such a nature that their recovery cannot bereliably predicted from observable cluster characteristics.Some problems were exposed with the non-representativity of the datasets, with “true” clusterings, and with the VI (and somewhat less extremethe BCubed) index. These problems are not exclusive to the present study,and it can be hoped that these issues are on the radar whenever such bench-mark studies are run. These problems aﬀect analyses involving the “trueclusterings” in particular. There is no reason to believe that the resultsregarding the internal validation indexes are biased for these reasons. Appendix

A1: Computational details

The following amendments were made to the clustering functions listed inSection 2.1:

Mixture models:

Crisp partitions have always been enforced by assign-ing observations to the cluster with maximum posterior probability ofmembership. 32 means:

This was run with parameter runs=10 , governing the number ofrandom initialisations. The default value is runs=1 , which yields veryunstable results. emskewt

The function

EmSkew would occasionally produce errors or invalidresults. It is run inside a wrapper function that enforces a solution inthe following way: For each covariance matrix model , starting from(1) the fully ﬂexible model, 5 attempts (diﬀerent random initialisa-tions) are made to ﬁnd a solution. If all attempts for a model fail,a less ﬂexible model is tried out, in the order (2) diagonal covariancematrices, (3) ﬂexible but equal covariance matrices, (4) equal diagonalcovariance matrices, (5) equal spherical covariance matrices, until avalid solution is found. If none of these is successful, the same routineis carried out with a mixture of skew normal distributions, and if thisdoes not yield a valid solution either, mclustBIC is called with defaultsettings. teigen The function teigen would occasionally produce errors or invalidresults. It is run inside a wrapper function that enforces a solution inthe following way: If no valid solution is found, the wrapper-functionfor

EmSkew as explained above is called, but with dist="mvt" , ﬁttinga multivariate t-distribution. specc

The function specc would occasionally produce errors or invalid re-sults. It is run inside a wrapper function. 10 attempts (diﬀerentrandom initialisations) are made to ﬁnd a solution. If they all fail,all observations are assigned to cluster 1. While this approach mayseem unfair for spectral clustering in comparison to

EmSkew , which ul-timately calls mclust and can as such still produce a reasonable clus-tering, the motivation is that a Gaussian mixture model can be seenas a constrained version of a mixture of skew t-distributions, whereasspectral clustering has no straightforward constrained version that canguarantee a valid solution.In principle there can be situations in which also mclustBIC fails to deliver avalid solution, however such a situation did not occur in the study. Exhaust-ing all attempts, both specc and

EmSkew failed twice before resorting to aone-cluster solution or mclustBIC , respectively, and teigen failed 5 times;in all of these cases

EmSkew with distr="mvt" delivered a valid solution.

A2: More details on data sets

Tables 7 and 8 give a list of the data sets used in the study. The shape of a skew t-distribution is deﬁned by the covariance matrix of an involvedGaussian mixture, see Lee and McLachlan (2013), although this is not the covariancematrix of the resulting skew t-distribution.

Number Name n p K “Truth” given Source Reference1 Crabs 200 5 4 Yes R-MASS Campbell and Mahon (1974)Morphological measurements of crabs, two species, two sexes2 Dortmund 170 5 5 No See reference Sommerer and Weihs (2005)Various characteristics of the districts of the city of Dortmund3 Iris 150 4 3 Yes R-datasets Anderson (1935)Measurements on 50 ﬂowers from each of 3 species of iris4 Vowels 990 10 11 Yes See reference Hastie et al. (2001)Recognition of British English vowels5 Bats 2677 72 8 Yes V. Zamora-Gutierrez Zamora-Gutierrez et al. (2016)Acoustic identiﬁcation of Mexican bat species6 USArrests 50 4 2 No R-datasets McNeil (1977)Arrests per 100,000 residents for various crimes in US states 19737 OliveOil 572 8 9 Yes R-pdfcluster Forina et al. (1983)Chemical decomposition of Italian olive oils from 9 regions8 OldFaithful 299 2 3 No R-MASS Azzalini and Bowman (1990)Duration and waiting times for eruptions of Old Faithful geyser9 Tetragonula 236 4 9 Yes R-prabclus Franck et al. (2004)Genetic information on 9 species of tetragonula bees10 Thyroid 215 6 3 Yes R-mclust Coomans et al. (1983)Results of ﬁve laboratory tests diagnosing thyroid gland patients11 Spam 4601 57 2 Yes R-kernlab Hastie et al. (2001)Email spam classiﬁcation from word and character frequencies12 Wisconsin 569 30 2 Yes UCI Street et al. (1993)Diagnosis of breast cancer, measurements of features of image13 Yeast 1484 8 10 Yes UCI Horton and Nakai (1996)Discriminative features for protein Localization Sites in cells14 Vehicle 846 18 4 Yes R-mlbench (i)Recognising vehicle type from silhouettes15 Letters 2000 16 26 Yes R-mlbench Frey and Slate (1991)Recognising handwritten letters from pixel displays16 Bundestag 299 5 5 No R-ﬂexclust (ii)German Bundestag election results 2009 of 5 major parties by constituency17 Finance 889 4 2 Yes R-Rmixmod du Jardin and S´everin (2010)Predicting ﬁrm bankruptcy from four ﬁnancial ratios18 BankNotes 200 6 2 Yes R-mclust Flury and Riedwyl (1988)Identifying counterfeit Swiss bank notes from measurements19 StoneFlakes 79 8 3 No Thomas Weber Weber (2009)Measurements on prehistoric stone tools20 Leaf 340 14 30 Yes UCI Silva et al. (2013)Shape and consistency measurements on leafs from 30 plant species21 London 32 9 4 No See reference (iii)Relative numbers of various crimes in the boroughs of London 2014

Number Name n p K “Truth” given Source Reference22 Wholesale 440 7 2 Yes

Abreu (2011)Spending on various product categories by clients of wholesale distributor23 Heart 200 13 5 Yes

Detrano et al. (1989)Diagnosing diﬀerent stages of heart disease by diagnostic measurements24 MachineKnow 403 5 5 No

Kahraman et al. (2013)Students’ knowledge status about the subject of Electrical DC Machines25 PlantLeaves 1599 64 100 Yes

Yan et al. (2013)Plant species classiﬁcation by texture detected from leaf images26 RNAYan 90 2 7 Yes Bioconductor Yan et al. (2013)RNA sequencing data distinguishing cell types in human embryonic development27 RNAKolo 704 5 3 Yes Bioconductor Kolodziejczyk et al. (2015)RNA sequencing data on mouse embryonic stem cell growth28 Cardiotocography 2126 23 10 Yes

Ayres-de Campos et al. (2000)Classiﬁcation of cardiotocograms into pattern classes29 Stars 240 4 6 Yes Kaggle (i)Predict star types from features of stars30 Kidney 203 11 2 Yes R-teigen Dua and Graﬀ (2017)Presence or absence of chronic kidney disease from diagnostic features31 BreastTissue 106 9 4 Yes

Jossinet (1996)Classes of breast carcinoma diagnosed by impedance measurements32 FOREX 1832 10 2 Yes (ii)Historical price data EUR/JPY for predicting direction next day33 SteelPlates 1941 24 7 Yes

Buscema (1998)Classiﬁcation of steel plates faults from various measurements34 BostonHouse 506 13 5 No

Harrison and Rubinfeld (1978)Multivariate characterisation of diﬀerent areas of Boston35 Ionosphere 351 32 2 Yes

Sigillito et al. (1989)Radar data to distinguish free electron patterns from noise in ionosphere36 Glass 214 9 6 Yes R-MASS Venables and Ripley (2002)Identify type of glass from chemical analysis37 CustomerSat 1811 10 5 Yes R-bayesm Rossi et al. (2005)Responses to a satisfaction survey for a product38 Avalanches 394 9 5 No Margherita Maggioni Maggioni (2004)Avalanche frequencies by size and other factors for mapping release areas39 Decathlon 2580 10 6 No R-GDAdata (iii)Points per event of decathlon athletes40 Alcohol 125 10 5 Yes UCI Adak et al. (2020)Five types of alcohol classiﬁed by QCM sensor data41 Augsburg 95 11 3 No See reference Theus and Urbanek (2009)Tax data for districts of the city of Augsburg before and after Thirty Years War42 ImageSeg 2310 16 7 Yes

Dua and Graﬀ (2017)3 ×

35 zip ﬁle with all data sets in the form in which they were analysed inthe present study (i.e., after all pre-processing) is planned to be provided asonline supplement of the published version of this article. Where a “true”clustering is given this is in the ﬁrst variable. All variables were scaled tozero mean and unit variance before clustering, except where stated in thefollowing list, which gives information about data pre-processing where thiswas applied.

The original data set has 203 variables, many of which arenot of much substantial interest, with several linear dependencies. Theversion used here is described in Coretto and Hennig (2016).

The original data set is split into test and training data for su-pervised classiﬁcation. Both are used together here.

The used data set is a preliminary version of what is analysed inZamora-Gutierrez et al. (2016) that was provided to me for testingpurposes by Veronica Zamora-Gutierrez. A small number of missingvalues were imputed by mean imputation.

The original data set contains classiﬁcation by 9 regions, whichare subclasses of 3 macro areas. The regions were used as “true”clustering.

This data set originally contains categorical genetic infor-mation. The version used here was generated by running a four-dimensional multidimensional scaling on genetic distances as proposedby Hausdorf and Hennig (2010). The resulting data were not scaledbefore clustering; the original scales represent the original distances.

15 Letters

This data set has originally 20,000 observations, which is toobig for handling a full distance matrix. Only the ﬁrst 2,000 have beenused.

16 Bundestag

The data set was not scaled before clustering. The unscaledversion represents comparable voter percentages.

19 StoneFlakes

A small number of missing values were imputed by meanimputation.

21 London

The data were retrieved from the website maps.met.police.uk/tables.htm in December 2015. The website has been reorganised in the meantimeand the original data are probably no longer available there, howevermore recent data of the same kind is available. Only the major crimecategories were used, divided by the total number of oﬀences; the totalnumber of oﬀences was used as a variable divided by the number ofinhabitants. After constructing these features, variables were scaled36the number of serious crimes is very low, and not scaling the relativenumbers would have strongly reduced their inﬂuence on clustering).

24 MachineKnow

A classiﬁcation variable is provided, but this was notused as “true” clustering here, because according to the documentationthis was constructed from the data by a machine learning algorithm,and does not qualify as “ground truth”.

26 RNAYan

This is originally a data set with p (cid:29) n . Unscaled principalcomponents were used as explained in Batool and Hennig (2020) inline with some literature cited there.

27 RNAKolo

This is originally a data set with p (cid:29) n . Unscaled principalcomponents were used as explained in Batool and Hennig (2020) inline with some literature cited there.

28 Cardiotocography

A variable with less than four distinct values hasbeen removed.

33 SteelPlates

Three variables with less than four distinct values havebeen removed.

34 BostonHouse

This was originally a regression problem. The originalresponse variable “housing price” is used together with the other vari-ables for clustering here. A binary variable has been removed.

35 Ionosphere

Two binary variables have been removed.

38 Avalanches

On top of the ﬁrst six variables, which give geographicalinformation, the original data has frequencies for avalanches of tendiﬀerent sizes, categorised by what percentage of the release areasis covered. This information has been reduced to the three variables“number of avalanches”, “mean coverage” and “variance of coverages”.

39 Decathlon

Only data from the year 2000 onward are used, in orderto generate a data set of manageable size. Variables were not scaled,because the decathlon points system is meant to make the originalpoints values comparable.

41 Augsburg

For four count variables the meaning of missing values was“not existing”, and these were set to zero. Some other missing valueswere imputed by mean imputation.

42 ImageSeg

Three variables with less than ﬁve distinct values have beenremoved. 37 eferences

Abreu, N. (2011). Analise do perﬁl do cliente recheio e desenvolvimento deum sistema promocional. Master’s thesis, Lisbon: ISCTE-IUL.Ackerman, M., S. Ben-David, and D. Loker (2010). Towards property-basedclassiﬁcation of clustering paradigms. In

Advances in Neural InformationProcessing Systems (NIPS) , pp. 10–18.Adak, M. F., P. Lieberzeit, P. Jarujamrus, and N. Yumusak (2020). Classi-ﬁcation of alcohols obtained by qcm sensors with diﬀerent characteristicsusing abc based neural network.

Engineering Science and Technology, anInternational Journal 23 , 463–469.Akhanli, S. E. and C. Hennig (2020). Comparing clusterings and numbers ofclusters by aggregation of calibrated clustering validity indexes.

Statisticsand Computing 30 (5), 1523–1544.Amigo, E., J. Gonzalo, J. Artiles, and F. Verdejo (2009). A comparison ofextrinsic clustering evaluation metrics based on formal constraints.

Infor-mation retrieval 12 , 461–486.Anderlucci, L. and C. Hennig (2014). Clustering of categorical data: a com-parison of a model-based and a distance-based approach.

Communicationsin Statistics - Theory and Methods 43 , 704–721.Anderson, E. (1935). The irises of the Gaspe Peninsula.

Bulletin of theAmerican Iris Society 59 , 2–5.Andrews, J. L. and P. D. McNicholas (2012). Model-based clustering,classiﬁcation, and discriminant analysis via mixtures of multivariate t-distributions.

Statistics and Computing 22 (5), 1021–1029.Andrews, J. L., J. R. Wickins, N. M. Boers, and P. D. McNicholas (2018).teigen: An R package for model-based clustering and classiﬁcation via themultivariate t distribution. Journal of Statistical Software 83 (7), 1–32.Arbelaitz, O., I. Gurrutxaga, J. Muguerza, J. M. P´erez, and I. Perona (2013).An extensive comparative study of cluster validity indices.

Pattern Recog-nition 46 (1), 243–256.Ayres-de Campos, D., J. Bernardes, A. Garrido, J. Marques-de Sa, andL. Pereira-Leite (2000). Sisporto 2.0: a program for automated analysisof cardiotocograms.

Journal of maternal-fetal medicine 9 , 311–318.Azzalini, A. and A. W. Bowman (1990). A look at some data on the oldfaithful geyser.

Applied Statistics 39 , 357–365.38agga, A. and B. Baldwin (1998). Entity-basedcross-document coreferencingusing the vector space model. In

Proceedings of the 36th Annual Meeting ofthe Association for Computational Linguistics and the 17th InternationalConference on Computational Linguistics (COLING-ACL 98) , pp. 79–85.ACL, Stroudsburg PE.Batool, F. and C. Hennig (2020). Clustering with the average silhouettewidth. arXiv:1910.11339 [stat], Computational Statistics and Data Anal-ysis (accepted for publication) .Boulesteix, A.-L. (2015). Ten simple rules for reducing overoptimistic re-porting in methodological computational research.

Plos ComputationalBiology 11 , e1004191.Boulesteix, A.-L. and M. Hatz (2017). Benchmarking for clustering methodsbased on real data: A statistical view. In

Data Science: Innovative De-velopments in Data Analysis and Clustering , pp. 73–82. Springer, Berlin.Boulesteix, A.-L., S. Lauer, and M. J. A. Eugster (2013). A plea for neutralcomparison studies in computational sciences.

PlosOne 8 , e61562.Brusco, M. J. and D. Steinley (2007). A comparison of heuristic proce-dures for minimum within-cluster sums of squares partitioning.

Psychome-trika 72 , 583–600.Buscema, M. (1998). Metanet: The theory of independent judges.

SubstanceUse & Misuse 33 , 439–461.Campbell, N. A. and R. J. Mahon (1974). A multivariate study of variationin two species of rock crab of genus leptograpsus.

Australian Journal ofZoology 22 , 417–425.Coomans, D., M. Broeckaert, M. Jonckheer, and D. L. Massart (1983).Comparison of multivariate discriminant techniques for clinical data -application to the thyroid functional state.

Methods of Information inMedicine 22 , 93–101.Coretto, P. and C. Hennig (2016). Robust improper maximum likelihood:tuning, computation, and a comparison with other methods for robustGaussian clustering.

Journal of the American Statistical Association 111 ,1648–1659.de Souto, M. C., I. G. Costa, D. S. de Araujo, T. B. Ludermir, and A. Schliep(2008). Clustering cancer gene expression data: a comparative study.

BMC Bioinformatics 9 , 497.Detrano, R., A. Janosi, W. Steinbrunn, M. Pﬁsterer, J. Schmid, S. Sandhu,K. Guppy, S. Lee, and V. Froelicher (1989). International application of39 new probability algorithm for the diagnosis of coronary artery disease.

American Journal of Cardiology 64 , 304–310.Dimitriadou, E., M. Barth, C. Windischberger, K. Hornik, and E. Moser(2004). A quantitative comparison of functional mri cluster analysis.

Ar-tiﬁcial Intelligence in Medicine 31 , 57–71.du Jardin, P. and E. S´everin (2010). Dynamic analysis of the businessfailure process: A study of bankruptcy trajectories. In

Proceedings of the6th Portuguese Finance Network Conference, Ponta Delgada, Azores, 1July 2010 .Dua, D. and C. Graﬀ (2017). UCI machine learning repository.Ester, M., H.-P. Kriegel, J. Sander, and X. Xu (1996). A density-basedalgorithm for discovering clusters in large spatial databases with noise. InE. Simoudis, J. Han, and U. M. Fayyad (Eds.),

KDD 96: Proceedings ofthe Second International Conference on Knowledge Discovery and DataMining , pp. 226–231. AAAI Press, Menlo Park CA.Everitt, B. S., S. Landau, M. Leese, and D. Stahl (2011).

Cluster Analysis,5th ed.

Wiley, New York.Flury, B. and H. Riedwyl (1988).

Multivariate Statistics: A practical ap-proach . Chapman & Hall, London.Forina, M., C. Armanino, S. Lanteri, and E. Tiscornia (1983). Classiﬁca-tion of olive oils from their fatty acid composition. In H. Martens andH. Russwurm (Eds.),

Food Research and Data Analysis , pp. 189–214. Ap-plied Science Publ., Barking.Fraley, C. and A. E. Raftery (2002). Model-based clustering, discriminantanalysis and density estimation.

Journal of the American Statistical As-sociation 97 , 611–631.Franck, P., E. Cameron, G. Good, J.-Y. Rasplus, and B. P. Oldroyd (2004).Nest architecture and genetic diﬀerentiation in a species complex of aus-tralian stingless bees.

Molecular Ecology 13 , 2317–2331.Frey, P. W. and D. J. Slate (1991). Letter recognition using holland-styleadaptive classiﬁers.

Machine Learning 6 , 161–182.Halkidi, M., M. Vazirgiannis, and C. Hennig (2015). Method-independentindices for cluster validation and estimating the number of clusters. InC. Hennig, M. Meila, F. Murtagh, and R. Rocci (Eds.),

Handbook ofCluster Analysis , pp. 595–618. CRC Press.40arrison, D. and D. L. Rubinfeld (1978). Hedonic prices and the demandfor clean air.

Journal of Environmental Economics & Management 5 ,81–102.Hartigan, J. A. and M. A. Wong (1979). Algorithm as 136: A k-meansclustering algorithm.

Applied Statistics 28 , 100–108.Hastie, T., R. Tibshirani, and J. H. Friedman (2001).

The Elements ofStatistical Learning . Springer, New York.Hausdorf, B. and C. Hennig (2010). Species Delimitation Using Dominantand Codominant Multilocus Markers.

Systematic Biology 59 (5), 491–503.Hennig, C. (2015a). Clustering strategy and method selection. In C. Hennig,M. Meila, F. Murtagh, and R. Rocci (Eds.),

Handbook of Cluster Analysis ,pp. 703–730. CRC Press.Hennig, C. (2015b). What are the true clusters?

Pattern RecognitionLetters 64 , 53–62.Hennig, C. (2018). Some thoughts on simulation studies to compare clus-tering methods.

Archives of Data Science, Series A (Online First) 5 (1),1–21.Hennig, C. (2019). Cluster validation by measurement of clustering charac-teristics relevant to the user. In C. H. Skiadas and J. R. Bozeman (Eds.),

Data Analysis and Applications 1: Clustering and Regression, Modeling -Estimating, Forecasting and Data Mining , pp. 1–24. ISTE Ltd., London.Hennig, C. (2020). fpc: Flexible Procedures for Clustering . R package version2.2-8.Hennig, C. and M. Meila (2015). Cluster analysis: An overview. In C. Hen-nig, M. Meila, F. Murtagh, and R. Rocci (Eds.),

Handbook of ClusterAnalysis , pp. 1–19. CRC Press.Horton, P. and K. Nakai (1996). A probablistic classiﬁcation system for pre-dicting the cellular localization sites of proteins. In D. J. States, P. Agar-wal, T. Gaasterland, L. Hunter, and R. F. Smith (Eds.),

Proceedings ofthe Fourth International Conference for Intelligent Systems for MolecularBiology , pp. 109–115. AAAI Press, Menlo Park CA.Hubert, L. and P. Arabie (1985). Comparing partitions.

Journal of Classi-ﬁcation 2 (2), 193–218.Hubert, L. J. and J. Schultz (1976). Quadratic assignment as a generaldata analysis strategy.

British Journal of Mathematical and StatisticalPsychology 29 , 190–241. 41ain, A. K., A. Topchy, M. H. C. Law, and J. M. Buhmann (2004). Land-scape of clustering algorithms. In

Proceedings of the 17th InternationalConference on Pattern Recognition (ICPR04) , Vol. 1 , pp. 260–263. IEEEComputer Society Washington, DC.Javed, A., B. S. Lee, and D. M. Rizzo (2020). A benchmark study on timeseries clustering.

Machine Learning with Applications 1 , 100001.Jossinet, J. (1996). Variability of impedivity in normal and pathologicalbreast tissue.

Medical and biological engineering and computing 34 , 346–350.Kahraman, H. T., S. Sagiroglu, and I. Colak (2013). Developing intuitiveknowledge classiﬁer and modeling of users’ domain dependent data in web.

Knowledge Based Systems 37 , 283–295.Karatzoglou, A., A. Smola, K. Hornik, and A. Zeileis (2004). kernlab – anS4 package for kernel methods in R.

Journal of Statistical Software 11 (9),1–20.Kaufman, L. and P. J. Rousseeuw (1990).

Finding groups in data: an in-troduction to cluster analysis , Volume 344. Wiley, New York.Kolodziejczyk, A. A., J. K. Kim, J. C. Tsang, T. Ilicic, J. Henriksson, K. N.Natarajan, A. C. Tuck, X. Gao, M. B¨uhler, P. Liu, J. C. Marioni, andS. A. Teichmann (2015). Single cell rna-sequencing of pluripotent statesunlocks modular transcriptional variation.

Cell stem cell 17 (4), 471–485.Kou, G., Y. Peng, and G. Wang (2014). Evaluation of clustering algorithmsfor ﬁnancial risk analysis using mcdm methods.

Information Sciences 275 ,1–12.Lee, S. X. and G. J. McLachlan (2013). On mixtures of skew normal andskew t-distributions.

Advances in Data Analysis and Classiﬁcation 7 ,241–266.Liu, X., W. Song, B. Y. Wong, T. Zhang, S. Yu, G. N. Lin, and X. Di(2019). A comparison framework and guideline of clustering methods formass cytometry data.

Genome Biology 20 , 297.Maechler, M., P. Rousseeuw, A. Struyf, M. Hubert, and K. Hornik (2019). cluster: Cluster Analysis Basics and Extensions . R package version 2.1.0.Maggioni, M. (2004).

Avalanche release areas and their inﬂuence on uncer-tainty in avalanche hazard mapping . Ph. D. thesis, Universit¨at Z¨urich.Maulik, U. and S. Bandyopadhyay (2002). Performance evaluation of someclustering algorithms and validity indices.

IEEE Transactions on PatternAnalysis and Machine Intelligence 24 (12), 1650–1654.42cLachlan, G. J. and D. Peel (2000).

Finite Mixture Models . Wiley, NewYork.McNeil, D. R. (1977).

Interactive Data Analysis . Wiley, New York.Meila, M. (2007). Comparing clusterings—an information based distance.

Journal of Multivariate Analysis 98 (5), 873 – 895.Meila, M. (2015). Criteria for comparing clusterings. In C. Hennig, M. Meila,F. Murtagh, and R. Rocci (Eds.),

Handbook of Cluster Analysis , pp. 619–635. CRC Press.Meila, M. and D. Heckerman (2001). An experimental comparison of model-based clustering methods.

Machine Learning 42 , 9–29.Milligan, G. W. (1980). An examination of the eﬀect of six types of errorperturbation on ﬁfteen clustering algorithms.

Psychometrika 45 , 325–342.Milligan, G. W. (1996). Clustering validation: results and implications forapplied analyses. In P. Arabie, L. J. Hubert, and G. D. Soete (Eds.),

Clustering and Classiﬁcation , pp. 341–375. World Scientiﬁc, Singapore.Ng, A. Y., M. I. Jordan, and Y. Weiss (2001). On spectral clustering: Anal-ysis and an algorithm. In T. Dietterich, S. Becker, and Z. Ghahramani(Eds.),

Advances in Neural Information Processing Systems 14 (NIPS2001) , pp. 1–8. NIPS.Pinheiro, J. C. and D. M. Bates (2000).

Mixed-Eﬀects Models in S andS-PLUS . Springer, New York.Rodriguez, M. Z., C. H. Comin, D. Casanova, O. M. Bruno, D. R. Amancio,L. Costa, and F. A. Rodrigues (2019). Clustering algorithms: A compar-ative approach.

PloS one 14 , e0210236.Rossi, P. E., G. M. Allenby, and R. McCulloch (2005).

Bayesian Statisticsand Marketing . Wiley, New York.Saracli, S., N. Dogan, and I. Dogan (2013). Comparison of hierarchicalcluster analysis methods by cophenetic correlation.

Journal of Inequalitiesand Applications (electronic publication) 203 .Scrucca, L., M. Fop, T. B. Murphy, and A. E. Raftery (2016). mclust5: clustering, classiﬁcation and density estimation using Gaussian ﬁnitemixture models.

The R Journal 8 (1), 289–317.Shannon, C. E. (1948). A mathematical theory of communication.

The BellSystem Technical Journal 27 (3), 379–423.43igillito, V. G., S. P. Wing, L. V. Hutton, and K. B. Baker (1989). Classiﬁ-cation of radar returns from the ionosphere using neural networks.

JohnsHopkins APL Technical Digest 10 , 262–266.Silva, P. F. B., A. R. S. Mar¸cal, and R. M. A. da Silva (2013). Evaluationof features for leaf discrimination. In M. Kamel and A. Campilho (Eds.),

Image Analysis and Recognition , Berlin, Heidelberg, pp. 197–204. SpringerBerlin Heidelberg.Sommerer, E.-O. and C. Weihs (2005). Introduction to the contest “socialmilieus in dortmund”. In

Classiﬁcation - the Ubiquitious Challenge , pp.667–673. Springer, Berlin.Steinley, D. and M. J. Brusco (2011). Evaluating the performance of model-based clustering: Recommendations and cautions.

Psychological Meth-ods 16 , 63–79.Street, W. N., W. H. Wolberg, and O. L. Mangasarian (1993). Nuclearfeature extraction for breast tumor diagnosis. In

IS&T/SPIE 1993 In-ternational Symposium on Electronic Imaging: Science and Technology,Volume 1905, San Jose, CA , pp. 861–870.Theus, M. and S. Urbanek (2009).

Interactive Graphics for Data Analysis .CRC/Chapman & Hall, Boca Raton FL.Van Mechelen, I., A.-L. Boulesteix, R. Dangl, N. Dean, I. Guyon, C. Hennig,F. Leisch, and D. Steinley (2018, October). Benchmarking in clusteranalysis: A white paper. arXiv:1809.10496 [stat] .Venables, W. N. and B. D. Ripley (2002).

Modern Applied Statistics with S .Springer, New York.von Luxburg, U., R. Williamson, and I. Guyon (2012). Clustering: Scienceor art?

JMLR Workshop and Conference Proceedings 27 , 65–79.Wang, K., A. Ng, and G. McLachlan. (2018).

EMMIXskew: The EM Algo-rithm and Skew Mixture Distribution . R package version 1.0.3.Weber, T. (2009). The lower/middle palaeolithic transition - is there alower/middle palaeolithic transition?

Preistoria Alpina 44 , 1–6.Yan, L., M. Yang, H. Guo, L. Yang, J. Wu, R. Li, P. Liu, Y. Lian, X. Zheng,J. Yan, J. Huang, M. Li, X. Wu, L. Wen, K. Lao, R. Li, J. Qiao, andF. Tang (2013). Single-cell rna-seq proﬁling of human preimplantationembryos and embryonic stem cells.

Nature structural & molecular biol-ogy 20 , 1131–1139. 44amora-Gutierrez, V., C. Lopez-Gonzalez, M. C. MacSwiney Gonzalez,B. Fenton, G. Jones, E. K. V. Kalko, S. J. Puechmaille, V. Stathopoulos,and K. E. Jones (2016). Acoustic identiﬁcation of mexican bats based ontaxonomic and ecological constraints on call design.