[PDF] HAWKS: Evolving Challenging Benchmark Sets for Cluster Analysis

Abstract

Comprehensive benchmarking of clustering algorithms is rendered difficult by two key factors: (i)~the elusiveness of a unique mathematical definition of this unsupervised learning approach and (ii)~dependencies between the generating models or clustering criteria adopted by some clustering algorithms and indices for internal cluster validation. Consequently, there is no consensus regarding the best practice for rigorous benchmarking, and whether this is possible at all outside the context of a given application. Here, we argue that synthetic datasets must continue to play an important role in the evaluation of clustering algorithms, but that this necessitates constructing benchmarks that appropriately cover the diverse set of properties that impact clustering algorithm performance. Through our framework, HAWKS, we demonstrate the important role evolutionary algorithms play to support flexible generation of such benchmarks, allowing simple modification and extension. We illustrate two possible uses of our framework: (i)~the evolution of benchmark data consistent with a set of hand-derived properties and (ii)~the generation of datasets that tease out performance differences between a given pair of algorithms. Our work has implications for the design of clustering benchmarks that sufficiently challenge a broad range of algorithms, and for furthering insight into the strengths and weaknesses of specific approaches.

Full PDF

IIEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 1

HAWKS: Evolving Challenging Benchmark Sets forCluster Analysis

Cameron Shand, Richard Allmendinger,

Member, IEEE,

Julia Handl, Andrew Webb, and John Keane

Abstract —Comprehensive benchmarking of clustering algo-rithms is rendered difﬁcult by two key factors: (i) the elusivenessof a unique mathematical deﬁnition of this unsupervised learningapproach and (ii) dependencies between the generating modelsor clustering criteria adopted by some clustering algorithms andindices for internal cluster validation. Consequently, there is noconsensus regarding the best practice for rigorous benchmarking,and whether this is possible at all outside the context of agiven application. Here, we argue that synthetic datasets mustcontinue to play an important role in the evaluation of clusteringalgorithms, but that this necessitates constructing benchmarksthat appropriately cover the diverse set of properties that impactclustering algorithm performance. Through our framework,HAWKS, we demonstrate the important role evolutionary algo-rithms play to support ﬂexible generation of such benchmarks,allowing simple modiﬁcation and extension. We illustrate twopossible uses of our framework: (i) the evolution of benchmarkdata consistent with a set of hand-derived properties and (ii) thegeneration of datasets that tease out performance differencesbetween a given pair of algorithms. Our work has implicationsfor the design of clustering benchmarks that sufﬁciently challengea broad range of algorithms, and for furthering insight into thestrengths and weaknesses of speciﬁc approaches.

Index Terms —Clustering, evolutionary computation, syntheticdata, benchmarking, data generator.

I. I

NTRODUCTION C LUSTER analysis is an unsupervised learning approachwith the high-level aim of identifying groups (clusters)of objects that are more similar to each other than to theobjects in other groups. It is a fundamental approach forknowledge discovery, with a broad range of applications frombioinformatics [1]–[3], to cybersecurity [4], to medicine [5].Due to the unsupervised nature of clustering, the processof evaluating the quality of a partition (i.e. a given set ofclusters) is not a straightforward task. Attempts to formallycapture the qualities intuitively associated with pronouncedcluster structure (such as compactness of individual clustersand separation between clusters), have led to the mathematicaldeﬁnition of a range of internal validation indices, which canbe both complementary and conﬂicting [6]–[8]. Arbelaitz etal. [9] studied 30 such internal indices and concluded thatthe utility of such measures varied depending on the datasets

C. Shand is with the University College London, London WC1E 6BT,U.K. (e-mail: [email protected]).R. Allmendinger, J. Handl and J. Keane are with theUniversity of Manchester, Manchester M15 6PB, U.K. (e-mail:[email protected], [email protected],[email protected]).A. Webb is with vTime, Liverpool L8 5RN, U.K. (email:[email protected]). considered, highlighting the limited scope of each individualindex.

External cluster validation indices are thought to addressthis limitation and to provide a more objective assessment ofclustering performance [1]. However, they require knowledgeof the ground truth for a given dataset, i.e. information aboutthe correct cluster membership of each data point — infor-mation that is difﬁcult to come by in realistic unsupervisedlearning applications. For this reason, synthetic benchmarkdatasets (i.e. datasets with a known generating model) playan important role in the evaluation of clustering performance.A key advantage of such data is that both the ground truthand any assumptions implicit to the generating process areaccessible. This allows for both an objective assessment ofclustering performance and informed reasoning about the keydrivers behind the observed performance.In principle, direct control over the generating model thenallows for the provision of datasets with speciﬁc and variedproperties. This facilitates the testing of performance withregards to these known characteristics; the beneﬁt is concrete:practically translatable insight regarding the strengths andweaknesses of particular algorithms [10]. However, existinggenerators for synthetic clustering benchmarks have not beendesigned with this level of ﬂexibility in mind — instead, theytypically use a set of ﬁxed (manually tuned) parameter boundswithin their generating model [11], limiting the complexity anddiversity of datasets that can be obtained and failing to fullymatch the range of challenges posed by real-world datasets.Our framework, HAWKS, is designed to address this limita-tion through the integration of an evolutionary algorithm (EA)into the generating process. EAs lend themselves as a mecha-nism to directly control and adapt key aspects of a partition’sgenerating model — here, we demonstrate that this facilitatesthe design of more powerful benchmarks, exhibiting a diverserange of properties. Furthermore, the innate modularity of EAsprovides ﬂexibility in choosing all key model components,including the representation of individual clusters and theset of objectives and constraints constituting the partition-level generating model, such as constraints on inter-clusterrelationships. This ﬂexibility is key to a broader utility of theframework, and our experiments illustrate HAWKS’ potentialin evolving benchmark sets either to meet a predeﬁned setof criteria, or to directly maximize performance differencesbetween pairs of algorithms.In summary, the main contributions of this paper are asfollows:1) We propose an evolutionary framework for the genera-tion of clustering benchmarks, HAWKS, that allows for a r X i v : . [ c s . N E ] F e b EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 2 ﬂexible parameterization of its generating model.2) We describe a set of measurable properties (problemfeatures) that quantify the difﬁculty of a cluster structurefrom a range of perspectives. This set of problemfeatures is utilized to deﬁne an instance space [12],enabling visual examination of the correlations betweenthese properties and algorithm performance. For this pur-pose, we represent each benchmark set by its associatedproblem features, embed this representation into twodimensions, and use colour coding to highlight the topperforming algorithm for each dataset.3) We provide an indicative example of the use of HAWKSto generate datasets across the instance space by varyinga subset of parameters. In this ﬁrst optimization mode,individual problem features can be deployed as objec-tives and/or constraints. By varying the relative impor-tance of each feature, a diverse collection of benchmarkscan be obtained.4) We present a second optimization mode for HAWKS,which aims to generate benchmarks that elicit perfor-mance differences between pairs of clustering algo-rithms. Analysis of the problem features and clusterstructures associated with the resulting datasets allowsfor the identiﬁcation of the relative strengths and weak-nesses of each algorithm.The remainder of this paper proceeds as follows. Section IIfurther motivates the need for synthetic cluster generators,and positions our work relative to existing generators andthe literature. Section III describes HAWKS, discussing theimportance of each component in the framework as a whole,and the indicative design choices we have made, to illustrate itsuse. Section IV describes our experimental setup, including theset of problem features we deploy to measure the complexityof a dataset, and details of existing benchmark sets we compareagainst. The experimental results are presented in Section V,demonstrating HAWKS’ ability to evolve diverse benchmarkdata, and instances tailored to challenge speciﬁc algorithms.Finally, we conclude and discuss future research directions inSection VI. II. B

ACKGROUND

This section reviews the relevant background to our work.We start by positioning our work relative to similar problemsin the literature, most prominently the relevance of bench-marks, and related work on the algorithm selection problem.We then review the issues of cluster analysis, and discuss theimplications for the development and use of synthetic bench-marks. Finally, we provide an overview of existing benchmarkgenerators for clustering, and analyze their strengths andlimitations.

A. The general role of benchmarks

Empirical comparison between techniques is a cornerstoneof the scientiﬁc method. At a community-level, methods devel-oped by independent researchers need to be compared in orderto gain insights into the applicability, generalizability, andefﬁcacy of their developments. As direct comparison is only possible on the same problem instances, subsequent researchis highly likely to adopt instances used by other researchers.The importance of reproducibility further re-enforces the needfor a common, accessible set of data that can be utilizedto facilitate comparisons across independent studies. Thisfeedback loop results in “standard” benchmarks becomingvirtually required to include in experimentation [10]. Thisrequirement is supported explicitly through the creation ofbenchmark suites — a collection of problems collated and/orcreated for the purpose of widespread comparison [13], [14].The issue with this feedback loop is that the communityas a whole risks tuning both hyperparameters and algorithmicdevelopment to these speciﬁc problems [10], [15]. If thesepopular problems represent a broad-spectrum of real-worldchallenges, then this is not a negative; analyzing whetherthese problems adequately cover the space of encounterableproblems, however, is difﬁcult if even possible to do in itsentirety [13].To combat this challenge, Hooker [10] argues for “con-trolled experimentation” i.e. comparing algorithmic perfor-mance speciﬁcally on a problem characteristic that the re-search in question is addressing, compared to the “competitivetesting” that is encouraged when the same subset of datasetsare re-used time and again. This argument is consistent withthe implications of the No-Free-Lunch (NFL) theorem forlearning, which supports the intuition that no single algorithmis expected to be superior across all problems [16], and thatbinary statements about algorithm superiority (i.e. “algorithm A is better than algorithm B ”) are only possible within partic-ular problem classes. Controlled experimentation is inherentlysimpler with synthetic benchmarks, as we have control overthe generating mechanism and thus (to varying extents) theproperties of the instances. The use of real-world instancesfor this purpose is possible only if there is an appropriatemeasure of the problem characteristic and a controlled way tovary it. B. The algorithm selection problem

In the above, we have introduced the notion of problemcharacteristics, and the need for these to be varied in bench-mark data. This is to ensure appropriate coverage across thespace of possible instances, and the ability to appropriatelydifferentiate between the challenges these instances may posefor different algorithms.Rice [17] formalized these characteristics as “problemfeatures” in the context of the algorithm selection problem(ASP), where the goal is to predict which algorithm froma portfolio is best-suited to a given problem instance basedon its problem features. This is premised on the existence ofan identiﬁable relationship between the problem features andproblem difﬁculty for a given algorithm, typically requiringthese features to be speciﬁc to the problem class. A series ofpapers by Smith-Miles further extended this framework [12],[18], [19], using an instance space to visualize the interactionbetween problem features and algorithmic performance [12]and applying this approach to combinatorial optimization [20],[21] and supervised learning [22].

EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 3

The problem of selecting an appropriate algorithm for agiven task is closely related to the “algorithm conﬁgurationproblem” (i.e. hyperparameter optimization), which is impor-tant in both the metaheuristic and machine learning commu-nities [23]–[25]. In this context, the identiﬁcation of problemfeatures has featured prominently in the form of exploratorylandscape analysis, where quantiﬁcation of different aspects ofthe ﬁtness landscape informs not only algorithm conﬁguration,but a more general understanding of the suitability of algo-rithm components to different problem characteristics [14],[26]–[28].

C. Relevance to unsupervised learning

Unsupervised learning aims to identify the natural structureof a dataset in the absence of information about the groundtruth. This pattern recognition task is subjective, even forhumans [29], [30], explaining the existence of a diverse rangeof clustering algorithms that make a broad variety of mathe-matical assumptions about desirable cluster properties [6], [8].The dilemma stemming from the NFL theorem is thusexacerbated in cluster analysis, with differing clustering algo-rithms useful for different clustering problems [6], [31], dueto unavoidable differences in the underlying formulation [32].As the inductive bias of each clustering algorithm differs,this fundamentally governs its capabilities: as an example, thepopular K-Means assumes hyper-spherical, compact clusters,and this strictly limits its ability to deal with data that violatesthis assumption.In consequence, having a representative and diverse collec-tion of benchmark datasets is particularly crucial in clusteranalysis, in order to fairly test each algorithm’s strengths andweaknesses, and to avoid any biases towards those meth-ods whose inductive bias may be consistent or correlatedwith the assumptions of a limited benchmark set. However,compared to other optimization problems, evaluation of thisis complicated by the multifarious nature of cluster qualitymeasures itself [6], [8]. Without explicit effort to ﬁrst identifyand quantify aspects of problem structure or difﬁculty thatmatter in cluster analysis, it is challenging to assess andensure appropriate diversity of problem instances and, thus,a comprehensive benchmark suite. Any efforts to improveexisting benchmarks must therefore involve such identiﬁcationand quantiﬁcation as an integral step.Previous works on directly tackling the ASP for clustering(in the context of “meta-learning”) have used generic statisticalproperties of the data (such as the mean of the momentsacross the data). An alternative is the use of measures directlycharacterizing a given group structure [33], [34]. The latterapproach is likely to be more powerful in identifying drivers ofalgorithm performance, but its limitation lies in the difﬁcultyof transferring insights on feature-performance correlations todata with an unknown group structure.

D. Existing clustering benchmarks and generators

In the following, we consider existing clustering bench-marks and the extent to which these meet the criteria ofrepresentativeness and diversity, as discussed above. The datasets from the UCI Machine Learning Reposi-tory [35] comprise a diverse mix of real-world datasets andare one of the most common benchmarks for the evaluationof machine learning algorithms. Recent analysis demonstrates,however, that even with real-world data from a range of appli-cation areas, diversity in complexity is not guaranteed. Maciaet al. [13] analyzed the datasets of the UCI Machine LearningRepository, ﬁnding a surprising similarity in complexity acrossthem. They discovered a lack of diversity in both the statis-tical properties of the datasets and the relative classiﬁcationperformance of different algorithms and parameter settings.At the other end of the spectrum, two-dimensional toydatasets are a common approach to the evaluation of clus-tering methods [36], [37]. Typically, these datasets havebeen handcrafted to illustrate simple capabilities or propertiesof clustering algorithms. They remain popular due to theirsimplicity and because they enable intuitive visualization ofresults, rendering them highly effective at clearly exhibiting aparticular property and consequent algorithm behaviour (seee.g. the “moons” dataset ). Despite these advantages, thescenarios that toy datasets depict are often too contrived.Although synthetic data is commonly generated ad hoc for individual work, some generators have been explicitlydesigned to have broader utility, with more complex propertiesthan toy data. The generator proposed by Qiu and Joe [38](abbreviated to QJ ) uses a geometric framework for clusterplacement. The measure of separation used (proposed in [39]),provides a measure of the spatial separation between clusters.By adjusting this minimum amount of separation allowed,the user can generate datasets that have different amountsof separation between clusters, providing some loose controlover the resulting difﬁculty of the datasets. To do this, thecovariance matrices are iteratively scaled until this minimumseparation is achieved. Although useful and geometricallyinterpretable, this provides a single perspective of clusterstructure. Their generator can, however, embed additionalcomplexity through the addition of noise [38]. Despite havingseveral useful parameters to customize the difﬁculty of thegenerated problems, the generator (available as an R package )is not easily extended to incorporate other measures of clusterstructure.Handl and Knowles [11] (abbreviated to HK ) created twogenerators (named “gaussian” and “ellipsoidal”) that havebeen used extensively for the creation of synthetic clusters.At their core, these generators aim to generate clusters thatare as compact as possible with either no (“gaussian”) orminimal (“ellipsoidal”) overlap. The “gaussian” generator usesa trial-and-error scheme where Gaussian clusters are randomlygenerated, and rejected if they overlap with existing clusters.The “ellipsoidal” generator was designed speciﬁcally for gen-erating elongated clusters in higher dimensions, and uses anEA to optimize cluster location such that overall variancein the dataset is reduced while penalizing overlap. Despitethe popularity of these generators, their design is rigid withmany hand-tuned parameters, and without consideration or Example: https://rdrr.io/cran/clusterSim/man/shapes.two.moon.html https://cran.r-project.org/web/packages/clusterGeneration/index.html EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 4 easy scope for extension to consider additional or alternativeaspects of cluster structure.In this paper, we therefore propose a more general evolu-tionary framework for generating new synthetic benchmarksfor cluster analysis. Our work tackles two crucial steps for theconstruction of diverse datasets: (i) the explicit identiﬁcationof problem characteristics that are relevant to differentiatingclustering algorithm performance and (ii) the design of anoptimizer that is sufﬁciently general to evolve benchmarksfor a range of criteria. Speciﬁcally, we experiment with twodifferent types of criteria: approximating a desired target valueof problem characteristics, and directly maximizing perfor-mance differences between algorithms. The latter approachhelps highlight the strengths and weaknesses of the contes-tant methods, providing direct insight into their individualinductive bias. To evaluate the performance of our approach,we compare datasets obtained from our framework againstmultiple generators/dataset collections in terms of both thespread of performance across clustering algorithms, and thevariance across our proposed set of problem features. Thiswork greatly extends an initial proof-of-concept outlined in[40], generalizing the framework and introducing the ability togenerate datasets directly for particular clustering algorithms.III. HAWKSWe now proceed to introduce the details of our framework,HAWKS, for the generation of diverse, complex datasets.HAWKS uses an EA to evolve a population of datasets,where the objective function and constraints work together tovary properties of the resulting datasets that affect clusteringalgorithm performance. For each component of HAWKS, wediscuss its general role in dataset generation, and motivate thespeciﬁc design choice made for the experiments presented inthis paper.

A. Representing a dataset

In HAWKS, a dataset is represented by a set of clus-ters, which are themselves deﬁned by the parameters of agiven multivariate distribution. With this level of abstraction,we avoid handling of individual data points, and providean intuitive manipulation of clusters through adjustment ofdistribution parameters. In principle, this approach allowsfor the inclusion of a variety of cluster generating models,given suitable initialization and variation operators for eachdistribution have been deﬁned (see Sections III-B and III-E).For the experiments reported in this paper, each clusteris represented by a multivariate Gaussian distribution, dueto their relative simplicity and prominence in a machinelearning context. For a dataset in D dimensions, each clusteris therefore encoded by a ( µ , Σ ) pair, which we will refer to asa single gene. Here, µ is the D -dimensional mean vector and Σ is the symmetric D × D covariance matrix. A dataset with K clusters is therefore represented as a genotype composedof K genes. In the current implementation, K is set a priori as an input parameter. B. Initializing a population of datasets

The ﬁrst step of HAWKS is to create an initial populationof datasets. The sizes of the clusters can be controlled intwo ways: (i) all clusters can be of equal size or (ii) eachcluster is of random size (with an optional minimum size toavoid clusters small enough to be considered outliers) . Inboth cases, the user predeﬁnes the total number of data points N . The cluster sizes remain ﬁxed across all individuals forthe remainder of the evolution, to avoid interference with theﬁtness (described in Section III-C) and focus of the search onthe distribution parameters only.All other aspects of the cluster-level initialization are spe-ciﬁc to the type of distribution used; here, we describe ourapproach for Gaussian clusters. The initial means are sampledfrom a D -dimensional uniform distribution, i.e. for the i thcluster µ ( i ) ∼ U (0 , β ) D where β is the upper thresh-old to control the initial sampling space for the means. InHAWKS, the covariance matrix is deﬁned for the i th clusteras Σ ( i ) = R ( i ) S ( i ) e Σ ( i ) R ( i ) | , where R ( k ) and S ( k ) are the k th rotation and scaling matrices, respectively, and e Σ ( k ) is the k th axis-aligned covariance matrix (i.e. a diagonal matrix thatconsists of only the variances). These variances are sampledsimilarly to the means, i.e. e Σ = diag( U (0 , β ) D ) where β is the upper threshold to control the initial sampling space forthe variances. The scaling matrix is set to the identity matrixat this stage. The covariances are then obtained via rotationusing a random rotation matrix ( R ), which is drawn from theHaar distribution [41] to generate a valid covariance matrix.This method permits generation of clusters with a variety ofshapes and orientations, thus ensuring the initial populationhas a diverse set of individuals.This approach, while more complex than constructing acovariance matrix with random values, allows both moreintuitive parameterization and the ability to separately modifyindividual components of the covariance matrix, which welater exploit for mutation (Section III-E). C. Computing the ﬁtness of a dataset

The quantiﬁcation of ﬁtness plays a key role in focusing thesearch on those datasets that are deemed to present interestingclustering benchmarks. It is thus one of the most vital designdecisions in a given generator, but is complicated by ourlimited understanding of the desirable properties of clusteringbenchmarks, as discussed in Section II-C. Therefore, ourframework is designed to allow for the interchangeable useof different ﬁtness functions, providing scope for a variety ofchoices to be trialled.In this paper, we introduce and experiment with two dif-ferent modes of optimization which support the generationof datasets for distinct goals. The ﬁrst, named

Index mode,optimizes the datasets towards a user-deﬁned target value for agiven cluster validity index. This provides control of the broaddifﬁculty of the datasets, where the nature of the difﬁcultyis governed by the properties of the index used. The second, Owing to its non-triviality, the method to generate random cluster sizeswith a minimum size that overall sum to N is described in the supplementarymaterial (Section ?? ). EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 5 named

Versus mode, directly optimizes datasets such that theymaximize the performance difference between two clusteringalgorithms. By generating a range of datasets that are simpleand difﬁcult for speciﬁc algorithms we can “stress-test” them,revealing their relative strengths and weaknesses. We describethese approaches in more detail in the following sections.

1) Index Mode:

Cluster validity indices capture the amountof structure of a partition, in terms of the underlying datadistribution. Given knowledge of the generating model (i.e.the true partition), they can thus act as a proxy for the easeof recognizing this structure in a given dataset.As previously discussed, there are many different clustervalidity indices, each with a slightly different perspective ofhow cluster structure should be quantiﬁed. Previous work,including [9], could not conclude superiority of any singlevalidity index. Thus, to obtain a comprehensive understandingof the difﬁculty of a given dataset, many cluster validityindices could potentially be used in combination [42]. In thecontext of a ﬁtness function, this could be done in the form ofan aggregation of indices or through formulation as a many-objective problem. Here, we opt for a middle ground, limitingourselves to the deﬁnition of ﬁtness through a single clustervalidity index, but allowing for the incorporation of additionalconsiderations (which could include validity indices) throughthe use of constraints (discussed in Section III-D).For the experiments reported in this paper, we use thesilhouette width [43] as a representative example of a validityindex. It is an established, widely-used method and has twoparticular characteristics that makes it a promising choice foruse here: (i) it is a very rich measure that provides informationat multiple level of resolutions (the individual data point, thecluster, and the partition-level); and (ii) as it is bounded inthe range [ − , , it can be easily compared across individualsand even runs (when a similar dimensionality is used), and itis readily interpretable to the user.The silhouette width is a combination internal validityindex, as it measures a ratio of intra-cluster compactness andinter-cluster separation [1]. For a single data point, x i , it isdeﬁned as: s ( x i ) = b ( x i ) − a ( x i )max { a ( x i ) , b ( x i ) } . (1)Here, a ( x i ) represents the cluster compactness (with respectto x i ) and is the average distance from x i to all otherdata points in its cluster. The separation between clusters isrepresented by b ( x i ) ; for data point x i this is deﬁned as theminimum of the average distances to all data points in everyother cluster. The silhouette width is calculated for all N datapoints in dataset X , and an average is taken to obtain theoverall silhouette width: s all = 1 N N X i =1 s ( x i ) . (2)A value of 1 represents very compact and well-separatedclusters, whereas a negative silhouette width value indicatesthat points in different clusters are not well-separated (andthat their cluster membership should be changed).Independent of the validity index chosen, direct maximiza-tion or minimization of such an index would always lead to the evolution of datasets that are trivially separable or fullyunstructured, respectively. HAWKS therefore requires input inthe form of a desired target value, allowing direct modulationof the desired level of structure.In other words, and using the example of the silhouettewidth, a target value (denoted s t ) is speciﬁed by the user anddatasets are then optimized to meet this target value. This isachieved by minimizing the absolute difference between s t and s all , deﬁned as: min f ( µ (1) , Σ (1) , . . . , µ ( K ) , Σ ( K ) ) ≡ min | s t − s all | . (3)

2) Versus Mode:

The

Index mode provides us with theability to generate benchmarks that meet speciﬁc thresholdsof user-deﬁned validity criteria. However, shared assumptionsbetween validity indices and clustering algorithms themselvescan make it difﬁcult to generate datasets with properties thatspeciﬁcally challenge a given algorithm. This is particularlythe case when working with clustering techniques whoseinductive biases are poorly understood, e.g. self-organizingapproaches or those that utilize deep learning [44].To perform “controlled experimentation”, a more direct linkbetween dataset generation and algorithm performance maybe needed [10]. Our

Versus mode tries to address this bydirectly optimizing the performance difference between twoalgorithms, thereby exploiting the strengths of one algorithm,the weaknesses of the other, or a combination of the two.We re-formulate the ﬁtness function such that we maximizethe difference between the scores of the ‘winning’ algorithm( A w ) and the ‘losing’ algorithm ( A l ), using a scoring function φ : max f ( µ (1) , Σ (1) , . . . , µ ( K ) , Σ ( K ) ) ≡ max( φ ( A w ) − φ ( A l )) . (4)While any scoring function can be used, we have accessto the generating model of the clusters and therefore theground truth. The quality of each partition can therefore beobjectively assessed using an external validity index such asthe Adjusted Rand Index (ARI) [45]. The ARI measures theco-occurrences of cluster assignments between two partitions,which in our case corresponds to the output of a givenclustering algorithm, and the ground truth. The upper bound of1 represents identical assignment, whereas 0 indicates randomassignment. As the ARI considers only the assignment ofpoints, it can be used with any clustering algorithm and hasno preference of structure. D. Augmenting cluster properties using constraints

As previously discussed, it is difﬁcult to deﬁne a singleﬁtness measure that represents all properties that may bedesirable in a benchmark dataset. Therefore, our frameworkuses constraints to allow for the integration of additionalconsiderations, with two main aims: (i) to avoid the generationof trivial datasets that are e.g. simple for any algorithm, or toonoisy to be clustered, and (ii) to introduce additional propertiesthat balance limitations of individual ﬁtness functions orfurther enhance diversity in the datasets obtained. To controloptimization of the ﬁtness and constraints, we use stochastic

EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 6 ranking [46] to balance the satisfaction of all constraints (thisis discussed further in Section III-F).In the following, we discuss two constraints introducedto meet these aims by directly accounting for local overlapbetween clusters and for cluster shape (as measured by clusterelongation — for a different generating model, other choicesmay be plausible). These choices of constraints have to beseen in the context of the ﬁtness functions adopted for ourexperiments: when using the silhouette width in

Index mode,the ﬁtness provides a powerful overall perspective of separa-tion between clusters. However, the reliance on averaging canlose ﬁne-grained information about local variability, e.g. lackof inter-cluster separation for small clusters. Furthermore, thesilhouette width does not directly capture information aboutcluster shape, while differences in cluster shapes are knownto drive some performance differences between algorithms.Similarly, for the

Versus mode, a performance differencebetween the algorithms is sought with no explicit concernfor the underlying cluster properties. While this exploratoryapproach is desired, there are situations where this can lead tocluster structures that are not meaningful (e.g. clusters that arecompletely overlapping) but one algorithm is still perceived tobe favoured due to arbitrary artefacts.

1) Overlap:

Real-world data typically does not containcleanly separated clusters, either due to noise or reﬂectiveof the underlying relationships between variables. Clusteringalgorithms differ signiﬁcantly in their ability to handle somedegree of cluster overlap, and control over this aspect ofour benchmark is therefore important. At the extreme ends,a benchmark set with very well-separated clusters may betoo simple to detect any performance differences betweenalgorithms; equally, a dataset where clusters fully overlap willbe of no use.The deﬁnition of overlap is not purely objective, and canbe implemented in different ways. Fr¨anti and Sieranoja [36]deﬁned a data point as overlapping if it is closer to a centroidof a different cluster than to the centroid of its own cluster.This deﬁnition, however, does not extend to highly eccentricclusters. We use a deﬁnition similar to [11], where a data pointis considered as overlapping if its nearest neighbour belongs toa different cluster. In addition to avoiding the compactness biasintroduced by using centroids, this deﬁnition has the speciﬁcbeneﬁt of countering an inherent limitation of the silhouettewidth, where a high average silhouette width value can bedriven by having large clusters that are very well-separatedand fails to sufﬁciently reﬂect the presence of very small buthighly overlapping clusters.We calculate the overlap as the percentage of data pointswhose nearest neighbour is in a different cluster: overlap = 1 − N X x ∈ X C k ( n x ) (5)where C k is the cluster that data point x belongs to, n x is thenearest neighbour of x , and ( · ) is the indicator function thatis 1 if n x ∈ C k and 0 if n x / ∈ C k .

2) Elongation:

The elongation (or, more speciﬁcally, theeccentricity) of clusters can pose problems for compactness-based algorithms such as K-Means, and thus presents an µ (1) Σ (1) µ (2) Σ (2) µ (3) Σ (3) µ (4) Σ (4) µ (5) Σ (5) µ (1) Σ (1) µ (2) Σ (2) µ (3) Σ (3) µ (4) Σ (4) µ (5) Σ (5) µ (1) Σ (1) µ (2) Σ (2) µ (3) Σ (3) µ (4) Σ (4) µ (5) Σ (5) µ (1) Σ (1) µ (2) Σ (2) µ (3) Σ (3) µ (4) Σ (4) µ (5) Σ (5) Parent 1Parent 2Offspring 1Offspring 2 (a) Crossover(b) MutationFig. 1. Illustrations of the HAWKS genetic operators. Uniform crossoverbetween two individuals is shown in (a), where the means and covariancesare swapped independently. In (b), a single cluster is mutated where both thelocation (mean) and shape (covariance) have been randomly perturbed (theoriginal cluster is shown again faded to illustrate the relative differences). additional aspect that we are keen to control. Although ourinitialization parameters can adjust initial cluster eccentricity,by using an explicit constraint we can encourage (or penalize)this further during the evolution.Our deﬁnition of this constraint is speciﬁc to the gener-ating model used, and different deﬁnitions are possible. Aspreviously mentioned, each full covariance matrix ( Σ ) inour cluster representation is separated into the axis-alignedvariances ( e Σ ) and a rotation matrix. As the variances on thediagonal of e Σ are the eigenvalues of the full covariance matrix,i.e. e Σ = diag( λ , . . . , λ D ) , the ratio of the maximum andminimum of these eigenvalues gives us a measure of thecluster eccentricity. As even a single eccentric cluster canpose challenges for compactness-based algorithms, we takethe minimum of these ratios across all K clusters, i.e.: λ ratio = max ∀ k ∈{ ,...,K } | λ max ( Σ ( k ) ) || λ min ( Σ ( k ) ) | , (6)where λ max ( Σ ( k ) ) and λ min ( Σ ( k ) ) are the maximum andminimum eigenvalues of Σ ( k ) , respectively. E. Perturbing a dataset

As is typical in an EA, the variation operators in HAWKSprovide a further opportunity to integrate domain knowledgeand focus the search. Speciﬁcally, a close alignment betweenthe generating model used (i.e. the representation of an in-dividual cluster and partition) and the operators is crucial tooptimization performance and the types of datasets that can beobtained. In the following, we describe the variation operatorsdesigned for the multivariate Gaussian distributions utilized asthe generating model in our experiments.

1) Crossover:

For recombination between datasets,HAWKS uses a high-level uniform crossover scheme wherethe two components deﬁning each cluster distribution (i.e. µ and Σ ) can be swapped separately between individuals. This EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 7 allows for both the location (mean) and shape (covariance) ofclusters to be swapped between individuals independently, asillustrated in Fig. 1a.

2) Mutation:

Meaningful mutations of a cluster requirecareful design of a geometrically relevant operator that isrelevant for the distribution being used. In our Gaussian case,we use a separate operator for the mean and covariance terms.A key issue identiﬁed in [40] was the increasing amount ofdrift of the mean operator in higher dimensions. The originaloperator shifts the mean to a random nearby point; at higherdimensionality there are an increasing number of directionsthat point away from other clusters, and a random walk isthereby likely to increase the silhouette width at most steps. Inthe supplementary material (Section ?? ), we test multiple newmutation operators (drawing inspiration from operators usedin particle swarm optimization and differential evolution) toaddress this issue by directly considering the location of otherclusters.The resulting operator selected and used throughout thispaper utilizes concepts from particle swarm optimization [47],where a combination of other centroids and a global represen-tative is used to embed direction either towards or away fromexisting clusters, directly affecting the ﬁtness. The new mean, µ ( i ) , is obtained as follows: µ ( i ) = ( µ ( i ) + [ w ( µ ( n ) − µ ( i ) ) + w ( ¯ µ − µ ( i ) )] if p ≤ . µ ( i ) − [ w ( µ ( n ) − µ ( i ) ) + w ( ¯ µ − µ ( i ) )] if p > . , (7) where µ ( i ) is the current cluster mean, µ ( n ) is the mean ofanother randomly selected cluster, ¯ µ is the global mean acrossall data points, w and w are random weighting coefﬁcients inthe range [0 , , and p is a random coin-ﬂip to decide whetherto move away from or towards this weighted combination ofan existing cluster mean and the global mean.The covariance governs the shape of the cluster, affectingboth the overlap between clusters and the capabilities ofcompactness-based algorithms (such as K-Means). To mutatethe covariance, we rotate the cluster and scale its eigenvalues,effectively changing the shape of the cluster. Similar to theinitialization, the rotation matrix is drawn from the Haardistribution [41], though here it is raised to a fractionalpower to avoid complete reorientation of the cluster. Thescaling matrix, S , is drawn from a Dirichlet distribution inorder to ensure that the resulting determinant is unchangedi.e. det( e Σ ) = det( S · e Σ ) . Using an analogy, this has thesame effect as rotating a balloon and applying pressure tothe principal semi-axes, thereby changing the shape whilemaintaining the volume. The combined effect of both mutationoperators is illustrated in Fig. 1b. F. Selecting a dataset

As previously discussed, we use stochastic ranking [46]to balance the satisfaction of the objective and constraint(s),which may be complementary or impossible to both fullysatisfy. By adjusting the probability of comparing two in-dividuals on their ﬁtness (denoted P f ), we can effectively This is described in further detail in Section ?? . weigh the satisfaction of the objective or the constraints,providing us with another way of controlling the properties ofthe resulting datasets. For example, if using only the overlapconstraint in Index mode, setting P f to a higher value will addselection pressure to datasets with a silhouette width closerto s t , potentially at the cost of a higher degree of overlap.Thus, unlike the traditional use of stochastic ranking thatuses a narrow range of values for P f [46], [48] to avoid toomuch weighting towards the infeasible or feasible regions, wecan use the entire range (as datasets that heavily violate theconstraints may still be useful).Similar to the original work [46], for environmental selec-tion we use stochastic ranking to select the top |P| individualsfrom the sorted pool of parents and offspring. For parentalselection, we use standard binary tournament where the rank isused to determine the winner (in lieu of using only the ﬁtness)to ensure a continued selection pressure towards individualsthat best satisfy the ﬁtness and constraints as weighted by P f .IV. E XPERIMENTAL S ETUP

In order to assess the capabilities of HAWKS, we have threeprimary aims in our experiments: (i) compare the diversityof performance of multiple, distinct clustering algorithms ondatasets from HAWKS against that seen for other populargenerators or dataset collections; (ii) compare the diversityacross a distinct set of problem features, to see if datasets fromHAWKS cover a wider space of properties than other datasets;and, (iii) gain insights into the algorithms themselves, utilizingthe

Versus mode of HAWKS to directly challenge clusteringalgorithms.The remainder of this section describes our proposed prob-lem features (Section IV-A), the generators and dataset col-lections we compare against (Section IV-B), and the relevantparameters HAWKS used in each experiment (Section IV-C).

A. Problem features

In order to measure dataset diversity (with respect to theirinherent properties), we need to deﬁne a set of problem fea-tures describing properties relevant to algorithm performance.This is central to the algorithm selection problem (ASP), asthese features are used to predict which algorithm is best fora given problem instance. These problem features are alsovital for the creation of an instance space that visualizes thedatasets, allowing identiﬁcation of areas in the space whereparticular algorithms are most suitable.In previously discussed work on the ASP for clustering, theproblem features (also referred to as “meta-features”) used aregenerally statistical or information-theoretic, and not speciﬁcto clustering [33], [34]. The problem features previously usedin [40] were also too simplistic, and fully coincided withparameters available in HAWKS, reducing the utility of theinstance space.As there are many properties that inﬂuence performancefor a given clustering algorithm, and for most algorithms thisaspect is not fully understood, a full complementary set ofproblem features is arguably impossible [6], [9]. Nonetheless,going beyond simple statistical measures of the data by

EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 8 incorporating measures speciﬁc to clustering (especially thosethat make use of the generating model) should improve thediscriminative power of the problem features. The remainderof this section describes our proposed set of problem featuresfor this task.

1) Average eccentricity:

A high amount of cluster eccen-tricity can cause problems for compactness-based algorithms.To measure this, we use a similar method to its calculation as aconstraint in HAWKS, with some key differences. For both thesynthetic and real-world datasets used in this paper, we havethe true labels. Using these, the eccentricity is calculated asbefore — calculating the ratio of the maximum to minimumeigenvalues for each cluster, except that the average (ratherthan minimum) value is used as the problem feature. Theeigenvalues are obtained via singular value decomposition, but(in contrast to the constraint) a subset of the eigenvalues isused. This reduces the sensitivity to outliers and subspaceclusters (which result in zero, or arbitrarily close to zero,eigenvalues). For further details, see Section ?? .

2) Connectivity:

The connectivity, deﬁned by Handl andKnowles [49] (which is a modiﬁed version of what wasproposed in [50]), measures the extent to which neighbouringdata points are assigned to the same cluster. We normalize thismeasure by the number of data points to get a measure thatis comparable across datasets. This provides a more nuancedpicture than the overlap constraint used by HAWKS, whichlooks at the single nearest neighbour rather than the set of L nearest neighbours, where L is a parameter of the connectivity.Here, we use L = 10 in line with previous work [49].

3) Dimensionality:

As different measures of distance orsimilarity are affected differently by high-dimensionality, thisfeature simply describes the number of dimensions in the data.This is our only feature that is ground-truth agnostic, and isshared with previous work on the ASP for clustering [33],[34].

4) Entropy of cluster sizes:

The density or relative sizes ofthe clusters can affect the performance of different algorithms.For example, the ability of K-Means to discover clustersis diminished when a small subset of clusters contain themajority of data points. We calculate the entropy of clustersizes as follows: H ( C ) = − K X k =1 | C k | N log K (cid:18) | C k | N (cid:19) , (8)where C is the set of K clusters, and | C k | is the cardinalityof cluster C k , which is normalized by the size of the dataset( N ). We use log K to compare across datasets such that, forany K , a perfectly equal distribution of cluster sizes results in H ( C ) = 1 . , and H ( C ) → when one cluster has N − K + 1 data points.

5) Number of clusters:

The number of clusters can inher-ently affect algorithms where initialization of cluster loca-tion is key to convergence, particularly if clusters are well-separated. As discussed earlier, use of the ground truth in these measures comes at thecost of limiting applicability to datasets where such knowledge is unavailable;this is a major issue for ASP applications but is not our intended focus here.

6) Silhouette width (average):

The average silhouette widthmeasures how similar the data points are within their ownclusters, and how well-separated these are to the other clusters,indicating the potential difﬁculty for clustering algorithms indiscovering these clusters.

7) Silhouette width (standard deviation):

Averaging thesilhouette width can obscure the presence of a small numberof very well-separated clusters, and ill-deﬁned overlappingclusters. Thus, the standard deviation of the silhouette width(calculated across all data points) indicates whether all pointsare similarly separated. Higher values indicate the presenceof overlapping clusters (from a different perspective thanthe connectivity measure), which clustering algorithms havedifferent capabilities in handling.

B. Other generators and datasets

To provide context for the diversity in performance andproblem features for the datasets produced by HAWKS, wecompare against multiple generators or collections of populardatasets that have been previously used to evaluate clusteringalgorithms. We brieﬂy describe these below, though moredetails about the parameters used for these datasets, and howthey compare (in terms of their size, dimensionality etc.) canbe found in the supplementary material (Table S- ?? ) and theirrespective papers.

1) HK:

This is a collection of 350 datasets generated usingthe “ellipsoidal” generator proposed in [11], and used as abenchmark in [51]. For 35 unique combinations of the numberof clusters and dimensionality, 10 datasets are generated.

2) QJ:

We use the set of 243 datasets generated usingthe parameters proposed by Qiu and Joe [38]. The authorscalculated three target separation values using their mea-sure of separation: “close structure”, “separated”, and “well-separated”. Further complexity to these datasets is added byspecifying varying proportions of noisy variables, alongsidesmall variation in the number of dimensions.

3) SIPU:

Fr¨anti and Sieranoja [36] introduced a benchmarkconsisting of multiple sets of clustering datasets, of which weuse the ‘S-sets’, ‘A-sets’, and ‘G2 sets’. These sets were origi-nally intended to stress-test compactness-based algorithms, ofwhich K-Means exhibited a wide range of performance. The‘S-sets’ are all 2D data where N = 5000 and K = 15 ,but have different degrees of overlap between the clusters(determined by the aforementioned method of the closestcentroid). The ‘A-sets’ are also 2D data with varying numbersof (equally-sized) clusters. The ‘G2 sets’ of datasets consistof two Gaussians with varying degrees of overlap, constantsize ( N = 2048 ) and varying dimensionality (chosen acrossthe range , , . . . , ). The variance of overlap and highnumber of dimensions in this benchmark presents a variety ofchallenges for clustering algorithms.

4) UCI:

The UCI Machine Learning Repository [35] is apopular source of datasets used for machine learning. As notedin [8], [36] the class labels do not necessarily translate tomeaningful cluster labels, and thus these datasets may not bethe most suitable for clustering. Nonetheless, they have seenextensive use in the clustering literature, and we include them

EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 9 for completeness. Speciﬁcally, we use the subset of 20 datasetsused by Arbelaitz et al. [9].

5) UKC:

These 8 real-world datasets, curated in [51], arethe (anonymized) locations of different crimes. Alongside the

UCI datasets, these will help provide insight into whether thereare signiﬁcant differences between the real-world and syntheticdata (in terms of their problem features or clustering algorithmperformance).

C. HAWKS conﬁgurations

In this section we highlight the key parameters for HAWKS,separating those that are common across all experiments andthose adjusted for different modes. In particular, while thecore EA parameters (population size, mutation probabilityetc.) remain the same across modes, in our experiment forthe

Index mode we vary the objective function target andconstraint parameters to generate different datasets. In contrast,the constraint parameters remain ﬁxed for the

Versus modeexperiment as the variation comes from selecting differentpairs of clustering algorithms. The full conﬁgurations for bothsets of experiments can be found on GitHub .

1) Common parameters:

In both

Index and

Versus modeexperiments, the following core EA parameters are used: G max = 100 generations, P = 10 individuals, p c = 0 . crossover probability, and p m = 1 /K mutation probabilityfor the mean and covariance mutation operators. The lowpopulation size was previously found to be sufﬁcient for bothdiversity and convergence [40].

2) Solution selection:

At the end of a run, a single datasetneeds to be selected. Here, we simply select the individualwith the highest ﬁtness. Different choices are possible, e.g.,due to our use of stochastic ranking, the individual with thehighest (sorted) rank could be selected instead (though thishad little effect on the experiments in this paper).

3) Analyzing benchmark dataset diversity (Index mode):

Totest the ability of HAWKS to produce a variety of datasets, wevary several parameters to encourage coverage of our featurespace: • A poor and a high silhouette width ( s t ∈ { . , . } ). • Two upper thresholds for the overlap constraint( overlap ≤ { , . } ) to penalize any overlap, or penalizeif more than 10% of the data points overlap. • Two lower thresholds for the eccentricity constraint( λ ratio ≥ { , } ) to allow for any amount of eccentricity,or encourage all clusters have some eccentricity. • Two levels of dimensionality ( D ∈ { , } ), datasetsize ( N ∈ { , } ), and number of clusters ( K ∈{ , } ).Of note is the use of P f = 0 . , which weights the ﬁtnessand constraints equally. By varying s t , we directly attempt togenerate datasets that either have poor cluster separation or arewell-separated. Different values of the overlap and eccentricityconstraints help further modulate the level of separation andthe minimum cluster eccentricity. HAWKS is run 7 times foreach of the 64 unique combinations of parameters listed above,resulting in 448 datasets. https://github.com/sea-shunned/hawks conﬁgs As we aim to measure diversity in clustering performance,we need an objective way to measure this. We therefore use theARI (see Section III-C2) to compare the ground truth (whichwe know for all datasets used here) with the assignments fromeach algorithm. For the clustering algorithms themselves, weselect four well-established algorithms with distinct propertiesand inductive biases, allowing us to assess the diversity ofchallenges that the datasets pose. These are: • Average-linkage — A hierarchical clustering method thatuses the average distance between all points in a clusterwhen deciding which are the closest (and thus should bemerged). • Gaussian mixture models (GMM) — Probabilistic mod-els which try to represent subpopulations (of the data)through a number of Gaussian distributions. To obtain acrisp clustering, each point is assigned to the cluster withthe highest probability. • K-Means ++ — Proposed by Arthur and Vassilvitskii [52]to improve K-Means, the initialization scheme probabilis-tically selects cluster centres such that points that arefurther away from existing cluster centres are more likelyto be selected. • Single-linkage — In contrast to average-linkage, single-linkage considers the distance between two clusters asthe shortest distance from any member of one cluster toany member of another.To assess the maximum potential of each algorithm, weprovide the true number of clusters for each dataset. Owing tothe propensity of the linkage algorithms (particularly single-linkage) to be side-tracked by singleton clusters, we also runaverage- and single-linkage with double the true number ofclusters [53].

4) Challenging speciﬁc algorithms (Versus mode):

To as-certain the capabilities of HAWKS’

Versus mode, we run eachof the four clustering algorithms against the others in a head-to-head, with either as the ‘winner’ and ‘loser’, respectively.Some pairings of algorithms (e.g. K-Means ++ vs. GMM )are likely to be more competitive, due to shared capabilitiesand inductive biases. We expect this to translate into a re-duction of the performance differences that can be observed(and thus a weaker ﬁtness gradient). While the constraintsare still important in such a scenario, we wish to avoidsituations where the optimization is being mainly driven byreducing the constraint penalty. Rather than simply removingthe constraints, we therefore increase P f to emphasize theproduction of a difference in algorithmic performance. Thisprovides a useful lever for weighting the maximization of theperformance difference against producing datasets that do notsacriﬁce cluster structure (e.g. through a high overlap). Assuch, we use P f = 0 . in all head-to-head runs. Furtheradjustment of this parameter is encouraged for complex algo-rithm pairings.Finally, all experiments presented for the Versus mode arerun using K = 5 clusters and N = 2000 data points.The dimensionality is consistently set to D = 2 , ensuringa straightforward visualization of the datasets and enabling us Referring to Eq. 4, we use the format A w vs. A l . EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 10 to observe the properties of these datasets without concernsabout information loss through projection.V. E

XPERIMENTAL R ESULTS

In this section, we explore HAWKS’ ability to producedatasets that exhibit a broad range of properties and posechallenges to different clustering algorithms. Separate resultsfor the

Index and

Versus modes highlight the general ﬂexibilityof our framework.

A. Analyzing benchmark dataset diversity (Index mode)

This section presents the instance space constructed fromour set of 7 problem features applied to the 1,176 datasets from6 popular collections/generators and our generator HAWKS.We assess diversity by considering (i) the variation observedacross problem features and (ii) the variation observed in theperformance of 4 clustering algorithms.First, we use the instance space to understand how thedatasets are spread across the space, and if there are dis-tinct patterns either in the sources of the datasets or in theperformance of the clustering algorithms. The two principalcomponents of the instance space shown in Fig. 2 accountfor . of the variance, and so there is some informationloss in the projection. Nevertheless, it is evident that thecomponents capture a sufﬁcient proportion of the variance tohighlight some key differences between datasets.The instance space in Fig. 2a uses colour and marker codingto ﬂag up which collection each dataset comes from. Visual-izations of how each problem feature varies across the spacecan be found in Fig. S- ?? . The distinct spread of HAWKSdatasets across the central part of the space is encouragingand highlights an appropriate level of diversity in terms ofthe problem features. Furthermore, the interpretation of theprincipal components (in terms of the underlying problem fea-tures), provides clear guidance on additional experiments thatcould be conducted to expand coverage in various directions.In contrast, the QJ datasets expand across a narrow band,indicating a lack of variance across the problem features. The SIPU datasets show a strong banding of instances, indicatingthat there is little variance among the datasets from each con-ﬁguration (the separated datasets at the bottom of the space arethe ‘G2 sets’, which have a much higher dimensionality thanother datasets). The

UCI datasets are spread across the upper-half of the space, though this is unfortunately due to a lack ofstructure (which we later explore when looking at clusteringalgorithm performance), as their higher connectivity indicatesthat the labels do not line up with a spatial perspective ofclustering (shown in Fig. S- ?? ). The UKC datasets do not seemto represent anything extraordinary with regards to the problemfeatures we use here, leading to the conclusion that either thesynthetic datasets used here are not too dissimilar to real-worlddata or our set of problem features does not capture someaspect of complexity that they uniquely exhibit. Notably, the HK datasets are distinct from every other collection, indicatingthat they have unique characteristics. As shown in Fig. S- ?? ,the main difference is the average eccentricity of the clusters,which is higher than in any other collection. As these datasets PC1 P C SourceHAWKSHKQJSIPUUCIUKC (a) Dataset source

PC1 P C AlgorithmTiedAverage-LinkageAverage-Linkage(2K)GMMK-Means++Single-LinkageSingle-Linkage(2K) (b) Algorithm with highest ARIFig. 2. Two instance spaces, with the source of the dataset (a)or clustering algorithm with the highest ARI (b) highlighted. The 7problem features are projected down to 2D using PCA. The con-tribution of each problem feature to the two principal componentsis (cid:20) − .

029 0 .

208 0 .

519 0 .

308 0 . − .

256 0 . . − . − . − . − . − .

555 0 . (cid:21) for the connectivity, dimensionality, average eccentricity, entropy, number ofclusters, silhouette width (average), and silhouette width (standard deviation)respectively. originate from the “ellipsoidal” generator, which was designedexclusively to create eccentric clusters in higher dimensions,this is unsurprising. Of interest, however, is whether this leadsto a distinct difference in the relative performance of clusteringalgorithms on that benchmark suite.To consider this aspect, Fig. 2b shows the instance spaceagain, but with the clustering algorithm achieving the highestARI for a given dataset highlighted. The ‘tied’ category is usedwhen at least two algorithms were able to achieve the sameARI, which in every case was 1 and thus the dataset wastrivial to cluster. This visualization permits easy identiﬁcationof potential footprints, though the full information is tabulatedin Table I, for completeness. It is evident that there are someareas of the space that correlate with a high performance of aparticular algorithm, and this happens for the HK benchmarksuite in particular, where the highly ellipsoidal clusters con-sistently favour GMM or average-link with the higher settingof K .In the more central part of the instance space, however,such footprints are unclear, indicating a complex performancelandscape. This could be a consequence of the projection EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 11

TABLE IN

UMBER OF TIMES ( AS A PERCENTAGE OF THE TOTAL NUMBER OFDATASETS FOR EACH SOURCE ) EACH ALGORITHM ACHIEVED THEHIGHEST

ARI

FOR A GIVEN DATASET . T

HE BEST PERFORMINGALGORITHM ON EACH DATA SOURCE IS HIGHLIGHTED IN BOLD .Average- Average- K- Single- Single-Source Linkage Linkage GMM Means Linkage Linkage Tied( K ) ++ ( K )HAWKS 0.261 0.056 0.239 0.056 0.016 0.011 HK [11] – 0.426 QJ [38] 0.008 0.021 SIPU [36] – – 0.106 0.202 – –

UCI [35] 0.100 0.150

UKC [51] – –

HAWKS HK QJ SIPU UCI UKC0.00.20.40.60.81.0 A R I Average-LinkageAverage-Linkage(2K)GMMK-Means++Single-LinkageSingle-Linkage(2K)

Fig. 3. Clustering performance for each algorithm for each set of datasets. (which may lose too much information to fully distinguishthe datasets), or may point to the role of other features thathave not yet been captured here, but distinctly impact onperformance. Notably, variation in the identity of the bestperforming algorithm is the most pronounced for the HAWKSbenchmark (see Table I), which is the only collection ofdatasets where every algorithm was best for a particulardataset. As this was achieved through just a few parametersettings, this is encouraging for the potential of HAWKS togenerate diverse datasets.To further examine the performance variation of the cluster-ing algorithms across these datasets, Fig. 3 shows boxplots forthe aggregated performance of each algorithm for each groupof datasets. Here, larger boxplots indicate a greater variety ofperformance for that clustering algorithm, which is preferred.For HAWKS, the boxplots indicate that its datasets elicit abroad range of performance across all algorithms, though thehigh median ARIs for all algorithms indicate that in general thedatasets were not that difﬁcult. As we discourage overlap andhalf of the datasets were optimized to have a high silhouettewidth ( s t = 0 . ), this is somewhat expected and could beaddressed by revised parameter choices.The near-perfect average performance for all algorithms butsingle-linkage on the SIPU datasets indicate their relative sim-plicity. Similarly, the poor average performance across the

UCI datasets is consistent with the low connectivity and silhouettewidth values previously observed in the instance space, andpoints to weakly deﬁned structure of the ground truth. The HK datasets show a reasonable diversity of performance, andthe signiﬁcantly lower mean ARI indicates that these are much (a) HAWKS (b) HK Fig. 4. Critical difference (CD) diagrams, showing the mean rank (in termsof ARI) for the HAWKS and HK datasets, for each algorithm. Algorithmsconnected by solid lines are not signiﬁcantly different according to a two-tailed Nemenyi test. CD diagrams for the other dataset sources can be foundin Fig. S- ?? . harder datasets. This may be in part due to the much highereccentricity of these clusters, but also potentially due to agreater variance in the silhouette width (Fig. S- ?? ), indicatingthat many data points on the edges of clusters are closer to thepoints in other clusters than to points in their own. The highperformance of all algorithms on the UKC datasets indicatesthat the clusters in this real-world dataset are generally well-deﬁned. This is consistent with the instance space whichprovided no evidence of unusual complexity.To include considerations of signiﬁcance into our analysis,we follow the approach outlined by Demˇsar [54] to com-pare multiple methods across multiple datasets. In brief, aFriedman test is used to rank each competing algorithm foreach dataset, where the null hypothesis is that all algorithmshave equal ranks. As rejection indicates at least one algorithmis signiﬁcantly different, the (two-tailed) Nemenyi test [55]is used as a post-hoc test to ascertain which algorithm isdifferent by calculating the critical difference (CD), which isthe minimum that two average ranks must differ by to besigniﬁcantly ( p < . ) different. We illustrate the resultsusing CD diagrams, which show the average rank of eachalgorithm across all datasets, with solid lines connectingalgorithms whose difference in rank is less than the CD. Well-ordered rankings of algorithms indicate a lack of variance inperformance (as an algorithm is consistently bad or good),whereas if all algorithms had an average rank of 3.5 (and thusclustered in the middle of the CD diagram), this would showthat the datasets have an equal spread of difﬁculty for theseclustering algorithms.We show the CD diagrams for the HAWKS and HK gen-erators in Fig. 4 (as these two generators showed the greatestdiversity in the boxplots) . As evident from the CD diagram,the ranks of the algorithms are more similar for HAWKS,whereas there is a clearer superiority for a subset of algorithmswith the HK datasets. The best-performing algorithm forHAWKS was average-linkage, which highlights the variety of The remaining CD diagrams can be found in Section ?? of the supple-mentary material. EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 12 cluster structures that can be generated (as the representationuses purely Gaussian clusters, one might have expected thatGMM would on average perform the best). The eccentricity ofthe generated clusters is reﬂected in the higher average rank ofGMM over K-Means ++ . Furthermore, the poor performanceof single-linkage (using both K and K ) indicates that thereis (on average) insufﬁcient cluster separation to avoid the‘chaining’ effect [53], where there is at least one pair of pointsin different clusters that are closer than a pair of points withina cluster, thereby forming a ‘bridge’ between clusters.As shown in Table I, all of the generators struggle toproduce datasets that single-linkage is uniquely better at,suggesting a possible lower utility of this algorithm, in general.To further investigate this, and to create a fully comprehensivebenchmark, it would be of interest to more directly generatedatasets that favour single-linkage. In contrast to existinggenerators, the Versus mode in HAWKS facilitates this directgeneration, and the results from this mode are discussed in thenext section.

B. Generating datasets that challenge speciﬁc algorithms(Versus mode)

In this section, we present the results of running HAWKS in

Versus mode, such that we evolve datasets to directly maximizethe performance difference between pairs of algorithms. Asseen in Section V-A, most current generators have somebias towards a particular algorithm, and no generator (exceptHAWKS, though not consistently so) was able to producedatasets that were uniquely suited to single-linkage.First, we need to look at the broad capability of the

Versus mode in producing a performance differential in the varioushead-to-heads. In Fig. 5a, we can see a grid of plots showingthe performance (ARI) of every algorithm against every other.Each line (in the off-diagonal plots) represents the best datasetfrom a single run, showing the ARI for ‘winning’ (left) and‘losing’ (right) algorithms. Here, the angle of the lines indicatethe magnitude of the performance difference, and the spreadshows the consistency of HAWKS across runs. The plots onthe diagonal aggregate the performance for that particularalgorithm, e.g. the bottom-right plot indicates that HAWKSwas able to produce datasets that single-linkage performedboth very well and very poorly on, dependent on whether itwas designated as the winner or loser of the head-to-head.There are some clear differences between certain pairingsof algorithms. As hinted in Section V-A, HAWKS is consis-tently able to generate maximal (i.e. φ ( A w ) − φ ( A l ) ≈ )performance difference when single-linkage is A l , but theperformance difference appears minimal for single-linkagevs. average-linkage, indicating that any weaknesses speciﬁcto average-link are not weaknesses that single-link can ex-ploit. The average ARI for average-linkage, GMM and K-Means ++ when set as the ‘loser’ ( A l ) indicates a difﬁculty inconsistently generating datasets that these algorithms performvery poorly on. This is not unexpected, given the generatingmodel used. As long as clusters are not fully overlapping(which we discourage with our overlap constraint) we expectthese algorithms to identify some elements of the clusters,making an ARI close to 0 unlikely. To evaluate the inﬂuence of initialization on these results, inFig. 5b the stochastic algorithms (GMM and K-Means ++ ) arere-run with a different initialization. While K-Means ++ waslargely unchanged, there was an average ARI increase (as A l ) of 0.13 for GMM, highlighting that in some cases thelower performance was due purely to a poor initialization.The average ARI did, however, slightly decrease (as A w ),highlighting the potential disadvantage of these stochasticalgorithms over the linkage-based algorithms as they convergeto local minima. This suggests another potential use case forHAWKS. When given an inﬁnite budget of initializations andpicking the best, we expect these algorithms to perform better.The robustness of the initialization method can be assessed,however, by investigating how often the algorithm is still ableto achieve good performance with a limited budget.In order to investigate the ability of the Versus mode togrant algorithmic insight, we need to inspect the structuresHAWKS discovered for different algorithm combinations. Forthis, it is important to identify where HAWKS struggles togenerate datasets that favour one algorithm over another. Wecan then try to establish whether this is due to the superiorityof one algorithm over another, or the inability of HAWKSto generate structures with properties that would differentiatethem. For brevity, the following sections discuss some interest-ing examples for a subset of the scenarios (shown in Fig. 6).Further examples of the algorithm pairings can be found inthe supplementary material (Section ?? ).

1) GMM vs. single-linkage:

Owing to the Gaussian rep-resentation that HAWKS uses, and the known issues ofsingle-linkage, this scenario is expected to provide a largeperformance difference. Fig. 6a shows that HAWKS achievesthis performance difference by exploiting the aforementioned‘chaining’ effect of single-linkage [53]. Discovering this re-quires iterative movement of the clusters in order for the datapoints to be close enough to induce this effect, supporting theutility of our mutation operators’ nuanced ability to adjust thelocation of clusters.

2) Single-linkage vs. GMM:

The reverse scenario should bemuch harder for HAWKS, as GMM is largely insensitive toeccentricity and naturally ﬁts our cluster representation. Asexempliﬁed in Fig. 6b (and observed in the other datasetsproduced), HAWKS tends to place large clusters far away fromseveral smaller compact clusters in order to increase the chanceof a poor initialization from GMM, exploiting the stochasticityof this method.

3) GMM vs. K-Means ++ : In Fig. 6c we can see that akey exploit, as discovered by HAWKS, is the inability of K-Means ++ to handle eccentric clusters. The example highlightssigniﬁcant differences between the use of mixture models andan algorithm relying on assignment to the closest centroid. Aclear performance differential is found on this dataset, despitebasic similarities in the inductive biases of the two algorithms. Each ﬁtness evaluation uses a different initialization for GMM and K-Means ++ , otherwise HAWKS tacitly exploits this knowledge by movingthe clusters into the static initial centroid locations. This will maximizeperformance difference, but represents a technological artefact rather than ageneric property. EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 13 A v e r a g e - L i n k a g e Average-Linkage GMM KMeans++ Single-Linkage0.000.250.500.751.00 G MM K M e a n s ++ Winner Loser0.000.250.500.751.00 S i n g l e - L i n k a g e Winner Loser Winner Loser Winner Loser W i nn e r Loser (a) A v e r a g e - L i n k a g e Average-Linkage GMM KMeans++ Single-Linkage0.000.250.500.751.00 G MM K M e a n s ++ Winner Loser0.000.250.500.751.00 S i n g l e - L i n k a g e Winner Loser Winner Loser Winner Loser W i nn e r Loser (b)Fig. 5. Performance (ARI) for each algorithm as the winner (row) and loser (column). Each line represents the best individual from a single run, connectingthe ARI obtained for A w and A l . On the diagonal, the average (and standard deviation) performance is aggregated for that individual as the winner andloser, indicating the overall capability of HAWKS to produce datasets that are simple and difﬁcult for that algorithm. The stochastic algorithms (GMM andK-Means ++ ) are re-run with a different initialization (as indicated by the solid lines, with the original results as dashed lines) in (b) to further measurerobustness.

4) K-Means ++ vs. GMM: As both algorithms are well-suited for compact clusters, eccentricity is not a charac-teristic that can be used for differentiating performance inthis scenario. Here, HAWKS exploits GMM’s previously-mentioned weakness (of sub-dividing a single large cluster)during its initialization stage. K-Means ++ is less sensitive tothis problem due to its improved initialization routine. Thisshows the utility of HAWKS in identifying relative strengthsand weaknesses of speciﬁc algorithms, which could aid inalgorithmic development (e.g. when empirically comparinginitialization schemes).The datasets in the Versus mode are optimized towards aperformance differential between the algorithms, rather thantowards a cluster structure speciﬁed by the cluster validityindex (as done in the

Index mode). It is therefore of interestto investigate how these datasets compare in terms of problemfeature diversity. For this, we add the 30 datasets from eachof the 12 head-to-heads to the instance space created inSection V-A (using the same principal components).In Fig. 7, we have highlighted the datasets from the

In-dex and

Versus modes (with the other datasets in grey forreference). Clearly, we are able to generate datasets that arenotably different in terms of their problem features (andthus properties). In particular, there are many datasets in theregion where previously only HK produced datasets, furtherhighlighting the ﬂexibility of our generating mechanism incovering additional regions, as objective function, clusteringalgorithms, and other parameters are varied. VI. C ONCLUSION

Clustering is a vital tool for pattern discovery, but it is oftenunclear which clustering algorithm is the most appropriate fora given dataset. An optimal choice requires an accurate un-derstanding of the data properties as well as the strengths andweaknesses of candidate algorithms. Both types of informationare difﬁcult to come by in typical real-world settings.Synthetic benchmark datasets play an important role inimproving our understanding of the former, i.e. to examinethe speciﬁc strengths and capabilities of a given clusteringmethod. Their speciﬁc advantage is the availability of a knowngenerating model, which allows researchers to relate aspectsof the true cluster structure to algorithm performance. Unfor-tunately, available synthetic benchmarks for cluster analysiscover a limited variety of structural aspects, and there areno existing generators that have been designed with a widerﬂexibility in mind.Our framework HAWKS employs the power of an EA tobetter meet the challenges highlighted above, and to generatemore diverse collections of benchmarks, in particular. Whencompared to existing clustering benchmarks, HAWKS is foundto generate datasets exhibiting more feature diversity andeliciting more variation in algorithm performance. Future workcould improve the ability of the instance space to distinguishalgorithmic footprints, by further enriching the set of problemfeatures and improving the projection methodology.Finally, HAWKS can be modiﬁed to directly generatedatasets that are either simple or difﬁcult for a given algorithm,facilitating a deeper understanding of existing algorithms and

EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 14

Truth

GMMARI: 1.000

Single-LinkageARI: 0.003 (a) GMM vs. single-linkage

Truth

Single-LinkageARI: 0.929

GMMARI: 0.630 (b) Single-linkage vs. GMM

Truth

GMMARI: 0.962

KMeans++ARI: 0.316 (c) GMM vs. K-Means ++ Truth

KMeans++ARI: 0.996

GMMARI: 0.593 (d) K-Means ++ vs. GMMFig. 6. Examples of datasets for the listed head-to-heads. Each ﬁgure showsthe ground truth (left column), and the cluster assignment for each algorithmwith the associated ARI (middle and right columns). PC1 P C SourceOtherIndexVersus

Fig. 7. Instance space created in Section V-A with the datasets producedby the two modes of HAWKS highlighted (the ‘Other’ points are the otherdatasets). potentially informing algorithm development. This provides anew avenue for “controlled experimentation” with clusteringalgorithms, which does not rely on the use of over-simpliﬁedtoy datasets. Investigations into the use of this

Versus mode inconjunction with more complex clustering algorithms, couldfurther test HAWKS’ capabilities and may provide novelinsights into the inductive biases and performance of thesealgorithms. R

EFERENCES[1] J. Handl, J. D. Knowles, and D. B. Kell, “Computational clustervalidation in post-genomic data analysis,”

Bioinform. , vol. 21, no. 15,pp. 3201–3212, 2005.[2] U. Maulik, S. Bandyopadhyay, and A. Mukhopadhyay,

MultiobjectiveGenetic Algorithms for Clustering - Applications in Data Mining andBioinformatics . Springer, 2011.[3] R. Xu and D. C. Wunsch, “Clustering algorithms in biomedical research:A review,”

IEEE Reviews in Biomedical Engineering , vol. 3, pp. 120–154, 2010.[4] M. Ahmed, A. N. Mahmood], and J. Hu, “A survey of network anomalydetection techniques,”

Journal of Network and Computer Applications ,vol. 60, pp. 19–31, 2016.[5] Z. Ma, J. M. R. Tavares, R. N. Jorge, and T. Mascarenhas, “A reviewof algorithms for medical image segmentation and their applicationsto the female pelvic cavity,”

Computer Methods in Biomechanics andBiomedical Engineering , vol. 13, no. 2, pp. 235–246, 2010.[6] C. Hennig, “What are the true clusters?”

Pattern Recognit. Lett. , vol. 64,pp. 53–62, 2015.[7] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,”

ACM Comput. Surv. , vol. 31, no. 3, pp. 264–323, 1999.[8] U. von Luxburg, R. C. Williamson, and I. Guyon, “Clustering: Science orart?” in

Unsupervised and Transfer Learning – Workshop held at ICML2011 , ser. JMLR Proceedings, vol. 27. JMLR.org, 2012, pp. 65–80.[Online]. Available: http://proceedings.mlr.press/v27/luxburg12a.html[9] O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J. M. P´erez, and I. Perona,“An extensive comparative study of cluster validity indices,”

PatternRecognit. , vol. 46, no. 1, pp. 243–256, 2013.[10] J. N. Hooker, “Testing heuristics: We have it all wrong,”

Journal ofHeuristics , vol. 1, no. 1, pp. 33–42, 1995.[11] J. Handl and J. D. Knowles, “Improvements to the scalability ofmultiobjective clustering,” in

Proceedings of the IEEE Congress onEvolutionary Computation, CEC 2005, 2-4 September 2005, Edinburgh,UK . IEEE, 2005, pp. 2372–2379.[12] K. Smith-Miles and T. T. Tan, “Measuring algorithm footprints ininstance space,” in

Proceedings of the IEEE Congress on EvolutionaryComputation, CEC 2012 . IEEE, 2012, pp. 1–8.[13] N. Maci`a and E. Bernad´o-Mansilla, “Towards uci+: a mindful repositorydesign,”

Information Sciences , vol. 261, pp. 237–262, 2014.[14] O. Mersmann, M. Preuss, and H. Trautmann, “Benchmarking evolu-tionary algorithms: Towards exploratory landscape analysis,” in

Inter-national Conference on Parallel Problem Solving from Nature 2010 .Springer, 2010, pp. 73–82.[15] J. Schmidhuber, “Deep learning in neural networks: An overview,”

Neural Networks , vol. 61, pp. 85–117, 2015.[16] D. H. Wolpert, “The lack of A priori distinctions between learningalgorithms,”

Neural Computation , vol. 8, no. 7, pp. 1341–1390, 1996.[17] J. R. Rice, “The algorithm selection problem,” 1976, vol. 15, pp. 65–118.[18] K. Smith-Miles, “Cross-disciplinary perspectives on meta-learning foralgorithm selection,”

ACM Comput. Surv. , vol. 41, no. 1, pp. 6:1–6:25,2008.[19] K. Smith-Miles, D. Baatar, B. Wreford, and R. Lewis, “Towardsobjective measures of algorithm performance across instance space,”

Computers & Operations Research , vol. 45, pp. 12–24, 2014.[20] K. Smith-Miles and L. Lopes, “Measuring instance difﬁculty for com-binatorial optimization problems,”

Computers & Operations Research ,vol. 39, no. 5, pp. 875–889, 2012.[21] K. Smith-Miles and S. Bowly, “Generating new test instances byevolving in instance space,”

Computers & Operations Research , vol. 63,pp. 102–113, 2015.[22] M. A. Mu˜noz, L. Villanova, D. Baatar, and K. Smith-Miles, “Instancespaces for machine learning classiﬁcation,”

Machine Learning , vol. 107,no. 1, pp. 109–147, 2018.

EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 15 [23] M. Birattari,

Tuning Metaheuristics - A Machine Learning Perspective ,ser. Studies in Computational Intelligence. Springer, 2009, vol. 197.[24] F. Hutter, L. Kotthoff, and J. Vanschoren, Eds.,

Automated MachineLearning - Methods, Systems, Challenges , ser. The Springer Series onChallenges in Machine Learning. Springer, 2019.[25] M. L´opez-Ib´a˜nez, J. Dubois-Lacoste, L. P. C´aceres, M. Birattari, andT. St¨utzle, “The irace package: Iterated racing for automatic algorithmconﬁguration,”

Operations Research Perspectives , vol. 3, pp. 43–58,2016.[26] P. Kerschke and M. Preuss, “Exploratory landscape analysis: advancedtutorial at GECCO 2017,” in

Genetic and Evolutionary ComputationConference, 2017, Companion Material Proceedings , P. A. N. Bosman,Ed. ACM, 2017, pp. 762–781.[27] L. Kotthoff, P. Kerschke, H. H. Hoos, and H. Trautmann, “Improvingthe state of the art in inexact TSP solving using per-instance algorithmselection,” in

Learning and Intelligent Optimization - 9th InternationalConference, LION 2015. Revised Selected Papers , ser. Lecture Notes inComputer Science, vol. 8994. Springer, 2015, pp. 202–217.[28] O. Mersmann, B. Bischl, H. Trautmann, M. Preuss, C. Weihs, andG. Rudolph, “Exploratory landscape analysis,” in

Proceedings of theGenetic and Evolutionary Computation Conference, GECCO 2011 ,N. Krasnogor and P. L. Lanzi, Eds. ACM, 2011, pp. 829–836.[29] G. C. Bowker and S. L. Star,

Sorting Things Out: Classiﬁcation and itsConsequences . MIT Press, 2000.[30] B. S. Everitt, S. Landau, M. Leese, and D. Stahl,

Cluster Analysis ,5th ed. Wiley Publishing, 2011.[31] S. Ben-David, “Clustering — what both theoreticians and practitionersare doing wrong,” in

Proceedings of the Thirty-Second AAAI Conferenceon Artiﬁcial Intelligence 2018

Advancesin neural information processing systems , 2003, pp. 463–470.[33] D. G. Ferrari and L. N. de Castro, “Clustering algorithm selection bymeta-learning systems: A new distance-based problem characterizationand ranking combination methods,”

Inf. Sci. , vol. 301, pp. 181–194,2015.[34] R. G. F. Soares, T. B. Ludermir, and F. de A. T. de Carvalho, “Ananalysis of meta-learning techniques for ranking clustering algorithmsapplied to artiﬁcial data,” in

Artiﬁcial Neural Networks - ICANN 2009 ,ser. Lecture Notes in Computer Science, vol. 5768. Springer, 2009, pp.131–140.[35] D. Dheeru and E. Karra Taniskidou, “UCI machine learning repository,”2017. [Online]. Available: http://archive.ics.uci.edu/ml[36] P. Fr¨anti and S. Sieranoja, “K-means properties on six clusteringbenchmark datasets,”

Appl. Intell. , vol. 48, no. 12, pp. 4743–4759, 2018.[37] J. Handl and J. Knowles, “Feature subset selection in unsupervisedlearning via multiobjective optimization,”

International Journal of Com-putational Intelligence Research , vol. 2, no. 3, pp. 217–238, 2006.[38] W. Qiu and H. Joe, “Generation of random clusters with speciﬁed degreeof separation,”

J. Classiﬁcation , vol. 23, no. 2, pp. 315–334, 2006.[39] ——, “Separation index and partial membership for clustering,”

Com-putational statistics & data analysis , vol. 50, no. 3, pp. 585–603, 2006.[40] C. Shand, R. Allmendinger, J. Handl, A. M. Webb, and J. Keane,“Evolving controllably difﬁcult datasets for clustering,” in

Proceedingsof the Genetic and Evolutionary Computation Conference, GECCO2019 . ACM, 2019, pp. 463–471.[41] G. W. Stewart, “The efﬁcient generation of random orthogonalmatrices with an application to condition estimators,”

SIAM Journalon Numerical Analysis arXivpreprint arXiv:2002.01822 , 2020.[43] P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation andvalidation of cluster analysis,”

Journal of computational and appliedmathematics , vol. 20, pp. 53–65, 1987.[44] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding forclustering analysis,” in

Proceedings of the 33rd International Conferenceon International Conference on Machine Learning - Volume 48 , ser.ICML’16. JMLR.org, 2016, pp. 478—-487.[45] L. Hubert and P. Arabie, “Comparing partitions,”

J. Classiﬁcation , vol. 2,no. 1, pp. 193–218, 1985.[46] T. P. Runarsson and X. Yao, “Stochastic ranking for constrained evolu-tionary optimization,”

IEEE Trans. on Evolutionary Computation , vol. 4,no. 3, pp. 284–294, 9 2000. [47] J. Kennedy and R. Eberhart, “Particle swarm optimization,” in

Proceed-ings of International Conference on Neural Networks (ICNN’95) . IEEE,1995, pp. 1942–1948.[48] B. Li, K. Tang, J. Li, and X. Yao, “Stochastic ranking algorithm formany-objective optimization based on multiple indicators,”

IEEE Trans.Evolutionary Computation , vol. 20, no. 6, pp. 924–938, 2016.[49] J. Handl and J. D. Knowles, “An Evolutionary Approach to Multiob-jective Clustering,”

IEEE Trans. on Evolutionary Computation , vol. 11,no. 1, pp. 56–76, 2007.[50] C. H. Q. Ding and X. He, “K-nearest-neighbor consistency in dataclustering: incorporating local information into global optimization,” in

Proceedings of the 2004 ACM Symposium on Applied Computing (SAC) ,H. Haddad, A. Omicini, R. L. Wainwright, and L. M. Liebrock, Eds.ACM, 2004, pp. 584–589.[51] M. Garza-Fabre, J. Handl, and J. D. Knowles, “An Improved and MoreScalable Evolutionary Approach to Multiobjective Clustering,”

IEEETrans. on Evolutionary Computation , vol. 22, no. 4, pp. 515–535, 2017.[52] D. Arthur and S. Vassilvitskii, “k-means++: The advantages ofcareful seeding,” in

Proceedings of the Eighteenth Annual ACM-SIAMSymposium on Discrete Algorithms, SODA 2007 . SIAM, 2007,pp. 1027–1035. [Online]. Available: http://dl.acm.org/citation.cfm?id=1283383.1283494[53] L. Hubert, “Approximate evaluation techniques for the single-link andcomplete-link hierarchical clustering procedures,”

Journal of the Amer-ican Statistical Association , vol. 69, no. 347, pp. 698–704, 1974.[54] J. Demˇsar, “Statistical comparisons of classiﬁers over multiple datasets,”

Journal of Machine Learning Research , vol. 7, no. Jan, pp. 1–30,2006. [Online]. Available: http://jmlr.org/papers/v7/demsar06a.html[55] P. Nemenyi, “Distribution-free multiple comparisons,” Ph.D. disserta-tion, Princeton University, 1963.

EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 1

HAWKS: Evolving Challenging Benchmark Sets forCluster Analysis– SUPPLEMENTARY MATERIAL –

Cameron Shand, Richard Allmendinger, Julia Handl, Andrew Webb, and John Keane

S-I. HAWKS D

ETAILS

A. Generating random cluster sizes to a pre-deﬁned limit

To generate randomly sized clusters that sum to a givenvalue with a minimum size, the following approach is used: w = [ w , . . . , w K ] where K is the number of clusters, w i ∼ Dir ( α ) , and α =(1 , , . . . , . For a given cluster i , the size is: | C i | = C min + w i ( N − ( K × C min )) where C min is the minimum size a cluster can be.This ensures a uniform distribution of cluster sizes after scaling such that they sum to N , which is not guaranteedwhen simply sampling random numbers and scaling these to N (as this no longer guarantees a uniform distribution). B. Scaling the covariance matrix

We provide further details about the covariance mutationoperator described in Section ?? . To mutate the covariancematrix in HAWKS, two separate matrices are generated for theperturbation. The rotation matrix randomly rotates the cluster,whereas the scaling matrix modiﬁes the magnitude of theprincipal semi-axes, modifying the area of the space that iscovered by this distribution.To avoid issues of converging towards either spherical orhighly eccentric clusters upon multiple applications of themutation operator, we ensure that the determinant of thecovariance matrix remains the same after applying the scalingmatrix. For this, let x i be the elements of a vector sampledfrom a Dirichlet distribution. Then D X i x i = 1 ⇒ D X i (cid:18) x i − D (cid:19) = 0 ⇒ exp D X i (cid:18) x i − D (cid:19)! = 1 ⇒ D Y i exp (cid:18) x i − D (cid:19) = 1 . The determinant of a diagonal matrix is the product of thevalues on the diagonal, thus the scaling matrix ( exp( x i − D ) )has determinant 1. C. Average eccentricity problem feature

In order to reduce the sensitivity to outliers and subspaceclusters when calculating the average eccentricity, we use asubset of the eigenvalues (obtained via singular value de-composition) that account for 95% of the total sum of theeigenvalues. This is functionally identical to using the principalcomponents (obtained via principal component analysis) thataccount for 95% of the variance.S-II. M

UTATING CLUSTERS IN HIGHER DIMENSIONS

Shand et al. [1] noted that the original mutation operatorfor the cluster means ( µ ) is decreasingly useful as the dimen-sionality increases due to stochastic movement of the mean,as there are an increasing number of directions to move awayfrom other clusters. This results in a bias towards increasingthe silhouette width as the clusters drift apart. In this section,we propose and compare several different operators that try toexplicitly incorporate directionality into the random movementto avoid this bias.Incorporating explicit directionality into the operator candirectly guide whether the movement of clusters is either awayfrom or towards existing clusters, increasing and decreasingthe silhouette width respectively. For this, the operator wouldneed to incorporate the position of at least one other clusterin the random perturbation.The previous [1] mutation operator was deﬁned as: µ ( i ) ∼ N ( µ ( i ) , s ) (S-1)where µ ( i ) is the new mean, µ ( i ) is the current mean, and s is the width (variance) of the multivariate normal distribution.In other words, the new mean is sampled from a normaldistribution around the current mean. Next, we deﬁne eachof the proposed operators. A. Mutation Operator Descriptions1) “Rails” operator:

As a simple baseline to examine theutility of including directionality, this operator randomly se-lects a cluster, and the current cluster moves either away fromor towards that cluster with a random weighting ( ≤ w ≤ ).To select a random cluster that is different from the currentone, let n be a random integer from the set { , . . . , K } \{ i } . The mean of the current cluster, µ ( i ) , can be mutated asfollows: µ ( i ) = ( µ ( i ) + [ w ( µ ( n ) − µ ( i ) )] if p ≤ . µ ( i ) − [ w ( µ ( n ) − µ ( i ) )] if p > . . (S-2) a r X i v : . [ c s . N E ] F e b EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 2 where µ ( n ) is the mean of the random cluster n , w ∼ U (0 , is a random weight, µ ( n ) − µ ( i ) is the difference betweenthe cluster means, and p ∼ U (0 , is a random uniformprobability to determine whether we mutate towards or awayfrom the other cluster.

2) PSO-inspired mutation with random directionality:

For amutation that covers more of the space, rather than along thevectors between centroids, we take inspiration from anotherEA paradigm: particle swarm optimization (PSO). In theoriginal PSO mutation operator, each particle is updated byincorporating their current position, the best position thatparticle has ever had, and the best position ever found byany particle. These latter two best positions are weighted byrandom coefﬁcients in order to create a random movementof the particles. For further details, see [2], [3]. We use thisas inspiration by viewing the clusters as particles, and try tomutate the location of the cluster based on the position ofanother cluster and some global representative. As our ﬁtnessis derived from a combination of all particles, the originalnotions of personal and global best are not applicable here,but the concept of independently weighting both a single pointand an aggregated one is.By updating the mean using a randomly weighted combi-nation of an existing cluster and the global mean ( ¯ µ ) of allclusters, we can create a random movement of the cluster thatstill takes into account the position of the existing clusters. Byincorporating the global mean, we can avoid generating well-separated groups of clusters that can deceive the silhouettewidth. To ensure that the location of the global mean iscalculated in a way meaningful to the ﬁtness, we calculatethe mean across all data points (not across cluster means), asit is the former that is used in the calculation of the silhouettewidth (and thus ﬁtness). As a result, relative cluster sizes areincorporated into the mutation. The mean of the current cluster, µ ( i ) , is thus mutated using the follow operator (referred to asPSO-random): µ ( i ) = ( µ ( i ) + [ w ( µ ( n ) − µ ( i ) ) + w ( ¯ µ − µ ( i ) )] if p ≤ . µ ( i ) − [ w ( µ ( n ) − µ ( i ) ) + w ( ¯ µ − µ ( i ) )] if p > . . (S-3)

3) PSO-inspired mutation with informed directionality:

Byusing a coin-ﬂip to determine whether the cluster should movetowards or away from existing clusters, there arises obviousscenarios where the existing clusters may be very far apartand a closer structure is desired. An unfavourable coin-ﬂipmoves this cluster further away, wasting a perturbation andthus evaluation. By using the sign of the difference betweenthe current silhouette width of the individual and the target( s all − s t ), we can move the cluster centre in the directionthat will have the best chance of improving the ﬁtness.The mean of the current cluster, µ ( i ) , can be mutated asfollows: µ ( i ) = ( µ ( i ) + [ w ( µ ( n ) − µ ( i ) ) + w ( ¯ µ − µ ( i ) )] if s all > s t µ ( i ) − [ w ( µ ( n ) − µ ( i ) ) + w ( ¯ µ − µ ( i ) )] if s all ≤ s t . (S-4) The addition of this information will improve convergencein situations where s all − s t is large, as the ﬁtness will beimproved more quickly, but this adds in a bias towards the ﬁtness (as opposed to the constraints), which is typicallycontrolled through the stochastic ranking ( P f ). This operatoris referred to as PSO-informed.

4) DE-inspired mutation:

Taking inspiration from yet an-other EA paradigm, differential evolution (DE), we view theexisting clusters as individual vectors which can help in thecreation of a “donor vector” (the new mean) from a “targetvector” (the current mean). The classical DE mutation operatorcombines three existing individuals in order to create a newindividual [4]–[6]. We adapt this to generate a new clustermean from existing ones as follows: µ ( i ) = µ ( i ) + F ( µ ( r ) − µ ( r ) ) (S-5)where i , r , and r are distinct indices of a cluster, and F isa constant factor in the range [0 , . Unlike with the previousoperators, the randomness occurs only in terms of which othermeans are selected, such that the movement vector for thecurrent mean is a ﬁxed multiple ( F ) of the vector betweenthe randomly selected means. As a result, the number ofpossible locations that a cluster can mutate to is ﬁnite andrelated to the number of clusters. This lack of ﬂexibility mayeither slow convergence when larger jumps are needed, ornuance when small movements to the cluster are needed inthe ﬁnal generations (which is especially useful for adjustingthe overlap between clusters). B. Experimental Setup

To investigate the convergence properties of these operators,and whether they show a bias towards separating or bringingtogether clusters, we set up the following experiments. Weset a low (0.2) and high (0.9) s t to optimize towards poorly-and well-separated clusters respectively, as well as initializingthe clusters either together or far apart. These four scenariostest the general ability of the operators to move clustercentroids, and are tested using D ∈ { , } dimensions tocheck robustness at a higher dimensionality. For each scenario,HAWKS is run for 30 times to assess the robustness overdifferent initializations, and for 100 generations to assess thestability of the operators. For the DE-inspired operator, weuse F = 1 to strike a balance between convergence speed andpotential oscillation when clusters are close together. C. Results

Fig. S-1 shows convergence plots for the four differentscenarios in D , showing the average silhouette width (andnot the ﬁtness, to make it clear when s all is above or belowthe target) across the generations. The target, s t , is shown bythe dashed horizontal line.In Fig. S-1a, all mutation operators are quickly able to movethe clusters apart to increase the silhouette width, but we see adifference in the stability after this as some operators (predom-inantly DE and PSO-random) further increase the silhouettewidth above s t , highlighting a mechanistic bias. Fig. S-1bhas the same low s t , but the initial clusters are further apart(requiring the optimization to bring clusters together). The DE-inspired operator is unable to decrease the silhouette width,and the optimization is driven by minimizing the overlap EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 3

Generation A v e r a g e s il h o u e tt e w i d t h (a) Clusters initialized overlapping, s t = 0 . Generation A v e r a g e s il h o u e tt e w i d t h OperatorOriginalRailsPSO-RandomPSO-InformedDE (b) Clusters initialized apart, s t = 0 . Generation A v e r a g e s il h o u e tt e w i d t h (c) Clusters initialized overlapping, s t = 0 . Generation A v e r a g e s il h o u e tt e w i d t h OperatorOriginalRailsPSO-RandomPSO-InformedDE (d) Clusters initialized apart, s t = 0 . Fig. S-1. Convergence plot showing the average (across all 30 runs) silhouette width across the generations when generating 2-dimensional datasets for ourfour scenarios. The dashed line shows the target silhouette width ( s t ). constraint (Fig. S-2b). While the PSO-random operator doesnot converge, the quick decrease in the silhouette width forthe PSO-informed operator shows that this is not a mechanisticissue, but one of directionality. Mutually exclusive satisfactionof the ﬁtness ( s t ) and the overlap constraint prevent anyoperator from fully reaching s t = 0 . . The slow silhouettewidth decrease with the original operator highlights the issuewith a static step size.In Fig. S-1c–d we can see that there is no signiﬁcant differ-ence beyond the speed of convergence between the operatorswhen optimizing for s t = 0 . , independent of the initializationused. No such difﬁculties with any of the operators can beseen for either initialization in Fig. S-1c–d, where the targetsilhouette width is high. None of the operators drift onceconverged, as this target also satisﬁes the overlap constraint.In Fig. S-2a, we can see the minimization of the overlap constraint, which is naturally greater for the operators thatincreased the silhouette width. This indicates a differencein the relative ease of satisfying s t and minimizing overlapbetween the operators, particularly with the lack of pressure(afforded by P f = 0 . ) towards either one. With such a low s t , there is a strong inverse correlation between the overlapand ﬁtness. The PSO-informed operator has an explicit driveto avoid a drift of the silhouette width away from the target, but the vast difference in drift for the DE-inspired operatorillustrates a clear behavioural difference. In Fig. S-2b, whenthe clusters are initialized apart (and thus the task is to bringthem closer together), we can see that the DE-inspired doesconsistently minimize the overlap (as it erroneously increasesthe silhouette width as the clusters drift apart). The otheroperators also decrease the overlap , apart from PSO-informedwhere the embedded preference towards the ﬁtness disturbsthe minimization of the overlap .This preference is further illustrated when we look atFig. S-3, where the silhouette width and overlap of the bestindividuals from each run are plotted when the target silhouettewidth is low ( s t = 0 . ). The PSO-informed operator createsfar less diversity in terms of the overlap , thus generating lessdiverse datasets. To further investigate these operators, we nowlook at the same scenarios in D .Fig. S-4 shows convergence plots for the four differentscenarios in D , which was previously where issues withthe original operator were ﬁrst identiﬁed. In Fig. S-4a, whenthe scenario is to keep clusters close together, the drift ofthe DE-inspired operator is more pronounced relative to theother operators, which remain mostly stable. The bias of theDE-inspired operator towards moving clusters further apartis further shown in Fig. S-4b, where the original operator EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 4

Generation O v e r l a p (a) Clusters initialized overlapping, s t = 0 . Generation O v e r l a p OperatorOriginalRailsPSO-RandomPSO-InformedDE (b) Clusters initialized apart, s t = 0 . Fig. S-2. Convergence plot showing the average (across all 30 runs) of the overlap constraint across the generations when generating 2-dimensional datasets.

Silhouette O v e r l a p OperatorPSO-RandomPSO-InformedInitializationOverlappingApart

Fig. S-3. Scatter plot of the silhouette width and overlap constraint for eachof the best individuals from every run for the low silhouette width target. is also incapable of bringing clusters closer together. As theDE-inspired operator uses the direction and magnitude of thevector between two random clusters, it does not necessarilymove the cluster being mutated in the direction of other clus-ters, an issue which is more prevalent in higher dimensions.As expected, the PSO-informed operator converges the fastest,whereas the “rails” and PSO-random operators increasinglyslow to decrease the silhouette width (largely due to trying tominimize the overlap).The ability of our original operator to move clusters awayfrom each other in higher dimensions is highlighted in Fig. S-4c, which is clearly contrasted with the proposed operatorsthat are all able to converge signiﬁcantly faster. Once again,utilizing the individual’s current silhouette width allows thePSO-informed operator to consistently move the clusters apart,converging rapidly. The speed of convergence for the DE-inspired operator is likely due to its step size being basedon the distance between two random clusters, thus enabling itto rapidly move clusters apart. In Fig. S-4d, when the clustersare initialized further apart, similar behaviour to the low s t isseen. The initial silhouette width is very close to the target, yetour original and the DE-inspired operators are still unable tosigniﬁcantly improve the ﬁtness as they begin above the target.It is likely that in this scenario, the step size is too large for the DE-operator to be useful, and too small and undirected forour original operator.As such, in our experiments we select to use the PSO-random operator, as it provides a more nuanced mechanism formoving cluster means, but avoids introducing additional biaswhich would otherwise be controlled by stochastic ranking.For scenarios where the ﬁtness is prioritized, PSO-informedmay be the more useful operator, but for general use the PSO-random operator is preferred.S-III. A NALYZING BENCHMARK DATASET DIVERSITY (I NDEX MODE )In Section ?? , we compare several collections of datasetsto a set of datasets produced by HAWKS. Here, we providesome further information on the different properties of thesedatasets, and further supporting results. A. Dataset collection details

Table S-I shows some basic parameters about the differentcollections of datasets. Where possible, values are condensedinto a range (i.e. “6–11”), but are otherwise listed. Owing tothe different nature of these datasets (some real-world, somefrom a generator), some values are speciﬁc to a single dataset,and others represent multiple instantiations from a generator.

B. Problem feature values across the instance space

Fig. ?? showed the instance space created for these datasetsusing our set of problem features. Fig. S-5 shows this instancespace, but with the best ARI achieved by anything algorithmand the values for each of the problem features highlighted.As we can see, for each problem feature there is a cleargradient across the space, highlighting the utility of using PCAto construct the instance space in terms of understanding howthe problem features vary across the space. Interestingly, thereis also a gradient for the best ARI achieved, for which lowervalues are generally associated with either a lower silhouettewidth, higher standard deviation of the silhouette width, andhigher average cluster eccentricity. Overall, this highlights thatusing problem features speciﬁc to clustering help to create EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 5

HE NUMBER OF DATA POINTS ( N ), CLUSTERS ( K ), DIMENSIONS ( D ), AND DATASETS FOR EACH COLLECTION OF DATASETS .Source N K D HK [7] ±

10, 20, 40, 60,80, 100, 120 20, 50, 100,150, 200 350 QJ [8] ±

3, 6, 9 5–24 243

SIPU [9] ±

2, 15, 20, 35,50 , , . . . , UCI [10] ±

2, 3, 4, 6, 7, 8,10, 11, 15 3, 4, 6–11, 13,18, 19, 22, 30,34, 44, 60, 90,166 20

UKC [11] ±

10, 11, 12 2 8 The average (and standard deviation) are given, except for HAWKS forwhich the values are targetted. an instance space that correlates to algorithmic performance.As shown in Fig. ?? however, through either the projectionmethod or an incomplete set of problem features (as discussedin Section ?? , it is difﬁcult to get a complete set), there is nota complete separation of regions where a single algorithm issuperior. Table S-II shows the mean and standard deviation of eachof the problem features for every collection of datasets studiedin this paper, clarifying the observations seen for the instancespace visualizations (Fig. S-5). The UCI datasets show a muchhigher connectivity, indicating that the nearest neighbours ofdata points are typically in a different cluster, and thus havepoor cluster structure.We can compare the mean values for the problem featuresfor the

Index and

Versus modes to see if the different optimiza-tion settings had a particular bias towards e.g. a low connectiv-ity. The main differences between these datasets, according tothe problem features, is that the

Versus datasets have a slightlyhigher average connectivity, much lower eccentricity, andhigher variance in the silhouette width. The lower eccentricitycan be explained through both the initialization used andlow dimensionality. As shown in our examples, where highlyeccentric clusters were seen, this did not diminish HAWKS’capability of evolving such clusters when needed.

C. Comparing performance diversity via critical differencediagrams

In the main paper, we showed the critical difference (CD)diagrams for datasets from the HAWKS and HK generators. EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 6

PC1 P C ARI-0.400.000.400.801.20SourceHAWKSHKQJSIPUUCIUKC (a) ARI

PC1 P C Connectivity0.000.601.201.80SourceHAWKSHKQJSIPUUCIUKC (b) Connectivity

PC1 P C Dimensionality0.00400.00800.001200.00SourceHAWKSHKQJSIPUUCIUKC (c) Number of dimensions

PC1 P C Eccentricity Avg0.00150.00300.00450.00SourceHAWKSHKQJSIPUUCIUKC (d) Average eccentricity

PC1 P C Entropy0.400.600.801.00SourceHAWKSHKQJSIPUUCIUKC (e) Entropy of cluster sizes

PC1 P C Num Clusters0.0040.0080.00120.00SourceHAWKSHKQJSIPUUCIUKC (f) Number of clusters

PC1 P C Silhouette-0.500.000.501.00SourceHAWKSHKQJSIPUUCIUKC (g) Silhouette width (average)

PC1 P C Silhouette Std0.000.250.500.75SourceHAWKSHKQJSIPUUCIUKC (h) Silhouette width (standard deviation)Fig. S-5. The instance space created in Section ?? showing how the values of each of the 7 problem features and the best ARI found by any algorithm varyacross the space. EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 7

TABLE S-IIT

HE MEAN AND STANDARD DEVIATION (SD)

OF THE PROBLEM FEATURE VALUES FOR EVERY COLLECTION OF DATASETS , INCLUDING BOTHTHE I NDEX AND V ERSUS MODES OF

HAWKS. Silh. widthNumber of Average Cluster Size Number of Silh. width (standardConnectivity Dimensions Eccentricity Entropy Clusters (average) deviation)Collection Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD Mean SDHAWKS (

Index ) .

114 0 .

198 26 .

000 24 .

027 5 .

692 2 .

568 0 .

884 0 .

126 17 .

500 12 .

514 0 .

675 0 .

225 0 .

226 0 . HAWKS (

Versus ) .

144 0 .

175 2 .

000 0 .

000 2 .

551 1 .

309 0 .

825 0 .

074 5 .

000 0 .

712 0 .

188 0 .

300 0 . HK [7] .

008 0 .

008 104 .

000 65 .

393 145 .

202 79 .

847 0 .

984 0 .

021 61 .

429 38 .

012 0 .

474 0 .

043 0 .

437 0 . QJ [8] .

207 0 .

247 13 .

000 6 .

926 3 .

615 0 .

867 0 .

966 0 .

023 6 .

000 2 .

455 0 .

380 0 .

154 0 .

145 0 . SIPU [9] .

091 0 .

209 191 .

346 308 .

120 3 .

411 3 .

008 1 .

000 0 .

000 3 .

411 6 .

280 0 .

634 0 .

225 0 .

091 0 . UCI [10] .

789 0 .

409 28 .

350 39 .

125 20 .

716 20 .

706 0 .

897 0 .

116 4 .

900 3 .

712 0 .

084 0 .

279 0 .

375 0 . UKC [11] .

001 0 .

001 2 .

000 0 .

000 1 .

938 0 .

177 0 .

997 0 .

002 10 .

875 0 .

641 0 .

780 0 .

062 0 .

232 0 . The 448 datasets generated in Section ?? . The 360 datasets generated in Section ?? . For completeness, here in Fig. 6 we show the CD diagramsfor the other dataset collections.For the QJ datasets (Fig. 6a), the compactness-based algo-rithms (GMM and K-Means ++ ) are clearly the best perform-ing algorithms, with GMM nearly superior across all datasets,highlighting a lack of diversity. Similarly, for the SIPU datasets(Fig. 6b) GMM and K-Means ++ are more equally the bestperforming, where the lower connectivity (i.e. lack of overlapbetween clusters) compared to QJ datasets does not distinguishperformance further between these two algorithms. The UCIdatasets (Fig. 6c) show little signiﬁcant difference betweenthe clustering algorithms which, when combined with thelow performance shown in Fig. ?? , indicates that there is ahigh variance of performance, but that this diversity is dueto low cluster structure (rather than varied cluster structures).Finally, the UKC datasets (Fig. 6d) are too few in numberto identify signiﬁcant differences, though the the higher rankof the compactness-based algorithms ﬁts with visualization ofthe clusters [11].S-IV. G ENERATING DATASETS THAT CHALLENGE SPECIFICALGORITHMS (V ERSUS MODE )Section ?? presented results for the reformulation of theﬁtness function to evolve datasets that directly maximize theperformance difference between two algorithms. Here, weprovide further visual examples of the different head-to-headscenarios to illustrate the properties which HAWKS founddifferentiated the algorithms, and further details for the head-to-head between average- and single-linkage. A. Additional scenarios1) Single-linkage vs. Average-linkage:

Fig. ?? showed thatthere was a high spread in the ARI for both algorithms,and high variance in the maximum difference found betweenruns, highlighting the difﬁculty for this scenario. There aresome datasets that correlate both high and low performancebetween the two, indicating that the algorithms do share somesimilarities. Fig. S-7a shows an example, where the structureis common among the datasets that did ﬁnd a performance difference. The clusters are in general well-separated, as byplacing some clusters much further away and co-locating a fewsmaller clusters, the averaging criterion (to determine whereto split next) used by average-linkage assigns clusters that arecloser together the same cluster label. The mixed eccentricitiesof clusters that HAWKS is able to generate helps facilitate thediscovery of this exploit. As the clusters need to be sufﬁcientlyfar away before a chance in the ﬁtness can be found, theremay be insufﬁcient pressure for this exploit to be reliablyfound, highlighting the high variation in performance for thisscenario.

2) Average-linkage vs. Single-linkage:

As average-linkageuses the average of the distances between groups, rather thanthe smallest distance, it is not susceptible to the chainingeffect [12] that single-linkage is. As a result, Fig. S-7b showsthat the exploit used in this head-to-head is simply to placeclusters close together in order for single-linkage to determinethe majority of the points are in a single cluster.R

EFERENCES[1] C. Shand, R. Allmendinger, J. Handl, A. M. Webb, and J. Keane,“Evolving controllably difﬁcult datasets for clustering,” in

Proceedingsof the Genetic and Evolutionary Computation Conference, GECCO2019 . ACM, 2019, pp. 463–471.[2] J. Kennedy and R. Eberhart, “Particle swarm optimization,” in

Proceed-ings of International Conference on Neural Networks (ICNN’95) . IEEE,1995, pp. 1942–1948.[3] J. Kennedy, “Particle swarm optimization,” pp. 760–766, 2010.[4] R. Storn and K. Price, “Differential evolution–a simple and efﬁcientheuristic for global optimization over continuous spaces,”

Journal ofGlobal Optimization , vol. 11, no. 4, pp. 341–359, 1997.[5] K. Fleetwood, “An introduction to differential evolution,” in

Proceedingsof Mathematics and Statistics of Complex Systems (MASCOS) One DaySymposium, 26th November, Brisbane, Australia , 2004, pp. 785–791.[6] M. L. Ortiz and N. Xiong, “Investigation of mutation strategies indifferential evolution for solving global optimization problems,” in

Arti-ﬁcial Intelligence and Soft Computing - 13th International Conference,ICAISC 2014 , ser. Lecture Notes in Computer Science, L. Rutkowski,M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. A. Zadeh, and J. M.Zurada, Eds., vol. 8467. Springer, 2014, pp. 372–383.[7] J. Handl and J. D. Knowles, “Improvements to the scalability ofmultiobjective clustering,” in

Proceedings of the IEEE Congress onEvolutionary Computation, CEC 2005, 2-4 September 2005, Edinburgh,UK . IEEE, 2005, pp. 2372–2379.[8] W. Qiu and H. Joe, “Generation of random clusters with speciﬁed degreeof separation,”

J. Classiﬁcation , vol. 23, no. 2, pp. 315–334, 2006.

EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 8 (a) QJ (b) SIPU (c) UCI (d) UKCFig. S-6. Critical difference (CD) diagrams, showing the mean rank (in terms of ARI) for the remaining dataset collections, for each algorithm. Algorithmsconnected by solid lines are not signiﬁcantly different according to a two-tailed Nemenyi test.

10 0 1050510

Truth

10 0 10

Single-LinkageARI: 1.000

10 0 10

Average-LinkageARI: 0.692 (a) Single-linkage vs. average-linkage

Truth

Average-LinkageARI: 0.994

Single-LinkageARI: 0.001 (b) Average-linkage vs. single-linkage

Truth

GMMARI: 0.989

KMeans++ARI: 0.407 (c) GMM vs. K-Means ++ Fig. S-7. Examples of datasets for the listed head-to-heads. Each ﬁgure shows the ground truth, and the cluster assignment for each algorithm with theassociated ARI.

EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 9 [9] P. Fr¨anti and S. Sieranoja, “K-means properties on six clusteringbenchmark datasets,”

Appl. Intell. , vol. 48, no. 12, pp. 4743–4759, 2018.[10] D. Dheeru and E. Karra Taniskidou, “UCI machine learning repository,”2017. [Online]. Available: http://archive.ics.uci.edu/ml[11] M. Garza-Fabre, J. Handl, and J. D. Knowles, “An Improved and MoreScalable Evolutionary Approach to Multiobjective Clustering,”

IEEETrans. on Evolutionary Computation , vol. 22, no. 4, pp. 515–535, 2017.[12] L. Hubert, “Approximate evaluation techniques for the single-link andcomplete-link hierarchical clustering procedures,”