HAWKS: Evolving Challenging Benchmark Sets for Cluster Analysis
Cameron Shand, Richard Allmendinger, Julia Handl, Andrew Webb, John Keane
IIEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 1
HAWKS: Evolving Challenging Benchmark Sets forCluster Analysis
Cameron Shand, Richard Allmendinger,
Member, IEEE,
Julia Handl, Andrew Webb, and John Keane
Abstract —Comprehensive benchmarking of clustering algo-rithms is rendered difficult by two key factors: (i) the elusivenessof a unique mathematical definition of this unsupervised learningapproach and (ii) dependencies between the generating modelsor clustering criteria adopted by some clustering algorithms andindices for internal cluster validation. Consequently, there is noconsensus regarding the best practice for rigorous benchmarking,and whether this is possible at all outside the context of agiven application. Here, we argue that synthetic datasets mustcontinue to play an important role in the evaluation of clusteringalgorithms, but that this necessitates constructing benchmarksthat appropriately cover the diverse set of properties that impactclustering algorithm performance. Through our framework,HAWKS, we demonstrate the important role evolutionary algo-rithms play to support flexible generation of such benchmarks,allowing simple modification and extension. We illustrate twopossible uses of our framework: (i) the evolution of benchmarkdata consistent with a set of hand-derived properties and (ii) thegeneration of datasets that tease out performance differencesbetween a given pair of algorithms. Our work has implicationsfor the design of clustering benchmarks that sufficiently challengea broad range of algorithms, and for furthering insight into thestrengths and weaknesses of specific approaches.
Index Terms —Clustering, evolutionary computation, syntheticdata, benchmarking, data generator.
I. I
NTRODUCTION C LUSTER analysis is an unsupervised learning approachwith the high-level aim of identifying groups (clusters)of objects that are more similar to each other than to theobjects in other groups. It is a fundamental approach forknowledge discovery, with a broad range of applications frombioinformatics [1]–[3], to cybersecurity [4], to medicine [5].Due to the unsupervised nature of clustering, the processof evaluating the quality of a partition (i.e. a given set ofclusters) is not a straightforward task. Attempts to formallycapture the qualities intuitively associated with pronouncedcluster structure (such as compactness of individual clustersand separation between clusters), have led to the mathematicaldefinition of a range of internal validation indices, which canbe both complementary and conflicting [6]–[8]. Arbelaitz etal. [9] studied 30 such internal indices and concluded thatthe utility of such measures varied depending on the datasets
C. Shand is with the University College London, London WC1E 6BT,U.K. (e-mail: [email protected]).R. Allmendinger, J. Handl and J. Keane are with theUniversity of Manchester, Manchester M15 6PB, U.K. (e-mail:[email protected], [email protected],[email protected]).A. Webb is with vTime, Liverpool L8 5RN, U.K. (email:[email protected]). considered, highlighting the limited scope of each individualindex.
External cluster validation indices are thought to addressthis limitation and to provide a more objective assessment ofclustering performance [1]. However, they require knowledgeof the ground truth for a given dataset, i.e. information aboutthe correct cluster membership of each data point — infor-mation that is difficult to come by in realistic unsupervisedlearning applications. For this reason, synthetic benchmarkdatasets (i.e. datasets with a known generating model) playan important role in the evaluation of clustering performance.A key advantage of such data is that both the ground truthand any assumptions implicit to the generating process areaccessible. This allows for both an objective assessment ofclustering performance and informed reasoning about the keydrivers behind the observed performance.In principle, direct control over the generating model thenallows for the provision of datasets with specific and variedproperties. This facilitates the testing of performance withregards to these known characteristics; the benefit is concrete:practically translatable insight regarding the strengths andweaknesses of particular algorithms [10]. However, existinggenerators for synthetic clustering benchmarks have not beendesigned with this level of flexibility in mind — instead, theytypically use a set of fixed (manually tuned) parameter boundswithin their generating model [11], limiting the complexity anddiversity of datasets that can be obtained and failing to fullymatch the range of challenges posed by real-world datasets.Our framework, HAWKS, is designed to address this limita-tion through the integration of an evolutionary algorithm (EA)into the generating process. EAs lend themselves as a mecha-nism to directly control and adapt key aspects of a partition’sgenerating model — here, we demonstrate that this facilitatesthe design of more powerful benchmarks, exhibiting a diverserange of properties. Furthermore, the innate modularity of EAsprovides flexibility in choosing all key model components,including the representation of individual clusters and theset of objectives and constraints constituting the partition-level generating model, such as constraints on inter-clusterrelationships. This flexibility is key to a broader utility of theframework, and our experiments illustrate HAWKS’ potentialin evolving benchmark sets either to meet a predefined setof criteria, or to directly maximize performance differencesbetween pairs of algorithms.In summary, the main contributions of this paper are asfollows:1) We propose an evolutionary framework for the genera-tion of clustering benchmarks, HAWKS, that allows for a r X i v : . [ c s . N E ] F e b EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 2 flexible parameterization of its generating model.2) We describe a set of measurable properties (problemfeatures) that quantify the difficulty of a cluster structurefrom a range of perspectives. This set of problemfeatures is utilized to define an instance space [12],enabling visual examination of the correlations betweenthese properties and algorithm performance. For this pur-pose, we represent each benchmark set by its associatedproblem features, embed this representation into twodimensions, and use colour coding to highlight the topperforming algorithm for each dataset.3) We provide an indicative example of the use of HAWKSto generate datasets across the instance space by varyinga subset of parameters. In this first optimization mode,individual problem features can be deployed as objec-tives and/or constraints. By varying the relative impor-tance of each feature, a diverse collection of benchmarkscan be obtained.4) We present a second optimization mode for HAWKS,which aims to generate benchmarks that elicit perfor-mance differences between pairs of clustering algo-rithms. Analysis of the problem features and clusterstructures associated with the resulting datasets allowsfor the identification of the relative strengths and weak-nesses of each algorithm.The remainder of this paper proceeds as follows. Section IIfurther motivates the need for synthetic cluster generators,and positions our work relative to existing generators andthe literature. Section III describes HAWKS, discussing theimportance of each component in the framework as a whole,and the indicative design choices we have made, to illustrate itsuse. Section IV describes our experimental setup, including theset of problem features we deploy to measure the complexityof a dataset, and details of existing benchmark sets we compareagainst. The experimental results are presented in Section V,demonstrating HAWKS’ ability to evolve diverse benchmarkdata, and instances tailored to challenge specific algorithms.Finally, we conclude and discuss future research directions inSection VI. II. B
ACKGROUND
This section reviews the relevant background to our work.We start by positioning our work relative to similar problemsin the literature, most prominently the relevance of bench-marks, and related work on the algorithm selection problem.We then review the issues of cluster analysis, and discuss theimplications for the development and use of synthetic bench-marks. Finally, we provide an overview of existing benchmarkgenerators for clustering, and analyze their strengths andlimitations.
A. The general role of benchmarks
Empirical comparison between techniques is a cornerstoneof the scientific method. At a community-level, methods devel-oped by independent researchers need to be compared in orderto gain insights into the applicability, generalizability, andefficacy of their developments. As direct comparison is only possible on the same problem instances, subsequent researchis highly likely to adopt instances used by other researchers.The importance of reproducibility further re-enforces the needfor a common, accessible set of data that can be utilizedto facilitate comparisons across independent studies. Thisfeedback loop results in “standard” benchmarks becomingvirtually required to include in experimentation [10]. Thisrequirement is supported explicitly through the creation ofbenchmark suites — a collection of problems collated and/orcreated for the purpose of widespread comparison [13], [14].The issue with this feedback loop is that the communityas a whole risks tuning both hyperparameters and algorithmicdevelopment to these specific problems [10], [15]. If thesepopular problems represent a broad-spectrum of real-worldchallenges, then this is not a negative; analyzing whetherthese problems adequately cover the space of encounterableproblems, however, is difficult if even possible to do in itsentirety [13].To combat this challenge, Hooker [10] argues for “con-trolled experimentation” i.e. comparing algorithmic perfor-mance specifically on a problem characteristic that the re-search in question is addressing, compared to the “competitivetesting” that is encouraged when the same subset of datasetsare re-used time and again. This argument is consistent withthe implications of the No-Free-Lunch (NFL) theorem forlearning, which supports the intuition that no single algorithmis expected to be superior across all problems [16], and thatbinary statements about algorithm superiority (i.e. “algorithm A is better than algorithm B ”) are only possible within partic-ular problem classes. Controlled experimentation is inherentlysimpler with synthetic benchmarks, as we have control overthe generating mechanism and thus (to varying extents) theproperties of the instances. The use of real-world instancesfor this purpose is possible only if there is an appropriatemeasure of the problem characteristic and a controlled way tovary it. B. The algorithm selection problem
In the above, we have introduced the notion of problemcharacteristics, and the need for these to be varied in bench-mark data. This is to ensure appropriate coverage across thespace of possible instances, and the ability to appropriatelydifferentiate between the challenges these instances may posefor different algorithms.Rice [17] formalized these characteristics as “problemfeatures” in the context of the algorithm selection problem(ASP), where the goal is to predict which algorithm froma portfolio is best-suited to a given problem instance basedon its problem features. This is premised on the existence ofan identifiable relationship between the problem features andproblem difficulty for a given algorithm, typically requiringthese features to be specific to the problem class. A series ofpapers by Smith-Miles further extended this framework [12],[18], [19], using an instance space to visualize the interactionbetween problem features and algorithmic performance [12]and applying this approach to combinatorial optimization [20],[21] and supervised learning [22].
EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 3
The problem of selecting an appropriate algorithm for agiven task is closely related to the “algorithm configurationproblem” (i.e. hyperparameter optimization), which is impor-tant in both the metaheuristic and machine learning commu-nities [23]–[25]. In this context, the identification of problemfeatures has featured prominently in the form of exploratorylandscape analysis, where quantification of different aspects ofthe fitness landscape informs not only algorithm configuration,but a more general understanding of the suitability of algo-rithm components to different problem characteristics [14],[26]–[28].
C. Relevance to unsupervised learning
Unsupervised learning aims to identify the natural structureof a dataset in the absence of information about the groundtruth. This pattern recognition task is subjective, even forhumans [29], [30], explaining the existence of a diverse rangeof clustering algorithms that make a broad variety of mathe-matical assumptions about desirable cluster properties [6], [8].The dilemma stemming from the NFL theorem is thusexacerbated in cluster analysis, with differing clustering algo-rithms useful for different clustering problems [6], [31], dueto unavoidable differences in the underlying formulation [32].As the inductive bias of each clustering algorithm differs,this fundamentally governs its capabilities: as an example, thepopular K-Means assumes hyper-spherical, compact clusters,and this strictly limits its ability to deal with data that violatesthis assumption.In consequence, having a representative and diverse collec-tion of benchmark datasets is particularly crucial in clusteranalysis, in order to fairly test each algorithm’s strengths andweaknesses, and to avoid any biases towards those meth-ods whose inductive bias may be consistent or correlatedwith the assumptions of a limited benchmark set. However,compared to other optimization problems, evaluation of thisis complicated by the multifarious nature of cluster qualitymeasures itself [6], [8]. Without explicit effort to first identifyand quantify aspects of problem structure or difficulty thatmatter in cluster analysis, it is challenging to assess andensure appropriate diversity of problem instances and, thus,a comprehensive benchmark suite. Any efforts to improveexisting benchmarks must therefore involve such identificationand quantification as an integral step.Previous works on directly tackling the ASP for clustering(in the context of “meta-learning”) have used generic statisticalproperties of the data (such as the mean of the momentsacross the data). An alternative is the use of measures directlycharacterizing a given group structure [33], [34]. The latterapproach is likely to be more powerful in identifying drivers ofalgorithm performance, but its limitation lies in the difficultyof transferring insights on feature-performance correlations todata with an unknown group structure.
D. Existing clustering benchmarks and generators
In the following, we consider existing clustering bench-marks and the extent to which these meet the criteria ofrepresentativeness and diversity, as discussed above. The datasets from the UCI Machine Learning Reposi-tory [35] comprise a diverse mix of real-world datasets andare one of the most common benchmarks for the evaluationof machine learning algorithms. Recent analysis demonstrates,however, that even with real-world data from a range of appli-cation areas, diversity in complexity is not guaranteed. Maciaet al. [13] analyzed the datasets of the UCI Machine LearningRepository, finding a surprising similarity in complexity acrossthem. They discovered a lack of diversity in both the statis-tical properties of the datasets and the relative classificationperformance of different algorithms and parameter settings.At the other end of the spectrum, two-dimensional toydatasets are a common approach to the evaluation of clus-tering methods [36], [37]. Typically, these datasets havebeen handcrafted to illustrate simple capabilities or propertiesof clustering algorithms. They remain popular due to theirsimplicity and because they enable intuitive visualization ofresults, rendering them highly effective at clearly exhibiting aparticular property and consequent algorithm behaviour (seee.g. the “moons” dataset ). Despite these advantages, thescenarios that toy datasets depict are often too contrived.Although synthetic data is commonly generated ad hoc for individual work, some generators have been explicitlydesigned to have broader utility, with more complex propertiesthan toy data. The generator proposed by Qiu and Joe [38](abbreviated to QJ ) uses a geometric framework for clusterplacement. The measure of separation used (proposed in [39]),provides a measure of the spatial separation between clusters.By adjusting this minimum amount of separation allowed,the user can generate datasets that have different amountsof separation between clusters, providing some loose controlover the resulting difficulty of the datasets. To do this, thecovariance matrices are iteratively scaled until this minimumseparation is achieved. Although useful and geometricallyinterpretable, this provides a single perspective of clusterstructure. Their generator can, however, embed additionalcomplexity through the addition of noise [38]. Despite havingseveral useful parameters to customize the difficulty of thegenerated problems, the generator (available as an R package )is not easily extended to incorporate other measures of clusterstructure.Handl and Knowles [11] (abbreviated to HK ) created twogenerators (named “gaussian” and “ellipsoidal”) that havebeen used extensively for the creation of synthetic clusters.At their core, these generators aim to generate clusters thatare as compact as possible with either no (“gaussian”) orminimal (“ellipsoidal”) overlap. The “gaussian” generator usesa trial-and-error scheme where Gaussian clusters are randomlygenerated, and rejected if they overlap with existing clusters.The “ellipsoidal” generator was designed specifically for gen-erating elongated clusters in higher dimensions, and uses anEA to optimize cluster location such that overall variancein the dataset is reduced while penalizing overlap. Despitethe popularity of these generators, their design is rigid withmany hand-tuned parameters, and without consideration or Example: https://rdrr.io/cran/clusterSim/man/shapes.two.moon.html https://cran.r-project.org/web/packages/clusterGeneration/index.html EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 4 easy scope for extension to consider additional or alternativeaspects of cluster structure.In this paper, we therefore propose a more general evolu-tionary framework for generating new synthetic benchmarksfor cluster analysis. Our work tackles two crucial steps for theconstruction of diverse datasets: (i) the explicit identificationof problem characteristics that are relevant to differentiatingclustering algorithm performance and (ii) the design of anoptimizer that is sufficiently general to evolve benchmarksfor a range of criteria. Specifically, we experiment with twodifferent types of criteria: approximating a desired target valueof problem characteristics, and directly maximizing perfor-mance differences between algorithms. The latter approachhelps highlight the strengths and weaknesses of the contes-tant methods, providing direct insight into their individualinductive bias. To evaluate the performance of our approach,we compare datasets obtained from our framework againstmultiple generators/dataset collections in terms of both thespread of performance across clustering algorithms, and thevariance across our proposed set of problem features. Thiswork greatly extends an initial proof-of-concept outlined in[40], generalizing the framework and introducing the ability togenerate datasets directly for particular clustering algorithms.III. HAWKSWe now proceed to introduce the details of our framework,HAWKS, for the generation of diverse, complex datasets.HAWKS uses an EA to evolve a population of datasets,where the objective function and constraints work together tovary properties of the resulting datasets that affect clusteringalgorithm performance. For each component of HAWKS, wediscuss its general role in dataset generation, and motivate thespecific design choice made for the experiments presented inthis paper.
A. Representing a dataset
In HAWKS, a dataset is represented by a set of clus-ters, which are themselves defined by the parameters of agiven multivariate distribution. With this level of abstraction,we avoid handling of individual data points, and providean intuitive manipulation of clusters through adjustment ofdistribution parameters. In principle, this approach allowsfor the inclusion of a variety of cluster generating models,given suitable initialization and variation operators for eachdistribution have been defined (see Sections III-B and III-E).For the experiments reported in this paper, each clusteris represented by a multivariate Gaussian distribution, dueto their relative simplicity and prominence in a machinelearning context. For a dataset in D dimensions, each clusteris therefore encoded by a ( µ , Σ ) pair, which we will refer to asa single gene. Here, µ is the D -dimensional mean vector and Σ is the symmetric D × D covariance matrix. A dataset with K clusters is therefore represented as a genotype composedof K genes. In the current implementation, K is set a priori as an input parameter. B. Initializing a population of datasets
The first step of HAWKS is to create an initial populationof datasets. The sizes of the clusters can be controlled intwo ways: (i) all clusters can be of equal size or (ii) eachcluster is of random size (with an optional minimum size toavoid clusters small enough to be considered outliers) . Inboth cases, the user predefines the total number of data points N . The cluster sizes remain fixed across all individuals forthe remainder of the evolution, to avoid interference with thefitness (described in Section III-C) and focus of the search onthe distribution parameters only.All other aspects of the cluster-level initialization are spe-cific to the type of distribution used; here, we describe ourapproach for Gaussian clusters. The initial means are sampledfrom a D -dimensional uniform distribution, i.e. for the i thcluster µ ( i ) ∼ U (0 , β ) D where β is the upper thresh-old to control the initial sampling space for the means. InHAWKS, the covariance matrix is defined for the i th clusteras Σ ( i ) = R ( i ) S ( i ) e Σ ( i ) R ( i ) | , where R ( k ) and S ( k ) are the k th rotation and scaling matrices, respectively, and e Σ ( k ) is the k th axis-aligned covariance matrix (i.e. a diagonal matrix thatconsists of only the variances). These variances are sampledsimilarly to the means, i.e. e Σ = diag( U (0 , β ) D ) where β is the upper threshold to control the initial sampling space forthe variances. The scaling matrix is set to the identity matrixat this stage. The covariances are then obtained via rotationusing a random rotation matrix ( R ), which is drawn from theHaar distribution [41] to generate a valid covariance matrix.This method permits generation of clusters with a variety ofshapes and orientations, thus ensuring the initial populationhas a diverse set of individuals.This approach, while more complex than constructing acovariance matrix with random values, allows both moreintuitive parameterization and the ability to separately modifyindividual components of the covariance matrix, which welater exploit for mutation (Section III-E). C. Computing the fitness of a dataset
The quantification of fitness plays a key role in focusing thesearch on those datasets that are deemed to present interestingclustering benchmarks. It is thus one of the most vital designdecisions in a given generator, but is complicated by ourlimited understanding of the desirable properties of clusteringbenchmarks, as discussed in Section II-C. Therefore, ourframework is designed to allow for the interchangeable useof different fitness functions, providing scope for a variety ofchoices to be trialled.In this paper, we introduce and experiment with two dif-ferent modes of optimization which support the generationof datasets for distinct goals. The first, named
Index mode,optimizes the datasets towards a user-defined target value for agiven cluster validity index. This provides control of the broaddifficulty of the datasets, where the nature of the difficultyis governed by the properties of the index used. The second, Owing to its non-triviality, the method to generate random cluster sizeswith a minimum size that overall sum to N is described in the supplementarymaterial (Section ?? ). EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 5 named
Versus mode, directly optimizes datasets such that theymaximize the performance difference between two clusteringalgorithms. By generating a range of datasets that are simpleand difficult for specific algorithms we can “stress-test” them,revealing their relative strengths and weaknesses. We describethese approaches in more detail in the following sections.
1) Index Mode:
Cluster validity indices capture the amountof structure of a partition, in terms of the underlying datadistribution. Given knowledge of the generating model (i.e.the true partition), they can thus act as a proxy for the easeof recognizing this structure in a given dataset.As previously discussed, there are many different clustervalidity indices, each with a slightly different perspective ofhow cluster structure should be quantified. Previous work,including [9], could not conclude superiority of any singlevalidity index. Thus, to obtain a comprehensive understandingof the difficulty of a given dataset, many cluster validityindices could potentially be used in combination [42]. In thecontext of a fitness function, this could be done in the form ofan aggregation of indices or through formulation as a many-objective problem. Here, we opt for a middle ground, limitingourselves to the definition of fitness through a single clustervalidity index, but allowing for the incorporation of additionalconsiderations (which could include validity indices) throughthe use of constraints (discussed in Section III-D).For the experiments reported in this paper, we use thesilhouette width [43] as a representative example of a validityindex. It is an established, widely-used method and has twoparticular characteristics that makes it a promising choice foruse here: (i) it is a very rich measure that provides informationat multiple level of resolutions (the individual data point, thecluster, and the partition-level); and (ii) as it is bounded inthe range [ − , , it can be easily compared across individualsand even runs (when a similar dimensionality is used), and itis readily interpretable to the user.The silhouette width is a combination internal validityindex, as it measures a ratio of intra-cluster compactness andinter-cluster separation [1]. For a single data point, x i , it isdefined as: s ( x i ) = b ( x i ) − a ( x i )max { a ( x i ) , b ( x i ) } . (1)Here, a ( x i ) represents the cluster compactness (with respectto x i ) and is the average distance from x i to all otherdata points in its cluster. The separation between clusters isrepresented by b ( x i ) ; for data point x i this is defined as theminimum of the average distances to all data points in everyother cluster. The silhouette width is calculated for all N datapoints in dataset X , and an average is taken to obtain theoverall silhouette width: s all = 1 N N X i =1 s ( x i ) . (2)A value of 1 represents very compact and well-separatedclusters, whereas a negative silhouette width value indicatesthat points in different clusters are not well-separated (andthat their cluster membership should be changed).Independent of the validity index chosen, direct maximiza-tion or minimization of such an index would always lead to the evolution of datasets that are trivially separable or fullyunstructured, respectively. HAWKS therefore requires input inthe form of a desired target value, allowing direct modulationof the desired level of structure.In other words, and using the example of the silhouettewidth, a target value (denoted s t ) is specified by the user anddatasets are then optimized to meet this target value. This isachieved by minimizing the absolute difference between s t and s all , defined as: min f ( µ (1) , Σ (1) , . . . , µ ( K ) , Σ ( K ) ) ≡ min | s t − s all | . (3)
2) Versus Mode:
The
Index mode provides us with theability to generate benchmarks that meet specific thresholdsof user-defined validity criteria. However, shared assumptionsbetween validity indices and clustering algorithms themselvescan make it difficult to generate datasets with properties thatspecifically challenge a given algorithm. This is particularlythe case when working with clustering techniques whoseinductive biases are poorly understood, e.g. self-organizingapproaches or those that utilize deep learning [44].To perform “controlled experimentation”, a more direct linkbetween dataset generation and algorithm performance maybe needed [10]. Our
Versus mode tries to address this bydirectly optimizing the performance difference between twoalgorithms, thereby exploiting the strengths of one algorithm,the weaknesses of the other, or a combination of the two.We re-formulate the fitness function such that we maximizethe difference between the scores of the ‘winning’ algorithm( A w ) and the ‘losing’ algorithm ( A l ), using a scoring function φ : max f ( µ (1) , Σ (1) , . . . , µ ( K ) , Σ ( K ) ) ≡ max( φ ( A w ) − φ ( A l )) . (4)While any scoring function can be used, we have accessto the generating model of the clusters and therefore theground truth. The quality of each partition can therefore beobjectively assessed using an external validity index such asthe Adjusted Rand Index (ARI) [45]. The ARI measures theco-occurrences of cluster assignments between two partitions,which in our case corresponds to the output of a givenclustering algorithm, and the ground truth. The upper bound of1 represents identical assignment, whereas 0 indicates randomassignment. As the ARI considers only the assignment ofpoints, it can be used with any clustering algorithm and hasno preference of structure. D. Augmenting cluster properties using constraints
As previously discussed, it is difficult to define a singlefitness measure that represents all properties that may bedesirable in a benchmark dataset. Therefore, our frameworkuses constraints to allow for the integration of additionalconsiderations, with two main aims: (i) to avoid the generationof trivial datasets that are e.g. simple for any algorithm, or toonoisy to be clustered, and (ii) to introduce additional propertiesthat balance limitations of individual fitness functions orfurther enhance diversity in the datasets obtained. To controloptimization of the fitness and constraints, we use stochastic
EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 6 ranking [46] to balance the satisfaction of all constraints (thisis discussed further in Section III-F).In the following, we discuss two constraints introducedto meet these aims by directly accounting for local overlapbetween clusters and for cluster shape (as measured by clusterelongation — for a different generating model, other choicesmay be plausible). These choices of constraints have to beseen in the context of the fitness functions adopted for ourexperiments: when using the silhouette width in
Index mode,the fitness provides a powerful overall perspective of separa-tion between clusters. However, the reliance on averaging canlose fine-grained information about local variability, e.g. lackof inter-cluster separation for small clusters. Furthermore, thesilhouette width does not directly capture information aboutcluster shape, while differences in cluster shapes are knownto drive some performance differences between algorithms.Similarly, for the
Versus mode, a performance differencebetween the algorithms is sought with no explicit concernfor the underlying cluster properties. While this exploratoryapproach is desired, there are situations where this can lead tocluster structures that are not meaningful (e.g. clusters that arecompletely overlapping) but one algorithm is still perceived tobe favoured due to arbitrary artefacts.
1) Overlap:
Real-world data typically does not containcleanly separated clusters, either due to noise or reflectiveof the underlying relationships between variables. Clusteringalgorithms differ significantly in their ability to handle somedegree of cluster overlap, and control over this aspect ofour benchmark is therefore important. At the extreme ends,a benchmark set with very well-separated clusters may betoo simple to detect any performance differences betweenalgorithms; equally, a dataset where clusters fully overlap willbe of no use.The definition of overlap is not purely objective, and canbe implemented in different ways. Fr¨anti and Sieranoja [36]defined a data point as overlapping if it is closer to a centroidof a different cluster than to the centroid of its own cluster.This definition, however, does not extend to highly eccentricclusters. We use a definition similar to [11], where a data pointis considered as overlapping if its nearest neighbour belongs toa different cluster. In addition to avoiding the compactness biasintroduced by using centroids, this definition has the specificbenefit of countering an inherent limitation of the silhouettewidth, where a high average silhouette width value can bedriven by having large clusters that are very well-separatedand fails to sufficiently reflect the presence of very small buthighly overlapping clusters.We calculate the overlap as the percentage of data pointswhose nearest neighbour is in a different cluster: overlap = 1 − N X x ∈ X C k ( n x ) (5)where C k is the cluster that data point x belongs to, n x is thenearest neighbour of x , and ( · ) is the indicator function thatis 1 if n x ∈ C k and 0 if n x / ∈ C k .
2) Elongation:
The elongation (or, more specifically, theeccentricity) of clusters can pose problems for compactness-based algorithms such as K-Means, and thus presents an µ (1) Σ (1) µ (2) Σ (2) µ (3) Σ (3) µ (4) Σ (4) µ (5) Σ (5) µ (1) Σ (1) µ (2) Σ (2) µ (3) Σ (3) µ (4) Σ (4) µ (5) Σ (5) µ (1) Σ (1) µ (2) Σ (2) µ (3) Σ (3) µ (4) Σ (4) µ (5) Σ (5) µ (1) Σ (1) µ (2) Σ (2) µ (3) Σ (3) µ (4) Σ (4) µ (5) Σ (5) Parent 1Parent 2Offspring 1Offspring 2 (a) Crossover(b) MutationFig. 1. Illustrations of the HAWKS genetic operators. Uniform crossoverbetween two individuals is shown in (a), where the means and covariancesare swapped independently. In (b), a single cluster is mutated where both thelocation (mean) and shape (covariance) have been randomly perturbed (theoriginal cluster is shown again faded to illustrate the relative differences). additional aspect that we are keen to control. Although ourinitialization parameters can adjust initial cluster eccentricity,by using an explicit constraint we can encourage (or penalize)this further during the evolution.Our definition of this constraint is specific to the gener-ating model used, and different definitions are possible. Aspreviously mentioned, each full covariance matrix ( Σ ) inour cluster representation is separated into the axis-alignedvariances ( e Σ ) and a rotation matrix. As the variances on thediagonal of e Σ are the eigenvalues of the full covariance matrix,i.e. e Σ = diag( λ , . . . , λ D ) , the ratio of the maximum andminimum of these eigenvalues gives us a measure of thecluster eccentricity. As even a single eccentric cluster canpose challenges for compactness-based algorithms, we takethe minimum of these ratios across all K clusters, i.e.: λ ratio = max ∀ k ∈{ ,...,K } | λ max ( Σ ( k ) ) || λ min ( Σ ( k ) ) | , (6)where λ max ( Σ ( k ) ) and λ min ( Σ ( k ) ) are the maximum andminimum eigenvalues of Σ ( k ) , respectively. E. Perturbing a dataset
As is typical in an EA, the variation operators in HAWKSprovide a further opportunity to integrate domain knowledgeand focus the search. Specifically, a close alignment betweenthe generating model used (i.e. the representation of an in-dividual cluster and partition) and the operators is crucial tooptimization performance and the types of datasets that can beobtained. In the following, we describe the variation operatorsdesigned for the multivariate Gaussian distributions utilized asthe generating model in our experiments.
1) Crossover:
For recombination between datasets,HAWKS uses a high-level uniform crossover scheme wherethe two components defining each cluster distribution (i.e. µ and Σ ) can be swapped separately between individuals. This EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 7 allows for both the location (mean) and shape (covariance) ofclusters to be swapped between individuals independently, asillustrated in Fig. 1a.
2) Mutation:
Meaningful mutations of a cluster requirecareful design of a geometrically relevant operator that isrelevant for the distribution being used. In our Gaussian case,we use a separate operator for the mean and covariance terms.A key issue identified in [40] was the increasing amount ofdrift of the mean operator in higher dimensions. The originaloperator shifts the mean to a random nearby point; at higherdimensionality there are an increasing number of directionsthat point away from other clusters, and a random walk isthereby likely to increase the silhouette width at most steps. Inthe supplementary material (Section ?? ), we test multiple newmutation operators (drawing inspiration from operators usedin particle swarm optimization and differential evolution) toaddress this issue by directly considering the location of otherclusters.The resulting operator selected and used throughout thispaper utilizes concepts from particle swarm optimization [47],where a combination of other centroids and a global represen-tative is used to embed direction either towards or away fromexisting clusters, directly affecting the fitness. The new mean, µ ( i ) , is obtained as follows: µ ( i ) = ( µ ( i ) + [ w ( µ ( n ) − µ ( i ) ) + w ( ¯ µ − µ ( i ) )] if p ≤ . µ ( i ) − [ w ( µ ( n ) − µ ( i ) ) + w ( ¯ µ − µ ( i ) )] if p > . , (7) where µ ( i ) is the current cluster mean, µ ( n ) is the mean ofanother randomly selected cluster, ¯ µ is the global mean acrossall data points, w and w are random weighting coefficients inthe range [0 , , and p is a random coin-flip to decide whetherto move away from or towards this weighted combination ofan existing cluster mean and the global mean.The covariance governs the shape of the cluster, affectingboth the overlap between clusters and the capabilities ofcompactness-based algorithms (such as K-Means). To mutatethe covariance, we rotate the cluster and scale its eigenvalues,effectively changing the shape of the cluster. Similar to theinitialization, the rotation matrix is drawn from the Haardistribution [41], though here it is raised to a fractionalpower to avoid complete reorientation of the cluster. Thescaling matrix, S , is drawn from a Dirichlet distribution inorder to ensure that the resulting determinant is unchangedi.e. det( e Σ ) = det( S · e Σ ) . Using an analogy, this has thesame effect as rotating a balloon and applying pressure tothe principal semi-axes, thereby changing the shape whilemaintaining the volume. The combined effect of both mutationoperators is illustrated in Fig. 1b. F. Selecting a dataset
As previously discussed, we use stochastic ranking [46]to balance the satisfaction of the objective and constraint(s),which may be complementary or impossible to both fullysatisfy. By adjusting the probability of comparing two in-dividuals on their fitness (denoted P f ), we can effectively This is described in further detail in Section ?? . weigh the satisfaction of the objective or the constraints,providing us with another way of controlling the properties ofthe resulting datasets. For example, if using only the overlapconstraint in Index mode, setting P f to a higher value will addselection pressure to datasets with a silhouette width closerto s t , potentially at the cost of a higher degree of overlap.Thus, unlike the traditional use of stochastic ranking thatuses a narrow range of values for P f [46], [48] to avoid toomuch weighting towards the infeasible or feasible regions, wecan use the entire range (as datasets that heavily violate theconstraints may still be useful).Similar to the original work [46], for environmental selec-tion we use stochastic ranking to select the top |P| individualsfrom the sorted pool of parents and offspring. For parentalselection, we use standard binary tournament where the rank isused to determine the winner (in lieu of using only the fitness)to ensure a continued selection pressure towards individualsthat best satisfy the fitness and constraints as weighted by P f .IV. E XPERIMENTAL S ETUP
In order to assess the capabilities of HAWKS, we have threeprimary aims in our experiments: (i) compare the diversityof performance of multiple, distinct clustering algorithms ondatasets from HAWKS against that seen for other populargenerators or dataset collections; (ii) compare the diversityacross a distinct set of problem features, to see if datasets fromHAWKS cover a wider space of properties than other datasets;and, (iii) gain insights into the algorithms themselves, utilizingthe
Versus mode of HAWKS to directly challenge clusteringalgorithms.The remainder of this section describes our proposed prob-lem features (Section IV-A), the generators and dataset col-lections we compare against (Section IV-B), and the relevantparameters HAWKS used in each experiment (Section IV-C).
A. Problem features
In order to measure dataset diversity (with respect to theirinherent properties), we need to define a set of problem fea-tures describing properties relevant to algorithm performance.This is central to the algorithm selection problem (ASP), asthese features are used to predict which algorithm is best fora given problem instance. These problem features are alsovital for the creation of an instance space that visualizes thedatasets, allowing identification of areas in the space whereparticular algorithms are most suitable.In previously discussed work on the ASP for clustering, theproblem features (also referred to as “meta-features”) used aregenerally statistical or information-theoretic, and not specificto clustering [33], [34]. The problem features previously usedin [40] were also too simplistic, and fully coincided withparameters available in HAWKS, reducing the utility of theinstance space.As there are many properties that influence performancefor a given clustering algorithm, and for most algorithms thisaspect is not fully understood, a full complementary set ofproblem features is arguably impossible [6], [9]. Nonetheless,going beyond simple statistical measures of the data by
EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 8 incorporating measures specific to clustering (especially thosethat make use of the generating model) should improve thediscriminative power of the problem features. The remainderof this section describes our proposed set of problem featuresfor this task.
1) Average eccentricity:
A high amount of cluster eccen-tricity can cause problems for compactness-based algorithms.To measure this, we use a similar method to its calculation as aconstraint in HAWKS, with some key differences. For both thesynthetic and real-world datasets used in this paper, we havethe true labels. Using these, the eccentricity is calculated asbefore — calculating the ratio of the maximum to minimumeigenvalues for each cluster, except that the average (ratherthan minimum) value is used as the problem feature. Theeigenvalues are obtained via singular value decomposition, but(in contrast to the constraint) a subset of the eigenvalues isused. This reduces the sensitivity to outliers and subspaceclusters (which result in zero, or arbitrarily close to zero,eigenvalues). For further details, see Section ?? .
2) Connectivity:
The connectivity, defined by Handl andKnowles [49] (which is a modified version of what wasproposed in [50]), measures the extent to which neighbouringdata points are assigned to the same cluster. We normalize thismeasure by the number of data points to get a measure thatis comparable across datasets. This provides a more nuancedpicture than the overlap constraint used by HAWKS, whichlooks at the single nearest neighbour rather than the set of L nearest neighbours, where L is a parameter of the connectivity.Here, we use L = 10 in line with previous work [49].
3) Dimensionality:
As different measures of distance orsimilarity are affected differently by high-dimensionality, thisfeature simply describes the number of dimensions in the data.This is our only feature that is ground-truth agnostic, and isshared with previous work on the ASP for clustering [33],[34].
4) Entropy of cluster sizes:
The density or relative sizes ofthe clusters can affect the performance of different algorithms.For example, the ability of K-Means to discover clustersis diminished when a small subset of clusters contain themajority of data points. We calculate the entropy of clustersizes as follows: H ( C ) = − K X k =1 | C k | N log K (cid:18) | C k | N (cid:19) , (8)where C is the set of K clusters, and | C k | is the cardinalityof cluster C k , which is normalized by the size of the dataset( N ). We use log K to compare across datasets such that, forany K , a perfectly equal distribution of cluster sizes results in H ( C ) = 1 . , and H ( C ) → when one cluster has N − K + 1 data points.
5) Number of clusters:
The number of clusters can inher-ently affect algorithms where initialization of cluster loca-tion is key to convergence, particularly if clusters are well-separated. As discussed earlier, use of the ground truth in these measures comes at thecost of limiting applicability to datasets where such knowledge is unavailable;this is a major issue for ASP applications but is not our intended focus here.
6) Silhouette width (average):
The average silhouette widthmeasures how similar the data points are within their ownclusters, and how well-separated these are to the other clusters,indicating the potential difficulty for clustering algorithms indiscovering these clusters.
7) Silhouette width (standard deviation):
Averaging thesilhouette width can obscure the presence of a small numberof very well-separated clusters, and ill-defined overlappingclusters. Thus, the standard deviation of the silhouette width(calculated across all data points) indicates whether all pointsare similarly separated. Higher values indicate the presenceof overlapping clusters (from a different perspective thanthe connectivity measure), which clustering algorithms havedifferent capabilities in handling.
B. Other generators and datasets
To provide context for the diversity in performance andproblem features for the datasets produced by HAWKS, wecompare against multiple generators or collections of populardatasets that have been previously used to evaluate clusteringalgorithms. We briefly describe these below, though moredetails about the parameters used for these datasets, and howthey compare (in terms of their size, dimensionality etc.) canbe found in the supplementary material (Table S- ?? ) and theirrespective papers.
1) HK:
This is a collection of 350 datasets generated usingthe “ellipsoidal” generator proposed in [11], and used as abenchmark in [51]. For 35 unique combinations of the numberof clusters and dimensionality, 10 datasets are generated.
2) QJ:
We use the set of 243 datasets generated usingthe parameters proposed by Qiu and Joe [38]. The authorscalculated three target separation values using their mea-sure of separation: “close structure”, “separated”, and “well-separated”. Further complexity to these datasets is added byspecifying varying proportions of noisy variables, alongsidesmall variation in the number of dimensions.
3) SIPU:
Fr¨anti and Sieranoja [36] introduced a benchmarkconsisting of multiple sets of clustering datasets, of which weuse the ‘S-sets’, ‘A-sets’, and ‘G2 sets’. These sets were origi-nally intended to stress-test compactness-based algorithms, ofwhich K-Means exhibited a wide range of performance. The‘S-sets’ are all 2D data where N = 5000 and K = 15 ,but have different degrees of overlap between the clusters(determined by the aforementioned method of the closestcentroid). The ‘A-sets’ are also 2D data with varying numbersof (equally-sized) clusters. The ‘G2 sets’ of datasets consistof two Gaussians with varying degrees of overlap, constantsize ( N = 2048 ) and varying dimensionality (chosen acrossthe range , , . . . , ). The variance of overlap and highnumber of dimensions in this benchmark presents a variety ofchallenges for clustering algorithms.
4) UCI:
The UCI Machine Learning Repository [35] is apopular source of datasets used for machine learning. As notedin [8], [36] the class labels do not necessarily translate tomeaningful cluster labels, and thus these datasets may not bethe most suitable for clustering. Nonetheless, they have seenextensive use in the clustering literature, and we include them
EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 9 for completeness. Specifically, we use the subset of 20 datasetsused by Arbelaitz et al. [9].
5) UKC:
These 8 real-world datasets, curated in [51], arethe (anonymized) locations of different crimes. Alongside the
UCI datasets, these will help provide insight into whether thereare significant differences between the real-world and syntheticdata (in terms of their problem features or clustering algorithmperformance).
C. HAWKS configurations
In this section we highlight the key parameters for HAWKS,separating those that are common across all experiments andthose adjusted for different modes. In particular, while thecore EA parameters (population size, mutation probabilityetc.) remain the same across modes, in our experiment forthe
Index mode we vary the objective function target andconstraint parameters to generate different datasets. In contrast,the constraint parameters remain fixed for the
Versus modeexperiment as the variation comes from selecting differentpairs of clustering algorithms. The full configurations for bothsets of experiments can be found on GitHub .
1) Common parameters:
In both
Index and
Versus modeexperiments, the following core EA parameters are used: G max = 100 generations, P = 10 individuals, p c = 0 . crossover probability, and p m = 1 /K mutation probabilityfor the mean and covariance mutation operators. The lowpopulation size was previously found to be sufficient for bothdiversity and convergence [40].
2) Solution selection:
At the end of a run, a single datasetneeds to be selected. Here, we simply select the individualwith the highest fitness. Different choices are possible, e.g.,due to our use of stochastic ranking, the individual with thehighest (sorted) rank could be selected instead (though thishad little effect on the experiments in this paper).
3) Analyzing benchmark dataset diversity (Index mode):
Totest the ability of HAWKS to produce a variety of datasets, wevary several parameters to encourage coverage of our featurespace: • A poor and a high silhouette width ( s t ∈ { . , . } ). • Two upper thresholds for the overlap constraint( overlap ≤ { , . } ) to penalize any overlap, or penalizeif more than 10% of the data points overlap. • Two lower thresholds for the eccentricity constraint( λ ratio ≥ { , } ) to allow for any amount of eccentricity,or encourage all clusters have some eccentricity. • Two levels of dimensionality ( D ∈ { , } ), datasetsize ( N ∈ { , } ), and number of clusters ( K ∈{ , } ).Of note is the use of P f = 0 . , which weights the fitnessand constraints equally. By varying s t , we directly attempt togenerate datasets that either have poor cluster separation or arewell-separated. Different values of the overlap and eccentricityconstraints help further modulate the level of separation andthe minimum cluster eccentricity. HAWKS is run 7 times foreach of the 64 unique combinations of parameters listed above,resulting in 448 datasets. https://github.com/sea-shunned/hawks configs As we aim to measure diversity in clustering performance,we need an objective way to measure this. We therefore use theARI (see Section III-C2) to compare the ground truth (whichwe know for all datasets used here) with the assignments fromeach algorithm. For the clustering algorithms themselves, weselect four well-established algorithms with distinct propertiesand inductive biases, allowing us to assess the diversity ofchallenges that the datasets pose. These are: • Average-linkage — A hierarchical clustering method thatuses the average distance between all points in a clusterwhen deciding which are the closest (and thus should bemerged). • Gaussian mixture models (GMM) — Probabilistic mod-els which try to represent subpopulations (of the data)through a number of Gaussian distributions. To obtain acrisp clustering, each point is assigned to the cluster withthe highest probability. • K-Means ++ — Proposed by Arthur and Vassilvitskii [52]to improve K-Means, the initialization scheme probabilis-tically selects cluster centres such that points that arefurther away from existing cluster centres are more likelyto be selected. • Single-linkage — In contrast to average-linkage, single-linkage considers the distance between two clusters asthe shortest distance from any member of one cluster toany member of another.To assess the maximum potential of each algorithm, weprovide the true number of clusters for each dataset. Owing tothe propensity of the linkage algorithms (particularly single-linkage) to be side-tracked by singleton clusters, we also runaverage- and single-linkage with double the true number ofclusters [53].
4) Challenging specific algorithms (Versus mode):
To as-certain the capabilities of HAWKS’
Versus mode, we run eachof the four clustering algorithms against the others in a head-to-head, with either as the ‘winner’ and ‘loser’, respectively.Some pairings of algorithms (e.g. K-Means ++ vs. GMM )are likely to be more competitive, due to shared capabilitiesand inductive biases. We expect this to translate into a re-duction of the performance differences that can be observed(and thus a weaker fitness gradient). While the constraintsare still important in such a scenario, we wish to avoidsituations where the optimization is being mainly driven byreducing the constraint penalty. Rather than simply removingthe constraints, we therefore increase P f to emphasize theproduction of a difference in algorithmic performance. Thisprovides a useful lever for weighting the maximization of theperformance difference against producing datasets that do notsacrifice cluster structure (e.g. through a high overlap). Assuch, we use P f = 0 . in all head-to-head runs. Furtheradjustment of this parameter is encouraged for complex algo-rithm pairings.Finally, all experiments presented for the Versus mode arerun using K = 5 clusters and N = 2000 data points.The dimensionality is consistently set to D = 2 , ensuringa straightforward visualization of the datasets and enabling us Referring to Eq. 4, we use the format A w vs. A l . EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 10 to observe the properties of these datasets without concernsabout information loss through projection.V. E
XPERIMENTAL R ESULTS
In this section, we explore HAWKS’ ability to producedatasets that exhibit a broad range of properties and posechallenges to different clustering algorithms. Separate resultsfor the
Index and
Versus modes highlight the general flexibilityof our framework.
A. Analyzing benchmark dataset diversity (Index mode)
This section presents the instance space constructed fromour set of 7 problem features applied to the 1,176 datasets from6 popular collections/generators and our generator HAWKS.We assess diversity by considering (i) the variation observedacross problem features and (ii) the variation observed in theperformance of 4 clustering algorithms.First, we use the instance space to understand how thedatasets are spread across the space, and if there are dis-tinct patterns either in the sources of the datasets or in theperformance of the clustering algorithms. The two principalcomponents of the instance space shown in Fig. 2 accountfor . of the variance, and so there is some informationloss in the projection. Nevertheless, it is evident that thecomponents capture a sufficient proportion of the variance tohighlight some key differences between datasets.The instance space in Fig. 2a uses colour and marker codingto flag up which collection each dataset comes from. Visual-izations of how each problem feature varies across the spacecan be found in Fig. S- ?? . The distinct spread of HAWKSdatasets across the central part of the space is encouragingand highlights an appropriate level of diversity in terms ofthe problem features. Furthermore, the interpretation of theprincipal components (in terms of the underlying problem fea-tures), provides clear guidance on additional experiments thatcould be conducted to expand coverage in various directions.In contrast, the QJ datasets expand across a narrow band,indicating a lack of variance across the problem features. The SIPU datasets show a strong banding of instances, indicatingthat there is little variance among the datasets from each con-figuration (the separated datasets at the bottom of the space arethe ‘G2 sets’, which have a much higher dimensionality thanother datasets). The
UCI datasets are spread across the upper-half of the space, though this is unfortunately due to a lack ofstructure (which we later explore when looking at clusteringalgorithm performance), as their higher connectivity indicatesthat the labels do not line up with a spatial perspective ofclustering (shown in Fig. S- ?? ). The UKC datasets do not seemto represent anything extraordinary with regards to the problemfeatures we use here, leading to the conclusion that either thesynthetic datasets used here are not too dissimilar to real-worlddata or our set of problem features does not capture someaspect of complexity that they uniquely exhibit. Notably, the HK datasets are distinct from every other collection, indicatingthat they have unique characteristics. As shown in Fig. S- ?? ,the main difference is the average eccentricity of the clusters,which is higher than in any other collection. As these datasets PC1 P C SourceHAWKSHKQJSIPUUCIUKC (a) Dataset source
PC1 P C AlgorithmTiedAverage-LinkageAverage-Linkage(2K)GMMK-Means++Single-LinkageSingle-Linkage(2K) (b) Algorithm with highest ARIFig. 2. Two instance spaces, with the source of the dataset (a)or clustering algorithm with the highest ARI (b) highlighted. The 7problem features are projected down to 2D using PCA. The con-tribution of each problem feature to the two principal componentsis (cid:20) − .
029 0 .
208 0 .
519 0 .
308 0 . − .
256 0 . . − . − . − . − . − .
555 0 . (cid:21) for the connectivity, dimensionality, average eccentricity, entropy, number ofclusters, silhouette width (average), and silhouette width (standard deviation)respectively. originate from the “ellipsoidal” generator, which was designedexclusively to create eccentric clusters in higher dimensions,this is unsurprising. Of interest, however, is whether this leadsto a distinct difference in the relative performance of clusteringalgorithms on that benchmark suite.To consider this aspect, Fig. 2b shows the instance spaceagain, but with the clustering algorithm achieving the highestARI for a given dataset highlighted. The ‘tied’ category is usedwhen at least two algorithms were able to achieve the sameARI, which in every case was 1 and thus the dataset wastrivial to cluster. This visualization permits easy identificationof potential footprints, though the full information is tabulatedin Table I, for completeness. It is evident that there are someareas of the space that correlate with a high performance of aparticular algorithm, and this happens for the HK benchmarksuite in particular, where the highly ellipsoidal clusters con-sistently favour GMM or average-link with the higher settingof K .In the more central part of the instance space, however,such footprints are unclear, indicating a complex performancelandscape. This could be a consequence of the projection EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 11
TABLE IN
UMBER OF TIMES ( AS A PERCENTAGE OF THE TOTAL NUMBER OFDATASETS FOR EACH SOURCE ) EACH ALGORITHM ACHIEVED THEHIGHEST
ARI
FOR A GIVEN DATASET . T
HE BEST PERFORMINGALGORITHM ON EACH DATA SOURCE IS HIGHLIGHTED IN BOLD .Average- Average- K- Single- Single-Source Linkage Linkage GMM Means Linkage Linkage Tied( K ) ++ ( K )HAWKS 0.261 0.056 0.239 0.056 0.016 0.011 HK [11] – 0.426 QJ [38] 0.008 0.021 SIPU [36] – – 0.106 0.202 – –
UCI [35] 0.100 0.150
UKC [51] – –
HAWKS HK QJ SIPU UCI UKC0.00.20.40.60.81.0 A R I Average-LinkageAverage-Linkage(2K)GMMK-Means++Single-LinkageSingle-Linkage(2K)
Fig. 3. Clustering performance for each algorithm for each set of datasets. (which may lose too much information to fully distinguishthe datasets), or may point to the role of other features thathave not yet been captured here, but distinctly impact onperformance. Notably, variation in the identity of the bestperforming algorithm is the most pronounced for the HAWKSbenchmark (see Table I), which is the only collection ofdatasets where every algorithm was best for a particulardataset. As this was achieved through just a few parametersettings, this is encouraging for the potential of HAWKS togenerate diverse datasets.To further examine the performance variation of the cluster-ing algorithms across these datasets, Fig. 3 shows boxplots forthe aggregated performance of each algorithm for each groupof datasets. Here, larger boxplots indicate a greater variety ofperformance for that clustering algorithm, which is preferred.For HAWKS, the boxplots indicate that its datasets elicit abroad range of performance across all algorithms, though thehigh median ARIs for all algorithms indicate that in general thedatasets were not that difficult. As we discourage overlap andhalf of the datasets were optimized to have a high silhouettewidth ( s t = 0 . ), this is somewhat expected and could beaddressed by revised parameter choices.The near-perfect average performance for all algorithms butsingle-linkage on the SIPU datasets indicate their relative sim-plicity. Similarly, the poor average performance across the
UCI datasets is consistent with the low connectivity and silhouettewidth values previously observed in the instance space, andpoints to weakly defined structure of the ground truth. The HK datasets show a reasonable diversity of performance, andthe significantly lower mean ARI indicates that these are much (a) HAWKS (b) HK Fig. 4. Critical difference (CD) diagrams, showing the mean rank (in termsof ARI) for the HAWKS and HK datasets, for each algorithm. Algorithmsconnected by solid lines are not significantly different according to a two-tailed Nemenyi test. CD diagrams for the other dataset sources can be foundin Fig. S- ?? . harder datasets. This may be in part due to the much highereccentricity of these clusters, but also potentially due to agreater variance in the silhouette width (Fig. S- ?? ), indicatingthat many data points on the edges of clusters are closer to thepoints in other clusters than to points in their own. The highperformance of all algorithms on the UKC datasets indicatesthat the clusters in this real-world dataset are generally well-defined. This is consistent with the instance space whichprovided no evidence of unusual complexity.To include considerations of significance into our analysis,we follow the approach outlined by Demˇsar [54] to com-pare multiple methods across multiple datasets. In brief, aFriedman test is used to rank each competing algorithm foreach dataset, where the null hypothesis is that all algorithmshave equal ranks. As rejection indicates at least one algorithmis significantly different, the (two-tailed) Nemenyi test [55]is used as a post-hoc test to ascertain which algorithm isdifferent by calculating the critical difference (CD), which isthe minimum that two average ranks must differ by to besignificantly ( p < . ) different. We illustrate the resultsusing CD diagrams, which show the average rank of eachalgorithm across all datasets, with solid lines connectingalgorithms whose difference in rank is less than the CD. Well-ordered rankings of algorithms indicate a lack of variance inperformance (as an algorithm is consistently bad or good),whereas if all algorithms had an average rank of 3.5 (and thusclustered in the middle of the CD diagram), this would showthat the datasets have an equal spread of difficulty for theseclustering algorithms.We show the CD diagrams for the HAWKS and HK gen-erators in Fig. 4 (as these two generators showed the greatestdiversity in the boxplots) . As evident from the CD diagram,the ranks of the algorithms are more similar for HAWKS,whereas there is a clearer superiority for a subset of algorithmswith the HK datasets. The best-performing algorithm forHAWKS was average-linkage, which highlights the variety of The remaining CD diagrams can be found in Section ?? of the supple-mentary material. EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 12 cluster structures that can be generated (as the representationuses purely Gaussian clusters, one might have expected thatGMM would on average perform the best). The eccentricity ofthe generated clusters is reflected in the higher average rank ofGMM over K-Means ++ . Furthermore, the poor performanceof single-linkage (using both K and K ) indicates that thereis (on average) insufficient cluster separation to avoid the‘chaining’ effect [53], where there is at least one pair of pointsin different clusters that are closer than a pair of points withina cluster, thereby forming a ‘bridge’ between clusters.As shown in Table I, all of the generators struggle toproduce datasets that single-linkage is uniquely better at,suggesting a possible lower utility of this algorithm, in general.To further investigate this, and to create a fully comprehensivebenchmark, it would be of interest to more directly generatedatasets that favour single-linkage. In contrast to existinggenerators, the Versus mode in HAWKS facilitates this directgeneration, and the results from this mode are discussed in thenext section.
B. Generating datasets that challenge specific algorithms(Versus mode)
In this section, we present the results of running HAWKS in
Versus mode, such that we evolve datasets to directly maximizethe performance difference between pairs of algorithms. Asseen in Section V-A, most current generators have somebias towards a particular algorithm, and no generator (exceptHAWKS, though not consistently so) was able to producedatasets that were uniquely suited to single-linkage.First, we need to look at the broad capability of the
Versus mode in producing a performance differential in the varioushead-to-heads. In Fig. 5a, we can see a grid of plots showingthe performance (ARI) of every algorithm against every other.Each line (in the off-diagonal plots) represents the best datasetfrom a single run, showing the ARI for ‘winning’ (left) and‘losing’ (right) algorithms. Here, the angle of the lines indicatethe magnitude of the performance difference, and the spreadshows the consistency of HAWKS across runs. The plots onthe diagonal aggregate the performance for that particularalgorithm, e.g. the bottom-right plot indicates that HAWKSwas able to produce datasets that single-linkage performedboth very well and very poorly on, dependent on whether itwas designated as the winner or loser of the head-to-head.There are some clear differences between certain pairingsof algorithms. As hinted in Section V-A, HAWKS is consis-tently able to generate maximal (i.e. φ ( A w ) − φ ( A l ) ≈ )performance difference when single-linkage is A l , but theperformance difference appears minimal for single-linkagevs. average-linkage, indicating that any weaknesses specificto average-link are not weaknesses that single-link can ex-ploit. The average ARI for average-linkage, GMM and K-Means ++ when set as the ‘loser’ ( A l ) indicates a difficulty inconsistently generating datasets that these algorithms performvery poorly on. This is not unexpected, given the generatingmodel used. As long as clusters are not fully overlapping(which we discourage with our overlap constraint) we expectthese algorithms to identify some elements of the clusters,making an ARI close to 0 unlikely. To evaluate the influence of initialization on these results, inFig. 5b the stochastic algorithms (GMM and K-Means ++ ) arere-run with a different initialization. While K-Means ++ waslargely unchanged, there was an average ARI increase (as A l ) of 0.13 for GMM, highlighting that in some cases thelower performance was due purely to a poor initialization.The average ARI did, however, slightly decrease (as A w ),highlighting the potential disadvantage of these stochasticalgorithms over the linkage-based algorithms as they convergeto local minima. This suggests another potential use case forHAWKS. When given an infinite budget of initializations andpicking the best, we expect these algorithms to perform better.The robustness of the initialization method can be assessed,however, by investigating how often the algorithm is still ableto achieve good performance with a limited budget.In order to investigate the ability of the Versus mode togrant algorithmic insight, we need to inspect the structuresHAWKS discovered for different algorithm combinations. Forthis, it is important to identify where HAWKS struggles togenerate datasets that favour one algorithm over another. Wecan then try to establish whether this is due to the superiorityof one algorithm over another, or the inability of HAWKSto generate structures with properties that would differentiatethem. For brevity, the following sections discuss some interest-ing examples for a subset of the scenarios (shown in Fig. 6).Further examples of the algorithm pairings can be found inthe supplementary material (Section ?? ).
1) GMM vs. single-linkage:
Owing to the Gaussian rep-resentation that HAWKS uses, and the known issues ofsingle-linkage, this scenario is expected to provide a largeperformance difference. Fig. 6a shows that HAWKS achievesthis performance difference by exploiting the aforementioned‘chaining’ effect of single-linkage [53]. Discovering this re-quires iterative movement of the clusters in order for the datapoints to be close enough to induce this effect, supporting theutility of our mutation operators’ nuanced ability to adjust thelocation of clusters.
2) Single-linkage vs. GMM:
The reverse scenario should bemuch harder for HAWKS, as GMM is largely insensitive toeccentricity and naturally fits our cluster representation. Asexemplified in Fig. 6b (and observed in the other datasetsproduced), HAWKS tends to place large clusters far away fromseveral smaller compact clusters in order to increase the chanceof a poor initialization from GMM, exploiting the stochasticityof this method.
3) GMM vs. K-Means ++ : In Fig. 6c we can see that akey exploit, as discovered by HAWKS, is the inability of K-Means ++ to handle eccentric clusters. The example highlightssignificant differences between the use of mixture models andan algorithm relying on assignment to the closest centroid. Aclear performance differential is found on this dataset, despitebasic similarities in the inductive biases of the two algorithms. Each fitness evaluation uses a different initialization for GMM and K-Means ++ , otherwise HAWKS tacitly exploits this knowledge by movingthe clusters into the static initial centroid locations. This will maximizeperformance difference, but represents a technological artefact rather than ageneric property. EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 13 A v e r a g e - L i n k a g e Average-Linkage GMM KMeans++ Single-Linkage0.000.250.500.751.00 G MM K M e a n s ++ Winner Loser0.000.250.500.751.00 S i n g l e - L i n k a g e Winner Loser Winner Loser Winner Loser W i nn e r Loser (a) A v e r a g e - L i n k a g e Average-Linkage GMM KMeans++ Single-Linkage0.000.250.500.751.00 G MM K M e a n s ++ Winner Loser0.000.250.500.751.00 S i n g l e - L i n k a g e Winner Loser Winner Loser Winner Loser W i nn e r Loser (b)Fig. 5. Performance (ARI) for each algorithm as the winner (row) and loser (column). Each line represents the best individual from a single run, connectingthe ARI obtained for A w and A l . On the diagonal, the average (and standard deviation) performance is aggregated for that individual as the winner andloser, indicating the overall capability of HAWKS to produce datasets that are simple and difficult for that algorithm. The stochastic algorithms (GMM andK-Means ++ ) are re-run with a different initialization (as indicated by the solid lines, with the original results as dashed lines) in (b) to further measurerobustness.
4) K-Means ++ vs. GMM: As both algorithms are well-suited for compact clusters, eccentricity is not a charac-teristic that can be used for differentiating performance inthis scenario. Here, HAWKS exploits GMM’s previously-mentioned weakness (of sub-dividing a single large cluster)during its initialization stage. K-Means ++ is less sensitive tothis problem due to its improved initialization routine. Thisshows the utility of HAWKS in identifying relative strengthsand weaknesses of specific algorithms, which could aid inalgorithmic development (e.g. when empirically comparinginitialization schemes).The datasets in the Versus mode are optimized towards aperformance differential between the algorithms, rather thantowards a cluster structure specified by the cluster validityindex (as done in the
Index mode). It is therefore of interestto investigate how these datasets compare in terms of problemfeature diversity. For this, we add the 30 datasets from eachof the 12 head-to-heads to the instance space created inSection V-A (using the same principal components).In Fig. 7, we have highlighted the datasets from the
In-dex and
Versus modes (with the other datasets in grey forreference). Clearly, we are able to generate datasets that arenotably different in terms of their problem features (andthus properties). In particular, there are many datasets in theregion where previously only HK produced datasets, furtherhighlighting the flexibility of our generating mechanism incovering additional regions, as objective function, clusteringalgorithms, and other parameters are varied. VI. C ONCLUSION
Clustering is a vital tool for pattern discovery, but it is oftenunclear which clustering algorithm is the most appropriate fora given dataset. An optimal choice requires an accurate un-derstanding of the data properties as well as the strengths andweaknesses of candidate algorithms. Both types of informationare difficult to come by in typical real-world settings.Synthetic benchmark datasets play an important role inimproving our understanding of the former, i.e. to examinethe specific strengths and capabilities of a given clusteringmethod. Their specific advantage is the availability of a knowngenerating model, which allows researchers to relate aspectsof the true cluster structure to algorithm performance. Unfor-tunately, available synthetic benchmarks for cluster analysiscover a limited variety of structural aspects, and there areno existing generators that have been designed with a widerflexibility in mind.Our framework HAWKS employs the power of an EA tobetter meet the challenges highlighted above, and to generatemore diverse collections of benchmarks, in particular. Whencompared to existing clustering benchmarks, HAWKS is foundto generate datasets exhibiting more feature diversity andeliciting more variation in algorithm performance. Future workcould improve the ability of the instance space to distinguishalgorithmic footprints, by further enriching the set of problemfeatures and improving the projection methodology.Finally, HAWKS can be modified to directly generatedatasets that are either simple or difficult for a given algorithm,facilitating a deeper understanding of existing algorithms and
EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 14
Truth
GMMARI: 1.000
Single-LinkageARI: 0.003 (a) GMM vs. single-linkage
Truth
Single-LinkageARI: 0.929
GMMARI: 0.630 (b) Single-linkage vs. GMM
Truth
GMMARI: 0.962
KMeans++ARI: 0.316 (c) GMM vs. K-Means ++ Truth
KMeans++ARI: 0.996
GMMARI: 0.593 (d) K-Means ++ vs. GMMFig. 6. Examples of datasets for the listed head-to-heads. Each figure showsthe ground truth (left column), and the cluster assignment for each algorithmwith the associated ARI (middle and right columns). PC1 P C SourceOtherIndexVersus
Fig. 7. Instance space created in Section V-A with the datasets producedby the two modes of HAWKS highlighted (the ‘Other’ points are the otherdatasets). potentially informing algorithm development. This provides anew avenue for “controlled experimentation” with clusteringalgorithms, which does not rely on the use of over-simplifiedtoy datasets. Investigations into the use of this
Versus mode inconjunction with more complex clustering algorithms, couldfurther test HAWKS’ capabilities and may provide novelinsights into the inductive biases and performance of thesealgorithms. R
EFERENCES[1] J. Handl, J. D. Knowles, and D. B. Kell, “Computational clustervalidation in post-genomic data analysis,”
Bioinform. , vol. 21, no. 15,pp. 3201–3212, 2005.[2] U. Maulik, S. Bandyopadhyay, and A. Mukhopadhyay,
MultiobjectiveGenetic Algorithms for Clustering - Applications in Data Mining andBioinformatics . Springer, 2011.[3] R. Xu and D. C. Wunsch, “Clustering algorithms in biomedical research:A review,”
IEEE Reviews in Biomedical Engineering , vol. 3, pp. 120–154, 2010.[4] M. Ahmed, A. N. Mahmood], and J. Hu, “A survey of network anomalydetection techniques,”
Journal of Network and Computer Applications ,vol. 60, pp. 19–31, 2016.[5] Z. Ma, J. M. R. Tavares, R. N. Jorge, and T. Mascarenhas, “A reviewof algorithms for medical image segmentation and their applicationsto the female pelvic cavity,”
Computer Methods in Biomechanics andBiomedical Engineering , vol. 13, no. 2, pp. 235–246, 2010.[6] C. Hennig, “What are the true clusters?”
Pattern Recognit. Lett. , vol. 64,pp. 53–62, 2015.[7] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,”
ACM Comput. Surv. , vol. 31, no. 3, pp. 264–323, 1999.[8] U. von Luxburg, R. C. Williamson, and I. Guyon, “Clustering: Science orart?” in
Unsupervised and Transfer Learning – Workshop held at ICML2011 , ser. JMLR Proceedings, vol. 27. JMLR.org, 2012, pp. 65–80.[Online]. Available: http://proceedings.mlr.press/v27/luxburg12a.html[9] O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J. M. P´erez, and I. Perona,“An extensive comparative study of cluster validity indices,”
PatternRecognit. , vol. 46, no. 1, pp. 243–256, 2013.[10] J. N. Hooker, “Testing heuristics: We have it all wrong,”
Journal ofHeuristics , vol. 1, no. 1, pp. 33–42, 1995.[11] J. Handl and J. D. Knowles, “Improvements to the scalability ofmultiobjective clustering,” in
Proceedings of the IEEE Congress onEvolutionary Computation, CEC 2005, 2-4 September 2005, Edinburgh,UK . IEEE, 2005, pp. 2372–2379.[12] K. Smith-Miles and T. T. Tan, “Measuring algorithm footprints ininstance space,” in
Proceedings of the IEEE Congress on EvolutionaryComputation, CEC 2012 . IEEE, 2012, pp. 1–8.[13] N. Maci`a and E. Bernad´o-Mansilla, “Towards uci+: a mindful repositorydesign,”
Information Sciences , vol. 261, pp. 237–262, 2014.[14] O. Mersmann, M. Preuss, and H. Trautmann, “Benchmarking evolu-tionary algorithms: Towards exploratory landscape analysis,” in
Inter-national Conference on Parallel Problem Solving from Nature 2010 .Springer, 2010, pp. 73–82.[15] J. Schmidhuber, “Deep learning in neural networks: An overview,”
Neural Networks , vol. 61, pp. 85–117, 2015.[16] D. H. Wolpert, “The lack of A priori distinctions between learningalgorithms,”
Neural Computation , vol. 8, no. 7, pp. 1341–1390, 1996.[17] J. R. Rice, “The algorithm selection problem,” 1976, vol. 15, pp. 65–118.[18] K. Smith-Miles, “Cross-disciplinary perspectives on meta-learning foralgorithm selection,”
ACM Comput. Surv. , vol. 41, no. 1, pp. 6:1–6:25,2008.[19] K. Smith-Miles, D. Baatar, B. Wreford, and R. Lewis, “Towardsobjective measures of algorithm performance across instance space,”
Computers & Operations Research , vol. 45, pp. 12–24, 2014.[20] K. Smith-Miles and L. Lopes, “Measuring instance difficulty for com-binatorial optimization problems,”
Computers & Operations Research ,vol. 39, no. 5, pp. 875–889, 2012.[21] K. Smith-Miles and S. Bowly, “Generating new test instances byevolving in instance space,”
Computers & Operations Research , vol. 63,pp. 102–113, 2015.[22] M. A. Mu˜noz, L. Villanova, D. Baatar, and K. Smith-Miles, “Instancespaces for machine learning classification,”
Machine Learning , vol. 107,no. 1, pp. 109–147, 2018.
EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 15 [23] M. Birattari,
Tuning Metaheuristics - A Machine Learning Perspective ,ser. Studies in Computational Intelligence. Springer, 2009, vol. 197.[24] F. Hutter, L. Kotthoff, and J. Vanschoren, Eds.,
Automated MachineLearning - Methods, Systems, Challenges , ser. The Springer Series onChallenges in Machine Learning. Springer, 2019.[25] M. L´opez-Ib´a˜nez, J. Dubois-Lacoste, L. P. C´aceres, M. Birattari, andT. St¨utzle, “The irace package: Iterated racing for automatic algorithmconfiguration,”
Operations Research Perspectives , vol. 3, pp. 43–58,2016.[26] P. Kerschke and M. Preuss, “Exploratory landscape analysis: advancedtutorial at GECCO 2017,” in
Genetic and Evolutionary ComputationConference, 2017, Companion Material Proceedings , P. A. N. Bosman,Ed. ACM, 2017, pp. 762–781.[27] L. Kotthoff, P. Kerschke, H. H. Hoos, and H. Trautmann, “Improvingthe state of the art in inexact TSP solving using per-instance algorithmselection,” in
Learning and Intelligent Optimization - 9th InternationalConference, LION 2015. Revised Selected Papers , ser. Lecture Notes inComputer Science, vol. 8994. Springer, 2015, pp. 202–217.[28] O. Mersmann, B. Bischl, H. Trautmann, M. Preuss, C. Weihs, andG. Rudolph, “Exploratory landscape analysis,” in
Proceedings of theGenetic and Evolutionary Computation Conference, GECCO 2011 ,N. Krasnogor and P. L. Lanzi, Eds. ACM, 2011, pp. 829–836.[29] G. C. Bowker and S. L. Star,
Sorting Things Out: Classification and itsConsequences . MIT Press, 2000.[30] B. S. Everitt, S. Landau, M. Leese, and D. Stahl,
Cluster Analysis ,5th ed. Wiley Publishing, 2011.[31] S. Ben-David, “Clustering — what both theoreticians and practitionersare doing wrong,” in
Proceedings of the Thirty-Second AAAI Conferenceon Artificial Intelligence 2018
Advancesin neural information processing systems , 2003, pp. 463–470.[33] D. G. Ferrari and L. N. de Castro, “Clustering algorithm selection bymeta-learning systems: A new distance-based problem characterizationand ranking combination methods,”
Inf. Sci. , vol. 301, pp. 181–194,2015.[34] R. G. F. Soares, T. B. Ludermir, and F. de A. T. de Carvalho, “Ananalysis of meta-learning techniques for ranking clustering algorithmsapplied to artificial data,” in
Artificial Neural Networks - ICANN 2009 ,ser. Lecture Notes in Computer Science, vol. 5768. Springer, 2009, pp.131–140.[35] D. Dheeru and E. Karra Taniskidou, “UCI machine learning repository,”2017. [Online]. Available: http://archive.ics.uci.edu/ml[36] P. Fr¨anti and S. Sieranoja, “K-means properties on six clusteringbenchmark datasets,”
Appl. Intell. , vol. 48, no. 12, pp. 4743–4759, 2018.[37] J. Handl and J. Knowles, “Feature subset selection in unsupervisedlearning via multiobjective optimization,”
International Journal of Com-putational Intelligence Research , vol. 2, no. 3, pp. 217–238, 2006.[38] W. Qiu and H. Joe, “Generation of random clusters with specified degreeof separation,”
J. Classification , vol. 23, no. 2, pp. 315–334, 2006.[39] ——, “Separation index and partial membership for clustering,”
Com-putational statistics & data analysis , vol. 50, no. 3, pp. 585–603, 2006.[40] C. Shand, R. Allmendinger, J. Handl, A. M. Webb, and J. Keane,“Evolving controllably difficult datasets for clustering,” in
Proceedingsof the Genetic and Evolutionary Computation Conference, GECCO2019 . ACM, 2019, pp. 463–471.[41] G. W. Stewart, “The efficient generation of random orthogonalmatrices with an application to condition estimators,”
SIAM Journalon Numerical Analysis arXivpreprint arXiv:2002.01822 , 2020.[43] P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation andvalidation of cluster analysis,”
Journal of computational and appliedmathematics , vol. 20, pp. 53–65, 1987.[44] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding forclustering analysis,” in
Proceedings of the 33rd International Conferenceon International Conference on Machine Learning - Volume 48 , ser.ICML’16. JMLR.org, 2016, pp. 478—-487.[45] L. Hubert and P. Arabie, “Comparing partitions,”
J. Classification , vol. 2,no. 1, pp. 193–218, 1985.[46] T. P. Runarsson and X. Yao, “Stochastic ranking for constrained evolu-tionary optimization,”
IEEE Trans. on Evolutionary Computation , vol. 4,no. 3, pp. 284–294, 9 2000. [47] J. Kennedy and R. Eberhart, “Particle swarm optimization,” in
Proceed-ings of International Conference on Neural Networks (ICNN’95) . IEEE,1995, pp. 1942–1948.[48] B. Li, K. Tang, J. Li, and X. Yao, “Stochastic ranking algorithm formany-objective optimization based on multiple indicators,”
IEEE Trans.Evolutionary Computation , vol. 20, no. 6, pp. 924–938, 2016.[49] J. Handl and J. D. Knowles, “An Evolutionary Approach to Multiob-jective Clustering,”
IEEE Trans. on Evolutionary Computation , vol. 11,no. 1, pp. 56–76, 2007.[50] C. H. Q. Ding and X. He, “K-nearest-neighbor consistency in dataclustering: incorporating local information into global optimization,” in
Proceedings of the 2004 ACM Symposium on Applied Computing (SAC) ,H. Haddad, A. Omicini, R. L. Wainwright, and L. M. Liebrock, Eds.ACM, 2004, pp. 584–589.[51] M. Garza-Fabre, J. Handl, and J. D. Knowles, “An Improved and MoreScalable Evolutionary Approach to Multiobjective Clustering,”
IEEETrans. on Evolutionary Computation , vol. 22, no. 4, pp. 515–535, 2017.[52] D. Arthur and S. Vassilvitskii, “k-means++: The advantages ofcareful seeding,” in
Proceedings of the Eighteenth Annual ACM-SIAMSymposium on Discrete Algorithms, SODA 2007 . SIAM, 2007,pp. 1027–1035. [Online]. Available: http://dl.acm.org/citation.cfm?id=1283383.1283494[53] L. Hubert, “Approximate evaluation techniques for the single-link andcomplete-link hierarchical clustering procedures,”
Journal of the Amer-ican Statistical Association , vol. 69, no. 347, pp. 698–704, 1974.[54] J. Demˇsar, “Statistical comparisons of classifiers over multiple datasets,”
Journal of Machine Learning Research , vol. 7, no. Jan, pp. 1–30,2006. [Online]. Available: http://jmlr.org/papers/v7/demsar06a.html[55] P. Nemenyi, “Distribution-free multiple comparisons,” Ph.D. disserta-tion, Princeton University, 1963.
EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 1
HAWKS: Evolving Challenging Benchmark Sets forCluster Analysis– SUPPLEMENTARY MATERIAL –
Cameron Shand, Richard Allmendinger, Julia Handl, Andrew Webb, and John Keane
S-I. HAWKS D
ETAILS
A. Generating random cluster sizes to a pre-defined limit
To generate randomly sized clusters that sum to a givenvalue with a minimum size, the following approach is used: w = [ w , . . . , w K ] where K is the number of clusters, w i ∼ Dir ( α ) , and α =(1 , , . . . , . For a given cluster i , the size is: | C i | = C min + w i ( N − ( K × C min )) where C min is the minimum size a cluster can be.This ensures a uniform distribution of cluster sizes after scaling such that they sum to N , which is not guaranteedwhen simply sampling random numbers and scaling these to N (as this no longer guarantees a uniform distribution). B. Scaling the covariance matrix
We provide further details about the covariance mutationoperator described in Section ?? . To mutate the covariancematrix in HAWKS, two separate matrices are generated for theperturbation. The rotation matrix randomly rotates the cluster,whereas the scaling matrix modifies the magnitude of theprincipal semi-axes, modifying the area of the space that iscovered by this distribution.To avoid issues of converging towards either spherical orhighly eccentric clusters upon multiple applications of themutation operator, we ensure that the determinant of thecovariance matrix remains the same after applying the scalingmatrix. For this, let x i be the elements of a vector sampledfrom a Dirichlet distribution. Then D X i x i = 1 ⇒ D X i (cid:18) x i − D (cid:19) = 0 ⇒ exp D X i (cid:18) x i − D (cid:19)! = 1 ⇒ D Y i exp (cid:18) x i − D (cid:19) = 1 . The determinant of a diagonal matrix is the product of thevalues on the diagonal, thus the scaling matrix ( exp( x i − D ) )has determinant 1. C. Average eccentricity problem feature
In order to reduce the sensitivity to outliers and subspaceclusters when calculating the average eccentricity, we use asubset of the eigenvalues (obtained via singular value de-composition) that account for 95% of the total sum of theeigenvalues. This is functionally identical to using the principalcomponents (obtained via principal component analysis) thataccount for 95% of the variance.S-II. M
UTATING CLUSTERS IN HIGHER DIMENSIONS
Shand et al. [1] noted that the original mutation operatorfor the cluster means ( µ ) is decreasingly useful as the dimen-sionality increases due to stochastic movement of the mean,as there are an increasing number of directions to move awayfrom other clusters. This results in a bias towards increasingthe silhouette width as the clusters drift apart. In this section,we propose and compare several different operators that try toexplicitly incorporate directionality into the random movementto avoid this bias.Incorporating explicit directionality into the operator candirectly guide whether the movement of clusters is either awayfrom or towards existing clusters, increasing and decreasingthe silhouette width respectively. For this, the operator wouldneed to incorporate the position of at least one other clusterin the random perturbation.The previous [1] mutation operator was defined as: µ ( i ) ∼ N ( µ ( i ) , s ) (S-1)where µ ( i ) is the new mean, µ ( i ) is the current mean, and s is the width (variance) of the multivariate normal distribution.In other words, the new mean is sampled from a normaldistribution around the current mean. Next, we define eachof the proposed operators. A. Mutation Operator Descriptions1) “Rails” operator:
As a simple baseline to examine theutility of including directionality, this operator randomly se-lects a cluster, and the current cluster moves either away fromor towards that cluster with a random weighting ( ≤ w ≤ ).To select a random cluster that is different from the currentone, let n be a random integer from the set { , . . . , K } \{ i } . The mean of the current cluster, µ ( i ) , can be mutated asfollows: µ ( i ) = ( µ ( i ) + [ w ( µ ( n ) − µ ( i ) )] if p ≤ . µ ( i ) − [ w ( µ ( n ) − µ ( i ) )] if p > . . (S-2) a r X i v : . [ c s . N E ] F e b EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 2 where µ ( n ) is the mean of the random cluster n , w ∼ U (0 , is a random weight, µ ( n ) − µ ( i ) is the difference betweenthe cluster means, and p ∼ U (0 , is a random uniformprobability to determine whether we mutate towards or awayfrom the other cluster.
2) PSO-inspired mutation with random directionality:
For amutation that covers more of the space, rather than along thevectors between centroids, we take inspiration from anotherEA paradigm: particle swarm optimization (PSO). In theoriginal PSO mutation operator, each particle is updated byincorporating their current position, the best position thatparticle has ever had, and the best position ever found byany particle. These latter two best positions are weighted byrandom coefficients in order to create a random movementof the particles. For further details, see [2], [3]. We use thisas inspiration by viewing the clusters as particles, and try tomutate the location of the cluster based on the position ofanother cluster and some global representative. As our fitnessis derived from a combination of all particles, the originalnotions of personal and global best are not applicable here,but the concept of independently weighting both a single pointand an aggregated one is.By updating the mean using a randomly weighted combi-nation of an existing cluster and the global mean ( ¯ µ ) of allclusters, we can create a random movement of the cluster thatstill takes into account the position of the existing clusters. Byincorporating the global mean, we can avoid generating well-separated groups of clusters that can deceive the silhouettewidth. To ensure that the location of the global mean iscalculated in a way meaningful to the fitness, we calculatethe mean across all data points (not across cluster means), asit is the former that is used in the calculation of the silhouettewidth (and thus fitness). As a result, relative cluster sizes areincorporated into the mutation. The mean of the current cluster, µ ( i ) , is thus mutated using the follow operator (referred to asPSO-random): µ ( i ) = ( µ ( i ) + [ w ( µ ( n ) − µ ( i ) ) + w ( ¯ µ − µ ( i ) )] if p ≤ . µ ( i ) − [ w ( µ ( n ) − µ ( i ) ) + w ( ¯ µ − µ ( i ) )] if p > . . (S-3)
3) PSO-inspired mutation with informed directionality:
Byusing a coin-flip to determine whether the cluster should movetowards or away from existing clusters, there arises obviousscenarios where the existing clusters may be very far apartand a closer structure is desired. An unfavourable coin-flipmoves this cluster further away, wasting a perturbation andthus evaluation. By using the sign of the difference betweenthe current silhouette width of the individual and the target( s all − s t ), we can move the cluster centre in the directionthat will have the best chance of improving the fitness.The mean of the current cluster, µ ( i ) , can be mutated asfollows: µ ( i ) = ( µ ( i ) + [ w ( µ ( n ) − µ ( i ) ) + w ( ¯ µ − µ ( i ) )] if s all > s t µ ( i ) − [ w ( µ ( n ) − µ ( i ) ) + w ( ¯ µ − µ ( i ) )] if s all ≤ s t . (S-4) The addition of this information will improve convergencein situations where s all − s t is large, as the fitness will beimproved more quickly, but this adds in a bias towards the fitness (as opposed to the constraints), which is typicallycontrolled through the stochastic ranking ( P f ). This operatoris referred to as PSO-informed.
4) DE-inspired mutation:
Taking inspiration from yet an-other EA paradigm, differential evolution (DE), we view theexisting clusters as individual vectors which can help in thecreation of a “donor vector” (the new mean) from a “targetvector” (the current mean). The classical DE mutation operatorcombines three existing individuals in order to create a newindividual [4]–[6]. We adapt this to generate a new clustermean from existing ones as follows: µ ( i ) = µ ( i ) + F ( µ ( r ) − µ ( r ) ) (S-5)where i , r , and r are distinct indices of a cluster, and F isa constant factor in the range [0 , . Unlike with the previousoperators, the randomness occurs only in terms of which othermeans are selected, such that the movement vector for thecurrent mean is a fixed multiple ( F ) of the vector betweenthe randomly selected means. As a result, the number ofpossible locations that a cluster can mutate to is finite andrelated to the number of clusters. This lack of flexibility mayeither slow convergence when larger jumps are needed, ornuance when small movements to the cluster are needed inthe final generations (which is especially useful for adjustingthe overlap between clusters). B. Experimental Setup
To investigate the convergence properties of these operators,and whether they show a bias towards separating or bringingtogether clusters, we set up the following experiments. Weset a low (0.2) and high (0.9) s t to optimize towards poorly-and well-separated clusters respectively, as well as initializingthe clusters either together or far apart. These four scenariostest the general ability of the operators to move clustercentroids, and are tested using D ∈ { , } dimensions tocheck robustness at a higher dimensionality. For each scenario,HAWKS is run for 30 times to assess the robustness overdifferent initializations, and for 100 generations to assess thestability of the operators. For the DE-inspired operator, weuse F = 1 to strike a balance between convergence speed andpotential oscillation when clusters are close together. C. Results
Fig. S-1 shows convergence plots for the four differentscenarios in D , showing the average silhouette width (andnot the fitness, to make it clear when s all is above or belowthe target) across the generations. The target, s t , is shown bythe dashed horizontal line.In Fig. S-1a, all mutation operators are quickly able to movethe clusters apart to increase the silhouette width, but we see adifference in the stability after this as some operators (predom-inantly DE and PSO-random) further increase the silhouettewidth above s t , highlighting a mechanistic bias. Fig. S-1bhas the same low s t , but the initial clusters are further apart(requiring the optimization to bring clusters together). The DE-inspired operator is unable to decrease the silhouette width,and the optimization is driven by minimizing the overlap EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 3
Generation A v e r a g e s il h o u e tt e w i d t h (a) Clusters initialized overlapping, s t = 0 . Generation A v e r a g e s il h o u e tt e w i d t h OperatorOriginalRailsPSO-RandomPSO-InformedDE (b) Clusters initialized apart, s t = 0 . Generation A v e r a g e s il h o u e tt e w i d t h (c) Clusters initialized overlapping, s t = 0 . Generation A v e r a g e s il h o u e tt e w i d t h OperatorOriginalRailsPSO-RandomPSO-InformedDE (d) Clusters initialized apart, s t = 0 . Fig. S-1. Convergence plot showing the average (across all 30 runs) silhouette width across the generations when generating 2-dimensional datasets for ourfour scenarios. The dashed line shows the target silhouette width ( s t ). constraint (Fig. S-2b). While the PSO-random operator doesnot converge, the quick decrease in the silhouette width forthe PSO-informed operator shows that this is not a mechanisticissue, but one of directionality. Mutually exclusive satisfactionof the fitness ( s t ) and the overlap constraint prevent anyoperator from fully reaching s t = 0 . . The slow silhouettewidth decrease with the original operator highlights the issuewith a static step size.In Fig. S-1c–d we can see that there is no significant differ-ence beyond the speed of convergence between the operatorswhen optimizing for s t = 0 . , independent of the initializationused. No such difficulties with any of the operators can beseen for either initialization in Fig. S-1c–d, where the targetsilhouette width is high. None of the operators drift onceconverged, as this target also satisfies the overlap constraint.In Fig. S-2a, we can see the minimization of the overlap constraint, which is naturally greater for the operators thatincreased the silhouette width. This indicates a differencein the relative ease of satisfying s t and minimizing overlapbetween the operators, particularly with the lack of pressure(afforded by P f = 0 . ) towards either one. With such a low s t , there is a strong inverse correlation between the overlapand fitness. The PSO-informed operator has an explicit driveto avoid a drift of the silhouette width away from the target, but the vast difference in drift for the DE-inspired operatorillustrates a clear behavioural difference. In Fig. S-2b, whenthe clusters are initialized apart (and thus the task is to bringthem closer together), we can see that the DE-inspired doesconsistently minimize the overlap (as it erroneously increasesthe silhouette width as the clusters drift apart). The otheroperators also decrease the overlap , apart from PSO-informedwhere the embedded preference towards the fitness disturbsthe minimization of the overlap .This preference is further illustrated when we look atFig. S-3, where the silhouette width and overlap of the bestindividuals from each run are plotted when the target silhouettewidth is low ( s t = 0 . ). The PSO-informed operator createsfar less diversity in terms of the overlap , thus generating lessdiverse datasets. To further investigate these operators, we nowlook at the same scenarios in D .Fig. S-4 shows convergence plots for the four differentscenarios in D , which was previously where issues withthe original operator were first identified. In Fig. S-4a, whenthe scenario is to keep clusters close together, the drift ofthe DE-inspired operator is more pronounced relative to theother operators, which remain mostly stable. The bias of theDE-inspired operator towards moving clusters further apartis further shown in Fig. S-4b, where the original operator EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 4
Generation O v e r l a p (a) Clusters initialized overlapping, s t = 0 . Generation O v e r l a p OperatorOriginalRailsPSO-RandomPSO-InformedDE (b) Clusters initialized apart, s t = 0 . Fig. S-2. Convergence plot showing the average (across all 30 runs) of the overlap constraint across the generations when generating 2-dimensional datasets.
Silhouette O v e r l a p OperatorPSO-RandomPSO-InformedInitializationOverlappingApart
Fig. S-3. Scatter plot of the silhouette width and overlap constraint for eachof the best individuals from every run for the low silhouette width target. is also incapable of bringing clusters closer together. As theDE-inspired operator uses the direction and magnitude of thevector between two random clusters, it does not necessarilymove the cluster being mutated in the direction of other clus-ters, an issue which is more prevalent in higher dimensions.As expected, the PSO-informed operator converges the fastest,whereas the “rails” and PSO-random operators increasinglyslow to decrease the silhouette width (largely due to trying tominimize the overlap).The ability of our original operator to move clusters awayfrom each other in higher dimensions is highlighted in Fig. S-4c, which is clearly contrasted with the proposed operatorsthat are all able to converge significantly faster. Once again,utilizing the individual’s current silhouette width allows thePSO-informed operator to consistently move the clusters apart,converging rapidly. The speed of convergence for the DE-inspired operator is likely due to its step size being basedon the distance between two random clusters, thus enabling itto rapidly move clusters apart. In Fig. S-4d, when the clustersare initialized further apart, similar behaviour to the low s t isseen. The initial silhouette width is very close to the target, yetour original and the DE-inspired operators are still unable tosignificantly improve the fitness as they begin above the target.It is likely that in this scenario, the step size is too large for the DE-operator to be useful, and too small and undirected forour original operator.As such, in our experiments we select to use the PSO-random operator, as it provides a more nuanced mechanism formoving cluster means, but avoids introducing additional biaswhich would otherwise be controlled by stochastic ranking.For scenarios where the fitness is prioritized, PSO-informedmay be the more useful operator, but for general use the PSO-random operator is preferred.S-III. A NALYZING BENCHMARK DATASET DIVERSITY (I NDEX MODE )In Section ?? , we compare several collections of datasetsto a set of datasets produced by HAWKS. Here, we providesome further information on the different properties of thesedatasets, and further supporting results. A. Dataset collection details
Table S-I shows some basic parameters about the differentcollections of datasets. Where possible, values are condensedinto a range (i.e. “6–11”), but are otherwise listed. Owing tothe different nature of these datasets (some real-world, somefrom a generator), some values are specific to a single dataset,and others represent multiple instantiations from a generator.
B. Problem feature values across the instance space
Fig. ?? showed the instance space created for these datasetsusing our set of problem features. Fig. S-5 shows this instancespace, but with the best ARI achieved by anything algorithmand the values for each of the problem features highlighted.As we can see, for each problem feature there is a cleargradient across the space, highlighting the utility of using PCAto construct the instance space in terms of understanding howthe problem features vary across the space. Interestingly, thereis also a gradient for the best ARI achieved, for which lowervalues are generally associated with either a lower silhouettewidth, higher standard deviation of the silhouette width, andhigher average cluster eccentricity. Overall, this highlights thatusing problem features specific to clustering help to create EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 5
Generation A v e r a g e s il h o u e tt e w i d t h (a) Clusters initialized overlapping, s t = 0 . Generation A v e r a g e s il h o u e tt e w i d t h OperatorOriginalRailsPSO-RandomPSO-InformedDE (b) Clusters initialized apart, s t = 0 . Generation A v e r a g e s il h o u e tt e w i d t h (c) Clusters initialized overlapping, s t = 0 . Generation A v e r a g e s il h o u e tt e w i d t h OperatorOriginalRailsPSO-RandomPSO-InformedDE (d) Clusters initialized apart, s t = 0 . Fig. S-4. Convergence plot showing the average (across all 30 runs) silhouette width across the generations when generating 50-dimensional datasets for ourfour scenarios. TABLE S-IT
HE NUMBER OF DATA POINTS ( N ), CLUSTERS ( K ), DIMENSIONS ( D ), AND DATASETS FOR EACH COLLECTION OF DATASETS .Source N K D HK [7] ±
10, 20, 40, 60,80, 100, 120 20, 50, 100,150, 200 350 QJ [8] ±
3, 6, 9 5–24 243
SIPU [9] ±
2, 15, 20, 35,50 , , . . . , UCI [10] ±
2, 3, 4, 6, 7, 8,10, 11, 15 3, 4, 6–11, 13,18, 19, 22, 30,34, 44, 60, 90,166 20
UKC [11] ±
10, 11, 12 2 8 The average (and standard deviation) are given, except for HAWKS forwhich the values are targetted. an instance space that correlates to algorithmic performance.As shown in Fig. ?? however, through either the projectionmethod or an incomplete set of problem features (as discussedin Section ?? , it is difficult to get a complete set), there is nota complete separation of regions where a single algorithm issuperior. Table S-II shows the mean and standard deviation of eachof the problem features for every collection of datasets studiedin this paper, clarifying the observations seen for the instancespace visualizations (Fig. S-5). The UCI datasets show a muchhigher connectivity, indicating that the nearest neighbours ofdata points are typically in a different cluster, and thus havepoor cluster structure.We can compare the mean values for the problem featuresfor the
Index and
Versus modes to see if the different optimiza-tion settings had a particular bias towards e.g. a low connectiv-ity. The main differences between these datasets, according tothe problem features, is that the
Versus datasets have a slightlyhigher average connectivity, much lower eccentricity, andhigher variance in the silhouette width. The lower eccentricitycan be explained through both the initialization used andlow dimensionality. As shown in our examples, where highlyeccentric clusters were seen, this did not diminish HAWKS’capability of evolving such clusters when needed.
C. Comparing performance diversity via critical differencediagrams
In the main paper, we showed the critical difference (CD)diagrams for datasets from the HAWKS and HK generators. EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 6
PC1 P C ARI-0.400.000.400.801.20SourceHAWKSHKQJSIPUUCIUKC (a) ARI
PC1 P C Connectivity0.000.601.201.80SourceHAWKSHKQJSIPUUCIUKC (b) Connectivity
PC1 P C Dimensionality0.00400.00800.001200.00SourceHAWKSHKQJSIPUUCIUKC (c) Number of dimensions
PC1 P C Eccentricity Avg0.00150.00300.00450.00SourceHAWKSHKQJSIPUUCIUKC (d) Average eccentricity
PC1 P C Entropy0.400.600.801.00SourceHAWKSHKQJSIPUUCIUKC (e) Entropy of cluster sizes
PC1 P C Num Clusters0.0040.0080.00120.00SourceHAWKSHKQJSIPUUCIUKC (f) Number of clusters
PC1 P C Silhouette-0.500.000.501.00SourceHAWKSHKQJSIPUUCIUKC (g) Silhouette width (average)
PC1 P C Silhouette Std0.000.250.500.75SourceHAWKSHKQJSIPUUCIUKC (h) Silhouette width (standard deviation)Fig. S-5. The instance space created in Section ?? showing how the values of each of the 7 problem features and the best ARI found by any algorithm varyacross the space. EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 7
TABLE S-IIT
HE MEAN AND STANDARD DEVIATION (SD)
OF THE PROBLEM FEATURE VALUES FOR EVERY COLLECTION OF DATASETS , INCLUDING BOTHTHE I NDEX AND V ERSUS MODES OF
HAWKS. Silh. widthNumber of Average Cluster Size Number of Silh. width (standardConnectivity Dimensions Eccentricity Entropy Clusters (average) deviation)Collection Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD Mean SDHAWKS (
Index ) .
114 0 .
198 26 .
000 24 .
027 5 .
692 2 .
568 0 .
884 0 .
126 17 .
500 12 .
514 0 .
675 0 .
225 0 .
226 0 . HAWKS (
Versus ) .
144 0 .
175 2 .
000 0 .
000 2 .
551 1 .
309 0 .
825 0 .
074 5 .
000 0 .
000 0 .
712 0 .
188 0 .
300 0 . HK [7] .
008 0 .
008 104 .
000 65 .
393 145 .
202 79 .
847 0 .
984 0 .
021 61 .
429 38 .
012 0 .
474 0 .
043 0 .
437 0 . QJ [8] .
207 0 .
247 13 .
000 6 .
926 3 .
615 0 .
867 0 .
966 0 .
023 6 .
000 2 .
455 0 .
380 0 .
154 0 .
145 0 . SIPU [9] .
091 0 .
209 191 .
346 308 .
120 3 .
411 3 .
008 1 .
000 0 .
000 3 .
411 6 .
280 0 .
634 0 .
225 0 .
091 0 . UCI [10] .
789 0 .
409 28 .
350 39 .
125 20 .
716 20 .
706 0 .
897 0 .
116 4 .
900 3 .
712 0 .
084 0 .
279 0 .
375 0 . UKC [11] .
001 0 .
001 2 .
000 0 .
000 1 .
938 0 .
177 0 .
997 0 .
002 10 .
875 0 .
641 0 .
780 0 .
062 0 .
232 0 . The 448 datasets generated in Section ?? . The 360 datasets generated in Section ?? . For completeness, here in Fig. 6 we show the CD diagramsfor the other dataset collections.For the QJ datasets (Fig. 6a), the compactness-based algo-rithms (GMM and K-Means ++ ) are clearly the best perform-ing algorithms, with GMM nearly superior across all datasets,highlighting a lack of diversity. Similarly, for the SIPU datasets(Fig. 6b) GMM and K-Means ++ are more equally the bestperforming, where the lower connectivity (i.e. lack of overlapbetween clusters) compared to QJ datasets does not distinguishperformance further between these two algorithms. The UCIdatasets (Fig. 6c) show little significant difference betweenthe clustering algorithms which, when combined with thelow performance shown in Fig. ?? , indicates that there is ahigh variance of performance, but that this diversity is dueto low cluster structure (rather than varied cluster structures).Finally, the UKC datasets (Fig. 6d) are too few in numberto identify significant differences, though the the higher rankof the compactness-based algorithms fits with visualization ofthe clusters [11].S-IV. G ENERATING DATASETS THAT CHALLENGE SPECIFICALGORITHMS (V ERSUS MODE )Section ?? presented results for the reformulation of thefitness function to evolve datasets that directly maximize theperformance difference between two algorithms. Here, weprovide further visual examples of the different head-to-headscenarios to illustrate the properties which HAWKS founddifferentiated the algorithms, and further details for the head-to-head between average- and single-linkage. A. Additional scenarios1) Single-linkage vs. Average-linkage:
Fig. ?? showed thatthere was a high spread in the ARI for both algorithms,and high variance in the maximum difference found betweenruns, highlighting the difficulty for this scenario. There aresome datasets that correlate both high and low performancebetween the two, indicating that the algorithms do share somesimilarities. Fig. S-7a shows an example, where the structureis common among the datasets that did find a performance difference. The clusters are in general well-separated, as byplacing some clusters much further away and co-locating a fewsmaller clusters, the averaging criterion (to determine whereto split next) used by average-linkage assigns clusters that arecloser together the same cluster label. The mixed eccentricitiesof clusters that HAWKS is able to generate helps facilitate thediscovery of this exploit. As the clusters need to be sufficientlyfar away before a chance in the fitness can be found, theremay be insufficient pressure for this exploit to be reliablyfound, highlighting the high variation in performance for thisscenario.
2) Average-linkage vs. Single-linkage:
As average-linkageuses the average of the distances between groups, rather thanthe smallest distance, it is not susceptible to the chainingeffect [12] that single-linkage is. As a result, Fig. S-7b showsthat the exploit used in this head-to-head is simply to placeclusters close together in order for single-linkage to determinethe majority of the points are in a single cluster.R
EFERENCES[1] C. Shand, R. Allmendinger, J. Handl, A. M. Webb, and J. Keane,“Evolving controllably difficult datasets for clustering,” in
Proceedingsof the Genetic and Evolutionary Computation Conference, GECCO2019 . ACM, 2019, pp. 463–471.[2] J. Kennedy and R. Eberhart, “Particle swarm optimization,” in
Proceed-ings of International Conference on Neural Networks (ICNN’95) . IEEE,1995, pp. 1942–1948.[3] J. Kennedy, “Particle swarm optimization,” pp. 760–766, 2010.[4] R. Storn and K. Price, “Differential evolution–a simple and efficientheuristic for global optimization over continuous spaces,”
Journal ofGlobal Optimization , vol. 11, no. 4, pp. 341–359, 1997.[5] K. Fleetwood, “An introduction to differential evolution,” in
Proceedingsof Mathematics and Statistics of Complex Systems (MASCOS) One DaySymposium, 26th November, Brisbane, Australia , 2004, pp. 785–791.[6] M. L. Ortiz and N. Xiong, “Investigation of mutation strategies indifferential evolution for solving global optimization problems,” in
Arti-ficial Intelligence and Soft Computing - 13th International Conference,ICAISC 2014 , ser. Lecture Notes in Computer Science, L. Rutkowski,M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. A. Zadeh, and J. M.Zurada, Eds., vol. 8467. Springer, 2014, pp. 372–383.[7] J. Handl and J. D. Knowles, “Improvements to the scalability ofmultiobjective clustering,” in
Proceedings of the IEEE Congress onEvolutionary Computation, CEC 2005, 2-4 September 2005, Edinburgh,UK . IEEE, 2005, pp. 2372–2379.[8] W. Qiu and H. Joe, “Generation of random clusters with specified degreeof separation,”
J. Classification , vol. 23, no. 2, pp. 315–334, 2006.
EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 8 (a) QJ (b) SIPU (c) UCI (d) UKCFig. S-6. Critical difference (CD) diagrams, showing the mean rank (in terms of ARI) for the remaining dataset collections, for each algorithm. Algorithmsconnected by solid lines are not significantly different according to a two-tailed Nemenyi test.
10 0 1050510
Truth
10 0 10
Single-LinkageARI: 1.000
10 0 10
Average-LinkageARI: 0.692 (a) Single-linkage vs. average-linkage
Truth
Average-LinkageARI: 0.994
Single-LinkageARI: 0.001 (b) Average-linkage vs. single-linkage
Truth
GMMARI: 0.989
KMeans++ARI: 0.407 (c) GMM vs. K-Means ++ Fig. S-7. Examples of datasets for the listed head-to-heads. Each figure shows the ground truth, and the cluster assignment for each algorithm with theassociated ARI.
EEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 9 [9] P. Fr¨anti and S. Sieranoja, “K-means properties on six clusteringbenchmark datasets,”
Appl. Intell. , vol. 48, no. 12, pp. 4743–4759, 2018.[10] D. Dheeru and E. Karra Taniskidou, “UCI machine learning repository,”2017. [Online]. Available: http://archive.ics.uci.edu/ml[11] M. Garza-Fabre, J. Handl, and J. D. Knowles, “An Improved and MoreScalable Evolutionary Approach to Multiobjective Clustering,”
IEEETrans. on Evolutionary Computation , vol. 22, no. 4, pp. 515–535, 2017.[12] L. Hubert, “Approximate evaluation techniques for the single-link andcomplete-link hierarchical clustering procedures,”