[PDF] Towards Explainable Exploratory Landscape Analysis: Extreme Feature Selection for Classifying BBOB Functions

Abstract

Facilitated by the recent advances of Machine Learning (ML), the automated design of optimization heuristics is currently shaking up evolutionary computation (EC). Where the design of hand-picked guidelines for choosing a most suitable heuristic has long dominated research activities in the field, automatically trained heuristics are now seen to outperform human-derived choices even for well-researched optimization tasks. ML-based EC is therefore not any more a futuristic vision, but has become an integral part of our community. A key criticism that ML-based heuristics are often faced with is their potential lack of explainability, which may hinder future developments. This applies in particular to supervised learning techniques which extrapolate algorithms' performance based on exploratory landscape analysis (ELA). In such applications, it is not uncommon to use dozens of problem features to build the models underlying the specific algorithm selection or configuration task. Our goal in this work is to analyze whether this many features are indeed needed. Using the classification of the BBOB test functions as testbed, we show that a surprisingly small number of features -- often less than four -- can suffice to achieve a 98\% accuracy. Interestingly, the number of features required to meet this threshold is found to decrease with the problem dimension. We show that the classification accuracy transfers to settings in which several instances are involved in training and testing. In the leave-one-instance-out setting, however, classification accuracy drops significantly, and the transformation-invariance of the features becomes a decisive success factor.

Full PDF

TTowards Explainable Exploratory LandscapeAnalysis: Extreme Feature Selection forClassifying BBOB Functions

Quentin Renau , , Johann Dreo , Carola Doerr , and Benjamin Doerr Thales Research & Technology, Palaiseau, France ´Ecole Polytechnique, Institut Polytechnique de Paris, CNRS, LIX, France Sorbonne Universit´e, CNRS, LIP6, Paris, France

Abstract.

Facilitated by the recent advances of Machine Learning (ML),the automated design of optimization heuristics is currently shaking upevolutionary computation (EC). Where the design of hand-picked guide-lines for choosing a most suitable heuristic has long dominated researchactivities in the ﬁeld, automatically trained heuristics are now seen tooutperform human-derived choices even for well-researched optimizationtasks. ML-based EC is therefore not any more a futuristic vision, buthas become an integral part of our community.A key criticism that ML-based heuristics are often faced with is their po-tential lack of explainability, which may hinder future developments. Thisapplies in particular to supervised learning techniques which extrapolatealgorithms’ performance based on exploratory landscape analysis (ELA).In such applications, it is not uncommon to use dozens of problem fea-tures to build the models underlying the speciﬁc algorithm selection orconﬁguration task. Our goal in this work is to analyze whether this manyfeatures are indeed needed. Using the classiﬁcation of the BBOB testfunctions as testbed, we show that a surprisingly small number of fea-tures – often less than four – can suﬃce to achieve a 98% accuracy. Inter-estingly, the number of features required to meet this threshold is foundto decrease with the problem dimension. We show that the classiﬁcationaccuracy transfers to settings in which several instances are involved intraining and testing. In the leave-one-instance-out setting, however, clas-siﬁcation accuracy drops signiﬁcantly, and the transformation-invarianceof the features becomes a decisive success factor.

Keywords:

Exploratory Landscape Analysis · Feature Selection · Black-Box Optimization.

Evolutionary algorithms and other iterative optimization heuristics (IOHs) areclassically introduced as frameworks within which a user can gather some mod-ules to instantiate an algorithm. For instance, the design of an evolutionaryalgorithm requires to choose the population size, the variation and selection op-erators in use, the encoding structure, ﬁtness function penalization weights, etc. a r X i v : . [ c s . N E ] F e b Q. Renau, J. Dreo, C. Doerr, B. Doerr

This highly ﬂexible design of IOHs allows for eﬃcient abstractions but comes atthe burden of having to solve an additional (meta-)optimization problem.

Auto-mated design of heuristics aims at solving this problem by providing data-drivenrecommendations which IOH shall be employed for a given optimization problemand how it shall be conﬁgured. Automated IOH design has proven its promisein numerous applications, see [7,9,20,3,12] for examples and further references.A common critique of machine-trained automated algorithm design is itspotential lack of explainability. That is, the general fear is that by relying onautomated design approaches, we may be loosing intuition for why certain rec-ommendation are made – a key driver for the development of new optimizationapproaches. This fear is not without any reason: the vast majority of automatedalgorithm design studies fall short in this explainability aspect.

Our Contribution.

Our work aims at providing paths to narrowing thisimportant gap, by studying which information the trained models actually needto achieve convincing performance. As testbed we chose the automated classiﬁ-cation of optimization problems through exploratory landscape analysis (ELA).We show that very small feature sets can suﬃce to reliably discriminate betweenvarious optimization problems and that these sets are robust with respect to theclassiﬁers and function instances.Apart from the explainability aspect, our ﬁndings have important conse-quences also for the eﬃciency of automated algorithm design: smaller featuresets are faster to compute and they can drastically reduce the time spent in thetraining phase. Another advantage of feature selection is that the classiﬁcationor regression accuracy can increase . Background and Motivation.

ELA was introduced in [17] with the objec-tive to gain insights about the properties of an unknown optimization problem.Instead of relying on expert knowledge, the keystone of ELA are computer-generated features that are based on sampling the decision space. With the pur-pose of enhancing the eﬀectiveness of this approach, several additional featureshave been introduced since. A good selection of these features are automaticallycomputed by the R package ﬂacco [14], see Sec. 2 for more details.We chose classiﬁcation as task, because it oﬀers a very clean setting in whichthe results are easily interpretable. Classiﬁcation has a straightforward perfor-mance measure, the classiﬁcation accuracy , i.e., the fraction of items that areclassiﬁed correctly. Additionally, the classiﬁcation accuracy is a good way of esti-mating the expressiveness of ELA feature sets, i.e., their ability to discriminatebetween diﬀerent problems [26]. A proper classiﬁcation furthermore plays animportant role also in many other ML tasks, including the selection and conﬁg-uration of algorithms, so that a good classiﬁcation accuracy can be expected toprovide good results also for these tasks.

Related Work.

Given the mentioned speed-up and the better performancethat one can expect from smaller feature sets, feature selection is not new, butrather standard in automated algorithm design. However, most related worksstill use a relatively large number of features, hindering explainability of the owards Explainable ELA: Feature Selection for BBOB Functions 3 trained models. Among the ELA-based applications in EC, the following oneshave used the smallest feature portfolios.Mu˜noz and Smith-Miles [19] compute the co-linearity between landscape fea-tures with the idea that if two features are strongly co-linear, they carry the sametype of information about the landscape. Applying this procedure, nine featureswere kept for further analysis: the adjusted coeﬃcient of determination of alinear regression model including interactions [17], the adjusted coeﬃcient of de-termination of a quadratic regression model [17], the ratio between the minimumand maximum absolute values of the quadratic term coeﬃcients in the quadraticmodel, the signiﬁcance of D -th and ﬁrst order [29], the skewness, kurtosis andentropy of the ﬁtness function distribution [17], and the maximum informationcontent [22].Another method to perform feature selection is the use of search algorithms.In their work, Kerschke and Trautmann [12] compare four diﬀerent algorithms,a greedy forward-backward selection, a greedy backward-forward selection, a(10 + 5)-GA and a (10 + 50)-GA. The smallest feature sets considered in theiralgorithm selection setting have a size of eight features: three features from the y -distribution feature set [17] (skewness, kurtosis, and number of peaks), onelevel set feature [17] (the ratio of mean misclassiﬁcation errors when using alinear (LDA) and mixed discriminant analysis (MDA)), two information contentfeatures [22] (the maximum information content and the settling sensitivity),one cell mapping feature [13] (the standard deviation of the distances betweeneach cell’s center and worst observation), and one of the basic features (the bestﬁtness value within the sample). This result is still considerably larger than thesets we will identify as promising in our work.Saini et al. [28] and Lacroix and McCall [15] also use reduced feature sets,but do not expand on how these have been derived. Availability of Our Data.

All our project data is available at [27].

Our primary objective is to analyze the number of features that are neededto correctly classify the 24 BBOB functions from the COCO benchmark envi-ronment and their robustness across several dimensions and sample sizes. Wedescribe in this section the benchmark set, the experimental procedure, and theclassiﬁcation scheme.

The 24 BBOB Benchmark Problems.

A standard benchmark en-vironment for numerical black-box optimization is the COCO ( CO mparing C ontinuous O ptimizers) platform [6]. From this environment, we consider theBBOB suite, a set of 24 noiseless problems. For each BBOB problem, several in-stances are available, which are obtained from a “base” function via translation,rotation and/or scaling transformations [6]. Each problem instances is a real-valued function f : [ − , d → R . Problems scale for arbitrary dimensions d . Inour experiments, we consider six diﬀerent dimensions, d ∈ { , , , , , } , Q. Renau, J. Dreo, C. Doerr, B. Doerr and we focus on the ﬁrst ﬁve instances of each problem (ﬁrst instance in Sec. 3.In abuse of notation, we shall often identify the functions by their ID 1 , . . . ,

Computation of Feature Values via ﬂacco.

For the feature valueapproximation, we sample for each of the 24 functions f a number n of points x (1) , . . . , x ( n ) ∈ [ − , d , and we evaluate their function values f ( x (1) ) , . . . , f ( x ( n ) ). The set of pairs { ( x ( i ) , f ( x ( i ) )) | i = 1 , ..., n } is then fedto the ﬂacco package [14], which returns a vector of features. The ﬂacco pack-age covers a total number of 343 features [9], which are grouped into 17 fea-ture sets. However, some of these features are often omitted in practice becausethey require adaptive sampling [2,12,18,24], while other features have previouslybeen dismissed as non-informative for the BBOB functions [13,26]. After remov-ing these sets from our test bed, we are left with six feature sets: dispersion (disp [16]), information content (ic [22]), nearest better clustering (nbc [10]), meta model (ela meta [17]), y -distribution (ela distr [17]), and principal compo-nent analysis (pca [14]). But even if this selection reduces the number of featuresto 46, a full enumeration of all subsets for all sizes c ≤

46 would still be com-putationally infeasible (since we need to train and test a classiﬁcation model foreach such set). We therefore need to reduce the set of eligible features further. Tothis end, we build on the work presented in [26], in which we studied the expres-siveness of these 46 features. Based on this work we select four features. We addto this selection another six features, one per each of the feature set mentionedabove (to ensure a broad diversity of features) and again giving preference to themost expressive ones and to features invariant to BBOB transformations [30].This leaves us with the following ten features. We indicate in this list by (cid:88) and -whether or not a feature is considered invariant under transformation accordingto [30] (ﬁrst entry) and according to our data (second entry), respectively. Notehere that the setting used in [30] is slightly diﬀerent from the instances usedin BBOB, mostly due to diﬀerent ways to handle boundary constraints. Theassessment can therefore diﬀer.1. disp.ratio mean 02 [ (cid:88) , (cid:88) ] ( disp ) computes the ratio of the pairwise dis-tances of the points having the best 2% ﬁtness values with the pairwisedistances of all points in the design.2. ela distr.skewness [ (cid:88) , (cid:88) ] ( skew ) computes the skewness coeﬃcient of thedistribution of the ﬁtness values. This coeﬃcient is a measure of the asym-metry of a distribution around its mean.3. ela meta.lin simple.adj r2 [ (cid:88) , (cid:88) ] ( lr2 ), which computes the adjustedcorrelation coeﬃcient R of a linear model ﬁtted to the data.4. ela meta.lin simple.intercept [ (cid:88) ,-] ( int ), the intercept coeﬃcient of thelinear model.5. ela meta.lin simple.coef.max [-,-] ( max ), the largest coeﬃcient of thelinear model that is not the intercept coeﬃcient.6. ela meta.quad simple.adj r2 [ (cid:88) , (cid:88) ] ( qr2 ), the adjusted correlation coef-ﬁcient R of a quadratic model ﬁtted to the data.7. ic.eps.ratio [-, (cid:88) ] ( ε ratio ), the half partial information sensitivity.8. ic.eps.s [-, (cid:88) ] ( ε s ), the settling sensitivity. owards Explainable ELA: Feature Selection for BBOB Functions 5 nbc.nb ﬁtness.cor [ (cid:88) , (cid:88) ] ( nbc ), the correlation between the ﬁtness valuesof the search points and their indegree in the nearest-better point graph.10. pca.expl var PC1.cov init [ (cid:88) , (cid:88) ] ( pca ), which measures the importanceof the ﬁrst principal component of a Principal Component Analysis (PCA)over the sample points in the whole search space. Normalization of Feature Values.

The value of each feature is normalizedbetween 0 and 1 where 0 (resp. 1) correspond to the smallest (resp. largest) valueencountered in the approximated feature values. This normalization is performedindependently for each dimension, each sample size, and each classiﬁer used inthis paper.

Sampling Strategy.

Based on an extension of the preliminary experi-ments reported in [25] we use a quasi-random distribution to sample the points x (1) , . . . , x ( n ) from which the feature values are computed. More precisely, weuse Sobol’ sequences [32], which we obtain from the Python package sobol seq (version 0.1.2), with randomly chosen initial seeds.We sample a total number of 100 independent Sobol’ designs, which leaves uswith 100 feature value vectors per each function. Fig. 1 provides an impressionof the distribution of these feature values. Plotted are here approximated valuesfor the lr2 feature. The comparison shows that the dispersion slightly decreaseswith the dimension, which is quite surprising in light of the lower density of thepoints in higher dimensions. We also see that the median values are not stableacross dimensions. Some functions (F5 of course, which is correctly identiﬁed asa linear function, but also F16, F19, and F20, for example) show a high con-centration of feature value approximations, whereas other functions show muchlarger dispersion within one dimension (e.g., F12, F15, F17, F18) or betweendiﬀerent dimensions (F2, F11, F24). Sample Size.

To study the eﬀect of the sample size on the number of featuresneeded to correctly classify the 24 BBOB functions, we conduct experiments forseven diﬀerent values of n , namely n ∈ { d, d, d, d, d, d, d } .We note here that a linear scaling of the sample size is the by far most commonchoice, see, for example, [3,11,12]. Feature Selection.

We apply a wrapper method , i.e., we actually train aclassiﬁer for every considered subset of features. For a given sample size anda given dimension, we train and test all (cid:0) c (cid:1) possible subsets of size c startingwith c = 1. If none of these size- c subsets achieves our target accuracy, wemove on to the size c + 1 subsets. As soon as a suﬃciently qualiﬁed subset hasbeen identiﬁed, we continue to evaluate all size- c subsets, but stop the selectionprocess thereafter. This full enumeration of all possible feature combinations fora given size c allows us to investigate the robustness of the feature selection.Ideally, we would like to see that the feature sets achieving our 98% accuracythreshold (this will be introduced below) are stable across the diﬀerent samplesizes. Robustness with respect to the dimension is much less of a concern to us,since the problem dimension is typically known and can be used for the choosingthe feature ensemble that shall be applied to characterize the problem. Q. Renau, J. Dreo, C. Doerr, B. Doerr A pp r o x i m a t e d f e a t u r e v a l u e s ela_meta.lin_simple.adj_r2 budget of 250*d Dimension

Fig. 1: Distribution of the feature values for the lr2 feature for diﬀerent dimen-sions. Each feature value is computed from 250 × d samples and each boxplotrepresents results of 100 independent feature computations. Validation Procedure and Target Classiﬁcation Accuracy.

In ourexperiments, we use 80 randomly chosen feature vectors (per function) to traina classiﬁcation model, and we use the remaining 24 ×

20 = 480 feature vectorsfor testing. For each of these 480 test cases we store the true function ID (i.e.,the ID of the function that the feature value originates from) and we store theID of the function that the classiﬁer matches the feature vector to. From thisdata we compute the overall classiﬁcation accuracy .We repeat this procedure of splitting the set of all feature vectors into 80training and 20 test instances 20 times; i.e., we repeat 20 times a randomsub-sampling validation . We require that the overall classiﬁcation accuracyfor each of the 20 validations is at least 98% . That is, a feature set is eligible if,in each of the 20 random sub-sampling validation runs, it misclassiﬁes at most 10out of the 480 tested feature vectors. Feature combinations achieving a smallerclassiﬁcation accuracy in one of the validation runs are immediately discarded.

Classiﬁcation Model.

In the main part of this work, we use a MajorityJudgment classiﬁer [1]. A cross-validation with decision trees and KNN classiﬁerswill be presented in Sec. 4.The Majority Judgment classiﬁer works as follows. Let Φ = { ϕ , . . . , ϕ k } be the set of features for which we want to know whether it achieves our 98%target precision requirement. We consider one of the independent subsampling owards Explainable ELA: Feature Selection for BBOB Functions 7 Function ID (index j )1 2 . . . Feature ϕ ϕ ϕ Median distance D j Table 1: Example for the Majority Judgment classiﬁcation scheme with threefeatures. The values in the table are the distances of the measured feature value ζ i to the median feature values M ( i, j ) of the training set. The median values arereported in the last line. The ID of the function minimizing this median distance D j is the output of the Majority Judgment classiﬁer.validation runs. That is, for each function we randomly select 80 out of the 100feature vectors. Denote by ϕ i,j,r the r -th estimated value for feature ϕ i for the j -th BBOB function, the set { ( ϕ i,j,r , j ) | i = 1 , . . . , k, j = 1 , . . . , , r = 1 , . . . , } describes the full set of training data. From this data we compute for each ofthe 24 functions j = 1 , . . . ,

24 and for each feature ϕ i ∈ Φ the median value M ( i, j ) := M ( { ϕ i,j,r | r = 1 , . . . , } ) . This gives us a set of 24 k values M ( i, j ) and concludes the training step .In the testing step we apply an approval voting mechanism [4] to each ofthe 480 test instances. Approval voting mechanisms are single-winner systemswhere the winner is the most-approved candidate among the voters. From thisclass of approval voting mechanism we choose Majority Judgment [1] —a votingtechniques which ensures that the winner between three or more candidates hasreceived an absolute majority of the scores given by the voters.To apply Majority Judgment to our classiﬁcation task, we do the following.We recall that the task of the classiﬁer is to output, for a given feature vector ζ = ( ζ i ) ki =1 , the ID of the function that it believes this feature vector to belong to.To this end, it ﬁrst computes for each of the k features i and for all 24 functions j the absolute distances d i,j := | ζ i − M ( i, j ) | . Tab. 1 presents an example forwhat the distances may look like. We then compute for each function the medianof these distances, by setting D j ( ζ ) := M ( { d i,j | i = 1 , . . . , k } ) . The cells withthese median values are highlighted with a blue background in Tab. 1, and thevalues D j ( ζ ) are reported in the last line. The classiﬁer outputs as predictedfunction ID the value j for which the distance D j ( ζ ) is minimized. This cell ishighlighted in yellow background color. Computation Time.

To give an impression of the computational resourcesrequired for our experiments, we report that the computation of the 100 5-dimensional feature vectors requires around 6 CPU hours, whereas the computa-tion of the 25-dimensional feature vectors takes about 1221 CPU hours. Trainingand testing the classiﬁer takes between 1 second and 3 hours, depending on thesetting. In total, we have invested around 432 CPU days for computing the datapresented in this work.

Q. Renau, J. Dreo, C. Doerr, B. Doerr

Sample Sizedimension

30d 50d 100d 250d 650d 800d 1000d5 - - - 4 4 - 210 - - - 4 1 2 115 - - 6 4 2 2 220 - - 6 2 1 1 225 1 1 1 1 1 1 130 - 6 2 1 1 2 2

Table 2: Feature combination size achieving 98% classiﬁcation accuracy in all 20runs.

The portfolios of features for which we obtained the desired 98% classiﬁcationaccuracy for each of the 20 random sub-sampling validation runs are presentedin Tab. 3. For convenience, their sizes are summarized in Tab. 2.Our ﬁrst, and most important, ﬁnding is that we can actually classify theBBOB functions with very few features. However, we also see that the existenceof such portfolios requires a suﬃcient sample size. For d ∈ { , , , } , none ofthe 2 possible portfolios based on size-30 d and size-50 d feature approximationscould achieve the 98% accuracy threshold.We also see that, as expected, the size of the minimal portfolio achieving thetarget precision decreases with increasing sampling size. A few exceptions to thisrule exist: – No combination in d = 5 with n = 800 samples achieved the target precision. – In d = 10 we see that a single feature, the intercept feature int , suﬃcesto classify with 98% accuracy when the sampling size is 650 d and 1000 d .For 800 d , however, this feature does not achieve the threshold. A detailedanalysis of the classiﬁcation accuracy achieved with this feature will be givenin Fig. 2. – In d = 15, the ε ratio information content feature classiﬁes properly when thesample size equals n = 800 d , but for n = 1000 d , one additional feature isneeded to pass the 98% accuracy threshold. – In d = 20 a single feature suﬃces for n = 650 d and n = 800 d , but for n = 1 , d an additional feature is needed to achieve the target accuracy.Overall, we see that for ten settings a single feature suﬃces for proper classiﬁ-cation. An additional seven cases can be solved by a combination of two features.It seems counter-intuitive that in almost all cases the size of the smallest admissi-ble portfolio decreases with increasing dimension. However, as already discussedin the context of Fig. 1, the dispersion of some feature values decreases withincreasing dimension – an eﬀect that is interesting in its own right. Withoutgoing into much detail here, we note that this eﬀect is further intensiﬁed whenusing a properly scaled sampling size that maintains the same sampling densityacross dimensions. owards Explainable ELA: Feature Selection for BBOB Functions 9 Feature d n int lr2 qr2 max ε s ε ratio disp skew pca nbc d d d d X X X X d X X X X d d X X

10 30 d d d d XO XO XO O X d X d X X d X

15 30 d d d X X X X X X d X X X X d X H O XH O d X X d XO X O

20 30 d d d X X X X X X d X X d X d X d X XO O

25 30 d X d X d X d X d X O d X d X M

30 30 d d X X X X X X d X XO O d X d X d O X XO M d O X H XOHV M V Table 3: Feature combinations achieving the 98% classiﬁcation accuracy thresh-old in all 20 runs. Features with the same symbol (X,O,H,V) belong to the samecombination. Results are grouped by dimension d and by the sample size n usedto approximate the feature values. Blank rows are for ( d , n ) settings for whichall 2 feature sets failed. M = missing data (due to coronavirus measures inFrance, we have lost access to cluster and data.)

30d 50d 100d 250d 650d 800d 1000dNumber of search points8486889092949698100 A cc u r a c y Dimension51015202530

Fig. 2: Distributions of intercept feature accuracy by dimension and sample size

Robustness of the feature combinations with respect to dimensionand sample size.

Looking at the robustness of the selected combinations overthe dimensions and the sample sizes, we observe the following.One feature, the intercept feature int , is involved in 15 out of the 28 ( d, n )pairs for which a successful feature portfolio could be found. This feature, in con-trast, is rarely present in other combinations of size | c | >

1. To shed more lighton its expressive power, we present in Fig. 2 the distributions of the classiﬁca-tion accuracy for the various ( d, n ) combinations. Aggregated over all dimensionsand all sample sizes, the median accuracy of the int feature is 96%. Even if thefeature does not always reach our threshold of 98%, it is worth noting that itsperformances is almost always above 90%. Therefore, this feature is very expres-sive, and this across all tested dimension and sample sizes. Another interestingobservation from Fig. 2 is that the classiﬁcation accuracy is not monotonic inthe dimension. In all but one case ( n = 30 d ), the d = 15 results are worse thanthose for the other dimensions. As already seen in Tab. 3, for n = 250 × d wealways have very good classiﬁcation accuracy.The most frequent feature is ε ratio , which is present in almost all combina-tions of size | c | ≥

2. We count 21 successful combinations of size | c | ≥ ε ratio appears in 20 of these combinations regardless of the dimension and thesample size. In total, it appears in successful portfolios for 17 out of the 28 ( d, n )combinations for which a successful subset had been found. The ε ratio feature isvery useful for our classiﬁcation task.The skewness feature skew , in contrast, does not appear in any of the port-folios of the smallest size. Classiﬁcation Accuracy When Using All ﬂacco Features.

We com-pare the results presented above with the classiﬁcation accuracy achievedby the Majority Judgment voting scheme using the whole set of 46 fea- owards Explainable ELA: Feature Selection for BBOB Functions 11 tures described in Sec. 2. We perform the same sub-sampling validation asabove. Interestingly, none of tests performed on the pairs ( d , n ) with n ∈{ d, d, d, d, d, d } and d ∈ { , , , , , } met our re-quired target precision of 98% for each of the 20 runs. We can thus conclude that,in addition to the gain in explainability, the selection of features for supervised-ELA approaches provide better performances, and – as we shall discuss below –also come at a much smaller computational cost. Having identiﬁed feature portfolios that reliably classify the BBOB functionswith at least 98% accuracy when using Majority Judgment (MJ), we now inves-tigate how robust this accuracy is with respect to the choice of the classiﬁer. Tothis end, we apply the same classiﬁcation routine as above, but now using de-cision trees (DT) and K Nearest Neighbors (KNN) as for classiﬁcation. We useoﬀ-the-shelf implementations from the scikit learn

Python package [23, we useversion 0.21.3]. Our goal being in investigating robustness, we do not performany hyper-parameter tuning for these two classiﬁers. For the KNN classiﬁer weuse K = 5. For all classiﬁcations with a reduced portfolio of features, if multiplecombinations are available, only the one marked with X in Tab. 3 will be used.Both KNN and decision trees perform as well as our classiﬁer when trainedand tested with the small portfolios from Tab. 3, i.e., they both reach at least98% classiﬁcation accuracy in every run except for the decision trees trainedwith only one feature, for which the accuracy drops to around 62% in everyrun. Fig. 3 summarizes the classiﬁcation accuracy of the three classiﬁers for thecase that features are based on n = 250 d samples, for the portfolios describedin Tab. 3. Performance is indeed very robust with respect to the classiﬁcationmechanism. Running Time.

While training and testing were made in around 4 secondsfor the DT and for the MJ voting scheme, the KNN classiﬁer needed around 12seconds to complete the 20 sub-sampling validation runs.

Gain over Full Feature Set.

We now study how much we gain in termsof computation time when we compute, train, and test the three classiﬁers (MJ,DT, and KNN) on the selected feature sets only.To quantify this gain, we train all three classiﬁers with the full set of 46features mentioned in Sec. 2. We ﬁrst observe that the decision tree classiﬁerhas the best performances among the three classiﬁers in terms of accuracy. Itachieves at least 99% classiﬁcation accuracy. For KNN, in contrast, performancesdrops below our 98% threshold precision on several runs, resulting in a medianclassiﬁcation accuracy (over all tests) of around 97%. The results for KNN align,as already brieﬂy touched upon in Sec. 3, with those obtained using MJ, wherenone of the tests produced 20 runs in which the threshold was reached.In terms of computation time, we observe signiﬁcant diﬀerences between thesmall feature portfolios and the full ﬂacco set. As already commented in Sec. 2,the computation of the feature values can be very time-consuming. Reducing the A cc u r a c y ClassifierMJDTKNN

X X

Fig. 3: Classiﬁcation accuracy for the feature portfolios from Tab. 3 for budget250 d . Results are sorted by dimension and classiﬁer and are for 20 random sub-sampling validation runs. Training and testing is done on the ﬁrst instance ofeach function only. The X corresponds to settings that did not achieve the 98%threshold.number of features therefore reduces the running time of the feature extraction.However, the savings are even bigger when comparing the cost of training (andtesting) the classiﬁers. For decision trees, the execution of the whole classiﬁca-tion pipeline takes 3000 times longer than with the small portfolios – around3 CPU hours instead of a few seconds. For KNN, the total cost is comparable,also around 3 CPU hours for training and testing the classiﬁers for the 20 sub-sampling validation runs. For the MJ classiﬁer, the overall running time is onlyaround 35 CPU minutes – which is still way above the time needed for the smallportfolios.Thus, overall, the reduced portfolios resulted not only in much faster com-putation times, but achieved also better classiﬁcation accuracy. The discussion above focused on classifying the ﬁrst instance of the BBOB func-tions, and we now investigate how robust the selection is with respect to diﬀerentinstances of the same problems. Concretely, we investigate classiﬁcation accuracywhen performing the same random sub-sampling validation routine as above tothe set of features computed for the ﬁrst ﬁve instances of the BBOB functions.In this experiment, we keep 80% of feature values for each instance for trainingthe classiﬁer, and we test on the remaining ones. In a second step we then testtransferability, by performing a leave-one-instance-out (LOIO) cross-validation.

In this setting, the classiﬁers are trained on four instances of each function andtested on the remaining one. We use the portfolios marked by an X in Tab. 3, andcompare to classiﬁcation accuracy when using all ten features. In the following, owards Explainable ELA: Feature Selection for BBOB Functions 13 A cc u r a c y ClassifierKNNDT

X X

Fig. 4: Classiﬁcation accuracy of DT and KNN classiﬁers when applied to theﬁrst ﬁve instances of the 24 BBOB functions. Feature values are computed from250 d samples, for the portfolios marked by an X in Tab. 3. Cases with poorperformance are marked by a red X.MJ voting is excluded as, by design, it is not suited to work with multiple dis-tributions coming from diﬀerent instances. Hence, only DT and KNN classiﬁerswill be used in this section.Fig. 4 aggregates the results obtained for the ﬁrst classiﬁcation task, where wetake feature values from each or the ﬁrst ﬁve instances. As in Fig. 3, DT performsbadly in d = 25 and d = 30, where classiﬁcation is only based on the interceptfeature. For these cases, the median accuracy is 45% and 62%, respectively.Since the intercept feature is not invariant to ﬁtness function transformations,the worsened performance is no surprise. In contrast, the median classiﬁcationaccuracy is above 98% for all portfolios with at least two features. We also notethat KNN in dimension d = 25 does not reach our 98% threshold, but stillachieves good performances with an average 97% accuracy.Fig. 5 presents the classiﬁcation accuracy achieved by KNN and DT in theLOIO setting. Fig. 5a is for features lr2 , qr2 , ε ratio , and nbc computed from 650 d samples in d = 5 and the Fig. 5b is for the two features qr2 and ε ratio computedfrom 250 d samples in d = 20. For comparison, we also plot the classiﬁcationaccuracy achieved when using all ten features listed in Sec. 2. For most settings,the accuracy obtained with the set of ten features is better than that achieved forthe smaller portfolios. For the 650 d setting, this is the case for all instances. Forthe 250 d setting, DT performs better with the smaller portfolio when instance1 or instance 3 is left out. The performance loss when using the reduced featureset is particularly drastic for KNN when instance 1 is left out (both cases), wheninstance 2 is left out (650 d case), and when instance 4 is left out (250 d case).Interestingly, for DT in the 650 d setting, the largest performance losses occurwhen leaving instance 2 or 5 out. The average loss in classiﬁcation accuracy is A cc u r a c y ClassifierKNN_10KNN_4DT_10DT_4 (a) 650 d samples, d = 5 A cc u r a c y ClassifierKNN_10KNN_2DT_10DT_2 (b) 250 d samples, d = 20 Fig. 5: Classiﬁcation accuracy of KNN and DT in the leave-one-instance-outsetting. The subscripts 2 ,

4, and 10 refer to the size of the feature portfolio.5% and 4% for KNN in the 650 d and the 250 d case, respectively. For DT, theaverage loss in the 650 d case is 10% and the average gain in the 250 d case is 2%.We conclude that the feature selection is robust when studying diﬀerentinstances, except for those portfolios which consist only of a single feature. Forthe (arguably more interesting) LOIO setting, however, classiﬁcation accuracydrops, but non-homogeneously for the diﬀerent instances. We recommend usingthe larger feature portfolio in this case. Our ambition to build small feature sets is driven by the desire to obtain modelsthat are (at least to some degree) human-interpretable. While our study certainlyhas several limitations, as only one test bed is considered, it nevertheless showsthat the number of features needed to successfully classify the BBOB functionsis surprisingly low. Our main direction for future work is an application of thesmall feature sets to automated algorithm design tasks. [8] shows promisingperformance of the selected feature portfolio presented in Sec. 2 for automatedperformance regression and per-instance algorithm selection, results that we wishto detail further based on the results presented in Sec. 3. Our next importantgoal will then be to uncover how the performance of a given solver depends onthe selected features, by taking a closer look at the trained regression models.With small feature sets, there is reasonable hope that we can identify meaningfulcorrelations.We are targeting, in the mid-term perspective, classiﬁers and automatedalgorithm design techniques that work well on highly constrained problems andwhich can cope with discontinuities. Extending the results of this work to suchproblems forms another important next step.Other interesting directions for future work include the investigation of newfeatures recently proposed in the literature, (such as, for example, the SOO-based features [5]). We also plan on a closer inspection of the classiﬁcation results owards Explainable ELA: Feature Selection for BBOB Functions 15 presented above, particularly with respect to the mis-classiﬁcations. Functionsthat are wrongly classiﬁed more often than others (a preliminary investigationshowed that these mis-classiﬁcation rates depend on the dimension. In dimen-sions d = 10, for example, function 17 is confused with function 21 in 30% of thetests even when a sample size of n = 10 ,

000 is used.) Such data can be used, inparticular, for training set selection, but also for the generation of new probleminstances for which the algorithms show some behavior not observable on otherinstances of the same collection [31,21].

Acknowledgments.

We thank C´edric Buron, Claire Laudy, and Bruno Mar-con for providing the implementation of the Majority Judgment classiﬁer.

References

1. Balinski, M., Laraki, R.: Judge: Don’t vote! Operations Research (3), 483–511(2014)2. Belkhir, N., Dr´eo, J., Sav´eant, P., Schoenauer, M.: Surrogate assisted feature com-putation for continuous problems. In: In: LION. pp. 17–31. Springer (2016)3. Belkhir, N., Dr´eo, J., Sav´eant, P., Schoenauer, M.: Per instance algorithm conﬁg-uration of CMA-ES with limited budget. In: GECCO. pp. 681–688. ACM (2017)4. Brams, S., Fishburn, P.: Approval voting, 2nd edition. Springer (2007)5. Derbel, B., Liefooghe, A., V´erel, S., Aguirre, H., Tanaka, K.: New features forcontinuous exploratory landscape analysis based on the SOO tree. In: FOGA. pp.72–86. ACM (2019)6. Hansen, N., Auger, A., Ros, R., Mersmann, O., Tuˇsar, T., Brockhoﬀ, D.: COCO: aplatform for comparing continuous optimizers in a black-box setting. OptimizationMethods and Software pp. 1–31 (2020)7. Hutter, F., Kotthoﬀ, L., Vanschoren, J. (eds.): Automated Machine Learning -Methods, Systems, Challenges. Springer (2019)8. Jankovic, A., Doerr, C.: Landscape-aware ﬁxed-budget performance regression andalgorithm selection for modular CMA-ES variants. In: In: GECCO. pp. 841–849.ACM (2020)9. Kerschke, P., Hoos, H., Neumann, F., Trautmann, H.: Automated Algorithm Selec-tion: Survey and Perspectives. Evolutionary Computation (1), 3–45 (Mar 2019)10. Kerschke, P., Preuss, M., Wessing, S., Trautmann, H.: Detecting Funnel Structuresby Means of Exploratory Landscape Analysis. In: GECCO. pp. 265–272. ACM(2015)11. Kerschke, P., Preuss, M., Wessing, S., Trautmann, H.: Low-Budget ExploratoryLandscape Analysis on Multiple Peaks Models. In: GECCO. pp. 229–236. ACM(2016)12. Kerschke, P., Trautmann, H.: Automated algorithm selection on continuous black-box problems by combining exploratory landscape analysis and machine learning.Evolutionary Computation (1), 99–127 (2019)13. Kerschke, P., Preuss, M., Hern´andez Castellanos, C., Sch¨utze, O., Sun, J.Q.,Grimme, C., Rudolph, G., Bischl, B., Trautmann, H.: Cell mapping techniquesfor exploratory landscape analysis. Advances in Intelligent Systems and Comput-ing , 115–131 (2014)14. Kerschke, P., Trautmann, H.: Comprehensive feature-based landscape analysis ofcontinuous and constrained optimization problems using the r-package ﬂacco. In:6 Q. Renau, J. Dreo, C. Doerr, B. DoerrApplications in Statistical Computing: From Music Data Analysis to IndustrialQuality Improvement, pp. 93–123. Springer (2019)15. Lacroix, B., McCall, J.A.W.: Limitations of benchmark sets and landscape fea-tures for algorithm selection and performance prediction. In: GECCO. pp. 261–262.ACM (2019)16. Lunacek, M., Whitley, D.: The dispersion metric and the CMA evolution strategy.In: GECCO. p. 477. ACM (2006)17. Mersmann, O., Bischl, B., Trautmann, H., Preuss, M., Weihs, C., Rudolph, G.:Exploratory Landscape Analysis. In: GECCO. pp. 829–836. ACM (2011)18. Morgan, R., Gallagher, M.: Sampling Techniques and Distance Metrics in High Di-mensional Continuous Landscape Analysis: Limitations and Improvements. IEEETransactions on Evolutionary Computation (3), 456–461 (Jun 2014)19. Mu˜noz, M.A., Smith-Miles, K.: Eﬀects of function translation and dimensionalityreduction on landscape analysis. In: IEEE CEC. pp. 1336–1342. IEEE (2015)20. Mu˜noz, M.A., Sun, Y., Kirley, M., Halgamuge, S.K.: Algorithm selection for black-box continuous optimization problems: A survey on methods and challenges. Inf.Sci. , 224–245 (2015)21. Mu˜noz, M.A., Villanova, L., Baatar, D., Smith-Miles, K.: Instance spaces for ma-chine learning classiﬁcation. Machine Learning (1), 109–147 (2018)22. Mu˜noz, M., Kirley, M., Halgamuge, S.: Exploratory Landscape Analysis of Contin-uous Space Optimization Problems Using Information Content. IEEE Transactionson Evolutionary Computation (1), 74–87 (Feb 2015)23. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machinelearning in Python. JMLR , 2825–2830 (2011)24. Pitra, Z., Repick´y, J., Holena, M.: Landscape analysis of Gaussian process sur-rogates for the covariance matrix adaptation evolution strategy. In: GECCO. pp.691–699 (2019)25. Renau, Q., Doerr, C., Dr´eo, J., Doerr, B.: Exploratory landscape analysis isstrongly sensitive to the sampling strategy. In: PPSN. LNCS, vol. 12270, pp. 139–153. Springer (2020)26. Renau, Q., Dreo, J., Doerr, C., Doerr, B.: Expressiveness and robustness of land-scape features. In: GECCO (Companion). pp. 2048–2051. ACM (2019)27. Renau, Q., Dreo, J., Doerr, C., Doerr, B.: Exploratory LandscapeAnalysis Feature Values for the 24 Noiseless BBOB Functions (2021).https://doi.org/10.5281/zenodo.444993428. Saini, B., L´opez-Ib´a˜nez, M., Miettinen, K.: Automatic surrogate modelling tech-nique selection based on features of optimization problems. In: GECCO (Compan-ion). pp. 1765–1772 (2019)29. Seo, D., Moon, B.R.: An information-theoretic analysis on the interactions ofvariables in combinatorial optimization problems. Evol. Comput. (2), 169–198(2007)30. Skvorc, U., Eftimov, T., Korosec, P.: Understanding the problem space in single-objective numerical optimization using exploratory landscape analysis. Appl. SoftComput. , 106138 (2020)31. Smith-Miles, K., Bowly, S.: Generating new test instances by evolving in instancespace. Computers & OR , 102–113 (2015)32. Sobol’, I.: On the distribution of points in a cube and the approximate evaluationof integrals. USSR Computational Mathematics and Mathematical Physics7