Towards Explainable Exploratory Landscape Analysis: Extreme Feature Selection for Classifying BBOB Functions
TTowards Explainable Exploratory LandscapeAnalysis: Extreme Feature Selection forClassifying BBOB Functions
Quentin Renau , , Johann Dreo , Carola Doerr , and Benjamin Doerr Thales Research & Technology, Palaiseau, France ´Ecole Polytechnique, Institut Polytechnique de Paris, CNRS, LIX, France Sorbonne Universit´e, CNRS, LIP6, Paris, France
Abstract.
Facilitated by the recent advances of Machine Learning (ML),the automated design of optimization heuristics is currently shaking upevolutionary computation (EC). Where the design of hand-picked guide-lines for choosing a most suitable heuristic has long dominated researchactivities in the field, automatically trained heuristics are now seen tooutperform human-derived choices even for well-researched optimizationtasks. ML-based EC is therefore not any more a futuristic vision, buthas become an integral part of our community.A key criticism that ML-based heuristics are often faced with is their po-tential lack of explainability, which may hinder future developments. Thisapplies in particular to supervised learning techniques which extrapolatealgorithms’ performance based on exploratory landscape analysis (ELA).In such applications, it is not uncommon to use dozens of problem fea-tures to build the models underlying the specific algorithm selection orconfiguration task. Our goal in this work is to analyze whether this manyfeatures are indeed needed. Using the classification of the BBOB testfunctions as testbed, we show that a surprisingly small number of fea-tures – often less than four – can suffice to achieve a 98% accuracy. Inter-estingly, the number of features required to meet this threshold is foundto decrease with the problem dimension. We show that the classificationaccuracy transfers to settings in which several instances are involved intraining and testing. In the leave-one-instance-out setting, however, clas-sification accuracy drops significantly, and the transformation-invarianceof the features becomes a decisive success factor.
Keywords:
Exploratory Landscape Analysis · Feature Selection · Black-Box Optimization.
Evolutionary algorithms and other iterative optimization heuristics (IOHs) areclassically introduced as frameworks within which a user can gather some mod-ules to instantiate an algorithm. For instance, the design of an evolutionaryalgorithm requires to choose the population size, the variation and selection op-erators in use, the encoding structure, fitness function penalization weights, etc. a r X i v : . [ c s . N E ] F e b Q. Renau, J. Dreo, C. Doerr, B. Doerr
This highly flexible design of IOHs allows for efficient abstractions but comes atthe burden of having to solve an additional (meta-)optimization problem.
Auto-mated design of heuristics aims at solving this problem by providing data-drivenrecommendations which IOH shall be employed for a given optimization problemand how it shall be configured. Automated IOH design has proven its promisein numerous applications, see [7,9,20,3,12] for examples and further references.A common critique of machine-trained automated algorithm design is itspotential lack of explainability. That is, the general fear is that by relying onautomated design approaches, we may be loosing intuition for why certain rec-ommendation are made – a key driver for the development of new optimizationapproaches. This fear is not without any reason: the vast majority of automatedalgorithm design studies fall short in this explainability aspect.
Our Contribution.
Our work aims at providing paths to narrowing thisimportant gap, by studying which information the trained models actually needto achieve convincing performance. As testbed we chose the automated classifi-cation of optimization problems through exploratory landscape analysis (ELA).We show that very small feature sets can suffice to reliably discriminate betweenvarious optimization problems and that these sets are robust with respect to theclassifiers and function instances.Apart from the explainability aspect, our findings have important conse-quences also for the efficiency of automated algorithm design: smaller featuresets are faster to compute and they can drastically reduce the time spent in thetraining phase. Another advantage of feature selection is that the classificationor regression accuracy can increase . Background and Motivation.
ELA was introduced in [17] with the objec-tive to gain insights about the properties of an unknown optimization problem.Instead of relying on expert knowledge, the keystone of ELA are computer-generated features that are based on sampling the decision space. With the pur-pose of enhancing the effectiveness of this approach, several additional featureshave been introduced since. A good selection of these features are automaticallycomputed by the R package flacco [14], see Sec. 2 for more details.We chose classification as task, because it offers a very clean setting in whichthe results are easily interpretable. Classification has a straightforward perfor-mance measure, the classification accuracy , i.e., the fraction of items that areclassified correctly. Additionally, the classification accuracy is a good way of esti-mating the expressiveness of ELA feature sets, i.e., their ability to discriminatebetween different problems [26]. A proper classification furthermore plays animportant role also in many other ML tasks, including the selection and config-uration of algorithms, so that a good classification accuracy can be expected toprovide good results also for these tasks.
Related Work.
Given the mentioned speed-up and the better performancethat one can expect from smaller feature sets, feature selection is not new, butrather standard in automated algorithm design. However, most related worksstill use a relatively large number of features, hindering explainability of the owards Explainable ELA: Feature Selection for BBOB Functions 3 trained models. Among the ELA-based applications in EC, the following oneshave used the smallest feature portfolios.Mu˜noz and Smith-Miles [19] compute the co-linearity between landscape fea-tures with the idea that if two features are strongly co-linear, they carry the sametype of information about the landscape. Applying this procedure, nine featureswere kept for further analysis: the adjusted coefficient of determination of alinear regression model including interactions [17], the adjusted coefficient of de-termination of a quadratic regression model [17], the ratio between the minimumand maximum absolute values of the quadratic term coefficients in the quadraticmodel, the significance of D -th and first order [29], the skewness, kurtosis andentropy of the fitness function distribution [17], and the maximum informationcontent [22].Another method to perform feature selection is the use of search algorithms.In their work, Kerschke and Trautmann [12] compare four different algorithms,a greedy forward-backward selection, a greedy backward-forward selection, a(10 + 5)-GA and a (10 + 50)-GA. The smallest feature sets considered in theiralgorithm selection setting have a size of eight features: three features from the y -distribution feature set [17] (skewness, kurtosis, and number of peaks), onelevel set feature [17] (the ratio of mean misclassification errors when using alinear (LDA) and mixed discriminant analysis (MDA)), two information contentfeatures [22] (the maximum information content and the settling sensitivity),one cell mapping feature [13] (the standard deviation of the distances betweeneach cell’s center and worst observation), and one of the basic features (the bestfitness value within the sample). This result is still considerably larger than thesets we will identify as promising in our work.Saini et al. [28] and Lacroix and McCall [15] also use reduced feature sets,but do not expand on how these have been derived. Availability of Our Data.
All our project data is available at [27].
Our primary objective is to analyze the number of features that are neededto correctly classify the 24 BBOB functions from the COCO benchmark envi-ronment and their robustness across several dimensions and sample sizes. Wedescribe in this section the benchmark set, the experimental procedure, and theclassification scheme.
The 24 BBOB Benchmark Problems.
A standard benchmark en-vironment for numerical black-box optimization is the COCO ( CO mparing C ontinuous O ptimizers) platform [6]. From this environment, we consider theBBOB suite, a set of 24 noiseless problems. For each BBOB problem, several in-stances are available, which are obtained from a “base” function via translation,rotation and/or scaling transformations [6]. Each problem instances is a real-valued function f : [ − , d → R . Problems scale for arbitrary dimensions d . Inour experiments, we consider six different dimensions, d ∈ { , , , , , } , Q. Renau, J. Dreo, C. Doerr, B. Doerr and we focus on the first five instances of each problem (first instance in Sec. 3.In abuse of notation, we shall often identify the functions by their ID 1 , . . . ,
Computation of Feature Values via flacco.
For the feature valueapproximation, we sample for each of the 24 functions f a number n of points x (1) , . . . , x ( n ) ∈ [ − , d , and we evaluate their function values f ( x (1) ) , . . . , f ( x ( n ) ). The set of pairs { ( x ( i ) , f ( x ( i ) )) | i = 1 , ..., n } is then fedto the flacco package [14], which returns a vector of features. The flacco pack-age covers a total number of 343 features [9], which are grouped into 17 fea-ture sets. However, some of these features are often omitted in practice becausethey require adaptive sampling [2,12,18,24], while other features have previouslybeen dismissed as non-informative for the BBOB functions [13,26]. After remov-ing these sets from our test bed, we are left with six feature sets: dispersion (disp [16]), information content (ic [22]), nearest better clustering (nbc [10]), meta model (ela meta [17]), y -distribution (ela distr [17]), and principal compo-nent analysis (pca [14]). But even if this selection reduces the number of featuresto 46, a full enumeration of all subsets for all sizes c ≤
46 would still be com-putationally infeasible (since we need to train and test a classification model foreach such set). We therefore need to reduce the set of eligible features further. Tothis end, we build on the work presented in [26], in which we studied the expres-siveness of these 46 features. Based on this work we select four features. We addto this selection another six features, one per each of the feature set mentionedabove (to ensure a broad diversity of features) and again giving preference to themost expressive ones and to features invariant to BBOB transformations [30].This leaves us with the following ten features. We indicate in this list by (cid:88) and -whether or not a feature is considered invariant under transformation accordingto [30] (first entry) and according to our data (second entry), respectively. Notehere that the setting used in [30] is slightly different from the instances usedin BBOB, mostly due to different ways to handle boundary constraints. Theassessment can therefore differ.1. disp.ratio mean 02 [ (cid:88) , (cid:88) ] ( disp ) computes the ratio of the pairwise dis-tances of the points having the best 2% fitness values with the pairwisedistances of all points in the design.2. ela distr.skewness [ (cid:88) , (cid:88) ] ( skew ) computes the skewness coefficient of thedistribution of the fitness values. This coefficient is a measure of the asym-metry of a distribution around its mean.3. ela meta.lin simple.adj r2 [ (cid:88) , (cid:88) ] ( lr2 ), which computes the adjustedcorrelation coefficient R of a linear model fitted to the data.4. ela meta.lin simple.intercept [ (cid:88) ,-] ( int ), the intercept coefficient of thelinear model.5. ela meta.lin simple.coef.max [-,-] ( max ), the largest coefficient of thelinear model that is not the intercept coefficient.6. ela meta.quad simple.adj r2 [ (cid:88) , (cid:88) ] ( qr2 ), the adjusted correlation coef-ficient R of a quadratic model fitted to the data.7. ic.eps.ratio [-, (cid:88) ] ( ε ratio ), the half partial information sensitivity.8. ic.eps.s [-, (cid:88) ] ( ε s ), the settling sensitivity. owards Explainable ELA: Feature Selection for BBOB Functions 5 nbc.nb fitness.cor [ (cid:88) , (cid:88) ] ( nbc ), the correlation between the fitness valuesof the search points and their indegree in the nearest-better point graph.10. pca.expl var PC1.cov init [ (cid:88) , (cid:88) ] ( pca ), which measures the importanceof the first principal component of a Principal Component Analysis (PCA)over the sample points in the whole search space. Normalization of Feature Values.
The value of each feature is normalizedbetween 0 and 1 where 0 (resp. 1) correspond to the smallest (resp. largest) valueencountered in the approximated feature values. This normalization is performedindependently for each dimension, each sample size, and each classifier used inthis paper.
Sampling Strategy.
Based on an extension of the preliminary experi-ments reported in [25] we use a quasi-random distribution to sample the points x (1) , . . . , x ( n ) from which the feature values are computed. More precisely, weuse Sobol’ sequences [32], which we obtain from the Python package sobol seq (version 0.1.2), with randomly chosen initial seeds.We sample a total number of 100 independent Sobol’ designs, which leaves uswith 100 feature value vectors per each function. Fig. 1 provides an impressionof the distribution of these feature values. Plotted are here approximated valuesfor the lr2 feature. The comparison shows that the dispersion slightly decreaseswith the dimension, which is quite surprising in light of the lower density of thepoints in higher dimensions. We also see that the median values are not stableacross dimensions. Some functions (F5 of course, which is correctly identified asa linear function, but also F16, F19, and F20, for example) show a high con-centration of feature value approximations, whereas other functions show muchlarger dispersion within one dimension (e.g., F12, F15, F17, F18) or betweendifferent dimensions (F2, F11, F24). Sample Size.
To study the effect of the sample size on the number of featuresneeded to correctly classify the 24 BBOB functions, we conduct experiments forseven different values of n , namely n ∈ { d, d, d, d, d, d, d } .We note here that a linear scaling of the sample size is the by far most commonchoice, see, for example, [3,11,12]. Feature Selection.
We apply a wrapper method , i.e., we actually train aclassifier for every considered subset of features. For a given sample size anda given dimension, we train and test all (cid:0) c (cid:1) possible subsets of size c startingwith c = 1. If none of these size- c subsets achieves our target accuracy, wemove on to the size c + 1 subsets. As soon as a sufficiently qualified subset hasbeen identified, we continue to evaluate all size- c subsets, but stop the selectionprocess thereafter. This full enumeration of all possible feature combinations fora given size c allows us to investigate the robustness of the feature selection.Ideally, we would like to see that the feature sets achieving our 98% accuracythreshold (this will be introduced below) are stable across the different samplesizes. Robustness with respect to the dimension is much less of a concern to us,since the problem dimension is typically known and can be used for the choosingthe feature ensemble that shall be applied to characterize the problem. Q. Renau, J. Dreo, C. Doerr, B. Doerr A pp r o x i m a t e d f e a t u r e v a l u e s ela_meta.lin_simple.adj_r2 budget of 250*d Dimension
Fig. 1: Distribution of the feature values for the lr2 feature for different dimen-sions. Each feature value is computed from 250 × d samples and each boxplotrepresents results of 100 independent feature computations. Validation Procedure and Target Classification Accuracy.
In ourexperiments, we use 80 randomly chosen feature vectors (per function) to traina classification model, and we use the remaining 24 ×
20 = 480 feature vectorsfor testing. For each of these 480 test cases we store the true function ID (i.e.,the ID of the function that the feature value originates from) and we store theID of the function that the classifier matches the feature vector to. From thisdata we compute the overall classification accuracy .We repeat this procedure of splitting the set of all feature vectors into 80training and 20 test instances 20 times; i.e., we repeat 20 times a randomsub-sampling validation . We require that the overall classification accuracyfor each of the 20 validations is at least 98% . That is, a feature set is eligible if,in each of the 20 random sub-sampling validation runs, it misclassifies at most 10out of the 480 tested feature vectors. Feature combinations achieving a smallerclassification accuracy in one of the validation runs are immediately discarded.
Classification Model.
In the main part of this work, we use a MajorityJudgment classifier [1]. A cross-validation with decision trees and KNN classifierswill be presented in Sec. 4.The Majority Judgment classifier works as follows. Let Φ = { ϕ , . . . , ϕ k } be the set of features for which we want to know whether it achieves our 98%target precision requirement. We consider one of the independent subsampling owards Explainable ELA: Feature Selection for BBOB Functions 7 Function ID (index j )1 2 . . . Feature ϕ ϕ ϕ Median distance D j Table 1: Example for the Majority Judgment classification scheme with threefeatures. The values in the table are the distances of the measured feature value ζ i to the median feature values M ( i, j ) of the training set. The median values arereported in the last line. The ID of the function minimizing this median distance D j is the output of the Majority Judgment classifier.validation runs. That is, for each function we randomly select 80 out of the 100feature vectors. Denote by ϕ i,j,r the r -th estimated value for feature ϕ i for the j -th BBOB function, the set { ( ϕ i,j,r , j ) | i = 1 , . . . , k, j = 1 , . . . , , r = 1 , . . . , } describes the full set of training data. From this data we compute for each ofthe 24 functions j = 1 , . . . ,
24 and for each feature ϕ i ∈ Φ the median value M ( i, j ) := M ( { ϕ i,j,r | r = 1 , . . . , } ) . This gives us a set of 24 k values M ( i, j ) and concludes the training step .In the testing step we apply an approval voting mechanism [4] to each ofthe 480 test instances. Approval voting mechanisms are single-winner systemswhere the winner is the most-approved candidate among the voters. From thisclass of approval voting mechanism we choose Majority Judgment [1] —a votingtechniques which ensures that the winner between three or more candidates hasreceived an absolute majority of the scores given by the voters.To apply Majority Judgment to our classification task, we do the following.We recall that the task of the classifier is to output, for a given feature vector ζ = ( ζ i ) ki =1 , the ID of the function that it believes this feature vector to belong to.To this end, it first computes for each of the k features i and for all 24 functions j the absolute distances d i,j := | ζ i − M ( i, j ) | . Tab. 1 presents an example forwhat the distances may look like. We then compute for each function the medianof these distances, by setting D j ( ζ ) := M ( { d i,j | i = 1 , . . . , k } ) . The cells withthese median values are highlighted with a blue background in Tab. 1, and thevalues D j ( ζ ) are reported in the last line. The classifier outputs as predictedfunction ID the value j for which the distance D j ( ζ ) is minimized. This cell ishighlighted in yellow background color. Computation Time.
To give an impression of the computational resourcesrequired for our experiments, we report that the computation of the 100 5-dimensional feature vectors requires around 6 CPU hours, whereas the computa-tion of the 25-dimensional feature vectors takes about 1221 CPU hours. Trainingand testing the classifier takes between 1 second and 3 hours, depending on thesetting. In total, we have invested around 432 CPU days for computing the datapresented in this work.
Q. Renau, J. Dreo, C. Doerr, B. Doerr
Sample Sizedimension
30d 50d 100d 250d 650d 800d 1000d5 - - - 4 4 - 210 - - - 4 1 2 115 - - 6 4 2 2 220 - - 6 2 1 1 225 1 1 1 1 1 1 130 - 6 2 1 1 2 2
Table 2: Feature combination size achieving 98% classification accuracy in all 20runs.
The portfolios of features for which we obtained the desired 98% classificationaccuracy for each of the 20 random sub-sampling validation runs are presentedin Tab. 3. For convenience, their sizes are summarized in Tab. 2.Our first, and most important, finding is that we can actually classify theBBOB functions with very few features. However, we also see that the existenceof such portfolios requires a sufficient sample size. For d ∈ { , , , } , none ofthe 2 possible portfolios based on size-30 d and size-50 d feature approximationscould achieve the 98% accuracy threshold.We also see that, as expected, the size of the minimal portfolio achieving thetarget precision decreases with increasing sampling size. A few exceptions to thisrule exist: – No combination in d = 5 with n = 800 samples achieved the target precision. – In d = 10 we see that a single feature, the intercept feature int , sufficesto classify with 98% accuracy when the sampling size is 650 d and 1000 d .For 800 d , however, this feature does not achieve the threshold. A detailedanalysis of the classification accuracy achieved with this feature will be givenin Fig. 2. – In d = 15, the ε ratio information content feature classifies properly when thesample size equals n = 800 d , but for n = 1000 d , one additional feature isneeded to pass the 98% accuracy threshold. – In d = 20 a single feature suffices for n = 650 d and n = 800 d , but for n = 1 , d an additional feature is needed to achieve the target accuracy.Overall, we see that for ten settings a single feature suffices for proper classifi-cation. An additional seven cases can be solved by a combination of two features.It seems counter-intuitive that in almost all cases the size of the smallest admissi-ble portfolio decreases with increasing dimension. However, as already discussedin the context of Fig. 1, the dispersion of some feature values decreases withincreasing dimension – an effect that is interesting in its own right. Withoutgoing into much detail here, we note that this effect is further intensified whenusing a properly scaled sampling size that maintains the same sampling densityacross dimensions. owards Explainable ELA: Feature Selection for BBOB Functions 9 Feature d n int lr2 qr2 max ε s ε ratio disp skew pca nbc d d d d X X X X d X X X X d d X X
10 30 d d d d XO XO XO O X d X d X X d X
15 30 d d d X X X X X X d X X X X d X H O XH O d X X d XO X O
20 30 d d d X X X X X X d X X d X d X d X XO O
25 30 d X d X d X d X d X O d X d X M
30 30 d d X X X X X X d X XO O d X d X d O X XO M d O X H XOHV M V Table 3: Feature combinations achieving the 98% classification accuracy thresh-old in all 20 runs. Features with the same symbol (X,O,H,V) belong to the samecombination. Results are grouped by dimension d and by the sample size n usedto approximate the feature values. Blank rows are for ( d , n ) settings for whichall 2 feature sets failed. M = missing data (due to coronavirus measures inFrance, we have lost access to cluster and data.)
30d 50d 100d 250d 650d 800d 1000dNumber of search points8486889092949698100 A cc u r a c y Dimension51015202530
Fig. 2: Distributions of intercept feature accuracy by dimension and sample size
Robustness of the feature combinations with respect to dimensionand sample size.
Looking at the robustness of the selected combinations overthe dimensions and the sample sizes, we observe the following.One feature, the intercept feature int , is involved in 15 out of the 28 ( d, n )pairs for which a successful feature portfolio could be found. This feature, in con-trast, is rarely present in other combinations of size | c | >
1. To shed more lighton its expressive power, we present in Fig. 2 the distributions of the classifica-tion accuracy for the various ( d, n ) combinations. Aggregated over all dimensionsand all sample sizes, the median accuracy of the int feature is 96%. Even if thefeature does not always reach our threshold of 98%, it is worth noting that itsperformances is almost always above 90%. Therefore, this feature is very expres-sive, and this across all tested dimension and sample sizes. Another interestingobservation from Fig. 2 is that the classification accuracy is not monotonic inthe dimension. In all but one case ( n = 30 d ), the d = 15 results are worse thanthose for the other dimensions. As already seen in Tab. 3, for n = 250 × d wealways have very good classification accuracy.The most frequent feature is ε ratio , which is present in almost all combina-tions of size | c | ≥
2. We count 21 successful combinations of size | c | ≥ ε ratio appears in 20 of these combinations regardless of the dimension and thesample size. In total, it appears in successful portfolios for 17 out of the 28 ( d, n )combinations for which a successful subset had been found. The ε ratio feature isvery useful for our classification task.The skewness feature skew , in contrast, does not appear in any of the port-folios of the smallest size. Classification Accuracy When Using All flacco Features.
We com-pare the results presented above with the classification accuracy achievedby the Majority Judgment voting scheme using the whole set of 46 fea- owards Explainable ELA: Feature Selection for BBOB Functions 11 tures described in Sec. 2. We perform the same sub-sampling validation asabove. Interestingly, none of tests performed on the pairs ( d , n ) with n ∈{ d, d, d, d, d, d } and d ∈ { , , , , , } met our re-quired target precision of 98% for each of the 20 runs. We can thus conclude that,in addition to the gain in explainability, the selection of features for supervised-ELA approaches provide better performances, and – as we shall discuss below –also come at a much smaller computational cost. Having identified feature portfolios that reliably classify the BBOB functionswith at least 98% accuracy when using Majority Judgment (MJ), we now inves-tigate how robust this accuracy is with respect to the choice of the classifier. Tothis end, we apply the same classification routine as above, but now using de-cision trees (DT) and K Nearest Neighbors (KNN) as for classification. We useoff-the-shelf implementations from the scikit learn
Python package [23, we useversion 0.21.3]. Our goal being in investigating robustness, we do not performany hyper-parameter tuning for these two classifiers. For the KNN classifier weuse K = 5. For all classifications with a reduced portfolio of features, if multiplecombinations are available, only the one marked with X in Tab. 3 will be used.Both KNN and decision trees perform as well as our classifier when trainedand tested with the small portfolios from Tab. 3, i.e., they both reach at least98% classification accuracy in every run except for the decision trees trainedwith only one feature, for which the accuracy drops to around 62% in everyrun. Fig. 3 summarizes the classification accuracy of the three classifiers for thecase that features are based on n = 250 d samples, for the portfolios describedin Tab. 3. Performance is indeed very robust with respect to the classificationmechanism. Running Time.
While training and testing were made in around 4 secondsfor the DT and for the MJ voting scheme, the KNN classifier needed around 12seconds to complete the 20 sub-sampling validation runs.
Gain over Full Feature Set.
We now study how much we gain in termsof computation time when we compute, train, and test the three classifiers (MJ,DT, and KNN) on the selected feature sets only.To quantify this gain, we train all three classifiers with the full set of 46features mentioned in Sec. 2. We first observe that the decision tree classifierhas the best performances among the three classifiers in terms of accuracy. Itachieves at least 99% classification accuracy. For KNN, in contrast, performancesdrops below our 98% threshold precision on several runs, resulting in a medianclassification accuracy (over all tests) of around 97%. The results for KNN align,as already briefly touched upon in Sec. 3, with those obtained using MJ, wherenone of the tests produced 20 runs in which the threshold was reached.In terms of computation time, we observe significant differences between thesmall feature portfolios and the full flacco set. As already commented in Sec. 2,the computation of the feature values can be very time-consuming. Reducing the A cc u r a c y ClassifierMJDTKNN
X X
Fig. 3: Classification accuracy for the feature portfolios from Tab. 3 for budget250 d . Results are sorted by dimension and classifier and are for 20 random sub-sampling validation runs. Training and testing is done on the first instance ofeach function only. The X corresponds to settings that did not achieve the 98%threshold.number of features therefore reduces the running time of the feature extraction.However, the savings are even bigger when comparing the cost of training (andtesting) the classifiers. For decision trees, the execution of the whole classifica-tion pipeline takes 3000 times longer than with the small portfolios – around3 CPU hours instead of a few seconds. For KNN, the total cost is comparable,also around 3 CPU hours for training and testing the classifiers for the 20 sub-sampling validation runs. For the MJ classifier, the overall running time is onlyaround 35 CPU minutes – which is still way above the time needed for the smallportfolios.Thus, overall, the reduced portfolios resulted not only in much faster com-putation times, but achieved also better classification accuracy. The discussion above focused on classifying the first instance of the BBOB func-tions, and we now investigate how robust the selection is with respect to differentinstances of the same problems. Concretely, we investigate classification accuracywhen performing the same random sub-sampling validation routine as above tothe set of features computed for the first five instances of the BBOB functions.In this experiment, we keep 80% of feature values for each instance for trainingthe classifier, and we test on the remaining ones. In a second step we then testtransferability, by performing a leave-one-instance-out (LOIO) cross-validation.
In this setting, the classifiers are trained on four instances of each function andtested on the remaining one. We use the portfolios marked by an X in Tab. 3, andcompare to classification accuracy when using all ten features. In the following, owards Explainable ELA: Feature Selection for BBOB Functions 13 A cc u r a c y ClassifierKNNDT
X X
Fig. 4: Classification accuracy of DT and KNN classifiers when applied to thefirst five instances of the 24 BBOB functions. Feature values are computed from250 d samples, for the portfolios marked by an X in Tab. 3. Cases with poorperformance are marked by a red X.MJ voting is excluded as, by design, it is not suited to work with multiple dis-tributions coming from different instances. Hence, only DT and KNN classifierswill be used in this section.Fig. 4 aggregates the results obtained for the first classification task, where wetake feature values from each or the first five instances. As in Fig. 3, DT performsbadly in d = 25 and d = 30, where classification is only based on the interceptfeature. For these cases, the median accuracy is 45% and 62%, respectively.Since the intercept feature is not invariant to fitness function transformations,the worsened performance is no surprise. In contrast, the median classificationaccuracy is above 98% for all portfolios with at least two features. We also notethat KNN in dimension d = 25 does not reach our 98% threshold, but stillachieves good performances with an average 97% accuracy.Fig. 5 presents the classification accuracy achieved by KNN and DT in theLOIO setting. Fig. 5a is for features lr2 , qr2 , ε ratio , and nbc computed from 650 d samples in d = 5 and the Fig. 5b is for the two features qr2 and ε ratio computedfrom 250 d samples in d = 20. For comparison, we also plot the classificationaccuracy achieved when using all ten features listed in Sec. 2. For most settings,the accuracy obtained with the set of ten features is better than that achieved forthe smaller portfolios. For the 650 d setting, this is the case for all instances. Forthe 250 d setting, DT performs better with the smaller portfolio when instance1 or instance 3 is left out. The performance loss when using the reduced featureset is particularly drastic for KNN when instance 1 is left out (both cases), wheninstance 2 is left out (650 d case), and when instance 4 is left out (250 d case).Interestingly, for DT in the 650 d setting, the largest performance losses occurwhen leaving instance 2 or 5 out. The average loss in classification accuracy is A cc u r a c y ClassifierKNN_10KNN_4DT_10DT_4 (a) 650 d samples, d = 5 A cc u r a c y ClassifierKNN_10KNN_2DT_10DT_2 (b) 250 d samples, d = 20 Fig. 5: Classification accuracy of KNN and DT in the leave-one-instance-outsetting. The subscripts 2 ,
4, and 10 refer to the size of the feature portfolio.5% and 4% for KNN in the 650 d and the 250 d case, respectively. For DT, theaverage loss in the 650 d case is 10% and the average gain in the 250 d case is 2%.We conclude that the feature selection is robust when studying differentinstances, except for those portfolios which consist only of a single feature. Forthe (arguably more interesting) LOIO setting, however, classification accuracydrops, but non-homogeneously for the different instances. We recommend usingthe larger feature portfolio in this case. Our ambition to build small feature sets is driven by the desire to obtain modelsthat are (at least to some degree) human-interpretable. While our study certainlyhas several limitations, as only one test bed is considered, it nevertheless showsthat the number of features needed to successfully classify the BBOB functionsis surprisingly low. Our main direction for future work is an application of thesmall feature sets to automated algorithm design tasks. [8] shows promisingperformance of the selected feature portfolio presented in Sec. 2 for automatedperformance regression and per-instance algorithm selection, results that we wishto detail further based on the results presented in Sec. 3. Our next importantgoal will then be to uncover how the performance of a given solver depends onthe selected features, by taking a closer look at the trained regression models.With small feature sets, there is reasonable hope that we can identify meaningfulcorrelations.We are targeting, in the mid-term perspective, classifiers and automatedalgorithm design techniques that work well on highly constrained problems andwhich can cope with discontinuities. Extending the results of this work to suchproblems forms another important next step.Other interesting directions for future work include the investigation of newfeatures recently proposed in the literature, (such as, for example, the SOO-based features [5]). We also plan on a closer inspection of the classification results owards Explainable ELA: Feature Selection for BBOB Functions 15 presented above, particularly with respect to the mis-classifications. Functionsthat are wrongly classified more often than others (a preliminary investigationshowed that these mis-classification rates depend on the dimension. In dimen-sions d = 10, for example, function 17 is confused with function 21 in 30% of thetests even when a sample size of n = 10 ,
000 is used.) Such data can be used, inparticular, for training set selection, but also for the generation of new probleminstances for which the algorithms show some behavior not observable on otherinstances of the same collection [31,21].
Acknowledgments.
We thank C´edric Buron, Claire Laudy, and Bruno Mar-con for providing the implementation of the Majority Judgment classifier.
References
1. Balinski, M., Laraki, R.: Judge: Don’t vote! Operations Research (3), 483–511(2014)2. Belkhir, N., Dr´eo, J., Sav´eant, P., Schoenauer, M.: Surrogate assisted feature com-putation for continuous problems. In: In: LION. pp. 17–31. Springer (2016)3. Belkhir, N., Dr´eo, J., Sav´eant, P., Schoenauer, M.: Per instance algorithm config-uration of CMA-ES with limited budget. In: GECCO. pp. 681–688. ACM (2017)4. Brams, S., Fishburn, P.: Approval voting, 2nd edition. Springer (2007)5. Derbel, B., Liefooghe, A., V´erel, S., Aguirre, H., Tanaka, K.: New features forcontinuous exploratory landscape analysis based on the SOO tree. In: FOGA. pp.72–86. ACM (2019)6. Hansen, N., Auger, A., Ros, R., Mersmann, O., Tuˇsar, T., Brockhoff, D.: COCO: aplatform for comparing continuous optimizers in a black-box setting. OptimizationMethods and Software pp. 1–31 (2020)7. Hutter, F., Kotthoff, L., Vanschoren, J. (eds.): Automated Machine Learning -Methods, Systems, Challenges. Springer (2019)8. Jankovic, A., Doerr, C.: Landscape-aware fixed-budget performance regression andalgorithm selection for modular CMA-ES variants. In: In: GECCO. pp. 841–849.ACM (2020)9. Kerschke, P., Hoos, H., Neumann, F., Trautmann, H.: Automated Algorithm Selec-tion: Survey and Perspectives. Evolutionary Computation (1), 3–45 (Mar 2019)10. Kerschke, P., Preuss, M., Wessing, S., Trautmann, H.: Detecting Funnel Structuresby Means of Exploratory Landscape Analysis. In: GECCO. pp. 265–272. ACM(2015)11. Kerschke, P., Preuss, M., Wessing, S., Trautmann, H.: Low-Budget ExploratoryLandscape Analysis on Multiple Peaks Models. In: GECCO. pp. 229–236. ACM(2016)12. Kerschke, P., Trautmann, H.: Automated algorithm selection on continuous black-box problems by combining exploratory landscape analysis and machine learning.Evolutionary Computation (1), 99–127 (2019)13. Kerschke, P., Preuss, M., Hern´andez Castellanos, C., Sch¨utze, O., Sun, J.Q.,Grimme, C., Rudolph, G., Bischl, B., Trautmann, H.: Cell mapping techniquesfor exploratory landscape analysis. Advances in Intelligent Systems and Comput-ing , 115–131 (2014)14. Kerschke, P., Trautmann, H.: Comprehensive feature-based landscape analysis ofcontinuous and constrained optimization problems using the r-package flacco. In:6 Q. Renau, J. Dreo, C. Doerr, B. DoerrApplications in Statistical Computing: From Music Data Analysis to IndustrialQuality Improvement, pp. 93–123. Springer (2019)15. Lacroix, B., McCall, J.A.W.: Limitations of benchmark sets and landscape fea-tures for algorithm selection and performance prediction. In: GECCO. pp. 261–262.ACM (2019)16. Lunacek, M., Whitley, D.: The dispersion metric and the CMA evolution strategy.In: GECCO. p. 477. ACM (2006)17. Mersmann, O., Bischl, B., Trautmann, H., Preuss, M., Weihs, C., Rudolph, G.:Exploratory Landscape Analysis. In: GECCO. pp. 829–836. ACM (2011)18. Morgan, R., Gallagher, M.: Sampling Techniques and Distance Metrics in High Di-mensional Continuous Landscape Analysis: Limitations and Improvements. IEEETransactions on Evolutionary Computation (3), 456–461 (Jun 2014)19. Mu˜noz, M.A., Smith-Miles, K.: Effects of function translation and dimensionalityreduction on landscape analysis. In: IEEE CEC. pp. 1336–1342. IEEE (2015)20. Mu˜noz, M.A., Sun, Y., Kirley, M., Halgamuge, S.K.: Algorithm selection for black-box continuous optimization problems: A survey on methods and challenges. Inf.Sci. , 224–245 (2015)21. Mu˜noz, M.A., Villanova, L., Baatar, D., Smith-Miles, K.: Instance spaces for ma-chine learning classification. Machine Learning (1), 109–147 (2018)22. Mu˜noz, M., Kirley, M., Halgamuge, S.: Exploratory Landscape Analysis of Contin-uous Space Optimization Problems Using Information Content. IEEE Transactionson Evolutionary Computation (1), 74–87 (Feb 2015)23. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machinelearning in Python. JMLR , 2825–2830 (2011)24. Pitra, Z., Repick´y, J., Holena, M.: Landscape analysis of Gaussian process sur-rogates for the covariance matrix adaptation evolution strategy. In: GECCO. pp.691–699 (2019)25. Renau, Q., Doerr, C., Dr´eo, J., Doerr, B.: Exploratory landscape analysis isstrongly sensitive to the sampling strategy. In: PPSN. LNCS, vol. 12270, pp. 139–153. Springer (2020)26. Renau, Q., Dreo, J., Doerr, C., Doerr, B.: Expressiveness and robustness of land-scape features. In: GECCO (Companion). pp. 2048–2051. ACM (2019)27. Renau, Q., Dreo, J., Doerr, C., Doerr, B.: Exploratory LandscapeAnalysis Feature Values for the 24 Noiseless BBOB Functions (2021).https://doi.org/10.5281/zenodo.444993428. Saini, B., L´opez-Ib´a˜nez, M., Miettinen, K.: Automatic surrogate modelling tech-nique selection based on features of optimization problems. In: GECCO (Compan-ion). pp. 1765–1772 (2019)29. Seo, D., Moon, B.R.: An information-theoretic analysis on the interactions ofvariables in combinatorial optimization problems. Evol. Comput. (2), 169–198(2007)30. Skvorc, U., Eftimov, T., Korosec, P.: Understanding the problem space in single-objective numerical optimization using exploratory landscape analysis. Appl. SoftComput. , 106138 (2020)31. Smith-Miles, K., Bowly, S.: Generating new test instances by evolving in instancespace. Computers & OR , 102–113 (2015)32. Sobol’, I.: On the distribution of points in a cube and the approximate evaluationof integrals. USSR Computational Mathematics and Mathematical Physics7