[PDF] Towards Feature-Based Performance Regression Using Trajectory Data

Abstract

Black-box optimization is a very active area of research, with many new algorithms being developed every year. This variety is needed, on the one hand, since different algorithms are most suitable for different types of optimization problems. But the variety also poses a meta-problem: which algorithm to choose for a given problem at hand? Past research has shown that per-instance algorithm selection based on exploratory landscape analysis (ELA) can be an efficient mean to tackle this meta-problem. Existing approaches, however, require the approximation of problem features based on a significant number of samples, which are typically selected through uniform sampling or Latin Hypercube Designs. The evaluation of these points is costly, and the benefit of an ELA-based algorithm selection over a default algorithm must therefore be significant in order to pay off. One could hope to by-pass the evaluations for the feature approximations by using the samples that a default algorithm would anyway perform, i.e., by using the points of the default algorithm's trajectory. We analyze in this paper how well such an approach can work. Concretely, we test how accurately trajectory-based ELA approaches can predict the final solution quality of the CMA-ES after a fixed budget of function evaluations. We observe that the loss of trajectory-based predictions can be surprisingly small compared to the classical global sampling approach, if the remaining budget for which solution quality shall be predicted is not too large. Feature selection, in contrast, did not show any advantage in our experiments and rather led to worsened prediction accuracy. The inclusion of state variables of CMA-ES only has a moderate effect on the prediction accuracy.

Full PDF

TTowards Feature-Based Performance RegressionUsing Trajectory Data

Anja Jankovic , Tome Eftimov , and Carola Doerr Sorbonne Universit´e, CNRS, LIP6, Paris, France Computer Systems Department, Joˇzef Stefan Institute, Ljubljana, Slovenia

Abstract.

Black-box optimization is a very active area of research, with many new algorithms beingdeveloped every year. This variety is needed, on the one hand, since diﬀerent algorithms are mostsuitable for diﬀerent types of optimization problems. But the variety also poses a meta-problem: whichalgorithm to choose for a given problem at hand? Past research has shown that per-instance algo-rithm selection based on exploratory landscape analysis (ELA) can be an eﬃcient mean to tackle thismeta-problem. Existing approaches, however, require the approximation of problem features based ona signiﬁcant number of samples, which are typically selected through uniform sampling or Latin Hy-percube Designs. The evaluation of these points is costly, and the beneﬁt of an ELA-based algorithmselection over a default algorithm must therefore be signiﬁcant in order to pay oﬀ. One could hope toby-pass the evaluations for the feature approximations by using the samples that a default algorithmwould anyway perform, i.e., by using the points of the default algorithm’s trajectory. We analyze inthis paper how well such an approach can work. Concretely, we test how accurately trajectory-basedELA approaches can predict the ﬁnal solution quality of the CMA-ES after a ﬁxed budget of functionevaluations. We observe that the loss of trajectory-based predictions can be surprisingly small comparedto the classical global sampling approach, if the remaining budget for which solution quality shall bepredicted is not too large. Feature selection, in contrast, did not show any advantage in our experimentsand rather led to worsened prediction accuracy. The inclusion of state variables of CMA-ES only has amoderate eﬀect on the prediction accuracy.

Keywords:

Exploratory Landscape Analysis · Automated Algorithm Selection · Black-Box Optimiza-tion · Performance Regression · Feature Selection

In many real-world optimization challenges, we encounter optimization problems which are too complex to beexplicitly modeled via mathematical functions, but which nonetheless need to be assessed and solved, moreoften than not requiring signiﬁcant computational resources to do so. Explicit problem modeling is also anissue when the relationship between decision variables and solution quality cannot be established other thanby simulations or experiments. A standard example for the latter is the design of (deep) neural networks.

Black-box optimization algorithms (BBOA) are algorithms designed to solve problems of the two types above.BBOA are usually iterative procedures, which actively steer the search by using information obtained fromprevious iterations, with the goal to eventually converge towards an estimated optimal solution. In eachgeneration, a number of solutions candidates are generated and undergo evaluation.Classically, BBOA were manually designed, based on users’ experience. A plethora of algorithmic compo-nents exist from which users can choose to build their own algorithms, and the number of these componentsis growing every year. Even though the basic underlying principles of these components can be consid-ered similar in nature, their performances on diﬀerent problem instances can greatly vary. An importantand challenging task is thus to select the most appropriate and eﬃcient algorithm when presented witha new, unknown problem instance. This research problem, formalized as the algorithm selection problem(ASP) [Ric76], is one of the core questions that evolutionary computation aims to answer. The algorithmselection problem is classically tackled by relying on expert knowledge of both the problem instance and thealgorithm’s strengths and weaknesses. In recent years, however, due to the signiﬁcant progress in the machine a r X i v : . [ c s . N E ] F e b earning (ML) ﬁeld, there has been a shift towards an automated selection [KKB + features . In the terminology used in evolutionary computation(EC), features are hence aimed at describing the ﬁtness landscape of a problem instance.Fitness landscape analysis has a long tradition in EC. For practical use in black-box optimization, how-ever, the ﬁtness landscape properties can only be described via an informed guessing strategy. Concretely, wecan only approximate the ﬁtness landscapes, through the samples that we have evaluated and to which a so-lution quality has been assigned. Research addressing eﬃcient ways to characterize problem instances via fea-ture approximations is subsumed under the umbrella term exploratory landscape analysis (ELA) [MBT + ﬂacco [KT19b]. In the second step the model for the classiﬁcation or regressiontask is built and an algorithm and/or its conﬁguration is suggested. In the third step, this algorithm is thenrun on the problem instance under consideration. Clearly, the eﬀort for steps 1 and 2 cannot be neglected, andcan have a decisive inﬂuence on the usefulness of a per-instance algorithm selection/conﬁguration approach,as its eﬀort needs to pay oﬀ compared to the performance of a default solver. Even when neglecting thecomputational overhead of this approach and focusing on function evaluations only as performance measure(as is commonly done in evolutionary computation [HAR + d [BDSS16] and 50 d samples [KPWT16], where d denotes thedimension of the problem. This is hence a considerable investment.Of course, one could use these samples to warm-start the optimization heuristics, e.g., by initiating themin good regions and/or by calibrating their search behavior based on the information obtained from thesamples used to compute the features.A charming, yet straightforward alternative would be to integrate the ﬁrst step of the ELA-based approachdescribed above into the optimization routine, by computing the features based on the search points thata default algorithm would anyway perform. That is, one would use the search trajectory of such a defaultalgorithm to predict and then to select and/or to conﬁgure a solver on the ﬂy , once or even several timesduring the optimization process.Similar to parameter control [KHE15,AM16,DD20], such a dynamic selection would not only allow to identify an eﬃcient algorithm for the given problem instance, but could also beneﬁt from tracking the bestchoice while the optimization process (and the best response to its needs) evolves. Such a dynamic algorithmselection can therefore be seen as an ELA-based variant of hyper-heuristics [BGH + Our Results.

With the long-term goal to obtain well-performing dynamic ELA-based algorithm selectionand conﬁguration techniques, we analyze in this work a ﬁrst, rather cautious task: ELA-based performanceprediction using the trajectory samples of the algorithm under investigation. More precisely, we consider theCovariance Matrix Adaptation Evolution Strategy (CMA-ES [HO01]), and we aim at predicting its solutionquality (measured as target precision, i.e., the diﬀerence to an optimal solution in quality space) after aﬁxed budget of function evaluations. Concretely, we use the ﬁrst 250 samples evaluated by the CMA-ES andwe aim at predicting its performance after additional 250 evaluations, doing so for 20 independent CMA-ES runs. The performance regression is done via a random forest model which takes as input the featurescomputed from the trajectory data and which outputs an estimate for the ﬁnal solution quality.We then take into account that problem characteristics cannot only be described via classic ELA fea-tures, but that internal states of the search heuristics can also be used to derive information about theproblem instance at hand. Such approaches have in the past been used, for example, for local surrogate-modelling [PRH19]. We analyze the accuracy gains when using the same state information as in [PRH19],that is, the values of the CMA-ES internal variables that mainly carry information about the current proba-bility distribution from which the CMA-ES samples candidates for the new generation. In our experiments,the advantage of using this state information over using ELA-features only, however, is only marginal. Con-cretely, the average diﬀerence between true and predicted solution quality decreases from 14.4 to 12.1 whenadding the state variables as features (where the average error reported here is taken over all 24 benchmarkproblems from the BBOB suite of the COCO platform [HAR + Supervised ML for Performance Regression

The Experimental Setup.

When it comes to landscape-aware performance prediction, supervised machinelearning techniques such as regression and classiﬁcation have been studied in a variety of settings. Regressionmodels, unlike classiﬁcation ones, have an advantage of keeping track of the magnitude of diﬀerences betweenperformances of diﬀerent algorithms, as they measure concrete values for performances of all algorithms fromthe portfolio.Among diﬀerent supervised learning regressors in the literature, such as support vector machines, Gaus-sian processes or ridge regression to name a few, it has been empirically shown that random forests out-perform other models in terms of prediction accuracy [HXHL14]. A random forest is an ensemble-basedmeta-estimator that works by ﬁtting multiple decision trees on subsamples of the original data set, thenuses averaging as a way to control overﬁtting. In our experimental setup, we used an oﬀ-the-shelf randomforest regressor from the Python scikit-learn package [PVG + pycma package [HAB19], which uses a ﬁxed population size and no restarts during the optimization process.As our benchmark, we used the ﬁrst ﬁve instances of all 24 noiseless BBOB functions of the

COCO platform [HAR + d = 5 here.For our ﬁrst experiments, we perform 20 independent runs of the CMA-ES on these 120 problem in-stances, while keeping track of the search trajectories and the internal state variables of the algorithm itself.Throughout this work, we ﬁx the budget of 500 function evaluations, after which we stop the optimization andrecord the target precision of the best found solution within the budget. In order to predict those recordedtarget precisions after 500 function evaluations, we compute the trajectory-based landscape features usingthe ﬁrst 250 sampled points and their evaluations from the beginning of each trajectory, and couple themwith the values of the internal CMA-ES state variables extracted at the 250 th function evaluation.Figure 1 summarizes the target precision achieved by CMA-ES in each of the 20 runs. We see that theresults are more or less homogeneous across diﬀerent runs and across diﬀerent instances of the same problem.However, we also observe several outliers, e.g., for functions 7 (outlier for all instances), function 10 (instance4), function 12 (instance 1). It is important to keep in mind that the randomness of these performances areentirely caused by the randomness of the algorithm itself – the problem instance does not change betweendiﬀerent runs.For landscape feature computation, we use the R package ﬂacco [KT19b]. Following suggestions madein [KT19a,BDSS17] we restrict ourselves to those feature sets that do not require additional function eval-uations for computing the features. Namely, in this work we use 2 original ELA feature sets ( y-Distribution and Meta-Model ), as well as

Dispersion , Nearest-Better Clustering and

Information Content feature sets.This gives us a total of 38 landscape features per problem instance. In addition, we follow up on an ideapreviously used in [PRH19] and consider a set of internal CMA-ES state variables as features: – Step-size: its value indicates how large is the approximated region from which the CMA-ES samples newcandidate solutions. – Mahalanobis mean distance: represents the measure of suitability of the current sampled population formodel training from the point of view of the current state of the CMA-ES algorithm. – C evolution path length: indicates the similarity of landscapes among previous generations. – σ evolution path ratio: provides information about the changes in the probability distribution used tosample new candidates. 4 BBOB Functions factor(Run)

Fig. 1.

Target precision achieved by the CMA-ES with a budget of 500 function evaluations, for each of the ﬁrst ﬁveinstances of all 24 BBOB functions. Diﬀerently colored and shaped points represent 20 independent CMA-ES runs. – CMA similarity likelihood: it is a log-likelihood of the set of candidate solutions with respect to theCMA-ES distribution and may also represent a measure of the set suitability for training.As suggested in [JD20], and using the elements described above, we establish two separate regressionapproaches. One model is trained to predict the actual, true value of the target precision data (we referto it as the unscaled model in the remainder of the paper), while the other predicts the logarithm of thetarget precision data (the log-model ). It is important to note that the target precision measure intuitivelycarries the information about the order of magnitude of the actual distance to the optimum, i.e., the distancelevel to the optimum, which is eﬀectively computed as the log-target precision. For instance, if an algorithmreaches a target precision of 10 − for one problem instance and 10 − for another, it means that the algorithmfound a solution which is 4 distance levels closer to the optimum in the latter scenario. Moreover, to reducevariability, we estimate both models’ prediction accuracy through performing a 5-fold leave-one-instance-out cross-validation, making sure to train on 4 out of 5 instances per BBOB function, test on the remaininginstance and combine the results over the rounds. Results.

Adopting our two regression models, we trained them separately in the following three scenarios:using as predictor variables the landscape features only, using the internal CMA-ES state variables only, andusing the combination of the two. We trained the random forests 3 independent times and took a median ofthe 3 runs to ensure the robustness of the results.Figure 2 highlights the absolute prediction errors per BBOB function using two regression models, theunscaled and the log- one, when trained with 3 diﬀerent feature sets: using only the trajectory landscape data,only the CMA-ES state variable data, and the combination of the two. For the majority of the functions,using the combination of the trajectory data and the state variable data seems to help in improving theperformance prediction accuracy, compared to the scenarios which use only one of those two feature sets.We also conﬁrm that the log-model is indeed better at predicted ﬁne-grained target precision (e.g., inthe case of F1 (sphere function) or F6 (linear slope function), we know that those functions do not requiremany function evaluations to converge to the global optimum, and their recorded target precision valuesare already quite small as they are very near the optimal solution). On the other hand, the unscaled modelperforms better where the target precision values are higher (e.g., for the functions such as F3, F15 (two5

ID SV ELA ELA+SV SV ELA ELA+SV1 Unscaled model Log model

Fig. 2.

Absolute prediction errors for both regression models aggregated per BBOB function in 3 diﬀerent scenariosdepending on the feature set used. The SV column stands for the CMA-ES state variables, the ELA for the landscapefeatures, and the third one is the combination of both. versions of Rastrigin function), and also F24 (Lunacek bi-Rastigin), which are all highly multimodal, thenumber of function evaluations in our budget was not nearly enough to allow for ﬁnding a true optimum).We also notice that using only the state variables for the unscaled model does not suﬃce for an accurateprediction in the most cases. The reverse situation is nevertheless also possible: we see that for F12, usingonly the state variables yields the best accuracy in the unscaled model. Furthermore, there are also exceptionswhere using only the landscape data results in a higher accuracy than using the combined features (e.g., F11for both models, F5 for the unscaled model).

We then proceeded to compare the diﬀerences in the prediction accuracy from the sets described in theSection 2 with the prediction accuracy using the global feature data, both alone and combined with thesame CMA-ES state variable data as above. To be able to perform a fair comparison, for the trajectorydata we selected from the 20 executed CMA-ES runs those runs with the median target precision value perproblem instance and their corresponding features and re-trained the unscaled and the log-model. Globalfeatures-wise, both models were also trained using features computed from 2000 and 250 globally uniformlysampled points (the median value of 50 independent feature computations) for each function and instance.Figure 3 shows the absolute errors in prediction when the trajectory-based approach is compared with theresults using the global features. The highest accuracy is reported in cases when only the global landscapefeatures were used, across almost all problems, with 2000-sample features yielding the best results. Here, wedo not observe a huge improvement when combining the global landscape features with the state variabledata. It seems that the number of samples used to compute the features can be crucial in reducing the6

ID SV ELA ELA +SV GLOB2k GLOB2k +SV GLOB250 GLOB250 +SV SV ELA ELA +SV GLOB2k GLOB2k +SV GLOB250 GLOB250 +SV1 Unscaled model Log model

Fig. 3.

Absolute prediction errors for both regression models for the median trajectory-based prediction (the ﬁrst 3columns of each block) and the median global feature prediction (the middle two columns of each block represent theerrors when using the 2000-sample features, and the last two columns correspond to using the 250-sample features). errors in prediction, as global sampling could be linked to a potential higher discriminative power of featuresthus computed. Again, for certain functions such as F2 and F10 (both of which are diﬀerent variants of theellipsoidal function), we observe an overall low accuracy.

Feature Selection.

To provide a sensitivity analysis based on the features used for the performance regres-sion, we performed feature selection in the scenario of transfer learning, i.e., between diﬀerent supervisedtasks, where the features selected for the problem classiﬁcation task have been evaluated on the performanceregression task.To do this, we have explored four state-of-the-art feature selection techniques:

Boruta [KJR10] is afeature selection and ranking algorithm based on random forests algorithm, which only selects features thatare statistically signiﬁcant.

Recursive feature elimination (rfe) [GFBG06] learns a model assessing diﬀerentsets of features by recursively eliminating features per loop until a good model is learnt. It requires anML algorithm for evaluation, and here we use a random forest.

Stepwise forward and backward selection(swfb) [DK92] tries to ﬁt the best regression model by iteratively selecting and removing features. In ourexperiments, we used it in both directions simultaneously.

Correlation analysis with diﬀerent threshold values(cor) [BCHC09] is based on the correlation analysis done only using the features (i.e., excluding the target).The result is a feature set where highly correlated features are omitted. In our case, we tested three diﬀerentcorrelation thresholds: 0.50, 0.75, and 0.90. Note that while the ﬁrst three feature selection methods requirea supervised ML task, the last one is completely unsupervised and does not depend on the target.7 oruta swfb rfe cor0.5 cor0.75 cor0.9

Table 1.

Number of ELA and state variable features for each selected feature portfolio. Details are available inTable 3.

SV ELA ELA GLOB2k GLOB2k GLOB250 GLOB250 boruta cor cor cor rfe swfb+SV +SV +SV 0.5 0.75 0.9Best threshold τ FID min tp max tp 1.336 3.99 4.742 14.497 9.46 0.694 2.605 3.63 1.813 4.901 1.717 7.388 20

Overall RMSE, combined 15.05 10.41 10.25 7.74 7.92 9.48 10.05 10.43 13.66 11.67 11.86 9.77 15.73

Overall RMSE, unscaled 15.08 11.18 10.88 9.21 9.30 9.58 10.19 11.16 14.11 11.80 12.00 10.93 17.03Overall RMSE, log 15.63 13.05 13.21 11.46 11.88 12.05 12.87 13.14 14.65 13.89 14.29 12.61 15.73

Table 2.

RMSE values of the combined selector in three scenarios: when the prediction is based on the searchtrajectory landscape features and state variables (ﬁrst 3 columns), on global features (next 4 columns), and ﬁnallyon selected feature portfolios (last 6 columns).

Our experimental design has been done using stratiﬁed 5-fold cross-validation. For a fair feature selection,we used the aforementioned methods on each training fold separately, then selected the intersection of thefeatures returned by each training fold in the end. These features are further evaluated in the performanceregression task.Table 1 summarizes how many features were selected per portfolio, from the whole set of 38 ELA landscapefeatures and 5 CMA-ES state variable features.

Combined Selector Model and Sensitivity Analysis.

As common in ML, we measure the regression accuracy in terms of

Root Mean Squared Error (RMSE).Table 2 summarizes the RMSE values for the diﬀerent feature portfolios when using (1) the unscaled model,(2) the scaled model, and (3) a combination of unscaled and the log-model (see last three rows of Table 2).The threshold τ at which the predictive model changes is optimized for each feature portfolio individually, theobtained thresholds are summarized at the top of Table 2. That is, we select the prediction of the log-modelwhen the predicted precision (according to the log-model) is smaller than the threshold value τ , and we usethe prediction of the unscaled model otherwise. Note that the optimal threshold value τ varies signiﬁcantlybetween the diﬀerent feature portfolios.When comparing all the diﬀerent portfolios (initial trajectory-based, global and selected trajectory-basedones), the good performance of the global feature sets is not surprising. Diﬀerences from the initial trajectory-8 oruta cor0.5 cor0.75 cor0.9 SV ELA ELA +SV GLOB_2k GLOB2k +SV GLOB250 GLOB250 +SV rfe swfb boruta cor0.5 cor0.75 cor0.9 SV ELA ELA +SV GLOB_2k GLOB2k +SV GLOB250 GLOB250 +SV rfe swfb Fig. 4.

Absolute prediction errors of the combined models using portfolio-speciﬁc optimal thresholds τ . based predictions are marginal for sets such as boruta , cor0.75 and cor0.9 , whereas swbf and cor0.5 performconstantly worse than ELA+SV . Using the rfe set, on the other hand, led to better results than using theoriginal feature set. SV alone does not achieve good accuracy, but its contribution to ELA-only featureportfolio is around 3% at the best threshold for the combined model, which is τ = 4 . We analyzed in this paper the accuracy of predicting the CMA-ES solution quality after given budget basedon the features computed from the samples on the CMA-ES search trajectory using two complementaryregression models, the unscaled and the log-model. Adding information obtained from the CMA-ES internalstate variables does not improve the prediction accuracy drastically compared to the trajectory-based dataonly. Those results were then contrasted to the regression using the global features, where using the latterones, especially those computed using a higher number of samples, yielded a consistently better accuracy.Next, we tested whether we would achieve further gains in accuracy through feature selection. Althoughthe overall results are comparable to the ones from initial trajectory-based portfolios, several selected feature9ets resulted in worse accuracy than in the initial approach. We ultimately pointed out the advantages ofusing our combined selector model over relying separately on predictions of the standalone unscaled orlog-model across all diﬀerent feature portfolios in all 3 scenarios.In terms of future work, we plan on continuing this research by considering the following questions andtasks:(0) Performance prediction of other solvers : How accurately can we use trajectory-based features of onealgorithm to predict the performance of another algorithm? In this work, we have only tried to predictperformance for the same algorithm from whose trajectory the feature values have been computed. A nextstep would be to test if models for conﬁguring the same algorithm can be trained. When this is successful,transfer learning from one algorithm to another one can be considered.(1) How can we more eﬃciently capture the temporal component , i.e., the information which sample wasevaluated when during the search? Using such longitudinal data, both in terms of extracted feature valuesand in terms of state variable evolution could possibly be done using recurrent neural networks [ZFW + Combining global and trajectory-based sampling:

In our work, we only considered the case inwhich either global sampling or trajectory-based sampling is used. The accuracy of the models based onglobal sampling was better than that of the trajectory-based features. Even if we keep in mind that this com-parison was unfair in that we provided the global feature values “for free”, the results nevertheless suggestthat a combination of global and trajectory-based feature computations could be worthwhile to investigate.How we can optimally balance the budget between global sampling, trajectory-based sampling, and remain-ing optimization budget is a challenging question in this context.(3) Warm-starting the CMA-ES such that it starts the optimization process with the covariance matrixand other parameters that are extrapolated from the (uniformly or otherwise) distributed global samplesmight signiﬁcantly improve the overall accuracy, as the CMA-ES will have a better overview of the wholeproblem instance ”from the get-go”. A similar approach has been suggested in [MRT15] when switching froma Bayesian optimization algorithm to CMA-ES.(4)

Feature selection and ranking:

Instead of using transfer learning for feature selection between twodiﬀerent supervised ML tasks, feature selection within the same supervised task has not been consideredin this paper. We also plan on making better use of variable importance estimations provided by featureranking algorithms such as those based on ensemble of predictive clustering trees [PKD20] and those basedon ReliefF and RReliefF [RˇSK03].(5)

Feature design:

The work [DLV +

19] suggests several algorithm-speciﬁc features for the SOO tree algo-rithm [Mun11]. Such speciﬁc features can much more explicitly capture the characteristics of the algorithm-problem instance interaction. It could be worthwhile to study whether, possibly in addition to the longitudinaldata mentioned in (1), such speciﬁc features can be identiﬁed for other common solvers, such as the CMA-ES.(6)

Feature portfolio:

We note that our work above is based on the features available in the ﬂacco pack-age [KT19b]. Since the design of ﬂacco , however, several new feature sets have been suggested. Anotherstraightforward way to extend our analyses would be in the inclusion of these feature sets, with the hope toimprove the overall regression accuracy. In this respect, we ﬁnd in particular the Search Trajectory Networkssuggested in [OMB20] worth investigating.(7)

Representation learning of landscapes:

The feature data will be additionally explored by applyingrepresentation learning methods that automatically learn new data representations by reducing the dimen-sion of the data, automatically detecting correlations, and removing bias and redundancies presented in thefeature data. The work presented in [EPR +

20] showed that linear matrix factorization representations of theELA features values signiﬁcantly detects better correlation between diﬀerent problem instances.(8)

Hyperparameter tuning of regression models:

Last, but not least, we are planning to explorealgorithm portfolio that consists of diﬀerent regression methods in order to ﬁnd the most suitable one, to-gether with ﬁnding its best hyperparameters for achieving better performance. In this study, we have usedrandom forest for regression without tuning its parameters, since we have been interested in the contributionof diﬀerent feature portfolios.

Acknowledgments.

This research beneﬁted from the support of the Paris Ile-de-France region and ofa public grant as part of the Investissement d’avenir project, reference ANR-11-LABX-0056-LMH, LabEx10MH. This work was also supported by projects from the Slovenian Research Agency: research core fundingNo. P2-0098 and project No. Z2-1867. We also acknowledge support by COST Action CA15140 “ImprovingApplicability of Nature-Inspired Optimisation by Joining Theory and Practice (ImAppNIO)”.

References

AM16. Aldeida Aleti and Irene Moser,

A systematic literature review of adaptive parameter control methods forevolutionary algorithms , ACM Comput. Surv. (2016), 56:1–56:35.BCHC09. Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen, Pearson correlation coeﬃcient , Noisereduction in speech processing, Springer, 2009, pp. 1–4.BDSS16. N. Belkhir, J. Dr´eo, P. Sav´eant, and M. Schoenauer,

Surrogate assisted feature computation for continuousproblems , LION, Springer, 2016, pp. 17–31.BDSS17. ,

Per instance algorithm conﬁguration of CMA-ES with limited budget , GECCO, ACM, 2017,pp. 681–688.BGH +

13. Edmund K. Burke, Michel Gendreau, Matthew R. Hyde, Graham Kendall, Gabriela Ochoa, Ender ¨Ozcan,and Rong Qu,

Hyper-heuristics: a survey of the state of the art , Journal of the Operational ResearchSociety (2013), 1695–1724.DD20. Benjamin Doerr and Carola Doerr, Theory of parameter control mechanisms for discrete black-box opti-mization: Provable performance gains through dynamic parameter choices , Theory of Evolutionary Com-putation: Recent Developments in Discrete Optimization, Springer, 2020, pp. 271–321.DK92. Shelley Derksen and Harvey J Keselman,

Backward, forward and stepwise automated subset selectionalgorithms: Frequency of obtaining authentic and noise variables , British Journal of Mathematical andStatistical Psychology (1992), no. 2, 265–282.DLV +

19. B. Derbel, A. Liefooghe, S. V´erel, H. Aguirre, and K. Tanaka,

New features for continuous exploratorylandscape analysis based on the SOO tree , FOGA, ACM, 2019, pp. 72–86.EPR +

20. Tome Eftimov, Gorjan Popovski, Quentin Renau, Peter Korosec, and Carola Doerr,

Linear matrix fac-torization embeddings for single-objective optimization landscapes , SSCI, IEEE, 2020, pp. 775–782.GFBG06. Pablo M Granitto, Cesare Furlanello, Franco Biasioli, and Flavia Gasperi,

Recursive feature eliminationwith random forest for ptr-ms analysis of agroindustrial products , Chemometrics and Intelligent Labora-tory Systems (2006), no. 2, 83–90.HAB19. Nikolaus Hansen, Youhei Akimoto, and Petr Baudis, CMA-ES/pycma on Github , https://github.com/CMA-ES/pycma, 2019.HAR +

20. Nikolaus Hansen, Anne Auger, Raymond Ros, Olaf Mersmann, Tea Tuˇsar, and Dimo Brockhoﬀ,

COCO: aplatform for comparing continuous optimizers in a black-box setting , Optimization Methods and Software(2020), 1–31.HO01. Nikolaus Hansen and Andreas Ostermeier,

Completely derandomized self-adaptation in evolution strategies ,Evolutionary Computation (2001), no. 2, 159–195.HXHL14. Frank Hutter, Lin Xu, Holger H. Hoos, and Kevin Leyton-Brown, Algorithm runtime prediction: Methods& evaluation , Artif. Intell. (2014), 79–111.JD19. Anja Jankovic and Carola Doerr,

Adaptive landscape analysis , GECCO, Companion Material, ACM, 2019,pp. 2032–2035.JD20. ,

Landscape-aware ﬁxed-budget performance regression and algorithm selection for modular CMA-ES variants , GECCO, 2020, pp. 841–849.KHE15. Giorgos Karafotias, Mark Hoogendoorn, and A.E. Eiben,

Parameter control in evolutionary algorithms:Trends and challenges , IEEE Trans. Evol. Comput. (2015), 167–187.KJR10. Miron B Kursa, Aleksander Jankowski, and Witold R Rudnicki, Boruta–a system for feature selection ,Fundamenta Informaticae (2010), no. 4, 271–285.KKB +

18. P. Kerschke, L. Kotthoﬀ, J. Bossek, H.H. Hoos, and H. Trautmann,

Leveraging TSP solver complemen-tarity through machine learning , Evolutionary Computation (2018), no. 4.KPWT16. P. Kerschke, M. Preuss, S. Wessing, and H. Trautmann, Low-Budget Exploratory Landscape Analysis onMultiple Peaks Models , GECCO, 2016, pp. 229–236.KT19a. P. Kerschke and H. Trautmann,

Automated algorithm selection on continuous black-box problems by com-bining exploratory landscape analysis and machine learning , Evolutionary Computation (2019), no. 1,99–127. T19b. Pascal Kerschke and Heike Trautmann,

Comprehensive feature-based landscape analysis of continuousand constrained optimization problems using the R-package ﬂacco , Applications in Statistical Computing– From Music Data Analysis to Industrial Quality Improvement (Nadja Bauer, Katja Ickstadt, KarstenL¨ubke, Gero Szepannek, Heike Trautmann, and Maurizio Vichi, eds.), Springer, 2019, pp. 93 – 123.Mal18. Katherine Mary Malan,

Landscape-aware constraint handling applied to diﬀerential evolution , TPNC,LNCS, vol. 11324, Springer, 2018, pp. 176–187.MBT +

11. O. Mersmann, B. Bischl, H. Trautmann, M. Preuss, C. Weihs, and G. Rudolph,

Exploratory LandscapeAnalysis , GECCO, ACM, 2011, pp. 829–836.MRT15. Hossein Mohammadi, Rodolphe Le Riche, and Eric Touboul,

Making EGO and CMA-ES complementaryfor global optimization , Proc. LION, Springer, 2015, pp. 287–292.MSKH15. Mario A. Mu˜noz, Yuan Sun, Michael Kirley, and Saman K. Halgamuge,

Algorithm selection for black-boxcontinuous optimization problems: A survey on methods and challenges , Inf. Sci. (2015), 224–245.Mun11. R´emi Munos,

Optimistic optimization of a deterministic function without the knowledge of its smoothness ,Advances in Neural Information Processing Systems, 2011, pp. 783–791.OMB20. Gabriela Ochoa, Katherine Mary Malan, and Christian Blum,

Search trajectory networks of population-based algorithms in continuous spaces , EvoApplications, Springer, 2020, pp. 70–85.PKD20. Matej Petkovi´c, Dragi Kocev, and Saˇso Dˇzeroski,

Feature ranking for multi-target regression , MachineLearning (2020), no. 6, 1179–1204.PRH19. Zbynek Pitra, Jakub Repick´y, and Martin Holena,

Landscape analysis of Gaussian process surrogates forthe covariance matrix adaptation evolution strategy , GECCO, ACM, 2019, pp. 691–699.PVG +

11. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten-hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, andE. Duchesnay,

Scikit-learn: Machine learning in Python , JMLR (2011), 2825–2830.RDDD20. Quentin Renau, Carola Doerr, Johann Dr´eo, and Benjamin Doerr, Exploratory landscape analysis isstrongly sensitive to the sampling strategy , PPSN, Springer, 2020, pp. 139–153.Ric76. John R. Rice,

The algorithm selection problem , Advances in Computers, vol. 15, Elsevier, 1976, pp. 65 –118.RˇSK03. Marko Robnik-ˇSikonja and Igor Kononenko,

Theoretical and empirical analysis of relieﬀ and rrelieﬀ ,Machine learning (2003), no. 1-2, 23–69.SEK20. U. Skvorc, T. Eftimov, and P. Korosec, Understanding the problem space in single-objective numericaloptimization using exploratory landscape analysis , Appl. Soft Comput. (2020), 106138.ZFW +

19. Juan Zhao, QiPing Feng, Patrick Wu, Roxana A Lupu, Russell A Wilke, Quinn S Wells, Joshua C Denny,and Wei-Qi Wei,

Learning from longitudinal data in electronic health record and genetic data to improvecardiovascular event prediction , Scientiﬁc reports (2019), no. 1, 1–10. eature Table 3.