Neural Model-based Optimization with Right-Censored Observations
Katharina Eggensperger, Kai Haase, Philipp Müller, Marius Lindauer, Frank Hutter
NNeural Model-based Optimization with Right-Censored Observations
Katharina Eggensperger, Kai Haase, Philipp M ¨uller, Marius Lindauer, Frank Hutter University of Freiburg, Germany Leibniz University Hannover, Germany Bosch Center for Artificial [email protected], [email protected], [email protected],[email protected], [email protected]
Abstract
In many fields of study, we only observe lower bounds onthe true response value of some experiments. When fitting aregression model to predict the distribution of the outcomes,we cannot simply drop these right-censored observations , butneed to properly model them. In this work, we focus on theconcept of censored data in the light of model-based op-timization where prematurely terminating evaluations (andthus generating right-censored data) is a key factor for effi-ciency, e.g., when searching for an algorithm configurationthat minimizes runtime of the algorithm at hand. Neural net-works (NNs) have been demonstrated to work well at the coreof model-based optimization procedures and here we extendthem to handle these censored observations. We propose (i) aloss function based on the Tobit model to incorporate cen-sored samples into training and (ii) use an ensemble of net-works to model the posterior distribution. To nevertheless beefficient in terms of optimization-overhead, we propose to useThompson sampling s.t. we only need to train a single NN ineach iteration. Our experiments show that our trained regres-sion models achieve a better predictive quality than severalbaselines and that our approach achieves new state-of-the-artperformance for model-based optimization on two optimiza-tion problems: minimizing the solution time of a SAT solverand the time-to-accuracy of neural networks.
Introduction
When studying the outcome of an experiment we might onlyobserve a lower or an upper bound on its true value – a cen-sored observation. Such censored data is present in many ap-plications, in particular if individual observations are costlyin terms of time and resources. For example when study-ing the impact of a fertilizer, the quantity of a toxin candrop below the detection level of the measurement device;when studying the expected lifetime of hard drives, the studymight be stopped before all hard drives have exceeded theirlifetime; or when studying the damage of insect pests togrowing crops, some plants might still be healthy at harvest-ing time. To analyze such time-to-event data it is crucial tohandle censored data correctly to avoid over- or underesti-mation of the quantity of interest.In this work, we study the concept of censored data in thelight of model-based optimization where we are interestedin minimizing an objective function describing the time re-quired to achieve a desired outcome. We focus on optimiza- tion procedures that actively terminate evaluations to speedup the optimization by spending less time on poorly per-forming evaluations and thus generate right-censored data.This censoring mechanism enables to efficiently tune costlyobjectives and can speed up optimization by orders of mag-nitude (Hutter 2017; Kleinberg, Leyton-Brown, and Lucier2017; Kleinberg et al. 2019; Weisz, Gy¨orgy, and Szepesv´ari2019b,a). Censoring strategies thus substantially contributeto the state of the art for automated algorithm configu-ration; the potential of which has been demonstrated inmany domains of AI by providing speedup factors of up to × in satisfiability solving (Hutter et al. 2017), 118 × in AI planning (Vallati et al. 2013), × for reinforce-ment learning (Falkner, Klein, and Hutter 2018), 28 × fortime-tabling (Chiarandini, Fawcett, and Hoos 2008), × for mixed integer programming (Hutter, Hoos, and Leyton-Brown 2010) and × for answer set solving (Gebser et al.2011). However, the efficiency of the model-based optimiz-ers stands and falls with the quality of the empirical perfor-mance model at their core.Bayesian optimization (BO) (Shahriari et al. 2016), asone of the most studied model-based approaches, commonlyuses Gaussian processes (GPs) (Mockus 1994; Brochu,Cora, and de Freitas 2010; Snoek, Larochelle, and Adams2012) or random forests (RFs) (Feurer et al. 2015; C´aceres,Bischl, and St¨utzle 2017; Candelieri, Perego, and Archetti2018), but recently methods based on neural networks (NNs)have been shown to perform superior on some applica-tions (Schilling et al. 2015; Snoek et al. 2015; Springen-berg et al. 2016; Perrone et al. 2018; White, Neiswanger,and Savani 2019). In this work, we extended this NN-basedBayesian optimization to work in the presence of partiallycensored data. Specifically, our contributions are:1. We propose to use the Tobit loss for training NNs to prop-erly handle noisy and censored observations.2. We study the impact of censoring on the training of NNsusing this loss function.3. We show that ensembles of NNs, trained on censored data,in combination with Thompson Sampling yield efficientoptimization procedures.4. We demonstrate the benefit of our model on two runtimeminimization problems, outperforming the previous stateof the art. a r X i v : . [ c s . A I] S e p igure 1: True and observed normal distributions with andwithout censored observations. Formal Problem Setting
First, we formally describe the problem setting we addressin this work by discussing the challenges of regression un-der censored observations and showing how these relate tomodel-based optimization.
Regression under Censored Data
We focus here on a specific type of data where the variableof interest is the time until an event occurs, e.g., until analgorithm has solved a task at hand. If this event does notoccur within the time allocated for an evaluation, we obtaina right-censored observation , i.e. a lower bound on its actualruntime, but not the actual value. There exist two other typesof censored values which we do not consider here since theydo not occur during the type of optimization we are inter-ested in: left-censored values for which the start time is un-known, but the event has happened (e.g., we do not knowwhen the algorithm has started) and interval-censored valuesfor which we know that the event has happened within sometime interval (e.g., we know that the algorithm has solvedthe problem within to seconds).Our goal is to build a regression model trained on data D = ( x i , y i , I i ) ni =1 , where x ∈ R d is the d -dimensional in-put, y ∈ R is the observed value and I ∈ { , } is a binaryvariable indicating whether the value of this observation hasbeen censored or not; if I i = 1 , the i -th observation hasbeen censored and y i is only a lower bound of the true value.Furthermore, we assume that the true (non-censored) value y i is generated by a stochastic process as is the case for,e.g., observing the time of a non-deterministic algorithm.We will explain in the next section how such data is gen-erated. In Figure 1, we visualize how the observed distribu-tion changes when some of the samples are censored. Theupper plot shows the true distribution which generated theobservations, and the middle (lower) plot shows how cen-soring at fixed (randomly chosen) censoring thresholds skewthe empirically observed distribution. Fitting a maximum-likelihood model on such data could lead to substantial un-derestimations of the target value’s true mean. We use the term time for simplicity, but we note that this no-tation also generalizes to other metrics, e.g. CPU cycles, gradientdescent steps, training epochs, and number of memory accesses.
Fit probabilisticmodel M on D Select λ based on M Racing &AdaptiveCappingTargetAlgorithm D = (cid:104) ( λ i , c i , I i ) (cid:105) i incumbent λ inc λ , κ c Figure 2: Model-based algorithm configuration using racingand censoring strategies.
Sequential Model-based Optimization UsingCensoring Strategies
We aim to find a solution λ ∗ minimizing a black-box func-tion describing the stochastic cost metric c : λ → R + ofevaluating λ ∈ Λ (e.g., the time to obtain a solution): λ ∗ ∈ arg min λ ∈ Λ E [ c ( λ )] . (1)Sequential model-based optimization, especially BayesianOptimization (BO; (Brochu, Cora, and de Freitas 2010;Shahriari et al. 2016)), has been demonstrated to work wellfor such problems. This optimization procedure typically it-erates over three phases: (i) fitting a probabilistic model onall observations, (ii) choosing the next configuration to eval-uate based on the model and an acquisition function, e.g.,Expected improvement (Jones, Schonlau, and Welch 1998)and (iii) evaluating this configuration. In this work, we fo-cus on how to conduct step (i) in the presence of censoredobservations.As discussed before, we consider optimization proceduresthat actively cap the evaluation of poorly performing con-figurations. A well known example for such procedures isalgorithm configuration (AC) (Hutter et al. 2009) based onBO methods using a censoring strategy. In Figure 2, we pro-vide an overview of this iterative procedure. AC in principleperforms the same steps as BO, but instead of evaluatinga selected configuration in isolation, it is raced against thebest configuration seen so far and might be prematurely ter-minated if it performs worse. This racing procedure accom-modates for noise across repeated evaluations of the sameconfiguration using different seeds or solving different prob-lem instances. Thus, besides choosing a configuration λ toevaluate next, AC methods in addition also define a cutofftime κ at which each evaluation will be terminated. The cut-off κ can either be a globally set cutoff κ max or be adaptedfor each new run, such that new configurations use at mostas much time as the best configuration seen so far (Hutteret al. 2009). For a configuration λ i at the i -th observation,we then observe c i = min ( κ i , c ( λ i )) (2)with c ( λ i ) being a sample from the cost metric (which weonly fully observe if c ( λ i ) ≤ κ i ). These observations formthe training data D = ( λ i , c i , I i ) ni =1 for the model (i.e. x i , y i , I i ) ni =1 as we discussed before) to be used in BO. AChas been used with random forests (RFs) to guide the searchprocedure and the resulting model-based methods define onedirection of state-of-the-art AC methods (Hutter, Hoos, andLeyton-Brown 2011b; Ans´otegui et al. 2015; C´aceres, Bis-chl, and St¨utzle 2017). Related Work on Modeling under CensoredData
Handling censored data has a long history in the field ofsurvival analysis (Kleinbaum and Klein 2012; Haider et al.2020) studying statistical procedures to describe survival orhazard functions. Commonly used methods are the Kaplan-Meier estimator (Kaplan and Meier 1958), the Cox Propor-tional Hazards model (Cox 1972), the accelerated failuretime model (Kalbfleisch and Prentice 2002) and also ex-tensions to regression models, such as NNs (Katzman et al.2018; Lee et al. 2018) and RFs (Ishwaran et al. 2008). How-ever, here, we are not interested in the impact of differentrisk factors or the risk of the occurrence of the event at aspecific time step (the so-called hazard function), but weare interested in the mean survival time given a configura-tion. Additionally, for methods modeling the survival func-tion relying on a non-parametric survival function, such asthe Kaplan-Meier estimator, the mean survival time is notwell defined if the largest observation is censored, which isthe case for model-based optimization.Survival analysis methods in the context of modeling al-gorithm performance have been studied by Gagliolo (2010)in order to construct online portfolios of algorithms to solvea sequence of problem instances. However, much more re-lated to our work is the method introduced by Schmee andHahn (1979) for regression with censored data. This methoditeratively increases the values of censored data using a twostep procedure: (1) Fit a model to the current data and (2) up-date the censored values according to the mean of the pre-dicted normal distribution truncated at the observed cen-sored value (to ensure that the updated value can never be-come smaller than the observed value). Steps (1) and (2)are repeated until the method converged or a manually de-fined number of iterations was performed. This procedurehas been adapted to work with Random Forests and hasbeen demonstrated to work well as a model for model-based optimization to minimize the runtime of stochastic al-gorithms (Hutter, Hoos, and Leyton-Brown 2011a). As weshow in our experiments, this method performs similarly toour approach, but because of its iterative nature, it is slowerthan ours by at least a factor of k if using k iterations.Other related work includes regression models forhandling censored data in a specific context different fromours, e.g., a weighted loss function for fitting a linear re-gression for high-dimensional biomedical data (Li, Vinza-muri, and Reddy 2016) or a two-step procedure for Gaussianprocess regression to build a model that resembles the out-put of a (physical) experiment where observations might becensored due to unknown resource constraints (Chen et al.2019). Training a Neural Network on CensoredObservations
The model at the heart of state-of-the-art model-based op-timization methods has to handle various challenges suchas highly varying noise and most importantly for us: cen-sored data. Neural Networks (NNs) provide a flexible learn-ing framework based on (stochastic) gradient descent allow-ing to use any differentiable loss function. We will make useof this flexibility to incorporate censored data during train-ing.First, we want to model the value of interest as a ran-dom variable with a normal distribution, e.g. data fed intothe model describing the cost of a stochastic algorithms cancontain aleatoric noise. We model this parametric distribu-tion by adding a second output neuron to the last layer mod-eling the variance (Nix and Weigend 1994), i.e. our NN pre-dicts a mean µ i and variance σ i depending on the input λ i .To train our network on data D = ( λ i , c i , I i ) ni =1 , we use thenegative log-likelihood (NLL), customary in deep learning: − log L (cid:0) (ˆ µ i , ˆ σ i ) ni =1 | D (cid:1) = − n (cid:88) i =1 log φ ( Z i ) ,Z i = c i − ˆ µ i ˆ σ i , (3)with ˆ µ i and ˆ σ i being the predicted mean and variance ofthe objective value for a configuration λ i and φ being thestandard normal probability density function.Using this loss also for censored observations yields sub-optimal estimators underestimating the true distribution (seePlot (a) in Figure 3). A simple solution would be to employthe iterative model-agnostic procedure proposed by Schmeeand Hahn (1979) (S&H) to restore censored values beforetraining the final model. However, this multiplies the train-ing time depending on an additional hyperparameter, thenumber of iterations (e.g. for iterations we would need totrain models). Furthermore, treating the censored informa-tion would be decoupled from the training procedure poten-tially misleading the estimator (see Plot (b) in Figure 3).Instead, our approach relies on a solution handling cen-sored data directly within the loss function. In the sum oflikelihoods from Eq. (3) we correct the terms for censoredobservations. For right-censored data, all we know is a lowerbound on the actual time of occurrence and this quantity canbe described by − Φ( Z i ) , with Φ being the standard normalcumulative distribution function (Gagliolo and Schmidhuber2006; Klein and Moeschberger 2006), also known as the To-bit likelihood function (Tobin 1958). These two terms com-plement each other by describing the information obtainedfrom censored observations, i.e., the probability that the truevalue for this observation lies above the observed value, andthe information obtained from uncensored observations, i.e.,the probability of observing the observed value. This yieldsthe following loss function we use to train our network ondata D = ( λ i , c i , I i ) ni =1 :a) NN (Ignore) (b) NN S&H (c) NN Tobit loss (d) NN Ens. Tobit loss (e) NN Ens. Tobit lossFigure 3: Comparing the predictive quality of NNs ignoring censoring information (a) using S&H (b) using the Tobit Loss (c)and ensembles of NNs using the Tobit loss (d). We show results for random censoring in (a,b,c,d) and fixed censoring in (e). − log L (cid:0) (ˆ µ i , ˆ σ i ) ni =1 | D (cid:1) = − n (cid:88) i =1 log (cid:16) φ ( Z i ) − I i (1 − Φ( Z i )) I i (cid:17) ,I i = (cid:40) , if c i ≤ κ i , otherwise (4) Using this loss function allows to directly handle the infor-mation obtained from censored observations during trainingwithout any additional overhead as in S&H and can yieldpredictions close to the actual values (see Plot (c) in Fig-ure 3) predicting values for censored points to be higher thanthe observed censored value (orange markers).
Integration into Model-Based Optimization
Next, we will describe how we make use of our NN modelwithin model-based optimization and how we use model un-certainty to select the next configuration to evaluate.First, since the runtime distributions of stochastic algo-rithms are known to exhibit heavy-tails for each configura-tion (Gomes and Selman 1997; Eggensperger, Lindauer, andHutter 2018) and are constrained to be larger than zero wemodel the log of the observed runtime. Thus, we can modelthe distribution of its log-function values as a normal distri-bution. Such a parametric distribution assumption allows usto efficiently model the distribution of the quantity of inter-est (Pinitilie 2006).Secondly, we note that the model uncertainty is differ-ent from the aleatoric noise modeled by the second outputneuron (which is not relevant for modeling the mean perfor-mance of a configuration). A full Bayesian treatment of allweights in the NN would provide model uncertainty (Neal1996); however, this would come with a large computa-tional overhead and additional hyperparameters to tune.Recently, Lakshminarayanan, Pritzel, and Blundell (2017)showed that an ensemble of NNs with random initializationsalso yield robust and meaningful predictive uncertainty es-timates. We use this approach to model the posterior distri-bution of the objective function with the modification thatwe do not use the estimated aleatoric noise, since the goalin BO is to compute a predictive distribution over the truefunction value at a point, not for the noisy observations at the point. Hence, we compute the mean µ λ and variance σ λ for a configuration λ based on an ensemble of size M asfollows: µ λ = 1 M M (cid:88) m =1 ˆ µ ( m ) λ σ λ = 1 M M (cid:88) m =1 (ˆ µ ( m ) λ − µ λ ) . (5)Plot (d) in Figure 3 shows the predictive mean and varianceof an NN ensemble trained on noisy data with randomly cen-sored data. The uncertainty favorably increases for unseenlocations and the NNs were able to prune out the noise.However, the most natural idea to simply use these en-sembles comes with the risk of over-exploring poor regions:The mixed loss function we use to train our network onlyconstrains the network to predict above the censoring thresh-old and the models can, in principle, predict arbitrarily highvalues resulting in a potentially high empirical variance (seethe left part of Plot (e) in Figure 3). This could make poorlyperforming areas, where we have only observed censoreddata, promising for acquisition functions using the predic-tive uncertainty despite the fact that none of the NNs pre-dict the area to be promising. For this reason in practice weuse Thompson Sampling (Thompson 1933), i.e. we draw asample function from the posterior distribution of the sur-rogate model and evaluate the configuration at its optimum.Thompson sampling based on ensembles has been shown towork well for online decision problems and reinforcementlearning (Osband et al. 2016; Lu and Roy 2017). However,in contrast to this work, we do not maintain a set of ensemblemembers and draw from it to select the next configuration toevaluate but train a network from scratch in each iteration.This renders the requirement for reasonable uncertainty es-timates in poor areas with only censored samples unneces-sary and comes with the benefit of significantly reducing thecomputational requirement since we only need to train a sin-gle NN instead of the whole ensemble.
Studying the Impact of Censored Observations
Now, we turn to the empirical evaluation of our model.
Implementation Details.
Following Snoek et al. (2015),we use a fully connected feedforward NN with 3 hidden lay-ers, each with neurons and tanh activations. To train ourable 1: RMSE values for ensembles of NNs ignoring censoring information (I), dropping censored observations (D), using 5iterations of S&H (S&H), and using the Tobit loss (T); and under varying levels of randomized censoring (a higher thresholdindicates less censoring, see text). thresh 10th percentile 20th percentile 40th percentile 80th percentileModel I D S&H T I D S&H T I D S&H T I D S&H Tbranin . . . . . . . . . . . . . . . . camel . . . . . . . . . . . . . . . . hart3 . . . . . . . . . . . . . . . . hart6 . . . . . . . . . . . . . . . . network, we use SGD with momentum, a batch size of ,and a cyclic learning rate with one cycle increasing the learn-ing rate to e − and then decreasing it again (Smith 2018).As regularization, we apply weight decay of e − and clipthe gradients to be within [ − . , . . To stabilize training,we use a softplus function for our second output neuronto ensure a positive value of the predicted standard devi-ation (Lakshminarayanan, Pritzel, and Blundell 2017) andinitialize the bias for the second output neuron with the datadeviation. As pre-processing, we normalize the training tar-gets to have zero-mean and unit-variance and the input val-ues to be in [0 , . Our implementation uses PyTorch (Paszkeet al. 2019) and is integrated in the sequential model-basedoptimization framework SMAC (Hutter, Hoos, and Leyton-Brown 2011b) using the Python3 re-implementation (Lin-dauer et al. 2017). We will make it publicly available uponacceptance. All experiments were run on a compute clusterequipped with 2.80GHz Intel(R) Xeon(R) Gold CPUs. Baselines.
First, we study how well NNs trained with theTobit loss function (T) perform on censored data. We com-pare our loss function to three alternative baselines usingonly NLL as a loss function: ignoring the censoring infor-mation, i.e., the ensemble trains on the same data but doesnot know which of the values are censored (I); dropping cen-sored data, i.e., training the ensemble only on non-censoredvalues (D); and using 5 iterations of the iterative procedureproposed by Schmee and Hahn (1979) (S&H). For all meth-ods, we trained ensembles of size and trained each networkfor
10 000 epochs.
Problem Setup.
To study our NNs under controlled con-ditions, we consider synthetic global optimization prob-lems: Branin (2D), Camelback (2D), and versions of theHartmann Function (3D and 6D), to obtain training data. Foreach function f we sample × D locations. We generated copies of our training data and added normally distributednoise with µ = 0 and σ = (max( f ) − min( f )) · . to ob-tain noisy training data. Then, we use a threshold γ , e.g. the th percentile of all observations, and censor all observa-tions at a location x i for which f ( x i ) ≥ γ with a probabilityincreasing from to . This mimics the data obtained dur-ing optimization, where censoring caps poorly-performingruns. To study the impact of censored samples, we study different thresholds ( th, th, th and th) covering ag-gressive censoring and almost no censoring. Results.
In Table 1 we report the average root-mean-squared-error (RMSE) using a -fold CV of the ensemblemean prediction w.r.t. to the true function values. Looking Figure 4: Mean and variance of RMSE across a -fold CV ofan ensemble built by using iterations of the S&H algorithmand using the Tobit loss. Left: Hartmann6 with aggressivecensoring starting at the th percentile and % censoreddata. Right: Hartmann6 with mild censoring starting at the th percentile and . % censored data.at the overall results, T achieved the best results, not sur-prisingly, followed by S&H since both methods incorporatecensoring information. However, we note that S&H requires × the training time of using the Tobit loss. The alterna-tive, I, performs worse since the model learns from biaseddata. In contrast, D can yield relatively good results for somecases if the uncensored samples provide enough informa-tion. Also, naturally all methods achieved the best results onthe datasets with the least fraction of censored observations.Furthermore, we take a closer look at using S&H for twoexemplary scenarios in Figure 4 and show how the RMSEchanges with each iteration. Obviously, the largest improve-ment happens in the beginning and might improve or con-verge with further iterations. However, it is unclear before-hand how many iterations are required to achieve the bestresults, making S&H much harder to use in practice. Model-based Optimization
Next, we study two real-world runtime minimization prob-lems which we first briefly describe.
Tuning the runtime of a local-search SAT solver.
Weuse an existing benchmark for consistency with prior workand tune continuous hyperparameters of Saps (Hutter,Tompkins, and Hoos 2002) on SAT instances as intro-duced by Hutter et al. (2010). The instances are quasigroupcompletion problems (
QCP ) (Gomes and Selman 1997)of increasing difficulty denoting the % ( QCP med ), %( QCP q ) and % ( QCP q ) quantiles of hardness for Saps w.r.t. a large distribution of instances. Instead of run-time in seconds, we optimize the number of steps as a morerobust measurement (Eggensperger et al. 2018). We ran eachoptimization method for min and set the overall cutoff to CP q Adult
Ignore censoring Drop censored data S&H iterations Tobit ModelFigure 5: Results for I, D, S&H and T on actual runtime data. The upper plots show RMSE and CC when training on increasingfractions of observed data during optimization and tested on unseen data from the same distribution. The lower plots showpredicted values versus true observations on a log-scale for training on 100% of the data obtained on the QCP q .
10 000 steps.
Tuning the time-to-accuracy of NNs.
In the spirit of arecent effort to establish time-to-accuracy to evaluate deeplearning architectures (Coleman et al. 2019; Mattson et al.2017) w.r.t. both, speed and performance, we constructeda new benchmark. We defined a -dimensional configura-tion space consisting of commonly tuned hyperparametersof neural networks (see appendix (Table 2)) and draw random configurations to evaluate their validation perfor-mance on datasets obtained from OpenML (Vanschorenet al. 2014) using the Python-API (Feurer et al. 2019). Weselected datasets where the default configuration achievedthe target accuracy in more than and less than sec-onds. For each dataset, we considered the % quantile ofthese runs as the target value to achieve for our network. Asa performance measure, we use the training time in secondsto reach this preset accuracy. Furthermore, we consideredruns, for which the training diverged, as timeouts with thegiven cutoff time, allowed each optimization method to runfor h and set the overall cutoff per run to seconds. Evaluating the quality of our NNs on runtime data
Before turning to the optimization problem, we first studythe performance of our model on data obtained from ac-tual optimization runs. For this, we train our model posthocon runhistories ( λ i , c i , I i ) ni =1 obtained from running ran-dom search with racing, containing all observations evalu-ated during optimization. We consider one task from each ofour optimization problems: Tuning Saps on QCP q andtuning time-to-accuracy on the adult dataset. We used runhistories for QCP q and for adult and used and , respectively, of these as a hold-out test set. For each con- figuration in the test set, we collected observations to obtainan empirical estimate of its actual performance. Again, we compare I, D, S&H, and T. We study the RMSEand the Spearman rank correlation coefficient (CC) of thepredictions of individual networks (and provide more met-rics in the appendix (Figure 1 and 2)), each trained for
40 000 epochs and the estimated mean for each configura-tion. To accommodate for noise in our training procedure,we conducted repetitions, each training on all but the testrunhistories.We report the mean and standard deviation across theserepetitions in Figure 5 and also additionally report the frac-tion of censored values present during training. The upperrow shows how RMSE and CC develop using more trainingdata by training on increasing fractions of the runhistories(i.e. studying the model’s performance at different time stepsof the optimization run). We observe that for both scenar-ios the Tobit loss clearly outperforms alternative handlingof censored information. Furthermore, for both scenariosRMSE values for T stay the same ( QCP q ) or decrease( adult ) with more data while the values for other methodsstrongly increase ( QCP q ) or stagnate ( adult ), highlight-ing the benefit obtained from using our loss function. The To obtain results in a feasible time we applied a globalcutoff yielding still some globally censored values. To obtain aground truth value for each configuration to compare our predic-tion against, instead of using the empirical mean biased due to theglobal cutoff, we use the mean of a normal distribution fitted viamaximizing the Tobit likelihood on the log-values. We note thatthis only makes a difference for configurations where we observeduncensored and timed-out runs. We replaced all values higher thanthe globally set cutoff by the cutoff before computing any metrics. able 2: Results for minimizing number of steps for
Saps (upper) and the time-to-accuracy for NNs (lower). For each optimiza-tion method, we report the median, the lower and upper quartiles across ( Saps ) and (NN) repetitions. We evaluated thefinal configuration times for Saps and times for the NN. We underline the best found value and boldface values thatare not statistically different due to a random permutation test with
100 000 permutations. In the lower part of the table, wereport the average rank and the averaged normalized score for each optimization method.
Set Default Rand RF w \ S&H NN w \ TS & Tobit
QCP med . .
78; 12 .
62] 12 . .
79; 12 . . [11 .
27; 11 . QCP q . [25 .
48; 28 . . [25 .
35; 27 . . [24 .
28; 25 . QCP q . [125 .
54; 146 . . [118 .
17; 140 . . [108 .
72; 119 . adult 8.69 . .
58; 8 .
69] 4 . .
27; 5 . . [3 .
80; 4 . airlines 38.99 . .
76; 36 . . [25 .
47; 28 . . [24 .
04; 24 . bank 12.17 . .
36; 11 .
88] 9 . .
24; 10 . . [8 .
93; 9 . connect-4 47.41 . .
43; 27 . . [12 .
02; 13 . . [13 .
97; 17 . credit-g 39.71 . [5 .
79; 19 . . [4 .
28; 5 . . [3 .
96; 7 . jannis 16.61 . .
30; 12 . . [10 .
20; 10 . . [10 .
22; 10 . numerai 15.36 . .
68; 9 . . [8 .
31; 8 . . [7 .
72; 9 . vehicle 7.5 . [3 .
26; 3 . . [2 .
91; 3 . . [2 .
64; 3 . average rank - .
72 1 . average score - .
77 86 .
08 91 . CC values indicate how well a method preserves the rankingbetween configurations and again, T performs best followedby S&H. Both, D and I, perform worse due to underestimat-ing the true values. In the second row, we plot actual predic-tions against the empirical mean showing the difference inthe quality of the predictions. While T preserves the rank-ing between configurations best, D and I tend to predict thetraining data mean, making these bad choices for the opti-mization procedure. S&H performs slightly better but doesnot restore censored values with the same quality as T.
NNs for Algorithm Configuration
Now, we turn to the results of using our model on minimization tasks. As a competitive baseline, we use thestate-of-the-art sequential model-based optimization frame-work SMAC (Hutter, Hoos, and Leyton-Brown 2011b) us-ing RFs to handle censored data with S&H (Hutter, Hoos,and Leyton-Brown 2011a), which we replace with our NNsusing Thompson Sampling (NN w \ TS & Tobit). Addition-ally, we ran random search (
Rand ) as a baseline to quantifythe contribution of a model to the optimization procedure.We ran the experiments as discussed above and report themedian performance of all methods in Table 2. Addition-ally, we report the average rank and the average normalizedscore across all scenarios (see appendix (Table 3) for moredetails).In general, all methods found configurations that substan-tially improve over the default configuration being morethan × ( Saps on QCP q ) and × (time-to-accuracy oncredit-g) faster. Furthermore, all model-based methods per-form better than Rand for all benchmarks and almost alwaysstatistically significantly so. Looking at the
Saps results andcomparing the RF-based method to our method, we foundthat NN w \ TS & Tobit performed better on all scenariosindicating that the networks with Thompson sampling in-deed work well for such low-dimensional and noisy bench- marks. On the higher dimensional tasks minimizing time-to-accuracy, the RF model performs more competitively, butstill NN w \ TS & Tobit performs significantly better on out of tasks and better or competitively on the remain-ing datasets. Also, looking at the aggregated rank and score,NN w \ TS & Tobit obtained the lowest rank and the highestscore showing that NNs are a promising alternative.
Conclusion and Future Work
Better methods for efficiently handling censored data di-rectly lead to improvements for model-based optimizationin a wide variety of domains, e.g., minimizing the time-to-accuracy for machine learning algorithms and the solutiontime for combinatorial optimization. Recently, the potentialof NNs to rival and outperform GP- and RF-based Bayesianoptimization has been demonstrated, but they have not yetbeen extended for optimization procedures relying on cen-soring strategies as one of their primary factor for efficiency.In this work, we propose a theoretically motivated loss func-tion (Tobin 1958) which directly incorporates censored data,and thus, the missing piece for NN-driven optimization pro-cedures for these domains. We empirically showed how wellour loss function works for censored data and evaluated oursolution on real optimization benchmarks outperforming theprevious state of the art.Building on these promising results, for future work, itwould be interesting to investigate how to extend our modelfor training on non-Gaussian distributions and in the jointspace of configurations and tasks (Eggensperger, Lindauer,and Hutter 2018). Furthermore, another interesting directionwould be to study whether the model capacity of the NNscan be adapted over time during the optimization (Frankeet al. 2019). Finally, we believe the overhead of trainingthe model can be further reduced by studying how to reusetrained NNs during optimization (Huang et al. 2017). cknowledgements
This work has partly been supportedby the European Research Council (ERC) under the Euro-pean Union’s Horizon 2020 research and innovation pro-gramme under grant no. 716721. Robert Bosch GmbH isacknowledged for financial support.
References
Ans´otegui, C.; Malitsky, Y.; Sellmann, M.; and Tierney, K.2015. Model-Based Genetic Algorithms for Algorithm Con-figuration. In
Proc. of IJCAI’15 , 733–739.Brochu, E.; Cora, V.; and de Freitas, N. 2010. A Tutorial onBayesian Optimization of Expensive Cost Functions, withApplication to Active User Modeling and Hierarchical Re-inforcement Learning. arXiv:1012.2599v1 [cs.LG] .C´aceres, L. P.; Bischl, B.; and St¨utzle, T. 2017. Evaluatingrandom forest models for irace. In
Proc. of GECCO , 1146–1153. ACM.Candelieri, A.; Perego, R.; and Archetti, F. 2018. Bayesianoptimization of pump operations in water distribution sys-tems.
JGO arXiv preprint arXiv:1910.05452 .Chiarandini, M.; Fawcett, C.; and Hoos, H. 2008. A Mod-ular Multiphase Heuristic Solver for Post Enrolment CourseTimetabling. In
Proc. of PATAT’08 .Coleman, C.; Kang, D.; Narayanan, D.; Nardi, L.; Zhao, T.;Zhang, J.; Bailis, P.; Olukotun, K.; R´e, C.; and Zaharia, M.2019. Analysis of DAWNBench, a Time-to-Accuracy Ma-chine Learning Performance Benchmark.
Operating Sys-tems Review
Journalof the Royal Statistical Society
34: 187–220.Eggensperger, K.; Lindauer, M.; Hoos, H. H.; Hutter, F.; andLeyton-Brown, K. 2018. Efficient benchmarking of algo-rithm configurators via model-based surrogates.
MachineLearning
Proc. of IJCAI’18 , 1442–1448.Falkner, S.; Klein, A.; and Hutter, F. 2018. BOHB: Ro-bust and Efficient Hyperparameter Optimization at Scale. In
Proc. of ICML’18 , 1437–1446.Feurer, M.; Klein, A.; Eggensperger, K.; Springenberg, J.;Blum, M.; and Hutter, F. 2015. Efficient and Robust Auto-mated Machine Learning. In
Proc. of NeurIPS’15 , 2962–2970.Feurer, M.; van Rijn, J.; Kadra, A.; Gijsbers, P.; Mallik, N.;Ravi, S.; M¨uller, A.; Vanschoren, J.; and Hutter, F. 2019.OpenML-Python: an extensible Python API for OpenML. arXiv:1911.02490 [cs.LG] .Franke, J.; K¨ohler, G.; Awad, N.; and Hutter, F. 2019. NeuralArchitecture Evolution in Deep Reinforcement Learning forContinuous Control. In
MetaLearn’19 . Gagliolo, M. 2010.
Online Dynamic Algorithm Portfolios .Ph.D. thesis, Universit`a della Svizzera Italiana.Gagliolo, M.; and Schmidhuber, J. 2006. Impact of censoredsampling on the performance of restart strategies. In
Proc.of CP’06 , 167–181.Gebser, M.; Kaminski, R.; Kaufmann, B.; Schaub, T.;Schneider, M.; and Ziller, S. 2011. A Portfolio Solver forAnswer Set Programming: Preliminary Report. In
Proc. ofLPNMR’11 , 352–357.Gomes, C.; and Selman, B. 1997. Problem structure in thepresence of perturbations. In
Proc. of AAAI’97 , 221–226.Haider, H.; Hoehn, B.; Davis, S.; and Greiner, R. 2020. Ef-fective Ways to Build and Evaluate Individual Survival Dis-tributions.
JMLR
Proc. of ICLR’17 .Hutter, F. 2017. Towards true end-to-end learning & opti-mization. Invited talk held at the European Conference onMachine Learning & Principles and Practices of KnowledgeDiscovery in Databases (ECML/PKDD’17).Hutter, F.; Hoos, H.; and Leyton-Brown, K. 2010. Au-tomated Configuration of Mixed Integer ProgrammingSolvers. In
Proc. of CPAIOR’10 , 186–202.Hutter, F.; Hoos, H.; and Leyton-Brown, K. 2011a. BayesianOptimization With Censored Response Data. In
NeurIPSworkshop on Bayesian Optimization, Sequential Experimen-tal Design, and Bandits (BayesOpt’11) .Hutter, F.; Hoos, H.; and Leyton-Brown, K. 2011b. Sequen-tial Model-Based Optimization for General Algorithm Con-figuration. In
Proc. of LION’11 , 507–523.Hutter, F.; Hoos, H.; Leyton-Brown, K.; and Murphy, K.2010. Time-Bounded Sequential Parameter Optimization.In
Proc. of LION’10 , 281–298.Hutter, F.; Hoos, H.; Leyton-Brown, K.; and St¨utzle, T.2009. ParamILS: An Automatic Algorithm ConfigurationFramework.
JAIR
36: 267–306.Hutter, F.; Lindauer, M.; Balint, A.; Bayless, S.; Hoos, H.;and Leyton-Brown, K. 2017. The Configurable SAT SolverChallenge (CSSC).
AIJ
Proc. of CP’02 , 233–248.Ishwaran, H.; Kogalur, U.; Blackstone, E.; and Lauer, M.2008. Random survival forests.
The Annals of AppliedStatistics
JGO
13: 455–492.Kalbfleisch, J.; and Prentice, R. 2002.
The statistical analy-sis of failure time data , volume 360. John Wiley & Sons.Kaplan, E.; and Meier, P. 1958. Nonparametric Estimationfrom Incomplete Observations.
Journal of the American Sta-tistical Association
53: 457–481.atzman, L.; Shaham, U.; Cloninger, A.; Brates, J.; Jiang,T.; and Kluger, Y. 2018. DeepSurv: personalized treatmentrecommender system using a Cox proportional hazards deepneural network.
BMC Medical Research Methodology
Survival analy-sis: techniques for censored and truncated data . Springer-Verlag.Kleinbaum, D.; and Klein, M. 2012.
Survival analysis .Springer-Verlag.Kleinberg, R.; Leyton-Brown, K.; and Lucier, B. 2017. Ef-ficiency Through Procrastination: Approximately OptimalAlgorithm Configuration with Runtime Guarantees. In
Proc.of IJCAI’17 , 2023–2031.Kleinberg, R.; Leyton-Brown, K.; Lucier, B.; and Graham,D. 2019. Procrastinating with Confidence: Near-Optimal,Anytime, Adaptive Algorithm Configuration. In
Proc. ofNeurIPS’19 , 8881–8891.Lakshminarayanan, B.; Pritzel, A.; and Blundell, C. 2017.Simple and Scalable Predictive Uncertainty Estimation us-ing Deep Ensembles. In
Proc. of NeurIPS’17 , 6402–6413.Lee, C.; Zame, W.; Yoon, J.; and van der Schaar, M. 2018.DeepHit: A Deep Learning Approach to Survival Analysiswith Competing Risks. In
Proc. of AAAI’18 , 2314–2321.Li, Y.; Vinzamuri, B.; and Reddy, C. 2016. RegularizedWeighted Linear Regression for High-dimensional Cen-sored Data. In
Proc. of SDM’16 , 45–53.Lindauer, M.; Eggensperger, K.; Feurer, M.; Falkner, S.;Biedenkapp, A.; and Hutter, F. 2017. SMACv3: Algorithmconfiguration in Python. github.com/automl/SMAC3.Lu, X.; and Roy, B. V. 2017. Ensemble sampling. In
Proc.of NeurIPS’17 , 3258–3266.Mattson, P.; Cheng, C.; Diamos, G.; Coleman, C.; Micikevi-cius, P.; Patterson, D.; Tang, H.; Wei, G.; Bailis, P.; Bittorf,V.; Brooks, D.; Chen, D.; Dutta, D.; Gupta, U.; Hazelwood,K.; Hock, A.; Huang, X.; Kang, D.; Kanter, D.; Kumar, N.;Liao, J.; Narayanan, D.; Oguntebi, T.; Pekhimenko, G.; Pen-tecost, L.; Janapa, R.; Robie, T.; John, T. S.; Wu, C.; Xu, L.;Young, C.; and Zaharia, M. 2017. MLPerf Training Bench-mark. In
Proc. of MLSys’20 , 336–349.Mockus, J. 1994. Application of Bayesian approach to nu-merical methods of global and stochastic optimization.
Jour-nal of Global Optimization
Bayesian Learning for Neural Networks .Lecture Notes in Computer Science. Springer-Verlag.Nix, D.; and Weigend, A. 1994. Estimating the mean andvariance of the target probability distribution. In
Proc. ofICNN’94 , 55–60.Osband, I.; Blundell, C.; Pritzel, A.; and Roy, B. V. 2016.Deep exploration via bootstrapped DQN. In
Proc. ofNeurIPS’16 , 4026–4034.Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.;Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga,L.; Desmaison, A.; Kopf, A.; Yang, E.; DeVito, Z.; Rai-son, M.; Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai, J.; and Chintala, S. 2019. PyTorch: An ImperativeStyle, High-Performance Deep Learning Library. In
Proc.of NeurIPS’19 , 8024–8035.Perrone, V.; Jenatton, R.; Seeger, M.; and Archambeau, C.2018. Scalable hyperparameter transfer learning. In
Proc.of NeurIPS’18 , 12751–12761.Pinitilie, M. 2006.
Competing Risks: A Practical Perspec-tive . John Wiley & Sons.Schilling, N.; Wistuba, M.; Drumond, L.; and Schmidt-Thieme, L. 2015. Hyperparameter optimization with factor-ized multilayer perceptrons. In
Proc. of ECML/PKDD’15 ,87–103.Schmee, J.; and Hahn, G. 1979. A simple method for regres-sion analysis with censored data.
Technometrics
21: 417–432.Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R.; and de Fre-itas, N. 2016. Taking the Human Out of the Loop: A Reviewof Bayesian Optimization.
Proceedings of the IEEE arXiv:1803.09820 [cs.LG] .Snoek, J.; Larochelle, H.; and Adams, R. 2012. PracticalBayesian Optimization of Machine Learning Algorithms. In
Proc. of NeurIPS’12 , 2960–2968.Snoek, J.; Rippel, O.; Swersky, K.; Kiros, R.; Satish, N.;Sundaram, N.; Patwary, M.; Prabhat; and Adams, R. 2015.Scalable Bayesian Optimization Using Deep Neural Net-works. In
Proc. of ICML’15 , 2171–2180.Springenberg, J.; Klein, A.; Falkner, S.; and Hutter, F. 2016.Bayesian Optimization with Robust Bayesian Neural Net-works. In
Proc. of NeurIPS’16 .Thompson, W. 1933. On the Likelihood that One UnknownProbability Exceeds Another in View of the Evidence ofTwo Samples.
Biometrika
Econometrica
Proc. of SOCS’13 .Vanschoren, J.; van Rijn, J.; Bischl, B.; and Torgo, L.2014. OpenML: Networked Science in Machine Learning.
SIGKDD Explor. Newsl.
Proc. of ICML’19 , 6707–6715.Weisz, G.; Gy¨orgy, A.; and Szepesv´ari, C. 2019b. Leap-sAndBounds: A Method for Approximately Optimal Algo-rithm Configuration. In
Proc. of ICML’18 , 5257–5265.White, C.; Neiswanger, W.; and Savani, Y. 2019. Neural Ar-chitecture Search via Bayesian Optimization with a NeuralNetwork Prior. In
MetaLearn’19 . ppendix: Neural Model-based Optimization with Right-Censored Observations Katharina Eggensperger, Kai Haase, Philipp M ¨uller, Marius Lindauer, Frank Hutter University of Freiburg, Germany Leibniz University Hannover, Germany Bosch Center for Artificial [email protected], [email protected], [email protected],[email protected], [email protected]
More details for
Studying the Impact ofCensored Observations
Here, we provide additional results for evaluating the pre-dictive quality of our NNs on actual data obtained duringoptimization. Based on the same experiments as in the mainpaper, we plot the L2 loss between the predicted medianand the empirical median (as a more robust metric when ob-serving censored data) and the RMSE when only consider-ing configurations for which the mean performance is betterthan the global cutoff value (to provide an alternative mea-surement that takes only the actually observed performancerange into account).
QCP q Median-L2 adult
Median-L2
QCP q RMSE w/o timeouts adult
RMSE w/o timeoutsFigure 1: Results for I, D, S&H and T on actual runtimedata. The upper plots show L2 loss w.r.t. the empirical andpredicted median and the lower plots show RMSE w/o time-outs when training on increasing fractions of observed dataduring optimization and tested on unseen data from the samedistribution.
More details for
Model-based Optimization
Here, we provide the configuration spaces used for tun-ing the number of steps for
Saps (Table 2) and for tun-ing time-to-accuracy for neural networks (Table 1). Further- adult
Ignore adult
Drop adult
S&H adult
TFigure 2: Results for I, D, S&H and T. We show predictedvalues versus true observations on a log-scale for training on100% of the data obtained on the time-to-acurracy bench-mark adult .more, analogously to Table 2 in the main paper, we providein Table 3 for each optimization problems the median score,which describes the percentage of the best found configura-tion per run and the overall best found function value nor-malized between the best and worst found configuration perscenario.able 1: -dimensional searchspace for tuning Saps . name range default type log alpha [1 , .
4] 1 . float rho [0 ,
1] 0 . float ps [0 , .
2] 0 . float wp [0 , .
06] 0 . float Table 2: -dimensional searchspace for tuning neural net-works. name range default type log batch size [16 , int learning rate [1 e − , e − ] 1 e − float momentum [0 . , .
99] 0 . float weight decay [1 e − , e − ] 1 e − float [1 ,
5] 2 int [64 , int dropout [0 . , .
0] 0 . float Table 3: Results for minimizing number of steps for
Saps (upper) and the time-to-accuracy for NNs (lower), see Table2 in the main paper. We report the averaged score normal-ized with the best and worse final configuration found byan optimization procedure. For each optimization methodwe report the median, the lower and upper quartiles acrossrepetitions. For
Saps , we conducted runs and evaluatedthe final configuration 1 000 times run with a different seedwhile for the NNs, we conducted runs and evaluated eachconfiguration 100 times run with a different seed. We under-line the best found value and boldface values that are notstatistically different due to a random permutation test with
100 000 permutations.
Set Rand RF w \ S&H NN w \ TS & Tobit
QCP med .
21 70 . . QCP q . .
23 97 . QCP q .
65 99 .
66 99 . adult .
37 88 . . airlines . .
59 95 . bank .
75 69 . . connect-4 . .
66 78 . credit-g .
42 97 . . jannis . .
19 85 . numerai . .
09 91 . vehicle .
34 77 .
62 91 . average rank .
72 1 . average score .
77 86 .
08 91 ..