MISO-wiLDCosts: Multi Information Source Optimization with Location Dependent Costs
MMISO-wiLDCosts: Multi Information Source Optimization withLocation Dependent Costs
Antonio Candelieri, Francesco Archetti
University of Milano-Bicocca
Abstract
This paper addresses black-box optimiza-tion over multiple information sources whoseboth fidelity and query cost change over thesearch space, that is they are location de-pendent . The approach uses: ( i ) an Aug-mented Gaussian Process, recently proposedin multi-information source optimization asa single model of the objective function oversearch space and sources, and ( ii ) a Gaus-sian Process to model the location-dependentcost of each source. The former is used intoa Confidence Bound based acquisition func-tion to select the next source and locationto query, while the latter is used to penalizethe value of the acquisition depending on theexpected query cost for any source-locationpair. The proposed approach is evaluated ona set of Hyperparameters Optimization tasks,consisting of two Machine Learning classifiersand three datasets of different sizes. Bayesian Optimization (BO) (Shahriari et al., 2015;Frazier, 2018; Archetti and Candelieri, 2019) is a sam-ple efficient strategy for global optimization of black-box, expensive and multi-extremal functions. The ref-erence global optimization problem is:arg min f ( x ) x ∈ Ω ⊂(cid:60) d (1)where Ω is the d -dimensional search space. BO consists of two key components: a probabilisticsurrogate model of f ( x ), fitted on the function evalu-ations performed so far, and an acquisition functionproviding the next location to query, while balanc-ing between exploitation and exploration , dependingon the predictions, and associated uncertainty, of theprobabilistic surrogate model. Updating the modelconditioned to the observations, and selecting the nextlocation to query, are sequentially iterated until sometermination criterion is met, usually a maximum num-ber of function evaluations.Thanks to its sample efficiency, compared to compet-ing methods, BO is currently the standard techniquefor Automated Machine Learning (AutoML) (Hut-ter et al., 2019; Candelieri and Archetti, 2019) andalso successfully applied in Neural Architecture Search(NAS) (Elsken et al., 2019). Other relevant applica-tions of BO concern simulation-optimization (Sha etal., 2020) and control of complex systems (Candelieriet al., 2018). However, the basic BO algorithm doesnot directly incorporate and use any information aboutquery cost, which is instead crucial in many real-lifeapplications. An example comes just from AutoMLand NAS: (Strubell et al., 2020) provides an estima-tion about the financial and environmental costs foroptimizing the hyperparameters of deep neural net-works in the domain of Natural Language Processing.The astonishing amount of energy required can gen-erate an amount of CO emissions around five timesthose generated by a car during its own longlife.Therefore, recent research studies have been propos-ing innovative BO-based approaches designed for morechallenging settings, where information about querycost is explicitly used. Research can be basically splitinto two branches: ( i ) multi-information sources op-timization (MISO), characterized by the availabilityof cheap approximations (i.e., sources) of the more ex-pensive f ( x ), and ( ii ) cost-aware (Bayesian) optimiza-tion, assuming a cost, to query f ( x ), which changesover the search space. a r X i v : . [ c s . L G ] F e b ISO-wiLDCosts: Multi Information Source Optimization with Location Dependent Costs
As far as MISO is concerned, the problem (1) has tobe solved by efficiently using S information sources, f ( x ) , . . . , f S ( x ), providing approximations, with dif-ferent accuracy, of f ( x ), each one at its own query cost c s , with s = 1 , . . . , S . Thus accuracy is location de-pendent, but cost is not. Efficiently solving (1) meansthat the query cost cumulated along the optimizationprocess must be kept as low as possible. Sources can bealways sorted by decreasing query cost, so that f ( x )is the most expensive one, usually with f ( x ) = f ( x )in the case that f ( x ) can be directly queried.When information sources come with an explicit in-formation about their approximation quality (aka fi-delity ), MISO specializes in multi-fidelity optimiza-tion, first proposed in (Kennedy and O’Haga, 2000)and more recently addressed in (Peherstorfer et al.,2017; Sen et al., 2018; Marques et al., 2018; Chaudhuriet al., 2019; Kandasamy et al., 2019; Song et al., 2019).Most multi-fidelity approaches exploit hierarchical re-lations among sources, based on their fidelities, evenif already in (March and Willcox., 2012), some draw-backs were highlighted. More precisely, hirerachicalorganization requires the assumption that informationsources are unbiased, meaning that noise must be in-dependent across sources (Lam et al., 2015; Poloczeket al., 2017). Moreover, querying a source with a cer-tain fidelity at a given location x implies that no fur-ther knowledge can be obtained by querying any othersource of lower fidelity, at any location. Consequently,most of the the multi-fidelity approaches cannot be ap-plied in the case that sources cannot be hierarchicallyorganized.In (Lam et al., 2015) the first approach for non-hierarchical information sources has been proposed,addressing location-dependent fidelities of the sourcesand defining the more general MISO setting. Morerecently, (Poloczek et al., 2017; Ghoreishi and Allaire,2019) have been provided improvements to (Lam et al.,2015). All these methods are based on the idea of us-ing a separate model for each information source (i.e.,a Gaussian Process - GP) and then fusing their pre-dictions and related uncertainties through the methodproposed in (Winkler, 1981), which became the stan-dard practice for the fusion of normally distributeddata. In (Candelieri et al., 2020) a different procedurehas been recently proposed, where instead of fusing GPs, sparsification is used to create the so-called Aug-mented Gaussian Process (AGP). We used this model– summarized in Sec. 3.2 – and extended MISO-AGPto deal with location-dependent costs.In AutoML, MISO and multifidelity have been firstconsidered in (Swersky et al., 2013), proposing an ap- proach able to use small datasets to quickly optimizethe hyperparameters of a Machine Learning (ML) al-gorithm on a large dataset. More recently, (Klein etal., 2017) proposed FABOLAS (FAst Bayesian Opti-mization on LArge dataSets), a hyperparameters op-timization (HPO) tool that simultaneously optimizesthe hyperparameters values of a ML algorithm and thesize of the dataset portion to consider. Also in (Cande-lieri et al., 2020) an HPO task is considered, with twoinformation sources, related to a large dataset and asmall portion of it, respecitevely. All these approachesresulted more cost-efficient than using BO for HPOperformed on the large dataset only, without signifi-cant degration of the final ML model’s accuracy.
Although the quoted MISO and multi-fidelity ap-proaches generalize from sample efficiency to cost effi-ciency – thus, they are “aware of costs” – they assumethat the query cost of any source is constant over thesearch space. However, this is not true in practice,even using a single source, as empirically demonstratedin (Lee et al., 2020) according to the distribution of thetime required to evaluate a set of 5000 randomly se-lected hyperparameter configurations, for five commonML algorithms. This is also confirmed in this paper(Sect. 4.1), relatively to 1000 hyperparameters config-urations for two ML algorithms on three datasets.The seminal work aimed at making BO cost-aware is(Snoek et al., 2012), which proposes to penalize the ac-quisition function, specifically Expected Improvement(EI), by the location-dependent cost, c ( x ), leading tothe EI-per-unit of cost : EI pu ( x ) = EI ( x ) /c ( x ). How-ever, EI pu ( x ) is basically driven by the query cost,biased towards cheap locations and, therefore, it per-forms well only when optima are relatively cheap. Toovercome this undesired behviour, (Lee et al., 2020)proposes CArBO (Cost Apportioned BO), consistingof two consecutive stages: ( i ) a cost-effective selec-tion of initial locations to query (i.e., cost-effective ini-tial design) and ( ii ) a cost-cooling strategy where thepenalty associated to the cost in EI pu ( x ) is modulatedaccording to the query cost incurred so far. More pre-cisely, EI - cool ( x ) = EI ( x ) c ( x ) α , with α the cooling factor defined as α = ( τ − τ n ) / ( τ − τ init ), with τ the overall“budget” (i.e., maximum cumulated query cost), τ n the cost cumulated up to the current iteration n , and τ init the cost of the initial design. Finally, in CArBO, c ( x ) is modelled through a warped GP.Another recent paper (Paria et al., 2020) has proposedan approach whose cost-aware acquisition function isbased on Information Directed Sampling (IDS) (Russoet al., 2014), a principled mechanism to balance re- ntonio Candelieri, Francesco Archetti gret and information gain . The proposed cost-awareacquisition function, namely CostIDS, also balances cost , along with regret and information gain . How-ever, an additional constraint is introduced, in opti-mizing CostIDS, to avoid that extremely cheap pointsare chosen repeatedly without any significant increasein information. To the authors’ knowledge, this is the first paper ad-dressing a multi-information source setting in whichboth fidelity and query cost of the sources are black-box and location-dependent .Thus, we named our approach
MISO-wiLDCosts : MISO wi th L ocation- D ependent Costs , whose goalis to solve (1), while keeping the cumulated querycost as small as possible, given a set of S informa-tion sources, { f s ( x ) } s =1: S , whose approximation qual-ity and query costs, c s ( x ), are black-box and locationdependent.We provide an empirical evaluation of MISO-wiLDCosts on an AutoML task: HPO of two ML clas-sifiers on three datasets of different sizes. GP modelling (Williams and Rasmussen, 2006) —- IN-SERIRE GRAMACY—- is a non-parametric kernel-based learning method for probabilistic regression andclassification. A GP regression model is a randomfunction f : Ω → (cid:60) with output drawn from amultivariate normal distribution, formally f ( x ) ∼N ( µ ( x ) , σ ( x )). In BO, GP regression is usuallyadopted as probabilistic rurrogate model, fitted onthe n function evaluations performed so far. Let X : n = { x (1) , ..., x ( n ) } and y = { y (1) , ..., y ( n ) } denote,respectively, the n locations and the associated ob-served values, then fitting the GP means to computethe posterior mean and variance of the multivariatenormal distribution as follows: µ ( x ) = k( x, X : n ) (cid:2) K + λ I (cid:3) − y σ ( x ) = k ( x, x ) − k( x, X : n ) (cid:2) K + λ I (cid:3) − k( X : n , x )(2)where λ is the variance of a zero-mean Gaussian noisein the case of noisy observations (i.e., y ( i ) = f ( x ( i ) )+ ε ,with ε ∼ N (0 , λ )), K ∈ (cid:60) n × n , such that K ij = k ( x i , x j ), with k a kernel function modelling the co-variance in the GP. Finally, k( x, X : n ) is vector whose i -th component is given by k ( x, x i ) (for completeness,k( X : n , x ) = k( x, X : n ) (cid:62) ).The most widely adopted kernels are Squared Expo- nential (aka Gaussian), Mat´ern, Power Exponentialand Exponential (aka Laplacian). Each kernel impliesa different prior on the structural properties of thesample paths of the latent function under the GP, suchas differentiability. Moreover, each kernel has its ownhyperparameters to adjust the GP’s posterior depend-ing on observations: GP’s hyperparameters are usuallytuned via Maximum Log-likelihood Estimation (MLE)or Maximum A Posteriori estimation (MAP).In this paper, the Matern 3 / k m / ( x, x (cid:48) ) = σ k (cid:16) √ r(cid:96) (cid:17) e − √ r(cid:96) (3)with r = || x − x (cid:48) || . The kernel’s hyperparameters σ k and (cid:96) , namely the kernel amplitude and the character-istic length scale, are estimated via MLE. Let y s denotes a function evaluation performed on thesource s , where y s = f s ( x ) + ε s and ε s is a zero-meanGaussian noise associated to that source, formally ε s ∼N (0 , λ s ). Let z s denotes the (black-box) query cost toobserve y s , that is z s = c s ( x ) + δ s , with δ ∼ N (0 , ζ s ).It is important to remark that, by definition of querycost, y s and z s are not decoupled , in the sense that y s cannot be observed without paying z s , and vice-versa.Let D s = { x ( i ) , y ( i ) s , z ( i ) s } i =1 ,...,n s denotes the datasetcollecting all the relevant information along the opti-mization process, with n s the number of function eval-uations performed on the source s . The following twoprojections of D s are considered: F s = { ( x ( i ) , y ( i ) s ) } i =1 ,...,n s (4)namely the function evaluations dataset , storing loca-tions queried and function values observed, and C s = { ( x ( i ) , z ( i ) s ) } i =1 ,...,n s (5)namely the query costs dataset , storing locationsqueried and query costs.The two sets, F s and C s , are used to fit two GPs,namely F s ( x ) and C s ( x ), according to (2) and mod-elling: f s ( x ) ∼ N ( µ s ( x ) , σ s ( x )) (6)and c s ( x ) ∼ N ( p s ( x ) , q s ( x )) (7)with s = 1 . . . , S and where p s ( x ) and q s ( x ) are againmean and standard deviation, but different symbolsare used to distinguish them from µ s ( x ) and σ s ( x ). ISO-wiLDCosts: Multi Information Source Optimization with Location Dependent Costs
An Augmented Gaussian Process (AGP) (Candelieriet al., 2020) is aimed at generating a single model overall the f s ( x ) depending on the simplified model dis-crepancy measure, which is devoted to measure thedifference between two GPs approximating two differ-ent sources: η ( F s , F ¯ s , x ) = | µ s ( x ) − µ ¯ s ( x ) | (8)with s, ¯ s = 1 , . . . , S and s (cid:54) = ¯ s .In MISO-wiLDCosts, we assume that f ( x ) is the pre-ferred source, because, for instance, it coincides with f ( x ) or it is supposed to provide the most accurate ap-proximation for it. Fitting the AGP requires to aug-ment the set of function evaluations performed on thepreferred source (i.e., the F set), with function evalu-ations performed on other sources and whose functionvalue is sufficiently close to the prediction provided by F ( x ). More formally, the set of augmenting locations,namely ¯ F , is given by.¯ F = { ( x, y s ) ∈ F s , s = 2 , . . . , S : η ( F s , F , x ) < mσ ( x ) } (9)where η ( F s , F , x ) is computed according to (8) and m is a techincal parameter, usually set to m = 1, meaningthat only evaluations falling into µ ( x ) ± σ ( x ) areincluded in ¯ F .The final set of inducing locations to use for fitting theAGP is given by (cid:98) F = F ∪ ¯ F , and the resulting AGPis denoted by (cid:98) F ( x ), such that: f ( x ) ∼ N ( (cid:98) µ s ( x ) , (cid:98) σ s ( x )) (10)conditioned to (cid:98) F according to (2). To select the next source and locations to query,we started from the acquisition function proposed in(Candelieri et al., 2020) according to the AGP model.However, we have significantly modified it to considerthe location dependent query cost of each source. Theresulting acquisition function is based on the well-known Lower Confidence Bound (LCB), whose con-vergence proof under an appropriate scheduling of it’stechnical parameter, β t , is given in (Srinivas et al.,2012). The acquisition function proposed in this pa-per is defined as follows:( s (cid:48) , x (cid:48) ) = arg max s =1 ,...,Sx ∈ Ω ⊂(cid:60) d (cid:98) y + − (cid:104)(cid:98) µ ( x ) − (cid:112) β t (cid:98) σ ( x ) (cid:105) (cid:52) c s ( x ) η ( (cid:98) F , F s , x ) . (11)where (cid:98) y + is the lowest function value of the inducinglocations set (cid:98) F , that is (cid:98) y + = min ( x,y ) ∈ (cid:98) F { y } , and (cid:52) c s ( x ) is an estimation of the query cost for the source s .Therefore, the numerator of (11) is the most optimisticimprovement, with respect to (cid:98) y + , depending on theAGP’s LCB, while denominator consists of two source-and-location-dependent penalization terms: (cid:52) c s ( x ) andthe model discrepancy between the AGP (cid:98) F ( x ) and theGP F s ( x ) modelling the source s .The location dependent query cost is modelled by theGP C s ( x ) and a risk-averse attitude is considered, so (cid:52) c s ( x ) is given by the upper confidence bound of C s ( x ),that is the most pessimistic estimation of the cost toquery f s ( x ). Formally (cid:52) c s ( x ) = max { , p s ( x ) + q s ( x ) } ,to deal with negative values of the upper confidencebound, possibly due to the GP’s approximation.Finally, as reported in (Candelieri et al., 2020), thereis the possibility that solving (11) could lead to choosea pair ( s (cid:48) , x (cid:48) ) whose location x (cid:48) is very close to a pre-viously evaluated location on the source s (cid:48) , leading tothe ill-conditioning of the matrix (cid:2) K + λ I (cid:3) in (2) and,consequently, to the impossibility to update µ s ( x ) and σ s ( x ). The correction propsed to avoid this undesiredbehaviour is: Correction : If ∃ ( x, y ) ∈ F s (cid:48) : || x − x (cid:48) || < δ then set s (cid:48) ← and choose x (cid:48) as follows: x (cid:48) ← arg max x ∈ Ω σ ( x ) (12)The idea is to choose an alternative x (cid:48) by “investing”the available budget on exploration in order to improvethe AGP at the next iteration by querying the mostexpensive source. Indeed, as stated in (Srinivas et al.,2012), selecting the location associated to the highestprediction uncertainty is a good strategy for functionlearning (aka function approximation ), whose goal isto efficiently explore the search space to obtain an ac-curate approximation of f ( x ) within a limited numberof queries. To evaluate MISO-wiLDCosts, we considered a coreapplication, that is the HPO of a ML algorithm.This task has been widely addressed via MISO/multi-fidelity optimization, such as in (Poloczek et al., 2017;Ghoreishi and Allaire, 2019; Candelieri et al., 2020;Swersky et al., 2013; Klein et al., 2017), as well ascost-aware optimization (Snoek et al., 2012; Paria etal., 2020; Lee et al., 2020). Three binary classificationdatasets have been considered, different for size andnumber of features: ntonio Candelieri, Francesco Archetti • SPLICE(3) – related to primate splice-junctiongene sequences with associated imperfect domaintheory, and used in (Lee et al., 2020). This datasetconsists of 3175 instances, 60 (numeric) featuresplus the (binary) class label. • SVMGUIDE(1) – related to an astroparticle ap-plication from Jan Conrad of Uppsala University,Sweden (Chih-Wei et al., 2003). This dataset con-sists of 7089 instances, 4 (numeric) features plusthe (binary) class label. • MAGIC GAMMA TELESCOPE – generatedthrough a Monte Carlo simulation software,namely Corsika, described in (Heck et al., 1998).The dataset has been used in (Candelieri et al.,2020) and consists of 19020 instances, 10 (nu-meric) features plus the (binary) class label.All the features of the three datasets have been pre-liminary scaled in [0 , • Support Vector Machine classifier (C-SVC) withRadial Basis Function (RBF) kernel (Scholkopfand Smola, 2001). The hyperparameters to op-timize are: the SVC’s regularization term C ∈ [10 − , ] and γ ∈ [10 − , ] of the RBF ker-nel (i.e., k ( a, a (cid:48) ) = e − || a − a (cid:48) || γ , with a (cid:54) = a (cid:48) twoinstances within the dataset). • Random Forest (RF) classifier (Goel and Abhi-lasha, 2017). The hyperparameters to optimizeare the number of decision trees of the forest, n trees = { , . . . , } , and the number of fea-tures subsampled to generate every tree, m try ∈ (cid:2) (cid:98) . × m feat (cid:101) , (cid:98) . × m feat (cid:101) (cid:3) , with m feat thenumber of features in the original dataset. Thesymbol (cid:98) . (cid:101) represents the rounding operation tothe clostest integer value.The goal is to identify, for each dataset and for eachML classifier, the hyperparameters values minimizingthe mis-classification error (mce), computed: • on 10 fold cross validation (mce-FCV) for the C-SVC • and Out-Of-Bag (mce-OOB) for the RF classifier https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope while keeping the cumulated query cost as small aspossible.Figure 1 and Figure 2 show, respectively for the C-SVC and the RF classifier, the query cost and themis-classification error of 1000 hyperparameters con-figurations, randomly sampled via LHS, for each oneof the three datasets. The blue line can be assimilatedto a Pareto frontier, minimizing both mce and querycost, and it is used to make more evident the relationbetween the optimal mce and its cost.First, when the two figures are compared, C-SVC’sand RF’s hyperparameters configurations have com-pletly different mce values and query costs. This ismainly due to both the high computational cost fortraining an SVM classifier (i.e., number of instancespowered to three) and differences in the computationof mce (i.e., on 10 FCV for C-SVC and OOB for RF).Then, C-SVC’s mce values vary in a larger range thanRF’s ones. Therefore, the two classification algor-tihms can be considered two different representativecases. With respect to C-SVC (Figure 1), the sam-pled hyperparameters configurations show that ( i ) theoptimum mce should be “cheap” – especially for theSVMGUIDE(1) dataset – but ( ii ) it could be difficultto reach, according to the small number of configura-tions around its minimum observed value, for all thethree datasets. On the contrary, the minimum mceobserved for the RF classifier is not associated to thecheapest hyperparameters configurations in the case ofSPLICE(3) and SVMGUIDE(1) datasets (Figure 2).Therefore, a lower mce can be achieved by more ex-pensive RF classifiers, on these two datasets (even if itwould be not significantly different from the average). In MISO and multi-fidelity optimization for HPO, in-formation sources are typically associated to small por-tions of the large original dataset. A similar idea is fol-lowed in this paper: each one of the three datasets wasdivided into 10 stratified subsets, then riaggregated togenerate the following five sources: • f ( x ) = f ( x ), mce-FCV and mce-OOB on the en-tire dataset, respectively for C-SVC and RF; • f ( x ), mce-FCV and mce-OOB on the 40% of theoriginal dataset (merging the first 4 subsets); • f ( x ), mce-FCV and mce-OOB on the 30% of theoriginal dataset (merging subsets 5 to 7); • f ( x ), mce-FCV and mce-OOB on the 20% of theoriginal dataset (merging subsets 8 and 9); • f ( x ), mce-FCV and mce-OOB on the 10% of theoriginal dataset (just the subset 10) ISO-wiLDCosts: Multi Information Source Optimization with Location Dependent Costs
Figure 1: Query cost and mce-FCV of 1000 randomly sampled C-SVC’s hyperparameters configurations on threedatasets: SPLICE(3) (left), SVMGUIDE(1) (middle) and MAGIC (right). The blue line can be assimilated toa Pareto frontier minimizing both mce and cost. In this case, the minimum mce should be also cheap.Figure 2: Query cost and mce-OOB of 1000 randomly sampled RF’s hyperparameters configurations on threedatasets: SPLICE(3) (left), SVMGUIDE(1) (middle) and MAGIC (right). The blue line can be assimilated to aPareto frontier minimizing both mce and cost. For the SPLICE(3) and SVMGUIDE(3) datasets, the mimimummce is not associated to the cheapest RF’s hyperparameters configurations.According to this experimental setup, f ( x ) to f ( x )rely on subsets that are not overlapping among them.Although this could imply a hierarchial organizationsof the sources, we assume to do not anything aboutthe composition and nature of the sources, they arecompletly black-box for the purposes of the study. A set of 5 initial locations (i.e., hyperparameters val-ues) sampled via Latin Hypercube Sampling (LHS), oneach source, are used to initialize MISO-wiLDCosts.Ten independent runs are performed, for each datasetand for each classification algorithm, to mitigate the effect of random initialization.At a generic iteration of MISO-wiLDCosts, all the GPmodels – F s ( x ), C s ( x ) and (cid:98) F ( x ) – are updated condi-tioned to the function values and costs observed so far.Then, the next ( s (cid:48) , x (cid:48) ) to query is selected accordingto (11), and in case (12). Models updating and evalu-ation of the next selected source-location are iterateduntil fifty function evaluations are performed.As final solution of each run, MISO-wiLDCosts re-turns the hyperparameters values, x + , associated tothe (cid:98) y + obtained at the end of the process, that is x + : ( x + , (cid:98) y + ) ∈ (cid:98) F . This requires to compute, one lasttime, the set of the AGP’s inducing locations, (cid:98) F . Fi- ntonio Candelieri, Francesco Archetti nally, since x + could be a location queried on a sourcedifferent from f ( x ), a further evaluation is performedto replace (cid:98) y + with the f ( x + ). Although this situationhas been considred in implementing MISO-wiLDCosts,it never occurred in the experiments performed. Since this is the first paper, at the authors’ knowl-edge, addressing simultaneously MISO and cost-awareoptimization, a comparison with other approaches wasnot so straightforward. A reasonable choice could beto compare MISO-wiLDCosts with a cost-aware BOapproach performed on f ( x ), only. In this case, thestate of the art is represented by CArBO (Lee et al.,2020), and its cost-cooling strategy has been adoptedas a baseline. It is important to remark that CArBOalso implements a procedure to obtain a cost-effectiveinitial design, which could be also included in MISO-wilDCosts. In this paper, just the cost-cooling ofCArBO has been considered, while the initial designsare those sampled via LHS in the MISO-wiLDCostsexperiments. Cost-cooling requires to define the over-all budget in terms of maximum cumulated query cost.We defined it to cover approximately the fifty functionevaluations performed by MISO-wiLDCosts, leadingto: • HPO of C-SVC: 1 hour for SPLICE(3) andSVMGUIDE(1), and 3 hours for MAGIC. • HPO of RF: 15 minutes for SPLICE(3), 10 min-utes for SVMGUIDE(1), and 1 hour and a halffor MAGIC.Just for a fair comparison, also for CArBO cost-coolingthe optimization process was stopped at fifty evalua-tions, even if some residal budget was available.
MISO-wiLDCosts is developed in R. Experiments wererun on a Microsft Azure virtual machine, H8 (HighPerformance Computing family) Standard with 8 vC-PUs, 56 GB of memory, Ubuntu 16.04.6 LTS.
The main results of our experiments are summa-rized in Table 1, where the suffixes “wild” and “cool-ing” are used to distinguish between MISO-wiLDCostsand CArBO cost-cooling, respectively. Unsurpris-ingly, cumulated query cost is significantly lower forMISO-wiLDCosts, according to the Wilcoxon’s non-paramatric test for paired samples ( p -value < • with respect to C-SVC, mce-wild is higher thanmce-cooling, for SPLICE(3) ( p -value < p -value=0.01); • mce-wild is equal to mce-cooling in the caseof HPO of C-SVC on the MAGIC dataset ( p -value < p -value < • as far as HPO of RF is concerned, mce-wild andmce-cooling are basically the same, but MISO-wiLDcosts used less than 40% of the cumu-lated cost required by CArBO cost-cooling, lessthan 20% for the SPLICE(3) and SVMGUIDE(1)datasets.These differences are more evident in Table 2, where“delta mce” is the difference between mce-wild andmce-cooling, and “%cost” is given by 100 × cost-wildcost-cooling .However, it is important to remark that our resultscannot be considered a comparative analysis withCArBO, because our implementation does not includethe cost-effective initial design proposed in CArBO.Finally, just for illustrative purposes, Figure 3 showsthe query cost incurred at each iteration of MISO-wiLDCosts and CArBO cost-cooling, separately (solidvs dashed lines): lines and shaded area are, respec-tively, mean and standard deviation computed on the10 indepedent runs. For reasons of space and ease ofviewing, the figure refers to HPO of RF, but the be-haviour is analogous for HPO of C-SVC.It is important to remark that just five initial hyperpa-rameters configurations are used to initialize CArBOcost-cooling, while five for each source (i.e., twentyfive)are used to initialize MISO-wiLDCosts.After the initial design, MISO-wiLDCosts mainlyused “cheap” sources in the case of SPLICE(3) andSVMGUIDE(1) datasets, while it used all the sources,including f ( x ), in the case of MAGIC dataset.This means that, in the case of SPLICE(3) andSVMGUIDE(1): ( i ) sources that were discrepant from f ( x ) had contributed in providing inducing locationsof the AGP and/or ( ii ) any early convergence towardsa specific source-location pair, ( s (cid:48) , x (cid:48) ), did not oc-curred, thus correction (12) was not (frequently) used. This paper shows that unifying MISO and cost-awareBO in a single framework can be accomplished obtain-ing good numerical performance. MISO-wiLDCosts isthe first MISO approach that is also aware of location-dependent costs within each source. Its practical value
ISO-wiLDCosts: Multi Information Source Optimization with Location Dependent Costs
Table 1: HPO results on three classification datasets and two ML classification algorithms: mce and cumulatedquery cost for 10 runs of MISO-wiLDCosts and CArBO cost-cooling.
Classifier Dataset mce-wild mce-cooling cost-wild [secs] cost-cooling [secs]
C-SVC SPLICE(3) 0 . ± .
062 0 . ± .
002 744 . ± .
760 3646 . ± . . ± .
014 0 . ± .
000 572 . ± .
167 3263 . ± . . ± .
000 0 . ± .
000 11050 . ± .
669 11575 . ± . . ± .
003 0 . ± .
000 78 . ± .
375 456 . ± . . ± .
003 0 . ± .
000 25 . ± .
490 133 . ± . . ± .
000 0 . ± .
000 492 . ± .
704 1360 . ± . Classifier Dataset delta mce %cost
C-SVC SPLICE(3) 0 . ± .
063 20 . ± . . ± .
014 17 . ± . . ± .
000 95 . ± . . ± .
003 17 . ± . − . ± .
003 19 . ± . . ± .
000 38 . ± . ntonio Candelieri, Francesco Archetti MISO-wiLDCosts yields the same classification errorat a significantly lower cumulated query costs.
Acknowledgements
We greatly acknowledge the DEMS Data Science Lab,Department of Economics Management and Statistics(DEMS), University of Milano-Bicocca, for supportingthis work by providing computational resources.
References
F. Archetti, and A. Candelieri (2019).
Bayesian Opti-mization and Data Science . Springer InternationalPublishing.A. Candelieri, R. Perego, and F. Archetti (2018).Bayesian optimization of pump operations in waterdistribution systems.
Journal of Global Optimiza-tion , (1), 213–235.A. Candelieri, and F. Archetti(2019). Global optimiza-tion in machine learning: the design of a predictiveanalytics application. Soft Computing, 23(9) , 2969–2977.A. Candelieri, R. Perego, and F. Archetti (2020).Green Machine Learning via Augmented GaussianProcesses and Multi-Information Source Optimiza-tion. newblock arXiv preprint arXiv:2006.14233 .A. Chaudhuri, A.N. Marques, R. Lam, and K.E. Will-cox (2019). Reusing Information for MultifidelityActive Learning in Reliability-Based Design Opti-mization.
AIAA Scitech 2019 Forum .H. Chih-Wei, C. Chih-Chung, and L.. Chih-Jen (2003).A practical guide to support vector classification.
Technical report, Department of Computer Science,National Taiwan University .T. Elsken, J.H. Metzen, and F. Hutter (2019). NeuralArchitecture Search
Automated Machine Learning .P.I. Frazier (2018). Bayesian optimization.
Recent Ad-vances in Optimization and Modeling of Contempo-rary Problems
INFORMS, 255–278.R. B. Gramacy (2020).
Surrogates: Gaussian processmodeling, design, and optimization for the appliedsciences . CRC Press.S.F. Ghoreishi, and D. Allaire (2019). Multi-information source constrained Bayesian optimiza-tion.
Structural and Multidisciplinary Optimization , (3), 977-991.E. Goel, and E. Abhilasha (2017). Random forest:A review. International Journal of Advanced Re-search in Computer Science and Software Engineer-ing , (1). D. Heck, G. Schatz, J. Knapp, T. Thouw, J. Capde-vielle (1998). Corsika: A monte carlo code to simu-late extensive air showers. Technical report .F. Hutter, L. Kotthoff, and J. Vanschoren (2019).
Au-tomated machine learning: methods, systems, chal-lenges
Springer Nature.K. Kandasamy, G. Dasarathy, J. Oliva, J. Schneider,and B. Poczos (2019). Multi-fidelity gaussian pro-cess bandit optimisation.
Journal of Artificial Intel-ligence Research , , 151–196.M.C. Kennedy, and A. O’Haga (2000). Predicting theoutput from a complex computer code when fastapproximations are available. Biometrika 87 (1), 1–13.A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hut-ter (2017). Fast bayesian optimization of machinelearning hyperparameters on large datasets.
Artifi-cial Intelligence and Statistics , 528–536).R. Lam, D.L. Allaire, and K.E. Willcox (2015). Mul-tifidelity optimization using statistical surrogatemodeling for non-hierarchical information sources.
In 56th AIAA/ASCE/AHS/ASC Structures, Struc-tural Dynamics, and Materials Conference .E.H. Lee, V. Perrone,C. Archambeau, and M.Seeger(2020). Cost-aware Bayesian Optimization.
A. March, and K. Willcox (2012). Provably conver-gent multifidelity optimization algorithm not requir-ing high-fidelity derivatives.
AIAA journal , (5),1079–1089.A. Marques, R. Lam, and K. Willcox (2018). Contourlocation via entropy reduction leveraging multipleinformation sources. In Advances in Neural Infor-mation Processing Systems
ICML 2020Workshop on Real World Experiment Design andActive Learning .B. Peherstorfer, B. Kramer, and K. Willcox (2017).Combining multiple surrogate models to acceleratefailure probability estimation with expensive high-fidelity models.
Journal of Computational Physics ,341, 61–75.M. Poloczek, J. Wang, and P. Frazier (2017). Multi-information source optimization.
Advances in Neu-ral Information Processing Systems , 4288-4298.
ISO-wiLDCosts: Multi Information Source Optimization with Location Dependent Costs
D. Russo, and B. Van Roy (2014). Learning to opti-mize via information-directed sampling.
Advancesin Neural Information Processing Systems , 1583–1591.B. Scholkopf, and A.J. Smola (2001).
Learning withkernels: support vector machines, regularization,optimization, and beyond. MIT press .R. Sen, K. Kandasamy, and S. Shakkottai (2018).Multi-fidelity black-box optimization with hierarchi-cal partitions.
International conference on machinelearning
Transportation Re-search Record .B. Shahriari, K. Swersky, Z. Wang, R.P. Adams, andN. De Freitas (2015). Taking the human out of theloop: A review of Bayesian optimization.
Proceed-ings of the IEEE , (1), 148–175.J. Snoek, H. Larochelle, and R.P. Adams (2012). Prac-tical Bayesian optimization of machine learning al-gorithms. , 2951–2959.J. Song, Y. Chen, and Y. Yue (2019). A general frame-work for multi-fidelity bayesian optimization withgaussian processes. , 3158-3167.N. Srinivas, A. Krause, S.M. Kakade, and M.W. Seeger(2012). Information-theoretic regret bounds forgaussian process optimization in the bandit setting. IEEE Transactions on Information Theory , (5),3250–3265.E. Strubell, A. Ganesh, and A. McCallum (2020). En-ergy and Policy Considerations for Modern DeepLearning Research. AAAI , 13693-13696.K. Swersky, J. Snoek, and R.P. Adams (2013). Multi-task bayesian optimization.
Advances in neural in-formation processing systems , 2004–2012.C.K. Williams, and C.E. Rasmussen (2006). Gaussianprocesses for machine learning.
Cambridge, MA:MIT press .R.L. Winkler (1981). Combining probability distribu-tions from dependent information sources.
Manage-ment Science ,27