[PDF] MISO-wiLDCosts: Multi Information Source Optimization with Location Dependent Costs

Abstract

This paper addresses black-box optimization over multiple information sources whose both fidelity and query cost change over the search space, that is they are location dependent. The approach uses: (i) an Augmented Gaussian Process, recently proposed in multi-information source optimization as a single model of the objective function over search space and sources, and (ii) a Gaussian Process to model the location-dependent cost of each source. The former is used into a Confidence Bound based acquisition function to select the next source and location to query, while the latter is used to penalize the value of the acquisition depending on the expected query cost for any source-location pair. The proposed approach is evaluated on a set of Hyperparameters Optimization tasks, consisting of two Machine Learning classifiers and three datasets of different sizes.

Full PDF

MMISO-wiLDCosts: Multi Information Source Optimization withLocation Dependent Costs

Antonio Candelieri, Francesco Archetti

University of Milano-Bicocca

Abstract

This paper addresses black-box optimiza-tion over multiple information sources whoseboth ﬁdelity and query cost change over thesearch space, that is they are location de-pendent . The approach uses: ( i ) an Aug-mented Gaussian Process, recently proposedin multi-information source optimization asa single model of the objective function oversearch space and sources, and ( ii ) a Gaus-sian Process to model the location-dependentcost of each source. The former is used intoa Conﬁdence Bound based acquisition func-tion to select the next source and locationto query, while the latter is used to penalizethe value of the acquisition depending on theexpected query cost for any source-locationpair. The proposed approach is evaluated ona set of Hyperparameters Optimization tasks,consisting of two Machine Learning classiﬁersand three datasets of diﬀerent sizes. Bayesian Optimization (BO) (Shahriari et al., 2015;Frazier, 2018; Archetti and Candelieri, 2019) is a sam-ple eﬃcient strategy for global optimization of black-box, expensive and multi-extremal functions. The ref-erence global optimization problem is:arg min f ( x ) x ∈ Ω ⊂(cid:60) d (1)where Ω is the d -dimensional search space. BO consists of two key components: a probabilisticsurrogate model of f ( x ), ﬁtted on the function evalu-ations performed so far, and an acquisition functionproviding the next location to query, while balanc-ing between exploitation and exploration , dependingon the predictions, and associated uncertainty, of theprobabilistic surrogate model. Updating the modelconditioned to the observations, and selecting the nextlocation to query, are sequentially iterated until sometermination criterion is met, usually a maximum num-ber of function evaluations.Thanks to its sample eﬃciency, compared to compet-ing methods, BO is currently the standard techniquefor Automated Machine Learning (AutoML) (Hut-ter et al., 2019; Candelieri and Archetti, 2019) andalso successfully applied in Neural Architecture Search(NAS) (Elsken et al., 2019). Other relevant applica-tions of BO concern simulation-optimization (Sha etal., 2020) and control of complex systems (Candelieriet al., 2018). However, the basic BO algorithm doesnot directly incorporate and use any information aboutquery cost, which is instead crucial in many real-lifeapplications. An example comes just from AutoMLand NAS: (Strubell et al., 2020) provides an estima-tion about the ﬁnancial and environmental costs foroptimizing the hyperparameters of deep neural net-works in the domain of Natural Language Processing.The astonishing amount of energy required can gen-erate an amount of CO emissions around ﬁve timesthose generated by a car during its own longlife.Therefore, recent research studies have been propos-ing innovative BO-based approaches designed for morechallenging settings, where information about querycost is explicitly used. Research can be basically splitinto two branches: ( i ) multi-information sources op-timization (MISO), characterized by the availabilityof cheap approximations (i.e., sources) of the more ex-pensive f ( x ), and ( ii ) cost-aware (Bayesian) optimiza-tion, assuming a cost, to query f ( x ), which changesover the search space. a r X i v : . [ c s . L G ] F e b ISO-wiLDCosts: Multi Information Source Optimization with Location Dependent Costs

As far as MISO is concerned, the problem (1) has tobe solved by eﬃciently using S information sources, f ( x ) , . . . , f S ( x ), providing approximations, with dif-ferent accuracy, of f ( x ), each one at its own query cost c s , with s = 1 , . . . , S . Thus accuracy is location de-pendent, but cost is not. Eﬃciently solving (1) meansthat the query cost cumulated along the optimizationprocess must be kept as low as possible. Sources can bealways sorted by decreasing query cost, so that f ( x )is the most expensive one, usually with f ( x ) = f ( x )in the case that f ( x ) can be directly queried.When information sources come with an explicit in-formation about their approximation quality (aka ﬁ-delity ), MISO specializes in multi-ﬁdelity optimiza-tion, ﬁrst proposed in (Kennedy and O’Haga, 2000)and more recently addressed in (Peherstorfer et al.,2017; Sen et al., 2018; Marques et al., 2018; Chaudhuriet al., 2019; Kandasamy et al., 2019; Song et al., 2019).Most multi-ﬁdelity approaches exploit hierarchical re-lations among sources, based on their ﬁdelities, evenif already in (March and Willcox., 2012), some draw-backs were highlighted. More precisely, hirerachicalorganization requires the assumption that informationsources are unbiased, meaning that noise must be in-dependent across sources (Lam et al., 2015; Poloczeket al., 2017). Moreover, querying a source with a cer-tain ﬁdelity at a given location x implies that no fur-ther knowledge can be obtained by querying any othersource of lower ﬁdelity, at any location. Consequently,most of the the multi-ﬁdelity approaches cannot be ap-plied in the case that sources cannot be hierarchicallyorganized.In (Lam et al., 2015) the ﬁrst approach for non-hierarchical information sources has been proposed,addressing location-dependent ﬁdelities of the sourcesand deﬁning the more general MISO setting. Morerecently, (Poloczek et al., 2017; Ghoreishi and Allaire,2019) have been provided improvements to (Lam et al.,2015). All these methods are based on the idea of us-ing a separate model for each information source (i.e.,a Gaussian Process - GP) and then fusing their pre-dictions and related uncertainties through the methodproposed in (Winkler, 1981), which became the stan-dard practice for the fusion of normally distributeddata. In (Candelieri et al., 2020) a diﬀerent procedurehas been recently proposed, where instead of fusing GPs, sparsiﬁcation is used to create the so-called Aug-mented Gaussian Process (AGP). We used this model– summarized in Sec. 3.2 – and extended MISO-AGPto deal with location-dependent costs.In AutoML, MISO and multiﬁdelity have been ﬁrstconsidered in (Swersky et al., 2013), proposing an ap- proach able to use small datasets to quickly optimizethe hyperparameters of a Machine Learning (ML) al-gorithm on a large dataset. More recently, (Klein etal., 2017) proposed FABOLAS (FAst Bayesian Opti-mization on LArge dataSets), a hyperparameters op-timization (HPO) tool that simultaneously optimizesthe hyperparameters values of a ML algorithm and thesize of the dataset portion to consider. Also in (Cande-lieri et al., 2020) an HPO task is considered, with twoinformation sources, related to a large dataset and asmall portion of it, respecitevely. All these approachesresulted more cost-eﬃcient than using BO for HPOperformed on the large dataset only, without signiﬁ-cant degration of the ﬁnal ML model’s accuracy.

Although the quoted MISO and multi-ﬁdelity ap-proaches generalize from sample eﬃciency to cost eﬃ-ciency – thus, they are “aware of costs” – they assumethat the query cost of any source is constant over thesearch space. However, this is not true in practice,even using a single source, as empirically demonstratedin (Lee et al., 2020) according to the distribution of thetime required to evaluate a set of 5000 randomly se-lected hyperparameter conﬁgurations, for ﬁve commonML algorithms. This is also conﬁrmed in this paper(Sect. 4.1), relatively to 1000 hyperparameters conﬁg-urations for two ML algorithms on three datasets.The seminal work aimed at making BO cost-aware is(Snoek et al., 2012), which proposes to penalize the ac-quisition function, speciﬁcally Expected Improvement(EI), by the location-dependent cost, c ( x ), leading tothe EI-per-unit of cost : EI pu ( x ) = EI ( x ) /c ( x ). How-ever, EI pu ( x ) is basically driven by the query cost,biased towards cheap locations and, therefore, it per-forms well only when optima are relatively cheap. Toovercome this undesired behviour, (Lee et al., 2020)proposes CArBO (Cost Apportioned BO), consistingof two consecutive stages: ( i ) a cost-eﬀective selec-tion of initial locations to query (i.e., cost-eﬀective ini-tial design) and ( ii ) a cost-cooling strategy where thepenalty associated to the cost in EI pu ( x ) is modulatedaccording to the query cost incurred so far. More pre-cisely, EI - cool ( x ) = EI ( x ) c ( x ) α , with α the cooling factor deﬁned as α = ( τ − τ n ) / ( τ − τ init ), with τ the overall“budget” (i.e., maximum cumulated query cost), τ n the cost cumulated up to the current iteration n , and τ init the cost of the initial design. Finally, in CArBO, c ( x ) is modelled through a warped GP.Another recent paper (Paria et al., 2020) has proposedan approach whose cost-aware acquisition function isbased on Information Directed Sampling (IDS) (Russoet al., 2014), a principled mechanism to balance re- ntonio Candelieri, Francesco Archetti gret and information gain . The proposed cost-awareacquisition function, namely CostIDS, also balances cost , along with regret and information gain . How-ever, an additional constraint is introduced, in opti-mizing CostIDS, to avoid that extremely cheap pointsare chosen repeatedly without any signiﬁcant increasein information. To the authors’ knowledge, this is the ﬁrst paper ad-dressing a multi-information source setting in whichboth ﬁdelity and query cost of the sources are black-box and location-dependent .Thus, we named our approach

MISO-wiLDCosts : MISO wi th L ocation- D ependent Costs , whose goalis to solve (1), while keeping the cumulated querycost as small as possible, given a set of S informa-tion sources, { f s ( x ) } s =1: S , whose approximation qual-ity and query costs, c s ( x ), are black-box and locationdependent.We provide an empirical evaluation of MISO-wiLDCosts on an AutoML task: HPO of two ML clas-siﬁers on three datasets of diﬀerent sizes. GP modelling (Williams and Rasmussen, 2006) —- IN-SERIRE GRAMACY—- is a non-parametric kernel-based learning method for probabilistic regression andclassiﬁcation. A GP regression model is a randomfunction f : Ω → (cid:60) with output drawn from amultivariate normal distribution, formally f ( x ) ∼N ( µ ( x ) , σ ( x )). In BO, GP regression is usuallyadopted as probabilistic rurrogate model, ﬁtted onthe n function evaluations performed so far. Let X : n = { x (1) , ..., x ( n ) } and y = { y (1) , ..., y ( n ) } denote,respectively, the n locations and the associated ob-served values, then ﬁtting the GP means to computethe posterior mean and variance of the multivariatenormal distribution as follows: µ ( x ) = k( x, X : n ) (cid:2) K + λ I (cid:3) − y σ ( x ) = k ( x, x ) − k( x, X : n ) (cid:2) K + λ I (cid:3) − k( X : n , x )(2)where λ is the variance of a zero-mean Gaussian noisein the case of noisy observations (i.e., y ( i ) = f ( x ( i ) )+ ε ,with ε ∼ N (0 , λ )), K ∈ (cid:60) n × n , such that K ij = k ( x i , x j ), with k a kernel function modelling the co-variance in the GP. Finally, k( x, X : n ) is vector whose i -th component is given by k ( x, x i ) (for completeness,k( X : n , x ) = k( x, X : n ) (cid:62) ).The most widely adopted kernels are Squared Expo- nential (aka Gaussian), Mat´ern, Power Exponentialand Exponential (aka Laplacian). Each kernel impliesa diﬀerent prior on the structural properties of thesample paths of the latent function under the GP, suchas diﬀerentiability. Moreover, each kernel has its ownhyperparameters to adjust the GP’s posterior depend-ing on observations: GP’s hyperparameters are usuallytuned via Maximum Log-likelihood Estimation (MLE)or Maximum A Posteriori estimation (MAP).In this paper, the Matern 3 / k m / ( x, x (cid:48) ) = σ k (cid:16) √ r(cid:96) (cid:17) e − √ r(cid:96) (3)with r = || x − x (cid:48) || . The kernel’s hyperparameters σ k and (cid:96) , namely the kernel amplitude and the character-istic length scale, are estimated via MLE. Let y s denotes a function evaluation performed on thesource s , where y s = f s ( x ) + ε s and ε s is a zero-meanGaussian noise associated to that source, formally ε s ∼N (0 , λ s ). Let z s denotes the (black-box) query cost toobserve y s , that is z s = c s ( x ) + δ s , with δ ∼ N (0 , ζ s ).It is important to remark that, by deﬁnition of querycost, y s and z s are not decoupled , in the sense that y s cannot be observed without paying z s , and vice-versa.Let D s = { x ( i ) , y ( i ) s , z ( i ) s } i =1 ,...,n s denotes the datasetcollecting all the relevant information along the opti-mization process, with n s the number of function eval-uations performed on the source s . The following twoprojections of D s are considered: F s = { ( x ( i ) , y ( i ) s ) } i =1 ,...,n s (4)namely the function evaluations dataset , storing loca-tions queried and function values observed, and C s = { ( x ( i ) , z ( i ) s ) } i =1 ,...,n s (5)namely the query costs dataset , storing locationsqueried and query costs.The two sets, F s and C s , are used to ﬁt two GPs,namely F s ( x ) and C s ( x ), according to (2) and mod-elling: f s ( x ) ∼ N ( µ s ( x ) , σ s ( x )) (6)and c s ( x ) ∼ N ( p s ( x ) , q s ( x )) (7)with s = 1 . . . , S and where p s ( x ) and q s ( x ) are againmean and standard deviation, but diﬀerent symbolsare used to distinguish them from µ s ( x ) and σ s ( x ). ISO-wiLDCosts: Multi Information Source Optimization with Location Dependent Costs

An Augmented Gaussian Process (AGP) (Candelieriet al., 2020) is aimed at generating a single model overall the f s ( x ) depending on the simpliﬁed model dis-crepancy measure, which is devoted to measure thediﬀerence between two GPs approximating two diﬀer-ent sources: η ( F s , F ¯ s , x ) = | µ s ( x ) − µ ¯ s ( x ) | (8)with s, ¯ s = 1 , . . . , S and s (cid:54) = ¯ s .In MISO-wiLDCosts, we assume that f ( x ) is the pre-ferred source, because, for instance, it coincides with f ( x ) or it is supposed to provide the most accurate ap-proximation for it. Fitting the AGP requires to aug-ment the set of function evaluations performed on thepreferred source (i.e., the F set), with function evalu-ations performed on other sources and whose functionvalue is suﬃciently close to the prediction provided by F ( x ). More formally, the set of augmenting locations,namely ¯ F , is given by.¯ F = { ( x, y s ) ∈ F s , s = 2 , . . . , S : η ( F s , F , x ) < mσ ( x ) } (9)where η ( F s , F , x ) is computed according to (8) and m is a techincal parameter, usually set to m = 1, meaningthat only evaluations falling into µ ( x ) ± σ ( x ) areincluded in ¯ F .The ﬁnal set of inducing locations to use for ﬁtting theAGP is given by (cid:98) F = F ∪ ¯ F , and the resulting AGPis denoted by (cid:98) F ( x ), such that: f ( x ) ∼ N ( (cid:98) µ s ( x ) , (cid:98) σ s ( x )) (10)conditioned to (cid:98) F according to (2). To select the next source and locations to query,we started from the acquisition function proposed in(Candelieri et al., 2020) according to the AGP model.However, we have signiﬁcantly modiﬁed it to considerthe location dependent query cost of each source. Theresulting acquisition function is based on the well-known Lower Conﬁdence Bound (LCB), whose con-vergence proof under an appropriate scheduling of it’stechnical parameter, β t , is given in (Srinivas et al.,2012). The acquisition function proposed in this pa-per is deﬁned as follows:( s (cid:48) , x (cid:48) ) = arg max s =1 ,...,Sx ∈ Ω ⊂(cid:60) d (cid:98) y + − (cid:104)(cid:98) µ ( x ) − (cid:112) β t (cid:98) σ ( x ) (cid:105) (cid:52) c s ( x ) η ( (cid:98) F , F s , x ) . (11)where (cid:98) y + is the lowest function value of the inducinglocations set (cid:98) F , that is (cid:98) y + = min ( x,y ) ∈ (cid:98) F { y } , and (cid:52) c s ( x ) is an estimation of the query cost for the source s .Therefore, the numerator of (11) is the most optimisticimprovement, with respect to (cid:98) y + , depending on theAGP’s LCB, while denominator consists of two source-and-location-dependent penalization terms: (cid:52) c s ( x ) andthe model discrepancy between the AGP (cid:98) F ( x ) and theGP F s ( x ) modelling the source s .The location dependent query cost is modelled by theGP C s ( x ) and a risk-averse attitude is considered, so (cid:52) c s ( x ) is given by the upper conﬁdence bound of C s ( x ),that is the most pessimistic estimation of the cost toquery f s ( x ). Formally (cid:52) c s ( x ) = max { , p s ( x ) + q s ( x ) } ,to deal with negative values of the upper conﬁdencebound, possibly due to the GP’s approximation.Finally, as reported in (Candelieri et al., 2020), thereis the possibility that solving (11) could lead to choosea pair ( s (cid:48) , x (cid:48) ) whose location x (cid:48) is very close to a pre-viously evaluated location on the source s (cid:48) , leading tothe ill-conditioning of the matrix (cid:2) K + λ I (cid:3) in (2) and,consequently, to the impossibility to update µ s ( x ) and σ s ( x ). The correction propsed to avoid this undesiredbehaviour is: Correction : If ∃ ( x, y ) ∈ F s (cid:48) : || x − x (cid:48) || < δ then set s (cid:48) ← and choose x (cid:48) as follows: x (cid:48) ← arg max x ∈ Ω σ ( x ) (12)The idea is to choose an alternative x (cid:48) by “investing”the available budget on exploration in order to improvethe AGP at the next iteration by querying the mostexpensive source. Indeed, as stated in (Srinivas et al.,2012), selecting the location associated to the highestprediction uncertainty is a good strategy for functionlearning (aka function approximation ), whose goal isto eﬃciently explore the search space to obtain an ac-curate approximation of f ( x ) within a limited numberof queries. To evaluate MISO-wiLDCosts, we considered a coreapplication, that is the HPO of a ML algorithm.This task has been widely addressed via MISO/multi-ﬁdelity optimization, such as in (Poloczek et al., 2017;Ghoreishi and Allaire, 2019; Candelieri et al., 2020;Swersky et al., 2013; Klein et al., 2017), as well ascost-aware optimization (Snoek et al., 2012; Paria etal., 2020; Lee et al., 2020). Three binary classiﬁcationdatasets have been considered, diﬀerent for size andnumber of features: ntonio Candelieri, Francesco Archetti • SPLICE(3) – related to primate splice-junctiongene sequences with associated imperfect domaintheory, and used in (Lee et al., 2020). This datasetconsists of 3175 instances, 60 (numeric) featuresplus the (binary) class label. • SVMGUIDE(1) – related to an astroparticle ap-plication from Jan Conrad of Uppsala University,Sweden (Chih-Wei et al., 2003). This dataset con-sists of 7089 instances, 4 (numeric) features plusthe (binary) class label. • MAGIC GAMMA TELESCOPE – generatedthrough a Monte Carlo simulation software,namely Corsika, described in (Heck et al., 1998).The dataset has been used in (Candelieri et al.,2020) and consists of 19020 instances, 10 (nu-meric) features plus the (binary) class label.All the features of the three datasets have been pre-liminary scaled in [0 , • Support Vector Machine classiﬁer (C-SVC) withRadial Basis Function (RBF) kernel (Scholkopfand Smola, 2001). The hyperparameters to op-timize are: the SVC’s regularization term C ∈ [10 − , ] and γ ∈ [10 − , ] of the RBF ker-nel (i.e., k ( a, a (cid:48) ) = e − || a − a (cid:48) || γ , with a (cid:54) = a (cid:48) twoinstances within the dataset). • Random Forest (RF) classiﬁer (Goel and Abhi-lasha, 2017). The hyperparameters to optimizeare the number of decision trees of the forest, n trees = { , . . . , } , and the number of fea-tures subsampled to generate every tree, m try ∈ (cid:2) (cid:98) . × m feat (cid:101) , (cid:98) . × m feat (cid:101) (cid:3) , with m feat thenumber of features in the original dataset. Thesymbol (cid:98) . (cid:101) represents the rounding operation tothe clostest integer value.The goal is to identify, for each dataset and for eachML classiﬁer, the hyperparameters values minimizingthe mis-classiﬁcation error (mce), computed: • on 10 fold cross validation (mce-FCV) for the C-SVC • and Out-Of-Bag (mce-OOB) for the RF classiﬁer https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope while keeping the cumulated query cost as small aspossible.Figure 1 and Figure 2 show, respectively for the C-SVC and the RF classiﬁer, the query cost and themis-classiﬁcation error of 1000 hyperparameters con-ﬁgurations, randomly sampled via LHS, for each oneof the three datasets. The blue line can be assimilatedto a Pareto frontier, minimizing both mce and querycost, and it is used to make more evident the relationbetween the optimal mce and its cost.First, when the two ﬁgures are compared, C-SVC’sand RF’s hyperparameters conﬁgurations have com-pletly diﬀerent mce values and query costs. This ismainly due to both the high computational cost fortraining an SVM classiﬁer (i.e., number of instancespowered to three) and diﬀerences in the computationof mce (i.e., on 10 FCV for C-SVC and OOB for RF).Then, C-SVC’s mce values vary in a larger range thanRF’s ones. Therefore, the two classiﬁcation algor-tihms can be considered two diﬀerent representativecases. With respect to C-SVC (Figure 1), the sam-pled hyperparameters conﬁgurations show that ( i ) theoptimum mce should be “cheap” – especially for theSVMGUIDE(1) dataset – but ( ii ) it could be diﬃcultto reach, according to the small number of conﬁgura-tions around its minimum observed value, for all thethree datasets. On the contrary, the minimum mceobserved for the RF classiﬁer is not associated to thecheapest hyperparameters conﬁgurations in the case ofSPLICE(3) and SVMGUIDE(1) datasets (Figure 2).Therefore, a lower mce can be achieved by more ex-pensive RF classiﬁers, on these two datasets (even if itwould be not signiﬁcantly diﬀerent from the average). In MISO and multi-ﬁdelity optimization for HPO, in-formation sources are typically associated to small por-tions of the large original dataset. A similar idea is fol-lowed in this paper: each one of the three datasets wasdivided into 10 stratiﬁed subsets, then riaggregated togenerate the following ﬁve sources: • f ( x ) = f ( x ), mce-FCV and mce-OOB on the en-tire dataset, respectively for C-SVC and RF; • f ( x ), mce-FCV and mce-OOB on the 40% of theoriginal dataset (merging the ﬁrst 4 subsets); • f ( x ), mce-FCV and mce-OOB on the 30% of theoriginal dataset (merging subsets 5 to 7); • f ( x ), mce-FCV and mce-OOB on the 20% of theoriginal dataset (merging subsets 8 and 9); • f ( x ), mce-FCV and mce-OOB on the 10% of theoriginal dataset (just the subset 10) ISO-wiLDCosts: Multi Information Source Optimization with Location Dependent Costs

Figure 1: Query cost and mce-FCV of 1000 randomly sampled C-SVC’s hyperparameters conﬁgurations on threedatasets: SPLICE(3) (left), SVMGUIDE(1) (middle) and MAGIC (right). The blue line can be assimilated toa Pareto frontier minimizing both mce and cost. In this case, the minimum mce should be also cheap.Figure 2: Query cost and mce-OOB of 1000 randomly sampled RF’s hyperparameters conﬁgurations on threedatasets: SPLICE(3) (left), SVMGUIDE(1) (middle) and MAGIC (right). The blue line can be assimilated to aPareto frontier minimizing both mce and cost. For the SPLICE(3) and SVMGUIDE(3) datasets, the mimimummce is not associated to the cheapest RF’s hyperparameters conﬁgurations.According to this experimental setup, f ( x ) to f ( x )rely on subsets that are not overlapping among them.Although this could imply a hierarchial organizationsof the sources, we assume to do not anything aboutthe composition and nature of the sources, they arecompletly black-box for the purposes of the study. A set of 5 initial locations (i.e., hyperparameters val-ues) sampled via Latin Hypercube Sampling (LHS), oneach source, are used to initialize MISO-wiLDCosts.Ten independent runs are performed, for each datasetand for each classiﬁcation algorithm, to mitigate the eﬀect of random initialization.At a generic iteration of MISO-wiLDCosts, all the GPmodels – F s ( x ), C s ( x ) and (cid:98) F ( x ) – are updated condi-tioned to the function values and costs observed so far.Then, the next ( s (cid:48) , x (cid:48) ) to query is selected accordingto (11), and in case (12). Models updating and evalu-ation of the next selected source-location are iterateduntil ﬁfty function evaluations are performed.As ﬁnal solution of each run, MISO-wiLDCosts re-turns the hyperparameters values, x + , associated tothe (cid:98) y + obtained at the end of the process, that is x + : ( x + , (cid:98) y + ) ∈ (cid:98) F . This requires to compute, one lasttime, the set of the AGP’s inducing locations, (cid:98) F . Fi- ntonio Candelieri, Francesco Archetti nally, since x + could be a location queried on a sourcediﬀerent from f ( x ), a further evaluation is performedto replace (cid:98) y + with the f ( x + ). Although this situationhas been considred in implementing MISO-wiLDCosts,it never occurred in the experiments performed. Since this is the ﬁrst paper, at the authors’ knowl-edge, addressing simultaneously MISO and cost-awareoptimization, a comparison with other approaches wasnot so straightforward. A reasonable choice could beto compare MISO-wiLDCosts with a cost-aware BOapproach performed on f ( x ), only. In this case, thestate of the art is represented by CArBO (Lee et al.,2020), and its cost-cooling strategy has been adoptedas a baseline. It is important to remark that CArBOalso implements a procedure to obtain a cost-eﬀectiveinitial design, which could be also included in MISO-wilDCosts. In this paper, just the cost-cooling ofCArBO has been considered, while the initial designsare those sampled via LHS in the MISO-wiLDCostsexperiments. Cost-cooling requires to deﬁne the over-all budget in terms of maximum cumulated query cost.We deﬁned it to cover approximately the ﬁfty functionevaluations performed by MISO-wiLDCosts, leadingto: • HPO of C-SVC: 1 hour for SPLICE(3) andSVMGUIDE(1), and 3 hours for MAGIC. • HPO of RF: 15 minutes for SPLICE(3), 10 min-utes for SVMGUIDE(1), and 1 hour and a halffor MAGIC.Just for a fair comparison, also for CArBO cost-coolingthe optimization process was stopped at ﬁfty evalua-tions, even if some residal budget was available.

MISO-wiLDCosts is developed in R. Experiments wererun on a Microsft Azure virtual machine, H8 (HighPerformance Computing family) Standard with 8 vC-PUs, 56 GB of memory, Ubuntu 16.04.6 LTS.

The main results of our experiments are summa-rized in Table 1, where the suﬃxes “wild” and “cool-ing” are used to distinguish between MISO-wiLDCostsand CArBO cost-cooling, respectively. Unsurpris-ingly, cumulated query cost is signiﬁcantly lower forMISO-wiLDCosts, according to the Wilcoxon’s non-paramatric test for paired samples ( p -value < • with respect to C-SVC, mce-wild is higher thanmce-cooling, for SPLICE(3) ( p -value < p -value=0.01); • mce-wild is equal to mce-cooling in the caseof HPO of C-SVC on the MAGIC dataset ( p -value < p -value < • as far as HPO of RF is concerned, mce-wild andmce-cooling are basically the same, but MISO-wiLDcosts used less than 40% of the cumu-lated cost required by CArBO cost-cooling, lessthan 20% for the SPLICE(3) and SVMGUIDE(1)datasets.These diﬀerences are more evident in Table 2, where“delta mce” is the diﬀerence between mce-wild andmce-cooling, and “%cost” is given by 100 × cost-wildcost-cooling .However, it is important to remark that our resultscannot be considered a comparative analysis withCArBO, because our implementation does not includethe cost-eﬀective initial design proposed in CArBO.Finally, just for illustrative purposes, Figure 3 showsthe query cost incurred at each iteration of MISO-wiLDCosts and CArBO cost-cooling, separately (solidvs dashed lines): lines and shaded area are, respec-tively, mean and standard deviation computed on the10 indepedent runs. For reasons of space and ease ofviewing, the ﬁgure refers to HPO of RF, but the be-haviour is analogous for HPO of C-SVC.It is important to remark that just ﬁve initial hyperpa-rameters conﬁgurations are used to initialize CArBOcost-cooling, while ﬁve for each source (i.e., twentyﬁve)are used to initialize MISO-wiLDCosts.After the initial design, MISO-wiLDCosts mainlyused “cheap” sources in the case of SPLICE(3) andSVMGUIDE(1) datasets, while it used all the sources,including f ( x ), in the case of MAGIC dataset.This means that, in the case of SPLICE(3) andSVMGUIDE(1): ( i ) sources that were discrepant from f ( x ) had contributed in providing inducing locationsof the AGP and/or ( ii ) any early convergence towardsa speciﬁc source-location pair, ( s (cid:48) , x (cid:48) ), did not oc-curred, thus correction (12) was not (frequently) used. This paper shows that unifying MISO and cost-awareBO in a single framework can be accomplished obtain-ing good numerical performance. MISO-wiLDCosts isthe ﬁrst MISO approach that is also aware of location-dependent costs within each source. Its practical value

ISO-wiLDCosts: Multi Information Source Optimization with Location Dependent Costs

Table 1: HPO results on three classiﬁcation datasets and two ML classiﬁcation algorithms: mce and cumulatedquery cost for 10 runs of MISO-wiLDCosts and CArBO cost-cooling.

Classiﬁer Dataset mce-wild mce-cooling cost-wild [secs] cost-cooling [secs]

C-SVC SPLICE(3) 0 . ± .

062 0 . ± .

002 744 . ± .

760 3646 . ± . . ± .

014 0 . ± .

000 572 . ± .

167 3263 . ± . . ± .

000 0 . ± .

000 11050 . ± .

669 11575 . ± . . ± .

003 0 . ± .

000 78 . ± .

375 456 . ± . . ± .

003 0 . ± .

000 25 . ± .

490 133 . ± . . ± .

000 0 . ± .

000 492 . ± .

704 1360 . ± . Classiﬁer Dataset delta mce %cost

C-SVC SPLICE(3) 0 . ± .

063 20 . ± . . ± .

014 17 . ± . . ± .

000 95 . ± . . ± .

003 17 . ± . − . ± .

003 19 . ± . . ± .

000 38 . ± . ntonio Candelieri, Francesco Archetti MISO-wiLDCosts yields the same classiﬁcation errorat a signiﬁcantly lower cumulated query costs.

Acknowledgements

We greatly acknowledge the DEMS Data Science Lab,Department of Economics Management and Statistics(DEMS), University of Milano-Bicocca, for supportingthis work by providing computational resources.

References

F. Archetti, and A. Candelieri (2019).

Bayesian Opti-mization and Data Science . Springer InternationalPublishing.A. Candelieri, R. Perego, and F. Archetti (2018).Bayesian optimization of pump operations in waterdistribution systems.

Journal of Global Optimiza-tion , (1), 213–235.A. Candelieri, and F. Archetti(2019). Global optimiza-tion in machine learning: the design of a predictiveanalytics application. Soft Computing, 23(9) , 2969–2977.A. Candelieri, R. Perego, and F. Archetti (2020).Green Machine Learning via Augmented GaussianProcesses and Multi-Information Source Optimiza-tion. newblock arXiv preprint arXiv:2006.14233 .A. Chaudhuri, A.N. Marques, R. Lam, and K.E. Will-cox (2019). Reusing Information for MultiﬁdelityActive Learning in Reliability-Based Design Opti-mization.

AIAA Scitech 2019 Forum .H. Chih-Wei, C. Chih-Chung, and L.. Chih-Jen (2003).A practical guide to support vector classiﬁcation.

Technical report, Department of Computer Science,National Taiwan University .T. Elsken, J.H. Metzen, and F. Hutter (2019). NeuralArchitecture Search

Automated Machine Learning .P.I. Frazier (2018). Bayesian optimization.

Recent Ad-vances in Optimization and Modeling of Contempo-rary Problems

INFORMS, 255–278.R. B. Gramacy (2020).

Surrogates: Gaussian processmodeling, design, and optimization for the appliedsciences . CRC Press.S.F. Ghoreishi, and D. Allaire (2019). Multi-information source constrained Bayesian optimiza-tion.

Structural and Multidisciplinary Optimization , (3), 977-991.E. Goel, and E. Abhilasha (2017). Random forest:A review. International Journal of Advanced Re-search in Computer Science and Software Engineer-ing , (1). D. Heck, G. Schatz, J. Knapp, T. Thouw, J. Capde-vielle (1998). Corsika: A monte carlo code to simu-late extensive air showers. Technical report .F. Hutter, L. Kotthoﬀ, and J. Vanschoren (2019).

Au-tomated machine learning: methods, systems, chal-lenges

Springer Nature.K. Kandasamy, G. Dasarathy, J. Oliva, J. Schneider,and B. Poczos (2019). Multi-ﬁdelity gaussian pro-cess bandit optimisation.

Journal of Artiﬁcial Intel-ligence Research , , 151–196.M.C. Kennedy, and A. O’Haga (2000). Predicting theoutput from a complex computer code when fastapproximations are available. Biometrika 87 (1), 1–13.A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hut-ter (2017). Fast bayesian optimization of machinelearning hyperparameters on large datasets.

Artiﬁ-cial Intelligence and Statistics , 528–536).R. Lam, D.L. Allaire, and K.E. Willcox (2015). Mul-tiﬁdelity optimization using statistical surrogatemodeling for non-hierarchical information sources.

In 56th AIAA/ASCE/AHS/ASC Structures, Struc-tural Dynamics, and Materials Conference .E.H. Lee, V. Perrone,C. Archambeau, and M.Seeger(2020). Cost-aware Bayesian Optimization.

A. March, and K. Willcox (2012). Provably conver-gent multiﬁdelity optimization algorithm not requir-ing high-ﬁdelity derivatives.

AIAA journal , (5),1079–1089.A. Marques, R. Lam, and K. Willcox (2018). Contourlocation via entropy reduction leveraging multipleinformation sources. In Advances in Neural Infor-mation Processing Systems

ICML 2020Workshop on Real World Experiment Design andActive Learning .B. Peherstorfer, B. Kramer, and K. Willcox (2017).Combining multiple surrogate models to acceleratefailure probability estimation with expensive high-ﬁdelity models.

Journal of Computational Physics ,341, 61–75.M. Poloczek, J. Wang, and P. Frazier (2017). Multi-information source optimization.

Advances in Neu-ral Information Processing Systems , 4288-4298.

ISO-wiLDCosts: Multi Information Source Optimization with Location Dependent Costs

D. Russo, and B. Van Roy (2014). Learning to opti-mize via information-directed sampling.

Advancesin Neural Information Processing Systems , 1583–1591.B. Scholkopf, and A.J. Smola (2001).

Learning withkernels: support vector machines, regularization,optimization, and beyond. MIT press .R. Sen, K. Kandasamy, and S. Shakkottai (2018).Multi-ﬁdelity black-box optimization with hierarchi-cal partitions.

International conference on machinelearning

Transportation Re-search Record .B. Shahriari, K. Swersky, Z. Wang, R.P. Adams, andN. De Freitas (2015). Taking the human out of theloop: A review of Bayesian optimization.

Proceed-ings of the IEEE , (1), 148–175.J. Snoek, H. Larochelle, and R.P. Adams (2012). Prac-tical Bayesian optimization of machine learning al-gorithms. , 2951–2959.J. Song, Y. Chen, and Y. Yue (2019). A general frame-work for multi-ﬁdelity bayesian optimization withgaussian processes. , 3158-3167.N. Srinivas, A. Krause, S.M. Kakade, and M.W. Seeger(2012). Information-theoretic regret bounds forgaussian process optimization in the bandit setting. IEEE Transactions on Information Theory , (5),3250–3265.E. Strubell, A. Ganesh, and A. McCallum (2020). En-ergy and Policy Considerations for Modern DeepLearning Research. AAAI , 13693-13696.K. Swersky, J. Snoek, and R.P. Adams (2013). Multi-task bayesian optimization.

Advances in neural in-formation processing systems , 2004–2012.C.K. Williams, and C.E. Rasmussen (2006). Gaussianprocesses for machine learning.

Cambridge, MA:MIT press .R.L. Winkler (1981). Combining probability distribu-tions from dependent information sources.

Manage-ment Science ,27