Practical Bayesian Optimization of Objectives with Conditioning Variables
PPractical Bayesian Optimization of Objectives with ConditioningVariables
Michael Pearce Janis Klaise Matthew Groves
Centre for Complexity ScienceWarwick UniversityCoventry, UK Seldon TechnologiesLondon, UK Centre for Doctoral Training inMathematics for Real-World SystemsWarwick UniversityCoventry, UK
Abstract
Bayesian optimization is a class of data ef-ficient model based algorithms typically fo-cused on global optimization. We considerthe more general case where a user is facedwith multiple problems that each need to beoptimized conditional on a state variable, forexample given a range of cities with differ-ent patient distributions, we optimize theambulance locations conditioned on patientdistribution. Given partitions of CIFAR-10,we optimize CNN hyperparameters for eachpartition. Similarity across objectives boostsoptimization of each objective in two ways:in modelling by data sharing across objec-tives, and also in acquisition by quantifyinghow a single point on one objective can pro-vide benefit to all objectives. For this wepropose a framework for conditional optimiza-tion: ConBO. This can be built on top of arange of acquisition functions and we proposea new Hybrid Knowledge Gradient acquisitionfunction. The resulting method is intuitiveand theoretically grounded, performs eithersimilar to or significantly better than recentlypublished works on a range of problems, and iseasily parallelized to collect a batch of points.
Expensive black box functions arise in many fieldssuch as fluid simulations (Picheny and Ginsbourger,2013), engineering wing design (Jeong et al., 2005), andmachine learning parameter tuning (Snoek et al., 2012).In this work we consider the more general case wherethe expensive function must be optimized for a rangeof settings, or conditioned on a state , often called acontext, nuisance variable or source. This has manyapplications as follows.
Physics simulators: the optimal packing fraction ofparticles in a container varies with particle size (Gins-bourger et al., 2014b). A nuclear fusion reactor has mul-tiple plasma states each requiring unique controls (Charet al., 2019; Chung et al., 2020) .
Algorithm Parameter Tuning: given a set ofgraphs one may find the best coloring algorithm foreach graph (Smith-Miles et al., 2014). For a selectionof datasets one can optimize model training hyperpa-rameters (Bardenet et al., 2013) for each dataset. Weconsider CNN optimizer parameters for partitions ofCIFAR-10.
Robust Engineering: when designing a wing for arange of environment conditions and flow angles, asafety engineer needs to find flow angle of worst case stresses of the wing for each environment.
Logistics: given a range of warehouses spread overa country that each face different sales demand, onemust optimize the stock control policy of each ware-house (Poloczek et al., 2016). Given a range of citieswith different population distributions, for each cityone must optimize ambulance locations (Dong et al.,2017). We also consider these two applications.In this work we shall refer to variables that are to beoptimized as actions , though parameters, decision vari-ables, and solutions are often used. There are manyvariations of the conditional setting. In contextualoptimization (Paul et al., 2018), at each iteration acontext (or state) is passed to the optimization algo-rithm, the algorithm chooses an action and the re-ward is returned. Studied applications include drugdesign (Krause and Ong, 2011) in which an algorithmcan visit molecules (states) in a round-robin fashionand chooses a test chemical (action) at each iteration.Optimization with warm starts, dynamic objectives,transfer learning (Morales-Enciso and Branke, 2015;Feurer et al., 2015; Poloczek et al., 2016; Perrone et al.,2018) may be interpreted as contextual optimization a r X i v : . [ s t a t . M L ] N ov ractical Bayesian Optimization of Objectives with Conditioning Variables A c t i o n x Negative Rosenbrock Function datamax x n ( s , x )( s , x ) n +1 states2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0State s S t a t e V a l u e Figure 1: Top: GP model. Each state (vertical slice)is an objective function to maximise. At each iterationthe algorithm must determine ( s, x ) n +1 that providesinformation about the optima of all states (black line).Bottom: the acquisition function value of each localMonte-Carlo state using hybrid Knowledge Gradient.where the state remains constant for multiple timesteps in which multiple actions can be evaluated. Suchmethods have been applied to the aforementioned manywarehouses example and to hyperparameters of neuralnetworks for sequentially arriving datasets.Many applications in these works may also be viewedas serialised versions of conditional optimization prob-lems. In drug design, an experimenter may be free tochoose both the molecule and test chemical for eachexperiment, in hyperparameter tuning, if a collectionof datasets is already in user possession, the user isfree to choose both the dataset and parameters to trainnext. Both the context/state and the action is underalgorithm control and the goal is to maximise all statessimultaneously . See Figure 1.In multi information source, or multi-fidelity optimiza-tion (Huang et al., 2006a; Poloczek et al., 2017; Kan-dasamy et al., 2017) one has multiple functions overaction space, each one corresponds to a source (orstate) and has a different query cost. The goal isto optimize an expensive source (target state) by se-quentially evaluating actions on cheaper sources whileexploiting similarity across sources. The conditionalsetting is a generalization where all sources are targetsand (for simplicity) we assume all sources/states haveequal query cost.Past works have referred to this setting as profile opti-mization (Ginsbourger et al., 2014a), conditional op- timization (Sambakhé et al., 2019), offline contextualoptimization (Char et al., 2019), collaborative tun-ing (Bardenet et al., 2013), and continuous multi-taskoptimization (Pearce and Branke, 2018). In this workwe also adopt conditional optimization .This paper makes the following contributions:1. A framework for conditional Bayesian optimizationwith theoretical guarantees: ConBO2. A new competitive global optimization methodthat is cheap to compute: Hybrid Knowledge Gra-dient3. Extensive evaluation on fully open source con-ditional optimization problems from various do-mains. We focus on conditional algorithms. We define suchalgorithms as those that consider a set of functions,each corresponding to a state. For each iteration, thealgorithm is free to choose both the state and action.The goal is to find the peak of each function, all of theconditional optima, see Figure 1.The SCoT algorithm (Bardenet et al., 2013) considerdiscrete states and visits states in a round-robin fashionand chooses actions using expected improvement. Theauthors consider hyperparameters of machine learningalgorithms for multiple datasets, all of which are consid-ered to be available upfront. As SCoT predeterminesthe sequence of states it is unnecessarily restrictive, andalso cannot be applied to problems with continuousstates.The Profile Expected Improvement (PEI) algo-rithm (Ginsbourger et al., 2014b) does consider contin-uous state and proposes an acquisition function for thenext state action pair by measuring expected improve-ment of a new outcome over the best predicted outcomewithin the same state. The recent PEQI method (Sam-bakhé et al., 2019) extends PEI to noisy problems usingSKO (Huang et al., 2006b). These algorithms signifi-cantly improve upon SCoT by allowing the algorithm to dynamically determine the state as well as the action ofeach sampled state action pair. However, the proposedacquisition functions only quantify the benefit of a newsample by the effect it may have at finding the maximawithin the sampled state, it does not account for how asample may help to find the maxima of similar states.The REVI algorithm (Pearce and Branke, 2018) aimsto improve upon these methods and explicitly quanti-fies the benefit a new sample will have for finding themaxima of all states. At each iteration, the method ichael Pearce, Janis Klaise, Matthew Groves discretizes both state and action domains and treatinga continuous problem as a discrete one. Given a hypo-thetical state action point to sample, the KnowledgeGradient by discretization (over actions) quantifies howthe new sample will benefit each state in the discretizedset of states. The sum of these benefits provides an esti-mate of overall benefit. The method suffers the curse ofdimensionality, increasing dimension leads to exponen-tially increasing discrertization size with correspondingexponential space and time requirements. Using asmaller discretization suffers discretization errors thatcan largely undermine any benefit of accounting for allstates and actions.Most recently, the MTS method (Char et al., 2019) usesa novel kernel with a length scale that varies acrossstates. This is combined with a multi-task Thompsonsampling method for collecting new data. Note thatthis method also requires a discretization, over realstate-action space or Fourier space, and has cubic costwith the size of the discretization. Further, Thompsonsampling, like earlier works, does not account for howone sample provides benefit for optimizing all states,fundamental structure of conditional problems.While REVI and MTS appear to be the strongest al-gorithms, both suffer from practical issues, REVI isexponentially expensive and MTS does not utilise thestructure of the problem.In this work, we propose a method that both exploitsthe basic structure of conditional optimization, likeREVI, while also being much more practical and com-putationally favourable, like SCoT, PEI or PEQI, andtherefore amenable to much more challenging real worldproblems.
Adopting the convenient terminology of bandits, weassume that we have an expensive black box functionthat takes as input both a state in state space s ∈ S and an action in action space x ∈ X either of whichmay be continuous or discrete. The expensive functionreturns a noisy scalar reward , f ( s, x ) : S × X → R . (1)There is a distribution over states P [ s ] that encodesthe priority or weighting of states. Given a budget of N function calls, we sequentially choose points ( s, x ) and observe a noisy reward f ( s, x ) . The objective isto learn the best action for every state, or a policy π : S → X , that maximises expected output for allstates, Total Reward = (cid:90) S E [ f ( s, π ( s ))] P [ s ] ds, (2) where the expectation is over the stochasticity in re-wards. If there is only a single state, the reward func-tion is a single objective, the policy is a single action,and the problem reduces to global optimisation. If S is a finite set, the integral reduces to a summation. Inthis work we consider the following applications: CNN hyperparameters: the CIFAR-10 datasetcontains 10 classes, we split this into five datasets, S = { , ..., } , of two classes each with uniform statedensity P [ s ] = 1 / . Using a 5 layer CNN, we opti-mize dropout rates, batch size and Adam parameters, X ⊂ R . f ( s, x ) is the validation accuracy. Ambulances in a square: (Dong et al., 2017) givena range of 30km × s ∈ S = [0 , . P [ s ] is a truncated Gaussian aroundthe centre of the map; most cities, like Paris, are dens-est in the centre while others are densest at a coast linelike Singapore. The action space is the 2D coordinatesof three ambulance bases, x ∈ X = [0 , , and foreach city s one must find bases x to minimise journeytimes f ( s, x ) to patients over a simulated day. Assemble to order: (Xie et al., 2016) a company ownsmany stock warehouses and each one faces a differentlevel of demand s ∈ [0 . , . . The distribution ofdemand P [ s ] is uniform. Stock is depleted as customerorders arrive and are fulfilled, and stock levels arecontrolled by setting targets for replenishment, x ∈ X = [0 , . The objective, f ( s, x ) , is sales profitminus holding costs over a month. We first discuss the fitting of the Gaussian processmodel and the policy. We then motivate the acquisitionvalue for a single state and how this is integrated overstates yielding the acquisition function. Integratingover states increases the computational burden hencewe propose Hybrid Knowledge Gradient as a solution.At a stage after having observed n data points, { ( s i , x i , y i ) } ni =1 where y i = f ( s i , x i ) , we fit a Gaussianprocess from the joint space S × X to outputs y ∈ R .Let ˜ X n = (( s, x ) , ..., ( s, x ) n ) and Y n = ( y , ..., y n ) . AGaussian process is defined by a prior mean and priorcovariance function, µ ( s, x ) , k (( s, x ) , ( s (cid:48) , x (cid:48) )) whichare chosen for each application, for more informationsee Rasmussen and Williams (2004). After observing n data points, let K = k ( ˜ X n , ˜ X n ) ∈ R n × n , the posteriormean is given by µ n ( s, x ) = µ ( s, x )+ k ( s, x, ˜ X n ) (cid:0) K + σ I (cid:1) − ( Y n − µ ( ˜ X n )) (3) ractical Bayesian Optimization of Objectives with Conditioning Variables and the posterior covariance is given by k n ( s, x,s (cid:48) , x (cid:48) ) = k ( s, x, s (cid:48) , x (cid:48) ) (4) + k ( s, x, ˜ X n ) (cid:0) K + σ I (cid:1) − k ( ˜ X n , s (cid:48) , x (cid:48) ) . For state s , the predicted optimal action x defines apolicy π n ( s ) = arg max x µ n ( s, x ) . (5)Although we don’t use them in our experiments, acommon approach for applying GPs to higher dimen-sions is additive kernels (Kandasamy et al., 2015;Krause and Ong, 2011). These assume the func-tion decomposes into a sum of functions, each onedepending on an exclusive subset of variables e.g. f (( x , x )) = f ( x ) + f ( x ) . If the state and actionvariables are in different subsets the policy becomes π n ( s ) = arg max x µ n ( s ) + µ n ( x ) = arg max x µ n ( x ) , itis independent of the state. Additive structure assumes no interaction between variables while conditional op-timization is only necessary with interaction . Thusin settings with additive structure, one may fix allnon-interacting state and action variables and applyconditional optimization only to additive componentscontaining both state and action variables.Collecting a new data point y n +1 at ( s, x ) n +1 willupdate the Gaussian process model for all points S × X , updating the predicted peak of all states af-fecting the policy. To construct an acquisition func-tion for conditional optimization, we start by look-ing for standard acquisition functions that accountfor how the model changes at unsampled locations ( s (cid:48) , x (cid:48) ) (cid:54) = ( s, x ) n +1 . Specifically, the popular ExpectedImprovement (EI) (Jones et al., 1998) and upper confi-dence bound (UCB) (Kandasamy et al., 2016) methodsare both functions of the mean and variance at the sam-pled point only , µ n (( s, x ) n +1 ) , k n (( s, x ) n +1 , ( s, x ) n +1 ) .Methods that utilise other points in the domain includeEntropy search (ES and PES) that measures the mutualinformation between the new output P [ y n +1 | x n +1 ] andthe (induced) location of the peak P [ x ∗ | ˜ X n , Y n ] , Max-value entropy search (MES) (Wang and Jegelka, 2017)measures mutual information between the new output P [ y n +1 | x n +1 ] and the largest output P [max y | ˜ X n , Y n ] ,and Knowledge Gradient (KG) (Frazier et al., 2009)that measures the expected new peak posterior mean E [max µ n +1 ( x )] caused by a new y n +1 at x n +1 . UnlikeEI and UCB, we classify ES, MES, KG as “globallyaware”.Each state s i defines a single global optimization prob-lem over x ∈ X , using ES, PES, MES or KG, theacquisition value for state s i given a sample elsewhereat ( s, x ) n +1 may be computed. In this work we adoptKG for its Bayesian decision theoretic derivation thatextends seamlessly to the conditional setting. For KG, the benefit for state s i given a sample at ( s, x ) n +1 isthe expected increase in predicted reward for the bestaction within state s i . We denote this as KG c ( · ) givenby KG c ( s i ; ( s, x ) n +1 ) = (6) E y n +1 [max x (cid:48) µ n +1 ( s i , x (cid:48) ) | ( s, x ) n +1 ] − max x (cid:48)(cid:48) µ n ( s i , x (cid:48)(cid:48) ) . We discuss numerical evaluation of KG c ( · ) in Sec-tion 4.1. Similar expressions for entropy methods,ES c ( · ) , PES c ( · ) , MES c ( · ) , are derived in the Supple-mentary Material. Integrating over states s i yields thetotal acquisition value (cid:90) S KG c ( s ; ( s, x ) n +1 ) P [ s ] ds. (7)For discrete S the integral is replaced by summa-tion. For continuous S , the integral over states s can-not be computed analytically and so we use MonteCarlo with importance sampling. When using a ker-nel that factorises k ( s, x, s (cid:48) , x (cid:48) ) = σ k S ( s, s (cid:48) ) k X ( x, x (cid:48) ) ,like squared exponential or Matérn, similarity acrossstates is encoded in k S ( s, s (cid:48) ) . This naturally leads tothe proposal distribution q ( s | s n +1 ) ∝ k S ( s, s n +1 ) . Inour continuous state experiments, we use the Matérnkernel and a Gaussian proposal distribution with mean s n +1 and the state kernel length scales, l s , as standarddeviations, q ( s | s n +1 ) ∼ N ( s | s n +1 , diag ( l s )) . (8)We generate n s = 20 samples S MC = { s , ..., s n s } ,finally the acquisition function isConBO ( s n +1 , x n +1 ) = (9) (cid:88) s i ∈ S MC P [ s i ] q ( s i | s n +1 ) KG c ( s i ; ( s, x ) n +1 ) . Figure 1 shows a set of sampled states and the KG c ( · ) for each one. Each KG term directly measures in-crease in predicted reward for one state and the abovesum measures reward increases across all states, thisis a direct surrogate for maximizing the true objec-tive (Equation 2). For entropy based methods, ConBObecomes a sum of Shannon information units therebyindirectly optimizing the true objective. The randomlysampled states S MC may be resampled with each callto ConBO ( s, x ) and gradients estimated enabling theoptimal ( s, x ) n +1 to be found with a stochastic gradientascent optimizer such as Adam (Kingma and Ba, 2014).Selecting each point according to maximising ConBOis also myopically optimal in a value of informationframework: Theorem 1
Let ( s ∗ , x ∗ ) = arg max ConBO ( s, x ) be apoint chosen for sampling. ( s ∗ , x ∗ ) is also the pointthat maximises the myopic Value of Information, theincrease in predicted policy reward. ichael Pearce, Janis Klaise, Matthew Groves Poster Mean Updates Z j samplesmax x n +1 i ( x ) Z KG by discretizing X n +1 ( x j ) max n +1 ( x ) Z KG by MC over Z max n +1 i ( x ) Z , . . , Z N z Z Hybrid KG max n +1 i ( x ) Z , . . , Z N z Figure 2: Methods for computing KG ( x n +1 ) at x n +1 = 7 . Left: µ n ( x ) and samples of µ n +1 ( x ) determined by ascalar Z ∼ N (0 , . Centre-left: KG d replaces X with up to 3000 points x i ∈ X d and µ n +1 ( x i ) is linear in Z .Centre-right: KG MC samples up to 1000 functions µ n +1 ( x ) functions and maximises each of them numerically.Right: KG h samples up to 5 functions µ n +1 ( x ) and maximizes them numerically, the arg max points x ∗ , .., x ∗ areused as X d in KG d .Further, in finite search space, with an infinite samplingbudget all points will be sampled infinitely: Theorem 2
Let S and X be finite sets and N thebudget to be sequentially allocated by ConBO. Let n ( s, x, N ) be the number of samples allocated to point s, x within budget N . Then for all ( s, x ) ∈ S × X wehave that lim N →∞ n ( s, x, N ) = ∞ . The law of large numbers ensures that the algorithmlearns the true expected reward for all points. Proofsare given in the Supplementary Material.
By definition, KG is more expensive than EI and UCB.Further, the function KG c ( s i , ( s, x ) n +1 ) must be com-puted once for each sampled state s i , the computationalcost is therefore n s times the global acquisition func-tion equivalent. To alleviate this cost, we propose anovel, efficient algorithm for computing KG. In thefollowing section we assume constant s for brevity, re-ducing to the global optimization setting. Given ahypothetical location x n +1 , KG quantifies the value ofa new hypothetical observation y n +1 by the expectedincrease in predicted reward of the optimal action, i.e.the expected increase in the peak of the posterior meanKG ( x n +1 ) = E y n +1 (cid:2) max x (cid:48) µ n +1 ( x (cid:48) ) (cid:12)(cid:12) x n +1 (cid:3) − max x (cid:48)(cid:48) µ n ( x (cid:48)(cid:48) ) . (10)However, max x (cid:48) µ n +1 ( x (cid:48) ) has no direct formula andapproximations are required which we describe next.At time n , the new posterior mean is unknown, however,it may be written as µ n +1 ( x ) = µ n ( x ) + ˜ σ ( x ; x n +1 ) Z where ˜ σ ( x ; x n +1 ) is a deterministic function and thescalar Z ∼ N (0 , captures the randomness of y n +1 ,see SM. Previously, KG ( x ) has been computed in thefollowing two ways. KG by discretization (Frazier et al., 2009; Scottet al., 2011): in Equation 10, the maximizations may be performed over a discrete set of d points x (cid:48) ∈ X d ⊂ X . Denoting µ = µ n ( X d ) ∈ R d and ˜ σ ( x n +1 ) = ˜ σ ( X d ; x n +1 ) ∈ R d , thenKG d ( x n +1 ) = E Z (cid:2) max { µ + ˜ σ ( x n +1 ) Z } (cid:3) − max µ. The max { µ + ˜ σ ( x n +1 ) Z } is a piece-wise linear func-tion of Z and the expectation over Z ∼ N (0 , isanalytically tractable. The output is a lower bound ofthe true KG ( x ) over the continuous set. The MISOalgorithm (Poloczek et al., 2017) used KG d with 3000uniformly random distributed points. This methodsuffers the curse of dimensionality, X d must grow ex-ponentially with dimension of X . Even when using adense discretization, X d may contain many unnecessarypoints x i that do not form part of max µ n +1 ( X d ) , seeFigure 2 centre-left plot. KG by Monte Carlo (Wu and Frazier, 2017; Toscano-Palmerin and Frazier, 2018): given x n +1 , the methodsamples up to n z = 1000 values of Z . For each Z j , con-struct µ n +1 j ( x ) = µ n ( x ) + ˜ σ ( x ; x n +1 ) Z j and the peakoutput value of each µ n +1 j ( x ) is found using a continu-ous numerical Optimizer() (e.g. L-BFGS, CG),KG MC ( x n +1 ) =1 n s (cid:88) j Optimizer x (cid:48) (cid:0) µ n ( x (cid:48) ) + ˜ σ ( x (cid:48) ; x n +1 ) Z j (cid:1) − Optimizer x (cid:48)(cid:48) (cid:0) µ n ( x (cid:48)(cid:48) ) (cid:1) . The resulting average is an unbiased estimate of trueKG ( x ) and scales better to higher dimensional X asthe univariate Z is discretized by Monte Carlo samplesinstead of X . However, similar Z j values will have sim-ilar max µ n +1 j ( x ) making many calls to Optimizer() redundant. See Figure 2 centre right.We instead propose a simple natural mixture of theabove two approaches that scales to higher dimensional X and reduces redundant Optimizer() calls. ractical Bayesian Optimization of Objectives with Conditioning Variables
Hybrid KG: given x n +1 , first following KG MC we use n z = 5 values of Z j and for each µ n ( x ) + ˜ σ ( x ; x n +1 ) Z j we find the argmax x ∗ j using Optimizer() . Second,following KG d we treat the set of peak locations asa dynamic optimized discretization X ∗ d = { x ∗ , ..., x ∗ n z } to analytically compute a lower bound of the trueKG ( x ) . Let µ ∗ = µ n ( X ∗ d ) ∈ R n z and ˜ σ ∗ ( x n +1 ) =˜ σ ( X ∗ d ; x n +1 ) ∈ R n z , thenKG h ( x n +1 ) = E Z [max µ ∗ + ˜ σ ∗ ( x n +1 ) Z (cid:3) − max µ ∗ . Compared to KG d , the hybrid method removes redun-dant points in the discretization X d , all X ∗ d pointscontribute to max µ n +1 ( X ∗ d ) and there are far fewerpoints. Compared to KG MC , instead of sampling many Z j and finding many x ∗ j , far fewer Z j are sampled andfewer need to be x ∗ j found.To ensure asymptotic convergence, in a discrete domain,we require that the acquisition function is non-negative,KG h ( x ) ≥ , and the acquisition function is zero whereGP variance is zero, KG h ( x ) = 0 ⇐⇒ k n ( x, x ) = 0 .Therefore, always choosing x n +1 = arg max KG h ( x ) ensures only points with GP variance will be revisiteduntil all points have no variance i.e. the true functionis known for all points. We can ensure these propertiesby setting n z ≥ and at least one Z j is equal to zero. Theorem 3
Let n z ≥ and let Z = { Z j | j = 1 , ..., n z } .If ∈ Z then KG h ( x ) ≥ for all x ∈ X and if x issampled infinitely often KG h ( x ) = 0 . Proof is in the SM. The Z j values can be fixed,for n z = 5 we use equal Gaussian quantiles Z = (cid:8) Φ − (0 . , Φ − (0 . , . . . , Φ − (0 . (cid:9) where Φ( · ) is theGaussian CDF. Using quantile spacing and odd n z ensures Z j = Φ − (0 .
5) = 0 is included which satisfiesthe assumptions of asymptotic convergence. See Fig.2(right).The computational complexity of a single call toConBO requires the posterior variance ( O ( n ) ) and n s n z runs of Optimizer() . Assuming
Optimizer() internally makes n calls to the posterior mean O ( n ) ,ConBO total complexity is O ( n + nn calls n s n z ) . Notethis is linear in n s n z , the size of the dynamic opti-mal discretization over S × X . Thompson samplingwith discretization uses one operation that scales as O ( n ( n s n z ) + n ( n s n z ) + ( n s n z ) ) in the worst caseand to reduce cost special techniques are required e.g.Fourier features, CG matrix inversion. We consider synthetic benchmarks, the three applica-tions described in Section 3, and in the SM we alsopresent parallelization results and global optimization experiments comparing KG variants to other methods.We observe that KG MC performs worse in the samecomputation time hence we exclude KG MC methodsfrom the conditional experiments. We perform low dimensional toy experiments in anideal setting as a sanity check where the error due todiscretizations will be minimal. We use the popularBranin-Hoo and Rosenbrock test functions in 2D cor-responding to the (state, action) domain as displayedin Figure 3.
Global Algorithms on Conditional Problems
Wefirst modify the synthetic conditional benchmark toconstruct a range of problems that vary only by thewidth of the state space. Zero width corresponds to asingleton state space (e.g. S = { } for Rosenbrock) sothat the problem reduces to global optimization, whilsta width of one corresponds to the original continuumof states (e.g. S = [ − , for Rosenbrock). We applyexpected improvement and hybrid Knowledge Gradient,these acquisition functions aim to find the single globalmaximum over S × X . We also apply ConBO that triesto find the maximum of each state. All methods use thefitted GP to define the policy, therefore differences aresolely due to the acquisition function. Results are inFigure 3. When state space width is zero, all methodsachieve near zero opportunity cost (OC), both globaland conditional algorithms find the optimal action ofthe single state. As state space width increases to one,there are increasingly more states to optimize. ConBOconsistently achieves near zero total OC over test states.Meanwhile global methods have increasing OC withstate space width, these methods prioritise samplinghigh reward states neglecting the optimization of everystate and failing to converge. . . . . . . . . . . Branin-Hoo OC at 50 samples
EIKG-hConBO-5 0 . . . . . . . . . . Rosenbrock OC at 50 samples
EIKG-hConBO-5
Figure 3: Opportunity cost (simple regret) on two syn-thetic functions with varying state space width. Awidth of zero reduces the problem to global optimiza-tion for which global methods, EI, KG h , perform well.Increasing state space width to include more states tooptimize, the ConBO algorithm performance is consis-tent while global methods suffer. ichael Pearce, Janis Klaise, Matthew Groves
20 30 40 50N10 O C Branin-Uniform
KNNPGUNIPEQIEIMTSConBO-3REVIConBO-5
20 30 40 50N
Branin-Triangle
20 30 40 50N
Rosenbrock-Uniform
20 30 40 50N
Rosenbrock-Triangle
Figure 4: Opportunity Cost across a range of synthetic test problems. The dummy baseline, KNN, is worst in allcases, policy gradient is better however the Gaussian process based methods all perform better. UNI and EI arenot conditional algorithms yet outperform PEQI. Amongst other conditional algorihtms, MTS, REVI and ConBOmethods all perform significantly better. ConBO-3 is outperformed by ConBO-5 demonstrating the improvementwith more accurate KG h . Conditional Algorithms
Using the same syntheticfunctions with full state space width, we consider auniform and a triangular state distribution P [ s ] . Forbaselines, we adopt two policy based methods. KNN :randomly collected data, the policy returns the bestobserved action from 10 nearest neighbor states andserves as a dummy baseline. PG : policy gradient, aparametric quadratic policy π θ ( s ) is learnt by maximis-ing observed rewards, data collection samples statesfrom P [ s ] then (cid:15) -greedy actions. For a controlled exper-iment, all BO methods fit a GP defining the policy andonly differ by data collection (using a different policye.g. recommend global optimum x or single x beston average over states, is clearly inferior and we donot investigate it here). UNI : random data collection,
PEQI : given ( s, x ) , computes 0.75 quantile improve-ment of ( s, x ) over highest 0.75 quantile within thesame state. REVI : discretizes S × X , computes KG d for each state. MTS : discretizes S × X , draws a sam-ple output vector, the state with highest improvementover the policy action for the same state is chosen withthe corresponding action. ConBO-k : given ( s, x ) , 20states are importance sampled, KG h with k = 3 , points is used. EI : expected improvement, optimiza-tion over S × X treating all inputs as actions, the sameas in the previous section.Results are in Figure 4. Policy based methods KNNand PG consistently perform worse than the Gaus-sian process methods (which may be viewed as valuebased as they predict the reward of any state-actionpair). Surprisingly, the conditional BO algorithm PEQIperforms similarly to UNI and much worse than EI.In all cases, the conditional methods outperform allnon-conditional methods. We apply MTS, REVI, and ConBO variants and weadopt the recently proposed kernel used for BO withCommon Random Numbers (Pearce et al., 2019), k (( s, x ) , ( s (cid:48) , x (cid:48) )) = σ M ( x, x (cid:48) ; l ) + δ s (cid:48) s ( σ M ( x, x (cid:48) ; l ) + σ ) . where M ( x, x (cid:48) ; l ) is a Matérn kernel with lengthscales l . The first term models a common trend func-tion across all states and the second term models howeach state independently differs from the trend. Thedifferences are composed of another Matérn and theconstant kernel to model a global offset e.g. one datasetmay have universally lower validation error. This kernelhas far fewer parameters than a full multi-task productkernel, it is easy to fit and scales to arbitrary number ofstates (or datasets) without adding extra parameters.In this problem setting, learning hyperparameters oversimilar datasets, one may expect that the optimalhyperparameters would be the same for all datasets.Therefore, as a baseline we apply EI to learn the hy-perparameters of the first dataset (state 1). We thenevaluate the objective function (validation error) onthe rest of the datasets using the best observed hyper-parameters from dataset 1, we refer to this as EI +Tr ansfer.
Policy Optimization versus Data Optimization: in discrete state settings where the number of statesis much lower than the sampling budget | S | (cid:28) N ,a user may desire to find good observed (stochas-tic) values max y for each state instead of finding π ( s ) = arg max x E [ f ( s, x )] . Here, we refer to the for-mer as data optimization (DO) and the latter as policyoptimization (PO). For stochastic simulation, PO meth-ods are used to find actions that have reproducibleaverage rewards as the actions will be deployed for ractical Bayesian Optimization of Objectives with Conditioning Variables
20 40 60 80 100N92.893.093.293.493.6
Mean Validation Error E I + T r R E V I R E V I D O M T S M T S D O C o n B O - C o n B O - D O C o n B O - C o n B O - D O Val. Err. at 50 E I + T r R E V I R E V I D O M T S M T S D O C o n B O - C o n B O - D O C o n B O - C o n B O - D O Val. Err. at 100
20 40 60 80 100N020406080100
Acquisition Overhead (s)
MTSREVIConBO-3ConBO-5
Figure 5: Left: validation error. Centre-left: validation error after 50 samples. Centre-right: validation error after100 samples. Right: algorithm overhead in seconds. After 50 samples, none of the multi-task models outperformthe baseline, EI + Tr (dashed line) suggesting all datasets can use similar hyperparameters. For the larger budget100, all models outperform the baseline by . suggesting that for more fine-tuning, each dataset requiresdifferent hyperaparameters. In all cases, performing data optimization significantly increases performance.real world use. In neural network training we mainlyintend to find a network with good validation errorregardless of the hyperparameters as the network willbe deployed for real world use, hence DO is preferred.Past work (Pearce and Branke, 2018) showed that aPO method can be used to warm start a DO methodby performing PO and reserving samples at the endto allocate one per state with action determined byEI (which is a DO algorithm). This was shown tosignificantly outperform using a DO algorithm for allsamples. We apply this “DO trick” to all algorithms inthis experiment.Results are shown in Figure 5. For the medium budgetof 50 samples, ConBO performs best of the standardalgorithms yet it is still is worse than the baseline.Applying the final round of DO improves all results tomatch the baseline. For the large budget of 100 samples,all methods outperform the baseline suggesting datasetspecific fine-tuning of hyperparameters is required toachieve best results. Again, DO provides a significantboost to performance of all methods.Gaussian process kernel parameter learning requiredapproximately 2–5 seconds using Tensorflow. In figure5 (right) we show the run time of each algorithm exclud-ing model fitting and network training, purely acqui-sition function optimization time. MTS and ConBO-3 are quickest while Conbo-5 increases linearly overConBO-3 and REVI takes much longer.In future work we intend to investigate more hyper-parameter specific extensions such as early stopping,variable training set size, pretraining etc. In these pre-liminary experiments we consider the base case andtreat the training process as a black box. We apply all the methods from Section 5.1 to twobenchmarks from the library for simu-lation optimization problems. The ambulance problem(AMB) consists of a range of cities with different popu-lation distributions and one must optimize ambulancebase locations for each city. The Assemble-to-orderproblem (ATO) consists of a range of warehouses thatface different demand and for each warehouse the targetstock level must be optimized.Results are shown in Figure 6. Of the policy basedmethods, PG performs poorly and does not show onthe plots whilst KNN performs poorly on AMB andperforms well on ATO suggesting AMB is a more diffi-cult problem. Of the GP based methods, EI performswell. Although it is not a conditional algorithm weinclude it to highlight that sometimes the simplest ideacan also work, note that learning the global optimum arg max s,x f ( s, x ) also learns the corresponding part ofthe policy for the “optimum” state π ( s ∗ ) . However weemphasize that this behaviour will not generalize toother problems as shown by the synthetic results andthe premature convergence on the ATO problem. Ofthe conditional methods, MTS, REVI, and ConBO-3all perform similarly, either slightly (AMB) or largely(ATO) outperforming UNI. ConBO-5 uses a more ac-curate acquisition function and is the only methodthat learns a good policy on both problems. This im-plies that these problems are more difficult than thesynthetics and CNN and truly stress test conditionalalgorithms. Parallelization
Unique to the conditional setting, onemay execute function calls on different states in parallelas long as the states are far apart and there is littleinteraction. In SM we use sequential batch construction ichael Pearce, Janis Klaise, Matthew Groves
50 100 150 200 250 300N0.17500.17750.18000.18250.18500.18750.19000.19250.1950 J o u r n e y T i m e Ambulance Journey Times
KNNUNIConBO-3REVIMTSEIConBO-5 50 100 150 200 250 300N11010510095908580757065 N e g a t i v e P r o f i t Warehouse Profit
Figure 6: Left: average journey times across a range of cities. Right : average profit across a range of warehouses.ConBO-5 and EI perform best on these benchmarks.by penalization (González et al., 2016) for ConBO andobserve almost perfect linear scaling in AMB and mixedresults on ATO with almost no added computationoverhead.
We propose ConBO, a simple theoretically groundedalgorithm framework for conditional Bayesian opti-mization. We investigate this method on a range ofproblems and in all benchmarks compared to stateof the art methods ConBO-5 was either best or jointbest algorithm providing the most reliable consistentperformance, with ConBO-3 a cheaper alternative. Wealso emphasise the caution required with additive ker-nels and the conflict of data optimization versus policyoptimization.
Acknowledgements
We would like to thank Ayman Boustati for helpfuldiscussions and code review. We are grateful for theemotional feline support of Fred, Tonks, Luna, and herhighness Ogden Balmer Von Stroppleslouth the Third.
References
R. Bardenet, Mátyás Brendel, Balázs Kégl, and MicheleSebag. Collaborative hyperparameter tuning. In
In-ternational Conference on Machine Learning , pages199–207, 2013.Ian Char, Youngseog Chung, Willie Neiswanger,Kirthevasan Kandasamy, Andrew Oakleigh Nelson,Mark Boyer, Egemen Kolemen, and Jeff Schneider.Offline contextual bayesian optimization. In
Ad- vances in Neural Information Processing Systems ,pages 4629–4640, 2019.Youngseog Chung, Ian Char, Willie Neiswanger,Kirthevasan Kandasamy, Andrew Oakleigh Nelson,Mark D Boyer, Egemen Kolemen, and Jeff Schneider.Offline contextual bayesian optimization for nuclearfusion. arXiv preprint arXiv:2001.01793 , 2020.Naijia Anna Dong, David J Eckman, Xueqi Zhao,Shane G Henderson, and Matthias Poloczek. Em-pirically comparing the finite-time performance ofsimulation-optimization algorithms. In , pages 2206–2217.IEEE, 2017.Matthias Feurer, Jost Tobias Springenberg, and FrankHutter. Initializing bayesian hyperparameter opti-mization via meta-learning. In
Twenty-Ninth AAAIConference on Artificial Intelligence , 2015.P. Frazier, W. Powell, and S. Dayanik. The knowledge-gradient policy for correlated normal beliefs.
IN-FORMS Journal on Computing , 21(4):599–613, 2009.ISSN 10919856.D. Ginsbourger, J. Baccou, C. Chevalier, F. Perales,N. Garland, and Y Monerie. Bayesian adaptivereconstruction of profile optima and optimizers.
SIAM/ASA Journal on Uncertainty Quantification ,2(1):490–510, 2014a.David Ginsbourger, Jean Baccou, Clément Chevalier,Frédéric Perales, Nicolas Garland, and Yann Monerie.Bayesian adaptive reconstruction of profile optimaand optimizers.
SIAM/ASA Journal on UncertaintyQuantification , 2(1):490–510, 2014b.Javier González, Zhenwen Dai, Philipp Hennig, andNeil Lawrence. Batch bayesian optimization via local ractical Bayesian Optimization of Objectives with Conditioning Variables penalization. In
Artificial intelligence and statistics ,pages 648–657, 2016.Matthew Groves, Michael Pearce, and Juergen Branke.On parallelizing multi-task bayesian optimization. In , pages1993–2002. IEEE, 2018.José Miguel Hernández-Lobato, James Requeima, Ed-ward O Pyzer-Knapp, and Alán Aspuru-Guzik. Paral-lel and distributed thompson sampling for large-scaleaccelerated exploration of chemical space. In
Proceed-ings of the 34th International Conference on MachineLearning-Volume 70 , pages 1470–1479. JMLR. org,2017.Deng Huang, Theodore T Allen, William I Notz, andR Allen Miller. Sequential kriging optimization usingmultiple-fidelity evaluations.
Structural and Multi-disciplinary Optimization , 32(5):369–382, 2006a.Deng Huang, TT Allen, WI Notz, and RA Miller. Se-quential kriging optimization using multiple-fidelityevaluations.
Structural and Multidisciplinary Opti-mization , 32(5):369–382, 2006b.Shinkyu Jeong, Mitsuhiro Murayama, and KazuomiYamamoto. Efficient optimization design methodusing kriging model.
Journal of aircraft , 42(2):413–420, 2005.D. R. Jones, M. Schonlau, and W. J. Welch. Efficientglobal optimization of expensive black-box functions.
Journal of Global optimization , 13(4):455–492, 1998.ISSN 09255001.Kirthevasan Kandasamy, Jeff Schneider, and BarnabásPóczos. High dimensional bayesian optimisation andbandits via additive models. In
International Con-ference on Machine Learning , pages 295–304, 2015.Kirthevasan Kandasamy, Gautam Dasarathy, Junier BOliva, Jeff Schneider, and Barnabás Póczos. Gaus-sian process bandit optimisation with multi-fidelityevaluations. In
Advances in Neural Information Pro-cessing Systems , pages 992–1000, 2016.Kirthevasan Kandasamy, Gautam Dasarathy, JeffSchneider, and Barnabás Póczos. Multi-fidelitybayesian optimisation with continuous approxima-tions. In
Proceedings of the 34th International Con-ference on Machine Learning-Volume 70 , pages 1799–1808. JMLR. org, 2017.Kirthevasan Kandasamy, Akshay Krishnamurthy, JeffSchneider, and Barnabás Póczos. Parallelisedbayesian optimisation via thompson sampling. In
International Conference on Artificial Intelligenceand Statistics , pages 133–142, 2018.Diederik P Kingma and Jimmy Ba. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014. Andreas Krause and Cheng S Ong. Contextual gaussianprocess bandit optimization. In
Advances in neuralinformation processing systems , pages 2447–2455,2011.S. Morales-Enciso and J. Branke. Tracking global op-tima in dynamic environments with efficient globaloptimization.
European Journal of Operational Re-search , 242:744–755, 2015.Supratik Paul, Michael A Osborne, and Shimon White-son. Contextual policy optimisation. arXiv preprintarXiv:1805.10662 , 2018.M. Pearce and J. Branke. Efficient information collec-tion on portfolios. Technical report, University ofWarwick, 2016.Michael Pearce and Juergen Branke. Continuous multi-task bayesian optimisation with correlation.
Euro-pean Journal of Operational Research , 270(3):1074–1085, 2018.Michael Pearce, Matthias Poloczek, and JuergenBranke. Bayesian optimization allowing for commonrandom numbers. arXiv preprint arXiv:1910.09259 ,2019.Valerio Perrone, Rodolphe Jenatton, Matthias WSeeger, and Cédric Archambeau. Scalable hyper-parameter transfer learning. In
Advances in NeuralInformation Processing Systems , pages 6845–6855,2018.Victor Picheny and David Ginsbourger. A nonstation-ary space-time gaussian process model for partiallyconverged simulations.
SIAM/ASA Journal on Un-certainty Quantification , 1(1):57–78, 2013.Matthias Poloczek, Jialei Wang, and Peter I Frazier.Warm starting bayesian optimization. In , pages 770–781.IEEE, 2016.Matthias Poloczek, Jialei Wang, and Peter Frazier.Multi-information source optimization. In
Advancesin Neural Information Processing Systems , pages4289–4299, 2017.C. E. Rasmussen and C. K. I. Williams.
GaussianProcesses for Machine Learning . MIT Press, 2004.ISBN 026218253X.Diariétou Sambakhé, Lauriane Rouan, Jean-NoëlBacro, and Eric Gozé. Conditional optimization of anoisy function using a kriging metamodel.
Journalof Global Optimization , 73(3):615–636, 2019.Warren Scott, Peter Frazier, and Warren Powell. Thecorrelated knowledge gradient for simulation opti-mization of continuous parameters using gaussianprocess regression.
SIAM Journal on Optimization ,21(3):996–1026, 2011. ichael Pearce, Janis Klaise, Matthew Groves
Kate Smith-Miles, Davaatseren Baatar, Brendan Wre-ford, and Rhyd Lewis. Towards objective mea-sures of algorithm performance across instancespace.
Computers and Operations Research , 45:12–24, 2014. ISSN 03050548. doi: 10.1016/j.cor.2013.11.015. URL http://dx.doi.org/10.1016/j.cor.2013.11.015 .Jasper Snoek, Hugo Larochelle, and Ryan P Adams.Practical bayesian optimization of machine learn-ing algorithms. In
Advances in neural informationprocessing systems , pages 2951–2959, 2012.Saul Toscano-Palmerin and Peter I Frazier. Bayesianoptimization with expensive integrands. arXivpreprint arXiv:1803.08661 , 2018.Zi Wang and Stefanie Jegelka. Max-value entropysearch for efficient bayesian optimization. In
Proceed-ings of the 34th International Conference on MachineLearning-Volume 70 , pages 3627–3635. JMLR. org,2017.Jian Wu and Peter I Frazier. Discretization-free knowl-edge gradient methods for bayesian optimization. arXiv preprint arXiv:1707.06541 , 2017.Jing Xie, Peter I Frazier, and Stephen E Chick.Bayesian optimization via simulation with pairwisesampling and correlated prior beliefs.
OperationsResearch , 64(2):542–559, 2016. ractical Bayesian Optimization of Objectives with Conditioning Variables
A Theoretical Results
We restate the theorems from the main paper and provide each proof. Firstly, in Theorem 1 we show that thatConBO with knowledge gradient myopically maximises
Value of Information in a Bayesian decision theoreticframework. In Theorem 2 we show that in discrete settings ConBO will sample all pairs infinite often. Finally, inTheorem 3 we prove conditions for hybrid KG to satisfy the results of Theorems 1 and 2.
Theorem 1
Let ( s ∗ , x ∗ ) = arg max ConBO ( s, x ) be a point chosen for sampling. ( s ∗ , x ∗ ) is also the point thatmaximises the myopic Value of Information, the increase in predicted policy reward.Proof of Theorem 1 Given all the information available at time n , ˜ X n , Y n and the model µ n ( s, s ) , k n ( s, x, s (cid:48) x (cid:48) ) ,for any given state and action s, x , and a given realization of the true reward function f () , in a Bayesian decisiontheoretic framework, the loss is given by the output of the functionLoss ( s, x ) = − f ( s, x ) , the expected loss is the risk functionRisk ( s, x ) = E [ Loss ( s, x ) | ˜ X n , Y n ] = E [ − f ( s, x ) | ˜ X n , Y n ] = − µ n ( s, x ) . For convenience we assume conditioning on ˜ X n , Y n for the remaining equations. The optimal action minimizesrisk x optimal = arg min x Risk ( s, x ) = arg min x − µ n ( s, x ) . Alternatively, π ( s ) = arg max x µ n ( s, x ) is the Bayesian decision theoretic optimal action given all data availableat time n . The total risk is the risk of optimal actions is for all states, or the risk of the policyTotal Risk ( n ) = − (cid:90) s max x µ n ( s, x ) P [ s ] ds which is the negative of the models best prediction of true reward given data up to time n where we have made n an explicit argument for convenience. Next assume we are able to collect more data to update the model, choose ( s, x ) n +1 and observe y n +1 . The myopic Value of Information is defined as the data that minimizes future riskVoI (( s, x ) n +1 ) = − E y n +1 [ Total Risk ( n + 1) | ( s, x ) n +1 ] (11)(12)Note that that arg max VoI (( s, x ) n +1 ) is not affected by adding terms that do not depend on ( s, x ) n +1 . Thus wemay subtract the current Total Risk ( n ) . Finally, the difference between risks simplifies to E y n +1 (cid:20)(cid:90) s max x µ n +1 ( s, x ) P [ s ] ds (cid:12)(cid:12)(cid:12)(cid:12) ( s, x ) n +1 (cid:21) − (cid:90) s (cid:48) max x µ n ( s (cid:48) , x ) P [ s ] ds (cid:48) (13) = E y n +1 (cid:20)(cid:90) s max x µ n +1 ( s, x ) − max x µ n ( s, x ) P [ s ] ds (cid:12)(cid:12)(cid:12)(cid:12) ( s, x ) n +1 (cid:21) (14) = (cid:90) s E y n +1 (cid:104) max x µ n +1 ( s, x ) | ( s, x ) n +1 (cid:105) − max x µ n ( s, x ) P [ s ] ds (15) = (cid:90) s KG c ( s ; ( s, x ) n +1 ) P [ s ] ds (16) = ConBO (( s, x ) n +1 ) (17)Therefore arg max VoI (( s, x ) n +1 ) = arg max ConBO (( s, x ) n +1 ) . (cid:3) Theorem 2
Let S and X be finite sets and N the budget to be sequentially allocated by ConBO. Let n ( s, x, N ) be the number of samples allocated to point s, x within budget N . Then for all ( s, x ) ∈ S × X we have that lim N →∞ n ( s, x, N ) = ∞ . We require some intermediate results, firstly ConBO is non-negative. ichael Pearce, Janis Klaise, Matthew Groves
Lemma 1
Let ( s, x ) ∈ S × X , then ConBO (( s, x ) n +1 ) ≥ . Proof of Lemma 1
ConBO (( s, x ) n +1 ) = (cid:88) s (cid:48) E Z [max x (cid:48) µ n ( s (cid:48) , x (cid:48) ) + ˜ σ ( s (cid:48) , x (cid:48) ; ( s, x ) n +1 ) Z ] − max x (cid:48)(cid:48) µ n ( s (cid:48) , x (cid:48)(cid:48) ) (18) ≥ (cid:88) s (cid:48) E Z [ µ n ( s (cid:48) , π n ( s (cid:48) )) + ˜ σ ( s (cid:48) , π n ( s (cid:48) ); ( s, x ) n +1 ) Z ] − max x (cid:48)(cid:48) µ n ( s (cid:48) , x (cid:48)(cid:48) ) (19) = (cid:88) s (cid:48) max x (cid:48) µ n ( s (cid:48) , x (cid:48) ) + ˜ σ ( s (cid:48) , π n ( s (cid:48) ); ( s, x ) n +1 ) E Z [ Z ] − max x (cid:48)(cid:48) µ n ( s (cid:48) , x (cid:48)(cid:48) ) (20) = (cid:88) s (cid:48) max x (cid:48) µ n ( s (cid:48) , x (cid:48) ) − max x (cid:48)(cid:48) µ n ( s (cid:48) , x (cid:48)(cid:48) ) (21) = 0 (22) (cid:3) Secondly, we require that ConBO ( s, x ) reduces to zero for an infinitely sampled pair. Note that for deterministic f ( s, x ) , the result simplifies to ConBO ( s, x ) is zero for any sampled pair. Lemma 2
Let ( s, x ) n +1 ∈ S × X with n ( s, x ) = ∞ , then ConBO (( s, x ) n +1 ) = 0 . Proof of Lemma 2
Given infinitely many finite variance observations of f ( s, x ) , we have that µ n ( s, x ) = E [ f ( s, x )] and posterior variance is zero k n ( s, x, s, x ) = 0 . By the positive definiteness of the kernel we also have that k n ( s, x, s (cid:48) , x (cid:48) ) = 0 for all ( s (cid:48) , x (cid:48) ) n +1 ∈ S × X (see Pearce and Branke (2016) Lemma 3). It follows that ˜ σ ( s (cid:48) , x (cid:48) ; ( s, x ) n +1 ) = 0 for all ( s (cid:48) , x (cid:48) ) ∈ S × X and thusKG c ( s (cid:48) ; ( s, x ) n +1 ) = E Z [max x (cid:48) µ n ( s (cid:48) , x (cid:48) ) + ˜ σ ( s (cid:48) , x (cid:48) ; ( s, x ) n +1 ) Z ] − max x (cid:48)(cid:48) µ n ( s (cid:48) , x (cid:48)(cid:48) ) (23) = E Z [max x (cid:48) µ n ( s, x (cid:48) ) + 0 · Z ] − max x (cid:48)(cid:48) µ n ( s (cid:48) , x (cid:48)(cid:48) ) (24) = max x (cid:48) µ n ( s (cid:48) , x (cid:48) ) − max x (cid:48)(cid:48) µ n ( s (cid:48) , x (cid:48)(cid:48) ) (25) = 0 (26)and therefore ConBO (( s, x ) n +1 ) = (cid:82) s P [ s ] ds = 0 . (cid:3) Thirdly, we require the inverse of Lemma 2, that points for which ConBO ( s, x ) > must have non-zero variance k n ( s, x, s, x ) > (and therefore cannot be infinitely sampled). Lemma 3
Let ( s, x ) n +1 ∈ S × X be a point for which ConBO (( s, x ) n +1 ) > , then n ( s, x ) < ∞ .Proof of Lemma 3 ConBO (( x, s ) n +1 ) > implies that there exists an s ∈ S ’ such that KG c ( s (cid:48) ; ( s, x ) n +1 ) > .By the contrapositive of Lemma 3 in Poloczek et al. (2017), we must have that k n ( s (cid:48) , x (cid:48) , ( s, x ) n +1 ) is not aconstant function of x (cid:48) . If ( s, x ) n +1 is infinitely sampled, then k n ( s (cid:48) , x (cid:48) , ( s, x ) n +1 ) is a constant function of x (cid:48) ,thus ( s, x ) n +1 is not infinitely sampled. (cid:3) Finally, combining the previous Lemmas we can complete the proof.
Proof of Theorem 2
By Lemmas 1 and 2, any infinitely sampled points become minima of the function ConBO ( s, x ) .By construction, the ConBO algorithm choose points at maxima ( s, x ) n +1 = arg max ConBO ( s, x ) . Thus in theinfinite budget limit, we have ConBO ( s, x ) = 0 for all ( s, x ) ∈ S × X by the contrapositive of Lemma 3 we havethat n ( s, x ) = ∞ for all points. (cid:3) Theorem 3
Let n z ≥ and let Z = { Z j | j = 1 , ..., n z } . If ∈ Z then KG h ( x ) ≥ for all x ∈ X and if x issampled infinitely often KG h ( x ) = 0 .Proof of Theorem 3 First consider the base case n z = 2 . Let Z = { , Z } and given x n +1 , let X ∗ = { x ∗ , x ∗ } bethe optimal discretization as found by Algorithm 2. Then x ∗ = arg max µ n ( x ) + ˜ σ ( x ; x n +1 ) · µ n ( x ) ractical Bayesian Optimization of Objectives with Conditioning Variables and therefore µ n ( x ∗ ) = max x µ n ( x ) . Let µ ∗ = µ n ( X ∗ ) and ˜ σ ∗ = ˜ σ ( X ∗ , x n +1 ) . Then we have that KG h ( x n +1 ) = E Z [max { µ ∗ + ˜ σ ∗ Z, µ ∗ + ˜ σ ∗ Z } ] − max x µ n ( x ) (27) = E Z (cid:104) max { max x µ n ( x ) + ˜ σ ∗ Z, µ ∗ + ˜ σ ∗ Z } (cid:105) − max x µ n ( x ) (28) = E Z (cid:104) max { ˜ σ ∗ Z, µ ∗ − max x µ n ( x ) + ˜ σ ∗ Z } (cid:105) (29) ≥ max (cid:110) E Z [˜ σ ∗ Z ] , E Z (cid:104) µ ∗ − max x µ n ( x ) + ˜ σ ∗ Z (cid:105)(cid:111) (30) = max (cid:110) , µ ∗ − max x µ n ( x ) (cid:111) (31) = 0 (32)where we Jensen’s inequality in the penultimate line and we use that µ ∗ < max x µ n ( x ) in the final line. Theresult extends to the case for n z > trivially. The proof for KG h ( x ) = 0 at infinitely sampled points follows theproof of Lemma 2. (cid:3) B Computing ConBO and Hybrid Knowledge Gradient
Algorithm 1
Computing ConBO. The algorithm requires a new point, discretization sizes, past data and posteriorGP functions and an optimizer.
Require: ˜ x n +1 , n s , n z , ˜ X n , Y n , µ n ( s, x ) , k n ( s, x, s (cid:48) x (cid:48) ) , σ (cid:15) , Optimizer()
Precompute and cache (cid:16) k ( ˜ X n , ˜ X n ) + σ (cid:15) I (cid:17) − k ( ˜ X n , ˜ x n +1 ) C ← for i in , .., n s do s i ∼ N ( s i | s n +1 , diag ( l s )) KG i ← Algorithm2 (˜ x n +1 , s i , .... ) w i ← P [ s i ] /N ( s i | s n +1 , diag ( l s )) C ← C + w i KG i /n s end forreturn C Algorithm 2
Computing Hybrid Knowledge Gradient. The algorithm requires a new point, a state, a discretizationsize, past data, posterior GP functions and an optimizer.
Require: ˜ x n +1 , s , n z , ˜ X n , Y n , µ n ( s, x ) , k n ( s, x, s (cid:48) x (cid:48) ) , σ (cid:15) , Optimizer() ˜ X n +1 ← ˜ X n ∪ { ˜ x n +1 } X ∗ ← {} Precompute and cache (cid:16) k ( ˜ X n , ˜ X n ) + σ (cid:15) I (cid:17) − k ( ˜ X n , ˜ x n +1 ) for j in , .., n z do Z j ← Φ − (cid:16) j − n z (cid:17) ˜ Y n +1 j from Equation 35 µ n +1 j ( s, x ) ← µ ( s, x ) + k ( s, x, ˜ X n +1 ) ˜ Y n +1 j x ∗ j ← arg max x µ n +1 j ( s, x ) using Optimizer() X ∗ ← X ∗ ∪ { x ∗ j } end for µ ← µ n ( s, X ∗ ) σ ← k n (( s,X ∗ ) , ˜ x n +1 ) √ k n ( x n +1 ,x n +1 )+ σ (cid:15) KG ← Algorithm3 ( µ, σ ) return KG ichael Pearce, Janis Klaise, Matthew Groves Algorithm 3
Knowledge Gradient by discretization. This algorithm takes as input a set of linear functionsparameterised by a vector of intercepts µ and a vector of gradients σ . It then computes the intersections ofthe piece-wise linear epigraph (ceiling) of the functions and the expectation of the output of the function givenGaussian input. Vector indices are assumed to start from 0. Require: µ , σ ∈ R n A O ← order ( σ ) σµ ← µ [ O ] , σ ← σ [ O ] I ← [0 , ˜ Z ← [ −∞ , µ − µ σ − σ ] for i = 2 to n z − do ( (cid:63) ) j ← last ( I ) z ← µ i − µ j σ j − σ i if z < last ( ˜ Z ) then Delete last element of I and of ˜ Z Return to ( (cid:63) ) end if Add i to end of I and z to ˜ Z end for ˜ Z ← [ ˜ Z, ∞ ] A ← φ ( ˜ Z [1 :]) − φ ( ˜ Z [: − B ← Φ( ˜ Z [1 :]) − Φ( ˜ Z [: − KG ← B T µ [ I ] − A T σ [ I ] − max µ return KG B.1 Deriving One-Step Look-ahead Posterior Mean µ n +1 ( s, x ) At iteration n during optimization, let the training inputs be ˜ X n = (cid:0) ( s , x ) , ..., ( s n , x n ) (cid:1) and the trainingoutputs Y n = ( y , ..., y n ) . Given a prior mean and kernels functions, µ ( s, x ) : S × X → R and k ( s, x, s (cid:48) , x (cid:48) ) : S × X × S × X → R . Finally let the new sample point be ( s, x ) n +1 = ˜ x n +1 .Updating the mean function with data from the th step to n th step is given by µ n ( s, x ) = µ ( s, x ) + k ( s, x, ˜ X n ) K − (cid:16) Y n − µ ( ˜ X n ) (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) define ˜ Y n (33) = µ ( s, x ) + k ( s, x, ˜ X n ) ˜ Y n (34)where K = k ( ˜ X n , ˜ X n ) + σ (cid:15) I . µ n ( s, x ) may also be written as a weighted average of a modified ˜ Y n ∈ R n vector as defined above. Computing the new posterior mean reduces to augmenting ˜ X n → ˜ X n +1 with ˜ x n +1 and Y n → Y n +1 then computing the new ˜ Y n +1 ∈ R n +1 . Let Z be the z-score of y n +1 on its predictive distribution,then ˜ Y n +1 = (cid:20) ˜ Y n (cid:21) + Z (cid:112) k n (˜ x n +1 , ˜ x n +1 ) + σ (cid:15) (cid:20) − K − k ( ˜ X n , ˜ x n +1 )1 (cid:21) (35)and the above expression may be used directly in Algorithm 1 with sampled Z j ∼ N (0 , . This is derived by asimple change of indices from → n and n → n + 1 , yields the one-step updated posterior mean µ n +1 ( s, x ) = µ n ( s, x ) + k n ( s, x, ˜ x n +1 ) k n (˜ x n +1 , ˜ x n +1 ) + σ (cid:15) (cid:0) y n +1 − µ n (˜ x n +1 ) (cid:1) . (36) ractical Bayesian Optimization of Objectives with Conditioning Variables which contains the random y n +1 . This may be factorized as follows: µ n +1 ( s, x ) = µ n ( s, x ) + k n ( s, x, ˜ x n +1 ) 1 (cid:112) k n (˜ x n +1 , ˜ x n +1 ) + σ (cid:15) (cid:124) (cid:123)(cid:122) (cid:125) standard deviation of y n +1 (cid:0) y n +1 − µ n (˜ x n +1 ) (cid:1)(cid:112) k n (˜ x n +1 , ˜ x n +1 ) + σ (cid:15) (cid:124) (cid:123)(cid:122) (cid:125) Z-score of y n +1 (37) = µ n ( s, x ) + k n ( s, x, ˜ x n +1 ) 1 σ ny n +1 (˜ x n +1 ) Z (38) = µ n ( s, x ) + ˜ σ ( s, x, ˜ x n +1 ) Z (39)where the left factor is a deterministic and the right factor is the (at time n ) stochastic Z-score of the new y n +1 value. This is clear by noting that the predictive distribution of the new output y n +1 P [ y n +1 | ˜ x n +1 , ˜ X n , Y n ] = N ( µ n (˜ x n +1 ) , k n (˜ x n +1 , ˜ x n +1 ) + σ (cid:15) ) . (40)as a result, to sample new posterior mean functions, we may simply sample Z ∼ N (0 , values and computeEquation 38. However, this results in a quadratic cost per call to sampled poster mean function as both k n ( s, x, ˜ x n +1 ) and σ ny n +1 (˜ x n +1 ) have O ( n ) quadratic cost. This can be easily reduced to linear instead as we nowshow.We next focus on the first factor k n ( s, c, ˜ x n +1 ) which may also be factorized k n ( s, x, ˜ x n +1 ) = k ( s, x, ˜ x n +1 ) − k ( s, x, ˜ X n ) K − k ( ˜ X n , ˜ x n +1 ) (41) = (cid:2) k ( s, x, ˜ X n ) , k ( s, x, ˜ x n +1 ) (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) k ( s,x, ˜ X n +1 ) (cid:20) − K − k ( ˜ X n , ˜ x n +1 )1 (cid:21) . (42)Combining Equations 34 and 42 yields the following formula µ n +1 ( s, x ) = µ ( s, x ) + k ( s, x, ˜ X n +1 ) (cid:32)(cid:20) ˜ Y n (cid:21) + Zσ ny n +1 (˜ x n +1 ) (cid:20) − K − k ( ˜ X n , ˜ x n +1 )1 (cid:21)(cid:33) (43) = µ ( s, x ) + k ( s, x, ˜ X n +1 ) ˜ Y n +1 . (44)The quantity ˜ Y n is pre-computed at the start of the algorithm iteration, the quantity − K − k ( ˜ X n , ˜ x n +1 ) hasquadratic cost and can be computed once and used again for σ ny n +1 (˜ x n +1 ) . Then, sampling posterior meanfunctions reduces to sampling n y values z , ..., z n y ∼ N (0 , and for each value computing ˜ Y n +11 , ...., ˜ Y n +1 n y . Theneach sampled posterior mean is just the weighted average given by Equation 44. B.2 Choice of
Optimizer()
Since evaluating Equation 44 for many points is simply a matrix multiplication, random search is cheap toevaluate in parallel. After random search, the gradient of Equation 44 with respect to ( s, x ) is easily computed,and starting from the best random search point, gradient ascent over x ∈ X can be used to find the optimalaction. This varies with kernel choice and application, we describe our settings in Section C.3.We start with a set of sampled means µ n +11 ( s, x ) , ...., µ n +1 n z ( s, x ) and a set of sampled states s , ..., s n s . For eachstate s i , the set of n z optimal actions X d,s i is found by optimizing the n z posterior means X d,s i = n z (cid:91) j =1 (cid:26) argmax x µ n +1 j ( s i , x ) (cid:27) Finally, for each point in the { s i } × X d,s i = ˜ X d,s i (state, action) discretization, we evaluate two quantities, firstlythe vector of current posterior means µ s i ∈ R n z , µ s i = µ n ( ˜ X d,s i ) (45) = µ ( ˜ X d,s i ) + k ( ˜ X d,s i , ˜ X n ) ˜ Y n (46) ichael Pearce, Janis Klaise, Matthew Groves and the vector of additive updates σ s i ∈ R n z , σ s i = k n ( ˜ X d,s i , ˜ x n +1 ) σ ny n +1 (˜ x n +1 ) (47) = k ( ˜ X d,s i , ˜ X n +1 ) (cid:20) − K − k ( ˜ X n , ˜ x n +1 )1 (cid:21) σ ny n +1 (˜ x n +1 ) . (48)These two vectors µ s i and σ s i are both differentiable ∇ ˜ x n +1 µ s i and ∇ ˜ x n +1 σ s i and they are used to analyticallycompute the peicewise-linear E Z (cid:104) max (cid:16) µ s i + Zσ s i (cid:17)(cid:105) − max µ s i which is also differentiable. Thus assuming fixed ˜ X d,s , ..., ˜ X d,s ns , approximate gradients are computed and can be used in any stochastic gradient ascent optimizer. C Implementation Details
C.1 REVI
At iteration n of the algorithm, we used a discretization of size n disc = 2 n , split equally amongst actions andstates n s = n x = (cid:100)√ n disc (cid:101) . States are sampled from P [ s ] and actions are sampled as a latin hypercube over X .The acquisition function is optimized by 100 points of random search over S × X followed by Nelder-Mead ascentstarting form the best 20 points in the random search phase. C.2 MTS
We use a target discretization size of n disc = 3000 . Given d s states dimensions and d x action dimensions, wesampled states uniformly, the number of sampled states is given by n s = (cid:100) ( n disc ) d s / ( d s + d x ) (cid:101) and the number ofactions per state is n x = (cid:100) n disc /n s (cid:101) such that n s ∗ n x ≈ n disc . This way the discretization over all states andaction dimensions is roughly constant. For each sampled state s i , n x actions are generated in three ways. Firstly,the policy is evaluated x πs i = π n ( s i ) , we generate 40 actions around this policy action. Secondly, we take the10 nearest neighbor states from the training set, and the points with the 4 largest y values are added to thediscretization set with randomly generated neighbors. Finally, remaining actions in the n x budget come fromuniform random sampling over X . Each s i has a bespoke action discretization. Sampled functions are drawnusing the python numpy random normal generate function. C.3 ConBO
Each sampled posterior mean function was optimized in two steps. For a given state s i , firstly, the actiondiscretization used by MTS, reduced to 40 points in total was used in parallel random search. The best point wasthen used in conjugate-gradient ascent for 20 steps.For optimizing sampled posterior mean functions for which z i = 0 , that is µ n +1 ( s, x ) = µ n ( s, x ) , this givenby the policy x πs i = π n ( s i ) . Since the same, or very similar states, may be used multiple times for differentConBO (( s, x ) n +1 ) calls, we may use caching to avoid such repeated policy computation. Whenever the policy isqueried for the optimal action for a given state, the final (state, action) pair are stored in a lookup table. Anyfuture calls to the policy function with state s j can check the lookup table and if very similar states exists usethe same action, if a somewhat similar state exists, re-optimize the action, if no similar states exist perform a fulloptimization as above.In our experiments, the cache of stored policy calls is wiped clean before any testing, ConBO is not given anunfair advantage at test time. In practical applications, this need not be the case. C.4 Policy Gradient
The (stochastic) policy was a Gaussian with mean that is a quadratic function of the state and constant variance, π θ ( s ) = P θ [ x | s ] = N ( x | s T As + Bs + C, . I ) where A ∈ R d X × d S × d S , B ∈ R d X × d S and C ∈ R d X and θ = { A, B, C } . Given a dataset { ( s, x, y ) i } ni =1 , we firstrescaled s and x values to the hypercube, and y-values were standardized to have mean 0 and variance 1. First ractical Bayesian Optimization of Objectives with Conditioning Variables we used kernel regression to learn a baseline value V ( s ) = (cid:80) i k ( s, s i ) y i (cid:80) i k ( s, s i ) where k ( s, s (cid:48) ) = exp( − . s − s (cid:48) ) / . ) . The parameters θ are found by optimizing the expected advantageExpected advantage = (cid:88) i P θ [ x i | s i ]( y i − V ( s i )) . At test time, given a state, the mean action is computed from the policy (accounting for rescaling to hypercubeand back) and recommended for use.
C.5 CIFAR-10 Hyperparameter ExperimentParameter Space: • dropout_1 ∈ [0 , . , linear scale• dropout_2 ∈ [0 , . , linear scale• dropout_3 ∈ [0 , . , linear scale• learning_rate ∈ [0 . , . , log scale• beta_1 ∈ [0 . , . , log scale• beta_2 ∈ [0 . , . , log scale• batch_size ∈ [16 , , log scale Network architecture: x_in = Input(shape=(32, 32, 3))x = Conv2D(filters=64, kernel_size=2, padding=’same’, use_bias=False)(x_in)x = BatchNormalization()(x)x = Activation("relu")(x)x = MaxPooling2D(pool_size=2)(x)x = Dropout(dropout_1)(x)x = Conv2D(filters=32, kernel_size=2, padding=’same’, use_bias=False)(x)x = BatchNormalization()(x)x = Activation("relu")(x)x = MaxPooling2D(pool_size=2)(x)x = Dropout(dropout_2)(x)x = Flatten()(x)x = Dense(256, activation=’relu’)(x)x = Dropout(dropout_3)(x)x_out = Dense(2, activation=’softmax’)(x)cnn = Model(inputs=x_in, outputs=x_out)adam = optimizers.Adam(learning_rate=learning_rate,beta_1=beta_1,beta_2=beta_2)cnn.compile(loss=’categorical_crossentropy’,optimizer=adam,metrics=[’accuracy’]) ichael Pearce, Janis Klaise, Matthew Groves
D Batch Construction by Sequential Penalization
For global optimization, parallelizing BO algorithms to sugggest a batch of q inputs, { x n +1 , ..., x n + q } , has beenapproached in multiple ways. For acquisitions functions that compute an expectation over future outcomes P [ y n +1 | x n +1 ] , (EI, KG, ES, MES), the acquisition value of a batch can be computed using the expectation overmultiple correlated outcomes P [ y n +1 , ..., y n + q | x n +1 , ..., x n + q ] . This larger q dimensional expectation, effectivelylooking q steps into the future, must be estimated by Monte-Carlo. At the same time, it is a function of all q points in the batch and must be optimized simultaneously over q times more dimensions X q . This method ofparallelization quickly becomes infeasible for even moderate dimensions and batch sizes. As before, adapting themethod to conditional optimisation adds another layer of Monte-Carlo integration over s ∈ S multiplying thecomputational cost.Thompson sampling (TS) randomly suggests the next point to evaluate, x n +1 . TS also has the convenientmathematical property that q step look ahead is equivalent to generating q i.i.d samples (Hernández-Lobato et al.,2017; Kandasamy et al., 2018). This property was used by Char et al. (2019) to parallelize MTS.Alternatively, sequential construction of a batch can be done in O ( q ) time and we consider the method of Gonzálezet al. (2016). First x n +1 is found by optimizing the chosen acquisition function x n +1 = arg max α ( x ) . The acqui-sition function is then multiplied by a non-negative penalty function φ ( x, x n +1 ) that penalizes x similar to x n +1 .The next point is found by x n +2 = arg max x α ( x ) φ ( x n +1 , x ) , then x n +3 = arg max x α ( x ) φ ( x n +1 , x ) φ ( x n +2 , x ) until a batch of q points is constructed. We use the inverted GP kernel as the penalty function φ (( s, x ) , ( s, x ) i ) = 1 − k (( s, x ) , ( s, x ) i ) k (( s, x ) i , ( s, x ) i ) . See Figure 7 for an illustration. Previous work (Groves et al., 2018) showed that the conditional optimizationsetting is well suited to this construction method. Two points on dissimilar states do not interact and if they areboth local peaks of the chosen acquisition function α ( s, x ) , then both may be evaluated in parallel. In conditionaloptimization, the presence of state variables introduces multiple objective functions allowing a batch of points tobe more spread out reducing interactions and possible inefficiencies. This can be achieved by penalization thussidestepping the need for expensive nested Monte-Carlo integration. By contrast, in global optimization all q points are “crammed” into a single state, all interacting with the same objective requiring more care in batchconstruction techniques. S t a t e s A c t i o n x Acquisition Function S t a t e s A c t i o n x Penalization S t a t e s A c t i o n x New Acquisition Function
Figure 7: Left: acquisition function over S × X with a peak at ( s, x ) n +1 = ( − . , . , the first point in thebatch. Centre: the penalization function that down-weighs any point ( s, x ) ∈ S × X according to similarity with ( s, x ) n +1 . Right: The product of acquisition and penalization functions, the peak at ( s, x ) n +2 = (1 . , . is thesecond point in the batch.This method for batch construction may be applied to any acquisition function, in our experiments we apply it toREVI and ConBO.In practice, we optimize the acquisition function using multi start gradient ascent and keep the entire history ofevaluations { ( s t , x t , α ( s t , x t )) } callst =1 . Since this history is very likely to contain multiple peaks, we simply applythe penalization to the set of past evaluations avoiding the need to re-optimize the penalized acquisition function.Therefore, efficiently parallelizing a conditional BO algorithm can be done in just a few additional lines of code. ractical Bayesian Optimization of Objectives with Conditioning Variables − − P r o fi t ATO
MTSREVIConBO
ATO 4x ATO 8x
100 200 300t0 . . E xp ec t e d ti m e AMB
100 200 300t
AMB 4x
100 200 300t
AMB 8x
Figure 8: Algorithm performance on the Assemble To Order (top) and Ambulances (bottom) benchmarks. Serial,parallel 4 and 8.
E Global Optimization
We perform a parameter sweep over the n z for each KG implementation. As a baseline we consider ThompsonSampling. For test functions we take the popular Branin-Hoo, Rosenbrock and Hartmann6. See Figure 9 forresults. F Entropy Based Methods for Conditional Bayesian Optimization
Given a GP model and a dataset, P [ x ∗ | ˜ X n , Y n ] is the distribution over the peak of realizations of GP samplefunctions (abusing notation) P [ x ∗ | ˜ X n , Y n ] = argmax x GP ( µ n ( x ) , k n ( x, x (cid:48) )) . Given a new sample input x n +1 , theoutcome P [ y n +1 | x n +1 , ˜ X n , Y n ] is also a random variable that is Gaussian. For thsi section, to reduce clutteringnotation, we suppress the dependence on ˜ X n , Y n . The mutual information between random variables y n +1 and x ∗ is defined as MI ( x ) = (cid:90) x ∗ (cid:90) y n +1 log (cid:18) P [ y n +1 , x ∗ ] P [ y n +1 ] P [ x ∗ ] (cid:19) P [ y n +1 , x ∗ ] dy n +1 dx ∗ (49)where P [ y n +1 ] depends upon x n +1 yet we drop it for convenience. F.1 Entropy Search
The Entropy search algorithm decomposes the above expression using P [ y n +1 , x ∗ ] = P [ y n +1 ] P [ x ∗ | y n +1 ] resultingin the following acquisition functionES ( x ) = (cid:90) x ∗ log ( P [ x ∗ ]) P [ x ∗ ] dx ∗ + (cid:90) y n +1 (cid:90) x ∗ log (cid:0) P [ x ∗ y n +1 ] (cid:1) P [ x ∗ | y n +1 ] dx ∗ P [ y n +1 ] dy n +1 (50) = H [ x ∗ ] − (cid:90) y n +1 H [ x ∗ | y n +1 ] P [ y n +1 ] dy n +1 (51)where H [ x ∗ ] is the entropy of the distribution P [ x ∗ ] . For the conditional case, the outcome P [ y n +1 | ( s, x ) n +1 ] is still a Gaussian random variable, and we measure the mutual information with the peak x ∗ s i over actionsconstrained to a given state { s i } × X that is P [ x ∗ s i ] = argmax x GP ( µ n ( s i , x ) , k n ( s i , x, s i , x (cid:48) )) . And the conditionalentropy search acquisition function is simplyES c ( s i ; ( s, x ) n +1 ) = H [ x ∗ s i ] − (cid:90) y n +1 H [ x ∗ s i | y n +1 ] P [ y n +1 ] dy n +1 . (52) F.2 Predictive Entropy Search
We again drop the dependence on x n +1 in P [ y n +1 | x n +1 ] . The Predictive Entropy search algorithm uses analternative decomposition of the Mutual Information using P [ y n +1 , x ∗ ] = P [ y n +1 | x ∗ ] P [ x ∗ ] resulting in the following ichael Pearce, Janis Klaise, Matthew Groves
20 40 60510152025 B r a n i n H oo TSKG-d 3KG-h 3KG-MC 3 20 40 60510152025 TSKG-d 5KG-h 5KG-MC 5 20 40 605101520 TSKG-d 9KG-h 9KG-MC 9 20 40 60510152025 TSKG-d 15KG-h 15KG-MC 1520 40 6050100150200250 R o s e n b r o c k TSKG-d 3KG-h 3KG-MC 3 20 40 60255075100125150 TSKG-d 5KG-h 5KG-MC 5 20 40 60050100150200250300 TSKG-d 9KG-h 9KG-MC 9 20 40 6050100150200 TSKG-d 15KG-h 15KG-MC 1560 80 1003.002.752.502.252.001.75 H a r t m a nn TSKG-d 3KG-h 3KG-MC 3 60 80 1003.002.752.502.252.001.75 TSKG-d 5KG-h 5KG-MC 5 60 80 1003.002.752.502.252.001.751.50 TSKG-d 9KG-h 9KG-MC 9 60 80 1003.002.752.502.252.001.751.50 TSKG-d 15KG-h 15KG-MC 15
Figure 9: The KG by discretization, KG d , uses n z random points in the search space X and performs worst by far.KG MC uses n z samples of the future posterior mean function and performs well on small problems but suffersfrom unreliable noisy convergence. KG h consistently outperforms other KG methods and is the most reliableon all problems matching Thompson sampling but still performing worse on the Hartmann6 function. This isin contrast to the conditional setting with multiple states wheer integrated KG methods outperform Thomsonsampling methods. ractical Bayesian Optimization of Objectives with Conditioning Variables acquisition function PES ( x ) = H [ y n +1 ] − (cid:90) x ∗ H [ y n +1 | x ∗ ] P [ x ∗ ] dx ∗ . (53)For the conditional case, the outcome P [ y n +1 | ( s, x ) n +1 ] is still a Gaussian random variable, and we mea-sure the mutual information with the peak x ∗ constrained to a given state s i that is as above P [ x ∗ s i ] = argmax x GP ( µ n ( s i , x ) , k n ( s i , x, s i , x (cid:48) )) . And the conditional predictive entropy search acquisition function issimply PES c ( s i ; ( s, x ) n +1 ) = H [ y n +1 ] − (cid:90) x ∗ si H [ y n +1 | x ∗ s i ] P [ x ∗ s i ] dx ∗ (54)where the expression H [ y n +1 | x ∗ s i ] is the (non Gaussian) distribution of y n +1 at ( s, x ) n +1 given that the peak ofstate s i is at x ∗ s i . F.3 Max-value Entropy Search
Max-value entropy search instead measures the mutual information between the new outcome P [ y n +1 | x n +1 ] and the largest possible outcome (again abusing notation) P [ y ∗ ] = max x GP ( µ n ( x ) , k n ( x, x (cid:48) )) , the peak value ofposterior sample functions. The acquisition function decomposes the mutual information intoMES ( x ) = H [ y n +1 ] − (cid:90) y ∗ H [ y n +1 | y ∗ ] dy ∗ (55)The conditional version measures the mutual information between P [ y n +1 | ( s, x ) n +1 ] and the largest y valueamongst all outcomes with the same state P [ y ∗ s i ] = max x GP ( µ n ( s i , x ) , k n ( s i , x, s i , x (cid:48) )) MES ( s i , ( s, x ) n +1 ) = H [ y n +1 ] − (cid:90) y ∗ si H [ y n +1 | y ∗ s i ] P [ y ∗ s i ] dy ∗ s ii