Differentiable Expected Hypervolume Improvement for Parallel Multi-Objective Bayesian Optimization
DDifferentiable Expected Hypervolume Improvementfor Parallel Multi-Objective Bayesian Optimization
Samuel Daulton
Facebook [email protected]
Maximilian Balandat
Facebook [email protected]
Eytan Bakshy
Facebook [email protected]
Abstract
In many real-world scenarios, decision makers seek to efficiently optimize multiplecompeting objectives in a sample-efficient fashion. Multi-objective Bayesianoptimization (BO) is a common approach, but many existing acquisition functionsdo not have known analytic gradients and suffer from high computational overhead.We leverage recent advances in programming models and hardware accelerationfor multi-objective BO using Expected Hypervolume Improvement (EHVI)—analgorithm notorious for its high computational complexity. We derive a novelformulation of q -Expected Hypervolume Improvement ( q EHVI), an acquisitionfunction that extends EHVI to the parallel, constrained evaluation setting. q EHVIis an exact computation of the joint EHVI of q new candidate points (up to Monte-Carlo (MC) integration error). Whereas previous EHVI formulations rely ongradient-free acquisition optimization or approximated gradients, we compute exactgradients of the MC estimator via auto-differentiation, thereby enabling efficientand effective optimization using first-order and quasi-second-order methods. Lastly,our empirical evaluation demonstrates that q EHVI is computationally tractablein many practical scenarios and outperforms state-of-the-art multi-objective BOalgorithms at a fraction of their wall time.
The problem of optimizing multiple competing objectives is ubiquitous in scientific and engineeringapplications. For example in automobile design, an automaker will want to maximize vehicledurability and occupant safety, while using lighter materials that afford increased fuel efficiency andlower manufacturing cost [42, 68]. Evaluating the crash safety of an automobile design experimentallyis expensive due to both the manufacturing time and the destruction of a vehicle. In such a scenario,sample efficiency is paramount. For a different example, video streaming web services commonlyuse adaptive control policies to determine the bitrate as the stream progresses in real time [45]. Adecision maker may wish to optimize the control policy to maximize the quality of the video stream,while minimizing the stall time. Policy evaluation typically requires using the suggested policy onsegments of live traffic, which is subject to opportunity costs. If long evaluation times are the limitingfactor, multiple designs may be evaluated in parallel to significantly decrease end-to-end optimizationtime. For example, an automaker could manufacture multiple vehicle designs in parallel or a webservice could deploy several control policies to different segments of traffic at the same time.
Multi-objective Optimization
In this work, we address the problem of optimizing a vector-valued objective f ( x ) : R d → R M with f ( x ) = (cid:0) f (1) ( x ) , ..., f ( M ) ( x ) (cid:1) over a bounded set X ⊂ R d .We consider the scenario in which the f ( i ) are expensive-to-evaluate black-box functions with noknown analytical expression, and no observed gradients. Such multi-objective (MO) optimizationproblems typically do not have a single best solution; rather, the goal is to identify the set of Pareto optimal solutions such that any improvement in one objective means deteriorating another.
Preprint. Under review. a r X i v : . [ s t a t . M L ] J un ithout loss of generality, we assume the goal is to maximize all objectives. We say a solution f ( x ) Pareto dominates another solution f ( x (cid:48) ) if f ( m ) ( x ) ≥ f ( m ) ( x (cid:48) ) ∀ m ∈ { , . . . , M } andthere exists m (cid:48) ∈ { , . . . , M } such that f ( m (cid:48) ) ( x ) > f ( m (cid:48) ) ( x (cid:48) ) . We write f ( x ) (cid:31) f ( x (cid:48) ) . Let P ∗ = { f ( x ) s.t. (cid:64) x (cid:48) ∈ X : f ( x (cid:48) ) (cid:31) f ( x ) } denote the set of Pareto optimal solutions, and X ∗ = { x ∈ X s.t. f ( x ) ∈ P ∗ } the corresponding set of Pareto optimal inputs. Provided withthe Pareto set, decision-makers can select a solution with an objective trade-off according to theirpreferences.A common approach for solving MO problems is to use evolutionary algorithms (e.g. NSGA-II),which are robust multi-objective optimizers, but require a large number of function evaluations [13].Bayesian optimization (BO) offers a far more sample-efficient alternative [54]. Bayesian Optimization [37] is an established method for optimizing expensive-to-evaluate black-box functions. BO relies on a Bayesian surrogate model , typically a Gaussian Process (GP) [52],to provide a posterior distribution P ( f |D ) over the true function values f given the observed data D = { ( x i , y i ) } ni =1 . An acquisition function α : X cand (cid:55)→ R employs the surrogate model to assign autility value to a set of candidates X cand = { x i } qi =1 to be evaluated on the true function. While thetrue f may be expensive-to-evaluate, the surrogate-based acquisition function is not, and can thus beefficiently optimized to yield a set of candidates X cand to be evaluated on f . If gradients of α ( X cand ) are available, gradient-based methods can be utilized. If not, gradients are either approximated (e.g.with finite differences) or gradient-free methods (e.g. DIRECT [36] or CMA-ES [31]) are used. Limitations of current approaches
In the single-objective (SO) setting, a large body of workfocuses on practical extensions to BO for supporting parallel evaluation and outcome constraints[47, 29, 62, 24, 41]. Less attention has been given to such extensions in the MO setting. Moreover,the existing constrained and parallel MO BO options have limitations: 1) many rely on scalarizationsto transform the MO problem into a SO one [39]; 2) many acquisition functions computationallyexpensive to compute [49, 20, 6, 67]; 3) few have known analytical gradients or are differentiable[18, 58, 32]; 4) many rely on heuristics to extend sequential algorithms to the parallel setting [26, 58].A natural acquisition function for MO BO is Expected Hypervolume Improvement (EHVI). Max-imizing the hypervolume (HV) has been shown to produce Pareto fronts with excellent cover-age [69, 11, 65]. However, there has been little work on EHVI in the parallel setting, and the workthat has been done resorts to approximate methods [67, 27, 58]. Furthermore, a vast body of literatureon has focused efficient EHVI computation [33, 19, 63], but the time complexity for computingEHVI is exponential in the number of objectives—in part due the hypervolume indicator itselfincurring a time complexity that scales super-polynomially with the number of objectives [64]. Ourcore insight is that by exploiting advances in auto-differentiation and highly parallelized hardware[48], we can make EHVI computations fast and practical.
Contributions
In this work, we derive a novel formulation of the parallel q -Expected HypervolumeImprovement acquisition function ( q EHVI) that is exact up to Monte-Carlo (MC) integration error.We compute the exact gradient of the MC estimator of q EHVI using auto-differentiation, whichallows us to employ efficient and effective gradient-based optimization methods. Despite its compu-tational cost, our formulation of q EHVI is embarrassingly parallel, and would achieve constant timecomplexity given infinite processing cores. We demonstrate that, using modern GPU hardware andcomputing exact gradients, optimizing q EHVI is faster than existing state-of-the art methods in manypractical scenarios. Moreover, we extend q EHVI to support auxiliary outcome constraints, making itpractical in many real-world scenarios. Lastly, we demonstrate how modern auto-differentiation canbe used to compute exact gradients of analytic EHVI, which has never been done before for
M > objectives. Our empirical evaluation shows that q EHVI outperforms state-of-the-art multi-objectiveBO algorithms using a fraction of their wall time.
Yang et al. [65] is the only previous work to consider exact gradients of EHVI, but the authorsonly derive an analytical gradient for the unconstrained M = 2 , q = 1 setting. All other workseither do not optimize EHVI (e.g. they use it for pre-screening candidates [17]), optimize it withgradient-free methods [64], or using approximate gradients [58]. In contrast, we use exact gradientsand demonstrate that optimizing EHVI with gradients is far more efficient.2here are many alternatives to EHVI for MO BO. For example, ParEGO [39] randomly scalarizesthe objectives and uses Expected Improvement [37], and SMS-EGO [50] uses HV in a UCB-basedacquisition function and is more scalable than EHVI [51]. Both methods have only been consideredfor the q = 1 , unconstrained setting. Predictive entropy search for MO BO (PESMO) [32] has beenshown to be another competitive alternative and has been extended to handle constraints [25] andparallel evaluations [26]. MO max-value entropy search (MO-MES) has been shown to achievesuperior optimization performance and faster wall times than PESMO, but is limited to q = 1 .Wilson et al. [61] empirically and theoretically show that sequential greedy selection of q candidatesachieves performance comparable to jointly optimizing q candidates for many acquisition functions(including [59, 62]). The sequential greedy approach integrates over the posterior of the unobservedoutcomes corresponding to the previously selected candidates in the q -batch. Sequential greedyoptimization often yields better empirical results because the optimization problem has a lowerdimension: d in each step, rather than q · d in the joint problem. Most prior works in the MO setting usea sequential greedy approximation or heuristics [58, 67, 27, 9], but impute the unobserved outcomeswith the posterior mean rather than integrating over the posterior [29]. For many joint acquisitionfunctions involving expectations, this shortcut sacrifices the theoretical error bound on the sequentialgreedy approximation because the exact joint acquisition function over x , ..., x i , ≤ i ≤ q requiresintegration over the joint posterior P ( f ( x ) , ..., f ( x q ) |D ) and is not computed for i > .Garrido-Merchán and Hernández-Lobato [26] and Wada and Hino [58] jointly optimize the q candi-dates and, noting the difficulty of the optimization, both papers focus on deriving gradients to aid inthe optimization. Wada and Hino [58] defined the q EHVI acquisition function, but after finding itchallenging to optimize q candidates jointly (without exact gradients), the authors propose optimizingan alternative acquisition function instead of exact q EHVI. In contrast, our novel q EHVI formulationallows for gradient-based parallel and sequential greedy optimization, with proper integration overthe posterior for the latter.Feliot et al. [21] and Abdolshah et al. [1] proposed extensions of EHVI to the constrained q = 1 setting, but neither considers the batch setting and both rely on gradient-free optimization. q -Expected Hypervolume Improvement In this section, we review HVI and EHVI computation by means of box decompositions, and explainour novel formulation for the parallel setting.
Definition 1.
Given a reference point r ∈ R M , the hypervolume indicator ( HV ) of a finite approxi-mate Pareto set P is the M -dimensional Lebesgue measure λ M of the space dominated by P andbounded from below by r : HV ( P , r ) = λ M (cid:0) (cid:83) |P| i =1 [ r , y i ] (cid:1) . Definition 2.
Given a Pareto set P and reference point r , the hypervolume improvement ( HVI ) of aset of points Y is: HVI ( Y , P , r ) = HV ( P ∪ Y , r ) − HV ( P , r ) . EHVI is the expectation of HVI over the posterior P ( f , D ) : α EHVI ( X cand ) = E (cid:2) HVI ( f ( X cand ) (cid:3) . Inthe sequential setting, and assuming the objectives are independent and modeled with independentGPs, EHVI can be expressed in closed form [65]. In other settings, EHVI can be approximated withMC integration. The reference point is typically specified by the decision maker [65] (see AppendixE.1.1 for more discussion). For a set of objective vectors { f ( x i ) } qi =1 , a reference point r ∈ R M , and a non-dominated set P , let ∆( { f ( x i ) } qi =1 , P , r ) ⊂ R M denote the set of points that { f ( x i ) } qi =1 dominates,but P does not dominate. Given P , r , the HVI of a new point f ( x ) is the HV of the intersection of space dominated by P ∪ { f ( x ) } and the non-dominated space. Figure 1b illustrates this for one new point f ( x ) for M = 2 . The yellow region is ∆( { f ( x ) } , P , r ) and the hypervolume improvement is the area coveredby ∆( { f ( x ) } , P , r ) . Since ∆( { f ( x ) } , P , r ) is often a non-rectangular polytope, HVI is typically In this work, we omit the arguments P and r when referring to HVI for brevity. { S k } Kk =1 be a partitioning the of non-dominated space into disjoint hyper-rectangles, where each S k is defined by a pair of lower and upper vertices l k ∈ R M and u k ∈ R M ∪ { ∞ } . The high levelidea to sum the HV of S k ∩ ∆( { f ( x ) } , P , r ) over all S k . For each hyper-rectangle S k , the intersec-tion of S k and ∆( { f ( x ) } , P , r ) is a hyper-rectangle where the lower bound vertex is l k and the upperbound vertex is the component-wise minimum of u k and the new point f ( x ) : z k := min (cid:2) u k , f ( x ) (cid:3) . rl l l l S S S S y y y (a) rl l l l S S S S y y y !( (b) rl l l l S S S S y y y !( ! )!( " ) (c) Figure 1: For M=2, (a) the dominated space (red) and non-dominated space partitioned into disjointboxes (white), (b) the HVI of one new point f ( x ) and (c) the HVI of two new points f ( x ) , f ( x ) .Hence, the HVI of a single outcome vector f ( x ) within S k is given by HVI k (cid:0) f ( x ) , l k , u k (cid:1) = λ M (cid:0) S k ∩ ∆( { f ( x ) } , P , r ) (cid:1) = (cid:81) Mm =1 (cid:2) z ( m ) k − l ( m ) k (cid:3) + , where u ( m ) k , l ( m ) k , f ( m ) ( x ) , and z ( m ) k denotethe m th component of the corresponding vector and [ · ] + denotes the min( · , operation. Summingover rectangles yieldsHVI (cid:0) f ( x ) (cid:1) = K (cid:88) k =1 HVI k (cid:0) f ( x ) , l k , u k (cid:1) = K (cid:88) k =1 M (cid:89) m =1 (cid:2) z ( m ) k − l ( m ) k (cid:3) + (1) q -Hypervolume Improvement via the Inclusion-Exclusion Principle Figure 1c illustrates the HVI in the q = 2 setting. Given q new points { f ( x i ) } qi =1 , let A i := ∆( { f ( x i ) } , P , r ) for i = 1 , . . . , q be the space dominated by f ( x i ) but not dominated by P ,independently of the other q − points. Note that | A i | = HVI ( f ( x i )) . The union of the subsets A i is the space dominated jointly by the q new points: (cid:83) qi =1 A i = (cid:83) qi =1 ∆( { f ( x i ) } , P , r ) , andso | (cid:83) qi =1 A i | is the joint HVI from the q new points. Using the inclusion-exclusion principle [12, 56],HVI ( { f ( x i ) } qi =1 ) = (cid:12)(cid:12)(cid:12)(cid:12) q (cid:91) i =1 A i (cid:12)(cid:12)(cid:12)(cid:12) = q (cid:88) j =1 ( − j +1 (cid:88) ≤ i ≤ ... ≤ i j ≤ q (cid:12)(cid:12) A i ∩ · · · ∩ A i j (cid:12)(cid:12) (2) | A i ∩ · · · ∩ A i j | can be computed in a piecewise fashion across the K hyper-rectangles { S k } Kk =1 as the HV of the intersection of A i ∩ · · · ∩ A i j with each hyper-rectangle S k since { S k } Kk =1 is adisjoint partition: | A i ∩ · · · ∩ A i j | = (cid:80) Kk =1 | S k ∩ A i ∩ · · · ∩ A i j | . The inclusion-exclusion principlehas been proposed for computing HV (not HVI) [43], but it is rarely used because complexity scalesexponentially with the number of elements. However, the inclusion-exclusion principle is practicalfor computing the joint HVI of q points since typically q << |P| .This formulation has three advantages. First, while the new dominated space A i can be anon-rectangular polytope, the intersection A i ∩ S k is a rectangular polytope, which simplifiescomputation of overlapping hypervolume. Second, the vertices defining the hyper-rectangle S k ∩ A i ∩ · · · ∩ A i j are easily derived. The lower bound is simply the l k lower bound of S k , and theupper bound is the component-wise minimum z k,i ,...i j := min (cid:2) u k , f ( x i ) , . . . , f ( x i j ) (cid:3) . Third,computation can be across all intersections of subsets A i ∩ · · · ∩ A i j for ≤ i j ≤ . . . ≤ i j ≤ q andacross all K hyper-rectangles can be performed in parallel. Explicitly, the HVI is computed as:HVI ( { f ( x i ) } qi =1 ) = K (cid:88) k =1 q (cid:88) j =1 (cid:88) X j ∈X j ( − j +1 M (cid:89) m =1 (cid:2) z ( m ) k,X j − l ( m ) k (cid:3) + (3)4here X j := { X j ⊂ X cand : | X j | = j } is the superset of all subsets of X cand of size j , and z ( m ) k,X j := z ( m ) k,i ,...i j for X j = { x i , ..., x i j } . See Appendix A for further details of the derivation. Expected q -Hypervolume Improvement The above approach for computing HVI assumes that we know the true objective values { f ( x i ) } qi =1 .In BO, we instead compute q EHVI as the expectation over the posterior: α q EHVI ( x ) = E (cid:104) HVI ( { f ( x i ) } qi =1 ) (cid:105) = (cid:90) ∞−∞ HVI ( { f ( x i ) } qi =1 ) d f . (4)Since no known analytical form is known [66] for q > (or in the case of correlated out-comes), we estimate (4) using MC integration with samples from the joint posterior { f t ( x i ) } qi =1 ∼ P (cid:0) f ( x ) , ..., f ( x q ) |D (cid:1) , t = 1 , . . . N . Let z ( m ) k,X j ,t := min (cid:2) u k , min x (cid:48) ∈ X j f t ( x (cid:48) ) (cid:3) . Then, ˆ α Nq EHVI ( x ) := 1 N N (cid:88) t =1 HVI ( { f t ( x i ) } qi =1 ) = 1 N N (cid:88) t =1 K (cid:88) k =1 q (cid:88) j =1 (cid:88) X j ∈X j ( − j +1 M (cid:89) m =1 (cid:2) z ( m ) k,X j ,t − l ( m ) k (cid:3) + (5)Provided that { S k } Kk =1 is an exact partitioning, (5) is an exact computation of q EHVI up to theMC estimation error, which scales as / √ N regardless of the dimension of the search space [17].In practice, we use randomized quasi MC methods [8] to reduce the variance (see Figure 4a in theAppendix for a comparison of analytic EHVI and MC-based q EHVI). q EHVI requires computing the volume of q − hyper-rectangles (the number of subsets of q) foreach of K hyper-rectangles and N MC samples. Given posterior samples, the time complexity on asingle-threaded machine is: T = O ( M N K (2 q − . In the two-objective case, K = |P| + 1 , but K is super-polynomial in M [64]. q EHVI is agnostic to the partitioning algorithm used, and the detailsof such algorithms are beyond the scope of this work. Despite the daunting workload, the criticalwork path—the time complexity of the smallest non-parallelizable unit—is constant: T ∞ = O (1) . On highly-threaded many-core hardware (e.g. GPUs), our formulation achieves tractable wall timesin many practical scenarios: as is shown in Figure 7 in the Appendix, the computation time is nearlyconstant with increasing q until an inflection point at which the workload saturates the available cores.For additional discussion of both time and memory complexity of q EHVI see Appendix A.4.
Our proposed q EHVI acquisition function is easily extended to constraints on auxiliary outcomes. Weconsider the scenario where we receive observations of M objectives f ( x ) ∈ R M and V constraints c ( v ) ∈ R V , all of which are assumed to be “black-box”. We assume w.l.o.g. that c ( v ) is feasibleiff c ( v ) ≥ . In the constrained optimization setting, we aim to identify the feasible Pareto set: P feas = { f ( x ) s.t. c ( x ) ≥ , (cid:64) x (cid:48) : c ( x (cid:48) ) ≥ , f ( x (cid:48) ) (cid:31) f ( x ) } . The natural improvementmeasure in the constrained setting is feasible HVI, which we define for a single candidate point x asHVI C ( f ( x ) , c ( x )) := HVI [ f ( x )] · [ c ( x ) ≥ ] . Taking expectations, the constrained expected HVcan be seen to be the HV weighted by the probability of feasibility. In Appendix A.3, we detail howperforming feasibility-weighting on the sample-level allows us to include such auxiliary outcomeconstraints into our MC formulation in a straightforward way. q -Expected Hypervolume Improvement Differentiability
While an analytic formula for the gradient of EHVI exists for the M = 2 objectivecase in the unconstrained, sequential ( q = 1 ) setting, no such formula is known in 1) the case of M > objectives, 2) the constrained setting, and 3) for q > . Leveraging the re-parameterizationtrick [38, 60] and auto-differentiation, we are able to automatically compute exact gradients of theMC-estimator q EHVI in all of the above settings, as well as the gradient of analytic EHVI for The number of boxes required for a decomposition of the non-dominated space is unknown for M ≥ [64]. As evident from (5), the critical path consists of 3 multiplications and 5 summations. ≥ (see Figure 4b in the Appendix for a comparison of the exact gradients of EHVI and thesample average gradients of q EHVI for M = 3 ). Optimization via Sample Average Approximation
We show in Appendix C that if mean andcovariance function of the GP are sufficiently regular, the gradient of the MC estimator (5) is anunbiased estimate of the gradient of the exact acquisition function (4). To maximize q EHVI, wecould therefore directly apply stochastic optimization methods, as has previously been done for single-outcome acquisition functions [60, 62]. Instead, we opt to use the sample average approximation(SAA) approach from Balandat et al. [5], which allows us to employ deterministic, higher-orderoptimizers to achieve faster convergence rates. Informally (see Appendix C for the formal statement),if ˆ x ∗ N ∈ arg max x ∈X ˆ α Nq EHVI ( x ) , we can show under some regularity conditions that, as N → ∞ ,(i) ˆ α Nq EHVI ( ˆ x ∗ N ) → max x ∈X α q EHVI ( x ) a.s. , and (ii) dist (cid:0) ˆ x ∗ N , arg max x ∈X α q EHVI ( x ) (cid:1) → a.s. .Figure 2a demonstrates the importance of using exact gradients for efficiently and effectively op-timizing EHVI and q EHVI by comparing the following optimization methods: L-BFGS-B withexact gradients, L-BFGS-B with gradients approximated via finite differences, and CMA-ES (withoutgradients). The cumulative time spent optimizing the acquisition function is an order of magnitudeless when using exact gradients rather than approximate gradients or zeroth order methods.
Average Cumulative Acquisition Optimization Wall Time (s) l og HV d i ff e r e n ce EHVI Exact GradientEHVI Approx. GradientEHVI Gradient FreeqEHVI (q=2) Exact GradientqEHVI (q=2) Approx. GradientqEHVI (q=2) Gradient Free (a)
Batch Iteration l og HV d i ff e r e n ce qEHVI Joint q=2qEHVI Joint q=4qEHVI Joint q=8qEHVI Post. Mean q=2qEHVI Post. Mean q=4qEHVI Post. Mean q=8qEHVI Seq. Greedy q=2qEHVI Seq. Greedy q=4qEHVI Seq. Greedy q=8qEHVI q=1 (b) Figure 2: (a) A comparison of EHVI and q EHVI ( q = 2 ) optimized with L-BFGS-B using exactgradients, L-BFGS-B using gradients approximated using finite differences, and CMA-ES, a gradient-free method. (b) A comparison of joint optimization, sequential greedy optimization with properintegration at the pending points, and sequential greedy using the posterior mean. Both plots showoptimization performance on a DTLZ2 problem ( d = 6 , M = 2 ) with a budget of 100 evaluations(plus the initial quasi-random design). We report means and 2 standard errors across 20 trials. Sequential Greedy and Joint Batch Optimization
Jointly optimizing q candidates increases indifficulty with q because the problem dimension is dq . An alternative is to sequentially and greedilyselect candidates and condition the acquisition function on the previously selected pending pointswhen selecting the next point [61]. Using a submodularity argument similar to that in Wilson et al.[60], the sequential greedy approximation of q EHVI enjoys regret of no more than e α ∗ q EHVI , where α ∗ q EHVI is the optima of α q EHVI [22] (see Appendix B).Although sequential greedy approaches have been considered for many acquisition functions [61], noprevious work has proposed a proper sequential greedy approach (with integration over the posterior)for parallel EHVI because it would require computing the Pareto front under each sample f t fromthe joint posterior before computing the hypervolume improvement. These operations would becomputationally expensive for even modest N and non-differentiable. q EHVI avoids determiningthe Pareto set for each sample by using inclusion-exclusion principle to compute the joint HVI overthe pending points x , ..., x i − and new candidate x i for each MC sample. Figure 2b empiricallydemonstrates the improved optimization performance from properly integrating over the unobservedoutcomes rather than using the posterior mean or jointly optimizing the q candidates. Technically, min and max are only sub-differentiable, but are known to be well-behaved [60]. In our MCsetting with GP posteriors, q EHVI is differentiable w.p. 1 if x contains no repeated points. For the constrained case, we replace the indicator with a differentiable sigmoid approximation.
20 40 60 80
Function Evaluations l og HV d i ff e r e n ce SobolEHVIqEHVI qParEGOPESMOSMSEGO (a)
Batch Iteration l og HV D i ff e r e n ce Sobol q=8qEHVI q=1qEHVI q=2qEHVI q=4qEHVI q=8 qParEGO q=1qParEGO q=2qParEGO q=4qParEGO q=8 (b)
Function Evaluations l og HV D i ff e r e n ce SobolEHVIqEHVI qParEGOPESMOSMSego (c)
Function Evaluations HV (d) Figure 3: Optimization performance on (a) on the Branin-Currin problem ( q = 1 ), (b) the C2-DTLZ2problem with a proper sequential greedy approximation, (c) the vehicle crash safety problem ( q = 1 ),and (d) the ABR control problem ( q = 1 ). We report the means and 2 standard errors across 20 trials. We empirically evaluate q EHVI on synthetic and real world optimization problems. We compare q EHVI against existing state of the art methods including SMS-EGO , PESMO , and analyticEHVI [64] with gradients . Additionally, we compare against a novel extension of ParEGO [39] tosupport parallel evaluation and constraints, neither of which have been done before to our knowledge;we call this method q P AR EGO . Additionally, we include a quasi-random baseline that selectscandidates from a scrambled Sobol sequence. See Appendix E.1 for details on all baseline algorithms. Synthetic Benchmarks : We evaluate optimization performance on four benchmark problems interms of log hypervolume difference, which is defined as the difference between the hypervolume ofthe true Pareto front and the hypervolume of the approximate Pareto front based on the observeddata; in the case that the true Pareto front is unknown (or not easily approximated), we evaluate thehypervolume indicator. All references points and search spaces are provided in Appendix E.2. Forsynthetic problems, we consider the Branin-Currin problem ( d = 2 , M = 2 , convex Pareto front) [6]and the C2-DTLZ2 ( d = 12 , M = 2 , V = 1 , concave Pareto front), which is a standard constrainedbenchmark from the MO literature [15] (see Appendix F.1 for additional synthetic benchmarks). Structural Optimization in Automobile Safety Design (VehicleSafety): Vehicle crash safety isan important consideration in the structural design of automobiles. A lightweight car is preferablebecause of its potentially lower manufacturing cost and better fuel economy, but lighter material canfair worse than sturdier alternatives in a collision, leading to increased vehicle damage and moresevere injury to the vehicle occupants [68]. We consider the problem designing the thickness of 5 Acquisition functions implemented in BoTorch [5] and will be made publicly available. For review purposes,see the code included with this submission. We hope to compare to other, challenging-to-implement acquisition functions such as MES-MO [6] andPPESMOC [26] in a future draft of this work when the authors of these methods are prepared to share their code. We leverage existing implementations from the Spearmint library. The code is available at https://github.com/HIPS/Spearmint/tree/PESM.
CPU B RANIN C URRIN
C2DTLZ2 ABR V
EHICLE S AFETY
PESMO ( q =1) .
16 ( ± . NA .
16 ( ± .
38) 492 .
64 ( ± . SMS-EGO ( q =1) . ± . NA .
54 ( ± .
79) 115 .
11 ( ± . q P AR EGO ( q =1) .
72 ( ± .
17) 4 .
01 ( ± .
77) 7 .
47 ( ± .
67) 1 .
63 ( ± . EHVI ( q =1) . ± . NA .
48 ( ± .
19) 13 .
02 ( ± . q EHVI ( q =1) .
91 ( ± .
24) 5 . ± .
18) 6 .
15 ( ± .
71) 82 .
13 ( ± . GPU B RANIN C URRIN
C2DTLZ2 ABR V
EHICLE S AFETY q P AR EGO ( q =1) . ± .
37) 8 .
95 ( ± .
85) 9 .
64 ( ± .
96) 3 .
44 ( ± . q P AR EGO ( q =2) .
12 ( ± .
81) 23 .
83 ( ± .
07) 21 .
19 ( ± .
53) 7 .
32 ( ± . q P AR EGO ( q =4) .
34 ( ± .
69) 50 .
81 ( ± .
82) 35 .
46 ( ± .
32) 17 . ± . q P AR EGO ( q =8) .
11 ( ± .
14) 117 .
66 ( ± .
9) 72 .
52 ( ± .
04) 39 .
72 ( ± . EHVI ( q =1) .
53 ( ± . NA .
82 ( ± .
55) 8 .
95 ( ± . q EHVI ( q =1) .
98 ( ± .
28) 8 .
66 ( ± .
78) 7 .
71 ( ± .
67) 10 .
43 ( ± . q EHVI ( q =2) .
37 ( ± .
56) 33 .
36 ( ± .
79) 18 .
32 ( ± .
48) 17 .
67 ( ± . q EHVI ( q =4) .
29 ( ± .
51) 67 .
77 ( ± .
19) 44 .
44 ( ± .
53) 54 .
25 ( ± . q EHVI ( q =8) .
46 ( ± .
22) 187 .
14 ( ± .
74) 100 .
64 ( ± .
22) 260 .
42 ( ± . reinforced parts of the frontal frame of the vehicle that considerably affect crash safety. The goalis to minimize: 1) the mass of the vehicle; 2) the collision acceleration in a full frontal crash—aproxy for bio-mechanical trauma to the vehicle occupants from the acceleration; and 3) the toe-boardintrusion —a measure of the most extreme mechanical damage to the vehicle in an off-frontal collision[42]. For this problem, we optimize the surrogate from Tanabe and Ishibuchi [57]. Policy Optimization for Adaptive Bitrate Control (ABR): Many web services adapt video play-back quality on-the-fly based on the receiver’s network bandwith to maintain steady, high qualitystream with minimal stalls and buffer periods [45]. Previous works have proposed controllers withdifferent scalarized objective functions [44], but in many cases, engineers may prefer to learn the setof optimal trade-offs between their metrics of interest, rather than specify a scalarized objective inadvance. In this problem, we decompose the objective function proposed in Mao et al. [44] into itsconstituent metrics and optimize 4 parameters of an ABR control policy on the Park simulator [46]to maximize video quality (bitrate) and minimize stall time. See E.2 for details.Figure 3 shows q EHVI outperforming all baselines in terms of optimization performance on allevaluated problems. Table 1 shows that q EHVI achieves wall times that are an order of magnitudesmaller than those of PESMO on a CPU in sequential optimization, and maintains competitive walltimes even relative to q P AR EGO (which has a significantly smaller workload) for large q on a GPU. We present a simple, practical, and efficient algorithm for parallel, constrained MO BO. Leveragingdifferentiable programming and modern parallel hardware, we are able to efficiently optimize q EHVIvia quasi second-order methods, for which we provide convergence guarantees. We demonstrate thatour method achieves performance superior to that of state-of-the-art MO BO approaches.One limitation of our approach is that it currently assumes noiseless observations. Extending tonoisy observations (e.g. by integrating over the uncertainty around the previous observations [41])would be nontrivial, particularly in the parallel case. Such an integration would be equivalent tonoiseless q EHVI computation with a batch size |P| + q , which would be prohibitively expensivesince computation scales exponentially with the batch size. Additional wall-time performanceimprovements can be gained through the use of more efficient partitioning algorithms (e.g. [40, 16,65]) that result in fewer disjoint hyper-rectangles. We hope this work encourages researchers toconsider more improvements from applying modern computational paradigms and tooling to MOBO, and BO more generally. The Park simulator is available at https://github.com/park-project/park . eferences [1] M. Abdolshah, A. Shilton, S. Rana, S. Gupta, and S. Venkatesh. Expected hypervolume improvementwith constraints. In , pages 3238–3243,2018.[2] Arash Asadpour, Hamid Nazerzadeh, and Amin Saberi. Stochastic submodular maximization. In ChristosPapadimitriou and Shuzhong Zhang, editors, Internet and Network Economics . Springer Berlin Heidelberg,2008.[3] R. Astudillo and P. Frazier. Bayesian optimization of composite functions.
Forthcoming, in Proceedings ofthe 35th International Conference on Machine Learning , 2019.[4] Anne Auger, Johannes Bader, Dimo Brockhoff, and Eckart Zitzler. Theory of the hypervolume indicator:Optimal mu-distributions and the choice of the reference point. In
Proceedings of the Tenth ACM SIGEVOWorkshop on Foundations of Genetic Algorithms , FOGA ’09, page 87–102, New York, NY, USA, 2009.Association for Computing Machinery.[5] Maximilian Balandat, Brian Karrer, Daniel R. Jiang, Samuel Daulton, Benjamin Letham, Andrew Gor-don Wilson, and Eytan Bakshy. Botorch: Programmable bayesian optimization in pytorch.
ArXiv ,abs/1910.06403, 2019.[6] Syrine Belakaria, Aryan Deshwal, and Janardhan Rao Doppa. Max-value entropy search for multi-objectivebayesian optimization. In
Advances in Neural Information Processing Systems 32 , 2019.[7] Eric Bradford, Artur Schweidtmann, and Alexei Lapkin. Efficient multiobjective optimization employinggaussian processes, spectral sampling and a genetic algorithm.
Journal of Global Optimization , 71, 022018. doi: 10.1007/s10898-018-0609-2.[8] Russel E Caflisch. Monte carlo and quasi-monte carlo methods.
Acta numerica , 7:1–49, 1998.[9] Anirban Chaudhuri, Raphael Haftka, Peter Ifju, Kelvin Chang, Christopher Tyler, and Tony Schmitz.Experimental flapping wing optimization and uncertainty quantification using limited samples.
Structuraland Multidisciplinary Optimization , 51, 11 2014. doi: 10.1007/s00158-014-1184-x.[10] I. Couckuyt, D. Deschrijver, and T. Dhaene. Towards efficient multiobjective optimization: Multiobjectivestatistical criterions. In , pages 1–8, 2012.[11] Ivo Couckuyt, Dirk Deschrijver, and Tom Dhaene. Fast calculation of multiobjective probability ofimprovement and expected improvement criteria for pareto optimization.
J. of Global Optimization , 60(3):575–594, November 2014.[12] Daniel A. da Silva. Proprietades geraes.
J. de l’Ecole Polytechnique , cah. 30. I, 1854.[13] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and elitist multiobjective genetic algorithm:Nsga-ii.
IEEE Transactions on Evolutionary Computation , 6(2):182–197, 2002.[14] Kalyan Deb, L. Thiele, Marco Laumanns, and Eckart Zitzler. Scalable multi-objective optimization testproblems. volume 1, pages 825–830, 06 2002. ISBN 0-7803-7282-4. doi: 10.1109/CEC.2002.1007032.[15] Kalyanmoy Deb.
Constrained Multi-objective Evolutionary Algorithm , pages 85–118. Springer Interna-tional Publishing, Cham, 2019.[16] Kerstin Dächert, Kathrin Klamroth, Renaud Lacour, and Daniel Vanderpooten. Efficient computation ofthe search region in multi-objective optimization.
European Journal of Operational Research , 260(3):841 –855, 2017.[17] M. T. M. Emmerich, K. C. Giannakoglou, and B. Naujoks. Single- and multiobjective evolutionary opti-mization assisted by gaussian random field metamodels.
IEEE Transactions on Evolutionary Computation ,10(4):421–439, 2006.[18] M. T. M. Emmerich, A. H. Deutz, and J. W. Klinkenberg. Hypervolume-based expected improvement:Monotonicity properties and exact computation. In , pages 2147–2154, 2011.[19] Michael Emmerich, Kaifeng Yang, André Deutz, Hao Wang, and Carlos M. Fonseca.
A MulticriteriaGeneralization of Bayesian Global Optimization , pages 229–242. Springer International Publishing, 2016.
20] Michael T. M. Emmerich and Carlos M. Fonseca. Computing hypervolume contributions in low dimensions:Asymptotically optimal algorithm and complexity results. In Ricardo H. C. Takahashi, Kalyanmoy Deb,Elizabeth F. Wanner, and Salvatore Greco, editors,
Evolutionary Multi-Criterion Optimization , pages121–135, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg.[21] Paul Feliot, Julien Bect, and Emmanuel Vazquez. A bayesian approach to constrained single- and multi-objective optimization.
Journal of Global Optimization , 67(1-2):97–133, Apr 2016. ISSN 1573-2916. doi:10.1007/s10898-016-0427-3. URL http://dx.doi.org/10.1007/s10898-016-0427-3 .[22] M. L. Fisher, G. L. Nemhauser, and L. A. Wolsey.
An analysis of approximations for maximizingsubmodular set functions—II , pages 73–87. Springer Berlin Heidelberg, Berlin, Heidelberg, 1978.[23] Tobias Friedrich and Frank Neumann. Maximizing submodular functions under matroid constraints bymulti-objective evolutionary algorithms. In Thomas Bartz-Beielstein, Jürgen Branke, Bogdan Filipiˇc, andJim Smith, editors,
Parallel Problem Solving from Nature – PPSN XIII , pages 922–931, Cham, 2014.Springer International Publishing. ISBN 978-3-319-10762-2.[24] Jacob Gardner, Matt Kusner, Zhixiang, Kilian Weinberger, and John Cunningham. Bayesian optimizationwith inequality constraints. In
Proceedings of the 31st International Conference on Machine Learning ,volume 32 of
Proceedings of Machine Learning Research , pages 937–945, Beijing, China, 22–24 Jun 2014.PMLR.[25] Eduardo C. Garrido-Merchán and Daniel Hernández-Lobato. Predictive entropy search for multi-objectivebayesian optimization with constraints.
Neurocomputing , 361:50 – 68, 2019.[26] Eduardo C. Garrido-Merchán and Daniel Hernández-Lobato. Parallel predictive entropy search for multi-objective bayesian optimization with constraints, 2020.[27] David Gaudrie, Rodolphe Le Riche, Victor Picheny, Benoît Enaux, and Vincent Herbert. Targeting solutionsin bayesian multi-objective optimization: sequential and batch versions.
Annals of Mathematics andArtificial Intelligence , 88(1-3):187–212, Aug 2019. ISSN 1573-7470. doi: 10.1007/s10472-019-09644-8.URL http://dx.doi.org/10.1007/s10472-019-09644-8 .[28] Michael A. Gelbart, Jasper Snoek, and Ryan P. Adams. Bayesian optimization with unknown constraints.In
Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence , UAI, 2014.[29] David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro.
Kriging Is Well-Suited to ParallelizeOptimization , pages 131–162. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.[30] P. Glasserman. Performance continuity and differentiability in monte carlo optimization. In , pages 518–524, 1988.[31] Nikolaus Hansen.
The CMA Evolution Strategy: A Comparing Review , volume 192, pages 75–102. 062007. doi: 10.1007/3-540-32494-1_4.[32] Daniel Hernández-Lobato, José Miguel Hernández-Lobato, Amar Shah, and Ryan P. Adams. Predictiveentropy search for multi-objective bayesian optimization, 2015.[33] Iris Hupkens, Andre Deutz, Kaifeng Yang, and Michael Emmerich. Faster exact algorithms for computingexpected hypervolume improvement. In Antonio Gaspar-Cunha, Carlos Henggeler Antunes, and Car-los Coello Coello, editors,
Evolutionary Multi-Criterion Optimization , pages 65–79. Springer InternationalPublishing, 2015.[34] Hisao Ishibuchi, Naoya Akedo, and Yusuke Nojima. A many-objective test problem for visually examiningdiversity maintenance behavior in a decision space. In
Proceedings of the 13th Annual Conferenceon Genetic and Evolutionary Computation , GECCO ’11, page 649–656, New York, NY, USA, 2011.Association for Computing Machinery. ISBN 9781450305570. doi: 10.1145/2001576.2001666. URL https://doi.org/10.1145/2001576.2001666 .[35] Hisao Ishibuchi, Ryo Imada, Yu Setoguchi, and Yusuke Nojima. How to specify a reference point inhypervolume calculation for fair performance comparison.
Evol. Comput. , 26(3):411–440, September2018.[36] Donald Jones, C. Perttunen, and B. Stuckman. Lipschitzian optimisation without the lipschitz constant.
Journal of Optimization Theory and Applications , 79:157–181, 01 1993. doi: 10.1007/BF00941892.[37] Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensiveblack-box functions.
Journal of Global Optimization , 13:455–492, 1998.
38] Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. arXiv e-prints , pagearXiv:1312.6114, Dec 2013.[39] J. Knowles. Parego: a hybrid algorithm with on-line landscape approximation for expensive multiobjectiveoptimization problems.
IEEE Transactions on Evolutionary Computation , 10(1):50–66, 2006.[40] Renaud Lacour, Kathrin Klamroth, and Carlos M. Fonseca. A box decomposition algorithm to computethe hypervolume indicator.
Computers & Operations Research , 79:347 – 360, 2017.[41] Benjamin Letham, Brian Karrer, Guilherme Ottoni, and Eytan Bakshy. Constrained bayesian optimizationwith noisy experiments.
Bayesian Analysis , 14(2):495–519, 06 2019. doi: 10.1214/18-BA1110.[42] Xingtao Liao, Qing Li, Xujing Yang, Weigang Zhang, and Wei Li. Multiobjective optimization for crashsafety design of vehicles using stepwise regression model.
Structural and Multidisciplinary Optimization ,35:561–569, 06 2008. doi: 10.1007/s00158-007-0163-x.[43] Edgar Manoatl Lopez, Luis Miguel Antonio, and Carlos A. Coello Coello. A gpu-based algorithm fora faster hypervolume contribution computation. In António Gaspar-Cunha, Carlos Henggeler Antunes,and Carlos Coello Coello, editors,
Evolutionary Multi-Criterion Optimization , pages 80–94. SpringerInternational Publishing, 2015.[44] Hongzi Mao, Ravi Netravali, and Mohammad Alizadeh. Neural adaptive video streaming with pensieve. In
Proceedings of the Conference of the ACM Special Interest Group on Data Communication , SIGCOMM ’17,page 197–210, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450346535.doi: 10.1145/3098822.3098843. URL https://doi.org/10.1145/3098822.3098843 .[45] Hongzi Mao, Shannon Chen, Drew Dimmery, Shaun Singh, Drew Blaisdell, Yuandong Tian, MohammadAlizadeh, and Eytan Bakshy. Real-world video adaptation with reinforcement learning. 2019.[46] Hongzi Mao, Parimarjan Negi, Akshay Narayan, Hanrui Wang, Jiacheng Yang, Haonan Wang, RyanMarcus, Ravichandra Addanki, Mehrdad Khani Shirkoohi, Songtao He, Vikram Nathan, Frank Cangialosi,Shaileshh Bojja Venkatakrishnan, Wei-Hung Weng, Shu-Wen Han, Tim Kraska, and Mohammad Alizadeh.Park: An open platform for learning-augmented computer systems. In
NeurIPS , 2019.[47] Sébastien Marmin, Clément Chevalier, and David Ginsbourger. Differentiating the multipoint expectedimprovement for optimal batch design. In Panos Pardalos, Mario Pavone, Giovanni Maria Farinella, andVincenzo Cutello, editors,
Machine Learning, Optimization, and Big Data , pages 37–48, Cham, 2015.Springer International Publishing.[48] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, ZemingLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. 2017.[49] Victor Picheny. Multiobjective optimization using gaussian process emulators via stepwise uncertaintyreduction.
Statistics and Computing , 25, 10 2013. doi: 10.1007/s11222-014-9477-x.[50] Wolfgang Ponweiser, Tobias Wagner, Dirk Biermann, and Markus Vincze. Multiobjective optimizationon a limited budget of evaluations using model-assisted s-metric selection. In Günter Rudolph, ThomasJansen, Nicola Beume, Simon Lucas, and Carlo Poloni, editors,
Parallel Problem Solving from Nature –PPSN X , pages 784–794, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg.[51] Alma A. M. Rahat, Richard M. Everson, and Jonathan E. Fieldsend. Alternative infill strategies forexpensive multi-objective optimisation. In
Proceedings of the Genetic and Evolutionary Computation Con-ference , GECCO ’17, page 873–880, New York, NY, USA, 2017. Association for Computing Machinery.ISBN 9781450349208.[52] Carl Edward Rasmussen.
Gaussian Processes in Machine Learning , pages 63–71. Springer BerlinHeidelberg, Berlin, Heidelberg, 2004.[53] Jerry Segercrantz. Inclusion-exclusion and characteristic functions.
Mathematics Magazine , 71(3):216–218,1998. ISSN 0025570X, 19300980. URL .[54] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out of the loop: Areview of bayesian optimization.
Proceedings of the IEEE , 104(1):148–175, 2016.[55] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization inthe bandit setting: No regret and experimental design. In
Proceedings of the 27th International Conferenceon International Conference on Machine Learning , ICML’10, page 1015–1022, Madison, WI, USA, 2010.Omnipress. ISBN 9781605589077.
56] J. Sylvester. Note sur la théorème de legendre.
Comptes Rendus Acad. Sci. , 96:463–465, 1883.[57] Ryoji Tanabe and Hisao Ishibuchi. An easy-to-use real-world multi-objective optimization problem suite.
Applied Soft Computing , 89:106078, 2020. ISSN 1568-4946. doi: https://doi.org/10.1016/j.asoc.2020.106078.[58] Takashi Wada and Hideitsu Hino. Bayesian optimization for multi-objective optimization and multi-pointsearch, 2019.[59] Jialei Wang, Scott C. Clark, Eric Liu, and Peter I. Frazier. Parallel bayesian global optimization ofexpensive functions, 2016.[60] J. T. Wilson, R. Moriconi, F. Hutter, and M. P. Deisenroth. The reparameterization trick for acquisitionfunctions.
ArXiv e-prints , December 2017.[61] James Wilson, Frank Hutter, and Marc Deisenroth. Maximizing acquisition functions for bayesianoptimization. In
Advances in Neural Information Processing Systems 31 , pages 9905–9916. 2018.[62] Jian Wu and Peter I. Frazier. The parallel knowledge gradient method for batch bayesian optimization. In
Proceedings of the 30th International Conference on Neural Information Processing Systems , NIPS’16,page 3134–3142, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.[63] Kaifeng Yang, Michael Emmerich, André Deutz, and Carlos M. Fonseca. Computing 3-d expectedhypervolume improvement and related integrals in asymptotically optimal time. In , EMO 2017, page 685–700,Berlin, Heidelberg, 2017. Springer-Verlag.[64] Kaifeng Yang, Michael Emmerich, André H. Deutz, and Thomas Bäck. Efficient computation of expectedhypervolume improvement using box decomposition algorithms.
CoRR , abs/1904.12672, 2019.[65] Kaifeng Yang, Michael Emmerich, André Deutz, and Thomas Bäck. Multi-objective bayesian globaloptimization using expected hypervolume improvement gradient.
Swarm and Evolutionary Computation ,44:945 – 956, 2019. ISSN 2210-6502. doi: https://doi.org/10.1016/j.swevo.2018.10.007. URL .[66] Kaifeng Yang, Pramudita Palar, Michael Emmerich, Koji Shimoyama, and Thomas Bäck. A multi-point mechanism of expected hypervolume improvement for parallel multi-objective bayesian globaloptimization. pages 656–663, 07 2019. doi: 10.1145/3321707.3321784.[67] Kaifeng Yang, Pramudita Satria Palar, Michael Emmerich, Koji Shimoyama, and Thomas Bäck. Amulti-point mechanism of expected hypervolume improvement for parallel multi-objective bayesian globaloptimization. In
Proceedings of the Genetic and Evolutionary Computation Conference , GECCO ’19, page656–663, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450361118. doi:10.1145/3321707.3321784. URL https://doi.org/10.1145/3321707.3321784 .[68] R. J. Yang, N. Wang, C. H. Tho, J. P. Bobineau, and B. P. Wang. Metamodeling Development for VehicleFrontal Impact Simulation.
Journal of Mechanical Design , 127(5):1014–1020, 01 2005.[69] E. Zitzler, L. Thiele, M. Laumanns, C. M. Fonseca, and V. G. da Fonseca. Performance assessment ofmultiobjective optimizers: an analysis and review.
IEEE Transactions on Evolutionary Computation , 7(2):117–132, 2003. ppendix to:Differentiable Expected Hypervolume Improvementfor Parallel Multi-Objective Bayesian Optimization A Derivation of q -Expected Hypervolume Improvement A.1 Hypervolume Improvement via the Inclusion-Exclusion Principle
The hypervolume improvement of f ( x ) within the hyper-rectangle S k is the volume of S k ∩ ∆( { f ( x ) } , P , r ) and is given by:HVI k (cid:0) f ( x ) , l k , u k (cid:1) = λ M (cid:0) S k ∩ ∆( { f ( x ) } , P , r ) (cid:1) = M (cid:89) m =1 (cid:2) z ( m ) k − l ( m ) k (cid:3) + , where u ( m ) k , l ( m ) k , f ( m ) ( x ) , and z ( m ) k denote the m th component of the corresponding vector and [ · ] + denotesthe min( · , operation. Summing over all S k gives the total hypervolume improvement:HVI (cid:0) f ( x ) (cid:1) = K (cid:88) k =1 HVI k (cid:0) f ( x ) , l k , u k (cid:1) = K (cid:88) k =1 λ M (cid:0) S k ∩ ∆( { f ( x ) } , P , r ) (cid:1) = K (cid:88) k =1 M (cid:89) m =1 (cid:2) z ( m ) k − l ( m ) k (cid:3) + . We can extend the HVI computation to the q > case using the inclusion-exclusion principle. Lemma 1.
The inclusion-exclusion principle [12, 56]
Let A be a p -system of some set R , i.e., a sequence ofpotentially empty or overlapping subsets A , ..., A p of R . Then (cid:12)(cid:12)(cid:12)(cid:12) p (cid:91) i =1 A i (cid:12)(cid:12)(cid:12)(cid:12) = p (cid:88) i =1 (cid:12)(cid:12) A i (cid:12)(cid:12) − (cid:88) ≤ i ≤ j ≤ p (cid:12)(cid:12) A i ∩ A j (cid:12)(cid:12) + (cid:88) ≤ i ≤ j ≤ k ≤ p (cid:12)(cid:12) A i ∩ A j ∩ A k (cid:12)(cid:12) − . . . + ( − p − (cid:12)(cid:12) A ∩ ... ∩ A p (cid:12)(cid:12) Or equivalently, (cid:12)(cid:12)(cid:12)(cid:12) p (cid:91) i =1 A i (cid:12)(cid:12)(cid:12)(cid:12) = p (cid:88) j =1 ( − j +1 (cid:88) ≤ i ≤ ... ≤ i j ≤ p (cid:12)(cid:12) A i ∩ ... ∩ A i j (cid:12)(cid:12) In the context of computing the joint HVI of q new points { f ( x i ) } qi =1 , each subset A i for i = 1 , . . . , q is the setof points contained in ∆( { f ( x i ) } , P , r ) — independently of the other q − points. | A i | is the hypervolumeimprovement from the new point f ( x i ) : | A i | = HVI ( f ( x i )) . The union of these subsets is the set of pointsin the new space dominated by the q new points: (cid:83) qi =1 A i = (cid:83) qi =1 ∆( { f ( x i ) } , P , r ) . The hypervolume of (cid:83) qi =1 ∆( { f ( x i ) } , P , r ) is the hypervolume improvement from the q new points:HVI ( { f ( x i ) } qi =1 ) = (cid:12)(cid:12)(cid:12)(cid:12) q (cid:91) i =1 A i (cid:12)(cid:12)(cid:12)(cid:12) = q (cid:88) j =1 ( − j +1 (cid:88) ≤ i ≤ ... ≤ i j ≤ q (cid:12)(cid:12) A i ∩ · · · ∩ A i j (cid:12)(cid:12) To compute | A i ∩ · · · ∩ A i j | , we partition the space covered by A i ∩ · · · ∩ A i j across the K hyper-rectangles { S k } Kk =1 and compute the hypervolume of the overlapping space of A i ∩ · · · ∩ A i j with each S k independently.Since { S k } Kk =1 is a disjoint partition, summing over K gives the hypervolume of A i ∩ · · · ∩ A i j : (cid:12)(cid:12) A i ∩ · · · ∩ A i j (cid:12)(cid:12) = K (cid:88) k =1 (cid:12)(cid:12) S k ∩ A i ∩ · · · ∩ A i j (cid:12)(cid:12) his has two advantages. First, the new dominated space A i can be a non-rectangular polytope, but theintersection A i ∩ S k is a rectangular polytope, which simplifies computation of overlapping hypervolume.Second, the vertices defining the hyper-rectangle encapsulated by S k ∩ A i ∩ · · · ∩ A i j are easily derived.The lower bound is simply the l k lower bound of S k and the upper bound is the component-wise minimum z k,i ,...i j = min (cid:2) u k , f ( x i ) , . . . , f ( x i j ) (cid:3) .Importantly, this is computationally tractable because this specific approach enables parallelizing computationacross all intersections of subsets A i ∩ · · · ∩ A i j for ≤ i j ≤ . . . ≤ i j ≤ q and across all K hyper-rectangles.Explicitly, the HVI is computed as:HVI ( { f ( x i ) } qi =1 ) = (cid:12)(cid:12)(cid:12)(cid:12) p (cid:91) i =1 A i (cid:12)(cid:12)(cid:12)(cid:12) = q (cid:88) j =1 (cid:88) ≤ i ≤ ... ≤ i j ≤ q ( − j +1 (cid:12)(cid:12) A i ∩ · · · ∩ A i j (cid:12)(cid:12) = K (cid:88) k =1 q (cid:88) j =1 (cid:88) ≤ i ≤ ... ≤ i j ≤ q ( − j +1 (cid:12)(cid:12) S k ∩ A i ∩ · · · ∩ A i j (cid:12)(cid:12) = K (cid:88) k =1 q (cid:88) j =1 (cid:88) ≤ i ≤ ... ≤ i j ≤ q ( − j +1 λ M (cid:0) S k ∩ ∆( { f ( x i ) } , P , r ) ∩ . . . ∩ ∆( { f ( x i j ) } , P , r ) (cid:1) = K (cid:88) k =1 q (cid:88) j =1 (cid:88) ≤ i ≤ ... ≤ i j ≤ q ( − j +1 M (cid:89) m =1 (cid:2) z ( m ) k,i ,...i j − l ( m ) k (cid:3) + = K (cid:88) k =1 q (cid:88) j =1 (cid:88) X j ∈X j ( − j +1 M (cid:89) m =1 (cid:2) z ( m ) k,X j − l ( m ) k (cid:3) + where X j is the superset all subsets of X cand of size j : X j = { X j ⊂ X cand : | X j | = j } and z ( m ) k,X j = z ( m ) k,i ,...i j for X j = { x i , ..., x i j } . A.2 Computing
Expected
Hypervolume Improvement
The above approach for computing HVI assumes we know the true objective values { f ( x i ) } qi =1 . Since we donot know the true function values { f ( x i ) } qi =1 , we compute q EHVI as the expectation over the GP posterior. α q EHVI = E (cid:104) HVI ( { f ( x i ) } qi =1 ) (cid:105) = (cid:90) R M HVI ( { f ( x i ) } qi =1 ) d f (6)In the sequential setting and under the assumption of independent outcomes, q EHVI is simply EHVI andcan be expressed in closed form [65]. However when q > , there is no known analytical formulation[66]. Instead, we estimate the expectation in (6) using MC integration with samples from the joint posterior P (cid:0) f ( x ) , ..., f ( x q ) |D ) : α q EHVI = E (cid:104) HVI ( { f ( x i ) } qi =1 ) (cid:105) ≈ N N (cid:88) t =1 HVI ( { f t ( x i ) } qi =1 ) (7) = 1 N N (cid:88) t =1 K (cid:88) k =1 q (cid:88) j =1 (cid:88) X j ∈X j ( − j +1 M (cid:89) m =1 (cid:2) z ( m ) k,X j ,t − l ( m ) k (cid:3) + (8)where { f t ( x i ) } qi =1 ∼ P (cid:0) f ( x ) , ..., f ( x q ) | X, Y (cid:1) is the t th sample from the joint posterior over X cand and z ( m ) k,X j ,t = min (cid:2) u k , min x (cid:48) ∈ X j f t ( x (cid:48) ) (cid:3) . A.3 Supporting Outcome Constraints
Recall that we defined the constrained hypervolume improvement asHVI C ( f ( x ) , c ( x )) = HVI [ f ( x )] · [ c ( x ) ≥ ] . (9)For q = 1 and assuming independence of the objectives and the constraints, the expected HVI C is the product ofthe expected HVI and the probability of feasibility (the expectation of [ c ( x ) ≥ ] ) [21]. However, requiring bjectives and constraints to be independent is unnecessary when estimating the expectation with MC integrationusing samples from the joint posterior.In the parallel setting, if all constraints are satisfied for all q candidates X cand = { x i } qi =1 , HVI C is simplyHVI. If a subset V ⊂ X cand , V (cid:54) = ∅ of the candidates violate at least one of the constraints, then the feasibleHVI is the HVI of the set of feasible candidates: HVI C ( X cand ) = HVI ( X cand \ V ) . That is, the hypervolumecontribution (i.e. the marginal HVI) of an infeasible point is zero. In our formulation, HVI can be computed bymultiplying (5) with an additional factor (cid:81) x (cid:48) ∈ X j (cid:81) Vv =1 [ c ( v ) ( x (cid:48) ) ≥ :HVI C ( { f ( x i ) , c ( x i ) } qi =1 ) = K (cid:88) k =1 q (cid:88) j =1 (cid:88) X j ∈X j ( − j +1 (cid:34)(cid:18) M (cid:89) m =1 (cid:2) z ( m ) k,X j − l ( m ) k (cid:3) + (cid:19) (cid:89) x (cid:48) ∈ X j V (cid:89) v =1 [ c ( v ) ( x (cid:48) ) ≥ (cid:35) . (10)The additional factor (cid:81) x (cid:48) ∈ X j (cid:81) Vv =1 [ c ( v ) ( x a ) ≥ indicates whether all constraints are satisfied for allcandidates in a given subset X j . Thus HVI C can be computed in the same fashion as HVI, but with theadditional step of setting the HV of all subsets containing x (cid:48) to zero if x (cid:48) violates any constraint. We can nowagain perform MC integration as in (5) to compute the expected constrained hypervolume improvement.In this formulation, the marginal hypervolume improvement from a candidate is weighted by the probability thatthe candidate is feasible. The marginal hypervolume improvements are highly dependent on the outcomes of theother candidates. Importantly, the MC-based approach enables us to properly estimate the marginal hypervolumeimprovements across candidates by sampling from the joint posterior.Note that while the expected constrained hypervolume E (cid:2) HVI C ( { f ( x i ) , c ( x i ) } qi =1 ) (cid:3) is differentiable, we may not differentiate inside the expectation (hence we cannot expect simply differentiating (10) on the sample-levelto provide proper gradients). We therefore replace the indicator with a sigmoid function with temperatureparameter (cid:15) , which provides a differentiable relaxation [ c ( v ) ( x (cid:48) ) ≥ ≈ s ( c ( v ) ( x (cid:48) ); (cid:15) ) := 11 + exp( − c ( v ) ( x (cid:48) ) /(cid:15) ) (11)that becomes exact in the limit (cid:15) (cid:38) .As in the unconstrained parallel scenario, there is no known analytical expression for the expected feasiblehypervolume improvement. Therefore, we again use MC integration to approximate the expectation: α q EHVI C ( x ) = E (cid:104) HVI C ( { f ( x i ) , c ( x i ) } qi =1 ) (cid:105) (12a) ≈ N N (cid:88) t =1 HVI C ( { f t ( x i ) , c t ( x i ) } qi =1 ) (12b) ≈ N N (cid:88) t =1 K (cid:88) k =1 q (cid:88) j =1 (cid:88) X j ∈X j ( − j +1 (cid:34)(cid:18) M (cid:89) m =1 (cid:2) z ( m ) k,X j ,t − l ( m ) k (cid:3) + (cid:19) (cid:89) x (cid:48) ∈ X j V (cid:89) v =1 s ( c ( v ) ( x (cid:48) ); (cid:15) ) (cid:35) (12c) A.3.1 Inclusion Exclusion principle for HVI C Equation (10) holds when the indicator function because HVI C is equivalent to HVI with the subset of feasiblepoints. However, the sigmoid approximation can result in non-zero error. The error function ε : 2 X cand → R canbe expressed as ε ( X ) = (cid:89) x (cid:48) ∈ X V (cid:89) v =1 [ c ( x (cid:48) ) > − (cid:89) x (cid:48) ∈ X V (cid:89) v =1 s ( c ( x (cid:48) ) , (cid:15) ) The error function gives a value to each to each element of X cand . Weight functions have been studied inconjunction with the inclusion-exclusion principle [53], but under the assumption of that the weight of a set isthe sum of the weights of its elements: w ( A ) = (cid:80) a ∈ A w ( a ) . In our case, the weight function of a set A is theproduct the weights of its elements. There, it is not obvious whether the inclusion-exclusion principle will holdin this case. Theorem 1.
Given a feasible Pareto front P feas , a partitioning { ( l k , u k } Kk =1 of the objective space R M that isnot dominated by the P feas , then for a set of points X cand with objective values f ( X cand ) and constraint values c ( X cand ) , HVI C ( f ( X cand ) , c ( X cand ) , P , r ) = HVI ( f (cid:48) ( X cand ) , P (cid:48) , r (cid:48) ) where f (cid:48) ( X cand ) is the set of objective-constraint vectors for each candidate point f (cid:48) ( x ) ∈ R M + V , P (cid:48) is theset of vectors [ f (1) ( x ) , ..., f ( M ) ( x ) , V ] ∈ R M + V , and r (cid:48) = [ r (1) , ..., r ( M ) , V ] ∈ R M + V . roof. Recall equation 10,HVI C ( { f ( x i ) , c ( x i ) } qi =1 ) = K (cid:88) k =1 q (cid:88) j =1 (cid:88) X j ∈X j ( − j +1 (cid:34)(cid:18) M (cid:89) m =1 (cid:2) z ( m ) k,X j − l ( m ) k (cid:3) + (cid:19) (cid:89) x (cid:48) ∈ X j V (cid:89) v =1 [ c ( v ) ( x (cid:48) ) ≥ (cid:35) . Note that the constraint product (cid:89) x (cid:48) ∈ X j V (cid:89) v =1 [ c ( v ) ( x (cid:48) ) ≥
0] = V (cid:89) v =1 (cid:89) x (cid:48) ∈ X j [ c ( v ) ( x (cid:48) ) ≥ V (cid:89) v =1 min x (cid:48) ∈ X j [ c ( v ) ( x (cid:48) ) ≥ V (cid:89) v =1 min (cid:20) , min x (cid:48) ∈ X j [ c ( v ) ( x (cid:48) ) ≥ (cid:21) = V (cid:89) v =1 (cid:34) min (cid:20) , min x (cid:48) ∈ X j [ c ( v ) ( x (cid:48) ) ≥ (cid:21) − (cid:35) . (13)For v = 1 , . . . , V , k = 1 , ...K , let l ( M + v ) k = 0 and u ( M + v ) k = 1 . Then, substituting into the followingexpression from Equation 13 gives min (cid:20) , min x (cid:48) ∈ X j [ c ( v ) ( x (cid:48) ) ≥ (cid:21) = min (cid:20) u ( M + v ) k , min x (cid:48) ∈ X j [ c ( v ) ( x (cid:48) ) ≥ (cid:21) Recall from Section 4, that z is defined as: z k := min (cid:2) u k , f ( x ) (cid:3) . The high-level idea is that if we considerthe indicator of the slack constraints [ c ( v ) ( x (cid:48) ) ≥ as objectives, then the above expression is consistent withthe definition of z at the beginning of section 4. For v = 1 , . . . , V , z ( M + v ) k,X j = min (cid:20) , min x (cid:48) ∈ X j [ c ( v ) ( x (cid:48) ) ≥ (cid:21) Thus, (cid:89) x (cid:48) ∈ X j V (cid:89) v =1 [ c ( v ) ( x (cid:48) ) ≥
0] = V (cid:89) v =1 (cid:34) min (cid:20) , min x (cid:48) ∈ X j [ c ( v ) ( x (cid:48) ) ≥ (cid:21) − (cid:35) = V (cid:89) v =1 (cid:2) z ( M + v ) k,X j − l ( M + v ) k (cid:3) + Returning to the HVI C equation, we haveHVI C ( { f ( x i ) , c ( x i ) } qi =1 ) = K (cid:88) k =1 q (cid:88) j =1 (cid:88) X j ∈X j ( − j +1 (cid:34)(cid:18) M (cid:89) m =1 (cid:2) z ( m ) k,X j − l ( m ) k (cid:3) + (cid:19) (cid:89) x (cid:48) ∈ X j V (cid:89) v =1 [ c ( v ) ( x (cid:48) ) ≥ (cid:35) = K (cid:88) k =1 q (cid:88) j =1 (cid:88) X j ∈X j ( − j +1 (cid:34)(cid:18) M (cid:89) m =1 (cid:2) z ( m ) k,X j − l ( m ) k (cid:3) + (cid:19) M + V (cid:89) v = M +1 (cid:2) z ( v ) k,X j − l ( M + v ) k (cid:3) + (cid:35) = K (cid:88) k =1 q (cid:88) j =1 (cid:88) X j ∈X j ( − j +1 (cid:34) M + V (cid:89) m =1 (cid:2) z ( m ) k,X j − l ( m ) k (cid:3) + (cid:35) (14)Now consider the case when a sigmoid approximation [ c ( v ) ( x (cid:48) ) ≥ ≈ s ( c ( v ) ( x (cid:48) ); (cid:15) ) is used. The onlychange to Equation 14 is that z ( m ) k,X j ≈ ˆ z ( m ) k,X j = min (cid:20) u ( M + v ) k , min x (cid:48) ∈ X j S [ c ( v ) ( x (cid:48) ) , (cid:15) ] (cid:21) . If S [ c ( v ) ( x (cid:48) ) , (cid:15) ] = [ c ( v ) ( x (cid:48) ) ≥ for all v, x (cid:48) , then HVI is computed exactly without approximation error. If S [ c ( v ) ( x (cid:48) ) , (cid:15) ] [ c ( v ) ( x (cid:48) ) ≥ for any v, x (cid:48) , then there is approximation error: the hypervolume improvementfrom all subsets containing x (cid:48) is proportional to (cid:81) Vv =1 min x (cid:48) ∈ X s ( c ( x (cid:48) ) , (cid:15) ) . Since the constraint outcomesare directly considered as components in the hypervolume computation, the inclusion-exclusion principleincorporates the approximate indicator properly. .4 Complexity Recall from Section 3.3 that, given posterior samples, the time complexity on a single-threaded machine is T = O ( MNK (2 q − . The space complexity required for maximum parallelism is also is T (ignoring thespace required by the models), which does limit scalability to larger M and q , but difficulty scaling to large M is a known limitaiton of EHVI [65]. To reduce memory load, rectangles could be materialized and processed inchunks at the cost of additional runtime. In addition, our implementation of q EHVI uses the box decompositionalgorithm from Couckuyt et al. [10], but we emphasize q EHVI is agnostic to the choice of partitioning algorithmand using a more efficient partitioning algorithm (e.g. [65, 16, 40]) may significantly improve memory footprinton GPU and enable larger using q in many scenarios. B Error Bound on Sequential Greedy Approximation
If the acquisition function L ( X cand ) is a normalized, monotone, submodular set function (where submodularmeans that the increase in L ( X cand ) is non-increasing as elements are added to X cand and normalized means that L ( ∅ ) = 0 ), then the sequential greedy approximation of L enjoys regret of no more than e L ∗ , where L ∗ is theoptima of L [22]. We have α q EHVI ( X cand ) = L ( X cand ) = E f (cid:0) HVI (cid:2) f ( X cand ) (cid:3)(cid:1) . Since HVI is a submodular setfunction [23] and the expectation of a stochastic submodular function is also submodular [2], α q EHVI ( X cand ) is also submodular and therefore its sequential greedy approximation enjoys regret of no more than e α ∗ q EHVI .Using the result from Wilson et al. [61], the MC-based approximation ˆ α q EHVI ( X cand ) = (cid:80) Nt =1 HVI (cid:2) f t ( X cand ) (cid:3) also enjoys the same regret bound since HVI is a normalized submodular set function. C Convergence Results
For the purpose of stating our convergence results, we recall some concepts and notation from Balandat et al.[5]. First, consider a sample { f t ( x ) } qi =1 from the multi-output posterior of the GP surrogate model. Let x ∈ R qd be the stacked set of candidates X cand and let f t ( x ) := [ f t ( x ) T , . . . , f t ( x q ) T ] T be the stacked set ofcorresponding objective vectors. It is well known that, using the reparameterization trick, we can write f t ( x ) = µ ( x ) + L ( x ) (cid:15) t , (15)where µ : R qd → R qM is the mean function of the multi-output GP, L ( x ) ∈ R qM × qM is a root decomposition(typically the Cholesky decomposition) of the multi-output GP’s posterior covariance Σ( x ) ∈ R qM × qM , and (cid:15) t ∈ R qM with (cid:15) t ∼ N (0 , I qM ) .For x ∈ X , consider the MC-approximation ˆ α Nq EHVI ( x ) from (5). Denote by ∇ x ˆ α Nq EHVI ( x ) the gradient of ˆ α Nq EHVI ( x ) , obtained by averaging the gradients on the sample-level: ∇ x ˆ α Nq EHVI ( x ) := 1 N N (cid:88) t =1 ∇ x HVI ( { f t ( x i ) } qi =1 ) (16)Let α ∗ q EHVI := max x ∈X α q EHVI ( x ) denote the maximum of the true acquisition function q EHVI, and let X ∗ := arg max x ∈X α q EHVI ( x ) denote the set of associated maximizers. Theorem 2.
Suppose that X is compact and that f has a Multi-Output Gaussian Process prior with continuouslydifferentiable mean and covariance functions. If the base samples { (cid:15) t } Nt =1 are drawn i.i.d. from N (0 , I qM ) ,and if ˆ x ∗ N ∈ arg max x ∈X ˆ α Nq EHVI ( x ) , then(1) α q EHVI ( ˆ x ∗ N ) → α ∗ q EHVI a.s.(2) dist ( ˆ x ∗ N , X ∗ ) → a.s. In addition to the almost sure convergence in Theorem 2, deriving a result on the convergence rate of theoptimizer, similar to the one obtained in [5], should be possible. We leave this to future work. Moreover, theresults in Theorem 2 can also be extended to the situation in which the base samples are generated using aparticular class of randomized QMC methods (see similar results in [5]).
Proof.
We consider the setting from Balandat et al. [5, Section D.5]. Let (cid:15) ∼ N (0 , I qM ) , so that we can writethe posterior over outcome m at x as the random variable f ( m ) ( x , (cid:15) ) = S { i j ,m } ( µ ( x ) + L ( x ) (cid:15) ) , where µ ( x ) As noted in Wilson et al. [61], submodularity technically requires the search space X to be finite, whereasin BO, it will typically be infinite. Wilson et al. [61] note that in similar scenarios, submodularity has beenextended to infinite sets X (e.g. Srinivas et al. [55]). nd L ( x ) are the (vector-valued) posterior mean and the Cholesky factor of posterior covariance, respectively,and S { i j ,m } is an appropriate selection matrix (in particular, (cid:107) S { i j ,m } (cid:107) ∞ ≤ for all i j and m ). Let A ( x , (cid:15) ) = K (cid:88) k =1 q (cid:88) j =1 (cid:88) X j ∈X j ( − j +1 M (cid:89) m =1 (cid:2) z ( m ) k,X j ( (cid:15) ) − l ( m ) k (cid:3) + where z ( m ) k,X j ( (cid:15) ) = min (cid:2) u ( m ) k , f ( m ) ( x i , (cid:15) ) , . . . , f ( m ) ( x i j , (cid:15) ) (cid:3) and X j = { x i , . . . , x i j } . Following [5, Theorem 3], we need to show that there exists an integrable function (cid:96) : R q × M (cid:55)→ R such that for almost every (cid:15) and all x , y ⊆ X , x , y ∈ R q × d , | A ( x , (cid:15) ) − A ( y , (cid:15) ) | ≤ (cid:96) ( (cid:15) ) (cid:107) x − y (cid:107) . (17)Let us define ˜ a kmjX j ( x , (cid:15) ) := (cid:104) min (cid:2) u ( m ) k , f ( m ) ( x i , (cid:15) ) , . . . , f ( m ) ( x i j , (cid:15) ) (cid:3) − l ( m ) k (cid:105) + . Linearity implies that it suffices to show that this condition holds for ˜ A ( x , (cid:15) ) := M (cid:89) m =1 ˜ a kmjX j ( x , (cid:15) ) = M (cid:89) m =1 (cid:104) min (cid:2) u ( m ) k , f ( m ) ( x i , (cid:15) ) , . . . , f ( m ) ( x i j , (cid:15) ) (cid:3) − l ( m ) k (cid:105) + (18)for all k , j , and X j . Observe that ˜ a kmjX j ( x , (cid:15) ) ≤ (cid:12)(cid:12)(cid:12) min (cid:2) u ( m ) k , f ( m ) ( x i , (cid:15) ) , . . . , f ( m ) ( x i j , (cid:15) ) (cid:3) − l ( m ) k (cid:12)(cid:12)(cid:12) ≤ | l ( m ) k | + (cid:12)(cid:12)(cid:12) min (cid:2) u ( m ) k , f ( m ) ( x i , (cid:15) ) , . . . , f ( m ) ( x i j , (cid:15) ) (cid:3)(cid:12)(cid:12)(cid:12) . Note that if u ( m ) k = ∞ , then min[ u ( m ) k , f ( x , (cid:15) ) ( m ) i , ...f ( m ) ( x i j , (cid:15) )] = min[ f ( m ) ( x i , (cid:15) ) , ...f ( m ) ( x i j , (cid:15) )] . If u ( m ) k < ∞ , then min[ u ( m ) k , f ( m ) ( x i , (cid:15) ) , ...f ( m ) ( x i j , (cid:15) )] < (cid:12)(cid:12) min[ f ( m ) ( x i , (cid:15) ) , ...f ( m ) ( x i j , (cid:15) )] (cid:12)(cid:12) + (cid:12)(cid:12) u ( m ) k (cid:12)(cid:12) .Let w ( m ) k = u ( m ) k if u ( m ) k < ∞ and 0 otherwise. Then ˜ a kmjX j ( x , (cid:15) ) ≤ | l ( m ) k | + | w ( m ) k | + (cid:12)(cid:12) min (cid:2) f ( m ) ( x i , (cid:15) ) , . . . , f ( m ) ( x i j , (cid:15) ) (cid:3)(cid:12)(cid:12) ≤ | l ( m ) k | + | w ( m ) k | + (cid:88) i ,...,i j (cid:12)(cid:12) f ( m ) ( x i j , (cid:15) ) (cid:12)(cid:12) . We therefore have that | ˜ a kmjX j ( x , (cid:15) ) | ≤ | l ( m ) k | + | w ( m ) k | + | X j | (cid:0) (cid:107) µ ( m ) ( x ) (cid:107) + (cid:107) L ( m ) ( x ) (cid:107)(cid:107) (cid:15) (cid:107) (cid:1) for all k, m, j, X j , where | X j | denotes the cardinality of the set X j . Under our assumptions (compactness of X ,continuous differentiability of mean and covariance function), both µ ( x ) and L ( x ) , as well as their respectivegradients w.r.t. x , are uniformly bounded. In particular there exist C , C < ∞ such that | ˜ a kmjX j ( x , (cid:15) ) | ≤ C + C (cid:107) (cid:15) (cid:107) for all k, m, j, X j .Dropping indices k, j, X j for simplicity, observe that (cid:12)(cid:12) ˜ A ( x , (cid:15) ) − ˜ A ( y , (cid:15) ) (cid:12)(cid:12) = (cid:12)(cid:12) ˜ a ( x , (cid:15) )˜ a ( x , (cid:15) ) − ˜ a ( y , (cid:15) )˜ a ( y , (cid:15) ) (cid:12)(cid:12) (19a) = (cid:12)(cid:12) ˜ a ( x , (cid:15) ) (cid:0) ˜ a ( x , (cid:15) ) − ˜ a ( y , (cid:15) ) (cid:1) + ˜ a ( y , (cid:15) ) (cid:0) ˜ a ( x , (cid:15) ) − ˜ a ( y , (cid:15) ) (cid:1)(cid:12)(cid:12) (19b) ≤ | ˜ a ( x , (cid:15) ) | (cid:12)(cid:12) ˜ a ( x , (cid:15) ) − ˜ a ( y , (cid:15) ) (cid:12)(cid:12) + | ˜ a ( y , (cid:15) ) | (cid:12)(cid:12) ˜ a ( x , (cid:15) ) − ˜ a ( y , (cid:15) ) (cid:12)(cid:12) . (19c)Furthermore, | ˜ a kmjX j ( x , (cid:15) ) − ˜ a kmjX j ( y , (cid:15) ) | ≤ (cid:88) i ,...,i j (cid:12)(cid:12) S { i j ,m } ( µ ( x ) + L ( x ) (cid:15) ) − S { i j ,m } ( µ ( y ) + L ( y ) (cid:15) ) (cid:12)(cid:12) ≤ | X j | (cid:16) (cid:107) µ ( x ) − µ ( y ) (cid:107) + (cid:107) L ( x ) − L ( y ) (cid:107)(cid:107) (cid:15) (cid:107) (cid:17) . Since µ and L have uniformly bounded gradients, they are Lipschitz. Therefore, there exist C , C < ∞ suchthat | ˜ a kmjX j ( x , (cid:15) ) − ˜ a kmjX j ( y , (cid:15) ) | ≤ ( C + C (cid:107) (cid:15) (cid:107) ) (cid:107) x − y (cid:107) or all x , y , k, m, j, X j . Plugging this into (19) above, we find that (cid:12)(cid:12) ˜ A ( x , (cid:15) ) − ˜ A ( y , (cid:15) ) (cid:12)(cid:12) ≤ (cid:16) C C + ( C C + C C ) (cid:107) (cid:15) (cid:107) + C C (cid:107) (cid:15) (cid:107) (cid:17) (cid:107) x − y (cid:107) for all x , y and (cid:15) . For M > we generalize the idea from (19), making sure to telescope the respectiveexpressions. It is not hard to see that with this, there exist C < ∞ such that (cid:12)(cid:12) ˜ A ( x , (cid:15) ) − ˜ A ( y , (cid:15) ) (cid:12)(cid:12) ≤ C M (cid:88) m =1 (cid:107) (cid:15) (cid:107) m (cid:107) x − y (cid:107) Letting (cid:96) ( (cid:15) ) := C (cid:80) Mm =1 (cid:107) (cid:15) (cid:107) m , we observe that (cid:96) ( (cid:15) ) is integrable (since all absolute moments exist for theNormal distribution).The result now follows from in Balandat et al. [5, Theorem 3].Besides the above convergence result, we can also show that the sample average gradient of the MC approximationof q EHVI is an unbiased estimator of the true gradient of q EHVI:
Proposition 1.
Suppose that the GP mean and covariance function are continuously differentiable. Supposefurther that the candidate set x has no duplicates, and that the sample-level gradients ∇ x HVI ( { f t ( x i ) } qi =1 ) are obtained using the reparameterization trick as in [5]. Then E (cid:2) ∇ x ˆ α Nq EHVI ( x ) (cid:3) = ∇ x α q EHVI ( x ) , (20) that is, the averaged sample-level gradient is an unbiased estimate of the gradient of the true acquisition function.Proof. This proof follows the arguments Wang et al. [59, Theorem 1], which leverages Glasserman [30, Theorem1]. We verify the conditions of Glasserman [30, Theorem 1] below. Using the arguments from [5], we knowthat, under the assumption of differentiable mean and covariance functions, the samples f t ( x ) are continuouslydifferentiable w.r.t. x (since there are no duplicates, and thus the covariance Σ( x ) is non-singular). Hence,Glasserman [30, A1] is satisfied. Furthermore, it is easy to see from (1) that HVI ( { f ( x i ) } qi =1 ) is a.s. continuousand is differentiable w.r.t. f t ( x ) on R M , except on the edges of the hyper-rectangle decomposition { S k } Kk =1 ofthe non-dominated space, which satisfies [30, A3]. The set of points defined by the union of these edges clearlyhas measure zero under any non-degenerate (non-signular covariance) GP posterior on R M , so Glasserman [30,A4] holds. Therefore Glasserman [30, Lemma 2] holds, so HVI ( { f ( x i ) } qi =1 ) is a.s. piece-wise differentiablew.r.t. x .Lastly, we need to show that the result in Glasserman [30, Lemma 3] holds: E (cid:20) sup x ci / ∈ ˜ D | A (cid:48) ( x , (cid:15) ) | (cid:21) < ∞ . As in Wang et al. [59, Theorem 1], we fix x except for x ci where x ci is the c th component of the i th point, We needto show that E (cid:2) sup x ci / ∈ ˜ D | A (cid:48) ( x , (cid:15) ) | (cid:3) < ∞ . By linearity, it suffices to show that E (cid:2) sup x ci / ∈ ˜ D | ˜ A (cid:48) ( x , (cid:15) ) | (cid:3) < ∞ .We have E (cid:20) sup x ci / ∈ ˜ D | ˜ A (cid:48) ( x , (cid:15) ) | (cid:21) = E (cid:20) sup x ci / ∈ ˜ D (cid:12)(cid:12)(cid:12)(cid:12) ∂ ˜ A ( x , (cid:15) ) ∂x ci (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) . Consider the M = 2 case. We have ˜ A ( x , (cid:15) ) = a ( x , (cid:15) ) a ( x , (cid:15) ) , where a m ( x , (cid:15) ) = (cid:104) min (cid:2) u ( m ) k , f ( m ) ( x i , (cid:15) ) , . . . , f ( m ) ( x i j , (cid:15) ) (cid:3) − l ( m ) k (cid:105) + . The partial derivative of ˜ A ( x , (cid:15) ) with respect to x ci is ∂ ˜ A ( x , (cid:15) ) ∂x ci = ∂a ( x , (cid:15) ) ∂x ci a ( x , (cid:15) ) + a ( x , (cid:15) ) ∂a ( x , (cid:15) ) ∂x ci , and therefore (cid:12)(cid:12)(cid:12) ∂ ˜ A ( x , (cid:15) ) ∂x ci (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) ∂a ( x , (cid:15) ) ∂x ci (cid:12)(cid:12)(cid:12) · (cid:12)(cid:12)(cid:12) a ( x , (cid:15) ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) a ( x , (cid:15) ) (cid:12)(cid:12)(cid:12) · (cid:12)(cid:12)(cid:12) ∂a ( x , (cid:15) ) ∂x ci (cid:12)(cid:12)(cid:12) Since we are only concerned with x ci / ∈ ˜ D , a m ( x , (cid:15) ) = (cid:104) min (cid:2) f ( m ) ( x i , (cid:15) ) , . . . , f ( m ) ( x i j , (cid:15) ) (cid:3) − l (1) k (cid:105) + . s in the proof of Theorem 2, we write the posterior over outcome m at x as the random variable f ( m ) ( x , (cid:15) ) = S { i j ,m } ( µ ( x ) + L ( x ) (cid:15) ) , where (cid:15) ∼ N (0 , I qM ) and S { i j ,m } is an appropriate selection matrix. With this, a m ( x , (cid:15) ) = (cid:104) min (cid:2) S { i , } (cid:0) µ ( x ) + L ( x ) (cid:15) (cid:1) , . . . , S { i j , } (cid:0) µ ( x ) + L ( x ) (cid:15) (cid:1)(cid:3) − l (1) k (cid:105) + . Since the interval X is compact and the mean, covariance, and Cholesky factor of the covariance µ ( x ) , C ( x ) , L ( x ) are continuously differentiable, for all m we have sup x ci (cid:12)(cid:12)(cid:12)(cid:12) ∂µ ( m ) ( x a ) ∂x ci (cid:12)(cid:12)(cid:12)(cid:12) = µ ∗ , ( m ) a < ∞ , sup x ci (cid:12)(cid:12)(cid:12)(cid:12) ∂L ( m ) ( x ) ∂x ci (cid:12)(cid:12)(cid:12)(cid:12) = L ∗ , ( m ) ca < ∞ . Let µ ( m ) ∗∗ = max a µ ∗ , ( m ) a , L ( m ) ∗∗ = max a,b L ∗ , ( m ) ab ( x ) , where L ( m ) ab is the element at row a , column b in L ( m ) ,the Cholesky factor for outcome m . Let (cid:15) ( m ) ∈ R q denote the vector of i.i.d. N (0 , samples corresponding tooutcome m . Then we have (cid:12)(cid:12)(cid:12)(cid:12) ∂∂x ci (cid:104) [min (cid:2) S { i , } (cid:0) µ ( x ) + L ( x ) (cid:15) (cid:1) , . . . , S { i j , } (cid:0) µ ( x ) + L ( x ) (cid:15) (cid:1)(cid:3) − l (1) k (cid:105) + (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:104) µ ( m ) ∗∗ + L ( m ) ∗∗ || (cid:15) ( m ) || − l ( m ) k (cid:105) + (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) µ ( m ) ∗∗ + L ( m ) ∗∗ || (cid:15) ( m ) || (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) l ( m ) k (cid:12)(cid:12)(cid:12) . Under our assumptions (compactness of X , continuous differentiability of mean and covariance function)both µ ( x ) and L ( x ) , as well as their respective gradients, are uniformly bounded. In particular there exist C ( m )1 , C ( m )2 < ∞ such that (cid:12)(cid:12) S { a,m } (cid:0) µ ( x ) + L ( x ) (cid:15) (cid:1) − l ( m ) k (cid:12)(cid:12) ≤ C ( m )1 + C ( m )2 || (cid:15) ( m ) || for all a = i , ..., i j .Hence, (cid:12)(cid:12)(cid:12)(cid:12) ∂ ˜ A ( x , (cid:15) ) ∂x ci (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:34)(cid:12)(cid:12)(cid:12) µ (1) ∗∗ + C (1) ∗∗ || (cid:15) (1) || (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) l (1) k (cid:12)(cid:12)(cid:12)(cid:35)(cid:34) C (2)1 + C (2)2 || (cid:15) (2) || (cid:35) + (cid:34) C (1)1 + C (1)2 || (cid:15) (1) || (cid:35)(cid:34)(cid:12)(cid:12)(cid:12) µ (2) ∗∗ + C (2) ∗∗ || (cid:15) (2) || (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) l (2) k (cid:12)(cid:12)(cid:12)(cid:35) Since (cid:15) is absolutely integrable, E (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) ∂ ˜ A ( x , (cid:15) ) ∂x ci (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) < ∞ . Hence, E (cid:2) sup x ci / ∈ ˜ D | A (cid:48) ( x , (cid:15) ) | (cid:3) < ∞ . This can be extended to M > in the same manner using the productrule to obtain E (cid:18) ∂ ˜ A ( x , (cid:15) ) ∂x ci (cid:19) ≤ M (cid:88) m =1 (cid:32)(cid:34)(cid:12)(cid:12)(cid:12) µ ( m ) ∗∗ + C ( m ) ∗∗ E [ || (cid:15) ( m ) || ] (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) l (1) k (cid:12)(cid:12)(cid:12)(cid:35) M (cid:89) n =1 ,n (cid:54) = m (cid:34) C ( n )1 + C ( n )2 E [ || (cid:15) ( n ) || ] (cid:35)(cid:33) ≤ M (cid:88) m =1 (cid:32)(cid:34)(cid:12)(cid:12)(cid:12) µ ( m ) ∗∗ + π qC ( m ) ∗∗ (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) l (1) k (cid:12)(cid:12)(cid:12)(cid:35) M (cid:89) n =1 ,n (cid:54) = m (cid:34) C ( n )1 + π qC ( n )2 ] (cid:35)(cid:33) . Hence, E (cid:2) sup x ci / ∈ ˜ D | A (cid:48) ( x , (cid:15) ) | (cid:3) < ∞ for M ≥ and Glasserman [30, Theorem 1] holds. Monte-Carlo Approximation
Figure 4b shows the gradient of analytic EHVI and the MC estimator q EHVI on slice of a 3-objective problem.Even using only N = 32 QMC samples, the average sample gradient has very low variance. Moreover, fixingthe base samples also greatly reduces the variance without introducing bias. E H V I MC, N=32analytic qMC, N=32analytic E H V I MC, N=32 (fixed)analytic qMC, N=32 (fixed)analytic (a) A comparison of the analytic EHVI acquisition function and the MC-based q EHVI for q = 1 . E HV I MC, N=32analytic qMC, N=32analytic E HV I MC, N=32 (fixed)analytic qMC, N=32 (fixed)analytic (b) A comparison of the exact gradient of analytic EHVI and the exact sample average gradient of the MC-based q EHVI for q = 1 . Figure 4: A comparison of (a) the analytic EHVI and MC-based q EHVI for q = 1 and (b) acomparison of the exact gradient ∇ α EHVI of analytic EHVI and average sample gradient of theMC-estimator ∇ ˆ α q EHVI over a slice of the input space on a DTLZ2 problem ( q = 1 , M = 3 , d = 6 )[14]. x (0) is varied across ≤ λ ≤ , while x ( i ) for , ...D are held constant. In each of (a) and (b),the top row show q EHVI where the (quasi-)standard normal base samples are resampled for eachvalue of x (0) . The solid line is one sample average (across (q)MC samples) and the shaded area is themean plus 2 standard errors across 50 repetitions. The bottom row uses the same base samples forevaluating each test point and the sample average for each of 50 repetitions is plotted. E Experiment Details
E.1 Algorithms
For PESMO, we follow [26] and use a Pareto set of size 10 for each sampled GP, which is optimized over adiscrete set of d points sampled from a scrambled Sobol sequence. The current Pareto front is approximatedby optimizing the posterior means over a grid as is done in Garrido-Merchán and Hernández-Lobato [25, 26]. P ROBLEM R EFERENCE P OINT B RANIN C URRIN (18.0, 6.0)DTLZ2 (1 . , ..., . ∈ R M ABR (-150.0, 3500.0, 5.1)V
EHICLE C RASH S AFETY (1864.72022, 11.81993945, 0.2903999384)C
ONSTRAINED B RANIN C URRIN (90.0, 10.0)C2-DTLZ2 (1 . , ..., . ∈ R M For SMS-EGO, we use the observed Pareto front. All acquisition functions are optimized with L-BFGS-B (witha maximum of 200 iterations); SMS-EGO [50] and PESMO [25] use gradients approximated by finite differencesand all other methods use exact gradients. For all methods, each outcome is modeled with an independentGaussian process with a Matern / ARD kernel. The methods implemented in Spearmint use a fully Bayesiantreatment of the hyperparameters with 10 samples from posterior over the hyperparamters, and the methodsimplemented in BoTorch use maximum a posteriori estimates of the GP hyperparameters. All methods areinitialized with d + 1) points from a scrambled Sobol sequence. q P AR EGO and q EHVI use N = 128 QMCsamples.
E.1.1 Reference point specification
There is a large body of literature on the effects of reference point specification [4, 34, 35]. The hypervolumeindicator is sensitive to specified the reference point: a reference point that is far away from the Pareto front willfavor extreme points, where as reference point that is close to the Pareto front gives more weight to less extremepoints [35]. Sensitivity to the reference point is affects both the evaluation of different MO methods and theutility function for methods that rely HV. In practice, a decision maker may be able to specify a reference pointthat satisfies their preference with domain knowledge. If a reference point is provided by the decision maker,previous work has suggested heuristics for choosing reference points for use in an algorithm’s utility function[34, 50]. We follow previous work [65, 64] and assume that the reference point is known.We also considered (but did not use in our experiments) a dynamic reference point strategy where at each BOiteration, the reference point is selected to be a point slightly worse than the nadir (component-wise minimum)point of the current observed Pareto front for computing the acquisition function: r = y nadir − . · | y nadir | where y nadir = (cid:0) min y (1) ∈D (1) y (1) , . . . , min y ( m ) ∈D ( m ) y ( m ) (cid:1) . This reference point is used in SMS-EMOA inIshibuchi et al. [34]), and we find similar average performance (but higher variance) on problems to using aknown reference point with continuous Pareto fronts. If the Pareto front is discontinuous, then it is possible notall sections of the Pareto front will be reached. E.1.2 q P AR EGO
Previous work has only considered unconstrained sequential optimization with ParEGO [39, 7] and ParEGOis often optimized with gradient-free methods [50]. To the best of our knowledge, q P AR EGO is the first tosupport parallel and constrained optimization. Morever, we compute exact gradients via auto-differentiation foracquisition optimization. ParEGO is typically implemented by applying the augmented Chebyshev scalarizationand modeling the scalarized outcome [39]. However, recent work has shown that composite objectives offerimproved optimization performance [3]. q P AR EGO uses a MC-based Expected Improvement [37] acquisitionfunction, where the objectives are modeled independently and the augmented Chebyshev scalarization [39] isapplied to the posterior samples as a composite objective. This approach enables the use of sequential greedyoptimization of q candidates with proper integration over the posterior at the pending points. Importantly, thesequential greedy approach allows for using different random scalarization weights for selecting each of the q candidates. q P AR EGO is extended to the constrained setting by weighting the EI by the probability of feasibility[24]. We estimate the probability of feasiblity using the posterior samples and approximate the indicator functionwith a sigmoid to maintain differentiablity as in constrained q EHVI. q P AR EGO is trivially extended to the noisysetting using Noisy Expected Improvement [41, 5], but we use Expected Improvement in our experiments as allof the problems are noiseless.
E.2 Benchmark Problems
The details for the benchmark problems below assume minimization of all objectives. Table 2 provides thereference points used for all benchmark problems. ranin-Currin f (1) ( x (cid:48) , x (cid:48) ) = ( x − . π x + 5 π x − r ) + 10(1 − π ) cos( x ) + 10 f (2) ( x , x ) = (cid:20) − exp (cid:18) − x ) (cid:19)(cid:21) x + 1900 x + 2092 x + 60100 x + 500 x + 4 x + 20 where x , x ∈ [0 , , x (cid:48) = 15 x − , and x (cid:48) = 15 x .The constrained Branin-Currin problem uses the following disk constraint from [28]: c ( x (cid:48) , x (cid:48) ) = 50 − ( x (cid:48) − . − ( x (cid:48) − . ) ≥ DTLZ2
The objectives are given by [14]: f ( x ) = (1 + g ( x M )) cos (cid:0) π x (cid:1) · · · cos (cid:0) π x M − (cid:1) cos (cid:0) π x M − (cid:1) f ( x ) = (1 + g ( x M )) cos (cid:0) π x (cid:1) · · · cos (cid:0) π x M − (cid:1) sin (cid:0) π x M − (cid:1) f ( x ) = (1 + g ( x M )) cos (cid:0) π x (cid:1) · · · sin (cid:0) π x M − (cid:1) ... f M ( x ) = (1 + g ( x M )) sin (cid:0) π x (cid:1) where g ( x ) = (cid:80) x i ∈ x M ( x i − . , x ∈ [0 , d , and x M represents the last d − M + 1 elements of x .The C2-DTLZ2 problem adds the following constraint [15]: c ( x ) = − min (cid:20) M min i =1 (cid:18) ( f i ( x ) − + M (cid:88) j =1 ,j = i ( f j − r ) (cid:19) , (cid:18) M (cid:88) i =1 (cid:0) ( f i ( x ) − √ M ) − r (cid:1)(cid:19)(cid:21) ≥ Vehicle Crash Safety
The objectives are given by [57]: f ( x ) = 1640 . . x + 2 . x + 4 . x + 7 . x + 4 . x f ( x ) = 6 . . x − . x + 0 . x + 0 . x − . x x + 0 . x x + 0 . x x + 0 . x − . x + 0 . x f ( x ) = − . . x + 0 . x + 0 . x − . x x + 0 . x x − . x x − . x x − . x x − . x + 0 . x where x ∈ [1 , . Policy Optimization for Adaptive Bitrate Control
The controller is given by: a t = x ˆ z bd ,t + x z bf ,t + x ,where ˆ z bd ,t = (cid:80) ti F.1 Additional Synethetic Benchmarks We include results for two additional synthetic benchmarks: a Constrained Branin-Currin problem and the DTLZ2 problem from the MO literature [14] ( d = 6 , M = 2 ).Figures 5a and 5b show that q EHVI outperforms all other baseline algorithms on both problems in terms ofoptimization performance, while achieving competitive wall times (see Table 3). Batch Iteration l og HV D i ff e r e n ce Sobol q=8qEHVI q=1qEHVI q=2qEHVI q=4qEHVI q=8 qParEGO q=1qParEGO q=2qParEGO q=4qParEGO q=8 (a) Function Evaluations l og HV d i ff e r e n ce SobolEHVIqEHVI qParEGOPESMOSMSEGO (b) Figure 5: (a) Optimization performance on the Constrained Branin-Currin synthetic function ( d =2 , M = 2 , V = 1 ) for various q , and (b) sequential optimization performance on the DTLZ2 syntheticfunction ( d = 6 , M = 2 ). F.1.1 Anytime Performance with Parallel Evaluation Figure 6 shows that that the anytime performance of q EHVI performance does not degrade with increasing q ,whereas performance does degrade for q P AR EGO on the Constrained Branin-Currin problem. Batch Iteration l og HV D i ff e r e n ce Sobol q=8qEHVI q=1qEHVI q=2qEHVI q=4qEHVI q=8 qParEGO q=1qParEGO q=2qParEGO q=4qParEGO q=8 Figure 6: A comparison of anytime performance as q increases for q P AR EGO and q EHVI on theConstrained Branin-Currin test problem ( d = 2 , M = 2 , V = 1 ).24able 3: Acquisition Optimization wall time in seconds on a CPU (2x Intel Xeon E5-2680 v4 @2.40GHz) and on a GPU (Tesla V100-SXM2-16GB). The mean and two standard errors are reported.NA indicates that the algorithm does not support constraints. CPU C ONSTRAINED B RANIN C URRIN DTLZ2PESMO ( q =1) NA . 53 ( ± . SMS-EGO ( q =1) NA . 26 ( ± . q P AR EGO ( q =1) . ± . 37) 4 . 91 ( ± . EHVI ( q =1) NA . 75 ( ± . q EHVI ( q =1) . 69 ( ± . 43) 5 . 69 ( ± . GPU C ONSTRAINED B RANIN C URRIN DTLZ2 q P AR EGO ( q =1) . 52 ( ± . 34) 9 . 04 ( ± . q P AR EGO ( q =2) . ± . 56) 14 . 23 ( ± . q P AR EGO ( q =4) . 07 ( ± . 98) 40 . ± . q P AR EGO ( q =8) . ± . 32) 84 . 15 ( ± . EHVI ( q =1) NA . 15 ( ± . q EHVI ( q =1) . 61 ( ± . 17) 10 . 21 ( ± . q EHVI ( q =2) . 06 ( ± . 88) 17 . 75 ( ± . q EHVI ( q =4) . 26 ( ± . 01) 40 . 41 ( ± . q EHVI ( q =8) . 56 ( ± . 51) 106 . 51 ( ± . F.2 Acquisition Computation Time Figure 7 show the acquisition computation time for different M and q . The inflection points corresponds toavailable processor cores becoming saturated. For large M an q on the GPU, memory becomes an issue, but wediscuss ways of mitigating the issue in Appendix A.4. q A c qu i s iti on C o m pu t a ti on T i m e ( s ) M=2 (CPU)M=3 (CPU)M=4 (CPU) M=2 (GPU)M=3 (GPU)M=4 (GPU) Figure 7: Acquisition computation time for different batch sizes q and numbers of objectives M (this excludes the time required to compute the acquisition function given box decomposition of thenon-dominated space). This uses N = 512 MC samples, d = 6 , |P| = 10= 10