[PDF] Distributionally Robust Bayesian Optimization

Abstract

Robustness to distributional shift is one of the key challenges of contemporary machine learning. Attaining such robustness is the goal of distributionally robust optimization, which seeks a solution to an optimization problem that is worst-case robust under a specified distributional shift of an uncontrolled covariate. In this paper, we study such a problem when the distributional shift is measured via the maximum mean discrepancy (MMD). For the setting of zeroth-order, noisy optimization, we present a novel distributionally robust Bayesian optimization algorithm (DRBO). Our algorithm provably obtains sub-linear robust regret in various settings that differ in how the uncertain covariate is observed. We demonstrate the robust performance of our method on both synthetic and real-world benchmarks.

Full PDF

DDistributionally Robust Bayesian Optimization

Johannes Kirschner Ilija Bogunovic Stefanie Jegelka Andreas Krause

ETH Z¨urich ETH Z¨urich MIT ETH Z¨urich

Abstract

Robustness to distributional shift is one ofthe key challenges of contemporary machinelearning. Attaining such robustness is thegoal of distributionally robust optimization,which seeks a solution to an optimizationproblem that is worst-case robust under aspeciﬁed distributional shift of an uncon-trolled covariate. In this paper, we studysuch a problem when the distributional shiftis measured via the maximum mean discrep-ancy (MMD). For the setting of zeroth-order,noisy optimization, we present a novel dis-tributionally robust Bayesian optimizationalgorithm (DRBO). Our algorithm provablyobtains sub-linear robust regret in varioussettings that diﬀer in how the uncertaincovariate is observed. We demonstrate therobust performance of our method on bothsynthetic and real-world benchmarks.

Bayesian optimization (BO) is a framework for model-based sequential optimization of black-box functionsthat are expensive to evaluate and for which noisypoint evaluations are available. Bayesian optimiza-tion algorithms have been successfully applied in awide range of applications where the goal is to dis-cover best-performing designs from a small number oftrials, e.g., in vaccine and molecular design, gene op-timization, automatic machine learning, robotics andcontrol tasks, and many more.In many practical tasks, the objective also depends on contextual covariates of the environment. If this con-text follows a known distribution, the setting is essen-tially that of stochastic optimization with the objec-

Objective FormulationStochastic (SO) max x E c ∼ P [ f ( x, c )] Worst-caserobust (RO) max x min c ∈ ∆ f ( x, c ) Distributionallyrobust (DRO) max x inf Q∈U E c ∼ Q [ f ( x, c )]Table 1: Diﬀerent optimization objectives consideredin Bayesian optimization.tive to maximize the expected pay-oﬀ. Often, however,there exists a distributional mismatch between the co-variate distribution that the learner assumes, and thetrue distribution of the environment. Examples in-clude automated machine learning, where hyperpa-rameters are tuned on training data while the test dis-tribution can diﬀer; recommender systems, where thedistribution of the users shifts with time; and robotics,where the simulated environmental variables are onlyan approximation of the real physical world. In par-ticular, whenever there is a distributional mismatchbetween the true and the data distribution used attraining time, the optimization solutions can result ininferior performance or even lead to unsafe/unreliableexecution. The problem of distributional data shift hasbeen recently identiﬁed as one of the most prevalentconcrete challenges of modern AI safety (Amodei et al.,2016). While the connection of robust optimization(RO) and Bayesian optimization has recently been es-tablished by Bogunovic et al. (2018), robustness to dis-tributional data shift remains unexplored in this ﬁeld.In this paper, we introduce the setting of distribution-ally robust Bayesian optimization (DRBO) : The goalis to track the optimal input that maximizes the ex-pected function value under the worst-case distribu-tion of an external, contextual parameter. In distribu-tionally robust optimization (DRO), such a worst-casedistribution belongs to a known uncertainty set of dis-tributions that is typically chosen as a ball centeredaround a given reference distribution. To measure the a r X i v : . [ s t a t . M L ] M a r istributionally Robust Bayesian Optimization distance between distributions, in this work, we fo-cus on the kernel-based maximum mean discrepancy (MMD) distance. This metric ﬁts well with the kernel-based regularity assumptions on the unknown functionthat are typically made in Bayesian optimization. A large number of Bayesian optimization algorithmshave been developed over the years, (e.g. Srinivaset al., 2010; Wang and Jegelka, 2017; Hennig andSchuler, 2012; Chowdhury and Gopalan, 2017; Bo-gunovic et al., 2016b). Several practical variants ofthe standard setting were addressed recently, includ-ing contextual (Krause and Ong, 2011; Valko et al.,2013; Lamprier et al., 2018; Kirschner and Krause,2019) and time-varying (Bogunovic et al., 2016a)BO, high-dimensional BO (Djolonga et al., 2013;Kandasamy et al., 2015; Kirschner et al., 2019), BOwith constraints (Gardner et al., 2014; Gelbart et al.,2014), heteroscedastic noise (Kirschner and Krause,2018) and uncertain inputs (Oliveira et al., 2019).Two classical objectives for optimization under uncer-tainty are stochastic optimization (SO) (Srinivas et al.,2010; Krause and Ong, 2011; Lamprier et al., 2018;Oliveira et al., 2019; Kirschner and Krause, 2019) androbust optimization (RO) (Bogunovic et al., 2018), seeTable 1. SO asks for a solution that performs well inexpectation over an uncontrolled, stochastic covariate.Here, the assumption is that the distribution of thecontextual parameter is known, or (i.i.d.) samples areprovided. Some variants of SO have been consideredin the related contextual Bayesian optimizationworks (Krause and Ong, 2011; Valko et al., 2013;Kirschner and Krause, 2019). RO aims at a solutionthat is robust with respect to the worst possible re-alization of the context parameter. The RO objectivehas recently been studied in Bayesian optimizationin (Bogunovic et al., 2018); the authors provide arobust BO algorithm, and obtain strong regret guar-antees. In many practical scenarios, however, the so-lution to the SO problem might be highly non-robust ,while on the other hand, the worst-case RO solutionmight be overly pessimistic . This motivates us to con-sider the distributionally robust optimization (DRO),which is a “middle ground” between SO and RO.Distributionally robust optimization (DRO) datesback to the seminal work of Scarf (1957) and sincethen it has become an important topic in robust op-timization (e.g. Bertsimas et al., 2018; Goh and Sim,2010). It has recently received signiﬁcant attention inmachine learning, in particular due to its relation toregularization, adversarial learning, and generaliza-tion (Staib et al., 2018). The full literature on DRO istoo vast to be adequately covered here, so we refer the interested reader to the recent review by Rahimianand Mehrotra (2019) and references within. Fordeﬁning the uncertainty sets of distributions, diﬀerentDRO works have studied φ -divergences (Ben-Tal et al.,2013; Namkoong and Duchi, 2017), Wasserstein (Gaoet al., 2017; Esfahani and Kuhn, 2018; Sinha et al.,2017) and the MMD (Staib and Jegelka, 2019)distances. In this work, we focus on the kernel-basedMMD distance, but unlike previous DRO works, weassume that the objective function is unknown , andonly noisy point evaluations are available.We conclude this section by mentioning other robustaspects and settings that have been previously con-sidered in Bayesian optimization. BO with outliershas been considered by Martinez-Cantin et al. (2017),while the setting in which sampled points are subjectto uncertainty has been studied by Nogueira et al.(2016); Beland and Nair (2017); Oliveira et al. (2019).These settings diﬀer signiﬁcantly from the one consid-ered in this paper and they do not consider robust-ness under distributional shift. Finally, we note thatanother robust BO algorithm has been recently de-veloped for playing unknown repeated games againstnon-cooperative agents (Sessa et al., 2019).While this work was under submission, a related ap-proach for distributionally robust Bayesian quadratureappeared online (Nguyen et al., 2020). The authorspropose an approach based on Thompson sampling tosolve a related robust objective for Bayesian quadra-ture. Our work captures this scenario in the ”simu-lator setting”, detailed below. The main diﬀerence inthe analysis is that we bound worst-case frequentistregret opposed to the expected Bayesian regret. Contributions

We propose a novel, distributionallyrobust Bayesian optimization (DRBO) algorithm. Ouranalysis shows that the DRBO achieves sublinear ro-bust regret on several variants of the setting. Finally,we demonstrate robust performance of the DRBOmethod on synthetic and real-world benchmarks.

Let f : X × C → R be an unknown reward func-tion deﬁned over a parameter space X × C with ﬁ-nite action and context sets, X and C . The objec-tive is to optimize f from sequential and noisy pointevaluations. In our main setup, at each time step t , the learner chooses x t ∈ X whereas the environ-ment provides the context c t ∈ C together with the noisy function observation y t = f ( x t , c t ) + ξ t , where Our formulation and the theory extend to continuoussets C and X , but for the algorithm we rely on solving aconvex program of size |C| . ohannes Kirschner, Ilija Bogunovic, Stefanie Jegelka, Andreas Krause ξ t ∼ N (0 , σ ) with known σ and independence be-tween time steps. More generally, our results holdif the noise is σ -sub-Gaussian, which allows for non-Gaussian likelihoods (e.g., bounded noise). Further,we assume that c t is sampled independently from anunknown, time-dependent distribution P ∗ t . Optimization objective.

We consider the distribu-tionally robust optimization (DRO) (Scarf, 1957) ob-jective, which asks to perform well simultaneously fora range of problems, each determined by a distribu-tion in some uncertainty set. This is in contrast toSO, where we seek good performance against a singleproblem instance parametrized by a given distribution.In DRO, the objective is to ﬁnd x ∈ X that solvesmax x ∈X inf Q ∈U t E c ∼ Q [ f ( x, c )] . (1)Here, U t is a known uncertainty set of distributionsover C that can depend on the step t and contains thetrue distribution P ∗ t ∈ U t . Typically, U t is chosen as aball of radius (or margin) (cid:15) t >

0, and centered arounda given reference distribution P t on C , i.e., U t = { Q : d ( Q, P t ) ≤ (cid:15) t } , where d ( · , · ) measures the discrepancy between twodistributions. A possible choice for the reference dis-tribution P t , is the empirical sample distribution ˆ P t = t − (cid:80) ts =1 δ c s , which is an instance of data-driven DRO (Bertsimas et al., 2018). Depending on the underlyingfunction and the uncertainty set U t , the robust solu-tion can signiﬁcantly diﬀer from the solution to thestochastic objective max x ∈X E c ∼ P [ f ( x, c )] for a ﬁxed(and typically known) distribution P . We illustratesuch a case in Fig. 1.Hence, at time step t , the learner receives a referencedistribution P t ∈ P ( C ) and margin (cid:15) t >

0. Our objec-tive is to choose a sequence of actions x , . . . , x T thatminimizes robust cumulative regret : R T = T (cid:88) t =1 inf Q ∈U t E Q [ f ( x ∗ t , c )] − inf Q ∈U t E Q [ f ( x t , c )] , (2)where x ∗ t = max x ∈X inf Q ∈U t E Q [ f ( x, c )]. The robustregret measures the cumulative loss of the learner onthe chosen sequence of actions w.r.t. the worst casedistribution over C . RKHS Regression.

The main regularity assump-tion of Bayesian optimization is that f belongs to a re-producing kernel Hilbert space (RKHS) H with knownkernel k . We denote the Hilbert norm by (cid:107) · (cid:107) H andassume (cid:107) f (cid:107) H ≤ B for some known B >

0. From theobserved data D t = { ( x , c , y ) , . . . , ( x t , c t , y t ) } , we .

00 0 . . . . . . . C ref. distrtrue distr0 .

00 0 .

25 0 .

50 0 .

75 1 . X stochastic solrobust sol Figure 1: A function where the robust solution signiﬁ-cantly diﬀers from the stochastic solution. The learnerobtains the blue reference distribution over the contextset C and chooses a design x ∈ X . If the distributionover the context set is equal to the reference, the so-lution marked by the triangle maximizes the expectedreward. On the other hand, if the true distribution(orange) is shifted away from the reference, the ﬂat-ter region of the reward function, marked by the star,provides higher expected reward.can compute a kernel ridge regression estimate withˆ f t = arg min g ∈H t − (cid:88) i =1 ( g ( x i , c i ) − y i ) + (cid:107) g (cid:107) H . (3)The representer theorem provides the standard,closed-form solution for the least-squares estimate(Rasmussen and Williams, 2006). The next lemmais a standard result by Srinivas et al. (2010); Abbasi-Yadkori (2013). It provides a frequentist conﬁdenceinterval of the form [ ˆ f t ( x, c ) ± β t σ t ( x, c )] that containsthe true function values f ( x, c ) with high probabil-ity. The exact deﬁnitions of ˆ f t and σ t can be foundin Appendix A; we just note here that ˆ f t ( x, c ) and σ t ( x, c ) are the posterior mean and posterior vari-ance functions of the corresponding Bayesian Gaus-sian process model (Rasmussen and Williams, 2006).We denote the data kernel matrix by ( K t ) i,j =1 ,...,t = k ( x i , c i , x j , c j ), and assume that k ( x, c, x (cid:48) , c (cid:48) ) ≤ Lemma 1.

With probability at least − δ , for any x ∈ X , c ∈ C at any time t ≥ , | ˆ f t ( x, c ) − f ( x, c ) | ≤ β t σ t ( x, c ) with β t = σ (cid:113) log det (cid:0) t + K t (cid:1) + 2 log δ + B . We explicitly deﬁne the upper and lower conﬁdencebounds for every x ∈ X and c ∈ C as follows:ucb t ( x, c ) := ˆ f t ( x, c ) + β t σ t ( x, c ) , lcb t ( x, c ) := ˆ f t ( x, c ) − β t σ t ( x, c ) . For a ﬁxed x , we use ucb tx := ucb t ( x, · ) and lcb tx :=lcb t ( x, · ) to refer to the corresponding vectors in R |C| . istributionally Robust Bayesian Optimization Finally, we introduce a sample complexity parameter,the maximum information gain : γ T := max { ( x t ,c t ) } Tt =1 log det (cid:0) t + K T (cid:1) . (4)The information gain appears in the regret bounds forBayesian optimization (Srinivas et al., 2010). Analyti-cal upper bounds are known for a range of kernels, e.g.,for the RBF kernel, γ T ≤ O (log( T ) d +1 ) if X × C ⊂ R d . Maximum Mean Discrepancy (MMD).

MMDis a kernel-based discrepancy measure between distri-butions (e.g., Muandet et al. (2017)). It has been usedin various applications, including generative model-ing (Sutherland et al., 2016; Bi´nkowski et al., 2018),DRO (Staib and Jegelka, 2019) and kernel sampletests (Gretton et al., 2012; Chwialkowski et al., 2016).Let H M be an RKHS with corresponding kernel k M : C × C → R . For two distributions P and Q over C , the maximum mean discrepancy (MMD) is d ( P, Q ) := sup g ∈H M : (cid:107) g (cid:107) H M ≤ E c ∼ P [ g ( c )] − E c ∼ Q [ g ( c )] .(5)Note that the kernel k M over C that deﬁnes the MMDis diﬀerent from the kernel k over X × C that is usedfor regression. An equivalent way of writing d ( P, Q )is via kernel mean embeddings (Muandet et al., 2017,Section 3.5). Speciﬁcally, any distribution P over C can be embedded into H M via the mean embedding m P = E c ∼ P [ k M ( c, · )], which satisﬁes (cid:104) m P , k M ( c (cid:48) , · ) (cid:105) = E c ∼ P [ k M ( c (cid:48) , c )] for all c (cid:48) ∈ C . An equivalent expressionfor the MMD (5) is d ( P, Q ) = (cid:107) m P − m Q (cid:107) H . (6)More explicitly, for ﬁnite context set C and probabilityvectors w i = P P [ c i ] and w (cid:48) i = P Q [ c i ], the kernel meanembeddings are m P = (cid:80) ni =1 w i k M ( c i , · ) and m Q = (cid:80) ni =1 w (cid:48) i k M ( c i , · ), respectively. With the kernel matrix( M ) ij := k M ( c i , c j ), the MMD becomes d ( P, Q ) = (cid:113) ( w − w (cid:48) ) (cid:62) M ( w − w (cid:48) ) =: (cid:107) w − w (cid:48) (cid:107) M . We now introduce a Bayesian optimization algorithmfor our main objective (2). We will start with a generalformulation that allows for time-dependent referencedistributions P t and margins (cid:15) t . We then continuewith data-driven DRO (Bertsimas et al., 2018), wherewe specialize the general setup and choose the empir-ical distribution P t = t (cid:80) ts =1 δ c s as reference distri-bution. Hence, our algorithm chooses actions that are Algorithm 1

DRBO - General SettingInitialize ( K x ) i,j = k ( x, c i , x, c j ) , C = { c , . . . , c n } For step t = 1 , , . . . , T :1. Learner obtains reference distribution P t with w ti = P [ c = c i ], and margin (cid:15) t

2. Deﬁne (ucb tx ) j := ˆ f t ( x, c j ) + β t σ t ( x, c j )3. Deﬁne w ucb t x := arg min w (cid:48) (cid:104) ucb tx , w (cid:48) (cid:105) , s.t. (cid:107) w (cid:48) (cid:107) =1 , ≤ w (cid:48) j ≤ ∀ j ∈ [ n ]) , and (cid:107) w (cid:48) − w t (cid:107) M ≤ (cid:15) t

4. Choose action x t = arg max x ∈X (cid:104) w ucb t x , ucb tx (cid:105)

5. Learner observes c t ∼ P ∗ t and y t = f ( x t , c t ) + ξ t .6. Use { x t , c t , y t } to update ˆ f t +1 ( · , · ) and σ t +1 ( · , · ).robust w.r.t. the estimation error of the true contextdistribution. Finally, we motivate and discuss the sim-ulator setting, where the learner is allowed to choosethe context c t and obtains the corresponding evalua-tion y t = f ( x t , c t ) + ξ t . In our general DRBO formulation, the interaction pro-tocol at time t is speciﬁed by the following steps:1. The environment chooses a reference distribution P t and margin (cid:15) t . This deﬁnes the uncertainty set U t = { Q : d ( Q, P t ) ≤ (cid:15) t } . (7)2. The learner observes P t and (cid:15) t , and chooses a ro-bust action x t ∈ X .3. The environment chooses a sampling distribution P ∗ t ∈ U t and the context is realized as an inde-pendent sample c t ∼ P ∗ t .4. The learner observes the reward y t = f ( x t , c t )+ ξ t and c t ∼ P ∗ t .We make no further assumptions on how the environ-ment chooses the sequences P t , P ∗ t and (cid:15) t . The DRBOalgorithm for this setting is given in Algorithm 1. Re-call that P t is a distribution over the ﬁnite context set C with n elements, and we use w t ∈ R n to denote aprobability vector with entries w ti = P P t [ c = c i ] forevery i ∈ [ n ]. With this, the inner adversarial problemfor a ﬁxed action x can be equivalently written as:inf Q : d ( P t ,Q ) ≤ (cid:15) t E c ∼ Q [ f ( x, c )] = min w (cid:48) : (cid:107) w (cid:48) (cid:107) =1 , ≤ w (cid:48) j ≤ ∀ j ∈ [ n ] , (cid:107) w (cid:48) − w t (cid:107) M ≤ (cid:15) t (cid:104) w (cid:48) , f x (cid:105) , (8)where f x := f ( x, · ) ∈ R n , and M ∈ R n × n with( M ) ij := k M ( c i , c j ). In particular the solution to (8) ohannes Kirschner, Ilija Bogunovic, Stefanie Jegelka, Andreas Krause is the worst-case distribution over c for the objective f if the learner chooses action x . Since the constraintsare convex, the program (8) can be solved eﬃcientlyby standard convex optimization solvers.Since the true function values f x are unknown to thelearner, we can only obtain an approximate solutionto (8). In our algorithm, we hence use an optimisticupper bound instead. Speciﬁcally, we substitute ucb tx for f x to compute the “optimistic” worst-case distribu-tion for every action x . Finally, at time t , the learnerchooses x t that maximizes the optimistic expected re-ward under the worst-case distribution.The DRBO algorithm achieves the following regretbound. Theorem 2.

The robust regret R T of Algorithm1, with β t = σ (cid:113) log det (cid:0) t + K t (cid:1) + 2 log δ + B , isbounded with probability at least − δ by R T ≤ β T (cid:113) T (cid:0) γ T + 4 log (cid:0) δ (cid:1) (cid:1) + 2 B (cid:48) T (cid:88) t =1 (cid:15) t .Here, γ T is the maximum information gain deﬁned inEq. (4) , (cid:107) f (cid:107) H ≤ B and B (cid:48) = max x ∈X (cid:107) f x (cid:107) M − . The complete proof is given in Appendix B.1, and weonly sketch the main steps here. Denote by w ∗ t theprobability vector of the true distribution at time t ,and by w fx t the solution to (8) at x t . The idea is tobound the instantaneous regret at time t by r t = inf Q : d ( P t ,Q ) ≤ (cid:15) t E Q [ f ( x ∗ , c )] − inf Q : d ( P t ,Q ) ≤ (cid:15) t E Q [ f ( x t , c )] ( i ) ≤ (cid:104) w ∗ t , ucb tx t (cid:105) − (cid:104) w fx t , f x t (cid:105) = (cid:104) w ∗ t , ucb tx t − f x t (cid:105) + (cid:104) w ∗ t − w fx t , f x t (cid:105) ( ii ) ≤ β t (cid:104) w ∗ t , σ t ( x t , · ) (cid:105) + (cid:107) w ∗ t − w fx t (cid:107) M (cid:107) f x t (cid:107) M − ( iii ) ≤ β T (cid:104) w ∗ t , σ t ( x t , · ) (cid:105) + 2 (cid:15) t B (cid:48) .For the ﬁrst inequality (i), we used that f x ≤ ucb x ,the deﬁnition of the UCB action and that w ∗ t ∈ U t . Instep (ii), we use Cauchy-Schwarz and the conﬁdencebounds, and step (iii) follows since w fx t ∈ U t . Fromhere it remains to sum the instantaneous regret, wherewe rely on Lemma 3 in (Kirschner and Krause, 2018)to relate the expectation over the true sampling distri-bution (cid:104) w ∗ t , σ t ( x t , · ) (cid:105) to the observed values σ t ( x t , c t ).In the regret bound in Theorem 2, the ﬁrst term isthe same as the standard regret bound for GP-UCB(Srinivas et al., 2010; Abbasi-Yadkori, 2013) and re-ﬂects the statistical convergence rate for estimatingthe RKHS function. The additional term B (cid:48) T (cid:15) (for (cid:15) t = (cid:15) ) is speciﬁc to our setting. First, the com-plexity parameter B (cid:48) = max x ∈X (cid:107) f x (cid:107) M − quantiﬁes how much the distributional shift can increase the re-gret on the given objective f . A crude upper boundis B (cid:48) ≤ B (cid:112) λ max ( M − ) |C| , but in general B (cid:48) can bemuch smaller. The linear scaling O ( (cid:15)T ) of the regretbound is arguably unsatisfying, but seems unavoid-able without further assumptions. A problematic caseis when the true distribution P ∗ t is supported on asingle context, e.g., P ∗ t = δ c , and the learner is notable to learn the function values at diﬀerent contexts c i for i >

1. In this case, the learner can never inferthe robust solution exactly from the data and conse-quently incurs constant regret of order (cid:15) t per round.In practice, we do not expect that this severely aﬀectsthe performance of our algorithm if the true distribu-tion suﬃciently covers the context space. We leave aprecise formulation of this intuition for future work.Instead, in the following sections we explore two diﬀer-ent ways of controlling the additional regret that thelearner incurs in the general DRBO setting. First, forthe data-driven setting, we will set the reference dis-tribution to the empirical distribution of the observedcontext samples. In this case, the margin (cid:15) t is thedistance to the true sampling distribution, which forthe MMD is of order 1 / √ t and results in (cid:80) Tt =1 (cid:15) t = O ( √ T ). In the second variant, the learner is allowed toalso choose c t , which circumvents the estimation prob-lem outlined above and avoids the linear regret term. In data-driven DRBO, we assume there is a ﬁxedbut unknown distribution P ∗ on C . In each round,the learner ﬁrst chooses an action x t ∈ X , and thenobserves a context sample c t ∼ P ∗ together withthe corresponding observation y t = f ( x t , c t ) + ξ t . Atthe beginning of round t , the learner compute theempirical distribution ˆ P t = t − (cid:80) t − s =1 δ c s using theobserved contexts { c , . . . , c t − } . The objective is tochoose a sequence of actions x t , which is robust to theestimation error in ˆ P t . This corresponds to minimizingthe robust regret (2), where we set P t = ˆ P t for every t .As the learner observes more context samples, she be-comes more conﬁdent about the true unknown P ∗ . Itis therefore reasonable to shrink the uncertainty set ofdistributions U t = { Q : d ( Q, ˆ P t ) ≤ (cid:15) t } over time. Wemake use of the following lemma. Lemma 3 (Muandet et al. (2017), Theorem 3.4) . As-sume k ( c i , c j ) ≤ for all c i , c j ∈ C . Let P ∗ be the truecontext distribution over C , and let ˆ P t = t − (cid:80) ts =1 δ c s be the empirical sample distribution. Then, withprobability at least − δ , d ( P ∗ , ˆ P t ) ≤ √ t (cid:16) (cid:112) /δ ) (cid:17) . istributionally Robust Bayesian Optimization Lemma 3 shows how to set the margin (cid:15) t such that,at time t , the true distribution is contained with highprobability in the uncertainty set around the empiricaldistribution. The interaction protocol at time t is then:1. The learner computes the empirical distributionˆ P t and corresponding margin (cid:15) t according toLemma 3, and deﬁnes the uncertainty set U t = { Q : d ( Q, ˆ P t ) ≤ (cid:15) t } .2. The learner chooses a robust action x t .3. The learner observes reward y t = f ( x t , c t ) + ξ t and context sample c t ∼ P ∗ .We follow Algorithm 1, and set the reference distribu-tion and margin as outlined above. As a consequenceof Theorem 2 we obtain the following regret bound. Corollary 4.

The robust regret R T of Algorithm 1,with β t = σ (cid:113) log det (cid:0) t + K t (cid:1) + 2 log δ + B and (cid:15) t = √ t (cid:18) (cid:113) t δ ) (cid:19) is bounded in the data-drivenscenario with probability at least − δ by R T ≤ β T √ T (cid:112) γ T (1 + log(3 /δ )+ 4 B (cid:48) √ T (cid:0) (cid:113) (cid:0) T δ (cid:1)(cid:1) , (9) where γ T is the maximum information gain as deﬁnedin (4) , (cid:107) f (cid:107) H ≤ B and B (cid:48) = max x ∈X (cid:107) f x (cid:107) M − . The proof can be found in Appendix B.2. We just notethat we increased the value of (cid:15) t such that Lemma 3holds simultaneously over all time steps. In the data-driven contextual setting without the robustness re-quirement , several related approaches have been pro-posed (Lamprier et al., 2018; Kirschner and Krause,2019). These are based on computing a UCB scoredirectly at the kernel mean embedding of the empir-ical distribution ˆ P t . To account for the estimationerror, an additional exploration bonus is added. Wenote that as t → ∞ and ˆ P t becomes an accurate esti-mation of P ∗ , both robust and non-robust approachesconverge to the stochastic solution. The advantage ofthe robust formulation is that we explicitly minimizethe loss under the worst-case estimation error in thecontext distribution. As we demonstrante in our ex-periments (in Section 4), DRBO obtains signiﬁcantlysmaller regret when the robust and stochastic solutionsare diﬀerent. In our second variant of the general setup, the learneris allowed to choose c t in addition to x t and then ob-tains the observation y t = f ( x t , c t ) + ξ t . Algorithm 2

DRBO - Simulator SettingInitialize ( K x ) i,j = k ( x, c i , x, c j ) , C = { c , . . . , c n } For step t = 1 , , . . . , T :1. Obtain reference distribution P t with w ti = P [ c = c i ], margin (cid:15) t

2. Deﬁne (ucb tx ) j := ˆ f t ( x, c j ) + β t σ t ( x, c j )3. w ucb t x := arg min w (cid:48) (cid:104) ucb tx , w (cid:48) (cid:105) , s.t. (cid:107) w (cid:48) (cid:107) = 1 , ≤ w j ≤ ∀ j ∈ [ n ]) , (cid:107) w (cid:48) − w (cid:107) M ≤ (cid:15) t x t = arg max x ∈X (cid:104) w ucb t x , ucb tx (cid:105) c t = arg max c ∈C σ t ( x t , c ).6. Observe y t = f ( x t , c t ) + ξ t from simulatorOne example of this setting, previously considered inthe context of RO (Bogunovic et al., 2018), is whenthe learner tunes control parameters with a simulatorof the environment (e.g. for a building heating system).The simulator gives the learner the ability to evaluatethe objective at any speciﬁc context c t . The objectiveis to simultaneously (or only at the ﬁnal time T ) de-ploy a robust solution x T on the real system, wherethe covariate c t is uncontrolled. Again, the learner’sobjective is to be robust with respect to an uncer-tainty set of distributions on c t on the real environ-ment (e.g. for heating control, we want robustness onpredicted weather conditions that eﬀect the building’sstate). With this motivation in mind, we refer to thissetup as simulator DRBO. Formally, the interactionprotocol is:1. The environment provides a reference distribution P t , margin (cid:15) t and uncertainty set U t as before.2. The learner chooses an action x t and a context c t ∈ C .3. The learner observes reward y t = f ( x t , c t ) + ξ t from the simulator.4. The learner deploys a robust action x t on the realsystem (or possibly only at the ﬁnal step T ).We provide Algorithm 2 for this setting. As be-fore, x t is an optimistic action under the worst-casedistribution. In addition, the learner chooses c t =arg max c ∈C σ t ( x t , c t ) as the context with the largestestimation uncertainty at x t . We bound the robustregret in the next theorem. Theorem 5.

In the simulator setting, Algorithm 2,with β t = σ (cid:113) log det (cid:0) t + K t (cid:1) + 2 log δ + B , obtainsbounded robust regret w.p. at least − δ , R T ≤ β T (cid:112) γ T T . We provide the proof of Theorem 5 in Appendix B.3. ohannes Kirschner, Ilija Bogunovic, Stefanie Jegelka, Andreas Krause T G e n e r a l Benchmark 1 T Benchmark 2 T D a t a D r i v e n T T S i m u l a t o r T Figure 2: Results for two synthetic benchmarks, wherethe stochastic, worst-case robust and distribution-ally robust solution are all diﬀerent (left) or coincide (right) . All plots show robust regret, averaged over50 independent runs and the error bars indicate thestandard error.Perhaps surprisingly, this rate is the same as forGP-UCB in the standard setting (a similar result wasobtained for RO (Bogunovic et al., 2018)). This isbecause now the learner can estimate ˆ f t globally atany input ( x t , c t ) ∈ X × C , and the sample complexityto infer the robust solution only depends on thesample complexity of estimating f .In the simulator setting, the performance of the ﬁnalsolution can be of signiﬁcant interest if we aim to de-ploy the obtained parameter on the real system. Tothis end, we allow the ﬁnal solution ˆ x T to be diﬀerentfrom the last evaluation x T . The metric of interest isthen the robust simple regret, r T = max x ∈X inf Q E c ∼ Q f ( x, c ) − inf Q E c ∼ Q f (ˆ x T , c ).To obtain a bound on the simple regret, we assumethat the margin (cid:15) = (cid:15) t and the reference distribution P = P t are ﬁxed. This is a natural requirement, whichallows the learner to optimize the simple regret for theﬁnal solution ˆ x T w.r.t. P and (cid:15) . We choose the ﬁnalsolution ˆ x T := x ˆ t among the iterates x , . . . , x T fromAlgorithm 2 withˆ t := arg max t =1 ,...,T min w (cid:48) : (cid:107) w (cid:48) (cid:107) =1 , ≤ w (cid:48) j ≤ ∀ j ∈ [ n ] , (cid:107) w (cid:48) − w t (cid:107) M ≤ (cid:15) (cid:104) w (cid:48) , lcb tx t (cid:105) . (10)The program computes the best robust solution amongthe iterates { x , . . . , x T } using the conservative func- tion values lcb tx of the corresponding time steps t . Itis easy to maintain ˆ x T iteratively by computing theconservative, worst-case payoﬀ of the action x t andcomparing to the previous solution ˆ x t − . Corollary 6 (Simple Regret) . With probability atleast − δ , the solution ˆ x T obtains simple regret r T ≤ β T (cid:112) γ T /T . (11)This result is a consequence of the fact that the simpleregret of ˆ x t is upper bounded by the simple regret ofeach iterate x t . The guarantee then follows from theproof of Theorem 5. We provide the complete argu-ment in Appendix B.4. We evaluate the proposed DRBO in the general, data-driven and simulator setting on two synthetic testfunctions, and on a real-world wind-power predictiontask. In our experiments, we compare to StableOpt(Bogunovic et al., 2018) and a stochastic UCB variant(Srinivas et al., 2010; Kirschner and Krause, 2019).

Baselines

The ﬁrst baseline is a stochastic variantof the UCB approach (Srinivas et al., 2010; Kirschnerand Krause, 2019), which chooses actions according tooptimistic expected payoﬀ w.r.t. the reference distri-bution, x UCB t = arg max x ∈X E P t [ucb t ( x, c )] .Our second baseline is StableOpt (Bogunovic et al.,2018), an approach for worst-case robust optimization.It chooses actions according to x STABLE t = arg max x ∈X min c ∈ ∆ t ucb t ( x, c ) ,for a robustness set of possible context values ∆ t ⊂ C .There is no canonical way of choosing ∆ t in our setting,and we use ∆ t = { c ∈ C : (cid:107) c − E c (cid:48) ∼ P t [ c (cid:48) ] (cid:107) ≤ (cid:15) t } .With the decreasing margin and the discretization ofthe context domain, it can happen that ∆ t is an emptyset. In this case we explicitly set ∆ t = { arg min c ∈C (cid:107) c − E c (cid:48) ∼ P t [ c (cid:48) ] (cid:107) } .UCB and StableOpt optimize for the stochastic andworst-case robust solutions respectively, and thereforecan exhibit linear regret for the robust regret (unless (cid:15) t → β t = 2, which is a common practiceto improve performance over the (conservative) theo-retical values. istributionally Robust Bayesian Optimization Benchmarks

Our ﬁrst synthetic benchmark is thefunction illustrated in the introduction. The refer-ence distribution is P t = N (0 . , .

05) and the truesampling distribution is P ∗ = N (0 . , . (cid:15) t := d ( P t , P ∗ ). On this function, the stochastic,worst-case robust and distributionally robust solutionall diﬀer, which leads to linear robust regret for UCBand StableOpt. The second synthetic benchmark ischosen such that stochastic, worst-case and distribu-tionally robust solutions coincide, with the same choiceof P t , P ∗ and (cid:15) t as before. See Appendix C, Fig. 4afor a contour plot. Fig. 2 illustrates the results.Further, we evaluate the methods on real-world windpower data (Data Package Time Series, 2019). Windpower forecasting is an important task (Wang et al.,2011) as power sources that can be eﬀectively sched-uled are valuable on the global energy market. In ourproblem setup, we take hourly recorded wind powerdata from 2013/14 and use a 48h sliding window tocompute an empirical reference distribution for eachtime step. The decision variable x is the amount of en-ergy that is guaranteed to be delivered in the next hourafter the end of the window. The contextual variable c is the actual power generation which we take from thedata set. We choose the reward (revenue) function: f ( x, c ) = 0 . c − x, x, c ) − x − c, . There is a 0 . − x t = 0). The ﬁgure alsoshows cumulative robust regret. Clearly, the stochasticsolution is diﬀerent from the robust one, hence UCBobtains linear robust regret. In fact, in this case ifthe DRO objective is solved exactly for each step, theDRBO method would obtain zero robust regret (wecompute the solution according to (10) after T = 100steps, therefore an optimization error may remain). In this work, we introduced and studied distribution-ally robust Bayesian optimization, where the goal is tobe robust against the worst-case contextual distribu-tion among a speciﬁed uncertainty set of distributions.Speciﬁcally, we focused on uncertainty sets determined T o t a l R e w a r d R obu s t R e g r e t Figure 3:

Wind power prediction . We show cumulativerevenue (top) and robust regret (bottom). StableOptis too conservative to perform well on either objective.UCB does not account for the distributional shift onthe sliding window. By deﬁnition, DRBO chooses therobust solution most of the time, therefore achieves(almost) zero robust regret.by the MMD distance. For a few settings of interestthat diﬀer in how the contextual parameter is realized,we provided the ﬁrst DRBO algorithms with theoreti-cal guarantees. In the experimental study, we demon-strated improvements in terms of robust expected re-gret over stochastic and worst-case BO baselines.Our algorithms rely on solving the inner adversaryproblem, which, in our case, is a linear program withconvex constraints. This program can be solved ef-ﬁciently but is of size |C| , which currently limits themethod to relatively small context sets. The formula-tion and the theory continue to hold for large or con-tinuous context sets, but ﬁnding a tractable algorith-mic approximation is an interesting direction for fu-ture work. Finally, while the considered kernel-basedMMD distance ﬁts well with the kernel-based regu-larity assumptions used in BO, an interesting direc-tion is to extend the ideas to other uncertainty setsused in machine learning, such as the ones deﬁned by φ -divergences and Wasserstein distance. In fact, ourapproach is still applicable in the case of other diver-gences, as long as the uncertainty set of distributions isconvex and the inner problem can be solved eﬃciently. Acknowledgement

This project has received funding from the Euro-pean Research Council (ERC) under the EuropeanUnions Horizon 2020 research, innovation programmegrant agreement No 815943, and NSF CAREER award1553284. IB is supported by ETH Z¨urich PostdoctoralFellowship 19-2 FEL-47. ohannes Kirschner, Ilija Bogunovic, Stefanie Jegelka, Andreas Krause

References

Abbasi-Yadkori, Y. (2013). Online learning for linearlyparametrized control problems.Amodei, D., Olah, C., Steinhardt, J., Christiano, P.,Schulman, J., and Man´e, D. (2016). Concrete prob-lems in AI safety. arXiv preprint arXiv:1606.06565 .Beland, J. J. and Nair, P. B. (2017). Bayesian opti-mization under uncertainty. NIPS BayesOpt 2017workshop.Ben-Tal, A., Den Hertog, D., De Waegenaere, A., Me-lenberg, B., and Rennen, G. (2013). Robust solu-tions of optimization problems aﬀected by uncertainprobabilities.

Management Science , 59(2):341–357.Bertsimas, D., Gupta, V., and Kallus, N. (2018). Data-driven robust optimization.

Mathematical Program-ming , 167(2):235–292.Bi´nkowski, M., Sutherland, D. J., Arbel, M., and Gret-ton, A. (2018). Demystifying mmd gans. arXivpreprint arXiv:1801.01401 .Bogunovic, I., Scarlett, J., and Cevher, V. (2016a).Time-varying Gaussian process bandit optimization.In

International Conference on Artiﬁcial Intelli-gence and Statistics (AISTATS) , pages 314–323.Bogunovic, I., Scarlett, J., Jegelka, S., and Cevher,V. (2018). Adversarially robust optimization withGaussian processes. In

Conference on Neural Infor-mation Processing Systems (NeurIPS) , pages 5760–5770.Bogunovic, I., Scarlett, J., Krause, A., and Cevher,V. (2016b). Truncated variance reduction: A uni-ﬁed approach to Bayesian optimization and level-setestimation. In

Advances in Neural Information Pro-cessing Systems (NIPS) , pages 1507–1515.Chowdhury, S. R. and Gopalan, A. (2017). On kernel-ized multi-armed bandits. In

International Confer-ence on Machine Learning (ICML) , pages 844–853.Chwialkowski, K., Strathmann, H., and Gretton, A.(2016). A kernel test of goodness of ﬁt. JMLR:Workshop and Conference Proceedings.Data Package Time Series, O. (2019). Open powersystem data. https://doi.org/10.25832/time_series/2019-06-05 . Version 2019-06-05 (Primarydata from various sources, for a complete list seeURL).Djolonga, J., Krause, A., and Cevher, V. (2013). High-dimensional Gaussian process bandits. In

Advancesin Neural Information Processing Systems (NIPS) ,pages 1025–1033.Esfahani, P. M. and Kuhn, D. (2018). Data-driven dis-tributionally robust optimization using the wasser-stein metric: Performance guarantees and tractable reformulations.

Mathematical Programming , 171(1-2):115–166.Gao, R., Chen, X., and Kleywegt, A. J. (2017).Wasserstein distributional robustness and regular-ization in statistical learning. arXiv preprintarXiv:1712.06050 .Gardner, J. R., Kusner, M. J., Xu, Z. E., Weinberger,K. Q., and Cunningham, J. P. (2014). Bayesianoptimization with inequality constraints. In

ICML ,pages 937–945.Gelbart, M. A., Snoek, J., and Adams, R. P. (2014).Bayesian optimization with unknown constraints. arXiv preprint arXiv:1403.5607 .Goh, J. and Sim, M. (2010). Distributionally robustoptimization and its tractable approximations.

Op-erations research , 58(4-part-1):902–917.Gretton, A., Borgwardt, K. M., Rasch, M. J.,Sch¨olkopf, B., and Smola, A. (2012). A kernel two-sample test.

Journal of Machine Learning Research ,13(Mar):723–773.Hennig, P. and Schuler, C. J. (2012). Entropy searchfor information-eﬃcient global optimization.

Jour-nal of Machine Learning Research , 13(Jun):1809–1837.Kandasamy, K., Schneider, J., and P´oczos, B. (2015).High dimensional Bayesian optimisation and ban-dits via additive models. In

International Confer-ence on Machine Learning (ICML) , pages 295–304.Kirschner, J. and Krause, A. (2018). Information di-rected sampling and bandits with heteroscedasticnoise. In

Proc. International Conference on Learn-ing Theory (COLT) .Kirschner, J. and Krause, A. (2019). Stochastic ban-dits with context distributions.Kirschner, J., Mutn`y, M., Hiller, N., Ischebeck, R.,and Krause, A. (2019). Adaptive and safe Bayesianoptimization in high dimensions via one-dimensionalsubspaces. arXiv preprint arXiv:1902.03229 .Krause, A. and Ong, C. S. (2011). Contextual Gaus-sian process bandit optimization. In

Advancesin Neural Information Processing Systems (NIPS) ,pages 2447–2455.Lamprier, S., Gisselbrecht, T., and Gallinari, P.(2018). Proﬁle-based bandit with unknown pro-ﬁles.

The Journal of Machine Learning Research ,19(1):2060–2099.Martinez-Cantin, R., Tee, K., and McCourt, M.(2017). Practical Bayesian optimization in the pres-ence of outliers. arXiv preprint arXiv:1712.04567 .Muandet, K., Fukumizu, K., Sriperumbudur, B.,Sch¨olkopf, B., et al. (2017). Kernel mean embedding istributionally Robust Bayesian Optimization of distributions: A review and beyond.

Foundationsand Trends R (cid:13) in Machine Learning , 10(1-2):1–141.Namkoong, H. and Duchi, J. C. (2017). Variance-based regularization with convex objectives. In Ad-vances in Neural Information Processing Systems ,pages 2971–2980.Nguyen, T. T., Gupta, S., Ha, H., Rana, S.,and Venkatesh, S. (2020). Distributionally robustbayesian quadrature optimization. arXiv preprintarXiv:2001.06814 .Nogueira, J., Martinez-Cantin, R., Bernardino, A.,and Jamone, L. (2016). Unscented Bayesian opti-mization for safe robot grasping. In , pages 1967–1972. IEEE.Oliveira, R., Ott, L., and Ramos, F. (2019). Bayesianoptimisation under uncertain inputs. arXiv preprintarXiv:1902.07908 .Rahimian, H. and Mehrotra, S. (2019). Distribution-ally robust optimization: A review. arXiv preprintarXiv:1908.05659 .Rasmussen, C. E. and Williams, C. K. (2006).

Gaus-sian processes for machine learning , volume 1. MITpress Cambridge.Scarf, H. E. (1957). A min-max solution of an in-ventory problem. Technical report, RAND CORPSANTA MONICA CALIF.Sessa, P. G., Bogunovic, I., Kamgarpour, M., andKrause, A. (2019). No-regret learning in unknowngames with correlated payoﬀs. In

Conference onNeural Information Processing Systems (NeurIPS) .Sinha, A., Namkoong, H., and Duchi, J. (2017).Certifying some distributional robustness withprincipled adversarial training. arXiv preprintarXiv:1710.10571 .Srinivas, N., Krause, A., Kakade, S. M., and Seeger,M. (2010). Gaussian process optimization in thebandit setting: No regret and experimental design.In

International Conference on Machine Learning(ICML) , pages 1015–1022.Staib, M. and Jegelka, S. (2019). Distributionally ro-bust optimization and generalization in kernel meth-ods. In

Advances in Neural Information ProcessingSystems (NeurIPS) .Staib, M., Wilder, B., and Jegelka, S. (2018). Distri-butionally robust submodular maximization. arXivpreprint arXiv:1802.05249 .Sutherland, D. J., Tung, H.-Y., Strathmann, H., De,S., Ramdas, A., Smola, A., and Gretton, A. (2016).Generative models and model criticism via opti-mized maximum mean discrepancy. arXiv preprintarXiv:1611.04488 . Valko, M., Korda, N., Munos, R., Flaounas, I.,and Cristianini, N. (2013). Finite-time analysisof kernelised contextual bandits. arXiv preprintarXiv:1309.6869 .Wang, X., Guo, P., and Huang, X. (2011). A reviewof wind power forecasting models.

Energy procedia ,12:770–778.Wang, Z. and Jegelka, S. (2017). Max-value entropysearch for eﬃcient Bayesian optimization. In

Inter-national Conference on Machine Learning (ICML) ,pages 3627–3635. ohannes Kirschner, Ilija Bogunovic, Stefanie Jegelka, Andreas Krause

A RKHS Regression

Recall that at step t , we have data D t = { ( x , c , y ) , . . . , ( x t , c t , y t ) } . The kernel ridge regression estimate isdeﬁned by, ˆ f t = arg min g ∈H t (cid:88) i =1 ( g ( x i , c i ) − y i ) + (cid:107) g (cid:107) H . (12)Denote by y t = [ y , . . . , y t ] (cid:62) the vector of observations, ( K t ) i,j =1 ,...,t = k ( x i , c i , x j , c j ) the data kernel matrix,and k t ( x, c ) = [ k ( x, c, c , x ) , . . . , k ( x, c, x t , c t )] (cid:62) the data kernel features. We then haveˆ f t ( x, c ) = k t ( x, c ) (cid:62) ( K t + t ) − y t . (13)We further have the posterior variance σ t ( x, c ) that determines the width of the conﬁdence intervals, σ t ( x, c ) = k ( x, c, x, c ) − k t ( x, c ) T ( K t + t ) − k t ( x, c ) . (14) B Proofs

B.1 Proof of Theorem 2

The robust cumulative regret is R T = T (cid:88) t =1 max x ∈X inf Q : d ( Q,P t ) ≤ (cid:15) t E Q [ f ( x, c )] − inf Q : d ( Q,P t ) ≤ (cid:15) t E Q [ f ( x t , c )] . (15)For the proof, we ﬁrst bound the instantaneous robust regret , r t = inf Q : d ( P t ,Q ) ≤ (cid:15) t E Q [ f ( x ∗ t , c )] − inf Q : d ( P t ,Q ) ≤ (cid:15) t E Q [ f ( x t , c )] , (16)where we denote x ∗ t = arg max x ∈X inf Q : d ( P t ,Q ) ≤ (cid:15) t E Q [ f ( x, c )] the true robust solution at time t . We recall thefollowing notation, f x = f ( x, · ), lcb tx = lcb t ( x, · ) and ucb tx = ucb t ( x, · ) are vectors in R n , and ( M ) i,j = k ( c i , c j ).Further, w i = P P [ c = c i ] is a probability vector in R n , where n is used to denote the size of the contextual set,i.e., n = |C| . With this, note that inf Q : d ( P t ,Q ) ≤ (cid:15) t E [ f ( x, c )] = inf w (cid:48) : (cid:107) w (cid:48) (cid:107) =1 , ≤ w (cid:48) j ≤ ∀ j ∈ [ n ] , (cid:107) w (cid:48) − w t (cid:107) M ≤ (cid:15) t (cid:104) w (cid:48) , f x (cid:105) (17)The solution to this linear program is the worst case distribution over c if we choose action x . Deﬁne worst-casedistributions w fx , w lcb t x and w ucb t x for exact, optimistic and pessimistic function values (the dependence on t isimplicit), w fx = arg min w (cid:48) : (cid:107) w (cid:48) (cid:107) =1 , ≤ w (cid:48) j ≤ ∀ j ∈ [ n ] , (cid:107) w (cid:48) − w t (cid:107) M ≤ (cid:15) t (cid:104) w (cid:48) , f x (cid:105) , w lcb t x = arg min w (cid:48) : (cid:107) w (cid:48) (cid:107) =1 , ≤ w (cid:48) j ≤ ∀ j ∈ [ n ] , (cid:107) w (cid:48) − w t (cid:107) M ≤ (cid:15) t (cid:104) w (cid:48) , lcb tx (cid:105) , w ucb t x = arg min w (cid:48) : (cid:107) w (cid:48) (cid:107) =1 , ≤ w (cid:48) j ≤ ∀ j ∈ [ n ] , (cid:107) w (cid:48) − w t (cid:107) M ≤ (cid:15) t (cid:104) w (cid:48) , ucb tx (cid:105) . (18)By combining (8) and (18), we can upper and lower bound the objective as follows: (cid:104) w lcb t x , lcb tx (cid:105) ≤ inf w (cid:48) : (cid:107) w (cid:48) (cid:107) =1 , ≤ w (cid:48) j ≤ ∀ j ∈ [ n ] , (cid:107) w (cid:48) − w t (cid:107) M ≤ (cid:15) t (cid:104) w (cid:48) , f x (cid:105) ≤ (cid:104) w ucb t x , ucb tx (cid:105) . (19)Recall that Algorithm 1 takes actions x t = arg max x ∈X (cid:104) w ucb t x , ucb tx (cid:105) , and note that (cid:107) w lcb t x t − w ∗ t (cid:107) M ≤ (cid:15) t where( w ∗ t ) i = P P ∗ t [ c = c i ] is the probability vector from the true sampling distribution at time t . For any x ∈ X , weproceed to bound the instantaneous regret, istributionally Robust Bayesian Optimization r t = inf Q : d ( P,Q ) ≤ (cid:15) E Q [ f ( x ∗ t , c )] − inf Q : d ( P,Q ) ≤ (cid:15) E Q [ f ( x t , c )] (20) ( i ) ≤ (cid:104) w ucb t x ∗ t , ucb tx ∗ t (cid:105) − (cid:104) w fx t , f x t (cid:105) (21) ( ii ) ≤ (cid:104) w ucb t x t , ucb tx t (cid:105) − (cid:104) w fx t , f x t (cid:105) (22) ( iii ) ≤ (cid:104) w ∗ t , ucb tx t (cid:105) − (cid:104) w fx t , f x t (cid:105) (23)= (cid:104) w ∗ t , f x t (cid:105) + (cid:104) w ∗ t , ucb tx t − f x t (cid:105) − (cid:104) w fx t , f x t (cid:105) (24)= (cid:104) w ∗ t , ucb tx t − f x t (cid:105) + (cid:104) w ∗ t − w fx t , f x t (cid:105) (25) ( iv ) ≤ β t (cid:104) w ∗ t , σ t ( x t , · ) (cid:105) + (cid:107) w ∗ t − w fx t (cid:107) M (cid:107) f x t (cid:107) M − (26) ( v ) ≤ β t (cid:104) w ∗ t , σ t ( x t , · ) (cid:105) + 2 (cid:15) t (cid:107) f x t (cid:107) M − (27) ≤ β T (cid:104) w ∗ t , σ t ( x t , · ) (cid:105) + 2 (cid:15) t B (cid:48) . (28)Here, (i) follows from the deﬁnition of the upper bound (19), (ii) is by the choice of x t , (iii) follows from the factthat w ucb t x t is a minimizer, (iv) uses again the conﬁdence bounds and (v) follows from d x t ( P ∗ , Q ) ≤ (cid:15) t . Finally,the following holds for the instantaneous regret: r t = max x ∈X r t ( x ) ≤ β T (cid:104) w ∗ t , σ t ( x t , · ) (cid:105) + 2 (cid:15) t B (cid:48) .We now continue to bound the cumulative regret R T = (cid:80) Tt =1 r t . To this end, we ﬁrst apply the Cauchy-Schwarzinequality as in the standard proof, and then Jensen’s inequality to ﬁnd, R T ≤ β T T (cid:88) t =1 (cid:104) w ∗ t , σ t ( x t , · ) (cid:105) + 2 B (cid:48) T (cid:88) t =1 (cid:15) t (29) ≤ β T (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 (cid:104) w ∗ t , σ t ( x t , · ) (cid:105) + 2 B (cid:48) T (cid:88) t =1 (cid:15) t (30) ≤ β T (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 (cid:104) w ∗ t , σ t ( x t , · ) (cid:105) + 2 B (cid:48) T (cid:88) t =1 (cid:15) t . (31)To complete the proof, we need to relate (cid:80) Tt =1 (cid:104) w ∗ t , σ t ( x t , · ) (cid:105) to the observed posterior variance (cid:80) Tt =1 σ t ( x t , c t ) .For this, we apply Lemma 7 below. Note that σ t ( x t , c t ) ≤ k ( x, c, x (cid:48) , c (cid:48) ) ≤

1. Hence,w.p. at least 1 − δ , T (cid:88) t =1 (cid:104) w ∗ t , σ t ( x t , · ) (cid:105) ≤ T (cid:88) t =1 σ t ( x t , c t ) + 8 log (cid:18) δ (cid:19) (32)Finally, using that x ≤ α log(1 + x ) for all x ∈ [0 , α ], T (cid:88) t =1 σ t ( x t , c t ) ≤ T (cid:88) t =1 (cid:0) σ t ( x t , c t ) (cid:1) ≤ γ T . (33)The last inequality follows from Lemma 3 in Chowdhury and Gopalan (2017). The ﬁnal regret bound is therefore, R T ≤ β T (cid:113) T (cid:0) γ T + 4 log (cid:0) δ (cid:1) (cid:1) + 2 B (cid:48) T (cid:88) t =1 (cid:15) t , (34)A union bound over the events such that both Lemma 1 and Lemma 7 hold, yields probability ≥ − δ for thecomplete statement and completes the proof. ohannes Kirschner, Ilija Bogunovic, Stefanie Jegelka, Andreas Krause Lemma 7 (Concentration of conditional mean, Lemma 3 in Kirschner and Krause (2018)) . Let S t ≥ be non-negative stochastic process adapted to a ﬁltration {F t } , and deﬁne m t = E [ S t |F t − ] . Further assume that S t ≤ B for B ≥ . Then, for any T ≥ , with probability at least − δ it holds that, T (cid:88) t =1 m t ≤ T (cid:88) t =1 S t + 4 B log 1 δ + 8 B log(4 B ) + 1 ≤ T (cid:88) t =1 S t + 8 B log 6 Bδ B.2 Proof of Corollary 4

Note that with δ t = δπ t , the result from Lemma 3 holds with probability at least 1 − δ/ t = 1 , , . . . , i.e. d ( ˆ P t , P ∗ ) ≤ √ t (cid:32) (cid:114) π t δ (cid:33) = (cid:15) t (35)Therefore, with probability at least 1 − δ/ P ∗ ∈ U t . The results follows from Theorem 2 and another applicationof the union bound. Finally we use that (cid:80) Tt =1 t − / ≤ √ T to complete the proof of the corollary. B.3 Proof of Theorem 5

Recall that Algorithm 2 takes the actions x t = arg max x ∈X (cid:104) w ucb t x , ucb tx (cid:105) and c t = arg max c ∈C σ t ( x t , c ). We beginto bound r t similar as in the proof of the general regret bound. r t ( x ) = inf Q : d ( P t ,Q ) ≤ (cid:15) t E Q [ f ( x, c )] − inf Q : d ( P t ,Q ) ≤ (cid:15) t E Q [ f ( x t , c )] (36) ( i ) ≤ (cid:104) w ucb t x , ucb tx (cid:105) − (cid:104) w lcb t x t , lcb tx t (cid:105) (37) ( ii ) ≤ (cid:104) w ucb t x t , ucb tx t (cid:105) − (cid:104) w lcb t x t , lcb tx t (cid:105) (38) ( iii ) ≤ (cid:104) w lcb t x t , ucb tx t (cid:105) − (cid:104) w lcb t x t , lcb tx t (cid:105) (39) ( iv ) ≤ β t σ t ( x t , c t ) (40)Here, as before, (i) replaces the function values by over/under-estimated values of the upper/lower conﬁdencebounds, (ii) uses the choice of the UCB action and (iii) uses that w ucb t x t is a minimizer of (cid:104) w ucb t x t , ucb tx t (cid:105) . Thelast inequality (iv) uses that c t maximizes σ t ( x t , c t ) as well as that w lcb t x t is a probability vector. With that, thecumulative regret bound follows via the standard argument. B.4 Proof of Corollary 6

To bound the simple regret, recall that ˆ t = arg max t =1 ,...,T inf Q E Q [lcb t ( x t , c )] and ˆ x T = x ˆ t . For any t = 1 , . . . , T we have that, r T = max x ∈X inf Q : d ( Q,P ) ≤ (cid:15) E Q f ( x, c ) − inf Q : d ( Q,P ) ≤ (cid:15) E Q f (ˆ x T , c ) (41) ( i ) ≤ max x ∈X (cid:104) w ucb t x , ucb tx (cid:105) − (cid:104) w lcb ˆ t ˆ x T , lcb ˆ t ˆ x T (cid:105) (42) ( ii ) = max x ∈X (cid:104) w ucb t x , ucb tx (cid:105) − max s =1 ,...,T (cid:104) w lcb s x s , lcb sx s (cid:105) (43) ( iii ) ≤ (cid:104) w ucb t x t , ucb tx t (cid:105) − (cid:104) w lcb t x t , lcb tx t (cid:105) (44) ( iv ) ≤ (cid:104) w lcb t x t , ucb tx t (cid:105) − (cid:104) w lcb t x t , lcb tx t (cid:105) (45) ( v ) ≤ β t σ t ( x t , c t ) (46) istributionally Robust Bayesian Optimization Here, (i) bounds the function values by the upper and lower conﬁdence bound respectively, (ii) uses the deﬁnitionof ˆ x T , (iii) uses the deﬁnition of the UCB action and drops the maximum and ﬁnally, (iv,v) as before uses thechoices of x t , c t , w ucb t x t and that w lcb t x t is a probability vector. With this we are able to leverage the cumulativeregret bound, and ﬁnd r T ≤ T T (cid:88) t =1 β t σ t ( x t , c t ) ≤ R T T . (47) C Details on the Experiments . . . . . . X . . . . . . C stochastic & robust solstochastic & robust sol