[PDF] Mean-Variance Analysis in Bayesian Optimization under Uncertainty

Abstract

We consider active learning (AL) in an uncertain environment in which trade-off between multiple risk measures need to be considered. As an AL problem in such an uncertain environment, we study Mean-Variance Analysis in Bayesian Optimization (MVA-BO) setting. Mean-variance analysis was developed in the field of financial engineering and has been used to make decisions that take into account the trade-off between the average and variance of investment uncertainty. In this paper, we specifically focus on BO setting with an uncertain component and consider multi-task, multi-objective, and constrained optimization scenarios for the mean-variance trade-off of the uncertain component. When the target blackbox function is modeled by Gaussian Process (GP), we derive the bounds of the two risk measures and propose AL algorithm for each of the above three problems based on the risk measure bounds. We show the effectiveness of the proposed AL algorithms through theoretical analysis and numerical experiments.

Full PDF

MMean-Variance Analysis in Bayesian Optimization under Uncertainty

Shogo Iwazaki ∗ Yu Inatsu † Ichiro Takeuchi ‡† ABSTRACT

We consider active learning (AL) in an uncertain environment in which trade-oﬀ between multiple risk measuresneed to be considered. As an AL problem in such an uncertain environment, we study Mean-Variance Analysisin Bayesian Optimization (MVA-BO) setting. Mean-variance analysis was developed in the ﬁeld of ﬁnancial engi-neering and has been used to make decisions that take into account the trade-oﬀ between the average and varianceof investment uncertainty. In this paper, we speciﬁcally focus on BO setting with an uncertain component andconsider multi-task, multi-objective, and constrained optimization scenarios for the mean-variance trade-oﬀ of theuncertain component. When the target blackbox function is modeled by Gaussian Process (GP), we derive thebounds of the two risk measures and propose AL algorithm for each of the above three problems based on therisk measure bounds. We show the eﬀectiveness of the proposed AL algorithms through theoretical analysis andnumerical experiments.

Decision making in an uncertain environment has been studied in various domains. For example, in ﬁnancial engi-neering, the mean-variance analysis [1, 2, 3] has been introduced as a framework for making investment decisions,taking into account the trade-oﬀ between the return (mean) and the risk (variance) of the investment. In this paperwe study active learning (AL) in an uncertain environment. In many practical AL problems, there are two typesof parameters called design parameters and environmental parameters . For example, in a product design, while thedesign parameters are fully controllable, the environmental parameters vary depending on the environment in whichthe product is used. In this paper, we examine AL problems under such an uncertain environment, where the goal isto eﬃciently ﬁnd the optimal design parameters by properly taking into account the uncertainty of the environmentalparameters.Concretely, let f ( x , w ) be a blackbox function indicating the performance of a product, where x ∈ X is the set ofcontrollable design parameters and w ∈ Ω is the set of uncontrollable environmental parameters whose uncertainty ischaracterized by a probability distribution p ( w ). We particularly focus on the AL problem where the mean and thevariance of the environmental parameters, E w [ f ( x , w )] = (cid:90) Ω f ( x , w ) p ( w )d w , (1a) V w [ f ( x , w )] = (cid:90) Ω ( f ( x , w ) − E w [ f ( x , w )]) p ( w )d w , (1b)respectively, are taken into account. Speciﬁcally, we work on these two uncertainty measures in three diﬀerent scenarios:multi-task learning scenario, multi-objective optimization scenario, and constrained optimization scenario. In the ﬁrstscenario, we study AL for optimizing a weighted sum of these two measures. In the second scenario, we discuss howto obtain the Pareto frontier of these two measures in an AL setting. In the third scenario, we consider optimizingone of the two measures under some constraint on the other measure. We refer to these problems and the proposedframework for solving them as Mean-Variance Analysis in Bayesian Optimization (MVA-BO) . Figure 1 shows anillustration of a multi-task learning scenario.In this study, we employ a Gaussian process (GP) to model the uncertainty of the blackbox function f ( x , w ). Ina conventional GP-based AL problem (without uncontrollable environmental parameters w ), the acquisition function(AF) is designed based on how the uncertainty of the blackbox function changes when an input point is selected andthe blackbox function is evaluated at the input point. On the other hand, in MVA-BO, we need to know how theuncertainties of the mean function (1a) and the variance function (1b) change by evaluating the blackbox function atthe selected input point. Note that we face the diﬃculty of not being able to directly evaluate the target functions ∗ Department of Computer Science, Nagoya Institute of Technology † RIKEN Center for Advanced Intelligence Project ‡ Department of Computer Science/Research Institute for Information Science, Nagoya Institute of Technology,mail:[email protected] a r X i v : . [ s t a t . M L ] S e p f ( x , w ) follows a GP, the mean function (1a)also follows a GP. Unfortunately, however, the variance function (1b) does not follow a GP, indicating that we needto develop a new method to quantify how the uncertainty of the variance function changes by evaluating the blackboxfunction at the selected input point. In this study, we extend the GP-UCB algorithm [5] to realize MVA-BO in theabove mentioned three scenarios by overcoming these technical diﬃculties. We demonstrate the eﬀectiveness of theproposed MVA-BO framework through theoretical analyses and numerical experiments. Related Work

Various problem setups and methods have been studied for AL and Bayesian optimization (BO)problems when there are multiple target functions. One of such problem setup is multi-task BO [6]. In this problemsetup, the AF is designed to select input points that commonly contribute to optimizing multiple target functions.Another popular problem setup is multi-objective BO [7, 8, 9]. The goal of a multi-objective optimization is to obtainso-called

Pareto-optimal solutions. The AF in this problem setup is designed to eﬃciently identify solutions on thePareto frontier. Another common problem setup is constrained BO [10, 11, 12]. The goal of this problem setup is toﬁnd the optimal solution to a constrained optimization problem in a situation where both the objective function andconstraint function are blackbox functions that are costly to evaluate. The AF in this problem setup is designed toselect input points that are useful not only for maximizing the objective function but also for identifying the feasibleregion. In this paper, we study these three scenarios as concrete examples of MVA-BO. Unlike conventional multi-task,multi-objective and constrained BOs, the main technical challenges of MVA-BO are that the two target functions (1a)and (1b) cannot be directly evaluated and that the latter does not follow a GP.Various studies have been published on BO under various types of uncertainty. The most relevant one to ourstudy is on

Bayesian quadrature optimization (BQO) [13], the goal of which is to optimize the mean function (1a).When the blackbox function follows a GP, the mean function (1a) also follows a GP, suggesting that one can eﬃcientlysolve BQO problems by properly modifying the AFs in conventional BO. By replacing the integrand in (1a) withdiﬀerent uncertainty measures, one can consider various types of AL problems under uncertainty [14, 15]. Anotherline of research dealing with uncontrollable and uncertain factors in BO is known as robust BO . The goal of robustBO is to make robust decisions that appropriately take into account the uncertainty of the BO process and the GPmodel. For example, input uncertainty in BO has been studied, in which probabilistic noise is inevitably added to theinput points when evaluating the target blackbox function. Although research on BO in an uncertain environment hassteadily progressed over the past few years, to our knowledge, there are no AL nor BO studies that take into accountthe trade-oﬀs between multiple uncertainty measures such as mean-variance analysis.Decision making under uncertainty is being examined in the ﬁeld of robust optimization [16, 17, 18], with especiallyapplications to ﬁnancial engineering in mind [19, 20, 21]. It has been pointed out that when making decisions underuncertainty, it is important to balance multiple uncertainty measures appropriately, as represented by the Nobel prize-winning mean-variance analysis in portfolio theory [1, 2, 3]. Various risk measures, such as Value at Risk (VaR), havebeen proposed in ﬁnancial engineering, and these multiple risk measures are used in combination, depending on thepurpose of the decision making. However, to our knowledge, there have been AL or BO studies that have appropriatelytaken into account multiple uncertainty measures.

Let f : X × Ω → R be a blackbox function which is expensive to evaluate, where X ⊂ R d and Ω ⊂ R d are a ﬁniteset and a compact convex set, respectively. In our setting, a variable w ∈ Ω is probabilistically ﬂuctuated by thegiven density function p ( w ) . At every step t , a user chooses the next observation point x t ∈ X , whereas w t ∈ Ω willbe given as a realization of the random variable, which follows the distribution p ( w ). Next, the user gets the noisyobservation y t = f ( x t , w t ) + η t , where η t is independent Gaussian noise following N (0 , σ ).Furthermore, as a regularity assumption, we assume that f is an element of reproducing kernel Hilbert space(RKHS) and has a bounded norm, which is also assumed in the standard BO literature [5]. Let k be a positive deﬁnitekernel over ( X × Ω) × ( X ×

Ω) and H k be an RKHS corresponding to k . In this paper, for some positive constant B ,we assume f ∈ H k with (cid:107) f (cid:107) H k ≤ B , where (cid:107) · (cid:107) H k denotes the Hilbert norm deﬁned on H k . Models

Our algorithm uses the GP method [22] to navigate the optimization process. First, we assume GP (0 , k ) asa prior of f , where GP ( µ, k ) is a GP that is characterized by a mean function µ and a kernel function k . Given the We discuss the case where X is a continuous set in appendix D. Note that a probability mass function can also be considered when Ω is a ﬁnite set. In that case, the subsequent discussions still holdif integral operations are replaced by summation operations. x and w , respectively. Blue and yellow dotted lines indicate the points where expectedvalue F ( x ) and negative standard deviation F ( x ) of f ( x , w ) are maximum. Our goal is to identify the point on thered line that simultaneously maximize both of F and F .sequence of data { (( x i , w i ) , y i ) } ti =1 , the posterior distribution of f ( x , w ) is the Gaussian distribution that has mean µ t ( x , w ) and variance σ t ( x , w ) deﬁned as follows: µ t ( x , w ) = k t ( x , w ) (cid:62) (cid:0) K t + σ I t (cid:1) − y t ,σ t ( x , w ) = k (( x , w ) , ( x , w )) − k t ( x , w ) (cid:62) (cid:0) K t + σ I t (cid:1) − k t ( x , w ) , where k t ( x , w ) = ( k (( x , w ) , ( x , w )) , . . . , k (( x , w ) , ( x t , w t ))) (cid:62) , y t = ( y , . . . , y t ), I t is the identity matrix of size t ,and K t is the t × t kernel matrix whose ( i, j )th element is k (( x i , w i ) , ( x j , w j )).We will make use of the following lemma, to construct the conﬁdence bound of f by using the posterior mean µ t and the variance σ t . Lemma 2.1 (Theorem 3.11 in [23]) . Fix f ∈ H k with (cid:107) f (cid:107) H k ≤ B . Given δ ∈ (0 , , let deﬁne β t = (cid:16)(cid:112) ln det( I t + σ − K t ) + 2 ln(1 /δ ) + B (cid:17) . Then, the following holds with probability at least − δ : | f ( x , w ) − µ t − ( x , w ) | ≤ β / t σ t − ( x , w ) , ∀ x ∈ X , ∀ w ∈ Ω , ∀ t ≥ . (2)Based on the above lemma, the conﬁdence bound Q t ( x , w ) := [ l t ( x , w ) , u t ( x , w )] of f ( x , w ) can be computed by l t ( x , w ) = µ t − ( x , w ) − β / t σ t − ( x , w ) ,u t ( x , w ) = µ t − ( x , w ) + β / t σ t − ( x , w ) . Here, we consider the expectation and variance of f ( x , w ) under the uncertainty of p ( w ) as follows: E w [ f ( x , w )] = (cid:90) Ω f ( x , w ) p ( w )d w , (3) V w [ f ( x , w )] = (cid:90) Ω { f ( x , w ) − E w [ f ( x , w )] } p ( w )d w . (4)Using these E w [ f ( x , w )] and V w [ f ( x , w )], we deﬁne the objective functions F and F as follows: F ( x ) = E w [ f ( x , w )] , F ( x ) = − (cid:112) V w [ f ( x , w )] . (5)3ur goal is to maximize F and F simultaneously with as few function evaluations as possible. To this end, we handlethese objective functions in multi-task and multi-objective optimization frameworks. Multi-task Optimization Scenario

First, we formulate the problem as a single-objective optimization problemwhose objective function is deﬁned as a weighted sum of F and F . Given a user-speciﬁed weight α ∈ [0 , G bea new objective function deﬁned as follows: G ( x ) = αF ( x ) + (1 − α ) F ( x ) . In this formulation, our goal is to ﬁnd x ∗ := argmax x ∈X G ( x ) eﬃciently. To rigorously determine the theoreticalproperties, we introduce the notion of an (cid:15) -accurate solution . Let ˆ x t be an estimated solution which is deﬁned by thealgorithm at step t . Given a ﬁxed constant (cid:15) ≥

0, we say that ˆ x t is (cid:15) -accurate if the following inequality holds: G ( ˆ x t ) ≥ G ( x ∗ ) − (cid:15). In section 4, for an arbitrarily small (cid:15) , we show that our algorithm can ﬁnd the (cid:15) -accurate solution with high probabilityafter ﬁnite step T . Multi-objective Optimization Scenario

In the multi-task scenario, we assume that the user can specify theweight α before the optimization; however this is sometimes unrealistic. We also consider the more general formulationbased on the Pareto optimality criterion. Hereafter, we use the vector representation of the objective functions like F ( x ) = ( F ( x ) , F ( x )). First, let (cid:22) be a relational operator deﬁned over X × X or R × R . Given x , x (cid:48) ∈ X , wewrite x (cid:22) x (cid:48) or F ( x ) (cid:22) F ( x (cid:48) ) provided that F ( x ) ≤ F ( x (cid:48) ) and F ( x ) ≤ F ( x (cid:48) ) hold simultaneously. We say that x (cid:48) dominates x if x (cid:22) x (cid:48) . Furthermore, we write x ≺ x (cid:48) or F ( x ) ≺ F ( x (cid:48) ) provided that either F ( x ) < F ( x (cid:48) ) or F ( x ) < F ( x (cid:48) ) holds.The goal of this scenario is to identify the following Pareto set

Π eﬃciently:Π = { x ∈ X | ∀ x (cid:48) ∈ E x , F ( x ) (cid:14) F ( x (cid:48) ) } , where E x = { x (cid:48) ∈ X | F ( x ) (cid:54) = F ( x (cid:48) ) } . Moreover,

Pareto front Z is deﬁned by Z = ∂ { y ∈ R | ∃ x ∈ X , y (cid:22) F ( x ) } . Next, we introduce the notion of an (cid:15) -accurate Pareto set [8], which is an idea similar to the (cid:15) -accurate solution inthe multi-task scenario. Given a non-negative vector (cid:15) = ( (cid:15) , (cid:15) ), we deﬁne the relational operator (cid:22) (cid:15) , which is therelaxed version of (cid:22) . For x , x (cid:48) ∈ X , we write x (cid:22) (cid:15) x (cid:48) or F ( x ) (cid:22) (cid:15) F ( x (cid:48) ) if F ( x ) ≤ F ( x (cid:48) )+ (cid:15) and F ( x ) ≤ F ( x (cid:48) )+ (cid:15) hold simultaneously. Then, the (cid:15) -Pareto front is deﬁned as: Z (cid:15) = { y ∈ R | ∃ y (cid:48) ∈ Z, y (cid:22) y (cid:48) and ∃ y (cid:48)(cid:48) ∈ Z, y (cid:48)(cid:48) (cid:22) (cid:15) y } . We say that estimated Pareto set ˆΠ t of the algorithm is an (cid:15) -accurate Pareto set if the following two conditions aresatisﬁed:1. F ( ˆΠ t ) ⊂ Z (cid:15) , where F ( ˆΠ t ) := (cid:110) F ( x ) | x ∈ ˆΠ t (cid:111) .2. For any x ∈ Π, there is at least one point x (cid:48) ∈ ˆΠ t such that x (cid:22) (cid:15) x (cid:48) .Intuitively, condition 1 guarantees that the estimated solutions are worse than the true Pareto front by at most (cid:15) .Condition 2 indicates that ˆΠ can cover all points in the true Pareto set Π.We emphasize that although many studies about multi-task or multi-objective optimization based on a GP havebeen reported, their methods cannot be directly applied to our setting because the objective functions F and F arenot observed directly. First, we explain the basic idea of our proposed algorithms. To maximize F and F eﬃciently, one simple way is toconsider the predicted distributions of F and F , and apply existing methods (e.g. expected improvement, entropysearch). However, it is diﬃcult to handle the predicted distribution of F although f is modeled by a GP. In thispaper, we ﬁrst derive the intervals in which F and F exist with high probability from the conﬁdence bound of f ,and construct the algorithm based on these derived intervals. Hereafter, with a slight abuse of notation, we refer tothese derived intervals as the conﬁdence bounds of F and F . In appendix B, as another formulation, we also consider the constrained optimization problem whose objective and constraint functionsare F and F respectively. .1 Conﬁdence Bounds of Objective Functions First, we consider the conﬁdence bound Q ( F ) t ( x ) = [ l ( F ) t ( x ) , u ( F ) t ( x )] of F ( x ). When (2) holds, the following inequityholds for any x ∈ X , t ≥ (cid:90) Ω l t ( x , w ) p ( w )d w ≤ (cid:90) Ω f ( x , w ) p ( w )d w ≤ (cid:90) Ω u t ( x , w ) p ( w )d w . This implies that F ( x ) ∈ Q ( F ) t ( x ) for any x ∈ X , t ≥ − δ for l ( F ) t and u ( F ) t deﬁned as l ( F ) t ( x ) = (cid:90) Ω l t ( x , w ) p ( x )d w , u ( F ) t ( x ) = (cid:90) Ω u t ( x , w ) p ( x )d w . We construct the conﬁdence bound Q ( F ) t ( x ) = [ l ( F ) t ( x ) , u ( F ) t ( x )] of F ( x ) in a similar way. First, we consider thequantity f ( x , w ) − E w [ f ( x , w )], which appears in the integrand of V w [ f ( x , w )]. Under condition (2), the followinginequity holds: ˜ l t ( x , w ) ≤ f ( x , w ) − E w [ f ( x , w )] ≤ ˜ u t ( x , w ) , (6)where ˜ l t ( x , w ) = l t ( x , w ) − E w [ u t ( x , w )] and ˜ u t ( x , w ) = u t ( x , w ) − E w [ l t ( x , w )]. Next, the integrand of V w [ f ( x , w )]can be evaluated based on (6) as follows:˜ l (sq) t ( x , w ) ≤ { f ( x , w ) − E w [ f ( x , w )] } ≤ ˜ u (sq) t ( x , w ) , where ˜ l (sq) t ( x , w ) = (cid:40) l t ( x , w ) ≤ ≤ ˜ u t ( x , w ) , min (cid:110) ˜ l t ( x , w ) , ˜ u t ( x , w ) (cid:111) otherwise , ˜ u (sq) t ( x , w ) = max (cid:110) ˜ l t ( x , w ) , ˜ u t ( x , w ) (cid:111) . Finally, from the monotonicity of square root, the conﬁdence bound Q ( F ) t ( x ) = [ l ( F ) t ( x ) , u ( F ) t ( x )] of F ( x ) is com-puted using the following equations for l ( F ) t and u ( F ) t : l ( F ) t ( x ) = − (cid:115)(cid:90) Ω ˜ u (sq) t ( x , w ) p ( x )d w , u ( F ) t ( x ) = − (cid:115)(cid:90) Ω ˜ l (sq) t ( x , w ) p ( x )d w . Multi-task Scenario

In the multi-task scenario, our algorithm chooses the next observation point x t based on theupper conﬁdence bound (UCB) of the function G . From Q ( F ) t ( x ) and Q ( F ) t ( x ), the conﬁdence bound Q ( G ) t ( x ) :=[ l ( G ) t ( x ) , u ( G ) t ( x )] of G ( x ) can be constructed by deﬁning l ( G ) t ( x ) = αl ( F ) t ( x ) + (1 − α ) l ( F ) t ( x ) , u ( G ) t ( x ) = αu ( F ) t ( x ) + (1 − α ) u ( F ) t ( x ) . At every step t , the next observation point x t of our algorithm is deﬁned by x t = argmax x ∈X u ( G ) t ( x ). Hereafter, wecall this strategy Multi-Task (MT)-MVA-BO. The pseudo-code of MT-MVA-BO is shown as Algorithm 1. Multi-objective Scenario

Next, we explain the proposed algorithm for ﬁnding the Pareto set eﬃciently. From theconﬁdence bounds of F and F , we deﬁne F (opt) t and F (pes) t by F (opt) t ( x ) = (cid:16) u ( F ) t ( x ) , u ( F ) t ( x ) (cid:17) and F (pes) t ( x ) = (cid:16) l ( F ) t ( x ) , l ( F ) t ( x ) (cid:17) , which respectively represent the optimistic and pessimistic predictions of the objective functionsat step t . First, we deﬁne the estimated Pareto set ˆΠ t at step t byˆΠ t = (cid:110) x ∈ X (cid:12)(cid:12)(cid:12) ∀ x (cid:48) ∈ E (pes) t, x , F (pes) t ( x ) (cid:14) F (pes) t ( x (cid:48) ) (cid:111) , where E (pes) t, x = (cid:110) x (cid:48) ∈ X (cid:12)(cid:12)(cid:12) F (pes) t ( x ) (cid:54) = F (pes) t ( x (cid:48) ) (cid:111) . (7)For theoretical reasons, we deﬁne ˆΠ t based on pessimistic predictions and the same idea is used in the existing GP-basedoptimization literatures [24, 8, 25, 26]. Furthermore, using ˆΠ t , the potential Pareto set M t is deﬁned by M t = (cid:110) x ∈ X \ ˆΠ t (cid:12)(cid:12)(cid:12) ∀ x (cid:48) ∈ ˆΠ t , F (opt) t ( x ) (cid:14) (cid:15) F (pes) t ( x (cid:48) ) (cid:111) . lgorithm 1 Multi-task MVA-BO (MT-MVA-BO)

Input:

GP prior GP (0 , k ), { β t } t ≤ T , α ∈ (0 , for t = 0 to T do Compute u ( G ) t ( x ) for any x ∈ X Choose x t = argmax x ∈X u ( G ) t ( x ).Sample w t ∼ p ( w ).Observe y t ← f ( x t , w t ) + η t Update the GP by adding (( x t , w t ) , y t ). end forOutput: argmax x ∈{ x ,..., x T } l ( G ) T ( x ). Algorithm 2

Multi-objective MVA-BO (MO-MVA-BO)

Input:

GP prior GP (0 , k ), { β t } t ∈ N , Non-negative vector (cid:15) = ( (cid:15) , (cid:15) ). t ← repeat Compute ˆΠ t , M t .Compute λ t ( x ) for any x ∈ M t ∪ ˆΠ t .Choose x t = argmax x ∈ M t ∪ ˆΠ t λ t ( x ).Sample w t ∼ p ( w ).Observe y t ← f ( x t , w t ) + η t .Update the GP by adding (( x t , w t ) , y t ). t ← t + 1.Compute U t . until M t = ∅ and U t = ∅ Output: ˆΠ t .An intuitive interpretation of M t is the set which excludes the points that are (cid:15) -dominated by other points with highprobability. At every step t , our algorithm chooses x t based on the uncertainty deﬁned by the conﬁdence bounds of F and F . In this paper, we adopt the diameter λ t ( x ) of rectangle Rect t ( x ) = (cid:104) l ( F ) t ( x ) , u ( F ) t ( x ) (cid:105) × (cid:104) l ( F ) t ( x ) , u ( F ) t ( x ) (cid:105) as the uncertainty of x : λ t ( x ) = max y , y (cid:48) ∈ Rect t ( x ) (cid:107) y − y (cid:48) (cid:107) . (8)Namely, the next observation point x t is deﬁned by x t = argmax x ∈ M t ∪ ˆΠ t λ t ( x ) at every step t .Our proposed algorithm terminates when estimated Pareto set ˆΠ t is guaranteed to be an (cid:15) -Pareto set with highprobability. To this end, our algorithm checks the uncertainty set U t that is deﬁned by U t = (cid:110) x ∈ ˆΠ t (cid:12)(cid:12)(cid:12) ∃ x (cid:48) ∈ ˆΠ t \ { x } , F (pes) t ( x ) + (cid:15) ≺ F (opt) t ( x (cid:48) ) (cid:111) . Intuitively, U t is the set of points where it is not possible to decide whether it is an (cid:15) -Pareto solution based on thecurrent conﬁdence bounds. Our algorithm terminates at a step t where both M t = ∅ and U t = ∅ hold.Hereafter, we call this algorithm Multi-Objective (MO)-MVA-BO. The pseudo-code of MO-MVA-BO is shown asAlgorithm 2. In this section, we consider several extensions of the proposed method to deal with situations which arise in somepractical applications, leaving the details for appendix C.

Thus far, we have assumed that p ( w ) is known; however, this assumption is sometimes unrealistic. Considering howto deal with the case where p ( w ) is unknown, one simple way is to estimate p ( w ) during the optimization process.For example, if we estimate p ( w ) by using an empirical distribution, we can apply our algorithm by replacing p ( w )with the following ˜ p t ( w ) when computing the conﬁdence bounds:˜ p t ( w ) = 1 t t (cid:88) t (cid:48) =1 w t (cid:48) = w ] . As a more advanced method, it may be possible to consider extension to the distributionally robust setting [26, 27];however, we leave this as future work.

One setting similar to that in this paper is the noisy input setting [14, 28]. In this setting, observation point x t isﬂuctuated by noise ξ ∈ ∆ which follows the known density p ( ξ ) deﬁned over ∆. At every step t , the user chooses x t y t as y t = f ( x t + ξ ) + η t , ξ ∼ p ( ξ ). Our problem can be extended to the noisy input settingby deﬁning F and F through expectation E ξ [ f ( x + ξ )] and variance V ξ [ f ( x + ξ )] deﬁned as follows: E ξ [ f ( x + ξ )] = (cid:90) ∆ f ( x + ξ ) p ( ξ )d ξ , (9) V ξ [ f ( x + ξ )] = (cid:90) ∆ { f ( x + ξ ) − E ˆ ξ [ f ( x + ˆ ξ )] } p ( ξ )d ξ . (10)We can apply the same algorithms as those in section 3.2 by constructing the conﬁdence bounds via a way similar tothat in section 3.1. Some applications can be allowed to control the variable w in the optimization. For example, the case that the userrun the optimization process by evaluating f ( x , w ) with the computer simulation. Such scenarios have often beenconsidered in similar studies reported in the BO literature that assumed the existence of an uncontrollable variable w [13, 26, 27]. Our method can be extended to such a scenario by choosing w t according to w t = argmax w ∈ Ω σ t − ( x t , w )after the selection of x t . In this section, we show the theoretical results of the proposed algorithms. The details of the proofs are in appendixA. First, we introduce the maximum information gain [5] as a sample complexity parameter of a GP. Now, Let A = { a , . . . , a T } be a ﬁnite subset of X ×

Ω, and y A be a vector whose i th element is y a i = f ( a i ) + ε a i . Maximuminformation gain γ T at step T is deﬁned by γ T = max A ⊂X × Ω; | A | = T I ( y A ; f ) , where I ( y A ; f ) denotes the mutual information between y A and f . Maximum information gain γ T is often used inBO, and its analytical form of the upper bound is derived in commonly used kernels [5].The following two theorems show the convergence properties of the proposed algorithms for the multi-task andmulti-objective scenarios, respectively. Theorem 4.1.

Fix positive deﬁnite kernel k , and assume f ∈ H k with (cid:107) f (cid:107) H k ≤ B . Let δ ∈ (0 , and (cid:15) > , and set β t according to β t = (cid:16)(cid:113) ln det( I t + σ − K t ) + 2 ln δ + B (cid:17) at every step t . Furthermore, for any t ≥ , deﬁne ˆ x t by ˆ x t = argmax x t (cid:48) ∈{ x ,..., x t } l ( G ) t (cid:48) ( x t (cid:48) ) . When applying MT-MVA-BO under the above conditions, with probability at least − δ , ˆ x T is an (cid:15) -accurate solution, where T is the smallest positive integer which satisﬁes the following inequity: αT − β / T (cid:16)(cid:112) T C γ T + C (cid:17) + (1 − α ) T − (cid:114) T ˜ Bβ / T (cid:16)(cid:112) T C γ T + 2 C (cid:17) + 5 T β T ( C γ T + 2 C ) ≤ (cid:15). (11) Here, ˜ B = max ( x , w ) ∈ ( X × Ω) | f ( x , w ) − E w [ f ( x , w )] | and C = σ − ) , C = 16 log δ . Theorem 4.2.

Fix positive deﬁnite kernel k , and assume f ∈ H k with (cid:107) f (cid:107) H k ≤ B . Let δ ∈ (0 , and (cid:15) > , and set β t according to β t = (cid:16)(cid:113) ln det( I t + σ − K t ) + 2 ln δ + B (cid:17) at every step t . When applying MO-MVA-BO under theabove conditions, the following 1. and 2. hold with probability at least − δ : The algorithm terminates at most step T where T is the smallest positive integer that satisﬁes the following inequity: T − β / T (cid:16)(cid:112) T C γ T + C (cid:17) + T − (cid:114) T ˜ Bβ / T (cid:16)(cid:112) T C γ T + 2 C (cid:17) + 5 T β T ( C γ T + 2 C ) ≤ min { (cid:15) , (cid:15) } . (12) Here, ˜ B = max ( x , w ) ∈ ( X × Ω) | f ( x , w ) − E w [ f ( x , w )] | , C = σ − ) , C = 16 log δ . When the algorithm terminates, estimated Pareto set ˆΠ t is an (cid:15) -accurate Pareto set. The ﬁrst term β / T (cid:0) √ T C γ T + C (cid:1) of the left hand side in (11) and (12) also appears in the theoretical resultof the existing algorithm, which only considers the expectation F (e.g. Theorem 2 in [26]). The second term (cid:113) T ˜ Bβ / T (cid:0) √ T C γ T + 2 C (cid:1) + 5 T β T ( C γ T + 2 C ) is speciﬁc to our problem. This term depends on the complexityparameter ˜ B , which quantiﬁes the variation of function f ( x , w ) around its expectation.7 Numerical Experiments

In this section, we show the performance of the proposed methods through numerical experiments. As the baselinemethods in both the multi-task and multi-objective scenarios, we adopted random sampling ( RS ) and uncertaintysampling ( US ). RS choose x t from X uniformly at random, and US choose x t such that x t achieve the largest averageposterior variance x t = argmax x ∈X (cid:82) Ω σ t − ( x , w ) p ( w )d w . To measure the performance, in the multi-task scenario, wecomputed the regret, G ( x ∗ ) − G ( ˆ x t ), at every step t , where x t is the estimated solution deﬁned by the algorithms. Wedeﬁned ˆ x t as ˆ x t = argmax t (cid:48) =1 ,...,t l ( G ) t ( x t (cid:48) ) in RS , US , and proposed method ( MT-MVA-BO ). Furthermore, we set α = 0 . − ˆHV t to measure the performance,where HV and ˆHV t denote the hyper volumes computed based on the true Pareto set Π and the estimated Pareto setˆΠ t , respectively. The hyper volume gap measures how close the estimated Pareto front is to the true Pareto front. Wedeﬁned ˆΠ t by (7) in RS , US and the proposed method ( MO-MVA-BO ). Furthermore, in the multi-task scenario, to showthe eﬀect of diﬀerence of objective functions, we also adopt the two methods

BQOUCB [26, 27] and

BO-VO . BQOUCB is theexisting method which aims to maximize F , and BO-VO is the variant of our method which corresponds to the case α = 0. These methods choose x t as the maximizing point of u ( F ) t ( x ) and u ( F ) t ( x ) respectively. In addition, estimatedsolution ˆ x t is deﬁned by ˆ x t = argmax t (cid:48) =1 ,...,t l ( F ) t ( x t (cid:48) ) and ˆ x t = argmax t (cid:48) =1 ,...,t l ( F ) t ( x t (cid:48) ) respectively. Moreover,we also make comparisons to the adaptive versions of these methods, ADA-BQOUCB and

ADA-BO-VO . ADA-BQOUCB and

ADA-BO-VO choose x t in the same way as do BQOUCB and

BO-VO , but the estimated solutions are deﬁned as ˆ x t =argmax t (cid:48) =1 ,...,t l ( G ) t ( x t (cid:48) ). In this subsection, we show the results of the artiﬁcial-data experiments.

GP Test Functions

We experimented with the true oracle functions f that are generated from the 2D GP prior.First, we divided [ − , into 25 uniformly spaced grid points in each dimension and generated the sample path fromthe GP prior. Next, we created the GP model with these grid points and set the true oracle function as its GPposterior mean. In this experiment, we created 50 sample paths from diﬀerent seeds, and conducted 10 experimentsfor each function. Thus, we report the average performance of a total of 500 experiments. To create a GP samplepath, we use the Gaussian kernel k (( x , w ) , ( x (cid:48) , w (cid:48) )) = σ exp( (cid:107) x − x (cid:48) (cid:107) + (cid:107) w − w (cid:48) (cid:107) l ) with σ ker = 1 , l = 0 .

25, as well asto construct the conﬁdence bound in the algorithms. Furthermore, we set noise variance as σ = 10 − . In addition,we divided [ − ,

1] into 100 grid points uniformly, and set X and Ω as these grid points. Moreover, we deﬁne p ( w ) by p ( w ) = (cid:80) w ∈ Ω φ ( w ) /Z , Z = (cid:80) w ∈ Ω φ ( w ) where φ is the density function of the standard normal distribution. Benchmark Functions of Optimization

We also experimented with the Bird function (2D) and Rosenbrockfunction (3D), which are often used as the benchmark function in the ﬁeld of the optimization. First, we scaled theinput domain to [ − ,

1] divided into 100 grid points in each dimension. In Bird function, we set X and Ω as the gridpoints of the ﬁrst and the second dimensions, respectively. In the Rosenbrock function, we set Ω as the grid points of thethird dimension and the remaining points as X . Furthermore, we set p ( w ) as in the same way as the experiment of theGP test functions. We use ARD Gaussian kernel k (( x , w ) , ( x (cid:48) , w (cid:48) )) = σ exp (cid:26)(cid:80) d i =1 ( x i − x (cid:48) i ) l ( x )2 i + (cid:80) d j =1 ( w j − w (cid:48) j ) l ( w )2 j (cid:27) , andtune these hyperparameters by maximizing the marginal likelihood at every 10 step in the algorithms. Furthermore,we set the noise variance as σ = 10 − and report the average performance of 100 simulations with diﬀerent seeds.Figure 2 shows the results of the artiﬁcial data experiments. We conﬁrmed that the proposed methods achievebetter performances than the other methods. In the experiments of the multi-task scenario, we also conﬁrmed thatthe regrets of BQOUCB , BO-VO , ADA-BQOUCB , and

ADA-BO-VO stop decreasing at an early stage. Note that these arereasonable results because objective functions of these methods are inconsistent with our settings.

We applied the proposed methods to

Newsvendor problem under dynamic consumer substitution [29], whose goal is tooptimize the initial inventory levels under uncertainty of customer behaviors. The parameter x and w respectivelycorrespond to the initial inventory level of products and the uncertain purchasing behaviors of customers, which followmutually independent Gamma distributions. The goal of this problem is to ﬁnd the x which optimizes the proﬁt f ( x , w ) under the uncertainty of w . For this problem, we conducted the experiments in the simulator-based settingdescribed in section 3.3.3 because proﬁt f ( x , w ) can be evaluated based on a computer simulation. Figure 3 showsthe average performances of 100 simulations with diﬀerent seeds.8

25 50 75 100 125 150 175 200

Iteration R e g r e t GP Test Functions

RSUSBQOUCBADA-BQOUCBBO-VOADA-BO-VOMT-MVA-BO 0 20 40 60 80 100

Iteration R e g r e t Bird

RSUSBQOUCBADA-BQOUCBBO-VOADA-BO-VOMT-MVA-BO 0 25 50 75 100 125 150 175 200

Iteration R e g r e t Rosenbrock

RSUSBQOUCBADA-BQOUCBBO-VOADA-BO-VOMT-MVA-BO0 25 50 75 100 125 150 175 200

Iteration H y p e r - v o l u m e G a p GP Test Functions

RSUSMO-MVA-BO 0 20 40 60 80 100

Iteration H y p e r - v o l u m e G a p Bird

RSUSMO-MVA-BO 0 25 50 75 100 125 150 175 200

Iteration H y p e r - v o l u m e G a p Rosenbrock

RSUSMO-MVA-BO

Figure 2: Average performances in artiﬁcial data experiments. The error bars represent 2 × [standard error]. The topand bottom ﬁgures show the results of the multi-task ( α = 0 .

5) and multi-objective scenarios respectively.

Iteration R e g r e t Newsvendor

RSUSBQOUCBADA-BQOUCBBO-VOADA-BO-VOMT-MVA-BO 0 25 50 75 100 125 150 175 200

Iteration H y p e r - v o l u m e G a p Newsvendor

RSUSMO-MVA-BO

Figure 3: The results of experiments in Newsvendor problem. The left and right ﬁgures present the results of themulti-task ( α = 0 .

5) and multi-objective scenarios, respectively.

We introduced the novel Bayesian optimization framework: MVA-BO, which simultaneously considers two objectivefunctions: expectation and variance under an uncertainty environment. In this framework, we considered the threescenarios; multi-task, multi-objective and constraint optimization scenarios, which often appear in real-world applica-tions. We studied the rigorous convergence properties of our MVA-BO algorithms and demonstrated the eﬀectivenessof them through both artiﬁcial and real-data experiments.

Acknowledgement

This work was partially supported by MEXT KAKENHI (20H00601, 16H06538), JST CREST (JPMJCR1502), andRIKEN Center for Advanced Intelligence Project.

References [1] Markowitz HM. Portfolio selection.

Journal of Finance , 7(1):77–91, 1952.[2] Harry M Markowitz and G Peter Todd.

Mean-variance analysis in portfolio choice and capital markets , volume 66.John Wiley & Sons, 2000. 93] Michael C Keeley and Frederick T Furlong. A reexamination of mean-variance analysis of bank capital regulation.

Journal of Banking & Finance , 14(1):69–84, 1990.[4] Anthony O’Hagan. Bayes–hermite quadrature.

Journal of statistical planning and inference , 29(3):245–260, 1991.[5] Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias W. Seeger. Gaussian process optimization inthe bandit setting: No regret and experimental design. In Johannes F¨urnkranz and Thorsten Joachims, editors,

Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa,Israel , pages 1015–1022. Omnipress, 2010.[6] Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task Bayesian optimization. In

Advances in neuralinformation processing systems , pages 2004–2012, 2013.[7] Michael Emmerich. Single-and multi-objective evolutionary design optimization assisted by Gaussian randomﬁeld metamodels.

Dissertation, LS11, FB Informatik, Universit¨at Dortmund, Germany , 2005.[8] Marcela Zuluaga, Andreas Krause, and Markus P¨uschel. e-pal: An active learning approach to the multi-objectiveoptimization problem.

Journal of Machine Learning Research , 17(104):1–32, 2016.[9] Shinya Suzuki, Shion Takeno, Tomoyuki Tamura, Kazuki Shitara, and Masayuki Karasuyama. Multi-objectiveBayesian optimization using Pareto-frontier entropy. In

Proceedings of Machine Learning and Systems 2020 , pages10841–10850. 2020.[10] Jacob R Gardner, Matt J Kusner, Zhixiang Eddie Xu, Kilian Q Weinberger, and John P Cunningham. Bayesianoptimization with inequality constraints. In

ICML , volume 2014, pages 937–945, 2014.[11] Michael A. Gelbart, Jasper Snoek, and Ryan P. Adams. Bayesian optimization with unknown constraints. In

Proceedings of the Thirtieth Conference on Uncertainty in Artiﬁcial Intelligence , UAI’14, page 250259, Arlington,Virginia, USA, 2014. AUAI Press.[12] Jos´e Miguel Hern´andez-Lobato, Michael A Gelbart, Ryan P Adams, Matthew W Hoﬀman, and Zoubin Ghahra-mani. A general framework for constrained Bayesian optimization using information-based search.

The Journalof Machine Learning Research , 17(1):5549–5601, 2016.[13] Saul Toscano-Palmerin and Peter I. Frazier. Bayesian optimization with expensive integrands.

CoRR ,abs/1803.08661, 2018.[14] Justin J Beland and Prasanth B Nair. Bayesian optimization under uncertainty. In

NIPS BayesOpt 2017 workshop ,2017.[15] Shogo Iwazaki, Yu Inatsu, and Ichiro Takeuchi. Bayesian quadrature optimization for probability thresholdrobustness measure. arXiv preprint arXiv:2006.11986 , 2020.[16] Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski.

Robust optimization , volume 28. Princeton Uni-versity Press, 2009.[17] Hans-Georg Beyer and Bernhard Sendhoﬀ. Robust optimization–a comprehensive survey.

Computer methods inapplied mechanics and engineering , 196(33-34):3190–3218, 2007.[18] Aharon Ben-Tal and Arkadi Nemirovski. Robust optimization–methodology and applications.

Mathematicalprogramming , 92(3):453–480, 2002.[19] Alexander Schied*. Risk measures and robust optimization problems.

Stochastic Models , 22(4):753–831, 2006.[20] Gordon J Alexander and Alexandre M Baptista. Economic implications of using a mean-var model for portfolioselection: A comparison with mean-variance analysis.

Journal of Economic Dynamics and Control , 26(7-8):1159–1193, 2002.[21] Frank J Fabozzi, Petter N Kolm, Dessislava A Pachamanova, and Sergio M Focardi. Robust portfolio optimization.

The Journal of portfolio management , 33(3):40–48, 2007.[22] Carl Edward Rasmussen and Christopher K. I. Williams.

Gaussian Processes for Machine Learning . MIT Press,2006.[23] Yasin Abbasi-Yadkori. Online learning for linearly parametrized control problems. 2013.1024] Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization with Gaussianprocesses. In

International Conference on Machine Learning , pages 997–1005, 2015.[25] Ilija Bogunovic, Jonathan Scarlett, Stefanie Jegelka, and Volkan Cevher. Adversarially robust optimization withGaussian processes. In

Advances in neural information processing systems , pages 5760–5770, 2018.[26] Johannes Kirschner, Ilija Bogunovic, Stefanie Jegelka, and Andreas Krause. Distributionally robust Bayesianoptimization. In Silvia Chiappa and Roberto Calandra, editors,

The 23rd International Conference on ArtiﬁcialIntelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy] , volume 108 of

Proceedings of Machine Learning Research , pages 2174–2184. PMLR, 2020.[27] Thanh Nguyen, Sunil Gupta, Huong Ha, Santu Rana, and Svetha Venkatesh. Distributionally robust Bayesianquadrature optimization. In Silvia Chiappa and Roberto Calandra, editors,

Proceedings of the Twenty ThirdInternational Conference on Artiﬁcial Intelligence and Statistics , volume 108 of

Proceedings of Machine LearningResearch , pages 1921–1931, Online, 26–28 Aug 2020. PMLR.[28] Lukas Frhlich, Edgar Klenske, Julia Vinogradska, Christian Daniel, and Melanie Zeilinger. Noisy-input entropysearch for eﬃcient robust Bayesian optimization. In Silvia Chiappa and Roberto Calandra, editors,

Proceedingsof the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics , volume 108 of

Proceedingsof Machine Learning Research , pages 2262–2272, Online, 26–28 Aug 2020. PMLR.[29] Siddharth Mahajan and Garrett Van Ryzin. Stocking retail assortments under dynamic consumer substitution.

Operations Research , 49(3):334–351, 2001.[30] Johannes Kirschner and Andreas Krause. Information directed sampling and bandits with heteroscedastic noise.In

Proc. International Conference on Learning Theory (COLT) , July 2018.[31] Yanan Sui, Vincent Zhuang, Joel W. Burdick, and Yisong Yue. Stagewise safe Bayesian optimization withGaussian processes. In Jennifer G. Dy and Andreas Krause, editors,

Proceedings of the 35th International Con-ference on Machine Learning, ICML 2018, Stockholmsm¨assan, Stockholm, Sweden, July 10-15, 2018 , volume 80of

Proceedings of Machine Learning Research , pages 4788–4796. PMLR, 2018.11

Proofs

A.1 Proof of Theorem 4.1

From the deﬁnition of β t and Lemma 2.1, the following holds with probability at least 1 − δ/ ∀ x ∈ X , ∀ w ∈ Ω , ∀ t ≥ , | f ( x , w ) − µ t − ( x , w ) | ≤ β / t σ t − ( x , w ) . (13)Moreover, we give the following lemma about the conﬁdence bound Q ( G ) t ( x t ): Lemma A.1.

Assume that (13) holds. Then, for any T ≥ , it holds that T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) ≤ αβ / T T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w + (1 − α ) (cid:118)(cid:117)(cid:117)(cid:116) T ˜ Bβ / T T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w + 20 T β

T T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w , where ˜ B = max ( x , w ) ∈ ( X × Ω) | f ( x , w ) − E w [ f ( x , w )] | .Proof. From the deﬁnition of u ( G ) t and l ( G ) t , we have T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) = α T (cid:88) t =1 (cid:110) u ( F ) t ( x t ) − l ( F ) t ( x t ) (cid:111) + (1 − α ) T (cid:88) t =1 (cid:110) u ( F ) t ( x t ) − l ( F ) t ( x t ) (cid:111) . (14)Similarly, from the deﬁnition of u ( F ) t and l ( F ) t , we get the following inequality: T (cid:88) t =1 (cid:110) u ( F ) t ( x t ) − l ( F ) t ( x t ) (cid:111) = T (cid:88) t =1 (cid:90) Ω { u t ( x t , w ) − l t ( x t , w ) } p ( w )d w = 2 T (cid:88) t =1 β / t (cid:90) Ω σ t − ( x t , w ) p ( w )d w ≤ β / T T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w . (15)Here, the last inequality is given by monotonicity of β t . In addition, noting that the deﬁnition of u ( F ) t and l ( F ) t weobtain u ( F ) t ( x t ) − l ( F ) t ( x t ) = (cid:115)(cid:90) Ω ˜ u (sq) t ( x t , w ) p ( w )d w − (cid:115)(cid:90) Ω ˜ l (sq) t ( x t , w ) p ( w )d w ≤ (cid:115)(cid:90) Ω (cid:110) ˜ u (sq) t ( x t , w ) − ˜ l (sq) t ( x t , w ) (cid:111) p ( w )d w , (16)where the last inequality is obtained by using the fact that √ a − √ b ≤ √ a − b for any a ≥ b ≥

0. Furthermore, wehave ˜ u (sq) t ( x t , w ) − ˜ l (sq) t ( x t , w ) = max (cid:110) ˜ l t ( x t , w ) , ˜ u t ( x t , w ) (cid:111) − min (cid:110) ˜ l t ( x t , w ) , ˜ u t ( x t , w ) (cid:111) + STR ,t ( x t , w ) , (17)where STR ,t ( x t , w ) = max (cid:110) , min (cid:16) ˜ u t ( x t , w ) , − ˜ l t ( x t , w ) (cid:17)(cid:111) . Moreover, we deﬁne ˜ µ t − ( x , w ) and ˜ σ t − ( x , w ) as˜ µ t − ( x , w ) = µ t − ( x , w ) − E w [ µ t − ( x , w )] , ˜ σ t − ( x , w ) = σ t − ( x , w ) + E w [ σ t − ( x , w )] . Then, ˜ l t ( x , w ) and ˜ u t ( x , w ) can be expressed as follows:˜ l t ( x , w ) = ˜ µ t − ( x , w ) − β / t ˜ σ t − ( x , w ) , ˜ u t ( x , w ) = ˜ µ t − ( x , w ) + β / t ˜ σ t − ( x , w ) . l t ( x t , w ) ≤ ˜ u t ( x t , w ), then we have ˜ µ t − ( x t , w ) ≥ (cid:110) ˜ l t ( x t , w ) , ˜ u t ( x t , w ) (cid:111) − min (cid:110) ˜ l t ( x t , w ) , ˜ u t ( x t , w ) (cid:111) = (cid:110) ˜ µ t − ( x t , w ) + β / t ˜ σ t − ( x t , w ) (cid:111) − (cid:110) ˜ µ t − ( x t , w ) − β / t ˜ σ t − ( x t , w ) (cid:111) = 4 β / t ˜ µ t − ( x t , w )˜ σ t − ( x t , w )= 4 β / t | ˜ µ t − ( x t , w ) | ˜ σ t − ( x t , w ) . On the other hand, if ˜ l t ( x t , w ) > ˜ u t ( x t , w ), then we get ˜ µ t − ( x t , w ) < (cid:110) ˜ l t ( x t , w ) , ˜ u t ( x t , w ) (cid:111) − min (cid:110) ˜ l t ( x t , w ) , ˜ u t ( x t , w ) (cid:111) = (cid:110) ˜ µ t − ( x t , w ) − β / t ˜ σ t − ( x t , w ) (cid:111) − (cid:110) ˜ µ t − ( x t , w ) + β / t ˜ σ t − ( x t , w ) (cid:111) = − β / t ˜ µ t − ( x t , w )˜ σ t − ( x t , w )= 4 β / t | ˜ µ t − ( x t , w ) | ˜ σ t − ( x t , w ) . Therefore, in all cases the following equality holds:max (cid:110) ˜ l t ( x t , w ) , ˜ u t ( x t , w ) (cid:111) − min (cid:110) ˜ l t ( x t , w ) , ˜ u t ( x t , w ) (cid:111) = 4 β / t | ˜ µ t − ( x t , w ) | ˜ σ t − ( x t , w ) . Next, since (13) holds, we get f ( x , w ) − E w [ f ( x t , w )] ∈ [˜ l t ( x , w ) , ˜ u t ( x , w )]. This implies that | f ( x , w ) − E w [ f ( x t , w )] − ˜ µ t − ( x , w ) | ≤ β / t ˜ σ ( x , w ) . Hence, we have | f ( x , w ) − E w [ f ( x t , w )] − ˜ µ t − ( x , w ) | ≤ β / t ˜ σ t − ( x , w ) ⇒ | ˜ µ t − ( x , w ) | ≤ | f ( x , w ) − E w [ f ( x t , w )] | + β / t ˜ σ t − ( x , w ) ⇒ | ˜ µ t − ( x , w ) | ≤ ˜ B + β / t ˜ σ t − ( x , w ) . Thus, the following inequality holds:max (cid:110) ˜ l t ( x t , w ) , ˜ u t ( x t , w ) (cid:111) − min (cid:110) ˜ l t ( x t , w ) , ˜ u t ( x t , w ) (cid:111) ≤ β / t ˜ σ t − ( x t , w ) (cid:110) ˜ B + β / t ˜ σ t − ( x t , w ) (cid:111) = 4 ˜ Bβ / t ˜ σ t − ( x t , w ) + 4 β t ˜ σ t − ( x t , w ) . (18)Moreover, STR ,t ( x t , w ) can be bounded asSTR ,t ( x t , w ) ≤ ˜ u t ( x t , w ) − ˜ l t ( x t , w )2= β / t ˜ σ t − ( x t , w ) . (19)Hence, from (17), (18) and (19), we obtain u (sq) t ( x t , w ) − l (sq) t ( x t , w ) ≤ Bβ / t ˜ σ t − ( x t , w ) + 5 β t ˜ σ t − ( x t , w )and (cid:90) Ω (cid:110) u (sq) t ( x t , w ) − l (sq) t ( x t , w ) (cid:111) p ( w )d w ≤ Bβ / t (cid:90) Ω ˜ σ t − ( x t , w ) p ( w )d w + 5 β t (cid:90) Ω ˜ σ t − ( x t , w ) p ( w )d w .

13n addition, from the deﬁnition of ˜ σ t − ( x t , w ), the following holds: (cid:90) Ω ˜ σ t − ( x t , w ) p ( w )d w = E w [ σ t − ( x t , w )] + (cid:90) Ω σ t − ( x t , w ) p ( w )d w = 2 (cid:90) Ω σ t − ( x t , w ) p ( w )d w , (cid:90) Ω ˜ σ t − ( x t , w ) p ( w )d w = (cid:90) Ω σ t − ( x t , w ) p ( w )d w + 2 E w [ σ t − ( x t , w )] (cid:90) Ω σ t − ( x t , w ) p ( w )d w + { E w [ σ t − ( x t , w )] } = (cid:90) Ω σ t − ( x t , w ) p ( w )d w + 3 (cid:26)(cid:90) Ω σ t − ( x t , w ) p ( w )d w (cid:27) ≤ (cid:90) Ω σ t − ( x t , w ) p ( w )d w . Here, the last inequality is obtained by using Jensen’s inequality and convexity of g ( x ) = x . Therefore, we have (cid:90) Ω (cid:110) u (sq) t ( x t , w ) − l (sq) t ( x t , w ) (cid:111) p ( w )d w ≤ Bβ / t (cid:90) Ω σ t − ( x t , w ) p ( w )d w + 20 β t (cid:90) Ω σ t − ( x t , w ) p ( w )d w . (20)Thus, by using (20) and Schwartz’s inequality for (16), we get T (cid:88) t =1 (cid:110) u ( F ) t ( x t ) − l ( F ) t ( x t ) (cid:111) ≤ (cid:118)(cid:117)(cid:117)(cid:116) T ˜ Bβ / T T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w + 20 T β

T T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w . (21)Therefore, from (14), (15) and (21), we have the desired inequality. (cid:4) Next, in order to evaluate (cid:80) Tt =1 (cid:82) Ω σ t − ( x t , w )d w and (cid:80) Tt =1 (cid:82) Ω σ t − ( x t , w )d w in the right hand side of the in-equality of Lemma A.1, we introduce the following lemma given by [30]: Lemma A.2.

Let S t be any non-negative stochastic process adapted to a ﬁltration {F t } , and deﬁne m t = E [ S t | F t − ] .Assume that S t ≤ K for K ≥ . Then, for any T ≥ , the following holds with probability at least − δ : T (cid:88) t =1 m t ≤ T (cid:88) t =1 S t + 8 K ln 6 Kδ .

Furthermore, from the assumption about the kernel function, we get k (( x t , w ) , ( x t , w )) ≤ σ t − ( x t , w ) ≤ k (( x t , w ) , ( x t , w )) ≤

1. Hence, from Lemma A.2, with probability at least 1 − δ/

3, it holds that T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w ≤ T (cid:88) t =1 σ t − ( x t , w t ) + 8 ln 18 δ . (22)Similarly, the following inequality holds with probability at least 1 − δ/ T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w ≤ T (cid:88) t =1 σ t − ( x t , w t ) + 8 ln 18 δ . (23)In addition, we introduce the following lemma given by [5] about the maximum information gain γ T : Lemma A.3.

Fix T ≥ . Then, the following inequality holds: T (cid:88) t =1 σ t − ( x t , w t ) ≤ σ − ) γ T . (24)14oreover, from Schwarz’s inequality and Lemma A.3, we get the following inequality: T (cid:88) t =1 σ t − ( x t , w t ) ≤ (cid:115) T ln(1 + σ − ) γ T . (25)Thus, from (22), (23), (24) and (25) we obtain the following corollary: Corollary A.1.

Assume that (13), (22) and (23) hold. Then, for any T ≥ , it holds that T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) ≤ αβ / T (cid:110)(cid:112) T C γ T + C (cid:111) + (1 − α ) (cid:114) T ˜ Bβ / T (cid:110)(cid:112) T C γ T + 2 C (cid:111) + 5 T β T { C γ T + 2 C } , where C = σ − ) and C = 16 ln δ .Proof. From Lemma A.1, (22) and (23), it holds that T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) ≤ αβ / T (cid:40) T (cid:88) t =1 σ t − ( x t , w t ) + 4 ln 18 δ (cid:41) + (1 − α ) (cid:118)(cid:117)(cid:117)(cid:116) T ˜ Bβ / T (cid:40) T (cid:88) t =1 σ t − ( x t , w t ) + 4 ln 18 δ (cid:41) + 40 T β T (cid:40) T (cid:88) t =1 σ t − ( x t , w t ) + 4 ln 18 δ (cid:41) . (26)Therefore, by combining (24), (25) and (26), we get the desired inequality. (cid:4) Finally, we prove Theorem 4.1. Let T ≥

1, and deﬁne ˆ T = argmax t =1 ,...,T l ( G ) t ( x t ). Assume that (13) holds. Then,for any x ∈ X , it holds that G ( x ) ∈ [ l ( G ) t ( x ) , u ( G ) t ( x )]. Thus, for any t (cid:48) = 1 , . . . , T , we get G ( x ∗ ) − G ( ˆ x T ) ≤ u ( G ) t (cid:48) ( x t (cid:48) ) − l ( G )ˆ T ( ˆ x T )= u ( G ) t (cid:48) ( x t (cid:48) ) − max t =1 ,...,T l ( G ) t ( ˆ x t ) ≤ u ( G ) t (cid:48) ( x t (cid:48) ) − l ( G ) t (cid:48) ( x t (cid:48) ) . This implies that G ( x ∗ ) − G ( ˆ x T ) ≤ T T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) . (27)Here, note that with probability at least 1 − δ , (13), (22) and (23) hold. Therefore, by combining Corollary A.1, thefollowing holds with probability at least 1 − δ : G ( x ∗ ) − G ( ˆ x T ) ≤ αT − β / T (cid:16)(cid:112) T C γ T + C (cid:17) + (1 − α ) T − (cid:114) T ˜ Bβ / T (cid:16)(cid:112) T C γ T + 2 C (cid:17) + 5 T β T ( C γ T + 2 C ) . Hence, if T satisﬁes (11), with probability at least 1 − δ , it holds that G ( x ∗ ) − G ( ˆ x T ) ≤ (cid:15) . Therefore, ˆ x T is the (cid:15) -accurate solution. A.2 Proof of Theorem 4.2

In this subsection, we prove Theorem 4.2. First, we show several lemmas.

Lemma A.4.

For any t ≥ , ˆΠ t has at least one element (i.e., ˆΠ t (cid:54) = ∅ ).Proof. Let t ≥

1. We deﬁne ˜ x t and x † t as ˜ x t = arg max x ∈X l ( F ) t ( x ) , x † t = arg max x ∈X ; l ( F t ( x )= l ( F t (˜ x t ) l ( F ) t ( x ) . E (pes) t, x † t = ∅ . Then, it holds that ∀ x (cid:48) ∈ ∅ = E (pes) t, x † t , F (pes) t ( x † t ) (cid:14) F (pes) t ( x (cid:48) ) . This implies that x † t ∈ ˆΠ t .On the other hand, if E (pes) t, x † t (cid:54) = ∅ , then the following holds for any x (cid:48) ∈ E (pes) t, x † t : l ( F ) t ( x † t ) = l ( F ) t ( ˜ x t ) ≥ l ( F ) t ( x (cid:48) ) . Here, if l ( F ) t ( x † t ) > l ( F ) t ( x (cid:48) ), it holds that F (pes) t ( x † t ) (cid:14) F (pes) t ( x (cid:48) ). Similarly, if l ( F ) t ( x † t ) = l ( F ) t ( x (cid:48) ), it holds that l ( F ) t ( x † t ) ≥ l ( F ) t ( x (cid:48) ) . Noting that F (pes) t ( x † t ) (cid:54) = F (pes) t ( x (cid:48) ) and l ( F ) t ( x † t ) = l ( F ) t ( x (cid:48) ), we have l ( F ) t ( x † t ) > l ( F ) t ( x (cid:48) ). Thus, we have F (pes) t ( x † t ) (cid:14) F (pes) t ( x (cid:48) ). Form the deﬁnition of ˆΠ t , we get x † t ∈ ˆΠ t . (cid:4) Lemma A.5.

Let t ≥ , and assume that M t (cid:54) = ∅ . Also let x (1) be an element of M t . Then, there exists an element x (cid:48) ∈ ˆΠ t such that F ( pes ) t ( x (1) ) (cid:22) F ( pes ) t ( x (cid:48) ) . Proof.

Let t ≥ M t (cid:54) = ∅ and x (1) ∈ M t . Assume that the following holds: F (pes) t ( x (1) ) (cid:14) F (pes) t ( x (cid:48) ) , ∀ x (cid:48) ∈ ˆΠ t . (28)From the deﬁnition of M t , we have x (1) / ∈ ˆΠ t . Here, since x (1) / ∈ ˆΠ t , there exists x (2) ∈ E (pes) t, x (1) such that F (pes) t ( x (1) ) (cid:22) F (pes) t ( x (2) ) . Therefore, there exists x (3) ∈ E (pes) t, x (2) such that F (pes) t ( x (2) ) (cid:22) F (pes) t ( x (3) ) . Furthermore, by combining F (pes) t ( x (1) ) (cid:22) F (pes) t ( x (2) ) , F (pes) t ( x (2) ) (cid:22) F (pes) t ( x (3) )we get F (pes) t ( x (1) ) (cid:22) F (pes) t ( x (3) ). Thus, from (28) we obtain x (3) / ∈ ˆΠ t . By repeating the same argument, we have x (1) , . . . , x ( |X | ) , where x ( k ) / ∈ ˆΠ t , k = 1 , . . . , |X | . Next, we show that x ( i ) (cid:54) = x ( j ) for any i and j with i (cid:54) = j . In fact, ifthere exist i and j with i < j such that x ( i ) = x ( j ) , we get F (pes) t ( x ( i ) ) = F (pes) t ( x ( j ) ). Here, from i ≤ j −

1, notingthat the deﬁnition of x ( i ) and x ( j − we get F (pes) t ( x ( j ) ) = F (pes) t ( x ( i ) ) ≤ F (pes) t ( x ( j − ) . Similarly, from the deﬁnition of x ( j − and x ( j ) , we obtain F (pes) t ( x ( j − ) ≤ F (pes) t ( x ( j ) ) . Thus, we get F (pes) t ( x ( j − ) = F (pes) t ( x ( j ) ). However, it contradicts x ( j ) ∈ E (pes) t, x ( j − . Hence, it holds that x ( i ) (cid:54) = x ( j ) for any i and j with i (cid:54) = j . Therefore, the set { x (1) , . . . , x ( |X | ) } is equal to X . Recall that x ( k ) / ∈ ˆΠ t for any k = 1 , . . . , |X | . By combining this and { x (1) , . . . , x ( |X | ) } = X , we have ˆΠ t = ∅ . However, it contradicts Lemma A.4.Hence, the assumption (28) is incorrect. (cid:4) Lemma A.6.

Let x be an element of X , and let (cid:15) = ( (cid:15) , (cid:15) ) be a positive vector. Assume that at least one of thefollowing inequalities holds for any x (cid:48) ∈ X : F ( x ) + (cid:15) ≥ F ( x (cid:48) ) , F ( x ) + (cid:15) ≥ F ( x (cid:48) ) . Then, it holds that F ( x ) ∈ Z (cid:15) .Proof. In order to prove Lemma A.6, we consider the following two cases:16 For any x , x (cid:48) ∈ Π, F ( x ) = F ( x (cid:48) ). (2) There exist x , x (cid:48) ∈ Π such that F ( x ) (cid:54) = F ( x (cid:48) ).First, we consider (1) . We deﬁne x (1) and x (2) as˜ x = arg max x ∈X F ( x ) , x (1) = arg max x ; F ( x )= F (˜ x ) F ( x ) , x † = arg max x ∈X F ( x ) , x (2) = arg max x ; F ( x )= F ( x † ) F ( x ) . From the deﬁnition of x (1) and x (2) , it holds that x (1) , x (2) ∈ Π. Thus, from (1) , we get F ( x (1) ) = F ( x (2) ). Hence,the following holds for any x (cid:48) ∈ X : F ( x (cid:48) ) ≤ F ( x (1) ) , F ( x (cid:48) ) ≤ F ( x (2) ) = F ( x (1) ) . Therefore, we get F ( x (cid:48) ) (cid:22) F ( x (1) ). Note that F ( x (1) ) ∈ Z . Here, let x ∈ X . Then, from the lemma’s assumption, atleast one of the following inequalities holds: F ( x ) + (cid:15) ≥ F ( x (1) ) , F ( x ) + (cid:15) ≥ F ( x (1) ) . If F ( x ) + (cid:15) ≥ F ( x (1) ), we set a = ( F ( x (1) ) , F ( x )) (cid:62) . Noting that F ( x (cid:48) ) ≤ F ( x (1) ) for any x (cid:48) ∈ X , we have a (cid:22) F ( x (1) ). This implies that a ∈ Z . Thus, the following holds: a = ( F ( x (1) ) , F ( x )) (cid:62) (cid:22) ( F ( x ) + (cid:15) , F ( x ) + (cid:15) ) (cid:62) = F ( x ) + (cid:15) . Furthermore, since F ( x ) (cid:22) F ( x (1) ) and F ( x (1) ) ∈ Z , we obtain F ( x ) ∈ Z (cid:15) . Similarly, if F ( x ) + (cid:15) ≥ F ( x (1) ), weset b = ( F ( x ) , F ( x (1) )) (cid:62) . Also in this case, by using the same argument, we get b ∈ Z and b (cid:22) F ( x ) + (cid:15) . By combining this and F ( x ) (cid:22) F ( x (1) ) (and F ( x (1) ) ∈ Z ), we obtain F ( x ) ∈ Z (cid:15) .Next, we consider (2) . From (2) , there exist x (1) , . . . , x ( l ) such that F (Π) = { F ( x ) | x ∈ Π } = { F ( x ( i ) ) | i = 1 , . . . , l } , F ( x ( i ) ) (cid:54) = F ( x ( j ) ) , i (cid:54) = j. Here, without loss of generality, we may assume the following: F ( x (1) ) < · · · < F ( x ( l ) ) , F ( x (1) ) > · · · > F ( x ( l ) ) . Let x be an element of X . Assume that there exists j such that F ( x ) + (cid:15) ≥ F ( x ( j ) ) , F ( x ) + (cid:15) ≥ F ( x ( j +1) ) . Note that ( F ( x ( j ) ) , F ( x ( j +1) ) (cid:62) ∈ Z . In addition, there exists i ∈ { , . . . , l } such that F ( x ) (cid:22) F ( x ( i ) ) ∈ Z . Therefore, F ( x ) ∈ Z (cid:15) .Similarly, assume that at least one of the following inequalities holds for any j : F ( x ) + (cid:15) < F ( x ( j ) ) , F ( x ) + (cid:15) < F ( x ( j +1) ) . (29)Here, if F ( x ) + (cid:15) < F ( x (1) ), from lemma’s assumption it holds that F ( x ) + (cid:15) ≥ F ( x (1) ). Moreover, we deﬁne c = ( F ( x ) , F ( x (1) )) (cid:62) ∈ Z . Then, the following holds: F ( x ) + (cid:15) = ( F ( x ) + (cid:15) , F ( x ) + (cid:15) ) (cid:62) (cid:23) ( F ( x ) , F ( x (1) )) (cid:62) = c ∈ Z. Furthermore, from the deﬁnition of x (1) , it holds that F ( x (1) ) ≥ F ( x ). Thus, noting that F ( x ) + (cid:15) < F ( x (1) ),we get F ( x ) ≤ F ( x (1) ). By combining these, we have F ( x ) (cid:22) F ( x (1) ) ∈ Z . This implies that F ( x ) ∈ Z (cid:15) . On theother hand, if F ( x ) + (cid:15) ≥ F ( x (1) ), from (29) we get F ( x ) + (cid:15) < F ( x (2) ). Therefore, from lemma’s assumption,we obtain F ( x ) + (cid:15) ≥ F ( x (2) ). By using (29) again, we have F ( x ) + (cid:15) < F ( x (3) ). Hence, by repeating theseprocedures, we get F ( x ) + (cid:15) ≥ F ( x ( l ) ) and F ( x ) + (cid:15) < F ( x ( l ) ). Finally, noting that F ( x ) (cid:22) ( F ( x ( l ) ) , F ( x ) + (cid:15) ) (cid:62) (cid:22) ( F ( x ( l ) ) , F ( x ( l ) )) (cid:62) = F ( x ( l ) ) ∈ Z, F ( x ) + (cid:15) (cid:23) ( F ( x ( l ) ) , F ( x )) (cid:62) ∈ Z, we get F ( x ) ∈ Z (cid:15) . (cid:4)

17y using these lemmas, we prove Theorem 4.2.

Proof.

First, we prove that the algorithm terminates after at most t (cid:48) iterations where t (cid:48) is the positive integer satisfyingmax x ∈ M t (cid:48) ∪ ˆΠ t (cid:48) λ t (cid:48) ( x ) = λ t (cid:48) ( x t (cid:48) ) ≤ min { (cid:15) , (cid:15) } . From the deﬁnition of λ t , noting that u ( F ) t ( x ) − l ( F ) t ( x ) ≤ λ t ( x ) and u ( F ) t ( x ) − l ( F ) t ( x ) ≤ λ t ( x ), we have max x ∈ M t (cid:48) ∪ ˆΠ t (cid:48) (cid:110) u ( F ) t (cid:48) ( x ) − l ( F ) t (cid:48) ( x ) (cid:111) ≤ (cid:15) and max x ∈ M t (cid:48) ∪ ˆΠ t (cid:48) (cid:110) u ( F ) t (cid:48) ( x ) − l ( F ) t (cid:48) ( x ) (cid:111) ≤ (cid:15) . Then, for any x (cid:48) ∈ ˆΠ t , it holds that u ( F ) t (cid:48) ( x (cid:48) ) ≤ l ( F ) t (cid:48) ( x (cid:48) ) + (cid:15) (30)and u ( F ) t (cid:48) ( x (cid:48) ) ≤ l ( F ) t (cid:48) ( x (cid:48) ) + (cid:15) . (31)Here, let x be an element of ˆΠ t (cid:48) . Then, from the deﬁnition of ˆΠ t , for any x (cid:48) ∈ ˆΠ t (cid:48) , at least one of the followinginequalities holds: l ( F ) t (cid:48) ( x (cid:48) ) ≤ l ( F ) t (cid:48) ( x ) , l ( F ) t (cid:48) ( x (cid:48) ) ≤ l ( F ) t (cid:48) ( x ) . Thus, from (30) and (31), for any x (cid:48) ∈ ˆΠ t (cid:48) , it holds that F (pes) t ( x ) + (cid:15) ⊀ F (opt) t (cid:48) ( x (cid:48) ). This implies that U t (cid:48) = ∅ .Similarly, if M t (cid:48) (cid:54) = ∅ , there exists x ∈ M t (cid:48) such that F (opt) t (cid:48) ( x ) (cid:14) (cid:15) F (pes) t (cid:48) ( x (cid:48) ) for any x (cid:48) ∈ ˆΠ t . On the other hand,from Lemma A.5, there exists x (cid:48)(cid:48) ∈ ˆΠ t (cid:48) such that F (pes) t (cid:48) ( x ) (cid:22) F (pes) t (cid:48) ( x (cid:48)(cid:48) ). Moreover, from (30) and (31), x (cid:48)(cid:48) satisﬁes F (opt) t (cid:48) ( x ) (cid:22) (cid:15) F (pes) t (cid:48) ( x (cid:48)(cid:48) ). However, it contradicts the deﬁnition of M t . Hence, we get M t (cid:48) = ∅ .Hereafter, we assume that (13), (22) and (23) hold. From the deﬁnition of λ t , we obtain λ t ( x ) ≤ (cid:110) u ( F ) t ( x ) − l ( F ) t ( x ) (cid:111) + (cid:110) u ( F ) t ( x ) − l ( F ) t ( x ) (cid:111) . This implies that T (cid:88) t =1 λ t ( x t ) ≤ T (cid:88) t =1 (cid:110) u ( F ) t ( x t ) − l ( F ) t ( x t ) (cid:111) + T (cid:88) t =1 (cid:110) u ( F ) t ( x t ) − l ( F ) t ( x t ) (cid:111) . Therefore, from (15), (21), (22) and (23), we get T (cid:88) t =1 λ t ( x t ) ≤ β / T (cid:40) T (cid:88) t =1 σ t − ( x t , w t ) + 4 ln 18 δ (cid:41) + (cid:118)(cid:117)(cid:117)(cid:116) T ˜ Bβ / T (cid:40) T (cid:88) t =1 σ t − ( x t , w t ) + 4 ln 18 δ (cid:41) + 40 T β T (cid:40) T (cid:88) t =1 σ t − ( x t , w t ) + 4 ln 18 δ (cid:41) . Hence, from (24) and (25), it holds that1 T T (cid:88) t =1 λ t ( x t ) ≤ T − β / T (cid:110)(cid:112) T C γ T + C (cid:111) + T − (cid:114) T ˜ Bβ / T (cid:110)(cid:112) T C γ T + 2 C (cid:111) + 5 T β T { C γ T + 2 C } . (32)Here, let T be a positive integer such that the right hand side in (32) is less than or equal to min { (cid:15) , (cid:15) } . Then, thereexists a positive integer t (cid:48) such that t (cid:48) ≤ T and λ t (cid:48) ( x t (cid:48) ) ≤ min { (cid:15) , (cid:15) } . Therefore, we have M t (cid:48) = ∅ and U t (cid:48) = ∅ . Thismeans that the algorithm terminates after at most t (cid:48) iterations.Next, under (13) we show that ˆΠ t is the (cid:15) -accurate Pareto set when M t = ∅ and U t = ∅ . First, we prove F ( ˆΠ t ) ⊂ Z (cid:15) . Let x be an element of ˆΠ t . For any x (cid:48) ∈ ˆΠ t \ { x } , it holds that F (pes) t ( x ) + (cid:15) ⊀ F (opt) t ( x (cid:48) ) because U t = ∅ . Furthermore, noting that M t = ∅ , for any x (cid:48) ∈ X \ ˆΠ t , there exists x (cid:48)(cid:48) ∈ ˆΠ t such that F (opt) t ( x (cid:48) ) (cid:22) (cid:15) F (pes) t ( x (cid:48)(cid:48) ).In addition, since x ∈ ˆΠ t , from the deﬁnition of ˆΠ t , at least one of the following inequalities holds: l ( F ) t ( x (cid:48)(cid:48) ) ≤ l ( F ) t ( x ) , l ( F ) t ( x (cid:48)(cid:48) ) ≤ l ( F ) t ( x ) . By combining this and F (opt) t ( x (cid:48) ) (cid:22) (cid:15) F (pes) t ( x (cid:48)(cid:48) ), we get F (pes) t ( x ) + (cid:15) ⊀ F (opt) t ( x (cid:48) ). Therefore, under (13) at leastone of the following inequalities holds for any x (cid:48) ∈ X \ { x } : F ( x ) + (cid:15) ≥ F ( x (cid:48) ) , F ( x ) + (cid:15) ≥ F ( x (cid:48) ) . F ( x ) + (cid:15) ≥ F ( x ). Hence, from Lemma A.6, we get F ( ˆΠ t ) ⊂ Z (cid:15) .Finally, we show that for any x (cid:48) ∈ Π, there exists x ∈ ˆΠ t such that x (cid:48) (cid:22) (cid:15) x . When x (cid:48) ∈ ˆΠ t , the existence of x is obvious because x (cid:48) (cid:22) (cid:15) x (cid:48) . On the other hand, when x (cid:48) ∈ X \ ˆΠ t , since M t = ∅ there exists x ∈ ˆΠ t such that F (opt) t ( x (cid:48) ) (cid:22) (cid:15) F (pes) t ( x ). Thus, under (13), this implies that x (cid:48) (cid:22) (cid:15) x . Hence, for any x (cid:48) ∈ Π, there exists x ∈ ˆΠ t suchthat x (cid:48) (cid:22) (cid:15) x . From this and F ( ˆΠ t ) ⊂ Z (cid:15) , we have that ˆΠ t is the (cid:15) -accurate Pareto set. Here, note that (13), (22) and(23) hold with probability at least 1 − δ . Therefore, we get the desired result. (cid:4) B Extension to Constraint Optimization Problem

In real applications, there exists a situation where the known tolerance level for the value of the function F is deﬁned.For example, in the parameter tuning of an engineering system, this situation corresponds to the case where thevariance of the performance must be below a certain level. In such a situation, it is necessary to treat the functions F and F as in the following constrained optimization problem: x ∗ = arg max x ∈X F ( x ) s.t. F ( x ) ≥ h, where h < (cid:15) = ( (cid:15) , (cid:15) ), we deﬁne an (cid:15) -accurate solution as a solution ˆ x satisfying F ( ˆ x ) ≥ F ( x ∗ ) − (cid:15) , F ( ˆ x ) ≥ h − (cid:15) . Proposed Algorithm

First, we deﬁne M (cons) t , S t and M (obj) t as M (cons) t = (cid:110) x ∈ X | u ( F ) t ( x ) ≥ h − (cid:15) (cid:111) ,S t = (cid:110) x ∈ X | l ( F ) t ( x ) ≥ h − (cid:15) (cid:111) ,M (obj) t = (cid:26) x ∈ X | u ( F ) t ( x ) ≥ max x (cid:48) ∈ S t l ( F ) t ( x (cid:48) ) − (cid:15) (cid:27) . Here, we deﬁne M (obj) t = X if S t = ∅ . Note that an element in the complement of M (cons) t or M (obj) t is not an (cid:15) -accurate solution with high probability. In addition, S t is a set that is determined to be a feasible region with highprobability. Based on these deﬁnitions, we deﬁne a latent optimal solution set M t at the t th step as follows: M t = M (cons) t ∩ M (obj) t . In our proposed algorithm, we select the most uncertain point in the latent optimal solution set M t . In other words,the observation point x t at the t th step is selected by using λ t as deﬁned by Equation (8) as follows: x t = arg max x ∈ M t λ t ( x ) . (33)Furthermore, if S t (cid:54) = ∅ at the t th step, then we deﬁne the estimated optimal solution ˆ x t by ˆ x t = argmax x ∈ S t l ( F ) t ( x ).In order to ensure that ˆ x t is an (cid:15) -accurate solution, the uncertainties of the function values F and F for the latentoptimal solution should be suﬃciently small. In the proposed method, the algorithm terminates at the t th step whichsatisﬁes the following: max x ∈ M t λ t ( x ) ≤ min { (cid:15) , (cid:15) } . The pseudo code of the proposed method is shown as Algorithm 3.

Theoretical Analysis

For Algorithm 3, the following theorem holds:

Theorem B.1.

Let k be a positive-deﬁnite kernel, and let f ∈ H k with (cid:107) f (cid:107) H k ≤ B . Also let δ ∈ (0 , and (cid:15) > , (cid:15) > , and deﬁne β t = (cid:16)(cid:113) ln det( I t + σ − K t ) + 2 ln δ + B (cid:17) . Then, with probability at least − δ , thefollowing 1. and 2. hold: Algorithm 3 terminates after at most T iterations, where T is the smallest positive integer satisfying T − β / T (cid:110)(cid:112) T C γ T + C (cid:111) + T − (cid:114) T ˜ Bβ / T (cid:110)(cid:112) T C γ T + 2 C (cid:111) + 5 T β T { C γ T + 2 C } ≤ min { (cid:15) , (cid:15) } . Here, ˜ B = max ( x , w ) ∈ ( X × Ω) | f ( x , w ) − E w [ f ( x , w )] | , C = σ − ) and C = 16 ln δ . lgorithm 3 Proposed Algorithm for Constrained Optimization

Input:

GP prior GP (0 , k ), { β t } t ∈ N , Threshold h , Non-negative vector (cid:15) = ( (cid:15) , (cid:15) ). M ← X , S ← ∅ , t ← λ ( x ) for any x ∈ M while max x ∈ M t λ t ( x ) (cid:2) min { (cid:15) , (cid:15) } do Choose x t = argmax x ∈ M t λ t ( x ).Sample w t ∼ p ( w ).Observe y t ← f ( x t , w t ) + η t .Update the GP by adding (( x t , w t ) , y t ). t ← t + 1.Compute S t , M t .Compute λ t ( x ) for any x ∈ M t end whileif S t (cid:54) = ∅ then Output ˆ x t = argmax x ∈ S t l ( F ) t ( x ). end if2. If x ∗ exists, then S t (cid:48) (cid:54) = ∅ at the termination step t (cid:48) ≤ T . Moreover, ˆ x t (cid:48) = argmax x ∈ S t (cid:48) l ( F ) t ( x ) is an (cid:15) -accuratesolution.Proof. Assume that (13), (22) and (23) hold. Then, by using the same argument as in the proof of Theorem 4.2, weget 1 T T (cid:88) t =1 λ t ( x t ) ≤ T − β / T (cid:110)(cid:112) T C γ T + C (cid:111) + T − (cid:114) T ˜ Bβ / T (cid:110)(cid:112) T C γ T + 2 C (cid:111) + 5 T β T { C γ T + 2 C } . (34)Here, from the deﬁnition of T , the right-hand side of (34) is less than or equal to min { (cid:15) , (cid:15) } . Hence, there exists apositive integer t (cid:48) ≤ T such that max x ∈ M t (cid:48) λ t (cid:48) ( x ) = λ t (cid:48) ( x t (cid:48) ) ≤ min { (cid:15) , (cid:15) } . This implies that the algorithm terminatesafter at most T iterations.Next, we prove claim 2 of the theorem. Assume that x ∗ exists. Here, we consider the two cases x ∗ ∈ M (obj) t (cid:48) and x ∗ / ∈ M (obj) t (cid:48) . For case x ∗ ∈ M (obj) t (cid:48) , since (13) holds, the following inequality holds: h − (cid:15) ≤ h ≤ F ( x ∗ ) ≤ u ( F ) t (cid:48) ( x ∗ ) . This means that x ∗ ∈ M (cons) t (cid:48) . Therefore, we have x ∗ ∈ M t (cid:48) . Furthermore, noting that u ( F ) t ( x ) − l ( F ) t ( x ) ≤ λ t ( x )and u ( F ) t ( x ) − l ( F ) t ( x ) ≤ λ t ( x ), it holds thatmax x ∈ M t (cid:48) (cid:110) u ( F ) t (cid:48) ( x ) − l ( F ) t (cid:48) ( x ) (cid:111) ≤ (cid:15) , (35)max x ∈ M t (cid:48) (cid:110) u ( F ) t (cid:48) ( x ) − l ( F ) t (cid:48) ( x ) (cid:111) ≤ (cid:15) . (36)Here, if l ( F ) t (cid:48) ( x ∗ ) < h − (cid:15) , then from (36), we get u ( F ) t (cid:48) ( x ∗ ) < h . Thus, from (13), we obtain F ( x ∗ ) < h . However,this contradicts the deﬁnition of x ∗ , implying that l ( F ) t (cid:48) ( x ∗ ) ≥ h − (cid:15) and x ∗ ∈ S t (cid:48) (cid:54) = ∅ . Moreover, from (35) thefollowing holds: max x ∈ M t (cid:48) (cid:110) u ( F ) t (cid:48) ( x ) − l ( F ) t (cid:48) ( x ) (cid:111) ≤ (cid:15) ⇒ u ( F ) t (cid:48) ( x ∗ ) − l ( F ) t (cid:48) ( x ∗ ) ≤ (cid:15) ⇒ u ( F ) t (cid:48) ( x ∗ ) − max x ∈ S t (cid:48) l ( F ) t (cid:48) ( x ) ≤ (cid:15) ⇒ u ( F ) t (cid:48) ( x ∗ ) − l ( F ) t (cid:48) ( ˆ x t (cid:48) ) ≤ (cid:15) ⇒ l ( F ) t (cid:48) ( ˆ x t (cid:48) ) ≥ u ( F ) t (cid:48) ( x ∗ ) − (cid:15) . In addition, from the deﬁnition of S t (cid:48) , we have l t (cid:48) ( ˆ x t (cid:48) ) ≥ h − (cid:15) .

20n the other hand, if x ∗ / ∈ M (obj) t (cid:48) , then M (obj) t (cid:48) (cid:54) = X . Thus, from the deﬁnition of M (obj) t (cid:48) , it holds that S t (cid:48) (cid:54) = ∅ .Therefore, we get l t (cid:48) ( ˆ x t (cid:48) ) = max x ∈ S t (cid:48) l t (cid:48) ( x ) ≥ h − (cid:15) . Furthermore, since x ∗ / ∈ M (obj) t (cid:48) , it holds that u ( F ) t (cid:48) ( x ∗ ) − (cid:15) ≤ u ( F ) t (cid:48) ( x ∗ ) < l ( F ) t (cid:48) ( ˆ x t (cid:48) ) − (cid:15) ≤ l ( F ) t (cid:48) ( ˆ x t (cid:48) ) . Therefore, if x ∗ exists, then we have S t (cid:48) (cid:54) = ∅ and l ( F ) t (cid:48) ( ˆ x t (cid:48) ) ≥ u ( F ) t (cid:48) ( x ∗ ) − (cid:15) , (37) l t (cid:48) ( ˆ x t (cid:48) ) ≥ h − (cid:15) . (38)Note that (37) and (38) imply that ˆ x t (cid:48) is an (cid:15) -accurate solution when (13) holds. Finally, since (13), (22) and (23)hold with probability at least 1 − δ , we have Theorem B.1. (cid:4) C Details of Section 3.3

C.1 Noisy Input Setting

In this subsection, we consider the setting where the input x contains a noise ξ ∈ ∆. Let X ⊂ R d be an input spacefor optimization. In addition, assume that X is a ﬁnite set. Furthermore, let ∆ ⊂ R d be a compact and convex set,and let ξ be a random noise satisfying ξ ∈ ∆. Moreover, let f be a black-box function on D := { x + ξ | x ∈ X , ξ ∈ ∆ } ,and let k : D × D → R be a positive-deﬁnite kernel with f ∈ H k and (cid:107) f (cid:107) H k ≤ B .For each step t , we select an observation point x t ∈ X , and the observed value is obtained as y t = f ( x t + ξ t ) + η t .Here, η t is the independent normal distribution η t ∼ N (0 , σ ), and ξ t is the observed value of ξ .In this setting, the expected value and variance of f ( x ) with respect to ξ are given by E ξ [ f ( x + ξ )] = (cid:90) ∆ f ( x + ξ ) p ( ξ )d ξ , (39) V ξ [ f ( x + ξ )] = (cid:90) ∆ { f ( x + ξ ) − E ξ [ f ( x + ξ )] } p ( ξ )d ξ , (40)where p ( ξ ) is a known probability density function of ξ . Similarly as in (5), using (39) and (40) we deﬁne theoptimization objective functions F and F . In addition, let µ t ( x ), σ t ( x ) and Q t ( x ) := [ l t ( x ) , u t ( x )] denote theposterior mean, posterior variance and conﬁdence bound of f ( x ) at the step t , respectively. Conﬁdence Bound

Conﬁdence bounds of objective functions F and F deﬁned by using (39) and (40) can alsobe constructed by using the same procedure as in 3.1. First, assume that f ( ˜ x ) ∈ Q t ( ˜ x ) for any ˜ x ∈ D . Then, thefollowing holds for any x ∈ X : (cid:90) ∆ l t ( x + ξ ) p ( ξ )d ξ ≤ (cid:90) ∆ f ( x + ξ ) p ( ξ )d ξ ≤ (cid:90) ∆ u t ( x + ξ ) p ( ξ )d ξ . Therefore, the conﬁdence bound Q ( F ) t ( x ) of F ( x ) can be constructed as Q ( F ) t ( x ) := [ l ( F ) t ( x ) , u ( F ) t ( x )] using l ( F ) t ( x ) = (cid:90) ∆ l t ( x + ξ ) p ( ξ )d ξ , u ( F ) t ( x ) = (cid:90) ∆ u t ( x + ξ ) p ( ξ )d ξ . Similarly, the conﬁdence bound Q ( F ) t ( x ) of F ( x ) can be expressed as Q ( F ) t ( x ) := [ l ( F ) t ( x ) , u ( F ) t ( x )] using l ( F ) t ( x ) = − (cid:115)(cid:90) ∆ ˜ u (sq) t ( x + ξ ) p ( ξ )d ξ , l ( F ) t ( x ) = − (cid:115)(cid:90) ∆ ˜ l (sq) t ( x + ξ ) p ( ξ )d ξ , where ˜ l (sq) t ( x + ξ ) and ˜ u (sq) t ( x + ξ ) are given by˜ l t ( x + ξ ) = l t ( x + ξ ) − E ξ [ u t ( x + ξ )] , ˜ u t ( x + ξ ) = u t ( x + ξ ) − E ξ [ l t ( x + ξ )] , ˜ l (sq) t ( x + ξ ) = (cid:40) l t ( x + ξ ) ≤ ≤ ˜ u t ( x + ξ ) , min (cid:110) ˜ l t ( x + ξ ) , ˜ u t ( x + ξ ) (cid:111) otherwise , ˜ u (sq) t ( x + ξ ) = max (cid:110) ˜ l t ( x + ξ ) , ˜ u t ( x + ξ ) (cid:111) . Using Q ( F ) t and Q ( F ) t above, we can construct the proposed algorithm in the same procedure.21 .2 Simulator Based Experiment In this subsection, we consider the setting that w t can be selected in the optimization phase at each step. Furthermore,we show theoretical guarantees in this setting. Hereafter, we only discuss the multi-task scenario, but the sameargument can be made for multi-objective and constraint optimization scenarios by selecting w t and ξ t in the sameprocedure.In our proposed algorithm, ( x t , w t ) at the step t is selected by x t = arg max x ∈X u ( G ) t ( x ) , w t = arg max w ∈ Ω σ t − ( x t , w ) . In this algorithm, the following theorem holds:

Theorem C.1.

Let k be a positive-deﬁnite kernel, and let f ∈ H k with (cid:107) f (cid:107) H k ≤ B . Also let δ ∈ (0 , , (cid:15) > , anddeﬁne β t = (cid:16)(cid:113) ln det( I t + σ − K t ) + 2 ln δ + B (cid:17) . Moreover, for any t , deﬁne ˆ x t = argmax x t (cid:48) ∈{ x ,..., x t } l ( G ) t (cid:48) ( x t (cid:48) ) .Then, when the proposed algorithm in the simulator based setting is performed, ˆ x T is the (cid:15) -accurate solution withprobability at least − δ , where T is the smallest positive integer satisfying αT − β / T (cid:112) T C γ T + (1 − α ) T − (cid:113) T ˜ Bβ / T (cid:112) T C γ T + 5 T β T C γ T ≤ (cid:15). Here, ˜ B and C are given by ˜ B = max ( x , w ) ∈ ( X × Ω) { f ( x , w ) − E w [ f ( x , w )] } and C = σ − ) .Proof. Assume that (13) holds. Then, from Lemma A.1 we have T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) ≤ αβ / T T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w + (1 − α ) (cid:118)(cid:117)(cid:117)(cid:116) T ˜ Bβ / T T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w + 20 T β

T T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w . In addition, from the deﬁnition of w t , it holds that T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w ≤ T (cid:88) t =1 σ t − ( x t , w t ) , T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w ≤ T (cid:88) t =1 σ t − ( x t , w t ) . Hence, we get T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) ≤ αβ / T T (cid:88) t =1 σ t − ( x t , w t ) + (1 − α ) (cid:118)(cid:117)(cid:117)(cid:116) T ˜ Bβ / T T (cid:88) t =1 σ t − ( x t , w t ) + 20 T β

T T (cid:88) t =1 σ t − ( x t , w t ) . Furthermore, from (24) and (25), we obtain T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) ≤ αβ / T (cid:112) C T γ T + (1 − α ) (cid:113) T ˜ Bβ / T (cid:112) C T γ T + 5 T β T C γ T . Finally, by using the same argument as in the proof of Theorem 4.1, the following inequality holds: G ( x ∗ ) − G ( ˆ x T ) ≤ T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) /T. Therefore, noting that the deﬁnition of T , we get the desired result. (cid:4) oisy Input Extension Here, we extend the setting deﬁned in subsection 3.3.2 to the simulator based setting.Since there is the noise ξ ∈ ∆ instead of w , we consider the observation point x t at the step t as x t := ˜ x t + ξ t , where( ˜ x t , ξ t ) is given by ˜ x t = arg max x ∈X u ( G ) t ( x ) , ξ t = arg max ξ ∈ ∆ σ t − ( ˜ x t + ξ ) . Then, similar theorems as in Theorem C.1 hold. However, the practical performance of this algorithm is not muchdiﬀerent from that of Uncertainty Sampling, which was used as the base method in numerical experiments. For thisreason, in the simulator based noisy input setting, we propose a method for selecting ( ˜ x t , ξ t ) as follows:˜ x t = arg max x ∈X u ( G ) t ( x ) , ξ t = arg max ξ ∈ ∆ σ t − ( ˜ x t + ξ ) p ( ξ ) . In order to derive similar convergence results as in Theorem C.1, we assume that the probability density function p ( ξ )of ξ is a bounded function on ∆, i.e., sup ξ ∈D p ( ξ ) < ∞ . Theorem C.2.

Let δ ∈ (0 , , (cid:15) > , and set β t = (cid:16)(cid:113) ln det( I t + σ − K t ) + 2 ln δ + B (cid:17) . For any t , deﬁne ˆ x t =argmax x t (cid:48) ∈{ x ,..., x t } l ( G ) t (cid:48) ( x t (cid:48) ) . Moreover, assume that sup ξ ∈ ∆ p ( ξ ) ≤ R < ∞ . Then, when the proposed algorithm inthe simulator based noisy input setting is performed, ˆ x T is the (cid:15) -accurate solution with probability at least − δ , where T is the smallest positive integer satisfying αT − β / T R (cid:112) T C γ T + (1 − α ) T − (cid:113) T ˜ BRβ / T (cid:112) T C γ T + 5 T Rβ T C γ T ≤ (cid:15). Here, ˜ B and C are given by ˜ B = max ( x , ξ ) ∈ ( X × ∆) { f ( x + ξ ) − E ξ [ f ( x + ξ )] } and C = σ − ) .Proof. Similarly as in Lemma A.1, with probability at least 1 − δ , it holds that T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) ≤ αβ / T T (cid:88) t =1 (cid:90) ∆ σ t − ( x t + ξ ) p ( ξ )d ξ + (1 − α ) (cid:118)(cid:117)(cid:117)(cid:116) T ˜ Bβ / T T (cid:88) t =1 (cid:90) ∆ σ t − ( x t + ξ ) p ( w )d ξ + 20 T β

T T (cid:88) t =1 (cid:90) ∆ σ t − ( x t + ξ ) p ( ξ )d ξ . Moreover, from the deﬁnition of ξ t , we have T (cid:88) t =1 (cid:90) ∆ σ t − ( x t + ξ ) p ( ξ )d ξ ≤ T (cid:88) t =1 σ t − ( x t + ξ t ) p ( ξ t ) ≤ R T (cid:88) t =1 σ t − ( x t + ξ t ) , T (cid:88) t =1 (cid:90) Ω σ t − ( x t + ξ ) p ( ξ )d ξ ≤ T (cid:88) t =1 σ t − ( x t + ξ t ) p ( ξ t ) ≤ R T (cid:88) t =1 σ t − ( x t + ξ t ) . Thus, we get T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) ≤ αβ / T R T (cid:88) t =1 σ t − ( x t + ξ t )+(1 − α ) (cid:118)(cid:117)(cid:117)(cid:116) T ˜ Bβ / T R T (cid:88) t =1 σ t − ( x t + ξ t ) + 20 T β T R T (cid:88) t =1 σ t − ( x t + ξ t ) , and T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) ≤ αβ / T R (cid:112) C T γ T + (1 − α ) (cid:113) T ˜ BRβ / T (cid:112) C T γ T + 5 T β T RC γ T . By using the same argument as in the proof of 4.1, we obtain the following inequality: G ( x ∗ ) − G ( ˆ x T ) ≤ T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) /T. Therefore, we get the desired result. (cid:4) Extension to Continuous Set

In this section, we consider the setting where X is a continuous set. First, in MT-MVA-BO, x t = argmax x ∈X u ( G ) t ( x )can be calculated by using a continuous optimization solver. However, in MO-MVA-BO, it is diﬃcult to calculate theestimated Pareto set ˆΠ t and set of latent optimal solutions M t . In this paper, based on [5] we extend the proposedalgorithm by using a discretization set ˜ X of X .Hereafter, let X = [0 , d . Furthermore, assume that f is an L -Lipschitz continuous function, i.e., there exists L > | f ( x , w ) − f ( x (cid:48) , w ) | ≤ L (cid:107) x − x (cid:48) (cid:107) , for any x , x (cid:48) ∈ X . Note that Lipschitz continuity holds if standard kernels are used [24, 31].From Lipschitz continuity of f , the following lemmas about F and F hold: Lemma D.1.

Let f be an L -Lipschitz continuous function. Then, it holds that | F ( x ) − F ( x (cid:48) ) | ≤ L (cid:107) x − x (cid:48) (cid:107) , ∀ x , x (cid:48) ∈ X , where F is given by (5).Proof. From the deﬁnition of F and Lipschitz continuity of f , the following inequality holds: | F ( x ) − F ( x (cid:48) ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) Ω { f ( x , w ) − f ( x (cid:48) , w ) } p ( w )d w (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:90) Ω | f ( x , w ) − f ( x (cid:48) , w ) | p ( w )d w ≤ L (cid:107) x − x (cid:48) (cid:107) . (cid:4) Lemma D.2.

Let f be an L -Lipschitz continuous function, ˜ B = max ( x , w ) ∈ ( X × Ω) | f ( x , w ) − E w [ f ( x , w )] | , and deﬁne F as in (5). Then, the following inequality holds for any x , x (cid:48) ∈ X : | F ( x ) − F ( x (cid:48) ) | ≤ (cid:113) BL (cid:107) x − x (cid:48) (cid:107) . Proof.

From Lipschitz continuity of f , for any x , x (cid:48) ∈ X , w ∈ Ω, it holds that (cid:12)(cid:12)(cid:12) { f ( x , w ) − E w [ f ( x , w )] } − { f ( x (cid:48) , w ) − E w [ f ( x (cid:48) , w )] } (cid:12)(cid:12)(cid:12) = |{ f ( x , w ) − E w [ f ( x , w )] } − { f ( x (cid:48) , w ) − E w [ f ( x (cid:48) , w )] }| × |{ f ( x , w ) − E w [ f ( x , w )] } + { f ( x (cid:48) , w ) − E w [ f ( x (cid:48) , w )] }|≤ ( | f ( x , w ) − f ( x (cid:48) , w ) | + | E w [ f ( x , w )] − E w [ f ( x (cid:48) , w )] | ) × ( | f ( x , w ) − E w [ f ( x , w )] | + | f ( x (cid:48) , w ) − E w [ f ( x (cid:48) , w )] | ) ≤ L (cid:107) x − x (cid:48) (cid:107) × B =4 ˜ BL (cid:107) x − x (cid:48) (cid:107) . Here, if F ( x ) ≥ F ( x (cid:48) ), then | F ( x ) − F ( x (cid:48) ) | = F ( x ) − F ( x (cid:48) )= (cid:115)(cid:90) Ω { f ( x (cid:48) , w ) − E w [ f ( x (cid:48) , w )] } p ( w )d w − (cid:115)(cid:90) Ω { f ( x , w ) − E w [ f ( x , w )] } p ( w )d w ≤ (cid:115)(cid:90) Ω { f ( x (cid:48) , w ) − E w [ f ( x (cid:48) , w )] } p ( w )d w − (cid:90) Ω { f ( x , w ) − E w [ f ( x , w )] } p ( w )d w ≤ (cid:115)(cid:90) Ω (cid:12)(cid:12)(cid:12) { f ( x (cid:48) , w ) − E w [ f ( x (cid:48) , w )] } − { f ( x , w ) − E w [ f ( x , w )] } (cid:12)(cid:12)(cid:12) p ( w )d w ≤ (cid:113) BL (cid:107) x − x (cid:48) (cid:107) . On the other hand, if F ( x ) < F ( x (cid:48) ), it holds that | F ( x ) − F ( x (cid:48) ) | ≤ (cid:113) BL (cid:107) x − x (cid:48) (cid:107) . Therefore, for any x , x (cid:48) ∈ X ,the desired inequality holds. (cid:4) Lemma D.3.

Let Z be the Pareto front for X , and let (cid:15) = ( (cid:15) , (cid:15) ) (cid:62) be a positive vector. Deﬁne Z + = (cid:91) ( y ,y ) ∈ Z ( −∞ , y ] × ( −∞ , y ] , Z − ( (cid:15) ) = (cid:91) ( y ,y ) ∈ Z ( −∞ , y − (cid:15) ) × ( −∞ , y − (cid:15) ) ,Z ∗ ( (cid:15) ) = { ( y − (cid:15) (cid:48) , y − (cid:15) (cid:48) ) | ( y , y ) ∈ Z, ≤ (cid:15) (cid:48) ≤ (cid:15) , ≤ (cid:15) (cid:48) ≤ (cid:15) } . Then, it holds that Z + = Z − ( (cid:15) ) ∪ Z ∗ ( (cid:15) ) , Z − ( (cid:15) ) ∩ Z ∗ ( (cid:15) ) = ∅ . Proof.

First, we show Z − ( (cid:15) ) ∩ Z ∗ ( (cid:15) ) = ∅ . Let y be an element of Z − ( (cid:15) ). Then, there exists ( y (cid:48) , y (cid:48) ) ∈ Z such that y < y (cid:48) − (cid:15) , y < y (cid:48) − (cid:15) . Here, for any ( y (cid:48)(cid:48) , y (cid:48)(cid:48) ) ∈ Z , y (cid:48)(cid:48) satisﬁes y (cid:48) ≤ y (cid:48)(cid:48) or y (cid:48) > y (cid:48)(cid:48) . If y (cid:48) ≤ y (cid:48)(cid:48) , from y < y (cid:48) − (cid:15) we get y / ∈ { ( y (cid:48)(cid:48) − (cid:15) (cid:48) , y (cid:48)(cid:48) − (cid:15) (cid:48) ) | ≤ (cid:15) (cid:48) ≤ (cid:15) , ≤ (cid:15) (cid:48) ≤ (cid:15) } . On the other hand, if y (cid:48) > y (cid:48)(cid:48) , then y (cid:48)(cid:48) satisﬁes y (cid:48) ≤ y (cid:48)(cid:48) because the inequality y (cid:48) > y (cid:48)(cid:48) implies that ( y (cid:48)(cid:48) , y (cid:48)(cid:48) ) ∈ ( −∞ , y (cid:48) ) × ( −∞ , y (cid:48) ). However, it contradicts that ( y (cid:48)(cid:48) , y (cid:48)(cid:48) ) ∈ Z . From y (cid:48) ≤ y (cid:48)(cid:48) and y < y (cid:48) − (cid:15) , we have y / ∈ { ( y (cid:48)(cid:48) − (cid:15) (cid:48) , y (cid:48)(cid:48) − (cid:15) (cid:48) ) | ≤ (cid:15) (cid:48) ≤ (cid:15) , ≤ (cid:15) (cid:48) ≤ (cid:15) } . Therefore, it holds that y / ∈ Z ∗ ( (cid:15) ). This implies that Z − ( (cid:15) ) ∩ Z ∗ ( (cid:15) ) = ∅ .Next, we show Z + = Z − ( (cid:15) ) ∪ Z ∗ ( (cid:15) ). It is clear that Z + ⊃ Z − ( (cid:15) ) ∪ Z ∗ ( (cid:15) ). Thus, we only show that Z + ⊂ Z − ( (cid:15) ) ∪ Z ∗ ( (cid:15) ). Let y be an element of Z + . If y ∈ Z − ( (cid:15) ), it holds that y ∈ Z − ( (cid:15) ) ∪ Z ∗ ( (cid:15) ). On the other hand, if y / ∈ Z − ( (cid:15) ), at least one of the following inequalities holds for any ( y (cid:48) , y (cid:48) ) ∈ Z : y ≥ y (cid:48) − (cid:15) , y ≥ y (cid:48) − (cid:15) . If there exists (cid:15) (cid:48) ∈ [0 , (cid:15) ] such that ( y + (cid:15) (cid:48) , y ) ∈ Z , then y ∈ Z ∗ ( (cid:15) ). Next, we consider the case that ( y + (cid:15) (cid:48) , y ) / ∈ Z for any (cid:15) (cid:48) ∈ [0 , (cid:15) ]. Let Z (cid:48) = { a = ( a , a ) ∈ Z | y ≤ a ≤ y + (cid:15) } . Here, assume that y < a − (cid:15) for any a ∈ Z (cid:48) .Then, from continuity of Z , there exists ˆ y = (ˆ y , ˆ y ) ∈ Z such that y < ˆ y − (cid:15) and y < ˆ y − (cid:15) . However, itcontradicts y / ∈ Z − ( (cid:15) ). Hence, there exists an element a = ( a , a ) ∈ Z (cid:48) such that y ≥ a − (cid:15) . Moreover, thereexists b ≥ y such that ( y , b ) ∈ Z . This implies that there exist ˜ (cid:15) and ˜ (cid:15) such that 0 ≤ ˜ (cid:15) ≤ (cid:15) , 0 ≤ ˜ (cid:15) ≤ (cid:15) and( y + ˜ (cid:15) , y + ˜ (cid:15) ) ∈ Z . Therefore, it holds that y ∈ Z ∗ ( (cid:15) ). (cid:4) Next, we explain the method of constructing ˜ X . Let ˜ X be a set of grid points when each dimension of X = [0 , d is divided into τ evenly spaced segments. Also let [ x ] ∈ ˜ X be a point closest to x ∈ X with respect to the L (cid:107) x − [ x ] (cid:107) ≤ d τ , ∀ x ∈ X . (41)In the proposed algorithm for the continuous set setting, Algorithm 2 is performed by using ˜ X instead of X . Then,we deﬁne the estimated Pareto set ˆΠ t , latent Pareto set M t and uncertain set U t in Algorithm 2 asˆΠ t = (cid:110) x ∈ ˜ X | ∀ x (cid:48) ∈ ˜ E (pes) t, x , F (pes) t ( x ) (cid:14) F (pes) t ( x (cid:48) ) (cid:111) , ˜ E (pes) t, x = { x (cid:48) ∈ ˜ X | F (pes) t ( x ) (cid:54) = F (pes) t ( x (cid:48) ) } ,M t = (cid:110) x ∈ ˜ X \ ˆΠ t | ∀ x (cid:48) ∈ ˆΠ t , F (opt) t ( x ) (cid:14) (cid:15) / F (pes) t ( x (cid:48) ) (cid:111) ,U t = (cid:110) x ∈ ˆΠ t | ∃ x (cid:48) ∈ ˆΠ t \ { x } , F (pes) t ( x ) + (cid:15) / ≺ F (opt) t ( x (cid:48) ) (cid:111) . Note that (cid:15) /

2, not (cid:15) is used to calculate ˜ M t and ˜ U t .In the algorithm using ˜ X , the following theorem holds: Theorem D.1.

Let ˜ B = max ( x , w ) ∈ ( X × Ω) | f ( x , w ) − E w [ f ( x , w )] | , and let δ ∈ (0 , , (cid:15) = ( (cid:15) , (cid:15) ) where (cid:15) > and (cid:15) > . Deﬁne β t = (cid:16)(cid:113) ln det( I t + σ − K t ) + 2 ln δ + B (cid:17) and τ = max (cid:110) Ld (cid:15) ,

16 ˜

BLd (cid:15) (cid:111) . Then, the following (1) and(2) hold with probability at least − δ : (1) The algorithm terminates after at most T iterations, where T is the smallest positive integer satisfying T − β / T (cid:16)(cid:112) T C γ T + C (cid:17) + T − (cid:114) T ˜ Bβ / T (cid:16)(cid:112) T C γ T + 2 C (cid:17) + 5 T β T ( C γ T + 2 C ) ≤ min { (cid:15) , (cid:15) } / . Here, C and C are given by C = σ − ) and C = 16 ln δ . When the algorithm is terminated, the estimated Pareto set ˆΠ is the (cid:15) -accurate Pareto Set.Proof. We omit the proof of (1) because its proof is the same as in the proof of Theorem 4.2. We only prove (2) . From(41) and Lemma D.1–D.2, the following holds for any x ∈ X : | F ( x ) − F ([ x ]) | ≤ L (cid:107) x − [ x ] (cid:107) = (cid:15) , (42) | F ( x ) − F ([ x ]) | ≤ (cid:113) BL (cid:107) x − [ x ] (cid:107) = (cid:15) . (43)Assume that (13) holds. Let ˜ Z be a Pareto front for ˜ X . Then, for any y ∈ ˜ Z , it holds that y ∈ (cid:91) ( y (cid:48) ,y (cid:48) ) ∈ Z ( −∞ , y (cid:48) ] × ( −∞ , y (cid:48) ] , (44)where Z is the Pareto front for X . Similarly, let Z − ( (cid:15) /

2) = (cid:91) ( y (cid:48) ,y (cid:48) ) ∈ Z ( −∞ , y (cid:48) − (cid:15) / × ( −∞ , y (cid:48) − (cid:15) / . Then, for any y (cid:48)(cid:48) ∈ Z − ( (cid:15) / x ∈ X such that y (cid:48)(cid:48) < F ( x ) − (cid:15) / , y (cid:48)(cid:48) < F ( x ) − (cid:15) / . Here, from (42) and (43) we have F ( x ) ≤ F ([ x ]) + (cid:15) / , F ( x ) ≤ F ([ x ]) + (cid:15) / . Thus, it holds that y (cid:48)(cid:48) < F ([ x ]) and y (cid:48)(cid:48) < F ([ x ]). This implies that Z − ( (cid:15) / ⊂ { y ∈ R | ∃ x ∈ ˜ X , y (cid:22) F ( x ) } ≡ A. Here, since Z − ( (cid:15) /

2) is the open set, noting that Z − ( (cid:15) / ⊂ A we get Z − ( (cid:15) / ⊂ int( A ), where int( A ) is the interiorof A . In addition, from the deﬁnition of the interior and boundary (frontier), we obtain int( A ) ∩ ∂A = ∅ . Therefore,from ∂A = ˜ Z and Z − ( (cid:15) / ⊂ int( A ), it holds that Z − ( (cid:15) / ∩ ˜ Z = ∅ . Hence, for any y ∈ ˜ Z , y / ∈ Z − ( (cid:15) / Z ⊂ Z ∗ ( (cid:15) / . Hence, for any y ∈ ˜ Z , there exists a ∈ Z such that y = a − (cid:15) (cid:48) , y = a − (cid:15) (cid:48) , ≤ (cid:15) (cid:48) ≤ (cid:15) / , ≤ (cid:15) (cid:48) ≤ (cid:15) / . (45)Furthermore, from Theorem 4.2, for any x ∈ ˆΠ t , there exists y † ∈ ˜ Z such that y † ≤ F ( x ) + (cid:15) / , y † ≤ F ( x ) + (cid:15) / . By combining this and (45), we get a = y † + (cid:15) (cid:48) ≤ F ( x ) + (cid:15) / (cid:15) (cid:48) ≤ F ( x ) + (cid:15) ,a = y † + (cid:15) (cid:48) ≤ F ( x ) + (cid:15) / (cid:15) (cid:48) ≤ F ( x ) + (cid:15) . Therefore, we have F ( ˆΠ t ) ⊂ Z (cid:15) .Furthermore, let x ∈ Π. For [ x ] ∈ ˜ X , since ˆΠ t is the ( (cid:15) / X , there exists x (cid:48) ∈ ˆΠ t such that F ([ x ]) (cid:22) (cid:15) / F ( x (cid:48) ). Moreover, form (42) and (43), it holds that F ( x ) ≤ F ([ x ]) + (cid:15) /

2. This implies that F ( x ) (cid:22) F ([ x ]) + (cid:15) / (cid:22) F ( x (cid:48) ) + (cid:15) . Therefore, for any x ∈ Π, there exits x (cid:48) ∈ ˆΠ t such that x (cid:22) (cid:15) x (cid:48) . Thus, ˆΠ t is the (cid:15) -accurate Pareto set for X . (cid:4)(cid:4)