Mean-Variance Analysis in Bayesian Optimization under Uncertainty
MMean-Variance Analysis in Bayesian Optimization under Uncertainty
Shogo Iwazaki ∗ Yu Inatsu † Ichiro Takeuchi ‡† ABSTRACT
We consider active learning (AL) in an uncertain environment in which trade-off between multiple risk measuresneed to be considered. As an AL problem in such an uncertain environment, we study Mean-Variance Analysisin Bayesian Optimization (MVA-BO) setting. Mean-variance analysis was developed in the field of financial engi-neering and has been used to make decisions that take into account the trade-off between the average and varianceof investment uncertainty. In this paper, we specifically focus on BO setting with an uncertain component andconsider multi-task, multi-objective, and constrained optimization scenarios for the mean-variance trade-off of theuncertain component. When the target blackbox function is modeled by Gaussian Process (GP), we derive thebounds of the two risk measures and propose AL algorithm for each of the above three problems based on therisk measure bounds. We show the effectiveness of the proposed AL algorithms through theoretical analysis andnumerical experiments.
Decision making in an uncertain environment has been studied in various domains. For example, in financial engi-neering, the mean-variance analysis [1, 2, 3] has been introduced as a framework for making investment decisions,taking into account the trade-off between the return (mean) and the risk (variance) of the investment. In this paperwe study active learning (AL) in an uncertain environment. In many practical AL problems, there are two typesof parameters called design parameters and environmental parameters . For example, in a product design, while thedesign parameters are fully controllable, the environmental parameters vary depending on the environment in whichthe product is used. In this paper, we examine AL problems under such an uncertain environment, where the goal isto efficiently find the optimal design parameters by properly taking into account the uncertainty of the environmentalparameters.Concretely, let f ( x , w ) be a blackbox function indicating the performance of a product, where x ∈ X is the set ofcontrollable design parameters and w ∈ Ω is the set of uncontrollable environmental parameters whose uncertainty ischaracterized by a probability distribution p ( w ). We particularly focus on the AL problem where the mean and thevariance of the environmental parameters, E w [ f ( x , w )] = (cid:90) Ω f ( x , w ) p ( w )d w , (1a) V w [ f ( x , w )] = (cid:90) Ω ( f ( x , w ) − E w [ f ( x , w )]) p ( w )d w , (1b)respectively, are taken into account. Specifically, we work on these two uncertainty measures in three different scenarios:multi-task learning scenario, multi-objective optimization scenario, and constrained optimization scenario. In the firstscenario, we study AL for optimizing a weighted sum of these two measures. In the second scenario, we discuss howto obtain the Pareto frontier of these two measures in an AL setting. In the third scenario, we consider optimizingone of the two measures under some constraint on the other measure. We refer to these problems and the proposedframework for solving them as Mean-Variance Analysis in Bayesian Optimization (MVA-BO) . Figure 1 shows anillustration of a multi-task learning scenario.In this study, we employ a Gaussian process (GP) to model the uncertainty of the blackbox function f ( x , w ). Ina conventional GP-based AL problem (without uncontrollable environmental parameters w ), the acquisition function(AF) is designed based on how the uncertainty of the blackbox function changes when an input point is selected andthe blackbox function is evaluated at the input point. On the other hand, in MVA-BO, we need to know how theuncertainties of the mean function (1a) and the variance function (1b) change by evaluating the blackbox function atthe selected input point. Note that we face the difficulty of not being able to directly evaluate the target functions ∗ Department of Computer Science, Nagoya Institute of Technology † RIKEN Center for Advanced Intelligence Project ‡ Department of Computer Science/Research Institute for Information Science, Nagoya Institute of Technology,mail:[email protected] a r X i v : . [ s t a t . M L ] S e p f ( x , w ) follows a GP, the mean function (1a)also follows a GP. Unfortunately, however, the variance function (1b) does not follow a GP, indicating that we needto develop a new method to quantify how the uncertainty of the variance function changes by evaluating the blackboxfunction at the selected input point. In this study, we extend the GP-UCB algorithm [5] to realize MVA-BO in theabove mentioned three scenarios by overcoming these technical difficulties. We demonstrate the effectiveness of theproposed MVA-BO framework through theoretical analyses and numerical experiments. Related Work
Various problem setups and methods have been studied for AL and Bayesian optimization (BO)problems when there are multiple target functions. One of such problem setup is multi-task BO [6]. In this problemsetup, the AF is designed to select input points that commonly contribute to optimizing multiple target functions.Another popular problem setup is multi-objective BO [7, 8, 9]. The goal of a multi-objective optimization is to obtainso-called
Pareto-optimal solutions. The AF in this problem setup is designed to efficiently identify solutions on thePareto frontier. Another common problem setup is constrained BO [10, 11, 12]. The goal of this problem setup is tofind the optimal solution to a constrained optimization problem in a situation where both the objective function andconstraint function are blackbox functions that are costly to evaluate. The AF in this problem setup is designed toselect input points that are useful not only for maximizing the objective function but also for identifying the feasibleregion. In this paper, we study these three scenarios as concrete examples of MVA-BO. Unlike conventional multi-task,multi-objective and constrained BOs, the main technical challenges of MVA-BO are that the two target functions (1a)and (1b) cannot be directly evaluated and that the latter does not follow a GP.Various studies have been published on BO under various types of uncertainty. The most relevant one to ourstudy is on
Bayesian quadrature optimization (BQO) [13], the goal of which is to optimize the mean function (1a).When the blackbox function follows a GP, the mean function (1a) also follows a GP, suggesting that one can efficientlysolve BQO problems by properly modifying the AFs in conventional BO. By replacing the integrand in (1a) withdifferent uncertainty measures, one can consider various types of AL problems under uncertainty [14, 15]. Anotherline of research dealing with uncontrollable and uncertain factors in BO is known as robust BO . The goal of robustBO is to make robust decisions that appropriately take into account the uncertainty of the BO process and the GPmodel. For example, input uncertainty in BO has been studied, in which probabilistic noise is inevitably added to theinput points when evaluating the target blackbox function. Although research on BO in an uncertain environment hassteadily progressed over the past few years, to our knowledge, there are no AL nor BO studies that take into accountthe trade-offs between multiple uncertainty measures such as mean-variance analysis.Decision making under uncertainty is being examined in the field of robust optimization [16, 17, 18], with especiallyapplications to financial engineering in mind [19, 20, 21]. It has been pointed out that when making decisions underuncertainty, it is important to balance multiple uncertainty measures appropriately, as represented by the Nobel prize-winning mean-variance analysis in portfolio theory [1, 2, 3]. Various risk measures, such as Value at Risk (VaR), havebeen proposed in financial engineering, and these multiple risk measures are used in combination, depending on thepurpose of the decision making. However, to our knowledge, there have been AL or BO studies that have appropriatelytaken into account multiple uncertainty measures.
Let f : X × Ω → R be a blackbox function which is expensive to evaluate, where X ⊂ R d and Ω ⊂ R d are a finiteset and a compact convex set, respectively. In our setting, a variable w ∈ Ω is probabilistically fluctuated by thegiven density function p ( w ) . At every step t , a user chooses the next observation point x t ∈ X , whereas w t ∈ Ω willbe given as a realization of the random variable, which follows the distribution p ( w ). Next, the user gets the noisyobservation y t = f ( x t , w t ) + η t , where η t is independent Gaussian noise following N (0 , σ ).Furthermore, as a regularity assumption, we assume that f is an element of reproducing kernel Hilbert space(RKHS) and has a bounded norm, which is also assumed in the standard BO literature [5]. Let k be a positive definitekernel over ( X × Ω) × ( X ×
Ω) and H k be an RKHS corresponding to k . In this paper, for some positive constant B ,we assume f ∈ H k with (cid:107) f (cid:107) H k ≤ B , where (cid:107) · (cid:107) H k denotes the Hilbert norm defined on H k . Models
Our algorithm uses the GP method [22] to navigate the optimization process. First, we assume GP (0 , k ) asa prior of f , where GP ( µ, k ) is a GP that is characterized by a mean function µ and a kernel function k . Given the We discuss the case where X is a continuous set in appendix D. Note that a probability mass function can also be considered when Ω is a finite set. In that case, the subsequent discussions still holdif integral operations are replaced by summation operations. x and w , respectively. Blue and yellow dotted lines indicate the points where expectedvalue F ( x ) and negative standard deviation F ( x ) of f ( x , w ) are maximum. Our goal is to identify the point on thered line that simultaneously maximize both of F and F .sequence of data { (( x i , w i ) , y i ) } ti =1 , the posterior distribution of f ( x , w ) is the Gaussian distribution that has mean µ t ( x , w ) and variance σ t ( x , w ) defined as follows: µ t ( x , w ) = k t ( x , w ) (cid:62) (cid:0) K t + σ I t (cid:1) − y t ,σ t ( x , w ) = k (( x , w ) , ( x , w )) − k t ( x , w ) (cid:62) (cid:0) K t + σ I t (cid:1) − k t ( x , w ) , where k t ( x , w ) = ( k (( x , w ) , ( x , w )) , . . . , k (( x , w ) , ( x t , w t ))) (cid:62) , y t = ( y , . . . , y t ), I t is the identity matrix of size t ,and K t is the t × t kernel matrix whose ( i, j )th element is k (( x i , w i ) , ( x j , w j )).We will make use of the following lemma, to construct the confidence bound of f by using the posterior mean µ t and the variance σ t . Lemma 2.1 (Theorem 3.11 in [23]) . Fix f ∈ H k with (cid:107) f (cid:107) H k ≤ B . Given δ ∈ (0 , , let define β t = (cid:16)(cid:112) ln det( I t + σ − K t ) + 2 ln(1 /δ ) + B (cid:17) . Then, the following holds with probability at least − δ : | f ( x , w ) − µ t − ( x , w ) | ≤ β / t σ t − ( x , w ) , ∀ x ∈ X , ∀ w ∈ Ω , ∀ t ≥ . (2)Based on the above lemma, the confidence bound Q t ( x , w ) := [ l t ( x , w ) , u t ( x , w )] of f ( x , w ) can be computed by l t ( x , w ) = µ t − ( x , w ) − β / t σ t − ( x , w ) ,u t ( x , w ) = µ t − ( x , w ) + β / t σ t − ( x , w ) . Here, we consider the expectation and variance of f ( x , w ) under the uncertainty of p ( w ) as follows: E w [ f ( x , w )] = (cid:90) Ω f ( x , w ) p ( w )d w , (3) V w [ f ( x , w )] = (cid:90) Ω { f ( x , w ) − E w [ f ( x , w )] } p ( w )d w . (4)Using these E w [ f ( x , w )] and V w [ f ( x , w )], we define the objective functions F and F as follows: F ( x ) = E w [ f ( x , w )] , F ( x ) = − (cid:112) V w [ f ( x , w )] . (5)3ur goal is to maximize F and F simultaneously with as few function evaluations as possible. To this end, we handlethese objective functions in multi-task and multi-objective optimization frameworks. Multi-task Optimization Scenario
First, we formulate the problem as a single-objective optimization problemwhose objective function is defined as a weighted sum of F and F . Given a user-specified weight α ∈ [0 , G bea new objective function defined as follows: G ( x ) = αF ( x ) + (1 − α ) F ( x ) . In this formulation, our goal is to find x ∗ := argmax x ∈X G ( x ) efficiently. To rigorously determine the theoreticalproperties, we introduce the notion of an (cid:15) -accurate solution . Let ˆ x t be an estimated solution which is defined by thealgorithm at step t . Given a fixed constant (cid:15) ≥
0, we say that ˆ x t is (cid:15) -accurate if the following inequality holds: G ( ˆ x t ) ≥ G ( x ∗ ) − (cid:15). In section 4, for an arbitrarily small (cid:15) , we show that our algorithm can find the (cid:15) -accurate solution with high probabilityafter finite step T . Multi-objective Optimization Scenario
In the multi-task scenario, we assume that the user can specify theweight α before the optimization; however this is sometimes unrealistic. We also consider the more general formulationbased on the Pareto optimality criterion. Hereafter, we use the vector representation of the objective functions like F ( x ) = ( F ( x ) , F ( x )). First, let (cid:22) be a relational operator defined over X × X or R × R . Given x , x (cid:48) ∈ X , wewrite x (cid:22) x (cid:48) or F ( x ) (cid:22) F ( x (cid:48) ) provided that F ( x ) ≤ F ( x (cid:48) ) and F ( x ) ≤ F ( x (cid:48) ) hold simultaneously. We say that x (cid:48) dominates x if x (cid:22) x (cid:48) . Furthermore, we write x ≺ x (cid:48) or F ( x ) ≺ F ( x (cid:48) ) provided that either F ( x ) < F ( x (cid:48) ) or F ( x ) < F ( x (cid:48) ) holds.The goal of this scenario is to identify the following Pareto set
Π efficiently:Π = { x ∈ X | ∀ x (cid:48) ∈ E x , F ( x ) (cid:14) F ( x (cid:48) ) } , where E x = { x (cid:48) ∈ X | F ( x ) (cid:54) = F ( x (cid:48) ) } . Moreover,
Pareto front Z is defined by Z = ∂ { y ∈ R | ∃ x ∈ X , y (cid:22) F ( x ) } . Next, we introduce the notion of an (cid:15) -accurate Pareto set [8], which is an idea similar to the (cid:15) -accurate solution inthe multi-task scenario. Given a non-negative vector (cid:15) = ( (cid:15) , (cid:15) ), we define the relational operator (cid:22) (cid:15) , which is therelaxed version of (cid:22) . For x , x (cid:48) ∈ X , we write x (cid:22) (cid:15) x (cid:48) or F ( x ) (cid:22) (cid:15) F ( x (cid:48) ) if F ( x ) ≤ F ( x (cid:48) )+ (cid:15) and F ( x ) ≤ F ( x (cid:48) )+ (cid:15) hold simultaneously. Then, the (cid:15) -Pareto front is defined as: Z (cid:15) = { y ∈ R | ∃ y (cid:48) ∈ Z, y (cid:22) y (cid:48) and ∃ y (cid:48)(cid:48) ∈ Z, y (cid:48)(cid:48) (cid:22) (cid:15) y } . We say that estimated Pareto set ˆΠ t of the algorithm is an (cid:15) -accurate Pareto set if the following two conditions aresatisfied:1. F ( ˆΠ t ) ⊂ Z (cid:15) , where F ( ˆΠ t ) := (cid:110) F ( x ) | x ∈ ˆΠ t (cid:111) .2. For any x ∈ Π, there is at least one point x (cid:48) ∈ ˆΠ t such that x (cid:22) (cid:15) x (cid:48) .Intuitively, condition 1 guarantees that the estimated solutions are worse than the true Pareto front by at most (cid:15) .Condition 2 indicates that ˆΠ can cover all points in the true Pareto set Π.We emphasize that although many studies about multi-task or multi-objective optimization based on a GP havebeen reported, their methods cannot be directly applied to our setting because the objective functions F and F arenot observed directly. First, we explain the basic idea of our proposed algorithms. To maximize F and F efficiently, one simple way is toconsider the predicted distributions of F and F , and apply existing methods (e.g. expected improvement, entropysearch). However, it is difficult to handle the predicted distribution of F although f is modeled by a GP. In thispaper, we first derive the intervals in which F and F exist with high probability from the confidence bound of f ,and construct the algorithm based on these derived intervals. Hereafter, with a slight abuse of notation, we refer tothese derived intervals as the confidence bounds of F and F . In appendix B, as another formulation, we also consider the constrained optimization problem whose objective and constraint functionsare F and F respectively. .1 Confidence Bounds of Objective Functions First, we consider the confidence bound Q ( F ) t ( x ) = [ l ( F ) t ( x ) , u ( F ) t ( x )] of F ( x ). When (2) holds, the following inequityholds for any x ∈ X , t ≥ (cid:90) Ω l t ( x , w ) p ( w )d w ≤ (cid:90) Ω f ( x , w ) p ( w )d w ≤ (cid:90) Ω u t ( x , w ) p ( w )d w . This implies that F ( x ) ∈ Q ( F ) t ( x ) for any x ∈ X , t ≥ − δ for l ( F ) t and u ( F ) t defined as l ( F ) t ( x ) = (cid:90) Ω l t ( x , w ) p ( x )d w , u ( F ) t ( x ) = (cid:90) Ω u t ( x , w ) p ( x )d w . We construct the confidence bound Q ( F ) t ( x ) = [ l ( F ) t ( x ) , u ( F ) t ( x )] of F ( x ) in a similar way. First, we consider thequantity f ( x , w ) − E w [ f ( x , w )], which appears in the integrand of V w [ f ( x , w )]. Under condition (2), the followinginequity holds: ˜ l t ( x , w ) ≤ f ( x , w ) − E w [ f ( x , w )] ≤ ˜ u t ( x , w ) , (6)where ˜ l t ( x , w ) = l t ( x , w ) − E w [ u t ( x , w )] and ˜ u t ( x , w ) = u t ( x , w ) − E w [ l t ( x , w )]. Next, the integrand of V w [ f ( x , w )]can be evaluated based on (6) as follows:˜ l (sq) t ( x , w ) ≤ { f ( x , w ) − E w [ f ( x , w )] } ≤ ˜ u (sq) t ( x , w ) , where ˜ l (sq) t ( x , w ) = (cid:40) l t ( x , w ) ≤ ≤ ˜ u t ( x , w ) , min (cid:110) ˜ l t ( x , w ) , ˜ u t ( x , w ) (cid:111) otherwise , ˜ u (sq) t ( x , w ) = max (cid:110) ˜ l t ( x , w ) , ˜ u t ( x , w ) (cid:111) . Finally, from the monotonicity of square root, the confidence bound Q ( F ) t ( x ) = [ l ( F ) t ( x ) , u ( F ) t ( x )] of F ( x ) is com-puted using the following equations for l ( F ) t and u ( F ) t : l ( F ) t ( x ) = − (cid:115)(cid:90) Ω ˜ u (sq) t ( x , w ) p ( x )d w , u ( F ) t ( x ) = − (cid:115)(cid:90) Ω ˜ l (sq) t ( x , w ) p ( x )d w . Multi-task Scenario
In the multi-task scenario, our algorithm chooses the next observation point x t based on theupper confidence bound (UCB) of the function G . From Q ( F ) t ( x ) and Q ( F ) t ( x ), the confidence bound Q ( G ) t ( x ) :=[ l ( G ) t ( x ) , u ( G ) t ( x )] of G ( x ) can be constructed by defining l ( G ) t ( x ) = αl ( F ) t ( x ) + (1 − α ) l ( F ) t ( x ) , u ( G ) t ( x ) = αu ( F ) t ( x ) + (1 − α ) u ( F ) t ( x ) . At every step t , the next observation point x t of our algorithm is defined by x t = argmax x ∈X u ( G ) t ( x ). Hereafter, wecall this strategy Multi-Task (MT)-MVA-BO. The pseudo-code of MT-MVA-BO is shown as Algorithm 1. Multi-objective Scenario
Next, we explain the proposed algorithm for finding the Pareto set efficiently. From theconfidence bounds of F and F , we define F (opt) t and F (pes) t by F (opt) t ( x ) = (cid:16) u ( F ) t ( x ) , u ( F ) t ( x ) (cid:17) and F (pes) t ( x ) = (cid:16) l ( F ) t ( x ) , l ( F ) t ( x ) (cid:17) , which respectively represent the optimistic and pessimistic predictions of the objective functionsat step t . First, we define the estimated Pareto set ˆΠ t at step t byˆΠ t = (cid:110) x ∈ X (cid:12)(cid:12)(cid:12) ∀ x (cid:48) ∈ E (pes) t, x , F (pes) t ( x ) (cid:14) F (pes) t ( x (cid:48) ) (cid:111) , where E (pes) t, x = (cid:110) x (cid:48) ∈ X (cid:12)(cid:12)(cid:12) F (pes) t ( x ) (cid:54) = F (pes) t ( x (cid:48) ) (cid:111) . (7)For theoretical reasons, we define ˆΠ t based on pessimistic predictions and the same idea is used in the existing GP-basedoptimization literatures [24, 8, 25, 26]. Furthermore, using ˆΠ t , the potential Pareto set M t is defined by M t = (cid:110) x ∈ X \ ˆΠ t (cid:12)(cid:12)(cid:12) ∀ x (cid:48) ∈ ˆΠ t , F (opt) t ( x ) (cid:14) (cid:15) F (pes) t ( x (cid:48) ) (cid:111) . lgorithm 1 Multi-task MVA-BO (MT-MVA-BO)
Input:
GP prior GP (0 , k ), { β t } t ≤ T , α ∈ (0 , for t = 0 to T do Compute u ( G ) t ( x ) for any x ∈ X Choose x t = argmax x ∈X u ( G ) t ( x ).Sample w t ∼ p ( w ).Observe y t ← f ( x t , w t ) + η t Update the GP by adding (( x t , w t ) , y t ). end forOutput: argmax x ∈{ x ,..., x T } l ( G ) T ( x ). Algorithm 2
Multi-objective MVA-BO (MO-MVA-BO)
Input:
GP prior GP (0 , k ), { β t } t ∈ N , Non-negative vector (cid:15) = ( (cid:15) , (cid:15) ). t ← repeat Compute ˆΠ t , M t .Compute λ t ( x ) for any x ∈ M t ∪ ˆΠ t .Choose x t = argmax x ∈ M t ∪ ˆΠ t λ t ( x ).Sample w t ∼ p ( w ).Observe y t ← f ( x t , w t ) + η t .Update the GP by adding (( x t , w t ) , y t ). t ← t + 1.Compute U t . until M t = ∅ and U t = ∅ Output: ˆΠ t .An intuitive interpretation of M t is the set which excludes the points that are (cid:15) -dominated by other points with highprobability. At every step t , our algorithm chooses x t based on the uncertainty defined by the confidence bounds of F and F . In this paper, we adopt the diameter λ t ( x ) of rectangle Rect t ( x ) = (cid:104) l ( F ) t ( x ) , u ( F ) t ( x ) (cid:105) × (cid:104) l ( F ) t ( x ) , u ( F ) t ( x ) (cid:105) as the uncertainty of x : λ t ( x ) = max y , y (cid:48) ∈ Rect t ( x ) (cid:107) y − y (cid:48) (cid:107) . (8)Namely, the next observation point x t is defined by x t = argmax x ∈ M t ∪ ˆΠ t λ t ( x ) at every step t .Our proposed algorithm terminates when estimated Pareto set ˆΠ t is guaranteed to be an (cid:15) -Pareto set with highprobability. To this end, our algorithm checks the uncertainty set U t that is defined by U t = (cid:110) x ∈ ˆΠ t (cid:12)(cid:12)(cid:12) ∃ x (cid:48) ∈ ˆΠ t \ { x } , F (pes) t ( x ) + (cid:15) ≺ F (opt) t ( x (cid:48) ) (cid:111) . Intuitively, U t is the set of points where it is not possible to decide whether it is an (cid:15) -Pareto solution based on thecurrent confidence bounds. Our algorithm terminates at a step t where both M t = ∅ and U t = ∅ hold.Hereafter, we call this algorithm Multi-Objective (MO)-MVA-BO. The pseudo-code of MO-MVA-BO is shown asAlgorithm 2. In this section, we consider several extensions of the proposed method to deal with situations which arise in somepractical applications, leaving the details for appendix C.
Thus far, we have assumed that p ( w ) is known; however, this assumption is sometimes unrealistic. Considering howto deal with the case where p ( w ) is unknown, one simple way is to estimate p ( w ) during the optimization process.For example, if we estimate p ( w ) by using an empirical distribution, we can apply our algorithm by replacing p ( w )with the following ˜ p t ( w ) when computing the confidence bounds:˜ p t ( w ) = 1 t t (cid:88) t (cid:48) =1 w t (cid:48) = w ] . As a more advanced method, it may be possible to consider extension to the distributionally robust setting [26, 27];however, we leave this as future work.
One setting similar to that in this paper is the noisy input setting [14, 28]. In this setting, observation point x t isfluctuated by noise ξ ∈ ∆ which follows the known density p ( ξ ) defined over ∆. At every step t , the user chooses x t y t as y t = f ( x t + ξ ) + η t , ξ ∼ p ( ξ ). Our problem can be extended to the noisy input settingby defining F and F through expectation E ξ [ f ( x + ξ )] and variance V ξ [ f ( x + ξ )] defined as follows: E ξ [ f ( x + ξ )] = (cid:90) ∆ f ( x + ξ ) p ( ξ )d ξ , (9) V ξ [ f ( x + ξ )] = (cid:90) ∆ { f ( x + ξ ) − E ˆ ξ [ f ( x + ˆ ξ )] } p ( ξ )d ξ . (10)We can apply the same algorithms as those in section 3.2 by constructing the confidence bounds via a way similar tothat in section 3.1. Some applications can be allowed to control the variable w in the optimization. For example, the case that the userrun the optimization process by evaluating f ( x , w ) with the computer simulation. Such scenarios have often beenconsidered in similar studies reported in the BO literature that assumed the existence of an uncontrollable variable w [13, 26, 27]. Our method can be extended to such a scenario by choosing w t according to w t = argmax w ∈ Ω σ t − ( x t , w )after the selection of x t . In this section, we show the theoretical results of the proposed algorithms. The details of the proofs are in appendixA. First, we introduce the maximum information gain [5] as a sample complexity parameter of a GP. Now, Let A = { a , . . . , a T } be a finite subset of X ×
Ω, and y A be a vector whose i th element is y a i = f ( a i ) + ε a i . Maximuminformation gain γ T at step T is defined by γ T = max A ⊂X × Ω; | A | = T I ( y A ; f ) , where I ( y A ; f ) denotes the mutual information between y A and f . Maximum information gain γ T is often used inBO, and its analytical form of the upper bound is derived in commonly used kernels [5].The following two theorems show the convergence properties of the proposed algorithms for the multi-task andmulti-objective scenarios, respectively. Theorem 4.1.
Fix positive definite kernel k , and assume f ∈ H k with (cid:107) f (cid:107) H k ≤ B . Let δ ∈ (0 , and (cid:15) > , and set β t according to β t = (cid:16)(cid:113) ln det( I t + σ − K t ) + 2 ln δ + B (cid:17) at every step t . Furthermore, for any t ≥ , define ˆ x t by ˆ x t = argmax x t (cid:48) ∈{ x ,..., x t } l ( G ) t (cid:48) ( x t (cid:48) ) . When applying MT-MVA-BO under the above conditions, with probability at least − δ , ˆ x T is an (cid:15) -accurate solution, where T is the smallest positive integer which satisfies the following inequity: αT − β / T (cid:16)(cid:112) T C γ T + C (cid:17) + (1 − α ) T − (cid:114) T ˜ Bβ / T (cid:16)(cid:112) T C γ T + 2 C (cid:17) + 5 T β T ( C γ T + 2 C ) ≤ (cid:15). (11) Here, ˜ B = max ( x , w ) ∈ ( X × Ω) | f ( x , w ) − E w [ f ( x , w )] | and C = σ − ) , C = 16 log δ . Theorem 4.2.
Fix positive definite kernel k , and assume f ∈ H k with (cid:107) f (cid:107) H k ≤ B . Let δ ∈ (0 , and (cid:15) > , and set β t according to β t = (cid:16)(cid:113) ln det( I t + σ − K t ) + 2 ln δ + B (cid:17) at every step t . When applying MO-MVA-BO under theabove conditions, the following 1. and 2. hold with probability at least − δ : The algorithm terminates at most step T where T is the smallest positive integer that satisfies the following inequity: T − β / T (cid:16)(cid:112) T C γ T + C (cid:17) + T − (cid:114) T ˜ Bβ / T (cid:16)(cid:112) T C γ T + 2 C (cid:17) + 5 T β T ( C γ T + 2 C ) ≤ min { (cid:15) , (cid:15) } . (12) Here, ˜ B = max ( x , w ) ∈ ( X × Ω) | f ( x , w ) − E w [ f ( x , w )] | , C = σ − ) , C = 16 log δ . When the algorithm terminates, estimated Pareto set ˆΠ t is an (cid:15) -accurate Pareto set. The first term β / T (cid:0) √ T C γ T + C (cid:1) of the left hand side in (11) and (12) also appears in the theoretical resultof the existing algorithm, which only considers the expectation F (e.g. Theorem 2 in [26]). The second term (cid:113) T ˜ Bβ / T (cid:0) √ T C γ T + 2 C (cid:1) + 5 T β T ( C γ T + 2 C ) is specific to our problem. This term depends on the complexityparameter ˜ B , which quantifies the variation of function f ( x , w ) around its expectation.7 Numerical Experiments
In this section, we show the performance of the proposed methods through numerical experiments. As the baselinemethods in both the multi-task and multi-objective scenarios, we adopted random sampling ( RS ) and uncertaintysampling ( US ). RS choose x t from X uniformly at random, and US choose x t such that x t achieve the largest averageposterior variance x t = argmax x ∈X (cid:82) Ω σ t − ( x , w ) p ( w )d w . To measure the performance, in the multi-task scenario, wecomputed the regret, G ( x ∗ ) − G ( ˆ x t ), at every step t , where x t is the estimated solution defined by the algorithms. Wedefined ˆ x t as ˆ x t = argmax t (cid:48) =1 ,...,t l ( G ) t ( x t (cid:48) ) in RS , US , and proposed method ( MT-MVA-BO ). Furthermore, we set α = 0 . − ˆHV t to measure the performance,where HV and ˆHV t denote the hyper volumes computed based on the true Pareto set Π and the estimated Pareto setˆΠ t , respectively. The hyper volume gap measures how close the estimated Pareto front is to the true Pareto front. Wedefined ˆΠ t by (7) in RS , US and the proposed method ( MO-MVA-BO ). Furthermore, in the multi-task scenario, to showthe effect of difference of objective functions, we also adopt the two methods
BQOUCB [26, 27] and
BO-VO . BQOUCB is theexisting method which aims to maximize F , and BO-VO is the variant of our method which corresponds to the case α = 0. These methods choose x t as the maximizing point of u ( F ) t ( x ) and u ( F ) t ( x ) respectively. In addition, estimatedsolution ˆ x t is defined by ˆ x t = argmax t (cid:48) =1 ,...,t l ( F ) t ( x t (cid:48) ) and ˆ x t = argmax t (cid:48) =1 ,...,t l ( F ) t ( x t (cid:48) ) respectively. Moreover,we also make comparisons to the adaptive versions of these methods, ADA-BQOUCB and
ADA-BO-VO . ADA-BQOUCB and
ADA-BO-VO choose x t in the same way as do BQOUCB and
BO-VO , but the estimated solutions are defined as ˆ x t =argmax t (cid:48) =1 ,...,t l ( G ) t ( x t (cid:48) ). In this subsection, we show the results of the artificial-data experiments.
GP Test Functions
We experimented with the true oracle functions f that are generated from the 2D GP prior.First, we divided [ − , into 25 uniformly spaced grid points in each dimension and generated the sample path fromthe GP prior. Next, we created the GP model with these grid points and set the true oracle function as its GPposterior mean. In this experiment, we created 50 sample paths from different seeds, and conducted 10 experimentsfor each function. Thus, we report the average performance of a total of 500 experiments. To create a GP samplepath, we use the Gaussian kernel k (( x , w ) , ( x (cid:48) , w (cid:48) )) = σ exp( (cid:107) x − x (cid:48) (cid:107) + (cid:107) w − w (cid:48) (cid:107) l ) with σ ker = 1 , l = 0 .
25, as well asto construct the confidence bound in the algorithms. Furthermore, we set noise variance as σ = 10 − . In addition,we divided [ − ,
1] into 100 grid points uniformly, and set X and Ω as these grid points. Moreover, we define p ( w ) by p ( w ) = (cid:80) w ∈ Ω φ ( w ) /Z , Z = (cid:80) w ∈ Ω φ ( w ) where φ is the density function of the standard normal distribution. Benchmark Functions of Optimization
We also experimented with the Bird function (2D) and Rosenbrockfunction (3D), which are often used as the benchmark function in the field of the optimization. First, we scaled theinput domain to [ − ,
1] divided into 100 grid points in each dimension. In Bird function, we set X and Ω as the gridpoints of the first and the second dimensions, respectively. In the Rosenbrock function, we set Ω as the grid points of thethird dimension and the remaining points as X . Furthermore, we set p ( w ) as in the same way as the experiment of theGP test functions. We use ARD Gaussian kernel k (( x , w ) , ( x (cid:48) , w (cid:48) )) = σ exp (cid:26)(cid:80) d i =1 ( x i − x (cid:48) i ) l ( x )2 i + (cid:80) d j =1 ( w j − w (cid:48) j ) l ( w )2 j (cid:27) , andtune these hyperparameters by maximizing the marginal likelihood at every 10 step in the algorithms. Furthermore,we set the noise variance as σ = 10 − and report the average performance of 100 simulations with different seeds.Figure 2 shows the results of the artificial data experiments. We confirmed that the proposed methods achievebetter performances than the other methods. In the experiments of the multi-task scenario, we also confirmed thatthe regrets of BQOUCB , BO-VO , ADA-BQOUCB , and
ADA-BO-VO stop decreasing at an early stage. Note that these arereasonable results because objective functions of these methods are inconsistent with our settings.
We applied the proposed methods to
Newsvendor problem under dynamic consumer substitution [29], whose goal is tooptimize the initial inventory levels under uncertainty of customer behaviors. The parameter x and w respectivelycorrespond to the initial inventory level of products and the uncertain purchasing behaviors of customers, which followmutually independent Gamma distributions. The goal of this problem is to find the x which optimizes the profit f ( x , w ) under the uncertainty of w . For this problem, we conducted the experiments in the simulator-based settingdescribed in section 3.3.3 because profit f ( x , w ) can be evaluated based on a computer simulation. Figure 3 showsthe average performances of 100 simulations with different seeds.8
25 50 75 100 125 150 175 200
Iteration R e g r e t GP Test Functions
RSUSBQOUCBADA-BQOUCBBO-VOADA-BO-VOMT-MVA-BO 0 20 40 60 80 100
Iteration R e g r e t Bird
RSUSBQOUCBADA-BQOUCBBO-VOADA-BO-VOMT-MVA-BO 0 25 50 75 100 125 150 175 200
Iteration R e g r e t Rosenbrock
RSUSBQOUCBADA-BQOUCBBO-VOADA-BO-VOMT-MVA-BO0 25 50 75 100 125 150 175 200
Iteration H y p e r - v o l u m e G a p GP Test Functions
RSUSMO-MVA-BO 0 20 40 60 80 100
Iteration H y p e r - v o l u m e G a p Bird
RSUSMO-MVA-BO 0 25 50 75 100 125 150 175 200
Iteration H y p e r - v o l u m e G a p Rosenbrock
RSUSMO-MVA-BO
Figure 2: Average performances in artificial data experiments. The error bars represent 2 × [standard error]. The topand bottom figures show the results of the multi-task ( α = 0 .
5) and multi-objective scenarios respectively.
Iteration R e g r e t Newsvendor
RSUSBQOUCBADA-BQOUCBBO-VOADA-BO-VOMT-MVA-BO 0 25 50 75 100 125 150 175 200
Iteration H y p e r - v o l u m e G a p Newsvendor
RSUSMO-MVA-BO
Figure 3: The results of experiments in Newsvendor problem. The left and right figures present the results of themulti-task ( α = 0 .
5) and multi-objective scenarios, respectively.
We introduced the novel Bayesian optimization framework: MVA-BO, which simultaneously considers two objectivefunctions: expectation and variance under an uncertainty environment. In this framework, we considered the threescenarios; multi-task, multi-objective and constraint optimization scenarios, which often appear in real-world applica-tions. We studied the rigorous convergence properties of our MVA-BO algorithms and demonstrated the effectivenessof them through both artificial and real-data experiments.
Acknowledgement
This work was partially supported by MEXT KAKENHI (20H00601, 16H06538), JST CREST (JPMJCR1502), andRIKEN Center for Advanced Intelligence Project.
References [1] Markowitz HM. Portfolio selection.
Journal of Finance , 7(1):77–91, 1952.[2] Harry M Markowitz and G Peter Todd.
Mean-variance analysis in portfolio choice and capital markets , volume 66.John Wiley & Sons, 2000. 93] Michael C Keeley and Frederick T Furlong. A reexamination of mean-variance analysis of bank capital regulation.
Journal of Banking & Finance , 14(1):69–84, 1990.[4] Anthony O’Hagan. Bayes–hermite quadrature.
Journal of statistical planning and inference , 29(3):245–260, 1991.[5] Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias W. Seeger. Gaussian process optimization inthe bandit setting: No regret and experimental design. In Johannes F¨urnkranz and Thorsten Joachims, editors,
Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa,Israel , pages 1015–1022. Omnipress, 2010.[6] Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task Bayesian optimization. In
Advances in neuralinformation processing systems , pages 2004–2012, 2013.[7] Michael Emmerich. Single-and multi-objective evolutionary design optimization assisted by Gaussian randomfield metamodels.
Dissertation, LS11, FB Informatik, Universit¨at Dortmund, Germany , 2005.[8] Marcela Zuluaga, Andreas Krause, and Markus P¨uschel. e-pal: An active learning approach to the multi-objectiveoptimization problem.
Journal of Machine Learning Research , 17(104):1–32, 2016.[9] Shinya Suzuki, Shion Takeno, Tomoyuki Tamura, Kazuki Shitara, and Masayuki Karasuyama. Multi-objectiveBayesian optimization using Pareto-frontier entropy. In
Proceedings of Machine Learning and Systems 2020 , pages10841–10850. 2020.[10] Jacob R Gardner, Matt J Kusner, Zhixiang Eddie Xu, Kilian Q Weinberger, and John P Cunningham. Bayesianoptimization with inequality constraints. In
ICML , volume 2014, pages 937–945, 2014.[11] Michael A. Gelbart, Jasper Snoek, and Ryan P. Adams. Bayesian optimization with unknown constraints. In
Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence , UAI’14, page 250259, Arlington,Virginia, USA, 2014. AUAI Press.[12] Jos´e Miguel Hern´andez-Lobato, Michael A Gelbart, Ryan P Adams, Matthew W Hoffman, and Zoubin Ghahra-mani. A general framework for constrained Bayesian optimization using information-based search.
The Journalof Machine Learning Research , 17(1):5549–5601, 2016.[13] Saul Toscano-Palmerin and Peter I. Frazier. Bayesian optimization with expensive integrands.
CoRR ,abs/1803.08661, 2018.[14] Justin J Beland and Prasanth B Nair. Bayesian optimization under uncertainty. In
NIPS BayesOpt 2017 workshop ,2017.[15] Shogo Iwazaki, Yu Inatsu, and Ichiro Takeuchi. Bayesian quadrature optimization for probability thresholdrobustness measure. arXiv preprint arXiv:2006.11986 , 2020.[16] Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski.
Robust optimization , volume 28. Princeton Uni-versity Press, 2009.[17] Hans-Georg Beyer and Bernhard Sendhoff. Robust optimization–a comprehensive survey.
Computer methods inapplied mechanics and engineering , 196(33-34):3190–3218, 2007.[18] Aharon Ben-Tal and Arkadi Nemirovski. Robust optimization–methodology and applications.
Mathematicalprogramming , 92(3):453–480, 2002.[19] Alexander Schied*. Risk measures and robust optimization problems.
Stochastic Models , 22(4):753–831, 2006.[20] Gordon J Alexander and Alexandre M Baptista. Economic implications of using a mean-var model for portfolioselection: A comparison with mean-variance analysis.
Journal of Economic Dynamics and Control , 26(7-8):1159–1193, 2002.[21] Frank J Fabozzi, Petter N Kolm, Dessislava A Pachamanova, and Sergio M Focardi. Robust portfolio optimization.
The Journal of portfolio management , 33(3):40–48, 2007.[22] Carl Edward Rasmussen and Christopher K. I. Williams.
Gaussian Processes for Machine Learning . MIT Press,2006.[23] Yasin Abbasi-Yadkori. Online learning for linearly parametrized control problems. 2013.1024] Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization with Gaussianprocesses. In
International Conference on Machine Learning , pages 997–1005, 2015.[25] Ilija Bogunovic, Jonathan Scarlett, Stefanie Jegelka, and Volkan Cevher. Adversarially robust optimization withGaussian processes. In
Advances in neural information processing systems , pages 5760–5770, 2018.[26] Johannes Kirschner, Ilija Bogunovic, Stefanie Jegelka, and Andreas Krause. Distributionally robust Bayesianoptimization. In Silvia Chiappa and Roberto Calandra, editors,
The 23rd International Conference on ArtificialIntelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy] , volume 108 of
Proceedings of Machine Learning Research , pages 2174–2184. PMLR, 2020.[27] Thanh Nguyen, Sunil Gupta, Huong Ha, Santu Rana, and Svetha Venkatesh. Distributionally robust Bayesianquadrature optimization. In Silvia Chiappa and Roberto Calandra, editors,
Proceedings of the Twenty ThirdInternational Conference on Artificial Intelligence and Statistics , volume 108 of
Proceedings of Machine LearningResearch , pages 1921–1931, Online, 26–28 Aug 2020. PMLR.[28] Lukas Frhlich, Edgar Klenske, Julia Vinogradska, Christian Daniel, and Melanie Zeilinger. Noisy-input entropysearch for efficient robust Bayesian optimization. In Silvia Chiappa and Roberto Calandra, editors,
Proceedingsof the Twenty Third International Conference on Artificial Intelligence and Statistics , volume 108 of
Proceedingsof Machine Learning Research , pages 2262–2272, Online, 26–28 Aug 2020. PMLR.[29] Siddharth Mahajan and Garrett Van Ryzin. Stocking retail assortments under dynamic consumer substitution.
Operations Research , 49(3):334–351, 2001.[30] Johannes Kirschner and Andreas Krause. Information directed sampling and bandits with heteroscedastic noise.In
Proc. International Conference on Learning Theory (COLT) , July 2018.[31] Yanan Sui, Vincent Zhuang, Joel W. Burdick, and Yisong Yue. Stagewise safe Bayesian optimization withGaussian processes. In Jennifer G. Dy and Andreas Krause, editors,
Proceedings of the 35th International Con-ference on Machine Learning, ICML 2018, Stockholmsm¨assan, Stockholm, Sweden, July 10-15, 2018 , volume 80of
Proceedings of Machine Learning Research , pages 4788–4796. PMLR, 2018.11
Proofs
A.1 Proof of Theorem 4.1
From the definition of β t and Lemma 2.1, the following holds with probability at least 1 − δ/ ∀ x ∈ X , ∀ w ∈ Ω , ∀ t ≥ , | f ( x , w ) − µ t − ( x , w ) | ≤ β / t σ t − ( x , w ) . (13)Moreover, we give the following lemma about the confidence bound Q ( G ) t ( x t ): Lemma A.1.
Assume that (13) holds. Then, for any T ≥ , it holds that T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) ≤ αβ / T T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w + (1 − α ) (cid:118)(cid:117)(cid:117)(cid:116) T ˜ Bβ / T T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w + 20 T β
T T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w , where ˜ B = max ( x , w ) ∈ ( X × Ω) | f ( x , w ) − E w [ f ( x , w )] | .Proof. From the definition of u ( G ) t and l ( G ) t , we have T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) = α T (cid:88) t =1 (cid:110) u ( F ) t ( x t ) − l ( F ) t ( x t ) (cid:111) + (1 − α ) T (cid:88) t =1 (cid:110) u ( F ) t ( x t ) − l ( F ) t ( x t ) (cid:111) . (14)Similarly, from the definition of u ( F ) t and l ( F ) t , we get the following inequality: T (cid:88) t =1 (cid:110) u ( F ) t ( x t ) − l ( F ) t ( x t ) (cid:111) = T (cid:88) t =1 (cid:90) Ω { u t ( x t , w ) − l t ( x t , w ) } p ( w )d w = 2 T (cid:88) t =1 β / t (cid:90) Ω σ t − ( x t , w ) p ( w )d w ≤ β / T T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w . (15)Here, the last inequality is given by monotonicity of β t . In addition, noting that the definition of u ( F ) t and l ( F ) t weobtain u ( F ) t ( x t ) − l ( F ) t ( x t ) = (cid:115)(cid:90) Ω ˜ u (sq) t ( x t , w ) p ( w )d w − (cid:115)(cid:90) Ω ˜ l (sq) t ( x t , w ) p ( w )d w ≤ (cid:115)(cid:90) Ω (cid:110) ˜ u (sq) t ( x t , w ) − ˜ l (sq) t ( x t , w ) (cid:111) p ( w )d w , (16)where the last inequality is obtained by using the fact that √ a − √ b ≤ √ a − b for any a ≥ b ≥
0. Furthermore, wehave ˜ u (sq) t ( x t , w ) − ˜ l (sq) t ( x t , w ) = max (cid:110) ˜ l t ( x t , w ) , ˜ u t ( x t , w ) (cid:111) − min (cid:110) ˜ l t ( x t , w ) , ˜ u t ( x t , w ) (cid:111) + STR ,t ( x t , w ) , (17)where STR ,t ( x t , w ) = max (cid:110) , min (cid:16) ˜ u t ( x t , w ) , − ˜ l t ( x t , w ) (cid:17)(cid:111) . Moreover, we define ˜ µ t − ( x , w ) and ˜ σ t − ( x , w ) as˜ µ t − ( x , w ) = µ t − ( x , w ) − E w [ µ t − ( x , w )] , ˜ σ t − ( x , w ) = σ t − ( x , w ) + E w [ σ t − ( x , w )] . Then, ˜ l t ( x , w ) and ˜ u t ( x , w ) can be expressed as follows:˜ l t ( x , w ) = ˜ µ t − ( x , w ) − β / t ˜ σ t − ( x , w ) , ˜ u t ( x , w ) = ˜ µ t − ( x , w ) + β / t ˜ σ t − ( x , w ) . l t ( x t , w ) ≤ ˜ u t ( x t , w ), then we have ˜ µ t − ( x t , w ) ≥ (cid:110) ˜ l t ( x t , w ) , ˜ u t ( x t , w ) (cid:111) − min (cid:110) ˜ l t ( x t , w ) , ˜ u t ( x t , w ) (cid:111) = (cid:110) ˜ µ t − ( x t , w ) + β / t ˜ σ t − ( x t , w ) (cid:111) − (cid:110) ˜ µ t − ( x t , w ) − β / t ˜ σ t − ( x t , w ) (cid:111) = 4 β / t ˜ µ t − ( x t , w )˜ σ t − ( x t , w )= 4 β / t | ˜ µ t − ( x t , w ) | ˜ σ t − ( x t , w ) . On the other hand, if ˜ l t ( x t , w ) > ˜ u t ( x t , w ), then we get ˜ µ t − ( x t , w ) < (cid:110) ˜ l t ( x t , w ) , ˜ u t ( x t , w ) (cid:111) − min (cid:110) ˜ l t ( x t , w ) , ˜ u t ( x t , w ) (cid:111) = (cid:110) ˜ µ t − ( x t , w ) − β / t ˜ σ t − ( x t , w ) (cid:111) − (cid:110) ˜ µ t − ( x t , w ) + β / t ˜ σ t − ( x t , w ) (cid:111) = − β / t ˜ µ t − ( x t , w )˜ σ t − ( x t , w )= 4 β / t | ˜ µ t − ( x t , w ) | ˜ σ t − ( x t , w ) . Therefore, in all cases the following equality holds:max (cid:110) ˜ l t ( x t , w ) , ˜ u t ( x t , w ) (cid:111) − min (cid:110) ˜ l t ( x t , w ) , ˜ u t ( x t , w ) (cid:111) = 4 β / t | ˜ µ t − ( x t , w ) | ˜ σ t − ( x t , w ) . Next, since (13) holds, we get f ( x , w ) − E w [ f ( x t , w )] ∈ [˜ l t ( x , w ) , ˜ u t ( x , w )]. This implies that | f ( x , w ) − E w [ f ( x t , w )] − ˜ µ t − ( x , w ) | ≤ β / t ˜ σ ( x , w ) . Hence, we have | f ( x , w ) − E w [ f ( x t , w )] − ˜ µ t − ( x , w ) | ≤ β / t ˜ σ t − ( x , w ) ⇒ | ˜ µ t − ( x , w ) | ≤ | f ( x , w ) − E w [ f ( x t , w )] | + β / t ˜ σ t − ( x , w ) ⇒ | ˜ µ t − ( x , w ) | ≤ ˜ B + β / t ˜ σ t − ( x , w ) . Thus, the following inequality holds:max (cid:110) ˜ l t ( x t , w ) , ˜ u t ( x t , w ) (cid:111) − min (cid:110) ˜ l t ( x t , w ) , ˜ u t ( x t , w ) (cid:111) ≤ β / t ˜ σ t − ( x t , w ) (cid:110) ˜ B + β / t ˜ σ t − ( x t , w ) (cid:111) = 4 ˜ Bβ / t ˜ σ t − ( x t , w ) + 4 β t ˜ σ t − ( x t , w ) . (18)Moreover, STR ,t ( x t , w ) can be bounded asSTR ,t ( x t , w ) ≤ ˜ u t ( x t , w ) − ˜ l t ( x t , w )2= β / t ˜ σ t − ( x t , w ) . (19)Hence, from (17), (18) and (19), we obtain u (sq) t ( x t , w ) − l (sq) t ( x t , w ) ≤ Bβ / t ˜ σ t − ( x t , w ) + 5 β t ˜ σ t − ( x t , w )and (cid:90) Ω (cid:110) u (sq) t ( x t , w ) − l (sq) t ( x t , w ) (cid:111) p ( w )d w ≤ Bβ / t (cid:90) Ω ˜ σ t − ( x t , w ) p ( w )d w + 5 β t (cid:90) Ω ˜ σ t − ( x t , w ) p ( w )d w .
13n addition, from the definition of ˜ σ t − ( x t , w ), the following holds: (cid:90) Ω ˜ σ t − ( x t , w ) p ( w )d w = E w [ σ t − ( x t , w )] + (cid:90) Ω σ t − ( x t , w ) p ( w )d w = 2 (cid:90) Ω σ t − ( x t , w ) p ( w )d w , (cid:90) Ω ˜ σ t − ( x t , w ) p ( w )d w = (cid:90) Ω σ t − ( x t , w ) p ( w )d w + 2 E w [ σ t − ( x t , w )] (cid:90) Ω σ t − ( x t , w ) p ( w )d w + { E w [ σ t − ( x t , w )] } = (cid:90) Ω σ t − ( x t , w ) p ( w )d w + 3 (cid:26)(cid:90) Ω σ t − ( x t , w ) p ( w )d w (cid:27) ≤ (cid:90) Ω σ t − ( x t , w ) p ( w )d w . Here, the last inequality is obtained by using Jensen’s inequality and convexity of g ( x ) = x . Therefore, we have (cid:90) Ω (cid:110) u (sq) t ( x t , w ) − l (sq) t ( x t , w ) (cid:111) p ( w )d w ≤ Bβ / t (cid:90) Ω σ t − ( x t , w ) p ( w )d w + 20 β t (cid:90) Ω σ t − ( x t , w ) p ( w )d w . (20)Thus, by using (20) and Schwartz’s inequality for (16), we get T (cid:88) t =1 (cid:110) u ( F ) t ( x t ) − l ( F ) t ( x t ) (cid:111) ≤ (cid:118)(cid:117)(cid:117)(cid:116) T ˜ Bβ / T T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w + 20 T β
T T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w . (21)Therefore, from (14), (15) and (21), we have the desired inequality. (cid:4) Next, in order to evaluate (cid:80) Tt =1 (cid:82) Ω σ t − ( x t , w )d w and (cid:80) Tt =1 (cid:82) Ω σ t − ( x t , w )d w in the right hand side of the in-equality of Lemma A.1, we introduce the following lemma given by [30]: Lemma A.2.
Let S t be any non-negative stochastic process adapted to a filtration {F t } , and define m t = E [ S t | F t − ] .Assume that S t ≤ K for K ≥ . Then, for any T ≥ , the following holds with probability at least − δ : T (cid:88) t =1 m t ≤ T (cid:88) t =1 S t + 8 K ln 6 Kδ .
Furthermore, from the assumption about the kernel function, we get k (( x t , w ) , ( x t , w )) ≤ σ t − ( x t , w ) ≤ k (( x t , w ) , ( x t , w )) ≤
1. Hence, from Lemma A.2, with probability at least 1 − δ/
3, it holds that T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w ≤ T (cid:88) t =1 σ t − ( x t , w t ) + 8 ln 18 δ . (22)Similarly, the following inequality holds with probability at least 1 − δ/ T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w ≤ T (cid:88) t =1 σ t − ( x t , w t ) + 8 ln 18 δ . (23)In addition, we introduce the following lemma given by [5] about the maximum information gain γ T : Lemma A.3.
Fix T ≥ . Then, the following inequality holds: T (cid:88) t =1 σ t − ( x t , w t ) ≤ σ − ) γ T . (24)14oreover, from Schwarz’s inequality and Lemma A.3, we get the following inequality: T (cid:88) t =1 σ t − ( x t , w t ) ≤ (cid:115) T ln(1 + σ − ) γ T . (25)Thus, from (22), (23), (24) and (25) we obtain the following corollary: Corollary A.1.
Assume that (13), (22) and (23) hold. Then, for any T ≥ , it holds that T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) ≤ αβ / T (cid:110)(cid:112) T C γ T + C (cid:111) + (1 − α ) (cid:114) T ˜ Bβ / T (cid:110)(cid:112) T C γ T + 2 C (cid:111) + 5 T β T { C γ T + 2 C } , where C = σ − ) and C = 16 ln δ .Proof. From Lemma A.1, (22) and (23), it holds that T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) ≤ αβ / T (cid:40) T (cid:88) t =1 σ t − ( x t , w t ) + 4 ln 18 δ (cid:41) + (1 − α ) (cid:118)(cid:117)(cid:117)(cid:116) T ˜ Bβ / T (cid:40) T (cid:88) t =1 σ t − ( x t , w t ) + 4 ln 18 δ (cid:41) + 40 T β T (cid:40) T (cid:88) t =1 σ t − ( x t , w t ) + 4 ln 18 δ (cid:41) . (26)Therefore, by combining (24), (25) and (26), we get the desired inequality. (cid:4) Finally, we prove Theorem 4.1. Let T ≥
1, and define ˆ T = argmax t =1 ,...,T l ( G ) t ( x t ). Assume that (13) holds. Then,for any x ∈ X , it holds that G ( x ) ∈ [ l ( G ) t ( x ) , u ( G ) t ( x )]. Thus, for any t (cid:48) = 1 , . . . , T , we get G ( x ∗ ) − G ( ˆ x T ) ≤ u ( G ) t (cid:48) ( x t (cid:48) ) − l ( G )ˆ T ( ˆ x T )= u ( G ) t (cid:48) ( x t (cid:48) ) − max t =1 ,...,T l ( G ) t ( ˆ x t ) ≤ u ( G ) t (cid:48) ( x t (cid:48) ) − l ( G ) t (cid:48) ( x t (cid:48) ) . This implies that G ( x ∗ ) − G ( ˆ x T ) ≤ T T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) . (27)Here, note that with probability at least 1 − δ , (13), (22) and (23) hold. Therefore, by combining Corollary A.1, thefollowing holds with probability at least 1 − δ : G ( x ∗ ) − G ( ˆ x T ) ≤ αT − β / T (cid:16)(cid:112) T C γ T + C (cid:17) + (1 − α ) T − (cid:114) T ˜ Bβ / T (cid:16)(cid:112) T C γ T + 2 C (cid:17) + 5 T β T ( C γ T + 2 C ) . Hence, if T satisfies (11), with probability at least 1 − δ , it holds that G ( x ∗ ) − G ( ˆ x T ) ≤ (cid:15) . Therefore, ˆ x T is the (cid:15) -accurate solution. A.2 Proof of Theorem 4.2
In this subsection, we prove Theorem 4.2. First, we show several lemmas.
Lemma A.4.
For any t ≥ , ˆΠ t has at least one element (i.e., ˆΠ t (cid:54) = ∅ ).Proof. Let t ≥
1. We define ˜ x t and x † t as ˜ x t = arg max x ∈X l ( F ) t ( x ) , x † t = arg max x ∈X ; l ( F t ( x )= l ( F t (˜ x t ) l ( F ) t ( x ) . E (pes) t, x † t = ∅ . Then, it holds that ∀ x (cid:48) ∈ ∅ = E (pes) t, x † t , F (pes) t ( x † t ) (cid:14) F (pes) t ( x (cid:48) ) . This implies that x † t ∈ ˆΠ t .On the other hand, if E (pes) t, x † t (cid:54) = ∅ , then the following holds for any x (cid:48) ∈ E (pes) t, x † t : l ( F ) t ( x † t ) = l ( F ) t ( ˜ x t ) ≥ l ( F ) t ( x (cid:48) ) . Here, if l ( F ) t ( x † t ) > l ( F ) t ( x (cid:48) ), it holds that F (pes) t ( x † t ) (cid:14) F (pes) t ( x (cid:48) ). Similarly, if l ( F ) t ( x † t ) = l ( F ) t ( x (cid:48) ), it holds that l ( F ) t ( x † t ) ≥ l ( F ) t ( x (cid:48) ) . Noting that F (pes) t ( x † t ) (cid:54) = F (pes) t ( x (cid:48) ) and l ( F ) t ( x † t ) = l ( F ) t ( x (cid:48) ), we have l ( F ) t ( x † t ) > l ( F ) t ( x (cid:48) ). Thus, we have F (pes) t ( x † t ) (cid:14) F (pes) t ( x (cid:48) ). Form the definition of ˆΠ t , we get x † t ∈ ˆΠ t . (cid:4) Lemma A.5.
Let t ≥ , and assume that M t (cid:54) = ∅ . Also let x (1) be an element of M t . Then, there exists an element x (cid:48) ∈ ˆΠ t such that F ( pes ) t ( x (1) ) (cid:22) F ( pes ) t ( x (cid:48) ) . Proof.
Let t ≥ M t (cid:54) = ∅ and x (1) ∈ M t . Assume that the following holds: F (pes) t ( x (1) ) (cid:14) F (pes) t ( x (cid:48) ) , ∀ x (cid:48) ∈ ˆΠ t . (28)From the definition of M t , we have x (1) / ∈ ˆΠ t . Here, since x (1) / ∈ ˆΠ t , there exists x (2) ∈ E (pes) t, x (1) such that F (pes) t ( x (1) ) (cid:22) F (pes) t ( x (2) ) . Therefore, there exists x (3) ∈ E (pes) t, x (2) such that F (pes) t ( x (2) ) (cid:22) F (pes) t ( x (3) ) . Furthermore, by combining F (pes) t ( x (1) ) (cid:22) F (pes) t ( x (2) ) , F (pes) t ( x (2) ) (cid:22) F (pes) t ( x (3) )we get F (pes) t ( x (1) ) (cid:22) F (pes) t ( x (3) ). Thus, from (28) we obtain x (3) / ∈ ˆΠ t . By repeating the same argument, we have x (1) , . . . , x ( |X | ) , where x ( k ) / ∈ ˆΠ t , k = 1 , . . . , |X | . Next, we show that x ( i ) (cid:54) = x ( j ) for any i and j with i (cid:54) = j . In fact, ifthere exist i and j with i < j such that x ( i ) = x ( j ) , we get F (pes) t ( x ( i ) ) = F (pes) t ( x ( j ) ). Here, from i ≤ j −
1, notingthat the definition of x ( i ) and x ( j − we get F (pes) t ( x ( j ) ) = F (pes) t ( x ( i ) ) ≤ F (pes) t ( x ( j − ) . Similarly, from the definition of x ( j − and x ( j ) , we obtain F (pes) t ( x ( j − ) ≤ F (pes) t ( x ( j ) ) . Thus, we get F (pes) t ( x ( j − ) = F (pes) t ( x ( j ) ). However, it contradicts x ( j ) ∈ E (pes) t, x ( j − . Hence, it holds that x ( i ) (cid:54) = x ( j ) for any i and j with i (cid:54) = j . Therefore, the set { x (1) , . . . , x ( |X | ) } is equal to X . Recall that x ( k ) / ∈ ˆΠ t for any k = 1 , . . . , |X | . By combining this and { x (1) , . . . , x ( |X | ) } = X , we have ˆΠ t = ∅ . However, it contradicts Lemma A.4.Hence, the assumption (28) is incorrect. (cid:4) Lemma A.6.
Let x be an element of X , and let (cid:15) = ( (cid:15) , (cid:15) ) be a positive vector. Assume that at least one of thefollowing inequalities holds for any x (cid:48) ∈ X : F ( x ) + (cid:15) ≥ F ( x (cid:48) ) , F ( x ) + (cid:15) ≥ F ( x (cid:48) ) . Then, it holds that F ( x ) ∈ Z (cid:15) .Proof. In order to prove Lemma A.6, we consider the following two cases:16 For any x , x (cid:48) ∈ Π, F ( x ) = F ( x (cid:48) ). (2) There exist x , x (cid:48) ∈ Π such that F ( x ) (cid:54) = F ( x (cid:48) ).First, we consider (1) . We define x (1) and x (2) as˜ x = arg max x ∈X F ( x ) , x (1) = arg max x ; F ( x )= F (˜ x ) F ( x ) , x † = arg max x ∈X F ( x ) , x (2) = arg max x ; F ( x )= F ( x † ) F ( x ) . From the definition of x (1) and x (2) , it holds that x (1) , x (2) ∈ Π. Thus, from (1) , we get F ( x (1) ) = F ( x (2) ). Hence,the following holds for any x (cid:48) ∈ X : F ( x (cid:48) ) ≤ F ( x (1) ) , F ( x (cid:48) ) ≤ F ( x (2) ) = F ( x (1) ) . Therefore, we get F ( x (cid:48) ) (cid:22) F ( x (1) ). Note that F ( x (1) ) ∈ Z . Here, let x ∈ X . Then, from the lemma’s assumption, atleast one of the following inequalities holds: F ( x ) + (cid:15) ≥ F ( x (1) ) , F ( x ) + (cid:15) ≥ F ( x (1) ) . If F ( x ) + (cid:15) ≥ F ( x (1) ), we set a = ( F ( x (1) ) , F ( x )) (cid:62) . Noting that F ( x (cid:48) ) ≤ F ( x (1) ) for any x (cid:48) ∈ X , we have a (cid:22) F ( x (1) ). This implies that a ∈ Z . Thus, the following holds: a = ( F ( x (1) ) , F ( x )) (cid:62) (cid:22) ( F ( x ) + (cid:15) , F ( x ) + (cid:15) ) (cid:62) = F ( x ) + (cid:15) . Furthermore, since F ( x ) (cid:22) F ( x (1) ) and F ( x (1) ) ∈ Z , we obtain F ( x ) ∈ Z (cid:15) . Similarly, if F ( x ) + (cid:15) ≥ F ( x (1) ), weset b = ( F ( x ) , F ( x (1) )) (cid:62) . Also in this case, by using the same argument, we get b ∈ Z and b (cid:22) F ( x ) + (cid:15) . By combining this and F ( x ) (cid:22) F ( x (1) ) (and F ( x (1) ) ∈ Z ), we obtain F ( x ) ∈ Z (cid:15) .Next, we consider (2) . From (2) , there exist x (1) , . . . , x ( l ) such that F (Π) = { F ( x ) | x ∈ Π } = { F ( x ( i ) ) | i = 1 , . . . , l } , F ( x ( i ) ) (cid:54) = F ( x ( j ) ) , i (cid:54) = j. Here, without loss of generality, we may assume the following: F ( x (1) ) < · · · < F ( x ( l ) ) , F ( x (1) ) > · · · > F ( x ( l ) ) . Let x be an element of X . Assume that there exists j such that F ( x ) + (cid:15) ≥ F ( x ( j ) ) , F ( x ) + (cid:15) ≥ F ( x ( j +1) ) . Note that ( F ( x ( j ) ) , F ( x ( j +1) ) (cid:62) ∈ Z . In addition, there exists i ∈ { , . . . , l } such that F ( x ) (cid:22) F ( x ( i ) ) ∈ Z . Therefore, F ( x ) ∈ Z (cid:15) .Similarly, assume that at least one of the following inequalities holds for any j : F ( x ) + (cid:15) < F ( x ( j ) ) , F ( x ) + (cid:15) < F ( x ( j +1) ) . (29)Here, if F ( x ) + (cid:15) < F ( x (1) ), from lemma’s assumption it holds that F ( x ) + (cid:15) ≥ F ( x (1) ). Moreover, we define c = ( F ( x ) , F ( x (1) )) (cid:62) ∈ Z . Then, the following holds: F ( x ) + (cid:15) = ( F ( x ) + (cid:15) , F ( x ) + (cid:15) ) (cid:62) (cid:23) ( F ( x ) , F ( x (1) )) (cid:62) = c ∈ Z. Furthermore, from the definition of x (1) , it holds that F ( x (1) ) ≥ F ( x ). Thus, noting that F ( x ) + (cid:15) < F ( x (1) ),we get F ( x ) ≤ F ( x (1) ). By combining these, we have F ( x ) (cid:22) F ( x (1) ) ∈ Z . This implies that F ( x ) ∈ Z (cid:15) . On theother hand, if F ( x ) + (cid:15) ≥ F ( x (1) ), from (29) we get F ( x ) + (cid:15) < F ( x (2) ). Therefore, from lemma’s assumption,we obtain F ( x ) + (cid:15) ≥ F ( x (2) ). By using (29) again, we have F ( x ) + (cid:15) < F ( x (3) ). Hence, by repeating theseprocedures, we get F ( x ) + (cid:15) ≥ F ( x ( l ) ) and F ( x ) + (cid:15) < F ( x ( l ) ). Finally, noting that F ( x ) (cid:22) ( F ( x ( l ) ) , F ( x ) + (cid:15) ) (cid:62) (cid:22) ( F ( x ( l ) ) , F ( x ( l ) )) (cid:62) = F ( x ( l ) ) ∈ Z, F ( x ) + (cid:15) (cid:23) ( F ( x ( l ) ) , F ( x )) (cid:62) ∈ Z, we get F ( x ) ∈ Z (cid:15) . (cid:4)
17y using these lemmas, we prove Theorem 4.2.
Proof.
First, we prove that the algorithm terminates after at most t (cid:48) iterations where t (cid:48) is the positive integer satisfyingmax x ∈ M t (cid:48) ∪ ˆΠ t (cid:48) λ t (cid:48) ( x ) = λ t (cid:48) ( x t (cid:48) ) ≤ min { (cid:15) , (cid:15) } . From the definition of λ t , noting that u ( F ) t ( x ) − l ( F ) t ( x ) ≤ λ t ( x ) and u ( F ) t ( x ) − l ( F ) t ( x ) ≤ λ t ( x ), we have max x ∈ M t (cid:48) ∪ ˆΠ t (cid:48) (cid:110) u ( F ) t (cid:48) ( x ) − l ( F ) t (cid:48) ( x ) (cid:111) ≤ (cid:15) and max x ∈ M t (cid:48) ∪ ˆΠ t (cid:48) (cid:110) u ( F ) t (cid:48) ( x ) − l ( F ) t (cid:48) ( x ) (cid:111) ≤ (cid:15) . Then, for any x (cid:48) ∈ ˆΠ t , it holds that u ( F ) t (cid:48) ( x (cid:48) ) ≤ l ( F ) t (cid:48) ( x (cid:48) ) + (cid:15) (30)and u ( F ) t (cid:48) ( x (cid:48) ) ≤ l ( F ) t (cid:48) ( x (cid:48) ) + (cid:15) . (31)Here, let x be an element of ˆΠ t (cid:48) . Then, from the definition of ˆΠ t , for any x (cid:48) ∈ ˆΠ t (cid:48) , at least one of the followinginequalities holds: l ( F ) t (cid:48) ( x (cid:48) ) ≤ l ( F ) t (cid:48) ( x ) , l ( F ) t (cid:48) ( x (cid:48) ) ≤ l ( F ) t (cid:48) ( x ) . Thus, from (30) and (31), for any x (cid:48) ∈ ˆΠ t (cid:48) , it holds that F (pes) t ( x ) + (cid:15) ⊀ F (opt) t (cid:48) ( x (cid:48) ). This implies that U t (cid:48) = ∅ .Similarly, if M t (cid:48) (cid:54) = ∅ , there exists x ∈ M t (cid:48) such that F (opt) t (cid:48) ( x ) (cid:14) (cid:15) F (pes) t (cid:48) ( x (cid:48) ) for any x (cid:48) ∈ ˆΠ t . On the other hand,from Lemma A.5, there exists x (cid:48)(cid:48) ∈ ˆΠ t (cid:48) such that F (pes) t (cid:48) ( x ) (cid:22) F (pes) t (cid:48) ( x (cid:48)(cid:48) ). Moreover, from (30) and (31), x (cid:48)(cid:48) satisfies F (opt) t (cid:48) ( x ) (cid:22) (cid:15) F (pes) t (cid:48) ( x (cid:48)(cid:48) ). However, it contradicts the definition of M t . Hence, we get M t (cid:48) = ∅ .Hereafter, we assume that (13), (22) and (23) hold. From the definition of λ t , we obtain λ t ( x ) ≤ (cid:110) u ( F ) t ( x ) − l ( F ) t ( x ) (cid:111) + (cid:110) u ( F ) t ( x ) − l ( F ) t ( x ) (cid:111) . This implies that T (cid:88) t =1 λ t ( x t ) ≤ T (cid:88) t =1 (cid:110) u ( F ) t ( x t ) − l ( F ) t ( x t ) (cid:111) + T (cid:88) t =1 (cid:110) u ( F ) t ( x t ) − l ( F ) t ( x t ) (cid:111) . Therefore, from (15), (21), (22) and (23), we get T (cid:88) t =1 λ t ( x t ) ≤ β / T (cid:40) T (cid:88) t =1 σ t − ( x t , w t ) + 4 ln 18 δ (cid:41) + (cid:118)(cid:117)(cid:117)(cid:116) T ˜ Bβ / T (cid:40) T (cid:88) t =1 σ t − ( x t , w t ) + 4 ln 18 δ (cid:41) + 40 T β T (cid:40) T (cid:88) t =1 σ t − ( x t , w t ) + 4 ln 18 δ (cid:41) . Hence, from (24) and (25), it holds that1 T T (cid:88) t =1 λ t ( x t ) ≤ T − β / T (cid:110)(cid:112) T C γ T + C (cid:111) + T − (cid:114) T ˜ Bβ / T (cid:110)(cid:112) T C γ T + 2 C (cid:111) + 5 T β T { C γ T + 2 C } . (32)Here, let T be a positive integer such that the right hand side in (32) is less than or equal to min { (cid:15) , (cid:15) } . Then, thereexists a positive integer t (cid:48) such that t (cid:48) ≤ T and λ t (cid:48) ( x t (cid:48) ) ≤ min { (cid:15) , (cid:15) } . Therefore, we have M t (cid:48) = ∅ and U t (cid:48) = ∅ . Thismeans that the algorithm terminates after at most t (cid:48) iterations.Next, under (13) we show that ˆΠ t is the (cid:15) -accurate Pareto set when M t = ∅ and U t = ∅ . First, we prove F ( ˆΠ t ) ⊂ Z (cid:15) . Let x be an element of ˆΠ t . For any x (cid:48) ∈ ˆΠ t \ { x } , it holds that F (pes) t ( x ) + (cid:15) ⊀ F (opt) t ( x (cid:48) ) because U t = ∅ . Furthermore, noting that M t = ∅ , for any x (cid:48) ∈ X \ ˆΠ t , there exists x (cid:48)(cid:48) ∈ ˆΠ t such that F (opt) t ( x (cid:48) ) (cid:22) (cid:15) F (pes) t ( x (cid:48)(cid:48) ).In addition, since x ∈ ˆΠ t , from the definition of ˆΠ t , at least one of the following inequalities holds: l ( F ) t ( x (cid:48)(cid:48) ) ≤ l ( F ) t ( x ) , l ( F ) t ( x (cid:48)(cid:48) ) ≤ l ( F ) t ( x ) . By combining this and F (opt) t ( x (cid:48) ) (cid:22) (cid:15) F (pes) t ( x (cid:48)(cid:48) ), we get F (pes) t ( x ) + (cid:15) ⊀ F (opt) t ( x (cid:48) ). Therefore, under (13) at leastone of the following inequalities holds for any x (cid:48) ∈ X \ { x } : F ( x ) + (cid:15) ≥ F ( x (cid:48) ) , F ( x ) + (cid:15) ≥ F ( x (cid:48) ) . F ( x ) + (cid:15) ≥ F ( x ). Hence, from Lemma A.6, we get F ( ˆΠ t ) ⊂ Z (cid:15) .Finally, we show that for any x (cid:48) ∈ Π, there exists x ∈ ˆΠ t such that x (cid:48) (cid:22) (cid:15) x . When x (cid:48) ∈ ˆΠ t , the existence of x is obvious because x (cid:48) (cid:22) (cid:15) x (cid:48) . On the other hand, when x (cid:48) ∈ X \ ˆΠ t , since M t = ∅ there exists x ∈ ˆΠ t such that F (opt) t ( x (cid:48) ) (cid:22) (cid:15) F (pes) t ( x ). Thus, under (13), this implies that x (cid:48) (cid:22) (cid:15) x . Hence, for any x (cid:48) ∈ Π, there exists x ∈ ˆΠ t suchthat x (cid:48) (cid:22) (cid:15) x . From this and F ( ˆΠ t ) ⊂ Z (cid:15) , we have that ˆΠ t is the (cid:15) -accurate Pareto set. Here, note that (13), (22) and(23) hold with probability at least 1 − δ . Therefore, we get the desired result. (cid:4) B Extension to Constraint Optimization Problem
In real applications, there exists a situation where the known tolerance level for the value of the function F is defined.For example, in the parameter tuning of an engineering system, this situation corresponds to the case where thevariance of the performance must be below a certain level. In such a situation, it is necessary to treat the functions F and F as in the following constrained optimization problem: x ∗ = arg max x ∈X F ( x ) s.t. F ( x ) ≥ h, where h < (cid:15) = ( (cid:15) , (cid:15) ), we define an (cid:15) -accurate solution as a solution ˆ x satisfying F ( ˆ x ) ≥ F ( x ∗ ) − (cid:15) , F ( ˆ x ) ≥ h − (cid:15) . Proposed Algorithm
First, we define M (cons) t , S t and M (obj) t as M (cons) t = (cid:110) x ∈ X | u ( F ) t ( x ) ≥ h − (cid:15) (cid:111) ,S t = (cid:110) x ∈ X | l ( F ) t ( x ) ≥ h − (cid:15) (cid:111) ,M (obj) t = (cid:26) x ∈ X | u ( F ) t ( x ) ≥ max x (cid:48) ∈ S t l ( F ) t ( x (cid:48) ) − (cid:15) (cid:27) . Here, we define M (obj) t = X if S t = ∅ . Note that an element in the complement of M (cons) t or M (obj) t is not an (cid:15) -accurate solution with high probability. In addition, S t is a set that is determined to be a feasible region with highprobability. Based on these definitions, we define a latent optimal solution set M t at the t th step as follows: M t = M (cons) t ∩ M (obj) t . In our proposed algorithm, we select the most uncertain point in the latent optimal solution set M t . In other words,the observation point x t at the t th step is selected by using λ t as defined by Equation (8) as follows: x t = arg max x ∈ M t λ t ( x ) . (33)Furthermore, if S t (cid:54) = ∅ at the t th step, then we define the estimated optimal solution ˆ x t by ˆ x t = argmax x ∈ S t l ( F ) t ( x ).In order to ensure that ˆ x t is an (cid:15) -accurate solution, the uncertainties of the function values F and F for the latentoptimal solution should be sufficiently small. In the proposed method, the algorithm terminates at the t th step whichsatisfies the following: max x ∈ M t λ t ( x ) ≤ min { (cid:15) , (cid:15) } . The pseudo code of the proposed method is shown as Algorithm 3.
Theoretical Analysis
For Algorithm 3, the following theorem holds:
Theorem B.1.
Let k be a positive-definite kernel, and let f ∈ H k with (cid:107) f (cid:107) H k ≤ B . Also let δ ∈ (0 , and (cid:15) > , (cid:15) > , and define β t = (cid:16)(cid:113) ln det( I t + σ − K t ) + 2 ln δ + B (cid:17) . Then, with probability at least − δ , thefollowing 1. and 2. hold: Algorithm 3 terminates after at most T iterations, where T is the smallest positive integer satisfying T − β / T (cid:110)(cid:112) T C γ T + C (cid:111) + T − (cid:114) T ˜ Bβ / T (cid:110)(cid:112) T C γ T + 2 C (cid:111) + 5 T β T { C γ T + 2 C } ≤ min { (cid:15) , (cid:15) } . Here, ˜ B = max ( x , w ) ∈ ( X × Ω) | f ( x , w ) − E w [ f ( x , w )] | , C = σ − ) and C = 16 ln δ . lgorithm 3 Proposed Algorithm for Constrained Optimization
Input:
GP prior GP (0 , k ), { β t } t ∈ N , Threshold h , Non-negative vector (cid:15) = ( (cid:15) , (cid:15) ). M ← X , S ← ∅ , t ← λ ( x ) for any x ∈ M while max x ∈ M t λ t ( x ) (cid:2) min { (cid:15) , (cid:15) } do Choose x t = argmax x ∈ M t λ t ( x ).Sample w t ∼ p ( w ).Observe y t ← f ( x t , w t ) + η t .Update the GP by adding (( x t , w t ) , y t ). t ← t + 1.Compute S t , M t .Compute λ t ( x ) for any x ∈ M t end whileif S t (cid:54) = ∅ then Output ˆ x t = argmax x ∈ S t l ( F ) t ( x ). end if2. If x ∗ exists, then S t (cid:48) (cid:54) = ∅ at the termination step t (cid:48) ≤ T . Moreover, ˆ x t (cid:48) = argmax x ∈ S t (cid:48) l ( F ) t ( x ) is an (cid:15) -accuratesolution.Proof. Assume that (13), (22) and (23) hold. Then, by using the same argument as in the proof of Theorem 4.2, weget 1 T T (cid:88) t =1 λ t ( x t ) ≤ T − β / T (cid:110)(cid:112) T C γ T + C (cid:111) + T − (cid:114) T ˜ Bβ / T (cid:110)(cid:112) T C γ T + 2 C (cid:111) + 5 T β T { C γ T + 2 C } . (34)Here, from the definition of T , the right-hand side of (34) is less than or equal to min { (cid:15) , (cid:15) } . Hence, there exists apositive integer t (cid:48) ≤ T such that max x ∈ M t (cid:48) λ t (cid:48) ( x ) = λ t (cid:48) ( x t (cid:48) ) ≤ min { (cid:15) , (cid:15) } . This implies that the algorithm terminatesafter at most T iterations.Next, we prove claim 2 of the theorem. Assume that x ∗ exists. Here, we consider the two cases x ∗ ∈ M (obj) t (cid:48) and x ∗ / ∈ M (obj) t (cid:48) . For case x ∗ ∈ M (obj) t (cid:48) , since (13) holds, the following inequality holds: h − (cid:15) ≤ h ≤ F ( x ∗ ) ≤ u ( F ) t (cid:48) ( x ∗ ) . This means that x ∗ ∈ M (cons) t (cid:48) . Therefore, we have x ∗ ∈ M t (cid:48) . Furthermore, noting that u ( F ) t ( x ) − l ( F ) t ( x ) ≤ λ t ( x )and u ( F ) t ( x ) − l ( F ) t ( x ) ≤ λ t ( x ), it holds thatmax x ∈ M t (cid:48) (cid:110) u ( F ) t (cid:48) ( x ) − l ( F ) t (cid:48) ( x ) (cid:111) ≤ (cid:15) , (35)max x ∈ M t (cid:48) (cid:110) u ( F ) t (cid:48) ( x ) − l ( F ) t (cid:48) ( x ) (cid:111) ≤ (cid:15) . (36)Here, if l ( F ) t (cid:48) ( x ∗ ) < h − (cid:15) , then from (36), we get u ( F ) t (cid:48) ( x ∗ ) < h . Thus, from (13), we obtain F ( x ∗ ) < h . However,this contradicts the definition of x ∗ , implying that l ( F ) t (cid:48) ( x ∗ ) ≥ h − (cid:15) and x ∗ ∈ S t (cid:48) (cid:54) = ∅ . Moreover, from (35) thefollowing holds: max x ∈ M t (cid:48) (cid:110) u ( F ) t (cid:48) ( x ) − l ( F ) t (cid:48) ( x ) (cid:111) ≤ (cid:15) ⇒ u ( F ) t (cid:48) ( x ∗ ) − l ( F ) t (cid:48) ( x ∗ ) ≤ (cid:15) ⇒ u ( F ) t (cid:48) ( x ∗ ) − max x ∈ S t (cid:48) l ( F ) t (cid:48) ( x ) ≤ (cid:15) ⇒ u ( F ) t (cid:48) ( x ∗ ) − l ( F ) t (cid:48) ( ˆ x t (cid:48) ) ≤ (cid:15) ⇒ l ( F ) t (cid:48) ( ˆ x t (cid:48) ) ≥ u ( F ) t (cid:48) ( x ∗ ) − (cid:15) . In addition, from the definition of S t (cid:48) , we have l t (cid:48) ( ˆ x t (cid:48) ) ≥ h − (cid:15) .
20n the other hand, if x ∗ / ∈ M (obj) t (cid:48) , then M (obj) t (cid:48) (cid:54) = X . Thus, from the definition of M (obj) t (cid:48) , it holds that S t (cid:48) (cid:54) = ∅ .Therefore, we get l t (cid:48) ( ˆ x t (cid:48) ) = max x ∈ S t (cid:48) l t (cid:48) ( x ) ≥ h − (cid:15) . Furthermore, since x ∗ / ∈ M (obj) t (cid:48) , it holds that u ( F ) t (cid:48) ( x ∗ ) − (cid:15) ≤ u ( F ) t (cid:48) ( x ∗ ) < l ( F ) t (cid:48) ( ˆ x t (cid:48) ) − (cid:15) ≤ l ( F ) t (cid:48) ( ˆ x t (cid:48) ) . Therefore, if x ∗ exists, then we have S t (cid:48) (cid:54) = ∅ and l ( F ) t (cid:48) ( ˆ x t (cid:48) ) ≥ u ( F ) t (cid:48) ( x ∗ ) − (cid:15) , (37) l t (cid:48) ( ˆ x t (cid:48) ) ≥ h − (cid:15) . (38)Note that (37) and (38) imply that ˆ x t (cid:48) is an (cid:15) -accurate solution when (13) holds. Finally, since (13), (22) and (23)hold with probability at least 1 − δ , we have Theorem B.1. (cid:4) C Details of Section 3.3
C.1 Noisy Input Setting
In this subsection, we consider the setting where the input x contains a noise ξ ∈ ∆. Let X ⊂ R d be an input spacefor optimization. In addition, assume that X is a finite set. Furthermore, let ∆ ⊂ R d be a compact and convex set,and let ξ be a random noise satisfying ξ ∈ ∆. Moreover, let f be a black-box function on D := { x + ξ | x ∈ X , ξ ∈ ∆ } ,and let k : D × D → R be a positive-definite kernel with f ∈ H k and (cid:107) f (cid:107) H k ≤ B .For each step t , we select an observation point x t ∈ X , and the observed value is obtained as y t = f ( x t + ξ t ) + η t .Here, η t is the independent normal distribution η t ∼ N (0 , σ ), and ξ t is the observed value of ξ .In this setting, the expected value and variance of f ( x ) with respect to ξ are given by E ξ [ f ( x + ξ )] = (cid:90) ∆ f ( x + ξ ) p ( ξ )d ξ , (39) V ξ [ f ( x + ξ )] = (cid:90) ∆ { f ( x + ξ ) − E ξ [ f ( x + ξ )] } p ( ξ )d ξ , (40)where p ( ξ ) is a known probability density function of ξ . Similarly as in (5), using (39) and (40) we define theoptimization objective functions F and F . In addition, let µ t ( x ), σ t ( x ) and Q t ( x ) := [ l t ( x ) , u t ( x )] denote theposterior mean, posterior variance and confidence bound of f ( x ) at the step t , respectively. Confidence Bound
Confidence bounds of objective functions F and F defined by using (39) and (40) can alsobe constructed by using the same procedure as in 3.1. First, assume that f ( ˜ x ) ∈ Q t ( ˜ x ) for any ˜ x ∈ D . Then, thefollowing holds for any x ∈ X : (cid:90) ∆ l t ( x + ξ ) p ( ξ )d ξ ≤ (cid:90) ∆ f ( x + ξ ) p ( ξ )d ξ ≤ (cid:90) ∆ u t ( x + ξ ) p ( ξ )d ξ . Therefore, the confidence bound Q ( F ) t ( x ) of F ( x ) can be constructed as Q ( F ) t ( x ) := [ l ( F ) t ( x ) , u ( F ) t ( x )] using l ( F ) t ( x ) = (cid:90) ∆ l t ( x + ξ ) p ( ξ )d ξ , u ( F ) t ( x ) = (cid:90) ∆ u t ( x + ξ ) p ( ξ )d ξ . Similarly, the confidence bound Q ( F ) t ( x ) of F ( x ) can be expressed as Q ( F ) t ( x ) := [ l ( F ) t ( x ) , u ( F ) t ( x )] using l ( F ) t ( x ) = − (cid:115)(cid:90) ∆ ˜ u (sq) t ( x + ξ ) p ( ξ )d ξ , l ( F ) t ( x ) = − (cid:115)(cid:90) ∆ ˜ l (sq) t ( x + ξ ) p ( ξ )d ξ , where ˜ l (sq) t ( x + ξ ) and ˜ u (sq) t ( x + ξ ) are given by˜ l t ( x + ξ ) = l t ( x + ξ ) − E ξ [ u t ( x + ξ )] , ˜ u t ( x + ξ ) = u t ( x + ξ ) − E ξ [ l t ( x + ξ )] , ˜ l (sq) t ( x + ξ ) = (cid:40) l t ( x + ξ ) ≤ ≤ ˜ u t ( x + ξ ) , min (cid:110) ˜ l t ( x + ξ ) , ˜ u t ( x + ξ ) (cid:111) otherwise , ˜ u (sq) t ( x + ξ ) = max (cid:110) ˜ l t ( x + ξ ) , ˜ u t ( x + ξ ) (cid:111) . Using Q ( F ) t and Q ( F ) t above, we can construct the proposed algorithm in the same procedure.21 .2 Simulator Based Experiment In this subsection, we consider the setting that w t can be selected in the optimization phase at each step. Furthermore,we show theoretical guarantees in this setting. Hereafter, we only discuss the multi-task scenario, but the sameargument can be made for multi-objective and constraint optimization scenarios by selecting w t and ξ t in the sameprocedure.In our proposed algorithm, ( x t , w t ) at the step t is selected by x t = arg max x ∈X u ( G ) t ( x ) , w t = arg max w ∈ Ω σ t − ( x t , w ) . In this algorithm, the following theorem holds:
Theorem C.1.
Let k be a positive-definite kernel, and let f ∈ H k with (cid:107) f (cid:107) H k ≤ B . Also let δ ∈ (0 , , (cid:15) > , anddefine β t = (cid:16)(cid:113) ln det( I t + σ − K t ) + 2 ln δ + B (cid:17) . Moreover, for any t , define ˆ x t = argmax x t (cid:48) ∈{ x ,..., x t } l ( G ) t (cid:48) ( x t (cid:48) ) .Then, when the proposed algorithm in the simulator based setting is performed, ˆ x T is the (cid:15) -accurate solution withprobability at least − δ , where T is the smallest positive integer satisfying αT − β / T (cid:112) T C γ T + (1 − α ) T − (cid:113) T ˜ Bβ / T (cid:112) T C γ T + 5 T β T C γ T ≤ (cid:15). Here, ˜ B and C are given by ˜ B = max ( x , w ) ∈ ( X × Ω) { f ( x , w ) − E w [ f ( x , w )] } and C = σ − ) .Proof. Assume that (13) holds. Then, from Lemma A.1 we have T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) ≤ αβ / T T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w + (1 − α ) (cid:118)(cid:117)(cid:117)(cid:116) T ˜ Bβ / T T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w + 20 T β
T T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w . In addition, from the definition of w t , it holds that T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w ≤ T (cid:88) t =1 σ t − ( x t , w t ) , T (cid:88) t =1 (cid:90) Ω σ t − ( x t , w ) p ( w )d w ≤ T (cid:88) t =1 σ t − ( x t , w t ) . Hence, we get T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) ≤ αβ / T T (cid:88) t =1 σ t − ( x t , w t ) + (1 − α ) (cid:118)(cid:117)(cid:117)(cid:116) T ˜ Bβ / T T (cid:88) t =1 σ t − ( x t , w t ) + 20 T β
T T (cid:88) t =1 σ t − ( x t , w t ) . Furthermore, from (24) and (25), we obtain T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) ≤ αβ / T (cid:112) C T γ T + (1 − α ) (cid:113) T ˜ Bβ / T (cid:112) C T γ T + 5 T β T C γ T . Finally, by using the same argument as in the proof of Theorem 4.1, the following inequality holds: G ( x ∗ ) − G ( ˆ x T ) ≤ T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) /T. Therefore, noting that the definition of T , we get the desired result. (cid:4) oisy Input Extension Here, we extend the setting defined in subsection 3.3.2 to the simulator based setting.Since there is the noise ξ ∈ ∆ instead of w , we consider the observation point x t at the step t as x t := ˜ x t + ξ t , where( ˜ x t , ξ t ) is given by ˜ x t = arg max x ∈X u ( G ) t ( x ) , ξ t = arg max ξ ∈ ∆ σ t − ( ˜ x t + ξ ) . Then, similar theorems as in Theorem C.1 hold. However, the practical performance of this algorithm is not muchdifferent from that of Uncertainty Sampling, which was used as the base method in numerical experiments. For thisreason, in the simulator based noisy input setting, we propose a method for selecting ( ˜ x t , ξ t ) as follows:˜ x t = arg max x ∈X u ( G ) t ( x ) , ξ t = arg max ξ ∈ ∆ σ t − ( ˜ x t + ξ ) p ( ξ ) . In order to derive similar convergence results as in Theorem C.1, we assume that the probability density function p ( ξ )of ξ is a bounded function on ∆, i.e., sup ξ ∈D p ( ξ ) < ∞ . Theorem C.2.
Let δ ∈ (0 , , (cid:15) > , and set β t = (cid:16)(cid:113) ln det( I t + σ − K t ) + 2 ln δ + B (cid:17) . For any t , define ˆ x t =argmax x t (cid:48) ∈{ x ,..., x t } l ( G ) t (cid:48) ( x t (cid:48) ) . Moreover, assume that sup ξ ∈ ∆ p ( ξ ) ≤ R < ∞ . Then, when the proposed algorithm inthe simulator based noisy input setting is performed, ˆ x T is the (cid:15) -accurate solution with probability at least − δ , where T is the smallest positive integer satisfying αT − β / T R (cid:112) T C γ T + (1 − α ) T − (cid:113) T ˜ BRβ / T (cid:112) T C γ T + 5 T Rβ T C γ T ≤ (cid:15). Here, ˜ B and C are given by ˜ B = max ( x , ξ ) ∈ ( X × ∆) { f ( x + ξ ) − E ξ [ f ( x + ξ )] } and C = σ − ) .Proof. Similarly as in Lemma A.1, with probability at least 1 − δ , it holds that T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) ≤ αβ / T T (cid:88) t =1 (cid:90) ∆ σ t − ( x t + ξ ) p ( ξ )d ξ + (1 − α ) (cid:118)(cid:117)(cid:117)(cid:116) T ˜ Bβ / T T (cid:88) t =1 (cid:90) ∆ σ t − ( x t + ξ ) p ( w )d ξ + 20 T β
T T (cid:88) t =1 (cid:90) ∆ σ t − ( x t + ξ ) p ( ξ )d ξ . Moreover, from the definition of ξ t , we have T (cid:88) t =1 (cid:90) ∆ σ t − ( x t + ξ ) p ( ξ )d ξ ≤ T (cid:88) t =1 σ t − ( x t + ξ t ) p ( ξ t ) ≤ R T (cid:88) t =1 σ t − ( x t + ξ t ) , T (cid:88) t =1 (cid:90) Ω σ t − ( x t + ξ ) p ( ξ )d ξ ≤ T (cid:88) t =1 σ t − ( x t + ξ t ) p ( ξ t ) ≤ R T (cid:88) t =1 σ t − ( x t + ξ t ) . Thus, we get T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) ≤ αβ / T R T (cid:88) t =1 σ t − ( x t + ξ t )+(1 − α ) (cid:118)(cid:117)(cid:117)(cid:116) T ˜ Bβ / T R T (cid:88) t =1 σ t − ( x t + ξ t ) + 20 T β T R T (cid:88) t =1 σ t − ( x t + ξ t ) , and T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) ≤ αβ / T R (cid:112) C T γ T + (1 − α ) (cid:113) T ˜ BRβ / T (cid:112) C T γ T + 5 T β T RC γ T . By using the same argument as in the proof of 4.1, we obtain the following inequality: G ( x ∗ ) − G ( ˆ x T ) ≤ T (cid:88) t =1 (cid:110) u ( G ) t ( x t ) − l ( G ) t ( x t ) (cid:111) /T. Therefore, we get the desired result. (cid:4) Extension to Continuous Set
In this section, we consider the setting where X is a continuous set. First, in MT-MVA-BO, x t = argmax x ∈X u ( G ) t ( x )can be calculated by using a continuous optimization solver. However, in MO-MVA-BO, it is difficult to calculate theestimated Pareto set ˆΠ t and set of latent optimal solutions M t . In this paper, based on [5] we extend the proposedalgorithm by using a discretization set ˜ X of X .Hereafter, let X = [0 , d . Furthermore, assume that f is an L -Lipschitz continuous function, i.e., there exists L > | f ( x , w ) − f ( x (cid:48) , w ) | ≤ L (cid:107) x − x (cid:48) (cid:107) , for any x , x (cid:48) ∈ X . Note that Lipschitz continuity holds if standard kernels are used [24, 31].From Lipschitz continuity of f , the following lemmas about F and F hold: Lemma D.1.
Let f be an L -Lipschitz continuous function. Then, it holds that | F ( x ) − F ( x (cid:48) ) | ≤ L (cid:107) x − x (cid:48) (cid:107) , ∀ x , x (cid:48) ∈ X , where F is given by (5).Proof. From the definition of F and Lipschitz continuity of f , the following inequality holds: | F ( x ) − F ( x (cid:48) ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) Ω { f ( x , w ) − f ( x (cid:48) , w ) } p ( w )d w (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:90) Ω | f ( x , w ) − f ( x (cid:48) , w ) | p ( w )d w ≤ L (cid:107) x − x (cid:48) (cid:107) . (cid:4) Lemma D.2.
Let f be an L -Lipschitz continuous function, ˜ B = max ( x , w ) ∈ ( X × Ω) | f ( x , w ) − E w [ f ( x , w )] | , and define F as in (5). Then, the following inequality holds for any x , x (cid:48) ∈ X : | F ( x ) − F ( x (cid:48) ) | ≤ (cid:113) BL (cid:107) x − x (cid:48) (cid:107) . Proof.
From Lipschitz continuity of f , for any x , x (cid:48) ∈ X , w ∈ Ω, it holds that (cid:12)(cid:12)(cid:12) { f ( x , w ) − E w [ f ( x , w )] } − { f ( x (cid:48) , w ) − E w [ f ( x (cid:48) , w )] } (cid:12)(cid:12)(cid:12) = |{ f ( x , w ) − E w [ f ( x , w )] } − { f ( x (cid:48) , w ) − E w [ f ( x (cid:48) , w )] }| × |{ f ( x , w ) − E w [ f ( x , w )] } + { f ( x (cid:48) , w ) − E w [ f ( x (cid:48) , w )] }|≤ ( | f ( x , w ) − f ( x (cid:48) , w ) | + | E w [ f ( x , w )] − E w [ f ( x (cid:48) , w )] | ) × ( | f ( x , w ) − E w [ f ( x , w )] | + | f ( x (cid:48) , w ) − E w [ f ( x (cid:48) , w )] | ) ≤ L (cid:107) x − x (cid:48) (cid:107) × B =4 ˜ BL (cid:107) x − x (cid:48) (cid:107) . Here, if F ( x ) ≥ F ( x (cid:48) ), then | F ( x ) − F ( x (cid:48) ) | = F ( x ) − F ( x (cid:48) )= (cid:115)(cid:90) Ω { f ( x (cid:48) , w ) − E w [ f ( x (cid:48) , w )] } p ( w )d w − (cid:115)(cid:90) Ω { f ( x , w ) − E w [ f ( x , w )] } p ( w )d w ≤ (cid:115)(cid:90) Ω { f ( x (cid:48) , w ) − E w [ f ( x (cid:48) , w )] } p ( w )d w − (cid:90) Ω { f ( x , w ) − E w [ f ( x , w )] } p ( w )d w ≤ (cid:115)(cid:90) Ω (cid:12)(cid:12)(cid:12) { f ( x (cid:48) , w ) − E w [ f ( x (cid:48) , w )] } − { f ( x , w ) − E w [ f ( x , w )] } (cid:12)(cid:12)(cid:12) p ( w )d w ≤ (cid:113) BL (cid:107) x − x (cid:48) (cid:107) . On the other hand, if F ( x ) < F ( x (cid:48) ), it holds that | F ( x ) − F ( x (cid:48) ) | ≤ (cid:113) BL (cid:107) x − x (cid:48) (cid:107) . Therefore, for any x , x (cid:48) ∈ X ,the desired inequality holds. (cid:4) Lemma D.3.
Let Z be the Pareto front for X , and let (cid:15) = ( (cid:15) , (cid:15) ) (cid:62) be a positive vector. Define Z + = (cid:91) ( y ,y ) ∈ Z ( −∞ , y ] × ( −∞ , y ] , Z − ( (cid:15) ) = (cid:91) ( y ,y ) ∈ Z ( −∞ , y − (cid:15) ) × ( −∞ , y − (cid:15) ) ,Z ∗ ( (cid:15) ) = { ( y − (cid:15) (cid:48) , y − (cid:15) (cid:48) ) | ( y , y ) ∈ Z, ≤ (cid:15) (cid:48) ≤ (cid:15) , ≤ (cid:15) (cid:48) ≤ (cid:15) } . Then, it holds that Z + = Z − ( (cid:15) ) ∪ Z ∗ ( (cid:15) ) , Z − ( (cid:15) ) ∩ Z ∗ ( (cid:15) ) = ∅ . Proof.
First, we show Z − ( (cid:15) ) ∩ Z ∗ ( (cid:15) ) = ∅ . Let y be an element of Z − ( (cid:15) ). Then, there exists ( y (cid:48) , y (cid:48) ) ∈ Z such that y < y (cid:48) − (cid:15) , y < y (cid:48) − (cid:15) . Here, for any ( y (cid:48)(cid:48) , y (cid:48)(cid:48) ) ∈ Z , y (cid:48)(cid:48) satisfies y (cid:48) ≤ y (cid:48)(cid:48) or y (cid:48) > y (cid:48)(cid:48) . If y (cid:48) ≤ y (cid:48)(cid:48) , from y < y (cid:48) − (cid:15) we get y / ∈ { ( y (cid:48)(cid:48) − (cid:15) (cid:48) , y (cid:48)(cid:48) − (cid:15) (cid:48) ) | ≤ (cid:15) (cid:48) ≤ (cid:15) , ≤ (cid:15) (cid:48) ≤ (cid:15) } . On the other hand, if y (cid:48) > y (cid:48)(cid:48) , then y (cid:48)(cid:48) satisfies y (cid:48) ≤ y (cid:48)(cid:48) because the inequality y (cid:48) > y (cid:48)(cid:48) implies that ( y (cid:48)(cid:48) , y (cid:48)(cid:48) ) ∈ ( −∞ , y (cid:48) ) × ( −∞ , y (cid:48) ). However, it contradicts that ( y (cid:48)(cid:48) , y (cid:48)(cid:48) ) ∈ Z . From y (cid:48) ≤ y (cid:48)(cid:48) and y < y (cid:48) − (cid:15) , we have y / ∈ { ( y (cid:48)(cid:48) − (cid:15) (cid:48) , y (cid:48)(cid:48) − (cid:15) (cid:48) ) | ≤ (cid:15) (cid:48) ≤ (cid:15) , ≤ (cid:15) (cid:48) ≤ (cid:15) } . Therefore, it holds that y / ∈ Z ∗ ( (cid:15) ). This implies that Z − ( (cid:15) ) ∩ Z ∗ ( (cid:15) ) = ∅ .Next, we show Z + = Z − ( (cid:15) ) ∪ Z ∗ ( (cid:15) ). It is clear that Z + ⊃ Z − ( (cid:15) ) ∪ Z ∗ ( (cid:15) ). Thus, we only show that Z + ⊂ Z − ( (cid:15) ) ∪ Z ∗ ( (cid:15) ). Let y be an element of Z + . If y ∈ Z − ( (cid:15) ), it holds that y ∈ Z − ( (cid:15) ) ∪ Z ∗ ( (cid:15) ). On the other hand, if y / ∈ Z − ( (cid:15) ), at least one of the following inequalities holds for any ( y (cid:48) , y (cid:48) ) ∈ Z : y ≥ y (cid:48) − (cid:15) , y ≥ y (cid:48) − (cid:15) . If there exists (cid:15) (cid:48) ∈ [0 , (cid:15) ] such that ( y + (cid:15) (cid:48) , y ) ∈ Z , then y ∈ Z ∗ ( (cid:15) ). Next, we consider the case that ( y + (cid:15) (cid:48) , y ) / ∈ Z for any (cid:15) (cid:48) ∈ [0 , (cid:15) ]. Let Z (cid:48) = { a = ( a , a ) ∈ Z | y ≤ a ≤ y + (cid:15) } . Here, assume that y < a − (cid:15) for any a ∈ Z (cid:48) .Then, from continuity of Z , there exists ˆ y = (ˆ y , ˆ y ) ∈ Z such that y < ˆ y − (cid:15) and y < ˆ y − (cid:15) . However, itcontradicts y / ∈ Z − ( (cid:15) ). Hence, there exists an element a = ( a , a ) ∈ Z (cid:48) such that y ≥ a − (cid:15) . Moreover, thereexists b ≥ y such that ( y , b ) ∈ Z . This implies that there exist ˜ (cid:15) and ˜ (cid:15) such that 0 ≤ ˜ (cid:15) ≤ (cid:15) , 0 ≤ ˜ (cid:15) ≤ (cid:15) and( y + ˜ (cid:15) , y + ˜ (cid:15) ) ∈ Z . Therefore, it holds that y ∈ Z ∗ ( (cid:15) ). (cid:4) Next, we explain the method of constructing ˜ X . Let ˜ X be a set of grid points when each dimension of X = [0 , d is divided into τ evenly spaced segments. Also let [ x ] ∈ ˜ X be a point closest to x ∈ X with respect to the L (cid:107) x − [ x ] (cid:107) ≤ d τ , ∀ x ∈ X . (41)In the proposed algorithm for the continuous set setting, Algorithm 2 is performed by using ˜ X instead of X . Then,we define the estimated Pareto set ˆΠ t , latent Pareto set M t and uncertain set U t in Algorithm 2 asˆΠ t = (cid:110) x ∈ ˜ X | ∀ x (cid:48) ∈ ˜ E (pes) t, x , F (pes) t ( x ) (cid:14) F (pes) t ( x (cid:48) ) (cid:111) , ˜ E (pes) t, x = { x (cid:48) ∈ ˜ X | F (pes) t ( x ) (cid:54) = F (pes) t ( x (cid:48) ) } ,M t = (cid:110) x ∈ ˜ X \ ˆΠ t | ∀ x (cid:48) ∈ ˆΠ t , F (opt) t ( x ) (cid:14) (cid:15) / F (pes) t ( x (cid:48) ) (cid:111) ,U t = (cid:110) x ∈ ˆΠ t | ∃ x (cid:48) ∈ ˆΠ t \ { x } , F (pes) t ( x ) + (cid:15) / ≺ F (opt) t ( x (cid:48) ) (cid:111) . Note that (cid:15) /
2, not (cid:15) is used to calculate ˜ M t and ˜ U t .In the algorithm using ˜ X , the following theorem holds: Theorem D.1.
Let ˜ B = max ( x , w ) ∈ ( X × Ω) | f ( x , w ) − E w [ f ( x , w )] | , and let δ ∈ (0 , , (cid:15) = ( (cid:15) , (cid:15) ) where (cid:15) > and (cid:15) > . Define β t = (cid:16)(cid:113) ln det( I t + σ − K t ) + 2 ln δ + B (cid:17) and τ = max (cid:110) Ld (cid:15) ,
16 ˜
BLd (cid:15) (cid:111) . Then, the following (1) and(2) hold with probability at least − δ : (1) The algorithm terminates after at most T iterations, where T is the smallest positive integer satisfying T − β / T (cid:16)(cid:112) T C γ T + C (cid:17) + T − (cid:114) T ˜ Bβ / T (cid:16)(cid:112) T C γ T + 2 C (cid:17) + 5 T β T ( C γ T + 2 C ) ≤ min { (cid:15) , (cid:15) } / . Here, C and C are given by C = σ − ) and C = 16 ln δ . When the algorithm is terminated, the estimated Pareto set ˆΠ is the (cid:15) -accurate Pareto Set.Proof. We omit the proof of (1) because its proof is the same as in the proof of Theorem 4.2. We only prove (2) . From(41) and Lemma D.1–D.2, the following holds for any x ∈ X : | F ( x ) − F ([ x ]) | ≤ L (cid:107) x − [ x ] (cid:107) = (cid:15) , (42) | F ( x ) − F ([ x ]) | ≤ (cid:113) BL (cid:107) x − [ x ] (cid:107) = (cid:15) . (43)Assume that (13) holds. Let ˜ Z be a Pareto front for ˜ X . Then, for any y ∈ ˜ Z , it holds that y ∈ (cid:91) ( y (cid:48) ,y (cid:48) ) ∈ Z ( −∞ , y (cid:48) ] × ( −∞ , y (cid:48) ] , (44)where Z is the Pareto front for X . Similarly, let Z − ( (cid:15) /
2) = (cid:91) ( y (cid:48) ,y (cid:48) ) ∈ Z ( −∞ , y (cid:48) − (cid:15) / × ( −∞ , y (cid:48) − (cid:15) / . Then, for any y (cid:48)(cid:48) ∈ Z − ( (cid:15) / x ∈ X such that y (cid:48)(cid:48) < F ( x ) − (cid:15) / , y (cid:48)(cid:48) < F ( x ) − (cid:15) / . Here, from (42) and (43) we have F ( x ) ≤ F ([ x ]) + (cid:15) / , F ( x ) ≤ F ([ x ]) + (cid:15) / . Thus, it holds that y (cid:48)(cid:48) < F ([ x ]) and y (cid:48)(cid:48) < F ([ x ]). This implies that Z − ( (cid:15) / ⊂ { y ∈ R | ∃ x ∈ ˜ X , y (cid:22) F ( x ) } ≡ A. Here, since Z − ( (cid:15) /
2) is the open set, noting that Z − ( (cid:15) / ⊂ A we get Z − ( (cid:15) / ⊂ int( A ), where int( A ) is the interiorof A . In addition, from the definition of the interior and boundary (frontier), we obtain int( A ) ∩ ∂A = ∅ . Therefore,from ∂A = ˜ Z and Z − ( (cid:15) / ⊂ int( A ), it holds that Z − ( (cid:15) / ∩ ˜ Z = ∅ . Hence, for any y ∈ ˜ Z , y / ∈ Z − ( (cid:15) / Z ⊂ Z ∗ ( (cid:15) / . Hence, for any y ∈ ˜ Z , there exists a ∈ Z such that y = a − (cid:15) (cid:48) , y = a − (cid:15) (cid:48) , ≤ (cid:15) (cid:48) ≤ (cid:15) / , ≤ (cid:15) (cid:48) ≤ (cid:15) / . (45)Furthermore, from Theorem 4.2, for any x ∈ ˆΠ t , there exists y † ∈ ˜ Z such that y † ≤ F ( x ) + (cid:15) / , y † ≤ F ( x ) + (cid:15) / . By combining this and (45), we get a = y † + (cid:15) (cid:48) ≤ F ( x ) + (cid:15) / (cid:15) (cid:48) ≤ F ( x ) + (cid:15) ,a = y † + (cid:15) (cid:48) ≤ F ( x ) + (cid:15) / (cid:15) (cid:48) ≤ F ( x ) + (cid:15) . Therefore, we have F ( ˆΠ t ) ⊂ Z (cid:15) .Furthermore, let x ∈ Π. For [ x ] ∈ ˜ X , since ˆΠ t is the ( (cid:15) / X , there exists x (cid:48) ∈ ˆΠ t such that F ([ x ]) (cid:22) (cid:15) / F ( x (cid:48) ). Moreover, form (42) and (43), it holds that F ( x ) ≤ F ([ x ]) + (cid:15) /
2. This implies that F ( x ) (cid:22) F ([ x ]) + (cid:15) / (cid:22) F ( x (cid:48) ) + (cid:15) . Therefore, for any x ∈ Π, there exits x (cid:48) ∈ ˆΠ t such that x (cid:22) (cid:15) x (cid:48) . Thus, ˆΠ t is the (cid:15) -accurate Pareto set for X . (cid:4)(cid:4)