Active learning for distributionally robust level-set estimation
AActive learning for distributionally robust level-set estimation
Yu Inatsu Shogo Iwazaki Ichiro Takeuchi , , ∗ Department of Computer Science, Nagoya Institute of Technology RIKEN Center for Advanced Intelligence Project ∗ E-mail: [email protected]
ABSTRACT
Many cases exist in which a black-box function f with high evaluation cost depends on two types ofvariables x and w , where x is a controllable design variable and w are uncontrollable environmental variablesthat have random variation following a certain distribution P . In such cases, an important task is to find therange of design variables x such that the function f ( x , w ) has the desired properties by incorporating therandom variation of the environmental variables w . A natural measure of robustness is the probability that f ( x , w ) exceeds a given threshold h , which is known as the probability threshold robustness (PTR) measure inthe literature on robust optimization. However, this robustness measure cannot be correctly evaluated whenthe distribution P is unknown. In this study, we addressed this problem by considering the distributionallyrobust PTR (DRPTR) measure, which considers the worst-case PTR within given candidate distributions.Specifically, we studied the problem of efficiently identifying a reliable set H , which is defined as a region inwhich the DRPTR measure exceeds a certain desired probability α , which can be interpreted as a level setestimation (LSE) problem for DRPTR. We propose a theoretically grounded and computationally efficientactive learning method for this problem. We show that the proposed method has theoretical guaranteeson convergence and accuracy, and confirmed through numerical experiments that the proposed methodoutperforms existing methods.
1. Introduction
In the manufacturing industry, product performance often depends on two types of variables: design variablesand environmental variables. The design variables are completely controllable, whereas environmental variablesare random variables that change depending on the usage environment of the product. When considering sucha problem, it is important to identify the design variables that allow the product performance to exceed thedesired requirement threshold with a sufficiently high degree of confidence, taking into account the randomnessof the environmental variables. In this setting, we must emphasize that there are two distinctly different phasesof the product: the development phase and the use phase. In the development phase, we have full control overthe design variables and environmental variables. In the use phase, on the other hand, the design variables arefixed, and the environmental variables change randomly and cannot be controlled.Let f ( x , w ) represent the performance of the product, and let h ∈ R be a desired performance threshold,where x is a design variable defined on X , and w is an environmental variable defined on Ω. Then, we considerthe following robustness measure: PTR( x ) = (cid:90) Ω f ( x , w ) > h ] p † ( w )d w , where 1l[ · ] is the indicator function and p † ( w ) is the probability density function of w . This measure is called theprobability threshold robustness (PTR) measure in the field of robust optimization [2], and can be interpretedas a measure of how well the design variables behave under randomness in the environmental variables. In themanufacturing industry, it is desirable to identify the set of controllable variables x ∈ X for which PTR( x ) isgreater than a certain threshold. In other words, this problem is interpreted as a level-set estimation (LSE)[4, 8] of the PTR measure. There are two main reasons for considering LSE of the PTR measure. One is thatby enumerating all the design variables that exceed the desired threshold with a high probability, it is possibleto respond the usage conditions of various users. The other is to consider some optimization problem (e.g., tofind x with the minimum price) for design variables with PTR measures above a certain level. This is knownas the chance-constrained programming problem [5], and has many applications such as finance, in addition tomanufacturing industry. Unfortunately, however, the PTR measure cannot be correctly evaluated when p † ( w )is unknown. If p † ( w ) is unknown and the estimated density is simply plugged in, then PTR( x ) is no longervalid as a robustness measure because of the estimation error.In this study, we considered a distributionally robust PTR (DRPTR) measure, which includes uncertaintyabout p † ( w ) under the setting that p † ( w ) is unknown. Let A be a user-specified class of candidate distributionsof w . Then, the DRPTR measure can be defined as F ( x ) = inf p ( w ) ∈A (cid:90) Ω f ( x , w ) > h ] p ( w )d w . The DRPTR measure has the advantage of being robust with respect to using wrong distributions because it canbe interpreted as the PTR in the worst case among the candidate distributions. In this study, we formulated this1 a r X i v : . [ s t a t . M L ] F e b roblem as an active learning problem for the LSE for F ( x ) instead of PTR( x ), and developed a theoreticallygrounded and numerically efficient algorithm for its calculation. The basic ideas of our proposed method areas follows. First, we consider the function f ( x , w ) to be a black-box function with a high evaluation cost, andwe employ a Gaussian process (GP) model as a surrogate model. Next, we predict the target DRPTR measureusing the GP model for the black-box function f ( x , w ). Finally, we perform LSE using credible intervals of theDRPTR measure calculated on the basis of this prediction. Active learning using GP models [29] for black-box functions have been actively studied in the context ofBayesian optimization (see, e.g., [21, 22]). Several studies have been conducted on active learning for LSE[4, 8, 30, 10]. Furthermore, some researchers applied LSE to efficiently identify safety regions [25, 27, 24, 28],and others used LSE to enumerate the local minima of black-box functions [9].Many studies have been conducted on active learning under input uncertainty (including random environ-mental variables). In [11], the authors proposed an efficient method for performing LSE in the setting where theinput is a random variable generated from a certain distribution. In other studies, the researchers formulatedthe randomness of the input with some robustness measures for performing active learning on it. For example,the authors of [3] used the worst-case function value of the input shift as a robustness measure. Similarly, otherresearch ([1, 26, 18, 6, 7, 14]) dealt with the stochastic robustness (SR) measure, which is a robustness measuredefined by integrating the black-box function against the input distribution. In another study closely related tothe present work, the authors of [12] proposed an active learning method for LSE in the PTR measure on the ba-sis of random inputs; in [14], the authors considered an active learning method for both LSE and maximizationproblems in the PTR measure. However, these two are not distributionally robust settings. Distributionallyrobust optimization (DRO), which is not an active learning framework, was first introduced by [20]. DRO is animportant topic in the context of robust optimization, and there have been countless related studies (see [19]for comprehensive survey of DRO). Active learning methods for DRO with uncertainty environmental variableshave recently been proposed by [16, 17]. The main differences to our problem setup are that they focus ona distributionally robust SR (DRSR) measure for the target function, which is the worst-case SR measure incandidate distributions of the unknown environmental variable, and consider the maximization problem for theDRSR measure. In particular, for the former, we cannot directly apply their proposed methods and theoreticaltechniques because the target function is different from ours. To the best of our knowledge, none of these studieshave addressed the same research problem considered in the present work.
The main contributions of this study are summarized as follows: • We formulate the LSE problem for the DRPTR measure, i.e., the problem of finding the set of designvariables for which the DRPTR measure exceeds a given threshold. • We construct non-trivial credible intervals for the DRPTR measure and propose a new acquisition function(AF) based on an expected classification improvement. Using them, we propose an active learning methodfor the LSE of the DRPTR measure. Moreover, because the naive implementation of our proposed AFrequires a large computational cost, we propose a computationally efficient technique for its calculation. • We clarify the theoretical property of the proposed method. Under mild conditions, we show that theproposed method has desirable accuracy and convergence properties. • We describe the empirical performance of the proposed method through the results of numerical experi-ments with benchmark functions and real data.
2. Preliminary
Let f : X × Ω → R be an expensive-to-evaluate black-box function. We assume that X and Ω are finitesets. For each input ( x , w ) ∈ X × Ω, the value of f ( x , w ) is observed as f ( x , w ) + ε with an independent noise ε , where ε follows Gaussian distribution N (0 , σ ). In our setting, a variable w ∈ Ω stochastically fluctuates bythe (unknown) discrete distribution P † in the use phase, whereas we can specify w in the development phase.Moreover, let A be a family of candidate distributions of P † . In this work, we consider A = { p.m.f. p ( w ) | d ( p ( w ) , p ∗ ( w )) < (cid:15) } . where p ∗ ( w ) is a user-specified reference distribution, d ( · , · ) is a given distance metricbetween two distributions, and (cid:15) >
0. Then, under the given threshold h , we define the DRPTR F ( x ) for each x ∈ X as F ( x ) = inf p ( w ) ∈A (cid:88) w ∈ Ω f ( x , w ) > h ] p ( w ) . H of X that satisfies F ( x ) > α for a given threshold α ∈ (0 , H = { x ∈ X | F ( x ) > α } . (2.1)Moreover, we define the lower set L as L = { x ∈ X | F ( x ) ≤ α } . Gaussian process
In this study, we used the Gaussian process (GP) to model the unknown black-box function f . First, we assume that the GP, GP (0 , k (( x , w ) , ( x (cid:48) , w (cid:48) ))) is a prior distribution of f , where k (( x , w ) , ( x (cid:48) , w (cid:48) ))is a positive-definite kernel. Then, given the dataset { ( x i , w i , y i ) } ti =1 , the posterior distribution of f also followsthe GP, and its posterior mean µ t ( x , w ) and posterior variance σ t ( x , w ) are given by µ t ( x , w ) = k (cid:62) t ( x , w )( K t + σ I t ) − y t ,σ t ( x , w ) = k (( x , w ) , ( x , w )) − k (cid:62) t ( x , w )( K t + σ I t ) − k t ( x , w ) , where k t ( x , w ) is the t -dimensional vector whose j th element is k (( x , w ) , ( x j , w j )), y t = ( y , . . . , y t ) (cid:62) , I t is the t × t identity matrix, and K t is the t × t matrix whose ( j, k )th element is k (( x j , w j ) , ( x k , w k )).
3. Proposed method
In this section, we propose an active learning method for efficiently identifying (2.1). The target function F ( x ) is a random variable because F ( x ) is the function of f ( x , w ), and f ( x , w ) is drawn from GP. Thus, areasonable method to identify (2.1) is to construct a credible interval of F ( x ), and estimate H using the lowerbound of the constructed credible interval. Unfortunately, although f ( x , w ) follows GP, F ( x ) does not followGP. Hence, the credible interval of F ( x ) cannot be directly calculated on the basis of normal distributions. Inthe next section, we propose a simple and theoretically valid credible interval of F ( x ) using the credible intervalof f ( x , w ). For any input ( x , w ) ∈ X × Ω and step t , we define a credible interval of f ( x , w ) as Q t ( x , w ) = [ l t ( x , w ) , u t ( x , w )],where l t ( x , w ) = µ t ( x , w ) − β / t σ t ( x , w ), u t ( x , w ) = µ t ( x , w ) + β / t σ t ( x , w ), and β / t ≥
0. Similarly, wedefine a credible interval of 1l[ f ( x , w ) > h ] on the basis of Q t ( x , w ). For the theoretical analysis described inSection 4, we introduce a user-specified accuracy parameter η >
0. Specifically, we define the credible intervalof 1l[ f ( x , x ) > h ] at step t as ˜ Q t ( x , w ; η ) ≡ [˜ l t ( x , w ; η ) , ˜ u t ( x , w ; η )]= [1 ,
1] if l t ( x , w ) > h − η, [0 ,
1] if l t ( x , w ) ≤ h − η and u t ( x , w ) > h, [0 ,
0] if l t ( x , w ) ≤ h − η and u t ( x , w ) ≤ h. Note that when the accuracy parameter η = 0, this credible interval simply indicates that if the lower (resp.upper) bound of f ( x , w ) is greater (resp. smaller) than h , we say that 1l[ f ( x , w ) > h ] = 1 (resp. 0). Thus, acredible interval Q ( F ) t ( x ; η ) ≡ [ l ( F ) t ( x ; η ) , u ( F ) t ( x ; η )] of the target function F ( x ) can be given by l ( F ) t ( x ; η ) = inf p ( w ) ∈A (cid:88) w ∈ Ω ˜ l t ( x , w ; η ) p ( w ) , u ( F ) t ( x ; η ) = inf p ( w ) ∈A (cid:88) w ∈ Ω ˜ u t ( x , w ; η ) p ( w ) . (3.1)Note that if we use the L L d ( · , · ), equation (3.1) is equivalent to solvinga linear (or second-order cone) programming problem. In both cases, because solvers exist that can computethe optimal solution quickly, it is easy to compute Q ( F ) t ( x ; η ) when using such distance functions. Then, weestimate H and L using Q ( F ) t ( x ; η ) as follows: H t = { x ∈ X | l ( F ) t ( x ; η ) > α } , L t = { x ∈ X | u ( F ) t ( x ; η ) ≤ α } . Also, we define the unclassified set as U t = X \ ( H t ∪ L t ). In this section, we propose two acquisition functions to select the next evaluation point. Our proposedacquisition functions are based on the maximum improvement in level-set estimation (MILE) strategy proposedin [30]. In MILE, the expected value of the increase in the number of classifications after adding the new point( x ∗ , w ∗ ) is calculated, and the point with the largest expected value is selected. In this study, owing to the3 lgorithm 1 Active learning for distributionally robust level-set estimation
Input:
GP prior GP (0 , k ), threshold h ∈ R , probability α ∈ (0 , η >
0, tradeoffparameter { β t } t ≤ T H ← ∅ , L ← ∅ , U ← X , t ← while U t − (cid:54) = ∅ do Compute l ( F ) t ( x ; η ) and u ( F ) t ( x ; η ) for all x ∈ X Choose ( x t , w t ) by ( x t , w t ) = argmax ( x ∗ , w ∗ ) ∈X × Ω a (1) t − ( x ∗ , w ∗ ) (or a (2) t − ( x ∗ , w ∗ ) instead of a (1) t − ( x ∗ , w ∗ ))Observe y t ← f ( x t , w t ) + ε t Update GP by adding (( x t , w t ) , y t ) and compute H t , L t and U t t ← t + 1 end while ˆ H ← H t − , ˆ L ← L t − Output:
Estimated Set ˆ H, ˆ L computational cost of calculating the acquisition function, we consider a strategy based on the expected valuewhere points in the unclassified set are classified as H .Let ( x ∗ , w ∗ ) be a new point, and let y ∗ = f ( x ∗ , w ∗ )+ ε be a new observation at point ( x ∗ , w ∗ ). Furthermore,let l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) be the lower bound of the credible interval of F ( x ), where η = 0 when ( x ∗ , w ∗ , y ∗ ) isnewly added. Then, we consider the function a t ( x ∗ , w ∗ ): a t ( x ∗ , w ∗ ) = (cid:88) x ∈ U t E y ∗ [1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ]] . (3.2)In this work, we do not directly use (3.2) as the acquisition function because the value of (3.2) is sometimesexactly zero for any point. A reasonable method to avoid this problem is to consider a different function b t ( x ∗ , w ∗ ) only when the values of (3.2) are all zero. For theoretical treatment, we follow the strategy describedin [30], and consider the acquisition function of the form max { a t ( x ∗ , w ∗ ) , γb t ( x ∗ , w ∗ ) } with a positive constantparameter γ . Note that if we use a sufficiently small γ , it is almost the same when considering b t ( x ∗ , w ∗ ) onlywhen the values of (3.2) are all zero; otherwise, a t ( x ∗ , w ∗ ). In Section 4, we present the theoretical guaranteesof our proposed method for this acquisition function. In this section, we propose two types of b t ( x ∗ , w ∗ ). Thefirst is based on the RMILE acquisition function proposed by [30]. The basic idea of RMILE is to add anadditional variance term γσ t ( x ∗ , w ∗ ) to the original MILE acquisition function. By using the same argument,we define the following modified acquisition function: Definition 3.1 (Proposed acquisition function 1) . Let a t ( x ∗ , w ∗ ) be the function defined by (3.2), and let γ be a positive parameter. Then, we propose the following acquisition function a (1) t ( x ∗ , w ∗ ): a (1) t ( x ∗ , w ∗ ) = max { a t ( x ∗ , w ∗ ) , γσ t ( x ∗ , w ∗ ) } . Moreover, we select the next evaluation point ( x t +1 , w t +1 ) by maximizing a (1) t ( x ∗ , w ∗ ).The other acquisition function we propose uses γ RMILE t ( x ∗ , w ∗ ) instead of γσ t ( x ∗ , w ∗ ) as the function b t ( x ∗ , w ∗ ), where RMILE t ( x ∗ , w ∗ ) is the RMILE function proposed in [30]. Definition 3.2 (Proposed acquisition function 2) . Let a t ( x ∗ , w ∗ ) be the function defined by (3.2), and let γ be a positive parameter. Then, we propose the following acquisition function a (2) t ( x ∗ , w ∗ ): a (2) t (( x ∗ , w ∗ )) = max { a t ( x ∗ , w ∗ ) , γ RMILE t ( x ∗ , w ∗ ) } . Moreover, we select the next evaluation point ( x t +1 , w t +1 ) by maximizing a (2) t ( x ∗ , w ∗ ).The pseudocode of the proposed method is given in Algorithm 1. Our proposed acquisition functions are based on (3.2), where (3.2) includes the calculation of the expectedvalue. This expectation cannot be expressed as a simple expression using the cumulative distribution function(CDF) of the standard normal distribution, as in the original MILE [30]. One way to solve this problemis to generate many samples from the posterior distribution of y ∗ and numerically calculate the expectedvalue. However, because one optimization calculation is required to calculate 1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ],if the expected value is calculated using M samples, then M | U t | optimization calculations are required tocalculate a t ( x ∗ , w ∗ ) for each ( x ∗ , w ∗ ). Therefore, to calculate a t ( x ∗ , w ∗ ) for all candidate points, M | U t ||X × Ω | optimization calculations are required. To reduce this large computational cost, we provide useful lemmas forefficiently computing the acquisition function. The expected values in (3.2) can be exactly calculated using thefollowing lemma: 4 emma 3.1. Let l t ( x , w j | x ∗ , w ∗ , y ∗ ) be the lower confidence bound of f ( x , w j ) after adding ( x ∗ , w ∗ , y ∗ )to { ( x i , w i , y i ) } ti =1 . Furthermore, let r j be a number satisfying h = l t ( x , w j | x ∗ , w ∗ , r j ), and let r ( j ) bethe j th-smallest number in the range r to r | Ω | . For each s ∈ { , . . . , | Ω | + 1 } ≡ [ | Ω | + 1], define R s =( r ( s − , r ( s ) ], where r (0) = −∞ and r ( | Ω | +1) = ∞ . Moreover, let c s be a real number satisfying c s ∈ R s . Then, E y ∗ [1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ]] can be calculated as follows: E y ∗ [1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ]] = | Ω | +1 (cid:88) s =1 P ( y ∗ ∈ R s )1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) > α ] . (3.3)Lemma 3.1 implies that | Ω | +1 optimization calculations are required to calculate E y ∗ [1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) >α ]], but the following lemma shows that the number of optimization calculations can be reduced by checking asimple inequality: Lemma 3.2.
Let c , . . . , c | Ω | +1 be numbers defined as in Lemma 3.1. Suppose that c s satisfies (cid:88) w ∈ Ω l t ( x , w | x ∗ , w ∗ , c s ) > h ] p ∗ ( w ) ≤ α. Then, 1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) > α ] = 0.Finally, noting that 0 ≤ P ( y ∗ ∈ R s ) ≤ ≤ l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) > α ] ≤
1, we can approximate (3.3)with any approximation accuracy ζ > Lemma 3.3.
Let ζ >
0, and defineˆ a t ( x ∗ , w ∗ ) = (cid:88) s ∈ S t P ( y ∗ ∈ R s )1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) > α ] ,S t = { s ∈ [ | Ω | + 1] | P ( y ∗ ∈ R s ) ≥ ζ/ ( | Ω | + 1) } . Then, ˆ a t ( x ∗ , w ∗ ) satisfies the following inequality: | E y ∗ [1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ]] − ˆ a t ( x ∗ , w ∗ ) | ≤ ζ. Lemma 3.3 implies that the number of optimization calculations for (3.3) can be further reduced if the error ζ is allowed. In addition, we must emphasize that P ( y ∗ ∈ R s ) is often very small for most s when we actuallycalculate (3.3). Therefore, from these properties, if we apply Lemma 3.3 using a sufficiently small ζ , we canreduce the computational cost of (3.3) significantly with almost no error. Detailed numerical comparisons areprovided in Section 5.
4. Theoretical analysis
In this section, we provide three theorems regarding the accuracy and convergence properties of our methods.First, we define the misclassification loss e α ( x ) for each x ∈ X as follows: e α ( x ) = (cid:26) max { , F ( x ) − α } if x ∈ ˆ L max { , α − F ( x ) } if x ∈ ˆ H .
Furthermore, for theoretical reasons, we assume that the black-box function f follows GP GP (0 , k (( x , w ) , ( x (cid:48) , w (cid:48) ))).In addition, for technical reasons, we assume that the prior variance k (( x , w ) , ( x , w )) ≡ σ ( x , w ) satisfies0 < σ ,min ≡ min ( x , w ) ∈X × Ω σ ( x , w ) ≤ max ( x , w ) ∈X × Ω σ ( x , w ) ≤ . Moreover, let κ T be the maximum information gain at step T . Note that κ T is a measure often used to showtheoretical guarantee for GP-based active learning methods (see, e.g., [23]), and can be expressed using mutualinformation I ( y ; f ) between the observed vector y and f as κ T = max A ⊂X × Ω I ( y A ; f ) . Then, the followingtheorem regarding accuracy holds:
Theorem 4.1.
Let h ∈ R , α ∈ (0 , t ≥
1, and δ ∈ (0 , β t = 2 log( |X × Ω | π t / (3 δ )). Moreover,for a user-specified accuracy parameter ξ >
0, we define η > η = min (cid:26) ξσ ,min , ξ δσ ,min |X × Ω | (cid:27) . Then, when Algorithm 1 terminates, with a probability of at least 1 − δ , the misclassification loss is boundedby ξ , that is, the following inequality holds: P (cid:18) max x ∈X e α ( x ) ≤ ξ (cid:19) ≥ − δ. Theorem 4.2.
Under the same setting as described in Theorem 4.1, let γ > C = 2 / log(1 + σ − ). Inaddition, let T be the smallest positive integer satisfying the following four inequalities:(1) σ − β / T C κ T T < η , (2) σ − C κ T T < η , (3) C β T κ T T < η , (4) 12 log β T − T η σ C κ T < log( |X | − −| Ω | ηγ (2 π ) / / . Then, Algorithm 1 terminates (i.e., U T = ∅ ) after at most T trials when we use the acquisition function a (1) t ( x ∗ , w ∗ ).Furthermore, the similar theorem holds if the acquisition function a (2) t (( x ∗ , w ∗ )) is used. In this study,owing to the practical performance, we modified the original RMILE toRMILE t ( x ∗ , w ∗ ) = max { MILE t ( x ∗ , w ∗ ) , ˜ γσ t ( x ∗ , w ∗ ) } , MILE t ( x ∗ , w ∗ ) = (cid:88) ( x , w ) ∈ U t × Ω E y ∗ [1l[ l t ( x , w | x ∗ , w ∗ , y ∗ ) > h ]] − |{ ( x , w ) ∈ U t × Ω | l t ( x , w ) > h − η }| . Then, the following theorem holds:
Theorem 4.3.
Under the same setting described in Theorem 4.1, let γ >
0, ˜ γ >
0, and C = 2 / log(1 + σ − ).In addition, let T be the smallest positive integer satisfying the following five inequalities:(1) σ − β / T C κ T T < η , (2) σ − C κ T T < η , (3) C β T κ T T < η , (4) 12 log β T − T η σ C κ T < log( |X | − −| Ω | ηγ ˜ γ (2 π ) / / , (5) 12 log β T − T η σ C κ T < log( |X × Ω | − η ˜ γ (2 π ) / / . Then, Algorithm 1 terminates (i.e., U T = ∅ ) after at most T trials when we use the acquisition function a (2) t ( x ∗ , w ∗ ).The order of the maximum information gain κ T is known to be sublinear under mild conditions [23]. Hence,because the order of β T is O (log T ), there exist positive integers satisfying the inequalities in Theorems 4.2 and4.3.
5. Numerical experiments
We confirmed the performance of the proposed method using both synthetic and real data. Because ofspace limitation, we provide a part of experimental results in the main text. All experimental results and detailparameter settings are given in the Appendix. The input space
X ×
Ω was defined as a set of grid points thatuniformly cut the region [ L , U ] × [ L , U ] into 50 ×
50. In all experiments, we used the following Gaussiankernel as the kernel function: k (( x, w ) , ( x (cid:48) , w (cid:48) )) = σ f exp( −{ ( x − x (cid:48) ) + ( w − w (cid:48) ) } /L ) . Moreover, we used L p ∗ ( w ): Uniform: p ∗ ( w ) = 1 / . Normal: p ∗ ( w ) = a ( w ) (cid:80) w ∈ Ω a ( w ) , a ( w ) = 1 √ π exp( − w / . Then, we compared the following acquisition functions:
Random:
Select ( x t +1 , w t +1 ) by using random sampling. US:
Perform uncertainty sampling, i.e., ( x t +1 , w t +1 ) = argmax ( x,w ) ∈X × Ω σ t ( x, w ).6 traddle f: Perform straddle strategy [4], i.e., ( x t +1 , w t +1 ) = argmax ( x,w ) ∈X × Ω v t ( x, w ), where v t ( x, w ) =min { u t ( x, w ) − h, h − l t ( x, w ) } . Straddle US:
Select x t +1 and w t +1 by using the straddle of F ( x ) and σ t ( x t +1 , w ), respectively, i.e., x t +1 =argmax x ∈X v Ft ( x ) and w t +1 = argmax w ∈ Ω σ t ( x t +1 , w ), where v Ft ( x ) = min { u Ft ( x ; η ) − α, α − l Ft ( x ; η ) } . Straddle random:
Replace the selection method of w t +1 in straddle US with random sampling. MILE:
Perform the original MILE strategy, i.e., ( x t +1 , w t +1 ) was selected by using (6) in [30]. Proposed1 . : Perform a (1) t ( x ∗ , w ∗ ) with γ = 0 . Proposed1 . : Perform a (1) t ( x ∗ , w ∗ ) with γ = 0 . Proposed2 . : Perform a (2) t ( x ∗ , w ∗ ) with γ = 0 . Proposed2 . : Perform a (2) t ( x ∗ , w ∗ ) with γ = 0 . η to zero. Similarly, because of the computational costof calculating acquisition functions, we replaced P ( y ∗ ∈ R s )1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) > α ] in (3.3) with zero when P ( y ∗ ∈ R s ) satisfies P ( y ∗ ∈ R s ) < . ζ/ ( | Ω | + 1) = 0 .
005 toapproximate (3.3).
We confirmed the performance of the proposed method using synthetic functions. We considered the fol-lowing four functions, which are commonly used benchmark functions (the last one adds − Booth: f ( x, w ) = ( x + 2 w − + (2 x + w − . Matyas: f ( x, w ) = 0 . x + w ) − . xw . McCormick: f ( x, w ) = sin( x + w ) + ( x − w ) − . x + 2 . w + 1. Styblinski-Tang: f ( x, w ) = ( x − x + 5 x ) / w − w + 5 w ) / − × recpre + rec , pre = | H ∩ H t || H t | , rec = | H ∩ H t || H | . From Figures 1 and 2, it can be confirmed that our proposed methods outperform other existing methods.On the other hand, in the existing methods, Straddle f and MILE exhibit high performance, because theMILE acquisition function increases the expected number of ( x, w ) satisfying l t ( x, w ) > h . As a result, because˜ l t ( x, w ; η ) and l ( F ) t ( x ; η ) become large early, the number of elements in H t also increases early. Similarly, becausethe Straddle f acquisition function can efficiently search for ( x, w ) satisfying l t ( x, w ) > h or u t ( x, w ) < h , thenumber of elements in H t also increases efficiently from the same argument as before. Furthermore, whencomparing Proposed1 and Proposed2, one of the reasons why the latter exhibits better performance is the factthat RMILE exhibits better performance than uncertainty sampling. Other experiments, a comparison of thedifference in γ is described in the Appendix. In this section, we confirmed how much the computation time of (3.2) can be improved by using Lemma3.1, 3.2 and 3.3. We evaluated the computation time of (3.2) when we performed the same experiment as inSubsection 5.1 using Proposed1 0 .
01 and Proposed2 0 .
01 for the Booth function. The experiments for Matyas,McCormick and Styblinski-Tang functions are described in the Appendix. Here, as for the parameter settings,we considered only the case of L Naive:
For each ( x ∗ , w ∗ ), we generate M samples y ∗ , . . . , y ∗ M from the posterior distribution of f ( x ∗ , w ∗ ), andapproximate (3.2) by (cid:88) x ∈ U t M M (cid:88) m =1 l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ m ) > α ] , where we set M = 1000. 7
50 100 150 200 250 300 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01
Booth Matyas McCormick Styblinski-TangFigure 1: Average F-score over 50 simulations with four benchmark functions when the distance function andreference distribution are L . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01
Booth Matyas McCormick Styblinski-TangFigure 2: Average F-score over 50 simulations with four benchmark functions when the distance function andreference distribution are L Naive L1 L2 L3 (10 − ) L3 (10 − ) L3 (10 − )Proposed1 0 .
01 138505 . ± .
87 7621 . ± .
23 2370 . ± .
94 71 . ± .
33 80 . ± .
37 86 . ± . .
01 106306 . ± .
01 5835 . ± .
99 2608 . ± .
06 63 . ± .
29 72 . ± .
99 78 . ± . L1:
Compute (3.2) using Lemma 3.1.
L2:
Compute (3.2) using Lemma 3.1 and 3.2. L3 (10 − ) : Compute (3.2) using Lemma 3.1, 3.2 and 3.3 with ζ = ( | Ω | + 1)10 − . L3 (10 − ) : Compute (3.2) using Lemma 3.1, 3.2 and 3.3 with ζ = ( | Ω | + 1)10 − . L3 (10 − ) : Compute (3.2) using Lemma 3.1, 3.2 and 3.3 with ζ = ( | Ω | + 1)10 − .Under this setup, we took one initial point at random and ran the algorithms until the number of iterationsreached to 300. Furthermore, for each trial t , we evaluated the computation time to calculate (3.2) for allcandidate points ( x ∗ , w ∗ ) ∈ X × Ω, and calculated the average computation time over 300 trials. From Table 1,it can be confirmed that the computation time is improved as the proposed computational techniques are used.Moreover, comparing L3 (10 − ), L3 (10 − ) and L3 (10 − ), it can be confirmed that the computation timebecomes shorter when a large ζ is used. However, it can be seen that the computation time of L3 (10 − ) is stillvery small compared to the computation time of Naive, L1 and L2. Therefore, from | Ω | = 50 and Lemma 3.3,it implies that by using proposed computational techniques, we can improve the computation time significantlyeven if the error from the true a t ( x ∗ , w ∗ ) is kept to a very small value such as 51 × − = 5 . × − . We compared our proposed method with other existing methods by using the infection control problem[15]. We considered a simulation-based decision-making problem for an epidemic, which aims to determinean acceptable infection rate x under an uncertain recovery rate w with as few simulations as possible. Themotivation for this simulation was to evaluate the tradeoff between economic risk and a controllable infectionrate. For example, if the infection rate x is minimized by shutting down all economic activities, the economic8
20 40 60 80 100 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 L L h with a probability of at least α . In thisexperiment, to simulate epidemic behavior, we used the SIR model [15]. The model computes the evolutionof the number of infected people by using an infection rate x and recovery rate w . In our experiment, weconsidered the infection rate as the design variable x and the recovery rate as the environmental variable w following an unknown distribution. In addition, we regarded economic risk as a black-box function f ( x, w ).Note that similar numerical experiments were performed in [13] under the setting where the distribution of w , p † ( w ), is known. Furthermore, we rescaled the ranges of x and w in the interval [ − , X ×
Ωis defined as a set of grid points that uniformly cut the region [ − , × [ − ,
1] into 50 ×
50. We used the followingeconomic risk function f ( x, w ): f ( x, w ) = n infected ( x, w ) − x, where n infected ( x, w ) is the maximum numberof infected people in a given period of time, calculated using the SIR model. Note that this risk function wasalso used by [13], and in this experiment, we used the same function they used in their experiment. Under thissetup, we took one initial point at random and ran the algorithms until the number of iterations reached 100.From 50 Monte Carlo simulations, we calculated average F-scores, where we used the following parameters forall problem settings: h = 135 , α = 0 . , σ = 0 . , σ f = 250 , L = 0 . , β / t = 4 , (cid:15) = 0 . . In this experiment, we used the following modified reference function as Normal: p ∗ ( w ) = a ( w ) (cid:80) w ∈ Ω a ( w ) , a ( w ) = 1 √ . π exp( − w / . . From Figure 3, it can be confirmed that Proposed2 and MILE performed better than the others.
6. Conclusion
We proposed active learning methods for identifying the reliable set of distributionally robust probabilitythreshold robustness (DRPTR) measure under uncertain environmental variables. We showed that our proposedmethods satisfy theoretical guarantees about convergence and accuracy, and outperform existing methods innumerical experiments.
Acknowledgement
This work was partially supported by MEXT KAKENHI (20H00601, 16H06538), JST CREST (JPMJCR1502),and RIKEN Center for Advanced Intelligence Project. 9 eferences [1] Justin J Beland and Prasanth B Nair. Bayesian optimization under uncertainty. In
NIPS BayesOpt 2017workshop , 2017.[2] Hans-Georg Beyer and Bernhard Sendhoff. Robust optimization–a comprehensive survey.
Computer meth-ods in applied mechanics and engineering , 196(33-34):3190–3218, 2007.[3] Ilija Bogunovic, Jonathan Scarlett, Stefanie Jegelka, and Volkan Cevher. Adversarially robust optimizationwith Gaussian processes. In
Advances in neural information processing systems , pages 5760–5770, 2018.[4] Brent Bryan, Robert C Nichol, Christopher R Genovese, Jeff Schneider, Christopher J Miller, and LarryWasserman. Active learning for identifying function threshold boundaries. In
Advances in neural informa-tion processing systems , pages 163–170, 2006.[5] Abraham Charnes and William W Cooper. Chance-constrained programming.
Management science ,6(1):73–79, 1959.[6] Lukas Fr¨ohlich, Edgar Klenske, Julia Vinogradska, Christian Daniel, and Melanie Zeilinger. Noisy-inputentropy search for efficient robust bayesian optimization. In Silvia Chiappa and Roberto Calandra, editors,
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics , volume108 of
Proceedings of Machine Learning Research , pages 2262–2272. PMLR, 26–28 Aug 2020.[7] Alexandra Gessner, Javier Gonzalez, and Maren Mahsereci. Active multi-information source Bayesianquadrature. In
Uncertainty in Artificial Intelligence , pages 712–721. PMLR, 2020.[8] Alkis Gotovos, Nathalie Casati, Gregory Hitz, and Andreas Krause. Active learning for level set estimation.In
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence , IJCAI ’13,pages 1344–1350. AAAI Press, 2013.[9] Y Inatsu, D Sugita, K Toyoura, and I Takeuchi. Active learning for enumerating local minima based onGaussian process derivatives.
Neural Computation , 32(10):2032–2068, 2020.[10] Yu Inatsu, Masayuki Karasuyama, Keiichi Inoue, Hideki Kandori, and Ichiro Takeuchi. Active learning ofBayesian linear models with high-dimensional binary features by parameter confidence-region estimation.
Neural Computation , 32(10):1998–2031, 2020.[11] Yu Inatsu, Masayuki Karasuyama, Keiichi Inoue, and Ichiro Takeuchi. Active learning for level set estima-tion under input uncertainty and its extensions.
Neural Computation , 32(12):2486–2531, 2020.[12] Shogo Iwazaki, Yu Inatsu, and Ichiro Takeuchi. Bayesian experimental design for finding reliable level setunder input uncertainty.
IEEE Access , 8:203982–203993, 2020.[13] Shogo Iwazaki, Yu Inatsu, and Ichiro Takeuchi. Bayesian quadrature optimization for probability thresholdrobustness measure. arXiv preprint arXiv:2006.11986 , 2020.[14] Shogo Iwazaki, Yu Inatsu, and Ichiro Takeuchi. Mean-variance analysis in Bayesian optimization underuncertainty. In
The 24th International Conference on Artificial Intelligence and Statistics , 2021. To appear.[15] William Ogilvy Kermack and Anderson G McKendrick. A contribution to the mathematical theory ofepidemics.
Proceedings of the royal society of london. Series A, Containing papers of a mathematical andphysical character , 115(772):700–721, 1927.[16] Johannes Kirschner, Ilija Bogunovic, Stefanie Jegelka, and Andreas Krause. Distributionally robustBayesian optimization. In Silvia Chiappa and Roberto Calandra, editors,
Proceedings of the Twenty ThirdInternational Conference on Artificial Intelligence and Statistics , volume 108 of
Proceedings of MachineLearning Research , pages 2174–2184. PMLR, 26–28 Aug 2020.[17] Thanh Nguyen, Sunil Gupta, Huong Ha, Santu Rana, and Svetha Venkatesh. Distributionally robustBayesian quadrature optimization. In Silvia Chiappa and Roberto Calandra, editors,
Proceedings of theTwenty Third International Conference on Artificial Intelligence and Statistics , volume 108 of
Proceedingsof Machine Learning Research , pages 1921–1931. PMLR, 26–28 Aug 2020.[18] Rafael Oliveira, Lionel Ott, and Fabio Ramos. Bayesian optimisation under uncertain inputs. In
The 22ndInternational Conference on Artificial Intelligence and Statistics , pages 1177–1184, 2019.[19] Hamed Rahimian and Sanjay Mehrotra. Distributionally robust optimization: A review. arXiv preprintarXiv:1908.05659 , 2019. 1020] Herbert Scarf. A min-max solution of an inventory problem.
Studies in the mathematical theory of inventoryand production , 10:201–209, 1958.[21] Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Depart-ment of Computer Sciences, 2009.[22] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. Taking the humanout of the loop: A review of Bayesian optimization.
Proceedings of the IEEE , 104(1):148–175, 2016.[23] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization inthe bandit setting: No regret and experimental design. In
Proceedings of the 27th International Conferenceon International Conference on Machine Learning , ICML’10, pages 1015–1022, USA, 2010. Omnipress.[24] Yanan Sui, Joel Burdick, Yisong Yue, et al. Stagewise safe Bayesian optimization with Gaussian processes.In
International Conference on Machine Learning , pages 4781–4789, 2018.[25] Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization withGaussian processes. In
International Conference on Machine Learning , pages 997–1005, 2015.[26] Saul Toscano-Palmerin and Peter I Frazier. Bayesian optimization with expensive integrands. arXiv preprintarXiv:1803.08661 , 2018.[27] Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration in finite Markov decision pro-cesses with Gaussian processes. In
Proceedings of the 30th International Conference on Neural InformationProcessing Systems , pages 4312–4320, 2016.[28] Akifumi Wachi, Yanan Sui, Yisong Yue, and Masahiro Ono. Safe exploration and optimization of con-strained MDPs using Gaussian processes. In
Proceedings of the AAAI Conference on Artificial Intelligence ,volume 32, 2018.[29] Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine learning. the MITPress , 2(3):4, 2006.[30] Andrea Zanette, Junzi Zhang, and Mykel J Kochenderfer. Robust super-level set estimation using Gaussianprocesses. In
Joint European Conference on Machine Learning and Knowledge Discovery in Databases ,pages 276–291. Springer, 2018. 11 ppendixA. ProofsA.1. Proof of Theorem 4.1
In this section, we prove Theorem 4.1. First, we show two lemmas.
Lemma A.1.
Let δ ∈ (0 , β t = 2 log( |X × Ω | π t / (6 δ )). Then, with a probability of at least 1 − δ ,the following inequality holds: | f ( x , w ) − µ t − ( x , w ) | ≤ β / t σ t − ( x , w ) , ∀ ( x , w ) ∈ X × Ω , ∀ t ≥ . Proof.
By replacing D and π t in Lemma 5.1 of [23] with X ×
Ω and π t /
6, respectively, we have LemmaA.1.
Lemma A.2.
Let δ ∈ (0 , ξ > η = min (cid:110) ξσ ,min , ξ δσ ,min |X × Ω | (cid:111) . Then, with a probability of at least1 − δ/
2, the following holds for any x ∈ X and p ( w ) ∈ A :˜ F η,p ( x ) ≡ (cid:88) w ∈ Ω h ≥ f ( x , w ) > h − η ] p ( w ) < ξ. Proof.
From Chebyshev’s inequality, for any ν > x , w ) ∈ X × Ω, the following inequality holds: P ( | g η ( x , w ) − µ ( g η ) ( x , w ) | ≥ ν ) ≤ V [ g η ( x , w )] ν , where g η ( x , w ) = 1l[ h ≥ f ( x , w ) > h − η ] and µ ( g η ) ( x , w ) = E [ g η ( x , w )]. Hence, by replacing ν with ( δ/ (2 |X × Ω | )) − / ( V [ g η ( x , w )]) / , with a probability of at least 1 − δ/
2, the following holds for any ( x , w ) ∈ X × Ω: | g η ( x , w ) − µ ( g η ) ( x , w ) | < (cid:112) V [ g η ( x , w )] (cid:112) δ/ (2 |X × Ω | ) . This implies that g η ( x , w ) < µ ( g η ) ( x , w ) + (cid:112) V [ g η ( x , w )] (cid:112) δ/ (2 |X × Ω | ) . (A.1)Moreover, noting that g η ( x , w ) follows Bernoulli distribution, we get V [ g η ( x , w )] = E [ g η ( x , w )](1 − E [ g η ( x , w )]) ≤ E [ g η ( x , w )] = µ ( g η ) ( x , w ) . (A.2)In addition, µ ( g η ) ( x , w ) can be expressed as µ ( g η ) ( x , w ) = Φ (cid:18) hσ ( x , w ) (cid:19) − Φ (cid:18) h − ησ ( x , w ) (cid:19) . Furthermore, by using Taylor’s expansion, for any a < b it holds thatΦ( b ) = Φ( a ) + φ ( c )( b − a ) ≤ Φ( a ) + φ (0)( b − a ) ≤ Φ( a ) + ( b − a ) , where c ∈ ( a, b ). Thus, we obtain µ ( g η ) ( x , w ) ≤ ησ ( x , w ) ≤ ησ ,min . (A.3)Thus, by substituting (A.2) and (A.3) into (A.1), we have g η ( x , w ) < ησ ,min + (cid:115) η |X × Ω | δσ ,min . Hence, from the definition of η , we get g η ( x , w ) < ξ (cid:114) ξ ξ. Therefore, for any p ( w ) ∈ A , the following holds:˜ F η,p ( x ) = (cid:88) w ∈ Ω g η ( x , w ) p ( w ) < (cid:88) w ∈ Ω ξp ( w ) = ξ.
12y using Lemma A.1 and A.2, we prove Theorem 4.1.
Proof.
Let δ ∈ (0 ,
1) and β t = 2 log( |X × Ω | π t / (3 δ )). Then, from Lemma A.1, with a probability of at least1 − δ/ l t ( x , w ) ≤ f ( x , w ) ≤ u t ( x , w ) , ∀ ( x , w ) ∈ X × Ω , ∀ t ≥ . (A.4)Thus, from the definition of ˜ Q t ( x , w ; η ), it holds that1l[ f ( x , w ) > h ] ≤ ˜ u t ( x , w ; η ) . This implies that F ( x ) = inf p ( w ) ∈A (cid:88) w ∈ Ω f ( x , w ) > h ] p ( w ) ≤ inf p ( w ) ∈A (cid:88) w ∈ Ω ˜ u t ( x , w ; η ) p ( w ) = u ( F ) t ( x ; η ) . Therefore, noting that the definition of L t , we have x ∈ L t ⇒ F ( x ) ≤ u ( F ) t ( x ; η ) ≤ α. (A.5)On the other hand, for any x ∈ X and p ( w ) ∈ A , it holds that (cid:88) w ∈ Ω f ( x , w ) > h ] p ( w ) + ˜ F η,p ( x ) = (cid:88) w ∈ Ω f ( x , w ) > h − η ] p ( w ) . Moreover, from Lemma A.2, with a probability of at least 1 − δ/
2, the following holds: (cid:88) w ∈ Ω f ( x , w ) > h ] p ( w ) + ξ > (cid:88) w ∈ Ω f ( x , w ) > h − η ] p ( w ) . (A.6)Thus, we get the following inequality:inf p ( w ) ∈A (cid:32) (cid:88) w ∈ Ω f ( x , w ) > h ] p ( w ) + ξ (cid:33) = F ( x ) + ξ > inf p ( w ) ∈A (cid:88) w ∈ Ω f ( x , w ) > h − η ] p ( w ) . (A.7)Furthermore, from the definition of ˜ Q t ( x , w ; η ), the following inequality holds:1l[ f ( x , w ) > h − η ] ≥ ˜ l t ( x , w ; η ) . Therefore, we haveinf p ( w ) ∈A (cid:88) w ∈ Ω f ( x , w ) > h − η ] p ( w ) ≥ inf p ( w ) ∈A (cid:88) w ∈ Ω ˜ l t ( x , w ; η ) p ( w ) = l ( F ) t ( x ; η ) . (A.8)Hence, by combining (A.7) and (A.8), we obtain l ( F ) t ( x ; η ) < F ( x ) + ξ. Thus, from the definition of H t , it holds that x ∈ H t ⇒ α < F ( x ) + ξ ⇒ α − ξ < F ( x ) . (A.9)Hence, from (A.5), (A.9) and the definition of e α ( x ), the following inequality holds:max x ∈X e α ( x ) ≤ ξ. Finally, since both (A.4) and (A.6) hold with a probability of at least 1 − δ , the following holds for any t ≥ P (cid:18) max x ∈X e α ( x ) ≤ ξ (cid:19) ≥ − δ. .2. Proof of Theorem 4.2 and 4.3 In this section, we prove Theorem 4.2 and 4.3. First, we show related lemmas.
Lemma A.3.
Let η > β t >
0. Suppose that the following holds for some T ≥ β / T σ T − ( x , w ) < η, ∀ ( x , w ) ∈ X × Ω . (A.10)Then, Algorithm 1 terminates after at most T iterations. Proof.
From the definition of ˜ Q t ( x , w ; η ), if l T ( x , w ) > h − η , then ˜ l T ( x , w ; η ) = ˜ u T ( x , w ; η ) = 1. On the otherhand, noting that u T ( x , w ) − l T ( x , w ) = 2 β / T σ T − ( x , w ) and (A.10), if l T ( x , w ) ≤ h − η , then u T ( x , w ) ≤ h .This implies that ˜ l T ( x , w ; η ) = ˜ u T ( x , w ; η ) = 0. Thus, under (A.10), the following holds for any ( x , w ) ∈ X × Ω:˜ l T ( x , w ; η ) = ˜ u T ( x , w ; η ) . Hence, from the definitions of l ( F ) t ( x , w ; η ) and u ( F ) t ( x , w ; η ), we have l ( F ) t ( x , w ; η ) = u ( F ) t ( x , w ; η ). Therefore,for any x ∈ X , x satisfies x ∈ H T or x ∈ L T , i.e., U T = ∅ . Lemma A.4.
Let η > β t >
0. Suppose that the following inequalities hold for some ( x ∗ , w ∗ ) ∈ X × Ω: σ − σ t − ( x ∗ , w ∗ ) β / t < η , (A.11) σ − σ t − ( x ∗ , w ∗ ) < η / . (A.12)Then, (3.2) can be bounded as a t − ( x ∗ , w ∗ ) ≤ |X | | Ω | √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) . Proof.
First, we define the set B as B = b = ( b , . . . , b | Ω | ) ∈ { , } | Ω | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) inf p ( w ) ∈A | Ω | (cid:88) j =1 p ( w j ) b j > α . Moreover, for each b ∈ B , let N ( b ) be a subset of { , . . . , | Ω |} satisfying ∀ s ∈ N ( b ) , b s = 1 . Then, the following holds for any x ∈ U t : E y ∗ [1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ]]= P y ∗ [(1l[ l t ( x , w | x ∗ , w ∗ , y ∗ ) > h ] , . . . , l t ( x , w | Ω | | x ∗ , w ∗ , y ∗ ) > h ]) (cid:62) ∈ B ]= (cid:88) b ∈B P y ∗ [1l[ l t ( x , w | x ∗ , w ∗ , y ∗ ) > h ] = b , . . . , l t ( x , w | Ω | | x ∗ , w ∗ , y ∗ ) > h ] = b | Ω | ] ≤ (cid:88) b ∈B P y ∗ [ ∀ s ∈ N ( b ) , l t ( x , w s | x ∗ , w ∗ , y ∗ ) > h ] = b s ] , (A.13)where l t ( x , w j | x ∗ , w ∗ , y ∗ ) is the lower confidence bound of f ( x , w j ) after adding ( x ∗ , w ∗ , y ∗ ) to { ( x i , w i , y i ) } ti =1 .Next, for any N ( b ) , there exists s b ∈ N ( b ) such that l t ( x , w s b ) ≤ h − η. (A.14)In fact, if l t ( x , w s b ) > h − η for any s ∈ N ( b ) , then we get(1l[ l t ( x , w ) > h − η ] , . . . , l t ( x , w | Ω | ) > h − η ]) (cid:62) ∈ B , which contradicts x ∈ U t . Furthermore, from Lemma 2 of [30], P y ∗ [ l t ( x , w s b | x ∗ , w ∗ , y ∗ ) > h ] can be calculatedas P y ∗ [ l t ( x , w s b | x ∗ , w ∗ , y ∗ ) > h ] = Φ (cid:113) σ t − ( x ∗ , w ∗ ) + σ | k t − (( x , w s b ) , ( x ∗ , w ∗ )) | ( µ t − ( x , w s b ) − β / t σ t − ( x , w s b | x ∗ , w ∗ ) − h ) , (A.15)14here σ t − ( x , w s b | x ∗ , w ∗ ) is the posterior variance of f ( x , w s b ) after adding ( x ∗ , w ∗ , y ∗ ) to { ( x i , w i , y i ) } ti =1 .Moreover, by using (A.14) we obtain µ t − ( x , w s b ) − β / t σ t − ( x , w s b | x ∗ , w ∗ ) − h = µ t − ( x , w s b ) − β / t σ t − ( x , w s b ) + β / t σ t − ( x , w s b ) − β / t σ t − ( x , w s b | x ∗ , w ∗ ) − h = l t ( x , w s b ) + β / t σ t − ( x , w s b ) − β / t σ t − ( x , w s b | x ∗ , w ∗ ) − h ≤ − η + β / t ( σ t − ( x , w s b ) − σ t − ( x , w s b | x ∗ , w ∗ )) . (A.16)In addition, the following three inequalities hold: σ ≤ (cid:113) σ t − ( x ∗ , w ∗ ) + σ , (A.17) | k t − (( x , w s b ) , ( x ∗ , w ∗ )) | ≤ σ t − ( x , w s b ) σ t − ( x ∗ , w ∗ ) ≤ σ ( x , w s b ) σ t − ( x ∗ , w ∗ ) ≤ σ t − ( x ∗ , w ∗ ) , (A.18) σ t − ( x , w s b ) − σ t − ( x , w s b | x ∗ , w ∗ ) ≤ σ t − ( x , w s b ) σ t − ( x ∗ , w ∗ ) σ t − ( x ∗ , w ∗ ) + σ ≤ σ ( x , w s b ) σ t − ( x ∗ , w ∗ ) σ ≤ σ t − ( x ∗ , w ∗ ) σ , (A.19)where the first, second and third inequalities in (A.18) can be derived from H¨older’s inequality, monotonicityof the posterior variance and the assumption max ( x , w ) ∈X × Ω σ ( x , w ) ≤
1, respectively. Similarly, the firstinequality in (A.19) can be derived from the equation (39) of [30]. Therefore, by substituting (A.16)–(A.19)and (A.11) into (A.15), we obtain the following inequality: P y ∗ [ l t ( x , w s b | x ∗ , w ∗ , y ∗ ) > h ] ≤ Φ (cid:18) σσ t − ( x ∗ , w ∗ ) ( − η/ (cid:19) , (A.20)Moreover, noting that the assumption (A.12) is equal to the condition 1 < σσ − t − ( x ∗ , w ∗ )( η/ (cid:18) σσ t − ( x ∗ , w ∗ ) ( − η/ (cid:19) = (cid:90) σσt − x ∗ , w ∗ ) ( − η/ −∞ φ ( z )d z = (cid:90) ∞ σσt − x ∗ , w ∗ ) ( η/ φ ( z )d z ≤ (cid:90) ∞ σσt − x ∗ , w ∗ ) ( η/ zφ ( z )d z = [ − φ ( z )] ∞ σσt − x ∗ , w ∗ ) ( η/ = 1 √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) . (A.21)Finally, from (A.13), (A.20) and (A.21), E y ∗ [1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ]] can be bounded as E y ∗ [1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ]] ≤ (cid:88) b ∈B P y ∗ [ ∀ s ∈ N ( b ) , l t ( x , w s | x ∗ , w ∗ , y ∗ ) > h ] = b s ] ≤ (cid:88) b ∈B P y ∗ [1l[ l t ( x , w s b | x ∗ , w ∗ , y ∗ ) > h ] = b s b ]= (cid:88) b ∈B P y ∗ [ l t ( x , w s b | x ∗ , w ∗ , y ∗ ) > h ] ≤ (cid:88) b ∈B √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) = |B| √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) ≤ | Ω | √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) . a t − ( x ∗ , w ∗ ), we have a t − ( x ∗ , w ∗ ) = (cid:88) x ∈ U t E y ∗ [1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ]] ≤ (cid:88) x ∈ U t | Ω | √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) = | U t | | Ω | √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) ≤ |X | | Ω | √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) . Lemma A.5.
Let η > β t > γ >
0. Also let ( x t , w t ) ∈ X × Ω be a maximum point of a (1) t − ( x ∗ , w ∗ ).Assume that the following inequalities hold for some T ≥ σ − σ T − ( x T , w T ) β / T < η , (A.22) σ − σ T − ( x T , w T ) < η / , (A.23) σ T − ( x T , w T ) β T < η / , (A.24)12 log β T − η σ σ T − ( x T , w T ) < log( |X | − −| Ω | ηγ − √ π ) . (A.25)Then, Algorithm 1 terminates after at most T iterations. Proof.
From the definitions of a (1) t − ( x ∗ , w ∗ ) and ( x t , w t ), the following holds for any ( x , w ) ∈ X × Ω: γσ T − ( x , w ) ≤ a (1) T − ( x , w ) ≤ a (1) T − ( x T , w T ) = max { a T − ( x T , w T ) , γσ T − ( x T , w T ) } . (A.26)In addition, from (A.22), (A.23) and Lemma A.4, a T − ( x T , w T ) can be bounded as a T − ( x T , w T ) ≤ |X | | Ω | √ π exp (cid:18) − σ η σ T − ( x T , w T ) (cid:19) . (A.27)Thus, by substituting (A.27) into (A.26), we have γσ T − ( x , w ) ≤ max (cid:26) |X | | Ω | √ π exp (cid:18) − σ η σ T − ( x T , w T ) (cid:19) , γσ T − ( x T , w T ) (cid:27) . This implies that β / T σ T − ( x , w ) ≤ max (cid:26) γ − β / T |X | | Ω | √ π exp (cid:18) − σ η σ T − ( x T , w T ) (cid:19) , β / T σ T − ( x T , w T ) (cid:27) . (A.28)On the other hand, (A.24) and (A.25) are equal to the following inequalities, respectively: β / T σ T − ( x T , w T ) < η/ , (A.29)exp (cid:18) − η σ σ T − ( x T , w T ) (cid:19) < |X | − −| Ω | ηγ − √ πβ / T . (A.30)Hence, by combining (A.28), (A.29) and (A.30), we get β / T σ T − ( x T , w T ) < η/
2. Therefore, from Lemma A.3,we have Lemma A.5.
Lemma A.6.
Let η > β t >
0. Assume that (A.11) and (A.12) hold for some ( x ∗ , w ∗ ) ∈ X × Ω. Then,MILE t − ( x ∗ , w ∗ ) can be bounded asMILE t − ( x ∗ , w ∗ ) ≤ |X × Ω | √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) . roof. From Lemma 2 of [30] and the definition of MILE t − ( x , w ), the following holds:MILE t − ( x ∗ , w ∗ )= (cid:88) ( x , w ) ∈ U t × Ω E y ∗ [1l[ l t ( x , w | x ∗ , w ∗ , y ∗ ) > h ]] − |{ ( x , w ) ∈ U t × Ω | l t ( x , w ) > h − η }| = (cid:88) ( x , w ) ∈ U t × Ω P y ∗ [ l t ( x , w | x ∗ , w ∗ , y ∗ ) > h ] − |{ ( x , w ) ∈ U t × Ω | l t ( x , w ) > h − η }|≤ (cid:88) ( x , w ) ∈ U t × Ω Φ (cid:113) σ t − ( x ∗ , w ∗ ) + σ | k t − (( x , w ) , ( x ∗ , w ∗ )) | ( µ t − ( x , w ) − β / t σ t − ( x , w | x ∗ , w ∗ ) − h ) − |{ ( x , w ) ∈ U t × Ω | l t ( x , w ) > h − η }| . = (cid:88) ( x , w ) ∈ U t × Ω Φ (cid:113) σ t − ( x ∗ , w ∗ ) + σ | k t − (( x , w ) , ( x ∗ , w ∗ )) | ( µ t − ( x , w ) − β / t σ t − ( x , w | x ∗ , w ∗ ) − h ) − l t ( x , w ) > h − η ] . (A.31)Next, for each ( x , w ) ∈ U t × Ω, we consider the two cases of l t ( x , w ) > h − η and l t ( x , w ) ≤ h − η . If l t ( x , w ) > h − η , then the following inequality holds:Φ (cid:113) σ t − ( x ∗ , w ∗ ) + σ | k t − (( x , w ) , ( x ∗ , w ∗ )) | ( µ t − ( x , w ) − β / t σ t − ( x , w | x ∗ , w ∗ ) − h ) − l t ( x , w ) > h − η ] ≤ ≤ √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) . On the other hand, if l t ( x , w ) ≤ h − η , then using (A.15)–(A.21) we haveΦ (cid:113) σ t − ( x ∗ , w ∗ ) + σ | k t − (( x , w ) , ( x ∗ , w ∗ )) | ( µ t − ( x , w ) − β / t σ t − ( x , w | x ∗ , w ∗ ) − h ) − l t ( x , w ) > h − η ]= Φ (cid:113) σ t − ( x ∗ , w ∗ ) + σ | k t − (( x , w ) , ( x ∗ , w ∗ )) | ( µ t − ( x , w ) − β / t σ t − ( x , w | x ∗ , w ∗ ) − h ) ≤ √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) . Therefore, in both cases, the following inequality holds:Φ (cid:113) σ t − ( x ∗ , w ∗ ) + σ | k t − (( x , w ) , ( x ∗ , w ∗ )) | ( µ t − ( x , w ) − β / t σ t − ( x , w | x ∗ , w ∗ ) − h ) − l t ( x , w ) > h − η ] ≤ √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) . (A.32)Thus, by substituting (A.32) into (A.31), we obtainMILE t − ( x ∗ , w ∗ ) ≤ (cid:88) ( x , w ) ∈ U t × Ω √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) = | U t × Ω | √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) ≤ |X × Ω | √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) . Lemma A.7.
Let η > β t > γ > γ >
0. Also let ( x t , w t ) ∈ X × Ω be a maximum point of a (2) t − ( x ∗ , w ∗ ). Assume that the inequalities (A.22), (A.23) and (A.24) hold for some T ≥
1. In addition, assumethat the following inequalities hold:12 log β T − η σ σ T − ( x T , w T ) < log( |X | − −| Ω | ηγ ˜ γ − √ π ) , (A.33)12 log β T − η σ σ T − ( x T , w T ) < log( |X × Ω | − η ˜ γ − √ π ) . (A.34)Then, Algorithm 1 terminates after at most T iterations.17 roof. From the definition of a (2) t − ( x ∗ , w ∗ ) and ( x t , w t ), the following holds for any ( x , w ) ∈ X × Ω: γ ˜ γσ T − ( x , w ) ≤ γ RMILE T − ( x , w ) ≤ a (2) T − ( x , w ) ≤ a (2) T − ( x T , w T )= max { a T − ( x T , w T ) , γ RMILE T − ( x T , w T ) } . (A.35)Furthermore, from (A.22), (A.23) and Lemma A.6, we have γ RMILE T − ( x T , w T ) = max { γ MILE T − ( x T , w T ) , γ ˜ γσ T − ( x T , w T ) }≤ max (cid:26) γ |X × Ω | √ π exp (cid:18) − σ η σ T − ( x T , w T ) (cid:19) , γ ˜ γσ T − ( x T , w T ) (cid:27) . (A.36)Moreover, from (A.24) and (A.34), we get the following inequalities: σ T − ( x T , w T ) < β − / T η/ , (A.37) |X × Ω | √ π exp (cid:18) − σ η σ T − ( x T , w T ) (cid:19) < β − / T η ˜ γ/ . (A.38)Thus, by substituting (A.37) and (A.38) into (A.36), we obtain γ RMILE T − ( x T , w T ) ≤ γ ˜ γβ − / T η/ . (A.39)Similarly, from (A.22), (A.23), (A.33) and Lemma A.4, a T − ( x T , w T ) can be bounded as a T − ( x T , w T ) ≤ |X | | Ω | √ π exp (cid:18) − σ η σ T − ( x T , w T ) (cid:19) ≤ γ ˜ γβ − / T η/ . (A.40)Hence, by combining (A.39) and (A.40) into (A.35), we get γ ˜ γσ T − ( x , w ) ≤ γ ˜ γβ − / T η/ . This implies that 2 β / T σ T − ( x , w ) < η . Therefore, from Lemma A.3, we have Lemma A.7. Lemma A.8.
Let ( x , w ) , . . . , ( x t , w t ) be selected points, and define C = 2 / log(1 + σ − ). Then, there existsa natural number t (cid:48) ≤ t such that σ t (cid:48) − ( x t (cid:48) , w t (cid:48) ) ≤ C κ t t . Proof.
From Lemma 5.3 in [23], the mutual information I ( y A ; f ) can be expressed as I ( y A ; f ) = 12 t (cid:88) i =1 log(1 + σ − σ i − ( x i , w i )) . (A.41)Similarly, from Lemma 5.4 in [23], σ i − ( x i , w i ) can be bounded as σ i − ( x i , w i ) ≤ log(1 + σ − σ i − ( x i , w i ))log(1 + σ − ) . (A.42)Hence, by using (A.41) and (A.42), we get t (cid:88) i =1 σ i − ( x i , w i ) ≤ σ − ) I ( y t ; f ) ≤ C κ t . (A.43)Next, we define t (cid:48) as t (cid:48) = argmin ≤ i ≤ t σ i − ( x i , w i ). Then, it follows that tσ t (cid:48) − ( x t (cid:48) , w t (cid:48) ) ≤ t (cid:88) i =1 σ i − ( x i , w i ) . (A.44)Therefore, by combining (A.43) and (A.44), we have the desired inequality.Finally, using Lemma A.5, A.7 and A.8, we prove Theorem 4.2 and 4.3.18 roof. From Lemma A.8 and monotonicity of β t , for any t ≥
1, there exists a natural number t (cid:48) ≤ t such that σ − σ t (cid:48) − ( x t (cid:48) , w t (cid:48) ) β / t (cid:48) ≤ σ − β / t (cid:48) C κ t t ≤ σ − β / t C κ t t ,σ − σ t (cid:48) − ( x t (cid:48) , w t (cid:48) ) ≤ σ − C κ t t ,σ t (cid:48) − ( x t (cid:48) , w t (cid:48) ) β t (cid:48) ≤ C β t (cid:48) κ t t ≤ C β t κ t t ,
12 log β t (cid:48) − η σ σ t (cid:48) − ( x t (cid:48) , w t (cid:48) ) ≤
12 log β t (cid:48) − T η σ C κ t ≤
12 log β t − T η σ C κ t . (A.45)Hence, from (A.45), if the inequality conditions in Theorem 4.2 hold, then the inequality conditions in LemmaA.5 also hold for some ˜ T ≤ T . Therefore, from Lemma A.5, Algorithm 1 terminates after at most ˜ T iterations,i.e., Theorem 4.2 holds. By using the same argument, Theorem 4.3 can also be proved. A.3. Proof of Lemma 3.1 and 3.2
First, we prove Lemma 3.1
Proof.
From GP properties, the posterior mean µ t − ( x , w | x ∗ , w ∗ , y ∗ ) and the posterior variance σ t − ( x , w | x ∗ , w ∗ )of f ( x , w ) after adding ( x ∗ , w ∗ , y ∗ ) can be written as follows (see, e.g., [29]): µ t − ( x , w | x ∗ , w ∗ , y ∗ ) = µ t − ( x , w ) − k t − (( x , w ) , ( x ∗ , w ∗ )) σ t − ( x ∗ , w ∗ ) + σ ( y ∗ − µ t − ( x ∗ , w ∗ )) ,σ t − ( x , w | x ∗ , w ∗ ) = σ t − ( x , w ) − k t − (( x , w ) , ( x ∗ , w ∗ )) σ t − ( x ∗ , w ∗ ) + σ . Thus, l t ( x , w | x ∗ , w ∗ , y ∗ ) is a linear function with respect to (w.r.t.) y ∗ . Hence, the indicator function1l[ l t ( x , w j | x ∗ , w ∗ , y ∗ ) > h ] is a piecewise constant function w.r.t. y ∗ , where the breakpoint is y ∗ = r j . Therefore,for any s ∈ { , . . . , | Ω | + 1 } , the following holds:(1l[ l t ( x , w | x ∗ , w ∗ , c ) > h ] , . . . , l t ( x , w | Ω | | x ∗ , w ∗ , c ) > h ]) (cid:62) = (1l[ l t ( x , w | x ∗ , w ∗ , c (cid:48) ) > h ] , . . . , l t ( x , w | Ω | | x ∗ , w ∗ , c (cid:48) ) > h ]) (cid:62) , ∀ c, c (cid:48) ∈ R s . This implies that l ( F ) t ( x ; 0 | x ∗ , w ∗ , c ) = l ( F ) t ( x ; 0 | x ∗ , w ∗ , c (cid:48) ) , ∀ c, c (cid:48) ∈ R s . Hence, using this we have E y ∗ [1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ]]= (cid:90) l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ] p ( y ∗ )d y ∗ = | Ω | +1 (cid:88) s =1 (cid:90) y ∗ ∈ R s l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ] p ( y ∗ )d y ∗ = | Ω | +1 (cid:88) s =1 l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) > α ] (cid:90) y ∗ ∈ R s p ( y ∗ )d y ∗ = | Ω | +1 (cid:88) s =1 P ( y ∗ ∈ R s )1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) > α ] . Next, we prove Lemma 3.2.
Proof.
From the definition of l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ), l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) can be expressed as l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) = inf p ( w ) ∈A (cid:88) w ∈ Ω l t ( x , w | x ∗ , w ∗ , c s ) > h ] p ( w ) . Moreover, since p ∗ ( w ) ∈ A , the following holds:inf p ( w ) ∈A (cid:88) w ∈ Ω l t ( x , w | x ∗ , w ∗ , c s ) > h ] p ( w ) ≤ (cid:88) w ∈ Ω l t ( x , w | x ∗ , w ∗ , c s ) > h ] p ∗ ( w ) . L L L L h = 100, α = 0 . h = 100, α = 0 . h = 100, α = 0 . h = 100, α = 0 . σ = 10 − , σ f = 1300 , σ = 10 − , σ f = 1300 , σ = 10 − , σ f = 1300 , σ = 10 − , σ f = 1300 ,Booth L = 4, β / t = 2, (cid:15) = 0 . L = 4, β / t = 2, (cid:15) = 0 . L = 4, β / t = 2, (cid:15) = 0 . L = 4, β / t = 2, (cid:15) = 0 . L = − U = 10, L = − U = 10, L = − U = 10, L = − U = 10, L = − U = 10 L = − U = 10 L = − U = 10 L = − U = 10 h = 5, α = 0 . h = 5, α = 0 . h = 5, α = 0 . h = 5, α = 0 . σ = 10 − , σ f = 50 , σ = 10 − , σ f = 50 , σ = 10 − , σ f = 50 , σ = 10 − , σ f = 50 ,Matyas L = 4, β / t = 2, (cid:15) = 0 . L = 4, β / t = 2, (cid:15) = 0 . L = 4, β / t = 2, (cid:15) = 0 . L = 4, β / t = 2, (cid:15) = 0 . L = − U = 10, L = − U = 10, L = − U = 10, L = − U = 10, L = − U = 10 L = − U = 10 L = − U = 10 L = − U = 10 h = 1, α = 0 . h = 1, α = 0 . h = 2, α = 0 . h = 1, α = 0 . σ = 10 − , σ f = 20 , σ = 10 − , σ f = 20 , σ = 10 − , σ f = 20 , σ = 10 − , σ f = 20 ,McCormick L = 1, β / t = 3, (cid:15) = 0 . L = 1, β / t = 3, (cid:15) = 0 . L = 1, β / t = 3, (cid:15) = 0 . L = 1, β / t = 3, (cid:15) = 0 . L = − . U = 4, L = − . U = 4, L = − . U = 4, L = − . U = 4, L = − U = 4 L = − U = 4 L = − U = 4 L = − U = 4 h = − α = 0 . h = − α = 0 . h = − α = 0 . h = − α = 0 . σ = 10 − , σ f = 2000 , σ = 10 − , σ f = 2000 , σ = 10 − , σ f = 2000 , σ = 10 − , σ f = 2000 ,Styblinski-Tang L = 3, β / t = 2, (cid:15) = 0 . L = 3, β / t = 2, (cid:15) = 0 . L = 3, β / t = 2, (cid:15) = 0 . L = 3, β / t = 2, (cid:15) = 0 . L = − U = 10, L = − U = 10, L = − U = 10, L = − U = 10, L = − U = 10 L = − U = 10 L = − U = 10 L = − U = 10 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01
Booth Matyas McCormick Styblinski-TangFigure 4: Average F-score over 50 simulations with four benchmark functions when the distance function andreference distribution are L l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) ≤ (cid:88) w ∈ Ω l t ( x , w | x ∗ , w ∗ , c s ) > h ] p ∗ ( w ) . Hence, if the inequality assumption in Lemma 3.2 holds, then we get l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) ≤ α . This impliesthat 1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) > α ] = 0. B. Additional experimentsB.1. Synthetic and real data experiments in the L -norm setting In this section, we performed the same experiment as in Subsection 5.1 and 5.3 under the setting that thedistance function is L L B.2. Computation time experiments in the other benchmark function setting
In this section, we performed the same experiment as in Subsection 5.2 for the Matyas, McCormick andStyblinski-Tang benchmark functions. We evaluated the computation time of (3.2) when we performed thesame experiment as in Subsection 5.2 using Proposed1 0 .
01 and Proposed2 0 .
01. Here, as for the parameter20
50 100 150 200 250 300 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01
Booth Matyas McCormick Styblinski-TangFigure 5: Average F-score over 50 simulations with four benchmark functions when the distance function andreference distribution are L . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 L L L L t , weevaluated the computation time to calculate (3.2) for all candidate points ( x ∗ , w ∗ ) ∈ X × Ω, and calculated theaverage computation time over 300 trials. From Tables 3, 4 and 5, it can be confirmed that the same results asin Subsection 5.2 are obtained in the three benchmark function settings.
B.3. Hyperparameter sensitivity in the proposed acquisition function
In this section, we evaluated how the performance is affected by the hyperparameter γ in the proposedacquisition function. We calculated the F-score for the cases with acquisition functions Proposed1 γ and Pro-posed2 γ when we performed the same experiment as in Subsection 5.1 for Booth, Matyas, McCormick andStyblinski-Tang functions. Here, Proposed1 γ and Proposed2 γ respectively represent the acquisition functions a (1) t ( x ∗ , w ∗ ) and a (2) t ( x ∗ , w ∗ ) with the parameter γ , and we considered γ as 0, 10 − . , 10 − , 10 − , 10 − and10 − . In this experiment, as for the parameter settings, we considered only the case of L γ = 0. The reason is that a t ( x ∗ , w ∗ ) was zero for all ( x ∗ , w ∗ ) ∈ X × Ω when the number of data was small.Furthermore, when γ >
0, it can be seen that the performance of Proposed1 γ decreases as γ increases. Onereason is that although a (1) t ( x ∗ , w ∗ ) is closer to uncertainty sampling (US) as γ becomes large, US is not theacquisition function for efficiently estimating H t . On the other hand, it can be confirmed that the performanceof Proposed2 γ is not necessarily better when γ is smaller. From the definition of Proposed2 γ , when γ is large, a (2) t ( x ∗ , w ∗ ) behaves similarly to RMILE. RMILE is the acquisition function that works to efficiently identify21able 3: Computation time (second) for the Matyas function setting Naive L1 L2 L3 (10 − ) L3 (10 − ) L3 (10 − )Proposed1 0 .
01 112403 . ± .
33 6211 . ± .
06 1297 . ± .
31 32 . ± .
36 32 . ± .
18 33 . ± . .
01 98478 . ± .
68 5504 . ± .
62 1831 . ± .
59 32 . ± .
43 37 . ± .
58 38 . ± . Table 4: Computation time (second) for the McCormick function setting
Naive L1 L2 L3 (10 − ) L3 (10 − ) L3 (10 − )Proposed1 0 .
01 83608 . ± .
78 4692 . ± .
72 1094 . ± .
81 39 . ± .
27 41 . ± .
20 42 . ± . .
01 79782 . ± .
70 4383 . ± .
23 1525 . ± .
80 49 . ± .
33 56 . ± .
54 62 . ± . ( x , w ) that satisfies f ( x , w ) > h . However, since F ( x ) is given as the function of 1l[ f ( x , w ) > h ], as a result,RMILE also works to efficiently estimate H t . This is one of the reasons why Proposed2 γ sometimes has goodperformance even at large γ . 22able 5: Computation time (second) for the Styblinski-Tang function setting Naive L1 L2 L3 (10 − ) L3 (10 − ) L3 (10 − )Proposed1 0 .
01 118443 . ± .
13 6297 . ± .
76 900 . ± .
84 44 . ± .
66 47 . ± .
67 48 . ± . .
01 96731 . ± .
16 5240 . ± .
16 686 . ± .
10 26 . ± .
92 27 . ± .