[PDF] Active learning for distributionally robust level-set estimation

Abstract

Many cases exist in which a black-box function f with high evaluation cost depends on two types of variables \bm x and \bm w, where \bm x is a controllable \emph{design} variable and \bm w are uncontrollable \emph{environmental} variables that have random variation following a certain distribution P. In such cases, an important task is to find the range of design variables \bm x such that the function f(\bm x, \bm w) has the desired properties by incorporating the random variation of the environmental variables \bm w. A natural measure of robustness is the probability that f(\bm x, \bm w) exceeds a given threshold h, which is known as the \emph{probability threshold robustness} (PTR) measure in the literature on robust optimization. However, this robustness measure cannot be correctly evaluated when the distribution P is unknown. In this study, we addressed this problem by considering the \textit{distributionally robust PTR} (DRPTR) measure, which considers the worst-case PTR within given candidate distributions. Specifically, we studied the problem of efficiently identifying a reliable set H, which is defined as a region in which the DRPTR measure exceeds a certain desired probability \alpha, which can be interpreted as a level set estimation (LSE) problem for DRPTR. We propose a theoretically grounded and computationally efficient active learning method for this problem. We show that the proposed method has theoretical guarantees on convergence and accuracy, and confirmed through numerical experiments that the proposed method outperforms existing methods.

Full PDF

AActive learning for distributionally robust level-set estimation

Yu Inatsu Shogo Iwazaki Ichiro Takeuchi , , ∗ Department of Computer Science, Nagoya Institute of Technology RIKEN Center for Advanced Intelligence Project ∗ E-mail: [email protected]

ABSTRACT

Many cases exist in which a black-box function f with high evaluation cost depends on two types ofvariables x and w , where x is a controllable design variable and w are uncontrollable environmental variablesthat have random variation following a certain distribution P . In such cases, an important task is to ﬁnd therange of design variables x such that the function f ( x , w ) has the desired properties by incorporating therandom variation of the environmental variables w . A natural measure of robustness is the probability that f ( x , w ) exceeds a given threshold h , which is known as the probability threshold robustness (PTR) measure inthe literature on robust optimization. However, this robustness measure cannot be correctly evaluated whenthe distribution P is unknown. In this study, we addressed this problem by considering the distributionallyrobust PTR (DRPTR) measure, which considers the worst-case PTR within given candidate distributions.Speciﬁcally, we studied the problem of eﬃciently identifying a reliable set H , which is deﬁned as a region inwhich the DRPTR measure exceeds a certain desired probability α , which can be interpreted as a level setestimation (LSE) problem for DRPTR. We propose a theoretically grounded and computationally eﬃcientactive learning method for this problem. We show that the proposed method has theoretical guaranteeson convergence and accuracy, and conﬁrmed through numerical experiments that the proposed methodoutperforms existing methods.

1. Introduction

In the manufacturing industry, product performance often depends on two types of variables: design variablesand environmental variables. The design variables are completely controllable, whereas environmental variablesare random variables that change depending on the usage environment of the product. When considering sucha problem, it is important to identify the design variables that allow the product performance to exceed thedesired requirement threshold with a suﬃciently high degree of conﬁdence, taking into account the randomnessof the environmental variables. In this setting, we must emphasize that there are two distinctly diﬀerent phasesof the product: the development phase and the use phase. In the development phase, we have full control overthe design variables and environmental variables. In the use phase, on the other hand, the design variables areﬁxed, and the environmental variables change randomly and cannot be controlled.Let f ( x , w ) represent the performance of the product, and let h ∈ R be a desired performance threshold,where x is a design variable deﬁned on X , and w is an environmental variable deﬁned on Ω. Then, we considerthe following robustness measure: PTR( x ) = (cid:90) Ω f ( x , w ) > h ] p † ( w )d w , where 1l[ · ] is the indicator function and p † ( w ) is the probability density function of w . This measure is called theprobability threshold robustness (PTR) measure in the ﬁeld of robust optimization [2], and can be interpretedas a measure of how well the design variables behave under randomness in the environmental variables. In themanufacturing industry, it is desirable to identify the set of controllable variables x ∈ X for which PTR( x ) isgreater than a certain threshold. In other words, this problem is interpreted as a level-set estimation (LSE)[4, 8] of the PTR measure. There are two main reasons for considering LSE of the PTR measure. One is thatby enumerating all the design variables that exceed the desired threshold with a high probability, it is possibleto respond the usage conditions of various users. The other is to consider some optimization problem (e.g., toﬁnd x with the minimum price) for design variables with PTR measures above a certain level. This is knownas the chance-constrained programming problem [5], and has many applications such as ﬁnance, in addition tomanufacturing industry. Unfortunately, however, the PTR measure cannot be correctly evaluated when p † ( w )is unknown. If p † ( w ) is unknown and the estimated density is simply plugged in, then PTR( x ) is no longervalid as a robustness measure because of the estimation error.In this study, we considered a distributionally robust PTR (DRPTR) measure, which includes uncertaintyabout p † ( w ) under the setting that p † ( w ) is unknown. Let A be a user-speciﬁed class of candidate distributionsof w . Then, the DRPTR measure can be deﬁned as F ( x ) = inf p ( w ) ∈A (cid:90) Ω f ( x , w ) > h ] p ( w )d w . The DRPTR measure has the advantage of being robust with respect to using wrong distributions because it canbe interpreted as the PTR in the worst case among the candidate distributions. In this study, we formulated this1 a r X i v : . [ s t a t . M L ] F e b roblem as an active learning problem for the LSE for F ( x ) instead of PTR( x ), and developed a theoreticallygrounded and numerically eﬃcient algorithm for its calculation. The basic ideas of our proposed method areas follows. First, we consider the function f ( x , w ) to be a black-box function with a high evaluation cost, andwe employ a Gaussian process (GP) model as a surrogate model. Next, we predict the target DRPTR measureusing the GP model for the black-box function f ( x , w ). Finally, we perform LSE using credible intervals of theDRPTR measure calculated on the basis of this prediction. Active learning using GP models [29] for black-box functions have been actively studied in the context ofBayesian optimization (see, e.g., [21, 22]). Several studies have been conducted on active learning for LSE[4, 8, 30, 10]. Furthermore, some researchers applied LSE to eﬃciently identify safety regions [25, 27, 24, 28],and others used LSE to enumerate the local minima of black-box functions [9].Many studies have been conducted on active learning under input uncertainty (including random environ-mental variables). In [11], the authors proposed an eﬃcient method for performing LSE in the setting where theinput is a random variable generated from a certain distribution. In other studies, the researchers formulatedthe randomness of the input with some robustness measures for performing active learning on it. For example,the authors of [3] used the worst-case function value of the input shift as a robustness measure. Similarly, otherresearch ([1, 26, 18, 6, 7, 14]) dealt with the stochastic robustness (SR) measure, which is a robustness measuredeﬁned by integrating the black-box function against the input distribution. In another study closely related tothe present work, the authors of [12] proposed an active learning method for LSE in the PTR measure on the ba-sis of random inputs; in [14], the authors considered an active learning method for both LSE and maximizationproblems in the PTR measure. However, these two are not distributionally robust settings. Distributionallyrobust optimization (DRO), which is not an active learning framework, was ﬁrst introduced by [20]. DRO is animportant topic in the context of robust optimization, and there have been countless related studies (see [19]for comprehensive survey of DRO). Active learning methods for DRO with uncertainty environmental variableshave recently been proposed by [16, 17]. The main diﬀerences to our problem setup are that they focus ona distributionally robust SR (DRSR) measure for the target function, which is the worst-case SR measure incandidate distributions of the unknown environmental variable, and consider the maximization problem for theDRSR measure. In particular, for the former, we cannot directly apply their proposed methods and theoreticaltechniques because the target function is diﬀerent from ours. To the best of our knowledge, none of these studieshave addressed the same research problem considered in the present work.

The main contributions of this study are summarized as follows: • We formulate the LSE problem for the DRPTR measure, i.e., the problem of ﬁnding the set of designvariables for which the DRPTR measure exceeds a given threshold. • We construct non-trivial credible intervals for the DRPTR measure and propose a new acquisition function(AF) based on an expected classiﬁcation improvement. Using them, we propose an active learning methodfor the LSE of the DRPTR measure. Moreover, because the naive implementation of our proposed AFrequires a large computational cost, we propose a computationally eﬃcient technique for its calculation. • We clarify the theoretical property of the proposed method. Under mild conditions, we show that theproposed method has desirable accuracy and convergence properties. • We describe the empirical performance of the proposed method through the results of numerical experi-ments with benchmark functions and real data.

2. Preliminary

Let f : X × Ω → R be an expensive-to-evaluate black-box function. We assume that X and Ω are ﬁnitesets. For each input ( x , w ) ∈ X × Ω, the value of f ( x , w ) is observed as f ( x , w ) + ε with an independent noise ε , where ε follows Gaussian distribution N (0 , σ ). In our setting, a variable w ∈ Ω stochastically ﬂuctuates bythe (unknown) discrete distribution P † in the use phase, whereas we can specify w in the development phase.Moreover, let A be a family of candidate distributions of P † . In this work, we consider A = { p.m.f. p ( w ) | d ( p ( w ) , p ∗ ( w )) < (cid:15) } . where p ∗ ( w ) is a user-speciﬁed reference distribution, d ( · , · ) is a given distance metricbetween two distributions, and (cid:15) >

0. Then, under the given threshold h , we deﬁne the DRPTR F ( x ) for each x ∈ X as F ( x ) = inf p ( w ) ∈A (cid:88) w ∈ Ω f ( x , w ) > h ] p ( w ) . H of X that satisﬁes F ( x ) > α for a given threshold α ∈ (0 , H = { x ∈ X | F ( x ) > α } . (2.1)Moreover, we deﬁne the lower set L as L = { x ∈ X | F ( x ) ≤ α } . Gaussian process

In this study, we used the Gaussian process (GP) to model the unknown black-box function f . First, we assume that the GP, GP (0 , k (( x , w ) , ( x (cid:48) , w (cid:48) ))) is a prior distribution of f , where k (( x , w ) , ( x (cid:48) , w (cid:48) ))is a positive-deﬁnite kernel. Then, given the dataset { ( x i , w i , y i ) } ti =1 , the posterior distribution of f also followsthe GP, and its posterior mean µ t ( x , w ) and posterior variance σ t ( x , w ) are given by µ t ( x , w ) = k (cid:62) t ( x , w )( K t + σ I t ) − y t ,σ t ( x , w ) = k (( x , w ) , ( x , w )) − k (cid:62) t ( x , w )( K t + σ I t ) − k t ( x , w ) , where k t ( x , w ) is the t -dimensional vector whose j th element is k (( x , w ) , ( x j , w j )), y t = ( y , . . . , y t ) (cid:62) , I t is the t × t identity matrix, and K t is the t × t matrix whose ( j, k )th element is k (( x j , w j ) , ( x k , w k )).

3. Proposed method

In this section, we propose an active learning method for eﬃciently identifying (2.1). The target function F ( x ) is a random variable because F ( x ) is the function of f ( x , w ), and f ( x , w ) is drawn from GP. Thus, areasonable method to identify (2.1) is to construct a credible interval of F ( x ), and estimate H using the lowerbound of the constructed credible interval. Unfortunately, although f ( x , w ) follows GP, F ( x ) does not followGP. Hence, the credible interval of F ( x ) cannot be directly calculated on the basis of normal distributions. Inthe next section, we propose a simple and theoretically valid credible interval of F ( x ) using the credible intervalof f ( x , w ). For any input ( x , w ) ∈ X × Ω and step t , we deﬁne a credible interval of f ( x , w ) as Q t ( x , w ) = [ l t ( x , w ) , u t ( x , w )],where l t ( x , w ) = µ t ( x , w ) − β / t σ t ( x , w ), u t ( x , w ) = µ t ( x , w ) + β / t σ t ( x , w ), and β / t ≥

0. Similarly, wedeﬁne a credible interval of 1l[ f ( x , w ) > h ] on the basis of Q t ( x , w ). For the theoretical analysis described inSection 4, we introduce a user-speciﬁed accuracy parameter η >

0. Speciﬁcally, we deﬁne the credible intervalof 1l[ f ( x , x ) > h ] at step t as ˜ Q t ( x , w ; η ) ≡ [˜ l t ( x , w ; η ) , ˜ u t ( x , w ; η )]=  [1 ,

1] if l t ( x , w ) > h − η, [0 ,

1] if l t ( x , w ) ≤ h − η and u t ( x , w ) > h, [0 ,

0] if l t ( x , w ) ≤ h − η and u t ( x , w ) ≤ h. Note that when the accuracy parameter η = 0, this credible interval simply indicates that if the lower (resp.upper) bound of f ( x , w ) is greater (resp. smaller) than h , we say that 1l[ f ( x , w ) > h ] = 1 (resp. 0). Thus, acredible interval Q ( F ) t ( x ; η ) ≡ [ l ( F ) t ( x ; η ) , u ( F ) t ( x ; η )] of the target function F ( x ) can be given by l ( F ) t ( x ; η ) = inf p ( w ) ∈A (cid:88) w ∈ Ω ˜ l t ( x , w ; η ) p ( w ) , u ( F ) t ( x ; η ) = inf p ( w ) ∈A (cid:88) w ∈ Ω ˜ u t ( x , w ; η ) p ( w ) . (3.1)Note that if we use the L L d ( · , · ), equation (3.1) is equivalent to solvinga linear (or second-order cone) programming problem. In both cases, because solvers exist that can computethe optimal solution quickly, it is easy to compute Q ( F ) t ( x ; η ) when using such distance functions. Then, weestimate H and L using Q ( F ) t ( x ; η ) as follows: H t = { x ∈ X | l ( F ) t ( x ; η ) > α } , L t = { x ∈ X | u ( F ) t ( x ; η ) ≤ α } . Also, we deﬁne the unclassiﬁed set as U t = X \ ( H t ∪ L t ). In this section, we propose two acquisition functions to select the next evaluation point. Our proposedacquisition functions are based on the maximum improvement in level-set estimation (MILE) strategy proposedin [30]. In MILE, the expected value of the increase in the number of classiﬁcations after adding the new point( x ∗ , w ∗ ) is calculated, and the point with the largest expected value is selected. In this study, owing to the3 lgorithm 1 Active learning for distributionally robust level-set estimation

Input:

GP prior GP (0 , k ), threshold h ∈ R , probability α ∈ (0 , η >

0, tradeoﬀparameter { β t } t ≤ T H ← ∅ , L ← ∅ , U ← X , t ← while U t − (cid:54) = ∅ do Compute l ( F ) t ( x ; η ) and u ( F ) t ( x ; η ) for all x ∈ X Choose ( x t , w t ) by ( x t , w t ) = argmax ( x ∗ , w ∗ ) ∈X × Ω a (1) t − ( x ∗ , w ∗ ) (or a (2) t − ( x ∗ , w ∗ ) instead of a (1) t − ( x ∗ , w ∗ ))Observe y t ← f ( x t , w t ) + ε t Update GP by adding (( x t , w t ) , y t ) and compute H t , L t and U t t ← t + 1 end while ˆ H ← H t − , ˆ L ← L t − Output:

Estimated Set ˆ H, ˆ L computational cost of calculating the acquisition function, we consider a strategy based on the expected valuewhere points in the unclassiﬁed set are classiﬁed as H .Let ( x ∗ , w ∗ ) be a new point, and let y ∗ = f ( x ∗ , w ∗ )+ ε be a new observation at point ( x ∗ , w ∗ ). Furthermore,let l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) be the lower bound of the credible interval of F ( x ), where η = 0 when ( x ∗ , w ∗ , y ∗ ) isnewly added. Then, we consider the function a t ( x ∗ , w ∗ ): a t ( x ∗ , w ∗ ) = (cid:88) x ∈ U t E y ∗ [1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ]] . (3.2)In this work, we do not directly use (3.2) as the acquisition function because the value of (3.2) is sometimesexactly zero for any point. A reasonable method to avoid this problem is to consider a diﬀerent function b t ( x ∗ , w ∗ ) only when the values of (3.2) are all zero. For theoretical treatment, we follow the strategy describedin [30], and consider the acquisition function of the form max { a t ( x ∗ , w ∗ ) , γb t ( x ∗ , w ∗ ) } with a positive constantparameter γ . Note that if we use a suﬃciently small γ , it is almost the same when considering b t ( x ∗ , w ∗ ) onlywhen the values of (3.2) are all zero; otherwise, a t ( x ∗ , w ∗ ). In Section 4, we present the theoretical guaranteesof our proposed method for this acquisition function. In this section, we propose two types of b t ( x ∗ , w ∗ ). Theﬁrst is based on the RMILE acquisition function proposed by [30]. The basic idea of RMILE is to add anadditional variance term γσ t ( x ∗ , w ∗ ) to the original MILE acquisition function. By using the same argument,we deﬁne the following modiﬁed acquisition function: Deﬁnition 3.1 (Proposed acquisition function 1) . Let a t ( x ∗ , w ∗ ) be the function deﬁned by (3.2), and let γ be a positive parameter. Then, we propose the following acquisition function a (1) t ( x ∗ , w ∗ ): a (1) t ( x ∗ , w ∗ ) = max { a t ( x ∗ , w ∗ ) , γσ t ( x ∗ , w ∗ ) } . Moreover, we select the next evaluation point ( x t +1 , w t +1 ) by maximizing a (1) t ( x ∗ , w ∗ ).The other acquisition function we propose uses γ RMILE t ( x ∗ , w ∗ ) instead of γσ t ( x ∗ , w ∗ ) as the function b t ( x ∗ , w ∗ ), where RMILE t ( x ∗ , w ∗ ) is the RMILE function proposed in [30]. Deﬁnition 3.2 (Proposed acquisition function 2) . Let a t ( x ∗ , w ∗ ) be the function deﬁned by (3.2), and let γ be a positive parameter. Then, we propose the following acquisition function a (2) t ( x ∗ , w ∗ ): a (2) t (( x ∗ , w ∗ )) = max { a t ( x ∗ , w ∗ ) , γ RMILE t ( x ∗ , w ∗ ) } . Moreover, we select the next evaluation point ( x t +1 , w t +1 ) by maximizing a (2) t ( x ∗ , w ∗ ).The pseudocode of the proposed method is given in Algorithm 1. Our proposed acquisition functions are based on (3.2), where (3.2) includes the calculation of the expectedvalue. This expectation cannot be expressed as a simple expression using the cumulative distribution function(CDF) of the standard normal distribution, as in the original MILE [30]. One way to solve this problemis to generate many samples from the posterior distribution of y ∗ and numerically calculate the expectedvalue. However, because one optimization calculation is required to calculate 1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ],if the expected value is calculated using M samples, then M | U t | optimization calculations are required tocalculate a t ( x ∗ , w ∗ ) for each ( x ∗ , w ∗ ). Therefore, to calculate a t ( x ∗ , w ∗ ) for all candidate points, M | U t ||X × Ω | optimization calculations are required. To reduce this large computational cost, we provide useful lemmas foreﬃciently computing the acquisition function. The expected values in (3.2) can be exactly calculated using thefollowing lemma: 4 emma 3.1. Let l t ( x , w j | x ∗ , w ∗ , y ∗ ) be the lower conﬁdence bound of f ( x , w j ) after adding ( x ∗ , w ∗ , y ∗ )to { ( x i , w i , y i ) } ti =1 . Furthermore, let r j be a number satisfying h = l t ( x , w j | x ∗ , w ∗ , r j ), and let r ( j ) bethe j th-smallest number in the range r to r | Ω | . For each s ∈ { , . . . , | Ω | + 1 } ≡ [ | Ω | + 1], deﬁne R s =( r ( s − , r ( s ) ], where r (0) = −∞ and r ( | Ω | +1) = ∞ . Moreover, let c s be a real number satisfying c s ∈ R s . Then, E y ∗ [1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ]] can be calculated as follows: E y ∗ [1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ]] = | Ω | +1 (cid:88) s =1 P ( y ∗ ∈ R s )1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) > α ] . (3.3)Lemma 3.1 implies that | Ω | +1 optimization calculations are required to calculate E y ∗ [1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) >α ]], but the following lemma shows that the number of optimization calculations can be reduced by checking asimple inequality: Lemma 3.2.

Let c , . . . , c | Ω | +1 be numbers deﬁned as in Lemma 3.1. Suppose that c s satisﬁes (cid:88) w ∈ Ω l t ( x , w | x ∗ , w ∗ , c s ) > h ] p ∗ ( w ) ≤ α. Then, 1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) > α ] = 0.Finally, noting that 0 ≤ P ( y ∗ ∈ R s ) ≤ ≤ l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) > α ] ≤

1, we can approximate (3.3)with any approximation accuracy ζ > Lemma 3.3.

Let ζ >

0, and deﬁneˆ a t ( x ∗ , w ∗ ) = (cid:88) s ∈ S t P ( y ∗ ∈ R s )1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) > α ] ,S t = { s ∈ [ | Ω | + 1] | P ( y ∗ ∈ R s ) ≥ ζ/ ( | Ω | + 1) } . Then, ˆ a t ( x ∗ , w ∗ ) satisﬁes the following inequality: | E y ∗ [1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ]] − ˆ a t ( x ∗ , w ∗ ) | ≤ ζ. Lemma 3.3 implies that the number of optimization calculations for (3.3) can be further reduced if the error ζ is allowed. In addition, we must emphasize that P ( y ∗ ∈ R s ) is often very small for most s when we actuallycalculate (3.3). Therefore, from these properties, if we apply Lemma 3.3 using a suﬃciently small ζ , we canreduce the computational cost of (3.3) signiﬁcantly with almost no error. Detailed numerical comparisons areprovided in Section 5.

4. Theoretical analysis

In this section, we provide three theorems regarding the accuracy and convergence properties of our methods.First, we deﬁne the misclassiﬁcation loss e α ( x ) for each x ∈ X as follows: e α ( x ) = (cid:26) max { , F ( x ) − α } if x ∈ ˆ L max { , α − F ( x ) } if x ∈ ˆ H .

Furthermore, for theoretical reasons, we assume that the black-box function f follows GP GP (0 , k (( x , w ) , ( x (cid:48) , w (cid:48) ))).In addition, for technical reasons, we assume that the prior variance k (( x , w ) , ( x , w )) ≡ σ ( x , w ) satisﬁes0 < σ ,min ≡ min ( x , w ) ∈X × Ω σ ( x , w ) ≤ max ( x , w ) ∈X × Ω σ ( x , w ) ≤ . Moreover, let κ T be the maximum information gain at step T . Note that κ T is a measure often used to showtheoretical guarantee for GP-based active learning methods (see, e.g., [23]), and can be expressed using mutualinformation I ( y ; f ) between the observed vector y and f as κ T = max A ⊂X × Ω I ( y A ; f ) . Then, the followingtheorem regarding accuracy holds:

Theorem 4.1.

Let h ∈ R , α ∈ (0 , t ≥

1, and δ ∈ (0 , β t = 2 log( |X × Ω | π t / (3 δ )). Moreover,for a user-speciﬁed accuracy parameter ξ >

0, we deﬁne η > η = min (cid:26) ξσ ,min , ξ δσ ,min |X × Ω | (cid:27) . Then, when Algorithm 1 terminates, with a probability of at least 1 − δ , the misclassiﬁcation loss is boundedby ξ , that is, the following inequality holds: P (cid:18) max x ∈X e α ( x ) ≤ ξ (cid:19) ≥ − δ. Theorem 4.2.

Under the same setting as described in Theorem 4.1, let γ > C = 2 / log(1 + σ − ). Inaddition, let T be the smallest positive integer satisfying the following four inequalities:(1) σ − β / T C κ T T < η , (2) σ − C κ T T < η , (3) C β T κ T T < η , (4) 12 log β T − T η σ C κ T < log( |X | − −| Ω | ηγ (2 π ) / / . Then, Algorithm 1 terminates (i.e., U T = ∅ ) after at most T trials when we use the acquisition function a (1) t ( x ∗ , w ∗ ).Furthermore, the similar theorem holds if the acquisition function a (2) t (( x ∗ , w ∗ )) is used. In this study,owing to the practical performance, we modiﬁed the original RMILE toRMILE t ( x ∗ , w ∗ ) = max { MILE t ( x ∗ , w ∗ ) , ˜ γσ t ( x ∗ , w ∗ ) } , MILE t ( x ∗ , w ∗ ) = (cid:88) ( x , w ) ∈ U t × Ω E y ∗ [1l[ l t ( x , w | x ∗ , w ∗ , y ∗ ) > h ]] − |{ ( x , w ) ∈ U t × Ω | l t ( x , w ) > h − η }| . Then, the following theorem holds:

Theorem 4.3.

Under the same setting described in Theorem 4.1, let γ >

0, ˜ γ >

0, and C = 2 / log(1 + σ − ).In addition, let T be the smallest positive integer satisfying the following ﬁve inequalities:(1) σ − β / T C κ T T < η , (2) σ − C κ T T < η , (3) C β T κ T T < η , (4) 12 log β T − T η σ C κ T < log( |X | − −| Ω | ηγ ˜ γ (2 π ) / / , (5) 12 log β T − T η σ C κ T < log( |X × Ω | − η ˜ γ (2 π ) / / . Then, Algorithm 1 terminates (i.e., U T = ∅ ) after at most T trials when we use the acquisition function a (2) t ( x ∗ , w ∗ ).The order of the maximum information gain κ T is known to be sublinear under mild conditions [23]. Hence,because the order of β T is O (log T ), there exist positive integers satisfying the inequalities in Theorems 4.2 and4.3.

5. Numerical experiments

We conﬁrmed the performance of the proposed method using both synthetic and real data. Because ofspace limitation, we provide a part of experimental results in the main text. All experimental results and detailparameter settings are given in the Appendix. The input space

X ×

Ω was deﬁned as a set of grid points thatuniformly cut the region [ L , U ] × [ L , U ] into 50 ×

50. In all experiments, we used the following Gaussiankernel as the kernel function: k (( x, w ) , ( x (cid:48) , w (cid:48) )) = σ f exp( −{ ( x − x (cid:48) ) + ( w − w (cid:48) ) } /L ) . Moreover, we used L p ∗ ( w ): Uniform: p ∗ ( w ) = 1 / . Normal: p ∗ ( w ) = a ( w ) (cid:80) w ∈ Ω a ( w ) , a ( w ) = 1 √ π exp( − w / . Then, we compared the following acquisition functions:

Random:

Select ( x t +1 , w t +1 ) by using random sampling. US:

Perform uncertainty sampling, i.e., ( x t +1 , w t +1 ) = argmax ( x,w ) ∈X × Ω σ t ( x, w ).6 traddle f: Perform straddle strategy [4], i.e., ( x t +1 , w t +1 ) = argmax ( x,w ) ∈X × Ω v t ( x, w ), where v t ( x, w ) =min { u t ( x, w ) − h, h − l t ( x, w ) } . Straddle US:

Select x t +1 and w t +1 by using the straddle of F ( x ) and σ t ( x t +1 , w ), respectively, i.e., x t +1 =argmax x ∈X v Ft ( x ) and w t +1 = argmax w ∈ Ω σ t ( x t +1 , w ), where v Ft ( x ) = min { u Ft ( x ; η ) − α, α − l Ft ( x ; η ) } . Straddle random:

Replace the selection method of w t +1 in straddle US with random sampling. MILE:

Perform the original MILE strategy, i.e., ( x t +1 , w t +1 ) was selected by using (6) in [30]. Proposed1 . : Perform a (1) t ( x ∗ , w ∗ ) with γ = 0 . Proposed1 . : Perform a (1) t ( x ∗ , w ∗ ) with γ = 0 . Proposed2 . : Perform a (2) t ( x ∗ , w ∗ ) with γ = 0 . Proposed2 . : Perform a (2) t ( x ∗ , w ∗ ) with γ = 0 . η to zero. Similarly, because of the computational costof calculating acquisition functions, we replaced P ( y ∗ ∈ R s )1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) > α ] in (3.3) with zero when P ( y ∗ ∈ R s ) satisﬁes P ( y ∗ ∈ R s ) < . ζ/ ( | Ω | + 1) = 0 .

005 toapproximate (3.3).

We conﬁrmed the performance of the proposed method using synthetic functions. We considered the fol-lowing four functions, which are commonly used benchmark functions (the last one adds − Booth: f ( x, w ) = ( x + 2 w − + (2 x + w − . Matyas: f ( x, w ) = 0 . x + w ) − . xw . McCormick: f ( x, w ) = sin( x + w ) + ( x − w ) − . x + 2 . w + 1. Styblinski-Tang: f ( x, w ) = ( x − x + 5 x ) / w − w + 5 w ) / − × recpre + rec , pre = | H ∩ H t || H t | , rec = | H ∩ H t || H | . From Figures 1 and 2, it can be conﬁrmed that our proposed methods outperform other existing methods.On the other hand, in the existing methods, Straddle f and MILE exhibit high performance, because theMILE acquisition function increases the expected number of ( x, w ) satisfying l t ( x, w ) > h . As a result, because˜ l t ( x, w ; η ) and l ( F ) t ( x ; η ) become large early, the number of elements in H t also increases early. Similarly, becausethe Straddle f acquisition function can eﬃciently search for ( x, w ) satisfying l t ( x, w ) > h or u t ( x, w ) < h , thenumber of elements in H t also increases eﬃciently from the same argument as before. Furthermore, whencomparing Proposed1 and Proposed2, one of the reasons why the latter exhibits better performance is the factthat RMILE exhibits better performance than uncertainty sampling. Other experiments, a comparison of thediﬀerence in γ is described in the Appendix. In this section, we conﬁrmed how much the computation time of (3.2) can be improved by using Lemma3.1, 3.2 and 3.3. We evaluated the computation time of (3.2) when we performed the same experiment as inSubsection 5.1 using Proposed1 0 .

01 and Proposed2 0 .

01 for the Booth function. The experiments for Matyas,McCormick and Styblinski-Tang functions are described in the Appendix. Here, as for the parameter settings,we considered only the case of L Naive:

For each ( x ∗ , w ∗ ), we generate M samples y ∗ , . . . , y ∗ M from the posterior distribution of f ( x ∗ , w ∗ ), andapproximate (3.2) by (cid:88) x ∈ U t M M (cid:88) m =1 l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ m ) > α ] , where we set M = 1000. 7

50 100 150 200 250 300 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01

Booth Matyas McCormick Styblinski-TangFigure 1: Average F-score over 50 simulations with four benchmark functions when the distance function andreference distribution are L . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01

Booth Matyas McCormick Styblinski-TangFigure 2: Average F-score over 50 simulations with four benchmark functions when the distance function andreference distribution are L Naive L1 L2 L3 (10 − ) L3 (10 − ) L3 (10 − )Proposed1 0 .

01 138505 . ± .

87 7621 . ± .

23 2370 . ± .

94 71 . ± .

33 80 . ± .

37 86 . ± . .

01 106306 . ± .

01 5835 . ± .

99 2608 . ± .

06 63 . ± .

29 72 . ± .

99 78 . ± . L1:

Compute (3.2) using Lemma 3.1.

L2:

Compute (3.2) using Lemma 3.1 and 3.2. L3 (10 − ) : Compute (3.2) using Lemma 3.1, 3.2 and 3.3 with ζ = ( | Ω | + 1)10 − . L3 (10 − ) : Compute (3.2) using Lemma 3.1, 3.2 and 3.3 with ζ = ( | Ω | + 1)10 − . L3 (10 − ) : Compute (3.2) using Lemma 3.1, 3.2 and 3.3 with ζ = ( | Ω | + 1)10 − .Under this setup, we took one initial point at random and ran the algorithms until the number of iterationsreached to 300. Furthermore, for each trial t , we evaluated the computation time to calculate (3.2) for allcandidate points ( x ∗ , w ∗ ) ∈ X × Ω, and calculated the average computation time over 300 trials. From Table 1,it can be conﬁrmed that the computation time is improved as the proposed computational techniques are used.Moreover, comparing L3 (10 − ), L3 (10 − ) and L3 (10 − ), it can be conﬁrmed that the computation timebecomes shorter when a large ζ is used. However, it can be seen that the computation time of L3 (10 − ) is stillvery small compared to the computation time of Naive, L1 and L2. Therefore, from | Ω | = 50 and Lemma 3.3,it implies that by using proposed computational techniques, we can improve the computation time signiﬁcantlyeven if the error from the true a t ( x ∗ , w ∗ ) is kept to a very small value such as 51 × − = 5 . × − . We compared our proposed method with other existing methods by using the infection control problem[15]. We considered a simulation-based decision-making problem for an epidemic, which aims to determinean acceptable infection rate x under an uncertain recovery rate w with as few simulations as possible. Themotivation for this simulation was to evaluate the tradeoﬀ between economic risk and a controllable infectionrate. For example, if the infection rate x is minimized by shutting down all economic activities, the economic8

20 40 60 80 100 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 L L h with a probability of at least α . In thisexperiment, to simulate epidemic behavior, we used the SIR model [15]. The model computes the evolutionof the number of infected people by using an infection rate x and recovery rate w . In our experiment, weconsidered the infection rate as the design variable x and the recovery rate as the environmental variable w following an unknown distribution. In addition, we regarded economic risk as a black-box function f ( x, w ).Note that similar numerical experiments were performed in [13] under the setting where the distribution of w , p † ( w ), is known. Furthermore, we rescaled the ranges of x and w in the interval [ − , X ×

Ωis deﬁned as a set of grid points that uniformly cut the region [ − , × [ − ,

1] into 50 ×

50. We used the followingeconomic risk function f ( x, w ): f ( x, w ) = n infected ( x, w ) − x, where n infected ( x, w ) is the maximum numberof infected people in a given period of time, calculated using the SIR model. Note that this risk function wasalso used by [13], and in this experiment, we used the same function they used in their experiment. Under thissetup, we took one initial point at random and ran the algorithms until the number of iterations reached 100.From 50 Monte Carlo simulations, we calculated average F-scores, where we used the following parameters forall problem settings: h = 135 , α = 0 . , σ = 0 . , σ f = 250 , L = 0 . , β / t = 4 , (cid:15) = 0 . . In this experiment, we used the following modiﬁed reference function as Normal: p ∗ ( w ) = a ( w ) (cid:80) w ∈ Ω a ( w ) , a ( w ) = 1 √ . π exp( − w / . . From Figure 3, it can be conﬁrmed that Proposed2 and MILE performed better than the others.

6. Conclusion

We proposed active learning methods for identifying the reliable set of distributionally robust probabilitythreshold robustness (DRPTR) measure under uncertain environmental variables. We showed that our proposedmethods satisfy theoretical guarantees about convergence and accuracy, and outperform existing methods innumerical experiments.

Acknowledgement

This work was partially supported by MEXT KAKENHI (20H00601, 16H06538), JST CREST (JPMJCR1502),and RIKEN Center for Advanced Intelligence Project. 9 eferences [1] Justin J Beland and Prasanth B Nair. Bayesian optimization under uncertainty. In

NIPS BayesOpt 2017workshop , 2017.[2] Hans-Georg Beyer and Bernhard Sendhoﬀ. Robust optimization–a comprehensive survey.

Computer meth-ods in applied mechanics and engineering , 196(33-34):3190–3218, 2007.[3] Ilija Bogunovic, Jonathan Scarlett, Stefanie Jegelka, and Volkan Cevher. Adversarially robust optimizationwith Gaussian processes. In

Advances in neural information processing systems , pages 5760–5770, 2018.[4] Brent Bryan, Robert C Nichol, Christopher R Genovese, Jeﬀ Schneider, Christopher J Miller, and LarryWasserman. Active learning for identifying function threshold boundaries. In

Advances in neural informa-tion processing systems , pages 163–170, 2006.[5] Abraham Charnes and William W Cooper. Chance-constrained programming.

Management science ,6(1):73–79, 1959.[6] Lukas Fr¨ohlich, Edgar Klenske, Julia Vinogradska, Christian Daniel, and Melanie Zeilinger. Noisy-inputentropy search for eﬃcient robust bayesian optimization. In Silvia Chiappa and Roberto Calandra, editors,

Proceedings of the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics , volume108 of

Proceedings of Machine Learning Research , pages 2262–2272. PMLR, 26–28 Aug 2020.[7] Alexandra Gessner, Javier Gonzalez, and Maren Mahsereci. Active multi-information source Bayesianquadrature. In

Uncertainty in Artiﬁcial Intelligence , pages 712–721. PMLR, 2020.[8] Alkis Gotovos, Nathalie Casati, Gregory Hitz, and Andreas Krause. Active learning for level set estimation.In

Proceedings of the Twenty-Third International Joint Conference on Artiﬁcial Intelligence , IJCAI ’13,pages 1344–1350. AAAI Press, 2013.[9] Y Inatsu, D Sugita, K Toyoura, and I Takeuchi. Active learning for enumerating local minima based onGaussian process derivatives.

Neural Computation , 32(10):2032–2068, 2020.[10] Yu Inatsu, Masayuki Karasuyama, Keiichi Inoue, Hideki Kandori, and Ichiro Takeuchi. Active learning ofBayesian linear models with high-dimensional binary features by parameter conﬁdence-region estimation.

Neural Computation , 32(10):1998–2031, 2020.[11] Yu Inatsu, Masayuki Karasuyama, Keiichi Inoue, and Ichiro Takeuchi. Active learning for level set estima-tion under input uncertainty and its extensions.

Neural Computation , 32(12):2486–2531, 2020.[12] Shogo Iwazaki, Yu Inatsu, and Ichiro Takeuchi. Bayesian experimental design for ﬁnding reliable level setunder input uncertainty.

IEEE Access , 8:203982–203993, 2020.[13] Shogo Iwazaki, Yu Inatsu, and Ichiro Takeuchi. Bayesian quadrature optimization for probability thresholdrobustness measure. arXiv preprint arXiv:2006.11986 , 2020.[14] Shogo Iwazaki, Yu Inatsu, and Ichiro Takeuchi. Mean-variance analysis in Bayesian optimization underuncertainty. In

The 24th International Conference on Artiﬁcial Intelligence and Statistics , 2021. To appear.[15] William Ogilvy Kermack and Anderson G McKendrick. A contribution to the mathematical theory ofepidemics.

Proceedings of the royal society of london. Series A, Containing papers of a mathematical andphysical character , 115(772):700–721, 1927.[16] Johannes Kirschner, Ilija Bogunovic, Stefanie Jegelka, and Andreas Krause. Distributionally robustBayesian optimization. In Silvia Chiappa and Roberto Calandra, editors,

Proceedings of the Twenty ThirdInternational Conference on Artiﬁcial Intelligence and Statistics , volume 108 of

Proceedings of MachineLearning Research , pages 2174–2184. PMLR, 26–28 Aug 2020.[17] Thanh Nguyen, Sunil Gupta, Huong Ha, Santu Rana, and Svetha Venkatesh. Distributionally robustBayesian quadrature optimization. In Silvia Chiappa and Roberto Calandra, editors,

Proceedings of theTwenty Third International Conference on Artiﬁcial Intelligence and Statistics , volume 108 of

Proceedingsof Machine Learning Research , pages 1921–1931. PMLR, 26–28 Aug 2020.[18] Rafael Oliveira, Lionel Ott, and Fabio Ramos. Bayesian optimisation under uncertain inputs. In

The 22ndInternational Conference on Artiﬁcial Intelligence and Statistics , pages 1177–1184, 2019.[19] Hamed Rahimian and Sanjay Mehrotra. Distributionally robust optimization: A review. arXiv preprintarXiv:1908.05659 , 2019. 1020] Herbert Scarf. A min-max solution of an inventory problem.

Studies in the mathematical theory of inventoryand production , 10:201–209, 1958.[21] Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Depart-ment of Computer Sciences, 2009.[22] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. Taking the humanout of the loop: A review of Bayesian optimization.

Proceedings of the IEEE , 104(1):148–175, 2016.[23] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization inthe bandit setting: No regret and experimental design. In

Proceedings of the 27th International Conferenceon International Conference on Machine Learning , ICML’10, pages 1015–1022, USA, 2010. Omnipress.[24] Yanan Sui, Joel Burdick, Yisong Yue, et al. Stagewise safe Bayesian optimization with Gaussian processes.In

International Conference on Machine Learning , pages 4781–4789, 2018.[25] Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization withGaussian processes. In

International Conference on Machine Learning , pages 997–1005, 2015.[26] Saul Toscano-Palmerin and Peter I Frazier. Bayesian optimization with expensive integrands. arXiv preprintarXiv:1803.08661 , 2018.[27] Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration in ﬁnite Markov decision pro-cesses with Gaussian processes. In

Proceedings of the 30th International Conference on Neural InformationProcessing Systems , pages 4312–4320, 2016.[28] Akifumi Wachi, Yanan Sui, Yisong Yue, and Masahiro Ono. Safe exploration and optimization of con-strained MDPs using Gaussian processes. In

Proceedings of the AAAI Conference on Artiﬁcial Intelligence ,volume 32, 2018.[29] Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine learning. the MITPress , 2(3):4, 2006.[30] Andrea Zanette, Junzi Zhang, and Mykel J Kochenderfer. Robust super-level set estimation using Gaussianprocesses. In

Joint European Conference on Machine Learning and Knowledge Discovery in Databases ,pages 276–291. Springer, 2018. 11 ppendixA. ProofsA.1. Proof of Theorem 4.1

In this section, we prove Theorem 4.1. First, we show two lemmas.

Lemma A.1.

Let δ ∈ (0 , β t = 2 log( |X × Ω | π t / (6 δ )). Then, with a probability of at least 1 − δ ,the following inequality holds: | f ( x , w ) − µ t − ( x , w ) | ≤ β / t σ t − ( x , w ) , ∀ ( x , w ) ∈ X × Ω , ∀ t ≥ . Proof.

By replacing D and π t in Lemma 5.1 of [23] with X ×

Ω and π t /

6, respectively, we have LemmaA.1.

Lemma A.2.

Let δ ∈ (0 , ξ > η = min (cid:110) ξσ ,min , ξ δσ ,min |X × Ω | (cid:111) . Then, with a probability of at least1 − δ/

2, the following holds for any x ∈ X and p ( w ) ∈ A :˜ F η,p ( x ) ≡ (cid:88) w ∈ Ω h ≥ f ( x , w ) > h − η ] p ( w ) < ξ. Proof.

From Chebyshev’s inequality, for any ν > x , w ) ∈ X × Ω, the following inequality holds: P ( | g η ( x , w ) − µ ( g η ) ( x , w ) | ≥ ν ) ≤ V [ g η ( x , w )] ν , where g η ( x , w ) = 1l[ h ≥ f ( x , w ) > h − η ] and µ ( g η ) ( x , w ) = E [ g η ( x , w )]. Hence, by replacing ν with ( δ/ (2 |X × Ω | )) − / ( V [ g η ( x , w )]) / , with a probability of at least 1 − δ/

2, the following holds for any ( x , w ) ∈ X × Ω: | g η ( x , w ) − µ ( g η ) ( x , w ) | < (cid:112) V [ g η ( x , w )] (cid:112) δ/ (2 |X × Ω | ) . This implies that g η ( x , w ) < µ ( g η ) ( x , w ) + (cid:112) V [ g η ( x , w )] (cid:112) δ/ (2 |X × Ω | ) . (A.1)Moreover, noting that g η ( x , w ) follows Bernoulli distribution, we get V [ g η ( x , w )] = E [ g η ( x , w )](1 − E [ g η ( x , w )]) ≤ E [ g η ( x , w )] = µ ( g η ) ( x , w ) . (A.2)In addition, µ ( g η ) ( x , w ) can be expressed as µ ( g η ) ( x , w ) = Φ (cid:18) hσ ( x , w ) (cid:19) − Φ (cid:18) h − ησ ( x , w ) (cid:19) . Furthermore, by using Taylor’s expansion, for any a < b it holds thatΦ( b ) = Φ( a ) + φ ( c )( b − a ) ≤ Φ( a ) + φ (0)( b − a ) ≤ Φ( a ) + ( b − a ) , where c ∈ ( a, b ). Thus, we obtain µ ( g η ) ( x , w ) ≤ ησ ( x , w ) ≤ ησ ,min . (A.3)Thus, by substituting (A.2) and (A.3) into (A.1), we have g η ( x , w ) < ησ ,min + (cid:115) η |X × Ω | δσ ,min . Hence, from the deﬁnition of η , we get g η ( x , w ) < ξ (cid:114) ξ ξ. Therefore, for any p ( w ) ∈ A , the following holds:˜ F η,p ( x ) = (cid:88) w ∈ Ω g η ( x , w ) p ( w ) < (cid:88) w ∈ Ω ξp ( w ) = ξ.

12y using Lemma A.1 and A.2, we prove Theorem 4.1.

Proof.

Let δ ∈ (0 ,

1) and β t = 2 log( |X × Ω | π t / (3 δ )). Then, from Lemma A.1, with a probability of at least1 − δ/ l t ( x , w ) ≤ f ( x , w ) ≤ u t ( x , w ) , ∀ ( x , w ) ∈ X × Ω , ∀ t ≥ . (A.4)Thus, from the deﬁnition of ˜ Q t ( x , w ; η ), it holds that1l[ f ( x , w ) > h ] ≤ ˜ u t ( x , w ; η ) . This implies that F ( x ) = inf p ( w ) ∈A (cid:88) w ∈ Ω f ( x , w ) > h ] p ( w ) ≤ inf p ( w ) ∈A (cid:88) w ∈ Ω ˜ u t ( x , w ; η ) p ( w ) = u ( F ) t ( x ; η ) . Therefore, noting that the deﬁnition of L t , we have x ∈ L t ⇒ F ( x ) ≤ u ( F ) t ( x ; η ) ≤ α. (A.5)On the other hand, for any x ∈ X and p ( w ) ∈ A , it holds that (cid:88) w ∈ Ω f ( x , w ) > h ] p ( w ) + ˜ F η,p ( x ) = (cid:88) w ∈ Ω f ( x , w ) > h − η ] p ( w ) . Moreover, from Lemma A.2, with a probability of at least 1 − δ/

2, the following holds: (cid:88) w ∈ Ω f ( x , w ) > h ] p ( w ) + ξ > (cid:88) w ∈ Ω f ( x , w ) > h − η ] p ( w ) . (A.6)Thus, we get the following inequality:inf p ( w ) ∈A (cid:32) (cid:88) w ∈ Ω f ( x , w ) > h ] p ( w ) + ξ (cid:33) = F ( x ) + ξ > inf p ( w ) ∈A (cid:88) w ∈ Ω f ( x , w ) > h − η ] p ( w ) . (A.7)Furthermore, from the deﬁnition of ˜ Q t ( x , w ; η ), the following inequality holds:1l[ f ( x , w ) > h − η ] ≥ ˜ l t ( x , w ; η ) . Therefore, we haveinf p ( w ) ∈A (cid:88) w ∈ Ω f ( x , w ) > h − η ] p ( w ) ≥ inf p ( w ) ∈A (cid:88) w ∈ Ω ˜ l t ( x , w ; η ) p ( w ) = l ( F ) t ( x ; η ) . (A.8)Hence, by combining (A.7) and (A.8), we obtain l ( F ) t ( x ; η ) < F ( x ) + ξ. Thus, from the deﬁnition of H t , it holds that x ∈ H t ⇒ α < F ( x ) + ξ ⇒ α − ξ < F ( x ) . (A.9)Hence, from (A.5), (A.9) and the deﬁnition of e α ( x ), the following inequality holds:max x ∈X e α ( x ) ≤ ξ. Finally, since both (A.4) and (A.6) hold with a probability of at least 1 − δ , the following holds for any t ≥ P (cid:18) max x ∈X e α ( x ) ≤ ξ (cid:19) ≥ − δ. .2. Proof of Theorem 4.2 and 4.3 In this section, we prove Theorem 4.2 and 4.3. First, we show related lemmas.

Lemma A.3.

Let η > β t >

0. Suppose that the following holds for some T ≥ β / T σ T − ( x , w ) < η, ∀ ( x , w ) ∈ X × Ω . (A.10)Then, Algorithm 1 terminates after at most T iterations. Proof.

From the deﬁnition of ˜ Q t ( x , w ; η ), if l T ( x , w ) > h − η , then ˜ l T ( x , w ; η ) = ˜ u T ( x , w ; η ) = 1. On the otherhand, noting that u T ( x , w ) − l T ( x , w ) = 2 β / T σ T − ( x , w ) and (A.10), if l T ( x , w ) ≤ h − η , then u T ( x , w ) ≤ h .This implies that ˜ l T ( x , w ; η ) = ˜ u T ( x , w ; η ) = 0. Thus, under (A.10), the following holds for any ( x , w ) ∈ X × Ω:˜ l T ( x , w ; η ) = ˜ u T ( x , w ; η ) . Hence, from the deﬁnitions of l ( F ) t ( x , w ; η ) and u ( F ) t ( x , w ; η ), we have l ( F ) t ( x , w ; η ) = u ( F ) t ( x , w ; η ). Therefore,for any x ∈ X , x satisﬁes x ∈ H T or x ∈ L T , i.e., U T = ∅ . Lemma A.4.

Let η > β t >

0. Suppose that the following inequalities hold for some ( x ∗ , w ∗ ) ∈ X × Ω: σ − σ t − ( x ∗ , w ∗ ) β / t < η , (A.11) σ − σ t − ( x ∗ , w ∗ ) < η / . (A.12)Then, (3.2) can be bounded as a t − ( x ∗ , w ∗ ) ≤ |X | | Ω | √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) . Proof.

First, we deﬁne the set B as B =  b = ( b , . . . , b | Ω | ) ∈ { , } | Ω | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) inf p ( w ) ∈A | Ω | (cid:88) j =1 p ( w j ) b j > α  . Moreover, for each b ∈ B , let N ( b ) be a subset of { , . . . , | Ω |} satisfying ∀ s ∈ N ( b ) , b s = 1 . Then, the following holds for any x ∈ U t : E y ∗ [1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ]]= P y ∗ [(1l[ l t ( x , w | x ∗ , w ∗ , y ∗ ) > h ] , . . . , l t ( x , w | Ω | | x ∗ , w ∗ , y ∗ ) > h ]) (cid:62) ∈ B ]= (cid:88) b ∈B P y ∗ [1l[ l t ( x , w | x ∗ , w ∗ , y ∗ ) > h ] = b , . . . , l t ( x , w | Ω | | x ∗ , w ∗ , y ∗ ) > h ] = b | Ω | ] ≤ (cid:88) b ∈B P y ∗ [ ∀ s ∈ N ( b ) , l t ( x , w s | x ∗ , w ∗ , y ∗ ) > h ] = b s ] , (A.13)where l t ( x , w j | x ∗ , w ∗ , y ∗ ) is the lower conﬁdence bound of f ( x , w j ) after adding ( x ∗ , w ∗ , y ∗ ) to { ( x i , w i , y i ) } ti =1 .Next, for any N ( b ) , there exists s b ∈ N ( b ) such that l t ( x , w s b ) ≤ h − η. (A.14)In fact, if l t ( x , w s b ) > h − η for any s ∈ N ( b ) , then we get(1l[ l t ( x , w ) > h − η ] , . . . , l t ( x , w | Ω | ) > h − η ]) (cid:62) ∈ B , which contradicts x ∈ U t . Furthermore, from Lemma 2 of [30], P y ∗ [ l t ( x , w s b | x ∗ , w ∗ , y ∗ ) > h ] can be calculatedas P y ∗ [ l t ( x , w s b | x ∗ , w ∗ , y ∗ ) > h ] = Φ  (cid:113) σ t − ( x ∗ , w ∗ ) + σ | k t − (( x , w s b ) , ( x ∗ , w ∗ )) | ( µ t − ( x , w s b ) − β / t σ t − ( x , w s b | x ∗ , w ∗ ) − h )  , (A.15)14here σ t − ( x , w s b | x ∗ , w ∗ ) is the posterior variance of f ( x , w s b ) after adding ( x ∗ , w ∗ , y ∗ ) to { ( x i , w i , y i ) } ti =1 .Moreover, by using (A.14) we obtain µ t − ( x , w s b ) − β / t σ t − ( x , w s b | x ∗ , w ∗ ) − h = µ t − ( x , w s b ) − β / t σ t − ( x , w s b ) + β / t σ t − ( x , w s b ) − β / t σ t − ( x , w s b | x ∗ , w ∗ ) − h = l t ( x , w s b ) + β / t σ t − ( x , w s b ) − β / t σ t − ( x , w s b | x ∗ , w ∗ ) − h ≤ − η + β / t ( σ t − ( x , w s b ) − σ t − ( x , w s b | x ∗ , w ∗ )) . (A.16)In addition, the following three inequalities hold: σ ≤ (cid:113) σ t − ( x ∗ , w ∗ ) + σ , (A.17) | k t − (( x , w s b ) , ( x ∗ , w ∗ )) | ≤ σ t − ( x , w s b ) σ t − ( x ∗ , w ∗ ) ≤ σ ( x , w s b ) σ t − ( x ∗ , w ∗ ) ≤ σ t − ( x ∗ , w ∗ ) , (A.18) σ t − ( x , w s b ) − σ t − ( x , w s b | x ∗ , w ∗ ) ≤ σ t − ( x , w s b ) σ t − ( x ∗ , w ∗ ) σ t − ( x ∗ , w ∗ ) + σ ≤ σ ( x , w s b ) σ t − ( x ∗ , w ∗ ) σ ≤ σ t − ( x ∗ , w ∗ ) σ , (A.19)where the ﬁrst, second and third inequalities in (A.18) can be derived from H¨older’s inequality, monotonicityof the posterior variance and the assumption max ( x , w ) ∈X × Ω σ ( x , w ) ≤

1, respectively. Similarly, the ﬁrstinequality in (A.19) can be derived from the equation (39) of [30]. Therefore, by substituting (A.16)–(A.19)and (A.11) into (A.15), we obtain the following inequality: P y ∗ [ l t ( x , w s b | x ∗ , w ∗ , y ∗ ) > h ] ≤ Φ (cid:18) σσ t − ( x ∗ , w ∗ ) ( − η/ (cid:19) , (A.20)Moreover, noting that the assumption (A.12) is equal to the condition 1 < σσ − t − ( x ∗ , w ∗ )( η/ (cid:18) σσ t − ( x ∗ , w ∗ ) ( − η/ (cid:19) = (cid:90) σσt − x ∗ , w ∗ ) ( − η/ −∞ φ ( z )d z = (cid:90) ∞ σσt − x ∗ , w ∗ ) ( η/ φ ( z )d z ≤ (cid:90) ∞ σσt − x ∗ , w ∗ ) ( η/ zφ ( z )d z = [ − φ ( z )] ∞ σσt − x ∗ , w ∗ ) ( η/ = 1 √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) . (A.21)Finally, from (A.13), (A.20) and (A.21), E y ∗ [1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ]] can be bounded as E y ∗ [1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ]] ≤ (cid:88) b ∈B P y ∗ [ ∀ s ∈ N ( b ) , l t ( x , w s | x ∗ , w ∗ , y ∗ ) > h ] = b s ] ≤ (cid:88) b ∈B P y ∗ [1l[ l t ( x , w s b | x ∗ , w ∗ , y ∗ ) > h ] = b s b ]= (cid:88) b ∈B P y ∗ [ l t ( x , w s b | x ∗ , w ∗ , y ∗ ) > h ] ≤ (cid:88) b ∈B √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) = |B| √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) ≤ | Ω | √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) . a t − ( x ∗ , w ∗ ), we have a t − ( x ∗ , w ∗ ) = (cid:88) x ∈ U t E y ∗ [1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ]] ≤ (cid:88) x ∈ U t | Ω | √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) = | U t | | Ω | √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) ≤ |X | | Ω | √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) . Lemma A.5.

Let η > β t > γ >

0. Also let ( x t , w t ) ∈ X × Ω be a maximum point of a (1) t − ( x ∗ , w ∗ ).Assume that the following inequalities hold for some T ≥ σ − σ T − ( x T , w T ) β / T < η , (A.22) σ − σ T − ( x T , w T ) < η / , (A.23) σ T − ( x T , w T ) β T < η / , (A.24)12 log β T − η σ σ T − ( x T , w T ) < log( |X | − −| Ω | ηγ − √ π ) . (A.25)Then, Algorithm 1 terminates after at most T iterations. Proof.

From the deﬁnitions of a (1) t − ( x ∗ , w ∗ ) and ( x t , w t ), the following holds for any ( x , w ) ∈ X × Ω: γσ T − ( x , w ) ≤ a (1) T − ( x , w ) ≤ a (1) T − ( x T , w T ) = max { a T − ( x T , w T ) , γσ T − ( x T , w T ) } . (A.26)In addition, from (A.22), (A.23) and Lemma A.4, a T − ( x T , w T ) can be bounded as a T − ( x T , w T ) ≤ |X | | Ω | √ π exp (cid:18) − σ η σ T − ( x T , w T ) (cid:19) . (A.27)Thus, by substituting (A.27) into (A.26), we have γσ T − ( x , w ) ≤ max (cid:26) |X | | Ω | √ π exp (cid:18) − σ η σ T − ( x T , w T ) (cid:19) , γσ T − ( x T , w T ) (cid:27) . This implies that β / T σ T − ( x , w ) ≤ max (cid:26) γ − β / T |X | | Ω | √ π exp (cid:18) − σ η σ T − ( x T , w T ) (cid:19) , β / T σ T − ( x T , w T ) (cid:27) . (A.28)On the other hand, (A.24) and (A.25) are equal to the following inequalities, respectively: β / T σ T − ( x T , w T ) < η/ , (A.29)exp (cid:18) − η σ σ T − ( x T , w T ) (cid:19) < |X | − −| Ω | ηγ − √ πβ / T . (A.30)Hence, by combining (A.28), (A.29) and (A.30), we get β / T σ T − ( x T , w T ) < η/

2. Therefore, from Lemma A.3,we have Lemma A.5.

Lemma A.6.

Let η > β t >

0. Assume that (A.11) and (A.12) hold for some ( x ∗ , w ∗ ) ∈ X × Ω. Then,MILE t − ( x ∗ , w ∗ ) can be bounded asMILE t − ( x ∗ , w ∗ ) ≤ |X × Ω | √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) . roof. From Lemma 2 of [30] and the deﬁnition of MILE t − ( x , w ), the following holds:MILE t − ( x ∗ , w ∗ )= (cid:88) ( x , w ) ∈ U t × Ω E y ∗ [1l[ l t ( x , w | x ∗ , w ∗ , y ∗ ) > h ]] − |{ ( x , w ) ∈ U t × Ω | l t ( x , w ) > h − η }| = (cid:88) ( x , w ) ∈ U t × Ω P y ∗ [ l t ( x , w | x ∗ , w ∗ , y ∗ ) > h ] − |{ ( x , w ) ∈ U t × Ω | l t ( x , w ) > h − η }|≤ (cid:88) ( x , w ) ∈ U t × Ω Φ  (cid:113) σ t − ( x ∗ , w ∗ ) + σ | k t − (( x , w ) , ( x ∗ , w ∗ )) | ( µ t − ( x , w ) − β / t σ t − ( x , w | x ∗ , w ∗ ) − h )  − |{ ( x , w ) ∈ U t × Ω | l t ( x , w ) > h − η }| . = (cid:88) ( x , w ) ∈ U t × Ω  Φ  (cid:113) σ t − ( x ∗ , w ∗ ) + σ | k t − (( x , w ) , ( x ∗ , w ∗ )) | ( µ t − ( x , w ) − β / t σ t − ( x , w | x ∗ , w ∗ ) − h )  − l t ( x , w ) > h − η ]  . (A.31)Next, for each ( x , w ) ∈ U t × Ω, we consider the two cases of l t ( x , w ) > h − η and l t ( x , w ) ≤ h − η . If l t ( x , w ) > h − η , then the following inequality holds:Φ  (cid:113) σ t − ( x ∗ , w ∗ ) + σ | k t − (( x , w ) , ( x ∗ , w ∗ )) | ( µ t − ( x , w ) − β / t σ t − ( x , w | x ∗ , w ∗ ) − h )  − l t ( x , w ) > h − η ] ≤ ≤ √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) . On the other hand, if l t ( x , w ) ≤ h − η , then using (A.15)–(A.21) we haveΦ  (cid:113) σ t − ( x ∗ , w ∗ ) + σ | k t − (( x , w ) , ( x ∗ , w ∗ )) | ( µ t − ( x , w ) − β / t σ t − ( x , w | x ∗ , w ∗ ) − h )  − l t ( x , w ) > h − η ]= Φ  (cid:113) σ t − ( x ∗ , w ∗ ) + σ | k t − (( x , w ) , ( x ∗ , w ∗ )) | ( µ t − ( x , w ) − β / t σ t − ( x , w | x ∗ , w ∗ ) − h )  ≤ √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) . Therefore, in both cases, the following inequality holds:Φ  (cid:113) σ t − ( x ∗ , w ∗ ) + σ | k t − (( x , w ) , ( x ∗ , w ∗ )) | ( µ t − ( x , w ) − β / t σ t − ( x , w | x ∗ , w ∗ ) − h )  − l t ( x , w ) > h − η ] ≤ √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) . (A.32)Thus, by substituting (A.32) into (A.31), we obtainMILE t − ( x ∗ , w ∗ ) ≤ (cid:88) ( x , w ) ∈ U t × Ω √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) = | U t × Ω | √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) ≤ |X × Ω | √ π exp (cid:18) − σ η σ t − ( x ∗ , w ∗ ) (cid:19) . Lemma A.7.

Let η > β t > γ > γ >

0. Also let ( x t , w t ) ∈ X × Ω be a maximum point of a (2) t − ( x ∗ , w ∗ ). Assume that the inequalities (A.22), (A.23) and (A.24) hold for some T ≥

1. In addition, assumethat the following inequalities hold:12 log β T − η σ σ T − ( x T , w T ) < log( |X | − −| Ω | ηγ ˜ γ − √ π ) , (A.33)12 log β T − η σ σ T − ( x T , w T ) < log( |X × Ω | − η ˜ γ − √ π ) . (A.34)Then, Algorithm 1 terminates after at most T iterations.17 roof. From the deﬁnition of a (2) t − ( x ∗ , w ∗ ) and ( x t , w t ), the following holds for any ( x , w ) ∈ X × Ω: γ ˜ γσ T − ( x , w ) ≤ γ RMILE T − ( x , w ) ≤ a (2) T − ( x , w ) ≤ a (2) T − ( x T , w T )= max { a T − ( x T , w T ) , γ RMILE T − ( x T , w T ) } . (A.35)Furthermore, from (A.22), (A.23) and Lemma A.6, we have γ RMILE T − ( x T , w T ) = max { γ MILE T − ( x T , w T ) , γ ˜ γσ T − ( x T , w T ) }≤ max (cid:26) γ |X × Ω | √ π exp (cid:18) − σ η σ T − ( x T , w T ) (cid:19) , γ ˜ γσ T − ( x T , w T ) (cid:27) . (A.36)Moreover, from (A.24) and (A.34), we get the following inequalities: σ T − ( x T , w T ) < β − / T η/ , (A.37) |X × Ω | √ π exp (cid:18) − σ η σ T − ( x T , w T ) (cid:19) < β − / T η ˜ γ/ . (A.38)Thus, by substituting (A.37) and (A.38) into (A.36), we obtain γ RMILE T − ( x T , w T ) ≤ γ ˜ γβ − / T η/ . (A.39)Similarly, from (A.22), (A.23), (A.33) and Lemma A.4, a T − ( x T , w T ) can be bounded as a T − ( x T , w T ) ≤ |X | | Ω | √ π exp (cid:18) − σ η σ T − ( x T , w T ) (cid:19) ≤ γ ˜ γβ − / T η/ . (A.40)Hence, by combining (A.39) and (A.40) into (A.35), we get γ ˜ γσ T − ( x , w ) ≤ γ ˜ γβ − / T η/ . This implies that 2 β / T σ T − ( x , w ) < η . Therefore, from Lemma A.3, we have Lemma A.7. Lemma A.8.

Let ( x , w ) , . . . , ( x t , w t ) be selected points, and deﬁne C = 2 / log(1 + σ − ). Then, there existsa natural number t (cid:48) ≤ t such that σ t (cid:48) − ( x t (cid:48) , w t (cid:48) ) ≤ C κ t t . Proof.

From Lemma 5.3 in [23], the mutual information I ( y A ; f ) can be expressed as I ( y A ; f ) = 12 t (cid:88) i =1 log(1 + σ − σ i − ( x i , w i )) . (A.41)Similarly, from Lemma 5.4 in [23], σ i − ( x i , w i ) can be bounded as σ i − ( x i , w i ) ≤ log(1 + σ − σ i − ( x i , w i ))log(1 + σ − ) . (A.42)Hence, by using (A.41) and (A.42), we get t (cid:88) i =1 σ i − ( x i , w i ) ≤ σ − ) I ( y t ; f ) ≤ C κ t . (A.43)Next, we deﬁne t (cid:48) as t (cid:48) = argmin ≤ i ≤ t σ i − ( x i , w i ). Then, it follows that tσ t (cid:48) − ( x t (cid:48) , w t (cid:48) ) ≤ t (cid:88) i =1 σ i − ( x i , w i ) . (A.44)Therefore, by combining (A.43) and (A.44), we have the desired inequality.Finally, using Lemma A.5, A.7 and A.8, we prove Theorem 4.2 and 4.3.18 roof. From Lemma A.8 and monotonicity of β t , for any t ≥

1, there exists a natural number t (cid:48) ≤ t such that σ − σ t (cid:48) − ( x t (cid:48) , w t (cid:48) ) β / t (cid:48) ≤ σ − β / t (cid:48) C κ t t ≤ σ − β / t C κ t t ,σ − σ t (cid:48) − ( x t (cid:48) , w t (cid:48) ) ≤ σ − C κ t t ,σ t (cid:48) − ( x t (cid:48) , w t (cid:48) ) β t (cid:48) ≤ C β t (cid:48) κ t t ≤ C β t κ t t ,

12 log β t (cid:48) − η σ σ t (cid:48) − ( x t (cid:48) , w t (cid:48) ) ≤

12 log β t (cid:48) − T η σ C κ t ≤

12 log β t − T η σ C κ t . (A.45)Hence, from (A.45), if the inequality conditions in Theorem 4.2 hold, then the inequality conditions in LemmaA.5 also hold for some ˜ T ≤ T . Therefore, from Lemma A.5, Algorithm 1 terminates after at most ˜ T iterations,i.e., Theorem 4.2 holds. By using the same argument, Theorem 4.3 can also be proved. A.3. Proof of Lemma 3.1 and 3.2

First, we prove Lemma 3.1

Proof.

From GP properties, the posterior mean µ t − ( x , w | x ∗ , w ∗ , y ∗ ) and the posterior variance σ t − ( x , w | x ∗ , w ∗ )of f ( x , w ) after adding ( x ∗ , w ∗ , y ∗ ) can be written as follows (see, e.g., [29]): µ t − ( x , w | x ∗ , w ∗ , y ∗ ) = µ t − ( x , w ) − k t − (( x , w ) , ( x ∗ , w ∗ )) σ t − ( x ∗ , w ∗ ) + σ ( y ∗ − µ t − ( x ∗ , w ∗ )) ,σ t − ( x , w | x ∗ , w ∗ ) = σ t − ( x , w ) − k t − (( x , w ) , ( x ∗ , w ∗ )) σ t − ( x ∗ , w ∗ ) + σ . Thus, l t ( x , w | x ∗ , w ∗ , y ∗ ) is a linear function with respect to (w.r.t.) y ∗ . Hence, the indicator function1l[ l t ( x , w j | x ∗ , w ∗ , y ∗ ) > h ] is a piecewise constant function w.r.t. y ∗ , where the breakpoint is y ∗ = r j . Therefore,for any s ∈ { , . . . , | Ω | + 1 } , the following holds:(1l[ l t ( x , w | x ∗ , w ∗ , c ) > h ] , . . . , l t ( x , w | Ω | | x ∗ , w ∗ , c ) > h ]) (cid:62) = (1l[ l t ( x , w | x ∗ , w ∗ , c (cid:48) ) > h ] , . . . , l t ( x , w | Ω | | x ∗ , w ∗ , c (cid:48) ) > h ]) (cid:62) , ∀ c, c (cid:48) ∈ R s . This implies that l ( F ) t ( x ; 0 | x ∗ , w ∗ , c ) = l ( F ) t ( x ; 0 | x ∗ , w ∗ , c (cid:48) ) , ∀ c, c (cid:48) ∈ R s . Hence, using this we have E y ∗ [1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ]]= (cid:90) l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ] p ( y ∗ )d y ∗ = | Ω | +1 (cid:88) s =1 (cid:90) y ∗ ∈ R s l ( F ) t ( x ; 0 | x ∗ , w ∗ , y ∗ ) > α ] p ( y ∗ )d y ∗ = | Ω | +1 (cid:88) s =1 l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) > α ] (cid:90) y ∗ ∈ R s p ( y ∗ )d y ∗ = | Ω | +1 (cid:88) s =1 P ( y ∗ ∈ R s )1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) > α ] . Next, we prove Lemma 3.2.

Proof.

From the deﬁnition of l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ), l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) can be expressed as l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) = inf p ( w ) ∈A (cid:88) w ∈ Ω l t ( x , w | x ∗ , w ∗ , c s ) > h ] p ( w ) . Moreover, since p ∗ ( w ) ∈ A , the following holds:inf p ( w ) ∈A (cid:88) w ∈ Ω l t ( x , w | x ∗ , w ∗ , c s ) > h ] p ( w ) ≤ (cid:88) w ∈ Ω l t ( x , w | x ∗ , w ∗ , c s ) > h ] p ∗ ( w ) . L L L L h = 100, α = 0 . h = 100, α = 0 . h = 100, α = 0 . h = 100, α = 0 . σ = 10 − , σ f = 1300 , σ = 10 − , σ f = 1300 , σ = 10 − , σ f = 1300 , σ = 10 − , σ f = 1300 ,Booth L = 4, β / t = 2, (cid:15) = 0 . L = 4, β / t = 2, (cid:15) = 0 . L = 4, β / t = 2, (cid:15) = 0 . L = 4, β / t = 2, (cid:15) = 0 . L = − U = 10, L = − U = 10, L = − U = 10, L = − U = 10, L = − U = 10 L = − U = 10 L = − U = 10 L = − U = 10 h = 5, α = 0 . h = 5, α = 0 . h = 5, α = 0 . h = 5, α = 0 . σ = 10 − , σ f = 50 , σ = 10 − , σ f = 50 , σ = 10 − , σ f = 50 , σ = 10 − , σ f = 50 ,Matyas L = 4, β / t = 2, (cid:15) = 0 . L = 4, β / t = 2, (cid:15) = 0 . L = 4, β / t = 2, (cid:15) = 0 . L = 4, β / t = 2, (cid:15) = 0 . L = − U = 10, L = − U = 10, L = − U = 10, L = − U = 10, L = − U = 10 L = − U = 10 L = − U = 10 L = − U = 10 h = 1, α = 0 . h = 1, α = 0 . h = 2, α = 0 . h = 1, α = 0 . σ = 10 − , σ f = 20 , σ = 10 − , σ f = 20 , σ = 10 − , σ f = 20 , σ = 10 − , σ f = 20 ,McCormick L = 1, β / t = 3, (cid:15) = 0 . L = 1, β / t = 3, (cid:15) = 0 . L = 1, β / t = 3, (cid:15) = 0 . L = 1, β / t = 3, (cid:15) = 0 . L = − . U = 4, L = − . U = 4, L = − . U = 4, L = − . U = 4, L = − U = 4 L = − U = 4 L = − U = 4 L = − U = 4 h = − α = 0 . h = − α = 0 . h = − α = 0 . h = − α = 0 . σ = 10 − , σ f = 2000 , σ = 10 − , σ f = 2000 , σ = 10 − , σ f = 2000 , σ = 10 − , σ f = 2000 ,Styblinski-Tang L = 3, β / t = 2, (cid:15) = 0 . L = 3, β / t = 2, (cid:15) = 0 . L = 3, β / t = 2, (cid:15) = 0 . L = 3, β / t = 2, (cid:15) = 0 . L = − U = 10, L = − U = 10, L = − U = 10, L = − U = 10, L = − U = 10 L = − U = 10 L = − U = 10 L = − U = 10 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01

Booth Matyas McCormick Styblinski-TangFigure 4: Average F-score over 50 simulations with four benchmark functions when the distance function andreference distribution are L l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) ≤ (cid:88) w ∈ Ω l t ( x , w | x ∗ , w ∗ , c s ) > h ] p ∗ ( w ) . Hence, if the inequality assumption in Lemma 3.2 holds, then we get l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) ≤ α . This impliesthat 1l[ l ( F ) t ( x ; 0 | x ∗ , w ∗ , c s ) > α ] = 0. B. Additional experimentsB.1. Synthetic and real data experiments in the L -norm setting In this section, we performed the same experiment as in Subsection 5.1 and 5.3 under the setting that thedistance function is L L B.2. Computation time experiments in the other benchmark function setting

In this section, we performed the same experiment as in Subsection 5.2 for the Matyas, McCormick andStyblinski-Tang benchmark functions. We evaluated the computation time of (3.2) when we performed thesame experiment as in Subsection 5.2 using Proposed1 0 .

01 and Proposed2 0 .

01. Here, as for the parameter20

Booth Matyas McCormick Styblinski-TangFigure 5: Average F-score over 50 simulations with four benchmark functions when the distance function andreference distribution are L . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 . . . . . . iteration F − sc o r e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RandomUSStraddle_fStraddle_randomStraddle_USMILEProposed1_0.1Proposed1_0.01Proposed2_0.1Proposed2_0.01 L L L L t , weevaluated the computation time to calculate (3.2) for all candidate points ( x ∗ , w ∗ ) ∈ X × Ω, and calculated theaverage computation time over 300 trials. From Tables 3, 4 and 5, it can be conﬁrmed that the same results asin Subsection 5.2 are obtained in the three benchmark function settings.

B.3. Hyperparameter sensitivity in the proposed acquisition function

In this section, we evaluated how the performance is aﬀected by the hyperparameter γ in the proposedacquisition function. We calculated the F-score for the cases with acquisition functions Proposed1 γ and Pro-posed2 γ when we performed the same experiment as in Subsection 5.1 for Booth, Matyas, McCormick andStyblinski-Tang functions. Here, Proposed1 γ and Proposed2 γ respectively represent the acquisition functions a (1) t ( x ∗ , w ∗ ) and a (2) t ( x ∗ , w ∗ ) with the parameter γ , and we considered γ as 0, 10 − . , 10 − , 10 − , 10 − and10 − . In this experiment, as for the parameter settings, we considered only the case of L γ = 0. The reason is that a t ( x ∗ , w ∗ ) was zero for all ( x ∗ , w ∗ ) ∈ X × Ω when the number of data was small.Furthermore, when γ >

0, it can be seen that the performance of Proposed1 γ decreases as γ increases. Onereason is that although a (1) t ( x ∗ , w ∗ ) is closer to uncertainty sampling (US) as γ becomes large, US is not theacquisition function for eﬃciently estimating H t . On the other hand, it can be conﬁrmed that the performanceof Proposed2 γ is not necessarily better when γ is smaller. From the deﬁnition of Proposed2 γ , when γ is large, a (2) t ( x ∗ , w ∗ ) behaves similarly to RMILE. RMILE is the acquisition function that works to eﬃciently identify21able 3: Computation time (second) for the Matyas function setting Naive L1 L2 L3 (10 − ) L3 (10 − ) L3 (10 − )Proposed1 0 .

01 112403 . ± .

33 6211 . ± .

06 1297 . ± .

31 32 . ± .

36 32 . ± .

18 33 . ± . .

01 98478 . ± .

68 5504 . ± .

62 1831 . ± .

59 32 . ± .

43 37 . ± .

58 38 . ± . Table 4: Computation time (second) for the McCormick function setting

Naive L1 L2 L3 (10 − ) L3 (10 − ) L3 (10 − )Proposed1 0 .

01 83608 . ± .

78 4692 . ± .

72 1094 . ± .

81 39 . ± .

27 41 . ± .

20 42 . ± . .

01 79782 . ± .

70 4383 . ± .

23 1525 . ± .

80 49 . ± .

33 56 . ± .

54 62 . ± . ( x , w ) that satisﬁes f ( x , w ) > h . However, since F ( x ) is given as the function of 1l[ f ( x , w ) > h ], as a result,RMILE also works to eﬃciently estimate H t . This is one of the reasons why Proposed2 γ sometimes has goodperformance even at large γ . 22able 5: Computation time (second) for the Styblinski-Tang function setting Naive L1 L2 L3 (10 − ) L3 (10 − ) L3 (10 − )Proposed1 0 .

01 118443 . ± .

13 6297 . ± .

76 900 . ± .

84 44 . ± .

66 47 . ± .

67 48 . ± . .

01 96731 . ± .

16 5240 . ± .

16 686 . ± .

10 26 . ± .

92 27 . ± .