Bayesian Optimization with Unknown Search Space
Huong Ha, Santu Rana, Sunil Gupta, Thanh Nguyen, Hung Tran-The, Svetha Venkatesh
BBayesian Optimization with Unknown Search Space
Huong Ha, Santu Rana, Sunil Gupta, Thanh Nguyen, Hung Tran-The, Svetha Venkatesh
Applied Artificial Intelligence Institute (A I )Deakin University, Geelong, Australia{huong.ha, santu.rana, sunil.gupta, thanhnt, hung.tranthe, svetha.venkatesh}@deakin.edu.au Abstract
Applying Bayesian optimization in problems wherein the search space is unknownis challenging. To address this problem, we propose a systematic volume expansionstrategy for the Bayesian optimization. We devise a strategy to guarantee that initerative expansions of the search space, our method can find a point whose functionvalue within (cid:15) of the objective function maximum. Without the need to specifyany parameters, our algorithm automatically triggers a minimal expansion requirediteratively. We derive analytic expressions for when to trigger the expansion and byhow much to expand. We also provide theoretical analysis to show that our methodachieves (cid:15) - accuracy after a finite number of iterations. We demonstrate our methodon both benchmark test functions and machine learning hyper-parameter tuningtasks and demonstrate that our method outperforms baselines. Choosing where to search matters. A time-tested path in the quest for new products or processesis through experimental optimization. Bayesian optimization offers a sample efficient strategy forexperimental design by optimizing expensive black-box functions [9–11]. But one problem is thatusers need to specify a bounded region to restrict the search of the objective function extrema. Whentackling a completely new problem, users do not have prior knowledge, hence there is no guaranteethat an arbitrarily defined search space contains the global optimum. Thus application of the Bayesianoptimization framework when the search region is unknown remains an open challenge [16].One approach is to use a regularized acquisition function such that its maximum can never be atinfinity - hence no search space needs to be declared and an unconstrained optimizer can be used [16].Other approaches use volume expansion, i.e. starting from the user-defined region, the search spaceis expanded during the optimization. The simplest strategy is to repeatedly double the volume of thesearch space every several iterations [16]. Nguyen et al suggest a volume expansion strategy based onthe evaluation budget [12]. All these methods require users to specify critical parameters - as example,regularization parameters [16], or growth rate, expansion frequency (volume doubling) [16] or budget[12]. These parameters are difficult to specify in practice. Additionally, [12] is computationallyexpensive and the user-defined search space needs to be close to the global optimum.In this paper, we propose a systematic volume expansion strategy for the Bayesian optimizationframework wherein the search space is unknown. Without any prior knowledge about the objectivefunction argmax or strict assumptions on the behavior of the objective function, it is impossible toguarantee the global convergence when the search space is continuously expanded. To circumventthis problem, we consider the setting where we achieve the global (cid:15) - accuracy condition, that is, weaim to find a point whose function value is within (cid:15) of the objective function global maximum.Our volume expansion strategy is based on two guiding principles: 1) The algorithm can reach apoint whose function value is within (cid:15) of the objective function maximum in one expansion, and, 2)the search space should be minimally expanded so that the algorithm does not spend unnecessary a r X i v : . [ s t a t . M L ] O c t valuations near the search space boundary. As the objective function is unknown, it is not possibleto compute this ideal expansion region. Using the GP-UCB acquisition function as a surrogate,this region is computed as one that contains at least one point whose acquisition function valueis within (cid:15) of the acquisition function maximum. However, by using a surrogate to approximatethe objective function, there is no guarantee that we can achieve the global (cid:15) - accuracy within oneexpansion. Hence multiple expansions are required, and a new expansion is triggered when the local (cid:15) - accuracy is satisfied, i.e. when the algorithm can find a point whose function value is within (cid:15) of the objective function maximum in the current search space. Analytical expressions for the sizeof the new expansion space and when to trigger the expansion are derived. The guarantees for the (cid:15) - accuracy condition, however, now lapses in the expanded region, and so we adjust the acquisitionfunction appropriately to maintain the guarantee. Finally, we provide theoretical analysis to showthat our proposed method achieves the global (cid:15) - accuracy condition after a finite number of iterations.We demonstrate our algorithm on five synthetic benchmark functions and three real hyperparametertuning tasks for common machine learning models: linear regression with elastic net, multilayerperceptron and convolutional neural network. Our experimental results show that our method achievesbetter function values with fewer samples compared to state-of-the-art approaches. In summary, ourcontributions are: • Formalising the analysis for Bayesian optimization framework in an unknown search spacesetting, and introducing (cid:15) - accuracy as a way to track the algorithmic performance; • Providing analytic expressions for how far to expand the search space and when to expandthe search space to achieve global (cid:15) - accuracy ; • Deriving theoretical global (cid:15) - accuracy convergence; and, • Demonstrating our algorithm on both synthetic and real-world problems and comparing itagainst state-of-the-art methods.Our method differs from previous works in that 1) our method does not require any algorithmicparameters, automatically adjusting both when to trigger the expansion and by how much to expand,and, 2) our approach is the only one to guarantee the global (cid:15) - accuracy condition. This is becausewe guarantee the local (cid:15) - accuracy condition in each search space, thus eventually the global (cid:15) - accuracy is achieved. Without this local guarantee, the suggested solution cannot be guaranteed toreach global (cid:15) - accuracy . The regularization [16] and the filtering method [12] require the globaloptimum to be within a bound constructed by either the user specified regularizer or the budget. Thevolume doubling method [16] can continue to expand the search space to infinity, however, the local (cid:15) - accuracy condition is not guaranteed in each search space.The paper is organized as follows. Section 2 gives an overview of Bayesian optimization anddiscusses some of the related work. Section 3 describes the problem setup. Section 4 proposes ournew expansion strategy for the Bayesian optimization framework when the search space is unknown.A theoretical analysis for our proposed method is presented in Section 5. In Section 6, we demonstratethe effectiveness of our algorithm by numerical experiments. Finally, Section 7 concludes the paper. Bayesian optimization is a powerful optimization method to find the global optimum of an unknownobjective function f ( x ) by sequential queries [9–11, 17, 18]. First, at time t , a surrogate model isused to approximate the behaviour of f ( x ) using all the current observed data D t − = { ( x i , y i ) } ni =1 , y i = f ( x i ) + ξ i , where ξ i ∼ N (0 , σ ) is the noise. Second, an acquisition function is constructedfrom the surrogate model that suggests the next point x itr to be evaluated. The objective function isthen evaluated at x itr and the new data point ( x itr , y itr ) is added to D t − . These steps are conductedin an iterative manner to get the best estimate of the global optimum.The most common choice for the surrogate model used in Bayesian optimization is the GaussianProcess (GP) [14]. Assume the function f follows a GP with mean function m ( x ) and covariancefunction k ( x, x (cid:48) ) , the posterior distribution of f given the observed data D t − = { ( x i , y i ) } ni =1 is a2P with the following posterior mean and variance, µ t − ( x ) = m ( x ) + k |D t − | ( x ) T ( K |D t − | + σ I |D t − | ) − y |D t − | ,σ t − ( x ) = k ( x, x ) − k |D t − | ( x ) T ( K |D t − | + σ I |D t − | ) − k |D t − | ( x ) , (1)where y |D t − | = [ y , . . . , y |D t − | ] T , k |D t − | ( x ) = [ k ( x, x i )] |D t − | i =1 , K |D t − | = [ k ( x i , x j )] i,j , I |D t − | is the |D t − | × |D t − | identity matrix and |D t − | denotes the cardinality of D t − . To aid readability,in the sequel we remove the notation that shows the dependence of k , K , I , y on |D t − | .There are many existing acquisition functions [6, 7, 10, 11, 20] and in this paper, we focus only onthe GP-UCB acquisition function [1, 2, 5, 19]. The GP-UCB acquisition function is defined as, α UCB ( x ; D t − ) = µ t − ( x ) + (cid:112) β t σ t − ( x ) , (2)where µ t − ( x ) , σ t − ( x ) are the posterior mean and standard deviation of the GP given observed data D t − and β t ≥ is an appropriate parameter that balances the exploration and exploitation. Given asearch domain, { β t } can be chosen as in [19] to ensure global convergence in this domain. All the work related to the problem of Bayesian optimization with unknown search space have beendescribed in Section 1. There is the work in [3] introduces the term (cid:15) - accuracy . However, theirpurpose is to unify the Bayesian optimization and the Level-set estimation framework. We wish to find the global argmax x max of an unknown objective function f : R d (cid:55)→ R , whoseargmax is at a finite location, i.e. x max = argmax x ∈S ∗ f ( x ) , (3)where S ∗ is a finite region that contains the argmax of the function f ( x ) . In practice, the region S ∗ is not known in advance, so users need to identify a search domain S user which is likely to containthe argmax of f ( x ) . This search domain can be set arbitrarily or based on limited prior knowledge.Thus there is no guarantee that S user contains the global optimum of the objective function. In thetrivial cases when the search space S ∗ is known or when S ∗ ⊂ S user , the global convergence can beguaranteed through classical analysis [4, 19]. Here, we consider the general case when S ∗ may ormay not be a subset of S user . Without any prior knowledge about S ∗ or strict assumptions on thebehavior of the objective function, it is impossible to guarantee the global convergence. Therefore, inthis work, instead of solving Eq. (3), we consider the setting where we achieve the global (cid:15) - accuracy condition. That is, for a small positive value (cid:15) , we find a solution x (cid:15) which satisfies, f ( x max ) − f ( x (cid:15) ) ≤ (cid:15). (4) We make some mild assumptions to develop our main results.
Assumption 4.1
The prior mean function m ( x ) = 0 . This is done by subtracting the mean from all observations and is common practice.
Assumption 4.2
The kernel k ( x, x (cid:48) ) satisfies, (1) when (cid:107) x − x (cid:48) (cid:107) → + ∞ , k ( x, x (cid:48) ) → ; (2) k ( x, x (cid:48) ) ≤ ∀ ( x, x (cid:48) ) ; (3) k ( x, x ) = θ , where θ ≥ is the scale factor of the kernel function. Various kernels satisfy Assumption 4.2, e.g. the Matérn kernel, the Square Exponential kernel. Asthe function can always be re-scaled, condition 2 is met without loss of generality [15, 19].
Defining g k ( γ ) : With these types of kernels, for all small positive γ , there always exists g k ( γ ) > , ∀ x, x (cid:48) : (cid:107) x − x (cid:48) (cid:107) ≥ g k ( γ ) , k ( x, x (cid:48) ) ≤ γ. (5)The value of g k ( γ ) can be computed from γ and the kernel covariance function k ( x, x (cid:48) ) i.e. forSquared Exponential kernel k SE ( x, x (cid:48) ) = θ exp ( −(cid:107) x − x (cid:48) (cid:107) / (2 l )) , g k ( γ ) will be (cid:112) l log ( θ /γ ) . Assumption 4.3
The kernel k ( x, x (cid:48) ) is known in advance or can be learned from the observations. √ β t θ + (cid:15)/ ; or (3) at a finite locationand its function value is smaller than √ β t θ + (cid:15)/ . The ideal expansion strategy should satisfy two characteristics: 1) The algorithm can reach the global (cid:15) - accuracy condition in one expansion, and, 2) the search space should be minimally expanded sothat the algorithm does not spend unnecessary evaluations near the search space boundary. Sincewe have a black-box objective function, it is not possible to compute the ideal expansion space S ideal directly. Let the exploration-exploitation parameters { β t } be chosen to ensure the objectivefunction is upper bounded by the GP-UCB acquisition function with high probability. Then wecan estimate S ideal by a region S as a minimal region that contains at least one point whoseacquisition function value is within (cid:15) from the acquisition function maximum, i.e. ∃ x u ∈ S : | α UCB ( x u ; D t − ) − max x ∈ R d α UCB ( x ; D t − ) | ≤ (cid:15) . Due to the approximation, there is no guaranteewe can achieve the global (cid:15) - accuracy in one expansion. Thus we need multiple expansions sequential.A new expansion is triggered when the local (cid:15) - accuracy is satisfied in the previous expansion. In thefollowing, we first derive the value of the GP-UCB acquisition function when x → ∞ (Proposition4.1), and then use this value to derive analytical expressions for the size of the expansion space S (Theorem 4.1) and when to trigger a new expansion. Proposition 4.1
When x → ∞ , the GP-UCB acquisition function α UCB ( x ; D t − ) → √ β t θ , where β t is the exploration-exploitation parameter of the GP-UCB acquisition function and θ is the scalefactor of the kernel function k ( x, x (cid:48) ) . Derivation of the expansion search space
Our idea is to choose the region S such that S = R d \ A , where 1) A contains all the points x that are far from all the current observations, and, 2) A := { x ∈ R d : | α UCB ( x ; D t − ) − √ β t θ | < (cid:15)/ } . Here, we will show that with this choice of S ,there exists at least one point in S whose acquisition function value is within (cid:15) from the acquisitionfunction maximum, given (cid:15) < |√ β t θ − min x ∈ R d ( α UCB ( x ; D t − )) | . We consider three cases thatcan happen to the GP-UCB acquisition function (See Figure 1): • Case 1 : The argmax of the GP-UCB acquisition function is at infinity. This means thatthe GP-UCB acquisition function maximum is equal to √ β t θ . As the GP-UCB acquisitionfunction is continuous and (cid:15) < |√ β t θ − min x ∈ R d ( α UCB ( x ; D t − )) | , hence, there exists apoint x u such that α UCB ( x u ) = √ β t θ − (cid:15)/ . By the definition of S , it is straightforwardthat x u belongs to S , thus proving that there exists a point in S whose GP-UCB acquisitionfunction value is within (cid:15) from the maximum of the acquisition function. • Case 2 : The argmax of the GP-UCB acquisition function x (cid:48) max is at a finite locationand its acquisition function value is larger or equal √ β t θ + (cid:15)/ . It is straightforwardto see that the argmax x (cid:48) max belongs to the region S and this is the point that satisfies | α UCB ( x (cid:48) max ; D t − ) − max x ∈ R d α UCB ( x ; D t − ) | ≤ (cid:15) . • Case 3 : The GP-UCB acquisition function argmax is at a finite location and the acquisitionfunction maximum is smaller than √ β t θ + (cid:15)/ . As the GP-UCB acquisition function iscontinuous and (cid:15) < |√ β t θ − min x ∈ R d ( α UCB ( x ; D t − )) | , there exists a point x u ∈ S : α UCB ( x u ; D t − ) = √ β t θ − (cid:15)/ . As max x ∈ R d α UCB ( x ; D t − ) < √ β t θ + (cid:15)/ , it followsdirectly that | α UCB ( x u ; D t − ) − max x ∈ R d α UCB ( x ; D t − ) | ≤ (cid:15) .Theorem 4.1 now formally derives an analytical expression for one way to define region S .4 lgorithm 1 Bayesian optimization with unknown search space (GPUCB-UBO) Input:
Gaussian Process (GP) M , acquisition functions α UCB , α
LCB , initial observations D init ,initial search space S user , function f , positive small threshold (cid:15) , evaluation budget T . Output:
Point x (cid:15) : max f ( x ) − f ( x (cid:15) ) ≤ (cid:15) . Initialize D = D init , S = S user , β , t k = 0 . Update the GP using D . for t = 1 , , . . . , T do Set t local = t − t k Compute x m = argmax x ∈S α UCB ( x ; D t − ) Set x t = x m , y t = f ( x t ) . Update D t = D t − ∪ ( x t , y t ) . / ∗ Compute the expansion trigger, the regret upper bound ∗ / Compute r b = α UCB ( x t ; D t − ) − max x ∈D t α LCB ( x ; D t − ) + 1 /t local / ∗ If expansion triggered, expand the search space ∗ / if ( r b < = (cid:15) ) | ( t == 1 ) then Compute the new search space S as defined in Theorem 4.1 Set t k = t k + t local end if / ∗ Adjust the β t based on the search space ∗ / Compute β t following Theorem 5.1 Update the GP using D t . end forTheorem 4.1 Consider the GP-UCB acquisition function α UCB ( x ; D t − ) . Let us define the region S = (cid:83) |D t − | i =1 S i , S i = { x : (cid:107) x − x i (cid:107) ≤ d (cid:15) } , x i ∈ D t − , |D t − | is the cardinality of D t − , d (cid:15) = g k ( min ( (cid:112) ( √ β t θ(cid:15)/ − (cid:15) / / ( |D t − | λ max ) / √ β t , . (cid:15)/max ( (cid:80) z j ≤ − z j , (cid:80) z j ≥ z j ))) with g k ( . ) as in Eq. (10), λ max be the largest singular value of ( K + σ I ) − , and z j be the j th element of ( K + σ I ) − y . Given (cid:15) < |√ β t θ − min x ∈ R d ( α UCB ( x ; D t − )) | , then there existsat least one point in S whose acquisition function value is within (cid:15) from the acquisition functionmaximum, i.e. ∃ x u ∈ S : | α UCB ( x u ; D t − ) − max x ∈ R d α UCB ( x ; D t − ) | ≤ (cid:15) . Acquisition function adaption
Let us denote S k as the k th expansion search space ( k ≥ .In each S k , the parameter { β t } of the GP-UCB acquisition function needs to be valid to ensurethe algorithm achieves the local (cid:15) - accuracy condition. Hence, a new { β t } is adjusted after eachexpansion. Details on how to compute the new { β t } are in Theorem 5.1. Triggering the next expansion
To guarantee the global (cid:15) - accuracy condition, in each search space S k , we aim to find an iteration T k which satisfies r S k ( T k ) = (max x ∈S k f ( x ) − max x i ∈D Tk f ( x i )) ≤ (cid:15) before the next expansion. As we do not have max x ∈S k f ( x ) and { f ( x i ) } , we bound r S k ( t ) by r b, S k ( t ) = max x ∈S k α UCB ( x ; D t − )+1 /t − max x ∈D t α LCB ( x ; D t − ) , where α LCB ( x ; D t − ) = µ t − ( x ) − √ β t σ t − ( x ) . The next expansion is triggered when r b, S k ( t ) reaches (cid:15) . Search space optimization
The theoretical search space developed in Theorem 4.1 is the union of |D t − | balls. To suit optimizer input, this region is converted to an encompassing hypercube using, min x i ∈D t − ( x ki ) − d (cid:15) ≤ x k ≤ max x i ∈D t − ( x ki ) + d (cid:15) , k = 1 , d. (6)Further refinement of the implementation is provided in the supplementary material. Algorithm 1 describes the proposed Bayesian optimization with unknown search space algorithm.
First, to ensure the validity of our algorithm, we prove that for a wide range of kernels, for any searchspace S k and any positive (cid:15) , with a proper choice of { β t } , our trigger for expansion condition occurswith high probability. When this happens, the algorithm achieves the local (cid:15) - accuracy condition. Proposition 5.1
For any d -dimensional domain S k with side length r k , for the kernel classes: finitedimensional linear, Squared Exponential and Matérn, suppose the kernel k ( x, x (cid:48) ) satisfies the follow-ing condition on the derivatives of GP sample paths f : ∃ a k , b k > , Pr { sup x ∈S k | ∂f /∂x j | > } ≤ a k exp − ( L/b k ) , j = 1 , d . Pick δ ∈ (0 , , and define β t = 2 log( t π / (3 δ )) +2 d log( t db k r k (cid:112) log(4 da k /δ )) , then ∀ (cid:15) > , with probability larger than − δ , there ∃ T k : ∀ t ≥ T k , max x ∈S k α UCB ( x ; D t − ) − max x ∈D t α LCB ( x ; D t − ) ≤ (cid:15) − /t ; and ∀ t that satisfies theprevious condition, max x ∈S k f ( x ) − max x ∈D t f ( x ) ≤ (cid:15) . Second, we prove that with a proper choice of { β t } and for a wide range class of kernels, after a finitenumber of iterations, our algorithm achieves the global (cid:15) - accuracy condition with high probability. Theorem 5.1
Denote {S k } as the series of the expansion search space suggested by our algorithm ( k ≥ . In each S k , let T k be the smallest number of iterations that satisfies our expansion triggeredcondition, i.e. r b, S k ( T k ) ≤ (cid:15) . Suppose the kernel k ( x, x (cid:48) ) belong to the kernel classes listed inProposition 5.1 and it satisfies the following condition on the derivatives of GP sample paths f : ∃ a k , b k > , Pr { sup x ∈S k | ∂f /∂x j | > L } ≤ a k exp − ( L/b k ) , j = 1 , d . Pick δ ∈ (0 , , anddefine, β t = 2 log(( t − (cid:80) j ≤ k − T j ) π / (3 δ )) + 2 d log(( t − (cid:80) j ≤ k − T j ) db k r k (cid:112) log(4 da k /δ )) , (cid:80) j ≤ k − T j + 1 ≤ t ≤ (cid:80) j ≤ k T j , k = 1 , , ... . Then running the proposed algorithm with the abovechoice of β t for a sample f of a GP with mean function zero and covariance function k ( x, x (cid:48) ) , aftera finite number of iterations, we achieve global (cid:15) -accuracy with at least − δ probability, i.e.Pr { f ( x max ) − f ( x suggest ) ≤ (cid:15) } ≥ − δ, where x suggest is the algorithm recommendation and x max is the objective function global argmax. Discussion
The difference between our method and previous works is that we guarantee the local (cid:15) - accuracy condition in each search space, eventually achieving the global (cid:15) - accuracy . Previousmethods do not give this guarantee, and thus their final solution may not reach global (cid:15) - accuracy . We evaluate our method on five synthetic benchmark functions and three hyperparameter tuning tasksfor common machine learning models. For problems with dimension d , the optimization evaluationbudget is d (excluding initial d points following a latin hypercube sampling [8]). The experimentswere repeated and times for the synthetic functions and machine learning hyperparametertuning tasks respectively. For all algorithms, the Squared Exponential kernel is used, the GP modelsare fitted using the Maximum Likelihood method and the output observations { y i } are normalized y i ∼ N (0 , . As with previous GP-based algorithms that use confidence bounds [3, 19], ourtheoretical choice of { β t } in Theorem 5.1 is typically overly conservative. Hence, following thesuggestion in [19], for any algorithms that use the GP-UCB acquisition, we scale β t down by a factorof . Finally, for the synthetic functions, (cid:15) is set at . whist for the machine learning models, (cid:15) isset at . as we require higher accuracy in these cases.We compare our proposed method, GPUCB-UBO , with seven baselines: (1)
EI-Vanilla : the vanillaExpected Improvement (EI); (2)
EI-Volx2 : the EI with the search space volume doubled every d iterations [16]; (3) EI-H : the Regularized EI with a hinge-quadratic prior mean where β = 1 and R is the circumradius of the initial search space [16]; (4) EI-Q : the Regularized EI with a quadraticprior mean where the widths w are set to those of the initial search space [16]; (5) GPUCB-Vanilla :the vanilla GP-UCB; (6)
GPUCB-Volx2 : the GP-UCB with the search space volume doubled every d iterations [16]; (7) GPUCB-FBO : the GP-UCB with the fitering expansion strategy in [12].
We visualize our theoretical expansion search spaces derived in Theorem 4.1 on the Beale test function(Figure 2). We show the contour plots of the GP-UCB acquisition functions, and show both theobservations (red stars) and the recommendation from the algorithm that correspond the acquisitionfunction maximum (cyan stars). The initial user-defined search space (black rectangle) is expandedas per theoretical search spaces developed in Theorem 4.1 (yellow rectangles). Here we use Eq. (28)to plot the expansion search spaces, however, the spaces developed in Theorem 4.1 are tighter. Thefigure illustrates that when the argmax of the objective function is outside of the user-defined searchspace, with our search space expansion strategy, this argmax can be located within a finite number ofexpansions. 6igure 2: Expansion search spaces using Theorem 4.1 for Beale function in two cases when the global (cid:15) - accuracy is achieved within (a) one expansion; or (b) two expansions. The black rectangle isthe user-defined search space and the yellow rectangles are the theoretical expansion search spaces.The contour plots of the acquisition function are also displayed with observations (red stars) and therecommendation at that iteration (cyan star). Global optimum of Beale function is the magenta star. We compare our method with seven baselines on five benchmark test functions: Beale, Eggholder,Levy 3, Hartman 3 and Hartman 6. We use the same experiment setup as in [16]. The length of theinitial user-defined search space is set to be of the length of the function domain - e.g. if thefunction domain is the unit hypercube [0 , d , then the initial search space has side length of . . Thecenter of this initial search space is placed randomly in the domain of the objective function.For each test function and algorithm, we run the experiment 30 times, and each time the initialsearch space will be placed differently. We plot the mean and the standard error of the best foundvalues max i =1 ,n f ( x i ) of each test function. Figure 5 shows that for most test functions, our methodGPUCB-UBO achieves both better function values and in less iterations than other methods. For mosttest functions, our method is better than other six state-of-the-art approaches (except GPUCB-FBO)by a high margin. Compared with GPUCB-FBO, our method is better on the test functions Hartman3and Hartman6 while performing similar on other three test functions. Note that the computation timeof GPUCB-FBO is 2-3 times slower than our method and other approaches (see Table 1) because itneeds an extra step to numerically solve several optimization problems to construct the new searchspace. Since we derive the expansion search spaces analytically, our method, in contrast, can optimizethe acquisition function within these spaces without any additional computation.Figure 3: Best found values of various synthetic benchmark test functions using different algorithms.Plotting mean and standard error over 30 repetitions. (Best seen in color)7able 1: The average runtime (seconds) of selecting the next input of different methods. All the timemeasurements were taken when evaluating the methods on a Ubuntu 18.04.2 server with Intel XeonCPU E5-2670 2.60GHz 128GB RAM. All the source codes are written in Python 3.6.METHODS Beale Eggholder Hartman3 Levy3 Hartman6GPUCB-UBO 2.8 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Next we apply our method on hyperparameter tuning of three machine learning models on the MNISTdataset: elastic net, multilayer perceptron and convolutional neural network. With each model, theexperiments are repeated 20 times and each time the initial search space will be placed differently.
Elastic Net
Elastic net is a regularized regression method that utilizes the L and L regularizers. Inthe model, the hyperparameter α > determines the magnitude of the penalty and the hyperparameter l (0 ≤ l ≤ balances between the L and L regularizers. We tune α in the normal space while l istuned in an exponent space (base 10). The initial search space of α and l is randomly placed in thedomain [ − , − × [0 , with side length to be of the domain size length. We implement theElastic net model using the function SGDClassifier in the scikit-learn package [13].
Multilayer Perceptron (MLP)
We construct a 2-layer MLP with 512 neurons/layer. We optimizethree hypeparameters: the learning rate l and the L norm regularization hyperparameters l r and l r of the two layers. All the hyperparameters are tuned in the exponent space (base 10). The initialsearch space is a randomly placed unit cube in the cube [ − , − . The model is implemented usingtensorflow. The model is trained with the Adam optimizer in 20 epochs and the batch size is 128. Convolutional Neural Network (CNN)
We consider a CNN with two convolutional layers. TheCNN architecture (e.g. the number of filters, the filter shape, etc.) is chosen as the standard architec-ture published on the official GitHub repository of tensorflow . We optimize three hyperparameters:the learning rate l and the dropout rates r d , r d in the pooling layers 1 and 2. We tune r d , r d in thenormal space while l is tuned in an exponent space (base 10). The initial search space of r d , r d , l israndomly placed in the domain [0 , × [0 , × [ − , − with side length to be of this domainsize length. The network is trained with the Adam optimizer in 20 epochs and the batch size is 128.Figure 4: Prediction accuracy of different machine learning models on MNIST dataset using differentalgorithms. Mean and standard error over 20 repetitions are shown. (Best seen in color)Given a set of hyperparameters, we train the models with this hyperparameter setting using theMNIST train dataset ( patterns) and then test the model on the MNIST test dataset ( patterns). Bayesian optimization method then suggests a new hyperparameter setting based on the https://github.com/tensorflow/tensorflow d evaluations) is depleted. We plot the prediction accuracy in Figure 4. For the Elasticnet model, our method GPUCB-UBO performs similar to GPUCB-FBO while outperforming theother six approaches significantly. For the MLP model, GPUCB-UBO performs far better than otherapproaches. To be specific, after only iterations, it achieves a prediction accuracy of . whilstother approaches take more than iterations to get to this level. For the CNN model, GPUCB-UBOalso outperforms other approaches by a high margin. After 30 iterations, it can provide a CNN modelwith prediction accuracy of . . We propose a novel Bayesian optimization framework when the search space is unknown. Weguarantee that in iterative expansions of the search space, our method can find a point whose functionvalue within (cid:15) of the objective function maximum. Without the need to specify any parameters,our algorithm automatically triggers a minimal expansion required iteratively. We demonstrate ourmethod on both synthetic benchmark functions and machine learning hyper-parameter tuning tasksand demonstrate that our method outperforms state-of-the-art approaches.Our source code is publicly available at https://github.com/HuongHa12/BO_unknown_searchspace . Acknowledgments
This research was partially funded by the Australian Government through the Australian Re-search Council (ARC). Prof Venkatesh is the recipient of an ARC Australian Laureate Fellowship(FL170100006).
References [1] P. Auer. Using confidence bounds for exploitation-exploration trade-offs.
Journal of MachineLearning Research , 3:397–422, 2003.[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.
Machine Learning , 47(2-3):235–256, 2002.[3] I. Bogunovic, J. Scarlett, A. Krause, and V. Cevher. Truncated variance reduction: A unifiedapproach to bayesian optimization and level-set estimation. In
Proceedings of the 30th Interna-tional Conference on Neural Information Processing Systems (NIPS) , pages 1515–1523, USA,2016.[4] A.D. Bull. Convergence rates of efficient global optimization algorithms.
Journal of MachineLearning Research , 12:2879–2904, 2011.[5] V. Dani, T.P. Hayes, and S.M. Kakade. Stochastic linear optimization under bandit feedback. In
COLT , 2008.[6] P. Hennig and C.J. Schuler. Entropy search for information-efficient global optimization.
Journal of Machine Learning Research , 13(1):1809–1837, 2012.[7] J.M. Henrández-Lobato, M.W. Hoffman, and Z. Ghahramani. Predictive entropy search forefficient global optimization of black-box functions. In
Advances in Neural InformationProcessing Systems (NIPS) , pages 918–926, 2014.[8] D.R. Jones. A taxonomy of global optimization methods based on response surfaces.
Journalof Global Optimization , 21(4):345–383, 2001.[9] D.R. Jones, M. Schonlau, and W.J. Welch. Efficient global optimization of expensive black-boxfunctions.
Journal of Global Optimization , 13(4):455–492, December 1998.[10] H.J. Kushner. A new method of locating the maximum point of an arbitrary multipeak curve inthe presence of noise.
Journal of Basic Engineering , 86(1):97–106, 1964.911] J. Mo˘ckus, V. Tiesis, and A. ˘Zilinskas.
The application of Bayesian methods for seeking theextremum , volume 2 of
Toward Global Optimization . Elsevier, 1978.[12] V. Nguyen, S. Gupta, S. Rane, C. Li, and S. Venkatesh. Bayesian optimization in weaklyspecified search space. In , pages347–356, 2017.[13] F. Pedregosa and G. Varoquaux et al. Scikit-learn: Machine learning in python.
The Journal ofMachine Learning Research , 12:2825–2830, 2011.[14] C.E. Rasmussen and C.K.I. Williams.
Gaussian Processes for Machine Learning . The MITPress, 2006.[15] J. Scarlett. Tight regret bounds for Bayesian optimization in one dimension. In
Proceed-ings of the 35th International Conference on Machine Learning (ICML) , pages 4500–4508,Stockholmsmässan, Stockholm, Sweden, 2018.[16] B. Shahriari, A. Bouchard-Cote, and N. De Freitas. Unbounded bayesian optimization viaregularization. In
Proceedings of the 19th International Conference on Artificial Intelligenceand Statistics (AISTATS) , volume 51, pages 1168–1176, 2016.[17] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out ofthe loop: A review of bayesian optimization.
Proceedings of the IEEE , 104(1):148–175, 2016.[18] J. Snoek, H. Larochelle, and R.P Adams. Practical bayesian optimization of machine learn-ing algorithms. In
Proceedings of the 25th International Conference on Neural InformationProcessing Systems - Volume 2 (NIPS) , NIPS’12, pages 2951–2959, USA, 2012.[19] N. Srinivas, A. Krause, S.M. Kakade, and M. Seeger. Gaussian process optimization in thebandit setting: No regret and experimental design. In
Proceedings of the 27th InternationalConference on International Conference on Machine Learning (ICML) , pages 1015–1022, 2010.[20] Z. Wang and S. Jegelka. Max-value entropy search for efficient bayesian optimization. In
Proceedings of the 34th International Conference on Machine Learning (ICML) , pages 3627–3635, 2017.
Appendices
A Proof of All the Propositions and Theorems
A.1 BackgroundGaussian Process
A Gaussian Process (GP) is a distribution over functions, which is completelyspecified by its mean function and covariance function [14]. Assume the function f follows a GPwith mean function m ( x ) and covariance function k ( x, x (cid:48) ) . Given the observed data D t − = { ( x i , y i ) } ni =1 , y i = f ( x i ) + ξ i , where ξ i ∼ N (0 , σ ) is the noise, the posterior distribution of f is aGP with the following posterior mean and variance, µ t − ( x ) = m ( x ) + k |D t − | ( x ) T ( K |D t − | + σ I |D t − | ) − y |D t − | ,σ t − ( x ) = k ( x, x ) − k |D t − | ( x ) T ( K |D t − | + σ I |D t − | ) − k |D t − | ( x ) , (7)where y |D t − | = [ y , . . . , y |D t − | ] T , k |D t − | ( x ) = [ k ( x, x i )] |D t − | i =1 , K |D t − | = [ k ( x i , x j )] i,j , I |D t − | is the |D t − | × |D t − | identity matrix and |D t − | denotes the cardinality of D t − . To aid readability,in the sequel we remove the notation that shows the dependence of k , K , I , y on |D t − | .10 P-UCB Acquisition Function
The GP-UCB acquisition function is defined as [1, 2, 5, 19], α UCB ( x ; D t − ) = µ t − ( x ) + (cid:112) β t σ t − ( x ) , (8)where µ t − ( x ) , σ t − ( x ) are the posterior mean and standard deviation of the GP given observed data D t − and β t ≥ is an appropriate parameter that balances the exploration and exploitation. Given asearch domain, { β t } can be chosen following the suggestion in [19] to ensure global convergence inthis domain. Maximum Information Gain
For any search space S , define the maximum mutual information γ T, S [19]: γ T, S := max A ⊂S , | A | = T
12 log det( I T + σ − K T ) , (9)where K T = [ k ( x, x (cid:48) )] x,x (cid:48) ∈ A and I T is the identity matrix with size T × T . Defining g k ( γ ) for kernel k(.) With the types of kernels satisfied Assumption 4.2 in the paper, forall small positive γ , there always exists g k ( γ ) > such that, ∀ x, x (cid:48) : (cid:107) x − x (cid:48) (cid:107) ≥ g k ( γ ) , k ( x, x (cid:48) ) ≤ γ. (10)The value of g k ( γ ) can be computed from γ and the kernel covariance function k ( x, x (cid:48) ) i.e. forSquared Exponential kernel k SE ( x, x (cid:48) ) = θ exp ( −(cid:107) x − x (cid:48) (cid:107) / (2 l )) , g k ( γ ) will be (cid:112) l log ( θ /γ ) . A.2 Proof of Proposition 4.1
Substituting Eq. (7) into Eq. (8) and combining with Assumption 4.1 in the paper, we have, α UCB ( x ; D t − ) = k ( x ) T ( K + σ I ) − y + (cid:112) β t (cid:113) k ( x, x ) − k ( x ) T ( K + σ I ) − k ( x ) . (11)Since k ( x, x (cid:48) ) is a kernel covariance function that satisfies the Assumptions 4.2 in the paper, then, k ( x ) x →∞ −−−−→ , k ( x, x ) = θ . (12)Combining Eq. (11) and Eq. (12), the proposition is proved, i.e., α UCB ( x ; D t − ) x →∞ −−−−→ (cid:112) β t θ. (cid:4) A.3 Proof of Theorem 4.1
Firstly, we prove that with the choice of S = R d \ A , where 1) A contains all the points x that are farfrom all the current observations, and, 2) A := { x ∈ R d : | α UCB ( x ; D t − ) − √ β t θ | < (cid:15)/ } , thereexists a point in S whose GP-UCB acquisition function value is within (cid:15) from the maximum of theacquisition function, i.e. ∃ x u ∈ S : | α UCB ( x u ; D t − ) − max x ∈ R d α UCB ( x ; D t − ) | ≤ (cid:15) . With thechoice of (cid:15) < |√ β t θ − min x ∈ R d ( α UCB ( x ; D t − )) | , let us consider three cases: • Case 1 : The argmax of the GP-UCB acquisition function is at infinity. This means thatthe GP-UCB acquisition function maximum is equal to √ β t θ . As the GP-UCB acquisitionfunction is continuous and (cid:15) < |√ β t θ − min x ∈ R d ( α UCB ( x ; D t − )) | , hence, there exists apoint x u such that α UCB ( x u ) = √ β t θ − (cid:15)/ . By the definition of S , it is straightforwardthat x u belongs to S , thus proving that there exists a point in S whose GP-UCB acquisitionfunction value is within (cid:15) from the maximum of the acquisition function. • Case 2 : The argmax of the GP-UCB acquisition function x (cid:48) max is at a finite locationand its acquisition function value is larger or equal √ β t θ + (cid:15)/ . It is straightforwardto see that the argmax x (cid:48) max belongs to the region S and this is the point that satisfies | α UCB ( x (cid:48) max ; D t − ) − max x ∈ R d α UCB ( x ; D t − ) | ≤ (cid:15) . • Case 3 : The GP-UCB acquisition function argmax is at a finite location and the acquisitionfunction maximum is smaller than √ β t θ + (cid:15)/ . As the GP-UCB acquisition function iscontinuous and (cid:15) < |√ β t θ − min x ∈ R d ( α UCB ( x ; D t − )) | , there exists a point x u ∈ S : α UCB ( x u ; D t − ) = √ β t θ − (cid:15)/ . As max x ∈ R d α UCB ( x ; D t − ) < √ β t θ + (cid:15)/ , it followsdirectly that | α UCB ( x u ; D t − ) − max x ∈ R d α UCB ( x ; D t − ) | ≤ (cid:15) .11econdly, we will give analytical expression for one way to define the region A . We will prove that A can be chosen as { x : ∀ x i ∈ D t − : (cid:107) x − x i (cid:107) ≥ g k ( γ ) } , where, γ = min ( (cid:113) (0 . (cid:112) β t θ(cid:15) − . (cid:15) ) / ( |D t − | λ max ) θ, . (cid:15)/ max (cid:16) (cid:88) z j ≤ − z j , (cid:88) z j ≥ z j ) (cid:17) ,g k ( . ) defined as in Eq. (10), λ max ∈ R + is the largest singular value of ( K + σ I ) − , z j is the j th element of vector ( K + σ I ) − y and (cid:15) < |√ β t θ − min x ∈ R d ( α UCB ( x ; D t − )) | . With this choice of A , then the region S can be computed as S = (cid:83) |D t − | i =1 S i , S i = { x : (cid:107) x − x i (cid:107) ≤ g k ( γ ) } , x i ∈ D t − .Consider the GP-UCB acquisition function, α UCB ( x ; D t − ) = µ t − ( x ) + (cid:112) β t σ t − ( x )= k ( x ) T ( K + σ I ) − y + (cid:112) β t (cid:113) k ( x, x ) − k ( x ) T ( K + σ I ) − k ( x )= k ( x ) T ( K + σ I ) − y + (cid:112) β t (cid:113) θ − k ( x ) T ( K + σ I ) − k ( x ) . The error between α UCB ( x ; D t − ) and √ β t θ can be computed as, (cid:112) β t θ − α UCB ( x ; D t − ) = − k ( x ) T ( K + σ I ) − y + √ β t k ( x ) T ( K + σ I ) − k ( x ) (cid:112) θ − k ( x ) T ( K + σ I ) − k ( x ) + θ . (13)First, we consider the first term in Eq. (13). It is easy to see that, (cid:88) z i ≤ z i k ( x, x i ) ≤ k ( x ) T ( K + σ I ) − y ≤ (cid:88) z i > z i k ( x, x i ) , where z i is the i th element of vector ( K + σ I ) − y . Therefore, | k ( x ) T ( K + σ I ) − y | ≤ max (cid:16)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) z i ≤ z i k ( x, x i ) (cid:12)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) z i ≥ z i k ( x, x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:17) ≤ max (cid:16) (cid:88) z i ≤ − z i k ( x, x i ) , (cid:88) z i ≥ z i k ( x, x i ) (cid:17) . (14)For all small positive γ , for all x such that (cid:107) x − x i (cid:107) ≥ g k ( γ ) , ∀ x i ∈ D t − with g k ( γ ) defined inEq. (10), we have k ( x, x i ) ≤ γ, ∀ x i ∈ D t − . Combining this with Eq. (14), the first term in Eq. (13)is bounded as, | k ( x ) T ( K + σ I ) − y | ≤ max (cid:16) (cid:88) z i ≤ − z i , (cid:88) z i ≥ z i (cid:17) γ. (15)Second, consider the second term in Eq. (13). Since K is a covariance matrix, it is a positivesemidefinite matrix, hence, ( K + σ I ) and ( K + σ I ) − are also positive semidefinite symmetricmatrices. Using the singular value decomposition (SVD), we have, ( K + σ I ) − = U DU T , where D is a diagonal matrix and U is a unitary matrix. Denote λ max ( λ max ∈ R + ) to be the maximumentry on the diagonal of matrix D ( λ max is also called the largest singular value of the matrix ( K + σ I ) − ), we then have, k ( x ) T (( K + σ I ) − − λ max I ) k ( x ) = k ( x ) T ( U DU T − U λ max I U T ) k ( x )= ( U T k ( x )) T ( D − λ max I ) U T k ( x ) . Since ( D − λ max I ) is a negative semidefinite matrix, ( U T k ( x )) T ( D − λ max I ) U T k ( x ) ≤ . There-fore, k ( x ) T ( K + σ I ) − k ( x ) ≤ λ max k ( x ) T k ( x ) ≤ λ max |D t − | (cid:88) i =1 k ( x, x i ) . (16)For all small positive γ , for all x such that (cid:107) x − x i (cid:107) ≥ g k ( γ ) , ∀ x i ∈ D t − with g k ( γ ) defined inEq. (10), we have k ( x, x i ) ≤ γ, ∀ x i ∈ D t − . Combining this with Eq. (16), we now have, ≤ k ( x ) T ( K + σ I ) − k ( x ) ≤ nλ max γ , (17)12here n denotes |D t − | , i.e. cardinality of D t − . Consider the function, f ( z ) = √ β t z √ θ − z + θ , ≤ z ≤ θ . It is easy to see f ( z ) is a monotone increasing function in the range [0 , θ ] . Hence, with k ( x ) T ( K + σ I ) − k ( x ) being in the range [0 , nλ max γ ] , then, ≤ √ β t k ( x ) T ( K + σ I ) − k ( x ) (cid:112) θ − k ( x ) T ( K + σ I ) − k ( x ) + θ ≤ √ β t nλ max γ (cid:112) θ − nλ max γ + θ , (18)where γ ≤ (cid:112) θ/ ( nλ max ) . From Eqs. (13), (15) and (18), we have that: ∀ γ > , then ∀ x such that (cid:107) x − x i (cid:107) ≥ d γ , i = 1 , n , the following inequality is satisfied, | (cid:112) β t θ − α UCB ( x ; D t − ) | ≤ max (cid:16) (cid:88) z i ≤ − z i , (cid:88) z i ≥ z i (cid:17) γ + √ β t nλ max γ (cid:112) θ − nλ max γ + θ . (19)With the choice of γ = min (cid:16) . (cid:15) max ( (cid:80) z i ≤ − z i , (cid:80) z i ≥ z i ) , √ β t (cid:115) . √ β t θ(cid:15) − . (cid:15) nλ max (cid:17) , < (cid:15) < √ β t θ . Then from Eq. (19), we have that ∀ x : (cid:107) x − x i (cid:107) ≥ d γ , i = 1 , n , the following inequalitysatisfies, | (cid:112) β t θ − α UCB ( x ; D t − ) | < max (cid:16) (cid:88) z i ≤ − z i , (cid:88) z i ≥ z i (cid:17) . (cid:15) max ( (cid:80) z i ≤ − z i , (cid:80) z i ≥ z i )+ √ β t nλ max (cid:18) √ β t (cid:115) . √ β t θ(cid:15) − . (cid:15) nλ max (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) θ − nλ max (cid:18) √ β t (cid:115) . √ β t θ(cid:15) − . (cid:15) nλ max (cid:19) + θ< . (cid:15) + 0 . (cid:15)< (cid:15)/ . (cid:4) (20) Remark A.1
Note that if |√ β t θ − min x ∈ R d α UCB ( x ; D t − ) | = 0 , then ∀ (cid:15) < |√ β t θ − max x ∈ R d α UCB ( x ; D t − ) | or < |√ β t θ − max x ∈D t − α UCB ( x ; D t − ) | , the bound in Theorem 4.1remains valid. As in this case, the GP-UCB argmax is at a finite location and its acquisition functionvalue > √ β t θ , thus our arguments in Case 2 hold. However, it is worth noting that the scenario |√ β t θ − min x ∈ R d α UCB ( x ; D t − ) | = 0 is very rare in practice. With Assumption 4.1, most of thetime, min x ∈ R d α UCB ( x ; D t − ) ≤ , hence |√ β t θ − min x ∈ R d α UCB ( x ; D t − ) | > . Only when thenoise is large, there is a very small chance |√ β t θ − min x ∈ R d α UCB ( x ; D t − ) | = 0 can happen. A.4 Proof of Proposition 5.1
Following Lemma 5.7 in [19], for any d -dimensional domain S k with side length r k , suppose thekernel k ( x, x (cid:48) ) satisfies the following condition on the derivatives of GP sample paths f : ∃ a k , b k > , Pr { sup x ∈S k | ∂f /∂x j | > L } ≤ a k exp − ( L/b k ) , j = 1 , d . Pick δ ∈ (0 , , and define β t =2 log( t π / (3 δ )) + 2 d log( t db k r k (cid:112) log(4 da k /δ )) , then ∀ (cid:15) > , with probability larger than − δ ,we have max x ∈S k α UCB ( x ; D t − ) ≤ µ t − ( x t ) + β / t σ t − ( x t ) + 1 /t , ∀ t ≥ , where x t is thesuggestion from the GP-UCB algorithm at iteration t . Therefore, max x ∈S k f ( x ) − max x ∈D t f ( x ) ≤ µ t − ( x t ) + β / t σ t − ( x t ) + 1 /t − max x ∈D t f ( x ) ≤ µ t − ( x t ) + β / t σ t − ( x t ) + 1 /t − max x ∈D t α LCB ( x ; D t − ) ≤ µ t − ( x t ) + β / t σ t − ( x t ) + 1 /t − α LCB ( x t ; D t − ) ≤ β / t σ t − ( x t ) + 1 /t . (21)13ollowing Lemma 5.4 in [19], we have that, T (cid:88) t =1 β t σ t − ( x t ) ≤ C β T γ T , (22)with C = 8 / log(1 + σ − ) , γ T is the maximum information gain in the search space S k (can becomputed using (9)).Assume β / t σ t − ( x t ) does not converge to 0 when t → ∞ . It means there exists T , such that ∀ t ≥ T , β / t σ t − ( x t ) ≥ m with m being a constant. Then, ∀ T ≥ T , T (cid:88) t = T β t σ t − ( x t ) ≥ m ( T − T ) . (23)Which means, ∀ T ≥ T , C β T γ T − T (cid:88) t =1 β t σ t − ( x t ) ≥ m ( T − T ) . (24)For the kernel classes: finite dimensional linear, Squared Exponential and Matérn, and assume thekernel satisfies k ( x, x (cid:48) ) ≤ (condition 2 of Assumption 4.2 in the paper), we have that γ T is upperbounded by O ( T d ( d +1) / (2 ν + d ( d +1)) (log T )) with ν > . However, the RHS of Eq. (24) is O ( T ) ,thus it is not correct. Therefore, β / t σ t − ( x t ) converges to 0 when t → ∞ . Which means there ∃ T k : ∀ t ≥ T k , r b, S k ( t ) = max x ∈S k α UCB ( x ; D t − ) − max x ∈D t α LCB ( x ; D t − ) ≤ (cid:15) − /t . (25)Finally, following Eq. (21), with probability larger than − δ , we have, ∀ t , max x ∈S k α UCB ( x ; D t − ) − max x ∈D t f ( x ) ≤ µ t − ( x t ) + β / t σ t − ( x t ) + 1 /t − max x ∈D t α LCB ( x ; D t − ) . Thus ∀ t satisfiesEq. (25), max x ∈S k f ( x ) − max x ∈D t f ( x ) ≤ (cid:15) . (cid:4) A.5 Proof of Theorem 5.1
First, we prove that with our search space expansion strategy, the search space will continue to expand,i.e. size ( S k ) k →∞ −−−−→ ∞ . As λ max is the largest singular value of the matrix ( K + σ I ) − , hence, itis also the maximum entry on the diagonal of matrix D where D satisfies: ( K + σ I ) − = U DU T with U being a unitary matrix. Note that λ max = 1 / ( λ min, ( K + σ I ) ) where λ min, ( K + σ I ) is thesmallest singular value of matrix ( K + σ I ) . We have, nλ min, ( K + σ I ) ≤ Tr ( K + σ I ) = n ( θ + σ ) , (26)where Tr ( . ) is the Trace operator and n = |D t − | . Thus λ min, ( K + σ I ) ≤ ( θ + σ ) , which results λ max ≥ / ( θ + σ ) . Therefore, (cid:113) (0 . (cid:112) β t θ(cid:15) − . (cid:15) ) / ( |D t − | λ max ) θ t →∞ −−−→ , (27)as β t ∼ O (log( t )) and |D t − | ≥ t − .This means d (cid:15) t →∞ −−−→ ∞ , hence size ( S k ) k →∞ −−−−→ ∞ . Thus ∃ k such that S ∗ ⊂ S k , where S ∗ isthe region that contains the objective function global maximum. Following Proposition 5.1 in thepaper, with the choice of { β t } to be as suggested in the Theorem, the proposed algorithm achievesthe local (cid:15) - accuracy any search space. Thus it eventually achieves the local (cid:15) - accuracy in searchspace S k , hence ∃ x k : ( f ( x k ) − max x ∈S f ( x )) ≤ (cid:15) . Note that max x ∈S f ( x ) = max x ∈S ∗ f ( x ) as S ∗ ⊂ S k . Therefore, ( max x ∈S ∗ f ( x ) − f ( x k )) ≤ (cid:15) . (cid:4) B Further Details of Search Space Optimization