[PDF] Warm Starting Bayesian Optimization

Abstract

Full PDF

WWarm Starting Bayesian Optimization ∗ Matthias Poloczek, Jialei Wang, and Peter I. FrazierSchool of Operations Research and Information EngineeringCornell University {poloczek,jw865,pf98}@cornell.edu

ABSTRACT

We develop a framework for warm-starting Bayesian optimization, that reduces the solutiontime required to solve an optimization problem that is one in a sequence of related problems.This is useful when optimizing the output of a stochastic simulator that fails to providederivative information, for which Bayesian optimization methods are well-suited. Solvingsequences of related optimization problems arises when making several business decisionsusing one optimization model and input data collected over diﬀerent time periods or markets.While many gradient-based methods can be warm started by initiating optimization at thesolution to the previous problem, this warm start approach does not apply to Bayesianoptimization methods, which carry a full metamodel of the objective function from iterationto iteration. Our approach builds a joint statistical model of the entire collection of relatedobjective functions, and uses a value of information calculation to recommend points toevaluate. ∗ The authors were partially supported by NSF CAREER CMMI-1254298, NSF CMMI-1536895, NSFIIS-1247696, AFOSR FA9550-12-1-0200, AFOSR FA9550-15-1-0038, and AFOSR FA9550-16-1-0046. a r X i v : . [ s t a t . M L ] A ug . INTRODUCTION Businesses that use optimization to make operational decisions often solve collections ofrelated optimization problems. Consider three examples: • First, consider a ride-sharing company, such as Uber or Lyft, that makes real-timedecisions about how to assign requests for rides to drivers that may fulﬁll thoserequests. Such a company might have a single algorithm for making these decisions: itsperformance depends crucially on tunable parameters. Given a particular distributionalforecast of demand patterns, tuned to the particular next time period, and a simulatorto estimate the performance of the algorithm using this forecast, this company mightwish to optimize the parameters every few hours to ﬁnd a conﬁguration that wouldbest achieve their goals in that time interval. Such distributional forecasts wouldbe likely to change across hours within a day and across days within a week due toseasonality, and would also vary from week to week as distributional forecasts includemore recent historical data due to changing weather, sporting events, etc. • Second, consider a company managing inventory: to avoid overstocking or outages, aninventory management system has to decide what quantity of each item to hold instock and how to set the threshold levels to order supplies. To maximize the proﬁt,not only the current market prices have to be taken into account but also holdingcosts, delivery times aﬀecting the lead times, and patterns in the demand. Thus, thecompany would re-optimize the parameters of its inventory management system on aregular schedule and whenever any of the key factors change. • Third, consider a business that relies on machine learning models to predict thebehavior of their clients. Suppose that such a model is used to predict “click-rates”for ads displayed on a collection of webpages. These predictions are subsequentlyused to decide which ad is to be displayed to each visitor, based on the user’s proﬁlebut also on the targets agreed on in the advertising campaigns. To obtain the mostaccurate predictions, the company would want to retrain the parameters of their modelfrequently based on the latest historical data, some of which typically have to be ﬁtby a derivative-free method.The common characteristic of all these problems is that we are to solve very similarproblems over and over again. To squeeze the most out of the real time availability ofdata, we would wish to “shortcut” the optimization process, thereby saving time and moneyand most importantly gaining an edge over competitors, without negatively aﬀecting thequality of solutions. One way to avoid a “cold start”, i.e. starting from scratch each time, isto exploit knowledge obtained while previously tackling related problems. This comprisesnot only the ﬁnal solutions that were actually used in the application, but also all thoseevaluated in the respective optimization processes.The idea of using previous solutions to speed up the optimization process is ubiquitousfor applications in mathematical programming: in particular, the number of iterations that2nterior points methods for linear programming require to obtain an almost optimal solutioncan be reduced by a warm start, e.g., see [YW02, GW04, JY08]. Typically this is achievedsimply by using the solution to a previous optimization problem as the starting pointwhen solving the current problem. This same technique can be used when using stochasticgradient ascent to solve optimization via simulation problems whose objective functionevaluations provide stochastic gradients. However, when evaluations of the objective functionare derivative-free, as they commonly are when optimizing complex simulators, many ofthe optimization methods that are most eﬀective for solving a single optimization probleminstance lack the structure that would permit warm starting them using this technique.In this article, we create the ability to warm start Bayesian optimization methods. Bayesianoptimization methods are eﬀective for solving optimization via simulation problems in whichthe objective does not provide gradients, is expensive to evaluate, and takes a small number(< 20) of real- or integer-valued inputs. For an overview, see Bayesian optimization tutorialsand surveys [FW16, BCd09, Kle14]. Bayesian optimization methods belong to a widerset of optimization via simulation methods that leverage Bayesian statistics and value ofinformation calculations. See, e.g., the surveys of [Chi00, Fra11].Bayesian optimization methods iteratively construct a Gaussian process metamodel overthe objective function and use this metamodel to decide whether to sample next. Becausethe information carried from iteration to iteration is not just a single point but the fullmetamodel, simply carrying the solution from a previous problem as a starting point doesnot provide a warm starting capability to Bayesian optimization, necessitating a moresophisticated approach. While one can partially warm start Bayesian optimization byperforming function evaluations of the current problem at solutions to previous problems,this fails to take full advantage of the set of previous evaluations at non-optimal points.In this article, we propose a Bayesian framework that captures the observed relationshipof the current black-box function to previous ones. This enables us to perform a “warmstart” and hence to zoom in on an optimal solution faster. Our approach puts an emphasison conceptual simplicity and wide applicability. In particular, we will model the tasks andtheir relationships by a Gaussian Process, whose covariance structure is tailored to capturethe diﬀerences of the current task and the previous ones.To optimize the current objective, we iteratively sample the search space, using theKnowledge Gradient acquisition criterion proposed by Frazier, Powell, and Dayanik [FPD09];besides its one-step Bayes optimality, this criterion has the advantage that it has a simpleform and can be computed eﬃciently. However, our statistical model is independent ofthe acquisition function and can be combined with others if that seems beneﬁcial for theapplication under consideration.

Related Literature.

The previous work most directly relevant to warm starting Bayesianoptimization and optimization of expensive functions more generally comes from the machinelearning literature, where hyper-parameters in machine learning models are ﬁt by solvingoptimization problems with expensive derivative-free evaluations.3ithin this literature, the most relevant previous work is due to Swersky, Snoek, andAdams [SSA13], who developed a warm start algorithm to tune hyper-parameters of machinelearning models using Bayesian Optimization. They demonstrate that their algorithm canspeed up the ﬁne-tuning process for hyper-parameters of convolutional neural networksand logistic regression for image classiﬁcation. Their approach is similar to ours, as theyalso construct a Gaussian Process model to quantify the relationships between all previousinstances and then use an acquisition function, in their case Entropy Search, to optimizethe hyper-parameters for the new instance.We believe that our covariance kernel has the following advantages. Their model assumesthe covariance of each pair of tasks to be uniform over the whole domain. Moreover, theyrequire that all tasks are positively correlated. Our covariance function does not imposethese restrictions and in particular allows for the relationship of tasks to vary over thedomain. Additionally, we quantify biases between the objective functions of diﬀerent tasks.Finally, the inference of the parameters of their kernel requires an expensive sampling-basedstep, involving slice sampling, whereas our approach performs inference in closed form.In other related work from machine learning, Feurer, Springenberg, and Hutter [FSH14]proposed a meta-learning procedure that improves the performance of a machine learningalgorithm by selecting a small set of start-up conﬁgurations. The selection procedureexplores the new task and then ranks previous tasks (and their corresponding optimalsolutions) based on a metric that relies on an evaluation of the “metafeatures” that graspthe characteristics of each dataset. The authors demonstrate that their approach can speedup the process of optimizing hyper-parameters for Support Vector Machines and AlgorithmSelection models. This approach diﬀers from ours in three ways: ﬁrst, the metafeatures anddistance functions it develops are speciﬁc to machine learning, and extending this approachin a generic way to optimization via simulation problems encountered in operations researchwould require substantial eﬀort; second, it uses only the ﬁnal solution from each previousproblem, while our approach uses the full set of function evaluations; third, it requiresthat solutions from previous problems be evaluated under the current objective before theyprovide useful information, while our approach utilitizes this information directly withoutrequiring additional expensive evaluations.Finally, in the aerospace engineering literature, Gano, Renaud, and Sanders [GRS04] stud-ied the optimization of expensive-to-evaluate functions in a multi-ﬁdelity setting and showedthat warm starting optimization of an expensive high-ﬁdelity objective using inexpensiveevaluations of a low-ﬁdelity model can reduce the number of high ﬁdelity queries required.This previous work is focused on engineering design problems in which computational codesof various expense and ﬁdelities all model the same objective, while we focus on a sequenceof optimization problems with diﬀerent objectives, that are not ordered by ﬁdelity, and donot necessarily have diﬀerent computational costs.

Outline of the Paper.

The problem is stated formally in Sect. 2. The warm startalgorithm is described in Sect. 3. The results of an experimental evaluation of the algorithmare presented in Sect. 4. 4 . PROBLEM DEFINITION

We give a formal exposition of the setup. We indicate the current problem to be solvedby T , and tasks that have been solved previously by T (cid:96) , for (cid:96) > . We also refer to theseoptimization problems as “instances” or “tasks” in the sequel. Let M be the number ofpreviously solved problems, and deﬁne [ M ] := { , , . . . , M } and [ M ] := [ M ] ∪ { } .We suppose that evaluating T (cid:96) , (cid:96) ≥ , at input parameter x yields independently andnormally distributed observations with mean f ( (cid:96), x ) and variance λ (cid:96) ( x ) , i.e. f ( (cid:96), x ) denotesthe objective value of x for the current task. We also suppose that accompanying this evalu-ation is an estimate of λ (cid:96) ( x ) . This assumption of normally distributed noise accompanied byan estimate of the noise’s variance is approximately valid when each evaluation is actuallyan average of many independent and identically distribution observations, whose samplevariance provides an estimate of λ (cid:96) ( x ) .We indicate the current optimization problem’s feasible set by D . This feasible set maybe shared with previous problems, or previous problems may have diﬀerent feasible sets.We assume that D is a compact subset of R d , and that membership in this set can bedetermined simply without evaluating an expensive function. For example, D might be ahyper-rectangle or a simplex.Our goal is to optimize the current task, i.e., to ﬁnd argmax x ∈ D f (0 , x ) . To supportus in this task we have evaluations of previous tasks, in the form of a collection of tuples ( (cid:96), x, y (cid:96) ( x ) , λ (cid:96) ( x )) , where y (cid:96) ( x ) is the sum of f ( (cid:96), x ) and independent normally distributednoise with variance λ (cid:96) ( x ) . Typically these would be obtained from runs of the optimizationalgorithm we propose, or some other optimization algorithm, when solving these previoustasks. Evaluations obtained for other simulation studies can also be included.During optimization for the current task, we assume that previous tasks cannot beevaluated. We make this assumption because maintaining the ability to evaluate previoustasks may present greater practical engineering challenges than simply storing previousevaluations, if previous tasks used diﬀerent versions of the simulation software, or requiredaccess to large historical datasets. In contrast, under the assumptions made below, previousevaluations do not consume a great deal of memory, and can be stored simply in a ﬂat ﬁleor database.We make a few additional assumptions that should be met for our proposed methodologyto be useful in comparison to other methods.1. First, we suppose that evaluations of T (cid:96) do not provide ﬁrst- or second-order derivativeinformation. If derivative information is available, it would be more productive toinstead use a gradient-based method.2. Second, we suppose that evaluations are expensive to obtain, limiting the number ofevaluations that can be performed to at most a few hundred evaluations per problem.When the number of evaluations possible grows substantially larger, the computationalexpense incurred by our method grows, decreasing its utility.5. Third, we will model both the current problem’s objective x (cid:55)→ f (0 , x ) and thecollection of all objectives ( (cid:96), x ) (cid:55)→ f ( (cid:96), x ) with Gaussian processes, and so we assumethat these objectives are amenable to Gaussian process modeling. While Gaussianprocesses are ﬂexible, and can handle a wide variety of objective functions, they requirecontinuity of x (cid:55)→ f ( (cid:96), x ) for each (cid:96) to work well.4. Fourth, while our mathematical approach does not make restrictions on d , our numericalexperiments all assume that d ≤ . Other work on Bayesian optimization suggeststhat performance degrades on high-dimensional problem, suggesting that the methodwe develop will provide value only for problems with d ≤ .

3. WARM START ALGORITHM

The key idea of our approach is to use samples of the objectives from previous optimizationproblems to learn about the current optimization problem’s objective. To accomplish this,we use a Bayesian prior probability distribution that models the relationship between pastobjectives x (cid:55)→ f ( (cid:96), x ) and the current objective f (0 , x ) . Thereby we begin our optimizationof T with substantial information obtained from previous tasks, allowing us to ﬁnd one ofits optima faster.More speciﬁcally, we place a Gaussian process prior on f , which models the relationshipsbetween the tasks. We emphasize that the Gaussian process prior is not just over x (cid:55)→ f ( (cid:96), x ) for a single (cid:96) , as is typical in Bayesian optimization, but instead is over ( (cid:96), x ) (cid:55)→ f ( (cid:96), x ) .Modeling the objective jointly across tasks allows our approach to learn about the currenttask from previous ones. This is described in Sect. 3.1.The Gaussian process prior utilizes a parametrized class of covariance functions: wedescribe in Sect. 3.2 how the parameters for the covariance function can be obtained in ourproblem setting. These parameters describe the fundamental characteristics of the instances,and therefore are typically robust. We suggest to optimize them periodically rather than forevery new task.Then, using this Gaussian process prior over the collection of tasks, we apply Gaussianprocess regression to update our belief on f (0 , · ) based on samples obtained from previouslytasks. Although this application of Gaussian process regression is relatively straightforward,we provide a detailed description for completeness in Sect. A showing how to calculateposterior probability distribution on f (0 , · ) given the prior from Sect. 3.1 and samples fromprevious tasks.Finally, starting from this Gaussian process posterior over the current task’s objective f (0 , · ) , we use samples of f (0 , · ) to localize the optimum. Points are chosen by our algorithmto sample by iteratively solving a sub-optimization problem, in which we ﬁnd the point toevaluate that maximizes some acquisition function, sample this point, use it to update theposterior, and repeat until our budget of samples is exhausted. We choose as our acquisitionfunction the Knowledge Gradient (KG) proposed by Frazier, Powell, and Dayanik [FPD09];a succinct exposition is given in Sect. 3.3. 6 he Algorithm wsKG . The proposed algorithm for warm starting Bayesian optimizationproceeds as follows:1. Using samples of previous tasks, and the prior distribution described in Sect. 3.1:a) Estimate hyper-parameters of the Gaussian process prior as described in Sect. 3.2.b) Use the prior with the estimated hyper-parameters and the samples on previoustasks to calculate a posterior on f (0 , · ) , as described in Sect. A.2. Until the budget of samples on the current task is exhausted:a) Choose the point x at which to sample f (0 , x ) next by maximizing the KG factor,as described in Sect. 3.3.b) Update the posterior distribution to include the new sample of f (0 , · ) , as describedin Sect. A, so that it includes all previous samples on the current and previoustasks.3. Choose as our solution the point with the largest estimated value according to thecurrent posterior. We begin by describing the construction of the Gaussian Process prior, which models therelationship between previous tasks and the current one. Let δ (cid:96) ( x ) denote the diﬀerence inobjective value of any design x between T and T (cid:96) : δ (cid:96) ( x ) = f ( (cid:96), x ) − f (0 , x ) . (1)We construct our Bayesian prior probability distribution over f by supposing that each δ (cid:96) ( · ) with (cid:96) > is drawn from an independent Gaussian Process prior with mean function µ (cid:96) and covariance kernel Σ (cid:96) , where µ (cid:96) ( · ) and Σ (cid:96) ( · , · ) belong to some pre-speciﬁed parametrizedclass of functions. For example, we assume in our numerical experiments that µ (cid:96) ( · ) = 0 , andthat Σ (cid:96) ( · , · ) belongs to the Matérn covariance functions, a popular class of stationary kernels(see chapter 4.2 in Rasmussen and Williams [RW06] for details). It would also be possible toset the prior mean µ (cid:96) ( · ) to a constant. Hyper-parameters determining the speciﬁc choicesfor µ (cid:96) and Σ (cid:96) from their respective parameterized class can be ﬁt from data in an automatedprocess, as we describe in more detail in Sect. 3.2.This δ (cid:96) ( x ) models the diﬀerence between previous tasks and the current task. By usingknowledge of the typical values of δ (cid:96) ( x ) , inferred from the hyper-parameters and fromevaluations of f ( (cid:96), x (cid:48) ) and f (0 , x (cid:48)(cid:48) ) for x (cid:48) and x (cid:48)(cid:48) nearby to x , we can learn about f (0 , x ) from knowledge of f ( (cid:96), · ) .We also suppose that f (0 , x ) is drawn from another independent Gaussian processwith mean function µ and covariance kernel Σ , which also belong to some pre-speciﬁedparameterized class of functions. As with µ (cid:96) and Σ (cid:96) for (cid:96) > , these parameters may alsobe chosen automatically as described in Sect. 3.2.7his speciﬁcation of the (joint) distribution of ( (cid:96), x ) (cid:55)→ δ (cid:96) ( x ) and x (cid:55)→ f (0 , x ) , togetherwith the speciﬁcation of f ( (cid:96), x ) in terms of these quantities, deﬁnes a joint Gaussian processprobability distribution over f . This is because each f ( (cid:96), x ) is a linear function of quantitiesthat are themselves jointly Gaussian. By specifying the mean function and covariance kernelof this Gaussian process, which we call µ and Σ respectively, we may leverage standardsoftware for Gaussian process regression.We calculate the functions µ and Σ as follows. For the ease of notation let δ ( x ) = 0 forall x ∈ D , then we can write f as f ( (cid:96), x ) = f (0 , x ) + δ (cid:96) ( x ) . Then, for (cid:96) ∈ [ M ] , we calculatethe mean function as, µ ( (cid:96), x ) = E [ f ( (cid:96), x )] = E [ f (0 , x ) + δ (cid:96) ( x )] = µ ( x ) + E [ δ (cid:96) ( x )] = µ ( x ) , since E [ δ (cid:96) ( x )] = 0 . For (cid:96), m ∈ [ M ] and x, x (cid:48) ∈ D , the covariance kernel is given by Σ (cid:0) ( (cid:96), x ) , ( m, x (cid:48) ) (cid:1) = Cov ( f ( (cid:96), x ) , f ( m, x (cid:48) ))= Cov ( f (0 , x ) + δ (cid:96) ( x ) , f (0 , x (cid:48) ) + δ m ( x (cid:48) )) , = Cov ( f (0 , x ) , f (0 , x (cid:48) )) + Cov ( δ (cid:96) ( x ) , δ m ( x (cid:48) )) and since the Gaussian processes δ (cid:96) and δ m are independent iﬀ (cid:96) (cid:54) = m holds, = Σ ( x, x (cid:48) ) + (cid:96),m · Σ (cid:96) ( x, x (cid:48) ) , where we use the indicator variable (cid:96),m that is one if (cid:96) = m and (cid:96) ≥ , and zero otherwise. When designing the method, we aimed for a wide applicability, and therefore do not assumethat further information on the current optimization task T and its relationship withprevious tasks is given explicitly. Rather we will estimate the important quantities fromhistorical data in an automated fashion as follows.Let θ denote the parameters of the covariance function Σ (cid:96) : for instance, in case ofthe Squared Exponential Kernel K ( x, y ) = α (cid:96) exp( (cid:80) di =1 β (cid:96),i ( x i − y i ) ) for x, y ∈ R d thereare d + 1 parameters. Since we believe that θ changes little from instance to instance, weformulate a prior distribution based on previously optimal choices for θ : let p ( θ ) be theprobability under that prior. Moreover, our statistical model provides the likelihood ofobservations Y conditioned on the samples X and θ , denoted by p ( Y | X, θ ) . Thinking of θ as a random variable, the probability of an outcome for θ conditioned on the observationsequals the product p ( Y | X, θ ) · p ( θ ) . Thus, the method of maximum a posteriori (MAP)estimation the chooses the hyper-parameters that maximize this product, or equivalentlyas argmax θ ln( p ( Y | X, θ )) + ln( p ( θ )) , where ln( · ) is the natural logarithm. The optimalparameters can be computed eﬃciently via gradient methods. We refer to Sect. 5.4.1in [RW06] for details. 8 .3. Acquisition Function The proposed algorithm takes as input the historical data, i.e. samples obtained in theoptimization on previous tasks, and an “oracle access” to the current task T ; that is, weconsider the objective function a black-box. The actual optimization process is conductedin an iterative manner, where in each iteration a so-called acquisition function selects apoint x to sample. A wide range of these functions has been proposed in the literature.We suggest using the Knowledge Gradient [FPD09]: this greedy strategy is known to beone-step (Bayes) optimal and moreover converges to an optimal solution in settings where aﬁnite number of alternatives is given with a multivariate normal prior.From the practical point of view, the main advantage of the Knowledge Gradient is thathas been shown to produce good results, but nonetheless can be computed very eﬃciently (seetheir paper for details on the implementation). For the sake of a self-contained presentation,we give a brief description of the method.We assume that we have already observed n points for T and are now to decide the nextsample. Let µ ( n ) ( x ) = E n [ f (0 , x )] , where E n denotes the expected value under the posteriordistribution. Thus, based on our current knowledge, the best x ∈ D for task T has anexpected objective value of max x ∈ D µ ( n ) ( x ) . Note that this value can be computed with ourinformation after observing the n -th sample. Then the Knowledge Gradient ( KG ) selects adesign x ( n +1) ∈ D that maximizes E n (cid:2) max x (cid:48) ∈ D µ ( n +1) ( x (cid:48) ) (cid:12)(cid:12) x ( n +1) = x (cid:3) − max x (cid:48) ∈ D µ ( n ) ( x (cid:48) ) .We obtain an approximation by replacing D by a discrete set of points, denoted by A : E n (cid:20) max x (cid:48) ∈ A µ ( n +1) ( x (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12) x ( n +1) = x (cid:21) − max x (cid:48) ∈ A µ ( n ) ( x (cid:48) ) . (2)Set A can either be obtained by randomly sampling from D , for example using a LatinHypercube design to ensure an even coverage. If certain regions of D seem particularlyrelevant, then A can represent it with higher resolution. Note that the ﬁrst summandof Eq. (2), the conditional expectation for ﬁxed x , can be calculated from the posteriordistribution: recall that µ ( n ) is mean vector and let Σ ( n ) denote the covariance matrix of theposterior distribution after observing n samples. Deﬁne ˜ σ x (cid:48) ( x ) = Σ ( n )(0 ,x (cid:48) ) , (0 ,x ) / (Σ ( n )(0 ,x ) , (0 ,x ) + λ ( x )) . It is well-known (e.g., see [Kai68, FPD09]) that the random variables µ ( n +1) ( x (cid:48) ) and µ ( n ) ( x (cid:48) ) + ˜ σ x (cid:48) ( x ) · Z have the same distributions conditioned on the previous samples.Here Z is a standard normal random variable. Thus we have that µ ( n +1) ( x (cid:48) ) d = µ ( n ) ( x (cid:48) ) +˜ σ x (cid:48) ( x ) · Z holds for every x (cid:48) ∈ A .Based on that observation, we can either enumerate all x ∈ A and pick a point thatmaximizes Eq. (2), which may be useful if the cost of evaluating the task is very expensive(compared to the computational cost of the enumeration). Or we utilize that the gradientof the KG can be computed eﬃciently (e.g., see [FXC11, SFP11, PWF16]) and perform alocal search over D , for instance using a multi-start gradient ascent method. In this case wewould require that the noise functions λ (cid:96) ( · ) are diﬀerentiable.9 . EXPERIMENTAL EVALUATION We have evaluated the warm start algorithm on two benchmark suites. The ﬁrst set comprisesvariants of the two-dimensional Rosenbrock function that is a standard test problem inBayesian optimization.The second family of testbed instances in this evaluation are taken from the Assemble-To-Order (

ATO ) problem that was introduced by Hong and Nelson [HN06]. In this eight-dimensional simulation optimization scenario the task is to optimize the target of each itemunder a continuous-review base stock policy that manages the inventory of a productionfacility. In order to evaluate the objective value of a speciﬁc choice of the parameters, werun the simulator of Xie, Frazier, and Chick [XFC12] available on the SimOpt website.The performance of the warm start algorithm ( wsKG ) is compared to two traditionalmethods that do not beneﬁt from previous optimizations of related tasks.

EGO . The ﬁrst baseline method is the well-known

EGO algorithm of Jones, Schonlau, andWelch [JSW98].

EGO is also a myopic algorithm that iteratively selects one point to sample.

EGO ’s acquisition criterion is to optimize the expected improvement ( EI ): for this expositionsuppose that the case that observations are noiseless and that y ∗ is the best objective valuethat was sampled so far. Let Y x be the random variable that is distributed (normally)according to the posterior distribution of the objective value of some design x , then wecall EI ( x ) := E [max { Y x − y ∗ , } ] the expected improvement for x . In each iteration EGO greedily samples an x that maximizes this expectation. KG . In order to assess the beneﬁt of taking data on previous tasks into account, we alsocompare the new method to the “vanilla version” of the Knowledge Gradient ( KG ) [FPD09];see Sect. 3.3. Experimental Setup.

In our evaluation all algorithms are given the same initial datapoints drawn randomly for each instance and each run. The warm start algorithm isadditionally provided with one data set for the Rosenbrock instances and two for

ATO : eachset contains the samples collected during a single run on a related instance. A single runconsists of 25 steps (also referred to as iterations) for each of the Rosenbrock instancesand 50 steps for the Assemble-to-Order suite.The hyper-parameters of the Gaussian Process aﬀect how accurate the posterior distribu-tion predicts the objective function of the current task. In accordance with our scenario,we optimized the hyper-parameters once for a single instance of each suite and then in-voked wsKG with this ﬁxed setting for all instances. For the baseline methods

EGO and KG we optimized these parameters for each instance in advance, thereby possibly giving theman edge over wsKG in this respect.The plots below provide the mean of the gain over the initial solution for each iteration.Error bars are shown at the mean plus and minus two standard errors, averaged over atleast 100 replications. 10 .1. The Rosenbrock Family The basis of these instances is the 2D Rosenbrock function RB ( x , x ) = (1 − x ) + 100 · ( x − x ) , which subject to the following modiﬁcations: RB ( x , x ) = RB ( x , x ) + . · sin(10 · x + 5 · x ) RB ( x , x ) = RB ( x + . , x − . RB ( x , x

2) = RB ( x , x ) + . · x Moreover, each function evaluation is subject to i.i.d. noise with mean zero and variance . .The task is to optimize the respective function on the domain [ − , .The results are summarized in Fig. 1: the warm start algorithm is able to exploit theknowledge from a single run on a related instance and samples better solutions than theother methods from the ﬁrst iteration on. In particular, wsKG obtains a near-optimal solutionalready after only one or two samples. KG and EGO are distanced, with KG showing a superiorperformance to EGO . In the Assemble-To-Order problem we consider a production facility that sells m diﬀerentproducts that are assembled on demand whenever a customer request comes in. Productsare manufactured from a selection of n diﬀerent items, where we distinguish for each productbetween key items and non-key items : when a product is ordered whose key items are allstocked, then it is assembled and sold. Otherwise the customer is lost. Non-key items areoptional and used if available; they increase the revenue. On the other hand, items kept inthe inventory inﬂict holding cost , therefore the facility would like to avoid overstocking.The inventory is managed by a continuous-review base stock policy that sets a targetstock base for each item and triggers a replenishment order if an item is requested. Thedelivery time for supplies is normally distributed and varies over the items. Moreover, theorders for products vary, too, and are modeled by m Poisson arrival processes. The goalis to choose the target stock base used by the inventory management system so that theexpected daily proﬁt is maximized.Hong and Nelson [HN06] proposed a setting (referred to as

ATO

1) for n = 8 items, m = 5 products, and search domain [0 , . We have created three additional instances based on ATO

ATO − .On the other hand, the popularity of two other products grows by up to . Additionally,the proﬁt of some of the items increases by − on average.11 G a i n EGOwsKGKG G a i n EGOwsKGKG G a i n EGOwsKGKG G a i n EGOwsKGKG

Figure 1: (ul) The basic Rosenbrock function RB . (ur) The Rosenbrock function RB withan additive sine. (bl) The shifted Rosenbrock function RB . (br) The Rosenbrockfunction RB with an additive sine and a bias depending on x .12

10 20 30 40 50Step020406080100 G a i n EGOwsKGKG G a i n EGOwsKGKG

Figure 2: (l)

ATO

2: All algorithms have the same initial data for the currentproblem. wsKG has also access to samples of two runs on related instances, but itshyper-parameters are not optimized for the current instance.In the next period

ATO . Moreover, all products see a higher demand and the proﬁts forseveral items increase.In the ﬁnal period ATO .Moreover, the proﬁts of several items drop slightly.When comparing the performance of the three algorithms given in Fig 2 and Fig 3 forthe task of selecting the inventory levels used by the stocking policy, we see that wsKG consistently performs signiﬁcantly better than its competitors, achieving about of theoptimum after only ten samples. Among the two distanced methods, the KG policy achievesbetter solutions than EGO for the Assemble-To-Order problem. Looking closer, KG ’s solutionis typically about away of the optimum after 10 steps, about after 20 steps, andstill about after 50 steps, depending on the instance. EGO achieves less than half of theoptimum after 20 iterations, and only about − after 50 steps. Notably, EGO shows alarge discrepancy between the performance of single runs; the error bar displays the meanaveraged over 100 runs.

5. CONCLUSION

We have proposed an algorithm to warm start Bayesian optimization and demonstratedits usefulness for an application in simulation optimization. The design of the algorithm isconceptually simple and computationally eﬃcient.A typical drawback of such Bayesian optimization methods is their ability to scale tolarger input sizes. For the chosen acquisition function, Poloczek, Wang, and Frazier [PWF16]recently suggested a novel parallelization that eﬃciently utilizes several multi-core CPUs.13

10 20 30 40 50Step020406080100120 G a i n EGOwsKGKG G a i n EGOwsKGKG

Figure 3: (l)

ATO

ATO wsKG has received the samples of two runs on related instances.Thus, the acquisition function poses no longer the bottleneck of the approach if access to acomputing cluster is provided. However, in order to obtain the predictive distribution of theGaussian Process model involves calculating the Cholesky decomposition to compute theinverse of a matrix, whose running time dependency on the number of previous observationsis cubic.There are several approaches to address this and we believe it is worthwhile to combineone of them with the proposed method. On the one hand, a possible direction is to replacethe Gaussian Process by a scalable approximation. Several such techniques have beenproposed lately, e.g., see [FWN +

15, WN15] and the references therein.A complementary approach to speed up the computations is to reduce the number ofdata points: instead of the whole data set, a “sketch” reduces the number of data pointssigniﬁcantly, while the predictions made based on the sketch are nearly indistinguishable fromthe full dataset. Geppert et al. [GIM +

15] proposed such an approximation and demonstratedits usefulness for applications in Bayesian linear regression. Another approach also basedon a linear projection of the dataset to a subspace was given by Banerjee, Dunson, andTokdar [BDT12].Another interesting direction is to extend our statistical model. We sketch two ideas.Recall that the statistical model in Sect. 3.1 describes the previous tasks as deviations fromthe current objective g ( x ) , where each deviation is given by a Gaussian processes δ (cid:96) . Inparticular, δ (cid:96) and δ (cid:96) (cid:48) are supposed to be independent for (cid:96) (cid:54) = (cid:96) (cid:48) . Alternatively, we could applythe formulation described in Sect. 2.2 of [PWF16] to additionally incorporate correlationsamong previous tasks.A second approach is to suppose that all tasks are derived from an “ideal” problem.Then f (1 , x ) = g ( x ) = f (0 , x ) + δ ( x ) gives the current task that we want to optimize,and f ( (cid:96), x ) = f (0 , x ) + δ (cid:96) for (cid:96) ≥ are the previous tasks. Note that f (0 , x ) is notobservable in this case and obtained via Gaussian process regression. Both approaches canbe implemented with little additional eﬀort. Therefore, we have a collection of ﬂexible14odels to choose from, depending on the characteristics of the optimization problem athand. A. PRIOR AND POSTERIOR DISTRIBUTION OF THEGAUSSIAN PROCESS

For the sake of a self-contained exposition, we give the prior and the posterior of the Gaussianprocess. Both follow canonically (see [RW06] for an in-depth treatment).Let X = (cid:0) ( (cid:96) (1) , x (1) ) , ( (cid:96) (2) , x (2) ) , . . . , ( (cid:96) ( n ) , x ( n ) ) (cid:1) T be a column vector of sample locations(for generality, assume that each pair belongs to [ M ] × D ). Then the prior distribution of f at X is given by (cid:16) f ( (cid:96) (1) , x (1) ) , . . . , f ( (cid:96) ( n ) , x ( n ) ) (cid:17) T ∼ N ( µ ( X ) , K ( X, X )) , (3)i.e. the prior distribution is multivariate normal. Here we utilized that µ ( (cid:96), x ) = µ ( x ) for any (cid:96) ∈ [ M ] and x ∈ D as shown above; µ ( X ) is a shorthand for the n -dimensionalcolumn vector with µ ( X ) i = µ ( x ( i ) ) , and K ( X, X ) is the n × n matrix with K ( X, X ) i,j =Σ (cid:0)(cid:0) (cid:96) ( i ) , x ( i ) (cid:1) , (cid:0) (cid:96) ( j ) , x ( j ) (cid:1)(cid:1) for i, j ∈ [ n ] .Next we describe the Gaussian Process regression to predict the latent function f (0 , · ) that serves as our estimator for T ( · ) . Upon sampling the points X , we have observed Y := (cid:0) y (1) , y (2) , . . . , y ( n ) (cid:1) T ∈ R n with y ( i ) = y (cid:96) ( i ) (cid:16) x ( i ) (cid:17) = f (cid:16) (cid:96) ( i ) , x ( i ) (cid:17) + ε ( i ) , where ε ( i ) ∼ N (cid:0) , λ (cid:96) ( i ) (cid:0) x ( i ) (cid:1)(cid:1) for i ∈ [ n ] . Recall that λ (cid:96) ( i ) (cid:0) x ( i ) (cid:1) is the noise function thatgives the variance when evaluating task T (cid:96) ( i ) for design x ( i ) , and observe that ε ( i ) and ε ( i (cid:48) ) are independent (for i (cid:54) = i (cid:48) ) conditioned on x ( i ) , x ( i (cid:48) ) , and the respective noise functions.Then the posterior (or predictive ) distribution of the latent function value f (0 , x ) is stillmultivariate normal (e.g., see Eq. (A.6) on pp. 200 in [RW06]) with f (0 , x ) | x, X, Y ∼ N (cid:0) µ n ( x ) , σ ,n ( x ) (cid:1) where µ n ( x ) = µ ( x ) + K ( x, X ) [ K ( X, X ) + D ( X )] − ( Y − µ ( X )) (4) σ ,n ( x ) = Σ ( x, x ) − K ( x, X ) [ K ( X, X ) + D ( X )] − K ( x, X ) T , where K ( x, X ) is deﬁned as the row vector of length n with K ( x, X ) i = Σ (cid:0) (0 , x ) , (cid:0) (cid:96) ( i ) , x ( i ) (cid:1)(cid:1) for i ∈ [ n ] , and D ( X ) is the n × n diagonal matrix with D ( X ) i,i = λ (cid:96) ( i ) (cid:0) x ( i ) (cid:1) . As usual, Q − denotes the inverse of some n × n matrix Q under matrix multiplication, i.e. QQ − = I n holds where I n is the n × n identity matrix.15 eferences [BCd09] Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on Bayesianoptimization of expensive cost functions, with application to active user mod-eling and hierarchical reinforcement learning. Technical Report TR-2009-23,Department of Computer Science, University of British Columbia, 2009.[BDT12] Anjishnu Banerjee, David B. Dunson, and Surya T. Tokdar. Eﬃcient gaussianprocess regression for large datasets. Biometrika , 100(1):75–89, 2012.[Chi00] Stephen E Chick. Bayesian methods: Bayesian methods for simulation. InJeﬀrey A. Joines, Russel R. Barton, Keebom Kang, and Paul A. Fishwick,editors,

Proceedings of the 2000 Winter Simulation Conference , pages 109–118,San Diego, CA, USA, 2000. Society for Computer Simulation International.[FPD09] Peter I Frazier, Warren B Powell, and Savas Dayanik. The Knowledge Gradi-ent Policy for Correlated Normal Beliefs.

INFORMS Journal on Computing ,21(4):599–613, 2009.[Fra11] P. I. Frazier. Decision-theoretic foundations of simulation optimization. In

WileyEncyclopedia of Operations Research and Management Science . John Wiley andSons, 2011.[FSH14] Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. Using meta-learning to initialize bayesian optimization of hyperparameters. In

Proceedings ofthe International Workshop on Meta-Learning and Algorithm Selection (ECAI) ,pages 3–10, 2014.[FW16] Peter I. Frazier and Jialei Wang. Bayesian optimization for materials design. In

Information Science for Materials Discovery and Design , pages 45–75. 2016.[FWN +

15] Seth Flaxman, Andrew Gordon Wilson, Daniel B. Neill, Hannes Nickisch, andAlexander J. Smola. Fast kronecker inference in gaussian processes with non-gaussian likelihoods. In

Proceedings of the 32nd International Conference onMachine Learning , pages 607–616, 2015.[FXC11] Peter I. Frazier, Jing Xie, and Stephen E. Chick. Value of information methodsfor pairwise sampling with correlations. In Sanjay Jain, Roy Creasey, and JanHimmelspach, editors,

Proceedings of the 2011 Winter Simulation Conference ,pages 3979–3991, Phoenix, AZ, USA, 2011. Society for Computer SimulationInternational.[GIM +

15] Leo N. Geppert, Katja Ickstadt, Alexander Munteanu, Jens Quedenfeld, andChristian Sohler. Random projections for bayesian regression.

Statistics andComputing , pages 1–23, 2015. 16GRS04] Shawn E Gano, John E Renaud, and Brian Sanders. Variable ﬁdelity optimizationusing a kriging based scaling function. In

Proceedings of the Tenth AIAA/ISSMOMultidisciplinary Analysis and Optimization Conference , 2004.[GW04] Mevludin Glavic and Louis Wehenkel. Interior point methods: A survey, shortsurvey of applications to power systems, and research opportunities. Technicalreport, Department of Electrical Engineering and Computer Science, Universitede Liège, Sart Tilman, Belgium, 2004.[HN06] L Jeﬀ Hong and Barry L Nelson. Discrete optimization via simulation usingcompass.

Operations Research , 54(1):115–129, 2006.[JSW98] Donald R Jones, Matthias Schonlau, and William J Welch. Eﬃcient GlobalOptimization of Expensive Black-Box Functions.

Journal of Global Optimization ,13(4):455–492, 1998.[JY08] Elizabeth John and E Alper Yıldırım. Implementation of warm-start strategies ininterior-point methods for linear programming in ﬁxed dimension.

ComputationalOptimization and Applications , 41(2):151–183, 2008.[Kai68] Thomas Kailath. An innovations approach to least-squares estimation – PartI: Linear Filtering in Additive White Noise.

IEEE Transactions on AutomaticControl , 13(6):646–655, 1968.[Kle14] Jack PC Kleijnen. Simulation-optimization via kriging and bootstrapping: Asurvey.

Journal of Simulation , 8(4):241–250, 2014.[PWF16] Matthias Poloczek, Jialei Wang, and Peter I. Frazier. Multi-information sourceoptimization with general model discrepancies.

ArXiv e-print 1603.00389 , 2016.Available via http://arxiv.org/abs/1603.00389 .[RW06] Carl Edward Rasmussen and Christopher K.˜I. Williams.

Gaussian Processes forMachine Learning . MIT Press, 2006.[SFP11] Warren R. Scott, Peter I. Frazier, and Warren B. Powell. The correlatedknowledge gradient for simulation optimization of continuous parameters usinggaussian process regression.

SIAM Journal on Optimization , 21(3):996–1026,2011.[SSA13] Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task bayesian optimiza-tion. In

Advances in Neural Information Processing Systems , pages 2004–2012,2013.[WN15] Andrew Gordon Wilson and Hannes Nickisch. Kernel interpolation for scal-able structured gaussian processes (KISS-GP). In

Proceedings of the 32ndInternational Conference on Machine Learning , pages 1775–1784, 2015.17XFC12] J. Xie, P. I. Frazier, and S. Chick. Assemble to order simulator. http://simopt.org/wiki/index.php?title=Assemble_to_Order&oldid=447 , 2012. AccessedMay 9, 2016.[YW02] E Alper Yildirim and Stephen J Wright. Warm-start strategies in interior-pointmethods for linear programming.