[PDF] Directed particle swarm optimization with Gaussian-process-based function forecasting

Abstract

Particle swarm optimization (PSO) is an iterative search method that moves a set of candidate solution around a search-space towards the best known global and local solutions with randomized step lengths. PSO frequently accelerates optimization in practical applications, where gradients are not available and function evaluations expensive. Yet the traditional PSO algorithm ignores the potential knowledge that could have been gained of the objective function from the observations by individual particles. Hence, we draw upon concepts from Bayesian optimization and introduce a stochastic surrogate model of the objective function. That is, we fit a Gaussian process to past evaluations of the objective function, forecast its shape and then adapt the particle movements based on it. Our computational experiments demonstrate that baseline implementations of PSO (i.e., SPSO2011) are outperformed. Furthermore, compared to, state-of-art surrogate-assisted evolutionary algorithms, we achieve substantial performance improvements on several popular benchmark functions. Overall, we find that our algorithm attains desirable properties for exploratory and exploitative behavior.

Full PDF

DDirected particle swarm optimization with Gaussian-process-basedfunction forecasting

Johannes Jakubik a,b , Adrian Binding a , Stefan Feuerriegel a, ∗ a ETH Zurich, Weinbergstr. 56/58, 8092 Zurich, Switzerland b Karlsruhe Institute of Technology, Kaiserstraße 12, 76131 Karlsruhe, Germany

Abstract

Particle swarm optimization (PSO) is an iterative search method that moves a set of candi-date solution around a search-space towards the best known global and local solutions withrandomized step lengths. PSO frequently accelerates optimization in practical applications,where gradients are not available and function evaluations expensive. Yet the traditionalPSO algorithm ignores the potential knowledge that could have been gained of the objectivefunction from the observations by individual particles. Hence, we draw upon concepts fromBayesian optimization and introduce a stochastic surrogate model of the objective function.That is, we ﬁt a Gaussian process to past evaluations of the objective function, forecast itsshape and then adapt the particle movements based on it. Our computational experimentsdemonstrate that baseline implementations of PSO (i. e., SPSO2011) are outperformed. Fur-thermore, compared to, state-of-art surrogate-assisted evolutionary algorithms, we achievesubstantial performance improvements on several popular benchmark functions. Overall, weﬁnd that our algorithm attains desirable properties for exploratory and exploitative behavior.

Keywords:

Forecasting; Gaussian process; Surrogate model; SPSO2011; Particle swarm optimization ∗ Corresponding author.

Email addresses: [email protected] (Johannes Jakubik), [email protected] (StefanFeuerriegel)

Preprint submitted to arXiv February 25, 2021 a r X i v : . [ c s . N E ] F e b . Introduction Stochastic optimization methods refer to optimization methods that incorporate randomvariables into a search process (Gentle et al., 2012, Chapter 7) and often improves the perfor-mance in a large variety of practical settings (Hoos & St¨utzle, 2005). Stochastic optimizationmethods are frequently utilized in black-box optimization settings, in particular for derivative-free optimization (Pham & Castellani, 2014; Rios & Sahinidis, 2013). Here no assumptionsregarding the analytic form of the objective function are made and, because of it, gradientsare unavailable. That is, one can only query a function f for single points x , for which thecorresponding evaluation f ( x ) is then returned. Such problems are prevalent in numerousapplications from engineering, medicine and economics among others, where the underlyingfunction is computationally or economically expensive to evaluate (Rios & Sahinidis, 2013).A prominent example of stochastic search methods is particle swarm optimization (PSO),which presents the focus of this paper. The idea behind PSO stems from population-basedsimulations of animal behavior (Kennedy & Eberhart, 1995). As such, a set of candidatesolutions (i. e., particles) are initialized at random positions, which are then moved arounda search-space. These then proceed towards a mixture of the best-known swarm positionand the particle’s best location from its past trajectory. Several variants to the original PSOalgorithm have been developed (e. g., SPSO2011; Zambrano-Bigiarini et al., 2013), which wesummarize in our review section. For detailed surveys of PSO variants, we refer to Poli et al.(2007) and Bonyadi & Michalewicz (2017).PSO is straightforward to implement and makes little assumptions on the underlyingoptimization problem. In particular, the optimization is derivative-free and can also adaptto non-convex problems. As a result, it is used for multi-objective optimization (e. g., Liuet al., 2017; Zouache et al., 2018; Yu et al., 2017; Sethanan & Neungmatcha, 2016), as wellas in various areas of application, such as energy (Yu et al., 2017), supply chain manage-ment (Hong et al., 2018), and operations research in general (Etgar et al., 2017; Tasgetirenet al., 2007). PSO can further be extended to unconstrained optimization problems (Bonyadi& Michalewicz, 2017), as well as integer programming (Laskari et al., 2002). Convergenceguarantees can be made for a wide range of settings (Jiang et al., 2007).The information actually utilized by PSO and its variants is limited: each particle is merelycontrolled by its personal best position, as well as the best of a selection of other particles.In other words, no knowledge regarding the past trajectories in the search-space are retainedand all corresponding function evaluations are “lost”. As a result, particles might re-visita neighborhood that has previously been explored; thereby leading to potentially redundant2unction calls (Iqbal & de Oca, 2006). On the other hand, all particle movements are guidedtowards the best known positions. This strategy could miss a global optimum that is hiddenin an unexplored region (cf. Pham & Castellani, 2014) and may lead to convergence to anon-optimum (van den Bergh & Engelbrecht, 2002).This work aims at improving the search strategy that controls the movements of theparticle swarm. We thus propose a combination of the PSO methodology, where the swarmintelligence leverages a stochastic surrogate model of the objective function. This allows usto estimate the surface of the objective function (i. e. the landscape) from the past searchtrajectory of the particles. Based on this surrogate model, we can then enhance particlemovements in the search process with respect to exploratory and exploitative behavior. Morespeciﬁcally, we develop a variant (A) in which we modify the movements for all particles byincorporating a third direction pointing to the global optima of the surrogate. Further variantsre-locate individual particles in order to let them (B) exploit directly global optima suggestedby the surrogate model and (C) explore regions with high uncertainty in the surrogate.Mathematically, we draw upon Gaussian processes (GPs) as stochastic model for thesurrogate. This choice is motivated by common procedures in Bayesian optimization (Snoeket al., 2012) and response surface methods via kriging (Rios & Sahinidis, 2013). In addition,it provides rigorous uncertainty estimates for the function approximation, thereby facilitatingexplorations of locations that promise the highest probability of improvement in the surrogatemodel. Our function approximation thus reveals similarities to Bayesian optimization, whereone also formulates a stochastic model over the function space (Moˇckus, 1975). However,one then chooses a sequence of points in order to improve the overall objective, as if a singleparticle jumps through the search space (see Online Appendix C for details). Diﬀerent fromPSO, these movements are deterministic and can thus not beneﬁt from a randomization inthe optimization method.We ﬁnally perform a series of computational experiments in order to demonstrate theimproved convergence of our proposed algorithm. For this purpose, we adhere to earlierworks that develop modiﬁcations to particle swarm optimization, that is, we use the SPSO2011implementation (Zambrano-Bigiarini et al., 2013). We further follow the literature with regardto the evaluation setting, that is, we use the CEC2013 suite of benchmark functions (Lianget al., 2013; Zambrano-Bigiarini et al., 2013) that is common for evaluating PSO approaches.Our algorithm can outperform SPSO2011 for the entire set of benchmark functions. We ﬁndconsiderable improvements in the convergence in early stages of the iterative search. Statisticaltests further demonstrate that the improvements are signiﬁcant at common signiﬁcance levels.3ur work entails multiple contributions to the ﬁeld of stochastic optimization. That is, wepresent an innovative combination of swarm search and machine learning tools (i. e., Gaussianprocess). Thereby, we contribute to the state-of-the-art by suggesting a statistical procedureto better leverage search trajectories in PSO. This thus introduces strategic control into theotherwise randomized search process. The resulting performance improvements stem from abetter trade-oﬀ with respect to exploration and exploitation in the stochastic swarm search.This may prove useful in a wide variety of real-world optimization settings, where one utilizesPSO.The rest of this paper is structured as follows. Section 2 reviews the mathematical speci-ﬁcation of particle swarm optimization. Based on it, Section 3 proposes our algorithm whichcombines PSO with a Gaussian-process-based surrogate model. Section 4 then performs aseries of computational experiments that demonstrate the acceleration in convergence of ourproposed algorithms. Section 5 discusses our algorithm, while Section 6 ﬁnally concludes.

2. Background

Mathematically, we deﬁne black-box optimization problem as follows. Let f : D → R denote an arbitrary (and potentially non-convex) function with a known (closed) domain D .Then our goal is to ﬁnd x ∗ ∈ arg min x ∈D f ( x ), where we assume that D ⊆ R p is connectedand f is continuous. In many applications, the underlying function is expensive to evaluateand, in these case, we might prefer to terminate the search process after a certain number ofiterations or when the relative convergence fulﬁlls predeﬁned criteria. The last iteration ofthe algorithm can then be assessed via the expectation of a ﬁxed loss function, for example,the mean squared loss E (cid:34)(cid:13)(cid:13)(cid:13)(cid:13) f ( A ( f )) − min x ∈D f ( x ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:35) . (1) Black-box optimization is often approached by particle swarm optimization. In the follow-ing, we provide a brief speciﬁcation of particle swarm optimization (cf. Bonyadi & Michalewicz,2017, for a detailed overview). PSO stems from the simulation of socially-coordinated be-havior of animal swarms (Kennedy & Eberhart, 1995). It relies on a swarm of particles j = 1 , . . . , n par serving as candidate solutions. These particles move through the domain ofthe function in a coordinated fashion, where each particle is guided by individual experience,as well as observations of other particles. For this reason, each particle comes with information w ( i ) j := (cid:16) x ( i ) j , v ( i ) j , p ( i ) j (cid:17) in iteration i , where 4 x ( i ) j is the position of particle j , • v ( i ) j is its velocity, and • p ( i ) j is its personal best position up to iteration i .During the initialization phase, the particle positions x (0) j are distributed across D . Thisinitialization may be uniformly at random, or according to some other initialization scheme(e. g. a Sobol sequence). The initial velocities v (0) j are also chosen at random according to apredeﬁned initialization procedure.In subsequent iterations, the particles move around the domain D , updating their re-spective velocity based on the personal best position p ( i ) j and on the swarm intelligence.In the original PSO, the latter component is set via the global best-known position g ( i ) =arg min p ( i ) j f ( p ( i ) j ), while some PSO variants only consider a certain neighborhood N of par-ticles, depending on the particle. This yields update rules v ( i +1) j = update N (cid:16) x ( i ) j , v ( i ) j , p ( i ) j (cid:17) , (2) x ( i +1) j = x ( i ) j + v ( i +1) j , (3) p ( i +1) j =  x ( i ) j , if f ( x ( i +1) j ) < f ( p ( i ) j ) ,p ( i ) j , otherwise . (4)Accordingly, the velocity update is key in controlling the movements of the swarm in space.In the original PSO (OPSO), the velocity update is given by update N (cid:16) v ( i ) j , p ( i ) j (cid:17) = v ( i ) j + φ p R ( i ) p,j (cid:12) ( p ( i ) j − v ( i ) j ) + φ g R ( i ) g,j (cid:12) ( g ( i ) − v ( i ) j ) , (5)where R p,j and R g,j , are random vectors with components drawn from a uniform distribution U (0 , φ p and φ g that additionally determine the relative inﬂuence of personal and globalbest directions and where (cid:12) denotes component-wise multiplication. The vectors p ( i ) j − v ( i ) j and g ( i ) − v ( i ) j are referred to as the cognitive inﬂuence (CI) and social inﬂuence (SI), respectively. Various modiﬁcations to the above update rules have been proposed in the literature. Forinstance, PSO has been extended by adjustments to the method itself (e. g. Yin et al., 2010)or combinations with other optimization schemes (e. g. Fan & Zahara, 2007). For the sake5f establishing a benchmark moving forward and focusing on reproducible insights, Bonyadi& Michalewicz (2017) refer to the following variant as the standard PSO (SPSO in short).Another standard version of PSO was introduced by Zambrano-Bigiarini et al. (2013), which iscalled SPSO2011. Instead of mixing component-wise uniformly random shifts in the velocityvector, particles are accelerated according to a hyper-spherical distribution. The velocityupdate is here given by update N (cid:16) v ( i ) j , p ( i ) j (cid:17) = ω v ( i ) j + H (cid:16) G ( i ) j , (cid:13)(cid:13)(cid:13) G ( i ) j − x ( i ) j (cid:13)(cid:13)(cid:13)(cid:17) , (6)where H ( x, r ) refers to as a hyperspherical distribution over the sphere with center x and ra-dius r , and G ( i ) j is given as G ( i ) j := x ( i ) j + φ p CI+ φ g SI3 (Bonyadi & Michalewicz, 2017). SPSO2011is regarded as state-of-the-art and should be used for benchmarking (Zambrano-Bigiarini et al.,2013). Hence, it is later used in our experiments.The update rules described above lead to convergence of the individual particles. Thisconvergence behavior has been studied extensively (Bonyadi & Michalewicz, 2017). There ishowever no guarantee that the particles converge towards a global optimum. Other variantsexist, such as modiﬁcations that replace the linear position update by arbitrary functions x ( i +1) j = ξ ( x ( i ) j , v ( i ) j ) (cf. e. g. van den Bergh & Engelbrecht, 2002), or which limit the neigh-borhood N in speciﬁc ways (e. g. Bratton & Kennedy, 2007). For a detailed review, we referto Bonyadi & Michalewicz (2017). PSO can often require numerous function calls until convergence, which might not befeasible for problems with expensive function evaluations. Here previous research has proposedto replace certain evaluations of the actual function f by an approximation ˜ f (e. g. Bird & Li,2010; Parno et al., 2012; Regis, 2014b; Sun et al., 2013). Such approaches are also referred to asresponse surface method, surrogate model, or ﬁtness function. However, in the aforementionedstream of research, the approximation is learned from the function calls to f (i. e., surrogatesare predicted based on past observations), while leaving the rest of the algorithm unchanged;the approximation is then merely used as a proxy for f , but any probabilistic interpretationdoes not further guide the search process. Hence, their only similarity with our approach isthat these approaches also draw upon a function approximation, but for a completely diﬀerentpurpose.Following the above rationale, surrogate-assisted PSO has been developed earlier (e. g. Sunet al., 2017; Yu et al., 2018; Wang et al., 2017). Here algorithms directly model surrogates6e. g., sigmoid-like inertia weights in Modiﬁed PSO) or using function approximations as aproxy for f . In fact, (non-stochastic) function approximation has been shown to improveevolutionary algorithms, as the surrogate can be used to evaluate additional candidate solu-tions within a local neighborhood, while keeping the number of function calls to f unchanged(Regis, 2014a). Examples are surrogate-assisted variants of Gaussian PSO (e. g. Krohling,2004; Melo & Watada, 2016; Varma et al., 2013; Barman et al., 2016; Liu et al., 2013; Gaoet al., 2020), Bayesian PSO (e. g. Zhang et al., 2015; Chen & Yu, 2017; Kang et al., 2018),and modiﬁed PSO (e. g. Tian & Shi, 2018; Liu et al., 2015). However, surrogate-assisted algo-rithms are primarily used to speed up runtime (due to fewer evaluations f ) but with similarconvergence characteristics. That is, the swarm movements are not directly adapted to thesurrogate, whereas we use a surrogate to guide swarm search towards exploration-exploitation.Bird & Li (2010) compute a local regression of the function f around a certain point x inorder to yield an interpolation ˜ f within a close neighborhood and then move the worst particleto the best location of the local regression. This approach resembles our algorithm (B), butthe speciﬁcation in Bird & Li (2010) is restricted to a local neighborhood and can thus takeonly a small region in consideration. Conversely, our Gaussian process regression computesthe stochastic function approximation from all particles and thus yields a global stochasticfunction approximation. Thereby, it is able to relocate particles not locally but globally. Inaddition, Bird & Li (2010) merely address local exploitation but ignore our idea of furtherexploration.Several attempt have been made to use surrogate models for directing a random search,yet outside of PSO. Liu et al. (2013) propose a surrogate-assisted genetic algorithm based onGaussian processes (GPEME for short). Here the surrogate is used for prescreening potentialfunction evaluations, whereas our algorithm integrates the surrogate in way that it guidesthe search process (towards exploration-exploitation). There are further surrogate-assistedevolutionary algorithms for hierarchical swarm searches (Yu et al., 2018), committee-basedactive learning approaches (Wang et al., 2017), and co-operative optimization (Sun et al.,2017). However, diﬀerent from the above, we develop a PSO variant on top of a Gaussian-process-based surrogate that leverages the uncertainty estimates in the GP, so that the swarmsearch is driven by both exploration and exploitation.7 . PSO with Gaussian-process-based swarm search Our previous literature review has shown that particle swarm optimization suﬀers fromreturning local instead of global optima, as well as ignoring valuable information about theunderlying functional landscape. Accordingly, our goal is to extend the strengths of thisstochastic, population-based search method for black-box optimization problems with a moreeﬃcient search strategy. For this reason, we propose a combination of the PSO mechanismwith a stochastic surrogate model of the objective function, so that the swarm search can bedirected strategically.The stochastic surrogate model sheds light into the objective and we can leverage itslandscape in the search process of PSO. Thereby, we aim at addressing two problems oftraditional PSO algorithms, namely, the risk of ﬁnding local optima and redundant functionevaluations, i. e., slow convergence in terms of evaluations. Our remedy to both is providedby learning a stochastic surrogate model: it can identify regions that have only barely beenexplored and where we thus face a high uncertainty regarding the landscape. On the basis ofthis, we can direct the swarm towards both exploration and exploitation.Key to the strategic search is an approximation of the underlying objective function. Thisis formulated as a learning task in which a stochastic model for the underlying function (sur-rogate model) is ﬁtted. In this, we make use of ideas from the realm of Bayesian optimizationto augment our particle swarm through a Gaussian-process-based surrogate model. The latterthen guides the swarm towards regions that entail the optimum in our approximation of theobjective, or where we have great uncertainty in our knowledge of the underlying function. Theidea of a stochastic response surface is also the basis of Bayesian optimization methods. Wegive a review in Online Appendix C. We choose a Gaussian-process-based surrogate model forseveral reasons: (1) Gaussian processes are common for modeling uncertainty in multivariatespaces (cf. Ackermann et al., 2011; Buche et al., 2005). The underlying Gaussian distribu-tion makes few assumptions on the actual curvature of the function. (2) Gaussian processesare fairly parsimonious (i. e., they have fewer model parameters as compared to many othersurrogate models such as, e. g., artiﬁcial neural networks, support vector machines) and arethus beneﬁcial when the swarm search has still little information. (3) Gaussian distributions(that underlie GPs) are found to be highly eﬀective in state-of-the-art Bayesian optimization(cf. Snoek et al., 2012; Swersky et al., 2013).In principle, the stochastic function model can facilitate swarm movements with regardto both exploration and exploitation, yet is is unclear whether it should inﬂuence purely8irections, or even the velocity governing the step sizes. In the following, we thus suggestdiﬀerent approaches to how we integrate the Gaussian-process-based surrogate model into thesearch process and adapt the corresponding swarm intelligence:(A)

PSO with heuristic direction . Traditional implementations of particle swarm op-timization move the walkers towards a mixture of personal and global best-known solu-tions. Conversely, we extend the velocity update and introduce a third direction thatpoints towards the expected optimum from the approximation. This changes the velocityupdate fundamentally: while we previously merely utilized information from positionsthat we have already visited, the swarm could now even move towards directions thatdiﬀer from the existing search trajectories. With each iteration, we augment our knowl-edge of the objective function and obtain an increasingly more accurate estimate of thelandscape, including the approximate location of the corresponding optima. As a result,the swarm is additionally attracted by the optimum from the stochastic function model.The optimum of the stochastic function model is considered as an eﬀective guess to theoptimum of the function, and we use this information in combination with cognitive andsocial inﬂuence as a third heuristic inﬂuence on the particles.(B)

PSO with heuristic exploitation . Often, an adjustment of the direction is notsuﬃcient to accomplish a fast convergence behavior and we thus propose a variant wherewe directly relocate unsuccessful particles to the optimum in the approximation model.In other words, the particle is moved to the best average location given the stochasticfunction model. As a result, this accelerates convergence, as we explicitly query pointswhich are likely to improve our swarm according to the stochastic function model, asopposed to relying on sheer luck. At the same time, the stochastic movement of theother particles give us continuous exploitation of the best we know. This corresponds tothe combination of PSO with a response surface, albeit with a very general frameworkfor response surfaces.(C)

PSO with heuristic exploration . The previous version ignores uncertainty in thestochastic function model for the most part, and thus does not have any incentive forexploration. For this reason, we utilize the stochastic nature of the GP model that allowsus ﬁnd uncertainty estimates for the stochastic function model. We further modify theexploration behavior of the swarm and, in each iteration, relocate one particle with theworst function value to a location with high uncertainty in the stochastic function model.As a result, this search strategy entails an explicit and formalized exploration process,9hich diﬀers from the swarm movements in traditional variants of PSO. Mathematically,it also allows for a guaranteed convergence of the same complexity class as Bayesianoptimization.In the following, we provide a deﬁnition of Gaussian processes as our function approxima-tion and, based on this, we subsequently specify the actual optimization routines.

A Gaussian process (GP) over a domain D is given by a mean function m : D → R anda covariance function K : D → R . The covariance function must deﬁne a valid covarianceform, i. e. it is a positive semi-deﬁnite kernel over the space D . Deﬁniteness corresponds tonon-degenerate GPs. Now let GP ( m, K ) denote the distribution given by the GP with mean m and covariance K , which is given by the following deﬁnition. We refer to Rasmussen &Williams (2008) for deﬁnitions, posteriors and common choices of covariance functions. Deﬁnition 3.1 (Gaussian Process)

For any ﬁnite set of points X = { x , . . . x n } , f ∼GP ( m, K ) implies f ( X ) ∼ N ( m ( X ) , K ( X, X )) , (7) where f ( X ) is the vector with f ( X ) i = f ( x i ) , m ( X ) the vector with m ( X ) i = m ( x i ) and K ( X, X ) is the matrix with entries K ( X, X ) i,j = K ( x i , x j ) . Remark 3.2 (Gaussian Process Posterior)

Let Y = { y , . . . , y m } be another set ofpoints, and let K ( X, Y ) be the matrix with K ( X, Y ) i,j = K ( x i , y j ) . The deﬁnitions of K ( Y, X ) and K ( Y, Y ) are analogous. Given an observation of f ( X ) , the posterior distribution over f is then given by f ( Y ) ∼ N (cid:16) m ( Y ) + K ( Y, X ) K ( X, X ) − ( f ( X ) − m ( X )) K ( Y, Y ) − K ( Y, X ) K ( X, X ) − K ( X, Y ) (cid:17) (8) for any set Y ⊂ D . As one can readily see, this means that the posterior distribution of a GP is again a GPwith modiﬁed m and K . Thus any posterior evaluation f ( Y ) follows a multivariate normal10istribution. We recall that the marginals of a multivariate normal distribution are alsonormal distributions. This allows us to eﬃciently calculate the distribution of function valuesof the GP model at arbitrary locations.The choice we must make when specifying a GP as a stochastic surrogate model for ourfunction are the covariance K , as well as the mean m . In practice, one sets m = 0 andconcentrates on adapting K to the problem setting (Rasmussen & Williams, 2008, Chapter4). This allows the user to encapsulate, a priori, known properties of the function and welater give guidance concerning the choice. We now deﬁne our stochastic surrogate model based on which we approximate the objectivefunction. Let θ denote the additional parameters in a covariance K θ . Then, the interpolationrequires one to identify suitable parameters θ . Their values can be inferred from the observeddata based on a maximum likelihood estimationarg max θ p ( f ( X ) | θ ) (9)using p ( f ( X ) | θ ) ∝ exp (cid:18) − f ( X ) T K θ ( X, X ) − f ( X ) −

12 log det K θ ( X, X ) (cid:19) , (10)which gives the maximum a posteriori estimates of θ . The above optimization can be solvedwith a numerical optimization procedure of choice; e. g., quasi-Newton methods. Even though this optimization problem is generally tractable, the matrix inversion inEquation (10) requires O ( n ) ﬂoating-point operations per evaluation. While this may notprove to be a bottleneck depending on the application, it does make a comparison over a widevariety of functions with multiple runs per function prohibitively expensive for our algorithms.This is why we use a greedy heuristic to only retain certain evaluations when ﬁtting ourstochastic surrogate model. That is, we use only a subset M ⊂ X of our observations, our memory , and ﬁt the model via arg max θ p ( f ( M ) | θ ). We describe the selection process inmore detail in the following section. In our case, we use L-BFGS-B (Byrd et al., 1995) from the

Python package scipy . We used 10 restartsfrom random points on the parameter grid so as to ensure that we ﬁnd a global optimum, or at least a suﬃcientlocal optimum. .4. Eﬃcient learning of stochastic surrogate model We only retain a subset of

M ⊂ X of our observations. The corresponding selection ismade by a function χ which extracts the point that are considered most informative. Wedeﬁne it in the following.Recall w ( i ) j := ( x ( i ) j , v ( i ) j , p ( i ) j ) is the particle j at step i (cf. Section 2). We write W ( i ) for the set of particles at step i , M ( i ) = { y , . . . , y k } for the set of observations we keepin memory, and use the notation f ( W ( i ) ) = ( f ( x ( i )1 ) , . . . , x ( i ) n par ) as for f ( X ) in Section 3.2.Further, we write f ( M ( i ) ∪ W ( i ) ) = ( f ( x ( i )1 ) , . . . , f ( x ( i ) n par ) , f ( y ) , . . . , f ( y k )).As outlined in the previous section, we use a greedy heuristic to only retain a subset ofthe locations and evaluations at each iteration in M ( i ) . At each time step, we use the currentstochastic surrogate model GP ( m, K θ ) and set χ ( m, K θ , W ( i ) ) = (cid:110) x ( i ) j (cid:12)(cid:12) f ( x ( i ) j ) / ∈ (cid:104) m ( x ( i ) j ) − ρ K θ (cid:16) x ( i ) j , x ( i ) j (cid:17) m ( x ( i ) j ) + ρ K θ (cid:16) x ( i ) j , x ( i ) j (cid:17) (cid:105)(cid:111) (11)for a ﬁxed value of ρ . We set ρ consistently to correspond to the 75 % conﬁdence interval,i. e. ρ ≈ .

15. This heuristic corresponds to only keeping informative observations, that is,observations which change our assumptions about the function. All observations which appearlikely under the current stochastic surrogate model are forgotten. We additionally keep allmost recent observations in memory, to ensure that our stochastic function model is precisenear our swarm.

We now formalize the diﬀerent optimization methods combining both PSO and theGaussian-process-based surrogate model. To this end, we ﬁrst specify the general form thatis shared across all three algorithms (see Algorithm 1) and later discuss the individual mod-iﬁcations. The diﬀerent algorithms only diﬀer in the update rule that is given Step 6 ofAlgorithm 1, where each variant inserts its own rule. The general layout of Algorithm 1 ﬁrstconsists of initializations, where the initial particle positions are determined and the initialset of points for the stochastic function approximation are set. A subsequent loop moves theswarm around the function space. Within this loop, we ﬁt the stochastic function model inSteps 4 and 5. Then, Step 6 moves the particle around, while Step 7 updates the memorythat underlies our heuristic for accelerating the learning of the GP-based surrogate model.We now give the update rules corresponding to the individual algorithm variants:12 lgorithm 1

General form of PSO with a Gaussian-process-based surrogate model.

Input:

Domain D , parameterized family of kernels K θ , function f Output:

Approximate global minimum x ∗ of f

1: Initialize set of n par particles W (0) uniformly at random over D with multivariate normal velocity2: Set M (0) ← W (0) for i = 1 to convergence do

4: Fit ˆ θ via maximum likelihood of f ( M ( i − ∪ W ( i − ) (cid:12)(cid:12)(cid:12) GP ( m, K θ )5: Compute stochastic function approximation GP ( ˜ m, ˜ K ˆ θ ) as the posterior of GP ( m, K ˆ θ ) for the observa-tions f ( M ( i − ∪ W ( i − )6: Move particles via W ( i ) ← update (cid:16) ˜ m, ˜ K ˆ θ , W ( i ) (cid:17) , where the update rule depends on the algorithmvariant7: Updated memory via M ( i ) ← M ( i − ∪ χ ( ˜ m, ˜ K θ , W ( i − )8: end for return Global best solution from W (end) (A) PSO with heuristic direction . This algorithm extends the traditional velocityupdate by an additional direction towards the optimum from the stochastic surro-gate model. More precisely, we take the mean of the Gaussian process as a responsesurface, and use the minimum of this response surface as a heuristic direction, i. e. h ( i ) = arg min x ∈D m ( x ). We then set the heuristic inﬂuence (HI) of particle j to h ( i ) − x ( i ) j .We then use the update rule v ( i +1) j = ωv ( i ) j + φ p R p (cid:12) CI + φ g R g (cid:12) SI + φ h R h (cid:12) HI , (12)in the SPSO algorithm. As a result, the convergence analysis of SPSO found in Jianget al. (2007) readily extends to this form, assuming convergence of the heuristic minimum h ( i ) .The question remains as to how we wish to weight the factors φ p , φ g and φ h in relationto each other. We evaluate three settings (A1)–(A3) with diﬀerent parameters. For allother parameter, we use the intuition from SPSO to set the parameters of the PSO tosensible values. (B) PSO with heuristic exploitation . This algorithm accelerates exploitation, as the“worst” particle in the swarm moves to the optimum of the stochastic function approxi-mation. All other particles move according to the previous update rules. At each step, In practice, this means that we retain ω as in SPSO, and let φ p + φ g + φ h ≈ ψ p + ψ g , where ψ p and ψ g are the corresponding cognitive and social weights from SPSO.

13e choose the worst particle j ∗ ∈ arg min j =1 ,...,n par f ( x ( i ) j ) and determine its new locationfrom the GP mean, i. e. x ∗ arg min x ∈D m ( x ). Its new location and velocity is initializedvia x ( i +1) j ∗ = x ∗ , (13) v ( i +1) j ∗ ∼ N (0 ,

1) componentwise , (14) p ( i +1) j ∗ =  p ( i ) j ∗ , if f ( x ∗ ) > f ( p ( i ) j ∗ ) ,x ∗ , otherwise . (15)(C) PSO with heuristic exploration . Here we use a procedure similar to version (B),but now choose the new location according to a criterion encapsulating the uncertainty ofthe function value at this point. The only diﬀerence appears in our choice of the location x ∗ . We introduce an acquisition function R which incorporates the posterior variance σ ( x ∗ ) := K ( x ∗ , x ∗ ). The new location is then chosen by x ∗ ∈ arg max x ∈D R ( K, m, x ).Here we experiment with two choices of R . In version (C1), we let the worst parti-cle move to the optimum of the lower bound of the 90 % conﬁdence interval, that is R ( K, m, x ) = − ( m ( x ) − ρσ ( x )) for ρ ≈ .

6. In the other version (C2), we move it to theposition of maximal uncertainty, that is use R ( K, m, x ) = σ ( x ). Both versions performadded exploration, either by explicitly querying high-uncertainty points, or by being veryoptimistic as to the location value.The greedy heuristic in the stochastic function approximation makes a direct transfer ofconvergence guarantees diﬃcult. However, if we include all past values in our Gaussianprocess, we can guarantee the same convergence behavior as the associated Bayesianoptimization scheme, albeit with a higher number of concurrent function evaluations.Each of the algorithms (A)–(C) relies internally on ﬁnding a new location from the stochas-tic function approximation. We note here that the identiﬁcation is computationally eﬃcient,as the location is either analytically known or one can use numerical optimization (e. g. quasi-Newton methods). Multiple restarts are replaced by initializing the search with the locationof the current optimum. 14 .6. Choice of covariance function We consistently chose the covariance matrix K ( x, y ) = α exp (cid:32) − (cid:107) x − y (cid:107) ρ (cid:33) + α + α { x = y } , (16)where the parameters α , α , α and ρ are adapted by the estimation process. This choiceof covariance corresponds to the assumption that our function has the form f ( x ) + m + ε ,where m is a global mean, f is a smooth function with mean 0, and ε is white noise i. i. d.at every point. This residual white noise is introduced to account for non-smoothness in theunderlying function. We give a short example of the proposed algorithm. For this, we look at several snapshotsof the algorithm running on the two-dimensional Ackley function given by f ( x, y ) = −

20 exp (cid:16) − . (cid:112) . x + y ) (cid:17) − exp (0 . πx + cos 2 πy )) + e + 20 , (17)with e as Euler’s number. The domain is set to [ − , , for which the function has a globalminimum of 0 at x = y = 0.As an illustrative example, we now execute PSO with heuristic exploration, more precisely,variant (C1) of our algorithm. We use 10 walkers and otherwise the same parameters as inSection 4. Figure 1 depicts diﬀerent iterations of the algorithm, namely, after initialization,after 6 steps (i. e. 70 function evaluations) and after 18 steps (i. e. 190 function evaluations).After initialization, the stochastic surrogate model has learned little about the true natureof the wavy objective function and, hence, its mean suggests a rather smooth curve with ahigh variance associated to it. The acquisition function shows the lower bound of the 90 %conﬁdence interval. Accordingly, the heuristic exploration guides the algorithm to furtherexplore areas that promise both high chances of improvements, while taking the point-speciﬁcuncertainty into consideration.As the algorithm progresses, the particles quickly converge towards the optimum in thecenter of plot, while the mean surface yields a better resolution and serves as an increasinglyaccurate function approximation. Hence, the proposed location from the heuristic explorationquickly coincides with the true optimum. We can also see the convergence of the particles15owards the global optimum of the Ackley function at x = y = 0. After 18 steps, we canbarely recognize the position of individual particles as these are so closely scattered aroundthe optimum. Here we also note the low variance of the approximation around the optimum.Similarly, the acquisition function also attains its minimum around x = y = 0 and thus guidesthe swarm to explore this area. Objectivefunction Approximation(mean) Approximation(variance) Acquisitionfunction

Figure 1: Illustrative execution of the PSO with heuristic exploration, i. e. variant (C1). Shown is the state ofthe algorithm after initialization, 6 steps and 18 steps. Highlighted are the global optimum (yellow cross), thecurrent heuristic exploration (white circle), the current best solution (orange square) and the particles (browndots). The algorithm quickly learns a crisp approximation of the true objective and, based on it, guides thesearch direction of the particle swarm. Darker colors correspond to lower values. After 18 steps, almost allparticles have converged to the true optimum and are so closely scattered around x = y = 0 that one canbarely recognize them any longer. . Computational experiments We performed an extensive series of computational experiments. For reasons of compa-rability, we adhere to the CEC2013 benchmark functions and adopt the same experimentalsetup as in Liang et al. (2013). The benchmark functions are reviewed in Online Appendix A.These are challenging in diﬀerent ways (e. g., their landscape suﬀers from multi-modality andpoor scaling).In accordance with literature (Bonyadi & Michalewicz, 2017), we experiment withSPSO2011 as the prime benchmark algorithm. In our experiments, we have set the parametersof SPSO2011 according to the suggestions in Zambrano-Bigiarini et al. (2013) for reasons ofcomparability (details in Table 1). All PSO variants are compared with 50 particles (Yu et al.,2018). In practice, PSO is often used in settings where function evaluations are computation-ally or economically costly (Rios & Sahinidis, 2013). This is because, for instance, complexindustrial simulations are involved (e. g. Hong et al., 2018). This is the setting to which ouralgorithms are tailored, and, hence, our evaluations compare the eﬃciency of the algorithmsafter 100 D function evaluations following Liu et al. (2013); Varma et al. (2013). For reasonsof space, we report the results for ten dimensional ( D = 10) objective functions. In addition,we utilize Bayesian optimization as another comparison, as our approach borrows conceptsfrom the underlying GP-based approximation of functions, though the stochastic process inPSO yields a completely diﬀerent search strategy. We evaluated Bayesian optimization in thestandard parallelized implementation of the Python package

GPyOpt , which uses local penal-ization (Gonz´alez et al., 2015). For each experiment, we measure the performance regardingthe mean, minimum, maximum, median, and the standard deviation across 51 independentruns as in (Liang et al., 2013; Zambrano-Bigiarini et al., 2013).

Algorithm Parameters

Version (A1) ω = 0 . , φ p = 1 . , φ g = 1 . , φ h = 0 . ω = 0 . , φ p = 1 . , φ g = 0 . , φ h = 0 . ω = 0 . , φ p = 0 . , φ g = 1 . , φ h = 0 . ω = 0 . , φ p = 1 . , φ g = 1 . ω = 0 . , φ p = 1 . , φ g = 1 . , R ( m, K, x ) = m ( x ) − . σ ( x )Version (C2) ω = 0 . , φ = 1 . , φ g = 1 . , R ( m, K, x ) = − σ ( x )SPSO2011 ω =

12 ln 2 , φ p = 0 . , φ g = 0 . GPyOpt libraryTable 1: Overview of the evaluated algorithms where (A) refers to our PSO variant with a GP-based heuristic;(B) is the exploitation variant; and (C) is the exploration variant. .2. Numerical performance The results of the computational experiments are provided in Tables 2 and 3. Overall, thevariants of our proposed algorithm outperforms SPSO2011 and Bayesian optimization (BO)in all 28 experiments. The improvements are as large as 64 %. In sum, he performance of theproposed algorithm strongly exceeds the performance of previous implementations of PSO.We now discuss the performance of diﬀerent variants of our proposed search strategy.Overall, our evaluation favors variant (A3), which regularly outperforms the other variants(i. e., in 13 out of 28 experiments), followed by variant (B) (7 out of 28 experiments). The fa-vored variant (A3) improves the performance of SPSO2011 for multimodal experiments (e. g.,f15) by up to 64 % and for composition functions (e. g., f23) by up to 36 %. Furthermore,variant (A3) outperforms both the other variants of our algorithms and SPSO2011 for mul-timodal functions in 9 of 15 experiments. For composition functions, we observe a similarresult. Here variant (A3) achieves the best overall performance in 4 out of 8 experiments.Therefore, the explorative implementation of PSO with heuristic direction (A3) drives thesearch more eﬀectively to the minimum of complex functions.In comparison, variants (A1) and (A2) are each superior in only a single experiment. Bothvariants (C1) and (C2) outperform the remaining alternatives of our algorithms and bench-marks in 3 experiments. Recall that variant (A) of our PSO introduces a heuristic direction asa third proponent for the swarm. Here we vary the weights of the cognitive, social, and heuris-tic inﬂuence. Variant (B) comprises a heuristic exploitation inside the PSO algorithm, whilevariant (C) accentuates exploration, either with regard to a lower conﬁdence bound (C1) orthe highest uncertainty (C2). In sum, heuristic exploitation in the PSO algorithm facilitatesthe optimization of complex functions as compared to the heuristic exploration. We obtainthe overall best results when drawing upon a heuristic direction in the PSO algorithm.We further compare the performance through statistical tests (see Online Appendix B).That is, we compare the performance of the favored variant (A3) against the performanceof benchmark algorithms to see whether the improvement is statistically signiﬁcant. For asigniﬁcance threshold of α = 5 %, variant (A3) is superior over SPSO2011 for 27 of the 28objective functions at a statistically signiﬁcant level. This conﬁrms the eﬀectiveness of theproposed algorithms. 18 unction f ( x ∗ ) Min Median Mean Max SD Min Median Mean Max SDVariant (A1) Variant (A2) Unimodal functions f1 -1400 -1397.5872 -1383.4696 -1361.6633 -1043.4621 63.3769 -1396.0733 -1341.5089 -611.4317 1.2080e+05 2377.0420f2 -1300 2.6277e+06 2.5977e+07 3.9917e+07 2.2339e+08 4.7645e+07 7.4935e+06 5.2075e+07 1.0334e+08 4.1734e+08 1.0360e+08f3 -1200 f5 -1000 -951.5490 -70.1066 662.2853 1.2127e+05 2238.8980 -619.4898 1419.3945 3961.1237 3.178e+05 6624.0905

Multimodal functions f6 -900 -893.0991 -802.8364 -782.0918 -133.5772 112.9641 -866.3261 -697.4170 -603.4517 451.9740 268.6724f7 -800 -768.6168 -625.5504 -494.6803 2049.3854 475.6174 -722.3400 -541.1433 -117.2380 3836.2694 993.0082f8 -700 -679.3985 -679.0972 -679.0835 -678.7793 0.1434 -679.3306 -679.0104 -679.0234 -678.8371 0.1278f9 -600 -597.7262 -591.5365 -591.9679 -585.6585 3.1091 -591.7658 -590.7448 -590.7448 -589.7239 1.4439f10 -500 -487.8997 -412.2499 -292.6228 863.0087 277.7393 -480.5477 -303.1307 -81.0474 1630.5215 489.4571f11 -400 -361.5090 -336.4743 -334.7380 -289.1876 14.7983 -365.4087 -320.4254 -311.2982 -139.7896 37.1877f12 -300 -265.5353 -227.4410 -226.6150 -157.8276 20.0157 -261.8385 -204.3804 -196.7491 -76.1679 37.4749f13 -200 -164.2729 -129.9400 -127.4905 -76.9984 18.3428 -148.0190 -110.7818 -104.1624 21.3197 32.7836f14 -100 525.3739 1478.9113 1474.5626 2548.0959 457.6333 503.9011 1767.5706 1837.6050 3006.6952 561.8864f15 100 945.4400 1979.8693 1925.1920 3042.6037 440.3659 1177.2460 2150.8842 2111.5859 3060.6540 487.3763f16 200 200.9729 203.0626 203.0612 204.8510 0.8668 201.3839 203.6982 203.5232 206.5214 1.1423f17 300 300.0980 302.1162 307.6036 408.6987 17.6225 300.2047 304.7179 317.5512 543.9282 38.8794f18 400 400.1664 402.5121 405.8163 443.5966 7.9026 400.2131 408.2567 421.2932 606.5287 39.6921f19 500 503.0887 515.4676 1339.9376 2.2403e+05 3400.5543 505.4403 2596.9577 4.2600e+05 5.5700e+06 9.0404e+05f20 600 603.6799 604.6531 604.6227 605.0000 0.3463 604.1277 604.8752 604.7817 605.0000 0.2501

Composition functions f21 700 1196.3619 1527.3981 1543.4502 2760.8438 200.5887 1244.2648 1614.4777 1756.0408 2656.2074 329.0863f22 800 1602.6127 2694.7480 2676.1522 3733.3636 477.6080 1838.6323 2986.3565 2953.3486 3908.0153 502.4572f23 900 1834.8290 3024.2683 2936.5531 4121.2138 510.9749 1952.3371 3108.7894 3131.3746 3953.9239 460.8704f24 1000 1152.8172 1228.8007 1224.1155 1280.8363 24.8739 1179.1119 1239.1245 1238.7175 1331.0587 24.9041f25 1100 1256.1385 1350.9986 1344.9651 1376.9652 28.1815 1262.6632 1359.1513 1358.0976 1382.6490 20.4727f26 1200 1348.2933 1403.5477 1424.4960 1548.7305 57.6909 1363.1698 1418.9811 1437.5508 1547.1016 50.8241f27 1300 1636.9476 1654.0901 1655.2239 1705.1864 11.3101 1642.7216 1660.8510 1668.7147 1817.8577 26.4306f28 1400 1543.1449 2171.8226 2135.7050 2811.7222 233.6992 1625.6656 2409.6956 2526.8903 4038.9449 452.7039

Variant (A3) Variant (B)

Unimodal functions f1 -1400 -1399.8136 -1396.7569 -1395.0071 -1372.3346 5.6353 -1397.5894 -1376.2295 -1338.3509 -677.2732 109.2581f2 -1300 1.0587e+06 7.9735e+06 1.2012e+07 7.7522e+07 1.2741e+07 9.7927e+05 1.0826e+07 1.5301e+07 7.3415e+07 1.5404e+07f3 -1200 1.5196e+06 7.1068e+08 2.3204e+09 1.7155e+10 4.1159e+09 1.7493e+07 9.7294e+08 2.8782e+09 2.0762e+10 4.4020e+09f4 -1100 7876.3791 4.0557e+05 4.0105e+05 7.8208e+05 1.5994e+05 8604.8093 4.5064e+05 5.3138e+05 3.7933e+06 5.0677e+05f5 -1000 -930.3052 -549.1494 -381.1171 1582.2375 556.9127 -994.8325 -884.9113 -766.3978 822.9534 314.2315

Multimodal functions f6 -900 -897.4271 -860.0267 -848.4268 -767.2946 37.2299 -888.4831 -820.1558 -823.0307 -682.8588 49.9664f7 -800 -772.9237 -691.0189 -684.6283 -574.2905 45.7213 -782.9525 -683.7158 -682.9493 -467.0235 56.9393f8 -700 -679.4831 -679.1640 -679.1858 -678.9770 0.1205 -679.5066 -679.1245 -679.1450 -678.9398 0.1396f9 -600 -598.4306 -594.8041 -594.4478 -589.2729 2.2898 -591.7658 -590.7448 -590.7448 -589.7239 1.4439f10 -500 -498.2982 -474.0979 -451.9723 -262.2669 54.8691 -494.7935 -448.2590 -419.9692 136.3982 112.7522f11 -400 -378.8572 -356.2611 -355.2700 -330.1124 10.8450 -384.9150 -348.6396 -347.7882 -310.3917 18.1133 f12 -300 -265.9925 -248.3478 -247.0188 -226.2649 10.2883 -272.1120 -241.0569 -235.7232 -129.0170 25.8599 f13 -200 -178.7481 -148.8482 -146.9601 -121.7209 12.9224 -179.8161 -135.8240 -133.9180 -85.5803 18.6451 f14 -100 f17 300

Composition functions f21 700 f25 1100 f27 1300 1626.7491 1640.0832 1641.3110 1666.9729 8.1147 f28 1400 1515.9583 2141.2106 2043.2591 2265.9658 225.6466 1549.1988 2141.2374 1947.3751 2151.6891 344.8704

Table 2: Performance of PSO with a GP-based heuristic on CEC2013 benchmark experiments. Bold: Variantwith lowest min value. unction f ( x ∗ ) Min Median Mean Max SD Min Median Mean Max SDVariant (C1) Variant (C2) Unimodal functions f1 -1400 -1395.6798 -1371.9866 -1338.8192 -665.6236 114.7387 -1396.1872 -1371.9617 -1355.3548 -1229.9828 42.2859f2 -1300 -997.2929 -818.1903 -750.7602 1492.2827 388.5169

Multimodal functions f6 -900 -895.3502 -821.1045 -822.4812 -671.7844 52.6209 -894.6910 -834.4391 -829.8255 -724.5643 47.7397f7 -800 -782.9525 -683.7158 -682.9493 -467.0235 56.9393 -782.9525 -694.4117 -688.6216 -467.0235 58.9648f8 -700 -679.5066 -679.1449 -679.1533 -678.8807 0.1561 -679.5066 -679.1685 -679.1596 -678.9175 0.1489 f9 -600 -595.5807 -591.1572 -591.6152 -586.6314 2.1540 -595.8833 -591.6480 -591.9006 -586.6852 2.2068f10 -500 -490.7566 -456.7438 -409.3499 275.8620 146.2619 -495.2486 -447.6569 -417.5570 272.8506 119.5523f11 -400 -381.8322 -350.7278 -350.8433 -308.3906 16.9659 -385.8263 -352.6462 -349.7645 -306.3767 19.8457f12 -300 -268.4056 -236.9123 -234.3349 -173.2333 21.4976 -267.5442 -238.1376 -233.7328 -180.4475 24.0218f13 -200 -164.4919 -135.5697 -133.2558 -83.0992 16.5938 -157.3840 -133.8360 -130.9067 -101.3905 14.3010f14 -100 426.6840 1445.3719 1372.1666 1919.8069 371.0698 529.2313 1437.8050 1408.9207 2354.3925 372.3278f15 100 1220.1451 1947.0528 1975.5844 2562.6328 389.5177 1003.1814 1977.0000 1942.4554 2612.5980 420.3723f16 200 200.9639 202.4348 202.4266 203.6991 0.7239 201.4183 202.5016 202.7041 204.6743 0.8043f17 300 300.1525 301.0488 302.5322 344.4286 6.2663 300.1525 301.1105 302.6090 344.4286 6.2806f18 400 400.1525 401.0965 401.8234 406.8142 1.7579 400.1525 401.0380 401.8282 409.3670 1.9511f19 500 503.2288 506.1112 507.2618 546.8400 6.1033 503.2288 506.1112 506.5292 514.7202 2.2082f20 600 603.7083 604.3811 604.3909 605.0000 0.3448 603.7083 604.3823 604.3902 605.0000 0.3476

Composition functions f21 700 1023.3749 1513.9602 1479.0269 1552.7040 115.6359 1023.3749 1513.9602 1480.1213 1552.7040 115.4776f22 800 1548.1995 2539.2318 2522.6101 3500.2226 530.8839 f23 900 2034.5278 2775.3844 2807.9612 3892.0847 378.4048 1969.5557 2863.4119 2857.0894 3605.0865 375.8073f24 1000 1138.1878 1226.8297 1257.8455 1218.9173 25.6467 1118.7614 1222.2297 1202.6021 1235.4652 37.9750f25 1100 1241.8700 1345.1502 1339.3068 1366.6899 25.4506 1240.1102 1346.0017 1337.4075 1365.8927 29.4815f26 1200 1336.3902 1400.5789 1408.2959 1533.1272 49.3885 1331.5768 1400.6062 1407.4818 1529.5341 48.9864f27 1300 1623.8144 1642.8999 1643.3209 1665.1437 7.4070 1623.8144 1641.5731 1642.7850 1665.1437 7.0987f28 1400

SPSO2011 Bayesian Optimization

Unimodal functions f1 -1400 -1005.3148 -103.485 -7.8876 1695.6855 580.2353 5216.1430 1.5510e+04 1.5117e+04 2.4693e+04 4857.8925f2 -1300 2.3646e+07 1.3353e+08 1.4431e+08 2.6640e+08 6.6492e+07 5.6854e+07 2.1022e+08 2.3433e+08 5.5868e+08 1.1446e+08f3 -1200 1.2138e+10 8.9256e+10 1.06388e+11 2.7705e+11 5.7512e+10 1.0004e+10 8.7289e+12 5.2365e+15 1.8529e+17 2.5868e+16f4 -1100 1.6502e+05 3.4671e+05 3.9683e+05 8.9498e+05 1.8831e+05 4.6082e+04 9.1290e+05 4.1055e+06 2.9822e+07 6.2066e+06f5 -1000 -831.6348 -322.0166 -308.399 505.4016 325.4384 3434.6118 1.6727e+04 1.9594e+04 7.0332e+04 1.1099e+04

Multimodal functions f6 -900 -844.2954 -770.1236 -764.7432 -674.2552 38.683 -572.7398 875.9278 1041.0426 4309.8598 937.4901f7 -800 -701.8775 -633.7445 -632.2272 -492.1828 43.0969 -605.0234 4392.9283 4.8434e+04 6.6221e+05 1.0378e+05f8 -700 -679.4141 -679.2812 -679.2859 -679.0626 0.1107 -679.3266 -678.9201 -678.9299 -678.6414 0.1542f9 -600 -591.9697 -589.397 -589.4528 -587.2924 0.9338 -590.7963 -585.8143 -585.8881 -583.5415 1.353f10 -500 -461.6521 -274.8178 -264.2497 -0.613 101.7582 176.7729 1733.1143 1824.9214 3644.3367 741.2554f11 -400 -355.8199 -308.9896 -309.2968 -271.9039 16.7635 -253.1191 -141.5103 -137.2415 28.2658 59.2632f12 -300 -248.9015 -216.6082 -215.6252 -184.8619 14.6543 -146.2610 -18.1314 -18.5652 148.3443 66.9370f13 -200 -142.5769 -110.8844 -110.5187 -71.8861 14.7131 -40.8029 73.0554 76.9612 263.1982 67.6236f14 -100 1580.9685 1957.0233 1957.8533 2308.8529 184.3534 1763.2177 2647.0464 2657.2596 3247.7634 299.3089f15 100 1509.0158 2251.4663 2220.2589 2576.8 216.8827 1905.9186 2825.4242 2807.8911 3446.124 287.8989f16 200 201.6575 202.6773 202.6513 203.8566 0.5514 202.458 205.0509 205.0068 207.8333 1.3296f17 300 374.5278 413.6145 416.1109 473.2804 20.4491 564.6457 976.4061 960.0182 1343.7203 193.2775f18 400 482.4333 518.1552 521.4023 565.5428 19.127 664.6457 1076.4061 1060.6695 1443.7203 194.3157f19 500 507.1073 544.8849 628.3692 2936.0275 341.7338 3798.6579 1.5309e+05 2.7050e+05 1.2867e+06 2.8033e+05f20 600 603.9779 604.2864 604.3053 604.6311 0.167 604.8021 605.0000 604.9911 605.0000 0.0334

Composition functions f21 700 1117.9655 1172.2566 1176.7712 1292.8498 34.227 2234.1194 2815.4961 2782.896 3827.9943 344.2327f22 800 2515.5492 3119.0027 3090.0824 3447.205 197.5455 2832.1769 3780.0713 3740.7229 4312.6797 293.4730f23 900 2869.2363 3378.0357 3368.553 3697.4699 196.2726 3087.3422 3818.1579 3771.9333 4278.5656 280.3538f24 1000 1225.5428 1232.4759 1231.8447 1237.7855 3.3618 1238.0768 1266.8067 1275.7245 1420.2028 32.5109f25 1100 1323.2743 1330.8522 1330.7533 1338.3726 2.6161 1355.4382 1372.0921 1375.891 1400.1419 12.1785f26 1200 1364.9184 1395.6858 1398.6196 1521.2914 24.899 1407.4966 1488.573 1493.4568 1565.5591 46.6492f27 1300 1777.1847 1928.2138 1922.8312 2061.7246 69.009 1647.6561 1696.2735 1711.6029 1967.458 51.9861f28 1400 1880.9056 2359.1062 2346.6852 2469.9099 90.1052 2798.8736 3475.094 3513.2157 4967.1594 405.5373

Table 3: Performance of PSO with a GP-based heuristic on CEC2013 benchmark experiments. Bold: Variantwith lowest min value. .3. Convergence plots

Figure 2 depicts the convergence trajectory for selected CEC2013 benchmark experiments.Thereby, we provide further insights into the convergence behavior for unimodal, multimodal,and composition functions. We see that versions (A3), (B), and (C) improve the convergenceand, in particular, outperform SPSO2011 (e. g., f10: rotated Griewank’s function; f14: Schwe-fel’s function). For all experiments, we yield improved candidate solutions in the initial searchphase. Such behavior improves the eﬀectiveness of the swarm search but, on top of that, isespecially helpful in applications where function evaluations are costly (or even subject tobudget limitations).

Figure 3 compares the runtime of SPSO2011 against our algorithms. On the one hand,our algorithms require additional computational time for estimating the GP, while, on theother hand, the GP helps in accelerating the search process. The following experiments arebased on the actual runtime for evaluation f , which, in our experiments, is computationallycheap. However, we remind the reader that many applications of PSO in practice involvefunctions f that are computationally costly and, therefore, the relative overhead due to theGP is likely to be lower in practice.As above, we provide results for unimodal, multimodal, and composition functions. Incomparison to SPSO2011, variants (A3), (A2), (B), and (C) yield better solutions. The beneﬁtof our algorithm over SPSO2011 becomes especially pronounced for objective functions of highcomplexity (e. g. f14: Schwefel’s function; f25: composition function 5 ( n = 3, rotated)) wherea directed search can drive the swarm towards better solution. This points towards the overalleﬀectiveness of the directed search in our algorithm.21

200 400 600 800Function evaluations − − − − M e a n o b j ec t i v e f un c t i o n SPSO2011(B)(C1)(C2)(A1)(A2)(A3)0 200 400 600 800Function evaluations − − − − M e a n o b j ec t i v e f un c t i o n SPSO2011(B)(C1)(C2)(A1)(A2)(A3)0 200 400 600 800Function evaluations1000200300400500600700800900200030004000 M e a n o b j ec t i v e f un c t i o n SPSO2011(B)(C1)(C2)(A1)(A2)(A3)0 200 400 600 800Function evaluations604.30604.40604.50604.60604.70604.80604.90605.00 M e a n o b j ec t i v e f un c t i o n SPSO2011(B)(C1)(C2)(A1)(A2)(A3)

Figure 2: Convergence plots for selected CEC2013 unimodal, multimodal and composition benchmark experi-ments (from top to bottom: f1: sphere function ; f10: rotated Griewank’s function ; f14:

Schwefel’s function ; f20: expanded Schaﬀer’s F6 function ) for a limited number of function evaluations. Recall that above evaluationtables favored SPSO2011 over Bayesian Optimization and, hence, we exclusively focus on this benchmark.

50 100 150 200Runtime in [sec]10 M i n i m u m o b j ec t i v e f un c t i o n SPSO2011(B)(C1)(C2)(A1)(A2)(A3)0 25 50 75 100 125 150 175Runtime in [sec]100030040050060070080090020003000 M i n i m u m o b j ec t i v e f un c t i o n SPSO2011(B)(C1)(C2)(A1)(A2)(A3)0 20 40 60 80 100 120Runtime in [sec]10 × × M i n i m u m o b j ec t i v e f un c t i o n SPSO2011(B)(C1)(C2)(A1)(A2)(A3)0 10 20 30 40 50Runtime in [sec]1 . × . × . × . × . × . × . × M i n i m u m o b j ec t i v e f un c t i o n SPSO2011(B)(C1)(C2)(A1)(A2)(A3)

Figure 3: Runtime comparison for selected CEC2013 experiments from unimodal, multimodal, and compositionbenchmarks (from top to bottom: f4: rotated discus function ; f14:

Schwefel’s function ; f15: rotated Schwefel’sfunction ; f25: composition function 5 ( n = 3 , rotated) ). Experiments are performed on an Intel Core i5 with3.1 GHz and 16 GB memory. Bayesian optimization is omitted for reasons of space as it is already shown tobe inferior in the above numerical experiments. .5. Comparison against surrogate-assisted evolutionary algorithms We further compare our PSO variants against state-of-the-art surrogate-assisted evolu-tionary algorithms. Speciﬁcally, we compare against ﬁve popular variants: (1) a Gaussian-process based genetic search (GPEME; Liu et al., 2013); (2) an ensemble-based algorithm withreweighting according to surrogate errors (WTA1; Goel et al., 2007); (3) a surrogate-assistedmemetic algorithm that with surrogate-based local searches (GS-SOMA; Lim et al., 2009);(4) a surrogate-assisted by local Gaussian random ﬁeld metamodels (MAES-ExI; Emmerichet al., 2006) and (5) a PSO extended by committee-based active learning (CAL-SAPSO; Wanget al., 2017). The evaluation setting is identical to prior literature Wang et al. (2017). This isdone to allow for a fair comparison against existing surrogate-assisted evolution algorithms.Speciﬁcally, the evaluation is based on four popular multi-modal benchmark functions func-tions as in Wang et al. (2017): Ackley function, Griewank function, Rastrigin function, andRosenbrock function. The performance of the algorithm is evaluated after 11 · D functionevaluations analogous to Wang et al. (2017). For reasons of space, we present the results forten dimensional ( D = 10) objective functions. We report the best candidate solution values(mean) across 20 independent runs.The results are in Table 4. We observe that all considered surrogate-assisted evolutionaryalgorithms are outperformed by variant (A3) for two out of four functions. For instance, forthe Ackley function, variant (A3) registers a better result than the state-of-the-art bench-marks (i. e., MEAS-ExI; Emmerich et al., 2006) with an improvement of over 70 %. TheAckley function has a very narrow basin in the area of the global optimum, which we evaluatein the domain D = [ − , D . On this function, non-ensemble-based techniques as, for instance,GPENE and MAES-ExI outperform ensemble-based methods (e. g., WTA1, GS-SOMA, CAL-SAPSO). Similarly, non-ensemble-based techniques perform better than ensemble-based meth-ods on the multi-modal Rastrigin function for the domain D = [ − , D (Wang et al., 2017).In contrast to the surface of the Ackley function, the Rastrigin function includes a large set oflocal minima around the global optimum. For both functions, we observe that variant (A3)outperforms existing non-ensemble-based methods (i. e. GPMENE and MAES-ExI) and there-fore achieves the overall best performance. Thus, the evaluation suggests that our proposedvariant (A3) is superior in the optimization performance of non-ensemble-based methods onobjective functions where non-ensemble-based methods perform particularly strong.We further benchmark our proposed variants on the Griewank function for D =[ − , D and on the Rosenbrock function for D = [ − , D . While the Rosenbrock func-tion includes a narrow valley leading to global optimum, the Griewank function comprises a24arge number of local minima around the global minimum. Our proposed variant (A3) outper-forms GPEME on the Griewank benchmark. The comparison against GPEME is particularlyinteresting, as GPEME is based upon a combination of Gaussian process and genetic algo-rithm. Here we ﬁnd that variant (A3) is superior in three out of four cases. This can directlyattributed to the diﬀerence between both: GPEME uses the surrogate only for prescreeningcandidate solutions, whereas we actually integrate the surrogate to guide the swarm search.Overall, this establishes the strong performance of the proposed algorithms as they widelypreferred. Algorithm BenchmarksGriewank (Griewank, 1981)

Rosenbrock (Rosenbrock, 1960)

Rastrigin (Rastrigin, 1974)

Ackley (Ackley, 1987)

Proposed variants from this paper

Variant (A1) 6.40e+00( ± ± ± ± ± ± ± ± Variant (A3) ± ± ± ± Variant (B) 3.60e+01( ± ± ± ± ± ± ± ± ± ± ± ± Benchmark algorithms

SPSO2011 5.50e+01( ± ± ± ± ± ± ± ± State-of-the-art surrogate-assisted evolutionary algorithms (SAEAs)

CAL-SAPSO 1.12e+00 ( ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 4: Comparison of proposed variants in this paper against surrogate-assisted evolutionary algorithms.The setting is chosen as in other surrogate-assisted evolutionary algorithms for D = 10 (Wang et al., 2017).Here the corresponding objective function was minimized. Results were obtained across 20 runs and, across allruns, report is the mean over the best candidate solution were averaged. Reported is then the mean ( ± SD).

5. Discussion

As shown above, the proposed algorithms improves the eﬃciency over standard PSO, thatis, with the same number of iterations, a better solution is obtained. This is especially crucialin practical settings where function evaluations are economically or computationally costly (cf.Rios & Sahinidis, 2013, for a discussion). Fr potential users of our algorithms, there are directimplications. In a cost analysis, we showed that the beneﬁts of our algorithm relate to the cost25or runtime) of making function evaluations. If function evaluations are (computationally)cheap, a standard PSO appears beneﬁcial, while, otherwise, our Gaussian-process-based PSOhas beneﬁts. Similarly, the beneﬁts of our algorithms over standard PSO diminish if manyfunction evaluations can be made, yet, when this is limited as in the above experiments, ouralgorithms are favorable.Our work developed variations to PSO by using the information gained about the under-lying function to model this function itself as a stochastic process. In other words, this couldalso be seen as an extension of concepts from Bayesian optimization to PSO. By drawingupon Gaussian processes, we allow for a wide family of possible function structures and bene-ﬁt from recent advances in Bayesian optimization that currently serves as a popular algorithmin high-expense black-box function optimization. Along with variant (A3), variant (C) witha combination of PSO and GP-based exploitation showed promising results in the numericalexperiments. Besides that, this variant has interest theoretical properties, as it yields thesame performance guarantees as sequential Bayesian optimization (subject to a ﬁxed factor).As opposed to pure Bayesian optimization, the PSO search remains agnostic to the underlyingfunction. It might therefore serve as an initial exploratory scheme until the function can beadequately modeled. Here we point to Buche et al. (2005), who use an evolutionary algorithmfollowed by Bayesian optimization in similar fashion.

6. Conclusion

This work proposed an innovative direction for improving stochastic optimization methodsby guiding the search process with a stochastic function approximation. Speciﬁcally, wecombined particle swarm optimization with Gaussian-process-based function approximationbetween the walkers’ positions in order to better improve particle movements with regardto exploration and exploitation. This allows us to proﬁt from the stochastic convergence ofthe swarm, while also accelerating convergence and reducing risks of ﬁnding local insteadof global optima. To this end, three diﬀerent approaches were developed: (A) using thestochastic surrogate model to locate an approximate global optimum, and include this in theoriginal PSO search direction; (B) using the approximate global optimum to leapfrog theworst particle for fast convergence; (C) a hybridization of PSO and Bayesian optimizationfor increased exploitation. As a result, these modiﬁcations can improve the stochastic searchprocess over the original PSO by obtaining a more eﬀective swarm behavior with respectexploration and exploitation. Finally, our computational experiments demonstrated thatenhancing PSO by leveraging a Gaussian-process-bases surrogate model as a further source26f information improves its performance.The above method opens avenues for future research. Speciﬁcally, one could investigate theperformance when replacing the GP approximation with alternative models, such as Bayesianneural networks, random forests, or other models that allow for a stochastic interpretationof functions. Here one could also decompose the approximation into a stochastic one foruncertainty estimates and a non-stochastic counterpart for the response surface, as this couldempower more accurate approximations at lower computational costs. Our approach of a GP-guided stochastic optimization might also be helpful for other population-based algorithmsthat involve a combination of exploratory and exploitative behavior (e. g., evolutionary al-gorithms, cross-entropy methods and bees algorithms). Beyond that, we leave it to futureresearch to adapt the proposed algorithm to parallel multiple swarm optimization and multi-objective optimization.

Acknowledgment

Cloud computing resources were provided by a Microsoft Azure for Research award. We thank the reviewteam for valuable suggestions, in particular with regard to the experimental setup.

References

Ackermann, E. R., De Villiers, J. P., & Cilliers, P. (2011). Nonlinear dynamic systems modeling using Gaussianprocesses: Predicting ionospheric total electron content over South Africa.

Journal of Geophysical Research:Space Physics , .Ackley, D. (1987). A connectionist machine for genetic hillclimbing . Norwell, Massachusetts: Springer Science& Business Media.Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem.

Machine Learning , , 235–256.Barman, D., Hasnat, A., Sarkar, S., & Murshidanad, M. A. R. (2016). Color image quantization using Gaussianparticle swarm optimization (CIQ-GPSO). In (pp. 1–4). IEEE.Bird, S., & Li, X. (2010). Improving local convergence in particle swarms by ﬁtness approximation usingregression. In L. M. Hiot, Y. S. Ong, Y. Tenne, & C.-K. Goh (Eds.), Computational Intelligence in Ex-pensive Optimization Problems (pp. 265–293). Berlin, Heidelberg: Springer Berlin Heidelberg volume 2 of

Evolutionary Learning and Optimization .Bonyadi, M. R., & Michalewicz, Z. (2017). Particle swarm optimization for single objective continuous spaceproblems: A review.

Evolutionary computation , , 1–54.Bratton, D., & Kennedy, J. (2007). Deﬁning a standard for particle swarm optimization. In (pp. 120–127). IEEE. uche, D., Schraudolph, N. N., & Koumoutsakos, P. (2005). Accelerating evolutionary algorithms with Gaus-sian process ﬁtness function models. IEEE Transactions on Systems, Man and Cybernetics, Part C (Appli-cations and Reviews) , , 183–194.Bull, A. D. (2011). Convergence rates of eﬃcient global optimization algorithms. Journal of Machine LearningResearch , October 2011 , 2879–2904.Byrd, R. H., Lu, P., Nocedal, J., & Zhu, C. (1995). A limited memory algorithm for bound constrainedoptimization.

SIAM Journal on Scientiﬁc Computing , , 1190–1208.Chen, Z., & Yu, L. (2017). A novel PSO-based algorithm for structural damage detection using Bayesianmulti-sample objective function. Structural Engineering and Mechanics , , 825–835.Emmerich, M. T., Giannakoglou, K. C., & Naujoks, B. (2006). Single- and multiobjective evolutionary opti-mization assisted by Gaussian random ﬁeld metamodels. IEEE Transactions on Evolutionary Computation , , 421–439.Etgar, R., Gelbard, R., & Cohen, Y. (2017). Optimizing version release dates of research and developmentlong-term processes. European Journal of Operational Research , , 642–653.Fan, S.-K. S., & Zahara, E. (2007). A hybrid simplex search and particle swarm optimization for unconstrainedoptimization. European Journal of Operational Research , , 527–548.Gao, H., Li, Y., Kabalyants, P., Xu, H., & Mart´ınez-B´ejar, R. (2020). A novel hybrid PSO-k-means clusteringalgorithm using Gaussian estimation of distribution method and l´evy ﬂight. IEEE Access , , 122848–122863.Gentle, J. E., H¨ardle, W. K., & Mori, Y. (Eds.) (2012). Handbook of Computational Statistics: Concepts andMethods . Springer Handbooks of Computational Statistics (2nd ed.). Berlin, Heidelberg: Springer.Goel, T., Haftka, R. T., Shyy, W., & Queipo, N. V. (2007). Ensemble of surrogates.

Structural and Multidis-ciplinary Optimization , , 199–216.Gonz´alez, J., Dai, Z., Hennig, P., & Lawrence, N. D. (2015). Batch Bayesian optimization via local penalization.Griewank, A. O. (1981). Generalized descent for global optimization. Journal of optimization theory andapplications , , 11–39.Hong, Z., Dai, W., Luh, H., & Yang, C. (2018). Optimal conﬁguration of a green product supply chain withguaranteed service time and emission constraints. European Journal of Operational Research , , 663–677.Hoos, H. H., & St¨utzle, T. (2005). Stochastic local search: Foundations and applications . San Francisco, CA:Morgan Kaufmann Publishers.Iqbal, M., & de Oca, M. A. M. (2006). An estimation of distribution particle swarm optimization algorithm.In

Ant Colony Optimization and Swarm Intelligence (pp. 72–83). Berlin, Heidelberg: Springer BerlinHeidelberg volume 4150 of

Lecture Notes in Computer Science .Jiang, M., Luo, Y. P., & Yang, S. Y. (2007). Stochastic convergence analysis and parameter selection of thestandard particle swarm optimization algorithm.

Information Processing Letters , , 8–16.Jones, D. R., Schonlau, M., & Welch, W. J. (1998). Eﬃcient global optimization of expensive black-boxfunctions. Journal of Global Optimization , , 455–492.Kang, J., Jin, B., Duan, X., SHANG, X., & LI, W. (2018). Optimal operation of microgrid based on Bayesian-PSO algorithm. Power System Protection and Control , , 33–41. ennedy, J., & Eberhart, R. (1995). Particle swarm optimization. In Proceedings of ICNN’95 - InternationalConference on Neural Networks (pp. 1942–1948). IEEE.Krohling, R. A. (2004). Gaussian swarm: a novel particle swarm optimization algorithm. In

IEEE Conferenceon Cybernetics and Intelligent Systems, 2004. (pp. 372–376). IEEE volume 1.Kushner, H. J. (1964). A new method of locating the maximum point of an arbitrary multipeak curve in thepresence of noise.

Journal of Basic Engineering , , 97.Laskari, E. C., Parsopoulos, K. E., & Vrahatis, M. N. (2002). Particle swarm optimization for integer program-ming. In Proceedings of the 2002 Congress on Evolutionary Computation. CEC’02 (Cat. No.02TH8600) (pp. 1582–1587). IEEE.Liang, J., Qu, B., Suganthan, P., & Hern´andez-D´ıaz, A. G. (2013). Problem deﬁnitions and evaluation criteriafor the CEC 2013 special session on real-parameter optimization.

Computational Intelligence Laboratory,Zhengzhou University, Zhengzhou, China and Nanyang Technological University, Singapore, Technical Re-port , , 281–295.Lim, D., Jin, Y., Ong, Y.-S., & Sendhoﬀ, B. (2009). Generalizing surrogate-assisted evolutionary computation. IEEE Transactions on Evolutionary Computation , , 329–355.Liu, B., Zhang, Q., & Gielen, G. G. (2013). A Gaussian process surrogate model assisted evolutionary algorithmfor medium scale expensive optimization problems. IEEE Transactions on Evolutionary Computation , ,180–192.Liu, R., Li, J., fan, J., Mu, C., & Jiao, L. (2017). A coevolutionary technique based on multi-swarm particleswarm optimization for dynamic multi-objective optimization. European Journal of Operational Research , , 1028–1051.Liu, Y., Mu, C., Kou, W., & Liu, J. (2015). Modiﬁed particle swarm optimization-based multilevel thresholdingfor image segmentation. Soft computing , , 1311–1327.Melo, H., & Watada, J. (2016). Gaussian-PSO with fuzzy reasoning based on structural learning for traininga neural network. Neurocomputing , , 405–412.Moˇckus, J. (1975). On bayesian methods for seeking the extremum. In G. Goos, J. Hartmanis, P. BrinchHansen, D. Gries, C. Moler, G. Seegm¨uller, N. Wirth, & G. I. Marchuk (Eds.), Optimization TechniquesIFIP Technical Conference Novosibirsk, July 1–7, 1974 (pp. 400–404). Berlin, Heidelberg: Springer BerlinHeidelberg volume 27 of

Lecture Notes in Computer Science .Morales-Enciso, S., & Branke, J. (2015). Tracking global optima in dynamic environments with eﬃcient globaloptimization.

European Journal of Operational Research , , 744–755.Parno, M. D., Hemker, T., & Fowler, K. R. (2012). Applicability of surrogates to improve eﬃciency of particleswarm optimization for simulation-based problems. Engineering Optimization , , 521–535.Pham, D. T., & Castellani, M. (2014). Benchmarking and comparison of nature-inspired population-basedcontinuous optimisation algorithms. Soft Computing , , 871–903.Poli, R., Kennedy, J., & Blackwell, T. (2007). Particle swarm optimization. Swarm Intelligence , , 33–57.Rasmussen, C. E., & Williams, C. K. I. (2008). Gaussian processes for machine learning . Adaptive computationand machine learning (3rd ed.). Cambridge, Mass.: MIT Press.Rastrigin, L. A. (1974). Systems of extremal control.

Nauka , . egis, R. G. (2014a). Evolutionary programming for high-dimensional constrained expensive black-box opti-mization using radial basis functions. IEEE Transactions on Evolutionary Computation , , 326–347.Regis, R. G. (2014b). Particle swarm with radial basis function surrogates for expensive black-box optimization. Journal of Computational Science , , 12–23.Rios, L. M., & Sahinidis, N. V. (2013). Derivative-free optimization: A review of algorithms and comparisonof software implementations. Journal of Global Optimization , , 1247–1293.Rosenbrock, H. (1960). An automatic method for ﬁnding the greatest or least value of a function. The ComputerJournal , , 175–184.Sethanan, K., & Neungmatcha, W. (2016). Multi-objective particle swarm optimization for mechanical har-vester route planning of sugarcane ﬁeld operations. European Journal of Operational Research , , 969–984.Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian optimization of machine learning algo-rithms.Srinivas, N., Krause, A., Kakade, S. M., & Seeger, M. W. (2012). Information-theoretic regret bounds forGaussian process optimization in the bandit setting. IEEE Transactions on Information Theory , , 3250–3265.Sun, C., Jin, Y., Cheng, R., Ding, J., & Zeng, J. (2017). Surrogate-assisted cooperative swarm optimizationof high-dimensional expensive problems. IEEE Transactions on Evolutionary Computation , , 644–660.Sun, C., Zeng, J., Pan, J., Xue, S., & Jin, Y. (2013). A new ﬁtness estimation strategy for particle swarmoptimization. Information Sciences , , 355–370.Swersky, K., Snoek, J., & Adams, R. P. (2013). Multi-task bayesian optimization.Tasgetiren, M. F., Liang, Y.-C., Sevkli, M., & Gencyilmaz, G. (2007). A particle swarm optimization algorithmfor makespan and total ﬂowtime minimization in the permutation ﬂowshop sequencing problem. EuropeanJournal of Operational Research , , 1930–1947.Tian, D., & Shi, Z. (2018). MPSO: Modiﬁed particle swarm optimization and its applications. Swarm andevolutionary computation , , 49–68.van den Bergh, F., & Engelbrecht, A. P. (2002). A new locally convergent particle swarm optimizer. In Proceedings of the IEEE conference on systems, man and cybernetics . volume 3.Varma, S. C., Murthy, K. L., & SriChandan, K. (2013). Gaussian particle swarm optimization for combinedeconomic emission dispatch. In (pp. 1336–1340). IEEE.Wang, H., Jin, Y., & Doherty, J. (2017). Committee-based active learning for surrogate-assisted particle swarmoptimization of expensive problems.

IEEE transactions on cybernetics , , 2664–2677.Yin, P.-Y., Glover, F., Laguna, M., & Zhu, J.-X. (2010). Cyber swarm algorithms – improving particle swarmoptimization using adaptive memory strategies. European Journal of Operational Research , , 377–389.Yu, H., Tan, Y., Zeng, J., Sun, C., & Jin, Y. (2018). Surrogate-assisted hierarchical particle swarm optimization. Information Sciences , , 59–72.Yu, S., Zheng, S., Gao, S., & Yang, J. (2017). A multi-objective decision model for investment in energysavings and emission reductions in coal mining. European Journal of Operational Research , , 335–347. ambrano-Bigiarini, M., Clerc, M., & Rojas, R. (2013). Standard particle swarm optimisation 2011 at CEC-2013: A baseline for future PSO improvements. In (pp.2337–2344). IEEE.Zhang, L., Tang, Y., Hua, C., & Guan, X. (2015). A new particle swarm optimization algorithm with adaptiveinertia weight based on Bayesian techniques. Applied Soft Computing , , 138–149.Zouache, D., Moussaoui, A., & Abdelaziz, F. B. (2018). A cooperative swarm intelligence algorithm for multi-objective discrete optimization with application to the knapsack problem. European Journal of OperationalResearch , , 74–88. nline Supplements Online Appendix A. CEC2013 benchmark functions

Function Optimum Domain

Unimodal functions f1 Sphere function − − , D f2 Rotated high conditioned elliptic function − − , D f3 Rotated bent cigar function − − , D f4 Rotated discus function − − , D f5 Diﬀerent powers function − − , D Multimodal functions f6 Rotated Rosenbrock’s function −

900 [ − , D f7 Rotated Schaﬀers F7 function −

800 [ − , D f8 Rotated Ackley’s function −

700 [ − , D f9 Rotated Weierstrass function −

600 [ − , D f10 Rotated Griewank’s function −

500 [ − , D f11 Rastrigin’s Function −

400 [ − , D f12 Rotated Rastrigin’s function −

300 [ − , D f13 Non-continuous rotated Rastrigin’s function −

200 [ − , D f14 Schwefel’s function −

100 [ − , D f15 Rotated Schwefel’s function 100 [ − , D f16 Rotated Katsuura function 200 [ − , D f17 Lunacek Bi-Rastrigin function 300 [ − , D f18 Rotated Lunacek Bi-Rastrigin function 400 [ − , D f19 Expanded Griewank’s plus Rosenbrock’s function 500 [ − , D f20 Expanded Schaﬀer’s F6 Function 600 [ − , D Composition functions f21 Composition function 1 ( n = 5, rotated) 700 [ − , D f22 Composition function 2 ( n = 3, unrotated) 800 [ − , D f23 Composition function 3 ( n = 3, rotated) 900 [ − , D f24 Composition function 4 ( n = 3, rotated) 1000 [ − , D f25 Composition function 5 ( n = 3, rotated) 1100 [ − , D f26 Composition function 6 ( n = 5, rotated) 1200 [ − , D f27 Composition function 7 ( n = 5, rotated) 1300 [ − , D f28 Composition function 8 ( n = 5, rotated) 1400 [ − , D Table A.1: Evaluation functions from CEC2013 benchmark suite (Liang et al., 2013). nline Appendix B. Statistical tests for performance comparisons

We report p -values for the signiﬁcance of the results presented in Tables 2 and 3. To thisend, we calculate p -values of a one-sided t -test on the means of favored variant (A3) versusthe remaining variants and the benchmark algorithms. Function (A3) (A1) (A2) (B) (C1) (C2) (SPSO2011) (BO)

Unimodal functions f1 1.0 0.0726 0.1423 0.0735 0.0867 f3 1.0 0.0552 f5 1.0

Multimodal functions f6 1.0 f7 1.0 f8 1.0 f9 1.0 f10 1.0 f11 1.0 f12 1.0 f13 1.0 f14 1.0 f15 1.0 f16 1.0 f17 1.0 f18 1.0 f19 1.0 f20 1.0

Composition functions f21 1.0 f22 1.0 f23 1.0 f24 1.0 f25 1.0 f26 1.0 0.0963 f27 1.0 f28 1.0

Table B.1: Computed p -values of one-sided t -tests checking if the means of variant (A3) are signiﬁcantly lowerthan the means of the remaining variants and benchmark algorithms for the CEC2013 functions. We highlight p -values below a signiﬁcance level of 5% in bold. nline Appendix C. Bayesian optimization Bayesian optimization (Moˇckus, 1975) aims at ﬁnding the global optimum of a (non-convex) function f : D → R . The underlying idea is a stochastic model of the black-box function f . For this reason, one sequentially chooses a candidate solution x n thatpromises the highest gain according to a so-called acquisition function R : D → R , i. e. x n = arg max x ∈D R ( x ) depending on the current stochastic model of the function. After eval-uating the function at this location, one updates the model according to Bayes’ rule. Thisprocess is repeated until convergence. This idea goes back at least to Kushner (1964).The above approach requires a stochastic model of the function. A common choice is thefamily of Gaussian processes, due to their computational tractability as seen in Section 3.2.This family has been used in several approaches to Bayesian optimization, such as Jones et al.(1998); Snoek et al. (2012) or Morales-Enciso & Branke (2015) with various choices of kernels.Mathematically speaking, at each time step n , one deﬁnes a probabilistic model f ∼ P ( n ) ,based on the past observations. In the case of Gaussian processes, this is simply the posteriorGaussian process conditional on the past observations. The acquisition function R is thendeﬁned based on this distribution. Examples are the probability of improvement (Kushner,1964), the expected improvement E P ( n ) f ( x ) [ x > yx > y