Bayesian Optimization for Policy Search in High-Dimensional Systems via Automatic Domain Selection
Lukas P. Fröhlich, Edgar D. Klenske, Christian G. Daniel, Melanie N. Zeilinger
BBayesian Optimization for Policy Searchin High-Dimensional Systems via Automatic Domain Selection
Lukas P. Fr¨ohlich , , Edgar D. Klenske , Christian G. Daniel and Melanie N. Zeilinger , ∗ Abstract — Bayesian Optimization (BO) is an effective methodfor optimizing expensive-to-evaluate black-box functions witha wide range of applications for example in robotics, systemdesign and parameter optimization. However, scaling BO toproblems with large input dimensions ( >
10) remains an openchallenge. In this paper, we propose to leverage results fromoptimal control to scale BO to higher dimensional control tasksand to reduce the need for manually selecting the optimizationdomain. The contributions of this paper are twofold: 1) Weshow how we can make use of a learned dynamics modelin combination with a model-based controller to simplify theBO problem by focusing onto the most relevant regions of theoptimization domain. 2) Based on (1) we present a method tofind an embedding in parameter space that reduces the effectivedimensionality of the optimization problem. To evaluate theeffectiveness of the proposed approach, we present an experi-mental evaluation on real hardware, as well as simulated tasksincluding a 48-dimensional policy for a quadcopter.
I. INTRODUCTIONBayesian optimization (BO) is a powerful method forthe optimization of black-box functions which are costlyto evaluate. One of the great advantages is its sampleefficiency, enabling a wide variety of applications, rangingfrom hyperparameter search for machine learning algorithms[1], medical applications [2], to robotics [3], [4]. Especiallyfor high-dimensional problems, however, the number offunction evaluations—experiments on the hardware in manycases—can still be prohibitive. In this paper, we considerBO in the context of direct policy search for systems withcontinuous state/action space. This offers the possibility ofexploiting knowledge about the problem, and rather thanconsidering the objective function as a black-box, we take agray-box approach.For many high-dimensional objective functions it is validto assume some underlying structure, alleviating the curse ofdimensionality. One common assumption is that the effectivedimensionality of the objective function is lower and lies ina linear subspace of the parameter domain [5], [6]. Anotherassumption is that the objective has an additive structure andmany parameters are uncorrelated [7]. However, finding anappropriate subspace is hard and the effective dimensionalityof the objective function is usually not known a-priori.In addition to the problem of finding a suitable embeddingfor dimensionality reduction, it is not clear how to set Bosch Center for Artificial Intelligence, Robert BoschGmbH, 71272 Renningen, Germany. Corresponding author: [email protected] Institute for Dynamic Systems and Control, ETH Zurich, 8092 Zurich,Switzerland ∗ The research of Melanie N. Zeilinger was supported by the SwissNational Science Foundation under grant no. PP00P2 157601/1. Indep. domain: 32 evaluations PCA domain: 24 evaluationsBayesian Optimization with DDA
Fig. 1: Illustrative 2D example depicting different optimiza-tion domains (see Sec. IV) and the domains’ growth due todynamic domain adaptation (DDA) (see Sec. IV-C), as wellas the evaluated points ( × ). When an estimated optimum ( )is on the domain’s boundary ( ), the domain grows in therespective direction. The global optimum is marked by .the domain boundaries in which the parameters are tobe optimized in a meaningful manner. Oftentimes, theseboundaries have to be found experimentally or tuned by adomain expert, introducing many additional parameters thatsignificantly influence the convergence of BO. We addressboth issues by using a learned dynamics model in combinationwith a model-based technique from optimal control.The idea of combining BO and methods from optimalcontrol has previously been explored in order to efficientlytune policies for dynamical systems [8], [9]. However, bothof these approaches are susceptible to model bias because therespective policy parameterizations depend on a dynamicsmodel. The proposed method in this paper uses a model-basedapproach only for selecting an appropriate parameter spaceand initial domain boundaries. The optimization of the policyitself is done in a model-free manner, allowing for higherflexibility and improved final performance.In particular, our contributions are as follows, see alsoFig. 1: 1) Using a learned dynamics model, we show howto automatically select the boundaries of the optimizationdomain such that no manual tuning or expert knowledgeis needed, making BO more widely applicable for the usein policy search. 2) We show how to determine a linearembedding that exploits the structure of the objective function,thus reducing the effective dimensionality of the optimizationproblem. 3) We propose a scheme to dynamically adapt thedomain boundaries during optimization based on the objectivefunction’s surrogate model if the initial domain was chosentoo small. The scheme to adapt the optimization domain is notlimited to the application in policy search, but can be used as a r X i v : . [ ee ss . S Y ] J a n general extension to BO. These contributions enable directpolicy search based on BO for systems with significantlyhigher dimensionality than reported in the literature.II. RELATED WORKIn recent years, BO has emerged as a promising policysearch method. In [9], a stochastic optimal control problemis considered and BO is applied to optimize the entries ofthe linear system matrices that describe the system dynamics,which are then used to construct an linear quadratic regulator(LQR) controller. Similarly, [8] also uses an LQR approach,but instead of tuning the model matrices, the weight matricesof the quadratic cost function are optimized. In contrastto the two aforementioned papers, [3] directly optimizesthe entries of a linear state feedback policy. The differentparameterizations of these methods are reviewed in Sec. III-C.Besides linear state feedback policies, other parameteriza-tions have been proposed in the context of BO for policysearch. Neural networks have been used in order to learna swing-up of a double cart-pole system in simulation [10].On average more than 200 iterations are required to learna good policy, although no noise is present in the system.In [11], a high-fidelity simulator of a bipedal robot is usedin combination with a neural network to learn a tailoredkernel for the Gaussian process (GP) surrogate model inBO. To increase the efficiency of BO, [12] proposes to usetrajectory data generated during experiments by using a so-called behavior-based kernel, which compares policies bythe similarity of their resulting trajectories on the system.However, this approach is limited to stochastic policies. Inall methods discussed above, the domain over which the pa-rameters are optimized is always chosen manually, indicatingexpert knowledge or experience from other experiments. Themethod proposed in this paper, in contrast, reduces the needfor prior information and manual tuning.One of the most prominent approaches for finding a lowerdimensional space in which to perform the optimization isproposed in [5]. Here, the authors assume that the effectivedimensionality of the objective function is small and arandomly sampled linear mapping is used to transform thehigh-dimensional input parameters to the lower dimensionalspace. However, the effective dimensionality of the subspaceis usually not known a-priori and the choice of an appropriateoptimization domain is still an issue. Instead of randomlysampling an embedding, it has been proposed to activelylearn an embedding [6]. However, the proposed method hasnot been used in the context of BO, but rather in the activelearning setting for GP regression. Thus, evaluation pointsare not selected according to optimizing an objective, insteadthey are selected according to their information gain w.r.t.the belief of the embedding itself.The remainder of this paper is structured as follows:In Sec. III we formally state the policy search problemand review BO. Furthermore, we explain the differentparameterizations for linear state feedback policies that havebeen proposed in the literature. In Sec. IV we describe the contributions of this paper. Results from the simulations andhardware experiments are then discussed in Sec. V.III. PRELIMINARIESIn this section we formally state the policy search problemand explain how BO can be employed to solve it. Furthermore,we review different parameterizations for linear feedbackpolicies. A. Policy Search
Consider the problem of finding a policy, π θ : R n x → R n u ,mapping the state x to an input u = π θ ( x ) , which isparameterized by θ ∈ Θ ⊂ R n θ , with the goal of minimizinga specified performance criterion, or cost function, J , over afixed time horizon. Formally, this is defined in the followingoptimization problem: min θ J ( θ ) = min θ T (cid:88) t =0 E [ c ( x t , π θ ( x t ))] , s.t. x t +1 = f ( x t , π θ ( x t )) + ν , (1)where c ( x t , u t ) is the cost of being in state x t and applyinginput u t , and f : R n x × R n u → R n x denotes the (generallyunknown) state transition model governing the dynamics ofthe system that are corrupted by white noise, ν ∼ N (0 , Σ ν ) .In this paper, we apply BO to find the parameters θ ∗ of acost-minimizing policy. B. Bayesian Optimization for Policy Search
In BO, GP regression (see, e.g., [13]) is used to modelthe performance of the system, J ( θ ) , as a function of thepolicy parameters θ , based on noisy observations ˆ J ( θ ) . Inthe considered setting, one function evaluation correspondsto a rollout of the current policy on the system, after whichthe new data is added: D n +1 = D n ∪ ( θ n +1 , ˆ J ( θ n +1 )) . Asthe objective is expensive to evaluate, the goal is to onlyuse few function evaluations to find the minimum of thecost. After each experiment, a new evaluation point in thedomain is selected by maximizing an acquisition function, α ( θ ; D n ) . Different acquisition functions have been proposedin the literature, such as probability of improvement (PI) [14],expected improvement (EI) [15], and upper confidence bound(UCB) [16], [17]. They all have in common that they tradeoff between exploration (i.e., favoring regions of the domainwhere the objective function has not been evaluated yet) andexploitation (i.e., proposing the estimated optimum of theobjective function). For a thorough introduction to BO, werefer the reader to [18]. C. Linear Policy Parameterization
In this paper, we consider linear state feedback policies ofthe form: π θ ( x ) = − K ( θ ) x . (2)The advantage of linear policies is their low dimensionalitycompared with other parameterizations, such as (deep) neuralnetworks that need large amounts of training data. As everyexperiment on real hardware costs considerable amount ofime (and potentially money), it is infeasible to performhundreds or even thousands of rollouts. Another key benefitof a linear policy is that we can leverage the relation tolinear optimal controllers to increase the efficiency of BO.Although linear policies are simple in their form, they haveshown impressive results, even on complex tasks, such aslocomotion [19].To improve the efficiency of BO, we make use of the LQR,a method commonly applied in optimal control. One wayof approximating the problem in Eq. (1) is to linearize thesystem dynamics, f ( x t , u t ) ≈ Ax t + Bu t , and quadratizethe stage cost, c ( x t , u t ) ≈ x Tt Qx t + u Tt Ru t . Using theseapproximations, one can construct the static LQR feedbackgain matrix efficiently [20, §6.1]), which we denote as K = dlqr ( A , B , Q , R ) . In fact, if the true system dynam-ics are linear and the stage cost is quadratic, the static LQRgain matrix is optimal for the case of an infinite time horizon,i.e., T → ∞ . However, it leads to suboptimal performancein the case of a nonlinear system as considered in this paper.In the following, we review three different parameterizationsfor linear policies in order to apply BO-based policy search,two of which make use of the LQR.
1) Optimizing the Gain Matrix:
In [3], BO is applied bydirectly optimizing the feedback gain matrix, i.e., each entryin K K ( θ ) corresponds to one optimization parameter. Thus,the number of parameters scales linearly in the number ofstates and inputs. This method is model-free, i.e., it does notmake use of the LQR, and thus no (linear) dynamics modelis required for this parameterization.
2) Optimizing System Matrices:
This particular parameter-ization was first used in [9]. The idea is to parameterize thesystem matrices A and B , where each entry of the matricescorresponds to one parameter, from which then the respectiveLQR can be calculated: K AB ( θ ) = dlqr ( A ( θ ) , B ( θ ) , Q , R ) . (3)With this parameterization, a task-specific model is learned,which can be better than, e.g., the true model linearized aroundan operating point. However, the number of parameters scalesquadratically with the number of states, thus making thisapproach infeasible for large state spaces.
3) Optimizing Weight Matrices:
Given a linear approxi-mation of the dynamics, this method tunes the LQR’s weightmatrices instead of the system matrices [8]: K QR ( θ ) = dlqr ( A , B , Q ( θ ) , R ( θ )) . (4)Commonly, only the diagonal entries of the weightmatrices are tuned, i.e., the matrices are of the fol-lowing form: Q ( θ ) = diag(10 θ , . . . , θ nx ) , and R ( θ ) = diag(10 θ nx +1 , . . . , θ nx + nu ) , such that the numberof optimization parameters is reduced to n x + n u .For the remainder of this paper, we refer to the aforemen-tioned methods as K -learning [3], AB -learning [9], and QR -learning [8], respectively. In this paper, we propose to combinethe ideas of K -learning and the LQR to allow for efficientoptimization of high-dimensional policies. This is achieved byselecting the initial optimization domain boundaries based on a probabilistic dynamics model and the LQR. Additionally, themodel-based approach allows us to find a linear embeddingto reduce the effective dimensionality of the optimizationproblem as will be described in the next section.IV. AUTOMATIC DOMAIN SELECTION &DIMENSIONALITY REDUCTIONIt is well-known that BO has to cover the parameter spacesufficiently well with respect to the lengthscale of the costfunction in order to find a good estimate of the true optimum.Without prior knowledge, however, it is difficult to decideon the range of the parameters for optimization, which iscrucial for obtaining good performance without spendingan excessive amount of function evaluations for exploration.Commonly, the issue of finding an appropriate domain is leftas tuning parameter and domains are chosen, e.g., by priorexperience or problem-specific expertise. Especially in highdimensions, manual tuning of the domain parameters is notfeasible.We address this issue and propose a technique for automaticdomain selection, which consists of the following steps: first,a probabilistic model for the system dynamics is learned and,second, we then employ model-based techniques from optimalcontrol to find a distribution over policies. Based on thisdistribution we can define an appropriate domain for tuningthe policy parameters using BO (Sec. IV-A). Furthermore, wecan use the distribution over policies to find an embeddingthat maps the policy parameters into a lower-dimensionalspace, thus further increasing the efficiency of the subsequentoptimization (Sec. IV-B).To obtain a probabilistic model of the system dynamics,we perform Bayesian linear regression (see, e.g., [21, §3.3])using recorded data of state/action trajectories to obtain anapproximate linear model of the system dynamics. This resultsin a Gaussian distribution over linear dynamics models: p ( vec ( A , B ) | Data ) = N ( vec ( A , B ) | µ AB , Σ AB ) , (5)where µ AB is the maximum posterior (MAP) estimate, Σ AB quantifies the distribution’s uncertainty, and the vec ( · , · ) notation denotes that the matrices A and B are reshaped andstacked to a vector. A. Independence Domain
Based on the probabilistic model of the system dynamics,we are now seeking to define an appropriate range for allpolicy parameters. From Eq. (5) we can sample n s pairs of ( A , B ) for which we can each calculate the respective LQRfeedback gain matrix, resulting in the sample distribution: p ( vec ( K ) | ( A , B ) n s ) . In general, this sample distributioncan be multi-modal due to nonlinearities from the Riccatiequation that is solved for the construction of the LQR [20].However, since we are only looking for a bounding boxbased on the samples, it is sufficient to use a unimodalapproximation. To this end, we use a product of independentnormal distributions, one for each dimension: p ( vec ( K ) | ( A , B ) n s ) ≈ n θ (cid:89) i =1 N ( vec ( K ) i | µ Ki , σ Ki ) , (6)here the individual parameters for means and variances arefound using moment matching. Given this approximation,we can construct a domain centered around the mean of thedistribution, where the width is governed by the distribution’sstandard deviation: Θ K indep = [ µ K − βσ K , µ K + βσ K ] × · · · × [ µ Kn θ − βσ Kn θ , µ Kn θ + βσ Kn θ ] , (7)where the parameter β determines the effective size of thedomain. Due to the assumption of independence betweenentries in K , we call this domain the independence domain . B. PCA Domain
In the previous section, we assumed independence betweenpolicy parameters of the sample distribution. However, ingeneral the entries of K are not independent. In order to takeadvantage of potential correlations between parameters, wecan approximate the sample distribution with a multivariateGaussian: p ( vec ( K ) | ( A , B ) n s ) ≈ N ( vec ( K ) | µ K , Σ K ) , (8)where the goal of the approximation is to model the overalllocation and spread of the samples and not to accuratelymodel the (potentially) multiple modes.Now, based on the multivariate distribution, we proposeto transform the optimization parameters into the eigenspaceof the covariance matrix, Σ K . The transformation of theparameters is then described by ˜ θ = T ( θ − µ K ) , where thetransformation matrix T , consists of the eigenvectors of Σ K .In the eigenspace, the optimization domain is given by: ˜ Θ K PCA = [ − β ˜ σ K , β ˜ σ K ] × · · · × [ − β ˜ σ Kn θ , β ˜ σ Kn θ ] , (9)where ˜ σ Ki denotes the i -th eigenvalue of Σ K . In essence, weare performing a principal component analysis (PCA) [21,§12.1] and hence we call this domain the PCA domain . Thebenefit of using this kind of transformation is that we 1) areable to identify the most relevant directions of the parameterspace and 2) still retain the uncertainty information in eachdirection and thus are able to create meaningful parameterranges in the transformed space.For high-dimensional problems, some of the eigenvaluesare often close to zero and the domain size in the respec-tive dimension becomes negligible, effectively reducing thedimensionality of the optimization problem. Note that thevolume of the domain in eigenspace is always smaller orequal to the independence domain, thus BO in the eigenspaceleads to faster convergence if there are correlations betweenparameters. A visualization of the different domains is shownin Fig. 1.
C. Dynamic Domain Adaptation
For both the independence domain and PCA domainpresented in the previous sections, the tuning parameter β needs to be chosen carefully. With increasing β , BO hasto cover a larger space during the subsequent optimization,which means that more evaluations of the cost functionare needed for sufficient exploration. At the same time, it ∇ θ µ GP Θ ∆ Θ λ θ J ( θ ) Fig. 2: Sketch depicting the relevant parameters used forDDA. The true objective ( ) is approximated by a GP ( )and the estimated optimum ( ) is on the domain’s boundary.Consequently, the domain grows in the direction of the globaloptimum ( ) with stepsize ∆ Θ which is proportional to theGP’s lengthscale λ , the GP’s gradient ∇ θ µ GP at the estimatedoptimum and the size of the current domain Θ .might also be possible to find better policy parameters on alarger domain. Thus, the goal is to find a trade-off betweenrestricting the domain to a reasonable size and choosing alarge enough domain such that we do not suffer from modelbias. We argue that it is more efficient to start with a smalldomain, e.g., choosing β = 0 . , and then adapting the domainboundaries during optimization to account for the fact thatthe global optimum might not lie within the initial domain.In this section, we explain how to exploit the surrogate modelthat approximates the objective function in order to adapt thedomain boundaries in a goal-oriented manner.While running BO, we have an estimate of the optimum, θ ∗ , i.e., the minimum of the approximate cost function onthe current domain. If BO finds the location of the estimatedoptimum to lie on the domain’s boundary, ∂ Θ , it is likelythat there are better parameters outside the current domain.Thus, we propose to grow the domain in the dimensions forwhich the estimated optimum is at the boundary.This dynamic domain adaptation (DDA) is guided by theGP that models the objective function. The stepsize, ∆ Θ i ,by which the boundaries are increased is chosen heuristicallyproportional to three factors:1) The gradient of the GP posterior mean at the currentestimate of the optimum, ∇ θ i µ GP ( θ ∗ ) . If the gradientat the boundary is steep, we expect a potentially betterpoint to be further away from the boundary than if thegradient is small.2) The lengthscale, λ i , of the GP that approximates thecost function. For large lengthscales, the model assumesthe cost function to vary slowly and thus the stepsizeshould be increased accordingly.3) The domain’s extent in dimension i . Dimensions inwhich the domain is large should also be increased bya greater stepsize.he stepsize is then given by ∆ Θ i = γ · ∇ θ i µ GP ( θ ∗ ) · λ i · (cid:107) Θ i (cid:107) , (10)where γ is a tuning parameter that governs the effective stepsize. A one-dimensional visualization of DDA and all itsrelevant parameters can be seen in Fig. 2 and a summary ofthe proposed algorithm is described in Alg. 1. Note that DDAis not limited to policy search but can be used for any BOprocedure irrespective of the application and the meaning ofthe parameters.A similar heuristic to grow the domain during optimiza-tion has been proposed in [22], where the volume of thedomain is increased isotropically by a constant factor everyfew evaluations of the objective function. In the providedexperiments, the constant factor is chosen to be 2, and thusthis approach is called volume doubling (VD). Our approachdiffers in two aspects: 1) the growth of the domain is notbased on a fixed schedule, i.e., increasing the domain everyfew iterations and 2) we increase the domain anisotropicallybased on the surrogate model. We argue that in the case ofan initial domain that is already close the objective function’sglobal optimum, DDA only increases the domain towardsthe global optimum and thus leads to faster convergencecompared to VD. Besides the VD heuristic, another approachhas been proposed in [22] that regularizes the acquisitionfunction with a quadratic prior mean function. In this way,no explicit domain boundaries need to be specified, howeverthe shape of the quadratic regularizer gives rise to implicitbounds. All of the aforementioned methods are evaluated ona synthetic 2D function in Sec. V-A. Algorithm 1
BO with DDA and domain selection p ( vec ( A , B ) | Data ) ← perform system identification toobtain probabilistic dynamics model in Eq. (5) Θ ← select initial optimization domain, e.g., based ondynamics model (see Sec. IV) repeat θ ∗ ← run BO on current domain Θ if θ ∗ ∈ ∂ Θ then i ← dimension at which θ ∗ is at ∂ Θ Θ i ← increase domain by stepsize ∆ Θ i with ∆ Θ i ∝ ∇ θ i µ GP ( θ ∗ ) , λ i , (cid:107) Θ i (cid:107) end if until convergence or sufficient performance is achievedV. SIMULATIONS AND EXPERIMENTSWith the simulations and hardware experiments presentedin this section, we aim at supporting the following results: • The methods proposed in Sec. IV are able to selectmeaningful domain boundaries for BO and are applicablefor a variety of control tasks with less manual parametertuning. • The proposed DDA scheme helps to adjust the domainboundaries in a goal-oriented manner and is more
DDA (proposed) VD [22] (a) Parameter space view showing the growing domain bound-aries ( ) and evaluated points ( × ) during optimization. The globaloptimum is marked by . − − R e g r e t r t Vanilla BOVD [22]UBQ [22]DDA (proposed) (b) Regret of the best function value seen so far. With DDA theregret converges faster as the domain growth does not follow a fixedschedule (VD) and has more flexibility compared to UBQ.
Fig. 3: Comparison of dynamic domain adaptation (DDA),volume doubling (VD), unbounded quadratic (UBQ) andvanilla BO on the three-hump camel function.efficient than the VD scheme if the initial domain isalready close to the objective function’s global optimum. • Choosing an informed parameter transformation reducesthe number of required experiments needed for goodperformance significantly and enables BO based policysearch for higher-dimensional systems.
A. Synthetic Function
We start by comparing the DDA scheme presented inSec. IV-C with the approaches proposed in [22]. As objectiveto be minimized, we choose the three-hump camel function,a 2-dimensional function with three local and one globaloptimum at θ ∗ = (0 , . We sample N = 50 random initialdomains of size . in each dimension, where it was ensuredthat the global optimum was not contained in the initialdomains. We report the regret r t = | f ( θ ∗ ) − min t f ( θ t ) | forall N initial domains and illustrate the domain adaptationfor one of the sampled initial domains in Fig. 3. The solidline represents the median over all runs and the shaded areasindicate the 25 th and 75 th percentile, respectively. The examplehighlights that DDA leads to improved convergence and moregoal-oriented sampling of the parameter space. B. Policy Search
In this section we compare the different policy parameteri-zations presented in Sec. III and evaluate the influence of theproposed methods in Sec. IV and Sec. IV-C on K -learning.Additionally, we benchmark our method to REMBO [5],a well-known method to reduce the dimensionality of the
10 20 30 − − − − Parameter spaceDouble Integrator (Hardware), K-Learning
Fig. 5: Left: Comparison of performance when policy isoptimized on the large domain ( ) and the independencedomain ( ). 10 independent runs were performed. Right:Parameter space showing large domain ( ), independencedomain ( ), evaluated policies ( / ) and nominal LQR ( × ).objective function using random embeddings. To evaluateperformance of the policies, we compare the cost of oneepisode resulting from a learned policy to the cost resultingfrom the nominal LQR policy. In simulation, the nominalLQR policy is based on the true dynamics model linearizedat the target position and for the hardware experiments, thenominal model provided by the manufacturer is used as truemodel. This results in the normalized performance measure η = ( J − J LQR ) /J LQR . In the convergence plots shown in thefollowing, the solid lines represent the median performanceand the shaded areas indicate the 25 th and 75 th percentile,respectively. For all experiments we use the GPyOpt -Toolbox[23] and the UCB acquisition function [17], which wasshown to outperform other acquisition functions in roboticsapplications [3]. In addition, we use the logarithm of thecost function (1), as this has been shown to lead to fasterconvergence [9]. Fig. 4: Furuta pen-dulum used for hard-ware experiments.For the hardware experiments weuse the Quanser Qube Servo 2 which has two setups: one being adisk, which is a double integratorsystem, the second being a Furutapendulum [24], see also Fig. 4. a) Double Integrator (Hard-ware): The state of the double in-tegrator is 2-dimensional and con-sists of the disk’s angle, φ , andits velocity, ˙ φ . For system identifi-cation, we applied random inputsand recorded 5 seconds of data.Initially, the disk was placed at x = [ φ , ˙ φ ] (cid:124) = [90 ◦ , (cid:124) with the goal of regulating it tothe zero state.In this experiment, we only apply K -learning, because thistwo-dimensional problem already shows the importance ofdomain selection and the domains can be easily visualized.For a comparison between AB -, QR - and K -learning, we − N o r m a li ze dp e rf o r m a n ce [ % ] Cart-Pole (Simulation)QR-learning [8] AB-learning [9]ind. domain(proposed) PCA domain(proposed)DDA + PCA domain(proposed) nominal LQR
Fig. 6: Comparison of our proposed extensions to K -learningwith QR - and AB -learning on the cart-pole task. The dashedline corresponds to the nominal LQR using the true (linearizedaround the upright position) dynamics. 30 independent runswere performed.consider the dimensionality of this problem too low and thusshow it on more complex systems in the following sections.Although the problem appears simple, vanilla BO on alarge domain often fails to find a sensible policy (see Fig. 5).Optimizing on the large domain shows slow convergenceas many evaluations are non-informative and far away fromthe optimum. When using the proposed method in Sec. IVto identify a relevant domain (independence domain in thiscase), BO consistently finds good policies in few iterations.The nominal policy performs sub-optimally due to effectsfrom friction and stiction, which are not modeled. b) Cart-Pole (Simulation): The cart-pole system consistsof a cart that can be moved horizontally with a freely rotatingpendulum. The input is a horizontal force acting on the cart.The state of the system is given by the cart’s position, z , thependulum’s angle, φ , and their respective time derivatives: x = [ z, φ, ˙ z, ˙ φ ] (cid:124) . For system identification, we started withthe pendulum in the upright position ( φ = 0 ◦ ), appliedrandom inputs and recorded the resulting state trajectoryuntil the absolute value of the angle was larger than 30°.This process was repeated five times. The task was tostabilize the pendulum at the upright position over a timehorizon of five seconds, where the initial condition was setto x = [0 , ◦ , ms , (cid:124) . Note that with this initial condition,the system dynamics are strongly nonlinear.In Fig. 6 we compare K -learning on the independencedomain with AB -, QR -learning. Additionally, we show theimproved performance of K -learning when we optimize onthe PCA domain, with and without using DDA. QR -learningand K -learning on the independence domain show slowconvergence and high variance in the resulting performance. AB -learning fails to improve within 30 iterations, which isin accordance to the results presented in [9], where several − N o r m a li ze dp e rf o r m a n ce [ % ] Furuta Pendulum (Hardware)QR-learning [8] AB-learning [9]DDA + PCA domain(proposed) nominal LQR
Fig. 7: Comparison of our proposed extensions to K -learning with QR - and AB -learning on the Furuta pendulum.5 independent runs were performed.hundred iterations are needed to achieve good performancefor the same system and a task of similar complexity. Onlywhen optimizing on the PCA domain using K -learning, weconsistently outperform the LQR within only five iterations.However, as discussed in Sec. IV-C, it is possible to convergeto a sub-optimal solution when the optimization domain ischosen too conservatively, as can be seen by the red curvethat converges quickly to its final performance. When weadditionally use DDA, the performance is further increasedand all other methods are consistently outperformed. c) Furuta Pendulum (Hardware): An inverted pendulumis attached to the end of a rotary arm that is actuated via atorque; the underlying system dynamics are similar to thatof a cart-pole system. The state is given by x = [ η, φ, ˙ η, ˙ φ ] (cid:124) ,with η being the angle of the rotary arm and φ the angle of thependulum, respectively. The system identification process wasthe same as for the simulated cart-pole system and . secondsof trajectory data were used. The task to be optimized wasregulating the system from x = [45 ◦ , , , (cid:124) to the zerostate, accumulating a cost over five seconds.Qualitatively, the results on the hardware are similar to thesimulation results on the cart-pole system (see Fig. 7) . Onthe hardware, only K -learning with the proposed extensionsconsistently outperforms the nominal policy within ∼ QR -learning stagnates quicklyat the nominal controller’s level, which might be due to themodel-based policy parameterization. AB -learning is alsounable to improve within the given number of iterations dueto the high dimensionality of the optimization problem. Theresulting policies during K - and QR -learning result in stablebehavior of the pendulum in most cases. However, when All experiments were initialized with the same policy, i.e., the LQRusing the MAP estimate. The spread of the performance can be explainedby the inherent stochasticity of the problem on the one hand and smalldifferences in the initial conditions that are inevitable for mechanical setupson the other hand. Also the nominal controller shows some variance in theresulting performance, where we use the median of 20 independent runs forcost normalization. − N o r m a li ze dp e rf o r m a n ce [ % ] Quadcopter (Simulation), K-Learning: 48 dimREMBO ( d e = 10 ) [5] REMBO ( d e = 20 ) [5]ind. domain(proposed) PCA domain(proposed)DDA + ind. domain(proposed) DDA + PCA domain(proposed)nominal LQR Fig. 8: Performance comparison obtained from directlyoptimizing the full, 48-dimensional feedback gain matrixof a quadcopter. Our proposed extensions are comparedto REMBO with two different effective dimensionalities.30 independent runs were performed.using AB -learning on the hardware, we reject policies thatare not in a pre-defined safe set, in order not to damage theexperimental setup. For rejected policies, we assign the samecost as obtained from the initial policy. d) Quadcopter (Simulation): To show the applicabilityto high-dimensional problems, we demonstrate the approachfor a simulated quadcopter. This system has twelve states(position, linear velocities, orientation and angular velocities)and four inputs (rotational velocity of the rotors). For modellearning, we recorded seconds of flying through tendifferent waypoints which were uniformly sampled from [ x, y, z, ψ ] ∼ [ − m, m ] × [ − ◦ , ◦ ] , with ψ being theyaw angle and the other states of the waypoints were set tozero. In hardware experiments, this procedure can easily beadapted by using a remote controller to generate the data.The control task was to stabilize the quadcopter at the hoverposition where the initial position was shifted m in the x -direction, and the roll and pitch angle were set to ◦ .Due to the superior performance of K -learning on pre-vious experiments, we focus on this approach for the high-dimensional quadcopter and compare the influence of DDAand the different domains as introduced in Sec. IV. Forthe comparison with REMBO [5], we choose two differenteffective dimensionalities, d e = 10 and d e = 20 , and optimizeon the independence domain. For all methods, we initializeBO with the LQR policy obtained from the MAP estimateof the system identification step, which already was able tostabilize the quadcopter.The results are shown in Fig. 8. With and without usingDDA, optimizing on the PCA domain shows faster conver-gence in comparison with the independence domain due to . . . − p it c h a ng l e [ ◦ ] Quadcopter trajectories for different policies . . . Time [s] x - po s iti on [ m ] target statetrue model LQRlearned model LQRoptimized policy Fig. 9: Exemplary trajectories of the x -position and pitchangle before and after optimization. Note that the optimizedpolicy leads to faster convergence to the desired states andless overshoot of the pitch angle.the smaller volume of the domain. Furthermore, DDA helpsto 1) further speed up the convergence and 2) allows forpolicies that consistently outperform the LQR within only30 iterations. Using a random embedding for dimensionalityreduction as done by REMBO shows significantly lowerperformance and none of the optimized policies is able tooutperform the nominal LQR.For a more intuitive understanding of the actual perfor-mance increase that is obtained with BO, we show trajectoriesof the quadcopter before (gray dotted) and after (green)optimization in Fig. 9. Especially the x -position of thequadcopter does converge significantly faster to the desiredvalue after optimization of the policy.VI. CONCLUSIONIn this paper, we have shown that the choice of theoptimization domain is critical for the convergence andfinal performance, as well as scalability of BO in policysearch methods. By using a model-based technique fromoptimal control, we proposed an automatic domain selectionmethod for optimizing a linear feedback policy in a model-free manner which exploits the objective functions structureand improves the sample efficiency of BO. Additionally, weintroduce a dynamic domain adaptation mechanism in order tomitigate a potential model bias due to the choice of the initialdomain. Simulations and experiments have shown that thesecontributions enable BO-based policy search techniques tofind a policy that outperforms other control techniques that usethe true dynamics model with only few system interactions.A key benefit of the reduced search domain provided by theproposed technique is the improved scalability of BO. Wehave demonstrated our approach to learn a 48-dimensionalpolicy for a quadcopter—a size that renders standard BOtechniques infeasible. R EFERENCES[1] J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesianoptimization of machine learning algorithms,” in
Advances in NeuralInformation Processing Systems , 2012, pp. 2951–2959.[2] J. Gonzalez, J. Longworth, D. James, and N. Lawrence, “Bayesianoptimization for synthetic gene design,”
NIPS Workshop on BayesianOptimization in Academia and Industry , 2014.[3] R. Calandra, A. Seyfarth, J. Peters, and M. P. Deisenroth, “Bayesian op-timization for learning gaits under uncertainty,”
Annals of Mathematicsand Artificial Intelligence , vol. 76, no. 1-2, pp. 5–23, 2016.[4] R. Martinez-Cantin, N. de Freitas, A. Doucet, and J. A. Castellanos,“Active policy learning for robot planning and exploration underuncertainty,” in
Proceedings of the Robotics: Science and SystemsConference (RSS) , 2007.[5] Z. Wang, F. Hutter, M. Zoghi, D. Matheson, and N. de Feitas, “Bayesianoptimization in a billion dimensions via random embeddings,”
Journalof Artificial Intelligence Research , vol. 55, pp. 361–387, 2016.[6] R. Garnett, M. A. Osborne, and P. Hennig, “Active learning of linearembeddings for Gaussian processes,” in
Proceedings of the ThirtiethConference on Uncertainty in Artificial Intelligence . AUAI Press,2014, pp. 230–239.[7] K. Kandasamy, J. Schneider, and B. P´oczos, “High dimensionalBayesian optimisation and bandits via additive models,” in
Proceedingsof the International Conference on Machine Learning (ICML) , 2015,pp. 295–304.[8] A. Marco, P. Hennig, J. Bohg, S. Schaal, and S. Trimpe, “AutomaticLQR tuning based on Gaussian process global optimization,” in
Proceedings of the IEEE International Conference on Robotics andAutomation (ICRA) , 2016.[9] S. Bansal, R. Calandra, T. Xiao, S. Levine, and C. J. Tomlin, “Goal-driven dynamics learning via Bayesian optimization,” arXiv preprintarXiv:1703.09260 , 2017.[10] M. Frean and P. Boyle, “Using Gaussian processes to optimize expen-sive functions,” in
Proceedings of the Australasian Joint Conferenceon Artificial Intelligence , 2008, pp. 258–267.[11] R. Antonova, A. Rai, and C. G. Atkeson, “Deep kernels for optimizinglocomotion controllers,” arXiv preprint arXiv:1707.09062 , 2017.[12] A. Wilson, A. Fern, and P. Tadepalli, “Using trajectory data to improveBayesian optimization for reinforcement learning,”
Journal of MachineLearning Research , vol. 15, no. 1, pp. 253–282, 2014.[13] C. E. Rasmussen and C. K. I. Williams,
Gaussian Processes forMachine Learning . MIT Press, 2006.[14] H. J. Kushner, “A new method of locating the maximum point of anarbitrary multipeak curve in the presence of noise,”
Journal of BasicEngineering , vol. 86, no. 1, pp. 97–106, 1964.[15] J. Moˇckus, “On Bayesian methods for seeking the extremum,” in
Optimization Techniques IFIP Technical Conference , 1975, pp. 400–404.[16] D. D. Cox and S. John, “A statistical method for global optimization,”in
IEEE Transactions on Systems, Man, and Cybernetics , 1992, pp.1242–1246.[17] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger, “Gaussianprocess optimization in the bandit setting: No regret and experimentaldesign,” in
Proceedings of the International Conference on MachineLearning (ICML) , 2010.[18] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas,“Taking the human out of the loop: A review of Bayesian optimization,”
Proceedings of the IEEE , vol. 104, no. 1, pp. 148–175, 2016.[19] H. Mania, A. Guy, and B. Recht, “Simple random search providesa competitive approach to reinforcement learning,” arXiv preprintarXiv:1803.07055 , 2018.[20] R. F. Stengel,
Optimal Control and Estimation . Courier Corporation,1986.[21] C. M. Bishop,
Pattern Recognition and Machine Learning . Springer,2006.[22] B. Shahriari, A. Bouchard-Cˆot´e, and N. Freitas, “Unbounded Bayesianoptimization via regularization,” in
Artificial Intelligence and Statistics ,2016, pp. 1168–1176.[23] The GPyOpt authors, “GPyOpt: A Bayesian optimization frameworkin python,” http://github.com/SheffieldML/GPyOpt, 2016.[24] K. Furuta, M. Yamakita, and S. Kobayashi, “Swing-up control ofinverted pendulum using pseudo-state feedback,”