[PDF] Robust Model-free Reinforcement Learning with Multi-objective Bayesian Optimization

Abstract

In reinforcement learning (RL), an autonomous agent learns to perform complex tasks by maximizing an exogenous reward signal while interacting with its environment. In real-world applications, test conditions may differ substantially from the training scenario and, therefore, focusing on pure reward maximization during training may lead to poor results at test time. In these cases, it is important to trade-off between performance and robustness while learning a policy. While several results exist for robust, model-based RL, the model-free case has not been widely investigated. In this paper, we cast the robust, model-free RL problem as a multi-objective optimization problem. To quantify the robustness of a policy, we use delay margin and gain margin, two robustness indicators that are common in control theory. We show how these metrics can be estimated from data in the model-free setting. We use multi-objective Bayesian optimization (MOBO) to solve efficiently this expensive-to-evaluate, multi-objective optimization problem. We show the benefits of our robust formulation both in sim-to-real and pure hardware experiments to balance a Furuta pendulum.

Full PDF

RRobust Model-free Reinforcement Learning with Multi-objectiveBayesian Optimization

Matteo Turchetta Andreas Krause Sebastian Trimpe Abstract — In reinforcement learning (RL), an autonomousagent learns to perform complex tasks by maximizing anexogenous reward signal while interacting with its environ-ment. In real world applications, test conditions may differsubstantially from the training scenario and, therefore, focusingon pure reward maximization during training may lead topoor results at test time. In these cases, it is important totrade-off between performance and robustness while learning apolicy. While several results exist for robust, model-based RL,the model-free case has not been widely investigated. In thispaper, we cast the robust, model-free RL problem as a multi-objective optimization problem. To quantify the robustness ofa policy, we use delay margin and gain margin, two robustnessindicators that are common in control theory. We show howthese metrics can be estimated from data in the model-freesetting. We use multi-objective Bayesian optimization (MOBO)to solve efﬁciently this expensive-to-evaluate, multi-objectiveoptimization problem. We show the beneﬁts of our robustformulation both in sim-to-real and pure hardware experimentsto balance a Furuta pendulum.

I. I

NTRODUCTION

In reinforcement learning (RL) [1], the goal is to learn acontroller to perform a desired task from the data producedby the interaction between the learning agent and its envi-ronment. In this framework, autonomous agents are trainedto maximize their return. It is common to assume that suchagents will be deployed in conditions that are similar, if notequal, to those they were trained in. In this case, a return-maximizing agent performs well at test time. However, inreal world applications, this assumption may be violated.For example, in robotics, we can use RL to learn to ﬂya drone indoor. However, later on we may use the samedrone to carry a payload in a windy environment. The newenvironmental conditions and the possible deterioration ofthe drone components due to their usage may result in a poor,if not catastrophic, performance of the learned controller.Another scenario where training and testing conditions differsubstantially is the sim-to-real setting, i.e., when we deploya controller trained in simulation on a real-world agent.Considering robustness alongside performance whenlearning a controller can limit performance degradation dueto different training and testing environments. In specialcases, these goals may be aligned, and a high-performingcontroller can also be robust. This is the case for the LinearQuadratic Regulator (LQR), a linear state-feedback controller Learning and Adaptive Systems Group, ETH Zurich, Z¨urich, Switzer-land [email protected] [email protected] Intelligent Control Systems Group, Max Planck Institute for IntelligentSystems, Stuttgart, Germany [email protected]

This research was supported in part by the Max Planck ETH Center forLearning Systems, the Max Planck Society and the Cyber Valley Initiative. that is optimal for the case of linear dynamics, quadraticcost, and perfect state measurements. It is well-known thatthe LQR exhibits strong robustness indicators, such as gainand phase margins [2]. While performance and robustnessgo hand in hand for the LQR, they are often conﬂicting inother cases. For example, a celebrated result in control theoryshows that the Linear Quadratic Gaussian (LQG) regulator -the noisy counterpart of the LQR - can be arbitrarily closeto instability, despite being optimal [3]. Thus, in general, weneed to trade-off between performance and robustness [4].

Contributions.

While many works investigating the per-formance/robustness trade-off exist in both the RL andcontrol theory literature for the model-based setting, fewresults are known for the model-free scenario. However,there are several real-world scenarios where models are notavailable, inaccurate, or too expensive to use, but robustnessis fundamental. Thus, in this paper, we introduce the ﬁrstdata-efﬁcient, robust, model-free RL method based on pol-icy optimization with multi-objective Bayesian optimization(MOBO). In particular, these are our contributions: • We formulate the robust, model-free RL as a multi-objective optimization problem. • We propose a model-free, data-driven evaluation ofdelay and gain margins, two common robustness in-dicators from the model-based setting (where they arecomputed analytically). • We solve this problem efﬁciently with expected hyper-volume improvement (EHI). • We introduce the ﬁrst method that can learn robustcontrollers directly on hardware in a model-free fashion. • We show how our approach outperforms non-robustpolicy optimization in evaluations on a Furuta pendulumfor both a sim-to-real and a pure hardware setting.

Related work.

Robustness has been widely investigatedin control theory [5], and standard robust control techniquesfor linear systems include loop transfer recovery [6], H ∞ control, and µ synthesis [5], [7]. However, these methodstypically assume the availability of a model, and none ofthese includes a learning component. Recently, robustnesshas drawn attention in data-driven settings, giving rise to theﬁeld of robust, model-based RL. Robust Markov decisionprocesses study the RL problem when the transition model issubject to known and bounded uncertainties. For example, [8]studies the dynamic programming recursion in this setting.Other methods that consider parametric uncertainties include[9], [10]. All the previous methods are model-based.Robustness and performance are typical objectives in a r X i v : . [ c s . R O ] O c t erformanceRobustness Multi-Objective Optimization ✓ k +1 AAAB83icbVBNS8NAEN3Ur1q/qh69LBZBEEpSBT0WvXisYD+gCWWznbZLN5uwOxFK6N/w4kERr/4Zb/4bt20O2vpg4PHeDDPzwkQKg6777RTW1jc2t4rbpZ3dvf2D8uFRy8Sp5tDksYx1J2QGpFDQRIESOokGFoUS2uH4bua3n0AbEatHnCQQRGyoxEBwhlbyfRwBsl42vvCmvXLFrbpz0FXi5aRCcjR65S+/H/M0AoVcMmO6nptgkDGNgkuYlvzUQML4mA2ha6liEZggm988pWdW6dNBrG0ppHP190TGImMmUWg7I4Yjs+zNxP+8boqDmyATKkkRFF8sGqSSYkxnAdC+0MBRTixhXAt7K+UjphlHG1PJhuAtv7xKWrWqd1mtPVxV6rd5HEVyQk7JOfHINamTe9IgTcJJQp7JK3lzUufFeXc+Fq0FJ585Jn/gfP4AzOaRhg== u t = ⇡ ( y t | ✓ k ) AAAB/nicbVBNS8NAEN34WetXVDx5CRahXkpSBb0IRS8eK9gPaELYbDft0s0m7E6EEAv+FS8eFPHq7/Dmv3Hb5qCtDwYe780wMy9IOFNg29/G0vLK6tp6aaO8ubW9s2vu7bdVnEpCWyTmsewGWFHOBG0BA067iaQ4CjjtBKObid95oFKxWNxDllAvwgPBQkYwaMk3D1MfrtyEVTMfHl0YUsD+6NQ3K3bNnsJaJE5BKqhA0ze/3H5M0ogKIBwr1XPsBLwcS2CE03HZTRVNMBnhAe1pKnBElZdPzx9bJ1rpW2EsdQmwpurviRxHSmVRoDsjDEM1703E/7xeCuGllzORpEAFmS0KU25BbE2ysPpMUgI80wQTyfStFhliiQnoxMo6BGf+5UXSrtecs1r97rzSuC7iKKEjdIyqyEEXqIFuURO1EEE5ekav6M14Ml6Md+Nj1rpkFDMH6A+Mzx8WwJWQ Fig. 1. The multi-objective optimization model for the robust, model-free RL problem. We choose a controller corresponding to parameters θ k .Subsequently, we evaluate its performance and robustness by deploying iton the system. Finally, we use these observations to choose a new controller. control design, which often conﬂict each other, thus requiringdesign trade-offs [4], [11]. In the model free literature, thistrade-off is often ﬁxed a priori and the resulting problemis solved with standard optimization methods. In [12] aweighted cost that balances performance and robustness isoptimized. In [13] robust controllers are learned via gradientascent with random multiplicative noise on the control action.In [14], [15] external, adversarial disturbances are usedinstead. In these works, the upper bound on the magnitudeof the disturbance implicitly balances robustness and perfor-mance. However, setting this trade-off is often not intuitiveand, in case the requirements are misspeciﬁed or updated,a new controller must be learned. Alternatively, robust con-trol design methods based on multi-objective optimizationexplore the spectrum of such trade-offs. The work in [16]gives a review of such methods, with a focus on geneticalgorithms, which, due to their low data efﬁciency requirethe model to compute the robustness indices.Model-free RL algorithms are typically validated in sim-ulations due to their high sample complexity. However, inrobotics, it is crucial to test these methods on hardware.Bayesian optimization (BO) [17], [18] has been successfullyapplied to learn low-dimensional controllers for hardwaresystems. For example, [19] learns to control the x -coordinatefor a quadrotor hovering task with a linear controller, [20]learns a linear state feedback controller for a cart-pole systemin a sim-to-real setting and [21], [22] tune the parameters ofad-hoc controllers for locomotion tasks. However, none ofthese methods considers robustness, making ours the ﬁrst oneto learn robust controllers from data directly on hardware.MOBO is the branch of BO that solves multi-objectiveproblems. MOBO algorithms include EHI [23], (cid:15) -PAL [24],and PESMO [25]. They have been applied to several tasks in-cluding trading off prediction speed and accuracy in machinelearning models. However, they have rarely been appliedto RL. To the best of our knowledge, this has been doneonly in [26], [27], where a trade-off between frontal cameramovement and forward speed is found for a snake-like robot,for homoschedastic and heteroschedastic noise respectively.Robustness is not explicitly treated in these works.II. P ROBLEM S TATEMENT

In this section, we introduce our formulation of robustmodel-free RL as a multi-objective optimization problem. For ease of exposition, we limit ourselves to two objectives.However, this approach naturally extends to the any numberof objectives, for example, multiple robustness indicators.We assume we have a system with unknown dynamics, h ,and unknown observation model, g , x t +1 = h ( x t , u t , w t ) , o t = g ( x t , v t ) (1)where x is the state, u is the control input, o is theobservation and w and v are the process and sensor noise.An RL agent aims at learning a controller u t = π ( o t | θ ) ,i.e., a mapping parametrized by θ from an observation o t to an action u t that allows it to complete its task. Policyoptimization algorithms are a class of model-free RL meth-ods that solve this problem by optimizing the performanceof a given controller for the task at hand as a function of theparameters θ . Concretely, given a performance metric f :Θ → R , standard, non-robust policy optimization algorithmsaim to ﬁnd θ ∗ ∈ argmax f ( θ ) . In this work, we considerregulation tasks, i.e., bringing and keeping the system in adesired goal state x . This includes common problems likestabilization, set-point tracking, or disturbance rejection. Theperformance indicator f encodes these objectives.To extend this framework to the robustness-aware case,we use a second function f : Θ → R that measures therobustness of a controller. Since both the dynamics h andthe observation model g are unknown, we must evaluate orapproximate the value of f from data. In Sec. III-B, weintroduce the gain and the delay margin, two alternatives for f that are commonly used in model-based control and wediscuss how to evaluate them in the model-free setting.We aim at ﬁnding the best controller in terms of perfor-mance and robustness, as measured by f and f . However,since we compare controllers based on multiple, and possiblyconﬂicting, criteria, we cannot deﬁne a single best controller.Given a controller θ , we denote with f θ = [ f ( θ ) , f ( θ )] thearray containing its performance and robustness values. Tocompare two controllers θ and θ , we use the canonicalpartial order over R : f θ (cid:23) f θ iff f i ( θ ) ≥ f i ( θ ) for i = 1 , . This induces a relation in the controller space Θ : θ (cid:23) θ iff f θ (cid:23) f θ . If θ (cid:23) θ , we say that θ dominates θ . The Pareto set Θ ∗ ⊆ Θ is the set of non-dominatedpoints in the domain, i.e., θ ∗ ∈ Θ ∗ iff ∃ i = 1 , such that f i ( θ ∗ ) > f i ( θ ) for all θ ∈ Θ . The Pareto front is the set offunction values corresponding to the Pareto set. The Paretoset is optimal in the sense that, for each point θ ∗ in it, it is notpossible to ﬁnd another point in the domain that improvesthe value of one objective without degrading another [28].The goal of this paper is to approximate Θ ∗ from data.Fig. 1 represent our problem graphically: we suggest acontroller, we evaluate its performance and robustness onthe system and we select a new controller based on theseobservations to ﬁnd an approximation of the Pareto front.III. L EARNING THE P ERFORMANCE - ROBUSTNESS T RADE - OFF

For the robust, model-free RL setting we consider, we pro-pose to learn the Pareto front characterizing the performance-obustness trade-off of a given system with MOBO. Here,we describe the necessary components to solve our problemin a data efﬁcient way: MOBO and the robustness andperformance indicators used in our experiments. Moreover,we discuss how to evaluate such indicators from data in amodel-free fashion.

A. Multi-objective Bayesian optimization

MOBO algorithms solve multi-objective optimizationproblems by sequentially querying the objective at differentinputs and obtaining noisy evaluations of the correspondingvalues. They build a statistical model of the objectives tocapture the belief over them given the data available. Theymeasure how informative a point in the domain is about theproblem solution with an acquisition function. At every itera-tion, they evaluate the objective at the most informative point,as measured by the acquisition function. Thus, the complexmulti-objective optimization problem is decomposed into asequence of simpler scalar-valued optimization problems.In the following, we describe the surrogate model and theacquisition function used in this work.

Intrinsic Model of Coregionalization

A single-outputGaussian process (GP) [29] is a probability distribution overthe space of functions of the form f : Θ → R , such that thejoint distribution of the function values computed over anyﬁnite subset of the domain follows a multi-variate Gaussiandistribution. A GP is fully speciﬁed by a mean function µ : Θ → R , which, w.l.o.g., is usually assumed to be zero, µ ( θ ) = 0 for all θ ∈ Θ , and a covariance function, or kernel, k : Θ × Θ → R . The kernel encodes the strength of statisticalcorrelation between two latent function values and, therefore,it expresses our prior belief about the function behavior.Similarly, a D -output GP is a probability distribution overthe space of functions of the form f : Θ → R D . Thedifference with respect to single-output GPs is that, in thiscase, the kernel must capture the correlation across differentoutput dimensions in addition to the correlation of functionvalues at different inputs. The simplest way of doing this isby assuming that each output is independent. However, thismodel disregards the fundamental trade-off between robust-ness and performance that we are considering. For a reviewon kernels for multi-output GPs, see [30]. In this work, weuse the intrinsic model of coregionalization (ICM), whichdeﬁnes the covariance between the i th value of f ( θ ) and the j th value of f ( θ (cid:48) ) by separating the input and the outputcontribution as follows, b ij k ( θ, θ (cid:48) ) . In this case, we say f ∼GP ( µ ( · ) , K ( · , · ) = B k ( · , · )) , where µ : Θ → R D is a D -dimensional mean function, k : Θ × Θ → R is a scalar-valuedkernel and B ∈ R D × D is a matrix describing the correlationin the output space (more details on B in Sec. IV). Given N noisy observations of f , D = { ( θ , y ) , · · · ( θ N , y N ) } , with y i = f ( θ i )+ ω i , where ω i ∼ N ( , Σ) is i.i.d. Gaussian noise,we can compute the posterior distribution of the functionvalues conditioned on D at a target input θ (cid:63) in closed form as p ( f ( θ (cid:63) ) |D , θ (cid:63) ) ∼ N ( f (cid:63) ( θ (cid:63) ) , K (cid:63) ( θ (cid:63) , θ (cid:63) ))) . We denote with θ the inputs contained in D and with K ( θ , θ ) the N D × N D matrix with entries ( K ( θ u , θ v )) i,j for u, v = 1 , . . . , N and i, j = 1 , . . . , D , then f (cid:63) ( θ (cid:63) ) = K (cid:62) θ (cid:63) ( K ( θ , θ ) + Σ ) − y , (2) K (cid:63) ( θ (cid:63) , θ (cid:63) ) = K ( θ (cid:63) , θ (cid:63) ) − K θ (cid:63) ( K ( θ , θ ) + Σ ) − K (cid:62) θ (cid:63) , (3)where Σ = Σ ⊗ I N , with ⊗ denoting the Kronecker product, K θ (cid:63) ∈ R D × ND has entries ( K ( θ (cid:63) , θ v )) i,j for v = 1 , . . . , N and i, j = 1 , . . . , D and y is the N D -dimensional vectorcontaining the concatenation of the observations in D . Expected Hypervolume Improvement

EHI is an acqui-sition function introduced in [23], which selects inputs toevaluate based on a notion of improvement with respectto the incumbent solution. In multi-objective optimization,incumbent solutions take the form of approximations of thePareto set, X ∗ , whose quality is measured by the hyper-volume indicator induced by the corresponding front, Y ∗ with respect to a reference r . Formally, the hypervolumeindicator of a set of points A with respect to a reference r , HV ( A ; r ) , is the Lebesgue measure of the hypervolumecovered by the boxes that have an element in A as uppercorner and the reference as lower corner. It quantiﬁes the sizeof the portion of the output space that is Pareto-dominatedby the points in A . Given an estimate of the Pareto front, Y ∗ , the hypervolume improvement of θ ∈ Θ is deﬁned asthe relative improvement in hypervolume obtained by addingthe function value at θ , f ( θ ) , to Y ∗ , HI ( f ( θ ); Y ∗ , r ) = HV ( Y ∗ ∪ f ( θ ); r ) − HV ( Y ∗ ; r ) . However, we do not know f ( θ ) . Instead, we have a belief over its value expressedby the posterior distribution of the GP, which, in turn,induces a distribution over the hypervolume improvementcorresponding to an input θ . The EHI acquisition functionquantiﬁes the informativeness of an input θ toward thesolution of the multi-objective optimization problem throughthe expectation of this distribution, α ( θ |D , Y ∗ , r ) = (cid:90) f ( θ ) ∈ R n HI ( f ( θ ); Y ∗ ) , r ) p ( f ( θ ) |D ) d f ( θ ) . (4)[23] shows how to compute the integral in (4) in closed form. B. Robustness

In general, robustness can have very different meanings.One may desire to ensure robustness to a certain classof disturbances, imperfections in the control system, oruncertainty in the process, for example. In control theory,the latter is often understood as robustness in the strictersense. Speciﬁcally, robust stability assures that a controllerstabilizes every member from a set of uncertain processes[5]. Such processes can, for example, be deﬁned througha nominal process and variations thereof. Different varia-tions lead to different robustness characterizations. Likewise,there are different notions of stability that are meaningfuldepending on the context. For example, for a deterministicsystem, asymptotic stability, i.e., x t → x as t → ∞ , where x is an equilibrium of the system, is often used; for systemsthat are continuously excited, e.g., through noise, and thuscannot approach x , one may seek the above limit to hold inexpectation or practical stability in the sense of a boundedtate, i.e., (cid:107) x t − x (cid:107) ≤ x max for all t ≥ t . A controller isunstable when the respective condition does not hold (e.g.,no asymptotic convergence, or x t grows beyond any bounds).While many sophisticated robustness metrics have beendeveloped, stability margins such as gain and delay marginsare some of the most common and intuitive ones [11,Sec. 9.3]. We consider these in this work and comment onalternatives in Sec. V. Below, we formally introduce themand we explain how to evaluate them in a model-free setting.Notice that, our data-driven deﬁnitions can be extended toany setting where a success/failure outcome can be deﬁnedand, therefore, are not limited to stability considerations. Gain margin.

In classical control, the upper ( lower ) gain margin is deﬁned for single-input-single-output (SISO)linear systems as the largest factor κ max ∈ (1 , ∞ ) (thesmallest factor κ min ∈ (0 , ) that can multiply the open-loop transfer function so that the closed-loop system is stable[31, Sec. 9.5]. As the open-loop transfer function encodesboth the process and the controller dynamics, the factormay represent uncertainty in the process gain or the actuatorefﬁciency, for example. In this work, we consider a factor κ to be multiplied by the control action (i.e., u t = κ × π ( o t | θ ) ),which is equivalent to the deﬁnition for linear SISO systems,but can also be used for nonlinear ones. It quantiﬁes howmuch we can lower/amplify the control action before makingthe system unstable. In a way, it quantiﬁes how “far” weare from instability and, thus, how much we can toleratedifferences between training and testing. Delay margin.

Similarly, we deﬁne the delay margin asthe largest time delay on the measurement o t such that thecontrolled system is still stable. Formally, it is the largestvalue of d ∈ (0 , ∞ ) such that the closed-loop system withthe delayed control action u t = π ( o t − d | θ ) is stable. As delayin data transmission between sensor, controller, and actuator,and in the control computation are present in most controlsystems, the delay margin is a very relevant measure. Estimate from data.

While the indicators above can bereadily computed for linear systems, they are difﬁcult tocompute analytically if the model is nonlinear, or impossibleif no model is available, as considered herein. We describean experiment to estimate the delay margin from data in amodel-free setting (those for the gain margins are analogous).For general non-linear systems, stability with respect to anequilibrium is a local property. Thus, we assume we can resetthe system to a state in the neighborhood of the equilibriumof interest, i.e., we have x ∈ B ( x , ρ ) , where B ( x , ρ ) isa ball centered at x of radius ρ . We can establish whetherthe delay margin, denoted with d ∗ , is larger or smaller thana delay d by resetting the system near x , deploying thedelayed controller π ( o t − d | θ ) and evaluating the stability ofthe resulting trajectory.In practice, two problems arise with this approach: (i) wecan evaluate a ﬁnite number of delays with a ﬁnite numberof experiments; and (ii) while stability is an asymptoticcondition on the state, we do not know the state and we runﬁnite experiments. The ﬁrst problem requires us to selectcarefully the delays we evaluate. We know that increasing Algorithm 1

Robust policy optimization Inputs : Reference r D ← ∅ , Y ∗ ← { r } for k = 0 , , . . . do θ k +1 ← argmax θ ∈ Θ EHI ( θ |D k , Y ∗ k , r ) y k +1 ← performance experiment ( θ k +1 ) y k +1 ← robustness experiment ( θ k +1 ) D k +1 ← D k ∪ { ( θ k +1 , [ y k +1 , y k +1 ]) } Θ ∗ k +1 , Y ∗ k +1 ← Pareto set and front ( D k +1 ) Outputs : Pareto set Θ ∗ , Pareto front Y ∗ values of delay take a stable system closer to instability.Thus, given m delays [ d , · · · , d m ] , we do a binary searchto ﬁnd the largest one for which the closed-loop systemis stable. This allows us to approximate the delay marginwith log m experiments. Not knowing the state x can besolved by estimating it from the noisy sensor measurements o t or by introducing a new deﬁnition of stability based on o t rather than x t . Concerning the ﬁnite trajectories we notethat, in practical cases, it is rare to have small compoundingdeviations from x resulting in a divergent behavior emergingonly in the long run. Often, a controller makes the systemconverge to or diverge from x within a short amount oftime. In our experiments, we say that a controller stabilizesthe system if, after a burn-in time that accounts for thetransient behavior, it keeps the state within a box around x .Controllers with good margins are investigated further withlonger experiments to eliminate potential outliers due to theﬁnite trajectory issue.Reliably estimating robustness indicators and stability ofa system without a model is challenging. The estimationtechnique we presented is intuitive and easy to implement.While it does not provide formal guarantees on the estimationerror, we show in Sec. IV that it is accurate enough to greatlyimprove the robustness of our algorithm with respect to thenon-robust policy optimization baseline. C. Algorithm

In this section, we describe our robust policy optimizationalgorithm, for the pseudocode see Algorithm 1. At iteration k , we select the controller that maximizes the EHI criterion.Then, we run two experiments to estimate its performanceand robustness. For the performance, we introduce a stateand action dependent reward and we deﬁne the return as theaverage reward obtained over an episode. The performanceindex is deﬁned as the expectation of the return, whichwe approximate with a Monte Carlo estimate over multipleepisodes. To estimate the robustness, we use the experimentsfrom Sec. III-B. We update the data set with the experimentsresults. Finally, we update the estimate of the Pareto frontthat is used to compute the EHI as the set of dominatingpoints of the data set. Other options to compute such estimatefrom the posterior of the GP exist. However, they are com-putationally more expensive and they resulted in a similarperformance in our experiments. In the end, the algorithmreturns an estimate of the Pareto set and front. The choice of . − . − . − . Return D e l a y m a r g i n [ m s ] − . − . − . − . Return − . − . − . − . Return − . − . − . − . Return − . − . − . − . Return

Fig. 2. Pareto fronts identiﬁed in simulation for the delay margin experiment (MOBO-DM). The green square indicates the controller tested on thehardware, see Table I. The gray circles indicate the controllers that appeared to be outliers and were discarded after running longer simulations. − . − . − . − . Return . . . . L o w e r g a i n m a r g i n − . − . − . − . Return − . − . − . − . Return − . − . − . − . Return − . − . − . − . Return

Fig. 3. Pareto fronts identiﬁed in simulation for the gain margin experiment (MOBO-GM). The green square indicates the controller that is tested on thehardware, see Table I. a controller from the Pareto set depends on the performance-robustness trade-off required by the test applications and,therefore, the choice is left to the practitioner.IV. E

XPERIMENTAL R ESULTS

We compare the robust policy optimization algorithm inAlgorithm 1 to its non-robust counterpart based on scalarBO as, e.g., in [19], [32]. We use the scalar equivalent ofEHI for the non-robust case, i.e., the expected improvement(EI) algorithm [17]. We present two set of experiments:training controllers in simulation and directly on hardware,respectively. In both cases, the learned controllers are testedon the hardware in a set of different conditions.

System.

We learn a controller for a Furuta pendulum [33](see Figure 1), a system that is closely related to the well-known cart pole. It replaces the cart with a rotary arm thatrotates on the horizontal plane. In our experiments, we usethe Qube Servo 2 by Quanser [34], a small scale Furutapendulum. It uses a brushless DC motor to exert a torqueon the rotary arm, and it is equipped with two incrementaloptical encoders with 2048 counts per revolution to measurethe angle of the rotary arm and the pendulum. For sim-to-real, we use the dynamics model provided in the Qube Servo2 manual [34], which is a non-linear rigid body model. Amore detailed model is presented in [33].

Controller.

We consider a state feedback controller tostabilize the pendulum about the vertical equilibrium. Thesystem has four states, x = [ α, β, ω, φ ] : the angular positionof the rotary arm and the pendulum, α, β , with β = 0 being the vertical position, and the corresponding angularvelocities, ω, φ . We control the voltage applied to the motor, V m . We use the encoder readings as estimates of the angularpositions, ˆ α, ˆ β . We apply a low-pass ﬁlter to the differenceof consecutive angular positions to estimate the angular velocities, ˆ ω, ˆ φ . We aim to ﬁnd a controller of the form u t = V m = [ θ , θ , θ , θ ][ˆ α, ˆ β, ˆ ω, ˆ φ ] (cid:62) . Scaling and reward.

We deﬁne a state and action depen-dent reward as the negative of a quadratic cost, r ( x , u ) = − ( x (cid:62) Q x + u (cid:62) R u ) with Q = diag (1 , , , and R = 8 .The performance associated to a controller is the expectedaverage reward it induces on the system, that is, for a trajec-tory of duration T , E [1 /T (cid:82) T r ( x t , π ( x t | θ )) dt ] . To preventone of the objectives from dominating the contribution tothe hypervolume improvement in the EHI algorithm, wemust normalize them. We control the range of the robustnessindicators, see Sec. III-B, and, therefore it is easy to rescalethem to the [0 , range. We observe empirically that theunnormalized return ranges in [ − , . Thus, we clip everyreturn value to this range and we rescale it to the [0 , inter-val. Since the pendulum incurs substantially different returnswhen a stabilizing or destabilizing controller is used, wecannot rescale the range linearly. Instead, we use a piece-wiselinear function. In particular, since we observe empiricallythat stabilizing controllers have a performance between -20and 0, we rescale linearly the range [ − , − to [0 , . and the range [ − , to [0 . , . This differentiates coarselythe quality of unstable controllers, and it gives a more reﬁnedscale over stable ones. Surrogate models.

For the non-robust algorithm, weuse a standard GP model with a zero mean prior and aMat´ern kernel with ν = 5 / with automatic relevancedetermination (ARD). We set the hyperprior over the length-scales to Lognormal(1 , and over the standard deviation to Lognormal(0 . , . We use a Gaussian likelihood with nohyperprior. Similarly, for the robust algorithm, we use a zeroprior mean. The correlation in the input space in the ICMmodel is captured by an ARD Mat´ern kernel with ν = 5 / , ABLE I

SIM - TO - REAL EXPERIMENT : WE TRAIN IN SIMULATION DIFFERENT CONTROLLERS WITH SCALAR

BO, MOBO

WITH DELAY MARGIN OR LOWERGAIN MARGIN . W

E TEST EACH ONE TIMES ON THE HARDWARE IN SCENARIOS AND WE COMPARE THEM ACCORDING TO METRICS ( SEE THE MAINTEXT FOR AN IN DEPTH DESCRIPTION ). T

HE ROBUST CONTROLLERS CONSISTENTLY OUTPERFORM THE NON - ROBUST ONES ACROSS ALL SCENARIOS .Standard Motor noise Sensor noise Add g E [ R ] Fail Fail time (s) E [ R ] Fail Fail time (s) E [ R ] Fail Fail time (s) E [ R ] Fail Fail time (s)Scalar BO -0.150 80% 0.92 -0.151 80% 0.97 -0.151 80% 1.05 -0.185 100% 1.03MOBO-DM -0.044 32% 4.61 -0.038 20% 4.07 -0.063 20% 4.32 -0.126 84% 3.21MOBO-GM -0.003 0% ∞ -0.013 0% ∞ -0.057 0% ∞ -0.004 0% ∞ TABLE IIH

ARDWARE EXPERIMENT : WE TRAIN ON HARDWARE ONE CONTROLLER WITH SCALAR BO AND ONE WITH

MOBO

WITH LOWER GAIN MARGIN . W

ETEST EACH ONE TIMES ON THE HARDWARE IN SCENARIOS AND WE COMPARE THEM ACCORDING TO METRICS ( SEE THE MAIN TEXT FOR AN INDEPTH DESCRIPTION ). T

HE ROBUST CONTROLLER CONSISTENTLY OUTPERFORM THE NON - ROBUST ONES ACROSS ALL SCENARIOS .Add g Add g Add g and cm Add g, motor + sensor noise E [ R ] Fail Fail time (s) E [ R ] Fail Fail time (s) E [ R ] Fail Fail time (s) E [ R ] Fail Fail time (s)Scalar BO -0.101 ∞ -0.669 100% 4.07 -0.711 100% 3.59 -0.259 0% ∞ MOBO-GM -0.031 0% ∞ -0.026 0% ∞ -0.0259 0% ∞ -0.366 ∞ with the same hyperprior as the non-robust case. For thecorrelation in the output space, we set a Gaussian hyperpriorover each entry of the matrix B , N (0 , . We use a Gaussianlikelihood with a diagonal covariance matrix. In both cases,we udpate the hyperparameters using a maximum a posterioriestimate after every new data point is acquired. Training . In the sim-to-real setting, we train 5 differentcontrollers for each of these methods: scalar BO (non-robust), MOBO with performance and delay margin (DM),and MOBO with performance and lower gain margin (GM).The training consists of 200 BO iterations evaluated insimulation. In the hardware training setting, we train onecontroller for scalar BO and one for MOBO-GM using70 BO iterations evaluated on hardware. In both settings,MOBO requires fewer iterations than the given budget toﬁnd satisfactory solutions. Thus, using a stopping criterionin Algorithm 1 would reduce the total number of iterations.We estimate performance by averaging the return over 10independent runs. To estimate robustness, we require that thecontroller stabilizes the system for a given delay or gain for 5independent runs. A trial is deemed stable if α t ∈ [ − ◦ , ◦ ] and β t ∈ [ − ◦ , ◦ ] for all t ∈ [4 , . Every training runlasts for 5 seconds. Fig. 2 and 3 show the fronts obtainedby the MOBO-DM and MOBO-GM sim-to-real training,respectively. The gray circles correspond to controllers thatappeared stabilizing at ﬁrst, but that were ruled out withlonger simulations, cf. Sec. III-B. The green squares indicatecontrollers tested on hardware. To emphasize the generalityof our method, they were selected to be approximately at theelbow of the front without further tuning. Sim-to-real test.

We test each controller learned in sim-ulation on the hardware 5 times in 4 scenarios: ( i ) standardsim-to-real, ( ii ) sim-to-real adding Gaussian noise N (0 , . to the motor voltage, ( iii ) sim-to-real adding noise to theencoder readings following a multinomial distribution overthe integers in [ − , with p (0) = 0 . and 0.05 everywhereelse, and ( iv ) sim-to-real with the pendulum mass increasedby . A run is a failure if | β | > ◦ . In Table I, we compare the controllers in terms of average return, failure rate, andfailing time, averaged over the runs that resulted in a failure.The robust methods consistently outperform the non-robustpolicy optimization across all test scenarios. It appears thatthe lower gain margin is a more suitable robustness indicatorin this setting. This may be due to the fact that, in ourexperience, the gain margin is less noisy to estimate. Hardware test.

We test each controller learned in hard-ware 5 times in 4 scenarios: ( i ) extra mass of , ( ii ) extramass of , ( iii ) extra mass of

10 g and extra pendulumlength of , and ( iv ) extra mass of with the actuationand sensor noise used in the sim-to-real experiments. Table IIsummarizes the test results. Similarly to the sim-to-realsetting, the robust algorithm consistenlty outperforms its non-robust counterpart.V. C ONCLUDING R EMARKS

We present a data-efﬁcient algorithm for robust policyoptimization based on multi-objective Bayesian optimization.We suggest a data-driven evaluation of two common robust-ness indicators, which is suitable to model-free settings. Ourhardware experiments on a Furuta pendulum show that (i) our method facilitates simulation to real transfer, and (ii) consistently increases robustness of the learned controllers ascompared to BO with a single performance objective. Ourresults indicate a promising avenue toward robust learningcontrol by leveraging robustness measures from control the-ory and multi-objective Bayesian optimization and point toseveral directions for extensions. While we show that gainand delay margings are effective in practice on a mildlynonlinear system, they may not fully characterize robust sta-bility in general [31], [4]. Thus, investigating other relevantrobustness indicators that can efﬁciently be estimated fromdata in a model-free setting is a topic for future research.Also, using multiple robustness indicator simultaneously isrelevant, which our method could do at the expense of a morecomplex scaling to balance robustness and performance.

EFERENCES[1] R. S. Sutton and A. G. Barto,

Reinforcement learning: An introduction .MIT press, 2018.[2] B. D. Anderson and J. B. Moore,

Optimal control: linear quadraticmethods . Courier Corporation, 2007.[3] J. Doyle, “Guaranteed margins for LQG regulators,”

IEEE Transac-tions on Automatic Control , vol. 23, no. 4, pp. 756–757, 1978.[4] B. Boulet and Y. Duan, “The fundamental tradeoff between perfor-mance and robustness,”

IEEE Control Systems Magazine , vol. 27,no. 3, pp. 30–44, June 2007.[5] K. Zhou and J. C. Doyle,

Essentials of robust control . Prentice Hall,1998.[6] A. Saberi, S. Peddapullaiah, and B. M. Chen,

Loop transfer recovery:analysis and design . Springer-Verlag, 1993.[7] M. Green and D. J. Limebeer,

Linear robust control . CourierCorporation, 2012.[8] A. Nilim and L. El Ghaoui, “Robust control of markov decisionprocesses with uncertain transition matrices,”

Operations Research ,vol. 53, no. 5, pp. 780–798, 2005.[9] R. Sharma and M. Gopal, “A robust markov game controller fornonlinear systems,”

Applied Soft Computing , vol. 7, no. 3, pp. 818–827, 2007.[10] E. Delage and S. Mannor, “Percentile optimization for markov decisionprocesses with parameter uncertainty,”

Operations research , vol. 58,no. 1, pp. 203–213, 2010.[11] K. J. Astrom and R. M. Murray,

Feedback Systems: An Introduction forScientists and Engineers . Princeton, NJ, USA: Princeton UniversityPress, 2008.[12] M. Neumann-Brosig, A. Marco, D. Schwarzmann, and S. Trimpe,“Data-efﬁcient auto-tuning with bayesian optimization: An industrialcontrol study,”

IEEE Transactions on Control Systems Technology ,2019.[13] H. K. Venkataraman and P. J. Seiler, “Recovering robustness in model-free reinforcement learning,” in . IEEE, 2019, pp. 4210–4216.[14] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta, “Robust adversar-ial reinforcement learning,” in

International Conference on MachineLearning , 2017, pp. 2817–2826.[15] J. Morimoto and K. Doya, “Robust reinforcement learning,” in

Ad-vances in Neural Information Processing Systems , 2001, pp. 1061–1067.[16] A. Gambier and M. Jipp, “Multi-objective optimal control: An in-troduction,” in

Control Conference (ASCC), 2011 8th Asian . IEEE,2011, pp. 1084–1089.[17] J. Mockus, V. Tiesis, and A. Zilinskas, “The application of bayesianmethods for seeking the extremum,”

Towards global optimization ,vol. 2, no. 117-129, p. 2, 1978.[18] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas,“Taking the human out of the loop: A review of Bayesian optimiza-tion,”

Proceedings of the IEEE , vol. 104, no. 1, pp. 148–175, 2016.[19] F. Berkenkamp, A. P. Schoellig, and A. Krause, “Safe controlleroptimization for quadrotors with gaussian processes,” in

IEEE Inter-national Conference on Robotics and Automation (ICRA) , May 2016,pp. 491–496.[20] A. Marco, F. Berkenkamp, P. Hennig, A. P. Schoellig, A. Krause,S. Schaal, and S. Trimpe, “Virtual vs. real: Trading off simulationsand physical experiments in reinforcement learning with bayesianoptimization,” in . IEEE, 2017, pp. 1557–1563.[21] R. Calandra, A. Seyfarth, J. Peters, and M. P. Deisenroth, “Bayesianoptimization for learning gaits under uncertainty,”

Annals of Mathe-matics and Artiﬁcial Intelligence , vol. 76, no. 1, pp. 5–23, Feb 2016.[22] R. Antonova, A. Rai, and C. G. Atkeson, “Deep kernels for optimizinglocomotion controllers,” in

Proceedings of the 1st Annual Conferenceon Robot Learning , ser. Proceedings of Machine Learning Research,vol. 78, 2017, pp. 47–56.[23] M. Emmerich and J.-w. Klinkenberg, “The computation of the ex-pected improvement in dominated hypervolume of pareto front ap-proximations,”

Rapport technique, Leiden University , vol. 34, 2008.[24] M. Zuluaga, A. Krause, and M. P¨uschel, “e-PAL: An Active LearningApproach to the Multi-Objective Optimization Problem,”

Journal ofMachine Learning Research , vol. 17, no. 104, pp. 1–32, 2016. [25] D. Hern´andez-Lobato, J. Hernandez-Lobato, A. Shah, and R. Adams,“Predictive entropy search for multi-objective bayesian optimization,”in

International Conference on Machine Learning , 2016, pp. 1492–1501.[26] M. Tesch, J. G. Schneider, and H. Choset, “Expensive multiobjectiveoptimization for robotics,” , pp. 973–980, 2013.[27] R. Ariizumi, M. Tesch, H. Choset, and F. Matsuno, “Expensivemultiobjective optimization for robotics with consideration of het-eroscedastic noise,” in . IEEE, sep 2014, pp. 2230–2235.[28] Y. Collette and P. Siarry,

Multiobjective optimization: Principles andcase studies . Springer Science & Business Media, 2013.[29] C. E. Rasmussen, “Gaussian processes in machine learning,” in

Advanced lectures on machine learning . Springer, 2004, pp. 63–71.[30] M. A. Alvarez, L. Rosasco, N. D. Lawrence et al. , “Kernels for vector-valued functions: A review,”

Foundations and Trends in MachineLearning , vol. 4, no. 3, pp. 195–266, 2012.[31] K. Zhou, J. C. Doyle, K. Glover et al. , Robust and optimal control .Prentice hall New Jersey, 1996, vol. 40.[32] A. Marco, P. Hennig, J. Bohg, S. Schaal, and S. Trimpe, “AutomaticLQR tuning based on Gaussian process global optimization,” in

Proceedings of the IEEE International Conference on Robotics andAutomation (ICRA) . IEEE, May 2016, pp. 270–277.[33] B. S. Cazzolato and Z. Prime, “On the dynamics of the furutapendulum,”

Journal of Control Science and Engineering , vol. 2011,p. 3, 2011.[34] Quanser,