[PDF] Model-based Reinforcement Learning from Signal Temporal Logic Specifications

Abstract

Techniques based on Reinforcement Learning (RL) are increasingly being used to design control policies for robotic systems. RL fundamentally relies on state-based reward functions to encode desired behavior of the robot and bad reward functions are prone to exploitation by the learning agent, leading to behavior that is undesirable in the best case and critically dangerous in the worst. On the other hand, designing good reward functions for complex tasks is a challenging problem. In this paper, we propose expressing desired high-level robot behavior using a formal specification language known as Signal Temporal Logic (STL) as an alternative to reward/cost functions. We use STL specifications in conjunction with model-based learning to design model predictive controllers that try to optimize the satisfaction of the STL specification over a finite time horizon. The proposed algorithm is empirically evaluated on simulations of robotic system such as a pick-and-place robotic arm, and adaptive cruise control for autonomous vehicles.

Full PDF

MModel-based Reinforcement Learning from Signal Temporal LogicSpeciﬁcations

Parv Kapoor , Anand Balakrishnan and Jyotirmoy V. Deshmukh Abstract —Techniques based on Reinforcement Learning (RL)are increasingly being used to design control policies for roboticsystems. RL fundamentally relies on state-based reward functionsto encode desired behavior of the robot and bad reward functionsare prone to exploitation by the learning agent, leading to behav-ior that is undesirable in the best case and critically dangerous inthe worst. On the other hand, designing good reward functionsfor complex tasks is a challenging problem. In this paper, wepropose expressing desired high-level robot behavior using aformal speciﬁcation language known as Signal Temporal Logic(STL) as an alternative to reward/cost functions. We use STLspeciﬁcations in conjunction with model-based learning to designmodel predictive controllers that try to optimize the satisfactionof the STL speciﬁcation over a ﬁnite time horizon. The proposedalgorithm is empirically evaluated on simulations of roboticsystem such as a pick-and-place robotic arm, and adaptive cruisecontrol for autonomous vehicles.

I. I

NTRODUCTION

Reinforcement learning (RL) [1] is a general class ofalgorithms that enable automated controller synthesis, wherea high-level task/objective is coupled with repeated trial-and-error simulations to learn a policy that satisﬁes the givenspeciﬁcation. Relatively recent developments in deep learninghave renewed interest in designing scalable RL algorithms forhighly complex systems [2], [3], [4], [5]. While many recentworks have focused on so-called model-free learning , thesealgorithms are typically computationally expensive and model-free policies are slow to train, as they are not sample efﬁcient .A promising approach to increase sample efﬁciency is model-based reinforcement learning (MBRL) [6], [7], [8]. The goalhere is to (approximately) learn a predictive model of thesystem and use this learned model to synthesize a controllerusing sampling-based methods like model predictive control(MPC) [9], [10]. Another advantage of MBRL is that as thelearned dynamics models are task-independent they can beused to synthesize controllers for various tasks in the sameenvironment.An important problem to address when designing andtraining reinforcement learning agents is the design of rewardfunctions [1]. Reward functions are a means to incorporateknowledge of the goal in the training of an RL agent, usinghand-crafted and ﬁne-tuned functions of the current state ofthe system. Poorly designed reward functions can lead to theRL algorithm learning a policy that fails to accomplish thegoal, cf. [11]. Moreover, in safety-critical systems, the agentcan learn a policy that performs unsafe or unrealistic actions,even though it maximizes the expected total reward [11].The problem of an RL agent learning to maximize the total Department of Computer Science, University of Southern California, LosAngeles, CA 90069 { pk_878,anandbal,jdeshmuk } @usc.edu reward by exploiting the reward function, and thus performingunwanted or unsafe behavior is called reward hacking [12].With learning-enabled components being used increasinglycommonly in real-world applications like autonomous cars,there has been an increased interest in the provably safesynthesis of controllers, especially in the context of RL [13],[14]. Work in safe RL has explored techniques from controltheory and formal methods to design provably safe con-trollers [13], [15], [16]; and to specify tasks by synthesizingreward functions from formal speciﬁcations [14], [17], [18].Reward hacking has been addressed by using TemporalLogics like Linear Temporal Logic (LTL) and Signal TemporalLogic (STL) to specify tasks and create reward functions, asthese logics can be used to specify complex tasks in a rich andexpressive manner. The authors in [14], [18] explore the ideaof using the robust satisfaction semantics of STL to deﬁnereward functions for an RL procedure. Similar ideas wereextended by to a related logic for RL-based control designfor Markov Decision Processes (MDP) in [19].Although these techniques are effective in theory, in prac-tice, they do not typically scale well to the complexity of thetask and introduce large sampling variance in learned policy.This is especially true in the case of tasks that have a largeplanning horizons, or have sequential objectives. In contrast,using Temporal Logic-based motion planning has been usedsuccessfully together with model predictive control [20], [21],where a predeﬁned model is used to predict and evaluate futuretrajectories.In this paper, we are interested in incorporating model-based reinforcement learning with Signal Temporal Logic-based trajectory evaluation. The main contributions of thispaper are as follows:1) We formulate a procedure to learn a deterministic pre-dictive model of the system dynamics using deep neuralnetworks. Given a state and a sequence of actions, sucha predictive model produces a predicted trajectory over auser-speciﬁed time horizon.2) We use a cost function based on the quantitative semanticsof STL to evaluate the optimality of the predicted trajec-tory, and use a black-box optimizer that uses evolutionarystrategies to identify the optimal sequence of actions (inan MPC setting).3) We demonstrate the efﬁcacy of our approach on a numberof examples from the robotics and autonomous drivingdomains.1 a r X i v : . [ c s . R O ] N ov I. P

RELIMINARIES

A. Model-based Reinforcement Learning

In more traditional forms of controller design, like modelpredictive control [9], a major step in the design process is tohave a model of the system dynamics or some approximationof it, which is in-turn used as constraints in a receding horizonoptimal control problem. In MBRL, this step is automatedthrough a learning-based algorithm that approximates the sys-tem dynamics by assuming that the system can be representedby a Markov Decision Process, and the transition dynamics canbe learned by ﬁtting a function over a dataset of transitions.A

Markov Decision Process (MDP) is a tuple M =( S, A, T, ρ ) where S is the state space of the system; A isthe set of actions that can be performed on the system; and T : S × A × S → [0 , is the transition (or dynamics) function,where T ( s , a , s (cid:48) ) = Pr ( s (cid:48) | s , a ) ; and ρ is a user-suppliedevaluation metric deﬁned on either a transition of the system(like a reward function) or over a series of transitions.The goal of reinforcement learning (RL) is to learn a policythat maximizes ρ on the given MDP. In model-based reinforce-ment learning (MBRL), a model of the dynamics is used tomake predictions on the trajectory of the system, which is thenused for action selection. The forward dynamics of the model T is typically learned by ﬁtting a function ˆ F on a dataset D of sample transitions ( s , a , s (cid:48) ) collected from simulations orreal-world demonstrations. We will use ˆ F θ ( s t , a t ) to denotethe learned discrete-time dynamics function, parameterized by θ , that takes the current state s t and an action a t , and outputsa distribution over the set of possible successor states s t + δt . B. Model Predictive ControlModel Predictive Control (MPC) [9] refers to a class ofonline optimization algorithms used for designing robust con-trollers for complex dynamical systems. MPC algorithms typ-ically involve creating a model of the system (physics-basedor data-driven), which is then used to predict the trajectoryof a system over some ﬁnite horizon. The online optimizerin an MPC algorithm computes a sequence of actions thatoptimizes the predicted trajectories given a trajectory-basedcost function (such as ρ ) . This optimization is typically donein a receding horizon fashion, i.e., at each step during the runof the controller, the horizon is displaced towards the future byexecuting only the ﬁrst control action and re-planning at everynext step. Formally, the goal in MPC is to optimize ρ over thespace of all ﬁxed length action sequences. For example, let anaction sequence be denoted A ( H ) t = ( a t , . . . , a t + H − ) , thenat every time step t during the execution of the controller for aﬁnite planning horizon H we solve the following optimizationproblem: maximize A ( H ) t ρ ( ˆ s t , a t , ˆ s t +1 , . . . , a t + H − , ˆ s t + H ) where ˆ s t = s t ˆ s t + i +1 = ˆ F ( ˆ s t + i , a t + i ) , for i ∈ , . . . , H − (1)where, ˆ F is the (approximate) system dynamics model. Notice that the system dynamics are directly encoded asconstraints in the optimization problem: if the model is a linearsystem model, one can use Mixed Integer Linear Program-ming, or Quadratic Programming to solve the optimizationproblem at each step [22], [23]. For nonlinear systems, thisproblem gets signiﬁcantly harder, and can be solved usingnonlinear optimization techniques [24], sampling based op-timizers like Monte-Carlo methods [25], the Cross-EntropyMethod [26], or evolutionary strategies, like CMA-ES [27]and Natural Evolutionary Strategies [28]. C. Signal Temporal LogicSignal Temporal Logic (STL) [29] is a real-time logic,typically interpreted over signals over a dense-time domainthat take values in a continuous metric space (such as R m ).The basic primitive in STL is a signal predicate µ – a formulaof the form f ( s t ) ≥ c , where f is a function from the valuedomain R ≥ to R , and c ∈ R . STL formulas are then deﬁnedrecursively using Boolean combinations of sub-formulas, orby applying an interval-restricted temporal operators to a sub-formula. In this paper, we consider a restricted fragment ofSTL (where all temporal operators are unbounded). The syntaxof our fragment of STL is formally deﬁned as follows: ϕ ::= µ | ¬ ϕ | ϕ ∧ ϕ | G ϕ | F ϕ | ϕ U ϕ (2)Here, I = [ a, b ] denotes an arbitrary time-interval, where a, b ∈ R ≥ . The Boolean satisfaction semantics of STL aredeﬁned recursively over the structure of an STL formula;in lieu of the Boolean semantics, we instead present the quantitative semantics of STL and explain how these relateto the Boolean semantics. The quantitative semantics of anSTL formula ϕ are deﬁned in terms of a robustness value ρ ( ϕ, ξ, t ) that maps the sufﬁx of the signal ξ starting fromtime t to a real value. This value (approximately) measures asigned distance by which the signal can be perturbed before its(Boolean) satisfaction value changes. A positive value impliesthat the signal is satisﬁed, while a negative robustness valueimplies that the systems violates the speciﬁcation. Formally,the approximate robustness value (or simply robustness) isdeﬁned using the following recursive semantics [29]: ρ ( f ( ξ ) ≥ c, ξ, t ) ≡ f ( ξ t ) − cρ ( ¬ ϕ, ξ, t ) ≡ − ρ ( ϕ, ξ, t ) ρ ( ϕ ∧ ϕ , ξ, t ) ≡ min( ρ ( ϕ , ξ, t ) , ρ ( ϕ , ξ, t )) ρ ( G ϕ, ξ, t ) ≡ inf t (cid:48) ≥ t ρ ( ϕ, ξ, t (cid:48) ) ρ ( F ϕ, ξ, t ) ≡ sup t (cid:48) ≥ t ρ ( ϕ, ξ, t (cid:48) ) ρ ( ϕ U ϕ , ξ, t ) ≡ sup t (cid:48) ≥ t min  ρ ( ϕ , ξ, t (cid:48) ) , inf t (cid:48)(cid:48) ∈ [ t,t (cid:48) ) ρ ( ξ, ϕ , t (cid:48)(cid:48) )  (3)The convention is that ξ | = ϕ if ρ ( ϕ, ξ, ≥ , and ξ doesnot satisfy ϕ otherwise.2II. R OBUST C ONTROLLER S YNTHESIS WITH A PPROXIMATE M ODEL

We now present a framework that combines the use ofmodel-based reinforcement learning (MBRL) and sampling-based model predictive control (MPC) to maximize the ro-bustness value of a trajectory against a given STL formula. Weﬁrst learn a deterministic neural network model of the system, ˆ F , which can be used to predict what the next state will be,given the current state and an action. We then use this modelin an MPC setting described in Equation 1, where we usethe model to sample trajectories over a ﬁnite horizon, H , anduse CMA-ES to optimize a sequence of actions, A ( H ) t , thatmaximizes the robustness of the predicted trajectory againstan STL speciﬁcation, ϕ . A. Learning the System Dynamics

Algorithm 1

Learning the system dynamics model. Initialize empty dataset, D for i ∈ , . . . , N traj do for Each time step t do a t ∼ Uniform ( · ) , s t +1 ∼ Env ( s t , a t ) D ← ( s t , a t , s t +1 ) end for end for θ ← SGD ( D ) In MBRL, an approximate model of the system dynamicsis learned through repeated sampling of the environment. Thismodel can be represented using various means, including bythe use of Gaussian Processes [8], [30] and, more recently,Deep Neural Networks [31], [32], [33]. Here, we represent thelearned dynamics, ˆ F θ ( s t , a t ) as a deep neural network, where θ is the vector of parameters of the NN. The learned dynamics, ˆ F θ , is deterministic , i.e., it outputs a single prediction on thewhat the change in the state of the system, ∆ ˆ s t = s t +1 − s t will be, rather than outputting a distribution on the predictedstates. The same design decisions were taken in [31], wherethey use MBRL as a preceding step to accelerate the conver-gence of a model-free reinforcement learning algorithm. Thesteps to train the model can be seen in Algorithm 1.

Collecting data:

To generate the dataset, D , of samples ( s , a , s (cid:48) ) , we assume that we have access to some simulator,for example, the OpenAI Gym [34]. We sample a large numberof trajectories (and hence, a large number of transitions),where each trajectory start at some initial state initial state s from the environment, and actions are sampled from a randomcontroller , i.e., we pick actions from a uniform distributionover the action space of the environment. Preprocessing the dataset:

After generating the data, wecalculate the target as ∆ ˆ s = ( s (cid:48) − s ) . We then calculate themean and standard deviation for the inputs ( s , a ) and targetoutputs. We rescale the data by subtracting the mean from it We pick a deterministic model as, in our chosen case studies, we noticethat a lot of system noise is not time-varying, and thus, it sufﬁces to predictthe mean predicted state for some state-action pair. and dividing it by standard deviation. This normalization aidsthe model training process and leads to accurate predictionsfor all dimensions in the state space.

Training the model:

We train ˆ F θ ( s , a ) by minimizing thesum of squared error loss for each transition, ( s , a , s (cid:48) ) , in thedataset as follows: L ( θ ) = (cid:88) ( s , a , s (cid:48) ) ∈D train (cid:16) s (cid:48) − ˆ F θ ( s , a ) (cid:17) , (4)where D train is a partition on the dataset D allocated fortraining data. We use the remaining partition, D val , of thedataset as a validation dataset, over which we calculate thesame loss described in Equation 4. The loss minimizationis carried out using stochastic gradient descent, where thedataset is split into randomly sampled batches and the lossis minimized over these batches. B. Sampling-based Model Predictive Control

Algorithm 2

Model Predictive Control using ˆ F θ . for Each time step t in episode do for . . . N iter and . . . N samples do A ( H ) t = ( a t , . . . , a t + H − ) ∼ CMA-ES () for i ∈ , . . . , H − do ˆ s t + i +1 = ˆ F θ ( ˆ s t + i , ˆ a t + i ) end for Compute cost ρ ( ˆ s t , a t , ˆ s t +1 , . . . , a t + H − , ˆ s t + H ) Update CMA-ES () end for Execute ﬁrst action a ∗ from optimal sequence A ( H ) t ∼ CMA-ES () end for Once we have an approximate dynamics model, ˆ F θ , wherethe parameters θ have been trained as in the previous section,we can use this model in a model predictive control setting.At state s t , our goal is to compute a sequence of actions, A ( H ) t = ( a t , . . . , a t + H − ) , for some ﬁnite horizon, H .As stated earlier, this can be formulated as an optimizationproblem over sampled trajectories, as described in Equation 1,and the optimization is performed using the CMA-ES [27]black-box optimizer. During each iteration of the optimizer, anumber of action sequences are sampled from the optimizer’sinternal distribution, and a trajectory is computed for eachaction sequence by repeatedly applying ˆ F ( ˆ s t + i , a t + i ) to get ˆ s t + i +1 (here, ˆ s t = s t ). These trajectories are then used as aninput to compute the robust satisfaction value, or robustness, ofthe sequence of actions, given a STL speciﬁcation ϕ , whichis used as the objective function to maximize by CMA-ES.At each step, we execute only the ﬁrst action given to us byCMA-ES, and perform the optimization loop at each step. Thepseudocode for this can be seen in Algorithm 2.IV. E XPERIMENTAL R ESULTS AND D ISCUSSIONS

We evaluate our framework, we train dynamics modelsfor multiple different environments, and evaluate the model3 a) Fetch Robot (b) Highway Cruise ControlFig. 1. Examples of environments used for evaluation. performance in the model predictive control setting. In thissection, we will describe the environments we used, the tasksdeﬁned on them, and the results of using the proposed deepmodel predictive control framework on them, which can beseen in Table I.

Remark . It should be noted that to train the models, wechose a random controller to create the transition dataset as,in the environments we are studying, we saw that a randomcontroller strategy for exploration of the state and actionspaces works either just as good as, or sometimes better than,other methods that trade off between exploration for newtransitions and exploitation of previous knowledge.

A. Cartpole

This is the environment described in [35], where a pole isattached to a cart on an unactuated joint. The cart moves on africtionless track, and is controlled by applying a force to pushit left or right on the track. The state vector for the environmentis (cid:16) θ, x, ˙ θ, ˙ x (cid:17) , where θ is the angular displacement of the polefrom vertical; x is the displacement of the cart from the centerof the track; and ˙ θ and ˙ x are their corresponding velocities.The requirement, as described in [35], is to balance thepole for as long as possible on the cart (maintain angulardisplacement between ± ◦ ). For our experiments, we try tomaximize the robustness of the following speciﬁcation: ϕ := G ( | x | < . ∧ | θ | < ◦ ) (5)The above speciﬁcation requires that, at all time steps, thecart displacement shouldn’t exceed a magnitude of . andthe angular displacement of the pole should not exceed amagnitude of ◦ .In Table I, we see that for a horizon length of , we areable to maximize the robustness — which is upper boundedby . as, when the positional and angular displacement are ,the robustness will be the minimum distance from violation,which is . . The minor loss of robustness can be attributed tothe stochasticity in the environment dynamics, making it im-possible to perfectly balance the pole from time . Moreover,across multiple evaluation runs, the controller is also able tomaximize the rewards obtained from the environment underthe standard rewarding scheme, which is +1 for each timestep that the pole is balanced on the cart, for a maximum of200 time steps. Speciﬁcally, the controller is able to attain atotal reward of 200 consistently, across multiple runs. B. Mountain Car

This environment describes a mountain climbing task, asdescribed in [36]. The environment consists of a car stuck atthe bottom of a valley between two mountains, and the carcan be controlled by applying a force pushing it left or right.The state vector for the cart is on a one-dimensional left-righttrack, x, ˙ x , where x is the displacement of the cart on thetrack, and ˙ x is the car’s velocity along the horizontal track.The goal of this environment is to synthesize a controllerthat pushes the car to the top of the mountain on the right. Un-der a traditional rewarding scheme, the controller is rewardeda large positive amount for reaching the top of the mountain,and a small negative amount for every step it doesn’t reachthe top. We encode this speciﬁcation using the following STLformula, which requires that the cart should eventually reachthe top of the right side hill: ϕ := F ( x > . , (6)where . is the horizontal coordinate of the top of themountain on the right side.Here, since the goal is to reach a conﬁguration, the max-imum robustness is positive only when it actually reachesthe goal. Thus, any positive robustness implies a successfulcontroller, as seen in Table I. The classical rewarding schemeis as follows, for every time step that the car hasn’t reachedthe goal, it gets a reward of − . ∗ action , and only gets apositive reward when it actually reaches the top of the hill.A solution algorithm is considered to ”solve” the problem if itachieves a reward of 90.0 averaged over multiple evaluations.One key take away from this is that our controller is ableto achieve this reward despite not having this requirementencoded in speciﬁcations explicitly! C. Fetch robot

We use the OpenAI Gym

FetchReach environment [37],where the goal of the environment is design a controller tomove the arm of a simulated Fetch manipulator robot [38] toa region in 3D space. The state vector of the environment is a16 dimensional vector containing the position and orientationof the robots joints and end-effector. The below speciﬁcationis deﬁned solely on the position of the end-effector, denoted ( x, y, z ) , and the position of the goal ( x g , y g , z g ) : ϕ := F ( | x g − x | < . ∧ | y g − y | < . ∧ | z g − z | < . (7)Similar to the previous reach speciﬁcation, we see that thecontroller learned for the FetchReach environment is able toattain a positive reward on average and a 100 percent successrate, thus satisfying the speciﬁcation.

D. Adaptive Cruise-Control

Here, we simulate an adaptive cruise-control (ACC) sce-nario on a highway simulator [39]. Here, the goal is tosynthesize a controller for a car (referred to as the ego car) thatsafely does cruise control in the presence of an adversarial (or ado ) car ahead of it. The environment itself is a single lane,and the ego car is controlled only by the throttle/acceleration4 .

00 0 .

25 0 .

50 0 .

75 1 .

00 1 .

25 1 .

50 1 .

75 2 . × . . . . . . × (a) Cartpole .

00 0 .

25 0 .

50 0 .

75 1 .

00 1 .

25 1 .

50 1 .

75 2 . × . . . . . . . . . × (b) Mountain Car . . . . . . × . . . . . . × (c) Highway . . . . . . × . . . . . . × (d) Parking Lot . . . . . . . × . . . . . . × (e) Fetch robotFig. 2. Model learning losses for each environment. The plot measures the average decay of loss with the number of training iterations across 7 differenttraining runs. applied to the car. The ado agent ahead of the car acceleratesand decelerates in an erratic manner, in an attempt to cause acrash.The state space of the environment is ( x ego , v ego , a ego , d rel , v rel ) , where x ego , v ego , and a ego arethe position, velocity, and acceleration of the ego car onthe highway; and d rel and v rel are, respectively, the relativedisplacement and velocity of the ado car relative to the ego car’s coordinate frame. We have the following requirementsfor the cruise controller: ϕ := FG (50 > d rel > (8)To aid the model learning process, augment the above statespace with the relative acceleration of the ado car, a rel . Thisis done by maintaining a difference between v rel between twoconsecutive time steps. Augmenting the state space with thismanual (and trivial) estimation allows for more stable andconsistent predictions by the model.The above speciﬁcation is a stable goal speciﬁcation, thatis, the formula requires that the ego car comes so somestable conﬁguration eventually in the future. Thus, for anyﬁnal conﬁguration that maintains the relative distance betweenthe ego and ado cars at a value between and satisﬁesthe speciﬁcation. In Table I, we see that the robustness ρ isconsistently positive and, since the robustness is the distanceto violation, we see that it is well within the bounds (by ∼ units). E. Parking Lot

We use the same simulator [39] used in the previousenvironment to simulate a parking lot, where we design acontroller to park a car in a speciﬁc spot. Here, the state of thesystem is described on the 2D coordinate space of the parkinglot surface, ( x, y, v x , v y , cos( θ ) , sin( θ ) , x g , y g ) , where x, y are the position; v x , v y are the velocity components; θ is the bearing angle of the car; and x g , y g are the coordinatesof the parking spot.We evaluate the controller using the following speciﬁca-tions: ϕ := F ( | x g − x | < . ∧ | y g − y | < . (9) Similar to Mountain Car and

FetchReach , we see that thisspeciﬁcation is also a reach requirement. Thus, any positivevalue close to is considered a success. This is exactly whatwe see in Table I. F. Discussion and Analysis

In the experimental results in Table I, we see that using adeterministic model sufﬁces for predicting trajectories as weare able to maximize the robustness and, if applicable, therewards for a given system and speciﬁcation. This even appliesto complex environments like

Adaptive Cruise-Control , wherewe have an adversarial car performing erratic behavior. Wespeculate that this is due to the noise in the system beingtime-invariant, thus predicting the mean conﬁguration of thesystem is sufﬁciently accurate for the controller.V. R

ELATED W ORK

Model-based Reinforcement Learning:

Model-free deep re-inforcement learning algorithms based on Q-learning [2], [40],actor-critic methods [3], [4], [41], and policy gradients [42],[5] have been used to directly learn policies for speciﬁc tasksin highly complex systems including simulated robotic loco-motion, driving, video game playing, and navigation. Whilethese methods have been successful and popular, they tend torequire an incredibly large number of samples and extensivecomputational capabilities. In contrast, in MBRL, the goal isto learn a model of the system through repeatedly acting on theenvironment and observing state transition dynamics, and thenuse this model to synthesize a controller [6] or use an under-approximation to aid in exploration of the state space [43].Gaussian Processes [44] have been used extensively as non-parametric Bayesian models in various MBRL problems wheredata-efﬁciency is critical [7], [45], [8], [30]. In recent years,neural network-based models have generally supplanted GPsin MBRL [46], [47], [31], [48], [32], [33], as neural networkshave ﬁxed-time runtime execution, and can generalize betterto complex, non-linear, high-dimensional systems. Moreover,they perform much better than Gaussian Processes in thepresence of large amount of data, as in the case of RL.Recent work has used sampling-based model predictivecontrol (MPC) along with deep neural network dynamicsmodels, where an optimizer outputs a sequence of actionswhich are used to predict trajectories, and, in-turn, the cost of5

ABLE IH

YPERPARAMETERS AND R ESULTS . Environment Hyperparameters Results (averaged across 30 runs)No. of trainingtrajectories No. of optimizeriterations No. of optim.samples per iteration Horizon TrajectoryRobustness Vanilla rewards a N traj N iter N samples H ρ

Cartpole 2000 5 1000 10 . ± − . ± Mountain Car 2000 2 1000 50 . ± − . ± . Fetch 2000 7 500 10 . ± − –ACC 400 7 500 2 . ± . –Parking Lot 400 5 5 5 . − ± . − – a Not used for training. It is used purely as a means to evaluate how well the controller that optimizes the STL robustness does in maximizing the originalreward function deﬁned for the environment. the sequence of actions. The work by [31] uses a deterministicNN model with a random search optimizer for computing anaction sequence. On the other hand, the work in [32] usesan ensemble of probabilistic models to output a probabilitydistribution over the future states, and uses the Cross-EntropyMethod for optimizing the sequence of actions that maximizepredicted trajectory rewards.

Formal Methods in Controller Synthesis:

Temporal logicshave been used extensively in the context of cyber-physicalsystems to encode temporal dependencies in the state of asystem in an expressive manner [49], [50], [51]. Originatingin program synthesis and model checking literature, temporallogics like Linear Temporal Logic, Metric Temporal Logic,and Signal Temporal Logic, have been used extensively tosynthesize controllers for complex tasks, including motionplanning [52], [53] and synthesizing controllers for variousrobot systems, including swarms [54], [21], [55]. Such syn-thesis algorithms typically have strong theoretical guaranteeson the correctness, safety, or stability of the controlled system.A lot of traditional control theoretical approach to synthesiswith STL speciﬁcations have relied on optimizing the quan-titative semantics, or robustness , of trajectories generated bythe controller. In [53], [55], STL speciﬁcations are used tosynthesize control barrier certiﬁcates for motion planning, andin [21], the authors develop an extension to STL to reasonexplicitly about stochastic systems, named “Probabilistic STL”(PrSTL). Quantitative semantics for STL have been directlyencoded into model predictive control optimization problemsto design controllers for robots [56], [57].

Safe Reinforcement Learning:

Temporal logics have beenproposed as a means to address various problem in reinforce-ment learning. One such problem is reward hacking . Thisrefers to cases where a reinforcement learning agent learnsa policy that performs unsafe or unrealistic actions, thoughit maximizes the expected total reward [11], [12]. A meansto approach this problem is reward shaping [58]. The workin [14] proposes an extension to Q-learning [59] where STLrobustness is directly used to deﬁne reward functions over tra-jectories in an MDP. The approach presented in [18] translateSTL speciﬁcations into locally shaped reward functions usinga notion of “bounded horizon nominal robustness”, while theauthors in [19] propose a method that augments an MDP with ﬁnite trajectories, and deﬁnes reward functions for a truncatedform of Linear Temporal Logic.Likewise, temporal logics have been used with model-basedlearning to prove the safety of a controller that has beendesigned for the learned models. An example of this is [13],where the authors propose a PAC learning algorithm that hasbeen constrained using Linear Temporal Logic speciﬁcations.This is done by learning the transition dynamics of the productautomaton between the unknown model and the omega-regularautomaton corresponding to the LTL speciﬁcation, while si-multaneously using value-iteration to improve on the policy.The work by [15] uses Gaussian process models to learn thesystem dynamics, while using approximations of

Lyapunovfunctions to design a controller using the learned models.Using a similar model learning approach, the authors in [60]propose the use of learning a specialized dynamics model thattakes into account safe exploration using barrier certiﬁcates ,while generating control barrier certiﬁcate-based policies.VI. C

ONCLUSION

In this paper, we propose a model-based reinforcementlearning technique that uses simulations of the robotic sys-tem to learn a dynamical predictive model for the systemdynamics, represented as a deep neural network (DNN). Thismodel is used in a model-predictive control setting to provideﬁnite-horizon trajectories of the system behavior for a givensequence of control actions. We use formal speciﬁcationsof the system behavior expressed as Signal Temporal Logic(STL) formula, and utilize the quantitative semantics of STLto evaluate these ﬁnite trajectories. At each time-step an onlineoptimizer picks the sequence of actions that yields the lowestcost (best reward) w.r.t. the given STL formula, and uses theﬁrst action in the sequence as the control action. In otherwords, the DNN model that we learn is used in a receding hori-zon model predictive control scheme with an STL speciﬁcationdeﬁning the trajectory cost. The actual optimization over theaction space is performed using evolutionary strategies. Wedemonstrate our approach on several case studies from therobotics and autonomous driving domain.R

EFERENCES[1] R. S. Sutton and A. G. Barto,

Reinforcement Learning: An Introduction ,2nd ed., ser. Adaptive Computation and Machine Learning Series.Cambridge, Massachusetts: The MIT Press, 2018.

2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,D. Wierstra, S. Legg, and D. Hassabis, “Human-level control throughdeep reinforcement learning,”

Nature , vol. 518, no. 7540, pp. 529–533,Feb. 2015.[3] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforcementlearning,” arXiv:1509.02971 [cs, stat] , July 2019. [Online]. Available:http://arxiv.org/abs/1509.02971[4] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel,“High-Dimensional Continuous Control Using Generalized AdvantageEstimation,” arXiv:1506.02438 [cs] , Oct. 2018. [Online]. Available:http://arxiv.org/abs/1506.02438[5] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,“Proximal Policy Optimization Algorithms,” arXiv:1707.06347 [cs] ,Aug. 2017. [Online]. Available: http://arxiv.org/abs/1707.06347[6] C. Atkeson and J. Santamaria, “A comparison of direct and model-basedreinforcement learning,” in

Proceedings of International Conference onRobotics and Automation , vol. 4, Apr. 1997, pp. 3557–3564 vol.4.[7] J. Kocijan, R. Murray-Smith, C. Rasmussen, and A. Girard, “Gaussianprocess model based predictive control,” in

Proceedings of the 2004American Control Conference , vol. 3, June 2004, pp. 2214–2219 vol.3.[8] M. P. Deisenroth and C. E. Rasmussen, “PILCO: A Model-Based andData-Efﬁcient Approach to Policy Search,” in

In Proceedings of theInternational Conference on Machine Learning , 2011.[9] E. F. Camacho and C. B. Alba,

Model Predictive Control . SpringerScience & Business Media, Jan. 2013.[10] D. Q. Mayne, J. B. Rawlings, C. V. Rao, and P. O. M. Scokaert, “Con-strained model predictive control: Stability and optimality,”

Automatica ,vol. 36, no. 6, pp. 789–814, June 2000.[11] J. Leike, M. Martic, V. Krakovna, P. A. Ortega, T. Everitt, A. Lefrancq,L. Orseau, and S. Legg, “AI Safety Gridworlds,” arXiv:1711.09883[cs] , Nov. 2017. [Online]. Available: http://arxiv.org/abs/1711.09883[12] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, andD. Man´e, “Concrete Problems in AI Safety,” arXiv:1606.06565 [cs] ,July 2016. [Online]. Available: http://arxiv.org/abs/1606.06565[13] J. Fu and U. Topcu, “Probably Approximately Correct MDPLearning and Control With Temporal Logic Constraints,” in

Robotics:Science and Systems X , Dec. 2016, pp.6565–6570.[15] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause,“Safe Model-based Reinforcement Learning with StabilityGuarantees,” in

Advances in Neural Information ProcessingSystems 30 , I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates,Inc., 2017, pp. 908–918. [Online]. Available: http://papers.nips.cc/paper/6692-safe-model-based-reinforcement-learning-with-stability-guarantees.pdf[16] L. Wang, E. A. Theodorou, and M. Egerstedt, “Safe Learning ofQuadrotor Dynamics Using Barrier Certiﬁcates,” arXiv:1710.05472[cs] , Oct. 2017. [Online]. Available: http://arxiv.org/abs/1710.05472[17] M. Hasanbeig, A. Abate, and D. Kroening, “Logically-ConstrainedReinforcement Learning,” arXiv:1801.08099 [cs] , Jan. 2018. [Online].Available: http://arxiv.org/abs/1801.08099[18] A. Balakrishnan and J. V. Deshmukh, “Structured Reward Shaping usingSignal Temporal Logic speciﬁcations,” in , Nov. 2019, pp.3481–3486.[19] X. Li, C.-I. Vasile, and C. Belta, “Reinforcement learning with temporallogic rewards,” in , Sept. 2017, pp. 3834–3839.[20] S. S. Farahani, V. Raman, and R. M. Murray, “Robust Model PredictiveControl for Signal Temporal Logic Synthesis,”

IFAC-PapersOnLine ,vol. 48, no. 27, pp. 323–328, Jan. 2015.[21] D. Sadigh and A. Kapoor, “Safe Control under Uncertaintywith Probabilistic Signal Temporal Logic,” in

Robotics: Scienceand Systems XII

Automatica ,vol. 32, no. 10, pp. 1361–1379, Oct. 1996.[23] A. Richards and J. How, “Mixed-integer programming for control,” in

Proceedings of the 2005, American Control Conference, 2005. , June2005, pp. 2676–2683 vol. 4.[24] F. Allg¨ower and A. Zheng,

Nonlinear model predictive control .Birkh¨auser, 2012, vol. 26.[25] A. Mesbah, “Stochastic Model Predictive Control: An Overview andPerspectives for Future Research,”

IEEE Control Systems Magazine ,vol. 36, no. 6, pp. 30–44, Dec. 2016.[26] Z. I. Botev, D. P. Kroese, R. Y. Rubinstein, and P. L’Ecuyer, “TheCross-Entropy Method for Optimization,” in

Handbook of Statistics , ser.Handbook of Statistics, C. R. Rao and V. Govindaraju, Eds. Elsevier,Jan. 2013, vol. 31, pp. 35–59.[27] N. Hansen,

The CMA Evolution Strategy: A Comparing Review , 2006.[28] D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, and J. Schmid-huber, “Natural evolution strategies,”

The Journal of Machine LearningResearch , vol. 15, no. 1, pp. 949–980, Jan. 2014.[29] A. Donz´e and O. Maler, “Robust Satisfaction of Temporal Logicover Real-Valued Signals,” in

Formal Modeling and Analysis of TimedSystems , ser. Lecture Notes in Computer Science, K. Chatterjee andT. A. Henzinger, Eds. Berlin, Heidelberg: Springer, 2010, pp. 92–106.[30] M. P. Deisenroth, D. Fox, and C. E. Rasmussen, “Gaussian Processes forData-Efﬁcient Learning in Robotics and Control,”

IEEE Transactions onPattern Analysis and Machine Intelligence , vol. 37, no. 2, pp. 408–423,Feb. 2015.[31] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “NeuralNetwork Dynamics for Model-Based Deep Reinforcement Learningwith Model-Free Fine-Tuning,” arXiv:1708.02596 [cs] , Dec. 2017.[Online]. Available: http://arxiv.org/abs/1708.02596[32] K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep Reinforce-ment Learning in a Handful of Trials using Probabilistic DynamicsModels,” in

Advances in Neural Information Processing Systems 31 ,S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,and R. Garnett, Eds. Curran Associates, Inc., 2018, pp. 4754–4765.[33] A. Nagabandi, K. Konoglie, S. Levine, and V. Kumar, “Deep DynamicsModels for Learning Dexterous Manipulation,” arXiv:1909.11652 [cs] ,Sept. 2019. [Online]. Available: http://arxiv.org/abs/1909.11652[34] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman,J. Tang, and W. Zaremba, “OpenAI Gym,” arXiv:1606.01540 [cs] ,June 2016. [Online]. Available: http://arxiv.org/abs/1606.01540[35] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptiveelements that can solve difﬁcult learning control problems,”

IEEETransactions on Systems, Man, and Cybernetics , vol. SMC-13, no. 5,pp. 834–846, Sept. 1983.[36] A. W. Moore, “Efﬁcient Memory-based Learning for Robot Control,”Tech. Rep., 1990.[37] M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker,G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, V. Kumar,and W. Zaremba, “Multi-Goal Reinforcement Learning: ChallengingRobotics Environments and Request for Research,” arXiv:1802.09464[cs] , Mar. 2018. [Online]. Available: http://arxiv.org/abs/1802.09464[38] M. Wise, M. Ferguson, D. King, E. Diehr, and D. Dymesich, “Fetch andfreight: Standard platforms for service robot applications,” in

Workshopon Autonomous Mobile Service Robots , 2016.[39] E. Leurent, “An environment for autonomous driving decision-making,”

GitHub repository , 2018. [Online]. Available: https://github.com/eleurent/highway-env[40] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep Q-learning with model-based acceleration,” in

Proceedings of the 33rdInternational Conference on International Conference on MachineLearning - Volume 48 , ser. ICML’16. New York, NY, USA: JMLR.org,June 2016, pp. 2829–2838.[41] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap,T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous Methodsfor Deep Reinforcement Learning,” in

International Conference onMachine Learning , June 2016, pp. 1928–1937. [Online]. Available:http://proceedings.mlr.press/v48/mniha16.html[42] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, andM. Riedmiller, “Deterministic Policy Gradient Algorithms,” in

ICML ,June 2014. [Online]. Available: https://hal.inria.fr/hal-00938992[43] S. B. Thrun, “Efﬁcient Exploration In Reinforcement Learning,” Tech.Rep., 1992.[44] C. E. Rasmussen and C. K. I. Williams,

Gaussian Processes for achine Learning , ser. Adaptive Computation and Machine Learning.Cambridge, Mass: MIT Press, 2006.[45] D. Nguyen-tuong, J. R. Peters, and M. Seeger, “Local Gaussian ProcessRegression for Real Time Online Model Learning,” in Advances inNeural Information Processing Systems 21 , D. Koller, D. Schuurmans,Y. Bengio, and L. Bottou, Eds. Curran Associates, Inc.,2009, pp. 1193–1200. [Online]. Available: http://papers.nips.cc/paper/3403-local-gaussian-process-regression-for-real-time-online-model-learning.pdf[46] Y. Gal, R. McAllister, and C. E. Rasmussen, “Improving PILCO withBayesian neural network dynamics models,” in

Data-Efﬁcient MachineLearning Workshop, ICML , vol. 4, 2016, p. 34.[47] J. Fu, S. Levine, and P. Abbeel, “One-shot learning of manipulation skillswith online dynamics adaptation and neural network priors,” in , Oct. 2016, pp. 4019–4026.[48] S. Depeweg, J.-M. Hernandez-Lobato, F. Doshi-Velez, and S. Udluft,“Decomposition of Uncertainty in Bayesian Deep Learning forEfﬁcient and Risk-sensitive Learning,” in

International Conference onMachine Learning , July 2018, pp. 1184–1193. [Online]. Available:http://proceedings.mlr.press/v80/depeweg18a.html[49] A. Donz´e, X. Jin, J. V. Deshmukh, and S. A. Seshia, “Automotivesystems requirement mining using breach,” in , July 2015, pp. 4097–4097.[50] X. Jin, A. Donz´e, J. V. Deshmukh, and S. A. Seshia, “Mining Re-quirements From Closed-Loop Control Models,”

IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems , vol. 34,no. 11, pp. 1704–1717, Nov. 2015.[51] E. Bartocci, J. Deshmukh, A. Donz´e, G. Fainekos, O. Maler,D. Niˇckovi´c, and S. Sankaranarayanan, “Speciﬁcation-Based Moni-toring of Cyber-Physical Systems: A Survey on Theory, Tools andApplications,” in

Lectures on Runtime Veriﬁcation: Introductory andAdvanced Topics , ser. Lecture Notes in Computer Science, E. Bartocciand Y. Falcone, Eds. Cham: Springer International Publishing, 2018,pp. 135–175.[52] M. Lahijanian, J. Wasniewski, S. B. Andersson, and C. Belta, “Motionplanning and control from temporal logic speciﬁcations with probabilis-tic satisfaction guarantees,” in , May 2010, pp. 3227–3232.[53] K. Garg and D. Panagou, “Control-Lyapunov and Control-BarrierFunctions based Quadratic Program for Spatio-temporal Speciﬁcations,” arXiv:1903.06972 [cs, math] , Aug. 2019. [Online]. Available: http://arxiv.org/abs/1903.06972[54] S. Moarref and H. Kress-Gazit, “Automated synthesis of decentralizedcontrollers for robot swarms from high-level temporal logic speciﬁca-tions,”

Autonomous Robots , vol. 44, no. 3, pp. 585–600, Mar. 2020.[55] L. Lindemann and D. V. Dimarogonas, “Control Barrier Functions forSignal Temporal Logic Tasks,”

IEEE Control Systems Letters , vol. 3,no. 1, pp. 96–101, Jan. 2019.[56] Y. V. Pant, H. Abbas, R. Quaye, and R. Mangharam, “Fly-by-Logic: Control of Multi-Drone Fleets with Temporal LogicObjectives,”

The 9th ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS), 2018 , Mar. 2018. [Online]. Available:https://repository.upenn.edu/mlab papers/107[57] I. Haghighi, N. Mehdipour, E. Bartocci, and C. Belta, “Controlfrom Signal Temporal Logic Speciﬁcations with Smooth CumulativeQuantitative Semantics,” arXiv:1904.11611 [cs] , Apr. 2019. [Online].Available: http://arxiv.org/abs/1904.11611[58] M. Grze´s, “Reward Shaping in Episodic Reinforcement Learning,”in

Proceedings of the 16th Conference on Autonomous Agents andMultiAgent Systems , ser. AAMAS ’17. Richland, SC: InternationalFoundation for Autonomous Agents and Multiagent Systems, May 2017,pp. 565–573.[59] C. J. C. H. Watkins and P. Dayan, “Q-learning,”

Machine Learning ,vol. 8, no. 3, pp. 279–292, May 1992.[60] M. Ohnishi, L. Wang, G. Notomista, and M. Egerstedt, “Barrier-CertiﬁedAdaptive Reinforcement Learning with Applications to Brushbot Navi-gation,”

IEEE Transactions on Robotics , vol. 35, no. 5, pp. 1186–1205,Oct. 2019., vol. 35, no. 5, pp. 1186–1205,Oct. 2019.