[PDF] Direct Policy Optimization using Deterministic Sampling and Collocation

Abstract

We present an approach for approximately solving discrete-time stochastic optimal-control problems by combining direct trajectory optimization, deterministic sampling, and policy optimization. Our feedback motion-planning algorithm uses a quasi-Newton method to simultaneously optimize a reference trajectory, a set of deterministically chosen sample trajectories, and a parameterized policy. We demonstrate that this approach exactly recovers LQR policies in the case of linear dynamics, quadratic objective, and Gaussian disturbances. We also demonstrate the algorithm on several nonlinear, underactuated robotic systems to highlight its performance and ability to handle control limits, safely avoid obstacles, and generate robust plans in the presence of unmodeled dynamics.

Full PDF

DDirect Policy Optimizationusing Deterministic Sampling and Collocation

Taylor A. Howell , Chunjiang Fu , and Zachary Manchester Abstract — We present an approach for approximatelysolving discrete-time stochastic optimal control problemsby combining direct trajectory optimization, deterministicsampling, and policy optimization. Our feedback motionplanning algorithm uses a quasi-Newton method to simul-taneously optimize a nominal trajectory, a set of determin-istically chosen sample trajectories, and a parameterizedpolicy. We demonstrate that this approach exactly recoversLQR policies in the case of linear dynamics, quadratic cost,and Gaussian noise. We also demonstrate the algorithmon several nonlinear, underactuated robotic systems tohighlight its performance and ability to handle controllimits, safely avoid obstacles, and generate robust plansin the presence of unmodeled dynamics.

I. I

NTRODUCTION

Trajectory optimization (TO) is a powerful tool forsolving deterministic optimal control problems in whichaccurate models of the system and its environment areavailable. However, when noise, disturbances, or unmod-eled dynamics are signiﬁcant, a stochastic optimal con-trol approach, in which a feedback policy is optimizeddirectly, can often produce more robust performance [3].Unfortunately, general solution methods for solvingstochastic optimal control problems suffer from thecurse of dimensionality and are only applicable to low-dimensional systems in practice. To scale to manyinteresting robotic systems, approximations must bemade: Typically, these include simplifying or linearizingdynamics, approximating value functions or policieswith polynomials or neural networks, or approximatingdistributions with Gaussians or Monte Carlo sampling.We present Direct Policy Optimization (DPO), a com-putationally tractable algorithm for ﬁnding approximatesolutions to stochastic optimal control problems thatjointly optimizes a small number of sample trajectories Taylor A. Howell is with the Department of Mechani-cal Engineering, Stanford University, Stanford, CA 94305, USA [email protected] Chunjiang Fu is with Frontier Robotics, Innovative ResearchExcellence, Honda R&D Co., Ltd, Wako-shi, Saitama 3510188, Japan chunjiang [email protected] Zachary Manchester is with the Robotics Institute, Carnegie Mel-lon University, Pittsburg, PA 15213, USA [email protected]

Fig. 1:

Rocket landing trajectories generated by TO (blue) andDPO (orange). Unlike the TO solution tracked with TVLQR,the DPO policy lands the rocket despite unmodeled fuel slosh. and a stabilizing feedback policy. This new algorithmcombines several key ideas: • Deterministic sampling of trajectories and approx-imation of expectations using the unscented trans-form. • Direct collocation for implicitly enforcing dynam-ics along trajectories without performing rollouts. • Joint optimization of parameterized feedback poli-cies along with trajectories. • Use of large-scale constrained nonlinear program-ming solvers based on quasi-Newton methods forfast and robust convergence.In contrast to many reinforcement learning techniques,DPO is able to easily enforce constraints like torquelimits and obstacle avoidance, makes extensive use ofanalytical models and their derivatives, and is extremelysample efﬁcient.The paper proceeds as follows: We provide back-ground for the discrete-time stochastic optimal controlproblem, present related work on feedback motion plan-ning, and give an overview of the unscented transformin Section II. In Section III, we present the DPO algo-rithm. We then demonstrate that DPO exactly recoversLQR policies when the dynamics are linear, costs arequadratic, and noise is Gaussian, and provide examples a r X i v : . [ c s . R O ] O c t sing DPO for several nonlinear, underactuated con-trol problems in Section IV. Section V offers somediscussion of the experimental results and, ﬁnally, wesummarize our work and propose directions for futurework in Section VI.II. B ACKGROUND

This section provides brief reviews of the discrete-time stochastic optimal control problem and the un-scented transform, as well as a survey of related solutionapproaches.

A. Discrete-Time Stochastic Optimal Control

We formulate the discrete-time stochastic optimalcontrol problem as,minimize θ E w t (cid:104) (cid:96) T ( x T ) + T − (cid:80) t =1 (cid:96) t ( x t , u t ) (cid:105) subject to x t +1 = f t ( x t , u t , w t ) , ( x given ) ,u t = π t ( x t , θ t ) , prob ( c ( τ ) > ≤ (cid:15), (1)where we seek a policy, π t , parameterized by θ t ∈ R p , t is the time index, and T is the planning horizon. Thesystem’s state, x t ∈ R n , and control inputs, u t ∈ R m ,deﬁne a trajectory, τ = ( u , . . . , u T − , x , . . . , x T ) ∈ R z . The objective includes terminal and stage costs (cid:96) T and (cid:96) t , respectively. The discrete-time stochasticdynamics, f t , are subject to random disturbance inputs, w t ∈ R d , drawn from some distribution. Inequalityconstraints, c : R z → R r , are encoded as element-wise chance constraints on the trajectory, each withprobability of violation less than tolerance (cid:15) . B. Related Work

Model-based approaches often tackle problem (1) ina decoupled fashion: First, generate a nominal trajec-tory assuming no disturbances; then, design a trackingfeedback policy to reject disturbances. Using colloca-tion methods [23] or differential dynamic programming(DDP) [6] to optimize a trajectory, then tracking itwith a time-varying linear-quadratic regulator (TVLQR)has achieved impressive results for complex, real-worldsystems [7, 16].Unfortunately, there are two drawbacks to these clas-sic synthesis techniques: First, there is no explicit con-sideration for uncertainty or disturbances. Robustness isoften achieved heuristically via post-hoc

Monte Carlotesting and tuning. Second, by decoupling the synthesisof the nominal trajectory and policy, performance isoften sacriﬁced. DIRTREL [10] optimizes a nominal trajectory withan additional cost term that penalizes a linearized ap-proximation of the tracking error of a TVLQR policy.Chance constraints are approximated by enforcing themat a ﬁnite number of samples. There is also a variation ofDDP that can account for multiplicative noise applied tothe controls [21]. Unlike these methods, DPO directlyoptimizes policies and propagates uncertainty throughnonlinear dynamics.Several learning-based approaches exist that directlyoptimize feedback policies. Policy gradient methods[17] use ﬁrst-order or derivative-free methods to op-timize parameterized policies, typically without directaccess to the underlying dynamics model [18]. Domainrandomization [20] can be employed to vary modeland environment parameters to encourage robustness.Random search [12], stochastic gradient descent [4], andNewton methods [26] have also been used to optimizelinear feedback policies and MPC controllers [1, 2].A major beneﬁt of stochastic methods is the inherentexploration of the policy space.Guided Policy Search (GPS) [9] is a hybrid approachthat alternates between optimizing sample trajectoriesand ﬁtting a policy. DDP is used to generate high-reward trajectories close to the current policy that aresubsequently employed to train a neural-network policy.This procedure is then iterated with a regularizer to keepnew trajectories close to those generated by the previouspolicy.While GPS does not require explicit dynamics modelsor make strong assumptions about disturbance and statedistributions, it relies heavily on Monte Carlo samplingtechniques that require a large number of samples.As a result, training can require signiﬁcant time andcomputational resources. In contrast, DPO leveragesanalytical models and their derivatives and uses onlya small number of deterministically chosen samples,making it much more efﬁcient.Finally, we note that many of the approaches outlinedin this section—including DDP and many of the policygradient methods—rely on explicit forward simulationor “rollouts” along with some form of backpropaga-tion. These techniques are vulnerable to numerical ill-conditioning issues colloquially known as the “tail-wagging-the-dog problem” in control and the “vanish-ing” or “exploding” gradient problem in reinforcementlearning. In contrast, DPO employs collocation methodsthat simultaneously optimize state and control trajecto-ries with dynamics enforced as constraints. Such “di-rect” methods enjoy far better numerical conditioning,especially with long horizon plans.2

Fig. 2:

The unscented transform visualized in 2D for ini-tial (left) and transformed (right) distributions. Sigma points(circles) with sample mean (square) and sample covariance(dashed) are propagated through a nonlinear function (triangle)and then resampled to compute new sigma points.

C. Unscented Transform

The unscented tranform is a procedure for propagatinga unimodal probability distribution through a nonlinearfunction using deterministic samples, often referred toas sigma points [22]. This tool is commonly usedfor state estimation, where it is generally consideredto be superior to the extended Kalman ﬁlter’s linearpropagation of covariance matrices [22]. There are manyvariations of the unscented transform [15]; the versionwe utilize is outlined below and visualized in Fig. 2.Assuming a unimodal state distribution with mean µ ∈ R n and covariance P ∈ S n ++ , and a zero-meandisturbance distribution with covariance W ∈ S d ++ , wegenerate M = 2( n + d ) sigma points: (cid:20) x ( j ) w ( j ) (cid:21) ← (cid:20) µ (cid:21) ± β (cid:32)(cid:115)(cid:20) P W (cid:21)(cid:33) ( j ) . (2)The samples are constructed using the column vectorsof a square root of the joint covariance matrix. In thiswork we use the principle matrix square root, althoughother decompositions, such as the Cholesky factoriza-tion, could be employed. The parameter β can be usedto control the spread of the samples around the mean.The sigma points are ﬁrst propagated through thepolicy, u ( j ) = π ( x ( j ) ) , (3)and then dynamics, x ( j )+ = f ( x ( j ) , u ( j ) , w ( j ) ) , (4)in order to compute a sample mean, µ + = 1 M M (cid:88) j =1 x ( j )+ , (5) and sample covariance, P + = 1 β M M (cid:88) j =1 ( x ( j )+ − µ + )( x ( j )+ − µ + ) T , (6)for the state distribution at the next time step. Similarly,we compute a sample mean, ν = 1 M M (cid:88) j =1 u ( j ) , (7)and sample covariance, L = 1 β M M (cid:88) j =1 ( u ( j ) − ν )( u ( j ) − ν ) T , (8)for the control policy at the current time step.III. D IRECT P OLICY O PTIMIZATION

We now present the Direct Policy Optimization al-gorithm. DPO makes several strategic approximationsto the discrete-time stochastic optimal control problem:First, we seek a local feedback policy that is only validin the neighborhood of a nominal trajectory. Second, theexpectation of the objective in (1) is approximated usingthe unscented transform. Finally, chance constraints areapproximately enforced by applying constraints to a setof sample points chosen from desired level sets of thestate and input distributions.

A. Cost Function

DPO minimizes the following cost function, J (¯ τ ) + E w t [ S (¯ τ , τ )] , (9)where J is a general cost applied to the nominal trajec-tory and S is a quadratic tracking cost, S (¯ τ , τ ) = ( x T − ¯ x T ) T Q T ( x T − ¯ x T )+ T − (cid:88) t =1 { ( x t − ¯ x t ) T Q t ( x t − ¯ x t )+ ( u t − ¯ u t ) T R t ( u t − ¯ u t ) } , (10)that penalizes deviations from the nominal trajectoryunder disturbances. After some simple manipulations,we can write (9) in terms of the quantities introduced inSec. II-C: E w t (cid:2) ( x t − ¯ x t ) T Q t ( x t − ¯ x t ) (cid:3) = Tr ( P t Q t ) + ( µ t − ¯ x t ) T Q t ( µ t − ¯ x t ) , (11) E w t (cid:2) ( u t − ¯ u t ) T R t ( u t − ¯ u t ) (cid:3) = Tr ( L t R t ) + ( ν t − ¯ u t ) T R t ( ν t − ¯ u t ) . (12)3 lgorithm 1: Sample dynamics function g t ( τ (1: N ) t , W t ) for j = 1 : M do x ( j ) t , w ( j ) t ← compute sigma points ( ) u ( j ) t , x ( j ) t +1 ← propagate sigma points ( , ) end µ t +1 , ν t ← compute sample means ( , ) P t +1 , L t ← compute sample covariances ( , ) end function Since the expectation in (9) depends only on the ﬁrst andsecond moments of the state and control distributions,we can efﬁciently compute it using the unscented trans-form. As an aside, the second terms in (11) and (12) arezero in the linear-quadratic Gaussian (LQG) case, andtheir presence reﬂects the invalidity of the separationprinciple in more general settings.

B. Feedback Policy

Each sample trajectory is subject to a policy con-straint, u t = π t ( θ t , x t , ¯ x t , ¯ u t ) . (13)Instead of optimizing global policies, we restrict theclass to local feedback controllers. These policies couldbe linear state feedback, nonlinear feedback on featuresconstructed from the state, or neural networks. C. Constraints

Since DPO relies on direct collocation, dynamicsare enforced through equality constraints. The nomi-nal trajectory is not subject to disturbances, and thedynamics associated with the sigma-point trajectoriesare summarized in Algorithm 1. We maintain unimodaldistributions over the state and policy by resamplingat each time step using the unscented transform (2-8).Disturbances are assumed to be drawn from a zero-mean normal distribution, parameterized by a covariancematrix, W t ∈ S d ++ . By utilizing the discrete-timedynamics, it is possible to generate non-Gaussian noisefor the system and to capture model uncertainty.Chance constraints are approximated by constrainingthe nominal trajectory and each sample trajectory. Withappropriate selection of the scaling parameter β , we canapproximately enforce constraints on any desired levelset of the state and control distributions to achieve a de-sired conﬁdence bound, deﬁned by (cid:15) . While it is possibleto check constraints on ellipsoidal level sets by solvingsemideﬁnite programs, enforcing constraints on sigmapoints is computationally much cheaper. Additionally, deterministic constraints, b : R z → R q , can easily beenforced on the nominal trajectory. D. Nonlinear Programming Formulation

In summary, DPO can be formulated as the followingnonlinear program (NLP):minimize θ, ¯ τ,τ (1: N ) J (¯ τ ) + E w t [ S (¯ τ , τ )] subject to ¯ x t +1 = f t (¯ x t , ¯ u t , ,x ( i ) t +1 = g t ( τ (1: N ) t , W t ) , (¯ x , x ( i )1 given ) ,u ( i ) t = π t ( θ t , x ( i ) t , ¯ x t , ¯ u t ) ,b (¯ τ ) ≤ ,c (¯ τ ) ≤ ,c ( τ ( i ) ) ≤ . (14)Off-the-shelf large-scale NLP solvers like Ipopt [24] andSNOPT [5] can be employed to efﬁciently solve (14) bytaking advantage of sparsity in its associated Hessian andJacobian matrices. IV. E XAMPLES

The following examples were implemented in Ju-lia, optimized with the SNOPT solver [5], and runon a laptop computer with an Intel Core i7-7600U2.80 GHz CPU and 16 GB of memory. Our codeis available at https://github.com/thowell/direct_policy_optimization .In all examples, implicit midpoint integration is usedand the same weights are used for the sample costs(10) and LQR controllers. To verify online trackingperformance, the closed-loop policies are simulated atten times the sample rate used in the optimizationwith additive zero-mean white Gaussian noise, third-order Runge-Kutta integration, zero-order-hold controlinterpolation, and cubic spline state interpolation. Thetracking error is computed using the sample costs (10).All examples use quadratic objectives for the nominaltrajectories, have discrete dynamics with additive noise, f t ( x t , u t ) + w t , (15)and optimize linear feedback policies, u t = ¯ u t − θ t ( x t − ¯ x t ) . (16) A. Double Integrator

We ﬁrst demonstrate that DPO recovers the optimaldiscrete-time LQR solution for a double integrator, x t +1 = (cid:20) (cid:21) x t + (cid:20) (cid:21) u t + w t (17)that is regulated to the origin over a horizon T = 51 .Because this is a regulator problem, the DPO nominal4 ig. 3: Position trajectories for autonomous car avoidingobstacles. TO (left) ﬁnds a path (blue) that is near the obstacles,whereas the DPO (right) path (orange) avoids obstacles witha margin for safety. trajectory is ﬁxed to zeros. There are N = 4 sampletrajectories with initial conditions, x ( i )1 = ( ± , ± .Resampling is performed with β = 1 and W = 0 . I ,the sample-cost weights are Q = I and R = I , and thesolver is initialized with ones.The maximum normalized error between the LQR andDPO policies along the trajectory is 7.68e-7, reinforcingthat the approximations made in the derivation of DPOare exact in the LQG case. Additionally, the secondterm in (11) is zero, demonstrating that the separationprinciple holds for this problem. B. Car

We plan a collision-free path through four obstaclesfor an autonomous car, modeled as a unicycle [8],  ˙ y ˙ z ˙ φ  =  u cos ( φ ) u sin ( φ ) u  , (18)with T = 51 and ∆ t = 0 . . The state has positions y and z , and orientation φ . The controls are the forwardand angular velocities, u and u . For DPO, we set thesample-cost weights to be Q t = diag (10 , , for t =1 : T − , Q T = 100 I , and R = 0 . I , sample initialconditions for the N = 6 sample trajectories aroundthe origin, and apply the unscented transform with β =1 and W = diag (0 . , . , . . The choice of β = 1 results in constraints being enforced on the 1-sigma level set of the state distribution, correspondingto (cid:15) = 0 . .TO naturally ﬁnds a solution that is in close proximityto the obstacles. In contrast, DPO ﬁnds a path thatreaches the goal state while remaining a safe distancefrom the obstacles (Fig. 3). Unlike ad-hoc techniquesfor obstacle avoidance, like obstacle inﬂation, the pathgenerated by DPO results from consideration of thecoupled dynamics, constraints, and disturbances of thesystem. . − time s t a t e . time . time Fig. 4:

Simulated tracking (black) for position (dotted) and ori-entation (dashed) of cart-pole experiencing Coulomb friction.Comparison between TVLQR tracking a nominal trajectorywithout (blue) and with (magenta) modeled friction, and theDPO policy (orange).

C. Cart-Pole

A swing-up trajectory for a cart-pole [19] with aslider that experiences Coulomb friction is synthesizedfor T = 51 and ∆ t = 0 . . The state, x = ( y, φ, ˙ y, ˙ φ ) ,has cart position y , pendulum orientation φ , and their re-spective time derivatives. The cart position is controlled.The optimality conditions for the maximum dissipationprinciple [16] are used as deterministic constraints toexplicitly model this behavior [11] for a coefﬁcient offriction µ f = 0 . . The DPO sample-cost weights are Q t = diag (10 , , , for t = 1 : T − , Q T =100 I , and R = I , initial conditions for the N = 8 sample trajectories are sampled around the origin, andresampling is performed with β = 1 and W = 0 . I .The performance of TVLQR tracking nominal trajecto-ries optimized with and without friction, and the DPOpolicy is veriﬁed in simulation. Results are provided inTable I and tracking for the position and orientationis shown in Fig. 4. Without accounting for friction,the TVLQR policy fails to track. By ﬁrst designing atrajectory that explicitly models friction, it is possible tosubsequently synthesize a TVLQR policy for tracking.In contrast, DPO produces the best tracking results bysimultaneously optimizing the nominal trajectory andpolicy. D. Rocket Landing

We plan a soft landing trajectory for a rocket [13].The nominal system with planar dynamics has state

TABLE I:

Tracking error, computed with (10), for cart-poleexperiencing Coulomb friction. Comparison between TVLQRtracking nominal trajectories with and without modeled fric-tion, and the DPO policy. cost TVLQR (no friction) TVLQR DPO state 1.03e4 2.32 control 62.87 0.38 total 1.04e4 2.70 z J r , m r φd f l f ψ m f F y , F z l r Fig. 5:

Rocket with planar dynamics having lateral and verticalpositions y and z , orientation φ , and controlled with thrustinputs F y and F z . The model parameters are the rocket inertia J r , mass m r , and length from center of mass to thruster l r . Anadditional pendulum, with orientation ψ , and parameterized bydistance from center of mass to rotation point d f , length l f , andmass m f , is included to model fuel slosh. x = ( y, z, φ, ˙ y, ˙ z, ˙ φ ) , with lateral and vertical positions, y and z , orientation φ , and their respective time deriva-tives (Fig. 5). The rocket is controlled with gimballedthrust, u = ( F y , F z ) . It is initialized with non-zerodisplacements and velocities relative to its target state:a zero-velocity, vertical-orientation touchdown inside alanding zone.The rocket experiences fuel slosh during landing thatis not accounted for in the nominal model. This is a crit-ical dynamical effect, but it is difﬁcult to model and ob-serve. In practice, these unmodeled effects are typicallyhandled in ad hoc ways, often with extensive MonteCarlo simulation and tuning. To approximately modelfuel slosh, we augment the nominal model with twoadditional states associated with an unobservable andunactuated pendulum (Fig. 5). A common frequency-domain control approach is to include a notch ﬁlter atthe pendulum’s natural frequency in order to preventexcitation of these fuel slosh dynamics. We instead useDPO to synthesize a policy for the nominal model that is . . - π - π time o r i e n t a ti on . . time Fig. 6:

Simulated tracking (black) for the orientation of arocket experiencing fuel slosh during landing. Comparisonbetween tracking of TVLQR (blue) and DPO (orange) polices.

TABLE II:

Tracking error, computed with (10), for rocketthat experiences fuel slosh while landing. The simulation usesa rocket with modeled fuel slosh to compare TVLQR trackinga trajectory generated with the nominal rocket model and theDPO policy. cost TVLQR DPO state 7294.55 control 35.51 total 7330.06 robust to fuel slosh by utilizing sample trajectories withthe augmented model that capture fuel-slosh effects.For DPO, the nominal trajectory is optimized withthe nominal model because its states can be observed.The augmented model is utilized for each of N = 16 sample trajectories, and an output feedback policy isoptimized that does not have access to the fuel-sloshstates. The sample-cost weights are Q t = 100 I for t = 1 : T − , Q T = 1000 I , and R = diag (1 , , .The sample initial conditions, including the fuel-sloshstates, are sampled around the initial augmented state.The initial fuel-slosh state is a zero-velocity, downwardorientation corresponding to no excitation. Resamplingis performed with β = 1 and W = 0 . I . Thrust limitsare enforced, the problem has a free ﬁnal time, and theplanning horizon is T = 41 .TO ﬁnds a 2.72 second solution, whereas DPO ﬁndsa slightly longer, 2.89 second, path to the landing zone.The position trajectories for the nominal model areshown in Fig. 1. Simulation results with fuel slosh areprovided in Table II and tracking results for the rocket’sorientation are shown in Fig. 6. Due to the fuel slosh,TVLQR fails to track the path generated by TO and therocket tips over, whereas the DPO policy successfullylands the rocket. DPO is able to optimize a policy fora model that is observable while capturing additional,unobservable dynamical effects through sample models. E. Quadrotor

A point-to-point maneuver is planned for a quadrotor,with dynamics from [14], that experiences a random

TABLE III:

Average tracking error, computed with (10), forquadrotor with different broken propellers. In simulation, eachpropeller is broken in order to compare the overall robustnessof the TVLQR and DPO policies. cost (avg.) TVLQR DPO state 4.06 control 0.03 total 4.09 ig. 7: Simulated position trajectories (gray) for quadrotorcontrolled with TVLQR (top) and DPO (bottom) policies whileexperiencing different propellers breaking. The TVLQR policy(blue) experiences degrading tracking and generates differentpaths depending on which propeller breaks. The DPO policy(orange) has superior, and nearly identical, tracking regardlessof which propeller breaks. blade breaking off from a propeller during ﬂight. A bro-ken propeller is modeled by constraining the correspond-ing control input to only half its nominal maximumvalue. We use DPO to optimize a policy that is robustto this event occurring for any propeller. The nominalmodel has no broken propellers, but the N = 24 sampletrajectories are divided into four groups, each with adifferent broken propeller. The sample-cost weights are Q t = 10 I for t = 1 : T − , Q T = 100 I , and R = I .Initial conditions are sampled around the initial nominalstate, and resampling is performed with β = 1 and W t = 0 . I . The planning horizon is T = 31 withfree ﬁnal time.Both TO and DPO ﬁnd nominal trajectories thattake about 2.6 seconds, but DPO ﬁnds lower maximumcontrols (Fig. 8). The policies are compared in simula-tion and are visualized in Fig. 7. The average trackingperformance over each propeller breaking is providedin Table III. DPO ﬁnds a policy with consistent and03 u u u u Fig. 8:

Nominal controls for quadrotor performing point-to-point maneuver generated with TO (blue) and DPO (orange).The controls found with DPO have lower maximum valuescompared to TO. superior tracking compared with TVLQR for all cases.V. D

ISCUSSION

Because we use local optimization methods to solveDPO, providing good initial guesses to the solver iscrucial for performance of the algorithm. In practice, weuse a standard TO solution to warm start the nominal andsample trajectories for DPO. For linear state-feedbackpolicies, we use the corresponding TVLQR solution asan initial guess. While not necessary in any of ourexamples, initial guesses for more complex policiescould be found by running DPO without the policyconstraint, then performing an ofﬂine regression to ﬁtapproximate policy parameters for warm starting.For a TO problem solved with a sparsity-exploitingsolver, the computational complexity is approximately O (cid:16) T ( n + m ) (cid:17) [25]. For DPO, we can consider anaugmented state and control with dimensions n (2 n + 1) and m (2 n + 1) , respectively. The complexity of DPO isthen O (cid:16) T ( n + m n ) (cid:17) . In the examples, the largeststate dimension is n = 12 and the solve times rangefrom seconds to tens of minutes on a laptop computer.However, it is likely possible to exploit more of theDPO problem’s structure to improve the complexity.Parallelization across sample points is also possible,which would enable faster solve times and scaling tomuch larger systems.Lastly, LQR is a powerful tool and can likely be tunedto qualitatively match the performance of the linearpolicies found with DPO. However, the strength of DPOlies in its ability to explicitly reason about robustness andthe complex coupling between dynamics, constraints,and disturbances during synthesis, instead of relying onhand tuning and heuristics.VI. C ONCLUSION

We have presented a new algorithm, Direct Policy Op-timization, for approximately solving stochastic optimalcontrol problems. The algorithm is exact in the LQGcase and is capable of optimizing policies for nonlinearsystems that respect control limits and obstacles, and arerobust to unmodeled dynamics.Many interesting avenues for future work remain: Itshould be possible to optimize sparse nonlinear policiesby introducing high-dimensional features and regulariz-ing or constraining policy parameters. Neural networkscould also be used to parameterize policies. Anotherpotential direction is to optimize a local value functionapproximation in place of an explicit policy. Finally, amuch richer class of disturbances and model parameteruncertainty could be modeled by augmenting the statevector and dynamics.7

EFERENCES [1] Akshay Agrawal et al. “Differentiable convex op-timization layers”. In:

Advances in Neural Infor-mation Processing Systems . 2019, pp. 9562–9574.[2] Brandon Amos et al. “Differentiable MPC forend-to-end planning and control”. In:

Advancesin Neural Information Processing Systems . 2018,pp. 8289–8300.[3] Dimitir P. Bertsekas and Steven Shreve.

Stochas-tic Optimal Control: The Discrete-Time Case .Athena Scientiﬁc, 2004.[4] Jingjing Bu et al. “LQR through the lens ofﬁrst order methods: Discrete-time case”. In: arXivpreprint arXiv:1907.08921 (2019).[5] Philip E. Gill, Walter Murray, and Michael A.Saunders. “SNOPT: An SQP algorithm for large-scale constrained optimization”. In:

SIAM Review

Dif-ferential Dynamic Programming . North-Holland,1970.[7] Scott Kuindersma et al. “Optimization-based lo-comotion planning, estimation, and control designfor the Atlas humanoid robot”. In:

AutonomousRobots

Planning Algorithms . Cam-bridge University Press, 2006.[9] Sergey Levine and Vladlen Koltun. “Guided pol-icy search”. In:

International Conference on Ma-chine Learning . 2013, pp. 1–9.[10] Zachary Manchester and Scott Kuindersma. “Ro-bust direct trajectory optimization using approx-imate invariant funnels”. In:

Autonomous Robots

Robotics Research . Springer, 2020,pp. 985–1000.[12] Horia Mania, Aurelia Guy, and Benjamin Recht.“Simple random search of static linear policies iscompetitive for reinforcement learning”. In:

Ad-vances in Neural Information Processing Systems .2018, pp. 1800–1809.[13] James S. Meditch. “On the problem of optimalthrust programming for a lunar soft landing”.In:

IEEE Transactions on Automatic Control

The International Journal of Robotics Research

IEEETransactions on Automatic Control

New Variational Tech-niques in Mathematical Physics . Springer, 2011,pp. 171–322.[17] David Silver et al. “Deterministic policy gradi-ent algorithms”. In:

International Conference onMachine Learning . 2014.[18] Richard S. Sutton and Andrew G. Barto.

Rein-forcement Learning: An Introduction . MIT Press,2018.[19] Russ Tedrake. “Underactuated robotics: Algo-rithms for walking, running, swimming, ﬂying,and manipulation (course notes for MIT 6.832)”.In:

Downloaded in Fall (2020).[20] Josh Tobin et al. “Domain randomization fortransferring deep neural networks from simulationto the real world”. In: . IEEE. 2017, pp. 23–30.[21] Emanuel Todorov and Weiwei Li. “A generalizediterative LQG method for locally-optimal feed-back control of constrained nonlinear stochasticsystems”. In:

Proceedings of the 2005, AmericanControl Conference, 2005.

IEEE. 2005, pp. 300–306.[22] Jeffrey K. Uhlmann. “Dynamic map building andlocalization: New theoretical foundations”. PhDthesis. University of Oxford Oxford, 1995.[23] Oskar Von Stryk. “Numerical solution of optimalcontrol problems by direct collocation”. In:

Opti-mal Control . Springer, 1993, pp. 129–143.[24] Andreas W¨achter and Lorenz T. Biegler. “Onthe implementation of an interior-point ﬁlterline-search algorithm for large-scale nonlinearprogramming”. In:

Mathematical Programming

IFAC Proceedings Volumes arXiv preprintarXiv:1312.4892arXiv preprintarXiv:1312.4892