[PDF] Approximately Optimal Continuous-Time Motion Planning and Control via Probabilistic Inference

Abstract

The problem of optimal motion planing and control is fundamental in robotics. However, this problem is intractable for continuous-time stochastic systems in general and the solution is difficult to approximate if non-instantaneous nonlinear performance indices are present. In this work, we provide an efficient algorithm, PIPC (Probabilistic Inference for Planning and Control), that yields approximately optimal policies with arbitrary higher-order nonlinear performance indices. Using probabilistic inference and a Gaussian process representation of trajectories, PIPC exploits the underlying sparsity of the problem such that its complexity scales linearly in the number of nonlinear factors. We demonstrate the capabilities of our algorithm in a receding horizon setting with multiple systems in simulation.

Full PDF

AApproximately Optimal Continuous-Time Motion Planning and Controlvia Probabilistic Inference

Mustafa Mukadam, Ching-An Cheng, Xinyan Yan, and Byron Boots

Abstract — The problem of optimal motion planing andcontrol is fundamental in robotics. However, this problem isintractable for continuous-time stochastic systems in generaland the solution is difﬁcult to approximate if non-instantaneousnonlinear performance indices are present. In this work, weprovide an efﬁcient algorithm, PIPC (Probabilistic Inferencefor Planning and Control), that yields approximately optimalpolicies with arbitrary higher-order nonlinear performanceindices. Using probabilistic inference and a Gaussian processrepresentation of trajectories, PIPC exploits the underlyingsparsity of the problem such that its complexity scales linearlyin the number of nonlinear factors. We demonstrate thecapabilities of our algorithm in a receding horizon setting withmultiple systems in simulation.

I. I

NTRODUCTION

A fundamental goal in robotics is to efﬁciently computetrajectories of actions that drive a robot to achieve somedesired behavior. We seek a control policy in a multi-stagedecision problem [1] that can maximize performance indicesthat describe, for example, the smoothness of motion, energyconsumption, or the likelihood of avoiding an obstacle.Hierarchical planning and control has been used to solvethis problem in practice [2]. The idea is to ﬁrst generate adesired state sequence [3]–[9] without considering full sys-tem dynamics, and then design a robust low-level controllerfor tracking. Because the dynamic constraints are relaxed, itbecomes possible for an algorithm to plan a trajectory thatsatisﬁes complicated, higher-order performance indices [8]–[10], offering greater ﬂexibility in system design. Sampling-based planning techniques can even provide formal guar-antees such as probabilistically complete solutions [3], [4].However, recent work has started to challenge this classicalviewpoint by incorporating more dynamic constraints withintrajectory planning in search of solutions with improvedoptimality [11], [12].A theoretically elegant approach would be to addressboth the planning and control problems within a stochasticoptimal control framework. Unfortunately, since the statesand actions are coupled through system dynamics, exactsolutions become intractable with the exception of simplecases known as linearly solvable problems [13]. These challenges have motivated researchers to ﬁnd ap-proximate solutions rather than directly approximating theoriginal problems with hierarchical approaches. One simple

Mustafa Mukadam, Ching-An Cheng, Xinyan Yan, and Byron Bootsare afﬁliated with the Institute for Robotics and Intelligent Machines,Georgia Institute of Technology, Atlanta, GA 30332, USA. { mmukadam3,cacheng, xyan43 } @gatech.edu , [email protected] . Afﬁne systems with quadratic instantaneous control cost, or fully con-trollable discrete-time systems.

Robot View 2D Environment (a)(b) (c)

Fig. 1: PIPC used on (a) a 2D holonomic robot (blue) toreach goal (red) in a 2D environment with dynamic obstacles,where executed trajectory is in green and current plannedhorizon is in black, (b) a 7-DOF WAM arm, and (c) aPR2’s right arm where the semitransparent arm is the goalconﬁguration and dotted blue end effector trajectory is thecurrent planned horizon.approach is direct policy search [14], [15], which usesﬁrst-order information to ﬁnd a locally optimal policy. Toimprove the convergence rate, differential dynamic program-ming (DDP) has been widely adopted as the foundationof locally optimal algorithms [16]–[18], which solve lo-cal linear-quadratic Gaussian (LQG) subproblems and iter-atively improve these suboptimal solutions. However, forcontinuous-time systems, these algorithms would requireinefﬁcient high-frequency sampling to construct the LQGsubproblems, even when the given problem is close to a LQG(e.g. a performance index with only a small set of nonlinearfactors, or dynamics with a small amount of nonlinearity).Compared with the hierarchical approach, these algorithmsimpose a strict structural assumption: they are only applica-ble to problems that measure performance as an integral ofinstantaneous functions.In this paper, we propose a novel approximately optimalapproach to continuous-time motion planning and controlthat can handle costs expressed as arbitrary higher-ordernonlinear factors and exploit a problem’s underlying sparse structure. Speciﬁcally, we consider problems with a perfor-mance index expressed as the product of an exponential- a r X i v : . [ c s . R O ] F e b uadratic factor for instantaneous costs and a ﬁnite numberof possibly higher-order nonlinear factors, and provide an al-gorithm that has linear complexity in the number of nonlinearfactors. Moreover, we show the approximately optimal policycan be computed by posterior inference on a probabilisticgraphical model, which is a dual to the performance index.We convert these theoretical results into a practical algo-rithm called Probabilistic Inference for Planning and Control(PIPC) that recursively updates the approximately optimalpolicy as more information is encountered. To evaluate ourapproach, we employ PIPC on both Markov decision pro-cesses (MDPs) and partially-observable MDPs (POMDPs)in dynamic environments with multiple simulated systems(see Fig. 1). A. Related Work

Our algorithm contributes to a growing set of research thatseeks to reframe planning and control problems as probabilis-tic inference [19]. Work in this area has formed a new classof approximately optimal algorithms that leverage tools fromapproximate probabilistic inference, including expectationpropagation [20] and expectation maximization [21], [22].A common framework based on KL-minimization [23], [24]summarizes the above algorithms as well as approaches likepath-integral control [13].We contribute to this ﬁeld in the following ways. First,we extend the performance index for control algorithmsto incorporate nonlinear factors with arbitrary higher-orderconnections in time. In contrast to our approach, the methodsmentioned above generally assume that the performance in-dices factor into instantaneous terms, and thus require densesampling to solve continuous-time problems. Second, weprovide a new approach to derive a Gaussian approximationbased on Laplace approximation and Gaussian processes.Third, we deﬁne a new class of optimal control problems,called gLEQG (generalized Linear-Exponential-Quadratic-Gaussian), that are solvable after being transformed into theirdual probabilistic representation. In particular, we show thatgLEQG admits a solution given by posterior inference. Thistheoretical result, discussed in Section III-C, closes the gapin the duality between optimal control and inference.This rest of the paper is structured as follows. We beginin Section II by deﬁning the objective function in jointplanning and control problems. Then, in Section III, wepresent our main results in approximately optimal motionplanning and control. In Section IV, these theoretical resultsare summarized into an online algorithm PIPC that canperform simultaneous planning and control for partially ob-servable stochastic linear systems in dynamic environments.To validate our algorithm, we present the implementationdetails and experimental results in Section V and SectionVI. Finally, Section VII concludes the paper.II. T HE P ROBLEM OF M OTION P LANNING AND C ONTROL

We begin by introducing some notation. Let x t , u t , and z t be the state, action, and observation of a continuous-time partially-observable system at time t , and let h t = { z , u , z δt , · · · , z t } be the history of observations and ac-tions until time t . As shorthand, we use boldface to denotethe time trajectory of a variable, and π ( u | h ) to denotethe collection of time-varying causal (stochastic) policies π t ( u t | h t ) for all t .We formulate the motion planning and control problem asa ﬁnite-horizon stochastic optimization problem over π . Let p π be the distribution of x and u under the stochastic policy π and system dynamics, and S be a ﬁnite set of time indices.Here the goal is to ﬁnd an optimal policy π to maximize theperformance index max π J ( x ) = max π E p π (cid:34) ψ ( x , u ) (cid:89) S ∈S φ S ( x S , u S ) (cid:35) . (1)The objective function in (1) is deﬁned as the expectation ofthe product of two types of factors: a Gaussian process factor ψ ( · ) and a higher-order nonlinear factor φ S ( · ) . These twofactors, described below, cover many interesting behaviorsthat are often desired in planning and control problems. A. Higher-order Nonlinear Factors φ S ( · ) We deﬁne factors of the form φ S ( · ) = exp( −(cid:107) f S ( · ) (cid:107) ) , (2)to model nonlinear, higher-order couplings frequently usedin planing problems, where f S ( · ) is a differentiable nonlinearfunction deﬁned on a ﬁnite number of time indices S ∈ S .The structure of φ S ( · ) can model many performance indicesin planning: for example, a simple nonlinear cost functionat a single time instance, a penalty based on the differencebetween the initial and the terminal states/actions, a penaltyto enforce consistency across landmarks in time, or the costof a robot-obstacle collision. As each factor φ S ( · ) dependsonly on a ﬁnite number of states or actions, we refer to thecorresponding states x S and actions u S as support states or support actions. B. Gaussian Process Factors ψ ( · ) The Gaussian process factor ψ ( · ) is a generalization of theexponential-of-integral cost function in the optimal controlliterature [25]. To illustrate, here we consider a special case ψ ( · ) = ψ ( u ) . A joint factor between x and u as in (1) canbe deﬁned similarly.Let GP u ( u t | m ut , K ut,t (cid:48) ) be a Gaussian process [26], where ∀ t, t (cid:48) ∈ R , E [ u t ] = m ut and C [ u t , u t (cid:48) ] = K ut,t (cid:48) . Let P ut,t (cid:48) bethe (positive deﬁnite) Green’s function of K ut,t (cid:48) satisfying, ∀ t, t (cid:48) ∈ R , δ t,t (cid:48) = (cid:82) K ut,s P us,t (cid:48) d s, where δ is the Diracdelta distribution and the integral is over the length of thetrajectory. We deﬁne the Gaussian process factor ψ ( u ) as ψ ( u ) = exp (cid:18) − (cid:90) (cid:90) ( u s − m us ) T P us,s (cid:48) ( u s (cid:48) − m us (cid:48) )d s d s (cid:48) (cid:19) . (3) Loosely speaking, we call (3) the probability of a tra-jectory u from GP u ( u t | m ut , K ut,t (cid:48) ) . Note that this notation Here we assume the measurements z t are taken in discrete time at time t with sampling interval δt , and u t is a constant continuous-time trajectoryin time [ t, t + δt ) . oes not necessarily imply that u is a sample path of GP u ( u t | m ut , K ut,t (cid:48) ) ; rather, we use (3) as a metric between u and m u . Intuitively, the maximization in (1) encourages u to be close to m u in terms of the distance weighted by P ut,t (cid:48) .Solving a stochastic optimization problem with (3) in theobjective function is intractable in general, because P ut,t (cid:48) isonly implicitly deﬁned. However, here we show that when GP u is the sum of a Gaussian white noise process and a lin-early transformed Gauss-Markov process, the problem is notonly tractable but can also extend the classical exponential-of-integral cost to model higher-order behaviors.This is realized by deﬁning GP u ( u t | m ut , K ut,t (cid:48) ) througha linear stochastic differential equation (SDE). Let y t be the hidden state of u t (e.g. its higher-order deriva-tives) and p ( y ) = N ( y | m y , K y ) be its prior. We set GP u ( u t | m ut , K ut,t (cid:48) ) as the solution to dy t = ( Dy t + η ) dt + Gdωu t = Hy t + r t + ν t (4)in which D , η , G , H are (time-varying) system matrices, r t is control bias, dω is a Wiener process, and ν t isa Gaussian white noise process GP ν (0 , Q ν δ t,t (cid:48) ) . In otherwords, the Gaussian process GP u ( u t | m ut , K ut,t (cid:48) ) has meanand covariance functions: m ut = r t + Hm yt (5) K ut,t (cid:48) = Q ν δ t,t (cid:48) + H K yt,t (cid:48) H T (6)in which GP y ( m yt , K yt,t (cid:48) ) is another Gaussian process with m yt = Φ y ( t, t ) m y + (cid:90) tt Φ y ( t, s ) η s d s (7) K yt,t (cid:48) = Φ y ( t, t ) K y Φ y ( t (cid:48) , t ) T + (cid:90) min( t,t (cid:48) ) t Φ y ( t, s ) G s G Ts Φ y ( t (cid:48) , s ) T d s (8)and Φ y ( t, s ) is the state transition matrix from s to t withrespect to D . For derivations, please refer to [27] and therein.The deﬁnitions (5) and (6) contain the exponential-of-integral cost [25] ψ ( u ) = exp (cid:18) − (cid:90) ( u s − r s ) T Q − ν ( u s − r s )d s (cid:19) as a special case, which can be obtained by setting H = 0 (i.e. P ut,t (cid:48) = Q − ν ). In general, it assigns the action ψ ( u ) to be close to r , even in terms of higher-order derivatives(or their hidden states). This leads to a preference towardsmooth control signals. By extension, a joint factor between x and u would also encourage smooth state trajectories (i.e.smaller higher-order derivatives of the state).Constructing the Gaussian process factor by SDE results inone particularly nice property: If we consider the joint Gaus-sian process of y t and u t , then its Green’s function is sparse .To see this, let θ t = ( u t , y t ) and θ = { θ , θ , . . . , θ N } anddeﬁne ψ ( θ ) as its Gaussian process factor similar to (3). Then the double integral in ψ ( θ ) can be broken down intothe sum of smaller double integrals, or factorized as ψ ( θ ) = ˜ ψ ( θ ) N − (cid:89) i =1 ˜ ψ ( θ i , θ i +1 ) (9)where ˜ ψ ( · ) has a similar exponential-quadratic form but overa smaller time interval [ t i , t i + δt ] . In other words, if we treateach θ i as a coordinate, then the exponent of ψ ( θ ) can bewritten as a quadratic function with a tridiagonal Hessianmatrix (please see [27] for details). This sparse propertywill be the foundation of the approximation procedure andalgorithm proposed in Section III and IV.III. A PPROXIMATE O PTIMIZATION AS I NFERENCE

The mixed features from both planning and control do-mains in (1) present two major challenges: the optimiza-tion over continuous-time trajectories and the higher-order,nonlinear factors φ S ( · ) . The former results in an inﬁnite-dimensional problem, which often requires a dense dis-cretization. The latter precludes direct use of algorithmsbased on Bellman’s equation, because the factors may notfactorize into instantaneous terms.In this work, we propose a new approach inspired byapproximate probabilistic inference. The goal here is toderive an approximation to the problem in (1), in the form max π E ˆ p π (cid:34) ψ ( x , u ) (cid:89) S ∈S ˆ φ S ( x S , u S ) (cid:35) , (10)where ˆ φ S ( · ) is a local exponential-quadratic approximationof φ S ( · ) and ˆ p π is a Gaussian process approximation of p π . We call the problem in (10) “gLEQG” as it general-izes LEQG (Linear-Exponential-Quadratic-Gaussian) [25] toincorporate higher-order exponentials in the form of (3).In the rest of this section, we show how gLEQG can bederived by using the probabilistic interpretation [20] of thefactors in (1). Further, we show this problem can be solved inlinear time O ( |S| ) and its solution can be written in closed-form as posterior inference. A. Probabilistic Interpretation of Factors

We begin by representing each factor in (1) with a prob-ability distribution [21]. First, for φ S ( · ) , we introduce ad-ditional ﬁctitious observations e S such that p ( e S | x S , u S ) ∝ φ S ( x S , u S ) . These new variables e S can be interpreted as theevents that we wish the robot to achieve and whose likelihoodof success is reﬂected proportionally to φ S ( · ) . Practically,they help us keep track of the message propagation over thesupport state/action in later derivations. Second, we rewritethe Gaussian process factor ψ ( u ) to include the hidden state y t in (4), as a joint Gaussian process factor q ( u , y ) . Withthe introduction of y t , the joint Gaussian process q ( u , y ) hasthe sparse property in (9) that we desired.Now, we rewrite the stochastic optimization (1) in thenew notation. Let e S = { e S } S ∈S and ξ = ( x, y, u ) , This step can be carried similarly as the construction of ψ ( u ) . nd let p ( x | u ) and p ( z | x ) be the conditional distributionsdeﬁned by the system dynamics and the observation model,respectively. It can be shown that (1) is equivalent to max π (cid:90) q ( z , ξ | e S ) π ( u | h )d ξ d z (11)in which we deﬁne a joint distribution q ( z , ξ , e S ) = q ( ξ ) p ( z | x ) (cid:89) S ∈S p ( e S | x S , u S ) (12)with likelihoods p ( z | x ) and p ( e S | x S , u S ) , and a prior on thecontinuous-time trajectory ξ q ( ξ ) = p ( x | u ) q ( u , y ) . (13)Before proceeding, we clarify the notation we use to simplifywriting. We use q to denote the ad hoc constructed Gaussianprocess factor (e.g. in (3)) and use p to denote the probabilitydistribution associated with the real system. As such, q doesnot always deﬁne an expectation, so the integral notation(e.g. in (11)) denotes the expectation over p and π that arewell-deﬁned probability distributions. But, with some abuseof notation, we will call them both Gaussian processes, sinceour results depend rather on their algebraic form. B. Gaussian Approximation

Let ξ S = { ξ S } S ∈S and ¯ ξ S = ξ \ ξ S . To derive the gLEQGapproximation to (1), we notice, by (12), q ( z , ξ | e S ) in (11)can be factorized into q ( z , ξ | e S ) = q ( z , ¯ ξ S | ξ S ) q ( ξ S | e S ) (14)where have used the Markovian property in Section III-A i.e. given ξ S , e S is conditionally independent of otherrandom variables. Therefore, if q ( z , ¯ ξ S | ξ S ) and q ( ξ S | e S ) can be reasonably approximated as Gaussians, then we canapproximate (1) with (10).However, q ( z , ¯ ξ S | ξ S ) and q ( ξ S | e S ) have notably differenttopologies. q ( z , ¯ ξ S | ξ S ) is a distribution over continuous-timetrajectories, whereas q ( ξ S | e S ) is a density function on ﬁnitenumber of random variables. Therefore, to approximate (1),we need to ﬁnd a Gaussian process ˆ q ( z , ¯ ξ S | ξ S ) and aGaussian density ˆ q ( ξ S | e S ) .

1) Gaussian Process Approximation:

We derive the Gaus-sian process approximation ˆ q ( z , ξ ) to q ( z , ξ ) . With thisresult, the desired conditional Gaussian process ˆ q ( z , ¯ ξ S | ξ S ) is given closed-form.First we need to deﬁne the system dynamics p ( x | u ) andthe observation model p ( z | x ) . For now, let us assume thatthe system is governed by a linear SDE dx = ( Ax + Bu + b ) dt + F dwz = Cx + v (15)in which A , B , b , F , C are (time-varying) system matrices, dw is a Wiener process, and v is Gaussian noise withcovariance Q v . When a prior is placed on x (similar toSection II-B) it can be shown that the solution to (15) p ( x , z | u ) = p ( z | x ) p ( x | u ) is a Gaussian process. Since q ( u , y ) is also Gaussian process, we have a Gaussian processprior on z and ξ : q ( z , ξ ) = p ( x , z | u ) q ( u , y ) , (16)In this case, no approximation is made and therefore ˆ q ( z , ¯ ξ S | ξ S ) = q ( z , ¯ ξ S | ξ S ) .In the case of nonlinear systems, one approach is totreat (15) as its local linear approximation and derive ˆ q ( z , ξ ) = ˆ p ( x , z | u ) q ( u , y ) , where ˆ p ( x , z | u ) is the solutionto the linearized system. Alternatively, we can learn theconditional distribution ˆ p ( x , z | u ) from data directly throughGaussian process regression [26]. However, since our mainpurpose here is to show the solution when ˆ p ( x , z | u ) isavailable, from now on we will assume the system is linearand given by (15).

2) Gaussian Density Approximation:

Unlike q ( z , ¯ ξ S | ξ S ) ,the approximation to q ( ξ S | e S ) is more straightforward. First,because q ( ξ S | e S ) may not be available in closed form, weapproximate q ( ξ S | e S ) with ˜ q ( ξ S | e S ) q ( ξ S | e S ) ∝ q ( ξ S ) (cid:89) S ∈S p ( e S | x S , u S ) ≈ ˆ q ( ξ S ) (cid:89) S ∈S p ( e S | x S , u S ) ∝ ˜ q ( ξ S | e S ) (17)where ˆ q ( ξ S ) is the marginal distribution of ˆ q ( z , ξ ) , foundin the previous section. Given (17), we then ﬁnd a Gaussianapproximation ˆ q ( ξ S | e S ) of ˜ q ( ξ S | e S ) via a Laplace approx-imation [28].For the nonlinear factor from (2), a Laplace approximationof ˜ q ( ξ S | e S ) amounts to solving a nonlinear least-squaresoptimization problem. Using the sparsity of the structuredGaussian processes deﬁned by SDEs, the optimization canbe completed using efﬁcient data structures in O ( |S| ) [9].For space constraints, we omit the details here; please seeAppendix A and [9] for details.

3) Summary:

The above approximations allow us to ap-proximate (12) with a Gaussian distribution ˆ q ( z , ξ , e S ) = ˆ p ( z , x | u ) q ( u , y ) (cid:89) S ∈S ˆ p ( e S | x S , u S ) . (18)In (18), ˆ p ( z , x | u ) is the Gaussian process approxi-mation of the system, which is exact when the sys-tem is linear, and ˆ p ( e S | x S , u S ) is proportional to theexponential-quadratic factor ˆ φ S ( x S , u S ) in (10). Moreover,it can be shown that ˆ q ( z , ξ , e S ) is a Laplace approxi-mation of ˆ p ( z , x | u ) q ( u , y ) (cid:81) S ∈S p ( e S | x S , u S ) in terms ofcontinuous-time trajectory z and ξ . C. Finding an Approximately Optimal Policy

Substituting the results in Section III-B into (11), we havethe approximated optimization problem max π (cid:90) ˆ q ( z , ξ | e S ) π ( u | h )d ξ d z . (19)By (18), one can show that (19) is equivalent to the problemin (10), but expressed in probabilistic notation.owever, by writing the problem probabilistically, we canavoid the algebraic complications arising from attemptingto solve the Bellman’s equation of (10), which, because ofhigher-order factors, requires additional state expansion. Thissimplicity is reﬂected in the optimality condition for (19): π ∗ t ( u t | h t ) = δ ( u t − u ∗ t ( h t )) u ∗ t ( h t ) = argmax u t (cid:90) ˆ q ( z , ξ | e S ) π ∗ ( ¯ u t | h )d x d y d z d ¯ u t = argmax u t ˆ q ( u t | h t , e S ) (20)in which ¯ u t denotes u \{ u t } and δ is Dirac delta distribution.From the last equality in (20), we see that the solution tothe maximization problem coincides with the mode of theposterior distribution ˆ q ( u t | h t , e S ) . As a result, the optimalpolicies for time t can be derived forward in time, byperforming inference without solving for the future policiesﬁrst. Please see Appendix B for the proof.We call this property the duality between gLEQG andinference. This result seems surprising, but similar ideas canbe traced back to the duality between the optimal control andestimation [17], [20], in which the optimal value function ofa linear quadratic problem is computed by backward messagepropagation without performing maximization.Compared with previous work, a stronger duality holdshere: gLEQG is dual to the inference problem on the same probabilistic graphical model deﬁned by the randomvariables in Section III-A. This nice property is the resultof the use of an exponential performance index, and enablesus to handle higher-order factors naturally without referringto ad hoc derivations based on dynamic programming onextended states.Our posterior representation of the policy can also befound in [20], [29], or can be interpreted as one step ofposterior iteration [24]. In [20], this results from the approx-imation of the optimal value function, but its relationship tothe overall stochastic optimization is unclear. In [29], theposterior representation is reasoned from the notion of apredictive policy representation without further justiﬁcationof its effects on the whole decision process. Here we derivethe policy based on the assumption that the associateddistribution of (1) can be approximated by a Gaussian(18). Therefore, the condition on which the approximatepolicy remains valid can be more easily understood or evenenforced, as discussed later in Section IV-B.IV. P ROBABILISTIC M OTION P LANNING AND C ONTROL

In Section III, we show that if q ( z , ξ | e ) can be ap-proximated well by a Gaussian distribution, the stochasticoptimization in (1) can be approximately solved as posteriorinference (20). This representation suggests that the approx-imately optimal policy can be updated recursively throughKalman ﬁltering. A. Recurrent Policy Inference as Kalman Filtering

The approximately optimal policy in (20) can be viewedas the belief about the current action u t given the history h t t t i t t i +1 t i + t h t max Fig. 2: Time-line with PIPC, where a system that started at t , is currently at time t ∈ [ t i , t i +1 ] between support points t i and t i +1 in δt resolution. In a receding horizon setting, t + represents the receding horizon window [ t i , t i + t h ] , and t max is the (inﬁnite) ﬁnal time when the algorithm terminates. Ina ﬁnite horizon setting, t h = t max − t i .and the ﬁctitious events e S . Here we exploit the Markovianstructure underlying ˆ q ( z , ξ | e S ) to derive a recursive algo-rithm for updating the belief ˆ q ( ξ t | h t , e S ) . Given the belief,the policy can be derived by marginalization. First, for t = 0 ,we write ˆ q ( ξ | h , e S ) ∝ p ( z | ξ )ˆ q ( ξ | e S )= q ( z | ξ ) (cid:90) q ( ξ | ξ S )ˆ q ( ξ S | e S )d ξ S in which q ( z t | ξ t ) = p ( z t | x t ) and q ( ξ | ξ S ) is the conditionaldistribution deﬁned by (16). After initialization, the poste-rior ˆ q ( ξ t | h t , e S ) can be propagated through prediction andcorrection, summarized together in one step as ˆ q ( ξ t + δt | h t + δt , e S ) ∝ p ( z t + δt | ξ t + δt )ˆ q ( ξ t + δt | e S )= q ( z t + δt | ξ t + δt ) (cid:90) ˆ q ( ξ t + δt | ξ t , e S )ˆ q ( ξ t | h t , e S )d ξ t (21)in which the transition is given by ˆ q ( ξ t + δt | ξ t , e S ) ∝ ˆ q ( ξ k , ξ t + δt | e S )= (cid:90) q ( ξ k , ξ t + δt | ξ S )ˆ q ( ξ S | e S )d ξ S (22)and q ( ξ k , ξ t + δt | ξ S ) is given by (16) [30], [31]. Becauseof the Markovian structure in (9), the integral (22) onlydepends on two adjacent support states/actions of t and canbe computed in constant time. Note, in ˆ q ( ξ t | h t , e S ) in (21),the action u t is actually conditioned on the action taken u ∗ t ( h t ) . This notation is adopted to simplify the writing.Thus, we can view (21) as Kalman ﬁltering with transitiondynamics ˆ q ( ξ t + δt | ξ t , e S ) and observation process q ( z t | ξ t ) .This formulation gives us the ﬂexibility to switch betweenopen-loop and closed-loop policies. That is, before a newobservation z t + δt is available, (22) provides a continuous-time open-loop action trajectory during the interval ( t, t + δt ) .This recurrent policy inference is based on the assumptionthat ˆ q ( z , ξ | e S ) is accurate. Although this assumption is notnecessarily true in general, it is a practical approximation ifthe belief about current state p ( x t | h t ) is concentrated andthe horizon within which (20) applies is short. B. Online Motion Planning and Control

We now summarize everything into the PIPC algorithm.Let t i ∈ S . We compensate for the local nature of (20) by re-computing a new Laplace approximation q ( z t + , ξ t + | h t , e S ) whenever t ∈ S , and applying ﬁltering to update the policyby (21) for t ∈ ( t i , t i +1 ) , in which the subscript t + denotesthe future trajectory from t (see Fig. 2). This leads to an lgorithm 1 Receding Horizon PIPC

Input: horizon t h , start time t , initial belief q ( ξ t ) Output: success/failure while not STOP CRITERIA do ˆ q ( ξ S | e S , h t i − δt , u t i − δt ) = getLaplaceApprox( t i , t h , q ( ξ t | h t i − δt , u t i − δt ) , ENVIRONMENT) for t ∈ [ t i , t i +1 ] do z t = makeObservation() ˆ q ( ξ t | h t , e S ) = ﬁlterPolicy( z t , ˆ q ( ξ t − δt | h t − δt , e S ) , ˆ q ( ξ S | e S , h t i − δt , u t i − δt ) ) executePolicy( u t = u ∗ t ( h t ) ) q ( ξ t + δt | h t , u t ) = ﬁlterState( z t , u t , q ( ξ t | h t − δt , u t − δt ) ) end for end while return checkSuccess()iterative framework which solves for the new approximationwith up-to-date knowledge about the system.We can apply this scheme to MDP/POMDP problems inboth ﬁnite and receding horizon cases. When facing a dy-namic environment, PIPC updates environmental informationin the new Laplace approximation in Section III-B.2.The details of the receding horizon approach are summa-rized in Algorithm 1 and can be derived similarly for theﬁnite horizon case. First, at any time step t i , PIPC computesthe Laplace approximation for the current horizon window [ t i , t i + t h ] with the latest information about the system andthe environment, where t h ≥ t i +1 − t i is length of thepreview horizon. Second, for t ∈ ( t i , t i +1 ) , PIPC recursivelyupdates the policy using the most current observation with aresolution of δt . These two steps repeat until the set criteriaare met or the execution fails (for example, the robot is incollision). V. I MPLEMENTATION D ETAILS

We perform experiments with the receding horizon ver-sion of PIPC in four different setups, including both MDPand POMDP scenarios: MDP-CL and POMDP-CL executethe receding horizon PIPC in Algorithm 1; MDP-OL andPOMDP-OL ignore the policy ﬁltering step, but insteadrecursively apply the open-loop policy given as the modefound in the Laplace approximation. This open-loop baselinecan be viewed as the direct generalization of [9] to includeaction trajectories.The Laplace approximation is implemented usingGPMP2 and the GTSAM C++ library, which solves poste-rior maximization as a nonlinear least-squared optimizationdeﬁned on a factor graph with the Levenberg-Marquardtalgorithm. Note that in implementation we consider y t = u t (i.e. ξ t = ( x t , u t ) ) and a constant time difference ∆ t betweenany two support states or actions.We evaluate our algorithms on three different systems:a 2D holonomic robot, a 7-DOF WAM, and a PR2 arm.The state dynamics, following (15), is deﬁned as a double The receding-horizon version solves a new ﬁnite-horizon problem ateach iteration of the Laplace approximation. Available at https://github.com/gtrll/gpmp2 Available at https://bitbucket.org/gtborg/gtsam integrator with the state consisting of position and velocityand A = (cid:20) (cid:21) , B = (cid:20) (cid:21) , b = (cid:20) (cid:21) , F F T = (cid:20) Q x I (cid:21) and following (4) we deﬁne the Gaussian process factor by H = , D = , η = 0 , GG T = Q u I where and I are d × d zero and identity matrices, where d = 2 for the 2D holonomic robot and d = 7 for the 7-DOF WAM arm and PR2 arm, and Q x and Q u are positivescalars. The observation process in the POMDP is modeledas a state observation with additive zero-mean Gaussian noisewith covariance Q v = σ m I d × d . The state dynamics forboth the arms are assumed to be feedback linearized. On areal system, the control would to be mapped back to realtorques using inverse dynamics.VI. E VALUATION

We conduct benchmark experiments with our recedinghorizon algorithm on the 2D holonomic robot in a dynamicenvironment, and on the WAM arm and the PR2’s rightarm in a static environment (see Fig. 1). In each case, wecompare the closed-loop and open-loop algorithms for bothMDP and POMDP settings across different Q x (and numberof dynamic obstacles N obs in the 2D case) with respect tosuccess rate, time to reach the goal, path length, and pathcost. Each setting is run for K times with a unique randomgenerator seed to account for stochasticity, which is keptthe same across all four algorithms for a fair comparison.A trial is marked “successful” if the robot reaches the goalwithin a Euclidean distance gdist , and is marked “failed” ifat any point the robot runs into collision or runs out of themaximum allotted time t max . A. 2D Robot Benchmark

We simulate a 2D holonomic robot (radius = 0 . m ) in a2D environment ( m × m ) with moving obstacles (seeFig. 1 (a)). The robot’s sensor returns a limited view of a m × m square centered at the robot’s current position. A video of experiments is available at https://youtu.be/8rQcg1O-6aU Path cost is calculated as the negative log of the product of the factors.

ABLE I: Success rate across increasing Q x and N obs on the 2D holonomic robot. Q x

10 20 30 40 50

CL OL CL OL CL OL CL OL CL OLMDP 0.01 0.975 0.975 0.85 0.85 0.7 0.675 0.4 0.375 0.25 0.3250.04 0.95 0.975 0.85 0.8 0.55 0.525 0.525 0.375 0.325 0.3250.07 0.95 0.85 0.875 0.575 0.725 0.475 0.45 0.2 0.225 0.125POMDP 0.01 0.975 0.975 0.925 0.875 0.725 0.7 0.375 0.4 0.15 0.2250.04 0.95 0.975 0.875 0.825 0.525 0.475 0.425 0.45 0.4 0.250.07 0.975 0.875 0.825 0.55 0.7 0.425 0.45 0.25 0.2 0.075 T i m e ( s ) Q x = 0.01 P a t h Leng t h ( m ) Number of obstacles P a t h C o s t × Q x = 0.04 Number of obstacles × Q x = 0.07 Number of obstacles × Fig. 3: Results of successful runs with increasing Q x and N obs on the 2D holonomic robot.The moving obstacles (squares of m × m ) start at ran-dom locations and follow a 2D stochastic jump process,where a noisy acceleration a obs is uniformly sampled within [ − . , . m/s at every time step. Their velocities v obs arerestricted within, [ − . , . m/s . All obstacles are conﬁnedinside the boundary during simulation.Table I summarizes the success rates for this benchmark, and Fig. 3 shows the aggregate results of successful runs.From Table I, we see that, for both MDP and POMDP cases,the closed-loop algorithms have higher success rates thanthe open-loop algorithms, especially in difﬁcult problemswith larger stochasticity in the system ( Q x ) or increasedcomplexity in the environment ( N obs ). Similar increasingtrends can also be observed in the difference of the successrates between the closed-loop and open-loop algorithms.The majority of failed open-loop cases arise from collision;only a few are due to hitting the maximum run time. Theperformance in POMDP cases are slightly worse than thatin the MDP cases on average. All three metrics (time, pathlength, and path cost) in Fig. 3 increase in general with morenoise and obstacles. It is important that these plots shouldbe interpreted alongside the success rates, since the samplesize of successful trails is comparatively sparse for the harderproblems. B. WAM and PR2 Benchmark

We demonstrate the scalability of PIPC to higher dimen-sional systems by performing a benchmark on the WAM and Parameters for this benchmark are set as follows: K = 40 , gdist = 0 . , t max = 20 , ∆ t = 0 . , t h = 2 , n ip = 20 , σ m = 0 . , σ g = 1 , σ fix = 10 − , Q u = 10 , σ obs = 0 . , (cid:15) = 1 . TABLE II: Success rate across increasing Q x on the WAMand the PR2 robot arms. Q x WAM PR2

CL OL CL OLMDP 0.01 1 1 1 10.02 1 1 1 0.950.03 1 0.85 1 0.5POMDP 0.01 1 1 1 10.02 1 0.9 1 0.80.03 0.9 0.75 1 0.8 the PR2 robot arms. Here the WAM and the PR2 robot armsare set up in lab and industrial environments [6], [7], [9]respectively, in OpenRAVE. Here the task is to drive therobot arm from a given start to a goal conﬁguration (seeFig. 1 (b) and (c)). The environments are static and fullyobservable at all times. We compare the algorithms withrespect to increasing Q x . Table II summarizes the successrates for this benchmark, and Fig. 4 shows the aggregateresults of successful runs. Similar to the 2D robot benchmark,the results show that the closed-loop algorithms have highersuccess rate than the open-loop ones, and all three metricsincrease with noise. In particular, POMDP-CL performs evenbetter than MDP-OL.VII. C ONCLUSION

We consider the problem of motion planning and controlas probabilistic inference, and we propose an algorithm PIPC Parameters for this benchmark are set as follows: K = 20 , gdist =0 . , t max = 15 , ∆ t = 0 . , t h = 2 , n ip = 10 , σ m = 0 . , σ g =0 . , σ fix = 10 − , Q u = 10 , (cid:15) = 0 . , σ obs = 0 . (WAM), σ obs =0 . (PR2). i m e ( s ) Q x = 0.01 P a t h Leng t h P a t h C o s t ×

510 246 Q x = 0.02 ×

510 246 Q x = 0.03 × (a) T i m e ( s ) Q x = 0.01 P a t h Leng t h P a t h C o s t × Q x = 0.02 × Q x = 0.03 × (b) Fig. 4: Results of successful runs with increasing Q x on (a) the WAM and (b) the PR2 robot arms.for solving this problem that can exploit intrinsic sparsityin continuous-time stochastic systems. In particular, PIPCcan address performance indices given by arbitrary, higher-order nonlinear factors and a general exponential-integral-quadratic factor. Despite PIPC solving a continuous-timeproblem, its complexity scales only linearly in the number ofnonlinear factors, thus making online simultaneous planningand control possible in receding/ﬁnite horizon MDP/POMDPproblems. A CKNOWLEDGMENTS

The authors would like to thank Jing Dong for help withthe GTSAM interface. This material is based upon worksupported by NSF CRII Award No. 1464219 and NSF NRIAward No. 1637758. R

EFERENCES[1] D. P. Bertsekas,

Dynamic programming and optimal control . AthenaScientiﬁc Belmont, MA, 1995, vol. 1, no. 2.[2] R. C. Arkin,

Behavior-based robotics . MIT press, 1998.[3] L. E. Kavraki, P. Svestka, J.-C. Latombe, and M. H. Overmars, “Proba-bilistic roadmaps for path planning in high-dimensional conﬁgurationspaces,”

Robotics and Automation, IEEE Transactions on , vol. 12,no. 4, pp. 566–580, 1996.[4] J. J. Kuffner and S. M. LaValle, “RRT-connect: An efﬁcient approachto single-query path planning,” in

Robotics and Automation, 2000.Proceedings. ICRA’00. IEEE International Conference on , vol. 2.IEEE, 2000, pp. 995–1001.[5] A. Byravan, B. Boots, S. S. Srinivasa, and D. Fox, “Space-timefunctional gradient optimization for motion planning,” in

Robotics andAutomation (ICRA), 2014 IEEE International Conference on . IEEE,2014, pp. 6499–6506.[6] J. Schulman, Y. Duan, J. Ho, A. Lee, I. Awwal, H. Bradlow, J. Pan,S. Patil, K. Goldberg, and P. Abbeel, “Motion planning with sequentialconvex optimization and convex collision checking,”

The InternationalJournal of Robotics Research , vol. 33, no. 9, pp. 1251–1270, 2014.[7] M. Mukadam, X. Yan, and B. Boots, “Gaussian process motionplanning,” in , May 2016, pp. 9–15.[8] Z. Marinho, B. Boots, A. Dragan, A. Byravan, G. J. Gordon, andS. Srinivasa, “Functional gradient motion planning in reproducing ker-nel hilbert spaces,” in

Proceedings of Robotics: Science and Systems(RSS-2016) , 2016.[9] J. Dong, M. Mukadam, F. Dellaert, and B. Boots, “Motion planningas probabilistic inference using Gaussian processes and factor graphs,”in

Proceedings of Robotics: Science and Systems (RSS-2016) , 2016.[10] M. Toussaint, “Newton methods for k-order Markov constrainedmotion problems,” arXiv preprint arXiv:1407.0414 , 2014.[11] S. M. LaValle and J. J. Kuffner, “Randomized kinodynamic planning,”

The International Journal of Robotics Research , vol. 20, no. 5, pp.378–400, 2001.[12] R. Tedrake, I. R. Manchester, M. Tobenkin, and J. W. Roberts, “LQR-trees: Feedback motion planning via sums-of-squares veriﬁcation,”

TheInternational Journal of Robotics Research , 2010.[13] H. J. Kappen, “Linear theory for control of nonlinear stochasticsystems,”

Physical review letters , vol. 95, no. 20, p. 200201, 2005. [14] S. Levine and V. Koltun, “Guided policy search.” in

ICML (3) , 2013,pp. 1–9.[15] M. Deisenroth and C. E. Rasmussen, “PILCO: A model-based anddata-efﬁcient approach to policy search,” in

Proceedings of the 28thInternational Conference on machine learning (ICML-11) , 2011, pp.465–472.[16] D. Mayne, “A second-order gradient method for determining optimaltrajectories of non-linear discrete-time systems,”

International Journalof Control , vol. 3, no. 1, pp. 85–95, 1966.[17] E. Todorov and W. Li, “A generalized iterative LQG method forlocally-optimal feedback control of constrained nonlinear stochasticsystems,” in

American Control Conference, 2005. Proceedings of the2005 . IEEE, 2005, pp. 300–306.[18] E. Todorov and Y. Tassa, “Iterative local dynamic programming,”in . IEEE, 2009, pp. 90–95.[19] H. Attias, “Planning by probabilistic inference,” in

AISTATS , 2003.[20] M. Toussaint, “Robot trajectory optimization using approximate infer-ence,” in

Proceedings of the 26th annual international conference onmachine learning . ACM, 2009, pp. 1049–1056.[21] M. Toussaint and A. Storkey, “Probabilistic inference for solving dis-crete and continuous state Markov decision processes,” in

Proceedingsof the 23rd international conference on Machine learning . ACM,2006, pp. 945–952.[22] S. Levine and V. Koltun, “Variational policy search via trajectoryoptimization,” in

Advances in Neural Information Processing Systems ,2013, pp. 207–215.[23] H. J. Kappen, V. G´omez, and M. Opper, “Optimal controlas a graphical model inference problem,”

Machine Learning ,vol. 87, no. 2, pp. 159–182, 2012. [Online]. Available:http://dx.doi.org/10.1007/s10994-012-5278-7[24] K. Rawlik, M. Toussaint, and S. Vijayakumar, “On stochastic optimalcontrol and reinforcement learning by approximate inference,”

Pro-ceedings of Robotics: Science and Systems VIII , 2012.[25] P. Kumar and J. Van Schuppen, “On the optimal control of stochasticsystems with an exponential-of-integral performance index,”

Journalof mathematical analysis and applications , vol. 80, no. 2, pp. 312–332,1981.[26] C. E. Rasmussen,

Gaussian processes for machine learning , 2006.[27] S. Sarkka, A. Solin, and J. Hartikainen, “Spatiotemporal learningvia inﬁnite-dimensional Bayesian ﬁltering and smoothing: A look atGaussian process regression through Kalman ﬁltering,”

IEEE SignalProcessing Magazine , vol. 30, no. 4, pp. 51–61, 2013.[28] C. M. Bishop, “Pattern recognition,”

Machine Learning , vol. 128, pp.1–58, 2006.[29] A. Boularias, “A predictive model for imitation learning in partiallyobservable environments,” in

Machine Learning and Applications,2008. ICMLA’08. Seventh International Conference on . IEEE, 2008,pp. 83–90.[30] T. Barfoot, C. H. Tong, and S. Sarkka, “Batch continuous-timetrajectory estimation as exactly sparse Gaussian process regression,”

Proceedings of Robotics: Science and Systems, Berkeley, USA , 2014.[31] X. Yan, V. Indelman, and B. Boots, “Incremental sparse GP regressionfor continuous-time trajectory estimation & mapping,” in

Proceedingsof the International Symposium on Robotics Research (ISRR-2015) ,2015.

PPENDIX

A. Laplace Approximation with Factor Graphs

PIPC updates the Laplace approximation whenever t = t i by efﬁciently solving a nonlinear least-squares problemdeﬁned on a bipartite factor graph G = { ξ S , f S , E} , q ( ξ S | e S ) ∝ (cid:89) S ∈S f S ( ξ S ) . (23)where recall that ξ S is the set of support augmented states,and f S = { f S } S denotes the set of factors, and E are edgesconnected to between ξ S and f S .An example factor graph is shown in Fig. 5 for a trajectorystarting from t i with a length equal to t h . The sparse setof support augmented states ξ S are uniformly ∆ t apart andare connected to their neighbours via the Gaussian processfactors, forming a Gauss-Markov chain. Note that in ourimplementation ξ t = ( x t , u t ) .

1) Details of Factor Implementation:a) Prior Factor:

For each Laplace approximation, aprior factor is placed on the ﬁrst hidden state ξ t , reﬂecting itscurrent belief given past history h t . In the MDP setting, thecovariance for state x t is set as a diagonal matrix σ fix I d × d ,in which σ fix is a small number to indicate high conﬁdence;for control u t , we use the original Gaussian process factorgiven by (3). Together they deﬁne Q prior . In the POMDPsetting, the belief of the hidden augmented state is obtainedvia Kalman ﬁltering, and we heuristically set the covariancefor the state, x to σ fix I d × d as mentioned previously. b) Gaussian Process Factors: Analogous to deﬁning GP u for (4), we can deﬁne GP ξ , which in turns deﬁne. q ( ξ S ) in (17). In Fig. 5, this corresponds to Gaussian processfactors with Q gp,i = (cid:90) t i +1 t i Φ ξ ( t i +1 , s ) (cid:20) FG (cid:21) (cid:20) FG (cid:21) T Φ ξ ( t i +1 , s ) T ds, where Φ ξ is the state transition matrix associated with [ A B H ] that takes the system from t i to t i +1 . c) Obstacle and Interpolation Factors: For obstacleavoidance, we use a hinge loss function h with safety dis-tance (cid:15) to compute a signed distance ﬁeld as in GPMP2 [9].In effect, it deﬁnes the obstacle factors and interpolationfactors in Fig. 5, which both use Q obs = σ obs I . Thoughabstracted as a single factor in Fig. 5, between any twosupport points t i and t i +1 , multiple ( n ip ) interpolated factorscan be constructed with indexes evenly spaced in time ( δt apart) to ensure path safety. See [9] for details. d) Goal Factor: To drive the system to a desired goalconﬁguration ξ goal (for example, a particular position inconﬁguration space with zero velocity and action), we adda goal factor to every support point except the current state.This encourages the optimizer to drive all states in the currenthorizon window closer to the goal. Because the covarianceof this factor acts as a weighting term, we deﬁne it as Q goal,i = σ g || ξ ti − ξ goal || || ξ start − ξ goal || I such that it monotonicallydecreases with the Euclidean distance to the goal. Prior Factor: f m = exp {− ξ i Q − prior ξ i } , Goal Factor: f goali = exp {− e Ti Q − goal,i e i } , e i = ξ i − ξ goal GP Prior Factor: f gpi = exp {− e Ti Q − gp,i e i } , e i = Φ ξ ( t i +1 , t i ) ξ i − ξ i +1 Obstacle Factor: f obsi = exp {− e Ti Q − obs e i } , e i = h ( ξ i ) Interpoaltion Factor: f intpi = exp {− e Ti Q − obs e i } , e i = h (GPinterpolate( ξ i , ξ i +1 )) Fig. 5: A factor graph of an example Laplace approxi-mation problem showing states (white circle) ( ξ i is usedas a shorthand for ξ t i ) and different kinds of factors: GPPrior (black circle), obstacle and interpolation (white square),measurement (gray square) and goal (black square).

2) Update of Laplace Approximation:

The same Laplaceapproximation is used to recursively update the policy for t ∈ [ t i , t i +1 ) with a resolution of δt , and, when t = t i +1 ,the graph is updated to construct a new nonlinear least-square optimization problem. This is done by shifting thehorizon window ahead by ∆ t and update the factors toinclude any environmental changes during [ t i , t i +1 ) . In theupdated graph, the prior factor on the ﬁrst state is given byan additional Kalman ﬁlter based on (15) and (4) with ( x, y ) as hidden states and ( z, u ) as observations. For POMDPproblems, we treat the estimation of current state as perfectknowledge without uncertainty. This extra heuristic step isa compromise which makes the assumption accurate at themean of the current belief. B. Proof of (20)Here we prove that the solution (20) to the approximateoptimization problem (19) max π (cid:90) ˆ q ( z , ξ | e S ) π ( u | h )d ξ d z can be written as posterior inference: ∀ t , π ∗ t ( u t | h t ) = δ ( u t − u ∗ t ( h t )) u ∗ t ( h t ) = argmax u t (cid:90) ˆ q ( z , ξ | e S ) π ∗ ( ¯ u t | h )d x d y d z d ¯ u t = argmax u t ˆ q ( u t | h t , e S ) where ˆ q ( z , ξ | e S ) ∝ q ( ξ ) p ( z | x )ˆ p ( e S | x S , u S ) and ˆ p ( e S | x S , u S ) = (cid:81) S ∈S ˆ p ( e S | x S , u S ) is found by theexponential-quadratic approximate factor given by theLaplace approximation. Proof:

We assume the length of the trajectory is T .In the following, we ﬁrst show that the optimal policy isdeterministic and then show that it corresponds to the modeof the posterior distribution ˆ q ( u t | h t , e S ) . ) The Optimal Policy is Deterministic: For any t , wecan write the objective function (19) as (cid:90) ˆ q ( z , ξ | e S ) π ( u | h )d ξ d z ∝ (cid:90) q ( ξ ) p ( z | x )ˆ p ( e S | x S , u S ) π ( u | h )d ξ d z = (cid:90) π t ( u t | h t ) f h t ( u t )d u t d h t in which f h t ( u t ) = (cid:90) q ( ξ ) p ( z | x )ˆ p ( e S | x S , u S ) π ( ¯ u t | ¯ h t )d θ d z t + d u t + and π ( ¯ u t | ¯ h t ) := π ( u | h ) π t ( u t | h t ) Therefore, equivalently, (19) can be formulated explicitly asa variational problem of density function π t : max π t (cid:90) π t ( u t | h t ) f h t ( u t )d u t (24) s.t. (cid:90) π ( u t | h t )d u t = 1 π ( u t | h t ) ≥ , ∀ u t To deal with the inequality, let g t ( u t ) = π t ( u t | h t ) , and wecan further write (24) as max g t (cid:90) g t ( u t ) f h t ( u t )d u t (25) s.t. (cid:90) g t ( u t )d u t = 1 Let λ ∈ R be a Lagrangian multiplier. We can turn the (25)into an unconstrained optimization and use calculus of vari-ations to derive the solution: min λ max g t L ( g t , λ )= min λ max g t (cid:90) g t ( u t ) f h t ( u t )d u t + λ ( (cid:90) g t ( u t )d u t − Suppose g ∗ t ( λ ) is the optimum. Let g t = g ∗ t + (cid:15)η , where η is an arbitrary continuous function and (cid:15) → . Then theoptimality condition is given by ∂ L ( g t , λ ) ∂(cid:15) = (cid:90) g t ( u t ) η ( u t )( λ + f h t ( u t ))d u t = 0 . Since η is arbitrary, it implies ∀ u t , g t ( u t )( λ + f h t ( u t )) = 0 Given that λ is a scalar and g ( u n ) is non-zero, we canconclude that π ∗ t ( u t | h t ) = δ ( u t − u ∗ t ( h t )) satisfying u ∗ t ( h t ) = arg max u t ( h t ) f h t ( u t ) b) The Optimal Policy is the Mode of Posterior: Fromthe previous proof, we know that the policy corresponds tothe mode of f h t ( u t ) for any t . Therefore, to complete theproof, we only need to show that f ∗ h t ( u t ) ∝ ˆ q ( u t | e S , h t ) ,where f ∗ h t ( u t ) is f h t ( u t ) when the policies are optimal.First, let ˆ f h t ( u t ) denote f h t ( u t ) when all policies aredeterministic, and deﬁne, for all t , ˆ f h t ( u t ) := π t − ( u t − | h t − )ˆ q ( z t , z t − , e S , u t − , u t ) (26)Next, we introduce a lemma: Lemma 1:

Let z = ( x, y ) ∈ R n . If f ( x, y ) ∝N ( z | m , S ) , then, for all y , max x ( y ) f ( x, y ) = C (cid:90) f ( x, y )d x (27)for some constant C independent of y , in which m and S are the mean and covariance of a Gaussian.Now we can show f ∗ h t ( u t ) ∝ ˆ q ( u t | e S , h t ) by inductionin backward order. To start with, for the last policy at τ = T − δt , we can write (26) as f h τ ( u τ ) = ˆ f h τ ( u τ ) = π τ − ( u τ − | h τ − )ˆ q ( z τ , z τ − , e S , u τ − , u τ )= π τ − ( u τ − | h τ − )ˆ q ( h τ , e S , u τ ) ∝ ˆ q ( u τ | h τ , e S ) in which we purposefully omit the dependency of u τ − on h τ − , because the exact value of u τ − is observed whenperforming the optimization. Then we have u ∗ τ ( h τ ) =argmax u τ ˆ f h τ ( u τ ) = argmax u τ ˆ q ( u τ | h τ , e S ) .Now we propagate the objective function one step back-ward from τ to τ − δt . Given h τ − δt , the maximization at τ − δt is given as max π τ − δt max π τ (cid:90) π τ ( u τ | h τ ) ˆ f h τ ( u τ )d u τ d z τ d u τ − δt = max π τ − δt (cid:90) max u τ ( h τ ) ˆ f h τ ( u τ )d z τ d u τ − δt = max π τ − δt (cid:90) π τ − ( u τ − | h τ − ) max u τ ( h τ ) ˆ q ( h τ , e S , u τ )d z τ d u τ − δt ∝ max π τ − δt (cid:90) π τ − ( u τ − | h τ − )ˆ q ( h τ , e S , u τ )d z τ d u τ d u τ − δt = max π τ − δt (cid:90) π τ − ( u τ − | h τ − )ˆ q ( h τ − δt , e S , u τ − δt )d u τ − δt = max π τ − δt (cid:90) π τ − δt ( u τ − δt | h τ − δt ) ˆ f h τ − δt ( u τ − δt )d u τ − δt The second equality is due to the policy is deterministic;the third proportionality is given by (27); the last equal-ity is given by the deﬁnition (26). Therefore, the back-ward iteration maintains the policy optimization problem max π t (cid:82) π t ( u t | h t ) ˆ f h t ( u t )d u t in the same algebraic form asthe last step for all t . Since ˆ f h t ( u t ))