[PDF] Risk Sensitive Path Integral Control for Infinite Horizon Problem Formulations

Abstract

Path Integral Control methods were developed for stochastic optimal control covering a wide class of finite horizon formulations with control affine nonlinear dynamics. Characteristic for this class is that the HJB equation is linear and consequently the value function can be expressed as a conditional expectation of the exponentially weighted cost-to-go evaluated over trajectories with uncontrolled system dynamics, hence the name. Subsequently it was shown that under the same assumptions Path Integral Control generalises to finite horizon risk sensitive stochastic optimal control problems. Here we study whether the HJB of infinite horizon formulations can be made linear as well. Our interest in infinite horizon formulations is motivated by the stationarity of the associated value function and their inherent dynamic stability seeking nature. Technically a stationary value function may ease the solution of the associated linear HJB. Second we argue this may offer an interesting starting point for off-linear Reinforcement Learning applications. We show formally that the discounted and average cost formulations are respectively intractable and tractable.

Full PDF

AA Nonlinear Feynman-Kay Formula with Application inLinearly Solvable Optimal Control

Tom Lefebvre and Guillaume Crevecoeur { Tom.Lefebvre,Guillaume.Crevecoeur } @ugent.be are with the Dept. ofElectromechanical, Systems and Metal Engineering, Ghent University, 9000Ghent; and with the Core Lab EEDT-DC, Flanders Make. Abstract

In this article we present a solution to a nonlinear relative of the parabolicdiﬀerential equation that was tackled by Feynman and Kac in the late 1940s.For the proof we rely on continuous time stochastic calculus. Second wedraw an interesting connection with a related recurrence relation aﬃrmingthe presented result by collapsing onto the continuous time framework butonly in the limit. The equation emerges in the context of inﬁnite horizondiscounted Linearly Solvable Optimal Control, which, as far as we are awareof, is untreated by the literature. The continuous time setting can be treatedusing our new result. As we will demonstrate the discrete time setting isintractable. Nevertheless we can provide close estimates based on the re-currence relation which also allows us to estimate the inﬂuence of time dis-cretization errors. We demonstrate our solution treating a small case study.1 a r X i v : . [ m a t h . O C ] F e b . Introduction Ever since the work of Feynman and Kac in the late 1940’s, when bothmen were addressing what turned out to be essentially the same problem , theFeynman-Kac formula has grown into a well established means of evaluatingparabolic partial diﬀerential equations (PDEs) subject to, amongst others,terminal boundary constraints. ∂ t Z ( t, x ) = − q ( t, x ) Z ( t, x ) + ∇ x Z ( t, x ) (cid:62) a ( t, x ) + tr (Σ( t, x ) ∇ xx Z ( t, x ))) Z ( T, x ) = Z T ( x )The Feyman-Kac formula states Z ( t, x ) = E P ( X ( t → T ) | X ( t )= x ) (cid:20) exp (cid:18) − (cid:90) Tt q ( X ( τ ))d τ (cid:19) Z T ( x ) (cid:21) here X ( τ ≥ t ) is a diﬀusion process with drift d X = a ( X )d t + σ ( X )d W withinitial condition X ( t ) = x and d W being a Wiener process. Other quantitiesare introduced in theorem 1.The formula draws a marvellous connection between parabolic PDEs andstochastic Itˆo or diﬀusion processes. Essentially the formula implies that thePDE mentioned above can be evaluated by simulating random paths of astochastic Itˆo process and paved way for practical methods of evaluation forproblems that were otherwise deemed intractable. Major areas of interest arequantum mechanics [4], ﬁnances [5] and more recently also optimal control[6, 7, 8, 9].The aim of this paper is to present a solution to the related diﬀerentialequation introduced below, which can be interpreted reasonably as a nonlin-ear (and stationary) relative of the problem studied by Feynman and Kac, Although targeting the problem from their own respective ﬁeld of expertise. Feyn-man’s work was motivated by attempts to produce tractable solutions for the one-dimensional Schr¨odinger equation. Feynman developed the concept that is now knownas Path Integral theory, where the state of a particle at time t is obtained by averaging over all possible paths starting at a ﬁxed initial state [1]. When Kac got wind of histhen colleague’s ideas, he showed that the same principles applied to the heat equation,reinterpreting the averaging operation probabilistically as a conditional expectation. Hesubsequently revealed the fundamental relationship between stochastic diﬀusion processesand certain partial diﬀerential equations in a series of papers giving rise to what is nowknown as the Feynman-Kac formula [2, 3]. αZ ( x ) log Z ( x ) = − q ( x ) Z ( x ) + ∇ x Z ( x ) (cid:62) a ( x ) + tr (Σ( x ) ∇ xx Z ( x ))Speciﬁcally, we shall illustrate that the solution to this equation is givenby the following conditional expectation Z ( x ) = lim T →∞ E P ( X ( t → T ) | X ( t )= x ) [ I ( T )] I ( s ) = exp (cid:18) − (cid:90) st e − α ( τ − t ) q ( X ( τ ))d τ (cid:19) We provide two methods of proof. One involves a modiﬁed version of themodern method of proof used for the linear

Feynman-Kac formula. The otheris based on a stochastic recurrence relation that collapses onto the continuoustime framework but only in the limit. To the best of our knowledge thepresented results are strictly original.The studied nonlinear PDE emerges in the context of discounted inﬁnitehorizon Linearly Solvable Optimal Control (LSOC). This is a particular sub-class of stochastic optimal control problems that allow for explicit evaluationof the value function and hence the optimal control. In the second part ofthis contribution we treat its application in LSOC [9].

2. Derivation

This section contains two methods of proof for the central result. Theproofs are organised in two theorems, one addresses the continuous timeproblem as such, the other establishes a connection with a closely relatedrecurrence relation that collapses onto the continuous time framework in thelimit.

The derivation of our results relies on the Itˆo calculus [4, 5]. For thepurpose of self containment, we brieﬂy present notation and summarize therequired results from that area for the reader to acquaint with before pro-gressing further.The key concept in Itˆo calculus is that of the Wiener process. A Wienerprocess is a mathematical entity that can be used to construct arbitrarystochastic processes. 3 eﬁnition 1 (Wiener process) . The Wiener process W ( t ) is deﬁned bymeans of the following properties • for ∆ t > , the increment W ( t + ∆ t ) − W ( t ) is Gaussian with meanzero and variance ∆ t • the limit lim ∆ t → t ( W ( t + ∆ t ) − W ( t )) is not deﬁned The mathematical construction of a signal exhibiting such properties isbeyond that required to establish the result proposed here. It suﬃces to notethat the Wiener process can be understood as the limit case of a discrete-timerandom walk.These properties also imply that E [d W ] = 0 E [d W ] = d t The concept of Brownian motion can be used to construct general stochas-tic processes. Speciﬁcally we will consider (Itˆo drift) diﬀusion or stochasticprocesses propelled by the stochastic diﬀerential equations of the followinggeneral form d X = a ( X ( t ))d t + σ ( X ( t ))d W where X ∈ R n and d W ∈ R m .Further we will denote the probability of a path spawned by such a diﬀusionprocess conditioned on the initial value X ( t ) = x as P ( X ( t → T ) | X ( t ) = x ).Note that such a path is a stochastic variable and that by consequence theexpectation E P ( X ( t → T ) | X ( t )= x ) [ · ] is deﬁned.Within this framework Itˆo’s lemma generalises the notion of the totalderivative of a function V ( t, x ) along paths spawned by any diﬀusion process Lemma 1 (Itˆo’s lemma) . Let { V, a, σ } : R × R n (cid:55)→ { R , R n , R n × m } be func-tions of time and space where V ( t, x ) is twice diﬀerentiable at least and let X be a stochastic process that can be modelled as a Brownian motion withdrift a ( t, x ) and covariance Σ( t, x ) = σ ( t, x ) σ ( t, x ) (cid:62) then we have thatd V ( t, X ) ≈ (cid:0) ∂ t V ( t, X ) + ∇ x V ( t, X ) (cid:62) a ( t, X )+ tr(Σ( t, X ) ∇ xx V ( t, X )) (cid:1) d t + ∇ x V ( t, X ) (cid:62) σ ( t, X ) d W + O ( d )With major implicationlim d t → t E d X [d V ( t, X )] = ∂ t V ( t, X ) + ∇ x V ( t, X ) (cid:62) a ( t, X )+ tr (Σ( t, X ) ∇ xx V ( t, X ))4 .2. Main results For notational brevity and rapid intuitive interpretation, henceforth werestrain from writing the functional argument as often as the context permits.We provide the continuous time treatment ﬁrst.

Theorem 1.

Let { Z, q, a, σ } : R n (cid:55)→ { R , R n , R n , R n × n } be scalar, vector andmatrix functions of the spatial coordinate x respectively, so that Σ = σσ (cid:62) (cid:31) and let α > a positive constant. Now consider the PDE αZ log Z = − qZ + ∇ x Z (cid:62) a + tr (Σ ∇ xx Z ) If X is a stochastic process propelled by the stochastic diﬀerential equationd X = a ( X ) d t + σ ( X ) d W where d W is a Wiener process then we have that Z ( x ) = lim T →∞ E P ( X ( t → T ) | X ( t )= x ) [ I ( T )] I ( s ) = exp (cid:18) − (cid:90) st e − α ( τ − t ) q ( X ( τ )) d τ (cid:19) Proof.

Consider the stochastic process Y ( s ) = Z ( s ) e − α ( s − t ) I ( s )The product rule then implies thatd Y = d (cid:16) Z e − αs (cid:48) (cid:17) I + Z − αs (cid:48) d I, s (cid:48) = s − t and one easily veriﬁes that d I = − e − αs (cid:48) qI d s The ﬁrst diﬀerential is trickier and involves the chain rule.d (cid:16) Z e − αs (cid:48) (cid:17) = ∂∂Z (cid:16) Z e − αs (cid:48) (cid:17) d Z + ∂∂s (cid:16) Z e − αs (cid:48) (cid:17) d s = e − αs (cid:48) Z e − αs (cid:48) − d Z − αe − αs (cid:48) Z e − αs (cid:48) log Z d s = e − αs (cid:48) Z e − αs (cid:48) − (cid:0)(cid:0) ∇ x Z (cid:62) a + tr(Σ ∇ xx Z ) (cid:1) d s + ∇ x Z (cid:62) d W (cid:1) − αe − αs (cid:48) Z e − αs (cid:48) log Z d s Y then yieldsd Y = e − αs (cid:48) Z e − αs (cid:48) − (cid:0) ∇ x Z (cid:62) a + tr (Σ ∇ xx Z ) (cid:1) I d s − αe − αs (cid:48) Z e − αs (cid:48) I log Z d s − e − αs (cid:48) qZ e − αs (cid:48) S d s + e − αs (cid:48) Z e − αs (cid:48) − I ∇ x Z (cid:62) d W which can be rearranged intod Y = e − αs (cid:48) Z e − αs (cid:48) − × (cid:0) − αZ log Z − qZ + ∇ x Z (cid:62) a + tr (Σ ∇ xx Z ) (cid:1) I d s + e − αs (cid:48) Z e − αs (cid:48) − I ∇ x Z (cid:62) d W The ﬁrst term vanishes as dictated by the PDE. We can also get rid ofthe second term. First integrating from t to T , yielding Y ( T ) − Y ( t ) = (cid:90) Tt e − αs (cid:48) Z e − αs (cid:48) − S ∇ x Z (cid:62) d s d W We then note that the right-hand side depends on d W . Hence we take theexpectation over the path X ( t → T ) conditioned on X ( t ) = x and recognizean Itˆo integral so that as a result the term vanishes eﬀectively. E P ( X ( t → T ) | X ( t )= x ) [ Y ( T )] − Y ( t ) = E P ( X ( t → T ) | X ( t )= x ) (cid:20)(cid:90) Tt e − αs (cid:48) Z e − αs (cid:48) − I ∇ x Z (cid:62) d s d W (cid:21) = 0On account of the conditionality on X ( t ) = x and the deﬁnition of Y itfollows directly that Y ( t ) = Z ( x ). Hence we can rearrange the result aboveinto an expression for Z ( x ).Finally, we must get rid of Z ( X ( T )). It is easily veriﬁed that if we takethe limit T → ∞ the proof is completed. Y ( t ) = Z ( x )= E P ( X ( t → T ) | X ( t )= x ) [ Y ( T )]= E ρ ( P ( X ( t → T ) | X ( t )= x )) (cid:104) I ( T ) Z ( X ( T )) − e α ( T − t ) (cid:105) = lim T →∞ E P ( X ( t → T ) | X ( t )= x ) [ I ( T )]We refer to this solution as a nonlinear or stationary Feynman-Kac for-mula. 6e now present a second alternative method of proof. This method relieson identiﬁcation of a stochastic recurrence relation that collapses onto thecontinuous time framework but only in the limit.One might recognize that the result in discrete time is stronger in the sensethat the result holds for general discrete time transition probabilities P ( X (cid:48) = y | X = x ) . However to establish the connection with the result in continuoustime, we need to address a restrictive class of transition probabilities that canbe classiﬁed eﬀectively as random walks with drift. Second in contrast to thesolution in continuous time, here the solution is only exact in the limit. Thisis unfortunate for the anticipated application in control as we will discusslater. Theorem 2.

Let { Z, q, a, σ } : R n (cid:55)→ { R , R n , R n , R n × m } be scalar, vector andmatrix functions of the spatial coordinate x respectively and > α > apositive constant and so that Σ = σσ (cid:62) (cid:31) . Now consider the stochasticnonlinear recurrence relation Z ( x ) = E P ( X (cid:48) | X = x ) (cid:2) e − q ( X ) Z ( X (cid:48) ) α (cid:3) Then ˜ Z ( x ) ≤ Z ( x ) ≤ C ˜ Z ( x ) where C ≥ but bounded and ˜ Z ( x ) = lim N →∞ E P ( X (0 → N ) | X ( n )= x ) [ S ( N )] S ( s ) = exp (cid:16) − (cid:88) sn =0 α n q ( X ( n )) (cid:17) and if X is a stochastic process propelled by the random walk with drift X (cid:48) = X + ∆ X, ∆ X = a ( X )∆ t + σ ( X )∆ W where ∆ W ∼ N (0 , ∆ t I) then thestochastic recurrence relation collapses onto the stationary PDE from theo-rem 1 in the limit ∆ t → and the solution collapses onto the correspondingnonlinear Feynman-Kac formula. Notation X (cid:48) | X = x indicates the value of the state after one time instant has passedconditioned on X = x . This implies that we can associate a transition probability densityfunction P ( X (cid:48) = y | X = x ) expressing the probability of reaching y from x . The probabilityof a stochastic discrete time path conditioned on X ( n ) = x is denoted as P ( X ( n → N ) | X ( n ) = x ). roof. First note that the random walk∆ X = a ( X )∆ t + σ ( X )∆ W, ∆ W ∼ N (0 , ∆ t I)shares the characteristics of the continuous time diﬀusion process describedin theorem 1 if ∆ t → q and scalar α for the associated entities inthe discrete time setting.Speciﬁcally we have that q (cid:55)→ q ∆ t and α (cid:55)→ e − α ∆ t . Substitution into thestochastic recurrence relation yields Z ( x ) = E P (∆ X | X = x ) (cid:104) e − q ( X )∆ t Z ( x + ∆ X ) e − α ∆ t (cid:105) With these preliminary results in place we can start manipulating therecurrence relation. Furthermore since e − q ( x )∆ t Z ( x ) e − α ∆ t is not a stochasticvariable we can introduce it into the expectation operators argument0 = e − q ( x )∆ t E P (∆ X | X = x ) (cid:104) Z ( x + ∆ x ) e − α ∆ t − Z ( x ) e − α ∆ t (cid:105) + e − q ( x )∆ t Z ( x ) e − α ∆ t − Z ( x )Now we approximate the argument by its diﬀerential representation, i.e. Z ( x + ∆ x ) e − α ∆ t − Z ( x ) e − α ∆ t ≈ e − q ( x )∆ t Z ( x ) e − α ∆ t − ∆ Z . This approximationis exact for ∆ Z → { ∆ X, ∆ t } →

00 = e − α ∆ t e − q ( x )∆ t Z ( x ) e − α ∆ t − E P (∆ X | X = x ) [∆ Z ] + e − q ( x )∆ t Z ( x ) e − α ∆ t − Z ( x )By deﬁnition of the transition probability it follows direct that the ex-pectation is equal to E P (∆ X ) [∆ Z ] = (cid:0) ∇ x Z (cid:62) a + tr (Σ ∇ xx Z ) (cid:1) ∆ t The former result can be substituted into the recurrence relation.0 = lim ∆ t → t e − q ∆ t Z e − α ∆ t − (cid:0) ∇ x Z (cid:62) a + tr(Σ ∇ xx Z ) (cid:1) ∆ t + lim ∆ t → t (cid:16) e − q ∆ t Z e − α ∆ t − Z (cid:17) Evaluation of the ﬁrst term is trivial and established the part of the PDEthat can be associated to the diﬀusion process.8he second term can be recognized as the deﬁnition of the derivative withrespect to ∆ t evaluated at ∆ t = 0.lim ∆ t → t (cid:16) e − q ∆ t Z e − α ∆ t − Z (cid:17) = dd∆ t (cid:16) e − q ∆ t Z e − α ∆ t (cid:17)(cid:12)(cid:12)(cid:12) ∆ t =0 = − qZ − αZ log Z These terms produce the αZ log Z and qZ terms that were still missing.Substituting the intermediary results back into the main equation recoversthe stationary PDE. It follows that the discrete time stochastic recurrencerelation collapses onto the diﬀerential equation in the limit. αZ log Z = − qZ + ∇ x Z (cid:62) a + tr (Σ ∇ xx Z )Now it is left to proof that the same holds true for the solution. Firstlet us show that the discrete time solution indeed satisﬁes the bounds in thetheorem. As is easily veriﬁed, on account of the power α the recursion resistsexplicit evaluation. To resolve this inconvenience we introduce the variable δ n . Since ( · ) α is concave when 0 < α <

1, Jensen’s inequality ascertains thelower bound. Then since δ = 1 and lim n →∞ δ n = 1 the upper bound is areasonable assumption.1 ≤ δ n ( x ) = E P ( X (cid:48) | X = x ) [ Z ( X (cid:48) ) α ] α n E P ( X (cid:48) | X = x ) [ Z ( X (cid:48) ) α n +1 ] ≤ c n Substituting δ n ( x ) E P ( X (cid:48) | X = x ) [ Z ( X (cid:48) ) α n +1 ] for E P ( X (cid:48) | X = x ) [ Z ( X (cid:48) ) α ] α n when-ever such an expression emerges, allows to evaluate the recursion anyhowand produces the result Z ( x ) = E P ( X (0 → N ) | X (0)= x ) (cid:104) S ( N ) Z ( X ( N )) α N δ ( N ) (cid:105) where δ ( N ) = (cid:89) Nn =1 δ n ( X ( n ))On account of the bounds on δ n we have that 1 ≤ lim N →∞ δ ( N ) ≤ C ≤ ∞ guaranteeing that the solution satisﬁes indeed˜ Z ( x ) ≤ Z ( x ) ≤ C ˜ Z ( x )To show that the bounds tighten when ∆ t → ∆ t → δ n = 1 with α (cid:55)→ e − α ∆ t ∆ t → P ( X ( n → N ) | X ( n ) = x ) = P ( X ( t → T ) | X ( t ) = x ). Finally we practice the deﬁnition of the Riemann integral in the expec-tation operators argument so to obtain Z ( x ) = ˜ Z ( x ) = lim ∆ t → lim N →∞ E P ( X ( n → N ) | X ( n )= x ) [ S (cid:48) ( N )]= lim T →∞ E P ( X ( t → T ) | X ( t )= x ) [ I ( T )]where S (cid:48) ( s ) = exp (cid:16) − (cid:88) sm = n e − α ( m − n )∆ t q ( X ( m ))∆ t (cid:17) This ﬁnalises the connection between theorems 1 and 2.The theorem has two interesting consequences. Firstly, it guaranteesthat we can estimate a lower bound for the solution Z ( x ), namely ˜ Z ( x ), byapproximating the expectation using sampled paths. Secondly the alternativeproof is valuable in its own right as it oﬀers a tool to study the eﬀect of timediscretization schemes when evaluating the continuous solution numerically.Speciﬁcally for signiﬁcantly small ∆ t (cid:28) , C ] with C (cid:39)

1. We shall demonstrate this empirically in ournumerical experiment.

3. Linearly Solvable Optimal Control

In this section we introduce two problems that are rooted in stochasticoptimal control which merit from the theorems in the previous section. Inparticular we will introduce two optimal control problems, either in a contin-uous be it discrete time setting for which we estimate the value function bysampling paths from the uncontrolled process. This class of control problemsis known as LSOC [6]. On account of the Feynman-Kac formula it is possibleto estimate the control based on uncontrolled sample paths.Initiated by the work of Kappen — who ﬁrst noticed the potential use ofthis peculiar subclass — this lead to the development of a series of controlmethods known as Path Integral Control (see [8, 7] and references therein)which has been connected to Entropy Regularized Optimal Control morerecently [10]. As far as we are aware of the discounted inﬁnite horizon LSOCsetting is untreated by the literature.10 .1. Continuous Time Inﬁnite Horizon

Let us begin our discussion by considering the continuous time discountedinﬁnite horizon stochastic optimal control with control aﬃne dynamics andcontrol quadratic cost rate term.

Deﬁnition 2.

Let q : R n (cid:55)→ R be a strictly positive function, R ∈ R m × m astrictly positive deﬁnite matrix and α a strictly positive scalar. The continu-ous time inﬁnite horizon stochastic optimal control for control aﬃne dynam-ics and control quadratic cost rate is then deﬁned as u ∗ = arg min u C [ u ]= arg min u lim T →∞ E P ( X ( t → T ) | X ( t )= x ) (cid:20)(cid:90) Tt c ( τ ) d τ (cid:21) where c ( τ ) = e − α ( τ − t ) (cid:0) q ( X ( τ )) + u ( τ ) (cid:62) Ru( τ ) (cid:1) and X ( τ ≥ t ) | X ( t ) = x isgoverned by the control aﬃne diﬀusion process d X = ( a ( X ) + B( x ) u ) d t + σ ( x ) d W . It is well-known that the optimal control is governed by the following dif-ferential equation — also known as the stochastic Hamilton-Jacobi-Bellman(HJB) diﬀerential equation. Here V represents the so called Value function which quantiﬁes the accumulated cost when the optimal control is continu-ously applied from initial state X ( t ) = xαV = min u q + u (cid:62) R u + ∇ x V (cid:62) ( a + B u ) + tr (Σ V xx )It is easily veriﬁed that the minimizer satisﬁes u ∗ = − R − B (cid:62) ∇ x V Substitution into the original equation then produces the diﬀerential equationwhich we refer to as the optimal HJB equation αV = q − ∇ x V (cid:62) BR − B (cid:62) ∇ x V + ∇ x V (cid:62) a + tr (Σ ∇ xx V )This equation appears to be intractable on account of the quadratic termin ∇ x V . However a well known trick in physics is to introduce the followingValue function transformation Z ( x ) = e − V ( x )

11o that ∇ x V = − Z ∇ x Z, ∇ xx V = − Z ∇ xx Z + Z ∇ x Z ∇ x Z (cid:62) and equivalently u ∗ = Z R − B (cid:62) ∇ x Z This transformation will allow us to reduce the complexity of the PDEsigniﬁcantly and propose a solution in terms of theorem 1 for a limited prob-lem subclass. Substitution of this intermediary result into the HJB equationreveals that αZ log Z = − qZ + ∇ x Z (cid:62) a + tr (Σ ∇ xx Z )+

12 1 Z ∇ x Z (cid:62) BR − B (cid:62) ∇ x Z −

12 1 Z ∇ x Z (cid:62) Σ ∇ x Z Hence by choosing Σ = BR − B (cid:62) the last two terms cancel out and thenonlinear PDE treated in theorem 1 emerges αZ log Z = − qZ + ∇ x Z (cid:62) a + tr (Σ ∇ xx Z )Thus by consequence it follows that Z ( x ) = lim T →∞ E P ( X ( t → T ) | X ( t )= x ) [ I ( T )]where X ( τ ≥ t ) | X ( t ) = x is governed by the uncontrolled diﬀusion processd X = a ( X )d t + σ ( X )d W .This remarkable result implies that we can estimate the Value functionfrom sample paths taken from the uncontrolled diﬀusion process and sincethe optimal control is expressed in terms of the Value function, subsequentlyalso the policy. Analogous results can be established in a discrete time setting. Herethe associated optimal control problem is known as Kullback-Leibler controlor discrete time Linearly Solvable Optimal Control. As will soon proof tobe the case, the discrete time stochastic optimal control cannot be solveddirectly by introducing a control quadratic cost rate and assuming controlaﬃne dynamics. Instead one introduces an information-geometric cost thatpenalizes the control indirectly [9]. Speciﬁcally the quadratic control cost isreplaced by the Kullback-Leibler divergence between the generic controlledtransition probability P u and the uncontrolled transition probability P thatis obtained for zero control. 12e will show that for control aﬃne dynamics this reduces eﬀectively toan equivalent quadratic control cost. D [ P u ( X (cid:48) | X = x ) || P ( X (cid:48) | X = x )] = E P u ( X (cid:48) | X = x ) (cid:104) log P u ( X (cid:48) | X = x ) P ( X (cid:48) | X = x ) (cid:105) The discrete time inﬁnite horizon stochastic optimal control for controlaﬃne dynamics and Kullback-Leibler control cost rate is established rigor-ously in the following deﬁnition.

Deﬁnition 3.

Let q : R n (cid:55)→ R be a strictly positive function and α a strictlypositive scalar. The discrete time inﬁnite horizon stochastic optimal controlfor control aﬃne dynamics and Kullback-Leibler control cost rate is thendeﬁned as u ∗ = arg min u C [ u ]= arg min u lim N →∞ E P u ( X ( n → N ) | X ( n )= x ) (cid:104)(cid:88) Nm = n c ( m ) (cid:105) where c ( m ) = α m − n (cid:16) q ( X ( m )) + log P u ( X ( m +1) | X ( m )) P ( X ( m +1) | X ( m )) (cid:17) and the transition prob-ability is governed by the control aﬃne random walk ( ∆ W ∼ N (0 , I) ) X (cid:48) = X + ∆ X, ∆ X = a ( X ) + B( x ) u + σ ( X )∆ W . First note that for a control aﬃne random walk the Kullback-Leiblercost rate simpliﬁes to a quadratic control cost rate. Hence it follows thatan equivalent relation between the stochastic process and the control costrate, which allowed us to solve the stochastic optimal control problem incontinuous time, is inherent to the Kullback-Leibler control setting. D [ P u ( X (cid:48) | X = x ) || P ( X (cid:48) | X = x )] = u (cid:62) B (cid:62) Σ − B (cid:124) (cid:123)(cid:122) (cid:125) R u + c Further it can be shown that the problem in deﬁnition 3 is equivalent tothe following recursive optimization problem. Again V represents the Valuefunction which is subject to a similar interpretation as in continuous time.In a generic setting this equation is referred to as the stochastic Bellmanequation. V ( x ) = min P u ∈P E P u ( X (cid:48) | X = x ) (cid:104) q ( x ) + log P u ( X (cid:48) | X = x ) P ( X (cid:48) | X = x ) + αV ( X (cid:48) ) (cid:105) P u ( X (cid:48) = y | X = x ) ∈ P where P repre-sents the space of all transition probabilities on { X (cid:48) = y | X = x } deﬁned as P = { P ( X (cid:48) = y | X = x ) | (cid:82) P ( X (cid:48) = y | X = x )d y = 1 } . Hence the problem be-low can no longer be treated as an optimization problem but as a variationalone.We can solve this problem by introducing the Lagrangian L [ P u , λ ] = E P u (cid:2) q + log P u P + αV (cid:48) (cid:3) + λ ( E P u [1] − ∂ P u L = (cid:90) (cid:0) log P u P + αV (cid:48) + λ + 1 (cid:1) d X (cid:48) = 0which is identically equal to zero if the integrand is, and so P ∗ u ( X (cid:48) = y | X = x ) ∝ P ( X (cid:48) = y | X = x ) e − αV ( y ) where the relation ∝ implies that the left-hand side and right-hand side areequivalent up to a normalization constant which on its turn depends on theLagrangian multiplier λ .Substitution of the optimal transition probability function into the Bell-man equation reveals then that the Value function is governed by the follow-ing stochastic recurrence relation V ( x ) = q ( x ) − log E P ( X (cid:48) | X = x ) (cid:104) e − αV ( X (cid:48) ) (cid:105) Reintroducing the Value function transformation Z ( x ) = e − V ( x ) then pro-duces Z ( x ) = e − q ( x ) E P ( X (cid:48) | X = x ) [ Z ( X (cid:48) ) α ]and by consequence of theorem 2 it then also follows that˜ Z ( x ) = lim N →∞ E P ( X ( n → N ) | X ( n )= x ) [ S ( N )]Equivalently we have that P ∗ u ( X (cid:48) = y | X = x ) ∝ P ( X (cid:48) = y | X = x ) Z ( y ) α = P ( X (cid:48) = y | X = x ) Z ( y ) α E P ( X (cid:48)| X = x ) [ Z ( X (cid:48) ) α ] Z ( x ) andhence the performance of the derived policy will deteriorate in practice.The symmetry is completed by noting that we can abstract the optimalcontrol for control aﬃne dynamics remarking that E P ∗ u ( X (cid:48) | X = x ) [ X (cid:48) ] = a ( x ) + B ( x ) u ∗ ( x )= E P ( X (cid:48) | X = x ) [( a ( x ) + σ ( x )∆ W ) Z ( X (cid:48) ) α ] E P ( X (cid:48) | X = x ) [ Z ( X (cid:48) ) α ]so that u ∗ ( x ) = R − B (cid:62) Σ − E P ( X (cid:48) | X = x ) [ σ ( x )∆ W Z ( X (cid:48) ) α ] E P ( X (cid:48) | X = x ) [ Z ( X (cid:48) ) α ]The same result follows from the following projection strategy u ∗ ( x ) ≈ arg min u D [ P ∗ u ( X (cid:48) | X = x ) || P u ( X (cid:48) | X = x )] Both in a continuous and discrete time setting the control problem issubject to a constraint relating the diﬀusion covariance matrix Σ with thecontrol penalty matrix RΣ = BR − B (cid:62) or R = B (cid:62) Σ − BThis constraint implies that the diﬀusion process and the admissible con-trol are inherently balanced and the control designer cannot choose a dif-ferent control weighing. Although the constraint make sense from a controlengineering perspective — since the larger the uncertainty is, the higher theadmissible control can be — it poses a limiting restriction that practitionerswould like to see resolved. The issue can be remedied easily by scaling thesolely state dependent cost rate term q . Alternatively we could reconsiderthe Value function transformation. Speciﬁcally we could also have used theparametrized transformation Z = e − γ V . If one repeats the continuous timederivation substituting this transformation rather then the unparametrizedtransformation, we ﬁndΣ = γ BR − B (cid:62) or R = γ BΣ − B (cid:62) γ .Finally note that in this case the solution is adapted to Z ( x ) = lim T →∞ E P ( X ( t → T ) | X ( t )= x ) (cid:104) I ( T ) γ (cid:105) in the continuous time setting and Z ( x ) = lim N →∞ E P ( X ( n → N ) | X ( n )= x ) (cid:104) S ( N ) γ (cid:105) in the discrete time setting.This eﬀectively illustrates that the parametrized value function transfor-mation is equivalent to a rescaling of the state dependent control cost rateterm q .

4. Example

In this section we demonstrate the proposed solution for a small casestudy and verify the inﬂuence of some parameters empirically. The systemunder study is the controlled van der Pol equation ¨ y − ζω (1 − y ) ˙ y + ω y = u .Here we assume that the source of uncertainty operates solely in the inputspace, i.e. d X = ( a ( X ) + B u )d t + B σ d W with d W ∼ N (0 , I) or equivalentlyd X = ( a ( X ) + B u )d t + d W (cid:48) where d W (cid:48) ∼ N (0 , Σ) , Σ = σ BB (cid:62) . We choose ω = 1 , ζ = and σ = . A discounted inﬁnite horizon stochastic optimalcontrol problem is deﬁned with state penalty rate q = (cid:107) x (cid:107) and discountfactor α = 1. The conversion of this system to a state-space representationis trivial. In our simulations we use a discretised version of the continuousdynamics. The discretization is performed as in the proof of theorem 2 wherewe choose ∆ t = 1 × − which is several orders of magnitude smaller thanthe system’s time constant.The goal is to determine the optimal policy. Therefore we ﬁrst estimatethe value function making use of the nonlinear Feynman-Kac formula fromtheorem 1. This is done by determining the value function over a samplegrid of initial values (see ﬁgure 1). We approximate the associated functionvalue using a Monte Carlo estimate of the expectation using M sample pathsand truncate the cost after time T . For suﬃciently large M and T and small∆ t the estimate should be exact. We then approximate the surface using aGaussian Process and determine the optimal policy as detailed in section 3.1.16 igure 1: Illustration of estimation procedure. From left to right : visualisation of unactu-ated stochastic rollouts from coordinate (2 , I ( T )and varying value of γ (increasing from left to right) and visualization of used sample grid. Figure 1 illustrates the estimation procedure for a single coordinate, here x = (2 , γ is varied over the set { , , } . Note thatthe same set of experiments can be used for every value of γ . Figures 2and 3 demonstrate the Z , V and policy response function estimation for( T, M ) = (1 ,

25) and (

T, M ) = (25 , γ is illustrated on the estimates of Z . Finally ﬁgure 5 illustrates thestabilizing eﬀect of the policy on the dynamic landscape governing the vander Pol equation.

5. Conclusion

In this contribution we present a solution to a nonlinear PDE closelyrelated to the parabolic PDE studied by Feynman and Kac. Our resultshave applications in Linearly Solvable Optimal Control. In particular ourwork extends the framework to incorporate the discounted inﬁnite horizonsetting which remained untreated by the literature.

References [1] Richard Phillips Feynman. Space-time approach to non-relativisticquantum mechanics. In Feynman’s Thesis—A New Approach ToQuantum Theory, pages 71–109. World Scientiﬁc, 2005.[2] Mark Kac. On distributions of certain wiener functionals. Transactionsof the American Mathematical Society, 65(1):1–13, 1949.17 . . . . . . . . . . . . . . . . . . . . . . -4 -2 0 2 4-4-2024 -4 -2 0 2 4-4-2024 - . - . - . - . - . - . . . . . . . Figure 2: Solution for (

T, M, γ ) = (1 , , From left to right : estimate of the function Z , estimate of the function V and estimate of the optimal policy by approximating anddiﬀerentiating V using a Gaussian Process. -4 -2 0 2 4-4-2024 . . . . . . . . . . . . . . . . . . . . . -4 -2 0 2 4-4-2024 -4 -2 0 2 4-4-2024 - - . - . - . - . - . - . . . . . . . . Figure 3: Solution for (

T, M, γ ) = (25 , , -4 -2 0 2 4-4-2024 . . . . . . . . . . . . . . . . . . . . . -4 -2 0 2 4-4-2024 . . . . . . . . . -4 -2 0 2 4-4-2024 . . . . . . Figure 4: Solution for varying γ and ( T, M ) = (15 , igure 5: Illustration of optimal control. From left to right : streamline visualization ofunactuated dynamic landscape, streamline visualization of actuated dynamic landscape( γ = 4 versus γ = actuated in blue or red respectively) and visualization of actuatedstochastic rollouts.actuated in blue or red respectively) and visualization of actuatedstochastic rollouts.