Risk Sensitive Path Integral Control for Infinite Horizon Problem Formulations
AA Nonlinear Feynman-Kay Formula with Application inLinearly Solvable Optimal Control
Tom Lefebvre and Guillaume Crevecoeur { Tom.Lefebvre,Guillaume.Crevecoeur } @ugent.be are with the Dept. ofElectromechanical, Systems and Metal Engineering, Ghent University, 9000Ghent; and with the Core Lab EEDT-DC, Flanders Make. Abstract
In this article we present a solution to a nonlinear relative of the parabolicdifferential equation that was tackled by Feynman and Kac in the late 1940s.For the proof we rely on continuous time stochastic calculus. Second wedraw an interesting connection with a related recurrence relation affirmingthe presented result by collapsing onto the continuous time framework butonly in the limit. The equation emerges in the context of infinite horizondiscounted Linearly Solvable Optimal Control, which, as far as we are awareof, is untreated by the literature. The continuous time setting can be treatedusing our new result. As we will demonstrate the discrete time setting isintractable. Nevertheless we can provide close estimates based on the re-currence relation which also allows us to estimate the influence of time dis-cretization errors. We demonstrate our solution treating a small case study.1 a r X i v : . [ m a t h . O C ] F e b . Introduction Ever since the work of Feynman and Kac in the late 1940’s, when bothmen were addressing what turned out to be essentially the same problem , theFeynman-Kac formula has grown into a well established means of evaluatingparabolic partial differential equations (PDEs) subject to, amongst others,terminal boundary constraints. ∂ t Z ( t, x ) = − q ( t, x ) Z ( t, x ) + ∇ x Z ( t, x ) (cid:62) a ( t, x ) + tr (Σ( t, x ) ∇ xx Z ( t, x ))) Z ( T, x ) = Z T ( x )The Feyman-Kac formula states Z ( t, x ) = E P ( X ( t → T ) | X ( t )= x ) (cid:20) exp (cid:18) − (cid:90) Tt q ( X ( τ ))d τ (cid:19) Z T ( x ) (cid:21) here X ( τ ≥ t ) is a diffusion process with drift d X = a ( X )d t + σ ( X )d W withinitial condition X ( t ) = x and d W being a Wiener process. Other quantitiesare introduced in theorem 1.The formula draws a marvellous connection between parabolic PDEs andstochastic Itˆo or diffusion processes. Essentially the formula implies that thePDE mentioned above can be evaluated by simulating random paths of astochastic Itˆo process and paved way for practical methods of evaluation forproblems that were otherwise deemed intractable. Major areas of interest arequantum mechanics [4], finances [5] and more recently also optimal control[6, 7, 8, 9].The aim of this paper is to present a solution to the related differentialequation introduced below, which can be interpreted reasonably as a nonlin-ear (and stationary) relative of the problem studied by Feynman and Kac, Although targeting the problem from their own respective field of expertise. Feyn-man’s work was motivated by attempts to produce tractable solutions for the one-dimensional Schr¨odinger equation. Feynman developed the concept that is now knownas Path Integral theory, where the state of a particle at time t is obtained by averaging over all possible paths starting at a fixed initial state [1]. When Kac got wind of histhen colleague’s ideas, he showed that the same principles applied to the heat equation,reinterpreting the averaging operation probabilistically as a conditional expectation. Hesubsequently revealed the fundamental relationship between stochastic diffusion processesand certain partial differential equations in a series of papers giving rise to what is nowknown as the Feynman-Kac formula [2, 3]. αZ ( x ) log Z ( x ) = − q ( x ) Z ( x ) + ∇ x Z ( x ) (cid:62) a ( x ) + tr (Σ( x ) ∇ xx Z ( x ))Specifically, we shall illustrate that the solution to this equation is givenby the following conditional expectation Z ( x ) = lim T →∞ E P ( X ( t → T ) | X ( t )= x ) [ I ( T )] I ( s ) = exp (cid:18) − (cid:90) st e − α ( τ − t ) q ( X ( τ ))d τ (cid:19) We provide two methods of proof. One involves a modified version of themodern method of proof used for the linear
Feynman-Kac formula. The otheris based on a stochastic recurrence relation that collapses onto the continuoustime framework but only in the limit. To the best of our knowledge thepresented results are strictly original.The studied nonlinear PDE emerges in the context of discounted infinitehorizon Linearly Solvable Optimal Control (LSOC). This is a particular sub-class of stochastic optimal control problems that allow for explicit evaluationof the value function and hence the optimal control. In the second part ofthis contribution we treat its application in LSOC [9].
2. Derivation
This section contains two methods of proof for the central result. Theproofs are organised in two theorems, one addresses the continuous timeproblem as such, the other establishes a connection with a closely relatedrecurrence relation that collapses onto the continuous time framework in thelimit.
The derivation of our results relies on the Itˆo calculus [4, 5]. For thepurpose of self containment, we briefly present notation and summarize therequired results from that area for the reader to acquaint with before pro-gressing further.The key concept in Itˆo calculus is that of the Wiener process. A Wienerprocess is a mathematical entity that can be used to construct arbitrarystochastic processes. 3 efinition 1 (Wiener process) . The Wiener process W ( t ) is defined bymeans of the following properties • for ∆ t > , the increment W ( t + ∆ t ) − W ( t ) is Gaussian with meanzero and variance ∆ t • the limit lim ∆ t → t ( W ( t + ∆ t ) − W ( t )) is not defined The mathematical construction of a signal exhibiting such properties isbeyond that required to establish the result proposed here. It suffices to notethat the Wiener process can be understood as the limit case of a discrete-timerandom walk.These properties also imply that E [d W ] = 0 E [d W ] = d t The concept of Brownian motion can be used to construct general stochas-tic processes. Specifically we will consider (Itˆo drift) diffusion or stochasticprocesses propelled by the stochastic differential equations of the followinggeneral form d X = a ( X ( t ))d t + σ ( X ( t ))d W where X ∈ R n and d W ∈ R m .Further we will denote the probability of a path spawned by such a diffusionprocess conditioned on the initial value X ( t ) = x as P ( X ( t → T ) | X ( t ) = x ).Note that such a path is a stochastic variable and that by consequence theexpectation E P ( X ( t → T ) | X ( t )= x ) [ · ] is defined.Within this framework Itˆo’s lemma generalises the notion of the totalderivative of a function V ( t, x ) along paths spawned by any diffusion process Lemma 1 (Itˆo’s lemma) . Let { V, a, σ } : R × R n (cid:55)→ { R , R n , R n × m } be func-tions of time and space where V ( t, x ) is twice differentiable at least and let X be a stochastic process that can be modelled as a Brownian motion withdrift a ( t, x ) and covariance Σ( t, x ) = σ ( t, x ) σ ( t, x ) (cid:62) then we have thatd V ( t, X ) ≈ (cid:0) ∂ t V ( t, X ) + ∇ x V ( t, X ) (cid:62) a ( t, X )+ tr(Σ( t, X ) ∇ xx V ( t, X )) (cid:1) d t + ∇ x V ( t, X ) (cid:62) σ ( t, X ) d W + O ( d )With major implicationlim d t → t E d X [d V ( t, X )] = ∂ t V ( t, X ) + ∇ x V ( t, X ) (cid:62) a ( t, X )+ tr (Σ( t, X ) ∇ xx V ( t, X ))4 .2. Main results For notational brevity and rapid intuitive interpretation, henceforth werestrain from writing the functional argument as often as the context permits.We provide the continuous time treatment first.
Theorem 1.
Let { Z, q, a, σ } : R n (cid:55)→ { R , R n , R n , R n × n } be scalar, vector andmatrix functions of the spatial coordinate x respectively, so that Σ = σσ (cid:62) (cid:31) and let α > a positive constant. Now consider the PDE αZ log Z = − qZ + ∇ x Z (cid:62) a + tr (Σ ∇ xx Z ) If X is a stochastic process propelled by the stochastic differential equationd X = a ( X ) d t + σ ( X ) d W where d W is a Wiener process then we have that Z ( x ) = lim T →∞ E P ( X ( t → T ) | X ( t )= x ) [ I ( T )] I ( s ) = exp (cid:18) − (cid:90) st e − α ( τ − t ) q ( X ( τ )) d τ (cid:19) Proof.
Consider the stochastic process Y ( s ) = Z ( s ) e − α ( s − t ) I ( s )The product rule then implies thatd Y = d (cid:16) Z e − αs (cid:48) (cid:17) I + Z − αs (cid:48) d I, s (cid:48) = s − t and one easily verifies that d I = − e − αs (cid:48) qI d s The first differential is trickier and involves the chain rule.d (cid:16) Z e − αs (cid:48) (cid:17) = ∂∂Z (cid:16) Z e − αs (cid:48) (cid:17) d Z + ∂∂s (cid:16) Z e − αs (cid:48) (cid:17) d s = e − αs (cid:48) Z e − αs (cid:48) − d Z − αe − αs (cid:48) Z e − αs (cid:48) log Z d s = e − αs (cid:48) Z e − αs (cid:48) − (cid:0)(cid:0) ∇ x Z (cid:62) a + tr(Σ ∇ xx Z ) (cid:1) d s + ∇ x Z (cid:62) d W (cid:1) − αe − αs (cid:48) Z e − αs (cid:48) log Z d s Y then yieldsd Y = e − αs (cid:48) Z e − αs (cid:48) − (cid:0) ∇ x Z (cid:62) a + tr (Σ ∇ xx Z ) (cid:1) I d s − αe − αs (cid:48) Z e − αs (cid:48) I log Z d s − e − αs (cid:48) qZ e − αs (cid:48) S d s + e − αs (cid:48) Z e − αs (cid:48) − I ∇ x Z (cid:62) d W which can be rearranged intod Y = e − αs (cid:48) Z e − αs (cid:48) − × (cid:0) − αZ log Z − qZ + ∇ x Z (cid:62) a + tr (Σ ∇ xx Z ) (cid:1) I d s + e − αs (cid:48) Z e − αs (cid:48) − I ∇ x Z (cid:62) d W The first term vanishes as dictated by the PDE. We can also get rid ofthe second term. First integrating from t to T , yielding Y ( T ) − Y ( t ) = (cid:90) Tt e − αs (cid:48) Z e − αs (cid:48) − S ∇ x Z (cid:62) d s d W We then note that the right-hand side depends on d W . Hence we take theexpectation over the path X ( t → T ) conditioned on X ( t ) = x and recognizean Itˆo integral so that as a result the term vanishes effectively. E P ( X ( t → T ) | X ( t )= x ) [ Y ( T )] − Y ( t ) = E P ( X ( t → T ) | X ( t )= x ) (cid:20)(cid:90) Tt e − αs (cid:48) Z e − αs (cid:48) − I ∇ x Z (cid:62) d s d W (cid:21) = 0On account of the conditionality on X ( t ) = x and the definition of Y itfollows directly that Y ( t ) = Z ( x ). Hence we can rearrange the result aboveinto an expression for Z ( x ).Finally, we must get rid of Z ( X ( T )). It is easily verified that if we takethe limit T → ∞ the proof is completed. Y ( t ) = Z ( x )= E P ( X ( t → T ) | X ( t )= x ) [ Y ( T )]= E ρ ( P ( X ( t → T ) | X ( t )= x )) (cid:104) I ( T ) Z ( X ( T )) − e α ( T − t ) (cid:105) = lim T →∞ E P ( X ( t → T ) | X ( t )= x ) [ I ( T )]We refer to this solution as a nonlinear or stationary Feynman-Kac for-mula. 6e now present a second alternative method of proof. This method relieson identification of a stochastic recurrence relation that collapses onto thecontinuous time framework but only in the limit.One might recognize that the result in discrete time is stronger in the sensethat the result holds for general discrete time transition probabilities P ( X (cid:48) = y | X = x ) . However to establish the connection with the result in continuoustime, we need to address a restrictive class of transition probabilities that canbe classified effectively as random walks with drift. Second in contrast to thesolution in continuous time, here the solution is only exact in the limit. Thisis unfortunate for the anticipated application in control as we will discusslater. Theorem 2.
Let { Z, q, a, σ } : R n (cid:55)→ { R , R n , R n , R n × m } be scalar, vector andmatrix functions of the spatial coordinate x respectively and > α > apositive constant and so that Σ = σσ (cid:62) (cid:31) . Now consider the stochasticnonlinear recurrence relation Z ( x ) = E P ( X (cid:48) | X = x ) (cid:2) e − q ( X ) Z ( X (cid:48) ) α (cid:3) Then ˜ Z ( x ) ≤ Z ( x ) ≤ C ˜ Z ( x ) where C ≥ but bounded and ˜ Z ( x ) = lim N →∞ E P ( X (0 → N ) | X ( n )= x ) [ S ( N )] S ( s ) = exp (cid:16) − (cid:88) sn =0 α n q ( X ( n )) (cid:17) and if X is a stochastic process propelled by the random walk with drift X (cid:48) = X + ∆ X, ∆ X = a ( X )∆ t + σ ( X )∆ W where ∆ W ∼ N (0 , ∆ t I) then thestochastic recurrence relation collapses onto the stationary PDE from theo-rem 1 in the limit ∆ t → and the solution collapses onto the correspondingnonlinear Feynman-Kac formula. Notation X (cid:48) | X = x indicates the value of the state after one time instant has passedconditioned on X = x . This implies that we can associate a transition probability densityfunction P ( X (cid:48) = y | X = x ) expressing the probability of reaching y from x . The probabilityof a stochastic discrete time path conditioned on X ( n ) = x is denoted as P ( X ( n → N ) | X ( n ) = x ). roof. First note that the random walk∆ X = a ( X )∆ t + σ ( X )∆ W, ∆ W ∼ N (0 , ∆ t I)shares the characteristics of the continuous time diffusion process describedin theorem 1 if ∆ t → q and scalar α for the associated entities inthe discrete time setting.Specifically we have that q (cid:55)→ q ∆ t and α (cid:55)→ e − α ∆ t . Substitution into thestochastic recurrence relation yields Z ( x ) = E P (∆ X | X = x ) (cid:104) e − q ( X )∆ t Z ( x + ∆ X ) e − α ∆ t (cid:105) With these preliminary results in place we can start manipulating therecurrence relation. Furthermore since e − q ( x )∆ t Z ( x ) e − α ∆ t is not a stochasticvariable we can introduce it into the expectation operators argument0 = e − q ( x )∆ t E P (∆ X | X = x ) (cid:104) Z ( x + ∆ x ) e − α ∆ t − Z ( x ) e − α ∆ t (cid:105) + e − q ( x )∆ t Z ( x ) e − α ∆ t − Z ( x )Now we approximate the argument by its differential representation, i.e. Z ( x + ∆ x ) e − α ∆ t − Z ( x ) e − α ∆ t ≈ e − q ( x )∆ t Z ( x ) e − α ∆ t − ∆ Z . This approximationis exact for ∆ Z → { ∆ X, ∆ t } →
00 = e − α ∆ t e − q ( x )∆ t Z ( x ) e − α ∆ t − E P (∆ X | X = x ) [∆ Z ] + e − q ( x )∆ t Z ( x ) e − α ∆ t − Z ( x )By definition of the transition probability it follows direct that the ex-pectation is equal to E P (∆ X ) [∆ Z ] = (cid:0) ∇ x Z (cid:62) a + tr (Σ ∇ xx Z ) (cid:1) ∆ t The former result can be substituted into the recurrence relation.0 = lim ∆ t → t e − q ∆ t Z e − α ∆ t − (cid:0) ∇ x Z (cid:62) a + tr(Σ ∇ xx Z ) (cid:1) ∆ t + lim ∆ t → t (cid:16) e − q ∆ t Z e − α ∆ t − Z (cid:17) Evaluation of the first term is trivial and established the part of the PDEthat can be associated to the diffusion process.8he second term can be recognized as the definition of the derivative withrespect to ∆ t evaluated at ∆ t = 0.lim ∆ t → t (cid:16) e − q ∆ t Z e − α ∆ t − Z (cid:17) = dd∆ t (cid:16) e − q ∆ t Z e − α ∆ t (cid:17)(cid:12)(cid:12)(cid:12) ∆ t =0 = − qZ − αZ log Z These terms produce the αZ log Z and qZ terms that were still missing.Substituting the intermediary results back into the main equation recoversthe stationary PDE. It follows that the discrete time stochastic recurrencerelation collapses onto the differential equation in the limit. αZ log Z = − qZ + ∇ x Z (cid:62) a + tr (Σ ∇ xx Z )Now it is left to proof that the same holds true for the solution. Firstlet us show that the discrete time solution indeed satisfies the bounds in thetheorem. As is easily verified, on account of the power α the recursion resistsexplicit evaluation. To resolve this inconvenience we introduce the variable δ n . Since ( · ) α is concave when 0 < α <
1, Jensen’s inequality ascertains thelower bound. Then since δ = 1 and lim n →∞ δ n = 1 the upper bound is areasonable assumption.1 ≤ δ n ( x ) = E P ( X (cid:48) | X = x ) [ Z ( X (cid:48) ) α ] α n E P ( X (cid:48) | X = x ) [ Z ( X (cid:48) ) α n +1 ] ≤ c n Substituting δ n ( x ) E P ( X (cid:48) | X = x ) [ Z ( X (cid:48) ) α n +1 ] for E P ( X (cid:48) | X = x ) [ Z ( X (cid:48) ) α ] α n when-ever such an expression emerges, allows to evaluate the recursion anyhowand produces the result Z ( x ) = E P ( X (0 → N ) | X (0)= x ) (cid:104) S ( N ) Z ( X ( N )) α N δ ( N ) (cid:105) where δ ( N ) = (cid:89) Nn =1 δ n ( X ( n ))On account of the bounds on δ n we have that 1 ≤ lim N →∞ δ ( N ) ≤ C ≤ ∞ guaranteeing that the solution satisfies indeed˜ Z ( x ) ≤ Z ( x ) ≤ C ˜ Z ( x )To show that the bounds tighten when ∆ t → ∆ t → δ n = 1 with α (cid:55)→ e − α ∆ t ∆ t → P ( X ( n → N ) | X ( n ) = x ) = P ( X ( t → T ) | X ( t ) = x ). Finally we practice the definition of the Riemann integral in the expec-tation operators argument so to obtain Z ( x ) = ˜ Z ( x ) = lim ∆ t → lim N →∞ E P ( X ( n → N ) | X ( n )= x ) [ S (cid:48) ( N )]= lim T →∞ E P ( X ( t → T ) | X ( t )= x ) [ I ( T )]where S (cid:48) ( s ) = exp (cid:16) − (cid:88) sm = n e − α ( m − n )∆ t q ( X ( m ))∆ t (cid:17) This finalises the connection between theorems 1 and 2.The theorem has two interesting consequences. Firstly, it guaranteesthat we can estimate a lower bound for the solution Z ( x ), namely ˜ Z ( x ), byapproximating the expectation using sampled paths. Secondly the alternativeproof is valuable in its own right as it offers a tool to study the effect of timediscretization schemes when evaluating the continuous solution numerically.Specifically for significantly small ∆ t (cid:28) , C ] with C (cid:39)
1. We shall demonstrate this empirically in ournumerical experiment.
3. Linearly Solvable Optimal Control
In this section we introduce two problems that are rooted in stochasticoptimal control which merit from the theorems in the previous section. Inparticular we will introduce two optimal control problems, either in a contin-uous be it discrete time setting for which we estimate the value function bysampling paths from the uncontrolled process. This class of control problemsis known as LSOC [6]. On account of the Feynman-Kac formula it is possibleto estimate the control based on uncontrolled sample paths.Initiated by the work of Kappen — who first noticed the potential use ofthis peculiar subclass — this lead to the development of a series of controlmethods known as Path Integral Control (see [8, 7] and references therein)which has been connected to Entropy Regularized Optimal Control morerecently [10]. As far as we are aware of the discounted infinite horizon LSOCsetting is untreated by the literature.10 .1. Continuous Time Infinite Horizon
Let us begin our discussion by considering the continuous time discountedinfinite horizon stochastic optimal control with control affine dynamics andcontrol quadratic cost rate term.
Definition 2.
Let q : R n (cid:55)→ R be a strictly positive function, R ∈ R m × m astrictly positive definite matrix and α a strictly positive scalar. The continu-ous time infinite horizon stochastic optimal control for control affine dynam-ics and control quadratic cost rate is then defined as u ∗ = arg min u C [ u ]= arg min u lim T →∞ E P ( X ( t → T ) | X ( t )= x ) (cid:20)(cid:90) Tt c ( τ ) d τ (cid:21) where c ( τ ) = e − α ( τ − t ) (cid:0) q ( X ( τ )) + u ( τ ) (cid:62) Ru( τ ) (cid:1) and X ( τ ≥ t ) | X ( t ) = x isgoverned by the control affine diffusion process d X = ( a ( X ) + B( x ) u ) d t + σ ( x ) d W . It is well-known that the optimal control is governed by the following dif-ferential equation — also known as the stochastic Hamilton-Jacobi-Bellman(HJB) differential equation. Here V represents the so called Value function which quantifies the accumulated cost when the optimal control is continu-ously applied from initial state X ( t ) = xαV = min u q + u (cid:62) R u + ∇ x V (cid:62) ( a + B u ) + tr (Σ V xx )It is easily verified that the minimizer satisfies u ∗ = − R − B (cid:62) ∇ x V Substitution into the original equation then produces the differential equationwhich we refer to as the optimal HJB equation αV = q − ∇ x V (cid:62) BR − B (cid:62) ∇ x V + ∇ x V (cid:62) a + tr (Σ ∇ xx V )This equation appears to be intractable on account of the quadratic termin ∇ x V . However a well known trick in physics is to introduce the followingValue function transformation Z ( x ) = e − V ( x )
11o that ∇ x V = − Z ∇ x Z, ∇ xx V = − Z ∇ xx Z + Z ∇ x Z ∇ x Z (cid:62) and equivalently u ∗ = Z R − B (cid:62) ∇ x Z This transformation will allow us to reduce the complexity of the PDEsignificantly and propose a solution in terms of theorem 1 for a limited prob-lem subclass. Substitution of this intermediary result into the HJB equationreveals that αZ log Z = − qZ + ∇ x Z (cid:62) a + tr (Σ ∇ xx Z )+
12 1 Z ∇ x Z (cid:62) BR − B (cid:62) ∇ x Z −
12 1 Z ∇ x Z (cid:62) Σ ∇ x Z Hence by choosing Σ = BR − B (cid:62) the last two terms cancel out and thenonlinear PDE treated in theorem 1 emerges αZ log Z = − qZ + ∇ x Z (cid:62) a + tr (Σ ∇ xx Z )Thus by consequence it follows that Z ( x ) = lim T →∞ E P ( X ( t → T ) | X ( t )= x ) [ I ( T )]where X ( τ ≥ t ) | X ( t ) = x is governed by the uncontrolled diffusion processd X = a ( X )d t + σ ( X )d W .This remarkable result implies that we can estimate the Value functionfrom sample paths taken from the uncontrolled diffusion process and sincethe optimal control is expressed in terms of the Value function, subsequentlyalso the policy. Analogous results can be established in a discrete time setting. Herethe associated optimal control problem is known as Kullback-Leibler controlor discrete time Linearly Solvable Optimal Control. As will soon proof tobe the case, the discrete time stochastic optimal control cannot be solveddirectly by introducing a control quadratic cost rate and assuming controlaffine dynamics. Instead one introduces an information-geometric cost thatpenalizes the control indirectly [9]. Specifically the quadratic control cost isreplaced by the Kullback-Leibler divergence between the generic controlledtransition probability P u and the uncontrolled transition probability P thatis obtained for zero control. 12e will show that for control affine dynamics this reduces effectively toan equivalent quadratic control cost. D [ P u ( X (cid:48) | X = x ) || P ( X (cid:48) | X = x )] = E P u ( X (cid:48) | X = x ) (cid:104) log P u ( X (cid:48) | X = x ) P ( X (cid:48) | X = x ) (cid:105) The discrete time infinite horizon stochastic optimal control for controlaffine dynamics and Kullback-Leibler control cost rate is established rigor-ously in the following definition.
Definition 3.
Let q : R n (cid:55)→ R be a strictly positive function and α a strictlypositive scalar. The discrete time infinite horizon stochastic optimal controlfor control affine dynamics and Kullback-Leibler control cost rate is thendefined as u ∗ = arg min u C [ u ]= arg min u lim N →∞ E P u ( X ( n → N ) | X ( n )= x ) (cid:104)(cid:88) Nm = n c ( m ) (cid:105) where c ( m ) = α m − n (cid:16) q ( X ( m )) + log P u ( X ( m +1) | X ( m )) P ( X ( m +1) | X ( m )) (cid:17) and the transition prob-ability is governed by the control affine random walk ( ∆ W ∼ N (0 , I) ) X (cid:48) = X + ∆ X, ∆ X = a ( X ) + B( x ) u + σ ( X )∆ W . First note that for a control affine random walk the Kullback-Leiblercost rate simplifies to a quadratic control cost rate. Hence it follows thatan equivalent relation between the stochastic process and the control costrate, which allowed us to solve the stochastic optimal control problem incontinuous time, is inherent to the Kullback-Leibler control setting. D [ P u ( X (cid:48) | X = x ) || P ( X (cid:48) | X = x )] = u (cid:62) B (cid:62) Σ − B (cid:124) (cid:123)(cid:122) (cid:125) R u + c Further it can be shown that the problem in definition 3 is equivalent tothe following recursive optimization problem. Again V represents the Valuefunction which is subject to a similar interpretation as in continuous time.In a generic setting this equation is referred to as the stochastic Bellmanequation. V ( x ) = min P u ∈P E P u ( X (cid:48) | X = x ) (cid:104) q ( x ) + log P u ( X (cid:48) | X = x ) P ( X (cid:48) | X = x ) + αV ( X (cid:48) ) (cid:105) P u ( X (cid:48) = y | X = x ) ∈ P where P repre-sents the space of all transition probabilities on { X (cid:48) = y | X = x } defined as P = { P ( X (cid:48) = y | X = x ) | (cid:82) P ( X (cid:48) = y | X = x )d y = 1 } . Hence the problem be-low can no longer be treated as an optimization problem but as a variationalone.We can solve this problem by introducing the Lagrangian L [ P u , λ ] = E P u (cid:2) q + log P u P + αV (cid:48) (cid:3) + λ ( E P u [1] − ∂ P u L = (cid:90) (cid:0) log P u P + αV (cid:48) + λ + 1 (cid:1) d X (cid:48) = 0which is identically equal to zero if the integrand is, and so P ∗ u ( X (cid:48) = y | X = x ) ∝ P ( X (cid:48) = y | X = x ) e − αV ( y ) where the relation ∝ implies that the left-hand side and right-hand side areequivalent up to a normalization constant which on its turn depends on theLagrangian multiplier λ .Substitution of the optimal transition probability function into the Bell-man equation reveals then that the Value function is governed by the follow-ing stochastic recurrence relation V ( x ) = q ( x ) − log E P ( X (cid:48) | X = x ) (cid:104) e − αV ( X (cid:48) ) (cid:105) Reintroducing the Value function transformation Z ( x ) = e − V ( x ) then pro-duces Z ( x ) = e − q ( x ) E P ( X (cid:48) | X = x ) [ Z ( X (cid:48) ) α ]and by consequence of theorem 2 it then also follows that˜ Z ( x ) = lim N →∞ E P ( X ( n → N ) | X ( n )= x ) [ S ( N )]Equivalently we have that P ∗ u ( X (cid:48) = y | X = x ) ∝ P ( X (cid:48) = y | X = x ) Z ( y ) α = P ( X (cid:48) = y | X = x ) Z ( y ) α E P ( X (cid:48)| X = x ) [ Z ( X (cid:48) ) α ] Z ( x ) andhence the performance of the derived policy will deteriorate in practice.The symmetry is completed by noting that we can abstract the optimalcontrol for control affine dynamics remarking that E P ∗ u ( X (cid:48) | X = x ) [ X (cid:48) ] = a ( x ) + B ( x ) u ∗ ( x )= E P ( X (cid:48) | X = x ) [( a ( x ) + σ ( x )∆ W ) Z ( X (cid:48) ) α ] E P ( X (cid:48) | X = x ) [ Z ( X (cid:48) ) α ]so that u ∗ ( x ) = R − B (cid:62) Σ − E P ( X (cid:48) | X = x ) [ σ ( x )∆ W Z ( X (cid:48) ) α ] E P ( X (cid:48) | X = x ) [ Z ( X (cid:48) ) α ]The same result follows from the following projection strategy u ∗ ( x ) ≈ arg min u D [ P ∗ u ( X (cid:48) | X = x ) || P u ( X (cid:48) | X = x )] Both in a continuous and discrete time setting the control problem issubject to a constraint relating the diffusion covariance matrix Σ with thecontrol penalty matrix RΣ = BR − B (cid:62) or R = B (cid:62) Σ − BThis constraint implies that the diffusion process and the admissible con-trol are inherently balanced and the control designer cannot choose a dif-ferent control weighing. Although the constraint make sense from a controlengineering perspective — since the larger the uncertainty is, the higher theadmissible control can be — it poses a limiting restriction that practitionerswould like to see resolved. The issue can be remedied easily by scaling thesolely state dependent cost rate term q . Alternatively we could reconsiderthe Value function transformation. Specifically we could also have used theparametrized transformation Z = e − γ V . If one repeats the continuous timederivation substituting this transformation rather then the unparametrizedtransformation, we findΣ = γ BR − B (cid:62) or R = γ BΣ − B (cid:62) γ .Finally note that in this case the solution is adapted to Z ( x ) = lim T →∞ E P ( X ( t → T ) | X ( t )= x ) (cid:104) I ( T ) γ (cid:105) in the continuous time setting and Z ( x ) = lim N →∞ E P ( X ( n → N ) | X ( n )= x ) (cid:104) S ( N ) γ (cid:105) in the discrete time setting.This effectively illustrates that the parametrized value function transfor-mation is equivalent to a rescaling of the state dependent control cost rateterm q .
4. Example
In this section we demonstrate the proposed solution for a small casestudy and verify the influence of some parameters empirically. The systemunder study is the controlled van der Pol equation ¨ y − ζω (1 − y ) ˙ y + ω y = u .Here we assume that the source of uncertainty operates solely in the inputspace, i.e. d X = ( a ( X ) + B u )d t + B σ d W with d W ∼ N (0 , I) or equivalentlyd X = ( a ( X ) + B u )d t + d W (cid:48) where d W (cid:48) ∼ N (0 , Σ) , Σ = σ BB (cid:62) . We choose ω = 1 , ζ = and σ = . A discounted infinite horizon stochastic optimalcontrol problem is defined with state penalty rate q = (cid:107) x (cid:107) and discountfactor α = 1. The conversion of this system to a state-space representationis trivial. In our simulations we use a discretised version of the continuousdynamics. The discretization is performed as in the proof of theorem 2 wherewe choose ∆ t = 1 × − which is several orders of magnitude smaller thanthe system’s time constant.The goal is to determine the optimal policy. Therefore we first estimatethe value function making use of the nonlinear Feynman-Kac formula fromtheorem 1. This is done by determining the value function over a samplegrid of initial values (see figure 1). We approximate the associated functionvalue using a Monte Carlo estimate of the expectation using M sample pathsand truncate the cost after time T . For sufficiently large M and T and small∆ t the estimate should be exact. We then approximate the surface using aGaussian Process and determine the optimal policy as detailed in section 3.1.16 igure 1: Illustration of estimation procedure. From left to right : visualisation of unactu-ated stochastic rollouts from coordinate (2 , I ( T )and varying value of γ (increasing from left to right) and visualization of used sample grid. Figure 1 illustrates the estimation procedure for a single coordinate, here x = (2 , γ is varied over the set { , , } . Note thatthe same set of experiments can be used for every value of γ . Figures 2and 3 demonstrate the Z , V and policy response function estimation for( T, M ) = (1 ,
25) and (
T, M ) = (25 , γ is illustrated on the estimates of Z . Finally figure 5 illustrates thestabilizing effect of the policy on the dynamic landscape governing the vander Pol equation.
5. Conclusion
In this contribution we present a solution to a nonlinear PDE closelyrelated to the parabolic PDE studied by Feynman and Kac. Our resultshave applications in Linearly Solvable Optimal Control. In particular ourwork extends the framework to incorporate the discounted infinite horizonsetting which remained untreated by the literature.
References [1] Richard Phillips Feynman. Space-time approach to non-relativisticquantum mechanics. In Feynman’s Thesis—A New Approach ToQuantum Theory, pages 71–109. World Scientific, 2005.[2] Mark Kac. On distributions of certain wiener functionals. Transactionsof the American Mathematical Society, 65(1):1–13, 1949.17 . . . . . . . . . . . . . . . . . . . . . . -4 -2 0 2 4-4-2024 -4 -2 0 2 4-4-2024 - . - . - . - . - . - . . . . . . . Figure 2: Solution for (
T, M, γ ) = (1 , , From left to right : estimate of the function Z , estimate of the function V and estimate of the optimal policy by approximating anddifferentiating V using a Gaussian Process. -4 -2 0 2 4-4-2024 . . . . . . . . . . . . . . . . . . . . . -4 -2 0 2 4-4-2024 -4 -2 0 2 4-4-2024 - - . - . - . - . - . - . . . . . . . . Figure 3: Solution for (
T, M, γ ) = (25 , , -4 -2 0 2 4-4-2024 . . . . . . . . . . . . . . . . . . . . . -4 -2 0 2 4-4-2024 . . . . . . . . . -4 -2 0 2 4-4-2024 . . . . . . Figure 4: Solution for varying γ and ( T, M ) = (15 , igure 5: Illustration of optimal control. From left to right : streamline visualization ofunactuated dynamic landscape, streamline visualization of actuated dynamic landscape( γ = 4 versus γ = actuated in blue or red respectively) and visualization of actuatedstochastic rollouts.actuated in blue or red respectively) and visualization of actuatedstochastic rollouts.