[PDF] Actor-Critic Method for High Dimensional Static Hamilton--Jacobi--Bellman Partial Differential Equations based on Neural Networks

Abstract

We propose a novel numerical method for high dimensional Hamilton--Jacobi--Bellman (HJB) type elliptic partial differential equations (PDEs). The HJB PDEs, reformulated as optimal control problems, are tackled by the actor-critic framework inspired by reinforcement learning, based on neural network parametrization of the value and control functions. Within the actor-critic framework, we employ a policy gradient approach to improve the control, while for the value function, we derive a variance reduced least square temporal difference method (VR-LSTD) using stochastic calculus. To numerically discretize the stochastic control problem, we employ an adaptive stepsize scheme to improve the accuracy near the domain boundary. Numerical examples up to 20 spatial dimensions including the linear quadratic regulators, the stochastic Van der Pol oscillators, and the diffusive Eikonal equations are presented to validate the effectiveness of our proposed method.

Full PDF

AACTOR-CRITIC METHOD FOR HIGH DIMENSIONAL STATICHAMILTON–JACOBI–BELLMAN PARTIAL DIFFERENTIALEQUATIONS BASED ON NEURAL NETWORKS ∗ MO ZHOU † , JIEQUN HAN ‡ , AND

JIANFENG LU § Abstract.

We propose a novel numerical method for high dimensional Hamilton–Jacobi–Bellman (HJB) type elliptic partial diﬀerential equations (PDEs). The HJB PDEs, reformulatedas optimal control problems, are tackled by the actor-critic framework inspired by reinforcementlearning, based on neural network parametrization of the value and control functions. Within theactor-critic framework, we employ a policy gradient approach to improve the control, while for thevalue function, we derive a variance reduced least square temporal diﬀerence method (VR-LSTD)using stochastic calculus. To numerically discretize the stochastic control problem, we employ anadaptive stepsize scheme to improve the accuracy near the domain boundary. Numerical examplesup to 20 spatial dimensions including the linear quadratic regulators, the stochastic Van der Pol oscil-lators, and the diﬀusive Eikonal equations are presented to validate the eﬀectiveness of our proposedmethod.

Key words.

Hamilton-Jacobi-Bellman equations; high dimensional partial diﬀerential equa-tions; stochastic control; actor-critic methods

1. Introduction.

The Hamilton-Jacobi-Bellman (HJB) equation is an impor-tant family of partial diﬀerential equations (PDEs), given its connection with optimalcontrol problems that lead to a wide range of applications. The unknown in theHJB equation can be viewed as the total expected value function for optimal controlproblems. The equation can be derived from the dynamic programming principlepioneered by Bellman [8], which gives a necessary and suﬃcient condition of the opti-mality. Theoretical results for the existence and uniqueness of the HJB equations arewell-established, see e.g., [67]. From the viewpoint of stochastic control, the relation-ship between the viscosity solution of the HJB equations and the backward stochasticdiﬀerential equations (BSDEs) is introduced in [14, 51–54].The wide applications of HJB equations call for eﬃcient numerical algorithms.Various numerical approaches have been developed in the literature, including themonotone approximation scheme [3, 21], the ﬁnite volume method [59, 64], and theGalerkin method [4, 5]. In [49], non-oscillatory schemes are developed to solve theHJB equations exploring the connection with hyperbolic conservation laws. The HJBequations related to reachability problems are studied in [42,44,45]. A general surveyfor classical methods to solve the optimal control problem numerically can be found,e.g., in [58]. While these conventional approaches have been quite successful, they fallshort for solving HJB equations in high dimensions due to the curse of dimensional-ity [8]: the computational cost goes up exponentially with the dimensionality. Manyworks attempt to mitigate this fundamental diﬃculty by leveraging dimension reduc-tion techniques such as proper orthogonal decomposition, sparse grid, pseudospectralcollocation, and tensor decomposition (see e.g., [16, 35, 36, 41, 50]). The performanceof these algorithms heavily depends on how well the low-dimensional representation ∗ Date: February 20, 2021. Please address correspondence to Jiequn Han or Jianfeng Lu.

Funding:

The work of JL and MZ is supported in part by National Science Foundation viagrant DMS-2012286. The authors are grateful for computing time at the Terascale Infrastructure forGroundbreaking Research in Science and Engineering (TIGRESS) of Princeton University. † Department of Mathematics, Duke University ([email protected]). ‡ Department of Mathematics, Princeton University ([email protected]). § Department of Mathematics, Department of Physics, and Department of Chemistry, Duke Uni-versity ([email protected]). 1 a r X i v : . [ m a t h . O C ] F e b MO ZHOU, JIEQUN HAN, AND JIANFENG LU matches the solutions, and is typically problem-dependent and thus with limited ap-plicability.To better address the challenge of high dimensionality, a promising directionis to consider the artiﬁcial neural network as a more ﬂexible and eﬃcient functionapproximation tool. This topic has received a considerable amount of attention andbeen a rapidly developing ﬁeld in recent years. Several numerical approaches for highdimensional PDEs based on neural network parameterization have been proposed, seee.g., the reviews [6, 20] and references therein.For HJB type equations and related optimal control problems, the most tightlyconnected approach to our work is the deep BSDE method [19,27], which reformulatesparabolic PDEs as control problems using BSDEs, and uses deep neural networkparameterization for the solution and control to solve this problem. Theoretical resultsfor convergence of this method are studied in [28]. The deep BSDE method and itsvariants have been applied to solve HJB type equations, stochastic control problems,and diﬀerential games (see e.g., [13, 19, 24, 26, 27, 32, 34, 40, 48, 55, 57]). Numericalalgorithms for solving high dimensional deterministic and stochastic control problemsbased on other forms combined with deep learning approximation have also beeninvestigated in [7, 18, 25, 47].While some methods mentioned above have been successful in solving PDEs inhigh dimensions, there have been two issues that remain to be addressed. On theone hand, most of these works concern parabolic PDEs of ﬁnite time horizon (oftenof order one), while only a few works investigate the static elliptic HJB equationscorresponding to control problems with inﬁnite time horizon. On the other hand,most existing works consider equations where the optimal controls are explicitly knowngiven the value function or without controls, while there are many important HJBtype equations for which the optimal control is cast through an optimization problemand hence implicit. Recently, an algorithm for high dimensional ﬁnite-time horizonstochastic control problem with implicit optimal control is considered in [33], basedon the Deep BSDE formulation associated with the stochastic maximum principle. Inthis paper, we take a diﬀerent avenue and focus on solving the static elliptic type HJBequation with implicit control, in which the above two challenges are compounded.Our proposed numerical method is heavily inspired by the literature on reinforce-ment learning (RL) [62], which is of course closely related to control problems. Ourmotivation for borrowing techniques from RL is due to the impressive revolution andgreat success in recent years in deep reinforcement learning by utilizing neural net-work parametrization [17, 46, 60]. In the RL context, the control problem is usuallyformulated as a Markov decision process (MDP) on discrete time and state space. Ifthe model is given, ﬁnding the optimal policy can be viewed as solving a discrete HJBequation. It is then natural to ask whether algorithms developed in the RL contextcan be generalized to the context of solving high dimensional HJB equations.In this paper, we reformulate the HJB type semilinear elliptic PDEs into sto-chastic control problems and leverage the actor-critic framework in conjunction withneural network approximation to solve the equations. The actor-critic methods are aclass of algorithms in reinforcement learning [62]. These algorithms iteratively evalu-ate and improve the current policies (i.e., controls) until ﬁnal convergence. The criticrefers to the value function of a given policy. The process of estimating the criticis called policy evaluation. The most common algorithms for policy evaluation aretemporal diﬀerence (TD) methods [10,39,63], or their variants, such as the TD( λ ) [15]and the least square temporal diﬀerence (LSTD) [2, 11, 43, 56]. The actor refers tothe policy function, and we need to do policy improvement based on a given value CTOR-CRITIC METHOD FOR HIGH DIMENSIONAL STATIC HJB EQUATIONS

2. Theoretical background of the actor-critic framework.2.1. Control formulation of semilinear elliptic equations.

Consider thefollowing semilinear elliptic PDE(2.1) 12 Tr (cid:0) σσ (cid:62) Hess( V ) (cid:1) ( x ) + inf u ∈ U (cid:2) b ( x, u ) (cid:62) ∇ V ( x ) + f ( x, u ) (cid:3) − γV ( x ) = 0 in Ω , with boundary conditon V ( x ) = g ( x ) on ∂ Ω. Here the state space Ω is an open,connected set in R d with piecewise smooth boundary, and the control space U isa convex domain in R d u . We assume that V ( x ) ∈ C (Ω), f ( x, u ) ∈ C (Ω × U ), b ( x, u ) ∈ C (Ω × U ; R d ), σ ( x ) ∈ C (Ω; R d × d w ) with σσ (cid:62) ( x ) being uniformly elliptic andbounded, and γ ≥ ∇ and Hess todenote gradient and Hessian operators.As a starting point of our approach, we reformulate the above elliptic equationas an optimal control problem. Consider the following stochastic diﬀerential equation(SDE)(2.2) d X t = b ( X t , u t ) d t + σ ( X t ) d W t with initial condition X = x ∈ Ω, where u t ∈ U ⊂ R d u is a control ﬁeld and W t is a d w -dimensional standard Brownian motion. As we solve the equation in the domainΩ, we deﬁne a stopping time(2.3) τ = inf { t : X t / ∈ Ω } . It is standard that τ < ∞ a.s., see for example [38].We then consider an optimal control problem to minimize the following cost func-tional(2.4) J u ( x ) = E (cid:104)(cid:90) τ f ( X s , u s ) e − γs d s + e − γτ g ( X τ ) | X u = x (cid:105) . In this cost functional, f can be interpreted as running cost, g is the terminal costwhen the SDE hits the boundary ∂ Ω and γ is the discount rate.The control u is chosen over the set of stochastic processes that have values in U and are adapted to the ﬁltration generated by X t . Deﬁne(2.5) V ( x ) = inf u J u ( x )as the optimal value function (i.e., optimal cost-to-go function). According to stan-dard result in stochastic control theory [67], V satisﬁes the time-independent HJBequation(2.6) inf u (cid:8) L u V ( x ) − γV ( x ) + f ( x, u ) (cid:9) = 0 MO ZHOU, JIEQUN HAN, AND JIANFENG LU in Ω with boundary condition V ( x ) = g ( x ) on ∂ Ω, where L u V ( x ) = 12 Tr (cid:0) σσ (cid:62) Hess( V ) (cid:1) ( x ) + b ( x, u ) (cid:62) ∇ V ( x )is the generator of the controlled SDE (2.2). Note that the HJB equation (2.6)coincides with the original PDE (2.1), and hence, we can solve the PDE (2.1) bysolving the optimal control problem to obtain the optimal value function. Our ap-proach for solving the optimal control problem is based on the actor-critic framework.In such methods, one solves for both the value function and control ﬁeld. The control(i.e., policy in the RL terminology) is known as the actor , while the value functioncorresponding to the control is known as the critic since it is used to evaluate theoptimality of the control. Accordingly, the actor-critic algorithms consist of two parts:policy evaluation for the critic and policy improvement for the actor. While many ap-proaches have been developed under the actor-critic framework [2,10,15,39,56,63,65],we will focus on simple and perhaps the most popular algorithms: temporal diﬀerence(TD) learning for the value function given a policy and policy gradient for improvingthe control.

Tobetter convey the idea, let us ﬁrst brieﬂy recall the algorithms for discrete-time Markovdecision process (MDP) with ﬁnite state and action space; more details can be foundin e.g., [62]. The MDP starts with some initial state S in the state space S , possiblysampled according to a distribution. At time t ∈ N , given the current state S t , theagent picks an action A t in the action set A according to a policy. We assume thatthe policy is deterministic, i.e., the policy is a map π from the state space S to theaction space A :(2.7) A t = π ( S t ) . After the action A t is chosen, the system state will transit to S t +1 , according to aprobability transition function(2.8) P ( S t +1 = s (cid:48) | S t = s, A t = a ) = p ( s (cid:48) | s, a ) . The action also incurs a cost R t +1 , which we assume to be given by a deterministicfunction of the previous state S t , action A t , and the current state S t +1 :(2.9) R t +1 = f ( S t , A t , S t +1 ) . The goal of the MDP problem is to choose the best policy to minimize the expectedtotal discounted cost(2.10) E S ∼ µ E π (cid:104) ∞ (cid:88) t =1 β t − R t (cid:105) , where β ∈ (0 ,

1) is a discount factor, µ is the distribution of the initial state S , andwe have used E π to indicate the dependence on the transition dynamics on the choiceof the policy π .To solve the MDP problem, it is convenient to introduce the (state) value functionw.r.t. a policy π as the expected cost starting at states s under that policy:(2.11) V π ( s ) = E π (cid:104) ∞ (cid:88) t =1 β t − R t | S = s (cid:105) . CTOR-CRITIC METHOD FOR HIGH DIMENSIONAL STATIC HJB EQUATIONS π , the value functionsatisﬁes(2.12) V π ( s ) = E π (cid:104) n (cid:88) t =1 β t − R t + β n V π ( S n ) | S = s (cid:105) for any n ≥

1. In order to minimize the total cost (2.10), we search for an optimalpolicy π ∗ that satisﬁes for all π ,(2.13) V π ∗ ( s ) ≤ V π ( s ) , ∀ s ∈ S . Speciﬁcally, by the optimality principle, we have(2.14) V π ∗ ( s ) = min { a t }⊂A E (cid:104) n (cid:88) t =1 β t − R t + β n V π ∗ ( S n ) | S = s, A t = a t , t = 0 , , · · · , n − . Note that while in (2.12) and (2.14) the right-hand side starts at time 0, we can startat any time and run the process for n steps due to stationarity.Let us make a couple of remarks for the setup of the discrete MDP used here.First, in reinforcement learning, usually reward is used instead of cost, and henceone maximizes the total reward instead of minimizing the cost; but of course the twoviewpoints are equivalent up to a sign change. We use cost, which is more in linewith the control literature and also our problem in the continuous setting. Second,the cost R t is not necessarily a deterministic function as in (2.9), but may follow someprobability distribution together with the next state:(2.15) P ( S t +1 = s (cid:48) , R t +1 = r | S t = s, A t = a ) = p ( s (cid:48) , r | s, a ) . Moreover, the policy can also be probabilistic rather than deterministic as assumedin (2.7). We choose the simpliﬁed setting for the cost and policy to make it consistentwith our continuous optimal control setting. Finally, we use an MDP without stoppingtime and thus terminal cost for simplicity. The adaptation to our PDE setup will bediscussed below in Sections 2.2.3 and 2.2.4.Temporal diﬀerence (TD) learning is a class of algorithms that evaluate a givencontrol (i.e., policy) by updating the value function, combining a Monte Carlo estimateof the running cost over a time period and the dynamic programming principle for thefuture cost-to-go. For a policy π to be evaluated, with a given trajectory { S t , t ≥ } ,we update the value function at each t by(2.16) (cid:98) V π ( S t ) ←− (cid:98) V π ( S t ) + α (cid:32) n (cid:88) k =1 β k − R t + k + β n (cid:98) V π ( S t + n ) − (cid:98) V π ( S t ) (cid:33) , where α is the learning rate and (cid:98) V π on the right-hand side is the current estimate ofthe value function. In Eq. (2.16), we only update the value function at the state S t and the value at other states remain unchanged. In practice, this update of the valuefunction is usually done for multiple trajectories.In the above updating rule, we have used the n -step temporal diﬀerence TD πn ( S t ),deﬁned as(2.17) TD πn ( S t ) = n (cid:88) k =1 β k − R t + k + β n (cid:98) V π ( S t + n ) − (cid:98) V π ( S t ) , MO ZHOU, JIEQUN HAN, AND JIANFENG LU which depends on the trajectory of length n + 1 ( t to t + n ) from the starting state S t .TD πn can be understood as an indicator of the inconsistency between current estimateof value function with a sampled value using n -steps of the MDP, since according toEq. (2.12), E π TD πn vanishes if (cid:98) V π agrees with the true value function V π . Hence, TDlearning can be viewed as a stochastic ﬁxed point iteration for the value function.When function approximation is used for the value function, in particular nonlin-ear approximations such as neural networks, an alternative approach, the least squaretemporal diﬀerence (LSTD) is often used to overcome potential divergence problemsof the TD learning. Instead of the stochastic ﬁxed point updating formula as (2.16),in LSTD method, the parameters are optimized to minimize the square of the TDerror as a loss function. More speciﬁcally, if the value function is parameterized as V π ( · ; θ V ), we solve for θ V by(2.18) min θ V E S ∼ µ E π (cid:104)(cid:16) n (cid:88) t =1 β t − R t + β n V π ( S n ; θ V ) − V π ( S ; θ V ) (cid:17) | S (cid:105) , where µ is some prescribed measure for the state S . In practice, (2.18) is often solvedusing stochastic gradient descent method. Such method has been proved successfulin e.g., [17, 46]. Policygradient is a class of methods to learn parameterized policies through gradient basedalgorithms. Assume we consider a class of (deterministic) policy parameterized as(2.19) A t ( S t ) = π ( S t ; θ π ) , where θ π denotes a collection of parameters and π ( · ; θ ) is a chosen nonlinear param-etrization.To ﬁnd an optimal policy, we aim to minimize the objective function (cf. (2.14))(2.20) J ( θ π ) = E S ∼ µ E π ( · ; θ π ) (cid:2) n (cid:88) k =1 β k − R k + β n (cid:98) V π ( S n ) (cid:3) with respect to the collective parameter θ π . Note that (2.20) explicitly takes intoaccount the cost of the ﬁrst n steps, while using an (approximate) value function (cid:98) V π for the future cost after n steps, coming from e.g., the TD learning algorithm. Theobjective function (2.20) can be optimized using stochastic gradient method. Using astochastic estimate of the gradient (cid:100) ∇ J ≈ ∇ θ π J , so that the parameter is updated as(2.21) θ π ← θ π − α (cid:100) ∇ J, with suitable learning rate α . Note that in principle, one needs to diﬀerentiate allterms involved in (2.20) with respect to θ π , however, in practice, in the actor-criticframework, one typically neglects the derivative of (cid:98) V π with respect to π , as it isimpractical to compute since (cid:98) V π is obtained using e.g., the TD learning. Nevertheless,we would still need to diﬀerentiate (cid:98) V π ( S n ) with respect to S n , as the state S n isaﬀected by the choice of the policy, thus ∂ (cid:98) V π ( S n ) ∂θ π · = ∂ (cid:98) V π ∂S n ∂S n ∂θ π , where · = indicates that the (functional) derivative δ (cid:98) V π δπ is neglected. CTOR-CRITIC METHOD FOR HIGH DIMENSIONAL STATIC HJB EQUATIONS deterministic policy gradient algorithm proposedin [61]. One diﬀerence is that we use the state value function V ( s ) while [61] usesthe state-action value function ( Q -function) Q ( s, a ). Another diﬀerence is that theobjective function used in [61] is based on the stationary distribution of state-actionpair given the policy, while we roll out a trajectory for the cost function (combinedwith using estimated value function for future cost), which is more suitable to anactor-critic framework. Our approach is also easier to generalize to the continuoussetting, which will be discussed in Section 2.2.4. We now introduce how to adapt the above algorithmic ideas to the continuous setting.Given a control function u ( x ) ∈ C (Ω) (which corresponds to π in the discretesetting), the corresponding value function is given by(2.22) V u ( x ) = E u (cid:104)(cid:90) τ f ( X s , u ( X s )) e − γs d s + e − γτ g ( X τ ) | X u = x (cid:105) . Here E u indicates expectation with respect to the trajectory (with a ﬁxed policy u ).This is just the cost functional in (2.4) with a speciﬁc control policy. In the continuoussetting, the dynamical programming principle indicates that the value function V u satisﬁes the PDE (see e.g., [67])(2.23) 12 Tr (cid:0) σσ (cid:62) Hess( V u ) (cid:1) ( x )+ b ( x, u ( x )) (cid:62) ∇ V u ( x )+ f ( x, u ( x )) − γV u ( x ) = 0 , in Ωwith boundary condtion V u ( x ) = g ( x ) on ∂ Ω.To better convey the idea, we ﬁrst consider a ﬁxed time interval [0 , T ] with

T > e − γt V u ( X t ), we get(2.24) e − γT V u ( X T ) = V u ( X ) + (cid:90) T e − γs (cid:104)

12 Tr (cid:0) σσ (cid:62) Hess( V u ) (cid:1) ( X s )+ b ( X s , u ( X s )) (cid:62) ∇ V u ( X s ) − γV u ( X s ) (cid:105) d s + (cid:90) T e − γs ∇ V u ( X s ) (cid:62) σ ( X s ) d W s . Combined with the PDE (2.23), (2.24) gives(2.25) V u ( X ) = (cid:90) T e − γs f ( X s , u ( X s )) d s − (cid:90) T e − γs ∇ V u ( X s ) (cid:62) σ ( X s ) d W s + e − γT V u ( X T ) . The term (cid:82) T e − γs ∇ V u ( X s ) (cid:62) σ ( X s ) d W s is a martingale w.r.t. T [38]. Therefore,taking the expectation, we arrive at(2.26) V u ( X ) = E u (cid:104)(cid:90) T e − γs f ( X s , u ( X s )) d s + e − γT V u ( X T ) | X (cid:105) . We observe that this is the analog of (2.12) in the continuous setting, where the unittime discount e − γ is the analog of the discount factor β in the discrete setting. Com-pared with the discrete time setting, besides (2.26), we have in addition (2.25) before MO ZHOU, JIEQUN HAN, AND JIANFENG LU taking expectation, thanks to Itˆo’s lemma. Exploiting the two identities, analogousto the discrete case, we deﬁne two versions of temporal diﬀerence in the continuoussetting as TD u = (cid:90) T e − γs f ( X s , u ( X s )) d s − (cid:90) T e − γs ∇ V ( X s ) (cid:62) σ ( X s ) d W s (2.27) + e − γT V ( X T ) − V ( X );TD u = (cid:90) T e − γs f ( X s , u ( X s )) d s + e − γT V ( X T ) − V ( X ) . (2.28)Note that both TD and TD depend on the trajectory X t , in particular, they shouldbe viewed as random variables, while we suppress such dependence in the notation.From (2.25) and (2.26), if V is the exact value function corresponding to the control u , we have TD u = 0 , P -a.s.(2.29) E u TD u = 0 . (2.30)Note in particular that TD vanishes without taking the expectation for exact valuefunction. Moreover, as the diﬀerence between TD and TD is given by a martingaleterm, for any approximate value function, we have E u TD u = E u TD u . Now let us introduce two loss functionals for the critic in the spirit of LSTD: L ( V ) = E X ∼ µ E u (cid:0) TD u (cid:1) (2.31) = E X ∼ µ E u (cid:20)(cid:16)(cid:90) T ∧ τ e − γs f ( X s , u ( X s )) d s − (cid:90) T ∧ τ e − γs ∇ V ( X s ) (cid:62) σ ( X s ) d W s + e − γ ( T ∧ τ ) V ( X T ∧ τ ) − V ( X ) (cid:17) (cid:21) ; L ( V ) = E X ∼ µ E u (cid:0) TD u (cid:1) (2.32) = E X ∼ µ E u (cid:20)(cid:16)(cid:90) T ∧ τ e − γs f ( X s , u ( X s )) d s + e − γ ( T ∧ τ ) V ( X T ∧ τ ) − V ( X ) (cid:17) (cid:21) , where µ is some initial distribution for X and we have also taken into account thestopping time τ when the process hits the domain boundary. Here the two lossesare viewed as functionals of the value function V , the ﬁnite-dimensional functionapproximation will be discussed in the next section.Stochastic gradient method is used to minimize the loss function in LSTD to ﬁndthe best approximation of value function. Written in terms of functional variations,this amounts to approximating E X ∼ µ E u δ (cid:0) TD u ) δV ≈ δ (cid:0) TD u ) δV (cid:0) X t (cid:1) = 2TD u (cid:0) X t (cid:1) δ TD u δV (cid:0) X t (cid:1) ;(2.33) E X ∼ µ E u δ (cid:0) TD u ) δV ≈ δ (cid:0) TD u ) δV (cid:0) X t (cid:1) = 2TD u (cid:0) X t (cid:1) δ TD u δV (cid:0) X t (cid:1) , (2.34) CTOR-CRITIC METHOD FOR HIGH DIMENSIONAL STATIC HJB EQUATIONS X t , in particular, the variance of the stochastic gradient is 0. In comparison,while the stochastic estimate of (2.34) has expectation 0 for exact value function, foreach trajectory, the right-hand side is not 0. This means that the stochastic gradi-ent estimate (2.34) has larger variance than the estimate (2.33). Let us remark thatthe vanishing variance property of (2.33) is similar to quantum Monte Carlo [22], forwhich the variance of the local energy estimate also vanishes at the ground state.In the sequel, to distinguish the two loss functions, we call the method basedon L (2.31) the variance reduced least square temporal diﬀerence (VR-LSTD), whilethat corresponding to L (2.32) is named the least square temporal diﬀerence (LSTD).We will demonstrate in our numerical experiments that VR-LSTD results in betterresults than LSTD. For theactor part, we use policy gradient to improve the policy. According to dynamicalprogramming principle [67], for the optimal value function V , we have(2.35) V ( X ) = inf u E u (cid:104)(cid:90) T f ( X s , u ( X s )) e − γ ( s − t ) d s + e − γT V ( X T ) | X (cid:105) , where u is minimized over the set of admissible controls. In other words, the control u should minimize the functional on the right-hand side. Therefore, we can use thefollowing loss function for actor, for which we also incorporate in the stopping time:(2.36) J ( u ) = E X ∼ µ E u (cid:104)(cid:90) T ∧ τ f ( X s , u ( X s )) e − γs d s + (cid:98) V ( X T ∧ τ ) e − γ ( T ∧ τ ) (cid:105) , where (cid:98) V is the current estimate of the value function (via TD learning in the criticpart). Observe that this loss function is a continuous analog of (2.20).In the numerical algorithm, the control, as a high dimensional function, will beparameterized as a neural network u ( · ; θ u ) where θ u denotes collectively the parame-ters. The parameters are optimized using a stochastic approximation to gradients of J ( u ). Similar to our discussion of policy gradient for the discrete case in Section 2.2.2,when diﬀerentiating the loss function (2.36) w.r.t. the parameters of the control θ u ,several terms would contribute to the derivative, including the control u ( · ) itself, theSDE trajectory X · , the stopping time τ , and also the estimated value function (cid:98) V ( · ).Similar to the discrete case, we will drop the functional derivative of (cid:98) V w.r.t. u , i.e.,the derivative δ (cid:98) Vδu ∂u∂θ u , since the dependence of (cid:98) V on u is through the algorithm for thecritic, e.g., the TD learning, which is impractical to track. Furthermore, if ˆ V is theoptimal value function, treating it as a ﬁxed function and optimizing u in (2.36) givesthe optimal policy function. Therefore, we approximate the functional derivative as(2.37) δJδu · = E X ∼ µ E u (cid:34)(cid:90) T ∧ τ δf ( X s , u ( X s )) δu e − γs d s + { τ

3. Numerical algorithm.

In this section, we present our numerical algorithmfor solving high dimensional HJB type elliptic PDEs based on the actor-critic frame-work discussed in the previous section.

In order to numerically deal with the high di-mensional functions V and u , we use two neural networks to parametrize the valuefunction V ( · ; θ V ) and the control u ( · ; θ u ), the parameters of which are denoted col-lectively by θ V and θ u respectively. We apply the structure of the residual neuralnetwork [31] in pursuit of better optimization performance. A neural network φ ( x ; θ )with l + 1 hidden layer is represented by(3.1) φ ( x ; θ ) = F l ◦ σ l ◦ F l − ◦ σ l − · · · ◦ F ◦ σ ◦ F ( x ) , where F i are linear transforms with dimensions depending on the width of hiddenlayers and the dimensions of inputs and outputs, and σ i are element-wise activatefunctions with skip connection: σ i ( x ) = x + ReLU( x ).Moreover, note that the VR-LSTD loss function L (2.31) requires the gradientof value function. Since we are using a neural network parametrization V = V ( · ; θ V ),a direct approach is to use auto-diﬀerentiation of V ( x ; θ V ) w.r.t. x to calculate thegradient. We ﬁnd that a better approach in practice is to use another neural networkto represent ∇ V , which is consistent with the observations in [27, 29]. Thus, for VR-LSTD, the gradient of the value function is represented by a separate neural network G ( x ; θ G ) with collective parameters θ G .To summarize, with respect to the collective parameters, the loss functions forthe critic corresponding to (2.31) and (2.32) are L ( θ V , θ G ) = E X ∼ µ E u (cid:20)(cid:16)(cid:90) T ∧ τ e − γs f ( X s , u ( X s )) d s (3.2) − (cid:90) T ∧ τ e − γs G ( X s ; θ G ) (cid:62) σ ( X s ) d W s + e − γ ( T ∧ τ ) V ( X T ∧ τ ; θ V ) − V ( X ; θ V ) (cid:17) (cid:21) ; L ( θ V ) = E X ∼ µ E u (cid:20)(cid:16)(cid:90) T ∧ τ e − γs f ( X s , u ( X s )) d s + e − γ ( T ∧ τ ) V ( X T ∧ τ ; θ V )(3.3) − V ( X ; θ V ) (cid:17) (cid:21) . We remark that there is no need to add penalty terms in L to ensure the consistencybetween V ( x, θ V ) and G ( x ; θ G ), because if we replace V u ( · ) by V ( · , θ V ) in (2.24) and CTOR-CRITIC METHOD FOR HIGH DIMENSIONAL STATIC HJB EQUATIONS L ( θ V , θ G ) = E X ∼ µ E u (cid:20)(cid:16)(cid:90) T ∧ τ e − γs [( L u V − γV )( X s ; θ V ) + f ( X s , u ( X s ))] d s (3.4) − (cid:90) T ∧ τ e − γs ( ∇ x V ( X s ; θ V ) − G ( X s ; θ G )) (cid:62) σ ( X s ) d W s (cid:17) (cid:21) = E X ∼ µ E u (cid:104)(cid:0)(cid:90) T ∧ τ e − γs [( L u V − γV )( X s ; θ V ) + f ( X s , u ( X s ))] d s (cid:1) (cid:105) + E X ∼ µ E u (cid:104)(cid:90) T ∧ τ e − γs (cid:12)(cid:12) σ (cid:62) ( X s ) ( ∇ x V ( X s ; θ V ) − G ( X s ; θ G )) (cid:12)(cid:12) d s (cid:105) , where L u is the generator of the SDE and we use Itˆo’s isometry in the second step.Note that the ﬁrst and second terms in (3.4) enforce V and G satisfy equation andconsistency respectively.For neural-network parameterization of V , it is not easy to directly impose theDirichlet boundary condition V = g on ∂ Ω in the parametrization. Thus, instead,we add a penalty term to the loss functions for critic to help enforce the boundarycondition(3.5) η E X ∼ Unif( ∂ Ω) (cid:2) ( V ( X ) − g ( X )) (cid:3) , where η is a penalty hyperparameter and Unif( ∂ Ω) denotes the uniform distributionon ∂ Ω. In implementation,we need to simulate numerically based on a discretization of the diﬀusion processwith approximating stopping time and exit point. The solution to the PDE problemcrucially depends on the boundary condition, and thus in control formulation, the exittime and position of the SDE at the boundary. Several schemes have been developedin the literature to deal with the stopping time and exit point of the SDEs in relatedscenarios. Perhaps the most natural idea is to stop at the last step of the numericalSDE before exiting the domain, which has been tested in the context of using neuralnetworks for solving PDEs in [40]. The error of such boundary treatment has beenanalyzed in [23]. Moreover, several schemes have been proposed to improve the accu-racy around the boundary. In [30], the authors approximate the exit position by theintersection of the domain’s boundary and the line segment between the consecutivetwo steps before and after exiting the domain. It has also been considered to reducestepsize when the numerical trajectory is approaching the boundary [12]. After study-ing and testing several approaches for numerical discretization in our algorithms, wepresent two choices of discretization and give some remarks on the other schemes.Let us start with a na¨ıve approach. We can discretize the SDE (2.2) by Euler-Maruyama scheme with a given partition of interval [0 , T ]: 0 = t < t < · · · < t N = T , where a constant stepsize ∆ t = TN is used, so t n = n ∆ t . The SDE is discretized as(3.6) X = X , X t n +1 = X t n + b ( X t n , u n )∆ t + σ ( X t n ) ξ n √ ∆ t, where u n = u ( X t n ; θ u ) and ξ n ∼ N (0 , I d w ) follows the standard normal distribution.Here, we use X t n to denote the discretized stochastic process, to distinguish from X t ,the continuous process. Given a numerical trajectory X t n , n = 0 , . . . , N , we deﬁne(3.7) ¯ n = max (cid:8) n ∈ { , . . . , N } | X t i ∈ Ω , i = 0 , , · · · , n (cid:9) . MO ZHOU, JIEQUN HAN, AND JIANFENG LU

Thus, if ¯ n < N , X t ¯ n +1 exits the domain as X t ¯ n +1 (cid:54)∈ Ω; while if ¯ n = N the trajectory X t n remains in the domain for n = 0 , , . . . , N .Perhaps the most direct and intuitive approach for the boundary treatment is toview t = ¯ n ∆ t as the stopping time, even though X t ¯ n is still inside Ω. This schemewill be referred to as the “na¨ıve scheme” in the sequel. The stochastic integrations in(2.31), (2.32), and (2.36) are correspondingly approximated by(3.8) (cid:90) T ∧ τ e − γs f ( X s , u ) d s ≈ ¯ n − (cid:88) n =0 e − γn ∆ t f ( X t n , u n )∆ t ; (cid:90) T ∧ τ e − γs ∇ V ( X s ) (cid:62) σ ( X s ) d W s ≈ ¯ n − (cid:88) n =0 e − γn ∆ t G ( X t n ; θ G ) (cid:62) σ ( X t n ) ξ n √ ∆ t, where ξ n √ ∆ t is the same realization of Brownian increments as in (3.6). We remarkthat this discretization scheme is similar to the one used in [40] for solving degeneratesemilinear elliptic equations, in particular, both algorithms approximate the stoppingtime by ¯ n ∆ t . However, we aim to solve the value function in the whole domain, whilethe method developed in [40] only aims at the value at a speciﬁc point; thus the overallframework of the algorithm is quite diﬀerent.After discretization, the loss functions (2.31), (2.32), and (2.36) are further ap-proximated by Monte Carlo samples: for each iteration, we draw K independentsample trajectories ( K is known as the batch size) by drawing initial point X fromthe distribution µ and independent increments of the Brownian motion. To updatethe parameters of the neural networks, we employ the Adam optimizer [37].To apply the policy gradient method to the loss functional (2.36), we need todiﬀerentiate the discretized functional with respect to the control, similar to the func-tional derivative setting considered above in (2.37). While the ﬁrst and third terms in(2.37), which involve derivatives of J through its dependence on u and the trajectory,can be easily dealt with on the discretized level using auto-diﬀerentiation, the secondand fourth terms in (2.37) become tricky to deal with on the discrete level, since thestopping time is approximated by ¯ n ∆ t , which is discrete so δ ¯ n/δu is not really welldeﬁned. In our implementation, such terms are neglected in the policy gradient withrespect to u ; we leave the question of better numerical treatment of such terms tofuture works.The pseudocode for our actor-critic method for solving high-dimensional PDEs issummarized in Algorithm 3.1. It turns out in our numerical experi-ments that while the above na¨ıve scheme is able to get reasonably accurate valuefunctions, the approximation to control results in large errors, especially near theboundary (see Section 4 for more details). To improve the accuracy near the boundary,we adaptively shrink the stepsize when the trajectory approaches the boundary ∂ Ω,instead of using the uniform time stepsize as in the na¨ıve scheme. More speciﬁcally,we use the following scheme at the boundary, which is motivated by the integrationscheme used in [12] for Feynman-Kac representation of boundary value problems ofthe Poisson equation.The idea is to reduce the time stepsize adaptively when X t is close to the boundary,and thus to improve the accuracy of the trajectory. We consider the Euler-Maruyamascheme with varying stepsize given by(3.9) X t n +1 = X t n + b ( X t n , u n ) h ( X t n ) + σ ( X t n ) (cid:112) h ( X t n ) ξ n , CTOR-CRITIC METHOD FOR HIGH DIMENSIONAL STATIC HJB EQUATIONS Algorithm 3.1

Neural network based actor-critic solver for semilinear PDEs input :

A semilinear PDE (2.1), terminal time T , number of time intervals N , lossweights η , neural network structures, number of iterations, learning rate,batch size K , the choice of temporal diﬀerence output: Value function V ( · ; θ V ), its gradient G ( x ; θ G ) if we choose VR-LSTD, andthe control u ( · ; θ u )initialization: θ value ( θ value = ( θ V , θ G ) for VR-LSTD and θ value = θ V for LSTD) and θ u for (cid:96) = 1 to the number of iterations do /* critic steps */ Sample K independent trajectories X kt n , k = 1 , , · · · , K Sample K points onthe boundary ∂ Ω to enforce the boundary condition Estimate the gradient ofthe chosen critic loss with respect to θ value using the K trajectories Updateparameters θ value using the Adam optimizer /* actor steps */ Sample K independent trajectories X kt n , k = 1 , , · · · , K Estimate the gradient ofthe actor loss with respect to θ u using the K trajectories Update parameters θ u using the Adam optimizer end where the stepsize h ( X t n ) depends on the current position of the trajectory. For thechoice of stepsize, we deﬁne a subset near the boundary of Ω as(3.10) Γ = (cid:8) x ∈ Ω | dist( x, ∂ Ω) ≤ ς √ d ∆ t (cid:9) , where ς = sup x ∈ Ω (cid:107) σ ( x ) (cid:107) is the supremum of operator norm of σ ( x ). The adaptivechoice of the stepsize is speciﬁed as follows:1. When X t n ∈ Ω \ Γ, it would be considered in the “interior” of Ω, as it is veryunlikely that after one time step with stepsize ∆ t that the trajectory will exit thedomain. Thus, we will use the constant stepsize h ( X t n ) = ∆ t ;2. When X t n ∈ Γ, we decrease the stepsize according to the distance of thetrajectory to the boundary, with the minimum stepsize set as ∆ t : h ( X t n ) = max (cid:110) dς dist( X t n , ∂ Ω) , ∆ t (cid:111) . This reduced stepsize, together with the width of Γ deﬁned in (3.10), are decided suchthat the probability that X t n ∈ Ω \ Γ goes out of Ω in the next step is small. Note thatwhen the stepsize is small, the diﬀusion dominates the drift term, and thus it suﬃcesto incorporate the diﬀusion part in the choice. Note that we have used the supremumof (cid:107) σ ( x ) (cid:107) for simplicity, one could also choose the criteria more locally if σ variesa lot across the domain. The minimum stepsize is set to balance the accuracy andcomputational cost, as otherwise, the scheme might spend unnecessarily long timeresolving the trajectory near the boundary of the domain.In summary, we choose the stepsize h = h ( X t n ) as(3.11) h ( X t n ) = (cid:40) ∆ t, X t n ∈ Ω \ Γ;max { dς dist( X t n , ∂ Ω) , ∆ t } , X t n ∈ Γ . MO ZHOU, JIEQUN HAN, AND JIANFENG LU

The integrals are similarly discretized as in (3.8) with stepsize changed to h ( X t n ):(3.12) (cid:90) T ∧ τ e − γs f ( X s , u ) d s ≈ ¯ n − (cid:88) n =0 e − γnh ( X tn ) f ( X t n , u n ) h ( X t n ); (cid:90) T ∧ τ e − γs ∇ V ( X s ) (cid:62) σ ( X s ) d W s ≈ ¯ n − (cid:88) n =0 e − γnh ( X tn ) G ( X t n ; θ G ) (cid:62) σ ( X t n ) ξ n (cid:112) h ( X t n ) . For the policy gradient, similar to our numerical treatment in the case of na¨ıvescheme, we use auto-diﬀerentiation generated by the computational graph in practice,instead of directly numerically approximate the functional derivative deﬁned in (2.37).One reason is that the adaptive stepsize scheme further complicates the dependenceof the trajectory and exit time on the control, compared with the na¨ıve scheme,and hence makes the direct numerical discretization of (2.37) even more diﬃcult. Inpractice, the result from using auto-diﬀerentiation for the policy gradient seems to bequite accurate, as will be further discussed in the next section.

Remark

Remark ∂ Ωmore accurately. In this scheme, the exit position X τ on ∂ Ω is numerically approx-imated by the intersection of ∂ Ω and the line segment between X t ¯ n and X t ¯ n +1 ; thestopping time is correspondingly adjusted. The numerical result from the scheme isstill not accurate enough for the control near the boundary.The linear interpolation above gives an error of order √ ∆ t due to the diﬀusionterm, we can further improve the accuracy using a method proposed in [23]. Insteadof linear interpolation, we seek for coeﬃcient ρ ∈ (0 ,

1] such that X τ , deﬁned by X τ = X t ¯ n + b ( X t ¯ n , u ¯ n ) ρ ∆ t + σ ( X t ¯ n ) (cid:112) ρ ∆ t ξ ¯ n , is on ∂ Ω. In practice, we observe that numerically solving the coeﬃcient ρ makes thetraining unstable.

4. Numerical examples.

In this section, we present the numerical results forthe proposed method. We test on several examples: the linear quadratic regulatorproblem, the stochastic Van der Pol oscillator problem, and the diﬀusive Eikonalequation. To test the performance of our algorithm, we do not assume knowledgeof the true solution or the explicit formula for the control given the value function.The considered dimensions in all three examples are as large as 20. The algorithm isimplemented in Python by using the machine learning library TensorFlow 2.0 [1]. Inall the three examples, the weight parameter η associated with the boundary condition(cf. (3.5)) is set to 1 and the terminal time is T = 0 .

2. The numbers of time intervals

CTOR-CRITIC METHOD FOR HIGH DIMENSIONAL STATIC HJB EQUATIONS N = 50 for problems in 4 d and 5 d , and N = 100 in 10 d and 20 d . As for thearchitecture of the neural networks, the width of the hidden layers is set to 200 in allproblems while the numbers of hidden layers are 2 for problems in 4 d and 5 d , and3 in 10 d and 20 d . During the training, we use piecewise constant learning rates of1e −

3, 1e −

4, and 1e − − d , 5 d , and 10 d , and 30000in 20 d . The numbers of steps with learning rate 1e − − K = 1024 for problems in 4 d and 5 d , and K = 2048 in 10 d and 20 d .During the training, we sample a validation set { X k } Kk =1 uniformly in Ω, inde-pendent of the training, to evaluate the errors of value function and the control. Therelative L errors are computed by(4.1) err V = K (cid:88) k =1 ( V ( X k ) − V ( X k ; θ V )) / K (cid:88) k =1 V ( X k ) and(4.2) err u = K (cid:88) k =1 | u ( X k ) − u ( X k ; θ u ) | / K (cid:88) k =1 | u ( X k ) | , where V ( · ) and u ( · ) are the true value and control functions respectively (we willchoose test examples such that these true solutions are known). In addition to theerrors above, we also visualize the density of the true value function and comparethat with its neural network approximation, considering the diﬃculty of visualizingfunctions in high dimensions directly. Here, the density of a function V is deﬁned asthe probability density function of V ( X ) where X is uniformly distributed in Ω. Inour numerical experiments, the density is estimated by Monte Carlo sampling.Our numerical results indicate that in all three examples, the value functionsare approximated accurately, and the associated densities match well with that ofthe true solution. Furthermore, the numerical results show that, for the critic, thevariance reduced temporal diﬀerence method VR-LSTD performs better than LSTD,as expected. The adaptive stepsize scheme also signiﬁcantly improves the accuracy,in particular for the control. The details can be found in the following subsections. In this subsection we consider the PDEarising from the linear quadratic regulator problem, given by(4.3) ∆ V ( x ) + inf u ∈ R d (cid:0) βu (cid:62) ∇ V ( x ) + p | x | + q | u | − kd (cid:1) − γV ( x ) = 0 , in B R ⊂ R d with boundary condition V ( x ) = kR on ∂B R , where B R = { x ∈ R d : | x | < R } . Here p , q , β , k are positive constants such that(4.4) k = (cid:112) q γ + 4 pqβ − γq β . This is the HJB equation corresponding to the controlled stochastic process(4.5) d X t = βu d t + √ W t with cost functional(4.6) J u ( x ) = E (cid:104)(cid:90) τ ( p | X s | + q | u ( X s ) | − kd ) e − γs d s + e − γτ kR (cid:105) , MO ZHOU, JIEQUN HAN, AND JIANFENG LU . . . . . | x | . . . . . | u ( x ) | true controlcontrol by the na¨ıve scheme . . . . . | x | . . . . . | u ( x ) | true controlcontrol by the adaptive scheme Fig. 1: Left: the true optimal control and approximated control by the na¨ıve scheme;right: the true optimal control and approximated control by the adaptive stepsizescheme. x -axis: the norm of x ; y -axis: the norm of control u .where τ is the exit time of the domain B R . The PDE has the exact solution as aquadratic function: V ( x ) = k | x | , and the optimal control is also explicitly given as u ∗ ( x ) = − β q ∇ V ( x ) = − kβq x .We choose the model parameters p = q = R = β = γ = 1 and k = ( √ − / d are shown in table 1. The results of VR-LSTD have smallererrors due to the smaller asymptotic variance, as we discussed above. Moreover, theadaptive stepsize scheme is able to compute a more accurate control function, com-pared with the na¨ıve scheme. One possible reason is that the adaptive stepsize schemesamples more points X t n near the boundary, which helps to improve the accuracy ofthe control function near the boundary. To further illustrate the idea, let us com-pare the results for the na¨ıve scheme and the adaptive stepsize scheme in 5 d , bothwith critic optimized by VR-LSTD. The plot of the norm of the control | u ( x ) | w.r.t.the norm of the variable | x | is shown in Figure 1. The error of the control for thena¨ıve scheme is signiﬁcantly larger near the boundary. The adaptive stepsize schemeachieves a uniform accuracy of the control in the whole domain.Therefore, for the rest of the numerical experiments, we will stick to the adaptivestepsize scheme and VR-LSTD loss function for the critic. Figure 2 shows the densityand error curves for the LQR problem when d = 5 , ,

20. The sharp drop of theerrors at step 20000 and 30000 is due to the reduced learning rates at those steps.The ﬁnal errors of the value functions and controls are 1 . − . − d ;1 . − . − d ; 1 . − . − d . The Van der Pol oscillator is a pop-ular example in the study of dynamical systems because of its chaotic behavior. Thestochastic Van der Pol oscillator has been studied in [66], in which some internal orexternal noise is considered. In this subsection, we consider the generalized stochasticVan der Pol oscillator in high dimensional cases and solve the PDE(4.7) ∆ V ( x ) + inf u ∈ R d/ (cid:2) b ( x, u ) (cid:62) ∇ V ( x ) + f ( x, u ) (cid:3) − γV ( x ) = 0 , in B R ⊂ R d CTOR-CRITIC METHOD FOR HIGH DIMENSIONAL STATIC HJB EQUATIONS . . V ( x ) D e n s i t y TrueNN Solution .

25 0 . V ( x ) D e n s i t y TrueNN Solution . . V ( x ) D e n s i t y TrueNN Solution

Iteration E rr o r valuecontrol Iteration E rr o r valuecontrol Iteration E rr o r valuecontrol Fig. 2: Top: density of V for LQR problem with d = 5 (left), d = 10 (middle), and d = 20 (right). Bottom: associated error curves in the training process with d = 5(left), d = 10 (middle), and d = 20 (right).discretization TD variant error of value function error of controladaptive stepsize VR-LSTD . − . − adaptive stepsize LSTD 1 . − . − . − . − . − . − d LQR.where d = 2 n is even. The boundary condition is given by ( with convention x = x n and x n +1 = x n +1 )(4.8) g ( x ) = a n (cid:88) i =1 ( x i ) − (cid:15) ( n (cid:88) i =1 x i − x i + n (cid:88) i = n +1 x i x i +1 )Here a and (cid:15) are positive constants. The drift ﬁeld is given by(4.9) b i ( x, u ) = (cid:40) x i + n (1 ≤ i ≤ n );(1 − x i − n ) x i − x i − n + u i − n ( n + 1 ≤ i ≤ n ) . MO ZHOU, JIEQUN HAN, AND JIANFENG LU

We choose the running cost as(4.10) f ( x, u ) = q | u | + γ [ n (cid:88) i =1 ( ax i − (cid:15)x i x i − ) + n (cid:88) i = n +1 ( ax i − (cid:15)x i x i +1 )]+ 14 q [(2 ax n +1 − (cid:15)x n − (cid:15)x n +2 ) + n (cid:88) i = n +2 (2 ax i − (cid:15)x i − − (cid:15)x i +1 ) ] − na − a n (cid:88) i =1 x n + i x i + (cid:15) n (cid:88) i =1 x n + i x i − + (cid:15) n − (cid:88) i =1 x n + i x i +1 + (cid:15)x n x − ( x n +1 − x − x x n +1 )(2 ax n +1 − (cid:15)x n − (cid:15)x n +2 ) − n (cid:88) i =2 ( x i + n − x i − x i x i + n )(2 ax i + n − (cid:15)x i + n − − (cid:15)x i + n +1 ) . so that the true value function has an explicit formula:(4.11) V ( x ) = a n (cid:88) i =1 x i − (cid:15) (cid:16) n (cid:88) i =1 x i − x i + n (cid:88) i = n +1 x i x i +1 (cid:17) . The corresponding optimal control is given by u ∗ ( x ) = − q ∂ n +1 V ( x ) = 2 ax n +1 − (cid:15)x n − (cid:15)x n +2 and u ∗ i ( x ) = − q ∂ i + n V ( x ) = 2 ax i + n − (cid:15)x i + n − − (cid:15)x i + n +1 for i =2 , , · · · , n .The PDE can be reformulated as a stochastic control problem with the controlledSDE given by(4.12) d X t = b ( X t , u ) d t + √ W t . with objective function(4.13) J u ( x ) = E (cid:104)(cid:90) τ f ( X s , u ) e − γs d s + e − γτ g ( X τ ) (cid:105) . In the numerical experiments, we take a = q = R = γ = 1 and (cid:15) = 0 .

1. Figure 3shows the density and error curves when d = 4 , ,

20. The algorithm learns reason-ably nice shapes of the value functions. The ﬁnal errors of the value functions andcontrols are 1 . − . − d ; 1 . − . − d ; 1 . − . − d . The Eikonal equation corresponds to theshortest-path problems with a given metric. In our experiments, we add a smalldiﬀusion term to regularize the equation (otherwise, the solution has kinks, whichcreates diﬃculty for the neural networks to approximate well in high dimensions).The diﬀusive Eikonal equation is given by(4.14)  (cid:15) ∆ V ( x ) + inf u ∈ B (cid:0) c ( x ) u (cid:62) ∇ V ( x ) (cid:1) + 1 = 0 in B R V ( x ) = a − a on ∂B R where(4.15) c ( x ) = 3( d + 1) a da (2 a − a | x | ) > CTOR-CRITIC METHOD FOR HIGH DIMENSIONAL STATIC HJB EQUATIONS V ( x ) . . . . D e n s i t y TrueNN Solution . . V ( x ) D e n s i t y TrueNN Solution .

50 0 .

75 1 . V ( x ) D e n s i t y TrueNN Solution

Iteration E rr o r valuecontrol Iteration E rr o r valuecontrol Iteration E rr o r valuecontrol Fig. 3: Top: density of V for Van der Pol problem with d = 4 (left), d = 10 (middle),and d = 20 (right). Bottom: associated error curves in the training process with d = 4(left), d = 10 (middle), and d = 20 (right).is a real valued function. Here a and a are positive constants such that 2 a − a R > (cid:15) = 1 / (2 da ). We choose the form of c so that the true solution of the PDE isexplicitly given by(4.16) V ( x ) = a | x | − a | x | and the optimal control is u ∗ ( x ) = x/ | x | . In the numerical test, we take a = 1 . a = 0 .

2, and R = 1.Unlike the previous two examples, the constraint on the control in this exampleposes a new challenge to the numerical algorithm. In order to ensure that the control u is in the unit ball, we construct a speciﬁc structure of the neural network for thecontrol. Instead of outputting the control directly, the neural network gives a d + 1dimensional vector ( u len , u dir ) ∈ R d +1 . The control is represented by(4.17) u = u dir δ + ReLU( u len ) + | u dir | , where ReLU( x ) = max(0 , x ) and δ = 10 − . This δ is to ensure that the denominatorin (4.17) is not 0 to prevent numerical singularity. Figure 4 shows the density anderror curves for the Eikonal equation when d = 5 , ,

20. We also tried the straight-forward parametrizatio of the control function as before, with an additional penaltyterm η (cid:48) E X ∼ Unif(Ω) [ReLU( | u ( X ) | − . − . − d ; 1 . − . − d ; 1 . − . − d .

5. Conclusion and future directions.

In this paper, we propose and studynumerical methods for high dimensional static HJB equations based on neural net-work parametrization and actor-critic framework. There are several promising direc-tions for future research. First, the scalability of the methods shall be further tested0

MO ZHOU, JIEQUN HAN, AND JIANFENG LU − . − . . V ( x ) D e n s i t y TrueNN Solution − . − . V ( x ) D e n s i t y TrueNN Solution − . − . V ( x ) . . . . D e n s i t y TrueNN Solution

Iteration E rr o r valuecontrol Iteration E rr o r valuecontrol Iteration E rr o r valuecontrol Fig. 4: Top: density of V for Eikonal equation with d = 5 (left), d = 10 (middle), and d = 20 (right). Bottom: associated error curves in the training process with d = 5(left), d = 10 (middle), and d = 20 (right).by problems of higher dimensions. Second, it would be interesting to extend ourmethods to quasilinear type equations, for which the diﬀusion term σ in the HJBequation might depend on u and ∇ u . Third, one might explore better numericaltreatment of discretization of the functional derivative (2.37), rather than relying onauto-diﬀerentiation. Finally, as an outstanding challenge in the ﬁeld of deep learn-ing, theoretical analysis for convergence and error analysis of the proposed numericalmethods would be of great interest. REFERENCES[1]

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat,G. Irving, M. Isard, et al. , Tensorﬂow: A system for large-scale machine learning , in12th USENIX symposium on operating systems design and implementation (OSDI 16),2016, pp. 265–283.[2]

M. S. Abdulla and S. Bhatnagar , Parametrized actor-critic algorithms for ﬁnite-horizonMDPs , in 2007 American Control Conference, IEEE, 2007, pp. 534–539.[3]

G. Barles and E. R. Jakobsen , On the convergence rate of approximation schemes forHamilton–Jacobi–Bellman equations , ESAIM: Mathematical Modelling and NumericalAnalysis, 36 (2002), pp. 33–54.[4]

R. W. Beard, G. N. Saridis, and J. T. Wen , Galerkin approximations of the generalizedHamilton–Jacobi–Bellman equation , Automatica, 33 (1997), pp. 2159–2177.[5]

R. W. Beard, G. N. Saridis, and J. T. Wen , Approximate solutions to the time-invariantHamilton–Jacobi–Bellman equation , Journal of Optimization theory and Applications, 96(1998), pp. 589–626.[6]

C. Beck, M. Hutzenthaler, A. Jentzen, and B. Kuckuck , An overview on deeplearning-based approximation methods for partial diﬀerential equations , arXiv preprintarXiv:2012.12348, (2020).[7]

S. Becker, P. Cheridito, and A. Jentzen , Deep optimal stopping , Journal of Machine Learn-ing Research, 20 (2019), p. 74.[8]

R. Bellman , Dynamic programming , Science, 153 (1966), pp. 34–37.[9]

S. Bhatnagar and M. S. Abdulla , A reinforcement learning based algorithm for ﬁnite horizonMarkov decision processes , in Proceedings of the 45th IEEE Conference on Decision andCTOR-CRITIC METHOD FOR HIGH DIMENSIONAL STATIC HJB EQUATIONS Control, IEEE, 2006, pp. 5519–5524.[10]

S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee , Natural actor-critic algorithms ,Automatica, 45 (2009), pp. 2471–2482.[11]

J. A. Boyan , Least-squares temporal diﬀerence learning , in ICML, Citeseer, 1999, pp. 49–56.[12]

F. Buchmann and W. Petersen , Solving Dirichlet problems numerically using the Feynman–Kac representation , BIT Numerical Mathematics, 43 (2003), pp. 519–540.[13]

Q. Chan-Wai-Nam, J. Mikael, and X. Warin , Machine learning for semi linear PDEs , Jour-nal of Scientiﬁc Computing, 79 (2019), pp. 1667–1712.[14]

M. G. Crandall, H. Ishii, and P.-L. Lions , User’s guide to viscosity solutions of secondorder partial diﬀerential equations , Bulletin of the American mathematical society, 27(1992), pp. 1–67.[15]

T. Degris, M. White, and R. Sutton , Oﬀ-policy actor-critic , in International Conference onMachine Learning, 2012.[16]

S. Dolgov, D. Kalise, and K. Kunisch , Tensor decompositions for high-dimensionalHamilton–Jacobi–Bellman equations , arXiv preprint arXiv:1908.01533, (2019).[17]

Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel , Benchmarking deepreinforcement learning for continuous control , in International Conference on MachineLearning, PMLR, 2016, pp. 1329–1338.[18]

W. E and J. Han , Deep learning approximation for stochastic control problems , arXiv preprintarXiv:1611.07422, (2016).[19]

W. E, J. Han, and A. Jentzen , Deep learning-based numerical methods for high-dimensionalparabolic partial diﬀerential equations and backward stochastic diﬀerential equations , Com-munications in Mathematics and Statistics, 5 (2017), pp. 349–380.[20]

W. E, J. Han, and A. Jentzen , Algorithms for solving high dimensional PDEs: From non-linear monte carlo to machine learning , arXiv preprint arXiv:2008.13333, (2020).[21]

P. A. Forsyth and G. Labahn , Numerical methods for controlled Hamilton–Jacobi–BellmanPDEs in ﬁnance , Journal of Computational Finance, 11 (2007), p. 1.[22]

W. Foulkes, L. Mitas, R. Needs, and G. Rajagopal , Quantum Monte Carlo simulations ofsolids , Reviews of Modern Physics, 73 (2001), p. 33.[23]

E. Gobet , Weak approximation of killed diﬀusion using euler schemes , Stochastic processesand their applications, 87 (2000), pp. 167–197.[24]

J. Han and R. Hu , Deep ﬁctitious play for ﬁnding markovian Nash equilibrium in multi-agentgames , in Mathematical and Scientiﬁc Machine Learning, PMLR, 2020, pp. 221–245.[25]

J. Han and R. Hu , Recurrent neural networks for stochastic control problems with delay , arXivpreprint arXiv:2101.01385, (2021).[26]

J. Han, R. Hu, and J. Long , Convergence of deep ﬁctitious play for stochastic diﬀerentialgames , arXiv preprint arXiv:2008.05519, (2020).[27]

J. Han, A. Jentzen, and W. E , Solving high-dimensional partial diﬀerential equations usingdeep learning , Proceedings of the National Academy of Sciences, 115 (2018), pp. 8505–8510.[28]

J. Han and J. Long , Convergence of the deep bsde method for coupled FBSDEs , Probability,Uncertainty and Quantitative Risk, 5 (2020), pp. 1–33.[29]

J. Han, J. Lu, and M. Zhou , Solving high-dimensional eigenvalue problems using deep neuralnetworks: A diﬀusion monte carlo like approach , Journal of Computational Physics, 423(2020), p. 109792.[30]

J. Han, M. Nica, and A. R. Stinchcombe , A derivative-free method for solving elliptic partialdiﬀerential equations with deep neural networks , arXiv preprint arXiv:2001.06145, (2020).[31]

K. He, X. Zhang, S. Ren, and J. Sun , Deep residual learning for image recognition , inProceedings of the IEEE conference on computer vision and pattern recognition, 2016,pp. 770–778.[32]

P. Henry-Labordere , Deep primal-dual algorithm for BSDEs: Applications of machine learn-ing to CVA and IM , Available at SSRN 3071506, (2017).[33]

S. Ji, S. Peng, Y. Peng, and X. Zhang , Deep learning method for solving stochastic opti-mal control problem via stochastic maximum principle , arXiv preprint arXiv:2007.02227,(2020).[34]

S. Ji, S. Peng, Y. Peng, and X. Zhang , Three algorithms for solving high-dimensional fullycoupled fbsdes through deep learning , IEEE Intelligent Systems, 35 (2020), pp. 71–84.[35]

D. Kalise and K. Kunisch , Polynomial approximation of high-dimensional Hamilton–Jacobi–Bellman equations and applications to feedback control of semilinear parabolic PDEs , SIAMJournal on Scientiﬁc Computing, 40 (2018), pp. A629–A652.[36]

W. Kang and L. C. Wilcox , Mitigating the curse of dimensionality: sparse grid characteris-tics method for optimal feedback control and HJB equations , Computational Optimizationand Applications, 68 (2017), pp. 289–315. MO ZHOU, JIEQUN HAN, AND JIANFENG LU[37]

D. Kingma and J. Ba , Adam: a method for stochastic optimization , in Proceedings of theInternational Conference on Learning Representations (ICLR), 2015.[38]

F. C. Klebaner , Introduction to stochastic calculus with applications , World Scientiﬁc Pub-lishing Company, 2005.[39]

V. R. Konda and J. N. Tsitsiklis , Actor-critic algorithms , in Advances in Neural InformationProcessing Systems, Citeseer, 2000, pp. 1008–1014.[40]

S. Kremsner, A. Steinicke, and M. Sz¨olgyenyi , A deep neural network algorithm for semi-linear elliptic PDEs with applications in insurance mathematics , Risks, 8 (2020), p. 136.[41]

K. Kunisch, S. Volkwein, and L. Xie , HJB-POD-based feedback design for the optimal controlof evolution problems , SIAM Journal on Applied Dynamical Systems, 3 (2004), pp. 701–722.[42]

J. Lygeros , On reachability and minimum cost optimal control , Automatica, 40 (2004),pp. 917–927.[43]

H. R. Maei, C. Szepesv´ari, S. Bhatnagar, and R. S. Sutton , Toward oﬀ-policy learningcontrol with function approximation , in Proceedings of the 27th International Conferenceon Machine Learning (ICML-10), 2010, pp. 719–726.[44]

I. M. Mitchell, A. M. Bayen, and C. J. Tomlin , A time-dependent Hamilton–Jacobi formu-lation of reachable sets for continuous dynamic games , IEEE Transactions on automaticcontrol, 50 (2005), pp. 947–957.[45]

I. M. Mitchell and C. J. Tomlin , Overapproximating reachable sets by Hamilton–Jacobiprojections , Journal of Scientiﬁc Computing, 19 (2003), pp. 323–346.[46]

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra,and M. Riedmiller , Playing atari with deep reinforcement learning , arXiv preprintarXiv:1312.5602, (2013).[47]

T. Nakamura-Zimmerer, Q. Gong, and W. Kang , Adaptive deep learning for high-dimensional Hamilton–Jacobi-Bellman equations , arXiv preprint arXiv:1907.05317, (2019).[48]

N. N¨usken and L. Richter , Solving high-dimensional Hamilton–Jacobi–Bellman PDEs usingneural networks: perspectives from the theory of controlled diﬀusions and measures onpath space , arXiv preprint arXiv:2005.05409, (2020).[49]

S. Osher and J. A. Sethian , Fronts propagating with curvature-dependent speed: Algorithmsbased on Hamilton–Jacobi formulations , Journal of computational physics, 79 (1988),pp. 12–49.[50]

M. Oster, L. Sallandt, and R. Schneider , Approximating the stationary Hamilton–Jacobi–Bellman equation by hierarchical tensor products , arXiv preprint arXiv:1911.00279, (2019).[51] ´E. Pardoux , Backward stochastic diﬀerential equations and viscosity solutions of systems ofsemilinear parabolic and elliptic PDEs of second order , in Stochastic Analysis and RelatedTopics VI, Springer, 1998, pp. 79–127.[52]

E. Pardoux and S. Peng , Adapted solution of a backward stochastic diﬀerential equation ,Systems & Control Letters, 14 (1990), pp. 55–61.[53]

E. Pardoux and S. Peng , Backward stochastic diﬀerential equations and quasilinear para-bolic partial diﬀerential equations , in Stochastic partial diﬀerential equations and theirapplications, Springer, 1992, pp. 200–217.[54]

S. Peng , Probabilistic interpretation for systems of quasilinear parabolic partial diﬀerentialequations , Stochastics and Stochastics Reports, 37 (1991), pp. 61–74.[55]

M. A. Pereira, Z. Wang, I. Exarchos, and E. A. Theodorou , Learning deep stochasticoptimal control policies using forward-backward SDEs , in Robotics: science and systems,2019.[56]

J. Peters and S. Schaal , Natural actor-critic , Neurocomputing, 71 (2008), pp. 1180–1190.[57]

H. Pham, X. Warin, and M. Germain , Neural networks-based backward scheme for fullynonlinear PDEs , arXiv preprint arXiv:1908.00412, (2019).[58]

A. V. Rao , A survey of numerical methods for optimal control , Advances in the AstronauticalSciences, 135 (2009), pp. 497–528.[59]

S. Richardson and S. Wang , Numerical solution of Hamilton–Jacobi–Bellman equations byan exponentially ﬁtted ﬁnite volume method , Optimization, 55 (2006), pp. 121–140.[60]

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrit-twieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot , Mastering the gameof Go with deep neural networks and tree search , Nature, 529 (2016), pp. 484–489.[61]

D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller , Deterministicpolicy gradient algorithms , in International Conference on Machine Learning, PMLR, 2014,pp. 387–395.[62]

R. S. Sutton and A. G. Barto , Reinforcement learning: An introduction , MIT press, 2018.[63]

K. G. Vamvoudakis and F. L. Lewis , Online actor-critic algorithm to solve the continuous-

CTOR-CRITIC METHOD FOR HIGH DIMENSIONAL STATIC HJB EQUATIONS time inﬁnite horizon optimal control problem , Automatica, 46 (2010), pp. 878–888.[64] S. Wang, L. S. Jennings, and K. L. Teo , Numerical solution of Hamilton–Jacobi–Bellmanequations by an upwind ﬁnite volume method , Journal of Global Optimization, 27 (2003),pp. 177–192.[65]

Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Fre-itas , Sample eﬃcient actor-critic with experience replay , arXiv preprint arXiv:1611.01224,(2016).[66]

Y. Xu, R. Gu, H. Zhang, W. Xu, and J. Duan , Stochastic bifurcations in a bistable Duﬃng–Van der Pol oscillator with colored noise , Physical Review E, 83 (2011), p. 056215.[67]