Solving stochastic optimal control problem via stochastic maximum principle with deep learning method
DDeep learning method for solving stochastic optimal controlproblem via stochastic maximum principle
Shaolin Ji , Shige Peng , Ying Peng , and Xichuan Zhang Shandong University-Zhongtai Securities Institute for Financial Studies,Shandong University, 250100, China School of Mathematics, Shandong University, 250100, ChinaSeptember 24, 2020
Abstract
In this paper, we aim to solve the stochastic optimal control problem via deep learning.Through the stochastic maximum principle and its corresponding Hamiltonian system, wepropose a framework in which the original control problem is reformulated as a new one. Thisnew stochastic optimal control problem has a quadratic loss function at the terminal timewhich provides an easier way to build a neural network structure. But the cost is that wemust deal with an additional maximum condition. Some numerical examples such as the linearquadratic (LQ) stochastic optimal control problem and the calculation of G -expectation havebeen studied, and the results demonstrates rather optimistic performance, especially for highdimensional cases. Keywords stochastic control, deep neural networks, stochastic maximum principle, Hamil-tonian system, Legendre transformation
It is well known that Pontryagin’s maximum principle [1] and Bellman’s dynamic programmingprinciple [2] are two of the most important tools in solving stochastic optimal control problems.Since these two principles are proposed, the stochastic control theory has been widely developedand extended to a variety of complicated situations in sciences and technologies.There are many numerical methods for solving stochastic optimal control problems, such as theMarkov chain approximation method [3, 4] which approximate the original controlled process byan appropriate controlled Markov chain on a finite state space, the finite-difference approximations[5, 6, 7] and the probabilistic numerical methods based on dynamic programming [8, 9]. However,few of these methods can deal with high-dimensional problems due to the “curse of dimensionality”.In other words, the computational complexity grows exponentially when the dimension increases.In recent years, the deep learning method has been developed rapidly and achieved successes insolving high-dimensional problems of many areas [10], such as computer vision, natural languageprocessing, gaming, etc. This poses a possible way to solve the “curse of dimensionality” althoughthe reason why deep-learning is so effective has not been proven completely.Recently, the deep learning method demonstrated remarkable performance in solving the stochas-tic optimal control problems and the backward stochastic differential equations (BSDEs for short),especially for high dimensional cases [11, 12, 13, 14, 15]. The main idea is to treat the controlas the parameters in deep neural networks (DNNs) and to compute the optimal parameters withstochastic gradient descent methods (SGD). Based on this idea, some researchers extended theneural network architectures to solve the stochastic optimal control problems. For example, Phamet al [16, 17] proposed deep learning algorithms from the view of dynamic programming for solvingthe stochastic control problems. Pereira et al [18] proposed two architectures consisting feed-forward and recurrent neural network to calculate a specific non-linear stochastic control problemthrough the Hamilton-Jacobi-Bellman (HJB) equation.1 a r X i v : . [ m a t h . O C ] S e p n this paper, we aim to compute the following stochastic optimal control problem introducedin [19, 20], inf u ( · ) ∈U ad [0 ,T ] E (cid:40)(cid:90) T f ( t, x t , u t )d t + h ( x T ) (cid:41) , (1.1)s.t. x t = x + (cid:90) t b ( t, x s , u s )d s + (cid:90) t σ ( t, x s , u s )d W s , by deep learning method from the view of the stochastic maximum principle (SMP for short).The SMP states that any optimal control along with the optimal state trajectory must be thesolutions of the Hamiltonian system plus a maximum condition of a function H ( t, x, u, p, q ) calledthe Hamiltonian. In our context, the (extended) Hamiltonian system is characterized by thefollowing FBSDE with a maximum condition: d x ∗ t = b ( t, x ∗ t , u ∗ t )d t + σ ( t, x ∗ t , u ∗ t )d W t , d p ∗ t = − H x ( t, x ∗ t , u ∗ t , p ∗ t , q ∗ t )d t + q ∗ t d W t ,x ∗ (0) = x , p ∗ T = − h x ( x ∗ T ) ,H ( t, x ∗ t , u ∗ t , p ∗ t , q ∗ t ) = max u ∈ U H ( t, x ∗ t , u, p ∗ t , q ∗ t ) , (1.2)which has only first-order adjoint equations and correspond to the problems with convex controldomain. Our framework is also applicable to more complex problems such as that with non-convex control domain where the Hamiltonian system has second-order adjoint equations, and thecases where the state equation is described by a fully coupled FBSDE. More details are shown inAppendix D.We first reformulate (1.2) as a new optimal control problem, inf ˜ p , { ˜ q t } ≤ t ≤ T E (cid:104) | − h x (˜ x T ) − ˜ p T | (cid:105) (1.3)s.t. ˜ x t = x + (cid:90) t b ( t, ˜ x s , ˜ u s )d s + (cid:90) t σ ( t, ˜ x s , ˜ u s )d W s , ˜ p t = ˜ p − (cid:90) t H x ( s, ˜ x s , ˜ u s , ˜ p s , ˜ q s )d s + (cid:90) t ˜ q s d W s , ˜ u t = arg max u ∈ U H ( t, ˜ x t , u, ˜ p t , ˜ q t ) , where the process { ˜ q t } ≤ t ≤ T and initial state ˜ p are regarded as controls. Comparing with (1.1),the new control problem (1.3) has a simpler quadratic loss function at time T which provides amore intuitive way to decide whether the state-control pair (˜ x ( · ) , ˜ u ( · )) is an optimal pair, that iswhether E (cid:104) |− h x (˜ x T ) − ˜ p T | (cid:105) equals to 0. And the cost of doing this transformation is that we mustdeal with an extra term, the maximum condition. Besides, the cost functional in (1.3) is similar tothat of the general neural network, therefore it is easier to build a neural network structure withsuch a loss function.In order to solve the new control problem (1.3), we propose four algorithms via neural network.In Algorithm 1, a single DNN is constructed to simulate the control ˜ q t and the time t is regardedas a part of inputs of the neural network. The adopted single DNN greatly reduces the numberof network parameters and the training time. We obtain the approximate estimation of ˜ q t bytraining such a neural network, and then get the approximate solution (˜ x t , ˜ p t , ˜ q t , ˜ u t ) ≤ t ≤ T of (1.3).For calculating the maximum condition in (1.3), we employ L-BFGS (a quasi-Newton method) toapproximate the optimal control ˜ u in Algorithm 1. In order to punish the distance of ˜ u betweentwo adjacent iterations, an improved algorithm (Algorithm 2) is proposed. The loss functionincludes three penalty terms to measure the closeness of ˜ u between two iterations. It is worthto point out that Algorithm 2 will degenerate to a successive approximation method for solvingdeterministic systems which has been used for training deep neural networks [21]. For a specialkind of stochastic optimal control problem where all the coefficients are C in u and the optimalcontrol ˜ u falls inside the control domain, we propose a third algorithm (Algorithm 3). The aim2f this algorithm is to improve the computational efficiency of the approximate solution for theoptimal control ˜ u in the maximum condition. We first transfer the maximum condition to anotherkind of constraint H u ( t, x, u, p, q ) = 0 . Then two neural networks are constructed to simulatethe two controls { ˜ q t } ≤ t ≤ T and { ˜ u t } ≤ t ≤ T , respectively. Moreover, the integral of the constraint H u ( t, x, u, p, q ) from 0 to T is added as a penalty term to the original loss function. This algorithmwill greatly save the computing time, especially for high dimensional cases where ˜ u can not besolved explicitly. When the function ¯ H defined by (2.11) is known, we can also solve a class ofhigh-dimensional stochastic optimal control problems even though the optimal control ˜ u has notan explicit solution. For this case, we propose Algorithm 4. Note that when ˜ u has an explicitsolution in Algorithm 1, we can then get the function ¯ H explicitly, therefore Algorithm 1 withexplicit representation of ˜ u is essentially a special case of Algorithm 4.The numerical results of all the four algorithms demonstrate rather optimistic performance.When ˜ u has an explicit solution and thus the function ¯ H can be solved, Algorithm 4 shows themost effective performance. On the other side, even if the optimal control ˜ u may not be solvedexplicitly, our algorithms can still deal with the stochastic optimal control problem. And in thissituation, Algorithm 3 or 4 will be a better choice for high dimensional cases when the conditionsmentioned in Section 3.3 or Section 3.4 are satisfied. Otherwise Algorithm 1 or 2 should be chosenbut they are not suitable for high-dimensional cases.The rest of this paper is structured as follows. In Section 2, we briefly introduce the preliminariesabout stochastic optimal control problems and reformulate our stochastic optimal control problemas a new control problem. In Section 3, we propose our numerical algorithms for solving thenew optimal control problem and give the neural network architecture. In Section 4, we show thenumerical results and compare the results of our proposed algorithms. More complicated cases withnon-convex control domain for solving the second-order adjoint equations are studied in AppendixD. Let
T > and (Ω , F , F , P ) be a filtered probability space, where W : [0 , T ] × Ω → R d is a d -dimensional standard F -Brownian motion on (Ω , F , P ) , F = {F t } ≤ t ≤ T is the natural filtrationgenerated by the Brownian motion W . Suppose that (Ω , F , P ) is complete, F contains all the P -null sets in F and F is right continuous. Considering the following controlled stochastic differentialequation: (cid:40) d x t = b ( t, x t , u t )d t + σ ( t, x t , u t )d W t ,x = x ∈ R n , (2.1)where u t , t ∈ [0 , T ] , is an admissible control process, i.e. a F -adapted square-integrable processvalued in a given subset U of R k . We define the distance (cid:107) · (cid:107) in an Euclidean space. b and σ arethe drift coefficient and diffusion coefficient of (2.1), respectively. They are determined functions b :[0 , T ] × R n × U → R n ,σ :[0 , T ] × R n × U → R n × d . The cost functional is J ( u ( · )) = E (cid:40)(cid:90) T f ( t, x t , u t )d t + h ( x T ) (cid:41) . (2.2)The set of all admissible controls is denoted by U ad [0 , T ] U ad [0 , T ] (cid:44) { u : [0 , T ] × Ω → U | u ∈ L F (0 , T ; R k ) } , (2.3)where L F (0 , T ; R k ) (cid:44) { x : [0 , T ] × Ω → R k | x is F -adapted and E (cid:104) (cid:90) T | x t | d t (cid:105) < ∞} . U ad [0 , T ] . The goalis to find u ∗ ( · ) ∈ U ad [0 , T ] (if it exists) such that J ( u ∗ ( · )) = inf u ( · ) ∈U ad [0 ,T ] E (cid:40)(cid:90) T f ( t, x t , u t )d t + h ( x T ) (cid:41) . (2.4)Any u ∗ ( · ) ∈ U ad [0 , T ] satisfying (2.4) is called an optimal control . The corresponding state process x ∗ ( · ) and the state-control pair ( x ∗ ( · ) , u ∗ ( · )) are called an optimal state process and an optimalpair respectively.Firstly let us make the following assumptions. Assumption 1. (i) The maps b, σ, f and h are measurable, and there exist a constant L > and a modulus of continuity ¯ ω : [0 , ∞ ) → [0 , ∞ ) such that for ϕ ( t, x, u ) = b ( t, x, u ) , σ ( t, x, u ) , f ( t, x, u ) , h ( x ) , we have | ϕ ( t, x, u ) − ϕ ( t, ˆ x, ˆ u ) | ≤ L | x − ˆ x | + ¯ ω (cid:107) u − ˆ u (cid:107) , ∀ t ∈ [0 , T ] , x, ˆ x ∈ R n , u, ˆ u ∈ U, | ϕ ( t, , u ) | ≤ L, ∀ t ∈ [0 , T ] , u ∈ U ; (2.5) (ii) The maps b, σ, f and h are C in x . Moreover, there exist a constant L > and a modulusof continuity ¯ ω : [0 , ∞ ) → [0 , ∞ ) such that for ϕ = b, σ, f, h , we have | ϕ x ( t, x, u ) − ϕ x ( t, ˆ x, ˆ u ) | ≤ L | x − ˆ x | + ¯ ω (cid:107) u − ˆ u (cid:107) , | ϕ xx ( t, x, u ) − ϕ xx ( t, ˆ x, ˆ u ) | ≤ ¯ ω ( | x − ˆ x | + (cid:107) u − ˆ u (cid:107) ) , ∀ t ∈ [0 , T ] , x, ˆ x ∈ R n , u, ˆ u ∈ U. (2.6) Assumption 2.
The control domain U is a convex body in R k . The maps b, σ and f are locallyLipschitz in u , and their derivatives in x are continuous in ( x, u ) . In the following, before introducing a set of sufficient conditions for the
Stochastic MaximumPrinciple (SMP in short), we firstly introduce the adjoint equations involved in a SMP and theassociated stochastic Hamiltonian system.Let ( x ∗ ( · ) , u ∗ ( · )) be a given optimal pair. We introduce the adjoint BSDE as follows: d p ∗ t = − (cid:110) b x ( t, x ∗ t , u ∗ t ) T p ∗ t + d (cid:80) j =1 σ jx ( t, x ∗ t , u ∗ t ) T q ∗ jt − f x ( t, x ∗ t , u ∗ t ) (cid:111) d t + q ∗ t d W t ,p ∗ T = − h x ( x ∗ T ) , t ∈ [0 , T ] . (2.7)where p ∗ ( · ) and q ∗ ( · ) are two F -adapted processes which should be solved. Any pair of processes ( p ∗ ( · ) , q ∗ ( · )) ∈ L F (0 , T ; R n ) × ( L F (0 , T ; R n )) d satisfying (2.7) is called an adapted solution of (2.7).Under Assumption 1, for any ( x ∗ ( · ) , u ∗ ( · )) ∈ L F (0 , T ; R n ) × U [0 , T ] , (2.7) admits a unique adaptedsolution ( p ∗ ( · ) , q ∗ ( · )) . We refer to (2.7) as the first-order adjoint equations and to p ∗ ( · ) as the first-order adjointprocess . If ( x ∗ ( · ) , u ∗ ( · )) is an optimal (resp. admissible) pair, and ( p ∗ ( · ) , q ∗ ( · )) is an adaptedsolution of (2.7), then ( x ∗ ( · ) , u ∗ ( · ) , p ∗ ( · ) , q ∗ ( · )) is called an optimal 4-tuple (resp. admissible 4-tuple ). According to Theorem 5.2 of Chapter 3 and the comments after it in [20], we have thefollowing sufficient conditions for the SMP: Theorem 1.
Let Assumptions 1 and 2 hold. Let ( x ∗ ( · ) , u ∗ ( · ) , p ∗ ( · ) , q ∗ ( · )) be an admissible 4-tuple.Suppose that h ( · ) is convex, H ( t, · , · , p ∗ t , q ∗ t ) defined by H ( t, x, u, p, q ) = (cid:104) p, b ( t, x, u ) (cid:105) + tr [ q T σ ( t, x, u )] − f ( t, x, u ) , ( t, x, u, p, q ) ∈ [0 , T ] × R n × U × R n × R n × d , (2.8) is concave for all t ∈ [0 , T ] almost surely and H ( t, x ∗ t , u ∗ t , p ∗ t , q ∗ t ) = max u ∈ U H ( t, x ∗ t , u, p ∗ t , q ∗ t ) , a.e. t ∈ [0 , T ] , P -a.s. (2.9) holds. Then ( x ∗ ( · ) , u ∗ ( · )) is an optimal pair of (2.4) . H satisfy b ( t, x, u ) = H p ( t, x, u, p, q ) and σ ( t, x, u ) = H q ( t, x, u, p, q ) , then the combination of (2.1), (2.7) and (2.9) can be written as follows: d x ∗ t = b ( t, x ∗ t , u ∗ t )d t + σ ( t, x ∗ t , u ∗ t )d W t , d p ∗ t = − H x ( t, x ∗ t , u ∗ t , p ∗ t , q ∗ t )d t + q ∗ t d W t , t ∈ [0 , T ] ,x ∗ = x , p ∗ T = − h x ( x ∗ T ) ,H ( t, x ∗ t , u ∗ t , p ∗ t , q ∗ t ) = max u ∈ U H ( t, x ∗ t , u, p ∗ t , q ∗ t ) , (2.10)which is called a (extended) stochastic Hamiltonian system , with its solution being a 4-tuple ( x ∗ ( · ) , u ∗ ( · ) , p ∗ ( · ) , q ∗ ( · )) . And there exists a function ¯ H : [0 , T ] × R n × R n × R n × d → R suchthat ¯ H ( t, x, p, q ) = max u ∈ U H ( t, x, u, p, q ) , (2.11)which is a function independent of control u .In this paper, we primarily study the problems with convex control domain which correspond tothe first-order adjoint equations, and more complicated cases are discussed in Appendix D. In orderto make sure that the numerical algorithms can completely solve the optimal control problem, wemainly focus on the cases when equation (2.1) has a unique optimal control u ∗ . And these casesdo exist under some more strictly convex conditions (see Theorem 1). In this subsection, we reformulate the control problem (2.4) as a new problem based on the SMPand its corresponding stochastic Hamiltonian system. Considering the (extended) stochastic Hamil-tonian system (2.10), which is essentially a coupled FBSDE with a maximum condition. Supposethat the FBSDE (2.10) has a unique solution ( x ∗ ( · ) , u ∗ ( · ) , p ∗ ( · ) , q ∗ ( · )) .As is known, the FBSDE can be regarded as a stochastic optimal control problem. Based onthis idea, we have the following state equation with a maximum condition which is equivalent to(2.10), d˜ x t = b ( t, ˜ x t , ˜ u t )d t + σ ( t, ˜ x t , ˜ u t )d W t , d˜ p t = − H x ( t, ˜ x t , ˜ u t , ˜ p t , ˜ q t )d t + ˜ q t d W t , ˜ x = x , ˜ p = ˜ p ,H ( t, ˜ x t , ˜ u t , ˜ p t , ˜ q t ) = max u ∈ U H ( t, ˜ x t , u, ˜ p t , ˜ p t ) . (2.12)where (˜ q, ˜ p ) is the pair of control valued in R n × R n × d . As ˜ u t can be represented as ˜ u t = arg max u ∈ U H ( t, ˜ x t , u, ˜ p t , ˜ q t ) , (2.13)then we have a new variational problem which is a reformulation of the control problem (2.4): inf ˜ p , { ˜ q t } ≤ t ≤ T E (cid:104) | − h x (˜ x T ) − ˜ p T | (cid:105) (2.14)s.t. ˜ x t = x + (cid:90) t b ( t, ˜ x s , ˜ u s )d s + (cid:90) t σ ( t, ˜ x s , ˜ u s )d W s , ˜ p t = ˜ p − (cid:90) t H x ( s, ˜ x s , ˜ u s , ˜ p s , ˜ q s )d s + (cid:90) t ˜ q s d W s , ˜ u t = arg max u ∈ U H ( t, ˜ x t , u, ˜ p t , ˜ q t ) . According to the following Theorem 2 and Corollary 1, we can prove that the optimal control ˜ u of (2.14) can be obtained when E (cid:104) | − h x (˜ x T ) − ˜ p T | (cid:105) = 0 . Supposing that ¯ H in (2.11) is5ifferentiable in x, p, q , we have ¯ H p ( t, x, p, q ) = H p ( t, x, u ∗ , p, q ) = b ( t, x, u ∗ ) , ¯ H q ( t, x, p, q ) = H q ( t, x, u ∗ , p, q ) = σ ( t, x, u ∗ ) , ¯ H x ( t, x, p, q ) = H x ( t, x, u ∗ , p, q ) ,u ∗ = arg max u ∈ U H ( t, x, u, p, q ) , (2.15)for any ( t, x, u, p, q ) ∈ [0 , T ] × R n × U × R n × R n × d . Then (2.10) can be rewritten as d x ∗ t = ¯ H p ( t, x ∗ t , p ∗ t , q ∗ t )d t + ¯ H q ( t, x ∗ t , p ∗ t , q ∗ t )d W t , d p ∗ t = − ¯ H x ( t, x ∗ t , p ∗ t , q ∗ t )d t + q ∗ t d W t ,x ∗ = x , p ∗ T = − h x ( x ∗ T ) , (2.16)which is a FBSDE without constraint. And the problem (2.14) is equivalent to inf ˜ p , { ˜ q t } ≤ t ≤ T E (cid:104) | − h x (˜ x T ) − ˜ p T | (cid:105) (2.17)s.t. ˜ x t = x + (cid:90) t ¯ H p ( s, ˜ x s , ˜ p s , ˜ q s )d s + (cid:90) t ¯ H q ( s, ˜ x s , ˜ p s , ˜ q s )d W s , ˜ p t = ˜ p − (cid:90) t ¯ H x ( s, ˜ x s , ˜ p s , ˜ q s )d s + (cid:90) t ˜ q s d W s . Define the cost functional of the stochastic optimal control problem (2.17) as V (0 , x ) = inf ˜ p , { ˜ q t } ≤ t ≤ T E (cid:104) | − h x (˜ x T ) − ˜ p T | (cid:105) , and we have the following theorem which can be referred to Proposition 1.1 in Chapter 3 of [22]. Theorem 2 (Proposition 1.1 in Chapter 3 in [22]) . Assume h x ( x ) is continuous in x , the map ¯ H defined by (2.11) is differentiable in x, p, q , and its derivatives are continuous in x, p, q , such thatfor ϕ = ¯ H x , ¯ H p , ¯ H q and some constant C > , | ϕ ( t, x, p, q ) − ϕ ( t, x (cid:48) , p (cid:48) , q (cid:48) ) | ≤ C ( | x − x (cid:48) | + | p − p (cid:48) | + | q − q (cid:48) | ) , | ϕ ( t, , , | , | ¯ H q ( t, x, p, | ≤ C, ∀ x, x (cid:48) , p, p (cid:48) ∈ R n , q, q (cid:48) ∈ R n × d , then the FBSDE (2.16) is solvable over [0 , T ] if and only if V (0 , x ) = 0 . Corollary 1.
Suppose the assumptions in Theorem 1, 2 hold, if there exists a solution (˜ x t , ˜ u t , ˜ p t , ˜ q t ) ≤ t ≤ T of (2.12) satisfying E (cid:104) | − h x (˜ x T ) − ˜ p T | (cid:105) = 0 , (2.18) then (˜ x t , ˜ u t , ˜ p t , ˜ q t ) is a solution of (2.10) , and the cost functional (2.4) can be obtained by J ( u ∗ ( · )) = J (˜ u ( · )) = E (cid:40)(cid:90) T f ( t, ˜ x t , ˜ u t )d t + h (˜ x T ) (cid:41) . (2.19) Remark.
In this corollary, the assumptions in Theorem 1 ensure that the optimal pair ( x ∗ ( · ) , u ∗ ( · )) can be got by solving the Hamiltonian system (2.10) , the assumptions in Theorem 2 ensure thatthe optimal control of ˜ u in (2.14) exists so that we can solve this Hamiltonian system (2.10) bysolving the optimal control problem (2.14) . Problem (2.14) is a new reformulation of the problem (2.10). This new control problem has aquadratic loss function at the terminal time which gives an easier way to build a neural networkstructure. Moreover, it provides an alternative criterion to decide whether the state-control pair (˜ x ( · ) , ˜ u ( · )) is an optimal pair, which is whether E (cid:104) | − h x (˜ x T ) − ˜ p T | (cid:105) equals to 0. But the cost isthat we must deal with an additional maximum condition ˜ u t = arg max u ∈ U H ( t, ˜ x t , u, ˜ p t , ˜ q t ) . And inthe rest of this paper, we mainly focus on solving the problem (2.14) through the deep learningmethod. 6 Numerical algorithms for solving stochastic optimal controlproblems
In Section 2, we briefly introduce the sufficient conditions of the SMP and reformulate the Hamil-tonian system to a new variational problem. In this section, we propose four algorithms for solvingthe new variational problem through deep learning and show the neural network structure of Al-gorithm 1. The structures of other algorithms are similar with that of Algorithm 1, thus we omitthem in this paper.
Let π be a partition of the time interval, t < t < t < · · · < t N − < t N = T of [0 , T ] . Define ∆ t i = t i +1 − t i and ∆ W t i = W t i +1 − W t i , where W t i ∼ N (0 , t i ) , for i = 0 , , , · · · , N − . We alsodenote δ = sup ≤ i ≤ N − ∆ t i . Then the Euler scheme of the forward SDE (2.12) and the maximum condition (2.13) can bewritten as ˜ x πt i +1 = ˜ x πt i + b ( t i , ˜ x πt i , ˜ u πt i )∆ t i + σ ( t i , ˜ x πt i , ˜ u πt i )∆ W t i , ˜ p πt i +1 = ˜ p πt i − H x ( t i , ˜ x πt i , ˜ u πt i , ˜ p πt i , ˜ q πt i )∆ t i + ˜ q πt i ∆ W t i , ˜ x π = x , ˜ p π = ˜ p , ˜ u πt i = arg max u ∈ U H ( t i , ˜ x πt i , u, ˜ p πt i , ˜ q πt i ) . (3.1)We regard { ˜ q πt i } ≤ i Input: The Brownian motion ∆ W t i , initial parameters ( θ , ˜ p ,π ) , learning rate η ; Output: The couple precess (˜ x l,πt i , ˜ u l,πt i ) and ˜ p l,πT . for l = 0 to maxstep do ˜ x l,π = x , ˜ p l,π = ˜ p l,π ; for i = 0 to N − do ˜ q l,πt i = φ ( t i , ˜ x l,πt i , ˜ p l,πt i ; θ l ); ˜ u l,πt i = arg max u ∈ U H ( t i , ˜ x l,πt i , u, ˜ p l,πt i , ˜ q l,πt i ); ˜ x l,πt i +1 = ˜ x l,πt i + b ( t i , ˜ x l,πt i , ˜ u l,πt i )∆ t i + σ ( t i , ˜ x l,πt i , ˜ u l,πt i )∆ W t i ; ˜ p l,πt i +1 = ˜ p l,πt i − H x ( t i , ˜ x l,πt i , ˜ u l,πt i , ˜ p l,πt i , ˜ q l,πt i )∆ t i + ˜ q l,πt i ∆ W t i ; end for J (˜ u l,π ( · )) = 1 M M (cid:80) j =1 (cid:104) TN N − (cid:80) i =0 f ( t i , ˜ x l,πt i , ˜ u l,πt i ) + h (˜ x l,πT ) (cid:105) ; loss = 1 M M (cid:80) j =1 (cid:104) | − h x (˜ x l,πT ) − ˜ p l,πT | (cid:105) ; ( θ l +1 , ˜ p l +1 ,π ) = ( θ l , ˜ p l,π ) − η ∇ loss . end forRemark. There is one difficulty in this algorithm. As shown in line 5 of Algorithm 1, the solutionof the maximum condition is an extremum problem of multivariate functions and in most cases ithas no analytical solution, which means that we can not get the explicit value of ˜ u l,πt i to calculatethe forward process. When the explicit solution is not available, there are some ways to get theapproximated solution, such as the BFGS and its extended methods [23], the gradient descentmethods, the Sequential Least Squares Programming (SLSQP) and so on. On the other hand, whenthe explicit solution of ˜ u is available, the Hamiltonian system (2.12) is equivalent to a FBSDE,and this situation will be discussed in subsection 3.4. In subsection 3.1, we directly calculate the control ˜ u from the maximum condition, which can notreflect the relationship between two adjacent iterations of ˜ u . In this subsection, we propose animproved algorithm and the main motivation is to control the difference of u between two iterationswithin a certain range. This method has been used to help training a deep neural network in [21]for a deterministic control problem.Consider the following (extended) Hamiltonian system, d˜ x lt = b ( t, ˜ x lt , ˜ u lt )d t + σ ( t, ˜ x lt , ˜ u lt )d W t , d˜ p lt = − H x ( t, ˜ x lt , ˜ u lt , ˜ p lt , ˜ q lt )d t + ˜ q lt d W t , ˜ x l = x , ˜ p l = ˜ p l , ˜ u l +1 t = arg max u ∈ U ˜ H ( t, ˜ x lt , u, ˜ u lt , ˜ p lt , ˜ p lt ) , (3.5)where l represents the iteration step and the process ˜ u is given. The alternative maximumcondition function ˜ H is given as ˜ H ( t, x, u, v, p, q ) = H ( t, x, u, p, q ) − ρ | b ( t, x, u ) − b ( t, x, v ) | − ρ | σ ( t, x, u ) − σ ( t, x, v ) | − ρ | f ( t, x, u, p, u ) − f ( t, x, v, p, q ) | (3.6)with hyper-parameter ρ for any ( t, x, u, v, p, q ) ∈ [0 , T ] × R n × U × U × R n × R n × d . The last threeterms are regarded as penalties when the values of the two controls u and v are different. In thisextended Hamiltonian system, we also consider the process q as a control and the correspondingstochastic optimal control problem is formulated as: inf ˜ p l , { ˜ q lt } ≤ t ≤ T E (cid:104) | − h x (˜ x lT ) − ˜ p lT | (cid:105) (3.7)9.t. ˜ x lt = x + (cid:90) t b ( t, ˜ x ls , ˜ u ls )d s + (cid:90) t σ ( t, ˜ x ls , ˜ u ls )d W s , ˜ p lt = ˜ p l − (cid:90) t H x ( s, ˜ x ls , ˜ u ls , ˜ p ls , ˜ q ls )d s + (cid:90) t ˜ q ls d W s , ˜ u l +1 t = arg max u ∈ U ˜ H ( t, ˜ x lt , u, ˜ u lt , ˜ p lt , ˜ q lt ) . The corresponding Euler scheme is as follows. ˜ x l,πt i +1 = ˜ x l,πt i + b ( t i , ˜ x l,πt i , ˜ u l,πt i )∆ t i + σ ( t i , ˜ x l,πt i , ˜ u l,πt i )∆ W t i , ˜ p l,πt i +1 = ˜ p l,πt i − H x ( t i , ˜ x l,πt i , ˜ u l,πt i , ˜ p l,πt i , ˜ q l,πt i )∆ t i + ˜ q l,πt i ∆ W t i , ˜ x l,π = x , ˜ p l,π = ˜ p , ˜ u l +1 ,πt i = arg max u ∈ U ˜ H ( t i , ˜ x l,πt i , u, ˜ u l,πt i , ˜ p l,πt i , ˜ q l,πt i ) . (3.8)With the penalty terms in (3.6), the difference of ˜ u between two iterations is controlled withina certain range. We can easily check that the Hamiltonian system (3.5) is equivalent to system(2.12).We also regard { ˜ q l,πt i } ≤ i Improved algorithm with 1-NNet Input: The Brownian motion ∆ W t i , initial process { ˜ u ,πt i } ≤ i ≤ N , initial parameters ( θ , ˜ p ,π ) ,learning rate η and hyper-parameter ρ ; Output: The couple precess (˜ x l,πt i , ˜ u l,πt i ) and ˜ p l,πT . for l = 0 to maxstep do ˜ x l,π = x , ˜ p l,π = ˜ p l,π ; for i = 0 to N − do ˜ q l,πt i = ψ ( t i , ˜ x l,πt i , ˜ u l,πt i , ˜ p l,πt i ; θ l ); ˜ x l,πt i +1 = ˜ x l,πt i + b ( t i , ˜ x l,πt i , ˜ u l,πt i )∆ t i + σ ( t i , ˜ x l,πt i , ˜ u l,πt i )∆ W t i ; ˜ p l,πt i +1 = ˜ p l,πt i − H x ( t i , ˜ x l,πt i , ˜ u l,πt i , ˜ p l,πt i , ˜ q l,πt i )∆ t i + ˜ q l,πt i ∆ W t i ; ˜ u l +1 ,πt i = arg max u ∈ U ˜ H ( t i , ˜ x l,πt i , u, ˜ u l,πt i , ˜ p l,πt i , ˜ q l,πt i ); end for J (˜ u l,π ( · )) = 1 M M (cid:80) j =1 (cid:104) TN N − (cid:80) i =0 f ( t i , ˜ x l,πt i , ˜ u l,πt i ) + h (˜ x l,πT ) (cid:105) ; loss = 1 M M (cid:80) j =1 (cid:104) | − h x (˜ x l,πT ) − ˜ p l,πT | (cid:105) ; ( θ l +1 , ˜ p l +1 ,π ) = ( θ l , ˜ p l,π ) − η ∇ loss . end for The methods for solving the maximum condition function ˜ H include but are not limited toBFGS, gradient descent, SLSQP. Although Algorithm 2 shows the relationship of ˜ u between adja-cent iterations, it will be more difficult to get the extremum of function ˜ H due to the three morepenalty terms in ˜ H . Moreover, for the cases that ˜ u can be solved explicitly, Algorithm 2 needsmore computation cost. As mentioned in Algorithms 1 and 2, when the explicit solution of the maximum condition is notavailable, some approximation methods should be used. However, it is very time consuming to10alculate the approximate maximum condition for high-dimensional cases. In order to solve thisproblem, we develop a numerical algorithm with two neural networks (2-NNets) in this subsection.Here we consider a kind of stochastic optimal control problem, where the convex control domain U = R k , all the coefficients are C in u and the optimal control ˜ u falls inside the boundary of thecontrol domain, then (2.9) implies H u ( t, x ∗ t , u ∗ t , p ∗ t , q ∗ t ) = 0 , ∀ u ∈ U, a.e. t ∈ [0 , T ] , P -a.s. (3.10)Thus the corresponding stochastic Hamiltonian system (2.10) can be represented as d x ∗ t = b ( t, x ∗ t , u ∗ t )d t + σ ( t, x ∗ t , u ∗ t )d W t , d p ∗ t = − H x ( t, x ∗ t , u ∗ t , p ∗ t , q ∗ t )d t + q ∗ t d W t , t ∈ [0 , T ] ,x ∗ = x , p ∗ T = − h x ( x ∗ T ) ,H u ( t, x ∗ t , u ∗ t , p ∗ t , q ∗ t ) = 0 , ∀ u ∈ U. (3.11)For the Hamiltonian system (3.11) with constraint H u ( t, x ∗ t , u ∗ t , p ∗ t , q ∗ t ) = 0 , we solve it by thefollowing new control problem inf ˜ p , { ˜ q t } ≤ t ≤ T , { ˜ u t } ≤ t ≤ T E (cid:104) | − h x (˜ x T ) − ˜ p T | + λ (cid:90) T H u ( t, ˜ x t , ˜ u t , ˜ p t , ˜ q t ) d t (cid:105) , (3.12)s.t. ˜ x t = x + (cid:90) t b ( t, ˜ x s , ˜ u s )d s + (cid:90) t σ ( t, ˜ x s , ˜ u s )d W s , ˜ p t = ˜ p − (cid:90) t H x ( s, ˜ x s , ˜ u s , ˜ p s , ˜ q s )d s + (cid:90) t ˜ q s d W s , where ˜ p , { ˜ q t } ≤ t ≤ T , { ˜ u t } ≤ t ≤ T are the controls and λ is the hyper-parameter. As long as the lossfunction (3.12) converges to 0, the 4-tuple (˜ x t , ˜ u t , ˜ p t , ˜ q t ) converges to ( x ∗ t , u ∗ t , p ∗ t , q ∗ t ) .The discrete Euler scheme is given as ˜ x πt i +1 = ˜ x πt i + b ( t i , ˜ x πt i , ˜ u πt i )∆ t i + σ ( t i , ˜ x πt i , ˜ u πt i )∆ W t i , ˜ p πt i +1 = ˜ p πt i − H x ( t i , ˜ x πt i , ˜ u πt i , ˜ p πt i , ˜ q πt i )∆ t i + ˜ q πt i ∆ W t i , ˜ x π = x , ˜ p π = ˜ p , (3.13)and the loss function isloss = 1 M M (cid:88) j =1 (cid:104) | − h x (˜ x πT ) − ˜ p πT | + λ N − (cid:88) i =0 H u ( t, ˜ x πt i , ˜ u πt i , ˜ x πp i , ˜ q πt i ) (cid:105) , (3.14)where the time-divided coefficient /N is merged into the coefficient λ .Different from Algorithms 1 and 2, we regard the two processes { ˜ q πt i } ≤ i ≤ N − , { ˜ u πt i } ≤ i ≤ N − as controls, which means that two neural networks (2-NNets) should be constructed to simulate ˜ q πt i , ˜ u πt i , respectively. And similar with Algorithms 1 and 2, we regard ˜ q πt i , ˜ u πt i as feedback controlsof the state ˜ x πt i and the time t i , then we construct a common neural network for all time steps,respectively, ˜ q πt i = φ ( t i , ˜ x πt i ; θ q ) , ˜ u πt i = φ ( t i , ˜ x πt i ; θ u ) . (3.15)The 2-NNets contain both one ( n + 1) − dim input layers and three ( n + 10) − dim hidden layers, theoutput layers are ( n × d ) − dim for ˜ q πt i and k − dim for ˜ u πt i respectively. The loss function is given as(3.14).The pseudo-code is given as follows. 11 lgorithm 3 Numerical algorithm with 2-NNets Input: The Brownian motion ∆ W t i , initial parameters ( θ q, , θ u, , ˜ p ,π ) , learning rate η and hyper-parameter λ ; Output: The precesses ˜ x l,πt i and ˜ p l,πT . for l = 0 to maxstep do ˜ x l,π = x , ˜ p l,π = ˜ p l,π and H = 0 for i = 0 to N − do ˜ q l,πt i = φ ( t i , ˜ x l,πt i ; θ q,l ); ˜ u l,πt i = φ ( t i , ˜ x l,πt i ; θ u,l ); ˜ x l,πt i +1 = ˜ x l,πt i + b ( t i , ˜ x l,πt i , ˜ u l,πt i )∆ t i + σ ( t i , ˜ x l,πt i , ˜ u l,πt i )∆ W t i ; ˜ p l,πt i +1 = ˜ p l,πt i − H x ( t i , ˜ x l,πt i , ˜ u l,πt i , ˜ p l,πt i , ˜ q l,πt i )∆ t i + ˜ q l,πt i ∆ W t i ; H = H + H u ( t i , ˜ x l,πt i , ˜ u l,πt i , ˜ p l,πt i , ˜ q l,πt i ) end for J (˜ u l,π ( · )) = 1 M M (cid:80) j =1 (cid:104) TN N − (cid:80) i =0 f ( t i , ˜ x l,πt i , ˜ u l,πt i ) + h (˜ x l,πT ) (cid:105) ; loss = 1 M M (cid:80) j =1 (cid:104) | − h x (˜ x l,πT ) − ˜ p l,πT | + λH (cid:105) ; ( θ q,l +1 , θ u,l +1 , ˜ p l +1 ,π ) = ( θ q,l , θ u,l , ˜ p l,π ) − η ∇ loss . end for In Algorithm 3, instead of solving the maximum condition explicitly or approximately, we onlyneed to consider the control condition H u ( t, x, u, p, q ) = 0 . Thus Algorithm 3 can deal with a widerange of high-dimensional problems more effectively even if the optimal control can not be solvedexplicitly. The above mentioned Algorithm 3 provides a method for solving a kind of high-dimensional stochas-tic optimal control problems when the optimal control ˜ u has not an explicit solution. In this sub-section, we introduce another algorithm in solving high-dimensional cases. We show that as longas the function ¯ H defined by (2.11) is known, we can solve a class of high-dimensional stochasticoptimal control problems through the deep-learning method established in our previous work [14].In more details, once we know the function ¯ H , we can establish a class of corresponding stochas-tic optimal control problems shown as follows (cid:40) d x t = ( αx t + u t )d t + ( βx t + u t )d W t ,x = x , with the cost functional J ( u ( · )) = E (cid:40)(cid:90) T f ( t, x t , u t )d t + h ( x T ) (cid:41) , (3.16)where α, β are constants, x, u are valued in R n . The map f is given as f ( t, x, u ) = max ( p,q ) {(cid:104) ( p, q ) , ( u, u ) (cid:105) − ¯ H ( t, x, p, q ) + (cid:104) p, αx (cid:105) + (cid:104) q, βx (cid:105)} = max ( p,q ) {(cid:104) p, u (cid:105) + (cid:104) q, u (cid:105) − ¯ H ( t, x, p, q ) + (cid:104) p, αx (cid:105) + (cid:104) q, βx (cid:105)} , (3.17)which is a Legendre transformation for ¯ H with respect to p, q . The Hamiltonian H of the stochasticcontrol problem (3.16) and the function ¯ H satisfy (2.11). And the Hamiltonian system with ¯ H isgiven as follows d x ∗ t = ¯ H p ( t, x ∗ t , p ∗ t , q ∗ t )d t + ¯ H q ( t, x ∗ t , p ∗ t , q ∗ t )d W t , d p ∗ t = − ¯ H x ( t, x ∗ t , p ∗ t , q ∗ t )d t + q ∗ t d W t , t ∈ [0 , T ] ,x ∗ = x , p ∗ T = − h x ( x ∗ T ) , (3.18)12hich is essentially an FBSDE . If (3.18) satisfies the monotonic conditions [24, 25], then it canbe solved through the deep-learning method proposed in [14]. Thus the optimal state processes ( x ∗ ( · ) , p ∗ ( · ) , q ∗ ( · )) can be obtained. Finally we will get the optimal control ˜ u ( · ) via the maximumcondition, and the stochastic optimal control problem can be completely solved. In this situation,the control ˜ u does not necessarily have an explicit solution. And similar with the first threealgorithms, a single DNN is constructed for all the time points to improve the computing efficiencyof our previous proposed method in [14]. Remark. In fact, for the stochastic control problem (3.16) , (2.11) can be seen as the inverseLegendre transformation of (3.17) . From the definition of H and the relationship in (2.11) , wehave ¯ H ( t, x, p, q ) = max u ∈ U H ( t, x, u, p, q )= max u ∈ U {(cid:104) p, b ( t, x, u ) (cid:105) + (cid:104) q, σ ( t, x, u ) (cid:105) − f ( t, x, u ) } = max u ∈ U {(cid:104) ( p, q ) , ( u, u ) (cid:105) − f ( t, x, u ) } + (cid:104) p, αx (cid:105) + (cid:104) q, βx (cid:105) , which is essentially a Legendre transformation of f with respect to u and also the inverse transfor-mation of (3.17) . The choice of algorithm can be determined according to the nature of the function H in themaximum condition and the character of the stochastic optimal control problem. When ˜ u can besolved explicitly, it’s better to choose Algorithm 4, then the optimal stochastic control problem isdegenerated into the problem of solving FBSDEs. When ˜ u does not have an explicit representa-tion, all of the four proposed algorithms can be chosen. If the optimal control problem satisfies theconditions of Algorithm 3 or 4, then Algorithm 3 or 4 will be better alternatives, otherwise Algo-rithm 1 or 2 should be applied. And in this situation, Algorithm 2 will be a better choice for morestable results. Our proposed framework and algorithms proposed in Section 3 can be extendedto deal with stochastic optimal control problems where the control domain is non-convex and thestate equations are described by fully coupled FBSDEs [26, 27]. More details will be discussed inAppendix D. In this section, we apply our proposed algorithms in solving some stochastic optimal control prob-lems and show the numerical results. Firstly, we give a low-dimensional example to compare theperformance of the four algorithms. Then we show some high dimensional cases and give thecorresponding results.All the examples of this section are calculated with the learning rate η = 5 × − , the numberof time-points N = 25 , and 512 sample-paths for the Brownian motion W if not specificallynoted. In order to get more general results, we calculate the means of numerical results from tenindependent runs of each algorithm. The numerical experiments are performed in PYTHON ona LENOVO computer with a 2.40 Gigahertz (GHz) Inter Core i7 processor and 8 gigabytes (GB)random-access memory (RAM). Consider the following LQ control system: d x t = ( − x t + u t )d t + ( 15 x t + u t )d W t ,x (0) = x , (4.1)and the cost functional is defined as J (0 , x ; u ( · )) = E (cid:40) (cid:90) T [ (cid:28) x t , x t (cid:29) + (cid:104) u t , u t (cid:105) ]d t + 12 (cid:104) Qx T , x T (cid:105) (cid:41) , (4.2)13here the dimensions satisfy n = d = k and Q is determined matrix taking value in R n × n . Thecontrol domain is U = R n , and the Hamiltonian H is H ( t, x, u, p, q ) = (cid:28) p, − x + u (cid:29) + (cid:28) q, x + u (cid:29) − (cid:104) x, x (cid:105) − (cid:104) u, u (cid:105) . The explicit representation of the optimal control u ∗ is u ∗ = 12 ( p + q ) . Therefore the function ¯ H has an explicit form ¯ H ( x, p, q ) = − (cid:28) x, x (cid:29) + (cid:28) p, − x (cid:29) + (cid:28) q, x (cid:29) + 14 (cid:104) p + q, p + q (cid:105) . The corresponding Hamiltonian system is d x ∗ t = ( − x ∗ t + u ∗ t )d t + ( 15 x ∗ t + u ∗ t )d W t , − d p ∗ t = ( − x ∗ t − p ∗ t + 15 q ∗ t )d t − q ∗ t d W t ,x ∗ = x , p ∗ T = − Qx ∗ T ,u ∗ t = 12 ( p ∗ t + q ∗ t ) . (4.3)We set x = 1 in this example. It can be verified that equation (4.3) satisfies the monotoniccondition and has a unique solution ( x ∗ ( · ) , p ∗ ( · ) , q ∗ ( · )) . The reader can check it in Appendix B.Supposing the solution of FBSDE (4.3) is in the following form: p ∗ t = − K t x ∗ t , q ∗ t = − M t x ∗ t . Combing it with (4.3), we obtain a Riccati equation ˙ K t − K t − K t + ( 15 E n − K t ) M t + 12 E n = 0 , K t − K t + 12 K t M t + M t = 0 ,K T = Q, (4.4)where ˙ K t is the derivative of K t with respect to t , E n is an n -order unit matrix. Equation (4.4) is adetermined one-order ordinary differential equation and we can get its numerical solution with thefour-order Runge-Kutta methods by using the ODE45 method in Matlab(ODE45 in brief for easyexpression). Therefore the numerical solutions of ODE45 is used as a benchmark to be comparedwith that of our algorithms in the LQ control problem.Firstly, we set T = 0 . , Q = E n . The numerical solution of equation (4.4) with ODE45 is K = 0 . E n for each dimension, and the value of p ∗ with ODE45 is p ∗ = − K (0) x = − . , i.e. p ∗ is a n -dim vector with all its elements equal to − . .We give the numerical results comparison of our four proposed algorithms in Table 1 when n = 5 . And the numerical solution p ∗ = − . with ODE45 is used as the benchmark forcalculating the relative errors. In order to measure the performance, the approximate solution ofthe optimal control ˜ u in Algorithm 1 and 2 are calculated through the L-BFGS method though ˜ u can be solved explicitly in this example.Table 1: Comparison of our different algorithms for n = 5 Method p Cost Time(s) Terminal step Relative errorAlg 1 -0.99391 2.4841 9,433.1 2000 3.68%Alg 2 -0.97270 2.4891 117,116.5 2000 1.47%Alg 3 -0.95813 2.3970 589.4 2000 0.05%Alg 4 -0.95824 2.4041 318.6 2000 0.04%14rom the results in Table 1, we can see that Algorithm 4 is the most effective among the fourproposed algorithms with a relative error of . and running time of . s. And when ˜ u hasan explicit solution, the function ¯ H can be obtained, then Algorithm 1 will degenerate to a specialcase of Algorithm 4. Therefore when ˜ u can be solved explicitly, Algorithm 4 is the best choiceamong all the four algorithms. We also see that even if the optimal control ˜ u has not an explicitrepresentation, we can still calculate the stochastic optimal control problem with our algorithmswith a relative error of less than . However, the disadvantage of Algorithm 1 and 2 is that it ishard to apply them to high dimensional cases when the solution of ˜ u is not explicit, as it is verytime consuming to get the approximate solution of ˜ u with L-BFGS. And from both the relativeerrors and the running time (row 3-5 in Table 1), Algorithms 3 and 4 demonstrate much moreeffective performance on comparing with Algorithms 1 and 2. Therefore, when the optimal control ˜ u is not explicit, Algorithms 3 or 4 will be a better choice for high dimensional cases when theconditions mentioned in Section 3.3 or Section 3.4 are satisfied. Otherwise Algorithms 1 and 2could be chosen but they are not suitable for high dimensional cases. Besides, Algorithm 2 is moretime-consuming but it can get more stable results on comparing with Algorithm 1. Remark. For the LQ stochastic optimal problem, the existence of optimal control does not requiresuch strong assumptions as Assumptions 1, 2, the readers can refer to assumption L1 on page 301in [20]. In this situation, the assumptions in the corresponding Theorem 2 will also be as weak asassumption L1, the proof is similar to Corollary 1 and we omit it. In this sub-section, we show some numerical results for high-dimensional cases. As Algorithm 1and 2 are not suitable for high-dimensional problems when the optimal control ˜ u is not explicit,we mainly show the results of Algorithm 3 with H u ( t, x, u, p, q ) = 0 and that of Algorithm 4 whenwe have an explicit expression of ¯ H . We first compute the LQ problem mentioned in subsection 4.1 for a n = 100 case. Let x = 1 . , T =0 . . Figure 3 shows the loss curve and the relative errors with different numbers of iteration steps,and the results of ODE45 are used as the benchmark for calculating the relative errors. L o ss Loss of Algorithm R e l a t i v e e rr o r o f p ( l n ) Explicit Simulation vs ODE45 Figure 3: Case n=100. The loss curve in the left figure shows that after 5000 steps, the loss valueis . × − . The curves in the right figure shows the relative errors of p for each dimensionon comparing with that of ODE45.We perform 10 independent runs and the means of the cost functional converges to . . Thecurve of the cost functional is shown in Figure 4.15 500 1000 1500 2000 2500 3000 3500 4000Iteration step484950515253545556 V a l u e o f t h e c o s t f un c t i o n a l The mean and scope of 10 independent runs Alg 3Alg 4 Figure 4: Case n = 100 and λ = 0 . , the figure shows the mean and scope of the cost functionalamong 10 independent runs. The green line and scope represent the results of Algorithm 3 withthe constraint H u ( t, x, u, p, q ) = 0 . The red line and scope represent the value of cost functionalwith Algorithm 4 when the optimal control ˜ u has an explicit solution. We can see that after 4000iterations, the values of the cost functionals of Algorithm 3 and 4 are very close and converge to . .Now we consider a general case with the terminal p ∗ T = − n x ∗ T , where n is an n -order matrixwith all its elements equal to 1. In this case, the solution K t of equation (4.4) at time 0 changeswith the change of the dimension n .Similarly, we first calculate the numerical solutions of p with ODE45 and take the results asthe benchmark. Then we compare our neural network solutions of p with that of ODE45. Table2 shows the comparison results for different dimensions and Figure 5 shows the curve of the costfunctional for n = 10 .Table 2: Comparison between ODE45 and our algorithms for different dimensionsn=1 n=2 n=5 n=10 n=20Solution for ODE45 -0.9586 -1.8275 -4.3638 -8.5306 -16.821Solution for Neural Network Alg 3 -0.9583 -1.8281 -4.3543 -8.4509 -16.799Alg 4 -0.9587 -1.8287 -4.3580 -8.4808 -16.780Relative error Alg 3 0.031% 0.033% 0.218% 0.943% 0.131%Alg 4 0.010% 0.066% 0.133% 0.584% 0.244%Cost functional Alg 3 0.4788 1.8357 11.250 43.590 177.21Alg 4 0.4785 1.8217 10.959 42.582 168.3316 V a l u e o f t h e c o s t f un c t i o n a l The mean and scope of 10 independent runs Alg 3Alg 4 Figure 5: Case n = 10 . The figure shows the mean and scope of the cost functional for 10independent runs. We set λ = 0 . in Algorithm 3. The results show that after 1000 iterations,the value of the cost functional for Algorithm 3 is close to that of Algorithm 4, but it has greatervolatility among 10 independent runs. After 10000 iterations, the mean of the cost functional J (0 , x ; ˜ u ( · )) is . . In this subsection, we compute an example mentioned in [16, 17]. Consider the following controlsystem (cid:40) d x t = 2 u t d t + √ W t ,x (0) = x (4.5)and the corresponding cost functional J (0 , x ; u ( · )) = E (cid:40)(cid:90) T [ (cid:104) u t , u t (cid:105) ]d t + h ( x T ) (cid:41) , (4.6)where h ( x ) = ln( (1 + | x | )) , | x | = (cid:104) x, x (cid:105) . W t is a n -dimensional Brownian motion, and theprocess x t is valued in R n . The control domain is R and the Hamiltonian H is H ( t, x, u, p, q ) = (cid:104) p, u (cid:105) + tr ( √ q ) − (cid:104) u, u (cid:105) , the optimal control u ∗ can be solved with u ∗ = p, Then we have ¯ H ( t, x, p, q ) = (cid:104) p, p (cid:105) + tr ( √ q ) − (cid:104) p, p (cid:105) . The corresponding Hamiltonian system is given as d x ∗ t = 2 u ∗ t d t + √ W t , − d p ∗ t = − q ∗ t d W t ,x ∗ = x , p ∗ T = − h x ( x ∗ T ) ,u ∗ t = p ∗ t (4.7)where p ∗· and q ∗· are valued in R n and R n × n respectively and n = d = k . It is known that the costfunctional of (4.6) can be obtained via Hopf-Cole transformation, which is given by J (0 , x ; u ∗ ( · )) = − ln( E [ exp ( − h ( x + √ W T ))]) , x ∈ R n . 17e set x = 0 and T = 1 . with the dimension n = 100 and the approximate cost functionalis 4.591 by performing the Monte Carlo simulation. Figure 6 shows the relative errors of thecost functional on comparing the results between our proposed algorithms and the Monte Carlosimulation. V a l u e o f t h e c o s t f un c t i o n a l The mean and scope of 10 independent runs Alg 3Alg 4 0 1000 2000 3000 4000 5000Iteration step10 R e l a t i v e e rr o r o f t h e c o s t f un c t i o n a l The mean and scope of 10 independent runs Alg 3Alg 4 Figure 6: Case n=100. The curve in the left figure shows the cost functional curve and the rightfigure shows the mean and scope of relative errors of the cost functional J (0 , x ; ˜ u ( · )) among 10independent runs. We can see that after 5000 iteration steps, the values of the cost functional arevery close between Algorithm 3 and 4, and Algorithm 3 shows smaller mean of relative errors. Our algorithm can also be used to solve the sub-linear expectation [28], which is an expansion ofthe traditional classical probability theory system and can be used in a wide range of applicationareas.In this subsection, we calculate the G -expectation which is a special and important sub-linearexpectation. The preliminary knowledge of the G -expectation is given in Appendix C.For a given function ϕ and a n -dimensional G -normal distributed variable X , we compute thesub-linear expectation ˆ E [ ϕ ( X )] . According to the representation of the sub-linear expectation, wehave ˆ E [ ϕ ( X )] = sup θ ∈ Θ E θ [ ϕ ( X )] = − inf θ ∈ Θ E θ [ − ϕ ( X )] , which is equivalent to the following stochastic optimal control problem d X t = θ t d W t ,σ ≤ θ t ≤ ¯ σ,X (0) = 0 , (4.8)with the cost functional J ( θ ( · )) = inf θ ∈ Θ E θ [ − ϕ ( X )] . (4.8) is a stochastic optimal control problem with constraint and its corresponding Hamiltonian H can be easily got by H ( t, x, θ, p, q ) = θq with the constraint σ ≤ θ ≤ ¯ σ . The optimal control θ ∗ is θ ∗ = ¯ σ q ≥ + σ q< , and ¯ H is given as ¯ H ( t, x, p, q ) = (¯ σ q ≥ + σ q< ) q. d X ∗ t = θ ∗ t d W t , − d p ∗ t = − q ∗ t d W t ,X ∗ = 0 , p ∗ T = ϕ x ( X ∗ T ) ,θ ∗ t = ¯ σ q ∗ t ≥ + σ q ∗ t < . (4.9)In the problem (4.9), for the cases where the function ϕ is convex, we can solve it through ourframework though the assumptions in Corollary 1 are not satisfied. The proof is similar to Corollary1 and we omit it.Let n = 100 , X d = N ( { } × Θ) = N (0 × [ σ , ¯ σ ] n ) with σ = 1 , ¯ σ = 2 and ϕ ( x ) = x . The valueof ˆ E [ ϕ ( X )] is equal to 400 which is taken as the benchmark for measuring the performance of ouralgorithm. As the Hamiltonian H doe not contain any term with the control u , thus Algorithm3 with the constraint H u ( t, x, u, p, q ) = 0 can not be applied to this example, therefore we mainlyshow the results of Algorithm 4. The following Figure 7 shows the mean and scope of relativeerrors of the cost functional among 10 independent runs. R e l a t i v e e rr o r o f t h e c o s t f un c t i o n a l The mean and scope of 10 independent runs Alg 4 Figure 7: Case n = 100 . This figure shows the mean and scope of relative errors of the costfunctional J (0 , x ; θ ∗ ( · )) among 10 independent runs. After 5000 iterations, the value of J ( θ ∗ ( · )) converges to 400.268. ˜ u without explicit solution In this subsection, we show an example whose optimal control ˜ u doesn’t have an explicit form.Consider the following stochastic control problem, (cid:40) d x t = sin u t d t + x t d W t ,x (0) = x , (4.10)with cost functional J ( u ( · )) = E (cid:40)(cid:90) T (cid:104) u t , u t (cid:105) d t + 12 (cid:104) x T , x T (cid:105) (cid:41) , where the control domain is U = R n . The Hamiltonian H is H ( t, x, u, p, q ) = (cid:104) p, sin u (cid:105) + (cid:104) q, x (cid:105) − (cid:104) u, u (cid:105) , which is a multi-dimensional transcendental equation and has not an explicit representation ofboth the optimal control ˜ u and the function ¯ H . The derivative of the Hamiltonian H in u is givenas H u ( t, x, u, p, q ) = p cos u − u, H u ( t, x, u, p, q ) = 0 is d x ∗ t = sin u ∗ t d t + x t d W t , − d p ∗ t = q ∗ t d t − q ∗ t d W t ,x ∗ = x , p ∗ T = − h x ( x ∗ T ) ,p ∗ t cos u ∗ t − u ∗ t = 0 . (4.11)We set x = 1 . , T = 0 . and the hyper-parameter λ = 0 . . As the optimal control ˜ u has notan explicit solution and ¯ H does not have an explicit expression, thus Algorithm 1, 2 and 4 arenot suitable for this example with high dimension. Therefore we mainly give the numerical resultsof Algorithm 3 in this example. In order to show the convergence of our algorithm, we compareour algorithm with the algorithm of Han and E [11] which solves the stochastic optimal controlproblems directly via deep learning. The comparing results between our algorithm and that ofHan and E are shown in Figure 8. We can see that the value of the cost functional for the twoalgorithms are very close, but the advantage of our algorithm is that it provides a criterion todecide whether the state-control pair (˜ x ( · ) , ˜ u ( · )) is an optimal pair, that is, whether the value ofthe loss function equals to 0. V a l u e o f t h e c o s t f un c t i o n a l The mean and scope of 10 independent runs Alg of Han and EAlg 3 0 2000 4000 6000 8000 10000Step0.0500.0750.1000.1250.1500.1750.2000.225 V a r o f t h e c o s t f un c t i o n a l The comparison of the variances Alg of Han and EAlg 3 Figure 8: Case n=100. The left figure shows the comparison between our algorithm and that ofHan and E [11] on the mean and scope of the cost functional among 10 independent runs. Theright figure shows the variances comparison. We can see that the value of the cost functional forthe two algorithms are very close from the left figure and our algorithm has a lower variance fromthe right figure. In this paper, we have solved the stochastic optimal control problem from the view of the stochasticmaximum principle and proposed four different algorithms via deep learning. We have comparedour proposed algorithms through numerical results and pointed out their applicative situations.The numerical results for different examples demonstrate the effectiveness of our proposed algo-rithms. Appendix For readers’ convenience, some preliminaries and the algorithm for solving problems with non-convex control domain are listed in the appendix.20 Some typical control problems In this section, we list some special cases of optimal problems with the first-order adjoint equation(2.7). A.1 Case 1: σ does not contain u ( · ) When the diffusion σ does not contain the control variable u ( · ) , i.e., σ ( t, x, u ) ≡ σ ( t, x ) , ∀ ( t, x, u ) ∈ [0 , T ] × R n × U, (A.1)the maximum condition (2.9) reduces to H ( t, x ∗ t , u ∗ t , p ∗ t , q ∗ t ) = max u ∈ U H ( t, x ∗ t , u, p ∗ t , q ∗ t ) , a.e. t ∈ [0 , T ] , P -a.s. (A.2)In this case, the corresponding stochastic Hamiltonian system is d x ∗ t = H p ( t, x ∗ t , u ∗ t , p ∗ t , q ∗ t )d t + H q ( t, x ∗ t , u ∗ t , p ∗ t , q ∗ t )d W t , d p ∗ t = − H x ( t, x ∗ t , u ∗ t , p ∗ t , q ∗ t )d t + q ∗ t d W t , t ∈ [0 , T ] ,x ∗ = x , p ∗ T = − h x ( x ∗ T ) ,H ( t, x ∗ t , u ∗ t , p ∗ t , q ∗ t ) = max u ∈ U H ( t, x ∗ t , u, p ∗ t , q ∗ t ) , (A.3)by solving (A.3) we can get an admissible pair ( x ∗ ( · ) , u ∗ ( · )) . If the sufficient conditions are satisfied,the ( x ∗ ( · ) , u ∗ ( · )) is the optimal pair. A.2 Case 2: U is convex and coefficients are C in u If the control domain U ⊂ R k is convex and all the coefficients are C in u , then (2.9) implies (cid:104) H u ( t, x ∗ t , u ∗ t , p ∗ t , q ∗ t ) , u − u ∗ t (cid:105) ≤ , ∀ u ∈ U, a.e. t ∈ [0 , T ] , P -a.s. (A.4)This is called a local form of the maximum principle. Note that the local form does not involvethe second-order adjoint process P ∗ ( · ) either. The corresponding stochastic Hamiltonian systemis d x ∗ t = H p ( t, x ∗ t , u ∗ t , p ∗ t , q ∗ t )d t + H q ( t, x ∗ t , u ∗ t , p ∗ t , q ∗ t )d W t , d p ∗ t = − H x ( t, x ∗ t , u ∗ t , p ∗ t , q ∗ t )d t + q ∗ t d W t , t ∈ [0 , T ] ,x ∗ = x , p ∗ T = − h x ( x ∗ T ) , (cid:104) H u ( t, x ∗ t , u ∗ t , p ∗ t , q ∗ t ) , u − u ∗ t (cid:105) ≤ , ∀ u ∈ U. (A.5) A.3 Case 3: LQ optimal control problems Let T > be given. For any x ∈ R n , consider the following linear equation: d x t = (cid:104) A t x t + B t u t + α t (cid:105) d t + (cid:104) C t x t + D t u t + β t (cid:105) d W t ,x = x , (A.6)where A, B, C, D, α, β are deterministic matrix-valued functions of suitable sizes. The quadraticcost functional is given by J ( u ( · )) = E (cid:110) (cid:90) T (cid:104) (cid:104) Q t x t , x t (cid:105) + 2 (cid:104) S t x t , u t (cid:105) + (cid:104) R t u t , u t (cid:105) (cid:105) d t + 12 (cid:104) Gx T , x T (cid:105) (cid:111) , (A.7)where Q, S, and R are S n − , R k × n − , and S k − valued functions, respectively, and G ∈ S n .21ccording to Proposition 5.5 of Chapter 6 in [20], the problem (A.6) is solvable at x ∈ R n with an (the) optimal pair ( x ∗ ( · ) , u ∗ ( · )) if and only if there exists a 4-tuple ( x ∗ ( · ) , u ∗ ( · ) , p ∗ ( · ) , q ∗ ( · )) satisfying the stochastic Hamiltonian system d x ∗ t = (cid:104) A t x ∗ t + B t u ∗ t + α t (cid:105) d t + (cid:104) C t x ∗ t + D t u ∗ t + β t (cid:105) d W t , d p ∗ t = − (cid:104) A T t p ∗ t + C T t q ∗ t − Q t x ∗ t − S T t u ∗ t (cid:105) d t + q ∗ t d W t ,x ∗ = x , p ∗ T = − Gx ∗ T ,R t u ∗ t + S t x ∗ t − B T t p ∗ t − D T t q ∗ t = 0 , t ∈ [ s, T ] , P -a.s. . (A.8)Note that the Hamiltonian system (A.8) has a unique solution and we assume that ( x ∗ ( · ) , u ∗ ( · ) , p ∗ ( · ) , q ∗ ( · )) is an adapted solution of the linear equation (A.8) and x ∗ ( · ) , p ∗ ( · ) are associated by p ∗ t = − K t x ∗ t − ψ t , (A.9)where K t : [0 , T ] → R n × n is a deterministic matrix-valued function independent on x ∗ t . Applyingto (A.9) and using (A.8), we can have q ∗ t = − K t C t x ∗ t − K t D t u ∗ t − K t β t . (A.10)And then we can easily get ˙ K t + K t A t + A T t K t + C T t K t C t + Q t − ( B T t K t + S t + D T t K t C t ) T ( R t + D T t K t D t ) − ( B T t K t + S t + D T t K t C t ) = 0 ,K T = G,R t + D T t K t D t > , a.e. t ∈ [0 , T ] , (A.11)and ˙ ψ t + (cid:104) A t − B t ( R t + D T t K t D t ) − ( B T t K t + S t + D T t K t C t ) (cid:105) T ψ t + (cid:104) C t − D t ( R t + D T t K t D t ) − ( B T t K t + S t + D T t K t C t ) (cid:105) T K t β t + α t = 0 ,ψ T = 0 , a.e. t ∈ [0 , T ] , (A.12)where K ( · ) , ψ ( · ) is the solution of (A.11), (A.12), respectively. We call (A.11) the stochastic Riccatiequation associated with LQ optimal control problem. B Existence and uniqueness results of FBSDEs For a special case of FBSDEs, X t = X + (cid:90) t b ( s, X s , BY s , CZ s ) , d s + (cid:90) t σ ( s, X s , BY s , CZ s )d W s ,Y t = g ( X T ) + (cid:90) Tt f ( s, X s , Y s , Z s ) d s − (cid:90) Tt Z s d W s , (B.1)where B , C are k × n matrixes, ( x, y, z ) ∈ R n × n × n , and b, f, σ have appropriate dimensions.Denote that u = xyz , A ( t, u ) = − fbσ ( t, u ) , and assume that Assumption 3. A ( t, u ) is uniformly Lipschitz with respect to u ; . A ( · , u ) is in M (0 , T ) for ∀ u ;3. g ( x ) is uniformly Lipschitz with respect to x ∈ R n ;4. g ( x ) is in L (Ω , F T , P ) for ∀ x ;5. ∀ x , | l ( t, x, By, Cz ) − l ( t, x, B ¯ y, C ¯ z ) | ≤ K ( | B ˆ y + C ˆ z | ) , K > , l = b, σ and Assumption 4. (cid:104) A ( t, u ) − A ( t, ¯ u ) , u − ¯ u (cid:105) ≤ − ν | ˆ x | − ν | ˆ y + ˆ z | , (cid:104) g ( x ) − g (¯ x ) , ( x − ¯ x ) (cid:105) ≥ , ∀ u = ( x, y, z ) , ¯ u = (¯ x, ¯ y, ¯ z ) , ˆ x = x − ¯ x, ˆ y = y − ¯ y, ˆ z = z − ¯ z, where ν and ν are given objective constants. The following result can be found in Theorem 2.6 of [24]. Theorem 3. Let Assumptions 3 and 4 hold, then there exist a unique adapted solution ( X, Y, Z ) of FBSDE (B.1) . C The preliminary of G -expectation Let X be a G -normal distributed n -dimensional random vector X = ( X , · · · , X n ) T on a sub-linearexpectation space (Ω , H , ˆ E ) , where H is a space satisfied c ∈ H for each constant c and | X | ∈ H if X ∈ H . We denote by S ( n ) the collection of all n × n symmetric matrices.According to the definition and properties of G -normal distributed random variable, we caneasily get (cid:40) aX + b ¯ X d = √ a + b X, for a, b ≥ , ˆ E [ X ] = − ˆ E [ − X ] = 0 , (C.1)where ¯ X is an independent copy of X . (C.1) means that aX + b ¯ X and √ a + b X are identicallydistributed and X has no mean-uncertainty.Let G be a function G = G X ( A ) : S ( n ) → R defined by G ( A ) := 12 ˆ E (cid:104) (cid:104) AX, X (cid:105) (cid:105) , A ∈ S ( n ) . The distribution of X is characterized by u ( t, x ) = ˆ E (cid:104) ϕ ( x + √ tX ) (cid:105) , ϕ ∈ C l,Lip ( R n ) , where u is the solution of the following parabolic partial differential equation (PDE for short)defined on [0 , ∞ ) × R n : ∂ t u − G ( D u ) = 0 , u | t =0 = ϕ (C.2)and C l,Lip ( R n ) denotes the linear space of functions ϕ satisfying | ϕ ( x ) − ϕ ( y ) | ≤ C (1 + | x | m + | y | m ) | x − y | for x, y ∈ R n , some C > , m ∈ N depending on ϕ. The parabolic PDE (C.2) is called a G -heat equation. Then there exists a bounded, convexand closed subset Θ ⊂ S ( n ) such that 12 ˆ E (cid:104) (cid:104) AX, X (cid:105) (cid:105) = G ( A ) = 12 sup Q ∈ Θ tr (cid:104) AQ (cid:105) , A ∈ S ( n ) . We denote X d = N (0 × Θ) . When n = 1 , we have X d = N (0 × [ σ , ¯ σ ]) where σ = − ˆ E [ − X ] and ¯ σ = ˆ E [ X ] .In the following two typical situations, the calculation of ˆ E [ ϕ ( X )] is obvious:23 For each convex function ϕ , we have ˆ E [ ϕ ( X )] = 1 √ π (cid:90) ∞−∞ ϕ (¯ σ y ) exp( − y y ; (C.3) • For each concave function ϕ , we have ˆ E [ ϕ ( X )] = 1 √ π (cid:90) ∞−∞ ϕ ( σ y ) exp( − y y. (C.4) D Problems with non-convex control domain The four algorithms proposed in this paper are mainly aimed at the stochastic optimal controlproblems with first-order adjoint equations which correspond to the cases with convex controldomain. In fact, for the cases with non-convex control domain, we can use the similar method. Inthis situation, we need to introduce the following second order adjoint equation when σ containsthe control u : d P ∗ t = − (cid:110) b x ( t, x ∗ t , u ∗ t ) T P ∗ t + P ∗ t b x ( t, x ∗ t , u ∗ t ) + d (cid:80) j =1 σ jx ( t, x ∗ t , u ∗ t ) T P ∗ t σ jx ( t, x ∗ t , u ∗ t )+ d (cid:80) j =1 (cid:8) σ jx ( t, x ∗ t , u ∗ t ) T Q ∗ jt + Q ∗ jt σ jx ( t, x ∗ t , u ∗ t ) (cid:9) + H xx ( t, x ∗ t , u ∗ t , p ∗ t , q ∗ t ) (cid:111) d t + d (cid:80) j =1 Q ∗ jt d W jt ,P ∗ T = − h xx ( x ∗ T ) , (D.1)where the Hamiltonian H is defined by H ( t, x, u, p, q ) = (cid:104) p, b ( t, x, u ) (cid:105) + tr [ q T σ ( t, x, u )] − f ( t, x, u ) , (( t, x, u, p, q )) ∈ [0 , T ] × R n × U × R n × R n × d , (D.2)and ( p ∗ ( · ) , q ∗ ( · )) is the solution of (2.7). In equation (D.1), the solution is a pair of processes ( P ∗ ( · ) , Q ∗ ( · )) ∈ L F (0 , T ; S n ) × ( L F (0 , T ; S n )) d where S n = { A ∈ R n × n | A T = A } .Equation (D.1) is also a BSDE with matrix-valued ( P ∗ ( · ) , Q ∗ ( · )) . As with (2.7), there existsa unique adapted solution ( P ∗ ( · ) , Q ∗ ( · )) to (D.1) under Assumption 1. We refer to (2.7) (resp.(D.1)) as the first-order (resp. second-order ) adjoint equations and to p ∗ ( · ) (resp. P ∗ ( · ) ) as the first-order (resp. second-order ) adjoint process . If ( x ∗ ( · ) , u ∗ ( · )) is an optimal (resp. admissible)pair, and ( p ∗ ( · ) , q ∗ ( · )) and ( P ∗ ( · ) , Q ∗ ( · )) are adapted solutions of (2.7) and (D.1), respectively,then ( x ∗ ( · ) , u ∗ ( · ) , p ∗ ( · ) , q ∗ ( · ) , P ∗ ( · ) , Q ∗ ( · )) is called an optimal 6-tuple (resp. admissible 6-tuple ).The following theorem can be found in Theorem 3.2, Chapter 3 of [20]. Theorem 4. Let Assumption 1 hold. Let ( x ∗ ( · ) , u ∗ ( · )) be an optimal pair of (2.4) . Then thereare pairs of processes (cid:40) ( p ∗ ( · ) , q ∗ ( · )) ∈ L F (0 , T ; R n ) × ( L F (0 , T ; R n )) d , ( P ∗ ( · ) , Q ∗ ( · )) ∈ L F (0 , T ; S n ) × ( L F (0 , T ; S n )) d , (D.3) where (cid:40) q ∗ ( · ) = ( q ∗ ( · ) , · · · , q ∗ d ( · )) , Q ∗ ( · ) = ( Q ∗ ( · ) , · · · , Q ∗ d ( · )) ,q ∗ j ( · ) ∈ L F (0 , T ; R n ) , Q ∗ j ( · ) ∈ L F (0 , T ; S n ) , ≤ j ≤ d, (D.4) satisfying the first-order and second-order adjoint equations (2.7) and (D.1) , respectively, such thata generalized Hamiltonian H ( t, x ∗ t , u ∗ t ) = max u ∈ U H ( t, x ∗ t , u ) , a.e. t ∈ [0 , T ] , P -a.s. (D.5)24 here H ( t, x, u ) (cid:44) H ( t, x, u, p t , q t ) − tr [ σ ( t, x ∗ t , u ∗ t ) T P t σ ( t, x ∗ t , u ∗ t )]+ 12 tr (cid:110) [ σ ( t, x, u ) − σ ( t, x ∗ t , u ∗ t )] T P t [ σ ( t, x, u ) − σ ( t, x ∗ t , u ∗ t )] (cid:111) . (D.6)The sufficient conditions for the optimality can be found in [20] and the unique optimal control u ∗ can be got under some strictly convexity assumptions. Then we have the corresponding newstochastic optimal control problem: inf ˜ p , ˜ P , { ˜ q t , ˜ Q t } ≤ t ≤ T E (cid:104) | − h x (˜ x T ) − ˜ p T | + | − h xx (˜ x T ) − ˜ P T | (cid:105) , (D.7)s.t. ˜ x t = x + (cid:90) t b ( t, ˜ x s , ˜ u s )d s + (cid:90) t σ ( t, ˜ x s , ˜ u s )d W s , ˜ p t = ˜ p − (cid:90) t H x ( s, ˜ x s , ˜ u s , ˜ p s , ˜ q s )d s + (cid:90) t ˜ q s d W s , ˜ P t = ˜ P − (cid:90) t F ( t, ˜ x s , ˜ u s , ˜ p s , ˜ q s , ˜ P s , ˜ Q s )d t + (cid:90) t ˜ Q s d W s , where F ( t, x, u, p, q, P, Q ) = b x ( t, x, u ) T P + P b x ( t, x, u ) + d (cid:88) j =1 σ jx ( t, x, u ) T P σ jx ( t, x, u )+ d (cid:88) j =1 (cid:8) σ jx ( t, x, u ) T Q j + Q j σ jx ( t, x, u ) (cid:9) + H xx ( t, x, u, p, q ) , and the Euler scheme is ˜ x πt i +1 = ˜ x πt i + b ( t i , ˜ x πt i , ˜ u πt i )∆ t i + σ ( t i , ˜ x πt i , ˜ u πt i )∆ W t i , ˜ p πt i +1 = ˜ p πt i − H x ( t i , ˜ x πt i , ˜ u πt i , ˜ p πt i , ˜ q πt i )∆ t i + ˜ q πt i ∆ W t i , ˜ P πt i + i = ˜ P πt i − F ( t i , ˜ x πt i , ˜ u πt i , ˜ p πt i , ˜ q πt i , ˜ P πt i , ˜ Q πt i )∆ t i + ˜ Q πt i ∆ W t i , ˜ x π = x , ˜ p π = ˜ p , ˜ P π = ˜ P , H ( t i , ˜ x πt i , ˜ u πt i ) = max u ∈ U H ( t i , ˜ x πt i , u ) . (D.8)For this second-order case, we need to construct two neural networks at the same time, one forsimulating ˜ q π · and the other for simulating ˜ Q π · , (cid:40) ˜ q πt i = φ ( t i , ˜ x πt i ; θ ) , ˜ Q πt i = φ ( t i , ˜ x πt i ; θ ) . (D.9)These two networks have both one (1 + n ) -dim input layer, a ( n × d ) -dim and ( n × n × d ) -dimoutput layer, respectively. All parameters of the two networks are represented as θ and the lossfunction is defined asloss = 1 M M (cid:88) j =1 (cid:104) | − h x (˜ x πT ) − ˜ p πT | + | − h xx (˜ x πT ) − ˜ P πT | (cid:105) , (D.10)where M is the number of samples. The pseudo-code for the second-order case is given as following:25 lgorithm 5 Numerical algorithms for second-order case Input: The Brownian motion ∆ W ( t i ) , initial parameters ( θ , ˜ p ,π , ˜ P ,π ) , learning rate η ; Output: Couple precess (˜ x l,πt i , ˜ u l,πt i ) and ˜ p l,πT . for l = 0 to maxstep do ˜ x l,π = x , ˜ p l,π = ˜ p l,π , ˜ P l,π = ˜ P l,π ; for i = 0 to N − do ˜ q l,πt i = φ ( t i , ˜ x l,πt i ; θ l, ); ˜ Q l,πt i = φ ( t i , ˜ x l,πt i ; θ l, ); ˜ u l,πt i = arg max u ∈ U H ( t i , ˜ x l,πt i , u ); ˜ x l,πt i +1 = ˜ x l,πt i + b ( t i , ˜ x l,πt i , ˜ u l,πt i )∆ t i + σ ( t i , ˜ x l,πt i , ˜ u l,πt i )∆ W t i ; ˜ p l,πt i +1 = ˜ p l,πt i − H x ( t i , ˜ x l,πt i , ˜ u l,πt i , ˜ p l,πt i , ˜ q l,πt i )∆ t i + ˜ q l,πt i ∆ W t i ; ˜ P l,πt i + i = ˜ P l,πt i − F ( t i , ˜ x l,πt i , ˜ u πt i , ˜ p l,πt i , ˜ q l,πt i , ˜ P l,πt i , ˜ Q l,πt i )∆ t i + ˜ Q l,πt i ∆ W t i ; end for J (˜ u l,π ( · )) = 1 M M (cid:80) j =1 (cid:104) TN N − (cid:80) i =0 f ( t i , ˜ x l,πt i , ˜ u l,πt i ) + h (˜ x l,πT ) (cid:105) ; loss = 1 M M (cid:80) j =1 (cid:104) | − h x (˜ x l,πT ) − ˜ p l,πT | + | − h xx (˜ x l,πT ) − ˜ P l,πT | (cid:105) ; ( θ l +1 , ˜ p l +1 ,π , ˜ P l +1 ,π ) = ( θ l , ˜ p l,π , ˜ P l,π ) − η ∇ loss . end for Similar algorithm can be given when the state equation of a stochastic optimal control systemis described by a fully coupled FBSDE. References [1] Lev S Pontrygin. Mathematical theory of optimal processes. CRC Press , 1987.[2] Bellman and Richard. Dynamic programming and stochastic control processes. Informationand Control , 1(3):228–239, 1958.[3] Harold Kushner and Paul G. Dupuis. Numerical Methods for Stochastic Control Problems inContinuous Time . Springer, 2001.[4] Harold J Kushner. Numerical methods for stochastic control problems in continuous time. SIAM, J. Control Optim. , 28:888–1048, 1990.[5] Hongjie Dong and Nicolai V Krylov. The rate of convergence of finite-difference approxima-tions for parabolic bellman equations with lipschitz coefficients in cylindrical domains. AppliedMathematics and Optimization , 56(1):37–66, 2007.[6] Espen R Jakobsen. On the rate of convergence of approximation schemes for bellman equationsassociated with optimal stopping time problems. Mathematical Models and Methods in AppliedSciences , 13(05):613–644, 2003.[7] Nicolai V Krylov. The rate of convergence of finite-difference approximations for bellmanequations with lipschitz coefficients. Applied Mathematics and Optimization , 52(3):365–399,2005.[8] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming: an overview. Pro-ceedings of 1995 34th IEEE Conference on Decision and Control , 1:560–564, 1995.[9] Pardalos and M. Panos. Approximate dynamic programming: solving the curses of dimen-sionality. Optimization Methods and Software , 24(1):155–155, 2009.[10] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning . MIT Press, 2016. . 2611] Jiequn Han and E Weinan. Deep learning approximation for stochastic control problems. arXiv: 1611.07422 , 2016.[12] Weinan E, Jiequn Han, and Arnulf Jentzen. Deep learning-based numerical methods forhigh-dimensional parabolic partial differential equations and backward stochastic differentialequations. Communications in Mathematics and Statistics , 5(4):349–380, 2017.[13] Jiequn Han and Jihao Long. Convergence of the deep bsde method for coupled fbsdes. arXiv:1811.01165 , 2018.[14] Shaolin Ji, Shige Peng, Ying Peng, and Xichuan Zhang. Three algorithms for solving high-dimensional fully coupled fbsdes through deep learning. IEEE Intelligent Systems , 35(3):71–84,2020.[15] Maziar Raissi. Forward-backward stochastic neural networks: Deep learning of high-dimensional partial differential equations. arXiv: 1804.07010 , 2018.[16] Come Hure, Huyen Pham, Achref Bachouch, and Nicolas Langrene. Deep neural networksalgorithms for stochastic control problems on finite horizon, part i: convergence analysis. arXiv: 1812.04300v1 , 2018.[17] Achref Bachouch, Come Hure, Nicolas Langrene, and Huyen Pham. Deep neural networksalgorithms for stochastic control problems on finite horizon, part 2: numerical applications. arXiv: 1812.05916v1 , 2018.[18] Marcus A Pereira, Ziyi Wang, Ioannis Exarchos, and Evangelos A Theodorou. Neu-ral network architectures for stochastic control using the nonlinear feynman-kac lemma. arXiv:1902.03986v2 , 2019.[19] Shige Peng. A general stochastic maximum principle for optimal control problems. SiamJournal on Control and Optimization , 28(4):966–979, 1990.[20] Jiongmin Yong and Xunyu Zhou. Stochastic Controls-Hamiltonian System and HJB Equa-tions . Springer, 1999.[21] Qianxiao Li, Long Chen, Cheng Tai, and E Weinan. Maximum principle based algorithms fordeep learning. Journal of Machine Learning Research , 18(165):1–29, 2017.[22] Jin Ma, J-M Morel, and Jiongmin Yong. Forward-backward stochastic differential equationsand their applications . Number 1702. Springer Science & Business Media, 1999.[23] Dong C. Liu and Jorge Nocedal. On the limited memory bfgs method for large scale opti-mization. Mathematical Programming , 45(1-3):503–528, 1989.[24] Shige Peng and Zhen Wu. Fully coupled forward-backward stochastic differential equationsand applications to optimal control. Siam Journal on Control and Optimization , 37(3):825–843, 1999.[25] Shige Peng. Problem of eigenvalues of stochastic hamiltonian systems with boundary condi-tions. Stochastic Processes & Their Applications , 88(2):259–290, 2000.[26] Shige Peng. Backward stochastic differential equations and applications to optimal control. Applied Mathematics and Optimization , 27(2):125–144, 1993.[27] Mingshang Hu, Shaolin Ji, and Xiaole Xue. A global stochastic maximum principle for fullycoupled forward-backward stochastic systems. Siam Journal on Control and Optimization ,56(6):4309–4335, 2018.[28] Shige Peng.