[PDF] Deep Learning for Constrained Utility Maximisation

Abstract

This paper proposes two algorithms for solving stochastic control problems with deep reinforcement learning, with a focus on the utility maximisation problem. The first algorithm solves Markovian problems via the Hamilton Jacobi Bellman (HJB) equation. We solve this highly nonlinear partial differential equation (PDE) with a second order backward stochastic differential equation (2BSDE) formulation. The convex structure of the problem allows us to describe a dual problem that can either verify the original primal approach or bypass some of the complexity. The second algorithm utilises the full power of the duality method to solve non-Markovian problems, which are often beyond the scope of stochastic control solvers in the existing literature. We solve an adjoint BSDE that satisfies the dual optimality conditions. We apply these algorithms to problems with power, log and non-HARA utilities in the Black-Scholes, the Heston stochastic volatility, and path dependent volatility models. Numerical experiments show highly accurate results with low computational cost, supporting our proposed algorithms.

Full PDF

DDeep Learning for Constrained Utility Maximisation

Ashley Davey ∗ and Harry Zheng † Abstract

This paper proposes two algorithms for solving stochastic control problems with deep re-inforcement learning, with a focus on the utility maximisation problem. The ﬁrst algorithmsolves Markovian problems via the Hamilton Jacobi Bellman (HJB) equation. We solve thishighly nonlinear partial diﬀerential equation (PDE) with a second order backward stochasticdiﬀerential equation (2BSDE) formulation. The convex structure of the problem allows us todescribe a dual problem that can either verify the original primal approach or bypass some ofthe complexity. The second algorithm utilises the full power of the duality method to solvenon-Markovian problems, which are often beyond the scope of stochastic control solvers in theexisting literature. We solve an adjoint BSDE that satisﬁes the dual optimality conditions. Weapply these algorithms to problems with power, log and non-HARA utilities in the Black-Scholes,the Heston stochastic volatility, and path dependent volatility models. Numerical experimentsshow highly accurate results with low computational cost, supporting our proposed algorithms.

Keywords:

Control, Deep Reinforcement Learning, Primal and Dual BSDEs, HJB Equation,Portfolio Optimisation

AMS MSC 2010:

In this paper we propose algorithms that combine modern machine learning practises with theoret-ical stochastic control principles to solve a range of control problems with high accuracy, scalabilitywith dimension, and low computational cost. There is a natural crossover between machine learningand stochastic control problems, as they both involve searching for features within a data set. Forstochastic control the data set is a state process that is inﬂuenced by a controlling process thatcan be chosen with the aim to optimise some objective function. Often the optimal control is afunction of the underlying state process, a so-called Markovian control. This is very similar to deepreinforcement learning, where the control is chosen as the output of a neural network which takesthe dataset as input, and some loss function is minimised. It is this area of crossover that thispaper resides. We propose deep learning methods to solve the utility maximisation problem, as anapplication of a more general stochastic control solver, in a wide range of markets with arbitraryconvex control constraints. Dynamic portfolio optimisation has been extensively studied within theﬁeld of mathematical ﬁnance, see [17, 25] for exposition.A common drawback of numerical methods for stochastic control problems is their complexitywith dimension of the space in which either the state or control processes reside. Often some gridmethod or exhaustive search of the underlying state space is required to ﬁnd optimal trajectories, ∗ Department of Mathematics, Imperial College, London SW7 2BZ, UK. Email: [email protected] † Corresponding author. Department of Mathematics, Imperial College, London SW7 2BZ, UK. Email:[email protected] a r X i v : . [ q -f i n . C P ] A ug hich is computationally ineﬃcient for high dimensional state spaces. It is claimed that machinelearning methods provide a solution for this dimensionality problem [11]. Numerical results showeﬃcient solving of PDEs with up to thousands of variables [1, 2, 7]. These PDE solvers form thebasis of our generic stochastic control problem solver, which is the ﬁrst algorithm presented inthis paper. The problems are transformed into forward-backward stochastic diﬀerential equations(FBSDEs) using the Feynman-Kac formula and the Stochastic Maximum Principle (SMP). ThePDE that we solve is the Hamilton-Jacobi-Bellman (HJB) equation, a highly nonlinear PDE thatis often impossible to solve analytically, or even numerically if the dimensionality is high. Thispaper introduces a control variable into the PDE, that is constrained to some convex set. Thisdrastically increases the complexity and takes the problem out of the scope of [2, 7] unless thecontrol can be found analytically. We show how machine learning can overcome this diﬃculty andproduce highly accurate value approximations with comparatively low runtime.The FBSDE based methods developed in this paper are direct extension of the ﬁrst and secondorder deep BSDE solvers introduced in [2, 7]. In these papers, a system of FBSDEs is solvedby simulating all processes in the forward direction, driven by a neural network process which isoptimised to ensure the terminal condition holds. The extension of this paper is to add an additionaldriving neural network process, representing a control process. The convenience of adapting thisBSDE method to stochastic control problems is that the control process can be optimised using thevalue function approximations implied by the 2BSDE solver, via the Hamiltonian of the system. Inthe 2BSDE solver of [2], there are two unknown process that must be found with neural networks.This suggests that our method has three unknowns, but we can use the SMP to derive one of theseprocesses and therefore introduce control without increasing the number of unknown processes.There are some papers that describe generic algorithms for solving stochastic control problemsin the literature. [12] directly maximises the gains function at the terminal time with machinelearning and is applicable to many problems. However, the computation costs may be high, as oneintroduces a discretisation with many time steps or a high dimensional control space, and it onlyoutputs the value at time 0. [15] ﬁnds a local approximation of the value function with machinelearning. However, it only holds for Markovian problems and the iterative nature of this valuefunction approximation may lead to error propagation. Instead of directly optimising the valuefunction, the ﬁrst method presented in this paper maximises the Hamiltonian associated with thesystem. There are two key beneﬁts of doing this. Firstly, it reconciles the potential errors of [12,15]where the algorithm gets stuck at a local optimiser. To ﬁnd the value function (given a control)we ensure that the terminal condition of the HJB equation holds, which involves minimising theleast squares error of the associated processes at the terminal time. The global minimum of thiserror is zero by construction, ensuring that we can directly assess the quality of our approximationfor any generic control problem. Secondly, the control is optimised using the Hamiltonian of thesystem, which is a function of only what is known at one time. This optimisation is faster than backpropagating through a discretisation of the processes to compute the derivatives as in [12]. This iseven more important when the control space is high dimensional, which is normally true in portfoliooptimisation problems. To implement this Hamiltonian maximisation procedure, one needs to knowderivatives of the value function which can be computed accurately with deep learning based onthe 2BSDE method in [2].There are a few papers in the literature that also utilise the deep BSDE methods of [2, 7].For example, [9, 24, 27] ﬁrst ﬁnd an explicit formula for the optimal control in terms of the valuefunction and its derivatives and then substitute this control into the HJB equation and solve theresulting PDE with the deep 2BSDE method. However, these algorithms cannot be applied in fullgenerality like our method. [9,24] are reliant on the diﬀusion of the state process being independentof the control, which leads to a semi-linear PDE and a representation with a ﬁrst order BSDE. This2ramework is inappropriate for portfolio section problems as the control naturally appears in thediﬀusion term. [27] alleviates this restriction by deriving an explicit formula for the optimal controlthat is required to be unconstrained. Our approach is more general in the sense that it does notrequire any analytical solutions. In the literature these other algorithms are called “deep 2BSDEcontrollers” since the BSDE solutions are used to form the control. Our algorithm is called the deepcontrolled 2BSDE method as each control and state pair deﬁnes its own 2BSDE which we try tosolve. A primal-dual method is introduced in [14] for solving semi-linear PDEs. The deep 2BSDEmethod is used to ﬁnd a lower bound for the solution. An upper bound is found by a pathwisemaximising control problem that is derived using the Legendre transform. The approach in [14]is not directly related to our notion of convex duality using martingale methods. There are alsosome papers on applications of deep neural networks in mathematical ﬁnance, including optimalstopping [3], asset pricing with transaction cost [10], and optimal transport [8].The utility maximisation problem involves maximising concave functions, so we can naturallyapply convex duality methods. Duality is a powerful tool that can be used in complete or incom-plete markets with a range of convex constraints on the control process. The unique aspect ofour work is the combination of deep learning methodology with convex duality. The beneﬁts ofintroducing duality are threefold. Firstly, the dual problem is a minimisation problem, and solvingit independently of the primal problem naturally leads to an upper bound of the value function.Sometimes the dual problem is easier to solve, since primal control constraints are characterisedby penalties in the dual state process, leading the dual problem to be potentially unconstrained.Secondly, the BSDEs associated to the dual problem can be related to that of the primal problem,giving us the ability to compare our numerical results and check their accuracy. Again, as per theHamiltonian maximisation, this beneﬁts us since we know the error between the dual and primalprocesses should be zero, which allows us to easily measure the quality of our results. We may evenskip one of the problems if they are diﬃcult to solve, as solving one problem automatically gives usthe solution to the other. Thirdly, inherent in the dual formulation are certain optimality conditionsthat are derived from the SMP. These conditions can be used separately from the HJB equation toﬁnd the optimal control. We use these conditions as the basis for our second algorithm, called thedeep SMP method, which is valid for non-Markovian control problems with convex structure. Thisgreatly increases the range of utility maximisation problems than can be solved using the methodsof this paper. In a Markovian setting we can compare the two algorithms presented in the paper.The convex duality method works for complete and incomplete markets. A common incompletemarket model is the Heston stochastic volatility model in which the underlying variance of thestock follows some mean-reversion SDE with its own source of randomness. This model may beseen as a 2-dimensional stock market, where only one asset is traded. This framework allows usto use our deep learning method to solve it. We compare our results to [23] in which tight boundsof the value function are derived for an unconstrained optimisation problem. [23] is reliant on aconvenient guess of the dual control, so is not applicable to other methods where such a guess is notavailable, unlike our method. Another multidimensional model we consider is a path-dependentvolatility model. There are several recent papers on the topic in mathematical ﬁnance with machinelearning, for example [5, 6, 22]. These papers include model calibration, hedging and pricing, andare not directly related to our work.The main contributions of this paper are the following. In the deep controlled 2BSDE algo-rithm, we present a Markovian stochastic control problem solver that combines the HJB equationwith the deep 2BSDE method, but without the need for an analytical formula for the optimalcontrol. We apply this algorithm to both the primal and dual problems which give us tight boundson the value function. In the deep SMP method, we exploit the dual formulation of the utilitymaximisation problem to create a novel algorithm for non-Markovian optimisation problems using3eep reinforcement learning. Instead of dealing with the gains function, we solve the adjoint BSDEand use machine learning to satisfy the dual optimality conditions. To the best of our knowledge,we are the ﬁrst to use this methodology in the literature.The remainder of this paper is outlined as follows. Section 2 formulates the utility maximisa-tion problem and deﬁnes the dual problem. Section 3 describes the general algorithm for solvingMarkovian stochastic control problems using deep learning and the HJB equation, and applies thealgorithm to portfolio optimisation with various utilities. Section 4 introduces an algorithm forsolving non-Markovian utility maximisation problems using the SMP and applies this algorithmto problems where the volatility process is not deterministic, and even path-dependent. Section 5discusses convergence of the deep controlled 2BSDE algorithm based on empirical evidence. Section6 concludes the paper. In this section we describe the standard portfolio optimisation problem and its dual problem. Let W = ( W ( t )) ≤ t ≤ T be a standard m -dimensional Brownian motion on the natural ﬁltered probabilityspace (Ω , F , ( F t ) ≤ t ≤ T , P ) augmented with all P -null sets, and T a ﬁxed ﬁnite horizon. Consideran m + 1 dimensional stock market, consisting of a risk-free bond with interest rate r ( t ), and acollection S of risky stocks deﬁned by dS ( t ) = diag( S ( t )) ( µ ( t ) dt + σ ( t ) dW ( t )) , (2.1)where diag( S ( t )) is a m × m matrix with diagonal elements from S ( t ) and all other elements zero,and r , µ and σ are uniformly bounded, progressively measurable processes valued in R , R m and R m × m , respectively, with µ ( t ) and σ ( t ) representing the drifts and volatilities of the m stock pricesat time t , respectively. The matrix σ ( t ) is invertible and σ − is uniformly bounded. An agentinvests a proportion π ( t ) of the wealth X ( t ) in the risky stocks and the rest in the risk-free bondat time t ∈ [0 , T ]. The 1-dimensional wealth process X evolves as dX ( t ) = X ( t ) (cid:16) r ( t ) + π ( t ) (cid:62) b ( t ) (cid:17) dt + X ( t ) π ( t ) (cid:62) σ ( t ) dW ( t ) (2.2)with X (0) = x , where π ( t ) (cid:62) is the transpose of π ( t ), and b is the excess return process deﬁned by b ( t ) = µ ( t ) − r ( t ) m , where m ∈ R m is a vector of ones. We call π an admissible control if π is aprogressively measurable process valued in K ⊂ R m , a nonempty closed convex set, almost surelyfor almost every t ∈ [0 , T ], and there exists a unique strong solution X to the SDE (2.2). The setof all admissible controls π is denoted by A . The investor wishes to maximise the expected utilityof the terminal wealth, that is, to ﬁnd V .. = sup π ∈A E [ U ( X ( T ))] , where U : [0 , ∞ ) → R is a utility function that is twice continuously diﬀerentiable, strictly increas-ing, strictly concave and satisﬁes the Inada conditions U (cid:48) (0) = ∞ and U (cid:48) ( ∞ ) = 0. We say thatstate and control processes ˆ X, ˆ π are optimal if V = E [ ˆ X ( T )].We next construct the dual problem, see [21] for more details. The basis for duality is theLegendre-Fenchel transformation ˜ U : (0 , ∞ ) → R of the utility function U , deﬁned by˜ U ( y ) = sup x> { U ( x ) − xy } , y ≥ . U ensure that ˜ U is well deﬁned for all y ≥ , ∞ ), due to the correspondingproperties of U . The R -valued dual process Y is deﬁned by dY ( t ) = − Y ( t ) (cid:0) r ( t ) + δ K ( v ( t )) (cid:1) dt − Y ( t ) (cid:0) θ ( t ) + σ ( t ) − v ( t ) (cid:1) (cid:62) dW ( t ) (2.3)with Y (0) = y , where y is nonnegative, θ ( t ) = σ ( t ) − b ( t ) is the market price of risk, δ K is thesupport function of the set − K , deﬁned by δ K ( z ) .. = sup π ∈ K (cid:110) − π (cid:62) z (cid:111) , z ∈ R m , and v is a R m -valued dual control process which is progressively measurable and ensures thereexists a unique strong solution Y to the SDE (2.3).Since XY is a super-martingale for any choice of π, v and y , we have the following weak dualityrelation: V ≤ ˜ V ( x ) .. = inf y,v (cid:110) E (cid:104) ˜ U ( Y ( T )) (cid:12)(cid:12)(cid:12) Y (0) = y (cid:105) + x y (cid:111) . (2.4)Solving the dual problem would give us an upper bound of the value function, that can be combinedwith a lower bound by the primal problem to produce a conﬁdence interval for the value function.If we have an equality in (2.4), then we can ﬁnd the value function by alternatively solving the dualproblem. In [21] it is shown that the strong duality property holds, that is, V = ˜ V ( x ). Remark 2.1. If K = B (0 , R ) is a ball of radius R > centred at the origin, then δ K ( z ) = R | z | . If K is a closed convex cone, then δ K ( z ) = 0 if z ∈ ˜ K and δ K ( z ) = ∞ if z / ∈ ˜ K , where ˜ K is the positivepolar cone of K , deﬁned by ˜ K .. = { z ∈ R m : π (cid:62) z ≥ for all π ∈ K } , which greatly simpliﬁes thedual problem as the drift is independent of the dual control v if v ( t ) ∈ ˜ K . In particular, if K = R m ,then ˜ K = { } , which forces the dual control v = 0 . For general closed convex set K , the function δ K is well deﬁned but may be diﬃcult to compute and cannot be exactly implemented numerically due tothe inﬁnite penalty. Instead, so-called ‘soft’ constraints can be used, where a suitably large penaltyfunction gives us suﬃcient conﬁdence that our control process lies in the required space. Sometimesit is easier to implement a ‘hard’ constraint, where the optimal control is projected onto the spacethat makes this function ﬁnite before optimisation as in the cone case. With hard constraints, wecan have 100% conﬁdence that the process lies in the required space by construction. We assume in this section that the processes µ, r , and σ are deterministic. The utility maximisationproblem is then Markovian, and coincides with ﬁnding u (0 , x ), where the value function u : [0 , T ] × R → R is deﬁned by u ( t, x ) .. = sup π ∈A E [ U ( X ( T )) | X ( t ) = x ] . (3.1)Our machine learning methods focus on the PDE associated to this control problem. If u ∈ C , satisﬁes a growth condition, then u is a solution to the HJB equation ∂u∂t ( t, x ) = − sup π ∈ K F ( t, x, π, D x u ( t, x ) , D x u ( t, x )) , F of the problem is deﬁned by F ( t, x, π, z, γ ) = xz ( r ( t ) + π (cid:62) b ( t )) + 12 x γ | σ ( t ) (cid:62) π | with | · | denoting the Euclidean norm. The Veriﬁcation theorem [25] says the optimal control ˆ π ( t )can be determined at each time t ∈ [0 , T ] by maximising the Hamiltonian, that is,ˆ π ( t ) ∈ arg max π ∈ K F (cid:16) t, x, ˆ X ( t ) , π, D x u (cid:16) t, ˆ X ( t ) (cid:17) , D x u (cid:16) t, ˆ X ( t ) (cid:17)(cid:17) , where ˆ X is the state process corresponding to this control. We can use machine learning to ﬁndthe optimal control this way, but ﬁrst we must have suﬃcient information about the optimal stateprocess, as well as the derivatives of the value function. We use a 2BSDE approximation methodto ﬁnd the value function and its derivatives, again using machine learning, as outlined in the nextsection. To derive a dual HJB equation, consider a series of dual value problems parametrised by y ∈ R given by ˜ u ( t, y ) = inf v E (cid:104) ˜ U ( Y ( T )) (cid:12)(cid:12)(cid:12) Y ( t ) = y (cid:105) (3.2)and again, suppose that ˜ u ∈ C , satisﬁes a growth condition. It is shown in [19, Proposition 4.15]that under the additional assumption that ˜ U ( Y ( T )) is square integrable, there exist optimal controlsˆ y, ˆ v that satisfyˆ y ∈ arg min y (cid:110) E (cid:104) ˜ U ( ˆ Y ( T )) (cid:12)(cid:12)(cid:12) ˆ Y (0) = y (cid:105) + x y (cid:111) ˆ v ( t ) ∈ arg min v (cid:26) − ˆ y ( r ( t ) + δ K ( v )) ∂∂y ˜ u ( t, ˆ y ) + 12 (cid:12)(cid:12) θ ( t ) + σ ( t ) − v (cid:12)(cid:12) ˆ y ∂ ∂y ˜ u ( t, ˆ y ) (cid:27) (3.3)for all t ∈ [0 , T ], where ˆ Y is the state process driven by ˆ v . The optimality conditions of v come fromthe corresponding Hamiltonian of the dual control problem (3.2). The implicit relations between ˆ y and ˆ v make it diﬃcult to ﬁnd explicit analytical solutions. The deep learning approaches for both the dual and primal problems are similar, so we present thealgorithm in a more general form that covers both. We ﬁrst describe the general control problem,then use the machine learning method to ﬁnd the value function and optimal control. We presentthe algorithm for a state dimension d ≥

1, as this is required in proceeding sections.Let (Ω , F , ( F t ) ≤ t ≤ T , P ) be a ﬁltered probability space with a ﬁxed ﬁnite horizon T . Considerthe F t -adapted controlled diﬀusion process ( X ( t )) ≤ t ≤ T on R d , d ∈ N , deﬁned by dX ( t ) = b ( t, X ( t ) , π ( t )) dt + σ ( t, X ( t ) , π ( t )) dW ( t ) (3.4)with X (0) = x , where W is an R n -valued standard Brownian motion for some n ∈ N and π =( π ( t )) ≤ t ≤ T is a progressively measurable process, valued in K ⊂ R m for m ∈ N . It is assumedthat the drift and diﬀusion functions b : [0 , T ] × R d × K → R and σ : [0 , T ] × R d × K → R d × n are deterministic and measurable, such that (3.4) admits a strong solution. Here X represents thestate process and π the control process. We consider only admissible controls that satisfy someintegrability condition [25]. This control is chosen to optimise the state process via maximising areward that is deﬁned using a terminal gain function g : R d → R , which is a measurable function,6ither lower bounded or satisfying some quadratic growth condition. We wish to ﬁnd the valuefunction u ( t, x ) .. = sup π ∈A E [ g ( X ( T )) | X ( t ) = x ] . (3.5) Remark 3.1.

The utility function U is not deﬁned on the whole space R , but only (0 , ∞ ] . We take g ( x ) to be U ( x ) when x > and 0 otherwise, ensuring that quadratic growth or bounded conditionsare satisﬁed (if they are satisﬁed for U ). It is irrelevant what value is chosen when x < , as onecan show that the state process X deﬁned in (2.2) satisﬁes X ( T ) > if and only if X (0) > almostsurely. Suppose that u ∈ C , , so that we can use the HJB equation and the SMP. The HJB equationtells us that u solves the following PDE ∂∂t u ( t, x ) = − sup π ∈ K F ( t, x, π, D x u ( t, x ) , D x u ( t, x )) , ( t, x ) ∈ [0 , T ) × R d with the terminal condition u ( T, x ) = g ( x ) for x ∈ R d , where the Hamiltonian F is deﬁned by F ( t, x, π, z, γ ) = b ( t, x, π ) (cid:62) z + 12 tr (cid:16) σ ( t, x, π ) σ ( t, x, π ) (cid:62) γ (cid:17) . (3.6)In this setting we can introduce running gains or other time-based reward functions via addingsome term f ( t, x, π ) to the Hamiltonian.We can now describe the value function and its derivatives using a system of FBSDEs. Let x ∈ R d be our initial state. Fix an arbitrary admissible control process π and the correspondingstate process X . Deﬁne new processes on [0 , T ] by V ( t ) := u ( t, X ( t )) , Z ( t ) := D x u ( t, X ( t )) , Γ( t ) := D x u ( t, X ( t )) . (3.7)By the regularity assumptions on u , these processes are continuous. Applying Itˆo’s Lemma to u and using the HJB equation, we get dV ( t ) = (cid:20) F ( t, X ( t ) , π ( t ) , Z ( t ) , Γ( t )) − sup a ∈ K F ( t, X ( t ) , a, Z ( t ) , Γ( t )) (cid:21) dt + Z ( t ) (cid:62) σ ( t, X ( t ) , π ( t )) dW ( t ) , with the terminal condition V ( T ) = g ( X ( T )). Furthermore, the SMP relates the pair ( Z, Γ) withan SDE using the generalised Hamiltonian H : [0 , T ) × R d × K × R d × R d × n → R , given by H ( t, x, π, z, q ) = b ( t, x, π ) (cid:62) z + tr (cid:16) σ ( t, x, π ) (cid:62) q (cid:17) . (3.8)Again, in the presence of running gains we add the corresponding f term here. The supremumterm in the V dynamics makes it too complex to simulate directly. To avoid this, we assume thecontrol we chose arbitrarily is actually an optimal control, so by deﬁnition the Hamiltonian termscancel, and we are left with the following result. Theorem 3.2.

Suppose that u ∈ C , ([0 , T ) × R d ) ∩ C ([0 , T ] × R d ) , and there exists an optimalcontrol π ∈ A with associated controlled diﬀusion X , deﬁned by (3.4). Then there exist continuousprocesses ( V, Z, Γ) solving the following 2BSDE: dV ( t ) = Z ( t ) (cid:62) σ ( t, X ( t ) , π ( t )) dW ( t ) dZ ( t ) = − D x H ( t, X ( t ) , π ( t ) , Z ( t ) , Γ( t ) σ ( t, X ( t ) , π ( t ))) dt + Γ( t ) σ ( t, X ( t ) , π ( t )) dW ( t ) (3.9)7 ith the terminal conditions V ( T ) = g ( X ( T )) and Z ( T ) = D x g ( X ( T )) , where H is deﬁned by (3.8)and the control π satisﬁes F ( t, X ( t ) , π ( t ) , Z ( t ) , Γ( t )) = sup a ∈ K F ( t, X ( t ) , a, Z ( t ) , Γ( t )) , (3.10) and F is deﬁned by (3.6). If we have a running gains function f , this would result in a drift term − f ( t, X ( t ) , π ( t )) dt beingadded to the V dynamics. While the supremum term in the original dynamics of V makes itssimulation complicated, the derivative term in the dynamics of Z can be easily computed by themodern automatic diﬀerentiation method. This theorem is presented for a maximisation problembut easily translates to the dual problem (3.2), which is a minimisation problem, by replacingsupremums with inﬁmums. Before describing the algorithms used to solve this problem in detail, we ﬁrst look at what we meanby a neural network. Suppose we wish to approximate a function φ : R p → R q for some p, q ∈ N .We use the following algorithm to construct the approximating function. We take several elementsof R p , say a batch of size k ∈ N . For our purposes, k represents how many Brownian paths wegenerate, and p = d represents a space vector in R d . We also have a predetermined amount of‘layers’ L ∈ N and ‘hidden nodes’ (cid:96) ∈ N . Suppose we have a matrix X ∈ R k × p as our input. Set N = X , then for i = 1 , . . . , L , perform the update N i = h ( N i − M i + k, v i ) , where M ∈ R p × (cid:96) , M j ∈ R (cid:96) × (cid:96) for j > v i ∈ R × (cid:96) , and k, is a k × v i is added to each vector in the batch. The function h : R k × (cid:96) → R k × (cid:96) applies the non-linear(or activation) function x (cid:55)→ max( x,

0) element-wise, called a rectiﬁed linear unit (ReLU) activationfunction. The choice of this function for this scheme is rather arbitrary, as there seems to be noclear preferred function here. A comparison of popular activation functions in provided in Section5. This non-polynomial mapping is essential in ensuring the density of these networks within thespace of continuous functions [20]. Finally, we output N θ ( X ) = N L M L +1 + k, v L +1 , where M L +1 ∈ R (cid:96) × q and v L +1 ∈ R × q . Note that no non-linear function is applied at the ﬁnal step,else we could only approximate non-negative functions! The parameter vector θ ∈ R ρ representingour network is a parametrisation of the M i and v i matrices and are optimised at the trainingstep. We made no assumptions on the regularity of φ , but our approximation is almost everywherediﬀerentiable. In practise this is suﬃcient for gradient descent algorithms, in which derivatives ofthis network with respect to its parameters must be taken. We use the Python package

Tensorflow to implement this neural network procedure. This methodology reduces an optimisation over aninﬁnite dimensional space to a ﬁnite dimensional one. However, the dimension can still be high.For this neural network, the parameter space is made up of L − R (cid:96) × (cid:96) , L matricesin R × (cid:96) , and one matrix in each space R p × (cid:96) , R (cid:96) × q , and R × q respectively, which are ﬂattened andcombined into one single vector θ ∈ R ρ . The parameter space dimension is therefore ρ ( p, q ) = ( L − (cid:96) + (cid:96) ( L + p + q ) + q, (3.11)8 quadratic in the number of hidden nodes, and linear in the number of layers and the inputdimension. Associated to each neural network N θ is a loss function L ( θ ) which is a function ofthe parameters forming the network. Finding the parameters that minimise this function equatesto optimising the network itself. In our implementation, we use the ADAM algorithm [18], whichis an adaptation of stochastic gradient descent, implemented in Tensorflow . We furthermore usemini batches, which means the number of simulations we use to calculate the loss function is small,and batch normalisation, meaning that we scale our input to the neural network [16].Now we have set up the new problem (3.9)-(3.10), we look to solve it with machine learningtechniques in the spirit of [2] (the algorithm presented therein is named the “deep 2BSDE” method,thus inspiring our adapted deep controlled 2BSDE method). The trick here is to simulate allprocesses in the forward direction, introducing new variables v ∈ R and z ∈ R d , then movingforward through a discretisation ( t i ) Ni =0 of [0 , T ]. We furthermore set Γ to be a neural network thatat each time t i takes the i th state position, denoted by X i , as input since we are approximating D x u ( t, X ( t )) with this process. The control process π is also a neural network, taking in the samestate X i as input at time step t i , which means that we are searching for a Markovian control. Tobe speciﬁc, we introduce parameters θ i ∈ R ρ ( d,m ) and λ i ∈ R ρ ( d,d ) then use the following Euler-Maruyama Scheme. Set X = x , V = v and Z = z then for i = 0 , . . . N − π i = N θ i ( X i ),Γ i = N λ i ( X i ) and X i +1 = X i + ( t i +1 − t i ) b ( t i , X i , π i ) + (cid:112) t i +1 − t i σ ( t i , X i , π i ) dW i V i +1 = V i + (cid:112) t i +1 − t i Z (cid:62) i σ ( t i , X i , π i ) dW i Z i +1 = Z i − ( t i +1 − t i ) D x H ( t i , X i , π i , Z i , Γ i σ ( t i , X i , π i )) + (cid:112) t i +1 − t i Γ i σ ( t i , X i , π i ) dW i , (3.12)where dW i is a standard multivariate normal N m (0 , m ) random variable. We have used minimalnotation for these processes for clarity, but one should remember that these explicitly depend onthe choice of parameters λ i , θ i and the initial points v , z . Our algorithm involves using thisscheme multiple times, but with diﬀerent parameters. One we have repeated this iteration N − t N = T but realise that we have not satisﬁed the terminal conditions. Wewant to choose the neural networks and initial start points to reduce the expected error, whichamounts to minimising the the loss function L (Θ ) .. = E (cid:104) | V N − g ( X N ) | + β | Z N − D x g ( X N ) | (cid:105) , (3.13)where Θ ∈ R d + ρ ( d,d ) is a vector representation of ( v , z ) ∪ ( λ i ) N − i =0 . This is referred toas the BSDE loss function in the sequel. In practise, we evaluate the sample average of this lossfunction. The coeﬃcient β > β = 0 . π is not optimal, so we must alsomaximise the loss functions L ( θ i , i ) .. = E [ F ( t i , X i , π i , Z i , Γ i )] (3.14)for i = 0 , . . . , N −

1. This is referred to as the control loss function in the sequel. Again, we evaluatea sample average of this loss function. We proceed to optimise these loss function in turn, startingwith our arbitrary 0-superscript parameters. Now suppose we have just ﬁnished step j ∈ N .The ﬁrst sub-step is to improve the terminal condition loss. We generate V N , Z N and X N usingΘ j BSDE and θ ji for i = 1 , . . . , N −

1, which are carried forward from the previous step. We then makean improvement using one step of the ADAM algorithm, against the loss function L , resulting in9he updated parameters Θ j +1BSDE . We also keep track of the moment estimates that are present inthe algorithm [18].The second half of the iteration step is to improve the control process. We can ﬁrst simulate theprocesses π and X using our previous parameter set θ j = (cid:16) θ ji (cid:17) N − i =0 , then simulate Z and Γ usingthe updated parameters Θ j +1BSDE , and ﬁnally make the following improvement using one step of theADAM algorithm. For each i = 0 , . . . , N − θ ji using the ADAM algorithm against −L ( · , i ), resulting in θ j +1 i . We use a negative here since ADAM is a minimisation algorithm, butwe wish to maximise the Hamiltonian F at each time step. We track the moment estimates andmove to step j + 2 and repeat this process.These two steps can be repeated until the new parameters are suﬃciently close to the oldparameters after an update, at which point we deem the algorithm to have converged. We ﬁndthat making small increments at each step, ensuring that the learning rate required for the ADAMalgorithm is smaller for the control step than the BSDE step, we have good convergence results.This agrees with the analytical convergence assumption of two-timescale stochastic approximationalgorithms presented in [4]. Remark 3.3.

At each time step we build 2 neural networks, which have 4 layers, with the twohidden layers consisting of d + 10 nodes. The parameters are optimised using the ADAM algorithm,using a batch size of 64 Brownian paths. We choose to initialise each parameter near 0, thougha prior guess may be used, and the rate of convergence is naturally much higher when the initialguess is closer to the true value. Substituting p = d , L = 4 and l = d + 10 into (3.11) leads toa parameter set of size ρ ( d, q ) = 4 d + 74 d + 340 + q ( d + 11) , where q is the output dimensionof the network. The number of neural network parameters is therefore at most cubic in the statedimension and linear in the control dimension. For the dual problem of utility maximisation, we have a further variable to optimise, representingthe initial state. We simply deﬁne a variable y with which the state process (which for the dualproblem we denote by Y i for i = 0 , . . . , N ) begins, as opposed to starting at x , then introduce athird loss function L ( y ) = E (cid:104) ˜ U ( Y N ) (cid:12)(cid:12)(cid:12) Y = y (cid:105) + y x (3.15)against which to optimise this 1-dimensional parameter. This choice of loss function comes fromthe optimality condition (3.3) for y , replacing expectation with sample mean. We cannot use thevalue function approximation at time 0, given by V , since it does not depend explicitly on y . Wecould alternatively ﬁnd a functional representation (in the style of [15]) of V in terms of y , thenoptimise against this function. The eﬀectiveness of this method is left to further research. In this section we apply the deep controlled 2BSDE algorithm to solve utility maximisation, withvarying utilities and control constraints. This is a control problem (3.5) with d = 1 and n = m . Forthe primal problem, we have the process X deﬁned by (2.2), controlled by π , and the primal 2BSDEsystem of Theorem 3.2 denoted by ( V , Z , Γ ), where we take g ( x ) = U ( x ) x> (c.f. Remark 3.1)and deﬁne the drift and diﬀusion by b ( t, x, π ) = x ( r ( t ) + π (cid:62) b ( t )) , σ ( t, x, π ) = xπ (cid:62) σ ( t ) . Y is deﬁned in (2.3), controlled by ( y, v ), and the 2BSDEsystem of Theorem 3.2 is denoted by ( V , Z , Γ ), where we take g ( y ) = ˜ U ( y ) and deﬁne the driftand diﬀusion by b ( t, y, v ) = − y ( r ( t ) + δ K ( v )) , σ ( t, y, v ) = − y ( θ ( t ) + σ ( t ) − v ) (cid:62) . We use the following primal-dual relations, derived in [21].

Theorem 3.4.

Suppose that ˆ y and ˆ v are optimal dual controls, with the corresponding dual stateprocess ˆ Y and 2BSDE solutions ( ˆ V , ˆ Z , ˆΓ ) . Then the primal state process ˆ X and 2BSDE solutions ( ˆ V , ˆ Z , ˆΓ ) corresponding to the optimal primal control ˆ π satisfy ˆ X ( t ) = − ˆ Z ( t ) , ˆ V ( t ) = ˆ V ( t ) + ˆ Z ( t ) (cid:62) ˆ Y ( t ) , ˆ Z ( t ) = ˆ Y ( t ) . (3.16)The converse relation also applies, in that primal solutions can be used to derive the corre-sponding dual solutions. The direction described here is often the most useful, as the dual problemcan be simpler to solve than the primal problem. In addition to these processes, formulas for theoptimal controls are also given, but we do not use them here. Example 3.5 (Unconstrained Non-HARA Utility Problem) . To show the eﬀectiveness of the al-gorithm we apply it to a series of utility problems, with certain constraints on the control space.We start with the unconstrained case K = R m , corresponding to a complete market. We considerthe utility problem where U has the following non-HARA form U ( x ) = 13 H ( x ) − + H ( x ) − + xH ( x ) , x > , (3.17)where H ( x ) = √ (cid:0) − √ x (cid:1) − . The corresponding dual utility is slightly simpler, and isgiven by ˜ U ( y ) = y − + y − . In particular we have ˜ K = { } , so the optimal dynamic dual controlmust be v ( t ) = 0. Suppose that r ( s ) = r, σ ( s ) = σ, µ ( s ) = µ and θ ( s ) = θ are constant. In thiscase in [21] it is shown that the optimiser ˆ y and processes ˆ Y and ˆ Z are given byˆ y = 1 √ x (cid:104) e ( r + | θ | ) T + (cid:112) e r + | θ | ) T + 4 x e r +2 | θ | ) T (cid:105) ˆ Y ( t ) = ˆ y exp (cid:18) − (cid:18) r + 12 | θ | (cid:19) t − θ (cid:62) dW ( t ) (cid:19) ˆ Z ( t ) = a S ( t ) + a S ( t ) , where a = ˆ y − e r +2 | θ | ) T , a = ˆ y − e ( r + | θ | ) T , S ( t ) = e ( r − | θ | ) t +2 θ (cid:62) W ( t ) and S ( t ) = e rt +2 θ (cid:62) W ( t ) . Furthermore, the dual value function is given by˜ u ( t, y ) = 13 y − e (3 r +6 | θ | )( T − t ) + y − e ( r + | θ | )( T − t ) . From these analytical formulae, the primal processes can be derived using Theorem 3.4. For ourapplication, we take m = 5, T = 0 . x = 1, r = 0 . µ = 0 . and set σ as a random matrix ofpositive numbers between 0 and 0.2 to introduce correlation of the underlying Brownian motions,with 0.2 added to the diagonal to ensure invertibility.Figure 3.1 shows the development of the loss functions for the primal and dual algorithms with 50time steps. For the control, we wish for the Hamiltonian to be maximised, which means the deriva-tive of the loss function should go to 0. Hence, we plot the function j (cid:55)→ (cid:80) i =0 (cid:80) θ ∈ θ ji | D θ L ( θ ji , i ) | θ ji denotes the parameter set for the neural network forming π i ,at iteration step j of the algorithm. For the dual graph we plot the derivative of L with respect to y , since the above function is always zero as the dual control is forced to be zero, hence optimal, bythe support function δ K . Over the 10000 iteration steps both losses become small and settled, andtherefore are deemed to have converged. In this method we choose an initial learning rate of 10 − for the 2BSDE algorithm and the start control y in the dual algorithm, and 10 − for the controlpart. Over the iterations, we divide these rates by 10 three times at regular intervals. We repeatthis choice of learning rates for all applications in the sequel, though the number of iteration stepsmay change.Figure 3.2 shows two simulations of the processes V and Z (labelled as primal), along withtheir dual counterparts given by the relations (3.16) using 20 time steps. We plot these, alongwith the explicit solution for reference. Both primal and dual approximations are very close tothe solutions. The dual state process, appearing in the Z graph (but very close to the solutionprocess), is closer than the corresponding primal process since there is no error for the dynamicdual control.Figure 3.3 shows the evolution of the approximation of the initial value u (0 , x ) against numberof times steps for the primal and dual algorithms, where x = 1. The dual value is higher than theprimal value, as expected, and the gap between these shrinks as the number of time steps increases.The ﬁnal relative approximation error is 7 × − % for the primal problem, and 3 × − % for thedual problem, showing high accuracy of the proposed scheme at time 0. (a) Loss function of the BSDE. (b) Derivative of the control loss function. Figure 3.1: Loss functions against iteration step for the primal and dual deep controlled 2BSDEmethods applied to the unconstrained non-HARA utility problem.

Example 3.6 (Cone-Constrained Merton Problem) . Now we consider the problem with powerutility U ( x ) = p x p for some 0 < p < K = R m + , corresponding to no short selling constraints. Let ˆ X be the primalstate process and ( ˆ V , ˆ Z , ˆΓ ) the solutions of (3.9) for the primal problem with optimal control ˆ π ,and let ˆ Y and ( ˆ V , ˆ Z , ˆΓ ) be the state process and solutions respectively for the dual problem withoptimal controls ˆ v and ˆ y . The dual state process is a geometric Brownian motion given byˆ Y ( t ) = ˆ Y (0) exp (cid:18) − (cid:90) t (cid:18) r ( s ) + 12 | θ ( s ) | (cid:19) ds − (cid:90) t θ ( s ) (cid:62) dW ( s ) (cid:19) a) Simulation of the process V . (b) Simulation of the process Z . Figure 3.2: Two simulations of the value process and value-derivative process for the primal anddual deep controlled 2BSDE methods applied to the unconstrained non-HARA utility problem.The displayed dual processes are those implied by the duality relations (3.16).in terms of the unknown start point ˆ y = ˆ Y (0). The corresponding dual utility function is ˜ U ( y ) = − pp y pp − . In [21] the following solutions are found:ˆ y = x p − exp (cid:18) (1 − p ) (cid:90) T (cid:20) p p − | ˆ θ ( s ) | − pp − r ( s ) (cid:21) ds (cid:19) ˆ Z ( t ) = − x exp (cid:18)(cid:90) t (cid:20) r ( s ) + 1 − p − p ) | ˆ θ ( s ) | (cid:21) ds + 11 − p (cid:90) t ˆ θ ( s ) (cid:62) dW ( s ) (cid:19) , (3.18)where ˆ θ ( t ) = θ ( t ) + σ ( t ) − ˆ v ( t ). The dual value function is given by˜ u ( t, y ) = ˜ U ( y ) exp (cid:18)(cid:90) Tt (cid:20) p p − | ˆ θ ( s ) | − pp − r ( s ) (cid:21) ds (cid:19) . (3.19)By (3.3), ˆ v is chosen such that | ˆ θ | is minimised. This deterministic optimal control can be foundnumerically using Python packages such as Scipy . For our implementation, we take T = 0 . x = 1 . m = 50, r ( t ) = 0 .

05 and p = . We deﬁne the drift as µ ( t ) i = 0 .

04 + sin( πt + A i ) / A i is randomly generated for each i = 1 , . . . ,

50. We choose this function so that in theunconstrained case it would be optimal to short sell at some points. We set σ ( t ) to be a constantmatrix with 0.4 on the diagonal, and 0.2 everywhere else. Remark 3.7.

To ensure our controls π and v are in the correct set K = ˜ K = R m + , we (element-wise) apply the function x (cid:55)→ max(0 , x ) to the output of their respective neural networks. Thisfunction is an almost everywhere diﬀerentiable surjection from R m to R m + , which is suﬃcient forusing a gradient descent method to ﬁnd the optimal control. The mapping is not injective, but thisis not an issue as the exact output of the neural network is not relevant. We care only about thecontrol outputted by projecting onto the control space. Figure 3.4 shows the results of applying the primal and dual 2BSDE methods to this problem.We use N = 10 time steps and run the algorithm for 100000 steps, notably more than for the lower13 a) Value approximation. (b) Relative error of value approximation. Figure 3.3: Approximation and relative error of the value function against number of time steps N for the primal and dual deep controlled 2BSDE methods applied to the unconstrained non-HARAutility problem.dimensional unconstrained problem. This increases the run time to 5457 seconds. We start theBSDE and control learning rates at 0.01 and 0.001 respectively, then reduce these by a factor of10 half way through training. We run both algorithms at the same time, which saves time as wecan use the same generated Brownian motions for each algorithm. We see less convergent structureat the beginning of training, perhaps due to bad initial guesses for the neural networks. Indeed,we see that the value for the primal functions stays nearly constant until the loss function dropsto a certain level, when the processes are accurate enough that the value approximation can movetowards the analytical solution. The dual algorithm behaves similarly, but with more noise. Atthe end of training we have relative errors of 0.1% and 0.05% for the primal and dual algorithmsrespectively. This error is a few orders of magnitude higher than the unconstrained case due to theincrease in dimension. Example 3.8 (Ball Constrained Log Utility Problem) . In this example we consider the log utilityproblem when the control process is constrained to the ball B (0 , R ) of radius R > R m . Thiscorresponds to the investor not wanting to invest more than a certain proportion of their wealthin risky stocks. In this case, the support function δ K ( z ) = R | z | does not vanish for all admissibledual controls. Therefore, the SDE (2.3) is harder to solve. We recall from (3.3) that in general thisproblem is also diﬃcult to solve analytically because of the dependence of the dual controls ˆ y andˆ v on each other. However, we can exploit the multiplicity property of the log function to decouplethese controls to give analytical solutions, which we describe here. To this end let U ( x ) = log( x )be our utility function, for which the dual utility is ˜ U ( y ) = − (1 + log( y )). Since the dual process Y is a geometric Brownian motion we have E (cid:104) ˜ U ( Y ( T )) (cid:105) = − − log( Y (0)) + (cid:90) T r ( t ) + R | v ( t ) | + 12 | θ ( t ) + σ ( t ) − v ( t ) | dt. The measurable process ˆ v ( t ) .. = arg min v ∈ R m (cid:26) R | v | + 12 | θ ( t ) + σ ( t ) − v | (cid:27) a) Value approximation. (b) BSDE loss. Figure 3.4: Approximation of the value function and bsde loss against iteration step for the primaland dual deep controlled 2BSDE methods applied to the cone constrained power utility problem.is an optimal control, and can be found using existing deterministic convex optimisation algorithms.Independently of this, the dual control ˆ y is found viaˆ y ∈ arg min y> (cid:26) x y − log( y ) + (cid:90) T r ( t ) + δ K (ˆ v ( t )) + 12 | θ ( t ) + σ ( t ) − ˆ v ( t ) | dt (cid:27) . yielding ˆ y = x . The value function is then given by u (0 , x ) = log( x ) + (cid:90) T r ( t ) + R | ˆ v ( t ) | + 12 | θ ( t ) + σ ( t ) − ˆ v ( t ) | dt. (3.20)For our application, we shall take R = 1 . T = 0 . x = 5 . m = 20, r ( t ) = 0 .

05 and µ ( t ) =0 . . We deﬁne the volatility as a diagonal matrix with i th diagonal element equal to σ ( t ) i = πt + A i )10 for some uniform (0 , π ) random number A i . We choose this function so that in theunconstrained case it would be reasonable to invest more than we are allowed at points when thevolatility is low. The volatility is non negative for all t ∈ [0 , T ], so the problem is well posed. Toensure that our primal control is in the correct set, we use a penalty function method in the spiritof Remark 2.1. In this case we do not constrain π , but instead we alter the loss function (3.14) tobe L ( θ i , i ) .. = E (cid:2) F ( t i , X i , π i , Z i , Γ i ) − β max(0 , | π i | − (cid:3) which heavily penalises the control for straying outside of K . We can also describe this as a runningcost for the control problem, which would be added to the Hamiltonian. The parameter β > π via the mapping x (cid:55)→ e −| x | , we achieve a relative error of 0.8% withthe same runtime, as opposed to 0.7% using soft constraints. We do not use such a function forthe dual control, as it is never inﬁnitely penalised for its position.Table 1 shows the evolution of the approximated value function for the primal and dual al-gorithms, compared with the numerical solution 1.64755570 found using (3.20). The algorithm isrun using N = 10 time steps, and the value approximation, relative error and (combined) runtime15re recorded. We again perform both algorithm simultaneously using the same Brownian motionssimulations. The value approximation initially moves away from the solution, but then converges.For the primal algorithm, the value moves away over the ﬁrst 1000 steps, then goes to the solution.For the dual algorithm, the value does not move as far from the solution, but takes longer to do so,only starting moving towards the solution at 10000 steps. The dual value initially dips below thesolution, then increases, which is why the relative error is lower at 1000 steps than at the start.Step Primal Value Rel-Err (%) Dual Value Rel-Err (%) Time (s)1 9.92774e-01 3.97426e+01 2.50412e+00 5.19898e+01 1281000 4.63730e-02 9.71853e+01 1.57275e+00 4.54015e+00 1835000 2.27800e-01 8.61735e+01 1.80325e+00 9.44972e+00 40510000 1.39359e+00 1.54147e+01 2.01337e+00 2.22035e+01 70112000 1.62170e+00 1.56918e+00 1.98620e+00 2.05544e+01 81518000 1.63545e+00 7.34658e-01 1.68103e+00 2.03169e+00 114919000 1.63546e+00 7.34163e-01 1.65375e+00 3.76032e-01 120420000 1.63543e+00 7.35726e-01 1.65361e+00 3.67465e-01 1259Table 1: Value approximation and runtime at certain iteration steps for the primal and dual deepcontrolled 2BSDE methods applied to the ball-constrained log utility problem problem. In this section we consider the case in which r , µ and σ may not be deterministic. The utilityproblem and its dual problem then become non-Markovian. This means we cannot write the valuefunction as a measurable function of state and time nor apply the dynamical programming principle,hence the HJB equation is invalid. However, we can still employ the SMP, and it is shown in [21]that there exist certain optimality conditions for a non-Markovian utility maximisation problem,which we use to deﬁne a new algorithm, named the deep SMP method. In some of the exampleswe consider in this section, the problem can be made Markovian by introducing some new artiﬁcialstructure to the problem. In these cases, we will compare the accuracy and runtime of both ouralgorithms. Consider an agent who invests in the market (2.1) in full generality. We again consider the utilityproblem, written in this setting as u ( t ) = sup π ∈A E (cid:2) U ( X ( T )) (cid:12)(cid:12) F t (cid:3) . We cannot use Itˆo’s formula for this value process as it cannot be written as a measurable functionof the wealth process X ( t ), deﬁned as in (2.2) and controlled by π , but we can still use the SMP.For the primal problem we have the following adjoint BSDE: dP ( t ) = − (cid:104)(cid:16) r ( t ) + π ( t ) (cid:62) b ( t ) (cid:17) P ( t ) + Q ( t ) (cid:62) σ ( t ) (cid:62) π ( t ) (cid:105) dt + Q ( t ) (cid:62) dW ( t ) (4.1)with the terminal condition P ( T ) = − U (cid:48) ( X ( T )). Similarly, we can deﬁne the dual state Y as per(2.3), controlled by ( y, v ). The dual adjoint equation is the following BSDE dP ( t ) = (cid:104) ( r ( t ) + δ K ( v ( t )) P ( t ) + Q ( t ) (cid:62) ( θ ( t ) + σ ( t ) − v ( t )) (cid:105) dt + Q ( t ) (cid:62) dW ( t ) (4.2)16ith the terminal condition P ( T ) = − ˜ U (cid:48) ( Y ( T )). Furthermore, we have the duality relations P ( t ) = X ( t ) and P ( t ) = − Y ( t ) almost surely, for almost every t ∈ [0 , T ] by (3.16). We now statethe optimality conditions for the dual problem in terms of these processes. Theorem 4.1 ( [21], Theorem 11) . Let ( y, v ) be an admissible dual control. Then ( y, v ) is optimalif and only if the solutions ( P , Q ) of (4.2) satisfy the following1. P (0) = x ,2. ( σ ( t ) (cid:62) ) − Q ( t ) P ( t ) ∈ K ,3. P ( t ) δ K ( v ( t )) + Q ( t ) (cid:62) σ ( t ) − v ( t ) = 0 for all t ∈ [0 , T ] almost surely. Remark 4.2.

The corresponding optimality condition for a primal control π with correspondingstate process X and adjoint BSDE solutions ( P , Q ) is [ π ( t ) − a ] (cid:62) [ X ( t ) σ ( t )( P ( t ) θ ( t ) + Q ( t ))] ≥ for all t ∈ [0 , T ] and a ∈ K almost surely. This is a much more complex condition, and theimplementation of such would require some grid-based method to consider a suﬃciently large amountof a ∈ K , which would be computationally costly, especially in higher dimensions and unbounded(but still constrained) K . For this reason, we approach the dual conditions then use the primal dualrelations to derive the necessary primal processes. The machine learning methodology to solve this problem is as follows. First, we simulate thedual processes in the forward direction using the Euler-Maruyama scheme. We start the process P at P (0) = x to ensure the ﬁrst condition is satisﬁed. This condition is why it appears that theforward simulation method is naturally suited to this problem. We use neural networks to deﬁnethe processes v and Q , then update the values of Y and P accordingly. To be speciﬁc, we use thefollowing scheme. Set Y (0) = y , P (0) = x then for i = 0 , . . . , N − Q ( i ) = P ( i ) σ ( t i ) (cid:62) h K (cid:16) N θ Qi ( Y ( i )) (cid:17) v ( i ) = N θ vi ( Y ( i )) Y ( i + 1) = Y ( i ) − ( t i +1 − t i ) Y ( i ) (cid:0) r ( t i ) + δ K ( v ( i )) (cid:1) − (cid:112) t i +1 − t i Y ( i ) (cid:0) θ ( t i ) + σ ( t i ) − ˆ v ( i ) (cid:1) (cid:62) dW i P ( i + 1) = P ( i ) + ( t i +1 − t i ) (cid:104) r ( t i ) P ( i ) + Q ( i ) (cid:62) θ ( t i ) (cid:105) + (cid:112) t i +1 − t i Q ( i ) (cid:62) dW i . where θ vi , θ Qi ∈ R ρ ( d,m ) are parameters for neural networks. It is assumed that we can sample therandom processes at any time in an eﬃcient way. We deﬁne the function h K : R m → K to be somediﬀerentiable surjective function. Using this function, we have by construction that condition 2 isautomatically satisﬁed. The two neural networks used here both output to m -dimensional space,unlike the general algorithm outlined in Section 3.2, where the second neural network is valued in R d × d . This increases the number of parameters, as for the utility maximisation problem we have d = 1. However, as we move into the case of random coeﬃcients that satisfy their own SDEs, thisdimensionality also increases. We have the variables y, Θ Q .. = (cid:16) θ Qi (cid:17) N − i =0 and Θ v .. = ( θ vi ) N − i =0 . These17re optimised against the following loss functions L y ( y ) = E (cid:104) ˜ U ( Y ( N )) (cid:12)(cid:12)(cid:12) Y (0) = y (cid:105) + x y L Q (Θ Q ) = E (cid:104) | ˜ U (cid:48) ( Y ( N )) + P ( N ) | (cid:105) L v ( θ vi ) = E (cid:104) | P ( i ) δ K ( v ( i )) + Q ( i ) (cid:62) σ ( t ) − v ( i ) | (cid:105) . In practise for each function we take a sample mean to approximate the expected value or theexpected squared error. The ﬁrst loss function represents the optimality condition (3.3) of y . Thesecond loss function represents the need to ensure the terminal condition P ( N ) = − ˜ U (cid:48) ( Y ( N ))holds. Finally, instead of using the HJB condition from (3.3), which may not hold in a non-Markovian setting, the third loss function arises from the third optimality condition of Theorem4.1.The optimisation process is very similar to the deep controlled 2BSDE method as outlined inSection 3.2, so we only provide a brief summary here. For each iteration step we simulate theprocesses Y , v , P and Q using our existing (or randomly initialised for the ﬁrst step) parametersets y , Θ Q and θ v and a batch of simulated Brownian motion paths. We then perform one iterationof the ADAM algorithm to update each parameter set in turn using their respective loss functions.In between the updates, we simulate the processes again using the new parameters, but do notgenerate new Brownian motions. When the 3 sets are updated we generate new Brownian motionpaths and repeat. This process continues until the diﬀerence made by updating is suﬃciently small.For our case, we start with a learning rate of 0.01 for all parameters, which decreases as the numberof iterations increases.In this algorithm we no longer have a variable representing the value at time 0. We thereforeneed to perform Monte Carlo simulations, forming bounds of the value function. An upper boundcan be found using the dual state process Y and a lower bound can be found using the primal stateprocess X = P , as per (3.16). The estimates outputted are given by u low .. = 1 M M (cid:88) i =1 U ( P i ( N )) ≤ u (0 , x ) ≤ M M (cid:88) i =1 ˜ U ( Y i ( N )) + x y = .. u high (4.3)for some large batch of size M ∈ N , where each path is indexed by the superscript i . The startpoint y is not random (and hence is not included in the sum nor indexed by i ) as it is optimisedusing a deterministic loss function. The inequalities are valid up to discretisation and simulationerrors, and as the number of time steps increases the gap u high − u low decreases, and the optimalvalue is eventually contained in this gap. Remark 4.3.

We have tried this method for a cone constrained power utility problem as in Ex-ample 3.6, and the value function approximation seems quite accurate. However, using the hardconstraints to deﬁne the process Q does not give us the required dual control. In this example,say in dimension (so K = [0 , ∞ ) ), the dual control condition becomes Q ( t ) σ ( t ) − v ( t ) = 0 . Thealgorithm therefore determines that we must have ˆ v = 0 . However this is not the optimal con-trol, since the solution for Q also depends on v . Indeed, if we compare the BSDE (4.2) with thethe dual version of 2BSDE (3.9) and given solutions (3.7) (replacing u with ˜ u ), we deduce that Q ( t ) = − D y ˜ u ( t, Y ( t )) Y ( t )( θ ( t ) + σ − ( t ) v ( t )) , and the optimal control is ˆ v = max(0 , − σ ( t ) θ ( t )) .Taking Q to be a neural network that takes ( Y, v ) as input, together with using soft constraints hasshown some promising results in one dimension, with the correct control process being recovered.However it is not consistent since we make one neural network the input to another, leading toconvergence issues. In a Markovian setting we could use the form of the optimal solution above,but this does not generalise. Work is still to be done in this area. .2 Numerical Examples Example 4.4 (Power Utility in a Stochastic Volatility Model) . In this example we consider anextension of our utility model to stochastic volatility. In this model, our agent invests in a 2-dimensional stock market, out of which one asset S is traded, and is deﬁned by dS ( t ) = S ( t )( r + Av ( t )) dt + S ( t ) (cid:112) v ( t ) dW s ( t ) . The constants r, A > v has the dynamics dv ( t ) = κ ( θ − v ( t )) dt + ξ (cid:112) v ( t ) dW v ( t ) . The constant κ > θ >

0. The constant ξ > W s and W v are correlated with correlation parameter ρ ∈ [ − , π of their wealth X in this market, their wealth process evolves as dX ( t ) = X ( t )( r + π ( t ) Av ( t )) dt + X ( t ) π ( t ) (cid:112) v ( t ) dW s ( t ) . We will ﬁrst consider this as a 2-dimensional Markovian problem, then a random coeﬃcientproblem. In the Markovian problem the 2-dimensional process (

X, v ) is the state process controlledby the K ⊂ R valued process π . Consider the maximum utility problem given by u ( t, x, v ) = sup π ∈A E [ U ( X ( T )) | ( X (0) , v (0)) = ( x, v )]for some utility function U . This setup ﬁts in the framework of (3.5) with d = 2, n = 2, m = 1 and g ( x, v ) .. = U ( x ) x> for ( x, v ) ∈ R . The drift and diﬀusion functions are given by b ( t, x, v, π ) = ( x ( r + πAv ) , κ ( θ − v )) (cid:62) σ ( t, x, v, π ) = (cid:18) xπ √ v ρξ √ v (cid:112) − ρ ξ √ v (cid:19) . It is a usual assumption that the drift and diﬀusion functions are Lipschitz functions of the stateand control variable, but this is not the case here due to the square root appearing in the diﬀusion.This may invalidate certain convergence patterns that we have seen for the utility maximisationproblem thus far. The dual state process consists of the volatility process v and an R − valuedprocess Y that is controlled by some R -valued process ( η, γ ), given by dY ( t ) = − Y ( t ) (cid:34)(cid:18) r + δ K (cid:0) η ( t ) (cid:1)(cid:19) dt + (cid:32) η ( t ) (cid:112) v ( t ) + ργ ( t ) + A (cid:112) v ( t ) (cid:33) dW s ( t ) − γ ( t ) dW v ( t ) (cid:35) . (4.4)As before, the dual process starts at some undetermined y >

0, which forms the dual controltogether with η and γ , and is chosen such that the process XY is a super martingale. Now considerthe dual control problem˜ u ( t, y, v ) = inf γ,η E (cid:104) ˜ U ( Y ( T )) (cid:12)(cid:12)(cid:12) ( Y ( t ) , v ( t )) = ( y, v ) (cid:105) . (4.5)In our numerical example we consider the unconstrained case K = R . As before, the presence ofthe δ K term means that we must have η = 0. This puts the problem in the same form as [23],where only one process γ needs to be found. 19his setup ﬁts in our framework for a dual problem with d = 2, m = 1, and n = 2. There isa slight diﬀerence in that only one of the initial state start points is a variable, and the other isﬁxed. This is not a multidimensional duality as the process v appears in both state dynamics, buta single dimensional duality between X and Y . The drift and diﬀusion functions are given by˜ b ( t, y, v, γ ) = ( − yr, κ ( θ − v )) (cid:62) ˜ σ ( t, y, v, γ ) = (cid:32) − yA √ v (cid:112) − ρ yγρξ √ v (cid:112) − ρ ξ √ v (cid:33) . When we use the power utility U ( x ) = x p p for 0 < p <

1, exact solutions can be obtained for thisproblem. A solution is derived in [23] by assuming that the value function u has the form u ( t, x, v ) = x p p exp( C ( t ) + D ( t ) v ) (4.6)in terms of two unknown functions of time C and D . Two Ricatti ODEs are derived for thesefunctions, and exact solutions found, which we use as a benchmark. For our application we take r = 0 . ρ = − . p = 0 . κ = 1, θ = 0 . ξ = 0 . A = 0 . x = 1, and v = 0 . T and numbers of time steps N . We contrast this to the results of [23], in which, with T = 1 . N = 100, upper and lower bounds for the value were found with relative errors of 0.003% in2240 seconds and 27600 seconds respectively, albeit for slightly diﬀerent parameter choices. Theupper bound was found faster using a convenient guess of the form the dual control γ . We are notusing 100 time steps for our algorithms due to computer memory constraints. In fact, the 50 timestep algorithm is also slower than expected due to a memory bottleneck. For T = 0 . T = 0 . T = 1 . T = 0 . T = 1 . T = 0 . T = 0 .

2. While the errors for the primal algorithm reduce as N increases, the value does not lie within the duality gap for some terminal times. This may bedown to a discretisation error, and would be ﬁxed by taking more time steps. We see convergenceof the approximations to the true value as N increases and this convergence seems to be linear, aswe saw in Example 3.5. This convergence is explored further in Section 5.Now, let us treat v as some random process, meaning that we can only sample v , and notthe underlying Brownian motion W v . In this setting we have that both θ ( t ) = A (cid:112) v ( t ) and σ ( t ) = (cid:112) v ( t ) are random processes for the stock market, and we still wish to apply Theorem4.1. We cannot directly apply this theorem as v is not measurable with respect to the ﬁltrationgenerated by W s . To circumvent this problem, we adopt a market completion method [17] byﬁrst constructing an artiﬁcial stock driven by W v , adding it to our market, and then making itunavailable for trading. To be speciﬁc, we deﬁne a new asset price S v by dS v ( t ) = S v ( t ) ( rdt + dW v ( t )) , which ensures that v is adapted to the natural ﬁltration of the Brownian motions driving themarket. For simplicity we assume that m = k = 1, but this analysis extends naturally to multipledimensions. We invest in this market with a 2-dimensional portfolio, restricted to the set K × { } ,ensuring that our wealth process X is still deﬁned as in (2.2), controlled by a 1-dimensional K -valued process π . Now the dual state process (4.4) is controlled by y and a 2-dimensional dynamiccontrol ( η, γ ) valued in ˜ K × R . We would like to show that we can take γ = 0, as this would returnthe dual problem to the 1-dimensional control case independent of W v . We want this as we cannot20 N Primal Value Rel-Err (%) Time (s) Dual Value Rel-Err (%) Time (s)0.2 5 2.03310E+00 1.01654E-02 83 2.03378E+00 4.35850E-02 980.2 10 2.03301E+00 5.74161E-03 206 2.03344E+00 2.66047E-02 2390.2 20 2.03296E+00 3.27670E-03 591 2.03314E+00 1.23181E-02 6040.2 50 2.03289E+00 2.81206E-04 2849 2.03295E+00 2.89678E-03 26220.5 5 2.07616E+00 2.75979E-02 152 2.07969E+00 1.97584E-01 1830.5 10 2.07582E+00 1.14780E-02 362 2.07747E+00 9.08529E-02 4110.5 20 2.07572E+00 6.36082E-03 927 2.07670E+00 5.38088E-02 10300.5 50 2.07562E+00 1.55757E-03 5070 2.07600E+00 1.98421E-02 41811.0 5 2.13328E+00 4.32410E-02 154 2.14847E+00 6.68466E-01 1701.0 10 2.13377E+00 1.99985E-02 362 2.14264E+00 3.95457E-01 4001.0 20 2.13399E+00 9.92254E-03 918 2.13715E+00 1.38171E-01 9811.0 50 2.13403E+00 7.97535E-03 4230 2.13548E+00 6.01058E-02 4395Table 2: Accuracy and runtime for the deep controlled 2BSDE method for various terminal times T and numbers of time steps N applied to the unconstrained power utility problem in a stochasticvolatility model.‘see’ W v , and can only sample v directly at any time. To this end, let ( P , Q ) be the dual adjointsolutions. Fix a time t ∈ [0 , T ] and let Q ( t ) = ( q , q ). By Theorem 4.1 (2) we have P ( t ) − q = 0,hence q = 0. However, this means that condition (3) becomes P ( t ) δ K ( η ( t )) + q (cid:62) σ ( t ) − η ( t ) = 0which is independent of γ ( t ), leading to a free control. We therefore can take γ = 0, removing thedependence of the dual state process on W v .Table 3 shows the accuracy and runtime of the deep SMP method with a range of terminaltimes and step sizes, when we run it on the same problem formulation as above. Here, K = R ,so we take the constraining function h K to be the identity. We note that the upper bounds arean order of magnitude less accurate the lower bounds for this problem. This indicates that theimplied primal process approximations are more accurate than the dual approximations in thisalgorithm. Comparing these results to Table 2, the SMP method gives roughly the same accuracyof bounds as the dual deep controlled 2BSDE method, but with a lower computation time. Inparticular, the algorithm here only needed 5000 iterations steps, compared to 10000 for both the2BSDE algorithms. Remark 4.5.

We have applied this methodology to investing in the Vasicek model where the interestrate is a random process rather than the volatility. In the same way as for the Heston model weconsider it both as a 2-dimensional state process and a 1-dimensional state process with randomcoeﬃcients. We ﬁnd similar results as for the Heston model, so we omit this example for brevity.

Example 4.6 (Heston with non-HARA utility) . We now look at a higher dimensional examplewith no explicit solution. Consider the same stochastic volatility market, but with multiple stocks.For simplicity we assume all stocks are independent with the same parameters, but this is not anecessary restriction. We let U have the form (3.17). For this model we do not have a convenientansatz for u to obtain a solution, so we will simply compare our 3 algorithms. Say we have n stocks. We would then have control processes π ( t ) ∈ R n , Γ ( t ) ∈ R n, n for the primal 2BSDEalgorithm, v ( t ) ∈ R n , Γ ( t ) ∈ R n, n for the dual 2BSDE, and v ( t ) , Q ( t ) ∈ R n for the SMPalgorithm. 21 N u low Rel-Diﬀ (%) u high Rel-Diﬀ (%) Time (s)0.2 5 2.03271E+00 9.17595E-03 2.03348E+00 2.90360E-02 1030.2 10 2.03280E+00 4.40981E-03 2.03323E+00 1.64607E-02 2730.2 20 2.03284E+00 2.60193E-03 2.03305E+00 7.49044E-03 7810.2 50 2.03289E+00 5.76610E-05 2.03295E+00 2.54626E-03 47880.5 5 2.07512E+00 2.25897E-02 2.07974E+00 2.00124E-01 1020.5 10 2.07538E+00 9.97694E-03 2.07738E+00 8.63624E-02 3010.5 20 2.07563E+00 2.04008E-03 2.07656E+00 4.67255E-02 7780.5 50 2.07556E+00 1.16593E-03 2.07595E+00 1.74975E-02 38181.0 5 2.13025E+00 1.84896E-01 2.14786E+00 6.40234E-01 1041.0 10 2.13232E+00 8.78697E-02 2.13995E+00 2.69313E-01 2631.0 20 2.13489E+00 3.25920E-02 2.13750E+00 1.54844E-01 7611.0 50 2.13440E+00 9.35931E-03 2.13551E+00 6.11910E-02 4604Table 3: Accuracy and runtime for the deep SMP method for various terminal times T and numbersof time steps N applied to the unconstrained power utility problem in a stochastic volatility model.For this example let us take T = 0 . N = 20, and n = 10. We then have m = 10, d = 11, and k = 20 in (2.2). For the 2BSDE algorithms, the combined dimensions of the unknown processesare 131 for the primal and 141 for the dual. We consider the unconstrained problem, so we canset η = 0. For the SMP algorithm we only need to ﬁnd process in 20 dimensions. We run thealgorithm for 10000 steps. Table 4 shows the results of this. We give the value approximation forthe 2BSDE algorithms, and the average of the upper and lower bound for the SMP. We also providethe runtime to show the eﬃciency of the algorithms as a whole, but the individual diﬀerences shouldbe taken with a pinch of salt due to implementation diﬀerences, see Remark 4.7. The runtime hasincreased from the previous low dimensional example, but sub-linearly. The relative diﬀerencebetween approximations has increased to about 1%.Method Value Time (s)Primal 2.61059E+00 2316Dual 2.63758E+00 5636SMP 2.62514E+00 1840Table 4: Value approximations for the 2BSDE and SMP algorithms applied to the unconstrainednon-HARA utility problem in a stochastic volatility model. Remark 4.7.

There are a number of factors that aﬀect the algorithm run times. For example, wesee that the dual algorithm takes twice as long as the primal, but we did not simplify the algorithmto reﬂect the constraints of the problem. Making this change would have reduced the run time.Additionally, the Z processes deﬁned in (3.9) involve a diﬀerentiation element. This diﬀerentiationis done ’automatically’, as it is handled by the computer, as opposed to the explicit derivative forthe BSDE (4.2). This does not aﬀect the accuracy, but will increase the run time. Example 4.8 (Power Utility in a Path Dependent Volatility Model) . In this example not only willthe control problem be non Markovian, but it cannot be made Markovian by including additionalinformation in our state space. This means that we cannot apply our 2BSDE methods.This example is motivated by the concept that the volatility of a stock should be dependenton its value. If the stock is doing well the volatility is assumed to be lower. This may be due to22nfrequent trading of the stock, or the stability of the large company to which the stock is assigned.We will consider an extreme example, where there are two levels of volatility. The normal higherlevel is maintain until the stock reaches its historical maximum, in which case it instantly transitionsto the lower volatility.Consider the market (2 . σ is a function of the stock history. For each t ∈ [0 , T ], let σ ( t ) be a diagonal matrix, with i th diagonal entry σ i ( t ) deﬁned as σ i ( t ) = (cid:40) σ low , S i ( t ) ≤ sup s ≤ t S i ( s ) ,σ high , S i ( t ) = sup s ≤ t S i ( s ) , where σ low > σ high are constants and S i is the i th stock price. In this model, the volatility of astock is inversely proportional to the ratio of its value and the running maximum. This makes thediﬀusion coeﬃcient of the wealth process X a function of the entire path of S , instead of just at onetime. This path dependence ensures that the utility maximisation problem is truly non-Markovian,in the sense that it cannot be made Markovian by the introduction of a new variable. We do notneed to apply an artiﬁcial stock market argument here, as the volatility is a measurable functionof the adapted stock market process up to time t , hence is progressively measurable.For our application, we consider a 2-dimensional stock market with cone constraint K = R .The constraining function we use for this case is h K ( x ) = x . We take σ low = 0 . σ high = 0 . r ( t ) = 0 . µ ( t ) = 0 . and p = 0 .

5. Table 5 shows the duality gap when thisalgorithm is run for varying terminal times and numbers of time steps. We run the algorithms for10000 iteration steps. The ‘Diﬀ’ and ‘Rel-Diﬀ’ columns show the absolute and relative distancesbetween the upper and lower approximations respectively. We see in all cases that the duality gapreduces as the number of time steps increases. We do not have a benchmark to compare theseresults to, as the path dependence and incompleteness of the problem make it very diﬃcult to solveanalytically. T N u low u high Diﬀ Rel-Diﬀ (%)0.2 5 2.01057E+00 2.01066E+00 8.99222E-05 4.47246E-030.2 10 2.01054E+00 2.01063E+00 8.55058E-05 4.25287E-030.2 20 2.01057E+00 2.01063E+00 6.00701E-05 2.98772E-030.2 50 2.01061E+00 2.01062E+00 1.30984E-05 6.51462E-040.5 5 2.02652E+00 2.02682E+00 3.03732E-04 1.49879E-020.5 10 2.02654E+00 2.02676E+00 2.11323E-04 1.04278E-020.5 20 2.02648E+00 2.02665E+00 1.73093E-04 8.54166E-030.5 50 2.02637E+00 2.02650E+00 1.31536E-04 6.49120E-031.0 5 2.05333E+00 2.05421E+00 8.87694E-04 4.32320E-021.0 10 2.05337E+00 2.05397E+00 5.93055E-04 2.88821E-021.0 20 2.05327E+00 2.05368E+00 4.16821E-04 2.03005E-021.0 50 2.05307E+00 2.05339E+00 3.20542E-04 1.56128E-02Table 5: Duality gap for the deep SMP method for various terminal timed T and numbers of timesteps N applied to the cone-constrained power utility problem with path dependent volatility. Here we compare the results of small changes to the methodology of the algorithm. We considerthe primal deep controlled 2BSDE algorithm for the unconstrained Merton problem, with T = 1 . m = 10 unless stated otherwise, and the other coeﬃcients chosen as in Example 3.5. Thesetests were performed on an HP ZBook Studio G4 with a 2.9 GHz Intel i7 CPU, with no GPUacceleration.The ﬁrst comparison we consider is the activation function used within the neural networks (c.f.function h in the description of the algorithm). Table 6 shows the results of this test, where we take N = 20. The choice of activation function is not important for this particular problem. All thefunctions are valid for this algorithm, since they are continuous functions that are not polynomials,see [20]. Name Function Rel - Err (%) Time (s)ReLu x (cid:55)→ max(0 , x ) 3.57771E-03 352.7softplus x (cid:55)→ log(1 + e x ) 3.57250E-03 352.0tanh x (cid:55)→ tanh( x ) 3.44745E-03 352.8sigmoid x (cid:55)→ e x e x +1 m ∈ { , , } of the stock market, with numbers of timesteps N ∈ { , , , , } . Figure 5.1 shows the results of this test. The ﬁrst graph shows theerror of approximation against N − , and the second shows runtime against N . We see the order O ( N − ) convergence for each dimension, but the error increases with dimension. Therefore, toachieve the same accuracy, a larger number of time steps would be needed for a higher dimension.The dimensionality complexity of this algorithm is hence tied to the complexity with respect to thenumber of time steps N , which looks from the graph to be of order O ( N ). The time diﬀerencebetween the three dimensions is minimal. However, this is for a set number of iteration steps, andfor even higher dimensions more steps may be needed, which would increase the runtime comparedto lower dimensional models. 24 a) Approximation Error of Methods. (b) Runtime of Methods. Figure 5.1: Accuracy and runtime of the primal deep controlled 2BSDE method applied to uncon-strained power utility problems with diﬀerent dimensions.Finally, we consider the eﬀect of the terminal time T on the relative error of the output. Sincenothing is changing in the algorithm, we have found that changing T while keeping the numberof time steps N ﬁxed does not change the runtime of the algorithm. However, Figure 5.2 showsan increasing error with T , as it increases from 0.2 to 1.0. Both graphs show the same data, theleft side using a log scale for the y axis, and the right side squaring the x axis, to show the order O ( T ) error implied by the data. We display this data using multiple time steps, from 5 to 20. It isnotable here that even increasing the number of time steps by the same factor does not compensatefor the error caused by increasing the terminal time. For example, the relative error when we use( T, N ) = (0 . ,

5) is 0.05%, whereas when we use (

T, N ) = (0 . ,

10) it is 0.12%, which is just overdouble the error. This O ( T ) vs O ( N − ) error may be why an assumption of T being ‘small’ isused in the convergence analysis of the ﬁrst order BSDE solver, see [13]. (a) Log-Scaled Approximation Error. (b) Square-Scaled Approximation Error. Figure 5.2: Accuracy of the primal deep controlled 2BSDE method using diﬀerent terminal timesand numbers of time steps, applied to the unconstrained power utility problem. Both graphs displaythe same data, but the left uses a log scale for the y axis, and the right a squared scale for the xaxis. 25

Conclusion

In this paper we study a variety of constrained utility maximisation problems, including some thatare incomplete or non-Markovian, and introduced two algorithms, called the deep controlled 2BSDEmethod and the deep SMP method, for solving such problems, which involve simulating a systemof BSDEs of ﬁrst or second order. We use these equations, together with the optimality conditionsfor the optimal control, to ﬁnd the value function of the problem. We apply these algorithms to arange of utility functions and closed convex control constraints, and ﬁnd highly accurate solutions,with order O ( N − ) convergence in terms of the number of time steps of the discretisation. Wederive tight bounds of the value function at time 0 using the convex duality method that results ina primal problem and a dual problem, both can be solved with our algorithms. There remain manyinteresting outstanding problems such as convergence analysis, duality theory for multidimensionalstate process, the eﬃciency of the algorithm, etc. We leave these and others for future research. References [1] C. Beck, S. Becker, P. Cheridito, A. Jentzen, and A. Neufeld. Deep splitting method forparabolic pdes. arXiv:1907.03452 , 2019.[2] C. Beck, W. E, and A. Jentzen. Machine learning approximation algorithms for high-dimensional fully nonlinear partial diﬀerential equations and second-order backward stochasticdiﬀerential equations.

Journal of Nonlinear Science , pages 1–57, 2017.[3] S. Becker, P. Cheridito, and A. Jentzen. Deep optimal stopping.

Journal of Machine LearningResearch , 20:1–25, 2019.[4] V. S. Borkar.

Stochastic Approximation: a Dynamical Systems Viewpoint . Springer, 2009.[5] H. Buehler, L. Gonon, J. Teichmann, and B. Wood. Deep hedging.

Quantitative Finance ,pages 1–21, 2019.[6] J. De Spiegeleer, D. B. Madan, S. Reyners, and W. Schoutens. Machine learning for quantita-tive ﬁnance: Fast derivative pricing, hedging and ﬁtting.

Quantitative Finance , 18(10):1635–1643, 2018.[7] W. E, J. Han, and A. Jentzen. Deep learning-based numerical methods for high-dimensionalparabolic partial diﬀerential equations and backward stochastic diﬀerential equations.

Com-munications in Mathematics and Statistics , 5(4):349–380, Dec 2017.[8] S. Eckstein and M. Kupper. Computation of optimal transport and related hedging problemsvia penalization and neural networks.

Applied Mathematics & Optimization , 2019.[9] I. Exarchos and E. A. Theodorou. Stochastic optimal control via forward and backwardstochastic diﬀerential equations and importance sampling.

Automatica , 87:159–165, 2018.[10] L. Gonon, J. Muhle-Karbe, and X. Shi. Asset pricing with general transaction costs: Theoryand numerics. arXiv 1905.05027 , 2020.[11] P. Grohs, F. Hornung, A. Jentzen, and P. Von Wurstemberger. A proof that artiﬁcial neuralnetworks overcome the curse of dimensionality in the numerical approximation of black-scholespartial diﬀerential equations. arXiv:1809.02362 , 2018.2612] J. Han. Deep learning approximation for stochastic control problems. arXiv:1611.07422 , 2016.[13] J. Han and J. Long. Convergence of the deep bsde method for coupled fbsdes. arXiv:1811.01165 , 2018.[14] P. Henry-Labordere. Deep primal-dual algorithm for bsdes: Applications of machine learningto cva and im.

SSRN 3071506 , 2017.[15] C. Hur´e, H. Pham, A. Bachouch, and N. Langren´e. Deep neural networks algorithms forstochastic control problems on ﬁnite horizon, part 1: Convergence analysis. arXiv:1812.04300 ,2018.[16] S. Ioﬀe and C. Szegedy. Batch normalization: Accelerating deep network training by reducinginternal covariate shift.

Proceedings of the 32nd International Conference on Machine Learning(ICML) , 37:448–456, 2015.[17] I. Karatzas and S. E. Shreve.

Methods of Mathematical Finance . Springer, 1998.[18] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980 ,2014.[19] C. Labb´e and A. J. Heunis. Conjugate duality in problems of constrained utility maximization.

Stochastics , 81(6):545–565, 2009.[20] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken. Multilayer feedforward networks with anonpolynomial activation function can approximate any function.

Neural networks , 6(6):861–867, 1993.[21] Y. Li and H. Zheng. Dynamic convex duality in constrained utility maximization.

Stochastics ,90(8):1145–1169, 2018.[22] S. Liu, C. W. Oosterlee, and S. M. Bohte. Pricing options and computing implied volatilitiesusing neural networks.

Risks , 7(1):16, 2019.[23] J. Ma, W. Li, and H. Zheng. Dual control monte-carlo method for tight bounds of valuefunction under heston stochastic volatility model.

European Journal of Operational Research ,280(2):428–440, 2020.[24] M. Pereira, Z. Wang, and E. A. Theodorou. Neural network architectures for stochastic controlusing the nonlinear feynman-kac lemma. arXiv:1902.03986 , 2019.[25] H. Pham.

Continuous-Time Stochastic Control and Optimization with Financial Applications .Springer, 2009.[26] S. Ruder. An overview of gradient descent optimization algorithms. arXiv:1609.04747 , 2016.[27] Z. Wang, M. A. Pereira, and E. A. Theodorou. Deep 2fbsdes for systems with control multi-plicative noise. arXiv:1906.04762arXiv:1906.04762