Deep Learning for Constrained Utility Maximisation
DDeep Learning for Constrained Utility Maximisation
Ashley Davey ∗ and Harry Zheng † Abstract
This paper proposes two algorithms for solving stochastic control problems with deep re-inforcement learning, with a focus on the utility maximisation problem. The first algorithmsolves Markovian problems via the Hamilton Jacobi Bellman (HJB) equation. We solve thishighly nonlinear partial differential equation (PDE) with a second order backward stochasticdifferential equation (2BSDE) formulation. The convex structure of the problem allows us todescribe a dual problem that can either verify the original primal approach or bypass some ofthe complexity. The second algorithm utilises the full power of the duality method to solvenon-Markovian problems, which are often beyond the scope of stochastic control solvers in theexisting literature. We solve an adjoint BSDE that satisfies the dual optimality conditions. Weapply these algorithms to problems with power, log and non-HARA utilities in the Black-Scholes,the Heston stochastic volatility, and path dependent volatility models. Numerical experimentsshow highly accurate results with low computational cost, supporting our proposed algorithms.
Keywords:
Control, Deep Reinforcement Learning, Primal and Dual BSDEs, HJB Equation,Portfolio Optimisation
AMS MSC 2010:
In this paper we propose algorithms that combine modern machine learning practises with theoret-ical stochastic control principles to solve a range of control problems with high accuracy, scalabilitywith dimension, and low computational cost. There is a natural crossover between machine learningand stochastic control problems, as they both involve searching for features within a data set. Forstochastic control the data set is a state process that is influenced by a controlling process thatcan be chosen with the aim to optimise some objective function. Often the optimal control is afunction of the underlying state process, a so-called Markovian control. This is very similar to deepreinforcement learning, where the control is chosen as the output of a neural network which takesthe dataset as input, and some loss function is minimised. It is this area of crossover that thispaper resides. We propose deep learning methods to solve the utility maximisation problem, as anapplication of a more general stochastic control solver, in a wide range of markets with arbitraryconvex control constraints. Dynamic portfolio optimisation has been extensively studied within thefield of mathematical finance, see [17, 25] for exposition.A common drawback of numerical methods for stochastic control problems is their complexitywith dimension of the space in which either the state or control processes reside. Often some gridmethod or exhaustive search of the underlying state space is required to find optimal trajectories, ∗ Department of Mathematics, Imperial College, London SW7 2BZ, UK. Email: [email protected] † Corresponding author. Department of Mathematics, Imperial College, London SW7 2BZ, UK. Email:[email protected] a r X i v : . [ q -f i n . C P ] A ug hich is computationally inefficient for high dimensional state spaces. It is claimed that machinelearning methods provide a solution for this dimensionality problem [11]. Numerical results showefficient solving of PDEs with up to thousands of variables [1, 2, 7]. These PDE solvers form thebasis of our generic stochastic control problem solver, which is the first algorithm presented inthis paper. The problems are transformed into forward-backward stochastic differential equations(FBSDEs) using the Feynman-Kac formula and the Stochastic Maximum Principle (SMP). ThePDE that we solve is the Hamilton-Jacobi-Bellman (HJB) equation, a highly nonlinear PDE thatis often impossible to solve analytically, or even numerically if the dimensionality is high. Thispaper introduces a control variable into the PDE, that is constrained to some convex set. Thisdrastically increases the complexity and takes the problem out of the scope of [2, 7] unless thecontrol can be found analytically. We show how machine learning can overcome this difficulty andproduce highly accurate value approximations with comparatively low runtime.The FBSDE based methods developed in this paper are direct extension of the first and secondorder deep BSDE solvers introduced in [2, 7]. In these papers, a system of FBSDEs is solvedby simulating all processes in the forward direction, driven by a neural network process which isoptimised to ensure the terminal condition holds. The extension of this paper is to add an additionaldriving neural network process, representing a control process. The convenience of adapting thisBSDE method to stochastic control problems is that the control process can be optimised using thevalue function approximations implied by the 2BSDE solver, via the Hamiltonian of the system. Inthe 2BSDE solver of [2], there are two unknown process that must be found with neural networks.This suggests that our method has three unknowns, but we can use the SMP to derive one of theseprocesses and therefore introduce control without increasing the number of unknown processes.There are some papers that describe generic algorithms for solving stochastic control problemsin the literature. [12] directly maximises the gains function at the terminal time with machinelearning and is applicable to many problems. However, the computation costs may be high, as oneintroduces a discretisation with many time steps or a high dimensional control space, and it onlyoutputs the value at time 0. [15] finds a local approximation of the value function with machinelearning. However, it only holds for Markovian problems and the iterative nature of this valuefunction approximation may lead to error propagation. Instead of directly optimising the valuefunction, the first method presented in this paper maximises the Hamiltonian associated with thesystem. There are two key benefits of doing this. Firstly, it reconciles the potential errors of [12,15]where the algorithm gets stuck at a local optimiser. To find the value function (given a control)we ensure that the terminal condition of the HJB equation holds, which involves minimising theleast squares error of the associated processes at the terminal time. The global minimum of thiserror is zero by construction, ensuring that we can directly assess the quality of our approximationfor any generic control problem. Secondly, the control is optimised using the Hamiltonian of thesystem, which is a function of only what is known at one time. This optimisation is faster than backpropagating through a discretisation of the processes to compute the derivatives as in [12]. This iseven more important when the control space is high dimensional, which is normally true in portfoliooptimisation problems. To implement this Hamiltonian maximisation procedure, one needs to knowderivatives of the value function which can be computed accurately with deep learning based onthe 2BSDE method in [2].There are a few papers in the literature that also utilise the deep BSDE methods of [2, 7].For example, [9, 24, 27] first find an explicit formula for the optimal control in terms of the valuefunction and its derivatives and then substitute this control into the HJB equation and solve theresulting PDE with the deep 2BSDE method. However, these algorithms cannot be applied in fullgenerality like our method. [9,24] are reliant on the diffusion of the state process being independentof the control, which leads to a semi-linear PDE and a representation with a first order BSDE. This2ramework is inappropriate for portfolio section problems as the control naturally appears in thediffusion term. [27] alleviates this restriction by deriving an explicit formula for the optimal controlthat is required to be unconstrained. Our approach is more general in the sense that it does notrequire any analytical solutions. In the literature these other algorithms are called “deep 2BSDEcontrollers” since the BSDE solutions are used to form the control. Our algorithm is called the deepcontrolled 2BSDE method as each control and state pair defines its own 2BSDE which we try tosolve. A primal-dual method is introduced in [14] for solving semi-linear PDEs. The deep 2BSDEmethod is used to find a lower bound for the solution. An upper bound is found by a pathwisemaximising control problem that is derived using the Legendre transform. The approach in [14]is not directly related to our notion of convex duality using martingale methods. There are alsosome papers on applications of deep neural networks in mathematical finance, including optimalstopping [3], asset pricing with transaction cost [10], and optimal transport [8].The utility maximisation problem involves maximising concave functions, so we can naturallyapply convex duality methods. Duality is a powerful tool that can be used in complete or incom-plete markets with a range of convex constraints on the control process. The unique aspect ofour work is the combination of deep learning methodology with convex duality. The benefits ofintroducing duality are threefold. Firstly, the dual problem is a minimisation problem, and solvingit independently of the primal problem naturally leads to an upper bound of the value function.Sometimes the dual problem is easier to solve, since primal control constraints are characterisedby penalties in the dual state process, leading the dual problem to be potentially unconstrained.Secondly, the BSDEs associated to the dual problem can be related to that of the primal problem,giving us the ability to compare our numerical results and check their accuracy. Again, as per theHamiltonian maximisation, this benefits us since we know the error between the dual and primalprocesses should be zero, which allows us to easily measure the quality of our results. We may evenskip one of the problems if they are difficult to solve, as solving one problem automatically gives usthe solution to the other. Thirdly, inherent in the dual formulation are certain optimality conditionsthat are derived from the SMP. These conditions can be used separately from the HJB equation tofind the optimal control. We use these conditions as the basis for our second algorithm, called thedeep SMP method, which is valid for non-Markovian control problems with convex structure. Thisgreatly increases the range of utility maximisation problems than can be solved using the methodsof this paper. In a Markovian setting we can compare the two algorithms presented in the paper.The convex duality method works for complete and incomplete markets. A common incompletemarket model is the Heston stochastic volatility model in which the underlying variance of thestock follows some mean-reversion SDE with its own source of randomness. This model may beseen as a 2-dimensional stock market, where only one asset is traded. This framework allows usto use our deep learning method to solve it. We compare our results to [23] in which tight boundsof the value function are derived for an unconstrained optimisation problem. [23] is reliant on aconvenient guess of the dual control, so is not applicable to other methods where such a guess is notavailable, unlike our method. Another multidimensional model we consider is a path-dependentvolatility model. There are several recent papers on the topic in mathematical finance with machinelearning, for example [5, 6, 22]. These papers include model calibration, hedging and pricing, andare not directly related to our work.The main contributions of this paper are the following. In the deep controlled 2BSDE algo-rithm, we present a Markovian stochastic control problem solver that combines the HJB equationwith the deep 2BSDE method, but without the need for an analytical formula for the optimalcontrol. We apply this algorithm to both the primal and dual problems which give us tight boundson the value function. In the deep SMP method, we exploit the dual formulation of the utilitymaximisation problem to create a novel algorithm for non-Markovian optimisation problems using3eep reinforcement learning. Instead of dealing with the gains function, we solve the adjoint BSDEand use machine learning to satisfy the dual optimality conditions. To the best of our knowledge,we are the first to use this methodology in the literature.The remainder of this paper is outlined as follows. Section 2 formulates the utility maximisa-tion problem and defines the dual problem. Section 3 describes the general algorithm for solvingMarkovian stochastic control problems using deep learning and the HJB equation, and applies thealgorithm to portfolio optimisation with various utilities. Section 4 introduces an algorithm forsolving non-Markovian utility maximisation problems using the SMP and applies this algorithmto problems where the volatility process is not deterministic, and even path-dependent. Section 5discusses convergence of the deep controlled 2BSDE algorithm based on empirical evidence. Section6 concludes the paper. In this section we describe the standard portfolio optimisation problem and its dual problem. Let W = ( W ( t )) ≤ t ≤ T be a standard m -dimensional Brownian motion on the natural filtered probabilityspace (Ω , F , ( F t ) ≤ t ≤ T , P ) augmented with all P -null sets, and T a fixed finite horizon. Consideran m + 1 dimensional stock market, consisting of a risk-free bond with interest rate r ( t ), and acollection S of risky stocks defined by dS ( t ) = diag( S ( t )) ( µ ( t ) dt + σ ( t ) dW ( t )) , (2.1)where diag( S ( t )) is a m × m matrix with diagonal elements from S ( t ) and all other elements zero,and r , µ and σ are uniformly bounded, progressively measurable processes valued in R , R m and R m × m , respectively, with µ ( t ) and σ ( t ) representing the drifts and volatilities of the m stock pricesat time t , respectively. The matrix σ ( t ) is invertible and σ − is uniformly bounded. An agentinvests a proportion π ( t ) of the wealth X ( t ) in the risky stocks and the rest in the risk-free bondat time t ∈ [0 , T ]. The 1-dimensional wealth process X evolves as dX ( t ) = X ( t ) (cid:16) r ( t ) + π ( t ) (cid:62) b ( t ) (cid:17) dt + X ( t ) π ( t ) (cid:62) σ ( t ) dW ( t ) (2.2)with X (0) = x , where π ( t ) (cid:62) is the transpose of π ( t ), and b is the excess return process defined by b ( t ) = µ ( t ) − r ( t ) m , where m ∈ R m is a vector of ones. We call π an admissible control if π is aprogressively measurable process valued in K ⊂ R m , a nonempty closed convex set, almost surelyfor almost every t ∈ [0 , T ], and there exists a unique strong solution X to the SDE (2.2). The setof all admissible controls π is denoted by A . The investor wishes to maximise the expected utilityof the terminal wealth, that is, to find V .. = sup π ∈A E [ U ( X ( T ))] , where U : [0 , ∞ ) → R is a utility function that is twice continuously differentiable, strictly increas-ing, strictly concave and satisfies the Inada conditions U (cid:48) (0) = ∞ and U (cid:48) ( ∞ ) = 0. We say thatstate and control processes ˆ X, ˆ π are optimal if V = E [ ˆ X ( T )].We next construct the dual problem, see [21] for more details. The basis for duality is theLegendre-Fenchel transformation ˜ U : (0 , ∞ ) → R of the utility function U , defined by˜ U ( y ) = sup x> { U ( x ) − xy } , y ≥ . U ensure that ˜ U is well defined for all y ≥ , ∞ ), due to the correspondingproperties of U . The R -valued dual process Y is defined by dY ( t ) = − Y ( t ) (cid:0) r ( t ) + δ K ( v ( t )) (cid:1) dt − Y ( t ) (cid:0) θ ( t ) + σ ( t ) − v ( t ) (cid:1) (cid:62) dW ( t ) (2.3)with Y (0) = y , where y is nonnegative, θ ( t ) = σ ( t ) − b ( t ) is the market price of risk, δ K is thesupport function of the set − K , defined by δ K ( z ) .. = sup π ∈ K (cid:110) − π (cid:62) z (cid:111) , z ∈ R m , and v is a R m -valued dual control process which is progressively measurable and ensures thereexists a unique strong solution Y to the SDE (2.3).Since XY is a super-martingale for any choice of π, v and y , we have the following weak dualityrelation: V ≤ ˜ V ( x ) .. = inf y,v (cid:110) E (cid:104) ˜ U ( Y ( T )) (cid:12)(cid:12)(cid:12) Y (0) = y (cid:105) + x y (cid:111) . (2.4)Solving the dual problem would give us an upper bound of the value function, that can be combinedwith a lower bound by the primal problem to produce a confidence interval for the value function.If we have an equality in (2.4), then we can find the value function by alternatively solving the dualproblem. In [21] it is shown that the strong duality property holds, that is, V = ˜ V ( x ). Remark 2.1. If K = B (0 , R ) is a ball of radius R > centred at the origin, then δ K ( z ) = R | z | . If K is a closed convex cone, then δ K ( z ) = 0 if z ∈ ˜ K and δ K ( z ) = ∞ if z / ∈ ˜ K , where ˜ K is the positivepolar cone of K , defined by ˜ K .. = { z ∈ R m : π (cid:62) z ≥ for all π ∈ K } , which greatly simplifies thedual problem as the drift is independent of the dual control v if v ( t ) ∈ ˜ K . In particular, if K = R m ,then ˜ K = { } , which forces the dual control v = 0 . For general closed convex set K , the function δ K is well defined but may be difficult to compute and cannot be exactly implemented numerically due tothe infinite penalty. Instead, so-called ‘soft’ constraints can be used, where a suitably large penaltyfunction gives us sufficient confidence that our control process lies in the required space. Sometimesit is easier to implement a ‘hard’ constraint, where the optimal control is projected onto the spacethat makes this function finite before optimisation as in the cone case. With hard constraints, wecan have 100% confidence that the process lies in the required space by construction. We assume in this section that the processes µ, r , and σ are deterministic. The utility maximisationproblem is then Markovian, and coincides with finding u (0 , x ), where the value function u : [0 , T ] × R → R is defined by u ( t, x ) .. = sup π ∈A E [ U ( X ( T )) | X ( t ) = x ] . (3.1)Our machine learning methods focus on the PDE associated to this control problem. If u ∈ C , satisfies a growth condition, then u is a solution to the HJB equation ∂u∂t ( t, x ) = − sup π ∈ K F ( t, x, π, D x u ( t, x ) , D x u ( t, x )) , F of the problem is defined by F ( t, x, π, z, γ ) = xz ( r ( t ) + π (cid:62) b ( t )) + 12 x γ | σ ( t ) (cid:62) π | with | · | denoting the Euclidean norm. The Verification theorem [25] says the optimal control ˆ π ( t )can be determined at each time t ∈ [0 , T ] by maximising the Hamiltonian, that is,ˆ π ( t ) ∈ arg max π ∈ K F (cid:16) t, x, ˆ X ( t ) , π, D x u (cid:16) t, ˆ X ( t ) (cid:17) , D x u (cid:16) t, ˆ X ( t ) (cid:17)(cid:17) , where ˆ X is the state process corresponding to this control. We can use machine learning to findthe optimal control this way, but first we must have sufficient information about the optimal stateprocess, as well as the derivatives of the value function. We use a 2BSDE approximation methodto find the value function and its derivatives, again using machine learning, as outlined in the nextsection. To derive a dual HJB equation, consider a series of dual value problems parametrised by y ∈ R given by ˜ u ( t, y ) = inf v E (cid:104) ˜ U ( Y ( T )) (cid:12)(cid:12)(cid:12) Y ( t ) = y (cid:105) (3.2)and again, suppose that ˜ u ∈ C , satisfies a growth condition. It is shown in [19, Proposition 4.15]that under the additional assumption that ˜ U ( Y ( T )) is square integrable, there exist optimal controlsˆ y, ˆ v that satisfyˆ y ∈ arg min y (cid:110) E (cid:104) ˜ U ( ˆ Y ( T )) (cid:12)(cid:12)(cid:12) ˆ Y (0) = y (cid:105) + x y (cid:111) ˆ v ( t ) ∈ arg min v (cid:26) − ˆ y ( r ( t ) + δ K ( v )) ∂∂y ˜ u ( t, ˆ y ) + 12 (cid:12)(cid:12) θ ( t ) + σ ( t ) − v (cid:12)(cid:12) ˆ y ∂ ∂y ˜ u ( t, ˆ y ) (cid:27) (3.3)for all t ∈ [0 , T ], where ˆ Y is the state process driven by ˆ v . The optimality conditions of v come fromthe corresponding Hamiltonian of the dual control problem (3.2). The implicit relations between ˆ y and ˆ v make it difficult to find explicit analytical solutions. The deep learning approaches for both the dual and primal problems are similar, so we present thealgorithm in a more general form that covers both. We first describe the general control problem,then use the machine learning method to find the value function and optimal control. We presentthe algorithm for a state dimension d ≥
1, as this is required in proceeding sections.Let (Ω , F , ( F t ) ≤ t ≤ T , P ) be a filtered probability space with a fixed finite horizon T . Considerthe F t -adapted controlled diffusion process ( X ( t )) ≤ t ≤ T on R d , d ∈ N , defined by dX ( t ) = b ( t, X ( t ) , π ( t )) dt + σ ( t, X ( t ) , π ( t )) dW ( t ) (3.4)with X (0) = x , where W is an R n -valued standard Brownian motion for some n ∈ N and π =( π ( t )) ≤ t ≤ T is a progressively measurable process, valued in K ⊂ R m for m ∈ N . It is assumedthat the drift and diffusion functions b : [0 , T ] × R d × K → R and σ : [0 , T ] × R d × K → R d × n are deterministic and measurable, such that (3.4) admits a strong solution. Here X represents thestate process and π the control process. We consider only admissible controls that satisfy someintegrability condition [25]. This control is chosen to optimise the state process via maximising areward that is defined using a terminal gain function g : R d → R , which is a measurable function,6ither lower bounded or satisfying some quadratic growth condition. We wish to find the valuefunction u ( t, x ) .. = sup π ∈A E [ g ( X ( T )) | X ( t ) = x ] . (3.5) Remark 3.1.
The utility function U is not defined on the whole space R , but only (0 , ∞ ] . We take g ( x ) to be U ( x ) when x > and 0 otherwise, ensuring that quadratic growth or bounded conditionsare satisfied (if they are satisfied for U ). It is irrelevant what value is chosen when x < , as onecan show that the state process X defined in (2.2) satisfies X ( T ) > if and only if X (0) > almostsurely. Suppose that u ∈ C , , so that we can use the HJB equation and the SMP. The HJB equationtells us that u solves the following PDE ∂∂t u ( t, x ) = − sup π ∈ K F ( t, x, π, D x u ( t, x ) , D x u ( t, x )) , ( t, x ) ∈ [0 , T ) × R d with the terminal condition u ( T, x ) = g ( x ) for x ∈ R d , where the Hamiltonian F is defined by F ( t, x, π, z, γ ) = b ( t, x, π ) (cid:62) z + 12 tr (cid:16) σ ( t, x, π ) σ ( t, x, π ) (cid:62) γ (cid:17) . (3.6)In this setting we can introduce running gains or other time-based reward functions via addingsome term f ( t, x, π ) to the Hamiltonian.We can now describe the value function and its derivatives using a system of FBSDEs. Let x ∈ R d be our initial state. Fix an arbitrary admissible control process π and the correspondingstate process X . Define new processes on [0 , T ] by V ( t ) := u ( t, X ( t )) , Z ( t ) := D x u ( t, X ( t )) , Γ( t ) := D x u ( t, X ( t )) . (3.7)By the regularity assumptions on u , these processes are continuous. Applying Itˆo’s Lemma to u and using the HJB equation, we get dV ( t ) = (cid:20) F ( t, X ( t ) , π ( t ) , Z ( t ) , Γ( t )) − sup a ∈ K F ( t, X ( t ) , a, Z ( t ) , Γ( t )) (cid:21) dt + Z ( t ) (cid:62) σ ( t, X ( t ) , π ( t )) dW ( t ) , with the terminal condition V ( T ) = g ( X ( T )). Furthermore, the SMP relates the pair ( Z, Γ) withan SDE using the generalised Hamiltonian H : [0 , T ) × R d × K × R d × R d × n → R , given by H ( t, x, π, z, q ) = b ( t, x, π ) (cid:62) z + tr (cid:16) σ ( t, x, π ) (cid:62) q (cid:17) . (3.8)Again, in the presence of running gains we add the corresponding f term here. The supremumterm in the V dynamics makes it too complex to simulate directly. To avoid this, we assume thecontrol we chose arbitrarily is actually an optimal control, so by definition the Hamiltonian termscancel, and we are left with the following result. Theorem 3.2.
Suppose that u ∈ C , ([0 , T ) × R d ) ∩ C ([0 , T ] × R d ) , and there exists an optimalcontrol π ∈ A with associated controlled diffusion X , defined by (3.4). Then there exist continuousprocesses ( V, Z, Γ) solving the following 2BSDE: dV ( t ) = Z ( t ) (cid:62) σ ( t, X ( t ) , π ( t )) dW ( t ) dZ ( t ) = − D x H ( t, X ( t ) , π ( t ) , Z ( t ) , Γ( t ) σ ( t, X ( t ) , π ( t ))) dt + Γ( t ) σ ( t, X ( t ) , π ( t )) dW ( t ) (3.9)7 ith the terminal conditions V ( T ) = g ( X ( T )) and Z ( T ) = D x g ( X ( T )) , where H is defined by (3.8)and the control π satisfies F ( t, X ( t ) , π ( t ) , Z ( t ) , Γ( t )) = sup a ∈ K F ( t, X ( t ) , a, Z ( t ) , Γ( t )) , (3.10) and F is defined by (3.6). If we have a running gains function f , this would result in a drift term − f ( t, X ( t ) , π ( t )) dt beingadded to the V dynamics. While the supremum term in the original dynamics of V makes itssimulation complicated, the derivative term in the dynamics of Z can be easily computed by themodern automatic differentiation method. This theorem is presented for a maximisation problembut easily translates to the dual problem (3.2), which is a minimisation problem, by replacingsupremums with infimums. Before describing the algorithms used to solve this problem in detail, we first look at what we meanby a neural network. Suppose we wish to approximate a function φ : R p → R q for some p, q ∈ N .We use the following algorithm to construct the approximating function. We take several elementsof R p , say a batch of size k ∈ N . For our purposes, k represents how many Brownian paths wegenerate, and p = d represents a space vector in R d . We also have a predetermined amount of‘layers’ L ∈ N and ‘hidden nodes’ (cid:96) ∈ N . Suppose we have a matrix X ∈ R k × p as our input. Set N = X , then for i = 1 , . . . , L , perform the update N i = h ( N i − M i + k, v i ) , where M ∈ R p × (cid:96) , M j ∈ R (cid:96) × (cid:96) for j > v i ∈ R × (cid:96) , and k, is a k × v i is added to each vector in the batch. The function h : R k × (cid:96) → R k × (cid:96) applies the non-linear(or activation) function x (cid:55)→ max( x,
0) element-wise, called a rectified linear unit (ReLU) activationfunction. The choice of this function for this scheme is rather arbitrary, as there seems to be noclear preferred function here. A comparison of popular activation functions in provided in Section5. This non-polynomial mapping is essential in ensuring the density of these networks within thespace of continuous functions [20]. Finally, we output N θ ( X ) = N L M L +1 + k, v L +1 , where M L +1 ∈ R (cid:96) × q and v L +1 ∈ R × q . Note that no non-linear function is applied at the final step,else we could only approximate non-negative functions! The parameter vector θ ∈ R ρ representingour network is a parametrisation of the M i and v i matrices and are optimised at the trainingstep. We made no assumptions on the regularity of φ , but our approximation is almost everywheredifferentiable. In practise this is sufficient for gradient descent algorithms, in which derivatives ofthis network with respect to its parameters must be taken. We use the Python package
Tensorflow to implement this neural network procedure. This methodology reduces an optimisation over aninfinite dimensional space to a finite dimensional one. However, the dimension can still be high.For this neural network, the parameter space is made up of L − R (cid:96) × (cid:96) , L matricesin R × (cid:96) , and one matrix in each space R p × (cid:96) , R (cid:96) × q , and R × q respectively, which are flattened andcombined into one single vector θ ∈ R ρ . The parameter space dimension is therefore ρ ( p, q ) = ( L − (cid:96) + (cid:96) ( L + p + q ) + q, (3.11)8 quadratic in the number of hidden nodes, and linear in the number of layers and the inputdimension. Associated to each neural network N θ is a loss function L ( θ ) which is a function ofthe parameters forming the network. Finding the parameters that minimise this function equatesto optimising the network itself. In our implementation, we use the ADAM algorithm [18], whichis an adaptation of stochastic gradient descent, implemented in Tensorflow . We furthermore usemini batches, which means the number of simulations we use to calculate the loss function is small,and batch normalisation, meaning that we scale our input to the neural network [16].Now we have set up the new problem (3.9)-(3.10), we look to solve it with machine learningtechniques in the spirit of [2] (the algorithm presented therein is named the “deep 2BSDE” method,thus inspiring our adapted deep controlled 2BSDE method). The trick here is to simulate allprocesses in the forward direction, introducing new variables v ∈ R and z ∈ R d , then movingforward through a discretisation ( t i ) Ni =0 of [0 , T ]. We furthermore set Γ to be a neural network thatat each time t i takes the i th state position, denoted by X i , as input since we are approximating D x u ( t, X ( t )) with this process. The control process π is also a neural network, taking in the samestate X i as input at time step t i , which means that we are searching for a Markovian control. Tobe specific, we introduce parameters θ i ∈ R ρ ( d,m ) and λ i ∈ R ρ ( d,d ) then use the following Euler-Maruyama Scheme. Set X = x , V = v and Z = z then for i = 0 , . . . N − π i = N θ i ( X i ),Γ i = N λ i ( X i ) and X i +1 = X i + ( t i +1 − t i ) b ( t i , X i , π i ) + (cid:112) t i +1 − t i σ ( t i , X i , π i ) dW i V i +1 = V i + (cid:112) t i +1 − t i Z (cid:62) i σ ( t i , X i , π i ) dW i Z i +1 = Z i − ( t i +1 − t i ) D x H ( t i , X i , π i , Z i , Γ i σ ( t i , X i , π i )) + (cid:112) t i +1 − t i Γ i σ ( t i , X i , π i ) dW i , (3.12)where dW i is a standard multivariate normal N m (0 , m ) random variable. We have used minimalnotation for these processes for clarity, but one should remember that these explicitly depend onthe choice of parameters λ i , θ i and the initial points v , z . Our algorithm involves using thisscheme multiple times, but with different parameters. One we have repeated this iteration N − t N = T but realise that we have not satisfied the terminal conditions. Wewant to choose the neural networks and initial start points to reduce the expected error, whichamounts to minimising the the loss function L (Θ ) .. = E (cid:104) | V N − g ( X N ) | + β | Z N − D x g ( X N ) | (cid:105) , (3.13)where Θ ∈ R d + ρ ( d,d ) is a vector representation of ( v , z ) ∪ ( λ i ) N − i =0 . This is referred toas the BSDE loss function in the sequel. In practise, we evaluate the sample average of this lossfunction. The coefficient β > β = 0 . π is not optimal, so we must alsomaximise the loss functions L ( θ i , i ) .. = E [ F ( t i , X i , π i , Z i , Γ i )] (3.14)for i = 0 , . . . , N −
1. This is referred to as the control loss function in the sequel. Again, we evaluatea sample average of this loss function. We proceed to optimise these loss function in turn, startingwith our arbitrary 0-superscript parameters. Now suppose we have just finished step j ∈ N .The first sub-step is to improve the terminal condition loss. We generate V N , Z N and X N usingΘ j BSDE and θ ji for i = 1 , . . . , N −
1, which are carried forward from the previous step. We then makean improvement using one step of the ADAM algorithm, against the loss function L , resulting in9he updated parameters Θ j +1BSDE . We also keep track of the moment estimates that are present inthe algorithm [18].The second half of the iteration step is to improve the control process. We can first simulate theprocesses π and X using our previous parameter set θ j = (cid:16) θ ji (cid:17) N − i =0 , then simulate Z and Γ usingthe updated parameters Θ j +1BSDE , and finally make the following improvement using one step of theADAM algorithm. For each i = 0 , . . . , N − θ ji using the ADAM algorithm against −L ( · , i ), resulting in θ j +1 i . We use a negative here since ADAM is a minimisation algorithm, butwe wish to maximise the Hamiltonian F at each time step. We track the moment estimates andmove to step j + 2 and repeat this process.These two steps can be repeated until the new parameters are sufficiently close to the oldparameters after an update, at which point we deem the algorithm to have converged. We findthat making small increments at each step, ensuring that the learning rate required for the ADAMalgorithm is smaller for the control step than the BSDE step, we have good convergence results.This agrees with the analytical convergence assumption of two-timescale stochastic approximationalgorithms presented in [4]. Remark 3.3.
At each time step we build 2 neural networks, which have 4 layers, with the twohidden layers consisting of d + 10 nodes. The parameters are optimised using the ADAM algorithm,using a batch size of 64 Brownian paths. We choose to initialise each parameter near 0, thougha prior guess may be used, and the rate of convergence is naturally much higher when the initialguess is closer to the true value. Substituting p = d , L = 4 and l = d + 10 into (3.11) leads toa parameter set of size ρ ( d, q ) = 4 d + 74 d + 340 + q ( d + 11) , where q is the output dimensionof the network. The number of neural network parameters is therefore at most cubic in the statedimension and linear in the control dimension. For the dual problem of utility maximisation, we have a further variable to optimise, representingthe initial state. We simply define a variable y with which the state process (which for the dualproblem we denote by Y i for i = 0 , . . . , N ) begins, as opposed to starting at x , then introduce athird loss function L ( y ) = E (cid:104) ˜ U ( Y N ) (cid:12)(cid:12)(cid:12) Y = y (cid:105) + y x (3.15)against which to optimise this 1-dimensional parameter. This choice of loss function comes fromthe optimality condition (3.3) for y , replacing expectation with sample mean. We cannot use thevalue function approximation at time 0, given by V , since it does not depend explicitly on y . Wecould alternatively find a functional representation (in the style of [15]) of V in terms of y , thenoptimise against this function. The effectiveness of this method is left to further research. In this section we apply the deep controlled 2BSDE algorithm to solve utility maximisation, withvarying utilities and control constraints. This is a control problem (3.5) with d = 1 and n = m . Forthe primal problem, we have the process X defined by (2.2), controlled by π , and the primal 2BSDEsystem of Theorem 3.2 denoted by ( V , Z , Γ ), where we take g ( x ) = U ( x ) x> (c.f. Remark 3.1)and define the drift and diffusion by b ( t, x, π ) = x ( r ( t ) + π (cid:62) b ( t )) , σ ( t, x, π ) = xπ (cid:62) σ ( t ) . Y is defined in (2.3), controlled by ( y, v ), and the 2BSDEsystem of Theorem 3.2 is denoted by ( V , Z , Γ ), where we take g ( y ) = ˜ U ( y ) and define the driftand diffusion by b ( t, y, v ) = − y ( r ( t ) + δ K ( v )) , σ ( t, y, v ) = − y ( θ ( t ) + σ ( t ) − v ) (cid:62) . We use the following primal-dual relations, derived in [21].
Theorem 3.4.
Suppose that ˆ y and ˆ v are optimal dual controls, with the corresponding dual stateprocess ˆ Y and 2BSDE solutions ( ˆ V , ˆ Z , ˆΓ ) . Then the primal state process ˆ X and 2BSDE solutions ( ˆ V , ˆ Z , ˆΓ ) corresponding to the optimal primal control ˆ π satisfy ˆ X ( t ) = − ˆ Z ( t ) , ˆ V ( t ) = ˆ V ( t ) + ˆ Z ( t ) (cid:62) ˆ Y ( t ) , ˆ Z ( t ) = ˆ Y ( t ) . (3.16)The converse relation also applies, in that primal solutions can be used to derive the corre-sponding dual solutions. The direction described here is often the most useful, as the dual problemcan be simpler to solve than the primal problem. In addition to these processes, formulas for theoptimal controls are also given, but we do not use them here. Example 3.5 (Unconstrained Non-HARA Utility Problem) . To show the effectiveness of the al-gorithm we apply it to a series of utility problems, with certain constraints on the control space.We start with the unconstrained case K = R m , corresponding to a complete market. We considerthe utility problem where U has the following non-HARA form U ( x ) = 13 H ( x ) − + H ( x ) − + xH ( x ) , x > , (3.17)where H ( x ) = √ (cid:0) − √ x (cid:1) − . The corresponding dual utility is slightly simpler, and isgiven by ˜ U ( y ) = y − + y − . In particular we have ˜ K = { } , so the optimal dynamic dual controlmust be v ( t ) = 0. Suppose that r ( s ) = r, σ ( s ) = σ, µ ( s ) = µ and θ ( s ) = θ are constant. In thiscase in [21] it is shown that the optimiser ˆ y and processes ˆ Y and ˆ Z are given byˆ y = 1 √ x (cid:104) e ( r + | θ | ) T + (cid:112) e r + | θ | ) T + 4 x e r +2 | θ | ) T (cid:105) ˆ Y ( t ) = ˆ y exp (cid:18) − (cid:18) r + 12 | θ | (cid:19) t − θ (cid:62) dW ( t ) (cid:19) ˆ Z ( t ) = a S ( t ) + a S ( t ) , where a = ˆ y − e r +2 | θ | ) T , a = ˆ y − e ( r + | θ | ) T , S ( t ) = e ( r − | θ | ) t +2 θ (cid:62) W ( t ) and S ( t ) = e rt +2 θ (cid:62) W ( t ) . Furthermore, the dual value function is given by˜ u ( t, y ) = 13 y − e (3 r +6 | θ | )( T − t ) + y − e ( r + | θ | )( T − t ) . From these analytical formulae, the primal processes can be derived using Theorem 3.4. For ourapplication, we take m = 5, T = 0 . x = 1, r = 0 . µ = 0 . and set σ as a random matrix ofpositive numbers between 0 and 0.2 to introduce correlation of the underlying Brownian motions,with 0.2 added to the diagonal to ensure invertibility.Figure 3.1 shows the development of the loss functions for the primal and dual algorithms with 50time steps. For the control, we wish for the Hamiltonian to be maximised, which means the deriva-tive of the loss function should go to 0. Hence, we plot the function j (cid:55)→ (cid:80) i =0 (cid:80) θ ∈ θ ji | D θ L ( θ ji , i ) | θ ji denotes the parameter set for the neural network forming π i ,at iteration step j of the algorithm. For the dual graph we plot the derivative of L with respect to y , since the above function is always zero as the dual control is forced to be zero, hence optimal, bythe support function δ K . Over the 10000 iteration steps both losses become small and settled, andtherefore are deemed to have converged. In this method we choose an initial learning rate of 10 − for the 2BSDE algorithm and the start control y in the dual algorithm, and 10 − for the controlpart. Over the iterations, we divide these rates by 10 three times at regular intervals. We repeatthis choice of learning rates for all applications in the sequel, though the number of iteration stepsmay change.Figure 3.2 shows two simulations of the processes V and Z (labelled as primal), along withtheir dual counterparts given by the relations (3.16) using 20 time steps. We plot these, alongwith the explicit solution for reference. Both primal and dual approximations are very close tothe solutions. The dual state process, appearing in the Z graph (but very close to the solutionprocess), is closer than the corresponding primal process since there is no error for the dynamicdual control.Figure 3.3 shows the evolution of the approximation of the initial value u (0 , x ) against numberof times steps for the primal and dual algorithms, where x = 1. The dual value is higher than theprimal value, as expected, and the gap between these shrinks as the number of time steps increases.The final relative approximation error is 7 × − % for the primal problem, and 3 × − % for thedual problem, showing high accuracy of the proposed scheme at time 0. (a) Loss function of the BSDE. (b) Derivative of the control loss function. Figure 3.1: Loss functions against iteration step for the primal and dual deep controlled 2BSDEmethods applied to the unconstrained non-HARA utility problem.
Example 3.6 (Cone-Constrained Merton Problem) . Now we consider the problem with powerutility U ( x ) = p x p for some 0 < p < K = R m + , corresponding to no short selling constraints. Let ˆ X be the primalstate process and ( ˆ V , ˆ Z , ˆΓ ) the solutions of (3.9) for the primal problem with optimal control ˆ π ,and let ˆ Y and ( ˆ V , ˆ Z , ˆΓ ) be the state process and solutions respectively for the dual problem withoptimal controls ˆ v and ˆ y . The dual state process is a geometric Brownian motion given byˆ Y ( t ) = ˆ Y (0) exp (cid:18) − (cid:90) t (cid:18) r ( s ) + 12 | θ ( s ) | (cid:19) ds − (cid:90) t θ ( s ) (cid:62) dW ( s ) (cid:19) a) Simulation of the process V . (b) Simulation of the process Z . Figure 3.2: Two simulations of the value process and value-derivative process for the primal anddual deep controlled 2BSDE methods applied to the unconstrained non-HARA utility problem.The displayed dual processes are those implied by the duality relations (3.16).in terms of the unknown start point ˆ y = ˆ Y (0). The corresponding dual utility function is ˜ U ( y ) = − pp y pp − . In [21] the following solutions are found:ˆ y = x p − exp (cid:18) (1 − p ) (cid:90) T (cid:20) p p − | ˆ θ ( s ) | − pp − r ( s ) (cid:21) ds (cid:19) ˆ Z ( t ) = − x exp (cid:18)(cid:90) t (cid:20) r ( s ) + 1 − p − p ) | ˆ θ ( s ) | (cid:21) ds + 11 − p (cid:90) t ˆ θ ( s ) (cid:62) dW ( s ) (cid:19) , (3.18)where ˆ θ ( t ) = θ ( t ) + σ ( t ) − ˆ v ( t ). The dual value function is given by˜ u ( t, y ) = ˜ U ( y ) exp (cid:18)(cid:90) Tt (cid:20) p p − | ˆ θ ( s ) | − pp − r ( s ) (cid:21) ds (cid:19) . (3.19)By (3.3), ˆ v is chosen such that | ˆ θ | is minimised. This deterministic optimal control can be foundnumerically using Python packages such as Scipy . For our implementation, we take T = 0 . x = 1 . m = 50, r ( t ) = 0 .
05 and p = . We define the drift as µ ( t ) i = 0 .
04 + sin( πt + A i ) / A i is randomly generated for each i = 1 , . . . ,
50. We choose this function so that in theunconstrained case it would be optimal to short sell at some points. We set σ ( t ) to be a constantmatrix with 0.4 on the diagonal, and 0.2 everywhere else. Remark 3.7.
To ensure our controls π and v are in the correct set K = ˜ K = R m + , we (element-wise) apply the function x (cid:55)→ max(0 , x ) to the output of their respective neural networks. Thisfunction is an almost everywhere differentiable surjection from R m to R m + , which is sufficient forusing a gradient descent method to find the optimal control. The mapping is not injective, but thisis not an issue as the exact output of the neural network is not relevant. We care only about thecontrol outputted by projecting onto the control space. Figure 3.4 shows the results of applying the primal and dual 2BSDE methods to this problem.We use N = 10 time steps and run the algorithm for 100000 steps, notably more than for the lower13 a) Value approximation. (b) Relative error of value approximation. Figure 3.3: Approximation and relative error of the value function against number of time steps N for the primal and dual deep controlled 2BSDE methods applied to the unconstrained non-HARAutility problem.dimensional unconstrained problem. This increases the run time to 5457 seconds. We start theBSDE and control learning rates at 0.01 and 0.001 respectively, then reduce these by a factor of10 half way through training. We run both algorithms at the same time, which saves time as wecan use the same generated Brownian motions for each algorithm. We see less convergent structureat the beginning of training, perhaps due to bad initial guesses for the neural networks. Indeed,we see that the value for the primal functions stays nearly constant until the loss function dropsto a certain level, when the processes are accurate enough that the value approximation can movetowards the analytical solution. The dual algorithm behaves similarly, but with more noise. Atthe end of training we have relative errors of 0.1% and 0.05% for the primal and dual algorithmsrespectively. This error is a few orders of magnitude higher than the unconstrained case due to theincrease in dimension. Example 3.8 (Ball Constrained Log Utility Problem) . In this example we consider the log utilityproblem when the control process is constrained to the ball B (0 , R ) of radius R > R m . Thiscorresponds to the investor not wanting to invest more than a certain proportion of their wealthin risky stocks. In this case, the support function δ K ( z ) = R | z | does not vanish for all admissibledual controls. Therefore, the SDE (2.3) is harder to solve. We recall from (3.3) that in general thisproblem is also difficult to solve analytically because of the dependence of the dual controls ˆ y andˆ v on each other. However, we can exploit the multiplicity property of the log function to decouplethese controls to give analytical solutions, which we describe here. To this end let U ( x ) = log( x )be our utility function, for which the dual utility is ˜ U ( y ) = − (1 + log( y )). Since the dual process Y is a geometric Brownian motion we have E (cid:104) ˜ U ( Y ( T )) (cid:105) = − − log( Y (0)) + (cid:90) T r ( t ) + R | v ( t ) | + 12 | θ ( t ) + σ ( t ) − v ( t ) | dt. The measurable process ˆ v ( t ) .. = arg min v ∈ R m (cid:26) R | v | + 12 | θ ( t ) + σ ( t ) − v | (cid:27) a) Value approximation. (b) BSDE loss. Figure 3.4: Approximation of the value function and bsde loss against iteration step for the primaland dual deep controlled 2BSDE methods applied to the cone constrained power utility problem.is an optimal control, and can be found using existing deterministic convex optimisation algorithms.Independently of this, the dual control ˆ y is found viaˆ y ∈ arg min y> (cid:26) x y − log( y ) + (cid:90) T r ( t ) + δ K (ˆ v ( t )) + 12 | θ ( t ) + σ ( t ) − ˆ v ( t ) | dt (cid:27) . yielding ˆ y = x . The value function is then given by u (0 , x ) = log( x ) + (cid:90) T r ( t ) + R | ˆ v ( t ) | + 12 | θ ( t ) + σ ( t ) − ˆ v ( t ) | dt. (3.20)For our application, we shall take R = 1 . T = 0 . x = 5 . m = 20, r ( t ) = 0 .
05 and µ ( t ) =0 . . We define the volatility as a diagonal matrix with i th diagonal element equal to σ ( t ) i = πt + A i )10 for some uniform (0 , π ) random number A i . We choose this function so that in theunconstrained case it would be reasonable to invest more than we are allowed at points when thevolatility is low. The volatility is non negative for all t ∈ [0 , T ], so the problem is well posed. Toensure that our primal control is in the correct set, we use a penalty function method in the spiritof Remark 2.1. In this case we do not constrain π , but instead we alter the loss function (3.14) tobe L ( θ i , i ) .. = E (cid:2) F ( t i , X i , π i , Z i , Γ i ) − β max(0 , | π i | − (cid:3) which heavily penalises the control for straying outside of K . We can also describe this as a runningcost for the control problem, which would be added to the Hamiltonian. The parameter β > π via the mapping x (cid:55)→ e −| x | , we achieve a relative error of 0.8% withthe same runtime, as opposed to 0.7% using soft constraints. We do not use such a function forthe dual control, as it is never infinitely penalised for its position.Table 1 shows the evolution of the approximated value function for the primal and dual al-gorithms, compared with the numerical solution 1.64755570 found using (3.20). The algorithm isrun using N = 10 time steps, and the value approximation, relative error and (combined) runtime15re recorded. We again perform both algorithm simultaneously using the same Brownian motionssimulations. The value approximation initially moves away from the solution, but then converges.For the primal algorithm, the value moves away over the first 1000 steps, then goes to the solution.For the dual algorithm, the value does not move as far from the solution, but takes longer to do so,only starting moving towards the solution at 10000 steps. The dual value initially dips below thesolution, then increases, which is why the relative error is lower at 1000 steps than at the start.Step Primal Value Rel-Err (%) Dual Value Rel-Err (%) Time (s)1 9.92774e-01 3.97426e+01 2.50412e+00 5.19898e+01 1281000 4.63730e-02 9.71853e+01 1.57275e+00 4.54015e+00 1835000 2.27800e-01 8.61735e+01 1.80325e+00 9.44972e+00 40510000 1.39359e+00 1.54147e+01 2.01337e+00 2.22035e+01 70112000 1.62170e+00 1.56918e+00 1.98620e+00 2.05544e+01 81518000 1.63545e+00 7.34658e-01 1.68103e+00 2.03169e+00 114919000 1.63546e+00 7.34163e-01 1.65375e+00 3.76032e-01 120420000 1.63543e+00 7.35726e-01 1.65361e+00 3.67465e-01 1259Table 1: Value approximation and runtime at certain iteration steps for the primal and dual deepcontrolled 2BSDE methods applied to the ball-constrained log utility problem problem. In this section we consider the case in which r , µ and σ may not be deterministic. The utilityproblem and its dual problem then become non-Markovian. This means we cannot write the valuefunction as a measurable function of state and time nor apply the dynamical programming principle,hence the HJB equation is invalid. However, we can still employ the SMP, and it is shown in [21]that there exist certain optimality conditions for a non-Markovian utility maximisation problem,which we use to define a new algorithm, named the deep SMP method. In some of the exampleswe consider in this section, the problem can be made Markovian by introducing some new artificialstructure to the problem. In these cases, we will compare the accuracy and runtime of both ouralgorithms. Consider an agent who invests in the market (2.1) in full generality. We again consider the utilityproblem, written in this setting as u ( t ) = sup π ∈A E (cid:2) U ( X ( T )) (cid:12)(cid:12) F t (cid:3) . We cannot use Itˆo’s formula for this value process as it cannot be written as a measurable functionof the wealth process X ( t ), defined as in (2.2) and controlled by π , but we can still use the SMP.For the primal problem we have the following adjoint BSDE: dP ( t ) = − (cid:104)(cid:16) r ( t ) + π ( t ) (cid:62) b ( t ) (cid:17) P ( t ) + Q ( t ) (cid:62) σ ( t ) (cid:62) π ( t ) (cid:105) dt + Q ( t ) (cid:62) dW ( t ) (4.1)with the terminal condition P ( T ) = − U (cid:48) ( X ( T )). Similarly, we can define the dual state Y as per(2.3), controlled by ( y, v ). The dual adjoint equation is the following BSDE dP ( t ) = (cid:104) ( r ( t ) + δ K ( v ( t )) P ( t ) + Q ( t ) (cid:62) ( θ ( t ) + σ ( t ) − v ( t )) (cid:105) dt + Q ( t ) (cid:62) dW ( t ) (4.2)16ith the terminal condition P ( T ) = − ˜ U (cid:48) ( Y ( T )). Furthermore, we have the duality relations P ( t ) = X ( t ) and P ( t ) = − Y ( t ) almost surely, for almost every t ∈ [0 , T ] by (3.16). We now statethe optimality conditions for the dual problem in terms of these processes. Theorem 4.1 ( [21], Theorem 11) . Let ( y, v ) be an admissible dual control. Then ( y, v ) is optimalif and only if the solutions ( P , Q ) of (4.2) satisfy the following1. P (0) = x ,2. ( σ ( t ) (cid:62) ) − Q ( t ) P ( t ) ∈ K ,3. P ( t ) δ K ( v ( t )) + Q ( t ) (cid:62) σ ( t ) − v ( t ) = 0 for all t ∈ [0 , T ] almost surely. Remark 4.2.
The corresponding optimality condition for a primal control π with correspondingstate process X and adjoint BSDE solutions ( P , Q ) is [ π ( t ) − a ] (cid:62) [ X ( t ) σ ( t )( P ( t ) θ ( t ) + Q ( t ))] ≥ for all t ∈ [0 , T ] and a ∈ K almost surely. This is a much more complex condition, and theimplementation of such would require some grid-based method to consider a sufficiently large amountof a ∈ K , which would be computationally costly, especially in higher dimensions and unbounded(but still constrained) K . For this reason, we approach the dual conditions then use the primal dualrelations to derive the necessary primal processes. The machine learning methodology to solve this problem is as follows. First, we simulate thedual processes in the forward direction using the Euler-Maruyama scheme. We start the process P at P (0) = x to ensure the first condition is satisfied. This condition is why it appears that theforward simulation method is naturally suited to this problem. We use neural networks to definethe processes v and Q , then update the values of Y and P accordingly. To be specific, we use thefollowing scheme. Set Y (0) = y , P (0) = x then for i = 0 , . . . , N − Q ( i ) = P ( i ) σ ( t i ) (cid:62) h K (cid:16) N θ Qi ( Y ( i )) (cid:17) v ( i ) = N θ vi ( Y ( i )) Y ( i + 1) = Y ( i ) − ( t i +1 − t i ) Y ( i ) (cid:0) r ( t i ) + δ K ( v ( i )) (cid:1) − (cid:112) t i +1 − t i Y ( i ) (cid:0) θ ( t i ) + σ ( t i ) − ˆ v ( i ) (cid:1) (cid:62) dW i P ( i + 1) = P ( i ) + ( t i +1 − t i ) (cid:104) r ( t i ) P ( i ) + Q ( i ) (cid:62) θ ( t i ) (cid:105) + (cid:112) t i +1 − t i Q ( i ) (cid:62) dW i . where θ vi , θ Qi ∈ R ρ ( d,m ) are parameters for neural networks. It is assumed that we can sample therandom processes at any time in an efficient way. We define the function h K : R m → K to be somedifferentiable surjective function. Using this function, we have by construction that condition 2 isautomatically satisfied. The two neural networks used here both output to m -dimensional space,unlike the general algorithm outlined in Section 3.2, where the second neural network is valued in R d × d . This increases the number of parameters, as for the utility maximisation problem we have d = 1. However, as we move into the case of random coefficients that satisfy their own SDEs, thisdimensionality also increases. We have the variables y, Θ Q .. = (cid:16) θ Qi (cid:17) N − i =0 and Θ v .. = ( θ vi ) N − i =0 . These17re optimised against the following loss functions L y ( y ) = E (cid:104) ˜ U ( Y ( N )) (cid:12)(cid:12)(cid:12) Y (0) = y (cid:105) + x y L Q (Θ Q ) = E (cid:104) | ˜ U (cid:48) ( Y ( N )) + P ( N ) | (cid:105) L v ( θ vi ) = E (cid:104) | P ( i ) δ K ( v ( i )) + Q ( i ) (cid:62) σ ( t ) − v ( i ) | (cid:105) . In practise for each function we take a sample mean to approximate the expected value or theexpected squared error. The first loss function represents the optimality condition (3.3) of y . Thesecond loss function represents the need to ensure the terminal condition P ( N ) = − ˜ U (cid:48) ( Y ( N ))holds. Finally, instead of using the HJB condition from (3.3), which may not hold in a non-Markovian setting, the third loss function arises from the third optimality condition of Theorem4.1.The optimisation process is very similar to the deep controlled 2BSDE method as outlined inSection 3.2, so we only provide a brief summary here. For each iteration step we simulate theprocesses Y , v , P and Q using our existing (or randomly initialised for the first step) parametersets y , Θ Q and θ v and a batch of simulated Brownian motion paths. We then perform one iterationof the ADAM algorithm to update each parameter set in turn using their respective loss functions.In between the updates, we simulate the processes again using the new parameters, but do notgenerate new Brownian motions. When the 3 sets are updated we generate new Brownian motionpaths and repeat. This process continues until the difference made by updating is sufficiently small.For our case, we start with a learning rate of 0.01 for all parameters, which decreases as the numberof iterations increases.In this algorithm we no longer have a variable representing the value at time 0. We thereforeneed to perform Monte Carlo simulations, forming bounds of the value function. An upper boundcan be found using the dual state process Y and a lower bound can be found using the primal stateprocess X = P , as per (3.16). The estimates outputted are given by u low .. = 1 M M (cid:88) i =1 U ( P i ( N )) ≤ u (0 , x ) ≤ M M (cid:88) i =1 ˜ U ( Y i ( N )) + x y = .. u high (4.3)for some large batch of size M ∈ N , where each path is indexed by the superscript i . The startpoint y is not random (and hence is not included in the sum nor indexed by i ) as it is optimisedusing a deterministic loss function. The inequalities are valid up to discretisation and simulationerrors, and as the number of time steps increases the gap u high − u low decreases, and the optimalvalue is eventually contained in this gap. Remark 4.3.
We have tried this method for a cone constrained power utility problem as in Ex-ample 3.6, and the value function approximation seems quite accurate. However, using the hardconstraints to define the process Q does not give us the required dual control. In this example,say in dimension (so K = [0 , ∞ ) ), the dual control condition becomes Q ( t ) σ ( t ) − v ( t ) = 0 . Thealgorithm therefore determines that we must have ˆ v = 0 . However this is not the optimal con-trol, since the solution for Q also depends on v . Indeed, if we compare the BSDE (4.2) with thethe dual version of 2BSDE (3.9) and given solutions (3.7) (replacing u with ˜ u ), we deduce that Q ( t ) = − D y ˜ u ( t, Y ( t )) Y ( t )( θ ( t ) + σ − ( t ) v ( t )) , and the optimal control is ˆ v = max(0 , − σ ( t ) θ ( t )) .Taking Q to be a neural network that takes ( Y, v ) as input, together with using soft constraints hasshown some promising results in one dimension, with the correct control process being recovered.However it is not consistent since we make one neural network the input to another, leading toconvergence issues. In a Markovian setting we could use the form of the optimal solution above,but this does not generalise. Work is still to be done in this area. .2 Numerical Examples Example 4.4 (Power Utility in a Stochastic Volatility Model) . In this example we consider anextension of our utility model to stochastic volatility. In this model, our agent invests in a 2-dimensional stock market, out of which one asset S is traded, and is defined by dS ( t ) = S ( t )( r + Av ( t )) dt + S ( t ) (cid:112) v ( t ) dW s ( t ) . The constants r, A > v has the dynamics dv ( t ) = κ ( θ − v ( t )) dt + ξ (cid:112) v ( t ) dW v ( t ) . The constant κ > θ >
0. The constant ξ > W s and W v are correlated with correlation parameter ρ ∈ [ − , π of their wealth X in this market, their wealth process evolves as dX ( t ) = X ( t )( r + π ( t ) Av ( t )) dt + X ( t ) π ( t ) (cid:112) v ( t ) dW s ( t ) . We will first consider this as a 2-dimensional Markovian problem, then a random coefficientproblem. In the Markovian problem the 2-dimensional process (
X, v ) is the state process controlledby the K ⊂ R valued process π . Consider the maximum utility problem given by u ( t, x, v ) = sup π ∈A E [ U ( X ( T )) | ( X (0) , v (0)) = ( x, v )]for some utility function U . This setup fits in the framework of (3.5) with d = 2, n = 2, m = 1 and g ( x, v ) .. = U ( x ) x> for ( x, v ) ∈ R . The drift and diffusion functions are given by b ( t, x, v, π ) = ( x ( r + πAv ) , κ ( θ − v )) (cid:62) σ ( t, x, v, π ) = (cid:18) xπ √ v ρξ √ v (cid:112) − ρ ξ √ v (cid:19) . It is a usual assumption that the drift and diffusion functions are Lipschitz functions of the stateand control variable, but this is not the case here due to the square root appearing in the diffusion.This may invalidate certain convergence patterns that we have seen for the utility maximisationproblem thus far. The dual state process consists of the volatility process v and an R − valuedprocess Y that is controlled by some R -valued process ( η, γ ), given by dY ( t ) = − Y ( t ) (cid:34)(cid:18) r + δ K (cid:0) η ( t ) (cid:1)(cid:19) dt + (cid:32) η ( t ) (cid:112) v ( t ) + ργ ( t ) + A (cid:112) v ( t ) (cid:33) dW s ( t ) − γ ( t ) dW v ( t ) (cid:35) . (4.4)As before, the dual process starts at some undetermined y >
0, which forms the dual controltogether with η and γ , and is chosen such that the process XY is a super martingale. Now considerthe dual control problem˜ u ( t, y, v ) = inf γ,η E (cid:104) ˜ U ( Y ( T )) (cid:12)(cid:12)(cid:12) ( Y ( t ) , v ( t )) = ( y, v ) (cid:105) . (4.5)In our numerical example we consider the unconstrained case K = R . As before, the presence ofthe δ K term means that we must have η = 0. This puts the problem in the same form as [23],where only one process γ needs to be found. 19his setup fits in our framework for a dual problem with d = 2, m = 1, and n = 2. There isa slight difference in that only one of the initial state start points is a variable, and the other isfixed. This is not a multidimensional duality as the process v appears in both state dynamics, buta single dimensional duality between X and Y . The drift and diffusion functions are given by˜ b ( t, y, v, γ ) = ( − yr, κ ( θ − v )) (cid:62) ˜ σ ( t, y, v, γ ) = (cid:32) − yA √ v (cid:112) − ρ yγρξ √ v (cid:112) − ρ ξ √ v (cid:33) . When we use the power utility U ( x ) = x p p for 0 < p <
1, exact solutions can be obtained for thisproblem. A solution is derived in [23] by assuming that the value function u has the form u ( t, x, v ) = x p p exp( C ( t ) + D ( t ) v ) (4.6)in terms of two unknown functions of time C and D . Two Ricatti ODEs are derived for thesefunctions, and exact solutions found, which we use as a benchmark. For our application we take r = 0 . ρ = − . p = 0 . κ = 1, θ = 0 . ξ = 0 . A = 0 . x = 1, and v = 0 . T and numbers of time steps N . We contrast this to the results of [23], in which, with T = 1 . N = 100, upper and lower bounds for the value were found with relative errors of 0.003% in2240 seconds and 27600 seconds respectively, albeit for slightly different parameter choices. Theupper bound was found faster using a convenient guess of the form the dual control γ . We are notusing 100 time steps for our algorithms due to computer memory constraints. In fact, the 50 timestep algorithm is also slower than expected due to a memory bottleneck. For T = 0 . T = 0 . T = 1 . T = 0 . T = 1 . T = 0 . T = 0 .
2. While the errors for the primal algorithm reduce as N increases, the value does not lie within the duality gap for some terminal times. This may bedown to a discretisation error, and would be fixed by taking more time steps. We see convergenceof the approximations to the true value as N increases and this convergence seems to be linear, aswe saw in Example 3.5. This convergence is explored further in Section 5.Now, let us treat v as some random process, meaning that we can only sample v , and notthe underlying Brownian motion W v . In this setting we have that both θ ( t ) = A (cid:112) v ( t ) and σ ( t ) = (cid:112) v ( t ) are random processes for the stock market, and we still wish to apply Theorem4.1. We cannot directly apply this theorem as v is not measurable with respect to the filtrationgenerated by W s . To circumvent this problem, we adopt a market completion method [17] byfirst constructing an artificial stock driven by W v , adding it to our market, and then making itunavailable for trading. To be specific, we define a new asset price S v by dS v ( t ) = S v ( t ) ( rdt + dW v ( t )) , which ensures that v is adapted to the natural filtration of the Brownian motions driving themarket. For simplicity we assume that m = k = 1, but this analysis extends naturally to multipledimensions. We invest in this market with a 2-dimensional portfolio, restricted to the set K × { } ,ensuring that our wealth process X is still defined as in (2.2), controlled by a 1-dimensional K -valued process π . Now the dual state process (4.4) is controlled by y and a 2-dimensional dynamiccontrol ( η, γ ) valued in ˜ K × R . We would like to show that we can take γ = 0, as this would returnthe dual problem to the 1-dimensional control case independent of W v . We want this as we cannot20 N Primal Value Rel-Err (%) Time (s) Dual Value Rel-Err (%) Time (s)0.2 5 2.03310E+00 1.01654E-02 83 2.03378E+00 4.35850E-02 980.2 10 2.03301E+00 5.74161E-03 206 2.03344E+00 2.66047E-02 2390.2 20 2.03296E+00 3.27670E-03 591 2.03314E+00 1.23181E-02 6040.2 50 2.03289E+00 2.81206E-04 2849 2.03295E+00 2.89678E-03 26220.5 5 2.07616E+00 2.75979E-02 152 2.07969E+00 1.97584E-01 1830.5 10 2.07582E+00 1.14780E-02 362 2.07747E+00 9.08529E-02 4110.5 20 2.07572E+00 6.36082E-03 927 2.07670E+00 5.38088E-02 10300.5 50 2.07562E+00 1.55757E-03 5070 2.07600E+00 1.98421E-02 41811.0 5 2.13328E+00 4.32410E-02 154 2.14847E+00 6.68466E-01 1701.0 10 2.13377E+00 1.99985E-02 362 2.14264E+00 3.95457E-01 4001.0 20 2.13399E+00 9.92254E-03 918 2.13715E+00 1.38171E-01 9811.0 50 2.13403E+00 7.97535E-03 4230 2.13548E+00 6.01058E-02 4395Table 2: Accuracy and runtime for the deep controlled 2BSDE method for various terminal times T and numbers of time steps N applied to the unconstrained power utility problem in a stochasticvolatility model.‘see’ W v , and can only sample v directly at any time. To this end, let ( P , Q ) be the dual adjointsolutions. Fix a time t ∈ [0 , T ] and let Q ( t ) = ( q , q ). By Theorem 4.1 (2) we have P ( t ) − q = 0,hence q = 0. However, this means that condition (3) becomes P ( t ) δ K ( η ( t )) + q (cid:62) σ ( t ) − η ( t ) = 0which is independent of γ ( t ), leading to a free control. We therefore can take γ = 0, removing thedependence of the dual state process on W v .Table 3 shows the accuracy and runtime of the deep SMP method with a range of terminaltimes and step sizes, when we run it on the same problem formulation as above. Here, K = R ,so we take the constraining function h K to be the identity. We note that the upper bounds arean order of magnitude less accurate the lower bounds for this problem. This indicates that theimplied primal process approximations are more accurate than the dual approximations in thisalgorithm. Comparing these results to Table 2, the SMP method gives roughly the same accuracyof bounds as the dual deep controlled 2BSDE method, but with a lower computation time. Inparticular, the algorithm here only needed 5000 iterations steps, compared to 10000 for both the2BSDE algorithms. Remark 4.5.
We have applied this methodology to investing in the Vasicek model where the interestrate is a random process rather than the volatility. In the same way as for the Heston model weconsider it both as a 2-dimensional state process and a 1-dimensional state process with randomcoefficients. We find similar results as for the Heston model, so we omit this example for brevity.
Example 4.6 (Heston with non-HARA utility) . We now look at a higher dimensional examplewith no explicit solution. Consider the same stochastic volatility market, but with multiple stocks.For simplicity we assume all stocks are independent with the same parameters, but this is not anecessary restriction. We let U have the form (3.17). For this model we do not have a convenientansatz for u to obtain a solution, so we will simply compare our 3 algorithms. Say we have n stocks. We would then have control processes π ( t ) ∈ R n , Γ ( t ) ∈ R n, n for the primal 2BSDEalgorithm, v ( t ) ∈ R n , Γ ( t ) ∈ R n, n for the dual 2BSDE, and v ( t ) , Q ( t ) ∈ R n for the SMPalgorithm. 21 N u low Rel-Diff (%) u high Rel-Diff (%) Time (s)0.2 5 2.03271E+00 9.17595E-03 2.03348E+00 2.90360E-02 1030.2 10 2.03280E+00 4.40981E-03 2.03323E+00 1.64607E-02 2730.2 20 2.03284E+00 2.60193E-03 2.03305E+00 7.49044E-03 7810.2 50 2.03289E+00 5.76610E-05 2.03295E+00 2.54626E-03 47880.5 5 2.07512E+00 2.25897E-02 2.07974E+00 2.00124E-01 1020.5 10 2.07538E+00 9.97694E-03 2.07738E+00 8.63624E-02 3010.5 20 2.07563E+00 2.04008E-03 2.07656E+00 4.67255E-02 7780.5 50 2.07556E+00 1.16593E-03 2.07595E+00 1.74975E-02 38181.0 5 2.13025E+00 1.84896E-01 2.14786E+00 6.40234E-01 1041.0 10 2.13232E+00 8.78697E-02 2.13995E+00 2.69313E-01 2631.0 20 2.13489E+00 3.25920E-02 2.13750E+00 1.54844E-01 7611.0 50 2.13440E+00 9.35931E-03 2.13551E+00 6.11910E-02 4604Table 3: Accuracy and runtime for the deep SMP method for various terminal times T and numbersof time steps N applied to the unconstrained power utility problem in a stochastic volatility model.For this example let us take T = 0 . N = 20, and n = 10. We then have m = 10, d = 11, and k = 20 in (2.2). For the 2BSDE algorithms, the combined dimensions of the unknown processesare 131 for the primal and 141 for the dual. We consider the unconstrained problem, so we canset η = 0. For the SMP algorithm we only need to find process in 20 dimensions. We run thealgorithm for 10000 steps. Table 4 shows the results of this. We give the value approximation forthe 2BSDE algorithms, and the average of the upper and lower bound for the SMP. We also providethe runtime to show the efficiency of the algorithms as a whole, but the individual differences shouldbe taken with a pinch of salt due to implementation differences, see Remark 4.7. The runtime hasincreased from the previous low dimensional example, but sub-linearly. The relative differencebetween approximations has increased to about 1%.Method Value Time (s)Primal 2.61059E+00 2316Dual 2.63758E+00 5636SMP 2.62514E+00 1840Table 4: Value approximations for the 2BSDE and SMP algorithms applied to the unconstrainednon-HARA utility problem in a stochastic volatility model. Remark 4.7.
There are a number of factors that affect the algorithm run times. For example, wesee that the dual algorithm takes twice as long as the primal, but we did not simplify the algorithmto reflect the constraints of the problem. Making this change would have reduced the run time.Additionally, the Z processes defined in (3.9) involve a differentiation element. This differentiationis done ’automatically’, as it is handled by the computer, as opposed to the explicit derivative forthe BSDE (4.2). This does not affect the accuracy, but will increase the run time. Example 4.8 (Power Utility in a Path Dependent Volatility Model) . In this example not only willthe control problem be non Markovian, but it cannot be made Markovian by including additionalinformation in our state space. This means that we cannot apply our 2BSDE methods.This example is motivated by the concept that the volatility of a stock should be dependenton its value. If the stock is doing well the volatility is assumed to be lower. This may be due to22nfrequent trading of the stock, or the stability of the large company to which the stock is assigned.We will consider an extreme example, where there are two levels of volatility. The normal higherlevel is maintain until the stock reaches its historical maximum, in which case it instantly transitionsto the lower volatility.Consider the market (2 . σ is a function of the stock history. For each t ∈ [0 , T ], let σ ( t ) be a diagonal matrix, with i th diagonal entry σ i ( t ) defined as σ i ( t ) = (cid:40) σ low , S i ( t ) ≤ sup s ≤ t S i ( s ) ,σ high , S i ( t ) = sup s ≤ t S i ( s ) , where σ low > σ high are constants and S i is the i th stock price. In this model, the volatility of astock is inversely proportional to the ratio of its value and the running maximum. This makes thediffusion coefficient of the wealth process X a function of the entire path of S , instead of just at onetime. This path dependence ensures that the utility maximisation problem is truly non-Markovian,in the sense that it cannot be made Markovian by the introduction of a new variable. We do notneed to apply an artificial stock market argument here, as the volatility is a measurable functionof the adapted stock market process up to time t , hence is progressively measurable.For our application, we consider a 2-dimensional stock market with cone constraint K = R .The constraining function we use for this case is h K ( x ) = x . We take σ low = 0 . σ high = 0 . r ( t ) = 0 . µ ( t ) = 0 . and p = 0 .
5. Table 5 shows the duality gap when thisalgorithm is run for varying terminal times and numbers of time steps. We run the algorithms for10000 iteration steps. The ‘Diff’ and ‘Rel-Diff’ columns show the absolute and relative distancesbetween the upper and lower approximations respectively. We see in all cases that the duality gapreduces as the number of time steps increases. We do not have a benchmark to compare theseresults to, as the path dependence and incompleteness of the problem make it very difficult to solveanalytically. T N u low u high Diff Rel-Diff (%)0.2 5 2.01057E+00 2.01066E+00 8.99222E-05 4.47246E-030.2 10 2.01054E+00 2.01063E+00 8.55058E-05 4.25287E-030.2 20 2.01057E+00 2.01063E+00 6.00701E-05 2.98772E-030.2 50 2.01061E+00 2.01062E+00 1.30984E-05 6.51462E-040.5 5 2.02652E+00 2.02682E+00 3.03732E-04 1.49879E-020.5 10 2.02654E+00 2.02676E+00 2.11323E-04 1.04278E-020.5 20 2.02648E+00 2.02665E+00 1.73093E-04 8.54166E-030.5 50 2.02637E+00 2.02650E+00 1.31536E-04 6.49120E-031.0 5 2.05333E+00 2.05421E+00 8.87694E-04 4.32320E-021.0 10 2.05337E+00 2.05397E+00 5.93055E-04 2.88821E-021.0 20 2.05327E+00 2.05368E+00 4.16821E-04 2.03005E-021.0 50 2.05307E+00 2.05339E+00 3.20542E-04 1.56128E-02Table 5: Duality gap for the deep SMP method for various terminal timed T and numbers of timesteps N applied to the cone-constrained power utility problem with path dependent volatility. Here we compare the results of small changes to the methodology of the algorithm. We considerthe primal deep controlled 2BSDE algorithm for the unconstrained Merton problem, with T = 1 . m = 10 unless stated otherwise, and the other coefficients chosen as in Example 3.5. Thesetests were performed on an HP ZBook Studio G4 with a 2.9 GHz Intel i7 CPU, with no GPUacceleration.The first comparison we consider is the activation function used within the neural networks (c.f.function h in the description of the algorithm). Table 6 shows the results of this test, where we take N = 20. The choice of activation function is not important for this particular problem. All thefunctions are valid for this algorithm, since they are continuous functions that are not polynomials,see [20]. Name Function Rel - Err (%) Time (s)ReLu x (cid:55)→ max(0 , x ) 3.57771E-03 352.7softplus x (cid:55)→ log(1 + e x ) 3.57250E-03 352.0tanh x (cid:55)→ tanh( x ) 3.44745E-03 352.8sigmoid x (cid:55)→ e x e x +1 m ∈ { , , } of the stock market, with numbers of timesteps N ∈ { , , , , } . Figure 5.1 shows the results of this test. The first graph shows theerror of approximation against N − , and the second shows runtime against N . We see the order O ( N − ) convergence for each dimension, but the error increases with dimension. Therefore, toachieve the same accuracy, a larger number of time steps would be needed for a higher dimension.The dimensionality complexity of this algorithm is hence tied to the complexity with respect to thenumber of time steps N , which looks from the graph to be of order O ( N ). The time differencebetween the three dimensions is minimal. However, this is for a set number of iteration steps, andfor even higher dimensions more steps may be needed, which would increase the runtime comparedto lower dimensional models. 24 a) Approximation Error of Methods. (b) Runtime of Methods. Figure 5.1: Accuracy and runtime of the primal deep controlled 2BSDE method applied to uncon-strained power utility problems with different dimensions.Finally, we consider the effect of the terminal time T on the relative error of the output. Sincenothing is changing in the algorithm, we have found that changing T while keeping the numberof time steps N fixed does not change the runtime of the algorithm. However, Figure 5.2 showsan increasing error with T , as it increases from 0.2 to 1.0. Both graphs show the same data, theleft side using a log scale for the y axis, and the right side squaring the x axis, to show the order O ( T ) error implied by the data. We display this data using multiple time steps, from 5 to 20. It isnotable here that even increasing the number of time steps by the same factor does not compensatefor the error caused by increasing the terminal time. For example, the relative error when we use( T, N ) = (0 . ,
5) is 0.05%, whereas when we use (
T, N ) = (0 . ,
10) it is 0.12%, which is just overdouble the error. This O ( T ) vs O ( N − ) error may be why an assumption of T being ‘small’ isused in the convergence analysis of the first order BSDE solver, see [13]. (a) Log-Scaled Approximation Error. (b) Square-Scaled Approximation Error. Figure 5.2: Accuracy of the primal deep controlled 2BSDE method using different terminal timesand numbers of time steps, applied to the unconstrained power utility problem. Both graphs displaythe same data, but the left uses a log scale for the y axis, and the right a squared scale for the xaxis. 25
Conclusion
In this paper we study a variety of constrained utility maximisation problems, including some thatare incomplete or non-Markovian, and introduced two algorithms, called the deep controlled 2BSDEmethod and the deep SMP method, for solving such problems, which involve simulating a systemof BSDEs of first or second order. We use these equations, together with the optimality conditionsfor the optimal control, to find the value function of the problem. We apply these algorithms to arange of utility functions and closed convex control constraints, and find highly accurate solutions,with order O ( N − ) convergence in terms of the number of time steps of the discretisation. Wederive tight bounds of the value function at time 0 using the convex duality method that results ina primal problem and a dual problem, both can be solved with our algorithms. There remain manyinteresting outstanding problems such as convergence analysis, duality theory for multidimensionalstate process, the efficiency of the algorithm, etc. We leave these and others for future research. References [1] C. Beck, S. Becker, P. Cheridito, A. Jentzen, and A. Neufeld. Deep splitting method forparabolic pdes. arXiv:1907.03452 , 2019.[2] C. Beck, W. E, and A. Jentzen. Machine learning approximation algorithms for high-dimensional fully nonlinear partial differential equations and second-order backward stochasticdifferential equations.
Journal of Nonlinear Science , pages 1–57, 2017.[3] S. Becker, P. Cheridito, and A. Jentzen. Deep optimal stopping.
Journal of Machine LearningResearch , 20:1–25, 2019.[4] V. S. Borkar.
Stochastic Approximation: a Dynamical Systems Viewpoint . Springer, 2009.[5] H. Buehler, L. Gonon, J. Teichmann, and B. Wood. Deep hedging.
Quantitative Finance ,pages 1–21, 2019.[6] J. De Spiegeleer, D. B. Madan, S. Reyners, and W. Schoutens. Machine learning for quantita-tive finance: Fast derivative pricing, hedging and fitting.
Quantitative Finance , 18(10):1635–1643, 2018.[7] W. E, J. Han, and A. Jentzen. Deep learning-based numerical methods for high-dimensionalparabolic partial differential equations and backward stochastic differential equations.
Com-munications in Mathematics and Statistics , 5(4):349–380, Dec 2017.[8] S. Eckstein and M. Kupper. Computation of optimal transport and related hedging problemsvia penalization and neural networks.
Applied Mathematics & Optimization , 2019.[9] I. Exarchos and E. A. Theodorou. Stochastic optimal control via forward and backwardstochastic differential equations and importance sampling.
Automatica , 87:159–165, 2018.[10] L. Gonon, J. Muhle-Karbe, and X. Shi. Asset pricing with general transaction costs: Theoryand numerics. arXiv 1905.05027 , 2020.[11] P. Grohs, F. Hornung, A. Jentzen, and P. Von Wurstemberger. A proof that artificial neuralnetworks overcome the curse of dimensionality in the numerical approximation of black-scholespartial differential equations. arXiv:1809.02362 , 2018.2612] J. Han. Deep learning approximation for stochastic control problems. arXiv:1611.07422 , 2016.[13] J. Han and J. Long. Convergence of the deep bsde method for coupled fbsdes. arXiv:1811.01165 , 2018.[14] P. Henry-Labordere. Deep primal-dual algorithm for bsdes: Applications of machine learningto cva and im.
SSRN 3071506 , 2017.[15] C. Hur´e, H. Pham, A. Bachouch, and N. Langren´e. Deep neural networks algorithms forstochastic control problems on finite horizon, part 1: Convergence analysis. arXiv:1812.04300 ,2018.[16] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducinginternal covariate shift.
Proceedings of the 32nd International Conference on Machine Learning(ICML) , 37:448–456, 2015.[17] I. Karatzas and S. E. Shreve.
Methods of Mathematical Finance . Springer, 1998.[18] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980 ,2014.[19] C. Labb´e and A. J. Heunis. Conjugate duality in problems of constrained utility maximization.
Stochastics , 81(6):545–565, 2009.[20] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken. Multilayer feedforward networks with anonpolynomial activation function can approximate any function.
Neural networks , 6(6):861–867, 1993.[21] Y. Li and H. Zheng. Dynamic convex duality in constrained utility maximization.
Stochastics ,90(8):1145–1169, 2018.[22] S. Liu, C. W. Oosterlee, and S. M. Bohte. Pricing options and computing implied volatilitiesusing neural networks.
Risks , 7(1):16, 2019.[23] J. Ma, W. Li, and H. Zheng. Dual control monte-carlo method for tight bounds of valuefunction under heston stochastic volatility model.
European Journal of Operational Research ,280(2):428–440, 2020.[24] M. Pereira, Z. Wang, and E. A. Theodorou. Neural network architectures for stochastic controlusing the nonlinear feynman-kac lemma. arXiv:1902.03986 , 2019.[25] H. Pham.
Continuous-Time Stochastic Control and Optimization with Financial Applications .Springer, 2009.[26] S. Ruder. An overview of gradient descent optimization algorithms. arXiv:1609.04747 , 2016.[27] Z. Wang, M. A. Pereira, and E. A. Theodorou. Deep 2fbsdes for systems with control multi-plicative noise. arXiv:1906.04762arXiv:1906.04762