[PDF] Pricing Options Under Rough Volatility with Backward SPDEs

Abstract

In this paper, we study the option pricing problems for rough volatility models. As the framework is non-Markovian, the value function for a European option is not deterministic; rather, it is random and satisfies a backward stochastic partial differential equation (BSPDE). The existence and uniqueness of weak solution is proved for general nonlinear BSPDEs with unbounded random leading coefficients whose connections with certain forward-backward stochastic differential equations are derived as well. These BSPDEs are then used to approximate American option prices. A deep leaning-based method is also investigated for the numerical approximations to such BSPDEs and associated non-Markovian pricing problems. Finally, the examples of rough Bergomi type are numerically computed for both European and American options.

Full PDF

PPricing Options Under Rough Volatility with Backward SPDEs

Christian Bayer ∗ Jinniao Qiu † Yao Yao † August 5, 2020

Abstract

In this paper, we study the option pricing problems for rough volatility models. As theframework is non-Markovian, the value function for a European option is not deterministic;rather, it is random and satisﬁes a backward stochastic partial diﬀerential equation (BSPDE).The existence and uniqueness of weak solution is proved for general nonlinear BSPDEs withunbounded random leading coeﬃcients whose connections with certain forward-backwardstochastic diﬀerential equations are derived as well. These BSPDEs are then used to ap-proximate American option prices. A deep leaning-based method is also investigated for thenumerical approximations to such BSPDEs and associated non-Markovian pricing problems.Finally, the examples of rough Bergomi type are numerically computed for both Europeanand American options.

Mathematics Subject Classiﬁcation (2010):

Keywords: rough volatility, option pricing, stochastic partial diﬀerential equation, machinelearning, stochastic Feynman-Kac formula

Let (Ω , F , ( F t ) t ∈ [0 ,T ] , P ) be a complete ﬁltered probability space with the ﬁltration ( F t ) t ∈ [0 ,T ] being the augmented ﬁltration generated by two independent Wiener processes W and B .Throughout this paper, we denote by ( F Wt ) t ∈ [0 ,T ] the augmented ﬁltration generated by theWiener process W . The predictable σ -algebras on Ω × [0 , T ] corresponding to ( F Wt ) t ∈ [0 ,T ] and( F t ) t ∈ [0 ,T ] are denoted by P W and P , respectively.We consider a general stochastic volatility model given under a risk neutral probabilitymeasure as  dS t = rS t dt + S t (cid:112) V t (cid:16) ρ dW t + (cid:112) − ρ dB t (cid:17) ; S = s , (1.1)where ρ ∈ [ − ,

1] denotes the correlation coeﬃcient and the constant r the interest rate. Weimpose the following assumptions on the stochastic variance process V . Weierstrass Institute for Applied Analysis and Stochastics (WIAS), Berlin, Germany.

Email : [email protected] . C. Bayer gratefully acknowledges funding by the German Research Foun-dation (DFG) (project AA4-2 within the cluster of excellence MATH+). Department of Mathematics & Statistics, University of Calgary, 2500 University Drive NW, Calgary, AB T2N1N4, Canada.

Email : [email protected] (J. Qiu), [email protected] (Y. Yao). J. Qiu was partiallysupported by the National Science and Engineering Research Council of Canada and by the start-up funds fromthe University of Calgary. a r X i v : . [ q -f i n . M F ] A ug ssumption 1.1. V has continuous trajectories, takes values in R ≥ , and is adapted to theﬁltration generated by the Brownian motion W . We further assume that V is integrable, i.e., E (cid:20)(cid:90) T V s ds (cid:21) < ∞ , T > . Note that we do not assume that V (or even ( S, V )) is a Markov process or a semi-martingale,and, in fact, our main examples will be neither. Indeed, the motivation of this work is to extendthe backward stochastic diﬀerential equation-based pricing theory to rough volatility models.These models were put forth in [GJR18] in order to explain the roughness of time series of dailyrealized variance estimates. The idea is that the spot price process is modeled by a stochasticvolatility model, with the stochastic variance process essentially behaving like an exponentialfractional Brownian motion with Hurst index 0 < H < / / < H . In the pricing domain, rough volatility was found in [BFG16]to lead to extremely accurate ﬁts of SPX implied volatility surfaces with very few parameters,in particular explaining the power law behaviour of the ATM implied volatility skew for shortmaturities; see also [ALV07, Fuk11]. Since then, there have been many new contributions tothe literature of rough volatility models, including developments of rough Heston models withclosed expressions for the characteristic functions (see [EER19]), microstructural foundations ofrough volatility models ([EEFR18]), calibration of rough volatility models by machine learningtechniques ([BHM + + Example 1.1.

In the rough Bergomi model (see [BFG16]), the stochastic variance is given as V t = ξ t E (cid:16) η (cid:99) W t (cid:17) , (1.2) where ξ t denotes the forward variance curve (a quantity which can be computed from the impliedvolatility surface), E denotes the Wick exponential , i.e., E ( Z ) := exp (cid:0) Z − var Z (cid:1) for a zero-mean normal random variable Z , and η ≥ . Finally, (cid:99) W denotes a fractional Brownian motion(fBm) of Riemann-Liouville type with Hurst index < H < , i.e., (cid:99) W t := (cid:90) t K ( t − s ) dW s , K ( r ) := √ Hr H − / , r > . (1.3) If the correlation ρ is negative, then Gassiat [Gas18] showed that the discounted price e − rt S t is, indeed, a martingale; otherwise, it may not be a martingale. But the conditions of Assump-tion 1.1 are always satisﬁed. Example 1.2.

In the rough Heston model introduced in [EER19], the stochastic variance sat-isﬁes the stochastic Volterra equation V t = V + (cid:90) t K ( t − s ) λ ( θ − V s ) ds + (cid:90) t K ( t − s ) ζ (cid:112) V s dW s , (1.4) where the Kernel satisﬁes K ( r ) := r α − / Γ( α ) , r > , < α < . (1.5) The rough Heston process also satisﬁes Assumption 1.1; see [JLP19]. t, s ) ∈ [0 , T ] × R + , denote the asset/security price process by S t,sτ , for τ ∈ [ t, T ],which satisﬁes the stochastic diﬀerential equation (SDE) in (1.1) but with initial time t andinitial state s (price at time t ). The fair price of a European option with payoﬀ H , as thesmallest initial wealth required to ﬁnance an admissible (super-replicating) wealth process, isgiven by P t ( s ) := E (cid:104) e − r ( T − t ) H ( S t,sT ) (cid:12)(cid:12) F t (cid:105) ; (1.6)refer to [CH05] for the cases when the discounted price e − rt S t is just a local martingale. Taking X t = − rt + log S t , we may reformulate the above pricing problem, i.e., u t ( x ) := E (cid:104) e − r ( T − t ) H ( e X t,xT + rT ) (cid:12)(cid:12) F t (cid:105) , ( t, x ) ∈ [0 , T ] × R , (1.7)subject to  dX t,xs = (cid:112) V s (cid:16) ρ dW s + (cid:112) − ρ dB s (cid:17) − V s ds, ≤ t ≤ s ≤ T ; X t,xt = x. (1.8)Obviously, we have the relation u t ( x ) = P t ( e x + rt ) a.s..The non-Markovianity of the pair ( S, V ) (or (

X, V )) makes it impossible to characterizethe value function u t ( x ) with a conventional (deterministic) partial diﬀerential equation (PDE).Indeed, we prove that the function u t ( x ), for ( t, x ) ∈ [0 , T ] × R , is a random ﬁeld which togetherwith another random ﬁeld ψ t ( x ) satisﬁes the following backward stochastic partial diﬀerentialequation (BSPDE):  − du t ( x ) = (cid:104) V t D u t ( x ) + ρ (cid:112) V t Dψ t ( x ) − V t Du t ( x ) − ru t ( x ) (cid:105) dt − ψ t ( x ) dW s ; u T ( x ) = H ( e x + rT ) , (1.9)where the pair ( u, ψ ) is unknown and the volatility process ( V t ) t ≥ is deﬁned exogenously as inExamples 1.1 and 1.2.While the BSPDEs have been extensively studied (see [BD14, DQT11, HMY02, Pen92] forinstance), to the best of our knowledge, there is no available theory for the well-posednesssof BSPDE (1.9) because the leading coeﬃcient V t is neither uniformly bounded from abovenor uniformly (strictly) positive from below and the terminal value H ( e · + rT ) may not belongto any space L p (Ω × R ) for p ∈ (1 , ∞ ). Hence, a weak solution theory is established for thewell-posedness of general nonlinear BSPDEs and associated stochastic Feynman-Kac formula,particularly applicable to (1.9). Such nonlinear BSPDEs are further used to approximate theAmerican option prices. Based on the stochastic Feynman-Kac formula with forward-backwardstochastic diﬀerential equations (FBSDEs), we develop a deep learning-based method for numer-ical approximations for the solutions which are essentially deﬁned on the (inﬁnite dimensional)probability space due to the randomness. Accordingly, the universal approximation theorem ofneural networks is generalized from ﬁnite dimensional input spaces to inﬁnite dimensional casesin the probabilistic setting. On the basis of this approximation result, we design the schemesin the spirit of the Markovian counterpart by Hur´e, Pham, and Warin [HPW19] but equippedwith neural networks with changing and high input dimensions. Some numerical results are alsopresented for examples of rough Bergomi type, along with an appended convergence analysis.3ere, although the theory and application results are presented for the case of a single riskyasset under rough volatility, leading to associated BSPDEs on the one-dimensional space R , amulti-dimensional extension may be obtained under certain assumptions in a similar manner;nevertheless, we would not seek such a generality to avoid cumbersome arguments.Finally, let us contrast the present work with the recent work [JO19]. Therein, with themethod developed in [VZ +

19] the European option price in a local rough volatility model isexpressed as a function of t, S t and an additional, inﬁnite-dimensional term Θ, which is closelyrelated to the forward variance curve. An inﬁnite-dimensional pricing PDE for the option pricewith respect to these variables is then formulated and solved with a discretization method usingdeep neural networks as basis functions. The focus of [JO19] is clearly on the mathematicalﬁnance and numerical side, whereas well-posedness of the path-dependent PDE is more or lessassumed. (They do refer to [EKTZ14], which, however, only covers the case of path-dependentPDEs with constant diﬀusion coeﬃcients. Moreover, the arguments in [JO19] seem to requireclassical – not viscosity – solutions of the path-dependent PDE.) In this sense, our present workis complementary, as the well-posedness of the BSPDE is a serious concern of this paper. Wealso extend the consideration from the European to the American case, and provide similar typeof numerical discretization also based on deep neural networks, but for approximation of theassociated FBSDEs.The rest of this paper is organized as follows. Section 2 is devoted to the well-posedness of aclass of nonlinear BSPDEs and associated stochastic Feynman-Kac formula. The weak solutiontheory is then applied to approximations of American option prices under rough volatility inSection 3. Then in Section 4, we discuss the numerical approximations with a deep learning-based method: in the ﬁrst subsection we addressed the approximations of neural networks torandom functions involving inﬁnite-dimensional input spaces in the probabilistic setting, thena deep learning-based method is introduced for non-Markovian BSDEs and associated BSPDEsin the second subsection, and in the third subsection we present some numerical examples forthe rough Bergomi model. Finally, in the appendix, a convergence analysis is presented for thedeep learning-based method. This section is devoted to a weak solution theory for the following nonlinear BSPDE:  − du t ( x ) = (cid:104) V t D u t ( x ) + ρ (cid:112) V t Dψ t ( x ) − V t Du t ( x )+ F t ( e x , u t ( x ) , (cid:112) (1 − ρ ) V t Du t ( x ) , ψ t ( x ) + ρ (cid:112) V t Du t ( x )) (cid:105) dt − ψ t ( x ) dW s , ( t, x ) ∈ [0 , T ) × R ; u T ( x ) = G ( e x ) , x ∈ R . (2.1)Noteworthily, BSPDE (1.9) turns out to be a particular case when F t ( x, y, z, ˜ z ) ≡ − ry and G ( e x ) = H ( e x + rT ).We shall study the well-posedness of BSPDE (2.1) for given continuous nonnegative pro-cess ( V t ) t ≥ and address the representation relationship between BSPDE (2.1) and associatedFBSDE. Following are the assumptions on the coeﬃcients G and F .4 ssumption 2.1. (1) The function G : (Ω × R , F WT ⊗ B ( R )) → ( R , B ( R ) satisﬁes G ( x ) ≤ L (1 + | x | ) , x ∈ R , for some constant L > F : (Ω × [0 , T ] × R , P W ⊗ B ( R )) → ( R , B ( R )) satisﬁes that there exists apositive constants L ∈ (0 , ∞ ) such that for all x, y , y , z , z , ˜ z , ˜ z ∈ R , and t ∈ [0 , T ], | F t ( x, y , z , ˜ z ) − F t ( x, y , z , ˜ z ) | ≤ L (cid:0) | y − y | + | z − z | + | ˜ z − ˜ z | (cid:1) , a.s. , | F t ( x, , , | ≤ L (1 + | x | ) , a.s., | F t ( x, y , z , ˜ z ) − F t ( x, y , , | ≤ L , a.s..For the well-posedness of BSPDE (2.1) under Assumption 2.1, the diﬃculty lies in thecombination of the non-uniform-boundedness of ( V t ) t ∈ [0 ,T ] and the inintegrability of G ( e x ) and F t ( e x , y, z, ˜ z ) w.r.t. x on the whole space R . Indeed, from the condition on ( V ) t ≥ in Assumption2.1, we may conclude that e X ,xs is a positive local martingale and thus a supermartingale,satisfying E [ e X ,xt ] ≤ e x for instance; however, it is not appropriate to expect E (cid:104)(cid:12)(cid:12) e X ,xt (cid:12)(cid:12) p (cid:105) < ∞ for some p > F on ( Z, ˜ Z ) is not necessary for the concerned examples in this paper.We assume the Lipschitz continuity and boundedness in ( Z, ˜ Z ) for the reader’s interests. In fact,for the well-posedness of the involved BSDEs and BSPDEs in the L spaces, it is not appropriateto assume the linear growth in ( Z, ˜ Z ) as indicated in the theory of L solutions for BSDEs (see[BDH +

03, Section 6]); it might be workable for certain fractional growths in ( Z, ˜ Z ), while wewould not seek such a generality to avoid cumbersome arguments in this work.Corresponding to BSPDE (2.1), there follows the BSDE: (cid:40) − dY t,xs = F s ( e X t,xs , Y t,xs , Z t,xs , ˜ Z t,xs ) ds − ˜ Z t,xs dW s − Z t,xs dB s , ≤ t ≤ s ≤ T ; Y t,xT = G ( X t,xT ) , (2.2)where the triple ( Y t,xs , Z t,xs , ˜ Z t,xs ) is deﬁned as the solution to BSDE (2.2) in the sense of[BDH +

03, Deﬁnition 2.1]. Under Assumptions 1.1 and 2.1, BSDE (2.2) has a unique solution( Y t,xs , Z t,xs , ˜ Z t,xs ) for each ( t, x ) ∈ [0 , T ) × R (see [BDH +

03, Theorem 6.3]). (2.1)

Denote by C ∞ c the space of inﬁnitely diﬀerentiable functions with compact supports in R andlet D be the space of real-valued Schwartz distributions on C ∞ c . The Lebesgue measure in R will be denoted by dx . L ( R ) ( L for short) is the usual Lebesgue integrable space with scalarproduct and norm deﬁned (cid:104) φ, ψ (cid:105) = (cid:90) R φ ( x ) ψ ( x ) dx, (cid:107) φ (cid:107) = (cid:104) φ, φ (cid:105) / , ∀ φ, ψ ∈ L . For convenience, we shall also use (cid:104)· , ·(cid:105) to denote the duality between the Schwartz distributionspace D and C ∞ c .By D F (respectively, D F W ) we denote the set of all D -valued functions deﬁned on Ω × [0 , T ]such that, for any u ∈ D F (respectively, u ∈ D F W ) and φ ∈ C ∞ c , the function (cid:104) u, φ (cid:105) is P P W )-measurable. When there is no confusion about the involved ﬁltration, weshall just write D .For p = 1 , D p the totality of u ∈ D such that for any R ∈ (0 , ∞ ) and φ ∈ C ∞ c , we have (cid:90) T sup | x |≤ R |(cid:104) u t ( · ) , φ ( · − x ) (cid:105)| p dt < ∞ a.s.. Lemma 2.1.

Given u ∈ D p for p = 1 , , it holds that:(i) Du ∈ D p ;(ii) For each continuous function (cid:37) on R , we have (cid:37)u ∈ D p if u ∈ L (Ω × [0 , T ] × R ) .(iii) For any continuous processes ( x t ) t ∈ [0 ,T ] and ( y t ) t ∈ [0 ,T ] with max t ∈ [0 ,T ] | x t | + | y t | < ∞ a.s.,the random ﬁeld ˜ u t ( x ) := y t u t ( x + x t ) is also lying in D p .Proof. The assertion (i) may also be found in [Kry10, page 297]. In fact, for each φ ∈ C ∞ c , wehave Dφ ∈ C ∞ c , and the integration-by-parts formula indicates that (cid:104) Du t ( · ) , φ ( · − x ) (cid:105) = −(cid:104) u t ( · ) , ( Dφ )( · − x ) (cid:105) . Hence, Du ∈ D p if u ∈ D p .For assertion (ii), notice that for each γ ∈ (0 , ∞ ),sup | x |≤ γ |(cid:104) (cid:37) ( · ) u t ( · ) , φ ( · − x ) (cid:105)| p ≤ (cid:107) u t (cid:107) p (cid:107) φ (cid:107) p max | x |≤ γ + R | (cid:37) ( x ) | p , where we choose a suﬃciently big R > φ is contained in [ − R, R ]. Thenit follows obviously that (cid:37)u ∈ D p .Lastly, as max t ∈ [0 ,T ] | x t | + | y t | < ∞ a.s. and for each γ ∈ (0 , ∞ ),sup | x |≤ γ |(cid:104) y t u t ( · + x t ) , φ ( · − x ) (cid:105)| p = sup | x |≤ γ |(cid:104) u t ( · ) , y t φ ( · − x t − x ) (cid:105)| p ≤ sup | x |≤ γ +max t ∈ [0 ,T ] | x t | |(cid:104) u t ( · ) , φ ( · − x ) (cid:105)| p max t ∈ [0 ,T ] | y t | p , there holds assertion (iii).For u, f, g ∈ D , we say that the equality du t ( x ) = f t ( x ) dt + g t ( x ) dW t , t ∈ [0 , T ] , holds in the sense of distribution if f ∈ D , g ∈ D and for each φ ∈ C ∞ c , it holds a.s., (cid:104) u t ( · ) , φ (cid:105) = (cid:104) u ( · ) , φ (cid:105) + (cid:90) t (cid:104) f s ( · ) , φ (cid:105) ds + (cid:90) t (cid:104) g s ( · ) , φ (cid:105) dW s , ∀ t ∈ [0 , T ] . Deﬁnition 2.1.

A pair ( u, ψ ) ∈ D F W × D F W is said to be a weak solution of BSPDE (2.1), if(i) u T ( x ) = G ( e x ) a.s.; 6ii) for almost all ( ω, t ) ∈ Ω × [0 , T ], the functions u t ( x ) , (cid:112) (1 − ρ ) V t Dψ t ( x ), and ρ √ V t Du t ( x )+ ψ t ( x ) are locally integrable in x ∈ R ; (iii) the equality − du t ( x ) = (cid:104) V t D u t ( x ) + ρ (cid:112) V t Dψ t ( x ) − V t Du t ( x )+ F t ( e x , u t ( x ) , (cid:112) (1 − ρ ) V t Du t ( x ) , ψ t ( x ) + ρ (cid:112) V t Du t ( x )) (cid:105) dt − ψ t ( x ) dW s , holds in the sense of distribution.By Assumption 2.1, the linear growth of ( G, F ) w.r.t. e x produces the local integrability in x ∈ R . Therefore, in Deﬁnition 2.1 the local integrability is set for the weak solution, which doesnot just give a point-wise meaning of the compositions involved in function F but also makethe weak solution be potentially workable under Assumption 2.1 particularly encompassing theconcerned examples in this paper. Obviously, it diﬀers from the L p ( p ∈ (1 , ∞ ])-integrabilityrequirements for the weak or viscosity solutions in the existing BSPDE literature (see [DQT11,HMY02, Qiu18, Zho92] for instance). (2.1) and the stochastic Feynman-Kac formula First comes a result about the measurability of Y t,xt which basically states that the randomnessfrom Wiener process B is averaged out as the randomness of all the coeﬃcients is only (explicitly)subject to the sub-ﬁltration { F Wt } t ≥ . Theorem 2.2.

Under assumptions 1.1 and 2.1, for each ( t, x ) ∈ [0 , T ] × R , let ( Y t,xs , Z t,xs , ˜ Z t,xs ) be the solution to BSDE (2.2) . Then the value function: Φ t ( x ) := Y t,xt is just F Wt -measurable.Proof. We shall adopt some techniques by Buckdahn and Li in [BL08]. For the underlyingprobability space, w.l.o.g., we may take Ω = C ([0 , T ]; R ) = Ω W × Ω B , with Ω W = C ([0 , T ]; R ),Ω B = C ([0 , T ]; R ), and for each ω ∈ Ω, one has ω = ( ω W , ω B ) with ω W ∈ Ω W and ω B ∈ Ω B . Andthe two independent Wiener processes W and B may be deﬁned on Ω W and Ω B , respectively.Set H = (cid:26) h ; h (0) = 0 , dhdt ∈ L (0 , T ; R ) (cid:27) , which is the Cameron-Martin space associated with the Wiener process B . For any h ∈ H , wedeﬁne the translation operator τ h : Ω → Ω, τ h (( ω W , ω B )) = ( ω W , ω B + h ) for ω = ( ω W , ω B ) ∈ Ω. It is obvious that τ h is a bijection and that it deﬁnes the probability transformation: (cid:0) P ◦ τ − h (cid:1) ( dω ) = exp { (cid:82) T | dhdt | dt − (cid:82) T dhdt dB t } P ( dω ).Fix some ( t, x ) ∈ [0 , T ] × R d and set H t = { h ∈ H (cid:12)(cid:12) h ( · ) = h ( · ∧ t ) } . Recall X t,xT = x − (cid:90) Tt V s ds + (cid:90) Tt ρ (cid:112) V s dW s + (cid:90) Tt (cid:112) (1 − ρ ) V s dB s . Here, by the local integrability of a function g in x ∈ R we mean that for each bounded measurable set D ⊂ R ,it holds that the truncated function g · D lies in L ( R ).

7y Girsanov theorem, it follows that X t,xT ( τ h ) = X t,xT for all h ∈ H t , and thus, we haveΦ t ( x )( τ h ) = Φ t ( x ) P -a.s. for any h ∈ H t . In particular, for any continuous and boundedfunction G , E (cid:20) G (Φ t ( x )) exp (cid:110) (cid:90) T | dhds | ds − (cid:90) T dhds dB s (cid:111)(cid:21) = E (cid:20) G (Φ t ( x ))( τ h ) exp (cid:110) (cid:90) T | dhds | ds − (cid:90) T dhds dB s (cid:111)(cid:21) = E [ G (Φ t ( x ))]= E [ G (Φ t ( x ))] E (cid:20) exp (cid:110) (cid:90) T | dhds | ds − (cid:90) T dhds dB s (cid:111)(cid:21) , which together with the arbitrariness of ( G , h ) implies that Φ t ( x ) is just F Wt -measurable.Following is the Itˆo-Wentzell-Krylov formula. Lemma 2.3 (Theorem 1 of [Kry10]) . Let x t be an R -valued predictable process of the followingform x t = (cid:90) t b s ds + (cid:90) t β s dW s + (cid:90) t σ s dB s , where b , σ and β are predictable processes such that for all ω ∈ Ω and s ∈ [0 , T ] , it holds that | β s | + | σ s | < ∞ and (cid:90) T (cid:0) | b t | + | β s | + | σ s | (cid:1) dt < ∞ . Assume that the equality du t ( x ) = f t ( x ) dt + g t ( x ) dW t , t ∈ [0 , T ] , holds in the sense of distribution and deﬁne v t ( x ) := u t ( x + x t ) . Then we have dv t ( x ) = (cid:18) f t ( x + x t ) + 12 ( | β t | + | σ t | ) D v t ( x ) + β t Dg t ( x + x t ) + b t Dv t ( x ) (cid:19) dt + ( g t ( x + x t ) + β t Dv t ( x )) dW t + σ t Dv t ( x ) dB t , t ∈ [0 , T ] holds in the sense of distribution. We note that in the Itˆo-Wentzell formula by Krylov [Kry10, Theorem 1], the Wiener pro-cess ( W t ) t ≥ may be general separable Hilbert space-valued and the process ( x t ) t ≥ may bemulti-dimensional. An application of the above Itˆo-Wentzell-Krylov formula gives the followingstochastic Feynman-Kac formula that is the probabilistic representation of the weak solution toBSPDE (2.1) via the solution of associated BSDE (2.2) coupled with the forward SDE (1.8). Theorem 2.4.

Let Assumptions 1.1 and 2.1 hold. Let ( u, ψ ) be a weak solution of BSPDE (2.1) such that there is C u ∈ (0 , ∞ ) satisfying for each t ∈ [0 , T ] | u t ( x ) | ≤ C u (1 + e x ) , for almost all ( ω, x ) ∈ Ω × R . (2.3) Then ( u, ψ ) admits a version (denoted by itself ) satisfying a.s. u τ ( X t,xτ ) = Y t,xτ , (cid:112) (1 − ρ ) V τ Du τ ( X t,xτ ) = Z t,xτ , ψ τ ( X t,xτ ) + ρ (cid:112) V τ Du τ ( X t,xτ ) = ˜ Z t,xτ , for ≤ t ≤ τ ≤ T and x ∈ R , where ( Y t,xτ , Z t,xτ , ˜ Z t,xτ ) is the unique solution to BSDE (2.2) . roof. For each t ∈ [0 , T ), recall X t,xs = x − (cid:90) st V r dr + (cid:90) st ρ (cid:112) V r dW r + (cid:90) st (cid:112) (1 − ρ ) V r dB r , t ≤ s ≤ T. Applying Lemma 2.3 to u over the interval [ t, T ] yields that du s ( X t,xs ) = (cid:16) ψ s ( X t,xs ) + ρ (cid:112) V s Du s ( X t,xs ) (cid:17) dW s + (cid:112) (1 − ρ ) V s Du s ( X t,xs ) dB s − F s (cid:16) e X t,xs , u s ( X t,xs ) , (cid:112) (1 − ρ ) V s Du s ( X t,xs ) , ψ s ( X t,xs ) + ρ (cid:112) V s Du s ( X t,xs ) (cid:17) ds, s ∈ [ t, T ] , holds in the sense of distribution with u T ( X t,xT ) = G ( e X t,xT ).Notice that for all τ ∈ [ t, T ], we have e X t,xτ ∈ L (Ω , P ) and E (cid:104) e X t,xτ (cid:105) ≤ e x . This together withAssumption 2.1 and relation (2.3), implies that u τ ( X t,xτ ) ∈ L (Ω , P ) for all τ ∈ [ t, T ]. Further,the uniqueness of L -solution for BSDEs (see [BDH +

03, Section 6]) yields a version of ( u, ψ )(denoted by itself) satisfying that a.s. u τ ( X t,xτ ) = Y t,xτ , (cid:112) (1 − ρ ) V τ Du τ ( X t,xτ ) = Z t,xτ , ψ τ ( X t,xτ ) + ρ (cid:112) V τ Du τ ( X t,xτ ) = ˜ Z t,xτ , for 0 ≤ t ≤ τ ≤ T and x ∈ R , where ( Y t,xτ , Z t,xτ , ˜ Z t,xτ ) is the unique solution to BSDE (2.2).From the proof, we may see that the growth condition (2.3) conﬁrms that the distribution-valued process u is locally integrable and a.e. deﬁned on Ω × [0 , T ] × R which means more thandistributions. More importantly, it implies the integrability of u ( τ, X t,xτ ) which is needed for theuniqueness of solution to BSDEs. The growth condition (2.3) may be relaxed; however, powergrowth condition like | u t ( x ) | ≤ C (1 + | e x | p ) for some p > u ( τ, X t,xτ ) (see [Gas18, Theorem 2]). On the other hand, the stochastic Feynman-Kac formula inTheorem 2.4 actually implies the uniqueness of weak solution for BSPDE (2.1) which togetherwith the existence is summarized in what follows. Theorem 2.5.

Under Assumptions 1.1 and 2.1, suppose further that there is an inﬁnitelydiﬀerentiable function ζ such that ζ ( x ) > for all x ∈ R and G ( e · + X , T ) ζ ( · ) ∈ L (Ω , F T ; L ( R )) , ζ ( · ) F · ( e · + X , · , , , ∈ L (Ω × [0 , T ]; L ( R )) . (2.4) Then BSPDE (2.1) admits a unique weak solution ( u, ψ ) such that there is C u ∈ (0 , ∞ ) satisfyingfor each t ∈ [0 , T ] | u ( t, x ) | ≤ C u (1 + e x ) , for almost all ( ω, x ) ∈ Ω × R . (2.5) Proof.

Step 1 (Existence). Put θ ( x ) = ζ ( x )(1+ ζ ( x ))(1+ x ) for x ∈ R . The theory of Banach space-valued BSDEs in [DQT11, Section 3] may be extended to nonlinear cases under Lipschitz as-sumptions with the standard application of Picard iteration. In particular, for the case of Hilbertspaces, applying [HP91, Theorem 3.1] to the following Hilbert space-valued BSDE (with a trivialoperator A = 0 therein):˜ u t ( x ) = G ( e x + X , T ) θ ( x ) + (cid:90) Tt θ ( x ) F s ( e x + X , s , ( θ ( x )) − ˜ u s ( x ) , ( θ ( x )) − ˜ ψ Bs ( x ) , ( θ ( x )) − ˜ ψ Ws ( x )) ds − (cid:90) Tt ˜ ψ Bs ( x ) dB s − (cid:90) Tt ˜ ψ Ws ( x ) dW s , t ∈ [0 , T ] . (2.6)9ives the solution of the triple of L ( R )-valued ( F t )-adapted random ﬁelds(˜ u, ˜ ψ B , ˜ ψ W ) ∈ L (Ω; C ([0 , T ]; L ( R ))) × L (Ω × [0 , T ] × R ) × L (Ω × [0 , T ] × R ) . (2.7)Obviously, we have (˜ u, ˜ ψ B , ˜ ψ W ) ∈ D F × D F × D F , and thus by assertion (ii) of Lemma 2.1, itholds that (ˆ u, ˆ ψ B , ˆ ψ W ) := (˜ u, ˜ ψ B , ˜ ψ W ) θ ∈ D F × D F × D F , satisfying BSDE:ˆ u t ( x ) = G ( e x + X , T ) + (cid:90) Tt F s ( e x + X , s , ˆ u s ( x ) , ˆ ψ Bs ( x ) , ˆ ψ Ws ( x )) ds − (cid:90) Tt ˆ ψ Bs ( x ) dB s − (cid:90) Tt ˆ ψ Ws ( x ) dW s , t ∈ [0 , T ] . Also, it is straightforward to have thatˆ u t ( x ) = Y t,x + X , t t a.s., for all ( t, x ) ∈ [0 , T ] × R , (2.8)with the triple ( Y t,xs , Z t,xs , ˜ Z t,xs ) s ∈ [ t,T ] satisfying BSDE (2.2).By Lemma 2.1, we may apply the Itˆo-Wentzell-Krylov formula in Lemma 2.3 which yieldsthat the equality − d ˆ u t ( x − X , t ) (2.9)= (cid:26) − V t D ˆ u t ( x − X , t ) + (cid:112) (1 − ρ ) V t D ˆ ψ Bt ( x − X , t ) + ρ (cid:112) V t D ˆ ψ Wt ( x − X , t ) − V t D ˆ u t ( x − X , t ) + F t ( e x , ˆ u t ( x − X , t ) , ˆ ψ Bt ( x − X , t ) , ˆ ψ Wt ( x − X , t )) (cid:27) dt − (cid:16) ˆ ψ Wt ( x − X , t ) − ρ (cid:112) V t D ˆ u t ( x − X , t ) (cid:17) dW t − (cid:16) ˆ ψ Bt ( x − X , t ) − (cid:112) (1 − ρ ) V t D ˆ u t ( x − X , t ) (cid:17) dB t , t ∈ [0 , T ] , (2.10)holds in the sense of distribution. Notice that the equality (2.8) indicates that for each s ∈ [0 , T ]ˆ u s ( x − X , s ) = Y s,xs , (2.11)which is just F Ws -measurable by Theorem 2.2. Thus, the stochastic integration w.r.t. B shouldbe vanishing, i.e., we haveˆ ψ Bt ( x ) − (cid:112) (1 − ρ ) V t D ˆ u t ( x ) = 0 , a.s. for all ( t, x ) ∈ [0 , T ] × R . Put u t ( x ) = ˆ u t ( x − X , t ) and ψ t ( x ) = ˆ ψ Wt ( x − X , t ) − ρ (cid:112) V t D ˆ u t ( x − X , t ) , ( t, x ) ∈ [0 , T ] × R . The F Wt -adaptedness of u t ( x ), and the assertions (i) and (iii) of Lemma 2.1 imply ( u, ψ ) ∈ D F W × D F W , and the equality (2.8) writes equivalently − du t ( x ) = (cid:26) V t D u t ( x ) + ρ (cid:112) V t Dψ t ( x ) − V t Du t ( x )10 F t ( e x , u t ( x ) , (cid:112) (1 − ρ ) V t Du t ( x ) , ψ t ( x ) + ρ (cid:112) V t Du t ( x )) (cid:27) dt − ψ t ( x ) dW t , t ∈ [0 , T ] , which holds in the sense of distribution with the terminal condition u T ( x ) = G ( e x ). Thelocal integrability of ( u, (cid:112) (1 − ρ ) V t Du, ψ + ρ √ V t Du ) required in Deﬁnition 2.1 (ii) may beobtained by combining the relation (2.7), the path-continuity of ( X , s ) s ≥ , and the positivity of θ . Therefore, the pair ( u, ψ ) is a weak solution of BSPDE (2.1). Step 2 (Growth condition (2.5)). Consider the following Hilbert space-valued BSDE:  ˜ Y t ( x ) = (cid:12)(cid:12)(cid:12) G ( e x + X , T ) (cid:12)(cid:12)(cid:12) θ ( x ) − (cid:90) Tt ˜ Z Bs ( x ) dB s − (cid:90) Tt ˜ Z Ws ( x ) dW s + (cid:90) Tt (cid:16) θ ( x ) (cid:12)(cid:12) F s ( e x + X , s , , , (cid:12)(cid:12) + L θ ( x ) + L | ˜ Y s ( x ) | (cid:17) ds, (2.12)where the positive constant L is from Assumption 2.1 (iii). The standard BSDE theory (see[PP90]) yields the unique existence of the L -solution to BSDE (2.12). In fact, for each ( t, x ) ∈ [0 , T ) × R we have˜ Y t ( x ) = E (cid:20)(cid:12)(cid:12)(cid:12) G ( e x + X , T ) (cid:12)(cid:12)(cid:12) θ ( x ) γ tT + (cid:90) Tt θ ( x )( L + (cid:12)(cid:12) F s ( e x + X , s , , , (cid:12)(cid:12) ) · γ ts ds (cid:12)(cid:12)(cid:12) F t (cid:21) , (2.13)with γ ts = exp { L ( s − t ) } , s ∈ [ t, T ] . Putting the BSDEs (2.6) and (2.12) together, we may use the comparison theorem (see [EPQ97,Theorem 2.2]) to achieve the relation˜ u t ( x ) ≤ ˜ Y t ( x ) , a.s., ∀ ( t, x ) ∈ [0 , T ] × R , which together with (2.13) implies that u t ( x ) ≤ ( θ ( x − X , t )) − ˜ Y t ( x − X , t )= E (cid:20)(cid:12)(cid:12)(cid:12) G ( e X t,xT ) (cid:12)(cid:12)(cid:12) γ tT + (cid:90) Tt (cid:16)(cid:12)(cid:12) F s ( e X t,xs , , , (cid:12)(cid:12) + L (cid:17) · γ ts ds (cid:12)(cid:12)(cid:12) F t (cid:21) ≤ E (cid:20)(cid:16) Le X t,xT + L (cid:17) γ tT + (cid:90) Tt (cid:16) L e X t,xs + 2 L (cid:17) · γ ts ds (cid:12)(cid:12)(cid:12) F t (cid:21) ≤ E (cid:20) Le x + L ( T − t ) + Le L ( T − t ) + (cid:90) Tt (cid:16) L e x + L ( s − t ) + 2 L e L ( s − t ) (cid:17) ds (cid:12)(cid:12)(cid:12) F t (cid:21) ≤ C ( L, T, L )(1 + e x ) , a.s., ∀ ( t, x ) ∈ [0 , T ] × R , where we have used the relation E (cid:104) e X t,xs (cid:12)(cid:12) F t (cid:105) ≤ e x a.s., for 0 ≤ t ≤ s ≤ T . This gives thegrowth estimate (2.5). Step 3 (Uniqueness). The uniqueness follows from Theorem 2.4 and the proof is complete.

Remark 2.1.

In view of the above proof, the assumption (2.4) on G and F is to ensure( ˜ ψ B , ˜ ψ W ) ∈ D F × D F and further ψ ∈ D F W . It is for simplicity and may be relaxed; forinstance, the L -requirements in (2.4) may be replaced correspondingly by L p -integrability butwith 1 < p < ∞ and the associated well-posedness result with L p -integrability in (2.7) may be11btained by standardly extending the theory of Banach space-valued BSDEs in [DQT11, Section3] as stated at the beginning of the proof. A typical example satisfying (2.4) is the Europeanput option where F t ( x, y, z, ˜ z ) = − ry , G ( e x ) = ( K − e x + rT ) + for some K ∈ (0 , ∞ ) and one maytake ζ ( x ) = x for instance. However, it is by no means obvious to see if it is satisﬁed for thecall options, while for pricing calls, we may use the put-call parity if applicable. Assuming the same setting as the European options, we consider instead the American type,that is to compute u t ( x ) := sup τ ∈T t E (cid:104) e − ( τ − t ) r g τ ( e X t,xτ ) (cid:12)(cid:12) F t (cid:105) , ( t, x ) ∈ [0 , T ] × R , where r ≥ T t denotes all the stopping times τ satisfying t ≤ τ ≤ T .For simplicity, we assume: Assumption 3.1.

The function g : (Ω × [0 , T ] × R , P W ⊗ B ( R )) → ( R , B ( R )) satisﬁes thatthere exists a positive constant L > t, x ) ∈ [0 , T ] × R ,(i) g s ( e X t,xs ) is almost surely continuous in s ∈ [ t, T ];(ii) g s ( e x ) ≤ L (1 + e x ), a.s.;(iii) (cid:12)(cid:12)(cid:12) g s (cid:16) e X t,xs (cid:17)(cid:12)(cid:12)(cid:12) ≤ Γ ts ˜ θ ( x ) , a.s., ∀ s ∈ [ t, T ] , with E (cid:34) sup s ∈ [ t,T ] (cid:12)(cid:12) Γ ts (cid:12)(cid:12) (cid:35) < ∞ , where the positive function ˜ θ : R → (0 , ∞ ) is inﬁnitely diﬀerentiable.A typical example satisfying Assumption 3.1 is the American put option with g t ( e x ) =( K − e x + rt ) + for some K >

0, where one may take L = K , Γ ts ≡ K , and ˜ θ ( x ) ≡

1. By thetheory of reﬂected BSDEs (see [EKP +

97, Section 3]), the following reﬂected BSDE  − dY t,xs = − rY t,xs ds + dA t,xs − Z t,x ; Bs dB s − Z t,x ; Ws dW s , s ∈ [ t, T ]; Y t,xT = g T ( e X t,xT ); Y t,xs ≥ g s ( e X t,xs ) , s ∈ [ t, T ]; A t,x · is increasing and continuous, A t,xt = 0 , (cid:90) Tt ( Y t,xs − g s ( e X t,xs )) dA t,xs = 0 , (3.1)admits a unique solution ( Y t,x , A t,x , Z t,x ; B , Z t,x ; W ) for each ( t, x ) ∈ [0 , T ] × R , and in particular,by [EKP +

97, Proposition 7.1], we have Y t,xt = u t ( x ) , a.s. for each ( t, x ) ∈ [0 , T ] × R . (3.2)We would stress that the above relation (3.2) only indicates that u t ( x ) is F t -measuable for each( t, x ) ∈ [0 , T ] × R . 12n fact, the penalization method provides an approximation of reﬂected BSDE (3.1) with asequence of BSDEs without reﬂections (see [EKP +

97, Section 6]), i.e., for each N ∈ N + , thefollowing BSDE  − dY t,x ; Ns = (cid:20) − rY t,x ; Ns + N (cid:16) g s ( e X t,xs ) − Y t,x ; Ns (cid:17) + (cid:21) ds − Z t,x ; B,Ns dB s − Z t,x ; W,Ns dW s , s ∈ [ t, T ]; Y t,x ; NT = g T ( e X t,xT ) , (3.3)admits a unique solution ( Y t,x ; N , Z t,x ; B,N , Z t,x ; W,N ) such that Y t,x ; Ns converges increasingly to Y t,xs withlim N →∞ E (cid:34) sup s ∈ [ t,T ] (cid:12)(cid:12)(cid:12) Y t,x ; Ns − Y t,xs (cid:12)(cid:12)(cid:12) + (cid:90) Tt (cid:12)(cid:12)(cid:12) Z t,x ; B,Ns − Z t,x ; Bs (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) Z t,x ; W,Ns − Z t,x ; Ws (cid:12)(cid:12)(cid:12) ds (cid:35) = 0 , (3.4)lim N →∞ E (cid:34) sup s ∈ [ t,T ] (cid:12)(cid:12) A t,x ; Ns − A t,xs (cid:12)(cid:12) (cid:35) = 0 , (3.5)for each ( t, x ) ∈ [0 , T ] × R , where A t,x ; Nr = (cid:90) rt N (cid:16) g s ( e X t,xs ) − Y t,x ; Ns (cid:17) + ds, for 0 ≤ t ≤ r ≤ T. Notice that Theorem 2.2 says that Y t,x ; Nt is F Wt -measurable for each ( t, x ) ∈ [0 , T ] × R . Hence,the approximation (3.4) implies that Y t,xt (and thus u t ( x )) is also just F Wt -measurable for each( t, x ) ∈ [0 , T ] × R , which together with Theorems 2.4 and 2.5 yields the following Corollary 3.1.

Let Assumptions 1.1 and 3.1 hold. It holds that:(i) The value function u t ( x ) is just F Wt -measurable for each ( t, x ) ∈ [0 , T ] × R .(ii) For each N ∈ N + , the following BSPDE  − du Nt ( x ) = (cid:104) V t D u Nt ( x ) + ρ (cid:112) V t Dψ Nt ( x ) − V t Du Nt ( x ) − ru Nt ( x )+ N (cid:0) g t ( e x ) − u Nt ( x ) (cid:1) + (cid:105) dt − ψ Nt ( x ) dW s ; u NT ( x ) = g T ( e x ) , (3.6) admits a unique weak solution ( u N , ψ N ) such that there exists C N ∈ (0 , ∞ ) satisfying foreach t ∈ [0 , T ] | u Nt ( x ) | ≤ C N (1 + e x ) , for almost all ( ω, x ) ∈ Ω × R . (iii) For each N ∈ N + , the above weak solution ( u N , ψ N ) satisﬁes a.s. u Nτ ( X t,xτ ) = Y t,x ; Nτ , (cid:112) (1 − ρ ) V τ Du Nτ ( X t,xτ ) = Z t,x ; B,Nτ , and ψ Nτ ( X t,xτ ) + ρ (cid:112) V τ Du Nτ ( X t,xτ ) = Z t,x ; W,Nτ , for ≤ t ≤ τ ≤ T and x ∈ R , where ( Y t,x ; N , Z t,x ; B,N , Z t,x ; W,N ) is the unique solution toBSDE (3.3) . iv) For each ( t, x ) ∈ [0 , T ] × R , u Nt ( x ) converges increasingly to u t ( x ) in L (Ω , F t ; R ) .(v) There is a triple ( u, ψ B , ψ W ) deﬁned on (Ω × [0 , T ] × R , P W ⊗ B ( R )) such that u τ ( X t,xτ ) = Y t,xτ , ψ Bτ ( X t,xτ ) = Z t,x ; Bτ , and ψ Wτ ( X t,xτ ) = Z t,x ; Wτ , a.s. , for ≤ t ≤ τ ≤ T . Remark 3.1.

The assertion (v) is concluded from the approximating relations (3.4) and (3.5).In fact, by the theory of reﬂected BSPDEs (see [QW14] or [Qiu17, Section 3.3]), one may expectthe value function u t ( x ) to be characterized via the following reﬂected BSPDE  − du t ( x ) = (cid:104) V t D u t ( x ) + ρ (cid:112) V t Dψ t ( x ) − V t Du t ( x ) − ru t ( x ) (cid:105) dt + µ ( dt, x ) − ψ ( t, x ) dW t , ( t, x ) ∈ [0 , T ] × R ; u T ( x ) = g T ( e x ) , x ∈ R ; u t ( x ) ≥ g t ( e x ) , d P ⊗ dt ⊗ dx -a.e.; (cid:90) [0 ,T ] × R (cid:0) u t ( x ) − g t ( e x ) (cid:1) µ ( dt, dx ) = 0 , a.s., (Skorohod condition) (3.7)for which the solution is a triple ( u, ψ, µ ) with µ being a regular random radon measure. Asolution theory may be developed by generalizing the regular stochastic potential and capacitytheory in [Qiu17, QW14]; nevertheless, we would not seek such a generality in this paper, inorder to put more eﬀorts in the numerical approximations. Throughout this section, we assume that the functions G , F and g are deterministic, i.e.,( A ∗ ) G : R → R , F : [0 , T ] × R → R , g : [0 , T ] × R → R . In fact, this assumption may be relaxed by allowing (explicit) dependence on the varianceprocess V and the Wiener process W , and together with Assumptions 1.1, 2.1, and 3.1, itensures that all the coeﬃcients may be simulated in the subsequent numerical computations,given the approximations of the unknown functions. In what follows, we ﬁrst introduce anddiscuss the neural networks approximating random functions, a deep learning-based method isthen introduced for non-Markovian BSDEs and associated BSPDEs and ﬁnally, the numericalexamples are presented for the rough Bergomi model. First, we introduce a feedforward neural network with input dimension d and output dimension d . Suppose that it has M + 1 ∈ N + (cid:31) { , } layers with each layer having m n neurons, n =0 , · · · , M . For simplicity, we choose an identical number of neurons for all hidden layers, i.e., m n = m, n = 1 , · · · , M −

1. Obviously, we have m = d , and m M = d . The neural networkmay be thought of as a function from R d to R d deﬁned by composition of simple functions as x ∈ R d (cid:55)→ A M o (cid:37) o A M − o · · · o (cid:37) o A ( x ) ∈ R d . (4.1)14ere, A : R d (cid:55)→ R m , A M : R m (cid:55)→ R d and A n : R m (cid:55)→ R m , n = 2 , · · · , M − A n ( x ) = W n x + β n , where the matrix W n and the vector β n are called weight and bias respectively for the n th layerof the network. For the last layer we choose identity function as activation function, and theactivation function (cid:37) is applied component-wise on the outputs of A n , for n = 1 , . . . , M − θ = ( W n , β n ) Mn =1 . Given d , d , M and m , the total number of parameters in a network is M m = (cid:80) M − n =0 ( m n +1) m n +1 = ( d +1) m +( m +1) m ( M −

1) + ( m + 1) d and thus θ ∈ R M m . By Θ m , we denote the set of all possible parametersand if there are no constraints on parameters, we have Θ m = R M m . By Φ m ( · ; θ ) we denote theneural network function deﬁned in (4.1) and set of all such neural networks Φ m ( · ; θ ) , θ ∈ Θ m isdenoted by N N (cid:37)d ,d ,M,m (Θ m ).Deep neural networks may approximate large classes of unknown functions. Following is afundamental result by Hornik et al. [HSW89, HSW90]: Lemma 4.1 (Universal Approximation Theorem) . It holds that:(i) For each M ∈ N + \ { } , the set ∪ m ∈ N N N (cid:37)d ,d ,M,m ( R M m ) is dense in L ( R d , ν ( dx ); R d ) for any ﬁnite measure ν on R d , whenever (cid:37) is continuous and non-constant.(ii) Assume that (cid:37) is a non-constant C k function. Then the neural networks/functions in ∪ m ∈ N N N (cid:37)d ,d , ,m ( R m ) can approximate any function and its derivatives up to order k ,arbitrarily well on any compact set of R d . Notice that in the above lemma the approximated functions are deﬁned on the ﬁnite dimen-sional spaces i.e., R d . In fact, the approximations may be extended to some classes of functionsdeﬁned on inﬁnite dimensional spaces. In this paper, we need the following one: Proposition 4.2.

For each T ∈ (0 , T ] , M ∈ N + \ { } , and d , d ∈ N + , the function set (cid:110) Φ m ( W t , · · · , W t k , x ; θ ) : Φ m ( · ; θ ) ∈ N N (cid:37)d + k,d ,M,m ( R M m ) , m, k ∈ N + , < t < t < · · · < t k ≤ T (cid:111) is dense in L (cid:0) Ω × R d , F WT ⊗ B ( R d ) , P ( dω ) ⊗ dx ; R d (cid:1) , whenever (cid:37) is continuous and non-constant.Proof. Take f ∈ L (cid:0) Ω × R d , F WT ⊗ B ( R d ) , P ( dω ) ⊗ dx (cid:1) arbitrarily. Notice that L (cid:16) Ω × R d , F WT ⊗ B ( R d ) , P ( dω ) ⊗ dx ; R d (cid:17) ≡ L (cid:16) Ω , F WT , P ; L ( R d ; R d ) (cid:17) . The denseness of simple random variables (see [DPZ14, Lemma 1.2, Page 16] for instance)implies that the function f may be approximated monotonically by simple random variables ofthe following form: l (cid:88) i =1 A i ( ω ) h i ( x ) , with h i ∈ L ( R d ; R d ) , A i ∈ F T , l ∈ N + , i = 1 , . . . , l. A i may be approximatedin L (Ω , F T ) by functions in the following set { g i ( W ˜ t i , . . . , W ˜ t iki ) : k i ∈ N + , g i ∈ C ∞ c ( R k i ) , < ˜ t i < · · · < ˜ t ik i ≤ T } . To sum up, the function f may be approximated in L (cid:0) Ω × R d , F WT ⊗ B ( R d ) , P ( dω ) ⊗ dx ; R d (cid:1) by the following random ﬁelds: f k ( W ¯ t , · · · , W ¯ t k , x ) = l (cid:88) i =1 g i (cid:16) W ˜ t i , . . . , W ˜ t iki (cid:17) h i ( x ) , where g i ∈ C ∞ c ( R k i ), h i ∈ L ( R d ), 0 < ¯ t < · · · < ¯ t k ≤ T , and { ¯ t , . . . , ¯ t k } = ∪ li =1 { ˜ t i , . . . , ˜ t ik i } . Applying the approximation in (i) of Lemma 4.1 to the functions f k yields the approximationof f , and this completes the proof. Remark 4.1.

In fact, the process ( W t ) t ≥ and the ﬁltration ( F Wt ) t ≥ may be replaced by anarbitrary continuous process ( W t ) t ≥ and corresponding gernerated ﬁltration ( F Wt ) t ≥ , wherethe process ( W t ) t ≥ is not necessarily a Brownian motion. Inspired by [HPW19, HJW18], we adopt a deep learning method based on the following repre-sentation relationship by Theorems 2.4 and 2.5. Letting the quadruple ( X s , Y s , Z s , ˜ Z s ) be thesolution to the following FBSDE  − dY s = F s ( e X s , Y s , Z s , ˜ Z s ) ds − ˜ Z s dW s − Z s dB s , ≤ s ≤ T ; Y T = G ( e X T ) ,dX s = (cid:112) V s (cid:16) ρ dW s + (cid:112) − ρ dB s (cid:17) − V s ds, ≤ s ≤ T ; X = x ; V s = ξ s E ( η (cid:99) W s ) with (cid:99) W s = (cid:90) s K ( s, r ) dW r , s ∈ [0 , T ] , (4.2)with K being a general Kernel function including the particular cases in Examples 1.1 and 1.2,one has u τ ( X τ ) = Y τ , (cid:112) (1 − ρ ) V τ Du τ ( X τ ) = Z τ , ψ τ ( X τ ) + ρ (cid:112) V τ Du τ ( X τ ) = ˜ Z τ , for 0 ≤ τ ≤ T and x ∈ R , where the pair ( u, ψ ) is the unique weak solution to BSPDE (2.1) inTheorem 2.5. In particular, we may write forwardly, for t ∈ [0 , T ], u t ( X t ) = u ( X ) − (cid:90) t F s (cid:16) e X s , u s ( X s ) , (cid:112) (1 − ρ ) V s Du s ( X s ) , ψ s ( X s ) + ρ (cid:112) V s Du s ( X s ) (cid:17) ds + (cid:90) t (cid:16) ψ s ( X s ) + ρ (cid:112) V s Du s ( X s ) (cid:17) dW s + (cid:90) t (cid:112) (1 − ρ ) V s Du s ( X s ) dB s . (4.3)16iven a partition of the time interval: π = { t < t < ... < t N = T } with modulus | π | = max i =0 , ,...,N − ∆ t i , ∆ t i = t i +1 − t i , we ﬁrst simulate (or approximate) the joint process( B, W, V ), and then the forward process X may be approximated by X π obtained through anEuler scheme. Further, the forward representation (4.3) yields an approximation for ( u, ψ ) underthe Euler scheme u t i +1 ( X t i +1 ) ≈ H t i ( X t i , u t i ( X t i ) , (cid:113) (1 − ρ ) V t i Du t i ( X t i ) , ψ t i ( X t i ) + ρ (cid:112) V t i Du t i ( X t i ) , ∆ B t i , ∆ W t i )with H t ( x, y, z, ˜ z, b, w ) := y − F t ( e x , y, z, ˜ z )∆ t i + zb + ˜ zw. Inspired by [HPW19], we design the numerical approximation of u t i ( X t i ) as follows:(1) start with (cid:98) U N = G ;(2) for i = N − , ...,

0, given (cid:98) U i +1 , use the triple of deep neural networks( U i ( · , θ ) , Z i ( · , θ ) , ˜ Z i ( · , θ )) ∈ N N (cid:37) i, ,M,m ( R M m ) × N N (cid:37) i, ,M,m ( R M m ) × N N (cid:37) i, ,M,m ( R M m ) (4.4)for the approximation of (cid:18) u t i ( X t i ) , (cid:113) (1 − ρ ) V t i Du t i ( X t i ) , ψ t i ( X t i ) + ρ (cid:112) V t i Du t i ( X t i ) (cid:19) , to achieve an estimate U i +1 = H t i (cid:16) X t i , U i ( X t i , θ i ) , Z i ( X t i , θ i ) , ˜ Z i ( X t i , θ i ) , ∆ B t i , ∆ W t i (cid:17) ;(3) compute the minimizer of the expected quadratic loss function  ˆ L i ( θ ) : = E (cid:12)(cid:12)(cid:12) (cid:98) U i +1 − H t i (cid:16) X t i , U i ( X t i , θ i ) , Z i ( X t i , θ i ) , ˜ Z i ( X t i , θ i ) , ∆ B t i , ∆ W t i (cid:17)(cid:12)(cid:12)(cid:12) , ≈ J J (cid:88) j =1 (cid:12)(cid:12)(cid:12) (cid:98) U ( j ) i +1 − H t i (cid:16) X ( j ) t i , U i ( X ( j ) t i , θ i ) , Z i ( X ( j ) t i , θ i ) , ˜ Z i ( X ( j ) t i , θ i ) , ∆ B ( j ) t i , ∆ W ( j ) t i (cid:17)(cid:12)(cid:12)(cid:12) θ ∗ i ∈ arg min θ ∈ R Mm ˆ L i ( θ ) , where the Adam (adaptive moment estimation) optimizer may be used to get the optimalparameter θ ∗ ;(4) update and set (cid:98) U i = U i ( · , θ ∗ i ), (cid:98) Z i = Z i ( · , θ ∗ i ), and (cid:98) ˜ Z i = ˜ Z i ( · , θ ∗ i ). Remark 4.2.

Here, ( X ( j ) , B ( j ) , W ( j ) , (cid:99) W ( j ) , V ( j ) ) ≤ j ≤ J are independent simulations of ( X, B, W, (cid:99)

W , V ).Noticing that F Wt = F W, (cid:99) Wt for t ∈ [0 , T ], by Proposition 4.2 and Remark 4.1 we have the func-tions in N N (cid:37) i, ,M,m ( R M m ) of the following form:Φ m ( W t , · · · , W t i , (cid:99) W t , · · · , (cid:99) W t i , x ) , i = 0 , , , · · · , N − , which incorporates all the simulated values of ( W, (cid:99) W ) until time t i , leading to the changingdimension of the inputs. One may also see that the ﬁner the partition of [0 , T ] is, the higher input17imension it involves. The changing and high dimensionality arising from the approximationsprompts us to adopt a deep learning-based method, and this also unveils the diﬀerence from thescheme in [HPW19].On the other hand, a convergence analysis of the above scheme is given in the appendix.Even though we are working with dimension-changing neural networks under a non-Markovianframework with diﬀerent assumptions, we adopt a similar strategy to [HPW19] for the proof ofthe convergence analysis. We consider the rough Bergomi model of [BFG16] in Example 1.1 with the following choice ofparameters: H = 0 . η = 1 . , ρ = − . r = 0 . T = 1, X = ln(100). For simplicity, wechoose the forward variance curve to be ξ ( t ) ≡ .

09, independent of time.We compute the numerical approximations to the European option price given in (1.7). Thevalue function u together with another random ﬁeld ψ constitutes the unique solution to BSPDE(1.9) which corresponds to the BSPDE (2.1) in Theorem 2.5 with F s ( x, y, z, ˜ z ) = − ry, and G ( e x ) = ( K − e x + rT ) + . By Theorems 2.4 and 2.5, the triple ( Y ,xt , Z ,xt , Z ,xt ) t ∈ [0 ,T ] with Y ,xt := u t ( X ,xt ) , Z ,xt := ρ (cid:112) V t Du t ( X ,xt ) + ψ t ( X ,xt ) , Z ,xt := (cid:112) (1 − ρ ) V t Du t ( X ,xt ) , for t ∈ [0 , T ] satisﬁes the following FBSDE:  dX ,xs = (cid:112) V s (cid:16) ρ dW s + (cid:112) − ρ dB s (cid:17) − V s ds, ≤ s ≤ T ; X ,x = x ; V s = ξ s E ( η (cid:99) W s ) with (cid:99) W s = (cid:90) s √ H ( s − r ) H − / dW r , s ∈ [0 , T ]; dY ,xs = rY ,xs ds + Z ,xs dW s + Z ,xs dB s , s ∈ [0 , T ]; Y ,xT = G ( e X ,xT ) . (4.5)Then the deep learning-based method in Section 4.2 is used for the numerical approximations.We take N = 20 in the Euler Scheme and set a single hidden layer whose number of neuronsis equal to half of the total number of neurons in the input and output layers. We adoptthe Sigmoid function for the activation function and the optimization algorithm is Adam. Weimplement 10000 trajectories in mini-batch and check the loss convergence every 50 iterations.In the following Table 1, the reference values are calculated by Monte Carlo method and they areclose to the results obtained by averaging 20 independent runs with the deep learning method.Reference value RSD = standard deviationaverage value Estimated value RSD K = 90 4 . . . . K = 100 7 . . . . K = 110 12 . . . . K = 120 18 . . . . V . We simulate 10000 independent trajectories of the stochastic variance process V and evaluate the corresponding values of u (0 . , ln 100) when t = 0 . x = ln 100, and K = 100.The mean of these u (0 . , ln 100) is 9 . . u (0 . , ln 100)are listed in Table 2. From Figure 1(a) and Table 2, one may see that bigger values of V (0 . V , we reset the values of V to be the same and equal to the average of simulated values of V ( t )at time t = 0 .

5, i.e., we ﬁx V (0 .

5) = 0 . u (0 . , ln 100)turns out to be 9 . . u (0 . , ln 100) in Table 3. Comparing the obtained means, the standard deviations, and the fourpaths and associated values of u (0 . , ln 100) in these two cases, we may see that the value of V at t = 0 . u (0 . , ln 100),which is diﬀerent from the classical Markovian cases; this is due to the path-dependence andthus the non-Markovianity, i.e., the trajectory of V before t = 0 . u (0 . , ln 100) in a non-negligible manner. (a) Paths of V with diﬀerent values at t=0.5 (b) Paths of V with a ﬁxed value at t = 0 . Figure 1: Diﬀerent paths of V on time interval [0 , . u (0 . , ln 100) V t = 0 .

5) = 0 . . V t = 0 .

5) = 0 . . u (0 . , ln 100) on diﬀerent paths of V in Figure 1(a)19aths of process V with u (0 . , ln 100) V t = 0 .

5) = 0 . . V t = 0 .

5) = 0 . . u (0 . , ln 100) on diﬀerent paths of V in Figure 1(b) Again, consider the rough Bergomi model in Example 1.1 with the following choice of parameters: H = 0 . η = 1 . , ρ = − . r = 0 . T = 1, X = ln(100). Also, we choose the forwardvariance curve to be ξ ( t ) ≡ .

09 independent of time, for simplicity. The strike prices may takediﬀerent values. Then, pricing the American put option is to compute u ( x ) := sup τ ∈T E (cid:104) e − τr g τ ( e X ,xτ ) (cid:105) , with g τ ( e x ) = ( K − e rτ + x ) + , for ( τ, x ) ∈ [0 , T ] × R . We shall adopt two diﬀerent schemes for the computations for the numerical approximations.The ﬁrst scheme is based on the penalization. By Corollary 3.1, u ( x ) may be approximatedby u ˜ N ( x ) as ˜ N tends to inﬁnity, where the pair ( u ˜ N , ψ ˜ N ) is the unique weak solution to BSPDE(2.1) with F t ( e x , y, z, ˜ z ) = − ry + ˜ N ( g t ( e x ) − y ) + and G ( e x ) = g T ( e x ) . Then the ﬁrst scheme is to use the algorithm in Section 4.2 to compute u ˜ N ( X ) which approxi-mates u ( x ) when ˜ N tends to inﬁnity.The second scheme is based on the representation via the following forward-backward system:  dX ,xs = (cid:112) V s (cid:16) ρ dW s + (cid:112) − ρ dB s (cid:17) − V s ds, ≤ s ≤ T ; X ,x = x ; V s = ξ s E ( η (cid:99) W s ) with (cid:99) W s = (cid:90) s √ H ( s − r ) H − / dW r , s ∈ [0 , T ]; − dY ,xs = − rY ,xs ds + dA ,xs − Z ,x ; Bs dB s − Z ,x ; Ws dW s , s ∈ [0 , T ]; Y ,xT = g T ( e X ,xT ); Y ,xs ≥ g s ( e X ,xs ) , s ∈ [0 , T ]; A ,x · is increasing and continuous, A ,x = 0 , (cid:90) T (cid:16) Y ,xs − g s ( e X ,xs ) (cid:17) dA ,xs = 0 . (4.6)Recalling the assertion (v) in Corollary (3.1) which gives the following representation u τ ( X ,xτ ) = Y ,xτ , ψ Bτ ( X ,xτ ) = Z ,x ; Bτ , and ψ Wτ ( X ,xτ ) = Z ,x ; Wτ , a.s. , for 0 ≤ τ ≤ T , for some triple ( u, ψ B , ψ W ) deﬁned on (Ω × [0 , T ] × R , P W ⊗ B ( R )), we may usethe following scheme:(1) Start with (cid:98) U N = g T . 202) For i = N − , ...,

0, given (cid:98) U i +1 , use the triple of deep neural networks( U i ( · , θ ) , Z Bi ( · , θ ) , ˜ Z Wi ( · , θ )) ∈ N N (cid:37) i, ,M,m ( R M m ) × N N (cid:37) i, ,M,m ( R M m ) × N N (cid:37) i, ,M,m ( R M m ) (4.7)for the approximation of (cid:16) u t i ( X t i ) , ψ Bt i ( X t i ) , ψ Wt i ( X t i ) (cid:17) , and obtain an estimate U i +1 = U i ( X t i , θ i ) + r U i ( X t i , θ i )∆ t i + Z Bi ( X t i , θ i ) ∆ B t i + Z Wi ( X t i , θ i ) ∆ W t i . (3) Compute the minimizer of the expected quadratic loss function:  ˆ L i ( θ ) : = E (cid:12)(cid:12)(cid:12) (cid:98) U i +1 − U i +1 (cid:12)(cid:12)(cid:12) ,θ ∗ i ∈ arg min θ ∈ R Nm ˆ L i ( θ ) . (4) Update (cid:98) U i = max {U i ( X t i , θ ∗ i ) , g t i ( X t i ) } .The above scheme extends the one proposed in [HPW19, Section 3.3] from Markovian cases toa non-Markovian setting, with the main diﬀerence lying in the changing dimensions in the neuralnetworks (4.7). Looking into Appendix for the convergence analysis of the scheme in Section 4.2,we may extend the convergence analysis in [HPW19, Section 4.3] to our non-Markovian setting,and as such an extension is similar to that of the scheme in Section 4.2, the proof is omitted.In Table 4, the estimates of the above two schemes are presented together with the referencevalues which are lower bound estimates from [BTW18]. We take N = 20 and implement asingle hidden layer whose number of neurons is equal to half of the total number of neurons inthe input and output layers. The activation function and optimization algorithm we use hereare Sigmoid function and Adam. The results are obtained by averaging 20 independent runs.For the ﬁrst scheme, in theory, u ˜ N ( X ) is (bigger and) closer to the real value than u ¯ N ( X )when ˜ N > ¯ N , which is aﬃrmed by the numerical experiments. We set ˜ N equal to 40 and10000 for comparisons. The same neural networks are put to use in the second scheme. Here,neural networks with ≥ K = 90 5 .

32 5 . . . . . . K = 100 8 .

51 9 . . . . . . K = 110 13 .

24 15 . . . . . . K = 120 20 22 . . . . . . ρ = η = 0 and keeping the other parameters unchanged, The estimates of the above twoschemes are compared with the option prices calculated by binprice function in the ﬁnancialtoolbox of Matlab. It can be seen from Table 5 that our results are pretty close to the optionprice estimates by using the Cox-Ross-Rubinstein binomial model.Reference value 1st scheme 2nd scheme RSDN=40 RSD N=10000 RSD K = 90 5 . . . . . . . K = 100 9 . . . . . . . K = 110 15 . . . . . . . K = 120 22 . . . . . . . ρ = η = 0. A Convergence analysis

This section is to devoted to a convergence analysis for the deep learning-based scheme proposedin Section 4.2. The discussions are conducted under Assumptions ( A ∗ ), 1.1, 2.1, and the followingone: (H1) (i) There exists a continuous and increasing function ρ : [0 , ∞ ) → [0 , ∞ ) with ρ (0) = 0such that for any 0 ≤ t ≤ t ≤ T , it holds that E (cid:20)(cid:90) t t V s ds (cid:21) + E (cid:34)(cid:18)(cid:90) t t V s ds (cid:19) (cid:35) ≤ ρ ( | t − t | ) . (ii) There exists a constant L > | F t ( e x , y , z , ˜ z ) − F t ( e x , y , z , ˜ z ) |≤ L ( (cid:112) ρ ( | t − t | ) + | x − x | + | y − y | + | z − z | + | ˜ z − ˜ z | ) , for all ( t , x , y , z , ˜ z ) and ( t , x , y , z , ˜ z ) in [0 , T ] × R × R × R × R . Remark A.1.

In fact, for examples like 1.1 and 1.2, one has E (cid:20)(cid:90) T | V t | p dt (cid:21) < ∞ , for some p > , which, by H¨older’s inequality, implies E (cid:90) t t V t dt ≤ | t − t | p − p (cid:18) E (cid:90) t t | V t | p dt (cid:19) /p ≤ C p | t − t | p − p , for 0 ≤ t ≤ t ≤ T, (cid:34)(cid:18)(cid:90) t t V t dt (cid:19) (cid:35) ≤ | t − t | p − p (cid:18) E (cid:90) t t | V t | p dt (cid:19) /p ≤ C p | t − t | p − p , for 0 ≤ t ≤ t ≤ T, and thus, we may take ρ ( r ) = ( C p + C p ) · (cid:18) | r | p − p ∨ | r | p − p (cid:19) , for r ≥

0. Further, one maystraightforwardly check that the numerical examples discussed in Section 4.3 have Assumption (H1) satisﬁed.In what follows, we denote by C a positive generic constant whose value is independentof π and may vary from line to line, and by X we denote the unique (strong) solution to theSDE (1.8) start at t = 0 and by X = X π the Euler-Maruyama approximation with a time grid π = { t = 0 < t < ... < t N = T } , with modulus | π | = max ≤ i ≤ N | t i − t i − | bounded by CTN forsome constant C . Under Assumptions 1.1 and (H1) , standard calculations yield that E (cid:34) sup ≤ t ≤ T |X t | (cid:35) ≤ C (1 + | x | ) , (A.1)max i =0 ,...,N − E (cid:34) |X t i +1 − X t i +1 | + sup t ∈ [ t i ,t i +1 ] |X t − X t i | (cid:35) ≤ Cρ ( | π | ) . (A.2)By the theory of BSDEs (see [BDH +

03] for instance), Assumptions 1.1, 2.1, and (H1) implythe existence and uniqueness of an adapted L -solution ( Y, Z, ˜ Z ) to BSDE (2.2), which togetherwith (A.1) and (H1) -(ii) gives E (cid:20)(cid:90) T | F t ( e X t , Y t , Z t , ˜ Z t ) | dt (cid:21) < ∞ (A.3)and the standard L -regularity result on Y :max i =0 ,...,N − E (cid:34) sup t ∈ [ t i ,t i +1 ] | Y t − Y t i | (cid:35) = O ( | π | ) . (A.4)For the pair ( Z, ˜ Z ), set  ε Z ( π ) := E (cid:104)(cid:80) N − i =0 (cid:82) t i +1 t i | Z t − ¯ Z t i | dt (cid:105) , with ¯ Z t i := t i E i (cid:104)(cid:82) t i +1 t i Z t dt (cid:105) ,ε ˜ Z ( π ) := E (cid:104)(cid:80) N − i =0 (cid:82) t i +1 t i | ˜ Z t − ¯˜ Z t i | dt (cid:105) , with ¯˜ Z t i := t i E i (cid:104)(cid:82) t i +1 t i ˜ Z t dt (cid:105) , (A.5)where E i denotes the conditional expectation given F t i .To investigate the convergence of the deep learning scheme, we deﬁne, for i = 0 , ..., N −  (cid:98) V t i := E i [ (cid:98) U i +1 ( X t i +1 )] + F t i ( e X ti , (cid:98) V t i , (cid:98) Z t i , (cid:98) ˜ Z t i )∆ t i , (cid:98) Z t i := t i E i [( (cid:98) U i +1 ( X t i +1 )∆ B t i ] , (cid:98) ˜ Z t i := t i E i [( (cid:98) U i +1 ( X t i +1 )∆ W t i ] , (A.6)where, (cid:98) V t i is well-deﬁned for suﬃciently small | π | due to the uniform Lipschitz continuity of F .In view of Theorem 2.4, we may ﬁnd F Wt i ⊗ B ( R )-measurable functions ˆ v i , ˆ z i , and ˆ˜ z i s.t. (cid:98) V t i = ˆ v i ( X t i ) , (cid:98) Z t i = ˆ z i ( X t i ) , and (cid:98) ˜ Z t i = ˆ˜ z i ( X t i ) , i = 0 , ..., N − . (A.7)23n the other hand, by the martingale representation theorem, there exist two R -valued squareintegrable processes { (cid:98) Z t } and { (cid:98) ˜ Z t } s.t. (cid:98) U i +1 ( X t i +1 ) = (cid:98) V t i − F t i ( e X ti , (cid:98) V t i , (cid:98) Z t i , (cid:98) ˜ Z t i )∆ t i + (cid:90) t i +1 t i (cid:98) Z t dB t + (cid:90) t i +1 t i (cid:98) ˜ Z t dW t , (A.8)and Itˆo’s isometry gives (cid:98) Z t i = t i E i [ (cid:82) t i +1 t i (cid:98) Z t dt ] , (cid:98) ˜ Z t i = t i E i [ (cid:82) t i +1 t i (cid:98) ˜ Z t dt ] , i = 0 , ..., N − . The distance between the optimal triple ( (cid:98) U i , (cid:98) Z i , (cid:98) ˜ Z i ) from the deep learning-based scheme and( (cid:98) V t i , (cid:98) Z t i , (cid:98) ˜ Z t i ) from the system (A.6) is given as follows. Lemma A.1.

Let Assumptions ( A ∗ ) , 1.1, 2.1, and (H1) hold. When | π | is suﬃciently small,we have E | (cid:98) V t i − (cid:98) U i ( X t i ) | + ∆ t i E (cid:20) | (cid:98) Z t i − (cid:98) Z i ( X t i ) | + | (cid:98) ˜ Z t i − (cid:98) ˜ Z i ( X t i ) | (cid:21) ≤ Cε N ,vi + C ∆ t i ε N ,zi + C ∆ t i ε N , ˜ zi , (A.9) where we use ε N ,vi := inf ξ E | ˆ v i ( X t i ) − U i ( X t i ; ξ ) | , ε N ,zi := inf η E | ˆ z i ( X t i ) − Z i ( X t i ; η ) | , and ε N , ˜ zi := inf η E | ˆ˜ z i ( X t i ) − ˜ Z i ( X t i ; η ) | to denote the L -approximation errors of ˆ v i , ˆ z i , and ˆ˜ z i by neural networks U i , Z i , and ˜ Z i , for i = 0 , ..., N − . To focus on the convergence analysis, we postpone the proof of Lemma A.1. Deﬁne thefollowing square error: E [( (cid:98) U , (cid:98) Z , (cid:98) ˜ Z ) , ( Y, Z, ˜ Z )] = max i =0 ,...,N − E (cid:104) | Y t i − (cid:98) U i ( X t i ) | (cid:105) + E (cid:34) N − (cid:88) i =0 (cid:90) t i +1 t i | Z t − (cid:98) Z i ( X t i ) | dt (cid:35) + E (cid:34) N − (cid:88) i =0 (cid:90) t i +1 t i | ˜ Z t − (cid:98) ˜ Z i ( X t i ) | dt (cid:35) . Theorem A.2.

Under Assumptions ( A ∗ ) , 1.1, 2.1, and (H1) , it holds that E [( (cid:98) U , (cid:98) Z , (cid:98) ˜ Z ) , ( Y, Z, ˜ Z )] ≤ C (cid:40) E | G ( X T ) − G ( X T ) | + ρ ( | π | ) + | π | + ε Z ( π ) + ε ˜ Z ( π ) + N − (cid:88) i =0 ( N ε N ,vi + ε N ,zi + ε N , ˜ zi ) (cid:41) , (A.10) where the constant C is independent of the partition π . The computations involved in the proofs of Lemma A.1 and Theorem A.2 are conducted ina similar way to [HPW19, Section 4.1] by Hur´e, Pham, and Warin, with the main diﬀerenceslying in the approximations of the random variables with dimension-varying neural networksand the general modulus function ρ ( π ). We provide the proofs for the reader’s interests.24 roof of Theorem A.2. Step 1.

We ﬁrst derive a recursive estimate for the square norm of Y t i − (cid:98) V t i , i.e., E | Y t i − (cid:98) V t i | ≤ (1 + C | π | ) E | Y t i +1 − (cid:98) U i +1 ( X t i +1 ) | + C | π | E (cid:20)(cid:90) t i +1 t i | F t ( e X t , Y t , Z t , ˜ Z t ) | dt (cid:21) + CE (cid:20)(cid:90) t i +1 t i (cid:16) | ˜ Z t − ¯˜ Z t i | + | Z t − ¯ Z t i | (cid:17) dt (cid:21) + Cρ ( | π | ) | π | , (A.11)for each i ∈ { , ..., N − } .In view of (2.2) and (A.6), we have Y t i − (cid:98) V t i = E i [ Y t i +1 − (cid:98) U i +1 ( X t i +1 )] + E i (cid:20)(cid:90) t i +1 t i F t ( e X t , Y t , Z t , ˜ Z t ) − F t i ( e X ti , (cid:98) V t i , (cid:98) Z t i , (cid:98) ˜ Z t i )d t (cid:21) . Young’s inequality gives ( a + b ) ≤ (1 + γ ∆ t i ) a + (1 + γ ∆ t i ) b for any a, b ∈ R and γ > F in (H1) , and theestimation (A.2) on the forward process, implies that E | Y t i − (cid:98) V t i | ≤ E (cid:26) (1 + γ ∆ t i ) (cid:16) E i [ Y t i +1 − (cid:98) U i +1 ( X t i +1 )] (cid:17) + (cid:18) γ ∆ t i (cid:19) (cid:18) E i (cid:104) (cid:90) t i +1 t i ( F t ( e X t , Y t , Z t , ˜ Z t ) − F t i ( e X ti , (cid:98) V t i , (cid:98) Z t i , (cid:98) ˜ Z t i )) dt (cid:105)(cid:19) (cid:27) ≤ (1 + γ ∆ t i ) E (cid:104) | E i [ Y t i +1 − (cid:98) U i +1 ( X t i +1 )] | (cid:105) + 5 (cid:18) γ ∆ t i (cid:19) L ∆ t i (cid:26) Cρ ( | π | ) | π | + E (cid:20) (cid:90) t i +1 t i | Y t − (cid:98) V t i | dt (cid:21) + E (cid:20) (cid:90) t i +1 t i (cid:18) | Z t − (cid:98) Z t i | + | ˜ Z t − (cid:98) ˜ Z t i | (cid:19) dt (cid:21)(cid:27) ≤ (1 + γ ∆ t i ) E (cid:104) | E i [ Y t i +1 − (cid:98) U i +1 ( X t i +1 )] | (cid:105) + 5 (1 + γ ∆ t i ) L γ (cid:26) Cρ ( | π | ) | π | + 2∆ t i E | Y t i − (cid:98) V t i | + E (cid:20) (cid:90) t i +1 t i (cid:18) | Z t − (cid:98) Z t i | + | ˜ Z t − (cid:98) ˜ Z t i | (cid:19) dt (cid:21)(cid:27) , (A.12)where the L -regularity of Y (A.4) is used in the last inequality.Recalling that ¯ Z and ¯˜ Z are the L -projections of Z and ˜ Z respectively, we have  E [ (cid:82) t i +1 t i | Z t − (cid:98) Z t i | dt ] = E [ (cid:82) t i +1 t i | Z t − ¯ Z t i | dt ] + ∆ t i E (cid:104) | ¯ Z t i − (cid:98) Z t i | (cid:105) ,E [ (cid:82) t i +1 t i | ˜ Z t − (cid:98) ˜ Z t i | dt ] = E [ (cid:82) t i +1 t i | ˜ Z t − ¯˜ Z t i | dt ] + ∆ t i E (cid:20) | ¯˜ Z t i − (cid:98) ˜ Z t i | (cid:21) . (A.13)Integrate equation (2.2) over time interval [ t i , t i +1 ] multiplied by ∆ W t i and ∆ B t i respectively.This together with (A.6) gives∆ t i (cid:18) ¯˜ Z t i − (cid:98) ˜ Z t i (cid:19) = E i (cid:104) ∆ W t i (cid:16) Y t i +1 − (cid:98) U i +1 ( X t i +1 ) − E i [ Y t i +1 − (cid:98) U i +1 ( X t i +1 )] (cid:17)(cid:105) + E i (cid:20) ∆ W t i (cid:90) t i +1 t i F t ( e X t , Y t , Z t , ˜ Z t ) dt (cid:21) , ∆ t i (cid:16) ¯ Z t i − (cid:98) Z t i (cid:17) = E i (cid:104) ∆ B t i (cid:16) Y t i +1 − (cid:98) U i +1 ( X t i +1 ) − E i [ Y t i +1 − (cid:98) U i +1 ( X t i +1 )] (cid:17)(cid:105) E i (cid:20) ∆ B t i (cid:90) t i +1 t i F t ( e X t , Y t , Z t , ˜ Z t ) dt (cid:21) . Standard computations further indicate that∆ t i E (cid:104) | ¯ Z t i − (cid:98) Z t i | (cid:105) ≤ (cid:16) E | Y t i +1 − (cid:98) U i +1 ( X t i +1 ) | − E | E i [ Y t i +1 − (cid:98) U i +1 ( X t i +1 )] | (cid:17) + 2∆ t i E (cid:20)(cid:90) t i +1 t i | F t ( e X t , Y t , Z t , ˜ Z t ) | dt (cid:21) ; (A.14)it follows similarly for ˜ Z . Then, by plugging (A.13) and (A.14) into (A.12), and choosing γ = 20 L , we have E (cid:104) | Y t i − (cid:98) V t i | (cid:105) ≤ (1 + γ ∆ t i ) E (cid:104) | E i (cid:104) Y t i +1 − (cid:98) U i +1 ( X t i +1 ) (cid:105) | (cid:105) + 5(1 + γ ∆ t i ) L γ (cid:26) Cρ ( | π | ) | π | + 2∆ t i E | Y t i − (cid:98) V t i | + E (cid:20)(cid:90) t i +1 t i (cid:16) | Z t − ¯ Z t i | + | ˜ Z t − ¯˜ Z t i | (cid:17) dt (cid:21) + 4 (cid:16) E | Y t i +1 − (cid:98) U i +1 ( X t i +1 ) | − E (cid:2) | E i [ Y t i +1 − (cid:98) U i +1 ( X t i +1 )] | (cid:3)(cid:17) + 4∆ t i E (cid:20)(cid:90) t i +1 t i | F t ( e X t , Y t , Z t , ˜ Z t ) | dt (cid:21) (cid:27) ≤ Cρ ( | π | ) | π | + (1 + γ ∆ t i ) E | Y t i +1 − (cid:98) U i +1 ( X t i +1 ) | + C ∆ t i E | Y t i − (cid:98) V t i | + CE (cid:20) (cid:90) t i +1 t i (cid:16) | Z t − ¯ Z t i | + | ˜ Z t − ¯˜ Z t i | (cid:17) dt (cid:21) + C ∆ t i E (cid:20) (cid:90) t i +1 t i | F t ( e X t , Y t , Z t , ˜ Z t ) | dt (cid:21) , (A.15)which implies (A.11) when | π | is suﬃciently small. Step 2.

We prove the estimate for the Y -component in (A.10), i.e.,max i =0 ,...,N − E | Y t i − (cid:98) U i ( X t i ) | ≤ Cρ ( | π | ) + CE | G ( X T ) − G ( X T ) | + Cε Z ( π ) + Cε ˜ Z ( π )+ C N − (cid:88) i =0 ( N ε N ,vi + ε N ,zi + ε N , ˜ zi ) . (A.16)Indeed, using Young inequality of the form:( a + b ) ≥ (1 − | π | ) a + (cid:18) − | π | (cid:19) b ≥ (1 − | π | ) a − | π | b , we have E | Y t i − (cid:98) V t i | = E | Y t i − (cid:98) U i ( X t i ) + (cid:98) U i ( X t i ) − (cid:98) V t i | ≥ (1 − | π | ) E | Y t i − (cid:98) U i ( X t i ) | − | π | E | (cid:98) U i ( X t i ) − (cid:98) V t i | . (A.17)Plugging the above inequality into (A.11) and letting | π | be small enough yield that E | Y t i − (cid:98) U i ( X t i ) | ≤ Cρ ( | π | ) | π | + (1 + C | π | ) E | Y t i +1 − (cid:98) U i +1 ( X t i +1 ) | + CE (cid:20) (cid:90) t i +1 t i (cid:16) | Z t − ¯ Z t i | + | ˜ Z t − ¯˜ Z t i | (cid:17) dt (cid:21) C | π | E (cid:20) (cid:90) t i +1 t i | F t ( e X t , Y t , Z t , ˜ Z t ) | dt (cid:21) + CN E | (cid:98) V t i − (cid:98) U i ( X t i ) | . (A.18)Recalling Y t N = G ( X T ) and (cid:98) U i ( X t N ) = G ( X T ), and (A.3), we may use the discrete Gronwall’sinequality to reach the following estimate:max i =0 ,...,N − E | Y t i − (cid:98) U i ( X t i ) | ≤ C (cid:40) ρ ( | π | ) + | π | + E | G ( X T ) − G ( X T ) | + ε Z ( π ) + ε ˜ Z ( π ) + N N − (cid:88) i =0 E | (cid:98) U i ( X t i ) − (cid:98) V t i | (cid:41) , (A.19)which combined with Lemma A.1 gives (A.16). Step 3.

27 2 | π | E | Y t i − (cid:98) V t i | + E (cid:104) (cid:90) t i +1 t i | Z t − (cid:98) Z t i | dt (cid:105) + E (cid:104) (cid:90) t i +1 t i | ˜ Z t − (cid:98) ˜ Z t i | dt (cid:105)(cid:17)(cid:27) + 3 | π | (1 − | π | ) E | (cid:98) U i ( X t i ) − (cid:98) V t i | . (A.22)Take γ = 50 L so that L γ (1 + γ | π | ) / (1 − | π | ) ≤ / | π | small enough and notice that[(1 + γ | π | ) / (1 − | π | ) −

1] = O ( | π | ). This together with (A.3), (A.9), (A.11), (A.16), and (A.20),yields E (cid:34) N − (cid:88) i =0 (cid:90) t i +1 t i (cid:18) | Z t − (cid:98) Z t i | + | ˜ Z t − (cid:98) ˜ Z t i | (cid:19) dt (cid:35) ≤ ε Z ( π ) + ε ˜ Z ( π ) + C max i =0 ,...,N E | Y t i − (cid:98) U i ( X t i ) | + Cρ ( | π | ) + CE | G ( X T ) − G ( X T ) | + C | π | N − (cid:88) i =0 E | Y t i − (cid:98) V t i | + CN N − (cid:88) i =0 E | (cid:98) U i ( X t i ) − (cid:98) V t i | + C | π |≤ ε Z ( π ) + ε ˜ Z ( π ) + C max i =0 ,...,N E | Y t i − (cid:98) U i ( X t i ) | + Cρ ( | π | ) + C | π | + C | π | N − (cid:88) i =0 (cid:26) Cρ ( | π | ) | π | + CE (cid:20) (cid:90) t i +1 t i (cid:16) | Z t − ¯ Z t i | + | ˜ Z t − ¯˜ Z t i | (cid:17) dt (cid:21) + (1 + C | π | ) E | Y t i +1 − (cid:98) U i +1 ( X t i +1 ) | + C | π | E (cid:20) (cid:90) t i +1 t i | F ( t, X t , Y t , Z t , ˜ Z t ) | dt (cid:21)(cid:27) + CN N − (cid:88) i =0 E | (cid:98) U i ( X t i ) − (cid:98) V t i | ≤ C (cid:26) ε Z ( π ) + ε ˜ Z ( π ) + ρ ( | π | ) + | π | + E | G ( X T ) − G ( X T ) | + N − (cid:88) i =0 ( N ε N ,vi + ε N ,zi + ε N , ˜ zi ) (cid:27) . (A.23) Finally, noticing the relations E (cid:20)(cid:90) t i +1 t i (cid:12)(cid:12)(cid:12) Z t − (cid:98) Z i ( X t i ) (cid:12)(cid:12)(cid:12) d t (cid:21) ≤ E (cid:20)(cid:90) t i +1 t i (cid:12)(cid:12)(cid:12) Z t − (cid:98) Z t i (cid:12)(cid:12)(cid:12) d t (cid:21) + 2∆ t i E (cid:12)(cid:12)(cid:12) (cid:98) Z t i − (cid:98) Z i ( X t i ) (cid:12)(cid:12)(cid:12) ,E (cid:20)(cid:90) t i +1 t i (cid:12)(cid:12)(cid:12) ˜ Z t − (cid:98) ˜ Z i ( X t i ) (cid:12)(cid:12)(cid:12) d t (cid:21) ≤ E (cid:34)(cid:90) t i +1 t i (cid:12)(cid:12)(cid:12)(cid:12) ˜ Z t − (cid:98) ˜ Z t i (cid:12)(cid:12)(cid:12)(cid:12) d t (cid:35) + 2∆ t i E (cid:12)(cid:12)(cid:12)(cid:12) (cid:99) ˜ Z t i − (cid:98) ˜ Z i ( X t i ) (cid:12)(cid:12)(cid:12)(cid:12) , and using (A.9), (A.23), we obtain by summing over i = 0 , ..., N − , the desired error estimatefor the ( Z, ˜ Z )-component, completing the proof.Finally, we prove the claim in Lemma A.1. Proof of Lemma A.1.

Fix i ∈ { , ..., N − } . Using relation (A.8) in the expression of the expectedquadratic loss function, and recalling the deﬁnitions of (cid:98) Z t i and (cid:98) ˜ Z t i as L -projection of (cid:98) Z t and (cid:98) ˜ Z t , we have for all parameters θ of the neural networks U i ( . ; θ ), Z i ( . ; θ ), and ˜ Z i ( . ; θ ),ˆ L i ( θ ) = ˜ L i ( θ ) + E (cid:34)(cid:90) t i +1 t i (cid:32)(cid:12)(cid:12)(cid:12) (cid:98) Z t − (cid:98) Z t i (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) (cid:98) ˜ Z t − (cid:98) ˜ Z t i (cid:12)(cid:12)(cid:12)(cid:12) (cid:33) d t (cid:35) , (A.24)with˜ L i ( θ ) := E (cid:20) | (cid:98) V t i − U i ( X t i ; θ i ) 28 ( F t i ( e X ti , U i ( X t i ; θ ) , Z i ( X t i ; θ i ) , ˜ Z i ( X t i ; θ i )) − F t i ( e X ti , (cid:98) V t i , (cid:98) Z t i , (cid:98) ˜ Z t i ))∆ t i | (cid:21) + ∆ t i E (cid:104) | (cid:98) Z t i − Z i ( X t i ; θ i ) | (cid:105) + ∆ t i E (cid:20) | (cid:98) ˜ Z t i − ˜ Z i ( X t i ; θ i ) | (cid:21) . (A.25)By using Young inequality: ( a + b ) ≤ (1 + γ ∆ t i ) a + (1 + γ ∆ t i ) b , together with the Lipschitzcondition on F in (H1) , we see that˜ L i ( θ ) ≤ (1 + C ∆ t i ) E | (cid:98) V t i − U i ( X t i ; θ i ) | + C ∆ t i E (cid:20) | (cid:98) Z t i − Z i ( X t i ; θ i ) | + | (cid:98) ˜ Z t i − ˜ Z i ( X t i ; θ i ) | (cid:21) . (A.26)On the other hand, using Young inequality in the form: ( a + b ) ≥ (1 − γ ∆ t i ) a + (1 − γ ∆ t i ) b ≥ (1 − γ ∆ t i ) a − γ ∆ t i b , together with the Lipschitz condition on F , gives˜ L i ( θ ) ≥ (1 − γ ∆ t i ) E | (cid:98) V t i − U i ( X t i ; θ i ) | − t i L γ (cid:16) E | (cid:98) V t i − U i ( X t i ; θ i ) | + E | (cid:98) Z t i − Z i ( X t i ; θ i ) | + E | (cid:98) ˜ Z t i − ˜ Z i ( X t i ; θ i ) | (cid:17) + ∆ t i E | (cid:98) Z t i − Z i ( X t i ; θ i ) | + ∆ t i E | (cid:98) ˜ Z t i − ˜ Z i ( X t i ; θ i ) | . (A.27)Choosing γ = 6 L , this yields˜ L i ( θ ) ≥ (1 − C ∆ t i ) E | (cid:98) V t i − U i ( X t i ; θ i ) | + ∆ t i E (cid:20) | (cid:98) Z t i − Z i ( X t i ; θ i ) | + | (cid:98) ˜ Z t i − ˜ Z i ( X t i ; θ i ) | (cid:21) . (A.28)For each i ∈ { , . . . , N − } , take θ ∗ i ∈ arg min θ ˆ L i ( θ ) so that (cid:98) U i = U i ( · ; θ ∗ i ), (cid:98) Z i = Z i ( · ; θ ∗ i ), and (cid:98) ˜ Z i = ˜ Z i ( · ; θ ∗ i ). As the second term of the right hand side of (A.24) is independent of parameters θ i , it also holds that θ ∗ i ∈ arg min θ ˜ L i ( θ ). Combining (A.28) and (A.26) implies that for all θ (1 − C ∆ t i ) E | (cid:98) V t i − (cid:98) U i ( X t i ) | + ∆ t i E (cid:20) | (cid:98) Z t i − (cid:98) Z i ( X t i ) | + | (cid:98) ˜ Z t i − (cid:98) ˜ Z i ( X t i ) | (cid:21) ≤ ˜ L i ( θ ∗ i ) ≤ ˜ L i ( θ ) ≤ (1 + C ∆ t i ) E | (cid:98) V t i − U i ( X t i ; θ i ) | + C ∆ t i E (cid:20) | (cid:98) Z t i − Z i ( X t i ; θ i ) | + | (cid:98) ˜ Z t i − ˜ Z i ( X t i ; θ i ) | (cid:21) . (A.29)By (A.7), letting | π | be suﬃciently small gives (A.9). References [ALV07] Elisa Al`os, Jorge A Le´on, and Josep Vives. On the short-time behavior of theimplied volatility for jump-diﬀusion models with stochastic volatility.

Finance andStochastics , 11(4):571–589, 2007.[BD14] Christian Bender and Nikolai Dokuchaev. A ﬁrst-order BSPDE for swing optionpricing.

Math. Finance , 2014. DOI: 10.1111/maﬁ.12067.[BDH +

03] P. Briand, B. Delyon, Y. Hu, E. Pardoux, and L. Stoica. L p solutions of backwardstochastic diﬀerential equations. Stoch. Process. Appl. , 108(4):604–618, 2003.29BFG16] Christian Bayer, Peter Friz, and Jim Gatheral. Pricing under rough volatility.

Quan-titative Finance , 16(6):887–904, 2016.[BFG +

19] Christian Bayer, Peter K Friz, Paul Gassiat, Jorg Martin, and Benjamin Stemper.A regularity structure for rough volatility.

Mathematical Finance , 2019.[BHM +

19] Christian Bayer, Blanka Horvath, Aitor Muguruza, Benjamin Stemper, and MehdiTomas. On deep calibration of (rough) stochastic volatility models. arXiv preprintarXiv:1908.08806 , 2019.[BL08] Rainer Buckdahn and Juan Li. Stochastic diﬀerential games and viscosity solutionsof Hamilton-Jacobi-Bellman-Isaacs equations.

SIAM J. Control Optim. , 47(1):444–475, 2008.[BTW18] Christian Bayer, Ra´ul Tempone, and S¨oren Wolfers. Pricing american options byexercise rate optimization. arXiv preprint arXiv:1809.07300 , 2018.[CCR12] Fabienne Comte, Laure Coutin, and Eric Renault. Aﬃne fractional stochastic volatil-ity models.

Annals of Finance , 8(2-3):337–378, 2012.[CH05] Alexander MG Cox and David G Hobson. Local martingales, bubbles and optionprices.

Finance and Stochastics , 9(4):477–492, 2005.[DPZ14] Giuseppe Da Prato and Jerzy Zabczyk.

Stochastic equations in inﬁnite dimensions .Cambridge university press, 2014.[DQT11] Kai Du, Jinniao Qiu, and Shanjian Tang. L p theory for super-parabolic backwardstochastic partial diﬀerential equations in the whole space. Appl. Math. Optim. ,65(2):175–219, 2011.[EEFR18] Omar El Euch, Masaaki Fukasawa, and Mathieu Rosenbaum. The microstruc-tural foundations of leverage eﬀect and rough volatility.

Finance and Stochastics ,22(2):241–280, 2018.[EER19] Omar El Euch and Mathieu Rosenbaum. The characteristic function of rough Hestonmodels.

Mathematical Finance , 29(1):3–38, 2019.[EKP +

97] N. El Karoui, C. Kapoudjian, E. Paudoux, S. Peng, and M. C. Quenez. Reﬂectedsolutions of backward SDE’s, and related obstacle problems for PDE’s.

Ann. Probab. ,25(2):702–737, 1997.[EKTZ14] Ibrahim Ekren, Christian Keller, Nizar Touzi, and Jianfeng Zhang. On viscositysolutions of path dependent PDEs.

The Annals of Probability , 42(1):204–236, 2014.[EPQ97] N. El Karoui, S. Peng, and M. C. Quenez. Backward stochastic diﬀerential equationsin ﬁnance.

Math. Finance , 7(1):1–71, 1997.[Fuk11] Masaaki Fukasawa. Asymptotic analysis for stochastic volatility: martingale expan-sion.

Finance and Stochastics , 15(4):635–654, 2011.[Gas18] Paul Gassiat. On the martingale property in the rough Bergomi model, 2018.30GJR18] Jim Gatheral, Thibault Jaisson, and Mathieu Rosenbaum. Volatility is rough.

Quan-titative Finance , 18(6):933–949, 2018.[GMZ20] Ludovic Gouden`ege, Andrea Molent, and Antonino Zanette. Machine learning forpricing American options in high-dimensional Markovian and non-Markovian mod-els.

Quantitative Finance , 20(4):573–591, 2020.[HJW18] Jiequn Han, Arnulf Jentzen, and E Weinan. Solving high-dimensional partial dif-ferential equations using deep learning.

Proceedings of the National Academy ofSciences , 115(34):8505–8510, 2018.[HMY02] Y. Hu, J. Ma, and J. Yong. On semi-linear degenerate backward stochastic partialdiﬀerential equations.

Probab. Theory Relat. Fields , 123:381–411, 2002.[HP91] Y. Hu and S. Peng. Adapted solution of a backward semilinear stochastic evolutionequations.

Stoch. Anal. Appl. , 9:445–459, 1991.[HPW19] Cˆome Hur´e, Huyˆen Pham, and Xavier Warin. Some machine learning schemes forhigh-dimensional nonlinear pdes. arXiv preprint arXiv:1902.01599 , 2019.[HSW89] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforwardnetworks are universal approximators.

Neural networks , 2(5):359–366, 1989.[HSW90] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Universal approximationof an unknown mapping and its derivatives using multilayer feedforward networks.

Neural networks , 3(5):551–560, 1990.[JLP19] Eduardo Abi Jaber, Martin Larsson, and Sergio Pulido. Aﬃne Volterra processes.

The Annals of Applied Probability , 29(5):3155–3200, 2019.[JO19] Antoine Jack Jacquier and Mugad Oumgari. Deep PPDEs for rough local stochasticvolatility.

Available at SSRN 3400035 , 2019.[Kry10] N. V. Krylov. On the Itˆo-Wentzell formula for distribution-valued processes andrelated topics.

Probab. Theory Relat. Fields , 150:295–319, 2010.[Oks03] Bernt Oksendal.

Stochastic diﬀerential equations: an introduction with applications .Springer, 2003.[Pen92] Shige Peng. Stochastic Hamilton-Jacobi-Bellman equations.

SIAM J. Control Op-tim. , 30:284–304, 1992.[PP90] E. Pardoux and S. Peng. Adapted solution of a backward stochastic diﬀerentialequation.

Syst. Control Lett. , 14(1):55–61, 1990.[Qiu17] Jinniao Qiu. Weak solution for a class of fully nonlinear stochastic hamilton–jacobi–bellman equations.

Stoch. Process. Appl. , 127(6):1926–1959, 2017.[Qiu18] Jinniao Qiu. Viscosity solutions of stochastic Hamilton–Jacobi–Bellman equations.

SIAM J. Control Optim. , 56(5):3708–3730, 2018.31QW14] Jinniao Qiu and Wenning Wei. On the quasi-linear reﬂected backward stochasticpartial diﬀerential equations.

J. Funct. Anal. , 267:3598–3656, 2014.[VZ +

19] Frederi Viens, Jianfeng Zhang, et al. A martingale approach for fractional brown-ian motions and related path dependent pdes.

The Annals of Applied Probability ,29(6):3489–3540, 2019.[Zho92] Xun Yu Zhou. A duality analysis on stochastic partial diﬀerential equations.