[PDF] Deep Curve-dependent PDEs for affine rough volatility

Abstract

We introduce a new deep-learning based algorithm to evaluate options in affine rough stochastic volatility models. We show that the pricing function is the solution to a curve-dependent PDE (CPDE), depending on forward curves rather than the whole path of the process, for which we develop a numerical scheme based on deep learning techniques. Numerical simulations suggest that the latter is extremely efficient, and provides a good alternative to classical Monte Carlo simulations.

Full PDF

DDEEP PPDES FOR ROUGH LOCAL STOCHASTIC VOLATILITY

ANTOINE JACQUIER AND MUGAD OUMGARI

Abstract.

We introduce the notion of rough local stochastic volatility models, extending the classicalconcept to the case where volatility is driven by some Volterra process. In this setting, we show that thepricing function is the solution to a path-dependent PDE, for which we develop a numerical scheme based onDeep Learning techniques. Numerical simulations suggest that the latter is extremely eﬃcient, and providesa good alternative to classical Monte Carlo simulations. Introduction

Stochastic models in ﬁnancial modelling have undergone many transformations since the Black and Sc-holes model [14], and its most recent revolution, pioneered by Gatheral, Jaisson and Rosenbaum [34], hasintroduced the concept of rough volatility. In this setting, the instantaneous volatility is the solution of astochastic diﬀerential equation driven by a fractional Brownian motion with small (less than a half) Hurstexponent, synonym of low H¨older regularity of the paths. Not only is this feature consistent with historicaltime series [34], but it further allows to capture the notoriously steep at-the-money skew of Equity options,as highlighted in [3, 9, 8, 29, 30]. Since then, a lot of eﬀort has been devoted to advocating this new classof models and showing the full extent of their capabilities, in particular as good estimators for a large classof assets [11], and for consistent pricing of volatility indices [46, 51]. Nothing comes for free though, andthe ﬂip side of this new paradigm is the computational cost. With the notable exception of the rough Hes-ton model [1, 26, 27, 25, 24] and its aﬃne extensions [2, 35], the absence of Markovianity of the fractionalBrownian motion prevents any pricing tools other than Monte Carlo simulations; the simulation of contin-uous Gaussian processes, including fractional Brownian motion, is traditionally slow as soon as one stepsaway from the standard Brownian motion. However, the clear superiority–for estimation and calibration–ofthese rough volatility models has encouraged deep and fast innovations in numerical methods for pricing,in particular the now standard Hybrid scheme [10, 43] as well as Donsker-type theorems [45, 64], numericalapproximations [6, 42, 36] and machine learning-based techniques [47, 71].In fact, industry practice is often entrenched, not in pure stochastic volatility models, but in modelsenhanced with a local volatility component, `a la Dupire [18], thereby ensuring an exact ﬁt to the observedEuropean option price surface. The natural next step for rough volatility models is therefore to include sucha component. This is the main theme of the present work, where we provide, for the ﬁrst time as it seems, anumerical algorithm able to evaluate options, European or path-dependent, in such a rough local stochasticvolatility setting. Although this could be performed by simulation, we adopt a diﬀerent strategy, followingthe recent development by Viens and Zhang [72], who proved an analogous version of the Feynman-Kactheorem for rough volatility models. The fundamental diﬀerence, though, is that the corresponding partialdiﬀerent equation is now path-dependent. This forces us to revisit classical market completeness results in thesetting of rough local stochastic volatility models. Path-dependent partial diﬀerential equations (PPDEs)have been studied extensively by Touzi, Zhang and co- authors [22, 21, 20], and we shall draw existenceand uniqueness from their works. Despite these advances, though, very little has been developed to solvethese PPDEs numerically, with the sole exception of a path-dependent version of the Barles and Souganidis’monotone scheme [5] by Zhang and Zhuo [73] and Ren and Tan [66]. The implementation thereof is howeverfar from obvious, and we propose instead a novel algorithm, based on the discretisation of the pricing PPDE,

Date : June 7, 2019.2010

Mathematics Subject Classiﬁcation.

Key words and phrases. rough volatility, Deep learning, Path-dependent PDEs.The views and opinions expressed here are the authors’ and do not represent the opinions of their employers. They are notresponsible for any use that may be made of these contents. No part of this presentation is intended to inﬂuence investmentdecisions or promote any product or service. The authors would like to thank Lukasz Szpruch for stimulating discussions. a r X i v : . [ q -f i n . P R ] J un ANTOINE JACQUIER AND MUGAD OUMGARI which we then solve using the deep learning technique pioneered by Jentzen [41], which does not suﬀer thecurse of dimensionality.We note in passing that using machine learning (or deep learning) techniques to solve high-dimensionalPDEs has recently been the focus of several approaches. Neural networks have indeed been used to solvePDEs for a long time [57, 58, 59, 60]; more recently, Sirignano and Spiliopoulos [70] proposed an algorithmnot depending on a given mesh (as opposed to the previous literature), thus allowing for an easier extensionto the multi-dimensional case. During the writing-up of the present paper, a further two ideas, similar inspirit to the one we are borrowing from [41] came to light: Sabate-Vidales, ˇSiˇska and Szpruch [69], as wellas and Hur´e, Pham and Warin [48] also used the BSDE counterpart of the PDE–albeit in diﬀerent ways–toapply machine learning techniques. Since our set-up here is not solely about solving a high-dimensionalPDE, we shall leave the precise comparison of these diﬀerent schemes to rest for the moment.We shall follow a natural progression in this paper: Section 2 introduces the ﬁnancial modelling setup of theanalysis, introducing rough local stochastic volatility models, and proving preliminary results fundamentalfor their application in quantitative ﬁnance. In Section 2, we show that, in this context, a ﬁnancial derivativeis the solution to a path-dependent PDE, for which we propose a discretisation algorithm, and adapt theDeep Learning methodology developed in [41]. We show the validity and accuracy of this technique inSection 4 for diﬀerent examples of interest. We gather in appendix some long proofs and reminders in ordernot to disrupt the the ﬂow of the paper.2.

Modelling framework

In order to dive right into the modelling framework and our main setup, we postpone to Appendix A areview of the functional Ito formula for stochastic Volterra systems, as developed by Viens and Zhang [72].We introduce a rough local stochastic volatility model for the dynamics of a stock price process. Before divinginto numerical considerations, we adapt the classical framework of no-arbitrage and market completeness tothis setup in order to ensure that pricing and calibration make any sense at all.2.1.

Rough local stochastic volatility model.

We are interested here in local stochastic volatility models,where the volatility is rough, in the sense of [34]. This can be written, under the historical measure, as(1) S t = S + (cid:90) t µ r S r d r + (cid:90) t l ( r, S r , V r ) S r d W r ,V t = V + (cid:90) t K( t − r ) (cid:16) b ( V r )d r + ξ ( V r )d B r (cid:17) , d (cid:104) W, B (cid:105) t = ρ d t, where ρ ∈ [ − ,

0] and W and B are two standard Brownian motions. Setting X = ( S, V ), we can rewritethe system as X t = X + (cid:90) t b ( t, r, X r )d r + (cid:90) t σ ( t, r, X r ) · d B r , where b ( t, r, X r ) = (cid:18) µ r S r K( t − r ) b ( V r ) (cid:19) and σ ( t, r, X r ) = (cid:18) ρ l ( t, S r , V r ) S r ρ l ( t, S r , V r ) S r t − r ) ξ ( V r ) (cid:19) , where ρ := (cid:112) − ρ and B = ( B ⊥ , B ), with W := ρB + ρB ⊥ . The SDE for X represents a stochastic Volterrasystem which is, in general, not Markovian. We could in principle allow for more generality and assume,following [72], that for any t ≥ r ∈ [0 , t ], the coeﬃcients b and σ depend on the past trajectory X r ∧· ,for example to include models with delay. However, such an extension is not needed in the application weare interested in, and we shall not pursue it. We restrict the correlation parameter ρ to be non positivein order to ensure–with some assumptions, below, on the coeﬃcients–that the stock price process S is atrue martingale. This assumption is not so restrictive in Equity or FX markets, where it is respectivelyquite negative and close to zero. We assume that the function l ( · ) is of the form l ( t, S, v ) = L ( t, S ) ς ( v ),where L ( · , · ) is usually called the leverage function. From the results by Dupire [18] and Gy¨ongy [40], if onewants to ensure that this model calibrates exactly to European option prices, then the equality σ L ( t, s ) = E Q (cid:2) l ( t, S t , V t ) | S t = s (cid:3) = E Q (cid:2) L ( t, S t ) ς ( V t ) | S t = s (cid:3) = L ( t, s ) E Q (cid:2) ς ( V t ) | S t = s (cid:3) EEP PPDES FOR ROUGH LOCAL STOCHASTIC VOLATILITY 3 must hold for every t, s ≥

0, where the function σ L is called the local volatility can be, at least in theory,be obtained directly from European option prices; we refer the reader to Section 4.2 for more details, andfor some numerical aspects. Here Q denotes the risk-neutral measure. We will show below (Assumption 2.1and Theorem 2.4) that such a probability measure exists, thereby making the model meaningful for optionpricing. The leverage function can then be recovered directly as(2) L ( t, s ) = σ L ( t, s ) (cid:112) E Q [ ς ( V t ) | S t = s ] , as long as the right-hand side makes sense, and Assumption 2.1 ensures this is indeed the case. Note thatthe term on the right-hand side is a conditional expectation with respect to the stock price, and thereforeonly depends on t and { S t = s } , no matter whether the variance process is Markovian or not. This class ofmodels has the advantage of ensuring perfect (at least theoretically) calibration to European option prices,while giving ﬂexibility to price other options, in particular path-dependent or exotic options. A full accountof such models and the corresponding calibration problem can be found in [39]. A rigorous proof of theexistence and uniqueness of (1) is outside the scope of this paper; in fact, even in the classical (non-roughcase), a general answer does not exist yet, and only recently some advances [55, 56] have been made in thisdirection.We will develop some numerics for a rough local stochastic volatility model in Section 4.2, but for nowleave this concept aside and consider a general function L . We shall further assume the existence of amoney market account yielding a risk-free interest rate ( r t ) t ≥ . We shall always work under the followingconsiderations: Assumption 2.1. (i) The kernel K ∈ L ( R + → R ) admits a resolvent of the ﬁrst kind, and there exists γ ∈ (0 ,

2] such that (cid:90) h K( t ) d t = O ( h γ ) and (cid:90) T [K( t + h ) − K( t )] d t = O ( h γ ) , for every T ≥ , as h tends to zero;(ii) the functions b and ξ are linear of the form b ( y ) = b + b y and ξ ( y ) = a + a y ;(iii) for any t ≥

0, the map L ( t, · ) is bounded away from zero and bounded above by C L ;(iv) the system (1) admits a unique (weak) solution;(v) the function ς is strictly positive, bounded away from zero, uniformly H¨older continuous and ς ( y ) ≤ C ς (1 + | y | p ς ) for some C ς > p ς ∈ (0 , L ( · , · ) ≡

1, taking ς ( v ) = √ v , we are exactly in the settingof an aﬃne Volterra system (log( S ) , V ). In fact, this condition alone ensures that the process V is an aﬃneVolterra process, and by [2, Theorem 3.3], Assumption 2.1(i) ensures that the SDE for V admits a continuousweak solution, with sup t ≥ E [ | V t | p ] ﬁnite for all p ≥

2. This in particular implies [2, Theorem 4.3] that, undersuitable integrability conditions,(3) E (cid:2) e uV T (cid:12)(cid:12) F t (cid:3) = exp { φ ( T − t ) + ψ ( T − t ) V t } , for any 0 ≤ t ≤ T , where the two functions φ and ψ satisfy the system of Riccati equations˙ ψ ( t ) = b ψ ( t ) + a ψ ( t ) and ˙ φ ( t ) = b ψ ( t ) + a ψ ( t ) , with boundary conditions ψ (0) = u , φ (0) = 0. Assumption 2.1(i) is again borrowed from [2], and refer theinterested reader to this paper for examples of kernels satisfying this condition. Most examples so far inquantitative ﬁnance, such as power-law kernel K( t ) ≡ t H − / and Gamma kernel K( t ) ≡ t H − / e − λt , for H ∈ (0 , λ >

0, ﬁt into this framework. Finally, Assumption 2.1(vi) allows the representation(4) S t = S exp (cid:26)(cid:90) t (cid:18) µ u − l ( u, S u , V u )2 (cid:19) d u + (cid:90) t l ( u, S u , V u )d W u (cid:27) to be valid since the stochastic integrand is square integrable, by virtue of (cid:90) t E (cid:2) l ( u, S u , V u ) (cid:3) d u ≤ (cid:90) t E (cid:2) L ( u, S u ) ς ( V u ) (cid:3) d u ≤ C ς C L (cid:90) t E (cid:2)(cid:0) | V u | p ς (cid:1)(cid:3) d u. ANTOINE JACQUIER AND MUGAD OUMGARI

Remark 2.2.

The assumption on the leverage function L may look restrictive at ﬁrst, especially since, wheninferring it from market data through (2), it may not be bounded; but it is customary to truncate it forlarge values of the underlying, and we follow this convention throughout. Remark 2.3.

One could consider a slightly diﬀerent setup as (21), where the kernel does not apply to thewhole dynamics of the process V , but only to the diﬀusion part, in the spirit of diﬀusions driven by VolterraGaussian noises [8, 34, 44]. In the simple case, following the seminar paper by Comte and Renault [15], thevariance process is the unique strong solution to the Volterra stochastic equation(5) V t = V e bt + a (cid:90) t e b ( t − u ) d B Hu . where B H is a fractional Brownian motion with Hurst exponent H ∈ (0 , B Ht = (cid:82) t K( t − u )d B u for some standard Brownian motion B with the same ﬁltration [16] as B H . Theprocess V in (5) is a continuous Gaussian process with ﬁnite variance, and Fernique’s estimate [28] yieldsthat E (cid:104) exp (cid:16) α sup t ∈ [0 ,T ] | V t | p (cid:17)(cid:105) is ﬁnite for any α, T > p <

2, so that for any α > E (cid:20) exp (cid:26) α (cid:90) t l ( u, S u , V u )d u (cid:27)(cid:21) ≤ E (cid:20) exp (cid:26) αC ς C L (cid:90) t (cid:0) | V u | p ς (cid:1) d u (cid:27)(cid:21) ≤ E (cid:34) exp (cid:40) αC ς C L t (cid:32) u ∈ [0 ,t ] | V u | p ς (cid:33)(cid:41)(cid:35) is ﬁnite and (4) holds.2.2. Market completeness and arbitrage freeness.

In the general case of non-zero correlation, thepresence of the two noises renders the market incomplete. In order to be able to use the system (1) forpricing purposes, we need to show how to complete the market, and to check for the existence of someprobability measure under which the stock price is a true martingale. The latter issue was solved for therough Heston model by El Euch and Rosenbaum [25], while the rough Bergomi case with non-positivecorrelation was recently proved by Gassiat [31]. We show that this still holds in our framework.

Theorem 2.4.

The market is incomplete and free of arbitrage.Proof.

We wrote (1) under the historical measure P , but for pricing purposes, we need it under the pricingmeasure Q . Introduce the market prices of risk as adapted processes ( λ t ) t ≥ and ( β t ) t ≥ such that(6) ρλ t + ρβ t = µ t − r t l ( t, S t , V t ) , which is well deﬁned since the denominator is strictly positive by Assumption 2.1(ii)-(v). We now deﬁne theprobability measure Q via its Radon-Nikodym derivatived Q d P (cid:12)(cid:12)(cid:12)(cid:12) F t := exp (cid:26) − (cid:90) t (cid:0) λ s + β s (cid:1) d s − (cid:90) t (cid:0) λ s d B s + β s d B ⊥ s (cid:1)(cid:27) , so that Girsanov’s Theorem [54, Chapter 3.5] implies that the processes B Q and B Q , ⊥ deﬁned by B Q t := B t + (cid:90) t λ u d u and B Q , ⊥ t := B ⊥ t + (cid:90) t β u d u are orthogonal Brownian motions under Q . Therefore, under Q , the stock price satisﬁes(7) S t = S + (cid:90) t r u S u d u + (cid:90) t l ( u, S u , V u ) (cid:0) ρ d B Q u + ρ d B Q , ⊥ u (cid:1) ,V t = V + (cid:90) t K( t − u ) (cid:16)(cid:98) b ( V u )d u + ξ ( V u )d B Q u (cid:17) , where, under Q , the drift of the variance process is now of the form (cid:98) b ( V t ) = b ( V t ) − λ t ξ ( V t ). Introducing theBrownian motion W Q (under Q ), the ﬁrst SDE then reads S t = S + (cid:90) t r u S u d u + (cid:90) t l ( u, S u , V u )d W Q . EEP PPDES FOR ROUGH LOCAL STOCHASTIC VOLATILITY 5

The proposition then follows directly from Novikov’s criterion thanks to the assumptions on the function l ( · )in Assumption 2.1(iii)-(vi). (cid:3) Remark 2.5.

Following [31], we could in fact partially relax the assumption on the function l ( · ). Assume forexample that l ( t, S, v ) = L ( t, S ) ς ( v ), but with Assumption 2.1 replaced by ς ( v ) = exp( ηv ), for some η > ρ ≤

0, the proof of [31, Theorem 1], using a localisation argument with the increasingsequence of stopping times τ n := inf { t > V t = n } , remains the same, and the stock price is a truemartingale. This in particular means that a local volatility version of the rough Bergomi model [8, 51, 52]falls within our current setup. 3. Pricing via Deep PPDEs

We shall from now on only work under the risk-neutral measure and, with a slight abuse of notations,write b instead of (cid:98) b for the drift of the variance process (equivalently taking the market price of volatility riskto be null in Theorem 2.4). In the classical Itˆo diﬀusion setting, where the kernel K is constant, Feynman-Kacformula transforms the pricing problem from a probabilistic setting to a PDE formulation. The formulationof our rough local stochastic volatility model (1) goes beyond this scope since the system is not Markovianany longer. We adapt here the results developed by Viens and Zhang [72] to show that the pricing problemis equivalent to solving a path-dependent PDE.3.1. Pricing via path-dependent PDEs.

We ﬁrst start with the following lemma, which hints shows, notsurprisingly, that the option price should be viewed, not as a function of the state variable at a ﬁxed giventime, but as a functionals over paths. The argument is not new, and versions have already appeared forthe rough Heston model in [25] and in the context of Itˆo’s formula for stochastic Volterra equations in [72].Consider a European option with maturity T and payoﬀ function g ( · ) written on the stock price S . Bystandard no-arbitrage arguments, from Theorem 2.4, its price at time t ∈ [0 , T ] reads(8) P t = E [ g ( S T ) |F t ] . Lemma 3.1.

There exists a function P : [0 , T ] × [0 , ∞ ) × C ([0 , T ] → R ) , such that, for any t ∈ [0 , T ] , P t = P (cid:0) t, S t , (Θ tu ) t ≤ u ≤ T (cid:1) , where Θ t ∈ F t and is given, for any u ∈ [ t, T ] , by (9) Θ tu := E (cid:20) V u − (cid:90) ut K( u − r ) b ( r, V r )d r (cid:12)(cid:12)(cid:12)(cid:12) F t (cid:21) . Proof.

For any t ∈ [0 , T ] and u ∈ [ t, T ], we can write, from (1) V u = V + (cid:90) u K( u − r ) ( b ( V r )d r + ξ ( V r )d B r )= V + (cid:90) t K( u − r ) ( b ( V r )d r + ξ ( V r )d B r ) (cid:124) (cid:123)(cid:122) (cid:125) Θ tu + (cid:90) ut K( u − r ) ( b ( V r )d r + ξ ( V r )d B r ) (cid:124) (cid:123)(cid:122) (cid:125) I tu , so that Θ t is clearly F t -measurable and satisﬁes (9). We can therefore write the option price as E [ g ( S T ) |F t ] = E (cid:34) g (cid:32) S t + (cid:90) Tt r u S u d u + (cid:90) Tt l (cid:0) u, S u , Θ tu + I tu (cid:1) d W u (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F t (cid:35) , and the lemma follows since I t / ∈ F t . (cid:3) For any t ∈ [0 , T ], the curve Θ t (or equivalently the process inﬁnite-dimensional process (Θ t ) t ∈ [0 ,T ] ) playsa fundamental role. It does not exactly represent the forward variance curve, which is used in [25], and werefer to [72] for the precise correspondence between the two. Note that Θ tu = V u for u ∈ [0 , t ], in particularΘ tt = V t . The following theorem is the main result here (proved in Appendix B), and shows how to extendthe classical Feynman-Kac formula to the path-dependent case. From now on, we shall adopt the notation(10) K t := K( · − t ) , for any t ≥ , to denote the curve K t seen at time t . It will become clear in the following theorem how this becomes handy. ANTOINE JACQUIER AND MUGAD OUMGARI

Theorem 3.2.

The option price (8) is the unique solution to the linear path-dependent PDE (11) (cid:16) ∂ t + L x + L xx + L xω + L ω + L ωω − r t (cid:17) P ( t, S t , Θ t ) = 0 , for t ∈ [0 , T ) , with boundary condition P ( T, S T , Θ T ) = g ( S T ) , where the diﬀerential operators are deﬁned as L xω := ρl ( t, x, v ) ξ (Θ tt ) x (cid:10) ∂ x,ω , K t (cid:11) , L xx := 12 l ( t, x, Θ tt ) x ∂ x , L x := r t x∂ x , L ωω := 12 ξ (Θ tt ) (cid:10) ∂ ω , (K t , K t ) (cid:11) , L ω := b (Θ tt ) (cid:10) ∂ ω , K t (cid:11) . The pricing PDE (12) has both state-dependent terms, involving derivatives with respect to x , and path-dependent ones, involving functional derivatives with respect to ω . In order to streamline the presentation, wedefer to Appendix A the precise framework, borrowed from [72], to deﬁne these derivatives. Without essentialloss of understanding for the rest of the analysis, the reader can view them basically as Gˆateaux derivatives(with some regularisation due to the singularity of the kernel along the diagonal). Path-dependent PDEs havebeen studied in great details, and we refer the interested reader to [22, 21, 20]. However, numerical schemesfor such equations are rather scarce, and the only approaches we are aware of is the extension of Barlesand Souganidis’ monotone scheme [5] to the path-dependent case by Zhang and Zhuo [73], the convergenceof which was proved by Ren and Tan [66]. However, the actual implementation of this scheme in thePPDE context is far from trivial, and we consider a diﬀerent route here, more amenable to computations inour opinion. We ﬁrst discretise the PPDE along some basis of functions, reducing the inﬁnite-dimensionalproblem to a ﬁnite-, yet high-, dimensional problem. High-dimensional PDEs suﬀer from the so-called curseof dimensionality, and are notoriously diﬃcult to solve. We then adopt the deep learning approach recentlydeveloped by Hen and Jentzen [41] to solve this system of PDEs.3.2. Discretisation of the PPDE.

For each t ∈ [0 , T ], we consider a basis ( ψ ti ) i =1 ,...,n of c`adl`ag functions,for some ﬁxed integer n , and use it to approximate Θ t and K t by (cid:98) Θ t := n (cid:88) i =1 θ ti ψ tn and (cid:98) K t := n (cid:88) i =1 κ ti ψ tn , for some sequence of real coeﬃcients θ t := ( θ ti ) ni =1 and κ t := ( κ ti ) ni =1 . Since K t ∈ D t , the space of c`adl`agfunctions on [ t, T ] (see Appendix A.2), then, from Deﬁnition A.4, (cid:10) ∂ ω P (cid:0) t, x, Θ t (cid:1) , K t (cid:11) := ∂ ε P (cid:0) t, x, Θ t + ε K t [ t,T ] (cid:1)(cid:12)(cid:12) ε =0 = ∂ ε P (cid:0) t, x, Θ t + ε K t (cid:1)(cid:12)(cid:12) ε =0 , and we can introduce the following approximations of the path derivatives along the direction K t : (cid:68) ∂ ω P (cid:16) t, x, (cid:98) Θ t (cid:17) , (cid:98) K t (cid:69) := ∂ ε P (cid:16) t, x, (cid:98) Θ t + ε (cid:98) K t (cid:17)(cid:12)(cid:12)(cid:12) ε =0 = ∂ ε P (cid:32) t, x, n (cid:88) i =1 (cid:16) θ tn + εκ tn (cid:17) ψ n (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ε =0 =: ∂ ε (cid:98) P (cid:0) t, x, (cid:0) θ tn + εκ tn (cid:1) ni =1 (cid:1)(cid:12)(cid:12) ε =0 = n (cid:88) i =1 ∂ θ tn (cid:98) P (cid:0) t, x, θ t (cid:1) κ tn = ∇ θ t (cid:98) P (cid:0) t, x, θ t (cid:1) · κ t , where the function (cid:98) P now acts on [0 , T ] × [0 , ∞ ) × R n . Likewise, for the second functional derivative, wehave the approximation (cid:68) ∂ ω (cid:16) t, x, (cid:98) Θ t (cid:17) , (cid:16) (cid:98) K t , (cid:98) K t (cid:17)(cid:69) = n (cid:88) i,j =1 ∂ θ ti θ tj (cid:98) P (cid:0) t, x, θ t (cid:1) κ ti κ tj = (cid:0) κ t (cid:1) (cid:62) · ∆ θ t (cid:98) P (cid:0) t, x, θ t (cid:1) · κ , , and ﬁnally the cross derivatives can be approximated similarly as (cid:68) ∂ x,ω (cid:16) t, x, (cid:98) Θ t (cid:17) , (cid:98) K t (cid:69) := ∂ x ∇ θ t (cid:98) P (cid:0) t, x, θ t (cid:1) · κ t . The PPDE (12) therefore becomes(12)  ∂ t + L x + L xx + n (cid:88) i =1 L xθ ti + n (cid:88) i =1 L θ ti + n (cid:88) i,j =1 L θ ti θ tj − r t  (cid:98) P = 0 , EEP PPDES FOR ROUGH LOCAL STOCHASTIC VOLATILITY 7 where the diﬀerential operators are deﬁned, for each i, j = 1 , . . . , n , as L xθ i := ρ l (cid:0) t, x, θ t (cid:1) ξ ( θ t ) κ ti ∂ xθ ti , L xx := l (cid:0) t, x, θ t (cid:1) ∂ x , L x := r t x∂ x , L θ ti θ tj := ξ (cid:0) θ t (cid:1) κ ti κ tj ∂ θ ti θ tj , L θ i := b (cid:0) θ t (cid:1) κ ti ∂ θ ti . We can rewrite this system in a more concise way as(13) ∂ t (cid:98) P + 12 Tr (cid:16) Σ · Σ (cid:62) · ∆ (cid:98) P (cid:17) + µ · ∇ (cid:98) P − r t (cid:98) P = 0 , where µ (cid:0) t, x, θ t (cid:1) := (cid:0) xr t , b ( θ t ) κ t , . . . , b ( θ t ) κ tn (cid:1) (cid:62) and Σ (cid:0) t, x, θ t (cid:1) · Σ (cid:0) t, x, θ t (cid:1) (cid:62) :=  l ( t, x, θ t ) ρl ( t, x, θ t ) ξ ( θ t ) κ t · · · ρl ( t, x, θ t ) ξ ( θ t ) κ tn ρ l ( t, x, θ t ) ξ ( θ t ) κ t ξ ( θ t ) ( κ t ) · · · ξ ( θ t ) κ t κ tn ... ... . . . ... ρ l ( t, x, θ t ) ξ ( θ t ) κ tn ξ ( θ t ) κ t κ tn · · · ξ ( θ t ) ( κ tn )  Remark 3.3.

The simplest example is to consider piecewise constant curves ψ ti = 11 δ ti , for i = 1 , . . . , n ,where ( δ ti ) ni =1 represents a mesh of the interval [ t, T ], say of the form δ ti = [ t i − , t i ). In that case, we canconsider θ ti = Θ tt i and κ ti = K( t i − t ).There is an interesting connection between the functional Itˆo formula in Theorem A.6 and backwardstochastic diﬀerential equations. Following [23], consider the multidimensional BSDE(14)  X t = ξ + (cid:90) t µ ( r, X r )d s + (cid:90) t Σ( r, X r )d W r ,Y t = g ( X T ) + (cid:90) Tt f ( r, X r , Y r , Z r )d r − (cid:90) Tt Z (cid:62) r · d W r , on some ﬁltered probability space Ω , F , ( F t ) t ≥ , P ) supporting a d -dimensional Brownian motion W . Thesolution process ( X, Y, Z ) takes values, at each point in time, in R d × R × R d . Consider further the PDE(15) ∂ t u ( t, x ) + 12 Tr (cid:16) Σ( t, x )Σ( t, x ) (cid:62) ∆ u ( t, x ) (cid:17) + µ ( t, x ) ∇ u ( t, x ) + f (cid:16) t, x, u ( t, x ) , Σ( t, x ) (cid:62) ∇ u ( t, x ) (cid:17) = 0 , with terminal boundary condition u ( T, x ) = g ( x ). In the regular (Assumption A.3(i)) case, as shown in [72],assuming that (15) has a classical solution u ( · ) in C , (Λ) (see Appendix A.2), then the couple ( Y, Z ) deﬁnedas Y t := u ( t, X t ) and Z t := Σ( t, X t ) (cid:62) · ∇ u ( t, X t ) is the solution to (14). In light of this result, and followingup on the main idea in [41], if the solution to the option problem satisﬁes (13), it also solves the BSDE (14),namely(16) (cid:98) P (cid:0) t, S t , θ t (cid:1) = (cid:98) P (cid:16) T, S T , θ T (cid:17) + (cid:90) Tt r u (cid:98) P ( u, S u , θ u ) d u − (cid:90) Tt Σ ( u, S u , θ u ) (cid:62) · ∇ (cid:98) P ( u, S u , θ u ) d W u , or, written in forward form,(17) (cid:98) P (cid:0) t, S t , θ t (cid:1) = (cid:98) P (cid:0) , S , θ (cid:1) − (cid:90) t r u (cid:98) P ( u, S u , θ u ) d u + (cid:90) t Σ ( u, S u , θ u ) (cid:62) · ∇ (cid:98) P ( u, S u , θ u ) d W u . Neural network structure.

We now introduce the neural network structure that will help us solvethe high-dimensional pricing problem above. We concentrate on the setting in Remark 3.3.3.3.1.

Simulation of the network inputs.

Recall from (1) that the volatility process is given by V t = V + (cid:90) t K( t − r ) (cid:16) b ( V r )d r + ξ ( V r )d B r (cid:17) , for any 0 ≤ t, and for any t ∈ [0 , T ], the F t -measurable random variable Θ t introduced in (9) is given byΘ tu = V + (cid:90) t K( u − r ) (cid:16) b ( V r )d r + ξ ( V r )d B r (cid:17) , for u ≥ t. ANTOINE JACQUIER AND MUGAD OUMGARI

We discretise both time directions (pricing time and maturity) along the same square grid ( t i = iTn ) ≤ i ≤ n ,with ∆ i := t i − t i − . The matrix ( θ t i ) ≤ j ≤ n = (cid:16) θ t i t j (cid:17) ≤ i,j ≤ n is given by(18) θ t i t j =  V t j , if 0 ≤ j ≤ i,θ t i − t j + (cid:90) t i t i − K( t j − u ) (cid:16) b ( V u )d u + ξ ( V u )d B u (cid:17) , if j > i. Discretising the stochastic integral here is not trivial, and we appeal to the hybrid scheme developed in [10].Over the interval [ t i , t i − ], the stochastic integral can be decomposed as (cid:90) t i t i − K( t j − u ) ξ ( V u )d B u = (cid:90) t j t i − K( t j − u ) ξ ( V u )d B u − (cid:90) t j t i K( t j − u ) ξ ( V u )d B u . Now, the second term on the right-hand side reads (we use the hybrid scheme truncated at the ﬁrst level) (cid:90) t j t i K( t j − u ) ξ ( V u )d B u = j − i (cid:88) k =1 ξ ( V t j − k ) (cid:90) t j − k +1 t j − k K( t j − u )d B u = ξ ( V t j − ) (cid:90) t j t j − K( t j − u )d B u (cid:124) (cid:123)(cid:122) (cid:125) (cid:101) B j + j − i (cid:88) k =2 ξ ( V t j − k )K (cid:18) b k n (cid:19) (cid:90) t j − k +1 t j − k d B u (cid:124) (cid:123)(cid:122) (cid:125) B j , where the term ξ ( V u ) is approximated by its left point (it is obvious that B j does not depend on k , so weomit it the latter from the notation). with the coeﬃcients ( b k ) chosen appropriately (see also Remark 3.4below). The couple (cid:16) (cid:101) B j , B j (cid:17) forms a two-dimensional Gaussian vector, with covariance matrix Σ given byΣ = 1 n , Σ = (cid:90) t j t j − K( t j − u ) d u, Σ = Σ = (cid:90) t j t j − K( t j − u )d u. Remark 3.4.

In the power law case K( t ) = t H − with H ∈ (0 ,  Hn H H + ) n H + H + ) n H + n −  , and, as shown in [10], the coeﬃcients ( b k ) are explicitly computed as b k := (cid:32) k H + − ( k − H + H + (cid:33) H − / . For the stock price, starting from S t = S , we use the discretised explicit form, for i = 1 , . . . , n , S t i = S t i − exp (cid:26)(cid:18) r t i − − l ( t i − , S t i − , V t i − ) (cid:19) ∆ i + l ( t i − , S t i − , V t i − ) (cid:0) ρ ∆ B t i + ρ ∆ B ⊥ t i (cid:1)(cid:27) , Euler discretisation scheme.

We are iteratively computing the price of the option at each time step.Both (cid:98) P and ( ∇ (cid:98) P ) are the initial price and gradient that will be optimised with the weights of the network.On the two-dimensional grid ( t i , t j ) ≤ i,j ≤ n , we can discretise the forward stochastic equation (17) as  (cid:98) P (cid:0) t , S t , θ t (cid:1) = (cid:98) P , ∇ (cid:98) P (cid:0) t , S t , θ t (cid:1) = (cid:16) ∇ (cid:98) P (cid:17) , (cid:98) P (cid:0) t i +1 , S t i +1 , θ t i +1 (cid:1) = (1 + r t i ∆ i ) (cid:98) P (cid:0) t i , S t i , θ t i (cid:1) + Σ (cid:0) t i , S t i , θ t i (cid:1) (cid:62) ∇ (cid:98) P (cid:0) t i , S t i , θ t i (cid:1) ∆ W t i , (cid:98) P (cid:0) t n , S t n , θ t n (cid:1) = g ( S T ) , EEP PPDES FOR ROUGH LOCAL STOCHASTIC VOLATILITY 9 where W t i = (cid:16) B Q , ⊥ t i , B t i , . . . , B t i (cid:17) (cid:62) ∈ R n and the diﬀusion matrix reads, for any 1 ≤ i ≤ n ,Σ (cid:0) t i , S t i , θ t i (cid:1) =  ρ l ( t i , S t i , V t i ) ρ l ( t i , S t i , V t i ) 0 · · · ξ ( V t i )K( t i − t )11 { i> } · · · ...... 0 . . . 0 ...... 0 . . . . . . 00 0 · · · ξ ( V t i )K( t i − t n )11 { i>n }  ∈ M n +2 ,n +2 ( R ) . In particular, for any j = 2 , . . . , n + 2, the j-th diagonal term readsΣ (cid:0) t i , S t i , θ t i (cid:1) j,j = ξ ( V t i )K( t i − t j − )11 { i>j − } . Discretising the backward SDE (16) would give  (cid:98) P (cid:0) t n , S t n , θ t n (cid:1) = g ( S T ) , (cid:98) P (cid:0) t i , S t i , θ t i (cid:1) = (cid:0) − r t ∆ (cid:1) (cid:98) P (cid:0) t , S t , θ t (cid:1) − Σ (cid:0) t i , S t i , θ t i (cid:1) (cid:62) ∇ (cid:98) P (cid:0) t i , S t i , θ t i (cid:1) ∆ W t i , (cid:98) P (cid:0) t n , S t n , θ t n (cid:1) = g ( S T ) , We shall follow here this backward approach as it is more natural for pricing exotic (path-dependent)derivatives, such as Bermudan or American options.3.4.

Sub-network.

At each time step, the price as well as the model parameters are known. The onlyunknown is the term Σ (cid:0) t i , S t i , θ t i (cid:1) (cid:62) ∇ (cid:98) P (cid:0) t i , S t i , θ t i (cid:1) involving the gradient and the diﬀusion matrix. Wetherefore use a neural network to infer its value: (cid:98) P (cid:0) t i − , S t i − , θ t i − (cid:1) ∇ (cid:98) P (cid:0) t i − , S t i − , θ t i − (cid:1) S t i − , θ t i − Multilayer neural networkOutput dataInput data (cid:98) P (cid:0) t i , S t i , θ t i (cid:1) ∇ (cid:98) P (cid:0) t i , S t i , θ t i (cid:1) Layer L...Layer 1 S t i , θ t i B t i − B t i − , B ⊥ t i − B ⊥ t i − (cid:98) P (cid:0) t i +1 , S t i +1 , θ t i +1 (cid:1) · · · (cid:98) P (cid:0) t n , S t n , θ t n (cid:1) t = t i − t = t i t = t i +1 t = t n The input is composed of one input layer with the value of the processes at time t i . Each hidden layer iscomputed by multiplying the previous layer by the matrix of weights w and adding a bias δ . After computing a layer, we apply a batch normalisation by computing the mean m and the standard deviation s of the layer,and by applying the following linear transformation to each element of the layer:T( x ) := γ x − ms + β, where the scale γ and the oﬀset β are to be calibrated. We also use the ReLu activation function a ( x ) = x + on each element of the layer.3.5. Optimisation of the algorithm.

In the algorithm, n l denotes the number of layers per sub-network, n N the number of neurons per layer, n B the number of batches used to separate the samples, N the size of eachbatch and epoch shall denote the number of times all the samples are fed to the training algorithm. Weﬁnally introduce the following loss function that we aim to minimise:L (cid:16) w , δ , β, γ, (cid:98) P , ( ∇ (cid:98) P ) (cid:17) := E (cid:20)(cid:12)(cid:12)(cid:12) g ( S t n ) − (cid:98) P (cid:16) ( t i ) ≤ i ≤ n , ( S t i ) ≤ i ≤ n , (cid:0) θ t i (cid:1) ≤ i ≤ n (cid:17)(cid:12)(cid:12)(cid:12) (cid:21) , or, in fact, its version on the sample,(19) (cid:98) L (cid:16) w , δ , β, γ, (cid:98) P , ( ∇ (cid:98) P ) (cid:17) := 1 N N (cid:88) k =1 (cid:12)(cid:12)(cid:12) g (cid:0) S kt n (cid:1) − (cid:98) P (cid:16) ( t i ) ≤ i ≤ n , (cid:0) S kt i (cid:1) ≤ i ≤ n , (cid:0) θ t i (cid:1) k ≤ i ≤ n (cid:17)(cid:12)(cid:12)(cid:12) . It represents the variance of the initial price found by backward iterations for each simulated path. Since theinitial price is unique and deterministic, minimising its variance is natural good way to compute the initialprice. Regarding the optimisation itself, we use the Adaptive Moment Estimation method [4] for the ﬁrstiterations, and switch to the stochastic gradient method, with slower but more stable convergence properties.The diﬀerent parameters to calibrate are then • The weights w and biases δ for each layer of each sub-network; • The β and γ in the batch normalisation; • The initial price (cid:98) P ; • The initial gradient ( ∇ (cid:98) P ) . 4. Numerical results

The Black-Scholes case.

We ﬁrst test the accuracy of the method on a European Call option withstrike K = 100 and maturity T = 1 in the Black-Scholes model d S t = S t ( r d t + σ d W t ) with S = 100, volatility σ = 5%, and interest rate r = 5%. To train the model, we use diﬀerent hyper-parameter conﬁgurations andstudy the impact of each parameter on the loss function in (19). The default conﬁguration is the following: • • • Batch size = 5000; • •

10 neurons per layer • Activation function for the hidden layers: ReLu; • Activation function for the hidden layers: None.4.1.1.

Impact of the hyper-parameters on the loss function.

We ﬁrst study the inﬂuence of each parameter ofthe model on the loss function (19) , and summarises the results in Figure 1. As a function of the batch size,the loss function (on a logarithmic scale) ﬂattens out rapidly, so that one needs to increase the batch sizeconsiderably to improve the accuracy, which is however limited by computing power. One solution is to usemini batches that accumulate the gradients, and apply them altogether for a chosen number of mini-batches.This allows to train the model with a very large batch. However this is hugely time consuming since thenetwork has to process the mini batches one by one. For 1000 iterations, it is interesting to note that the lossfunction reaches a minimum for two hidden layers. Fewer layers would prevent the network to approximatethe gradient of the price accurately, while too many would give to much freedom to the model. As a functionof the number of neurons per layer, the loss function decays exponentially fast, and six neurons per layerseems enough for good accuracy. When plotting the loss in terms of the number of time steps, the convexityof the graph suggests that sixty time steps are suﬃcient. However, since time steps also determine the

EEP PPDES FOR ROUGH LOCAL STOCHASTIC VOLATILITY 11 number of neural networks to infer each gradient (and hence the number of weights to calibrate), we haveto limit their number. (a) Loss function vs batch size (on a logarithm scale). (b) Loss function vs number of hidden layers.(c) Loss vs number of neurons per layer. (d) Loss function vs number of time steps.

Figure 1.

Inﬂuence of the hyper-parameters on the loss function.4.1.2.

Impact of the hyper-parameters on the price accuracy.

To assess the accuracy of the computed priceitself, we compute the mean and standard deviation of the price obtained in the last 100 iterations, the goalbeing to determine the conﬁguration with the smallest standard deviation. Following the previous section,we consider 50 time steps, two hidden layers with six neurons per layer, and since the batch sizes we aretaking are high, we limit ourselves to 200 iterations. Since we integrated gradient accumulation to our solver,we can now take as input very large batches without memory issues. Table 2 shows the inﬂuence of the batchsize on the price accuracy. Even though the standard deviation decreases while increasing the batch size, itdoes not quite converge, and the batch size does not really aﬀect the accuracy of the convergence. This canbe explained by the relatively low number of iterations, and by the samples that may not be representativeenough (high discrepancy) to yield accurate convergence. An interesting observation is that the number ofsteps (Table 2) seems to have relatively little inﬂuence on the convergence of the accuracy. The latter couldin principle be improved in many ways, and numerical tests seem to disqualify the following approaches: • using layer normalisation instead of batch normalisation; • using a dropout rate to drop randomly a percentage of samples at each layer; • using diﬀerent activation functions; • using weight regularization by adding an L -norm penalty for the weights.On the positive side, though, we used quasi-random (Sobol) sequences instead of pseudo-random generators,and found that the algorithm converges faster. Note that, for speed purposes, the Sobol sequences can bepre-computed and stored, and then loaded when needed by the network. We so far always considered thesame size for the validation sample and for the training batch. In the backward method (our main choiceas it allows us to price callable and Bermudan options), the price is computed by minimising its variancethrough a Monte-Carlo-like method. Hence, increasing the size of the validation sample naturally decreasesthe variance. We then choose a very large validation sample (10 ) and a relatively smaller training batchsize (of order 10 ). Batch size (10 ) 1 2 3 4 5 7 10Standard deviation (10 − ) 6.34 4.29 3.17 2.54 2.4 2.01 1.47Mean 5.28 5.28 5.28 5.28 5.28 5.28 5.28Absolute error (10 − ) 12.3 5.16 3.11 1.82 3.52 1.08 6.47Relative error (10 − ) 2.33 0.97 0.58 0.34 0.66 0.2 1.22Table 1: Computation results for the last 100 iterations for diﬀerent batch sizes.Time Steps 10 20 30 40 50 70 100Standard deviation (10 − ) 6.34 4.29 3.17 2.54 2.4 2.01 1.47Mean 5.28 5.28 5.28 5.28 5.28 5.28 5.28Absolute error (10 − ) 3.08 1.63 2.7 2.2 4.24 2.66 6.5Relative error (10 − ) 3.65 0.65 0.9 2.4 0.61 1.32 1.15Table 2: Computation results for the last 100 iterations for diﬀerent number of time steps.4.1.3. Pricing Call options.

We compute the prices of a European Call option with the previous parametersfor diﬀerent strikes, and compare these in Table 3 to the true Black-Scholes values. The errors seem to varydepending on whether the option is in or out of the money. For an option with larger maturity, but withthe same precision, one needs to keep the number of time steps per year constant and hence increase thenumber of time steps, which consequently increases the computation time.Strike 80 85 90 95 100 105 110 115 120DL Price 23.90 19.14 14.39 9.67 5.28 2.05 0.51 0.07 0.01True price 23.90 19.14 14.39 9.67 5.28 2.05 0.51 0.07 0.01Absolute error (10 − ) 15 12 9.6 3.8 3 3.8 1.1 0.85 0.18Relative error (10 − ) 0.62 0.63 0.67 0.4 0.56 1.8 1.7 11 27Table 3: Deep Learning vs true prices (via the Black-Scholes formula) and errors.4.2. Rough stochastic local volatility models.

We now investigate the accuracy of our algorithm in themore interesting case of rough local stochastic volatility models. As mentioned in the introduction, localstochastic volatility models have become the standard on Equity markets, with each bank using its ownversion. We consider here the model (1) with b ( V ) = κ ( V ∞ − V ) , ξ ( V ) = νV, ς ( V ) = e V , K( t ) = t H − Γ (cid:0) H + (cid:1) , with κ, V ∞ , ν > H ∈ (0 , V t ) t ≥ is a fractionalOrnstein-Uhlenbeck process. We acknowledge that the function ς ( · ) does not quite satisfy Assumption 2.1(v),but leave this for future theoretical considerations. Note in passing that the true martingale property ofthe stock price in the rough Bergomi model–at least for non positive correlation–was recently proved byGassiat [31], and similar arguments could be used to extend the result to the current setting. The OUprocess can become negative, but, because of the function ς ( · ), the actual instantaneous volatility of thestock price is guaranteed to remain strictly positive at all times. The local volatility component is moredelicate, and we consider the usual approach introduced in Section 2.1. The Dupire [18] local volatilityfunction σ L can be recast in terms of the implied total variance w ( k, T ) := σ ( k, T ) T through theidentity [32, Chapter 1] σ ( k, T ) = ∂ T w ( k, T ) g ( k, T ) , where k represents the log-moneyness, and g ( k, T ) := (cid:32)(cid:18) − k∂ k w w (cid:19) − ( ∂ k w ) (cid:18)

14 + 1 w (cid:19) + ∂ kk w (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( k,T ) . EEP PPDES FOR ROUGH LOCAL STOCHASTIC VOLATILITY 13

This relationship is not trivial to apply in practice though, as the observed implied volatility surface is notsmooth, and only observed discretely. We assume here that the surface w ( · , · ) is given through the SSVIconﬁguration developed by Gatheral and Jacquier [33], namely(20) w ( k, T ) = ϑ T (cid:26) (cid:37)ϕ ( ϑ T ) k + (cid:113) ( ϕ ( ϑ T ) k + (cid:37) ) + 1 − (cid:37) (cid:27) , where (cid:37) ∈ [ − , ϑ t ) t ≥ represents the term structure of the at-the-money total implied variance, and thefunction ϕ ( · ) provides a backbone. In particular the so-called Heston-like parameterisation [33, Example 4.1] ϕ ( ϑ ) = 1 λϑ (cid:18) − − e − λϑ λϑ (cid:19) , for some strictly positive constant λ , was shown [33] to be free of static arbitrage, and therefore suitable asa starting point for market generation. Regarding the neural network, we use the default conﬁguration • Batch size: 10 ; • Validation batch size: 2 × ; •

200 iterations (epochs); • Two hidden layers; • • Between 100 and 2 ,

600 time steps per year depending on the maturity; • Between 20 and 520 BSDE time steps per year depending on the maturity.We will take the following values per default for the parameters of the model: S = 100 , T = 0 . , V = log(0 . , ν = 0 . , V ∞ = 0 . , κ = 0 . , ρ = − . . At ﬁrst, we try to recover the BS price setting ρ = 1, ν = 0 and κ = 0. Note that the number of time stepscontrols the dimension as well as the number of sub-networks, and one needs to decrease it in order to get aconvenient computation time. We then study (numerically) the inﬂuence of each parameter on the impliedvolatility smile, and summarise them in Figure 2. We compute the volatility smile for a given conﬁgurationand we study the impact of each model parameter. We take an extreme case with a very low correlation ρ = − .

95, and very small Hurst parameters H ∈ { . , . } . In order to compute a smile as accurate aspossible we will use Call options when the strike is above the initial stock value and put options when thestrike is below. This is in order to deal only with out of the money options who reﬂect more the time valuerather than the intrinsic value. In Figure 2(a), we compute the smile for three values of correlation, negative,null and positive. The correlation has an eﬀect on the symmetry of the smile. Indeed we observe a volatilityskew when setting the volatility diﬀerent than zero. The orientation of the skew (more expensive OTMCall or Put options) depends on the sign of the correlation (respectively positive and negative). Figure 2(b)looks at the inﬂuence of the volatility of volatility. When the latter is low, the smile is close to constant, asin the Black-Scholes model. An increase in the value of volatility of volatility aﬀects the convexity of thesmile as well of the level. The inﬂuence of the mean reversion can be seen in Figure 2(c), with H = 0 .

1. Inparticular, the mean reversion aﬀects slightly the smile convexity: the higher the mean reversion, the fasterthe volatility reverts to its long-term mean V ∞ and the smaller the curvature. Finally, the role of the Hurstparameter can be observed in Figure 2(d), with an eﬀect similar to that of the volatility of volatility: smallervalues yields higher curvature and (slightly) higher level.The computations require a large number of time steps for small maturities. We diﬀerentiate the processsimulation time steps from the BSDE time steps. Indeed, fewer BSDE time steps are needed in order toobtain decent computation times for the BSDE solver. We compare for T = 0 . ,

9; 1 , Model calibration.

We now provide a way to calibrate such a rough local stochastic volatility model.We assume that the mean reversion κ , the Hurst parameter H and the initial volatility V are given, whilethe volatility of volatility ν , the correlation ρ and the long-term variance level V ∞ need to be calibrated. Thereason for this choice is practical (the dimension of the optimisation problem is reduced), but can be easily (a) Inﬂuence of the correlation ρ . (b) Inﬂuence of the volatility of volatility ν .(c) Inﬂuence of the mean reversion κ . (d) Inﬂuence of the Hurst parameter. Figure 2.

Inﬂuence of the rough volatility parameters on the smile, with T = 0 . Figure 3.

Implied volatility for diﬀerent number of BSDE time steps in the rough LSV model.justiﬁed: a good proxy for the initial variance V is given by the short-term at-the-money smile, and theHurst parameter can be calibrated by the maturity-decay of the at-the-money skew of the implied volatilitysmile; proper and rigorous explanations, via asymptotic limits and expansions, can be found in [3, 9, 29,30, 37, 44, 52]. We perform a slice by slice calibration, via a minimisation of the diﬀerence between marketsmiles and of model smiles. However, with the Deep Learning method, this is extremely computer intensive(several hours), so we instead propose the following calibrating method using machine learning techniques,inspired by [47, 63]: EEP PPDES FOR ROUGH LOCAL STOCHASTIC VOLATILITY 15

1. We generate a volatility surface on the grid ( V ∞ , ν, ρ ) ∈ { . , . , . }×{ . , . , . }×{− . , − . , − . } ;2. Training the neural network. We use the ‘sub-network’ of Section 3.4 as our network to train. For a givenmaturity, the smiles generated by the sets of training data are the input of the network, and the threeparameters are the labels. Theoretically, the network needs to be fed with several thousands of samples.However, since it takes time to simulate so many smiles, we try with a small number of smiles but witha lot of epochs (number of times all the samples are fed to the network).3. Once the weights of the neural networks are calibrated, the prediction part consists in inputting the marketsmile used for the calibration in the network in order to get the corresponding parameters. Note thatonce the samples are produced and the network is trained, the calibration part is almost instantaneous(since it consists only in multiplying the input by weights matrices).We ﬁrst train the network with only one smile associated to a set of parameters in order to conﬁgure thehyper-parameters. While testing the same smile as an input, we obtain the right parameters when usingone layer and 100 epochs (the number of neurons does not seem to impact the precision). We then train thenetwork with 26 smiles generated from the grid above. The smile not used for calibration is then used asinput into the network to recover its associated parameters. The obtained absolute precision is around 1%,which increases when we train the network with more smiles. For this training, we used 1000 epochs andtwo hidden layers with 100 neurons each. The training and calibration are almost instantaneous. Appendix A. A review of the functional Itˆo formula for stochastic Volterra systems

We gather here several results from Viens and Zhang [72], which are key to our analysis.A.1.

Stochastic Volterra systems.

We consider here a stochastic Volterra process of the form(21) X t = X + (cid:90) t b ( t, r, X · )d r + (cid:90) t σ ( t, r, X · ) · d W r , where X ∈ R d , W is a Brownian motion in R n on a given ﬁltered probability space (Ω , F , ( F t ) t ≥ , P ),and b : R + × R + × R d → R d and σ : R + × R + × R d → R d satisfy the following assumptions: Assumption A.1.

The processes b and σ are adapted to ( F t ) t ≥ , and the derivatives ∂ t b and ∂ t σ exist.Furthermore, for ϕ ∈ { b , σ , ∂ t b , ∂ t σ } , | ϕ ( t, s, ω ) | ≤ C (1 + (cid:107) ω (cid:107) aT ) for some C, a > (cid:107) ω (cid:107) T := sup ≤ t ≤ T | ω t | denotes the supremum norm on the interval [0 , T ]. Saying that ϕ is adaptedhere is equivalent to the fact that it can be written as ϕ ( t, r, X · ) = ϕ ( t, r, X r ∧· ). This assumption ensuresthat the system (21) is well deﬁned in the following sense: Assumption A.2.

The SDE (21) admits a weak solution and E (cid:104) sup t ∈ [0 ,T ] | X t | p (cid:105) is ﬁnite for any p ≥ W is not observable, and only the process X is (or at least some components thereof; when X = ( S, V ) asin the present paper, only S is indeed observable). We refer the reader to [12, 13] for precise conditionson the coeﬃcients b and σ ensuring weak existence of a solution. The second condition on the moment ismore technical and needed for the functional Itˆo formula below in Theorem A.6. Most interesting modelsin the ﬁnance literature satisfy these assumptions, and we refer to [72, Appendix] for suﬃcient conditionsensuring Assumption A.1(ii), in particular for the class of rough aﬃne models developed in [2]. The keydiﬀerences between (21) and a classical stochastic diﬀerential equation is (a) that both the drift and thediﬀusion coeﬃcient depend on two time variables (thus violating the usual ﬂow property), and (b) that theydepend on space, not at a single instant, but evaluated along a path. The other classical issue is that thecoeﬃcients may blow up, as for the Riemann-Liouville fractional Brownian motion (cid:82) t ( t − s ) H − / d W s , wherethe power-law kernel explodes on the diagonal whenever the Hurst exponent H lies in (0 , ). Following theterminology introduced in [72], two cases have to be distinguished: Assumption A.3. (i) (Regular case) For any s ∈ [0 , T ], ∂ t b ( t, s, · ) and ∂ t σ ( t, s, · ) exist on [ s, T ], and for ϕ ∈ { b , σ , ∂ t b , ∂ t σ } , | ϕ ( t, s, ω ) | ≤ C (1 + (cid:107) ω (cid:107) aT ) , for some a, C > (ii) (Singular case) Let ϕ ∈ { b , σ } . For any s ∈ [0 , T ], ∂ t ϕ ( t, s, · ) exists on ( s, T ], and there exists h ∈ (cid:0) , (cid:1) such that, for some a, C > | ϕ ( t, s, ω ) | ≤ C (1 + (cid:107) ω (cid:107) aT ) ( t − s ) h − / and | ∂ t ϕ ( t, s, ω ) | ≤ C (1 + (cid:107) ω (cid:107) aT ) ( t − s ) h − / . The ﬁrst case mainly deals with the path dependence and the absence of the Markov property, while thesecond one allows us to treat the presence of two time variables in the kernel, which occurs in fractionalmodels, and in particular in the setting of Section 2 above. For any 0 ≤ t ≤ u , we can decompose (21) as(22) X u = X + (cid:90) t b ( u, r, X r ∧· )d r + (cid:90) t σ ( u, r, X r ∧· )d W r (cid:124) (cid:123)(cid:122) (cid:125) Θ tu ∈ F t + (cid:90) ut b ( s, r, X r ∧· )d r + (cid:90) ut σ ( u, r, X r ∧· )d W r (cid:124) (cid:123)(cid:122) (cid:125) I tu / ∈ F t . We further recall (from [72]) the concatenation notation of the paths X and Θ t before and after time t ,(23) (cid:0) X ⊗ t Θ t (cid:1) u := X u {

Functional Itˆo calculus.

For any t ∈ [0 , T ], let D t and C t denote respectively the space of c`adl`agfunctions on [ t, T ] and that of continuous functions on [ t, T ], as well asΛ := (cid:8) ( t, ω ) ∈ [0 , T ] × D : ω [ t,T ] ∈ C t (cid:9) and Λ := [0 , T ] × C (cid:0) [0 , T ] , R d (cid:1) , where ω [ t,T ] refers to the truncation of the path ω to the interval [ t, T ]. We denote by C (Λ) the space of allfunctions on Λ, continuous with respect to the distance function d (( t, ω ) , ( t (cid:48) , ω (cid:48) )) := | t − t (cid:48) | + (cid:107) ω − ω (cid:48) (cid:107) T . Fora given u ∈ C (Λ), we deﬁne its (right) time derivative as ∂ t u ( t, ω ) := lim ε ↓ u ( t + ε, ω ) − u ( t, ω ) ε . for all ( t, ω ) ∈ Λ . Following [72], we can then deﬁne ﬁrst- and second-order spatial derivatives of u ∈ C (Λ) as linear and bilinearoperators on C t : Deﬁnition A.4.

The spatial derivatives of u ∈ C (Λ) are deﬁned as Gˆateaux derivatives. For any ( t, ω ) ∈ Λ, (cid:104) ∂ ω u ( t, ω ) , η (cid:105) := lim ε ↓ u ( t, ω + εη [ t,T ] ) − u ( t, ω ) ε , for any η ∈ C t , (cid:10) ∂ ω u ( t, ω ) , ( η, ζ ) (cid:11) := lim ε ↓ (cid:10) ∂ ω u ( t, ω + εη [ t,T ] ) , ζ (cid:11) − (cid:104) ∂ ω u ( t, ω ) , ζ (cid:105) ε , for any η, ζ ∈ C t . This deﬁnition of the spatial derivative in the direction η ∈ C t is obviously equivalent to (cid:104) ∂ ω u ( t, ω ) , η (cid:105) = d u d ε u (cid:0) t, ω + εη [ t,T ] (cid:1)(cid:12)(cid:12) ε =0 . This deﬁnition is consistent with that of Dupire [19], as the perturbation acts on the time interval [ t, T ], butnot on [0 , t ], and the distance function d ( · ) is similar to Dupire’s pseudo-distance (see also [67]). We shallfurther need the following two spaces: C , (Λ) := (cid:8) u ∈ C (Λ) : ϕ ∈ C (Λ) for ϕ ∈ { ∂ t u, ∂ ω u, ∂ ω u } (cid:9) , C , (Λ) := (cid:8) u ∈ C , (Λ) : ϕ has polynomial growth for ϕ ∈ { ∂ t u, ∂ ω u, ∂ ω u } and (cid:10) ∂ ω u, ( η, η ) (cid:11) is locally uniformly continuous in ω with polynomial growth (cid:9) . The deﬁnition of polynomial growth here is as follows:

Deﬁnition A.5 (Deﬁnition 3.3 in [72]) . Let u ∈ C (Λ) such that ∂ ω u is well deﬁned on Λ. The functional ∂ ω u is said to have polynomial growth if |(cid:104) ∂ ω u ( t, ω ) , η (cid:105)| ≤ C (1 + (cid:107) ω (cid:107) αT ) (cid:107) η [ t,T ] (cid:107) T , for all ( t, ω ) ∈ Λ , η ∈ C t , for some C, α >

0. It is continuous if Λ (cid:51) ( t, ω ) (cid:55)→ (cid:104) ∂ ω u ( t, ω ) , η (cid:105) is continuous under d for every η ∈ C . EEP PPDES FOR ROUGH LOCAL STOCHASTIC VOLATILITY 17

We now recall the main result by Viens and Zhang [72, Theorem 3.10 and Theorem 3.17], extending theItˆo formula to the stochastic Volterra framework, for both regular and singular cases. The issue with thesingular case (Deﬁnition A.3(ii)) is that the coeﬃcients b and σ do not belong to C t any longer, so that theGˆateaux derivatives in Deﬁnition A.4 do not make sense any more. In order to develop an Itˆo formula, thoseneed to be amended. We refer the reader to [72, Deﬁnition 3.16] for a precise deﬁnition of the space C , β (Λ),where β ∈ (0 ,

1) intuitively monitors the rate of explosion on the (time) diagonal.

Theorem A.6.

For t ∈ [0 , T ] , deﬁne the process Z t := X ⊗ t Θ t , and denote ϕ t,ω := ϕ ( · , t, ω ) for ϕ ∈ { b , σ } to emphasise the time dependence of the coeﬃcients. Under Assumption A.1, the following Itˆo formula holds: d u ( t, Z t ) = (cid:18) ∂ t u (cid:0) t, Z t (cid:1) + (cid:10) ∂ ω u (cid:0) t, Z t (cid:1) , b t, X (cid:11) + 12 (cid:10) ∂ ω u (cid:0) t, Z t (cid:1) , (cid:0) σ t, X , σ t, X (cid:1)(cid:11)(cid:19) d t + (cid:10) ∂ ω u (cid:0) t, Z t (cid:1) , σ t, X (cid:11) d W t , (1) in the regular case (Assumption A.3(i)), whenever u ∈ C , (Λ) ;(2) in the singular case (Assumption A.3(ii)) for u ∈ C , ,β (Λ) with β + h − > , where the spatialderivatives should be understood in the regularised sense: (cid:104) ∂ ω u ( t, ω ) , φ (cid:105) := lim δ ↓ (cid:10) ∂ ω u ( t, ω ) , φ δ (cid:11) and (cid:10) ∂ ω u ( t, ω ) , ( φ, φ ) (cid:11) := lim δ ↓ (cid:10) ∂ ω u ( t, ω ) , ( φ δ , φ δ ) (cid:11) , with the truncated function φ δ ( t, s, ω ) := φ ( t ∨ ( s + δ ) , s, ω ) . In the theorem, we invoked the space C , (Λ), which represents the space of functions u ∈ : Λ → R suchthat there exists v ∈ C , (Λ) for which v = u on Λ. The derivatives are deﬁned similarly as restrictions on Λ. Appendix B. Proof of Theorem 3.2

Since the path-dependent PDE in the theorem is linear, existence and uniqueness of the solution is wellknown, and we refer to [20]. We consider a self-ﬁnancing portfolio Π consisting of the derivative P givenin (8), some quantity ∆ of stock and some other derivative Ψ, i.e. at any time t ∈ [0 , T ],Π t = P t − ∆ t S t − γ t Ψ t . Following [72] (Theorem 3.9 in the regular case and Theorem 3.16 in the singular case), we can write afunctional Itˆo formula for the option price using Lemme 3.1 under the pricing measure Q :d P t = (cid:16) ∂ t P t + r t S t ∂ x P t + l ( t, S t , V t ) ∂ x P t + ξ ( V t ) (cid:10) ∂ ω P t , (K t , K t ) (cid:11) + b ( V t ) (cid:10) ∂ ω P t , K t (cid:11) + l ( t, S t , V t ) ρξ ( V t ) (cid:10) ∂ x,ω P t , K t (cid:11) (cid:17) d t + l ( t, S t , V t ) ∂ x P t (cid:0) ρ d B t + ρ d B ⊥ t (cid:1) + ξ ( V t ) (cid:10) ∂ ω P t , K t (cid:11) d B t =: A P t d t + l ( t, S t , V t ) ∂ x P t d W t + ξ ( V t ) (cid:10) ∂ ω P t , K t (cid:11) d B t , (24)where A ψ := ∂ t ψ + r t S t ∂ x ψ + l ( t, S t , V t ) ∂ x ψ + ξ ( V t ) (cid:10) ∂ ω ψ, (K t , K t ) (cid:11) + b ( V t ) (cid:10) ∂ ω ψ, K t (cid:11) + l ( t, S t , V t ) ρξ ( V t ) (cid:10) ∂ x,ω ψ, K t (cid:11) . This yields, similarly, for the portfolio Π, under Q ,dΠ t = d P t + ∆ t d S t + γ t dΨ t = A P t d t + l ( t, S t , V t ) ∂ x P t d W t + ξ ( V t ) (cid:10) ∂ ω P t , K t (cid:11) d B t − γ t (cid:16) A Ψ t d t + l ( t, S t , V t ) ∂ x Ψ t d W t + ξ ( V t ) (cid:10) ∂ ω Ψ t , K t (cid:11) d B t (cid:17) − ∆ t (cid:16) µ t S t d t + l ( t, S t , V t )d W t (cid:17) . The portfolio is risk free if dΠ t = r Π t d t and the random noise is cancelled, meaning that(25)  (cid:16) A P t − γ t A Ψ t − ∆ t r t S t (cid:17) d t = r t (cid:16) P t − ∆ t S t − γ t Ψ t (cid:17) d t,∂ x P t − γ t ∂ x Ψ t − ∆ t = 0 , (cid:104) ∂ ω P t , K t (cid:105) − γ t (cid:104) ∂ ω Ψ t , K t (cid:105) = 0 , since both functions l ( · , · , · ) and ξ ( · ) are nowhere null. The last two equalities yield γ t = (cid:104) ∂ ω P t , K t (cid:105)(cid:104) ∂ ω Ψ t , K t (cid:105) and ∆ t = ∂ x P t − (cid:104) ∂ ω P t , K t (cid:105)(cid:104) ∂ ω Ψ t , K t (cid:105) ∂ x Ψ t . We can now rewrite the ﬁrst equality in (25) as (cid:16) A P t − γ t A Ψ t − ∆ t r t S t (cid:17) d t = r t (cid:16) P t − ∆ t S t − γ t Ψ t (cid:17) d t, which is equivalent to ( A − r t ) P t (cid:104) ∂ ω P t , K t (cid:105) = ( A − r t ) Ψ t (cid:104) ∂ ω Ψ t , K t (cid:105) . The left-hand side is a function of P only, whereas the right-hand side only depends on Ψ. Therefore, theonly way for this equality to hold is for both sides to be equal to some function − (cid:98) b that depends on S t , Θ t and t , but not on P nor Ψ. The pricing equation for the price function is therefore( A − r t ) P t = − (cid:10) ∂ ω P t , K t (cid:11) (cid:98) b t . Following similar computations in classical (Markovian) stochastic volatility models, we consider (cid:98) b t of theform (cid:98) b t = b ( V t ) − ξ ( V t ) λ t , where λ t is called the market price of risk. The ﬁnal pricing PDE is therefore ∂ t + r t S t ∂ x + l ( t, S t , V t ) ∂ x + ξ ( V t ) (cid:10) ∂ ω , (K t , K t ) (cid:11) + b ( V t ) (cid:10) ∂ ω , K t (cid:11) + l ( t, S t , V t ) ρξ ( V t ) (cid:10) ∂ x,ω , K t (cid:11) − r t + (cid:10) ∂ ω , K t (cid:11) (cid:98) b t = 0 . With a slight abuse of notations, writing b in place of (cid:98) b proves the statement. Appendix C. Further numerical results

We gather in this appendix further numerical results for the rough local stochastic volatility model inSection 4.2. The ﬁrst two lines correspond to prices generated by Monte Carlo and by the Deep Learning(DL) algorithm. The third line is the SSVI implied volatility given as input in (20), and lines 4 and 5 showthe Monte Carlo errors and DL errors (in implied volatilities).

Maturity: week Strike 0.95 0.98 1.02 1.05MC Price 15.09 15.02 14.96 14.94DL Price 15.12 15.06 14.99 14.91SSVI 15.08 15.03 14.98 14.93DL vol error 0.05 0.03 0.02 0.02MC vol error 0.01 0.01 0.02 0.01

Maturity: month Strike 0.89 0.93 0.98 1.02 1.07 1.11MC Price 15.18 15.11 15.05 14.99 14.92 14.85DL Price 15.29 15. 15.04 14.98 14.90 14.84SSVI 15.18 15.10 15.03 14.97 14.91 14.85DL vol error 0.11 0.01 0.01 0.02 0.00 0.01MC vol error 0.00 0.01 0.02 0.02 0.01 0.01

Maturity: months Strike 0.77 0.83 0.9 0.97 1.00 1.1 1.17 1.23MC Price 15.37 15.24 15.16 15.03 14.93 14.85 14.76 14.72DL Price 15.43 15.28 15.16 15.06 14.95 14.89 14.82 14.74SSVI 15.41 15.27 15.16 15.05 14.95 14.86 14.78 14.71DL vol error 0.03 0.01 0.00 0.01 0.01 0.03 0.04 0.03MC vol error 0.03 0.04 0.01 0.02 0.02 0.02 0.02 0.02

EEP PPDES FOR ROUGH LOCAL STOCHASTIC VOLATILITY 19

Maturity: year Strike 0.72 0.83 0.94 1.06 1.17 1.28 1.39 1.5MC Price 15.49 15.29 15.11 14.96 14.81 14.67 14.55 14.44DL Price 15.50 15.31 15.12 14.89 14.83 14.71 14.61 14.53SSVI 15.49 15.27 15.08 14.92 14.79 14.67 14.57 14.48DL vol error 0.02 0.04 0.04 0.03 0.04 0.04 0.04 0.05MC vol error 0.00 0.03 0.03 0.04 0.02 0.00 0.02 0.03

Maturity: years Strike 0.72 0.83 0.94 1.06 1.17 1.28 1.39 1.5MC Price 15.49 15.26 15.10 14.98 14.84 14.72 14.60 14.51DL Price 15.51 15.34 15.17 15.02 14.89 14.78 14.72 14.60SSVI 15.46 15.25 15.08 14.93 14.80 14.69 14.59 14.51DL vol error 0.05 0.09 0.09 0.09 0.09 0.09 0.13 0.10MC vol error 0.03 0.01 0.02 0.05 0.04 0.03 0.01 0.01

Maturity: years Strike 0.72 0.83 0.94 1.06 1.17 1.28 1.39 1.5MC Price 15.43 15.24 15.09 14.95 14.82 14.71 14.60 14.51DL Price 15.54 15.35 15.19 15.05 14.94 14.91 14.72 14.64SSVI 15.43 15.24 15.07 14.93 14.81 14.70 14.61 14.53DL vol error 0.11 0.11 0.12 0.12 0.13 0.21 0.11 0.11MC vol error 0.00 0.01 0.01 0.02 0.01 0.00 0.01 0.02

References [1] E. Abi Jaber. Lifting the Heston model. To appear in

Quantitative Finance .[2] E. Abi Jaber, M. Larsson, S. Pulido. Aﬃne Volterra processes. To appear in

Annals of Applied Probability .[3] E. Al`os, J. Le´on and J. Vives. On the short-time behavior of the implied volatility for jump-diﬀusion models with stochasticvolatility.

Finance and Stochastics , (4), 571-589, 2007.[4] J. Ba and D. Kingma. Adam: a method for stochastic optimization. Proceedings of the International Conference onLearning Representations , May 2015.[5] G. Barles and P.E. Souganidis. Convergence of approximation schemes for fully nonlinear second-order equation.

AsymptoticAnalysis , : 271-283, 1991.[6] C. Bayer, C. Ben Hammouda and R. Tempone. Hierarchical adaptive sparse grids for option pricing under the roughBergomi model. arXiv:1812.08533, 2018.[7] C. Bayer, P. Friz, P. Gassiat, J. Martin and B. Stemper. A regularity structure for rough volatility. arXiv:1710.07481, 2017.[8] C. Bayer, P. Friz and J. Gatheral. Pricing under rough volatility. Quantitative Finance , (6): 1-18, 2015.[9] C. Bayer, P. Friz, A. Gulisashvili, B. Horvath and B. Stemper. Short-time near the money skew in rough fractional stochasticvolatility models. To appear in Quantitative Finance .[10] M. Bennedsen, A. Lunde and M.S. Pakkanen. Hybrid scheme for Brownian semistationary processes.

Finance and Stochas-tics , (4): 931-965, 2017.[11] M. Bennedsen, A. Lunde and M.S. Pakkanen. Decoupling the short- and long-term behavior of stochastic volatility.arXiv:1610.00332, 2017.[12] M.A. Berger and V.J. Mizel. Volterra Equations with Itˆo Integrals, I. Journal of Integral Equations , (3): 187-245, 1980.[13] M.A. Berger and V.J. Mizel. Volterra Equations with Itˆo Integrals, II. Journal of Integral Equations , (4): 319-337, 1980.[14] F. Black and M. Scholes. The pricing of options and corporate liabilities. Journ. Political Economy , (3): 637-654, 1973.[15] F. Comte and E. Renault. Long memory continuous time models. Journal of Econometrics , (1): 101-149, 1996.[16] L. Decreusefond and A. Ust¨unel. Stochastic analysis of the fractional Brownian motion. Pot. Analysis , : 177-214, 1996.[17] M. Djehiche and M. Eddahbi. Hedging options in market models modulated by the fractional Brownian motion. StochasticAnalysis and Applications , (5): 753-770, 2001.[18] B. Dupire. Pricing with a smile. Risk , 1994.[19] B. Dupire. Functional Itˆo Calculus. SSRN:1435551, 2009.[20] I. Ekren, N. Touzi and J. Zhang. On viscosity solutions of path-dependent PDEs.

Annals Prob. , (1): 204-236, 2014.[21] I. Ekren, N. Touzi and J. Zhang. Viscosity solutions of fully nonlinear parabolic path-dependent PDEs: Part I. Annals ofProbability , (2): 1212-1253, 2016.[22] I. Ekren, N. Touzi and J. Zhang. Viscosity solutions of fully nonlinear parabolic path-dependent PDEs: Part II. Annals ofProbability , (4): 2507-2553, 2016. [23] N. El Karoui, S. Peng and M.C. Quenez. Backward stochastic diﬀerential equations in Finance. Math. Fin. , (1): 1-71, 1997.[24] O. El Euch and M. Rosenbaum. The characteristic function of rough Heston models. Math. Fin. , (1): 3-38, 2019.[25] O. El Euch and M. Rosenbaum. Perfect hedging in rough Heston models. Annals App. Prob. , (6): 3813-3856, 2018.[26] O. El Euch, M. Fukasawa and M. Rosenbaum. The microstructural foundations of leverage eﬀect and rough volatility. Finance and Stochastics , (2): 241-280, 2018.[27] O. El Euch, J. Gatheral and M. Rosenbaum. Roughening Heston. Risk , April 2019.[28] X. Fernique. Int´egrabilit´e des vecteurs Gaussiens.

CRAS Paris , : 1698-1699, 1970.[29] M. Forde and H. Zhang. Asymptotics for rough stochastic volatility models. SIAM Journal on Financial Mathematics , :114-145, 2017.[30] M. Fukasawa. Asymptotic analysis for stochastic volatility: martingale expansion. Fin. and Stochastics , : 635-654, 2011.[31] P. Gassiat. On the martingale property in the rough Bergomi model. arXiv:1811.10935, 2018.[32] J. Gatheral. The Volatility Surface: A Practitioner’s Guide. John Wiley & Sons, 2006.[33] J. Gatheral and A. Jacquier. Arbitrage-free SVI volatility surfaces. Quantitative Finance , : 59-71, 2014.[34] J. Gatheral, T. Jaisson and M. Rosenbaum. Volatility is rough. Quantitative Finance , (6): 933-949, 2018.[35] J. Gatheral and M. Keller-Ressel. Aﬃne forward variance models. To appear in Finance and Stochastics , 2019.[36] J. Gatheral and R. Radoiˇci´c. Rational approximation of the rough Heston solution.

International Journal of Theoreticaland Applied Finance , (3), 2019..[37] H. Guennoun, A. Jacquier, P. Roome and F. Shi. Asymptotic behaviour of the fractional Heston model. SIAM Journal onFinancial Mathematics , (3): 1017-1045, 2018.[38] A. Gulisashvili. Large deviation principle for Volterra type fractional stochastic volatility models. SIAM Journal on Fi-nancial Mathematics , (3): 1102-1136, 2018.[39] J. Guyon and P. Henry-Labord`ere. Nonlinear option pricing, CRC Press, 2013.[40] I. Gy¨ongy. Mimicking the one-dimensional marginal distributions of processes vaving an Itˆo diﬀerential. Probability Theoryand Related Fields , (4): 501-516, 1986.[41] J. Han and A. Jentzen. Solving high-dimensional partial diﬀerential equations using deep learning. To appear in Proceedingsof the Natural Academy of Sciences , 2018.[42] P. Harms. Strong convergence rates for Markovian representations of fractional Brownian motion. arXiv:1902.01471, 2019.[43] C. Heinrich, M. Pakkanen and A.E.D. Veraart. Hybrid simulation scheme for volatility modulated moving average ﬁelds.To appear in

Mathematics and Computers in Simulation .[44] B. Horvath, A. Jacquier and C. Lacombe. Asymptotic behaviour of randomised fractional volatility models. To appear in

Journal of Applied Probability , 2019.[45] B. Horvath, A. Jacquier and A. Muguruza. Functional central limit theorems for rough volatility arXiv:1711.03078, 2018.[46] B. Horvath, A. Jacquier and P. Tankov. Volatility options in rough volatility models. arXiv:1802.01641, 2018.[47] B. Horvath, A. Muguruza and M. Tomas. Deep learning volatility. SSRN:3322085, 2019.[48] C. Hur´e, H. Pham and X.Warin. Some machine learning schemes for high-dimensional nonlinear PDEs. arXiv:1902.01599,2019.[49] H.E. Hurst. Long-term storage capacity of reservoirs.

Trans. American Society of Civil Engineers , (1): 770-799, 1951.[50] H.E. Hurst, R.P. Black and Y.M. Simaika. Long-term storage: an experimental study. London: Constable,

Quant. Finance , (1): 45-61, 2018.[52] A. Jacquier, M. Pakkanen and H. Stone. Pathwise large deviations for the rough Bergomi model. Journal of AppliedProbability , (4): 1078-1092, 2018.[53] B. Jourdain. Loss of martingality in asset price models with lognormal stochastic volatility. Preprint CERMICS , 2004.[54] I. Karatzas and S. Shreve. Brownian motion and stochastic calculus. Springer-Verlag, New-York, 1988.[55] B. Jourdain and A. Zhou. Existence of a calibrated regime switching local volatility model and new fake Brownian motions.arXiv:1607.00077, 2016.[56] D. Lacker, M. Shkolnikov and J. Zhang. Inverting the Markovian projection, with an application to local stochastic volatilitymodels. arXiv:1905.06213, 2019.[57] I. Lagaris, A. Likas, and D. Fotiadis, Artiﬁcial neural networks for solving ordinary and partial diﬀerential equations.

IEEETransactions on Neural Networks , (5): 987-1000, 1998.[58] I. Lagaris, A. Likas, and D. Papageorgiou, Neural-network methods for boundary value problems with irregular boundaries. IEEE Transactions on Neural Networks , (5): 1041-1049, 2000.[59] H. Lee. Neural Algorithm for Solving Diﬀerential Equations. Journal of Computational Physics , : 110-131, 1990.[60] A. Malek and R. Beidokhti. Numerical solution for high order diﬀerential equations using a hybrid neural network-optimization method. Applied Mathematics and Computation , (1): 260-271, 2006.[61] B. Mandelbrot and J. Van Ness. Fractional Brownian motions, fractional noises and applications. SIAM Review , (4):422-437, 1968.[62] R. McCrickerd and M.S. Pakkanen. Turbocharging Monte Carlo pricing for the rough Bergomi model. Quantitative Finance , (11), 1877-1886, 2018.[63] W.A. McGhee. An artiﬁcial neural network representation of the SABR stochastic volatility model. SSRN:3288882, 2018.[64] P. Parczewski. Donsker-type theorems for correlated geometric fractional Brownian motions and related processes. Elec-tronic Communications in Probability , (55): 1-13, 2017.[65] J. Picard. Representation formulae for the fractional Brownian motion. S´eminaire de Probabilit´es , : 3-70, 2011.[66] Z. Ren and X. Tan. On the convergence of monotone schemes for path-dependent PDEs. Stochastic Processes and theirApplications , (6): 1738-1762, 2017. EEP PPDES FOR ROUGH LOCAL STOCHASTIC VOLATILITY 21 [67] Z. Ren, N. Touzi and J. Zhang, An overview of viscosity solutions of path-dependent PDEs.

Stochastic Analysis andApplications . Springer Proceedings in Mathematics and Statistics, : 397-454, 2014.[68] M. Romano and N. Touzi. Contingent claims and market completeness in a stochastic volatility model.

MathematicalFinance , (4): 399-412, 1997.[69] M. Sabate Vidales, D. ˇSiˇska and L. Szpruch. Martingale functional control variates via deep learning. arXiv:1810.05094,2018.[70] J. Sirignano and K. Spiliopoulos. DGM: A deep learning algorithm for solving partial diﬀerential equations. Journal ofComputational Physics , : 1339–1364, 2018[71] H. Stone. Calibrating rough volatility models: a convolutional neural network approach. arXiv:1812.05315, 2018.[72] F. Viens and J. Zhang. A martingale approach for fractional Brownian motions and related path-dependent PDEs.arXiv:1712.03637, 2018.[73] J. Zhang and J. Zhuo. Monotone schemes for fully nonlinear parabolic path dependent PDEs. Journal of Financial Engi-neering , , 2014. Department of Mathematics, Imperial College London, and Alan Turing Institute

E-mail address : [email protected] Lloyds Banking Group plc, Commercial Banking, 10 Gresham Street, London, EC2V 7AE, UK

E-mail address ::