[PDF] Robust pricing and hedging via neural SDEs

Abstract

Mathematical modelling is ubiquitous in the financial industry and drives key decision processes. Any given model provides only a crude approximation to reality and the risk of using an inadequate model is hard to detect and quantify. By contrast, modern data science techniques are opening the door to more robust and data-driven model selection mechanisms. However, most machine learning models are "black-boxes" as individual parameters do not have meaningful interpretation. The aim of this paper is to combine the above approaches achieving the best of both worlds. Combining neural networks with risk models based on classical stochastic differential equations (SDEs), we find robust bounds for prices of derivatives and the corresponding hedging strategies while incorporating relevant market data. The resulting model called neural SDE is an instantiation of generative models and is closely linked with the theory of causal optimal transport. Neural SDEs allow consistent calibration under both the risk-neutral and the real-world measures. Thus the model can be used to simulate market scenarios needed for assessing risk profiles and hedging strategies. We develop and analyse novel algorithms needed for efficient use of neural SDEs. We validate our approach with numerical experiments using both local and stochastic volatility models.

Full PDF

RROBUST PRICING AND HEDGING VIA NEURAL SDES

PATRYK GIERJATOWICZ , MARC SABATE-VIDALES , DAVID ˇSIˇSKA , , LUKASZ SZPRUCH , ,AND ˇZAN ˇZURI ˇC A BSTRACT . Mathematical modelling is ubiquitous in the ﬁnancial industry and drives key de-cision processes. Any given model provides only a crude approximation to reality and the risk ofusing an inadequate model is hard to detect and quantify. By contrast, modern data science tech-niques are opening the door to more robust and data-driven model selection mechanisms. However,most machine learning models are “black-boxes” as individual parameters do not have meaningfulinterpretation. The aim of this paper is to combine the above approaches achieving the best ofboth worlds. Combining neural networks with risk models based on classical stochastic differentialequations (SDEs), we ﬁnd robust bounds for prices of derivatives and the corresponding hedgingstrategies while incorporating relevant market data. The resulting model called neural SDE is aninstantiation of generative models and is closely linked with the theory of causal optimal transport.Neural SDEs allow consistent calibration under both the risk-neutral and the real-world measures.Thus the model can be used to simulate market scenarios needed for assessing risk proﬁles andhedging strategies. We develop and analyse novel algorithms needed for efﬁcient use of neuralSDEs. We validate our approach with numerical experiments using both local and stochastic volat-ility models.

Primary: 65C30, 60H35; secondary: 60H30.1. I

NTRODUCTION

Problem overview.

Model uncertainty is an essential part of mathematical modelling butis particularly acute in mathematical ﬁnance and economics where one cannot base models onwell established physical laws. Until recently, these models were mostly conceived in a threestep fashion: 1) gathering statistical properties of the underlying time-series or the so called styl-ized facts; 2) handcrafting a parsimonious model, which would best capture the desired marketcharacteristics without adding any needless complexity and 3) calibration and validation of thehandcrafted model. Indeed, model complexity was undesirable, amongst other reasons, for in-creasing the computational effort required to perform in particular calibration but also pricing and S CHOOL OF M ATHEMATICS , U

NIVERSITY OF E DINBURGH V EGA P ROTOCOL A LAN T URING I NSTITUTE D EPARTMENT OF M ATHEMATICS , I

MPERIAL C OLLEGE L ONDON

E-mail addresses : [email protected], [email protected],[email protected], [email protected], [email protected] . Date : 9th July 2020.

Key words and phrases.

Stochastic differential equations, Deep neural network, Derivative pricing, StochasticGradient Descent, a r X i v : . [ q -f i n . M F ] J u l P GIERJATOWICZ, M. SABATE-VIDALES, D. ˇSIˇSKA, L. SZPRUCH, AND Z. ˇZURI ˇC risk calculations. With greater uptake of machine learning methods and greater computationalpower more complex models can now be used. This is due to the fact that arguably the mostcomplicated and computationally expensive step of calibration has been addressed. Indeed, in theseminal paper [Hernandez, 2016] used neural networks to learn the calibration map from marketdata directly to model parameters. Subsequently, many papers followed [Liu et al., 2019, Ruf andWang, 2019, Ruf and Wang, 2019, Benth et al., 2020, Gambara and Teichmann, 2020, Sardroudi,2019, Horvath et al., 2019, Bayer et al., 2019, Bayer and Stemper, 2018, Vidales et al., 2018].However, these approaches focused on the calibration of ﬁxed parametric model, but did not ad-dress perhaps even the more important issue which is model selection and model uncertainty.The approach taken in this paper is fundamentally different. We let the data dictate the model,while still keeping a strong prior on the model form. This is achieved by using SDEs for the modeldynamics but instead of choosing a ﬁxed parametrization for the model SDEs we allow the driftand diffusion to be given by an overparametrized neural networks. We will refer to these as NeuralSDEs. These are shown to not only provide a systematic framework for model selection, but also,quite remarkably, to produce robust estimates on the derivative prices. Here, the calibration andmodel selection are done simultaneously. In this sense, model selection is data-driven. Since theneural SDE model is overparametrised, there is a large pool of possible models and the trainingalgorithm selects a model. Unlike in handcrafted models, individual parameters do not carry anymeaning. This makes it hard to argue why one model is better than another. Hence the ability toefﬁciently compute interval estimators, which algorithms in this paper provide, is critical.In parallel to this work, a similar approach to modelling was taken in [Cuchiero et al., 2020],where the authors considered local stochastic volatility models with the leverage function approx-imated with a neural network. Their model can be seen as an example of a Neural SDEs.Let us now consider a probability space (Ω , F , ( F t ) t ∈ [0 ,T ] , P ) and a random variable Ψ ∈ L ( F T ) that represents the discounted payoff of a illiquid (path-dependent) derivative. The prob-lem of calculating a market consistent price of a ﬁnancial derivative can be seen as equivalent toﬁnding a map that takes market data (e.g. prices of underlying assets, interest rates, prices of liquidoptions) and returns the no-arbitrage price of the derivative. Typically an Itˆo process ( X θt ) t ∈ [0 ,T ] ,with parameters θ ∈ R p has been the main component used in constructing such pricing function.Such parametric model induces a martingale probability measure, denoted by Q ( θ ) , which is thenused to compute no-arbitrage price of derivatives. The market data (input data) here is represen-ted by payoffs { Φ i } Mi =1 of liquid derivatives, and their corresponding market prices { p (Φ i ) } Mi =1 .We will assume throughout that this price set is free of arbitrage. To make the model Q ( θ ) con-sistent with market prices, one seeks parameters θ ∗ such that the difference between p (Φ i ) and E Q ( θ ∗ ) [Φ i ] is minimized for all i = 1 , . . . , M (w.r.t. some metric). If for all i = 1 , . . . , M wehave p (Φ i ) = E Q ( θ ∗ ) [Φ i ] then we will say the model is consistent with market data (perfectly cal-ibrated). There may be inﬁnitely many models that are consistent with the market. This is calledKnightian uncertainty [Knight, 1971, Cohen et al., 2018].Let M be the set of all martingale measures / models that are perfectly calibrated to marketinputs. In the robust ﬁnance paradigm, see [Hobson, 1998, Cox and Obloj, 2011], one takes con-servative approach and instead of computing a single price (that corresponds to a model from M )one computes the price interval (inf Q ∈M E Q [Ψ] , sup Q ∈M E Q [Ψ]) . The bounds can be computed OBUST PRICING AND HEDGING VIA NEURAL SDES 3 using tools from martingale optimal transport which also, through dual representation, yields cor-responding super- and sub- hedging strategies, [Beiglb¨ock et al., 2013]. Without imposing furtherconstrains, the class of all calibrated models M might be too large and consequently the corres-ponding bounds too wide to be of practical use [Eckstein et al., 2019]. See however an effortto incorporate further market information to tighten the pricing interval, [Nadtochiy and Obloj,2017, Aksamit et al., 2020]. Another shortcoming of working with the entire class of calibratedmodels M is that, in general, it is not clear how to obtain a practical/explicit model out of themeasures that yields price bounds. For example, such explicit models are useful when one wantsconsistently calibrate under pricing measure Q and real-world measure P as needed for risk es-timation and stress testing, [Broadie et al., 2011, Pelsser and Schweizer, 2016] or learn hedgingstrategies in the presence of transactional cost and an illiquidity constrains [Buehler et al., 2019].1.2. Neural SDEs.

Fix

T > and for simplicity assume constant interest rate r ∈ R . Con-sider parameter space Θ = Θ b × Θ σ ⊆ R p and parametric functions b : R d × Θ b → R d and σ : R d × Θ σ → R d × n . Let ( W t ) t ∈ [0 ,T ] be a n -dimensional Brownian motion supported on (Ω , F , ( F t ) t ∈ [0 ,T ] , Q ) so that Q is the Wiener measure and Ω = C ([0 , T ]; R n ) . We consider thefollowing parametric SDE(1.1) dX θt = b ( t, X θt , θ ) dt + σ ( t, X θt , θ ) dW t . We split X θ which is the entire stochastic model into traded assets and non-tradable components.Let X θ = ( S θ , V θ ) , where S are the traded assets and V are the components that are not traded.We will assume that for all t ∈ [0 , T ] , x = ( s, v ) ∈ R d and θ ∈ R p we will assume that b ( t, ( s, v ) , θ ) = (cid:0) rs, b V ( t, ( s, v ) , θ ) (cid:1) ∈ R d and σ ( t, ( s, v ) , θ ) = (cid:0) σ S ( t, ( s, v ) , θ ) , σ V ( t, ( s, v ) , θ ) (cid:1) . Then we can write (1.1) as dS θt = rS θt dt + σ S ( t, X θt , θ ) dW t ,dV θt = b V ( t, X θt , θ ) dt + σ V ( t, X θt , θ ) dW t ,X θt = ( S θt , V θt ) . (1.2)Observe that σ S and σ V encode arbitrary correlation structures between the traded assets and thenon-tradable components. Moreover, we immediately see that ( e − rt S t ) t ∈ [0 ,T ] is a (local) martin-gale and thus the model is free of arbitrage.In a situation when ( b, σ ) are deﬁned to be neural networks (see Appendix C), we call theSDE (1.1) a neural SDE and we denote by M nsde ( θ ) the class of all solutions to (1.1). Note thatdue to universal approximation property of neural networks, see [Hornik, 1991, Sontag and Suss-mann, 1997, Cuchiero et al., 2019], M nsde ( θ ) contains large class of SDEs solutions. Furthermore,neural networks can be efﬁciently trained with the stochastic gradient decent methods and henceone can easily seek calibrated models in M nsde ( θ ) . Finally, neural SDE integrate black-box neuralnetwork type models with the known and well studied SDE models. One consequence of that isthat one can: a) consistently calibrate these under the risk neutral measure as well as the real-worldmeasure; b) easily integrate additional market information e.g constrains on realised variance; c)verify martingale property. We want to remark that for simplicity we work in Markovian set-ting, but one could consider neural-SDEs with path-dependent coefﬁcients and/or consider more P GIERJATOWICZ, M. SABATE-VIDALES, D. ˇSIˇSKA, L. SZPRUCH, AND Z. ˇZURI ˇC general noise processes. We postpone analysis of theses cases to follow up paper. By imposingsuitable conditions on the coefﬁcients ( b, σ ) we know that unique solution to (1.1) exists, [Krylov,1980, Chapter 2]. These conditions can be satisﬁed by neural networks e.g. by applying weightclipping. We denote the law of X θ on C ([0 , T ]; R d ) by Q ( θ ) := L (( X t ) t ∈ [0 ,T ] ) .Given a loss function (cid:96) : R × R → R + , the search for calibrated model can be written as θ ∗ ∈ arg min θ ∈ Θ M (cid:88) i =1 (cid:96) ( E Q ( θ ) [Φ i ] , p (Φ i )) , where E Q ( θ ) [Φ] = (cid:90) C ([0 ,T ] , R d ) Φ( ω ) L ( X θ )( dω ) . To extend the calibration consistently to the real world measure, assume that we are given somestatistical facts (e.g. moments or other distributional properties) that the price process (or the nontradable components) should satisfy). Let ζ : [0 , T ] × R d × R p → R n be another parametricfunction (e.g. neural network) and we extend the parameter space to Θ = Θ b × Θ σ × Θ ζ ⊆ R p .Let b S, P ( t, X θt , θ ) := rS θt + σ S ( t, X θt , θ ) ζ ( t, X θt , θ ) ,b V, P ( t, X θt , θ ) := b V ( t, X θt , θ ) + σ V ( t, X θt , θ ) ζ ( t, X θt , θ ) . We now deﬁne a real-world measure P ( θ ) via the Radon–Nikodym derivative d P ( θ ) d Q ( θ ) := exp (cid:18)(cid:90) T ζ ( t, X θt , θ ) dW t + 12 (cid:90) T | ζ ( t, X θt , θ ) | dt (cid:19) . Under appropriate assumption on ζ (e.g. bounded) the measure P ( θ ) is a probability measure andby using Girsanov theorem we can ﬁnd Brownian motion ( W P ( θ ) t ) t ∈ [0 ,T ] such that dS θt = b S, P ( t, X θt , θ ) dt + σ S ( t, X θt , θ ) dW P ( θ ) t ,dV θt = b V, P ( t, X θt , θ ) dt + σ V ( t, X θt , θ ) dW P ( θ ) t . (1.3)This is now the Neural SDE model in real-world measure P ( θ ) and one would like use market datato seek ζ . Let P market denote empirical distribution of market data and ( E P market [ S i ]) ˜ Mi =1 be a corres-ponding set statistics one aims to match. These might be autocorrelation function, realised vari-ance or moments generating functions. The calibration to real-world measure, with ( b V , σ V , σ S ) being ﬁxed, consists of ﬁnding θ ∗ such that θ ∗ ∈ arg min θ ∈ Θ ˜ M (cid:88) i =1 (cid:96) ( E P ( θ ) [ S i ] , E P market [ S i ( ω )]) . But in fact we can write E P ( θ ) [ S i ] = E Q ( θ ) (cid:20) S i d P ( θ ) d Q ( θ ) (cid:21) . Thus we see that in this framework there needs to be no distinction between a derivative price Φ i and a real-world statistic E P market [ S i ] . Hence from now on we will write only about risk-neutralcalibrations bearing in mind that methodologically this leads to no loss of generality.Let us connect neural SDEs to the concept of generative modelling, see [Goodfellow et al.,2014, Kingma and Welling, 2013]. Let Q market ∈ M be the true martingale measure (so by OBUST PRICING AND HEDGING VIA NEURAL SDES 5 deﬁnition all liquid derivatives are perfectly calibrated under this measure i.e. E Q market [Φ i ] = p (Φ i ) for all i = 1 , . . . , M ). We know that when (1.1) admits a strong solution then for any θ ∈ R p thereexists a measurable map G θ : R d × C ([0 , T ]; R n ) → C ([0 , T ]; R d ) such that X θ = G θ ( ζ, W ) ,see [Karatzas and Shreve, 2012, Corolarry 3.23]. Hence, one can view (1.1) as a generative modelthat maps µ , the joint distribution of X on R d and the Wiener measure on C ([0 , T ]; R n ) into Q θ = ( G θt ) µ . We see that by construction G is a causal transport map i.e transport map that isadapted to ﬁltration ( F t ) t ∈ [0 ,T ] , see also [Acciaio et al., 2019, Lassalle, 2013].One then seeks θ ∗ such that G θ ∗ µ is a good approximation of Q market with respect to userspeciﬁed metric. In this paper we work with D ( G θ µ, Q market ) := M (cid:88) i =1 (cid:96) (cid:32)(cid:90) C ([0 ,T ] , R d ) Φ i ( ω )( G θ µ )( dω ) , (cid:90) C ([0 ,T ] , R d ) Φ i ( ω ) Q market ( dω ) (cid:33) . As we shall see in Sections 5.1 and 5.2 there are many Neural SDE models that can be calibratedwell to market data and that produce signiﬁcantly different prices for derivatives that were not partof the calibration data. In practice these would be illiquid derivatives where we require model toobtain prices. Therefore, we compute price intervals for illiquid derivatives within the class ofcalibrated neural SDEs models. To be more precise we compute inf θ (cid:110) E Q ( θ ) [Ψ] : D ( G ( θ ) µ , Q market ) = 0 (cid:111) , sup θ (cid:110) E Q ( θ ) [Ψ] : D ( G ( θ ) µ , Q market ) = 0 (cid:111) . We solve the above constraint optimisation problem by penalisation. See [Eckstein and Kupper,2019] for related ideas.1.3.

Key conclusions and methodological contributions of this paper.

The results in this paperpresented below lead to the following conclusions.i) Neural SDEs provide a systematic framework for model selection and produce robust estim-ates on the derivative prices. The calibration and model selection are done simultaneouslyand the thus the model selection is data-driven.ii) With neural SDEs, the modelling choices one makes are: networks architectures, structure ofneural SDE (e.g. traded and non-traded assets), training methods and data. For classical hand-crafted models the choice of the algorithm for calibrating parameters has not been consideredas part of modelling choice, but for machine learning this is one of the key components. SeeSection 5, where we show how the change in initialisation of stochastic gradient method usedfor training leads to different prices of illiquid options, thus providing one way of obtainingprice bounds. Furthermore even for basic local volalitly model that is unique for continuumof strikes and maturities, produces ranges of prices of illiquid derivatives when calibrated toﬁnite data sets.iii) The above optimisation problem is not convex. Nonetheless, empirical experiments in Sec-tions 5.1-5.2 demonstrate that the stochastic gradient decent methods used to minimise theloss functional D converges to the set of parameters for which calibrated error is of order − to − for the square loss function. Theoretical framework for analysing such algorithms isbeing developed in [ ˇSiˇska and Szpruch, 2020]. P GIERJATOWICZ, M. SABATE-VIDALES, D. ˇSIˇSKA, L. SZPRUCH, AND Z. ˇZURI ˇC iv) By augmenting classical risk models with modern machine learning approaches we are able tobeneﬁt from expressibility of neural networks while staying within realm of classical models,well understood by traders, risk managers and regulators. This mitigates, to some extent,the concerns that regulators have around use of black-box solutions to manage ﬁnancial risk.Finally while our focus here is on SDE type models, the devised framework naturally extendsto time-series type models.The main methodological contributions of this work are as follows.i) By leveraging martingale representation theorem we developed an efﬁcient Monte Carlobased methods that simultaneously learns the model and the corresponding hedging strategy.ii) The calibration problem does not ﬁt into classical framework of stochastic gradient algorithms,as the mini-batches of the gradient of the cost function are biased. We provide analysis of thebias and show how the inclusion of hedging strategies in training mitigates this bias.iii) We devise a novel, memory efﬁcient randomised training procedure. The algorithm allows usto keep memory requirements constant, independently of the number of neural networks inneural SDEs. This is critical to efﬁciently calibrate to path dependent contingent claims. Weprovide theoretical analysis of our method in Section 4 and numerical experiment supportingthe claims in Section 5.The paper is organized as follows. In Section 2 we outline the exact optimization problem, intro-duce a deep neural network control variate (or hedging strategy), address the process of calibrationto single/multiple option maturities and state the exact algorithms. In Section 3 we analyse the biasin Algorithms 1 and 2. In Section 4 we show that the novel, memory-efﬁcient, drop-out-like train-ing procedure for path-dependent derivatives does not introduce bias in the new estimator. Finally,the performance of Neural Local Volatility and Local Stochastic Volatility models is presented inSection 5. Some of the proofs and more detailed results from numerical experiments are relegatedto the Appendix. The code used is available at github.com/msabvid/robust nsde .2. R

OBUST PRICING AND HEDGING

Let (cid:96) : R × R → [0 , ∞ ) be a convex loss function such that min x ∈ R ,y ∈ R (cid:96) ( x, y ) = 0 . Forexample we can take (cid:96) ( x, y ) = | x − y | . Given (cid:96) , our aim is to solve the following optimisationproblems:i) Find model parameters θ ∗ such that model prices match market prices:(2.1) θ ∗ ∈ arg min θ ∈ Θ M (cid:88) i =1 (cid:96) ( E Q ( θ ) [Φ i ] , p (Φ i )) . In practice this is equivalent to ﬁnding some θ ∗ such that (cid:80) Mi =1 (cid:96) ( E Q ( θ ∗ ) [Φ i ] , p (Φ i )) = 0 .This is due to inherent overparametrization of Neural SDEs and the fact that (cid:96) ≥ reachesminimum at zero. OBUST PRICING AND HEDGING VIA NEURAL SDES 7 ii) Find model parameters θ l, ∗ and θ u, ∗ which provide robust arbitrage-free price bounds for anilliquid derivative, subject to available market data: θ l, ∗ ∈ arg min θ ∈ Θ E Q ( θ ) [Ψ] , subject to M (cid:88) i =1 (cid:96) ( E Q ( θ ) [Φ i ] , p (Φ i )) = 0 ,θ u, ∗ ∈ arg max θ ∈ Θ E Q ( θ ) [Ψ] , subject to M (cid:88) i =1 (cid:96) ( E Q ( θ ) [Φ i ] , p (Φ i )) = 0 . (2.2)The no-arbitrage price of Ψ over the class of neural SDEs used is then in (cid:104) E Q ( θ l, ∗ ) , E Q ( θ u, ∗ ) (cid:105) .2.1. Learning hedging strategy as a control variate.

A starting point in the derivation of thepractical algorithm is to estimate E Q ( θ ) [Φ] using a Monte Carlo estimator. Consider ( X i,θ ) Ni =1 ,a N i.i.d copies of (1.1) and let Q N ( θ ) := N (cid:80) Ni =1 δ X i,θ be empirical approximation of Q ( θ ) .Due to the Law of Large Numbers, E Q N ( θ ) [Φ] converges to E Q ( θ ) [Φ] in probability. Moreover, theCentral Limit Theorem tells us that P (cid:18) E Q ( θ ) [Φ] ∈ (cid:20) E Q N ( θ ) [Φ] − z α/ σ √ N , E Q N ( θ ) [Φ] + z α/ σ √ N (cid:21)(cid:19) → as N → ∞ , where σ = (cid:112) V ar [Φ] and z α/ is such that − CDF Z ( z α/ ) = α/ with Z the standard normaldistribution. We see that by increasing N , we reduce the width of the above conﬁdence intervals,but this increases the overall computational cost. A better strategy is to ﬁnd a good control variatei.e. we seek a random variable Φ cv such that:(2.3) E Q N ( θ ) [Φ cv ] = E [Φ] and V ar [Φ cv ] < V ar [Φ] . In the following we construct Φ cv using hedging strategy. Similar approach has recently beendeveloped in [Vidales et al., 2018] in the context of pricing and hedging with deep networks.Martingale representation theorem (see for example Th. 14.5.1 in [Cohen and Elliott, 2015])provides a general methodology for ﬁnding Monte Carlo estimators with the above stated proper-ties (2.3).If Φ is such that E Q [ | Φ | ] < ∞ , then there exists a unique process Z = ( Z t ) t adapted to theﬁltration ( F t ) t ∈ [0 ,T ] with E Q (cid:104)(cid:82) Tt | Z s | ds (cid:105) < ∞ such that E [Φ |F ] = Φ − (cid:90) T Z s dW s . Deﬁne Φ cv := Φ − (cid:90) T Z s dW s , and note that E Q ( θ ) [Φ cv |F ] = E Q ( θ ) [Φ |F ] and V ar Q ( θ ) [Φ cv |F ] = 0 . P GIERJATOWICZ, M. SABATE-VIDALES, D. ˇSIˇSKA, L. SZPRUCH, AND Z. ˇZURI ˇC

The process Z has more explicit representations using corresponding (possibly path dependent)backward Kolomogorov equation or Bismut–Elworthy–Li formula. Both approaches require fur-ther approximation, see [Vidales et al., 2018] and [Vidales et al., 2020]. Here, this approxim-ation will be provided by an additional neural network. Without loss of generality assume that Φ = φ (( X θt ) t ∈ [0 ,T ] ) for some φ : C ([0 , T ] , R d ) → R . In the remaining of the paper, wewill slightly abuse the notation to write Φ indistinctively as both the option and the mapping C ([0 , T ] , R d ) → R .Consider now a neural network h : [0 , T ] × C ([0 , T ] , R d ) × R p → R d with parameters ξ ∈ R p (cid:48) with p (cid:48) ∈ N and deﬁne the following learning task, in which θ (the parameters on the Neural SDEmodel) is ﬁxed:Find(2.4) ξ ∗ ∈ arg min ξ V ar (cid:20) Φ(( X θt ) t ∈ [0 ,T ] ) − (cid:90) T h ( s, ( X θs ∧ t ) t ∈ [0 ,T ] , ξ ) dW s (cid:12)(cid:12)(cid:12)(cid:12) F (cid:21) . In a similar manner one can derive Ψ cv for the payoff of the illiquid derivative for which we seekthe robust price bounds. Then (2.2) can be restated as θ l, ∗ ∈ arg min θ ∈ Θ E Q ( θ ) [Ψ cv ] , subject to M (cid:88) i =1 (cid:96) ( E Q ( θ ) [Φ cvi ] , p (Φ i )) = 0 ,θ u, ∗ ∈ arg max θ ∈ Θ E Q ( θ ) [Ψ cv ] , subject to M (cid:88) i =1 (cid:96) ( E Q ( θ ) [Φ cvi ] , p (Φ i )) = 0 . (2.5)The learning problem (2.5) is better than (2.2) from the point of view of algorithmic implement-ation, as it will enjoy lower Monte Carlo variance and hence will require simulation of fewerpaths of the Neural SDE in each step of the stochastic gradient algorithm. Furthermore, whenusing (2.5) we learn a (possibly abstract) hedging strategy for trading in the underlying asset toreplicate the derivative payoff. Since the market may be incomplete this abstract hedging strategymay not be usable in practice. More precisely, since the process X θ will contain tradable as wellas non-tradable assets, the control variate for the latter has to be adapted by either performing aprojection or deriving a strategy for the corresponding tradable instrument.To deduce a real hedging strategy recall that X θ = ( S θ , V θ ) with S θ being the tradable assetsand V θ the non-tradable components. Decompose the abstract hedging strategy as h = ( h S , h V ) .Let ¯ S θt := e − rt S θt and note that due to (1.2) we have d ( ¯ S θt ) = e − rt σ S ( t, X θt , θ ) dW t . If we can solve h S = e − rt ¯ h St σ S ( t, X θt , θ ) for ¯ h St then this is a real hedging strategy.Therefore, an alternative approach to (2.4), possibly yielding a better hedge, but worse variancereduction would be to consider ﬁnding(2.6) ¯ ξ ∗ ∈ arg min ¯ ξ V ar (cid:20) Φ(( X t ) t ∈ [0 ,T ] ) − (cid:90) T ¯ h ( r, ( X r ∧ t ) t ∈ [0 ,T ] , ¯ ξ ) d ¯ S θr (cid:12)(cid:12)(cid:12)(cid:12) F (cid:21) for some other neural network ¯ h . This is the version we present in Algorithms 1 and 2. OBUST PRICING AND HEDGING VIA NEURAL SDES 9

Time discretization.

In order to implement the (2.5) we deﬁne partition π of [0 , T ] as π := { t , t , . . . , t N steps = T } . We ﬁrst approximate the stochastic integral in (2.4) with the appropriateRiemann sum. Depending on the choice of the neural network architecture approximating σ in theNeural SDE (1.1) we may have σ which grows super-linearly as a function of x . In such a casethe moments of the classical Euler scheme are known blow up in the ﬁnite time, see [Hutzenthaleret al., 2011], even if moments of the solution to the SDE are ﬁnite. In order to avoid blow ups ofmoments of the simulated paths during training we apply tamed Euler method, see [Hutzenthaleret al., 2012, Szpruch and Zh¯ang, 2018]. The tamed Euler scheme is given by(2.7) X π,θt k +1 = X π,θt k + b ( t k , X π,θt k , θ )1 + | b ( t k , X π,θt k , θ ) |√ ∆ t k ∆ t k + σ ( t k , X π,θt k , θ )1 + | σ ( t k , X π,θt k , θ ) |√ ∆ t k ∆ W t k +1 , with ∆ t k = t k +1 − t k and ∆ W t k +1 = W t k +1 − W t k .2.3. Algorithms.

We now present the algorithm to calibrate the Neural SDE (1.1) to market pricesof derivatives (Algorithm 1) and the algorithm to ﬁnd robust price bounds for an illiquid derivative(Algorithm 2). Note that during training we aim to calibrate the SDE (1.1), and at the same time,adapt the abstract hedging strategy to minimise the variance (2.4). Therefore, we alternate twooptimisations:i) During each epoch, we optimise the parameters θ of the Neural SDE, while the paramet-ers of the hedging strategy ξ are ﬁxed. In order to calculate the Monte Carlo estimator E Q N ( θ ) [Φ cv ] we generate N trn paths ( x π,θ,it n ) N steps n =0 := ( s π,θ,it n , v π,θ,it n ) N steps n =0 , i = 1 , . . . , N trn usingtamed Euler scheme on (1.1). Furthermore, we create a copy of the generated paths, denotingthem (˜ x π,,it n ) N steps n =0 := (˜ s π,,it n , ˜ v π,,it n ) N steps n =0 such that each ˜ x π,,it k does not depend on θ anymore forthe purposes of backward propagation when calculating the gradient. The paths (˜ x π,,it n ) N steps n =0 will be used as input to the parametrisation of the abstract hedging strategy h ; as a result,in Algorithm 1 during this phase of the optimisation, the purpose of the hedging strategy isreducing the variance of E Q N ( θ ) [Φ cv ] in order to speed up the convergence of the gradientdescent algorithm.ii) During each epoch, we optimise the parameters ξ of the parametrisation of the hedgingstrategy, while the parameters θ of the Neural SDE are ﬁxed.In both Algorithms 1 and 2 as well as in the numerical experiments, we use the squared errorfor the nested loss function: (cid:96) ( x, y ) = | x − y | . Furthermore, the calibration of the Neural SDEto market derivative prices with robust price bounds for illiquid derivative in Algorithm 2 is doneusing a constrained optimisation using the method of Augmented Lagrangian [Hestenes, 1969],with the update rule of the Lagrange multipliers speciﬁed in Algorithm 3.2.4. Algorithm for multiple maturities.

Algorithms 1 and 2 calibrate the SDE (5.3) to one set ofderivatives. If the derivatives for which we have liquid market prices can be grouped by maturity,as is the case e.g. for call / put prices, we can use a more efﬁcient algorithm to achieve thecalibration.This follows the natural approach used e.g. in [Cuchiero et al., 2020], and in [Vidales et al.,2018] in the context of learning PDEs, where the networks for b ( t, X θt , θ ) and σ ( t, X θt , θ ) are split Algorithm 1

Calibration to market European option prices for one maturityInput: π = { t , t , . . . , t N steps } time grid for numerical scheme.Input: (Φ i ) N prices i =1 option payoffs.Input: Market option prices p (Φ j ) , j = 1 , . . . , N prices .Initialisation: θ for neural SDE parameters, N trn ∈ N large.Initialisation: ξ for control variate approximation. for epoch : 1 : N epochs do Generate N trn paths ( x π,θ,it n ) N steps n =0 := ( s π,θ,it n , v π,θ,it n ) N steps n =0 , i = 1 , . . . , N trn using Euler schemeon (1.1). and create copies (˜ x π,it n ) N steps n =0 := (˜ s π,it n , ˜ v π,it n ) N steps n =0 such that each ˜ x π,it k does not dependon θ anymore, thus ∂ θ ˜ x π,it k = 0 . During one epoch : Freeze ξ , use Adam (see [Kingma and Ba, 2014]) to update θ , where θ = (cid:92) arg min θ N prices (cid:88) j =1  E N trn  Φ j (cid:16) X π,θ (cid:17) − N steps − (cid:88) k =0 ¯ h ( t k , ˜ X π,t k , ξ j )∆ ˜¯ S π,t k  − p (Φ j )  and where E N trn denotes the empirical expected value calculated on the N trn paths. During one epoch : Freeze θ , use Adam to update ξ , by optimising the sample variance ξ = (cid:92) arg min ξ N prices (cid:88) j =1 V ar N trn  Φ j (cid:16) X π,θ (cid:17) − N steps − (cid:88) k =0 ¯ h ( t k , X π,θt k , ξ j )∆ ˜¯ S π,θt k  end forreturn θ, ξ j for all prices (Φ i ) N prices i =1 .into different networks, one per maturity. Let θ = ( θ , . . . , θ N m ) , where N m is the number ofmaturities. Let b ( t, X θt , θ ) := t ∈ [ T i − ,T i ] ( t ) b i ( t, X θt , θ i ) , i ∈ { , . . . , N m } ,σ ( t, X θt , θ ) := t ∈ [ T i − ,T i ] ( t ) σ i ( t, X θt , θ i ) , i ∈ { , . . . , N m } , (2.8)with each b i and σ i a feed forward neural network.Regarding the SDE parametrisation, we ﬁt feed-forward neural networks (see Appendix C) tothe diffusion of the SDE of the price process under the risk-neutral measure. In the particular case,where we calibrate the Neural SDE to market data, without imposing any bounds on the resultingexotic option prices, one can then do an incremental learning as follows,(1) Consider the ﬁrst maturity T i with i = 1 .(2) Calibrate the SDE using Algorithm 1 to the vanilla prices in maturity T i .(3) Freeze the parameters of σ i , set i := i + 1 , and go back to previous step.The above algorithm is memory efﬁcient, as it only needs to backpropagate through that lastmaturity in each gradient descent step. OBUST PRICING AND HEDGING VIA NEURAL SDES 11

Algorithm 2

Calibration to vanilla prices for one maturity with lower bound for exotic priceInput: π = { t , t , . . . , t N steps } time grid for numerical scheme.Input: (Φ i ) N prices i =1 option payoffs.Input: Market option prices p (Φ j ) , j = 1 , . . . , N prices .Initialisation: θ for neural SDE parameters, N trn ∈ N large.Initialisation: ξ for control variate approximation.Initialisation: λ, c for Augmented Lagrangian algorithm for constrained optimisation. for epoch : 1 : N epochs do Generate N trn paths ( x π,θ,it n ) N steps n =0 := ( s π,θ,it n , v π,θ,it n ) N steps n =0 , i = 1 , . . . , N trn using the Euler-typescheme on (1.1) and create copies (˜ x π,,it n ) N steps n =0 := (˜ s π,it n , ˜ v π,it n ) N steps n =0 such that each ˜ x π,it k does notdepend on θ anymore, thus ∂ θ ˜ x π,it k = 0 . During one epoch : Freeze ξ , use Adam to ﬁnd θ N trn , where f ( θ ) := E N trn  Ψ( X π,θ ) − N steps − (cid:88) k =0 ¯ h ( t k , ( ˜ X π,t k ∧ t j ) N steps j =0 , ξ Ψ )∆ ˜¯ S π,t k  h ( θ ) := N prices (cid:88) j =1  E N trn  Φ j (cid:16) X π,θ (cid:17) − N steps − (cid:88) k =0 ¯ h ( t k , ˜ X π,t k , ξ j )∆ ˜¯ S π,t k  − p (Φ j )  θ = (cid:92) arg min θ f ( θ ) + λh ( θ ) + c · ( h ( θ )) and where E N trn denotes the empirical expected value calculated on the N trn paths. During one epoch : Freeze θ , use Adam to update ξ , ξ = (cid:92) arg min ξ N prices (cid:88) j =0 V ar N trn  Φ j (cid:16) X π,θ (cid:17) − N steps − (cid:88) k =0 ¯ h ( t k , X π,θt k , ξ j )∆ ˜¯ S π,θt k  ++ V ar N trn  Ψ (cid:16) X π,θ (cid:17) − N steps − (cid:88) k =0 ¯ h ( t k , ( X π,θt k ∧ t j ) N steps j =0 , ξ Ψ )∆ ˜¯ S π,θt k  Every 50 updates of θ : Update λ, c using Algorithm 3 end forreturn θ, ξ .3. A NALYSIS OF THE STOCHASTIC APPROXIMATION ALGORITHM FOR THE CALIBRATIONPROBLEM

Classical stochastic gradient.

First, let us review the basics about stochastic gradient al-gorithm. Let H : Ω × Θ → R d . Consider the following optimisation problem min θ ∈ Θ h ( θ ) , where h ( θ ) := E [ H ( θ )] . Algorithm 3

Augmented Lagrangian parameters updateInput: λ > , c > Input: f ( θ ) approximated exotic price with current values of θ Input:

M SE ( θ ) MSE of calibration to Vanilla prices with current values of θ Update λ := λ + c M SE ( θ ) Update c := 2 c return c , λ Notice that the minimization task (2.1) does not ﬁt this pattern as in our case the expectation isinside (cid:96) .Nevertheless we know that the classical gradient algorithm, with the learning rates ( η k ) ∞ k =1 , η k > for all k , applied to this optimisation problem is given by θ k +1 = θ k − η k ∂ θ ( E [ H ( θ k )]) . Under suitable conditions on H and on η k , it is known that θ k converges to a minimiser of h ,see [Benveniste et al., 2012]. As E [ H ( θ k )] can rarely be computed explicitly, the above algorithmis not practical and is replaced with stochastic gradient descent (SGD) given by θ k +1 = θ k − η k N N (cid:88) i =1 ∂ θ H i ( θ k ) , where ( H i ( θ )) N batch i =1 are independent samples from the distribution of H ( θ ) and N ∈ N is the sizeof the mini-batch. In particular N could be one. The choice of a “good” estimator for E [ H ( θ )] inthe context of stochastic gradient algorithms is an active research area research, see e.g. [Majkaet al., 2020]. When the estimator of E [ H ( θ )] is unbiased, the SGD can be shown to converge to aminimum of h , [Benveniste et al., 2012].3.2. Stochastic algorithm for the calibration problem.

Recall that our overall objective in cal-ibration is to minimize some J = J ( θ ) given by J ( θ ) = M (cid:88) i =1 (cid:96) (cid:16) E Q ( θ ) [Φ cv i ] , p (Φ i ) (cid:17) . We write X θ := ( X θt ) t ∈ [0 ,T ] and note that E Q ( θ ) (cid:2) Φ i (cid:3) = E (cid:2) Φ i ( X θ ) (cid:3) . Noting that in the cal-ibration part of (2.1)–(2.5) the h ( s, ( ˜ X s ∧ t ) t ∈ [0 ,T ] , ξ ) d ˜¯ S s is ﬁxed and hence E Q [ ∂ θ Φ cvi ( X θ )] = E Q [ ∂ θ Φ i ( X θ )] We differentiate J = J ( θ ) and work with the pathwise representation of this derivative (usinglanguage from [Glasserman, 2013]). For that we impose the following assumption. Assumption 3.1.

We assume that payoffs G := (Ψ , Φ) , G : C ([0 , T ] , R d ) → R are such that ∂ θ E Q (cid:104) G ( X θ ) (cid:105) = E Q (cid:104) ∂ θ G ( X θ ) (cid:105) . OBUST PRICING AND HEDGING VIA NEURAL SDES 13

We refer reader to [Glasserman, 2013, chapter 7] for exact conditions when exchanging in-tegration and differentiation is possible. We also remark that for the payoffs for which the As-sumption 3.1 does not hold, one can use likelihood method and more generally Malliavin weightsapproach for computing greeks, [Fourni´e et al., 1999]. We don’t pursue this here for simplicity.Writing (cid:96) = (cid:96) ( x, y ) , applying Assumption 3.1 and noting that E Q ( θ ) (cid:2) Φ i (cid:3) = E (cid:2) Φ( X θ ) (cid:3) we seethat ∂ θ J ( θ ) = M (cid:88) i =1 ( ∂ x (cid:96) ) (cid:16) E Q ( θ ) [Φ cv i ] , p (Φ i ) (cid:17) ∂ θ E Q ( θ ) (cid:2) Φ cv i (cid:3) = M (cid:88) i =1 ( ∂ x (cid:96) ) (cid:16) E [Φ cv i ( X θ )] , p (Φ i ) (cid:17) E (cid:2) ∂ θ Φ i ( X θ ) (cid:3) . Hence, if we wish to update θ to some ˜ θ in such a way that J is decreased then we need to take(for some γ > ) ˜ θ = θ − γ M (cid:88) i =1 ( ∂ x (cid:96) ) (cid:16) E [Φ cv i ( X θ )] , p (Φ i ) (cid:17) E (cid:2) ∂ θ Φ i ( X θ ) (cid:3) so that ddε J ( θ + ε (˜ θ − θ )) (cid:12)(cid:12)(cid:12) ε =0 = M (cid:88) i =1 ( ∂ x (cid:96) ) (cid:16) E [Φ cv i ( X θ )] , p (Φ i ) (cid:17) E (cid:2) ∂ θ Φ i ( X θ ) (cid:3) (˜ θ − θ )= − γ (cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) i =1 ( ∂ x (cid:96) ) (cid:16) E [Φ cv i ( X θ )] , p (Φ i ) (cid:17) E (cid:2) ∂ θ Φ i ( X θ ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12) ≤ . If we had one network for each time step, leading to some resnet-like-network architecture for thetime discretization then it may be more efﬁcient to use a backward equation representation in thetraining. This representation can be derived using similar analysis as in [Jabir et al., 2019] (seealso [ ˇSiˇska and Szpruch, 2020]).Since the summation plays effectively no role in further analysis we will assume, without lossof generality, that M = 1 and work with the objective h ( θ ) = (cid:96) (cid:16) E Q ( θ ) [Φ cv ] , p (Φ) (cid:17) . Then in the gradient step update we have ∂ θ h ( θ ) = ∂ x (cid:96) (cid:16) E Q [Φ cv ( X θ )] , p (Φ) (cid:17) E Q [ ∂ θ Φ( X θ )] , Since (cid:96) is typically not an identity function, a mini-batch estimator of ∂ θ h ( θ ) , obtained by repla-cing Q with Q N given by ∂ θ h N ( θ ) := ∂ x (cid:96) (cid:16) E Q N [Φ cv ( X θ )] , p (Φ) (cid:17) E Q N [ ∂ θ Φ( X θ )] , is a biased estimator of ∂ θ h . Nonetheless the bias can be estimated in terms of number of samples N and the variance. The fact the bias is controlled by the variance justiﬁes why it is important toreduce the variance when calibrating the models with stochastic gradient algorithm. An alternative perspective is to view ∂ θ h N ( θ ) as non-linear function of Q N . It turns out that that there is a generaltheory studying smoothness and corresponding expansions of such functions of measures and werefer reader to [Chassagneux et al., 2019] for more details.In Appendix A we provide a result on the bias for a general loss function. For the square lossfunction the bias is given below. Theorem 3.2.

Let Assumption 3.1 hold. Consider the family of neural SDEs (1.1) . For (cid:96) ( x, y ) = | x − y | , we have (cid:12)(cid:12)(cid:12) E Q (cid:2) ∂ θ h N ( θ ) (cid:3) − ∂ θ h ( θ ) (cid:12)(cid:12)(cid:12) ≤ N (cid:16) V ar Q [Φ cv ( X θ )] (cid:17) / (cid:16) V ar Q [ ∂ θ Φ( X θ )] (cid:17) / . Proof.

This is an immediate consequence of Theorem A.1. (cid:3)

Hence, we see that by reducing the variance of the ﬁrst term we are also reducing the bias ofthe gradient. This justiﬁes superiority of learning task (2.5) over (2.2).4. A

NALYSIS OF THE RANDOMISED TRAINING

Case of general cost function (cid:96) . While the idea of calibrating to one maturity at the timedescribed in Section 2.4 works well if our aim is only to calibrate to vanilla options, it cannot bedirectly applied to learn robust bounds for path dependent derivatives, see (2.2). This is becausethe payoff of path dependent derivatives, in general, is not an afﬁne function of maturity. Onthe other hand training all neural networks at every maturity all at once, makes every step of thegradient algorithm used for training computationally heavy.In what follows we introduce a randomisation of the gradient so that at each step of the gradientalgorithm the derivatives with respect to the network parameters are computed at only one maturityat the time while keeping parameters at all other maturities unchanged. This is similar to thepopular dropout method, see [Srivastava et al., 2014], that is known to help with overﬁtting whentraining deep neural networks but for us the main aim is computational efﬁciency. Recall how wesplit the networks for drift and diffusion b ( t, X θt , θ ) := t ∈ [ T i − ,T i ] ( t ) b i ( t, X θt , θ i ) , i ∈ { , . . . , N m } ,σ ( t, X θt , θ ) := t ∈ [ T i − ,T i ] ( t ) σ i ( t, X θt , θ i ) , i ∈ { , . . . , N m } , (2.8’)Let U ∼ U [1 , . . . , N m ] be a uniform random variable over set [1 , . . . , N m ] deﬁned on a newprobability space (Ω U , F U , ( F U t ) t ∈ [0 ,T ] , P U ) . Let Z be given by dZ θt ( U ) = (cid:18) N m (cid:88) i =1 [ T i − ,T i ] ( t ) ∂ x b i ( t, X θt , θ i ) Z θt ( U ) + N [ T U − ,T U ] ( t ) ∂ θ U b U ( t, X θt , θ U ) (cid:19) dt + (cid:18) N m (cid:88) i =1 [ T i − ,T i ] ( t ) ∂ x σ i ( t, X θt , θ i ) Z θt ( U ) + N [ T U − ,T U ] ( t ) ∂ θ U σ U ( t, X θt , θ U ) (cid:19) dW t , where b U , σ U are simply neural networks sampled from the random index U . OBUST PRICING AND HEDGING VIA NEURAL SDES 15

Theorem 4.1.

Assume ∂ x [ b, σ ]( t, · , θ ) exists and is bounded with ( t, θ ) ﬁxed and ∂ θ [ b, σ ]( t, x, · ) exists and is bounded with ( t, x ) ﬁxed. Let h ( θ ) = (cid:96) ( E Q ( θ ) [ φ ( X θt )] , p (Φ)) and let its randomisedgradient be ( ∂ θ h )( θ, U ) = ∂ x (cid:96) ( E Q ( θ ) [ φ ( X θt )] , p (Φ)) E Q ( θ ) [( ∂ x φ )( X θt ) Z θt ( U )] . Then E U [( ∂ θ h )( θ, U )] = ( ∂ θ h )( θ ) . In other words the randomised gradient is an unbiased estim-ator of ( ∂ θ h )( θ ) . Less stringent assumption on derivatives of b and σ are possible, but we do not want to over-burden the present article with technical details. Proof.

It is well known, e.g [Krylov, 1999, Kunita, 1997], that d ( ∂ θ X θt ) = N m (cid:88) i =1 [ T i − ,T i ] ( t ) (cid:20)(cid:16) ( ∂ x b i ( t, X θt , θ i ) ∂ θ X θt + ∂ θ i b i ( t, X θt , θ i ) (cid:17) dt + (cid:16) ( ∂ x σ i ( t, X θt , θ i ) ∂ θ X θt + ∂ θ i σ i ( t, X θt , θ i ) (cid:17) dW t (cid:21) . Let U ∼ U [1 , . . . , N m ] be a uniform random variable over set [1 , . . . , N m ] deﬁned on a newprobability space (Ω U , F U , ( F U t ) t ∈ [0 ,T ] , P U ) . We introduce process Z as follows dZ θt ( U ) = (cid:18) N m (cid:88) i =1 [ T i − ,T i ] ( t ) ∂ x b i ( t, X θt , θ i ) Z θt ( U ) + N [ T U − ,T U ] ( t ) ∂ θ U b U ( t, X θt , θ U ) (cid:19) dt + (cid:18) N m (cid:88) i =1 [ T i − ,T i ] ( t ) ∂ x σ i ( t, X θt , θ i ) Z θt ( U ) + N [ T U − ,T U ] ( t ) ∂ θ U σ U ( t, X θt , θ U ) (cid:19) dW t . Note that E U [ N [ T U − ,T U ] ( t ) ∂ θ U b U ( t, X θt , θ U )] = N m (cid:88) i =1 [ T i − ,T i ] ( t ) ∂ θ i b i ( t, X θt , θ i ) and E U [ N [ T U − ,T U ] ( t ) ∂ θ U σ U ( t, X θt , θ U )] = N m (cid:88) i =1 [ T i − ,T i ] ( t ) ∂ θ i σ i ( t, X θt , θ i ) . Now using Fubini-type Theorem for Conditional Expectation, [Hammersley et al., 2019, LemmaA5], we have d E U (cid:2) Z θt ( U ) (cid:3) = (cid:18) N m (cid:88) i =1 t ∈ [ T i − ,T i ] ( t ) ∂ x b i ( t, X θt , θ i ) E U (cid:2) Z θt ( U ) (cid:3) + N m (cid:88) i =1 t ∈ [ T i − ,T i ] ( t ) ∂ θ i b i ( t, X θt , θ i ) (cid:19) dt + (cid:18) N m (cid:88) i =1 t ∈ [ T i − ,T i ] ( t ) ∂ x σ i ( t, X θt , θ i ) E U (cid:2) Z θt ( U ) (cid:3) + N m (cid:88) i =1 t ∈ [ T i − ,T i ] ( t ) ∂ θ i σ i ( t, X θt , θ i ) (cid:19) dW t . Hence the process E U (cid:2) Z θt ( U ) (cid:3) solves the same linear equation as ∂ θ X θ . As the equation hasunique solution we conclude that(4.1) E U [ Z θt ( U )] = ∂ θ X θt . Recall that h ( θ ) = (cid:96) ( E Q ( θ ) [ φ ( X θt )] , p (Φ)) , and so ( ∂ θ h )( θ ) = ∂ x (cid:96) ( E Q ( θ ) [ φ ( X θt )] , p (Φ)) E Q ( θ ) [( ∂ x φ )( X θt ) ∂ θ X θt ] . Recall the randomised gradient ( ∂ θ h )( θ, U ) = ∂ x (cid:96) ( E Q ( θ ) [ φ ( X θt )] , p (Φ)) E Q ( θ ) [( ∂ x φ )( X θt ) Z θt ( U )] . Note that due to (4.1) E U [ E Q ( θ ) [( ∂ x φ )( X θt ) Z θt ( U )]] = E Q ( θ ) [( ∂ x φ )( X θt ) ∂ θ X θt ] . this implies that E U [( ∂ θ h )( θ, U )] = ( ∂ θ h )( θ ) . (cid:3) Case of square loss function (cid:96) . Here we show that, in the special case when (cid:96) ( x, y ) = | x − y | p , randomised gradient as described in Section 4 is an unbiased estimator of the full gradienteven in the case when Q is replaced by its empirical measure Q N . Consequently standard theoryon stochastic approximation applies. We base the presentation on the case p = 2 , as general caseworks in exact the same way. Let ( ¯Ω , ¯ F , ( ¯ F t ) t ∈ [0 ,T ] , ¯ P ) be a copy of (Ω , F , ( F t ) t ∈ [0 ,T ] , P ) . Thenwe write h N ( θ ) := ( E Q N ( θ ) [ φ ( X θt )] − p (Φ)) = ( E ¯ Q N ( θ ) [ φ ( ¯ X θt )] − p (Φ))( E Q N ( θ ) [ φ ( X θt )] − p (Φ i )) . See also [Cuchiero et al., 2020] for the same observation. The gradient of h is given by ( ∂ θ h N )( θ ) =( E ¯ Q N ( θ ) [ φ ( ¯ X θt )] − p (Φ))( E Q N ( θ ) [( ∂ x φ )( X θt ) ∂ θ X θt ] − p (Φ i ))+ ( E ¯ Q N ( θ ) [( ∂ x φ )( ¯ X θt ) ∂ θ ¯ X θt ] − p (Φ))( E Q N ( θ ) [ φ ( X θt )] − p (Φ i )) . Equivalently ( ∂ θ h N )( θ )( ∂ θ h N )( θ ) =2( E ¯ Q N ( θ ) [ φ ( ¯ X θt )] − p (Φ))( E Q N ( θ ) [( ∂ x φ )( X θt ) ∂ θ X θt ] − p (Φ i ))=2( E ¯ Q N ( θ ) [( ∂ x φ )( ¯ X θt ) ∂ θ ¯ X θt ] − p (Φ))( E Q N ( θ ) [ φ ( X θt )] − p (Φ i )) . To implement the above algorithm one simply needs to generate two independent sets of samples.Furthermore, ( ∂ θ h N )( θ, U )( ∂ θ h N )( θ ) =2( E ¯ Q N ( θ ) [ φ ( ¯ X θt )] − p (Φ))( E Q N ( θ ) [( ∂ x φ )( X θt ) Z θt ( U )] − p (Φ i ))=2( E ¯ Q N ( θ ) [( ∂ x φ )( ¯ X θt ) Z θt ( U )] − p (Φ))( E Q N ( θ ) [ φ ( X θt )] − p (Φ i )) . is an unbiased estimator of ( ∂ θ h N )( θ ) . OBUST PRICING AND HEDGING VIA NEURAL SDES 17

5. T

ESTING NEURAL

SDE

CALIBRATIONS

All algorithms were implemented using P Y T ORCH , see [Paszke et al., 2017] and [Paszke et al.,2019]. The code used is available at github.com/msabvid/robust nsde . Our target data(European option prices for various strikes and maturities) is described in Appendix B. We assumethat there is one traded asset S = ( S t ) t ∈ [0 ,T ] . We calibrate to European option prices p (Φ) := E Q ( θ ) [Φ] = e − rT E Q ( θ ) (cid:2) ( S T − K ) + | S = 1 (cid:3) for maturities of , , . . . , months and typically uniformly spaced strikes between in [0 . , . .As an example of an illiquid derivative for which we wish to ﬁnd robust bounds we take the look-back option p (Ψ) := E Q ( θ ) [Ψ] = e − rT E Q ( θ ) (cid:20) max t ∈ [0 ,T ] S t − S T | X = 1 (cid:21) . Local volatility neural SDE model.

In this section we consider a Local Volatility (LV)Neural SDE model. It has been shown by [Dupire et al., 1994] (see also [Gy¨ongy, 1986]) that ifthe market data would consist of a continuum of call / put prices for all strikes and maturities thenthere is a unique function σ such that with the price process dS t = rS t dt + S t σ ( t, S t ) dW t , S = 1 (5.1)the model prices and market prices match exactly. In practice only some calls / put prices areliquid in the market and so to apply [Dupire et al., 1994] one has to interpolate, in an arbitrage freeway, the missing data. The choice of interpolation method is a further modelling choice on top ofthe one already made by postulating that the risky asset evolution is governed by (5.1).We will use a Neural SDE instead of directly interpolating the missing data. Let our LV NeuralSDE model be given by(5.2) dS θt = rS θt dt + σ ( t, S θt , θ ) S θt dW Q t , where S θt ≥ , S θ = 1 and σ : [0 , T ] × R × R p → R + allows us to calibrate the model to observedmarket prices.5.2. Local stochastic volatility neural SDE model.

In this section we consider a Local StochasticVolatility (LSV) Neural SDE model. See, for example, [Tian et al., 2015]. As in the Local Volat-ility Neural SDE model (5.2), we have the risky asset price price process ( S t ) t ∈ [0 ,T ] , where thedrift is equal to the risk-free bond rate r . However, the volatility function in the LSV Neural SDEmodel now depends on t , S t and a stochastic process ( V t ) t ∈ [0 ,T ] . Here ( V t ) t ∈ [0 ,T ] is not a tradedasset. The model is then given by dS t = rS t dt + σ S ( t, S t , V t , ν ) S t dB St , S = 1 ,dV t = b V ( V t , φ ) dt + σ V ( V t , ϕ ) dB Vt , V = v ,d (cid:104) B S , B V (cid:105) t = ρdt (5.3)where θ := { ν, φ, ϕ, v , ρ } , ρ, v ∈ R , as the set of (multi-dimensional) parameters that we aimto optimise so that the model is calibrated to the observed market data. Deep learning setting for the LV and LSV neural SDE models.

In the SDE (5.2) the func-tion σ and the SDE (5.3) the functions σ S , b V and σ V are parametrised by one feed-forward neuralnetwork per maturity (see Section 2.4 and Appendix C) with hidden layers with neurons ineach layer. The non-linear activation function used in each of the hidden layers is the linear rec-tiﬁer relu . In addition, in σ S and σ V after the output layer we apply the non-linear rectiﬁer softplus ( x ) = log(1 + exp( x )) to ensure a positive output.The parameterisation of the hedging strategy for the vanilla option prices is also a feed-forwardlinear network with 3 hidden layers, 20 neurons per hidden layer and relu activation functions.However, in order to get one hedging strategy per vanilla option considered in the market data, theoutput of h ( t k , s it k ,θ , ξ K j ) has as many neurons as strikes and maturities.Finally, the parameterisation of the hedging strategy for the exotic options price is also a feed-forward network with 3 hidden layers, 20 neurons per hidden layer and relu activation functions.The Neural SDEs (5.2) and (5.3) were discretized using the tamed Euler scheme (2.7) with N steps = 8 × uniform time steps for T = 1 year (i.e. for every months). The number ofMonte Carlo trajectories in each stochastic gradient descent iteration was N = 4 × and theabstract hedging strategy was used as a control variate.Finally in the evaluation of the calibrated Neural SDE, the option prices are calculated using N = 4 × trajectories of the calibrated Neural SDEs, which we generated with × Brownianpaths and their antithetic paths; in addition we also used the learned hedging strategies to calculatethe Monte Carlo estimators with lower variance E Q ( θ ) [Φ cvi ] and E Q ( θ ) [Ψ cv ] .5.4. Conclusions from calibrating for LV neural SDE.

Each calibration is run times withdifferent initialisations of the network parameters, with the goal to check the robustness of theexotic option price E Q ( θ ) [Ψ] for each calibrated Neural SDE. The blue boxplots in Figure 5.1provide different quantiles for the exotic option price E Q ( θ ) [Ψ] and the obtained bounds afterrunning We make the following observations from calibrating LV Neural SDE:i) It is possible to obtain high accuracy of calibration with MSE of about − for monthmaturity, about − for 12 month maturity when the only target is to ﬁt market data. If we areminimizing / maximizing the illiquid derivative price at the same time then the MSE increasessomewhat so that it is about − for both and month maturities. See Figure 5.1. Thecalibration has been performed using K = 21 strikes.ii) The calibration is accurate not only in MSE on prices but also on individual implied volatilitycurves, see Figure 5.2 and others in Appendix D.iii) As we increase the number of strikes per maturity the range of possible values for the illiquidderivative narrows. See Figure F.1 and Tables 1, 2, 3 and 4. The conjecture is that as thenumber of strikes (and maturities) would increase to inﬁnity we would recover the unique σ given by the Dupire formula that ﬁts the continuum of European option prices.iv) With limited amount of market data (which is closer to practical applications) even the LVNeural SDE produces noticeable ranges for prices of illiquid derivatives, see again Figure 5.1.In Appendix F we provide more details on how different random seeds, different constrainedoptimization algorithms and different number of strikes used in the market data input affect theilliquid derivative price. OBUST PRICING AND HEDGING VIA NEURAL SDES 19

Lookback LB Lookback w/o bounds Lookback UBTy e of fit0.1250.1300.1350.1400.1450.150 e x o t i c s Lookback LB Lookback w/o bounds Lookback UBTy e of fit10 −8 −9 −9 −9 −9 M S E Maturity T = 8 months

Lookback LB Lookback w/o bounds Lookback UBType of fit0.140.150.160.170.180.19 e x o t i c s Lookback LB Lookback w/o bounds Lookback UBType of fit10 −8 M S E Matu ity T = 10 months

Lookback LB Lookback w/o bound Lookback UBType of fit0.160.17 e x o t i c Lookback LB Lookback w/o bound Lookback UBType of fit −8 M S E Maturity T = 12 month F IGURE

Conclusions from calibrating for LSV neural SDE.

Each calibration is run ten times withdifferent initialisations of the network parameters, with the goal to check the robustness of theexotic option price E Q ( θ ) [Ψ] for each calibrated Neural SDE. The blue boxplots in Figure 5.3provide different quantiles for the exotic option price E Q ( θ ) [Ψ] and the obtained bounds afterrunning all the experiments times. We make the following observations from calibrating LSVNeural SDE:i) We note that our methods achieve high calibration accuracy to the market data (measured byMSE) with consistent bounds on the exotic option prices. See Figure 5.3.ii) The calibration is accurate not only in MSE on prices but also on individual implied volatilitycurves, see Figure 5.4 and others in Appendix E.iii) The LSV Neural SDE produces noticeable ranges for prices of illiquid derivatives, see againFigure 5.3. I m p li e d V o l . Implied Vol., T=2 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20510152025 E rr o r ( bp s ) Implied Vol. Error I m p li e d V o l . Implied Vol., T=4 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20510152025 E rr o r ( bp s ) Implied Vol. Error I m p li e d V o l . Implied Vol., T=6 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20510152025 E rr o r ( bp s ) Implied Vol. Error I m p li e d V o l . Implied Vol., T=8 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20510152025 E rr o r ( bp s ) Implied Vol. Error I m p li e d V o l . Implied Vol., T=10 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20510152025 E rr o r ( bp s ) Implied Vol. Error I m p li e d V o l . Implied Vol., T=12 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20510152025 E rr o r ( bp s ) Implied Vol. Error F IGURE

Hedging strategy evaluation.

We calculate the error of the portfolio hedging strategy of thelookback option at maturity T = 6 months, given by the empirical variance V ar N  ψ (cid:16) X π,θ (cid:17) − N steps − (cid:88) k =0 ¯ h ( t k , ( X π,θt k ∧ t j ) N steps j =0 , ξ Ψ )∆ ˜¯ S π,θt k  . The histogram in Figure 5.6 is calculated on N = 400 000 different paths and provides the valuesof s , s := Ψ (cid:16) X π,θ (cid:17) − N steps − (cid:88) k =0 ¯ h ( t k , ( X π,θt k ∧ t j ) N steps j =0 , ξ Ψ )∆ ˜¯ S π,θt k − E N (cid:104) Ψ (cid:16) X π,θ (cid:17)(cid:105) i.e. such that E N [ s ] = V ar N  Ψ (cid:16) X π,θ (cid:17) − N steps − (cid:88) k =0 ¯ h ( t k , ( X π,θt k ∧ t j ) N steps j =0 , ξ Ψ )∆ ˜¯ S π,θt k  . We obtain E N [ s ] = 1 . × − .Finally, we study the effect of the control variate parametrisation on the learning speed in Al-gorithm 1. Figure 5.6 displays the evolution of the Root Mean Squared Error of two runs of calibra-tion to market vanilla option prices for two-months maturity: the blue line using Algorithm 1 withsimultaneous learning of the hedging strategy, and the orange line without the hedging strategy.We recall that from Section 2.4, the Monte Carlo estimator ∂ θ h N ( θ ) is a biased estimator of ∂ θ h ( θ ) An upper bound of the bias is given by Corollary 3.2, that shows that by reducing the variance ofMonte Carlo estimator of the option price then the bias of ∂ θ h N ( θ ) is also reduced, yielding bet-ter convergence behaviour of the stochastic approximation algorithm. This can be observed inFigure 5.6. OBUST PRICING AND HEDGING VIA NEURAL SDES 21

Lookback LB Lookback w/o bounds Lookback UBType of fit0.090.100.110.120.130.140.15 e x o t i c s Lookback LB Lookback w/o bounds Lookback UBType of fit10 M S E Maturity T = 8 months

Lookback LB Lookback w/o bounds Lookback UBType of fit0.110.120.130.140.150.160.170.18 e x o t i c s Lookback LB Lookback w/o bounds Lookback UBType of fit10 M S E Maturity T = 10 months

Lookback LB Lookback w/o bounds Lookback UBType of fit0.130.14 e x o t i c s Lookback LB Lookback w/o bounds Lookback UBType of fit10 M S E Maturity T = 12 months F IGURE C a ll P r i c e Neural SDEMarket data 0.8 1.0 1.2K0.2150.2200.2250.2300.2350.2400.245 I m p li e d V o l . Neural SDEMarket data

Maturity 8 months C a ll P r i c e Neural SDEMarket data 0.8 1.0 1.2K0.2200.2250.2300.2350.2400.245 I m p li e d V o l . Neural SDEMarket data

Maturity 10 months C a ll P r i c e Neural SDEMarket data 0.8 1.0 1.2K0.2250.2300.2350.2400.245 I m p li e d V o l . Neural SDEMarket data

Maturity 12 months F IGURE market data . We see vanilla option pricesand implied volatility curves of the 10 calibrated Neural SDEs vs. the marketdata for different maturities.A

CKNOWLEDGEMENTS

This work was supported by the Alan Turing Institute under EPSRC grant no. EP/N510129/1.We thank Antoine Jacquier (Imperial) for fruitful discussions on the topic of the paper. c o un t F IGURE R oo t M S E with control variatewithout control variate F IGURE

ECLARATIONS OF I NTEREST

The authors report no conﬂicts of interest. The authors alone are responsible for the content andwriting of the paper. R

EFERENCES [Acciaio et al., 2019] Acciaio, B., Backhoff-Veraguas, J., and Zalashko, A. (2019). Causal optimal transport and itslinks to enlargement of ﬁltrations and continuous-time stochastic optimization.

Stochastic Processes and theirApplications .[Aksamit et al., 2020] Aksamit, A., Hou, Z., and Obloj, J. (2020). Robust framework for quantifying the value ofinformation in pricing and hedging.

SIAM Journal on Financial Mathematics , 11(1):27–59.

OBUST PRICING AND HEDGING VIA NEURAL SDES 23 [Albrecher et al., 2007] Albrecher, H., Mayer, P., Schoutens, W., and Tistaert, J. (2007). The little heston trap.

Wilmott , pages 83–92.[Bayer et al., 2019] Bayer, C., Horvath, B., Muguruza, A., Stemper, B., and Tomas, M. (2019). On deep calibration of(rough) stochastic volatility models.[Bayer and Stemper, 2018] Bayer, C. and Stemper, B. (2018). Deep calibration of rough stochastic volatility models.[Beiglb¨ock et al., 2013] Beiglb¨ock, M., Henry-Labord`ere, P., and Penkner, F. (2013). Model-independent bounds foroption prices—a mass transport approach.

Finance and Stochastics , 17(3):477–501.[Benth et al., 2020] Benth, F. E., Detering, N., and Lavagnini, S. (2020). Accuracy of deep learning in calibratingHJM forward curves. arXiv preprint arXiv:2006.01911 .[Benveniste et al., 2012] Benveniste, A., M´etivier, M., and Priouret, P. (2012).

Adaptive algorithms and stochasticapproximations , volume 22. Springer Science & Business Media.[Broadie et al., 2011] Broadie, M., Du, Y., and Moallemi, C. C. (2011). Efﬁcient risk estimation via nested sequentialsimulation.

Management Science , 57(6):1172–1194.[Buehler et al., 2019] Buehler, H., Gonon, L., Teichmann, J., and Wood, B. (2019). Deep hedging.

QuantitativeFinance , pages 1–21.[Chassagneux et al., 2019] Chassagneux, J.-F., Szpruch, L., and Tse, A. (2019). Weak quantitative propagation ofchaos via differential calculus on the space of measures. arXiv:1901.02556 .[Cohen and Elliott, 2015] Cohen, S. N. and Elliott, R. J. (2015).

Stochastic calculus and applications , volume 2.Springer.[Cohen et al., 2018] Cohen, S. N. et al. (2018). Data and uncertainty in extreme risks-a nonlinear expectationsapproach.

World Scientiﬁc Book Chapters , pages 135–162.[Cox and Obloj, 2011] Cox, A. M. and Obloj, J. (2011). Robust hedging of double touch barrier options.

SIAMJournal on Financial Mathematics , 2(1):141–182.[Cuchiero et al., 2020] Cuchiero, C., Khosrawi, W., and Teichmann, J. (2020). A generative adversarial networkapproach to calibration of local stochastic volatility models. arXiv preprint arXiv:2005.02505 .[Cuchiero et al., 2019] Cuchiero, C., Larsson, M., and Teichmann, J. (2019). Deep neural networks, generic universalinterpolation, and controlled odes. arXiv preprint arXiv:1908.07838 .[Dupire et al., 1994] Dupire, B. et al. (1994). Pricing with a smile.

Risk , 7(1):18–20.[Eckstein et al., 2019] Eckstein, S., Guo, G., Lim, T., and Obloj, J. (2019). Robust pricing and hedging of options onmultiple assets and its numerics. arXiv preprint arXiv:1909.03870 .[Eckstein and Kupper, 2019] Eckstein, S. and Kupper, M. (2019). Computation of optimal transport and relatedhedging problems via penalization and neural networks.

Applied Mathematics & Optimization , pages 1–29.[Fourni´e et al., 1999] Fourni´e, E., Lasry, J.-M., Lebuchoux, J., Lions, P.-L., and Touzi, N. (1999). Applications ofmalliavin calculus to monte carlo methods in ﬁnance.

Finance and Stochastics , 3(4):391–412.[Gambara and Teichmann, 2020] Gambara, M. and Teichmann, J. (2020). Consistent recalibration models and deepcalibration. arXiv preprint arXiv:2006.09455 .[Glasserman, 2013] Glasserman, P. (2013).

Monte Carlo methods in ﬁnancial engineering , volume 53. SpringerScience & Business Media.[Goodfellow et al., 2014] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In

Advances in neural information processingsystems , pages 2672–2680.[Gy¨ongy, 1986] Gy¨ongy, I. (1986). Mimicking the one-dimensional marginal distributions of processes having an itˆodifferential.

Probability theory and related ﬁelds , 71(4):501–516.[Hammersley et al., 2019] Hammersley, W. R., ˇSiˇska, D., and Szpruch, Ł. (2019). Weak existence and uniqueness formckean-vlasov sdes with common noise. arXiv preprint arXiv:1908.00955 .[Hernandez, 2016] Hernandez, A. (2016). Model calibration with neural networks.

Risk .[Hestenes, 1969] Hestenes, M. R. (1969). Multiplier and gradient methods.

Journal of optimization theory andapplications , 4(5):303–320.[Heston, 1997] Heston, S. L. (1997). A closed-form solution for options with stochastic volatility and applications tobond and currency options.

The Review of Financial Studies , 6:327–343. [Hobson, 1998] Hobson, D. G. (1998). Robust hedging of the lookback option.

Finance and Stochastics ,2(4):329–347.[Hornik, 1991] Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks.

Neural Networks ,4(2):251–257.[Horvath et al., 2019] Horvath, B., Muguruza, A., and Tomas, M. (2019). Deep learning volatility.[Hutzenthaler et al., 2011] Hutzenthaler, M., Jentzen, A., and Kloeden, P. E. (2011). Strong and weak divergence inﬁnite time of euler’s method for stochastic differential equations with non-globally lipschitz continuous coefﬁcients.

Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences , 467(2130):1563–1576.[Hutzenthaler et al., 2012] Hutzenthaler, M., Jentzen, A., Kloeden, P. E., et al. (2012). Strong convergence of anexplicit numerical method for sdes with nonglobally lipschitz continuous coefﬁcients.

The Annals of AppliedProbability , 22(4):1611–1641.[Jabir et al., 2019] Jabir, J.-F., ˇSiˇska, D., and Szpruch, Ł. (2019). Mean-ﬁeld neural odes via relaxed optimal control. arxiv preprint arXiv:1912.05475 .[Karatzas and Shreve, 2012] Karatzas, I. and Shreve, S. (2012).

Brownian motion and stochastic calculus . Springer.[Kingma and Ba, 2014] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 .[Kingma and Welling, 2013] Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114 .[Knight, 1971] Knight, F. H. (1971). Risk, uncertainty and proﬁt, 1921.

Library of Economics and Liberty .[Krylov, 1999] Krylov, N. (1999). On kolmogorov’s equations for ﬁnite dimensional diffusions. In

Stochastic PDE’sand Kolmogorov Equations in Inﬁnite Dimensions , pages 1–63. Springer.[Krylov, 1980] Krylov, N. V. (1980).

Controlled diffusion processes , volume 14 of

Applications of Mathematics .Springer-Verlag, New York-Berlin. Translated from the Russian by A. B. Aries.[Kunita, 1997] Kunita, H. (1997).

Stochastic ﬂows and stochastic differential equations , volume 24. Cambridgeuniversity press.[Lassalle, 2013] Lassalle, R. (2013). Causal transference plans and their monge-kantorovich problems. arXiv preprintarXiv:1303.6925 .[Liu et al., 2019] Liu, S., Borovykh, A., Grzelak, L. A., and Oosterlee, C. W. (2019). A neural network-basedframework for ﬁnancial model calibration.

Journal of Mathematics in Industry , 9(1):9.[Majka et al., 2020] Majka, M. B., Sabate-Vidales, M., and Szpruch, Ł. (2020). Multi-index antithetic stochasticgradient algorithm. arXiv preprint arXiv:2006.06102 .[Nadtochiy and Obloj, 2017] Nadtochiy, S. and Obloj, J. (2017). Robust trading of implied skew.

InternationalJournal of Theoretical and Applied Finance , 20(02):1750008.[Paszke et al., 2017] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A.,Antiga, L., and Lerer, A. (2017). Automatic differentiation in pytorch.[Paszke et al., 2019] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z.,Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In

Advances in Neural Information Processing Systems , pages 8024–8035.[Pelsser and Schweizer, 2016] Pelsser, A. and Schweizer, J. (2016). The difference between lsmc and replicatingportfolio in insurance liability modeling.

European actuarial journal , 6(2):441–494.[Ruf and Wang, 2019] Ruf, J. and Wang, W. (2019). Neural networks for option pricing and hedging: a literaturereview.

Available at SSRN 3486363 .[Sardroudi, 2019] Sardroudi, W. K. (2019). Polynomial semimartingales and a deep learning approach to localstochastic volatility calibration.[ˇSiˇska and Szpruch, 2020] ˇSiˇska, D. and Szpruch, Ł. (2020). Gradient ﬂows for regularized stochastic controlproblems. arxiv preprint arXiv:2006.05956 .[Sontag and Sussmann, 1997] Sontag, E. and Sussmann, H. (1997). Complete controllability of continuous-timerecurrent neural networks.

Systems Control Lett. , 30(4):177–183.[Srivastava et al., 2014] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).Dropout: a simple way to prevent neural networks from overﬁtting.

The journal of machine learning research ,15(1):1929–1958.

OBUST PRICING AND HEDGING VIA NEURAL SDES 25 [Szpruch and Zh¯ang, 2018] Szpruch, Ł. and Zh¯ang, X. (2018). V -integrability, asymptotic stability and comparisonproperty of explicit numerical schemes for non-linear SDEs. Math. Comp. , 87(310):755–783.[Tian et al., 2015] Tian, Y., Zhu, Z., Lee, G., Klebaner, F., and Hamza, K. (2015). Calibrating and pricing with astochastic-local volatility model.

The Journal of Derivatives , 22(3):21–39.[Vidales et al., 2018] Vidales, M. S., ˇSiˇska, D., and Szpruch, L. (2018). Unbiased deep solvers for parametric PDEs. arXiv:1810.05094 .[Vidales et al., 2020] Vidales, M. S., ˇSiˇska, D., and Szpruch, L. (2020). Learning solutions to path dependent PDEswith signatures and LSTM networks.

In preparation . A PPENDIX

A. B

OUND ON BIAS IN GRADIENT DESCENT

We complete the analysis from Section 3.2 for a general loss function here.

Theorem A.1.

Let Assumption 3.1 hold. Consider the family of neural SDEs (1.1) . We have | E [ ∂ θ h N ( θ )] − ∂ θ h ( θ ) | ≤ (cid:16) E (cid:104) ( ∂ x (cid:96) ( E Q N ( θ ) [Φ cv ( X θ )] , p (Φ)) − ∂ x (cid:96) ( E Q [Φ cv ( X θ )] , p (Φ))) (cid:105)(cid:17) / × (cid:18) E (cid:20)(cid:16) E Q N ( θ ) [ ∂ θ Φ( X θ )] (cid:17) (cid:21)(cid:19) / . (A.1) If in addition we assume that the loss function (cid:96) is three times differentiable in the ﬁrst variablewith all its derivatives bounded, then (cid:12)(cid:12)(cid:12) E Q (cid:2) ∂ θ h N ( θ ) (cid:3) − ∂ θ h ( θ ) (cid:12)(cid:12)(cid:12) ≤ (cid:110) (cid:107) ∂ x (cid:96) (cid:107) ∞ | E Q [ ∂ θ Φ( X θ )] | N V ar Q [Φ cv ( X θ )]+ (cid:107) ∂ x (cid:96) (cid:107) ∞ (cid:18) N V ar Q [ ∂ θ Φ( X θ )] (cid:19) / (cid:18) N E [(Φ cv ( X θ ) − E Q [Φ cv ( X θ )]) ] + 3 N ( V ar Q [Φ cv ( X θ )]) (cid:19) / + 2 (cid:107) ∂ x (cid:96) (cid:107) ∞ (cid:18) N V ar Q [ ∂ θ Φ( X θ )] (cid:19) / (cid:18) N V ar Q [Φ cv ( X θ )] (cid:19) / (cid:111) . (A.2) Proof.

Observe that E (cid:104) E Q N [Φ cv ( X θ )] (cid:105) = E Q [Φ cv ( X θ )] and E (cid:104) E Q N [ ∂ θ Φ( X θ )] (cid:105) = E Q [ ∂ θ φ ( X θ )] . Next, by adding and subtracting ∂ x (cid:96) ( E Q [Φ cv ( X θ )] , p (Φ)) and using the Cauchy–Schwarzinequality we have | E [ ∂ θ h N ( θ )] − ∂ θ h ( θ ) | = (cid:12)(cid:12)(cid:12) E (cid:104)(cid:16) ∂ x (cid:96) (cid:16) E Q N [Φ cv ( X θ )] , p (Φ) (cid:17) ± ∂ x (cid:96) (cid:16) E Q [Φ cv ( X θ )] , p (Φ) (cid:17)(cid:17) E Q N (cid:104) ∂ θ Φ( X θ ) (cid:105)(cid:105) − ∂ θ h ( θ ) (cid:12)(cid:12)(cid:12) . Hence | E [ ∂ θ h N ( θ )] − ∂ θ h ( θ ) | = (cid:12)(cid:12)(cid:12) E (cid:104)(cid:16) ∂ x (cid:96) (cid:16) E Q N [Φ cv ( X θ )] , p (Φ) (cid:17) − ∂ x (cid:96) (cid:16) E Q [Φ cv ( X θ )] , p (Φ) (cid:17)(cid:17) E Q N (cid:104) ∂ θ Φ( X θ ) (cid:105)(cid:105)(cid:12)(cid:12)(cid:12) ≤ (cid:18) E (cid:20)(cid:16) ∂ x (cid:96) (cid:16) E Q N [Φ cv ( X θ )] , p (Φ) (cid:17) − ∂ x (cid:96) (cid:16) E Q [Φ cv ( X θ )] , p (Φ) (cid:17)(cid:17) (cid:21)(cid:19) / (cid:18) E (cid:104) E Q N [ ∂ θ Φ( X θ )] (cid:105) (cid:19) / . This concludes the proof of (A.1). To prove (A.2), we view ∂ θ h N ( θ ) as function of ( E Q N [Φ cv ( X θ )] , E Q N [ ∂ θ Φ( X θ )]) and expand into its Taylor series around ( E Q [Φ cv ( X θ )] , E Q [ ∂ θ Φ( X θ )]) , i.e ∂ θ h N ( θ ) = ∂ θ h ( θ )+ ∂ x (cid:96) (cid:16) E Q [Φ cv ( X θ )] , p (Φ) (cid:17) E Q [ ∂ θ Φ( X θ )] (cid:16) E Q N [Φ cv ( X θ )] − E Q [Φ cv ( X θ )] (cid:17) + ∂ x (cid:96) (cid:16) E Q [Φ cv ( X θ )] , p (Φ) (cid:17) (cid:16) E Q N [ ∂ θ Φ( X θ )] − E Q [ ∂ θ Φ( X θ )] (cid:17) + 12 (cid:90) (cid:110) ∂ x (cid:96) ( ξ α , p (Φ)) ξ α (cid:16) E Q N [Φ cv ( X θ )] − E Q [Φ cv ( X θ )] (cid:17) + 2 ∂ x (cid:96) ( ξ α , p (Φ)) (cid:16) E Q N [ ∂ θ Φ( X θ )] − E Q [ ∂ θ Φ( X θ )] (cid:17) (cid:16) E Q N [Φ cv ( X θ )] − E Q [Φ cv ( X θ )] (cid:17) (cid:111) dα , where ξ α = E Q [Φ cv ( X θ )] + α (cid:16) E Q N [Φ cv ( X θ )] − E Q [Φ cv ( X θ )] (cid:17) ,ξ α = E Q [ ∂ θ Φ( X θ )] + α (cid:16) E Q N [ ∂ θ Φ( X θ )] − E Q [ ∂ θ Φ( X θ )] (cid:17) . Hence, using Cauchy-Schwarz inequality (cid:12)(cid:12)(cid:12) E Q (cid:2) ∂ θ h N ( θ ) (cid:3) − ∂ θ h ( θ ) (cid:12)(cid:12)(cid:12) ≤ (cid:90) E (cid:104)(cid:110) (cid:107) ∂ x (cid:96) (cid:107) ∞ | E Q [ ∂ θ Φ( X θ )] | (cid:16) E Q N [Φ cv ( X θ )] − E Q [Φ cv ( X θ )] (cid:17) + α (cid:107) ∂ x (cid:96) (cid:107) ∞ (cid:12)(cid:12)(cid:12) E Q N [ ∂ θ Φ( X θ )] − E Q [ ∂ θ Φ( X θ )] (cid:12)(cid:12)(cid:12) (cid:16) E Q N [Φ cv ( X θ )] − E Q [Φ cv ( X θ )] (cid:17) + 2 (cid:107) ∂ x (cid:96) (cid:107) ∞ (cid:12)(cid:12)(cid:12) E Q N [ ∂ θ Φ( X θ )] − E Q [ ∂ θ Φ( X θ )] (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) E Q N [Φ cv ( X θ )] − E Q [Φ cv ( X θ )] (cid:12)(cid:12)(cid:12) (cid:111)(cid:105) dα ≤ (cid:110) (cid:107) ∂ x (cid:96) (cid:107) ∞ | E Q [ ∂ θ Φ( X θ )] | N V ar Q [Φ cv ( X θ )]+ (cid:107) ∂ x (cid:96) (cid:107) ∞ (cid:18) N V ar Q [ ∂ θ Φ( X θ )] (cid:19) / (cid:18) E (cid:20)(cid:16) E Q N [Φ cv ( X θ )] − E Q [Φ cv ( X θ )] (cid:17) (cid:21)(cid:19) / + 2 (cid:107) ∂ x (cid:96) (cid:107) ∞ (cid:18) N V ar Q [ ∂ θ Φ( X θ )] (cid:19) / (cid:18) N V ar Q [Φ cv ( X θ )] (cid:19) / (cid:111) . OBUST PRICING AND HEDGING VIA NEURAL SDES 27

Now let λ i := Φ cv ( X θ,i ) − E Q [Φ cv ( X θ )] , and note that (cid:32) N (cid:88) i =1 λ i (cid:33) = N (cid:88) i =1 ( λ i ) + 3 N (cid:88) i (cid:54) = i ( λ i ) ( λ i ) + 4 N (cid:88) i (cid:54) = i ( λ i ) ( λ i ) + 6 N (cid:88) i ,i ,i distinct λ i λ i ( λ i ) + N (cid:88) i ,i ,i ,i distinct λ i λ i λ i λ i . Hence E (cid:20)(cid:16) E Q N [Φ cv ( X θ )] − E Q [Φ cv ( X θ )] (cid:17) (cid:21) = 1 N E [(Φ cv ( X θ ) − E Q [Φ cv ( X θ )]) ]+ 3 N ( V ar Q [Φ cv ( X θ )]) . The proof is complete. (cid:3) A PPENDIX

B. D

ATA USED IN CALIBRATION

We used Heston model to generate prices of calls and puts. The model is dX t = rX t dt + X t (cid:112) V t dW t , X = x (B.1) dV t = κ ( µ − V t ) dt + η (cid:112) V t dB t , V = v (B.2) d (cid:104) B, W (cid:105) t = ρdt . (B.3)It is well know that for this model a semi-analytic formula can be used to calculate option prices,see [Heston, 1997] but also [Albrecher et al., 2007]. The choice of parameters below was used togenerate target model calibration prices.(B.4) x = 1 , r = 0 . , κ = 0 . , µ = 0 . , η = 0 . , V = 0 . , ρ = 0 . , L o g - m o n e y n e ss −0.20−0.15−0.10−0.050.000.05 0.10 0.15 0.20 M a t u r i t y I m p li e d o l a t ili t y Target IV surface F IGURE

B.1. The “market” data used in calibration of the Neural SDE models.In fact the implied volatility surface comes from (B.1) and (B.4).Options with bi-monthly maturities up to one year with varying range of strikes were used asmarket data for Neural SDE calibration. The call / put option prices were obtained from theHeston model using Monte Carlo simulation with Brownian trajectories. We use bimonthlymaturities up to one year for considered calibrations. Varying range of strikes is used amongdifferent calibrations. See Figure B.1 for the resulting “market” data. A PPENDIX

C. F

EED - FORWARD NEURAL NETWORKS

Feed-forward neural networks are functions constructed by composition of afﬁne map andnon-linear activation function. We ﬁx a locally Lipschitz activation function a : R → R as ReLUfunction a ( z ) = (0 , z ) + and for d ∈ N deﬁne A d : R d → R d as the function given, for x = ( x , . . . , x d ) by A d ( x ) = ( a ( x ) , . . . , a ( x d )) . We ﬁx L ∈ N (the number of layers), l k ∈ N , k = 0 , , . . . L − (the size of input to layer k ) and l L ∈ N (the size of the network output). Afully connected artiﬁcial neural network is then given by θ = (( W , B ) , . . . , ( W L , B L )) , where,for k = 1 , . . . , L , we have real l k − × l k matrices W k and real l k dimensional vectors B k .The artiﬁcial neural network deﬁnes a function R ( · , θ ) : R l → R l L given recursively, for x ∈ R l , by R ( x , θ ) = W L x L − + B L , x k = A l k ( W k x k − + B k ) , k = 1 , . . . , L − . We will call such class of fully connected artiﬁcial neural networks DN . Note that since theactivation functions and architecture are ﬁxed the learning task entails ﬁnding the optimal θ ∈ R P where p is the number of parameters in θ given by P ( θ ) = L (cid:88) i =1 ( l k − l k + l k ) . A PPENDIX

D. LV

NEURAL

SDE

S CALIBRATION ACCURACY

Figures 5.2, D.2, D.4 present implied volatility ﬁt of local volatility neural SDE model (5.2)calibrated to: market vanilla data only; market vanilla data with lower bound constraint onlookback option payoff; market vanilla data with upper bound constraint on lookback optionpayoff respectively. Figures D.1, D.3, D.5 present target option price ﬁt of local volatility neuralSDE model (5.2) calibrated to: market vanilla data only; market vanilla data with lower boundconstraint on lookback option payoff; market vanilla data with upper bound constraint onlookback option payoff respectively. High level of accuracy in all calibrations is achieved due tothe hedging neural network incorporated into model training.A

PPENDIX

E. LSV

NEURAL

SDE

S CALIBRATION ACCURACY

Figures E.1, 5.4 and E.2 provide the Vanilla call option price and the implied volatility curve forthe calibrated models. In each plot, the blue line corresponds to the target data (generated usingHeston model), and each orange line corresponds to one run of the Neural SDE calibration. Wenote again in this plots how the absolute error of the calibration to the vanilla prices isconsistently of O (10 − ) .A PPENDIX

F. E

XOTIC PRICE IN LV NEURAL

SDE S Below we see how different random seeds, constrained optimization algorithms and number ofstrikes used in the market data input affect the illiquid derivative price in the Local VolatilityNeural SDE model.

OBUST PRICING AND HEDGING VIA NEURAL SDES 29 C a ll P r i c e Call Price, T=2 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20.00.51.01.52.02.53.0 E rr o r ( bp s ) Call Price Error C a ll P r i c e Call Price, T=4 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20.00.51.01.52.02.53.0 E rr o r ( bp s ) Call Price Error C a ll P r i c e Call Price, T=6 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20.00.51.01.52.02.53.0 E rr o r ( bp s ) Call Price Error C a ll P r i c e Call Price, T=8 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20.00.51.01.52.02.53.0 E rr o r ( bp s ) Call Price Error C a ll P r i c e Call Price, T=10 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20.00.51.01.52.02.53.0 E rr o r ( bp s ) Call Price Error C a ll P r i c e Call Price, T=12 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20.00.51.01.52.02.53.0 E rr o r ( bp s ) Call Price Error F IGURE

D.1. Calibrated neural SDE LV model and market target prices comparison. I m p li e d V o l . Implied Vol., T=2 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20510152025 E rr o r ( bp s ) Implied Vol. ErrorMaturity T=2 Months I m p li e d V o l . Implied Vol., T=4 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20510152025 E rr o r ( bp s ) Implied Vol. ErrorMaturity T=4 Months I m p li e d V o l . Implied Vol., T=6 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20510152025 E rr o r ( bp s ) Implied Vol. ErrorMaturity T=6 Months I m p li e d V o l . Implied Vol., T=8 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20510152025 E rr o r ( bp s ) Implied Vol. ErrorMaturity T=8 Months I m p li e d V o l . Implied Vol., T=10 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20510152025 E rr o r ( bp s ) Implied Vol. ErrorMaturity T=10 Months I m p li e d V o l . Implied Vol., T=12 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20510152025 E rr o r ( bp s ) Implied Vol. ErrorMaturity T=12 Months F IGURE

D.2. Calibrated neural SDE LV model (with lower bound minimizationon exotic payoff) and target market data implied volatility comparison. C a ll P r i c e Call Price, T=2 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20.00.51.01.52.02.53.0 E rr o r ( bp s ) Call Price Error C a ll P r i c e Call Price, T=4 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20.00.51.01.52.02.53.0 E rr o r ( bp s ) Call Price Error C a ll P r i c e Call Price, T=6 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20.00.51.01.52.02.53.0 E rr o r ( bp s ) Call Price Error C a ll P r i c e Call Price, T=8 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20.00.51.01.52.02.53.0 E rr o r ( bp s ) Call Price Error C a ll P r i c e Call Price, T=10 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20.00.51.01.52.02.53.0 E rr o r ( bp s ) Call Price Error C a ll P r i c e Call Price, T=12 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20.00.51.01.52.02.53.0 E rr o r ( bp s ) Call Price Error F IGURE

D.3. Calibrated neural SDE LV model (with lower bound minimizationon exotic payoff) and target market prices comparison. I m p li e d V o l . Implied Vol., T=2 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20510152025 E rr o r ( bp s ) Implied Vol. Error I m p li e d V o l . Implied Vol., T=4 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20510152025 E rr o r ( bp s ) Implied Vol. Error I m p li e d V o l . Implied Vol., T=6 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20510152025 E rr o r ( bp s ) Implied Vol. Error I m p li e d V o l . Implied Vol., T=8 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20510152025 E rr o r ( bp s ) Implied Vol. Error I m p li e d V o l . Implied Vol., T=10 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20510152025 E rr o r ( bp s ) Implied Vol. Error I m p li e d V o l . Implied Vol., T=12 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20510152025 E rr o r ( bp s ) Implied Vol. Error F IGURE

D.4. Calibrated neural LV model (with upper bound maximization onexotic payoff) and target market data implied volatility comparison.

OBUST PRICING AND HEDGING VIA NEURAL SDES 31 C a ll P r i c e Call Price, T=2 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20.00.51.01.52.02.53.0 E rr o r ( bp s ) Call Price Error C a ll P r i c e Call Price, T=4 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20.00.51.01.52.02.53.0 E rr o r ( bp s ) Call Price Error C a ll P r i c e Call Price, T=6 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20.00.51.01.52.02.53.0 E rr o r ( bp s ) Call Price Error C a ll P r i c e Call Price, T=8 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20.00.51.01.52.02.53.0 E rr o r ( bp s ) Call Price Error C a ll P r i c e Call Price, T=10 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20.00.51.01.52.02.53.0 E rr o r ( bp s ) Call Price Error C a ll P r i c e Call Price, T=12 Months

Neural SDEMarket Data 0.8 0.9 1.0 1.1 1.20.00.51.01.52.02.53.0 E rr o r ( bp s ) Call Price Error F IGURE

D.5. Calibrated LV neural model (with upper bound maximization onexotic payoff) and target market prices comparison. C a ll P r i c e Neural SDEMarket data 0.8 1.0 1.2K0.2150.2200.2250.2300.2350.2400.245 I m p li e d V o l . Neural SDEMarket data

Maturity 8 months C a ll P r i c e Neural SDEMarket data 0.8 1.0 1.2K0.2200.2250.2300.2350.2400.245 I m p li e d V o l . Neural SDEMarket data

Maturity 10 months C a ll P r i c e Neural SDEMarket data 0.8 1.0 1.2K0.2250.2300.2350.2400.245 I m p li e d V o l . Neural SDEMarket data

Maturity 12 months F IGURE

E.1. Comparing market and model data ﬁt for the Neural SDE LSVmodel (5.3) when targeting the lower bound on the illiquid derivative. We seevanilla option prices and implied volatility curves of the 10 calibrated NeuralSDEs vs. the market data for different maturities. C a ll P r i c e Neural SDEMarket data 0.8 1.0 1.2K0.2150.2200.2250.2300.2350.2400.2450.250 I m p li e d V o l . Neural SDEMarket data

Maturity 8 months C a ll P r i c e Neural SDEMarket data 0.8 1.0 1.2K0.2200.2250.2300.2350.2400.245 I m p li e d V o l . Neural SDEMarket data

Maturity 10 months C a ll P r i c e Neural SDEMarket data 0.8 1.0 1.2K0.2250.2300.2350.2400.2450.250 I m p li e d V o l . Neural SDEMarket data

Maturity 12 months F IGURE

E.2. Comparing market and model data ﬁt for the Neural SDE LSVmodel (5.3) when targeting the upper bound on the illiquid derivative. We seevanilla option prices and implied volatility curves of the 10 calibrated NeuralSDEs vs. the market data for different maturities.

11 21 31 41Number of strikes0.1050.1100.1150.1200.1250.1300.1350.140 E x o t i c p r i c e Exotic Price, T=6 Months

Lower boundUnconstrainedUpper bound 11 21 31 41Number of strikes0.140.160.180.200.22 E x o t i c p r i c e Exotic Price, T=12 Months

Lower boundUnconstrainedUpper bound F IGURE

F.1. Lookback exotic option price in lower, upper and unconstrainedimplied by perfectly calibrated LV neural SDE calibrated to varying number ofmarket option quotes.Initialisation Calibration type t=2/12 t=4/12 t=6/12 t=8/12 t=10/12 t=11 Unconstrained .055 .088 .113 .134 .159 .1782 Unconstrained .056 .086 .113 .132 .158 .1751 LB Lag. mult. .055 .086 .107 .127 .143 .1542 LB Lag. mult. .055 .084 .098 .113 .125 .1391 UB Lag. mult. .056 .099 .119 .142 .163 .2142 UB Lag. mult. .059 .101 .131 .156 .208 .2201 LB Augmented .055 .077 .107 .113 .127 .1362 LB Augmented .056 .085 .109 .123 .139 .1581 UB Augmented .058 .102 .139 .156 .188 .2242 UB Augmented .057 .088 .128 .151 .167 .184- Heston 400k paths .058 .087 .111 .133 .154 .174T

ABLE

1. Impact of initialisation and constrained optimization algorithms onprices of an illiquid derivative (lookback call) implied by LV neural SDE calib-rated to vanilla prices with K = 11 strikes: k = 0 . , k = 0 . , ..., k = 1 . for each maturity. OBUST PRICING AND HEDGING VIA NEURAL SDES 33

Initialisation Calibration type t=2/12 t=4/12 t=6/12 t=8/12 t=10/12 t=11 Unconstrained .056 .087 .114 .140 .161 .1822 Unconstrained .056 .087 .114 .136 .161 .1801 LB Lag. mult. .056 .086 .110 .123 .141 .1532 LB Lag. mult. .056 .087 .108 .125 .150 .1551 UB Lag. mult. .056 .088 .120 .156 .179 .2052 UB Lag. mult. .056 .088 .118 .153 .187 .2081 LB Augmented .056 .087 .108 .128 .143 .1642 LB Augmented .056 .087 .108 .125 .150 .1551 UB Augmented .056 .091 .124 .155 .173 .1942 UB Augmented .056 .088 .125 .146 .167 .189- Heston 400k paths .058 .087 .111 .133 .154 .174- Heston 10mil paths .058 .087 .111 .133 .154 .174T

ABLE

2. Impact of initialisation Prices of ATM lookback call implied by LVneural SDE calibrated to vanilla prices with K = 21 strikes: k = 0 . , k =0 . , ..., k = 1 . for each maturity.Initialisation Calibration type t=2/12 t=4/12 t=6/12 t=8/12 t=10/12 t=11 Unconstrained .056 .087 .114 .138 .162 .1842 Unconstrained .056 .087 .114 .138 .160 .1831 LB Lag. mult. .056 .087 .113 .137 .149 .1722 LB Lag. mult. .056 .087 .113 .136 .155 .1651 UB Lag. mult. .056 .088 .115 .148 .170 .1972 UB Lag. mult. .056 .087 .114 .144 .170 .1981 LB Augmented .056 .087 .114 .138 .161 .1832 LB Augmented .056 .087 .112 .130 .154 .1661 UB Augmented .056 .087 .114 .138 .162 .1832 UB Augmented .056 .087 .114 .141 .164 .190- Heston 400k paths .058 .087 .111 .133 .154 .174- Heston 10mil paths .058 .087 .111 .133 .154 .174T ABLE

3. Impact of initialisation Prices of ATM lookback call implied by LVneural SDE calibrated to vanilla prices with K = 31 strikes: k = 0 . , k =0 . , ..., k = 1 . for each maturity. Initialisation Calibration type t=2/12 t=4/12 t=6/12 t=8/12 t=10/12 t=11 Unconstrained .056 .087 .114 .138 .160 .1832 Unconstrained .056 .087 .113 .138 .162 .1841 LB Lag. mult. .056 .087 .113 .137 .158 .1722 LB Lag. mult. .056 .087 .113 .137 .153 .1711 UB Lag. mult. .056 .088 .117 .141 .166 .1932 UB Lag. mult. .056 .087 .116 .140 .166 .1921 LB Augmented .056 .087 .113 .136 .153 .1722 LB Augmented .056 .087 .113 .136 .152 .1691 UB Augmented .056 .087 .114 .138 .160 .1822 UB Augmented .056 .087 .116 .140 .164 .191- Heston 400k paths .058 .087 .111 .133 .154 .174- Heston 10mil paths .058 .087 .111 .133 .154 .174T

ABLE

4. Impact of initialisation Prices of ATM lookback call implied by LVneural SDE calibrated to vanilla prices with K = 41 strikes: k = 0 . , k =0 . , ..., k = 1 .4