[PDF] A generative adversarial network approach to calibration of local stochastic volatility models

Abstract

We propose a fully data-driven approach to calibrate local stochastic volatility (LSV) models, circumventing in particular the ad hoc interpolation of the volatility surface. To achieve this, we parametrize the leverage function by a family of feed-forward neural networks and learn their parameters directly from the available market option prices. This should be seen in the context of neural SDEs and (causal) generative adversarial networks: we generate volatility surfaces by specific neural SDEs, whose quality is assessed by quantifying, possibly in an adversarial manner, distances to market prices. The minimization of the calibration functional relies strongly on a variance reduction technique based on hedging and deep hedging, which is interesting in its own right: it allows the calculation of model prices and model implied volatilities in an accurate way using only small sets of sample paths. For numerical illustration we implement a SABR-type LSV model and conduct a thorough statistical performance analysis on many samples of implied volatility smiles, showing the accuracy and stability of the method.

Full PDF

AA GENERATIVE ADVERSARIAL NETWORK APPROACH TOCALIBRATION OF LOCAL STOCHASTIC VOLATILITY MODELS

CHRISTA CUCHIERO, WAHID KHOSRAWI AND JOSEF TEICHMANN

Abstract.

We propose a fully data driven approach to calibrate local stochastic volatility(LSV) models, circumventing in particular the ad hoc interpolation of the volatility surface.To achieve this, we parametrize the leverage function by a family of feed forward neuralnetworks and learn their parameters directly from the available market option prices. Thisshould be seen in the context of neural SDEs and (causal) generative adversarial networks: we generate volatility surfaces by speciﬁc neural SDEs, whose quality is assessed by quantifying,in an adversarial manner, distances to market prices. The minimization of the calibrationfunctional relies strongly on a variance reduction technique based on hedging and deep hedging,which is interesting in its own right: it allows to calculate model prices and model impliedvolatilities in an accurate way using only small sets of sample paths. For numerical illustrationwe implement a SABR-type LSV model and conduct a thorough statistical performance analyison many samples of implied volatility smiles, showing the accuracy and stability of the method. Introduction

Each day a crucial task is performed in ﬁnancial institutions all over the world: the calibrationof stochastic models to current market or historical data. So far the model choice was not onlydriven by the capacity of capturing well empirically observed market features, but also by thecomputational tractability of the calibration process. This is now undergoing a big change sincemachine learning technologies oﬀer new perspectives on model calibration.Calibration is the choice of one model from a pool of models , given current market and historicaldata, possibly with some information on their signiﬁcance. Depending on the nature of data

Mathematics Subject Classiﬁcation.

Key words and phrases.

LSV calibration, neural SDEs, generative adversarial networks, deep hedging, variancereduction, stochastic optimization.Christa Cuchiero gratefully acknowledges ﬁnancial support by the Vienna Science and Technology Fund (WWTF)under grant MA16-021.Christa CuchieroUniversity of Vienna, Department of Statistics and Operations Research, Data Science @ Uni Vienna, Oskar-Morgenstern-Platz 1, 1090 Wien, AustriaE-mail: [email protected]

Wahid KhosrawiETH Z¨urich, D-MATH, R¨amistrasse 101, CH-8092 Z¨urich, SwitzerlandE-mail: [email protected]

Josef TeichmannETH Z¨urich, D-MATH, R¨amistrasse 101, CH-8092 Z¨urich, SwitzerlandE-mail: [email protected] . a r X i v : . [ q -f i n . C P ] J un ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 2 this is considered as an inverse problem or a problem of statistical inference. We consider herecurrent market data, e.g. volatility surfaces, therefore we rather emphasize the inverse problempoint of view. We however stress that it is the ultimate goal of calibration to include both datasources simultaneously. In this respect machine learning might help considerably.We can distinguish three kinds of machine learning inspired approaches for calibration to currentmarket prices: First, having solved the inverse problem already several times, one can learn fromthis experience (i.e. training data) the calibration map from market data to model parametersdirectly. Let us here mention one of the pioneering papers by A. Hernandez [26] that appliedneural networks to learn this calibration map in the context of interest rate models. This wastaken up in [11] for calibrating more complex mixture models. Second, one can learn the mapfrom model parameters to market prices and then invert this map possibly with machine learningtechnology. In the context of rough volatility modeling, see [17], such approaches turned out tobe very successful: we refer here to [3] and the references therein. Third, the calibration problemis considered as search for a model which generates given market prices and technology fromgenerative adversarial networks, ﬁrst introduced by [19], is used. This means parameterizingthe model pool in a way which is accessible for machine learning techniques and interpretingthe inverse problem as an training task of a generative network, whose quality is assessed by an adversary.

We pursue this approach in the present article.1.1.

Generative adversarial approaches in ﬁnance.

At ﬁrst sight it might seem a bit un-expected to embed calibration problems in the realm of generative adversarial networks whichare rather applied in areas like photorealistic image generation: a generative adversarial net-work, here mainly in its causal interpretation, is a neural network (mostly of recurrent type) G θ depending on parameters θ , which transports a standard input law P I (e.g., in our case ofLSV modeling the law of a Brownian motion W together with an exogenously given stochasticvolatility process α ) to a target output law P O . In the context of ﬁnanical modeling P O corre-sponds to the “true” law of the market, deduced from data (e.g., the empirical measure) and notnecessarily fully speciﬁed. It can either correspond to the physical or to a risk neutral measureor even both, depending on the goal.Denoting the push-forward of P I under the transport map G θ by G θ ∗ P I , the goal is to ﬁndparameters θ such that G θ ∗ P I ≈ P O . For this purpose appropriate distance functions have to beused. Standard examples include entropies, integral distances Wasserstein or Radon distances,etc. The adversarial aspect appears when the chosen distance is represented as supremum overcertain classes of functions, which can themselves be parametrized via neural networks of acertain type. This leads to a game, often of zero-sum type, between the generator and theadversary. In our case the measure P O is not even fully speciﬁed, but only certain functionals,namely the given set of market prices, are provided. Therefore we shall measure the distance bythe following neo-classical calibration functionalinf θ sup γ X C ∈C w γC ( E G θ ∗ P I [ C ] | {z } model price − E P O [ C ] | {z } market price ) (1.1)where C is the class of option payoﬀs and w γC weights, which are be parametrized by γ to accountfor the adversarial part. Similary one could also consider distances of Wasserstein type (in itsdual representation) inf θ ∈ Θ sup γ (cid:0) E G θ ∗ P I [ C γ ] − E P O [ C γ ] (cid:1) ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 3 where γ parametrizes appropriate function classes, e.g. approximations of Lipschitz functions,neural networks or again payoﬀs of options.In general we can consider distance functions d γ such that the game between generator andadversary appears as inf θ sup γ d γ ( G θ ∗ P I , P O ) . (1.2)A distance function here just maps two laws on path space to a non-negative real numberallowing for minima if P O is ﬁxed: we do neither assume symmetry, nor a triangle inequality nordeﬁniteness.The advantage of this point of view is two-fold:(i) we have access to the unreasonable eﬀectiveness of modeling by neural networks, i.e. thecounter-intuitive highly over-parametrized ansatz to consider a local stochastic volatilitymodel as recurrent neural network is beneﬁcial through its miraculous generalization andregularization properties;(ii) the game theoretic view disentangles realistic price generation from discriminating cer-tain options, e.g. by putting higher weights. Notice that (1.1) is not the usual formof GAN problems, since the adversary distance d γ is non-linear in P I and P O , but webelieve that it is worth taking this abstract point of view.There is no reason why these generative models, if suﬃcient computing power is available, shouldnot take market price data as inputs, too. This would correspond, from the point of view ofgenerative adversarial networks, to actually learn a map G θ, market prices , such that for any priceconﬁguration of market prices one has instantaneously a generative model given, which producesthose prices. This requires just a rich data source of typical market prices (and computingpower!). This would ultimately connect the ﬁrst and third approach of calibration.Even though it is usually not considered like that, one can also view the generative model as anengine producing a likelihood on path space: given historic data this would allow, with preciselythe same technology, a maximum likelihood approach. In this case P O is just the empiricalmeasure of the one observed trajectory, that is inserted in the likelihood function on path space.This then falls in the realm of generative approaches that appear in the literature under the name“market generators”. Here the goal is to mimick precisely the behavior and features of historicalmarket trajectories. This line of research has been recently pursued in e.g. [31, 43, 25, 6, 2].From a bird’s eye perspective this machine learning approach to calibraiton might just looklike a standard inverse problem with another parametrized family of functions. We, however,insist on one important diﬀerence. Implicit or explicit regularization (see e.g., [24]), which alwaysappear in machine learning applications and which are cumbersome to mimick in classical inverseproblems, are one of the secrets of this success.In general, machine learning approaches are becoming more and more proliﬁc in mathematicalﬁnance. Concrete applications include hedging [5], portfolio selection [16], stochastic portfo-lio theory [37, 12], optimal stopping [4], optimal transport and robust ﬁnance [15], stochasticgames and control problems [28] as well as high dimensional non-linear partial diﬀerential equa-tions (PDEs) [22, 29]. Machine learning also allows for new insights into structural properties ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 4 of ﬁnancial markets as investigated in [39]. For an exhaustive overview on machine learningapplications in mathematical ﬁnance, in particular for option pricing and hedging we refer to[36].1.2.

Local stochastic volatiliy models as neural SDEs.

In the present article we focus oncalibration of local stochastic volatility (LSV) models , which is still an intricate task, both froma theoretical as well as a practical point of view. LSV models, going back to [32, 34], combineclassical stochastic volatility with local volatility to achieve both, a good ﬁt to time series dataand in principle a perfect calibration to the implied volatility smiles and skews. In these models,the discounted price process ( S t ) t ≥ of an asset satisﬁes dS t = S t L ( t, S t ) α t dW t , (1.3)where ( α t ) t ≥ is some stochastic process taking values in R , and (a suﬃciently regular function) L ( t, s ) the so-called leverage function depending on time and the current value of the asset and W a one-dimensional Brownian motion. Note that the stochastic volatility process α can bevery general and could for instance be chosen as rough volatility model. By slight abuse ofterminology we call α stochastic volatility even though it is strictly speaking not the volatilityof the log price of S (we can, however, imagine L to be close to 1).For notational simplicity we consider here the one-dimensional case, but the setup easily trans-lates to a multivariate situation with several assets and a matrix valued analog of α as well as amatrix valued leverage function.The leverage function L is the crucial part in this model. It allows in principle to perfectlycalibrate the implied volatility surface seen on the market. In order to achieve this goal L hasto satisfy(1.4) L ( t, s ) = σ ( t, s ) E [ α t | S t = s ] , where σ Dup denotes Dupire’s local volatility function (see [14]). For the derivation of (1.4), werefer to [21]. Note that (1.4) is an implicit equation for L as it is needed for the computation of E [ α t | S t = s ]. This in turn means that the stochastic diﬀerential equation (SDE) for the priceprocess ( S t ) t ≥ is actually a McKean-Vlasov SDE, since the law of S t enters the characteristics ofthe equation. Existence and uniqueness results for this equation are not at all obvious, since thecoeﬃcients do not satisfy any kind of standard conditions like for instance Lipschitz continuityin the Wasserstein space. Existence of a short-time solution of the associated non-linear Fokker-Planck equation for the density of ( S t ) t ≥ was shown in [1] under certain regularity assumptionson the initial distribution. As stated in [20] (see also [30], where existence of a simpliﬁed versionof an LSV model is proved) a very challenging and still open problem is to derive the set ofstochastic volatility parameters for which LSV models exist uniquely for a given market impliedvolatility surface.Despite these intriguing existence issues, LSV models have attracted – due to their appealingfeature of a potentially perfect smile calibration and their econometric properties – a lot ofattention from the calibration and implementation point of view. We refer to [20, 21, 10] forMonte Carlo methods, to [34, 40] for PDE methods based on non-linear Fokker-Planck equationsand to [38] for inverse problem techniques. Within these approaches the particle approximationmethod for the McKean-Vlasov SDE proposed in [20, 21] works impressively well, as very fewpaths have to be used in order to achieve very accurate calibration results. ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 5

In the current paper we propose an alternative, fully data driven approach circumventing inparticular the interpolation of the volatility surface, being necessary in several other approachesin order to compute Dupire’s local volatility. To achieve this we learn or train the leveragefunction L in order to generate the available market option prices accurately. This approachallows in principle to take all kind of options into account. In other words we are not limited toplain vanilla European call options, in contrast to many existing methods. Setting T = 0 anddenoting by T < T · · · < T n the maturities of the available options, we parametrize the leveragefunction L ( t, s ) via a family of neural networks F i : R → R with weights θ i ∈ Θ i , i.e. L ( t, s ) = 1 + F i ( s | θ i ) , t ∈ [ T i − , T i ) , i ∈ { , . . . , n } . We here consider for simplicity only the univariate case. The multivariate situation just meansthat 1 is replaced by the identity matrix and the neural networks F i are maps from R d → R d × d .This is the way how we parametrize the transport map G θ introduced in Section 1.1. Indeed, thisleads to so-called neural SDEs (see [18] for related work), which in the case of time-inhomogenousItˆo SDEs, just means to parametrize the drift µ ( · , · | θ ) and volatility σ ( · , · | θ ) by neural networkswith parameters θ , i.e., dX t ( θ ) = µ ( X t ( θ ) , t | θ ) + σ ( X t ( θ ) , t | θ ) dW t , X ( θ ) = x. In our case, there is no drift and the volatility function (for the price) reads as σ ( S t ( θ ) , t | θ ) = S t ( θ ) n X i =1 F i ( S t ( θ ) | θ i )1 [ T i − ,T i ) ( t ) ! α t The solution measure of the neural SDE corresponds to the transport G θ ∗ P I where P I is the lawof ( W, α ).Progressively for each maturity, the weights of the neural networks are learned by optimizing thecalibration criterion (1.1), where we allow for more general loss functions than just the square,i.e. inf θ sup γ J X j =1 w γj ‘ γ ( π mod j ( θ ) − π mkt j ) . (1.5)We here write π mod j ( θ ) and π mkt j for the model and market option prices, which correspondexactly to the expected values under G θ P I and P O , respectively. Moreover, for every ﬁxed γ , ‘ γ is a nonlinear nonnegative convex function with ‘ γ (0) = 0 and ‘ γ ( x ) > x = 0, measuringthe distance between model and market prices. The parameters w γj , for ﬁxed γ , denote someweights, e.g., of vega type (compare [9]), which allow us to match implied volatility data ratherthen pure prices, our actual goal, very well.Notice that, somehow quite typical for ﬁnancial applications, we need to guarantee a very highaccuracy of approximation, whence a variance reduction technique to approximate the modelprices is crucial for this learning task. Notice also that due to the general structure of theadversary no diversiﬁcation eﬀects can be expected: we have to deal with nonlinear functions ofexpectations with respect to P I .The precise algorithm is outlined in Section 3 and Section 4, where we also conduct a thoroughstatistical performance analysis. The implemention relies strongly on a variance reduction tech-nique based on hedging and deep hedging: it allows to compute accurate model prices π mod ( θ )for training purposes with only up to 5 · trajectories. Let us remark that we do not aim ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 6 to compete with existing algorithms, as e.g. the particle method by [20, 21], in terms of speedbut rather provide a generic data driven algorithm that is universally applicable for all kind ofoptions, also in multivariate situations, without resorting to Dupire type volatilities.The remainder of the article is organized as follows. Section 2 introduces the variance reductiontechnique based on hedge control variates, which is crucial in our optimization tasks. In Section3 we explain our calibration method, in particular how to optimize (1.5). The details of thenumerical implementation and the results of the statistical performance analysis are then givenin Section 4 as well as Section 6. In Appendix A we state stability theorems for stochasticdiﬀerential equations depending on parameters. This is applied to neural SDEs when calculatingderivatives with respect to the parameters of the neural networks. In Appendix B we recallpreliminaries on deep learning by giving a brief overview on universal approximation propertiesof artiﬁcial neural networks and brieﬂy explaining stochastic gradient descent. Finally AppendixC contains alternative optimization approaches to (1.5).2.

Variance reduction for pricing and calibration via hedging and deep heding

This section is dedicated to introduce a generic variance reduction technique for Monte Carlopricing and calibration by using hedging portfolios as control variates. This method will becrucial in our LSV calibration presented in Section 3. For similar considerations we refer to [41].Consider on a ﬁnite time horizon

T >

0, a ﬁnancial market in discounted terms with r tradedinstruments ( Z t ) t ∈ [0 ,T ] following an R r -valued stochastic process on some ﬁltered probabilityspace (Ω , ( F t ) t ∈ [0 ,T ] , F , Q ). Here, Q is a risk neutral measure and ( F t ) t ∈ [0 ,T ] is supposed to beright continuous. In particular, we suppose that ( Z t ) t ∈ [0 ,T ] is an r -dimensional square integrablemartingale with c`adl`ag paths.Let C be an F T -measurable random variable describing the payoﬀ of some European option atmaturity T >

0. Then the usual Monte Carlo estimator for the price of this option is given by(2.1) π = 1 N N X n =1 C n , where ( C , . . . , C N ) are i.i.d with the same distribution as C and N ∈ N . This estimator caneasily be modiﬁed by adding a stochastic integral with respect to Z . Indeed, consider a strategy( h t ) t ∈ [0 ,T ] ∈ L ( Z ) and some constant c . Denote the stochastic integral with respect to Z by I = ( h • Z ) T and consider the following estimator b π = 1 N N X n =1 ( C n − cI n ) , (2.2)where ( I , . . . , I N ) are i.i.d with the same distribution as I . Then, for any ( h t ) t ∈ [0 ,T ] ∈ L ( Z )and c , this estimator is still an unbiased estimator for the price of the option with payoﬀ C sincethe expected value of the stochastic integral vanishes. If we denote by H = 1 N N X n =1 I n , then the variance of b π is given byVar( b π ) = Var( π ) + c Var( H ) − c Cov( π, H ) . ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 7

This becomes minimal by choosing c = Cov( π, H )Var ( H ) . With this choice, we have Var( b π ) = (1 − Corr ( π, H ))Var( π ) . In particular, in the case of a perfect (pathwise) hedge we have Corr( π, H ) = 1 and the estimator b π has 0 variance since in this caseVar( π ) = Var( H ) = Cov( π, H ) . Therefore it is crucial to ﬁnd a good approximate hedging portfolio such that Corr ( π, H ) be-comes large. This is subject of Section 2.1 and 2.2 below.2.1. Black-Scholes Delta hedge.

In many cases of local stochastic volatility models as of form(1.3) and options depending only on the terminal value of the price process, a Delta hedge ofthe Black-Scholes model works well. Indeed, let C = g ( S T ) and let π g BS ( t, T, s, σ ) be the priceat time t of this claim in the Black-Scholes model. Here, s stands for the price variable and σ for the volatility parameter in the Black Scholes model. Moreover, we indicate the dependencyon the maturity T as well. Then choosing as hedging instrument only the price S itself and asapproximate hedging strategy(2.3) h t = ∂ s π g BS ( t, T, S t , L ( t, S t ) α t )usually already yields a considerable variance reduction. In fact it is even suﬃcient to consider α t alone to achieve satisfying results, i.e., one has(2.4) h t = ∂ s π g BS ( t, T, S t , α t ) , This reduces the computational costs for the evaluation of the hedging strategies even further.2.2.

Hedging strategies as neural networks - deep hedging.

Alternatively, in particularwhen the number of hedging instruments becomes higher, one can learn the hedging strategyby parameterizing it via neural networks. For a brief overview on neural networks and relevantnotation used below, we refer to Appendix B.Let the payoﬀ be again a function of the terminal values of the heging instruments, i.e., C = g ( Z T ). Then in Markovian models it makes sense to specify the hedging strategy via a function h : R + × R r → R r h t = h ( t, z ) , which in turn will correspond to an artiﬁcial neural network ( t, z ) h ( t, z | δ ) ∈ N N r +1 ,r withweights denoted by δ in some parameter space ∆ (see Notation B.4). Following the approach in[5, Remark 3], an optimal hedge for the claim C with given market price π mkt can be computedvia inf δ ∈ ∆ E (cid:2) u (cid:0) − C + π mkt + ( h ( · , Z ·− | δ ) • Z · ) T (cid:1)(cid:3) for some convex loss function u : R → R + . If u ( x ) = x , which is often used in practice, thisthen corresponds to a quadratic hedging criterion. We here use δ to denote the parameters of the hedging neural networks, as θ shall be used for the networks ofthe leverage function. ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 8

In order to tackle this optimization problem, we can apply stochastic gradient descent, becausewe fall in the realm of problem (B.1). Indeed, the stochastic objective function Q ( δ )( ω ) is givenby Q ( δ )( ω ) = u ( − C ( ω ) + π mkt + ( h ( · , Z ·− | δ )( ω ) • Z · ( ω )) T ) . The optimal hedging strategy h ( · , · | δ ∗ ) for an optimizer δ ∗ can then be used to deﬁne( h ( · , Z ·− | δ ∗ ) • Z · ) T which is in turn used in (2.2).As always in this article we shall assume that activation functions σ of the neural network aswell as the convex loss function u are smooth, hence we can calculate derivatives with respect to δ in a straight forward way. This is important to apply stochastic gradient descent, see SectionB.2. We shall show that the gradient of Q ( δ ) is given by ∇ δ Q ( δ )( ω ) = u ( − C ( ω ) + π mkt + ( h ( · , Z ·− | δ )( ω ) • Z · ( ω )) T )( ∇ δ h ( · , Z ·− | δ )( ω ) • Z · ( ω )) T , i.e. we are allowed to move the gradient inside the stochastic integral, and that approximationswith simple processes, as we shall do in practice, converge to the correct quantities. To ensurethis property, we shall apply the following theorem, which follows directly from results in SectionA. Theorem 2.1.

For ε ≥ , let Z ε be a solution of a stochastic diﬀerential equation as described inTheorem A.3 with drivers Y = ( Y , . . . , Y d ) , functionally Lipschitz operators F ε,ij , i = 1 , . . . , r , j = 1 , . . . , d and a process ( J ε, , . . . J ε,r ) , which is here for all ε ≥ simply J { t =0 } ( t ) for someconstant vector J ∈ R r . Let ( ε, t, z ) f ε ( t, z ) be a map, such that the bounded c`agl`ad process f ε := f ε ( . − , Z . − ) converges ucp to f := f ( . − , Z . − ) , then lim ε → ( f ε • Z ε ) = (cid:0) f • Z (cid:1) holds true.Proof. Consider the extended system d ( f ε • Z ε ) = d X j =1 f ε ( t − , Z εt − ) F ε,ij ( Z ε ) t − dY jt and dZ ε,it = d X j =1 F ε,ij ( Z ε ) t − dY jt , where we obtain existence, uniqueness and stability for the second equation by Theorem A.3,and wherefrom we obtain ucp convergence of the integrand of the ﬁrst equation: since stochasticintegration is continuous with respect to the ucp topology we obtain the result. (cid:3) The following corollary implies the announced properties, namely that we can move the gradientinside the stochastic integral and that the derivatives of a discretized integral with a discretizedversion of Z and approximations of the hedging strategies are actually close to the derivatives ofthe limit object. Corollary 2.2.

Let, for ε > , Z ε denote a discretization of the process of heding instruments Z ≡ Z such that the conditions of Theorem 2.1 are satisﬁed. Denote, for ε ≥ , the corre-sponding hegding strategies by ( t, z, δ ) h ε ( t, z | δ ) given by neural networks N N r +1 ,r , whoseactivation functions are bounded and C , with bounded derivatives. ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 9 (i)

Then the derivative ∇ δ ( h ( · , Z ·− | δ ) • Z ) in direction δ at δ satisﬁes ∇ δ ( h ( · , Z ·− | δ ) • Z ) = ( ∇ δ h ( · , Z ·− | δ ) • Z ) . (ii) If additionally the derivative in direction δ at δ of ∇ δ h ε ( · , Z ·− | δ ) converges ucp to ∇ δ h ( · , Z ·− | δ ) as ε → , then the directional derivative of the discretized integral, i.e. ∇ δ ( h ε ( · , Z ε ·− | δ ) • Z ε ) or equivalently ( ∇ δ h ε ( · , Z ε ·− | δ ) • Z ε ) , converges, as the dis-cretization mesh ε → , to lim ε → ( ∇ δ h ε ( · , Z ε ·− | δ ) • Z ε ) = ( ∇ δ h ( · , Z ·− | δ ) • Z ) . Proof.

To prove (i), we apply Theorem 2.1 with f ε ( · , Z ·− ) = h ( · , Z ·− | δ + εδ ) − h ( · , Z ·− | δ ) ε , which converges ucp to f = ∇ δ h ( · , Z ·− | δ ). Indeed, by the neural network assumptions, wehave (with the sup over some compat set)lim ε → sup ( t,z ) (cid:13)(cid:13)(cid:13)(cid:13) h ( t, z | δ + εδ ) − h ( t, z | δ ) ε − ∇ δ h ( t, z | δ ) (cid:13)(cid:13)(cid:13)(cid:13) = 0 , by equicontinuity of { ( t, z )

7→ ∇ δ h ( t, z | δ + εδ ) | ε ∈ [0 , } .Concerning (ii) we apply again Theorem 2.1, this time with f ε ( · , Z ·− ) = ∇ δ h ε ( · , Z ·− | δ ) , which converges by Assumption (ii) ucp to f = ∇ δ h ( · , Z ·− | δ ). (cid:3) Calibration of LSV models

Consider an LSV model as of (1.3) deﬁned on some ﬁltered probability space (Ω , ( F t ) t ∈ [0 ,T ] , F , Q ),where Q is a risk neutral measure. We assume the stochastic process α to be ﬁxed. This can forinstance be achieved by ﬁrst calibrating the pure stochastic volatility model with L ≡ L in perfect accordance with market data.We here consider only European call options, but our approach allows in principle to take allkind of other options into account.Due to the universal approximation properties outlined in Appendix B (Theorem B.3) and inspirit of neural SDEs, we choose to parametrize L via neural networks. More precisely, set T = 0and let 0 < T · · · < T n = T denote the maturities of the available European call options to whichwe aim to calibrate the LSV model. We then specify the leverage function L ( t, s ) via a familyof neural networks, i.e., L ( t, s | θ ) = n X i =1 F i ( s | θ i )1 [ T i − ,T i ) ( t ) ! (3.1) ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 10 where F i ∈ N N , for i = 1 , . . . , n (see Notation B.4). For notational simplicity we shall oftenomit the dependence on θ i ∈ Θ i . However, when needed we write for instance S t ( θ ), where θ then stands for all parameters θ i used up to time t .For purposes of training, similarly as in Section 2.2, we shall need to calculate derivatives of theLSV process with respect to θ . Theorem 3.1.

Let ( t, s, θ ) L ( t, s | θ ) be of form (3.1) where the neural networks ( s, θ i ) F i ( s | θ i ) are bounded and C , with bounded and Lipschitz continuous derivatives , for all i =1 , . . . , n . Then the directional derivative in direction θ at b θ satisﬁes the following equation d (cid:16) ∇ θ S t ( b θ ) (cid:17) = (cid:16) ∇ θ S t ( b θ ) L ( t, S t ( b θ ) | b θ ) + S t ( b θ ) ∂ s L ( t, S t ( b θ ) | b θ ) ∇ θ S t ( b θ )+ S t ( b θ ) ∇ θ L ( t, S t ( b θ ) | b θ ) (cid:17) α t dW t , (3.2) with initial value . This can be solved by variation of constants and leads to well-known backwardpropagation schemes.Proof. First note that Theorem A.2 implies the existence and uniqueness of dS t ( θ ) = S t ( θ ) L ( t, S t ( θ ) | θ ) α t dW t , for every θ . Here, the driving process is one dimensional and given by Y = R · α s dW s . Indeed,according to Remark A.4, if ( t, s ) L ( t, s | θ ) is bounded, c`adl`ag in t and Lipschitz in s witha Lipschitz constant independent of t , S · S · ( θ ) L ( · , S · ( θ ) | θ ) is functionally Lipschitz andTheorem A.2 implies the assertion. These conditions are implied by the form of L ( t, s | θ ) andthe conditions on the neural networks F i .To prove the form of the derivative process we apply Theorem A.3 to the following system:consider dS t ( b θ ) = S t ( b θ ) L ( t, S t ( b θ ) | b θ ) α t dW t , together with dS t ( b θ + εθ ) = S t ( b θ + εθ ) L ( t, S t ( b θ + + εθ ) | b θ + εθ ) α t dW t , as well as d S t ( b θ + εθ ) − S t ( b θ ) ε = S t ( b θ + εθ ) L ( t, S t ( b θ + εθ ) | b θ + εθ ) − S t ( b θ ) L ( t, S t ( b θ ) | b θ ) ε α t dW t = (cid:16) S t ( b θ + εθ ) − S t ( b θ ) ε L ( t, S t ( b θ + εθ ) | b θ + εθ )+ S t ( b θ ) L ( t, S t ( b θ + εθ ) | b θ + εθ ) − L ( t, S t ( b θ ) | b θ ) ε (cid:17) α t dW t . In the terminology of Theorem A.3, Z ε, = S ( b θ ), Z ε, = S ( b θ + εθ ) and Z ε, = S t ( b θ + εθ ) − S t ( b θ ) ε .Moreover, F ε, is given by F ε, ( Z t ) = Z , t L ( t, Z , t | b θ + εθ ) + Z , t ∂ s L ( t, Z , t | b θ ) Z , t + O ( ε )+ Z , t L ( t, Z , t | b θ + εθ ) − L ( t, Z , t | b θ ) ε , (3.3) This just means that the activation function is bounded and C , with bounded and Lipschitz continuous deriva-tives. ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 11 which converges ucp to F , ( Z t ) = Z , t L ( t, Z , t | b θ ) + Z , t ∂ s L ( t, Z , t | b θ ) Z , t + Z , t ∇ θ L ( t, Z , t | b θ ) . Indeed, for every ﬁxed t , the family { s L ( t, s | b θ + εθ ) , | ε ∈ [0 , } is due to the form of theneural networks equicontinuous. Hence pointwise convergence implies uniform convergence in s .This together with L ( t, s | θ ) being piecewise constant in t yieldslim ε → sup ( t,s ) | L ( t, s | b θ + εθ ) − L ( t, s | b θ ) | = 0 , whence ucp convergence of the ﬁrst term in (3.3). The convergence of term two and three isclear. The last one follows again from that fact the family { s

7→ ∇ θ L ( t, s | b θ + εθ ) | ε ∈ [0 , } isequicontinuous, which is again a consequence of the form of the neural networks.By the assumptions on the derivatives, F , is functionally Lipschitz. Hence Theorem A.2 yieldsthe existence of a unique solution to (3.2) and Theorem A.3 implies convergence. (cid:3) Remark . (i) For the pure existence and uniqueness of dS t ( θ ) = S t ( θ ) L ( t, S t ( θ ) | θ ) α t dW t , with L ( t, s | θ ) of form (3.1), it suﬃces that the neural networks s F i ( s | θ i ) arebounded and Lipschitz, for all i = 1 , . . . , n (see also Remark A.4).(ii) Similarly as in Theorem 2.1 we can also consider a discretization S ε ( θ ) and concludeanalogously as above the form of its derivative process. If the corresponding coeﬃcientsthen converge in ucp, then also the derivative process does so.Theorem 3.1 guarantees the existence and uniqueness of the derivative process. This thus allowsto set up gradient based search algorithms for training.In view of this let us now come to the precise optimization task as already outlined in Section 1.2.To ease the notation we shall here omit the dependence of the weigths w and the loss function ‘ on the parameter γ . For each maturity T i , we assume to have J i options with strikes K ij , j ∈ { , . . . , J i } . The calibration functional for the i -th maturity is then of the formargmin θ i ∈ Θ i J i X j =1 w ij ‘ ( π mod ij ( θ i ) − π mkt ij ) , i ∈ { , . . . , n } , (3.4)where π mod ij ( θ i ) ( π mkt ij respectively) denotes the model (market resp.) price of an option withmaturity T i and Strike K ij . Moreover, ‘ : R → R + is some nonnegative, nonlinear, convex lossfunction (e.g. square or absolute value) with ‘ (0) = 0 and ‘ ( x ) > x = 0, measuring thedistance between market and model prices. Finally, w ij denote some weights, e.g., of vega type(compare [9]), which allows to match implied volatility data rather then pure prices, our actualgoal, very well. Notice that we allow the weights to be trained and adapted during a run of ouralgorithm.We solve the minimization problems (3.4) iteratively: we start with maturity T and ﬁx θ . Thisthen enters in the computation of π mod2 j ( θ ) and thus in (3.4) for maturity T , etc. To simplifythe notation in the sequel, we shall therefore leave the index i away so that for a generic maturity T >

0, (3.4) becomes argmin θ ∈ Θ J X j =1 w j ‘ ( π mod j ( θ ) − π mkt j ) . ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 12

Since the model prices are given by π mod j ( θ ) = E [( S T ( θ ) − K j ) + ] , (3.5)we have π mod j ( θ ) − π mkt j = E [ Q j ( θ )] where Q j ( θ )( ω ) := ( S T ( θ )( ω ) − K j ) + − π mkt j . (3.6)The calibration task then amounts to ﬁnding a minimum of f ( θ ) := J X j =1 w j ‘ ( E [ Q j ( θ )]) . (3.7)As ‘ is a nonlinear function, this is not of the expected value form of problem (B.1). Hencestandard stochastic gradient descent, as outlined in Appendix B.2, can not be applied in astraightforward manner.We shall tackle this problem via hedge control variates as introduced in Section 2. In the followingwe explain this in more detail.3.1. Minimizing the calibration functional.

Consider the standard Monte Carlo estimatorfor E [ Q j ( θ )] so that (3.7) is estimated by f MC ( θ ) := J X j =1 w j ‘ N N X n =1 Q j ( θ )( ω n ) ! , (3.8)for i.i.d samples { ω , . . . , ω N } ∈ Ω. Since the Monte Carlo error decreases as √ N , the number ofsimulation N has to be chosen large ( ≈ ) in order to approximate well the true model pricesin (3.5). Note that implied volatility to which we actually aim to calibrate is even more sensitive.As stochastic gradient descent is not directly applicable due to the nonlinearity of ‘ , it seemsnecessary at ﬁrst sight to compute the gradient of the whole function b f ( θ ) to minimize (3.8). As m ≈ , this is however computationally very expensive, leads to numerical instabilities anddoes not allow to ﬁnd a minimum in the high dimensional parameter space Θ in a reasonableamount of time.One very expedient remedy is to apply hedge control variates as introduced in Section 2 asvariance reduction technique. This allows to reduce the number of samples N in the MonteCarlo estimator considerably to only up to 5 · sample paths.Assume that we have r hedging instruments (including the price process S ) denoted by ( Z t ) t ∈ [0 ,T ] which are square integrable martingales under Q and take values in R r . Consider, for j = 1 , . . . , J ,strategies h j : [0 , T ] × R r → R r such that h ( · , Z · ) ∈ L ( Z ) and some constant c . Deﬁne X j ( θ )( ω ) := ( S T ( θ )( ω ) − K j ) + − c ( h j ( · , Z ·− ( θ )( ω )) • Z · ( θ )( ω )) T − π mkt j (3.9)The calibration functionals (3.7) and (3.8), can then simply be deﬁned by replacing Q j ( θ )( ω ) by X j ( θ )( ω ) so that we end up minimizing b f ( θ )( ω , . . . , ω N ) = J X j =1 w j ‘ N N X n =1 X j ( θ )( ω n ) ! . (3.10) ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 13

To tackle this task, we apply the following variant of gradient descent: starting with an initialguess θ (0) , we iteratively compute θ ( k +1) = θ ( k ) − η k ∇ b f ( θ )( ω ( k )1 , . . . , ω ( k ) N )(3.11)for some learning rate η k and i.i.d samples ( ω ( k )1 , . . . , ω ( k ) N ). These samples can either be chosen tobe the same in each iteration or to be newly sampled in each update step. The diﬀerence betweenthese two approaches is negligable, since N is chosen so as to yield a small Monte Carlo error,whence the gradient is nearly deterministic. In our numerical experiments we newly sample ineach update step.Concerning the choice of the hedging strategies, we can parametrize them as in Section 2.2 vianeural networks and ﬁnd the optimal weights δ by computingargmin δ ∈ ∆ N N X n =1 u ( − X j ( θ, δ )( ω n )) . (3.12)for i.i.d samples { ω , . . . , ω N } ∈ Ω and some loss function u when θ is ﬁxed. Here, X j ( θ, δ )( ω ) = ( S T ( θ )( ω ) − K j ) + − ( h j ( · , Z ·− ( θ )( ω ) | δ ) • Z · ( θ )( ω )) T − π mkt j . This means to iterate the two optimization procedures, i.e. minimizing (3.10) for θ (with ﬁxed δ )and (3.12) for δ (with ﬁxed θ ). Clearly the Black-Scholes-Hedge ansatz as of Section 2.1 worksas well, in this case without additional optimization with respect to the hedging strategies.For alternative approaches how to minimize (3.7), we refer to Appendix C.4. Numerical Implementation

In this section we discuss the numerical implementation of the proposed calibration method.We implement our approach via tensorﬂow, taking advantage of gpu-accelerated computing. Allcomputations are performed on a single-gpu nvidea GEFORCE RTX 2080 machine. For theimplied volatility computations, we rely on the python py_vollib library. Recall that a LSV model is given on some ﬁltered probability space (Ω , ( F t ) t ∈ [0 ,T ] , F , Q ) by dS t = S t α t L ( t, S t ) dW t , S > , for some stochastic process α . When calibrating to data, it is therefore necessary to make furtherspeciﬁcations. We calibrate the follwing SABR-type LSV model. Deﬁnition 4.1.

The SABR-LSV model is speciﬁed via the SDE, dS t = S t L ( t, S t ) α t dW t ,dα t = να t dB t ,d h W, B i t = %dt, with parameters ν ∈ R , % ∈ [ − , and initial values α > , S > . Here, B and W are twocorrelated Brownian motions. See http://vollib.org/.

ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 14

Remark . We shall often work in log-price coordinates for S . In particular, we can thenconsider L as a function of X := log S rather then S . By denoting this parametrization againwith L , we therefore have L ( t, X ) instead of L ( t, S ) and the model dynamics read as dX t = α t L ( t, X t ) dW t − α t L ( t, X t ) dt,dα t = να t dB t ,d h W, B i t = %dt. Note that α is a geometric Brownian motion, in particular, the closed form solution for α isavailable and given by α t = α exp (cid:18) − ν t + νB t (cid:19) . For the rest of the paper we shall set S = 1.4.1. Implementation of the calibration method.

We now present a proper numerical testand demonstrate the eﬀectiveness of our approach on a family of typical market smiles (insteadof just one calibration example). We consider as ground truth a situation where market smilesare produced by a parametric family. By randomly sampling smiles from this family we thenshow that they can be calibrated up to small errors, which we analyze statistically.4.1.1.

Ground truth assumption.

We start by specifying the ground truth assumption. It isknown that a discrete set of prices can be exactly calibrated by a local volatility model usingDupire’s volatility function, if an appropriate interpolation method is chosen. Hence, any marketobserved smile data can be reproduced by the following model (we assume zero riskless rate anddeﬁne X = log( S )), dS t = σ Dup ( t, X t ) S t dW t , or equivalently(4.1) dX t = − σ ( t, X t ) dt + σ Dup ( t, X t ) dW t , where σ Dup denotes Dupire’s local volatility function [14]. Our ground truth assumption consistin supposing that the function σ Dup (or to be more precise σ ) can be chosen from a parametricfamily. Such parametric families for local volatility models have been discussed in the literature,consider e.g. [8] or [7]. In the latter, the authors introduce a family of local volatility functions e a ξ indexed by parameters ξ = ( p , p , σ , σ , σ )and p = 1 − ( p + p ) satisfying the constraints σ , σ , σ , p , p > p + p ≤ . Setting k ( t, x, σ ) = exp (cid:0) − x / (2 tσ ) − tσ / (cid:1) , e a ξ is then deﬁned as e a ξ ( t, x ) = P i =0 p i σ i k ( t, x, σ i ) P i =0 ( p i /σ i ) k ( t, x, σ i ) . In Figure 1(a) we show plots of implied volatilities for diﬀerent slices for a realistic choice ofparameters. As one can see, the produced smiles seem to be unrealistically ﬂat. Hence wemodify the local volatility function e a ξ to produce more pronounced and more realistic smiles. To ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 15 γ γ λ λ β β κ . .

005 0 .

001 0 . Table 1.

Fixed Parameters for the ground truth assumption a ξ .be precise, we deﬁne a new family of local volatility functions a ξ indexed by the set of parameters ξ as(4.2) a ξ ( t, x ) = 14 × min (cid:16) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16)P i =0 p i σ i k ( t, x, σ i ) + Λ( t, x ) (cid:17) (cid:0) − . × ( t> . (cid:1)P i =0 ( p i /σ i ) k ( t, x, σ i ) + 0 . (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:17) , with Λ( t, x ) := (cid:18) ( t ≤ . . t (cid:19) λ min (cid:8)(cid:0) γ ( x − β ) + + γ ( − x − β ) + (cid:1) κ , λ (cid:9) . We ﬁx the choice of the paramers γ i , β i , λ i , κ as given in Table 1. By taking absolute valuesabove, we can drop the requirement p > a ξ isnot deﬁned at t = 0. When doing a Monte Carlo simulation, we simply replace a ξ (0 , x ) with a ξ (∆ t , x ), where ∆ t is the time increment of the Monte Carlo simulation.What is left to be speciﬁed are the parameters ξ = ( p , p , σ , σ , σ )with p = 1 − p − p . This motivates our statistical test for the performance evaluation of ourmethod. To be precise, our ground truth assumption is, that all observable market prices areexplained by a variation of the parameters ξ . For illustration, we plot implied volatilities for thismodiﬁed local volatility function in Figure 1(b) for a speciﬁc parameter set ξ .Our ground truth model is now speciﬁed as in (4.1) with σ Dup replaced by a ξ , i.e.(4.3) dX t = − a ξ ( t, X t ) dt + a ξ ( t, X t ) dW t . Performance test.

We now come to the evaluation of our proposed method. We wantto calibrate the SABR-LSV model to market prices generated by the previously formulatedground truth assumption. This corresponds to randomly sampling the parameter ξ of the localvolatily function a ξ and to compute prices according to (4.3). Calibrating the SABR-LSV model,i.e. ﬁnding the parameters ν, % , the initial volatility α and the unknown leverage function L , tothese prices and repeating this multiple times then allows for a statistical analysis of the errors.As explained in Section 3, we consider European call options with maturities T < · · · < T n anddenote the strikes for a given maturity T i by K ij , j ∈ { , . . . , J i } . To compute the ground truthprices for these European calls we use a Euler-discretization of (4.3) with time step ∆ t = 1 / Brownianpaths and a Black-Scholes Delta hedge variance reduction as described previously. For a givenparameter set ξ , we use the same Brownian paths for all strikes and maturities. ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 16 (a) (b)

Fig. 1.

Implied volatility of the original parametric family e a ξ (a) versus ourmodiﬁcation a ξ (b). T T T T (a) k k k k . . . . (b) Fig. 2. (a) Maturities used to generate data for the calibration test. (b) Pa-rameters that deﬁne the strikes of the call options to which we calibrate.Overall, in this test, we consider n = 4 maturities with J i = 20 strike prices for all i = 1 , . . . , T i are given in Figure 2(a). For the choice of the strikes K i , we choose evenlyspaced points, i.e. K i,j +1 − K i,j = K i, − K i, . For the smallest and largest strikes per maturity we choose K i, = exp ( − k i ) , K i, = exp ( k i ) , with the values of k i given in Figure 2(b).We now specify a distribution under which we draw the parameters ξ = ( p , p , σ , σ , σ , )for our test. The components are all drawn independently from each other under the uniformdistribution on the respective intervals given below.- I p = [0 . , . I p = [0 . , .

7] - I σ = [0 . , . I σ = [0 . , .

4] - I σ = [0 . , . • For m = 1 , . . . ,

150 simulate parameters ξ m under the law described above. ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 17 • For each m , compute prices of European calls for maturities T i and strikes K ij for i = 1 , . . . , n = 4 and j = 1 , . . . ,

20 according to (4.3) using 10 Brownian trajectories(for each m we use new trajectories). • Store these prices.The second part consists in calibrating each of these surfaces and storing pertinent values forwhich we conduct a statistical analysis. In the following we describe the procedure in detail:Recall that we specify the leverage function L ( t, x ) via a family of neural networks, i.e., L ( t, x ) = 1 + F i ( x ) t ∈ [ T i − , T i ) , i ∈ { , . . . , n = 4 } , where F i ∈ N N , (see Notation B.4). Each F i is speciﬁed as a 3-hidden layer feed forwardnetwork where the dimension of each of the hidden layers is 50. As activation function we choose σ = tanh. As before we denote the parameters of F i by θ i and the corresponding parameterspace by Θ i .As closed form pricing formulas are not available for such an LSV model, let us here brieﬂy specifyour pricing method. For the variance reduced Monte Carlo estimator as of (3.10) we always usea standard Euler-SDE discretization with step size ∆ t = 1 / L ( t, S t ) α t is plugged in the formula for the Black-Scholes Delta as in (2.3). Theonly parameter that remains to be speciﬁed, is the number of trajectories used for the MonteCarlo estimator which is done in Algorithm 4.3 and 4.4 below.As a ﬁrst calibration step, we calibrate the SABR model (i.e. (4.1) with L ≡

1) to the marketprices and ﬁx the calibrated SABR parameters ν, % and α . For the remaining parameters θ i , i = 1 , . . . ,

4, we apply the following algorithm until all parameters are calibrated.

Algorithm 4.3.

In the subsequent pseudocode, the index i stands for the maturities, N for thenumber of samples used in the variance reduced Monte Carlo estimator as of (3.10) and k forthe updating step in the gradient descent: initialize θ , . . . , θ N , k = 400 , 1 ∆ t , tol = 0.01 , 0.0045 for i = 1 ,... ,4: nextslice = False w j = ˜ w j / P l =1 ˜ w l with ˜ w j = 1 /v ij , where v ij is the Black - Scholes vega for strike K ij , the corresponding market implied volatility and the maturity T i . while nextslice == False : do : ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 18 Simulate N trajectories of the SABR - LSV process up to time T i , compute the payoffs . do : Compute the stochastic integral of the Black - Scholes Delta hedge against these trajectories as of (2.3) for maturity T i do : Compute the calibration functional as of (3.10) with ‘ ( x ) = x and weights w j . do : Make an optimization step from θ ( k − i to θ ( k ) i , similary as in (3.11) but with the more sophisticated ADAM - optimizer with learning rate − . do : Update the parameter N , the condition nextslice and compute model prices according to Specification . . do : k = k + 1 Specﬁcation 4.4.

We update the parameters in the above algorithm according to the followingrules: if k == 500: N = 2000 else if k == 1500: N = 10000 else if k == 4000: N = 50000 if k >= 5000 and k mod 1000 == 0: do : Compute model prices π model for slice i via MC simulation using trajectories . Apply the Black - Scholes Delta hedge for variance reduction . do : Compute implied volatilities iv model from the model prices π model . do : Compute the maximum error of model implied volatilities against market implied volatilities : err_cali = || iv_model - iv_market || max if err_cali ≤ tol or k == 12000: nextslice = True else : Apply the adversarial part : Adjust the weights w j according to : for j = 1, . . . ,20: w j = w j + 0.1 * | iv_model j - iv_market j | ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 19 This puts heigher weights on the options where the fit can still be improved Normalize the weights : for j = 1, . . . ,20: w j = w j | P ‘ =1 w ‘ Numerical results for the calibration test.

We now discuss the results of our test.We start by pointing out that from the 150 sampled market smiles, two smiles caused diﬃculty, inthe sense that our implied volatility computation failed due to the remaining Monte-Carlo error.By increasing the training parameters slightly, this issue can be mitigated but the resultingcalibrated implied volatility errors stay large out of the money where the smiles are extreme.Hence, we opt to remove those two samples from the following statistical analysis.Further, we identify six smiles, for which the calibration of at least one slice has failed. Here,we say the calibration of a slice has failed if the maximum error of implied volatility is largerthan 0 .

01. Let us make the following remark. In all these examples the training got stuck in alocal minimum and a second drawing of the initial parameters actually led to satisfying results.A straight forward parallelization to reduce the additional time due to redrawing parameters isof course possible. In the following however, we keep the data of the failed 6 calibrations as theyare when presenting results.In Figure 3 we show calibration results for a typical example of randomly generated marketdata. From this it is already visible that the worst case calibration error (which occurs out of themoney) ranges typically between 20 and 40 basis points. The corresponding calibration resultfor the leverage function L is given in Figure 4.Let us note that our method achieves a high calibration accuracy for the considered range ofstrikes and across all considered maturities. However, the further away from ATM, the morechallenging this calibration becomes as can be seen in the results of a worst case analysis ofcalibration errors in Figures 6 and 7. There we show the mean as well as diﬀerent quantiles ofthe data.We present a histogram of calibration times in Figure 5. There, we see that the typical calibrationof all slices ﬁnishes under 30 minutes and only rarely we face the situation where a higher numberof optimization steps is needed before the abort criterion is activated. Since our approach allowsfor straight forward parallelization strategies, these times can be reduced signiﬁcantly by changingto a multi gpu setup. 5. Conclusion

We have demonstrated how the parametrization by means of neural networks can be used tocalibrate local stochastic volatility models to implied volatility data. We make the followingremarks:(i) The method we presented does not require any form of interpolation for the impliedvolatility surface since we do not calibrate via Dupire’s formula. As the interpolation isusually done ad hoc, this is a desirable feature of our method.

ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 20 (ii) It is possible to “plug in” any stochastic variance process such as rough volatility pro-cesses as long as an eﬃcient simulation of trajectories is possible.(iii) The multivariate extension is straight forward.(iv) The level of accuracy of the calibration result is of a very high degree, making thepresented method already of interest by this feature alone.(v) The method can be signiﬁcantly accelerated by applying distributed computation meth-ods in the context of multi-gpu computational concepts.(vi) The presented algorithm is further able to deal with path-dependent options since allcomputations are done by means of Monte Carlo simulations.(vii) We can also consider the instantaneous variance process of the price process as shortend of a forward variance process, which is assumed to follow (under appropriate as-sumptions) a neural SDE. This setting, as an iniﬁnite-dimensional version of the afore-mentioned ’multi-variate’ setting, then qualiﬁes for joint calibration to S&P and VIXoptions. This is investigated in a companion paper.(viii) We stress again the advantages of the generative adversarial network point of view.Indeed, the adversarial choice of the weights leads to very accurate generative neuralSDE models. We believe that this is a crucial feature in the joint calibration of S&Pand VIX options. 6.

Plots

This section contains the relevant plots for the numerical test outlined in Section 4.

ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 21

Fig. 3.

Left column: Implied volatilities for the calibrated model together withthe data (market) implied volatilities for a typical example of sampled marketprices. Right column: Calibration errors by subtracting model implied volatili-ties from the data implied volatilities. Each row corresponds to one of the fouravailable maturities.

ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 22

Fig. 4.

Plot of the calibrated leverage function L ( t, · ) at t ∈ { , T , T , T } inthe example shown in Figure 3. ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 23

Fig. 5.

Histogram of calibration times of the statistical test introduced in Sec-tion 4.1.2. One calibration in this histogram corresponds to the number ofminutes, before the calibration of all slices is ﬁnished. A L I B R A T I O N O F L S V M O D E L S W I T H G E N E R A T I V E A D V E R S A R I A L N E T W O R K S Fig. 6.

Boxplots for the ﬁrst (above) and second (below) slice. Depicted are the mean (horizontal line), as well as the0 . , . , . , .

15 quantiles for the maximum of absolute calibration errors along all strikes. A L I B R A T I O N O F L S V M O D E L S W I T H G E N E R A T I V E A D V E R S A R I A L N E T W O R K S Fig. 7.

Boxplot for the third (above) and fourth (below) slice. Depicted are the mean (horizontal line), as well as the0 . , . , . , .

15 quantiles for the maximum of absolute calibration errors along all strikes.

ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 26

Appendix A. Variations of stochastic differential equations

We follow here the excellent exposition of Philipp Protter in [33] in order to understand thedependence of solutions of stochastic diﬀerential equations on parameters, in particular when weaim to calculate derivatives with respect to parameters of neural networks.Let us denote by D the set of real-valued, c`adl`ag, adapted processes on a given stochastic basis(Ω , F , Q ) with a ﬁltration (satisfying usual conditions). By D n we denote the set of R n -valued,c`adl`ag, adapted processes on the same basis. Deﬁnition A.1.

An operator F from D n to D is called functional Lipschitz if for any X, Y ∈ D n (i) the property X τ − = Y τ − implies F ( X ) τ − = F ( Y ) τ − for any stopping time τ , (ii) there exists an increasing process ( K t ) t ≥ such that for t ≥ k F ( X ) t − F ( Y ) t k ≤ K t sup r ≤ t k X r − Y r k . Functional Lipschitz assumptions are suﬃcient to obtain existence and uniqueness for generalstochastic diﬀerential equations, see [33, Theorem V 7].

Theorem A.2.

Let Y = ( Y , . . . , Y d ) be a vector of semimartingales starting at Y = 0 , ( J , . . . , J n ) ∈ D n a vector of processes and let F ij , i = 1 , . . . , n , j = 1 , . . . , d be functionallyLipschitz operators. Then there is a unique process Z ∈ D n satisfying Z it = J it + d X j =1 Z t F ij ( Z ) s − dY js for t ≥ and i = 1 , . . . , n . If J is a semimartingale, then Z is a semimartingale as well. With an additional uniformity assumption on a sequence of stochastic diﬀerential equations withconverging coeﬃcients and initial data we obtain stability, see [33, Theorem V 15].

Theorem A.3.

Let Y = ( Y , . . . , Y d ) be vector of semimartingales starting at Y = 0 . Considerfor ε ≥ , a vector of processes ( J ε, , . . . , J ε,n ) ∈ D n and functionally Lipschitz operators F ε,ij for i = 1 , . . . , n , j = 1 , . . . , d . Then, for ε ≥ , there is a unique process Z ε ∈ D n satisfying Z ε,it = J ε,it + d X j =1 Z t F ε,ij ( Z ε ) s − dY js for t ≥ and and i = 1 , . . . , n . If J ε → J in ucp, F ε ( Z ) → F ( Z ) in ucp, then Z ε → Z inucp.Remark A.4 . We shall apply these theorems to a local stochastic volatility model of the form dS t ( θ ) = S t ( θ ) L ( t, S t ( θ ) | θ ) α t dW t , where θ ∈ Θ, (

W, α ) some Brownian motion together with an adapted, c`adl`ag stochastic process α (all on a given stochastic basis) and S > θ ∈ Θ ( t, s ) L ( t, s | θ )(A.1) ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 27 is bounded, c`adl`ag in t (for ﬁxed s > s with a Lipschitz constantindependent of t on compact intervals . In this case the map S S · L ( · , S · | θ )is functionally Lipschitz and therefore the above equation has a unique solution for all times t and any θ by Theorem A.2. If, additionally,lim θ → b θ sup ( t,s ) | L ( t, s | θ ) − L ( t, s | b θ ) | = 0 , (A.2)where the sup is taken over some compact set, then we also have that the solutions S ( θ ) convergeucp to S ( b θ ), as θ → b θ by Theorem A.3. Appendix B. Preliminaries on deep learning

We shall here brieﬂy introduce two core concepts in deep learning, namely artiﬁcial neuralnetworks and stochastic gradient descent . The latter is a widely used optimization method forsolving maximization or minimization problems involving the ﬁrst. In standard machine learningterminology, the optimization procedure is usually referred to as “training”. We shall use bothterminologies interchangeably.B.1.

Artiﬁcial neural networks.

We start with the deﬁnition of feed forward neural networks.

These are functions obtained by composing layers consisting of an aﬃne map and a component-wise nonlinearity. They serve as universal approximation class which is stated in Theorem B.3.Moreover, derivatives of these functions can be eﬃciently expressed iteratively (see e.g. [23]),which is a desirable feature from an optimization point of view.

Deﬁnition B.1.

Let

M, N , N , . . . , N M ∈ N , σ : R → R and for any m ∈ { , . . . , M } , let w m : R N m − → R N m , x A m x + b m be an aﬃne function with A m ∈ R N m × N m − and b m ∈ R N m .A function R N → R N M deﬁned as F ( x ) = w M ◦ F M − ◦ · · · ◦ F , with F m = σ ◦ w m for m ∈ { , . . . , M − } is called a feed forward neural network . Here the activation function σ is applied componentwise. M − denotes the number of hidden layers and N , . . . , N M − denote the dimensions of the hiddenlayers and N and N M the dimension of the input and output layers.Remark B.2 . Unless otherwise stated, the activation functions σ used in this article are alwaysassumed to be smooth, globally bounded with bounded ﬁrst derivative.The following version of the so-called universal approximation theorem is due to K. Hornik [27].An earlier version was proved by G. Cybenko [13]. To formulate the result we denote the setof all feed forward neural networks with activation function σ , input dimension N and outputdimension N M by N N σ ∞ ,N ,N M . Theorem B.3 (Hornik (1991)) . Suppose σ is bounded and nonconstant. Then the followingstatements hold: (i) For any ﬁnite measure µ on ( R N , B ( R N )) and ≤ p < ∞ , the set N N σ ∞ ,N , is densein L p ( R N , B ( R N ) , µ ) . (ii) If in addition σ ∈ C ( R , R ) , then N N σ ∞ ,N , is dense in C ( R N , R ) for the topology ofuniform convergence on compact sets. ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 28

Since each component of an R N M -valued neural network is an R -valued neural network, thisresult easily generalizes to N N σ ∞ ,N ,N M with N M > Notation B.4.

We denote by

N N N ,N M the set of all neural networks in N N σ ∞ ,N ,N M with a ﬁxed architecture, i.e. a ﬁxed number of hidden layers M − , ﬁxed input and output dimensions N m for each hidden layer m ∈ { , . . . , M − } and a ﬁxed activation function σ . This set can bedescribed by N N N ,N M = { F ( ·| θ ) | F feed forward neural network and θ ∈ Θ } , with parameter space Θ ∈ R q for some q ∈ N and θ ∈ Θ corresponding to the entries of thematrices A m and the vectors b m for m ∈ { , . . . , M } . B.2.

Stochastic gradient descent.

In light of Theorem B.3, it is clear that neural networkscan serve as function approximators. To implement this, the entries of the matrices A m andthe vectors b m for m ∈ { , . . . , M } are subject to optimization. If the unknown function can beexpressed as the expected value of a stochastic objective function, one widely applied optimizationmethod is stochastic gradient descent, which we shall review below.Indeed, consider the following minimization problemmin θ ∈ Θ f ( θ ) with f ( θ ) = E [ Q ( θ )](B.1)where Q denotes some stochastic objective function Q : Ω × Θ → R , ( ω, θ ) Q ( θ )( ω ) thatdepends on parameters θ taking values in some space Θ.The classical method how to solve generic optimization problems for some diﬀerentiable objectivefunction f (not necessarily of the expected value form as in (B.1)) is to apply a gradient descent algorithm: starting with an initial guess θ (0) , one iteratively deﬁnes θ ( k +1) = θ ( k ) − η k ∇ f ( θ ( k ) )(B.2)for some learning rate η k . Under suitable assumptions, θ ( k ) converges for k → ∞ to a localminimum of the function f .One insight of deep learning is that stochastic gradient descent methods, going back to stochasticapproximation algorithms proposed by Robbins and Monroe [35], are much more eﬃcient. Toapply this, it is crucial that the objective function f is linear in the sampling probabilities. Inother words, f needs to be of the expected value form as in (B.1). In the simplest form ofstochastic gradient descent, under the assumption that ∇ f ( θ ) = E [ ∇ Q ( θ )] , the true gradient of f is approximated by a gradient at a single sample Q ( θ )( ω ) which reducesthe computational cost considerably. In the updating step for the parameters θ as in (B.2), f isthen replaced by Q ( θ )( ω k ), hence(B.3) θ ( k +1) = θ ( k ) − η k ∇ Q ( θ ( k ) )( ω k ) . The algorithm passes through all samples ω k of the so-called training data set, possibly sev-eral times (speciﬁed by the number of epochs), and performs the update until an approximateminimum is reached. We shall often omit the dependence on ω . ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 29

A compromise between computing the true gradient of f and the gradient at a single sample Q ( θ )( ω ) is to compute the gradient of a subsample of size N batch , called (mini)-batch, so that Q ( θ ( k ) )( ω k ) used in the update (B.3) is replaced by Q ( k ) ( θ ) = 1 N batch N batch X n =1 Q ( θ )( ω n + kN batch ) , k ∈ { , , ..., b N/N batch c − } , (B.4)where N is the size of the whole training data set. Any other unbiased estimators of ∇ f ( θ ) canof course also be applied in (B.3). Appendix C. Alternative approaches for minimizing the calibration functional

We consider here alternative algorithms for minimizing (3.7).C.1.

Stochastic compositional gradient descent.

One alternative is stochastic composi-tional gradient descent as developed e.g. in [42]. Applied to our problem this algorithm (in itssimplest form) works as follows: starting with an initial guess θ (0) , and y (0) j , j = 1 , . . . , J oneiteratively deﬁnes y ( k +1) j = (1 − β k ) y ( k ) j + β k Q j ( θ ( k ) )( ω k ) j = 1 , . . . , J,θ ( k +1) = θ ( k ) − η k J X j =1 w j ‘ ( y ( k +1) j ) ∇ Q j ( θ ( k ) )( ω k )for some learning rates β k , η k ∈ (0 , y ( k ) is an auxiliary variable to track thequantity E [ Q ( θ ( k ) )] which has to be plugged in ‘ (other faster converging estimates have alsobeen developed). Of course ∇ Q j ( θ ( k ) )( ω k ) can also be replaced by other unbiased estimates ofthe gradient, e.g. the gradient of the (mini)-batches as in (B.4). For convergence results in thecase when θ ‘ ( E [ Q j ( θ )]) is convex we refer to [42, Theorem 5]. Of course the same algorithmcan be applied when we replace Q j ( θ ) in (3.7) with X j ( θ ) as deﬁned in (3.9) for the variancereduced case.C.2. Estimators compatible with stochastic gradient descent.

Our goal here is to applyat least in special cases of the nonlinear function ‘ (variant (B.4) of) stochastic gradient descentto the calibration functional (3.7). This means that we have to cast (3.7) into expected valueform. We focus on the case when ‘ ( x ) is given by ‘ ( x ) = x and write f ( θ ) as f ( θ ) = J X j =1 w j E h Q j ( θ ) e Q j ( θ ) i for some independent copy e Q j ( θ ) of Q j ( θ ), which is clearly of the expected value form requiredin (B.1). A Monte Carlo estimator of f ( θ ) is then constructed by b f ( θ ) = 1 N N X n =1 J X j =1 w j Q j ( θ )( ω n ) e Q j ( θ )( ω n ) . ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 30 for independent draws ω , . . . , ω N (the same N samples can be used for each strike K j ). Equiv-alently we have b f ( θ ) = 1 N N X n =1 J X j =1 w j Q j ( θ )( ω n ) Q j ( θ )( ω n + m ) . (C.1)for independent draws ω , . . . , ω N . The analog of (B.4) is then given by Q ( k ) ( θ ) = 1 N batch N batch X l =1 J X j =1 w j Q j ( θ )( ω l +2 kN batch ) Q j ( θ )( ω l +(2 k +1) N batch )for k ∈ { , , ..., b N/N batch c − } .Clearly we can now modify and improve the estimator by using again hedge control variates andreplace Q j ( θ ) by X j ( θ ) as deﬁned in (3.9). References [1] F. Abergel and R. Tachet. A nonlinear partial integro-diﬀerential equation from mathematical ﬁnance.

Dis-crete and Continuous Dynamical Systems-Series A , 27(3):907–917, 2010.[2] B. Acciaio and T. Xu. Learning dynamic gans via causal optimal transport.

Working paper , 2020.[3] C. Bayer, B. Horvath, A. Muguruza, B. Stemper, and M. Tomas. On deep calibration of (rough) stochasticvolatility models.

Preprint, available at arXiv:1908.08806 , 2019.[4] S. Becker, P. Cheridito, and A. Jentzen. Deep optimal stopping.

Journal of Machine Learning Research , 20,2019.[5] H. Buehler, L. Gonon, J. Teichmann, and B. Wood. Deep hedging.

Quantitative Finance , 19(8):1271–1291,2019.[6] H. Buehler, B. Horvath, I. Perez Arribaz, T. Lyons, and B. Wood. A data-driven market simulator for smalldata environments.

Working paper , 2020.[7] R. Carmona, I. Ekeland, A. Kohatsu-Higa, J.-M. Lasry, P.-L. Lions, H. Pham, and E. Taﬂin.

HJM: A UniﬁedApproach to Dynamic Models for Fixed Income, Credit and Equity Markets , volume 1919, pages 1–50. 102007.[8] R. Carmona and S. Nadtochiy. Local volatility dynamic models.

Finance and Stochastics , 13(1):1–48, 2009.[9] R. Cont and S. Ben Hamida. Recovering volatility from option prices by evolutionary optimization.

Journalof Computational Finance , 8(4):43–76, 2004.[10] A. Cozma, M. Mariapragassam, and C. Reisinger. Calibration of a hybrid local-stochastic volatility stochasticrates model with a control variate particle method.

Preprint, available at arXiv:1701.06001 , 2017.[11] C. Cuchiero, A. Marr, M. M., A. Mitoulis, S. Singh, and J. Teichmann. Calibration of mixture interest ratemodels with neural networks.

Technical report , 2018.[12] C. Cuchiero, P. Schmocker, and T. Josef. Deep stochastic portfolio theory.

Working paper , 2020.[13] G. Cybenko. Approximation by superpositions of a sigmoidal function.

Math. Control Signal Systems , 5(455),1992.[14] B. Dupire. A uniﬁed theory of volatility.

Derivatives pricing: The classic collection , pages 185–196, 1996.[15] S. Eckstein and M. Kupper. Computation of optimal transport and related hedging problems via penalizationand neural networks.

Applied Mathematics & Optimization , pages 1–29, 2019.[16] X. Gao, S. Tu, and L. Xu. A* tree search for portfolio management.

Preprint, available at arXiv:1901.01855 ,2019.[17] J. Gatheral, T. Jaisson, and M. Rosenbaum. Volatility is rough.

Quantitative Finance , 18(6):933–949, 2018.[18] P. Gierjatowicz, M. Sabate, D. Siska, and L. Szpruch. Robust ﬁnance via neural sdes.

Working Paper , 2020.[19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.Generative adversarial nets. In

Advances in neural information processing systems , pages 2672–2680, 2014.[20] J. Guyon and P. Henry-Labordere. The smile calibration problem solved.

Preprint, available athttps://ssrn.com/abstract=1885032 , 2011.[21] J. Guyon and P. Henry-Labord`ere.

Nonlinear option pricing . CRC Press, 2013.[22] J. Han, A. Jentzen, and E. Weinan. Overcoming the curse of dimensionality: Solving high-dimensional partialdiﬀerential equations using deep learning.

Preprint, available at arXiv:1707.02568 , pages 1–13, 2017.

ALIBRATION OF LSV MODELS WITH GENERATIVE ADVERSARIAL NETWORKS 31 [23] R. Hecht-Nielsen. Theory of the backpropagation neural network. In

Neural networks for perception , pages65–93. Elsevier, 1992.[24] J. Heiss, J. Teichmann, and H. Wutte. How implicit regularization of neural networks aﬀects the learnedfunction–part i.

Preprint arXiv:1911.02903 , 2019.[25] P. Henry-Labordere. Generative models for ﬁnancial data.

Preprint, Available at SSRN 3408007 , 2019.[26] A. Hernandez. Model calibration with neural networks.

Risk , 2017.[27] K. Hornik. Approximation capabilities of multilayer feedforward networks.

Neural networks , 4(2):251–257,1991.[28] C. Hur´e, H. Pham, A. Bachouch, and N. Langren´e. Deep neural networks algorithms for stochastic controlproblems on ﬁnite horizon, part i: convergence analysis.

Preprint, available at arXiv:1812.04300 , 2018.[29] C. Hur´e, H. Pham, and X. Warin. Some machine learning schemes for high-dimensional nonlinear PDEs.

Preprint, available at arXiv:1902.01599 , 2019.[30] B. Jourdain and A. Zhou. Existence of a calibrated regime switching local volatility model and new fakebrownian motions.

Preprint available at arXiv:1607.00077 , 2016.[31] A. Kondratyev and C. Schwarz. The market generator.

Available at SSRN , 2019.[32] A. Lipton. The vol smile problem.

Risk Magazine , 15:61–65, 2002.[33] P. Protter.

Stochastic integration and diﬀerential equations , volume 21 of

Applications of Mathematics (NewYork) . Springer-Verlag, Berlin, 1990. A new approach.[34] Y. Ren, D. Madan, and M. Q. Qian. Calibrating and pricing with embedded local volatility models.

LondonRisk Magazine Limited- , 20(9):138, 2007.[35] H. Robbins and S. Monro. A stochastic approximation method.

The annals of mathematical statistics , pages400–407, 1951.[36] J. Ruf and W. Wang. Neural networks for option pricing and hedging: a literature review.

Available at SSRN3486363 , 2019.[37] Y.-L. K. Samo and A. Vervuurt. Stochastic portfolio theory: a machine learning perspective.

Preprint,available at arXiv:1605.02654 , 2016.[38] Y. F. Saporito, X. Yang, and J. P. Zubelli. The calibration of stochastic-local volatility models-an inverseproblem perspective.

Preprint, available at arXiv:1711.03023 , 2017.[39] J. Sirignano and R. Cont. Universal features of price formation in ﬁnancial markets: perspectives from deeplearning.

Quantitative Finance , 19(9):1449–1459, 2019.[40] Y. Tian, Z. Zhu, G. Lee, F. Klebaner, and K. Hamza. Calibrating and pricing with a stochastic-local volatilitymodel.

Journal of Derivatives , 22(3):21, 2015.[41] M. S. Vidales, D. Siska, and L. Szpruch. Unbiased deep solvers for parametric pdes.

PreprintarXiv:1810.05094 , 2018.[42] M. Wang, E. X. Fang, and H. Liu. Stochastic compositional gradient descent: algorithms for minimizingcompositions of expected-value functions.

Mathematical Programming , 161(1-2):419–449, 2017.[43] M. Wiese, L. Bai, B. Wood, and H. Buehler. Deep hedging: learning to simulate equity option markets.