Deep Hedging under Rough Volatility
DDeep Hedging under Rough Volatility
Blanka HorvathKing’s College London and The Alan Turing Institute, [email protected] ,Josef TeichmannETH Z¨urich [email protected] ,ˇZan ˇZuriˇcImperial College London [email protected] ,February 4, 2021
Abstract
We investigate the performance of the Deep Hedging framework undertraining paths beyond the (finite dimensional) Markovian setup. In partic-ular we analyse the hedging performance of the original architecture underrough volatility models with view to existing theoretical results for those. Fur-thermore, we suggest parsimonious but suitable network architectures capableof capturing the non-Markoviantity of time-series. Secondly, we analyse thehedging behaviour in these models in terms of P&L distributions and drawcomparisons to jump diffusion models if the the rebalancing frequency is real-istically small. a r X i v : . [ q -f i n . C P ] F e b Introduction
Deep learning has undoubtedly had a major impact on financial modelling in the pastyears and has pushed the boundaries further of the challenges that can be tackled: Notonly can existing problems be solved faster and more efficiently [1, 2, 3, 4, 5, 6, 7, 8],but deep learning also allows us to derive (approximative) solutions to optimisationsproblems [9], where classical solutions had so far been limited in scope and generality.Additionally these approaches are fundamentally data driven, which makes themparticularly attractive from business perspectives.It comes as no surprise that the more similar (or “representative”) the data pre-sented to the network in the training phase is to the (unseen) test data, which thenetwork is later applied to, the better is the performance of the hedging network onreal data (in terms of P&L). It is also unsurprising that, as markets shift sufficientlyfar away from a presented regime into new, previously unseen territories, the hedgingnetworks may have to be retrained to adapt to the new environment.In the current paper we go a step further than just presenting an ad hoc well chosenmarket simulator (see [10, 11, 12, 13, 14, 15, 16, 17]): we investigate a situation wherethe relevant test data is structurally so different from the original modelling setupthat it calls for an adjustment of the model architecture itself: in a well-controlledsynthetic data environment we study the behaviour of the hedging engine as relevantproperties of the data change.More specifically, we use synthetic data generated from a rough volatility modelwith varying levels of the Hurst parameter. In its initial setup we set the Hurst pa-rameter to H = 1 /
2, which reflects a classical (finite dimensional) Markovian case,which is well-aligned with the majority of the most popular classical financial marketmodels, such as, e.g., the Heston model, which the initial version of the deep hedgingresults were demonstrated on. We then gradually alter the level of the Hurst param-eter to (rough) levels around H ≈ .
1, which more realistically reflects market realityas observed in [18, 19, 20, 21, 22] thereby introducing a non-Markovian memory intothe volatility process.Since rough volatility models are known to reflect the reality of financial mar-kets (as well as the stylised statistical facts) better than classical, finite-dimensionalMarkovian models do, our findings also give an indication how a naive application ofmodel architectures to real data could lead to substantial errors. With this our studyallows us to make a number of interesting observations for deep hedging and the datathat it is applied to: apart from drawing parallels between discretely observed roughvolatility models and jump processes, our findings highlight the need to rethink (orcarefully design) risk management frameworks of deep learning models as significantstructural shifts in the data occur.The paper is organised as follows: Section 2 recalls the setup of the original deephedging framework used in [9]. Section 3 gives a brief reminder on hedging under2ough volatility models and compares the performance of (feed-forward) hedging net-work on a rough Bergomi model compared to a theoretically derived model hedge. InSections 3.3 and 3.4 we draw conclusions with respect to the model architecture andin Section 3.5 we propose a new architecture that is better suited to the data. Section4 lays out the hedging under the new architecture and draws conclusions to existingliterature which outlines some parallels between (continuous) rough volatility modelsand jump processes in this setting, while Section 5 summarizes our conclusions.
We adopt the setting in [9] and consider a discrete finite-time financial market withtime horizon [0 , T ] for some T ∈ (0 , ∞ ) and a finite number of trading dates 0 = t < t < · · · < t n = T , n ∈ N . We work on a discrete probability space (Ω , F , P ),with Ω = { ω , . . . , ω N } and a probability measure P for which P [ { ω i } ] > i ∈ { , . . . , N } and N ∈ N . Additionally, we fix the notation X := { X : Ω → R } forthe set of all R -valued random variables on Ω.The filtration F = ( F k ) k =0 ,...,n is generated by the R r -valued information process( I k ) k =0 ,...,n for some n, r ∈ N . For any k ∈ { , . . . , n } , the variable I k denotes allavailable new market information at time t k and F k represents all available marketinformation up to time t k .The market contains d ∈ N financial instruments which can be used for hedging, withmid-prices given by an R d -valued F -adapted stochastic process S = ( S k ) k =0 ,...,n . Inorder to hedge a claim Z : Ω → R we may trade in S according to R d -valued F -adaptedprocesses (strategies), which we denote by δ := ( δ k ) k =1 ,...,n , where δ k = ( δ k , . . . , δ dk ).Here, δ ik denotes the agent’s holdings of the i -th asset at time t k . We denote theinitial cash injected at time t by p > t k andchange in position s ∈ R d we consider costs c k : s (cid:55)→ c ∈ [0 , ∞ ), where c k is F k -adapted, upper-semi continuous and for which c k (0) = 0 for all k ∈ { , . . . , n } . Thetotal costs up to time T , when trading according to a trading strategy δ are denotedby C T ( δ ) := (cid:80) nk =0 c k s k − ( δ k − δ k − ). Finally, we denote by H a set of all tradingstrategies.We consider optimality of hedging under convex risk measures as in [9, 23] and [24].For a reminder on convex risk measures see e.g. [25]. Now let ρ : X → R be acash invariant convex risk measure on the set X . As in [9], we consider for randomvariables X ∈ X the original optimization problem − π ( X ) := inf δ ∈H ρ ( − X + ( δ · S ) T − C T ( δ )) . (2.1)An optimal hedging strategy for X is a minimizer δ ∈ H of (2.2), where the premiumis π ( X ). 3n case of no trading costs there is an alternative view point, which will be takenin this paper: consider an equivalent pricing measure Q of our financial market, thenwe can also minimize the variance (with respect to the pricing measure Q )inf δ ∈H E (cid:2) ( X − ( δ · S ) T − p ) (cid:3) , (2.2)where p denotes the expectation of X with respect to Q , i.e. the risk neutral price.In other words: the price of the quadratic hedging loss (payoff) should be minimal.In the rest of this paper, the above optimisation (2.2) problem and–correspondingoptimisers– are considered in terms of their numerical approximation in the frameworkof hedging in a neural network setting as formulated in [9]. In the remainder of thissection we recall the notation and definitions to formulate this approximation propertyand the conditions that ensure its validity. Definition 2.1 (Set of Neural Networks with a fixed activation function) . We denoteby
N N σ ∞ ,d ,d the set of all NNs mapping from R d → R d with a fixed activationfunction σ . The set {N N σM,d ,d } M ∈ N is then a sequence of subsets in N N ∞ ,d ,d forwhich N N σM,d ,d = { F θ : θ ∈ Θ M,d ,d } with Θ M,d ,d ⊂ R q for some q ( M ) , M ∈ N . Definition 2.2.
We call H M ⊂ H the set of unconstrained neural network tradingstrategies: H M = (cid:8) ( δ k ) k =0 ,...,n − ∈ H : δ k = F k ( I , . . . , δ k − ) , F k ∈ N N M,r ( k +1)+ d,d (cid:9) = (cid:110) ( δ k ) k =0 ,...,n − ∈ H : δ k = F θ k k ( I , . . . , δ k − ) , θ k ∈ Θ M,r ( k +1)+ d,d (cid:111) (2.3)We now replace the set H in (2.2) by the finite subset H M ⊂ H . The optimisationproblem then becomes π M ( X ) := inf δ ∈H M ρ ( X + ( δ · S ) T − C T δ )= inf θ ∈ Θ M ρ (cid:0) X + (cid:0) δ θ · S (cid:1) T − C T δ θ (cid:1) (2.4)where Θ M = (cid:81) n − k =0 Θ M,r ( k +1)+ d,d denote the network parameters from Definition 2.2.With (2.3), (2.4) and Remark 2.1, the infinite-dimensional problem of finding anoptimal hedging strategy is reduced to a finite-dimensional problem of finding theoptimal NN parameters for the problem (2.4). Remark 2.1.
Note that in the above, we do not assume that S is an ( F , P ) -Markovprocess and that the contingent claim is of the form Z := g ( S T ) for a payoff function g : R d → R . This would allow us to write the optimal strategy δ k = f k ( I k , δ k − ) forsome f k : R r + d → R d . The next proposition recalls the central approximation property which states thatthe optimal trading strategy (2.1) can be approximated by a semi-recurrent neuralnetwork of the form Figure 2.1 in the sense that the functional π M ( X ) converges to π ( X ), as M becomes large. 4 roposition 2.1. Define H M as in (2.3) and π M as in (2.4) . Then for any X ∈ X lim M →∞ π M ( X ) = π ( X ) , where π ( X ) denotes the optimal solution of the original optimisaton problem (2.1) . Remark 2.2.
Of course there is a completely analogous formulation of this proposi-tion for the optimal trading strategy (2.2) . Input Layer ∈ ℝ ⁴ Hidden Layer ∈ ℝ ¹² Hidden Layer ∈ ℝ ¹² Output Layer ∈ ℝ ² Input Layer ∈ ℝ ⁴ Hidden Layer ∈ ℝ ¹² Hidden Layer ∈ ℝ ¹² Output Layer ∈ ℝ ² S t δ St − δ V t − S t+1 δ St δ V t V t+1 δ St δ V t δ St+1 δ V t+1 t t+1 Output
Figure 2.1: Original Network ArchiectureIn [9] this approximation property is demonstrated for Black-Scholes and Hestonmodels both in their original form and in variants including market frictions suchas transaction costs. These results demonstrated how deep hedging which allows usto take a leap beyond classical results in scenarios where the Markovian structure ispreserved.A natural question to ask is, how the approximation property of the neural networkis affected if the assumption of Markovian structure of the underlying process is nolonger satisfied. Rough Volatility models [18, 19, 20, 26] represent such a class of non-Markovian models. It is also well-established in a series of recent articles includingthe aforementioned works, that rough volatility dynamics are superior to standardstandard Markovian models (such as Black-Scholes and Heston) in terms of reflectingmarket reality and also that rough volatility models are superior to a number of interms of allowing close fits to market data.By taking hedging behaviour under rough volatility models under the loop wegain insight into non-Markovian aspects of markets in a controlled numerical set-ting: Varying the Hurst parameter H ∈ (0 ,
1) of the process (see [20]), which governs5he deviation from the Markovian setting in a general fractional (or rough) volatilityframework, enables us to control for the influence of the Markovianity assumptionon the hedging performace of the deep neural network. Therefore, in this work weinvestigate the effect of the loss of Markovianity property of the underlying stochas-tic process, by considering market dynamics that are governed in a rough volatilitysetting. With this in mind, by applying the original feedforward network architec-ture to a more realistic model class (represented by rough volatility models) we inparticular demonstrate how the the choice of the network architecture may affect theperformance of Deep Hedging framework could potentially break down on real lifedata. We also note in passing that the approach we take can be applied as a simpleroutine sanity check for model governance of deep learning models on real data: • Take a well-understood model class that generalises the modelling to more realis-tic market scenarios, but where the generalisation no longer satisfy assumptionsmade in the original architecture. • Test the robustness of the method if the assumption is violated by controllingfor the error as the deviation from the assumption increases. • Modify the network architecture accordingly if necessary.
Let us now consider the problem of hedging under rough volatility models in general.For this we consider now a continuous filtration {F t } ≤ t ≤ T . We know that for aMarkovian process of the form˜ X t = x + (cid:90) t b ( r, X r ) dr + (cid:90) t σ ( r, X r ) dW r , where b and σ satisfy suitable conditions, the price of a contingent claim ˜ Z t := E [ g ( ˜ X T ) |F t ] can be written as ˜ Z t = u ( t, ˜ X t ) , where u solves a parabolic PDE by Feynman-Kac formula [27]. However, it wasshown in [26] that Rough volatility models are not finite dimensional Markovian and For the numerical implementation of the resulting strategies that we consider in the followingsections, we naturally consider again the discrete filtration introduced above in Section 2.
6e therefore have to consider a more general process X and assume it to be a solutionto the d -dimensional Volterra SDE: X t = x + (cid:90) t b ( t ; r, X . ) dr + (cid:90) t σ ( t ; r, X . ) dW r , t ∈ [0 , T ] , (3.1)where W is a m -dimensional standard Brownian motion, b ∈ R d and σ ∈ R m × d . Bothare adapted in a sense that for ϕ = b, σ it holds ϕ ( t ; r, X . ) = ϕ ( t ; r, X r ∧ . ).In this general non-Markovian framework, the contingent claim in the form Z t := E [ g ( X T ) |F t ] will depend on the entire history of the process X := ( X t ) t ≥ up to time t and not just on the value of the process at that time i.e. Z t = u ( t, X [0 ,t ] ) with notation X [0 ,t ] := { X r } r ∈ [0 ,t ] , where u this time solves a Path dependent PDE (PPDE). The setting where X isa semi-martingale has already been explored in e.g. [28, 29]. Be that as it may,we know that fBm is not a semi-martingale in general and as a consequence thevolatility process is not a semi-martingale. Viens and Zhang [30] are able to cast theproblem back in to the semi-martingale framework by rewriting X t as a orthogonaldecomposition to an auxiliary process Θ t and a process I t , which is independent ofthe filtration X s = x + (cid:90) t b ( s ; r, X . ) dr + (cid:90) t σ ( s ; r, X . ) dW r + (cid:90) st b ( s ; r, X . ) dr + (cid:90) st σ ( s ; r, X . ) dW r (3.2):= x + Θ ts + I ts (3.3)for 0 ≤ t ≤ s . By exploiting the semi-martingale property of Θ, they go on to showthat the contingent claim can be expressed as a solution of a PPDE Z t = u ( t, X [0 ,t ) ⊗ t Θ t [ t,T ] ) , (3.4)where ⊗ t denotes concatenation at time t . Moreover, they develop an Itˆo-type formulafor a general non-Markovian process X t from (3.1), which we present in the Appendix. As an example we consider the rBergomi with a constant initial forward variancecurve ξ ( t ) = V : S t = S + (cid:90) t S r (cid:112) V r (cid:104)(cid:112) − ρ dB r + ρdW r (cid:105) (3.5a) V t = V E (cid:18) √ Hν (cid:90) t ( t − r ) H − dW r (cid:19) , V = v > , (3.5b)7he model fits into the affine structure of our Volterra SDE in (3.1), after a simplelog-transformation of the volatility process. In this case we take our auxiliary processto be Θ ts = √ Hν (cid:90) t ( s − r ) H − dW r , t < s. (3.6)It is easy to check that Θ ts is a true martingale for fixed s . The option price dynamicsare obtained by using the Functional Itˆo formula in (A.3). From this, the perfecthedge in terms of a forward variance ˆΘ tT with maturity T and a stock S t follows: dZ t = ∂ x u ( t, S t , Θ t [ t,T ] ) dS t + ( T − t ) − H ˆΘ tT (cid:10) ∂ ω u ( t, S t , Θ t [ t,T ] ) , a t (cid:11) d ˆΘ tT (3.7)with a ts = ( s − t ) H − The path-wise derivative in (3.7) is the Gateaux derivative along the direction a t . Formore details and discretization of the Gateaux derivative see Appendix A.1. We choose to hedge a plain vanilla call option Z T := max( S T − K,
0) with K = 100and a monthly maturity T = 30 / S with S = 100 and a forward variance with maturity T Fwd = 45 /
365 and is rebalanced daily.For the rBergomi model forward variance is equal toˆΘ tT Fwd := E Q (cid:20)(cid:90) T Fwd V s ds (cid:12)(cid:12)(cid:12)(cid:12) F t (cid:21) = V exp (cid:20) Θ tT Fwd + 12 ν (cid:2) ( T Fwd − t ) H − T H Fwd (cid:3)(cid:21) , (3.8)with Θ tT Fwd defined as in 3.6. Applying classical Itˆo’s Lemma to ˆΘ tT Fwd = ˆΘ tT Fwd ( t, Θ tT Fwd )yields the dynamics of the forward variance under the rough Bergomi d ˆΘ tT Fwd = ˆΘ tT Fwd √ Hν ( T Fwd − t ) H − dW t , (3.9)which is well defined for t ∈ [0 , T Fwd ). Therefore, choosing the maturity of the forwardvariance to be longer than the option maturity allows us to avoid the singularity as t → T . In practice this would correspond to hedging with a forward variance with aslightly longer maturity than that of the option.For the simulation of the forward variance we used the Euler-Mayurama method,whereas paths of the volatility process were simulated with the “turbo-charged” ver-sion of the hybrid scheme proposed in [31, 32]. The parameters were chosen suchthat they describe a typical market scenario with a flat forward variance: ξ =8 . × . ν = 1 . ρ = − .
7. We were particularly interested in the de-pendence of the hedging loss on the Hurst parameter. Finally quadratic loss functionwas chosen and the minimizing objective was therefore π ( − Z ) = inf δ θ ∈H M E (cid:2) ( − Z + p + ( δ θ · S ) T ) (cid:3) where price p was obtained with a Monte-Carlo simulation (e.g. for H = 0 . p = 2 . H Model hedge Deep hedge0 .
10 1.45 1.16 (*1.12)0 .
20 0.52 0.670 .
30 0.34 0.460 .
40 0.24 0.36 *-on 200 epochs
Table 1: Comparison of the quadratic loss between model and deep hedges trained75 epochs for different H .Next we implement the perfect hedge from (3.7) the details of the discretizationof the Gateaux derivative are presented in Appendix A.3. For evaluation of theoption price, we once again use Monte-Carlo, this time with generating parameters.In practice we would calibrate the parameters to the market data. Perfect hedge wasimplemented on the sample of 10 different paths for the same parameters as in thedeep hedging case. The results of both hedges under quadratic loss for different Hurstparameters are shown in Table 4. We also take a closer look of the P&L distributionsof the deep hedge as well as the model hedge for H = 0 .
10 in Figure 3.1. Curiouslyenough, the distributions are very similar to each other. The deep hedge seems to haveslightly thinner tails, which is interesting, considering the semi-recurrent architecturemakes a strong assumption of Markovianity of the underlying process.Indicators that the assumption of finite dimensional Markovianity is violatedmight be the heavy left tail of the P&L distribution as well as relatively high hedginglosses. This prompted us to question the semi-recurrent architecture and devise away to relax the Markov assumption on the underlying. Note that the heavy tailsof these distribution may also imply a link to jump diffusion models. We expand onthis in Section 4.3.
As discussed before, in [9] authors heavily rely on Remark 2.1, where they use theMarkov property of the underlying process in order to write the trading strategyat time t k as a function of the information process at t k and trading strategy in9igure 3.1: rBergomi model hedge ( blue ) compared to the deep hedge ( red ) trainedon 75 epochs on rBergomi paths with H = 0 .
10. Note that the option price is only p = 2 .
39 and that such a hedge can result in a substantial loss.the previous time step k −
1. Of course, in the case of rough volatility models onewould have to include the entire history of the information process up to t k in orderto get the hedge at that time. However, this would result in numerically infeasiblescheme. To illustrate this, take for example a single vanilla call option with maturity T = 30 / F θ would be 30 · · · B Ht := 1Γ( H + ) (cid:18)(cid:90) t ( t − s ) H − dW s (cid:19) , where W is a standard Brownian motion. Using the fact that for α ∈ (0 ,
1) and fixed x ∈ [0 , ∞ ):( t − s ) α − Γ( α ) = (cid:90) ∞ e − ( t − s ) x µ ( dx ) , with µ ( dx ) = dxx α Γ( α )Γ(1 − α ) (3.10)10e obtain by the Fubini Theorem B Ht = (cid:90) t (cid:90) ∞ e − ( t − s ) x µ ( dx ) dW s = (cid:90) ∞ (cid:90) t e − ( t − s ) x dW s µ ( dx )= (cid:90) ∞ Y xt µ ( dx )with Y xt = (cid:82) t e − ( t − s ) x dW s . Observe that for a fixed x ∈ [0 , ∞ ), ( Y xt ) t ≥ is an Ornstein-Uhlenbeck process with mean reversion zero and mean reversion speed x i.e. Gaussiansemi-martingale Markov process solution with the dynamics of dY xt = − xY xt dt + dW t . Therefore, we have shown that B H is a linear functional of the infinite dimensionalMarkov process. Being able to simulate from Y xt would mean that we can still usethe architecture in Figure 2.1, even for a rough processes. Numerical simulationscheme for such a process is presented in [33]. Regrettably the estimated Hurstparameter from the generated time series stayed around H ≈ .
5, for any choseninput Hurst parameter to the simulation scheme. For a fixed time-step ∆ t the schemedoes not produce desired roughness, even if we used number of OU-terms well beyondwhat authors propose. We believe this is because scheme is only valid in the limiti.e. when the number of terms goes to infnity and ∆ t →
0. Failure to recoverthe Hurst parameter together with the fact that the architecture does not allow forany path dependent contingent claims, encouraged us to change the Neural Networkarchitecture itself.
By the above insights we hence modify the original architecture. In this section wesuggest an alternative architecture and show that it is well-suited to the problem.When constructing a new architecture, we would like to change the semi-recurrentstructure as little as possible for our purpose, since it seems to perform very well inthe Markovian cases. However, in order to account for non-Markovianity we proposea completely recurrent structure. To that end we now introduce a hidden state ˜ δ k =(˜ δ Sk − , ˜ δ Vk − ) with ˜ δ = 0, which is passed to the cell at time t k along the informationprocess I k . So instead of adding layers to each of the state transitions separately as Several estimation procedures of the Hurst parameter were used see e.g. [34, 35]. Estimationsof the paths simulated with the hybrid scheme [31, 32] were on the other hand in alignment withthe input parameter. Note that by completely recurrent we do not mean the same network is used at each time step,but that the hidden state is passed on to the cell in the next time step along with current portfoliopositions. nput Layer ∈ ℝ ⁴ Hidden Layer ∈ ℝ ¹² Hidden Layer ∈ ℝ ¹² Output Layer ∈ ℝ ² Input Layer ∈ ℝ ⁴ Hidden Layer ∈ ℝ ¹² Hidden Layer ∈ ℝ ¹² Output Layer ∈ ℝ ² S t δ St − δ V t − S t+1 V t+1 t t+1 δ St δ V t δ St+1 δ V t+1 ~~ δ St δ V t ~~ δ St δ V t ~~ δ St+1 δ V t+1 ~~ Output
Figure 3.2: Fully Recurrent Neural Network (fRNN) Architecture. The recurrentstructure of this architecture is clearly visible as hidden states are passed on to thenext cell at each time step.in [36], we simply concatenate the input vector I k with the hidden state vector andfeed it into a the neural network cell F θk ( · ): F θk (cid:16) I k ⊕ ˜ δ k (cid:17) = δ k ⊕ ˜ δ k For the visual representation see Figure 3.2. The output is still a trading strategy δ k = ( δ Sk , δ Vk ) and it is evaluated on the same objective function as before in case ofquadratic hedging losses (without transaction costs): L ( θ ) := E (cid:104)(cid:0) − Z + p + ( δ θ · S ) T ) (cid:1) (cid:105) , whereas the hidden state ˜ δ k is passed forward to the next cell F θk +1 . These states cantake any value and are not restricted to having any meaningful financial representationas trading strategies do. We illustrate the fact that the fRNN architecture is trulyrecurrent by showing how hidden states are able to encode the relevant history of theinformation process. Let’s say for example that the information process I k = ( S k , S k )is simply the price of both hedging instruments. The strategies at time t k now do notdepend on the asset holdings δ xk − , but on ˜ δ xk − for x ∈ { S, V } : δ xk := δ xk ( S k , S k , ˜ δ Sk − , ˜ δ Vk − ) . For some F k − -measurable function g k − , it holds for the hidden states themselvesthat ˜ δ xk − = g xk − ( S k − , S k − , ˜ δ Sk − , ˜ δ Vk − ) . δ xk − = g x N N ( S k − , S k − , S k − , S k − , . . . , S , S , ˜ δ ) , where g x N N is again F k − -measurable. Structuring the network this way, we are hopingthat the hidden states at time t k will be able to encode the history of the informationprocess I , . . . , I k . More precisely, what we expect is that the network will learn itselfthe function g x N N : R k → R for x ∈ { S, V } and with that the path dependencyinherit to the liability we are trying to hedge. Remark 3.1.
We remark that in order to account for the history of the informationprocess one could also write the trading strategy as δ k := δ k ( I k , ˜ I k − ) , where ˜ I nk − = { I i } k − i =( k − − n is the history of the information process with a windowlength of n ∈ { , . . . , k − } . However in this case, we would have to optimize thewindow length and would inevitably face an accuracy and computational efficiencytrade-off. We would rather outsource this task to the neural network. Remark 3.2.
While we do think LSTM architecture [37] would be more appropriateto capture the non-Markovian aspect of our process, we find that our architecture isadequate in that regard as well. Our architecture has the advantage of being tractable(we can still appeal to the Proposition 2.1), all while being much simpler and easierto train.
Since the fRNN should perform just as well in Markovian case as the original one does,we first convinced ourselves that our architecture produces comparable results in theclassical case. Quadratic losses as well as the training time for the Heston model werevery similar for both . We were now ready to test it on the rough Bergomi model. Wehedge the ATM call from Section 3.3, the parameters were again ξ = 0 . × . ν = 1 . ρ = − . H (cid:38) . For Heston parameters α = 1 , b = 0 . , σ = 0 . , V = 0 . , S = 100 and ρ = − . .
20 under original architecture and 0 .
162 under the fully recurrent one. Both trainingtimes were fairly similar as well. H .
10 0 .
20 0 .
30 0 .
40 0 .
60 0 .
70 0 .
80 0 . * - loss on 200 epochs Table 2: Quadratic loss for different Hurst parameters. Run time on 75 epochs wasapproximately 2 hours for each parameter.Comparing these results, with both the model hedge and the deep hedge fromSection 3.3 (see Table 3), we notice the fRNN does indeed perform notably better.By increasing number of epochs in the training phase from 75 to 200 the loss inthe case of the deep hedge with original architecture does not improve, while theimprovement with the proposed architecture is clearly visible. This indicates thatwhile the semi-recurrent NN saturates at a given error, the new architecture keepsconverging and improving. Since the training at 200 epochs was computationallycostly (in terms of both memory and time) and since we have reached the modelhedge’s numbers at the higher end of H range we did not keep increasing the numberof epochs. But we expect that to keep improving as the number of epochs increases,which definitely indicates the second approaches suitability.Quadratic hedging loss H Model hedge Deep hedge Deep hedge - fRNN0 .
10 1.45 1.16 (*1.12) 0.83 (*0.63)0 .
20 0.52 0.67 0.380 .
30 0.34 0.46 0.260 .
40 0.24 0.36 0.22 *-on 200 epochs
Table 3: Comparison of the quadratic loss between model and deep hedges with fRNNarchitecture trained on 75 epochs for different H .Looking at the Figure 4.1 it is particularly interesting that the P&L distributionbecomes increasingly left tailed with lower Hurst parameters. Even under the newarchitecture the distribution for H = 0 .
10 is left-skewed with an extremely heavy lefttail, where relative losses reached cca. − sample paths. What iseven more compelling is that the sizeable losses occurred, when the discretized stockprocess jumped by several thousand basis points during the hedging period. Exampleof such a path is shown in Figure 4.2. Although jumps are not featured in the roughBergomi model (the price process is a continuous martingale [38]) the model clearlyexhibits jump-like behaviour when discretized .Naturally, for H = 0 .
10, where this effect was the most noticeable, we triedincreasing the training, test and validation set sizes, as well as number of epochs to200. Doing this we managed to decrease the realized loss to 0 . .
834 on smaller set sizes, but still far from the loss14igure 4.1: Empirical P&L distributions in log-scale for different Hurst parametersunder the fRNN hedge.
Loss on test denotes the realized quadratic loss on the testset for a network trained on 75 epochs.Figure 4.2: Under the discretized rough Bergomi model the stock can jump by morethan ±
30% in a single time step. This stock path caused extreme loss of − .
73 seenin Figure 4.1.of 0 .
162 we obtained under the Heston model. We investigated settings more epochs,bigger training sizes, different architectures, however the realized test loss did notimprove. 15s it can be seen in Figure 4.3 model hedge loss distribution exhibits very similarbehaviour as the deep hedge distribution. Higher losses of the model hedge can beexplained by the slightly fatter tail in comparison to the fully recurrent hedge. Weremark this behaviour is somewhat understandable, since re-hedging is done dailyand the hedging frequency is far from being a valid approximation for a continuoushedge. In the next section we thus implement hedges at different frequencies tosee, whether the H¨older regularity of the underlying process is problematic only forthe deep hedging procedure or is the heavy left-tailed P&L distribution a generalphenomena, when hedging under a discretized rough model. (a) H=0.10 (b) H = 0 . H = 0 .
30 (d) H = 0 . Figure 4.3: P&L distributions of rBergomi model hedge ( red ) vs. Deep hedge withproposed architecture ( blue ) for different Hurst parameters realized on 10 samplepaths. 16 .2 Rehedges We implement deep hedges on rBergomi paths with the Hurst parameter H = 0 . H = 0 . − . , . H = 0 .
10 Every two days Daily Twice daily Four times dailyQuadratic loss 1.11 0.65 0.46 0.52Training time (h) 3.1 7.5 19.6 45.3Table 4: Comparison of the deep hedge quadratic losses for different hedging frequen-cies (with H = 0 . It is rather interesting that A. Sepp [39] observes a similar behaviour, when deltahedging under jump diffusion models. Similarly to our observations above, he finds(in presence of jumps) that after a certain point the volatility of the P&L cannot bereduced by increasing the hedging frequency. More precisely, he shows that for jumpdiffusion models, there is a lower bound on the volatility of the P&L in relation tothe hedging frequency. Not only that, the P&L distributions in Figure 4.5 for deltahedges under jump diffusion models are generally fairly similar to ours.Figure 4.5: P&L distributions for delta hedging under jump diffusion models (JDM)from [39]. 18his gives us the idea to treat the discretsed rough models as jump models. In thiscase the market is incomplete and it is not possible to perfectly hedge a contingentclaim with a portfolio containing a finite number of instruments [40]. In practicetraders try to come as close as possible to the perfect hedge by trading a number ofdifferent options.Unfortunately, when trying to implement the hedge approximation, we are quicklyfaced with the absence of analytical pricing formulas and limitations of the slowMonte-Carlo scheme. In order for us to train the deep hedge, we would have tocalculate option prices on every time step of each sample path. In a typical applicationwe would need around 10 options with different strikes and at least 10 sample paths. In this work, we presented and compared different methods for hedging under roughvolatility models. More specifically, we analysed and implemented the perfect hedgefor the rBergomi model from [30] and used the deep hedging scheme from [9], whichhad to be adapted to a non-Markovian framework.We were particularly interested in the dependence of the P&L on the Hurst pa-rameter. We conclude the deep hedge with the proposed architecture performs betterthan the discretized perfect hedge for all H . We also find that the hedging P&Ldistributions for low H are highly left-skewed and have a lot of mass in the left tailunder the model hedge as well as the deep hedge.To mitigate the heavy losses in cases when H is close to zero, we explored increas-ing the hedging frequency upto four times a day. The loss did improve and the P&Ldistribution became less leptocurtic, however only slightly.Intriguingly, slow response to increased hedging frequency and left-skewed P&Ldistribution are characteristic for delta hedges under jump diffusion models [39]. Wetherefore observe that in terms of hedging there is a relation between jump diffusionmodels and rough models. In accordance with the literature we find that the priceprocess, despite being a continuous martingale, exhibits jump-like behaviour [32]. Webelieve this is an excellent illustration of rough volatility models dynamics. Explosivealmost jump-like pattern in the stock price might be the reason why they can fit theshort end of implied volatility so well.In our view, it is crucial to take into account the jump aspect, when looking foran optimal hedge in discretized rough volatility models. Our suggestion for futureresearch is adapting the objective function in deep hedge scheme for jump risk opti-mization. First step would be optimization of the Expected shortfall risk measure.Next, more appropriate jump risk measures for discretized rough models can be de-veloped. These risk measures cannot be completely analogous to the risk measuresin [39], since rough models themselves do not feature jumps.19 Appendix
A.1 Path derivatives
Denote by D a c`adl`ag space and by D t and C t the space of c`adl`ag functions on [ t, T ]and the space of continuous functions on [ t, T ] respectively. Additionally, we denoteby ω the sample paths on [0 , T ], ω t as its value at time t and defineΛ := [0 , T ] × C ([0 , T ] , R d ) , ¯Λ := (cid:8) ( t, ω ) ∈ [0 , T ] × D : ω | [ t,T ] ∈ C (cid:9) ; (cid:107) ω (cid:107) T := sup t ∈ [0 ,T ] | ω t | , d (( t, ω ) , ( t (cid:48) , ω (cid:48) )) := | t − t (cid:48) | + (cid:107) ω − ω (cid:48) (cid:107) T . Furthermore, we denote the set of all d -continuous functions u : ¯Λ → R by C ( ¯Λ).Define the usual horizontal time derivative for u ∈ C ( ¯Λ) as in [28]: ∂ t u ( t, ω ) := lim δ ↓ u ( t + δ, ω ) − u ( t, ω ) δ for all ( t, ω ) ∈ ¯Λ , (A.1)requiring of course that the limit exists. For the spatial derivative with respect to ω ,however, we use the definition of the Gateaux derivative for any ( t, ω ) ∈ ¯Λ: (cid:104) ∂ ω u ( t, ω ) , η (cid:105) = lim ε → u ( t, ω + εη [ t,T ] ) − u ( t, ω ) ε for any η ∈ C t . (A.2)Note that the function u ( t, · ) in the definition of the derivative is “lifted” only on[ t, T ] and not on [0 , t ). Hence the convention we follow is actually (cid:104) ∂ ω u ( t, ω ) , η (cid:105) := (cid:10) ∂ ω u ( t, ω ) , η [ t,T ] (cid:11) for any s < t and η ∈ C s . The definition of Gateaux derivative is clearly also equal to (cid:104) ∂ ω u ( t, ω ) , η (cid:105) = ddε u ( t, ω + εη [ t,T ] ) (cid:12)(cid:12) (cid:15) =0 . Remark A.1.
We remark that our definition of the spatial derivative is differentfrom the one in [28, 29], where functional derivative quantifies the sensitivity of thefunctional to the variation solely in the end point of the path i.e. ω t . While in ourdefinition the perturbation takes place throughout the whole interval [ t, T ] . We define two more spaces necessary for our analysis: C , ( ¯Λ) := (cid:8) u ∈ C ( ¯Λ) : ϕ ∈ C ( ¯Λ) for ϕ ∈ { ∂ t u, ∂ ω u, ∂ ωω u } (cid:9) , C , ( ¯Λ) := (cid:8) u ∈ C ( ¯Λ) : ϕ has polynomial growth for ϕ ∈ { ∂ t u, ∂ ω u, ∂ ωω u } and (cid:10) ∂ ωω u, ( η, η ) (cid:11) is locally uniformly continuous in ω with polynomial growth (cid:9) . .2 Functional Itˆo formula We have to differentiate two cases. The regular case where H ∈ ( ,
1) and thesingular case where the coefficients b , σ explode, because the power-kernel in Riemann-Liouville fractional Brownian motion whenever the Hurst exponent H lies in (0 , ).In the singular case the coefficients b, σ / ∈ C t and thus they cannot serve as the testfunction in the right side of (A.2), since Gateaux derivative would not make senseany more. In order to develop an Itˆo formula for the singular case, definitions needto be slightly amended. Nonetheless, Viens et al. show that both cases yield similarFunctional Itˆo formula. Assumption A.1. i The SDE (3.1) admits a weak solution ( X, W ) .ii E (cid:2) sup t ∈ [0 ,T ] | X t | p (cid:3) < ∞ for all p ≥ . Assumption A.2. i (Regular case) For any r ∈ [0 , T ] , ∂ t b ( t ; r, · ) , ∂ t σ ( t ; r, · ) exist for t ∈ [ r, T ] andfor ϕ = b, σ, ∂ t b, ∂ t σ , | ϕ ( t ; r, ω ) |≤ C (1 + (cid:107) ω (cid:107) κ T ) C , κ > . ii (Singular case) For any r ∈ [0 , T ] , ∂ t ( t ; r, · ) exists for t ∈ ( r, T ] with ϕ = b, σ .There exists H ∈ (0 , ) s.t., for some C , κ > | ϕ ( t ; r, ω ) |≤ C (1 + (cid:107) ω (cid:107) κ T )( t − r ) H − and | ϕ t ( t ; r, ω ) |≤ C (1 + (cid:107) ω (cid:107) κ T )( t − r ) H − Theorem A.1 (Functional Itˆo formula) . Let X be a weak solution to the SDE (3.1) for which E (cid:2) sup t ∈ [0 ,T ] | X t | p (cid:3) < ∞ for all p ≥ and Assumption A.2 hold. Then du ( t, X ⊗ t Θ t ) = ∂ t u ( t, X ⊗ t Θ t ) dt + 12 (cid:10) ∂ ωω u ( t, X ⊗ t Θ t ) , ( σ t,X , σ t,X ) (cid:11) dt + (cid:10) ∂ ω u ( t, X ⊗ t Θ t ) , b t,X (cid:11) dt + (cid:10) ∂ ω u ( t, X ⊗ t Θ t ) , σ t,X (cid:11) dW t , P -a.s. (A.3) for u ∈ C , (Λ) in the regular case and u ∈ C , ,β (Λ) with regularized Gateaux deriva-tive for the singular case. For ϕ = b, σ the notation ϕ t,ωs := ϕ ( s ; t, ω ) only emphasizesthe dependence on s ∈ [ t, T ] . For the definition of C , ,β (Λ) and precise statement ofthe theorem in the singular case see Theorem 3.17 and Theorem 3.10 in [30] .3 Discretization of the Gateaux Derivative It can be easily shown that ˆΘ ts = f (Θ ts ) for some f : R → R . Therefore, we have directrelation between the auxiliary process Θ and the forward variance ˆΘ, which allows usto write the option price as the function of the entire forward variance curve ˆΘ t [ t,T ] at time t ∈ [0 , T ], namely u ( t, S t , Θ t [ t,T ] ) = ˜ u ( t, S t , ˆΘ t [ t,T ] ). This is important, whenperforming Monte-Carlo, since in the rough Bergomi model, the forward variancecurve is directly modelled in the variance process with ξ t ( · ) = ˆΘ t · .Let us suppose that we are able to trade at times 0 = t < t < · · · < t n = T . In orderto get the hedging weights at trading times t i , we have to discretize the derivatives.The Gateaux derivative with respect to the stock simplifies to the usual derivativeand the discretization is straightforward: ∂ x ˜ u ( t, S t , ˆΘ t [ t,T ] ) ≈ ˜ u ( t, S t + ε, ˆΘ t [ t,T ] ) − ˜ u ( t, S t , ˆΘ t [ t,T ] ) ε for small ε > . (A.4)For the path-wise derivative the discretization is not immediately obvious, especiallybecause of the dependence of the option price ˜ u at time t on functional over the wholeinterval [ t, T ], more precisely since u : [0 , T ] × [0 , ∞ ) ×C ([0 , T ] → R ). First, we remindourselves of the definition of the Gateaux derivative on a path ω : (cid:104) ∂ ω u ( t, ω ) , η (cid:105) = lim ε → u ( t, ω + εη [ t,T ] ) − u ( t, ω ) ε for any η ∈ Ω t . We proceed as in [41] by approximating ˆΘ t [ t,T ] as a piecewise constant functionˆΘ ts ≈ (cid:88) i ∈I ˆΘ ti [ t i ,t i +1 ) ( s ) a ti = a ts [ t i ,t i +1 ) ( s ) , (A.5)where I := { i ∈ N : t ≤ t i ≤ T } . We introduce the following approximations of thepath derivatives along the direction a t : (cid:68) ∂ ω ˜ u ( t, S t , ˆΘ t [ t,T ] ) , a t (cid:69) ≈ ∂ ε ˜ u (cid:32) t, S t , (cid:88) i ∈I ( ˆΘ ti + εa t ) [ t i ,t i +1 ) ( s ) (cid:12)(cid:12)(cid:12) s ∈ [ t,T ] (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ε =0 = ∂ ε ˆ u (cid:16) t, S t , (cid:16) ˆΘ ti + εa ti (cid:17) i ∈I (cid:17)(cid:12)(cid:12)(cid:12) ε =0 = (cid:88) i ∈I ∂ ˆΘ ti ˆ u ( t, S t , θ t ) a ti , with θ t := ( ˆΘ ti ) i ∈I and ˆ u acts on [0 , T ] × [0 , ∞ ) × R I . Further discretizing thederivative we have for the flat forward variance ξ ( t ) = ξ : (cid:10) ∂ ω ˜ u ( t, S t , ξ ) , a t (cid:11) ≈ ˜ u ( t, S t , ξ + ε ) − ˜ u ( t, S t , ξ ) ε a t for small ε > . The option prices ˜ u can now be evaluated using Monte-Carlo at each time step to getthe hedging weights. Note that the discretization of the Gateaux derivative is purelyheuristic and that a rigorous proof of the convergence to the true derivative is out ofscope of this work. For more details we refer to [41].22 eferences [1] A. Hernandez, Model Calibration with Neural Networks Risk (2016) .[2] B. Horvath, A. Muguruza, and M. Tomas, Deep Learning Volatility arXiv:1901.09647 [q-fin.MF] .[3] C. Bayer, B. Horvath, A. Muguruza, B. Stemper, and M. Tomas, On deepcalibration of (rough) stochastic volatility models arXiv:1908.08806[q-fin.MF] .[4] S. Liu, A. Borovykh, L. A. Grzelak, and C. W. Oosterlee, A neuralnetwork-based framework for financial model calibration
Journal ofMathematics in Industry no. 1, (2019) 9.[5] J. Ruf and W. Wang, Neural networks for option pricing and hedging: aliterature review Available at SSRN 3486363 (2019) .[6] F. E. Benth, N. Detering, and S. Lavagnini, Accuracy of Deep Learning inCalibrating HJM Forward Curves arXiv preprint arXiv:2006.01911 (2020) .[7] C. Cuchiero, W. Khosrawi, and J. Teichmann, A Generative AdversarialNetwork Approach to Calibration of Local Stochastic Volatility Models
Risks no. 4, (Sep, 2020) 101.[8] P. Gierjatowicz, M. Sabate-Vidales, D. Siska, L. Szpruch, and Z. Zuric, RobustPricing and Hedging via Neural SDEs SSRN Electronic Journal (2020) .[9] H. Buehler, L. Gonon, J. Teichmann, and B. Wood, Deep hedging
QuantitativeFinance no. 8, (Feb, 2019) 1271–1291.[10] P. Henry-Labordere, Generative Models for Financial Data SSRN ElectronicJournal (2019) .[11] M. Wiese, L. Bai, B. Wood, and H. Buehler, Deep Hedging: Learning toSimulate Equity Option Markets
NeurIPS 2019 Workshop on Robust AI inFinancial Services: Data, Fairness, Explainability, Trustworthiness, andPrivacy (Nov., 2019) , arXiv:1911.01700 [q-fin.CP] .[12] M. Wiese, R. Knobloch, R. Korn, and P. Kretschmer, Quant GANs: deepgeneration of financial time series
Quantitative Finance no. 9, (Apr, 2020)1419–1440.[13] C. S. Alexei Kondratyev, The Market Generator Risk (2020) .[14] H. Buehler, B. Horvath, T. Lyons, I. P. Arribas, and B. Wood, GeneratingFinancial Markets With Signatures
SSRN Electronic Journal (2020) .2315] H. Buehler, B. Horvath, T. Lyons, I. P. Arribas, and B. Wood, A Data-DrivenMarket Simulator for Small Data Environments
SSRN Electronic Journal (2020) .[16] C. Cuchiero, M. Larsson, and J. Teichmann, Deep Neural Networks, GenericUniversal Interpolation, and Controlled ODEs
SIAM Journal on Mathematicsof Data Science no. 3, (Jan, 2020) 901–919.[17] T. Xu, L. K. Wenliang, M. Munn, and B. Acciaio, COT-GAN: GeneratingSequential Data via Causal Optimal Transport arXiv:2006.08571 [stat.ML] .[18] E. Al`os, J. A. Le´on, and J. Vives, On the short-time behavior of the impliedvolatility for jump-diffusion models with stochastic volatility Finance andStochastics no. 4, (2007) 571–589.[19] M. Fukasawa, Asymptotic analysis for stochastic volatility: martingaleexpansion Finance and Stochastics no. 4, (Aug, 2010) 635–654.[20] J. Gatheral, T. Jaisson, and M. Rosenbaum, Volatility is rough QuantitativeFinance no. 6, (Mar, 2018) 933–949.[21] A. E. Bolko, K. Christensen, M. S. Pakkanen, and B. Veliyev, Roughness inspot variance? A GMM approach for estimation of fractional log-normalstochastic volatility models using realized measures arXiv:2010.04610[q-fin.ST] .[22] G. Livieri, S. Mouti, A. Pallavicini, and M. Rosenbaum, Rough volatility:Evidence from option prices IISE Transactions no. 9, (Jun, 2018) 767–776.[23] M. Xu, Risk measure pricing and hedging in incomplete markets Annals ofFinance no. 1, (Oct, 2005) 51–71.[24] A. Ilhan, M. Jonsson, and R. Sircar, Optimal static-dynamic hedges for exoticoptions under convex risk measures Stochastic Processes and their Applications no. 10, (2009) 3608–3632.[25] F. Hans and A. Schied,
Stochastic Finance: An Introduction in Discrete Time .De Gruyter, 2016.[26] C. Bayer, P. Friz, and J. Gatheral, Pricing under rough volatility
QuantitativeFinance no. 6, (Nov, 2015) 887–904.[27] M. Kac, On distributions of certain Wiener functionals Transactions of theAmerican Mathematical Society no. 1, (Jan, 1949) 1–1.[28] B. Dupire, Functional Itˆo calculus Quantitative Finance no. 5, (Apr, 2019)721–729. 2429] R. Cont and D.-A. Fourni´e, Functional Itˆo calculus and stochastic integralrepresentation of martingales The Annals of Probability no. 1, (Jan, 2013)109–133.[30] F. Viens and J. Zhang, A martingale approach for fractional Brownian motionsand related path dependent PDEs The Annals of Applied Probability no. 6,(Dec, 2019) 3489–3540.[31] M. Bennedsen, A. Lunde, and M. S. Pakkanen, Hybrid scheme for Browniansemistationary processes Finance and Stochastics no. 4, (Jun, 2017)931–965.[32] R. McCrickerd and M. S. Pakkanen, Turbocharging Monte Carlo pricing for therough Bergomi model Quantitative Finance no. 11, (Apr, 2018) 1877–1886.[33] P. Carmona, G. Montseny, and L. Coutin, Application of a representation oflong memory gaussian processes Publications du Laboratoire de statistique etprobabilit´es (1998) .[34] T. Di Matteo, T. Aste, and M. M. Dacorogna, Long-term memories ofdeveloped and emerging markets: Using the scaling analysis to characterizetheir stage of development
Journal of Banking & Finance no. 4, (2005)827–851.[35] T. Di Matteo, Multi-scaling in finance Quantitative finance no. 1, (2007)21–36.[36] K. C. Y. B. Razvan Pascanu, Caglar Gulcehre, How to construct deep recurrentneural networks in Proceedings of the Second International Conference onLearning Representations . 2014.[37] S. Hochreiter and J. Schmidhuber, Long Short-Term Memory
NeuralComputation no. 8, (Nov, 1997) 1735–1780.[38] P. Gassiat, On the martingale property in the rough Bergomi model ElectronicCommunications in Probability no. 0, (2019) .[39] A. Sepp, An approximate distribution of delta-hedging errors in ajump-diffusion model with discrete trading and transaction costs QuantitativeFinance no. 7, (Jul, 2012) 1119–1141.[40] C. He, J. S. Kennedy, T. F. Coleman, P. A. Forsyth, Y. Li, and K. R. Vetzal,Calibration and hedging under jump diffusion Review of Derivatives Research no. 1, (Jan, 2007) 1–35.[41] A. Jacquier and M. Oumgari, Deep Curve-dependent PDEs for affine roughvolatility arXiv:1906.02551 [q-fin.PR]arXiv:1906.02551 [q-fin.PR]