Accuracy of Deep Learning in Calibrating HJM Forward Curves
AAccuracy of Deep Learningin Calibrating HJM Forward Curves
Fred Espen Benth ∗ Nils Detering † Silvia Lavagnini ‡ June 4, 2020
Abstract
We price European-style options written on forward contracts in a commodity market,which we model with a state-dependent infinite-dimensional Heath-Jarrow-Morton (HJM)approach. We introduce a new class of volatility operators which map the square integrablenoise into the Filipović space of forward curves, and we specify a deterministic parametrizedversion of it. For calibration purposes, we train a neural network to approximate theoption price as a function of the model parameters. We then use it to calibrate the HJMparameters starting from (simulated) option market data. Finally we introduce a new lossfunction that takes into account bid and ask prices and offers a solution to calibration inilliquid markets. A key issue discovered is that the trained neural network might be non-injective, which could potentially lead to poor accuracy in calibrating the forward curveparameters, even when showing a high degree of accuracy in recovering the prices. Thisreveals that the original meaning of the parameters gets somehow lost in the approximation.
Keywords:
Neural networks; Model calibration; Heath-Jarrow-Morton approach; Volatility opera-tor; Forward curve; Option pricing; Bid-ask price; Stochastic partial differential equation.
We price European-style options written on forward contracts in a commodity market. Wefollow the Heath-Jarrow-Morton (HJM) approach and model the forward curve by a stochasticpartial differential equation with state-dependent volatility, and having values in the Filipovićspace. In our setting, the Hilbert valued Wiener process driving the noise of the forward curvetakes values in L ( O ) , O being some Borel subset of R (possibly R itself). This requires thatthe volatility operator must smoothen elements in L ( O ) into elements of the Filipović space.We achieve this by constructing the volatility as an integral operator with respect to somekernel function, and derive the conditions needed on the kernel function such that the volatilityoperator is well defined. We then focus on the pricing of forward contracts with delivery period,also called swaps. Typical examples are forward contracts in the electricity market, such as theones traded at Nord Pool AS and the European Energy Exchange (EEX). For a deterministic The authors would like to thank Vegard Antun for precious coding support and related advice, and ChristianBayer for useful discussions. Part of the project has been carried out during Silvia Lavagnini’s three months visitat UCSB, funded by the Kristine Bonnevie travel stipend 2019 from the Faculty of Mathematics and NaturalSciences (University of Oslo). ∗ Department of Mathematics, University of Oslo, 0316 Blindern, Norway. Email: [email protected]. † Department of Statistics and Applied Probability, University of California, Santa Barbara, CA 93106, USA.Email: [email protected]. ‡ Department of Mathematics, University of Oslo, 0316 Blindern, Norway. Email: [email protected]. a r X i v : . [ q -f i n . M F ] J un olatility structure, we derive analytic pricing formulas based on a representation theorem forswaps presented in [11]. For a state-dependent stochastic volatility operator one needs insteadto resort to simulation schemes for stochastic partial differential equations, as for example theMultilevel Monte Carlo method [3, 4].Our fully parametrized model allows for pricing and calibration based on deep neural net-works. Therefore we adopt the approach presented in [7] to our setting, and approximate thepricing functional with a neural network in a (possibly) costly off-line step. After training,the neural network is used to recover the model parameters in an optimization routine. Thecalibrated parameters can then be used together with the trained neural network for pricingoptions that are not traded on the exchange. Resorting to neural networks for both, the cali-bration and the pricing step, offers critical computational advantages for those models requiringmore expensive techniques, such as Monte Carlo techniques. In fact, in this neural network ap-proach, the computationally expensive simulation is only required to generate the training setin the learning process of the network, and is not needed for intraday calibration and pricing.We perform a comprehensive case study of our framework and analyse the accuracy ofneural networks for pricing and calibration in the infinite dimensional HJM setting. To avoid atraining set based on large scale Monte Carlo simulation, which introduces additional sourcesof error, we restrict our focus on a deterministic but time dependent volatility operator. To ourknowledge, this is indeed the first application of deep neural networks in an infinite dimensionalHJM setup for the purpose of model calibration. We consider two different approaches bothpresented in [29], the pointwise and the grid-based learning approach, and we compare denseand convolutional neural network architectures. We then extend the framework to allow forcalibration in markets with a wide bid and ask spread, where using a mid price is not feasible.The problem of wide bid-ask spreads is particularly pronounced in energy markets, since onlythe front end swaps are traded liquidly.In the approximation step, we observe a high degree of accuracy, with an average relativeerror for the test set in the range . - . The picture is different when it comes to calibration.Here the trained neural network might fail to recover the true parameters, and we do indeedobserve average relative errors reaching almost in some cases. On the other hand, the pricesestimated after calibration have an average accuracy around . This failure in recovering theparameters is the result of two effects, one specific to the model and one to the network. In thespecified model, several parameter vectors lead to similar prices for the training set, making itdifficult to recover the true parameters. However, we also show that the trained neural networkcan be non-injective in the input parameters to a degree not justified by the original model.For instance, keeping all but the volatility scaling fixed, the call price should be an increasingfunction of volatility. This in fact is not always the case. Given that no-arbitrage conditionsare not imposed on the neural network, this is actually not surprising, but contributes to theproblem of calibration. It may cause the original meaning of the parameters to get lost in theapproximation step, and shows that careful benchmarking is required when using the neuralnetwork approach for calibration, in particular for pricing more complicated options.For the calibration in market environments with large bid-ask spread, the simple loss func-tion proposed here seems to work well. We test it with respect to different bid-ask spreadsizes, and in fact, after calibration, almost all prices lie within the bid-ask range and only feware outside, but still very close to either the bid or the ask price. In particular, wheneverthe bid-ask spread becomes more narrow, one can simply increase the number of iterations forthe optimization routine to obtain good results. The optimization is fast because it does notrequire any simulation. Moreover, the observed errors in recovering the parameters are notincreasing dramatically as compared to the calibration based on a zero bid-ask spread.The rest of the article is organized as follows. In Section 1.1 we give some background2nd motivations on the model considered here. In Section 2 we define the HJM forward curvedynamics and our volatility operators, and we specify a deterministic version of it. In Section3 we introduce forward contracts with delivery period and options written on them, and wederive the pricing formulas used in our case study. In Section 4 we define neural networks andintroduce the two steps approach for model calibration, together with the newly proposed bid-ask loss function. Finally, in Section 5 we specify the parametric setting for the experiments,and in Section 6 we show our findings. Appendix A contains the proofs to all results. In energy markets, forwards and futures on power and gas deliver the underlying commodityover a period of time, rather than at a fixed delivery time. While one usually derives thearbitrage-free forward price in a model free way from the buy-and-hold strategy in the under-lying, in energy markets this strategy can not be applied because storing electricity is costly.This implies that the forward price as derived from the spot price model is not backed by areplication strategy. Instead, what is often adopted as alternative in energy markets is thedirect modelling of the tradable forward price. This is referred to as the Heath-Jarrow-Morton(HJM) approach, as it has first been introduced by [25] for interest rate markets. Later this ideahas been transferred to other markets. [15] and [30], for example, model the whole call optionprice surface using the HJM methodology; [16] and [9] transferred the approach to commodityforward markets, the latter one, in particular, in the context of power markets.Another important characteristic of energy forward markets is the high degree of idiosyn-cratic risk across different maturities, which has been observed by several studies, such as[1, 31, 22]. In [10], for example, the authors performed a Principal Component Analysis onthe Nord Pool AS forward contracts, revealing that more than ten factors are needed to ex-plain of the volatility. This points out the necessity of modelling the time dynamics ofthe forward curve by a high dimensional, possibly infinite dimensional, noise process. In [12]the authors show that a reasonable state space for the forward curve is the so-called Filipovićspace, which is a separable Hilbert space first introduced by [21]. One can indeed realize energyforward prices as linear operators in this space, as done in [11] and [12].In particular, when considering stochastic volatility operators, as for example in our state-dependent model, it is in general not possible to have closed price formulas for options writtenon the forward curve. Hence one has to resort to time consuming numerical methods, such asMonte Carlo techniques for SPDEs (see [3, 4]) or Finite Elements methods (see [2, 33]). Inparticular, such costly pricing procedures render calibration almost impossible as the pricingfunction has to be evaluated in each step of the optimization routine. As a result, more accuratemodels are often not used in practice.To overcome the computational burden, machine learning may be useful. Machine learningis in fact replacing standard techniques in many different fields within scientific computing,and, in particular, within finance. In [20], for example, machine learning is used for theevaluation of derivatives, and in [14] for the problem of hedging a portfolio of derivatives withmarket frictions; in [19] and [32] machine learning tools are employed to learn the volatilitysurface or the term structure of forward crude oil prices, respectively; [38] and [26] proposeapproaches for solving Backward Stochastic Differential Equations in high dimension withdeep neural networks. In the context of model calibration, a first application is proposed in[27], who calibrates stochastic models by training a neural network which, given market priceobservations, returns directly the optimal model parameters. Recently, [17] suggested to employtools from generative adversarial networks to solve the calibration problem in the context oflocal stochastic volatility models. 3et another strategy which is also employed here can be found in [6, 29, 7], where the authorsintroduce the following two steps approach. They propose to approximate the implied volatilitysurface, respective the price functional by a neural network in an off-line training step. Then,they use the trained neural network to calibrate the stochastic model to fit market observations.The training step is a priori cost demanding, especially if the underlying stochastic model iscomplex, such as for a stochastic volatility model, since it requires many costly simulations.But, even if generating the artificial data as well as training the neural network are expensivetasks, the advantage is that these are performed one-off. To ensure that the model reflects thecurrent market situation, it is sufficient to run the calibration step regularly, say daily or evenintra-daily. This step requires only evaluation of the neural network and is therefore fast.
Let (Ω , F , F t , Q ) be a filtered probability space, with Q the risk-neutral probability. We shallwork directly under risk-neutrality. We consider the Filipović space on R + denoted by H α := H α ( R + ) : for a given continuous and non-decreasing function α : R + → [1 , ∞ ) with α (0) = 1 this is the Hilbert space of all absolutely continuous functions f : R + → R for which (cid:90) R + f (cid:48) ( x ) α ( x ) dx < ∞ . It turns out that H α is a separable Hilbert space with inner product defined by (cid:104) f , f (cid:105) α := f (0) f (0) + (cid:90) R + f (cid:48) ( x ) f (cid:48) ( x ) α ( x ) dx, for f , f ∈ H α and norm (cid:107) f (cid:107) α := (cid:104) f, f (cid:105) α . We assume (cid:82) R + α − ( x ) dx < ∞ . A typical exampleis α ( x ) = e αx , for α > . We refer to [21] for more properties of H α . In [12, Section 2] it isshown that the Filipović space is an appropriate state space for the forward curve in energymarkets. This motivates our choice of considering H α as the space for the forward curves.For a Borel set O ⊆ R , we define H := L ( O ) as the Hilbert space of all square integrablefunctions. We further denote by λ O the Lebesgue measure induced on O , and by (cid:107) · (cid:107) and (cid:104) · , · (cid:105) respectively the norm and scalar product on H . We assume that O is such that H is aseparable Hilbert space, and we shall consider H as the noise space, like, e.g., in [5, 12, 13].For a square integrable H -valued random variable X (i.e. E [ (cid:107) X (cid:107) ] < ∞ ), a linear operator Q ∈ L ( H , H ) is the unique covariance operator of X if E [ (cid:104) X, h (cid:105)(cid:104) X, h (cid:105) ] = (cid:104)Q h , h (cid:105) for any h , h ∈ H . We then define an H -valued Wiener process W := { W ( t ) } t ≥ with zero meanand covariance operator Q , as the H -valued stochastic process with continuous trajectories,stationary increments with law W ( t ) − W ( s ) ∼ N (0 , ( t − s ) Q ) , and W (0) = 0 , see [35] or [18].Let now f ( t, T ) be the forward price in the commodity market at time t ≤ T for a contractwith time of delivery T . Following the HJM approach in the Musiela parametrization, for x := T − t , the time to delivery, we define g ( t, x ) := f ( t, t + x ) following the dynamics dg ( t, x ) = ( ∂ x g ( t, x ) + β ( t, x )) dt + σ ( t, x ) d W ( t, x ) , (2.1) ∂ x being the generator for the semigroup {U t } t ≥ given by U t g ( x ) = g ( t + x ) , for any t, x ∈ R + and g ∈ H α . Here β ( t, · ) ∈ H α and σ ( t, · ) ∈ L ( H , H α ) is a linear and bounded operatorfrom H to H α . In order for the model (2.1) to be under the risk neutral measure directly, wemust impose the condition β ≡ (see [11, Section 3.2]). We also want to rewrite equation(2.1) as a model for the entire forward curve ( g ( t, x )) x ≥ for any time t ≥ , and to allow4or the volatility function σ to be stochastic itself. Without introducing any other externalnoise source, for instance with a second dynamics for the volatility, we consider σ to be state-dependent, namely depending on the current level of the forward curve g . We finally obtainthe following partial stochastic differential equation dg t = ∂ x g t dt + σ t ( g t ) d W t , (2.2)where g t := g ( t, · ) , σ t ( g t ) := σ ( t, g t ) and W t := W ( t, · ) . The following Theorem states condi-tions to ensure that a mild solution to equation (2.2) exists. Theorem 2.1.
Let us assume that the mapping σ : R + × H α → L ( H , H α ) , ( t, g t ) (cid:55)→ σ ( t, g t ) = σ t ( g t ) is measurable and that there exists an increasing function C : R + → R + such that for all f , f ∈ H α and t ∈ R + we have (cid:107) σ t ( f ) − σ t ( f ) (cid:107) L ( H , H α ) ≤ C ( t ) (cid:107) f − f (cid:107) α , (cid:107) σ t ( f ) (cid:107) L ( H , H α ) ≤ C ( t )(1 + (cid:107) f (cid:107) α ) . Then for every s ≥ t there exists a unique mild solution to equation (2.2) of the form g s = U s − t g t + (cid:90) st U s − u σ u ( g u ) d W u ,g t = g ( t, · ) ∈ H α being the initial condition.Proof. We refer to [37].By means of Theorem 2.1 we have the solution to the SPDE in equation (2.2) if the volatilityoperator satisfies the Lipschitz and linear growth conditions. Let us point out that in equation(2.2) the noise, namely the Wiener process W , is an element of H , while the forward curve g t belongs to the space H α . This means that the volatility operator σ t ( g t ) must turn elements of H into elements of H α . We shall study the volatility operator in the next Section. We focus on possible specifications for the volatility operator σ : R + × H α → L ( H , H α ) . Inparticular, we need conditions to ensure that actually σ t ( f ) ∈ L ( H , H α ) for every f ∈ H α . Thevolatility operator σ t ( g t ) must turn elements of H into elements of H α . It thus has to smoothenthe noise, and one way to do that is by integrating it over a suitably chosen kernel function.Additionally, σ t ( f ) has to fulfill the Lipchitz and growth conditions required for Theorem 2.1.We start with the following conditions on the kernel function. Theorem 2.2.
For t ≥ , let κ t : R + × O × H α → R + be a kernel function satisfying thefollowing assumptions:1. The map κ t ( x, · , f ) ∈ H for every x ∈ R + , f ∈ H α .2. For every x ∈ R + , f ∈ H α , the derivative ∂κ t ( x,y,f ) ∂x exists for λ O almost all y ∈ O . More-over there exists a neighbourhood I x of x and a function ¯ κ x ∈ H such that (cid:12)(cid:12)(cid:12) ∂κ t ( x,y,f ) ∂x (cid:12)(cid:12)(cid:12) ≤ ¯ κ x ( y ) for λ O almost all y on I x . . (cid:82) R + (cid:13)(cid:13)(cid:13) ∂κ t ( x, · ,f ) ∂x (cid:13)(cid:13)(cid:13) α ( x ) dx < ∞ .Then σ t ( f ) : H → H α , h (cid:55)→ σ t ( f ) h := (cid:90) O κ t ( · , y, f ) h ( y ) dy (2.3) is a linear and bounded operator from H to H α , namely σ t ( f ) ∈ L ( H , H α ) . In particular, forevery x ∈ R + , we can also write the equality σ t ( f ) h ( x ) = (cid:104) κ t ( x, · , f ) , h (cid:105) . Proof.
See Appendix A.1.Given the operator σ t ( f ) in equation (2.3), it will be necessary to find the correspondingadjoint operator, namely the operator σ t ( f ) ∗ ∈ L ( H α , H ) that for every f ∈ H α and every h ∈H satisfies (cid:104) σ t ( f ) h, f (cid:105) α = (cid:104) h, σ t ( f ) ∗ f (cid:105) . By [36, Theorem 6.1] any operator σ t ( f ) ∗ satisfyingthis equality is automatically bounded and is thus the unique adjoint operator of σ t ( f ) . Theorem 2.3.
With the assumptions of Theorem 2.2, the adjoint operator σ t ( f ) ∗ is given by σ t ( f ) ∗ : H α → H , f (cid:55)→ σ t ( f ) ∗ f := κ t (0 , · , f ) f (0) + (cid:90) R + ∂κ t ( x, · , f ) ∂x f (cid:48) ( x ) α ( x ) dx. In particular, for every y ∈ R + , we can also write the equality σ t ( f ) ∗ f ( y ) = (cid:104) κ t ( · , y, f ) , f (cid:105) α . Proof.
See Appendix A.2.In Theorem 2.2 we considered some assumptions that the kernel κ t must satisfy to ensurethat σ t ( f ) ∈ L ( H , H α ) for every f ∈ H α . We shall now look at conditions which also ensurethat σ t ( f ) fulfils the assumptions of Lipschitz continuity and linear growth of Theorem 2.1. Theorem 2.4.
Let κ t : R + × O × H α → R + be a kernel function satisfying the assumptionsof Theorem 2.2. If there exists an increasing function C : R + → R + such that, for every f , f ∈ H α , it holds:1. (cid:107) κ t (0 , · , f ) − κ t (0 , · , f ) (cid:107) ≤ C ( t ) | f (0) − f (0) | , (cid:13)(cid:13)(cid:13) ∂κ t ( x, · ,f ) ∂x − ∂κ t ( x, · ,f ) ∂x (cid:13)(cid:13)(cid:13) ≤ C ( t ) | f (cid:48) ( x ) − f (cid:48) ( x ) | , (cid:107) κ t (0 , · , f ) (cid:107) ≤ C ( t )(1 + | f (0) | ) , (cid:13)(cid:13)(cid:13) ∂κ t ( x, · ,f ) ∂x (cid:13)(cid:13)(cid:13) ≤ C ( t ) | f (cid:48) ( x ) | then σ t defined in equation (2.3) satisfies the Lipschitz and growth conditions of Theorem 2.1.Proof. See Appendix A.3.If the kernel function κ t satisfies the assumptions of Theorem 2.2 and Theorem 2.4, thenthere exists a mild solution to equation (2.2), which models the dynamics of the forward curve.6 .2 A deterministic specification In the previous Section we defined the volatility as an integral operator with respect to a kernelfunction κ t . We now specify κ t in order to reflect some properties that we believe to be crucialfor the volatility operator. For instance, we shall include time dependency to account forseasonality effects. This was for example observed in electricity markets by [9], when analysingthe volatility structure in a log normal model for the forward curve. The same authors alsomodel a maturity effect, namely a monotone decay in the volatility when the time to maturityincreases, also known as the Samuelson effect. This can be easily achieved with some decayfunction. Finally, we want to include that contracts with a certain maturity should be mainlyinfluenced by the randomness of the noise in a neighbourhood of that maturity.We shall restrict to a deterministic, time-dependent diffusion term. We therefore dropthe state-dependency and define the kernel function κ t as the product of two parts, one toincorporate the seasonal and the Samuelson effect, and a second one to smooth the noise in aneighbourhood of the time to maturity, namely κ t ( x, y ) := a ( t ) e − bx ω ( x − y ) , (2.4) a ( t ) := a + J (cid:88) j =1 ( s j sin(2 πjt ) − c j cos(2 πjt )) , (2.5)where ω : R → R + is a continuous weight function, while the term e − bx , b ≥ , captures theSamuelson effect, and a ( t ) is the seasonal function defined for a ≥ , s j and c j real constants,and t measured in years.With the following Proposition, we state some assumptions on the weight function ω toensure that κ t ( x, y ) as defined above fulfils the assumptions of Theorem 2.2 for every t ≥ . Proposition 2.5.
Let ω : R → R + be such that:1. For every x ∈ R + , ω ( x − · ) | O ∈ H .2. The derivative ω (cid:48) ( x ) exists for almost all x ∈ R and whenever it exists, there exists aneighbourhood I x of x and a function ¯ ω x ∈ H with (cid:107) ¯ ω x (cid:107) ≤ C for some C independentof x , and such that | ( ω (cid:48) ( x − y ) − bω ( x − y )) | O | ≤ ¯ ω x ( y ) on I x .Let further (cid:82) R + e − bx α ( x ) dx < ∞ . Then for every t ≥ the volatility operator σ t given by σ t : H → H α , h (cid:55)→ σ t h := (cid:90) O κ t ( · , y ) h ( y ) dy is well defined, and satisfies the Lipschitz and linear growth conditions of Theorem 2.1.Proof. See Appendix A.4.For the function α ( x ) = e αx for instance, the integrability assumption of Proposition 2.5 issatisfied if < α < b . We consider now energy forward contracts with a delivery period, which we refer to as swaps,in order to not confuse them with the contracts discussed in the previous Section. For ≤ t ≤ ≤ T , we denote by F ( t, T , T ) the price at time t of a swap contract on energy deliveringover the interval [ T , T ] . From [10, Proposition 4.1], this price can be expressed by F ( t, T , T ) = (cid:90) T T w ( T ; T , T ) f ( t, T ) dT, (3.1) f ( t, T ) being the forward curve introduced above, and w ( T ; T , T ) a deterministic weightfunction. Focusing on forward style swaps in the electricity markets as traded, for example, atNord Pool AS and the European Energy Exchange (EEX), the weight function takes the form w ( T ; T , T ) = 1 T − T . (3.2)According to [12], we introduce the Musiela representation for F ( t, T , T ) . For x := T − t the time until start of delivery, and (cid:96) := T − T > the length of delivery of the swap, we definethe weight function w (cid:96) ( t, x, y ) := w ( t + y ; t + x, t + x + (cid:96) ) . Motivated by practical examples(see [12, Section 2]), we shall consider only time-independent and stationary weight functions.With abuse of notation, let then w (cid:96) : R + → R + be bounded and measurable, such that G w(cid:96) ( t, x ) := F ( t, t + x, t + x + (cid:96) ) = (cid:90) x + (cid:96)x w (cid:96) ( y − x ) g t ( y ) dy, for g t in equation (2.2). For w ( T ; T , T ) as in equation (3.2), one simply gets w (cid:96) ( y − x ) = (cid:96) .Following [11, Section 4], for any g t ∈ H α , we represent G w(cid:96) as the following linear operator: D w(cid:96) ( g t ) := W (cid:96) ( (cid:96) )Id( g t ) + I w(cid:96) ( g t ) , (3.3)namely, G w(cid:96) ( t, · ) = D w(cid:96) ( g t )( · ) . In equation (3.3), Id is the identity operator, while W (cid:96) ( u ) := (cid:90) u w (cid:96) ( v ) dv, u ≥ , (3.4) I w(cid:96) ( g t )( · ) := (cid:90) ∞ q w(cid:96) ( · , y ) g (cid:48) t ( y ) dy, (3.5) q w(cid:96) ( x, y ) := ( W (cid:96) ( (cid:96) ) − W (cid:96) ( y − x )) I [0 ,(cid:96) ] ( y − x ) . (3.6)For a swap contract of forward type with weight function in equation (3.2), it turns outthat the operator D w(cid:96) has an easier representation as provided in the following Lemma. Lemma 3.1.
For a forward-style swap contract, the operator D w(cid:96) can be represented as D w(cid:96) ( g t )( x ) = (cid:90) ∞ d (cid:96) ( x, y ) g t ( y ) dy, where d (cid:96) : R + × R + → R + , d (cid:96) ( x, y ) := (cid:96) I [ x,x + (cid:96) ] ( y ) is called the delivery period function.Proof. See Appendix A.5.With the notation just introduced, thanks to [12, Lemma 3.3], for every ≤ t ≤ τ ≤ T theprice of the swap contract at time τ with delivery over the interval [ T , T ] can be expressed asa linear functional acting on the forward curve g t : F ( τ, T , T ) = δ T − t D w(cid:96) g t + (cid:90) τt δ T − s D w(cid:96) σ s ( g s ) d W s , (cid:96) = T − T . Further, [11, Theorem 2.1] allows us to derive a real-valued stochastic processfor the swap dynamics. More precisely, for every ≤ t ≤ τ ≤ T , we can write that F ( τ, T , T ) = δ T − t D w(cid:96) g t + (cid:90) τt Σ( s ) dW ( s ) , (3.7) Σ ( s ) := ( δ T − s D w(cid:96) σ s ( g s ) Q σ s ( g s ) ∗ ( δ T − s D w(cid:96) ) ∗ ) (1) , (3.8)where W is a standard Wiener process with values in R . In particular, equation (3.7) tellsus that the swap price F ( τ, T , T ) which is driven by the H -valued Wiener process W withcovariance operator Q and volatility operator σ t , can in fact be represented as driven by aone-dimensional Wiener process with diffusion term given in equation (3.8).For a swap contract of forward type, considering the covariance operator Q to be an integraloperator of the form Q h ( x ) = (cid:104) h, q ( x, · ) (cid:105) = (cid:90) O q ( x, y ) h ( y ) dy, h ∈ H , with kernel q ( x, y ) such that Q is well defined, and given the volatility operator in equation(2.3), we can rewrite the univariate volatility in equation (3.8) more explicitly. Proposition 3.2.
For a swap contract of forward type with weight function in equation (3.2) ,the volatility Σ ( s ) , t ≤ s ≤ τ , defined in equation (3.8) is equivalent to the fourth integral Σ ( s ) = (cid:90) R + (cid:90) R + (cid:90) O (cid:90) O d (cid:96) ( T − s, u ) d (cid:96) ( T − s, v ) κ s ( v, z, g s ) q ( z, y ) κ s ( u, y, g s ) dydzdudv, where d (cid:96) is the delivery period function defined in Lemma 3.1.Proof. See Appendix A.6.In the deterministic setting introduced in Section 2.2, the univariate volatility formula ofProposition 3.2 can be further simplified.
Corollary 3.3.
With a volatility kernel factorized as in equation (2.4) , the formula for Σ ( s ) , t ≤ s ≤ τ , in Proposition 3.2 is equivalent to Σ ( s ) = a ( s ) (cid:90) R + (cid:90) R + (cid:90) O (cid:90) O e − bu e − bv d (cid:96) ( T − s, u ) d (cid:96) ( T − s, v ) ω ( v − z ) q ( z, y ) ω ( u − y ) dydzdudv. Proof.
This is a direct consequence of Proposition 3.2.
We focus on European-style options written on energy swap contracts. These kinds of deriva-tives are traded, for example, at Nord Pool AS. With the price of the swap at time t being F ( t, T , T ) , we consider an option with payoff function π : R → R and exercise time ≤ τ ≤ T .Classical examples are standard call and put options with strike K ≥ , for which the payofffunction is defined by π ( x ) = max( x − K, , respectively π ( x ) = max( K − x, .From equation (3.7), the price at time ≤ t ≤ τ of the option with payoff π ( F ( τ, T , T )) at time ≤ τ ≤ T is given by Π( t ) = e − r ( τ − t ) E (cid:20) π (cid:18) δ T − t D w(cid:96) g t + (cid:90) τt Σ( s ) dW ( s ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) F t (cid:21) , (3.9)9or r > the risk-free interest rate, here considered to be constant. Assuming π to be measur-able and of at most linear growth, from [12, Proposition 3.1] we get that for every (cid:96) > andfor every g t ∈ H α such that E [ (cid:107) g t (cid:107) α ] < ∞ , the expectation (3.9) is well defined. The sameholds, in particular, also for the expectation of the payoff of an option on a forward with fixedtime to delivery T , namely π ( f ( τ, T )) . We refer to [12] for more details.We end the Section with a result from [12], which allows to rewrite the price functional inequation (3.9) for a deterministic volatility operator, like the one we introduced in Section 2.2. Proposition 3.4.
For σ t deterministic, the price functional in equation (3.9) becomes Π( t ) = e − r ( τ − t ) E [ π ( µ ( g t ) + ξX ) | F t ] , (3.10) ξ := (cid:90) τt Σ ( s ) ds, (3.11) µ ( g t ) := δ T − t D w(cid:96) g t , (3.12) with X standard normal distributed random variable and Σ in equation (3.8) .Proof. We refer to [12, Proposition 3.7].With Proposition 3.4 the price of an option on the forward curve can be calculated in closedform if the volatility operator is deterministic. In order to do that, by equations (3.11) and(3.12) one needs to define the volatility operator, as well as the covariance operator and theinitial forward curve. Before giving full specification for those, we define in the next Sectionthe two steps approach to calibrate the HJM model with neural networks.
For the purpose of calibration, we shall specify a fully parametric model, depending on aparameter vector θ taking values in a set Θ ⊂ R n . In the framework described in Section 2, θ isa vector of parameters defining the volatility operator, the covariance operator and the initialforward curve. Moreover, the option price function depends on some features of the contract,such as time to delivery, strike, etc. We denote the vector of these contract parameters by λ ∈ Λ ⊂ R m . Then, the price function (3.9) is Π( t ) = Π( t ; λ, θ ) .As the contract features λ are given by the market, to get a fully specified price functionalwe need to calibrate the chosen model and determine the vector θ that best matches theobserved prices of liquidly traded options. This will give us the best coefficient functions (interms of calibration) for the HJM model defined in Section 2. We do that by the two stepsapproach with neural networks presented in [7]. In what follows, we first define feedforwardneural networks, and then the calibration problem, together with the two steps approach. We define an L -layer feedforward neural network as a function N : R d → R p of the form N ( x ) := H L ( ρ ( H L − ( ρ ( . . . ρ ( H ( x )))))) , (4.1)where each H i : R n i − → R n i is an affine map of the form H i ( x ) = V i x + v i , for V i ∈ R n i × n i − and v i ∈ R n i . Here the V i ’s are referred to as the weights and the v i ’s as the biases. Inparticular, n = d equals the input dimension, and n L = p equals the output dimension. Wecall L the depth of the network and n i represents the number of nodes of the i -th layer. We set n := ( n , . . . , n L ) . The map ρ : R → R is the so-called activation function, which is typically10on-linear and applies component wise on the output of the affine maps H i . Typical choicesare the Rectified Linear Unit (ReLU) given by ρ ( x ) = max( x, , the Exponential Linear Unit(ELU) given by ρ ( x ) = max( e x − , x ) or the Sigmoid ρ ( x ) = e − x . We point out that inthe definition (4.1), the activation function is the same between all the layers, and we do notconsider any activation function for the L -th layer. This is important to allow for expressivity.With the sigmoid function in the last layer, for example, all values would lie between 0 and 1.We shall denote by V the set of all parameters involved, namely V := { V i , v i } Li =1 , and we usethe notation N = N ( x ; V ) . The cardinality of V is then given by M := |V| = (cid:80) Li =1 n i ( n i − +1) .As we can see, the number of parameters involved in the neural network, namely, the numberof parameters which must be optimized, grows quite fast. A particular subclass of feedforwardneural network is given by the convolutional networks. A convolution is a special kind of linearoperation, which is represented by a sparse matrix. This is cost efficient, both in terms ofcomputing time and required memory. We call a network convolutional when at least one ofthe V i ’s is a convolution operator. We shall refer to dense networks when the weights V i ’s arefull matrices instead. The architecture of the neural network, namely the width L , the numberof nodes per layer, n = ( n , . . . , n L − ) , and the activation function are the hyper-parameterswhich must be chosen in accordance to each specific problem to solve.The idea behind neural networks is to approximate a map ϕ : R d → R p starting from theset of input-outputs { ( x i , ϕ ( x i )) } Ni =1 of size N . This translates into an optimization problem,called the training of the neural network, which is solved by finding the best set of parameters V ∈ R M so that the neural network output N ( x ; V ) best approximates the observations ϕ ( x ) ,for x ∈ { x i } Ni =1 , with respect to some loss function R to be chosen. A common loss function isthe mean squared error defined by R (cid:0) { x i } Ni =1 , { y i } Ni =1 (cid:1) := 1 N N (cid:88) i =1 ( x i − y i ) . Training the neural network means to find a ˆ V ∈ R M that solves the optimization problem minimize V∈ R M N N (cid:88) i =1 ( N ( x i ; V ) − ϕ ( x i )) . Given the optimal weights ˆ V , we denote by ˆ N ( x ) := N ( x ; ˆ V ) the trained neural network.We point out that this is a non-convex optimization problem and one typically only finds anapproximation to a local solution. For more details on feedforward neural networks, activationfunctions and training of the network, we refer the reader to [28, 23]. We focus on the option price
Π = Π( λ, θ ) , omitting the time dependency to simplify thenotation, and we consider N option contracts with features { λ i } Ni =1 ∈ Λ , whose price can beobserved in the market. This means we have a set of market observed prices { Π i } Ni =1 , where Π i is the price of the contract corresponding to λ i . Calibrating the HJM model for the forwardcurve means to find vector of model parameters ˆ θ ∈ Θ which minimizes the distance betweenthe prices observed in the market and the corresponding prices given by the stochastic model.This is done with respect to a certain cost functional. With the mean squared error functional,11he calibration corresponds to find ˆ θ ∈ Θ which solves the optimization problem minimize θ ∈ Θ N N (cid:88) i =1 (Π( λ i , θ ) − Π i ) . (4.2)Often, it is not possible to obtain a closed formula for the price functional Π , due to thecomplexity of the underlying model. In these cases, the calibration problem presented inequation (4.2) must be slightly adjusted by substituting Π with ˜Π , which is an approximatedprice functional, obtained by, for example, Monte Carlo simulation. This makes the procedurecostly. In particular, the more complex the underlying stochastic model is, the greater the timeneeded for simulation and hence for calibration. This has the consequence that more accuratemodels are often left out from practical applications because of their complexity in calibration.An alternative solution proposed in [7] is to divide the calibration problem in equation (4.2)into two steps. We start by approximating the price functional Π (or its approximation ˜Π ) witha neural network, N ( λ, θ ; V ) ≈ Π( λ, θ ) . This is done by generating a training set by considering(many) different vectors of parameters θ ’s and option features λ ’s and the corresponding pricevalues. The training set is then used to train the neural network. This step is computationallydemanding for two reasons. First of all, we need to generate a large training set of input-outputdata by (potentially) costly simulations. Moreover, the training of the neural network is an M = |V| dimensional optimization problem, with M usually large, and with respect to a largetraining set and is therefore time consuming. However, the training has the advantage to beoff-line, namely, it does not use any market information. Thus, it can potentially be run onlyonce and does not require frequent updating when new market information arises.Once the neural network is trained, the second step is calibration. The advantage is that wereplace in equation (4.2) the function Π (or ˜Π ) with the trained neural network ˆ N . Evaluation ofthe latter one is fast and does not require any further simulation, thus to perform a calibrationbecomes an easier and much faster task, also due to the fact that most machine learningframeworks have very efficient implementations. In [7], for example, the calibration for the fullimplied volatility surface in the rough Bergomi model is obtained in less than milliseconds.Price market data are used in the second step, meaning that the model should be re-calibratedevery time that new market data is available. However, with this fast implementation vianeural networks, getting the results is possible in short time, and the model can be updatedwith low computational cost.We present now two alternative approaches to this procedure. In the first case, we train aneural network as function of both λ and θ and with output a single real number, being theprice of the option with features λ and given by the model with parameters θ . In the secondcase, we require a lower dimensional functional: the neural network is trained as function ofonly the model parameters θ . Then, the output is a (discrete) surface, namely, a (potentiallymulti-dimensional) grid, of prices corresponding to different contracts, namely different λ ’s. Let ( λ, θ ) ∈ Λ × Θ ⊂ R d , for d = n + m , such that Π = Π( λ, θ ) . In the pointwise learningapproach, we approximate the pricing map Π (or ˜Π ) by a neural network that maps the vector ( λ, θ ) into prices. Given the training set { (( λ i , θ i ) , Π( λ i , θ i )) } N train i =1 for ( λ i , θ i ) ∈ Λ × Θ and N train the size of the training set, we train a neural network N : Λ × Θ → R + by computing ˆ V ∈ argmin V∈ R M N train N train (cid:88) i =1 ( N ( λ i , θ i ; V ) − Π( λ i , θ i )) . ˆ N ( λ, θ ) = N ( λ, θ ; ˆ V ) , we use it for the calibration step with respect to the modelparameters, that is, we look for an approximate solution ˆ θ ∈ Θ to the following problem: minimize θ ∈ Θ N cal N cal (cid:88) i =1 (cid:16) ˆ N ( λ i , θ ) − Π i (cid:17) , where { Π i } N cal i =1 are market observed prices with respect to some contract features { λ i } N cal i =1 , and N cal is the size of the calibration set. The idea of the grid-based learning approach is to train a neural network which is a functiononly of the model parameters θ . In this case, the output is no longer a single price value, but adiscrete grid of values corresponding to different option specifications λ , which are decided inthe first step. These should reflect the options traded in the market. Let us suppose m = 2 and λ = ( λ , λ ) . Then for m , m ∈ N , we create a grid of values { ( λ j , λ k ) } m ,m j =1 ,k =1 and the trainingset { ( θ i , { Π( θ i , ( λ j , λ k )) } m ,m j =1 ,k =1 ) } N train i =1 . We then train a neural network N : Θ → R m × m + bysolving the following optimization problem ˆ V ∈ argmin V∈ R M N train m m N train (cid:88) i =1 m ,m (cid:88) j,k =1 (cid:0) N ( θ i ; V ) j,k − Π( θ i , ( λ j , λ k )) (cid:1) . Once trained ˆ N ( θ ) = N ( θ ; ˆ V ) , we use the neural network in the calibration step in order tofind the optimal model parameters ˆ θ ∈ Θ for fitting the market observations { Π j,k } m ,m j =1 ,k =1 : minimize θ ∈ Θ m m m ,m (cid:88) j,k =1 (cid:16) ˆ N ( θ ) j,k − Π j,k (cid:17) . The main difference of the grid-based approach compared with the pointwise one, is thatthe neural network is trained to price only specific options, namely options related to thegrid { ( λ j , λ k ) } m ,m j =1 ,k =1 defined in the approximation step. This might be seen as a weakness,however, every price for a contract not included in this grid can be obtained by interpolation(which is done automatically by the network in the pointwise approach). Moreover, the gridin the first step can be chosen as fine as wished. The advantage is that the dimension of theinput vector is smaller, making it easier for the network to approximate the price functional. In order book based markets there is no single price, but the so-called bid and ask pricecorresponding to the cheapest sell and the most expensive buy order. Depending on marketliquidity, the spread between bid and ask price (bid-ask spead) can be significant. Thus thecalibration problem described in equation (4.2) breaks down as we do not have exact prices toaim at in a mean squared loss sense. We need a new loss function taking the bid-ask spread13nto account. To penalize prices lying outside the bid-ask range, we introduce the loss R bidask as R bidask (cid:16) { x i } Ni =1 , { y bidi } Ni =1 , { y aski } Ni =1 (cid:17) := 1 N N (cid:88) i =1 (cid:26)(cid:16) x i − y bidi (cid:17) I { x i
We shall fully specify the setting. Let α ( x ) = e αx , x ∈ R + , be the weight function for theFilipović space H α . By Proposition 2.5 the real constant α must satisfy < α < b . Moreover,we deal with swap contracts of forward type on energy with delivery over an interval [ T , T ] .Thus we take w ( T ; T , T ) = T − T (see equation (3.2)) and, consequently, w (cid:96) ( y − x ) = (cid:96) . Then W (cid:96) ( (cid:96) ) = 1 (see equation (3.4)) and q w(cid:96) ( x, y ) = (cid:96) ( x + (cid:96) − y )) I [ x,x + (cid:96) ] ( y ) (see equation (3.6)).Finally, we select European-style call options with payoff function π ( x ) = max( x − K, , for K > the strike price. The ultimate goal would be of course to price all kind of optionswritten on forward contracts with or without delivery periods. This in fact is possible with theframework but beyond the scope of the paper.In order to benchmark the two-steps approach, as explained in Section 1, we focus ourattention on deterministic volatility operators, such as the one introduced in Section 2.2, which14ill be further specified here. For σ t deterministic, we have seen in Proposition 3.4 that theprice Π( t ) at time ≤ t ≤ τ of the option with payoff π ( F ( τ, T , T )) at time ≤ τ ≤ T , canbe expressed in terms of a standard Gaussian random variable X , a variance ξ = (cid:82) τt Σ ( s ) ds and a drift µ ( g t ) = δ T − t D w(cid:96) g t . In the case of an European-style call option, Π( t ) has a closedform solution, by means of a type of Black-76 formula. This allows for exact price values tobe used for both the training and the calibration step, with the advantage to avoid externalsources of error, such as resulting, for example, from a Monte Carlo simulation approach. Asdirect consequence of Proposition 3.4, we find explicitly the price functional. Proposition 5.1.
The price of an European-style call option with strike price
K > andmaturity time τ ≤ T is given by Π( t ) = e − r ( τ − t ) (cid:26) ξφ (cid:18) µ ( g t ) − Kξ (cid:19) + ( µ ( g t ) − K ) Φ (cid:18) µ ( g t ) − Kξ (cid:19)(cid:27) , (5.1) φ and Φ being, respectively, the density function and the cumulative distribution function of astandard Gaussian random variable.Proof. The proof follows by direct calculation, starting from equation (3.10) for π ( x ) = max( x − K, , and using standard techniques for the expected value of a Gaussian random variable.To compute the price in equation (5.1), we need the variance ξ and the shift µ ( g t ) , hence weneed to specify the volatility operator σ t and the covariance operator Q . Last, we must definean appropriate initial forward curve, g t ∈ H α . In view of the neural network approach, we needto define a fully parametric model, that is, a model fully specified by a vector of parameters θ . We need to introduce a suitable covariance operator Q that depends only on a finite numberof parameters. The analysis conducted by [13], reveals a covariance structure that is wellapproximated by an exponential function, i.e. Cov ( W t ( x ) , W t ( y )) ≈ e − k | x − y | . Despite theoperation W t ( x ) = δ x ( W t ) not being well defined on our space H , we can approximate it bythe scalar product δ x ( W t ) ≈ (cid:104) η x , W t (cid:105) , with some "bell-shaped" function η x centred in x , suchas a Gaussian density function. For every x, y ∈ O , denoting by c ( x, y ) the empirical covariancefunction between W t ( x ) and W t ( y ) , we then get: c ( x, y ) = E [ W t ( x ) W t ( y )] = E [ δ x ( W t ) δ y ( W t )] ≈ E [ (cid:104) η x , W t (cid:105)(cid:104) η y , W t (cid:105) ] = (cid:104)Q η y , η x (cid:105)≈ Q η y ( x ) = (cid:90) O e − k | x − z | η y ( z ) dz ≈ e − k | x − y | , which shows that in H a covariance operator based on an exponential kernel indeed approxi-mates the empirically observed covariance structure of the Wiener process W across differentmaturities. We thus define a one parameter covariance operator by Q h ( x ) = (cid:90) O e − k | x − y | h ( y ) dy, h ∈ H . (5.2)Because e − k |·| is the characteristic function of a Cauchy distributed random variable withlocation parameter and scale parameter k , it follows by Bochner’s Theorem that it is positive-definite. Since it is also symmetric and continuous, for O compact it follows from [35, Theorem15.8] that Q is in fact a covariance operator. In the following, we therefore choose O := [ − γ, γ ] for some large γ which ensures that all maturities of interest are covered. We consider a specification of the volatility operator that does not depend on time and space,which we denote by σ instead of σ t to simplify the notation. For the seasonal component inequation (2.5), we consider a ( t ) = a ≥ for every t ∈ R + . This means that we do not accountfor seasonality and the level a corresponds to the implied spot price volatility, as pointed outin [9]. Moreover, we define the following weight function ω ( x ) := (1 − | x | ) I {| x |≤ } = (cid:40) (1 − | x | ) if | x | ≤ otherwise . (5.3)Let us notice that this parameter-free specification of ω fulfils the assumptions of Proposi-tion 2.5. The kernel κ t in equation (2.4) becomes then κ t ( x, y ) = κ ( x, y ) = ae − bx ω ( x − y ) ,where b ≥ determines the strength of the maturity effect. The volatility operator is given by σh ( x ) = ae − bx (cid:90) O (1 − | x − y | ) I {| x − y |≤ } h ( y ) dy, (5.4)and is well defined by Theorem 2.2. Let us remember that the role of σ is to smoothen thenoise from the space H to H α , and we have achieved this by considering an integral operator.The weight function ω has then a double role. First of all, it functions as (a part of) thekernel for the integral operator which smoothen the noise. On the other hand, it weights therandomness coming from the Wiener process W t so that a contract with time to maturity x isonly influenced by W t ( y ) , for y in a neighbourhood of x . Other weight functions ω could beconsidered to obtain a similar weighting effect.We have to calculate the volatility Σ ( s ) using the formula provided in Corollary 3.3.However, the expression turns out cumbersome when integrating over O = [ − γ, γ ] . For thisreason, we integrate over R instead and calculate Σ ( s ) according to the formula Σ ( s ) := a (cid:90) R + (cid:90) R + (cid:90) R (cid:90) R e − bu e − bv d (cid:96) ( T − s, u ) d (cid:96) ( T − s, v ) ω ( v − z ) q ( z, y ) ω ( u − y ) dydzdudv. Since all terms are positive and Σ ( s ) < ∞ , it is then easy to see that, for Σ γ ( s ) := a (cid:90) R + (cid:90) R + (cid:90) γ − γ (cid:90) γ − γ e − bu e − bv d (cid:96) ( T − s, u ) d (cid:96) ( T − s, v ) ω ( v − z ) q ( z, y ) ω ( u − y ) dydzdudv, the limit lim γ →∞ Σ γ ( s ) = Σ ( s ) holds. We then calculate Σ ( s ) explicitly. Proposition 5.2.
In the setting described above, the volatility Σ ( s ) , t ≤ s ≤ τ , is given by Σ ( s ) = 2 a kb (cid:96) (cid:26) (cid:0) b + 3 (cid:1) (cid:16) e − b(cid:96) (cid:17) − e − b(cid:96) (cid:18) b (cid:0) (cid:96) − (cid:96) + 4 (cid:1) − (cid:96) + 2 (cid:19)(cid:27) e − b ( T − s ) . Proof.
See Appendix A.7.And, consequently, we can calculate ξ in equation (3.11).16 roposition 5.3. In the setting described above, we get ξ = a kb (cid:96) (cid:16) e − b ( T − τ ) − e − b ( T − t ) (cid:17) ·· (cid:26) (cid:0) b + 3 (cid:1) (cid:16) e − b(cid:96) (cid:17) − e − b(cid:96) (cid:18) b (cid:0) (cid:96) − (cid:96) + 4 (cid:1) − (cid:96) + 2 (cid:19)(cid:27) . Proof.
The results follows by integrating Σ ( s ) from Proposition 5.2 over the interval [ t, τ ] . Finally, we need to introduce a suitably parametrized initial forward curve, g t = g ( t, · ) ∈ H α .We choose the Nelson-Siegel curve, which is defined for x ≥ by g t ( x ) = g NS ( x ) := α + ( α + α α x ) e − α x , (5.5)for α , α , α , α ∈ R , α > . Let us notice that it does not depend on time t . This curvehas been first introduced for modelling the forward rates by [34], and has been already appliedin the context of energy markets by [8]. We need however to check that g NS ∈ H α . This isensured in the following Lemma. Lemma 5.4.
The Nelson-Siegel curve in equation (5.5) belongs to H α if and only if α < α .Proof. The condition is found by calculating the H α -norm of g NS , namely (cid:107) g NS (cid:107) α = g NS (0) + (cid:90) R + g (cid:48) NS ( x ) α ( x ) dx = ( α + α ) + (cid:90) R + α ( α − α − α α x ) e ( α − α ) x dx, where the integral converges if and only if α − α < .With the explicit representation for g t , we compute µ ( g t ) : from Lemma 3.1 we get that µ ( g t ) = 1 (cid:96) (cid:90) T + (cid:96) − tT − t g NS ( y ) dy, and by direct computation, we then calculate the drift µ ( g t ) = 1 (cid:96) (cid:26) α (cid:96) + 1 α e − α ( T − t ) ( α + α + α α ( T − t )) + − α e − α ( T + (cid:96) − t ) ( α + α + α α ( T + (cid:96) − t )) (cid:27) , (5.6)which is the last component we need in order to get a fully specified option price functional. In this Section, we describe implementation details and report our findings. From Section 5,we obtain that the vector of model parameter is θ = ( a, b, k, α , α , α , α ) ∈ R , i.e., n = 7 .In this vector a, b ≥ are parameters of the volatility operator introduced in Section 5.2, and17 ≥ is the parameter of the covariance operator, see Section 5.1. Finally α , α , α , α ∈ R with α > are the parameters for the Nelson-Siegel curve as introduced in Section 5.3.Since we are considering European-style call options written on forward-style swaps withdelivery period, the vector of contract features λ is given by λ = ( K, τ, T , (cid:96) ) ∈ R , hence m = 4 , where K > is the strike price, τ is the time to maturity, T ≥ τ is the starting ofdelivery of the swap, and, finally, (cid:96) > is the length of the delivery period. Let us rememberthat, in view of the grid-based learning approach, we shall create a grid of values for λ . Inour framework, this would be to create a four-dimensional grid. We then decide to set T = τ ,namely the time to maturity for the option coincides with the start of delivery of the swap, and (cid:96) = 1 / , namely we consider only contracts with one month of delivery as the time unit is oneyear. Then λ = ( τ, K ) ∈ R and m = 2 , so that we will have to create a two-dimensional grid,as introduced in Section 4.2.2. Finally, for all the experiments we fix t = 0 as evaluation time,and r = 0 as risk-free interest rate. In particular, we shall consider three different experiments:pointwise learning, grid-based learning and grid-based learning with convolutional networks. In all the experiments, we use the Adam optimizer both for training the neural networkand in the calibration step, and we consider the mean squared error loss function. We set thenumber of epochs to for the approximation step, since considering a higher number doesnot improve the accuracy of the network. We set the number of epochs for the calibrationstep to instead, since more computational effort is necessary for fitting the prices in thissecond step. The batch size is fixed to for all the experiments. Fo the grid-based learning approach, we have λ = ( τ, K ) , hence we create a grid of values oftimes to maturity and strike prices which we believe to be reasonable for a case study in theelectricity markets. Let Λ grid := Λ gridτ × Λ gridK where Λ gridτ = { / , / , / , / , / , / , } , Λ gridK = { . , . , . , . , . , . , . , . , . } . The grid has thus dimension m × m = 7 × . We point out that the range in Λ gridK isrelatively narrow if compared to the strike prices available, for example, at EEX. The reasonfor this choice is that by selecting a wider range for K , some options are far out of the money,and thus very cheap for parameter choices that result in low overall volatility. As consideringoptions with value less than a cent is not very interesting and poses also numerical challenges,we restrict the setting of our experiments to this narrow range. To widen the range, one mayconsider options with low strikes only in combination with model parameters a, b and k thatlead to large overall volatility ξ . From a practical perspective, this is reasonable since optionsfar out of the money are typically only in demand in large volatility environments. While thenetwork architecture would not actually change significantly, choosing the grid Λ gridK specifiedabove for all possible model parameters is convenient and thus preferred for this case study.We define a neural network N : Θ → R × , with Θ ⊂ R so that d = n = 7 is the inputdimension and p = 7 × is the output dimension. In particular, we consider a four-layerneural network ( L = 4 ) with number of nodes n = (30 , , , , and the ReLU activationfunction. The final output is reshaped into a grid of dimension × . The number of totalparameters to calibrate is M = |V| = 4053 . We consider a training set of size N train = 40000 and a test set of size N test = 4000 , which is also used for the calibration step. We consider The code is implemented in TensorFlow 2.1.0 and is available at GitHub: HJM_calibration_with_NN. = ( a, b, k, α , α , α , α ) ∈ Θ with Θ defined by Θ := [0 . , . × [0 . , . × [8 . , . × [34 . , . × [ − . , − . × [0 . , . × [4 . , . . Thus the annual implied spot price volatility is between and , while the half-life forthe process, that is the time taken for the price to revert half-way back to its long-term level,is approximately between . and months, see [16] for details.Since Θ has dimension n = 7 , if one wants to consider for each dimension a number of ν different values for the training set, and then combine every value in the first dimensionwith every value in the second dimension, and so on, then the size of the training set wouldbe ν , which easily gets very large. This approach would allow for training the network withall possible combinations of the parameters and possibly ensure a good approximation also ofthe -dimensional restrictions of the pricing function. However, for a reasonable size of thetraining set, as for instance N train + N test = 44000 chosen here, one would get only / ≈ different values for each of the parameter. We therefore consider a uniform grid of size N train + N test = 44000 for each of the n = 7 dimensions composing Θ , and we randomly matchthe values in each dimension to form a training set of size N train = 40000 and a test set of size N test = 4000 . The corresponding prices have then been calculated with the formula (5.1). Wepoint out that each of the N train (or N test ) samples are a grid of size
63 = 7 × with eachdimension corresponding to the different values for K ’s and τ ’s. Each vector calibration is thusperformed starting from a sample of prices, as defined in Section 4.2.2.In Figure 2 we report the average relative error and the maximum relative error in thetraining step, both for the training and test set. In particular, the errors have been clusteredin order to obtain a grid corresponding to the different contracts ( τ, K ) ∈ Λ gridτ × Λ gridK . Wenotice that the average relative error is quite low, showing a good performance of the neuralnetwork in approximating the price functional. The worst accuracy is for the contracts withhigher strike price, probably due to the fact that the price for these contracts is small.In Figure 3 we report the relative error for the components of ˆ θ in the calibration step. Forsome of the parameters, such as a and α we notice a certain pattern, namely, for higher valuesof the parameter the error is smaller and vice versa. However, the performance in calibrationis not particularly good: a , b , and α have a mean relative error which is more than . Onthe other hand, even if the model parameters are not accurately estimated, by substituting ˆ θ in the neural network, we get a price approximation ˆΠ which is quite good. This can be seenin Figure 4, where we report the average and the maximum relative error after calibration. Formid-maturity contracts we observe the best accuracy. With an output dimension of p = 63 in the grid-based learning approach, the number of neuralnetwork parameters grows fast, since in the last layer the weight matrix V L must have n L = 63 rows. In imaging, a solution adopted to reduce the total number of parameters is to considerconvolutional networks as described in Section 4. We consider a four-layer neural network( L = 4 ). The first two layers are dense with node size n = 30 and n = 63 . It follows areshaping into R × , and two convolutional layers with filter size, respectively, and . Boththe convolutional layers have kernel size equal to . We use the ReLU activation function andthe number of total parameters to calibrate is M = |V| = 2764 . The training and test sets arethe same ones we used for the dense network case above, hence each of the N train = 40000 and N test = 4000 sample is a grid of values corresponding to the different K ’s and τ ’s, and eachvector calibration is performed with a sample of prices.In Figure 5 we report the average relative error and the maximum relative error in the19 raining set Average relative error Maximum relative error
Strike M a t u r i t y ( m o n t h ) Strike M a t u r i t y ( m o n t h ) Test set
Average relative error Maximum relative error
Strike M a t u r i t y ( m o n t h ) Strike M a t u r i t y ( m o n t h ) Figure 2: Average relative error and Maximum relative error in the approximation step withthe grid-based learning approach and a dense neural network.20 b k
Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r α α α Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r α Parameter range R e l a t i v e e rr o r Mean Median a : .
4% 24 . b : .
7% 17 . k : .
94% 3 . α : .
12% 0 . α : .
29% 1 . α : .
3% 17 . α : .
34% 0 . Figure 3: Relative error for themodel parameters calibratedwith the grid-based learningapproach and a dense neuralnetwork.
Average relative error Maximum relative error
Strike M a t u r i t y ( m o n t h ) Strike M a t u r i t y ( m o n t h ) Figure 4: Average relative error and Maximum relative error after calibration with the grid-based learning approach and a dense neural network.21pproximation step clustered with respect to the different contracts. Looking at the averagerelative error, we notice a better accuracy compared to the grid-based approach with a densenetwork. However, for the maximum relative error, we get in the upper-right corner (corre-sponding to the contract with ( τ, K ) = (1 / , . ) a value which is almost three times itscorresponding value with the dense network. Also, the rest of the grid for the maximum rela-tive error is approximately around , while for the dense network it is around for mostof the cells. We conclude that the convolutional neural network considered has better averagerelative errors, but worse maximum relative errors.In Figure 6 we report the relative error for the components of ˆ θ in the calibration step.We notice the same pattern for a and α previously found. We also notice worse accuracy for a , but much better accuracy for b , so that overall this is not really different from the densenetwork experiment. In Figure 7 a similar pattern is visible as in the price approximation aftercalibration with dense network, where the mid-maturity contracts have the lowest relative error.However, the two networks considered have approximately the same overall level of accuracy. Training set
Average relative error Maximum relative error
Strike M a t u r i t y ( m o n t h ) Strike M a t u r i t y ( m o n t h ) Test set
Average relative error Maximum relative error
Strike M a t u r i t y ( m o n t h ) Strike M a t u r i t y ( m o n t h ) Figure 5: Average relative error and Maximum relative error in the approximation step withthe grid-based learning approach and a convolutional neural network.
We implement the pointwise learning approach. We define a neural network N : Λ × Θ → R + ,with Λ ⊂ R and Θ ⊂ R , then d = 9 is the input dimension and p = 1 is the output dimension.22 b k Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r α α α Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r α Parameter range R e l a t i v e e rr o r Mean Median a : .
9% 41 . b : .
43% 6 . k : .
62% 2 . α : .
11% 0 . α : .
35% 2 . α : .
4% 16 . α : .
86% 0 . Figure 6: Relative error for themodel parameters calibratedwith the grid-based learningapproach and a convolutionalneural network.
Average relative error Maximum relative error
Strike M a t u r i t y ( m o n t h ) Strike M a t u r i t y ( m o n t h ) Figure 7: Average relative error and Maximum relative error after calibration with the grid-based learning approach and a convolutional neural network.23e consider a four-layer neural network ( L = 4 ) with nodes size n = (30 , , , , and theELU activation function. The number of total parameters to calibrate is M = |V| = 2191 . Weconsider a training set of size N train = 60000 and a test set of size N test = 6000 . These are biggerthan the sets considered in the grid-based approach to reflect that each sample corresponds toa single value, while for the grid-based approach each sample corresponds to 63 values, hence inthis latter case the training sample is actually bigger. We take θ = ( a, b, k, α , α , α , α ) ∈ Θ with Θ as above, while λ = ( τ, K ) ∈ Λ point for Λ point = Λ pointτ × Λ pointK = [1 / , × [31 . , . .We notice that Λ grid ⊂ Λ point : this allows us to compare the pointwise learning approach withthe grid-based learning approach. Training and test sets have been generated starting from Θ × Λ in the same manner as described in the grid-based approach, calculating the correspondingprices with the formula in equation (5.1).In order to produce error plots comparable to the ones in the grid-based learning, we clusterthe prices in the following way. We introduce a new grid ˆΛ := ˆΛ τ × ˆΛ K , where ˆΛ τ := { / , /
12 + 1 / , /
12 + 1 / , /
12 + 1 / , /
12 + 1 / , /
12 + 1 / , /
12 + 1 / , } , ˆΛ K := { . , . , . , . , . , . , . , . , . , . } , and we label with K = 31 . the prices corresponding to a strike in the interval [31 . , . , whilewe label with K = 31 . the prices corresponding to a strike in the interval [31 . , . , and soon for each of the strike price in Λ gridK defined in the grid-based learning approach. We do thesame for the time to maturity τ , in order to obtain a grid of clustered values correspondingto the grid Λ grid . Moreover, since each of the N test = 6000 samples corresponds to a differentparameter vector, we can not perform calibration starting from the test set, as we have donepreviously. Instead, we calibrate by using the test set from the grid-based approach, aftershape adjustment. In this manner, each calibration is performed as described in Section 4.2.1,starting from N cal = 63 different prices corresponding to the same θ ∈ Θ , to be estimated.In Figure 8 we report the average relative error and the maximum relative error in theapproximation step after clustering, as described above. For both, we notice a better accuracycompared to the grid-based approaches. Still, the worst accuracy is for contracts with thehighest strike price. However, when it comes to calibration, the pointwise approach turns outto be the worst. Indeed, the relative errors for ˆ θ in Figure 9 are on average worse than inthe previous experiments. It can be noticed in particular, that for a , b and k , the error isconcentrated around, respectively, , and , while before it was much more spreadstarting from . The accuracy for a is not very good, almost reaching , and the accuracyfor b is the worst among the three methods. On the other hand, the accuracy for α hassubstantially improved. In Figure 10 we can see that the relative error for ˆΠ after calibrationis also the worst among the three methods. In the setting of the grid-based learning approach with dense network of Section 6.1, we test thebid-ask loss function defined in Section 4.3. In particular, we consider the dense neural networkpreviously trained, and we use the bid-ask loss function for the calibration step. Starting fromthe test set of the grid-based approach, we obtain bid and ask prices by considering and , respectively, of the original prices. In Figures 11 we report the relative error for themodel parameters calibrated, which are a bit worse than the values found in the experimentwith the same neural network but exact prices. However, considering that the informationgiven to the network is much weaker due to the relatively wide bid-ask range, the results aresurprisingly good and the overall error is of similar magnitude as in the rest of the experiments.24 raining set
Average relative error Maximum relative error
Strike M a t u r i t y ( m o n t h ) Strike M a t u r i t y ( m o n t h ) Test set
Average relative error Maximum relative error
Strike M a t u r i t y ( m o n t h ) Strike M a t u r i t y ( m o n t h ) Figure 8: Average relative error and Maximum relative error in the approximation step withthe pointwise learning approach. 25 b k
Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r α α α Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r α Parameter range R e l a t i v e e rr o r Mean Median a : .
9% 47 . b : .
3% 27 . k : .
59% 4 . α : .
17% 0 . α : .
72% 1 . α : .
2% 7 . α : .
33% 1 . Figure 9: Relative error for themodel parameters calibratedwith the pointwise learning ap-proach.
Average relative error Maximum relative error
Strike M a t u r i t y ( m o n t h ) Strike M a t u r i t y ( m o n t h ) Figure 10: Average relative error and Maximum relative error after calibration with the point-wise learning approach. 26e calculate the rate of mismatched prices, namely the percentage of prices which does notlie within the bid-ask constraints. In Figure 12, on the left we have the plot of the mismatchedprices rate that we observe before calibration, namely the rate we obtain by plugging in theneural network the starting parameters that we need in order to initialize the optimizationroutine. On the right, the plot shows the percentage of mismatched prices after calibration.As we can see, the final rate is for almost all the contracts, except for ( τ, K ) = (1 , . and for ( τ, K ) = (1 / , . . A random sub-sample of prices have been reported in Figure13 where, on the top panel (corresponding to ( τ, K ) = (1 , . ), we can notice several pricesbeing outside the constraints (marked with the symbol (cid:70) ), in the mid panel (corresponding to ( τ, K ) = (1 / , . ) only some of the prices are outside the constraints, while in the bottompanel (corresponding to ( τ, K ) = (4 / , . ) all the prices are within the constraints (markedwith the symbol × ), referring to a rate of mismatching equal to . However, the priceslying outside the bid-ask interval are indeed very close to the boundaries, which confirms thesuitability of the bid-ask loss function whenever exact prices are not available. We have beentesting the bid-ask loss function also with more narrow constraints. In this case, it is sufficientto increase the number of iterations for the optimizing routine to obtain very similar results. a b k Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r α α α Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r α Parameter range R e l a t i v e e rr o r Mean Median a : .
7% 33 . b : .
0% 21 . k : .
95% 4 . α : .
29% 0 . α : .
06% 6 . α : .
5% 12 . α : .
57% 1 . Figure 11: Relative error forthe model parameters cali-brated with a dense neural net-work in the grid-based learn-ing approach and bid-ask con-straints. tarting mismatch rate Final mismatch rate Strike M a t u r i t y ( m o n t h ) Strike M a t u r i t y ( m o n t h ) Figure 12: Starting mismatch rate and Final mismatch rate after calibration with a densenetwork in the grid-based learning approach and bid-ask constraints.
Bid-ask constraint examples K = 33 . , τ = 1 Random sample O p t i o n p r i c e K = 33 . , τ = 1 / Random sample O p t i o n p r i c e K = 32 . , τ = 4 / Random sample O p t i o n p r i c e Figure 13: Examples of a bid-ask constraint with a dense neural network and the grid-based learning approach. The symbol × indicates the prices which after calibration are insidethe constraints, the symbol (cid:70) the ones outside.28 .5 The non-injectivity issue In all the experiments described above, a common issue appears. The accuracy achieved incalibration is not particularly convincing, especially for the parameters regarding the volatilityand the covariance operator, a , b and k . Slightly better results was obtained for the Nelson-Siegel curve parameters, α , α and α , with the exception of α . See Figure 3, 6, 9 and 11.On the other hand, the relative error for the price approximation after calibration shows highdegree of accuracy. See Figure 4, 7, 10 and 12. This phenomenon appears as a discrepancy.From this, one may conclude that neural networks can be used for approximation and cali-bration of sophisticated stochastic models, such as the HJM model considered here. However,the original meaning of the model parameters may get lost in the approximation step. Aspointed out in [6], it is somehow to be expected that the neural network N is non-injective inthe input parameters on large part of the inputs domain. We shall briefly analyse this.The price formula (5.1), once fixed the strike K and the time to maturity τ , cruciallydepends on ξ as derived in Proposition 5.3 and on µ ( g t ) (see equation (5.6)): Π( t ) = e − r ( τ − t ) (cid:26) ξφ (cid:18) µ ( g t ) − Kξ (cid:19) + ( µ ( g t ) − K ) Φ (cid:18) µ ( g t ) − Kξ (cid:19)(cid:27) . However, a first observation is that ξ is only a scale, while µ ( g t ) defines the distance from thestrike price K , hence it is more influencial on the final price level. Let us focus on ξ : ξ = a kb (cid:96) (cid:16) e − b ( T − τ ) − e − b ( T − t ) (cid:17) B ( b , e − b(cid:96) ) , where B ( b , e − b(cid:96) ) simply indicates a term proportional to b and e − b(cid:96) . In the front coefficient,the three parameters a , b , and k are multiplied together directly or inversely, and a decreasein a might be, for example, compensated by a decrease in b or k , and vice versa, meaning thatseveral combinations of values for a , b , and k lead to the same overall ξ . Thus we may suspectthat it can be hard for the neural network to identify the right vector of parameters in thecalibration step, despite reaching good level of accuracy for the price.In Figure 14 we report an example of non-injectivity with respect to the parameters a , b and k that we have observed for the dense network trained in the grid-based learning approach.Here we notice indeed that the neural network is not injective as a map when all the parameters,except one, are fixed. We also notice that the map is only little sensitive to the change in theparameters a , b and k , which also explains the struggle in recovering these three parameters.Similar observations can be done for the drift: µ ( g t ) = α + e − α ( T − t ) α (cid:96) ( α + α + α α ( T − t )) − e − α ( T + (cid:96) − t ) α (cid:96) ( α + α + α α ( T + (cid:96) − t )) . Here the role of α is specific since it defines the starting level of the curve, and indeed α isthe parameter that gets the best accuracy in estimation. However, α appears in two positions:first added to α and then multiplied by α . It is thus difficult for the neural network to outlinethe role of α in the drift. In the Nelson-Siegel curve definition in equation (5.5), α definesthe position of the "bump" in the curve. Then, the drift µ ( g t ) is obtained my integrating thecurve within the delivery period of the contract. This integration might then smoothen thecurve, and make it difficult to locate the "bump". This might explain why the accuracy inestimating α is worse than for the other Nelson-Siegel parameters.The problem described above may arise also with more traditional calibration techniques. Itmight simply be due to the fact that one is dealing with a non-convex optimisation problem with29otentially several local minima. What is unique to the neural network is that the approximatedprice function N ( x ) might violate some basic arbitrage conditions. One particular example ofthese conditions is that the call price should be an increasing function of the volatility. Inour case, in particular, since a arises as a constant factor in the volatility, the price should beincreasing in a , when all the other parameters are fixed. In Figure 14 we observe that this isviolated here. The violation is however relatively small since overall the prices are well fit. a b k Parameter range O p t i o n p r i c e Parameter range O p t i o n p r i c e Parameter range O p t i o n p r i c e K = 32 . , τ = 3 / K = 33 . , τ = 1 K = 31 . , τ = 1 / θ = [ a, . , . , . , − . , . , . θ = [0 . , b, . , . , − . , . , . θ = [0 . , . , k, . , − . , . , . Figure 14: Examples of non injectivity for the dense neural network trained in the grid-basedlearning approach. In each image, only one of the parameters is varying, while the rest is fixed.We conclude the article with the following Theorem showing that it is possible to constructReLU neural networks which act as linear maps.
Theorem 6.1.
Let A ∈ R p × d . Then for any L ≥ and any n = ( d, n , . . . , n L − , p ) with n i ≥ d , i = 1 , . . . , ( L − , there exists an L -layer ReLU neural network N : R d → R p withdimension n , which satisfies N ( x ) = Ax, for all x ∈ R d . Proof.
See Appendix A.8.Theorem 6.1 proves that we can construct a ReLU L -layer neural network which acts as alinear map. As there are infinitely many non-injective linear maps (the zero-map being a trivialexample), it is then possible to construct infinitely many non-injective ReLU neural networks.Obviously, this does not show that a non-injective network, such as the one constructed in theproof of Theorem 6.1, will also minimize the objective function used for training. It howevermay represent an important starting point to (try to) understand the injectivity issue. We consider a state-dependent HJM model defining the forward curve dynamics under therisk-neutral measure. To specify the model, we construct a volatility operator which turnselements from the noise space L ( O ) into elements in the forward curves space, the Filipovićspace H α ( R + ) defined in Section 2. This is achieved by introducing a class of integral oper-ators. We prove some conditions which ensure that an operator of this class is well defined,and that the resulting volatility satisfies the Lipschitz and linear growth conditions requiredfor existence of a mild solution to the HJM equation. We then restrict our attention to aparametrized deterministic volatility operator. It captures some important properties studiedin the literature, such as a pronounced seasonality and the Samuelson effect.30e focus then on forward contracts with delivery periods, called swaps, and options writtenon them. By considering a parametric covariance operator and a parametric model for theinitial forward curve, namely the Nelson-Siegel curve, we obtain a vector θ of parameters thatdetermines the option prices in the model. We derive the respective pricing functional andtrain a neural network to approximate it. Then this neural network is used in a calibrationstep to find the best model parameters (in a mean squared error sense) to fit the (artificial)market price data. We do this both with the pointwise and the grid-based learning approachdescribed in Section 4.2, and we test different neural network architectures.From all the experiments, we can conclude that neural networks perform very well in ap-proximating the price functional, with average relative error in the test set reaching levels below . However, the original meaning of the parameters in the underlying model might get lostduring the approximation step. This becomes apparent in the calibration step, where the aver-age relative error for the parameters can be of magnitude , with peaks above in theworst cases. On the other hand, the prices obtained with the neural network and the estimatedparameters are accurate ( - for the average relative error), showing that the prices can berecovered even with a vector of parameters θ different from the true one.As discussed in Section 6.5, this may be due to the non-injectivity of the neural network.Moreover, for an option with fixed maturity and strike, different combinations of the parameterscan result in the same price. As stated in [6], computing the distance between the true modelparameters and the estimated ones as we do here, might not be the optimal way to quantify theaccuracy of the two steps approach. In [6] the authors suggest to switch to a Bayesian viewpointand observe the posterior distribution of the model parameters θ considered as a randomvariable. However, this kind of analysis was beyond the scope of the current research. Whilethe problem in recovering the true parameters stresses the importance of proper benchmarkingof a neural network based approach, the accuracy in pricing suggests that neural networksmight in fact turn out promising in making infinite dimensional models more tractable.To improve accuracy in calibration one might consider a different structure for the inputvectors in the training set. In our experiments, we consider a uniform grid for each of thedimension and randomly match them to generate the input set. Then, a value, for example, inthe first dimension appears only once in the whole training set. It might help to construct theset in such a way that every value in the first dimension is combined with every value in thesecond dimension, and so on to generate a uniform grid on Θ . As pointed out before, since Θ has dimension n = 7 , if one wants for example to train the network with different values foreach of the parameters (which is still a low number), the training set must be of size .The new loss function introduced for calibration based on bid-ask prices instead of exactprices, works well for the calibration step. It penalizes the values lying outside the bid-askrange interval, and considering that the information given to the network is weaker than withexact prices, the results are promising. Indeed, after calibration, for almost all the contractsconsidered the price lies within the bid-ask range, or very close to the boundaries otherwise.If the prices for some of the contracts are available, while for others one has only bid andask prices, one could for example consider a hybrid loss function, corresponding to the meansquared error for the exact prices, and to the newly defined bid-ask loss function for the bidand ask price ranges. This allows to exploit all the available information at once.Since forward contracts are traded themselves, it may be interesting to test a differentapproach. As done here, the training of the neural network includes the parameters for theinitial forward curve. Once trained the neural network in the off-line step, one could divide thesecond step in two parts. By standard techniques for interpolation, one could use the forwardprices to calibrate the initial forward curve parameters. These can then be considered as fixedin the proper calibration step, so that the vector of parameters to be recovered with the neural31etworks is smaller; most likely leading to better results in the calibration.One might also decide to fix the initial forward curve already at the level of the approxi-mation step, and then train the neural network with a lower dimensional input vector θ . Thiswould require to run the approximation step much more frequently to be updated with thecurrent market information, but most likely improves accuracy. Another possibility withoutchanging the setup, is to include forward contracts as options with strike K = 0 and time tomaturity τ = t equal to the evaluation time. These contracts most efficiently allow to recoverthe parameters for the Nelson-Siegel curve. Given these considerations, our approach to cali-brate the entire parameter vector θ only based on call options, is somehow likely to lead to alarger error than the one obtained in practice. For instance, when pricing standard options inequity markets based on a Black Scholes model, only the implied volatility is calibrated, whilethe current stock price, dividends and interest rate are obtained from other sources. A Proofs of the main results
We report in this Section the proofs to the main results in the order they appear in the paper.
A.1 Proof of Theorem 2.2
For every x ∈ R + and f ∈ H α , by the Cauchy-Schwarz inequality we can write that (cid:90) O | κ t ( x, y, f ) h ( y ) | dy ≤ (cid:18)(cid:90) O κ t ( x, y, f ) dy (cid:19) / (cid:18)(cid:90) O h ( y ) dy (cid:19) / < ∞ , which is bounded, since κ t ( x, · , f ) ∈ H for every x ∈ R + and every f ∈ H α by Assumption 1,and because h ∈ H . Thus σ t ( f ) h is well defined for all h ∈ H .We need to show that σ t ( f ) h ∈ H α for every f ∈ H α . We start by noticing that for every x ∈ R + the following equality holds: ∂σ t ( f ) h ( x ) ∂x = (cid:90) O ∂κ t ( x, y, f ) ∂x h ( y ) dy, where the differentiation under the integral sign is justified by Dominated Convergence becauseof Assumption 2 and (cid:82) O ¯ κ x ( y ) h ( y ) dy < ∞ . Moreover, by Assumption 3 and the Cauchy-Schwarz inequality, we find that (cid:90) R + (cid:18)(cid:90) O ∂κ t ( x, y, f ) ∂x h ( y ) dy (cid:19) α ( x ) dx ≤ (cid:107) h (cid:107) (cid:90) R + (cid:13)(cid:13)(cid:13)(cid:13) ∂κ t ( x, · , f ) ∂x (cid:13)(cid:13)(cid:13)(cid:13) α ( x ) dx < ∞ , which shows that σ t ( f ) h ∈ H α and boundedness of the operator σ t ( f ) for every f ∈ H α . A.2 Proof of Theorem 2.3
We first observe that for every h ∈ H and every f, f ∈ H α , it holds that (cid:90) R + (cid:90) R + (cid:12)(cid:12)(cid:12)(cid:12) ∂κ t ( x, y, f ) ∂x f (cid:48) ( x ) α ( x ) h ( y ) (cid:12)(cid:12)(cid:12)(cid:12) dydx = (cid:90) R + | f (cid:48) ( x ) α ( x ) | (cid:90) R + (cid:12)(cid:12)(cid:12)(cid:12) ∂κ t ( x, y, f ) ∂x (cid:12)(cid:12)(cid:12)(cid:12) | h ( y ) | dydx ≤ (cid:90) R + | f (cid:48) ( x ) | α / ( x ) (cid:13)(cid:13)(cid:13)(cid:13) ∂κ t ( x, · , f ) ∂x (cid:13)(cid:13)(cid:13)(cid:13) α / ( x ) (cid:107) h (cid:107) dx ≤ (cid:107) h (cid:107) (cid:107) f (cid:107) α (cid:32)(cid:90) R + (cid:13)(cid:13)(cid:13)(cid:13) ∂κ t ( x, · , f ) ∂x (cid:13)(cid:13)(cid:13)(cid:13) α ( x ) dx (cid:33) / (cid:104) σ t ( f ) h, f (cid:105) α = f (0) (cid:90) O κ t (0 , y, f ) h ( y ) dy + (cid:90) R + ∂σ t ( f ) h ( x ) ∂x f (cid:48) ( x ) α ( x ) dx = f (0) (cid:90) O κ t (0 , y, f ) h ( y ) dy + (cid:90) R + (cid:90) O ∂κ t ( x, y, f ) ∂x h ( y ) dyf (cid:48) ( x ) α ( x ) dx = (cid:90) O (cid:18) f (0) κ t (0 , y, f ) + (cid:90) R + ∂κ t ( x, y, f ) ∂x f (cid:48) ( x ) α ( x ) dx (cid:19) h ( y ) dy = (cid:90) O σ t ( f ) ∗ f ( y ) h ( y ) dy = (cid:104) h, σ t ( f ) ∗ f (cid:105) , for σ t ( f ) ∗ f defined by σ t ( f ) ∗ f ( y ) := f (0) κ t (0 , y, f ) + (cid:90) R + ∂κ t ( x, y, f ) ∂x f (cid:48) ( x ) α ( x ) dx = (cid:104) κ t ( · , y, f ) , f (cid:105) α . From [36, Theorem 6.1], σ t ( f ) ∗ is the unique adjoint operator of σ t ( f ) , for f ∈ H α . A.3 Proof of Theorem 2.4
We start with the growth condition. For h ∈ H and f ∈ H α we can write that (cid:107) σ t ( f ) h (cid:107) α = ( σ t ( f ) h (0)) + (cid:90) R + (cid:18) ∂σ t ( f ) h ( x ) ∂x (cid:19) α ( x ) dx = (cid:18)(cid:90) O κ t (0 , y, f ) h ( y ) dy (cid:19) + (cid:90) R + (cid:18)(cid:90) O ∂κ t ( x, y, f ) ∂x h ( y ) dy (cid:19) α ( x ) dx ≤ (cid:107) κ t (0 , · , f ) (cid:107) (cid:107) h (cid:107) + (cid:90) R + (cid:13)(cid:13)(cid:13)(cid:13) ∂κ t ( x, · , f ) ∂x (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) h (cid:107) α ( x ) dx ≤ C ( t ) (1 + | f (0) | ) (cid:107) h (cid:107) + (cid:90) R + C ( t ) f (cid:48) ( x ) (cid:107) h (cid:107) α ( x ) dx ≤ C ( t ) (1 + (cid:107) f (cid:107) α ) (cid:107) h (cid:107) , where we have used the Cauchy-Schwarz inequality, together with the inequality | f (0) | ≤ (cid:107) f (cid:107) α and Assumption 2. With some abuse of notation, it follows that (cid:107) σ t ( f ) (cid:107) L ( H , H α ) ≤ C ( t )(1 + (cid:107) f (cid:107) α ) for a suitably chosen constant C ( t ) . Similarly, from Assumption 1 it follows that (cid:107) ( σ t ( f ) − σ t ( f )) h (cid:107) α = (cid:18)(cid:90) O ( κ t (0 , y, f ) − κ t (0 , y, f )) h ( y ) dy (cid:19) ++ (cid:90) R + (cid:18)(cid:90) O (cid:18) ∂κ t ( x, y, f ) ∂x − ∂κ t ( x, y, f ) ∂x (cid:19) h ( y ) dy (cid:19) α ( x ) dx ≤ (cid:107) κ t (0 , · , f ) − κ t (0 , · , f ) (cid:107) (cid:107) h (cid:107) + (cid:90) R + (cid:13)(cid:13)(cid:13)(cid:13) ∂κ t ( x, · , f ) ∂x − ∂κ t ( x, · , f ) ∂x (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) h (cid:107) α ( x ) dx ≤ C ( t ) | f (0) − f (0) | (cid:107) h (cid:107) + (cid:90) R + C ( t ) ( f (cid:48) ( x ) − f (cid:48) ( x )) (cid:107) h (cid:107) α ( x ) dx ≤ C ( t ) (cid:107) f − f (cid:107) α (cid:107) h (cid:107) , (cid:107) σ t ( f ) − σ t ( f ) (cid:107) L ( H , H α ) ≤ C ( t ) (cid:107) f − f (cid:107) α for a suitably chosen C ( t ) , which provesthe Lipschitz continuity of the volatility operator, and concludes the proof. A.4 Proof of Proposition 2.5
In order for the volatility operator σ t to be well defined, we need to check that the function κ t introduced in equation (2.4) satisfies the assumptions of Theorem 2.2. We start by observingthat κ t ( x, · ) ∈ H if and only if ω ∈ H . Then, we can calculate the derivative ∂κ t ( x, y ) ∂x = a ( t ) e − bx (cid:0) ω (cid:48) ( x − y ) − bω ( x − y ) (cid:1) , which, in particular, by Assumption 2 is bounded by (cid:12)(cid:12)(cid:12)(cid:12) ∂κ t ( x, y ) ∂x (cid:12)(cid:12)(cid:12)(cid:12) ≤ a ( t ) e − bx ¯ ω x ( y ) . For the H -norm we then have that (cid:13)(cid:13)(cid:13)(cid:13) ∂κ t ( x, · ) ∂x (cid:13)(cid:13)(cid:13)(cid:13) = (cid:90) O (cid:18) ∂κ t ( x, y ) ∂x (cid:19) dy ≤ a ( t ) e − bx C < ∞ , where we have used that (cid:107) ¯ ω x (cid:107) ≤ C , which implies that Assumption 3 in Theorem 2.2 is satisfiedfor α such that (cid:82) R + e − bx α ( x ) < ∞ . Finally, the Lipschitz condition is trivially satisfied andthe growth condition is fulfilled because a ( t ) is bounded. A.5 Proof of Lemma 3.1
For w in equation (3.2), we get that w (cid:96) ( v ) = (cid:96) and W (cid:96) ( u ) = u(cid:96) . Then q w(cid:96) ( x, y ) = 1 (cid:96) ( (cid:96) − y + x ) I [0 ,(cid:96) ] ( y − x ) , and from equations (3.3) and (3.5) we can write that D w(cid:96) ( g t )( x ) = g t + 1 (cid:96) (cid:90) ∞ ( (cid:96) − y + x ) I [0 ,(cid:96) ] ( y − x ) g (cid:48) t ( y ) dy = g t + 1 (cid:96) (cid:90) x + (cid:96)x ( (cid:96) − y + x ) g (cid:48) t ( y ) dy. Integration by parts gives the result.
A.6 Proof of Proposition 3.2
Let f := D w ∗ (cid:96) δ ∗ T − s (1) . We start by applying the covariance operator to h := σ s ( g s ) ∗ f : ( Q σ s ( g s ) ∗ f ) ( x ) = (cid:90) O q ( x, y ) σ s ( g s ) ∗ f ( y ) dy = (cid:90) O q ( x, y ) (cid:104) κ s ( · , y, g s ) , f (cid:105) α dy = (cid:28)(cid:90) O q ( x, y ) κ s ( · , y, g s ) dy, f (cid:29) α , σ s ( g s ) : ( σ s ( g s ) Q σ s ( g s ) ∗ f ) ( x ) = (cid:90) O κ s ( x, z, g s ) ( Q σ s ( g s ) ∗ f ) ( z ) dz = (cid:28)(cid:90) O (cid:90) O κ s ( x, z, g s ) q ( z, y ) κ s ( · , y, g s ) dydz, f (cid:29) α = (cid:104) Ψ s ( x, · ) , f (cid:105) α , for Ψ s ( x, · ) := (cid:82) O (cid:82) O κ s ( x, z, g s ) q ( z, y ) κ s ( · , y, g s ) dydz . We go now back to the definition of f : ( σ s ( g s ) Q σ s ( g s ) ∗ ) (cid:0) D w ∗ (cid:96) δ ∗ T − s (1) (cid:1) ( x ) = (cid:10) Ψ s ( x, · ) , D w ∗ (cid:96) δ ∗ T − s (1) (cid:11) α = (cid:10) D w(cid:96) Ψ s ( x, · ) , δ ∗ T − s (1) (cid:11) α = δ T − s ( D w(cid:96) Ψ s ( x, · )) = ( D w(cid:96) Ψ s ) ( x, T − s ) . By Lemma 3.1 we can write that ( D w(cid:96) Ψ s ) ( x, T − s ) = (cid:90) R + d (cid:96) ( T − s, u )Ψ s ( x, u ) du = (cid:90) R + (cid:90) O (cid:90) O d (cid:96) ( T − s, u ) κ s ( x, z, g s ) q ( z, y ) κ s ( u, y, g s ) dydzdu, to which, finally, we apply the operator δ T − s D w(cid:96) : δ T − s D w(cid:96) ( σ s ( g s ) Q σ s ( g s ) ∗ ) (cid:0) D w ∗ (cid:96) δ ∗ T − s (1) (cid:1) = (cid:90) R + d (cid:96) ( T − s, v ) ( σ s ( g s ) Q σ s ( g s ) ∗ ) (cid:0) D w ∗ (cid:96) δ ∗ T − s (1) (cid:1) ( v ) dv = (cid:90) R + (cid:90) R + (cid:90) O (cid:90) O d (cid:96) ( T − s, v ) d (cid:96) ( T − s, u ) κ s ( v, z, g s ) q ( z, y ) κ s ( u, y, g s ) dydzdudv, finalizing the proof. A.7 Proof of Proposition 5.2
We consider the representation Σ ( s ) = a (cid:90) R + (cid:90) R + e − bu e − bv d (cid:96) ( T − s, u ) d (cid:96) ( T − s, v ) A ( u, v ) dudv, (A.1)where we have introduced A ( u, v ) := (cid:90) R (cid:90) R ω ( v − z ) q ( z, y ) ω ( u − y ) dydz, u, v ∈ R + . By applying (repeatedly) the integration by parts, and since ω (cid:48)(cid:48) is null, we obtain A ( u, v ) = (cid:90) R ω ( v − z ) (cid:18)(cid:90) R e − k | z − y | ω ( u − y ) dy (cid:19) dz = (cid:90) R ω ( v − z ) (cid:18)(cid:90) z −∞ e − k ( z − y ) ω ( u − y ) dy + (cid:90) ∞ z e − k ( y − z ) ω ( u − y ) dy (cid:19) dz = 2 k (cid:90) R ω ( v − z ) ω ( u − z ) dz. (A.2)35y substituting equation (A.2) into (A.1), we get that Σ ( s ) = 2 a k (cid:90) R (cid:90) R + (cid:90) R + e − bu e − bv d (cid:96) ( T − s, u ) d (cid:96) ( T − s, v ) ω ( v − z ) ω ( u − z ) dzdudv = 2 a k (cid:90) R (cid:18)(cid:90) R + e − bu d (cid:96) ( T − s, u ) ω ( u − z ) du (cid:19) dz = 2 a k(cid:96) (cid:90) R (cid:18)(cid:90) T − s + (cid:96)T − s e − bu ω ( u − z ) du (cid:19) dz, where we used the definition of d (cid:96) in Lemma 3.1. By integration by parts, we get Σ ( s ) = 2 a k(cid:96) b (cid:90) R (cid:16) e − b ( T − s ) (cid:0) bω ( T − s − z ) + ω (cid:48) ( T − s − z ) (cid:1) + − e − b ( T − s + (cid:96) ) (cid:0) bω ( T − s + (cid:96) − z ) + ω (cid:48) ( T − s + (cid:96) − z ) (cid:1)(cid:17) dz = 2 a k(cid:96) b (cid:16) e − b ( T − s ) B ( s ) − e − b ( T − s ) e − b ( T − s + (cid:96) ) B ( s ) + e − b ( T − s + (cid:96) ) B ( s ) (cid:17) , where we introduced B ( s ) := (cid:90) R (cid:0) bω ( T − s − z ) + ω (cid:48) ( T − s − z ) (cid:1) dz, B ( s ) := (cid:90) R (cid:0) bω ( T − s − z ) + ω (cid:48) ( T − s − z ) (cid:1) (cid:0) bω ( T − s + (cid:96) − z ) + ω (cid:48) ( T − s + (cid:96) − z ) (cid:1) dz, B ( s ) := (cid:90) R (cid:0) bω ( T − s + (cid:96) − z ) + ω (cid:48) ( T − s + (cid:96) − z ) (cid:1) dz. By using the definition of ω in equation (5.3), we get that B ( s ) = (cid:90) T − s +1 T − s − ( b (1 − | T − s − z | ) − sgn( T − s − z )) dz = (cid:90) T − sT − s − ( b (1 − T + s + z ) − dz + (cid:90) T − s +1 T − s ( b (1 + T − s − z ) + 1) dz = 23 (cid:0) b + 3 (cid:1) , where sgn denotes the sign function. Similarly, B ( s ) = b (cid:0) (cid:96) − (cid:96) + 4 (cid:1) − (cid:96) + 2 , B ( s ) = 23 (cid:0) b + 3 (cid:1) . By substituting these findings and rearranging the terms, we get that Σ ( s ) = 2 a kb (cid:96) (cid:26) (cid:0) b + 3 (cid:1) (cid:16) e − b(cid:96) (cid:17) − e − b(cid:96) (cid:18) b (cid:0) (cid:96) − (cid:96) + 4 (cid:1) − (cid:96) + 2 (cid:19)(cid:27) e − b ( T − s ) , which concludes the proof. 36 .8 Proof of Theorem 6.1 We follow a similar approach to [24, Section 8.5]. Let ν i ≥ be such that n i = 2 d + ν i for i = 1 , . . . , ( L − . For I d the identity matrix of dimension d , we define the following weights: V := (cid:2) I d − I d O (cid:3) (cid:62) ,V i := (cid:2) I d − I d O i (cid:3) (cid:2) I d − I d O i − (cid:3) (cid:62) , i = 2 , . . . , ( L − ,V L := A (cid:2) I d − I d O L − (cid:3) , where (cid:62) denotes the transpose operator. Here O i ∈ R d × ν i are matrices with all entries equal to to compensate the matrix dimension in such a way that V i ∈ R n i × n i − for i = 1 , . . . , ( L − .By considering zero-biases vectors v i , the linear maps H i introduced in the neural networkdefinition in equation (4.1) coincide then with the matrices V i .We observe that for every x ∈ R d , the ReLU activation function satisfies x = ρ ( x ) − ρ ( − x ) = (cid:2) I d − I d (cid:3) ρ (cid:16)(cid:2) I d − I d (cid:3) (cid:62) x (cid:17) , where the activation function is meant to act component wise. By straightforward calculation,one can then see that the neural network defined here satisfies the equality N ( x ) = Ax forevery x ∈ R d , which means that it acts on x as a linear map. References [1] Andresen, Arne, Steen Koekebakker and Sjur Westgaard (2010).
Modeling electricity for-ward prices using the multivariate normal inverse Gaussian distribution.
The Journal ofEnergy Markets, 3(3), 3.[2] Barth, Andrea and Annika Lang (2012).
Simulation of stochastic partial differential equa-tions using finite element methods.
Stochastics An International Journal of Probabilityand Stochastic Processes 84(2-3): 217-231.[3] Barth, Andrea and Annika Lang (2012).
Multilevel Monte Carlo method with applicationsto stochastic partial differential equations.
International Journal of Computer Mathemat-ics, 89(18): 2479-2498.[4] Barth, Andrea, Annika Lang and Christoph Schwab (2013).
Multilevel Monte Carlo methodfor parabolic stochastic partial differential equations.
BIT Numerical Mathematics, 53(1):3-27.[5] Barth, Andrea and Fred E. Benth (2014).
The forward dynamics in energy markets –infinite-dimensional modelling and simulation.
Stochastics An International Journal ofProbability and Stochastic Processes, 86(6), 932-966.[6] Bayer, Christian and Benjamin Stemper (2018).
Deep calibration of rough stochastic volatil-ity models. arXiv preprint arXiv:1810.03399.[7] Bayer, Christian, Blanka Horvath, Aitor Muguruza, Benjamin Stemper and MehdiTomas (2019).
On deep calibration of (rough) stochastic volatility models. arXiv preprintarXiv:1908.08806.[8] Benth, Fred E. (2015).
Kriging smooth energy futures curves.
Energy Risk.379] Benth, Fred E. and Steen Koekebakker (2008).
Stochastic modeling of financial electricitycontracts.
Energy Economics, 30(3), 1116-1157.[10] Benth, Fred E., J ¯u rat ˙e ˇS . Benth and Steen Koekebakker (2008). Stochastic Modelling ofElectricity and Related Markets.
Vol. 11. World Scientific, 2008.[11] Benth, Fred E. and Paul Krühner (2014).
Representation of infinite-dimensional forwardprice models in commodity markets.
Communications in Mathematics and Statistics, 2(1),47-106.[12] Benth, Fred E. and Paul Krühner (2015).
Derivatives pricing in energy markets: aninfinite-dimensional approach.
SIAM Journal on Financial Mathematics, 6(1), 825-869.[13] Benth, Fred E. and Florentina Paraschiv (2018).
A space-time random field model forelectricity forward prices.
Journal of Banking & Finance, 95, 203-216.[14] Bühler, Hans, Lukas Gonon, Josef Teichmann and Ben Wood (2019).
Deep hedging.
Quan-titative Finance, 19(8), 1271-1291.[15] Carmona, René and Sergey Nadtochiy (2012).
Tangent Lévy market models.
Finance andStochastics, 16(1): 63-104.[16] Clewlow, Les and Chris Strickland (2000).
Energy Derivatives: Pricing and Risk Manage-ment.
Lacima Publications, London.[17] Cuchiero, Christa, Wahid Khosrawi and Josef Teichmann (2020).
A generative adver-sarial network approach to calibration of local stochastic volatility models. arXiv preprintarXiv:2005.02505.[18] Da Prato, Giuseppe and Jerzy Zabczyk (2014).
Stochastic Equations in Infinite Dimen-sions.
Cambridge University Press.[19] De Spiegeleer, Jan, Dilip B. Madan, Sofie Reyners and Wim Schoutens (2018).
Machinelearning for quantitative finance: fast derivative pricing, hedging and fitting.
QuantitativeFinance, 18(10), 1635-1643.[20] Ferguson, Ryan and Andrew Green (2018).
Deeply learning derivatives. arXiv preprintarXiv:1809.02233.[21] Filipović, Damir (2001).
Consistency Problems for Heath-Jarrow-Morton Interest RateModels.
Lecture notes in Mathematics, vol. 1760. Springer, Berlin.[22] Frestad, Dennis (2008).
Common and unique factors influencing daily swap returns in theNordic electricity market, 1997–2005.
Energy Economics, 30(3): 1081-1097.[23] Goodfellow, Ian, Yoshua Bengio and Aaron Courville (2016).
Deep learning.
MIT press.[24] Gottschling, Nina M., Vegard Antun, Ben Adcock and Anders C. Hansen (2020).
Thetroublesome kernel: why deep learning for inverse problems is typically unstable. arXivpreprint arXiv:2001.01258.[25] Heath, David, Robert Jarrow and Andrew Morton (1992).
Bond pricing and the termstructure of interest rates: A new methodology for contingent claims valuation.
Economet-rica: Journal of the Econometric Society: 77-105.3826] Henry-Labordere, Pierre (2017).
Deep primal-dual algorithm for BSDEs: Applications ofmachine learning to CVA and IM.
Available at SSRN 3071506.[27] Hernandez, Andres (2016).
Model calibration with neural networks.
Available at SSRN2812140.[28] Higham, Catherine F. and Desmond J. Higham (2019).
Deep learning: An introductionfor applied mathematicians.
SIAM Review 61(4): 860-891.[29] Horvath, Blanka, Aitor Muguruza and Mehdi Tomas (2019).
Deep learning volatility.
Avail-able at SSRN 3322085.[30] Kallsen, Jan and Paul Krühner (2015).
On a Heath – Jarrow – Morton approach for stockoptions.
Finance and Stochastics, 19(3): 583-615.[31] Koekebakker, Steen and Fridthjof Ollmar (2005).
Forward curve dynamics in the Nordicelectricity market.
Managerial Finance, 31(6): 73-94.[32] Kondratyev, Alexei (2018).
Learning curve dynamics with artificial neural networks.
Avail-able at SSRN 3041232.[33] Kovács, Mihály, Stig Larsson and Fredrik Lindgren (2010).
Strong convergence of the finiteelement method with truncated noise for semilinear parabolic stochastic equations withadditive noise.
Numerical Algorithms 53(2-3): 309-320.[34] Nelson, Charles R. and Andrew F. Siegel (1987).
Parsimonious modeling of yield curves.
Journal of business: 473-489.[35] Peszat, Szymon and Jerzy Zabczyk (2007).
Stochastic Partial Differential Equations withLévy Noise: An evolution equation approach. (Vol. 113). Cambridge University Press.[36] Rynne, Bryan and Martin A. Youngson (2013).
Linear Functional Analysis.
Springer Sci-ence & Business Media.[37] Tappe, Stefan (2012).
Some refinements of existence results for SPDEs driven by Wienerprocesses and Poisson random measures.
International Journal of Stochastic Analysis,2012.[38] Weinan, E, Jiequn Han and Arnulf Jentzen (2017).