[PDF] Accuracy of Deep Learning in Calibrating HJM Forward Curves

Abstract

We price European-style options written on forward contracts in a commodity market, which we model with an infinite-dimensional Heath-Jarrow-Morton (HJM) approach. For this purpose we introduce a new class of state-dependent volatility operators that map the square integrable noise into the Filipović space of forward curves. For calibration, we specify a fully parametrized version of our model and train a neural network to approximate the true option price as a function of the model parameters. This neural network can then be used to calibrate the HJM parameters based on observed option prices. We conduct a numerical case study based on artificially generated option prices in a deterministic volatility setting. In this setting we derive closed pricing formulas, allowing us to benchmark the neural network based calibration approach. We also study calibration in illiquid markets with a large bid-ask spread. The experiments reveal a high degree of accuracy in recovering the prices after calibration, even if the original meaning of the model parameters is partly lost in the approximation step.

Full PDF

AAccuracy of Deep Learningin Calibrating HJM Forward Curves

Fred Espen Benth ∗ Nils Detering † Silvia Lavagnini ‡ June 4, 2020

Abstract

We price European-style options written on forward contracts in a commodity market,which we model with a state-dependent inﬁnite-dimensional Heath-Jarrow-Morton (HJM)approach. We introduce a new class of volatility operators which map the square integrablenoise into the Filipović space of forward curves, and we specify a deterministic parametrizedversion of it. For calibration purposes, we train a neural network to approximate theoption price as a function of the model parameters. We then use it to calibrate the HJMparameters starting from (simulated) option market data. Finally we introduce a new lossfunction that takes into account bid and ask prices and oﬀers a solution to calibration inilliquid markets. A key issue discovered is that the trained neural network might be non-injective, which could potentially lead to poor accuracy in calibrating the forward curveparameters, even when showing a high degree of accuracy in recovering the prices. Thisreveals that the original meaning of the parameters gets somehow lost in the approximation.

Keywords:

Neural networks; Model calibration; Heath-Jarrow-Morton approach; Volatility opera-tor; Forward curve; Option pricing; Bid-ask price; Stochastic partial diﬀerential equation.

We price European-style options written on forward contracts in a commodity market. Wefollow the Heath-Jarrow-Morton (HJM) approach and model the forward curve by a stochasticpartial diﬀerential equation with state-dependent volatility, and having values in the Filipovićspace. In our setting, the Hilbert valued Wiener process driving the noise of the forward curvetakes values in L ( O ) , O being some Borel subset of R (possibly R itself). This requires thatthe volatility operator must smoothen elements in L ( O ) into elements of the Filipović space.We achieve this by constructing the volatility as an integral operator with respect to somekernel function, and derive the conditions needed on the kernel function such that the volatilityoperator is well deﬁned. We then focus on the pricing of forward contracts with delivery period,also called swaps. Typical examples are forward contracts in the electricity market, such as theones traded at Nord Pool AS and the European Energy Exchange (EEX). For a deterministic The authors would like to thank Vegard Antun for precious coding support and related advice, and ChristianBayer for useful discussions. Part of the project has been carried out during Silvia Lavagnini’s three months visitat UCSB, funded by the Kristine Bonnevie travel stipend 2019 from the Faculty of Mathematics and NaturalSciences (University of Oslo). ∗ Department of Mathematics, University of Oslo, 0316 Blindern, Norway. Email: [email protected]. † Department of Statistics and Applied Probability, University of California, Santa Barbara, CA 93106, USA.Email: [email protected]. ‡ Department of Mathematics, University of Oslo, 0316 Blindern, Norway. Email: [email protected]. a r X i v : . [ q -f i n . M F ] J un olatility structure, we derive analytic pricing formulas based on a representation theorem forswaps presented in [11]. For a state-dependent stochastic volatility operator one needs insteadto resort to simulation schemes for stochastic partial diﬀerential equations, as for example theMultilevel Monte Carlo method [3, 4].Our fully parametrized model allows for pricing and calibration based on deep neural net-works. Therefore we adopt the approach presented in [7] to our setting, and approximate thepricing functional with a neural network in a (possibly) costly oﬀ-line step. After training,the neural network is used to recover the model parameters in an optimization routine. Thecalibrated parameters can then be used together with the trained neural network for pricingoptions that are not traded on the exchange. Resorting to neural networks for both, the cali-bration and the pricing step, oﬀers critical computational advantages for those models requiringmore expensive techniques, such as Monte Carlo techniques. In fact, in this neural network ap-proach, the computationally expensive simulation is only required to generate the training setin the learning process of the network, and is not needed for intraday calibration and pricing.We perform a comprehensive case study of our framework and analyse the accuracy ofneural networks for pricing and calibration in the inﬁnite dimensional HJM setting. To avoid atraining set based on large scale Monte Carlo simulation, which introduces additional sourcesof error, we restrict our focus on a deterministic but time dependent volatility operator. To ourknowledge, this is indeed the ﬁrst application of deep neural networks in an inﬁnite dimensionalHJM setup for the purpose of model calibration. We consider two diﬀerent approaches bothpresented in [29], the pointwise and the grid-based learning approach, and we compare denseand convolutional neural network architectures. We then extend the framework to allow forcalibration in markets with a wide bid and ask spread, where using a mid price is not feasible.The problem of wide bid-ask spreads is particularly pronounced in energy markets, since onlythe front end swaps are traded liquidly.In the approximation step, we observe a high degree of accuracy, with an average relativeerror for the test set in the range . - . The picture is diﬀerent when it comes to calibration.Here the trained neural network might fail to recover the true parameters, and we do indeedobserve average relative errors reaching almost in some cases. On the other hand, the pricesestimated after calibration have an average accuracy around . This failure in recovering theparameters is the result of two eﬀects, one speciﬁc to the model and one to the network. In thespeciﬁed model, several parameter vectors lead to similar prices for the training set, making itdiﬃcult to recover the true parameters. However, we also show that the trained neural networkcan be non-injective in the input parameters to a degree not justiﬁed by the original model.For instance, keeping all but the volatility scaling ﬁxed, the call price should be an increasingfunction of volatility. This in fact is not always the case. Given that no-arbitrage conditionsare not imposed on the neural network, this is actually not surprising, but contributes to theproblem of calibration. It may cause the original meaning of the parameters to get lost in theapproximation step, and shows that careful benchmarking is required when using the neuralnetwork approach for calibration, in particular for pricing more complicated options.For the calibration in market environments with large bid-ask spread, the simple loss func-tion proposed here seems to work well. We test it with respect to diﬀerent bid-ask spreadsizes, and in fact, after calibration, almost all prices lie within the bid-ask range and only feware outside, but still very close to either the bid or the ask price. In particular, wheneverthe bid-ask spread becomes more narrow, one can simply increase the number of iterations forthe optimization routine to obtain good results. The optimization is fast because it does notrequire any simulation. Moreover, the observed errors in recovering the parameters are notincreasing dramatically as compared to the calibration based on a zero bid-ask spread.The rest of the article is organized as follows. In Section 1.1 we give some background2nd motivations on the model considered here. In Section 2 we deﬁne the HJM forward curvedynamics and our volatility operators, and we specify a deterministic version of it. In Section3 we introduce forward contracts with delivery period and options written on them, and wederive the pricing formulas used in our case study. In Section 4 we deﬁne neural networks andintroduce the two steps approach for model calibration, together with the newly proposed bid-ask loss function. Finally, in Section 5 we specify the parametric setting for the experiments,and in Section 6 we show our ﬁndings. Appendix A contains the proofs to all results. In energy markets, forwards and futures on power and gas deliver the underlying commodityover a period of time, rather than at a ﬁxed delivery time. While one usually derives thearbitrage-free forward price in a model free way from the buy-and-hold strategy in the under-lying, in energy markets this strategy can not be applied because storing electricity is costly.This implies that the forward price as derived from the spot price model is not backed by areplication strategy. Instead, what is often adopted as alternative in energy markets is thedirect modelling of the tradable forward price. This is referred to as the Heath-Jarrow-Morton(HJM) approach, as it has ﬁrst been introduced by [25] for interest rate markets. Later this ideahas been transferred to other markets. [15] and [30], for example, model the whole call optionprice surface using the HJM methodology; [16] and [9] transferred the approach to commodityforward markets, the latter one, in particular, in the context of power markets.Another important characteristic of energy forward markets is the high degree of idiosyn-cratic risk across diﬀerent maturities, which has been observed by several studies, such as[1, 31, 22]. In [10], for example, the authors performed a Principal Component Analysis onthe Nord Pool AS forward contracts, revealing that more than ten factors are needed to ex-plain of the volatility. This points out the necessity of modelling the time dynamics ofthe forward curve by a high dimensional, possibly inﬁnite dimensional, noise process. In [12]the authors show that a reasonable state space for the forward curve is the so-called Filipovićspace, which is a separable Hilbert space ﬁrst introduced by [21]. One can indeed realize energyforward prices as linear operators in this space, as done in [11] and [12].In particular, when considering stochastic volatility operators, as for example in our state-dependent model, it is in general not possible to have closed price formulas for options writtenon the forward curve. Hence one has to resort to time consuming numerical methods, such asMonte Carlo techniques for SPDEs (see [3, 4]) or Finite Elements methods (see [2, 33]). Inparticular, such costly pricing procedures render calibration almost impossible as the pricingfunction has to be evaluated in each step of the optimization routine. As a result, more accuratemodels are often not used in practice.To overcome the computational burden, machine learning may be useful. Machine learningis in fact replacing standard techniques in many diﬀerent ﬁelds within scientiﬁc computing,and, in particular, within ﬁnance. In [20], for example, machine learning is used for theevaluation of derivatives, and in [14] for the problem of hedging a portfolio of derivatives withmarket frictions; in [19] and [32] machine learning tools are employed to learn the volatilitysurface or the term structure of forward crude oil prices, respectively; [38] and [26] proposeapproaches for solving Backward Stochastic Diﬀerential Equations in high dimension withdeep neural networks. In the context of model calibration, a ﬁrst application is proposed in[27], who calibrates stochastic models by training a neural network which, given market priceobservations, returns directly the optimal model parameters. Recently, [17] suggested to employtools from generative adversarial networks to solve the calibration problem in the context oflocal stochastic volatility models. 3et another strategy which is also employed here can be found in [6, 29, 7], where the authorsintroduce the following two steps approach. They propose to approximate the implied volatilitysurface, respective the price functional by a neural network in an oﬀ-line training step. Then,they use the trained neural network to calibrate the stochastic model to ﬁt market observations.The training step is a priori cost demanding, especially if the underlying stochastic model iscomplex, such as for a stochastic volatility model, since it requires many costly simulations.But, even if generating the artiﬁcial data as well as training the neural network are expensivetasks, the advantage is that these are performed one-oﬀ. To ensure that the model reﬂects thecurrent market situation, it is suﬃcient to run the calibration step regularly, say daily or evenintra-daily. This step requires only evaluation of the neural network and is therefore fast.

Let (Ω , F , F t , Q ) be a ﬁltered probability space, with Q the risk-neutral probability. We shallwork directly under risk-neutrality. We consider the Filipović space on R + denoted by H α := H α ( R + ) : for a given continuous and non-decreasing function α : R + → [1 , ∞ ) with α (0) = 1 this is the Hilbert space of all absolutely continuous functions f : R + → R for which (cid:90) R + f (cid:48) ( x ) α ( x ) dx < ∞ . It turns out that H α is a separable Hilbert space with inner product deﬁned by (cid:104) f , f (cid:105) α := f (0) f (0) + (cid:90) R + f (cid:48) ( x ) f (cid:48) ( x ) α ( x ) dx, for f , f ∈ H α and norm (cid:107) f (cid:107) α := (cid:104) f, f (cid:105) α . We assume (cid:82) R + α − ( x ) dx < ∞ . A typical exampleis α ( x ) = e αx , for α > . We refer to [21] for more properties of H α . In [12, Section 2] it isshown that the Filipović space is an appropriate state space for the forward curve in energymarkets. This motivates our choice of considering H α as the space for the forward curves.For a Borel set O ⊆ R , we deﬁne H := L ( O ) as the Hilbert space of all square integrablefunctions. We further denote by λ O the Lebesgue measure induced on O , and by (cid:107) · (cid:107) and (cid:104) · , · (cid:105) respectively the norm and scalar product on H . We assume that O is such that H is aseparable Hilbert space, and we shall consider H as the noise space, like, e.g., in [5, 12, 13].For a square integrable H -valued random variable X (i.e. E [ (cid:107) X (cid:107) ] < ∞ ), a linear operator Q ∈ L ( H , H ) is the unique covariance operator of X if E [ (cid:104) X, h (cid:105)(cid:104) X, h (cid:105) ] = (cid:104)Q h , h (cid:105) for any h , h ∈ H . We then deﬁne an H -valued Wiener process W := { W ( t ) } t ≥ with zero meanand covariance operator Q , as the H -valued stochastic process with continuous trajectories,stationary increments with law W ( t ) − W ( s ) ∼ N (0 , ( t − s ) Q ) , and W (0) = 0 , see [35] or [18].Let now f ( t, T ) be the forward price in the commodity market at time t ≤ T for a contractwith time of delivery T . Following the HJM approach in the Musiela parametrization, for x := T − t , the time to delivery, we deﬁne g ( t, x ) := f ( t, t + x ) following the dynamics dg ( t, x ) = ( ∂ x g ( t, x ) + β ( t, x )) dt + σ ( t, x ) d W ( t, x ) , (2.1) ∂ x being the generator for the semigroup {U t } t ≥ given by U t g ( x ) = g ( t + x ) , for any t, x ∈ R + and g ∈ H α . Here β ( t, · ) ∈ H α and σ ( t, · ) ∈ L ( H , H α ) is a linear and bounded operatorfrom H to H α . In order for the model (2.1) to be under the risk neutral measure directly, wemust impose the condition β ≡ (see [11, Section 3.2]). We also want to rewrite equation(2.1) as a model for the entire forward curve ( g ( t, x )) x ≥ for any time t ≥ , and to allow4or the volatility function σ to be stochastic itself. Without introducing any other externalnoise source, for instance with a second dynamics for the volatility, we consider σ to be state-dependent, namely depending on the current level of the forward curve g . We ﬁnally obtainthe following partial stochastic diﬀerential equation dg t = ∂ x g t dt + σ t ( g t ) d W t , (2.2)where g t := g ( t, · ) , σ t ( g t ) := σ ( t, g t ) and W t := W ( t, · ) . The following Theorem states condi-tions to ensure that a mild solution to equation (2.2) exists. Theorem 2.1.

Let us assume that the mapping σ : R + × H α → L ( H , H α ) , ( t, g t ) (cid:55)→ σ ( t, g t ) = σ t ( g t ) is measurable and that there exists an increasing function C : R + → R + such that for all f , f ∈ H α and t ∈ R + we have (cid:107) σ t ( f ) − σ t ( f ) (cid:107) L ( H , H α ) ≤ C ( t ) (cid:107) f − f (cid:107) α , (cid:107) σ t ( f ) (cid:107) L ( H , H α ) ≤ C ( t )(1 + (cid:107) f (cid:107) α ) . Then for every s ≥ t there exists a unique mild solution to equation (2.2) of the form g s = U s − t g t + (cid:90) st U s − u σ u ( g u ) d W u ,g t = g ( t, · ) ∈ H α being the initial condition.Proof. We refer to [37].By means of Theorem 2.1 we have the solution to the SPDE in equation (2.2) if the volatilityoperator satisﬁes the Lipschitz and linear growth conditions. Let us point out that in equation(2.2) the noise, namely the Wiener process W , is an element of H , while the forward curve g t belongs to the space H α . This means that the volatility operator σ t ( g t ) must turn elements of H into elements of H α . We shall study the volatility operator in the next Section. We focus on possible speciﬁcations for the volatility operator σ : R + × H α → L ( H , H α ) . Inparticular, we need conditions to ensure that actually σ t ( f ) ∈ L ( H , H α ) for every f ∈ H α . Thevolatility operator σ t ( g t ) must turn elements of H into elements of H α . It thus has to smoothenthe noise, and one way to do that is by integrating it over a suitably chosen kernel function.Additionally, σ t ( f ) has to fulﬁll the Lipchitz and growth conditions required for Theorem 2.1.We start with the following conditions on the kernel function. Theorem 2.2.

For t ≥ , let κ t : R + × O × H α → R + be a kernel function satisfying thefollowing assumptions:1. The map κ t ( x, · , f ) ∈ H for every x ∈ R + , f ∈ H α .2. For every x ∈ R + , f ∈ H α , the derivative ∂κ t ( x,y,f ) ∂x exists for λ O almost all y ∈ O . More-over there exists a neighbourhood I x of x and a function ¯ κ x ∈ H such that (cid:12)(cid:12)(cid:12) ∂κ t ( x,y,f ) ∂x (cid:12)(cid:12)(cid:12) ≤ ¯ κ x ( y ) for λ O almost all y on I x . . (cid:82) R + (cid:13)(cid:13)(cid:13) ∂κ t ( x, · ,f ) ∂x (cid:13)(cid:13)(cid:13) α ( x ) dx < ∞ .Then σ t ( f ) : H → H α , h (cid:55)→ σ t ( f ) h := (cid:90) O κ t ( · , y, f ) h ( y ) dy (2.3) is a linear and bounded operator from H to H α , namely σ t ( f ) ∈ L ( H , H α ) . In particular, forevery x ∈ R + , we can also write the equality σ t ( f ) h ( x ) = (cid:104) κ t ( x, · , f ) , h (cid:105) . Proof.

See Appendix A.1.Given the operator σ t ( f ) in equation (2.3), it will be necessary to ﬁnd the correspondingadjoint operator, namely the operator σ t ( f ) ∗ ∈ L ( H α , H ) that for every f ∈ H α and every h ∈H satisﬁes (cid:104) σ t ( f ) h, f (cid:105) α = (cid:104) h, σ t ( f ) ∗ f (cid:105) . By [36, Theorem 6.1] any operator σ t ( f ) ∗ satisfyingthis equality is automatically bounded and is thus the unique adjoint operator of σ t ( f ) . Theorem 2.3.

With the assumptions of Theorem 2.2, the adjoint operator σ t ( f ) ∗ is given by σ t ( f ) ∗ : H α → H , f (cid:55)→ σ t ( f ) ∗ f := κ t (0 , · , f ) f (0) + (cid:90) R + ∂κ t ( x, · , f ) ∂x f (cid:48) ( x ) α ( x ) dx. In particular, for every y ∈ R + , we can also write the equality σ t ( f ) ∗ f ( y ) = (cid:104) κ t ( · , y, f ) , f (cid:105) α . Proof.

See Appendix A.2.In Theorem 2.2 we considered some assumptions that the kernel κ t must satisfy to ensurethat σ t ( f ) ∈ L ( H , H α ) for every f ∈ H α . We shall now look at conditions which also ensurethat σ t ( f ) fulﬁls the assumptions of Lipschitz continuity and linear growth of Theorem 2.1. Theorem 2.4.

Let κ t : R + × O × H α → R + be a kernel function satisfying the assumptionsof Theorem 2.2. If there exists an increasing function C : R + → R + such that, for every f , f ∈ H α , it holds:1. (cid:107) κ t (0 , · , f ) − κ t (0 , · , f ) (cid:107) ≤ C ( t ) | f (0) − f (0) | , (cid:13)(cid:13)(cid:13) ∂κ t ( x, · ,f ) ∂x − ∂κ t ( x, · ,f ) ∂x (cid:13)(cid:13)(cid:13) ≤ C ( t ) | f (cid:48) ( x ) − f (cid:48) ( x ) | , (cid:107) κ t (0 , · , f ) (cid:107) ≤ C ( t )(1 + | f (0) | ) , (cid:13)(cid:13)(cid:13) ∂κ t ( x, · ,f ) ∂x (cid:13)(cid:13)(cid:13) ≤ C ( t ) | f (cid:48) ( x ) | then σ t deﬁned in equation (2.3) satisﬁes the Lipschitz and growth conditions of Theorem 2.1.Proof. See Appendix A.3.If the kernel function κ t satisﬁes the assumptions of Theorem 2.2 and Theorem 2.4, thenthere exists a mild solution to equation (2.2), which models the dynamics of the forward curve.6 .2 A deterministic speciﬁcation In the previous Section we deﬁned the volatility as an integral operator with respect to a kernelfunction κ t . We now specify κ t in order to reﬂect some properties that we believe to be crucialfor the volatility operator. For instance, we shall include time dependency to account forseasonality eﬀects. This was for example observed in electricity markets by [9], when analysingthe volatility structure in a log normal model for the forward curve. The same authors alsomodel a maturity eﬀect, namely a monotone decay in the volatility when the time to maturityincreases, also known as the Samuelson eﬀect. This can be easily achieved with some decayfunction. Finally, we want to include that contracts with a certain maturity should be mainlyinﬂuenced by the randomness of the noise in a neighbourhood of that maturity.We shall restrict to a deterministic, time-dependent diﬀusion term. We therefore dropthe state-dependency and deﬁne the kernel function κ t as the product of two parts, one toincorporate the seasonal and the Samuelson eﬀect, and a second one to smooth the noise in aneighbourhood of the time to maturity, namely κ t ( x, y ) := a ( t ) e − bx ω ( x − y ) , (2.4) a ( t ) := a + J (cid:88) j =1 ( s j sin(2 πjt ) − c j cos(2 πjt )) , (2.5)where ω : R → R + is a continuous weight function, while the term e − bx , b ≥ , captures theSamuelson eﬀect, and a ( t ) is the seasonal function deﬁned for a ≥ , s j and c j real constants,and t measured in years.With the following Proposition, we state some assumptions on the weight function ω toensure that κ t ( x, y ) as deﬁned above fulﬁls the assumptions of Theorem 2.2 for every t ≥ . Proposition 2.5.

Let ω : R → R + be such that:1. For every x ∈ R + , ω ( x − · ) | O ∈ H .2. The derivative ω (cid:48) ( x ) exists for almost all x ∈ R and whenever it exists, there exists aneighbourhood I x of x and a function ¯ ω x ∈ H with (cid:107) ¯ ω x (cid:107) ≤ C for some C independentof x , and such that | ( ω (cid:48) ( x − y ) − bω ( x − y )) | O | ≤ ¯ ω x ( y ) on I x .Let further (cid:82) R + e − bx α ( x ) dx < ∞ . Then for every t ≥ the volatility operator σ t given by σ t : H → H α , h (cid:55)→ σ t h := (cid:90) O κ t ( · , y ) h ( y ) dy is well deﬁned, and satisﬁes the Lipschitz and linear growth conditions of Theorem 2.1.Proof. See Appendix A.4.For the function α ( x ) = e αx for instance, the integrability assumption of Proposition 2.5 issatisﬁed if < α < b . We consider now energy forward contracts with a delivery period, which we refer to as swaps,in order to not confuse them with the contracts discussed in the previous Section. For ≤ t ≤ ≤ T , we denote by F ( t, T , T ) the price at time t of a swap contract on energy deliveringover the interval [ T , T ] . From [10, Proposition 4.1], this price can be expressed by F ( t, T , T ) = (cid:90) T T w ( T ; T , T ) f ( t, T ) dT, (3.1) f ( t, T ) being the forward curve introduced above, and w ( T ; T , T ) a deterministic weightfunction. Focusing on forward style swaps in the electricity markets as traded, for example, atNord Pool AS and the European Energy Exchange (EEX), the weight function takes the form w ( T ; T , T ) = 1 T − T . (3.2)According to [12], we introduce the Musiela representation for F ( t, T , T ) . For x := T − t the time until start of delivery, and (cid:96) := T − T > the length of delivery of the swap, we deﬁnethe weight function w (cid:96) ( t, x, y ) := w ( t + y ; t + x, t + x + (cid:96) ) . Motivated by practical examples(see [12, Section 2]), we shall consider only time-independent and stationary weight functions.With abuse of notation, let then w (cid:96) : R + → R + be bounded and measurable, such that G w(cid:96) ( t, x ) := F ( t, t + x, t + x + (cid:96) ) = (cid:90) x + (cid:96)x w (cid:96) ( y − x ) g t ( y ) dy, for g t in equation (2.2). For w ( T ; T , T ) as in equation (3.2), one simply gets w (cid:96) ( y − x ) = (cid:96) .Following [11, Section 4], for any g t ∈ H α , we represent G w(cid:96) as the following linear operator: D w(cid:96) ( g t ) := W (cid:96) ( (cid:96) )Id( g t ) + I w(cid:96) ( g t ) , (3.3)namely, G w(cid:96) ( t, · ) = D w(cid:96) ( g t )( · ) . In equation (3.3), Id is the identity operator, while W (cid:96) ( u ) := (cid:90) u w (cid:96) ( v ) dv, u ≥ , (3.4) I w(cid:96) ( g t )( · ) := (cid:90) ∞ q w(cid:96) ( · , y ) g (cid:48) t ( y ) dy, (3.5) q w(cid:96) ( x, y ) := ( W (cid:96) ( (cid:96) ) − W (cid:96) ( y − x )) I [0 ,(cid:96) ] ( y − x ) . (3.6)For a swap contract of forward type with weight function in equation (3.2), it turns outthat the operator D w(cid:96) has an easier representation as provided in the following Lemma. Lemma 3.1.

For a forward-style swap contract, the operator D w(cid:96) can be represented as D w(cid:96) ( g t )( x ) = (cid:90) ∞ d (cid:96) ( x, y ) g t ( y ) dy, where d (cid:96) : R + × R + → R + , d (cid:96) ( x, y ) := (cid:96) I [ x,x + (cid:96) ] ( y ) is called the delivery period function.Proof. See Appendix A.5.With the notation just introduced, thanks to [12, Lemma 3.3], for every ≤ t ≤ τ ≤ T theprice of the swap contract at time τ with delivery over the interval [ T , T ] can be expressed asa linear functional acting on the forward curve g t : F ( τ, T , T ) = δ T − t D w(cid:96) g t + (cid:90) τt δ T − s D w(cid:96) σ s ( g s ) d W s , (cid:96) = T − T . Further, [11, Theorem 2.1] allows us to derive a real-valued stochastic processfor the swap dynamics. More precisely, for every ≤ t ≤ τ ≤ T , we can write that F ( τ, T , T ) = δ T − t D w(cid:96) g t + (cid:90) τt Σ( s ) dW ( s ) , (3.7) Σ ( s ) := ( δ T − s D w(cid:96) σ s ( g s ) Q σ s ( g s ) ∗ ( δ T − s D w(cid:96) ) ∗ ) (1) , (3.8)where W is a standard Wiener process with values in R . In particular, equation (3.7) tellsus that the swap price F ( τ, T , T ) which is driven by the H -valued Wiener process W withcovariance operator Q and volatility operator σ t , can in fact be represented as driven by aone-dimensional Wiener process with diﬀusion term given in equation (3.8).For a swap contract of forward type, considering the covariance operator Q to be an integraloperator of the form Q h ( x ) = (cid:104) h, q ( x, · ) (cid:105) = (cid:90) O q ( x, y ) h ( y ) dy, h ∈ H , with kernel q ( x, y ) such that Q is well deﬁned, and given the volatility operator in equation(2.3), we can rewrite the univariate volatility in equation (3.8) more explicitly. Proposition 3.2.

For a swap contract of forward type with weight function in equation (3.2) ,the volatility Σ ( s ) , t ≤ s ≤ τ , deﬁned in equation (3.8) is equivalent to the fourth integral Σ ( s ) = (cid:90) R + (cid:90) R + (cid:90) O (cid:90) O d (cid:96) ( T − s, u ) d (cid:96) ( T − s, v ) κ s ( v, z, g s ) q ( z, y ) κ s ( u, y, g s ) dydzdudv, where d (cid:96) is the delivery period function deﬁned in Lemma 3.1.Proof. See Appendix A.6.In the deterministic setting introduced in Section 2.2, the univariate volatility formula ofProposition 3.2 can be further simpliﬁed.

Corollary 3.3.

With a volatility kernel factorized as in equation (2.4) , the formula for Σ ( s ) , t ≤ s ≤ τ , in Proposition 3.2 is equivalent to Σ ( s ) = a ( s ) (cid:90) R + (cid:90) R + (cid:90) O (cid:90) O e − bu e − bv d (cid:96) ( T − s, u ) d (cid:96) ( T − s, v ) ω ( v − z ) q ( z, y ) ω ( u − y ) dydzdudv. Proof.

This is a direct consequence of Proposition 3.2.

We focus on European-style options written on energy swap contracts. These kinds of deriva-tives are traded, for example, at Nord Pool AS. With the price of the swap at time t being F ( t, T , T ) , we consider an option with payoﬀ function π : R → R and exercise time ≤ τ ≤ T .Classical examples are standard call and put options with strike K ≥ , for which the payoﬀfunction is deﬁned by π ( x ) = max( x − K, , respectively π ( x ) = max( K − x, .From equation (3.7), the price at time ≤ t ≤ τ of the option with payoﬀ π ( F ( τ, T , T )) at time ≤ τ ≤ T is given by Π( t ) = e − r ( τ − t ) E (cid:20) π (cid:18) δ T − t D w(cid:96) g t + (cid:90) τt Σ( s ) dW ( s ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) F t (cid:21) , (3.9)9or r > the risk-free interest rate, here considered to be constant. Assuming π to be measur-able and of at most linear growth, from [12, Proposition 3.1] we get that for every (cid:96) > andfor every g t ∈ H α such that E [ (cid:107) g t (cid:107) α ] < ∞ , the expectation (3.9) is well deﬁned. The sameholds, in particular, also for the expectation of the payoﬀ of an option on a forward with ﬁxedtime to delivery T , namely π ( f ( τ, T )) . We refer to [12] for more details.We end the Section with a result from [12], which allows to rewrite the price functional inequation (3.9) for a deterministic volatility operator, like the one we introduced in Section 2.2. Proposition 3.4.

For σ t deterministic, the price functional in equation (3.9) becomes Π( t ) = e − r ( τ − t ) E [ π ( µ ( g t ) + ξX ) | F t ] , (3.10) ξ := (cid:90) τt Σ ( s ) ds, (3.11) µ ( g t ) := δ T − t D w(cid:96) g t , (3.12) with X standard normal distributed random variable and Σ in equation (3.8) .Proof. We refer to [12, Proposition 3.7].With Proposition 3.4 the price of an option on the forward curve can be calculated in closedform if the volatility operator is deterministic. In order to do that, by equations (3.11) and(3.12) one needs to deﬁne the volatility operator, as well as the covariance operator and theinitial forward curve. Before giving full speciﬁcation for those, we deﬁne in the next Sectionthe two steps approach to calibrate the HJM model with neural networks.

For the purpose of calibration, we shall specify a fully parametric model, depending on aparameter vector θ taking values in a set Θ ⊂ R n . In the framework described in Section 2, θ isa vector of parameters deﬁning the volatility operator, the covariance operator and the initialforward curve. Moreover, the option price function depends on some features of the contract,such as time to delivery, strike, etc. We denote the vector of these contract parameters by λ ∈ Λ ⊂ R m . Then, the price function (3.9) is Π( t ) = Π( t ; λ, θ ) .As the contract features λ are given by the market, to get a fully speciﬁed price functionalwe need to calibrate the chosen model and determine the vector θ that best matches theobserved prices of liquidly traded options. This will give us the best coeﬃcient functions (interms of calibration) for the HJM model deﬁned in Section 2. We do that by the two stepsapproach with neural networks presented in [7]. In what follows, we ﬁrst deﬁne feedforwardneural networks, and then the calibration problem, together with the two steps approach. We deﬁne an L -layer feedforward neural network as a function N : R d → R p of the form N ( x ) := H L ( ρ ( H L − ( ρ ( . . . ρ ( H ( x )))))) , (4.1)where each H i : R n i − → R n i is an aﬃne map of the form H i ( x ) = V i x + v i , for V i ∈ R n i × n i − and v i ∈ R n i . Here the V i ’s are referred to as the weights and the v i ’s as the biases. Inparticular, n = d equals the input dimension, and n L = p equals the output dimension. Wecall L the depth of the network and n i represents the number of nodes of the i -th layer. We set n := ( n , . . . , n L ) . The map ρ : R → R is the so-called activation function, which is typically10on-linear and applies component wise on the output of the aﬃne maps H i . Typical choicesare the Rectiﬁed Linear Unit (ReLU) given by ρ ( x ) = max( x, , the Exponential Linear Unit(ELU) given by ρ ( x ) = max( e x − , x ) or the Sigmoid ρ ( x ) = e − x . We point out that inthe deﬁnition (4.1), the activation function is the same between all the layers, and we do notconsider any activation function for the L -th layer. This is important to allow for expressivity.With the sigmoid function in the last layer, for example, all values would lie between 0 and 1.We shall denote by V the set of all parameters involved, namely V := { V i , v i } Li =1 , and we usethe notation N = N ( x ; V ) . The cardinality of V is then given by M := |V| = (cid:80) Li =1 n i ( n i − +1) .As we can see, the number of parameters involved in the neural network, namely, the numberof parameters which must be optimized, grows quite fast. A particular subclass of feedforwardneural network is given by the convolutional networks. A convolution is a special kind of linearoperation, which is represented by a sparse matrix. This is cost eﬃcient, both in terms ofcomputing time and required memory. We call a network convolutional when at least one ofthe V i ’s is a convolution operator. We shall refer to dense networks when the weights V i ’s arefull matrices instead. The architecture of the neural network, namely the width L , the numberof nodes per layer, n = ( n , . . . , n L − ) , and the activation function are the hyper-parameterswhich must be chosen in accordance to each speciﬁc problem to solve.The idea behind neural networks is to approximate a map ϕ : R d → R p starting from theset of input-outputs { ( x i , ϕ ( x i )) } Ni =1 of size N . This translates into an optimization problem,called the training of the neural network, which is solved by ﬁnding the best set of parameters V ∈ R M so that the neural network output N ( x ; V ) best approximates the observations ϕ ( x ) ,for x ∈ { x i } Ni =1 , with respect to some loss function R to be chosen. A common loss function isthe mean squared error deﬁned by R (cid:0) { x i } Ni =1 , { y i } Ni =1 (cid:1) := 1 N N (cid:88) i =1 ( x i − y i ) . Training the neural network means to ﬁnd a ˆ V ∈ R M that solves the optimization problem minimize V∈ R M N N (cid:88) i =1 ( N ( x i ; V ) − ϕ ( x i )) . Given the optimal weights ˆ V , we denote by ˆ N ( x ) := N ( x ; ˆ V ) the trained neural network.We point out that this is a non-convex optimization problem and one typically only ﬁnds anapproximation to a local solution. For more details on feedforward neural networks, activationfunctions and training of the network, we refer the reader to [28, 23]. We focus on the option price

Π = Π( λ, θ ) , omitting the time dependency to simplify thenotation, and we consider N option contracts with features { λ i } Ni =1 ∈ Λ , whose price can beobserved in the market. This means we have a set of market observed prices { Π i } Ni =1 , where Π i is the price of the contract corresponding to λ i . Calibrating the HJM model for the forwardcurve means to ﬁnd vector of model parameters ˆ θ ∈ Θ which minimizes the distance betweenthe prices observed in the market and the corresponding prices given by the stochastic model.This is done with respect to a certain cost functional. With the mean squared error functional,11he calibration corresponds to ﬁnd ˆ θ ∈ Θ which solves the optimization problem minimize θ ∈ Θ N N (cid:88) i =1 (Π( λ i , θ ) − Π i ) . (4.2)Often, it is not possible to obtain a closed formula for the price functional Π , due to thecomplexity of the underlying model. In these cases, the calibration problem presented inequation (4.2) must be slightly adjusted by substituting Π with ˜Π , which is an approximatedprice functional, obtained by, for example, Monte Carlo simulation. This makes the procedurecostly. In particular, the more complex the underlying stochastic model is, the greater the timeneeded for simulation and hence for calibration. This has the consequence that more accuratemodels are often left out from practical applications because of their complexity in calibration.An alternative solution proposed in [7] is to divide the calibration problem in equation (4.2)into two steps. We start by approximating the price functional Π (or its approximation ˜Π ) witha neural network, N ( λ, θ ; V ) ≈ Π( λ, θ ) . This is done by generating a training set by considering(many) diﬀerent vectors of parameters θ ’s and option features λ ’s and the corresponding pricevalues. The training set is then used to train the neural network. This step is computationallydemanding for two reasons. First of all, we need to generate a large training set of input-outputdata by (potentially) costly simulations. Moreover, the training of the neural network is an M = |V| dimensional optimization problem, with M usually large, and with respect to a largetraining set and is therefore time consuming. However, the training has the advantage to beoﬀ-line, namely, it does not use any market information. Thus, it can potentially be run onlyonce and does not require frequent updating when new market information arises.Once the neural network is trained, the second step is calibration. The advantage is that wereplace in equation (4.2) the function Π (or ˜Π ) with the trained neural network ˆ N . Evaluation ofthe latter one is fast and does not require any further simulation, thus to perform a calibrationbecomes an easier and much faster task, also due to the fact that most machine learningframeworks have very eﬃcient implementations. In [7], for example, the calibration for the fullimplied volatility surface in the rough Bergomi model is obtained in less than milliseconds.Price market data are used in the second step, meaning that the model should be re-calibratedevery time that new market data is available. However, with this fast implementation vianeural networks, getting the results is possible in short time, and the model can be updatedwith low computational cost.We present now two alternative approaches to this procedure. In the ﬁrst case, we train aneural network as function of both λ and θ and with output a single real number, being theprice of the option with features λ and given by the model with parameters θ . In the secondcase, we require a lower dimensional functional: the neural network is trained as function ofonly the model parameters θ . Then, the output is a (discrete) surface, namely, a (potentiallymulti-dimensional) grid, of prices corresponding to diﬀerent contracts, namely diﬀerent λ ’s. Let ( λ, θ ) ∈ Λ × Θ ⊂ R d , for d = n + m , such that Π = Π( λ, θ ) . In the pointwise learningapproach, we approximate the pricing map Π (or ˜Π ) by a neural network that maps the vector ( λ, θ ) into prices. Given the training set { (( λ i , θ i ) , Π( λ i , θ i )) } N train i =1 for ( λ i , θ i ) ∈ Λ × Θ and N train the size of the training set, we train a neural network N : Λ × Θ → R + by computing ˆ V ∈ argmin V∈ R M N train N train (cid:88) i =1 ( N ( λ i , θ i ; V ) − Π( λ i , θ i )) . ˆ N ( λ, θ ) = N ( λ, θ ; ˆ V ) , we use it for the calibration step with respect to the modelparameters, that is, we look for an approximate solution ˆ θ ∈ Θ to the following problem: minimize θ ∈ Θ N cal N cal (cid:88) i =1 (cid:16) ˆ N ( λ i , θ ) − Π i (cid:17) , where { Π i } N cal i =1 are market observed prices with respect to some contract features { λ i } N cal i =1 , and N cal is the size of the calibration set. The idea of the grid-based learning approach is to train a neural network which is a functiononly of the model parameters θ . In this case, the output is no longer a single price value, but adiscrete grid of values corresponding to diﬀerent option speciﬁcations λ , which are decided inthe ﬁrst step. These should reﬂect the options traded in the market. Let us suppose m = 2 and λ = ( λ , λ ) . Then for m , m ∈ N , we create a grid of values { ( λ j , λ k ) } m ,m j =1 ,k =1 and the trainingset { ( θ i , { Π( θ i , ( λ j , λ k )) } m ,m j =1 ,k =1 ) } N train i =1 . We then train a neural network N : Θ → R m × m + bysolving the following optimization problem ˆ V ∈ argmin V∈ R M N train m m N train (cid:88) i =1 m ,m (cid:88) j,k =1 (cid:0) N ( θ i ; V ) j,k − Π( θ i , ( λ j , λ k )) (cid:1) . Once trained ˆ N ( θ ) = N ( θ ; ˆ V ) , we use the neural network in the calibration step in order toﬁnd the optimal model parameters ˆ θ ∈ Θ for ﬁtting the market observations { Π j,k } m ,m j =1 ,k =1 : minimize θ ∈ Θ m m m ,m (cid:88) j,k =1 (cid:16) ˆ N ( θ ) j,k − Π j,k (cid:17) . The main diﬀerence of the grid-based approach compared with the pointwise one, is thatthe neural network is trained to price only speciﬁc options, namely options related to thegrid { ( λ j , λ k ) } m ,m j =1 ,k =1 deﬁned in the approximation step. This might be seen as a weakness,however, every price for a contract not included in this grid can be obtained by interpolation(which is done automatically by the network in the pointwise approach). Moreover, the gridin the ﬁrst step can be chosen as ﬁne as wished. The advantage is that the dimension of theinput vector is smaller, making it easier for the network to approximate the price functional. In order book based markets there is no single price, but the so-called bid and ask pricecorresponding to the cheapest sell and the most expensive buy order. Depending on marketliquidity, the spread between bid and ask price (bid-ask spead) can be signiﬁcant. Thus thecalibration problem described in equation (4.2) breaks down as we do not have exact prices toaim at in a mean squared loss sense. We need a new loss function taking the bid-ask spread13nto account. To penalize prices lying outside the bid-ask range, we introduce the loss R bidask as R bidask (cid:16) { x i } Ni =1 , { y bidi } Ni =1 , { y aski } Ni =1 (cid:17) := 1 N N (cid:88) i =1 (cid:26)(cid:16) x i − y bidi (cid:17) I { x i y aski } (cid:27) , which equals zero for those prices within the bid-ask interval, and it is a quadratic function ofthe distance to the boundary outside the interval, as it is shown in Figure 1.In the grid-based learning framework described in Section 4.2.2 with m = 2 and λ =( λ , λ ) ∈ { ( λ j , λ k ) } m ,m j =1 ,k =1 , we consider the market observations { Π bidj,k , Π askj,k } m ,m j =1 ,k =1 , where Π bidj,k and Π askj,k are, respectively, the bid price and the ask price of the contract with features ( λ j , λ k ) . Given the trained neural network ˆ N : Θ → R m × m + , we then look for the optimalmodel parameters ˆ θ ∈ Θ to ﬁt the market observations: minimize θ ∈ Θ m m m ,m (cid:88) j,k =1 (cid:26)(cid:16) ˆ N ( θ ) j,k − Π bidj,k (cid:17) I { ˆ N ( θ ) j,k < Π bidj,k } ++ (cid:16) ˆ N ( θ ) j,k − Π askj,k (cid:17) I { ˆ N ( θ ) j,k > Π askj,k } (cid:27) . The loss function R bidask can be also applied to the pointwise learning in a similar manner. bid price ask price0 Figure 1: The bid-ask loss function.

We shall fully specify the setting. Let α ( x ) = e αx , x ∈ R + , be the weight function for theFilipović space H α . By Proposition 2.5 the real constant α must satisfy < α < b . Moreover,we deal with swap contracts of forward type on energy with delivery over an interval [ T , T ] .Thus we take w ( T ; T , T ) = T − T (see equation (3.2)) and, consequently, w (cid:96) ( y − x ) = (cid:96) . Then W (cid:96) ( (cid:96) ) = 1 (see equation (3.4)) and q w(cid:96) ( x, y ) = (cid:96) ( x + (cid:96) − y )) I [ x,x + (cid:96) ] ( y ) (see equation (3.6)).Finally, we select European-style call options with payoﬀ function π ( x ) = max( x − K, , for K > the strike price. The ultimate goal would be of course to price all kind of optionswritten on forward contracts with or without delivery periods. This in fact is possible with theframework but beyond the scope of the paper.In order to benchmark the two-steps approach, as explained in Section 1, we focus ourattention on deterministic volatility operators, such as the one introduced in Section 2.2, which14ill be further speciﬁed here. For σ t deterministic, we have seen in Proposition 3.4 that theprice Π( t ) at time ≤ t ≤ τ of the option with payoﬀ π ( F ( τ, T , T )) at time ≤ τ ≤ T , canbe expressed in terms of a standard Gaussian random variable X , a variance ξ = (cid:82) τt Σ ( s ) ds and a drift µ ( g t ) = δ T − t D w(cid:96) g t . In the case of an European-style call option, Π( t ) has a closedform solution, by means of a type of Black-76 formula. This allows for exact price values tobe used for both the training and the calibration step, with the advantage to avoid externalsources of error, such as resulting, for example, from a Monte Carlo simulation approach. Asdirect consequence of Proposition 3.4, we ﬁnd explicitly the price functional. Proposition 5.1.

The price of an European-style call option with strike price

K > andmaturity time τ ≤ T is given by Π( t ) = e − r ( τ − t ) (cid:26) ξφ (cid:18) µ ( g t ) − Kξ (cid:19) + ( µ ( g t ) − K ) Φ (cid:18) µ ( g t ) − Kξ (cid:19)(cid:27) , (5.1) φ and Φ being, respectively, the density function and the cumulative distribution function of astandard Gaussian random variable.Proof. The proof follows by direct calculation, starting from equation (3.10) for π ( x ) = max( x − K, , and using standard techniques for the expected value of a Gaussian random variable.To compute the price in equation (5.1), we need the variance ξ and the shift µ ( g t ) , hence weneed to specify the volatility operator σ t and the covariance operator Q . Last, we must deﬁnean appropriate initial forward curve, g t ∈ H α . In view of the neural network approach, we needto deﬁne a fully parametric model, that is, a model fully speciﬁed by a vector of parameters θ . We need to introduce a suitable covariance operator Q that depends only on a ﬁnite numberof parameters. The analysis conducted by [13], reveals a covariance structure that is wellapproximated by an exponential function, i.e. Cov ( W t ( x ) , W t ( y )) ≈ e − k | x − y | . Despite theoperation W t ( x ) = δ x ( W t ) not being well deﬁned on our space H , we can approximate it bythe scalar product δ x ( W t ) ≈ (cid:104) η x , W t (cid:105) , with some "bell-shaped" function η x centred in x , suchas a Gaussian density function. For every x, y ∈ O , denoting by c ( x, y ) the empirical covariancefunction between W t ( x ) and W t ( y ) , we then get: c ( x, y ) = E [ W t ( x ) W t ( y )] = E [ δ x ( W t ) δ y ( W t )] ≈ E [ (cid:104) η x , W t (cid:105)(cid:104) η y , W t (cid:105) ] = (cid:104)Q η y , η x (cid:105)≈ Q η y ( x ) = (cid:90) O e − k | x − z | η y ( z ) dz ≈ e − k | x − y | , which shows that in H a covariance operator based on an exponential kernel indeed approxi-mates the empirically observed covariance structure of the Wiener process W across diﬀerentmaturities. We thus deﬁne a one parameter covariance operator by Q h ( x ) = (cid:90) O e − k | x − y | h ( y ) dy, h ∈ H . (5.2)Because e − k |·| is the characteristic function of a Cauchy distributed random variable withlocation parameter and scale parameter k , it follows by Bochner’s Theorem that it is positive-deﬁnite. Since it is also symmetric and continuous, for O compact it follows from [35, Theorem15.8] that Q is in fact a covariance operator. In the following, we therefore choose O := [ − γ, γ ] for some large γ which ensures that all maturities of interest are covered. We consider a speciﬁcation of the volatility operator that does not depend on time and space,which we denote by σ instead of σ t to simplify the notation. For the seasonal component inequation (2.5), we consider a ( t ) = a ≥ for every t ∈ R + . This means that we do not accountfor seasonality and the level a corresponds to the implied spot price volatility, as pointed outin [9]. Moreover, we deﬁne the following weight function ω ( x ) := (1 − | x | ) I {| x |≤ } = (cid:40) (1 − | x | ) if | x | ≤ otherwise . (5.3)Let us notice that this parameter-free speciﬁcation of ω fulﬁls the assumptions of Proposi-tion 2.5. The kernel κ t in equation (2.4) becomes then κ t ( x, y ) = κ ( x, y ) = ae − bx ω ( x − y ) ,where b ≥ determines the strength of the maturity eﬀect. The volatility operator is given by σh ( x ) = ae − bx (cid:90) O (1 − | x − y | ) I {| x − y |≤ } h ( y ) dy, (5.4)and is well deﬁned by Theorem 2.2. Let us remember that the role of σ is to smoothen thenoise from the space H to H α , and we have achieved this by considering an integral operator.The weight function ω has then a double role. First of all, it functions as (a part of) thekernel for the integral operator which smoothen the noise. On the other hand, it weights therandomness coming from the Wiener process W t so that a contract with time to maturity x isonly inﬂuenced by W t ( y ) , for y in a neighbourhood of x . Other weight functions ω could beconsidered to obtain a similar weighting eﬀect.We have to calculate the volatility Σ ( s ) using the formula provided in Corollary 3.3.However, the expression turns out cumbersome when integrating over O = [ − γ, γ ] . For thisreason, we integrate over R instead and calculate Σ ( s ) according to the formula Σ ( s ) := a (cid:90) R + (cid:90) R + (cid:90) R (cid:90) R e − bu e − bv d (cid:96) ( T − s, u ) d (cid:96) ( T − s, v ) ω ( v − z ) q ( z, y ) ω ( u − y ) dydzdudv. Since all terms are positive and Σ ( s ) < ∞ , it is then easy to see that, for Σ γ ( s ) := a (cid:90) R + (cid:90) R + (cid:90) γ − γ (cid:90) γ − γ e − bu e − bv d (cid:96) ( T − s, u ) d (cid:96) ( T − s, v ) ω ( v − z ) q ( z, y ) ω ( u − y ) dydzdudv, the limit lim γ →∞ Σ γ ( s ) = Σ ( s ) holds. We then calculate Σ ( s ) explicitly. Proposition 5.2.

In the setting described above, the volatility Σ ( s ) , t ≤ s ≤ τ , is given by Σ ( s ) = 2 a kb (cid:96) (cid:26) (cid:0) b + 3 (cid:1) (cid:16) e − b(cid:96) (cid:17) − e − b(cid:96) (cid:18) b (cid:0) (cid:96) − (cid:96) + 4 (cid:1) − (cid:96) + 2 (cid:19)(cid:27) e − b ( T − s ) . Proof.

See Appendix A.7.And, consequently, we can calculate ξ in equation (3.11).16 roposition 5.3. In the setting described above, we get ξ = a kb (cid:96) (cid:16) e − b ( T − τ ) − e − b ( T − t ) (cid:17) ·· (cid:26) (cid:0) b + 3 (cid:1) (cid:16) e − b(cid:96) (cid:17) − e − b(cid:96) (cid:18) b (cid:0) (cid:96) − (cid:96) + 4 (cid:1) − (cid:96) + 2 (cid:19)(cid:27) . Proof.

The results follows by integrating Σ ( s ) from Proposition 5.2 over the interval [ t, τ ] . Finally, we need to introduce a suitably parametrized initial forward curve, g t = g ( t, · ) ∈ H α .We choose the Nelson-Siegel curve, which is deﬁned for x ≥ by g t ( x ) = g NS ( x ) := α + ( α + α α x ) e − α x , (5.5)for α , α , α , α ∈ R , α > . Let us notice that it does not depend on time t . This curvehas been ﬁrst introduced for modelling the forward rates by [34], and has been already appliedin the context of energy markets by [8]. We need however to check that g NS ∈ H α . This isensured in the following Lemma. Lemma 5.4.

The Nelson-Siegel curve in equation (5.5) belongs to H α if and only if α < α .Proof. The condition is found by calculating the H α -norm of g NS , namely (cid:107) g NS (cid:107) α = g NS (0) + (cid:90) R + g (cid:48) NS ( x ) α ( x ) dx = ( α + α ) + (cid:90) R + α ( α − α − α α x ) e ( α − α ) x dx, where the integral converges if and only if α − α < .With the explicit representation for g t , we compute µ ( g t ) : from Lemma 3.1 we get that µ ( g t ) = 1 (cid:96) (cid:90) T + (cid:96) − tT − t g NS ( y ) dy, and by direct computation, we then calculate the drift µ ( g t ) = 1 (cid:96) (cid:26) α (cid:96) + 1 α e − α ( T − t ) ( α + α + α α ( T − t )) + − α e − α ( T + (cid:96) − t ) ( α + α + α α ( T + (cid:96) − t )) (cid:27) , (5.6)which is the last component we need in order to get a fully speciﬁed option price functional. In this Section, we describe implementation details and report our ﬁndings. From Section 5,we obtain that the vector of model parameter is θ = ( a, b, k, α , α , α , α ) ∈ R , i.e., n = 7 .In this vector a, b ≥ are parameters of the volatility operator introduced in Section 5.2, and17 ≥ is the parameter of the covariance operator, see Section 5.1. Finally α , α , α , α ∈ R with α > are the parameters for the Nelson-Siegel curve as introduced in Section 5.3.Since we are considering European-style call options written on forward-style swaps withdelivery period, the vector of contract features λ is given by λ = ( K, τ, T , (cid:96) ) ∈ R , hence m = 4 , where K > is the strike price, τ is the time to maturity, T ≥ τ is the starting ofdelivery of the swap, and, ﬁnally, (cid:96) > is the length of the delivery period. Let us rememberthat, in view of the grid-based learning approach, we shall create a grid of values for λ . Inour framework, this would be to create a four-dimensional grid. We then decide to set T = τ ,namely the time to maturity for the option coincides with the start of delivery of the swap, and (cid:96) = 1 / , namely we consider only contracts with one month of delivery as the time unit is oneyear. Then λ = ( τ, K ) ∈ R and m = 2 , so that we will have to create a two-dimensional grid,as introduced in Section 4.2.2. Finally, for all the experiments we ﬁx t = 0 as evaluation time,and r = 0 as risk-free interest rate. In particular, we shall consider three diﬀerent experiments:pointwise learning, grid-based learning and grid-based learning with convolutional networks. In all the experiments, we use the Adam optimizer both for training the neural networkand in the calibration step, and we consider the mean squared error loss function. We set thenumber of epochs to for the approximation step, since considering a higher number doesnot improve the accuracy of the network. We set the number of epochs for the calibrationstep to instead, since more computational eﬀort is necessary for ﬁtting the prices in thissecond step. The batch size is ﬁxed to for all the experiments. Fo the grid-based learning approach, we have λ = ( τ, K ) , hence we create a grid of values oftimes to maturity and strike prices which we believe to be reasonable for a case study in theelectricity markets. Let Λ grid := Λ gridτ × Λ gridK where Λ gridτ = { / , / , / , / , / , / , } , Λ gridK = { . , . , . , . , . , . , . , . , . } . The grid has thus dimension m × m = 7 × . We point out that the range in Λ gridK isrelatively narrow if compared to the strike prices available, for example, at EEX. The reasonfor this choice is that by selecting a wider range for K , some options are far out of the money,and thus very cheap for parameter choices that result in low overall volatility. As consideringoptions with value less than a cent is not very interesting and poses also numerical challenges,we restrict the setting of our experiments to this narrow range. To widen the range, one mayconsider options with low strikes only in combination with model parameters a, b and k thatlead to large overall volatility ξ . From a practical perspective, this is reasonable since optionsfar out of the money are typically only in demand in large volatility environments. While thenetwork architecture would not actually change signiﬁcantly, choosing the grid Λ gridK speciﬁedabove for all possible model parameters is convenient and thus preferred for this case study.We deﬁne a neural network N : Θ → R × , with Θ ⊂ R so that d = n = 7 is the inputdimension and p = 7 × is the output dimension. In particular, we consider a four-layerneural network ( L = 4 ) with number of nodes n = (30 , , , , and the ReLU activationfunction. The ﬁnal output is reshaped into a grid of dimension × . The number of totalparameters to calibrate is M = |V| = 4053 . We consider a training set of size N train = 40000 and a test set of size N test = 4000 , which is also used for the calibration step. We consider The code is implemented in TensorFlow 2.1.0 and is available at GitHub: HJM_calibration_with_NN. = ( a, b, k, α , α , α , α ) ∈ Θ with Θ deﬁned by Θ := [0 . , . × [0 . , . × [8 . , . × [34 . , . × [ − . , − . × [0 . , . × [4 . , . . Thus the annual implied spot price volatility is between and , while the half-life forthe process, that is the time taken for the price to revert half-way back to its long-term level,is approximately between . and months, see [16] for details.Since Θ has dimension n = 7 , if one wants to consider for each dimension a number of ν diﬀerent values for the training set, and then combine every value in the ﬁrst dimensionwith every value in the second dimension, and so on, then the size of the training set wouldbe ν , which easily gets very large. This approach would allow for training the network withall possible combinations of the parameters and possibly ensure a good approximation also ofthe -dimensional restrictions of the pricing function. However, for a reasonable size of thetraining set, as for instance N train + N test = 44000 chosen here, one would get only / ≈ diﬀerent values for each of the parameter. We therefore consider a uniform grid of size N train + N test = 44000 for each of the n = 7 dimensions composing Θ , and we randomly matchthe values in each dimension to form a training set of size N train = 40000 and a test set of size N test = 4000 . The corresponding prices have then been calculated with the formula (5.1). Wepoint out that each of the N train (or N test ) samples are a grid of size

63 = 7 × with eachdimension corresponding to the diﬀerent values for K ’s and τ ’s. Each vector calibration is thusperformed starting from a sample of prices, as deﬁned in Section 4.2.2.In Figure 2 we report the average relative error and the maximum relative error in thetraining step, both for the training and test set. In particular, the errors have been clusteredin order to obtain a grid corresponding to the diﬀerent contracts ( τ, K ) ∈ Λ gridτ × Λ gridK . Wenotice that the average relative error is quite low, showing a good performance of the neuralnetwork in approximating the price functional. The worst accuracy is for the contracts withhigher strike price, probably due to the fact that the price for these contracts is small.In Figure 3 we report the relative error for the components of ˆ θ in the calibration step. Forsome of the parameters, such as a and α we notice a certain pattern, namely, for higher valuesof the parameter the error is smaller and vice versa. However, the performance in calibrationis not particularly good: a , b , and α have a mean relative error which is more than . Onthe other hand, even if the model parameters are not accurately estimated, by substituting ˆ θ in the neural network, we get a price approximation ˆΠ which is quite good. This can be seenin Figure 4, where we report the average and the maximum relative error after calibration. Formid-maturity contracts we observe the best accuracy. With an output dimension of p = 63 in the grid-based learning approach, the number of neuralnetwork parameters grows fast, since in the last layer the weight matrix V L must have n L = 63 rows. In imaging, a solution adopted to reduce the total number of parameters is to considerconvolutional networks as described in Section 4. We consider a four-layer neural network( L = 4 ). The ﬁrst two layers are dense with node size n = 30 and n = 63 . It follows areshaping into R × , and two convolutional layers with ﬁlter size, respectively, and . Boththe convolutional layers have kernel size equal to . We use the ReLU activation function andthe number of total parameters to calibrate is M = |V| = 2764 . The training and test sets arethe same ones we used for the dense network case above, hence each of the N train = 40000 and N test = 4000 sample is a grid of values corresponding to the diﬀerent K ’s and τ ’s, and eachvector calibration is performed with a sample of prices.In Figure 5 we report the average relative error and the maximum relative error in the19 raining set Average relative error Maximum relative error

Strike M a t u r i t y ( m o n t h ) Strike M a t u r i t y ( m o n t h ) Test set

Average relative error Maximum relative error

Strike M a t u r i t y ( m o n t h ) Strike M a t u r i t y ( m o n t h ) Figure 2: Average relative error and Maximum relative error in the approximation step withthe grid-based learning approach and a dense neural network.20 b k

Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r α α α Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r α Parameter range R e l a t i v e e rr o r Mean Median a : .

4% 24 . b : .

7% 17 . k : .

94% 3 . α : .

12% 0 . α : .

29% 1 . α : .

3% 17 . α : .

34% 0 . Figure 3: Relative error for themodel parameters calibratedwith the grid-based learningapproach and a dense neuralnetwork.

Average relative error Maximum relative error

Strike M a t u r i t y ( m o n t h ) Strike M a t u r i t y ( m o n t h ) Figure 4: Average relative error and Maximum relative error after calibration with the grid-based learning approach and a dense neural network.21pproximation step clustered with respect to the diﬀerent contracts. Looking at the averagerelative error, we notice a better accuracy compared to the grid-based approach with a densenetwork. However, for the maximum relative error, we get in the upper-right corner (corre-sponding to the contract with ( τ, K ) = (1 / , . ) a value which is almost three times itscorresponding value with the dense network. Also, the rest of the grid for the maximum rela-tive error is approximately around , while for the dense network it is around for mostof the cells. We conclude that the convolutional neural network considered has better averagerelative errors, but worse maximum relative errors.In Figure 6 we report the relative error for the components of ˆ θ in the calibration step.We notice the same pattern for a and α previously found. We also notice worse accuracy for a , but much better accuracy for b , so that overall this is not really diﬀerent from the densenetwork experiment. In Figure 7 a similar pattern is visible as in the price approximation aftercalibration with dense network, where the mid-maturity contracts have the lowest relative error.However, the two networks considered have approximately the same overall level of accuracy. Training set

Average relative error Maximum relative error

Strike M a t u r i t y ( m o n t h ) Strike M a t u r i t y ( m o n t h ) Test set

Average relative error Maximum relative error

Strike M a t u r i t y ( m o n t h ) Strike M a t u r i t y ( m o n t h ) Figure 5: Average relative error and Maximum relative error in the approximation step withthe grid-based learning approach and a convolutional neural network.

We implement the pointwise learning approach. We deﬁne a neural network N : Λ × Θ → R + ,with Λ ⊂ R and Θ ⊂ R , then d = 9 is the input dimension and p = 1 is the output dimension.22 b k Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r α α α Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r α Parameter range R e l a t i v e e rr o r Mean Median a : .

9% 41 . b : .

43% 6 . k : .

62% 2 . α : .

11% 0 . α : .

35% 2 . α : .

4% 16 . α : .

86% 0 . Figure 6: Relative error for themodel parameters calibratedwith the grid-based learningapproach and a convolutionalneural network.

Average relative error Maximum relative error

Strike M a t u r i t y ( m o n t h ) Strike M a t u r i t y ( m o n t h ) Figure 7: Average relative error and Maximum relative error after calibration with the grid-based learning approach and a convolutional neural network.23e consider a four-layer neural network ( L = 4 ) with nodes size n = (30 , , , , and theELU activation function. The number of total parameters to calibrate is M = |V| = 2191 . Weconsider a training set of size N train = 60000 and a test set of size N test = 6000 . These are biggerthan the sets considered in the grid-based approach to reﬂect that each sample corresponds toa single value, while for the grid-based approach each sample corresponds to 63 values, hence inthis latter case the training sample is actually bigger. We take θ = ( a, b, k, α , α , α , α ) ∈ Θ with Θ as above, while λ = ( τ, K ) ∈ Λ point for Λ point = Λ pointτ × Λ pointK = [1 / , × [31 . , . .We notice that Λ grid ⊂ Λ point : this allows us to compare the pointwise learning approach withthe grid-based learning approach. Training and test sets have been generated starting from Θ × Λ in the same manner as described in the grid-based approach, calculating the correspondingprices with the formula in equation (5.1).In order to produce error plots comparable to the ones in the grid-based learning, we clusterthe prices in the following way. We introduce a new grid ˆΛ := ˆΛ τ × ˆΛ K , where ˆΛ τ := { / , /

12 + 1 / , /

12 + 1 / , } , ˆΛ K := { . , . , . , . , . , . , . , . , . , . } , and we label with K = 31 . the prices corresponding to a strike in the interval [31 . , . , whilewe label with K = 31 . the prices corresponding to a strike in the interval [31 . , . , and soon for each of the strike price in Λ gridK deﬁned in the grid-based learning approach. We do thesame for the time to maturity τ , in order to obtain a grid of clustered values correspondingto the grid Λ grid . Moreover, since each of the N test = 6000 samples corresponds to a diﬀerentparameter vector, we can not perform calibration starting from the test set, as we have donepreviously. Instead, we calibrate by using the test set from the grid-based approach, aftershape adjustment. In this manner, each calibration is performed as described in Section 4.2.1,starting from N cal = 63 diﬀerent prices corresponding to the same θ ∈ Θ , to be estimated.In Figure 8 we report the average relative error and the maximum relative error in theapproximation step after clustering, as described above. For both, we notice a better accuracycompared to the grid-based approaches. Still, the worst accuracy is for contracts with thehighest strike price. However, when it comes to calibration, the pointwise approach turns outto be the worst. Indeed, the relative errors for ˆ θ in Figure 9 are on average worse than inthe previous experiments. It can be noticed in particular, that for a , b and k , the error isconcentrated around, respectively, , and , while before it was much more spreadstarting from . The accuracy for a is not very good, almost reaching , and the accuracyfor b is the worst among the three methods. On the other hand, the accuracy for α hassubstantially improved. In Figure 10 we can see that the relative error for ˆΠ after calibrationis also the worst among the three methods. In the setting of the grid-based learning approach with dense network of Section 6.1, we test thebid-ask loss function deﬁned in Section 4.3. In particular, we consider the dense neural networkpreviously trained, and we use the bid-ask loss function for the calibration step. Starting fromthe test set of the grid-based approach, we obtain bid and ask prices by considering and , respectively, of the original prices. In Figures 11 we report the relative error for themodel parameters calibrated, which are a bit worse than the values found in the experimentwith the same neural network but exact prices. However, considering that the informationgiven to the network is much weaker due to the relatively wide bid-ask range, the results aresurprisingly good and the overall error is of similar magnitude as in the rest of the experiments.24 raining set

Average relative error Maximum relative error

Strike M a t u r i t y ( m o n t h ) Strike M a t u r i t y ( m o n t h ) Test set

Average relative error Maximum relative error

Strike M a t u r i t y ( m o n t h ) Strike M a t u r i t y ( m o n t h ) Figure 8: Average relative error and Maximum relative error in the approximation step withthe pointwise learning approach. 25 b k

9% 47 . b : .

3% 27 . k : .

59% 4 . α : .

17% 0 . α : .

72% 1 . α : .

2% 7 . α : .

33% 1 . Figure 9: Relative error for themodel parameters calibratedwith the pointwise learning ap-proach.

Average relative error Maximum relative error

Strike M a t u r i t y ( m o n t h ) Strike M a t u r i t y ( m o n t h ) Figure 10: Average relative error and Maximum relative error after calibration with the point-wise learning approach. 26e calculate the rate of mismatched prices, namely the percentage of prices which does notlie within the bid-ask constraints. In Figure 12, on the left we have the plot of the mismatchedprices rate that we observe before calibration, namely the rate we obtain by plugging in theneural network the starting parameters that we need in order to initialize the optimizationroutine. On the right, the plot shows the percentage of mismatched prices after calibration.As we can see, the ﬁnal rate is for almost all the contracts, except for ( τ, K ) = (1 , . and for ( τ, K ) = (1 / , . . A random sub-sample of prices have been reported in Figure13 where, on the top panel (corresponding to ( τ, K ) = (1 , . ), we can notice several pricesbeing outside the constraints (marked with the symbol (cid:70) ), in the mid panel (corresponding to ( τ, K ) = (1 / , . ) only some of the prices are outside the constraints, while in the bottompanel (corresponding to ( τ, K ) = (4 / , . ) all the prices are within the constraints (markedwith the symbol × ), referring to a rate of mismatching equal to . However, the priceslying outside the bid-ask interval are indeed very close to the boundaries, which conﬁrms thesuitability of the bid-ask loss function whenever exact prices are not available. We have beentesting the bid-ask loss function also with more narrow constraints. In this case, it is suﬃcientto increase the number of iterations for the optimizing routine to obtain very similar results. a b k Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r α α α Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r Parameter range R e l a t i v e e rr o r α Parameter range R e l a t i v e e rr o r Mean Median a : .

7% 33 . b : .

0% 21 . k : .

95% 4 . α : .

29% 0 . α : .

06% 6 . α : .

5% 12 . α : .

57% 1 . Figure 11: Relative error forthe model parameters cali-brated with a dense neural net-work in the grid-based learn-ing approach and bid-ask con-straints. tarting mismatch rate Final mismatch rate Strike M a t u r i t y ( m o n t h ) Strike M a t u r i t y ( m o n t h ) Figure 12: Starting mismatch rate and Final mismatch rate after calibration with a densenetwork in the grid-based learning approach and bid-ask constraints.

Bid-ask constraint examples K = 33 . , τ = 1 Random sample O p t i o n p r i c e K = 33 . , τ = 1 / Random sample O p t i o n p r i c e K = 32 . , τ = 4 / Random sample O p t i o n p r i c e Figure 13: Examples of a bid-ask constraint with a dense neural network and the grid-based learning approach. The symbol × indicates the prices which after calibration are insidethe constraints, the symbol (cid:70) the ones outside.28 .5 The non-injectivity issue In all the experiments described above, a common issue appears. The accuracy achieved incalibration is not particularly convincing, especially for the parameters regarding the volatilityand the covariance operator, a , b and k . Slightly better results was obtained for the Nelson-Siegel curve parameters, α , α and α , with the exception of α . See Figure 3, 6, 9 and 11.On the other hand, the relative error for the price approximation after calibration shows highdegree of accuracy. See Figure 4, 7, 10 and 12. This phenomenon appears as a discrepancy.From this, one may conclude that neural networks can be used for approximation and cali-bration of sophisticated stochastic models, such as the HJM model considered here. However,the original meaning of the model parameters may get lost in the approximation step. Aspointed out in [6], it is somehow to be expected that the neural network N is non-injective inthe input parameters on large part of the inputs domain. We shall brieﬂy analyse this.The price formula (5.1), once ﬁxed the strike K and the time to maturity τ , cruciallydepends on ξ as derived in Proposition 5.3 and on µ ( g t ) (see equation (5.6)): Π( t ) = e − r ( τ − t ) (cid:26) ξφ (cid:18) µ ( g t ) − Kξ (cid:19) + ( µ ( g t ) − K ) Φ (cid:18) µ ( g t ) − Kξ (cid:19)(cid:27) . However, a ﬁrst observation is that ξ is only a scale, while µ ( g t ) deﬁnes the distance from thestrike price K , hence it is more inﬂuencial on the ﬁnal price level. Let us focus on ξ : ξ = a kb (cid:96) (cid:16) e − b ( T − τ ) − e − b ( T − t ) (cid:17) B ( b , e − b(cid:96) ) , where B ( b , e − b(cid:96) ) simply indicates a term proportional to b and e − b(cid:96) . In the front coeﬃcient,the three parameters a , b , and k are multiplied together directly or inversely, and a decreasein a might be, for example, compensated by a decrease in b or k , and vice versa, meaning thatseveral combinations of values for a , b , and k lead to the same overall ξ . Thus we may suspectthat it can be hard for the neural network to identify the right vector of parameters in thecalibration step, despite reaching good level of accuracy for the price.In Figure 14 we report an example of non-injectivity with respect to the parameters a , b and k that we have observed for the dense network trained in the grid-based learning approach.Here we notice indeed that the neural network is not injective as a map when all the parameters,except one, are ﬁxed. We also notice that the map is only little sensitive to the change in theparameters a , b and k , which also explains the struggle in recovering these three parameters.Similar observations can be done for the drift: µ ( g t ) = α + e − α ( T − t ) α (cid:96) ( α + α + α α ( T − t )) − e − α ( T + (cid:96) − t ) α (cid:96) ( α + α + α α ( T + (cid:96) − t )) . Here the role of α is speciﬁc since it deﬁnes the starting level of the curve, and indeed α isthe parameter that gets the best accuracy in estimation. However, α appears in two positions:ﬁrst added to α and then multiplied by α . It is thus diﬃcult for the neural network to outlinethe role of α in the drift. In the Nelson-Siegel curve deﬁnition in equation (5.5), α deﬁnesthe position of the "bump" in the curve. Then, the drift µ ( g t ) is obtained my integrating thecurve within the delivery period of the contract. This integration might then smoothen thecurve, and make it diﬃcult to locate the "bump". This might explain why the accuracy inestimating α is worse than for the other Nelson-Siegel parameters.The problem described above may arise also with more traditional calibration techniques. Itmight simply be due to the fact that one is dealing with a non-convex optimisation problem with29otentially several local minima. What is unique to the neural network is that the approximatedprice function N ( x ) might violate some basic arbitrage conditions. One particular example ofthese conditions is that the call price should be an increasing function of the volatility. Inour case, in particular, since a arises as a constant factor in the volatility, the price should beincreasing in a , when all the other parameters are ﬁxed. In Figure 14 we observe that this isviolated here. The violation is however relatively small since overall the prices are well ﬁt. a b k Parameter range O p t i o n p r i c e Parameter range O p t i o n p r i c e Parameter range O p t i o n p r i c e K = 32 . , τ = 3 / K = 33 . , τ = 1 K = 31 . , τ = 1 / θ = [ a, . , . , . , − . , . , . θ = [0 . , b, . , . , − . , . , . θ = [0 . , . , k, . , − . , . , . Figure 14: Examples of non injectivity for the dense neural network trained in the grid-basedlearning approach. In each image, only one of the parameters is varying, while the rest is ﬁxed.We conclude the article with the following Theorem showing that it is possible to constructReLU neural networks which act as linear maps.

Theorem 6.1.

Let A ∈ R p × d . Then for any L ≥ and any n = ( d, n , . . . , n L − , p ) with n i ≥ d , i = 1 , . . . , ( L − , there exists an L -layer ReLU neural network N : R d → R p withdimension n , which satisﬁes N ( x ) = Ax, for all x ∈ R d . Proof.

See Appendix A.8.Theorem 6.1 proves that we can construct a ReLU L -layer neural network which acts as alinear map. As there are inﬁnitely many non-injective linear maps (the zero-map being a trivialexample), it is then possible to construct inﬁnitely many non-injective ReLU neural networks.Obviously, this does not show that a non-injective network, such as the one constructed in theproof of Theorem 6.1, will also minimize the objective function used for training. It howevermay represent an important starting point to (try to) understand the injectivity issue. We consider a state-dependent HJM model deﬁning the forward curve dynamics under therisk-neutral measure. To specify the model, we construct a volatility operator which turnselements from the noise space L ( O ) into elements in the forward curves space, the Filipovićspace H α ( R + ) deﬁned in Section 2. This is achieved by introducing a class of integral oper-ators. We prove some conditions which ensure that an operator of this class is well deﬁned,and that the resulting volatility satisﬁes the Lipschitz and linear growth conditions requiredfor existence of a mild solution to the HJM equation. We then restrict our attention to aparametrized deterministic volatility operator. It captures some important properties studiedin the literature, such as a pronounced seasonality and the Samuelson eﬀect.30e focus then on forward contracts with delivery periods, called swaps, and options writtenon them. By considering a parametric covariance operator and a parametric model for theinitial forward curve, namely the Nelson-Siegel curve, we obtain a vector θ of parameters thatdetermines the option prices in the model. We derive the respective pricing functional andtrain a neural network to approximate it. Then this neural network is used in a calibrationstep to ﬁnd the best model parameters (in a mean squared error sense) to ﬁt the (artiﬁcial)market price data. We do this both with the pointwise and the grid-based learning approachdescribed in Section 4.2, and we test diﬀerent neural network architectures.From all the experiments, we can conclude that neural networks perform very well in ap-proximating the price functional, with average relative error in the test set reaching levels below . However, the original meaning of the parameters in the underlying model might get lostduring the approximation step. This becomes apparent in the calibration step, where the aver-age relative error for the parameters can be of magnitude , with peaks above in theworst cases. On the other hand, the prices obtained with the neural network and the estimatedparameters are accurate ( - for the average relative error), showing that the prices can berecovered even with a vector of parameters θ diﬀerent from the true one.As discussed in Section 6.5, this may be due to the non-injectivity of the neural network.Moreover, for an option with ﬁxed maturity and strike, diﬀerent combinations of the parameterscan result in the same price. As stated in [6], computing the distance between the true modelparameters and the estimated ones as we do here, might not be the optimal way to quantify theaccuracy of the two steps approach. In [6] the authors suggest to switch to a Bayesian viewpointand observe the posterior distribution of the model parameters θ considered as a randomvariable. However, this kind of analysis was beyond the scope of the current research. Whilethe problem in recovering the true parameters stresses the importance of proper benchmarkingof a neural network based approach, the accuracy in pricing suggests that neural networksmight in fact turn out promising in making inﬁnite dimensional models more tractable.To improve accuracy in calibration one might consider a diﬀerent structure for the inputvectors in the training set. In our experiments, we consider a uniform grid for each of thedimension and randomly match them to generate the input set. Then, a value, for example, inthe ﬁrst dimension appears only once in the whole training set. It might help to construct theset in such a way that every value in the ﬁrst dimension is combined with every value in thesecond dimension, and so on to generate a uniform grid on Θ . As pointed out before, since Θ has dimension n = 7 , if one wants for example to train the network with diﬀerent values foreach of the parameters (which is still a low number), the training set must be of size .The new loss function introduced for calibration based on bid-ask prices instead of exactprices, works well for the calibration step. It penalizes the values lying outside the bid-askrange interval, and considering that the information given to the network is weaker than withexact prices, the results are promising. Indeed, after calibration, for almost all the contractsconsidered the price lies within the bid-ask range, or very close to the boundaries otherwise.If the prices for some of the contracts are available, while for others one has only bid andask prices, one could for example consider a hybrid loss function, corresponding to the meansquared error for the exact prices, and to the newly deﬁned bid-ask loss function for the bidand ask price ranges. This allows to exploit all the available information at once.Since forward contracts are traded themselves, it may be interesting to test a diﬀerentapproach. As done here, the training of the neural network includes the parameters for theinitial forward curve. Once trained the neural network in the oﬀ-line step, one could divide thesecond step in two parts. By standard techniques for interpolation, one could use the forwardprices to calibrate the initial forward curve parameters. These can then be considered as ﬁxedin the proper calibration step, so that the vector of parameters to be recovered with the neural31etworks is smaller; most likely leading to better results in the calibration.One might also decide to ﬁx the initial forward curve already at the level of the approxi-mation step, and then train the neural network with a lower dimensional input vector θ . Thiswould require to run the approximation step much more frequently to be updated with thecurrent market information, but most likely improves accuracy. Another possibility withoutchanging the setup, is to include forward contracts as options with strike K = 0 and time tomaturity τ = t equal to the evaluation time. These contracts most eﬃciently allow to recoverthe parameters for the Nelson-Siegel curve. Given these considerations, our approach to cali-brate the entire parameter vector θ only based on call options, is somehow likely to lead to alarger error than the one obtained in practice. For instance, when pricing standard options inequity markets based on a Black Scholes model, only the implied volatility is calibrated, whilethe current stock price, dividends and interest rate are obtained from other sources. A Proofs of the main results

We report in this Section the proofs to the main results in the order they appear in the paper.

A.1 Proof of Theorem 2.2

For every x ∈ R + and f ∈ H α , by the Cauchy-Schwarz inequality we can write that (cid:90) O | κ t ( x, y, f ) h ( y ) | dy ≤ (cid:18)(cid:90) O κ t ( x, y, f ) dy (cid:19) / (cid:18)(cid:90) O h ( y ) dy (cid:19) / < ∞ , which is bounded, since κ t ( x, · , f ) ∈ H for every x ∈ R + and every f ∈ H α by Assumption 1,and because h ∈ H . Thus σ t ( f ) h is well deﬁned for all h ∈ H .We need to show that σ t ( f ) h ∈ H α for every f ∈ H α . We start by noticing that for every x ∈ R + the following equality holds: ∂σ t ( f ) h ( x ) ∂x = (cid:90) O ∂κ t ( x, y, f ) ∂x h ( y ) dy, where the diﬀerentiation under the integral sign is justiﬁed by Dominated Convergence becauseof Assumption 2 and (cid:82) O ¯ κ x ( y ) h ( y ) dy < ∞ . Moreover, by Assumption 3 and the Cauchy-Schwarz inequality, we ﬁnd that (cid:90) R + (cid:18)(cid:90) O ∂κ t ( x, y, f ) ∂x h ( y ) dy (cid:19) α ( x ) dx ≤ (cid:107) h (cid:107) (cid:90) R + (cid:13)(cid:13)(cid:13)(cid:13) ∂κ t ( x, · , f ) ∂x (cid:13)(cid:13)(cid:13)(cid:13) α ( x ) dx < ∞ , which shows that σ t ( f ) h ∈ H α and boundedness of the operator σ t ( f ) for every f ∈ H α . A.2 Proof of Theorem 2.3

We ﬁrst observe that for every h ∈ H and every f, f ∈ H α , it holds that (cid:90) R + (cid:90) R + (cid:12)(cid:12)(cid:12)(cid:12) ∂κ t ( x, y, f ) ∂x f (cid:48) ( x ) α ( x ) h ( y ) (cid:12)(cid:12)(cid:12)(cid:12) dydx = (cid:90) R + | f (cid:48) ( x ) α ( x ) | (cid:90) R + (cid:12)(cid:12)(cid:12)(cid:12) ∂κ t ( x, y, f ) ∂x (cid:12)(cid:12)(cid:12)(cid:12) | h ( y ) | dydx ≤ (cid:90) R + | f (cid:48) ( x ) | α / ( x ) (cid:13)(cid:13)(cid:13)(cid:13) ∂κ t ( x, · , f ) ∂x (cid:13)(cid:13)(cid:13)(cid:13) α / ( x ) (cid:107) h (cid:107) dx ≤ (cid:107) h (cid:107) (cid:107) f (cid:107) α (cid:32)(cid:90) R + (cid:13)(cid:13)(cid:13)(cid:13) ∂κ t ( x, · , f ) ∂x (cid:13)(cid:13)(cid:13)(cid:13) α ( x ) dx (cid:33) / (cid:104) σ t ( f ) h, f (cid:105) α = f (0) (cid:90) O κ t (0 , y, f ) h ( y ) dy + (cid:90) R + ∂σ t ( f ) h ( x ) ∂x f (cid:48) ( x ) α ( x ) dx = f (0) (cid:90) O κ t (0 , y, f ) h ( y ) dy + (cid:90) R + (cid:90) O ∂κ t ( x, y, f ) ∂x h ( y ) dyf (cid:48) ( x ) α ( x ) dx = (cid:90) O (cid:18) f (0) κ t (0 , y, f ) + (cid:90) R + ∂κ t ( x, y, f ) ∂x f (cid:48) ( x ) α ( x ) dx (cid:19) h ( y ) dy = (cid:90) O σ t ( f ) ∗ f ( y ) h ( y ) dy = (cid:104) h, σ t ( f ) ∗ f (cid:105) , for σ t ( f ) ∗ f deﬁned by σ t ( f ) ∗ f ( y ) := f (0) κ t (0 , y, f ) + (cid:90) R + ∂κ t ( x, y, f ) ∂x f (cid:48) ( x ) α ( x ) dx = (cid:104) κ t ( · , y, f ) , f (cid:105) α . From [36, Theorem 6.1], σ t ( f ) ∗ is the unique adjoint operator of σ t ( f ) , for f ∈ H α . A.3 Proof of Theorem 2.4

We start with the growth condition. For h ∈ H and f ∈ H α we can write that (cid:107) σ t ( f ) h (cid:107) α = ( σ t ( f ) h (0)) + (cid:90) R + (cid:18) ∂σ t ( f ) h ( x ) ∂x (cid:19) α ( x ) dx = (cid:18)(cid:90) O κ t (0 , y, f ) h ( y ) dy (cid:19) + (cid:90) R + (cid:18)(cid:90) O ∂κ t ( x, y, f ) ∂x h ( y ) dy (cid:19) α ( x ) dx ≤ (cid:107) κ t (0 , · , f ) (cid:107) (cid:107) h (cid:107) + (cid:90) R + (cid:13)(cid:13)(cid:13)(cid:13) ∂κ t ( x, · , f ) ∂x (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) h (cid:107) α ( x ) dx ≤ C ( t ) (1 + | f (0) | ) (cid:107) h (cid:107) + (cid:90) R + C ( t ) f (cid:48) ( x ) (cid:107) h (cid:107) α ( x ) dx ≤ C ( t ) (1 + (cid:107) f (cid:107) α ) (cid:107) h (cid:107) , where we have used the Cauchy-Schwarz inequality, together with the inequality | f (0) | ≤ (cid:107) f (cid:107) α and Assumption 2. With some abuse of notation, it follows that (cid:107) σ t ( f ) (cid:107) L ( H , H α ) ≤ C ( t )(1 + (cid:107) f (cid:107) α ) for a suitably chosen constant C ( t ) . Similarly, from Assumption 1 it follows that (cid:107) ( σ t ( f ) − σ t ( f )) h (cid:107) α = (cid:18)(cid:90) O ( κ t (0 , y, f ) − κ t (0 , y, f )) h ( y ) dy (cid:19) ++ (cid:90) R + (cid:18)(cid:90) O (cid:18) ∂κ t ( x, y, f ) ∂x − ∂κ t ( x, y, f ) ∂x (cid:19) h ( y ) dy (cid:19) α ( x ) dx ≤ (cid:107) κ t (0 , · , f ) − κ t (0 , · , f ) (cid:107) (cid:107) h (cid:107) + (cid:90) R + (cid:13)(cid:13)(cid:13)(cid:13) ∂κ t ( x, · , f ) ∂x − ∂κ t ( x, · , f ) ∂x (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) h (cid:107) α ( x ) dx ≤ C ( t ) | f (0) − f (0) | (cid:107) h (cid:107) + (cid:90) R + C ( t ) ( f (cid:48) ( x ) − f (cid:48) ( x )) (cid:107) h (cid:107) α ( x ) dx ≤ C ( t ) (cid:107) f − f (cid:107) α (cid:107) h (cid:107) , (cid:107) σ t ( f ) − σ t ( f ) (cid:107) L ( H , H α ) ≤ C ( t ) (cid:107) f − f (cid:107) α for a suitably chosen C ( t ) , which provesthe Lipschitz continuity of the volatility operator, and concludes the proof. A.4 Proof of Proposition 2.5

In order for the volatility operator σ t to be well deﬁned, we need to check that the function κ t introduced in equation (2.4) satisﬁes the assumptions of Theorem 2.2. We start by observingthat κ t ( x, · ) ∈ H if and only if ω ∈ H . Then, we can calculate the derivative ∂κ t ( x, y ) ∂x = a ( t ) e − bx (cid:0) ω (cid:48) ( x − y ) − bω ( x − y ) (cid:1) , which, in particular, by Assumption 2 is bounded by (cid:12)(cid:12)(cid:12)(cid:12) ∂κ t ( x, y ) ∂x (cid:12)(cid:12)(cid:12)(cid:12) ≤ a ( t ) e − bx ¯ ω x ( y ) . For the H -norm we then have that (cid:13)(cid:13)(cid:13)(cid:13) ∂κ t ( x, · ) ∂x (cid:13)(cid:13)(cid:13)(cid:13) = (cid:90) O (cid:18) ∂κ t ( x, y ) ∂x (cid:19) dy ≤ a ( t ) e − bx C < ∞ , where we have used that (cid:107) ¯ ω x (cid:107) ≤ C , which implies that Assumption 3 in Theorem 2.2 is satisﬁedfor α such that (cid:82) R + e − bx α ( x ) < ∞ . Finally, the Lipschitz condition is trivially satisﬁed andthe growth condition is fulﬁlled because a ( t ) is bounded. A.5 Proof of Lemma 3.1

For w in equation (3.2), we get that w (cid:96) ( v ) = (cid:96) and W (cid:96) ( u ) = u(cid:96) . Then q w(cid:96) ( x, y ) = 1 (cid:96) ( (cid:96) − y + x ) I [0 ,(cid:96) ] ( y − x ) , and from equations (3.3) and (3.5) we can write that D w(cid:96) ( g t )( x ) = g t + 1 (cid:96) (cid:90) ∞ ( (cid:96) − y + x ) I [0 ,(cid:96) ] ( y − x ) g (cid:48) t ( y ) dy = g t + 1 (cid:96) (cid:90) x + (cid:96)x ( (cid:96) − y + x ) g (cid:48) t ( y ) dy. Integration by parts gives the result.

A.6 Proof of Proposition 3.2

Let f := D w ∗ (cid:96) δ ∗ T − s (1) . We start by applying the covariance operator to h := σ s ( g s ) ∗ f : ( Q σ s ( g s ) ∗ f ) ( x ) = (cid:90) O q ( x, y ) σ s ( g s ) ∗ f ( y ) dy = (cid:90) O q ( x, y ) (cid:104) κ s ( · , y, g s ) , f (cid:105) α dy = (cid:28)(cid:90) O q ( x, y ) κ s ( · , y, g s ) dy, f (cid:29) α , σ s ( g s ) : ( σ s ( g s ) Q σ s ( g s ) ∗ f ) ( x ) = (cid:90) O κ s ( x, z, g s ) ( Q σ s ( g s ) ∗ f ) ( z ) dz = (cid:28)(cid:90) O (cid:90) O κ s ( x, z, g s ) q ( z, y ) κ s ( · , y, g s ) dydz, f (cid:29) α = (cid:104) Ψ s ( x, · ) , f (cid:105) α , for Ψ s ( x, · ) := (cid:82) O (cid:82) O κ s ( x, z, g s ) q ( z, y ) κ s ( · , y, g s ) dydz . We go now back to the deﬁnition of f : ( σ s ( g s ) Q σ s ( g s ) ∗ ) (cid:0) D w ∗ (cid:96) δ ∗ T − s (1) (cid:1) ( x ) = (cid:10) Ψ s ( x, · ) , D w ∗ (cid:96) δ ∗ T − s (1) (cid:11) α = (cid:10) D w(cid:96) Ψ s ( x, · ) , δ ∗ T − s (1) (cid:11) α = δ T − s ( D w(cid:96) Ψ s ( x, · )) = ( D w(cid:96) Ψ s ) ( x, T − s ) . By Lemma 3.1 we can write that ( D w(cid:96) Ψ s ) ( x, T − s ) = (cid:90) R + d (cid:96) ( T − s, u )Ψ s ( x, u ) du = (cid:90) R + (cid:90) O (cid:90) O d (cid:96) ( T − s, u ) κ s ( x, z, g s ) q ( z, y ) κ s ( u, y, g s ) dydzdu, to which, ﬁnally, we apply the operator δ T − s D w(cid:96) : δ T − s D w(cid:96) ( σ s ( g s ) Q σ s ( g s ) ∗ ) (cid:0) D w ∗ (cid:96) δ ∗ T − s (1) (cid:1) = (cid:90) R + d (cid:96) ( T − s, v ) ( σ s ( g s ) Q σ s ( g s ) ∗ ) (cid:0) D w ∗ (cid:96) δ ∗ T − s (1) (cid:1) ( v ) dv = (cid:90) R + (cid:90) R + (cid:90) O (cid:90) O d (cid:96) ( T − s, v ) d (cid:96) ( T − s, u ) κ s ( v, z, g s ) q ( z, y ) κ s ( u, y, g s ) dydzdudv, ﬁnalizing the proof. A.7 Proof of Proposition 5.2

We consider the representation Σ ( s ) = a (cid:90) R + (cid:90) R + e − bu e − bv d (cid:96) ( T − s, u ) d (cid:96) ( T − s, v ) A ( u, v ) dudv, (A.1)where we have introduced A ( u, v ) := (cid:90) R (cid:90) R ω ( v − z ) q ( z, y ) ω ( u − y ) dydz, u, v ∈ R + . By applying (repeatedly) the integration by parts, and since ω (cid:48)(cid:48) is null, we obtain A ( u, v ) = (cid:90) R ω ( v − z ) (cid:18)(cid:90) R e − k | z − y | ω ( u − y ) dy (cid:19) dz = (cid:90) R ω ( v − z ) (cid:18)(cid:90) z −∞ e − k ( z − y ) ω ( u − y ) dy + (cid:90) ∞ z e − k ( y − z ) ω ( u − y ) dy (cid:19) dz = 2 k (cid:90) R ω ( v − z ) ω ( u − z ) dz. (A.2)35y substituting equation (A.2) into (A.1), we get that Σ ( s ) = 2 a k (cid:90) R (cid:90) R + (cid:90) R + e − bu e − bv d (cid:96) ( T − s, u ) d (cid:96) ( T − s, v ) ω ( v − z ) ω ( u − z ) dzdudv = 2 a k (cid:90) R (cid:18)(cid:90) R + e − bu d (cid:96) ( T − s, u ) ω ( u − z ) du (cid:19) dz = 2 a k(cid:96) (cid:90) R (cid:18)(cid:90) T − s + (cid:96)T − s e − bu ω ( u − z ) du (cid:19) dz, where we used the deﬁnition of d (cid:96) in Lemma 3.1. By integration by parts, we get Σ ( s ) = 2 a k(cid:96) b (cid:90) R (cid:16) e − b ( T − s ) (cid:0) bω ( T − s − z ) + ω (cid:48) ( T − s − z ) (cid:1) + − e − b ( T − s + (cid:96) ) (cid:0) bω ( T − s + (cid:96) − z ) + ω (cid:48) ( T − s + (cid:96) − z ) (cid:1)(cid:17) dz = 2 a k(cid:96) b (cid:16) e − b ( T − s ) B ( s ) − e − b ( T − s ) e − b ( T − s + (cid:96) ) B ( s ) + e − b ( T − s + (cid:96) ) B ( s ) (cid:17) , where we introduced B ( s ) := (cid:90) R (cid:0) bω ( T − s − z ) + ω (cid:48) ( T − s − z ) (cid:1) dz, B ( s ) := (cid:90) R (cid:0) bω ( T − s − z ) + ω (cid:48) ( T − s − z ) (cid:1) (cid:0) bω ( T − s + (cid:96) − z ) + ω (cid:48) ( T − s + (cid:96) − z ) (cid:1) dz, B ( s ) := (cid:90) R (cid:0) bω ( T − s + (cid:96) − z ) + ω (cid:48) ( T − s + (cid:96) − z ) (cid:1) dz. By using the deﬁnition of ω in equation (5.3), we get that B ( s ) = (cid:90) T − s +1 T − s − ( b (1 − | T − s − z | ) − sgn( T − s − z )) dz = (cid:90) T − sT − s − ( b (1 − T + s + z ) − dz + (cid:90) T − s +1 T − s ( b (1 + T − s − z ) + 1) dz = 23 (cid:0) b + 3 (cid:1) , where sgn denotes the sign function. Similarly, B ( s ) = b (cid:0) (cid:96) − (cid:96) + 4 (cid:1) − (cid:96) + 2 , B ( s ) = 23 (cid:0) b + 3 (cid:1) . By substituting these ﬁndings and rearranging the terms, we get that Σ ( s ) = 2 a kb (cid:96) (cid:26) (cid:0) b + 3 (cid:1) (cid:16) e − b(cid:96) (cid:17) − e − b(cid:96) (cid:18) b (cid:0) (cid:96) − (cid:96) + 4 (cid:1) − (cid:96) + 2 (cid:19)(cid:27) e − b ( T − s ) , which concludes the proof. 36 .8 Proof of Theorem 6.1 We follow a similar approach to [24, Section 8.5]. Let ν i ≥ be such that n i = 2 d + ν i for i = 1 , . . . , ( L − . For I d the identity matrix of dimension d , we deﬁne the following weights: V := (cid:2) I d − I d O (cid:3) (cid:62) ,V i := (cid:2) I d − I d O i (cid:3) (cid:2) I d − I d O i − (cid:3) (cid:62) , i = 2 , . . . , ( L − ,V L := A (cid:2) I d − I d O L − (cid:3) , where (cid:62) denotes the transpose operator. Here O i ∈ R d × ν i are matrices with all entries equal to to compensate the matrix dimension in such a way that V i ∈ R n i × n i − for i = 1 , . . . , ( L − .By considering zero-biases vectors v i , the linear maps H i introduced in the neural networkdeﬁnition in equation (4.1) coincide then with the matrices V i .We observe that for every x ∈ R d , the ReLU activation function satisﬁes x = ρ ( x ) − ρ ( − x ) = (cid:2) I d − I d (cid:3) ρ (cid:16)(cid:2) I d − I d (cid:3) (cid:62) x (cid:17) , where the activation function is meant to act component wise. By straightforward calculation,one can then see that the neural network deﬁned here satisﬁes the equality N ( x ) = Ax forevery x ∈ R d , which means that it acts on x as a linear map. References [1] Andresen, Arne, Steen Koekebakker and Sjur Westgaard (2010).

Modeling electricity for-ward prices using the multivariate normal inverse Gaussian distribution.

The Journal ofEnergy Markets, 3(3), 3.[2] Barth, Andrea and Annika Lang (2012).

Simulation of stochastic partial diﬀerential equa-tions using ﬁnite element methods.

Stochastics An International Journal of Probabilityand Stochastic Processes 84(2-3): 217-231.[3] Barth, Andrea and Annika Lang (2012).

Multilevel Monte Carlo method with applicationsto stochastic partial diﬀerential equations.

International Journal of Computer Mathemat-ics, 89(18): 2479-2498.[4] Barth, Andrea, Annika Lang and Christoph Schwab (2013).

Multilevel Monte Carlo methodfor parabolic stochastic partial diﬀerential equations.

BIT Numerical Mathematics, 53(1):3-27.[5] Barth, Andrea and Fred E. Benth (2014).

The forward dynamics in energy markets –inﬁnite-dimensional modelling and simulation.

Stochastics An International Journal ofProbability and Stochastic Processes, 86(6), 932-966.[6] Bayer, Christian and Benjamin Stemper (2018).

Deep calibration of rough stochastic volatil-ity models. arXiv preprint arXiv:1810.03399.[7] Bayer, Christian, Blanka Horvath, Aitor Muguruza, Benjamin Stemper and MehdiTomas (2019).

On deep calibration of (rough) stochastic volatility models. arXiv preprintarXiv:1908.08806.[8] Benth, Fred E. (2015).

Kriging smooth energy futures curves.

Energy Risk.379] Benth, Fred E. and Steen Koekebakker (2008).

Stochastic modeling of ﬁnancial electricitycontracts.

Energy Economics, 30(3), 1116-1157.[10] Benth, Fred E., J ¯u rat ˙e ˇS . Benth and Steen Koekebakker (2008). Stochastic Modelling ofElectricity and Related Markets.

Vol. 11. World Scientiﬁc, 2008.[11] Benth, Fred E. and Paul Krühner (2014).

Representation of inﬁnite-dimensional forwardprice models in commodity markets.

Communications in Mathematics and Statistics, 2(1),47-106.[12] Benth, Fred E. and Paul Krühner (2015).

Derivatives pricing in energy markets: aninﬁnite-dimensional approach.

SIAM Journal on Financial Mathematics, 6(1), 825-869.[13] Benth, Fred E. and Florentina Paraschiv (2018).

A space-time random ﬁeld model forelectricity forward prices.

Journal of Banking & Finance, 95, 203-216.[14] Bühler, Hans, Lukas Gonon, Josef Teichmann and Ben Wood (2019).

Deep hedging.

Quan-titative Finance, 19(8), 1271-1291.[15] Carmona, René and Sergey Nadtochiy (2012).

Tangent Lévy market models.

Finance andStochastics, 16(1): 63-104.[16] Clewlow, Les and Chris Strickland (2000).

Energy Derivatives: Pricing and Risk Manage-ment.

Lacima Publications, London.[17] Cuchiero, Christa, Wahid Khosrawi and Josef Teichmann (2020).

A generative adver-sarial network approach to calibration of local stochastic volatility models. arXiv preprintarXiv:2005.02505.[18] Da Prato, Giuseppe and Jerzy Zabczyk (2014).

Stochastic Equations in Inﬁnite Dimen-sions.

Cambridge University Press.[19] De Spiegeleer, Jan, Dilip B. Madan, Soﬁe Reyners and Wim Schoutens (2018).

Machinelearning for quantitative ﬁnance: fast derivative pricing, hedging and ﬁtting.

QuantitativeFinance, 18(10), 1635-1643.[20] Ferguson, Ryan and Andrew Green (2018).

Deeply learning derivatives. arXiv preprintarXiv:1809.02233.[21] Filipović, Damir (2001).

Consistency Problems for Heath-Jarrow-Morton Interest RateModels.

Lecture notes in Mathematics, vol. 1760. Springer, Berlin.[22] Frestad, Dennis (2008).

Common and unique factors inﬂuencing daily swap returns in theNordic electricity market, 1997–2005.

Energy Economics, 30(3): 1081-1097.[23] Goodfellow, Ian, Yoshua Bengio and Aaron Courville (2016).

Deep learning.

MIT press.[24] Gottschling, Nina M., Vegard Antun, Ben Adcock and Anders C. Hansen (2020).

Thetroublesome kernel: why deep learning for inverse problems is typically unstable. arXivpreprint arXiv:2001.01258.[25] Heath, David, Robert Jarrow and Andrew Morton (1992).

Bond pricing and the termstructure of interest rates: A new methodology for contingent claims valuation.

Economet-rica: Journal of the Econometric Society: 77-105.3826] Henry-Labordere, Pierre (2017).

Deep primal-dual algorithm for BSDEs: Applications ofmachine learning to CVA and IM.

Available at SSRN 3071506.[27] Hernandez, Andres (2016).

Model calibration with neural networks.

Available at SSRN2812140.[28] Higham, Catherine F. and Desmond J. Higham (2019).

Deep learning: An introductionfor applied mathematicians.

SIAM Review 61(4): 860-891.[29] Horvath, Blanka, Aitor Muguruza and Mehdi Tomas (2019).

Deep learning volatility.

Avail-able at SSRN 3322085.[30] Kallsen, Jan and Paul Krühner (2015).

On a Heath – Jarrow – Morton approach for stockoptions.

Finance and Stochastics, 19(3): 583-615.[31] Koekebakker, Steen and Fridthjof Ollmar (2005).

Forward curve dynamics in the Nordicelectricity market.

Managerial Finance, 31(6): 73-94.[32] Kondratyev, Alexei (2018).

Learning curve dynamics with artiﬁcial neural networks.

Avail-able at SSRN 3041232.[33] Kovács, Mihály, Stig Larsson and Fredrik Lindgren (2010).

Strong convergence of the ﬁniteelement method with truncated noise for semilinear parabolic stochastic equations withadditive noise.

Numerical Algorithms 53(2-3): 309-320.[34] Nelson, Charles R. and Andrew F. Siegel (1987).

Parsimonious modeling of yield curves.

Journal of business: 473-489.[35] Peszat, Szymon and Jerzy Zabczyk (2007).

Stochastic Partial Diﬀerential Equations withLévy Noise: An evolution equation approach. (Vol. 113). Cambridge University Press.[36] Rynne, Bryan and Martin A. Youngson (2013).

Linear Functional Analysis.

Springer Sci-ence & Business Media.[37] Tappe, Stefan (2012).

Some reﬁnements of existence results for SPDEs driven by Wienerprocesses and Poisson random measures.

International Journal of Stochastic Analysis,2012.[38] Weinan, E, Jiequn Han and Arnulf Jentzen (2017).