[PDF] Extensive networks would eliminate the demand for pricing formulas

Abstract

In this study, we generate a large number of implied volatilities for the Stochastic Alpha Beta Rho (SABR) model using a graphics processing unit (GPU) based simulation and enable an extensive neural network to learn them. This model does not have any exact pricing formulas for vanilla options, and neural networks have an outstanding ability to approximate various functions. Surprisingly, the network reduces the simulation noises by itself, thereby achieving as much accuracy as the Monte-Carlo simulation. Extremely high accuracy cannot be attained via existing approximate formulas. Moreover, the network is as efficient as the approaches based on the formulas. When evaluating based on high accuracy and efficiency, extensive networks can eliminate the necessity of the pricing formulas for the SABR model. Another significant contribution is that a novel method is proposed to examine the errors based on nonlinear regression. This approach is easily extendable to other pricing models for which it is hard to induce analytic formulas.

Full PDF

EExtensive networks would eliminate the demand for pricing formulas

Jaegi Jeon a , Kyunghyun Park a , Jeonggyu Huh b, ∗ a Department of Mathematical Sciences, Seoul National University, Seoul 08826, Korea b Department of Statistics, Chonnam National University, Gwangju 61186, Korea

Abstract

In this study, we generate a large number of implied volatilities for the Stochastic Alpha Beta Rho (SABR)model using a graphics processing unit (GPU) based simulation and enable an extensive neural networkto learn them. This model does not have any exact pricing formulas for vanilla options, and neural net-works have an outstanding ability to approximate various functions. Surprisingly, the network reduces thesimulation noises by itself, thereby achieving as much accuracy as the Monte-Carlo simulation. Extremelyhigh accuracy cannot be attained via existing approximate formulas. Moreover, the network is as efﬁcientas the approaches based on the formulas. When evaluating based on high accuracy and efﬁciency, exten-sive networks can eliminate the necessity of the pricing formulas for the SABR model. Another signiﬁcantcontribution is that a novel method is proposed to examine the errors based on nonlinear regression. Thisapproach is easily extendable to other pricing models for which it is hard to induce analytic formulas.

Keywords: efﬁcient pricing; deep learning; SABR model; nonlinear regression; GPU-based simulation;neural network

1. Introduction

Neural networks are often employed for regression because they have an outstanding ability to ap-proximate a wide range of functions (refer to Cybenko [1] and Hornik et al. [2] for the celebrated universalapproximation theorem). Only a decade ago, simple models, such as linear models, were preferred overneural networks. Most researchers used to believe that too many parameters led to notorious overﬁtting,which has resulted in an artiﬁcial intelligence winter in the past. However, as numerous techniques havebeen invented to prevent such an occurrence, most researchers like to utilize such networks in their re-search. To review the literature concerning the application of networks in ﬁnance, refer to Ruf and Wang[3] for network-based option pricing and Henrique et al. [4] for market predictions.Since the work of Hutchinson et al. [5], many researchers have been studying artiﬁcial neural networksto predict the option prices c (or the implied volatilities σ I ) for particular parametric models, such as theBlack-Scholes model [6], the Heston model [7], and the SABR model [8]. We selected six related studies[9, 10, 11, 12, 13, 14] and summarized their approaches in Tables 1 and 2. Interestingly, we can observesimilar propensities in the studies. They mainly focus on the models that can provide efﬁcient pricingformulas for vanilla options such as the Black-Scholes model and the Heston model. For instance, theHeston model gives a closed-form characteristic function, enabling cost-effective option pricing through aFourier transform [15]. Note that all works other than McGhee [12] are either associated with the Black-Scholes model or the Heston model. This correlation may exist because training samples are generatedexhaustively to eliminate the necessity of numerical algorithms. We, however, think that their contributionsare a little marginal from a practical perspective as even without the networks, the option prices can alreadybe efﬁciently obtained. ∗ corresponding author.E-mail address: [email protected] Preprint submitted to arXiv January 25, 2021 a r X i v : . [ q -f i n . C P ] J a n u l k i n ( ) B r o s t r o m ( ) F e r g u s o n ( ) M c G h ee ( ) b a s e m o d e l B l a c k - S c h o l e s B l a c k - S c h o l e s B l a c k - S c h o l e s S A B R ( β = ) o p t i o n s t y p e v a n ill a v a n ill a b a s k e t v a n ill a p r i c i n g m e t h o d c l o s e d f o r m u l a c l o s e d f o r m u l a M C s i m u l a t i o n ﬁ n i t e d i ff e r e n c e m e t h o d n e t w o r k i n p u t s f / K , T , r , q , ¯ α f / K , T , r , ¯ α f , i / K , T , ¯ α i , ρ ij ( ≤ i , j ≤ ) f / K , T , α , ν , ρ n e t w o r k o u t p u t s c / K c / K c / K σ I o f t r a i n i n g s a m p l e s k k M . M ( p r e c i s i o n )( e x a c t )( e x a c t )( k p a t h s )({ n t , n f , n α } = { , , }) o f t e s t s a m p l e s k k k k ( p r e c i s i o n )( e x a c t )( e x a c t )( M p a t h s )[ ? ]({ n t , n f , n α } = { , , }) o f e p o c h s [ ? ] N / A b a t c h s i z e k N / A o f h i dd e n l a y e r s o f n o d e s p e r l a y e r , , c t i v a t i o n f u n c t i o n s R e L U , EL U , e t c . R e L U R e L U s o f t p l u s , R e L U o p t i m i z e r S G D A D A M A D A M A D A M a r c h i t e c t u r e t u n i n g n o n e o f l a y e r s , o f n o d e s o f n o d e s o f n o d e s t e s t l o ss M S F E . E - . E - E - [ ? ] N / A M S P E s a m e a s a b o v e s a m e a s a b o v e N / A N / A T a b l e : T h i s t a b l e o u t li n e s v a r i o u s m e t h o d s t o t r a i nn e u r a l n e t w o r k s u s i n g o p t i o n p r i c e s c ( o r i m p li e d v o l a t ili t i e s σ I ) . I f t h e r e a r e s e v e r a l r e s u l t s i n a w o r k , o n l y t h e b e s t i s w r i tt e nh e r e . T h e b a s e m o d e l s a r ee x p r e ss e d a s f o ll o w s : d f t = ( r − q ) f t d t + ¯ α f t d W t ( B l a c k - S c h o l e s ) , d f t = ( r − q ) f t d t + √ y t f t d W t , dy t = κ ( ¯ y − y t ) + ν √ y t d Z t ( H e s t o n ) , d f t = α t f β t d W t , d α t = ν α t d Z t ( S t o c h a s t i c A l p h a B e t a R h o [ S A B R ]) , w h e r e d W t d Z t = ρ d t . K a n d T a r e t h e s t r i k e a n d m a t u r i t y o f t h e v a n ill a o p t i o n , r e s p e c t i v e l y . T h e L R , S G D , M S F E , a n d M S P E s t a n d f o r t h e l e a r n i n g r a t e , t h e s t o c h a s t i c g r a d i e n t d e s c e n t , t h e m e a n s q u a r e d ﬁ tt i n g e rr o r , a n d t h e m e a n s q u a r e d p r e d i c t i o n e rr o r , r e s p e c t i v e l y ( r e f e r t o S e c t i o n f o r t h e M S F E a n d M S P E ) . F i n a ll y , w e u s e d p r e ﬁ x e s f o r t h e i n t e r n a t i o n a l s y s t e m o f u n i t s : k = n d M = k . T h e m a r k [ ? ] i n d i c a t e s t h a tt h e v a l u e i s i n f e rr e d u s i n g t h e c o n t e x t o r t h e ﬁ g u r e s i n t h e w o r k . i u ( ) H i r s a ( ) o u r m e t h o d ( ) b a s e m o d e l B l a c k - S c h o l e s H e s t o n B l a c k - S c h o l e s H e s t o n S A B R ( β = ) o p t i o n s t y p e v a n ill a v a n ill a v a n ill a v a n ill a v a n ill a p r i c i n g m e t h o d c l o s e d f o r m u l a F o u r i e r - c o s i n e s e r i e s c l o s e d f o r m u l a f a s t F o u r i e r t r a n s f o r m M C s i m u l a t i o n n e t w o r k i n p u t s f / K , T , r , ¯ α f / K , T , r , y , κ , ¯ y , ν , ρ f / K , T , r , q , ¯ α f / K , T , r , q , y , κ , ¯ y , ν , ρ K / f , T , α , ν , ρ n e t w o r k o u t p u t s c / K , σ I c , σ I c / K c / K σ I o f t r a i n i n g s a m p l e s k k ( k + k ) k k M ( M + M ) ( p r e c i s i o n )( e x a c t )( a l m o s t e x a c t )( e x a c t )( a l m o s t e x a c t )( k p a t h s ) o f t e s t s a m p l e s k k k k M ( M + M ) ( p r e c i s i o n )( e x a c t )( a l m o s t e x a c t )( e x a c t )( a l m o s t e x a c t )( k , . M p a t h s ) o f e p o c h s N / A e a r l y s t o pp i n g b a t c h s i z e N / A

100 o f h i dd e n l a y e r s

442 o f n o d e s p e r l a y e r ,

000 a c t i v a t i o n f u n c t i o n s R e L U R e L U , EL U , e t c . R e L U o p t i m i z e r L R d e c a y i n g A D A M A D A M L R d e c a y i n g A D A M a r c h i t e c t u r e t u n i n g o f n o d e s o f l a y e r s , o f n o d e s o f l a y e r s , o f n o d e s t e s t l o ss M S F E . E - ( c / K ) . E - ( c ) . E - [ ? ] . E - [ ? ] . E - . E - ( σ I ) . E - ( σ I ) M S P E s a m e a s a b o v e s a m e a s a b o v e s a m e a s a b o v e s a m e a s a b o v e . E - T a b l e : T h i s t a b l e s u mm a r i z e s m e t h o d s t o t r a i nn e u r a l n e t w o r k s u s i n g o p t i o n p r i c e s c ( o r i m p li e d v o l a t ili t i e s σ I ) . I f t h e r e a r e s e v e r a l r e s u l t s i n a w o r k , o n l y t h e b e s t i s w r i tt e n h e r e . T h e b a s e m o d e l s a r ee x p r e ss e d a s f o ll o w s : d f t = ( r − q ) f t d t + ¯ α f t d W t ( B l a c k - S c h o l e s ) , d f t = ( r − q ) f t d t + √ y t f t d W t , dy t = κ ( ¯ y − y t ) + ν √ y t d Z t ( H e s t o n ) , d f t = α t f β t d W t , d α t = ν α t d Z t ( S A B R ) , w h e r e d W t d Z t = ρ d t . I n p a r t i c u l a r , K a n d T a r e t h e s t r i k e a n d m a t u r i t y o f t h e v a n ill a o p t i o n , r e s p e c t i v e l y . T h e L R , S G D , M S F E , a n d M S P E s t a n d f o r a l e a r n i n g r a t e , t h e s t o c h a s t i c g r a d i e n t d e s c e n t , t h e m e a n s q u a r e d ﬁ tt i n g e rr o r , a n d t h e m e a n s q u a r e d p r e d i c t i o n e rr o r , r e s p e c t i v e l y ( r e f e r t o S e c t i o n f o r t h e M S F E a n d M S P E ) . F i n a ll y , w e u s e d p r e ﬁ x e s f o r t h e i n t e r n a t i o n a l s y s t e m o f u n i t s : k = n d M = k . T h e m a r k [ ? ] s i g n i ﬁ e s t h a tt h e v a l u e i s d e d u c e d u s i n g t h e c o n t e x t o r t h e ﬁ g u r e s i n t h e w o r k . c true . Thus, a ﬁnitedifference method (FDM) for second-order in space and ﬁrst-order in time was utilized to produce anapproximation c approx of c true . Consequently, the network proposed by the study produces outcomes muchmore quickly than the FDM and outperforms the well-known approximation of Hagan et al. [8] in termsof accuracy. Nonetheless, McGhee only tries to make the predicted value c net of the network come close to c approx . In essence, the prediction error c net − c true is not considered in the study, and only the reductionof the ﬁtting error c net − c approx is studied. Nevertheless, c net should be close to c true (i.e., not to c approx ).Moreover, the research is not conducted systematically. We could not ﬁnd any mentions about the numberof epochs, the batch size, the weight initialization method, and the loss values for the training and testdatasets. In particular, the types of neural networks tested in the research are fairly limited because thenumber of the hidden layers for the networks is ﬁxed at one.We believe that it is desirable to choose a parametric model without an exact pricing formula for vanillaoptions and investigate a training method of neural networks using numerous option prices for the model.Therefore, we decided to study the SABR model. Furthermore, a pricing method to generate big datashould be efﬁcient enough and easily applicable to a wide range of models, such as the rough volatilitymodel [16]. Standard procedures satisfying the requirements may be the FDM and Monte-Carlo simulation(MC). Particularly, when the number of factors for underlying models is smaller than four, the FDM isusually more efﬁcient and stable than the MC (see Wilmott [17]). This evidence denotes that the FDM maybe more appropriate to price vanilla options under the SABR model than the MC because the model hastwo factors. Nevertheless, we chose the MC to produce c approx because we think that neural networks havethe potential to ﬁlter out symmetric noises caused by the MC, but they are unable to rectify the bias causedby the FDM. In other words, we do not choose the FDM but rather the MC because E (cid:2) c approx (cid:3) = c true forthe MC case, but E (cid:2) c approx (cid:3) (cid:54) = c true for the FDM case. For example, Ferguson and Green [11] made twodatasets using the MC, where one was precise but small, and the other was big yet imprecise, and theytrained two networks with the datasets, respectively. Interestingly, the network using the larger and lessprecise dataset gives better results. This outcome indicates that the network can reduce the MC noises dueto the integration of larger data. Thus, we also expect that the MC errors are reduced by a neural networkprovided the training data are sufﬁciently large.Furthermore, based on nonlinear regression analysis, we propose a novel method to indirectly estimatethe prediction error c net − c true even if c true cannot be obtained. In other studies, the prediction error isusually expected to be greater than the approximation error c approx − c true because it is a sum of the ap-proximation error and the ﬁtting error c net − c true that arises during training. However, neural networksare able to reduce a substantial part of the approximation error and produce c net close to c true . The methoddeveloped in this study would be an invaluable tool to evaluate the degree of the distance between c net and c true . When analyzing test results with the approach, the accuracy of our network is estimated to becomparable to that of about 13 million MC simulations.Further, it is noticeable in the literature that although similar approaches are adopted for the samemodel, the loss of test data differ considerably depending on the details of the training methods. Thisnotion is validated via the mean squared ﬁtting errors (MSFE) of Culkin and Das [9], Broström and Kris-tiansson [10], Liu et al. [13], and Hirsa et al. [14] for the Black-Scholes model in Table 1 and 2. The MSFEsof Broström and Kristiansson and Liu et al. (7.27 × − and 8.21 × − , respectively) seem to be supe-rior to the MSFEs of Culkin and Das and Hirsa et al. (1.25 × − and 1.85 × − , respectively). Thisassociation may be because the latter networks have narrower structures or learn from smaller data thanthe former. The former have 400 nodes per layer, and the latter have 100 or 140 nodes. Furthermore, thetraining data sizes of the former are 800 or 900 thousands, while the latter is 240 thousands. Either or bothof the two options can facilitate performance gaps. These gaps can also be conﬁrmed through the MSFEsfor the Heston model of Liu et al. and Hirsa et al. (1.65 × − and 2.5 × − , respectively). This notionimplies that it is not easy to perfectly ﬁt neural networks to the generated data. For a better goodness-of-ﬁt,one should consider several factors, such as data size, network architecture, and the optimization method.Therefore, we try to reduce the MSFE in our experiment by generating enormous data and tuning various4yperparameters for a considerably accurate ﬁt.In summary, this study contributes to the literature in the following ways. First, we generate numerousdata using GPU-based simulations and verify that the network trained with the data provides extraordi-narily accurate results. Second, we provide a novel method to analyze the prediction errors c net − c true byproposing an unbiased and consistent estimator associated with the prediction error. We make a new at-tempt using nonlinear regression, which forms the theoretical basis for the phenomenon that the networkproduces prediction errors smaller than the approximation errors c approx − c true .The remainder of this paper is organized as follows. In Section 2, a new method of analyzing predictionerrors with nonlinear regression is introduced. We then discuss the pricing methods of options in the SABRmodel and detailed methods of generating data for network learning in Section 3. In Section 4, we trainneural networks of various structures and assess the impact of training data size on network performance.Finally, Section 5 concludes the study.

2. Nonlinear regression of numerous implied volatilities

As mentioned in the introduction, we only focus on the parametric models that do not have any exactpricing formulas. Therefore, an important step in this work is to generate numerous approximate impliedvolatilities σ Iapprox , l for exact volatilities σ Itrue , l ( l =

1, 2, · · · , L ) under a parametric model, which will be usedas the material to train networks. Each volatility σ Iapprox , l is generated on randomly chosen parameters, suchas θ l ,1 , θ l ,2 , · · · , θ l , n θ , maturity T l , and strike K l .As σ Itrue , l is determined by θ l ,1 , θ l ,2 , · · · , θ l , n θ , T l , and K l , there exists a function ˜ h such that σ Itrue , l = ˜ h ( x l ) for x l = (cid:0) θ l ,1 , θ l ,2 , · · · , θ l , n θ , T l , K l (cid:1) . According to the renowned universal approximation theorem, it isassumed that a network with enough number of weights Γ = { γ , γ , · · · , γ n γ } can accurately approximatethe function value ˜ h ( x l ) as its output h ( x l ; Γ ) . Thus, if M simulations are run to obtain the approximatevolatility σ Iapprox , l , the central limit theorem yields the following relation: σ Iapprox , l ( M ) = h ( x l ; Γ ) + (cid:101) l ( M ) ,where (cid:101) l ( M ) ∼ N (cid:0) β l / M (cid:1) for β l > σ Iapprox , l against σ Itrue , l , and the notation ω ( M ) is appliedto emphasize that ω is a random variable that is dependent on M .In the perspective of nonlinear ordinary regression (Montgomery et al. [19]), an unbiased and consistentestimator ˆ Γ of Γ is ˆ Γ = argmin Γ L ( Γ ; M ) ,where L ( Γ ; M ) = L ∑ l = (cid:16) h ( x l ; Γ ) − σ Iapprox , l ( M ) (cid:17) .The Jacobian and Hessian matrices J and H of L (cid:0) ˆ Γ ; M (cid:1) should satisfy the optimality condition that J and H are zero and positive deﬁnite, respectively. On the other hand, we can derive J = (cid:101) Q , H ≈ Q T Q , (1)(Hansen et al. [20]), where (cid:101) = (cid:2) (cid:101) (cid:101) · · · (cid:101) L (cid:3) , Q =  ∂ h ( x ;ˆ Γ ) ∂γ ∂ h ( x ;ˆ Γ ) ∂γ · · · ∂ h ( x ;ˆ Γ ) ∂γ n γ ∂ h ( x ;ˆ Γ ) ∂γ ∂ h ( x ;ˆ Γ ) ∂γ · · · ∂ h ( x ;ˆ Γ ) ∂γ n γ ... ... . . . ... ∂ h ( x L ;ˆ Γ ) ∂γ ∂ h ( x L ;ˆ Γ ) ∂γ · · · ∂ h ( x L ;ˆ Γ ) ∂γ n γ  .5urthermore, ˆ Γ follows a multivariate normal distribution as follows:ˆ Γ ∼ N (cid:18) Γ , 1 LM W − W β W − (cid:19) ,where W = L Q T Q , W β = L Q T BQ , and B is the diagonal matrix with the l th diagonal entry β l . Con-versely, owing to the law of large numbers, the elements of W and W β converge in probability to theirrespective expected values as L → ∞ . In other words, W k , k (cid:48) = (cid:42) ∂ h (cid:0) x l ; ˆ Γ M (cid:1) ∂γ k ∂ h (cid:0) x l ; ˆ Γ M (cid:1) ∂γ k (cid:48) (cid:43) l , L p → (cid:42) ∂ h (cid:0) x l ; ˆ Γ M (cid:1) ∂γ k ∂ h (cid:0) x l ; ˆ Γ M (cid:1) ∂γ k (cid:48) (cid:43) l , W β k , k (cid:48) = (cid:42) β l ∂ h (cid:0) x l ; ˆ Γ M (cid:1) ∂γ k ∂ h (cid:0) x l ; ˆ Γ M (cid:1) ∂γ k (cid:48) (cid:43) l , L p → (cid:42) β l ∂ h (cid:0) x l ; ˆ Γ M (cid:1) ∂γ k ∂ h (cid:0) x l ; ˆ Γ M (cid:1) ∂γ k (cid:48) (cid:43) l as L → ∞ , where (cid:104) ϕ l (cid:105) l , L and (cid:104) ϕ l (cid:105) l are a sample mean of size L and the population mean of a randomvariable ϕ l , respectively. That is, (cid:104) ϕ l (cid:105) l , L = L ∑ Ll = ϕ l , and (cid:104) ϕ l (cid:105) l = E [ ϕ l ] . Notably, E (cid:104) (cid:104) ϕ l (cid:105) l , L (cid:105) = (cid:104) ϕ l (cid:105) l .Accordingly , if L (cid:29)

1, then W and W β rarely change, although L does change a little.We now consider another dataset for an out-of-sample test, constituting L (cid:48) volatilities σ Iapprox , l to ap-proximate σ Itrue , l ( l =

1, 2, · · · , L ’), each of which is generated from M (cid:48) simulations. With regard to thedataset, the ﬁtting error (cid:101) f it , l , the prediction error (cid:101) pred , l , and the approximation error (cid:101) approx , l are deﬁnedas follows: (cid:101) f it , l (cid:0) ˆ Γ , M (cid:48) (cid:1) = σ Inet , l (cid:0) ˆ Γ (cid:1) − σ Iapprox , l (cid:0) M (cid:48) (cid:1) , (cid:101) pred , l (cid:0) ˆ Γ (cid:1) = σ Inet , l (cid:0) ˆ Γ (cid:1) − σ Itrue , l , (cid:101) approx , l (cid:0) M (cid:48) (cid:1) = σ Iapprox , l (cid:0) M (cid:48) (cid:1) − σ Itrue , l ,where σ Inet , l (cid:0) ˆ Γ (cid:1) = h (cid:0) x l ; ˆ Γ (cid:1) . Notably, (cid:101) pred , l can be decomposed into (cid:101) f it , l and (cid:101) approx , l , that is, (cid:101) pred , l (cid:0) ˆ Γ (cid:1) = (cid:101) f it , l (cid:0) ˆ Γ , M (cid:48) (cid:1) + (cid:101) approx , l (cid:0) M (cid:48) (cid:1) .Note that ﬁnding (cid:101) f it , l is straightforward while determining (cid:101) pred , l and (cid:101) approx , l is not simple because σ true , l is unknown. Many researchers intuitively expect | (cid:101) pred , l | > | (cid:101) approx , l | because they guess that the signs of (cid:101) f it , l and (cid:101) approx , l are the same. Nonetheless, it would be the best if (cid:101) f it , l canceled out a part of (cid:101) approx , l sothat | (cid:101) pred , l | < | (cid:101) approx , l | . This mechanism is possible only when the neural network can reduce the noisesin (cid:101) approx , l and ﬁnd more plausible values by itself. Speciﬁcally, we prove that self-correction of networksis feasible, and it will be demonstrated in the tests of Section 4.Additionally, the errors (cid:101) f it , l , (cid:101) pred , l , and (cid:101) approx , l follow their respective normal distributions as below: (cid:101) f it , l (cid:0) ˆ Γ , M (cid:48) (cid:1) ∼ N (cid:32) β l M (cid:48) + LM q l W − W β W − q Tl (cid:33) , (cid:101) approx , l (cid:0) M (cid:48) (cid:1) ∼ N (cid:32) β l M (cid:48) (cid:33) , (cid:101) pred , l (cid:0) ˆ Γ (cid:1) ∼ N (cid:18)

0, 1 LM q l W − W β W − q Tl (cid:19) , (2)where q l is the l th row vector of Q . As mentioned above, it is extremely important to note that ﬁnding (cid:101) pred , l is infeasible because σ Itrue , l is unknown. This problem is serious because we need (cid:101) pred , l to evaluatethe performance of the network. Although some might presume that the difﬁculty can be circumventedby computing Q , W , and W β , the computation is severely unstable due to the countless parameters of thenetwork. 6o resolve the problem, we deﬁne three mean squared errors (MSE) for the test dataset, namely themean squared ﬁtting error E f it (MSFE), the mean squared prediction error E pred (MSPE), and the meansquared approximation error E approx (MSAE). They are given by the following: E f it (cid:0) ˆ Γ , M (cid:48) ; L (cid:48) (cid:1) = L (cid:48) L (cid:48) ∑ l = (cid:101) f it , l (cid:0) ˆ Γ , M (cid:48) (cid:1) , E pred (cid:0) ˆ Γ ; L (cid:48) (cid:1) = L (cid:48) L (cid:48) ∑ l = (cid:101) pred , l (cid:0) ˆ Γ (cid:1) , E approx (cid:0) M (cid:48) ; L (cid:48) (cid:1) = L (cid:48) L (cid:48) ∑ l = (cid:101) approx , l (cid:0) M (cid:48) (cid:1) .Among them, the MSPE E pred can serve as an indicator depicting the performance of the network with theweight ˆ Γ . However, it also depends on the type of test set. Therefore, the following statistic will be utilizedas an indicator to gauge performance: E pred (cid:0) ˆ Γ (cid:1) = E (cid:104) E pred (cid:0) ˆ Γ ; L (cid:48) (cid:1)(cid:105) ,which is the same as (cid:68) (cid:101) pred , l (cid:0) ˆ Γ (cid:1)(cid:69) l because E pred (cid:0) ˆ Γ ; L (cid:48) (cid:1) = (cid:68) (cid:101) pred , l (cid:0) ˆ Γ (cid:1)(cid:69) l , L (cid:48) . We will explain the estimation of E pred (cid:0) ˆ Γ (cid:1) in the later sections. The propositions below describe the expectations and variances of E f it , E pred ,and E approx . Proposition 1.

The expectations of E f it , E pred , and E approx are given byE (cid:104) E f it (cid:0) ˆ Γ , M (cid:48) ; L (cid:48) (cid:1)(cid:105) = E (cid:2) E approx (cid:0) M (cid:48) ; L (cid:48) (cid:1)(cid:3) + E (cid:104) E pred (cid:0) ˆ Γ ; L (cid:48) (cid:1)(cid:105) , E (cid:2) E approx (cid:0) M (cid:48) ; L (cid:48) (cid:1)(cid:3) = M (cid:48) (cid:68) β l (cid:69) l , L (cid:48) , E (cid:104) E pred (cid:0) ˆ Γ ; L (cid:48) (cid:1)(cid:105) = LM (cid:68) q l W − W β W − q Tl (cid:69) l , L (cid:48) . Proof. As E (cid:104) (cid:101) f it , l (cid:105) = l , E (cid:104) E f it (cid:0) ˆ Γ , M (cid:48) ; L (cid:48) (cid:1)(cid:105) = L (cid:48) L (cid:48) ∑ l = Var (cid:104) (cid:101) f it , l (cid:0) ˆ Γ , M (cid:48) (cid:1)(cid:105) = L (cid:48) L (cid:48) ∑ l = (cid:32) β l M (cid:48) + LM q l W − W β W − q Tl (cid:33) = M (cid:48) (cid:68) β l (cid:69) l , L (cid:48) + LM (cid:68) q l W − W β W − q Tl (cid:69) l , L (cid:48) .Similarly, E (cid:2) E approx (cid:3) and E (cid:104) E pred (cid:105) are calculated as M (cid:48) (cid:10) β l (cid:11) l , L (cid:48) and LM (cid:68) q l W − W β W − q Tl (cid:69) l , L (cid:48) , respec-tively. Proposition 2.

The variances of E f it , E pred , and E approx areVar (cid:104) E f it (cid:0) ˆ Γ , M (cid:48) ; L (cid:48) (cid:1)(cid:105) = Var (cid:2) E approx (cid:0) M (cid:48) ; L (cid:48) (cid:1)(cid:3) + Var (cid:104) E pred (cid:0) ˆ Γ ; L (cid:48) (cid:1)(cid:105) + L (cid:48) LM (cid:48) M (cid:68) β l q l W − W β W − q Tl (cid:69) l , L (cid:48) , Var (cid:2) E approx (cid:0) M (cid:48) ; L (cid:48) (cid:1)(cid:3) = L (cid:48) ( M (cid:48) ) (cid:68) β l (cid:69) l , L (cid:48) , Var (cid:104) E pred (cid:0) ˆ Γ ; L (cid:48) (cid:1)(cid:105) = L (cid:48) L M (cid:28)(cid:16) q l W − W β W − q Tl (cid:17) (cid:29) l , L (cid:48) . Proof.

The square X of a normal random variable X ∼ N (cid:0) σ (cid:1) follows a gamma distribution Γ (cid:0) σ (cid:1) ,which leads to (cid:101) f it , l (cid:0) ˆ Γ , M (cid:48) (cid:1) ∼ Γ (cid:32)

12 , 2 (cid:32) β l M (cid:48) + LM q l W − W β W − q Tl (cid:33)(cid:33) .7s E [ Y ] = ab and Var [ Y ] = ab for Y ∼ Γ ( a , b ) , E (cid:104) (cid:101) f it , l (cid:105) = β l M (cid:48) + LM q l W − W β W − q Tl , and Var (cid:104) (cid:101) f it , l (cid:105) = (cid:18) β l M (cid:48) + LM q l W − W β W − q Tl (cid:19) . As (cid:101) f it , l are independent, Var (cid:104) E f it (cid:0) ˆ Γ , M (cid:48) ; L (cid:48) (cid:1)(cid:105) = ( L (cid:48) ) L (cid:48) ∑ l = Var (cid:104) (cid:101) f it , l (cid:0) ˆ Γ M , M (cid:48) (cid:1)(cid:105) = ( L (cid:48) ) L (cid:48) ∑ l = (cid:32) β l ( M (cid:48) ) + L M (cid:16) q l W − W β W − q Tl (cid:17) + β l LM (cid:48) M q l W − W β W − q Tl (cid:33) = L (cid:48) ( M (cid:48) ) (cid:68) β l (cid:69) l , L (cid:48) + L (cid:48) L M (cid:28)(cid:16) q l W − W β W − q Tl (cid:17) (cid:29) l , L (cid:48) + L (cid:48) LM (cid:48) M (cid:68) β l q l W − W β W − q Tl (cid:69) l , L (cid:48) .Similarly, Var (cid:2) E approx (cid:3) and Var (cid:104) E pred (cid:105) are derived as L (cid:48) ( M (cid:48) ) (cid:10) β l (cid:11) l , L (cid:48) and L (cid:48) L M (cid:28)(cid:16) q l W − W β W − q Tl (cid:17) (cid:29) l , L (cid:48) ,respectively.Based on the propositions, the following theorem suggests an unbiased and consistent estimator of E pred (cid:0) ˆ Γ (cid:1) . The theorem needs two distinct test sets. Theorem 3.

The estimator ˆ E pred (cid:0) ˆ Γ (cid:1) = M (cid:48) E f it (cid:0) M (cid:48) ; L (cid:48) (cid:1) − M (cid:48) E f it ( M (cid:48) ; L (cid:48) ) M (cid:48) − M (cid:48) is unbiased and consistent to E pred (cid:0) ˆ Γ (cid:1) for M (cid:48) (cid:54) = M (cid:48) . Particularly, (cid:0) L (cid:48) , L (cid:48) (cid:1) and (cid:0) M (cid:48) , M (cid:48) (cid:1) are the data lengths andthe numbers of simulations for two distinct test sets, respectively. Further, the variance of ˆ E pred (cid:0) ˆ Γ (cid:1) is given byVar (cid:104) ˆ E pred (cid:0) ˆ Γ (cid:1)(cid:105) = (cid:18) M (cid:48) M (cid:48) − M (cid:48) (cid:19) Var (cid:104) E f it (cid:0) ˆ Γ M , M (cid:48) ; L (cid:48) (cid:1)(cid:105) + (cid:18) M (cid:48) M (cid:48) − M (cid:48) (cid:19) Var (cid:104) E f it (cid:0) ˆ Γ M , M (cid:48) ; L (cid:48) (cid:1)(cid:105) , (3) whereVar (cid:104) E f it (cid:0) ˆ Γ , M (cid:48) ; L (cid:48) (cid:1)(cid:105) = (cid:10) β l (cid:11) l , L (cid:48) L (cid:48) ( M (cid:48) ) + (cid:28)(cid:16) q l W − W β W − q Tl (cid:17) (cid:29) l , L (cid:48) L (cid:48) L M + (cid:68) β l q l W − W β W − q Tl (cid:69) l , L (cid:48) L (cid:48) LM (cid:48) M . Proof.

The unbiasedness of ˆ E pred (cid:0) ˆ Γ (cid:1) in relation to E pred (cid:0) ˆ Γ (cid:1) is exhibited as follows: E (cid:104) ˆ E pred (cid:0) ˆ Γ (cid:1)(cid:105) = M (cid:48) E (cid:104) E f it (cid:0) M (cid:48) ; L (cid:48) (cid:1)(cid:105) − M (cid:48) E (cid:104) E f it ( M (cid:48) ; L (cid:48) ) (cid:105) M (cid:48) − M (cid:48) = (cid:16)(cid:10) β l (cid:11) l + M (cid:48) LM (cid:68) q l W − W β W − q Tl (cid:69) l (cid:17) − (cid:16)(cid:10) β l (cid:11) l + M (cid:48) LM (cid:68) q l W − W β W − q Tl (cid:69) l (cid:17) M (cid:48) − M (cid:48) = LM (cid:68) q l W − W β W − q Tl (cid:69) l = E (cid:104) E pred (cid:0) ˆ Γ ; L (cid:48) (cid:1)(cid:105) = E pred (cid:0) ˆ Γ (cid:1) .On the other hand, the form (3) for Var (cid:104) E pred (cid:0) ˆ Γ (cid:1)(cid:105) is easily derived because E f it (cid:0) M (cid:48) ; L (cid:48) (cid:1) and E f it ( M (cid:48) ; L (cid:48) ) are independent of each other. Subsequently, by Proposition 2, because Var (cid:104) E f it (cid:0) M (cid:48) ; L (cid:48) (cid:1)(cid:105) and Var (cid:104) E f it ( M (cid:48) ; L (cid:48) ) (cid:105) converge to 0 as L and L go to inﬁnity, ˆ E pred (cid:0) ˆ Γ (cid:1) is consistent.8 T α T − − − ρ T √ T ν (a) S&P500 T α T − ρ T √ T ν (b) KOSPIFigure 1: This ﬁgure shows the estimates of α ( T ) , ν ( T ) , and ρ ( T ) for the SABR model where β ( T ) =

1. The values are obtained usingthe option data for the S&P 500 (left) and the KOSPI 200 (right) from April 2018 to March 2019.

When considering the theorem above, (cid:0) M (cid:48) , M (cid:48) (cid:1) and (cid:0) L (cid:48) , L (cid:48) (cid:1) should be set as M (cid:48) (cid:29) M (cid:48) , L (cid:48) (cid:29)

1, and L (cid:48) (cid:29) E pred (cid:0) ˆ Γ (cid:1) .

3. Data generation for network learning under the SABR model

The SABR model [8] is expressed as the following stochastic differential equation (SDE): d f t = α t f β t dW t , d α t = να t dZ t ,where f t is the forward price of an underlying asset (i.e., stock and interest rate) at time t , and W t and Z t are Brownian motions correlated with ρ ∈ ( −

1, 1 ) . The hidden state α t and the parameters β , ρ , ν of the model have their respective roles in determining the shapes of implied volatility surface σ I ( T , K ) (see Rebonato et al. [21] for a more detailed explanation). The state α t forms the backbone of the surfacebecause the change of α t causes a parallel shift upward of the surface. The volatility of volatility parameter ν handles the wings of the volatility surface because it controls the curvature of the surface. Conversely, theelasticity β and the correlation ρ play similar roles in adjusting the slopes of skews on the surface. Thus, β is commonly ﬁxed as a constant from 0 to 1 to reduce model complexity. Aesthetic considerations result in β = β = β =

1, which are called the normal SABR, CIR (named after Cox, Ingersoll, and Ross)SABR, and the log-normal SABR, respectively. It is known that such arbitrary choices of β hardly everdecrease the ﬁtting performance of the SABR model [22, 21]. Likewise, Bartlett [23] developed a hedgingmethod less sensitive to particular values of β . From these studies, we choose the log-normal SABR ( β = α t can be regarded as the volatility of f t . 9 .8 0.9 1.0 1.1 1.2 K α =0.12, ν =0.06, ρ =-0.17, T =1.77 MCHagan 0.94 0.96 0.98 1.00 1.02 1.04 1.06 K α =0.16, ν =3.67, ρ =-0.32, T =0.07 MCHagan

Figure 2: These ﬁgures are drawn to compare the implied volatilities provided by the Hagan formula (4) with the values of the MC.For the MC, we simulate 10 million paths for the time interval 0.002. The blue region indicates the 99% conﬁdence interval for theMC.

The SABR model is mostly utilized as a ﬁtting model to market volatilities by maturity, for which thestate α and parameters β , ν , and ρ are usually parameterized as α ( T ) , β ( T ) , ν ( T ) , and ρ ( T ) . Figure 1displays the estimates of α ( T ) , ν ( T ) , and ρ ( T ) when β ( T ) =

1. These are derived utilizing the option datafor the S&P 500 (left) and the KOSPI 200 (right) from April 2018 to March 2019. The calibration is performedby Korean Asset Pricing, a bond rating agency located in Korea. From the ﬁgure, one can observe that all α and ρ belong to ( ) and ( − ) , respectively, and all of ν are lower than the baseline 2/ √ T .Moreover, it seems that the values tend to become more unstable as T gets shorter, particularly for ν . Thisassociation may result because the SABR model ignores short-term events such as fast-mean-revertingvolatility.Let us assume a fair price c for a vanilla option under the SABR model. Under the risk-neutral pricingframework [24], we can induce the pricing formula by solving the integral c ( t , K ) = (cid:90) ∞ q ( f T ; K ) p t ( f T ) d f T ,or the partial differential equation (PDE) ∂ c ∂ t = α f β ∂ c ∂ f + ν α ∂ c ∂α + ρνα f β ∂ c ∂ f ∂α , c ( T , K ) = q ( f T ; K ) ,where p t is the density of f T , K and T are the strike and maturity of the option, respectively, and q ( · ; K ) isthe payoff function of the option with K . Regrettably, any exact pricing formulas cannot be derived for theoption because the integral and the PDE are fairly hard to solve analytically. Instead, it is possible to derivea wide range of approximate formulas for the implied volatilities of the options [8, 25, 26, 27, 28, 29]. Infact, the aforementioned notion explains why the SABR model is so popular in practice.Additionally, only for the case β =

1, we brieﬂy mention the most well-known asymptotic formulas forthe implied volatility σ I , which was found by Hagan et al. [8] as follows: σ I ( T , K ) ≈ α z χ ( z ) (cid:26) + (cid:20) ρα ν + (cid:16) − ρ (cid:17) ν (cid:21) T (cid:27) , (4)10 igure 3: An illustration of the grid of α s , ρ s , ˜ ν s , K s , and T s . where z = να log f K , χ ( z ) = log (cid:40) (cid:112) − ρ z + z + z − ρ − ρ (cid:41) .As the above equations are derived using an asymptotic technique, it can be applied only for the optionwith a short maturity T and a strike K close to f . If one of the assumptions does not hold, this formulayields inexact prices, but even if all of the assumptions are satisﬁed, it does not always provide consistentaccuracy. Figure 2 compares the implied volatilities formula with the MC-based values. Its accuracy issimilar to that of the MC even when T is large (left), and vice versa (right). For the MC simulation, wesimulate 10 million paths for the time interval 0.002. The blue region indicates the 99% conﬁdence intervalfor the MC. Nevertheless, formula (4) is still popular due to its simplicity, especially in global over-the-counter interest rate derivatives market. In this subsection, we explain the method to generate extensive approximate implied volatilities σ Iapprox , l for exact implied volatilities σ Itrue , l ( l = · · · , L ) under the SABR model where β =

1, which will be usedto train and test neural networks later. The volatilities σ Iapprox , l are grouped into mn data to form ˜ L surfaces σ Iapprox , s ( T s , k , K s , k ) , with respect to T s , k and K s , k , where ˜ L = L / ( mn ) , s = · · · , ˜ L , k = · · · , m , and k = · · · , n . In other words, L is the total number of generated data, and ˜ L is the number of volatilitysurfaces (a sort of partition of the data) with mn grid points.When constructing grid points for the surfaces, K s , k is set as a function of T s , k ; that is, K s , k = K s , k ( T s , k ) .This mechanism is intended to widen the width of strike range [ K s ,1 , K s , n ] as T s , k gets larger, in other words,log K s , n f s − log K s ,1 f s ∝ std (cid:34) log f T s , k , s f s (cid:35) ,where f t , s is the forward price at t under the parameters α s , ˜ ν s (cid:0) : = ν s / (cid:112) T s , k (cid:1) , and ρ s . Note that theparameter ˜ ν s depends on T s , k . As mentioned before, practitioners usually ﬁt the SABR model to the marketdata σ I ( T ) for each maturity T separately. If so, as shown in Figure 1, the estimates of ν tend to get largeras T becomes shorter in market data. Hence, ˜ ν s is set up in a way to capture the phenomenon.Let us explain the construction process of the grid points in more detail. Suppose that data is generateduntil the time T last . The maturity T s , k and strike K s , k for the s th surface σ Iapprox , s ( T s , k , K s , k ) are randomlychosen as follows: 11. Set a time grid interval ∆ T = T last / m , initialize the ﬁrst time point T s ,1 randomly in ( ∆ T ] , andchoose the other points equidistantly by T s , k = T s ,1 + ( k − ) ∆ T .2. Determine the start point K s ,1 and the end point K s , n of K s -range by the formula K s ,1 = f s exp (cid:18) − α s ˜ ν s (cid:16) exp (cid:110) ˜ ν s T s , k (cid:111) − (cid:17) − η f α s ˜ ν s (cid:16) exp (cid:110) ˜ ν s T s , k (cid:111) − (cid:17) (cid:19) , K s , n = f s exp (cid:32) − α s ˜ ν s (cid:16) exp (cid:110) ˜ ν s T s , k (cid:111) − (cid:17) + η f α s ˜ ν s (cid:16) exp (cid:110) ˜ ν s T s , k (cid:111) − (cid:17) (cid:33) ,where η f follows the uniform distribution U ( ) , and the hidden state α s and parameters ν s (not ˜ ν s ) and ρ s are also sampled uniformly within predetermined limits. That is, α s ∼ U ( α min , α max ) , ν s ∼ U ( ν min , ν max ) , ρ s ∼ U ( ρ min , ρ max ) .3. Equidistantly partition the interval [ K s ,1 , K s , n ] for each T s , k by K s , k = K s ,1 + ( k − ) ∆ K , where ∆ K =( K s , n − K s ,1 ) / n .If the number ˜ L of the surfaces is reasonably large, the random numbers { ( α s , ν s , ρ s ) : s = · · · , ˜ L } mayﬁll most of the parameter space { α min ≤ α ≤ α max } × { ν min ≤ ν ≤ ν max } × { ρ min ≤ ρ ≤ ρ max } evenlyand densely. In addition, we suppose f s = f t , s of the SABR modelwhere β = d ( f t , s / f s ) = α t , s ( f t , s / f s ) dW t . This notion implies that an option price c is homogeneous of degree one in both f s and K under the SABR model, that is, c = c ( K / f s ) (Garciaand Gençay [30]), which supports the assumption f s =

1. Figure 3 is an illustration of the grids α s , ρ s ,˜ ν s , K s , and T s . The grid for α and ρ is partitioned evenly, the partition (cid:8) T s , k (cid:9) is chosen randomly, but T s ,1 cannot be less than 0, and T s , m cannot be greater than T last . The upper boundary of ˜ ν s is dependent on T s , k because ˜ ν s = ν max / (cid:112) T s , k , where ν max is selected based on Figure 1. The strike boundary also relies on T s , k , and it widens and narrows under the inﬂuence of the value of the random variable η f . All grid pointsinside the boundaries are split equidistantly.To obtain the approximate implied volatility σ Iapprox , s ( T s , k , K s , k ) , we simulate N paths of the SABRmodel ( β =

1) under the parameters ( α s , ˜ ν s , ρ s ) using the Monte-Carlo Euler scheme, which can be ex-pressed as follows: f t + ∆ t , s = f t , s + α t , s f t , s (cid:16) e t , s √ ∆ t (cid:17) , α t + ∆ t , s = α t , s + ˜ ν s α t , s (cid:18) ρ s e t , s √ ∆ t + (cid:113) − ρ s ˜ e t , s √ ∆ t (cid:19) ,where t = ∆ t , 2 ∆ t , · · · , T s , m − ∆ t , f s =

1, and ( e t , s , ˜ e t , s ) are independent standard normal random vari-ables. Further, the MC gives an approximate price c approx , s , k , k for the true option price c true , s , k , k under theSABR model in the following way: c approx , s , k , k = N N ∑ j = c approx , s , k , k , j = N N ∑ j = q (cid:16) f T l , k , s , j , K s , k (cid:17) ,where the subscript j represents that the value originated from the j th path, and q (cid:0) · ; K s , k (cid:1) is the optionpayoff for strike K s , k . Price c approx , s , k , k is then converted to its implied volatility σ Iapprox , s , k , k . We usePowell’s method to obtain the implied volatility by minimizing ( BS ( σ Iapprox , s , k , k ) − c approx , s , k , k ) , whereBS ( σ ) is the Black-Scholes formula for the option. In this process, we only utilize out-of-the-money (OTM)call and put options. The price of OTM call options explodes occasionally. Therefore, if the standarddeviation of the simulations exceeds by 100 times the average, the data were excluded from the experiment(approximately 3.6% of the data are excluded in this manner).12ataset N ˜ L L ( D train ) 500k 1.2M 480Mvalidation set ( D validate ) 500k 0.1M 40Mtest set ( D test ) 500k 0.1M 40Mmore accurate set ( D test ’) 12.5M 0.25M 100Mtotal 1.65M 660M Table 3: This table displays the datasets for our experiments, which contain approximate implied volatilities σ Iapprox , l for σ Itrue , l ( l = · · · , L ). They are generated by the MC with N paths for a time interval of 0.002. Furthermore, σ Iapprox , l are grouped to form ˜ L surfaces. The symbols "k" and "M" indicate one thousand and one million, respectively. The hyperparameters are chosen as follows: m = n = α min = α max = ν min = ν max = ρ min = − ρ max = T last =

2. With the hyperparameters, we separately make 1.2Msurfaces ( N = + L = +

8) for training, 0.1M surfaces ( N = + L = +

7) for a validation,0.1M surfaces for a test with simulation accuracy ∆ t = N = +

5. In fact, we generate 250kpaths with the antithetic variate method in the experiment, but it is known that this approach is superiorto when 500k paths are generated without the method. Moreover, we create 0.25M surfaces with a higheraccuracy N = + L = +

8) while keeping ∆ t as 0.002. These additional data are employed forerror analysis, which will be explained in the following section. For convenience, the four kinds of datasetsare summarized in Table 3. Contrarily, to generate the datasets, we performed the MC in parallel usingmany GPUs (GeForce GTX 1080 TI ×

8, GeForce RTX 2080 TI ×

16, Tesla V100 ×

4. Network-based prediction of implied volatilities

In this section, we train neural networks of various structures using the training dataset D train andthe validation dataset D validate . Subsequently, we predict the data in the test dataset D test with the best-performing network among them. We also predict the data in the more accurate test dataset D (cid:48) test toevaluate the networks by estimating E pred (cid:0) ˆ Γ (cid:1) with Theorem 3 in Section 2. In addition, new notationsare introduced conﬁrming that E f it (cid:0) ˆ Γ , D (cid:1) = E f it (cid:0) ˆ Γ , M (cid:48) ; L (cid:48) (cid:1) , E pred (cid:0) ˆ Γ ; D (cid:1) = E pred (cid:0) ˆ Γ ; L (cid:48) (cid:1) , E approx ( D ) = E approx ( M (cid:48) ; L (cid:48) ) , N pred (cid:0) ˆ Γ , D (cid:1) = N pred (cid:0) ˆ Γ , M (cid:48) ; L (cid:48) (cid:1) for the dataset D with data length L (cid:48) for M (cid:48) simulations(i.e., see below for the deﬁnition of N pred ). The notations are used when it is desirable to emphasize D morethan L (cid:48) and M (cid:48) in the context.The expected number N pred of virtual simulations to achieve E pred (cid:0) ˆ Γ (cid:1) is deﬁned as follows: N pred (cid:0) ˆ Γ , M (cid:48) ; L (cid:48) (cid:1) = M (cid:48) E f it (cid:0) ˆ Γ , M (cid:48) ; L (cid:48) (cid:1) − E pred (cid:0) ˆ Γ (cid:1) E pred (cid:0) ˆ Γ (cid:1) , (5)which is utilized as an indicator of the network performance, along with E pred (cid:0) ˆ Γ (cid:1) . The deﬁnition is plausi-ble because E approx ( M (cid:48) ; L (cid:48) ) = E f it (cid:0) ˆ Γ , M (cid:48) ; L (cid:48) (cid:1) − E pred (cid:0) ˆ Γ ; L (cid:48) (cid:1) , and it is expected that (cid:12)(cid:12)(cid:12) E pred (cid:0) ˆ Γ (cid:1) − E pred (cid:0) ˆ Γ ; L (cid:48) (cid:1)(cid:12)(cid:12)(cid:12) is considerably small. For instance, suppose that N pred is approximately one million. This supposition im-plies that simulations should be performed one million times so that the MC can attain the accuracy of thenetwork.We use an extensive feedforward neural network with millions of weights. By increasing network sizewhen possible, we expect that approximation capability will be maximized. It accepts the following ﬁve13 of nodes E pred (cid:0) ˆ Γ (cid:1) N pred (cid:0) ˆ Γ , D test (cid:1) E - M Table 4: This table denotes the values of the two indicators of network performance, E pred (cid:0) ˆ Γ (cid:1) and N pred (cid:0) ˆ Γ , D test (cid:1) , for various networkstructures ("M" indicates one million). inputs T s =  T s ,1 T s ,1 · · · T s ,1 T s ,2 T s ,2 · · · T s ,2 ... ... . . . ... T s , m T s , m · · · T s , m ,  , K s =  K s ,1,1 K s ,1,2 · · · K s ,1, n K s ,2,1 K s ,2,2 · · · K s ,2, n ... ... . . . ... K s , m ,1 K s , m ,2 · · · K s , m , n  , α s =  α s α s · · · α s α s α s · · · α s ... ... . . . ... α s α s · · · α s  , ν s =  ν s ν s · · · ν s ν s ν s · · · ν s ... ... . . . ... ν s ν s · · · ν s  , ρ s =  ρ s ρ s · · · ρ s ρ s ρ s · · · ρ s ... ... . . . ... ρ s ρ s · · · ρ s  ,where K s , k , k = K s , k (cid:0) T s , k (cid:1) , and produces the following SABR volatilities: σ Inet , s =  σ Inet , s ( T s ,1 , K s ,1,1 , α s , ν s , ρ s ) · · · σ Inet , s ( T s ,1 , K s ,1, n , α s , ν s , ρ s ) ... . . . ... σ Inet , s ( T s , m , K s , m ,1 , α s , ν s , ρ s ) · · · σ Inet , s ( T s ,1 , K s , m , n , α s , ν s , ρ s )  .It is worthy to recall that σ Iapprox , s ( s =

1, 2, · · · , ˜ L ) is also generated as a two-dimensional form via σ Iapprox , s =  σ Iappox , s ( T s ,1 , K s ,1,1 , α s , ν s , ρ s ) · · · σ Iapprox , s ( T s ,1 , K s ,1, n , α s , ν s , ρ s ) ... . . . ... σ Iapprox , s ( T s , m , K s , m ,1 , α s , ν s , ρ s ) · · · σ Iapprox , s ( T s ,1 , K s , m , n , α s , ν s , ρ s )  .This kind of approach would help a network in ﬁltering out simulation noises by checking adjacent values.Dimitroff et al. [31] and Bayer et al. [32] adopted similar approaches.The optimal weights ˆ Γ are determined by minimizing the sum of squared differences between σ Inet , s with σ Iapprox , s in D train . To this end, the adaptive moment estimation (ADAM, Kingma and Ba [33]) is usedwith the batch size 100. The learning rate is initially set to be 1 e − , but it is reduced by a factor of 10 everytime the loss value for D validate is not improved. When the learning rate reaches the value 1 e − , the trainingis ﬁnished. We refer to Liu et al. [13] for the conﬁguration.We train and test various structures of the network while varying the numbers of layers and nodesper each hidden layer. Table 4 reveals E pred (cid:0) ˆ Γ (cid:1) and N pred (cid:0) ˆ Γ , D test (cid:1) for the networks, which are inducedusing Theorem 3 with E f it (cid:0) ˆ Γ , D test (cid:1) and E f it (cid:0) ˆ Γ , D (cid:48) test (cid:1) . In addition, E pred (cid:0) ˆ Γ (cid:1) are spread from 2.03 × − to2.22 × − , and N pred (cid:0) ˆ Γ , D test (cid:1) are distributed from 11.52M to 12.95M ("M" indicates one million). Afterobserving the ﬁgures on the table, we conclude that there are no signiﬁcant differences between the perfor-mances of the networks. However, notably, the values tend to improve slightly as the numbers of layersand nodes increase. We chose the network with 4 layers and 7,000 nodes as the best performance network.Thus, we will only continue subsequent tests for the chosen network.14 − − E fit (cid:16) ˆΓ; D test (cid:17) E fit (cid:16) ˆΓ; D test (cid:17) E pred (cid:16) ˆΓ (cid:17) N pred (cid:16) ˆΓ; D test (cid:17) Figure 4: This ﬁgure draws E fit (cid:0) ˆ Γ , D test (cid:1) , E fit (cid:0) ˆ Γ , D (cid:48) test (cid:1) , E pred (cid:0) ˆ Γ (cid:1) , and N pred (cid:0) ˆ Γ , D test (cid:1) with respect to training data size. Subsequently, we analyze the effect of training data size L on network performance. Figure 4 draws E f it (cid:0) ˆ Γ , D test (cid:1) , E f it (cid:0) ˆ Γ , D (cid:48) test (cid:1) , E pred (cid:0) ˆ Γ (cid:1) , and N pred (cid:0) ˆ Γ , D test (cid:1) with respect to L . To obtain the values, trainingand test are repeated using part of the training set D train . For instance, the values for the "1/128" subsetoriginate from the network trained by only using 1/128 part of D train . A subset contains other subsets withsmaller sizes. For example, the subset "1/256" contains the subset "1/512", and the subset "1/512" con-tains the subset "1/1024". In the ﬁgure, E f it (cid:0) ˆ Γ , D test (cid:1) , E f it (cid:0) ˆ Γ , D (cid:48) test (cid:1) , and E pred (cid:0) ˆ Γ (cid:1) decrease, and N pred (cid:0) ˆ Γ (cid:1) increases as the training data size L increases. This phenomenon can be understood through Proposition 1and Theorem 3. Accordingly, it is established that E (cid:104) E f it (cid:0) ˆ Γ , D (cid:1)(cid:105) = E (cid:104) E pred (cid:0) ˆ Γ ; D (cid:1)(cid:105) + E (cid:2) E approx ( D ) (cid:3) = LM (cid:68) q l W − W β W − q Tl (cid:69) l , L (cid:48) + M (cid:48) (cid:68) β l (cid:69) l , L (cid:48) , E (cid:104) E pred (cid:0) ˆ Γ (cid:1)(cid:105) = LM (cid:68) q l W − W β W − q Tl (cid:69) l , E (cid:104) N pred (cid:0) ˆ Γ ; D (cid:1)(cid:105) = LMM (cid:48) (cid:68) β l (cid:69) l , L (cid:48) (cid:68) q l W − W β W − q Tl (cid:69) − for the dataset D with data length L (cid:48) for M (cid:48) simulations ( M is the number of simulations to generate thedata in D train ). By the formulas above, E pred (cid:0) ˆ Γ ; D (cid:1) is reduced with a high probability as L increases, but E approx ( D ) is independent of L . Thus, as the training data size L grows, E f it (cid:0) ˆ Γ , D (cid:1) probably decreasesand converges to E app ( D ) . It also seems reasonable that E pred (cid:0) ˆ Γ (cid:1) and N pred (cid:0) ˆ Γ ; D (cid:1) monotonically decreaseand increase, respectively. Nonetheless, according to the formulas above, E pred (cid:0) ˆ Γ (cid:1) and N pred (cid:0) ˆ Γ ; D (cid:1) shouldapproximately be half and doubled, respectively. Nevertheless, in the graphs, the decrease rate of E pred (cid:0) ˆ Γ (cid:1) is higher than 0.5, and the increase rate of N pred (cid:0) ˆ Γ ; D (cid:1) is lower than 2. These results may be because theoptimal weights ˆ Γ are all different depending on the training data size L , highlighting that matrices suchas W and W β adjust if L changes. Further, we guess that the approximation of the Hessian for the lossfunction in (1) leads to these outcomes. We leave this issue for future endeavors.We further investigate E pred (cid:0) ˆ Γ (cid:1) and N pred (cid:0) ˆ Γ , D test (cid:1) while restricting the range of one of the inputs α , ν , ρ , K , and T . Thus, let [ Q k , Q k + ] be a subset containing the bottom 20 k % to 20 ( k + ) % of a given set.Recall that α is generated to be uniformly distributed from 0 to 2 (if exactly speaking, from 0.01 to 2).Thus, [ Q k , Q k + ] for α is { α | k ≤ α ≤ ( k + ) } . Table 5 illustrates how E pred (cid:0) ˆ Γ (cid:1) and N pred (cid:0) ˆ Γ , D test (cid:1) change when one of the inputs is restricted into [ Q k , Q k + ] . In the lower table, [ Q , Q ] represents that anyrestriction for the inputs is not imposed. Evidently, N pred (cid:0) ˆ Γ , D test (cid:1) is large concerning speciﬁc domains(e.g., [ Q , Q ] for α ). The large N pred (cid:0) ˆ Γ , D test (cid:1) means that E f it (cid:0) ˆ Γ , D test (cid:1) / E pred (cid:0) ˆ Γ (cid:1) is also high (see equation(5)). As E pred (cid:0) ˆ Γ (cid:1) positively correlates with N pred (cid:0) ˆ Γ , D test (cid:1) in the table, this correlation leads to the fact thatthe domains with large N pred (cid:0) ˆ Γ , D test (cid:1) have greater E f it (cid:0) ˆ Γ , D test (cid:1) . Therefore, we conclude that the networkreduces more noises in those domains where E f it (cid:0) ˆ Γ , D test (cid:1) is high. The phenomenon may occur because15 ν ρ K T E pred N pred E pred N pred E pred N pred E pred N pred E pred N pred [ Q , Q ] [ Q , Q ] [ Q , Q ] [ Q , Q ] [ Q , Q ] E pred N pred [ Q , Q ] Table 5: The table shows E pred (cid:0) ˆ Γ (cid:1) and N pred (cid:0) ˆ Γ , D test (cid:1) when restricting the range for one of the inputs α , ν , ρ , K , and T into [ Q k , Q k + ] .Here, [ Q k , Q k + ] is a subset containing the bottom 20 k % to 20 ( k + ) % of a given set. ordinary least squares (OLS) is employed in this study. The MC produces larger standard deviations β l onthe speciﬁc domains. Although the targets σ Iapprox , l for regression have different β l , if OLS is adopted, thecapacity of the network can be exhausted due to σ Iapprox , l with large deviations. Thus, the weighted leastsquares (WLS) should be used to resolve it. However, estimating the deviations β l to utilize the WLS maybe considerably difﬁcult.Finally, we reﬂect on the distribution of a random sample (cid:101) pred from D pop , where D pop is the popula-tion producing (cid:101) pred , l (= σ Inet , l − σ Itrue , l ) . (Roughly, D pop = { (cid:101) pred , l } l = ··· , ∞ in that the population can beunderstood as the set of inﬁnite samples.) By doing so, we can guess how well { σ Inet , l } l = ··· , L (cid:48) approxi-mates { σ Itrue , l } l = ··· , L (cid:48) . For instance, we can talk about P [ | σ Inet , l − σ Itrue , l | < ] . First, E [ (cid:101) pred ] = E [ (cid:101) pred , l ] = Var [ (cid:101) pred ] = E [ E pred (cid:0) ˆ Γ (cid:1) ] , Var [ (cid:101) pred ] is estimated as 2.03E-07 as depicted in Table 5. This association signiﬁes that if the best performance network (i.e., 4 layers and7,000 nodes per hidden layer) is utilized, std [ (cid:101) pred ] is about 0.00045 (0.45bp). Thus, under the assumptionthat (cid:101) pred follows a normal distribution, approximately 97% of | (cid:101) pred | are within 1bp. This outcome obvi-ously has remarkable accuracy. However, the normal assumption for the claim is hard to prove becauseestimating higher moments of (cid:101) pred is difﬁcult at the moment. Alternatively, we try to draw many plotsof σ Inet with respect to log ( K / f ) , along with the 99% conﬁdence intervals computed using 10 million MCsimulations. We then conﬁrm that most of σ Inet are in the intervals. Owing to the limitation of space, weselect two and draw them in Figure 5. In the ﬁgure, the blue regions indicate the 99% conﬁdence inter-vals. The intervals are too thin to be observed, so auxiliary ﬁgures are drawn at the bottom, in which theintervals are drawn centered about σ Inet . Notably, we can estimate the accuracy of the network through thesubﬁgures. Based on the examination of the many plots, we speculate that the guess | (cid:101) pred | <

5. Conclusion

Recently, the application of deep learning algorithms has facilitated outstanding achievement in variousﬁelds. Considering that pricing options are truly essential in ﬁnancial engineering, it seems inevitable tonote recent research using artiﬁcial neural networks to predict the prices for particular parametric models.In our opinion, the models without any pricing formulas should be studied more intensively. Nevertheless,measuring prediction errors is virtually impossible because true prices are unknown for such types ofmodels. To resolve this problem, we develop a novel method based on the Monte-Carlo simulation andnonlinear regression. According to the method, the best performance network developed in this workproduces the results as accurate as those of 13 million MC simulations. It is a remarkable result because theMC takes much more computational time in comparison to the network.16 − − σ I α =0.47, ν =0.85, ρ =-0.93, T =0.52 NetworkHagan − − − − ( K / f ) − − σ I α =0.13, ν =1.21, ρ =-0.38, T =1.29 NetworkHagan − − ( K / f ) − D test . The blue regionsindicate the 99% conﬁdence intervals computed by 10 million MC simulations. The intervals are drawn again centered about σ Inet inthe auxiliary subﬁgures at the bottom.

There are unresolved problems for future research. First, the decreasing rate of the prediction errorsdoes not exactly match with the value our theory predicts. This discrepancy implies that the theory shouldbe improved to address the gap. Second, the weighted least square should be introduced because the sim-ulated option prices have different variances. Finally, only the expectation and variance of the predictionerrors are provided in this work, but it is more desirable to investigate the theoretical distribution of theerrors.

Acknowledgments

We are grateful to Korean Asset Pricing, a bond rating agency located in Korea, for providing data inFigure 1. Jaegi Jeon received ﬁnancial support from the National Research Foundation of Korea (NRF)of the Korean government (Grant No. NRF-2019R1I1A1A01062911). Jeonggyu Huh received ﬁnancialsupport from the NRF (Grant No. NRF-2019R1F1A1058352).

References [1] G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems 2 (1989)303–314.[2] K. Hornik, M. Stinchcombe, H. White, et al., Multilayer feedforward networks are universal approximators., Neural networks 2(1989) 359–366.[3] J. Ruf, W. Wang, Neural networks for option pricing and hedging: a literature review, Available at SSRN 3486363 (2019).[4] B. M. Henrique, V. A. Sobreiro, H. Kimura, Literature review: Machine learning techniques applied to ﬁnancial market predic-tion, Expert Systems with Applications 124 (2019) 226–251.[5] J. M. Hutchinson, A. W. Lo, T. Poggio, A nonparametric approach to pricing and hedging derivative securities via learningnetworks, The Journal of Finance 49 (1994) 851–889.[6] F. Black, M. Scholes, The pricing of options and corporate liabilities, Journal of political economy 81 (1973) 637–654.

7] S. L. Heston, A closed-form solution for options with stochastic volatility with applications to bond and currency options, Thereview of ﬁnancial studies 6 (1993) 327–343.[8] P. S. Hagan, D. Kumar, A. S. Lesniewski, D. E. Woodward, Managing smile risk, The Best of Wilmott 1 (2002) 249–296.[9] R. Culkin, S. R. Das, Machine learning in ﬁnance: the case of deep learning for option pricing, Journal of Investment Management15 (2017) 92–100.[10] A. Broström, R. Kristiansson, Exotic derivatives and deep learning, 2018.[11] R. Ferguson, A. Green, Deeply learning derivatives, arXiv preprint arXiv:1809.02233 (2018).[12] W. A. McGhee, An artiﬁcial neural network representation of the sabr stochastic volatility model, Available at SSRN 3288882(2018).[13] S. Liu, C. W. Oosterlee, S. M. Bohte, Pricing options and computing implied volatilities using neural networks, Risks 7 (2019) 16.[14] A. Hirsa, T. Karatas, A. Oskoui, Supervised deep neural networks (dnns) for pricing/calibration of vanilla/exotic options undervarious different processes, arXiv preprint arXiv:1902.05810 (2019).[15] F. D. Rouah, The Heston Model and Its Extensions in Matlab and C, John Wiley & Sons, 2013.[16] J. Gatheral, T. Jaisson, M. Rosenbaum, Volatility is rough, Quantitative Finance 18 (2018) 933–949.[17] P. Wilmott, Paul Wilmott on quantitative ﬁnance, John Wiley & Sons, 2013.[18] P. Glasserman, Monte Carlo methods in ﬁnancial engineering, volume 53, Springer Science & Business Media, 2013.[19] D. C. Montgomery, E. A. Peck, G. G. Vining, Introduction to linear regression analysis, volume 821, John Wiley & Sons, 2012.[20] P. C. Hansen, V. Pereyra, G. Scherer, Least squares data ﬁtting with applications, JHU Press, 2013.[21] R. Rebonato, K. McKay, R. White, The SABR/LIBOR Market Model: Pricing, calibration and hedging for complex interest-ratederivatives, John Wiley & Sons, 2011.[22] G. West, Calibration of the sabr model in illiquid markets, Applied Mathematical Finance 12 (2005) 371–385.[23] B. Bartlett, Hedging under sabr model, Wilmott magazine 4 (2006) 2–4.[24] S. E. Shreve, Stochastic calculus for ﬁnance II: Continuous-time models, volume 11, Springer Science & Business Media, 2004.[25] J. Obloj, Fine-tune your smile: Correction to hagan et al, Wilmott Magazine 2008 (2008).[26] P. Henry-Labordere, Analysis, geometry, and modeling in ﬁnance: Advanced methods in option pricing, Chapman andHall/CRC, 2008.[27] L. Paulot, Asymptotic implied volatility at the second order with application to the sabr model, Large Deviations and AsymptoticMethods in Finance, Springer (2015) (2009) 37–69.[28] Q. Wu, Series expansion of the sabr joint density, Mathematical Finance: An International Journal of Mathematics, Statistics andFinancial Economics 22 (2012) 310–345.[29] A. Antonov, M. Konikov, M. Spector, Sabr spreads its wings, Risk 26 (2013) 58.[30] R. Garcia, R. Gençay, Pricing and hedging derivative securities with neural networks and a homogeneity hint, Journal ofEconometrics 94 (2000) 93–115.[31] G. Dimitroff, D. Roeder, C. P. Fries, Volatility model calibration with convolutional neural networks, Available at SSRN 3252432(2018).[32] C. Bayer, B. Horvath, A. Muguruza, B. Stemper, M. Tomas, On deep calibration of (rough) stochastic volatility models, arXivpreprint arXiv:1908.08806 (2019).[33] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).7] S. L. Heston, A closed-form solution for options with stochastic volatility with applications to bond and currency options, Thereview of ﬁnancial studies 6 (1993) 327–343.[8] P. S. Hagan, D. Kumar, A. S. Lesniewski, D. E. Woodward, Managing smile risk, The Best of Wilmott 1 (2002) 249–296.[9] R. Culkin, S. R. Das, Machine learning in ﬁnance: the case of deep learning for option pricing, Journal of Investment Management15 (2017) 92–100.[10] A. Broström, R. Kristiansson, Exotic derivatives and deep learning, 2018.[11] R. Ferguson, A. Green, Deeply learning derivatives, arXiv preprint arXiv:1809.02233 (2018).[12] W. A. McGhee, An artiﬁcial neural network representation of the sabr stochastic volatility model, Available at SSRN 3288882(2018).[13] S. Liu, C. W. Oosterlee, S. M. Bohte, Pricing options and computing implied volatilities using neural networks, Risks 7 (2019) 16.[14] A. Hirsa, T. Karatas, A. Oskoui, Supervised deep neural networks (dnns) for pricing/calibration of vanilla/exotic options undervarious different processes, arXiv preprint arXiv:1902.05810 (2019).[15] F. D. Rouah, The Heston Model and Its Extensions in Matlab and C, John Wiley & Sons, 2013.[16] J. Gatheral, T. Jaisson, M. Rosenbaum, Volatility is rough, Quantitative Finance 18 (2018) 933–949.[17] P. Wilmott, Paul Wilmott on quantitative ﬁnance, John Wiley & Sons, 2013.[18] P. Glasserman, Monte Carlo methods in ﬁnancial engineering, volume 53, Springer Science & Business Media, 2013.[19] D. C. Montgomery, E. A. Peck, G. G. Vining, Introduction to linear regression analysis, volume 821, John Wiley & Sons, 2012.[20] P. C. Hansen, V. Pereyra, G. Scherer, Least squares data ﬁtting with applications, JHU Press, 2013.[21] R. Rebonato, K. McKay, R. White, The SABR/LIBOR Market Model: Pricing, calibration and hedging for complex interest-ratederivatives, John Wiley & Sons, 2011.[22] G. West, Calibration of the sabr model in illiquid markets, Applied Mathematical Finance 12 (2005) 371–385.[23] B. Bartlett, Hedging under sabr model, Wilmott magazine 4 (2006) 2–4.[24] S. E. Shreve, Stochastic calculus for ﬁnance II: Continuous-time models, volume 11, Springer Science & Business Media, 2004.[25] J. Obloj, Fine-tune your smile: Correction to hagan et al, Wilmott Magazine 2008 (2008).[26] P. Henry-Labordere, Analysis, geometry, and modeling in ﬁnance: Advanced methods in option pricing, Chapman andHall/CRC, 2008.[27] L. Paulot, Asymptotic implied volatility at the second order with application to the sabr model, Large Deviations and AsymptoticMethods in Finance, Springer (2015) (2009) 37–69.[28] Q. Wu, Series expansion of the sabr joint density, Mathematical Finance: An International Journal of Mathematics, Statistics andFinancial Economics 22 (2012) 310–345.[29] A. Antonov, M. Konikov, M. Spector, Sabr spreads its wings, Risk 26 (2013) 58.[30] R. Garcia, R. Gençay, Pricing and hedging derivative securities with neural networks and a homogeneity hint, Journal ofEconometrics 94 (2000) 93–115.[31] G. Dimitroff, D. Roeder, C. P. Fries, Volatility model calibration with convolutional neural networks, Available at SSRN 3252432(2018).[32] C. Bayer, B. Horvath, A. Muguruza, B. Stemper, M. Tomas, On deep calibration of (rough) stochastic volatility models, arXivpreprint arXiv:1908.08806 (2019).[33] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).