[PDF] An unsupervised deep learning approach in solving partial integro-differential equations

Abstract

We investigate solving partial integro-differential equations (PIDEs) using unsupervised deep learning in this paper. To price options, assuming underlying processes follow Levy processes, we require to solve PIDEs. In supervised deep learning, pre-calculated labels are used to train neural networks to fit the solution of the PIDE. In an unsupervised deep learning, neural networks are employed as the solution, and the derivatives and the integrals in the PIDE are calculated based on the neural network. By matching the PIDE and its boundary conditions, the neural network gives an accurate solution of the PIDE. Once trained, it would be fast for calculating options values as well as option Greeks.

Full PDF

AAn unsupervised deep learning approach insolving partial-integro diﬀerential equations

Weilong Fu ∗ Ali Hirsa † Abstract

We investigate solving partial integro-diﬀerential equations (PIDEs) using unsu-pervised deep learning in this paper. To price options, assuming underlying processesfollow L´evy processes, we require to solve PIDEs. In supervised deep learning, pre-calculated labels are used to train neural networks to ﬁt the solution of the PIDE. Inan unsupervised deep learning, neural networks are employed as the solution, and thederivatives and the integrals in the PIDE are calculated based on the neural network.By matching the PIDE and its boundary conditions, the neural network gives an ac-curate solution of the PIDE. Once trained, it would be fast for calculating optionsvalues as well as option

Greeks . Keywords:

PIDEs, neural networks, deep learning

Financial models based on L´evy processes are better at describing the fat tails of assetreturns and matching the implied volatility surfaces in option markets than the diﬀu-sion models, since L´evy processes take jumps into consideration in addition to Gaussianmovements. Some examples of the models are the variance gamma model (VG, [16]), thenormal inverse gamma model (NIG, [1]), and the tempered stable process (also known asthe CGMY model, [2]).The partial integro-diﬀerential equation (PIDE) is used to calculate the option valuesfor the models based on L´evy processes. Its diﬀerence from the partial diﬀerential equa-tion (PDE) for the diﬀusion models is the integral term, which is due to jumps in L´evyprocesses. For this reason, the PIDE is harder to solve and subsequently options underL´evy processes are more complex to price. The PIDE can be solved utilizing the ﬁnitediﬀerence method in an explicit-implicit scheme as described in [9], or the fast Fouriertransform (FFT) (see [3] and [15] for details).Recently, many pricing approaches are proposed based on machine learning (ML) anddeep learning (DL). • In [4], kernel regression is applied to the pre-calculated data to price American op-tions. The PIDE is converted into an ordinary integro-diﬀerential equation (OIDE)and kernal regression is used to calculate a correction term in the OIDE to reducethe error. • In supervised deep learning, a priori, many labels are generated by other pricingmethods. Then neural networks are employed to ﬁt the price surface or the volatil-ity surface as a function w.r.t. all the parameters involved in the model (see e.g.[8],[14],[11]). The advantage of neural network approaches is that they are fast in ∗ Department of IEOR, Columbia University, [email protected] † Department of IEOR, Columbia University, [email protected] a r X i v : . [ q -f i n . C P ] J un omputing prices once trained. However, in supervised learning, it is pretty costlyto generate the training labels. • There are unsupervised learning approaches as well. In [17], the authors considerthe problem of solving a PDE by neural networks. The derivatives in the PDE arecalculated from the neural network and they try to match the two sides of the PDEand its boundary conditions. In this way, the solution of the PDE is solved and nolabels is needed for training the neural network.In [17], only a PDE is considered, and we aim to extend the approach to the PIDE. Inthis paper, we solve the PIDE by neural network directly. The neural network only needsto be trained once, which is the same as supervised deep learning. The main diﬀerencefrom supervised deep learning is that this approach is a self-contained pricing one, whichdoes not need pre-calculated labels.The paper is organized as follows. In Section 2, we describe the problem of solvingthe PIDE. The variance gamma model is used as an example. In Section 3, we explainhow to calculate derivatives and integrals. In Section 4, we testify the unsupervised deeplearning approach in pricing options. Many choices of the neural network parameters areimplemented to compare the performance. Also, the same method is applied to the Black-Merton-Scholes model to show the eﬀect of the integral in the PIDE. At last, we show theoutcome for option Greeks. Section 5 summarizes the paper.

In this paper, we choose the variance gamma (VG) model ([16]) as an example. It is apure jump process with three model parameters θ , σ and ν . The L´evy density of the VGprocess is given by k ( x ) = e − λ p x νx x> + e − λ n | x | ν | x | x< , (1)where λ p = (cid:18) θ σ + 2 σ ν (cid:19) − θσ and λ n = (cid:18) θ σ + 2 σ ν (cid:19) + θσ . Suppose S is the stock price, K is the strike price, T is the maturity, r is the risk-freeinterest rate and q is the dividend rate. In the VG model, the price of the Europeanoptions can be solved through the following PIDE (cid:90) ∞−∞ (cid:20) V ( Se x , t ) − V ( S, t ) − ∂V∂S ( S, t ) S ( e x − (cid:21) k ( x ) dx + ∂V∂t ( S, t ) + ( r − q ) S ∂V∂S ( S, t ) − rV ( S, t ) = 0 . (2)with the initial condition V ( S, T ) = ( K − S ) + for put options or V ( S, T ) = ( S − K ) + forcall options. Here V ( S, t ) is the price and k ( x ) is the L´evy density as given in (1).2y making change of variables, x = ln S , τ = T − t , and w ( x, τ ) = V ( S, t ), we get ∂w∂x ( x, τ ) = S ∂V∂S ( S, t ) ,∂w∂τ ( x, τ ) = − ∂V∂t ( S, t ) ,w ( x + y, τ ) = V ( Se y , t ) , and the following equation (cid:90) ∞−∞ (cid:20) w ( x + y, τ ) − w ( x, τ ) − ∂w∂x ( x, τ )( e y − (cid:21) k ( y ) dy − ∂w∂τ ( x, τ ) + ( r − q ) ∂w∂x ( x, τ ) − rw ( x, τ ) = 0 . (3)with the initial condition w ( x,

0) = ( K − e x ) + for put options or w ( x,

0) = ( e x − K ) + forcall options. Here x = log( S ) is the log-price and τ is the time to maturity.Our goal is to solve the PIDE in (3) utilizing neural networks directly. We treat thesolution w ( · ) as a function of x and τ as well as other parameters, substitute w ( · ) with amulti-layer perceptron and train the network to satisfy the PIDE. This is an unsupervisedmethod which means there is no need for labels which are option prices calculated by othermethods. For this study we just focus on the European put. In the neural network, we always use a smooth activation function to build the solu-tion w ( x, τ ), e.g., swish ( f ( x ) = x σ ( x ), where σ ( · ) is the sigmoid function) or softplus( f ( x ) = ln( e x +1)). Thus the derivatives ∂w∂x ( x, τ ) and ∂w∂τ ( x, τ ) are calculated by backwardpropagation. Given a point ( x, τ, Θ), where Θ = { σ, ν, θ, r, q } , we calculate the integral (cid:90) ∞−∞ (cid:20) w ( x + y, τ ) − w ( x, τ ) − ∂w∂x ( x, τ )( e y − (cid:21) k ( y ) dy in two parts. The inner part (cid:90) | y | <(cid:15) (cid:20) w ( x + y, τ ) − w ( x, τ ) − ∂w∂x ( x, τ )( e y − (cid:21) k ( y ) dy is approximated by (cid:20) ∂ w∂x ( x, τ ) − ∂w∂x ( x, τ ) (cid:21) (cid:90) | y |≤ (cid:15) y k ( y ) dy the same way as described in Chapter 5 in [7]. For the outer part, we can write (cid:90) | y | >(cid:15) (cid:20) w ( x + y, τ ) − w ( x, τ ) − ∂w∂x ( x, τ )( e y − (cid:21) k ( y ) dy = (cid:90) | y | >(cid:15) [ w ( x + y, τ ) − w ( x, τ )] k ( y ) dy − ∂w∂x ( x, τ ) (cid:90) | y | >(cid:15) ( e y − k ( y ) dy The ﬁrst portion of it is calculated using the trapezoidal rule as explained in [21]. Furtherdetails are included in Appendix A. 3 .2 Loss function

The Cauchy boundary conditions are considered: w ( x,

0) = ( K − e x ) + (4) w ( x min , τ ) = Ke − rτ − e x min − qτ (5) w ( x max , τ ) = 0 (6)Given a sample point ( x, τ, Θ), the loss function is the sum of the squared diﬀerences ofthe residuals in Equations (3), (4), (5) and (6).

For the VG model, the input dimension of the neural network is seven in total, which are x, τ, r, q, θ, σ and ν . In other models based on L´evy processes, there would be more orfewer parameters and the input dimension of the neural network should vary accordingly.The input dimension does not inﬂuence the outcome since the input dimension is smallcompared with the dimension of neurons. We consider networks consisting of 1 ≤ L ≤ n ∈ { , , , , , } neurons in each layer. The activation is chosen from swish and softplus. Initializationchanges among he-normal, he-uniform [6], lecun-normal, lecun-uniform [13], glorot-normaland glorot-uniform [5]. The initial distributions are uniform distributions with diﬀerentranges or truncated normal distributions with diﬀerent variances. For example, the he-normal initialization employs a truncated normal distribution with the variance 2 /N in where N in is the input size of the layer. The optimizer could be Adam [12] or RMSprop[20]. We also consider regularization of batch-normalization [10] and diﬀerent dropoutrates [19]. We then try diﬀerent boundary conditions and assess the performance of theintegral. Finally, we test the eﬀect of the sample size. We consider the option price w ( · ) in the following region:0 <τ ≤ , ≤ σ ≤ , . ≤ ν ≤ . , − . ≤ θ ≤ − . , ≤ r, q ≤ . K is ﬁxed to 200. We use (cid:15) = 0 .

01 for the integral calculation. For the boundaryconditions, we assume x min = ln(1) and x max = ln(10000). For the training samples, thelog-price x follows the uniform distribution between ln( K/

40) and ln(2 K ). For the testsamples, the log-price x follows the uniform distribution between ln( K/

2) and ln(2 K ).The log-price x of the test samples are restricted between ln( K/

2) and ln(2 K ) since wedo not want to focus too much on the deep in-the-money options. The other parametersare uniformly distributed within their ranges.Here we start from a training size of 50000 and a test size of 2000. The batch size is200 and the training epochs is 30 and they will be the same in the following experiments4nless speciﬁed otherwise. The samples are given by the Sobol sequence ([18]), which is aquasi random sequence. Given L = 3 layers and N = 200 neurons in each layer, we test diﬀerent initializations.The activation is swish and the optimizer is Adam. No regularization is used. As shown inTable 1, he-normal initialization gives the best outcome, and we choose he-normal goingforward in our study. We compare swish and softplus, with the optimizer Adam and no regularization. As shownin Table 2, swish performs uniformly better than softplus. We will continue with swish inthe following part.

We compare Adam and RMSprop without any regularization. As shown in Table 3, Adamperforms uniformly better than RMSprop. We will continue with swish in the followingpart.

We now test batch-normalization. As shown in Table 4, there is no obvious improvementwith batch-normalization. Considering batch normalization is costly, we prefer not to useit.

We now test the number of layers L and neuron size N in each layer. The best combinationsare ( L, N ) = (4 , L, N ) = (5 , L, N ) = (5 , ≤ L ≤ ≤ N ≤ When we choose L = 4 and N = 400, the best choice for the dropout rate is 0.3. However,if we choose L = 4 and N = 200, the best choice for the dropout rate seems to be 0.2. Sothe optimal dropout rate choice depends on the size of the network and cannot be ﬁxed.Also, a dropout rate slightly larger than 0 does not always reduce the error. We caneasily see this phenomenon in Table 7. The best performance is reached at a very largedropout rate. There are cases that dropout does not help at all, as can been seen in Tables12 and 14. Typically, for certain neural networks, we try diﬀerent dropout rates and pickthe best one. We illustrate in Table 8 the result of replacing Dirichlet boundary conditions w ( x min , τ ) = Ke − rτ − e x min − qτ w ( x max , τ ) = 0 5MSE MAEglorot-normal 2.434 13.266glorot-uniform 2.125 10.107 he-normal 1.954 10.140 he-uniform 2.090 10.106lecun-normal 1.956 10.433lecun-uniform 2.129 11.486Table 1: Root mean squared error (RMSE) and maximum absolute error (MAE).RMSEsoftplus swish L = 3 , N = 100 2.808 2.468 L = 3 , N = 200 2.309 1.954 L = 3 , N = 300 1.861 1.614 L = 3 , N = 400 1.845 1.733Table 2: Root mean squared error (RMSE).Adam RMSpropRMSE MAE RMSE MAE L = 3 , N = 100 2.468 11.586 4.333 14.905 L = 3 , N = 200 1.954 10.140 3.033 13.991 L = 3 , N = 300 1.614 8.569 3.075 9.670 L = 3 , N = 400 1.733 10.668 2.170 10.345Table 3: Root mean squared error (RMSE) and maximum absolute error (MAE).False TrueRMSE MAE RMSE MAE L = 3 , N = 100 2.468 11.586 2.135 10.366 L = 3 , N = 200 1.954 10.140 2.294 12.299 L = 3 , N = 300 1.614 8.569 1.888 9.882 L = 3 , N = 400 1.733 10.668 1.604 9.183Table 4: Root mean squared error (RMSE) and maximum absolute error (MAE).RMSE N100 200 300 400 500 600 700L 1 35.942 18.907 17.124 15.574 14.9802 2.744 2.400 2.438 2.458 2.5793 2.468 1.954 1.614 1.733 1.744 1.269 1.5024 2.143 1.860 1.657 1.401 1.221 0.981 1.2295 1.769 1.285 1.794 1.119 1.033 1.729 1.2016 1.706 1.261 1.396 1.708 1.362 1.664 1.167Table 5: Root mean squared error (RMSE).6ith Neumann boundary conditions ∂ w∂x w ( x min , τ ) − ∂w∂x w ( x min , τ ) = 0 ∂ w∂x w ( x max , τ ) − ∂w∂x w ( x max , τ ) = 0For small N , assuming Neumann conditions, the results are generally better when consid-ering the Dirichlet conditions. However, for large N , the results of the Neumann conditionsare worse. Also, as we see from Table 8, the results of the Neumann conditions are lessstable and not consistent. There are cases that the results are better and on the otherhand there are cases that they are worse. In this part we would like to test whether the error of the numerical integral leads to errorsin pricing. In Appendix A we explain how to calculate the integral (cid:90) y>(cid:15) [ w ( x + y, τ ) − w ( x, τ )] k ( y ) dy via the trapezoid rule. If we split each integral interval into two halves, i.e., add the midpoint of each integral interval to the grid points, then we get a ﬁner grid and can reducethe error of the trapezoid rule. We can compare the performance of the ﬁner grid andthe original grid to check if the error of the integral would inﬂuence training. When wetake the ﬁner grid, the error gets slightly larger. So approximation due to usage of thetrapezoid rule is not the reason for the error. In this part we would like to test whether it is stable to update the numerical integral,which involves the output of the neural network at many points. In each step of training,we need to consider the integral as a function of the network parameters and take thederivatives. That is actually pretty time consuming and computationally expensive. If wetreat the integral as a constant and do not update the integral in each step, we do nothave to take the derivatives of the integral with regard to the network parameters. This isconsistent with the explicit-implicit ﬁnite diﬀerence scheme proposed in [9]. This approachis faster, but the error is still larger than the original method. This is easy to understandsince we use integral approximation from the last step in training to approximate theintegrals of the current step. Since the ﬁxed integral does not reduce errors, we are notconcerned about the stability of the numerical integral.

To see the impact of the jump integral we consider solving the BMS equation − ∂w∂τ ( x, τ ) + σ ∂ w∂x ( x, τ ) + (cid:18) r − q − σ (cid:19) ∂w∂x ( x, τ ) − rw ( x, τ ) = 0 . The performance is provided in Tables 11 & 12.Note that there is no integral term in the BMS equation. If the integral term is themain reason for the errors in the VG model, the errors in the BMS model should be muchsmaller. But in fact, with neural networks of the same size, the errors in the BMS modelare just slightly smaller than those in the VG model. Also note that the BMS model isthe special case of the VG model when ν = 0 and θ = 0. The dimension of the sample7 = 4 , N = 400 L = 4 , N = 200dropout rate RMSE MAE RMSE MAE0 1.954 10.140 1.860 6.6060.1 2.092 7.114 1.708 10.7180.2 1.201 8.232 1.522 10.2500.3 0.955 4.764 1.612 10.4880.4 1.211 6.760 1.635 8.638Table 6: Root mean squared error (RMSE) and maximum absolute error (MAE). L = 4 , N = 600 L = 5 , N = 500dropout rate RMSE MAE RMSE MAE0.0 0.981 6.031 0.864 3.8480.1 0.752 3.585 1.265 8.5550.2 1.512 8.228 1.227 5.7440.3 2.507 10.394 1.335 8.2980.4 1.009 6.937 1.510 8.5540.5 1.894 6.381 0.742 3.8860.6 0.831 6.277 1.248 6.9130.7 1.387 8.294 1.193 6.990Table 7: Root mean squared error (RMSE) and maximum absolute error (MAE).Neumann Dirichlet N L = 4 L = 5 L = 4 L = 5100 1.941 1.422 2.143 1.769200 1.254 0.842 1.860 1.285300 1.512 1.109 1.657 1.794400 2.823 0.633 1.401 1.119500 1.675 1.754 1.221 1.033Table 8: Root mean squared error (RMSE). L = 4 , N = 400, dropout=0.3 RMSE MAEoriginal grid 1.081 6.776ﬁner grid 1.260 7.662Table 9: Root mean squared error (RMSE) and maximum absolute error (MAE). L = 4 , N = 400, dropout=0.3 RMSE MAEoriginal method 1.081 6.776ﬁxed integral 1.249 5.611Table 10: Root mean squared error (RMSE) and maximum absolute error (MAE).8pace in the BMS model is 5, while the dimension in the VG model is 7. The BMS modelshould be less aﬀected by the curse of dimensionality and we should expect that the errorsof neural networks are slightly smaller under the BMS model given the same sample size.From these results we can conclude that the integral part is not the main reason of theerror. The previous tests are to ﬁnd the suitable neural networks for pricing and test the perfor-mance of the boundary conditions and the numerical integral. Then we train the networkson large samples for many epochs to get the full performance. Tables 13-15 contain theoutcomes of large samples from 50,000 to 1,000,000. Since the batch size is ﬁxed to 200,there are more timesteps in one epoch if the sample size is larger and more timestepsusually means better results in the optimization algorithms. To be fair in our comparison,we need to keep the total timesteps the same even if the sample sizes are diﬀerent. Thisis fair also because the time costs will be the same for diﬀerent sample sizes. For the sizeof 50,000, 200,000, 500,000 and 1,000,000, the number of epochs are 600, 150, 60, and 30respectively. Thus the total timesteps of optimization is1 , , ×

30 = 150 , . The neural networks of the best performance are listed in Table 13. The RMSE isabout 0.1, which is quite small. The MAE is about 1, which is 10 times as large as theRMSE, which means the neural network is quite close to the true price surface, but witha few large meanderings.In Table 14, we show that dropout would not always work. For larger sizes such as L = 5 , N = 500 or L = 4 , N = 600, dropout deteriorates performance. But when the sizeis larger, dropout exacerbates.In Table 15, we ﬁx the structure of the neural network and just change the sample size.When we increase the sample size from 50,000 to 1,000,000, the performance is about thesame. Until now, we ﬁnd it is better to use he-normal initialization, with swish activation, Adamoptimizer and no batch-normalization. A larger network gives better outcomes and theoptimal dropout rate should be selected speciﬁcally. The Dirichlet boundary conditionsand the trapezoid rule work well in the method. The integral term does not increase theerror in the approach. The number of total timesteps plays a more important role thanthe sample size in fully training the neural network.

In Figures 1 and 2, we show the curves of the price, Delta, Gamma, and Theta and comparethe true values and the predicted values from the neural network. The true value arecomputed through the fast Fourier transform (FFT).[ ? ] The predicted value is calculatedfrom the neural network by back propagation. Suppose V ( S, t ) is the price an option of acertain strike K . Then Delta is ∂V∂S , Gamma is ∂ V∂S and Theta is ∂V∂t = − ∂V∂τ . Note thatthis Theta is one of the options Greeks and is diﬀerent from the model parameter θ in theVG model. The model for prediction is the network of L = 5 , N = 500 and dropout=0 inSection 4.14 after 30 epochs of training.In Figure 1, we choose K = 200, τ = 1 (1 year), r = 0 . q = 0 . θ = − . σ = 0 . ν = 0 .

4. In Figure 2, all the parameters are the same except τ = 3 (3 years). Even9MSE N300 400 500 600L 4 1.101 0.774 1.412 0.9815 0.955 1.100 0.684 0.882Table 11: Root mean squared error (RMSE). L = 5 , N = 500dropout rate RMSE MAE0.0 0.684 3.6590.1 3.721 10.7420.2 1.244 6.8410.3 1.942 6.4820.4 0.960 4.9420.5 0.973 4.669Table 12: Root mean squared error (RMSE) and maximum absolute error (MAE).size = 1000000 RMSE MAE L = 5 , N = 500, dropout=0 0.113 0.976 L = 4 , N = 400, dropout=0.3 0.128 1.161 L = 4 , N = 600, dropout=0 0.173 1.451Table 13: Root mean squared error (RMSE) and maximum absolute error (MAE).size = 1000000 RMSE MAE L = 5 , N = 500, dropout=0 0.113 0.976 L = 5 , N = 500, dropout=0.3 0.179 1.248 L = 5 , N = 500, dropout=0.5 0.226 1.417Table 14: Root mean squared error (RMSE) and maximum absolute error (MAE). L = 5 , N = 500, dropout=0 RMSE MAEsize = 50000 0.111 1.136size = 200000 0.125 0.970size = 500000 0.142 1.163size = 1000000 0.113 0.976Table 15: Root mean squared error (RMSE) and maximum absolute error (MAE).10hough we only want to ﬁt an approximate solution of the price, we also get the optionGreeks from the neural network. Here we only show the option Greeks w.r.t. price andtime. In fact, we can get the Greeks w.r.t. r , q and all the model parameters form theneural network by back propagation. In this paper we have proposed a pricing approach using unsupervised deep learning.Especially, we use neural networks to express the solutions and solve the PIDE for modelsbased on L´evy processes. The ﬁrst beneﬁt of this approach is that we only need to trainthe neural network once for a certain model. The second beneﬁt is that we do not needlabels for training. After model selection for neural networks, we ﬁnd that he-normal initialization, swish activation, Adam optimizer and no batch-normalization perform wellin this approach. We analyze that the integral term in the PIDE does not inﬂuence optionand option Greeks approximation in the approach and the error of prediction shrinks aftermore training epochs. Also, the approach does not only give the option price itself, butalso the option Greeks.We only study the European options in the paper. For future study, it is attractive toextend the same approach to price American options under L´evy processes.11

00 150 200 250 300 350 400stock price20406080 predictedtrue 100 150 200 250 300 350 400stock price0.150.100.050.000.050.10 difference

Option price

100 150 200 250 300 350 400stock price0.80.60.40.20.0 predictedtrue 100 150 200 250 300 350 400stock price0.0030.0020.0010.0000.0010.002 difference

Delta

100 150 200 250 300 350 400stock price0.0000.0010.0020.0030.0040.0050.0060.0070.008 predictedtrue 100 150 200 250 300 350 400stock price0.00020.00010.00000.0001 difference

Gamma

100 150 200 250 300 350 400stock price12.510.07.55.02.50.02.55.0 predictedtrue 100 150 200 250 300 350 400stock price0.100.050.000.050.10 difference

Theta

Figure 1: Price, Delta, Gamma and Theta when τ = 1, r = 0 . q = 0 . θ = − . σ = 0 . ν = 0 .

4. 12

00 150 200 250 300 350 400stock price2030405060708090 predictedtrue 100 150 200 250 300 350 400stock price0.100.050.000.050.10 difference

Option price

100 150 200 250 300 350 400stock price0.60.50.40.30.20.1 predictedtrue 100 150 200 250 300 350 400stock price0.00300.00250.00200.00150.00100.00050.00000.00050.0010 difference

Delta

100 150 200 250 300 350 400stock price0.0010.0020.0030.0040.005 predictedtrue 100 150 200 250 300 350 400stock price0.000100.000050.000000.000050.000100.00015 difference

Gamma

100 150 200 250 300 350 400stock price54321012 predictedtrue 100 150 200 250 300 350 400stock price0.1250.1000.0750.0500.0250.0000.0250.050 difference

Theta

Figure 2: Price, Delta, Gamma and Theta when τ = 3, r = 0 . q = 0 . θ = − . σ = 0 . ν = 0 .

4. 13 eferences [1] O. E. Barndorﬀ-Nielsen. Processes of normal inverse Gaussian type.

Finance andstochastics , 2(1):41–68, 1997.[2] P. Carr, H. Geman, D. B. Madan, and M. Yor. The ﬁne structure of asset returns:An empirical investigation.

The Journal of Business , 75(2):305–333, Apr. 2002.[3] P. Carr and D. Madan. Option valuation using the fast Fourier transform.

TheJournal of Computational Finance , 2(4):61–73, 1999.[4] W. Fu and A. Hirsa. A fast method for pricing american options under the variancegamma model, 2019.[5] X. Glorot and Y. Bengio. Understanding the diﬃculty of training deep feedforwardneural networks. In

Proceedings of the thirteenth international conference on artiﬁcialintelligence and statistics , pages 249–256, 2010.[6] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation, 2015.[7] A. Hirsa.

Computational Methods in Finance . CRC Press, apr 2016.[8] A. Hirsa, T. Karatas, and A. Oskoui. Supervised deep neural networks (DNNs) forpricing/calibration of vanilla/exotic options under various diﬀerent processes. 2019.[9] A. Hirsa and D. B. Madan. Pricing American options under variance gamma.

Journalof Computational Finance , 7(2):63–80, 2003.[10] S. Ioﬀe and C. Szegedy. Batch normalization: Accelerating deep network training byreducing internal covariate shift. In

Proceedings of the 32nd International Confer-ence on International Conference on Machine Learning - Volume 37 , ICML’15, page448–456. JMLR.org, 2015.[11] A. Itkin. Deep learning calibration of option pricing models: some pitfalls and solu-tions. 2019.[12] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[13] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. M¨uller. Eﬃcient backprop. In

Neuralnetworks: Tricks of the trade , pages 9–48. Springer, 2012.[14] S. Liu, A. Borovykh, L. A. Grzelak, and C. W. Oosterlee. A neural network-basedframework for ﬁnancial model calibration.

Journal of Mathematics in Industry , 9(1):9,Dec. 2019.[15] R. Lord, F. Fang, F. Bervoets, and C. W. Oosterlee. A fast and accurate FFT-basedmethod for pricing early-exercise options under L´evy processes.

SSRN ElectronicJournal , 2007.[16] D. B. Madan and E. Seneta. The variance gamma (V.G.) model for share marketreturns.

The Journal of Business , 63(4):511, Jan. 1990.[17] J. Sirignano and K. Spiliopoulos. Dgm: A deep learning algorithm for solving partialdiﬀerential equations.

Journal of Computational Physics , 375:1339 – 1364, 2018.1418] I. M. Sobol’. On the distribution of points in a cube and the approximate evaluation ofintegrals.

Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki , 7(4):784–802,1967.[19] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout:a simple way to prevent neural networks from overﬁtting.

The journal of machinelearning research , 15(1):1929–1958, 2014.[20] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a runningaverage of its recent magnitude.

COURSERA: Neural networks for machine learning ,4(2):26–31, 2012.[21] Wikipedia. Trapezoidal rule, https://en.wikipedia.org/wiki/trapezoidal rule.15 ppendicesA Calculation of the integral

This part follows Chapter 5 in [7]. We can split the integral term in Equation (3) into twoterms, the integrals on | y | ≤ (cid:15) and | y | > (cid:15) respectively.In the region | y | ≤ (cid:15) , w ( x + y, τ ) = w ( x, τ ) + y ∂w∂x ( x, τ ) + y ∂ w∂x ( x, τ ) + O ( y )and e y = 1 + y + y O ( y ) . Using those two approximations, we get (cid:90) | y |≤ (cid:15) (cid:20) w ( x + y, τ ) − w ( x, τ ) − ∂w∂x ( x, τ )( e y − (cid:21) k ( y ) dy = (cid:90) | y |≤ (cid:15) (cid:20) y ∂ w∂x ( x, τ ) − y ∂w∂x ( x, τ ) + O ( y ) (cid:21) k ( y ) dy ≈ (cid:90) | y |≤ (cid:15) (cid:20) y ∂ w∂x ( x, τ ) − y ∂w∂x ( x, τ ) (cid:21) k ( y ) dy. Deﬁne σ ( (cid:15) ) = (cid:82) | y |≤ (cid:15) y k ( y ) dy and we get (cid:90) | y |≤ (cid:15) (cid:20) w ( x + y, τ ) − w ( x, τ ) − ∂w∂x ( x, τ )( e y − (cid:21) k ( y ) dy ≈ σ ( (cid:15) ) (cid:18) ∂ w∂x ( x, τ ) − ∂w∂x ( x, τ ) (cid:19) . In the region | y | > (cid:15) , (cid:90) | y | >(cid:15) (cid:20) w ( x + y, τ ) − w ( x, τ ) − ∂w∂x ( x, τ )( e y − (cid:21) k ( y ) dy = (cid:90) | y | >(cid:15) [ w ( x + y, τ ) − w ( x, τ )] k ( y ) dy + ∂w∂x ( x, τ ) ω ( (cid:15) ) , where w ( (cid:15) ) = (cid:82) | y | >(cid:15) (1 − e y ) k ( y ) dy .Combine the two parts of integrals and put them back to Equation (3), and we get12 σ ( (cid:15) ) ∂ w∂x ( x, τ ) + (cid:90) | y | >(cid:15) [ w ( x + y, τ ) − w ( x, τ )] k ( y ) dy − ∂w∂τ ( x, τ ) + ( r − q + ω ( (cid:15) ) − σ ( (cid:15) )) ∂w∂x ( x, τ ) − rw ( x, τ ) = 0 . (7)The derivative terms can be calculated by back-propagation in neural networks. Theintegral (cid:82) | y | >(cid:15) [ w ( x + y, τ ) − w ( x, τ )] k ( y ) dy is calculated using the trapezoidal rule. If welet (cid:15) = 0 .

01, the integrand is calculated over the grid points {± . k : 1 ≤ k < } ∪ {± . k : 10 ≤ k < } ∪ {± . k : 5 ≤ k < } . Then the trapezoidal rule is applied to approximate the integral over each interval betweenthe grid points. The grid points are denser around 0 and coarser far from 0 because w ( x + y, τ ) is bounded and k ( y ) decreases exponentially when ||