[PDF] Supervised Deep Neural Networks (DNNs) for Pricing/Calibration of Vanilla/Exotic Options Under Various Different Processes

Abstract

We apply supervised deep neural networks (DNNs) for pricing and calibration of both vanilla and exotic options under both diffusion and pure jump processes with and without stochastic volatility. We train our neural network models under different number of layers, neurons per layer, and various different activation functions in order to find which combinations work better empirically. For training, we consider various different loss functions and optimization routines. We demonstrate that deep neural networks exponentially expedite option pricing compared to commonly used option pricing methods which consequently make calibration and parameter estimation super fast.

Full PDF

SSupervised Deep Neural Networks (DNNs) forPricing/Calibration of Vanilla/Exotic Options Under VariousDiﬀerent Processes

Tugce Karatas, Amir Oskoui, Ali Hirsa Abstract

We apply supervised deep neural networks (DNNs) for pricing and calibration of bothvanilla and exotic options under both diﬀusion and pure jump processes with and with-out stochastic volatility. We train our neural network models under diﬀerent numberof layers, neurons per layer, and various diﬀerent activation functions in order to ﬁndwhich combinations work better empirically. For training, we consider various diﬀerentloss functions and optimization routines. We demonstrate that deep neural networks ex-ponentially expedite option pricing compared to commonly used option pricing methodswhich consequently make calibration and parameter estimation super fast.

Keywords:

Machine Learning, Derivatives, Deep Neural Networks

1. Introduction

A major component of model calibration and parameter estimation is derivatives pric-ing. Aside from few cases there is no closed-form solution for derivatives pricing. In com-putational ﬁnance, it is critical to ﬁnd a fast and accurate approximation for cases thatthere is no analytical solution. Some of these approaches could be very time-consumingand to possess a model that is accurate and fast in pricing is central. Movement in timecan make the calculated values obsolete. Hence, the time it takes to re-compute thesevalues is pivotal in derivatives business. Sometimes it is possible to exploit redundanciesin the calculations: for example, pricing via fast Fourier techniques (FFT) [1], whichresults in a vector of prices on the same underlier for diﬀerent strikes. To do so, it takesadvantage of redundant matrix operations and cuts down on computational cost. In thesetting of neural networks, once training is done, the mapping from input parameters: X to an output price y has been optimized to give the smallest loss, through many iterationsof updates on weights and biases in combination with permutations of non-linear activa-tion functions. Hence, with its trained architecture the calculation of prices using neuralnetworks both with and without feedback connections which mimic the idea of contextor memory in the brain (i.e. recurrent and feedforward neural networks respectively),becomes elementary and at the same time super-fast.In this paper, we revisit the original work of Das and Culkin (2017) that used deepneural networks for option pricing and explore various architectures and extend the work We would like to thank participants at Shanghai Advanced Institute of Finance, Shanghai China,Orient Securities Shanghai China, and Cubist Systematic Strategies, New York for their comments andsuggestions on this study. Errors are our own responsibility. a r X i v : . [ q -f i n . P R ] F e b o more advanced models and diﬀerent contracts. We show how Feedforward NeuralNetworks can be used in the context of derivative pricing to achieve speed-ups of manyorders of magnitude. As are the laws of conservation, the speed-up comes with a cost ofaccuracy. However, we show the loss in accuracy is well within a worst case estimate ofbid-ask spreads, making the method practical. We demonstrate how to create labels totrain the model for vanilla and exotic options (with weak and strong path dependence)under diﬀusion and pure jump processes with and without stochastic volatility [2, 3, 4]. Tooptimize speed and accuracy of the Neural Network’s learning we use many perturbationsof hyper-parameters (i.e. number of layers, neurons per layer, and activation functions).In particular we show how European, Barrier, and American option prices can be trainedand priced out-of-sample swiftly and accurately under geometric Brownian motion (GBM)[5] and variance gamma (VG) [6], both with and without stochastic arrivals [2, 3]. Weshow that this method trumps traditional techniques for option pricing which results insuper-fast calibration and parameter estimation. To model market behavior we used four diﬀerent processes: (a) geometric Brownianmotion (GBM) [5], (b) geometric Brownian motion with stochastic arrival (GBMSA)a.k.a Heston stochastic volatility model [2], (c) variance gamma (VG) model [6], and (4)variance gamma with stochastic arrival (VGSA)[3].The ﬁrst two are pure diﬀusion processes with constant and stochastic volatility re-spectively. The last two are pure jump processes, mirroring the diﬀusion cases, VG hasconstant volatility whereas VGSA has stochastic arrivals to allow for volatility clustering[3]. GBM as in the Black-Merton-Scholes model is the most well known for derivativespricing [5]. However, its assumption of a constant volatility does not reﬂect market be-havior, leading to an implied volatility surface that facilitates model inputs to matchmarket prices. The variance gamma process is a pure-jump process with time-changedBrownian motion to account for high activity [6]. It is shown that with addition of twoextra parameters skewness and kurtosis we can closely ﬁt the smile across strikes for aﬁxed maturity [6]. Both GBMSA and VGSA are stochastic volatility models for diﬀusionand jump processes respectively, with an eﬀort to model non-constant volatility, as isobserved in the market, enabling ”us to calibrate to option price surfaces across bothstrike and maturity simultaneously” [7].

2. Brief Introduction to Neural Networks

Here, we will brieﬂy introduce feedforward neural networks but for a more detaileddiscussion of DNNs we refer the reader to chapter 6 of: Deep Learning by Goodfellow,Bengio and Courville . Feedforward neural networks, also called deep feedforward net-works, are considered ”quintessential” to deep learning with the goal of approximatinga function: f ∗ [8]. Where the network deﬁnes a mapping y = f ( x ; θ ) and learns theappropriate parameter set θ that leads to an optimal function approximation. The net-work is basically a composition of many diﬀerent functions in a chain. The more of thesefunctions that are applied on your input data X , the more layers in your network. Thelength of functions compositely applied is in a sense the depth of the model. These2etworks must be trained in order to mimic a function or model.Training data is provided accompanying the input x so that the learning algorithmcan decide how to use its layers to produce values as close as possible to ˆ y . Duringtraining we have an optimizer that drives the output of the neural network y = f ( x )to match ˆ y = f ∗ ( x ). Hence, the neural network learns to use its layers to minimize an objective function , in our case a Mean-Squared-Error function: M SE = 1 N N (cid:88) i =1 ( y − ˆ y ) The training data only indicates what values the output layer must approximate for eachinput x ; however, it does not directly indicate how to use the layers to optimally approx-imate f ∗ . When initially looking at training data it is not clear what the appropriateoutput for each layer should be, which is why they are referred to as hidden layers in a”neural” network.The term neural comes from a loose inspiration of neuroscience and the dimensional-ity of these hidden layers. As, the width of each layer is based on the number of elementswithin it that are mapping a vector to a scalar via a non-linear activation function. Inthat sense each of these elements is analogous to a neuron , because they are takinginputs from many diﬀerent things and applying their own activation function [8]. A visu-alization of both depth and width in a feed-forward neural network can be seen in Figure 1. Another way of specializing neural networks to replicate complex functions in highdimensions is convolutional neural networks, also referred to as CNNs [9]. They oftenwork well with data that ”has a clear grid-structured topology” like time-series data orimage data which are 1-D and 2-D grids respectively [8]. Their huge success in practicalapplications for high dimensional functions leads us to believe it could be tremendouslyuseful and intuitive to use in the context of Asian Options as they have a grid-like topology[10].

In Figure 2 there is a visualization of the ”Eﬀect of Depth” showing how an increasednumber of layers in a neural network (i.e. deepness) results in higher accuracy of predic-tion shown by Goodfellow et al. (2014d). This seems like a compelling point that favorsdeepness of the neural network; however, Zogoruyko and Komadakis (2017) noted thatby doubling the number of layers accuracy was only improved by a fraction of a percent,leading to a ”problem of diminishing feature reuse, which makes these networks veryslow to train.” They demonstrated how a ”wide” neural network (meaning the numberof hidden units is at-least twice the input dimension) of only 16-layers could outperform”deep” neural networks including ”thousand-layer-deep networks, achieving new state-of-the-art results on CIFAR, SVHN, COCO and signiﬁcant improvements on ImageNet.” Inthe context of derivative pricing we have also explored these two compelling argumentsand the results can be seen in Figure 8. This result clearly highlighted the diminishing3 igure 1: A graphical visualization of a Deep Neural Network from here accuracy of the networks with an increase in depth, resulting in us to explore an optimalwidth for our hidden layers as shown in Figure 7 which will be discussed further.

After exhausting the width and depth dimensions of our neural network, another ideaworth exercising is dropout. To prevent over-ﬁtting of the model we can force our neuralnetwork to learn the more robust features in conjunction with many diﬀerent randomsubsets of the other neurons. This is called dropout because nodes of the network aredropped with a probability 1- p at each training stage. Once this structure has been integrated in the neural network, optimization can bedone using a choice of either gradient descent or more sophisticated routines like RM-SProp, Adam or stochastic gradient descent (SGD) with momentum. Hence, as fordeciding the choice of optimizer in this architecture we introduce some insights, bothlocal and global, studied by Soltanolkotabi et al. (2018) for reaching optima on wideand over-parameterized neural networks[11]. For a 2-layered neural network with LeakyReLU activation function, Soudry and Carmon (2016) have shown that a global mini-mum can be obtained using gradient descent on a modiﬁed loss function; however, thisdoesn’t evince that the global minimum of the original loss function is reached [12]. Liand Yuan (2017) have introduced identity mapping, by which SGD always converges tothe global minimum for a 2-layered neural network with ReLU activation function un-der the standard weight initialization scheme [13]. Under similar realistic assumptions,Kawaguchi’s studies showed that all local minima are global minima using nonlinear acti-vation functions [14]. Additionally, under several assumptions Choromanska et al. (2015)have simpliﬁed the loss function to a polynomial with i.i.d. Gaussian coeﬃcients, showingthat using stochastic gradient descent all critical points found ”are local minima of high4 igure 2: Empirical results showing that deeper networks generalize better when used to transcribemulti-digit numbers from photographs of addresses. Data from Goodfellow et al. (2014d). The test setaccuracy consistently increases with increasing depth...[5] quality” meaning the test error value is comparable to that at the global minima [15].They concluded that global minima are more diﬃcult to reach with larger networks, butin practice it’s ”irrelevant as global minimum often leads to overﬁtting” [15].To further generalize previous studies, Kawaguchi et al. (2018) show that without theover-parameterization or any simpliﬁcation assumption, local minima of the DNN aretheoretically ”no worse than the globally optimal values of corresponding classical ma-chine learning models” [16]. This theoretical observation was supported empirically usingstochastic gradient descent with momentum as their optimizer during training for ReLU,Leaky ReLU, and absolute value activation functions[16]. Around the same time, Allen-Zhu et al. (2018) noted that under the non-degenerate input and over-parameterizationassumption, global optima can be reached using simple algorithms such as SGD duringtraining of the DNN, but this result cannot be extended to testing [17]. Furthermore,Kingma and Ba (2014) have introduced Adam (Adaptive Moment Estimation) optimiza-tion routine and they have shown that Adam converges faster compared to SGD andRMSProp when a neural network with ReLU activation function is trained [18]. Thesetheoretical and empirical ﬁndings lead us to use more sophisticated optimization routinesrather than gradient descent.For our problem, we ﬁrst used stochastic gradient descent (SGD) as the optimizationroutine which was painfully slow in training the model as expected. We then utilized RM-Sprop and Adam optimization routines since they are both commonly used by practition-ers. Adam was of particular interest because it combines advantages from both AdaGradand RMSProp, which are both extensions to stochastic gradient descent. Based on our5mpirical observations, both RMSprop and Adam converges much faster than SGD. Fur-thermore, Adam seems to work better than RMSprop, in parallel with what was expectedfrom its bias-correction term and the insights mentioned above for an over-paramaterizedDNN with ReLU, Leaky ReLU, and absolute value activation functions[18, 19]. Thus,for the remainder of this paper, Adam is used as our optimization routine for trainingthe network.

3. Labels and Parameter Selection

In order to implement supervised deep neural networks, many labels are needed fortraining. These labels are generated from existing models. Table 1 shows some classicaltechniques for pricing options and it shows under a given model, which types of optionsthat can be used for pricing. Monte-Carlo simulation is commonly considered the slowestamongst these methods and when it comes to trees, they have an order of complexity thatis quadratic with each time step. Now, even though discretization may lead Monte Carlosimulation in terms of speeds and they both have a very wide domain of option types andmodels where they can be used, they are both still considered computationally expensivein practice. Leaving us with the transform techniques like the FFT method, which maybe named ”Fast”, but still takes a reasonable amount of time to run and these transformsare not applicable for path-dependent options. This results in the crux of our problem,in other words the most expensive and computationally time-consuming step when usingsupervised deep neural networks, is generating labels from these existing methodologies.Once labels are generated, training feedforward neural networks becomes almost trivial. Pricing Model/Method

Transform Techniques PDEs/PIDEs MC simulation DNN model vanilla weak exotic vanilla weak exotic vanilla weak exotic vanilla weak exotic

GBM (cid:88) (cid:88) × (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) VG (cid:88) (cid:88) × (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) GBMSA (cid:88) (cid:88) × (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) ?VGSA (cid:88) (cid:88) × (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) ? Table 1: Pricing schemes for various diﬀerent models/processes

After deciding on which method and model to use for each option, the next step is todecide on market and model parameter ranges. Before going through model parametersfor each option, we ﬁrst start by deciding on the range of market parameters, which will bethe same throughout all models and options. As we stated before, supervised deep neuralnetworks necessitates creating a large number of labels from existing models. There are ageneral set of parameters that are common amongst all of these models: initial stock price( S ), strike price (K), maturity (T), interest rate (r), dividend yield (q). Using the factthat option prices are linear homogeneous in S and K as seen in Black-Merton-Scholesoption pricing formula, instead of considering both S and K as two separate parameters,we will only consider the ratio S /K that is the option moneyness. Table 2 shows theproduct and market parameters we have considered for each model. For training, wegenerated labels within 20% moneyness. The range for the maturity of the options isfrom 1 day to 3 years. By existing models, we mean analytical or computational models used for derivatives pricing e.g.transform techniques, numerical solutions of PDEs or PIDEs, Mote-Carlo simulation and the like arameter Range Moneyness( S /K ) 0.8 → → → → Table 2: Product and market parameters

In the case of Vanilla Options, there are many available methods for pricing includingFast Fourier Transform (FFT), Trees, and Monte-Carlo Simulation. Leading us to builda Deep Neural Network (DNN) trained on closed-form generated call and put prices fromthe Black-Merton-Scholes formula for the GBM model. For the VG, VGSA, and GBMSAmodels, the Deep Neural Network is trained using labels obtained via FFT consideringwe have the characteristic function of the logarithm of the stock price process under thesemodels.The process begins with creating an input parameter matrix X , which consists ofdiﬀerent combinations of market and model parameters. Here, model parameters arevolatility ( σ ) for the GBM model; volatility ( σ ), skewness ( θ ) and kurtosis ( ν ) for the VGmodel; rate of mean reversion ( κ ), correlation between stock and variance ( ρ ), long termvariance ( θ ), volatility of variance ( σ ) and the initial variance ( v ) for the GBMSA model;and volatility ( σ ), skewness ( θ ), kurtosis ( ν ), rate of mean reversion ( κ ), the long-termrate of time change ( η ), and the volatility of the time change ( λ ) for the VGSA model.Table 3 shows the ranges of model parameters we have considered for European options.After determining the ranges for the parameters, the parameter combinations are sam-pled via two diﬀerent approaches. The ﬁrst one is Non-Quasi approach, where parametercombinations are sampled at random, uniformly over the given ranges of market andmodel parameters. The second approach is the Quasi approach. For this approach, webeneﬁt from Halton sequences, which are deterministic and of low-discrepancy. Here, weconsidered a 5-dimensional Halton sequence for the GBM model, 7-dimensional for theVG model, 9-dimensional for GBMSA model, and 10-dimensional for the VGSA model.For this approach, Halton sequences are generated in the R platform as a ﬁrst step. Then,the value of each parameter is obtained by multiplying the numbers, which are obtainedfrom Halton sequences, with the range of the parameter and summing it with the lowerbound of the corresponding parameter.After the parameter ranges and sampling methods are chosen, the next step is todecide on the training size. Here, we considered 4 diﬀerent training sizes, which are40,000, 80,000, 160,000, and 240,000, in order to observe the eﬀect of the training sizeon the performance. Each data set is generated separately. As expected, the discrepancybetween data points decreases as training size increases. For this reason, it is expectedthat the performance of the model increases as training size increases.Once the input parameter matrix X = [ S / K , T , r , q , σ ] is produced for the GBMmodel, the output vector y = EC / K is obtained by using the closed form Black-Merton-Scholes formula, where EC is the price of European call option. For VG, GBMSA, andVGSA models, the output vector y = EC / K is obtained by using the FFT algorithm forcorresponding input parameter matrices X . Therefore, ( X , y ) will form the training set7 BM Range VG Range GBMSA Range VGSA Range σ → σ → σ → σ → θ -0.90 → -0.05 κ → θ -0.90 → -0.05 ν → ρ -0.90 → -0.10 ν → θ → κ → v → η → λ → Table 3: Model parameters of the deep neural network.

To introduce some path dependence, we transition implementation to Barrier Options;for which, closed-form solutions are available in the GBM setting. However, prices areusually computed using Monte-Carlo simulation for beyond GBM . In this section, wewill focus on pricing an Up-and-Out Put Options (UOP), where the Variance Gammaand Geometric Brownian Motion models are used to describe the market both with andwithout stochastic arrivals. The ranges of all product and market parameters are thesame with that of European options as in Table 2 except the range of H/K , where H isthe barrier level. Since we used the normalized values for stock prices and option prices,we consider the normalized values of barrier levels as well. For H/K , we consider therange between S / K and 1.2, where 1.2 is the upper bound for moneyness consideredbefore. The reason behind the lower bound of this range is that the barrier level mustbe greater than the initial stock price for up-and-out barrier options. Furthermore, theranges of the model parameters come from Table 3 as in European options. Again, theparameter combinations are sampled at random uniformly over the given ranges of marketand model parameters.Once the input parameter matrix X = [ S / K , H / K , T , r , q , σ ] is produced for theGBM model, the output vector y = UOP / K is obtained by using the closed formsolution. Here, UOP is the price of up-and-out put options. For the GBMSA and VGSAmodels the output vector y = UOP / K is obtained by using Monte Carlo procedure fortheir corresponding input parameter matrices X . This procedure is based on 10,000 pricepaths of the underlying for the GBMSA model and 8,000 price paths of the underlyingfor the VGSA model where each simulated path has 100 time-steps. For details on howto simulate GBMSA or VGSA processes, see [7] page 239. Now to really get to see where neural networks can transform the norm for pricingtimes, we introduce the strong path dependence of American Options to this architecture.American Options are usually priced using either Monte-Carlo simulation or tree methods.Since both of these methods are relatively time consuming, we construct a fully-connectedfeedforward neural network trained on the prices of American Options generated viathe Ju-Zhong (JZ) approximation in the GBM model. With input parameter matrix: Depends on the payoﬀ, we can apply transform techniques like the Cosine-Fourier (COS) method[20] for pricing some barrier options, however for this study we tried Monte-Carlo methods for pricing. = [ S / K , T , r , q , σ ], the output vector y = AJZ / K is obtained by using the Ju-ZhongApproximation. Here, JZ is the price of American Put Options under the Ju-ZhongApproximation. Therefore, ( X , y ) will form the training set of the deep neural network.While generating input parameter matrices for American options, we used the Non-Quasiapproach, which is described in Section 3.1. Therefore, market parameter combinationsare generated at random uniformly over the ranges provided in Table 2, and the modelparameter σ is generated from a uniform random distribution over the range between0.05 and 0.50.

4. Training

Once training sets ( X , y ) are obtained for each model and option type, they are fedto the deep neural network algorithm. We train each model under a diﬀerent number oflayers, a diﬀerent number of neurons per layer and various diﬀerent activation functioncombinations in order to observe which combinations work better empirically under giventraining sets. Non-linearity in the model is introduced through activation functions. We use diﬀerentcombinations of activation functions. In this experiment, we have considered exponentiallinear unit (elu), rectiﬁed linear unit (relu), leaky rectiﬁed linear unit (leaky relu), andsigmoid activation functions. For both sigmoid and tanh functions if the input is eithervery large or small, the slope of the activation becomes very small resulting in a slowdown for gradient descent making either of these functions not well suited for hiddenlayers [21]. However, to not rule anything out we have still stress tested permutationsof these activation functions and visualized the results. There is a common preferencetowards rectiﬁed linear unit (relu) for hidden layers as its derivative is 1 for all positiveinputs and 0 for negative ones, and to avoid a zero derivative in many cases leaky relucan be substituted [22].

Figure 3: Sample-1 Figure 4: Sample-2

As we expected, Figures 3 and 4 illustrate that a sigmoid activation function is notsuitable for our structure and going forward we will omit using it in our model andcontinue with the combinations of elu, relu, and leaky relu activation functions. Aftertesting all possible combinations of these three activation functions for 2, 3, and 4 numberof layers, we realized that the trained models with initial activation function of leaky reluperform better compared to other trained models. Figures 5 and 6 are good examples for9his argument. This argument is not only valid for GBM model, but it is true for otherprocesses as well. Based on these empirical observations, we will set the initial activationfunction as leaky relu, and test diﬀerent combinations of activation functions for otherhidden layers. We will then select the trained model with the smallest MSE value or thelargest R value as our pricing neural net. Figure 5: Sample-3 Figure 6: Sample-4

After testing diﬀerent combinations of activation functions, the next step is to test theeﬀect of a diﬀerent number of layers as well as a diﬀerent number of neurons per layer.As the ﬁrst experiment, we keep the number of layers and the set of activation functionsﬁxed, and vary the number of neurons per layer. Figure 7 shows how the R value changesas the number of neurons per layer increases. Here, we train Deep Neural Network withEuropean call option prices under the GBM model. There are 4 hidden layers, and theactivation function is leaky relu for each hidden layer. We can see in Figure 7 that as Figure 7: Number of Neurons vs. R the number of neurons per layer increases, the neural net model performs better, butthere are diminishing returns. After 120 neurons per layer, there is a plateauing result,making it pointless to increase the number of neurons per layer further. Hence, we used120 neurons per layer as default. 10n Figure 8, we examine the model performance as the number of layers is increased.We show how RMSE (root mean square error) changes when we increase the number oflayers for European call options under the GBM model. This ﬁnding is pretty consistentfor other models. It can be seen in Figure 8 that as the number of layers increases, the Figure 8: Number of Layers vs. RMSE model performance gets better. But this improvement is limited to 4 layers. After that,as soon as we use 5 layers or higher, RMSE value increases. It can be interpreted in away that as we increase the number of layers, after some point there will be an over-parametrization problem. Based on this observation we cap the number of layers in ourneural network to four.

5. Validation

Once we obtain the optimal neural network architecture for each model and eachoption type, we need to check the architecture’s validity. In order to do so, we checkour trained model for three cases: (1) interpolation, (2) deep-out-of-the-money, and (3)longer maturity. In the ﬁrst case, we just validate it for interpolated points on our testingset. In the second case, we keep everything else ﬁxed, and test how the trained modelsbehave when S /K is between 0.6 and 0.8. In the third case, we keep everything elseﬁxed, and test how the trained models behave when the maturity is between 3 years and5 years. European call option prices are generated under 4 diﬀerent models; hence, we obtain4 diﬀerent neural network architectures for European call options. Their performancesare measured in terms of MSE (mean square error) and R values. Table 4 provides thesummary of the empirical results for the trained models under GBM and VG, and Table 5provides the summary of the empirical results for the trained models under the GBMSAand VGSA models. For all these tables, the in-sample test size and the test sizes fordeep-out-of-the-money and longer maturity cases are taken as 60,000.According to the results in Tables 4 and 5, it can be observed that as the trainingsize increases, the performance of the trained models increases but there are diminishingreturns. The second result is that the performance of the trained neural net under the11 BM VGTraining set size

In-Sample

MSE 0.000060 0.000034 0.000029 0.000018 0.000678 0.000059 0.000046 0.000028 R Deep-Out-Of-The-Money

MSE 0.000065 0.000054 0.000022 0.000014 0.000376 0.000071 0.000050 0.000042 R Longer Maturity

MSE 0.000348 0.000184 0.000191 0.000129 0.004338 0.000643 0.000524 0.000378 R Table 4: European Option Pricing under GBM and VG

GBM model is better than all other processes. This is probably because as the number ofparameters increases, the process becomes more complicated to learn and probably morelabels are needed to train them.

GBMSA VGSATraining set size

In-Sample

MSE 0.000122 0.000059 0.000032 0.000025 0.000240 0.000230 0.000244 0.000092 R Deep-Out-Of-The-Money

MSE 0.000066 0.000032 0.000021 0.000016 0.000226 0.000324 0.000271 0.000150 R Longer Maturity

MSE 0.000362 0.000237 0.000158 0.000103 0.000647 0.000724 0.000755 0.000377 R Table 5: European Option Pricing under GBMSA and VGSA

As for one example, Figure 9 shows the performance of the architecture for theGBMSA model under in-sample test data, deep-out-of-the-money and longer maturityextrapolation data.Furthermore, we have also studied the case where the training data set is generatedwith deep-out-of-the-money European call option prices under the GBM model. There-fore, by keeping all other parameter ranges the same, we generated the training data setwith S /K between 0.6 and 0.8. Table 6 provides the summary of the empirical resultsfor the corresponding trained model under the GBM model. As in the previous tables,we check our trained model for interpolation and longer maturity cases. We also checkour model for the case where S /K is between 0.8 and 1.2, and it is under Out-Of-The-Money label in Table 6. When the results are compared with the results in Table 4, it isseen that In-Sample results are very similar. However, it is also seen that extrapolationresults are better in Table 6. Therefore, it can be concluded that although we trainedthe model with deep-out-of-the-money option prices, the neural network provides goodpredictions when the option prices are out-of-the-money. It is important to note that byout-of-the-money, we mean the moneyness between 0.8 and 1.2.Up until now, we have only obtained results using the Non-Quasi approach. In orderto see how well the Quasi approach works, we implemented it for generating data forEuropean call options under the GBM model. Figure 10 shows the results obtained for12 igure 9: Performance of Pricing Machine under GBMSA Model - EC Training set size

In-Sample

MSE 0.000138 0.000042 0.000029 0.000018 R Out-Of-The-Money

MSE 0.000166 0.000034 0.000020 0.000014 R Longer Maturity

MSE 0.001845 0.000311 0.000285 0.000221 R Table 6: European Option Pricing under GBM - 2 both Non-Quasi and Quasi approaches. It is shown that when the data is generated byusing Halton sequences, the model ﬁt is slightly better.

Up-and-Out Barrier put option prices are generated under 3 diﬀerent models; hence,we implemented 3 diﬀerent neural networks for pricing UOPs. For the GBMSA andVGSA models, we take the maximum training size as 160,000, because Monte Carloprocedure is computationally expensive and from the last section we know that the per-formances of the models trained with 160,000 data points and 240,000 data points areclose enough. Table 7 provides the summary of the empirical results for the trainedmodels under the GBM, GBMSA, and VGSA models.As it is seen in Table 7, as the number of parameters in the model decreases, the13 igure 10: Non-Quasi Approach vs. Quasi Approach under GBM Model - EC

GBM GBMSA VGSATraining set size

In-Sample

MSE 0.000159 0.000038 0.000017 0.000016 0.000114 0.000083 0.000026 0.000149 0.000089 0.000031 R Deep-Out-Of-The-Money

MSE 0.000684 0.000617 0.000766 0.000562 0.001230 0.000965 0.000764 0.002254 0.001733 0.001275 R Longer Maturity

MSE 0.000490 0.000107 0.000042 0.000048 0.000288 0.000167 0.000053 0.000163 0.000160 0.000067 R Table 7: Barrier Option Pricing under GBM, GBMSA and VGSA performance of the trained model increases. As an example, Figure 11 shows the per-formance of GBM model under in-sample test data and longer maturity extrapolationdata.

Figure 11: Performance of Pricing Machine under GBM Model - UOBP

As in the case of European options, we also applied the Quasi approach for Up-and-Out Barrier put options under the GBM model. Figure 12 shows the results obtainedwith both Non-Quasi and Quasi approaches. We can conclude that when the data isgenerated under the Quasi approach, the model ﬁt is better.

American call option prices were generated under the GBM model using the Ju Zhongapproximation. The empirical results from the neural network trained using this data is14 igure 12: Non-Quasi Approach vs. Quasi Approach under GBM Model - UOBP summarized in Table 8. Additionally, Figure 13 shows the performance of this architec-ture, trained under the Ju-Zhong approximation of the GBM model, for in-sample testdata and deep-out-of-the-money extrapolation data from left to right respectively.

Training set size

In-Sample

MSE 0.000196 0.000187 0.000113 0.000025 R Deep-Out-Of-The-Money

MSE 0.000221 0.000078 0.000098 0.000092 R Longer Maturity

MSE 0.000788 0.000500 0.000364 0.000812 R Table 8: American Option Pricing under GBMFigure 13: Performance of Pricing Machine trained on the Ju-Zhong Approximation of GBM model - JZ

6. Deep Neural Network vs. Recurrent Neural Network

Although the trained feedforward neural network for pricing American call optionsperforms well, as mentioned earlier it is pretty expensive to generate labels for Ameri-can options. Knowing American options are strongly path-dependent options, we might15eneﬁt from recurrent neural networks in pricing them. While generating data, inputparameters (except maturity) are sampled at random uniformly over the ranges deﬁnedbefore. For maturity, we deﬁne discrete time steps from 1 to 12 months. Figure 14provides a comparison between the performances of deep neural networks and recurrentneural networks. According to the numerical results, the MSE value of DNN model isslightly smaller than that of the RNN model. On the other hand, there is a big gapbetween training times of the two models. In our experiment training DNNs took 50%more time than RNNs.Although neural networks are proﬁcient at storing implicit knowledge, they strugglewith memorization of facts (i.e. a working memory). Human beings’ ability to be ableto hold memory and aﬃliate it with some context is critical for problem solving. Hence,a recurrent neural network, or in other words a neural network equipped with feedbackloops can ”not only rapidly and ’intentionally’ store and retrieve facts, but also to se-quentially reason with them” [8]. Enticing us to further explore the possibilities of anRNN’s architecture in our future works.

Figure 14: DNN vs. RNN

7. Conclusion

In this paper, we utilized a fully connected feedforward neural network to price Vanilla,Barrier and American options under the GBM, VG, GBMSA, and VGSA regimes andexplore the beneﬁts of introducing recurrent neural networks to this task. The crux ofthe problem is creating labels and selecting hyper-parameters for training of the DeepNeural Network. We used sample sizes of 300,000 points for training and 60,000 pointsfor validation. We observed with an increase in depth in layers and neurons per layerthe accuracy of the predictions increases, but eventually diminishes. Once the trainingis done, the model can be used for derivative pricing for market parameters both withinand reasonably outside the initial range of the training set with very low error. We alsocompared the performances of feedforward and recurrent neural networks for Americanput options. We observed that training time of RNNs is almost the half of the trainingtime of DNNs, which encourages us to explore RNNs further in our future works. Allin all, once labels are created, training and predicting options prices is very inexpensivecomputationally. This architecture results in speed-ups of many orders of magnitude,making it practical and easy to implement and use.16 eferencesReferences [1] P. Carr, D. Madan, Option valuation using the fast fourier transform, Journal of computationalﬁnance 2 (4) (1999) 61–73.[2] S. L. Heston, A closed-form solution for options with stochastic volatility with applications to bondand currency options, The review of ﬁnancial studies 6 (2) (1993) 327–343.[3] P. Carr, H. Geman, D. B. Madan, M. Yor, Stochastic volatility for l´evy processes, Mathematicalﬁnance 13 (3) (2003) 345–382.[4] A. Hirsa, D. B. Madan, Pricing american options under variance gamma, Journal of ComputationalFinance 7 (2) (2004) 63–80.[5] M. Scholes, et al., The pricing of options and corporate liabilities, Journal of political Economy81 (3) (1973) 637–654.[6] D. B. Madan, P. P. Carr, E. C. Chang, The variance gamma process and option pricing, Review ofFinance 2 (1) (1998) 79–105.[7] A. Hirsa, Computational methods in ﬁnance, CRC Press, 2016.[8] I. Goodfellow, Y. Bengio, A. Courville, Y. Bengio, Deep learning, Vol. 1, MIT press Cambridge,2016.[9] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, L. D. Jackel,Handwritten digit recognition with a back-propagation network, in: Advances in neural informationprocessing systems, 1990, pp. 396–404.[10] J. Vecer, M. Xu, Pricing asian options in a semimartingale model, Quantitative Finance 4 (2) (2004)170–175.[11] M. Soltanolkotabi, A. Javanmard, J. D. Lee, Theoretical insights into the optimization landscapeof over-parameterized shallow neural networks, arXiv preprint arXiv:1707.04926.[12] D. Soudry, Y. Carmon, No bad local minima: Data independent training error guarantees formultilayer neural networks, arXiv preprint arXiv:1605.08361.[13] Y. Li, Y. Yuan, Convergence analysis of two-layer neural networks with relu activation, in: Advancesin Neural Information Processing Systems, 2017, pp. 597–607.[14] K. Kawaguchi, Deep learning without poor local minima, arXiv:1605.07110v3.[15] A. Choromanska, M. Henaﬀ, m. Mathieu, B. A. Gerard, Y. LeCun, The loss surfaces of multilayernetworks, arXiv:1412.0233v3.[16] K. Kawaguchi, J. Huang, L. P. Kaelbling, Eﬀect of depth and width on local minima in deeplearning, arXiv preprint arXiv:1811.08150.[17] Z. Allen-Zhu, Y. Li, Z. Song, A convergence theory for deep learning via over-parameterization,arXiv preprint arXiv:1811.03962.[18] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.[19] S. Ruder, An overview of gradient descent optimization algorithms, arXiv preprint arXiv:1609.04747.[20] F. Fang, C. W. Oosterlee, Pricing early-exercise and discrete barrier options by fourier-cosine seriesexpansions, Numerische Mathematik 114 (1) (2009) 27.[21] X. Glorot, Y. Bengio, Understanding the diﬃculty of training deep feedforward neural networks, in:Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics, 2010,pp. 249–256.[22] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectiﬁer neural networks, in: Proceedings of thefourteenth international conference on artiﬁcial intelligence and statistics, 2011, pp. 315–323.[23] R. Culkin, S. R. Das, Machine learning in ﬁnance: The case of deep learning for option pricing,Journal of Investment Management 15 (4) (2017) 92–100.[24] S. Zagoruyko, N. Komodakis, Wide residual networks, arXiv preprint arXiv:1605.07146.[25] J. De Spiegeleer, D. B. Madan, S. Reyners, W. Schoutens, Machine learning for quantitative ﬁnance:fast derivative pricing, hedging and ﬁtting, Quantitative Finance 18 (10) (2018) 1635–1643.[26] J. M. Hutchinson, A. W. Lo, T. Poggio, A nonparametric approach to pricing and hedging derivativesecurities via learning networks, The Journal of Finance 49 (3) (1994) 851–889.[27] P. Carr, H. Geman, D. B. Madan, M. Yor, The ﬁne structure of asset returns: An empiricalinvestigation, The Journal of Business 75 (2) (2002) 305–332.[28] W. Fu, A. Hirsa, A fast method for pricing american options under the variance gamma model,2019.[1] P. Carr, D. Madan, Option valuation using the fast fourier transform, Journal of computationalﬁnance 2 (4) (1999) 61–73.[2] S. L. Heston, A closed-form solution for options with stochastic volatility with applications to bondand currency options, The review of ﬁnancial studies 6 (2) (1993) 327–343.[3] P. Carr, H. Geman, D. B. Madan, M. Yor, Stochastic volatility for l´evy processes, Mathematicalﬁnance 13 (3) (2003) 345–382.[4] A. Hirsa, D. B. Madan, Pricing american options under variance gamma, Journal of ComputationalFinance 7 (2) (2004) 63–80.[5] M. Scholes, et al., The pricing of options and corporate liabilities, Journal of political Economy81 (3) (1973) 637–654.[6] D. B. Madan, P. P. Carr, E. C. Chang, The variance gamma process and option pricing, Review ofFinance 2 (1) (1998) 79–105.[7] A. Hirsa, Computational methods in ﬁnance, CRC Press, 2016.[8] I. Goodfellow, Y. Bengio, A. Courville, Y. Bengio, Deep learning, Vol. 1, MIT press Cambridge,2016.[9] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, L. D. Jackel,Handwritten digit recognition with a back-propagation network, in: Advances in neural informationprocessing systems, 1990, pp. 396–404.[10] J. Vecer, M. Xu, Pricing asian options in a semimartingale model, Quantitative Finance 4 (2) (2004)170–175.[11] M. Soltanolkotabi, A. Javanmard, J. D. Lee, Theoretical insights into the optimization landscapeof over-parameterized shallow neural networks, arXiv preprint arXiv:1707.04926.[12] D. Soudry, Y. Carmon, No bad local minima: Data independent training error guarantees formultilayer neural networks, arXiv preprint arXiv:1605.08361.[13] Y. Li, Y. Yuan, Convergence analysis of two-layer neural networks with relu activation, in: Advancesin Neural Information Processing Systems, 2017, pp. 597–607.[14] K. Kawaguchi, Deep learning without poor local minima, arXiv:1605.07110v3.[15] A. Choromanska, M. Henaﬀ, m. Mathieu, B. A. Gerard, Y. LeCun, The loss surfaces of multilayernetworks, arXiv:1412.0233v3.[16] K. Kawaguchi, J. Huang, L. P. Kaelbling, Eﬀect of depth and width on local minima in deeplearning, arXiv preprint arXiv:1811.08150.[17] Z. Allen-Zhu, Y. Li, Z. Song, A convergence theory for deep learning via over-parameterization,arXiv preprint arXiv:1811.03962.[18] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.[19] S. Ruder, An overview of gradient descent optimization algorithms, arXiv preprint arXiv:1609.04747.[20] F. Fang, C. W. Oosterlee, Pricing early-exercise and discrete barrier options by fourier-cosine seriesexpansions, Numerische Mathematik 114 (1) (2009) 27.[21] X. Glorot, Y. Bengio, Understanding the diﬃculty of training deep feedforward neural networks, in:Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics, 2010,pp. 249–256.[22] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectiﬁer neural networks, in: Proceedings of thefourteenth international conference on artiﬁcial intelligence and statistics, 2011, pp. 315–323.[23] R. Culkin, S. R. Das, Machine learning in ﬁnance: The case of deep learning for option pricing,Journal of Investment Management 15 (4) (2017) 92–100.[24] S. Zagoruyko, N. Komodakis, Wide residual networks, arXiv preprint arXiv:1605.07146.[25] J. De Spiegeleer, D. B. Madan, S. Reyners, W. Schoutens, Machine learning for quantitative ﬁnance:fast derivative pricing, hedging and ﬁtting, Quantitative Finance 18 (10) (2018) 1635–1643.[26] J. M. Hutchinson, A. W. Lo, T. Poggio, A nonparametric approach to pricing and hedging derivativesecurities via learning networks, The Journal of Finance 49 (3) (1994) 851–889.[27] P. Carr, H. Geman, D. B. Madan, M. Yor, The ﬁne structure of asset returns: An empiricalinvestigation, The Journal of Business 75 (2) (2002) 305–332.[28] W. Fu, A. Hirsa, A fast method for pricing american options under the variance gamma model,2019.