[PDF] Deep Structural Estimation: With an Application to Option Pricing

Abstract

Full PDF

DDeep Structural Estimation:With an Application to Option Pricing

Hui Chen ∗ Antoine Didisheim † Simon Scheidegger ‡§ February 9, 2021

Abstract

We propose a novel structural estimation framework in which we train a surrogateof an economic model with deep neural networks. Our methodology alleviates the curseof dimensionality and speeds up the evaluation and parameter estimation by orders ofmagnitudes, which signiﬁcantly enhances one’s ability to conduct analyses that requirefrequent parameter re-estimation. As an empirical application, we compare two popularoption pricing models (the Heston and the Bates model with double-exponential jumps)against a non-parametric random forest model. We document that: a) the Batesmodel produces better out-of-sample pricing on average, but both structural modelsfail to outperform random forest for large areas of the volatility surface; b) randomforest is more competitive at short horizons (e.g., 1-day), for short-dated options (withless than 7 days to maturity), and on days with poor liquidity; c) both structuralmodels outperform random forest in out-of-sample delta hedging; d) the Heston model’srelative performance has deteriorated signiﬁcantly after the 2008 ﬁnancial crisis. ∗ Department of Finance, MIT Sloan School of Management and NBER. Email: [email protected]. † Department of Finance, HEC Lausanne, University of Lausanne and Swiss Finance Institute. Email:antoine.didisheim@unil. ‡ Department of Economics, HEC Lausanne, University of Lausanne. Email: simon.scheidegger@unil. § This work is generously supported by grants from the Swiss Platform for Advanced Scientiﬁc Computing(PASC) under project ID “Computing equilibria in heterogeneous agent macro models on contemporary HPCplatforms”, the Swiss National Supercomputing Center (CSCS) under project ID 995, and the Swiss NationalScience Foundation under project IDs “Can Economic Policy Mitigate Climate-Change?”, and “New methodsfor asset pricing with frictions”. Simon Scheidegger gratefully acknowledges support from the MIT SloanSchool of Management and the Cowles Foundation at Yale University. a r X i v : . [ ec on . E M ] F e b Introduction

Driven by theoretical developments, the availability of “big data,” and gains in comput-ing power, contemporary models in economics and ﬁnance have seen tremendous growthin complexity. However, the state-of-the-art structural models often impose a substantialroadblock to researchers and practitioners: with an ever-increasing number of states and pa-rameters, they increasingly suﬀer from the curse of dimensionality (Bellman, 1961)—that is,the computational burden to evaluate the model or estimate model parameters and hiddenstates grows exponentially with every additional degree of freedom.Consequently, economists are often forced to sacriﬁce certain features of the model inorder to reduce model dimensionality, estimate only a partial set of parameters while pre-ﬁxing the others, and estimate the model only once using the full sample of data. Suchrestrictions limit a researcher’s ability to carry out important model analysis, including:a) sub-sample or out-of-sample analyses, b) cross validation, in particular with time-seriesforecasting models, where it is routine to re-estimate the model using a moving window orexpanding window to take into account the latest available data while avoiding any look-ahead bias (see, e.g., Welch and Goyal, 2007), c) management of heterogeneity in datathrough repeated re-estimation, for example, when ﬁtting a consumption-portfolio model toa large cross-section of households, d) testing for parameter stability. In some cases, theexponentially increasing computational costs can prevent the adoption of complex modelsby practitioners in real-time.To tackle these issues, we introduce deep structural estimation , a new framework to moreeﬃciently evaluate and estimate structural models. At its core, our method re-purposes theconcept of surrogate models commonly applied in physics and engineering in the context ofﬁnancial models. We adopt deep neural networks to create cheap-to-evaluate surrogates ofhigh-dimensional models—that is, a function that takes the same input as the original modeland yields the same output at a signiﬁcantly lower computational cost. To construct thesesurrogates, we treat parameters as pseudo-states and estimate both parameters and hiddenstates from the data in a comparatively cheap optimization procedure. This methodology In physics and engineering, Gaussian processes regression (see, e.g., Williams and Rasmussen, 2006;Tripathy, Bilionis, and Gonzalez, 2016; Bilionis and Zabaras, 2012a; Bilionis, Zabaras, Konomi, and Lin,2013; Chen, Zabaras, and Bilionis, 2015), radial basis functions (Park and Sandberg, 1991), or relevancevector machines (Bilionis and Zabaras, 2012b) are often used to build surrogate models. More recently,following the rapid developments in the theory of stochastic optimization and artiﬁcial intelligence as wellas the advances in computer hardware leading to the widespread availability of graphic processing units(GPUs; see, e.g., Scheidegger, Mikushin, Kubler, and Schenk (2018); Aldrich, Fern´andez-Villaverde, Gallant,and Rubio-Ram´ırez (2011), and references therein), researchers have turned their attention towards deepneural networks (see, e.g., Tripathy and Bilionis, 2018a; Liu, Borovykh, Grzelak, and Oosterlee, 2019a). points to maintain a similaraccuracy level. Generalizing to d states, this procedure requires visiting O (10 d ) locations inthe d -dimensional state space and to evaluate the function at all those locations. Even in asituation where a single function evaluation is relatively inexpensive to compute, attemptingnaively to approximate a high-dimensional function in this way can quickly become infeasible.In contrast, the construction of a surrogate model based on neural networks alleviatesthe curse of dimensionality because one can typically train a deep network to accuratelyapproximates the high-dimensional function we are concerned with using substantially fewerobservations than one would use with more commonly used methods based on Cartesiangrids. Montanelli and Du (2019), for instance, show formally under certain assumptions thatapproximating multivariate functions by deep ReLU networks can be bounded by sparse grids(see, e.g., Bungartz and Griebel (2004), and references therein), which themselves alleviatethe curse of dimensionality.In one of the examples we study below, a double-exponential jump-diﬀusion model has14 states ( d = 14). We train a surrogate using a DNN with seven hidden layers and 400neurons in each layer, which correspond to nearly 10 trainable parameters. The modelis then successfully trained on a sample of size N = 10 , far smaller than 10 . Our nu-merical experiments show that while it takes over 40 minutes to estimate this model usingtraditional grid-based methods, the surrogate model reduces the execution time to less thanone second. Although comparing the time to execute an algorithm can be misleading (itdepends on expert knowledge and code optimization), the gulf between the two methods inour experiments demonstrates the potential for deep structural estimation.Such computational gains have important consequences for research. They allow for i)re-estimating a model’s parameters and hidden states at a high frequency, ii) thorough out-of-sample analysis of model performance, which ensures the generalization of the results, andiii) robust statistical testing of parameters’ stability.As an application, we use the deep structural estimation methodology to compare twoapproaches commonly used for pricing options: structural versus entirely data-driven. Con-ceptually, structural models have the advantage of linking the dynamics of option priceswith the dynamics of the hidden state variables and the structural parameters. In contrast,2hile (reduced-form) data-driven models can be more ﬂexible in ﬁtting the data in-sample,they are more prone to over-ﬁtting. Moreover, their respective performance relative to struc-tural models could also deteriorate out-of-sample when a change in the underlying hiddenstate signiﬁcantly alters the shape of the volatility surface; changes that a suitable structuralmodel might be able to predict.For reasons of transparency, we select as examples of structural models the classicstochastic volatility model proposed by Heston (1993) (HM), and a double-exponential jump-diﬀusion model extended from Bates (1996) (BDJM). Both models have been extensivelystudied in the literature and are well-understood. They feature one hidden state variable:spot volatility. As an example of a data-driven model, we use a non-parametric practitioner’sBlack-Scholes model by modeling the implied volatility surface using random forests (RFs;see, e.g., Breiman, 2001).We create a surrogate for each of these option-pricing models. The surrogates enable usto re-estimate each option model’s parameters as well as the spot volatility on the daily cross-section of S&P500 options for every individual day for the last seventeen years to producea time series of jointly estimated optimal parameters and hidden states across the sample.We stress that without the surrogate technology, the computational cost of estimating theparameters would render this analysis infeasible.Our empirical analysis reveals several ﬁndings. First , we compare the in- and out-of-sample performance of the structural models and the RF model. As expected, each pricingmodel’s in-sample performance is directly related to the number of degrees of freedom, withRF producing the smallest mean squared error (MSE) of the three option pricing models.Out of sample, BDJM outperforms HM and RF on average; RF outperforms HM at shorthorizons but underperforms when the forecasting horizon exceeds 7 days. In the cross section,RF outperforms both parametric models across all moneyness for options with a time tomaturity less than one week. Furthermore, BDJM outperforms the other models whenmarket volatility and jump risks are elevated, while the structural models perform worse foroptions with poor liquidity (as measured by bid-ask spreads). These results highlight thespeciﬁc area of the volatility surface on which option pricing theory has room to improve.

Second , using the statistical test proposed by Andersen, Fusari, and Todorov (2015), weshow that the estimated parameters for both structural models are highly unstable: theparameter estimates from two consecutive trading days diﬀer signiﬁcantly (at the 1% level)41.6% of the time for the BDJM and 60.7% of the time for the HM. Furthermore, thepercentage of days with signiﬁcant changes in their parameters increase with time. Notethat the test was originally designed to compare two large samples, but the deep surrogate stimation framework allows us to apply it at a daily frequency. Third , we compare the models’ hedging performances. Indeed, one of the importanttasks of option pricing models is to estimate the option delta, an option’s price sensitivityto changes in the price of the underlying asset. This key estimation allows market makersand liquidity providers to hedge their position dynamically. For each option, we performdelta hedging at daily frequency based on the model-speciﬁc theoretical deltas calculatedusing parameter values estimated that day. We show that the BDJM replicating portfoliooutperforms the HM on average. This diﬀerence is concentrated in short-dated out-of-moneyputs and long-dated out-of-money calls, and it mostly occurred after 2011. We also show thatRFs are ill-suited for constructing hedging portfolios due to its non-smoothness. This resulthighlights a fundamental weakness of non-parametric models. While parametric models aimto provide a simpliﬁed description of the world and therefore be used for multiple purposes,non-parametric models tend to be task-oriented and can not be trivially applied to multiplegoals without rethinking the model itself.

Fourth , throughout our analysis, we notice a steep decline in the parametric models’performance from the 2008 ﬁnancial crisis to the present, which could be due to alleviatedintermediary constraints during the crisis and the new ﬁnancial regulation that followed(see e.g. Chen, Joslin, and Ni, 2019a; Haynes, McPhail, and Zhu, 2019). The parameters’in-sample ﬁt, the out-of-sample predictive power, and the hedging performance all deterio-rate signiﬁcantly. This observation suggests that a potential structural change might haveimpacted those pricing models’ performances. Furthermore, we observe that the HM perfor-mance relative to that of the BDJM have decreased signiﬁcantly in recent years, implyingthat jump risk has become a more critical pricing factor in the options market. Notice thatfrequent re-estimation of the models is needed to uncover these patterns. Indeed, when themodel’s parameters are selected to minimize the mean squared error on the whole sample,the in-sample error will mechanically be smoothed across time, while out-of-sample analysiscannot be measured.From a methodological perspective, this paper makes two contributions to the existingliterature. First, we present deep structural estimation and discuss how to use it in practicalapplications. We demonstrate how to adopt the deep neural network’s architecture, theneurons’ activation functions, the training procedure, and the simulated training-sample toa given model’s complexity—that is, the number of parameters and states, as well as thenon-linearity of the model’s response surface. Second, we demonstrate the usefulness ofcheap-to-evaluate surrogate models in controlled environments. With simulated data, weshow that deep structural estimation can swiftly and very precisely estimate the parameters4f complex structural models.The remainder of this article is organized as follows. Section 2 provides a very briefoverview of the related literature. In Section 3, we present the deep structural estimation methodology and discuss how to apply it to option pricing models. Section 4 discusses theresults of our numerical experiments. In section 4.1, we discuss in particular the performanceof the surrogate in a controlled environment, whereas section 4.2 confront our methodologyin the context of real data. Section 5 concludes.

Our paper contributes to three strands of literature: (i) methods for constructing surro-gate models in general and their application to high-dimensional models in economics andﬁnance; (ii) applications of deep learning in ﬁnance and economics; and (iii) empirical optionpricing.Model estimation, calibration, and uncertainty quantiﬁcation can be daunting numericaltasks because of the need to perform sometimes hundreds of thousands of model evaluationsto obtain converging estimates of the relevant parameters and converging statistics (see,e.g., Fern´andez-Villaverde, Rubio-Ram´ırez, and Schorfheide, 2016; Fern´andez-Villaverde andGuerr´on-Quintana, 2020; Iskhakov, Rust, and Schjerning, 2020; Igami, 2020, among others).To this end, a broad strand of literature in engineering, physics (see, e.g., Tripathy andBilionis, 2018b), but also in ﬁnance and economics has long tried to replace expensive modelevaluations that suﬀer from the curse of dimensionality with cheap-to-evaluate surrogatemodels that mitigate the said curse. Heiss and Winschel (2008) for instance, approximatedthe likelihood by numerical integration on Smolyak sparse grids, whereas Scheidegger andTreccani (2018) applied adaptive sparse grids to approximate high-dimensional probabilitydensity functions (PDFs) in the context of American option pricing. Scheidegger and Bilionis(2019) propose a method to carry out uncertainty quantiﬁcation in the context of discrete-time dynamic stochastic models by combining the active subspace method (Constantine,2015) with Gaussian processes to approximate the high-dimensional policies as a function ofthe endogenous and exogenous states as well as the parameters. Liu, Borovykh, Grzelak, Over the course of the past two decades, there have been signiﬁcant advancements in the development ofalgorithms and numerical tools to compute global solutions to high-dimensional dynamic economic models(see Maliar and Maliar (2014) for a thorough review). This strand of literature is loosely related to the workpresented here in that also in the context of computing global solutions to dynamic models, high-dimensionalfunctions have to be approximated repeatedly. see, e.g., Brumm and Scheidegger (2017); Pﬂueger, Peherstorfer, and Bungartz (2010), and referencestherein for more details on adaptive sparse grids. Calibration Neural Networks to cali-brate ﬁnancial asset pricing models and provide numerical experiments based on simulateddata. Kaji, Manresa, and Pouliot (2020) propose a simulation-based estimation method foreconomics structural models using a generative adversarial neural network.A paper that is relatively close to ours is Norets (2012), which extends the state-spaceby adding the model parameters as “pseudo-states” to eﬃciently estimate ﬁnite-horizon,dynamic discrete choice models. He uses shallow artiﬁcial neural networks to approximatethe dynamic programming solution as a function of parameters and state variables prior toestimation. The main diﬀerence in our methodology is the application of deep neural net-works. Compared to shallow neural networks and some of the other popular approximationmethods, the beneﬁts that deep neural nets oﬀers include approximating power, alleviatingthe curse of dimensionality by reducing the amount of training data required, and the abilityto make eﬃcient use of GPUs and big data. In particular, we show that a deep network canreproduce the parameter values signiﬁcantly more accurately than shallow networks, whichis crucial for structural estimation. We also confront our surrogates with real data in thecontext of empirical option pricing.Secondly, our paper is part of the emergent literature on applications of deep learning toeconomics and ﬁnance. In the seminal work of Hutchinson, Lo, and Poggio (1994), shallowneural nets are used for pricing, and hedging derivative securities. Chen and White (1999)established improved approximation error rate and root-mean squared convergence rate fornonparametric estimation of conditional mean, conditional quantile, conditional densities ofnonlinear time series using shallow neural networks and obtained root-n asymptotic normalityof smooth functionals. Chen and Ludvigson (2009) used neural networks to solve habit-basedasset pricing models. More recently, Farrell, Liang, and Misra (2021) derive non-asymptotichigh probability bounds for deep feed-forward neural nets in the context of semiparametricinference without a pseudo-states approach. Buehler, Gonon, Teichmann, and Wood (2019)presented a methodology called deep hedging , which uses deep neural networks and rein-forcement learning to compute optimal hedging strategies directly from simulated data withtransaction costs. Didisheim, Karyampas, and Scheidegger (2020) introduced the concept of deep replication to extract the implied risk aversion smile from options data. Chen, Pelger,and Zhu (2019b) use deep neural networks to estimate an asset pricing model for individualstock returns that takes advantage of the vast amount of conditioning information. Becker,Cheridito, and Jentzen (2018) introduce deep optimal stopping for pricing Bermudan options.In contrast, Azinovic, Gaegauf, and Scheidegger (2019); Fernandez-Villaverde, Hurtado, andNuno (2019); Villa and Valaitis (2019); Maliar, Maliar, and Winant (2019); Duarte (2018);Fernandez-Villaverde, Nunu, Sorg-Langhans, and Vogler (2020) apply various formulations6f deep neural networks to solve a broad range of high-dimensional dynamic stochastic mod-els in discrete and continuous-time settings, but do not deal with estimation. We relate tothis strand of literature in that we apply neural networks to tackle problems in ﬁnance. Inparticular, and to the best of our knowledge, we are the ﬁrst to show that very deep neuralnets combined with the Swish activation function (Ramachandran, Zoph, and Le, 2017) arenecessary to construct high-dimensional surrogate models to estimate dynamic models inﬁnance.Thirdly, in the option pricing application, we study two popular structural models inthe literature, the stochastic volatility model of Heston (1993), and the double-exponentialjump-diﬀusion model extended from Bates (1996). Our method allows us to re-estimatethe structural parameters and hidden states at high frequency and compare the models’out-of-sample performances at various time horizons. We apply the test statistic developedby Andersen et al. (2015) to investigate the stability of the risk-neutral dynamics of thetwo models. Christoﬀersen and Jacobs (2004) also compare the pricing performance ofa “standard” Practitioner’s Black-Scholes (PBS) model against the Heston model. Theyemphasize the importance of using the same loss function to estimate and evaluate themodels, and they ﬁnd that in doing so, the PBS model outperforms the Heston model out-of-sample. Our ﬁnding of the superior hedging performance of structural models echoes theﬁnding of Schaefer and Strebulaev (2008) in the corporate bond market, which show thatstructural credit models are informative about out-of-sample hedge ratios even though theymight imply large pricing errors.

In this section, we introduce our deep structural estimation methodology. Furthermore,we brieﬂy discuss the two option pricing models to which we apply our technology to, thatis to say, the Heston (Heston, 1993) and the Bates model (Bates, 1996).To do so, we proceed in the following steps: First, we introduce the concept of deepstructural estimation in section 3.1. Second, we discuss in section 3.2 the general beneﬁtsof using deep neural networks in the context of surrogate models. Third, we address thespeciﬁcs of our option pricing applications in 3.3. Fourth, we elaborate in section 3.4 on thedeep neural network’s architecture and the training procedure applied in our applications.Besides, we describe in section 3.5 how to assess out-of-sample performance in our numericalexperiments. Finally, section 3.6 brieﬂy introduces a non-parametric benchmark—randomforests—to which we confront the structural results with.7 .1 Deep structural estimation

Consider an economic model that can be represented by some function f : R m → R k mapping m input variables into k potentially observable variables. The input dimensionality m can be decomposed into ω observable and time-varying states, h hidden states, and a setof θ parameters, with m = ω + h + θ . More formally, f (Ω t , H t | Θ) = y t , (1)where Ω t is a vector of dimension ω containing the observables states, H t is a vector ofdimension h comprising the hidden states, Θ is a vector of dimension θ containing modelparameters, and y t is a vector of dimension k comprising the predicted quantities of interest.The objective now is to replace the true function f ( · ) that potentially might be expensiveto evaluate by a numerically cheap-to-evaluate surrogate model ˆ f ( · ). That is,ˆ f (Ω t , H t , Θ) = ˆ f ( X t ) = y t , (2)where X t is represented by vector of dimension m containing the observable and the unob-servable states, as well as the model parameters as pseudo-states, X t = [Ω t , H t , Θ] T . (3)The core idea of deep structural estimation is to use deep neural networks to construct thesaid surrogate ˆ f ( · ) to accurately approximate the true function f ( · ), including its gradients,which helps with extreme estimators such as GMM (Hansen, 1982). Compared to some ofthe other popular approximation methods (see Judd, 1996, for a survey), the beneﬁts thatDNN oﬀers include approximating power (see e.g., Hornik, Stinchcombe, and White, 1989),alleviating the curse of dimensionality by reducing the amount of training data required, andthe ability to make eﬃcient use of GPUs and big data. We use deep neural networks to create the said surrogate model. Neural networks areuniversal function approximators (see e.g., Hornik et al., 1989)) that consist of stacked-uplayers of neurons. A given layer i takes a vector I i of length m as an input, and produces avector O i of length n as output, that is, O i = σ ( W i I i + b i ) , (4)8 igure 1. This ﬁgure depicts the Swish, ReLu, and the Sigmoid activation functions.where W i is a matrix of size ( m, n ) that consists of a-priori unknown entries; b i is a vector ofweights of length n ; σ ( · ) is a non-linear function applied element-wise and that is commonlytermed an activation function . For a neural network that consists of L layers, I representsthe network’s inputs, whereas I i = O i − ∀ i ∈ [2 , L ], and O L represents the output of the lastlayer. Popular choices for the activation function σ ( · ) includes rectiﬁed linear unit (Relu), σ ( x ) = max ( x, σ ( x ) = − x , as itdoes not suﬀer from the vanishing gradient problem (Goodfellow, Bengio, Courville, andBengio, 2016). However, in our numerical experiments below, we use the more recent Swish activation function (Ramachandran et al., 2017), which is given by σ ( x ) = x − γx ) , (5)where γ is either a constant or a trainable parameter, which can be viewed as a smoothversion of the ReLu function (see ﬁgure 1). The swish activation function possess twoqualities that make it appropriate to the surrogate application: 1) unlike the sigmoid , itdoes not suﬀer from the vanishing gradient problem and, as our application shows, complexstructural models requires surrogate networks with a deep architecture, and 2) unlike the ReLu activation function, the swish is smooth, which means that the gradients of the trainedsurrogate model will also be smooth across the state-space. This latter property substantiallyenhances the performance when the gradients of the surrogate are need (e.g., for estimation).Next, we discuss how to optimally determine the hyper-parameters of the deep neuralnetwork such that it can be eﬃciently used as a surrogate model of f ( · ) (cf. equation (1)). To It has been shown empirically that the performance of very deep neural nets in combination with

Swish activation functions outperform architectures that rely on ReLu (see, e.g., Tripathy and Bilionis (2018b)).

9o so, let φ ( X | θ NN ) be a neural network that consists of L layers. The said network takes thevector X of dimension ω + h + θ as an input, and generates an output of dimension k . Θ NN denotes a ﬂattened vector containing all the trainable parameters of the neural network, thatis, the components of the matrix W i and the vector b i for all layers i = 1 , , ..., L . φ ( X | θ NN )is an acceptable surrogate of the model f ( · ) if a convergence criterion such as( φ ( X t | θ NN ) − f (Ω t , H t | Θ)) < ε, (6)for all economically relevant values of Ω t , H t , and Θ, and with ε being some small positiveconstant representing is met. Thus, one somehow needs to determine the parameters Θ NN ,which satisfy equation (6). To do so, we deﬁne a training set, that is, data used to train theneural network—composed of pairs ˜ X i , y i , and where y i = f ( ˜ X i ).Now, we populate the by parameters enhanced state-space of dimension m with samplepoints ˜ X i by drawing them from a multivariate uniform distribution. To this end, we ﬁrstneed to deﬁne the minimum and the maximum for every state and parameter that areeconomically meaningful (based on expert knowledge or based on mathematical conditions).Thus, for every state x ( j ) t that populates the vector X t , let x ( j ) t be the minimum valueacceptable, and let ¯ x ( j ) t be the maximum admittable value, X = [ x (1) , x (2) , ..., x ( ω + h + θ ) ] , ¯ X = [¯ x (1) , ¯ x (2) , ..., ¯ x ( ω + h + θ ) ] . (7)Next, we can deﬁne ˜ X i = X + R ( ¯ X − X ) , (8)where R = [ r (1) , r (2) , ..., r ( ω + h + θ ) ](Jaynes, 1982). Consequently, one can now generate atraining sample of size N by drawing N random vectors ˜ X i , and querying the original model(cf. equation (1)) N times to obtain y i = f ( ˜ X i ). Given this set of training data, we can now The vector r can be drawn from any distribution appropriate to the particular context. In this paper,we use a simple uniform distribution for all states and parameters. Note that with complex structural models and large N , this operation can be computationally costly.However, this expense poses a unique ﬁxed cost in the sense that the researcher will need to spend thesecomputational costs only once. After that, in practice, he can beneﬁt from orders of magnitude of speedupwhen querying the surrogate model rather than evaluating the original function (cf. section 4). Furthermore,the generation of training data is embarrassingly parallelizable and thus consumes only little human time.As an illustration, consider an estimation of the BDJM’s parameters using the original model function. Thisoptimization took roughly 40 minutes. On the same hardware, we reduced the optimization time to lessthan 1 second with the deep surrogate estimation technology. Please keep in mind that these numbers areonly here to illustrate the potential time gain. Indeed, both times could probably be reduced signiﬁcantlywith expert knowledge and appropriate code optimization. All tests were conducted on the “Piz Daint,” aCray XC50 system that is installed at the Swiss National Supercomputing Centre. Its compute nodes areequipped with a 12-core Intel(R) Xeon(R) E5-2670 with 64GB of DDR3 memory. φ ( X | θ NN ) by minimizing the mean absolute error on the trainingsample, that is, θ ∗ NN = arg min θ NN N N (cid:88) i =1 | φ ( ˜ X i | θ NN ) − y i | . (9)To perform this minimization, we need to access the neural network’s gradients. This canbe done very eﬃciently thanks to the backpropagation algorithm, which we can view as arecursive application of the standard chain rule (Chauvin and Rumelhart, 1995). Note that,unlike other numerical diﬀerentiation schemes, backpropagation is exact. We compute thegradient to update the network’s parameters on mini-batches in a process known as stochasticgradient descent. We can either use stochastic gradient descent until convergence or stopearlier to avoid overﬁtting. However, in the case of deep structural estimation presented here,early stopping is unnecessary thanks to the fact that our target y i can safely be assumed tobe noise-free and that N can be made arbitrarily large.To test whether or not the neural network is surrogate model of acceptable quality, wecan generate an additional sample of pairs ˜ X j , y j , for j = 1 , ..., J : the validation sample withwhich we compute the validation error, i.e.,1 N J (cid:88) j =1 | ( φ ( ˜ X i | θ NN ) − y i ) | . (10)If the validation error is smaller than some prescribed ε , we deﬁne the neural network as anacceptable surrogate of the model f ( · ). Let us assume that we observe a time series of economic data from time t = 1 to t = T .At each time step, we observe N t target states and their corresponding observable states,ˆ Y t = [ˆ y t , ˆ y t , ..., ˆ y N t t ] , ˆΩ t = [ ˆΩ t , ˆΩ t , ..., ˆΩ N t t ] , (11)where ˆ Y t is a matrix of size ( k, N t ) that consists of all the observed states ˆ y jt at time t whichthe true model f ( · ) attempts to explain. Furthermore, ˆΩ t represents a matrix of size ( ω, N t )that is composed of the corresponding observable state of the economy. We wish now touse the said data to estimate the model’s parameters, Θ, as well as the hidden state of theeconomy: ˆ H t = [ ˆ H t , ˆ H t , ..., ˆ H N t t ] , (12)11here ˆ H t is a matrix of size ( h, N t ), for all t = 1 , ..., T . With the deep structural estimation technology, we can now directly and swiftly solve the following minimization problem:ˆ H ∗ , ˆ H ∗ , ... ˆ H ∗ T , Θ ∗ = arg min ˆ H , ˆ H ,... ˆ H T , Θ T N t T (cid:88) τ =1 N t (cid:88) i =1 ( φ ( (cid:104) ˆΩ iτ , ˆ H iτ , Θ (cid:105) | θ ∗ NN ) − ˆ y ( i ) τ ) . (13)In our numerical experiments below (cf. section 4), we apply the BFGS optimization al-gorithm (see, e.g., Nocedal and Wright, 2006). The BFGS procedure relies on a correctestimation of the function’s gradient, which we can obtain cheaply and precisely thanks tothe backpropagation algorithm. Note that even without the deep structural estimation , wecould reach the desired results by solving,ˆ H ∗ , ˆ H ∗ , ... ˆ H ∗ T , Θ ∗ = arg min ˆ H , ˆ H ,... ˆ H T , Θ T N t T (cid:88) τ =1 N t (cid:88) i =1 ( f ( ˆΩ iτ , ˆ H iτ | Θ) − ˆ y ( i ) τ ) . (14)However, unlike equation (13), we would have to compute the model’s gradient directly,which would be signiﬁcantly more costly for three reasons. First, unless the model’s gradi-ent can be derived analytically, one would have to estimate the gradients through a numericaldiﬀerentiation scheme, requiring multiple costly evaluations of the model. Second, even asingle estimation of the model can be computationally very expensive for complex struc-tural models. Third, depending on the structural model, it may be impossible to acceleratethe gradient’s computation. Furthermore, even if we can parallelize the calculation of themodel’s gradient on modern GPUs, the actual implementation of this parallelization will bea daunting task requiring time and expert knowledge. When dealing with contemporary structural models in ﬁnance or economics, the com-putational costs can be a limiting factor. The method we described in the previous sectionprovides a trade-oﬀ for the researcher. One pays a one-time, potentially relatively high,upfront aggregate computational cost to reduce the marginal cost of additional model es- Note that contemporary APIs such as

TensorFlow 2.x provide an implementation of machine learningalgorithms, including deep neural networks and classical optimization algorithm like the BFGS in a singleframework. Therefore, deep structural estimation was implemented within this API to smoothly leveragethe availability of gradients with a well-tuned BFGS algorithm and the available parallelization to harvestthe computing power of contemporary GPUs. Note that parallelizing the generation of the training set is trivial. Thus, although the overall compu-tational cost can be ”large” in node hours, it can be distributed across the compute nodes of contemporaryhigh-performance computing hardware, thus reducing the runtime by orders of magnitude. deep structural estimation , we can create a library of models that can signiﬁcantly improveresearch quality through easier access to prior work for model comparisons, meta-analysis,and recalibration on new data.

We now brieﬂy summarize the asset pricing models for which we build surrogate mod-els, namely the stochastic volatility model (HM) and the double-exponential jump-diﬀusionmodel (BDJM). Both models specify a stochastic process for the underlying asset S under arisk-neutral probability measure Q , which is guaranteed to exist by no-arbitrage. The pricesof the European options are equal to the risk-neutral expected present values of the terminalpayoﬀs discounted at the risk-free rate.While there are more sophisticated option pricing models in the literature (for example,Duﬃe, Pan, and Singleton, 2000; Bates, 2000; Pan, 2002; Andersen et al., 2015), we choosethe HM and the BDJM model as examples mainly for their transparency: both models aresingle-factor, with the latter adding a jump component relative to the former. A comparisonbetween the two allows us to focus on the importance of jumps for option pricing. In the Heston model, under measure Q , the stock price S t follows the process: dS t S t = ( r − d ) dt + √ v t dW ,t , (15a) dv t = κ ( θ − v t ) dt + σ √ v t dW ,t . (15b)Here, r is the (constant) instantaneous risk-free rate and d is the dividend yield. v t is the(diﬀusive) spot volatility, which follows a Feller process under Q with speed of mean reversion κ , long-run mean θ , and volatility parameter σ . W ,t and W ,t are two standard Brownianmotions under Q with corr( W ,t , W ,t ) = ρ .For an European option with maturity T and strike price K , the HM model involvessix parameters, that is, Θ HM = [ r, d, κ, θ, σ, ρ ] T , one hidden state, H HM = [ v t ], and threeobservable state variables, Ω HM = [ S t , K, T − t ] T . In the estimation, we will treat two of theparameters, interest rate r and dividend yield d , as observables, and estimate the remaining13our parameters along with the hidden state v t from the options panel. The Bates model (Bates, 1996) extends the Heston model by adding jumps in the stockprice. Under Q , the stock price follows dS t S t − = ( r − d − λm ) dt + √ v t dW + dZ t , (16a) dv t = κ ( θ − v t ) dt + σ √ vdW , (16b)where the key diﬀerence from the HM model is Z , which is a pure jump process with arrivalintensity λ and the log jump size has distribution ω . Diﬀerent from Bates (1996), whoassumes ω is normal, we model the log jump size with an asymmetric double exponentialdistribution, ω ( J ) = p ν u e − νu J { J> } + (1 − p ) 1 ν d e νd J { J< } , with p ≥ , (17)where ν u , ν d > m = p − ν u + − p ν d −

1. The double-exponential assumption allowsfor heavier tails in jumps than the normal distribution, a feature supported by the non-parametric option-based evidence in Bollerslev and Todorov (2011).In summary, BDJM consists of 10 parameters (of which 8 will be estimated, excluding r and d ), Θ BDJM = [ r, d, κ, θ, σ, ρ, λ, ν u , ν d , p ] T , plus the same hidden state H BDJM = [ v t ] andobservable states Ω BDJM = [ S t , K, T − t ] T as the HM. We now detail the speciﬁc training sample, network architecture, and the training pro-cedure we use to create surrogate models for the 10-dimensional HM and 14-dimensionalBDJM.

Before starting with any numerical computation, we exploit basic properties in the twomodels to reduce the eﬀective dimensionality. To decrease the original dimensionality by For both structural models, we apply the state-of-the-art option pricing library QuantLib ( ) to compute prices and the corresponding BSIV. K and spot price S into a measure of moneyness, i.e.,ˆ K = 100 KS . (18)Next, we further diminish the dimensionality of the original models by collapsing the optiontype dimension through the call-put parity. Instead of providing the option types (call orput) as an input of the surrogate, we use the call-put parity to transform all the put pricesin the sample into call prices. Furthermore, we replace the price of the options with the corresponding BSIV. Thus,instead of having surrogate models that produce prices that we can then transform intoBSIV, the surrogates directly predict the BSIV.We populate the training sample ˜ X i , y i as deﬁned in section 3.1. To do so, we ﬁrst setthe minimum and maximum values for each state variable and all parameters. The tables 4and 5 in appendix A list those ranges both for the HM and BDJM, respectively. We chosethose values based on a mixture of mathematic rules, expert knowledge, but also based ontrial and error. For example, the intensity of the Poisson process λ is, by deﬁnition, positive.We know that the correlation between volatility shocks and price shocks ρ is only negativein practice. Finally, we observe that a large value of κ is often necessary to obtain the bestin-sample ﬁt. Therefore, we use a relatively large range of possible κ value to accommodatethis fact.The only state variable in ˜ X i which we do not draw from a uniform distribution is ˆ K .Indeed, the standardized strike ˆ K is an imperfect substitute for moneyness, as moneynessis best represented as a function of strike price and time-to-maturity. Following Andersen,Fusari, and Todorov (2017), we deﬁne a moneyness indicator m as m = ln (cid:16) ˆ K/F T (cid:17) √ T σ atm , (19)where F T is the forward for transaction up to the option’s maturity, and σ atm is the averageBSIV of at-the-money options throughout the sample (0.19). Instead of drawing ˆ K , we drawa random moneyness value m , and invert equation (19) to get the corresponding ˆ K .Finally, we need to deﬁne the size of the training sample N , that is, the number ofpoints populating the pseudo-state-space we use to calibrate the neural network. We choose Note that the choice of transforming puts into calls instead of puts into calls was arbitrary. We canobtain the same performance with a put-only surrogate model. = 10 for the HM, and N = 10 for BDJM. We chose these numbers through trial anderror by training multiple surrogate models on increasingly larger training samples until theresult produced in our simulation reached a satisfying performance level.

We applied a network of 6 hidden layers to generate the HM surrogate, each of whichwas composed of 400 neurons with a

Swish activation function. This architecture yieldsa total of 806,001 trainable parameters. With the BDJM surrogate, we use a deep neuralnetwork with seven hidden layers, 400 neurons each, and

Swish activation functions. Thelatter architecture corresponds to a total of 967,201 trainable parameters. Our trial anderror approach to determine the optimal architecture—as common when working with deepneural networks—suggests that all things being equal, the network has to be deeper withadditional complexities (such as the addition of states and model parameters) to keep thesurrogate performance at a constant level. We run a mini-batch stochastic gradient descent algorithm with batches of size 256 todetermine the neural network’s parameters. In particular, we applied the ADAM algorithmwith an initial learning rate of 0 . ∗ − . As optimization criteria, we minimize the meanabsolute error (cf. equation (13)).We run the optimization algorithm for 15 epochs, that is, we use mini-batches of size256 until we have used the whole data-set 15 times. After each epoch, we save the modeland use a validation set of 10,000 points to estimate the surrogate model’s performance. Atthe end of the procedure, we use the network’s parameters after the epoch with the lowestvalidation error. Note that while those numbers read like big numbers, they actually only sparsely populate the state space.Consider a naive discretization scheme, where one places N points along one axis. The naive generalization d dimensions would yield N d points. Consequently, those N = 10 points in the case of the BDJM wouldtranslate to N ≈

4, which could be considered a low resolution of the function to be approximated withtraditional, Cartesian grid-based methods (see, e.g., Press, Teukolsky, Vetterling, and Flannery (2007)). Note that we also tried to increase the network’s width, that is, to add more neurons in each layer.This procedure led to a considerably increased training time but did not signiﬁcantly increase the surrogateperformance. .5 Out-of-sample pricing errors To measure out-of-sample pricing errors, we estimate the structural option pricing models’parameters and hidden states at some given time t , and use these states and parameters topredict options’ BSIVs at time t + τ , where τ is the forecasting horizon deﬁned in numbersof business days. We make the predictions assuming we can see the observable states of time t + τ , that is, the options’ maturity, the options’ moneyness, and the risk-free rate. Formally,we deﬁne the out-of-sample prediction for each option i asˆ y i = φ ( (cid:104) ˆΩ iτ , ˆ H i , Θ (cid:105) | Θ ∗ NN ) . (20)Note that we do not update the state parameter v t . In practice, we measure the volatilitysmile on the day t and use it to predict the volatility smile at time t + τ . As such, theperformance out-of-sample can be viewed as a measure of parameter stability. To contrast the performance of the two structural option pricing models, we apply a non-parametric benchmark in the form of a random forest regressor. We choose the randomforest over other non-parametric benchmarks for two main reasons: i) random forests regres-sors requires little to no tuning to perform well out-of-sample, ii) random forest regressorshave a relatively short training time, which signiﬁcantly facilitates our analysis.To train a random forest, we need to deﬁne the model’s input, that is, the vector ofobservable states X ( R ) i the random forest uses to predict the BSIV of option i —and the lossfunction which the random forest tries to minimize, that is, X ( R ) i = [1 , C,i , ˆ K i , T, ˆ K i T, ˆ K> ] , (21)where 1 is a constant term, C,i is a binary indicator equal to 1 if the contract is a call option,ˆ K i is the moneyness measure deﬁned in equation (18), T i is the option’s days to maturity,and ˆ K> is a binary indicator equal to 1 if the option’s strike is above the underling’s assetprice. The last two dimensions of the vector X ( rf ) i provide redundant information, which we A random forest works by constructing a multitude of decision trees, with the algorithm’s predictiondeﬁned as mean output of each individual tree. This ensemble approach drastically reduces the risk of over-ﬁtting. For a general introduction to random forests, see Liaw, Wiener, et al. (2002), and for random forestsapplied to ﬁnance see Gu, Kelly, and Xiu (2018). Concerning the random forest hyperparameters, we applied the default parameters of the sklearn library (cf. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html ). R ( · ) be a random forest predictor. To create a non-parametric benchmark (cf. sec-tion 3.5), we train a random forest to minimize the following loss function: L = 1 N N (cid:88) i =1 (cid:16) R ( X ( R ) i ) − ˆ y i ) (cid:17) , (22)where N is the number of options in the training sample, and ˆ y i is the option’s BSIV.Note that the random forest minimizes the mean squared error, just as the surrogatemodels discussed above. In addition, our non-parametric benchmark does not use any addi-tional information other than the one we use to estimate the option pricing model’s hiddenstates and parameters. This section presents the application of the deep structural estimation technology to theHM and BDJM models. For reasons of transparency, we start in section 4.1 by using simu-lated data to demonstrate the capabilities of our framework in a controlled environment. Insection 4.2, we turn our attention towards a panel of S&P500 options data to calibrate boththe HM and BDJM model at a daily frequency to perform an analysis that would be com-putationally (almost) infeasible without the surrogate technology. In particular, we considerand compare several deﬁnitions of out-of-sample performance to discuss the performance ofthe models. In each case, we compare the performance of the two options pricing modelsagainst each other and a non-parametric benchmark: the random forests. In addition, wediscuss the models’ parameter stability across time. Finally, we compare the models’ capac-ity to predict the delta of individual options, that is, the options’ price sensitivity to changesin the underlying asset’s price.

In this section, we demonstrate the capability and versatility of the surrogate technologyin a controlled environment. We start by ﬁrst investigating the sensitivity of surrogates totheir respective states and pseudo-states. To do so, we apply equation (8) to randomly drawa sample point. Then, we vary each state and pseudo-state in turn while keeping all othervariables ﬁxed. We compute the BSIV by querying the original option pricing model andquerying the surrogate for each state vector to produce a sensitivity plot.18 a) (b) (c)(d) (e) (f)(g) (h) (i)

Figure 2.

The ﬁgures above show the complexity of the surrogate models’ volatility surfaceas well as the quality of the surrogate’s interpolations. To do so, the ﬁgures compare theBSIVs of the surrogate against the predictions of the “true” BDJM. We populate the statespace with points, where we keep all parameters and states ﬁxed at the mid-range of possiblevalues: ˆ K = 100 . , rf = 0 . , T = 190 , κ = 25 . , θ = 0 . , v t = 0 . , σ = 2 .

55. Oneach panel, we show the BSIV predicted by the BDJM and its surrogate while varying oneof the parameters or states in its admissible range (cf. table 5).Figure 2 displays the resulting sensitivities for some states and parameters from theBDJM’s surrogate. The y-axis shows the BSIV as a function of a given pseudo-state fromthe range given in table 5. We can see that the surrogate model almost perfectly replicates thesensitivity of the target model. Furthermore, these plots highlight the high non-linearity ofthe BDJM’s volatility surface. Taken together, these results show that neural network surro-gates can reach a high degree of precision for very complex structural models. Figures 14, 15,and 16 in appendix A show similar sensitivity graphs for all states and parameters of bothsurrogates.Next, to demonstrate the value of the introduced surrogate technology in the context ofstructural estimation, we perform the last steps described in section 3.1. We estimate thestates and pseudo-states from the data. We start out by randomly generate a cross-section19f the options’ BSIV by drawing one state vector R true (cf. equation (8)), that is, ionR true = (cid:2) Ω true , H true , Θ true (cid:3) T . (23)After that, while keeping all other parameters and states equal to their value in R true , wepopulate the simulated cross-section of options by varying the moneyness and the maturityparameter, ˆ K and T . Thus, we create a set of state vectors ˜ X i with parameters and stateequal to the one in R true except for the moneyness and maturity parameters, T, ˆ K . Now, wequery the option pricing model to estimate a price for the option, and we convert this priceinto a BSIV, ˜ y i = f ( ˜ X i ).We apply the described procedure to generate a cross-section of N = 1 ,

000 points anduse the optimization algorithm described in section 3.1 to solve the following minimizationproblem: ˆ H ∗ , Θ ∗ = arg min ˆ H , Θ N N (cid:88) i =1 (cid:16) φ (cid:16)(cid:104) ˆΩ i , ˆ H i , Θ (cid:105) | θ ∗ NN (cid:17) − ˆ y ( i ) τ (cid:17) . (24)Subsequently, we deﬁne a performance measure for the model calibration. To this end,let X ∗ = [ ˆ H ∗ , Θ ∗ ] T be the estimated state vector of dimension h + θ containing the unob-servable states, and the models’ parameters are obtained through deep structural estimation .Moreover, let x ∗ i be the individual states and parameters populating the vector X ∗ , that is, X ∗ = [ x ∗ , x ∗ , x ∗ ω + h + θ ] T . For each x ∗ i , we compute the estimation error as e i = | x truei − x ∗ i || ¯ x i − x i | , (25)where ¯ x i and x i represent the maximum and minimum values in the training sample of therespective surrogate model (cf. tables 4 and 5). This error measure captures the absolutediﬀerence between the estimated state and the true value, standardized by the possibleparameter range to allow for comparison across states.For each surrogate model, we simulate 1,000 such cross-sections, and for each simulation,we estimate the parameters and compute the estimation error. Figure 3 displays the resultsfor both models. We show for each non-observable state and parameter the standardizedestimation error e i in the form of a box plot. The (green) horizontal line represents themedian error. The whisker indicates the 1st and 99 quantiles, whereas the box shows theinter-quartile range across all simulations.Besides, we compute for every simulation the in and out-of-sample performance. The20 a) HM (b) BDJM Figure 3.

The ﬁgures above show that the surrogate can be used to estimate the parametersof the models on a small data sample. We simulated market days with the original pric-ing models and used our surrogates to estimate the parameters and states simultaneously.Above, we show the resulting prediction errors (see equation (25)) for each hidden state andparameter of both models. The green vertical line represents the median prediction errorcomputed across 1000 simulations. The box shows the interquartile range, while the whiskersshow the 1st and 99th percentile errors across the simulations.in-sample performance is deﬁned as the MSE computed on the original cross-section ˜ X i , ˜ y i for all i = 1 , ..., N . We estimate the out-of-sample performance as the MSE on an additionalset of points ˜ X j , ˜ y j for all j = 1 , ..., N (cid:48) .In both in- and out-of-sample, the mean squared pricing error is virtually 0, with valuesranging ranging from the smallest at 0 . ∗ − for the HM with call options to the largestat 0 . ∗ − for the BDJM with put options.The results above suggest that deep structural estimation can be used to eﬃciently andprecisely estimate the states and pseudo-state from a simulated cross-section of options wegenerate with the true model’s pricing function. Remember that, unlike the original models,querying the surrogate or estimating the surrogate’s gradients is at a negligible computationalcost.Before discussing the option pricing models’ performances on real market data, we demon-strate the importance of the neural networks’ architecture presented in section 3.4.2 in thesurrogate models’ performance. In particular, we highlight the importance of the networkdepth combined with the Swish activation function. To do so, we ﬁrst investigate the per-formance of shallow networks as surrogate models. In particular, we deﬁne a neural networkwith a 400 neurons layer and

Swish activation function and train it to be a surrogate for the21M call options. We apply the procedure described at the start of this section to estimatethe estimation error and how well the shadow network sensitivities replicates the true modelsensitivities. We measure the sensitivities of a surrogate model generated by a shallow neu-ral network, as previously was done (cf. ﬁgure 14). Figure 17 in appendix A displays thissensitivity comparison for the HM call surrogate. We can see that the surrogate here cannotreplicate the true model eﬃciently.Next, we use our shallow surrogate model to compute estimation errors and produce abox-plot of those errors. Figure 18 in appendix A displays the main ﬁndings: while thesurrogate model estimates some states correctly, notably the parameter θ and the hiddenstate v t , the error on other parameters explodes, most notably on the κ parameters, wherethe median estimation error across simulations is above 13%. We perform a similar analysiswith a deep (6 hidden layers consisting of 400 neurons each) network with ReLU activationfunctions and obtained large replication errors.These additional numerical results complete the tests for our method within a controlledenvironment and highlight the importance of choosing an appropriate network architectureto create eﬃcient surrogate models. Note that, even though these bad surrogates can notestimate the models’ parameters reasonably, the prediction error of these surrogates on avalidation set was still virtually 0, that is, the surrogate BSIV versus the models BSIV for agiven set of parameters. With complex structural ﬁnancial models, some parameters have arelatively small impact on the target state. For example, on a cross-section of options on agiven day, the κ parameter does not have a signiﬁcant impact on the options BSIV. Therefore,we can reach a low prediction error while almost ignoring the eﬀect of the κ . An appropriatenetwork architecture can ensure that an adequate degree of precision is reached, even formarginally important parameters. Furthermore, this interesting result demonstrates thatthe prediction error alone is not is not a suﬃciently accurate measure to assess the qualityof a surrogate. deep structural estimation to market data For the numerical following numerical experiments, we use the daily options’ quotes fromthe

OptionMetric database. We use European call and put options on the S&P500 indexin the years between 2001 and 2019. We remove options with maturity above 250 days.Furthermore, we remove options with extreme moneyness values from the sample by keepingonly options with a BS delta between 0.1 and 0.9 (0 . ≤ ∆ ≤ . a) Maturities in sample (b) Number of options Figure 4.

The two ﬁgures above show the evolution of the sample composition across time.Panel (a) shows the percentage of options in the sample across time per maturity brackets.Panel (b) shows the total number of options per day (left axis) and the percentage of putoptions in the sample (right axis). Through all plots, we smooth the results with a rollingaverage over the last 252 business days.database provides, for each option, the prices, but also the BSIV as well as the BS delta ata daily frequency.Figures 4(b) summarizes our ﬁnal sample size and composition. Panel (a) shows theevolution across time of the sample’s composition per maturity brackets. Panel (b) showsthe evolution of the sample size (left axis) and the percentage of put options in the sample(right axis). We smooth all numbers with a rolling average over the last 252 days.In addition, we obtain the volatility index VIX, a non-parametric estimate of jump risk inthe S&P500 index from Todorov and Andersen’s website. The variable

J U M P t representsthe weekly probability of a negative jump of at least 10% of the S&P500 index. Finally, weobtain the risk-free rate from the Fama and French data library. In this section, we leverage the cheap-to-evaluate surrogate to re-estimate the parametersand hidden state of both option pricing models on every day of the sample. We contrastthose ﬁndings by re-estimating the non-parametric benchmark described in section 3.6 on adaily frequency. https://tailindex.com/volatilitymethodology.html https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html igure 5. The ﬁgures above display the mean in-sample performance of the three modelsacross time. We show the mean across days of the √ M SE . We smooth this measure witha rolling average over the last 252 days.We ﬁrst explore the in-sample performance. In ﬁgure 5, we display on the y-axis thedaily √ M SE , smoothed over a rolling average of the previous 252 days. As it was to beexpected, we ﬁnd that the parametric model with more degrees of freedom, i.e., the BDJMmodel, creates a better in-sample ﬁt than the simpler HM. Note that both models’ in-sampleMSE varies across time, with peaks in 1998 and during the 2008 crisis. We also observe thatboth parametric models in-sample errors are larger after the 2008 crisis. Finally, we notethat the non-parametric model strongly outperforms both structural models in-sample.Next, we turn to the out-of-sample measure of performance. Figure 6 displays the predic-tion errors for the two models and the random forest for prediction horizons of 1 to 30 days, τ = 1 , , ...,

30. On the y-axis, we show the mean daily square root of the MSE. The leftpanel (a) shows the mean daily performance, the middle panel (b) shows the 10th percentileestimated across days, while the right panel (c) shows the 90th percentile.On average, the diﬀerence in the performance is small for all models. For short predictionhorizons τ , the random forest is better than that of the HM. For τ = 1, the non-parametricbenchmark’s performance is almost equal to that of the BDJM, whereas, for longer horizons,the BDJM yields signiﬁcantly lower prediction errors. Note that the diﬀerence in performancebetween the HM and the BDJM is stable across time.The percentile graphs reveal that the BDJM benchmark signiﬁcantly outperforms boththe HM and RF in the 10th percentile measure but not under the 90th percentile. Thiseﬀect is more substantial for large out-of-sample horizons. This result suggests that the To facilitate interpretation of ﬁgure 5, we additionally display the square root of the MSE here as wellas throughout the remainder of this section. a) (b) (c) Figure 6.

The ﬁgures above compare the models’ out-of-sample performance. On the x-axis, we show the out-of-sample horizon τ , whereas the y-axis depicts the daily pricing error.In the left panel (a), we measure the square root of the daily mean MSE. Panel (b) showsthe 10th percentile of the daily mean MSE, while panel (c) shows the 90th percentile.over-performance of the BDJM is concentrated on “good days”.Next, we look at the models’ performance across time. In ﬁgure 7, we display the daily √ M SE smoothed across time with a rolling average on the last 252 days as a function oftime. Each panel shows the average daily √ M SE estimated on a diﬀerent subsample deﬁnedby maturity. In all panels, the out-of-sample performance is measured with a forecastinghorizon of one week ( τ = 5).These ﬁgures reveal several interesting patterns. First, we note that, while the non-parametric is outperformed on average, the RF does outperform both the BDJM and HMon most days for options with short maturities (0 < T ≤ < T ≤ √ M SE per standardized strike. As in ﬁgure 7, we split the sample in each panelby maturity brackets. Panel (b) and (d) show that for some maturity ranges, all models tendto underperform signiﬁcantly more with deep-out-of-the-money call options than the rest of25 a) 0 < T ≤ < T ≤ < T ≤

90 (d) 90 < T ≤ ∞

Figure 7.

The ﬁgures above show the model’s out-of-sample performance across time onsubsamples deﬁned by maturity brackets. All panels shows the performance for a forecastinghorizon of one week ( τ = 5).the sample. On the panel (c), we see that the parametric models signiﬁcantly outperformthe RF for options with a maturity contained between one and three months. Finally,when looking at the ﬁrst two panels the relative overperformance of the RF is stable acrossmoneyness on options with a very short time to maturity (a), but concentrated on putoptions for options with maturities between one week and one month (b).Finally, we ﬁnish this analysis by looking at the diﬀerence in the average daily perfor-mance of the models across time. We subtract the daily mean squared error of the HM andRF from that of the BDJM to create two time-series. The more negative values show alarger overperformance of the BDJM. Figure 9 shows these time-series for diﬀerent forecast- Recall that the sample does not contain in-the-money options. Therefore, the left-hand side of the graphs( ˆ

K < a) 0 < T ≤ < T ≤ < T ≤

90 (d) 90 < T ≤ ∞

Figure 8.

The ﬁgures above show the model’s out of sample performance across money-ness on subsamples deﬁned by maturity brackets. All panels shows the performance for aforecasting horizon of one week ( τ = 5)ing horizon. Comparing the three panels, we see that the volatility of the RF performanceincreases with the forecasting horizon. At the same time, panels (a) and (b) show that with asmall forecasting horizon, the BDJM relative performance from that of the HM signiﬁcantlyincreased after the 2008 crisis. This result suggests that before 2008, the market may haveomitted to price jump risk properly. After the crisis, however, they either learned or wereforced to do so by new legislation.Under all speciﬁcations, the RM is surprisingly competitive in forecasting BSIVs out-of-sample. The diﬀerence in average performance between all models is not large, and the RFdoes outperform both the BDJM and HM for speciﬁc types of options.In appendix D, we complement the analysis discussed in this section with additionalﬁgures showing the models’ performance on other data subsamples.27 a) τ = 1 (b) τ = 5 Figure 9.

The ﬁgures above compare the models’ out-of-sample performance across timeof the HM and RF compared to the best performing model, the BDJM. We ﬁrst computethe mean daily √ M SE for each model, then subtract the performance of the HM and RFto that of the HM to create two time series. Finally, we smooth our results with a 252 dayrolling average. The two panels show this measure with a forecasting horizon equal to 1 and5 business days, respectively.

We now investigate what drives the diﬀerences in performance among the option pricingmodels. To do so, we create four variables of interest. The ﬁrst one is the diﬀerence in perfor-mance between the daily average MSE of the BDJM and HM ( √ M SE

BDJM − √

M SE HM ).The second one is same diﬀerence between BDJM and the non-parametric benchmark,( √ M SE

BDJM − √

M SE RF ). In addition, we look at the diﬀerence in performance for eachoption i at time t , 100 ∗ ( M SE

BDJM,i,t − M SE

HM,i,t ) and 100 ∗ ( M SE

BDJM,i,t − M SE

RF,i,t )For simplicity, we focus on a forecasting horizon of a week ( τ = 5).In ﬁgure 10, we show the autocorrelations of the time series y t = ( √ M SE

BDJM −√ M SE HM ) (a) and y t = ( √ M SE

BDJM − √

M SE RF ) (b). These graphs show that thediﬀerence in performance between the two parametric models has strong and positive auto-correlations. On the other hand, the diﬀerence in daily performance between the BDJM andthe non-parametric RF has little to no autocorrelation. These results suggest that the stateof the economy in which the BDJM is comparatively better suited than the HM is spreadacross long time periods, while the state of the economy in which the RF performs relativelybetter than the parametric alternative are short lived.Next, we investigate potential factors that could explain the relative daily performanceamong the models. To do so, we compute the mean daily bid-ask spread at time t (( b − a ) t )28 a) y t = ( √ M SE

BDJM − √

M SE HM ) (b) y t = ( √ M SE

BDJM − √

M SE RF ) Figure 10.

The ﬁgures above show the autocorrelations of the relative performance of theBDJM against the HM (a) and RF (b). We compute the diﬀerence between the square rootof the daily MSE of the models and show the autocorrelations on the y-axis and the lags onthe x-axis.as a proxy of liquidity risk. We capture the relative volatility risk with the VIX index

V IX t . We test the risk of jumps in the underlying asset’s price with variable J U M P t .Finally, because the jump and volatility risk are highly correlated (0.84), we construct adecorrelated version of the volatility risk. We deﬁne ˜ V IX t as the residual from the regression V IX t = a + b · J U M P t + (cid:15) t .Panel A of table 1 reports various conﬁgurations of a regression where the dependentvariable is the diﬀerence in daily performance between the BDJM and the HM, y t =100( M SE

BDJM,i,t − M SE

HM,i,t ), and the dependent variable are the market measures ofliquidity, volatility, and risks of jumps. All three variables of interest are statistically signif-icant. The liquidity measure’s coeﬃcient (( b − a ) t ) is positive, which means that the BDJMperformance relative to that of the HM is worst on days with high liquidity risks. Both para-metric models’ assumptions include perfect liquidity, and neither is expected to perform wellon days where insuﬃcient liquidity has a large impact on price. When faced with an omittedpricing factor, both models misuse their degrees of freedom. Therefore, it is intuitive thatthe model with more degrees of freedom underperforms more on those low liquidity days.The coeﬃcient associated with jump risks is negative, meaning that the BDJM outperformssigniﬁcantly more on days with large jump risks. This eﬀect is to be expected given thatthe BDJM was explicitly designed to model such risks. Similarly, the BDJM overperformon days with high volatility risks, suggesting that the BDJM is more suited than the HM topredict steep implied volatility smiles. 29anel B of table 1 shows the same regressions when we analyze the diﬀerence in perfor-mance between the BDJM and RF ( y t = 100( M SE

BDJM,i,t − M SE

RF,i,t )). When controllingfor the yearly ﬁx eﬀect, the coeﬃcient associated with liquidity risks is positive and statis-tically signiﬁcant. These results suggest that the RF relative performance is higher whenthe liquidity risks are high. This mechanism is coherent because, unlike the RF, the para-metric BDJM was speciﬁcally designed under the assumption that liquidity had no impacton prices. The coeﬃcient associated with jump risk is signiﬁcant and negative, as it is asso-ciated with volatility risks. These two results imply that the BDJM has a small advantagewhen modeling days with steep implied volatility smile and high jump risks.

Our fast-to-evaluate surrogates allow us to swiftly and cheaply create a time-series of dailyestimated parameters across time. If one of the option pricing models is correct, we shouldsee only a small variation in the parameters across time, but we measure large variationsfrom one day to the next for both models (c.f., appendix B).We now test statistically whether the models’ parameters are stable across time. To do so,we apply the statistical tests developed in section 5.2 by Andersen et al. (2015). Accordingto their work, given a time-period t , and t + 1, if the pricing model is valid for the twodistinct time-periods we have,(Θ t − Θ t +1 ) (cid:48) (cid:16) (cid:91) Avar (Θ t ) + (cid:91) Avar (Θ t +1 ) (cid:17) − (Θ t − Θ t +1 ) L− s −→ χ ( q ) , (26)where Θ t , and Θ t +1 denote the estimated parameters in the time periods t and t + 1, respec-tively. (cid:91) Avar (Θ t ) and (cid:91) Avar (Θ t +1 ) denote consistent estimates of the asymptotic variances ofthese parameters (cf. Andersen et al. (2015), equations (11)-(12)).For every day of the sample t , we now estimate the statistical measure of equation (26)to test whether the parameters Θ t are statistically diﬀerent from Θ t +1 .Figure 11 shows the results of those daily tests. On the y-axis, we show the averagenumber of days over the past 252 days for which the hypothesis was rejected, that is, thepercentage of days for which the models’ parameters at time t and t + 1 are statisticallysigniﬁcantly diﬀerent from one another ( α = 1%). For all cases, we observe a high rejectionrate, which increases after the 2008 crisis. On the whole sample, we reject with 1% conﬁdencethe hypothesis that the models’ parameters are stable from one day to the next 41.6% ofthe time for the BDJM and 60.7% of the time for the HM. Those perhaps surprising resultssuggest that both models’ parameters are statistically signiﬁcantly unstable through time,30 able 1 The table below presents various combinations a regression where the dependent vari-able is, for each option i on day t the diﬀerence in performance between the BDJMand the HM (100( M SE

BDJM,i,t − M SE

HM,i,t )) in panel A, and the performance dif-ference between the BDHM and RF in panel B (100(

M SE

BDJM,i,t − M SE

RF,i,t )),and the independent variables include: the bid-ask spread ( b − a ) t , the volatility in-dex V IX t , the jump risk J U M P t . We estimated model performance with a fore-casting horizon of one day τ = 5. The values in parentheses are standard er-rors, while ∗ denotes signiﬁcance at the 10% level, ∗∗ at 5%, and ∗∗∗ at 1%. Panel A: y i,t = 100( MSE

DBJM,i,t − MSE

HM,i,t )(1) (2) (3) (4) (5) (6) (7) (8)( b − a ) t JUMP t -0.0049*** -0.0049*** -0.0078*** -0.0083***(0.0001) (0.0001) (0.0002) (0.0002) V IX t -0.0005*** -0.0006***(0.0) (0.0)˜ V IX t -0.0007*** -0.0007***(0.0) (0.0)constant -0.0045*** -0.0097*** -0.0126*** -0.01***(0.0001) (0.0001) (0.0002) (0.0002)year FE No No No No Yes Yes Yes YesObservations 5,493,397 5,484,390 5,493,397 5,484,390 5,493,397 5,484,390 5,493,397 5,484,390 R y i,t = 100( MSE

DBJM,i,t − MSE

RF,i,t )(1) (2) (3) (4) (5) (6) (7) (8)( b − a ) t -1.3513*** -3.612*** 15.8412*** 7.2956***(0.5008) (0.5173) (1.1346) (0.7151) JUMP t -0.7315*** -0.6506*** -0.7505*** -1.1787***(0.0525) (0.0538) (0.0783) (0.0704) V IX t -0.1244*** -0.1907***(0.0028) (0.0039)˜ V IX t -0.2705*** -0.3086***(0.0048) (0.0053)constant 1.1129*** -0.5713*** -0.7856*** -0.1478**(0.0501) (0.0331) (0.067) (0.0691)year FE No No No No Yes Yes Yes YesObservations 5,523,407 5,514,080 5,523,407 5,514,080 5,523,407 5,514,080 5,523,407 5,514,080 R at least when we estimate the parameters on a daily cross-section of the option.31 a) α = 1% Figure 11.

The ﬁgure above shows, for each day, the percentage of the last 252 days forwhich the models’ parameters are statistically signiﬁcantly diﬀerent from one day to thenext. We show the average rejection of the test with a 1% signiﬁcance level

Market makers and traders often used models to set-up hedging portfolios, that is, buyingand selling other assets to oﬀset or mitigate the risks inherent to holding options. There-fore, the quality of a model’s predicted price sensitivity to the underlying assets is often asimportant as the model’s predictive power itself.In this section, we compare the BDJM, HM, and non-parametric model’s hedging per-formance. We focus on the delta -hedging, that is, the option’s price sensitivity to change inthe underlying asset.Previously, we used the surrogate technology to re-estimate the parameters of the BDJMand HM daily. We now use those daily set of parameters and the original pricing model toestimate for each model m and each option i on day t the ﬁrst derivative with respect to theunderlying asset δ ( m ) i,t .Next, we compute a replication error assuming traders buy the option and counterbalancethe delta -risk by taking an opposite position on the underlying assets, (cid:15) ( m ) i,t = ( p i,t − p i,t − ) − δ ( m ) i,t ( S t − S t − ) p t − , (27)where S t is the underlying asset at time t , p i,t is the mid quote of the option i at time t .32e standardize the error by the option price at time t −

1. Therefore (cid:15) ( m ) i,t represents thereplication error in percentage of the original price. Table 2 shows the mean, standard deviations, and important quantiles of the absolutereplications errors across the whole sample ( | (cid:15) ( m ) i,t | ). In addition to the random forest, HMand BDJM, we compute the delta of the Black-Scholes model(BSM). A perfect replicatingportfolio would produce a replicating error of zero. Note that this is not possible here, aswe only hedge the sensitivity to changes in the underlying prices and ignore other well-documented sensitivities: time, volatility, etc. Nonetheless, the relative replicating errors ofthe various models give a measure of how well suited these models are to hedging applications.The BDJM produced an average absolute replication error of 0.14175, which only nar-rowly beats the HM average replication error of 0.15501. The percentile errors suggest thatoutliers do not produce this over-performance of the BDJM. Finally, while in section 4.2.2,we show that the RF is surprisingly competitive when applied to price predictions, table 2shows that the RF is not well suited for hedging applications. Indeed, the two complexparametric models produced signiﬁcantly lower replication errors than the RF.The relative underperformance of the RF is to be expected. We can view Random forestregressors as a classiﬁer with a large number of small categories. The RF’s output is neithercontinuous nor smooth and is therefore not well suited to produce gradients.This result highlights a well-known advantage of structural models over non-parametricones. Non-parametric models are designed to perform speciﬁc tasks and need to be adaptedfor another speciﬁc task. On the other hand, a parametric model aims to provide an ap-propriate description of the world and can therefore be used for multiple applications fairlyeasily.We further explore the relationship between replication error, time, and time to maturity12. We show the time series of average daily √ M SE on various subsamples deﬁned bytime to maturity. The ﬁrst two panel show options with a relatively low time to maturity(0 < T ≤

7, and (7 < T ≤

30) while the last two shows the performance on the subsampleof options with a relatively long time to maturity (30 < T ≤

90, and 90 < T ≤ ∞ ).These graphs highlight several interesting patterns. First, we see again that most of theBDJM improvement on the HM concerns short term options. This observation is in line with Unlike the √ M SE of the previous section, we standardized the replication errors by the price of theoptions. We do this standardization here because the replication error is expressed in terms of prices insteadof BSIV, and prices vary signiﬁcantly across time to maturity and moneyness, which can introduce bias inthe analysis. For each day, we took the BSIV of at the money options as our estimation of the Black-Scholes impliedvolatility for all options on that particular day. Table 6 in appendix D shows an extended version of this table. able 2 The table below shows the mean, standard deviation (std), and important percentilesof the absolute replication error of each model m ( | (cid:15) ( m ) i,t | ) on the whole sample.BSM RF HM BDJMmean 0.19380 0.20133 0.15501 0.14175std 0.42104 0.32173 0.39458 0.280515% 0.00423 0.00754 0.00601 0.0059550% 0.08686 0.12512 0.07958 0.0791495% 0.65096 0.59113 0.48294 0.44869Andersen et al. (2017) who highlight that weekly options are very sensitive to jump risks.Second, we see that the upward trends in replication errors after the 2008 crisis exists acrossdiﬀerent time to maturity subsample and can therefore not be explained by the change inmarket composition highlighted in ﬁgure 4(a). Finally, ﬁgure 13(a) shows that the diﬀerencein hedging performance between the two models on options with a short time to maturitydoes not exist in the earlier part of the sample and only starts to appear around 2011.Next, we discuss the relationship between replication error, time, and moneyness withﬁgure 13. We show the average √ M SE per strike level on various subsample deﬁned bytime to maturity.These graphs show that the replication error is smaller for at-the-money options acrossall maturity brackets. Indeed, a smile-like shape, reminiscent of the implied volatility smile,appears across all moneyness. Furthermore, panel (c) and (d) show that for options witha maturity larger or equal to 1 month, the replication error is larger for the call options inthe sample than the put options. Finally, we see that the overperformance of the BDJM ismostly caused by the hedging of the out-of-the-money put options.To conclude this analysis of hedging performance, we perform the following regression:We use the model’s states and parameters we estimated in the previous section to computethe theoretical δ of each option of the subsamples and compute the following regression, p i,t +1 − p i,t = β M δ i,M ( S t +1 − S t ) + β T ( T i,t +1 − T i,t ) + β J ( − put )( J U M P t +1 − J U M P t ) + β V ( V IX t +1 − V IX t ) , (28)where p i,t +1 − p i,t is the change in price of an option i from time t to time t + 1. δ i,M is thetheoretical delta of the option i , predicted by the model M . ( S t +1 − S t ) is the change in34 a) 0 < T ≤ < T ≤ < T ≤

90 (d) 90 < T ≤ ∞

Figure 12.

The ﬁgures above show the relative mean daily relative replication errorsmoothed by a 252 days rolling average, on various subsamples deﬁned by the options timeto maturity.price of the underlying asset. ( T i,t +1 − T i,t ) is the change in maturity in days of the asset. J U M P t − J U M P t − )( − call ) denotes the change in the estimated jump risk. We multiplythe change in jump risk by -1 for call options to account for the fact that jump risks aﬀectput and call options diﬀerently. Finally, and ( V IX t +1 − V IX t ) is the change in the VIXindex which captures the change in the volatility premium.We present the estimation of this regression’s conﬁgurations in table 3. The β M of theHM and BDJM are not equal to 1 in any such conﬁguration. However, the values get closerto 1 as we add controls. Finally, we note that the BDJM- Delta produces a higher adjusted R than the HM- Delta . Because t denotes trading days and not calendar days, the change in time to maturity of an option fromtime t to t + 1 is not always equal to 1. For example, after a Friday, two calendar days separate two tradingdays. a) 0 < T ≤ < T ≤ < T ≤

90 (d) 90 < T ≤ ∞

Figure 13.

The ﬁgures above show the relative mean daily relative replication error acrossstandardized strikes on various subsamples deﬁned by the options time to maturity.

In this paper, we introduce deep structural estimation : a generic framework to swiftly es-timate complex structural models in economics and ﬁnance. We treat the models’ parametersas pseudo-state variables to create a deep-neural network surrogate for replicating the tar-get model. By alleviating the curse of dimensionality, this surrogate approach considerablylowers the computational cost for the prediction and parameter estimation on data. Sucha speed gain is non-trivial, as high computational costs often prohibit important analysisof structural models, including a) out-of-sample analysis, b) testing of parameters stabilityand, c) re-calibration and testing on multiple subsamples. All three of these examples re-quire a fast re-estimation of structural models, which is often infeasible without the surrogatetechnology. 36 able 3

This table shows the estimations of various conﬁgurations of equa-tion (28). The values in parentheses are standard errors, while ∗ denotes signiﬁcance at the 10% level, ∗∗ at 5%, and ∗∗∗ at 1% (1) (2) (3) (4) (5) (6) (7) (8) T t − T t − -0.2178*** -0.1453*** -0.1561*** -0.1566***(0.0008) (0.0004) (0.0004) (0.0004)( JUMP t − JUMP t − )( − call ) 10.101*** 2.8129*** 1.4542*** 1.3527***(0.0664) (0.029) (0.0346) (0.0331) V IX t − V IX t − δ ( BDJM ) i,t ( S t − S t − ) 0.879*** 0.9158***(0.0003) (0.0002) δ BS ( S t − S t − ) 0.92*** 0.9361***(0.0002) (0.0002) δ ( hm ) i,t ( S t − S t − ) 0.9129*** 0.9509***(0.0003) (0.0002) δ ( RF ) i,t ( S t − S t − ) 82.5215*** 80.4929***(0.0603) (0.056)Observations 5,021,299 5,021,299 5,021,299 5,021,299 5,021,299 5,021,299 5,021,299 5,021,299 R We illustrate the validity, performance, and usefulness of the introduced method in thecontext of ﬁnancial models by constructing surrogates for two well-known option pricingmodels: the HM and the BDJM.First, we demonstrate, with the aid of simulated data, the performance of the surrogatesmodels. We show a) that the surrogate is capable of approximating a highly complex volatil-ity surface with virtually no pricing error, and b) that the surrogate can be used to estimatethe hidden states and parameters from a small cross-section of simulated option prices.Second, we apply our deep structural estimation framework to re-estimate both optionpricing models’ parameters and hidden states on every trading day of the last 17 yearson the cross-section S&P500 options. We then conduct a thorough in and out-of-sampleperformance analysis of the structural models. This exercise highlights several interestingpatterns: a) while the BDJM outperforms the HM and RF on average, the non-parametricmodel is more eﬃcient on speciﬁc areas of the volatility curve, namely options with a maturityof less than one week and put-options with a maturity of less than one month, b) whencomparing the model’s hedging performance out-of-sample we show that while the BDJMoutperforms the RF, this diﬀerence is concentrated on very short maturity options and onlyoccurs after 2011, c) the parameters of the BDJM and HM are relatively unstable throughtime as we can reject the hypothesis that the parameters do not change from one day to thenext 41.6% of the time for the BDJM and 60.7% of the time for the HM.37hese results help to identify the strength and weaknesses of the current option pricingtheory. On the one hand, the fact that, on average, the BDJM outperforms the RF, whereasHM fails to do so for short forecasting horizons, implies that the extension of the HM toinclude jump risks was important and that the more modern model reﬂects market realitiesbetter. On the other hand, the poor parameter stability of HM and BDJM and high pric-ing errors of these models on options with short time to maturity: a) highlights the needfor further progress, and b) suggests avenues and direction where the said progress shouldbe directed. Taken together, these insights showcase the usefulness of the deep structuralestimation framework and the thorough out-of-sample analysis it allows.38

Simulated results

Table 4

This table presents the ranges for the surrogate model of the HM’s training sample. x ( j ) ¯ x ( j ) j m -9.00 5.000 rf v t T κ θ σ ρ -1.00 -0.000 Table 5

This table presents the ranges for the surrogate model of the BDJM’s training sample. x ( j ) ¯ x ( j ) j m -9.00 5.000 rf v t T κ θ σ ρ -1.00 -0.000 λ ν ν a) (b)(c) (d)(e) (f)(g) (h) Figure 14.

The ﬁgures above compare the BSIVs of the surrogate against the predictionsof the “true” HM. We populate the state space with points, where we keep all parametersand states ﬁxed at the mid-range of possible values: ˆ K = 100 . , rf = 0 . , T = 190 , κ =25 . , θ = 0 . , v t = 0 . , σ = 2 .

55. On each panel, we show the BSIV predicted by themodel and its surrogate while varying one of the parameters or states in its admissible range(cf. table 4). 40 a) (b)(c) (d)(e) (f)(g) (h)

Figure 15.

The ﬁgures above compare the theoretical BSIVs predicted by the surrogatemodel of the BDJM as done in ﬁgure 14. The panel above shows sensitivity graphs for allstates and parameters in common between the BDJM and HM.41 a) (b)(c) (d)

Figure 16.

The ﬁgures above complement the sensitivity analysis presented in ﬁgure 15.The panel above shows sensitivity graphs for all states and parameters unique to the BDJMmodel. 42 a) (b)(c) (d)(e) (f)(g) (h)

Figure 17.

The ﬁgures above show results similar to those displayed in ﬁgure 14. However,we computed here the surrogate’s prediction with a shallow neural network that is composedof only one layer with 400 neurons and

Swish activations functions.43 igure 18.

The ﬁgure above shows the prediction error for the call model of the HM usinga shallow network (1 layer of 400 neurons). .44

Parameters across time

Figure 19 shows the time series of the hidden state v t re-estimated at a daily frequencyand smooth with a rolling average over the last 252 days. The left panel (a) shows the valueestimated with the HM, and the right panel (b) depicts the value estimated with the BDJMpanel. We can see that both time-series are highly correlated across time. Furthermore, thevolatility estimated with the HM is slightly higher than that estimated by the BDJM.Figures 21 and 20 in appendix B display the value of the parameters re-estimated dailyfor both models. We smooth each parameter across time with a rolling average over the last252 days. We can see a strong variation of all parameters across time. Figure 19.

The ﬁgure above depicts the value of the hidden volatility state v t estimatedevery day along with each model’s parameter. We smoothed the values with a rolling meanon the last 252 days. 45 a) κ (b) ρ (c) σ (d) θ Figure 20.

The ﬁgures above show the daily estimation parameters of the HM. We smoothedthe values with a rolling mean on the last 252 days.46 a) κ (b) ρ (c) σ (d) θ (e) λ (f) p (g) ν (h) ν Figure 21.

The ﬁgures above show the daily estimation parameters of the BDJM. Wesmoothed the values with a rolling mean on the last 252 days.47

Errors across strikes and maturities (a) τ = 1 (b) τ = 5 (c) τ = 30 Figure 22.

The ﬁgures above show the same √ M SE per strike level as in ﬁgure 8 computedon the subsample of options with a extremely-short time to maturity (0 < T ≤ (a) τ = 1 (b) τ = 5 (c) τ = 30 Figure 23.

The ﬁgures above show the same √ M SE per strike level as in ﬁgure 8 computedon the subsample of options with a short time to maturity (7 < T ≤ (a) τ = 1 (b) τ = 5 (c) τ = 30 Figure 24.

The ﬁgures above show the same √ M SE per strike level as in ﬁgure 8 computedon the subsample of options with a medium time to maturity (30 < T ≤ a) τ = 1 (b) τ = 5 (c) τ = 30 Figure 25.

The ﬁgures above show the same √ M SE per strike level as in ﬁgure 8 computedon the subsample of options with a long time to maturity (90 < T ≤ ∞ ).49

Additional analysis

Table 6

This table expand on table 2 with additional percentiles of the replication error for eachmodel. BSM RF HM BDJMmean 0.19380 0.20133 0.15501 0.14175std 0.42104 0.32173 0.39458 0.28051min 0.00000 0.00000 0.00000 0.000005% 0.00423 0.00754 0.00601 0.0059510% 0.01083 0.01878 0.01225 0.0122015% 0.01747 0.02963 0.01881 0.0187920% 0.02453 0.04081 0.02570 0.0256925% 0.03220 0.05255 0.03305 0.0329930% 0.04073 0.06476 0.04085 0.0407335% 0.05014 0.07787 0.04924 0.0490840% 0.06080 0.09247 0.05831 0.0581145% 0.07288 0.10846 0.06836 0.0680750% 0.08686 0.12512 0.07958 0.0791455% 0.10295 0.14439 0.09227 0.0916260% 0.12183 0.16658 0.10681 0.1058065% 0.14332 0.19171 0.12358 0.1220670% 0.17039 0.22086 0.14330 0.1412075% 0.20360 0.25554 0.16752 0.1645080% 0.24989 0.29937 0.19887 0.1947485% 0.31197 0.35633 0.24335 0.2378490% 0.41471 0.43864 0.31658 0.3062495% 0.65096 0.59113 0.48294 0.44869max 32.97002 30.43577 64.09164 29.1997850 eferences

Aldrich, Eric M, Jes´us Fern´andez-Villaverde, A Ronald Gallant, and Juan F Rubio-Ram´ırez,2011, Tapping the supercomputer under your desk: Solving dynamic equilibrium modelswith graphics processors,

Journal of Economic Dynamics and Control

35, 386–393.Andersen, Torben G, Nicola Fusari, and Viktor Todorov, 2015, Parametric inference anddynamic state recovery from option panels,

Econometrica

83, 1081–1145.Andersen, Torben G, Nicola Fusari, and Viktor Todorov, 2017, Short-term market risksimplied by weekly options,

The Journal of Finance

72, 1335–1386.Azinovic, Marlon, Luca Gaegauf, and Simon Scheidegger, 2019, Deep equilibrium nets Work-ing paper.Bates, David S, 1996, Jumps and stochastic volatility: Exchange rate processes implicit indeutsche mark options,

The Review of Financial Studies

9, 69–107.Bates, David S., 2000, Post-’87 crash fears in the S&P 500 futures option market,

Journalof Econometrics

94, 181–238.Becker, Sebastian, Patrick Cheridito, and Arnulf Jentzen, 2018, Deep optimal stopping, arXiv preprint arXiv:1804.05394 .Bellman, Richard E., 1961,

Adaptive Control Processes: A Guided Tour (Princeton Univer-sity Press).Bilionis, Ilias, and Nicholas Zabaras, 2012a, Multi-output local gaussian process regression:Applications to uncertainty quantiﬁcation,

Journal of Computational Physics

SIAM Journal on Scientiﬁc Computing

34, B881–B908.Bilionis, Ilias, Nicholas Zabaras, Bledar A Konomi, and Guang Lin, 2013, Multi-outputseparable gaussian process: Towards an eﬃcient, fully bayesian paradigm for uncertaintyquantiﬁcation,

Journal of Computational Physics

The Journal ofFinance

66, 2165–2211. 51reiman, Leo, 2001, Random forests,

Machine Learning

45, 5–32.Brumm, Johannes, and Simon Scheidegger, 2017, Using adaptive sparse grids to solve high-dimensional dynamic models,

Econometrica

85, 1575–1612.Buehler, H., L. Gonon, J. Teichmann, and B. Wood, 2019, Deep hedging,

QuantitativeFinance

0, 1–21.Bungartz, Hans-Joachim, and Michael Griebel, 2004, Sparse grids,

Acta Numerica

13, 147–270.Chauvin, Yves, and David E Rumelhart, 1995,

Backpropagation: theory, architectures, andapplications (Psychology press).Chen, Hui, Scott Joslin, and Sophie Xiaoyan Ni, 2019a, Demand for Crash Insurance, Inter-mediary Constraints, and Risk Premia in Financial Markets,

Review of Financial Studies

32, 228–265.Chen, Luyang, Markus Pelger, and Jason Zhu, 2019b, Deep learning in asset pricing, Workingpaper.Chen, Peng, Nicholas Zabaras, and Ilias Bilionis, 2015, Uncertainty propagation using inﬁnitemixture of gaussian processes and variational bayesian inference,

Journal of computationalphysics

Journal of Applied Econometrics

24, 1057–1093.Chen, Xiaohong, and Halbert White, 1999, Improved rates and asymptotic normality fornonparametric neural network estimators,

IEEE Transactions on Information Theory

Journal of Financial Economics

72, 291 – 318.Constantine, Paul G., 2015,

Active Subspaces: Emerging Ideas for Dimension Reduction inParameter Studies (Society for Industrial and Applied Mathematics, Philadelphia, PA,USA).Didisheim, Antoine, Dimitrios Karyampas, and Simon Scheidegger, 2020, Implied risk aver-sion smile,

Available at SSRN 3533089 . 52uarte, Victor, 2018, Machine learning for continuous-time economics Working paper.Duﬃe, Darrell, Jun Pan, and Kenneth Singleton, 2000, Transform analysis and asset pricingfor aﬃne jump-diﬀusions,

Econometrica

68, 1343–1376.Farrell, Max H., Tengyuan Liang, and Sanjog Misra, 2021, Deep neural networks for estima-tion and inference,

Econometrica

89, 181–213.Fern´andez-Villaverde, J., J.F. Rubio-Ram´ırez, and F. Schorfheide, 2016, None, volume 2 of

Handbook of Macroeconomics , 527 – 724 (Elsevier).Fernandez-Villaverde, Jesus, Samuel Hurtado, and Galo Nuno, 2019, Financial Frictionsand the Wealth Distribution, PIER Working Paper Archive 19-015, Penn Institute forEconomic Research, Department of Economics, University of Pennsylvania.Fernandez-Villaverde, Jesus, Galo Nunu, George Sorg-Langhans, and Maximilian Vogler,2020, Solving high-dimensional dynamic programming problems using deep learning, Tech-nical report.Fern´andez-Villaverde, Jes´us, and Pablo A Guerr´on-Quintana, 2020, Estimating dsge mod-els: Recent advances and future challenges, Working Paper 27715, National Bureau ofEconomic Research.Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio, 2016,

Deep learning ,volume 1 (MIT press Cambridge).Gu, Shihao, Bryan Kelly, and Dacheng Xiu, 2018, Empirical asset pricing via machine learn-ing, Technical report, National Bureau of Economic Research.Hansen, Lars Peter, 1982, Large Sample Properties of Generalized Method of MomentsEstimators,

Econometrica

50, 1029–54.Haynes, Richard, Lihong McPhail, and Haoxiang Zhu, 2019, When leverage ratio meetsderivatives: Running out of options?, CFTC Research Paper.Heiss, Florian, and Viktor Winschel, 2008, Likelihood approximation by numerical integra-tion on sparse grids,

Journal of Econometrics

The review of ﬁnancial studies

6, 327–343.Hornik, Kurt, Maxwell Stinchcombe, and Halbert White, 1989, Multilayer feedforward net-works are universal approximators,

Neural networks

2, 359–366.53utchinson, James M, Andrew W Lo, and Tomaso Poggio, 1994, A nonparametric approachto pricing and hedging derivative securities via learning networks,

The Journal of Finance

49, 851–889.Igami, Mitsuru, 2020, Artiﬁcial intelligence as structural estimation: Deep Blue, Bonanza,and AlphaGo,

The Econometrics Journal

23, S1–S24.Iskhakov, Fedor, John Rust, and Bertel Schjerning, 2020, Machine learning and structuraleconometrics: contrasts and synergies,

The Econometrics Journal

23, S81–S124.Jaynes, E. T., 1982, On the rationale of maximum-entropy methods,

Proceedings of the IEEE

70, 939–952.Judd, Kenneth, 1996, Approximation, perturbation, and projection methods in economicanalysis, in H. M. Amman, D. A. Kendrick, and J. Rust, eds.,

Handbook of ComputationalEconomics , volume 1, ﬁrst edition, chapter 12, 509–585 (Elsevier).Kadan, Ohad, and Xiaoxiao Tang, 2020, A bound on expected stock returns,

The Review ofFinancial Studies

33, 1565–1617.Kaji, Tetsuya, Elena Manresa, and Guillaume Pouliot, 2020, An Adversarial Approach toStructural Estimation, arXiv e-prints arXiv:2007.06169.Liaw, Andy, Matthew Wiener, et al., 2002, Classiﬁcation and regression by randomforest,

Rnews

2, 18–22.Liu, Shuaiqiang, Anastasia Borovykh, Lech A Grzelak, and Cornelis W Oosterlee, 2019a, Aneural network-based framework for ﬁnancial model calibration,

Journal of Mathematicsin Industry

9, 9.Liu, Shuaiqiang, Anastasia Borovykh, Lech A. Grzelak, and Cornelis W. Oosterlee, 2019b, Aneural network-based framework for ﬁnancial model calibration,

Journal of Mathematicsin Industry volume

Handbook ofComputational Economics Vol. 3 , volume 3 of

Handbook of Computational Economics ,325 – 477 (Elsevier).Maliar, Lilia, Serguei Maliar, and Pablo Winant, 2019, Will artiﬁcial intelligence replacecomputational economists any time soon?, CEPR Discussion Paper DP14024.54ontanelli, Hadrien, and Qiang Du, 2019, New error bounds for deep relu networks usingsparse grids,

SIAM Journal on Mathematics of Data Science

1, 78–92.Nocedal, Jorge, and Stephen Wright, 2006,

Numerical optimization (Springer Science &Business Media).Norets, Andriy, 2012, Estimation of dynamic discrete choice models using artiﬁcial neuralnetwork approximations,

Econometric Reviews

31, 84–106.Pan, Jun, 2002, The jump-risk premia implicit in options: evidence from an integratedtime-series study,

Journal of Financial Economics

63, 3–50.Park, Jooyoung, and Irwin W Sandberg, 1991, Universal approximation using radial-basis-function networks,

Neural computation

3, 246–257.Pﬂueger, Dirk, Benjamin Peherstorfer, and Hans-Joachim Bungartz, 2010, Spatially adaptivesparse grids for high-dimensional data-driven problems,

Journal of Complexity

26, 508 –522.Press, William H., Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery, 2007,

Numerical Recipes 3rd Edition: The Art of Scientiﬁc Computing , third edition (CambridgeUniversity Press, New York, NY, USA).Ramachandran, Prajit, Barret Zoph, and Quoc V. Le, 2017, Swish: a self-gated activationfunction, arXiv: Neural and Evolutionary Computing .Schaefer, Stephen M., and Ilya A. Strebulaev, 2008, Structural models of credit risk areuseful: Evidence from hedge ratios on corporate bonds,

Journal of Financial Economics

90, 1 – 19.Scheidegger, S., D. Mikushin, F. Kubler, and O. Schenk, 2018, Rethinking large-scale eco-nomic modeling for eﬃciency: Optimizations for gpu and xeon phi clusters, in , 610–619.Scheidegger, Simon, and Ilias Bilionis, 2019, Machine learning for high-dimensional dynamicstochastic economies,

Journal of Computational Science

33, 68 – 82.Scheidegger, Simon, and Adrien Treccani, 2018, Pricing American Options under High-Dimensional Models with Recursive Adaptive Sparse Expectations*,

Journal of FinancialEconometrics nby024. 55ripathy, Rohit, Ilias Bilionis, and Marcial Gonzalez, 2016, Gaussian processes with built-in dimensionality reduction: Applications to high-dimensional uncertainty propagation,

Journal of Computational Physics

Journal of computationalphysics

Journal of ComputationalPhysics

Economic Research Initiatives at Duke (ERID) Working Paper .Welch, Ivo, and Amit Goyal, 2007, A Comprehensive Look at The Empirical Performanceof Equity Premium Prediction,

The Review of Financial Studies

21, 1455–1508.Williams, Christopher KI, and Carl Edward Rasmussen, 2006,