[PDF] A First Option Calibration of the GARCH Diffusion Model by a PDE Method

Abstract

Time-series calibrations often suggest that the GARCH diffusion model could also be a suitable candidate for option (risk-neutral) calibration. But unlike the popular Heston model, it lacks a fast, semi-analytic solution for the pricing of vanilla options, perhaps the main reason why it is not used in this way. In this paper we show how an efficient finite difference-based PDE solver can effectively replace analytical solutions, enabling accurate option calibrations in less than a minute. The proposed pricing engine is shown to be robust under a wide range of model parameters and combines smoothly with black-box optimizers. We use this approach to produce a first PDE calibration of the GARCH diffusion model to SPX options and present some benchmark results for future reference.

Full PDF

1 A FIRST OPTION CALIBRATION OF THE GARCH DIFFUSION MODEL BY A PDE METHOD

Yiannis A. Papadopoulos and Alan L. Lewis ABSTRACT

1. Introduction

Stochastic volatility models are a natural generalization of the seminal Black-Scholes-Merton (BSM) option theory. In such models, the constant volatility parameter 𝜎 of the BSM theory is promoted to a random process: 𝑑𝑆 𝑡 = 𝑟𝑆 𝑡 𝑑𝑡 + 𝜎 𝑡 𝑆 𝑡 𝑑𝑊 𝑡𝑆 . Indeed, there is general agreement in finance that volatility (in its many forms) is best modelled as some sort of mean-reverting stochastic process. Starting from that premise, there are many possibilities. One of the simplest has the instantaneous variance rate 𝑣 𝑡 ≡ 𝜎 𝑡2 evolving as a positive diffusion process following the SDE: 𝑑𝑣 𝑡 = 𝜅(𝑣̅ − 𝑣 𝑡 )𝑑𝑡 + 𝜉𝑣 𝑡 𝑑𝑊 𝑡𝑣 . Here 𝑊 𝑡𝑣 is an additional Brownian motion, (𝜅, 𝑣̅, 𝜉) > 0 are constant parameters, and the two Brownian motions (𝑊 𝑡𝑆 , 𝑊 𝑡𝑣 ) are correlated with constant parameter 𝜌 . Coupled with the (risk-neutral) stock price evolution above , this defines the GARCH diffusion model. The GARCH diffusion model has several nice properties. First, ignoring the drift term for a moment, 𝑣 𝑡 evolves as a geometric Brownian motion (GBM) -- a natural way in finance to achieve a positive stochastic process. GBM was originally introduced into finance by M. F. M. Osborne in the 1950’s to model stock prices under constant volatility. Indeed, time series analysis seems to favor GBM volatility over the popular Heston ’93 (square-root) volatility process. Second, with 𝑣̅ = 0 , the model nests a variant of the SABR model – very popular in interest rate modelling. The virtue of the SABR-GARCH connection is very tractable small-time behavior – due to a close connection of the small-time dynamics with hyperbolic Brownian motion. (While tractable small-time behavior facilitates time-series analysis by Maximum Likelihood, we found it not especially helpful in option chain calibration). Finally, the model name comes from the property (due to D. Nelson) that there exists a continuous-time limit of a discrete-time GARCH model (GJR-GARCH) that leads to a GARCH diffusion model. How well can the model fit option chains? Answering that is called calibration. Unfortunately, one desirable – but absent – property is an analytic solution, leaving numerics. While a simulation-based (Monte Carlo) approach doesn’t seem like the most efficient approach , in fact the model has been calibrated using Monte Carlo to a large options data set extending over several years by Christoffersen et al. in [1]. They find the GARCH diffusion a better fit than the oft- calibrated Heston ’93 model and the so-called 3/2-model, their points of comparison. Thessaloniki, Greece; email: [email protected] Newport Beach, California, USA; email: [email protected] Briefly summarized (with 𝜌 = 0) in Bollerslev and Rossi’s 1995 D. Nelson remembrance piece [20]. Given the nice properties, prior calibration results, and the general challenge, we were motivated to develop an efficient, accurate PDE calibrator for this model. Here, we report our methods and first results. and KBE’s . An important property that the model shares with a wide class of models is stock price level-independence, a well-known scaling relation for vanilla option prices. Specifically, at some initial time 𝑡 , consider a vanilla European call option price 𝐶(𝑡 , 𝑆 , 𝑣 ; 𝐾, 𝑇) with strike price 𝐾 , expiration 𝑇 , and state variables (𝑆 , 𝑣 ) . Then 𝐶(𝑡 , 𝑆 , 𝑣 ; 𝐾, 𝑇) = 𝐾 𝑐(𝑡 , 𝑠 , 𝑣 ; 𝑇) where the standardized option pricing function 𝑐(𝑡, 𝑠, 𝑣; 𝑇) is independent of 𝐾 and 𝑠 = 𝑆 /𝐾 . Fixing and suppressing (𝐾, 𝑇) , consider the pricing function 𝐶(𝑡, 𝑆, 𝑣).

It satisfies the KBE (Kolmogorov backwards equation) problem: −𝜕𝐶/𝜕𝑡 = 𝐿

𝑆,𝑣 𝐶 with terminal condition 𝐶(𝑇, 𝑆, 𝑣) = (𝑆 − 𝐾) + , and where 𝐿 𝑆,𝑣 is the process generator. Then, of course, 𝑐(𝑡, 𝑠, 𝑣 ) satisfies the same PDE with c (𝑇, 𝑠, 𝑣) = (𝑠 − 1) + . Now fix 𝐾 , say 𝐾 = 𝐾 ≡ 𝑆 , and solve the (continuum) KBE problem once for expiration 𝑇 . This gives 𝑐(𝑡 , 𝑠, 𝑣 ) , a function of 𝑠 for 𝑠 ∈ (0, ∞) since 𝑐(𝑡 , 𝑠, 𝑣 ) = 𝐶(𝑡 , 𝑠 𝐾 , 𝑣 ) / 𝐾 and the r.h.s is known for all values of 𝑠 . For any other strike then, say 𝐾 = 𝐾 , one immediately gets 𝐶(𝑡 , 𝑆 , 𝑣 ; 𝐾 , 𝑇) = 𝐾 𝑐(𝑡 , 𝑆 /𝐾 , 𝑣 ) . The point is that a single KBE solution yields all the (vanilla) option values for different strikes at a given expiration. While obvious in hindsight, the KBE implication of the MAP property initially eluded us. Early on we thought a forward equation (Fokker-Planck) was the only way for pricing “all -options-at- once” at a fixed expiration . Exploiting the scaling property resulted in significant performance improvements over our original “one option at -a- time” approach: 3× − 𝑁 𝑜𝑝𝑡𝑖𝑜𝑛𝑠 /𝑁 𝑒𝑥𝑝𝑖𝑟𝑎𝑡𝑖𝑜𝑛𝑠 . The somewhat subtle reasons are discussed in Sec. 3.1. Given our KBE approach, one must make choices on how to solve the pricing PDE. As with the Heston model, option prices under the GARCH diffusion model are governed by a 2-D convection-diffusion-reaction PDE with a mixed derivative term. Key characteristics of a suitable numerical scheme would be a) stability under practical usage, b) good accuracy to execution time ratio, and c) robustness (good oscillation-damping properties). As noted in [2], spurious oscillations in numerical computation of option prices can have three distinct causes: convection dominance, time-stepping schemes that are unable to sufficiently damp the high frequency errors stemming from the payoff discontinuity, and finally negative coefficients arising from the discretization of the diffusion terms. Here we take a closer look at the last two. For the spatial discretization we use the finite difference method on non-uniform grids. We employ standard central finite difference formulae for the diffusion and convection terms, but opt for a less common formula for the mixed derivative term, one that helps reduce oscillations that may take the solution to negative values. Although not our first choice, we also discretize the PDE cast in the natural We are well-aware of the general limitations of simple stochastic volatility diffusions. For example, they have difficulty fitting short-dated SPX option smiles and VIX options. Overcoming the limitations seems to require jump-processes. But, even if you want to include jumps into so- called “non - affine” models (like the GARCH diffusion), you need to start with a good PDE solver. For American options, barrier options, and other more exotic options, individual KBE solutions are needed. More generally, replace 𝑣 in the scaling argument above by 𝑌 , a (D-1) vector-valued state variable for a D-dimensional jump- diffusion or whatever. Then, scaling (and thus “all -options-at- once”) for Euro -style vanillas holds if: the process ( 𝑋 𝑡 , 𝑌 𝑡 ) is a MAP (Markov Additive Process), where 𝑋 𝑡 ≡ log 𝑆 𝑡 is the additive component and 𝑌 𝑡 is the Markov component. MAPs are defined in Çinlar [19]; modelling implications are stressed in Lewis (2016) [5]. Note this generality admits even discrete-time processes. Thus, for MAPs, it suffices to solve the backwards evolution problem once for a single strike to get all the vanilla option prices at a given expiration 𝑇 . Admitting jumps, the backwards evolution problem (in continuous-time) is generally a PIDE (partial integro-differential equation) problem. logarithm of the asset price; combined with the mixed derivative scheme, this can further guard against negative values (but not preclude them altogether). With spatial discretization in place, one is left with a large system of stiff ordinary differential equations and must adopt a time-marching method. We employ two commonly used schemes plus a rather unusual one. For this type of PDE the most popular choice would be a cross-derivative-enabled ADI variant (see [3] for an overview). We opt for the Hundsdorfer-Verwer and the Modified Craig-Sneyd schemes that offer the best overall characteristics. Our alternative is the BDF3 fully implicit scheme which, as far as we know, has not been used in such a context in the financial literature. It may have already become apparent that we do not aim for one sole scheme that is necessarily monotone by design; we believe that such a scheme would likely be less accurate or slower than it needs to be. What we aim for instead is a reliable set-up, enabling as fast and accurate calibrations as possible. To this end we propose a strategy that involves occasional re-evaluations and a hybrid engine that switches from ADI to the slower but more robust BDF3 scheme in such cases. The optimization is done with commercial software. We mainly use local constrained optimization routines, but we also try a global method (Differential Evolution). The latter, while proving too slow to be the recommended option, can be used to add confidence that the local optimizer is indeed finding global minima (which we have found to be the case in all our tests). The rest of this paper is organized as follows: Sec. 2 presents the numerical methods for the solution of the pricing PDE, with non-standard implementation specifics given in more detail. Sec. 3 describes the calibration phase and proposed strategy for optimizing performance. Sec. 4 contains various numerical results. We compare the computational efficiency of the time-marching schemes and examine the effectiveness of Richardson extrapolation in both space and time. This is followed by reference calibration results to real data and comparisons with the Heston model. We conclude with a brief exploration of other non-affine models that are readily handled by our framework. We finally present our conclusions and suggestions for further development.

2. Numerical solution of the GARCH diffusion PDE

The GARCH diffusion stochastic volatility model is described (under the risk-neutral measure) by 𝑑𝑆 𝑡 = (𝑟 𝑇 − 𝑞 𝑇 )𝑆 𝑡 𝑑𝑡 + √𝑣 𝑡 𝑆 𝑡 𝑑𝑊 𝑡𝑆 , (1) 𝑑𝑣 𝑡 = 𝜅(𝑣̅ − 𝑣 𝑡 )𝑑𝑡 + 𝜉𝑣 𝑡 𝑑𝑊 𝑡𝑣 . Here the Brownian noises associated to the underlying asset 𝑆 𝑡 (here the SPX) and its variance 𝑣 𝑡 are correlated; i.e., 𝑑𝑊 𝑡𝑆 𝑑𝑊 𝑡𝑣 = 𝜌 𝑑𝑡. A compatible real-world evolution is given in Appendix A. Time-series analysis (of similar real-world models) suggests that the correlation coefficient 𝜌 is negative with typical values of around -0.75 (Ait-Sahalia & Kimmel [4]). So here we assume 𝜌 < 0 . The variance process 𝑣 𝑡 has volatility 𝜉 > 0 and reverts to its long-run mean 𝑣̅ > 0 with a mean-reversion rate of 𝜅 > 0. 𝑇 is the time of an option expiration. Generally, our model assumes an environment with deterministic interest rate and dividend yields: ( 𝑟 𝑡, 𝑞 𝑡 ). But we write ( 𝑟 𝑇, 𝑞 𝑇 ) to indicate that we are using stepwise constants for each option expiration. (There will be some deterministic behavior for ( 𝑟 𝑡, 𝑞 𝑡 ) compatible with this). Let then 𝑉(𝑆, 𝑣, 𝑡) denote the price of a European option when at time

𝑇 − 𝑡 the underlying asset price equals 𝑆 and its variance equals 𝑣 . It is easy to verify that under the above specification 𝑉(𝑆, 𝑣, 𝑡) must satisfy the following parabolic PDE 𝜕𝑉𝜕𝑡 = 12 𝑆 𝑣 𝜕 𝑉𝜕𝑆 + 𝜌𝜉𝑆𝑣 𝜕 𝑉𝜕𝑆𝜕𝑣 + 12 𝜉 𝑣 𝜕 𝑉𝜕𝑣 + (𝑟 𝑇 − 𝑞 𝑇 )𝑆 𝜕𝑉𝜕𝑆 + 𝜅(𝑣̅ − 𝑣) 𝜕𝑉𝜕𝑣 − 𝑟 𝑇 𝑉 (2) for

We can also cast the equation in terms of the natural logarithm of the price

𝑋 = ln(𝑆) 𝜕𝑉𝜕𝑡 = 12 𝑣 𝜕 𝑉𝜕𝑋 + 𝜌𝜉𝑣 𝜕 𝑉𝜕𝑋𝜕𝑣 + 12 𝜉 𝑣 𝜕 𝑉𝜕𝑣 + (𝑟 𝑇 − 𝑞 𝑇 − 12 𝑣) 𝜕𝑉𝜕𝑋 + 𝜅(𝑣̅ − 𝑣) 𝜕𝑉𝜕𝑣 − 𝑟 𝑇 𝑉 (3) for

Equations (2) and (3) are categorized as time-dependent convection-diffusion-reaction PDE ’ s on an unbounded spatial domain. While (3) has a slightly simpler form, it is harder to allocate points on the 𝑋 -grid optimally. This is especially true since in many cases the grid needs to start from very small S values (to avoid loss of accuracy from grid truncation), which then means that a lot of X -points will be placed in an area of low interest. We will thus discretize and solve (2) primarily, but that the code is also (trivially) adapted for switching to solving (3) as well. Initial and boundary conditions

As initial conditions to (2) we have the vanilla call and put payoffs 𝑉 𝑐𝑎𝑙𝑙 (𝑆, 𝑣, 0) = max(𝑆 − 𝐾, 0), 𝑉 𝑝𝑢𝑡 (𝑆, 𝑣, 0) = max(𝐾 − 𝑆, 0) , (4) where K is the strike of the option. We impose (numerical) boundary conditions of Dirichlet type (5) - (6) and Neumann type (7) - (9) on the left and right-side boundaries respectively: 𝑉 𝑐𝑎𝑙𝑙 (𝑆 𝑚𝑖𝑛 , 𝑣, 𝑡) = 0, (5) 𝑉 𝑝𝑢𝑡 (𝑆 𝑚𝑖𝑛 , 𝑣, 𝑡) = 𝐾𝑒 −𝑟 𝑇 𝑡 − 𝑆 𝑚𝑖𝑛 𝑒 −𝑞 𝑇 𝑡 , (6) 𝜕𝑉 𝑐𝑎𝑙𝑙 𝜕𝑆 (𝑆 𝑚𝑎𝑥 , 𝑣, 𝑡) = 𝑒 −𝑞 𝑇 𝑡 , (7) 𝜕𝑉 𝑝𝑢𝑡 𝜕𝑆 (𝑆 𝑚𝑎𝑥 , 𝑣, 𝑡) = 0, (8) 𝜕𝑉 𝑐𝑎𝑙𝑙 𝜕𝑣 (𝑆, 𝑣 𝑚𝑎𝑥 , 𝑡) = 𝜕𝑉 𝑝𝑢𝑡 𝜕𝑣 (𝑆, 𝑣 𝑚𝑎𝑥 , 𝑡) = 0. (9) Under this model 𝑣 = 0 is an entrance boundary for all (𝜅, 𝜉) > 0 , meaning that 𝑣 = 0 is unreachable whenever the process starts at 𝑣 𝑜 > 0 . However, the process may in principle be started at 𝑣 𝑜 = 0 , after which it immediately enters the interior and never hits the origin again (for more details the reader is referred to Lewis [5], pg. 102). Therefore, from a mathematical standpoint no boundary condition is necessary. The PDE itself can be applied at 𝑣 = 𝑣 𝑚𝑖𝑛 = 0 (where all diffusion terms vanish due to the presence of factor 𝑣 ) and there is no need for any extra condition from a numerical point of view either. The choice of the grid truncation boundaries 𝑆 𝑚𝑖𝑛 , 𝑆 𝑚𝑎𝑥 (or 𝑋 𝑚𝑖𝑛 , 𝑋 𝑚𝑎𝑥 ) and 𝑣 𝑚𝑎𝑥 is discussed in Sec. 2.4.1.1. The boundary conditions are set in an equivalent manner for equation (3). We discretize in space using the finite difference method and work on non-uniform grids which we consider necessary for the efficient solution of the pricing PDE. In the S -direction, allocating more points around the strike can significantly reduce the error stemming from the initial delta discontinuity there. In the v -direction, allocating more points near 𝑣 makes sense since we want to resolve better the area where we want to obtain a price. Also, since typically we have 𝑣 𝑚𝑎𝑥 ≫ 𝑣 𝑜 (see Sec. 2.4.1.1), a non-uniform v -grid is all but necessary to both adequately resolve the area around 𝑣 𝑜 and at the same time reach out to 𝑣 𝑚𝑎𝑥 with a reasonable number of grid points. We use the standard central finite difference formulas for the first and second derivatives in (2) and (3) and a rather less standard seven-point stencil representation for the mixed derivative. All formulas give second-order accurate approximations, provided the grid step variation is sufficiently smooth (as is indeed the case for the grid construction proposed in Sec. 2.4.1.2). This means that there are indeed non-trivial option price solutions for 𝑣 = 0. Let the grid in the S -direction be defined by 𝑁𝑆 + 1 points, 0 ≤ 𝑆 𝑚𝑖𝑛 = 𝑆 < 𝑆 < ⋯ < 𝑆 𝑁𝑆 =𝑆 𝑚𝑎𝑥 and the corresponding grid steps Δ𝑆 𝑖 = 𝑆 𝑖 − 𝑆 𝑖−1 , 𝑖 = 1,2, … , 𝑁𝑆 . We then define the discretized versions of the first and second derivatives 𝜕𝑉 𝑖,𝑗 /𝜕𝑆 and 𝜕 𝑉 𝑖,𝑗 /𝜕𝑆 at 𝑆 = 𝑆 𝑖 as 𝜕𝑉 𝑖,𝑗 𝜕𝑆 ≈ −∆𝑆 𝑖+1 ∆𝑆 𝑖 (∆𝑆 𝑖 + ∆𝑆 𝑖+1 ) 𝑉 𝑖−1,𝑗 + ∆𝑆 𝑖+1 − ∆𝑆 𝑖 ∆𝑆 𝑖 ∆𝑆 𝑖+1 𝑉 𝑖,𝑗 + ∆𝑆 𝑖 ∆𝑆 𝑖+1 (∆𝑆 𝑖 + ∆𝑆 𝑖+1 ) 𝑉 𝑖+1,𝑗 , (10) 𝜕 𝑉 𝑖,𝑗 𝜕𝑆 ≈ 2∆𝑆 𝑖 (∆𝑆 𝑖 + ∆𝑆 𝑖+1 ) 𝑉 𝑖−1,𝑗 − 2∆𝑆 𝑖 ∆𝑆 𝑖+1 𝑉 𝑖,𝑗 + 2∆𝑆 𝑖+1 (∆𝑆 𝑖 + ∆𝑆 𝑖+1 ) 𝑉 𝑖+1,𝑗 . (11) We use the equivalent expressions for the derivatives 𝜕𝑉 𝑖,𝑗 /𝜕𝑣 and 𝜕 𝑉 𝑖,𝑗 /𝜕𝑣 at 𝑣 = 𝑣 𝑗 in the v -direction where the grid is defined by 𝑁𝑉 + 1 points, < 𝑣 < ⋯ < 𝑣 𝑁𝑣 = 𝑣 𝑚𝑎𝑥 and Δv 𝑗 = 𝑣 𝑗 − 𝑣 𝑗−1 , 𝑗 = 1,2, … , 𝑁𝑉 . An exception is the 𝑣 = 0 boundary where we use the one-sided (upwind) second-order formula for 𝜕𝑉 𝑖,𝑗=0 / 𝜕𝑣 : 𝜕𝑉 𝑖,0 𝜕𝑣 ≈ − (2∆𝑣 + ∆𝑣 )∆𝑣 (∆𝑣 + ∆𝑣 ) 𝑉 𝑖,0 + (∆𝑣 + ∆𝑣 )∆𝑣 ∆𝑣 𝑉 𝑖,1 − ∆𝑣 ∆𝑣 (∆𝑣 + ∆𝑣 ) 𝑉 𝑖,2 . (12) For the mixed derivative term, we opt for a custom second-order scheme based on a 7-point stencil, which is very similar but not identical to that proposed by Ikonen & Toivanen in [6]. Such a scheme can be constructed so that it contributes fewer negative off-diagonal coefficients to the resulting system’s discretization matrix A than the standard second-order scheme based on the 9-point stencil. This in turn makes the solution less likely to produce a negative valuation. When the correlation coefficient 𝜌 is negative (which we take to be the case here as discussed in Sec. 2.1), an appropriate formula for approximating 𝜕 𝑉 𝑖,𝑗 /𝜕𝑆𝜕𝑣 at (𝑆, 𝑣) = (𝑆 𝑖 , 𝑣 𝑗 ) is given by 𝜕 𝑉 𝑖,𝑗 𝜕𝑆𝜕𝑣 = 1𝐷 (−𝑉 𝑖+1,𝑗−1 + 2𝑉 𝑖,𝑗 − 𝑉 𝑖−1,𝑗+1 + (∆𝑆 𝑖+1 − ∆𝑆 𝑖 ) 𝜕𝑉 𝑖,𝑗 𝜕𝑆 + (∆𝑣 𝑗+1 − ∆𝑣 𝑗 ) 𝜕𝑉 𝑖,𝑗 𝜕𝑣 +

12 (∆𝑆 𝑖+12 +∆𝑆 𝑖2 ) 𝜕 𝑉 𝑖,𝑗 𝜕𝑆 + 12 (∆𝑣 𝑗+12 +∆𝑣 𝑗2 ) 𝜕 𝑉 𝑖,𝑗 𝜕𝑣 ) (13 ) where 𝐷 = ∆𝑆 𝑖+1 ∆𝑣 𝑗 + ∆𝑆 𝑖 ∆𝑣 𝑗+1 . (14) Formula (13) is readily obtained considering Taylor expansions of the option value 𝑉 𝑖,𝑗 at the neighboring upper left and lower right grid points (𝑆 𝑖−1 , 𝑣 𝑗+1 ) and (𝑆 𝑖+1 , 𝑣 𝑗−1 ) . Such a formula can be used in conjunction with a specially constructed grid (with limitations imposed on the grid steps) and some use of first-order upwind formulas for the convection terms 𝜕𝑉 𝑖,𝑗 𝜕𝑆⁄ and 𝜕𝑉 𝑖,𝑗 𝜕𝑣⁄ , to make A an M-matrix by design (see [6] for example). This would ensure that the solution cannot produce a negative valuation in any case, which would be a particularly useful feature for our calibration (that requires the calculation of implied volatilities). Such an approach though is not favored in the present work, since we believe that it would unnecessarily reduce the average accuracy of the solution through suboptimal grid construction: The grid points allocation should be driven by the problem’s physical characteristics (e.g. the location of the payoff discontinuity) and not be forced upon through the mathematical requirement of nonnegative coefficients . We revisit this in Sec. 3 where we explain how we handle the occasional negative values that are indeed possible under the proposed discretization. We can now replace the spatial derivatives on the right-hand side of equation (2) with their discretized versions described above to obtain its semi-discretized form Using the first-order (two-point) upwind formula would be better from a stability point of view, but would result in loss of accuracy and the overall second-order convergence of the discretization. In practice we have seen no stability issues arising from the use of (12) in extensive tests throughout numerous calibration exercises. Zvan et al. [2] provide a similar discussion, albeit in the context of finite volume/element discretization . 𝜕𝑉 𝑖,𝑗 𝜕𝑡 = 𝑑 𝑆 𝜕 𝑉 𝑖,𝑗 𝜕𝑆 + 𝑑 𝑣 𝜕 𝑉 𝑖,𝑗 𝜕𝑣 + 𝑐 𝑆 𝜕𝑉 𝑖,𝑗 𝜕𝑆 + 𝑐 𝑣 𝜕𝑉 𝑖,𝑗 𝜕𝑣 + 𝑚 𝑆𝑣 (−𝑉 𝑖+1,𝑗−1 + 2𝑉 𝑖,𝑗 − 𝑉 𝑖−1,𝑗+1 ) − 𝑟 𝑇 𝑉 𝑖,𝑗 , (15) where the diffusion, convection and mixed derivative coefficients are given by 𝑑 𝑆 = 12 𝑆 𝑖2 𝑣 𝑗 + 12 𝑚 𝑆𝑣 (∆𝑆 𝑖+12 +∆𝑆 𝑖2 ) , (16) 𝑑 𝑣 = 12 𝜉 𝑣 𝑗2 + 12 𝑚 𝑆𝑣 (∆𝑣 𝑗+12 +∆𝑣 𝑗2 ) , (17) 𝑐 𝑆 = (𝑟 𝑇 − 𝑞 𝑇 )𝑆 𝑖 + 𝑚 𝑆𝑣 (∆𝑆 𝑖+1 − ∆𝑆 𝑖 ) , (18) 𝑐 𝑣 = 𝜅(𝑣̅ − 𝑣 𝑗 ) + 𝑚 𝑆𝑣 (∆𝑣 𝑗+1 − ∆𝑣 𝑗 ) , (19) and 𝑚 𝑆𝑣 = 1𝐷 𝜌𝜉𝑆 𝑖 𝑣 𝑗3 2⁄ . (20) Equation (15) is applied at each grid point (𝑆 𝑖 , 𝑣 𝑗 ) for 𝑖 = 1, 2, … 𝑁𝑆 and 𝑗 = 0,1, … , 𝑁𝑉 . We do not need to solve for 𝑆 𝑖=0 = 𝑆 𝑚𝑖𝑛 since the Dirichlet boundary conditions (5) and (6) specify constant values there for the option value 𝑉 . At 𝑣 𝑗=0 = 0 the second and mixed derivative terms in (15) vanish and the upwind discretization (12) means we only use values within our grid. This is not the case though for the far-boundary grid lines 𝑆 𝑖=𝑁𝑆 = 𝑆 𝑚𝑎𝑥 and 𝑣 𝑗=𝑁𝑉 = 𝑣 𝑚𝑎𝑥 : the Neumann-type conditions (7) - (9) imply that the mixed derivative 𝜕 𝑉 𝜕𝑆𝜕𝑣⁄ (and thus all the terms in (15) - (19) multiplied by 𝑚 𝑆𝑣 ) vanish. But we still have the second derivatives, whose stencils reference a point outside the grid. Such points are treated as fictitious and their value is obtained through extrapolation based on the last actual grid point and the known value of the gradient there. With spatial discretization in place we are now left with a large system of stiff ordinary differential equations (ODE’s) in time, which we can write as 𝑽 ′ (𝑡) = 𝑭(𝑡, 𝑽(𝑡)), 𝑽(0) = 𝑽 , where 𝑭(𝒕, 𝑽(𝒕)) ∶= 𝑨𝑽(𝑡) + 𝒃(𝑡) for 0 ≤ 𝑡 ≤ 𝑇. (21) Here 𝑽 ′ (𝑡), 𝑽(𝑡), 𝒃(𝑡) and 𝑽 are vectors of size 𝑀 and 𝑨 is the 𝑀 × 𝑀 spatial discretization matrix, where

𝑀 = 𝑁𝑆 × (𝑁𝑉 + 1) is the total number of unknowns. The elements of 𝒃(𝑡) will depend on the boundary conditions (5) - (9) and those of 𝑽 on the initial conditions (4). We now need to adopt a time-marching method to solve (21). Popular choices for 1-D problems, such as the Implicit Euler and Crank-Nicolson schemes, become inefficient in higher dimensions, leading to large sparse systems that are a lot more expensive to solve than the small (typically tridiagonal) ones in the 1-D case. ADI-type splitting schemes are thus the most popular choice in 2-D and 3-D. However, standard (non-splitting) schemes can still be competitive for 2-D problems if a fast, sparse direct solver is used. This is especially true for the vanilla option pricing problem as the coefficients are time-independent and the matrix factorization step only needs to be performed once (or a few times). We employ one such method not often used in finance, namely the BDF3 (or 4-Level Fully Implicit) scheme. Our main workhorses though will be two popular ADI schemes which we briefly present first. For a detailed review of

ADI methods for PDE’s with mixed derivatives in finance, the reader is referred to [3]. The first step for all such methods is to decompose 𝑨 in (21) into three submatrices: 𝑨 = 𝑨 + 𝑨 + 𝑨 . (22) 𝑨 contains all terms stemming from the discretization of the mixed derivative term in (2), (3), i.e., all terms in (15) including 𝑚 𝑆𝑣 as a factor. 𝑨 and 𝑨 contain all the terms corresponding to the discretized derivatives in the S -direction and v -direction respectively. The source term 𝑟 𝑇 𝑉 is evenly distributed between 𝑨 and 𝑨 . By virtue of our 3-point central discretizations for the convection and diffusion terms, 𝑨 and 𝑨 are tridiagonal matrices . We split the vector 𝒃(𝑡) and function 𝑭(𝑡, 𝑽) from (21) accordingly as 𝒃(𝑡) = 𝒃 (𝑡) + 𝒃 (𝑡) + 𝒃 (𝑡) and 𝑭(𝑡, 𝑽) = 𝑭 (𝑡, 𝑽) + 𝑭 (𝑡, 𝑽) + 𝑭 (𝑡, 𝑽). We will use a uniform temporal grid which is defined by the points 𝑡 𝑛 = 𝑛 ∙ ∆𝑡, 0 ≤ 𝑛 ≤ 𝑁𝑇, ∆𝑡 = 𝑇𝑁𝑇 . Let 𝜃 be a real parameter which will control the exact splitting. We now outline our two main schemes, chosen for their optimal combination of stability, accuracy and inherent oscillation-damping properties [7]. Hundsdorfer-Verwer (HV) scheme { 𝑌 = 𝑉 𝑛−1 + ∆𝑡𝐹(𝑡 𝑛−1, 𝑉 𝑛−1 ), 𝑠𝑡𝑒𝑝 1𝑌 𝑘 = 𝑌 𝑘−1 + 𝜃∆𝑡 (𝐹 𝑘 (𝑡 𝑛, 𝑌 𝑘 ) − 𝐹 𝑘 (𝑡 𝑛−1, 𝑉 𝑛−1 )) (𝑘 = 1,2), 𝑠𝑡𝑒𝑝𝑠 2 & 3𝑌̃ = 𝑌 + ∆𝑡 (𝐹(𝑡 𝑛, 𝑌 ) − 𝐹(𝑡 𝑛−1, 𝑉 𝑛−1 )) , 𝑠𝑡𝑒𝑝 4𝑌̃ 𝑘 = 𝑌̃ 𝑘−1 + 𝜃∆𝑡 (𝐹 𝑘 (𝑡 𝑛, 𝑌̃ 𝑘 ) − 𝐹 𝑘 (𝑡 𝑛, 𝑌 )) (𝑘 = 1,2), 𝑠𝑡𝑒𝑝𝑠 5 & 6𝑉 𝑛 = 𝑌̃ Modified Craig-Sneyd (MCS) scheme { 𝑌 = 𝑉 𝑛−1 + ∆𝑡𝐹(𝑡 𝑛−1, 𝑉 𝑛−1 ), 𝑠𝑡𝑒𝑝 1𝑌 𝑘 = 𝑌 𝑘−1 + 𝜃∆𝑡 (𝐹 𝑘 (𝑡 𝑛, 𝑌 𝑘 ) − 𝐹 𝑘 (𝑡 𝑛−1, 𝑉 𝑛−1 )) (𝑘 = 1,2), 𝑠𝑡𝑒𝑝𝑠 2 & 3𝑌̃ = 𝑌 + 𝜃∆𝑡 (𝐹 (𝑡 𝑛, 𝑌 ) − 𝐹 (𝑡 𝑛−1, 𝑉 𝑛−1 )) , 𝑠𝑡𝑒𝑝 4𝑌̃ = 𝑌̃ + ( − 𝜃)∆𝑡 (𝐹(𝑡 𝑛, 𝑌 ) − 𝐹(𝑡 𝑛−1, 𝑉 𝑛−1 )) , 𝑠𝑡𝑒𝑝 5𝑌̃ 𝑘 = 𝑌̃ 𝑘−1 + 𝜃∆𝑡 (𝐹 𝑘 (𝑡 𝑛, 𝑌̃ 𝑘 ) − 𝐹 𝑘 (𝑡 𝑛−1, 𝑉 𝑛−1 )) (𝑘 = 1,2), 𝑠𝑡𝑒𝑝𝑠 6 & 7𝑉 𝑛 = 𝑌̃ Both schemes employ multiple intermediate steps to advance the solution from 𝑽 𝑛−1 to 𝑽 𝑛 . The HV scheme starts with a forward Euler (predictor) step (1), followed by two unidirectional implicit (corrector) steps (2 & 3) which serve to stabilize the explicit first step. Then a second predictor step (4) is followed by two more implicit corrector steps (5 & 6). The MCS scheme has an identical structure except for the double second predictor step (steps 4 & 5). The implicit steps require the solution of tridiagonal systems which we solve efficiently with LU decomposition. We use the HV scheme with 𝜃 = 1 − √2 2⁄ (which we shall refer to as HV1) and 𝜃 = 1 2⁄ + √3 6⁄ (HV2). It was conjectured in [3] that HV1 is only conditionally stable (but more accurate), and HV2 unconditionally stable (and less accurate). For the MCS scheme we use 𝜃 = 1 3⁄ , recommended in [8] as an optimal value based on stability analysis and experiments. Regardless of the value of 𝜃, both schemes are second-order. We note that despite proven unconditionally (von Neumann-) stable, these schemes do not always sufficiently damp local high-frequency errors caused by discontinuities in the initial conditions. This may result in spurious oscillations and reduced order of convergence; see for example [3, 9]. In this case a technique known as Rannacher time-stepping can be used to palliate the issue. This involves using a different scheme for the first time-step (which is divided into two equal sub-steps), one that can successfully damp oscillations and is usually first-order (typically the Euler Implicit scheme). This is not strictly true for matrix 𝑨 because the one-sided formula (12) used for the 𝑣 = 0 boundary involves one more point off the diagonal . For a more robust alternative that could conceivably provide smoother inputs to the (gradient-based) optimizer, we look at the third-order BDF3 scheme. Although not a typical choice, it is nonetheless simple to implement and has good stability (it is almost A-stable in an ODE sense) and oscillation damping properties. To solve the resulting systems, we use the Eigen C++ matrix library that offers simple interfaces to several direct sparse system solvers. The fastest one for the present system structure seems to be the UMFPACK solver, which we used for our experiments here. The scheme simply amounts to replacing the time derivative 𝑽 ′ (𝑡) in (21) with a one-sided, 4-level backward finite difference expression. The discretized version of (21) then looks like

116 𝑽 𝑛 − 3𝑽 𝑛−1 + 32 𝑽 𝑛−2 − 13 𝑽 𝑛−3 ∆𝑡 = 𝑨𝑽 𝑛 + 𝒃 𝑛 = 𝑭(𝑡 𝑛, 𝑽 𝑛 ), (25) and the values 𝑽 𝑛 at time level n are calculated given the values at the previous three time-levels as

116 𝑽 𝑛 = 3𝑽 𝑛−1 − 32 𝑽 𝑛−2 + 13 𝑽 𝑛−3 + ∆𝑡(𝑨𝑽 𝑛 + 𝒃 𝑛 ). (26) Since values are required not only from the previous time-level (like the ADI methods), but also from two levels before that, we must use some alternative scheme for the first two steps of the integration. We use the first-order Implicit Euler (IE) scheme and the second-order BDF2 scheme. The IE scheme is given by 𝑽 𝑛 = 𝑽 𝑛−1 + ∆𝑡(𝑨𝑽 𝑛 + 𝒃 𝑛 ) and requires the factorization of 𝑨 𝐼𝐸 = (𝐈 − ∆𝑡𝑨) . The BDF2 scheme is given by 𝑛 = 2𝑽 𝑛−1 − 0.5𝑽 𝑛−2 + ∆𝑡(𝑨𝑽 𝑛 + 𝒃 𝑛 ) and requires the factorization of 𝑨 𝐵𝐷𝐹2 = (1.5𝐈 − ∆𝑡𝑨) . In order to improve accuracy for the first time-step, we employ Richardson extrapolation like this: we first use the IE scheme for 4 sub-steps of size ∆𝑡/4 to get the values 𝑽 at the end of the first time-step. We then repeat, this time using 2 sub-steps of size ∆𝑡/2 to obtain 𝑽 and get the final composite values for the first time-step as 𝑽 = 2𝑽 − 𝑽 . Note that this requires 2 matrix factorizations, corresponding to 𝑨 𝐼𝐸 with ∆𝑡/4 and ∆𝑡/2 . To get the values 𝑽 at the end of the second time-step we use the BDF2 scheme. In total, the present implementation requires 4 expensive factorizations which add a substantial upfront computational cost. Let us loosely define computational efficiency (CE) as the accuracy achieved per unit CPU time. A PDE-based solver cannot match the CE of semi-analytical solutions, such as those available for the Heston model. We therefore need to look into ways of improving the CE of our set-up. Here we consider grid construction, smoothing of the initial conditions and Richardson extrapolation.

Our domain is semi-infinite (or infinite for X in equation (3)), so in practice the grid needs to be truncated at some point. If the grid does not extend far enough then the imposed boundary conditions will not hold exactly true and forcing them on the solution will introduce some error. If the grid extends further than it needs to then the grid step sizes will be larger for the same number of points, resulting in less accurate finite difference approximations. There is no obvious way to determine the truncation limits, so here we make the empirical choices below. Note the dependence of the limits on the model parameters, which means that the grids used will be different for each objective function evaluation (based on the parameters set by the optimizer each time).

S-direction

For the S -grid, we truncate to the right at 𝑆 𝑚𝑎𝑥 = 𝑒 (𝑙𝑛(𝑚𝑎𝑥(𝐾,𝑆 ))+ 𝑀𝜎 𝑒𝑠𝑡 √𝑇) , where we set M = 5 and 𝜎 𝑒𝑠𝑡 = 0.5(√𝑣 + √𝑣 𝐿 ) . We then set: 𝑚𝑎𝑥 < 20𝐾. This choice leads to good solution accuracy overall, but for extreme model parameter regimes and benchmark calculations we additionally multiply 𝑆 𝑚𝑎𝑥 by a safety factor of 2 to 3. For the left boundary and for equation (2), we set 𝑆 𝑚𝑖𝑛 =

0. For equation (3) in

𝑋 = ln(𝑆) , we truncate at 𝑋 𝑚𝑖𝑛 = ln(𝑚𝑖𝑛(𝐾, 𝑆 )) − 𝑀𝜎 𝑒𝑠𝑡 √𝑇 , where we set M = 6. We then further require that 𝑋 𝑚𝑖𝑛 ≤ 𝛼𝐾 , where 𝛼 is some constant. We normally set 𝛼 = 0.1 but for high accuracy we recommend 𝛼 ≤ 0.025. v-direction We set 𝑣 𝑚𝑖𝑛 = 0 , i.e., we do not truncate the left boundary. To set an appropriate right boundary, we note that for 𝑇 → ∞ , 𝑣 𝑡 follows an Inverse Gamma distribution (see Appendix B). Given the distribution we can then set 𝑣 𝑚𝑎𝑥 = 𝑣 𝑐𝑟𝑖𝑡 (𝑞) = 𝐹 ′ (𝑞) (27) and 𝐹 ′ is the inverse cumulative (Inverse Gamma) probability function. We find that a value of between −5 and −6 is necessary for accurate valuations. For short-dated options an empirical fraction of (27) can be used whenever 𝜅 ∙ 𝑇 < 1 . Alternatively, one can numerically calculate the exact distribution – and thus 𝑣 𝑐𝑟𝑖𝑡 (𝑞) – for each expiration. This is described in Appendix B and is used for our experiments in Sec. 4 with −6 . We finally note that typically it will be 𝑣 𝑚𝑎𝑥 ≫ 𝑣 . This observation alone necessitates the use of a non-uniform grid, described next. Computational efficiency can be improved significantly and any problems due to discontinuities mitigated, with a grid that concentrates more points where they’re needed.

We employ a well-known one-dimensional grid-generating (stretching) function based on the inverse hyperbolic sine, which satisfies certain criteria for use with finite difference methods. The interested reader is referred to Vinokur [10]. The same function but in slightly different form is often used in the financial literature, see for example Tavella & Randall [11] and

In ’t Hout & Foulon [3]. The grid in the S -direction is given by: 𝑆 𝑖 = 𝑆 𝑚𝑖𝑛 + 𝐾 (1 + sinh (𝑏 𝑆 ( 𝑖𝑁𝑆 − 𝑎 𝑆 )) sinh(𝑏 𝑆 𝑎 𝑆 )⁄ ) (28) for 𝑖 = 0,1, … , 𝑁𝑆 , where K is the strike (or more generally the desired clustering point) and 𝑎 𝑆 , 𝑏 𝑆 are free parameters. 𝑎 𝑆 represents the percentage of total points that lie between 𝑆 𝑚𝑖𝑛 and K and 𝑏 𝑆 controls the degree of non-uniformity. We set 𝑏 𝑆 = 4.5 which corresponds to moderate non-uniformity and generally results in low error profiles across the moneyness spectrum. Given 𝑏 𝑆 , 𝑎 𝑆 can be set so that the grid goes up to 𝑆 𝑚𝑎𝑥 using: 𝑎 𝑆 = ln((𝐴 + 𝑒 𝑏 𝑆 ) (𝐴 + 𝑒 −𝑏 𝑆 )⁄ ) 2𝑏 𝑆 ⁄ , where 𝐴 = (𝑆 𝑚𝑎𝑥 − 𝐾) (𝐾 − 𝑆 𝑚𝑖𝑛 ⁄ ) . We then make sure that the strike falls exactly on a grid point by making a further slight adjustment to 𝑎 𝑆 : we find 𝑖 𝐾 = ⌊𝑎 𝑆 𝑁𝑆⌋ and then reset 𝑎 𝑆 = 𝑖 𝐾 /𝑁𝑆 . Finally, we use the same approach for generating the X -grid in equation (3). For the v -direction we again use the same grid-generating function: 𝑣 𝑗 = 𝑣 (1 + sinh (𝑏 𝑣 ( 𝑗𝑁𝑉 − 𝑎 𝑣 )) sinh(𝑏 𝑣 𝑎 𝑣 )⁄ ) , (29) for j = 0, 1, … , NV , which clusters points around 𝑣 . Since 𝑣 𝑚𝑎𝑥 ≫ 𝑣 we set 𝑏 𝑣 = 8.5 which is as non-uniform as we can get before CE starts dropping. We first set 𝑎 𝑣 so that the grid goes up to 𝑣 𝑚𝑎𝑥 : 𝑎 𝑣 = ln((𝐴 + 𝑒 𝑏 𝑣 ) (𝐴 + 𝑒 −𝑏 𝑣 )⁄ ) 2𝑏 𝑣 ⁄ , where 𝐴 = 𝑣 𝑚𝑎𝑥 𝑣 − 1 . If we wanted to place the strike in the middle between grid points we would use 𝑎 𝑆 = (𝑖 𝐾 + 0.5)/𝑁𝑆 instead. We then find 𝑁 𝑣 = max(⌊𝑎 𝑣 𝑁𝑉⌋, ⌊0.2𝑁𝑉⌋, 6) and reset 𝑎 𝑣 = 𝑁 𝑣 𝑁𝑉⁄ to ensure that 𝑣 lies exactly on a grid point. When the input 𝑁𝑉 is low and/or 𝑣 𝑚𝑎𝑥 ≫ 𝑣 , then the above 𝑎 𝑣 adjustment results in the last grid point now falling short of 𝑣 𝑚𝑎𝑥 . In such cases we keep adding points using (29) until 𝑣 𝑗 ≥ 𝑣 𝑚𝑎𝑥 , which means that the final grid size will be 𝑁𝑉 ∗ , with 𝑁𝑉 ∗ ≥ 𝑁𝑉 . Typically, 𝑁𝑉 ∗ will be up to 50% higher than the input 𝑁𝑉 . An alternative construction that seems to have some advantage over the one just described, is a hybrid one, having the narrow (but most important) zone around 𝑣 uniformly-spaced and the rest non-uniform. Haentjens & In ’t Hout [12] propose one way of constructing such a grid in the S -direction for the solution of the Heston PDE. Here we use our first construction above as the base, to determine 𝑁 𝑣 and 𝑁𝑉 ∗ . The segment (

0, 2𝑣 ) is then made uniform with step Δ𝑣 𝑈 = 𝑣 𝑁 𝑣 ⁄ . We then use the simple stretching function 𝑣 𝑗 = 𝑅 𝑁𝑈 sinh ( 𝑏 𝑣 𝑗𝑁 𝑁𝑈 ) sinh(𝑏 𝑣 )⁄ , 𝑁 𝑁𝑈 = 𝑁𝑉 ∗ − 2𝑁 𝑣 + 1 , 𝑅 𝑁𝑈 = 𝑣 𝑚𝑎𝑥 − 2𝑣 + Δ𝑣 𝑈 (30) to generate the non-uniform part, choosing 𝑏 𝑣 so that the first step is equal to Δ𝑣 𝑈 . This can be easily achieved with any one-dimensional root-finding method. The two v -grid constructions generally result in comparable performance. But when used with Richardson extrapolation (Sec. 2.4.3), the second (hybrid) variant is always preferable. We thus use the hybrid construction for all the numerical experiments of Sec. 4. Whenever there is a discontinuity at some point in the initial conditions, it is usually a good idea to apply some sort of averaging for that point using the value(s) of adjacent point(s). That is effectively to smooth out the discontinuity (in this case located at the strike K ) before solving the PDE. The reason is that such discontinuities increase the solution error. To this end, here we just replace the (zero) initial condition values along the 𝑆 = 𝐾 line of the grid (remember we made sure that there is a grid point 𝑆 𝑖 𝐾 on K ) with a simple average over nearby space as proposed in [13]. For vanilla options this amounts to setting: 𝑖𝑛𝑖𝑡𝐶𝑜𝑛𝑑 𝑖 𝐾 ,𝑗 = 0.25∆𝑆 𝑆 𝑖 𝐾 +1 − 𝑆 𝑖 𝐾 −1 , for j = 0, 1, … , NV , where we have ∆𝑆 = 𝑆 𝑖 𝐾 +1 − 𝑆 𝑖 𝐾 for calls and ∆𝑆 = 𝑆 𝑖 𝐾 − 𝑆 𝑖 𝐾 −1 for puts. Richardson extrapolation (RE) can significantly increase accuracy for many problems adding only a small computational overhead. It simply involves calculating solutions based on two different grids (either spatial or temporal, usually with grid-step sizes ratio of

Here we apply it on the spatial level as follows: for a given resolution

𝑁𝑆 × 𝑁𝑉 , we first generate a (

𝑁𝑆/2 × 𝑁𝑉/2 ) grid and calculate an option price on it, 𝑃 𝑐𝑜𝑎𝑟𝑠𝑒 . Then using the same grid parameters ( a S , b S ) and ( a V , b V ), we generate a ( 𝑁𝑆 × 𝑁𝑉 ) grid and use it to calculate 𝑃 𝑓𝑖𝑛𝑒 . Given that our discretization is full second-order in both S and v , we can then calculate the extrapolated price as 𝑃 𝑅𝐸 = 𝑃 𝑓𝑖𝑛𝑒 + 𝑃 𝑐𝑜𝑎𝑟𝑠𝑒 . Note that the fine grid will contain all the coarse grid’s points and add new ones in between. It is important that the relative location of the strike K is the same for the two grids, as is indeed the case (both have points exactly on the 𝑆 = 𝐾 line). The main advantage is that while the computational cost increase is merely 25%, the accuracy is typically improved by 1-2 orders of magnitude (depending on the resolution used and the model parameters). RE works very well when sufficiently fine grids are used and not so well when the grids are too coarse (in which case it may well give worse accuracy than the single evaluation). This is because the To guarantee that the solution around 𝑣 is always adequately resolved, we make sure that there is a minimum number of allocated grid points up to 𝑣 , at least 20% of the total and no less than 6. premise for RE is that the two solutions are in the asymptotic range, i.e., that the observed order of convergence for the grids used is (very close to) the theoretical one. Down to the lowest resolution ( 𝑁𝑆 × 𝑁𝑉 ) = (40 × 20) used in our experiments, we’ve found RE to clearly outperform the single evaluation in terms of CE. RE is also less effective for 2-D and 3-D problems when non-uniform grids with different stretching functions for each dimension are used (as is the case here). We find that this is effectively countered with the use of the hybrid v -grid that makes the grid in the v -direction uniform in the region of interest. This helps to regularize convergence, which in turns leads to improved RE performance. Discontinuities and/or singularities in the initial or boundary conditions will also often cause the observed order of convergence to be less than the theoretical one (and make convergence overall erratic), again reducing the effectiveness of RE. If those can be treated somehow, then convergence order is restored and RE performance improved. This is one more reason for applying the smoothing procedure described in Sec. 2.4.2.

3. Calibration

The main goal of the present work is to fit the GARCH diffusion model to a market of options. Some people choose to fit to option prices and others to the implied volatilities (IVs). We are strong proponents of the second approach.

IV’s are a natural way to regularize a set of option prices -- which can range from $0.05 to hundreds of dollars.

IV’s are the same order of magnitude across all the options . For SPX and other broad- based indices, using IV’s will also weight higher the influence of deep out-of-the money puts. Given that such options are a difficult regime for models (especially diffusions) to fit well, we like this property as well. It stresses an area where models have difficulty. Specifically, we try to fit the model to the option data by defining the following objective function to be minimized:

𝑅𝑀𝑆𝐸 𝐼𝑉 = √ 1𝑁 ∑(𝐼𝑉 𝑖𝑚𝑜𝑑𝑒𝑙 − 𝐼𝑉 𝑖𝑚𝑎𝑟𝑘𝑒𝑡 ) = 𝑓(𝑣 , 𝑣̅, 𝜅, 𝜉, 𝜌, 𝑁𝑆, 𝑁𝑉, 𝑁𝑇), (31) where N is the number of options we wish to include in the calibration . We calibrate to two SPX option chains, denoted Chain A and Chain B. Chain A used 246 SPX option quotes from Mar 31, 2017, filtered from quotes and IV’s calculated by the CBOE. The data and notes for that are found in Appendix C. For the optimization we use tools available in popular software . We test two local optimizers, Excel’s Solver tool which is based on the Generalized Reduced Gradient (GRG) method and

Mathematica’s FindMinimum function which is based on an interior point method. We also use

Mathematica’s NMinimize function , based on the global optimization Differential Evolution algorithm. All routines accept constraints which we impose on the model parameters in (31) in a way so that they encompass all plausible values. To work with Excel and Mathematica , we build a dll exporting a function that returns the RMSE IV taking just the PDE engine’s configuration as inputs. The function then reads the option chain data from a file, prices the options and evaluates (31). This is readily parallelized at the chain level, distributing the N options across all available CPU cores. We apply some basic load balancing since resolution (and thus calculation time - roughly proportional to ( 𝑁𝑆 × 𝑁𝑉 × 𝑁𝑇) ) may vary, as we discuss next. This excludes options with very small market prices which, as described in Sec. 3, will usually be priced with higher resolution than the nominal one input for the calibration. Obviously, NS, NV and NT (as well as the rest of the PDE engine’s configuration like choice of scheme, etc) are kept constant throughout a calibration (except when a negative value is detected as explained next). While a more customizable solution integrated with the PDE engine would likely be made to converge faster, we wish to keep things simple here and focus mostly on the PDE engine. Calling the function from Mathematica is trivial using the .NET/Link. For options of different expirations to be priced with similar accuracy, we need to have the number of time-steps 𝑁𝑇 increasing with the expiration T . At the same time, the initial period of the valuation (close to the discontinuous initial conditions) always requires a minimum 𝑁𝑇 to be resolved adequately. We roughly satisfy these requirements by taking the nominal 𝑁𝑇 input in (31) to be the number of time steps per year for options with 𝑇 > 1 , i.e., we set 𝑁𝑇 𝑜𝑝𝑡𝑖𝑜𝑛 = ⌊𝑁𝑇 ∙ max(𝑇, 1)⌋ . We also find it is important to ensure that some minimum spatial resolution is used for the far out-of-the-money options in the chain, since those are more likely to incur higher relative pricing errors. More specifically, whenever the market value of an option is less than 0.5% of the asset spot 𝑆 , (𝑁𝑆 × 𝑁𝑉) 𝑚𝑖𝑛 is set to (120 × 70) which is then gradually increased to (

400 × 100) for market values of 0.01% of 𝑆 or lower. W e’ve found t hese empirical choices lead to better efficiency in terms of obtaining more accurate fitted parameters faster. As was explained in Sec. 2.2, the present discretization allows for negative option values by design. In general, those occur when the resolution is too coarse (and thus the accuracy too low) and the correlation coefficient 𝜌 strongly negative. In practice we found such occurrences relatively rare under reasonable resolutions. With a local optimizer and when a previous result is used as the starting point, it is not unusual to complete a calibration involving tens of thousands of individual evaluations without a single negative value occurring. Such occurrences become even less frequent if one applies the transformation 𝑋 = ln(𝑆) , i.e. discretizing and solving equation (3) instead of equation (2). When negative valuations do occur during a calibration, the implied volatility cannot be calculated. In such cases we simply repeat the failed option valuation using our most robust (but least efficient) configuration, which involves switching to equation (3) and using the BDF3 scheme. We do so repeatedly, if required, using gradually increasing resolution until a positive value is returned. This “brute - force” approach can occasionally slow down a calibration, mostly when a global optimizer is used. On the other hand, it “automatically” ensures that the option is priced accurately, which wouldn’t be the case if we used restrictions on the grid steps and/or added some sort of artificial diffusion aiming for an M-matrix (as was discussed in Sec. 2.2). Finally, since a valuation may just happen to be positive at (𝑆 , 𝑣 ) (i.e., 𝑉 > 0) , but still go significantly negative in the vicinity (and thus be inaccurate overall) , the naïve check of 𝑉 > 0 is not sufficient. Instead, we check for negative values at all grid points within 10% of the strike in the S -direction and 50% of 𝑣 in the v -direction and discard any positive valuation 𝑉 if a negative value of magnitude more than 1% of 𝑉 is detected. In general, if the model is to produce a decent f it to the market IV’s (and thus prices), then as the optimizer homes in on the optimum parameter set, the chances of a model price being that close to zero and thus susceptible to this problem are very low. Given our KBE PDE solver, the first and most obvious approach to evaluating (31) is indeed to price each of the N options separately, i.e., solve N PDE’s for each objective function evaluation.

Each PDE is solved on a different grid based on

𝐾, 𝑆 and 𝑇 , as described in Sec. 2.4.1. This is also the most general strategy since it can be used if we want to include options other than vanillas in the calibration exercise. The PDE engine could easily handle American or barrier options for example, at no extra cost. We shall refer to this as Approach I hereafter. Our purpose here though is to calibrate to vanilla options, in which case we can make use of the scaling (MAP) property introduced in Sec. 1.1. This means that only one PDE solution is sufficient to provide all option prices (for all different strikes) in an expiration bucket 𝑇 𝑗 , 𝑗 = 1, … , 𝑁 𝐸 , where 𝑁 𝐸 is the number of different expirations included in the calibration. We choose to price one put option per 𝑇 𝑗 and in particular the one that is furthest out-of-the-money (with the lowest strike price K ). This put necessitates an S -grid with the highest required 𝑆 𝑚𝑎𝑥 /𝐾 ratio ( see Sec. 2.4.1.1) for our data . The prices of the rest of the puts are then readily found via the scaling relation and interpolation on the S -grid. The call prices are obtained from the put price at the corresponding scaled 𝑠 via put-call-parity. Overall, It also ensures there are enough grid points where necessary for smooth higher order interpolation of the calculated option price at (𝑆 , 𝑣 ) , and helps to avoid negative values by enforcing some minimum accuracy. W e include puts that are further out-of-the-money than the calls. this strategy (henceforth referred to as Approach II) requires 𝑁 𝐸 PDE solutions for one evaluation of (31) and the actual N option prices are extracted from those. Naively thinking, one may expect this to result in 𝑁/𝑁 𝐸 times faster computation compared to Approach I (which would represent a 30-fold increase in the case of a chain with 240 options and 8 distinct expirations for example). In the previous section we described why for some options we want to use higher spatial resolution ( 𝑁𝑆 × 𝑁𝑉) . Each 𝑇 𝑗 bucket may include one or more such options, seen as weak links in terms of computational efficiency. With Approach I this means that about 80% of the N PDE’s can be solved very fast on coarse grids , while the rest (a few options in each 𝑇 𝑗 bucket) will be priced on a much finer grid and take longer. Allocating 240 PDE solvers across different CPU cores amounts to reasonable medium-grain parallelism and allows for decent load-balancing. With Approach II on the other hand these advantages are lost: if we want all the extracted option prices to be of equivalent accuracy to when calculated individually (as in Approach I), we need to account for the weakest link in each case. If each 𝑇 𝑗 includes at least one option that requires a fine(r) grid, it follows that all 𝑁 𝐸 PDE solvers need to use some overridden (high) resolution (as per previous section). This also means that now we may well have more CPU cores available than parallel tasks, say 𝑁 𝐸 = 8 and 𝑁 𝑐𝑜𝑟𝑒𝑠 = 10 , leaving some of the processing power unutilized. It may also lead to bad load balancing if for example one solver uses higher resolution that the rest. For this reason, we lower the maximum enforced (𝑁𝑆 × 𝑁𝑉) 𝑚𝑖𝑛 from (

400 × 100) for Approach I to (2

00 × 80) for Approach II. The lesser accuracy that this implies for the few deep out-of-the-money options is offset by the fact that now all option prices are extracted from fine grids (as opposed to only a few with Approach I). As we will see next, this leads to calibrated model parameter accuracy as high as under Approach I. For the ADI schemes there is an alternative strategy and that is to parallelize at the PDE solver level. This is because grid lines can be updated simultaneously during both the explicit and implicit steps. Our brief tests with 8 or 10 cores/threads show that even with basic OpenMP instructions, a parallel efficiency of 80% is readily achievable this way, resulting in similar calibration times to our main, chain-level parallelization approach.

4. Numerical experiments and results

We now analyze the performance of the PDE engine and calibrator and present results using our two sample SPX option datasets, 246-option Chain A representing the 2017 low-volatility market, and 68-option Chain B from the higher volatility environment of 2010. Chain A data are given in Appendix C. The timings were taken on a 10-core Intel i9-7900X PC; the code was written in C++ and compiled in VS2013. We perform most of our tests using Approach I of the previous section (i.e. pricing each option separately). We do so since it is obviously preferable to assess the behavior/performance of the PDE engine over a sample of 246 or 68, rather than 8 or 7 individual option evaluations.

The ADI methods, as expected, prove to be more efficient than the BDF3 scheme and can be depended upon for successful and fast calibrations. At an individual option pricing level though we found they are less robust as they are not immune to spurious oscillations, mostly in the delta and gamma of the solution. The most likely offender is the HV1 scheme (which also proves to be the most efficient overall). We found this problem much rarer with the MCS scheme for the v anilla payoffs we’re dealing with here . The fully implicit BDF3 scheme demonstrates superior damping properties; we have been unable to reproduce a single case of this problem in extensive testing. In practice, with all reasonable (𝑁𝑆, 𝑁𝑉, 𝑁𝑇) the ADI schemes are oscillation-free as well. Even when trying to force the issue (using some low 𝑁𝑇 𝑁𝑆⁄ ), our tests show that any such mild oscillations present in the individual solutions do not usually prevent optimizer convergence (but they do slow it down). We note that this is a more We will see in the next section that a resolution of (

𝑁𝑆 × 𝑁𝑉) = (60 × 30) is more than enough to obtain accurate IV’s for the bulk of the options that are not deep out -of-the-money. Wyns [9] shows that such problems are more common when the MCS scheme is used to price cash-or-nothing options, where the discontinuity in the initial conditions is much more severe. general feature: the less accurate the PDE solution is, the more steps are usually required for the optimizer to converge. Using too low NT for example, will cause the optimizer to “hunt” more ; conversely, application of (spatial) Richardson extrapolation typically means convergence in fewer steps than when simple (non-extrapolated) solutions are used. The other relevant problem is that of the occasional negative values. The use of equation (3) results in non-negative solutions under moderate spatial grid resolutions in cases where this is impossible with equation (2) and any reasonable resolution. Moreover, we find that in such cases the BDF3 scheme does a better job than the ADI schemes. As an example, we mention in passing a particularly difficult case that arose during a calibration with the global optimizer. In that case 𝑁𝑇 = 20 was enough for the BDF3 scheme to produce a smooth, non-negative price profile near the strike, whereas the ADI schemes needed

𝑁𝑇 > 20000 to achieve the same result (with the same spatial discretization).

In terms of the optimizers we tested, we found Excel’s Solver tool to be the best choice. The main reason is that it benefits from a good initial guess, whereas Mathematica’s FindMinimum does not. The Solver will typically converge 2.5 to 5 times faster when the starting vector is not too far from the optimal. Despite both being local optimizers, we are confident that they can be used to find the true (global) minimum. Extensive testing using many different starting points shows that both converge to the same vector . Using Mathematica’s NMinimize global optimization routine furth er confirmed the solution in every case we tested. NMinimize also served as a torture test for the PDE engine, exploring all corners of the parameter space. All schemes never failed to produce a valid price, though of course “difficult” parameter sets (and insufficient resolution) generally trigger the repricing mechanism. Even so, the total optimization time is not significantly affected in practice. The allowed parameter ranges we used for the tests were: ≤ 0.50 , , , , −0.95 ≤ 𝜌 ≤ 0 , which cover most market scenarios. To test the convergence behavior of the time-marching schemes, we fix the spatial resolution to (𝑁𝑆 × 𝑁𝑉) = (60 × 30) and calculate (time-converged) benchmark prices using the BDF3 scheme with

𝑁𝑇 = 12800 . We also apply spatial Richardson extrapolation . This way the spatial discretization error is low, but not negligible compared to the temporal error. Nonetheless, the two errors are found to be only weakly dependent, allowing the comparative performance of the schemes to be properly assessed. The prices are obtained via an objective function evaluation under Approach I, which means that every option in the chain is priced individually and under the resolution overriding / repricing rules described in Sec. 3. The pricing errors for various 𝑁𝑇 are calculated as the differences from the benchmark prices and the RMSE is used as an indicator of the overall performance for each scheme. Figure 1 shows the results for the HV1, HV2, MCS and BDF3 schemes, plus the HV1 scheme with Rannacher time-stepping (hereafter referred to as HV1D). The points on the left correspond to practical 𝑁𝑇 (and CPU times) while those on the right are included to better illustrate the asymptotic behavior . We plot the relative (as opposed to absolute) pricing errors, since those are more closely related to the errors in the implied volatilities and consequently the calibrated model parameters. The HV1, HV2 and MCS schemes display a linear relationship between RMSE and CPU time on the logarithmic scale. This reflects (and confirms) their theoretical second-order convergence and the fact that the execution time in their case is proportional to NT . The Implicit Euler damping step (which requires an expensive factorization of the full system matrix) introduces an upfront cost that lowers the efficiency of the HV1D scheme. The irregular first two points from the left of the HV2 curve for Chain A are an example of the repricing mechanism in action: the scheme’s accuracy is too low h ere, causing some options in the chain to fail the “negative values test” (which are then revalued with a different configuration). Finally, for the BDF3 scheme we have an upfront cost that just like with HV1D is due to the initial matrix factorizations, resulting in significantly reduced efficiency for practical 𝑁𝑇 . At large 𝑁𝑇 this effect is diluted, and the scheme is seen to confirm its theoretical third-order convergence. The (composite) RE solution combines solutions on the (𝑁𝑆 × 𝑁𝑉) and (

𝑁𝑆/2 × 𝑁𝑉/2 ) grids. The rightmost points of the error curves in Figure 1 correspond to

𝑁𝑇 = 1600 for the ADI schemes and

𝑁𝑇 = 400 for the BDF3. Figure 1 . Computational efficiency of time-marching schemes. RMS relative temporal error vs CPU time for the pricing of Chain A (246 options, left) and Chain B (68 options, right). Model parameters from Table 1.

Overall, the HV1, HV1D and MCS schemes offer comparable performance. Based on the present and many more similar tests, the HV1 scheme may well be the best choice. Under typical usage, any spurious oscillations do not seem to significantly affect its calibration performance. Adding damping comes at a mild cost overall and the HV1D scheme in fact produces lower (relative) errors for Chain B . The MCS scheme is still the most robust of the considered ADI schemes and usually about as accurate as HV1/HV1D, but it is slightly more computationally expensive per time-step. The HV2 scheme is the worst-performing and we will not consider it here further. Finally, the BDF3 scheme perhaps fares better than expected, beginning to outperform the ADI schemes for relative errors below around 0.02% in the case of Chain A. It is interesting to note that all temporal discretization errors are about an order of magnitude smaller for Chain B, making the BDF3 scheme less competitive in this case. Overall, we (unsurprisingly) find it cannot match the best ADI schemes’ efficiency for practical accuracy goals. – Calibration (Approach I)

We now test the performance of the PDE engine in terms of the end-result, the calibrated model’s parameters. Benchmark values are given in Table 1, calculated using the BDF3 scheme and a resolution of (𝑁𝑆 × 𝑁𝑉 × 𝑁𝑇) = (800 × 200 × 50) using spatial RE . Table 1.

Benchmark calibrated model parameters 𝑣 𝑣̅    Option chain A (Mar 31, 2017) 0.010935 0.039139 5.3905 6.8997 -0.74579 Option chain B (Feb 1, 2010) 0.044816 0.088529 3.6695 5.0333 -0.79206

Figure 1 suggests that the HV1 scheme is more efficient than the MCS scheme for Chain A, but the two schemes perform very similar for Chain B. This translates to the calibrated model parameters. Table 2 shows that the parameters obtained for Chain A with the HV1 scheme are more converged for any 𝑁𝑇 than those obtained with the MCS scheme and the CPU time required for the calibration is lower as well. We note that in this case the BDF3 scheme with 𝑁𝑇 = 25 gives about the same accuracy as the The absolute errors (not shown) are slightly lower as well compared to the non-damped HV1, for both chains A and B. This may indeed suggest the presence of mild oscillations with the HV1 scheme (resulting in lowered accuracy) that are successfully suppressed by the added damping of the HV1D scheme. We’ve found cases where the Euler damping (Rannacher time -stepping) procedure of the HV1D scheme reduces but doesn’t eliminate spurious oscillations, whereas for the same cases the MCS scheme is oscillation -free (even without damping). In contrast, we have not seen any cases where the reverse is true. -7 -6 -5 -4 -3 -2 -1 R M S E secs HV1HV1DHV2MCSBDF3 -8 -7 -6 -5 -4 -3 -2 R M S E secs HV1HV1DHV2

MCS

BDF3 HV1 scheme with

𝑁𝑇 = 200 and requires the same time as well. Table 3 confirms the two ADI schemes produce similar accuracy for the Chain B parameters. The latter are also more converged for all 𝑁𝑇 compared to the Chain A parameters in Table 2, reflecting the lower overall errors on the right plot of Figure 1. As in Sec. 4.2, we used a fixed spatial resolution of (𝑁𝑆 × 𝑁𝑉) = (60 × 30) with (spatial) RE and obtained time-converged calibrated model parameters using the BDF3 scheme with 𝑁𝑇 = 400 . Table 2. M o del parameter convergence for calibrations to Chain A (246 options) as a function of temporal resolution for the HV1, MCS and BDF3 schemes. The exact (time-converged) parameter set is ( 𝑣 , 𝑣 ̅ , 𝜅, 𝜉, 𝜌 ) = ( ) . HV1 MCS BDF3 NT

25 50 100 200 25 50 100 200 25 50 𝑣 𝑣̅    -0.73655 -0.74338 -0.74511 -0.74554 -0.73753 -0.74197 -0.74476 -0.74545 -0.74586 -0.74570 CPU (mm:ss)

Table 3. M o del parameter convergence for calibrations to Chain B (68 options) as a function of temporal resolution for the HV1, MCS and BDF3 schemes. The exact (time-converged) parameter set is ( 𝑣 , 𝑣 ̅ , 𝜅, 𝜉, 𝜌 ) = ( ) . HV1 MCS BDF3 NT

25 50 100 200 25 50 100 200 25 50 𝑣 𝑣̅    -0.79120 -0.79205 -0.79226 -0.79232 -0.79121 -0.79204 -0.79226 -0.79231 -0.79227 -0.79233 CPU (mm:ss)

A careful inspection of the sequence of parameters obtained with the two ADI schemes with increasing (doubling) 𝑁𝑇 , reveals very close to second-order convergence for 𝑁𝑇 = 50 and above, indicating that the theoretical order of the time- discretization translates to the “functional” of the computation, i.e., the model parameter vector. This suggests the possible use of (temporal) Richardson extrapolation on the fitted parameters. The effectiveness of such an approach is demonstrated in Table 4, where the parameter vectors have been obtained as the composite (extrapolated) result of two successive calibrations using slightly different 𝑁𝑇 . The vector from the first calibration can be used as the starting point for the second, reducing the time of the latter . Comparing for example the 𝑁𝑇 = 100 calibrations in Table 2 with the (composite)

𝑁𝑇 = (40,60) calibrations in Table 4 for Chain A, one can see that the parameters obtained with the latter are significantly more converged, while the CPU times are lower as well. As always with RE, care should be taken not to use too low a resolution (in this case 𝑁𝑇 ) . To guarantee good RE performance we recommend using 𝑁𝑇 𝑐𝑜𝑎𝑟𝑠𝑒 ≥ 30 and 𝑁𝑇 𝑓𝑖𝑛𝑒 /𝑁𝑇 𝑐𝑜𝑎𝑟𝑠𝑒 ≥ 1.2. The effect of spatial RE on the calibrated parameters is shown in Tables 5 & 6. Here we use the BDF3 scheme with 𝑁𝑇 = 50 so that the temporal discretization error is negligible. The benchmark parameters are from Table 1. Despite resolution being low, the parameters obtained with RE are not far Unless otherwise noted, all calibration CPU times listed in the present work are obtained with Excel’s Solver starting from the average values vector (𝑣 , 𝑣̅, 𝜅, 𝜉, 𝜌) = (0.05, 0.05, 5, 5, −0.7) . Also, the ADI schemes may not always sufficiently damp oscillations when

𝑁𝑇/𝑁𝑆 is too low, possibly leading to erratic convergence and thus RE results. This is mainly an issue for the HV1 scheme. Applied on the PDE solution (option prices) as described in Sec. 2.4.3. from the benchmark values. For Chain A they are practically converged to 4 digits (the error is at most one point off in the fourth digit), while for Chain B they are somewhat less converged. In both cases the parameters obtained without the use of RE are significantly less accurate. Table 4. M o del parameter convergence using the ADI schemes and temporal Richardson extrapolation on the fitted parameters. (𝑁𝑆 × 𝑁𝑉) = (60 × 30) with spatial Richardson extrapolation. The exact (time-converged) parameter set is (𝑣 , 𝑣̅, 𝜅, 𝜉, 𝜌) = (0.010937, 0.039131, 5.3914, 6.9010, −0.74567) for Chain A and ( 𝑣 , 𝑣 ̅ , 𝜅, 𝜉, 𝜌 ) = ( ) for Chain B. Chain A (246 options) Chain B (68 options)

HV1 MCS HV1 MCS NT 𝑣 𝑣̅    -0.74571 -0.74569 -0.74567 -0.74568 -0.79233 -0.79233 -0.79231 -0.79233 CPU (mm:ss) 05:07 09:07 07:07 11:20 00:52 01:08 01:00 01:20 Table 5.

Effect of spatial Richardson extrapolation on model parameter convergence for Chain A.

Spatial resolution 𝑣 𝑣̅    (NS × NV) = 60 × 30 0.010914 0.039232 5.2805 6.8705 -0.74333 (NS × NV) = 60 × 30 w RE 0.010937 0.039131 5.3915 6.9009 -0.74570 Benchmark 0.010935 0.039139 5.3905 6.8997 -0.74579 Table 6.

Effect of spatial Richardson extrapolation on model parameter convergence for Chain B.

Spatial resolution 𝑣 𝑣̅    (NS × NV) = 60 × 30 0.044811 0.088663 3.6064 5.0181 -0.79212 (NS × NV) = 60 × 30 w RE 0.044815 0.088479 3.6727 5.0330 -0.79233 Benchmark 0.044816 0.088529 3.6695 5.0333 -0.79206 So far, we have presented calibration tests where every option in the chain is priced separately. These tests demonstrated the efficiency of the PDE engine; calibrations using this Approach are already quite fast. We now test Approach II for the objective function evaluation, i.e., making use of the MAP property. Following the winning combination of Table 4, we choose the HV1 scheme, a resolution of (NS × NV) = (60 × 30) with spatial RE, as well as temporal RE on the fitted parameters (combining the results of two successive calibrations) with NT = (30,36). As expected, Table 7 confirms that the speed-up compared to Approach I is significant, especially for the largest Chain A . The parameter accuracy is at least as good. It is obvious that this approach works well in practice and effectively decouples the calibration time from the total number of options included. For either of our datasets a CPU time of less than a minute is needed to achieve a maximum While this nominal resolution is used by most of the PDE solvers within Approach I, in the case of Approach II, all (7 or 8) solvers really use higher resolution, as explained in Sec. 3.1. We note that the code was developed on an older CPU (4-core Intel i7-920, 2009) and not the new 10-core i9-7900X CPU used for the timings reported here. The MAP performance gains on the development CPU are almost double (6 ×) those of the newer CPU, and the maximum calibration time still around 2 mins. relative (numerical) error of 0.05% for the obtained parameters. We stress that a judicious S -grid construction (low to moderate non-uniformity) is key for keeping the solution error profile low across the moneyness spectrum and make this approach work. As we already mentioned in Sec. 4.1, lower overall pricing accuracy leads not only to less accurate parameters but also (perhaps more importantly) to slower convergence of the optimizer. Table 7.

Comparison of pricing Approach I (one PDE solution per option) and Approach II (one PDE solution per expiration). The relative errors (compared to the benchmark parameters of Table 1) are shown in parentheses.

Chain A (246 options) Chain B (68 options)

Approach I Approach II Approach I Approach II 𝑣 𝑣̅    -0.74569 (0.01%) -0.74581 (0.00%) -0.79233 (0.03%) -0.79209 (0.00%) CPU (mm:ss) 00:03:00 00:00:55 00:00:52 00:00:40 We now present detailed calibration results demonstrating the ability of the GARCH diffusion model to fit the option market and compare it to the popular Heston model. The Heston calibrations were performed using the present PDE engine and the resulting parameter vectors were then confirmed with independent calibrations using pricing via well-known Fourier integral representations. Figures 2 and 3 illustrate the market fit for each option expiration bucket in chains A and B respectively. As a first remark we can say that the model is indeed able to capture a smile behavior in the short end and achieve an overall decent fit. (By ‘smile behaviour’, we mean that the IV curve has an evident minimum). This can be seen in the first two plots in Figure 2, but not in Figure 3 (the Heston model can be seen to capture the smile in both cases). The reason is that the strike point K* (where the GARCH diffusion model’s IV curve turns up ) for the first two expirations in Chain B lies further to the right of the last market point included in the plot (at around K = 1225). Overall, both sample calibrations seem to indicate that the Heston model is more ‘ flexible ’ , managing to fit the data better overall. The GARCH diffusion model fits on the other hand look more ‘ rigid ’ . This is somewhat surprising and in contrast with the findings of Christoffersen et al. [1]. While our two-chain data set is tiny compared to theirs, we suspect the contrast in findings is due to our much wider (smile) moneyness coverage – as no downside puts are used in [1]. The apparent Heston model victory here comes with known problems. The very small obtained Feller ratios, 𝑅 ≡ 2𝜅𝑣̅ 𝜉 ⁄ , (0.12 and 0.29) are well below one. Note that under S. Heston’s 1993 model [14], R is the same under both P (physical) and Q (risk-neutral) model evolutions. In our experience, the Heston P-model estimates (from time series, using maximum likelihood) will typically have

𝑅 > 1.

There are some caveats to complaining about 𝑅 : for either model, P-model parameter estimates are not trivially obtained because the latent volatility must either be proxied or jointly estimated. In addition, P-model time series estimates are typically quite sensitive to the inclusion or not of crash days like Oct. 19, 1987. To trivially adapt the PDE engine to the Heston pricing PDE: adjust the v -diffusion and mixed derivative coefficients 𝑑 𝑣 and 𝑚 𝑆𝑣 accordingly in (17) and (20). The only other change required is the choice of 𝑣 𝑚𝑎𝑥 . We believe th e ‘hitting range’, ≤ 1⁄ , should be admitted in calibrations. However, once you find the optimal Q-estimate ratio in the hitting range, you are forced to ponder the implications. Indeed, the volatility distribution then develops an integrable divergence at zero, v = 0 becomes the most probable value, and repeated volatility hits on zero become possible. If the P-evolution model has ⁄ , arbitrage opportunities develop – at least in the idealized continuous-time world in which the models are constructed. Yet, finding a smile-calibrated Feller ratio in the hitting range is well-known to be a common occurrence [15]. Figure 2 . GARCH diffusion and Heston model implied volatilities by expiry for Chain A. The GARCH diffusion model parameters are given in Table 1. The parameters for the Heston model are 𝑣 = 0.007316, 𝑣̅ = 0.03608,𝜅 = 6.794, 𝜉 = 2.044 and 𝜌 = −0.7184 . GARCH diffusion RMSE IV = 1.68%, Heston RMSE IV = 1.28%. Heston Feller ratio = 0.12. To purposely exaggerate the Feller ratio issue, we also fitted only the first two shortest expirations in Chain A. Despite seemingly achieving a decent fit (Figure 4), the Heston model fitted 𝑣 is practically zero and the Feller ratio is 0.05. The GARCH diffusion model on the other hand achieves a closer fit with mostly reasonable parameters and a volatility process that doesn’t hit zero. Nevertheless, our calibrated volatility-of-volatility 𝜉 - values for the GARCH diffusion are also likely ‘too high’, relative to typical P-estimates. In the case of Chain B (Figure 5) the Heston model does a slightly better (if IV K T= 21 days

MarketGARCHHeston 00.1 K MarketGARCH

Heston IV K

T= 77 days

MarketGARCHHeston 0

IV K

Market

GARCH

Heston

500 1000 1500 2000 2500 3000 IV T= 259 days

Market

GARCH

Heston 00.1

400 900 1400 1900 2400 2900

IV K

T= 441 days

Market

GARCH

Heston00.10.20.30.40.5 400 900 1400 1900 2400 2900 3400 IV T= 630 days

Market

GARCH

Heston K T= 994 days

Market

GARCH

Heston oscillatory) job, following the smile correctly (with the model K * very close to the market K * at around K = 1140), while the GARCH diffusion model K * is about 1180. Figure 3 . GARCH diffusion and Heston model implied volatilities by expiry for Chain B. The GARCH diffusion model parameters are given in Table 1. The parameters for the Heston model are 𝑣 = 0.04576,𝑣̅ = 0.06862, 𝜅 = 4.905, 𝜉 = 1.525 and 𝜌 = −0.7131 . GARCH diffusion RMSE IV = 1.19%, Heston RMSE IV = 1.01%. Heston Feller ratio = 0.29. IV T= 19 days

MarketGARCHHeston0.1

650 750 850 950 1050 1150 1250 IV K Market

GARCH

Heston 0.10.15

IV K

T= 46 days

MarketGARCH

Heston K T= 228 days

MarketGARCHHeston0.1

450 550 650 750 850 950 1050 1150 1250 1350 IV T= 319 days

MarketGARCH

Heston 0.10.15

450 650 850 1050 1250 1450 1650

IV K

Market

GARCH

Heston0.10.150.20.250.30.35 450 650 850 1050 1250 1450 1650 1850

IV K

T=1054 days

MarketGARCHHeston Figure 4 . GARCH diffusion and Heston model implied volatilities when fitted only to the first two expiries in Chain A. The GARCH model parameters are 𝑣 = 0.008046, 𝑣̅ = 0.02981, 𝜅 = 10.93, 𝜉 = 15.06 and 𝜌 = −0.5669 . The Heston model parameters are 𝑣 = 10 −6 , 𝑣̅ = 0.02, 𝜅 = 40.64, 𝜉 = 5.65 and 𝜌 =−0.623 . GARCH diffusion RMSE IV = 0.61%, Heston RMSE IV = 0.86%. Heston Feller ratio = 0.05. Figure 5 . GARCH diffusion and Heston model implied volatilities when fitted only to the first two expiries in Chain B. The GARCH model parameters are 𝑣 = 0.03836, 𝑣̅ = 0.07067, 𝜅 = 11.96, 𝜉 = 7.685 and 𝜌 = −0.6871 . The Heston model parameters are 𝑣 = , 𝑣̅ = 0.06011, 𝜅 = 21.3, 𝜉 = 2.604 and 𝜌 = −0.6637 . GARCH diffusion RMSE IV = 0.64%, Heston RMSE IV = 0.59%. Heston Feller ratio = 0.38. To summarize, at the outset, we knew both models have their nice properties and their issues. In particular, neither model handles extreme moves – in either the asset price or its underlying volatility – either naturally or well. However, although our dataset is very small, we were still surprised to see the Heston model achieve the better smile fits. Both models we ’ve calibrated so far are subcases of the more general power-law SV model 𝑑𝑆 𝑡 = (𝑟 𝑇 − 𝑞 𝑇 )𝑆 𝑡 𝑑𝑡 + √𝑣 𝑡 𝑆 𝑡 𝑑𝑊 𝑡𝑆 , (32) 𝑑𝑣 𝑡 = 𝜅(𝑣̅ − 𝑣 𝑡 )𝑑𝑡 + 𝜉𝑣 𝑡𝑝 𝑑𝑊 𝑡𝑣 . The popular affine Heston model ( 𝑝 = 0.5 ) has some well-known limitations, such as: (a) inability to capture steep short-term volatility smiles, (b) instability of the fitted parameters over recalibrations and (c) incompatibility of fitted parameters with those estimated from the P-world. The GARCH diffusion model ( 𝑝 = 1 ) has received less attention in the literature, but our tests here indicate that it too suffers from (a). Regarding (b) and (c) more tests would be needed; it may well be that GARCH produces more stable parameters than Heston for example. In an effort to address these issues, researchers have introduced a variety of ideas. Again, naming some: (i) moving away from affine models, so using the more general p -model above (or other two-factor diffusion variations), (ii) randomizing the (latent) spot variance 𝑣 , (iii) adding jumps, (iv) adding more stochastic factors, (v) using fractional Brownian IV K

T= 21 days

MarketGARCH

Heston 00.10.2 K T= 49 days

MarketGARCH

Heston

850 900 950 1000 1050 1100 1150 1200 K T= 19 days

MarketGARCH

Heston 0.1

750 850 950 1050 1150 1250 IV T= 46 days

MarketGARCH

Heston motion as the volatility driver. Among these extensions, (i) and (ii) are particularly simple to add to the present framework, so we will briefly explore them here. Our PDE engine can easily solve for either the GARCH diffusion, Heston, or any model “in between”, i.e., with . It is in fact straightforward to let 𝑝 float and have the optimizer decide its optimal value. As Figures 2 and 3 show, a value of 𝑝 = 0.5 corresponds to more ‘ flexible ’ , whereas 𝑝 = 1 to more ‘ rigid ’ fits. Intermediate values of 𝑝 predictably yield fits that look like a mix between the two. A low value of 𝑝 reduces the RMSE IV by locally enabling enough curvature to emulate the short-term market smiles and on the other hand it increases it as the (overly) flexible behavior persists for the middle expirations (Heston case). Overall though, at least for our datasets, lower RMSE IV is achievable with lower values of p . Our tests found an optimum 𝑝 of 0.62 for Chain A ( RMSE IV = 1.23%), while for Chain B we found 𝑝 = 0.59 ( RMSE IV = 1%). As a result, the improvement in the fit was not significant over that of the Heston model (relative RMSE IV reduction of 1% - 4%). This may not be telling the whole story though . Let’s accept that both Heston and GARCH diffusion are essentially misspecified for the cause. Because of that, the optimizer is forced into unnatural parameters (e.g. unrealistically low 𝑣 and/or high ξ ) to adapt to the observed short-term smiles. If those two models are misspecified, so will be the general p -model. Our conclusion is that any benefit from choosing a particular value of p for the general p -model, would be best assessed if some sort of short-term-smile-enabling extension was in place, leaving the diffusion model more freedom to fit the rest of the expirations. Normally this would involve adding jumps, but an easier way is available through randomization; see Mechkov [16] or Jacquier & Shi [17]. While we find the dynamic rationale of this approach perhaps not entirely clear, we also found it does succeed in “turning on” (up) the short -term smiles. Since again the changes needed in our code to add this feature were minimal, we gave it a try. Table 8 presents summary RMSE IV results for Chain A. As we noted above, varying p when the model lacks the ability to match the short-term smile makes little difference. But when the smile is better accounted for (here via the randomization), the optimal p- model provides a more substantial improvement in the calibration fit over both the Heston and the GARCH diffusion model. This case is presented in more detail in Appendix D. We finally add that other non-affine two-factor variations can also easily be accommodated. As an example, we calibrated the Inverse Gamma SV model introduced by Langrené et al. [15]. As can be seen in Table 8, it placed between the Heston and GARCH diffusion models in terms of overall quality of fit ( RMSE IV for Chain B was 1.11%). We also note is that the fitted 𝑣 and 𝑣̅ for this model were quite close to each other for both datasets (not shown). Table 8.

Calibration fit for Chain A under variations of pure diffusion models.

Model

𝑹𝑴𝑺𝑬 𝑰𝑽 Power law models

Heston 1.28% GARCH diffusion 1.68% Optimal p- model 1.23% Heston - Randomized 0.96% GARCH diffusion - Randomized 0.94% Optimal p- model - Randomized 0.79% Inverse Gamma Vol model 1.53%

5. Conclusions

In this work we present a first (to our knowledge) full option calibration of the GARCH diffusion model using a PDE approach. The calibration is very fast and accurate (less than a minute on a modern PC) ameliorating the lack of a closed-form solution. This is accomplished with the use of an efficient yet “ordinary” second -order finite difference based PDE engine. While here we calibrate to European vanilla options, the same pricing engine can be used with only minor modifications for fast calibrations to other types of options that can easily be handled in the PDE setting, e.g. American options or barriers. Other similar models can also easily be accommodated as our brief experiments of Sec. 4.6 show. In a small test with two SPX option chains, the smile fits with the GARCH diffusion model were inferior to the fits from the Heston ‘93 model. This differed from some prior literature such as [1]. The Heston fits come with very low values for the so-called Feller condition ratio, which leads to other issues. Nevertheless, we were surprised. Our more general contribution is showing closed-form solutions need not be such a strong criterion for model selection. Similar PDE engines can potentially handle various related models that are being largely ignored in practice and therefore allow a more informed choice for a particular trading area. For the future, it would be quite interesting to extend the solver to one that could handle bivariate jump-diffusions with similar high efficiency. Finally, the second author (Lewis) would like to stress that the first author (Papadopoulos) has done all the heavy lifting here: developing and implementing all the C/C++ solvers and their Excel and Mathematica interfaces. References [1]

P. Christoffersen, K. Jacobs and K. Mimouni, “Volatility dynamics for the S&P500: Evidence from realized volatility, data returns, and option prices,”

Review of Financial Studies , vol. 23, no. 8, pp. 3141-3189, 2010. [2] R. Zvan, P. Forsyth and K. Vetzal, “Negative Coefficients in Two Factor Option Pricing Models,” Journal of Computational Finance, vol. 7, no. 1, pp. 37-73, 2003. [3]

K. J. in 't Hout and S. Foulon, “ADI finite difference schemes for option pricing in the Heston model with correlation,” International Journal of Numerical Analysis and Modeling, vol. 7, no. 2, p. 303–

Sahalia and R. Kimmel, “Maximum likelihood estimation of stochastic volatility models,” Journal of Financial Economics, no. 83, p. 413 – S. Ikonen and J. Toivanen, “Efficient numerical methods for pricing American options under stochastic volatility,” Numerical Methods for Partial Differential Equations, vol. 24, no. 1, p. 104–

K. J. in 't Hout and J. Toivanen, “Application of Operator Splitting Methods in Finance,” arXiv:1504.01022 [q-fin.CP], 2015. [8]

K. J. in 't Hout and C. Mishra, “Stability of the modified Craig–

Sneyd scheme for two-dimensional convection –diffusion equations with mixed derivative term,” Mathematics and Computers in Simulation, vol. 81, no. 11, pp. 2540-2548, 2011. [9]

M. Wyns, “Co nvergence analysis of the Modified Craig-Sneyd scheme for two-dimensional convection- diffusion equations with nonsmooth initial data,” IMA Journal of Numerical Analysis, vol. 37, no. 2, pp.

M. Vinokur, “On One -Dimensional Stretching

Functions For Finite Difference Calculations,” NASA

Contractor Report 3313, Santa Clara, California, 1980. [11] D. Tavella and C. Randall, Pricing Financial Instruments, The finite difference method, New York: Wiley, 2000. [12] T. Haentjens and K. J. in 't

Hout, “ADI schemes for pricing American options under the Heston model,”

Applied Mathematical Finance, vol. 22, no. 3, pp. 207-237, 2015. [13]

D. M. Pooley, K. R.Vetzal and P. A. Forsyth, “Convergence remedies for non -smooth payoffs in option pric ing,” Journal of Computational Finance, vol. 6, no. 4, pp. 25 -40, 2003. [14]

S. L. Heston, “A Closed -form Solution for Options with Stochastic Volatility with Application to Bond and

Currency Options,” Rev. of Financial Studies, vol. 6, no. 2, pp. 327 -343, 1993. [15]

N. Langrené, G. Lee and Z. Zili, “Switching to non -affine stochastic volatility: A closed-form expansion for the Inverse Gamma model,” arXiv:1507.02847v2 [q-fin.CP], 2016. [16] S. Mechkov, “'Hot - start' initialization of the Heston model,” Risk, November 2016. [17] A. Jacquier and F. Shi, “The randomised Heston model,” arXiv:1608.07158v2 [q-fin.PR], London, 2017. [18] Chicago Board Options Exchange, “VIX white paper,” 2010. [19] E. Ç inlar, Markov additive processes I, II., Z. Wahrscheinlichkeitsth. verw. Geb., 24:I:85-93, II:93-121, 1972. [20] T. Bollerslev and P. E. Rossi, “Dan Nelson Remembered,” J. Business & Econ. Statistics, 13(4), pp. 361 -364, 1995. [21]

D. W. Peaceman and H. H. Rachford, “The numerical solution of parabolic and elliptic differential equations,” SIAM Journal, no. 3, p. 28–

41, 1955. [22]

J. Douglas and H. H. Rachford, “On the numerical solution of heat conduction problems in two and thre e space variables,” Trans. Amer. Math. Soc., no. 82, p. 421– R. Rannacher, “Finite element solution of diffusion problems with irregular data,” Numer. Math., no. 43, pp. 309-327, 1984.

Appendix A – Market prices of risk and a compatible real-world evolution

We have performed option chain calibrations after postulating that the risk-neutral (aka Q-measure) evolution has the GARCH diffusion form (1). Jumping immediately to a risk-neutral model is a common finance short-cut. More carefully, even given a target Q-model, one should begin with a compatible real-world (P-measure) evolution, and then move to the desired Q-measure evolution by a Girsanov transformation. In the presence of a deterministic stock dividend yield 𝑞 𝑡 , a P-measure evolution compatible with a Q-measure GARCH diffusion under this procedure has the form: 𝑑𝑆 𝑡 = (𝛼 𝑡𝑃 − 𝑞 𝑡 )𝑆 𝑡 𝑑𝑡 + 𝜎 𝑡 𝑆 𝑡 𝑑𝑊 𝑃,𝑡𝑆 , 𝑑𝑣 𝑡 = 𝛽 𝑡𝑃 𝑑𝑡 + 𝜉𝑣 𝑡 𝑑𝑊 𝑃,𝑡.𝑣 . Now (𝑊 𝑃,𝑡𝑆 , 𝑊

𝑃,𝑡𝑣 ) are a pair of correlated P-Brownian motions with correlation 𝜌 , both (𝜌, 𝜉) are identical under P or Q, and we used 𝜎 𝑡 ≡ √𝑣 𝑡 . Indeed, “no - arbitrage” requires that the Q-evolution must be related to the P-evolution by the Girsanov substitutions 𝑑𝑊 𝑃,𝑡𝑆 = 𝑑𝑊

𝑄,𝑡𝑆 − 𝜆 𝑡𝑒 𝑑𝑡 and 𝑑𝑊 𝑃,𝑡.𝑣 =𝑑𝑊

𝑄,𝑡𝑣 − 𝜆 𝑡𝑣 𝑑𝑡 . Under this implied change-of-measure, the variance-covariance structure of the SDE is preserved but the drifts may change. Financially, (𝜆 𝑡𝑒 , 𝜆 𝑡𝑣 ) represent market prices of (equity, volatility) risk. The 𝜆 functions are independent of any derivative asset but generally dependent upon (𝑡, 𝑆 𝑡 , 𝑣 𝑡 ). The Q-evolution model then becomes 𝑑𝑆 𝑡 = (𝑟 𝑡 − 𝑞 𝑡 )𝑆 𝑡 𝑑𝑡 + 𝜎 𝑡 𝑆 𝑡 𝑑𝑊 𝑄,𝑡𝑆 , 𝑑𝑣 𝑡 = 𝛽 𝑡𝑄 𝑑𝑡 + 𝜉𝑣 𝑡 𝑑𝑊 𝑄,𝑡𝑣 , where 𝜆 𝑡𝑒 = (𝛼 𝑡𝑃 − 𝑟 𝑡 )/𝜎 𝑡 and 𝛽 𝑡𝑄 = 𝛽 𝑡𝑃 − 𝜉𝑣 𝑡 𝜆 𝑡𝑣 . Fixing 𝛽 𝑡𝑄 = 𝜔 𝑄 − 𝜅 𝑄 𝑣 𝑡 (where 𝜔 𝑄 ≡ 𝜅 𝑄 𝑣̅ 𝑄 ) from our postulated Q-measure GARCH diffusion (1) still leaves a lot of freedom for the P-evolution. In this generality, the only remaining compatibility requirements under “no - arbitrage” are: • 𝛼 𝑃 (𝑣 𝑡 = 0) = 𝑟 𝑡 , since a stock holding would be instantaneously riskless, presuming a deterministic short-rate 𝑟 𝑡 . • the boundaries 𝑣 = 0 and 𝑣 = ∞ should be unattainable in finite time by the P-measure 𝑣 𝑡 -process, since that is true of the postulated Q-measure process. However, the spirit of the model is that the P-measure evolution is also a GARCH diffusion (recall the origin of the name). So, let 𝛽 𝑡𝑃 = 𝜔 𝑃 − 𝜅 𝑃 𝑣 𝑡 , with possibly different P-parameters. For example, let’s postulate i) a volatility-dependent equity risk premium 𝛼 𝑡𝑃 = 𝑟 𝑡 + 𝑐𝑣 𝑡 , where 𝑐 is a (positive) constant, and ii) 𝜆 𝑣 is also constant. With those choices, our associated P-model GARCH diffusion is 𝑑𝑆 𝑡 = (𝑟 𝑡 − 𝑞 𝑡 + 𝑐 𝑣 𝑡 )𝑆 𝑡 𝑑𝑡 + 𝜎 𝑡 𝑆 𝑡 𝑑𝑊 𝑃,𝑡𝑆 , 𝑑𝑣 𝑡 = (𝜔 𝑃 − 𝜅 𝑃 𝑣 𝑡 ) 𝑑𝑡 + 𝜉𝑣 𝑡 𝑑𝑊 𝑃,𝑡𝑣 , where 𝜔 𝑄 = 𝜔 𝑃 and 𝜅 𝑄 = 𝜅 𝑃 + 𝜆 𝑣 𝜉 . Now two additional parameters (𝑐, 𝜆 𝑣 ) need to be estimated and “ P/Q compatibility ” under our choices becomes a hypothesis to be tested. All that is outside our scope in this article. However, one expects 𝑐 > 0 and 𝜆 𝑣 < 0 for, say, SPX. Finally, note that with our choices and using (𝑥 𝑡 , 𝑣 𝑡 ) where 𝑥 𝑡 = log 𝑆 𝑡 , both real-world and risk-neutral processes are MAPs (Markov Additive Processes). As discussed in the body, the MAP property leads to an “all -options-at- once” KBE solution for vanilla options; in addition, it also allows dimensional reduction by Fourier methods. Appendix B – Critical points for the stand-alone volatility process

In our model, the stand-alone volatility process evolves as 𝑑𝑉 𝑡 = 𝜅(𝑣̅ − 𝑉 𝑡 )𝑑𝑡 + 𝜉𝑉 𝑡 𝑑𝐵 𝑡 , where 𝐵 𝑡 is a Brownian motion. The same functional form holds under either measure (P/Q), although the numerical values of the parameters may differ. Let 𝑝(𝑡, 𝑣, 𝑣 ) denote the transition probability density for the process; i.e. 𝑝(𝑡, 𝑣, 𝑣 )𝑑𝑣 ≡ Pr(𝑉 𝑡 ∈ 𝑑𝑣|𝑉 = 𝑣 ) . Then, 𝑣 𝑞 = 𝑣 𝑞 (𝑞, 𝑡, 𝑣 , 𝜅, 𝑣̅, 𝜉) , the 𝑞 -critical point for the associated distribution, is defined by ∫ 𝑝(𝑡, 𝑥, 𝑣 𝑞 ) 𝑑𝑥 = 𝑞 , where . Having 𝑣 𝑞 (for 𝑞 close to 1) is useful for setting the 𝑣 -grid (upper) truncation points for the full 2D process PDE solvers. Now 𝑝(𝑡, 𝑣, 𝑣 ) is not known analytically, but solves the Fokker-Planck problem: 𝜕𝑝𝜕𝑡 = − 𝜕𝐽𝜕𝑣 , where 𝐽(𝑡, 𝑣) ≝ − 𝜉 {𝑣 𝑝} + 𝜅(𝑣̅ − 𝑣)𝑝 , 𝑣 ∈ (0, ∞), and subject to the initial condition 𝑝(0, 𝑣, 𝑣 ) = 𝛿(𝑣 − 𝑣 ), using the Dirac-delta. Here 𝐽(𝑡, 𝑣) is the probability current or flux. Mathematically, since both 𝑣 = 0 and 𝑣 = ∞ are inaccessible to the process (with 𝑣 > 0) , no boundary conditions are necessary in the continuum problem. Two relations.

It is easy to find that in the limit 𝑡 → ∞ , 𝑣 𝑡 follows an Inverse Gamma distribution 𝑝(𝑥) = 𝛽 𝛼 𝑥 −𝛼−1 𝑒 −𝛽/𝑥 /Γ(𝛼) , with shape parameter 𝛼 = 1 + 2𝜅/𝜉 and scale parameter 𝛽 = 2𝜅𝑣̅/𝜉 . Another easy relation for arbitrary t is the scaling identity: 𝑣 𝑞 (𝑞, 𝑡, 𝑣 , 𝜅, 𝑣̅, 𝜉) = 𝑣̅ × 𝑣 𝑞 (𝑞, 𝜉 𝑡, 𝑣 𝑣̅ , 𝜅𝜉 , 1,1), which reduces the effective number of parameters by two. While the scaling relation was not used in the implementations, it was checked. Mathematica implementation.

When speed is not a factor, this full problem (solving the Fokker-Planck PDE, and calculating the critical point) is readily solved in Mathematica. Our short implementation is shown in Fig. 6. Even if you are not a Mathematica user, the syntax should be largely readable. The basic idea is to convert to new coordinates 𝑥 = log 𝑣 and solve the resulting PDE problem using NDSolve. A uniformly-spaced x -grid with 𝑥 points is centered at 𝑥 = log 𝑣 . The grid is truncated at ±𝑛 “sigma’s” from 𝑥 , where one sigma equals 𝜉√𝑇 . Numerical boundary conditions are taken to be (i) zero flux at 𝑥 𝑚𝑖𝑛 and (ii) a zero spatial derivative at 𝑥 𝑚𝑎𝑥 . The initial condition is a lattice Dirac-delta, non-zero only at 𝑥 , which lies exactly on a node. C/C++ implementation.

The Mathematica implementation solves the above Fokker-Planck problem using 4 th order spatial discretization and the Method of Lines (MOL) via an ODE solver in time. This yields very accurate results but is slow (though we have not tried to port it to C++). For the tests presented in this paper we have opted for a more standard approach which we only briefly outline here: The discretization is based on uniform central second order finite differences and the Crank-Nicolson scheme with Rannacher time-stepping. Boundary conditions are the same as above. This approach works, apart from the far-left region of the grid where convection may dominate and result in oscillations/ negative densities. In such cases we locally introduce the 1 st order upwind scheme for the convection term. If negative densities are still produced, we try to bump up 𝑁 𝑥 . If all fails, we simply return 𝑣 𝑞 from the stationary (Inverse Gamma) distribution. We also apply double spatial Richardson extrapolation (which should in theory result in 6 th order accuracy- if the upwind scheme is only used in areas where the density takes negligible values). All this results in accuracy even higher than Mathematica’s , requiring CPU times of about 2-3 milliseconds with 𝑁 𝑥 = 800 and 𝑁 𝑇 = 50 . Figure 6 . Mathematica code computing the stand-alone V-distribution critical points.

Appendix C – Data description

Option Chain A . End-of-day (EOD) SPX option data on March 31, 2017 was obtained from the

CBOE’s LiveVol service : “End -of- Day Option Quotes with Calcs”.

These files record option quotes and CBOE calculated option implied volatilities (IV’s) at 15:45 New York time. This time is 15 minutes prior to the regular session close in both NYC and Chicago. From the CBOE (edited for brevity): “ Implied volatility and Greeks are calculated off the 1545 timestamp, considered a more accurate snapshot of market liquidity than the end of day market. LiveVol applies a unified calculation methodology across both live and historical data sets to provide maximum consistency between back-testing and real-time applications. Cost of carry inputs (interest rates, dividends) are determined by a statistical regression

CriticalPointGarchPDE q , V0 , T , NX , vbar , kappa , xi , n1 , AG :Module X0, Xmin, Xmax, mu kappa xi ^ 2, omega, A,h0, h, dX, i, grida, gridb, grid, t, soln, cdf, critx, critv, critx99 ,Off NDSolve::eerri ;X0 N Log V0 ;Xmin X0 n1 xi Sqrt T ;Xmax X0 n1 xi Sqrt T ;dX X0 Xmin NX ;omega kappa vbar ;grida Xmin N Table i dX, i, 0, NX ;grida NX NX ;grid Join grida, gridb ;h0 X ? NumericQ : If Abs X X0 0.5 dX, 1 dX, 0 ;Clear soln, h, t ; Off NDSolve::eerr ;using zero flux condition at Xminsoln h . NDSolve t h x, t 0.5 xi ^ 2 x,x h x, t x omega E ^ x mu h x, t ,h x, 0 h0 x , 0.5 xi ^ 2 D h x, t , x . x Xmin omega E ^ Xmin mu h Xmin, t 0,D h x, t , x . x Xmax 0 , h , x, Xmin, Xmax , t, T , T , AccuracyGoal AG ,Method "MethodOfLines","SpatialDiscretization" "TensorProductGrid", "Coordinates" grid 1 ;A NIntegrate soln x, T , x, Xmin, X0 , AccuracyGoal AG ;cdf y ? NumericQ : A NIntegrate soln x, T , x, X0, y , AccuracyGoal AG ;critx99 y . FindRoot cdf y 0.99, y, X0, Xmin, Xmax ;critx y . FindRoot cdf y q , y, critx99, Xmin, Xmax ;critv E ^ critx;Return critv process … . The cost of carry projected from these inputs is compared against those implied by the at-the-money options from each option expiry. If the rates differ significantly — and the option spreads for this expiry are sufficiently narrow — the implied rates replace the standard inputs. …” . Table 9 . SPX Option Implied Volatilities (%): Chain A (15:45, March 31, 2017).

Option expiration date (mm/dd/yy format) Strike

On any given data, option data can be quite voluminous and needs to be filtered, both to reduce calculational burdens and remove irrelevant noise. Indeed, the full March 31, 2017 data file contained 8399 option line items, which we filtered to the 246 items shown in Table 9. This was done by, first, focusing on traditional “third Friday” expirations and then doing some strike filtering. The 15:45 SPX index value was 𝑆 = 2367.94 (midpoint quote). We selected positive bid, out-of-the-money options, so the IV’s shown in the table are from puts when the strike

𝐾 < 𝑆 and otherwise calls. (The CBOE’s IV methodology is somewhat of a black-box, but it appears to be essentially put-call-parity preserving. See also [18]). For the first expiration we chose (Apr 21, 2017), the implied volatilities were smooth down to a strike 𝐾 = 1800 , chosen as a lower limit strike cutoff. For other expirations, the data looked smooth down to

𝐾 = 500 , our cutoff for the remaining expirations. We imposed no upper strike cutoff. To achieve a rough balance between puts and calls, we selected put strikes at multiples of 100 and call strikes at multiples of 25. We believe this filtering retains the important characteristics of the full data sets. For short-term interest rates, we found U.S. Treasury debt asked yields on Mar 31, 2017 from the Wall Street Journal (WSJ) and use those as stepwise constants for each of our 8 expirations. That series was {0.00728, 0.00723, 0.00716, 0.00865, 0.00939, 0.01118, 0.01203, 0.01434} in expiration order. For the SPX dividend yield, we used a constant 𝑞 = 0.0197 for all expirations, the WSJ-reported trailing 12-month SPX yield on the same date. These are not necessarily the cost-of-carry parameters used by the CBOE, the latter unavailable. The difference is unlikely to change the parameter fits or our conclusions in any way that matters. But if the reader is concerned about this small point, then take our combination of IV’s and cost -of- carry’s as one ‘ possible ’ market data set (largely consistent with the 3/31/17 actual data) to which we fit various models. Option Chain B.

The second author (Lewis) collected (out-of-the-money) closing option quotes at the time (Feb. 1, 2010), using only options with positive bids.

IV’s were calculated from the bid -ask midpoint option price. Interest rates and a dividend yield was found from the WSJ as per Chain A.

Appendix D – Calibration for the randomized optimal p -model A randomized version of the Heston model was presented by Mechkov [16]. The basic idea is that instead of taking the initial value of the (latent) variance process 𝑣 to be a known fixed value, we assume it is given by some distribution. It is a simple and appealing idea making for an easy extension to the present framework: The PDE solution automatically provides the option values for the whole range of possible initial variance values (corresponding to the grid in the v -direction). To find the "randomized" option price at the asset spot 𝑆 , average the solution across the 𝑆 = 𝑆 grid line using the assumed distribution. As suggested in [16], it is reasonable to assume the latter should be of the same type as that of the process’ equilibrium distribution. For the GARCH diffusion model this would be the Inverse Gamma distribution. Our brief testing indicates that this choice yields the best results even for the randomization of the Heston model , so we will use it here to randomize the general p -model (32). We make the parameters of the distribution (shape 𝛼 and scale 𝛽 ), as well as the power p of the model part of the calibration. The total number of parameters to be fitted is now seven. As Figure 7 shows, the overall quality of fit ( RMSE IV = 0.79% ) is considerably better than either that of the GARCH diffusion ( RMSE IV = 1.68% ) or the Heston model ( RMSE IV = 1.28% ), especially for the shortest expiration (see Figure 3). As already discussed in Sec. 4.5, the Heston model ( 𝑝 = 0.5) fit implies unlikely dynamics. The optimal power of the general model was calibrated to 𝑝 = 0.8 , slightly closer to the GARCH diffusion model . In contrast to the latter, here we see a steep smile captured for the first (3W) expiration, which is due to the randomization (and not the change from 𝑝 = 1 to 𝑝 = 0.8 ). We also note how now that the model is not bound (and thus stressed) by its inability to account for the short-term smile, the fitted volatility of volatility parameter falls to arguably more realistic levels ( 𝜉 = 2) . This is much lower than GARCH diffusion’s calibrated val ue of 𝜉 = 6.9 (Table 1), even after accounting for the scaling from 𝑝 = 1 to 𝑝 = 0.8 . A potential problem is seen with the calibrated correlation coefficient ( 𝜌 = −1 ). Other similar tests also indicate that the randomization procedure tends to lead the optimizer to rather extreme correlation values. Why? From [17], the small-T asymptotic smile “explosion” , due to randomization, is symmetric The equilibrium (stationary) distribution of the square root variance process is a Gamma distribution, but at least for our datasets we found it actually performs worse than the Inverse Gamma as the randomizing (initialization) distribution for the Heston model. The model is much closer to the GARCH diffusion model in terms of the implied dynamics. For example, one finds that the Heston model’s fit here implies a 41% probability that the long -run (risk-neutral) volatility is less than 1%, which is not very plausible. For both GARCH diffusion and the 𝑝 = 0.8 model this probability is practically zero. The mean long-run volatility is about 20% for all three models. about 𝑥 𝐾 = 0, where 𝑥 𝐾 = log 𝐾/𝑆 is the log-moneyness. The optimizer may be trying to compensate for the ‘ unwanted ’ new tendency towards symmetry by pushing 𝜌 towards -1. Excluding 𝜌 from the calibration and fixing it to a more reasonable value ( 𝜌 = −0.8 ) yields a very similar fit with RMSE IV =0.83% . This indicates that the much-improved fit does not depend strongly on such extreme 𝜌 values. Nevertheless, this behavior (as well as needing a dynamical rationale) are issues for randomization. Figure 7 . Calibration of the general power-law model (32) for Chain A, randomizing 𝑣 with an Inverse Gamma distribution. The calibrated model parameters are 𝑣̅ = 0.0407 , 𝜅 = 3.34, 𝜉 = 2.00, 𝜌 = −1.00 , the initial distribution shape and scale parameters (see Appendix B) 𝛼 = 1.05, 𝛽 = 0.00124 and the model power 𝑝 =0.801 . RMSE IV = 0.79%. IV K T= 21 days

Market

Randomizedp = 0.8 model IV K T= 49 days

Market

Randomizedp = 0.8 model00.1 IV K T= 77 days

Market

Randomized p= 0.8 model

IV K

T= 168 days

Market

Randomizedp = 0.8 model00.1

IV K

T= 259 days

MarketRandomizedp = 0.8 model

IV K

T= 441 days

Market

Randomizedp = 0.8 model00.10.20.3

IV K

T= 630 days

Market

Randomizedp = 0.8 model

IV K

T= 994 days