[PDF] Deep Importance Sampling

Abstract

We present a generic path-dependent importance sampling algorithm where the Girsanov induced change of probability on the path space is represented by a sequence of neural networks taking the past of the trajectory as an input. At each learning step, the neural networks' parameters are trained so as to reduce the variance of the Monte Carlo estimator induced by this change of measure. This allows for a generic path dependent change of measure which can be used to reduce the variance of any path-dependent financial payoff. We show in our numerical experiments that for payoffs consisting of either a call, an asymmetric combination of calls and puts, a symmetric combination of calls and puts, a multi coupon autocall or a single coupon autocall, we are able to reduce the variance of the Monte Carlo estimators by factors between 2 and 9. The numerical experiments also show that the method is very robust to changes in the parameter values, which means that in practice, the training can be done offline and only updated on a weekly basis.

Full PDF

DDeep Importance Sampling

Benjamin Virrion ∗ July 8, 2020

Abstract

We present a generic path-dependent importance sampling algorithm where the Girsanov inducedchange of probability on the path space is represented by a sequence of neural networks taking thepast of the trajectory as an input. At each learning step, the neural networks’ parameters are trainedso as to reduce the variance of the Monte Carlo estimator induced by this change of measure. Thisallows for a generic path dependent change of measure which can be used to reduce the variance of anypath-dependent ﬁnancial payoﬀ. We show in our numerical experiments that for payoﬀs consistingof either a call, an asymmetric combination of calls and puts, a symmetric combination of calls andputs, a multi coupon autocall or a single coupon autocall, we are able to reduce the variance of theMonte Carlo estimators by factors between 2 and 9. The numerical experiments also show that themethod is very robust to changes in the parameter values, which means that in practice, the trainingcan be done oﬄine and only updated on a weekly basis.

Keywords:

Importance Sampling, Neural Networks, Path-Dependence. ∗ Natixis and CEREMADE, UMR CNRS, Universit´e Paris-Dauphine, PSL University. a r X i v : . [ q -f i n . C P ] J u l ontents a θ Using Neural Networks . . . . . . . . . . . . . 4 a θ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.3 A Simpliﬁed Version of the Algorithm with a Local a . . . . . . . . . . . . . . . . . . . . . 144.4 Results for a Bachelier Diﬀusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.4.1 Call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.4.2 Asymmetric Calls and Puts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.4.3 Symmetric Calls and Puts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.4.4 Multi Coupons AutoCall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.4.5 Single Coupon AutoCall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.5 Results for a Local Volatility Diﬀusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.5.1 Call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.5.2 Asymmetric Calls & Puts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.5.3 Symmetric Calls & Puts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.5.4 Multi Coupons AutoCall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.5.5 Single Coupon AutoCall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Introduction

The objective of this paper is to compute the price of a possibly path-dependent option of payoﬀ g , givenby: f ( x ) = E Q (cid:104) g ( X x ≤ t ≤ T ) (cid:105) (1)Where (cid:16) X x ≤ t ≤ T (cid:17) is an Itˆo diﬀusion satisfying: (cid:40) dX x t = b t dt + σ t dW t X x = x (2)Where Q is the risk neutral measure, W t is a Brownian process under Q , and b t and σ t are twolocally bounded predictable processes.In order to reduce the variance of the Monte Carlo, we will compute f ( x ) by using importancesampling. The change of probability measure used in this importance sampling will be obtained using anadapted process a θ (cid:16) t, X x ≤ s ≤ t (cid:17) and the Girsanov Theorem. Using Importance Sampling to reduce the standard deviation of Monte Carlo estimators is a well knownand studied practice. Its use has been well described in both non-ﬁnancial [1] and ﬁnancial [2], [3] articlesand books on Monte Carlo techniques. Among the many possible ways of using importance sampling, wedecide here to separate two broad categories. In the ﬁrst category, g is a function of a random variablethat is not on the path space, whereas in the second category, g is a function of a random process thatlives on the path space.In the ﬁrst category, importance sampling using neural networks has been studied in [4] and [5]. Forthese papers, as the problem at hand does not inherintly live in the path space, the change of measure isdescribed directly using the densities, and there is no attempt to use the Girsanov theorem to describethe change of measure.On the other hand, when importance sampling is done on the path space, describing the changeof measure using the Girsanov theorem becomes natural. Such a use of the Girsanov theorem to doimportance sampling is well known, and has been been compared to other variance reduction methods in[2] to price out of the money options. Out of this vast litterature, we would like to mention a few articlesthat stand out in the sense that our method is close to theirs.One of the ﬁrst papers using an adaptive importance sampling scheme in order to reduce variancein a Monte Carlo algorithm for a diﬀusion process in a ﬁnancial context is that of [6]. In this paper, theauthor shows how a Robbins-Monro algorithm allows to ﬁnd the optimal Girsanov change of measurethat reduces the Monte Carlo variance. However, the author does not directly represent the changeof measure using a neural network, but restricts himself to using a deterministic drift vector. As ourneural-network can take the past of the trajectory as an input, our algorithm is more ﬂexible and allowsfor a path-dependent change of measure, which is not possible with the approach of [6].In [7] the authors use importance sampling both for a function of a random variable and a functionof a random process. For a function of a random process, the authors use the Girsanov theorem torepresent the change of measure. Furthermore, assuming that they have some knowledge on the law ofthe process, more precisely on p and ∇ p , they show that a Robbins-Monro scheme using this knowledge In full generality, one might want to take as an input for a θ ( t, . ) all the information available at time t , that is F t . Forexample, for a stochastic volatility diﬀusion, taking the past of the volatility as an input would be useful. For simplicity’ssake, we only take the past of the trajectory of the underlying as an input in this paper.

3o update the θ parameter of a parametrized family of changes of measures converges to a minimizer ofthis family that minimizes the standard deviation. The method in our paper is extremely close to theirs,except that we do not use a model-dependant analytical formula to obtain p and ∇ p , but instead usea neural network and backpropagation to obtain this gradient, which then enables the gradient descentalgorithm. So in some sense, our paper is a natural extension of their paper when using a neural networkand backpropagation, which enables us to obtain the gradient for diﬀusion processes where no analyticalformulas could be explicited.The paper by [4] uses importance sampling and a family of changes of measures parametrized by aneural network to learn the change of measure. However, they naturally place themselves in the randomvariable setting, whereas we place ourselves in the random process setting. This quite naturally bringsus to use the neural network as a function of the past of the process, which we believe is much moreadapted to most ﬁnancial payoﬀs. Let us consider a parametrized family of measurable functions a θ : [0 , T ] × R → R , with θ ∈ Θ, satisfying: (cid:12)(cid:12) a θ ( t, x ) (cid:12)(cid:12) ≤ C (1 , + | x | ) , for t ∈ [0 , T ] , x ∈ R (3) (cid:12)(cid:12) a θ ( t, x ) − a θ ( t, y ) (cid:12)(cid:12) ≤ D | x − y | , for t ∈ [0 , T ] , x, y ∈ R (4)for some constants C and D .We can then introduce the process: dW θt = dW t − a θ ( t, X x ≤ s ≤ t ) dt (5)The diﬀusion of our underlying process ( X x t ) can be rewritten: dX x t = b t dt + σ t dW t = b t dt + σ t dW θt + a θ ( t, X x ≤ s ≤ t ) σ t dt (6)Furthermore, introduce the change of measure process, which is a Q -martingale: Z θt = d Q θ d Q (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F t = exp (cid:18)(cid:90) ts =0 a θ (cid:16) s, X x ≤ u ≤ s (cid:17) dW s − (cid:90) ts =0 (cid:16) a θ (cid:16) s, X x ≤ u ≤ s (cid:17)(cid:17) ds (cid:19) = exp (cid:18)(cid:90) ts =0 a θ (cid:16) s, X x ≤ u ≤ s (cid:17) dW θs + 12 (cid:90) ts =0 (cid:16) a θ (cid:16) s, X x ≤ u ≤ s (cid:17)(cid:17) ds (cid:19) (7)By the Girsanov theorem, W θt is a Brownian Motion under the Q θ probability measure.We then have: f ( x ) = E Q θ (cid:20) g (cid:0) X x ≤ t ≤ T (cid:1) Z θT (cid:21) (8) a θ Using Neural Networks

In order to reduce the variance in the Monte-Carlo, our aim is to ﬁnd the a θ that minimizes the varianceof g (cid:0) X x ≤ t ≤ T (cid:1) Z θT under Q θ . Therefore, let us introduce:˜ h ( θ ) := E Q θ (cid:34)(cid:18) g (cid:0) X x ≤ t ≤ T (cid:1) Z θT (cid:19) (cid:35) (9)4onsidering that equation (Eq 8) is true for all θ , the square of the mean in the variance term E Q θ (cid:104) g (cid:0) X x ≤ t ≤ T (cid:1) Z θT (cid:105) = f ( x ) is the same for all θ . Therefore, we can simply ignore it, and our opti-mization problem can be rewritten: ˜ θ ∗ := argmin θ ∈ Θ ˜ h ( θ ) (10)To do this, we will use multiple neural networks to represent a θ , and minimize for the parameters ofthe neural networks using the variance as our loss function.In practice, we do not want to allow changes of measures that are too extreme. Therefore, we willadd the following constraint to the error function: h ( θ ) := E Q θ (cid:34)(cid:18) g (cid:0) X x ≤ t ≤ T (cid:1) Z θT (cid:19) (cid:35) + λ ln (cid:32) E Q θ (cid:34)(cid:18) Z θT − C (cid:19) + (cid:35)(cid:33) (11)and consider the minimizer: θ ∗ := argmin θ ∈ Θ h ( θ ) (12)Assuming that h is smooth, when Θ is compact, such a minimizer exists, albeit not necessarilyuniquely. As the number of inputs in a θ (cid:16) t, X x ≤ s ≤ t (cid:17) increases with time, in practice, we need one neural networkper time step. If the maturity is long and we want to have a ﬁne mesh for the diﬀusion process X x t , it ispossible to introduce a coarser mesh for the a θ , so as not to have too many neural networks to calibrate.We won’t do this in this paper so as to keep the notations simple.Let us introduce a time grid [ t , ..., t N T ] := (cid:2) , T /N T , T /N T , ..., T (cid:3) with N T ∈ N ∗ time steps. For i ∈ [[1 , N T − a θ i which has i + 1 inputs and one output. For i = 0,as all trajectories start at the same initial point, we introduce instead a neural network a θ .Mathematically, we have: (cid:40) a θ i : R i +1 × Θ i → R , for i ∈ [[1 , N T − a θ : Θ → R , for i = 0 (13)For a Bachelier diﬀusion, the algorithm is then as follows: import numpy as npimport tensorflow as tfdef generate_trajectories_z_and_a_list(self): a_list = [None for i in range(self.N_T)]trajectories = [None for i in range(self.N_T + 1)]for n_time_step in range(self.N_T + 1):if n_time_step == 0:trajectories[n_time_step] = tf.tile(tf.reshape(self.x, [1, 1]),[self.N_batch_size, 1])else:a_list[n_time_step - 1] = self.neural_net(trajectories[:n_time_step],self.weights_list[n_time_step - 1],self.biases_list[n_time_step - 1], _step=n_time_step - 1)trajectories[n_time_step] = trajectories[n_time_step - 1]+ a_list[n_time_step - 1] * self.sigma * self.dt+ gaussian_term * self.sigma * self.sqrt_dt a_times_gaussians = tf.multiply(tf.reduce_sum(tf.stack(a_list, axis=1), axis=2),self.random_gaussians * self.sqrt_dt)a_squared_list = tf.square(tf.reduce_sum(tf.stack(a_list, axis=1), axis=2)) * self.dtfirst_term_z = tf.expand_dims(tf.reduce_sum(a_times_gaussians, axis=1), axis=-1)second_term_z = 0.5 * tf.expand_dims(tf.reduce_sum(a_squared_list, axis=1), axis=-1)z = tf.exp(first_term_z + second_term_z)return trajectories, z, a_listdef neural_net(self, trajectories, weights, biases, t_step):if t_step == 0: variable = tf.Variable(tf.zeros([1, 1], dtype=tf.float64))Y = tf.tile(variable, multiples=[self.N_batch_size, 1])else: num_layers = len(weights) + 1 H = tf.reduce_sum(tf.stack(trajectories, axis=1), axis=1) - self.xfor l in range(0, num_layers - 2):W = weights[l]b = biases[l]H = tf.nn.relu(tf.add(tf.matmul(H, W), b))W = weights[-1]

Y = tf.matmul(H, W)return Y

Let us comment on the above algorithm.In the generate trajectories z and a list method, in the loop on the variable n time step, we do thefollowing. If n time step == 0, we simply initiate the ﬁrst value of the trajectories, trajectories[0],with the initial value self.x . If n time step >

0, we do two things. First, we evaluate a θ n time step - 1 on the past of the trajectories. This is done when we call self.neural net(trajectories[:n time step],self.weights list[n time step - 1], self.biases list[n time step - 1], t step=n time step + 1) . Second, weconstruct the next step of the trajectory, with trajectories[n time step] = trajectories[n time step - 1] +a list[n time step - 1] * self.sigma * self.dt + gaussian term * self.sigma * self.sqrt dt. This second stepis simply the Euler scheme for the Bachelier process under Q θ .In the function neural net, we should notice a few things. First when we call the neural net inthe method generate trajectories z and a list, the trajectories input of neural net actually consists of thevariable trajectories[:n time step] of the method generate trajectories z and a list. That is, is consistsof all the past of the trajectories up to step n time step - 1 included. Ignoring the tf.reduce sum andtf.stack which are there for reshaping purposes, we then deﬁne H as trajectories[:n time step] - self.x,where self.x is the initial values of the trajectories. That is, we decide to recenter all the trajectories by In Python, list[:n] is the sublist of list containing its ﬁrst n terms. So trajectories[:n time step] consists in the values ofthe trajectories up to n time step, that is, the past of the trajectory θ parameters that are trained in the algorithm. The neural networks that we use have ( i + 1) inputs, 2 intermediate layers with 16 neurons each, and oneouput layer. The intermediate layers have both a weight and a bias term, whereas the output layer onlycontains a weight term. The activation function that we use is ReLu.Figure 1: Neural Network Architecture for a θ .2 Discretization of the Processes For numerical implementation, one needs to consider discretized versions of X x t , Z θt , a θ , h ( θ ) , θ ∗ . Theunderlying process X x t will be discretized in its numerical version X x t according to two possible Eulerschemes in section 4.1.1. Assuming X x t already deﬁned, we now deﬁne: a θ ( t, X x ≤ s ≤ t ) := a θ (cid:98) tNTT (cid:99) (cid:32)(cid:16) X x t i (cid:17) (cid:98) tNTT (cid:99) i =0 (cid:33) (14)This allows us to deﬁne: dW θt := dW t − a θ ( t, X x ≤ s ≤ t ) dt (15)We now deﬁne the martingale: Z θt := exp (cid:18)(cid:90) ts =0 a θ (cid:16) s, X x ≤ r ≤ s (cid:17) dW θs + 12 (cid:90) ts =0 a θ (cid:16) s, (cid:16) X x ≤ r ≤ s (cid:17)(cid:17) ds (cid:19) (16)Which allows us to deﬁne Q θ as a measure such that ∀ t ∈ [0 , T ]: d Q θ d Q (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F t := Z θt (17)We can then deﬁne: h ( θ ) := (cid:98) E Q θ (cid:32) g (cid:16) X x ≤ t ≤ T (cid:17) Z θT (cid:33)  + λ ln  (cid:98) E Q θ (cid:32) Z θT − C (cid:33) +  (18)where the operator (cid:98) E A is deﬁned as the empirical average obtained when simulating NBatchSizerandom variables under the probability A .We deﬁne h ( θ ) as the function to be minimized, and then train our neural networks by doingNBatchChangeProportion learning steps for a given batch of random variables, and fetching NumberOf-BatchesForTraining batches of random variables. We thus do in total NBatchChangeProportion x Num-berOfBatchesForTraining learning steps.We then deﬁne ˆ θ ∗ as the value obtained for θ after these NBatchChangeProportion x NumberOf-BatchesForTraining learning steps.The C used for the diﬀerente experiments are those of tables 8-17.For the ﬁgures 12, 13, 20 21, 22 23, 30, 31, 32, 33, 36, 37, 38, 39, 42, 43, 46, 47, 50, 51, 54, 55, 56,57, 60, 61, 64, 67, 66 and 67 showing the surfaces a θ and ˜ a θ , we use the learning rates given by tables 8,9, 10, 11, 12, 13, 14, 15, 16 and 17. For the graphs 6, 7, 14, 15, 24, 25, 34, 35, 40, 41, 44, 45, 48, 49, 48,49, 52, 53, 58, 59, 62 and 63, we use the learning rates given by tables 18-27.For the values of λ , we do two things. For the ﬁgures 12, 13, 20 21, 22 23, 30, 31, 32, 33, 36, 37, 38,39, 42, 43, 46, 47, 50, 51, 54, 55, 56, 57, 60, 61, 64, 67, 66 and 67 showing the surfaces a θ and ˜ a θ , we usethe values of tables 8-17. However, manually choosing each λ for the graphs 6, 7, 14, 15, 24, 25, 34, 35, 40,41, 44, 45, 48, 49, 48, 49, 52, 53, 58, 59, 62 and 63 would be very cumbersome, as we might need a diﬀerent λ for each point in the graph. Therefore, for these graphs, we instead use a ﬁrst batch to evaluate thestandard deviation ˆ σ of g (cid:16) X x ≤ t ≤ T (cid:17) under Q , and choose λ = BaseForAutomaticLambdaConstraint × −(cid:98) log (ˆ σ ) (cid:99) , where BaseForAutomaticLambdaConstraint is given by table 28.8 Numerical Experiment

For the numerical experiments, we will consider two diﬀusion processes (Bachelier and Local VolatilityDiﬀusion), and 3 types of payoﬀs (Calls, Calls & Puts and Autocall). The parameters used for thesediﬀusions and payoﬀs are those of tables 1, 2, 3, 4, 5, 6 and 7.

The Bachelier diﬀusion process is deﬁned by: (cid:40) dX x t = σdW t for t ∈ [0 , T ] X = x (19)In practice, we diﬀuse the Euler scheme given by: (cid:40) X x i +1 Nt T = X x iNt T + σ (cid:16) W i +1 Nt T − W iNt T (cid:17) for i ∈ [[0 , N t − X = x (20) x , σ, T and N t are deﬁned in table (Tab 1). Local Volatility

The Local Volatility diﬀusion process is deﬁned by: (cid:40) dX x t = X x t σ ( t, ln ( X x t )) dW t X x = x (21)In practice, we diﬀuse the Euler scheme of the log diﬀusion given by:  Y x i +1 Nt T = Y iNt T + σ (cid:18) t, Y x iNt T (cid:19) (cid:16) W i +1 Nt T − W iNt T (cid:17) for i ∈ [[0 , N t − Y x = ln( x ) (22)and deﬁne the discrete process as: X x iNt T = exp (cid:16) Y x iNt T (cid:17) , for i ∈ [[0 , N t ]] (23)To obtain the local volatility σ ( t, x ), we start from an implied volatility surface given by a raw SVIparametric, and obtain the corresponding local volatility by using the Dupire formula.Taking the deﬁnition of [8], for a given parameter set χ = { a, b, ρ, m, σ } , the raw SVI parameterizationof total implied variance up to time t reads:˜ w ( t, χ ) = (cid:18) a + b (cid:26) ρ ( k − m ) + (cid:113) ( k − m ) + σ (cid:27)(cid:19) (24)where a ∈ R , b ≥ , | ρ | < , m ∈ R , σ > χ respects the condition a + bσ (cid:112) − ρ ≥ k := ln (cid:16) KF ( t, (cid:17) = ln (cid:16) KX (cid:17) is the log-Forward-Strike (we will only consider diﬀusions with no interestrates, so the forward of the underlying is its spot) 9owever, as we do not only want one time strand of the volatility surface, but a whole volatilitysurface, we will very na¨ıvely use the following parameters for the whole volatility surface: w ( t, k, χ ) = t (cid:18) a + b (cid:26) ρ ( k − m ) + (cid:113) ( k − m ) + σ (cid:27)(cid:19) (25)As w ( t, k, χ ) = tσ imp ( t, k, χ ), this gives the following parameterization for the implied volatility σ imp ( t, k, χ ): σ imp ( t, k, χ ) = (cid:115)(cid:18) a + b (cid:26) ρ ( k − m ) + (cid:113) ( k − m ) + σ (cid:27)(cid:19) (26)By doing so, we consider an implied volatility surface which has the same smile for all maturities.This is not what is observed in practice: the smile tends to smooth out as maturity increases. However,as the main focus of this paper is not the volatility surface, we will still use this simple parameterizationin order to have a simple implementation.Using the Dupire formula, we can obtain according to the computations in [9] the following localvolatility: σ ( t, k ) = ∂ t w − kw ∂ k w + (cid:0) − − w + k w (cid:1) ( ∂ k w ) + ∂ kk w (27)For t ∈ (0 , T ], using the deﬁnition for w , we obtain:  ∂ t w = a + b (cid:26) ρ ( k − m ) + (cid:113) ( k − m ) + σ (cid:27) ∂ k w = tb (cid:26) ρ + k − m √ ( k − m ) + σ (cid:27) ∂ kk w = tbσ ( ( k − m ) + σ ) (28)For the special case t = 0, we deﬁne the local volatility as the limit for t → σ ( t, k ) as previouslydeﬁned, which gives:  σ (0 , k ) = ∂ t ω − k ∂kωω + (cid:18) − + k (cid:16) ∂kωω (cid:17) (cid:19) ∂ k ωω = b (cid:26) ρ + k − m √ ( k − m )2+ σ (cid:27) a + b (cid:110) ρ ( k − m )+ √ ( k − m ) + σ (cid:111) (29)For a given χ , we can therefore compute both the corresponding implied volatility σ imp ( t, k ) andlocal volatility surfaces σ ( t, k ) numerically. The Call option payoﬀ of strike K is given by: g ( x ) = ( x − K ) + (30)10 g ( x ) K Call & Put Options

The Call & Put Options payoﬀ with N calls, N puts, of strikes K and K isgiven by: g ( x ) = N ( x − K ) + + N ( K − x ) + (31) xg ( x ) K K AutoCall

The AutoCall payoﬀ is a function of the whole trajectory. In practice, we use a smoothedAutoCall payoﬀ, as is used in the ﬁnance industry. In our case, the smoothing of the Payoﬀ is necessaryfor the training of the neural networks, more speciﬁcally for the computation of the derivative of our lossfunction with respect to the neural network parameters, to work properly. For practitioners, we do not seethis as a major problem, as it is industry standard to smooth their Payoﬀs, so as to get reasonable Greeksfor the trader’s hedge of the product. The smoothing presented here is not the one that practitionersshould use in practice. Indeed, as the AutoCall is a non convex payoﬀ, our smoothing can lead to underhedging the product. As the smoothing of the AutoCall payoﬀ is not the topic of this paper, we will dowith this very crude smoothing method. g (( X x t ) Tt =0 ) = C P DI + N P − (cid:88) i =0 C P i (32)with: C P i = S (cid:16) i, X x T Ai ≥ B T Ai (cid:17) i − (cid:89) ˜ i =0 (cid:18) − S (cid:18) ˜ i, X x T A ˜ i , B T A ˜ i (cid:19)(cid:19) (33)where S ( i, x, b ) = ( x − b ) + − ( x − b − S i ) + S i (34)11nd C P DI = − (cid:18)(cid:18) − KS P DI (cid:19) (cid:0) K + S P DI − X x T (cid:1) + − − KS P DI ( K − X x T ) + (cid:19) N P − (cid:89) ˜ i =0 (cid:18) − S (cid:18) ˜ i, X x T A ˜ i , B T A ˜ i (cid:19)(cid:19) (35)Under the condition that the product is sill alive at T Ai , the corresponding Phoenix Coupon C P i payoﬀ is illustrated in ﬁgure (Fig. 3). Similarly, under the condition that the product is still alive at ﬁnalmaturity T , we illustrate in ﬁgure (Fig. 2) the Put Down and In C P DI payoﬀ. xC P DI KS P DI − KS PDI

Figure 2: C P DI when product reaches maturity T AN P − xC P i B T Ai S i Figure 3: C P i when product is alive at time T Ai We illustrate in ﬁgure (Fig. 4) how the product is autocalled or not for diﬀerent trajectories. Fortrajectory 1, the product is autocalled at the ﬁrst phoenix barrier date for which the underlying is higherthan the barrier value, that is, at time T A . The owner of the AutoCall then recieves the correspondingphoenix coupon C P . For trajectory 2, the product is autocalled slightly later, at time T A , and the ownerof the AutoCall recieves C P . Trajectories 3 and 4 never cross a barrier B i , for 0 ≤ i ≤

4. Therefore, forthese trajectories, the product reaches maturity. For trajectory 3, at maturity, the underlying is abovethe put down and in barrier B P DI and below the ﬁnal phoenix coupon barier B . Therefore, the ownerof the AutoCall doesn’t pay or recieve anything. For trajectory 4, at maturity, the underlying is belowthe put down and in barrier B P DI . The owner therefore “recieves” C P DI (which is a negative value, sothe owner actually pays | C P DI | ). The above explaination stands for the non-smoothed product. In orderto smooth the payoﬀ, we replace the indicator functions by their smoothed versions from equation (Eq.34). 12 X t T A T A T A T A T A K P DI

312 221 2222224 B B B B B Figure 4: AutoCall Barrier Activations for Diﬀerent Underlying Trajectories a θ As a θ is a function of the pathspace, we cannot represent it visually in a graph. However, we canrestrict ourselves to showing its values for some simple trajectories. Let us therefore introduce ˜ a θ ( t, x ) = a θ (cid:16) t, ( X x s ) ts =0 ( ω ( t, x )) (cid:17) , where ω ( t, x ) is the set of events such that ∀ s ∈ [0 , t ] , X x s ( ω ( t, x )) = x + st ( x − x ). In other words, ˜ a θ ( t, x ) is the evaluation of a θ on trajectories that start at (0 , x ) and go ina straight line up to to the point ( t, x ). Figure (Fig. 5) shows the construction of these trajectories.13 X x t x TN t N t T N t T N t T T x minx max (cid:0) X x s (cid:0) ω (cid:0) TN T , x max (cid:1)(cid:1)(cid:1) TNT s =0 (cid:0) X x s (cid:0) ω (cid:0) TN T , x max (cid:1)(cid:1)(cid:1) TNT s =0 Figure 5: Construction of Trajectories ( X x s ( ω ( t, x ))) ts =0 a In the preceding sections, we have considered a very general change of measure a θ taking the wholetrajectory of (cid:16) X x ≤ s ≤ t (cid:17) as an input. This function being of high dimension, it is diﬃcult to represent itvisually. Therefore, one might wonder if a local version of a θ , taking only the last value of the underlyingprocess at time t , X t , as an input, might be suﬃcient in practice.Let us therefore consider a function a L,θ : [0 , T ] × R → R . The theory of section (Sect 1.3) stillapplies. We can then do as in section (Sect. 2), and use a list of neural networks to represent a L,θ ( ., . )on { t , ..., t N T − } × R . Let us introduce the list of neural networks: a L,θ i : R → R for i ∈ [[0 , N T − Z L,θ , θ L, ∗ and ( W θ,Lt ) ≤ t ≤ T as their counterparts Z θ , θ ∗ and W Lt except for the fact thatwe now use the local version a θ,L instead of a θ for the drift in the Girsanov change of measure.From now on, we will refer to the algorithm using the full function of the path space a θ `a the “full”method, and the one using the local a L,θ as the “local” method.

In all the following graphs, the full lines and dotted lines represent on the left axis the standard deviationsobtained when pricing with a plain Monte Carlo and a Deep Importance Sampling Monte Carlo fordiﬀerent values of the spot price x . The dashed lines (right axis) show the ratios of these standarddeviations. The graphs on the left use the ”full” version of the Deep Importance Sampling algorithm,whereas graphs on the right use the ”local” version of the Deep Importance Sampling algorithm. For each value of the spot price x , we train a separate set of neural networks. This will not be the case in the section5, where the neural networks will be trained only with x = 1, and we will see the results obtained via a plain and a DeepImportance Sampling Monte Carlo when changing the diﬀerent parameters, without retraining the neural networks .4.1 Call We see in ﬁgures (Fig. 6) and (Fig. 7) that the Monte Carlo obtained via our adaptative importancesampling has a lower standard deviation than a plain Monte Carlo.Figure 6: Standard Deviation vs x for FullMethod Figure 7: Standard Deviation vs x for LocalMethodTo get an idea of how the trajectories are modiﬁed, we show in ﬁgures (Fig. 8) and (Fig. 10)the distributions of the weights Z ˆ θ ∗ T and Z L, ˆ θ L, ∗ T in log scale. Figures (Fig. 9) and (Fig. 11) show thedistributions of X T under Q ˆ θ ∗ and Q ˆ θ L, ∗ as histograms, and their theoretical distributions under Q as asolid line. We can see that the trajectories are modiﬁed so as to get closer to the call’s strike at K = 1 . Z ˆ θ ∗ T Distribution Under Q ˆ θ ∗ Figure 9: X T Distribution Under Q ˆ θ ∗ for FullMethodFigure 10: Z ˆ θ ∗ T Distribution Under Q L, ˆ θ L, ∗ Figure 11: X T Distribution Under Q ˆ θ ∗ for LocalMethodIn ﬁgures (Fig. 12) and (Fig. 13), we show the surfaces ˜ a ˆ θ ∗ ( t, x ) and a L, ˆ θ L, ∗ ( t, x ). For the calloption, ˜ a θ and a L,θ are always positive. This agrees with the intuition that we need to apply a positivedrift in order to have more trajectories reach the strike region of the product.Figure 12: ˜ a ˆ θ ∗ ( t, x ) Figure 13: a L, ˆ θ L, ∗ ( t, x )16 .4.2 Asymmetric Calls and Puts For this experiment, we price N calls of strike K and N puts of strike K . The parameters used arethose of table (Tab. 4), where N , N , K and K have been chosen so that the N calls and N puts haveroughly the same price.Again, we see in ﬁgures (Fig. 14) and (Fig. 15) that the Monte Carlo obtained using our adaptativeimportance sampling has a lower standard deviation than a plain Monte Carlo.Figure 14: Standard Deviation vs x for FullMethod Figure 15: Standard Deviation vs x for LocalMethodTo get an idea of how the trajectories are modiﬁed, we again show in ﬁgures (Fig. 16) and (Fig.18) the distributions of the weights Z ˆ θ ∗ T and Z L, ˆ θ L, ∗ T under Q ˆ θ ∗ and Q L, ˆ θ L, ∗ . Figures (Fig. 17) and (Fig.19) show the distributions of X T under Q ˆ θ ∗ and Q L, ˆ θ L, ∗ . In ﬁgures (Fig. 17) and (Fig. 17), we can seethat the mode of the distributions are lower than 1, which shows that many trajectories are now sentdownwards. 17igure 16: Z ˆ θ ∗ T Distribution Under Q ˆ θ ∗ Figure 17: X T Distribution Under Q ˆ θ ∗ Figure 18: Z L, ˆ θ L, ∗ T Distribution Under Q L, ˆ θ L, ∗ Figure 19: X T Distribution Under Q L, ˆ θ L, ∗ We also show in ﬁgures (Fig. 20), (Fig. 22) and (Fig. 21), (Fig. 23) the surfaces ˜ a ˆ θ ∗ ( t, x ) and a L, ˆ θ L, ∗ from diﬀerent viewpoints. Notice that now, ˜ a θ ∗ and a L,θ L, ∗ are now positive roughly speakingwhen x >

1. , and negative when x <

1. This is expected, as the payoﬀ now has two strikes, one forthe call options, and one for the put options, so it needs to separate the trajectories in two groups. Onegroup goes up to get closer to the call strike K = 1 .

2. Another group goes down to join the more extremeput strike K = 0 .

6. Our option consists of 1 call of strike K = 1 .

2, and 10 puts of strike 0 .

6. The ratioof 10 has been chosen such that the price of the call is roughly the same as the price of the 10 puts.However, we can notice that the surface is asymmetric: the part of the surface that goes down roughlyreaches the values around 2, whereas the positive part of the surface reaches around 1. This is becausethe trajectories that go down need to reach a more extreme strike of 0.6, whereas trajectories that go uponly need to reach a strike of 1.2. 18igure 20: ˜ a ˆ θ ∗ ( t, x ) Figure 21: a L, ˆ θ L, ∗ ( t, x )Figure 22: ˜ a ˆ θ ∗ ( t, x ) Figure 23: a L, ˆ θ L, ∗ ( t, x ) For this experiment, we again price N calls of strike K and N puts of strike K . However, theparameters used are now those of table (Tab. 5), and have been chosen so that the situation is perfectlysymmetric: K = 0 . K = 1 . N = 10 and N = 10. As the diﬀusion is of Bachelier type, itsdistribution is also symmetric with respect to x = 1. We therefore expect a θ to have a symmetry withrespect to x = 1.Again, we see in ﬁgures (Fig. 24) and (Fig. 25) that the Monte Carlo obtained using our adaptativeimportance sampling has a lower standard deviation than a plain Monte Carlo.19igure 24: Standard Deviation vs x for FullMethod Figure 25: Standard Deviation vs x for LocalMethodAs previously, we show in ﬁgures (Fig. 26) and (Fig. 28) the distribution of the weights Z ˆ θ ∗ T and Z L, ˆ θ L, ∗ T in log scales and in ﬁgures (Fig. 27) and (Fig. 29) the distribution of X T under Q ˆ θ ∗ and Q L, ˆ θ L, ∗ .The mode of the distributions are now at 1, which is expected from the symmetries of both the productand the Bachelier process law with respect to x = 1.Figure 26: Z ˆ θ ∗ T Distribution Under Q ˆ θ ∗ Figure 27: X T Distribution Under Q ˆ θ ∗ Figure 28: Z L, ˆ θ L, ∗ T Distribution Under Q L, ˆ θ L, ∗ Figure 29: X T Distribution Under Q L, ˆ θ L, ∗

20s previously, we show in ﬁgures (Fig. 30), (Fig. 31) and (Fig. 32) (Fig. 33) the surfaces ˜ a ˆ θ ∗ ( t, x )and a L, ˆ θ L, ∗ from diﬀerent viewpoints. Results are similar to those obtained with the asymmetric callsand puts, except that we can now notice in ﬁgures (Fig. 32) and (Fig. 33), that ˜ a ˆ θ ∗ and a L, ˆ θ L, ∗ present asymmetry with respect to x = 1, which is expected from the symmetries of the product and the Bachelierprocess law. Figure 30: ˜ a ˆ θ ∗ ( t, x ) Figure 31: a L, ˆ θ L, ∗ ( t, x )Figure 32: ˜ a ˆ θ ∗ ( t, x ) Figure 33: a L, ˆ θ L, ∗ ( t, x ) For this experiment, we price an AutoCall that has multiple coupons. The precise characteristics arethose of table (Tab. 6). Again, we see in ﬁgures (Fig. 34) and (Fig. 35) that the Monte Carlo obtainedusing our adaptative importance sampling has a lower standard deviation than a plain Monte Carlo.21igure 34: Standard Deviation vs x for FullMethod Figure 35: Standard Deviation vs x for LocalMethodIn ﬁgures (Fig. 36), (Fig. 38) and (Fig. 37), (Fig. 39), we show the surfaces ˜ a ˆ θ ∗ ( t, x ) and a L, ˆ θ L, ∗ from diﬀerent viewpoints. We can see that that the values are rougly speaking positive when x >

1, andnegative for x <

1. This is again expected from the fact that some trajectories need to get closer tothe barrier strikes region B A. = 1 .

5, while others need to get close to the put down and in strike region K P DI = 0 .

5. Figure 36: ˜ a θ ( t, x ) Figure 37: ˜ a θ ( t, x )Figure 38: ˜ a θ ( t, x ) Figure 39: ˜ a θ ( t, x )22 .4.5 Single Coupon AutoCall For this experiment, we price an AutoCall that has a single coupon (but still multiple AutoCall barriers).The precise characteristics are those of table (Tab. 7). Again, we see in ﬁgures (Fig. 40) and (Fig. 41)that the Monte Carlo obtained using our adaptative importance sampling has a lower standard deviationthan a plain Monte Carlo.Figure 40: Standard Deviation vs x for FullMethod Figure 41: Standard Deviation vs x for LocalMethodIn ﬁgures (Fig. 42) and (Fig. 43), we again show the surfaces ˜ a ˆ θ ∗ ( t, x ) and a L, ˆ θ L, ∗ from diﬀerentviewpoints. Compared to the multiple coupons AutoCall, we can see one striking diﬀerence: for x > t = 0 up to the coupon date t = T A , then suddenly drop to values closeto 0, and slightly negative. This is expected, as once the coupon date has passed, there is no reason todeviate trajectories towards the x >

1. Indeed, once the coupon date has passed, the only thing left toprice is the put down and in, so trajectories now need to get to values closer to the put down and instrike K P DI = 0 . a ˆ θ ∗ ( t, x ) Figure 43: a L, ˆ θ L, ∗ ( t, x ) Results for the local volatility diﬀusion, which is the most used diﬀusion by practitioners, are very similarto those obtained with the Bachelier diﬀusion. We will therefore be more brief in our comments.23 .5.1 Call

As in the previous section, we see in ﬁgures (Fig. 44) and (Fig. 45) that the Monte Carlo obtained viaour adaptative importance sampling has a lower standard deviation than a plain Monte Carlo.Figure 44: Standard Deviation vs x for FullMethod Figure 45: Standard Deviation vs x for LocalMethodIn ﬁgures (Fig. 46) and (Fig. 47), we show the surfaces ˜ a ˆ θ ∗ ( t, x ) and a L, ˆ θ L, ∗ . As in the Bachelierdiﬀusion case, for the call option, ˜ a ˆ θ ∗ and a L, ˆ θ L, ∗ are always positive, as expected.Figure 46: ˜ a θ ( t, x ) Figure 47: ˜ a θ ( t, x ) For this experiment, we again price N calls of strike K and N puts of strike K with the parametersof table (Tab. 4). As for the Bachelier diﬀusion case, we see in ﬁgures (Fig. 48) and (Fig. 49) that theMonte Carlo obtained using our adaptative importance sampling has a lower standard deviation than aplain Monte Carlo. 24igure 48: Standard Deviation vs x for FullMethod Figure 49: Standard Deviation vs x for LocalMethodAs previously, we show in ﬁgures (Fig. 50) and (Fig. 51) the surfaces ˜ a ˆ θ ∗ ( t, x ) and a L, ˆ θ L, ∗ ( t, x ).Figure 50: ˜ a ˆ θ ∗ ( t, x ) Figure 51: a L, ˆ θ L, ∗ ( t, x ) For this experiment, we again price N calls of strike K and N puts of strike K with the parametersof table (Tab. 5).Again, we see in ﬁgures (Fig. 52) and (Fig. 53) that the Monte Carlo obtained using our adaptativeimportance sampling has a lower standard deviation than that of a plain Monte Carlo.25igure 52: Standard Deviation vs x for FullMethod Figure 53: Standard Deviation vs x for LocalMethodAs previously, we show in ﬁgures (Fig. 54), (Fig. 56) and (Fig. 55), (Fig. 57) the surfaces ˜ a ˆ θ ∗ ( t, x )and a L, ˆ θ L, ∗ from diﬀerent viewpoints. We see in ﬁgures (Fig. 56) and (Fig. 57), that contrary to theBachelier diﬀusion case, ˜ a ˆ θ ∗ and a L, ˆ θ L, ∗ do not present a symmetry with respect to x = 1. This isexpected, as the local volatility diﬀusion process does not present the same symmetry with respect to x = 1 that the Bachelier diﬀusion process presents.Figure 54: ˜ a ˆ θ ∗ ( t, x ) Figure 55: a L, ˆ θ L, ∗ ( t, x )Figure 56: ˜ a θ ( t, x ) Figure 57: a L, ˆ θ L, ∗ ( t, x )26 .5.4 Multi Coupons AutoCall For this experiment, we again price an AutoCall that has multiple coupons with the characteristics oftable (Tab. 6). Again, we see in ﬁgures (Fig. 58) and (Fig. 59) that the Monte Carlo obtained using ouradaptative importance sampling has a lower standard deviation than a plain Monte Carlo.Figure 58: Standard Deviation vs x for FullMethod Figure 59: Standard Deviation vs x for LocalMethodIn ﬁgures (Fig. 60) and (Fig. 61), we again show the surfaces ˜ a ˆ θ ∗ ( t, x ) and a L, ˆ θ L, ∗ ( t, x ). As withthe Bachelier diﬀusion case, we can see that that the values are rougly speaking positive when x > x <

1. This is again expected from the fact that some trajectories need to get closer tothe barrier strikes region B A. = 1 .

5, while others need to get close to the put down and in strike region K P DI = 0 .

5. Figure 60: ˜ a ˆ θ ∗ ( t, x ) Figure 61: a L, ˆ θ L, ∗ ( t, x ) For this experiment, we again price an AutoCall that has a single coupon (but still multiple AutoCallbarriers) with the characteristics of table (Tab. 7). Again, we see in ﬁgures (Fig. 62) and (Fig. 63) thatthe Monte Carlo obtained using our adaptative importance sampling has a lower standard deviation thana plain Monte Carlo. 27igure 62: Standard Deviation vs x for FullMethod Figure 63: Standard Deviation vs x for LocalMethodIn ﬁgures (Fig. 64), (Fig. 66) and (Fig. 65), (Fig. 67), we show the surfaces ˜ a ˆ θ ∗ ( t, x ) and a L, ˆ θ L, ∗ fromdiﬀerent viewpoints. As with the Bachelier diﬀusion case, compared to the multiple coupons AutoCall,we can see one striking diﬀerence: for x >

1, the surface increases from time t = 0 up to the coupondate t = T A , then suddenly drop to values close to 0 and slightly negative. This is expected, as oncethe coupon date has passed, there is no reason to deviate trajectories towards the x >

1. As with theBachelier diﬀusion case, once the coupon date has passed, the only thing left to price is the put downand in, so trajectories now need to get to values closer to the put down and in strike K P DI = 0 . a θ ( t, x ) Figure 65: ˜ a θ ( t, x )28igure 66: ˜ a θ ( t, x ) Figure 67: ˜ a θ ( t, x ) In the previous sections, we have considered a user that trains the neural network prior to each pricing.In most cases, such a training can take only a small percentage of the time taken for the pricing andthus only add a small overhead. Indeed, our experience is that training the neural networks applyinga number of training steps of only 10% of the number of trajectories used in the ﬁnal pricing is oftenenough to obtain good results. Furthermore, our algorithm could easily be improved upon by keeping thevalues obtained during the training step and use them in the the ﬁnal Monte Carlo estimator, instead ofthrowing them away.However, an alternative approach is to only train the neural network once for many pricings. Forexample, a bank might want to train the neural networks only once per week and use the same neuralnetwork to price its book for the whole week. To test the feasibility of such a methodology, we studyhere how well the algorithm performs when we change each of the model parameters: x and σ for theBachelier diﬀusion, x , σ , a , b , m and ρ for the local volatility diﬀusion.In order to do this, for each payoﬀ and each diﬀusion type, we train the neural network withthe parameters of tables (Tab. 8 - Tab. 17). Once the neural networks are trained, we then varyeach parameter from −

50% to +50% of its original value, and plot the algorithm’s and a plain MonteCarlo’s standard deviations. We see that in practice, the algorithm is very robust for each tested payoﬀand diﬀusion type. Indeed, for all parameters except the spot price x , the algorithm systematicallyoutperforms the plain Monte Carlo, even though it was trained with the diﬀerent parameters than thoseused for the pricing. For the spot price, it usually outperforms the plain Monte Carlo in the range − a θ . Results for the local version a L,θ are similar.29 .1 Bachelier Diﬀusion

Figure 68: Standard Deviation against x , Adap-tative vs MC - Bachelier Diﬀusion - Call Figure 69: Standard Deviation vs σ , Adaptativevs MC - Bachelier Diﬀusion - Call Figure 70: Standard Deviation against x , Adap-tative vs MC - Bachelier Diﬀusion - AsymmetricCalls and Puts Figure 71: Standard Deviation vs σ , Adaptative vsMC - Bachelier Diﬀusion - Asymmetric Calls andPuts30 .1.3 Symmetric Calls and Puts Figure 72: Standard Deviation against x , Adap-tative vs MC - Bachelier Diﬀusion - SymmetricCalls and Puts Figure 73: Standard Deviation vs σ , Adaptativevs MC - Bachelier Diﬀusion - Symmetric Calls andPuts Figure 74: Standard Deviation against x , Adap-tative vs MC - Bachelier Diﬀusion - Multi CouponsAutoCall Figure 75: Standard Deviation vs σ , Adaptativevs MC - Bachelier Diﬀusion - Multi Coupons Au-toCall31 .1.5 Single Coupon AutoCall Figure 76: Standard Deviation against x , Adap-tative vs MC - Bachelier Diﬀusion - Single CouponAutoCall Figure 77: Standard Deviation vs σ , Adaptative vsMC - Bachelier Diﬀusion - Single Coupon AutoCall Figure 78: Standard Deviation against x , Adap-tative vs MC - LV Diﬀusion - Call Figure 79: Standard Deviation against σ , Adapta-tive vs MC - LV Diﬀusion - Call32igure 80: Standard Deviation against a , Adapta-tive vs MC - LV Diﬀusion - Call Figure 81: Standard Deviation against b , Adapta-tive vs MC - LV Diﬀusion - CallFigure 82: Standard Deviation against m , Adap-tative vs MC - LV Diﬀusion - Call Figure 83: Standard Deviation against ρ , Adapta-tive vs MC - LV Diﬀusion - Call Figure 84: Standard Deviation against x , Adap-tative vs MC - LV Diﬀusion - Asymmetric Callsand Puts Figure 85: Standard Deviation vs σ , Adaptative vsMC - LV Diﬀusion - Asymmetric Calls and Puts33igure 86: Standard Deviation against a , Adapta-tive vs MC - LV Diﬀusion - Asymmetric Calls andPuts Figure 87: Standard Deviation against b , Adapta-tive vs MC - LV Diﬀusion - Asymmetric Calls andPutsFigure 88: Standard Deviation against m , Adap-tative vs MC - LV Diﬀusion - Asymmetric Callsand Puts Figure 89: Standard Deviation against ρ , Adapta-tive vs MC - LV Diﬀusion - Asymmetric Calls andPuts34 .2.3 Symmetric Calls and Puts Figure 90: Standard Deviation against x , Adap-tative vs MC - LV Diﬀusion - Symmetric Calls andPuts Figure 91: Standard Deviation against σ , Adapta-tive vs MC - LV Diﬀusion - Symmetric Calls andPutsFigure 92: Standard Deviation against a , Adapta-tive vs MC - LV Diﬀusion - Symmetric Calls andPuts Figure 93: Standard Deviation against b , Adapta-tive vs MC - LV Diﬀusion - Symmetric Calls andPuts35igure 94: Standard Deviation against m , Adap-tative vs MC - LV Diﬀusion - Symmetric Calls andPuts Figure 95: Standard Deviation against ρ , Adapta-tive vs MC - LV Diﬀusion - Symmetric Calls andPuts Figure 96: Standard Deviation against x , Adap-tative vs MC - LV Diﬀusion - Multi Coupons Au-toCall Figure 97: Standard Deviation vs σ , Adaptativevs MC - LV Diﬀusion - Multi Coupons AutoCall36igure 98: Standard Deviation against a , Adapta-tive vs MC - LV Diﬀusion - Multi Coupons Auto-Call Figure 99: Standard Deviation against b , Adapta-tive vs MC - LV Diﬀusion - Multi Coupons Auto-CallFigure 100: Standard Deviation against m , Adap-tative vs MC - LV Diﬀusion - Multi Coupons Au-toCall Figure 101: Standard Deviation against ρ , Adap-tative vs MC - LV Diﬀusion - Multi Coupons Au-toCall37 .2.5 Single Coupon AutoCall Figure 102: Standard Deviation against x , Adap-tative vs MC - LV Diﬀusion - Single Coupon Au-toCall Figure 103: Standard Deviation vs σ , Adaptativevs MC - LV Diﬀusion - Single Coupon AutoCallFigure 104: Standard Deviation against a , Adap-tative vs MC - LV Diﬀusion - Single Coupon Au-toCall Figure 105: Standard Deviation against b , Adap-tative vs MC - LV Diﬀusion - Single Coupon Au-toCall38igure 106: Standard Deviation against m , Adap-tative vs MC - LV Diﬀusion - Single Coupon Au-toCall Figure 107: Standard Deviation against ρ , Adap-tative vs MC - LV Diﬀusion - Single Coupon Au-toCall The code used to generate the above graphs is available on the github repository [10]. The code runs withPython 3.6.5, matplotlib 3.1.1, pandas 0.25.2, scikit-learn 0.21.3, numpy 1.16.4 and tensorﬂow 1.13.1.

We have presented here a generic algorithm that ﬁnds the path-dependent change of measure one needsto apply to any particular payoﬀ to reduce the variance of the corresponding Monte Carlo estimator. Wehave shown in our numerical results of section 4 that this enables for a wide range of payoﬀs a reductionof variance of a factor between 2 and 9. In section 5, we show that even if one uses the Deep ImportanceSampling algorithm with parameters with diﬀerent values than those used during the training, thesevalues can vary by large amounts (10% for the spot price, 40% for the volatility parameters) and thealgorithm still performs better than a plain Monte Carlo.However, if one were to implement this algorithm, one could still improve it in multiple ways. For anon-the-ﬂy version of the Deep Importance Sampling algorithm, contrary to what we did in the paper, onemight want to keep the values obtained during the training, so as to use them in the ﬁnal Monte Carloestimator, and thus not throw them away. Furthermore, if one is worried about the Deep ImportanceSampling algorithm performing less well than a plain Monte Carlo, one can estimate on the ﬂy thestandard deviations obtained both via the Deep Importance Sampling algorithm and a plain MonteCarlo, and automatically switch back to the plain Monte Carlo if results are unsatisfactory. For a userwishing to only train the neural networks on a regular basis (such as once per week), training the neuralnetworks on multiple initial spot prices x should make the Deep Importance Sampling algorithm morerobust to changes in the spot price.Although we have not explored these applications, this method can also naturally be extendedto payoﬀs with multiple underlyings as well as diﬀusion models with more than one driving brownianmotion. One might then want to add all relevant factors in the input of a θ . The Deep ImportanceSampling algorithm should also be useful for rare event simulation, where one might expect even largergains in variance reduction than in the examples presented in this paper.39 eferences [1] Jun S Liu. Monte Carlo strategies in scientiﬁc computing . Springer Science & Business Media, 2008.[2] Phelim Boyle, Mark Broadie, and Paul Glasserman. Monte carlo methods for security pricing.

Journal of economic dynamics and control , 21(8-9):1267–1321, 1997.[3] Paul Glasserman.

Monte Carlo methods in ﬁnancial engineering , volume 53. Springer Science &Business Media, 2013.[4] Thomas M¨uller, Brian McWilliams, Fabrice Rousselle, Markus Gross, and Jan Nov´ak. Neural im-portance sampling. arXiv preprint arXiv:1808.03856 , 2018.[5] Shixiang Shane Gu, Zoubin Ghahramani, and Richard E Turner. Neural adaptive sequential montecarlo. In

Advances in neural information processing systems , pages 2629–2637, 2015.[6] Bouhari AROUNA. Variance reduction and robbins-monro algorithms. Technical report, Technicalreport, Cermics, 2002.[7] Vincent Lemaire, Gilles Pag`es, et al. Unconstrained recursive importance sampling.

The Annals ofApplied Probability , 20(3):1029–1067, 2010.[8] Jim Gatheral and Antoine Jacquier. Arbitrage-free svi volatility surfaces.

Quantitative Finance ,14(1):59–71, 2014.[9] Rafael Balestro Dias da Silva. Backtesting svi parameterization of implied volatilities. Master’sthesis, Instituto Nacional de Matem´atica Pura e Aplicada, Rio de Janeiro, 2016.[10] Benjamin Virrion. Deep importance sampling code repository. https://github.com/bvirrion/deep-importance-sampling , 2020. 40

Parameter Tables