A deep learning approach for computations of exposure profiles for high-dimensional Bermudan options
AA deep learning approach for computations of exposure profiles forhigh-dimensional Bermudan options
Kristoffer Andersson ∗ Cornelis W. Oosterlee ∗† September 14, 2020
Abstract
In this paper, we propose a neural network-based method for approximating expected exposures andpotential future exposures of Bermudan options. In a first phase, the method relies on the Deep OptimalStopping algorithm (DOS) proposed in [1], which learns the optimal stopping rule from Monte-Carlo samplesof the underlying risk factors. Cashflow paths are then created by applying the learned stopping strategy ona new set of realizations of the risk factors. Furthermore, in a second phase the cashflow-paths are projectedonto the risk factors to obtain approximations of pathwise option values. The regression step is carried outby ordinary least squares as well as neural networks, and it is shown that the latter produces more accurateapproximations.The expected exposure is formulated, both in terms of the cashflow-paths and in terms of the pathwiseoption values and it is shown that a simple Monte-Carlo average yields accurate approximations in bothcases. The potential future exposure is estimated by the empirical α -percentile.Finally, it is shown that the expected exposures, as well as the potential future exposures can be computedunder either, the risk neutral measure, or the real world measure, without having to re-train the neuralnetworks. Contents ∗ Research Group of Scientific Computing, Centrum Wiskunde & Informatica † Delft Institute of Applied Mathematics (DIAM), Delft University of Technology a r X i v : . [ q -f i n . C P ] S e p .1.5 Comparison of the OLS-regression and the NN-regression for approximation of pathwiseoption values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206.2 Heston model dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236.2.1 Comparison with Monte-Carlo-based algorithms . . . . . . . . . . . . . . . . . . . . . . . 25 The exposure of a financial contract is the maximum amount that an investor stands to lose if the counterpartyis unable to fulfill its obligations, for instance, due to a default. This means that, in addition to the marketrisk, a so-called counterparty credit risk (CCR) needs to be accounted for. Furthermore, the liquidity risk,which is the risk arising from potential costs of unwinding a position, is also closely related to the financialexposure. Over the counter (OTC) derivatives, i.e., contracts written directly between counterparties, insteadof through a central clearing party (CCP), are today mainly subject to so-called valuation adjustments (XVA ).These valuation adjustments aim to adjust the value of an OTC derivative for certain risk factors, e.g., creditvaluation adjustment (CVA), adjusting the value for CCR, funding valuation adjustment (FVA), adjusting forfunding cost of an uncollateralized derivative or capital valuation adjustment (KVA), adjusting for future capitalcosts. The financial exposure is central in calculations of many of the XVAs (for an in-depth overview of XVAs,we refer to [2] and [3]). In this paper, we therefore focus on computations of financial exposures of options.For a European style option, the exposure is simply the option value at some future time, given all thefinancial information available today. If t is the time today and X t the d -dimensional vector of underlying riskfactors of an option, we define the exposure of the option, at time t > t , as the random variable E Eur t = V t ( X t ) . However, for derivatives with early-exercise features, we also need to take into account the possibility that theoption has been exercised prior to t , sending the exposure to zero. The most well-known of such options isarguably the American option, which gives the holder (or buyer) the right to exercise the option at any timebetween t and maturity T . In this paper, the focus is on the Bermudan option which gives the holder the rightto exercise at finitely many, predefined exercise dates between t and maturity T . For early-exercise options,the exposure needs to be adjusted for the possibility that the stopping time τ , representing the instance of timeat which the option is exercised, is smaller than or equal to t . This leads to the following definition of theexposure E Ber t = V t ( X t ) I { τ>t } , where I {·} is the indicator function. Two of the most common ways to measure the exposure are the expectedexposure (EE), which is the expected future value of the exposure, and the potential future exposure (PFE),which is some α − percentile of the future exposure (usually α = 97 .
5% for the upper, and α = 2 .
5% for thelower tails of the distribution). To accurately compute EE and PFE of a Bermudan option, it is crucial to usean algorithm which is able to accurately approximate, not only V t ( X t ), but also I { τ>t } .It is common to compute the EE and the PFE with simulation-based methods, usually from realizationsof approximate exposures, from which the distribution of E t is approximated by its empirical counterpart. Ageneric scheme for approximation of exposure profiles is given below:1. At each exercise date, find a representation, v t : R d → R , of the true value function V t ( · );2. Generate trajectories of the underlying risk factors, and at each exercise date, evaluate v t ( X t ( ω ));3. At the earliest exercise date where v t ( X t ( ω )) ≤ g ( X t ( ω )), where g is the pay-off function, set τ ( ω ) = t ;4. The distribution of E Ber t , is then approximated by the empirical distribution created by many trajectoriesof the form v t ( X t ( ω )) I { τ ( ω ) >t } . For instance, if the target statistic is the EE, the estimation is the samplemean of the exposure paths.Step 1 above corresponds to the option valuation problem, which can be tackled by several different methods,all with their own advantages and disadvantages. For instance, the value function can be approximated bysolving an associated partial differential equation (PDE), which is done in e.g., [9],[10],[11],[12] and [13], or thevalue function can be approximated by a Fourier transform methodology, which is done in e.g., [14], [15] and[16]. Furthermore, classical tree-based methods such as [17], [18] and [19], can be used. These types of methods X represents arbitrary letters, e.g., ”C” for credit valuation adjustment, ”F” for funding valuation adjustment, etc. Usually, the exposure is defined as max { V t ( X t ) , } , but since we only consider options with non-negative values, the maxoperator is omitted. We define a filtered probability space in the next section from which ω is an element. In this section, ω should be viewed as anoutcome of a random experiment. d = 4), see [20]. In higher dimensions, Monte-Carlo-based methods are often used, see e.g., [21],[22],[23],[24] and [25]. Monte-Carlo-based methods can generatehighly accurate option values at t , i.e., v t ( X t = x t ), given that all trajectories of the underlying risk factorsare initiated at some deterministic x t . However, at intermediate exercise dates, v t ( X t ( ω )) might not be anequally good approximation, due to the need of cross sectional regression. We show with numerical examplesthat, even though the approximation is good on average, the option value is underestimated in certain regions,which is compensated by overestimated values in other regions. For European options, this seems to have minoreffect on EE and PFE, but for Bermudan options the effect can be significant due to step 3 above. To providean intuitive understanding of the problem, we give an illustrative example. Assume that v t underestimatesthe option value in some region A, which is close to the exercise boundary, and overestimates the option valuein some region B, where it is clearly optimal to exercise the option. The effect on the exposure would be anoverestimation in region A, since underestimated option values would lead to fewer exercised options. In regionB, the exposure would be zero in both cases since all options in that region would be exercised immediately. Intotal, this would lead to overestimated exposure. In numerical examples this is, of course, more involved and wesee typically several regions with different levels of under/overestimated values. This makes the phenomenondifficult to analyze since the effect may lead to underestimated exposure, unchanged exposure (from offsettingeffects), or overestimated exposure. In addition to the classical regression methods, with e.g., polynomial basisfunctions, which are cited above, there are several papers in which a neural network plays the role of the basisfunctions, see e.g., [26],[27], [28] and [29].In this paper we suggest the use of the Deep Optimal Stopping (DOS) algorithm, proposed in [1], to ap-proximate the optimal stopping strategy. The DOS algorithm approximates the optimal stopping strategy byexpressing the stopping time in terms of binary decision functions, which are approximated by deep neural net-works. The DOS algorithm is very suitable for exposure computations, since the stopping strategy is computeddirectly, not as a consequence of the approximate value function, as is the case with most other algorithms.This procedure leads to highly accurate approximations of the exercise boundaries. Furthermore, we proposea neural network-based regression method (NN-regression) to approximate V t ( · ), which relies on the stoppingstrategy, approximated by the DOS algorithm. Although NN-regression is needed in order to fully approxi-mate the distributions of future exposures, it turns out that for some target statistics, it is in fact sufficient toapproximate the optimal stopping strategy. This is, for instance, the case for some estimates of the EE.Another advantage of the method is that the EE and the PFE easily can be computed under the risk neutralmeasure, as well as the real world measure. When the EE is used to compute CVA, the calculations shouldbe done under the risk neutral measure, since CVA is a tradable asset and any other measure would createopportunities of arbitrage. For other applications, such as for computations of KVA (valuation adjustment toaccount for future capital costs associated to a derivative), the EE should be calculated under the real worldmeasure . The reason for this is that the KVA, among other things, depends on the required CCR capital whichis a function of the EE, which in this case, ideally, should be calculated under the real world measure. For anexplanation of KVA in general, see e.g., [30] and for a discussion about the effect of different measures in KVAcomputations see [31].Finally, we emphasize that, even though the focus of this paper is to compute exposure profiles, the algorithmsproposed are flexible and with small adjustments other kind of risk measures could be computed. The first reasonfor only studying exposure profiles is, as mentioned above, that it is an important building block in computations(or approximations) of many XVAs as well as expected shortfall. Another reason is that there are a numberof similar studies in the existing literature (see e.g., [4], [5], [6]). However, it should be pointed out that thealgorithms proposed in this paper are not limited to computations of exposures. As an example, consider theCVA of a Bermudan option, expressed as the difference between the risky (including CCR) and the risk-free(excluding CCR) option values. To accurately compute the risky option value, one needs to adjust the exercisestrategy for the risk of a default of the counterparty. Therefore, it is not sufficient to compute the exposureof the option when computing the CVA (for a discussion see e.g., [7], [8]). With minor adjustments to theproposed algorithm, one could compute this kind of advanced CVA (as opposed to the approximation, which isbased on the exposure).This paper is structured as follows: In Section 2, the mathematical formulation of a Bermudan option and itsexposure profiles are given. Furthermore, the Bermudan option, as well as the exposure profiles are reformulatedto fit the algorithms introduced in later sections. The DOS algorithm is described in Section 3, and we proposesome adjustments to make the training procedure more efficient. In Section 4, we present a classical ordinaryleast squares regression-based method (OLS-regression), as well as the NN-regression to approximate pathwiseoption values. In Section 5, the EE and PFE formulas are reformulated in a way such that they can easily beestimated by a combination of the algorithms presented in Sections 3 and 4, and a simple Monte-Carlo sampling. For fixed numraire. In this setting, the EE is an expectation under the real world measure, which is conditioned on a filtration generated by theunderlying risk factors.
For d, N ∈ N \ { } , let X = ( X t n ) Nn =0 be an R d − valued discrete-time Markov process on a complete probabilityspace (Ω , F , A ). The outcome set Ω is the set of all possible realizations of the stochastic economy, F is a σ − algebra on Ω and we define F n as the sub- σ -algebra generated by ( X t m ) nm =0 . With little loss of generality, werestrict ourselves to the case when X is constructed from time snap shots of a continuous-time Markov processat monitoring dates { t , t , . . . , t N } . The probability measure A is a generic notation, representing either thereal world measure, or the risk neutral measure, denoted P and Q , respectively.If not specifically stated otherwise, equalities and inequalities of random variables should be interpreted inan ω -wise sense. A Bermudan option is an exotic derivative that gives the holder the opportunity to exercise the option at afinite number of exercise dates, typically one per month. We define the exercise dates as T = { t , t , . . . , t N } ,which for simplicity coincide with the monitoring dates. Furthermore, for t n ∈ T , we let the remaining exercisedates be defined as T n = { t n , t n +1 . . . , t N } . Let τ be an X − stopping time, i.e. , a random variable definedon (Ω , F , A ), taking on values in T such that for all t n ∈ T , it holds that the event { τ = t n } ∈ F n . Assumea risk-free rate r ∈ R and let the risk-less savings account process, M ( t ) = e r ( t − t ) , be our numraire. For all t ∈ T , let g t : R d → R be a measurable function which returns the immediate pay-off of the option, if exercisedat market state ( t, X t = x ∈ R d ). The initial value of a Bermudan option, i.e., the value at market state( t , X t = x t ∈ R d ), is given by V t ( x t ) = sup τ ∈T E Q (cid:104) e − r ( τ − t ) g τ ( X τ ) | X t = x t (cid:105) , (1)where T is the set of all X − stopping times. Assuming the option has not been exercised prior to some t n ∈ T ,the option value at t n is given by V t n ( X t n ) = sup τ ∈T n E n Q (cid:104) e − r ( τ − t n ) g τ ( X τ ) (cid:105) , (2)where T n is the set of all X − stopping times greater than or equal to t n and we define E n A [ · ] = E A [ · |F n ]. Toguarantee that (1) and (2) are well-posed, we assume that for all t ∈ T it holds that E Q [ | g t ( X t ) | ] < ∞ . To concretize (1) and (2), we view the problem from the holder’s perspective. At each exercise date, t n ∈ T ,the holder of the option is facing the decision whether or not to exercise the option, and receive the immediatepay-off. Due to the Markovian nature of X , the decisions are of Markov-type (Markov decision process (MDP)),meaning that all the information needed to make an optimal decision is contained in the current market state,( t n , X t n ). With this in mind, we define for each t n ∈ T , the exercise region, E n , in which it is optimal toexercise, and the continuation region, C n , in which it is optimal to hold on, by E n = (cid:8) x ∈ R d | V t n ( x ) = g t n ( x ) (cid:9) , C n = (cid:8) x ∈ R d | V t n ( x ) > g t n ( x ) (cid:9) . Note that E n ∪ C n = R d and E n ∩ C n = ∅ .Below, we give a short motivation for how these decisions can be expressed mathematically and how we canformulate a stopping time in terms of (stopping) decisions at each exercise date. For a more detailed motivationwe refer to [1]. We introduce the following notation for measurable functions D ( A ; A ) = { f : A → A | f measurable } . Furthermore, for P ∈ N , let D ( A ; B ) P denote the P :th Cartesian product of the set D ( A ; A ). Define thedecision functions f , f , . . . , f N ∈ D ( R d ; { , } ), with f N ≡
1, and denote f n = ( f n , f n +1 , . . . , f N ) with Allowing for different pay-off functions at different exercise dates makes the framework more flexible. We can e.g. , let g t bethe pay-off function for an entire netting set of derivatives with different maturities. In the sense of using a decision policy such that the supremum in (1) or (2) is attained. We assume measurable spaces ( A , A ) and ( A , A ) and measurable functions with respect to σ -algebras A and A . Thisassumption holds in all cases in this paper. = f . An X − stopping time can then be defined as τ n ( f n ) = N (cid:88) m = n t m f m ( X t m ) m − (cid:89) j = n (1 − f j ( X t j )) , (3)where the empty product is defined as 1. We emphasize that τ n ( f n ) = τ n (cid:0) f n ( X t n ) , f n +1 ( X t n +1 ) . . . , f N ( X t N ) (cid:1) but to make the notation cleaner, we do not specify this dependency explicitly. When it is not clear from thecontext which process the decision function f n is acting on, we will denote the stopping time by τ n ( f n ( X )),where we recall that X = ( X t m ) Nm = n . As a candidate for optimal stopping decisions, we define f ∗ ∈ arg max f ∈D ( R d ; { , } ) N +1 E Q (cid:104) e − r ( τ ( f ) − t ) g τ ( f ) (cid:0) X τ ( f ) (cid:1) | X t = x (cid:105) . (4)We define the fair price of an option that follows the strategy constructed by combining (3) and (4) as V ∗ t n ( X t n ) = E n Q (cid:104) e − r ( τ n ( f ∗ n ) − t n ) g τ n ( f ∗ n ) (cid:0) X τ n ( f ∗ n ) (cid:1)(cid:105) . (5)The following theorem states that (5), in fact, coincides with the fair price of the option as defined in (2). Theorem 2.1.
For all t n ∈ T , and V and V ∗ as defined in (2) and (5) , respectively, it holds that V t n ( X t n ) = V ∗ t n ( X t n ) . Proof.
Note that V t N ( X t N ) = V ∗ t N ( X t N ) = g t N ( X t N ). The proof is carried out by induction and we assume thatfor some t n +1 ∈ T it holds that V t n +1 ( X t n +1 ) = V ∗ t n +1 ( X t n +1 ) . (6)We can rewrite V ∗ t n ( X t n ) as V ∗ t n ( X t n ) = E n Q (cid:104) e − r ( τ n ( f ∗ n ) − t n ) g τ n ( f ∗ n ) ( X τ n ( f ∗ n ) ) (cid:105) = g t n ( X t n ) f ∗ n ( X t n ) + (1 − f ∗ n ( X t n ))e − r ( t n +1 − t n ) × E n Q (cid:104) e − r ( τ n +1 ( f ∗ n +1 ) − t n +1 ) g τ n +1 ( f ∗ n +1 ) ( X τ n +1 ( f ∗ n +1 ) ) (cid:105) . (7)By the law of total expectation and the assumption (6), the last conditional expectation satisfies E n Q (cid:104) e − r ( τ n +1 ( f ∗ n +1 ) − t n +1 ) g τ n +1 ( f ∗ n +1 ) ( X τ n +1 ( f ∗ n +1 ) ) (cid:105) = E n Q (cid:104) V ∗ t n +1 ( X t n +1 ) (cid:105) = E n Q (cid:2) V t n +1 ( X t n +1 ) (cid:3) . (8)We insert the rightmost part of (8) in (7) and note that I { · ∈E n } ∈ D ( R d ; { , } ), which implies that V ∗ t n ( X t n ) ≥ I { X tn ∈E n } g t n ( X t n ) + I { X tn ∈C n } e − r ( t n +1 − t n ) E n Q (cid:2) V t n +1 ( X t n +1 ) (cid:3) = V t n ( X t n ) . Moreover, V ∗ t n ( X t n ) ≤ V t n ( X t n ) and therefore we conclude that V ∗ t n ( X t n ) = V t n ( X t n ).OR:Follows directly from [1, Theorem 1]. For t n ∈ T and α ∈ (0 , A asEE A ( t n ) = E A (cid:2) V t n ( X t n ) I { τ ( f ∗ ) >t n } | X t = x t (cid:3) , (9)PFE α A ( t n ) = inf (cid:8) a ∈ R | A (cid:0) V t n ( X t n ) I { τ ( f ∗ ) >t n } ≤ a | X t = x t (cid:1) ≥ α (cid:9) . (10)Note that the option value, given by (2), is a conditional expectation of future cashflows, which is by definitionmeasured with Q . The exposure profiles under the P − measure, on the other hand, are statistics of the optionvalue at some future time, t n , under the assumption that the conditional distribution X t m conditional on X t = x t is considered under the P − measure.In the theorem below, the expected exposures are reformulated in terms of discounted cashflows and thedecision functions introduced in Subsection 2.1. It will become clear in later sections that this is a more tractableform for the algorithms used in this paper. We recall that equalities and inequalities are in an A − almost sure sense. heorem 2.2. Let Q and P be probability measures on the measurable space (Ω , F ) and assume that the laws ofX under Q and P are absolutely continuous with respect to the Lebesgue measure. Then, the expected exposure, (9) under the Q − and P − measures, satisfies EE Q ( t n ) = E Q (cid:104) e − r ( τ n ( f ∗ n ) − t n ) g τ n ( f ∗ n ) ( X τ n ( f ∗ n ) ) I { τ ( f ∗ ) >t n } | X t = x t (cid:105) , (11)EE P ( t n ) = E Q (cid:104) e − r ( τ n ( f ∗ n ) − t n ) g τ n ( f ∗ n ) ( X τ n ( f ∗ n ) ) I { τ ( f ∗ ) >t n } l ( X t n , X t n − , . . . , X t ) | X t = x t (cid:105) , (12) where l ( y n , y n − , . . . , y ) = (cid:81) nk =1 p Xtk | Xtk − ( y k | y k − ) q Xtk | Xtk − ( y k | y k − ) , with q X tk | X tk − ( y k | y k − ) and p X tk | X tk − ( y k | y k − ) being transition densities for X under the measures Q and P , respectively. Note that l ( y n , y n − , . . . , y ) is the Radon–Nikodym derivative of P with respect to Q evaluated at ( y n , y n − , . . . , y ) .Proof. We begin by proving (11). By combining (5) and (9) and setting A = Q we obtainEE Q ( t n ) = E Q (cid:104) E n Q [e − r ( τ n ( f ∗ n ) − t n ) g τ n ( f ∗ n ) ( X τ n ( f ∗ n ) )] I { τ ( f ∗ ) >t n } | X t = x t (cid:105) = E Q (cid:104) e − r ( τ n ( f ∗ n ) − t n ) g τ n ( f ∗ n ) ( X τ n ( f ∗ n ) ) I { τ ( f ∗ ) >t n } | X t = x t (cid:105) , where the final step is justified by law of total expectation. The expected exposure under the P − measure canbe rewritten in the following wayEE P ( t n ) = E P (cid:104) E n Q [e − r ( τ n ( f ∗ n ) − t n ) g τ n ( f ∗ n ) ( X τ n ( f ∗ n ) )] I { τ ( f ∗ ) >t n } | X t = x t (cid:105) = (cid:90) ( y n ,...,y ) ∈ R d ×···× R d E Q (cid:104) e − r ( τ n ( f ∗ n ) − t n ) g τ n ( f ∗ n ) ( X τ n ( f ∗ n ) ) | X t n = y n (cid:105) × I { τ ( f ∗ ) >t n } p X t ,..., X tn ( y n , . . . , y )d y n · · · d y = (cid:90) ( y n ,...,y ) ∈ R d ×···× R d E Q (cid:104) e − r ( τ n ( f ∗ n ) − t n ) g τ n ( f ∗ n ) ( X τ n ( f ∗ n ) ) | X t n = y n (cid:105) × I { τ ( f ∗ ) >t n } p X tn ,..., X t ( y n , . . . , y ) q X tn ,..., X t ( y n , . . . , y ) q X tn ,..., X t ( y n , . . . , y )d y n · · · d y , (13)where q X tn ,..., X t ( y n , . . . , y ) and p X tn ,..., X t ( y n , . . . , y ) are joint densities of X t n , . . . , X t (conditioned on X t = x ) under the measures P and Q , respectively. The fact that P and Q areequivalent, guarantees that the quotient in (13) is well defined. Furthermore, due to the Markov property of X ,we have p X tn ,..., X t ( y n , . . . , y ) = p X tn | X tn − ( y n | y n − ) × · · · × p X t | X t ( y | y ) . (14)The same argumentation holds for q X tn ,..., X t ( y n , . . . , y ). The proof is finalized by inserting the product ofdensity functions (14) in (13) and writing the expression as a Q expectation, and again, use the law of totalexpectation. Is this proof too long? Is the statement obvious?Theorem 2.2 opens up for approximations of the expected exposure directly from the discounted cashflows.We now consider the special case, when X is described by a diffusion-type stochastic differential equation (SDE).Let µ Q : [ t , t N ] × R d → R d and σ : [ t , t N ] × R d → R d × R d be the drift and diffusion coefficients, respectively,and let ( W Q t ) t ∈ [ t ,t N ] be a d -dimensional standard Brownian motion under the measure Q . We assume that µ Q and σ satisfy the usual conditions (see e.g. , [32]) for existence of a unique, strong solution X tod X t = µ Q ( t, X t )d t + σ ( t, X t )d W Q t , t ∈ [ t , t N ]; X t = x t . Let µ P : [ t , t N ] × R d → R d and assume that σ ( t, X t ) is invertible and that for t ∈ [ t , t N ] (cid:107) σ − ( t, X t ) (cid:107) max is bounded almost surely. Then, ( W P t ) t ∈ [ t ,t N ] given byd W P t = − σ − ( t, X t ) (cid:0) µ P ( t, X t ) − µ Q ( t, X t ) (cid:1) d t + d W Q t is a Brownian motion under the measure P . Furthermore, under the measure P it holds almost surely that X is described by d X t = µ P ( t, X t )d t + σ ( t, X t )d W P t , t ∈ [ t , t N ]; X t = x t . (15)As a way to rewrite EE P ( t ) as a conditional expectation under the measure Q , we define a process U =( U t ) t ∈ [ t ,t N ] , which follows the SDEd U t = µ P ( t, U t )d t + σ ( t, U t )d W Q t , t ∈ [ t , t N ]; U t = x t . (16) For a matrix A ∈ R d × d , (cid:107) A (cid:107) max = max ij | A ij | . U has the same distribution under the measure Q as X hasunder the measure P . We can then express EE P ( t n ) with only Q − expectations in the following wayEE P ( t n ) = E P (cid:2) V t n ( X t n ) I { τ ( f ∗ ( X )) >t n } | X t = x t (cid:3) = E Q (cid:2) V t n ( U t n ) I { τ ( f ∗ ( U )) >t n } | U t = x t (cid:3) = E Q (cid:20) E Q (cid:104) e − r ( τ n ( f ∗ n ( X )) − t n ) g τ n ( f ∗ n ( X )) (cid:0) X τ n ( f ∗ n ( X )) (cid:1) | X t n = U t n (cid:105) × I { τ ( f ∗ ( U )) >t n } | U t = x t (cid:21) . (17) Remark 2.1.
Regarding the equality between the right hand side of the first line and the second line, we wantto emphasize that X under the measure P , and U under the measure Q do not represent the same stochasticprocess (as they would have done after a change of measure). If they were, then the conditional expectationwould change, and the equality would not hold. To enforce the equality to hold we could have corrected withthe Radon–Nikodym derivative, and obtained (12) . However, to find a way to write EE P ( t n ) without havingto include the Radon–Nikodym derivative, we introduce another process U, which is distributed in such a waythat the equality holds, i.e., the conditional expectation under P , when using X , should equal the conditionalexpectation under Q , when using U . For this to hold, it is sufficient that the distribution of X under the measure P equals the distribution of U under the measure Q , which is satisfied when X follows (15) and U follows (16) . Remark 2.2.
The final equality is obtained by the fact that V t n ( U t n ) is the option value at the (random) state ( t n , U t n ) , given by (5) . Before we get rid of the inner expectation in (17), we need to define a process following (16) on [ t , t n ], andthe dynamics of (15) on [ t n , t N ] with (stochastic) initial condition X t n = U t n . We denote such a process by˜ X t n = ( ˜ X t n t ) t ∈ [ t ,t N ] , and conclude that ˜ X t n should satisfy the following SDEd ˜ X t n t = µ P , Q ,t n ( t, ˜ X t n t )d t + σ ( t, ˜ X t n t )d W Q t , t ∈ [ t , t N ]; ˜ X t n t = x t , (18)where µ P , Q ,t n ( t, · ) = µ P ( t, · ) I { t ≤ t n } + µ Q ( t, · ) I { t>t n } . Note that we have implicitly assumed that also µ P satisfiesthe usual conditions for existence of a unique strong solution, U , to (16). As a consequence of this assumptionwe are also guaranteed that there exists a unique strong solution, ˜ X t n , to (18). We can then use the law oftotal expectation to obtainEE P ( t n ) = E Q (cid:104) V t n ( ˜ X t n t n ) I { τ ( f ∗ ) >t n } | ˜ X t n t = x t (cid:105) = E Q (cid:104) e − r ( τ n ( f ∗ n ) − t n ) g τ n ( f ∗ n ) (cid:16) ˜ X t n τ n ( f ∗ n ) (cid:17) I { τ ( f ∗ ) >t n } | ˜ X t n t = x t (cid:105) , (19)where we remind ourselves that τ ( f ∗ ) = τ ( f ∗ ( ˜ X t n )) and τ n ( f ∗ n ) = τ n ( f ∗ n ( ˜ X t n )).In the next sections we describe a method to approximate f ∗ ( · ). It is then straightforward to estimate (11),(12) and (19) by Monte-Carlo sampling. Furthermore, in Section 4, we introduce a method to approximate theprice function V t n ( · ), which makes it straightforward to also approximate the potential future exposure (10). In the first part of this section, we present the DOS algorithm, which was proposed in [1]. The idea is touse fully connected neural networks to approximate the decision functions introduced in the previous section.The neural networks are optimized backwards in time with the objective to maximize the expected discountedcashflow at each exercise date. In the second part of this section, we suggest some adjustments that can bedone in order to make the optimization more efficient.
As indicated above, the core of the algorithm is to approximate decision functions. To be more precise, for n ∈ { , , . . . , N } , the decision function f n is approximated by a fully connected neural network of the form f θ n n : R d → { , } , where θ n ∈ R q n is a vector containing all the q n ∈ N trainable parameters in network n . We assume that the initial state, x ∈ R d , is such that it is sub-optimal to exercise the option at t , andtherefore set θ such that f θ ( x ) = 0 (for a further discussion, see Remark 6 in [1]). Since binary decisionfunctions are discontinuous, and therefore unsuitable for gradient-type optimization algorithms, we use as anintermediate step, the neural network F θ n n : R d → (0 , Parameters that are subject to optimization. F θ n n can be viewed as the probability for exercise to be optimal. This output is then mapped to 1for values above (or equal to) 0.5, and to 0 otherwise, by defining f θ n n ( · ) = a ◦ F θ n n ( · ), where a ( x ) = I { x ≥ / } .Our objective is to find θ n such that E n Q (cid:104) f θ n n ( X t n ) g t n ( X t n ) + (1 − f θ n n ( X t n ))e − r ( τ n +1 ( f ∗ n +1 ) − t n ) g τ n +1 ( f ∗ n +1 ) (cid:16) X τ n +1 ( f ∗ n +1 ) (cid:17)(cid:105) , (20)is as close as possible to V t n ( X t n ) (in mean squared sense), where f ∗ n +1 is the vector of optimal decisionfunctions, defined in (4). Although (20) is an accurate representation of the optimization problem, it gives ussome practical problems. In general, we have no access to either f ∗ n +1 or the distribution of V t n ( X t n ) and inmost cases the expectation needs to be approximated. We however notice that at maturity, the option valueis equal to its intrinsic value, i.e. , V t N ( · ) ≡ g t N ( · ), which implies that f ∗ N ≡ τ N ( f ∗ N ) = t N . With thisinsight, we can write (20) with n = N − E N − Q (cid:104) f θ N − N − ( X t N − ) g t N − ( X t N − ) + (1 − f θ N − N − ( X t N − ))e − r ( t N − t N − ) g t N ( X t N ) (cid:105) , (21)which can be approximated by Monte-Carlo sampling. Given M ∈ N samples, distributed as X , which for m ∈ { , , . . . , M } is denoted by x = ( x t n ( m )) Nn =0 , we can approximate (21) by1 M M (cid:88) m =1 (cid:16) f θ N − N − ( x t N − ( m )) g t N − ( x t N − ( m )) + (1 − f θ N − N − ( x t N − ( m )))e − r ( t N − t N − ) g t N ( x t N ( m )) (cid:17) . (22)Note that the only unknown entity in (22) is the parameter θ N − in the decision function f θ N − N − . Furthermore,we note that we want to find θ N − such that (22) is maximized, since it represents the average cashflow in[ t N − , t N ]. Once θ N − is optimized, we use this parameter and find θ N − such that the average cashflow on[ t N − , t N ] is maximized.In the next section, we explain the parameters θ n and present the structure for the neural networks used inthis paper. For completeness, we introduce all the trainable parameters that are contained in each of the parameters θ , θ , . . . , θ N − , and present the structure of the networks. • We denote the dimension of the input layers by D input ∈ N , and we assume the same input dimension forall n ∈ { , , . . . , N − } networks. The input is assumed to be the market state x train t n ∈ R d , and hence D input = d . However, we can add additional information to the input that is mathematically redundant buthelps the training, e.g., the immediate pay-off, to obtain as input (cid:0) vec( x train t n ( m )) , g t n (cid:0) x train t n ( m ) (cid:1)(cid:1) ∈ R d +1 ,which would give D input = d + 1. In [1], the authors claim that by adding the immediate pay-off to theinput, the efficiency of the algorithm was improved; • For network n ∈ { , , . . . , N − } , we denote the number of layers by L n ∈ N , and for layer (cid:96) ∈{ , , . . . , L n } , the number of nodes by N n,(cid:96) ∈ N . Note that N n, = D input ; • For network n ∈ { , , . . . , N } , and layer (cid:96) ∈ { , , . . . , L n } we denote the weight matrix, acting betweenlayers (cid:96) − (cid:96) , by w n,(cid:96) ∈ R N n,(cid:96) − × N n,(cid:96) , and the bias vector by b n,(cid:96) ∈ R (cid:96) ; • For network n ∈ { , , . . . , N } , and layer (cid:96) ∈ { , , . . . , L n } we denote the (scalar) activation function by a n,(cid:96) : R → R and the vector activation function by a n,(cid:96) : R N n,(cid:96) → R N n,(cid:96) , which for x = ( x , x , . . . , x N n,(cid:96) )is defined by a n,(cid:96) ( x ) = a n,(cid:96) ( x )... a n,(cid:96) ( x N n,(cid:96) ) ; • The output of our network should belong to (0 , ⊂ R , meaning that the output dimension of our neuralnetwork, denoted by D output should equal 1. To enforce the output to only take on values in (0 , a n, L n : R → (0 , However the interpretation as a probability may be helpful, one should be careful since it is not a rigorous mathematicalstatement. It should be clear that there is nothing random about the stopping decisions, since the stopping time is F t − measurable.It can also be interpreted as a measure on how certain we can be that exercise is optimal. Input and output layers included. n ∈ { , , . . . N − } is then defined by F θ n n ( · ) = L n, L n ◦ L n, L n − ◦ · · · ◦ L n, ( · ) , (23)where for n ∈ { , , . . . , N − } and for x ∈ R L n,(cid:96) − , the layers are defined as L n,(cid:96) ( x ) = (cid:40) x, for (cid:96) = 1 , a n,(cid:96) ( w Tn,(cid:96) x + b n,(cid:96) ) , for (cid:96) ≥ , where w Tn,(cid:96) is the matrix transpose of w n,(cid:96) . The trainable parameters of network n ∈ { , , . . . , N − } are thengiven by the list θ n = { w n, , b n, , w n, , b n, , . . . , w n, L n , b n, L n } . Furthermore, since we have N − θ n = { θ n , θ n +1 , . . . , θ N − } the trainable parameters in the neural networks at exercise dates T n and by θ = θ . The main idea of the training and valuation procedure is to fit the parameters to some training data, and thenuse the fitted parameters to make informed decisions with respect to some unseen, so-called, valuation data.The training part of the algorithm is summarized below in pseudo code.
Training phase :Sample M train ∈ N independent realizations of X , which for m ∈ { , , . . . , M train } are denoted ( x train t n ( m )) Nn =0 .At maturity, define the cashflow as CF N ( m ) = g t N ( x train t N ( m )).For n = N − , N − , . . . ,
1, do the following:1. Find a ˆ θ n ∈ R q n which approximatesˆ θ ∗ n ∈ arg max θ ∈ R qn (cid:18) M train M train (cid:88) m =1 F θn (cid:0) x train t n ( m ) (cid:1) g t n (cid:0) x train t n ( m ) (cid:1) + (cid:0) − F θn (cid:0) x train t n ( m ) (cid:1)(cid:1) e − r ( t n +1 − t n ) CF n +1 ( m ) (cid:19) . In machine learning terminology, this would give an (empirical) loss-function of the form L ( θ ; x train ) = − M train M train (cid:88) m =1 F θn (cid:0) x train t n ( m ) (cid:1) g t n (cid:0) x train t n ( m ) (cid:1) + (cid:0) − F θn (cid:0) x train t n ( m ) (cid:1)(cid:1) e − r ( t n +1 − t n ) CF n +1 ( m ) . The minus sign in the loss-function transforms the problem from a maximization to minimization, whichis the standard formulation in the machine learning community. Note the straightforward relationshipbetween the loss function and the average cashflows in (22). In practice, the data is often divided intomini-batches, for which the loss-function is minimized consecutively.2. For m = 1 , , . . . , M train , update the discounted cashflow according to:CF n ( m )= f ˆ θ n n (cid:0) x train t n ( m ) (cid:1) g t n (cid:0) x train t n ( m ) (cid:1) + (cid:16) − f ˆ θ n n (cid:0) x train t n ( m ) (cid:1)(cid:17) e − r ( t n +1 − t n ) CF n +1 ( m ) . The performance of the algorithm is not particularly sensitive to the specific choice of the number of hiddenlayers, number of nodes, optimization algorithm, etc. Below is a list of the most relevant parameters/structuralchoices: • Initialization of the trainable parameters, where a typical procedure is to initialize the biases to 0, andsample the weights independently from a normal distribution; • The activation functions a (cid:96),n , which are used to add a non-linear structure to the neural networks. Inour case we have the strict requirement that the activation function of the output layer maps R to (0 , , e.g., [33].; 9 The batch size, B n ∈ { , , . . . , M train } , is the number of training samples used for each update of θ n , i.e., with B n = M train , the loss function is of the form defined in step 1 above. Note that if we want allbatches to be of equal size, we need to choose B n to be a multiplier of M train ; • For each update of θ n , we use an optimization algorithm, for which a common choice is the Adam optimizer,proposed in [34]. Depending on the choice of optimization algorithm, there are different parameters relatedto the specific algorithm to be chosen. One example is the so-called learning rate which decides how muchthe parameter, θ n is adjusted after each batch.Once the parameters, { θ , θ , . . . , θ N − } , have been optimized we can use the algorithm for valuation. Valuation phase :Sample M val ∈ N independent realizations of X , denoted (cid:0) x val t n ( m ) (cid:1) Nn =0 . We emphasize that the valuation datashould be independent from the training data. Denote the vector of decision functions by f ˆ θ n = (cid:16) f ˆ θ n n , f ˆ θ n +1 n +1 , . . . , f ˆ θ N − N − (cid:17) , and f ˆ θ = f ˆ θ . We then obtain for sample m , i.e., x val ( m ), the following stopping rule at time t n τ n (cid:16) f ˆ θ n (cid:0) x val ( m ) (cid:1)(cid:17) = N (cid:88) m = n t m f ˆ θ m m (cid:0) x val t m ( m ) (cid:1) m − (cid:89) j =0 (cid:16) − f ˆ θ j j (cid:16) x val t j ( m ) (cid:17)(cid:17) . The estimated option value at t is then given byˆ V t ( x ) = 1 M val M val (cid:88) m =1 e − r (cid:16) τ (cid:16) f ˆ θ (cid:17) − t (cid:17) g τ ( f ˆ θ ) (cid:16) x val τ ( f ˆ θ ) ( m ) (cid:17) , (24)where we recall that τ ( f ˆ θ ) = τ (cid:16) f ˆ θ (cid:0) x val ( m ) (cid:1)(cid:17) . Note that, by construction, any stopping strategy is sub-optimal, implying that the estimate (24) is biased low. It should be pointed out that it is possible to derivea biased high estimate of (1) from a dual formulation of the optimal stopping problem, which is described in[1]. In addition, numerical results in [1] show a tight interval for the biased low and biased high estimates for awide range of different problems. The presentation of the DOS algorithm in [1] is in a general form. In addition to the pricing of Bermudanoptions, the authors considered the non-Markovian problem to optimally stop a fractional Brownian motion(this is done by including also the historical states in the current state of the system). Since the aim of this paperis more specific (to approximate exposure profiles of Bermudan options), it is natural to use more of the knownunderlying structure of this specific problem. In this section we use some properties of the specific problems,and propose some adjustments to the DOS-algorithm, which make the training procedure more efficient.
The first proposed adjustment is to reuse parameters of neural networks that have already been optimized. Wenote that for a single Bermudan option (possibly with a high-dimensional underlying asset) the pay-off functionsare identical at all exercise dates, i.e., g t n = g t m for all t n , t m ∈ T . In this case the stopping rules at adjacentexercise dates are similar, especially when t n +1 − t n is small. We therefore use the stopping strategy at t n +1 as an ”initial guess” for the stopping strategy at t n . This is done by initializing the trainable parameters innetwork n by the already optimized parameters in network n + 1, i.e., at t n , initialize θ n by ˆ θ n +1 . This allowsus to use smaller learning rates leading to a more efficient algorithm. The term simple stopping decisions is loosely defined as stopping decisions that follow directly without anysophisticated optimization algorithm. The most obvious example is when the contract is out-of-the-money, inwhich case it is never optimal to exercise. For t n ∈ T , we define the set of in-the-money points and out-of-the-money points as ITM n = { x ∈ R d | g t n ( x ) > } , OTM n = { x ∈ R d | g t n ( x ) = 0 } , respectively. Another, less obvious insight is that, given a single Bermudan option with identical pay-off functionsat all exercise dates, if it is optimal to exercise at ( t n , x ), then it is also optimal to exercise at ( t n +1 , x ). Or inother words, the exercise region is non-decreasing with time. This statement is formulated as a theorem below.10 heorem 3.1. Define the set of exercise dates by { t , . . . , t n , t n +1 , . . . , t N , t N + ∆ } , and let ∆ = t n +1 − t n = t N +1 − t N ≥ . Note that an equidistant time grid is sufficient, but not necessary for the above to be satisfied.Moreover, assume that V t n ( · ; t N ) = V t n +1 ( · ; t N + ∆) , where t N and t N + ∆ indicate the maturity of otherwise identical contracts of the form (2) , with g = g t n for allexercise dates t n . Then, for any ¯ x ∈ E n , it holds that ¯ x ∈ E n +1 .Proof. Since ¯ x ∈ E n , V t n (¯ x ; t N ) = g (¯ x ) and V t n (¯ x ; t N ) = V t n +1 (¯ x ; t N + ∆) we have that V t n +1 (¯ x ; t N +∆) = g (¯ x ). From (2) we also see that V t n +1 (¯ x ; t N ) ≤ V t n +1 (¯ x ; t N + ∆) and V t n +1 (¯ x ; t N ) ≥ g (¯ x ). Therefore V t n +1 (¯ x ; t N ) = g (¯ x ) and ¯ x ∈ E n +1 .Theorem 3.1 shows that the exercise region is non-decreasing with time, but since the optimization of theneural network parameters is carried out backwards in time we instead use the fact that the continuation regionis non-increasing with time. In practice, this leads to the following three alternatives:A1. Use all training data in the optimization algorithm (as the algorithm is described in Subsection 3.1);A2. At t n ∈ T , use the subset of the training data satisfying x val t n ( m ) ∈ ITM n in the optimization algorithm.Define the decision functions as f ˆ θ n n ( · ) = I { g tn ( · ) > } ( a ◦ F ˆ θ n n ( · )) . To only use ”in the money paths” is also employed in the Least Squares Method (LSM), proposed in [22];A3. At t n use the subset of the training data x train t n ( m ) ∈ E n +1 in the optimization algorithm. Define thedecision functions as f ˆ θ n n ( · ) = f ˆ θ n +1 n +1 ( · )( a ◦ F ˆ θ n n ( · )) . In Figure 1 the three cases above are visualized for a two-dimensional max call option at one of the exercisedates. To the left we have the blue points belonging to E n +1 (used for optimization in A3), the blue and redpoints belong to ITM n (used for optimization in A2) and the blue, red and yellow points are all the availabledata (used for optimization in A1). To the right, we see the fraction of the total data used in each case at eachexercise date.Figure 1: Left:
Blue points in E n +1 (used for optimization in A3), blue and red points in ITM n (used foroptimization in A2) and the blue, red and yellow points are all the available data (used for optimization in A1). Right:
The fraction of the total data used in each case at each exercise date.11
Learning pathwise option values
In Section 3 an algorithm to learn stopping decisions was described and (24) gives an approximation of theoption value at time t , given some deterministic initial state X t = x t ∈ R d . As described in Subsection 2.2,to compute exposure profiles we sometimes need additional information about the future distribution of theoption values. In this section, we present two methods to approximate the pathwise option values at all exercisedates. The first method is the well-established Ordinary Least Squares (OLS) regression and the second methodis a neural network-based least squares regression. Both methods rely on projections of conditional expectationson a finite-dimensional function space. Central for the regression algorithms presented in this section is the cashflow process, Y = ( Y t n ) Nn =0 , definedas Y t n = e − r ( τ n ( f ∗ n ) − t n ) g τ n ( f ∗ n ) (cid:0) X τ n ( f ∗ n ) (cid:1) , (25)where τ n ( · ) and f ∗ are defined in (3) and (4), respectively. We assume that for t n ∈ T it holds that E Q [ g t n ( X t n ) ] < ∞ , which also implies that E Q [ Y t n ] < ∞ . The following theorem states that the option value, at some t n ∈ T , isequivalent (in L sense) to the so-called regression function. Furthermore, we see that the regression functioncan be obtained by solving a minimization problem. Theorem 4.1.
Let Y t n be as defined in (25) and for h n ∈ D ( R d ; R ) , define the so-called L -risk as E Q (cid:2) | h n ( X t n ) − Y t n | (cid:3) . It then holds that E Q (cid:2) | V t n ( X t n ) − Y t n | (cid:3) = min h n ∈D ( R d ; R ) E Q (cid:2) | h n ( X t n ) − Y t n | (cid:3) , or equivalently V t n ( · ) ∈ arg min h n ∈D ( R d ; R ) E Q (cid:2) | h n ( X t n ) − Y t n | (cid:3) . Proof.
Define m n ( x ) = E Q [ Y t n | X t n = x ]. For an arbitrary function, v : R d → R , it holds that E Q (cid:2) | v ( X t n ) − Y t n | (cid:3) = E Q (cid:2) | v ( X t n ) − m n ( X t n ) + m n ( X t n ) − Y t n | (cid:3) = E Q (cid:2) | v ( X t n ) − m n ( X t n ) | (cid:3) + E Q (cid:2) | m n ( X t n ) − Y t n | (cid:3) + 2 E Q [( v ( X t n ) − m n ( X t n ))( m n ( X t n ) − Y t n )] . By the law of total expectation, the last term satisfies E Q [( v ( X t n ) − m n ( X t n )) ( m n ( X t n ) − Y t n )]= E Q (cid:2) E n Q [( v ( X t n ) − m n ( X t n ))( m n ( X t n ) − Y t n )] (cid:3) = E Q (cid:2) ( v ( X t n ) − m n ( X t n ))( m n ( X t n ) − E n Q [ Y t n ]) (cid:3) = 0 , since E n Q [ Y t n ] = m n ( X t n ) by definition of Y t n . This means that E Q [ | v ( X t n ) − Y t n | ] = E Q (cid:2) | v ( X t n ) − m n ( X t n ) | (cid:3) + E Q (cid:2) | m n ( X t n ) − Y t n | (cid:3) , (26)which is clearly minimized when v ≡ m n . Also, notice that m n ( X t n ) = V ∗ t n ( X t n ), and by Theorem 2.1, V ∗ t n ( X t n ) = V t n ( X t n ) , which concludes the proof.OR:This is a straight forward consequence of the fact that the conditional expectation is the (least-squares) projec-tion onto the Markov states, h n ( X t n ). Note that Y is the discounted cashflow-process, which in the training phase was denoted by CF n . The reason for using Y t n instead in this section is that we want to emphasize that the pathwise valuation problem is, in fact, a standard regression problemin which X and Y usually are used to represent the observation vector, and the response variable, respectively.
12n practice, the distribution of (
X, Y ) is usually unknown. On the other hand, we are often able to generatesamples distributed as ( X, Y ). We consider some t n ∈ T , and generate M reg ∈ N independent realizationsof the regression pair ( X t n , Y t n ), which we denote by (cid:0) x reg t n ( m ) , y reg t n ( m ) (cid:1) M reg m =1 . Similarly, we define the empirical L -risk by 1 M reg M reg (cid:88) m =1 | h n ( x reg t n ( m )) − y reg t n ( m ) | . With a fixed sample of regression pairs it is possible to find a function h ∈ D ( R d ; R ) such that the empirical L -risk equals zero. However, such a function is not a consistent estimator in general. Therefore, we want touse a smaller class of more regular functions. When choosing the function class, which we denote by A M , weneed to keep two aspects in mind;P1. It should be ”rich enough” to be able to approximate V t n ( · ) sufficiently accurately,P2. It should not be ”too rich” since that may cause the empirical L -risk being an inaccurate approximationof the L -risk. Since this problem is more severe for smaller M reg , it is reasonable to have the sample sizein mind when choosing the function class, and hence the subscript M on A M , where ”reg” is dropped fornotational convenience. A too rich function class may lead to what is known as overfitting in the machinelearning community.Given a sample and a function class A M , we define the empirical regression function as m M ( · ) ∈ arg min h ∈A M M reg M reg (cid:88) m =1 (cid:12)(cid:12) h (cid:0) x reg t n ( m ) (cid:1) − y reg t n ( m ) (cid:12)(cid:12) . Since (26) holds for arbitrary v , we can write the L -risk of the empirical regression function and the optionvalue as E Q (cid:2) | m M ( X t n ) − V t n ( X t n ) | (cid:3) = E Q (cid:2) | m M ( X t n ) − Y t n | (cid:3) − E Q (cid:2) | V t n ( X t n ) − Y t n | (cid:3) . This in turn can be written in terms of the so-called estimation error (first term) and the approximation error(second term), i.e., E Q [ | m M ( X t n ) − V t n ( X t n ) | ]= (cid:18) E Q (cid:2) | m M ( X t n ) − Y t n | (cid:3) − min h ∈A M E Q [ | h n ( X t n ) − Y t n | ] (cid:19) + (cid:18) min h ∈A M E Q [ | h n ( X t n ) − Y t n | ] − E Q (cid:2) | V t n ( X t n ) − Y t n | (cid:3)(cid:19) (27)The approximation error measures how well the option value can be estimated by functions in A M , whichcorresponds to (P1) above. The estimation error is the difference between the L -risk of the estimator m M andthe optimal h in A M , which corresponds to (P2) above.There is however, one problem with the approximation error above; we have assumed that we can samplerealizations of ( X, Y ), while we in practice only are able to sample from ( X, ˆ Y ), with ˆ Y = ( ˆ Y t ) t ∈ [ t ,t N ] given byˆ Y t n = e − r (cid:16) τ n (cid:16) f ˆ θ n (cid:17) − t n (cid:17) g τ n ( f ˆ θ n ) (cid:16) X τ n ( f ˆ θ n ) (cid:17) . By also taking into account that the regression is carried out against an approximation of Y , (27) becomesinstead E Q [ | m M ( X t n ) − V t n ( X t n ) | ] ≤ (cid:18) E Q (cid:104) | m M ( X t n ) − ˆ Y t n | (cid:105) − min h ∈A M E Q [ | h n ( X t n ) − ˆ Y t n | ] (cid:19) + (cid:18) min h ∈A M E Q [ | h n ( X t n ) − Y t n | ] − E Q (cid:2) | V t n ( X t n ) − Y t n | (cid:3)(cid:19) + (cid:18) min h ∈A M E Q [ | h n ( X t n ) − ˆ Y t n | ] − min h ∈A M E Q [ | h n ( X t n ) − Y t n | ] (cid:19) + E Q [ | ˆ Y t n − Y t n | ] . (28) In fact, we can only generate samples distributed as ( X, ˆ Y ), where ˆ Y is the approximate discounted cashflow process obtainedby using the neural network-based decision functions instead of the optimal decision functions in (25). We give a short explanationof how this affects the regression in the end of this section. A M can approximate ˆ Y t n and Y t n and the final rowis the L -risk of our approximation of the discounted cashflow and the true discounted cashflow. Furthermore,note that the equality in (27) has changed to an inequality in (28).In the next section, we introduce the two different types of function classes that are used in this paper. At t n ∈ T , we assume that V t n ( X t n ) can be represented by a linear combination of a countable set of F n -measurable basis functions. We denote by { φ b } ∞ b =0 the basis functions and given optimal parameters α (1) t n , α (2) t n , . . . (in the sense that the L -risk against V t n ( X t n ) is minimized) and define v ( t n , X t n ) = ∞ (cid:88) b =1 α ( b ) t n φ b ( X t n ) . For practical purposes we use the first B ∈ N basis functions, so that v B ( t n , X t n ) = B (cid:88) b =1 α ( b ) t n φ b ( X t n ) . (29)We now want to estimate (29) by projecting a sample of realizations of ( X t n , Y t n ) onto the B first basisfunctions. This procedure is similar to LSM, [22]. In the LSM, only ITM samples are used, which is motivatedby the fact that it is never optimal to exercise an option that is OTM and the objective is to find the optimalexercise strategy. Furthermore, the authors claim that the number of basis functions needed to obtain anaccurate approximation is significantly reduced since the approximation region is reduced by only consideringITM paths. However, this is not possible in our case since we need to approximate the option everywhere . Onthe other hand, the exercise region, E n , has already been approximated (as described in Subsection 3.1) and theoption value in the exercise region is always known. This means that, similar to the LSM, the approximationregion can be reduced (in many cases significantly) by only considering samples belonging to the continuationregion, C n .Given a sample of regression pairs (cid:0) x reg t n ( m ) , y reg t n ( m ) (cid:1) M reg m =1 , we let M C n reg denote the number of samplesbelonging to C n and let (cid:0) x reg t n ( m ) , y reg t n ( m ) (cid:1) M C n reg m =1 be our new samples of regression pairs (where the indexa-tion has been appropriately changed). Assuming M C n reg ≥
1, we want to find a set of regression coefficients ˆ α t n = (ˆ α (1) t n , . . . , ˆ α ( B ) t n ) such that the following empirical L -risk is minimized1 M C n reg M C n reg (cid:88) m =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) B (cid:88) b =1 α ( b ) t n φ b (cid:0) x reg t n ( m ) (cid:1) − y reg t n ( m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (30)For notational convenience, we introduce the compact notation x t n = (cid:0) x reg t n (1) , . . . , x reg t n ( M C n reg ) (cid:1) , y t n = (cid:0) y reg t n (1) , . . . , y reg t n ( M C n reg ) (cid:1) and φ ( x t n ) = φ (cid:0) x reg t n (1) (cid:1) φ (cid:0) x reg t n (1) (cid:1) · · · φ B (cid:0) x reg t n (1) (cid:1) φ (cid:0) x reg t n (2) (cid:1) φ (cid:0) x reg t n (2) (cid:1) · · · φ B (cid:0) x reg t n (2) (cid:1) ... ... . . . ... φ (cid:0) x reg t n ( M C n reg ) (cid:1) φ (cid:0) x reg t n ( M C n reg ) (cid:1) · · · φ B (cid:0) x reg t n ( M C n reg ) (cid:1) . It is a well-known fact (see e.g. , [35]) that ˆ α t n is given by ˆ α t n = (cid:0) φ ( x t n ) T φ ( x t n ) (cid:1) − φ ( x t n ) T y t n , (31)where we note that, if we choose linearly independent basis functions, matrix inversion in (31) exists almostsurely since X t n has a density . We define the estimatorˆ v B,K ( t n , · ) = B (cid:88) b =0 ˆ α ( b ) t n φ b ( · ) . (32) By ”everywhere” we mean the region in which the distribution of X t n has positive density. Of course, this is in many cases R d , so in practice by ”everywhere” we mean the region in which the density is significantly positive. In practice we run into troubles if we choose B too high since the approximation of the matrix inverse may be unstable. M C n reg = 0 we know that all samples are in the exercise region and we simply set ˆ v B,K ( · ) ≡ g t n ( · ). Since theLSM is one of the most established algorithms for valuation of Bermudan options, the theoretical properties areextensively studied and many of the results can also be applied to the algorithm above. However, we first needto make an assumption regarding M C n reg . Assume that there exists c > Q { X t n ∈ C n } ≥ c . It thenholds for any C ∈ R that Q lim M reg →∞ M reg (cid:88) m =1 I { X reg tn ( m ) ∈C n } ≥ C = 1 , which implies that M C n reg approaches infinity when M reg approaches infinity almost surely. Since the regressionpairs are independently and identically distributed, it holds that ˆ v B,K ( t n , X t n ) converges both in mean squareand in probability to v B ( t n , X t n ) as M reg approaches infinity (see e.g. , [36]). To make it more clear whencomparing the OLS-approximator of the option value to the neural network approximator (to be defined in thenext section), we use the following notation ˆ v OLS t n ( · ) = ˆ v B,M ( t n , · ) , (33)where we assume that B and M are chosen such that both accuracy and time complexity are taken into account.A nice property of OLS regression is that, given B and a sample of regression pairs, we have a closed-formexpression for the optimal parameters and thus also the regression function (32). On the other hand, we mayface memory or runtime issues for large B and M C n reg due to (31). This is a problem, especially when we wantto approximate a complicated function surface over a large approximation region. For example, consider anoption based on 50 correlated assets. If we want to use the first and second-order polynomials as basis functions(including cross-terms) we have B = = 1325, which is often too large for practical purposes. We shouldalso have in mind that this corresponds to an approximation with polynomials of degree 2, which is usuallynot sufficient for complicated problems. There are however methods to get around this problem, see e.g. , [25]in which the state space is divided into several bundles and regression is carried out locally at each bundle.Another suitable method to overcome these difficulties is neural network regression, which is presented in thenext section. In this section we present, a simple neural network approximation of V t n ( · ). The neural network is a mapping, v ϕ n : R d → R , parametrized by the p t n ∈ N trainable parameters ϕ n ∈ R p tn . The objective is to find ϕ n suchthat the empirical L -risk 1 M C n reg M C n reg (cid:88) m =1 (cid:12)(cid:12) v ϕ n ( x reg t n ( m )) − y reg t n ( m ) (cid:12)(cid:12) (34)is minimized. We denote by ˆ ϕ n an optimized version of ϕ n and define our neural network approximator of theoption price at t n by ˆ v NN t n ( · ) = v ˆ ϕ n ( · ) . To avoid repetition, the description of the neural networks in Subsection 3.1.1 is also valid for the neural networkused here. However, one important difference is the output, which in this section is an approximation of theoption value, and should hence take on values in (0 , ∞ ). A natural choice as activation function in the outputlayer is therefore ReLU( · ) = max { , ·} . Furthermore, by shifting the output with − g t n ( · ), i.e., designing theneural network to output v ˆ ϕ n ( · ) − g t n ( · ) and defining ˆ v NN t n ( · ) = v ˆ ϕ n ( · ) + g t n ( · ) we can, for all x ∈ R d , enforceˆ v NN ( x ) ≥ g t n ( x ) by using ReLU as activation function in the output layer. In many cases it seems to bebeneficial to use the identity as the activation function in the output layer. This could possibly be explained bythe fact that when using the ReLU as activation function, the gradient of the loss function, (34) (with respectto the input) may vanish during training. This, in turn leads to an inefficient use of a gradient descent typealgorithm in the optimization problem.Another difference, which has to do with the training phase, is that the optimization of the parameters ϕ n does not have to be carried out recursively. This opens up the possibility for parallelization of the code.By comparing (34) to (30) we see that the optimization problems are similar. There are, however, somemajor differences. In Subsection 4.2, we have a closed-form expression for the optimal parameters resulting inthe final regression function (32). This is not the case for the neural network regression and we therefore needto use an optimization algorithm to approximate the optimal parameters. On the other hand, as mentionedin Subsection 4.2 it is sometimes hard to find basis functions that are flexible enough. This problem can beovercome with neural networks, which are known to be good global approximators.15 Approximation algorithms for exposure profiles
In this section, we introduce different ways to estimate (9) and (10) relying on Monte-Carlo sampling andthe approximation algorithms described in Sections 3 and 4. Furthermore, a simple example is presented andvisualized, which aims to provide an intuitive understanding of the different methods. Finally, the advantagesand disadvantages of each method are presented in a table.In this section, the neural network-based approximation of the value function of the option, introducedin Subsection 4.3 is used. However, it would have been possible to use the OLS-based approximation fromSubsection 4.2, instead.We use M ∈ N independent realizations of X , which for m ∈ { , , . . . , M } are denoted by x ( m ) =( x t ( m )) Nn =0 for t ∈ T . When X is given by (18), realization m is denoted by ˜ x t n ( m ) = (cid:0) ˜ x t n t ( m ) (cid:1) t ∈ [ t ,t N ] ,where we recall that superscript t n refers to the point in time where the discontinuity of the drift coefficientis located. We introduce the following two approximations of the expected exposure under the risk neutralmeasure ˆEE Q ( t n ) = 1 M M (cid:88) m =1 ˆ v NN t n ( x t n ( m )) I { τ ( f ˆ θ ) >t n } , (35)ˆEE Q ( t n ) = 1 M M (cid:88) m =1 e − r (cid:16) τ n (cid:16) f ˆ θ n (cid:17) − t n (cid:17) g τ n ( f ˆ θ n ) (cid:16) x τ n ( f ˆ θ n )( m ) (cid:17) I { τ ( f ˆ θ ) >t n } . (36)For the expected exposure under the real world measure, we have the following three approximationsˆEE P ( t n ) = 1 M M (cid:88) m =1 ˆ v NN t n (cid:0) ˜ x t n t n ( m ) (cid:1) I { τ ( f ˆ θ ) >t n } , (37)ˆEE P ( t n ) = 1 M M (cid:88) m =1 e − r (cid:16) τ n (cid:16) f ˆ θ n (cid:17) − t n (cid:17) g τ n ( f ˆ θ n ) (cid:18) ˜ x t n τ n ( f ˆ θ n )( m ) (cid:19) I { τ ( f ˆ θ ) >t n } , (38)ˆEE P ( t n ) = 1 M M (cid:88) m =1 e − r (cid:16) τ n (cid:16) f ˆ θ n (cid:17) − t n (cid:17) g τ n ( f ˆ θ n ) (cid:16) x τ n ( f ˆ θ n )( m ) (cid:17) I { τ ( f ˆ θ ) >t n } l ( x t n ( m ) , . . . , x t ( m )) , (39)where l is the likelihood ratio function defined in Theorem 2.2. We note that (35) and (37) are the only ap-proximations that require a calculation of the option value. On the other hand, we need X to be described by adiffusion-type SDE in (38) and we need to know the density functions to calculate (39). To define the approxima-tions of the potential future exposure, we start by defining the order statistic of (cid:16) ˆ v NN t n ( x t n ( m )) I { τ ( f ˆ θ ) >t n } (cid:17) Mm =1 , i.e., the vector given by (cid:16) ˆ v NN t n ( x t n ( ˜ m )) I { τ ( f ˆ θ ) >t n } , . . . , ˆ v NN t n ( x t n ( ˜ m M )) I { τ ( f ˆ θ ) >t n } (cid:17) satisfying ˆ v NN t n ( x t n ( ˜ m i )) ≤ ˆ v NN t n ( x t n ( ˜ m j )) whenever i ≤ j . Furthermore, we define i α = (cid:40) (cid:100) αM (cid:101) , for α ≥ . , (cid:98) αM (cid:99) , for α < . . The approximations of the potential future exposure are then defined asˆPFE α Q ( t n ) = ˆ v NN t n ( x t n ( ˜ m i α )) I { τ ( f ˆ θ ) >t n } , ˆPFE α P ( t n ) = ˆ v NN t n (˜ x t n t n ( ˜ m i α )) I { τ ( f ˆ θ ) >t n } . In Table 1, some characteristics of the calculations needed for each approximation are given.To explain the different approximations in a concrete setting we turn to a simple example.
Example 5.1.
Consider a one-dimensional American put option, where we for simplicity assume that r = 0 .We are interested in the expected exposure and the potential future exposure at time t n ∈ ( t , t N ) given that wehave full knowledge of the market at time t . We give a short explanation of the intuition behind the differentmethods by referring to Figure 2, where the problem is visualized.We start by ˆEE Q ( t n ) and ˆEE P ( t n ) for which we only use the figure to the left. For ˆEE Q ( t n ) we follow theblue samples and note that samples 2 and 3 are not exercised prior to, or at t n , which means that the indicatorfunction in (35) equals 1. Sample 1, on the other hand, touches the exercise region prior to t n and has thereforealready been exercised, which means that ˆEE Q ( t n ) = 13 (cid:16) ˆ v NN ( x t n (1)) I { τ ( f ˆ θ ) >t n } + ˆ v NN ( x t n (2)) I { τ ( f ˆ θ ) >t n } + ˆ v NN ( x t n (3)) I { τ ( f ˆ θ ) >t n } (cid:17) = 13 (cid:0) ˆ v NN ( x t n (2)) + ˆ v NN ( x t n (3)) (cid:1) . pecification of requirements for each approximation Approx-imation ofthe func-tional form V t n ( · ) Samplingfrom ˜ X t n Knowndensityfunctions X givenmust be bydiffusion-type SDEˆEE Q ¸ Ø Ø Ø ˆEE Q Ø Ø Ø Ø ˆEE P ¸ ¸ Ø Ø ˆEE P Ø ¸ Ø ¸ ˆEE P Ø Ø ¸ Ø ˆPFE Q ¸ Ø Ø Ø ˆPFE P ¸ ¸ Ø Ø Table 1: A specification of which approximations/entities that are required in order to carry out the differentcalculations. If required ¸ , otherwise Ø . When focusing on the red samples instead, we see that no sample touches the exercise region prior to t n and weobtain ˆEE P ( t n ) = 13 (cid:0) ˆ v NN (˜ x t n t n (1)) + ˆ v NN (˜ x t n t n (2)) + ˆ v NN (˜ x t n t n (3)) (cid:1) . Similarly, we can e.g., state that ˆPFE . Q = 0 and ˆPFE . P = ˆ v NN (˜ x t n t n (2)) .Moving on to ˆEE P ( t n ) and ˆEE P ( t n ) , we shift focus to the figure to the right. For ˆEE P ( t n ) we want to usethe red samples and notice that samples 2 and 3 end up out of the money. We therefore obtain ˆEE P ( t n ) = 13 g τ n ( f θ n ) (cid:18) ˜ x t n τ n ( f θ n )(1) (cid:19) . For ˆEE P ( t n ) , we instead consider the blue samples and see that sample 1 is exercised prior to t n and samples 2and 3 end up in the money. However, we also need to adjust the estimate for using the wrong state process .This is done by multiplying each term with the likelihood ratios l ( x (2)) and l ( x (3)) to finally obtain ˆEE P ( t n ) = 13 (cid:16) g τ n ( f θ n ) (cid:16) x τ n ( f θ n )(2) (cid:17) l ( x (2)) + g τ n ( f θ n ) (cid:16) x τ n ( f θ n )(3) (cid:17) l ( x (3)) (cid:17) . The last estimate, ˆEE Q ( t n ) , is obtained by removing the likelihood ratios from the estimate for ˆEE P ( t n ) . Figure 2: Blue trajectories are distributed as X and red trajectories are distributed as ˜ X t n . The boundaryfor immediate exercise is pointed out and should be interpreted as, as soon a trajectory touches the boundary,the option is exercised and the holder receives the immediate pay-off. Recall that the exercise boundaries arecalculated in order to be optimal under the Q -measure. The samples are generated from the state process under the Q − measure between t and t n which, if not corrected, would bein conflict with the definition of EE P ( t n ).
17o conclude, we note that Figure 2 displays, to the left, the cases where functional form approximations ofthe option values are used and to the right the cases where cashflow-paths are used (this can be read in Table1). Furthermore, blue and red trajectories are distributed according to X and ˜ X t n , respectively (this can alsobe read in Table 1). This section is divided into two parts: in the first part we use the Black–Scholes dynamics for the underly-ing assets. The proposed algorithm is compared with two different Monte-Carlo-based algorithms for a two-dimensional case. We focus on both the accuracy of the computed exercise boundaries and the exposure profiles.Furthermore, the exercise boundary is compared with the exercise boundary for the corresponding Americanoption, computed by a finite element method, from the PDE-formulation of the problem. Exposure profiles arethen computed both under the risk neutral and the real world measures for problems in 2 and 30 dimensions.Finally, we compare the OLS-regression with the NN-regression.In the second part we consider stochastic volatility and compute exposure profiles under the Heston model.
In the Black–Scholes setting the only risk factors are the d ∈ N , underlying assets, denoted by S . We assumea constant risk-free interest rate r ∈ R , and for each asset i ∈ { , , . . . , d } , volatility σ i ∈ (0 , ∞ ), and constantdividend rate q i ∈ (0 , ∞ ). The state process X is then given by the asset process S = ( S t ) t ∈ [ t ,t N ] , i.e., X = S ,following the SDE d( S t ) i ( S t ) i = ( A i − q i )d t + σ i d( W A t ) i , ( S t ) i = ( s t ) i ; t ∈ [ t , t N ] , (40)with initial state ( s t ) i ∈ (0 , ∞ ), and where A is either the real world measure P or the risk neutral measure Q .In the real world setting A i = µ i ∈ R and in the risk neutral setting A i = r . The process W A = ( W A t ) t ∈ [ t ,t n ] is a d -dimensional Brownian motion under the measure A . Furthermore, W A is correlated with correlationparameters ρ ij ∈ [ − , i.e., for i, j ∈ { , , . . . , d } , we have that E A [d( W A t ) i d( W A t ) j ] = ρ ij d t , with ρ ii = 1.Moreover, for t ≤ u ≤ t ≤ t N a closed-form solution to (40) is given by( S t ) i = ( S u ) i exp (cid:18) ( A i − q i − σ i )( t − u ) + σ i (cid:0) ( W A t ) i − ( W A u ) i (cid:1)(cid:19) . (41)We note that log S has Gaussian increments under both the P − and the Q − measures which implies that wehave access to a closed-form expressions for the transition densities of S , and in turn also to the likelihood ratioin (39). At exercise, a Bermudan max-call option pays the positive part of the maximum of the underlying assets aftersubtraction of a fixed amount K ∈ R . This implies an identical pay-off function at all exercise dates, given by g ( s ) = (max { s , s . . . , s d } − K ) + , where s = ( s , s , . . . , s d ) ∈ (0 , ∞ ) d and for c ∈ R , ( c ) + = max { c, } .We choose to focus on Bermudan max-call options for two reasons: first, there exist plenty of comparableresults in the literature (see e.g., [1], [24], [25]); second, and more importantly, the exercise region is nontrivialwith several interesting features. For example, max { s , s , . . . , s d } is not a sufficient statistic for making optimalexercise decisions, meaning that we cannot easily reduce the dimensionality (all dimensions are needed in order toprice the option). This is not the case with, e.g., the geometric-average call option with pay-off function g ( s ) = (cid:16) d (cid:81) di =1 s i − K (cid:17) + , since d (cid:81) di =1 s i is a sufficient statistic for optimal exercise decisions (only the geometricaverage is needed to price the option, meaning that the problem can be reduced to 1 dimension, see e.g., [37]).Another example is the arithmetic-average call option with pay-off function g ( s ) = (cid:16) d (cid:80) di =1 s i − K (cid:17) + . Similarto the max-call option, d (cid:80) di =1 s i is not a sufficient statistic for optimal exercise decisions, but on the otherhand the exercise region is convex , (see [38, Proposition 6.1]). Convexity of the exercise region does nothold for the max-call option, making it hard to capture the exercise region when global polynomials are used asbasis functions in e.g., the LSM. Methods which instead rely on local regression can, to some extent, overcome In this section we have assumed that results for exercise regions for American options also hold for their Bermudan counterparts. Convex in the underlying assets for fixed t . r = 0 .
05 and for i, j ∈ { , , . . . , d } , q i = 0 . σ i = 0 .
2, for i (cid:54) = j , ρ ij = 0, N = 9, t = 0, t N = 3, ( s t ) i = 100 and K = 100. We want to emphasize thatthe choice of having no correlation between assets is due to the fact that this case has been studied thoroughlyin the literature (see e.g., [1], [24], [25], [38]). To verify that the algorithm is also able to tackle problems withcorrelated assets, we replicated an experiment in [29], of pricing a put option on the arithmetic average of 5correlated assets and obtained the same price . The performance of the DOS algorithm is thoroughly explored for a wide range of different examples in [1].However, the convergence with respect to the amount of training data used, M train was not given. We thereforepresent, in Figure 3, an example of how the option price at t is converging to a reference value in terms of theamount of training data. In the considered example, we use the parameter values given at the end of Subsection6.1.1 with d = 2, i.e., a two-dimensional Bermudan max-call option. The reference value (13.902) for V t ( x t ) iscomputed by a binomial lattice model in [21]. The DOS approximation, denoted by ˆ V t ( x t ), is computed accord-ing to (24). To be more specific, one neural network is trained for each M train ∈ { , , , , , , } ,as described in Subsection 3.1.2. For each M train , the option value, V t ( x t ) is computed 20 times (with 20independent realizations of X ) with M val = 2 and the average of these 20 values is showed in Figure 3.Furthermore, the figure to the right displays empirical 95%-confidence intervals of the sample mean, which iscomputed by adding and subtracting . √ times the sample standard deviation.Figure 3: Convergence of approximate option values in the amount of training data. Reference value 13.902,computed by a binomial lattice model in [21]. Left:
With different levels of deviation from the reference value.
Right:
With empirical 95%-confidence intervals of sample mean.
We start with a short introduction of the two Monte-Carlo-based algorithms with which we compare results forthe two-dimensional max-call option.The least squares method (LSM) proposed in [22], is one of the most used methods for pricing Ameri-can/Bermudan options. The method approximates exercise boundaries and computes the option value fromdiscounted cashflows. The regression is carried out globally, i.e., with the same regression function over theentire state space. However, if one is only interested in the option value at t , it is beneficial to only considerITM-samples in the regression state. This is done when the exercise boundary, and PFE . are approxi-mated. For approximating EE and PFE . we need the entire distribution of future option values forcing usto also include OTM-samples in the regression. We use as basis functions the first 6 Laguerre polynomials L n ( x ) = e − x/ x n ! d n d x n ( x n e − x ) of ( S t n ) and ( S t n ) and the first 4 powers of max { log(( S t n ) ) , log(( S t n ) ) } for all t n ∈ T . Note that since we have no correlation between ( S ) and ( S ) , we do not include cross-terms in thebasis.The second algorithm is the stochastic grid bundling method (SGBM) proposed in [25], in which regressionis carried out locally in so-called bundles. In the SGBM the target function in the regression phase is the optionvalue which makes it suitable for approximations of exposure profiles. There are several papers on computations We obtained the price 0.1804, which is the same price up to the given accuracy of 4 digits, presented in [29, Parameters as inSet II, results in Table 3 on p. 22].
19f exposure profiles with the SGBM, see e.g., [4]. At each exercise date t n ∈ T , we use as basis functions:a constant, the first 4 powers of ( S t n ) and ( S t n ) , and the first 2 powers of max { log(( S t n ) ) , log(( S t n ) ) } .Furthermore, the state space is divided into 32 equally-sized bundles based on max { ( S t n ) , ( S t n ) } .Before presenting any results, we recall from equations (9) and (10) that the expected exposure and thepotential future exposure are two statistics of V t n ( S t n ) I { τ>t n } , (42)where τ is an S − stopping time. This means that approximations of the exposure profiles are sensitive to, notonly the option value itself, but also the exercise strategy. The SGBM and LSM only compute the exercisestrategy implicitly, i.e., by comparing the approximate option value and the immediate pay-off. Therefore smallerrors of the approximate option value close to the exercise boundary can lead to significant errors in the exercisestrategy . We therefore start by presenting a comparison of the exercise boundaries. As a reference, we usethe exercise boundary for the corresponding American option, which is computed from the PDE-formulationwith the finite element method used in [39]. We note that since the PDE formulation refers to the Americanoption, the exercise boundary differs slightly from the exercise boundary of the Bermudan counterpart, whichwe are interested in.In Figure 4, a comparison of the exercise boundaries at t ≈ .
67 for the different algorithms is presented.As we can see, the DOS algorithm captures the shape of the exercise regions while both the SGBM and the LSMseem to struggle, especially with the part of the continuation region along (and around) the line ( S t n ) = ( S t n ) ,in particular for high values of ( S t n ) and ( S t n ) . The irregular shaped features in the continuation region forthe SGBM are a consequence of the local regression. In particular, the triangular shapes come from the specificchoice of bundling rule (bundling based on max { ( S t n ) , ( S t n ) } ).Moving on to the exposure profiles, we see in Figure 5, that even though the DOS algorithm and the SGBMseem to agree on the exposure profile in general, we notice a difference in the PFE . . This is a consequenceof the slight bias towards classifying samples as belonging to the continuation region, which is shown in Figure5, to the right.The LSM is performing worse, both in terms of accuracy of exposure profiles and bias towards miss-classification. This is however not a surprise, since the LSM is tailored to calculate the option value at t .Finally, it should be pointed out that for both the SGBM and LSM, it could very well be the case thatanother set of basis functions would better capture the shape of the exercise boundaries. In this two-dimensionalexample one could probably use geometric intuition to come up with a better set of basis functions, but in higherdimensions, and for more complicated pay-off functions, this becomes difficult. In this section we compare exposure profiles under different measures for the max-call option, in 2 and 30dimensions. In
Case I we set d = 2 and P and P such that for i ∈ { , } , we have drifts ( µ ) i = 15% and( µ ) i = − Case II we set d = 30 and P and P such that such that for i ∈ { , . . . , } , we have drifts( µ ) i = 7 .
5% and ( µ ) i = 2 . Finally, we compare the performance of the OLS-regression, with the NN-regression, introduced in Subsections4.2 and 4.3. We emphasise that both OLS-regression and NN-regression are regressing the current state ondiscounted cashflow paths approximated by the DOS algorithm (described in Section 3). They can therefore beseen as phase 2 of an algorithm producing pathwise option values.After conducting numerical experiments on a variety of different examples we conclude that the expectedexposures are very similar for the two regression methods. However, the potential future exposure is not alwayscaptured by the OLS-regression. The difficulty lies in finding a set of computationally feasible basis functions,flexible enough to accurately capture a complicated function surface on a large domain (similar problem as forthe LSM in Subsection 6.1.3). To overcome this problem, we also implement a slightly different version of thealgorithm, where we instead carry out local regression in the continuation regions. To be able to differentiatebetween the local and global OLS-regressions, we denote the regression functions by v OLS loc and v OLS glob ,respectively. The localization procedure is done similarly as in the SGBM, i.e., at each t ∈ T , the state space is In our experiments the errors of the approximate option values seem be of the same sign locally i.e., the polynomial basisfunction underestimates the option value in some regions and overestimates the option value in other regions. In fact, the continuation region for an American option is a subset of the continuation region for the Bermudan counterpart. t ≈ . From top leftto bottom right:
FEM (American option), DOS, SGBM and LSM respectively.Figure 5: Comparison of DOS, SGBM and LSM for a two-dimensional max-call option.
Left:
Expectedexposure and potential future exposures at 97.5%- and 2.5%-levels.
Right:
Proportion of options exercised atdifferent exercise dates. 21igure 6: Approximate exposures, at the exercise dates over the life time of the contract.
Top:
Case I.
Bottom:
Case II.
Left:
For i = 1 and i = 2, ˆPFE . P i , ˆPFE . Q , ˆEE P i , ˆEE Q ˆPFE . P i and ˆPFE . Q . Right:
For i = 1 and i = 2, ˆEE Q , ˆEE Q , ˆEE P i , ˆEE P i and ˆEE P i .Figure 7: Proportion of samples exercised at different exercise dates under measures Q , P and P for Case I.22ivided into bundles of equal size (in terms of the number of samples in each bundle), based on max { ( S t ) , ( S t ) } .With local OLS-regression we obtain almost identical exposure profiles as with NN-regression. In Figure 8, ontop to the left, the exposure profiles computed with the three different algorithms are displayed. Furthermore,from top right to bottom right, we compare the approximate risk premia for holding instead of immediatelyexercising the option at t ≈ .
667 and some x ∈ R , i.e., v Zt ( x ) − g t ( x ), with Z representing, in order NN,OLS glob and OLS loc . We know that for all x ∈ R , it holds that V Zt ( x ) − g t ( x ) ≥
0. We see in Figure 8, topright, that this is captured by the NN-regression, since the values range from 0 to just above 12. When wecarefully evaluate the values of the risk premia computed with local and global OLS-regression (Figure 8 bottomleft and bottom right) we see that negative values exist in both plots. We see similar phenomena for high values, i.e., the range is stretched upwards in comparison to the values obtained with NN-regression. The reason for thenegative values is that v OLS loc and v OLS glob underestimate the option values close to the boundary (which in thiscase coincides with the exercise boundary since the regression is carried out only in the continuation region).To compensate for this, we see a tendency of higher values in the center (not close to the exercise boundaries)of the continuation regions. As a final remark, we note that this behaviour is reduced by using local regressioninstead of global regression (the range of values in Figure 8 is tighter for local than global regression). On theother hand, we note discontinuities in Figure 8 bottom, left, stemming from the localized regression, of the riskpremium computed with the local OLS-regression. The discontinuity in Figure 8, bottom right, is a boundaryissue of the OLS-regression (regression is only carried out in the continuation region since we set the value equalto the immediate pay-off in the exercise region).Even though, the local OLS-regression is less accurate when it comes to computations of exposure profilesthan the other two algorithms in this section, it is clear by comparing Figures 5 and 8, that it outperforms theLSM. This is not a surprise since:1. The accumulated error in the LSM (because of recursive dependency of the regression functions) is signif-icantly reduced since the regression functions are not sequentially dependent. The discounted cashflowsare projected onto the risk factors directly. This implies that the only error accumulation (over time),in the OLS-regression originates from the DOS algorithm, which computes the exercise boundaries withhigh accuracy;2. By recalling equation (42), we note that a less accurate stopping strategy may cause a less accurateexposure.
In this section we assume a one-dimensional underlying asset following the Heston stochastic volatility model[40], which is considered only under the risk neutral measure. We therefore omit the explicit notation of theprobability measure used in this section. In this setting, the market is described by, not only the underlyingasset price process S = ( S t ) t ∈ [ t ,t N ] itself , but also by the instantaneous variance process ν = ( ν t ) t ∈ [ t ,t N ] . Thestate process is then the two-dimensional process X = ( ν, S ) which satisfies the system of SDEsd S t = ( r − q ) S t d t + √ ν t S t d W St , S t = s t ; t ∈ [ t , t N ] , (43)d ν t = κ ( θ − ν t )d t + ξ √ ν t d W νt , ν t = ν ; t ∈ [ t , t N ] , (44)with risk-free interest rate r ∈ R , dividend rate q ∈ (0 , ∞ ), initial conditions s t , ν ∈ (0 , ∞ ), speed of meanreversion κ ∈ (0 , ∞ ), long term mean of the variance process θ ∈ (0 , ∞ ), and volatility coefficient of the varianceprocess ξ ∈ (0 , ∞ ). Furthermore, ( W St ) t ∈ [ t ,t N ] and ( W νt ) t ∈ [ t ,t N ] are two one-dimensional, standard Brownianmotions satisfying E [d W St d W νt ] = ρ ν,S d t for some correlation parameter ρ ν,S ∈ ( − , κθ ≥ ξ , is satisfied, it holds that 0 is an unattaiable boundary for ν . If the Feller condition is not satisfied, then 0 is anattainable, but strongly reflective boundary, see e.g., [43]. This leads to an accumulation of probability massaround zero, which makes it more challenging to approximate ν accurately. Unfortunately, the Feller conditionis rarely satisfied for parameters calibrated to the market.In this paper, we use the QE-scheme, proposed in [44] is used to approximate ( ν, S ). If necessary, wechoose a finer time grid for the approximation of ( S, ν ) than the exercise dates, T . Strongly reflective in the sense that the time spent at 0 is of Lebesgue measure zero, see e.g., [41]. For notational convenience, the state process is denoted by X , which falsely indicates that we have an exact form for X . Thisis because our focus is on approximating option values, not the underlying state process. It should however be mentioned that X needs to be approximated in this section. Top left:
Exposure profiles.
From top right to bottom left:
Approximate risk premium forholding option instead of immediate exercise at t ≈ . i.e., v Zt ( · ) − g t ( · ), with Z representing NN, OLS loc and OLS glob , respectively. 24e consider a standard Bermudan put option, i.e., identical pay-off functions at all exercise dates, onlydepending on the underlying asset. Furthermore, the pay-off function for the Bermudan put option is given by g ( s ) = ( K − s ) + , for s ∈ (0 , ∞ ), and strike K ∈ R . In this section, we again compare the DOS algorithm with the two Monte-Carlo-based algorithms, SGBM andLSM. We use the following set of model parameters: r = 0 . q = 0, s t = 100, κ = 1 . θ = 0 . ξ = 0 . ν = 0 . ρ ν,S = − .
64, and the contract parameters: T = 0 . N = 10, K = 100. The parameterscoincide with Set B in [45], in which the valuation is carried out with the so-called 2D-COS method. The2D-COS method is a Fourier-based method and is assumed to yield highly accurate valuation of the option.For the LSM, we use as basis functions, for t n ∈ T , Laguerre polynomials of degree 3 of S t n , Laguerrepolynomials of degree 3 of ν t n and ν t n S t n (only on constant basis function is used). For the SGBM, we use 32equally-sized bundles based on S t n and for t n ∈ T , we use as basis functions a constant, S t n , S t n ν t n and thefirst 3 powers of ν t n . These parameters are chosen such that the approximate option value at t are as close aspossible to the (almost) exact value of 3.198944, for a Bermudan option with 10 exercise dates, retrieved from[45]. The obtained option values (for each algorithm, average value of 10 runs) at t were 3.1792, 3.2033, and3.1222 for the DOS algorithm, the SGBM and the LSM, respectively.It should be stressed that the numerical results for the LSM and the SGBM in this section should notbe seen as state of the art performance of the algorithms. For example, in [4], a bundling scheme based onrecursive bifurcation and rotation of the state space to match the correlation between S and ν gave accurateresults. Furthermore, they use as basis functions only monomials of log ( S t n ) and letting the stochastic variance ν t n enter only through conditional expectations of the form E [(log( S t n )) k | S t n − , ν t n − ]. These conditionalexpectations are computed from the characteristic function of the S t n | S t n − , ν t n − which was presented in [40].Similar improvements could also be done for the LSM. The reason for this comparison to still be relevant isto demonstrate the flexibility of DOS and the NN-regression, in which nothing has been changed (from theexamples with Black–Scholes dynamics) except the dynamics of the stochastic process from which we generatetraining data.In Figure 9, the exercise boundaries are presented in the state space. Worth noticing is that for t n ∈ T ,both the option value and the pay-off functions are increasing in ν t n and decreasing in S t n . An immediateconsequence of this is that we can have at most one exercise boundary along the lines with constant S t n orconstant ν t n . We can therefore conclude inconsistencies in the exercise boundaries for both the LSM and theSGBM. In Figure 9, on the bottom line to the right, the empirical probability density functions (pdf) of theexposures at the exercise dates are plotted. Figure 10 shows the exposure profiles and the exercise frequencycomputed with the three algorithms. We note that the DOS and the SGBM seem to agree fairly well on theEE and the lower percentile of the PFE but differ significantly on the upper PFE. Acknowledgments
This project is part of the ABC-EU-XVA project and has received funding from the European Unions Hori-zon 2020 research and innovation programme under the Marie Sk(cid:32)ldowska–Curie grant agreement No 813261.Furthermore, we are greatful for discussions with Adam Andersson, Andrea Fontanari, Lech A. Grzelak, andShashi Jain regarding the content of this paper.
References [1] S. Becker, P. Cheridito, and A. Jentzen.
Deep Optimal Stopping.
Journal of Machine Learning Research20.74 (2019): 1-25.[2] J. Gregory.
The xVA Challenge: Counterparty Credit Risk, Funding, Collateral and Capital.
John Wiley &Sons, (2015).[3] A. Green.
XVA: credit, funding and capital valuation adjustments.
John Wiley & Sons, (2015).[4] C. S. De Graaf, Q. Feng, D. Kandhai, and C.W. Oosterlee.
Efficient computation of exposure profiles forcounterparty credit risk.
International Journal of Theoretical and Applied Finance, 17(04), 1450024, (2014).[5] Y. Shen, J. A. M. Van Der Weide & J. H. M. Anderluh
A benchmark approach of counterparty credit exposureof Bermudan option under L´evy Process: the Monte Carlo-COS Method.
Procedia Computer Science 18(2013): 1163-1171. 25igure 9: Approximate exercise boundaries for a Bermudan put option under the Heston model at t = 0 . From top left to bottom right:
DOS,SGBM, LSM and the empirical density of the exposure.Figure 10: Comparison of the DOS algorithm, the SGBM and the LSM for a Bermudan put option under theHeston model.
Left:
Expected exposure and potential future exposures at 97.5%- and 2.5%-levels.
Right:
Proportion of options exercised at different exercise dates.266] Q. Feng, S. Jain, P. Karlsson, D. Kandhai & C. W. Oosterlee
Efficient computation of exposure profileson real-world and risk-neutral scenarios for Bermudan swaptions.
Journal of Computational Finance 20(1),139172 (2016).[7] R. Baviera, G. La Bua & P. Pellicioli.
CVA with wrong-way risk in the presence of early exercise.
Springer,In Innovations in Derivatives Markets (2016): 103-116.[8] M. Breton & M. Oussama.
An efficient method to price counterparty risk.
Groupe d’tudes et de rechercheen analyse des dcisions, (2014).[9] P. A. Forsyth and K. R. Vetzal.
Quadratic convergence for valuing American options using a penalty method.
SIAM Journal on Scientific Computing, 23(6), (2001): 2095-2122.[10] C. Reisinger and J. H. Witte.
On the use of policy iteration as an easy way of pricing American options.
SIAM Journal on Financial Mathematics, 3(1), (2012): 459-478.[11] C. Vzquez.
An upwind numerical approach for an American and European option pricing model.
AppliedMathematics and Computation, 97(2-3), (1998): 273-286.[12] T. Haentjens and , K. J. int Hout.
ADI schemes for pricing American options under the Heston model.
Applied Mathematical Finance, 22(3), (2015): 207-237.[13] K. in’t Hout.
Numerical Partial Differential Equations in Finance Explained; An Introduction to Compu-tational Finance.
Palgrave Macmillan UK, (2017).[14] F. Fang and C. W. Oosterlee.
Pricing early-exercise and discrete barrier options by Fourier-cosine seriesexpansions.
Numerische Mathematik, 114(1), 27, (2009).[15] O. Zhylyevskyy.
A fast Fourier transform technique for pricing American options under stochastic volatility.Review of Derivatives Research.
Fourier-based valuation method for Bermudan and barrier options underHeston’s model.
SIAM Journal on Financial Mathematics, 2(1), (2011): 439-463.[17] M. Broadie and J. Detemple.
American option valuation: new bounds, approximations, and a comparisonof existing methods.
The Review of Financial Studies, 9(4), (1996): 1211-1250.[18] M. Rubinstein.
Edgeworth binomial trees.
Journal of Derivatives, 5, (1998): 20-27.[19] J. C. Jackwerth.
Generalized binomial trees. Journal of Derivatives. , 5(2), (1996): 7-17.[20] R. Bellman.
Dynamic programming.
Science, 153(3731), (1966): 34-37.[21] L. Andersen and M. Broadie.
Primal-dual simulation algorithm for pricing multidimensional Americanoptions.
Management Science 50(9), (2004): 12221234.[22] F. A. Longstaff, and E. S. Schwartz.
Valuing American options by simulation: a simple least-squaresapproach.
The review of financial studies 14.1 (2001): 113-147.[23] M. Broadie and P. Glasserman.
A Stochastic Mesh Method for Pricing High-Dimensional American Options.
Papers 98-04, Columbia - Graduate School of Business, (1997).[24] M. Broadie and M. Cao.
Improved lower and upper bound algorithms for pricing American options bysimulation.
Quantitative Finance, 8 (2008): 845861.[25] S. Jain, and C. W. Oosterlee.
The stochastic grid bundling method: Efficient pricing of Bermudan optionsand their Greeks.
Applied Mathematics and Computation, 269, (2015): 412-431.[26] M. Kohler, A. Krzyak and N. Todorovic.
Pricing of HighDimensional American Options by Neural Net-works.
Mathematical Finance: An International Journal of Mathematics, Statistics and Financial Economics,20(3), ISO 690, (2010): 383-410.[27] S. Becker, P. Cheridito, and A. Jentzen.
Pricing and hedging American-style options with deep learning. arXiv preprint arXiv:1912.11060. ISO 690, (2019).[28] B. Lapeyre and J. Lelong.
Neural network regression for Bermudan option pricing
ArXiv Preprint, (2019).[29] V. Lokeshwar, V. Bhardawaj and S. Jain, S.
Neural network for pricing and universal static hedging ofcontingent claims.
Available at SSRN 3491209. ISO 690, (2019).2730] A. D. Green, C. Kenyon and C. Dennis
KVA: Capital valuation adjustment.
Risk, December, (2014).[31] S. Jain, P. Karlsson and D. Kandhai.
KVA, Mind Your P’s and Q’s!. Wilmott , 2019(102), ISO 690, (2019):p. 60-73.[32] B. ksendal.
Stochastic differential equations
Springer, Berlin, Heidelberg, (2003).[33] C. Nwankpa, W. Ijomah, A. Gachagan, and S. Marshall
Activation functions: Comparison of trends inpractice and research for deep learning. arXiv preprint arXiv:1811.03378, (2018).[34] D. P. Kingma, and J. Ba.
Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.ISO 690. Comment: Published as a conference paper at the 3rd International Conference for LearningRepresentations, San Diego, (2015).[35] L. Gyrfi, M. Kohler, A. Krzyzak, and H. Walk.
A distribution-free theory of nonparametric regression.
Springer Science & Business Media, (2006).[36] H. White.
Asymptotic theory for econometricians.
Academic press, (2014).[37] P. Glasserman.
Monte-Carlo Methods in Financial Engineering.
Vol. 53. Springer Science & Business Media,(2013).[38] M. Broadie and J. Detemple.
The valuation of American options on multiple assets.
Mathematical Finance7.3, (1997): 241-286.[39] I. Arregui, B. Salvador, D. evovi, and C. Vzquez.
PDE models for American options with counterparty riskand two stochastic factors: Mathematical analysis and numerical solution.
Computers & Mathematics withApplications, (2019).[40] S. Heston.
A closed-form solution for options with stochastic volatility with applications to bond and cur-rency options , Rev. Financ. Stud., 6 (1993): 327343.[41] L. B. Andersen and V. V. Piterbarg.
Moment explosions in stochastic volatility models.
Finance and Stochas-tics, 11(1), (2007): 29-50.[42] J. C. Cox, J. E. Ingersoll, and S. A. Ross.
A theory of the term structure of interest rates. , Econometrica,53 (1985): 385407.[43] W. Feller.
Two singular diffusion problems.
Annals of mathematics, (1951): 173-182.[44] L. B. Andersen.
Efficient simulation of the Heston stochastic volatility model.
Journal of ComputationalFinance, (2007).[45] M. J. Ruijter, and C. W. Oosterlee.
Two-dimensional Fourier cosine series expansion method for pricingfinancial options.
SIAM Journal on Scientific Computing, 34(5), B642-B671. ISO 690, (2014).[46] P. Glasserman and B. Yu.
Simulation for American options: Regression now or regression later?
InMonte-Carlo and Quasi-Monte-Carlo Methods Springer, Berlin, Heidelberg, (2002): 213-226.[47] K. H. F. Kan, G. Frank, V. Mozgin, and M. Reesor.