[PDF] A machine learning approach to portfolio pricing and risk management for high-dimensional problems

Abstract

We present a general framework for portfolio risk management in discrete time, based on a replicating martingale. This martingale is learned from a finite sample in a supervised setting. The model learns the features necessary for an effective low-dimensional representation, overcoming the curse of dimensionality common to function approximation in high-dimensional spaces. We show results based on polynomial and neural network bases. Both offer superior results to naive Monte Carlo methods and other existing methods like least-squares Monte Carlo and replicating portfolios.

Full PDF

AA machine learning approach to portfolio pricing and riskmanagement for high-dimensional problems ∗ Lucio Fernandez-Arjona † Damir Filipovi´c ‡

22 May 2020

Abstract

We present a general framework for portfolio risk management in discrete time, based ona replicating martingale. This martingale is learned from a ﬁnite sample in a supervised set-ting. The model learns the features necessary for an eﬀective low-dimensional representation,overcoming the curse of dimensionality common to function approximation in high-dimensionalspaces. We show results based on polynomial and neural network bases. Both oﬀer superiorresults to naive Monte Carlo methods and other existing methods like least-squares Monte Carloand replicating portfolios. keywords:

Solvency capital; dimensionality reduction; neural networks; nested Monte Carlo;replicating portfolios.

Financial institutions face a variety of risks on their portfolios. Whether they be market and creditrisk for investment portfolios, default and prepayment risk on their mortgage portfolios, or longevityrisk on life insurance portfolios, the balance sheet of a bank or insurance company is exposed tomany risk factors. Failure to manage these risks can lead to insolvency—with the associated lossesto shareholders, bondholders and/or customers—or overly conservative business strategies, whichhurt consumers.Alongside qualitative assessments—and plenty of common sense—portfolio risk managementrequires quantitative models that are accurate and suﬃciently fast to provide useful informationto management. Additionally, government regulations, such as solvency regimes, require extensivecalculations to produce the required reports.Simulation techniques are often used to explore possible future outcomes. Since quantitativemodels require to estimate conditional expectations across a time interval, Monte Carlo simulations ∗ We thank Antoon Pelsser for comments. † University of Zurich. Email: [email protected] ‡ EPFL and Swiss Finance Institute. Email: damir.ﬁlipovic@epﬂ.ch a r X i v : . [ q -f i n . R M ] M a y an be used to calculate those expectations. However, Monte Carlo methods suﬀer from problemswith both accuracy and speed.Alternatives to Monte Carlo methods have been developed over the years, some of them usingfunctional approximation techniques. Under this approach, an approximation to the full, slowmodel is built using one or more faster functions. For example, the behaviour of a portfolio can bereplicated via an appropriate combination of basis functions, which are faster to calculate than theoriginal model.Many, if not all, of these alternatives suﬀer from several problems: some are not entirely data-driven—requiring subject matter expertise—which limits their applicability to complex problemsand the ability to automate them, others can be automated but have low quality of the approxi-mation, while others are limited to low-dimensional problems.In this paper we present a model that overcomes or greatly diminishes the problems above.This model for calculating conditional expectations uses a machine learning approach to learn asuitable function from ﬁnite samples. We show how this function can be used to obtain accurateestimates of price and risk measures, focusing on practical real-world situations in terms of runtimeand number of samples being used.The learned functions are linear combinations of functional bases, of which we present severalexamples, including polynomials and neural networks (of the single-layer feed-forward type). In allcases, the conditional expectations are calculated in closed-form—even for neural networks—whichcontributes to the accuracy and speed of the solution.We pay special attention to high-dimensional cases, which motivate the use of a special polyno-mial basis. In this basis, the input vector undergoes a data-driven, linear dimensionality reductionstep similar to a linear neural network layer. This allows using polynomial-based replicating mar-tingales in high-dimensional problems where full polynomial bases are unfeasible.The solutions obtained are martingales that replicate the value processes, given as risk-neutralconditional expectations, of ﬁnancial or insurance products. Drawing a parallel with the conceptof a replicating portfolio, we call this approach “replicating martingale”. Replicating portfoliosis a widely used method in the ﬁnancial industry, which relies on building linear combinations ofderivatives to approximate conditional expectations necessary for market risk calculations. Ourproposed replicating martingale method is also based on linear combinations of basis functions,but these are not restricted to market risk calculations. Any risk exposure can be modelled withreplicating martingales because its basis functions—polynomials or neural networks—are agnosticto the underlying risk type.Given its regression-based nature and use of simulated samples, replicating martingales is amethod of the least squares Monte Carlo (LSMC) family. It is, however, global in the time dimen-sion, unlike LSMC methods used for American option pricing, which are local in the time dimension,that is, these require repeated, recursive regressions. Within the LSMC family of methods, repli-cating martingales are a regress-later method (Glasserman and Yu (2002)) given the regression ismade against terminal payoﬀs and not against empirical conditional expectations.2e use two examples—a ﬁnancial derivative and an insurance product—to perform extensivetesting of the accuracy of the risk calculations based on replicating martingales. In line with existingmachine learning literature, we also include extensive comparison with alternative methods, such asnested Monte Carlo and other LSMC methods. Also in line with machine learning best practices,we publish the datasets for our examples, Fernandez-Arjona and Filipovi´c (2020). We hope thatthese datasets can be used by others to allow more direct comparisons between methods in futureresearch. An early example of static replication via basis functions can be found in Madan and Milne (1994),that presents a framework for replication of general contingent claims. These contingent claims aremodeled in a Hilbert space and the static replication problem is solved by constructing a countableorthonormal basis. The method is applied to the pricing and hedging of these contingent claims.Carriere (1996) and Longstaﬀ and Schwartz (2001) use a sequential approximation algorithm tocalculate the conditional expectations required in the valuation of options with a early exercise (likeAmerican options). The method estimates these conditional expectations from the cross-sectionalinformation in the simulation by using least squares, which gives the method the name of leastsquares Monte Carlo. Andreatta and Corradin (2003) apply this idea to valuation of life insurancepolicies.Glasserman and Yu (2002) were the ﬁrst to make a distinction between regress-now and regress-later LSMC, the former being the direct estimation of the conditional expectation function, andthe later the indirect estimation via regression against the terminal payoﬀ of the contingent claim.Working in the context of American option pricing via approximate dynamic programming, theyﬁnd that regress-later LSMC has interesting properties, among them less-dispersed estimates thanregress-now LSMC.In the area of polynomial regress-now LSMC, Broadie, Du, and Moallemi (2015) show thatsuch regression-based methods can—asymptotically—improve the convergence rate of nested MonteCarlo methods. They provide quality comparisons against nested Monte Carlo, and a delta-gammaapproach but not against regress-later methods, whereas we do.A regress-later model based on orthonormal piecewise linear basis functions is presented inPelsser and Schweizer (2016). Path-dependency and the resulting high-dimensional problem in long-term projections is managed via a hand-picked A ( Z ) function to avoid the curse of dimensionality.In that framework, the basis functions are not guaranteed to have a closed form solution, whoseexistence depends on the choice of A ( Z ). This function must be given to the method, and is based onexpert judgement and knowledge of the problem domain. Moreover, the choice of A implies a trade-oﬀ between complexity and dimensionality—for a given target accuracy. High-dimensional A ( Z )functions lead to the curse of dimensionality while low-dimensional A ( Z ) functions might be toocomplex to ﬁnd a closed-form solution to their conditional expectations. By contrast, our frameworkuses a data-driven dimensionality reduction function in the parameter space instead of an arbitrary3unction. In comparison to the piecewise linear model, which requires ﬁxing a grid, neural networksare able to provide a data-driven grid for its activation functions. Regarding the quality of theapproximation, Pelsser and Schweizer (2016) do not provide a quantitative metric for comparisonbetween regress-now and regress-later models. Whereas, we provide extensive comparisons on riskmetrics between these models, for both polynomial and neural network bases.A neural network model is applied to solvency capital problem in life insurance in Castel-lani et al. (2018). The neural network model shows better performance than LSMC (regress-nowmodel with polynomial basis functions) and a support vector regression model. All three models—including the neural network model—are regress-now models. By contrast, we focus on regress-latermethods, which show better accuracy in the examples in addition to being better in theory.In the ﬁeld of uncertainty quantiﬁcation, polynomial surrogate functions have been used for along time to reduce the runtime of complex models. Recently, Hokanson and Constantine (2018)showed that polynomial ridge approximation can be extended to reduce dimensionality in a data-driven manner. We follow a similar approach and apply it to portfolio pricing and risk managementfor the ﬁrst time, it seems.Nested Monte Carlo methods for portfolio management have been studied in S.-H. Lee andGlynn (2003) and Gordy and Juneja (2010) among others. Gordy and Juneja (2010) show severalmethods that can reduce the computational cost of nested Monte Carlo for a homogeneous portfolioof ﬁnancial instruments with an additive structure. Since we aim to cover both ﬁnancial deriva-tives and insurance portfolios—where it is common that portfolios do not exhibit such additivestructure—we restrict our comparisons to standard nested Monte Carlo.The remainder of the paper is as follows. Section 2 formalizes the replicating martingale problemand recalls the standard nested Monte Carlo approach. Section 3 describes our machine learningapproach to the replicating martingale problem. Section 4 and 5 provide numerical case studies: aEuropean call option in Section 4, and an insurance liability model in Section 5. Section 6 concludes.The appendix contains proofs and technical background material. Section A describes the qualitymetrics used to compare diﬀerent methods. Section B contains the economic scenario generatorunderlying the numerical examples in the main text. Sections C, D, and E contain proofs, andvalues of key parameters. Sections F and G present an analysis of the sensitivity of the proposedmethods to diﬀerent hyper-parameters. We work in an economy with ﬁnite time horizon T . Randomness is modeled on a ﬁltered probabilityspace (Ω , F , ( F t ) , Q ) with Q being the risk-neutral measure corresponding to some numeraire. Ifnot otherwise stated, all ﬁnancial values and cash ﬂows are discounted by this numeraire. Thereis an R d -valued stochastic driver process X = ( X , ..., X T ), which is adapted to the ﬁltration ( F t ),and such that X t is independent of F t − . The physical measure is denoted by P ∼ Q .We consider an asset–liability portfolio whose discounted cash ﬂows accumulate to the dis-4ounted terminal value at T which is given as function of X , f ( X ) = T (cid:88) t =1 ζ t with time- t cash ﬂows ζ t = ζ t ( X , . . . , X t ) ∈ L Q .There are many examples that ﬁt this description, such as ﬁnancial derivatives, insurance lia-bilities, mortgage-backed instruments. Our goal is to ﬁnd the gains process of the portfolio, thatis, the L Q -martingale given by Z t = E Q t [ f ( X )] = t (cid:88) s =1 ζ s (cid:124) (cid:123)(cid:122) (cid:125) cumulative CF at t + E Q t [ T (cid:88) s = t +1 ζ s ] (cid:124) (cid:123)(cid:122) (cid:125) time- t value , where E Q t [ · ] = E Q [ · | F t ] denotes the F t -conditional expectation. In particular, Z t is not given inclosed form but has to be derived by simulating future cash ﬂows. This creates many computationalchallenges for certain applications, as described below. Applications: risk measurement

In many insurance and banking solvency regulatory frame-works (e.g. Solvency II, Swiss Solvency Test, Basel III), capital calculations are based on the valueat risk or expected shortfall of − ∆ Z t , where ∆ Z t = Z t − Z t − denotes the proﬁt over period [ t, t +1],which are deﬁned as follows:VaR α ( Y ) = inf (cid:8) y : F Y ( y ) ≥ α (cid:9) = F − Y ( α )ES α ( Y ) = 11 − α (cid:90) − α VaR γ ( Y ) dγ, (1)where Y is a random variable with P -distribution function denoted by F Y , and α ∈ (0 ,

1) denotesthe conﬁdence level.In insurance regulation risk is normally measured for a one-year horizon and the economiccapital is ρ [ − ∆ Z ], where ρ is a placeholder for either VaR α or ES α . There are cases, however, wherecalculations require 3 year capital projections which would be calculated by ρ [ − ∆ Z ], ρ [ − ∆ Z ] and ρ [ − ∆ Z ] for years one, two and three respectively. Applications: hedging

Given tradable ﬁnancial instruments with gains processes G , i.e., L Q -martingales, ﬁnd a predictable hedging strategy ψ that approximately replicates the proﬁt, ψ (cid:62) t ∆ G t ≈ ∆ Z t . For example, ψ could minimize the local quadratic hedging error, min ψ t E Q t − [ | ψ (cid:62) t ∆ G t − ∆ Z t | ], sothat ψ t = E Q t − [∆ G t ∆ G (cid:62) t ] − E Q t − [∆ G t ∆ Z t ] .

5n a real-world setting, both applications mentioned above require computing Z t = E Q t [ f ( X )],which is subject to restrictions on the available computing budget, since the cumulative cash-ﬂowfunction f might be very costly to evaluate. Approximating the integrals in (1) via a Monte Carlo simulation is the standard method for calcu-lating risk metrics. However, in many cases this method cannot be in used in practice due to thelarge computational eﬀort required. We refer to as standard nested Monte Carlo to the followingprocess: the outer stage of the simulation consists of a set of n simulations, X ( i )1: t , i = 1 , . . . , n ,that are independent and identically distributed according to the distribution µ of the risk factorsuntil the time t of the risk calculation horizon. The portfolio value Z ( i ) t and associated portfolioloss ∆ Z ( i ) t for each scenario X ( i )1: t is estimated via the inner stage of Monte Carlo simulation bytaking the sample mean of the n inner simulations drawn for each outer simulation X ( i )1: t . This isillustrated in Figure 1. Once the empirical distribution for ∆ Z ( i ) t has been obtained, the desiredrisk metric is approximated via its empirical equivalent. For our model, we apply the following machine-learning approach, that can be divided in two steps: • Approximate f by some f (cid:63) in L µ for which E Q t [ f (cid:63) ( X )] is given in closed form. • Learn f (cid:63) from a ﬁnite sample x , . . . , x n and corresponding f ( x ) , . . . , f ( x n ), obtaining (cid:98) f . Feature map

Let A be a parameter set such that, for every A ∈ A , there are m linearly indepen-dent functions φ A, , . . . , φ A,m in L µ . These functions form the feature map φ A = ( φ A, , . . . , φ A,m ) (cid:62) .For each function φ A,i its expectation E Q t [ φ A,i ( X )] is known in closed form. This last requirementis necessary for the one of the key distinguishing factors between the model proposed here andother models in the literature: we obtain the gains process of the portfolio by regressing against itsterminal value. Our model, therefore, falls within the “regress later” category ﬁrst mentioned inGlasserman and Yu (2002). As we will see in Section 4, this approach performs much better thanthe alternative “regress now”.For every A ∈ A , the L µ -projection of f on span φ A is given by f A = φ (cid:62) A β A , for the uniqueminimizer β A = arg min β ∈ R m (cid:107) f − φ (cid:62) A β (cid:107) L µ = (cid:104) φ A , φ (cid:62) A (cid:105) − L µ (cid:104) φ A , f (cid:105) L µ . (2)In the particular case where φ A, , . . . , φ A,m are orthonormal in L µ then the Gram matrix (cid:104) φ A , φ (cid:62) A (cid:105) L µ = I m , so that β A = (cid:104) φ A , f (cid:105) L µ f (cid:63) is also called surrogate function, e.g., in Hokanson and Constantine (2018). eature learning Having described the feature map in our regression model, we point out thatthe features themselves are also learned from the data. This is the second key distinguishingfactors of our model in comparison to others: we tackle the curse of dimensionality that comesnaturally with regress-later models by learning a reduced set of features. To do this, we minimizethe approximation error (cid:107) f − f A (cid:107) L µ over A ∈ A . We assume that there exists an optimal parameter( A (cid:63) , β (cid:63) ) ∈ A × R m attaining the best approximation error min ( A,β ) ∈A× R m (cid:107) f − φ (cid:62) A β (cid:107) L µ = (cid:107) f − φ (cid:63) (cid:62) β (cid:63) (cid:107) L µ (3)where we have β (cid:63) = β A (cid:63) , by (2), and we write φ (cid:63) = φ A (cid:63) , for short. Our goal of approximating f isachieved by the best L µ -projection f (cid:63) = φ (cid:63) (cid:62) β (cid:63) = f A (cid:63) Approximate gains process

If for each φ A,i , we have E Q t [ φ A,i ( X )] given in closed form then wealso have the approximate gains process in closed form, Z (cid:63)t = E Q t [ φ (cid:63) ( X )] (cid:62) β (cid:63) , (4)and the best approximation error carries over, (cid:107) Z t − Z (cid:63)t (cid:107) L Q ≤ (cid:107) f − φ (cid:63) (cid:62) β (cid:63) (cid:107) L µ , (cid:107) Z t − Z (cid:63)t (cid:107) L P ≤ (cid:107) d P d Q (cid:107) L Q (cid:107) f − φ (cid:63) (cid:62) β (cid:63) (cid:107) L µ The expectations E Q t [ φ A,i ( X )] depend on the form of φ A,i ( X ). In our model, of which we presentseveral examples in the following subsections, we work with φ A that factorize as φ A,i ( X ) = g i ( A (cid:62) X) , i = 1 , . . . , m for some functions g i : R p → R and T d × p -matrix A = ( A , . . . , A p ) ∈ A ⊂ R T d × p , for some p ∈ N .Since we work with X ∈ R T × d , we add for convenience X ∈ R T d , a ﬂat version of X such thatX ( t − d + j = X t,j . This allows us to easily express the role of A as a linear map that creates featuresand performs a key role in driving the dimensionality reduction that we seek in high dimensionalproblems. For the expectations, we then obtain E Q t [ φ A,i ( X )] = E Q t (cid:34) g i (cid:32) td (cid:88) s =1 ( A (cid:62) ) s X s + T (cid:88) s (cid:48) = td +1 ( A (cid:62) ) s (cid:48) X s (cid:48) (cid:33)(cid:35) = G t,i ( µ t ) (5)for the function G t,i ( µ ) = E Q (cid:104) g i (cid:16) µ + (cid:80) Ts (cid:48) = td +1 ( A (cid:62) ) s (cid:48) X s (cid:48) (cid:17)(cid:105) , where µ t = (cid:80) tds =1 ( A (cid:62) ) s X s and wewrite ( A (cid:62) ) s for the s th column vector of A (cid:62) . 7 .2 Full polynomial bases Let A = { I T d } , so that |A| = 1, and write φ A ≡ φ , for short, so that also φ = φ (cid:63) . In this case,problem (3) reduces, and is identical, to (2). Let the feature map φ be composed of all polynomialsof maximal degree δ or less,Pol δ ( R T d ) = span { x α | α ∈ N T d , | α | ≤ δ } . This example then matches a polynomial regress-later model. The feature map suﬀers the curse ofdimensionality from the rapid growth of the number of basis functions m as a function of d and T , m = dim Pol δ ( R T d ) = (cid:0)

T d + δT d (cid:1) . If, for example, d = 3, δ = 3, then we have m =  , for T = 5,302,621 , for T = 40. (6) Expectations under full polynomial bases

The expectation at time 0, Z , which correspondsto the present value or t = 0 price, is simply given by β , the coeﬃcient associated to the monomialof order 0.The exact form of the conditional expectation E Q t [ φ ( X ) (cid:62) β ] = E Q t [ φ ( X )] (cid:62) β depends on thechoice of φ and the distribution of X . Choosing an orthogonal basis in L µ can greatly simplify thecalculations. For example, for a multivariate normal X ∼ N (0 , I T d ) we work below with a Hermitepolynomial basis.Consider the Hermite polynomials H , . . . , H m on R T d of degree δ or less. The multivariate(probabilists’) Hermite polynomial of order α on variable X can be expressed in terms of univariateHermite polynomials on X j : H α (X) = (cid:81) T dj =1 H α j (X j ) where α ∈ N T d . From this expression wecan see that the sequence { H i } mi =1 is not unique but depends on the mapping i (cid:55)→ α i between thenatural numbers i and the multi-index α i in order to be able to assign a unique order. Once anorder is given, { H i } mi =1 can be mapped to a subset of { H α } α ∈ N Td such that we can write H i = H α i .Under these conditions we can obtain a closed-form solution to the expectation of the featuremap: E Q t [ φ i ( X )] = E Q t [ H α i (X)] = td (cid:89) j =1 H α ij (X j ) T d (cid:89) j = td +1 E Q [ H α ij (X j )]= td (cid:89) j =1 H α ij (X j ) T d (cid:89) j = td +1 α ij =0 which is an explicit expression of the form (5). 8 .3 Polynomials and linear dimensionality reduction Let us consider a more general case now. In this example we ﬁx p ≤ T d , a degree δ ≥ q , . . . , q m of Pol δ ( R p ) with m = dim Pol δ ( R p ) = (cid:0) p + δp (cid:1) . For any A ∈ R T d × p with full rank deﬁne the polynomials φ A,i ∈ Pol δ ( R T d ) by φ A,i ( x ) = q i ( A (cid:62) x ) , i = 1 , . . . , m The following theorem shows that without loss of generality we can assume that A lies in theStiefel manifold, the set of all orthonormal p -frames in R T d , given by V p ( R T d ) = { A ∈ R T d × p | A (cid:62) A = I p } . Theorem 1.

For any A ∈ R T d × p with full rank, there exists ˜ A ∈ V p ( R T d ) with span φ A = span φ ˜ A . Let us henceforth assume X ∼ N (0 , I T d ), that is, X is Gaussian white noise. As a conse-quence, A (cid:62) X ∼ N (0 , I p ) for any A ∈ V p ( R T d ). Accordingly, we consider the Hermite polynomials H , . . . , H m on R p of degree δ or less. For any A ∈ V p ( R T d ) we can deﬁne the polynomials φ A,i ( X ) = H i ( A (cid:62) X) = H α i ( A (cid:62) X) , i = 1 , . . . , m The optimization problem (3) corresponds to linear dimensionality reduction with matrixmanifold V p ( R T d ) × R m in the spirit of Cunningham and Ghahramani (2015).We can appreciate the eﬀect of the dimensionality reduction when we compare the numberof parameters for d = 3, δ = 3, p = 3. In this case dim V p ( R T d ) =

T dp − p ( p + 1) and m =dim Pol δ ( R p ) = (cid:0) p + δp (cid:1) = 20, so thatnumber of parameters =  , for T = 5,114 + 20 = 134 , for T = 40.Compare this example to the corresponding one using the full polynomial basis (6). In the highdimensional cases of maturity T = 40, we now have 134 vs 302 ,

621 parameters, thus achieving ahigh degree of dimensionality reduction in the parameter space.

Expectations under linear dimensionality reduction of a polynomial basis

The expec-tation at time 0, Z , which corresponds to the present value or t price, is simply given by β , thecoeﬃcient associated to the monomial of order 0.As with the full polynomial basis, the expectation is shown below for a Hermite polynomialbasis and multivariate normal X ∼ N (0 , I T d ). For more details and the full derivation of theseresults, we refer to Appendix D. Speciﬁcally, applying Lemma 3 we obtain9 t [ φ A,i ( X )] = E t  p (cid:89) j =1 H α ij (cid:32) T d (cid:88) s =1 X s A sj (cid:33) = p (cid:89) j =1 σ [ W − j ] α ij H α ij (cid:32) W − j σ [ W − j ] (cid:33) (7)where W − j = td (cid:88) s =1 A sj X s , σ [ W − j ] = (cid:118)(cid:117)(cid:117)(cid:116) td (cid:88) s =1 A sj , which is an explicit expression of the form (5). In this third example of basis functions, we consider a one-layer neural network, which can berepresented as feature map φ A as follows. We choose the activation “ReLu” function t (cid:55)→ t + andset φ A,i ( X ) = ( A (cid:62) i X) + . Remark.

Here we cannot assume that A ∈ V p ( R T d ) as we did in Theorem 1. Indeed, consider forexample A = (cid:32) (cid:33) , so that the “orthogonalization” is ˜ A = A ( A (cid:62) A ) − / = (cid:32) (cid:33) . Matrix A gives the feature map φ A ( x ) = ( x +1 , ( x + x ) + ) (cid:62) , while ˜ A gives φ ˜ A = ( x +1 , x +2 ) (cid:62) . Now observe that ( x + x ) + is not in the linear span of x +1 and x +2 . Hence the analog of Theorem 1 does not applyfor neural networks, in general. For notational convenience we adopt the common practice in neural network literature of work-ing with augmented A and X, appending constant terms to avoid carrying explicit bias terms. ForX this means adding a constant term in the ﬁrst position, X = 1. This additional term in Xrequires an additional row in A as the new ﬁrst row which holds the bias terms in the hidden layer.This makes A ∈ R ( T d +1) × p , where p is the number of hidden nodes. The remaining bias term, theone in the output layer, corresponds to the parameter β p +1 , and we have m = p + 1. In order toproduce a constant term that matches this bias term, we append an additional constant row to A ,(1 , , . . . , A is in eﬀect ( T d + 1) × p -dimensional, but in actualcomputations A ∈ R ( T d +1) × ( p +1) , with the following block structure:  b . . . b p A (cid:48)  , where A (cid:48) ∈ R ( T d ) × p is the core A matrix without augmentation and b , . . . , b p are the bias terms ofthe hidden layer. From a ﬁnancial engineering point of view, the feature map can be thought as a portfolio of Gaussian call options. Strictly speaking, we ignore those A for which φ A, , . . . , φ A,p +1 are not linearly independent. However, in practicethis is not relevant, as the set of such singular A ’s has Lebesgue measure zero in R ( Td +1) × ( p +1) . xpectations under neural network bases The expectation at time 0, Z , which corre-sponds to the present value or t price, is simply given by β p +1 , the bias coeﬃcient in the outputlayer. The conditional expectation E Q t [ φ A ( X ) (cid:62) β ] is provided in the following lemma (Fernandez-Arjona (2019)). Lemma 1.

Assume X ∼ N (0 , I T d ) . Then the above one-layer neural network, f (cid:63) ( X ) = p +1 (cid:88) i =1 β i φ A,i ( X ) = p +1 (cid:88) i =1 β i ( A (cid:62) i X ) + , has conditional expectation given by E Q t [ f (cid:63) ( X )] = p (cid:88) i =1 β i E i ( t ) + β p +1 , where E i ( t ) = E Q t (cid:20)(cid:16) A (cid:62) i X (cid:17) + (cid:21) = 12 σ i ( t ) (cid:114) π exp (cid:18) − µ i ( t ) σ j ( t ) (cid:19) + µ i ( t ) (cid:18) − Φ (cid:18) − µ i ( t ) σ i ( t ) (cid:19)(cid:19) (8) with µ i ( t ) = E Q t (cid:104) A (cid:62) i X (cid:105) = td (cid:88) s =0 X s A si ,σ i ( t ) = σ Q t (cid:104) A (cid:62) i X (cid:105) = (cid:118)(cid:117)(cid:117)(cid:116) T d (cid:88) s = td +1 A si , which is an explicit expression of the form (5) . Given a sample ( x , y ) , . . . , ( x n , y n ) where X is drawn from µ ( dx ) and y i = f ( x i ), and given A ∈ A ,we can construct Φ A ∈ R n × m by Φ A,ij = φ A,j ( x i ).Assuming that m ≤ n and Φ (cid:62) A Φ A is invertible (rank Φ A = m ), the empirical version of (2) readsas the ordinary least squares (OLS) problem (cid:98) β A = arg min β ∈ R n (cid:107) y − Φ A β (cid:107) R n = (Φ (cid:62) A Φ A ) − Φ (cid:62) A y (9) Remark.

The law of large numbers implies, for a ﬁxed A ∈ A and for n → ∞ , (cid:98) β A = ( n Φ (cid:62) A Φ A ) − ( n Φ (cid:62) A y ) p −→ (cid:104) φ A , φ (cid:62) A (cid:105) − L µ (cid:104) φ A , f (cid:105) L µ = β A (10)11 inite-sample feature learning The empirical version of (3) reads as follows: we assume thatthere exists an optimal parameter ( (cid:98) A, (cid:98) β ) ∈ A × R m such that Φ (cid:62) (cid:98) A Φ (cid:98) A is invertible (rank Φ (cid:98) A = m )and attaining the best approximation errormin ( A,β ) ∈A× R m (cid:107) y − Φ A β (cid:107) R n = (cid:107) y − Φ (cid:98) A (cid:98) β (cid:107) R n (11)where we have (cid:98) β = (cid:98) β (cid:98) A , by (9). Writing (cid:98) φ = φ (cid:98) A , we obtain the estimated approximation (cid:98) f = (cid:98) φ (cid:62) (cid:98) β of f . Having presented the theoretical background, we now turn to a ﬁrst example to show how the ideasin Section 3 can be used to construct a replicating martingale and how these compare to alternativeapproaches. In order to introduce this example, and the estimator quality metrics to be used, webegin in the next section showing how the capital calculations are performed using nested MonteCarlo.

In this example, we calculate the present value ( Z ) and expected shortfall of − ∆ Z for an Europeancall on an equity index. We assume that we hold a short position in this call and therefore focuson the loss-making tail, that is, the tail where the equity index values are higher. In keeping withindustry convention, the present value and expected shortfall are reported as positive numbers,without taking into account the fact that it is a short position.The economic scenario generator is described in Appendix B. That generator maps X to avector of economic factors. This vector contains several components, among them the equity indexand the cash account. We will call these components EQ t and C t respectively. The European callpayoﬀ at time T will be max( EQ T − K, Z = − max( EQ T − K, /C T where K is the strike of the option and T its maturity. The variables EQ t , C t , and K representnominal—undiscounted—values.In this example we work with two maturities T = 5 and T = 40 and for both K = 100—that is,the option is at the money at time zero. The model has 3 stochastic drivers, d = 3, and thereforethere are 15 and 120 total dimensions—individual stochastic variables—for T = 5 and T = 40,respectively.To establish the benchmark value—the closest we can get to the “ground truth” without havinga closed form solution—we ﬁrst run a very large nested Monte Carlo simulation (henceforth nMC),12ith 1,000,000 outer simulations ( n ) and 100,000 inner simulations ( n ). We then calculate the99% expected shortfall on the loss-making tail. When working with an empirical distribution, aswe do here based on simulation data, the expected shortfall reduces to simply averaging the 1%worst results.Before we can test the quality of nMC estimator for a ﬁnite simulation budget, we need to decidehow to split said budget between inner and outer simulations. Diﬀerent combinations of outer andinner simulations will produce diﬀerent nMC estimators. The bias and variance of the nested MonteCarlo estimator depends on both the amount of outer and inner simulations. Each combination hasa diﬀerent bias and variance and therefore a diﬀerent mean absolute error. This is shown in Tables1 and 2 for a ﬁxed total budget of 50,000 simulations. In each individual estimation (inner–outercombination), the error is calculated as a percentage of the benchmark value and therefore we referto the quality metric as “MApE”, for Mean Absolute percentage Error. The formula is describedin Appendix A. It is important to note that since the expected shortfall is calculated on − ∆ Z ,the error on Z is also part of the error on ES ( − ∆ Z ).As described in Broadie, Du, and Moallemi (2015), it is not possible, in general cases, to decidefor a ﬁnite budget how to make this inner-outer split in an optimal way. In this paper we willerr on the side of presenting optimistic risk ﬁgures for nMC estimations, by choosing an optimalsplit based on our knowledge of the benchmark value. This bias towards more accurate nMC riskestimations than possible in practice will not be a problem for our analysis, since we ﬁnd that theproposed replicating martingale method produces more accurate results than the optimal nMC,which is already better than what one would obtain in practice. Since the optimal number ofsimulations for risk calculations is diﬀerent than for the pricing calculations, it is not possible tobe optimal for both at the same time. We have chosen the optimal set for risk calculations. Thisproblem does not exist for regression-based methods, since it is not necessary to split the trainingbudget.The eﬀect on the present value calculations of using the split which is for optimal for riskcalculations can be seen in Table 3. There, for example, the error on the 5 year call is 1.7% whilein Table 1 we can see that the error would be 0.6% when using the split optimal for present valuecalculations—which corresponds to ﬂat, not nested, Monte Carlo.In order to select the optimal outer-inner combination, each combination is run 100,000 timesto be able to estimate its MApE ES. The combination with the lowest MApE ES is chosen asthe optimal for that (total) sample size. Table 3 shows the relationship between the MApE ES ofthe optimal combination and the total sample size. We can see that the estimation of ES [∆ Z ],performed via nested Monte Carlo, is much more aﬀected by the increased dimensionality of theproblem than the estimation of Z .These Monte Carlo results will be used throughout the paper as one of the reference methodsagainst which we measure the proposed methods. Even if the limitations of standard nested MonteCarlo mean that it is not the main approach used by practitioners in large-scale problems, itremains the simplest way to approach the estimation of conditional expectations, and it provides a13ommon baseline that both practitioners and academics can easily understand. In order to providea comparison with other methods widely used in practice, we also include regress-now methods inall examples. We now apply the functional bases described in Sections 3.2, 3.3, and 3.4. For each functional basewe show the same quality metrics as in Section 4.1 in order to compare to the results from nestedMonte Carlo estimation.Additionally, each functional base is also compared to other related methods, such as regress-now LSMC. When comparing between regression-based methods, we use an additional qualitymetric: the mean L error over the empirical distribution of ∆ Z . This metric allows us to makecomparisons of the goodness-of-ﬁt along the entire distribution, not only the tail. This metric isrelated to the upper bound on the error for any tail metric, not only the 1% tail used in the expectedshortfall error examples. For more details about the quality metrics use, we refer to Appendix A. The ﬁrst comparison uses the full polynomial basis described in Section 3.2. In Tables 4 and5 we present the MApE comparison among Monte Carlo, regress-now polynomial basis and thereplicating martingale full polynomial basis. Table 4 shows results for the present value and Table 5for the expected shortfall estimators.We can see how the replicating martingale outperforms the other two methods in the estimationof the present value and the 99% expected shortfall. For a more comprehensive comparison, we lookat the mean L error in Table 6. We can see that Table 5 conﬁrms the conclusions from Table 6:the replicating martingale estimators outperform the regress-now estimators. In this regard, weverify what others in the literature have reported before for regress-later estimators.Tables 4 and 5 do not show results for the full polynomial basis under the replicating martingaleapproach for T = 40. The reason for this is the combinatorial explosion in the number of basisfunctions as the dimensionality of the problem grows. As shown in (6), the number of basisfunctions for d = 3 , δ = 3 , T = 40 is 302 , The problems with high dimensional cases described in the previous paragraph motivate our use oflinear dimensionality reduction (LDR). As described in Section 3.3, the number of parameters to14e estimated can be greatly reduced, from 816 to 29 for T = 5 and from 302,621 to 134 for T = 40.Whether a function f can be well approximated by a polynomial basis with linear dimensionalityreduction for a small δ and p depends on the nature of the function. Asymptotically, any functionin L µ can be approximated with arbitrary precision.The optimization problem in (11) is solved over the product manifold V k ( R T d ) × R m ratherover R T d × k × R m , supported by Theorem 1. This reduces the eﬀective dimensionality of theproblem and simpliﬁes the calculation of the conditional expectation of φ ( X ) in Equation (4).For these examples we have used the Riemannian BFGS algorithm from Huang, Gallivan, andAbsil (2015) using the C++ library published by the authors. Additionally, we tested anothertwo algorithms, Riemannian Trust Regions (Absil, Baker, and Gallivan (2007)) as implementedby the Python library “pymanopt” (Townsend, Koep, and Weichwald (2016)), and GrassmannGauss-Newton (Hokanson and Constantine (2018)) as implemented by the authors in the publiclyavailable Python library. In all cases, Riemannian BFGS achieved better results.For this example, we have chosen p = 3. We performed a sensitivity analysis on this parameterand found that a larger value might lead to better results in some cases but not in all cases. Thefull results are included in Appendix F. In that section, we also provide a sensitivity analysis tothe starting point of the optimization. Since BFGS is quasi-Newton method, it is not guaranteedto ﬁnd a global minimum in a general case. In the case of the polynomial basis with LDR, we ﬁndthat the selection of the starting point makes a big diﬀerence in the ﬁnal result.Tables 4 and 5 show the results of applying LDR ( p = 3) to the European call problem andwe can see how this method performs in relation to the other alternatives. We can see that thepolynomial LDR has lower error than nested Monte Carlo and both regress-now and regress-laterpolynomials. It also becomes clear that the LDR approach allows high dimensional problems wherethe regress-later approach on a full polynomial basis would fail. In this section, we describe the results achieved using a neural network model as deﬁned in Sec-tion 3.4. For this example we have chosen to work with 100 nodes, that is, the hidden layer is 100nodes wide. This choice was made via cross-validation.The neural network was optimized via backpropagation using the BFGS algorithm from Python’spopular library scikit-learn (Pedregosa et al. (2011)). The results in Tables 4 and 5 show an ex-cellent quality of the neural network replicating martingale in this example, outperforming everyother choice, except for the risk calculations with a very low number of samples (1,000).In the comparison between neural network approaches, we can see that the replicating martingale—a regress-later method—outperforms the regress-now variation, the same way that it did for thepolynomials.In Figure 2 we can compare the empirical distribution for each method. This ﬁgure makes iteasy to qualitatively assess the diﬀerences between the diﬀerent methods, for example: the highvariance of nMC vis-a-vis the lower variance of replicating martingales, or the higher accuracy of15egress-later methods compared to regress-now methods. We can also see that, despite the non-linear optimization with random starting points involved, the neural network replicating martingaledoes not have qualitatively higher variance than the polynomial equivalents.It is interesting to consider the structure of the neural network and polynomial models, tounderstand what they have in common and what they do not. As seen in (5), both methods usea linear map to reduce the dimensionality of the input before applying a non-linear function. Thepolynomial model is based on global polynomials while the neural network can be seen as a data-driven piece-wise linear model. While usually piece-wise linear models require a grid to be deﬁneda priori, neural networks adjust the bias term to “place” the grid where it is most needed accordingto the input data.

The lasso (Tibshirani (1996)) is a popular model building technique for generating sparse models. Itis used very successfully as a method for feature selection, especially in high dimensional problems.For that reason it is natural to compare the lasso regressor with the replicating martingale approachwe present in this paper. At the same time, the comparison reveals the interesting diﬀerences inthe way that replicating martingales and lasso perform the dimensionality reduction.The lasso estimates of a linear model on a full polynomial bases, as in Section 3.2 are obtainedby (cid:98) β = arg min β (cid:107) y − Φ (cid:62) β (cid:107) R n + λ m (cid:88) j =1 | β j | . One important diﬀerence is that lasso does not change the size of the regression problem, onlythe sparsity of the solution. Therefore the regression problem does not become smaller, unlikethe case of polynomials with linear dimensionality reduction where once the input variables areprojected, the remaining OLS problem is much smaller than the original.A second important diﬀerence is that lasso selects among preexisting features, rather thancreating new ones. The methods we described above create new features adaptively, under theassumption that the true function can be expressed in a low dimensional space. This lower dimen-sion number p must be speciﬁed and becomes ﬁxed, which is a disadvantage in respect with lassowhich does not require a predetermined number of features. The ability to “create” new featuresgives the replicating martingale an advantage over lasso, which is limited to the input feature map.However, doing so while being faster in high-dimensional cases is part of what makes the replicat-ing martingales an interesting method. The next section contains a comparison of runtime acrossdiﬀerent methods.For this numerical example, the lasso estimator is built from the full polynomial basis—816elements for T = 5 and 302 ,

621 for T = 40. The feature map used for lasso is consequently theupper bound of the polynomial LDR feature map, that is, the one corresponding to p = 15 and p = 120 for T = 5 and T = 40 respectively. In the example we use p = 3, and therefore the success16f the polynomial LDR estimator will depend on whether a good solution can be expressed in thelower dimensional space of the resulting feature map.To automate the selection of the penalty factor, we use the Akaike information criterion, usingthe algorithm provided by scikit-learn in its LassoLarsIC module, which is partly based on Zou,Hastie, Tibshirani, et al. (2007).Table 7 shows the comparison between lasso and two replicating martingale methods—LDRand neural network—for both MApE and L error. The quality of lasso is consistently worse thanthat of the other estimators, both in the tail and across the body of the distribution. The last comparison that we present on this European call example shows how long it takes to runthe training and prediction phases on each model. This involves: running the regression on thenumber of samples indicated on the ﬁrst column and calculating (cid:98) Z ( x i ) for 1,000,000 validationsamples.The results are presented in Table 8. They clearly show the eﬀect of the dimensionality reductionin the computational cost of the replicating martingale method. In high-dimensional problems,where the lasso regression must work with all features in the high-dimensional space, the LDR andNeural Network models are two orders of magnitude faster since the dimensionality reduction takesplace prior to the creation of the features, thus creating a smaller regression problem. Having shown the eﬀectiveness of learning the replicating martingale in the case of an Europeancall, we present now a much more complex example: a variable annuity guarantee. Unlike theprevious example, this one features path dependent cash ﬂows at multiple points in time and alsoa dependency on a stochastic mortality model, rather than only stochastic market variables. Themodel has been built using models commonly in use in the insurance industry. The policyholderpopulation is ﬁctitious.The example is constructed as follows. There are ﬁve stochastic drivers, d = 5: two for theinterest rate model, one for the equity model, one for the real estate model, and one for thestochastic mortality. The interest rate and equity models are those described in Appendix B andused in previous examples. The real estate model is the same as the equity model from AppendixB, but uses an independent stochastic driver and a lower volatility than the equity model. Thestochastic mortality follows a Lee–Carter model (R. D. Lee and Carter (1992)) to provide a trendand random ﬂuctuations over time.The insurance product being simulated is an investment account with a “return premium ondeath” guarantee. Every policyholder has an investment account. At each time period, the pol-icyholders pay a premium, which is used to buy assets. These assets are deposited in the fund.The fund is re-balanced at each time period to maintain a target asset allocation. The value of the17und is driven by the inﬂows from premiums and the market value changes, which are driven by theinterest rate, equity, and real estate models. At each time step, a number of policyholders die—asdetermined by the stochastic life table—and the investment fund is paid out to the beneﬁciaries.If the investment fund were below the guaranteed amount, the company will additionally pay thediﬀerence between the fund value and the guaranteed amount. The guaranteed amount is the sim-ple sum of all the premiums paid over the life of the policy. Over the course of the simulation thepremiums paid gradually increase the guaranteed amount for each policy.All policies have the same maturity date. In the short-term, low dimensional example, thematurity is T = 5. In the long term, high dimensional example the maturity is T = 40. Atmaturity, the higher of the fund value and the guaranteed amount is paid out to all survivors.The investment portfolio holds four assets: a ten-year zero coupon bond, a twenty-year zerocoupon bond, an equity index and a real estate index. The bonds are annually replaced such thatthe time to maturity remains constant.The model is described by the following equations, where all ﬁnancial variables are nominalamounts, unless otherwise noted: Cash ﬂows – D t : total dead in period (t-1, t)– L t : total of policyholders alive at time t– A t : value of assets at time t (per policy) – G t : guaranteed value at time t (per policy)– C t : value of the cash account at time t– T : maturity date of the policies ζ t =  D t max( A t , G t ) /C t , t < TL T − max( A T , G T ) /C T , t = T In these equations, ζ t is the only discounted variable—by the cash index—all other variable arenominal amounts. Investment fund – V it : unit price of asset i at time t– U it : number of units of asset it held (perpolicy) in the fund at time t– B ( t, S ): value at time t of a bond maturingat time S – EQ t : value of equity index at time t– RE t : value of real estate at time t– M : asset allocation mix– P t : premium paid (per policy) at time t A t = (cid:88) i U it V it it =  U it − B ( t, t + 9) B ( t, t + 10) + P t M i V it , i = 1 U it − B ( t, t + 19) B ( t, t + 20) + P t M i V it , i = 2 U it − + P t M i V it , i = 3 , M = (cid:16) , , , (cid:17) V it =  B ( t, t + 10) , i = 1 B ( t, t + 20) , i = 2 EQ t , i = 3 RE t , i = 4Bond prices ( B ), equity and real estate indices ( EQ and RE ) are simulated according to themodel in Appendix B. Policy variables G t = G t − + P t P t = 100 G = 100 Demographic variables – D xt : total dead of age x in period (t-1, t)– L xt : total alive of age x at time t – q x ( t ): death rate for age x in period (t-1,t) D t = (cid:88) x D xt D xt = L xt − q x ( t ) L x = 1000 ∀ x ∈ (30 , Mortality model (Lee–Carter) – m x ( t ): force of mortality at time t for agex – X ( lc ) t : element of innovation process X attime t used for mortality model– a x and b x : Lee–Carter parameters (table in Appendix E) q x ( t ) = 1 − e − m x ( t ) m x ( t ) = e a x + b x k ( t ) k ( t ) = k (0) − . t + (cid:15) t (cid:15) t = 0 . X ( lc ) t √ tk (0) = − . The results for the variable annuity guarantee conﬁrm those of European option case: the replicatingmartingale works very well, in particular the neural network model, which provides the best resultsin most cases. However, the more complex example also shows some limitations of the methods.In the estimation of the present value (Table 9) we can observe:19

Monte Carlo simulation is still very eﬀective, but regression-based methods provide slightlybetter accuracy. • The neural network model performs relatively badly in the case with the lowest number ofsamples (1,000) and high dimensions ( T = 40), providing the worst results in that case. Thisis driven by over-ﬁtting, as we describe in the analysis of the mean relative L error below.The quality reaches that of the other methods as the number of samples increase. • The polynomial LDR method—which is calculated with p = 10—shows its advantage overthe full polynomial basis not only in being able to solve the high dimensional case, but also inthe estimation of the low dimensional case with low number of samples. The full polynomialbasis has a MApE of 61% due to the basis containing (cid:0) (cid:1) = 3,276 elements, which exceedsthe 1,000 available samples. The polynomial LDR has a MApE of less than 0.1% due to onlycontaining 286 basis elements.In the estimation of the expected shortfall (Table 10 and Figure 3) and the analysis of the meanrelative L error (Table 11) we can observe: • Regress-later methods dominate over regress-now methods and nested Monte Carlo, withbetter mean absolute error and standard deviation. • Unlike the case in the European call example where neural networks completely dominatedthe quality comparison, polynomial LDR shows better results in a few cases. However, whichmethod shows better results is very sensitive to the choice of hyper-parameters. We providea sensitivity analysis for hyper-parameters in Appendices F and G. Overall, neural networkshave more room for improvement with an alternative choice of hyper-parameters and can beassumed to produce better results in this variable annuity example. • We can observe several cases where an insuﬃcient number of training samples leads to over-ﬁtting and poor out-of-sample results. For example, for the full polynomial basis and T = 5,we ﬁnd a large improvement in results when the training data changes from 1,000 samplesto 5,000 samples. This basis has m = 3,276 which means that when working with 1,000samples we have more parameters than samples. The same eﬀect can be seen in the neuralnetwork replicating martingale for T = 40 when the sample size changes from 10,000 to 50,000samples. This can be explained by the fact that this model has 20,201 parameters. • In some cases, for example, the case neural network regress-later estimator for T = 5 theMApE ES increases when the sample size increases from 5,000 to 10,000 and 50,000 (Table 10).This behaviour is not present in the relative mean L error (Table 11). This is due to thedivergence between the error being minimized—errors along the full cash ﬂows distribution—and the error being measured—errors in the tail of T = 1 conditional expectation distribution.20 .2 Comparison of runtime The comparison of runtime for the insurance example is very important to determine the relativestrength of the methods as feasible solution in the real world. For the reasons described in theintroduction, nested Monte Carlo is not a feasible method for this problem, and it is thereforeexcluded from this comparison.Unsurprisingly, we ﬁnd regress-now methods to be faster than regress-later methods. This mightpartially explain the popularity with practitioners, especially for frequent calculations that do notrequire high precision. However, for quarterly or annual calculations of regulatory solvency, it seemshard to justify the much higher error rates for the beneﬁt of saving a few minutes of calculations.The slowest method is the polynomial LDR, which for the high dimensional problem takes al-most 25 minutes to ﬁnd the solution and make the estimation of the out-of-sample distribution. Thistime is entirely dominated by the optimization—training—step, not the estimation—prediction—step. The polynomial LDR method runtime is extremely sensitive to the p parameter. For example,for T = 40 and sample size 1,000, it takes 33 seconds to solve with p = 5 and 279.1 seconds to solvewith p = 10.The neural network model can be solved relatively fast, taking 2 minutes in the largest problem. In the context of the need for accurate and fast calculations in portfolio pricing and risk manage-ment, we have introduced a data-driven method to build replicating martingales under diﬀerentfunctional bases. This method yields lower errors than standard nested Monte Carlo simulation.The model learns the features necessary for an eﬀective low-dimensional representation fromﬁnite samples in a supervised setting. By doing so, it can be very eﬀective in high-dimensionalproblems, without some of the usual diﬃculties associated with them.We have presented two examples to demonstrate the usefulness of replicating martingales inthe calculation of economic capital. The ﬁrst is a typical benchmark example for calculationsinvolving ﬁnancial derivatives: an European call option. The second is a path-dependent insuranceproduct, a variable annuity guarantee. Replicating martingales outperform other methods in theliterature and in use in the ﬁnancial industry for these two representative cases. This is illustratedby extensive comparisons and sensitivity analyses.21 ppendixA Quality metrics

Since the focus on this paper are applications in pricing and risk managements, we use two keyquality metrics. The ﬁrst one looks into the goodness of ﬁt in the tail of the distribution, and thesecond one the goodness of ﬁt across the body of the distribution.For the tail of the distribution we look at expected shortfall and value at risk, for the loss-makingtail. For the body of the distribution we look at the L error.We treat the models as statistical estimators since their estimates are subject to the randomnessof their inputs. For that reason, for each those metrics described above we derive an empiricaldistribution based on R macro-runs of the the entire simulation-estimation-prediction chain ofcalculations. That means that we also need to deﬁne which metric summarizes the results of theempirical distribution. In both cases (tail error and L error) we use the mean absolute error. Inall cases we work with relative errors, expressed as a percentage. Root mean squared errors wouldhave been an option but the advantages of the mean absolute error have been well documented inWillmott and Matsuura (2005) and Chai and Draxler (2014).In the sections below we describe in detail the calculation of our two quality metrics: meanabsolute percentage error on tail error (MApE), and mean relative L error. A.1 Mean absolute percentage error

Let us consider an empirical distribution of X t , composed of n samples. From this distribution wecan obtain an empirical distribution of Z t . Given a function f ∗ , we obtain R repetitions of its ﬁnitesample estimator (cid:98) f . For each function in these { (cid:98) f j } Rj =1 , we can produce an empirical distributionof its value estimator (cid:98) Z t = E Q t [ (cid:98) f ( X )] using X t , therefore obtaining a set of empirical distributions { (cid:98) Z ( j ) t } Rj =1 .Given a benchmark expected shortfall calculation at α (e.g., 99%) conﬁdence, ES α [ − ∆ Z t ], anestimator of such quantity ES α [ − ∆ (cid:98) Z ( j ) t ], and R repetitions (independent samples) of such estimator j = 1 . . . R , the mean absolute percentage error (MApE) is deﬁned asMApE ES = 1 R R (cid:88) j =1 | ES α [ − ∆ (cid:98) Z ( j ) t ] − ES[ − ∆ Z t ] | ES[ − ∆ Z t ] . The MApE metric can also be applied to the present value of Z t , E [ Z t ] = Z :MApE PV = 1 R R (cid:88) j =1 | (cid:98) Z ( j )0 − Z | Z . A.2 Mean relative L error Given the above, the mean relative L error is deﬁned as22 R R (cid:88) j =1 E [ | (cid:98) Z ( j ) t − Z t | ] E [ | Z t | ]This metric is related to the error on the expected shortfall in the following way | ES α [ ˆ Z t ] − ES α [ Z t ] | ≤ α E [ | ˆ Z t − Z t | ] . This shows that for any α , the expected shortfall MApE is bounded by a multiple of the L error.While the MApE ES is a metric calculated for a particular α and only takes into account thedistribution beyond the α -th percentile, the mean relative L error takes into account the wholedistribution and bounds the expected shortfall error for any α . B Economic scenario generator

B.1 Interest rate model

To model interest rates we use the one-factor Hull White model: dr ( t ) = α ( b ( t ) − r ( t )) dt + σdW ( t ) . B.2 Bond prices

The nominal price at time t of a zero-coupon bond with maturity T is given by B ( t, T ) = exp( − A ( t, T ) r ( t ) + C ( t, T )) (12)where A ( t, T ) = 1 α (1 − e − α ( T − t ) ) ,C ( t, T ) = − α (cid:90) Tt (cid:90) ut e − α ( u − s ) b ( s ) dsdu + σ α (cid:104) ( T − t ) + 12 α (1 − e − α ( T − t ) ) + 2 α ( e − α ( T − t ) − (cid:105) . If we deﬁne h ( t, T ) := (cid:82) Tt (cid:82) ut e − α ( u − s ) b ( s ) dsdu then B (0 ,

1) = exp( − A (0 , r (0) + C (0 , α (1 − exp( − α )) ∗ r (0) − αh (0 ,

1) + σ α (cid:104) α (1 − e − α ) + 2 α ( e − α − (cid:105) . (13)Additionally, we deﬁne for future use g ( t ) := (cid:90) t +1 t e − α ( t +1 − s ) b ( s ) dsh ( t ) := h ( t, t + 1) . .3 Simulation We simulate the short rate, r , and log-cash account, Y , jointly from Hull White one-factor modelaccording to the formulas in Glasserman (2013), using the innovation process X as a stochasticdriver. Y ( t ) = (cid:90) t r ( u ) duµ ( t i , t i + 1) = α (cid:90) t i +1 t i e − α ( t i +1 − s ) b ( s ) ds = αg ( t i ) = µ ( t i ) σ r ( t i , t i + 1) = σ α (1 − e − α ( t i +1 − t i ) ) = σ α (1 − e − α ) = σ r r ( t i + 1) = e − α ( t i +1 − t i ) r ( t i ) + µ ( t i , t i +1 ) + σ r ( t i , t i +1 ) X i +1 , = e − α r ( t i ) + µ ( t i ) + σ r X i +1 , ] σ Y ( t i , t i +1 ) = σ α (cid:16) ( t i +1 − t i ) + 12 α (1 − e − α ( t i +1 − t i ) ) + 2 α ( e − α ( t i +1 − t i ) − (cid:17) = σ α (cid:16) α (1 − e − α ) + 2 α ( e − α − (cid:17) = σ Y σ rY ( t i , t i +1 ) = σ α (1 + e − α ( t i +1 − t i ) − e − α ( t i +1 − t i ) ) = σ α (1 + e − α − e − α ) = σ rY ρ rY ( t i , t i +1 ) = σ rY ( t i , t i +1 ) / [ σ r ( t i , t i +1 ) σ Y ( t i , t i +1 )] = σ rY / ( σ r σ Y ) = ρ rY X (cid:48) i +1 , = ρ rY X i +1 , + (cid:113) − ρ rY X i +1 , µ Y ( t i , t i +1 ) = (1 /α )(1 − e − α ( t i +1 − t i ) ) r ( t i ) + α (cid:90) t i +1 t i (cid:90) ut i e − α ( u − s ) b ( s ) dsdu = (1 /α )(1 − e − α ) r ( t i ) + αh ( t i ) = µ Y ( t i ) Y ( t i +1 ) = Y ( t i ) + µ Y ( t i ) + σ Y X (cid:48) i +1 , B.4 Equity model

A geometric brownian process S t models the equity excess return, therefore the equity index EQ t is given by EQ t = C t S t where C t = exp( Y ( t )) is the cash account.24e use this model for simulating for equity and real estate indices, with the recursive formula S t = S t − exp (cid:18) − σ σX (cid:48) t,j (cid:19) , where j = 3 for the equity index and j = 4 for the real estate index. For a given Σ matrixthat encodes the desired correlations, X (cid:48) is a (correlated) innovation process X (cid:48) t = Σ X t . For theexamples in this paper, we have not been calibrated Σ to market data. In order to produce X (cid:48) weset X (cid:48) t, = X t, , X (cid:48) t, , X (cid:48) t, , X (cid:48) t, to be correlated with X t, and X (cid:48) t, = X ( lc ) t —used for the mortalitymodel—to be independent of all other variables. C Proof of Theorem 1

Theorem 1 follows from Lemma 2 below by choosing ˜ A = A ( A (cid:62) A ) − / , which lies in V k ( R T d ). Lemma 2.

Let A, ˜ A ∈ R T d × k with full rank. The following are equivalent:1. Im A = Im ˜ A A = ˜ AS , for some invertible k × k -matrix S span φ A = span φ ˜ A Proof. ⇔

2: is elementary.2 ⇒

3: deﬁne ˜ q i ( y ) = q i ( S (cid:62) y ) and note that { ˜ q , . . . , ˜ q m } is a basis of Pol δ ( R p ). Note also that φ A,i ( x ) = q i ( S (cid:62) ˜ A (cid:62) x ) = ˜ q i ( ˜ A (cid:62) x ), while φ ˜ A,i ( x ) = q i ( ˜ A (cid:62) x ). This yields the claim.3 ⇒

1: as Pol δ ( R p ) contains all linear polynomials on R p , we infer that for any v ∈ R p thereexists some ˜ v ∈ R p such that v (cid:62) A (cid:62) x = ˜ v (cid:62) ˜ A (cid:62) x , for all x ∈ R T d , and vice versa. This completesthe proof.

D Expectations under polynomial LDR If x ∈ R p , H α ( x ) for α ∈ N p is a multivariate Hermite polynomial of order α j in the j-th dimension. Lemma 3.

Given X ∼ N (0 , I T d ) and A ∈ V p ( R T d ) , the conditional expectation of the Hermitepolynomial of order α is E t [ H α ( A (cid:62) X )] = p (cid:89) j =1 σ [ W − j ] α j H α j (cid:32) W − j σ [ W − j ] (cid:33) (14) where σ [ W − j ] = (cid:113)(cid:80) tds =1 A sj .Proof. Since A (cid:62) X ∼ N (0 , I p ), we have E t [ H α ( A (cid:62) X)] = E t  p (cid:89) j =1 H α j ( A (cid:62) j X)  = p (cid:89) j =1 E t (cid:104) H α j ( A (cid:62) j X) (cid:105) . (15)25t this point we separate those terms before and including t and those after t , and deﬁne W − j = (cid:80) T ds =1 A sj X s and W + j = (cid:80) T ds = td +1 A sj X s , so that W − j is F t -measurable, W + j is independent of F t ,and A (cid:62) j X = W − j + W + j .To continue the proof, we introduce the Hermite polynomials of variance ν , H [ ν ] n ( x ) = ν n H n (cid:16) x √ ν (cid:17) following Roman (1984), Section 2.1. These generalized Hermite polynomials form an Appell se-quence for which the following identity holds (Equation 4.2.1 in Roman (1984)): H [ ν + µ ] n ( x + y ) = n (cid:88) k =0 (cid:18) nk (cid:19) H [ ν ] k ( x ) H [ µ ] n − k ( y ) . This is the binomial convolution of H [ ν ] k and H [ µ ] n − k under umbral calculus.Note that the variance of W − j + W + j fulﬁlls σ [ W − j + W + j ] = σ [ W − j ] + σ [ W + j ] = 1 . Setting ν = σ [ W − j ] and µ = σ [ W + j ] we obtain H α j (cid:0) W − j + W + j (cid:1) = H [ σ [ W − j ]+ σ [ W + j ]] α j (cid:0) W − j + W + j (cid:1) = α j (cid:88) k =0 (cid:18) α j k (cid:19) H [ σ [ W − j ]] k ( W − j ) H [ σ [ W + j ]] α j − k ( W + j )= α j (cid:88) k =0 (cid:18) α j k (cid:19) σ [ W − j ] k H k (cid:32) W − j σ [ W − j ] (cid:33) σ [ W + j ] α j − k H α j − k (cid:32) W + j σ [ W + j ] (cid:33) . Applying the F t -conditional expectation, we obtain E t (cid:104) H α j (cid:0) W − j + W + j (cid:1)(cid:105) = α j (cid:88) k =0 (cid:18) α j k (cid:19) σ [ W − j ] k H k (cid:32) W − j σ [ W − j ] (cid:33) σ [ W + j ] α j − k E t (cid:34) H α j − k (cid:32) W + j σ [ W + j ] (cid:33)(cid:35) , since E t (cid:34) H k (cid:32) W − j σ [ W − j ] (cid:33)(cid:35) = E (cid:34) H k (cid:32) W − j σ [ W − j ] (cid:33) | W − j (cid:35) = H k (cid:32) W − j σ [ W − j ] (cid:33) . Replacing in (15), and taking into account that σ [ W + j ] α j − k E t (cid:34) H α j − k (cid:32) W + j σ [ W + j ] (cid:33)(cid:35) =  ∀ k (cid:54) = α i k = α i , we obtain (14). 26 Lee–Carter parameters

The Lee–Carter parameters are based on the ﬁndings in the original paper R. D. Lee and Carter (1992),and they are shown in Table 13.

F Sensitivity of polynomial LDR to hyper-parameters

The polynomial LDR method has two hyper-parameters, the target dimensionality p and the poly-nomial degree δ . Additionally, the Riemannian BFGS algorithm used to solve the optimizationproblem adds several other parameters, the main one being the starting point for the parameter A ,called here A .The polynomial degree parameter is common to all polynomial approximations, and has theexpected impact on the results. In this section, we focus on the parameter p which is unique to thelinear dimensionality reduction and the parameter A which in our empirical examples proved tohave a large impact on results.We show that the choice of p is purely a trade-oﬀ between approximation error and number ofsamples required, and that the choice of starting point makes a very large diﬀerence in the ﬁnalresults. A random starting point performs relatively badly, compared to a starting point that takesinto account the fact that in ﬁnancial models, cash ﬂows closer in time are usually more importantthan those farther in time. F.1 Starting point A The Riemannian BFGS algorithm used to solve the polynomial LDR optimization problem requiresa starting point for A .A simple way of generating a starting point—similar to what is done for the L-BFGS algorithmused to solve the neural network optimization problem—is to generate it randomly. To do thiswe draw T dp random samples from N (0 ,

1) and arrange them into an

T d × p matrix B . Then A = B ( B (cid:62) B ) − / is a random matrix that follows the uniform distribution on the Stiefel mainfold V p ( R T d ).A second way is to use a rectangular diagonal matrix and ﬁll the last column to ensure thatevery one of the

T d input dimensions has a weight in at least one of the p output dimension, thatis A = B | B ij = 1 if ( i = j ∧ i (cid:54) = p ) ∧ B ip = √ T d − p +1 if i (cid:62) p . Conceptually, this starting pointcan be thought as a point where those input dimensions farthest in the future have been groupedinto one output dimension. The following is an example for T d = 4 and p = 3:  √ .

50 0 √ .  .

27 third way uses the same rationale of grouping input dimensions that are far in the future intoone output dimension, but does so respecting the fact that X ∈ R T × d and only groups variablesacross time ( T ) but not across dimensions ( d ). The following is an example for T = 5, d = 3 and p = 3 which corresponds to what was used in the European call example in Section 3.3:  √ T √ T

00 0 √ T ... ... ... √ T √ T

00 0 √ T  . Since this method, which we call “folding”, provides the best results we also use it in Sections 4and 5. In the latter we work with T = 5, d = 5 and p = 10, and leads to a starting point (in blocknotation): d d × ( p − d ) d × d √ T − p − d ... ...   T − F.2 Target dimensionality parameter p To show the eﬀects of parameter p on the insurance example, we choose one of the starting pointmethods (diagonal) and one maturity ( T = 5). The results in Tables 16 and 17 conﬁrm theexpected eﬀect of changing this parameter: larger values of p produce better results (since f (cid:63) isa richer function) but also require more training samples to do so. We can see that when movingfrom p = 10—used in the main results for the insurance example—to p = 15 and therefore from m = 286 to m = 816 the error for 1,000 training samples increases by a factor of 10 in the expectedshortfall and by a factor of 3 in the L metric. In those cases with more training samples—5,000and above—the error goes down as expected. 28 Sensitivity of neural network to hyper-parameters

The neural network method has one main hyper-parameter, the width of the hidden layer. Othertypical neural network hyper-parameters as number of layers or activation function do not applyin this case, since the closed-form of the time-t expectation has been deﬁned only for single-layer,ReLu networks. Unlike the polynomial LDR basis, we do not explore the impact of the startingpoint, since a random starting point already performs very well.In the main results in Section 5, we use the same layer width in all cases, p = 100. This valueis the results of a sensitivity test done for diﬀerent values (10, 50, 100, 200) after which we chosethe best results overall cases. This sensitivity test is similar to cross validation but is performedon entirely out-of-sample data, rather than partitioning the existing training data. Using a singlechoice of layer width in all cases has the advantage of showing good overall results (for diﬀerentmaturities and training sample size) but the disadvantage of being neither optimized for each singlecase (meaning that the results could have been better when looking at each cell of the table) norcomparable to the polynomial method in terms of functional complexity, that is, the number ofparameters that describe the function f (cid:63) .The selection could have been done in diﬀerent ways, and in this section we show some alterna-tives and their eﬀects on the results shown in Section 5. The results are summarized in Tables 18and 19. We show that some of the alternatives perform even better than our choice for the mainresults, implying the potential for improvement in the neural network basis, which is already thebest performing basis in our comparisons.The ﬁrst alternative is to use the theoretical minimum width for the network, as described inHanin and Sellke (2017). In our case, it means using p = 25 for T = 5 and p = 200 for T = 40.This method does not show a good performance. Interestingly, it performs worse even for T = 40where p = 100 is below the minimum. Still, this is not a violation of the theoretical minimum sincethis is valid for neural networks of arbitrary length, so it is always possible that using more hiddenlayers would result in smaller errors than the ﬁxed p method.The second alternative is to backsolve the width of the network that creates a parameter spaceof similar dimensionality as that of the polynomial LDR method. For d = 5, T = 5 and p = 10 thepolynomial LDR has 481 parameters. For T = 40 it has 2,231 parameters. This can be matched byusing a neural network with p = 18 and p = 11 nodes respectively. This alternative provide verygood results for the high dimensionality case ( T = 40) but not as good for the low dimensionalitycase ( T = 5).The third and ﬁnal alternative is to use a neural network that matches the number of basisfunctions m . This means, for both T = 5 and T = 40, that p = 286. The results for this alternativeare similar to the other alternatives. 29 eferences Absil, P-A, Christopher G Baker, and Kyle A Gallivan (2007). “Trust-region methods on Rie-mannian manifolds”. In:

Foundations of Computational Mathematics

Proceedings of “Real option theory meets practice”, 8th Annual International Conference,Montreal (cit. on p. 3).Broadie, Mark, Yiping Du, and Ciamac C Moallemi (2015). “Risk estimation via regression”. In:

Operations Research

Insurance: mathematics and Economics

Available at SSRN 3303296 (cit. on p. 4).Chai, Tianfeng and Roland R Draxler (2014). “Root mean square error (RMSE) or mean absoluteerror (MAE)?–Arguments against avoiding RMSE in the literature”. In:

Geoscientiﬁc modeldevelopment

The Journal of Machine Learning Research (cit. on p. 11).Fernandez-Arjona, Lucio and Damir Filipovi´c (2020).

Benchmark and training data for replicatingﬁnancial and insurance examples . Zenodo. doi : (cit. on p. 3).Glasserman, Paul (2013). Monte Carlo methods in ﬁnancial engineering . Vol. 53. Springer Science& Business Media (cit. on p. 24).Glasserman, Paul and Bin Yu (2002). “Simulation for American options: Regression now or re-gression later?” In:

Monte Carlo and Quasi-Monte Carlo Methods 2002 . Springer, pp. 213–226(cit. on pp. 2, 3, 6).Gordy, Michael B and Sandeep Juneja (2010). “Nested simulation in portfolio risk measurement”.In:

Management Science arXiv preprint arXiv:1710.11278 (cit. on p. 29).30okanson, Jeﬀrey M and Paul G Constantine (2018). “Data-driven polynomial ridge approximationusing variable projection”. In:

SIAM Journal on Scientiﬁc Computing

SIAM Journal on Optimization

Journal of the American Statistical Association

ACM Transactions on Modelingand Computer Simulation (TOMACS)

The Review of Financial Studies

Mathematical Finance

Journal of MachineLearning Research

12, pp. 2825–2830 (cit. on p. 15).Pelsser, Antoon and Janina Schweizer (2016). “The diﬀerence between LSMC and replicating port-folio in insurance liability modeling”. In:

European actuarial journal

The umbral calculus . Academic Press: New York. isbn : 9780125943802 (cit.on p. 26).Tibshirani, Robert (1996). “Regression shrinkage and selection via the lasso”. In:

Journal of theRoyal Statistical Society: Series B (Methodological)

The Journal of MachineLearning Research

Climateresearch

The Annals of Statistics able 1:

Nested Monte Carlo, comparison of present value (50,000 total samples)

Maturity Benchmark nMC MApE

Table 2:

Nested Monte Carlo, comparison of expected shortfall for nested Monte Carlo (50,000 totalsamples)

Benchmark Nested Monte Carlo (MApE by inner simulations)Maturity (value)

Table 3:

Nested Monte Carlo, comparison of present value and expected shortfall MApE (in percentagepoints) for optimal inner–outer split

Present Value Expected ShortfallSamples

Maturity: 5 Maturity: 40 Maturity: 5 Maturity: 401,000 6.4 7.5 26.8 404.45,000 4.0 3.9 15.9 111.210,000 3.8 3.1 14.5 56.850,000 1.7 2.4 8.0 19.2 able 4: European call, comparison of present value MApE (in percentage points)

Full Polynomial basis LDR Neural NetworkSamples nMC

Regress-now Regress-later Regress-later Regress-now Regress-later

T=5 < < < T=40

Table 5:

European call, comparison of expected shortfall MApE (in percentage points)

Full Polynomial basis LDR Neural NetworkSamples nMC

Regress-now Regress-later Regress-later Regress-now Regress-later

T=5

T=40

Table 6:

European call, comparison of relative mean L error (in percentage points) Full Polynomial basis LDR Neural NetworkSamples

Regress-now Regress-later Regress-later Regress-now Regress-later

T=5

T=40 able 7: European call, comparison of MApE ES and relative mean L error (both in percentage points) MApE ES Rel. mean L errorSamples Lasso LDR N. Net. Lasso LDR N. Net.

T=5

T=40

Table 8:

European call, comparison of runtime (in seconds), single core AMD Opteron 6380

Full Polynomial basis LDR Neural NetworkSamples Lasso

Regress-now Regress-later Regress-later Regress-now Regress-later

T=5

T=40 able 9: Insurance liability, comparison of present value MApE (in percentage points)

Full Polynomial basis LDR Neural NetworkSamples nMC

Regress-now Regress-later Regress-later Regress-now Regress-later

T=5 < < < < < < < < < < < < T=40 < Table 10:

Insurance liability, comparison of expected shortfall MApE (in percentage points)

Full Polynomial basis LDR Neural NetworkSamples nMC

Regress-now Regress-later Regress-later Regress-now Regress-later

T=5

T=40

Table 11:

Insurance liability, comparison of relative mean L error (in percentage points) Full Polynomial basis LDR Neural NetworkSamples

Regress-now Regress-later Regress-later Regress-now Regress-later

T=5 < < < < < T=40 able 12: Insurance liability, comparison of runtime (in seconds), single core AMD Opteron 6380

Full Polynomial basis LDR Neural NetworkSamples

Regress-now Regress-later Regress-later Regress-now Regress-later

T=5

T=40 able 13: Lee–Carter parameters a x and b x for every age x x a x b x able 14: Insurance liability, comparison of expected shortfall MApE (in percentage points) for diﬀerentstarting points in polynomial LDR basis

Samples Folding Diagonal RandomT=5

T=40

Table 15:

Insurance liability, comparison of relative mean L error (in percentage points) for diﬀerentstarting points in polynomial LDR basis Samples Folding Diagonal RandomT=5

T=40 able 16: Insurance liability, comparison of expected shortfall MApE (in percentage points) for diﬀerentvalues of the target dimensionality parameter in polynomial LDR basis

Samples Diagonal p=5 Diagonal p=10 Diagonal p=15 Diagonal p=20T=5

Table 17:

Insurance liability, comparison of relative mean L error (in percentage points) for diﬀerentvalues of the target dimensionality parameter in polynomial LDR basis Samples Diagonal p=5 Diagonal p=10 Diagonal p=15 Diagonal p=20T=5 able 18: Comparison of expected shortfall MApE (in percentage points) for diﬀerent layer widths inneural network basis

Samples

Fixed p = 100 Minimum width Equal param dims Equal m T=5

T=40

Table 19:

Comparison of relative mean L error (in percentage points) for diﬀerent layer widths in neuralnetwork basis Samples

Fixed p = 100 Minimum width Equal param dims Equal m T=5 < < < < < T=40 igure 1: Nested Monte Carlo structure. The outer simulations span the time period 0 to t , and each ofthose serves as the starting point of a group of inner simulations. . now R. later R. later R. now R. later 10 l o g M A p E nMC Full Polynomial basis LDR Neural Network Maturity = 5

Samples

R. now R. later R. later R. now R. later nMC Full Polynomial basis LDR Neural Network

Maturity = 40

Figure 2:

Distribution of expected shortfall estimates per method for the European call example. Boxesshow the upper and lower quartiles of the empirical distributions, while whiskers show their maxima andminima. Due to the logarithmic scale, a visual artifact is introduced by which the inter-quartile range seemsto increase with larger sample sizes—most notably in the nMC for Maturity=40. In reality, the inter-quartilerange decreases with larger sample sizes, but the logarithmic scale makes the box bigger for smaller errors.

R. now R. later R. later R. now R. later 10 l o g M A p E nMC Full Polynomial basis LDR Neural Network Maturity = 5

Samples

R. now R. later R. later R. now R. later 10 nMC Full Polynomial basis LDR Neural Network Maturity = 40

Figure 3: