[PDF] Mathematical Foundations of Regression Methods for the approximation of the Forward Initial Margin

Abstract

Abundant literature has been published on approximation methods for the forward initial margin. The most popular ones being the family of regression methods. This paper describes the mathematical foundations on which these regression approximation methods lie. We introduce mathematical rigor to show that in essence, all the methods propose variations of approximations for the conditional expectation function, which is interpreted as an orthogonal projection on Hilbert spaces. We show that each method is simply choosing a different functional form to numerically estimate the conditional expectation. We cover in particular the most popular methods in the literature so far, Polynomial approximation, Kernel regressions and Neural Networks.

Full PDF

MMathematical Foundations of Regression Methodsfor Approximating the Forward Initial Margin

Lucia Cipolina-Kun , Simone Caenazzo , and Ksenia Ponomareva University of Bristol. School of Computer Science, Electrical andElectronic and Engineering Mathematics Riskcare Ltd., LondonJune 18, 2020

Abstract

In recent years, forward initial margin has attracted the attention ofpractitioners and academics. From a computational point of view, it posesa challenging problem as it requires the implementation of a nested MonteCarlo simulation. Abundant literature has been published on approxima-tion methods aiming to reduce the dimensionality of the problem, themost popular ones being the family of regression methods.This paper describes the mathematical foundations, on which these re-gression approximation methods lie. Mathematical rigor is introduced toshow that, in essence, all methods are just diﬀerent variations of approx-imating the conditional expectation function. In short, all methods areperforming orthogonal projections on Hilbert spaces, while simply choos-ing a diﬀerent functional form to numerically estimate the conditionalexpectation.The most popular methods in the literature so far are covered here. Theseare polynomial approximations, Kernel regressions and Neural Networks.

Initial margin (IM) has become a topic of high relevance for the ﬁnancial in-dustry in recent years. The relevance stems from its wide implications to theway business will be conducted in the ﬁnancial industry and in addition, for theneed of approximation methods to calculate it.

The authors would like to thank Eusebio Gardella and Bertrand Nortier for their contri-bution to the ﬁnal manuscript. a r X i v : . [ q -f i n . R M ] J un he exchange of initial margin is not new in the ﬁnancial industry, however, afterregulations passed in 2016 by Financial Regulatory Authorities [5], a wider setof ﬁnancial institutions are required to post signiﬁcant amounts of initial marginto each other.The initial margin amount is calculated at the netting set level following a setof rules depending on whether the trade has been done bilaterally [13] or facinga Clearing House [7]. The calculation is performed on a daily basis and carriedon throughout the life of the trade.Since each netting set consumes initial margin throughout its entire life, ﬁnancialinstitutions need resources to fund this amount in the future, given that themargin received from its counterparty cannot be rehypothecated. To estimatethe total need for funds, each counterparty needs to perform a forecast of theinitial margin amount up to the time to maturity of the netting set.Forecasting the total forward initial margin amount is a computationally chal-lenging problem. An exact initial margin calculation requires a full implemen-tation of a nested Monte Carlo simulation, which is computationally onerousgiven the real-time demand for the IM numbers as part of day-to-day businessactivities. As a response to this computational challenge, practitioners and academics haveproposed several approximation methods to reduce the dimensionality of theproblem, at the cost of losing a tolerable degree of accuracy. The most popularof these approaches have been the regression methods, since they oﬀer simplicityand re usability of the Bank’s legacy Monte Carlo engines, paired with goodresults. Inspired by early works of Longstaﬀ-Schwartz [16], several banks haveimplemented some version of the polynomial regression as proposed by [1], [8]and [10], or the Kernel regressions proposed by [2], [10] and [11]. Most recently,estimation by Neural Networks has been proposed by [12] and [18].These methods have not been developed exclusively to tackle the initial marginproblem. These are well-known regression methodologies and have been usedextensively in ﬁnance and engineering. As such, the current literature on ini-tial margin have focused on the implementation and numerical validation of theregression methods from a practical point of view. The authors’ main focushave been to show, the applicability and performance of the diﬀerent method-ologies when compared to a brute-force benchmark, without going deep intomathematical formalities.This paper aims to ﬁll the gap between theory and practice. The underlyingmathematical framework required to analyze the initial margin computationproblem from a theoretical standpoint is presented here.2 .2 Organization

The rest of this paper is organized as follows. Section 2 translates the prac-titioner’s deﬁnition of initial margin into a formal probability setting. Section3-4 develop the assumptions needed to ﬁt the initial margin problem into aclassical problem of orthogonal projections into Hilbert spaces. The last sectionconcludes by linking back the formal theory presented with the practitioner’ssuggested approaches.

Following the deﬁnition in [6], the aim of a Bank is to calculate the total costassociated with the posting of initial margin to a counterparty over the life ofa netting set. Since this is a quantity that evolves in time, it is necessary toconsider a forward diﬀusion model, together with a probability space and therelevant discounting and funding rates into the equation.Given a ﬁltered probability space (Ω , F , {F t } t ≥ , P ) where as usual, Ω is theset of all possible sample paths ω , F is the sigma-algebra, {F t } is the ﬁltrationof events at time t (i.e. the natural ﬁltration to which risk factors are adaptedto) and P the probability measure deﬁned on the elements of F . The total costof posting initial margin over the entire life of a netting set is referred as the Margin Valuation Adjustment as deﬁned below.

M V A C ( t ) = E t (cid:34)(cid:90) Tt ((1 − R C ) λ ( u ) − S I ( u )) e (cid:82) ut ( r ( s )+ λ B ( s )+ λ C ( s )) ds E u [ IM C ( u )] du (cid:35) where: • E t denotes an expectation conditional to information at time t , i.e. E t [ . . . ] = E [ · · · | F t ] • R C ( t ) is the recovery rate at time t of the counterparty related to nettingset C ; • T is the ﬁnal maturity of netting set C ; • λ ( t ) is the Bank’s eﬀective ﬁnancing rate (i.e. borrowing spread) at time t ; • S I is the spread received on initial margin; • r ( t ) is the risk free rate (e.g. an OIS rate) at time t ; • λ B ( t ) , λ C ( t ) are the survival probabilities at time t of the Bank and thecounterparty related to netting set C , respectively;3 IM C ( t ) is the initial margin posted by the bank at time t against nettingset C .Note that the expectation E t is taken on the whole integral. This is becausecertain simulation setups, like the one in [21] and [22], feature explicitly modeledcredit factors that lead to stochastic credit spreads and survival probabilities. Ifthe simulation setup instead considers deterministic credit factors, the formulacan be simpliﬁed as follows: M V A C ( t ) = (cid:90) Tt ((1 − R C ) λ ( u ) − S I ( u )) e (cid:82) ut ( λ B ( s )+ λ C ( s )) ds E t (cid:104) E u [ IM C ( u )] e (cid:82) ut r ( s ) ds (cid:105) du Here our quantity of interest is E t [ IM ( t )]. This is the expectation of the initialmargin at time t , taken across all random paths ω , conditional on the realizationof the risk factors at time t . The initial margin is a path-wise random variabledeﬁned as P (∆ t ≤ IM ( ω, t ) |F t ) = p (1)Where: • p denotes the desired quantile (usually the 99% quantile); • ∆( ω, t, δ IM ) is deﬁned (in its simplest form ) as the change, conditional tocounterparty default, in the netting set’s value (i.e. proﬁt and loss (PnL))over a time interval ( t, t + δ IM ]:∆( ω, t, δ IM ) = V ( ω, t + δ IM ) − V ( ω, t )Note that the Initial Margin as deﬁned above is speciﬁcally the posted IM, asthe quantile is calculated on the upper-tail of the PnL distribution. In orderto calculate the received

IM, one can instead compute the 1% quantile on thesame distribution.The following sections describe the computational methods developed in thecurrent literature to estimate this quantity.

The most exact method to calculate the initial margin requires a brute force im-plementation of a nested Monte Carlo simulation. Since this approach is compu-tationally untractable, in practice it is only used as a ground truth benchmarkto assess the performance of the approximation methodologies. This is the simplest deﬁnition of initial margin, for alternative deﬁnitions see [3] This is the simplest deﬁnition of netting set PnL, for further deﬁnitions see [3]. t , an outer MonteCarlo simulation for the values of V ( ω, t ) and an inner Monte Carlo simulationof the forward Probability Density Function (PDF) of V ( ω, t + δ IM ) | F t )have been performed. The diﬀerence in the value of the netting set between t and t + δ IM time steps is ∆( ω, t , δ IM ). Since the PDF distribution of( V ( ω, t + δ IM ) | F t ) is known, it is possible to obtain the PDF distribution of∆( ω, t , δ IM ). Then, one should take the Q to obtain the initial margin.Figure 1: A simulation tree with two branches and two time stepsThe initial margin at every time step t is the 99 th percentile of the inner PDFscenarios. Both processes (inner and outer) are usually simulated using the samestochastic diﬀerential equation with a suﬃciently large number of scenarios toensure convergence. The number of scenarios become particularly importantfor the inner scenarios as the Q is a tail statistic, therefore the number ofscenarios on the tail should be suﬃciently large.As every nested Monte Carlo problem, the brute force calculation of initial mar-gin is computationally onerous given that the number of operations increaseexponentially. This type of complexity is not possible to aﬀord in real time, e.g.in a live trading scenario, thus the reason behind the development of approxi-mation methods. 5 Regression Methods for the Approximation ofInitial Margin

The regression proposed by Longstaﬀ-Schwartz in [16] is the traditional frame-work used in ﬁnance to reduce dimensions in nested Monte Carlo problems.In short, the authors propose to reduce the number of machine operations bysimulating only the outer scenarios and ﬁtting a regression to forecast the innerones. Their regression uses a polynomial basis as the functional choice for theregression.The use of regressions is justiﬁed since the authors’ mathematical problem is theestimation of a conditional expectation on the inner scenarios, which is usuallyapproximated with a linear regression. Initial margin computation problemposed in this paper requires approximation of a quantile. It will be shown thatthe general Longstaﬀ-Schwartz framework can be applied to this problem aswell.Figure 2 provides a graphical description of the approach followed to approx-imate a nested Monte Carlo problem. The main idea is to simulate only theouter scenarios and assume a parametrized PDF for the inner ones. Under thissetting, one can use a regression to estimate the required quantile.Figure 2: Simpliﬁcation of a nested Monte Carlo by applying distributionalassumptionsFrom equation 1, the inner scenarios are used to calculate the PDF of the changein the portfolio value, conditional on the information provided on each ﬁltrationset F t . From this PDF, one can obtain the Q that makes the initial marginamount. However, when only the outer scenarios are simulated, there is only asingle point of the empirical distribution per branch left, which is not enough tobuild an entire PDF. To overcome this limitation, the relevant literature proposeto assume a parametric distribution for the rest of the unknown inner scenarios.The majority of authors propose modelling the quantity [ V ( ω, t + δ IM ) − V ( ω, t )]with a local Normal distribution, since it can be conveniently ﬁtted by onlytwo parameters, the mean and the variance. This choice is justiﬁed under theassumption that the simulated V ( ω, t ) come from a Normal SDE and the time6apse between observations δ IM is small. Other authors like [4] propose a fat-tailed T distribution.If it can be assumed that the inner scenarios follow a local Normal distribution,the only task left is the calculation of the distribution’s expectation and varianceto then obtain the initial margin as:( IM ( ω, t ) | F t ) = σ ( ω, t ) · Φ − (99%) | F t Where for simplicity and without loss of generalization, we have assumed thatthe data is centered and thus the expectation is zero. Under this assumption,the term σ ( ω, t ) can be written as:( σ ( ω, t ) | F t ) = (cid:112) E [∆ ( ω, t, δ IM ) | F t ] (2)Equation 2 shows that, when restated in this way, a quantile estimation problemhas been transformed into the conditional expectation problem, which endsup in the same setting as the Longstaﬀ-Schwartz model. The next sectionsstudy diﬀerent approximations of E (cid:2) ∆ ( ω, t, δ IM ) | F t (cid:3) , which is the conditionalexpectation of the netting set’s PnL under the information given at time t . The task is to ﬁnd the best (i.e. unbiased) estimator of the conditional expec-tation in equation 2 at every time step t . The proposal is to approximate theconditional expectation by regressing ∆ ( ω, t, δ IM ) against a set of basis func-tions. However, this use of regressions to approximate conditional expectationsneeds to be formally justiﬁed.The intuition behind the connection between conditional expectations and re-gressions is to frame conditional expectations as functions mapping a set ofvectors in L onto a given set of vectors in the same space. The the particularform of these functions is an arbitrary choice, and the coeﬃcients are calibratedfollowing the geometric properties of L spaces. The connection between conditional expectations and regressions is proved intwo steps. Firstly, the theoretical framework is described in order to character-ize conditional expectations as orthogonal projections on Hilbert spaces, and,secondly, orthogonal projections are connected to regressions in an empiricalstep. 7he conditional expectation is expressed as a function of a set of vectors x t obtained from the ﬁltration, E (cid:2) ∆ ( ω, t, δ IM ) | F t (cid:3) = m ( x t ) (3)where x t a short for span ( x , ..., x T ), an F t -adapted process and m ( x t ) is astochastic function with certain properties described in the following sections.In equation 3, the set of ( x , ..., x T ) regressors have been ﬁxed by conditioning oninformation obtained from the ﬁltration at time t , and therefore, E (cid:2) ∆ ( ω, t, δ IM ) (cid:3) is projected on a ﬁxed space. In this setting, a conditional expectation is a func-tion of the vectors ( x , ..., x T ) and not just a random variable, as it is typicallydeﬁned in option pricing. By interpreting conditional expectations as functions,the properties inherited from the functional space they belong to can be used. Inparticular, the properties regarding orthogonal projections and decompositioninto orthogonal bases from L .In order to set up an appropriate functional space, the space of random boundedfunctions is considered on the ﬁltered probability space L (Ω , F , {F t } t ≥ , P ).Thus, only square-integrable functions, satisfying E [ y ] < ∞ for every y ∈ L (i.e. ﬁnite second moment), are considered. As such, the inner product of twoelements y and y is E [ y .y ], with the induced norm (cid:107) y (cid:107) = (cid:112) E [ y ].Following the deﬁnition 4.1 below, under L spaces conditional expectations areequivalent to linear maps that perform orthogonal projections between a spaceand a subspace . Deﬁnition 4.1.

Given a σ -algebra F , there is a closed subspace L (Ω , X, P ) ⊂ L (Ω , F , P ) , each equipped with its ﬁltration. Suppose that y ∈ L (Ω , F , P ) ,the conditional expectation E [ y | X ] is deﬁned to be the orthogonal projection of y onto the closed subspace L (Ω , X, P ) . The next step is to ﬁnd a functional form for the projection function and this isdone by choosing an orthogonal basis for the span ( x , ..., x T ). While the basisvectors can be chosen arbitrarily, the vector weights (i.e. coeﬃcients) mustbe chosen so that the resulting projection vector is orthogonal to the originalspace .The challenge under a Monte Carlo simulation setting is to ﬁnd a ﬁnite setof suitable basis functions to represent the projected vectors. The followingsections expand on the most common empirical methodologies used for thismean. While conditional expectations are well deﬁned in L , this is not an inner product space,only the L space deﬁnes an inner product that can be used to meet orthogonality requirement. The linear combination is guaranteed to exist since every Hilbert space admits an or-thonormal basis (Zorn’s Lemma) and it is unique (by the theorem of characterization oforthogonal projections (see [15]). Moreover, if the space projected onto is separable, thecombination of basis functions needed to obtain this projected vector is ﬁnite. Approximation Methodologies for the Estima-tion of Conditional Expectations

This section deals with the practical implementation of the theory above, inparticular, the choice of regressors and basis functions. The described theory ofHilbert spaces can be considered as the population version, where conditionalexpectations are equivalent to orthogonal projections, while the simulated vec-tors are the sample version. The spaces are now Euclidean (i.e. ﬁnite) andconditional expectations are estimated with regressions.

Under a Monte Carlo simulation, the independent X t and dependent variables Y t are empirically generated following an arbitrary choice of SDE. In particular,the discrete time ﬁltration for both is obtained.The key art of estimating regressions is the choice of the regressors X t . Theoutput of the Monte Carlo simulation contains several variables that can be usedas regressors. As explained in the previous sections, in theory it is desirable toﬁnd an orthogonal basis to decompose the projected vectors, in practice, thechoice of regressors is an empirical call. However, some minimal conditionsmust be met. In choosing the regressors, there should be a sound relationship between theprojected vector Y t and the projection vectors ( X t ) ≤ t ≤ T obtained from theﬁltration. The second aspect to determine is the relevance of past data as pre-dictors of the dependent variable, Y t . This is, how much of the past historycoming from the ﬁltration is relevant. For Markov chains, only the most recentvalues of the state variables are necessary, while for non Markov, past realiza-tions of the ﬁltrations should be included as regressors.If the variables forming the ﬁltration are Markovian, then the conditional ex-pectation can be estimated as: E [ Y t +1 | F t ] = E [ Y t +1 | X t ] (4)If the ( X t ) ≤ t ≤ T process has stochastic independent increments, then it is aMarkov process ([15]). The choice of basis functions depends on the projection subspace span ( x , ..., x T )and mostly, on the empirical relationship between the independent and depen-dent variables X t and Y t . 9he underlying idea behind the diﬀerent approximation methods is to ﬁnd anapproximation for E [ Y t + δ | X = X t ] = m ( X t ) that solves the below minimizationproblem:Let Y t + δ be a random vector in a Hilbert space and ( X t ) ≤ t ≤ T a vector in aclosed Hilbert subspace. Additionally, assume m ( X t ) is a given a function of span ( X t ). Solve the minimization problem below to ﬁnd ˆ β ⊂ R n such that: inf ˆ β E [( Y t + δ − m ( X t , ˆ β )) ] (5)where the function m ( · ) can be decomposed into basis functions as explained inSection 4.2. The choice of functional form to represent the relationship between the vectors Y t and X t is an empirical call, mostly based on data analysis. The questionis, which would be the best choice of basis functions such that m ( X ) : X → Y agrees with our simulation data?In practice, under a nested Monte Carlo simulation, basis functions are selectedarbitrarily so that the resulting functional form of the conditional expectationfunction represents appropriately the relationship between the vector ( Y t ) n andthe ﬁltration vector ( X t ) n , where n is the dimension of the simulated vectors attime t .From the Monte Carlo simulation engine, one can easily obtain the simulatedvariables V ( ω, t ) and V ( ω, t + δ IM ) from which ∆ ( ω, t, δ IM ) can be calculatedat every t . The natural choice used in the literature is to choose V ( ω, t ) as theregressor for ∆ ( ω, t, δ IM ). This assumes that V ( ω, t ) is a good predictor for∆ ( ω, t, δ IM ) as in equation 4. In what follows, the following terms will be used: X t = V ( ω, t ) n = [ V ( ω , t ) , V ( ω , t ) , ..., V ( ω n , t )] T Y t + δ = ∆ ( ω, t, δ IM ) n = ( V ( ω, t + δ IM ) n − V ( ω, t ) n ) where n is the number of simulated scenarios, ω .In order to choose the functional relationship between V ( ω, t ) and ∆ ( ω, t, δ IM )one must remember from the previous section that the only available data are thesimulations from the outer scenarios, and from this information it is necessaryto infer the relationship between the V ( ω, t ) and ∆ ( ω, t, δ IM ) on the innerscenarios. The assumption is that the functional relationship between the V ( · )and ∆ ( · ) in the inner and outer scenarios is the same. In other words, it isassumed that the model we adjust with data coming from the outer scenariossimulation can be used to infer the inner scenarios. This assumption is basedon the fact that the same SDE is used to simulate both the inner and outerscenarios (provided observer is at a given t ).10rom the graphs in [10] we can study the relationship between V ( · ) and ∆ ( · )for diﬀerent type of regression models. From the cloud of points, the functionalform is rather un-conclusive.Figure 3: A comparison of scatter plots of portfolio value vs simulated initialmargin and PnL for initial margin models. Taken from [10].Lastly, a usual simpliﬁcation is to assume that the set of basis functions doesnot change over time, but weights do. Hence, after a functional form of theconditional expectation has been chosen, the problem is to estimate the weightsat every simulation time step t , conditional on the information given by theﬁltration up to that time. In earlier sections it has been stated that the analysis should be restrictedto square integrable functions with ﬁnite second moment (variance) - in theavailable literature this point is assumed to hold without proof, especially withrespect to the squared PnL random variable Y t . Now formal derivation of twoconditions that a generic Mark-to-Market (MtM) random variable has to satisfyin order for Y t to be square integrable will be provided. It will be assumed thatfor the MtM to be a real random variable, as must be the case for the ﬁnancialsetting we are analysing. This section begins with stating of the theorems on square-integrable variablesthat will be used in the rest of the derivations.

Theorem 5.1.

If a real random variable X has ﬁnite mean and variance, it issquare integrable: E [ X ] < ∞ , V ar [ X ] < ∞ = ⇒ E [ | X | ] < ∞ Proof.

The variance of X is by deﬁnition: V ar [ X ] = E [ X ] − E [ X ]11f V ar [ X ] < ∞ , then all terms in the RHS must be ﬁnite. It is known that E [ X ] < ∞ , therefore E [ X ] < ∞ . It must follow that E [ X ] < ∞ and therefore E [ | X | ] < ∞ . Remark.

Note that L is formally deﬁned as the space of square integrablefunctions. Since a space of random variables is considered, from now on thesetwo concepts will be used interchangeably. Theorem 5.2.

Sums or scalar multiplications of square integrable variables arealso square integrable.Proof.

The theorem for a generic sum Z = X + Y of two square integrablerandom variables X and Y will be proved. Let’s express the variance of Z : V ar ( Z ) = E [( X + Y ) ] − E [ X + Y ]= E [ X ] + E [ Y ] − E [ X ] − E [ Y ] − E [ X ] E [ Y ] + 2 E [ XY ]In the above, the ﬁrst two terms are ﬁnite as X and Y are square-integrable.The third and fourth terms are also ﬁnite as X and Y must have ﬁnite mean.Finally, the Cauchy-Schwarz inequality can be used to infer that the last termmust also be ﬁnite, as its absolute value is bounded by the square root of theproduct of two ﬁnite terms: | E [ XY ] | ≤ (cid:112) E [ X ] E [ Y ]Since the variance is ﬁnite, we can refer to Theorem 5.1 and conclude that Z must also be square integrable. The result is trivially extendable, via inductionarguments, to multiple sums of square integrable variables as well as scalarmultiplications thereof. Recalling that the generic form of the squared PnL random variable Y t + δ IM is: Y t + δ IM = ∆ ( ω, t, δ IM ) n = ( V ( ω, t + δ IM ) n − V ( ω, t ) n ) = V ( ω, t + δ IM ) n − V ( ω, t + δ IM ) n V ( ω, t ) n + V ( ω, t ) n It can be observed that, without loss of generality, Y t ∼ O ( V ( ω, t ) n ). Byvirtue of Theorem 5.2 it can be concluded that if the generic squared MtMrandom variable V ( ω, t ) n is square-integrable, so will be Y t . In turn, by virtueof Theorem 5.1, this only leaves the need to state the conditions on V ( ω, t ) n forwhich the variance of V ( ω, t ) n is ﬁnite. The variance of V ( ω, t ) n is: V ar ( V ( ω, t ) n ) = E [ V ( ω, t ) n ] − E [ V ( ω, t ) n ]This variance is ﬁnite i.f.f. the two terms on the RHS are ﬁnite, yielding thefollowing two conditions on V ( ω, t ) n : 12 roposition 5.1. V ( ω, t ) n must have ﬁnite mean and variance. This ensuresthat V ( ω, t ) n is square integrable and therefore E [ V ( ω, t ) n ] < ∞ . Proposition 5.2. V ( ω, t ) n must have ﬁnite fourth moment. This ensures that E [ V ( ω, t ) n ] < ∞ . Several methods can be used to estimate the conditional expectation function E [ Y t + δ | X = X t ] = m ( X t ), depending on the properties of the underlying data.A good summary of the available methodologies for nested Monte Carlo prob-lems can be found in [9]. The next sections discuss the most popular approachesused in the context of initial margin. In the context of initial margin estimation, linear regression methods are pro-posed by [1], [8] and [10]. The problem setting is as described in equation 5.In the case where the Hilbert subspace F t is a linear vector space, the estimatecan be computed solving a linear equation system. Linear maps include theordinary least square regression and polynomial functions.The projection is done onto a linear independent set of random variables ( X t ).where m ( X, ˆ β )) is a linear map of the form: m ( X t , ˆ β ) = P (cid:88) i =0 ˆ β i .φ i ( X t ) = B T .φ i ( X t ) (6)where φ i ( X t ) are the P set of linear basis functions arbitrarily chosen.A linear problem like this can be solved by inverting matrices. If Y and X are jointly Gaussian distributed, then the setting is of the well known NormalEquations. For example, in [1] P = 2 and φ i ( X t ) = X it .In the non-linear methods below, the vectors spaces (and thus the function m ( X, ˆ β )) are non-linear and thus, the Least Ordinary Square approach doesnot work. Among the vast amount of non-parametric algorithms available, kernel regres-sions have been the most popular ones. In particular the Naradaya-Watsonkernel [17] has been used by [2], [10] and [11] in the context of initial marginestimation.In principle, Naradaya-Watson proposed an algorithm to approximate the em-pirical joint density distribution of two random variables. This is, instead of13stimating a conditional expectation directly, the method estimates the entirejoint pdf distribution,from which the conditional expectation is derived. Themethod calculates a locally weighted average of the regressors given a choice ofa kernel function. The weights provide a smoothing feature to the ﬁt as theydecrease smoothly to zero with distance from the given point. The derivation ofthe Naradaya-Watson pdf estimator can be found in [14]. Under this method,the conditional expectation function can be approximated as E [ Y | X = x ] = (cid:80) Ni =1 K h ( x, x i ) y i (cid:80) Nj =1 K h ( x, x j ) (7)where h is the bandwidth parameter that controls the degree of smoothing, x is the centroid, x i , y i are elements of the vectors Y t , X t and K ( . ) is a Kernelfunction satisfying: :1. symmetry: k ( − u ) = k ( u )2. positive semi-deﬁniteness: k ( u ) ≥ ∀ u (cid:82) + ∞−∞ k ( u ) du = 1While equation 7 is non-linear for the regressors, it can be shown that the NWestimator is a particular case of the local estimation problem. The functionabove can be re-expressed as being locally linear (i.e. piece-wise linear) on acertain set of β parameters. Moreover, it will be shown that the conditionalexpectation estimator proposed by kernel regressions can be framed as a par-ticular case of the orthogonal projection problem in equation 5 with a modiﬁedobjective function and a weighted norm.The local minimization problem works by setting a small neighborhood of agiven centroid x and assuming a parametric function within the selected region.The process is equivalent to partitioning the support of the regressors spaceproducing smooth step functions. For every time step t of the Monte Carlosimulation, consider a neighborhood of x and assume a local functional rela-tionship m ( . ) between the Y, X variables. If m ( . ) is continuous and smooth,it should be somewhat constant over a small enough neighborhood (by Tay-lor expansion). While linear regressions assume a linear relationship betweenvariables and minimize a global sum of square errors, kernel regressions assumea linear relationship only on a small neighborhood, focusing on the modelingof pice-wise areas of the observed variables, and repeating the procedure forall centroids. The trick is to determinze the size of the neighborhood to tradeoﬀ error bias with model variance. The local minimization problem is now asfollows: ”Kernel” functions can be seen as data ”ﬁlters” or ”envelopes”, in the sense that theydeﬁne proximity between points. See [14] for a derivation. Y t ∈ H and a deterministic vector X t ⊂ H , and giventhe functions K h ( · ) and m ( · ), ﬁnd ˆ β such that: inf ˆ β E [ K h ( X t , Y t + δ ) . ( Y t + δ − m ( X t , ˆ β )) ]In the local minimization problem, the model for m ( x ) is a linear expansion ofbasis functions as in equation 6. m ( X t , ˆ β ) = P (cid:88) i =0 ˆ β i .φ i ( X t )Let p be the order of the φ ( X t ) polynomial, some examples of m ( . ) are: • if p = 0, then m ( X t , ˆ β ) = β , i.e. the constant function (i.e. interceptonly). This results in the NW estimate in equation 7. • if p = 1 then m ( X t , ˆ β ) = β + β ( x i − x ) with both x i and the centroid x ∈ X t gives a local linear regression.The ˆ β are solved by weighted least squares. As with the linear case, the vector W ( Y t + δ − m ( X t , ˆ β )) should be orthogonal to all other vectors of X, but now therelevant projection is orthogonal with respect to this new inner product: (cid:104) W ( Y t + δ − m ( X t , ˆ β )) , m ( X t , ˆ β ) (cid:105) (8)where W is a sequence of weights W ( X ) = diag ( K h λ ( X, X i )) N × N .The solution is given by: ˆ β = (cid:0) X T W X (cid:1) − (cid:0) X T W Y (cid:1) (9)We can obtain the same expression for m ( x ) as in equation 7: m ( X t ) = β = (cid:104)(cid:0) X T W X (cid:1) − X T W (cid:105) Y = N (cid:88) i =1 w i y i The task is now to choose the K h ( . ) function. One common choice in theliterature are the radial basis functions, in which each basis function dependsonly on the magnitude of the radial distance (often Euclidean) from a centroid x ∈ X . The most cited in the literature is the Gaussian kernel: k ( x, x ) = exp (cid:18) − (cid:107) x − x (cid:107) σ (cid:19) .6.3 Deep Neural Networks Deep Neural Networks (DNN) have been proposed to estimate initial margin by[12] and [18]. Deep NN is an umbrella term for diﬀerent related methodologies,the focus of this paper is on methods proposed on the initial margin literature,in particular the feed-forward neural network as implemented in [18].A DNN can be thought of as a nonlinear generalization of a linear model, wherethe parameters (called weights) have to be ”learned” from the data. The idea isto create a nonlinear function from a stack of linear functions. In a feed-forwardDNN, each layer projects onto the next layer of neurons (computational nodes).Each linear regression is called a neuron . Each neuron takes outputs from theprevious layer as the input by employing linear transformation with weights andbias, and then applies an activation function to the result. In order to ﬁnd theweights for each layer, the total sum of errors between the ﬁnal output and thereal observation for each example as in equation 5 is minimized by performingan orthogonal projection of the dependent variables onto a non-linear subspace.The theoretical basis for the approximation of a function by neural networks isgiven by the Universal Approximation Theorem (UAT) [19]. This theorem statesthat any continuous function in a compact subset of R n can be approximated bya feed-forward network with a single hidden layer containing a ﬁnite number ofneurons. However, in practice we might need more than one layer to improve theapproximation. Recently, in [20], it has been shown that any continuous functionon the d in -dimensional unit cube can be approximated to arbitrary precision byDNN with ReLU activations on hidden layers with width of exactly d in + 1.ReLU actiovation function is ReLU(x) = max(0 , x ). The activation function forthe output layer is linear.Following [18] and [20], let’s assume that in the feed-forward neural networkthere are an input layer, L hidden layers and an output layer. An examplearchitecture diagram for this general neural network is provided in ﬁgure 4.Figure 4: Feed-forward neural network architecture16hen the functional form for the approximation of the conditional expectationfunction is given by m ( X t , W, b ) = ˆ Y t = N L (cid:88) i =1 W iL A iL = N L (cid:88) i =1 W iL · φ L  N L − (cid:88) j =1 W jL − A jL − + b iL  where for each hidden layer l, l = 1 ..L , φ l is an activation function, A il is anactivation output for the i th neuron, W l are weights and b l is the bias. Here,without loss of generality and for the ease of comparison, bias for the outputlayer, b L +1 is assumed to be incorporated into the W L . Note, that weights W are applied to the input layer, X t , and A = X t . The authors in [18] use theclassic ReLU function as the activation function for all hidden layers, φ l . As in[20], activation function for the output layer, φ L +1 , is linear.Except for the input layer, each node i in the layer l is a neuron (function) thatperforms a simple computation: A il = φ l ( (cid:80) j W jl − A jl − + b il ), where A il is theoutput and A jl − is the j th input. This section summarises theoretical formulations of the three approaches con-sidered in Sections 5.6.1-5.6.3 and provides advantages and disadvantages fromthe practitioners’ point of view in the table below.17 ethod m ( X t , ˆ β ) Pros and Cons

LinearMaps (cid:80) Ni =0 ˆ β i · φ i ( X t )ˆ β ∈ R N φ i - linear basisfunctionsN - degree of poly-nomial + Same few calibrated parameters areused across all simulation scenarios andtime steps.+ Computationally cheap compared tothe two other methods.- Large oscillations in the calibrated pa-rameters can be observed due to dailychange in IM value at t .- In a distributed grid setup small sim-ulation run is required to pre-computeˆ β i prior to the main MC simulation.KernelRegres-sions(NW) (cid:80) Ni =0 ˆ β i · φ i ( X t )ˆ β i = K h ( x ,x i ) (cid:80) Nj =1 K h ( x ,x j ) φ i = y i N - number ofneighbourhoodpairs ( x i , y i ) K h - Kernel func-tion + Computationally cheap+ Better empirical ﬁt to data than lin-ear regression.+ Ensures functional smoothness.- Assumes parametric functional form.- It could be challenging to obtain sen-sitivities to model parameters and in-puts.Feed-ForwardNeuralNetworks (cid:80) Ni =0 ˆ β i · φ i ( X t )ˆ β i = W iL , weightapplied to the out-put of the neuron iin the ﬁnal hiddenlayer φ i ( X t ) = A iL - ac-tivation of the neu-ron i in the ﬁnalhidden layerN - number of neu-rons in the ﬁnalhidden layer (plus1 if incorporatingbias) + Once trained, predicted IM can becalculated extremely fast because itonly involves small scale matrix multi-plication and is done at the portfoliolevel.+ ˆ β i only need to be updated everyquarter compared to daily in linearmaps method.- Training requires portfolio sensitivi-ties for each MC path and time-step perSIMM model requirements.- Data generation for training is com-putationally expensive and needs to bedone oﬄine.18 eferences [1] Anfuso, Fabrizio and Aziz, Daniel and Giltinan, Paul and Loukopoulos,Klearchos. A Sound Modelling and Backtesting Framework for Forecast-ing Initial Margin Requirements . (January 11, 2017). Available at SSRN:https://ssrn.com/abstract=2716279[2] Andersen, L and Pykhtin, M and Sokol, A.

Credit Exposure inthe Presence of Initial Margin . (July 22, 2016). Available at SSRN:https://ssrn.com/abstract=2806156[3] Andersen, Leif and Pykhtin M editors.

Margin in Derivaties Trading .(2018).Risk Books.[4] Andersen, Leif B.G. and Dickinson, Andrew Samuel.

Funding and CreditRisk with Locally Elliptical Portfolio Processes: An Application to CCPs .(April 11, 2018). Available at SSRN: https://ssrn.com/abstract=316115[5] Basel Committee on Banking Supervision and International Organizationof Securities Commissions (2019).

Margin Requirements for Non-CentrallyCleared Derivatives . D475 July 2019.[6] Green, Andrew David and Kenyon, Chris.

MVA: Initial Margin ValuationAdjustment by Replication and Regression .(January 12, 2015) Available atSSRN: https://ssrn.com/abstract=2432281.[7] Garcia Trillos, Camilo and Henrard, Marc and Macrina, Andrea.

Estima-tion of Future Initial Margins in a Multi-Curve Interest Rate Framework .(February 3, 2016). Available at SSRN: https://ssrn.com/abstract=2682727[8] Caspers, Peter and Giltinan, Paul and Lichters, Roland and Nowaczyk, Niko-lai.

Forecasting Initial Margin Requirements - A Model Evaluation . (Febru-ary 3, 2017). Available at SSRN: https://ssrn.com/abstract=2911167[9] Carrier, J.

Valuation of early-exercise price of options using simulationsand nonparametric regression . Insurance: Mathematics and Economics 19,19-30 (1996).[10] Chan, Justin and Zhu, Shengyao and Tourtzevitch, Boris.

Prac-tical Approximation Approaches to Forecasting and Backtesting Ini-tial Margin Requirements . (November 29, 2017). Available at SSRN:https://ssrn.com/abstract=3079782[11] Dahlgren, Martin.

Forward Valuation of Initial Margin in Exposure andFunding Calculations .(2018) Margin in Derivatives Trading, ch 9. RiskBooks.[12] Hernandez, Andres.

Estimating Future Initial Margin with Machine Learn-ing .(March, 2017). PWC. 1913] ISDA SIMM Methodology, version 2.1, Eﬀective Date: December 2018.[14] Kuhn, Steﬀen.

Kernel Regression by Mode Calculation of the ConditionalProbability Distribution

Available at: https://arxiv.org/pdf/0811.3499.pdf[15] Luenberg, David

Optimization by Vector Space Methods .(1969). Wiley.[16] Longstaﬀ, Francis A. and Schwartz, Eduardo S.

Valuing American Optionsby Simulation: A Simple Least-Squares Approach .(October 1998). Availableat SSRN: https://ssrn.com/abstract=137399[17] Naradaya EA

On Estimating Regression . Theory Probab. Appl., 9(1),141142.[18] Ma, Xun and Spinner, Sogee and Venditti, Alex and Li, Zhao and Tang,Strong.

Initial Margin Simulation with Deep Learning .(March 21, 2019).Available at SSRN: https://ssrn.com/abstract=3357626[19] Sodhi, Anurag.

American Put Option pricing using Least squares MonteCarlo method under Bakshi, Cao and Chen Model Framework (1997) andcomparison to alternative regression techniques in Monte Carlo .(August,2018). Available at https://arxiv.org/abs/1808.02791[20] Hanin, Boris and Selke, Mark.

Approximating Continuous Functionsby ReLU Nets of Minimal Width .(2018). Available at arXiv preprintarXiv:1710.11278[21] Albanese, Claudio and Gimonet, Guilloume and Pietronero, Giacomo.

Co-herent global market simulations and securitization measures for counter-party credit risk . (2011). Quantitative Finance, 11, 1-20.[22] Albanese, Claudio and Caenazzo, Simone and Crepey, Stephane.