[PDF] A learning scheme by sparse grids and Picard approximations for semilinear parabolic PDEs

Abstract

Relying on the classical connection between Backward Stochastic Differential Equations (BSDEs) and non-linear parabolic partial differential equations (PDEs), we propose a new probabilistic learning scheme for solving high-dimensional semi-linear parabolic PDEs. This scheme is inspired by the approach coming from machine learning and developed using deep neural networks in Han and al. [32]. Our algorithm is based on a Picard iteration scheme in which a sequence of linear-quadratic optimisation problem is solved by means of stochastic gradient descent (SGD) algorithm. In the framework of a linear specification of the approximation space, we manage to prove a convergence result for our scheme, under some smallness condition. In practice, in order to be able to treat high-dimensional examples, we employ sparse grid approximation spaces. In the case of periodic coefficients and using pre-wavelet basis functions, we obtain an upper bound on the global complexity of our method. It shows in particular that the curse of dimensionality is tamed in the sense that in order to achieve a root mean squared error of order {\epsilon}, for a prescribed precision {\epsilon}, the complexity of the Picard algorithm grows polynomially in {\epsilon}^{-1} up to some logarithmic factor |log({\epsilon})| which grows linearly with respect to the PDE dimension. Various numerical results are presented to validate the performance of our method and to compare them with some recent machine learning schemes proposed in Han and al. [20] and Hur\'e and al. [37].

Full PDF

AA learning scheme by sparse grids and Picard approximationsfor semilinear parabolic PDEs

J.-F. Chassagneux ∗ , J. Chen † , N. Frikha ‡ , C. Zhou § .February 25, 2021 Abstract

Relying on the classical connection between Backward Stochastic Diﬀerential Equations(BSDEs) and non-linear parabolic partial diﬀerential equations (PDEs), we propose a newprobabilistic learning scheme for solving high-dimensional semilinear parabolic PDEs. Thisscheme is inspired by the approach coming from machine learning and developed using deepneural networks in Han and al. [32]. Our algorithm is based on a Picard iteration scheme inwhich a sequence of linear-quadratic optimisation problem is solved by means of stochasticgradient descent (SGD) algorithm. In the framework of a linear speciﬁcation of the approxi-mation space, we manage to prove a convergence result for our scheme, under some smallnesscondition. In practice, in order to be able to treat high-dimensional examples, we employsparse grid approximation spaces. In the case of periodic coeﬃcients and using pre-waveletbasis functions, we obtain an upper bound on the global complexity of our method. It showsin particular that the curse of dimensionality is tamed in the sense that in order to achievea root mean squared error of order ε , for a prescribed precision ε , the complexity of thePicard algorithm grows polynomially in ε ´ up to some logarithmic factor | log p ε q| whichgrows linearly with respect to the PDE dimension. Various numerical results are presentedto validate the performance of our method and to compare them with some recent machinelearning schemes proposed in Han and al. [20] and Huré and al. [37]. In the present work, we are interested in the numerical approximation in high dimension of thesolution to the semilinear parabolic PDE " B t u p t, x q ` L u p t, x q ` f p u p t, x q , σ J p x q ∇ x u p t, x qq “ , p t, x q P r , T q ˆ R d ,u p T, x q “ g p x q , x P R d (1.1) ∗ Université de Paris, Laboratoire de Probabilités, Statistiques et Modélisation, F-75013 Paris, France. Email:[email protected]. † Université de Paris, Laboratoire de Probabilités, Statistiques et Modélisation, F-75013 Paris, France. Partof the work was done during a visit at NUSRI, Suzhou, China, whose hospitality is greatly appreciated. Email:[email protected] ‡ Université de Paris, Laboratoire de Probabilités, Statistiques et Modélisation, F-75013 Paris, France. Email:[email protected] § Department of Mathematics, City University of Hong Kong, Kowloon Tong, Hong Kong. On leave fromDepartment of Mathematics, National University of Singapore. Research supported by NSFC Grant No. 11871364as well as Singapore MOE (Ministry of Educations) AcRF Grants R-146-000-271-112 and R-146-000-284-114.Email: [email protected] a r X i v : . [ m a t h . NA ] F e b here f : R ˆ R d Ñ R , g : R d Ñ R are measurable functions and L is the inﬁnitesimal generatorof the forward diﬀusion process with dynamics d X t “ b p X t q d t ` σ p X t q d W t (1.2)and deﬁned, for a smooth function ϕ , by L ϕ p t, x q : “ b p x q ¨ ∇ x ϕ p t, x q `

12 Tr rp σσ J qp x q ∇ x ϕ p t, x qs . (1.3)Here, W is a d -dimensional brownian motion deﬁned on a complete probability space p O , A , P q , b : R d Ñ R d and σ : R d Ñ M d are measurable functions, M d being the set of d ˆ d matrix.The initial condition X is a square integrable random variable independent from the BrownianMotion W . We denote by p F t q ď t ď T the ﬁltration generated by W and X , augmented with P null sets.Developing eﬃcient algorithms for the numerical approximation of high-dimensional non-linear PDEs is a challenging task that has attracted considerable attention from the researchcommunity in the last two decades. We can quote various approaches (limiting to the "stochastic"ones) that have proven to be eﬃcient in a high dimensional setting: branching methods, see e.g.[35], machine learning methods (especially using deep neural networks), see e.g. [32], and fullhistory recursive multilevel Picard method (abbreviated MLP in the literature) see e.g. [38]. Thisis a very active ﬁeld of research, we refer to the recent survey papers [33, 4] for more referencesand an overview of the numerical and theoretical results available. We focus now more on onestream of research which uses the celebrated link between semilinear parabolic PDEs of theform (1.1) and BSDEs. This connection, initiated in [45], reads as follows: denoting by u aclassical solution to (1.1), p u p t, X t q , σ J p X t q ∇ x u p t, X t qq “ p Y t , Z t q where the pair p Y , Z q is the R ˆ R d -valued and p F t q -adapted process solution to the BSDE with dynamics Y t “ g p X T q ` ż Tt f p Y s , Z s q d s ´ ż Tt Z s ¨ d W s , ď t ď T , (1.4)so that, the original problem boils down to the numerical approximation of the above stochasticsystem. Various strategies have been used to numerically approximate the stochastic system p X t , Y t , Z t q t Pr ,T s . The most studied one is based on a time discretization of (1.4) leading toa backward programming algorithm to approximate p Y, Z q , as exposed in e.g [10, 49] (see thereferences therein for early works). This involves computing a sequence of conditional expecta-tions and various methods have been developed: Malliavin calculus based methods [10, 18, 36],optimal quantization methods [1, 2, 44], cubature methods [16, 17, 14] and (linear) regressionmethods, see among others [26, 28, 27]. It is acknowledged that such approaches will be feasi-ble for problems up to dimension 10. This limitation is a manifestation of the so-called “curseof dimensionality”. Recently, non-linear regression methods using deep neural networks weresuccesfuly combined with this approach and proved to be capable of tackling problems in highdimension [37]. However, other strategies have been introduced in the last ﬁve years or so toapproximate (1.4) trying to adopt a "forward point of view". Relying on Wiener chaos expansionand Picard iteration, [11, 25] introduced a method that notably works in non-Markovian settingbut is still impacted by the curse of dimension. A key step forward has been realized by theso called deep BSDEs solver introduced in [32]. Interpreting the resolution of a BSDEs as anoptimisation problem, it relies on the expressivity of deep neural network and well establishedSGD algorithms to show great performance in practice. More precisely, in this approach, the2 -process is now interpreted as a forward SDE controlled by the Z -process. Then, an Euler-Maruyama approximation scheme is derived in which the derivative of the solution u appearingin the non-linear function f (through the Z -process) is approximated by a multi-layer neural net-work. The optimal weights are then computed by minimizing the mean-squared error betweenthe value of the approximation scheme at time T and a good approximation of the target g p X T q using stochastic gradient descent algorithms. Again, this kind of deep learning technique seemsto be very eﬃcient to numerically approximate the solution to semi-linear parabolic PDEs inpractice. However a complete theory concerning its theoretical performance is still not achieved[4]. One important observation is that, due to highly non-linear speciﬁcation, the optimisationproblem that has to be solved in practice, has no convexity property. The numerical proceduredesigned can only converge to local minima, whose properties (with respect to the approximationquestion) are still not completely understood.Inspired by this new forward approach, we introduce here an algorithm which is shown toconverge to a global minimum. This, of course, comes with a price. First, we move from the deepneural networks approximation space to a more classical linear speciﬁcation of the approximationspace. However, due the non-linearity in the BSDE driver, the global optimisation problem tobe solved is still non-convex. To circumvent this issue, we employ a Picard iteration procedure .The overall procedure becomes then a sequence of linear-quadratic optimisation problems whichare solved by a SGD algorithm. Our ﬁrst main result is a control of the global error betweenthe implemented algorithm and the solution to the BSDE which notably shows the convergenceof the method under some smallness conditions, see Theorem 2.1. In particular, contrary to[32, 34] or [37], our result takes into account the error induced by the SGD algorithm. In ournumerical experiments, we rely on sparse grid approximation spaces which are known to be well-suited to deal with high-dimensional problems. Under the framework of periodic coeﬃcients, weestablish as our second main result, an upper bound on the global complexity for our implementedalgorithm , see Theorem 3.1. We notably prove that the curse of dimensionality is tamed in thesense that the complexity is of order ε ´ p | log p ε q| q p d q , where p is a constant which does not dependon the PDE dimension and d ÞÑ q p d q is an aﬃne function. We also demonstrate numerically theeﬃciency of our methods in high dimensional setting.The rest of the paper is organized as follows. In Section 2, we ﬁrst recall the deep BSDEs solver of [32] but adapted to our framework. Namely, we use a linear speciﬁcation of the approximationspace together with SGD algorithms. For sake of clarity, we denote this method: the directalgorithm . Then, we introduce our new numerical method : the Picard algorithm . We presentour main assumptions on the coeﬃcients and state our main convergence results. In Section 3,we use sparse grid approximation with the direct and Picard algorithms , using two types ofbasis functions: pre-wavelet [9] and modiﬁed hat function [24]. We discuss their numericalperformances in practice through various test examples. We also compare them with some deeplearning techniques [32, 37]. We also state our main theoretical complexity result. Section 4 isdevoted to the theoretical analysis required to establish our main theorems: all the proofs arecontained in this section. Finally, we give a complete list of the algorithm parameters that havebeen used to obtain the numerical results in Appendix.

Notation:

Elements of R q are seen as column vectors. For x P R q , x i is the i th componentand | x | corresponds to its Euclidian norm, x ¨ y denotes the scalar product of x and y P R q . M q is the set of q ˆ q real matrices. We denote by e (cid:96) the (cid:96) th vector of the standard basis of R q . The vector p , . . . , q J is denoted , I d is the d ˆ d identity matrix. We use the bold face3otations l P N d for multidimensional indices l : “ p l , ¨ ¨ ¨ , l d q with (index) norms denoted by | l | p : “ p ř di “ l pi q { p and | l | : “ max ď i ď d | l i | . For later use, for a positive integer k , we introducethe set J d,k of multidimensional indices l P N d satisfying | l | ď k . For a ﬁnite set A , we denoteby | A | its cardinality.For a function f : R d Ñ R , we denote by B x l f the partial derivative function with respect to x l , ∇ f denotes the gradient function of f , valued in R d . We also use ∇ f “ pB x i ,x j f q ď i,j ď d todenote the Hessian matrix of f , valued in M d . For a suﬃciently smooth real-valued function f deﬁned in R d , we let D l f “ B l x ¨ ¨ ¨ B l d x d f denote the diﬀerentiation operator with respect to themulti-index l P N d . For a ﬁxed positive integer k and a function f deﬁned on an open domain U Ă R d , we deﬁne its Sobolev norm of mixed smoothness } f } H kmix p U q : “ ´ ÿ l P J d,k } D l f } L p U q ¯ (1.5)where the derivative D l f in the above formula has to be understood in the weak sense and fora map g : U ÞÑ R , } g } L p U q : “ ş U | g p x q| dx . The Sobolev space of mixed smoothness H kmix p U q isthen deﬁned by H kmix p U q : “ ! f P L p U q : } f } H kmix p U q ă 8 ) . (1.6)For a positive integer q , the set H q is the set of progressively measurable processes V deﬁnedon the probability space p O , A , P q with values in R q and satisfying E ”ş T | V t | d t ı ă `8 . Theset S q is the set of adapted càdlàg processes U deﬁned on the probability space p O , A , P q withvalues in R q and satisfying E ” sup t Pr ,T s | U t | ı ă `8 . We also deﬁne B : “ S ˆ H d . direct and Picard algorithms

We describe here the numerical methods studied in this work. The ﬁrst one, the direct algorithm is an adaptation of the

Deep BSDEs solver introduced in [20] to the linear speciﬁcation of theparametric space that we use here. The second one, the

Picard algorithm , is new and is the maincontribution of our work. We also give here the main general convergence results related to the

Picard algorithm . The complexity analysis is postponed to the next section.The methods we introduce below have for goal to compute an approximation of the valuefunction u , satisfying the PDE (1.1), at the initial time on a given domain or at a speciﬁc point.This lead us to introduce the following setup for the initial value X : Assumption 2.1.

One of the two following cases holds:(i) The law of X has compact support and is absolutely continuous with respect to the Lebesguemeasure.(ii) The law of X is a Dirac mass at some point x P R d . Most of our numerical applications are done in the setting of Assumption 2.1(ii), see nextsection. Then, obviously, the approximation of the value function is known only at the point x at the initial time. However, one should note that it could also be interesting to work in thesetting of Assumption 2.1(i) if one seeks to obtain an approximation of the whole value function(on the support of X ) at the initial time. 4 .1 Assumptions on the coeﬃcients and connection with the semilinear PDE In this subsection, we ﬁrst give the assumptions on the BSDE coeﬃcients that will be requiredfor our approach and then recall the connection with semilinear PDEs. In particular, under theseassumptions, the underlying PDE admits a unique classical solution. Under an additional regu-larity assumption on the coeﬃcients, the unique solution to the PDE admits smooth derivativesof enough order which are controlled on the whole domain by known parameters. This addi-tional regularity, together with a periodicity assumption, will be used to obtain our theoreticalcomplexity result, see Section 3.1.2. For sake of simplicity, it is also assumed that the coeﬃcients b, σ and f do not depend on time and that f does not depend on the space variable. Assumption 2.2. (i) The coeﬃcients b , σ , f and g are bounded, Lipschitz-continuous withrespect to all variables and g P C ` α p R d q , for some α P p , s . We will denote by L theLipschitz-modulus of the map f .(ii) The coeﬃcient a “ σσ J is uniformly elliptic, that is, there exists λ ě such that for any p x, ζ q P p R d q it holds λ ´ | ζ | ď a p x q ζ ¨ ζ ď λ | ζ | . (2.1) (iii) For any p i, j q P t , ¨ ¨ ¨ , d u , the coeﬃcients b i , σ i,j , g belong to C d ` p R d , R q and f belongs C d ` p R ˆ R d , R q . Moreover, their derivatives of any order up and equal to d ` arebounded and Lipschitz continuous.(iv) The coeﬃcients b , σ , f and g are periodic functions. From now on, we will say that Assumption 2.2 holds if and only if Assumption 2.2 (i), (ii), (iii)and (iv) are satisﬁed.Under Assumption 2.2 (i) and (ii), it is known (see e.g. [45]) that for any square integrableinitial condition X there exists a unique couple p Y , Z q P B satisfying equation (1.4) P -a.s.Moreover, from [40] Chapter VI and [23] Chapter 7, the PDE (1.1) admits a unique solution u P C , pr , T s ˆ R d , R q satisfying: there exists a positive constant C , depending on T and theparameters appearing in Assumption 2.2 (i) and (ii), such that for all p t, x q P r , T s ˆ R d | u p t, x q| ` |B t u p t, x q| ` | ∇ x u p t, x q| ` | ∇ x u p t, x q| ď C. From [48, 46, 47], the semilinear PDE (1.1) and the BSDE (1.2)-(1.4) are connected, namely,for all p t, x q P r , T q ˆ R d , it holds Y t “ u p t, X t q , Z t “ σ J p X t q ∇ x u p t, X t q . Finally, under Assumption 2.2, still from [40] Chapter IV and [23] Chapter III, the uniquesolution u to the PDE (1.1) is smooth, namely, setting v i “ p σ J ∇ x u q i , ď i ď d , for any l P J d, , D l v i p t, . q exists and is bounded. In particular, there exists a positive constant, depending on T and the known parameters appearing in Assumption 2.2 (i), (ii) and (iii) such that for all l P J d, and all p t, x q P r , T s ˆ R d , max ď i ď d | D l v i p t, x q| ď C. (2.2)5 .2 Direct algorithm

We ﬁrst consider the approximation of the forward component (1.2). Given an equidistant grid π : “ t t “ ă ¨ ¨ ¨ ă t n ă ¨ ¨ ¨ ă t N “ T u of the time interval r , T s , t n “ nh , n “ , ¨ ¨ ¨ N , withtime-step h : “ T { N , we denote by W : “ p W t n q ď n ď N the discrete-time version of the Brownianmotion W and deﬁne ∆ W n “ W t n ` ´ W t n , ď n ď N ´ .We then introduce a standard Euler-Maruyama approximation scheme of X on π deﬁned by X “ X and for ď n ď N ´ , X t n ` “ X t n ` b p X t n q h ` σ p X t n q ∆ W n . (2.3)Before discussing the approximation of the backward component, we here state an importantlemma concerning the existence of two-sided Gaussian estimates for the transition density ofthe above Euler-Maruyama approximation scheme. These estimates will prove very useful inthe sequel, when studying the theoretical complexity of the Picard algorithm . We denote by p π p t i , t j , x, ¨q the transition density function of the Euler-Maruyama scheme starting from thepoint x at time t i and taken at time t j , with ď t i ă t j ď T . We refer e.g. to [42] for a proof ofthe following result. Lemma 2.1.

Assume that the coeﬃcients b and σ satisﬁes Assumption 2.2 (i) and (ii). Thereexist constants c : “ c p λ , b, σ, d q P p , s and C : “ C p T, λ, b, σ, d q ě such that for any p x, x q Pp R d q and for any ď i ă j ď N C ´ p p c p t j ´ t i q , x ´ x q ď p π p t i , t j , x, x q ď C p p c ´ p t j ´ t i q , x ´ x q (2.4) where for any p t, x q P p ,

8q ˆ R d , p p t, x q : “ p {p πt qq d { exp p´| x | {p t qq . We now turn to the approximation of the backward component (1.4). We ﬁrst introducea linear parametrization of the process Z . For each discrete date t n P π zt T u , we consider aparametric functional approximation space V zn generated by a set of basis functions p ψ kn q ď k ď K zn ,for ď n ď N ´ and some positive integer K zn . The measurable functions ψ kn : R d ÞÑ R haveat most polynomial growth. Note that, for n ě , the speciﬁcation of the basis function coulddepend on the time t n , but in order to simplify the discussion, we let the number of basis functionsbe the same and set to K . Namely, K zn “ K , for all n ě . For n “ , the speciﬁcation willdepend on the nature of X : if Assumption 2.1(i) holds, then we will set K z “ K ; if Assumption2.1(ii) holds, then we simply set K z “ and ψ is a function satisfying ψ p x q “ . For latteruse, we set: s K z : “ N ´ ÿ n “ K zn “ K z ` p N ´ q K . (2.5)Remark that there is no need to introduce an approximation space at T since the function g isexplicitly known.For ď n ď N ´ , each component of p σ J ∇ x u qp t n , ¨q should be approximated in an optimalway by a function in V zn . The process Z appearing in the dynamics of the controlled process Y ,that has to be optimized, is parametrized using the spaces p V zn q ď n ď N ´ . Namely, the R d -valuedrandom variable Z t n will be approximated by K zn ÿ k “ ψ kn p X t n q z n,k , (2.6)6here z n,k P R d for any ď k ď K zn and ď n ď N ´ . Importantly, we denote, for later use, z J : “ pp z n,k q J q ď n ď N ´ , ď k ď K zn so that z P R d s K z . Deﬁnition 2.1 (Class of discrete control process) . We let H π,ψ be the set of discrete controlprocess Z deﬁned by: for z P R d s K z , Z t n : “ K zn ÿ k “ ψ kn p X t n q z n,k , for ď n ď N ´ , (2.7) and where we set Z t “ Z t n , t n ď t ă t n ` , ď n ď N ´ with the convention Z T “ . Remark 2.1.

We insist on the fact that for a given Z P H π,ψ , the R d -valued random variable Z t n depends only on z n , for any ď n ď N ´ . The approximation space we consider is aﬁnite dimensional vector space. This notably diﬀers from the recent works [3, 31, 37] where anon-linear approximation using neural network is used. The dynamics of Y given by (1.4), in turn, has to be approximated. As previously mentionedin the introduction, we ﬁrst rewrite it in forward form as follows Y t “ Y ´ ż t f p Y s , Z s q ds ` ż t Z s ¨ d W s , t P r , T s , with Y “ u p , X q . The main goal of the algorithm is to obtain a good estimate of u p , . q on the support of X . Inorder to do so, we deﬁne the starting point of Y , standing for the approximation of Y , by usinga linear functional approximation space denoted V y , namely Y : “ K y ÿ k “ ψ ky p X q y k with y P R K y . (2.8)The speciﬁcation of V y will depend also on the nature of X . Namely, if Assumption 2.1(i) holds,then we set K y “ K , while if Assumption 2.1(ii) holds, then we simply set K y “ and ψ y is afunction satisfying ψ y p x q “ .Then, employing a standard Euler scheme on π together with the above approximation Z P H π,ψ of the control process Z , we are naturally led to consider the following approximationscheme for Y . Deﬁnition 2.2. i) Given u “ p y , z q P R K y ˆ R d ¯ K z , we denote by Z u P H π,ψ the discrete controlprocess as given in (2.7) . Then, the discrete controlled process Y u is deﬁned as follows.(a) Initialization: Set Y u “ K ÿ k “ ψ ky p X q y k . (2.9) (b) Discrete version: for any ď n ď N ´ : Y u t n ` “ Y u t n ´ hf p Y u t n , Z u t n q ` Z u t n ¨ ∆ W n (2.10) where we recall that ∆ W n “ W t n ` ´ W t n . c) Continuous version: for any ď n ď N ´ and any t n ď t ă t n ` , Y u t “ Y u t n ´ p t ´ t n q f p Y u t n , Z u t n q ` Z u t n ¨ p W t ´ W t n q (2.11) ii) Based on the previous step, we deﬁne B π,ψ Ă B as the set of processes p Y u , Z u q , with Z u P H π,ψ , Y u deﬁned as above for some u P R K y ˆ R d ¯ K z . Remark 2.2.

Let us note that the discrete process p X t , Y u t , Z u t q t P π depends on X and p W t q t P π but we omit these dependences in the notation. The main idea of approximation by learning methods is to force the discrete controlled process Y u T at maturity T to match the approximated terminal condition g p X T q , by minimizing a lossfunction. Here, we work with the quadratic loss function, so that one faces the optimizationproblem inf u “p y , z qP R Ky ˆ R d ¯ Kz g p u q : “ E r G p X , W, u qs with G p X , W, u q “ | g p X T q ´ Y u T | . (2.12)However, one has to come up with a numerical procedure to compute the solution in practice.In order to numerically compute a solution to the optimization problem (2.12) (if any exists),one generally employs a stochastic approximation scheme such as a SGD algorithm. For anoverview of the theory of stochastic approximation, the reader may refer to [19], [39] and [6] andto [3, 20, 31, 37, 30] for applications to deep learning approximation of PDEs.We now describe the SGD algorithm that we implement in order to compute a solution p y , z q P R K y ˆ R d ¯ K z to the optimization problem (2.12).For a prescribed positive integer M representing the number of steps in the stochastic al-gorithm and two deterministic non increasing sequences of positive real number p γ ym q m ě and p γ zm q m ě representing the learning rates, we design the following direct algorithm . Deﬁnition 2.3. (Implemented direct algorithm)1. Simulate M independent discrete paths of the Brownian motion W “ p W m q ď m ď M and M independent samples of the initial condition p X m q ď m ď M .2. Initialization: select a random vector u “ p y , z q with values in R K y ˆ R d ¯ K z , independentof W and p X m q ď m ď M , and such that E r| u | s ă 8 .3. Iteration: For ď m ď M ´ , compute y m ` “ y m ´ γ ym ` ∇ y G ` X m ` , W m ` , u m ˘ , (2.13) z nm ` “ z nm ´ γ zm ` ∇ z n G ` X m ` , W m ` , u m ˘ , (2.14) for ď n ď N ´ .The output of the algorithm is then u M “ p y M , z M q . Remark 2.3.

In order to analyse the asymptotic properties of stochastic approximation schemes,one usually chooses the learning sequences p γ m q m ě “ p γ ym q m ě or p γ m q m ě “ p γ zm q m ě such that ÿ m ě γ m “ 8 and ÿ m ě γ m ă 8 , (2.15) see e.g. [19, 39, 6]. ∇ λ G p X , W, u q , λ P t y , z n , ď n ď N ´ u appearing in the aboveSGD algorithm. It shows that p y m ` , z m ` q can be easily computed once p Y u m , Z u m q have beensimulated for any ď m ď M ´ . Lemma 2.2.

For λ P (cid:32) y k , ď k ď K y ( Ť (cid:32) z n,k , ď k ď K zn , ď n ď N ´ ( and u “ p y , z q P R K y ˆ R d ¯ K z , it holds ∇ λ G p X , W, u q “ ´ p g p X T q ´ Y u T q ∇ λ Y u T (2.16) with ∇ y k Y u T “ ψ ky p X q N ´ ź l “ ` ´ h ∇ y f p Y u t l , Z u t l q ˘ , ď k ď K y , and for any ď n ď N ´ and any ď k ď K zn , ∇ z n,k Y u T “ ψ kn p X t n q ` ∆ W J n ´ h ∇ z f p Y u t n , Z u t n q ˘ N ´ ź l “ n ` ` ´ h ∇ y f p Y u t l , Z u t l q ˘ with the convention ś H “ . Under Assumption 2.1, Assumption 2.2 (i) and Assumption 2.3, the well-posedness of Algorithm2.3, that is, the fact that it holds arg min u P R Ky ˆ R d ¯ Kz g p u q ‰ H , is proved in Lemma 4.1.Additionally, for any u ‹ P arg min u P R Ky ˆ R d ¯ Kz g p u q , we show in Proposition 4.2 that E « | u p , X q ´ Y u ‹ | ` h N ´ ÿ n “ | Z t n ´ Z u ‹ t n | ﬀ ď C p E π ` E ψ q , for some positive constant C . The quantities E π and E ψ represents the discrete-time error andthe error due to the approximation in the functional spaces p V y , V zn q , respectively. They aredeﬁned by E π : “ E « N ´ ÿ n “ ż t n ` t n ` | Y s ´ Y t n | ` | Z s ´ Z t n | ` | X t n ´ X t n | ˘ ds ﬀ (2.17)and E ψ : “ inf u P R Ky ˆ R d ¯ Kz E « | u p , X q ´ Y u | ` N ´ ÿ n “ h |p σ J ∇ x u qp t n , X t n q ´ Z u t n | ﬀ . (2.18)Let us mention, for later use, that, in the setting of Assumptions 2.1 and 2.2(i), E π ď Ch , (2.19)for some positive constant C , see e.g. Ma and Zhang [41] and Pagès [43].We shall not seek to obtain theoretical convergence results for the direct algorithm itself. However,we illustrate its performance numerically in Section 3 when using sparse grids approximations[12]. 9 emark 2.4. The numerical complexity C will be measured by the number of coeﬃcients updaterealized to obtain the approximation. From the previous description, we obtain straightforwardlythat the complexity at worst satisﬁes C “ O d p N KM q . (2.20) Picard algorithm

An issue with the above algorithm comes from the fact that the optimization problem (2.12)is generally not convex. Even though u ÞÑ p Y u , Z u q is linear for our choice of parametrisation,in general the mapping u Ñ Y u is non-linear since f itself is non-linear. As a consequence, inpractical implementation, we have no guarantee that the algorithm converges to local or globalminima. On top of practical problems, this renders the theoretical analysis of the implemented direct algorithm diﬃcult, in particular if one wants to obtain rates of convergence to assessprecisely the numerical complexity of the method.In this section, we introduce a Picard algorithm which transforms this non-convex optimisa-tion problem into a sequence of linear-quadratic optimization problems. This is done by usingthe special structure of the original problem. Indeed, it is well known that the solution of theBSDE (1.4) itself is obtained as the limit of a sequence of Picard iterations, see e.g. [21] and [5]from a numerical perspective.

Our

Picard algorithm is based on the iteration of the following operator: R K y ˆ R d ¯ K z Q ˜ u ÞÑ Φ p ˜ u q : “ ˇ u P R K y ˆ R d ¯ K z (2.21)where, ˇ u “ arg min u P R Ky ˆ R d ¯ Kz E ” | g p X T q ´ U ˜ u , u T | ı . (2.22)In the above expectation, the process X is the Euler-Maruyama approximation scheme on thetime grid π with dynamics (2.3) and U ˜ u , u (simply denoted as U below) is given by the followingdecoupling approximation scheme:1. For ˜ u P R K y ˆ R d ¯ K z , we ﬁrst consider p Y ˜ u , Z ˜ u q P B π,ψ as introduced in Deﬁnition 2.2.2. Then, for any u P R K y ˆ R d ¯ K z , consider the discrete control process Z u P H π,ψ as introducedin (2.7) of Deﬁnition 2.1 and deﬁne the control process U ˜ u , u by U ˜ u , u “ Y u (2.23)recall (2.9) and for any ď n ď N ´ , U ˜ u , u t n ` “ U ˜ u , u t n ´ hf p Y ˜ u t n , Z ˜ u t n q ` Z u t n ¨ p W t n ` ´ W t n q , (2.24)and its continuous version, for any t n ď t ď t n ` , U ˜ u , u t “ U ˜ u , u t n ´ p t ´ t n q f p Y ˜ u t n , Z ˜ u t n q ` Z u t n ¨ p W t ´ W t n q . E ” sup t Pr ,T s | U ˜ u , u t | ı ă `8 . Deﬁnition 2.4 (Theoretical Picard algorithm) . For a prescribed positive integer P :1. Initialization: set u P R K y ˆ R d ¯ K z .2. Iteration: for ď p ď P , compute: u p “ Φ p u p ´ q .The output of the algorithm is then u P . The main novelty compared to the optimization problem (2.12) comes from the fact that themap u ÞÑ U ˜ u , u is now linear. This linearity is achieved by freezing the driver f in the dynamicsof the control process along the process p Y ˜ u , Z ˜ u q P B π,ψ . The parameter ˜ u P R K y ˆ R d ¯ K z is thenupdated through the Picard iteration procedure. This is of course the main purpose of this Picardalgorithm , compared to the direct algorithm of Section 2.2. At this stage, the above algorithm istheoretical and its solution (if any exists) still needs to be numerically approximated. This willbe discussed in full details in the next section.We here discuss the well-posedness of the optimization problem (2.22). We ﬁrst introducesome notations that will be useful in the sequel to study the

Picard algorithm as iterated least-square optimization problems.First, to clarify the linear structure, we introduce the following notationsi) For ď k ď K y , θ k : “ ψ ky p X q .ii) For ď n ď N ´ , ď k ď K , the R d -valued random vectors ω n,k is deﬁned by ω n,k “ Ψ n,k ∆ W n with Ψ n,k “ ψ kn p X t n q , (2.25)and we set ω J : “ pp ω n,k q J q ď n ď N ´ , ď k ď K zn (so that ω is an R d ¯ K z -valued random vector).iii) the random vector Ω “ p θ J , ω J q J which takes values in R K y ˆ R d ¯ K z .Note that both ω and Ω depends on W and X , but we will omit this in the notation. Then, werewrite g p X T q ´ U ˜ u , u T “ G ˜ u ´ u ¨ Ω (2.26)where G ˜ u “ G ˜ u p X , W q : “ g p X T q` N ´ ÿ n “ hf p Y ˜ u t n , Z ˜ u t n q . (2.27)Thus, the optimization problem (2.22) is given by ˇ u “ argmin u P R Ky ˆ R d ¯ Kz H p ˜ u , u q with H p ˜ u , u q : “ E ” | G ˜ u ´ u ¨ Ω | ı (2.28)and simply reads as a Linear-Quadratic optimization problem. Classically, we introduce semi-norms on the parameter spaces. 11 eﬁnition 2.5. For u “ p y , z q P R K y ˆ R d ¯ K z , we deﬁne } y } y : “ E “ | y ¨ θ | ‰ , } z } z : “ E “ | z ¨ ω | ‰ and ~ u ~ : “ E “ | u ¨ Ω | ‰ . Let us insist on the fact that these quantities depend on the choices of π , ψ though this is notreﬂected in the notation. Remark 2.5. i) Observe that from the very deﬁnition of the random vector Ω , for any u “p y , z q P R K y ˆ R d ¯ K z , it holds ~ u ~ “ } y } y ` } z } z . (2.29) ii) With the notations of Section 2.2, the following relations hold ~ u ~ “ E «ˇˇˇˇ Y u ` ż T Z u t d W t ˇˇˇˇ ﬀ , } y } y “ E “ | Y u | ‰ and } z } z “ N ´ ÿ n “ h E “ | Z u t n | ‰ , (2.30) for any u “ p y , z q P R K y ˆ R d ¯ K z .iii) For later use, see Section 2.3.3, we also note that } ¨ } z develops as } z } z “ N ´ ÿ n “ h d ÿ l “ p z n, ¨ l q J E “ Ψ n, ¨ p Ψ n, ¨ q J ‰ z n, ¨ l (2.31) by using the independence of the increments p ∆ W n q l , for ď n ď N ´ and l P t , . . . , d u and where we used the notations z n, ¨ l “ p z n, l , . . . , z n,K zn l q and Ψ n, ¨ “ p Ψ n, , . . . , Ψ n,K zn q . A key assumption to ensure the well-posedness of our approach is the following.

Assumption 2.3.

There exist two positive constants κ K ě ě α K such that for any p y , z q P R K y ˆ R d ¯ K z α K | y | ď } y } y ď κ K | y | and hα K | z | ď } z } z ď hκ K | z | . Lemma 2.3.

For all p ˜ u , u q P ´ R K y ˆ R d ¯ K z ¯ , it holds u J ∇ u H p ˜ u , u q u “ ~ u ~ , (2.32) where ∇ u H denotes the Hessian of the function u ÞÑ H p ˜ u , u q .Moreover, under Assumption 2.3, the optimization problem (2.22) admits a unique solution andfor any ˜ u P R K y ˆ R d ¯ K z and p y , z q P R K y ˆ R d ¯ K z , it holds p y ´ ˇ y q ¨ ∇ y H p ˜ u , u q “ } y ´ ˇ y } y ě α K | y ´ ˇ y | (2.33) p z ´ ˇ z q ¨ ∇ z H p ˜ u , u q “ } z ´ ˇ z } z ě hα K | z ´ ˇ z | (2.34) where ˇ u “ p ˇ y , ˇ z q P R K y ˆ R d ¯ K z is the unique solution to (2.22) . roof. From (2.28), we straightforwardly compute ∇ u H p ˜ u , u q “ ´ E ” p G ˜ u ´ u ¨ Ω q Ω ı and ∇ u H p ˜ u , u q “ E “ ΩΩ J ‰ (2.35)which, recalling Deﬁnition 2.5, directly yields (2.32). In particular, we have ∇ y H p ˜ u , u q “ ´ E ” p G ˜ u ´ u ¨ Ω q θ ı “ ´ E ” p G ˜ u ´ y ¨ θ q θ ı (2.36)and, for any ď n ď N ´ and any ď l ď d , ∇ z n, ¨ l H p ˜ u , u q “ ´ E ” p G ˜ u ´ Ω ¨ u q ω n, ¨ l ı “ ´ E ” p G ˜ u ´ ω n, ¨ l ¨ z n, ¨ l q ω n, ¨ l ı . (2.37)Under Assumption 2.3, we deduce from (2.32) that the problem is strictly convex and has aunique (global) minimum ˇ u “ p ˇ y , ˇ z q . The inequalities (2.33) and (2.34) then follow from (2.36)and (2.37) combined with the fact that ∇ z H p ˜ u , ˇ u q “ . l From a practical point of view, the sequence of theoretical Linear-Quadratic optimization problemdescribed in the previous section has to be approximated. Due to the possibly high dimension ofthe matrix E r ΩΩ J s , we will rely on a SGD algorithm to compute the unique solution to (2.22).Indeed, for a ﬁxed vector ˜ u in R K y ˆ R d ¯ K z , the key point is to observe that the unique minimizer ˇ u is the unique solution to the equation ∇ u H p ˜ u , u q “ . (2.38)We importantly remark, using (2.36) and (2.37) that the above relation (2.38) holds true ifand only if E r H y p X , W, ˜ u , y qs “ , and E ” H n,l p X , W, ˜ u , z n, ¨ l q ı “ , (2.39)where H y is a map from R d ˆ p R d q N ˆ p R K y ˆ R d ¯ K z q ˆ R K y to R K y and deﬁned by H y p X , W, ˜ u , y q : “ ´ β K p G ˜ u ´ y ¨ θ q θ, (2.40)and H n,l are maps deﬁned on R d ˆ p R d q N ˆ p R K y ˆ R d ¯ K z q ˆ R K zn taking values in R K zn and givenby H n,l p X , W, ˜ u , z n, ¨ l q : “ ´ β K ? h p G ˜ u ´ ω n, ¨ l ¨ z n, ¨ l q ω n, ¨ l , ď n ď N ´ , ď l ď d. (2.41)We importantly point out that we abuse the notation in (2.40) and (2.41) since the variable p X , W q stands for a vector of R d ˆ p R d q N and p X , W q ÞÑ G ˜ u “ G ˜ u p X , W q is also deﬁned by(2.27) while in (2.39) the random vector W “ p W t n q ď n ď N stands for the discrete path of theBrownian motion W and X for the starting value of X . Since it is also the procedure used for the direct algorithm , the numerical comparison between the two will bemore relevant.

13n (2.40) and (2.41), the deterministic constant β K corresponding to a normalizing factor isintroduced in order to control the L p P q -moment of the random vectors p G ˜ u ´ y ¨ θ q θ and p G ˜ u ´ ω n, ¨ l ¨ z n, ¨ l q ω n, ¨ l . Namely, we select β K large enough so that p β K q ě p ` E “ | θ | ‰ q _ max ď n ď N ´ , ď l ď d p ` E “ | ˜ ω n, ¨ l | ‰ q (2.42)with ˜ ω n, ¨ (cid:96) “ ω n, ¨ (cid:96) ? h . Let us insist on the fact that the chosen β K above should be uniform for alltime grid π . It depends only on the level of approximation coming from the deﬁnition of theapproximation spaces. This qualitative level of approximation is controlled by the number ofbasis function per time step, namely K .For latter use, comparing (2.40) to (2.36) and (2.41) to (2.37), we remark that E r H y p X , W, ˜ u , y qs “ β K ∇ y H p ˜ u , u q and E ” H n,l p X , W, ˜ u , z n, ¨ l q ı “ β K ? h ∇ z n, ¨ l H p ˜ u , u q , (2.43)for ď n ď N ´ , l P t , . . . , d u and p ˜ u , u q P p R K y ˆ R d ¯ K z q .The implemented Picard algorithm is obtained by iterating a stochastic gradient operatorwhich is the counterpart of Φ deﬁned by (2.21) obtained by the numerical approximation thatwe now introduce. Deﬁnition 2.6.

Let M be a positive integer. Let W : “ p W m q ď m ď M , be M discrete paths alongthe time grid π of the Brownian motion W , X : “ p X m q ď m ď M , be M independent samples ofthe initial condition (and independent from W ) and p γ m q m ě a deterministic sequence of positivereal numbers satisfying: ÿ m ě γ m “ 8 and ÿ m ě γ m ă 8 . (2.44) We set, for all m ě , γ ym “ γ m and γ zm “ γ m ? h . (2.45) Let u “ p y , z q be a random vector taking values in R K y ˆ R d ¯ K z , independent of p X , W q andsuch that E r| u | s ă 8 .The operator Φ M , parametrized by p u , X , W q , is given by R K y ˆ R d ¯ K z Q ˜ u ÞÑ Φ M p u , X , W , ˜ u q “ u M (2.46) where u M is the output of the SGD algorithm after M steps and is obtained as follows:1. The initial value is set to u .2. Iteration: For ď m ď M ´ , compute y m ` “ y m ´ γ ym ` H y p X m ` , W p m ` q , ˜ u , y m q , (2.47)and z n, ¨ l,m ` “ z n, ¨ l,m ´ γ zm ` H n,l p X m ` , W p m ` q , ˜ u , z n, ¨ l,m q , (2.48) for any ď n ď N ´ and any ď l ď d . eﬁnition 2.7 (Implemented Picard algorithm ) . For a prescribed positive integer P :1. Initialization: Select a random vector u taking values in R K y ˆ R d ¯ K z such that E r| u | s ă8 . Set u M : “ u .2. Iteration: for ď p ď P , simulate independently a set of M independent discrete paths W p of the Brownian motion W , independent initial condition X p and an initial starting point u p (independently also of u , u j and of the previous p X j , W j q , ď j ď p ´ ), and compute u pM : “ Φ M p u p , X p , W p , u p ´ M q as in Deﬁnition 2.6.The output of the algorithm is then u PM . Remark 2.6. i) The choice of the learning sequence γ for the SGD algorithm (2.47) , (2.48) might be delicate in practice, see e.g. Section 3.1.3.ii) The initialisation is random in the above algorithm. We do not always follow this procedurein our numerical experiments, see Section 3.iii) The numerical complexity C of the full algorithm is the sum of the local complexity of eachSGD algorithm so that C “ O d p P N KM q . (2.49)Using the output u PM of the Picard algorithm , we set the approximating function at time to be: U PM p x q : “ K y ÿ k “ p y PM q k ψ ky p x q , (2.50)recalling (2.8).We then aim to control the following mean squared error: E MSE : “ E ”ˇˇ U PM p X q ´ u p , X q ˇˇ ı . (2.51)We obtain an explicit upper bound on the mean-squared error when specifying the parametersof the algorithm as follows. For γ ą , ρ P p , q , we set γ m : “ γm ´ ρ , m ě and we assumethat the number of steps M in the SGD algorithm satisﬁes, for some η ě , γ α K β K M ´ ρ ě ? and κ K h ^ α K ˆ e ´ ? p q γ αKβK M ´ ρ ` β K α K M ρ ˙ ď η . (2.52) Theorem 2.1.

Let Assumption 2.1, Assumption 2.2 (i), (ii), Assumption 2.3 and (2.52) hold.If LT and η are small enough, then there exists δ ă such that E MSE ď C ρ,γ ˆ δ P ` κ K h ^ α K ˆ e ´ ? p q γ αKβK M ´ ρ ` β K α K M ρ ˙ ` h ` E ψ ˙ . (2.53) for some positive constant C ρ,γ , where we recall that E ψ is given by (2.18) . Remark 2.7.

1. As expected, the above upper bound is the sum of the error due to the Picarditeration, the error induced by the SGD algorithm, the discrete-time approximation error,recall (2.19) , and the error E ψ generated by the approximation in the functional spaces p V y , V zn q . . The smallness condition on LT is precisely given in the statement of Proposition 4.4.This condition should not come as a surprise since we use Picard iteration. The smallnesscondition on η is not restrictive in practice as the quantity it controls should go to zero toobtain the convergence of the numerical procedure. To deduce a rate of convergence from (2.53), one has to chose the approximation basis functions p ψ ky q ď k ď K y and p ψ kn q ď n ď N ´ , ď k ď K zn and to set optimally the algorithm’s parameters. Thechoice of the basis function has a dramatic impact on the complexity of the algorithm. In thenext section, we work with sparse grid approximation and we are able to show that the complexityis controlled both theoretically and in practice under Assumption 2.2. Both the implemented direct algorithm , see Deﬁnition 2.3, and the implemented

Picard algorithm ,see Deﬁnition 2.7, rely on the choice of the approximation spaces V y and V zn , ď n ď N ´ andthe choice of the related basis functions p ψ ky q ď k ď K y and p ψ kn q ď n ď N ´ , ď k ď K zn . The impact isboth theoretical, in terms of convergence rate and numerical complexity, and practical in termsof computational time. We choose here to use sparse grid approximations. This will allow us toobtain interesting numerical complexity results in the setting of Assumption 2.2, see Theorem 3.1.We carefully investigate the convergence of the implemented Picard Algorithm . It is not the ﬁrsttime that sparse grid approximations are investigated in the context of linear regression. We willuse the framework introduced in [9]. Note however that some restriction in the choice of sparsegrid approximations are introduced by Assumption 2.3.The basis functions are built using elementary bricks that have a compact support includedin the bounded domain O n “ d ź l “ r a nl , b nl s where a nl ă b nl for l P t , . . . , d u . (3.1)The domain speciﬁcation strongly depends on the applications under study. We will considertwo main cases in this work.1. For all ď n ď N ´ , O n “ d ź l “ r a l , b l s “ : O . (3.2)Namely, the coeﬃcients a and b do not depend on n . This will be the case in Section 3.1.2where we consider coeﬃcient functions that are O -periodic.2. Alternatively, the coeﬃcient a and b are functions of the time-step but also of the diﬀusioncoeﬃcients p b, σ q , recall (1.2), and the PDE dimension d . Namely a n : “ a p t n , b, σ, d q and b n : “ b p t n , b, σ, d q . (3.3)16owever, In both cases the basis functions are obtained by a transformation of the domain r , s d on which we deﬁne the primary basis using sparse grids . The transformation is deﬁned as follows: τ n : R d Q x ÞÑ τ n p x q “ ´ x ´ a n b n ´ a n , . . . , x d ´ a nd b nd ´ a nd ¯ J P R d . (3.4)We will introduce two types of basis functions: the ﬁrst one, based on pre-wavelet basis, followsfrom [9] and the second one, based on hat functions modiﬁed at the boundary of the domain,follows from [24]. We describe here the elementary bricks that are used to build the basis functions of the approx-imation spaces.For a level l P N and an index i P t , . . . , l u , we ﬁrst consider the family of hat functionsgiven by φ l,i p x q “ φ p l x ´ i q with φ p x q “ " ´ | x | if ´ ă x ă otherwise . (3.5)The univariate pre-wavelet basis functions χ l,i : R Ñ R are deﬁned by χ , “ r , s , χ , “ φ , , χ , “ φ , ´ (3.6)and for l ě , i P I l zt , l ´ u with I l : “ t ď i ď l ´ | i odd u χ l,i “ l ˆ φ l,i ´ ´ φ l,i ´ ` φ l,i ´ φ l,i ` ` φ l,i ` ˙ . (3.7)For the boundary points i P t , l ´ u , we set χ l, “ l ˆ ´ φ l, ` φ l, ´ φ l, ` φ l, ˙ and χ l, l ´ p x q “ χ l, p ´ x q , x P R . (3.8)The multivariate pre-wavelet function on R d are obtained by a classical tensor-product approach.For a multi-index level l “ p l , . . . , l d q and a multi-index position i “ p i , . . . , i d q , χ l , i p x q “ d ź j “ χ l j ,i j p x j q . (3.9)In this multivariate case, the index sets are given by I l “ " i P N d ˇˇˇ ď i j ď if l j “ ,i j P I l j if l j ą , for all ď j ď d * . (3.10)The hierarchical increment spaces are then deﬁned for l P N d by W l : “ span t χ l , i | i P I l u . (cid:96) is given by S (cid:96) : “ à l P L (cid:96) W l , L (cid:96) : “ t l P N d , ζ d p l q ď (cid:96) u (3.11)with ζ d p q : “ and for l ‰ ζ d p l q “ | l | ´ d ` |t j | l j “ u| ` , where for a multi-index l P N d we recall that | l | “ ř d(cid:96) “ l (cid:96) and that | A | is the cardinality of A .The key point here is that the dimension of S (cid:96) satisﬁes dim p S (cid:96) q “ O p (cid:96) (cid:96) d ´ q , (3.12)so that the curse of dimensionality only appears with respect to the level (cid:96) , see [22] (and alsoin the constant related to the notation O p . q ). The key point now is that the approximationerror when using the sparse space is also controlled if the function to be approximated is smoothenough. To this end, for the ﬁxed open domain p , q d , we consider the space of function withmixed derivatives H mix pp , q d q (see the section Notation for a precise deﬁnition). Then, for any v P H mix pp , q d q , it holds inf ξ P S (cid:96) } ξ ´ v } L pp , q d q ď C ´ (cid:96) (cid:96) d ´ } v } H mix pp , q d q (3.13)for some positive constant C : “ C p d q . We refer e.g. Theorem 3.25 in [8] for a proof of thisresult. Again, we importantly emphasize that in the above control of the error the curse ofdimensionality only appears with respect to the level (cid:96) . Remark 3.1.

The number of basis functions is thus K “ dim p S (cid:96) q . We denote by k : C ÞÑt , . . . , K u any bijection enumerating C . We will often slightly abuse the notation and writedirectly p ψ kn q ď k ď K instead of p ψ p l , i q n q p l , i qP C to be consistent with the notation introduced in theprevious section. Picard Algorithm in the case of periodic coeﬃcients

In this section, we work under the setting of Assumption 2.2 (iv). To alleviate the notation –but without loss of generality – we assume that the coeﬃcients are - periodic in the followingsense: for λ “ b, σ or g λ p x ` q q “ λ p x q , for all p x, q q P R d ˆ Z d , (3.14)which implies the same property for the value function u and its derivatives.We thus consider here that O “ r , s d , recall (3.2) and τ “ I d , recall (3.4). Here, we are lookingfor an approximation U PM p¨q of u p , ¨q on the whole domain O , recall (2.50). We thus set X tobe uniformly distributed on p , q d , which means that Assumption 2.1(i) holds true.For sake of clarity, we summarize the current setting in the following assumption: Assumption 3.1.

Let Assumption 2.1(i) and Assumption 2.2 hold true. Moreover, set O “r , s d and X „ U pp , q d q .

18o take into account the periodic setting in our approximation, let us ﬁrst deﬁne the - periodisation of a compactly supported function ϕ by q ϕ p x q : “ ÿ q P Z d ϕ p x ` q q , for all x P R d . (3.15)The basis functions ψ are then given by ψ “ q χ . Namely, for any ď n ď N ´ , for anapproximation date t n , we introduce the set of functions V zn : “ t ξ : R d ÞÑ R | ξ p x q “ q v p x q , for some v P S (cid:96) u . (3.16)Moreover, at the initial time, the approximation of u p , ¨q will also be computed in V y : “ t ξ : R d ÞÑ R | ξ p x q “ q v p x q , for some v P S (cid:96) u . (3.17) Remark 3.2.

We could have set an approximation level diﬀerent for each time step, howeverwe shall not use this possibility in our theoretical or numerical convergence results. We thussimply consider a ﬁxed positive level (cid:96) of approximation, that, obviously, will be chosen later inan optimal way.

Let also introduce the function R d Q x ÞÑ p x P r , q d (3.18)such that p x and x belong to the same equivalence class in R d { Z d . Denoting by P X tn the proba-bility measure on R d associated to the random vector X t n given by the Euler-Maruyama scheme(2.3) taken at time t n and starting from X at time 0 and using Lemma 2.1, we remark that theboundary of the domain O has null P X tn -measure. We thus deduce q ψ p X t n q “ ψ p p X t n q P ´ a.s. , (3.19)and in practice we should work with the latter quantity. Namely, we construct our approximationscheme using: Y u : “ K y ÿ k “ ψ ky p p X q y k and Z u t n : “ K zn ÿ k “ ψ kn p p X t n q z n,k , for ď n ď N ´ , (3.20)with u “ p y , z q P R K y ˆ R d ¯ K z .Under the current setting of periodic coeﬃcients and sparse grid approximation, we takebeneﬁt of the convergence results given in Theorem 2.1 to obtain our main theoretical resulton the complexity of the Picard algorithm . Indeed, the next theorem shows that the curse ofdimensionality is tamed by using the sparse grid approximation.

Theorem 3.1.

Let Assumption 3.1 hold and assume that LT is small enough. For a prescribed ε ą , the complexity C ε , deﬁned in Remark 2.6, of the full Picard algorithm in order to achievea global error E MSE of order ε , recall (2.51) , satisﬁes C ε “ O d p ε ´ p ` ι q | log p ε q| ` ` ι p d ´ q q for any ă ι ă . The proof of this theorem is given in Section 4.4 where the algorithm’s parameters are optimallyset with respect to ε . 19 eriodic example We consider here 1-periodic coeﬃcients on R d . The coeﬃcients of theforward SDE (1.2) are given by, for x P R d , b i p x q “ . p πx i q , σ i,j p x q “ ? dπ p . ` . p πx i qq t i “ j u , ď i, j ď d . The coeﬃcients of the BSDE reads g p x q “ π ˜ sin ˜ π d ÿ i “ x i ¸ ` cos ˜ π d ÿ i “ x i ¸¸ , x P R d ,f p t, x, y, z q “ π y d ÿ i “ p σ i,i p x qq ´ d ÿ i “ b i p x q z i σ i,i p x q ` h p t, x q , t P r , T s , x P R d , y P R , z P R d , where h p t, x q “ ´ cos p π ř di “ x i ` π p T ´ t qq ´ sin p π ř di “ x i ` π p T ´ t qq ¯ . The explicit so-lution is given by u p t, x q “ π ˜ sin ˜ π d ÿ i “ x i ` π p T ´ t q ¸ ` cos ˜ π d ÿ i “ x i ` π p T ´ t q ¸¸ , x P R d . We perform the test for d “ and M “ , N “ , T “ . , level “ by PicardAlgorithm with P “ , then there are K y “ K zn “ K “ basis functions. We obtain a meansquare error E MSE “ . at the 5-th Picard iteration: See Figure 1 displaying the learningperformance. The parameters of the test are shown in Table 6 in the appendix.Figure 1: m ÞÑ | ˆ Y m ´ u p , X m q| for the Picard algorithm , d “ . The MSE is computed by themean of the last 10000 steps of each Picard iteration. Picard and direct Algorithm

We will now investigate numerically the behavior of the

Picard algorithm and direct algorithm on “test” examples that have been already considered in the literature. In particular, this willallow us also to compare our methods to existing methods as the ones investigated in [20, 37].20or this section, we work in the setting of Assumption 2.1(ii). This means that at the initialtime, the output p ˆ y , ˆ z q of the algorithms: p ˆ y , ˆ z q “ p y M , z M q , for the direct algorithm , recallDeﬁnition 2.3, or p ˆ y , ˆ z q “ p y PM , p z PM q q , for the Picard algorithm , recall Deﬁnition 2.7, areapproximating p u p , x q , σ J ∇ x u p , x qq P R ˆ R d . Since, these are one point values, there isno need to introduce basis functions at the initial time and the approximating spaces are just V y “ R and V z “ R d . Then, for any ď n ď N ´ , for the discrete time t n , we set theapproximating space as follows: V zn : “ t ξ : R d ÞÑ R | ξ p x q “ v p τ n p x qq , for some v P S (cid:96) u , (3.21)recall (3.4). In particular, the basis functions are given by ψ kn p x q “ χ k p τ n p x qq , recall (3.9) andRemark 3.1.We now report more speciﬁcally the various algorithms parameters that have been used inpractice. The ﬁrst thing to note is that we are able to obtain good results with a low level ofapproximation. Indeed, in all our numerical tests, we set the level (cid:96) “ . The Table 1 belowindicates the number of basis functions that have theoretically to be considered when includingboundary function. dimensions levels (cid:96) ď (cid:96) ď (cid:96) ď d=2 49 113 257d=3 225 593 1505d=4 945 2769 7681d=5 3753 12033 36033Table 1: The number of functions in the sparse grid approximation with boundary for diﬀerentdimensions and levels.Next, we need to deﬁne the domain O n , ď n ď N ´ , where the approximation will becomputed, which depends on the underlying process, recall (3.1)-(3.3). We will consider twocases in our simulations, each component of the forward SDE is given by a Brownian motionwith drift µ and volatility σ : t ÞÑ x ` µt ` σW t or a geometric Brownian motion: t ÞÑ x exp pp µ ´ σ { q t ` σW t q .1. For the Brownian motion with drift, we set O n “ x ` r µt n ´ rσ ? t n , µt n ` rσ ? t n s d , for some r P R ` . (3.22)2. For the geometric Brownian motion, we set O n “ r x e R ´ rσ ? t n , x e R ` rσ ? t n s , R “ p µ ´ σ q t n , for some r P R ` . (3.23)Finally, a delicate parameter to chose is the the learning rate. Empirically, it was set to: Deviating slightly from Deﬁnition 2.7, we will use for the initialization of the current SGD step, the last valuecomputed at the previous step instead of a random value. m p λ, t n , α, β , β , m q “ β p λ q n ` β p λ q ` p m ` m p λ qq α p λ q , ď m ď M, λ

P t y , z , ¨ u n “ Y t z n, ¨ u ď n ď N ´ , (3.24)where β , β P R ` , m P N ` , α P ` , ‰ . Remark 3.3. i) m is a suitable positive number to decrease the learning rates for avoiding abig jump of the estimated λ in the beginning steps of the algorithm.ii) The parameter r P R ` is a suitable number to balance the running time and the errors of thealgorithms.iii) Both β and α can be used to adjust the converge speed and the variance of the estimated λ .Suitable parameters make the algorithm more stable, converge faster and reduce the variance ofthe estimated λ .iv) Usually, we increase the value of α or decrease the value of p β , β q gradually to decrease theconvergence rate with the increase of step p, ď p ď P for Picard algorithm . Concerning the number of steps in the SGD algorithm, we make the following remark.

Remark 3.4.

We used two techniques to control M in order to reduce the computational cost:i) We use α P ` , ‰ which still works well as the SGD algorithm can converge faster.ii) If β p λ q ” , for M large enough, the algorithm eventually converge, but t z n,k u ď k ď K ď n ď N ´ convergence becomes slower and slower with the increase of n (the time step). We thus choose β p λ q ą in practice to make all t z n,k u ď k ď K ď n ď N ´ converge altogether with a smaller M (thanksto the learning rates increase with n ). The remaining parameters are precised in the examples below. We refer also Section to theAppendix for the collection of all algorithm parameters values used in the numerical simulation.

Quadratic model

First, we consider the quadratic example, whose driver is set to f p y, z q “ a | z | “ a p z ` z ` ¨ ¨ ¨ ` z d q , y P R , z P R d , (3.25)where a P R is a constant, and the terminal condition to g p x q “ log ˆ ` | x | ˙ , x P R d . (3.26)The explicit solution can be obtained through the Cole-Hopf transformation(see e.g. [15, 20]): y t “ u p t, x q “ a log E “ e a ¨ g p x ` W T ´ t q ‰ , a ‰ E r g p x ` W T ´ t qs , a “ and z it “ B x i u p t, x q “ E “ B x i g p x ` W T ´ t q e a ¨ g p x ` W T ´ t q ‰ E “ e a ¨ g p x ` W T ´ t q ‰ , i “ , , ¨ ¨ ¨ , d. Thus, to obtain a numerical reference solution and conﬁdence interval for y and z i , i “ , , ¨ ¨ ¨ , d , we use classical Monte Carlo estimation of the expectations.22he underlying diﬀusion X is given by the Brownian motion W , and the parameters areselected as follows: a “ , M “ , N “ , T “ . We compute a reference solution ¯ y “ . with conﬁdence interval p . , . q when d “ by Monte Carlo method using simulation paths. Figure 2 shows the numerical approximation of y and its 95% conﬁdenceinterval by the same color line of the 5-dimensional quadratic model by direct algorithm and the deep learning algorithm introduced in [20]. The diﬀerence of ˆ y between our SGD algorithm andMonte Carlo simulation is less than ´ . It turns out that for this “low” dimensional exampleand with this set of parameter, it is more precise than the deep BSDEs solver .Figure 2: ˆ y for the quadratic model withd=5 and T=1 by direct algorithm anddeep learning algorithm. Figure 3: The value of ˆ y by Picard algorithm with d=5, level=3, T=1, P=6, M=2000.For d “ , M “ , N “ , P “ , T “ , a “ , Figure 3 shows that ˆ y converges for eachPicard iteration, and overall ˆ y Ñ . . We can observe that ˆ y is very close to the referencesolution ¯ y when the number of iteration p is greater or equal to . A ﬁnancial model

We now report our numerical results for a model with a ﬁnancial ﬂavour.The underlying process X follows a d-dimensional geometric Brownian motion, for µ P R , σ ą ,namely d X it “ X it p µdt ` σdW it q , i “ , , ¨ ¨ ¨ d, X “ x P p R ` q d . The driver of the BSDE is given by, for p y, z q P R ˆ R d , f p y, z q “ ´ R l y ´ µ ´ R l σ d ÿ i “ z i ` p R b ´ R l q max , ř di “ z i σ ´ y + , and the terminal condition g p x q “ max "„ max ď i ď d x i  ´ K , * ´ "„ max ď i ď d x i  ´ K , * . Hence, for all t P r , T q , x P R d , it holds that u p T, x q “ g p x q and B u B t ` σ d ÿ i “ x i B u B x i ´ min R b p u ´ d ÿ i “ x i B u B x i q , R l p u ´ d ÿ i “ x i B u B x i q + “ . (3.27)This is a typical example of “non-linear market” speciﬁcation, where there are two diﬀerentinterest rates for borrowing and lending money, see Bergman [7] and e.g. [20, 26, 11, 16, 5],where this example has been used as a test example for numerical methods for BSDEs.23n our numerical test below, we set the parameters as follows: N “ , M “ , µ “ . , σ “ . , R l “ . , R b “ . , K “ , K “ , T “ . and x “ p , ¨ ¨ ¨ , q .Table 2 compares the results of the direct algorithm and the deep learning algorithm [20] when d “ , . The approximated value of y obtained by the two methods are very close. Figure 4and Figure 5 show the performance of the Picard algorithm with parameters d “ , P “ , and ˆ y converges to . at the last step.dimensions SGD algo with Sparse grids Deep learning scheme [20] y

95% CI of y y

95% CI of y d=2 4.3332 [4.2921, 4.3743] 4.3516 [4.3420, 4.3612]d=4 7.0960 [7.0432, 7.1487] 7.1130 [7.0649, 7.1611]Table 2: Comparison of the direct algorithm and the deep learning algorithm for the ﬁnancialmodel.Figure 4: Approximation ˆ y by Picard al-gorithm when d=4 and T=0.5. Figure 5: Approximation ˆ z by Picard al-gorithm when d=4 and T=0.5.

Picard Algorithm

We now illustrate on a numerical example that the smallness assumption may be necessary toobtain the convergence of the

Picard Algorithm . To this end, we consider the following model.For a given a P R , the BSDE driver is given by f p y, z q : “ arctan p ay q ` d ÿ j “ z j , p y, z q P R ˆ R d , and the terminal condition g p x q : “ e ` ¨ x ` e ` ¨ x , x P R d . The underlying process X is simply equal to the Brownian motion W , namely b “ , σ “ I d .We set the terminal time T “ and the dimension d “ .We study numerically the above model for diﬀerent value of a , which controls the Lipschitzconstant of f , in the case of the Picard Algorithm . The value obtained are compared to the onesobtained by two other methods: a multistep scheme in [13] and the deep BSDEs solver of [20].The values obtained by these two methods are considered to be close to the true solution.When a “ ´ . , Figure 6 shows that ˆ y converges. However, this is not the case anymore when a “ ´ . , see Figure 7, as ˆ y oscillates between two values. Actually, we see on Figure 8 that abifurcation occurs for the Picard Algorithm around a “ ´ . .24igure 6: The value of ˆ y by Picard al-gorithm with d=2, level=3, T=1, P=9,M=5000, a=-0.4. Figure 7: The value of ˆ y by Picard al-gorithm with d=2, level=3, T=1, P=9,M=5000, a=-1.5Figure 8: The value of ˆ y by Picard algorithm , direct algorithm and deep learning method withd=2, level=3, T=1, P=9, M=5000. The last four steps are shown for the Picard Algorithm illustrating the bifurcation phenomenon. Note that the direct algorithm does not exhibit suchbehaviour.

In the previous section, using the pre-wavelet basis, we were able to establish a theoreticalupper-bound on the global complexity for the

Picard algorithm and to show that both the

Picard algorithm and direct algorithm converge in practice too. However, the number of basisfunctions, even though we use a sparse approximation, is still quite important which prevents usfrom dealing eﬀectively with high-dimensional PDE. In particular, the number of basis functionsused to capture what happens on the boundary of the domain is large. In this section, we use,the so-called “modiﬁed hat functions” that allows to get rid of the boundary basis.

The modiﬁed hat functions are deﬁned by the following method (which corresponds to equation(2.16) in [24]), ˜ φ l,i p x q : “ $’’’’’’&’’’’’’% if l “ ^ i “ " ´ l ´ ¨ x if x P r , h l s otherwise * if l ą ^ i “ " l ´ ¨ x ` p ´ i q{ if x P r ´ h l , s otherwise * if l ą ^ i “ l ´ φ l,i p x q otherwise , (3.28)25he multivariate hat function on R d are obtained by a classical tensor-product approach.For these basis functions, we can remove the points on the boundary of the space so that all thecomponents l j , j “ , ¨ ¨ ¨ , d , are positive for a multi-index level l “ p l , . . . , l d q and a multi-indexposition i “ p i , . . . , i d q , ˜ φ l , i p x q “ d ź j “ ˜ φ l j ,i j p x j q . (3.29)In this multivariate case, the index set are given by I l “ ! i P N d | i j P I l j for all ď j ď d ) . (3.30)Table 3 shows the number of points in the sparse grids without boundary. In particular, weobserve that it is much less than sparse grids with boundary for the same dimensions and levels,recall Table 1. dimensions levels l ď l ď l ď d=2 17 49 129d=4 49 209 769d=5 71 351 1471d=10 241 2001 13441d=20 881 13201 154881d=25 1351 24751 352351d=50 5201 182001 4867201d=100 20401 1394001Table 3: The number of points in the sparse grid approximation without boundary functions fordiﬀerent dimensions and levels. The quadratic model

We come back to the quadratic model introduced in (3.25)-(3.26). Inthis setting, we can test the 100-dimensional version of this model. Let M “ , N “ , T “ , a “ , the convergence of ˆ y and ˆ z , when using the direct algorithm , is shown in Figure 9 andFigure 10: 3819 seconds were spent on this test. The error for ˆ y appears to be less than 0.01.For Z , the true solution is ¯ z i “ W i E r `| W T | s “ , @ i “ , ..., d . The gain in computational timeis important in comparison with the pre-wavelet speciﬁcation of the last section. Not only lessbasis functions are used, but one should also note that the computational cost of a hat functionis less than a pre-wavelet function up to a factor 5. We do also test the Picard algorithm in25-dimensional setting. We set M “ , N “ , T “ , a “ , P “ and get ˆ y « . quite quickly, see Figure 11 (all the initial value of z n,k , ď n ď N ´ , ď k ď K zn are set to in this test). 26igure 9: ˆ y for the quadratic model withd=100 and T=1 Figure 10: ˆ z i for the quadratic modelwith d=100 and T=1Figure 11: ˆ y for the quadratic model by Picard algorithm with d=25, T=1, P=3, N=10, M=1500

The ﬁnancial model of (3.27) For the direct algorithm in this example, we set the parameters N “ , M “ , µ “ . , σ “ . , R l “ . , R b “ . , K “ , K “ , T “ . .Table 4 compares the results for the direct algorithm and deep BSDEs solver : the approximatedvalue for y obtained by the two methods are very close. The running time for the deep BSDEssolver shows almost no increase up to d ď . For the direct algorithm , it does increase with thedimensions but it stays reasonable. Actually, it is even competitive when d ď . Figure 12shows the performance of the direct algorithm when d “ . Finally in Figure 13, the Picardalgorithm is tested on this model with parameter d “ , P “ , and ˆ y « . at the lastiteration. The numerical experiments were realised by C++ 17 on a MacBook Pro 6-core Intel Core i7, using only onecore and compiling with optimisation ﬂag ‘-O3’ in gcc. The deep BSDEs solver [20], using

Tensorﬂow , spendsmost of the time to build the graph for the NN and initialize the variables when the dimension is small then thelearning phase is quick. On the contrary, our algorithm builds the approximation grid space quite eﬃciently (lessthan 1 second when d ď , level ď ) and then the runtime is spent on the the SG algorithm. imensions direct SGD algo with Sparse grids Deep BSDEs solver[20] y

95% CI of y time y

95% CI of y timed=5 8.0966 [8.0226, 8.1705] 3 s 8.1010 [8.0747, 8.1273] 115 sd=10 10.9865 [10.9224, 11.0506] 12 s 10.9216 [10.8944, 10.9489] 120 sd=15 11.848 [11.7853, 11.9107] 33 s 11.8226 [11.7750, 11.8702] 122 sd=20 11.8674 [11.7962, 11.9387] 61 s 11.9508 [11.8965, 12.0051] 127 sd=25 11.7801 [11.6467, 11.9135] 130 s 11.6416 [11.5316, 11.7517] 132 s Table 4: Comparison of the direct algorithm and the deep learning algorithm .Figure 12: ˆ y by direct algorithm ford=20 and T=0.5. Figure 13: ˆ y by Picard algorithm for d=20,T=0.5, P=4, M=5000, N=10.

A challenging example

We now consider a model with an unbounded and complex structuresolution, which has been analyzed in [37]. The value function in this case is given by: u p t, x q “ T ´ td d ÿ i “ p sin p x i q t x i ă u ` x i t x i ě u q ` cos ˜ d ÿ i “ ix i ¸ , x P R d . (3.31)It corresponds to a BSDE, with underlying process given by X t “ x ` ? d I d W t , and x “ . d and driver and terminal condition given respectively by f p t, x, y, z q “ ˆ ` p T ´ t qp d ´ C q ˙ A p x q ` p ´ p T ´ t q C q B p x q ` Cy, “ ˆ ` T ´ t d ˙ A p x q ` B p x q ` C cos ˜ d ÿ i “ ix i ¸ , x P R d , y P R , z P R d ,g p x q “ u p T, x q “ cos ˜ d ÿ i “ ix i ¸ , x P R d , where A p x q “ d d ÿ i “ sin p x i q t x i ă u , B p x q “ d d ÿ i “ x i t x i ě u , C “ p d ` qp d ` q . In Table 5, we compare the approximation of y by using ﬁve diﬀerent algorithms to thetheoretical solution. When the dimension d ď , all the algorithms perform well. However,as already mentioned in [37] the deep learning algorithm [31] fails when d ě (no matter the28hosen initial learning rate and the activation function for the hidden layers, among the tanh,ELU, ReLu and sigmoid ones; besides, taking 3 or 4 hidden layers does not improve the results.)The two deep learning schemes of [37] and our algorithms with sparse grids still works well when d ď . Figure 14 shows the performance of the direct algorithm , ˆ y converges to . when d “ , it is close to the theoretical solution . , and the 95% conﬁdence interval of ˆ y is r . , . s . When the dimension d “ , all the algorithms failed at providing correct esti-mates of the solution as shown in the table, but the errors of our algorithms appear to be smallerthan the errors of deep learning methods. Figure 15 and Figure 16 show the performance of the direct algorithm , Picard algorithm respectively. dimensions Theoreticalsolution SGD algo with L sparsegrids and hat functions DL schemeof HPW [37] DL schemeof HJE [31] direct algo Picard algo DBDP1 DBDP2d=1 1.3776 1.3790 1.3825 1.3720 1.3736 1.3724d=2 0.5707 0.5795 0.5794 0.5715 0.5708 0.5715d=5 0.8466 0.8734 0.8606 0.8666 0.8365 NCd=8 1.1603 1.1745 1.1801 1.1694 1.0758 NCd=10 -0.2149 -0.2439 -0.2594 -0.3105 -0.3961 NC

Table 5: The approach value of ˆ y of diﬀerent methods when T “ .Figure 14: ˆ y Ñ . by direct SGDalgorithm when d=8, N=60, M=10000. Figure 15: ˆ y Ñ ´ . by direct algo-rithm when d=10, N=100, M=10000.29igure 16: ˆ y Ñ ´ . by picard SGD algorithm when d=10, P=8, N=100, M=10000. In this section, we study theoretically the direct algorithm and the

Picard algorithm in order toprove the results stated in Section 2 and 3.1.2. We ﬁrst obtain forward and backward estimates onperturbed BSDEs. This allows in particular to derive the wellposedness of the direct algorithm .However, most of the work concentrates on the

Picard algorithm in Section 4.3. A careful studyof the iterated SGD algorithms allows to prove the convergence announced in Theorem 2.1.Finally, we prove Theorem 3.1 concerning the complexity of the method in the case of periodiccoeﬃcients and using pre-wavelet basis.

In this subsection, we prove general technical estimates for the backward component of a BSDEthat will be used in the proof of the convergence of the numerical methods under study. Wewill essentially compare two processes with dynamics given by (2.11) but taken at two diﬀerentstarting points and controlled processes.The ﬁrst process, denoted by V , is a scheme built with a random driver F satisfying: Assumption 4.1.

1. For all p y, z q P R ˆ R d , F p¨ , y, z q is progressively measurable.2. There exists some deterministic constant C ě such that for any t P r , T s and any p y, y , z, z q P R ˆ R d | F p t, y, z q ´ F p t, y , z q| ď C ` | y ´ y | ` | z ´ z | ˘ . For Z P S d and ζ P L p F q , we thus deﬁne V ζ,Zt “ ζ ´ ż t F p ¯ s, V ζ,Z ¯ s , Z ¯ s q d s ` ż t Z ¯ s d W s , (4.1)30here we introduced the notation ¯ s : “ t n for t n ď s ă t n ` . The second one, denoted ˜ V ,corresponds to the the true solution to the BSDE ˜ V ζ,Zt “ ζ ´ ż t ˜ F p s, ˜ V ζ,Zs , Z s q d s ` ż t Z s d W s , (4.2)where ˜ F satisﬁes the same assumptions as F above. Proposition 4.1.

Let Assumption 4.1 hold for F and ˜ F . For p ζ, ζ q P L p F q ˆ L p F q and p Z, Z q P S d ˆ S d , we consider V ζ,Z and ˜ V ζ ,Z as deﬁned in (4.1) - (4.2) and we set δF : “ ˜ F p¨ , V ζ,Z , Z q ´ F p¨ , V ζ,Z , Z q , η fs “ ˜ F p s, ˜ V ζ ,Z s , Z s q ´ ˜ F p s, ˜ V ζ ,Z ¯ s , Z ¯ s q and η zs “ Z s ´ Z ¯ s . Then,under the above assumptions on F and ˜ F , it holds1. Forward estimate: E « sup t Pr ,T s | ˜ V ζ ,Z t ´ V ζ,Zt | ﬀ ď C ˜ E « | ζ ´ ζ | ` h N ´ ÿ i “ | Z t n ´ Z t n | ﬀ ` E „ż T p| δF ¯ s | ` | η fs | ` | η zs | q d s ¸ .

2. Backward estimate: E « sup t Pr ,T s | ˜ V ζ ,Z t ´ V ζ,Zt | ` h N ´ ÿ i “ | Z t n ´ Z t n | ﬀ ď C E „ | ˜ V ζ ,Z T ´ V ζ,ZT | ` ż T p| δF ¯ s | ` | η fs | ` | η zs | q d s  . Proof.

1. Denote ∆ V : “ ˜ V ζ ,Z ´ V ζ,Z , ∆ Z “ Z ´ Z , ∆ F “ ˜ F p¨ , ˜ V ζ ,Z , Z q ´ ˜ F p¨ , V ζ,Z , Z q and ∆Γ s “ ∆ Z ¯ s ` η zs . Applying Itô’s formula, we compute | ∆ V t | “ | ∆ V | ` ż t t´ V s p ∆ F ¯ s ` δF ¯ s ` η fs q ` | ∆Γ s | u d s ` ż t ∆ V s ∆Γ s d W s . (4.3)Since ˜ F is Lipschitz-continuous, we have | ∆ V s p ∆ F ¯ s ` δF ¯ s ` η fs q| ď p ` L q sup ď r ď s | ∆ V r | ` L | ∆ Z ¯ s | ` | δF ¯ s | ` | η fs | which combined with (4.3) leads to E „ sup ď s ď t | ∆ V s |  ď E „ | ∆ V | ` C ż t t sup ď r ď s | ∆ V r | ` | ∆ Z ¯ s | ` | δF ¯ s | ` | η fs | ` | η zs | u d s  ` E „ sup ď r ď t | ż r ∆ V s p ∆ Z ¯ s ` η zs q d W s |  . (4.4)Applying the Burkholder-Davis-Gundy inequality, we obtain E « sup r Pr ,t s | ż r ∆ V s p ∆ Z ¯ s ` η zs q d W s | ﬀ ď C E „ | ż t | ∆ V s p ∆ Z ¯ s ` η zs q| d s |  ď C ˆ E „ sup ď s ď t | ∆ V s | ` ż t p| ∆ Z ¯ s | ` | η zs | q d s ˙ E „ sup ď r ď t | ∆ V t |  ď | ∆ V | ` C ż t E „ sup ď r ď s | ∆ V r | ` | ∆ Z ¯ s | ` | η fs | ` | δF ¯ s | ` | η zs |  d s . The proof for this step is concluded by applying Grönwall’s Lemma.2. From (4.3), we compute E „ | ∆ V t | ` ż Tt | ∆Γ s | d s  ď E „ | ∆ V T | ` ż Tt ∆ V s p ∆ F ¯ s ` δF ¯ s ` η fs q d s  . We observe that, since ˜ F is Lipschitz continuous, ∆ V s ∆ F ¯ s ď C p| ∆ V s | ` | ∆ V ¯ s | ` | ∆ V s ∆ Z ¯ s |q . For α ą , to be ﬁxed later on, we get using Young’s inequality, ∆ V s ∆ F ¯ s ď C ˆ p ` α q| ∆ V s | ` | ∆ V ¯ s | ` α | ∆ Z ¯ s | ˙ , ď C ˆ p ` α q| ∆ V s | ` | ∆ V ¯ s | ` α | ∆Γ s | ` | η zs | ˙ . For α small enough, we thus obtain E „ | ∆ V t | ` ż Tt | ∆Γ s | d s  ď E „ | ∆ V T | ` C ż Tt ´ | ∆ V s | ` | ∆ V ¯ s | ` | δF ¯ s | ` | η fs | ` | η zs | ¯ d s  . (4.5)Applying Grönwall’s Lemma leads to, for all t ď T , E “ | ∆ V t | ‰ ď C E „ B T ` ż Tt | ∆ V ¯ s | d s  with B T : “ | ∆ V T | ` ż T p| η fs | ` | η zs | ` | δF ¯ s | q d s . (4.6)In particular, for n ď N and t n P π , we have E “ | ∆ V t n | ‰ ď C E « B T ` h N ´ ÿ j “ n | ∆ V t j | ﬀ which in turn, using the discrete-time Grönwall Lemma, leads to max n ď N E “ | ∆ V t n | ‰ ď C E r B T s .Combining this inequality with (4.5) and (4.6), we obtain E „ | ∆ V t | ` ż Tt | ∆Γ s | d s  ď C E r B T s , t ď T. To conclude the proof one applies the Burkholder-Davis-Gundy inequality as in step 1. l .2 Application to the direct algorithm We here prove the results announced in Section 2.2. We start by proving the analytic expressionof the main quantities appearing in the direct algorithm , recall Deﬁnition 2.3.

Proof of Lemma 2.2

By standard computations, recall (2.6), for any ď n ď N ´ , ∇ y k Z u t n “ , and ∇ z n,k p Z u t q l “ ψ kn p X t n q t t “ t n u e l . This leads to, for ď q ď N ´ , ∇ z n,k p Z u t q ¨ ∆ W q q “ ψ kn p X t n q ∆ W n t q “ n u (4.7)and ∇ z n,k f p Y u t q , Z z t q q “ ∇ y f p Y u t q , Z u t q q ∇ z n,k Y u t q ` ψ kn p X t n q ∇ z f p Y u t q , Z u t q q t q “ n u . (4.8)Now, diﬀerentiating both side of (2.10) with respect to the variable y k , ď k ď K y yields ∇ y k Y u t n “ ψ ky p X q n ´ ź j “ ´ ´ h ∇ y f p Y u t j , Z u t j q ¯ for n ě . From (2.10), by diﬀerentiation, we obtain ∇ z n,k Y u “ and using (4.7) and (4.8), for q ě , ∇ z n,k Y u t q “ ∇ z n,k Y u t q ´ ´ ´ h ∇ y f p Y u t q ´ , Z z t q ´ q ¯ ` ψ kn p X t n q t n “ q ´ u ´ ∆ W J q ´ ´ h ∇ z f p Y u t q ´ , Z u t q ´ q ¯ , which in turn yields ∇ z n,k Y u t q “ for q ď n , ∇ z n,k Y u t n ` “ ψ kn p X t n q ` ∆ W J n ´ h ∇ z f p Y u t n , Z u t n q ˘ and for q ě n ` , ∇ z n,k Y u t q “ ψ kn p X t n q ` ∆ W n ´ h ∇ z f p Y u t n , Z u t n q ˘ q ´ ź j “ n ` ´ ´ h ∇ y f p Y u t j , Z u t j q ¯ . This concludes the proof. l Recall that, the time discretization error that will appear in our estimates is classically given by E π “ E « N ´ ÿ n “ ż t n ` t n ` | Y s ´ Y t n | ` | Z s ´ Z t n | ` | X t n ´ X t n | ˘ ds ﬀ . (4.9)The approximation error due to the restriction to the functional space, expressed in (2.18),is also given by E ψ “ E « | u p , X q ´ ¯ u p X q| ` N ´ ÿ n “ h |p σ J ∇ x u qp t n , X t n q ´ ¯ v n p X t n q| ﬀ , (4.10)33here, ¯ v n is the L p R d , P X tn q -projection of the map p σ J ∇ x u qp t n , ¨q onto V zn , ď n ď N ´ and ¯ u is the L p R d , P X q -projection of the map u p , ¨q onto V y . We denote ¯ y the coeﬃcientassociated to the decomposition of ¯ u , namely R d Q x ÞÑ ¯ u p x q “ K y ÿ k “ ¯ y k ψ k p x q P R , (4.11)and also ¯ z n the coeﬃcient associated to the decomposition of ¯ v n , namely R d Q x ÞÑ ¯ v n p x q “ K zn ÿ k “ ¯ z n,k ψ kn p x q P R d , ď n ď N ´ . (4.12)For later use, we introduce a reference solution ¯ u “ p ¯ y , ¯ z q P R K y ˆ R d ¯ K z . (4.13)We ﬁrst discuss the well-posedness of the optimization problem (2.12). Lemma 4.1.

Under Assumption 2.1, Assumption 2.2 (i) and Assumption 2.3, it holds arg min u P R Ky ˆ R d ¯ Kz g p u q ‰ H . Proof.

Let u “ p y , z q P R K y ˆ R d ¯ K z . Using the backward estimate of Proposition 4.1 with V ζ,Z : “ Y u and ˜ V ζ ,Z : “ Y ( ζ “ , Z ” ) yields ~ u ~ “ E « | Y u | ` N ´ ÿ n “ h | Z u t n | ﬀ ď C E “ | Y u T ´ Y T | ‰ ď C p ` E “ | g p X T q ´ Y u T | ‰ q . Under Assumption 2.3, we thus deduce that the continuous function R K y ˆ R d ¯ K z Q u ÞÑ g p u q is coercive. As a consequence, it admits a global minimizer so that the optimization problem(2.12) is well-posed. l The following proposition can be seen as a version of the results in [34] (see Theorem 1 &2) adapted to our context. Let us note that our setting is simpler as we do not deal with fullycoupled Forward Backward SDEs.

Proposition 4.2.

We introduce the following mean squared error: E p : “ E “ ~ u pM ´ ¯ u ~ ‰ , ď p ď P , (4.19)where the sequence p u pM q ď p ď P is given by Deﬁnition 2.7, ~ ¨ ~ is given by Deﬁnition 2.5 and ¯ u isthe reference solution introduced in (4.13). In this subsection, our aim is to establish an upperbound for the quantity E P that will allow us to prove Theorem 2.1.35 .3.1 Preliminary estimatesProposition 4.3. Suppose that Assumption 2.1, Assumption 2.2 (i), (ii) and Assumption 2.3hold. If T p ` L p ` h qq ă and δ h : “ L T ´ T p ` L p ` h qq ă , then for any ε ą such that δ h,ε : “ δ h p ` ε q ă there exists a positive constant C ε such that for any positive integer P E P ď δ Ph,ε E ` C ε ` E RM ` E ψ ` E π ˘ (4.20) with the notation E RM : “ max ď p ď P E ” ~ u pM ´ Φ p u p ´ M q~ ı . (4.21) Proof.

From the decomposition, u pM ´ ¯ u “ Φ M p u p ´ M q ´ Φ p u p ´ M q ` Φ p u p ´ M q ´ ¯ u we obtain, for any ε ą , E p ď p ` ε q E RM ` p ` ε q E ” ~ Φ p u p ´ M q ´ ¯ u ~ ı . Then, using Lemma 4.2 below, we get E p ď δ h,ε E p ´ ` C ε p E ψ ` E π ` E RM q up to a modiﬁcation of ε . By an induction argument, we derive E P ď δ Ph,ε E ` C ε ` E RM ` E ψ ` E π ˘ which concludes the proof. l Lemma 4.2.

Suppose that Assumption 2.1, Assumption 2.2 (i), (ii) and Assumption 2.3 hold.If T p ` L p ` h qq ă and δ h : “ L T ´ T p ` L p ` h qq ă , then, for any ε ą there exists apositive constant C ε ( ε ÞÑ C ε being non-increasing) such that for any ˜ u P R K y ˆ R d ¯ K z it holds ~ Φ p ˜ u q ´ ¯ u ~ ď δ h p ` ε q~ ˜ u ´ ¯ u ~ ` C ε p E ψ ` E π q . Proof.

Step 1:

36e now compute, for any ε ą , E ” | g p X T q ´ U ˜ u , ¯ u T | ı ď p ` ε q E ” | g p X T q ´ U ¯ u , ¯ u T | ı ` p ` ε q E ” | U ¯ u , ¯ u T ´ U ˜ u , ¯ u T | ı , which, combined with the previous inequality, yields ~ Φ p ˜ u q ´ ¯ u ~ ď p ` ε q E ” | g p X T q ´ U ¯ u , ¯ u T | ı ` p ` ε q E ” | U ¯ u , ¯ u T ´ U ˜ u , ¯ u T | ı . (4.22)Since U ¯ u , ¯ u “ Y ¯ u , we can give an upper bound for the ﬁrst term appearing on the right-hand sideof the above inequality by using (4.15), E ” | g p X T q ´ U ¯ u , ¯ u T | ı “ g p ¯ u q ď C p E ψ ` E π q . (4.23) Step 2:

Suppose that Assumption 2.1, Assumption 2.2 (i) and Assumption 2.3 hold. Let ˜ u be a ﬁxed R K y ˆ R d ¯ K z -valued random vector and set ˇ u “ p ˇ y , ˇ z q : “ Φ p ˜ u q and u M “ p y M , z M q : “ Φ M p u , X , W , ˜ u q , M being a positive integer and where p u , X , W q (recall Deﬁnition 2.6) isindependent of ˜ u . We denote by E ˜ u r¨s , the conditional expectation with respect to the sigma-ﬁeld σ p ˜ u q generated by ˜ u . Then, for any positive integer M , the random vector p y M , z M q satisﬁes: E ˜ u ” | y M ´ ˇ y | ı ď L K,M p ` | ˇ y | q and E ˜ u ” | z n, ¨ M,l ´ ˇ z n, ¨ l | ı ď L K,M ˆ h ` | ˇ z n, ¨ l | ˙ (4.26) for any l P t , ¨ ¨ ¨ , d u , with L K,M : “ (cid:37) (cid:37) ´ ` E ” | u | ı¯ M ÿ m “ exp ˆ ´ α K β K p Γ M ´ Γ m q ˙ γ m (4.27) Γ m : “ m ÿ k “ γ k , m ě , (4.28) where the constants (cid:37) , (cid:37) are deﬁned respectively in equation (4.31) and (4.35) below.Moreover, it holds E ˜ u ” ~ Φ M p ˜ u q ´ Φ p ˜ u q~ ı ď κ K L K,M p ` dN q ` κ K α K L K,M ~ Φ p ˜ u q~ . (4.29) Proof.

Step 1:

We prove the estimate for the diﬀerence z n, ¨ M,l ´ ˇ z n, ¨ l . The proof for y M ´ ˇ y follows fromsimilar arguments and we omit some technical details. From (2.41), one gets | H n,l p X , W, ˜ u , z n, ¨ l q| “ p β K ? h q | G ˜ u ´ ω n, ¨ l ¨ z n, ¨ l | | ω n, ¨ l | ď β K p| G ˜ u | ` h | ˜ ω n, ¨ l | | z n, ¨ l | q| ˜ ω n, ¨ l | . Under the boundedness Assumption 2.2 (i), recalling (2.27), we obtain | G ˜ u | ď ` | g | ` T | f | ˘ (4.30)so that setting (cid:37) : “ tp| g | ` T | f | q , u (4.31)and from the lower bound (2.42), we obtain E ˜ u ” | H n,l p X , W, ˜ u , z n, ¨ l q| ı ď (cid:37) p ` h | z n, ¨ l ´ ˇ z n, ¨ l | ` h | ˇ z n, ¨ l | q . (4.32) Step 2:

We now introduce the natural ﬁltration of the algorithm namely F “ p F m q ď m ď M ,deﬁned by F m “ σ p u , X p k q , W p k q , ď k ď m q , m ě , and F “ σ p u q . From the dynamics(2.48), we directly get | z n, ¨ l,m ` ´ ˇ z n, ¨ l | “ | z n, ¨ l,m ´ ˇ z n, ¨ l | ´ γ zm ` H n,l p X p m ` q , W p m ` q , ˜ u , z n, ¨ l,m q ¨ p z n, ¨ l,m ´ ˇ z n, ¨ l q` p γ zm ` q | H n,l p X p m ` q , W p m ` q , ˜ u , z n, ¨ l,m q|

38o that introducing the sequence of F -martingale increments, ∆ M m ` : “ ´ β K ? h ∇ z n, ¨ l H n,l p ˜ u , u m q ´ H n,l p X p m ` q , W p m ` q , ˜ u , z n, ¨ l,m q ¯ ¨ p z n, ¨ l,m ´ ˇ z n, ¨ l q , m ě , we have | z n, ¨ l,m ` ´ ˇ z n, ¨ l | “ | z n, ¨ l,m ´ ˇ z n, ¨ l | ´ β K ? h γ zm ` ∇ z n, ¨ l H n,l p ˜ u , u m q ¨ p z n, ¨ l,m ´ ˇ z n, ¨ l q` γ zm ` ∆ M m ` ` p γ zm ` q | H n,l p X p m ` q , W p m ` q , ˜ u , z n, ¨ l,m q| . Now, from the previous equality, (2.34) and the fact that E ˜ u r ∆ M m ` | F m s “ , recall (2.43), weobtain E ˜ u ” | z n, ¨ l,m ` ´ ˇ z n, ¨ l | ı ď E ˜ u ” | z n, ¨ l,m ´ ˇ z n, ¨ l | ı´ ´ hα K β K ? h γ zm ` ¯ ` p γ zm ` q E ˜ u ” | H n,l p X p m ` q , W p m ` q , ˜ u , z n, ¨ l,m q| ı ď E ˜ u ” | z n, ¨ l,m ´ ˇ z n, ¨ l | ı´ ´ hα K β K ? h γ zm ` ` (cid:37) h p γ zm ` q ¯ ` (cid:37) p γ zm ` q p ` h | ˇ z n, ¨ l | q (4.33)where, for the last inequality, we used (4.32) together with the fact that, since p X p m ` q , W p m ` q q is independent of F m and z m is F m -measurable, one has E ˜ u ” | H n,l p X p m ` q , W p m ` q , ˜ u , z n, ¨ l,m q| | F m ı “ E ˜ u ” | H n,l p X , W, ˜ u , z n, ¨ l q| ı | z n, ¨ l “ z n, ¨ l,m . Observe now that by the very deﬁnition (2.45) of the sequence p γ zm q m ě and using the fact that α K { β K ď and (cid:37) ě , one gets ´ hα K β K ? h γ zm ` ` (cid:37) h p γ zm ` q “ ´ p α K β K γ m ` ´ (cid:37) γ m ` q ě ´ p? (cid:37) γ m ` ´ (cid:37) γ m ` q ě ´ { “ { . As a consequence, Π m : “ ś mk “ p ´ hα K β K ? h γ zk ` (cid:37) h p γ zk q q “ ś mk “ p ´ α K β K γ k ` (cid:37) γ k q is a product of positive terms. From (4.33), we thusdeduce E ˜ u ” | z n, ¨ l,m ` ´ ˇ z n, ¨ l | ı ď Π m ` E ˜ u ” | z n, ¨ l, ´ ˇ z n, ¨ l | ı ` (cid:37) p h ` | ˇ z n, ¨ l | q m ` ÿ q “ Π m ` Π q γ q ď (cid:37) exp ´ ´ α K β K Γ m ` ¯´ E ˜ u ” | z n, ¨ l, | ı ` | ˇ z n, ¨ l | ¯ ` (cid:37) (cid:37) p h ` | ˇ z n, ¨ l | q m ` ÿ q “ exp ˆ ´ α K β K p Γ m ` ´ Γ q q ˙ γ q ď (cid:37) (cid:37) ´ ` E ˜ u ” | z n, ¨ l, | ı¯ ˆ h ` | ˇ z n, ¨ l | ˙ m ` ÿ q “ exp ˆ ´ α K β K p Γ m ` ´ Γ q q ˙ γ q (4.34)where we used the standard inequality ` x ď e x , the fact that ρ ě and introduced thequantity (cid:37) : “ exp p (cid:37) ÿ m ě γ m q . (4.35)39his concludes the proof for (4.26). Step 3:

Recalling Deﬁnition 2.5 and using (2.29) as well as Assumption 2.3, we directly deduce E ˜ u ” ~ Φ M p ˜ u q ´ Φ p ˜ u q~ ı ď E ˜ u ” κ K | y M ´ ˇ y | ` hκ K N ´ ÿ n “ d ÿ l “ | z n, ¨ l,M ´ ˇ z n, ¨ l | ı so that, using (4.26) E ˜ u ” ~ Φ M p ˜ u q ´ Φ p ˜ u q~ ı ď κ K L K,M p ` | ˇ y | q ` hκ K L K,M N ´ ÿ n “ d ÿ l “ p h ` | ˇ z n, ¨ l | q , ď κ K L K,M ` κ K α K L K,M } ˇ y } y ` hκ K L K,M

N d h ` κ K α K L K,M } ˇ z } z , ď κ K L K,M p ` dN q ` κ K α K L K,M ~ ˇ u ~ , which concludes the proof. l The following result provides an upper-bound for the quantity L K,M for a given speciﬁcationof the learning step that is useful to study the complexity of the global algorithm.

Lemma 4.4.

Let Assumption 2.3 hold. For γ ą , ρ P p , q , set γ m : “ γm ´ ρ , m ě . If thenumber of steps M in the stochastic gradient descent algorithm satisﬁes γ α K β K M ´ ρ ě ? , (4.36) then, there exists some positive constant C : “ C p ρ, γ q such that L K,M ď C ˆ e ´ ? p q γ αKβK M ´ ρ ` β K α K M ρ ˙ . (4.37) Remark 4.1.

1. In practice, L K,M should go to zero with respect to the optimal parameters.Thus, we must have that β K α K M ρ goes to zero and α K β K M ´ ρ goes to inﬁnity at the sametime as M goes to inﬁnity. This will be carefully discussed in Section 4.4.3. With theseconstraints, we will naturally have that (4.36) is satisﬁed.2. A careful analysis of the proof below shows that lim ρ Ñ . ` C p ρ, γ q “ `8 , which comes fromthe dependence of C p ρ, γ q with respect to ř `8 m ρ . However, one must bear in mind that ρ is a ﬁxed (but optimised) parameter. Proof of Lemma 4.4.

We have, since Γ m “ γ ř mq “ q ρ , for m ě , γ ´ ρ ` m ´ ρ ´ ˘ ď Γ m ď γ ´ ρ ` m ´ ρ ´ ˘ ` γ (4.38)by standard computations based on comparison between series and integral, leading to Γ m ´ Γ M ď γ ´ ρ ` m ´ ρ ´ M ´ ρ ˘ ` γ. (4.39)40ecalling (4.27), we employ the following decomposition L K,M “ (cid:37) (cid:37) ´ ` E ” | u | ı¯ M ÿ m “ exp ˆ ´ α K β K p Γ M ´ Γ m q ˙ γ m (4.40) ď (cid:37) (cid:37) ´ ` E ” | u | ı¯ exp ˆ γ α K β K ˙ ` A M ` B M ` γ M ˘ (4.41)with A M : “ t M { u ÿ m “ exp ˆ α K β K γ ´ ρ t m ´ ρ ´ M ´ ρ u ˙ γ m ρ ,B M : “ M ´ ÿ m “ t M { u ` exp ˆ α K β K γ ´ ρ t m ´ ρ ´ M ´ ρ u ˙ γ m ρ . For the ﬁrst term A M , we observe that, for m ď t M { u ď M { and ă ρ ă , m ´ ρ ´ M ´ ρ ď ´ ?

22 ln p qp ´ ρ q M ´ ρ . (4.42)We then compute A M ď exp ˆ ´ ? p q γ α K β K M ´ ρ ˙ t M { u ÿ m “ γ m ρ ď C ρ,γ exp ˆ ´ ? p q γ α K β K M ´ ρ ˙ . (4.43)We now study the term B M which reads B M “ γ exp ˆ ´ α K β K γ ´ ρ M ´ ρ ˙ M ´ ÿ m “ t M { u ` λ p m q where, the map λ is deﬁned for x ě by λ p x q : “ exp ˆ α K β K γ ´ ρ x ´ ρ ˙ x ρ . (4.44)We observe that λ is increasing on r t M { u ` , `8q when (4.36) holds. This leads to M ´ ÿ m “ t M { u ` λ p m q ď ż MM { λ p x q d x ď ρ M ρ ż MM { exp ˆ α K β K γ ´ ρ x ´ ρ ˙ x ρ d x ď ρ ´ β K γM ρ α K exp ˆ α K β K γ ´ ρ M ´ ρ ˙ which in turn yields B M ď γβ K M ρ α K . Inserting the previous inequality and estimate (4.43) into (4.41) concludes the proof, since E ” | u | ı ă 8 , recall Deﬁnition 2.7, step 1. and α K β K ď recall Assumption 2.3 and (2.42). l emark 4.2. Let us importantly point out that if one choses γ m “ γ { m , with γ ą , then fromstandard comparison between series and integral Γ m ´ Γ M ď γ p ln p m { M q ` q so that repeatingthe computations of the proof of Lemma 4.4, one has to consider the two disjoint cases γ ă β K α K and γ ą β K α K in order to provide an upper bound for the quantity of interest L K,M . Only thelatter allows to obtain the best convergence rate of order { M . However, in practice, the userdoes not know the exact value of β K α K so that one will often consider higher values of γ thanrequested which will have the undesirable eﬀect to deteriorate the upper-bound as suggested by thevalue of (cid:37) in (4.35) . Moreover, as shown in Section 4.4, the value β K α K actually goes to inﬁnitywhen the prescribed approximation error ε goes to zero so that the latter condition becomes moreand more stringent. We now give an upper bound for the error E P deﬁned by (4.19), with respect to all the algorithm’sparameters. These parameters will be chosen in the next section taking into account the precisespeciﬁcation of the functional approximation space. Proposition 4.4.

Suppose that Assumption 2.1, Assumption 2.2 (i), (ii) and Assumption 2.3hold. Assume that there exists a positive constant η (independent of N , M and the basis functions p ψ q ) such that κ K h ^ α K L K,M ď η . (4.45) If T p ` L p ` h qq ă and δ h,η : “ L T p ` η q ´ T p ` L p ` h qq ă , then for any ε ą there exists apositive constant C ε such that E RM ď C ε κ K h ^ α K L K,M so that, with the notations of Proposition 4.3, it holds E P ď δ Ph,ε E ` C ε ˆ κ K h ^ α K L K,M ` E ψ ` E π ˙ . (4.46) Remark 4.3.

In practice, η will be ﬁxed to be a small constant as the term in the left hand sideof (4.45) should be asymptotically zero. Proof of Proposition 4.4.

Step 1:

From Proposition 4.3, we see that to obtain (4.46), it remains to control E RM “ max ď p ď P E ” ~ Φ M p u p ´ M q ´ Φ p u p ´ M q~ ı . Using Lemma 4.3, we have E ” ~ Φ M p u p ´ M q ´ Φ p u p ´ M q~ ı ď κ K L K,M p ` dN q ` κ K α K L K,M E ” ~ Φ p u p ´ M q~ ı . (4.47)To conclude the proof, we will in the next step, provide an upper bound for the term E ” ~ Φ p u p ´ M q~ ı ,uniformly with respect to p . 42 tep 2: We denote by C ε a constant that may change from line to line along with ε . FromYoung’s inequality and Lemma 4.2, we obtain E “ ~ Φ p u pM q~ ‰ ď p ` ε q E “ ~ Φ p u pM q ´ ¯ u ~ ‰ ` p ` ε q~ ¯ u ~ ď δ h p ` ε q E “ ~ u pM ´ ¯ u ~ ‰ ` C ε p E π ` E ψ ` ~ ¯ u ~ qď δ h p ` ε q E “ ~ u pM ~ ‰ ` C ε p E π ` E ψ ` ~ ¯ u ~ q up to a modiﬁcation of ε . From the previous inequality, we readily obtain E “ ~ Φ p u M q~ ‰ ď C ε .Now, if p is a positive integer, noting again that E “ ~ u pM ~ ‰ ď E ” ~ Φ M p u p ´ M q ´ Φ p u p ´ M q~ ı ` E ” ~ Φ p u p ´ M q~ ı we obtain E “ ~ Φ p u pM q~ ‰ ď δ h p ` ε q E ” ~ Φ p u p ´ M q~ ı ` δ h p ` ε q E ” ~ Φ M p u p ´ M q ´ Φ p u p ´ M q~ ı ` C ε ˆ E π ` E ψ ` ~ ¯ u ~ ˙ so that using (4.47), we get E “ ~ Φ p u pM q~ ‰ ď δ h p ` ε q ˆ ` κ K α K L K,M ˙ E ” ~ Φ p u p ´ M q~ ı ` δ h p ` ε q κ K L K,M p ` dN q ` C ε ˆ E π ` E ψ ` ~ ¯ u ~ ˙ . From (4.45) and the fact that δ h,η , we can set ε such that δ h p ` ε qp ` κ K α K L K,M q ď δ h,η p ` ε q ă so that from the above inequality, by induction, for any positive integer p , we get E “ ~ Φ p u pM q~ ‰ ď p δ h,η p ` ε qq p E “ ~ Φ p u M q~ ‰ ` C ε p E π ` E ψ ` ~ ¯ u ~ ` δ h q ď C ε , which concludes the proof. l We now have all the ingredients to give the proof of the main result announced in Section 2.3 onthe upper bound for the global convergence error at the initial time.

Proof of Theorem 2.1

From the very deﬁnition (2.51) of the global error, we deduce E MSE ď E ” | u p , X q ´ ¯ u p X q| ` | Y ¯ u ´ Y u PM | ı (4.48)where we used the notations introduced in (2.9), (4.11) and (4.13). A fortiori, we have E “ | u p , X q ´ ¯ u p X q| ‰ ď E ψ and E ” | Y ¯ u ´ Y u PM | ı ď E P (4.49)recalling (4.10), (2.30) and (4.19). Combining (4.49) and (4.48) yields E MSE ď p E ψ ` E P q . (4.50)We eventually conclude the proof by invoking Proposition 4.4 together with Lemma 4.4. l .4 Convergence and complexity analysis for sparse grid approximations For this part, we work in the setting of Section 3.1.2. Our goal is to prove the theoreticalupper-bound on the algorithm’s complexity stated in Theorem 3.1.We ﬁrst state the following useful estimate.

Lemma 4.5.

Suppose that Assumption 3.1 is satisﬁed. Let φ : R d Ñ R be a non-negativemeasurable function whose support is included in O and q φ be its - periodisation deﬁned by (3.15) .Then, it holds C ´ E r φ p U qs ď E ” q φ p X t n q ı “ E ” φ p p X t n q ı ď C E r φ p U qs where U has law U pp , q d q and C is given in Lemma 2.1. Proof.

We denote by x ÞÑ p X p t n , x q the density function of X t n given by the Euler-Maruyamascheme taken at time t n and starting from X with law U pp , q d q at time . Note that we have p X p t n , x q “ ş p π p , t n , x, x q p , q d p x q d x and using (2.4), C ´ ż p p c t n , x ´ x q p , q d p x q d x ď p X p t n , x q ď C ż p p c ´ t n , x ´ x q p , q d p x q d x . Then, E ” φ p p X t n q ı “ E ” q φ p X t n q ı ě C ´ ż q φ p x q ż p p c t n , x ´ x q p , q d p x q d x d x so that introducing the notation Ξ “ ξ ` W c t n , ż q φ p x q ż p p c t n , x ´ x q p , q d p x q d x d x “ E ” q φ p Ξ q ı “ E ” φ p p Ξ q ı (4.51)The proof is then concluded by observing that L p p Ξ q “ L p U q . The proof of the upper-boundfollows from similar arguments. l We now provide some upper-bound estimates for the sparse grid approximation error.

Theorem 4.1.

Under Assumption 3.1, there exists a positive constant C : “ C p T, b, σ, d, λ q suchthat E ψ ď C ´ (cid:96) (cid:96) d ´ . (4.52) To obtain an error of order ε for the quantity E ψ one may thus set (cid:96) ε “ log p ε ´ | log p ε q| d ´ q so that, for each n “ , ¨ ¨ ¨ , N ´ , the number of basis functions required satisﬁes K ε “ ε ´ | log p ε q| p d ´ q . roof. For ď i ď d , n “ , ¨ ¨ ¨ , N ´ , setting v i p t n , x q “ p σ J ∇ x u q i p t n , x q t x P O u , x P R d , from(2.2) we have } u } H kmix p O q ` max ď i ď d } v i } H kmix p O q ď C. (4.53)Moreover, from (3.19), we obtain p σ J ∇ x u q i p t n , X t n q “ v i p t n , p X t n q , thus E “ |p σ J ∇ x u qp t n , X t n q ´ Z u t n | ‰ “ d ÿ i “ E «ˇˇˇ v i p t n , p X t n q ´ K ÿ k “ z n,ki ψ kn p p X t n q ˇˇˇ ﬀ ď C d ÿ i “ E «ˇˇˇ v i p t n , U q ´ K ÿ k “ z n,ki ψ kn p U q ˇˇˇ ﬀ with U „ U pp , q d q and where we use the upper-estimate given in Lemma 4.5 to obtain the lastinequality. We also recall that E “ | u p , X q ´ Y u | ‰ “ E « | u p , U q ´ K ÿ k “ y k ψ ky p U q| ﬀ . (4.54)From (2.18) and the previous estimates, we thus deduce E ψ ď C ˜ inf ξ P V y } ξ ´ u p , . q} L p O q ` max ď n ď N ´ d ÿ i “ inf ξ P V zn } ξ ´ v i p t n , . q} L p O q ¸ , ď C ´ (cid:96) (cid:96) d ´ , (4.55)where for the last inequality we used (3.13) and (4.53).Finally, setting (cid:96) ε “ log p ε ´ | log p ε q| d ´ q yields E ψ “ O p ε q as ε Ó and from (3.12) (see alsoRemark 3.1) we deduce that in this case K ε “ ε ´ | log p ε q| p d ´ q . l We now provide some estimates for the value of α K and κ K appearing in Assumption 2.3. Proposition 4.5.

Suppose that Assumption 3.1 holds. In the setting of Section 3.1.2, thereexists a constant k ď such that k “ α K “ κ K . Proof.

For any u “ p y , z q P R K y ˆ R d ¯ K z , any ď n ď N ´ and any l P t , ¨ ¨ ¨ , d u , from (2.6)we have E ” |p Z u t n q l | ı “ E «ˇˇˇ K ÿ k “ z n,kl ψ kn p p X t n q ˇˇˇ ﬀ , C ´ E «ˇˇˇ K ÿ k “ z n,kl ψ kn p U q ˇˇˇ ﬀ ď E ” |p Z u t n q l | ı ď C E «ˇˇˇ K ÿ k “ z n,kl ψ kn p U q ˇˇˇ ﬀ . (4.56)Note that in our setting, ψ kn “ χ k , so it holds ż O ˇˇˇ K ÿ k “ z n,kl ψ kn p z q ˇˇˇ d z “ ż O ˇˇˇ K ÿ k “ z n,kl χ k p x q ˇˇˇ d x . (4.57)Since the basis functions p χ k q ď k ď K forms a Riesz basis [29], there exists a constant c ě suchthat it holds c ´ K ÿ k “ | z n,kl | ď ż r , s d ˇˇˇ K ÿ k “ z n,kl χ k p x q ˇˇˇ d x ď c K ÿ k “ | z n,kl | (4.58)Combining (4.56)-(4.57)-(4.58) and taking into account all the component of Z u t n , we compute c ´ C ´ d K ÿ k “ | z n,kl | ď E “ |p Z u t n q| ‰ ď c C d K ÿ k “ | z n,kl | . (4.59)We also observe that E “ | Y u | ‰ “ E «ˇˇˇ K ÿ k “ y k χ k p X q ˇˇˇ ﬀ . (4.60)Since X „ U pp , q d q , we similarly deduce that c ´ K ÿ k “ | y n,kl | ď E “ | Y u | ‰ ď c K ÿ k “ | y n,kl | . (4.61)The proof is concluded by combining (4.59) and (4.61) with (2.30) and setting k “ c ´ C ´ d ´ . l Under Assumption 3.1, one can set β K “ C d p ` (cid:96) (cid:96) d ´ q (4.62) for some positive constant C d which depends on the PDE dimension d . Proof.

Step 1:

We now quantify the term appearing in the right-hand side of (4.68), namely Q : “ ÿ p l , i qP C | l | “ ` (cid:96) ÿ k “ ÿ l P N d | l | t ζ d p l q“ k u . (4.69)We denote by } l } “ |t j | l j “ u| . For l ‰ , recall that ζ d p l q “ | l | ` } l } ´ p d ´ q (from thedeﬁnition of ζ d ). Thus, Q “ ` (cid:96) ÿ k “ d ´ ÿ q “ ÿ l Pp N ą q d ´ q | l | t| l | “ k ` d ´ ´ q and } l } “ q u “ ` (cid:96) ÿ k “ d ´ ÿ q “ k ` d ´ ´ q C d ´ q ´ k ` d ´ q ´ C qd |t l P p N ą q d ´ q | | l | “ k ` d ´ ´ q u| “ C d ´ q ´ k ` d ´ q ´ . Introducing θ “ d ´ ´ q , we get Q “ ` (cid:96) ÿ k “ d k ` d ´ ÿ θ “ ` θ C d ´ ´ θd (cid:96) ÿ k “ k ´ C θk ´ ` θ (4.70)From Lemma 3.6 in [12], we know that ř (cid:96)k “ k ´ C θk ´ ` θ “ (cid:96) p (cid:96) θ θ ! ` O d p (cid:96) θ ´ qq . We thus obtain Q “ (cid:96) ˆ d p d ´ q ! (cid:96) d ´ ` O d p (cid:96) d ´ q ˙ which combined with (4.68) yields K ÿ k “ E ” | ψ kn p p X t n q| ı ď C d (cid:96) (cid:96) d ´ . (4.71)Combining the previous inequality with (4.64), we obtain E “ | ˜ ω n, ¨ (cid:96) | ‰ ď C d (cid:96) (cid:96) d ´ . (4.72)Using similar arguments, as the basis function are chosen to be the same in our setting, we alsohave E “ | θ | ‰ ď C d (cid:96) (cid:96) d ´ . The proof is then concluded recalling the deﬁnition of β K in (2.42). l We now turn to the analysis of the convergence and complexity of the full Picard algorithm. Thefollowing corollary is a preparatory result and expresses the main convergence results in termsof the parameters P , M , (cid:96) and h . Corollary 4.1.

Suppose that Assumption 3.1 as well as (4.36) and (4.45) hold. Set γ m “ γ { m ρ ,for some ρ P p { , q and γ ą . If T p ` L p ` h qq ă and δ h,η “ L T p ` η q ´ T p ` L p ` h qq ă , then,with the notations of Proposition 4.3, for any ε ą such that δ h,ε : “ δ h,η p ` ε q ă there existconstants C ε : “ C p ε, T, b, σ, d, γ, ρ q ě , c : “ c p T, b, σ, d, γ q ą such that it holds E P ď δ Ph,ε E ` C ε ˆ N e ´ c M ´ ρ ` (cid:96)(cid:96)d ´ ` ` (cid:96) (cid:96) d ´ hM ρ ` ´ (cid:96) (cid:96) d ´ ` h ˙ . (4.73) Proof.

Combining Proposition 4.4 with Theorem 4.1 and (2.19), we obtain E P ď δ Ph,ε E ` C ε ˆ κ K h ^ α K L K,M ` ´ (cid:96) (cid:96) d ´ ` h ˙ . (4.74)From Proposition 4.5, we have that κ K h ^ α K ď Ch , which combined with Lemma 4.4 gives κ K h ^ α K L K,M ď C ρ,γ h ˆ e ´ γ ? p q k βK M ´ ρ ` β K k M ρ ˙ (4.75) ď C ρ,γ h ˆ e ´ c M ´ ρ ` (cid:96)(cid:96)d ´ ` ` (cid:96) (cid:96) d ´ M ρ ˙ (4.76)for some positive constant c , where we used Lemma 4.6 for the last inequality. l We are now ready to establish the complexity of the full Picard algorithm.48 roof of Theorem 3.1

Step 1: Setting the parameters P , N , M , (cid:96) and ρ . We will chose the parameters P , N , M , (cid:96) and ρ in order to achieve a global error E P of order ε ,as this error controls E MSE . We ﬁrst set P “ | log δ p ε q| and N ε “ P T ε ´ T so that h ε “ T { N ε ď ε .From Theorem 4.1, we also know that setting (cid:96) ε “ log p ε ´ | log p ε q| d ´ q , we obtain E ψ “ O d p ε q and K ε “ O d p ε ´ | log p ε q| p d ´ q q .We now set M such that the term K ε M ρ h ε is of order ε , which leads to M ε “ O d p ε ´ ρ | log p ε q| p d ´ q ρ q . For ι ą , we set ¯ ρ “ ι with the constraint ¯ ρ ą and we verify that M ´ ¯ ρ(cid:15) K ε ě cε p ´ ι q | log p ε q| p d ´ qp ρ ´ q for some constant c ą . This leads to e ´ c M ´ ¯ ρε ` (cid:96)ε(cid:96)d ´ ε “ o p ε q and we also have that (4.36) is satisﬁed. Step 2: Computing the complexity C ε . Recalling Remark 2.4, we see that the overall complexity C ε to reach the prescribed approximation accuracy ε satisﬁes C ε “ P ε N ε K ε M ε “ O d ´ | log p ε q| ε ´ ε ´ | log p ε q| p d ´ q ε ´ ι | log p ε q| p d ´ q ι ¯ “ O d p ε ´ p ` ι q | log p ε q| ` ` ι p d ´ q q , which concludes the proof. l We gather below all the parameters values used in the various algorithms, examples and basisfunctions settings. In particular, we recall that the domain speciﬁcation is given in (3.22) and(3.23) The learning rates are given by (3.24). Denoting Γ p λ q : “ p α p λ q , β p λ q , β p λ q , m p λ qq P R , we set the parameters in all the approximating space V zn , ď n ď N ´ to be the same: namely Γ p z n, ¨ q “ Γ p z , ¨ q for all n ě . Thus, in the table below, the parameters of the learning rates aresimply denoted by: Γ : “ t Γ p y q , Γ p z , ¨ q , Γ p z , ¨ qu P R ˆ . lgorithms Basis functions dim N M T Initial z n,k p Γ E MSE

Picard Algorithm

Pre-wavelets 3 10 100000 0.3 0 1 (0.6, 0, 3, 15000),(0.6, 0, 20, 10000) 0.02862 (0.6, 0, 2, 8000),(0.6, 0, 10, 8000 0.02473 (0.6, 0, 1, 8000),(0.6, 0, 5, 8000 0.02194 (0.6, 0, 0.5, 8000),(0.6, 0, 3, 8000) 0.02075 (0.6, 0, 0.5, 8000),(0.6, 0, 3, 8000) 0.0201

Table 6: Parameters for the periodic example

Examples a dim M N Initial y Learning rateQuadratic 1 5 2000 10 0.5 0.01Limits toPicard algorithm -0.4 2 5000 20 0.3 0.002-1.5 2 5000 20 0 0.001Financialexample N.A. 2 6000 20 2 0.0054 6000 20 5 0.0055 5000 20 5 0.00510 5000 20 8 0.00515 5000 20 8 0.00520 5000 20 8 0.00525 5000 20 8 0.005

Table 7: Parameters by model for the deep learning method with layers “ , batchsize “ References [1]

Bally, V., and Pagès, G.

Error analysis of the optimal quantization algorithm forobstacle problems.

Stochastic Processes and their Applications 106 , 1 (2003), 1 – 40.[2]

Bally, V., and Pagès, G.

A quantization algorithm for solving multidimensional discrete-time optimal stopping problems.

Bernoulli 9 , 6 (2003), 1003–1049.[3]

Beck, C., E, W., and Jentzen, A.

Machine learning approximation algorithms forhigh-dimensional fully nonlinear partial diﬀerential equations and second-order backwardstochastic diﬀerential equations.

Journal of Nonlinear Science (2017), 1–57.[4]

Beck, C., Hutzenthaler, M., Jentzen, A., and Kuckuck, B.

An overview on deeplearning-based approximation methods for partial diﬀerential equations. arXiv preprintarXiv:2012.12348 (2020). 50 xamples Basis functions dim N r Initial value Γ y z , ¨ z n,k Quadraticmodel Pre-wavelets 5 10 2 0.5 -0.2 0 (1, 0, 1, 100),(0.8, 0, 1, 100),(0.86, 0.02, 0.05, 100)Hat 100 10 3.2 5.5 0 0 (0.9, 0, 1, 100),(0.7, 0, 1, 100),(1, 0.003, 0.01, 1000)Financialexample Pre-wavelets 2 10 2 2 0 0 (1, 0, 1.5, 1000),(1, 0, 20, 1000),(1, 0.003, 0.01, 1000)4 10 2 5 0 0 (1, 0, 1, 1000),(1, 0, 20, 1000),(1, 0.001, 0.01, 1000)Hat 5 10 2 5 0 0 (0.95, 0, 0.3, 100),(1, 0, 5, 100),(1, 0.001, 0.01, 100)10 10 2.5 8 0 0 (0.9, 0, 0.35, 300),(1, 0, 5, 300),(1, 0.001, 0.01, 300)15 10 2.5 8 0 0 (0.85, 0, 0.3, 500),(1, 0, 5, 500),(1, 0.001, 0.01, 500)20 10 2.5 8 0 0 (0.8, 0, 0.2, 1000),(1, 0, 5, 1000),(1, 0.001, 0.01, 1000)25 10 2.8 8 0 0 (0.7, 0, 0.2, 1500),(1, 0, 5, 1500),(1, 0.001, 0.01, 1500)Thechallengingexample Hat 1 10 2 0.5 0 0 (1, 0, 0.5, 100),(1, 0, 3, 100),(1, 0.1, 0.1, 100)2 20 2 0.1 0 0 (1, 0, 0.5, 100),(1, 0, 5, 100),(1, 0.2, 0.5, 100)5 40 2 0.4 -1 0 (1, 0, 0.5, 300),(0.95, 0, 5, 500),(1, 0.2, 0.1, 500)8 60 2.2 0.6 1 0.08 (1, 0, 0.35, 500),(1, 0, 5, 500),(1, 0.2, 1, 500)10 100 2.5 0.1 -1 -0.1 (1, 0, 0.35, 500),(0.95, 0, 5, 500),(1, 0.2, 1, 500)

Table 8: Parameters by model for the direct algorithm xamples Basis functions dim r Initial value a p Γ y z , ¨ z n,k Quadraticmodel Pre-wavelets 5 2 0.5 -0.2 0 1 1 (1, 0, 0.8, 100),(0.9, 0, 1, 100),(0.84, 0.02, 0.05, 100)2 (1, 0, 0.3, 100),(0.9, 0, 0.4, 100),(0.84, 0.01, 0.02, 100) p ě (1, 0, 0.2, 100),(0.9, 0, 0.2, 100),(0.84, 0.005, 0.01, 100)Hat 25 2.8 2 0.1 0 1 ď p ď (0.9, 0, 0.5, 100),(0.8, 0, 0.8, 100),(1, 0.003, 0.01, 1000)Financialexample Pre-wavelets 4 2 5 0.1 -0.01 N.A. 1 (1, 0, 1, 1000),(1, 0, 20, 1000),(1, 0.001, 0.01, 1000)2 (1, 0, 0.3, 1000),(1, 0, 5, 1000),(1, 0.0005, 0.005, 1000)3 (1, 0, 0.2, 1000),(1, 0, 5, 1000),(1, 0.0003, 0.003, 1000)4, 5 (1, 0, 0.15, 1000),(1, 0, 3, 1000),(1, 0.0002, 0.002, 1000) p ě (1, 0, 0.1, 1000),(1, 0, 2, 1000),(1, 0.0001, 0.001, 1000)Hat 20 2.5 8 0 0 N.A. 1 (0.9, 0, 0.6, 1000),(1, 0, 5, 1000),(1, 0.001, 0.01, 1000)2 (0.9, 0, 0.2, 1000),(1, 0, 4, 1000),(1, 0.001, 0.01, 1000)3 (0.9, 0, 0.15, 1000),(1, 0, 3, 1000),(1, 0.0005, 0.005, 1000) p ě (0.9, 0, 0.1, 1000),(1, 0, 2, 1000),(1, 0.0005, 0.005, 1000)Limits toPicardalgorithm Pre-wavelets 2 2 0.3 0 0 -0.4 ď p ď (1, 0, 1, 300),(1, 0, 1, 300),(1, 0.6, 0.1, 300)0 0 0 -1.5 ď p ď (1, 0, 2, 300),(1, 0, 1, 300),(1, 0.6, 0.1, 300) Table 9: Parameters by model for the

Picard algorithm xamples Basis functions dim r Initial value N p Γ y z , ¨ z n,k Thechallengingexample Hat 1 2 0.5 0 0 10 ď p ď (1, 0, . ˚ p . q p ´ , 100),(1, 0, 3 ˚p . q p ´ , 100),(1, 0.1 ˚p . q p ´ , 0.1 ˚p . q p ´ , 100)2 2 0.1 0 0 20 ď p ď (1, 0, 0.5 ˚p . q p ´ , 100),(1, 0, 5 ˚p . q p ´ , 100),(1, 0.2 ˚p . q p ´ , 0.5 ˚p . q p ´ , 100)5 2 0.4 -2 0 40 ď p ď (1, 0, 0.5 ˚p . q p ´ , 300),(0.95, 0, 5 ˚p . q p ´ , 500),(1, 0.2 ˚p . q p ´ , 0.1 ˚p . q p ´ , 500)8 2.2 0.6 1 0.08 60 ď p ď (1, 0, 0.35 ˚p . q p ´ , 500),(1, 0, 5 ˚p . q p ´ , 500),(1, 0.2 ˚p . q p ´ , p . q p ´ , 500)10 2.5 0.1 -1 -0.1 100 1 (1, 0, 0.25, 500),(0.95, 0, 4, 500),(1, 0.15, 0.5, 500)2 (1, 0, 0.2, 500),(0.95, 0, 3, 500),(1, 0.12, 0.4, 500)3 (1, 0, 0.15, 500),(0.95, 0, 2, 500),(1, 0.1, 0.3, 500) p ě (1, 0, 0.1, 500),(0.95, 0, 1, 500),(1, 0.08, 0.25, 500) p ě (1, 0, 0.05, 500),(0.95, 0, 0.5, 500),(1, 0.05, 0.2, 500) Table 10: Parameters by model for the

Picard algorithm

Bender, C., and Denk, R.

A forward scheme for backward sdes.

Stochastic Processesand their Applications 117 , 12 (2007), 1793–1812.[6]

Benveniste, A., Métivier, M., and Priouret, P.

Adaptive algorithms and stochasticapproximations , vol. 22. Springer Science & Business Media, 2012.[7]

Bergman, Y. Z.

Option pricing with diﬀerential interest rates.

The Review of FinancialStudies 8 , 2 (1995), 475–500.[8]

Bohn, B.

Error analysis of regularized and unregularized least-squares regression on dis-cretized function spaces . PhD thesis, Institute for Numerical Simulation, University of Bonn,(2017).[9]

Bohn, B.

On the convergence rate of sparse grid least squares regression. In

Sparse Gridsand Applications-Miami 2016 . Springer, (2018), pp. 19–41.[10]

Bouchard, B., and Touzi, T.

Discrete-time approximation and monte-carlo simulationof backward stochastic diﬀerential equations.

Stochastic Processes and their Applications111 , 2 (2004), 175 – 206.[11]

Briand, P., and Labart, C.

Simulation of bsdes by wiener chaos expansion.

The Annalsof Applied Probability 24 , 3 (2014), 1129–1171.[12]

Bungartz, H.-J., and Griebel, M.

Sparse grids.

Acta numerica 13 (2004), 147–269.[13]

Chassagneux, J.-F.

Linear multistep schemes for bsdes.

SIAM Journal on NumericalAnalysis 52 , 6 (2014), 2815–2836.[14]

Chassagneux, J.-F., and Garcia Trillos, C.

Cubature method to solve bsdes: Errorexpansion and complexity control.

Mathematics of Computation 89 , 324 (2020), 1895–1932.[15]

Chassagneux, J.-F., and Richou, A.

Numerical simulation of quadratic bsdes.

TheAnnals of Applied Probability 26 , 1 (2016), 262–304.[16]

Crisan, D., and Manolarakis, K.

Solving backward stochastic diﬀerential equationsusing the cubature method: application to nonlinear pricing.

SIAM Journal on FinancialMathematics 3 , 1 (2012), 534–571.[17]

Crisan, D., and Manolarakis, K.

Second order discretization of backward sdes andsimulation with the cubature method.

The Annals of Applied Probability 24 , 2 (2014),652–678.[18]

Crisan, D., Manolarakis, K., and Nizar, T.

On the monte carlo simulation of bsdes:An improvement on the malliavin weights.

Stochastic Processes and their Applications 120 ,7 (2010), 1133 – 1158.[19]

Duflo, M.

Algorithmes stochastiques , vol. 23 of

Mathématiques & Applications (Berlin)[Mathematics & Applications] . Springer-Verlag, Berlin, (1996).[20]

E, W., Han, J., and Jentzen, A.

Deep learning-based numerical methods for high-dimensional parabolic partial diﬀerential equations and backward stochastic diﬀerentialequations.

Communications in Mathematics and Statistics 5 , 4 (2017), 349–380.5421]

El Karoui, N., Peng, S., and Quenez, M. C.

Backward stochastic diﬀerential equationsin ﬁnance.

Mathematical ﬁnance 7 , 1 (1997), 1–71.[22]

Feuersänger, C.

Sparse grid methods for higher dimensional approximation.

PhD thesis,University of Bonn, (2010).[23]

Friedman, A.

Partial diﬀerential equations of parabolic type . Prentice-Hall, Inc., Engle-wood Cliﬀs, N.J., (1964).[24]

Frommert, M., Pflüger, D., Riller, T., Reinecke, M., Bungartz, H.-J., andEnßlin, T.

Eﬃcient cosmological parameter sampling using sparse grids.

Monthly Noticesof the Royal Astronomical Society 406 , 2 (2010), 1177–1189.[25]

Geiss, C., and Labart, C.

Simulation of bsdes with jumps by wiener chaos expansion.

Stochastic Processes and their Applications 126 , 7 (2016), 2123–2162.[26]

Gobet, E., Lemor, J.-P., and Warin, X.

A regression-based monte carlo method tosolve backward stochastic diﬀerential equations.

The Annals of Applied Probability 15 , 3(2005), 2172–2202.[27]

Gobet, E., López-Salas, J. G., Turkedjiev, P., and Vázquez, C.

Stratiﬁed regres-sion monte-carlo scheme for semilinear pdes and bsdes with large scale parallelization ongpus.

SIAM Journal on Scientiﬁc Computing 38 , 6 (2016), C652–C677.[28]

Gobet, E., Turkedjiev, P., et al.

Approximation of backward stochastic diﬀerentialequations using malliavin weights and least-squares regression.

Bernoulli 22 , 1 (2016), 530–562.[29]

Griebel, M., and Oswald, P.

Tensor product type subspace splittings and multileveliterative methods for anisotropic problems.

Advances in Computational Mathematics 4 , 1(1995), 171.[30]

Gu, Y., Yang, H., and Zhou, C.

Selectnet: Self-paced learning for high-dimensionalpartial diﬀerential equations. arXiv preprint arXiv:2001.04860 (2020).[31]

Han, J., Jentzen, A., and E, W.

Overcoming the curse of dimensionality: Solv-ing high-dimensional partial diﬀerential equations using deep learning. arXiv preprintarXiv:1707.02568 (2017).[32]

Han, J., Jentzen, A., and E, W.

Solving high-dimensional partial diﬀerential equationsusing deep learning.

Proceedings of the National Academy of Sciences 115 , 34 (2018), 8505–8510.[33]

Han, J., Jentzen, A., and E, W.

Algorithms for solving high dimensional pdes: Fromnonlinear monte carlo to machine learning. arXiv preprint arXiv:2008.13333 (2020).[34]

Han, J., and Long, J.

Convergence of the deep bsde method for coupled fbsdes.

Proba-bility, Uncertainty and Quantitative Risk 5 , 1 (2020), 1–33.[35]

Henry-Labordère, P., Oudjane, N., Tan, X., Touzi, N., and Warin, X.

Branchingdiﬀusion representation of semilinear pdes and monte carlo approximation.

Ann. Inst. H.Poincaré Probab. Statist. 55 , 1 (02 2019), 184–210.5536]

Hu, Y., Nualart, D., and Song, X.

Malliavin calculus for backward stochastic diﬀer-ential equations and application to numerical solutions.

The Annals of Applied Probability21 , 6 (2011), 2379–2423.[37]

Huré, C., Pham, H., and Warin, X.

Deep backward schemes for high-dimensionalnonlinear pdes.

Mathematics of Computation 89 , 324 (2020), 1547–1579.[38]

Hutzenthaler, M., Jentzen, A., Kruse, T., and E, W.

Multilevel picard iterationsfor solving smooth semilinear parabolic heat equations. arXiv preprint arXiv:1607.03295 (2016).[39]

Kushner, H. J., and Yin, G. G.

Stochastic approximation and recursive algorithms andapplications , second ed., vol. 35 of

Applications of Mathematics (New York) . Springer-Verlag,New York, (2003). Stochastic Modelling and Applied Probability.[40]

Ladyženskaja, O. A., Solonnikov, V. A., and Ural’ceva, N. N.

Linear and quasi-linear equations of parabolic type . Translated from the Russian by S. Smith. Translationsof Mathematical Monographs, Vol. 23. American Mathematical Society, Providence, R.I.,(1968).[41]

Ma, J., and Zhang, J.

Path regularity for solutions of backward stochastic diﬀerentialequations.

Probability Theory and Related Fields 122 , 2 (2002), 163–190.[42]

Menozzi, S., and Lemaire, V.

On some non asymptotic bounds for the euler scheme.

Electron. J. Probab. 15 (2010), 1645–1681.[43]

Pagès, G.

Numerical Probability: An Introduction with Applications to Finance . Universi-text. Springer International Publishing, (2018).[44]

Pagès, G., and Sagna, A.

Improved error bounds for quantization based numericalschemes for bsde and nonlinear ﬁltering.

Stochastic Processes and their Applications 128 , 3(2018), 847 – 883.[45]

Pardoux, E., and Peng, S.

Adapted solution of a backward stochastic diﬀerential equa-tion.

Systems Control Lett. 14 , 1 (1990), 55–61.[46]

Pardoux, E., and Peng, S.

Backward stochastic diﬀerential equations and quasilinearparabolic partial diﬀerential equations. In

Stochastic partial diﬀerential equations and theirapplications (Charlotte, NC, 1991) , vol. 176 of

Lect. Notes Control Inf. Sci.

Springer, Berlin,(1992), pp. 200–217.[47]

Pardoux, E., and Tang, S.

Forward-backward stochastic diﬀerential equations andquasilinear parabolic PDEs.

Probab. Theory Related Fields 114 , 2 (1999), 123–150.[48]

Peng, S.

Probabilistic interpretation for systems of quasilinear parabolic partial diﬀerentialequations.

Stochastics and Stochastics Reports (Print) (1991).[49]

Zhang, J.

A numerical scheme for bsdes.