[PDF] Numerical resolution of McKean-Vlasov FBSDEs using neural networks

Abstract

We propose several algorithms to solve McKean-Vlasov Forward Backward Stochastic Differential Equations. Our schemes rely on the approximating power of neural networks to estimate the solution or its gradient through minimization problems. As a consequence, we obtain methods able to tackle both mean field games and mean field control problems in high dimension. We analyze the numerical behavior of our algorithms on several examples including non linear quadratic models.

Full PDF

NNumerical resolution of McKean-Vlasov FBSDEs using neuralnetworks ∗ Maximilien

Germain † , Joseph Mikael ‡ , Xavier Warin § October 4, 2019

Abstract

We propose several algorithms to solve McKean-Vlasov Forward Backward Stochastic Diﬀer-ential Equations. Our schemes rely on the approximating power of neural networks to estimatethe solution or its gradient through minimization problems. As a consequence, we obtain meth-ods able to tackle both mean ﬁeld games and mean ﬁeld control problems in high dimension.We analyze the numerical behavior of our algorithms on several examples including non linearquadratic models.

This paper is dedicated to the numerical resolution in high dimension of the following McKean-Vlasov Forward Backward Stochastic Diﬀerential Equations (MkV FBSDEs) (cid:40) X t = ξ + (cid:82) t b ( s, X s , Y s , Z s , L ( X s ) , L ( Y s ) , L ( Z s )) d s + (cid:82) t σ ( s, X s ) d W s Y t = g ( X T , L ( X T )) + (cid:82) Tt f ( s, X s , Y s , Z s , L ( X s ) , L ( Y s ) , L ( Z s )) d s − (cid:82) Tt Z s d W s (1)with b : R × R d × R k × R k × d ×P ( R d ) ×P ( R k ) ×P ( R k × d ) (cid:55)→ R d , σ : R × R d (cid:55)→ R d , g : R d ×P ( R d ) (cid:55)→ R k , and f : R × R d × R k × R k × d × P ( R d ) × P ( R k ) × P ( R k × d ) (cid:55)→ R k . W t is a d -dimensional F t -Brownian motion where (Ω , F , F t , P ) is a given ﬁltered probability spaceand T > . ξ is a given random variable in L (Ω , F , P ; R d ) and P ( R n ) stands for the space of square integrableprobability measures over R n endowed with the − Wasserstein distance W ( µ, ν ) = inf (cid:110)(cid:112) E [( X − X (cid:48) ) ] | X, X (cid:48) ∈ L (Ω , F , P ; R n ) , P X = µ, P X (cid:48) = ν (cid:111) . (2)At last L ( . ) is a generic notation for the law of a random variable.This kind of equation is linked to some non local PDEs kwown as master equations. We refer to(Carmona and Delarue 2018b), chapter 4 and 5 of volume 2, for an introduction on the subject. In(Chassagneux, Crisan, and Delarue 2019), (Chassagneux, Crisan, and Delarue 2014), it is shown forexample that under regularity conditions, when the driver b is independent of Z t , L ( Z t ) and when ∗ This work is supported by FiME, Laboratoire de Finance des Marchés de l’Energie. † EDF R&D maximilien.germain at edf.fr ‡ EDF R&D joseph.mikael at edf.fr § EDF R&D & FiME xavier.warin at edf.fr a r X i v : . [ m a t h . O C ] O c t does not depend on L ( Z t ) , the resolution of equation (1) provides a way to estimate the solutionof the equation ∂ t U ( t, x, µ )+ b ( x, U ( t, x, µ ) , η ) .∂ x U ( t, x, µ )+12 Tr[ ∂ x,x U ( t, x, µ ) σσ (cid:62) ( x, µ )] + f ( x, U ( t, x, µ ) , ∂ x U ( t, x, µ ) σ ( x, µ ) , η )+ (cid:90) R d ∂ µ U ( t, x, µ )( y ) .b ( t, y, U ( t, y, η ) , η ) dµ ( y )+ (cid:90) R d

12 Tr[ ∂ x ∂ µ U ( t, x, µ )( v ) σσ (cid:62) ( y, µ )] dµ ( y ) = 0 , where η is a notation for the image of the probability measure µ : x ∈ R d −→ ( x, U ( t, x, η )) ∈ R d +1 .The equation above is non local due to the integral terms and the term ∂ µ U ( t, x, µ )( y ) stands forthe Wasserstein derivative of U in the direction of the measure at point ( t, x, µ ) and taken at thecontinuous coordinate y .Equations (1) appear as well as some probabilistic formulations of mean ﬁeld games or mean ﬁeldcontrols leading to ﬁnding the value function V of the game. Mean ﬁeld games were introduced by(Lasry and Lions 2006a) and (Lasry and Lions 2006b) to model games with interactions betweenmany similar players. In this theory, a player cannot observe the states of the other players, howeverhis or her dynamics and cost take into account the distribution of all agents.Two probabilistic approaches based on Forward Backward Stochastic Diﬀerential Equations can beused to solve these problems: • A ﬁrst approach called the

Pontryagin approach consists as shown in (Carmona and Delarue2018a) in applying the strong Pontryagin principle to these control problems. Under regularityand convexity conditions, Y t appears to be a stochastic representation of the gradient of thevalue function V . • Another approach called the

Weak approach permits to solve the optimization problem byestimating directly Y t as the value function V of the problem as shown in (Carmona and Lacker2015).The numerical resolution of equations (1) is rather diﬃcult as far as: • The dynamics are coupled through both the drift and the driver of the BSDE. • The McKean-Vlasov structure of the problem requires to solve a ﬁxed point in probabilityspaces.The weak approach applied to mean ﬁeld problems gives a problem in low dimension but often withquadratic coupling in Z t appearing in the backward dynamic, whereas the Pontryagin approach givesa problem in high dimension but with a linear coupling in Y t giving some equations easier to solvenumerically.In the case of mean ﬁeld games only the law of X t is present in the dynamic of (1). In otherapplications, individuals interact through their controls instead of their states as in the applicationof trade crowding in (Cardaliaguet and Lehalle 2018). The law of the control appears in the dynamicof (1) and may give some FBSDE depending on the law of Z t in the weak approach or the law of Y t in the Pontryagin approach.In (Chassagneux, Crisan, and Delarue 2019) and (Angiuli et al. 2019), tree and grid algorithmsare proposed and tested in dimension 1. It is worth mentioning that these techniques suﬀer fromthe so-called curse of dimensionality and cannot be applied when the dimension describing a playerstate is high (typically greater than 3 or 4). This is due to the discretization of the state space.2owever, new approaches using machine learning are developed since 2017 ﬁrst to solve non linearPDEs. Two kinds of methods have emerged: • The ﬁrst to appear are global methods ﬁrst proposed in (Han, Jentzen, and E 2017) tosolve semi linear PDEs. The approach is extended to full non linear equations in (Beck, E,and Jentzen 2017) and the authors show that the methodology can solve some equations inhigh dimension. In the methods proposed Z t is represented by a neural network at each date.(Chan-Wai-Nam, Mikael, and Warin 2019) showed that it is more eﬀective to used a singlenetwork for all dates and besides proposed an original algorithm to solve some the semi-linearPDEs. • A second kind of local methods is based on some local optimization problems solved at eachtime step. Some algorithms are ﬁrst proposed in (Huré, Pham, and Warin 2019) to solvesemi-linear PDEs and the methodology is extended to some full non linear PDEs in (Phamand Warin 2019) by combining some ideas proposed in (Beck, Becker, et al. 2019).Machine learning techniques to solve coupled FBSDEs are investigated by several authors in (Hanand Long 2018) and (Ji et al. 2019), and a ﬁrst method for McKean-Vlasov FBSDEs with delayis studied by (Fouque and Zhang 2019) for a linear quadratic equation. Similar and more generalideas are presented in (Carmona and Laurière 2019) alongside convergence results. The resultingalgorithms proposed all rely on the global approach ﬁrst initiated in (Han, Jentzen, and E 2017).Our paper aims to extend these methods and to propose new ones for the resolution of McKean-Vlasov FBSDEs in high dimension. We ﬁrst propose to modify the previously proposed algorithmto stabilize the convergence of the algorithm proposed in (Ji et al. 2019). Our modiﬁcation permitsto deal with a high variance of the estimators used in the dynamic of X t and Y t . Then we proposea second algorithm relaxing the ﬁxed point iteration algorithm used adding a constraint on thedistribution in the loss function optimized At last we propose a resolution scheme based on somelocal resolution as in (Huré, Pham, and Warin 2019).To simplify the presentation, we consider ﬁrst order models, it is to say that the dependency ofthe drift and cost function with respect to the laws L ( X t ) , L ( Y t ) , L ( Z t ) only concerns their mean.We deﬁne µ t as the vector ( E [ X t ] , E [ Y t ] , E [ Z t ]) . This can directly be generalized to any ﬁnite numberof probability distribution moments. We provide multidimensional tests to show how these machinelearning approaches can overcome the curse of dimensionality on some test cases ﬁrst coming from amean ﬁeld game of controls: we solve the FBSDE derived from the weak approach and the Pontryaginapproach. Then we compare all the methods on some general test cases of FBSDE involving somelinear or quadratic dependence on the processes X t , Y t , Z t and on their distributions.The structure of the paper is the following: in sections 2 and 3 we describe the proposed schemes,and in section 4 we provide a numerical study of our methods in dimension 10. We show thatour algorithms can solve non linear-quadratic models with small maturities. At last we discuss theopportunity to combine Neural networks with some algorithm developed in (Chassagneux, Crisan,and Delarue 2019) and extensively tested in (Angiuli et al. 2019). This tree-based-algorithm is usedto permit to solve the problem with higher maturity as it is known that ﬁxed point iteration methodcan only be applied on small intervals. In this section we propose three global algorithms based in the approach in (Han, Jentzen, and E2017). 3 .1 Algorithm principle

As initialization, we set µ t = ( x , , ∀ t ∈ [0 , T ] .We propose a generalized and reﬁned version of the Algorithm 2 from (Carmona and Laurière2019). We recall that a similar technique with additional networks is used in (Fouque and Zhang2019) for delayed McKean-Vlasov equations but is tested only on a one dimensional linear quadraticexample. Our methods also take advantage of diﬀerent expectation computation methods, intro-duced in section 2.2. We present in section 4 several tests in dimension 10 where the laws of X, Y, Z are involved.We consider the Euler-Maruyama discretized FBSDE system (1) on a regular time grid t k = kTN for k ∈ (cid:74) , N (cid:75) : (cid:40) X t i +1 = X t i + b ( t i , X t i , Y t i , Z t i , µ t i ) ( t i +1 − t i ) + σ ( t i , X t i ) ( W t i +1 − W t i ) Y t i +1 = Y t i − f ( t i , X t i , Y t i , Z t i , µ t i ) ( t i +1 − t i ) + Z t i ( W t i +1 − W t i ) . (3)This FBSDE is solved by the Merged Deep BSDE method introduced in (Chan-Wai-Nam, Mikael,and Warin 2019). Z t i is approximated by a feedforward neural network Z θ ( t i , X t i ) and Y is avariable. With this point of view, the discretized Brownian motion W t acts as training data in thelanguage of machine learning. The motivation for such an approximation comes from the notion ofdecoupling ﬁeld, also used for numerical purposes in (Angiuli et al. 2019) or (Chassagneux, Crisan,and Delarue 2019), which gives the existence of functions u, v such that Y t = u ( t, X t , L ( X t )) , Z t = v ( t, X t , L ( X t )) . (4)For numerical purposes, it is enough to consider Z as a function of the couple ( t, X t ) . In fact, thelaw of the solution (and therefore its moments) can be seen as a function of t . That’s why we searchfor a representation Y t = (cid:101) u ( t, X t ) , Z t = (cid:101) v ( t, X t ) . (5)The forward-backward system is transformed into a forward system and an optimization problemaiming to satisfy the terminal condition of the BSDE through the loss function E (cid:104) ( Y T − g ( X T , µ T )) (cid:105) .The computation graph is drawn in ﬁgure 1. To simplify the graph, the expectation computation isnot presented in the ﬁgure. To simplify notations, X i := X t i and similarly for Y and Z . [( − ( , ] ) ( , ) −1 −1 loss Figure 1: Simpliﬁed computation graph of our global solverThe loss is minimized with the Adam gradient descent method (Kingma and Ba 2014). In anycase, the goal of our scheme is to learn both the optimal control and the distribution of X t , Y t , Z t .In the following, B is the batch size, N is the number of time steps and M is the number of previousbatches expectations to keep in memory. In the following, ∆ t := t i +1 − t i = TN .4e use a feedforward neural network with 3 hidden layers ( d + 10 neurons in each) with hyperbolictangent function as activation functions and an output layer with identity as activation. It is worthnoticing that because of the coupled structure of the FBSDE, we cannot use batch normalization asfar as the distribution of X i is not stationary over time. A key step for the methods is to estimate the moments of the processes

X, Y, Z . It has a signiﬁcanteﬀect on the algorithms performances. We note θ m the neural network parameters involved in theestimation and µ i ( θ m ) the estimation of µ t i at iteration m . In the algorithms described below, theapproximated processes are considered as functions of the parameters θ of the neural network.Several methods can be used to approximate the moments of the solution involved in the stochasticMcKean-Vlasov dynamics: • Direct : use the empirical mean of the current batch (in a McKean Vlasov formulation) or thelast batch (in a ﬁxed point formulation). This approach requires to handle very large batches,typically of the order of B = µ i ( θ m ) = 1 B  B (cid:88) j =1 X ji ( θ m ) , B (cid:88) j =1 Y ji ( θ m ) , B (cid:88) j =1 Z ji ( θ m )  , i = 0 , · · · , N − . (6)The direct solver leads to algorithm 1. Algorithm 1

Direct solver Let y be a variable in R k , Z η ( · , · ) be a neural network with parameter η , deﬁned on R + × R d and valued in R k × d , so that θ = ( y , η ) is initialized with value ( y , η ) . for m from 0 to K do (cid:46) Stochastic gradient iterations Set ∀ j ∈ (cid:74) , B (cid:75) , X j ( θ m ) = x ∈ R d , Y ( θ m ) = θ m, ∈ R k . for i from 0 to N − do µ i ( θ m ) = B (cid:16)(cid:80) Bj =1 X ji ( θ m ) , (cid:80) Bj =1 Y ji ( θ m ) , (cid:80) Bj =1 Z θ m, (cid:16) t i , X ji ( θ m ) (cid:17)(cid:17) for j from 1 to B do Sample ξ ji from a d -dimensional standard Gaussian vector. X ji +1 ( θ m ) = X ji ( θ m ) + b (cid:16) t i , X ji ( θ m ) , Y ji ( θ m ) , Z θ m, (cid:16) t i , X ji ( θ m ) (cid:17) , µ i ( θ m ) (cid:17) ∆ t + √ ∆ t σ (cid:16) t i , X ji ( θ m ) (cid:17) ξ ji Y ji +1 ( θ m ) = Y ji ( θ m ) − f (cid:16) t i , X ji ( θ m ) , Y ji ( θ m ) , Z θ m, (cid:16) t i , X ji ( θ m ) (cid:17) , µ i ( θ m ) (cid:17) ∆ t + √ ∆ t Z θ m, (cid:16) t i , X ji ( θ m ) (cid:17) ξ ji end for end for X N ( θ m ) = B (cid:80) Bj =1 X jN ( θ m ) , J ( θ m ) = B (cid:80) Bj =1 (cid:16) Y jN ( θ m ) − g (cid:16) X jN ( θ m ) , X N ( θ m ) (cid:17)(cid:17) Calculate ∇ J ( θ m ) by back-propagation. Update θ m +1 = θ m − ρ m ∇ J ( θ m ) . end for • Dynamic : a method which dynamically updates the estimation on ( M + 1) B samples. The5xpectations from the last M batches are kept in memory in an array ( ζ i,r ) i = 0 , . . . , N − ,r = 0 , . . . , M − ∈ ( R d × R k × R k × d ) N × M , initialized with values ( x , , N × M .At iteration m − , ν ( m − i is deﬁned as the empirical mean on these previous sample paths.On a new batch, the expectation is computed by averaging the previous estimation ν ( m − i andthe current batch empirical mean by the following algorithm used for i = 0 , · · · , N − : ν ( m − i = 1 M M − (cid:88) r =0 ζ i,r ,µ i ( θ m ) = M ν ( m − i + B (cid:16)(cid:80) Bj =1 X ji ( θ m ) , (cid:80) Bj =1 Y ji ( θ m ) , (cid:80) Bj =1 Z ji ( θ m ) (cid:17) M + 1 ,ζ i,m % M = 1 B  B (cid:88) j =1 X ji ( θ m ) , B (cid:88) j =1 Y ji ( θ m ) , B (cid:88) j =1 Z ji ( θ m )  . (7)The notation m % M refers to the remainder of the Euclidian division of m by M . Thistechnique allows to use smaller batches of size 100 or 1000. Thus it is more eﬃcient in termsof convergence speed in comparison with the direct approach. This method can be seen as adynamic ﬁxed point approach. Remark 1.

The ﬁxed point approach is known to be convergent theoretically only for smallmaturities. In practice, the theoretical bound on the maturity found on the simple examplegiven for example in paragraph 3.1 in (Angiuli et al. 2019) is far too pessimistic. We will seethat the restriction is not relevant on all our test cases. Other problems such as bifurcationproblems will appear ﬁrst.

The dynamic solver is given in algorithm 2.6 lgorithm 2

Dynamic solver Let y be a variable in R k , Z η ( · , · ) be a neural network with parameter η , deﬁned on R + × R d and valued in R k × d , so that θ = ( y , η ) is initialized with value ( y , η ) . Set ∀ i ∈ (cid:74) , N − (cid:75) , ∀ r ∈ (cid:74) , M − (cid:75) , ζ i,r = ( x , , ∈ R d × R k × R k × d . for m from 0 to K do Set ∀ j ∈ (cid:74) , B (cid:75) , X j ( θ m ) = x ∈ R d , Y j ( θ m ) = θ m, ∈ R k . for i from 0 to N − do µ i ( θ m ) = B (cid:16)(cid:80) Bj =1 X ji ( θ m ) , (cid:80) Bj =1 Y ji ( θ m ) , (cid:80) Bj =1 Z θ m, (cid:16) t i , X ji ( θ m ) (cid:17)(cid:17) (cid:101) µ i ( θ m ) = (cid:80) M − r =0 ζ i,r + µ i ( θ m ) M +1 for j from 1 to B do Sample ξ ji from a d -dimensional standard Gaussian vector. X ji +1 ( θ m ) = X ji ( θ m ) + b (cid:16) t i , X ji ( θ m ) , Y ji ( θ m ) , Z θ m, (cid:16) t i , X ji ( θ m ) (cid:17) , (cid:101) µ i ( θ m ) (cid:17) ∆ t + √ ∆ t σ (cid:16) t i , X ji ( θ m ) (cid:17) ξ ji Y ji +1 ( θ m ) = Y ji ( θ m ) − f (cid:16) t i , X ji ( θ m ) , Y ji ( θ m ) , Z θ m, (cid:16) t i , X ji ( θ m ) (cid:17) , (cid:101) µ i ( θ m ) (cid:17) ∆ t + √ ∆ t Z θ m, (cid:16) t i , X ji ( θ m ) (cid:17) ξ ji end for ζ i,m % M = µ i ( θ m ) end for X N ( θ m ) = B (cid:80) Bj =1 X jN ( θ m ) , J ( θ m ) = B (cid:80) Bj =1 (cid:16) Y jN ( θ m ) − g (cid:16) X jN ( θ m ) , X N ( θ m ) (cid:17)(cid:17) Calculate ∇ J ( θ m ) by back-propagation. Update θ m +1 = θ m − ρ m ∇ J ( θ m ) . end for • Expectation : estimate µ t by a neural network Ψ κ with input t and parameters κ . The loss ismodiﬁed in order to force the estimation to remain close to the empirical mean of each batchat every time. µ i ( θ m ) = Ψ κ ( t i ) , i = 0 , · · · , N (8)In this case, a term E (cid:20) λ (cid:80) N − i =0 (cid:16) Ψ κ ( t i ) − B (cid:16)(cid:80) Bj =1 X ji , (cid:80) Bj =1 Y ji , (cid:80) Bj =1 Z ji (cid:17)(cid:17) (cid:21) is added tothe loss function. We will see that in practise this method is quite involved to use because theperformances heavily depend upon the choice of the parameter λ . This approach provides arelaxation of the ﬁxed point method. The expectation solver is described in algorithm 3.7 lgorithm 3 Expectation solver Let y be a variable in R k , Z η ( · , · ) deﬁned on R + × R d , Ψ κ ( · ) deﬁned on R + be neural networkswith parameters η , κ , taking values respectively in R k × d and R d × R k × R k × d , so that θ =( y , η, κ ) is initialized with value ( y , η , κ ) . for m from 0 to K do Set ∀ j ∈ (cid:74) , B (cid:75) , X j ( θ m ) = x ∈ R d , Y j ( θ m ) = θ m, ∈ R k . for i from 0 to N − do µ i ( θ m ) = B (cid:16)(cid:80) Bj =1 X ji ( θ m ) , (cid:80) Bj =1 Y ji ( θ m ) , (cid:80) Bj =1 Z θ m, (cid:16) t i , X ji ( θ m ) (cid:17)(cid:17) for j from 1 to B do Sample ξ ji from a d -dimensional Gaussian vector. X ji +1 ( θ m ) = X ji ( θ m ) + b (cid:16) t i , X ji ( θ m ) , Y ji ( θ m ) , Z θ m, (cid:16) t i , X ji ( θ m ) (cid:17) , Ψ θ m, ( t i ) (cid:17) ∆ t + √ ∆ t σ (cid:16) t i , X ji ( θ m ) (cid:17) ξ ji Y ji +1 ( θ m ) = Y ji ( θ m ) − f (cid:16) t i , X ji ( θ m ) , Y ji ( θ m ) , Z θ m, (cid:16) t i , X ji ( θ m ) (cid:17) , Ψ θ m, ( t i ) (cid:17) ∆ t + √ ∆ t Z θ m, (cid:16) t i , X ji ( θ m ) (cid:17) ξ ji end for end for X N ( θ m ) = B (cid:80) Bj =1 X jN ( θ m ) , J ( θ m ) = B (cid:80) Bj =1 (cid:16) Y jN ( θ m ) − g (cid:16) X jN ( θ m ) , X N ( θ m ) (cid:17)(cid:17) + λ (cid:80) N − i =0 (cid:0) µ i ( θ m ) − Ψ θ m, ( t i ) (cid:1) Calculate ∇ J ( θ m ) by back-propagation. Update θ m +1 = θ m − ρ m ∇ J ( θ m ) . end for We will compare the performances of these techniques on several examples in section 4.

We also propose a local method inspired by the Deep Backward Dynamic Programming introducedby (Huré, Pham, and Warin 2019) and (Pham and Warin 2019). It considers local minimizationproblems between contiguous time steps. In this case there are as many networks as time steps. Wereplace a global optimization setting by a set of smaller problems.In this method for i ∈ (cid:74) , N − (cid:75) , Z i and Y i are approximated by a neural network ( Z iθ im ( · ) , Y iθ im ( · )) .At iteration m we simulate X i with the previously computed parameters θ im . The X i ’s dynamicbeing frozen with parameters θ im , we solve backward problems to ﬁnd the θ im +1 : • First Y N is set to the terminal condition g ( X N , µ N ) . • For i from N − to we solve successively the local backward problems min θ im E (cid:34)(cid:18) Y i +1 θ i +1 m +1 ( X i +1 ) − Y iθ im ( X i ) + f (cid:16) t i , X i , Y iθ im ( X i ) , Z iθ im ( X i ) , µ t i (cid:17) ∆ t − Z iθ im ( X i )∆ W t i (cid:19) (cid:35) . (9)We can then update the θ value as θ im +1 . In the version of the local solver given in algorithm4, we use the dynamic update of the expectations introduced previously in the dynamic solver ofsection 2. In this algorithm H stands for the number of gradient steps to perform at each step ofthe algorithm and R is the number of samples for the laws estimation.8 lgorithm 4 Local solver Let ( Y iη i ( · ) , Z iη i ( · )) be some neural networks deﬁned on R d with values in R k × R k × d for i =0 , · · · , N − and initialized with values η i . We note θ = ( η , · · · , η N ) . Set ∀ i ∈ (cid:74) , N (cid:75) , ∀ r ∈ (cid:74) , M − (cid:75) , ζ i,r = ( x , , ∈ R d × R k × R k × d . for m from 0 to K do Sample ξ ji from a d -dimensional standard Gaussian vector, i = 0 , · · · , N , j = 1 , · · · , R . Set ∀ j ∈ (cid:74) , R (cid:75) , X j ( θ m ) = x ∈ R d . for i from 0 to N do (cid:46) Forward estimation of the laws µ i ( θ m ) = R (cid:16)(cid:80) Rj =1 X ji ( θ m ) , (cid:80) Rj =1 Y iθ m,i (cid:16) X ji ( θ m ) (cid:17) , (cid:80) Rj =1 Z iθ m,i (cid:16) X ji ( θ m ) (cid:17)(cid:17) V i ( θ m ) = R (cid:80) Rj =1 (cid:16) X ji ( θ m ) (cid:17) − R (cid:16)(cid:80) Rj =1 X ji ( θ m ) (cid:17) (cid:101) µ i ( θ m ) = (cid:80) M − r =0 ζ i,r + µ i ( θ m ) M +1 ζ i,m % M = µ i ( θ m ) for j from 1 to R do X ji +1 ( θ m ) = X ji ( θ m ) + b (cid:18) t i , X ji ( θ m ) , Y iθ m,i (cid:16) X ji ( θ m ) (cid:17) , Z iθ m,i (cid:16) X ji ( θ m ) (cid:17) , (cid:101) µ i ( θ m ) (cid:19) ∆ t + √ ∆ t σ (cid:16) t i , X ji ( θ m ) (cid:17) ξ ji end for end for for i from N − to 0 do (cid:46) Backward resolution ˆ θ = θ m,i for h from 0 to H − do (cid:46) Gradient descent with simulated data for X for j from 1 to B do Sample Ξ ji , Θ ji from d -dimensional standard Gaussian vectors. x ji = µ i ( θ m ) + (cid:112) V i ( θ m ) Θ ji x ji +1 = x ji + b (cid:16) t i , x ji , Y i ˆ θ h (cid:16) x ji (cid:17) , Z i ˆ θ h (cid:16) x ji (cid:17) , (cid:101) µ i ( θ m ) (cid:17) ∆ t + √ ∆ t σ (cid:16) t i , x ji (cid:17) Ξ ji if i = N − then Y ji +1 = g (cid:16) x jN , (cid:102) µ N ( θ m ) (cid:17) else Y ji +1 = Y i +1ˆ θ m,i +1 (cid:16) x ji +1 (cid:17) end if end for J i (ˆ θ h ) = B (cid:80) Bj =1 (cid:18) f (cid:16) t i , x ji , Y i ˆ θ h (cid:16) x ji (cid:17) , Z i ˆ θ h (cid:16) x ji (cid:17) , (cid:101) µ i ( θ m ) (cid:17) ∆ t + Y ji +1 − Y i ˆ θ h (cid:16) x ji (cid:17) −√ ∆ t Z i ˆ θ h (cid:16) x ji (cid:17) Ξ ji (cid:19) Calculate ∇ J i (ˆ θ h ) by back-propagation. Update ˆ θ h +1 = ˆ θ h − ρ h ∇ J i (ˆ θ h ) . end for θ m +1 ,i = ˆ θ H end for end forRemark 2. Because we have to learn the dynamic of the forward process, the use of a backwardresolution is not as obvious as in (Huré, Pham, Bachouch, et al. 2018). We have to alternate forwarddynamic estimations and backward resolutions. Numerical results

The algorithms are implemented in Python with the Tensorﬂow library (Abadi et al. 2016). Eachnumerical experiment is conducted using a node composed of 2 Intel R (cid:13) Xeon R (cid:13) Gold 5122 Processors,192 Go of RAM, and 2 GPU nVidia R (cid:13) Tesla R (cid:13) V100 16Go. The multi-GPU parallelization on theglobal solver is conducted using the Horovod library (Sergeev and Del Balso 2018). The methodswe test are: • Direct : algorithm 1 at page 5. Batch size B = 10000 . • Dynamic : algorithm 2 at page 7. Batch size B = 200 and M = 100 . • Expectation : algorithm 3 at page 8. Batch size B = 2000 . • Local : algorithm 4 at page 9. Batch size B = 300 (Weak) , B = 100 (Pontryagin) and M = 20 .If the algorithm is applied to equations coming from the Pontryagin or the Weak approach, it isspeciﬁed in its name. We use a linear-quadratic mean ﬁeld game of controls model studied in (Angiuli et al. 2019) and(Carmona and Delarue 2018a) for comparison. This model is useful for numerical tests as far as theanalytic solution is known. The MFG of controls model for the representative player is given by: min α ∈ A E (cid:20)(cid:90) T (cid:16) c α (cid:107) α t (cid:107) + c X (cid:107) X t (cid:107) − γX t · µ t (cid:17) d t + c g (cid:107) X T (cid:107) (cid:21) subject to X t = x + (cid:90) t α s d s + σ W t . (10)and the ﬁxed point E [ α t ] = µ t . In this case, the mean ﬁeld interaction is exerted through the law ofthe control process.The Pontryagin optimality principle gives the system:  d X t = − c α Y t d t + σ d W t X = x d Y t = − ( c X X t + γc α E [ Y t ]) d t + Z t d W t Y T = c g X T . (11)In this case, the output Z of the neural network is a matrix of size d × d and Y is a vector of size d .The weak representation of the value function gives:  d X t = − c α σ − Z t d t + σ d W t X = x d Y t = − (cid:16) c X (cid:107) X t (cid:107) + γc α X t · σ − E [ Z t ] + c α (cid:13)(cid:13) σ − Z t (cid:13)(cid:13) (cid:17) d t + Z t d W t Y T = c g (cid:107) X T (cid:107) . (12)In this case, the output Z of the neural network is a vector of size d and Y is a scalar. Thereforewe may be able work in higher dimensions. Remark 3.

With LQ models, the dynamics of Y is linear in the Pontryagin approach and quadraticin the Weak approach. Thus the potentially high dimension of one method is counterbalanced by thecomplex dynamics of the other technique. c X = 2 , x = 1 , σ = 0 . , γ = 2 , c α = 2 / , c g = 0 . . If notstated otherwise, the simulations are conducted with T = 1 , d = 10 , ∆ t = 0 . . (cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80) Method T Reference 0.7709 0.1978 0.0811 0.0125

Pontryagin 0.763 (1.3e-03) 0.187 (2.5e-03) 0.075 (2.7e-03) 0.012 (5.0e-03)Dyn. Pontryagin 0.762 (2.3e-03) 0.189 (4.0e-03) 0.078 (5.5e-03) 0.013 (6.7e-03)Exp. Pontryagin 0.765 (1.3e-03) 0.198 (6.7e-03) 0.067 (2.1e-01) 0.496 (1.0e-01)Weak 0.778 (2.0e-03) 0.200 (1.4e-02) 0.092 (2.9e-02) 0.025 (2.0e-02)Dyn. Weak 0.775 (4.4e-03) 0.212 (2.0e-02) 0.083 (4.1e-02) 0.016 (5.6e-02)Exp. Weak NC NC NC NCPontryagin Loc. 0.767 (4.3e-04) 0.192 (8.8e-04) 0.077 (8.8e-04) 0.011 (1.3e-03)Weak Loc. 0.786 (1.3e-03) 0.308 (1.2e-02) 0.054 (3.7e-02) 0.075 (5.2e-02)Figure 2: Mean of E [ X T ] over the 10 dimensions (and standard deviation) for several maturities T (2000 iterations for global methods, 6000 iterations for local methods) on the price impact model.Pontryagin Dyn. Pontryagin

Exp. Pontryagin

Weak

Dyn. Weak

Exp. Weak

Pontryagin Loc.

Weak Loc.

Figure 3: Duration times of the methods (2000 iterations for global methods, 6000 iterations forlocal methods) on the price impact model. −5 −4 −3 −2 −1 L o ss T = 0.25T = 0.75T = 1T = 1.5 0 250 500 750 1000 1250 1500 1750 2000Iteration10 −5 −4 −3 −2 −1 L o ss T = 0.25T = 0.75T = 1T = 1.5

Figure 4: Learning curves for direct (left) and dynamic (right) Pontryagin method on the priceimpact model. The loss is the L error between Y T and the terminal condition of the backwardequation. 11

250 500 750 1000 1250 1500 1750 2000Iteration10 −2 −1 L o ss T = 0.25T = 0.75T = 1T = 1.5 0 250 500 750 1000 1250 1500 1750 2000Iteration10 −2 −1 L o ss T = 0.25T = 0.75T = 1T = 1.5

Figure 5: Learning curves for direct (left) and dynamic (right) Weak method on the price impactmodel. The loss is the L error between Y T and the terminal condition of the backward equation. −4 −3 −2 −1 L o ss T = 0.25T = 0.75T = 1T = 1.5 L o ss T = 0.25T = 0.75T = 1T = 1.5

Figure 6: Learning curves for expectation Pontryagin (left) and expectation Weak (right) methodon the price impact model. The loss is the L error between Y T and the terminal condition of thebackward equation. −2 −1 L o ss T = 0.25T = 0.75T = 1T = 1.5 0 1000 2000 3000 4000 5000 6000Iteration10 −1 L o ss T = 0.25T = 0.75T = 1T = 1.5

Figure 7: Learning curves for Local Pontryagin (left) and Local Weak (right) method on the priceimpact model. The loss is the sum of the local L errors between the dynamics of Y and the Eulerdiscretization for all time steps. 12

250 500 750 1000 1250 1500 1750 2000Iteration0.00.51.01.52.02.5 M e a n e x p e c t a t i o n T = 0.25T = 0.75T = 1T = 1.5 0 250 500 750 1000 1250 1500 1750 2000Iteration0.00.51.01.52.02.5 M e a n e x p e c t a t i o n T = 0.25T = 0.75T = 1T = 1.5

Figure 8: E [ X T ] for direct (left) and dynamic (right) Pontryagin method on the price impact model. M e a n e x p e c t a t i o n T = 0.25T = 0.75T = 1T = 1.5 0 250 500 750 1000 1250 1500 1750 2000Iteration0.00.20.40.60.8 M e a n e x p e c t a t i o n T = 0.25T = 0.75T = 1T = 1.5

Figure 9: E [ X T ] for direct (left) and dynamic (right) Weak method on the price impact model. M e a n e x p e c t a t i o n T = 0.25T = 0.75T = 1T = 1.5 M e a n e x p e c t a t i o n T = 0.25T = 0.75T = 1T = 1.5

Figure 10: E [ X T ] for expectation Pontryagin (left) and expectation Weak (right) method on theprice impact model. 13 M e a n e x p e c t a t i o n T = 0.25T = 0.75T = 1T = 1.5 0 1000 2000 3000 4000 5000 6000Iteration−0.20.00.20.40.60.81.0 M e a n e x p e c t a t i o n T = 0.25T = 0.75T = 1T = 1.5

Figure 11: E [ X T ] for Local Pontryagin (left) and Local Weak (right) method on the price impactmodel. Figure 12: First coordinate of the optimal control evaluated on a sample path for direct (left) anddynamic (right) Pontryagin method after 2000 iterations on the price impact model.

Figure 13: First coordinate of the optimal control evaluated on a sample path for direct (left) anddynamic (right) Weak method after 2000 iterations on the price impact model.14 .0 0.2 0.4 0.6 0.8 1.0t−3.5−3.0−2.5−2.0−1.5−1.0−0.50.0 Optimal controlComputed optimal control 0.0 0.2 0.4 0.6 0.8 1.0t−4−3−2−101

Figure 14: First coordinate of the optimal control evaluated on a sample path for expectationPontryagin (left) and expectation Weak (right) method after 2000 iterations on the price impactmodel.

Figure 15: First coordinate of the optimal control evaluated on a sample path for Local Pontryagin(left) and Local Weak (right) Local Weak method after 6000 iterations on the price impact model.All methods except Expectation Weak converge to the exact solution for small maturities. As weincrease the maturity, the local weak solver does not converge anymore to the right solution.For the local method it may be due to the lack of a contraction for the ﬁxed point problem when thematurity is too high. In fact, the algorithm converges to the true solution only for small maturities.We won’t test the expectation methods on the other test cases as far as they are less eﬃcient thanthe other methods.We cannot hope for more iterations to help the convergence in the Weak method as far as the lossin the learning curves of ﬁgure 5 reaches a plateau. The algorithms solving the system coming fromthe Pontryagin principle perform better than the others. The dynamic estimation of the expectationallows to gain training speed and to stabilize the loss.

In this Section we design non Linear Quadratic models in order to test the limitations of our meth-ods. We construct general MKV FBSDES with explicit solutions following a log-normal distribution.Let X it be deﬁned by 15 X it = a i X it d t + σ it X it d W it , (13) X i = ξ i . (14)We obtain explicitly X it = ξ i e ( a i − ( σi )22 ) t + σ i W it ,g it := E [ X it ] = ξ i e a i t ,k it := E [( X it ) ] = ξ i e (2 a i +( σ i ) ) t . We choose g : ( t, x ) (cid:55)→ e αt log (cid:0)(cid:81) ni =1 x i (cid:1) and the following dynamic for Y t Y t = e αt log (cid:32)(cid:89) i X it (cid:33) = e αt (cid:88) i (cid:20) log( ξ i ) + ( a i − ( σ i ) t + σ i W it (cid:21) , such that c t := E [ Y t ] = e αt (cid:88) i (cid:20) log( ξ i ) + (cid:18) a i − ( σ i ) (cid:19) t (cid:21) ,d t := E [ Y t ] = e αt (cid:34)(cid:88) i (cid:18) log( ξ i ) + a i − ( σ i ) (cid:19) t (cid:35) + (cid:88) i ( σ i ) t  . As we want Y t = u ( t, X t ) , we have Z it = σ it X it ∂ x u ( t, X t ) following: Z it = σ it e αt e it := E [ Z it ] = σ it e αt ,f it := E [( Z it ) ] = ( σ it ) e αt . Introducing φ ( t, x ) := ∂ t u + (cid:88) i a i x i ∂ x i u + (cid:88) i ( σ i x i ) ∂ x i u = e αt (cid:32) α log (cid:32)(cid:89) i x i (cid:33) + (cid:88) i (cid:18) a i − ( σ i ) (cid:19)(cid:33) ,u ( t, X t ) solves the PDE ∂ t u + (cid:88) i a i x i ∂ x i u + (cid:88) i ( σ i ) ∂ x i u − φ ( t, x ) = 0 . This semilinear PDE is related to the BSDE associated with the driver f ( t, x ) = − φ ( t, x ) for forwarddynamics (13).Using some chosen R d valued functions ψ i and R k valued functions κ , we express all dynamics in aMcKean-Vlasov setting:  d X it = ( a i X it + ψ i ( Y t , Z it , E [ X it ] , E [( X it ) ] , E [ Y t ] , E [ Y t ] , E [ Z it ] , E [( Z it ) ]) − ψ i (cid:0) e αt log (cid:0)(cid:81) i X it (cid:1) , σ it e αt , g it , k it , c t , d t , e it , f it (cid:1) d t + σ it X it d W it X i = ξ i d Y t = − f ( t, X t , Y t , Z t , E [ X t ] , E [ X t ] , E [ Y t ] , E [ Y t ] , E [ Z t ] , E [ Z t ]) d t + Z t d W t Y T = e αT log (cid:0)(cid:81) i X iT (cid:1) (15)16ith f ( t, X t , Y t , Z t , x , x , y , y , z , z )= − φ ( t, x ) + κ ( Y t , Z t , x , x , y , y , z , z ) − κ (cid:32) e αt log (cid:32)(cid:89) i X it (cid:33) , σ it e αt , g it , k it , c t , d t , e it , f it (cid:33) . and f : R × R d × R × R × d × R d × R d × R × R × R × d × R × d (cid:55)→ R .We consider two models of this kind for numerical tests. We consider a linear McKean-Vlasov FBSDE in Y t , Z t and their law dynamics for X t and Y t :  d X it = ( a i X it + b ( Y t + Z it + E [ X it ] + E [ Y t ] + E [ Z it ]) − b (cid:16) e αt log (cid:16)(cid:81) di =1 X it (cid:17) + σ it e αt + g it + c t + e it (cid:17) d t + σ it X it d W it X i = ξ i d Y t = (cid:18) φ ( t, X t ) + b ( Y t + d (cid:80) di =1 Z it + d (cid:80) di =1 E [ X it ] + E [ Y t ] + d (cid:80) di =1 E [ Z it ]) − b (cid:16) e αt log (cid:16)(cid:81) di =1 X it (cid:17) + d (cid:80) di =1 σ it e αt + d (cid:80) di =1 g it + c t + d (cid:80) di =1 e it (cid:17) (cid:19) d t + Z t d W t Y T = e αT log (cid:16)(cid:81) di =1 X iT (cid:17) . (16)We take a = b = 0 . , α = 0 . , σ = 0 . , ξ = 1 . (cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80) Method T Reference 1.0253 1.0779 1.1052 1.1618

Global 1.025 (1.8e-03) 1.076 (3.3e-03) 1.095 (3.9e-03) 1.162 (7.2e-03)Dyn. Global 1.026 (2.0e-03) 1.077 (3.6e-03) 1.105 (2.9e-03) 1.163 (4.8e-03)Local 1.025 (2.4e-04) 1.096 (6.2e-04) 1.158 (9.6e-04) 1.310 (1.5e-03)Figure 16: Mean of E [ X T ] over the 10 dimensions (and standard deviation) for several maturities T (2000 iterations for global methods, 6000 iterations for local methods) on the general linear model.Global Dynamic Global Local Figure 17: Duration times of the methods (2000 iterations for global methods, 6000 iterations forlocal methods) on the general linear model. 17

250 500 750 1000 1250 1500 1750 2000Iteration10 −3 −2 −1 L o ss T = 0.25T = 0.75T = 1T = 1.5 0 250 500 750 1000 1250 1500 1750 2000Iteration10 −3 −2 −1 L o ss T = 0.25T = 0.75T = 1T = 1.5

Figure 18: Learning curves for Direct Global (left) and Dynamic Global (right) method on thegeneral linear model. The loss is the L error between Y T and the terminal condition of the backwardequation. −1 L o ss T = 0.25T = 0.75T = 1T = 1.5

Figure 19: Learning curves for Local method on the general linear model. The loss is the sum ofthe local L errors between the dynamics of Y and the Euler discretization for all time steps. M e a n e x p e c t a t i o n T = 0.25T = 0.75T = 1T = 1.5 0 250 500 750 1000 1250 1500 1750 2000Iteration1.01.21.41.61.82.0 M e a n e x p e c t a t i o n T = 0.25T = 0.75T = 1T = 1.5

Figure 20: E [ X T ] for Direct Global (left) and Dynamic Global (right) method on the general linearmodel. 18 M e a n e x p e c t a t i o n T = 0.25T = 0.75T = 1T = 1.5

Figure 21: E [ X T ] for Local method on the general linear model. Figure 22: First coordinate of Z t evaluated on a sample path for Direct Global (left) and DynamicGlobal (right) method after 2000 iterations on the general linear model. Figure 23: First coordinate of Z t evaluated on a sample path for Local method after 6000 iterationson the general linear model. 19 .0 0.2 0.4 0.6 0.8 1.0t−1.00−0.75−0.50−0.250.000.250.500.751.00 Exact YComputed Y 0.0 0.2 0.4 0.6 0.8 1.0t−1.0−0.50.00.51.0 Exact YComputed Y Figure 24: First coordinate of Y t evaluated on a sample path for Direct Global (left) and DynamicGlobal (right) method after 2000 iterations on the general linear model. Figure 25: Y t evaluated on a sample path for Local method after 6000 iterations on the generallinear model.The three algorithms demonstrate good performances on this test case. Both processes Y, Z arewell represented by the neural network. However the local method is less precise than the globalmethods when the maturity grows. More iterations could improve the results given by the localmethod, as far as the loss hasn’t reached a plateau yet in ﬁgure 19.20 .2.2 Quadratic

We consider a quadratic McKean-Vlasov FBSDE in Y t , Z t and their law dynamics for X t and Y t :  d X it = ( a i X it + b ( Y t + Z it + E [ X it ] + E [ Y t ] + E [ Z it ]) − b (cid:16) e αt log (cid:16)(cid:81) di =1 X it (cid:17) + σ it e αt + g it + c t + e it (cid:17) + c (cid:32) Y t + ( Z it ) + E [( X it ) ] + E [ Y t ] + E [( Z it ) ]) − c (cid:18) e αt log (cid:16)(cid:81) di =1 X it (cid:17) + ( σ it ) e αt + ( g it ) + c t + ( e it ) (cid:19) (cid:33) d t + σ it X it d W it X i = ξ i d Y t = (cid:32) φ ( t, X t ) + b ( Y t + d (cid:80) di =1 Z it + d (cid:80) di =1 E [ X it ] + E [ Y t ] + d (cid:80) di =1 E [ Z it ]) − b (cid:16) e αt log (cid:16)(cid:81) di =1 X it (cid:17) + d (cid:80) di =1 σ it e αt + d (cid:80) di =1 g it + c t + d (cid:80) di =1 e it (cid:17) + c ( Y t + d (cid:80) di =1 ( Z it ) + d (cid:80) di =1 E [( X it ) ] + E [ Y t ] + d (cid:80) di =1 E [( Z it ) ]) − c (cid:18) e αt log (cid:16)(cid:81) di =1 X it (cid:17) + d (cid:80) di =1 ( σ it ) e αt + d (cid:80) di =1 ( g it ) + c t + d (cid:80) di =1 ( e it ) (cid:19) (cid:33) d t + Z t d W t Y T = e αT log (cid:16)(cid:81) di =1 X iT (cid:17) . (17)We take a = b = c = 0 . , α = 0 . , σ = 0 . , ξ = 1 . (cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80) Method T Reference 1.0253 1.0779 1.1052 1.1618

Global 1.024 (1.8e-03) 1.065 (4.3e-03) 12.776 (3.3e-02) DVDyn. Global 1.025 (2.1e-03) 1.072 (3.1e-03) 0.961 (7.0e-03) DVLocal 1.025 (3.3e-04) -6.763 (7.5e-04) 0.409 (2.7e-04) DVFigure 26: Mean of E [ X T ] over the 10 dimensions (and standard deviation) for several maturities T (2000 iterations for global methods, 50 ﬁxed point iterations for local methods) on the generalquadratic model. Global Dynamic Global Local Figure 27: Duration times of the methods (2000 iterations for global methods, 6000 iterations forlocal methods) on the general quadratic model. 21

250 500 750 1000 1250 1500 1750 2000Iteration10 −3 −2 −1 L o ss T = 0.25T = 0.75T = 1 0 250 500 750 1000 1250 1500 1750 2000Iteration10 −3 −2 −1 L o ss T = 0.25T = 0.75T = 1

Figure 28: Learning curves for Direct Global (left) and Dynamic Global (right) method on thegeneral linear model. The loss is the L error between Y T and the terminal condition of the backwardequation. −1 L o ss T = 0.25T = 0.75T = 1T = 1.5

Figure 29: Learning curves for Local method on the general quadratic model. The loss is the sumof the local L errors between the dynamics of Y and the Euler discretization for all time steps. M e a n e x p e c t a t i o n T = 0.25T = 0.75T = 1 0 250 500 750 1000 1250 1500 1750 2000Iteration2468 M e a n e x p e c t a t i o n T = 0.25T = 0.75T = 1

Figure 30: E [ X T ] for Direct Global (left) and Dynamic Global (right) method on the generalquadratic model. 22 M e a n e x p e c t a t i o n T = 0.25T = 0.75T = 1

Figure 31: E [ X T ] for Local method on the general quadratic model. Figure 32: First coordinate of Z t evaluated on a sample path for Direct Global (left) and DynamicGlobal (right) method ( T = 0 . ) after 2000 iterations on the general quadratic model. Figure 33: First coordinate of Z t evaluated on a sample path for Local method ( T = 0 . ) after6000 iterations on the general quadratic model. 23 .0 0.1 0.2 0.3 0.4 0.5 0.6 0.7t−0.50.00.51.01.52.0 Exact YComputed Y 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7t−0.50.00.51.01.52.0 Exact YComputed Y Figure 34: First coordinate of Y t evaluated on a sample path for Direct Global (left) and DynamicGlobal (right) method ( T = 0 . ) after 2000 iterations on the general quadratic model. Figure 35: First coordinate of Y t evaluated on a sample path for Local method ( T = 0 . ) after6000 iterations on the general quadratic model.We observe convergence for small maturities and divergence beyond T = 1 . Note that the dynamicestimation of the expectation prevents the algorithm to explode for T = 1 , contrarily to the directmethod. However, it does not converge to the true solution in this case. Indeed the loss plateausat the value 2 in ﬁgure 28 (right), so the terminal condition of the BSDE is not properly respected.The dynamic method also produces a better approximation of Y (see ﬁgure 34). Concerning thelocal method, we see in ﬁgure 31 that the estimated expectations are stable around zero for a fewiterations but then become negative. It may be due to the lack of a contraction for the ﬁxed pointproblem. As we increase the coeﬃcient values in the FBSDE, we seem to face some bifurcations. As statedin the papers (Chassagneux, Crisan, and Delarue 2019), (Angiuli et al. 2019) we encounter forexample bifurcations as we increase the parameter c X in the price impact problem. Suspectingthat these convergence problems are partly due to the ﬁxed point iteration procedure used in theirtree algorithm, they propose a numerical algorithm limiting the use of FBSDE ﬁxed point iterationprocedure to small maturities. Their algortihm is used for some tree solvers and we propose to24xtend it with neural networks.The algorithm proposed is based on some level of discretization { T < T , . . . , T L p } where L p stands for the number of levels. On a level [ T k , T k +1 ] , we can divide the resolution period with atime step T k +1 − T k N k . On a level we can solve the problem ( Y T k ( θ k , . ) , L ( X T k +1 )) = φ k ( L ( X T k ) , Y T k +1 ( θ k +1 , . )) , (18)where φ k is an operator deﬁned for t ∈ [ T k , T k +1 ] taking as arguments: • L ( X T k ) the law of an estimation of the process X t at date T k : this estimation is calculated bythe resolution of the sub problem (18) on [ T k − , T k ] or taking this law equal to δ X for k = 0 , • Y T k +1 ( θ k +1 , . ) an estimation calculated by neural networks of Y t at date T k +1 obtained by theresolution of the sub problem (18) on [ T k +1 , T k +2 ] or by projection of the terminal condition g when k = L p having an estimation of L ( X T ) .The operator φ k gives back an estimation of • Y T k ( θ k , . ) solution of the equations (1) with initial X distribution L ( X T k ) and terminal condi-tion Y T k +1 ( θ k +1 , . ) for Y T k +1 , • L ( X T k +1 ) the distribution of the forward process at date T k +1 ,where the resolution has been achieved on each level k by optimization of some neural networksparametrized with θ k by the solvers proposed in section 2 or 3.Using the operators { φ k } k =1 ,L p and iterating between levels, it is shown in (Chassagneux, Crisan,and Delarue 2019) that the algorithm converges to the solution when uniqueness is proved, the driver b is independent of Z t , L ( Z t ) and when f does not depend on L ( Z t ) .We give a recursive version of the procedure in algorithm 5 where J is a parameter for convergenceof iterations between levels. Algorithm 5

Level iteration algorithm function Solver ( k, L ( X T k ) ) if k == L p − then ( Y T k ( θ k , . ) , L ( X T k +1 )) = φ k ( L ( X T k ) , Y T k +1 ( θ k +1 , . )) else Initialize L ( X T k +1 ) = L ( X T k ) for ≤ j ≤ J do Y T k +1 ( θ k +1 , . ) = Solver ( k + 1 , L ( X T k +1 )) ( Y T k ( θ k , . ) , L ( X T k +1 )) = φ k ( L ( X T k ) , Y T k +1 ( θ k +1 , . )) end for end if return Y T k ( θ k , . ) end function We implemented the algorithm solving (18) by local and global solvers without any dynamicversions for simplicity.Both versions work well but we do not present the results as the methodology does not permit toextend the maturity treated. For both solvers, the algorithm converges only when the direct onealso converges to the true solution. Of course, because of the need to iterate between the levels, theresolution time previously limited to a few minutes explodes to hours.These results seem to indicate that the limiting factor of the resolution is not the ﬁxed point iterationprocedure with neural networks. 25 emark 4.

In (Angiuli et al. 2019), results show that the bifurcation appears later with their schemebased on trees. In fact, their scheme is more "explicit" than ours and between the local and the globalsolver, the local solver is more explicit than the global one. So it it seems that the more implicit thescheme is the later the bifurcation problem appears.

We have shown that neural network methods can solve some high dimensional FBSDE of McKean-Vlasov type. Comparing the diﬀerent algorithms used we ﬁnd out that • The dynamic update of the expectation is eﬃcient in terms of computation speed (about 30%faster than direct method) and seems to smooth the learning curve. • Pontryagin performs better than Weak for large maturities. On the contrary, the Weak ap-proach is the best for small maturities. • For the linear model we observe no convergence problem whereas for the quadratic one we cansolve only on a small horizon. • The local method faces more diﬃculties for quadratic problems than the global methods do.It also requires more iterations to converge • The methods can be used in dimension 10, thus applied to more realistic problems than usually.For instance, in the price impact model, the number of dimensions corresponds to the numberof assets involved in the trading. Thus, developing methods able to deal with problems in highdimensions can help us to handle large portfolios. • Previously developed algorithm used to postpone the apparition of bifurcations does not pro-vide a way to extend maturity with neural networks.

References

Abadi, Martin et al. (2016). “TensorFlow: A System for Large-scale Machine Learning”. In:

Pro-ceedings of the 12th USENIX Conference on Operating Systems Design and Implementation .OSDI’16. Savannah, GA, USA: USENIX Association, pp. 265–283. isbn : 978-1-931971-33-1. url : http://dl.acm.org/citation.cfm?id=3026877.3026899 .Angiuli, Andrea, Christy V. Graves, Houzhi Li, Jean-François Chassagneux, François Delarue, andRené Carmona (2019). “Cemracs 2017: numerical probabilistic approach to MFG”. In: ESAIM:Proceedings and Surveys

65, pp. 84–113. doi : .Beck, Christian, Sebastian Becker, Patrick Cheridito, Arnulf Jentzen, and Ariel Neufeld (2019).“Deep splitting method for parabolic PDEs”. In: arXiv preprint arXiv:1907.03452 .Beck, Christian, Weinan E, and Arnulf Jentzen (2017). “Machine learning approximation algorithmsfor high-dimensional fully nonlinear partial diﬀerential equations and second-order backwardstochastic diﬀerential equations”. In: CoRR abs/1709.05963.Cardaliaguet, Pierre and Charles-Albert Lehalle (2018). “Mean ﬁeld game of controls and an appli-cation to trade crowding”. In:

Mathematics and Financial Economics

Probabilistic Theory of Mean Field Games with Ap-plications I . Springer. doi :

10 . 1007 / 978 - 3 - 319 - 58920 - 6 . url : https : / / hal . archives -ouvertes.fr/hal-01868147 . 26armona, René and François Delarue (2018b). Probabilistic Theory of Mean Field Games with Ap-plications I-II . Springer.Carmona, René and Daniel Lacker (2015). “A probabilistic weak formulation of mean ﬁeld gamesand applications”. In:

Ann. Appl. Probab. doi : . url : https://doi.org/10.1214/14-AAP1020 .Carmona, René and Mathieu Laurière (2019). Convergence Analysis of Machine Learning Algorithmsfor the Numerical Solution of Mean Field Control and Games: II – The Finite Horizon Case .Chan-Wai-Nam, Quentin, Joseph Mikael, and Xavier Warin (2019). “Machine Learning for SemiLinear PDEs”. In:

Journal of Scientiﬁc Computing . doi : .Chassagneux, Jean-François, Dan Crisan, and François Delarue (2014). “A probabilistic approachto classical solutions of the master equation for large population equilibria”. In: arXiv preprintarXiv:1411.3009 .— (2019). “Numerical method for FBSDEs of McKean-Vlasov type”. In: Annals of Applied Proba-bility doi : .Fouque, Jean-Pierre and Zhaoyu Zhang (2019). Deep Learning Methods for Mean Field ControlProblems with Delay .Han, Jiequn, Arnulf Jentzen, and Weinan E (2017). “Solving high-dimensional partial diﬀerentialequations using deep learning”. In:

Proceedings of the National Academy of Sciences doi : .Han, Jiequn and Jihao Long (2018). Convergence of the Deep BSDE Method for Coupled FBSDEs .Huré, Côme, Huyên Pham, Achref Bachouch, and Nicolas Langrené (2018).

Deep neural networksalgorithms for stochastic control problems on ﬁnite horizon: convergence analysis .Huré, Côme, Huyên Pham, and Xavier Warin (2019).

Some machine learning schemes for high-dimensional nonlinear PDEs .Ji, Shaolin, Shige Peng, Ying Peng, and Xichuan Zhang (2019).

Three algorithms for solving high-dimensional fully-coupled FBSDEs through deep learning .Kingma, Diederik and Jimmy Ba (2014). “Adam: A Method for Stochastic Optimization”. In:

Inter-national Conference on Learning Representations .Lasry, Jean-Michel and Pierre-Louis Lions (2006a). “Jeux à champ moyen. I – Le cas stationnaire”.In: