[PDF] Mean-field Markov decision processes with common noise and open-loop controls

Abstract

We develop an exhaustive study of Markov decision process (MDP) under mean field interaction both on states and actions in the presence of common noise, and when optimization is performed over open-loop controls on infinite horizon. Such model, called CMKV-MDP for conditional McKean-Vlasov MDP, arises and is obtained here rigorously with a rate of convergence as the asymptotic problem of N-cooperative agents controlled by a social planner/influencer that observes the environment noises but not necessarily the individual states of the agents. We highlight the crucial role of relaxed controls and randomization hypothesis for this class of models with respect to classical MDP theory. We prove the correspondence between CMKV-MDP and a general lifted MDP on the space of probability measures, and establish the dynamic programming Bellman fixed point equation satisfied by the value function, as well as the existence of-optimal randomized feedback controls. The arguments of proof involve an original measurable optimal coupling for the Wasserstein distance. This provides a procedure for learning strategies in a large population of interacting collaborative agents. MSC Classification: 90C40, 49L20.

Full PDF

aa r X i v : . [ m a t h . O C ] D ec Mean-ﬁeld Markov decision processes with common noise andopen-loop controls

M´ed´eric MOTTE ∗ Huyˆen PHAM † December 18, 2019

Abstract

We develop an exhaustive study of Markov decision process (MDP) under meanﬁeld interaction both on states and actions in the presence of common noise, and whenoptimization is performed over open-loop controls on inﬁnite horizon. Such model,called CMKV-MDP for conditional McKean-Vlasov MDP, arises and is obtained hererigorously with a rate of convergence as the asymptotic problem of N -cooperative agentscontrolled by a social planner/inﬂuencer that observes the environment noises but notnecessarily the individual states of the agents. We highlight the crucial role of relaxedcontrols and randomization hypothesis for this class of models with respect to classicalMDP theory. We prove the correspondence between CMKV-MDP and a general liftedMDP on the space of probability measures, and establish the dynamic programmingBellman ﬁxed point equation satisﬁed by the value function, as well as the existenceof ǫ -optimal randomized feedback controls. The arguments of proof involve an originalmeasurable optimal coupling for the Wasserstein distance. This provides a procedurefor learning strategies in a large population of interacting collaborative agents. MSC Classiﬁcation:

Key words:

Mean-ﬁeld, Markov decision process, conditional propagation of chaos, mea-surable coupling, randomized control. ∗ LPSM, Universit´e de Paris medericmotte at gmail.com The author acknowledges support of the DIMMathInnov. † LPSM, Universit´e de Paris, and CREST-ENSAE, pham at lpsm.paris This work was partially supportedby the Chair Finance & Sustainable Development / the FiME Lab (Institut Europlace de Finance) ontents N -agent and the limiting McKean-Vlasov MDP 53 Lifted MDP on P ( X ) X . . . . . . . . . . . . . . . . . . . . . . . . 16 P ( X ) P ( X ) . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 Bellman ﬁxed point on P ( X ) . . . . . . . . . . . . . . . . . . . . . . . . . . 214.3 Building ǫ -optimal randomized feedback controls . . . . . . . . . . . . . . . 234.4 Relaxing the randomization hypothesis . . . . . . . . . . . . . . . . . . . . . 284.5 Computing value function and ǫ -optimal strategies in CMKV-MDP . . . . . 31 Optimal control of McKean-Vlasov (MKV) systems, also known as mean-ﬁeld control(MFC) problems, has sparked a great interest in the domain of applied probabilities duringthe last decade. In these optimization problems, the transition dynamics of the system andthe reward/gain function depend not only on the state and action of the agent/controller,but also on their probability distributions. These problems are motivated from models oflarge population of interacting cooperative agents obeying to a social planner (center ofdecision), and are often justiﬁed heuristically as the asymptotic regime with inﬁnite num-ber of agents under Pareto eﬃciency. Such problems have found numerous applications indistributed energy, herd behavior, ﬁnance, etc.A large literature has already emerged on continuous-time models for the optimal controlof McKean-Vlasov dynamics, and dynamic programming principle (in other words timeconsistency) has been established in this context in the papers [14], [16], [1], [7]. We pointout the work [13], which is the ﬁrst paper to rigorously connect mean-ﬁeld control to largesystems of controlled processes, see also the recent paper [9], and refer to the books [2], [5]for an overview of the subject.

Our work and main contributions.

In this paper, we introduce a general discretetime framework by providing an exhaustive study of Markov decision process (MDP) undermean-ﬁeld interaction in the presence of common noise , and when optimization is performed2ver open-loop controls on inﬁnite horizon. Such model is called conditional McKean-Vlasov MDP, and shortly abbreviated in the sequel as CMKV-MDP. Our set-up is themathematical framework for a theory of reinforcement learning with mean-ﬁeld interaction,and is notably motivated by the recent popularization of targeted advertising, in whichcontrols are naturally of open-loop form as the individuals states are inaccessible. Thecommon noise is also a feature of interest to model the impact of public data on thepopulation and to understand how it may aﬀect the strategy of the social planner/inﬂuencer.Compared to continuous-time models, discrete-time McKean-Vlasov control problemshave been less studied in the literature. In [15], the authors consider a ﬁnite-horizon problemwithout common noise and state the dynamic programming (Bellman) equation for MFCwith closed-loop (also called feedback) controls, that are restricted to depend on the state.Very recently, the works [6], [11] addressed Bellman equations for MFC problems in thecontext of reinforcement learning. The paper [11] considers relaxed controls in their MFCformulation but without common noise, and derives the Bellman equation for the Q -valuefunction as a deterministic control problem that we obtain here as a particular case (see ourRemark 4.11). The framework in [6] is closest to ours by considering also common noise,however with the following diﬀerences: these authors restrict their attention to stationaryfeedback policies, and reformulate their MFC control problem as a MDP on the spaceof probability measures by deriving formally (leaving aside the measurability issues andassuming the existence of a stationary feedback control) the associated Bellman equation,which is then used for the development of Q -learning algorithms. Notice that [6], [11] do notconsider dependence upon the probability distribution of the control in the state transitiondynamics and reward function.Besides the introduction of a general framework including a mean-ﬁeld dependence onthe pair state/action, our ﬁrst main contribution is to rigorously connect CMKV-MDP toa large but ﬁnite system of MDP with interacting processes. We prove the almost sureand L conditional propagation of chaos, i.e., the convergence, as the number of interactingagents N tends to inﬁnity, of the state processes and gains of the N -individual populationcontrol problem towards the corresponding object in the CMKV-MDP. Furthermore, byrelying on rate of convergence in Wasserstein distance of the empirical measure, we give arate of convergence for the limiting CMKV-MDP under suitable Lipschitz assumptions onthe state transition and reward functions, which is new to the best of our knowledge.Our second contribution is to obtain the correspondence of our CMKV-MDP with asuitable lifted MDP on the space of probability measures. Starting from open-loop controls,this is achieved in general by introducing relaxed (i.e. measure-valued) controls in theenlarged state/action space, and by emphasizing the measurability issues arising in thepresence of common noise and with continuous state space. In the special case withoutcommon noise or with discrete state space, the relaxed control in the lifted MDP is reducedto the usual notion in control theory, also known as mixed or randomized strategies ingame theory. While it is known in standard MDP that an optimal control (when it exists)is in pure form, relaxed control appears naturally in MFC where the social planner hasto sample the distribution of actions instead of simply assigning the same pure strategyamong the population in order to perform the best possible collective gain.3he reformulation of the original problem as a lifted MDP leads us to consider an asso-ciated dynamic programming equation written in terms of a Bellman ﬁxed point equationin the space of probability measures. Our third contribution is to establish rigorously theBellman equation satisﬁed by the state value function of the CMKV-MDP, and then by thestate-action value function, called Q -function in the reinforcement learning terminology.This is obtained under the crucial assumption that the initial information ﬁltration is gene-rated by an atomless random variable, i.e., that it is rich enough, and calls upon originalmeasurable optimal coupling results for the Wasserstein distance. Moreover, and this isour fourth contribution, the methodology of proof allows us to obtain as a by-product theexistence of an ǫ -optimal control, which is constructed from randomized feedback policiesunder a randomization hypothesis. This shows in particular that the value function ofCMKV-MDP over open-loop controls is equal to the value function over randomized feed-back controls, and we highlight that it may be strictly larger than the value function ofCMKV-MDP over “pure” feedback controls, i.e., without randomization. This is a notablediﬀerence with respect to the classical (without mean-ﬁeld dependence) theory of MDP asstudied e.g. in [3], [18].Finally, we discuss how to compute the value function and approximate optimal ran-domized feedback controls from the Bellman equation according to value or policy iterationmethods and by discretization of the state space and of the space of probability measures.Reinforcement learning algorithms and practical implementation are postponed to a com-panion paper with applications to model for targeted advertising in social networks. Outline of the paper.

The rest of the paper is organized as follows. Section 2 carefullyformulates both the N -individual model and the CMKV-MDP, and show their connectionby providing the rate of convergence of the latter to the limiting MFC when N goes toinﬁnity. In Section 3, we establish the correspondence of the CMKV-MDP with a liftedMDP on the space of probability measures with usual relaxed controls when there is nocommon noise or when the state space is discrete. In the general case considered in Section4, we show how to lift the CMKV-MDP by a suitable enlargement of the action space inorder to get the correspondance with a MDP on the Wasserstein space. We then derivethe associated Bellman ﬁxed point equation satisﬁed by the value function, and obtainthe existence of approximate randomized feedback controls. We conclude in Section 5 byindicating some questions for future research. Finally, we collect in the Appendix someuseful and technical results including measurable coupling arguments used in the proofs ofthe paper. Notations.

Given two measurable spaces ( X , Σ ) and ( X , Σ ), we denote by pr (resp.pr ) the projection function ( x , x ) ∈ X ×X x ∈ X (resp. x ∈ X ). For a measurablefunction Φ : X → X , and a positive measure µ on ( X , Σ ), the pushforward measureΦ ⋆ µ is the measure on ( X , Σ ) deﬁned byΦ ⋆ µ ( B ) = µ (cid:0) Φ − ( B ) (cid:1) , ∀ B ∈ Σ . We denote by P ( X ) the set of probability measures on X , and C ( X ) the cylinder (orweak) σ -algebra on P ( X ), that is the smallest σ -algebra making all the functions µ ∈P ( X ) µ ( B ) ∈ [0 , B ∈ Σ .4 probability kernel ν on X × X , denoted ν ∈ ˆ X ( X ), is a measurable mapping from( X , Σ ) into ( P ( X ) , C ( X )), and we shall write indiﬀerently ν ( x , B ) = ν ( x )( B ), for all x ∈ X , B ∈ Σ . Given a probability measure µ on ( X , Σ ), and a probability kernel ν ∈ ˆ X ( X ), we denote by µ · ν the probability measure on ( X × X , Σ ⊗ Σ ) deﬁned by( µ · ν )( B × B ) = Z B × B µ (d x ) ν ( x , d x ) , ∀ B ∈ Σ , B ∈ Σ . Let X and X be two random variables valued respectively on X and X , denoted X i ∈ L (Ω; X i ). We denote by L ( X i ) the probability distribution of X i , and by L ( X | X ) theconditional probability distribution of X given X . With these notations, when X =Φ( X ), then L ( X ) = Φ ⋆ L ( X ).When ( Y , d ) is a compact metric space, the set P ( Y ) of probability measures on Y isequipped with the Wasserstein distance W ( µ, µ ′ ) = inf n Z Y d ( y, y ′ ) µ (d y, d y ′ ) : µ ∈ Π ( µ, µ ′ ) o , where Π ( µ, µ ′ ) is the set of probability measures on Y × Y with marginals µ and µ ′ , i.e.,pr ⋆ µ = µ , and pr ⋆ µ = µ ′ . Since ( Y , d ) is compact, it is known (see e.g. Corollary 6.13in [20]) that the Borel σ -algebra generated by the Wasserstein metric coincides with thecylinder σ -algebra on P ( Y ), i.e., Wasserstein distances metrize weak convergence. We alsorecall the dual Kantorovich-Rubinstein representation of the Wasserstein distance W ( µ, µ ′ ) = sup n Z Y φ d( µ − µ ′ ) : φ ∈ L lip ( Y ; R ) , [ φ ] lip ≤ o , where L lip ( Y ; R ) is the set of Lipschitz continuous functions φ from Y into R , and [ φ ] lip =sup {| φ ( y ) − φ ( y ′ ) | /d ( y, y ′ ) : y, y ′ ∈ Y , y = y ′ } . N -agent and the limiting McKean-Vlasov MDP We formulate the mean-ﬁeld Markov Decision Process (MDP) in a large population modelwith indistinguishable agents i ∈ N ∗ = N \ { } .Let X (the state space) and A (the action space) be two compact Polish spaces equippedrespectively with their metric d and d A . We denote by P ( X ) (resp. P ( A )) the space ofprobability measures on X (resp. A ) equipped respectively with their Wasserstein dis-tance W and W A . We also consider the product space X × A , equipped with the metric d (( x, a ) , ( x ′ , a ′ )) = d ( x, x ′ ) + d A ( a, a ′ ), x, x ′ ∈ X , a, a ′ ∈ A , and the associated space ofprobability measure P ( X × A ), equipped with its Wasserstein distance W . Let G , E , and E be three measurable spaces, representing respectively the initial information, idiosyn-cratic noise, and common noise spaces.We denote by Π OL the set of sequences ( π t ) t ∈ N (called open-loop policies ) where π t is ameasurable function from G × E t × ( E ) t into A for t ∈ N .Let (Ω , F , P ) be a probability space on which are deﬁned the following family of mutuallyi.i.d. random variables 5 (Γ i , ξ i ) i ∈ N ⋆ (initial informations and initial states) valued in G × X• ( ε it ) i ∈ N ⋆ ,t ∈ N (idiosyncratic noises) valued in E with probability distribution λ ε • ε := ( ε t ) t ∈ N (common noise) valued in E .Without loss of generality, we may assume that F contains an atomless random variable,i.e., F is rich enough, so that any probability measure ν on X (resp. A or X × A ) can berepresented by the law of some random variable Y on X (resp. A or X × A ), and we write Y ∼ ν , i.e., L ( Y ) = ν . Given an open-loop policy π , we associate an open-loop control forindividual i ∈ N ∗ as the process α i,π deﬁned by α i,π = π t (Γ i , ( ε is ) s ≤ t , ( ε s ) s ≤ t ) , t ∈ N . In other words, an open-loop control is a non-anticipative process that depends on theinitial information, the past idiosyncratic and common noises, but not on the states of theagent in contrast with closed-loop control.Given N ∈ N ∗ , and π ∈ Π OL , the state process of agent i = 1 , . . . , N in an N -agentMDP is given by the dynamical system ( X i,N,π = ξ i X i,N,πt +1 = F ( X i,N,πt , α i,πt , N P Nj =1 δ ( X j,N,πt ,α j,πt ) , ε it +1 , ε t +1 ) , t ∈ N , where F is a measurable function from X × A × P ( X × A ) × E × E into X , called statetransition function. The i -th individual contribution to the inﬂuencer’s gain over an inﬁnitehorizon is deﬁned by J N,πi := ∞ X t =0 β t f (cid:16) X i,N,πt , α i,πt , N N X j =1 δ ( X j,N,πt ,α j,πt ) (cid:17) , i = 1 , . . . , N, where the reward f is a mesurable real-valued function on X × A × P ( X × A ), assumed tobe bounded (recall that X and A are compact spaces), and β is a positive discount factorin [0 , J N,π := 1 N N X i =1 J N,πi , V

N,π := E (cid:2) J N,π (cid:3) , and the optimal value of the inﬂuencer is V N := sup π ∈ Π OL V N,π . Observe that the agentsare indistinguishable in the sense that the initial pair of information/state (Γ i , ξ i ) i , andidiosyncratic noises are i.i.d., and the state transition function F , reward function f , anddiscount factor β do not depend on i .Let us now consider the asymptotic problem when the number of agents N goes toinﬁnity. In view of the propagation of chaos argument, we expect that the state process ofagent i ∈ N ∗ in the inﬁnite population model to be governed by the conditional McKean-Vlasov dynamics ( X i,π = ξ i X i,πt +1 = F ( X i,πt , α i,πt , P X i,πt ,α i,πt ) , ε it +1 , ε t +1 ) , t ∈ N . (2.1)6ere, we denote by P and E the conditional probability and expectation knowing the com-mon noise ε , and then, given a random variable Y valued in Y , we denote by P Y or L ( Y )its conditional law knowing ε , which is a random variable valued in P ( Y ) (see LemmaA.2). The i -th individual contribution to the inﬂuencer’s gain in the inﬁnite populationmodel is J πi := ∞ X t =0 β t f (cid:0) X i,πt , α i,πt , P X i,πt ,α i,πt ) (cid:1) , i ∈ N ∗ , and we deﬁne the conditional gain, expected gain, and optimal value, respectively by J π := E (cid:2) J πi (cid:3) = E (cid:2) J π (cid:3) , i ∈ N ∗ (by indistinguishability of the agents) ,V π := E (cid:2) J π (cid:3) , V := sup π ∈ Π OL V π . (2.2)Problem (2.1)-(2.2) is called conditional McKean-Vlasov Markov decision process, CMKV-MDP in short.The main goal of this Section is to rigorously connect the ﬁnite-agent model to theinﬁnite population model with mean-ﬁeld interaction by proving the convergence of the N -agent MDP to the CMKV-MDP. First, we prove the almost sure convergence of thestate processes under some continuity assumptions on the state transition function. Proposition 2.1

Assume that for all ( x , a, ν , e ) ∈ X × A ×P ( X × A ) × E , the (random)function ( x, ν ) ∈ X × P ( X × A ) F ( x, a, ν, ε , e ) ∈ X is continuous in ( x , ν ) a.s. Then, for any π ∈ Π OL , X i,N,πt a.s. → N →∞ X i,πt , i ∈ N ∗ , t ∈ N , and W (cid:16) N N X i =1 δ ( X i,N,πt ,α i,πt ) , P X i,πt ,α i,πt ) (cid:17) a.s. −→ N →∞ , N N X i =1 d ( X i,N,πt , X i,πt ) → N →∞ a.s. Furthermore, if we assume that for all a ∈ A , the function ( x, ν ) ∈ X × P ( X × A ) f ( x, a, ν ) is continuous, then J N,πi a.s. −→ N →∞ J πi , J N,π a.s. −→ N →∞ J π , V N,π −→ N →∞ V π , and lim inf N →∞ V N ≥ V. Proof.

Fix π ∈ Π OL . We omit the dependence of the state processes and control on π ,and denote by ν Nt := N P Ni =1 δ ( X i,Nt ,α it ) , ν N, ∞ t := N P Ni =1 δ ( X it ,α it ) , and ν t := P X it ,α it ) .(1) We ﬁrst prove the convergence of trajectories, for all i ∈ N ⋆ , P (cid:2) lim N →∞ ( X i,Nt , ν Nt ) = ( X it , ν t ) (cid:3) = 1 , P h lim N →∞ N N X i =1 d ( X i,Nt , X it ) = 0 i = 1 , by induction on t ∈ N . - Initialization. For t = 0, we have X i,N = ξ i = X i , α i = π (Γ i ), for all N ∈ N ⋆ and i ∈ N ⋆ ,which obviously implies that X i,N a.s. → N →∞ X i and N P Ni =1 d ( X i,N , X i ) → N →∞

0. Moreover,7 ( ν N , ν ) = W ( ν N, ∞ , ν ), which converges to zero a.s., when N goes to inﬁnity, by theweak convergence of empirical measures (see [19]), and the fact that Wasserstein distancemetrizes weak convergence. - Hereditary Property. We have X i,Nt +1 = F ( X i,Nt , α it , ν Nt , ε it +1 , ε t +1 ) and X it +1 = F ( X it , α it , ν t , ε it +1 , ε t +1 ) . (2.3)By a simple conditioning, we notice that P (cid:2) lim N →∞ X i,Nt +1 = X it +1 (cid:3) = E (cid:2) P (cid:0) ( X i,Nt , ν Nt ) N , X it , ν t , α it (cid:1)(cid:3) ,where P (cid:0) ( x N , ν N ) N , x, ν, a ) (cid:1) = P h lim N →∞ F ( x N , a, ν N , ε it +1 , ε t +1 ) = F ( x, a, ν, ε it +1 , ε t +1 ) i . By the continuity assumption on F , we see that P (cid:0) ( x N , ν N ) N , x, ν, a ) (cid:1) is bounded frombelow by lim N →∞ ( x N ,ν N )=( x,ν ) , and thus P (cid:2) lim N →∞ X i,Nt +1 = X it +1 (cid:3) ≥ P (cid:2) lim N →∞ ( X i,Nt , ν Nt ) = ( X it , ν t ) (cid:3) . This proves by induction hypothesis that P (cid:2) lim N →∞ X i,Nt +1 = X it +1 (cid:3) = 1.Let us now show that N P Ni =1 d ( X i,Nt +1 , X it +1 ) a.s. → N →∞

0. From (2.3), we have d ( X i,Nt +1 , X it +1 ) ≤ ∆ D N F ( X it , α it , ν t , ε it +1 , ε t +1 )with D N := max[ d ( X i,Nt , X it ) , W ( ν Nt , ν t )], and∆ y F ( x, a, ν, e, e ) := sup ( x ′ ,ν ′ ) ∈ D { d ( F ( x ′ , a, ν ′ , e, e ) , F ( x, a, ν, e, e )) max[ d ( x ′ ,x ) , W ( ν ′ ,ν )] ≤ y } , where D is a ﬁxed countable dense set of the separable space X × P ( X × A ), which impliesthat ( y, x, a, ν, e, e ) ∆ y F ( x, a, ν, e, e ) is a measurable function. Fix ǫ >

0. Let ∆ X denote the diameter of the compact metric space X . We thus have, for any η > d ( X i,Nt +1 , X it +1 ) ≤ d ( X i,Nt +1 , X it +1 ) D N ≤ η + d ( X i,Nt +1 , X it +1 ) D N >η ≤ ∆ η F ( X it , α it , ν t , ε t +1 , ε t +1 ) + ∆ X d ( X N,it ,X it ) >η + ∆ X W ( ν Nt ,ν t ) >η , and thus 1 N N X i =1 d ( X i,Nt +1 , X it +1 ) ≤ N N X i =1 ∆ η F ( X it , α it , ν t , ε t +1 , ε t +1 ) + ∆ X ηN N X i =1 d ( X i,Nt , X it ) + ∆ X W ( ν Nt ,ν t ) >η . The second and third terms in the right hand side converge to 0 by induction hypothesis,and by Proposition B.1, the ﬁrst term converges to E (cid:2) ∆ η F ( X t , α t , ν t , ε t +1 , ε t +1 ) (cid:3) = E h E (cid:2) ∆ η F ( x t , a t , ν t , ε t +1 , e t +1 ) (cid:3) ( x t ,a t ,e t +1 ):=( X t ,α t ,ε t +1 ) i . η →

0, the inner expectation tends to zero by continuity assumption on F and bydominated convergence. Then, the outer expectation converges to zero by conditionaldominated convergence, and will thus be smaller than ǫ for η small enough, which impliesthat N P Ni =1 d ( X i,Nt +1 , X it +1 ) will be smaller than ǫ for N large enough.Let us ﬁnally prove that W ( ν Nt +1 , ν t +1 ) a.s. → N →∞

0. We have W ( ν Nt +1 , ν t +1 ) ≤ W ( ν Nt +1 , ν N, ∞ t +1 )+ W ( ν N, ∞ t +1 , ν t +1 ). To dominate the ﬁrst term W ( ν Nt +1 , ν N, ∞ t +1 ), notice that, given a variable U N ∼ U ( { , ..., N } ), the random measure ν Nt +1 (resp. ν N, ∞ t +1 ) is, at ﬁxed ω ∈ Ω, the law of thepair of random variable ( X U N ,Nt +1 ( ω ) , α U N t +1 ( ω )) (resp. ( X U N t +1 ( ω ) , α U N t +1 ( ω )) where we stress thatonly U N is random here, essentially selecting each sample of these empirical measures withprobability N . Thus, by deﬁniton of the Wasserstein distance, W ( ν Nt +1 , ν N, ∞ t +1 ) is dominatedby E [ d ( X U N ,Nt +1 ( ω ) , X U N t +1 ( ω ))] = N P Ni =1 d ( X i,Nt +1 ( ω ) , X it +1 ( ω )), which has been proved to con-verge to zero. For the second term, observe that α it +1 = π t +1 (Γ i , ( ε is ) s ≤ t +1 , ( ε s ) s ≤ t +1 ), andby Proposition A.1, there exists a measurable function f t +1 : X × G × E t +1 × ( E ) t +1 → X such that X i,Nt +1 = f t +1 (Γ i , ( ε is ) s ≤ t +1 , ( ε s ) s ≤ t +1 )) . From Proposition B.1, we then deducethat W ( ν N, ∞ t +1 , ν t +1 ) converges to zero as N goes to inﬁnity. This concludes the induction.(2) Let us now study the convergence of gains. By the continuity assumption on f , wehave f ( X i,Nt , α it , ν Nt ) a.s. → N →∞ f ( X it , α it , ν t ) for all t ∈ N . Thus, as f is bounded, we get bydominated convergence that J N,πi a.s. → N →∞ J πi . Let us now study the convergence of J N,π to J π . We write | J N,π − J π | ≤ N N X i =1 | J N,πi − J πi | + (cid:12)(cid:12)(cid:12) N N X i =1 J πi − J π (cid:12)(cid:12)(cid:12) =: S N + S N . The second term S N converges a.s. to zero by Propositions A.1 and B.1, as N goes toinﬁnity. On the other hand, S N ≤ ∞ X t =0 β t ∆ N ( f ) , with ∆ N ( f ) := 1 N N X i =1 (cid:12)(cid:12) f ( X i,Nt , α it , ν Nt ) − f ( X it , α it , ν t ) (cid:12)(cid:12) . By the same argument as above in (1) for showing that N P Ni =1 d ( X i,Nt +1 , X it +1 ) a.s. → N →∞

0, weprove that ∆ N ( f ) tends a.s. to zero as N → ∞ . Since f is bounded, we deduce bythe dominated convergence theorem that S N converges a.s. to zero as N goes to inﬁnity,and thus J N,π a.s. → N →∞ J π . By dominated convergence, we then also obtain that V N,π = E [ J N ] a.s. → N →∞ E [ J π ] = V π . Finally, by considering an ǫ -optimal policy π ǫ for V , we havelim inf N →∞ V N ≥ lim N →∞ V N,π ǫ = V π ǫ ≥ V − ǫ, which implies, by sending ǫ to zero, that liminf N →∞ V N ≥ V . ✷ Next, under Lipschitz assumptions on the state transition and reward functions, weprove the corresponding convergence in L , which implies the convergence of the optimalvalue, and also a rate of convergence in terms of the rate of convergence in Wassersteindistance of the empirical measure. 9 HF lip ) There exists K F >

0, such that for all a ∈ A , e ∈ E , x, x ′ ∈ X , ν, ν ′ ∈ P ( X × A ), E (cid:2) d (cid:0) F ( x, a, ν, ε , e ) , F ( x ′ , a, ν ′ , ε , e ) (cid:1)(cid:3) ≤ K F (cid:0) d ( x, x ′ ) + W ( ν, ν ′ ) (cid:1) ) . ( Hf lip ) There exists K f >

0, such that for all a ∈ A , x, x ′ ∈ X , ν, ν ′ ∈ P ( X × A ), d ( f ( x, a, ν ) , f ( x ′ , a, ν ′ )) ≤ K f (cid:0) d ( x, x ′ ) + W ( ν, ν ′ ) (cid:1) ) . Remark 2.1

We stress the importance of making the Lipschitz assumption for F in ex-pectation only. Indeed, most of the practical applications we have in mind concerns discretespaces X , for which F cannot be, strictly speaking, Lipschitz, since it maps from a continu-ous space P ( X × A ) to a discrete space X . However, F can be Lipschitz in expectation , e.g.once integrated w.r.t. the idiosyncratic noise, and it is actually a very natural phenomenon. ✷ In the sequel, we shall denote by ∆ X the diameter of the metric space X , and deﬁne M N := sup ν ∈P ( X × A ) E [ W ( ν N , ν )] , (2.9)where ν N is the empirical measure ν N = N P Nn =1 δ Y n , ( Y n ) ≤ n ≤ N are i.i.d. random variableswith law ν . We recall in the next Lemma recent results about non asymptotic bounds forthe mean rate of convergence in Wasserstein distance of the empirical measure. Lemma 2.1

We have M N → N →∞ . Furthermore, • If X × A ⊂ R d for some d ∈ N ⋆ , then: M N = O ( N − ) for d = 1 , M N = O ( N − log(1 + N )) for d = 2 , and M N = O ( N − d ) for d ≥ . • If for all δ > , the smallest number of balls with radius δ covering the compactmetric set X × A with diameter ∆ X × A is smaller than O (cid:16)(cid:0) ∆ X× A δ (cid:1) θ (cid:17) for θ > , then M N = O ( N − /θ ) . Proof.

The ﬁrst point is proved in [10], and the second one in [4]. ✷ Theorem 2.1

Assume ( HF lip ) . For all i ∈ N ∗ , t ∈ N , sup π ∈ Π OL E (cid:2) d ( X i,N,πt , X i,πt ) (cid:3) = O ( M N ) , (2.10)sup π ∈ Π OL E h W (cid:16) N N X i =1 δ ( X i,N,πt ,α i,π ) , P X i,πt ,α i,π ) (cid:17)i = O ( M N ) . (2.11) Furthermore, if we assume ( Hf lip ) , then for any η > and γ = min (cid:2) , | ln β | (ln 2 K F ) + − η (cid:3) , thereexists a constant C = C ( K F , K f , β, γ ) (explicit in the proof ) such that for N large enough, sup π ∈ Π OL | V N,π − V π | ≤ CM γN , and thus | V N − V | = O ( M γN ) . Consequently, any ε − optimal policy for the CMKV-MDPis O ( ε ) -optimal for the N -agent MDP problem for N large enough, namely M γN = O ( ǫ ) . roof. Given π ∈ Π OL , denote by ν N,πt := N P Ni =1 δ ( X i,N,πt ,α i,πt ) , ν N, ∞ ,πt := N P Ni =1 δ ( X i,πt ,α i,πt ) and ν πt := P X i,πt ,α it ) . By deﬁnition, α i,πt = π t (Γ i , ( ε is ) s ≤ t , ( ε s ) s ≤ t ), and by Lemma A.1, wehave X i,πt = f πt ( ξ i , Γ i , ( ε is ) s ≤ t , ( ε s ) s ≤ t ) for some measurable function f πt ∈ L ( X × G × E t × ( E ) t , X ).By deﬁnition of M N in (2.9), we have E h W ( ν N, ∞ ,πt , ν πt ) i ≤ M N , ∀ N ∈ N , ∀ π ∈ Π OL . (2.13)Let us now prove (2.10) by induction on t ∈ N . At t = 0, X i,N,π = X i,π = ξ i , and theresult is obvious. Now assume that it holds true at time t ∈ N and let us show that itthen holds true at time t + 1. By a simple conditioning argument, E [ d ( X i,N,πt +1 , X i,πt +1 )] = E (cid:2) ∆ (cid:0) X i,N,πt , X i,πt , α it , ν N,πt , ν πt , ε t +1 (cid:1)(cid:3) , where∆( x, x ′ , a, ν, ν ′ , e ) = E [ d ( F ( x, a, ν, ε it +1 , e ) , F ( x ′ , a, ν ′ , ε it +1 , e ))] ≤ K F (cid:0) d ( x, x ′ ) + W ( ν, ν ′ ) (cid:1) , (2.14)by ( HF lip ). On the other hand, we have E (cid:2) W ( ν N,πt , ν πt ) (cid:3) ≤ E (cid:2) W ( ν N,πt , ν N, ∞ ,πt ) (cid:3) + E (cid:2) W ( ν N, ∞ ,πt , ν πt ) (cid:3) ≤ E [ d ( X i,N,πt , X i,πt )] + M N , (2.15)where we used the fact that W ( ν N,πt , ν N, ∞ ,πt ) ≤ N P Ni =1 d ( X i,N,πt , X i,πt ), and (2.13). Itfollows from (2.14) that E (cid:2) d ( X i,N,πt +1 , X i,πt +1 ) (cid:3) ≤ K F (cid:16) E [ d ( X i,N,πt , X i,πt )] + M N (cid:17) , ∀ π ∈ Π OL , (2.16)which proves that sup π ∈ Π OL E [ d ( X i,N,πt +1 , X i,πt +1 )] = O ( M N ) by induction hypothesis, and thus(2.10). Plugging (2.10) into (2.15) then yields (2.11).Let us now prove the convergence of gains. From ( Hf lip ), and (2.15), we have | V N,π − V π | ≤ K f ∞ X t =0 β t E h d ( X i,N,πt , X i,πt ) + W ( ν N,πt , ν πt ) i ≤ K f (cid:16) ∞ X t =0 β t δ Nt + M N − β (cid:17) , ∀ π ∈ Π OL , (2.17)where we set δ Nt := sup π ∈ Π OL E [ d ( X i,N,πt , X i,πt )]. From (2.16), we have: δ Nt +1 ≤ K F δ Nt + K F M N , t ∈ N , with δ N = 0, and so by induction: δ Nt ≤ K F − K F M N + s t (cid:0) K F | K F − | M N (cid:1) , s t ( m ) := m (2 K F ) t , m ≥ . where we may assume w.l.o.g. that 2 K F = 1. Observing that we obviously have δ Nt ≤ ∆ X (the diameter of X ), we deduce that ∞ X t =0 β t δ Nt ≤ K F (1 − K F )(1 − β ) M N + S (cid:0) K F | K F − | M N (cid:1) (2.18) S ( m ) := ∞ X t =0 β t min (cid:2) s t ( m ); ∆ X (cid:3) , m ≥ . M N ) N converges to zero, we may restrict the study of the function S to the interval [0 , ∆ X ], and we notice that it satisﬁes the dynamic programming relation S ( m ) = m + βS (cid:0) min[ s ( m ) , ∆ X ] (cid:1) , m ∈ [0 , ∆ X ] . Therefore, S can be viewed as the unique ﬁxed point in the Banach space L ∞ ([0 , ∆ X ])of the β -contracting operator H deﬁned by [ H ϕ ]( m ) = m + βϕ (min[ s ( m ) , ∆ X ]), and isobtained as the limit of ﬁxed point iterations: S n +1 = H S n , n ∈ N , S ≡

0. Fix γ ≥ n ∈ N , that S n ( m ) ≤ K γ,n m γ , m ∈ [0 , ∆ X ], for some suitable sequence ( K γ,n ) n . By writing that S n +1 = H S n at step n + 1,and using the induction hypothesis, we have for all m ∈ [0 , ∆ X ], S n +1 ( m ) ≤ m + βK γ,n s ( m ) γ ≤ (cid:0) ∆ − γ X + βK γ,n (2 K F ) γ (cid:1) m γ . By setting K γ,n +1 := ∆ − γ X + β (2 K F ) γ K γ,n , starting with K γ, = 0, we see that therequired relation is satisﬁed at iteration n + 1. Now, by taking γ < | ln( β ) | (ln 2 K F ) + , we have K γ,n = ∆ − γ X − ( β (2 K F ) γ ) n − β (2 K F ) γ → K γ := ∆ − γ X − β (2 K F ) γ , as n goes to inﬁnity, and thus S ( m ) ≤ K γ m γ , ∀ m ∈ [0 , ∆ X ] . From (2.17)-(2.18), it follows that for N large enough (so that M N < ∆ X | K F − | K F ), andtaking γ = min (cid:2) , | ln β | (ln 2 K F ) + − η (cid:3) , for η > | V N,π − V π | ≤ CM γN , π ∈ Π OL , for some constant C that is explicit in terms of K f , K F , β and γ . This concludes the proof. ✷ Remark 2.2

If the Lipschitz constant in ( HF lip ) satisﬁes β K F <

1, then we can take γ = 1 in the rate of convergence (2.12) of the optimal value, and this can be directly derivedfrom (2.17) since in this case P ∞ t =0 ( β K F ) t is ﬁnite and so P ∞ t =0 β t δ Nt = O ( M N ). Thefunction S in the above proof is introduced for dealing with the case when β K F > F and f depend on the joint distribution ν ∈ P ( X × A ) onlythrough its marginals on P ( X ) and P ( A ), which is the usual framework considered incontrolled mean-ﬁeld dynamics, then a careful look in the above proof shows that the rateof convergence of the CMKV-MDP will be expressed in terms of˜ M N := max n sup µ ∈P ( X ) E [ W ( µ N , µ )] , sup υ ∈P ( A ) E [ W A ( υ N , υ )] o , instead of M N in (2.9), where here µ N (resp. υ N ) is the empirical measure associated to µ (resp. υ ) ∈ P ( X ) (resp. P ( A )). From Lemma 2.1, the speed of convergence of ˜ M N is fasterthan the one of M N . For instance when X ⊂ R , A ⊂ R , then ˜ M N = O ( N − / ), while M N = O (cid:0) N − / log(1 + N ) (cid:1) . ✷ Lifted MDP on P ( X ) Theorem 2.1 justiﬁes the CMKV-MDP as a macroscopic approximation of the N -agentMDP problem with mean-ﬁeld interaction. Observe that the computation of the condi-tional gain, expected gain and optimal value of the CMKV-MDP in (2.2), only requiresthe variables associated to one agent, called representative agent. Therefore, we placeourselves in a reduced universe, dismissing other individuals variables, and renaming therepresentative agent’s initial information by Γ, initial state by ξ , idiosyncratic noise by( ε t ) t ∈ N . In the sequel, we shall denote by G = σ (Γ) the σ -algebra generated by the randomvariable Γ, hence representing the initial information ﬁltration, and by L ( G ; X ) the set of G -measurable random variables valued in X . We shall assume that the initial state ξ ∈ L ( G ; X ), which means that the policy has access to the agent’s initial state through theinitial information ﬁltration G .An open-loop control for the representative agent is a process α , which is adapted to theﬁltration generated by (Γ , ( ε s ) s ≤ t , ( ε s ) s ≤ t ) t ∈ N , and associated to an open-loop policy by: α t = α πt := π t (Γ , ( ε s ) s ≤ t , ( ε s ) s ≤ t ) for some π ∈ Π OL . We denote by A the set of open-loopcontrols, and given α ∈ A , ξ ∈ L ( G ; X ), the state process X = X ξ,α of the representativeagent is governed by X t +1 = F ( X t , α t , P X t ,α t ) , ε t +1 , ε t +1 ) , t ∈ N , X = ξ. (3.1)For α = α π , π ∈ Π OL , we write indiﬀerently X ξ,π = X ξ,α , and the expected gain V α = V π equal to V α ( ξ ) = E h X t ∈ N β t f ( X t , α t , P X t ,α t ) ) i , where we stress the dependence upon the initial state ξ . The value function to the CMKV-MDP is then deﬁned by V ( ξ ) = sup α ∈A V α ( ξ ) , ξ ∈ L ( G ; X ) . Let us now show how one can lift the CMKV-MDP to a (classical) MDP on the spaceof probability measures P ( X ). We set F as the ﬁltration generated by the common noise ε . Given an open-loop control α ∈ A , and its state process X = X ξ,α , denote by { µ t = P X t , t ∈ N } , the random P ( X )-valued process, and notice from Proposition A.1 that ( µ t ) t is F -adapted. From (3.1), and recalling the pushforward measure notation, we have µ t +1 = F (cid:0) · , · , P X t ,α t ) , · , ε t +1 (cid:1) ⋆ (cid:0) P X t ,α t ) ⊗ λ ε (cid:1) , a.s. (3.2)As the probability distribution λ ε of the idiosyncratic noise is a ﬁxed parameter, the aboverelation means that µ t +1 only depends on P X t ,α t ) and ε t +1 . Moreover, by introducing theso-called relaxed control associated to the open-loop control α asˆ α t ( x ) = L (cid:0) α t | X t = x (cid:1) , t ∈ N , A ( X ), the set of probability kernels on X × A (see Lemma A.2), we seefrom Bayes formula that P X t ,α t ) = µ t · ˆ α t . The dynamics relation (3.2) is then written as µ t +1 = ˆ F ( µ t , ˆ α t , ε t +1 ) , t ∈ N , where the function ˆ F : P ( X ) × ˆ A ( X ) × E → P ( X ) is deﬁned byˆ F ( µ, ˆ a, e ) = F (cid:0) · , · , µ · ˆ a, · , e (cid:1) ⋆ (cid:0) ( µ · ˆ a ) ⊗ λ ε (cid:1) . (3.3)On the other hand, by the law of iterated conditional expectation, the expected gaincan be written as V α ( ξ ) = E h X t ∈ N β t E (cid:2) f ( X t , α t , P X t ,α t ) ) (cid:3)i , with the conditional expectation term equal to E (cid:2) f ( X t , α t , P X t ,α t ) ) (cid:3) = ˆ f ( µ t , ˆ α t ) , where the function ˆ f : P ( X ) × ˆ A ( X ) → R is deﬁned byˆ f ( µ, ˆ a ) = Z X × A f ( x, a, µ · ˆ a )( µ · ˆ a )(d x, d a ) . (3.4)The above derivation suggests to consider a MDP with state space P ( X ), action spaceˆ A ( X ), a state transition function ˆ F as in (3.3), a discount factor β ∈ [0 , f as in (3.4). A key point is to endow ˆ A ( X ) with a suitable σ -algebra in orderto have measurable functions ˆ F , ˆ f , and F -adapted process ˆ α valued in ˆ A ( X ), so that theMDP with characteristics ( P ( X ) , ˆ A ( X ) , ˆ F , ˆ f , β ) is well-posed. This issue is investigated inthe next sections, ﬁrst in special cases, and then in general case by a suitable enlargementof the action space. When there is no common noise, the original state transition function F is deﬁned from X × A ×P ( X × A ) × E into X , and the associated function ˆ F is then deﬁned from P ( X ) × ˆ A ( X )into P ( X ) by ˆ F ( µ, ˆ a ) = F (cid:0) · , · , µ · ˆ a, · (cid:1) ⋆ (cid:0) ( µ · ˆ a ) ⊗ λ ε (cid:1) . In this case, we are simply reduced to a deterministic control problem on the state space P ( X ) with dynamics µ t +1 = ˆ F ( µ t , κ t ) , t ∈ N , µ = µ ∈ P ( X ) , controlled by κ = ( κ t ) t ∈ N ∈ b A , the set of deterministic sequences valued in ˆ A ( X ), andcumulated gain/value function: b V κ ( µ ) = ∞ X t =0 β t ˆ f ( µ t , κ t ) , b V ( µ ) = sup κ ∈ b A b V κ ( µ ) , µ ∈ P ( X ) , f : P ( X ) × ˆ A ( X ) → R is deﬁned as in (3.4). Notice that thereare no measurability issues for ˆ F , ˆ f , as the problem is deterministic and all the quantitiesdeﬁned above are well-deﬁned.We aim to prove the correspondence and equivalence between the MKV-MDP and theabove deterministic control problem. From similar derivation as in (3.2)-(3.4) (by takingdirectly law under P instead of P ), we clearly see that for any α ∈ A , V α ( ξ ) = b V ˆ α ( µ ),with µ = L ( ξ ), and ˆ α = R ξ ( α ) where R ξ is the relaxed operator R ξ : A −→ b A α = ( α t ) t ˆ α = ( ˆ α t ) t : ˆ α t ( x ) = L (cid:0) α t | X ξ,αt = x (cid:1) , t ∈ N , x ∈ X . It follows that V ( ξ ) ≤ b V ( µ ). In order to get the reverse inequality, we have to show that R ξ is surjective. Notice that this property is not always satisﬁed: for instance, when the σ -algebra generated by ξ is equal to G , then for any α ∈ A , α is σ ( ξ )-measurable at time t = 0,and thus L ( α | ξ ) is a Dirac distribution, hence cannot be equal to an arbitrary probabilitykernel κ = ˆ a ∈ ˆ A ( X ). We shall then make the following randomization hypothesis. Rand ( ξ, G ) : There exists a uniform random variable U ∼ U ([0 , G -measurableand independent of ξ ∈ L ( G ; X ). Remark 3.1

The randomization hypothesis

Rand ( ξ, G ) implies in particular that Γ isatomless, i.e., G is rich enough, and thus P ( X ) = {L ( ζ ) : ζ ∈ L ( G ; X ) } . Furthermore,it means that there is extra randomness in G besides ξ , so that one can freely randomizevia the uniform random variable U the ﬁrst action given ξ according to any probabilitykernel ˆ a . Moreover, one can extract from U , by standard separation of the decimals of U (see Lemma 2.21 in [12]), an i.i.d. sequence of uniform variables ( U t ) t ∈ N , which are G -measurable, independent of ξ , and can then be used to randomize the subsequent actions. ✷ Theorem 3.1 (Correspondence in the no common noise case)

Assume that

Rand ( ξ, Γ) holds true. Then R ξ is surjective from A into b A , and we have V ( ξ ) = b V ( µ ) , for µ = L ( ξ ) . Moreover, if α ǫ ∈ A is an ǫ -optimal control for V ( ξ ) , then R ξ ( α ǫ ) ∈ b A is an ǫ -optimal control for b V ( µ ) , and conversely, if ˆ α ǫ ∈ b A is an ǫ -optimalcontrol for b V ( µ ) , then any α ǫ ∈ R − ξ ( ˆ α ǫ ) is an ǫ -optimal control for V ( ξ ) . Proof.

In view of the above discussion, we only need to prove the surjectivity of R ξ . Fixa control κ ∈ b A for the MDP on P ( X ). By Proposition A.3, for all t ∈ N , there exists ameasurable function a t : X × [0 , → A such that P a t ( x,U ) = κ t ( x ), for all x ∈ X . It is thenclear that the control α deﬁned recursively by α t := a t ( X ξ,αt , U t ), where ( U t ) t is an i.i.d.sequence of G -measurable uniform variables independent of ξ under Rand ( ξ, Γ), satisﬁes L ( α t | X ξ,α = x ) = κ t ( x ) (observing that U t is independent of X ξ,αt ), and thus ˆ α = κ ,which proves the surjectivity of R ξ . ✷ Remark 3.2

The above correspondence result shows in particular that the value function V of the MKV-MDP is law invariant, in the sense that it depends on its initial state ξ onlyvia its probability law µ = L ( ξ ), for ξ satisfying the randomization hypothesis. ✷ .2 Case with discrete state space X We consider the case with common noise but when the state space X is discrete, i.e., itscardinal X is ﬁnite, equal to n .In this case, one can identify P ( X ) with the simplex S n − = { p = ( p i ) i =1 ,...,n ∈ [0 , n : P ni =1 p i = 1 } , by associating any probability distribution µ ∈ P ( X ) to its weights( µ ( { x } )) x ∈X ∈ S n − . We also identify the action space ˆ A ( X ) with P ( A ) n by associating anyprobability kernel ˆ a ∈ ˆ A ( X ) to (ˆ a ( x )) x ∈X ∈ P ( A ) n , and thus ˆ A ( X ) is naturally endowedwith the product σ -algebra of the Wasserstein metric space P ( A ). Lemma 3.1

Suppose that X = n < ∞ . Then, ˆ F in (3.3) is a measurable functionfrom S n − × P ( A ) n × E into S n − , ˆ f in (3.4) is a real-valued measurable function on S n − × P ( A ) n . Moreover, for any ξ ∈ L ( G ; X ) , and α ∈ A , the P ( A ) n -valued process ˆ α deﬁned by ˆ α t ( x ) = L ( α t | X ξ,αt = x ) , t ∈ N , x ∈ X , is F -adapted. Proof.

By Corollary A.1, it is clear, by measurable composition, that we only need toprove that Ψ : ( µ, ˆ a ) ∈ ( P ( X ) , ˆ A ( X )) µ · ˆ a ∈ P ( X × A ) is measurable. However, in thisdiscrete case, µ · ˆ a is here simply equal to P x ∈X µ ( x )ˆ a ( x ) and, thus Ψ is clearly measurable. ✷ In view of Lemma 3.1, the MDP with characteristics ( P ( X ) ≡ S n − , ˆ A ( X ) ≡ P ( A ) n , ˆ F , ˆ f , β )is well-posed. Let us then denote by b A the set of F -adapted processes valued in P ( A ) n ,and given κ ∈ b A , consider the controlled dynamics in S n − µ t +1 = ˆ F ( µ t , κ t , ε t +1 ) , t ∈ N , µ = µ ∈ S n − , (3.5)the associated expected gain and value function b V κ ( µ ) = E h ∞ X t =0 β t ˆ f ( µ t , κ t ) i , b V ( µ ) = sup κ ∈ ˆ A b V κ ( µ ) . (3.6)We aim to prove the correspondence and equivalence between the CMKV-MDP and theMDP (3.5)-(3.6). From the derivation in (3.2)-(3.4) and by Lemma 3.1, we see that for any α ∈ A , V α ( ξ ) = b V ˆ α ( µ ), where µ = L ( ξ ), and ˆ α = R ξ ( α ) where R ξ is the relaxed operator R ξ : A −→ b A α = ( α t ) t ˆ α = ( ˆ α t ) t : ˆ α t ( x ) = L (cid:0) α t | X ξ,αt = x (cid:1) , t ∈ N , x ∈ X . (3.7)It follows that V ( ξ ) ≤ b V ( µ ). In order to get the reverse inequality from the surjectivity of R ξ , we need again as in the no common noise case to make some randomization hypothesis.It turns out that when X is discrete, this randomization hypothesis is simply reduced tothe atomless property of Γ. Lemma 3.2

Assume that Γ is atomless, i.e., G is rich enough. Then, any ξ ∈ L ( G ; X ) taking a countable number of values, satisﬁes Rand ( ξ, Γ) . Proof.

Let S be a countable set s.t. ξ ∈ S a.s., and P [ ξ = x ] > x ∈ S . Fix x ∈ S and denote by P x the probability “knowing ξ = x ”, i.e., P x [ B ] := P [ B,ξ = x ] P [ ξ = x ] , for all B ∈ F .16t is clear that, endowing Ω with this probability, Γ is still atomless, and so there existsa G -measurable random variable U x that is uniform under P x . Then, the random variable U := P x ∈ S U x ξ = x is a G -measurable uniform random variable under P x for all x ∈ S ,which implies that it is a uniform variable under P , independent of ξ . ✷ Theorem 3.2 (Correspondance with the MDP on P ( X ) in the X discrete case) Assume that G is rich enough. Then R ξ is surjective from A into b A , and V ( ξ ) = b V ( µ ) , forany µ ∈ P ( X ) , ξ ∈ L ( G ; X ) s.t. µ = L ( ξ ) . Moreover, if α ǫ ∈ A is an ǫ -optimal controlfor V ( ξ ) , then R ξ ( α ǫ ) ∈ b A is an ǫ -optimal control for b V ( µ ) . Conversely, if ˆ α ǫ ∈ b A is an ǫ -optimal control for b V ( µ ) , then any α ǫ ∈ ( R ξ ) − ( ˆ α ǫ ) is an ǫ -optimal control for V ( ξ ) . Proof.

From the derivation in (3.5)-(3.7), we only need to prove the surjectivity of R ξ .Fix κ ∈ b A and let π t ∈ L (( E ) t ; ˆ A ( X )) be such that κ t = π t (( ε s ) s ≤ t ). As X is discrete, bydeﬁnition of the σ -algebra on ˆ A ( X ), π t can be seen as a measurable function in L (( E ) t ×X ; P ( A )). Let φ ∈ L ( A, R ) be an embedding as in Lemma C.2. By Corollary A.1, weknow that φ ⋆ π t is in L (( E ) t × X ; P ( R )). Given m ∈ P ( R ) we denote by F − m thegeneralized inverse of its distribution function, and it is known that the mapping m ∈ ( P ( R ) , W ) F − m ∈ ( L caglad ( R ) , k · k ) is an isometry and is thus measurable. Therefore, F − φ⋆ π t is in L (( E ) t × X ; ( L caglad ( R ) , k · k )). Finally, the mapping ( f, u ) ∈ ( L caglad ( R ) , k ·k ) × ([0 , , B ([0 , f ( u ) ∈ ( R , B ( R )) is measurable, since it is the limit of the sequence n P i ∈ Z [ i +1 n , i +2 n ) ( u ) R i +1 nin f ( y ) dy when n → ∞ . Therefore, the mappinga t : ( E ) t × X × [0 , −→ A (( e s ) s ≤ t , x, u ) φ − ◦ F − φ⋆ π t (( e s ) s ≤ t ,x ) ( u )is measurable. We thus deﬁne, by induction, α t := a t (( ε s ) s ≤ t , X ξ,αt , U t ). By constructionand by the generalized inverse simulation method, it is clear that ˆ α t = κ t . ✷ Remark 3.3

We point out that when both state space X and action space A are discrete,equipped with the metrics d ( x, x ′ ) := x = y , x, x ′ ∈ X and d A ( a, a ′ ) := a = a ′ , a, a ′ ∈ A ,the transition function ˆ F and reward function ˆ f of the lifted MDP on P ( X ) inherits theLipschitz condition ( HF lip ) and ( Hf lip ) used for the propagation of chaos. Indeed, it isknown that the Wasserstein distance obtained from d (resp. d A ) coincides with twice thetotal variation distance, and thus to the L distance when naturally embedding P ( X ) (resp. P ( A )) in [0 , X (resp. [0 , A ). Thus, embedding ˆ A ( X ) in M X , A ([0 , X × A matrices with coeﬃcients valued in [0 , k ˆ F ( µ, ˆ a, e ) , ˆ F ( ν, ˆ a ′ , e ) k ≤ (1 + K F )(2 k µ − µ ′ k + sup x ∈X k ˆ a x, · − ˆ a ′ x, · k ) . We obtain a similar property for f . In other words, lifting the CMKV-MDP not only turnsit into an MDP, but also its state and action spaces [0 , X and [0 , X × A are verystandard, and its dynamic and reward are Lipschitz functions with factors of the order of K F and K f according to the norm k · k . Thus, due to the standard nature of this MDP,most MDP algorithms can be applied and their speed will be simply expressed in terms ofthe original parameters of the CMKV-MDP, K F and K f . ✷ emark 3.4 As in the no common noise case, the correspondence result in the X discretecase shows notably that the value function of the CMKV-MDP is law-invariant.The general case (common noise and continuous state space X ) raises multiple issuesfor establishing the equivalence between CMKV-MDP and the lifted MDP on P ( X ). First,we have to endow the action space ˆ A ( X ) with a suitable σ -algebra for the lifted MDP tobe well-posed: on one hand, this σ -algebra has to be large enough to make the functionsˆ F : P ( X ) × ˆ A ( X ) × E → P ( X ) and ˆ f : P ( X ) × ˆ A ( X ) → R measurable, and on the otherhand, it should be small enough to make the process ˆ α = R ξ ( α ) F -adapted for any control α ∈ A in the CMKV-MDP. Beyond the well-posedness issue of the lifted MDP, the secondimportant concern is the surjectivity of the relaxed operator R ξ from A into b A . Indeed, ifwe try to adapt the proof of Theorem 3.2 to the case of a continuous state space X , theissue is that we cannot in general equip ˆ A ( X ) with a σ -algebra such that L (( E ) t ; ˆ A ( X ))= L (( E ) t × X ; P ( A )), and thus we cannot see π t ∈ L (( E ) t ; ˆ A ( X )) as an element of L (( E ) t × X ; P ( A )), which is crucial because the control α (such that ˆ α = κ ) is deﬁnedwith α t explicitly depending upon π t (( ε s ) s ≤ t , X t ).In the next section, we shall ﬁx these measurability issues in the general case, and provethe correspondence between the CMKV-MDP and a general lifted MDP on P ( X ). ✷ P ( X ) We address the general case with common noise and possibly continuous state space X ,and our aim is to state the correspondence of the CMKV-MDP with a suitable lifted MDPon P ( X ) associated to a Bellman ﬁxed point equation, characterizing the value function,and obtain as a by-product an ǫ -optimal control. We proceed as follows:(i) We ﬁrst introduce a well-posed lifted MDP on P ( X ) by enlarging the action spaceto P ( X × A ), and call ˜ V the corresponding value function, which satisﬁes: V ( ξ ) ≤ ˜ V ( µ ), for µ = L ( ξ ).(ii) We then consider the Bellman equation associated to this well-posed lifted MDP on P ( X ), which admits a unique ﬁxed point, called V ⋆ .(iii) Under the randomization hypothesis for ξ , we show the existence of an ǫ -randomizedfeedback policy, which yields both an ǫ -randomized feedback control for the CMKV-MDP and an ǫ -optimal feedback control for ˜ V . This proves that V ( ξ ) = ˜ V ( µ ) = V ∗ ( µ ), for µ = L ( ξ ).(iv) Under the condition that G is rich enough, we conclude that V is law-invariant andis equal to ˜ V = V ⋆ , hence satisﬁes the Bellman equation.Finally, we show how to compute from the Bellman equation by value or policy iterationapproximate optimal strategy and value function.18 .1 A general lifted MDP on P ( X ) We start again from the relation (3.2) describing the evolution of µ t = P X t , t ∈ N , for astate process X t = X ξ,αt controlled by α ∈ A : µ t +1 = F ( · , · , P X t ,α t ) , · , ε t +1 ) ⋆ ( P X t ,α t ) ⊗ λ ε ) , a.s. (4.1)Now, instead of decoupling as in Section 3, the conditional law of the pair ( X t , α t ), as P X t ,α t ) = µ t · ˆ α t where ˆ α = R ξ ( α ) is the relaxed control in (3.7), we directly considerthe control process α t = P X t ,α t ) , t ∈ N , which is F -adapted (see Proposition A.1), andvalued in the space of probability measures A := P ( X × A ), naturally endowed with the σ -algebra of its Wasserstein metric. Notice that this A -valued control α obtained fromthe CMKV-MDP has to satisfy by deﬁnition the marginal constraint pr ⋆ α t = µ t at anytime t . In order to tackle this marginal constraint, we shall rely on the following couplingresults. Lemma 4.1 (Measurable coupling)

There exists a measurable function ζ ∈ L ( P ( X ) × X × [0 , X ) s.t. for any ( µ, µ ′ ) ∈P ( X ) , and if ξ ∼ µ , then • ζ ( µ, µ ′ , ξ, U ) ∼ µ ′ , where U is an uniform random variable independent of ξ . • (i) When X ⊂ R : E (cid:2) d ( ξ, ζ ( µ, µ ′ , ξ, U )) (cid:3) = W ( µ, µ ′ ) . (ii) In general when X Polish: ∀ ε > , ∃ η > s.t. W ( µ, µ ′ ) < η ⇒ E (cid:2) d ( ξ, ζ ( µ, µ ′ , ξ, U )) (cid:3) < ε. Proof.

See Appendix C. ✷ Remark 4.1

Lemma 4.1 can be seen as a measurable version of the well-known couplingresult in optimal transport, which states that given µ , µ ′ ∈ P ( X ), there exists ξ and ξ ′ random variables with L ( ξ ) = µ , L ( ξ ′ ) = µ ′ such that W ( µ, µ ′ ) = E (cid:2) d ( ξ, ξ ′ )]. A similarmeasurable optimal coupling is proved in [8] under the assumption that there exists atransfer function realizing an optimal coupling between µ and µ ′ . However, such transferfunction does not always exist, for instance when µ has atoms but not µ ′ . Lemma 4.1 buildsa measurable coupling without making such assumption (essentially using the uniformvariable U to randomize when µ has atoms). ✷ From the measurable coupling function ζ as in Lemma 4.1, we deﬁne the couplingprojection p : P ( X ) × A → A by p ( µ, a ) = L (cid:0) ζ (pr ⋆ a , µ, ξ ′ , U ) , α (cid:1) , µ ∈ P ( X ) , a ∈ A , where ( ξ ′ , α ) ∼ a , and U is a uniform random variable independent of ξ ′ .19 emma 4.2 (Measurable coupling projection) The coupling projection p is a measurable function from P ( X ) × A into A , and for all ( µ, a ) ∈ P ( X ) × A :pr ⋆ p ( µ, a ) = µ, and if pr ⋆ a = µ, then p ( µ, a ) = a . Proof.

The only result that is not trivial is the measurability of p . Observe that p ( µ, a ) = g ( µ, a , · , · , · ) ⋆ ( a ⊗ U ([0 , g is the measurable function g : P ( X ) × P ( X × A ) × X × A × [0 , −→ X × A ( µ, a , x, a, u ) ( ζ (pr ⋆ a , µ, x, u ) , a )We thus conclude by Corollary A.1. ✷ By using this coupling projection p , we see that the dynamics (4.1) can be written as µ t +1 = ˜ F ( µ t , α t , ε t +1 ) , t ∈ N , (4.2)where the function ˜ F : P ( X ) × A × E → P ( X ) deﬁned by˜ F ( µ, a , e ) = F ( · , · , p ( µ, a ) , · , e ) ⋆ (cid:0) p ( µ, a ) ⊗ λ ε (cid:1) , is clearly measurable. Let us also deﬁne the measurable function ˜ f : P ( X ) × A → R by˜ f ( µ, a ) = Z X × A f ( x, a, p ( µ, a )) p ( µ, a )(d x, d a ) . The MDP with characteristics ( P ( X ) , A = P ( X × A ) , ˜ F , ˜ f , β ) is then well-posed. Letus then denote by A the set of F -adapted processes valued in A , and given an open-loopcontrol ν ∈ A , consider the controlled dynamics µ t +1 = ˜ F ( µ t , ν t , ε t +1 ) , t ∈ N , µ = µ ∈ P ( X ) , (4.3)with associated expected gain/value function e V ν ( µ ) = E h X t ∈ N β t ˜ f ( µ t , ν t ) i , e V ( µ ) = sup ν ∈ A e V ν ( µ ) . (4.4)Given ξ ∈ L ( G ; X ), and α ∈ A , we set α = L ξ ( α ), where L ξ is the lifted operator L ξ : A −→ A α = ( α t ) t α = ( α t ) t : α t = P X ξ,αt ,α t ) , t ∈ N . By construction from (4.2), we see that µ t = P X ξ,αt , t ∈ N , follows the dynamics (4.3) withthe control ν = L ξ ( α ) ∈ A . Moreover, by the law of iterated conditional expectation, andthe deﬁnition of ˜ f , the expected gain of the CMKV-MDP can be written as V α ( ξ ) = E h X t ∈ N β t E (cid:2) f ( X ξ,αt , α t , P X ξ,αt ,α t ) ) (cid:3)i = E h X t ∈ N β t ˜ f ( P X ξ,αt , α t ) i = e V α ( µ ) , with µ = L ( ξ ) . (4.5)It follows that V ( ξ ) ≤ e V ( µ ), for µ = L ( ξ ). Our goal is to prove the equality, which implies inparticular that V is law-invariant, and to obtain as a by-product the corresponding Bellmanﬁxed point equation that characterizes analytically the solution to the CMKV-MDP.20 .2 Bellman ﬁxed point on P ( X ) We derive and study the Bellman equation corresponding to the general lifted MDP (4.3)-(4.4) on P ( X ). By deﬁning this MDP on the canonical space ( E ) N , an open-loop control ν = ( ν t ) t ∈ A is identiﬁed with the open-loop policy ν = ( ν t ) t where ν t is a measurablefunction from ( E ) t into A via the relation ν t = ν t ( ε , . . . , ε t ), with the convention that ν = ν is simply a constant in A . Given ν ∈ A , and e ∈ E , we denote by ~ ν ( e ) = ( ~ν t ( e )) t ∈ A the shifted open-loop policy deﬁned by ~ν t ( e )( . ) = ν t +1 ( e , . ), t ∈ N . Given µ ∈ P ( X ),and ν ∈ A , we denote by ( µ µ, ν t ) t the solution to (4.3), which satisﬁes the ﬂow property (cid:0) µ µ, ν t +1 , ν t +1 (cid:1) ≡ (cid:0) µ µ µ, ν t , ~ ν t ( ε ) (cid:1) , t ∈ N , where ≡ means equality in law under P . This implies that the expected gain of this MDPin (4.4) satisﬁes the relation e V ν ( µ ) = ˜ f ( µ, ν ) + β E h e V ~ ν ( ε ) ( µ µ, ν ) i Recalling that µ µ, ν = ˜ F ( µ, ν , ε ), and by deﬁnition of ˜ V as the value function, we deducethat e V ( µ ) ≤ (cid:2) T e V (cid:3) ( µ ) , (4.6)where T is the Bellman operator on L ∞ ( P ( X )), the set of bounded real-valued functionson P ( X ), deﬁned for any W ∈ L ∞ ( P ( X )) by[ T W ]( µ ) := sup a ∈ A n ˜ f ( µ, a ) + β E (cid:2) W (cid:0) ˜ F ( µ, a , ε ) (cid:1)(cid:3)o , µ ∈ P ( X ) . (4.7)This Bellman operator is consistent with the lifted MDP derived in Section 3, with cha-racteristics ( P ( X ) , ˆ A ( X ) , ˆ F , ˆ f , β ), although this MDP is not always well-posed. Indeed,its corresponding Bellman operator is well-deﬁned as it only involves the random variable ε at time 1, hence only requires the measurability of e ˆ F ( µ, ˆ a, e ), for any ( µ, ˆ a ) ∈ P ( X ) × ˆ A ( X ) (which holds true), and it turns out that it coincides with T . Proposition 4.1

For any W ∈ L ∞ ( P ( X )) , and µ ∈ P ( X ) , we have [ T W ]( µ ) = sup ˆ a ∈ ˆ A ( X ) [ ˆ T ˆ a W ]( µ ) = sup a ∈ L ( X × [0 , A ) [ T a W ]( µ ) , (4.8) where ˆ T ˆ a and T a are the operators deﬁned on L ∞ ( P ( X )) by [ ˆ T ˆ a W ]( µ ) = ˆ f ( µ, ˆ a ) + β E (cid:2) W (cid:0) ˆ F ( µ, ˆ a, ε ) (cid:1)(cid:3) , [ T a W ]( µ ) = E h f ( ξ, a( ξ, U ) , L ( ξ, a( ξ, U ))) + βW (cid:0) P F ( ξ, a( ξ,U ) , L ( ξ, a( ξ,U )) ,ε ,ε ) (cid:1)i , (4.9) for any ( ξ, U ) ∼ µ ⊗ U ([0 , (it is clear that the right-hand side in (4.9) does not dependon the choice of such ( ξ, U ) ). Moreover, we have [ T W ]( µ ) = sup α ∈ L (Ω; A ) E h f ( ξ, α , L ( ξ, α )) + βW (cid:0) P F ( ξ,α , L ( ξ,α ) ,ε ,ε ) (cid:1)i . (4.10)21 roof. Fix W ∈ L ∞ ( P ( X )), and µ ∈ P ( X ). Let a be arbitrary in A . Since p ( µ, a ) hasﬁrst marginal equal to µ , there exists by Corollary A.2 a probability kernel ˆ a ∈ ˆ A ( X ) suchthat p ( µ, a ) = µ · ˆ a . Therefore, ˜ F ( µ, a , e ) = ˆ F ( µ, ˆ a, e ), ˜ f ( µ, a ) = ˆ f ( µ, ˆ a ), which impliesthat [ T W ]( µ ) ≤ sup ˆ a ∈ ˆ A ( X ) [ ˆ T ˆ a W ]( µ ) =: T .Let us consider the operator R deﬁned by R : L ( X × [0 , A ) −→ ˆ A ( X )a ˆ a : ˆ a ( x ) = L (cid:0) a( x, U ) (cid:1) , x ∈ X , U ∼ U ([0 , , and notice that it is surjective from L ( X × [0 , A ) into ˆ A ( X ), by Lemma A.3. By notingthat for any a ∈ L ( X × [0 , A ), and ( ξ, U ) ∼ µ ⊗ U ([0 , L (cid:0) ξ, a( ξ, U ) (cid:1) = µ · R (a), it follows that [ T a W ]( µ ) = [ ˆ T R (a) W ]( µ ). Since R is surjective, this yields T =sup a ∈ L ( X × [0 , A ) [ T a W ]( µ ) =: T .Denote by T the right-hand-side in (4.10). It is clear that T ≤ T . Conversely, let α ∈ L (Ω; A ). We then set a = L ( ξ, α ) ∈ P ( X × A ), and notice that the ﬁrst marginal of a is µ . Thus, p ( µ, a ) = L ( ξ, α ), and so˜ f ( µ, a ) = Z X × A f ( x, a, p ( µ, a )) p ( µ, a )(d x, d a ) = E (cid:2) f ( ξ, α , L ( ξ, α )) (cid:3) ˜ F ( µ, a , ε ) = F ( · , · , p ( µ, a ) , · , ε ) ⋆ (cid:0) p ( µ, a ) ⊗ λ ε (cid:1) = P F ( ξ,α , L ( ξ,α ) ,ε ,ε ) . We deduce that T ≤ [ T W ]( µ ), which gives ﬁnally the equalities (4.8) and (4.10). ✷ By standard and elementary arguments, we state the basic properties of the Bellmanoperator T . Lemma 4.3 (i) The operator T is contracting on L ∞ ( P ( X )) with Lipschitz factor β , andadmits a unique ﬁxed point in L ∞ ( P ( X )) , denoted by V ⋆ , hence solution to: V ⋆ = T V ⋆ . (ii) Furthermore, it is monotone increasing: for W , W ∈ L ∞ ( P ( X )) , if W ≤ W , then T W ≤ T W . As a consequence of Lemma 4.3, we can easily show the following relation between thevalue function ˜ V of the general lifted MDP, and the ﬁxed point V ⋆ of the Bellman operator. Lemma 4.4

For all µ ∈ P ( X ) , we have ˜ V ( µ ) ≤ V ⋆ ( µ ) . Proof.

From (4.6), we have: ˜ V ≤ T ˜ V =: V . By iterating this inequality with the operator T , and using the monotonic increasing property of T , we get ˜ V ≤ T n ˜ V =: V n . Since theﬁxed point V ⋆ of the contracting operator T is the limit of V n , as n goes to inﬁnity, thisproves the second inequality ˜ V ≤ V ⋆ . ✷ .3 Building ǫ -optimal randomized feedback controls We aim to prove rigorously the equality ˜ V = V ⋆ , i.e., the value function ˜ V of the generallifted MDP satisﬁes the Bellman ﬁxed point equation: ˜ V = T ˜ V , and also to show theexistence of an ǫ -optimal control for ˜ V . Notice that it cannot be obtained directly fromclassical theory of MDP as we consider here open-loop controls ν ∈ A while MDP usuallydeals with feedback controls on ﬁnite-dimensional spaces. Anyway, following the standardnotation in MDP theory with state space P ( X ) and action space A , and in connection withthe Bellman operator in (4.7), we introduce, for π ∈ L ( P ( X ); A ) (the set of measurablefunctions from P ( X ) into A ) called (measurable) feedback policy, the so-called π -Bellmanoperator T π on L ∞ ( P ( X )), deﬁned for W ∈ L ∞ ( P ( X )) by[ T π W ]( µ ) = ˜ f ( µ, π ( µ )) + β E (cid:2) W (cid:0) ˜ F ( µ, π ( µ ) , ε ) (cid:1)(cid:3) , µ ∈ P ( X ) . (4.11)As for the Bellman operator T , we have the basic properties on the operator T π . Lemma 4.5

Fix π ∈ L ( P ( X ); A ) .(i) The operator T π is contracting on L ∞ ( P ( X )) with Lipschitz factor β , and admits aunique ﬁxed point denoted ˜ V π .(ii) Furthermore, it is monotone increasing: for W , W ∈ L ∞ ( P ( X )) , if W ≤ W , then T π W ≤ T π W . Remark 4.2

It is well-known from MDP theory that the ﬁxed point ˜ V π to the operator T π is equal to ˜ V π ( µ ) = E h X t ∈ N ˜ f ( µ t , π ( µ t )) i , where ( µ t ) is the MDP in (4.3) with the feedback and stationary control ν π = ( ν π t ) t ∈ A deﬁned by ν π t = π ( µ t ), t ∈ N . In the sequel, we shall then identify by misuse of notation˜ V π and ˜ V ν π as deﬁned in (4.4). ✷ Our ultimate goal being to solve the CMKV-MDP, we introduce a subclass of feedbackpolicies for the lifted MDP.

Deﬁnition 4.1 (Lifted randomized feedback policy)

A feedback policy π ∈ L ( P ( X ); A ) is a lifted randomized feedback policy if there exists ameasurable function a ∈ L ( P ( X ) × X × [0 , A ) , called randomized feedback policy, suchthat (cid:0) ξ, a ( µ, ξ, U ) (cid:1) ∼ π ( µ ) , for all µ ∈ P ( X ) , with ( ξ, U ) ∼ µ ⊗ U ([0 , . Remark 4.3

Given a ∈ L ( P ( X ) × X × [0 , A ), denote by π a ∈ L ( P ( X ); A ) the asso-ciated lifted randomized feedback policy, i.e., π a ( µ ) = L (cid:0) ξ, a ( µ, ξ, U ) (cid:1) , for µ ∈ P ( X ), and( ξ, U ) ∼ µ ⊗ U ([0 , T π a in (4.11), and observingthat p ( µ, π a ( µ )) = π a ( µ ) = L (cid:0) ξ, a µ ( ξ, U ) (cid:1) , where we set a µ = a ( µ, · , · ) ∈ L ( X × [0 ,

1] : A ),we see (recalling the notation in (4.9)) that for all W ∈ L ∞ ( P ( X )),[ T π a W ]( µ ) = [ T a µ W ]( µ ) , µ ∈ P ( X ) . (4.12)23n the other hand, let ξ ∈ L ( G ; X ) be some initial state satisfying the randomizationhypothesis Rand ( ξ, G ), and denote by α a ∈ A the randomized feedback stationary controldeﬁned by α a t = a ( P X t , X t , U t ), where X = X ξ,α a is the state process in (3.1) of the CMKV-MDP, and ( U t ) t is an i.i.d. sequence of uniform G -measurable random variables independentof ξ . By construction, the associated lifted control α a = L ξ ( α a ) satisﬁes α a t = P X t ,α a t ) = π a ( µ t ), where µ t = P X t , t ∈ N . Denoting by V a := V α a the associated expected gain ofthe CMKV-MDP, and recalling Remark 4.2, we see from (4.5) that V a ( ξ ) = ˜ V ν π a ( µ ) =˜ V π a ( µ ), where µ = L ( ξ ). ✷ We show a veriﬁcation type result for the general lifted MDP, and as a byproduct forthe CMKV-MDP, by means of the Bellman operator.

Proposition 4.2 (Veriﬁcation result)

Fix ǫ ≥ , and suppose that there exists an ǫ -optimal feedback policy π ǫ ∈ L ( P ( X ); A ) for V ⋆ in the sense that V ⋆ ≤ T π ǫ V ⋆ + ǫ. Then, ν π ǫ ∈ A is ǫ − β -optimal for ˜ V , i.e., ˜ V π ǫ ≥ ˜ V − ǫ − β , and we have ˜ V ≥ V ⋆ − ǫ − β .Furthermore, if π ǫ is a lifted randomized feedback policy, i.e., π ε = π a ǫ , for some a ε ∈ L ( P ( X ) × X × [0 , A ) , then under Rand ( ξ, G ) , α a ǫ ∈ A is an ǫ − β -optimal control for V ( ξ ) , i.e., V a ǫ ( ξ ) ≥ V ( ξ ) − ǫ − β , and we have V ( ξ ) ≥ V ⋆ ( µ ) − ǫ − β , for µ = L ( ξ ) . Proof.

Since ˜ V π ǫ = T π ǫ ˜ V π ǫ , and recalling from Lemma 4.4 that V ⋆ ≥ ˜ V ≥ ˜ V π ǫ , we havefor all µ ∈ P ( X ), (cid:12)(cid:12)(cid:12) ( V ⋆ − ˜ V π ǫ )( µ ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) T π ǫ ( V ⋆ − ˜ V π ǫ )( µ ) + ǫ (cid:12)(cid:12)(cid:12) ≤ β k V ⋆ − ˜ V π ǫ k + ǫ, where we used the β -contraction property of T π ǫ in Lemma 4.5. We deduce that k V ⋆ − ˜ V π ǫ k≤ ǫ − β , and then, ˜ V ≥ ˜ V π ǫ ≥ V ⋆ − ǫ − β , which combined with V ⋆ ≥ ˜ V , shows the ﬁrstassertion. Moreover, if π ε = π a ǫ is a lifted randomized feedback policy, then by Remark4.3, and under Rand ( ξ, G ), we have V a ǫ ( ξ ) = ˜ V π ǫ ( µ ). Recalling that V ( ξ ) ≤ ˜ V ( µ ), andtogether with the ﬁrst assertion, this proves the required result. ✷ Remark 4.4

If we can ﬁnd for any ǫ >

0, an ǫ -optimal lifted randomized feedback policyfor V ⋆ , then according to Proposition 4.2, and under Rand ( ξ, G ), one could restrict torandomized feedback policies in the computation of the optimal value V ( ξ ) of the CMKV-MDP, i.e., V ( ξ ) = sup a ∈ L ( P ( X ) ×X × [0 , A ) V a ( ξ ). Moreover, this would prove that V ( ξ ) =˜ V ( µ ) = V ⋆ ( µ ), hence V is law-invariant, and satisﬁes the Bellman ﬁxed equation.Notice that instead of proving directly the dynamic programming Bellman equationfor V , we start from the ﬁxed point solution V ⋆ to the Bellman equation, and show via averiﬁcation result that V is indeed equal to V ⋆ , hence satisﬁes the Bellman equation.By the formulation (4.8) of the Bellman operator in Proposition 4.1, and the ﬁxedpoint equation satisﬁed by V ⋆ , we know that for all ǫ >

0, and µ ∈ P ( X ), there exists a µǫ ∈ L ( X × [0 , A ) such that V ⋆ ( µ ) ≤ [ T a µǫ V ⋆ ]( µ ) + ǫ. (4.13)24he crucial issue is to prove that the mapping ( µ, x, u ) a ǫ ( µ, x, u ) := a µǫ ( x, u ) is measu-rable so that it deﬁnes a randomized feedback policy a ǫ ∈ L ( P ( X ) × X × [0 , A ), and anassociated lifted randomized feedback policy π a ǫ . Recalling the relation (4.12), this wouldthen show that π a ǫ is a ǫ -optimal lifted randomized feedback policy for V ⋆ , and we couldapply the veriﬁcation result. ✷ We now address the measurability issue for proving the existence of an ǫ -optimal ran-domized feedback policy for V ⋆ . The basic idea is to construct as in (4.13) an ǫ -optimala µǫ ∈ L ( X × [0 , A ) for V ⋆ ( µ ) when µ lies in a suitable ﬁnite grid of P ( X ), and then“patchs” things together to obtain an ε -optimal randomized feedback policy. This is madepossible under some uniform continuity property of V ⋆ , and we shall require the followingLipschitz assumptions on the original state transition and reward functions F and f .( H ′ lip ) There exists K >

0, such that for all a ∈ A , e ∈ E , x, x ′ ∈ X , ν, ν ′ ∈ P ( X × A ),such that pr ⋆ ν = pr ⋆ ν ′ , E (cid:2) d (cid:0) F ( x, a, ν, ε , e ) , F ( x ′ , a, ν ′ , ε , e ) (cid:1)(cid:3) ≤ K (cid:0) d ( x, x ′ ) + W (pr ⋆ ν, pr ⋆ ν ′ ) (cid:1) d ( f ( x, a, ν ) , f ( x ′ , a, ν ′ )) ≤ K (cid:0) d ( x, x ′ ) + W (pr ⋆ ν, pr ⋆ ν ) (cid:1) . Remark 4.5

Under Assumption ( H ′ lip ), we see that once pr i ⋆ ν = pr i ⋆ ν ′ , i = 1 , F ( x, a, ν, ε , e ) = F ( x, a, ν ′ , ε , e ), and f ( x, a, ν ) = f ( x, a, ν ′ ). In other words, thefunctions F and f depend on ν only through its marginal distribution on X and on A , andwe shall write by misuse of notation: F ( x, a, µ, υ, e, e ) = F ( x, a, µ ⊗ υ, e, e ), f ( x, a, µ, υ )= f ( x, a, µ ⊗ υ ), for µ ∈ P ( X ), υ ∈ P ( A ). Assumption ( H ′ lip ) is then written as: thereexists K >

0, such that for all a ∈ A , e ∈ E , x, x ′ ∈ X , µ, µ ′ ∈ P ( X ), υ ∈ P ( A ), E (cid:2) d (cid:0) F ( x, a, µ, υ, ε , e ) , F ( x ′ , a, µ ′ , υ, ε , e ) (cid:1)(cid:3) ≤ K (cid:0) d ( x, x ′ ) + W ( µ, µ ′ ) (cid:1) d ( f ( x, a, µ, υ ) , f ( x ′ , a, µ ′ , υ )) ≤ K (cid:0) d ( x, x ′ ) + W ( µ, µ ′ ) (cid:1) . ✷ Proposition 4.3

Assume that ( H ′ lip ) holds true. Then, for all γ ∈ (cid:2) , min (cid:0) , | ln( β ) | (ln 2 K ) + − η (cid:1)(cid:3) ,with η > , V ⋆ is γ -H¨older with constant K γ = K ∆ − γ X − β (2 K ) γ : (cid:12)(cid:12) V ⋆ ( µ ) − V ⋆ ( µ ′ ) (cid:12)(cid:12) ≤ K γ W ( µ, µ ′ ) γ , ∀ µ, µ ′ ∈ P ( X ) . Proof.

Notice that V ⋆ is the limiting point in L ∞ ( P ( X )) of the iterative sequence V n +1 = T V n , and we shall prove the γ -H¨older property by induction. Fix γ ∈ [0 , V ≡ γ -H¨older, and assume that at iteration n , V n is γ -H¨older with aconstant K n . Under ( H ′ lip ), (see Remark 4.5), recall the expression (4.10) of the Bellmanoperator T : T W ( µ ) = sup α ∈ L (Ω; A ) E h f (cid:0) ξ, α , µ, L ( α ) (cid:1) + βW (cid:16) P F ( ξ,α ,µ, L ( α ) ,ε ,ε ) (cid:17)i , µ ∈ P ( X ) , ξ ∈ L (Ω; X ) s.t. L ( ξ ) = µ . For any µ, µ ′ ∈ P ( X ), let us consider an optimalcoupling ( ξ, ξ ′ ) ∈ L (Ω; X ) for the Wasserstein distance, i.e., E [ d ( ξ, ξ )] = W ( µ, µ ′ ). Under( H ′ lip ), we then have | V n +1 ( µ ) − V n +1 ( µ ′ ) | ≤ K W ( µ, µ ′ )+ β sup α ∈ L (Ω , ; A ) E h(cid:12)(cid:12) V n (cid:0) P F ( ξ,α ,µ, L ( α ) ,ε ,ε ) (cid:1) − V n (cid:0) P F ( ξ ′ ,α ,µ ′ , L ( α ) ,ε ,ε ) (cid:1)(cid:12)(cid:12)i . Now, by induction hypothesis, we have (cid:12)(cid:12) V n (cid:0) P F ( ξ,α ,µ, L ( α ) ,ε ,ε ) (cid:1) − V n (cid:0) P F ( ξ ′ ,α ,µ ′ , L ( α ) ,ε ,ε ) (cid:1)(cid:12)(cid:12) ≤ K n (cid:12)(cid:12) W (cid:0) P F ( ξ,α ,µ, L ( α ) ,ε ,ε ) , P F ( ξ ′ ,α ,µ ′ , L ( α ) ,ε ,ε ) (cid:1)(cid:12)(cid:12) γ , and by deﬁnition of the Wasserstein distance W (cid:0) P F ( ξ,α ,µ, L ( α ) ,ε ,ε ) , P F ( ξ ′ ,α ,µ ′ , L ( α ) ,ε ,ε ) (cid:1) ≤ E h d (cid:0) F ( ξ, α , µ, L ( α ) , ε , e ) , F ( ξ ′ , α , µ ′ , L ( α ) , ε , e ) (cid:1)i e := ε ≤ K W ( µ, µ ′ ) , where we used again ( H ′ lip ) and the fact that E [ d ( ξ, ξ )] = W ( µ, µ ′ ). We then deduce that (cid:12)(cid:12) V n +1 ( µ ) − V n +1 ( µ ′ ) (cid:12)(cid:12) ≤ K W ( µ, µ ′ ) + βK n (2 K ) γ W ( µ, µ ′ ) γ ≤ (2 K ∆ − γ X + βK n (2 K ) γ ) W ( µ, µ ′ ) γ , since W ( µ, µ ′ ) ≤ ∆ X and γ ≤

1. We then see that induction hypothesis at iteration n + 1holds true by setting K n +1 := 2 K ∆ − γ X + βK n (2 K ) γ , which leads to the expression K n =2 K ∆ − γ X − ( β (2 K ) γ ) n − ( β (2 K ) γ ) . Therefore, by taking γ ∈ (cid:2) , min (cid:0) , | ln( β ) | (ln 2 K ) + − η (cid:1)(cid:3) , with η >

0, wehave β (2 K ) γ <

1, and the sequence K n converges to K γ := K ∆ − γ X − ( β (2 K ) γ ) , which shows therequired γ -H¨older property for V ⋆ = lim n V n with the constant K γ . ✷ The next result provides a suitable discretization of the set of probability measures.

Lemma 4.6 (Quantization of P ( X ) ) Fix η > . There exists a ﬁnite subset M η = { µ , . . . , µ N η } ⊂ P ( X ) , such that for all µ ∈P ( X ) , there exists µ i ∈ M η satisfying W ( µ, µ i ) ≤ η . Proof. As X is compact, there exists a ﬁnite subset X η ⊂ X such that d ( x, x η ) ≤ η/ x ∈ X , where x η denotes the projection of x on X η . Given µ ∈ P ( X ), and ξ ∼ µ , we denote by ξ η the quantization, i.e., the projection of ξ on X η , and by µ η thediscrete law of ξ η . Thus, E [ d ( ξ, ξ η )] ≤ η/

2, and therefore W ( µ, µ η ) ≤ η/

2. The probabilitymeasure µ η lies in P ( X η ), which is identiﬁed with the simplex of [0 , X η . We then useanother grid G η = { in η : i = 0 , . . . , n η } of [0 , µ η ( y ) ∈ [0 , y ∈ X η , on G η , in order to obtain another discrete probability measure µ η,n η . From thedual Kantorovich representation of Wasserstein distance, it is easy to see that for n η largeenough, W ( µ η , µ η,n η ) ≤ η/

2, and so W ( µ, µ η,n η ) ≤ η . We conclude the proof by notingthat µ η,n η belongs to the set M η of probability measures on X η with weights valued in theﬁnite grid G η , hence M η is a ﬁnite set of P ( X η ), of cardinal N η = n X η − η . ✷

26e can conclude this paragraph by showing the existence of an ǫ -optimal lifted ran-domized feedback policy for the general lifted MDP on P ( X ), and obtain as a by-productthe corresponding Bellman ﬁxed point equation for its value function and for the optimalvalue of the CMKV-MDP under randomization hypothesis. Theorem 4.1

Assume that ( H ′ lip ) holds true. Then, for all ǫ > , there exists a liftedrandomized feedback policy π a ǫ , for some a ǫ ∈ L ( P ( X ) × X × [0 , A ) , that is ǫ -optimalfor V ⋆ . Consequently, under Rand ( ξ, G ) , the randomized feedback stationary control α a ǫ ∈A is ǫ − β -optimal for V ( ξ ) , and we have V ( ξ ) = ˜ V ( µ ) = V ⋆ ( µ ) , for µ = L ( ξ ) , which thussatisﬁes the Bellman ﬁxed point equation. Proof.

Fix ǫ >

0, and given η >

0, consider a quantizing grid M η = { µ , . . . , µ N η } ⊂P ( X ) as in Lemma 4.6, and an associated partition C iη , i = 1 , . . . , N η , of P ( X ), satisfying C iη ⊂ B η ( µ i ) := n µ ∈ P ( X ) : W ( µ, µ i ) ≤ η o , i = 1 , . . . , N η . For any µ i , i = 1 , . . . , N η , and by (4.13), there exists a iǫ ∈ L ( X × [0 , A ) such that V ⋆ ( µ i ) ≤ [ T a iǫ V ⋆ ]( µ i ) + ǫ . (4.14)From the partition C iη , i = 1 , . . . , N η of P ( X ), associated to M η , we construct the function a ǫ : P ( X ) × X × [0 , → A as follows. Let let h , h be two measurable functions from [0 , , U ∼ U ([0 , h ( U ) , h ( U )) ∼ U ([0 , ⊗ . We then deﬁne a ǫ ( µ, x, u ) = a iǫ (cid:0) ζ ( µ, µ i , x, h ( u )) , h ( u ) (cid:1) , when µ ∈ C iη , i = 1 , . . . , N η , x ∈ X , u ∈ [0 , , where ζ is the measurable coupling function deﬁned in Lemma 4.1. Such function a ε isclearly measurable, i.e., a ε ∈ L ( P ( X ) × X × [0 , A ), and we denote by π ε = π a ε theassociated lifted randomized feedback policy, which satisﬁes[ T π ε V ⋆ ]( µ i ) = [ T a iǫ V ⋆ ]( µ i ) , i = 1 , . . . , N η , (4.15)by (4.12). Let us now check that such π ǫ yields an ǫ -optimal randomized feedback policyfor η small enough. For µ ∈ P ( X ), with ( ξ, U ) ∼ µ ⊗ U ([0 , U := h ( U ), U := h ( U ), and deﬁne µ η = µ i , when µ ∈ C iη , i = 1 , . . . , N η , and ξ η := ζ ( µ, µ η , ξ, U ). Observeby Lemma 4.6 that W ( µ, µ η ) ≤ η , and by Lemma 4.1 that ( ξ η , U ) ∼ µ η ⊗ U ([0 , µ ∈ P ( X ),[ T π ǫ V ⋆ ]( µ ) − V ⋆ ( µ ) = (cid:16) [ T π ǫ V ⋆ ]( µ ) − [ T π ǫ V ⋆ ]( µ η ) (cid:17) + (cid:16) [ T π ǫ V ⋆ ]( µ η ) − V ⋆ ( µ η ) (cid:17) + (cid:0) V ⋆ ( µ η ) − V ⋆ ( µ ) (cid:1) ≥ (cid:16) [ T π ǫ V ⋆ ]( µ ) − [ T π ǫ V ⋆ ]( µ η ) (cid:17) − ǫ − ǫ , (4.16)where we used (4.14)-(4.15) and the fact that | V ⋆ ( µ η ) − V ⋆ ( µ ) | ≤ ǫ/ η small enoughby uniform continuity of V ⋆ in Proposition 4.3. Moreover, by observing that a ǫ ( µ, ξ, U ) = a ǫ ( µ η , ξ η , U ) =: α , so that π ǫ ( µ ) = L ( ξ, α ), π ǫ ( µ η ) = L ( ξ η , α ), we have[ T π ǫ V ⋆ ]( µ ) = E h f ( Y ) + βV ⋆ ( P F ( Y,ε ,ε ) i , [ T π ǫ V ⋆ ]( µ η ) = E h f ( Y η ) + βV ⋆ ( P F ( Y η ,ε ,ε ) i , Y = ( ξ, α , π ǫ ( µ )), and Y η = ( ξ η , α , π ǫ ( µ η )). Under ( H ′ lip ), by using the γ -H¨olderproperty of V ⋆ with constant K γ in Proposition 4.3, and by deﬁnition of the Wassersteindistance (recall that ξ ∼ µ , ξ η ∼ µ η ), we then get (cid:12)(cid:12) [ T π ǫ V ⋆ ]( µ ) − [ T π ǫ V ⋆ ]( µ η ) (cid:12)(cid:12) ≤ K E (cid:2) d ( ξ, ξ η ) (cid:3) + βK γ E h E (cid:2) d (cid:0) F ( ξ, α , π ǫ ( µ ) , ε , e ) , F ( ξ η , α , π ǫ ( µ η ) , ε , e ) (cid:1)(cid:3) e := ε i ≤ K (1 + βK γ ) E (cid:2) d ( ξ, ξ η ) (cid:3) . Now, by the coupling Lemma 4.1, E (cid:2) d ( ξ, ξ η ) (cid:3) can be made as small as possible provided that η ≥ W ( µ, µ η ) is small enough, and therefore, (cid:12)(cid:12) [ T π ǫ V ⋆ ]( µ ) − [ T π ǫ V ⋆ ]( µ η ) (cid:12)(cid:12) ≤ ǫ/

3. Plugginginto (4.16), we obtain T π ǫ V ⋆ ( µ ) − V ⋆ ( µ ) ≥ − ǫ , for all µ ∈ P ( X ), which means that π ǫ is ǫ -optimal for V ⋆ . The rest of the assertions in the Theorem follow from the veriﬁcationresult in Proposition 4.2. ✷ Remark 4.6

We stress the importance of the coupling Lemma in the construction of ǫ -optimal control in Theorem 4.1. Indeed, as we do not make any regularity assumption on F and f with respect to the “control arguments”, the only way to make [ T π ǫ V ⋆ ]( µ ) and[ T π ǫ V ⋆ ]( µ η ) close to each other is to couple terms to have the same control in F and f .This is achieved by turning µ into µ η , ξ into ξ η and set α = a ǫ ( µ, ξ, U ) = a ǫ ( µ η , ξ η , U ).Turning µ into µ η is a simple quantization, but turning ξ into ξ η is obtained thanks to thecoupling Lemma. ✷ We consider the general case by relaxing the randomization hypothesis and by assumingonly that the initial information ﬁltration G is rich enough.We need to state some uniform continuity property on the value function V of theCMKV-MDP. Lemma 4.7

Assume that ( H ′ lip ) holds true. Then, for all γ ∈ (cid:2) , min (cid:0) , | ln( β ) | (ln 2 K ) + − η (cid:1)(cid:3) ,with η > , and setting K γ = K ∆ − γ X − β (2 K ) γ , we have sup α ∈A (cid:12)(cid:12) V α ( ξ ) − V α ( ξ ′ ) (cid:12)(cid:12) ≤ K γ (cid:0) E (cid:2) d ( ξ, ξ ′ ) (cid:3)(cid:1) γ , ∀ ξ, ξ ′ ∈ L ( G ; X ) . Consequently, V is γ -H¨older on L ( G ; X ) endowed with the L -distance. Proof.

Fix ξ, ξ ′ ∈ L (Ω , X ), and consider an arbitrary α = α π ∈ A associated toan open-loop policy π ∈ Π OL . By Proposition A.1, there exists a measurable function f t,π ∈ L ( X × G × E t × ( E ) t ; X ), s.t. X ξ,αt = f t,π ( ξ, Γ , ( ε s ) s ≤ t , ( ε s ) s ≤ t ), thus P X ξ,αt = L ( f t,π ( ξ, Γ , ( ε s ) s ≤ t , ( e s ) s ≤ t )) e s = ε s ,s ≤ t . We thus have W ( P X ξ,αt , P X ξ ′ ,αt ) ≤ E h d (cid:16) f t,π ( ξ, Γ , ( ε s ) s ≤ t , ( e s ) s ≤ t ) , f t,π ( ξ ′ , Γ , ( ε s ) s ≤ t , ( e s ) s ≤ t ) (cid:17)i e s = ε s ,s ≤ t , and so E h W ( P X ξ,αt , P X ξ ′ ,αt ) i ≤ E (cid:2) d ( X ξ,αt , X ξ ′ ,αt ) (cid:3) , (4.17)28nder the Lipschitz condition on f in ( H ′ lip ), we then have | V α ( ξ ) − V α ( ξ ′ ) | ≤ K ∞ X t =0 β t E (cid:2) d ( X ξ,αt , X ξ ′ ,αt ) (cid:3) . (4.18)By conditioning, and from the transition dynamics of the state process, we see that for t ∈ N , E (cid:2) d ( X ξ,αt +1 , X ξ ′ ,αt +1 ) (cid:3) = E (cid:2) ∆( α it , X ξ,αt , P X ξ,αt ,α t ) , X ξ ′ ,αt , P X ξ ′ ,αt ,α t ) , ε t +1 ) (cid:3) , where∆( a, x, ν, x ′ , ν ′ , e t +1 ) = E (cid:2) d ( F ( x, a, ν, ε t +1 , e t +1 ) , F ( x ′ , a, ν ′ , ε t +1 , e t +1 )) (cid:3) . By the Lipschitz condition on F in ( H ′ lip ), we thus have E (cid:2) d ( X ξ,αt +1 , X ξ ′ ,αt +1 ) (cid:3) ≤ K F E [ d ( X ξ,αt , X ξ ′ ,αt ) + W ( P X ξ,αt , P X ξ ′ ,αt )] ≤ K F E (cid:2) d ( X ξ,αt , X ξ ′ ,αt ) (cid:3) , where the last inequality comes from (4.17). Denoting by δ t ( ξ, ξ ′ ) := sup α ∈A E (cid:2) d ( X ξ,αt , X ξ ′ ,αt ) (cid:3) ,and noting that δ ( ξ, ξ ′ ) = | ξ − ξ ′ | , it follows by induction that δ t ( ξ, ξ ′ ) ≤ s t ( | ξ − ξ ′ | ) , s t ( m ) := m (2 K ) t , m ≥ , t ∈ N . By the same arguments as in Theorem 2.1, and choosing γ as in the assertion of the theorem,we obtain that ∞ X t =0 β t δ t ( ξ, ξ ′ ) ≤ ∆ − γ X − β (2 K ) γ (cid:0) E (cid:2) d ( ξ, ξ ′ ) (cid:3)(cid:1) γ , ξ, ξ ′ ∈ L ( G ; X ) , and conclude with (4.18). ✷ Theorem 4.2

Assume that G is rich enough and ( H ′ lip ) holds true. Then, for any ξ ∈ L ( G ; X ) , V ( ξ ) = ˜ V ( µ ) , where µ = L ( ξ ) . Consequently, V is law-invariant, identiﬁed with ˜ V , and satisﬁes the Bellman ﬁxed point equation ˜ V = T ˜ V . Moreover, for all ǫ > , thereexists an ǫ -optimal randomized feedback control for V ( ξ ) . Proof. As X is compact, there exists a ﬁnite subset X η ⊂ X such that d ( x, x η ) ≤ η for all x ∈ X , where x η denotes the projection of x on X η . Fix ξ ∈ L ( G ; X ), and set µ = L ( ξ ).Let us then consider a random variable ξ ′ ∼ µ deﬁned on another probability universe alongwith an independent uniform law U ′ . We set Γ ′ := ( ξ ′ , U ′ ), and G ′ = σ (Γ ′ ). By constructionthe randomization hypothesis Rand ( ξ ′ , G ′ ) holds true, and we then have V ( ξ ′ ) = ˜ V ( µ ) fromTheorem 4.1. Consider now the quantized random variables ξ η and ξ ′ η , which have the samelaw, and satisfy respectively the randomization hypothesis Rand ( ξ η , G ) and Rand ( ξ ′ η , G ′ )from Lemma 3.2. From Theorem 4.1, we deduce that V ( ξ η ) = ˜ V ( L ( ξ η )) = V ( ξ ′ η ). Byuniform continuity of V , it follows by sending η to zero, that V ( ξ ) = V ( ξ ′ ), and thus V ( ξ ) = ˜ V ( µ ), which proves the required result.Finally, the existence of an ǫ -optimal control for V ( ξ ) is obtained as follows. From theuniform continuity of V in Lemma 4.7, there exists η small enough so that | V ( ξ ) − V ( ξ η ) |≤ ǫ/

2. We then build according to Theorem 4.1 an ǫ/ V ( ξ η ), whichyields an ǫ -optimal (randomized feedback stationary) control for V ( ξ ). ✷ emark 4.7 From Theorems 4.1 and 4.2, under the condition that G is rich enough and( H ′ lip ) holds true, the value function V of the CMKV-MDP is law-invariant, and the supre-mum in the Bellman ﬁxed point equation for V ≡ ˜ V with the operator T can be restrictedto lifted randomized feedback policies, i.e., V = T V = sup a ∈ L ( P ( X ) ×X × [0 , A ) T a V where we set T a := T π a equal to[ T a W ]( µ ) = E h f ( Y a ( µ, ξ, U )) + βW ( P F ( Y a ( µ,ξ,U ) ,ε ,ε ) i , with Y a ( µ, x, u ) := ( x, a ( µ, x, u ) , π a ( µ )), and ( ξ, U ) ∼ µ ⊗ U ([0 , ✷ Remark 4.8 ( G rich enough and dynamic programming) The condition that G isrich enough is crucial for obtaining the Bellman ﬁxed point equation for the value function V . Let us illustrate this fact with a counter-example similar to Example 3.1 in [11]. Consider X = {− , } = A , ε ∼ B (1 / F ( x, a, ν, e, e ) = ax , f ( x, a, ν ) = −W (pr ⋆ ν, B (1 / /

2) on X , and minimal equal to − / δ − or δ ). Assume that Γ = 1 a.s., so that that G is the trivial σ -algebra. In this case, ξ =: x and α are then necessarily determinist, and thus X ξ,α = α ξ has to be determinist aswell, which yields rewards at t = 0 , − . By choosing a control α = 1, α = ε and α t = 1 afterwards, we have P X x,αt = δ x for t = 0 ,

1, and P X x,αt = B (1 / V ( ξ ) = − β . If V satisﬁed the DPP, we would have V ( x ) = sup a ∈ A ( − / βV ( ax )), which is equivalent to − β = − − β β , and this is clearly false.Intuitively, the reason why the DPP is not satisﬁed in the previous example is thatfrom time t = 1 we have the possibility to randomize actions using ε , whereas at time t = 0 no randomization was possible (even by quantizing, because G is not rich enough).As discussed in the next remark, the possibility to randomize implies that the game fromtime 1 is more advantageous than the game from time 0, which contradicts the idea behindthe DPP that from time 1, we are in the same game, but only starting from a diﬀerentstate. ✷ Remark 4.9 (Open-loop vs feedback controls and randomization hypothesis)

InTheorem 4.1, we build under randomization hypothesis, an ǫ -optimal control of the form α t = a ǫ ( P X t , X t , U t ) (randomized feedback controls), for any ǫ >

0, which implies thatoptimizing over open-loop controls gives the same optimal value than optimizing over ran-domized feedback controls. In standard (non-McKean-Vlasov) MDPs, it is well known thatwe can even restrict to controls of the form α t = a ( X t ), i.e., (not randomized) feedbackcontrols. Randomizing actions is thus not necessary in standard MDPs. However, in thecase of McKean-Vlasov MDPs, it can be crucial to optimize our gain (even when the rewarddoes not depend upon the law of the control). To illustrate this, let us consider the sameexample as in Remark 4.8, and assume that we start from ξ = 1 a.s.30 If we use a feedback control (not randomized), it is clear that the law of X t will alwaysbe a Dirac, and thus the gain will be equal to P ∞ t =0 β t ( − ) = − − β ) which is theworst possible gain. • However, with open-loop controls, even when the randomization hypothesis

Rand ( ξ, G )is not satisﬁed, if at some point we can use the past noise to randomize (even justslightly), there will be some times where the law of the state is not a Dirac and thegain will then be strictly greater than − − β ) . In our case, we have seen in theprevious remark that we could reach V = − β . • Finally, if G was rich enough, for instance if Γ ∼ U ([0 , Rand ( ξ, G ) wouldhold true, and we could freely randomize from the beginning by choosing α = G< / ∼ B (1 /

2) (thus X ∼ B (1 / α t = 1 for all t ∈ N ⋆ , wewould have X t ∼ B (1 /

2) for all t ∈ N ⋆ . With this strateggy, the total gain would beequal to − , which is the best possible gain given that we start from ξ = 1 a.s.To summarize, this example shows that the optimal value under Rand ( ξ, G ) is here strictlygreater than the optimal value over open-loop controls without Rand ( ξ, G ), which is itselfstrictly greater than the optimal value over feedback controls. This highlights that ingeneral the supremum over “randomized feedback” ≥ “open-loop” ≥ “feedback” controls,and that the inequalities can be strict, the underlying idea being that the more (and sooner)we can randomize, the better it is. We ﬁnally point out that the randomization hypothesis Rand ( ξ, G ) is natural, since in practice, nothing prevents us from using (pseudo-)uniformvariables in our strategies to randomize our actions. ✷ ǫ -optimal strategies in CMKV-MDP Having established the correspondence of our CMKV-MDP with lifted MDP on P ( X ), andthe associated Bellman ﬁxed point equation, we can design two methods for computing thevalue function and optimal strategies: (a) Value iteration. We approximate the value function V = ˜ V = V ⋆ by iterationfrom the Bellman operator: V n +1 = T V n , and at iteration N , we compute an approximateoptimal randomized feedback policy a N by (recall Remark 4.7) a N ∈ arg max a ∈ L ( P ( X ) ×X × [0 , A ) T a V N . From a N , we then construct an approximate randomized feedback stationary control α a N according to the procedure described in Remark 4.3. (b) Policy iteration. Starting from some initial randomized feedback policy a ∈ L ( P ( X ) ×X × [0 , A ), we iterate according to: • Policy evaluation: we compute the expected gain ˜ V π a of the lifted MDP • Greedy strategy: we compute a k +1 ∈ arg max a ∈ L ( P ( X ) ×X × [0 , A ) T a ˜ V π a k .

31e stop at iteration K to obtain a K , and then construct an approximate randomizedfeedback control α a K according to the procedure described in Remark 4.3. Practical computation.

Since a randomized feedback control α is a measurable function a of ( P X ξ,αt , X ξ,αt , U t ), we would need to compute and store the (conditional) law of thestate process, which is infeasible in practice when X is a continuous space. In this case, tocircumvent this issue, a natural idea is to discretize the compact space X by consideringa ﬁnite subset X η = { x , . . . , x N η } ⊂ X associated with a partition B iη , i = 1 , . . . , N η , of X , satisfying: B iη ⊂ (cid:8) x ∈ X : d ( x, x i ) ≤ η (cid:9) , i = 1 , . . . , N η , with η >

0. For any x ∈ X ,we denote by [ x ] η (or simply x η ) its projection on X η , deﬁned by: x η = x i , for x ∈ B iη , i = 1 , . . . , N η . Deﬁnition 4.2 (Discretized CMKV-MDP)

Fix η > . Given ξ ∈ L ( G ; X η ) , and acontrol α ∈ A , we denote by X η,ξ,α the McKean-Vlasov MDP on X η given by X η,ξ,αt +1 = (cid:2) F ( X η,ξ,αt , α t , P X η,ξ,αt ,α t ) , ε t +1 , ε t +1 ) (cid:3) η , t ∈ N , X η,ξ,α = ξ, i.e., obtained by projecting the state on X η after each application of the transition function F . The associated expected gain V αη is deﬁned by V αη ( ξ ) = E h ∞ X t =0 β t f (cid:0) X η,ξ,αt , α t , P X η,ξ,αt ,α t ) (cid:1)i . Notice that the (conditional) law of the discretized CMKV-MDP on X η is now valuedin a ﬁnite-dimensional space (the simplex of [0 , N η ), which makes the computation of theassociated randomized feedback control accessible, although computationally challengingdue to the high-dimensionality (and beyond the scope of this paper). The next resultstates that an ǫ -optimal randomized feedback control in the initial CMKV-MDP can beapproximated by a randomized feedback control in the discretized CMKV-MDP. Proposition 4.4

Assume that G is rich enough and ( H ′ lip ) holds true. Fix ξ ∈ L ( G ; X ) .Given η > , let us deﬁne ξ η the projection of ξ on X η . As Rand ( ξ η , G ) holds true, let usconsider an i.i.d. sequence ( U η,t ) t ∈ N of G -measurable uniform variables independent of ξ η .For ǫ > , let a ǫ be a randomized feedback policy that is ǫ -optimal for the Bellman ﬁxedpoint equation satisﬁed by V . Finally, let α η,ǫ be the randomized feedback control in thediscretized CMKV-MDP recursively deﬁned by α η,ǫt = a ǫ ( P X η,ǫt , X η,ǫt , U η,t ) , t ∈ N , where weset X η,ǫt := X η,ξ η ,α ǫ,η t . Then, for any δ > and γ ∈ (cid:2) , min (cid:0) , | ln β | (ln 2 K ) + − δ (cid:1)(cid:3) , the control α η,ǫ is O ( η γ + ǫ ) -optimal for the CMKV-MDP X with initial state ξ . Proof.

Step 1.

For δ > γ ∈ (cid:2) , min (cid:0) , | ln β | (ln 2 K ) + − δ (cid:1)(cid:3) , let us show thatsup α ∈A ∞ X t =0 β t E (cid:2) d ( X ξ,αt , X η,ξ η ,αt ) (cid:3) ≤ Cη γ , (4.19)for some constant C that depends only on K , β and γ . Indeed, notice by deﬁnition of theprojection on X η , and by a simple conditioning argument that for all α ∈ A , and t ∈ N , E (cid:2) d ( X ξ,αt +1 , X η,ξ η ,αt +1 ) (cid:3) ≤ η + E (cid:2) ∆ (cid:0) X ξ,αt , X η,ξ η ,αt , α t , P X ξ,αt ,α t ) , P X η,ξη,αt ,α t ) , ε t +1 (cid:1)(cid:3) , x, x ′ , a, ν, ν ′ , e ) = E [ d ( F ( x, a, ν, ε t +1 , e ) , F ( x ′ , a, ν ′ , ε t +1 , e ))] . Under ( H ′ lip ), we then get E (cid:2) d ( X ξ,αt +1 , X η,ξ η ,αt +1 ) (cid:3) ≤ η + K E h d ( X ξ,αt , X η,ξ η ,αt ) + W ( P X ξ,αt , P X η,ξη,αt ) i ≤ η + 2 K E (cid:2) d ( X ξ,αt , X η,ξ η ,αt ) (cid:3) , by the same argument as in (4.17). Hence, the sequence ( E (cid:2) d ( X ξ,αt , X η,ξ η ,αt ) (cid:3) ) t ∈ N satisﬁesthe same type of induction inequality as in (2.16) in Theorem 2.1 with η instead of M N , andthus the same derivation leads to the required result (4.19). From the Lipschitz conditionon f , we deduce by the same arguments as in (4.17) in Lemma 4.7 thatsup α ∈A (cid:12)(cid:12) V α ( ξ η ) − V αη ( ξ η ) (cid:12)(cid:12) = O ( η γ ) . (4.20) Step 2.

Denote by µ = L ( ξ ), and µ η = L ( ξ η ), and observe that W ( µ, µ η ) ≤ E [ d ( ξ, ξ η )] ≤ η . We write V α η,ǫ ( ξ ) − V ( ξ ) = (cid:2) V α η,ǫ ( ξ ) − V α η,ǫ ( ξ η ) (cid:3) + (cid:2) V α η,ǫ ( ξ η ) − V α η,ǫ η ( ξ η ) (cid:3) + (cid:2) V α η,ǫ η ( ξ η ) − V ( ξ η ) (cid:3) + (cid:2) V ( ξ η ) − V ( ξ ) (cid:3) =: I + I + I + I . The ﬁrst and last terms I and I are smaller than O ( η γ ) by the γ -H¨older property of V α and V in Lemma 4.7. By (4.20), the second term I is of order O ( η γ ) as well for η smallenough. Regarding the third term I , notice that by deﬁnition, V α η,ǫ η ( ξ η ) corresponds tothe gain associated to the randomized feedback policy a ǫ for the discretized CMKV-MDP.Denote by π ǫ the lifted randomized feedback policy associated to a ǫ , and recall by Remark4.3 the identiﬁcation with the lifted MDP: V α η,ǫ η ( ξ ′ ) = ˜ V π ǫ η ( µ ′ ), µ ′ = L ( ξ ′ ), where ˜ V π ǫ η isthe expected gain of the lifted MDP associated to the discretized CMKV-MDP, hence ﬁxedpoint of the operator[ T a ǫ η W ]( µ ′ ) = E h f ( Y a ǫ ( µ ′ , ξ ′ , U )) + βW (cid:0) P (cid:2) F ( Y a ǫ ( µ ′ ,ξ ′ ,U ) ,ε ,ε ) (cid:3) η (cid:1)i ,Y a ( µ, x, u ) = ( x, a ( µ, x, u ) , π a ( µ )) and ( ξ ′ , U ) ∼ µ ′ ⊗U ([0 , V ( ξ ′ ) = ˜ V ( µ ′ ), µ ′ = L ( ξ ′ ), with ˜ V ﬁxed point to the Bellman operator T , it follows that I = ˜ V π ǫ η ( µ η ) − ˜ V ( µ η ) = (cid:16) [ T a ǫ η ˜ V π ǫ η ]( µ η ) − [ T a ǫ η ˜ V ]( µ η ) (cid:17) + (cid:16) [ T a ǫ η ˜ V ]( µ η ) − [ T a ǫ ˜ V ]( µ η ) (cid:17) + (cid:16) [ T a ǫ ˜ V ]( µ η ) − ˜ V ( µ η ) (cid:17) =: I + I + I . By deﬁnition of a ǫ , we have | I | ≤ ǫ . For I notice that the only diﬀerence between theoperators T a ǫ η and T a ǫ is that F is projected on X η . Thus, (cid:12)(cid:12)(cid:12) [ T a ǫ η ˜ V ]( µ η ) − [ T a ǫ ˜ V ]( µ η ) (cid:12)(cid:12)(cid:12) ≤ β E h(cid:12)(cid:12) ˜ V (cid:0) P F ( Y η ,ε ,ε )] η (cid:1) − ˜ V (cid:0) P F ( Y η ,ε ,ε ) (cid:1)(cid:12)(cid:12)i , Y η = ( ξ η , a ǫ ( µ, ξ η , U ) , π ǫ ( µ η )). It is clear by deﬁnition of the Wasserstein distanceand the projection on X η that W (cid:0) P F ( Y η ,ε ,ε )] η , P F ( Y η ,ε ,ε ) (cid:1) ≤ E [ d ( F ( Y η , ε , ε ) , [ F ( Y η , ε , ε )] η )] ≤ η. From the γ -H¨older property of ˜ V in Proposition 4.3, we deduce that I = O ( η γ ). Finally,for I , since T a ǫ η is a β -contracting operator on ( L ∞ ( M η ) , k · k η, ∞ ), we have (cid:12)(cid:12) [ T a ǫ η ˜ V π ǫ η ]( µ η ) − [ T a ǫ η ˜ V ]( µ η ) (cid:12)(cid:12) ≤ β k ˜ V π ǫ η − ˜ V k η, ∞ , and thus | ˜ V π ǫ η ( µ η ) − ˜ V ( µ η ) | = | I | ≤ | I | + | I | + | I | ≤ β k ˜ V π ǫ η − ˜ V k η, ∞ + O ( η γ + ǫ ). Takingthe sup over µ η ∈ M η on the left, we obtain that k ˜ V π ǫ η − ˜ V k η, ∞ ≤ − β O ( η γ + ǫ ) = O ( η γ + ǫ ),and we conclude that | I | ≤ k ˜ V π ǫ η − ˜ V k η, ∞ ≤ O ( η γ + ǫ ), which ends the proof. ✷ Remark 4.10

Back to the N -agent MDP problem with open-loop controls, recall fromSection 2, that it suﬃces to ﬁnd an ǫ -optimal open-loop policy π ǫ ∈ Π OL for the CMKV-MDP, as it will automatically be O ( ǫ )-optimal for the N -agent MDP with N large enough.For instance, the construction of an ǫ -optimal control α ǫ given by Proposition 4.4 can beassociated to an ǫ -optimal open-loop policy π ǫ such that α ǫ = α π ǫ (where π ǫt is a measurablefunction of (Γ , ( ε s ) s ≤ s ≤ t , ( ε s ) s ≤ s ≤ t )). The processes O ( ǫ )-optimal for the i -th agent α ǫ,it is then the result of the same construction but with (Γ i , ε i , ε ) instead of (Γ , ε, ε ), i.e.replacing ξ by ξ i , U η,t by U iη,t , and (Γ , ε, ε ) by (Γ i , ε i , ε ) in Proposition 4.4. Notice thatthis construction never requires the access the individual’s states X i,N . Remark 4.11 ( Q function) In view of the Bellman ﬁxed point equation satisﬁed by thevalue function V of the CMKV-MDP in terms of randomized feedback policies, let usintroduce the corresponding state-action value function Q deﬁned on P ( X ) × ˆ A ( X ) by Q ( µ, ˆ a ) = [ ˆ T ˆ a V ]( µ ) = ˆ f ( µ, ˆ a ) + β E (cid:2) V (cid:0) ˆ F ( µ, ˆ a, ε ) (cid:1)(cid:3) , From Proposition 4.1, and since V = T V , we recover the standard connection between thevalue function and the state-action value function, namely V ( µ ) = sup ˆ a ∈ ˆ A ( X ) Q ( µ, ˆ a ), fromwhich we obtain the Bellman equation for the Q function: Q ( µ, ˆ a ) = ˆ f ( µ, ˆ a ) + β E h sup ˆ a ′ ∈ ˆ A ( X ) Q (cid:0) µ ˆ a , ˆ a ′ (cid:1)i , (4.21)where we set µ ˆ a = ˆ F ( µ, ˆ a, ε ). Notice that this Q -Bellman equation extends the equationin [11] (see their Theorem 3.1) derived in the no common noise case and when there isno mean-ﬁeld dependence with respect to the law of the control. The Bellman equation(4.21) is the starting point in a model-free framework when the state transition functionis unknown (in other words in the context of reinforcement learning) for the design of Q -learning algorithms in order to estimate the Q -value function by Q n , and then to computea relaxed control by ˆ a µn ∈ arg max ˆ a ∈ ˆ A ( X ) Q n ( µ, ˆ a ) , µ ∈ P ( X ) . a µn , a function a n : P ( X ) ×X × [0 , → A , such that L ( a n ( µ, x, U )) = ˆ a µn ( x ), µ ∈ P ( X ), x ∈ X , where U is an uniformrandom variable. In practice, one has to discretize the state space X as in Deﬁnition 4.2,and then to quantize the space P ( X ) as in Lemma 4.6 in order to reduce the learningproblem to a ﬁnite-dimensional problem for the computation of an approximate optimalrandomized feedback policy a n for the CMKV-MDP. ✷ We have developed a theory for mean-ﬁeld Markov decision processes with common noiseand open-loop controls, called CMKV-MDP, for general state space and action space. Suchproblem is motivated and shown to be the asymptotic problem of a large population ofcooperative agents under mean-ﬁeld interaction controlled by a social planner/inﬂuencer,and we provide a rate of convergence of the N -agent model to the CMKV-MDP. We provethe correspondence of CMKV-MDP with a general lifted MDP on the space of probabilitymeasures, and emphasize the role of relaxed control, which is crucial to characterize the so-lution via the Bellman ﬁxed point equation. Approximate randomized feedback controls areobtained from the dynamic programming Bellman equation in a model-based framework,and future work under investigation will develop algorithms in a model-free framework, inother words in the context of reinforcement learning with many interacting and cooperativeagents. A Some useful results on conditional law

Lemma A.1

Let ( S, S ) , ( T, T ) be two measurable spaces, then(1) the map x ∈ ( S, S ) δ x ∈ ( P ( S ) , C ( S )) is measurable.(2) The map ( µ, ν ) ∈ ( P ( S ) , C ( S )) × ( P ( T ) , C ( T )) µ ⊗ ν ∈ ( P ( S × T ) , C ( S × T )) ismeasurable.(3) If F ∈ L ( S ; T ) , then the map µ ∈ ( P ( S ) , C ( S )) F ⋆µ ∈ ( P ( T ) , C ( T )) is measurable. Proof. (1). Fix A ∈ S , then x δ x ( A ) = A ( x ) is clearly a measurable function. (2). Fix( A, B ) ∈ S × T , then ( µ, ν ) ( µ ⊗ ν )( A × B ) = µ ( A ) ν ( B ) is a measurable function. (3).Fix B ∈ T , then µ ( F ⋆ µ )( B ) = µ ( F − ( B )) is a measurable function. ✷ Corollary A.1

Let ( S, S ) , ( T, T ) , and ( U, U ) be three measurable spaces, and F ∈ L (( S, S ) × ( T, T ); ( U, U )) be a measurable function, then the function ˆ F : ( P ( S ) , C ( S )) × ( T, T ) → ( P ( U ) , C ( U )) given by ˆ F ( µ, x ) := F ( · , x ) ⋆ µ is measurable. Proof.

This is a consequence of Lemma A.1 and the measurable composition ( µ, x ) ( µ, δ x ) µ ⊗ δ x F ⋆ ( µ ⊗ δ x ) = ˆ F ( µ, x ). ✷ eﬁnition A.1 (Conditional law) Fix two measurable spaces ( S, S ) and ( T, T ) , andlet X, Y be two random variables on (Ω , F , P ) valued respectively on S and T . Then aconditional law of Y knowing X is a σ ( X ) -measurable ( P ( T ) , C ( T )) -valued random variable P XY , also denoted L ( Y | X ) , such that P XY ( A ) = P [ Y ∈ A | X ] , ∀ A ∈ T a.s. Furthermore, given x ∈ S , the conditional law of Y knowing X = x , denoted P X = xY or L ( Y | X = x ) , is the image of x by any probability kernel ˆ a ∈ ˆ T ( S ) such that L ( Y | X ) = ˆ a ( X ) a.s. Lemma A.2 (Conditional law)

Let ( S, S ) and ( T, T ) be two measurable spaces.1. If ( S, S ) is a Borel space, there exists a conditional law of Y knowing X .2. If Y = ϕ ( X, Z ) where Z ⊥⊥ X is a random variable valued in a measurable space V and ϕ : S × V → T is a measurable function, then L ( ϕ ( x, Z )) | x = X is a conditionallaw of Y knowing X . In the case S = S × S , X = ( X , X ) , and Y = ϕ ( X , Z ) ,then P XY = L ( ϕ ( x , Z )) | x = X , and thus P XY is σ ( X ) -measurable in ( P ( T ) , C ( T )) . Proof.

The ﬁrst assertion is stated in Theorem 5.3 in [12]., and the second one followsfrom Fubini’s theorem. ✷ Corollary A.2

Fix a Borel space ( S, S ) and a measurable space ( T, T ) . Then for any jointlaw π ∈ P ( S × T ) with marginal law µ ∈ P ( S ) , there exists a probability kernel ν from S to P ( T ) s.t. π = µ · ν . Lemma A.3 (Kernels and randomization)

Fix a Borel space ( S, S ) and a measurablespace ( T, T ) . For any probability kernel ν from S to P ( T ) , there exists a measurable function φ : S × [0 , → T s.t. ν ( s ) = L ( φ ( s, U )) , for all s ∈ S , where U is a uniform randomvariable. Proof.

See Lemma 2.22 in [12]. ✷ Proposition A.1

Given an open-loop control α ∈ A , and an initial condition ξ ∈ L ( X ; G ) ,the solution X ξ,α to the conditional McKean-Vlasov equation is such that: for all t ∈ N , X ξ,αt is σ ( ξ, Γ , ( ε ) s ≤ t , ( ε s ) s ≤ t ) -measurable, and P X ξ,αt ,α t ) is F t -measurable. Proof.

We prove the result by induction on t . It is clear for t = 0. Assuming that it holdstrue for some t ∈ N , we write X ξ,αt +1 = F ( X ξ,αt , α t , P X ξ,αt ,α t ) , ε t +1 , ε t +1 ) , t ∈ N . By induction hypothesis, there is a measurable function f t +1 : X × G × E t +1 × ( E ) t +1 →X s.t. X ξ,αt +1 = f t +1 ( ξ, Γ , ( ε s ) s ≤ t +1 , ( ε s ) s ≤ t +1 ), and thus X ξ,αt +1 is σ ( ξ, Γ , ( ε ) s ≤ t +1 , ( ε s ) s ≤ t +1 )-measurable and P X ξ,αt +1 ,α t +1 ) is σ ( ε s , s ≤ t + 1)-measurable by Lemma A.2. ✷ Wasserstein convergence of the empirical law

Proposition B.1 (Conditional Wasserstein convergence of the empirical measure)

Let

E, F be two measurable spaces and G a compact Polish space. Let X be an E -valuedrandom variable independent from a family of i.i.d. F -valued variables ( U i ) i ∈ N , and ameasurable function f : E × F → G . Then W (cid:16) N N X i =1 δ f ( X,U i ) , P Xf ( X,U ) (cid:17) a.s. −→ N →∞ . Proof.

It suﬃces to observe that the probability of this event is one by conditioning w.r.t. X and use the analog non-conditional result, which follows from the fact that Wassersteindistance metrizes weak convergence (as G is compact), and the fact that empirical measureconverges weakly. ✷ C Proof of coupling results

Lemma C.1

Let

U, V be two independent uniform variables, and F a distribution functionon R . We have (cid:16) F − ( U ) , F ( F − ( U )) − U (cid:17) d = ( F − ( U ) , V ∆ F ( F − ( U )) , where we denote ∆ F := F − F − . Proof.

Notice that F ( F − ( U )) − U is the position (from top to bottom) of U in theset { u ∈ [0 , , F − ( u ) = F − ( U ) } and is thus smaller than ∆ F ( F − ( U )). Now, given ameasurable function f ∈ L ( A × [0 , R ), we have E h f (cid:0) F − ( U ) , F ( F − ( U )) − U ) i (C.1)= E (cid:2) f ( F − ( U ) , ∆ F ( F − ( U ))=0 (cid:3) + E (cid:2) f ( F − ( U ) , F ( F − ( U )) − U )) ∆ F ( F − ( U )) > (cid:3) . The second term can be decomposed as X ∆ F ( c ) > E h f (cid:0) c, F ( c ) − U (cid:1) F − ( U )= c i = X ∆ F ( c ) > Z f ( c, ∆ F ( c ) u )) ∆ F ( c ) du. where the equality comes from a change of variable. Summing over ∆ F ( c ) >

0, we obtain E (cid:2) f (cid:0) F − ( U ) , V ∆ F ( F − ( U )) (cid:1) ∆ F ( F − ( U )) > (cid:3) , and combined with (C.1), we get E h f (cid:0) F − ( U ) , F ( F − ( U )) − U ) i = E (cid:2) f (cid:0) F − ( U ) , V ∆ F ( F − ( U )) (cid:1)(cid:3) , which proves the result. ✷ emma C.2 Let X be a compact Polish space, then there exists an embedding φ ∈ L ( X , R ) such that1. φ and φ − are uniformly continuous,2. for any probability measure µ ∈ P ( X ) , we have Im (cid:16) F − φ⋆µ (cid:17) ⊂ Im ( φ ) . In particular, φ − ◦ F − φ⋆µ is well posed. Proof.

1. Without loss of generality, we assume that X is bounded by 1. Fix a countabledense family ( x n ) n ∈ N in X . We deﬁne the map φ : x ∈ X 7→ ( d ( x, x n )) n ∈ N ∈ [0 , N . Letus endow [0 , N with the metric d(( u n ) n ∈ N , ( v n ) n ∈ N ) := P n ≥ n | u n − v n | . φ is clearlyinjective and uniformly continuous (even Lipschitz). The compactness of X implies thatits inverse φ − is uniformly continuous as well. Let us now consider φ : ([0 , N , d) [0 , φ (( u n ) n ∈ N ) essentially groups the decimals of the real numbers u n , n ∈ N , in a singlereal number. More precisely, let ι : N → N be a surjection, then we deﬁne the k -th decimalof φ (( u n ) n ∈ N ) as the ( ι ( k )) -th decimal of u ( ι ( k )) (with the convention that for a numberwith two possible decimal representations, we choose the one that ends with 000 ... ). φ isclearly injective, uniformly continuous, as well as its inverse φ − . Thus, φ := φ ◦ φ deﬁnesan embedding of X into R , such that φ and φ − are uniformly continuous.2. F − φ⋆µ being caglad, and Im( φ ) being closed (by compactness of X ), it is enough to provethat F − φ⋆µ ( u ) ∈ Im φ for almost every u ∈ [0 ,

1] (in the Lebesgue sense). However, given auniform variable U , we have F − φ⋆µ ( U ) ∼ φ ⋆ µ , and thus P ( F − φ⋆µ ( U ) ∈ Im( φ )) = P Y ∼ µ ( φ ( Y ) ∈ Im( φ )) = 1 . ✷ Proof of Lemma 4.1 (1) We ﬁrst consider the case where

X ⊂ R . Let us call F µ the distribution function of µ ∈P ( X ), and F − µ its generalized inverse. Let us deﬁne the function ζ : P ( X ) ×P ( X ) ×X × [0 , → X by ζ ( µ, µ ′ , x, u ) := F − µ ′ (cid:0) F µ ( x ) − u ∆ F µ ( x ) (cid:1) , which is measurable by noting that the measurability in µ, µ ′ comes from the continuity of P ( X ) → L caglad ([0 , , X ) µ F − µ . By construction, we then have for any ξ ∼ µ , and U, V two independent uniform variables,independent of ξ ( ξ, ζ ( µ, µ ′ , ξ, V )) = ( ξ, F − µ ′ (cid:0) F µ ( ξ ) − V ∆ F µ ( ξ ) (cid:1) ) d = ( F − µ ( U ) , F − µ ′ (cid:0) F µ ( F − µ ( U )) − V ∆ F µ ( F − µ ( U )) (cid:1) )= ( F − µ ( U ) , F − µ ′ (cid:0) F µ ( F − µ ( U )) − V ∆ F µ ( F − µ ( U )) (cid:1) ) d = (cid:0) F − µ ( U ) , F − µ ′ ( U ) (cid:1) , (cid:0) F − µ ( U ) , F − µ ′ ( U ) (cid:1) is an optimal coupling for ( µ, µ ′ ), and so W ( µ, µ ′ ) = E (cid:2) d ( ξ, ζ ( µ, µ ′ , ξ, V )) (cid:3) .(2) Let us now consider the case of a general compact Polish space X . Denoting by ζ R the” ζ ” from the case ” X ⊂ R ”, and considering an embedding φ ∈ L ( X , R ) as in Lemma C.2,let us deﬁne ζ ( µ, µ ′ , x, u ) := φ − ( ζ R ( φ ⋆ µ, φ ⋆ µ ′ , φ ( x ) , u )) , which is well posed by deﬁnition of ζ R and Lemma C.2. Now, ﬁx ξ ∼ µ , U a uniformvariable independent of ξ , and deﬁne ξ ′ := ζ ( µ, µ ′ , ξ, U ). By deﬁnition of ζ , its clear that ξ ′ ∼ µ ′ , and E (cid:2) d ( φ ( ξ ) , φ ( ξ ′ )) (cid:3) = W ( φ ⋆ µ, φ ⋆ µ ′ ) . (C.2)Fix ǫ >

0. We are looking for η, δ > W ( µ, µ ′ ) < η ⇒ W ( φ ⋆ µ, φ ⋆ µ ′ ) < δ ⇔ E (cid:2) d ( φ ( ξ ) , φ ( ξ ′ )) (cid:3) < δ ⇒ E [ d ( ξ, ξ ′ )] < ǫ. Let us ﬁrst show that there exists δ > E [ d ( φ ( ξ ) , φ ( ξ ′ ))] < δ ⇒ E [ d ( ξ, ξ ′ )] < ǫ .Fix γ > d ( x, x ′ ) < γ ⇒ d ( φ − ( x ) , φ − ( x ′ )) < ǫ . Denoting by ∆ X the diameterof X , we then have E [ d ( ξ, ξ ′ )] ≤ E [ d ( ξ, ξ ′ ) d ( φ ( ξ ) ,φ ( ξ ′ )) <γ ] + ∆ X γ E (cid:2) d ( φ ( ξ ) , φ ( ξ ′ )) (cid:3) ≤ ǫ X γ E (cid:2) d ( φ ( ξ ) , φ ( ξ ′ )) (cid:3) , so that we can choose δ = γ ∆ X ǫ . On the other hand, by uniform continuity of φ andby deﬁnition of the Wasserstein metric, there exists η > d ( µ, µ ′ ) < η ⇒W ( φ ⋆ µ, φ ⋆ µ ′ ) < δ . From (C.2), we thus conclude that d ( µ, µ ′ ) < η ⇒ E [ d ( ξ, ξ ′ )] < ǫ . ✷ References [1] E. Bayraktar, A. Cosso, and H. Pham. Randomized dynamic programming principle andFeynman-Kac representation for optimal control of McKean-Vlasov dynamics.

Transactions ofthe American Mathematical Society , 370:2115–2160, 2018.[2] A. Bensoussan, J. Frehse, and P. Yam.

Mean ﬁeld games and mean ﬁeld type control theory.

Springer, 2013.[3] D. P. Bertsekas.

Dynamic programming and optimal control, Vol II, approximate dynamicprogramming . Athena Scientiﬁc, Belmont MA, 2012, 4th edition.[4] E. Boissard and T. Le Gouic. On the mean speed of convergence of empirical and occupa-tion measures in Wasserstein distance.

Annales de l’Institut Henri Poincar´e - Probabilit´es etStatistiques , 50(2):539–563, 2014.[5] R. Carmona and F. Delarue.

Probabilistic Theory of Mean Field Games with Applications volI.

Probability Theory and Stochastic Modelling. Springer, 2018.

6] R. Carmona, M. Lauri`ere, and Z. Tan. Model-free mean-ﬁeld reinforcement learning: mean-ﬁeldMDP and mean-ﬁeld Q-learning. arXiv: 1910.12802v1, 2019.[7] M.F. Djete, D. Possamai, and X. Tan. McKean-Vlasov optimal control: the dynamic program-ming principle. arXiv:1907.08860,, 2019.[8] J. Fontbana, H. Gu´erin, and S. M´el´eard. Measurability of optimal transportation and strongcoupling of martingale measures.

Electronic communications in probability , 15:124–133, 2010.[9] M. Fornasier, S. Lisini, C. Orrieri, and G. Savar´e. Mean-ﬁeld optimal control as Gamma-limitof ﬁnite agent controls.

European Journal of Applied Mathematics , pages 1–34, 2018.[10] N. Fournier and A. Guillin. On the rate of convergence in Wasserstein distance of the empiricalmeasure.

Probability Theory and Related Fields , 162:707–738, 2015.[11] H. Gu, X. Guo, X. Wei, and R. Xu. Dynamic programming principles for learning MFCs.arXiv:1911.07314, 2019.[12] O. Kallenberg.

Foundations of Modern Probability . Probability and its Applications (NewYork). Springer-Verlag, New York, second edition, 2002.[13] D. Lacker. Limit theory for controlled McKean-Vlasov dynamics.

SIAM Journal on Controland Optimization , 55(3):1641–1672, 2017.[14] M. Lauri`ere and O. Pironneau. Dynamic programming for mean-ﬁeld type controls.

Journalof Optimization Theory , 169(3):902–924, 2016.[15] H. Pham and X. Wei. Discrete time McKean-Vlasov control problem: a dynamic programmingapproach.

Applied Mathematics and Optimization , 74(3):487–506, 2016.[16] H. Pham and X. Wei. Dynamic programming for optimal control of stochastic McKean-Vlasovdynamics.

SIAM Journal on Control and Optimization , 55:1069–1101, 2017.[17] S.T. Rachev and L. R¨uschendorf.

Mass transportation problems . Springer Verlag, 1998.[18] R.S. Sutton and A.G. Barto.

Reinforcement learning: an introduction . Cambridge, MA, 2017,2nd edition.[19] A. Van der Vaart and J.A. Wellner.

Weak convergence of empirical processes . Springer Verlag,1996.[20] C. Villani.

Optimal transport , volume 338 of

Grundlehren der Mathematischen Wissenschaften[Fundamental Principles of Mathematical Sciences] . Springer-Verlag, Berlin, 2009. Old andnew.. Springer-Verlag, Berlin, 2009. Old andnew.