Mean-field Markov decision processes with common noise and open-loop controls
aa r X i v : . [ m a t h . O C ] D ec Mean-field Markov decision processes with common noise andopen-loop controls
M´ed´eric MOTTE ∗ Huyˆen PHAM † December 18, 2019
Abstract
We develop an exhaustive study of Markov decision process (MDP) under meanfield interaction both on states and actions in the presence of common noise, and whenoptimization is performed over open-loop controls on infinite horizon. Such model,called CMKV-MDP for conditional McKean-Vlasov MDP, arises and is obtained hererigorously with a rate of convergence as the asymptotic problem of N -cooperative agentscontrolled by a social planner/influencer that observes the environment noises but notnecessarily the individual states of the agents. We highlight the crucial role of relaxedcontrols and randomization hypothesis for this class of models with respect to classicalMDP theory. We prove the correspondence between CMKV-MDP and a general liftedMDP on the space of probability measures, and establish the dynamic programmingBellman fixed point equation satisfied by the value function, as well as the existenceof ǫ -optimal randomized feedback controls. The arguments of proof involve an originalmeasurable optimal coupling for the Wasserstein distance. This provides a procedurefor learning strategies in a large population of interacting collaborative agents. MSC Classification:
Key words:
Mean-field, Markov decision process, conditional propagation of chaos, mea-surable coupling, randomized control. ∗ LPSM, Universit´e de Paris medericmotte at gmail.com The author acknowledges support of the DIMMathInnov. † LPSM, Universit´e de Paris, and CREST-ENSAE, pham at lpsm.paris This work was partially supportedby the Chair Finance & Sustainable Development / the FiME Lab (Institut Europlace de Finance) ontents N -agent and the limiting McKean-Vlasov MDP 53 Lifted MDP on P ( X ) X . . . . . . . . . . . . . . . . . . . . . . . . 16 P ( X ) P ( X ) . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 Bellman fixed point on P ( X ) . . . . . . . . . . . . . . . . . . . . . . . . . . 214.3 Building ǫ -optimal randomized feedback controls . . . . . . . . . . . . . . . 234.4 Relaxing the randomization hypothesis . . . . . . . . . . . . . . . . . . . . . 284.5 Computing value function and ǫ -optimal strategies in CMKV-MDP . . . . . 31 Optimal control of McKean-Vlasov (MKV) systems, also known as mean-field control(MFC) problems, has sparked a great interest in the domain of applied probabilities duringthe last decade. In these optimization problems, the transition dynamics of the system andthe reward/gain function depend not only on the state and action of the agent/controller,but also on their probability distributions. These problems are motivated from models oflarge population of interacting cooperative agents obeying to a social planner (center ofdecision), and are often justified heuristically as the asymptotic regime with infinite num-ber of agents under Pareto efficiency. Such problems have found numerous applications indistributed energy, herd behavior, finance, etc.A large literature has already emerged on continuous-time models for the optimal controlof McKean-Vlasov dynamics, and dynamic programming principle (in other words timeconsistency) has been established in this context in the papers [14], [16], [1], [7]. We pointout the work [13], which is the first paper to rigorously connect mean-field control to largesystems of controlled processes, see also the recent paper [9], and refer to the books [2], [5]for an overview of the subject.
Our work and main contributions.
In this paper, we introduce a general discretetime framework by providing an exhaustive study of Markov decision process (MDP) undermean-field interaction in the presence of common noise , and when optimization is performed2ver open-loop controls on infinite horizon. Such model is called conditional McKean-Vlasov MDP, and shortly abbreviated in the sequel as CMKV-MDP. Our set-up is themathematical framework for a theory of reinforcement learning with mean-field interaction,and is notably motivated by the recent popularization of targeted advertising, in whichcontrols are naturally of open-loop form as the individuals states are inaccessible. Thecommon noise is also a feature of interest to model the impact of public data on thepopulation and to understand how it may affect the strategy of the social planner/influencer.Compared to continuous-time models, discrete-time McKean-Vlasov control problemshave been less studied in the literature. In [15], the authors consider a finite-horizon problemwithout common noise and state the dynamic programming (Bellman) equation for MFCwith closed-loop (also called feedback) controls, that are restricted to depend on the state.Very recently, the works [6], [11] addressed Bellman equations for MFC problems in thecontext of reinforcement learning. The paper [11] considers relaxed controls in their MFCformulation but without common noise, and derives the Bellman equation for the Q -valuefunction as a deterministic control problem that we obtain here as a particular case (see ourRemark 4.11). The framework in [6] is closest to ours by considering also common noise,however with the following differences: these authors restrict their attention to stationaryfeedback policies, and reformulate their MFC control problem as a MDP on the spaceof probability measures by deriving formally (leaving aside the measurability issues andassuming the existence of a stationary feedback control) the associated Bellman equation,which is then used for the development of Q -learning algorithms. Notice that [6], [11] do notconsider dependence upon the probability distribution of the control in the state transitiondynamics and reward function.Besides the introduction of a general framework including a mean-field dependence onthe pair state/action, our first main contribution is to rigorously connect CMKV-MDP toa large but finite system of MDP with interacting processes. We prove the almost sureand L conditional propagation of chaos, i.e., the convergence, as the number of interactingagents N tends to infinity, of the state processes and gains of the N -individual populationcontrol problem towards the corresponding object in the CMKV-MDP. Furthermore, byrelying on rate of convergence in Wasserstein distance of the empirical measure, we give arate of convergence for the limiting CMKV-MDP under suitable Lipschitz assumptions onthe state transition and reward functions, which is new to the best of our knowledge.Our second contribution is to obtain the correspondence of our CMKV-MDP with asuitable lifted MDP on the space of probability measures. Starting from open-loop controls,this is achieved in general by introducing relaxed (i.e. measure-valued) controls in theenlarged state/action space, and by emphasizing the measurability issues arising in thepresence of common noise and with continuous state space. In the special case withoutcommon noise or with discrete state space, the relaxed control in the lifted MDP is reducedto the usual notion in control theory, also known as mixed or randomized strategies ingame theory. While it is known in standard MDP that an optimal control (when it exists)is in pure form, relaxed control appears naturally in MFC where the social planner hasto sample the distribution of actions instead of simply assigning the same pure strategyamong the population in order to perform the best possible collective gain.3he reformulation of the original problem as a lifted MDP leads us to consider an asso-ciated dynamic programming equation written in terms of a Bellman fixed point equationin the space of probability measures. Our third contribution is to establish rigorously theBellman equation satisfied by the state value function of the CMKV-MDP, and then by thestate-action value function, called Q -function in the reinforcement learning terminology.This is obtained under the crucial assumption that the initial information filtration is gene-rated by an atomless random variable, i.e., that it is rich enough, and calls upon originalmeasurable optimal coupling results for the Wasserstein distance. Moreover, and this isour fourth contribution, the methodology of proof allows us to obtain as a by-product theexistence of an ǫ -optimal control, which is constructed from randomized feedback policiesunder a randomization hypothesis. This shows in particular that the value function ofCMKV-MDP over open-loop controls is equal to the value function over randomized feed-back controls, and we highlight that it may be strictly larger than the value function ofCMKV-MDP over “pure” feedback controls, i.e., without randomization. This is a notabledifference with respect to the classical (without mean-field dependence) theory of MDP asstudied e.g. in [3], [18].Finally, we discuss how to compute the value function and approximate optimal ran-domized feedback controls from the Bellman equation according to value or policy iterationmethods and by discretization of the state space and of the space of probability measures.Reinforcement learning algorithms and practical implementation are postponed to a com-panion paper with applications to model for targeted advertising in social networks. Outline of the paper.
The rest of the paper is organized as follows. Section 2 carefullyformulates both the N -individual model and the CMKV-MDP, and show their connectionby providing the rate of convergence of the latter to the limiting MFC when N goes toinfinity. In Section 3, we establish the correspondence of the CMKV-MDP with a liftedMDP on the space of probability measures with usual relaxed controls when there is nocommon noise or when the state space is discrete. In the general case considered in Section4, we show how to lift the CMKV-MDP by a suitable enlargement of the action space inorder to get the correspondance with a MDP on the Wasserstein space. We then derivethe associated Bellman fixed point equation satisfied by the value function, and obtainthe existence of approximate randomized feedback controls. We conclude in Section 5 byindicating some questions for future research. Finally, we collect in the Appendix someuseful and technical results including measurable coupling arguments used in the proofs ofthe paper. Notations.
Given two measurable spaces ( X , Σ ) and ( X , Σ ), we denote by pr (resp.pr ) the projection function ( x , x ) ∈ X ×X x ∈ X (resp. x ∈ X ). For a measurablefunction Φ : X → X , and a positive measure µ on ( X , Σ ), the pushforward measureΦ ⋆ µ is the measure on ( X , Σ ) defined byΦ ⋆ µ ( B ) = µ (cid:0) Φ − ( B ) (cid:1) , ∀ B ∈ Σ . We denote by P ( X ) the set of probability measures on X , and C ( X ) the cylinder (orweak) σ -algebra on P ( X ), that is the smallest σ -algebra making all the functions µ ∈P ( X ) µ ( B ) ∈ [0 , B ∈ Σ .4 probability kernel ν on X × X , denoted ν ∈ ˆ X ( X ), is a measurable mapping from( X , Σ ) into ( P ( X ) , C ( X )), and we shall write indifferently ν ( x , B ) = ν ( x )( B ), for all x ∈ X , B ∈ Σ . Given a probability measure µ on ( X , Σ ), and a probability kernel ν ∈ ˆ X ( X ), we denote by µ · ν the probability measure on ( X × X , Σ ⊗ Σ ) defined by( µ · ν )( B × B ) = Z B × B µ (d x ) ν ( x , d x ) , ∀ B ∈ Σ , B ∈ Σ . Let X and X be two random variables valued respectively on X and X , denoted X i ∈ L (Ω; X i ). We denote by L ( X i ) the probability distribution of X i , and by L ( X | X ) theconditional probability distribution of X given X . With these notations, when X =Φ( X ), then L ( X ) = Φ ⋆ L ( X ).When ( Y , d ) is a compact metric space, the set P ( Y ) of probability measures on Y isequipped with the Wasserstein distance W ( µ, µ ′ ) = inf n Z Y d ( y, y ′ ) µ (d y, d y ′ ) : µ ∈ Π ( µ, µ ′ ) o , where Π ( µ, µ ′ ) is the set of probability measures on Y × Y with marginals µ and µ ′ , i.e.,pr ⋆ µ = µ , and pr ⋆ µ = µ ′ . Since ( Y , d ) is compact, it is known (see e.g. Corollary 6.13in [20]) that the Borel σ -algebra generated by the Wasserstein metric coincides with thecylinder σ -algebra on P ( Y ), i.e., Wasserstein distances metrize weak convergence. We alsorecall the dual Kantorovich-Rubinstein representation of the Wasserstein distance W ( µ, µ ′ ) = sup n Z Y φ d( µ − µ ′ ) : φ ∈ L lip ( Y ; R ) , [ φ ] lip ≤ o , where L lip ( Y ; R ) is the set of Lipschitz continuous functions φ from Y into R , and [ φ ] lip =sup {| φ ( y ) − φ ( y ′ ) | /d ( y, y ′ ) : y, y ′ ∈ Y , y = y ′ } . N -agent and the limiting McKean-Vlasov MDP We formulate the mean-field Markov Decision Process (MDP) in a large population modelwith indistinguishable agents i ∈ N ∗ = N \ { } .Let X (the state space) and A (the action space) be two compact Polish spaces equippedrespectively with their metric d and d A . We denote by P ( X ) (resp. P ( A )) the space ofprobability measures on X (resp. A ) equipped respectively with their Wasserstein dis-tance W and W A . We also consider the product space X × A , equipped with the metric d (( x, a ) , ( x ′ , a ′ )) = d ( x, x ′ ) + d A ( a, a ′ ), x, x ′ ∈ X , a, a ′ ∈ A , and the associated space ofprobability measure P ( X × A ), equipped with its Wasserstein distance W . Let G , E , and E be three measurable spaces, representing respectively the initial information, idiosyn-cratic noise, and common noise spaces.We denote by Π OL the set of sequences ( π t ) t ∈ N (called open-loop policies ) where π t is ameasurable function from G × E t × ( E ) t into A for t ∈ N .Let (Ω , F , P ) be a probability space on which are defined the following family of mutuallyi.i.d. random variables 5 (Γ i , ξ i ) i ∈ N ⋆ (initial informations and initial states) valued in G × X• ( ε it ) i ∈ N ⋆ ,t ∈ N (idiosyncratic noises) valued in E with probability distribution λ ε • ε := ( ε t ) t ∈ N (common noise) valued in E .Without loss of generality, we may assume that F contains an atomless random variable,i.e., F is rich enough, so that any probability measure ν on X (resp. A or X × A ) can berepresented by the law of some random variable Y on X (resp. A or X × A ), and we write Y ∼ ν , i.e., L ( Y ) = ν . Given an open-loop policy π , we associate an open-loop control forindividual i ∈ N ∗ as the process α i,π defined by α i,π = π t (Γ i , ( ε is ) s ≤ t , ( ε s ) s ≤ t ) , t ∈ N . In other words, an open-loop control is a non-anticipative process that depends on theinitial information, the past idiosyncratic and common noises, but not on the states of theagent in contrast with closed-loop control.Given N ∈ N ∗ , and π ∈ Π OL , the state process of agent i = 1 , . . . , N in an N -agentMDP is given by the dynamical system ( X i,N,π = ξ i X i,N,πt +1 = F ( X i,N,πt , α i,πt , N P Nj =1 δ ( X j,N,πt ,α j,πt ) , ε it +1 , ε t +1 ) , t ∈ N , where F is a measurable function from X × A × P ( X × A ) × E × E into X , called statetransition function. The i -th individual contribution to the influencer’s gain over an infinitehorizon is defined by J N,πi := ∞ X t =0 β t f (cid:16) X i,N,πt , α i,πt , N N X j =1 δ ( X j,N,πt ,α j,πt ) (cid:17) , i = 1 , . . . , N, where the reward f is a mesurable real-valued function on X × A × P ( X × A ), assumed tobe bounded (recall that X and A are compact spaces), and β is a positive discount factorin [0 , J N,π := 1 N N X i =1 J N,πi , V
N,π := E (cid:2) J N,π (cid:3) , and the optimal value of the influencer is V N := sup π ∈ Π OL V N,π . Observe that the agentsare indistinguishable in the sense that the initial pair of information/state (Γ i , ξ i ) i , andidiosyncratic noises are i.i.d., and the state transition function F , reward function f , anddiscount factor β do not depend on i .Let us now consider the asymptotic problem when the number of agents N goes toinfinity. In view of the propagation of chaos argument, we expect that the state process ofagent i ∈ N ∗ in the infinite population model to be governed by the conditional McKean-Vlasov dynamics ( X i,π = ξ i X i,πt +1 = F ( X i,πt , α i,πt , P X i,πt ,α i,πt ) , ε it +1 , ε t +1 ) , t ∈ N . (2.1)6ere, we denote by P and E the conditional probability and expectation knowing the com-mon noise ε , and then, given a random variable Y valued in Y , we denote by P Y or L ( Y )its conditional law knowing ε , which is a random variable valued in P ( Y ) (see LemmaA.2). The i -th individual contribution to the influencer’s gain in the infinite populationmodel is J πi := ∞ X t =0 β t f (cid:0) X i,πt , α i,πt , P X i,πt ,α i,πt ) (cid:1) , i ∈ N ∗ , and we define the conditional gain, expected gain, and optimal value, respectively by J π := E (cid:2) J πi (cid:3) = E (cid:2) J π (cid:3) , i ∈ N ∗ (by indistinguishability of the agents) ,V π := E (cid:2) J π (cid:3) , V := sup π ∈ Π OL V π . (2.2)Problem (2.1)-(2.2) is called conditional McKean-Vlasov Markov decision process, CMKV-MDP in short.The main goal of this Section is to rigorously connect the finite-agent model to theinfinite population model with mean-field interaction by proving the convergence of the N -agent MDP to the CMKV-MDP. First, we prove the almost sure convergence of thestate processes under some continuity assumptions on the state transition function. Proposition 2.1
Assume that for all ( x , a, ν , e ) ∈ X × A ×P ( X × A ) × E , the (random)function ( x, ν ) ∈ X × P ( X × A ) F ( x, a, ν, ε , e ) ∈ X is continuous in ( x , ν ) a.s. Then, for any π ∈ Π OL , X i,N,πt a.s. → N →∞ X i,πt , i ∈ N ∗ , t ∈ N , and W (cid:16) N N X i =1 δ ( X i,N,πt ,α i,πt ) , P X i,πt ,α i,πt ) (cid:17) a.s. −→ N →∞ , N N X i =1 d ( X i,N,πt , X i,πt ) → N →∞ a.s. Furthermore, if we assume that for all a ∈ A , the function ( x, ν ) ∈ X × P ( X × A ) f ( x, a, ν ) is continuous, then J N,πi a.s. −→ N →∞ J πi , J N,π a.s. −→ N →∞ J π , V N,π −→ N →∞ V π , and lim inf N →∞ V N ≥ V. Proof.
Fix π ∈ Π OL . We omit the dependence of the state processes and control on π ,and denote by ν Nt := N P Ni =1 δ ( X i,Nt ,α it ) , ν N, ∞ t := N P Ni =1 δ ( X it ,α it ) , and ν t := P X it ,α it ) .(1) We first prove the convergence of trajectories, for all i ∈ N ⋆ , P (cid:2) lim N →∞ ( X i,Nt , ν Nt ) = ( X it , ν t ) (cid:3) = 1 , P h lim N →∞ N N X i =1 d ( X i,Nt , X it ) = 0 i = 1 , by induction on t ∈ N . - Initialization. For t = 0, we have X i,N = ξ i = X i , α i = π (Γ i ), for all N ∈ N ⋆ and i ∈ N ⋆ ,which obviously implies that X i,N a.s. → N →∞ X i and N P Ni =1 d ( X i,N , X i ) → N →∞
0. Moreover,7 ( ν N , ν ) = W ( ν N, ∞ , ν ), which converges to zero a.s., when N goes to infinity, by theweak convergence of empirical measures (see [19]), and the fact that Wasserstein distancemetrizes weak convergence. - Hereditary Property. We have X i,Nt +1 = F ( X i,Nt , α it , ν Nt , ε it +1 , ε t +1 ) and X it +1 = F ( X it , α it , ν t , ε it +1 , ε t +1 ) . (2.3)By a simple conditioning, we notice that P (cid:2) lim N →∞ X i,Nt +1 = X it +1 (cid:3) = E (cid:2) P (cid:0) ( X i,Nt , ν Nt ) N , X it , ν t , α it (cid:1)(cid:3) ,where P (cid:0) ( x N , ν N ) N , x, ν, a ) (cid:1) = P h lim N →∞ F ( x N , a, ν N , ε it +1 , ε t +1 ) = F ( x, a, ν, ε it +1 , ε t +1 ) i . By the continuity assumption on F , we see that P (cid:0) ( x N , ν N ) N , x, ν, a ) (cid:1) is bounded frombelow by lim N →∞ ( x N ,ν N )=( x,ν ) , and thus P (cid:2) lim N →∞ X i,Nt +1 = X it +1 (cid:3) ≥ P (cid:2) lim N →∞ ( X i,Nt , ν Nt ) = ( X it , ν t ) (cid:3) . This proves by induction hypothesis that P (cid:2) lim N →∞ X i,Nt +1 = X it +1 (cid:3) = 1.Let us now show that N P Ni =1 d ( X i,Nt +1 , X it +1 ) a.s. → N →∞
0. From (2.3), we have d ( X i,Nt +1 , X it +1 ) ≤ ∆ D N F ( X it , α it , ν t , ε it +1 , ε t +1 )with D N := max[ d ( X i,Nt , X it ) , W ( ν Nt , ν t )], and∆ y F ( x, a, ν, e, e ) := sup ( x ′ ,ν ′ ) ∈ D { d ( F ( x ′ , a, ν ′ , e, e ) , F ( x, a, ν, e, e )) max[ d ( x ′ ,x ) , W ( ν ′ ,ν )] ≤ y } , where D is a fixed countable dense set of the separable space X × P ( X × A ), which impliesthat ( y, x, a, ν, e, e ) ∆ y F ( x, a, ν, e, e ) is a measurable function. Fix ǫ >
0. Let ∆ X denote the diameter of the compact metric space X . We thus have, for any η > d ( X i,Nt +1 , X it +1 ) ≤ d ( X i,Nt +1 , X it +1 ) D N ≤ η + d ( X i,Nt +1 , X it +1 ) D N >η ≤ ∆ η F ( X it , α it , ν t , ε t +1 , ε t +1 ) + ∆ X d ( X N,it ,X it ) >η + ∆ X W ( ν Nt ,ν t ) >η , and thus 1 N N X i =1 d ( X i,Nt +1 , X it +1 ) ≤ N N X i =1 ∆ η F ( X it , α it , ν t , ε t +1 , ε t +1 ) + ∆ X ηN N X i =1 d ( X i,Nt , X it ) + ∆ X W ( ν Nt ,ν t ) >η . The second and third terms in the right hand side converge to 0 by induction hypothesis,and by Proposition B.1, the first term converges to E (cid:2) ∆ η F ( X t , α t , ν t , ε t +1 , ε t +1 ) (cid:3) = E h E (cid:2) ∆ η F ( x t , a t , ν t , ε t +1 , e t +1 ) (cid:3) ( x t ,a t ,e t +1 ):=( X t ,α t ,ε t +1 ) i . η →
0, the inner expectation tends to zero by continuity assumption on F and bydominated convergence. Then, the outer expectation converges to zero by conditionaldominated convergence, and will thus be smaller than ǫ for η small enough, which impliesthat N P Ni =1 d ( X i,Nt +1 , X it +1 ) will be smaller than ǫ for N large enough.Let us finally prove that W ( ν Nt +1 , ν t +1 ) a.s. → N →∞
0. We have W ( ν Nt +1 , ν t +1 ) ≤ W ( ν Nt +1 , ν N, ∞ t +1 )+ W ( ν N, ∞ t +1 , ν t +1 ). To dominate the first term W ( ν Nt +1 , ν N, ∞ t +1 ), notice that, given a variable U N ∼ U ( { , ..., N } ), the random measure ν Nt +1 (resp. ν N, ∞ t +1 ) is, at fixed ω ∈ Ω, the law of thepair of random variable ( X U N ,Nt +1 ( ω ) , α U N t +1 ( ω )) (resp. ( X U N t +1 ( ω ) , α U N t +1 ( ω )) where we stress thatonly U N is random here, essentially selecting each sample of these empirical measures withprobability N . Thus, by definiton of the Wasserstein distance, W ( ν Nt +1 , ν N, ∞ t +1 ) is dominatedby E [ d ( X U N ,Nt +1 ( ω ) , X U N t +1 ( ω ))] = N P Ni =1 d ( X i,Nt +1 ( ω ) , X it +1 ( ω )), which has been proved to con-verge to zero. For the second term, observe that α it +1 = π t +1 (Γ i , ( ε is ) s ≤ t +1 , ( ε s ) s ≤ t +1 ), andby Proposition A.1, there exists a measurable function f t +1 : X × G × E t +1 × ( E ) t +1 → X such that X i,Nt +1 = f t +1 (Γ i , ( ε is ) s ≤ t +1 , ( ε s ) s ≤ t +1 )) . From Proposition B.1, we then deducethat W ( ν N, ∞ t +1 , ν t +1 ) converges to zero as N goes to infinity. This concludes the induction.(2) Let us now study the convergence of gains. By the continuity assumption on f , wehave f ( X i,Nt , α it , ν Nt ) a.s. → N →∞ f ( X it , α it , ν t ) for all t ∈ N . Thus, as f is bounded, we get bydominated convergence that J N,πi a.s. → N →∞ J πi . Let us now study the convergence of J N,π to J π . We write | J N,π − J π | ≤ N N X i =1 | J N,πi − J πi | + (cid:12)(cid:12)(cid:12) N N X i =1 J πi − J π (cid:12)(cid:12)(cid:12) =: S N + S N . The second term S N converges a.s. to zero by Propositions A.1 and B.1, as N goes toinfinity. On the other hand, S N ≤ ∞ X t =0 β t ∆ N ( f ) , with ∆ N ( f ) := 1 N N X i =1 (cid:12)(cid:12) f ( X i,Nt , α it , ν Nt ) − f ( X it , α it , ν t ) (cid:12)(cid:12) . By the same argument as above in (1) for showing that N P Ni =1 d ( X i,Nt +1 , X it +1 ) a.s. → N →∞
0, weprove that ∆ N ( f ) tends a.s. to zero as N → ∞ . Since f is bounded, we deduce bythe dominated convergence theorem that S N converges a.s. to zero as N goes to infinity,and thus J N,π a.s. → N →∞ J π . By dominated convergence, we then also obtain that V N,π = E [ J N ] a.s. → N →∞ E [ J π ] = V π . Finally, by considering an ǫ -optimal policy π ǫ for V , we havelim inf N →∞ V N ≥ lim N →∞ V N,π ǫ = V π ǫ ≥ V − ǫ, which implies, by sending ǫ to zero, that liminf N →∞ V N ≥ V . ✷ Next, under Lipschitz assumptions on the state transition and reward functions, weprove the corresponding convergence in L , which implies the convergence of the optimalvalue, and also a rate of convergence in terms of the rate of convergence in Wassersteindistance of the empirical measure. 9 HF lip ) There exists K F >
0, such that for all a ∈ A , e ∈ E , x, x ′ ∈ X , ν, ν ′ ∈ P ( X × A ), E (cid:2) d (cid:0) F ( x, a, ν, ε , e ) , F ( x ′ , a, ν ′ , ε , e ) (cid:1)(cid:3) ≤ K F (cid:0) d ( x, x ′ ) + W ( ν, ν ′ ) (cid:1) ) . ( Hf lip ) There exists K f >
0, such that for all a ∈ A , x, x ′ ∈ X , ν, ν ′ ∈ P ( X × A ), d ( f ( x, a, ν ) , f ( x ′ , a, ν ′ )) ≤ K f (cid:0) d ( x, x ′ ) + W ( ν, ν ′ ) (cid:1) ) . Remark 2.1
We stress the importance of making the Lipschitz assumption for F in ex-pectation only. Indeed, most of the practical applications we have in mind concerns discretespaces X , for which F cannot be, strictly speaking, Lipschitz, since it maps from a continu-ous space P ( X × A ) to a discrete space X . However, F can be Lipschitz in expectation , e.g.once integrated w.r.t. the idiosyncratic noise, and it is actually a very natural phenomenon. ✷ In the sequel, we shall denote by ∆ X the diameter of the metric space X , and define M N := sup ν ∈P ( X × A ) E [ W ( ν N , ν )] , (2.9)where ν N is the empirical measure ν N = N P Nn =1 δ Y n , ( Y n ) ≤ n ≤ N are i.i.d. random variableswith law ν . We recall in the next Lemma recent results about non asymptotic bounds forthe mean rate of convergence in Wasserstein distance of the empirical measure. Lemma 2.1
We have M N → N →∞ . Furthermore, • If X × A ⊂ R d for some d ∈ N ⋆ , then: M N = O ( N − ) for d = 1 , M N = O ( N − log(1 + N )) for d = 2 , and M N = O ( N − d ) for d ≥ . • If for all δ > , the smallest number of balls with radius δ covering the compactmetric set X × A with diameter ∆ X × A is smaller than O (cid:16)(cid:0) ∆ X× A δ (cid:1) θ (cid:17) for θ > , then M N = O ( N − /θ ) . Proof.
The first point is proved in [10], and the second one in [4]. ✷ Theorem 2.1
Assume ( HF lip ) . For all i ∈ N ∗ , t ∈ N , sup π ∈ Π OL E (cid:2) d ( X i,N,πt , X i,πt ) (cid:3) = O ( M N ) , (2.10)sup π ∈ Π OL E h W (cid:16) N N X i =1 δ ( X i,N,πt ,α i,π ) , P X i,πt ,α i,π ) (cid:17)i = O ( M N ) . (2.11) Furthermore, if we assume ( Hf lip ) , then for any η > and γ = min (cid:2) , | ln β | (ln 2 K F ) + − η (cid:3) , thereexists a constant C = C ( K F , K f , β, γ ) (explicit in the proof ) such that for N large enough, sup π ∈ Π OL | V N,π − V π | ≤ CM γN , and thus | V N − V | = O ( M γN ) . Consequently, any ε − optimal policy for the CMKV-MDPis O ( ε ) -optimal for the N -agent MDP problem for N large enough, namely M γN = O ( ǫ ) . roof. Given π ∈ Π OL , denote by ν N,πt := N P Ni =1 δ ( X i,N,πt ,α i,πt ) , ν N, ∞ ,πt := N P Ni =1 δ ( X i,πt ,α i,πt ) and ν πt := P X i,πt ,α it ) . By definition, α i,πt = π t (Γ i , ( ε is ) s ≤ t , ( ε s ) s ≤ t ), and by Lemma A.1, wehave X i,πt = f πt ( ξ i , Γ i , ( ε is ) s ≤ t , ( ε s ) s ≤ t ) for some measurable function f πt ∈ L ( X × G × E t × ( E ) t , X ).By definition of M N in (2.9), we have E h W ( ν N, ∞ ,πt , ν πt ) i ≤ M N , ∀ N ∈ N , ∀ π ∈ Π OL . (2.13)Let us now prove (2.10) by induction on t ∈ N . At t = 0, X i,N,π = X i,π = ξ i , and theresult is obvious. Now assume that it holds true at time t ∈ N and let us show that itthen holds true at time t + 1. By a simple conditioning argument, E [ d ( X i,N,πt +1 , X i,πt +1 )] = E (cid:2) ∆ (cid:0) X i,N,πt , X i,πt , α it , ν N,πt , ν πt , ε t +1 (cid:1)(cid:3) , where∆( x, x ′ , a, ν, ν ′ , e ) = E [ d ( F ( x, a, ν, ε it +1 , e ) , F ( x ′ , a, ν ′ , ε it +1 , e ))] ≤ K F (cid:0) d ( x, x ′ ) + W ( ν, ν ′ ) (cid:1) , (2.14)by ( HF lip ). On the other hand, we have E (cid:2) W ( ν N,πt , ν πt ) (cid:3) ≤ E (cid:2) W ( ν N,πt , ν N, ∞ ,πt ) (cid:3) + E (cid:2) W ( ν N, ∞ ,πt , ν πt ) (cid:3) ≤ E [ d ( X i,N,πt , X i,πt )] + M N , (2.15)where we used the fact that W ( ν N,πt , ν N, ∞ ,πt ) ≤ N P Ni =1 d ( X i,N,πt , X i,πt ), and (2.13). Itfollows from (2.14) that E (cid:2) d ( X i,N,πt +1 , X i,πt +1 ) (cid:3) ≤ K F (cid:16) E [ d ( X i,N,πt , X i,πt )] + M N (cid:17) , ∀ π ∈ Π OL , (2.16)which proves that sup π ∈ Π OL E [ d ( X i,N,πt +1 , X i,πt +1 )] = O ( M N ) by induction hypothesis, and thus(2.10). Plugging (2.10) into (2.15) then yields (2.11).Let us now prove the convergence of gains. From ( Hf lip ), and (2.15), we have | V N,π − V π | ≤ K f ∞ X t =0 β t E h d ( X i,N,πt , X i,πt ) + W ( ν N,πt , ν πt ) i ≤ K f (cid:16) ∞ X t =0 β t δ Nt + M N − β (cid:17) , ∀ π ∈ Π OL , (2.17)where we set δ Nt := sup π ∈ Π OL E [ d ( X i,N,πt , X i,πt )]. From (2.16), we have: δ Nt +1 ≤ K F δ Nt + K F M N , t ∈ N , with δ N = 0, and so by induction: δ Nt ≤ K F − K F M N + s t (cid:0) K F | K F − | M N (cid:1) , s t ( m ) := m (2 K F ) t , m ≥ . where we may assume w.l.o.g. that 2 K F = 1. Observing that we obviously have δ Nt ≤ ∆ X (the diameter of X ), we deduce that ∞ X t =0 β t δ Nt ≤ K F (1 − K F )(1 − β ) M N + S (cid:0) K F | K F − | M N (cid:1) (2.18) S ( m ) := ∞ X t =0 β t min (cid:2) s t ( m ); ∆ X (cid:3) , m ≥ . M N ) N converges to zero, we may restrict the study of the function S to the interval [0 , ∆ X ], and we notice that it satisfies the dynamic programming relation S ( m ) = m + βS (cid:0) min[ s ( m ) , ∆ X ] (cid:1) , m ∈ [0 , ∆ X ] . Therefore, S can be viewed as the unique fixed point in the Banach space L ∞ ([0 , ∆ X ])of the β -contracting operator H defined by [ H ϕ ]( m ) = m + βϕ (min[ s ( m ) , ∆ X ]), and isobtained as the limit of fixed point iterations: S n +1 = H S n , n ∈ N , S ≡
0. Fix γ ≥ n ∈ N , that S n ( m ) ≤ K γ,n m γ , m ∈ [0 , ∆ X ], for some suitable sequence ( K γ,n ) n . By writing that S n +1 = H S n at step n + 1,and using the induction hypothesis, we have for all m ∈ [0 , ∆ X ], S n +1 ( m ) ≤ m + βK γ,n s ( m ) γ ≤ (cid:0) ∆ − γ X + βK γ,n (2 K F ) γ (cid:1) m γ . By setting K γ,n +1 := ∆ − γ X + β (2 K F ) γ K γ,n , starting with K γ, = 0, we see that therequired relation is satisfied at iteration n + 1. Now, by taking γ < | ln( β ) | (ln 2 K F ) + , we have K γ,n = ∆ − γ X − ( β (2 K F ) γ ) n − β (2 K F ) γ → K γ := ∆ − γ X − β (2 K F ) γ , as n goes to infinity, and thus S ( m ) ≤ K γ m γ , ∀ m ∈ [0 , ∆ X ] . From (2.17)-(2.18), it follows that for N large enough (so that M N < ∆ X | K F − | K F ), andtaking γ = min (cid:2) , | ln β | (ln 2 K F ) + − η (cid:3) , for η > | V N,π − V π | ≤ CM γN , π ∈ Π OL , for some constant C that is explicit in terms of K f , K F , β and γ . This concludes the proof. ✷ Remark 2.2
If the Lipschitz constant in ( HF lip ) satisfies β K F <
1, then we can take γ = 1 in the rate of convergence (2.12) of the optimal value, and this can be directly derivedfrom (2.17) since in this case P ∞ t =0 ( β K F ) t is finite and so P ∞ t =0 β t δ Nt = O ( M N ). Thefunction S in the above proof is introduced for dealing with the case when β K F > F and f depend on the joint distribution ν ∈ P ( X × A ) onlythrough its marginals on P ( X ) and P ( A ), which is the usual framework considered incontrolled mean-field dynamics, then a careful look in the above proof shows that the rateof convergence of the CMKV-MDP will be expressed in terms of˜ M N := max n sup µ ∈P ( X ) E [ W ( µ N , µ )] , sup υ ∈P ( A ) E [ W A ( υ N , υ )] o , instead of M N in (2.9), where here µ N (resp. υ N ) is the empirical measure associated to µ (resp. υ ) ∈ P ( X ) (resp. P ( A )). From Lemma 2.1, the speed of convergence of ˜ M N is fasterthan the one of M N . For instance when X ⊂ R , A ⊂ R , then ˜ M N = O ( N − / ), while M N = O (cid:0) N − / log(1 + N ) (cid:1) . ✷ Lifted MDP on P ( X ) Theorem 2.1 justifies the CMKV-MDP as a macroscopic approximation of the N -agentMDP problem with mean-field interaction. Observe that the computation of the condi-tional gain, expected gain and optimal value of the CMKV-MDP in (2.2), only requiresthe variables associated to one agent, called representative agent. Therefore, we placeourselves in a reduced universe, dismissing other individuals variables, and renaming therepresentative agent’s initial information by Γ, initial state by ξ , idiosyncratic noise by( ε t ) t ∈ N . In the sequel, we shall denote by G = σ (Γ) the σ -algebra generated by the randomvariable Γ, hence representing the initial information filtration, and by L ( G ; X ) the set of G -measurable random variables valued in X . We shall assume that the initial state ξ ∈ L ( G ; X ), which means that the policy has access to the agent’s initial state through theinitial information filtration G .An open-loop control for the representative agent is a process α , which is adapted to thefiltration generated by (Γ , ( ε s ) s ≤ t , ( ε s ) s ≤ t ) t ∈ N , and associated to an open-loop policy by: α t = α πt := π t (Γ , ( ε s ) s ≤ t , ( ε s ) s ≤ t ) for some π ∈ Π OL . We denote by A the set of open-loopcontrols, and given α ∈ A , ξ ∈ L ( G ; X ), the state process X = X ξ,α of the representativeagent is governed by X t +1 = F ( X t , α t , P X t ,α t ) , ε t +1 , ε t +1 ) , t ∈ N , X = ξ. (3.1)For α = α π , π ∈ Π OL , we write indifferently X ξ,π = X ξ,α , and the expected gain V α = V π equal to V α ( ξ ) = E h X t ∈ N β t f ( X t , α t , P X t ,α t ) ) i , where we stress the dependence upon the initial state ξ . The value function to the CMKV-MDP is then defined by V ( ξ ) = sup α ∈A V α ( ξ ) , ξ ∈ L ( G ; X ) . Let us now show how one can lift the CMKV-MDP to a (classical) MDP on the spaceof probability measures P ( X ). We set F as the filtration generated by the common noise ε . Given an open-loop control α ∈ A , and its state process X = X ξ,α , denote by { µ t = P X t , t ∈ N } , the random P ( X )-valued process, and notice from Proposition A.1 that ( µ t ) t is F -adapted. From (3.1), and recalling the pushforward measure notation, we have µ t +1 = F (cid:0) · , · , P X t ,α t ) , · , ε t +1 (cid:1) ⋆ (cid:0) P X t ,α t ) ⊗ λ ε (cid:1) , a.s. (3.2)As the probability distribution λ ε of the idiosyncratic noise is a fixed parameter, the aboverelation means that µ t +1 only depends on P X t ,α t ) and ε t +1 . Moreover, by introducing theso-called relaxed control associated to the open-loop control α asˆ α t ( x ) = L (cid:0) α t | X t = x (cid:1) , t ∈ N , A ( X ), the set of probability kernels on X × A (see Lemma A.2), we seefrom Bayes formula that P X t ,α t ) = µ t · ˆ α t . The dynamics relation (3.2) is then written as µ t +1 = ˆ F ( µ t , ˆ α t , ε t +1 ) , t ∈ N , where the function ˆ F : P ( X ) × ˆ A ( X ) × E → P ( X ) is defined byˆ F ( µ, ˆ a, e ) = F (cid:0) · , · , µ · ˆ a, · , e (cid:1) ⋆ (cid:0) ( µ · ˆ a ) ⊗ λ ε (cid:1) . (3.3)On the other hand, by the law of iterated conditional expectation, the expected gaincan be written as V α ( ξ ) = E h X t ∈ N β t E (cid:2) f ( X t , α t , P X t ,α t ) ) (cid:3)i , with the conditional expectation term equal to E (cid:2) f ( X t , α t , P X t ,α t ) ) (cid:3) = ˆ f ( µ t , ˆ α t ) , where the function ˆ f : P ( X ) × ˆ A ( X ) → R is defined byˆ f ( µ, ˆ a ) = Z X × A f ( x, a, µ · ˆ a )( µ · ˆ a )(d x, d a ) . (3.4)The above derivation suggests to consider a MDP with state space P ( X ), action spaceˆ A ( X ), a state transition function ˆ F as in (3.3), a discount factor β ∈ [0 , f as in (3.4). A key point is to endow ˆ A ( X ) with a suitable σ -algebra in orderto have measurable functions ˆ F , ˆ f , and F -adapted process ˆ α valued in ˆ A ( X ), so that theMDP with characteristics ( P ( X ) , ˆ A ( X ) , ˆ F , ˆ f , β ) is well-posed. This issue is investigated inthe next sections, first in special cases, and then in general case by a suitable enlargementof the action space. When there is no common noise, the original state transition function F is defined from X × A ×P ( X × A ) × E into X , and the associated function ˆ F is then defined from P ( X ) × ˆ A ( X )into P ( X ) by ˆ F ( µ, ˆ a ) = F (cid:0) · , · , µ · ˆ a, · (cid:1) ⋆ (cid:0) ( µ · ˆ a ) ⊗ λ ε (cid:1) . In this case, we are simply reduced to a deterministic control problem on the state space P ( X ) with dynamics µ t +1 = ˆ F ( µ t , κ t ) , t ∈ N , µ = µ ∈ P ( X ) , controlled by κ = ( κ t ) t ∈ N ∈ b A , the set of deterministic sequences valued in ˆ A ( X ), andcumulated gain/value function: b V κ ( µ ) = ∞ X t =0 β t ˆ f ( µ t , κ t ) , b V ( µ ) = sup κ ∈ b A b V κ ( µ ) , µ ∈ P ( X ) , f : P ( X ) × ˆ A ( X ) → R is defined as in (3.4). Notice that thereare no measurability issues for ˆ F , ˆ f , as the problem is deterministic and all the quantitiesdefined above are well-defined.We aim to prove the correspondence and equivalence between the MKV-MDP and theabove deterministic control problem. From similar derivation as in (3.2)-(3.4) (by takingdirectly law under P instead of P ), we clearly see that for any α ∈ A , V α ( ξ ) = b V ˆ α ( µ ),with µ = L ( ξ ), and ˆ α = R ξ ( α ) where R ξ is the relaxed operator R ξ : A −→ b A α = ( α t ) t ˆ α = ( ˆ α t ) t : ˆ α t ( x ) = L (cid:0) α t | X ξ,αt = x (cid:1) , t ∈ N , x ∈ X . It follows that V ( ξ ) ≤ b V ( µ ). In order to get the reverse inequality, we have to show that R ξ is surjective. Notice that this property is not always satisfied: for instance, when the σ -algebra generated by ξ is equal to G , then for any α ∈ A , α is σ ( ξ )-measurable at time t = 0,and thus L ( α | ξ ) is a Dirac distribution, hence cannot be equal to an arbitrary probabilitykernel κ = ˆ a ∈ ˆ A ( X ). We shall then make the following randomization hypothesis. Rand ( ξ, G ) : There exists a uniform random variable U ∼ U ([0 , G -measurableand independent of ξ ∈ L ( G ; X ). Remark 3.1
The randomization hypothesis
Rand ( ξ, G ) implies in particular that Γ isatomless, i.e., G is rich enough, and thus P ( X ) = {L ( ζ ) : ζ ∈ L ( G ; X ) } . Furthermore,it means that there is extra randomness in G besides ξ , so that one can freely randomizevia the uniform random variable U the first action given ξ according to any probabilitykernel ˆ a . Moreover, one can extract from U , by standard separation of the decimals of U (see Lemma 2.21 in [12]), an i.i.d. sequence of uniform variables ( U t ) t ∈ N , which are G -measurable, independent of ξ , and can then be used to randomize the subsequent actions. ✷ Theorem 3.1 (Correspondence in the no common noise case)
Assume that
Rand ( ξ, Γ) holds true. Then R ξ is surjective from A into b A , and we have V ( ξ ) = b V ( µ ) , for µ = L ( ξ ) . Moreover, if α ǫ ∈ A is an ǫ -optimal control for V ( ξ ) , then R ξ ( α ǫ ) ∈ b A is an ǫ -optimal control for b V ( µ ) , and conversely, if ˆ α ǫ ∈ b A is an ǫ -optimalcontrol for b V ( µ ) , then any α ǫ ∈ R − ξ ( ˆ α ǫ ) is an ǫ -optimal control for V ( ξ ) . Proof.
In view of the above discussion, we only need to prove the surjectivity of R ξ . Fixa control κ ∈ b A for the MDP on P ( X ). By Proposition A.3, for all t ∈ N , there exists ameasurable function a t : X × [0 , → A such that P a t ( x,U ) = κ t ( x ), for all x ∈ X . It is thenclear that the control α defined recursively by α t := a t ( X ξ,αt , U t ), where ( U t ) t is an i.i.d.sequence of G -measurable uniform variables independent of ξ under Rand ( ξ, Γ), satisfies L ( α t | X ξ,α = x ) = κ t ( x ) (observing that U t is independent of X ξ,αt ), and thus ˆ α = κ ,which proves the surjectivity of R ξ . ✷ Remark 3.2
The above correspondence result shows in particular that the value function V of the MKV-MDP is law invariant, in the sense that it depends on its initial state ξ onlyvia its probability law µ = L ( ξ ), for ξ satisfying the randomization hypothesis. ✷ .2 Case with discrete state space X We consider the case with common noise but when the state space X is discrete, i.e., itscardinal X is finite, equal to n .In this case, one can identify P ( X ) with the simplex S n − = { p = ( p i ) i =1 ,...,n ∈ [0 , n : P ni =1 p i = 1 } , by associating any probability distribution µ ∈ P ( X ) to its weights( µ ( { x } )) x ∈X ∈ S n − . We also identify the action space ˆ A ( X ) with P ( A ) n by associating anyprobability kernel ˆ a ∈ ˆ A ( X ) to (ˆ a ( x )) x ∈X ∈ P ( A ) n , and thus ˆ A ( X ) is naturally endowedwith the product σ -algebra of the Wasserstein metric space P ( A ). Lemma 3.1
Suppose that X = n < ∞ . Then, ˆ F in (3.3) is a measurable functionfrom S n − × P ( A ) n × E into S n − , ˆ f in (3.4) is a real-valued measurable function on S n − × P ( A ) n . Moreover, for any ξ ∈ L ( G ; X ) , and α ∈ A , the P ( A ) n -valued process ˆ α defined by ˆ α t ( x ) = L ( α t | X ξ,αt = x ) , t ∈ N , x ∈ X , is F -adapted. Proof.
By Corollary A.1, it is clear, by measurable composition, that we only need toprove that Ψ : ( µ, ˆ a ) ∈ ( P ( X ) , ˆ A ( X )) µ · ˆ a ∈ P ( X × A ) is measurable. However, in thisdiscrete case, µ · ˆ a is here simply equal to P x ∈X µ ( x )ˆ a ( x ) and, thus Ψ is clearly measurable. ✷ In view of Lemma 3.1, the MDP with characteristics ( P ( X ) ≡ S n − , ˆ A ( X ) ≡ P ( A ) n , ˆ F , ˆ f , β )is well-posed. Let us then denote by b A the set of F -adapted processes valued in P ( A ) n ,and given κ ∈ b A , consider the controlled dynamics in S n − µ t +1 = ˆ F ( µ t , κ t , ε t +1 ) , t ∈ N , µ = µ ∈ S n − , (3.5)the associated expected gain and value function b V κ ( µ ) = E h ∞ X t =0 β t ˆ f ( µ t , κ t ) i , b V ( µ ) = sup κ ∈ ˆ A b V κ ( µ ) . (3.6)We aim to prove the correspondence and equivalence between the CMKV-MDP and theMDP (3.5)-(3.6). From the derivation in (3.2)-(3.4) and by Lemma 3.1, we see that for any α ∈ A , V α ( ξ ) = b V ˆ α ( µ ), where µ = L ( ξ ), and ˆ α = R ξ ( α ) where R ξ is the relaxed operator R ξ : A −→ b A α = ( α t ) t ˆ α = ( ˆ α t ) t : ˆ α t ( x ) = L (cid:0) α t | X ξ,αt = x (cid:1) , t ∈ N , x ∈ X . (3.7)It follows that V ( ξ ) ≤ b V ( µ ). In order to get the reverse inequality from the surjectivity of R ξ , we need again as in the no common noise case to make some randomization hypothesis.It turns out that when X is discrete, this randomization hypothesis is simply reduced tothe atomless property of Γ. Lemma 3.2
Assume that Γ is atomless, i.e., G is rich enough. Then, any ξ ∈ L ( G ; X ) taking a countable number of values, satisfies Rand ( ξ, Γ) . Proof.
Let S be a countable set s.t. ξ ∈ S a.s., and P [ ξ = x ] > x ∈ S . Fix x ∈ S and denote by P x the probability “knowing ξ = x ”, i.e., P x [ B ] := P [ B,ξ = x ] P [ ξ = x ] , for all B ∈ F .16t is clear that, endowing Ω with this probability, Γ is still atomless, and so there existsa G -measurable random variable U x that is uniform under P x . Then, the random variable U := P x ∈ S U x ξ = x is a G -measurable uniform random variable under P x for all x ∈ S ,which implies that it is a uniform variable under P , independent of ξ . ✷ Theorem 3.2 (Correspondance with the MDP on P ( X ) in the X discrete case) Assume that G is rich enough. Then R ξ is surjective from A into b A , and V ( ξ ) = b V ( µ ) , forany µ ∈ P ( X ) , ξ ∈ L ( G ; X ) s.t. µ = L ( ξ ) . Moreover, if α ǫ ∈ A is an ǫ -optimal controlfor V ( ξ ) , then R ξ ( α ǫ ) ∈ b A is an ǫ -optimal control for b V ( µ ) . Conversely, if ˆ α ǫ ∈ b A is an ǫ -optimal control for b V ( µ ) , then any α ǫ ∈ ( R ξ ) − ( ˆ α ǫ ) is an ǫ -optimal control for V ( ξ ) . Proof.
From the derivation in (3.5)-(3.7), we only need to prove the surjectivity of R ξ .Fix κ ∈ b A and let π t ∈ L (( E ) t ; ˆ A ( X )) be such that κ t = π t (( ε s ) s ≤ t ). As X is discrete, bydefinition of the σ -algebra on ˆ A ( X ), π t can be seen as a measurable function in L (( E ) t ×X ; P ( A )). Let φ ∈ L ( A, R ) be an embedding as in Lemma C.2. By Corollary A.1, weknow that φ ⋆ π t is in L (( E ) t × X ; P ( R )). Given m ∈ P ( R ) we denote by F − m thegeneralized inverse of its distribution function, and it is known that the mapping m ∈ ( P ( R ) , W ) F − m ∈ ( L caglad ( R ) , k · k ) is an isometry and is thus measurable. Therefore, F − φ⋆ π t is in L (( E ) t × X ; ( L caglad ( R ) , k · k )). Finally, the mapping ( f, u ) ∈ ( L caglad ( R ) , k ·k ) × ([0 , , B ([0 , f ( u ) ∈ ( R , B ( R )) is measurable, since it is the limit of the sequence n P i ∈ Z [ i +1 n , i +2 n ) ( u ) R i +1 nin f ( y ) dy when n → ∞ . Therefore, the mappinga t : ( E ) t × X × [0 , −→ A (( e s ) s ≤ t , x, u ) φ − ◦ F − φ⋆ π t (( e s ) s ≤ t ,x ) ( u )is measurable. We thus define, by induction, α t := a t (( ε s ) s ≤ t , X ξ,αt , U t ). By constructionand by the generalized inverse simulation method, it is clear that ˆ α t = κ t . ✷ Remark 3.3
We point out that when both state space X and action space A are discrete,equipped with the metrics d ( x, x ′ ) := x = y , x, x ′ ∈ X and d A ( a, a ′ ) := a = a ′ , a, a ′ ∈ A ,the transition function ˆ F and reward function ˆ f of the lifted MDP on P ( X ) inherits theLipschitz condition ( HF lip ) and ( Hf lip ) used for the propagation of chaos. Indeed, it isknown that the Wasserstein distance obtained from d (resp. d A ) coincides with twice thetotal variation distance, and thus to the L distance when naturally embedding P ( X ) (resp. P ( A )) in [0 , X (resp. [0 , A ). Thus, embedding ˆ A ( X ) in M X , A ([0 , X × A matrices with coefficients valued in [0 , k ˆ F ( µ, ˆ a, e ) , ˆ F ( ν, ˆ a ′ , e ) k ≤ (1 + K F )(2 k µ − µ ′ k + sup x ∈X k ˆ a x, · − ˆ a ′ x, · k ) . We obtain a similar property for f . In other words, lifting the CMKV-MDP not only turnsit into an MDP, but also its state and action spaces [0 , X and [0 , X × A are verystandard, and its dynamic and reward are Lipschitz functions with factors of the order of K F and K f according to the norm k · k . Thus, due to the standard nature of this MDP,most MDP algorithms can be applied and their speed will be simply expressed in terms ofthe original parameters of the CMKV-MDP, K F and K f . ✷ emark 3.4 As in the no common noise case, the correspondence result in the X discretecase shows notably that the value function of the CMKV-MDP is law-invariant.The general case (common noise and continuous state space X ) raises multiple issuesfor establishing the equivalence between CMKV-MDP and the lifted MDP on P ( X ). First,we have to endow the action space ˆ A ( X ) with a suitable σ -algebra for the lifted MDP tobe well-posed: on one hand, this σ -algebra has to be large enough to make the functionsˆ F : P ( X ) × ˆ A ( X ) × E → P ( X ) and ˆ f : P ( X ) × ˆ A ( X ) → R measurable, and on the otherhand, it should be small enough to make the process ˆ α = R ξ ( α ) F -adapted for any control α ∈ A in the CMKV-MDP. Beyond the well-posedness issue of the lifted MDP, the secondimportant concern is the surjectivity of the relaxed operator R ξ from A into b A . Indeed, ifwe try to adapt the proof of Theorem 3.2 to the case of a continuous state space X , theissue is that we cannot in general equip ˆ A ( X ) with a σ -algebra such that L (( E ) t ; ˆ A ( X ))= L (( E ) t × X ; P ( A )), and thus we cannot see π t ∈ L (( E ) t ; ˆ A ( X )) as an element of L (( E ) t × X ; P ( A )), which is crucial because the control α (such that ˆ α = κ ) is definedwith α t explicitly depending upon π t (( ε s ) s ≤ t , X t ).In the next section, we shall fix these measurability issues in the general case, and provethe correspondence between the CMKV-MDP and a general lifted MDP on P ( X ). ✷ P ( X ) We address the general case with common noise and possibly continuous state space X ,and our aim is to state the correspondence of the CMKV-MDP with a suitable lifted MDPon P ( X ) associated to a Bellman fixed point equation, characterizing the value function,and obtain as a by-product an ǫ -optimal control. We proceed as follows:(i) We first introduce a well-posed lifted MDP on P ( X ) by enlarging the action spaceto P ( X × A ), and call ˜ V the corresponding value function, which satisfies: V ( ξ ) ≤ ˜ V ( µ ), for µ = L ( ξ ).(ii) We then consider the Bellman equation associated to this well-posed lifted MDP on P ( X ), which admits a unique fixed point, called V ⋆ .(iii) Under the randomization hypothesis for ξ , we show the existence of an ǫ -randomizedfeedback policy, which yields both an ǫ -randomized feedback control for the CMKV-MDP and an ǫ -optimal feedback control for ˜ V . This proves that V ( ξ ) = ˜ V ( µ ) = V ∗ ( µ ), for µ = L ( ξ ).(iv) Under the condition that G is rich enough, we conclude that V is law-invariant andis equal to ˜ V = V ⋆ , hence satisfies the Bellman equation.Finally, we show how to compute from the Bellman equation by value or policy iterationapproximate optimal strategy and value function.18 .1 A general lifted MDP on P ( X ) We start again from the relation (3.2) describing the evolution of µ t = P X t , t ∈ N , for astate process X t = X ξ,αt controlled by α ∈ A : µ t +1 = F ( · , · , P X t ,α t ) , · , ε t +1 ) ⋆ ( P X t ,α t ) ⊗ λ ε ) , a.s. (4.1)Now, instead of decoupling as in Section 3, the conditional law of the pair ( X t , α t ), as P X t ,α t ) = µ t · ˆ α t where ˆ α = R ξ ( α ) is the relaxed control in (3.7), we directly considerthe control process α t = P X t ,α t ) , t ∈ N , which is F -adapted (see Proposition A.1), andvalued in the space of probability measures A := P ( X × A ), naturally endowed with the σ -algebra of its Wasserstein metric. Notice that this A -valued control α obtained fromthe CMKV-MDP has to satisfy by definition the marginal constraint pr ⋆ α t = µ t at anytime t . In order to tackle this marginal constraint, we shall rely on the following couplingresults. Lemma 4.1 (Measurable coupling)
There exists a measurable function ζ ∈ L ( P ( X ) × X × [0 , X ) s.t. for any ( µ, µ ′ ) ∈P ( X ) , and if ξ ∼ µ , then • ζ ( µ, µ ′ , ξ, U ) ∼ µ ′ , where U is an uniform random variable independent of ξ . • (i) When X ⊂ R : E (cid:2) d ( ξ, ζ ( µ, µ ′ , ξ, U )) (cid:3) = W ( µ, µ ′ ) . (ii) In general when X Polish: ∀ ε > , ∃ η > s.t. W ( µ, µ ′ ) < η ⇒ E (cid:2) d ( ξ, ζ ( µ, µ ′ , ξ, U )) (cid:3) < ε. Proof.
See Appendix C. ✷ Remark 4.1
Lemma 4.1 can be seen as a measurable version of the well-known couplingresult in optimal transport, which states that given µ , µ ′ ∈ P ( X ), there exists ξ and ξ ′ random variables with L ( ξ ) = µ , L ( ξ ′ ) = µ ′ such that W ( µ, µ ′ ) = E (cid:2) d ( ξ, ξ ′ )]. A similarmeasurable optimal coupling is proved in [8] under the assumption that there exists atransfer function realizing an optimal coupling between µ and µ ′ . However, such transferfunction does not always exist, for instance when µ has atoms but not µ ′ . Lemma 4.1 buildsa measurable coupling without making such assumption (essentially using the uniformvariable U to randomize when µ has atoms). ✷ From the measurable coupling function ζ as in Lemma 4.1, we define the couplingprojection p : P ( X ) × A → A by p ( µ, a ) = L (cid:0) ζ (pr ⋆ a , µ, ξ ′ , U ) , α (cid:1) , µ ∈ P ( X ) , a ∈ A , where ( ξ ′ , α ) ∼ a , and U is a uniform random variable independent of ξ ′ .19 emma 4.2 (Measurable coupling projection) The coupling projection p is a measurable function from P ( X ) × A into A , and for all ( µ, a ) ∈ P ( X ) × A :pr ⋆ p ( µ, a ) = µ, and if pr ⋆ a = µ, then p ( µ, a ) = a . Proof.
The only result that is not trivial is the measurability of p . Observe that p ( µ, a ) = g ( µ, a , · , · , · ) ⋆ ( a ⊗ U ([0 , g is the measurable function g : P ( X ) × P ( X × A ) × X × A × [0 , −→ X × A ( µ, a , x, a, u ) ( ζ (pr ⋆ a , µ, x, u ) , a )We thus conclude by Corollary A.1. ✷ By using this coupling projection p , we see that the dynamics (4.1) can be written as µ t +1 = ˜ F ( µ t , α t , ε t +1 ) , t ∈ N , (4.2)where the function ˜ F : P ( X ) × A × E → P ( X ) defined by˜ F ( µ, a , e ) = F ( · , · , p ( µ, a ) , · , e ) ⋆ (cid:0) p ( µ, a ) ⊗ λ ε (cid:1) , is clearly measurable. Let us also define the measurable function ˜ f : P ( X ) × A → R by˜ f ( µ, a ) = Z X × A f ( x, a, p ( µ, a )) p ( µ, a )(d x, d a ) . The MDP with characteristics ( P ( X ) , A = P ( X × A ) , ˜ F , ˜ f , β ) is then well-posed. Letus then denote by A the set of F -adapted processes valued in A , and given an open-loopcontrol ν ∈ A , consider the controlled dynamics µ t +1 = ˜ F ( µ t , ν t , ε t +1 ) , t ∈ N , µ = µ ∈ P ( X ) , (4.3)with associated expected gain/value function e V ν ( µ ) = E h X t ∈ N β t ˜ f ( µ t , ν t ) i , e V ( µ ) = sup ν ∈ A e V ν ( µ ) . (4.4)Given ξ ∈ L ( G ; X ), and α ∈ A , we set α = L ξ ( α ), where L ξ is the lifted operator L ξ : A −→ A α = ( α t ) t α = ( α t ) t : α t = P X ξ,αt ,α t ) , t ∈ N . By construction from (4.2), we see that µ t = P X ξ,αt , t ∈ N , follows the dynamics (4.3) withthe control ν = L ξ ( α ) ∈ A . Moreover, by the law of iterated conditional expectation, andthe definition of ˜ f , the expected gain of the CMKV-MDP can be written as V α ( ξ ) = E h X t ∈ N β t E (cid:2) f ( X ξ,αt , α t , P X ξ,αt ,α t ) ) (cid:3)i = E h X t ∈ N β t ˜ f ( P X ξ,αt , α t ) i = e V α ( µ ) , with µ = L ( ξ ) . (4.5)It follows that V ( ξ ) ≤ e V ( µ ), for µ = L ( ξ ). Our goal is to prove the equality, which implies inparticular that V is law-invariant, and to obtain as a by-product the corresponding Bellmanfixed point equation that characterizes analytically the solution to the CMKV-MDP.20 .2 Bellman fixed point on P ( X ) We derive and study the Bellman equation corresponding to the general lifted MDP (4.3)-(4.4) on P ( X ). By defining this MDP on the canonical space ( E ) N , an open-loop control ν = ( ν t ) t ∈ A is identified with the open-loop policy ν = ( ν t ) t where ν t is a measurablefunction from ( E ) t into A via the relation ν t = ν t ( ε , . . . , ε t ), with the convention that ν = ν is simply a constant in A . Given ν ∈ A , and e ∈ E , we denote by ~ ν ( e ) = ( ~ν t ( e )) t ∈ A the shifted open-loop policy defined by ~ν t ( e )( . ) = ν t +1 ( e , . ), t ∈ N . Given µ ∈ P ( X ),and ν ∈ A , we denote by ( µ µ, ν t ) t the solution to (4.3), which satisfies the flow property (cid:0) µ µ, ν t +1 , ν t +1 (cid:1) ≡ (cid:0) µ µ µ, ν t , ~ ν t ( ε ) (cid:1) , t ∈ N , where ≡ means equality in law under P . This implies that the expected gain of this MDPin (4.4) satisfies the relation e V ν ( µ ) = ˜ f ( µ, ν ) + β E h e V ~ ν ( ε ) ( µ µ, ν ) i Recalling that µ µ, ν = ˜ F ( µ, ν , ε ), and by definition of ˜ V as the value function, we deducethat e V ( µ ) ≤ (cid:2) T e V (cid:3) ( µ ) , (4.6)where T is the Bellman operator on L ∞ ( P ( X )), the set of bounded real-valued functionson P ( X ), defined for any W ∈ L ∞ ( P ( X )) by[ T W ]( µ ) := sup a ∈ A n ˜ f ( µ, a ) + β E (cid:2) W (cid:0) ˜ F ( µ, a , ε ) (cid:1)(cid:3)o , µ ∈ P ( X ) . (4.7)This Bellman operator is consistent with the lifted MDP derived in Section 3, with cha-racteristics ( P ( X ) , ˆ A ( X ) , ˆ F , ˆ f , β ), although this MDP is not always well-posed. Indeed,its corresponding Bellman operator is well-defined as it only involves the random variable ε at time 1, hence only requires the measurability of e ˆ F ( µ, ˆ a, e ), for any ( µ, ˆ a ) ∈ P ( X ) × ˆ A ( X ) (which holds true), and it turns out that it coincides with T . Proposition 4.1
For any W ∈ L ∞ ( P ( X )) , and µ ∈ P ( X ) , we have [ T W ]( µ ) = sup ˆ a ∈ ˆ A ( X ) [ ˆ T ˆ a W ]( µ ) = sup a ∈ L ( X × [0 , A ) [ T a W ]( µ ) , (4.8) where ˆ T ˆ a and T a are the operators defined on L ∞ ( P ( X )) by [ ˆ T ˆ a W ]( µ ) = ˆ f ( µ, ˆ a ) + β E (cid:2) W (cid:0) ˆ F ( µ, ˆ a, ε ) (cid:1)(cid:3) , [ T a W ]( µ ) = E h f ( ξ, a( ξ, U ) , L ( ξ, a( ξ, U ))) + βW (cid:0) P F ( ξ, a( ξ,U ) , L ( ξ, a( ξ,U )) ,ε ,ε ) (cid:1)i , (4.9) for any ( ξ, U ) ∼ µ ⊗ U ([0 , (it is clear that the right-hand side in (4.9) does not dependon the choice of such ( ξ, U ) ). Moreover, we have [ T W ]( µ ) = sup α ∈ L (Ω; A ) E h f ( ξ, α , L ( ξ, α )) + βW (cid:0) P F ( ξ,α , L ( ξ,α ) ,ε ,ε ) (cid:1)i . (4.10)21 roof. Fix W ∈ L ∞ ( P ( X )), and µ ∈ P ( X ). Let a be arbitrary in A . Since p ( µ, a ) hasfirst marginal equal to µ , there exists by Corollary A.2 a probability kernel ˆ a ∈ ˆ A ( X ) suchthat p ( µ, a ) = µ · ˆ a . Therefore, ˜ F ( µ, a , e ) = ˆ F ( µ, ˆ a, e ), ˜ f ( µ, a ) = ˆ f ( µ, ˆ a ), which impliesthat [ T W ]( µ ) ≤ sup ˆ a ∈ ˆ A ( X ) [ ˆ T ˆ a W ]( µ ) =: T .Let us consider the operator R defined by R : L ( X × [0 , A ) −→ ˆ A ( X )a ˆ a : ˆ a ( x ) = L (cid:0) a( x, U ) (cid:1) , x ∈ X , U ∼ U ([0 , , and notice that it is surjective from L ( X × [0 , A ) into ˆ A ( X ), by Lemma A.3. By notingthat for any a ∈ L ( X × [0 , A ), and ( ξ, U ) ∼ µ ⊗ U ([0 , L (cid:0) ξ, a( ξ, U ) (cid:1) = µ · R (a), it follows that [ T a W ]( µ ) = [ ˆ T R (a) W ]( µ ). Since R is surjective, this yields T =sup a ∈ L ( X × [0 , A ) [ T a W ]( µ ) =: T .Denote by T the right-hand-side in (4.10). It is clear that T ≤ T . Conversely, let α ∈ L (Ω; A ). We then set a = L ( ξ, α ) ∈ P ( X × A ), and notice that the first marginal of a is µ . Thus, p ( µ, a ) = L ( ξ, α ), and so˜ f ( µ, a ) = Z X × A f ( x, a, p ( µ, a )) p ( µ, a )(d x, d a ) = E (cid:2) f ( ξ, α , L ( ξ, α )) (cid:3) ˜ F ( µ, a , ε ) = F ( · , · , p ( µ, a ) , · , ε ) ⋆ (cid:0) p ( µ, a ) ⊗ λ ε (cid:1) = P F ( ξ,α , L ( ξ,α ) ,ε ,ε ) . We deduce that T ≤ [ T W ]( µ ), which gives finally the equalities (4.8) and (4.10). ✷ By standard and elementary arguments, we state the basic properties of the Bellmanoperator T . Lemma 4.3 (i) The operator T is contracting on L ∞ ( P ( X )) with Lipschitz factor β , andadmits a unique fixed point in L ∞ ( P ( X )) , denoted by V ⋆ , hence solution to: V ⋆ = T V ⋆ . (ii) Furthermore, it is monotone increasing: for W , W ∈ L ∞ ( P ( X )) , if W ≤ W , then T W ≤ T W . As a consequence of Lemma 4.3, we can easily show the following relation between thevalue function ˜ V of the general lifted MDP, and the fixed point V ⋆ of the Bellman operator. Lemma 4.4
For all µ ∈ P ( X ) , we have ˜ V ( µ ) ≤ V ⋆ ( µ ) . Proof.
From (4.6), we have: ˜ V ≤ T ˜ V =: V . By iterating this inequality with the operator T , and using the monotonic increasing property of T , we get ˜ V ≤ T n ˜ V =: V n . Since thefixed point V ⋆ of the contracting operator T is the limit of V n , as n goes to infinity, thisproves the second inequality ˜ V ≤ V ⋆ . ✷ .3 Building ǫ -optimal randomized feedback controls We aim to prove rigorously the equality ˜ V = V ⋆ , i.e., the value function ˜ V of the generallifted MDP satisfies the Bellman fixed point equation: ˜ V = T ˜ V , and also to show theexistence of an ǫ -optimal control for ˜ V . Notice that it cannot be obtained directly fromclassical theory of MDP as we consider here open-loop controls ν ∈ A while MDP usuallydeals with feedback controls on finite-dimensional spaces. Anyway, following the standardnotation in MDP theory with state space P ( X ) and action space A , and in connection withthe Bellman operator in (4.7), we introduce, for π ∈ L ( P ( X ); A ) (the set of measurablefunctions from P ( X ) into A ) called (measurable) feedback policy, the so-called π -Bellmanoperator T π on L ∞ ( P ( X )), defined for W ∈ L ∞ ( P ( X )) by[ T π W ]( µ ) = ˜ f ( µ, π ( µ )) + β E (cid:2) W (cid:0) ˜ F ( µ, π ( µ ) , ε ) (cid:1)(cid:3) , µ ∈ P ( X ) . (4.11)As for the Bellman operator T , we have the basic properties on the operator T π . Lemma 4.5
Fix π ∈ L ( P ( X ); A ) .(i) The operator T π is contracting on L ∞ ( P ( X )) with Lipschitz factor β , and admits aunique fixed point denoted ˜ V π .(ii) Furthermore, it is monotone increasing: for W , W ∈ L ∞ ( P ( X )) , if W ≤ W , then T π W ≤ T π W . Remark 4.2
It is well-known from MDP theory that the fixed point ˜ V π to the operator T π is equal to ˜ V π ( µ ) = E h X t ∈ N ˜ f ( µ t , π ( µ t )) i , where ( µ t ) is the MDP in (4.3) with the feedback and stationary control ν π = ( ν π t ) t ∈ A defined by ν π t = π ( µ t ), t ∈ N . In the sequel, we shall then identify by misuse of notation˜ V π and ˜ V ν π as defined in (4.4). ✷ Our ultimate goal being to solve the CMKV-MDP, we introduce a subclass of feedbackpolicies for the lifted MDP.
Definition 4.1 (Lifted randomized feedback policy)
A feedback policy π ∈ L ( P ( X ); A ) is a lifted randomized feedback policy if there exists ameasurable function a ∈ L ( P ( X ) × X × [0 , A ) , called randomized feedback policy, suchthat (cid:0) ξ, a ( µ, ξ, U ) (cid:1) ∼ π ( µ ) , for all µ ∈ P ( X ) , with ( ξ, U ) ∼ µ ⊗ U ([0 , . Remark 4.3
Given a ∈ L ( P ( X ) × X × [0 , A ), denote by π a ∈ L ( P ( X ); A ) the asso-ciated lifted randomized feedback policy, i.e., π a ( µ ) = L (cid:0) ξ, a ( µ, ξ, U ) (cid:1) , for µ ∈ P ( X ), and( ξ, U ) ∼ µ ⊗ U ([0 , T π a in (4.11), and observingthat p ( µ, π a ( µ )) = π a ( µ ) = L (cid:0) ξ, a µ ( ξ, U ) (cid:1) , where we set a µ = a ( µ, · , · ) ∈ L ( X × [0 ,
1] : A ),we see (recalling the notation in (4.9)) that for all W ∈ L ∞ ( P ( X )),[ T π a W ]( µ ) = [ T a µ W ]( µ ) , µ ∈ P ( X ) . (4.12)23n the other hand, let ξ ∈ L ( G ; X ) be some initial state satisfying the randomizationhypothesis Rand ( ξ, G ), and denote by α a ∈ A the randomized feedback stationary controldefined by α a t = a ( P X t , X t , U t ), where X = X ξ,α a is the state process in (3.1) of the CMKV-MDP, and ( U t ) t is an i.i.d. sequence of uniform G -measurable random variables independentof ξ . By construction, the associated lifted control α a = L ξ ( α a ) satisfies α a t = P X t ,α a t ) = π a ( µ t ), where µ t = P X t , t ∈ N . Denoting by V a := V α a the associated expected gain ofthe CMKV-MDP, and recalling Remark 4.2, we see from (4.5) that V a ( ξ ) = ˜ V ν π a ( µ ) =˜ V π a ( µ ), where µ = L ( ξ ). ✷ We show a verification type result for the general lifted MDP, and as a byproduct forthe CMKV-MDP, by means of the Bellman operator.
Proposition 4.2 (Verification result)
Fix ǫ ≥ , and suppose that there exists an ǫ -optimal feedback policy π ǫ ∈ L ( P ( X ); A ) for V ⋆ in the sense that V ⋆ ≤ T π ǫ V ⋆ + ǫ. Then, ν π ǫ ∈ A is ǫ − β -optimal for ˜ V , i.e., ˜ V π ǫ ≥ ˜ V − ǫ − β , and we have ˜ V ≥ V ⋆ − ǫ − β .Furthermore, if π ǫ is a lifted randomized feedback policy, i.e., π ε = π a ǫ , for some a ε ∈ L ( P ( X ) × X × [0 , A ) , then under Rand ( ξ, G ) , α a ǫ ∈ A is an ǫ − β -optimal control for V ( ξ ) , i.e., V a ǫ ( ξ ) ≥ V ( ξ ) − ǫ − β , and we have V ( ξ ) ≥ V ⋆ ( µ ) − ǫ − β , for µ = L ( ξ ) . Proof.
Since ˜ V π ǫ = T π ǫ ˜ V π ǫ , and recalling from Lemma 4.4 that V ⋆ ≥ ˜ V ≥ ˜ V π ǫ , we havefor all µ ∈ P ( X ), (cid:12)(cid:12)(cid:12) ( V ⋆ − ˜ V π ǫ )( µ ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) T π ǫ ( V ⋆ − ˜ V π ǫ )( µ ) + ǫ (cid:12)(cid:12)(cid:12) ≤ β k V ⋆ − ˜ V π ǫ k + ǫ, where we used the β -contraction property of T π ǫ in Lemma 4.5. We deduce that k V ⋆ − ˜ V π ǫ k≤ ǫ − β , and then, ˜ V ≥ ˜ V π ǫ ≥ V ⋆ − ǫ − β , which combined with V ⋆ ≥ ˜ V , shows the firstassertion. Moreover, if π ε = π a ǫ is a lifted randomized feedback policy, then by Remark4.3, and under Rand ( ξ, G ), we have V a ǫ ( ξ ) = ˜ V π ǫ ( µ ). Recalling that V ( ξ ) ≤ ˜ V ( µ ), andtogether with the first assertion, this proves the required result. ✷ Remark 4.4
If we can find for any ǫ >
0, an ǫ -optimal lifted randomized feedback policyfor V ⋆ , then according to Proposition 4.2, and under Rand ( ξ, G ), one could restrict torandomized feedback policies in the computation of the optimal value V ( ξ ) of the CMKV-MDP, i.e., V ( ξ ) = sup a ∈ L ( P ( X ) ×X × [0 , A ) V a ( ξ ). Moreover, this would prove that V ( ξ ) =˜ V ( µ ) = V ⋆ ( µ ), hence V is law-invariant, and satisfies the Bellman fixed equation.Notice that instead of proving directly the dynamic programming Bellman equationfor V , we start from the fixed point solution V ⋆ to the Bellman equation, and show via averification result that V is indeed equal to V ⋆ , hence satisfies the Bellman equation.By the formulation (4.8) of the Bellman operator in Proposition 4.1, and the fixedpoint equation satisfied by V ⋆ , we know that for all ǫ >
0, and µ ∈ P ( X ), there exists a µǫ ∈ L ( X × [0 , A ) such that V ⋆ ( µ ) ≤ [ T a µǫ V ⋆ ]( µ ) + ǫ. (4.13)24he crucial issue is to prove that the mapping ( µ, x, u ) a ǫ ( µ, x, u ) := a µǫ ( x, u ) is measu-rable so that it defines a randomized feedback policy a ǫ ∈ L ( P ( X ) × X × [0 , A ), and anassociated lifted randomized feedback policy π a ǫ . Recalling the relation (4.12), this wouldthen show that π a ǫ is a ǫ -optimal lifted randomized feedback policy for V ⋆ , and we couldapply the verification result. ✷ We now address the measurability issue for proving the existence of an ǫ -optimal ran-domized feedback policy for V ⋆ . The basic idea is to construct as in (4.13) an ǫ -optimala µǫ ∈ L ( X × [0 , A ) for V ⋆ ( µ ) when µ lies in a suitable finite grid of P ( X ), and then“patchs” things together to obtain an ε -optimal randomized feedback policy. This is madepossible under some uniform continuity property of V ⋆ , and we shall require the followingLipschitz assumptions on the original state transition and reward functions F and f .( H ′ lip ) There exists K >
0, such that for all a ∈ A , e ∈ E , x, x ′ ∈ X , ν, ν ′ ∈ P ( X × A ),such that pr ⋆ ν = pr ⋆ ν ′ , E (cid:2) d (cid:0) F ( x, a, ν, ε , e ) , F ( x ′ , a, ν ′ , ε , e ) (cid:1)(cid:3) ≤ K (cid:0) d ( x, x ′ ) + W (pr ⋆ ν, pr ⋆ ν ′ ) (cid:1) d ( f ( x, a, ν ) , f ( x ′ , a, ν ′ )) ≤ K (cid:0) d ( x, x ′ ) + W (pr ⋆ ν, pr ⋆ ν ) (cid:1) . Remark 4.5
Under Assumption ( H ′ lip ), we see that once pr i ⋆ ν = pr i ⋆ ν ′ , i = 1 , F ( x, a, ν, ε , e ) = F ( x, a, ν ′ , ε , e ), and f ( x, a, ν ) = f ( x, a, ν ′ ). In other words, thefunctions F and f depend on ν only through its marginal distribution on X and on A , andwe shall write by misuse of notation: F ( x, a, µ, υ, e, e ) = F ( x, a, µ ⊗ υ, e, e ), f ( x, a, µ, υ )= f ( x, a, µ ⊗ υ ), for µ ∈ P ( X ), υ ∈ P ( A ). Assumption ( H ′ lip ) is then written as: thereexists K >
0, such that for all a ∈ A , e ∈ E , x, x ′ ∈ X , µ, µ ′ ∈ P ( X ), υ ∈ P ( A ), E (cid:2) d (cid:0) F ( x, a, µ, υ, ε , e ) , F ( x ′ , a, µ ′ , υ, ε , e ) (cid:1)(cid:3) ≤ K (cid:0) d ( x, x ′ ) + W ( µ, µ ′ ) (cid:1) d ( f ( x, a, µ, υ ) , f ( x ′ , a, µ ′ , υ )) ≤ K (cid:0) d ( x, x ′ ) + W ( µ, µ ′ ) (cid:1) . ✷ Proposition 4.3
Assume that ( H ′ lip ) holds true. Then, for all γ ∈ (cid:2) , min (cid:0) , | ln( β ) | (ln 2 K ) + − η (cid:1)(cid:3) ,with η > , V ⋆ is γ -H¨older with constant K γ = K ∆ − γ X − β (2 K ) γ : (cid:12)(cid:12) V ⋆ ( µ ) − V ⋆ ( µ ′ ) (cid:12)(cid:12) ≤ K γ W ( µ, µ ′ ) γ , ∀ µ, µ ′ ∈ P ( X ) . Proof.
Notice that V ⋆ is the limiting point in L ∞ ( P ( X )) of the iterative sequence V n +1 = T V n , and we shall prove the γ -H¨older property by induction. Fix γ ∈ [0 , V ≡ γ -H¨older, and assume that at iteration n , V n is γ -H¨older with aconstant K n . Under ( H ′ lip ), (see Remark 4.5), recall the expression (4.10) of the Bellmanoperator T : T W ( µ ) = sup α ∈ L (Ω; A ) E h f (cid:0) ξ, α , µ, L ( α ) (cid:1) + βW (cid:16) P F ( ξ,α ,µ, L ( α ) ,ε ,ε ) (cid:17)i , µ ∈ P ( X ) , ξ ∈ L (Ω; X ) s.t. L ( ξ ) = µ . For any µ, µ ′ ∈ P ( X ), let us consider an optimalcoupling ( ξ, ξ ′ ) ∈ L (Ω; X ) for the Wasserstein distance, i.e., E [ d ( ξ, ξ )] = W ( µ, µ ′ ). Under( H ′ lip ), we then have | V n +1 ( µ ) − V n +1 ( µ ′ ) | ≤ K W ( µ, µ ′ )+ β sup α ∈ L (Ω , ; A ) E h(cid:12)(cid:12) V n (cid:0) P F ( ξ,α ,µ, L ( α ) ,ε ,ε ) (cid:1) − V n (cid:0) P F ( ξ ′ ,α ,µ ′ , L ( α ) ,ε ,ε ) (cid:1)(cid:12)(cid:12)i . Now, by induction hypothesis, we have (cid:12)(cid:12) V n (cid:0) P F ( ξ,α ,µ, L ( α ) ,ε ,ε ) (cid:1) − V n (cid:0) P F ( ξ ′ ,α ,µ ′ , L ( α ) ,ε ,ε ) (cid:1)(cid:12)(cid:12) ≤ K n (cid:12)(cid:12) W (cid:0) P F ( ξ,α ,µ, L ( α ) ,ε ,ε ) , P F ( ξ ′ ,α ,µ ′ , L ( α ) ,ε ,ε ) (cid:1)(cid:12)(cid:12) γ , and by definition of the Wasserstein distance W (cid:0) P F ( ξ,α ,µ, L ( α ) ,ε ,ε ) , P F ( ξ ′ ,α ,µ ′ , L ( α ) ,ε ,ε ) (cid:1) ≤ E h d (cid:0) F ( ξ, α , µ, L ( α ) , ε , e ) , F ( ξ ′ , α , µ ′ , L ( α ) , ε , e ) (cid:1)i e := ε ≤ K W ( µ, µ ′ ) , where we used again ( H ′ lip ) and the fact that E [ d ( ξ, ξ )] = W ( µ, µ ′ ). We then deduce that (cid:12)(cid:12) V n +1 ( µ ) − V n +1 ( µ ′ ) (cid:12)(cid:12) ≤ K W ( µ, µ ′ ) + βK n (2 K ) γ W ( µ, µ ′ ) γ ≤ (2 K ∆ − γ X + βK n (2 K ) γ ) W ( µ, µ ′ ) γ , since W ( µ, µ ′ ) ≤ ∆ X and γ ≤
1. We then see that induction hypothesis at iteration n + 1holds true by setting K n +1 := 2 K ∆ − γ X + βK n (2 K ) γ , which leads to the expression K n =2 K ∆ − γ X − ( β (2 K ) γ ) n − ( β (2 K ) γ ) . Therefore, by taking γ ∈ (cid:2) , min (cid:0) , | ln( β ) | (ln 2 K ) + − η (cid:1)(cid:3) , with η >
0, wehave β (2 K ) γ <
1, and the sequence K n converges to K γ := K ∆ − γ X − ( β (2 K ) γ ) , which shows therequired γ -H¨older property for V ⋆ = lim n V n with the constant K γ . ✷ The next result provides a suitable discretization of the set of probability measures.
Lemma 4.6 (Quantization of P ( X ) ) Fix η > . There exists a finite subset M η = { µ , . . . , µ N η } ⊂ P ( X ) , such that for all µ ∈P ( X ) , there exists µ i ∈ M η satisfying W ( µ, µ i ) ≤ η . Proof. As X is compact, there exists a finite subset X η ⊂ X such that d ( x, x η ) ≤ η/ x ∈ X , where x η denotes the projection of x on X η . Given µ ∈ P ( X ), and ξ ∼ µ , we denote by ξ η the quantization, i.e., the projection of ξ on X η , and by µ η thediscrete law of ξ η . Thus, E [ d ( ξ, ξ η )] ≤ η/
2, and therefore W ( µ, µ η ) ≤ η/
2. The probabilitymeasure µ η lies in P ( X η ), which is identified with the simplex of [0 , X η . We then useanother grid G η = { in η : i = 0 , . . . , n η } of [0 , µ η ( y ) ∈ [0 , y ∈ X η , on G η , in order to obtain another discrete probability measure µ η,n η . From thedual Kantorovich representation of Wasserstein distance, it is easy to see that for n η largeenough, W ( µ η , µ η,n η ) ≤ η/
2, and so W ( µ, µ η,n η ) ≤ η . We conclude the proof by notingthat µ η,n η belongs to the set M η of probability measures on X η with weights valued in thefinite grid G η , hence M η is a finite set of P ( X η ), of cardinal N η = n X η − η . ✷
26e can conclude this paragraph by showing the existence of an ǫ -optimal lifted ran-domized feedback policy for the general lifted MDP on P ( X ), and obtain as a by-productthe corresponding Bellman fixed point equation for its value function and for the optimalvalue of the CMKV-MDP under randomization hypothesis. Theorem 4.1
Assume that ( H ′ lip ) holds true. Then, for all ǫ > , there exists a liftedrandomized feedback policy π a ǫ , for some a ǫ ∈ L ( P ( X ) × X × [0 , A ) , that is ǫ -optimalfor V ⋆ . Consequently, under Rand ( ξ, G ) , the randomized feedback stationary control α a ǫ ∈A is ǫ − β -optimal for V ( ξ ) , and we have V ( ξ ) = ˜ V ( µ ) = V ⋆ ( µ ) , for µ = L ( ξ ) , which thussatisfies the Bellman fixed point equation. Proof.
Fix ǫ >
0, and given η >
0, consider a quantizing grid M η = { µ , . . . , µ N η } ⊂P ( X ) as in Lemma 4.6, and an associated partition C iη , i = 1 , . . . , N η , of P ( X ), satisfying C iη ⊂ B η ( µ i ) := n µ ∈ P ( X ) : W ( µ, µ i ) ≤ η o , i = 1 , . . . , N η . For any µ i , i = 1 , . . . , N η , and by (4.13), there exists a iǫ ∈ L ( X × [0 , A ) such that V ⋆ ( µ i ) ≤ [ T a iǫ V ⋆ ]( µ i ) + ǫ . (4.14)From the partition C iη , i = 1 , . . . , N η of P ( X ), associated to M η , we construct the function a ǫ : P ( X ) × X × [0 , → A as follows. Let let h , h be two measurable functions from [0 , , U ∼ U ([0 , h ( U ) , h ( U )) ∼ U ([0 , ⊗ . We then define a ǫ ( µ, x, u ) = a iǫ (cid:0) ζ ( µ, µ i , x, h ( u )) , h ( u ) (cid:1) , when µ ∈ C iη , i = 1 , . . . , N η , x ∈ X , u ∈ [0 , , where ζ is the measurable coupling function defined in Lemma 4.1. Such function a ε isclearly measurable, i.e., a ε ∈ L ( P ( X ) × X × [0 , A ), and we denote by π ε = π a ε theassociated lifted randomized feedback policy, which satisfies[ T π ε V ⋆ ]( µ i ) = [ T a iǫ V ⋆ ]( µ i ) , i = 1 , . . . , N η , (4.15)by (4.12). Let us now check that such π ǫ yields an ǫ -optimal randomized feedback policyfor η small enough. For µ ∈ P ( X ), with ( ξ, U ) ∼ µ ⊗ U ([0 , U := h ( U ), U := h ( U ), and define µ η = µ i , when µ ∈ C iη , i = 1 , . . . , N η , and ξ η := ζ ( µ, µ η , ξ, U ). Observeby Lemma 4.6 that W ( µ, µ η ) ≤ η , and by Lemma 4.1 that ( ξ η , U ) ∼ µ η ⊗ U ([0 , µ ∈ P ( X ),[ T π ǫ V ⋆ ]( µ ) − V ⋆ ( µ ) = (cid:16) [ T π ǫ V ⋆ ]( µ ) − [ T π ǫ V ⋆ ]( µ η ) (cid:17) + (cid:16) [ T π ǫ V ⋆ ]( µ η ) − V ⋆ ( µ η ) (cid:17) + (cid:0) V ⋆ ( µ η ) − V ⋆ ( µ ) (cid:1) ≥ (cid:16) [ T π ǫ V ⋆ ]( µ ) − [ T π ǫ V ⋆ ]( µ η ) (cid:17) − ǫ − ǫ , (4.16)where we used (4.14)-(4.15) and the fact that | V ⋆ ( µ η ) − V ⋆ ( µ ) | ≤ ǫ/ η small enoughby uniform continuity of V ⋆ in Proposition 4.3. Moreover, by observing that a ǫ ( µ, ξ, U ) = a ǫ ( µ η , ξ η , U ) =: α , so that π ǫ ( µ ) = L ( ξ, α ), π ǫ ( µ η ) = L ( ξ η , α ), we have[ T π ǫ V ⋆ ]( µ ) = E h f ( Y ) + βV ⋆ ( P F ( Y,ε ,ε ) i , [ T π ǫ V ⋆ ]( µ η ) = E h f ( Y η ) + βV ⋆ ( P F ( Y η ,ε ,ε ) i , Y = ( ξ, α , π ǫ ( µ )), and Y η = ( ξ η , α , π ǫ ( µ η )). Under ( H ′ lip ), by using the γ -H¨olderproperty of V ⋆ with constant K γ in Proposition 4.3, and by definition of the Wassersteindistance (recall that ξ ∼ µ , ξ η ∼ µ η ), we then get (cid:12)(cid:12) [ T π ǫ V ⋆ ]( µ ) − [ T π ǫ V ⋆ ]( µ η ) (cid:12)(cid:12) ≤ K E (cid:2) d ( ξ, ξ η ) (cid:3) + βK γ E h E (cid:2) d (cid:0) F ( ξ, α , π ǫ ( µ ) , ε , e ) , F ( ξ η , α , π ǫ ( µ η ) , ε , e ) (cid:1)(cid:3) e := ε i ≤ K (1 + βK γ ) E (cid:2) d ( ξ, ξ η ) (cid:3) . Now, by the coupling Lemma 4.1, E (cid:2) d ( ξ, ξ η ) (cid:3) can be made as small as possible provided that η ≥ W ( µ, µ η ) is small enough, and therefore, (cid:12)(cid:12) [ T π ǫ V ⋆ ]( µ ) − [ T π ǫ V ⋆ ]( µ η ) (cid:12)(cid:12) ≤ ǫ/
3. Plugginginto (4.16), we obtain T π ǫ V ⋆ ( µ ) − V ⋆ ( µ ) ≥ − ǫ , for all µ ∈ P ( X ), which means that π ǫ is ǫ -optimal for V ⋆ . The rest of the assertions in the Theorem follow from the verificationresult in Proposition 4.2. ✷ Remark 4.6
We stress the importance of the coupling Lemma in the construction of ǫ -optimal control in Theorem 4.1. Indeed, as we do not make any regularity assumption on F and f with respect to the “control arguments”, the only way to make [ T π ǫ V ⋆ ]( µ ) and[ T π ǫ V ⋆ ]( µ η ) close to each other is to couple terms to have the same control in F and f .This is achieved by turning µ into µ η , ξ into ξ η and set α = a ǫ ( µ, ξ, U ) = a ǫ ( µ η , ξ η , U ).Turning µ into µ η is a simple quantization, but turning ξ into ξ η is obtained thanks to thecoupling Lemma. ✷ We consider the general case by relaxing the randomization hypothesis and by assumingonly that the initial information filtration G is rich enough.We need to state some uniform continuity property on the value function V of theCMKV-MDP. Lemma 4.7
Assume that ( H ′ lip ) holds true. Then, for all γ ∈ (cid:2) , min (cid:0) , | ln( β ) | (ln 2 K ) + − η (cid:1)(cid:3) ,with η > , and setting K γ = K ∆ − γ X − β (2 K ) γ , we have sup α ∈A (cid:12)(cid:12) V α ( ξ ) − V α ( ξ ′ ) (cid:12)(cid:12) ≤ K γ (cid:0) E (cid:2) d ( ξ, ξ ′ ) (cid:3)(cid:1) γ , ∀ ξ, ξ ′ ∈ L ( G ; X ) . Consequently, V is γ -H¨older on L ( G ; X ) endowed with the L -distance. Proof.
Fix ξ, ξ ′ ∈ L (Ω , X ), and consider an arbitrary α = α π ∈ A associated toan open-loop policy π ∈ Π OL . By Proposition A.1, there exists a measurable function f t,π ∈ L ( X × G × E t × ( E ) t ; X ), s.t. X ξ,αt = f t,π ( ξ, Γ , ( ε s ) s ≤ t , ( ε s ) s ≤ t ), thus P X ξ,αt = L ( f t,π ( ξ, Γ , ( ε s ) s ≤ t , ( e s ) s ≤ t )) e s = ε s ,s ≤ t . We thus have W ( P X ξ,αt , P X ξ ′ ,αt ) ≤ E h d (cid:16) f t,π ( ξ, Γ , ( ε s ) s ≤ t , ( e s ) s ≤ t ) , f t,π ( ξ ′ , Γ , ( ε s ) s ≤ t , ( e s ) s ≤ t ) (cid:17)i e s = ε s ,s ≤ t , and so E h W ( P X ξ,αt , P X ξ ′ ,αt ) i ≤ E (cid:2) d ( X ξ,αt , X ξ ′ ,αt ) (cid:3) , (4.17)28nder the Lipschitz condition on f in ( H ′ lip ), we then have | V α ( ξ ) − V α ( ξ ′ ) | ≤ K ∞ X t =0 β t E (cid:2) d ( X ξ,αt , X ξ ′ ,αt ) (cid:3) . (4.18)By conditioning, and from the transition dynamics of the state process, we see that for t ∈ N , E (cid:2) d ( X ξ,αt +1 , X ξ ′ ,αt +1 ) (cid:3) = E (cid:2) ∆( α it , X ξ,αt , P X ξ,αt ,α t ) , X ξ ′ ,αt , P X ξ ′ ,αt ,α t ) , ε t +1 ) (cid:3) , where∆( a, x, ν, x ′ , ν ′ , e t +1 ) = E (cid:2) d ( F ( x, a, ν, ε t +1 , e t +1 ) , F ( x ′ , a, ν ′ , ε t +1 , e t +1 )) (cid:3) . By the Lipschitz condition on F in ( H ′ lip ), we thus have E (cid:2) d ( X ξ,αt +1 , X ξ ′ ,αt +1 ) (cid:3) ≤ K F E [ d ( X ξ,αt , X ξ ′ ,αt ) + W ( P X ξ,αt , P X ξ ′ ,αt )] ≤ K F E (cid:2) d ( X ξ,αt , X ξ ′ ,αt ) (cid:3) , where the last inequality comes from (4.17). Denoting by δ t ( ξ, ξ ′ ) := sup α ∈A E (cid:2) d ( X ξ,αt , X ξ ′ ,αt ) (cid:3) ,and noting that δ ( ξ, ξ ′ ) = | ξ − ξ ′ | , it follows by induction that δ t ( ξ, ξ ′ ) ≤ s t ( | ξ − ξ ′ | ) , s t ( m ) := m (2 K ) t , m ≥ , t ∈ N . By the same arguments as in Theorem 2.1, and choosing γ as in the assertion of the theorem,we obtain that ∞ X t =0 β t δ t ( ξ, ξ ′ ) ≤ ∆ − γ X − β (2 K ) γ (cid:0) E (cid:2) d ( ξ, ξ ′ ) (cid:3)(cid:1) γ , ξ, ξ ′ ∈ L ( G ; X ) , and conclude with (4.18). ✷ Theorem 4.2
Assume that G is rich enough and ( H ′ lip ) holds true. Then, for any ξ ∈ L ( G ; X ) , V ( ξ ) = ˜ V ( µ ) , where µ = L ( ξ ) . Consequently, V is law-invariant, identified with ˜ V , and satisfies the Bellman fixed point equation ˜ V = T ˜ V . Moreover, for all ǫ > , thereexists an ǫ -optimal randomized feedback control for V ( ξ ) . Proof. As X is compact, there exists a finite subset X η ⊂ X such that d ( x, x η ) ≤ η for all x ∈ X , where x η denotes the projection of x on X η . Fix ξ ∈ L ( G ; X ), and set µ = L ( ξ ).Let us then consider a random variable ξ ′ ∼ µ defined on another probability universe alongwith an independent uniform law U ′ . We set Γ ′ := ( ξ ′ , U ′ ), and G ′ = σ (Γ ′ ). By constructionthe randomization hypothesis Rand ( ξ ′ , G ′ ) holds true, and we then have V ( ξ ′ ) = ˜ V ( µ ) fromTheorem 4.1. Consider now the quantized random variables ξ η and ξ ′ η , which have the samelaw, and satisfy respectively the randomization hypothesis Rand ( ξ η , G ) and Rand ( ξ ′ η , G ′ )from Lemma 3.2. From Theorem 4.1, we deduce that V ( ξ η ) = ˜ V ( L ( ξ η )) = V ( ξ ′ η ). Byuniform continuity of V , it follows by sending η to zero, that V ( ξ ) = V ( ξ ′ ), and thus V ( ξ ) = ˜ V ( µ ), which proves the required result.Finally, the existence of an ǫ -optimal control for V ( ξ ) is obtained as follows. From theuniform continuity of V in Lemma 4.7, there exists η small enough so that | V ( ξ ) − V ( ξ η ) |≤ ǫ/
2. We then build according to Theorem 4.1 an ǫ/ V ( ξ η ), whichyields an ǫ -optimal (randomized feedback stationary) control for V ( ξ ). ✷ emark 4.7 From Theorems 4.1 and 4.2, under the condition that G is rich enough and( H ′ lip ) holds true, the value function V of the CMKV-MDP is law-invariant, and the supre-mum in the Bellman fixed point equation for V ≡ ˜ V with the operator T can be restrictedto lifted randomized feedback policies, i.e., V = T V = sup a ∈ L ( P ( X ) ×X × [0 , A ) T a V where we set T a := T π a equal to[ T a W ]( µ ) = E h f ( Y a ( µ, ξ, U )) + βW ( P F ( Y a ( µ,ξ,U ) ,ε ,ε ) i , with Y a ( µ, x, u ) := ( x, a ( µ, x, u ) , π a ( µ )), and ( ξ, U ) ∼ µ ⊗ U ([0 , ✷ Remark 4.8 ( G rich enough and dynamic programming) The condition that G isrich enough is crucial for obtaining the Bellman fixed point equation for the value function V . Let us illustrate this fact with a counter-example similar to Example 3.1 in [11]. Consider X = {− , } = A , ε ∼ B (1 / F ( x, a, ν, e, e ) = ax , f ( x, a, ν ) = −W (pr ⋆ ν, B (1 / /
2) on X , and minimal equal to − / δ − or δ ). Assume that Γ = 1 a.s., so that that G is the trivial σ -algebra. In this case, ξ =: x and α are then necessarily determinist, and thus X ξ,α = α ξ has to be determinist aswell, which yields rewards at t = 0 , − . By choosing a control α = 1, α = ε and α t = 1 afterwards, we have P X x,αt = δ x for t = 0 ,
1, and P X x,αt = B (1 / V ( ξ ) = − β . If V satisfied the DPP, we would have V ( x ) = sup a ∈ A ( − / βV ( ax )), which is equivalent to − β = − − β β , and this is clearly false.Intuitively, the reason why the DPP is not satisfied in the previous example is thatfrom time t = 1 we have the possibility to randomize actions using ε , whereas at time t = 0 no randomization was possible (even by quantizing, because G is not rich enough).As discussed in the next remark, the possibility to randomize implies that the game fromtime 1 is more advantageous than the game from time 0, which contradicts the idea behindthe DPP that from time 1, we are in the same game, but only starting from a differentstate. ✷ Remark 4.9 (Open-loop vs feedback controls and randomization hypothesis)
InTheorem 4.1, we build under randomization hypothesis, an ǫ -optimal control of the form α t = a ǫ ( P X t , X t , U t ) (randomized feedback controls), for any ǫ >
0, which implies thatoptimizing over open-loop controls gives the same optimal value than optimizing over ran-domized feedback controls. In standard (non-McKean-Vlasov) MDPs, it is well known thatwe can even restrict to controls of the form α t = a ( X t ), i.e., (not randomized) feedbackcontrols. Randomizing actions is thus not necessary in standard MDPs. However, in thecase of McKean-Vlasov MDPs, it can be crucial to optimize our gain (even when the rewarddoes not depend upon the law of the control). To illustrate this, let us consider the sameexample as in Remark 4.8, and assume that we start from ξ = 1 a.s.30 If we use a feedback control (not randomized), it is clear that the law of X t will alwaysbe a Dirac, and thus the gain will be equal to P ∞ t =0 β t ( − ) = − − β ) which is theworst possible gain. • However, with open-loop controls, even when the randomization hypothesis
Rand ( ξ, G )is not satisfied, if at some point we can use the past noise to randomize (even justslightly), there will be some times where the law of the state is not a Dirac and thegain will then be strictly greater than − − β ) . In our case, we have seen in theprevious remark that we could reach V = − β . • Finally, if G was rich enough, for instance if Γ ∼ U ([0 , Rand ( ξ, G ) wouldhold true, and we could freely randomize from the beginning by choosing α = G< / ∼ B (1 /
2) (thus X ∼ B (1 / α t = 1 for all t ∈ N ⋆ , wewould have X t ∼ B (1 /
2) for all t ∈ N ⋆ . With this strateggy, the total gain would beequal to − , which is the best possible gain given that we start from ξ = 1 a.s.To summarize, this example shows that the optimal value under Rand ( ξ, G ) is here strictlygreater than the optimal value over open-loop controls without Rand ( ξ, G ), which is itselfstrictly greater than the optimal value over feedback controls. This highlights that ingeneral the supremum over “randomized feedback” ≥ “open-loop” ≥ “feedback” controls,and that the inequalities can be strict, the underlying idea being that the more (and sooner)we can randomize, the better it is. We finally point out that the randomization hypothesis Rand ( ξ, G ) is natural, since in practice, nothing prevents us from using (pseudo-)uniformvariables in our strategies to randomize our actions. ✷ ǫ -optimal strategies in CMKV-MDP Having established the correspondence of our CMKV-MDP with lifted MDP on P ( X ), andthe associated Bellman fixed point equation, we can design two methods for computing thevalue function and optimal strategies: (a) Value iteration. We approximate the value function V = ˜ V = V ⋆ by iterationfrom the Bellman operator: V n +1 = T V n , and at iteration N , we compute an approximateoptimal randomized feedback policy a N by (recall Remark 4.7) a N ∈ arg max a ∈ L ( P ( X ) ×X × [0 , A ) T a V N . From a N , we then construct an approximate randomized feedback stationary control α a N according to the procedure described in Remark 4.3. (b) Policy iteration. Starting from some initial randomized feedback policy a ∈ L ( P ( X ) ×X × [0 , A ), we iterate according to: • Policy evaluation: we compute the expected gain ˜ V π a of the lifted MDP • Greedy strategy: we compute a k +1 ∈ arg max a ∈ L ( P ( X ) ×X × [0 , A ) T a ˜ V π a k .
31e stop at iteration K to obtain a K , and then construct an approximate randomizedfeedback control α a K according to the procedure described in Remark 4.3. Practical computation.
Since a randomized feedback control α is a measurable function a of ( P X ξ,αt , X ξ,αt , U t ), we would need to compute and store the (conditional) law of thestate process, which is infeasible in practice when X is a continuous space. In this case, tocircumvent this issue, a natural idea is to discretize the compact space X by consideringa finite subset X η = { x , . . . , x N η } ⊂ X associated with a partition B iη , i = 1 , . . . , N η , of X , satisfying: B iη ⊂ (cid:8) x ∈ X : d ( x, x i ) ≤ η (cid:9) , i = 1 , . . . , N η , with η >
0. For any x ∈ X ,we denote by [ x ] η (or simply x η ) its projection on X η , defined by: x η = x i , for x ∈ B iη , i = 1 , . . . , N η . Definition 4.2 (Discretized CMKV-MDP)
Fix η > . Given ξ ∈ L ( G ; X η ) , and acontrol α ∈ A , we denote by X η,ξ,α the McKean-Vlasov MDP on X η given by X η,ξ,αt +1 = (cid:2) F ( X η,ξ,αt , α t , P X η,ξ,αt ,α t ) , ε t +1 , ε t +1 ) (cid:3) η , t ∈ N , X η,ξ,α = ξ, i.e., obtained by projecting the state on X η after each application of the transition function F . The associated expected gain V αη is defined by V αη ( ξ ) = E h ∞ X t =0 β t f (cid:0) X η,ξ,αt , α t , P X η,ξ,αt ,α t ) (cid:1)i . Notice that the (conditional) law of the discretized CMKV-MDP on X η is now valuedin a finite-dimensional space (the simplex of [0 , N η ), which makes the computation of theassociated randomized feedback control accessible, although computationally challengingdue to the high-dimensionality (and beyond the scope of this paper). The next resultstates that an ǫ -optimal randomized feedback control in the initial CMKV-MDP can beapproximated by a randomized feedback control in the discretized CMKV-MDP. Proposition 4.4
Assume that G is rich enough and ( H ′ lip ) holds true. Fix ξ ∈ L ( G ; X ) .Given η > , let us define ξ η the projection of ξ on X η . As Rand ( ξ η , G ) holds true, let usconsider an i.i.d. sequence ( U η,t ) t ∈ N of G -measurable uniform variables independent of ξ η .For ǫ > , let a ǫ be a randomized feedback policy that is ǫ -optimal for the Bellman fixedpoint equation satisfied by V . Finally, let α η,ǫ be the randomized feedback control in thediscretized CMKV-MDP recursively defined by α η,ǫt = a ǫ ( P X η,ǫt , X η,ǫt , U η,t ) , t ∈ N , where weset X η,ǫt := X η,ξ η ,α ǫ,η t . Then, for any δ > and γ ∈ (cid:2) , min (cid:0) , | ln β | (ln 2 K ) + − δ (cid:1)(cid:3) , the control α η,ǫ is O ( η γ + ǫ ) -optimal for the CMKV-MDP X with initial state ξ . Proof.
Step 1.
For δ > γ ∈ (cid:2) , min (cid:0) , | ln β | (ln 2 K ) + − δ (cid:1)(cid:3) , let us show thatsup α ∈A ∞ X t =0 β t E (cid:2) d ( X ξ,αt , X η,ξ η ,αt ) (cid:3) ≤ Cη γ , (4.19)for some constant C that depends only on K , β and γ . Indeed, notice by definition of theprojection on X η , and by a simple conditioning argument that for all α ∈ A , and t ∈ N , E (cid:2) d ( X ξ,αt +1 , X η,ξ η ,αt +1 ) (cid:3) ≤ η + E (cid:2) ∆ (cid:0) X ξ,αt , X η,ξ η ,αt , α t , P X ξ,αt ,α t ) , P X η,ξη,αt ,α t ) , ε t +1 (cid:1)(cid:3) , x, x ′ , a, ν, ν ′ , e ) = E [ d ( F ( x, a, ν, ε t +1 , e ) , F ( x ′ , a, ν ′ , ε t +1 , e ))] . Under ( H ′ lip ), we then get E (cid:2) d ( X ξ,αt +1 , X η,ξ η ,αt +1 ) (cid:3) ≤ η + K E h d ( X ξ,αt , X η,ξ η ,αt ) + W ( P X ξ,αt , P X η,ξη,αt ) i ≤ η + 2 K E (cid:2) d ( X ξ,αt , X η,ξ η ,αt ) (cid:3) , by the same argument as in (4.17). Hence, the sequence ( E (cid:2) d ( X ξ,αt , X η,ξ η ,αt ) (cid:3) ) t ∈ N satisfiesthe same type of induction inequality as in (2.16) in Theorem 2.1 with η instead of M N , andthus the same derivation leads to the required result (4.19). From the Lipschitz conditionon f , we deduce by the same arguments as in (4.17) in Lemma 4.7 thatsup α ∈A (cid:12)(cid:12) V α ( ξ η ) − V αη ( ξ η ) (cid:12)(cid:12) = O ( η γ ) . (4.20) Step 2.
Denote by µ = L ( ξ ), and µ η = L ( ξ η ), and observe that W ( µ, µ η ) ≤ E [ d ( ξ, ξ η )] ≤ η . We write V α η,ǫ ( ξ ) − V ( ξ ) = (cid:2) V α η,ǫ ( ξ ) − V α η,ǫ ( ξ η ) (cid:3) + (cid:2) V α η,ǫ ( ξ η ) − V α η,ǫ η ( ξ η ) (cid:3) + (cid:2) V α η,ǫ η ( ξ η ) − V ( ξ η ) (cid:3) + (cid:2) V ( ξ η ) − V ( ξ ) (cid:3) =: I + I + I + I . The first and last terms I and I are smaller than O ( η γ ) by the γ -H¨older property of V α and V in Lemma 4.7. By (4.20), the second term I is of order O ( η γ ) as well for η smallenough. Regarding the third term I , notice that by definition, V α η,ǫ η ( ξ η ) corresponds tothe gain associated to the randomized feedback policy a ǫ for the discretized CMKV-MDP.Denote by π ǫ the lifted randomized feedback policy associated to a ǫ , and recall by Remark4.3 the identification with the lifted MDP: V α η,ǫ η ( ξ ′ ) = ˜ V π ǫ η ( µ ′ ), µ ′ = L ( ξ ′ ), where ˜ V π ǫ η isthe expected gain of the lifted MDP associated to the discretized CMKV-MDP, hence fixedpoint of the operator[ T a ǫ η W ]( µ ′ ) = E h f ( Y a ǫ ( µ ′ , ξ ′ , U )) + βW (cid:0) P (cid:2) F ( Y a ǫ ( µ ′ ,ξ ′ ,U ) ,ε ,ε ) (cid:3) η (cid:1)i ,Y a ( µ, x, u ) = ( x, a ( µ, x, u ) , π a ( µ )) and ( ξ ′ , U ) ∼ µ ′ ⊗U ([0 , V ( ξ ′ ) = ˜ V ( µ ′ ), µ ′ = L ( ξ ′ ), with ˜ V fixed point to the Bellman operator T , it follows that I = ˜ V π ǫ η ( µ η ) − ˜ V ( µ η ) = (cid:16) [ T a ǫ η ˜ V π ǫ η ]( µ η ) − [ T a ǫ η ˜ V ]( µ η ) (cid:17) + (cid:16) [ T a ǫ η ˜ V ]( µ η ) − [ T a ǫ ˜ V ]( µ η ) (cid:17) + (cid:16) [ T a ǫ ˜ V ]( µ η ) − ˜ V ( µ η ) (cid:17) =: I + I + I . By definition of a ǫ , we have | I | ≤ ǫ . For I notice that the only difference between theoperators T a ǫ η and T a ǫ is that F is projected on X η . Thus, (cid:12)(cid:12)(cid:12) [ T a ǫ η ˜ V ]( µ η ) − [ T a ǫ ˜ V ]( µ η ) (cid:12)(cid:12)(cid:12) ≤ β E h(cid:12)(cid:12) ˜ V (cid:0) P F ( Y η ,ε ,ε )] η (cid:1) − ˜ V (cid:0) P F ( Y η ,ε ,ε ) (cid:1)(cid:12)(cid:12)i , Y η = ( ξ η , a ǫ ( µ, ξ η , U ) , π ǫ ( µ η )). It is clear by definition of the Wasserstein distanceand the projection on X η that W (cid:0) P F ( Y η ,ε ,ε )] η , P F ( Y η ,ε ,ε ) (cid:1) ≤ E [ d ( F ( Y η , ε , ε ) , [ F ( Y η , ε , ε )] η )] ≤ η. From the γ -H¨older property of ˜ V in Proposition 4.3, we deduce that I = O ( η γ ). Finally,for I , since T a ǫ η is a β -contracting operator on ( L ∞ ( M η ) , k · k η, ∞ ), we have (cid:12)(cid:12) [ T a ǫ η ˜ V π ǫ η ]( µ η ) − [ T a ǫ η ˜ V ]( µ η ) (cid:12)(cid:12) ≤ β k ˜ V π ǫ η − ˜ V k η, ∞ , and thus | ˜ V π ǫ η ( µ η ) − ˜ V ( µ η ) | = | I | ≤ | I | + | I | + | I | ≤ β k ˜ V π ǫ η − ˜ V k η, ∞ + O ( η γ + ǫ ). Takingthe sup over µ η ∈ M η on the left, we obtain that k ˜ V π ǫ η − ˜ V k η, ∞ ≤ − β O ( η γ + ǫ ) = O ( η γ + ǫ ),and we conclude that | I | ≤ k ˜ V π ǫ η − ˜ V k η, ∞ ≤ O ( η γ + ǫ ), which ends the proof. ✷ Remark 4.10
Back to the N -agent MDP problem with open-loop controls, recall fromSection 2, that it suffices to find an ǫ -optimal open-loop policy π ǫ ∈ Π OL for the CMKV-MDP, as it will automatically be O ( ǫ )-optimal for the N -agent MDP with N large enough.For instance, the construction of an ǫ -optimal control α ǫ given by Proposition 4.4 can beassociated to an ǫ -optimal open-loop policy π ǫ such that α ǫ = α π ǫ (where π ǫt is a measurablefunction of (Γ , ( ε s ) s ≤ s ≤ t , ( ε s ) s ≤ s ≤ t )). The processes O ( ǫ )-optimal for the i -th agent α ǫ,it is then the result of the same construction but with (Γ i , ε i , ε ) instead of (Γ , ε, ε ), i.e.replacing ξ by ξ i , U η,t by U iη,t , and (Γ , ε, ε ) by (Γ i , ε i , ε ) in Proposition 4.4. Notice thatthis construction never requires the access the individual’s states X i,N . Remark 4.11 ( Q function) In view of the Bellman fixed point equation satisfied by thevalue function V of the CMKV-MDP in terms of randomized feedback policies, let usintroduce the corresponding state-action value function Q defined on P ( X ) × ˆ A ( X ) by Q ( µ, ˆ a ) = [ ˆ T ˆ a V ]( µ ) = ˆ f ( µ, ˆ a ) + β E (cid:2) V (cid:0) ˆ F ( µ, ˆ a, ε ) (cid:1)(cid:3) , From Proposition 4.1, and since V = T V , we recover the standard connection between thevalue function and the state-action value function, namely V ( µ ) = sup ˆ a ∈ ˆ A ( X ) Q ( µ, ˆ a ), fromwhich we obtain the Bellman equation for the Q function: Q ( µ, ˆ a ) = ˆ f ( µ, ˆ a ) + β E h sup ˆ a ′ ∈ ˆ A ( X ) Q (cid:0) µ ˆ a , ˆ a ′ (cid:1)i , (4.21)where we set µ ˆ a = ˆ F ( µ, ˆ a, ε ). Notice that this Q -Bellman equation extends the equationin [11] (see their Theorem 3.1) derived in the no common noise case and when there isno mean-field dependence with respect to the law of the control. The Bellman equation(4.21) is the starting point in a model-free framework when the state transition functionis unknown (in other words in the context of reinforcement learning) for the design of Q -learning algorithms in order to estimate the Q -value function by Q n , and then to computea relaxed control by ˆ a µn ∈ arg max ˆ a ∈ ˆ A ( X ) Q n ( µ, ˆ a ) , µ ∈ P ( X ) . a µn , a function a n : P ( X ) ×X × [0 , → A , such that L ( a n ( µ, x, U )) = ˆ a µn ( x ), µ ∈ P ( X ), x ∈ X , where U is an uniformrandom variable. In practice, one has to discretize the state space X as in Definition 4.2,and then to quantize the space P ( X ) as in Lemma 4.6 in order to reduce the learningproblem to a finite-dimensional problem for the computation of an approximate optimalrandomized feedback policy a n for the CMKV-MDP. ✷ We have developed a theory for mean-field Markov decision processes with common noiseand open-loop controls, called CMKV-MDP, for general state space and action space. Suchproblem is motivated and shown to be the asymptotic problem of a large population ofcooperative agents under mean-field interaction controlled by a social planner/influencer,and we provide a rate of convergence of the N -agent model to the CMKV-MDP. We provethe correspondence of CMKV-MDP with a general lifted MDP on the space of probabilitymeasures, and emphasize the role of relaxed control, which is crucial to characterize the so-lution via the Bellman fixed point equation. Approximate randomized feedback controls areobtained from the dynamic programming Bellman equation in a model-based framework,and future work under investigation will develop algorithms in a model-free framework, inother words in the context of reinforcement learning with many interacting and cooperativeagents. A Some useful results on conditional law
Lemma A.1
Let ( S, S ) , ( T, T ) be two measurable spaces, then(1) the map x ∈ ( S, S ) δ x ∈ ( P ( S ) , C ( S )) is measurable.(2) The map ( µ, ν ) ∈ ( P ( S ) , C ( S )) × ( P ( T ) , C ( T )) µ ⊗ ν ∈ ( P ( S × T ) , C ( S × T )) ismeasurable.(3) If F ∈ L ( S ; T ) , then the map µ ∈ ( P ( S ) , C ( S )) F ⋆µ ∈ ( P ( T ) , C ( T )) is measurable. Proof. (1). Fix A ∈ S , then x δ x ( A ) = A ( x ) is clearly a measurable function. (2). Fix( A, B ) ∈ S × T , then ( µ, ν ) ( µ ⊗ ν )( A × B ) = µ ( A ) ν ( B ) is a measurable function. (3).Fix B ∈ T , then µ ( F ⋆ µ )( B ) = µ ( F − ( B )) is a measurable function. ✷ Corollary A.1
Let ( S, S ) , ( T, T ) , and ( U, U ) be three measurable spaces, and F ∈ L (( S, S ) × ( T, T ); ( U, U )) be a measurable function, then the function ˆ F : ( P ( S ) , C ( S )) × ( T, T ) → ( P ( U ) , C ( U )) given by ˆ F ( µ, x ) := F ( · , x ) ⋆ µ is measurable. Proof.
This is a consequence of Lemma A.1 and the measurable composition ( µ, x ) ( µ, δ x ) µ ⊗ δ x F ⋆ ( µ ⊗ δ x ) = ˆ F ( µ, x ). ✷ efinition A.1 (Conditional law) Fix two measurable spaces ( S, S ) and ( T, T ) , andlet X, Y be two random variables on (Ω , F , P ) valued respectively on S and T . Then aconditional law of Y knowing X is a σ ( X ) -measurable ( P ( T ) , C ( T )) -valued random variable P XY , also denoted L ( Y | X ) , such that P XY ( A ) = P [ Y ∈ A | X ] , ∀ A ∈ T a.s. Furthermore, given x ∈ S , the conditional law of Y knowing X = x , denoted P X = xY or L ( Y | X = x ) , is the image of x by any probability kernel ˆ a ∈ ˆ T ( S ) such that L ( Y | X ) = ˆ a ( X ) a.s. Lemma A.2 (Conditional law)
Let ( S, S ) and ( T, T ) be two measurable spaces.1. If ( S, S ) is a Borel space, there exists a conditional law of Y knowing X .2. If Y = ϕ ( X, Z ) where Z ⊥⊥ X is a random variable valued in a measurable space V and ϕ : S × V → T is a measurable function, then L ( ϕ ( x, Z )) | x = X is a conditionallaw of Y knowing X . In the case S = S × S , X = ( X , X ) , and Y = ϕ ( X , Z ) ,then P XY = L ( ϕ ( x , Z )) | x = X , and thus P XY is σ ( X ) -measurable in ( P ( T ) , C ( T )) . Proof.
The first assertion is stated in Theorem 5.3 in [12]., and the second one followsfrom Fubini’s theorem. ✷ Corollary A.2
Fix a Borel space ( S, S ) and a measurable space ( T, T ) . Then for any jointlaw π ∈ P ( S × T ) with marginal law µ ∈ P ( S ) , there exists a probability kernel ν from S to P ( T ) s.t. π = µ · ν . Lemma A.3 (Kernels and randomization)
Fix a Borel space ( S, S ) and a measurablespace ( T, T ) . For any probability kernel ν from S to P ( T ) , there exists a measurable function φ : S × [0 , → T s.t. ν ( s ) = L ( φ ( s, U )) , for all s ∈ S , where U is a uniform randomvariable. Proof.
See Lemma 2.22 in [12]. ✷ Proposition A.1
Given an open-loop control α ∈ A , and an initial condition ξ ∈ L ( X ; G ) ,the solution X ξ,α to the conditional McKean-Vlasov equation is such that: for all t ∈ N , X ξ,αt is σ ( ξ, Γ , ( ε ) s ≤ t , ( ε s ) s ≤ t ) -measurable, and P X ξ,αt ,α t ) is F t -measurable. Proof.
We prove the result by induction on t . It is clear for t = 0. Assuming that it holdstrue for some t ∈ N , we write X ξ,αt +1 = F ( X ξ,αt , α t , P X ξ,αt ,α t ) , ε t +1 , ε t +1 ) , t ∈ N . By induction hypothesis, there is a measurable function f t +1 : X × G × E t +1 × ( E ) t +1 →X s.t. X ξ,αt +1 = f t +1 ( ξ, Γ , ( ε s ) s ≤ t +1 , ( ε s ) s ≤ t +1 ), and thus X ξ,αt +1 is σ ( ξ, Γ , ( ε ) s ≤ t +1 , ( ε s ) s ≤ t +1 )-measurable and P X ξ,αt +1 ,α t +1 ) is σ ( ε s , s ≤ t + 1)-measurable by Lemma A.2. ✷ Wasserstein convergence of the empirical law
Proposition B.1 (Conditional Wasserstein convergence of the empirical measure)
Let
E, F be two measurable spaces and G a compact Polish space. Let X be an E -valuedrandom variable independent from a family of i.i.d. F -valued variables ( U i ) i ∈ N , and ameasurable function f : E × F → G . Then W (cid:16) N N X i =1 δ f ( X,U i ) , P Xf ( X,U ) (cid:17) a.s. −→ N →∞ . Proof.
It suffices to observe that the probability of this event is one by conditioning w.r.t. X and use the analog non-conditional result, which follows from the fact that Wassersteindistance metrizes weak convergence (as G is compact), and the fact that empirical measureconverges weakly. ✷ C Proof of coupling results
Lemma C.1
Let
U, V be two independent uniform variables, and F a distribution functionon R . We have (cid:16) F − ( U ) , F ( F − ( U )) − U (cid:17) d = ( F − ( U ) , V ∆ F ( F − ( U )) , where we denote ∆ F := F − F − . Proof.
Notice that F ( F − ( U )) − U is the position (from top to bottom) of U in theset { u ∈ [0 , , F − ( u ) = F − ( U ) } and is thus smaller than ∆ F ( F − ( U )). Now, given ameasurable function f ∈ L ( A × [0 , R ), we have E h f (cid:0) F − ( U ) , F ( F − ( U )) − U ) i (C.1)= E (cid:2) f ( F − ( U ) , ∆ F ( F − ( U ))=0 (cid:3) + E (cid:2) f ( F − ( U ) , F ( F − ( U )) − U )) ∆ F ( F − ( U )) > (cid:3) . The second term can be decomposed as X ∆ F ( c ) > E h f (cid:0) c, F ( c ) − U (cid:1) F − ( U )= c i = X ∆ F ( c ) > Z f ( c, ∆ F ( c ) u )) ∆ F ( c ) du. where the equality comes from a change of variable. Summing over ∆ F ( c ) >
0, we obtain E (cid:2) f (cid:0) F − ( U ) , V ∆ F ( F − ( U )) (cid:1) ∆ F ( F − ( U )) > (cid:3) , and combined with (C.1), we get E h f (cid:0) F − ( U ) , F ( F − ( U )) − U ) i = E (cid:2) f (cid:0) F − ( U ) , V ∆ F ( F − ( U )) (cid:1)(cid:3) , which proves the result. ✷ emma C.2 Let X be a compact Polish space, then there exists an embedding φ ∈ L ( X , R ) such that1. φ and φ − are uniformly continuous,2. for any probability measure µ ∈ P ( X ) , we have Im (cid:16) F − φ⋆µ (cid:17) ⊂ Im ( φ ) . In particular, φ − ◦ F − φ⋆µ is well posed. Proof.
1. Without loss of generality, we assume that X is bounded by 1. Fix a countabledense family ( x n ) n ∈ N in X . We define the map φ : x ∈ X 7→ ( d ( x, x n )) n ∈ N ∈ [0 , N . Letus endow [0 , N with the metric d(( u n ) n ∈ N , ( v n ) n ∈ N ) := P n ≥ n | u n − v n | . φ is clearlyinjective and uniformly continuous (even Lipschitz). The compactness of X implies thatits inverse φ − is uniformly continuous as well. Let us now consider φ : ([0 , N , d) [0 , φ (( u n ) n ∈ N ) essentially groups the decimals of the real numbers u n , n ∈ N , in a singlereal number. More precisely, let ι : N → N be a surjection, then we define the k -th decimalof φ (( u n ) n ∈ N ) as the ( ι ( k )) -th decimal of u ( ι ( k )) (with the convention that for a numberwith two possible decimal representations, we choose the one that ends with 000 ... ). φ isclearly injective, uniformly continuous, as well as its inverse φ − . Thus, φ := φ ◦ φ definesan embedding of X into R , such that φ and φ − are uniformly continuous.2. F − φ⋆µ being caglad, and Im( φ ) being closed (by compactness of X ), it is enough to provethat F − φ⋆µ ( u ) ∈ Im φ for almost every u ∈ [0 ,
1] (in the Lebesgue sense). However, given auniform variable U , we have F − φ⋆µ ( U ) ∼ φ ⋆ µ , and thus P ( F − φ⋆µ ( U ) ∈ Im( φ )) = P Y ∼ µ ( φ ( Y ) ∈ Im( φ )) = 1 . ✷ Proof of Lemma 4.1 (1) We first consider the case where
X ⊂ R . Let us call F µ the distribution function of µ ∈P ( X ), and F − µ its generalized inverse. Let us define the function ζ : P ( X ) ×P ( X ) ×X × [0 , → X by ζ ( µ, µ ′ , x, u ) := F − µ ′ (cid:0) F µ ( x ) − u ∆ F µ ( x ) (cid:1) , which is measurable by noting that the measurability in µ, µ ′ comes from the continuity of P ( X ) → L caglad ([0 , , X ) µ F − µ . By construction, we then have for any ξ ∼ µ , and U, V two independent uniform variables,independent of ξ ( ξ, ζ ( µ, µ ′ , ξ, V )) = ( ξ, F − µ ′ (cid:0) F µ ( ξ ) − V ∆ F µ ( ξ ) (cid:1) ) d = ( F − µ ( U ) , F − µ ′ (cid:0) F µ ( F − µ ( U )) − V ∆ F µ ( F − µ ( U )) (cid:1) )= ( F − µ ( U ) , F − µ ′ (cid:0) F µ ( F − µ ( U )) − V ∆ F µ ( F − µ ( U )) (cid:1) ) d = (cid:0) F − µ ( U ) , F − µ ′ ( U ) (cid:1) , (cid:0) F − µ ( U ) , F − µ ′ ( U ) (cid:1) is an optimal coupling for ( µ, µ ′ ), and so W ( µ, µ ′ ) = E (cid:2) d ( ξ, ζ ( µ, µ ′ , ξ, V )) (cid:3) .(2) Let us now consider the case of a general compact Polish space X . Denoting by ζ R the” ζ ” from the case ” X ⊂ R ”, and considering an embedding φ ∈ L ( X , R ) as in Lemma C.2,let us define ζ ( µ, µ ′ , x, u ) := φ − ( ζ R ( φ ⋆ µ, φ ⋆ µ ′ , φ ( x ) , u )) , which is well posed by definition of ζ R and Lemma C.2. Now, fix ξ ∼ µ , U a uniformvariable independent of ξ , and define ξ ′ := ζ ( µ, µ ′ , ξ, U ). By definition of ζ , its clear that ξ ′ ∼ µ ′ , and E (cid:2) d ( φ ( ξ ) , φ ( ξ ′ )) (cid:3) = W ( φ ⋆ µ, φ ⋆ µ ′ ) . (C.2)Fix ǫ >
0. We are looking for η, δ > W ( µ, µ ′ ) < η ⇒ W ( φ ⋆ µ, φ ⋆ µ ′ ) < δ ⇔ E (cid:2) d ( φ ( ξ ) , φ ( ξ ′ )) (cid:3) < δ ⇒ E [ d ( ξ, ξ ′ )] < ǫ. Let us first show that there exists δ > E [ d ( φ ( ξ ) , φ ( ξ ′ ))] < δ ⇒ E [ d ( ξ, ξ ′ )] < ǫ .Fix γ > d ( x, x ′ ) < γ ⇒ d ( φ − ( x ) , φ − ( x ′ )) < ǫ . Denoting by ∆ X the diameterof X , we then have E [ d ( ξ, ξ ′ )] ≤ E [ d ( ξ, ξ ′ ) d ( φ ( ξ ) ,φ ( ξ ′ )) <γ ] + ∆ X γ E (cid:2) d ( φ ( ξ ) , φ ( ξ ′ )) (cid:3) ≤ ǫ X γ E (cid:2) d ( φ ( ξ ) , φ ( ξ ′ )) (cid:3) , so that we can choose δ = γ ∆ X ǫ . On the other hand, by uniform continuity of φ andby definition of the Wasserstein metric, there exists η > d ( µ, µ ′ ) < η ⇒W ( φ ⋆ µ, φ ⋆ µ ′ ) < δ . From (C.2), we thus conclude that d ( µ, µ ′ ) < η ⇒ E [ d ( ξ, ξ ′ )] < ǫ . ✷ References [1] E. Bayraktar, A. Cosso, and H. Pham. Randomized dynamic programming principle andFeynman-Kac representation for optimal control of McKean-Vlasov dynamics.
Transactions ofthe American Mathematical Society , 370:2115–2160, 2018.[2] A. Bensoussan, J. Frehse, and P. Yam.
Mean field games and mean field type control theory.
Springer, 2013.[3] D. P. Bertsekas.
Dynamic programming and optimal control, Vol II, approximate dynamicprogramming . Athena Scientific, Belmont MA, 2012, 4th edition.[4] E. Boissard and T. Le Gouic. On the mean speed of convergence of empirical and occupa-tion measures in Wasserstein distance.
Annales de l’Institut Henri Poincar´e - Probabilit´es etStatistiques , 50(2):539–563, 2014.[5] R. Carmona and F. Delarue.
Probabilistic Theory of Mean Field Games with Applications volI.
Probability Theory and Stochastic Modelling. Springer, 2018.
6] R. Carmona, M. Lauri`ere, and Z. Tan. Model-free mean-field reinforcement learning: mean-fieldMDP and mean-field Q-learning. arXiv: 1910.12802v1, 2019.[7] M.F. Djete, D. Possamai, and X. Tan. McKean-Vlasov optimal control: the dynamic program-ming principle. arXiv:1907.08860,, 2019.[8] J. Fontbana, H. Gu´erin, and S. M´el´eard. Measurability of optimal transportation and strongcoupling of martingale measures.
Electronic communications in probability , 15:124–133, 2010.[9] M. Fornasier, S. Lisini, C. Orrieri, and G. Savar´e. Mean-field optimal control as Gamma-limitof finite agent controls.
European Journal of Applied Mathematics , pages 1–34, 2018.[10] N. Fournier and A. Guillin. On the rate of convergence in Wasserstein distance of the empiricalmeasure.
Probability Theory and Related Fields , 162:707–738, 2015.[11] H. Gu, X. Guo, X. Wei, and R. Xu. Dynamic programming principles for learning MFCs.arXiv:1911.07314, 2019.[12] O. Kallenberg.
Foundations of Modern Probability . Probability and its Applications (NewYork). Springer-Verlag, New York, second edition, 2002.[13] D. Lacker. Limit theory for controlled McKean-Vlasov dynamics.
SIAM Journal on Controland Optimization , 55(3):1641–1672, 2017.[14] M. Lauri`ere and O. Pironneau. Dynamic programming for mean-field type controls.
Journalof Optimization Theory , 169(3):902–924, 2016.[15] H. Pham and X. Wei. Discrete time McKean-Vlasov control problem: a dynamic programmingapproach.
Applied Mathematics and Optimization , 74(3):487–506, 2016.[16] H. Pham and X. Wei. Dynamic programming for optimal control of stochastic McKean-Vlasovdynamics.
SIAM Journal on Control and Optimization , 55:1069–1101, 2017.[17] S.T. Rachev and L. R¨uschendorf.
Mass transportation problems . Springer Verlag, 1998.[18] R.S. Sutton and A.G. Barto.
Reinforcement learning: an introduction . Cambridge, MA, 2017,2nd edition.[19] A. Van der Vaart and J.A. Wellner.
Weak convergence of empirical processes . Springer Verlag,1996.[20] C. Villani.
Optimal transport , volume 338 of
Grundlehren der Mathematischen Wissenschaften[Fundamental Principles of Mathematical Sciences] . Springer-Verlag, Berlin, 2009. Old andnew.. Springer-Verlag, Berlin, 2009. Old andnew.