Approximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning
AApproximately Solving Mean Field Games viaEntropy-Regularized Deep Reinforcement Learning
Kai Cui Heinz Koeppl
Technische Universität Darmstadt [email protected]
Technische Universität Darmstadt [email protected]
Abstract
The recent mean field game (MFG) formalismfacilitates otherwise intractable computationof approximate Nash equilibria in many-agentsettings. In this paper, we consider discrete-time finite MFGs subject to finite-horizonobjectives. We show that all discrete-timefinite MFGs with non-constant fixed pointoperators fail to be contractive as typically as-sumed in existing MFG literature, barringconvergence via fixed point iteration. In-stead, we incorporate entropy-regularizationand Boltzmann policies into the fixed pointiteration. As a result, we obtain provable con-vergence to approximate fixed points whereexisting methods fail, and reach the originalgoal of approximate Nash equilibria. All pro-posed methods are evaluated with respect totheir exploitability, on both instructive exam-ples with tractable exact solutions and high-dimensional problems where exact methodsbecome intractable. In high-dimensional sce-narios, we apply established deep reinforce-ment learning methods and empirically com-bine fictitious play with our approximations.
The framework of mean field games (MFG) was intro-duced independently by the seminal works of Huanget al. (2006) and Lasry and Lions (2007) in the fullycontinuous setting of stochastic differential games. Inthe meantime, it has sparked great interest and inves-tigation both in the mathematical community, whereinterests lie in the theoretical properties of MFGs, and
Proceedings of the 24 th International Conference on Artifi-cial Intelligence and Statistics (AISTATS) 2021, San Diego,California, USA. PMLR: Volume 130. Copyright 2021 bythe author(s). in the applied research communities as a framework forsolving and analyzing large-scale multi-agent problems.At its core lies the idea of reducing the classical, in-tractable multi-agent solution concept of Nash equilib-ria to the interaction between a representative agentand the ‘mass’ of infinitely many other agents – the so-called mean field. The solution to this limiting problemis the so-called mean field equilibrium (MFE), charac-terized by a forward evolution equation for the agent’sstate distributions, and a backward optimality equationof representative agent optimality. Importantly, theMFE constitutes an approximate Nash equilibrium inthe corresponding finite agent game of sufficiently manyagents (Huang et al. (2006)), which would otherwisebe intractable to compute (Daskalakis et al. (2009)).Nonetheless, computing an MFE remains difficult inthe general case. Standard assumptions in existing lit-erature are MFE uniqueness and operator contractivity(Huang et al. (2006), Anahtarcı et al. (2020), Guo et al.(2019)) to obtain convergence via simple fixed pointiteration. While these assumptions hold true for somegames, we address the case where such restrictive as-sumptions fail. Applications for such mean field modelsare manifold and include e.g. finance (Guéant et al.(2011)), power control (Kizilkale and Malhame (2016)),wireless communication (Aziz and Caines (2016)) orpublic health models (Laguzet and Turinici (2015)).
A motivating example.
Consider the followingtrivial situation informally: Let a large number ofagents choose simultaneously between going left ( L ) orright ( R ). Afterwards, each agent shall be punishedproportional to the number of agents that chose thesame action. If we had infinitely many independent,identically acting agents, the only stable solution wouldbe to have all agents pick uniformly at random.The MFG formalism models this problem by pick-ing one representative agent and abstracting all otheragents into their state distribution. Unfortunately,analytically obtaining fixed points in general provesdifficult and existing computational methods can fail. a r X i v : . [ c s . M A ] F e b pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning Our contribution.
We begin by formulating themean field analogue to finite games in game theory. Inthis setting we give simplified proofs for both existenceand the approximate Nash equilibrium property ofmean field equilibria. Moreover, we show that in finiteMFGs, all non-constant fixed point operators are non-contractive, necessitating a different approach thannaive fixed point iteration as in Anahtarcı et al. (2020).Consequently, we approximate the fixed point opera-tor by introducing relative entropy regularization andBoltzmann policies. We prove guaranteed convergencefor sufficiently high temperatures, while remaining arbi-trarily exact for sufficiently low temperatures. Further-more, repeatedly iterating on the prior policy allowsus to perform an iterative descent on exploitability,successively improving the equilibrium approximation.Finally, our methods are extensively evaluated and com-pared to other methods such as fictitious play (FP, seePerrin et al. (2020)), which in general fail to convergeto a fixed point. We outperform existing state-of-the-art methods in terms of exploitability in our problems,allowing us to find approximate mean field equilibriain the general case and paving the way to practical ap-plication of mean field games. In otherwise intractableproblems, we apply deep reinforcement learning tech-niques together with particle-based simulations.
Consider a discrete-time N -agent stochastic gamewith finite agent state space S and finite agent ac-tion space A , equipped with the discrete metric. Let T = { , , . . . , T − } denote the time index set. Denoteby P ( X ) the set of all Borel probability measures ona metric space X . Since we work with finite spaces,we abuse notation and denote both a measure ν andits probability mass function by ν ( · ) . For each agent,the dynamical behavior is described by the state tran-sition function p : S × S × A × P ( S ) → [0 , and theinitial state distribution µ : S → [0 , . For agents i = 1 , . . . , N at times t ∈ T , their states S it and actions A it are random variables with values in S and A respec-tively. Let G Ns ≡ N (cid:80) Ni =1 δ s i denote the empirical mea-sure of agent states s = ( s , . . . , s N ) ∈ S N , where δ isthe Dirac measure. Consider for each agent i a Markovpolicy π i = ( π it ) t ∈T ∈ Π , where π it : A × S → [0 , and Π is the space of all Markov policies. The state evolu-tion of agent i begins with S i ∼ µ and subsequentlyfor all applicable times t follows P ( A it = a | S it = s i ) ≡ π it ( a | s i ) , P ( S it +1 = s (cid:48) i | S t = s, A it = a ) ≡ p ( s (cid:48) i | s i , a, G Ns ) , for arbitrary s i , s (cid:48) i ∈ S , a ∈ A , s = ( s , . . . , s N ) ∈ S N and S t = ( S t , . . . , S Nt ) . Finally, define agent i ’s finitehorizon objective function J Ni ( π , . . . , π N ) ≡ E (cid:34) T − (cid:88) t =0 r ( S it , A it , G NS t ) (cid:35) to be maximized, where r : S × A × P ( S ) → R isthe agent reward function. With this, we can give thenotion of optimality used by Saldi et al. (2018). Definition 1.
A Markov-Nash equilibrium is a -Markov-Nash equilibrium. For ε ≥ , an ε -Markov-Nash equilibrium (approximate Markov-Nash equilib-rium) is defined as a tuple of policies ( π , . . . , π N ) ∈ Π N such that for any i = 1 , . . . , N , we have J Ni ( π , . . . , π N ) ≥ max π ∈ Π J Ni ( π , . . . , π i − , π, π i +1 , . . . , π N ) − ε . Since analyzing policies acting on joint state informa-tion or the state history is difficult, optimality has beenrestricted to the set of Markov policies Π acting onthe agent’s own state. Although this may seem like asignificant restriction, in the N → ∞ limit, the evo-lution of all other agents – the mean field – becomesdeterministic and therefore non-informative. The N → ∞ limit of the N -agent game constitutesits corresponding finite mean field game (i.e. with afinite state and action space). It consists of the sameelements T , S , A , p, r, µ . However, instead of modeling N separate agents, it models a single representativeagent and collapses all other agents into their commonstate distribution, i.e. the mean field µ = ( µ t ) t ∈T ∈ M with µ t : S → [0 , , where M is the space of all meanfields and µ is given. The deterministic mean field µ replaces the empirical measure of the finite game.Consider a Markov policy π ∈ Π as before. For somefixed mean field µ , the evolution of random states S t and actions A t begins with S ∼ µ and subsequentlyfor all applicable times t follows P ( A t = a | S t = s ) ≡ π t ( a | s ) , P ( S t +1 = s (cid:48) | S t = s, A t = a ) ≡ p ( s (cid:48) | s, a, µ t ) , and the objective analogously becomes J µ ( π ) ≡ E (cid:34) T − (cid:88) t =0 r ( S t , A t , µ t ) (cid:35) . The mean field µ induced by some fixed policy π beginswith the given µ and is defined recursively by µ t +1 ( s (cid:48) ) ≡ (cid:88) s ∈S µ t ( s ) (cid:88) a ∈A π t ( a | s ) p ( s (cid:48) | s, a, µ t ) . ai Cui, Heinz Koeppl By fixing a mean field µ ∈ M , we obtain an in-duced Markov Decision Process (MDP) with time-dependent transition function p ( s (cid:48) | s, a, µ t ) andreward function r ( s, a, µ t ) . Denote the set-valuedmap from mean field to optimal policies π of theinduced MDP as ˆΦ : M → Π (such that π ∈ arg max π E π (cid:104)(cid:80) T − t =0 r ( S t , A t , µ t ) | S t = s (cid:105) for all s ∈S ). Analogously, define the map from a policy to itsinduced mean field as Ψ : Π → M . Finally, we candefine the N → ∞ analogue to Markov-Nash equilibria. Definition 2.
A mean field equilibrium (MFE) is apair ( π, µ ) ∈ Π × M such that π ∈ ˆΦ( µ ) and µ = Ψ( π ) holds. By defining any single-valued map
Φ :
M → Π to anoptimal policy, we obtain a composition Γ = Ψ ◦ Φ :
M → M , henceforth MFE operator. Shown by Saldiet al. (2018) for general Polish S and A , the MFEexists and constitutes an approximate Markov-Nashequilibrium for sufficiently many agents under techni-cal conditions. In the Appendix, we give simplifiedproofs for finite MFGs under the following standardassumption. Assumption 1.
The functions r ( s, a, µ t ) and p ( s (cid:48) | s, a, µ t ) are continuous, therefore bounded. Note that we metrize probability measure spaces P ( X ) with the total variation distance d T V . For probabilitymeasures ν, ν (cid:48) on finite spaces X , d T V simplifies to d T V ( ν, ν (cid:48) ) = 12 (cid:88) x ∈X | ν ( x ) − ν (cid:48) ( x ) | . Accordingly, we equip Π , M with sup metrics, i.e. forpolicies π, π (cid:48) ∈ Π and mean fields µ, µ (cid:48) ∈ M we definethe metric spaces (Π , d Π ) and ( M , d M ) with d Π ( π, π (cid:48) ) ≡ max t ∈T max s ∈S d T V ( π t ( · | s ) , π (cid:48) t ( · | s )) ,d M ( µ, µ (cid:48) ) ≡ max t ∈T d T V ( µ t , µ (cid:48) t ) . Proposition 1.
Under Assumption 1, there exists atleast one MFE ( π ∗ , µ ∗ ) ∈ Π × M .Proof. See Appendix.
Theorem 1.
Under Assumption 1, if ( π ∗ , µ ∗ ) is anMFE, then for any ε > there exists N (cid:48) ∈ N such thatfor all N > N (cid:48) , the policy ( π ∗ , . . . , π ∗ ) is an ε -Markov-Nash equilibrium in the N -agent game.Proof. See Appendix.Importantly, finding Nash equilibria in large- N games ishard (Daskalakis et al. (2009)), whereas an MFE can besignificantly more tractable to compute. Accordingly,solving the limiting MFG approximately solves thefinite- N game for large N in a tractable manner. Repeated application of the MFE operator constitutesthe exact fixed point iteration approach to findingMFE. The standard assumption for convergence in theliterature is contractivity and thereby MFE uniqueness(e.g. Caines and Huang (2019); Guo et al. (2019)).
Proposition 2.
Let Φ , Ψ be Lipschitz with constants c , c , fulfilling c c < . Then, the fixed point iteration µ n +1 = Ψ(Φ( µ n )) converges to the mean field of theunique MFE for any initial µ ∈ M .Proof. Let µ, µ (cid:48) ∈ M arbitrary, then d M (Γ( µ ) , Γ( µ (cid:48) )) = d M (Ψ(Φ( µ )) , Ψ(Φ( µ (cid:48) ))) ≤ c · d Π (Φ( µ ) , Φ( µ (cid:48) )) ≤ c · c · d M ( µ, µ (cid:48) ) . Since µ, µ (cid:48) are arbitrary, Γ is Lipschitz with constant c · c < . (Π , d Π ) and ( M , d M ) are complete metricspaces (see Appendix). Therefore, Banach’s fixed pointtheorem implies convergence to the unique fixed pointfor any starting µ ∈ M .Unfortunately, it remains unclear how to proceed ifmultiple optimal policies of an induced MDP exist, orif contractivity fails, e.g. when multiple MFE exist. Inthe following, consider again the illuminating examplefrom the introduction. Consider S = { C, L, R } , A = S \ { C } , µ ( C ) = 1 , r ( s, a, µ t ) = − { L } ( s ) · µ t ( L ) − { R } ( s ) · µ t ( R ) and T = { , } . The transition function allows picking thenext state directly, i.e. for all s, s (cid:48) ∈ S , a ∈ A , P ( S t +1 = s (cid:48) | S t = s, A t = a ) = { s (cid:48) } ( a ) . Clearly, any MFE ( π ∗ , µ ∗ ) must fulfill π ∗ ( L | C ) = π ∗ ( R | C ) = 1 / , while π ∗ can be arbitrary. Even ifthe operator Φ chooses suitable optimal policies, thefixed point operator Γ remains non-contractive, as themean field will necessarily alternate between left andright for any non-uniform starting µ ∈ M .We observe that the example has infinitely many MFE,but no deterministic MFE, i.e. an MFE such thatfor all t ∈ T , s ∈ S , a ∈ A either π t ( a | s ) = 0 or π t ( a | s ) = 1 holds, similar to the classical game-theoretical insight of mixed Nash equilibrium existence(cf. Fudenberg and Tirole (1991)). Therefore, choosingoptimal, deterministic policies will typically fail.Most existing work assumes contractivity, which is toorestrictive. In many scenarios, agents need to "coordi-nate" with each other. For example, a herd of hunting pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning animals may collectively choose one of multiple huntinggrounds, allowing for multiple MFEs. Hence, it can bedifficult to apply existing MFG methodologies in prac-tice, as many problems automatically fail contractivity. From the previous example, we may be led to believethat non-contractivity is a general property of finiteMFGs. And indeed, regardless of number of MFEs, itturns out that in any finite MFG with non-constantMFE operator, a policy selection operator Φ with finiteimage Π Φ will lead to non-contractivity. Note thatthis includes both the conventional arg max and the arg max-e (cf. Guo et al. (2019)) choice of actions. Theorem 2.
Let the image of Φ be a finite set Π Φ ⊆ Π .Then, either it holds that Γ = Ψ ◦ Φ is a constant, or Γ is not Lipschitz continuous and thus not a contraction.Proof. See Appendix.Therefore, typical discrete-time finite MFGs have non-contractive fixed point operators and we must changeour approach. Note that although non-contractivitydoes not imply non-convergence, the trivial examplefrom before strongly suggests that non-convergence isthe case for many finite MFGs.
Exact fixed point iteration fails to solve most finiteMFGs. Therefore, a different solution approach isnecessary. In the following, we present two relatedapproaches that guarantee convergence while plausiblyremaining approximate Nash equilibria in the finite- N case. For our results, we require a stronger Lipschitzassumption that implies Assumption 1. Assumption 2.
The functions r ( s, a, µ t ) and p ( s (cid:48) | s, a, µ t ) are Lipschitz continuous, therefore bounded. A straightforward idea is regularization by replacingthe objective by the well-known (see e.g. Abdolmalekiet al. (2018)) relative entropy objective ˜ J µ ( π ) ≡ E (cid:34) T − (cid:88) t =0 r ( S t , A t , µ t ) − η log π t ( A t | S t ) q t ( A t | S t ) (cid:35) with temperature η > and positive prior policy q ∈ Π ,i.e. q t ( a | s ) > for all t ∈ T , s ∈ S , a ∈ A . Shown inthe Appendix, the unique optimal policy ˜ π µ,ηt fulfills ˜ π µ,ηt ( a | s ) = q t ( a | s ) exp (cid:16) ˜ Q η ( µ,t,s,a ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) ˜ Q η ( µ,t,s,a (cid:48) ) η (cid:17) for the MDP induced by fixed µ ∈ M , with the softaction-value function ˜ Q η ( µ, t, s, a ) given by the smooth-maximum Bellman recursion ˜ Q η ( µ, t, s, a ) = r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) · η log (cid:32) (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) η (cid:33) of the MDP induced by fixed µ ∈ M , with terminalcondition ˜ Q η ( µ, T − , s, a ) ≡ r ( s, a, µ T − ) . Note thatwe recover optimality as η → , see Theorem 4. Definethe relative entropy MFE operator ˜Γ η ≡ Ψ ◦ ˜Φ η withpolicy selection ˜Φ η ( µ ) ≡ ˜ π µ,η for all µ ∈ M . Definition 3. An η -relative entropy mean field equi-librium ( η -RelEnt MFE) for some positive prior pol-icy q ∈ Π is a pair ( π E , µ E ) ∈ Π × M such that π E = ˜Φ η ( µ E ) and µ E = Ψ( π E ) hold. An η -maximumentropy mean field equilibrium ( η -MaxEnt MFE) is an η -RelEnt MFE with uniform prior policy q . RelEnt MFE are guaranteed to exist for any η > byProposition 3. Furthermore, convergence to the regu-larized solution is guaranteed for large η by Theorem 3. Since only deterministic policies fail, a derivative ap-proach is to use softmax policies directly with theunregularized action-value function, also called Boltz-mann policies. Assume that the action-value function Q ∗ fulfilling the Bellman equation Q ∗ ( µ, t, s, a ) = r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) · max a (cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48) ) . of the MDP induced by fixed µ ∈ M with terminalcondition Q ∗ ( µ, T − , s, a ) ≡ r ( s, a, µ T − ) is known.Define the map Φ η ( µ ) ≡ π µ,η for any µ ∈ M , where π µ,ηt ( a | s ) ≡ q t ( a | s ) exp (cid:16) Q ∗ ( µ,t,s,a ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) Q ∗ ( µ,t,s,a (cid:48) ) η (cid:17) for all t ∈ T , s ∈ S , a ∈ A and temperature η > . Definition 4. An η -Boltzmann mean field equilibrium( η -Boltzmann MFE) for some positive prior policy q ∈ Π is a pair ( π B , µ B ) ∈ Π × M such that π B = Φ η ( µ B ) and µ B = Ψ( π B ) hold. Both η -RelEnt MFE and η -Boltzmann MFE are guar-anteed to exist for any temperature η > . ai Cui, Heinz Koeppl Proposition 3.
Under Assumption 1, η -Boltzmannand η -RelEnt MFE exist for any temperature η > .Proof. See Appendix.Contractivity of both η -Boltzmann MFE operator Γ η ≡ Ψ ◦ Φ η and η -RelEnt MFE operator ˜Γ η ≡ Ψ ◦ ˜Φ η isguaranteed for sufficiently high temperatures, even ifall possible original Φ are not Lipschitz continuous. Theorem 3.
Under Assumption 2, µ (cid:55)→ Q ∗ ( µ, t, s, a ) , µ (cid:55)→ ˜ Q η ( µ, t, s, a ) and Ψ( π ) are Lipschitz continuouswith constants K Q ∗ , K ˜ Q and K Ψ for arbitrary t ∈T , s ∈ S , a ∈ A , η > η (cid:48) , η (cid:48) > . Furthermore, Γ η and ˜Γ η are a contraction for η > max (cid:18) η (cid:48) , |A| ( |A| − K Q K Ψ q q (cid:19) where K Q = K Q ∗ for Γ η , K Q = K ˜ Q for ˜Γ η , q max ≡ max t ∈T ,s ∈S ,a ∈A q t ( a | s ) > and q min ≡ min t ∈T ,s ∈S ,a ∈A q t ( a | s ) > .Proof. See Appendix.Sufficiently large η hence implies convergence via fixedpoint iteration. On the other hand, for sufficientlylow temperatures η , both η -Boltzmann and η -RelEntMFE will also constitute an approximate Markov-Nashequilibrium of the finite- N game. Theorem 4.
Under Assumption 2, if ( π ∗ n , µ ∗ n ) n ∈ N isa sequence of η n -Boltzmann or η n -RelEnt MFE with η n → , then for any ε > there exist n (cid:48) , N (cid:48) ∈ N suchthat for all n > n (cid:48) , N > N (cid:48) , the policy ( π ∗ n , . . . , π ∗ n ) ∈ Π N is an ε -Markov-Nash equilibrium of the N -agentgame, i.e. J Ni ( π ∗ n , . . . , π ∗ n ) ≥ max π i ∈ Π J Ni ( π ∗ n , . . . , π ∗ n , π i , π ∗ n , . . . , π ∗ n ) − ε . Proof.
See Appendix.If we can obtain contractivity for sufficiently low η ,we can find good approximate Markov-Nash equilibria.As it is impossible to have both η → and η → ∞ ,it depends on the problem and prior whether we canconverge to a good solution. Nonetheless, we find thatit is often possible to empirically find low η that provideconvergence as well as a good approximate MFE. In principle, we can insert arbitrary prior policies q ∈ Π .Under Assumption 1, by boundedness of both ˜ Q η and Q ∗ (see Appendix), both η -RelEnt and η -Boltzmann MFE policies converge to the prior policy as η → ∞ .Therefore, in principle we can show that for any ε > ,for sufficiently large η and N , the η -RelEnt and η -Boltzmann MFE under q will be at most an ε -worseapproximate Nash equilibrium than the prior policy.Furthermore, we obtain guaranteed contractivity byTheorem 3. Thus, any prior policy gives a worst-casebound on the performance achievable over all η > .On the other hand, if we obtain better results forsufficiently low η , we may iteratively improve our policyand thus our equilibrium quality. The original work of Huang et al. (2006) introducescontractivity and uniqueness assumptions into the con-tinuous MFG setting. Analogously, Guo et al. (2019)and Caines and Huang (2019) assume contractivityfor discrete-time MFGs and dense graph limit MFGsrespectively. Further existing work on discrete-timeMFGs similarly assumes uniqueness of the MFE, whichincludes Saldi et al. (2018) and Gomes et al. (2010)for approximate optimality and existence results, andAnahtarcı et al. (2020) for an analysis on contractiv-ity requirements. Mguni et al. (2018) solve discrete-time continuous state MFG problems under the clas-sical uniqueness conditions of Lasry and Lions (2007).Further extensions of the MFG formula include par-tial observability (Saldi et al. (2019)) or major agents(Nourian and Caines (2013)).The work of Anahtarci et al. (2020) is related andstudies theoretical properties of finite- N regularizedgames and their limiting MFG. In their work, theexistence and approximate Nash property of MFE instationary regularized games is shown, and Q-Learningerror propagation is investigated. In comparison, weconsider the original, unregularized finite- N game ina transient setting and perform extensive empiricalevaluations. Guo et al. (2019) and Yang et al. (2018)previously proposed to apply Boltzmann policies. Theformer applies the approximation heuristically, whilethe latter focuses on directly solving finite- N games.An orthogonal approach to computing MFE is fic-titious play. Rooted in game-theory and classicaleconomic works (Brown (1951)), it has since beenadapted to MFGs. In fictitious play, all past meanfields (Cardaliaguet and Hadikhanloo (2017)) and poli-cies (Perrin et al. (2020)) are averaged to produce anew mean field or policy. Importantly, convergence isguaranteed in certain special cases only (cf. Elie et al.(2019)). Although introduced in a differentiable setting,we evaluate fictitious play empirically in our settingand find that both our regularization and fictitious playmay be combined successfully. pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning .
00 0 .
25 0 .
50 0 .
75 1 .
00 1 .
25 1 .
50 1 .
75 2 . η . . . . . . . . . ∆ J ( π ) (a) η -Boltzmann η -MaxEnt η -Boltzmann with FP η -MaxEnt with FPUniform prior policy .
00 0 .
25 0 .
50 0 .
75 1 .
00 1 .
25 1 .
50 1 .
75 2 . η ∆ J ( π ) (b) η -Boltzmann η -MaxEnt η -Boltzmann with FP η -MaxEnt with FPUniform prior policy .
00 0 .
02 0 .
04 0 .
06 0 .
08 0 . η ∆ J ( π ) (c) η -Boltzmann η -MaxEnt η -Boltzmann with FP η -MaxEnt with FPUniform prior policy Figure 1: Mean exploitability over the final 10 iterations. Dashed lines represent maximum and minimum overthe final 10 iterations. (a) LR, 10000 iterations; (b) RPS, 10000 iterations; (c) SIS, 10000 iterations. Maximumentropy (MaxEnt) results begin at higher temperatures due to limited floating point accuracy. Temperature zerodepicts the exact fixed point iteration for both η -MaxEnt and η -Boltzmann MFE. In LR and RPS, η -MaxEntand η -Boltzmann MFE coincide both with and without fictitious play (FP), here averaging both policy and meanfield over all past iterations. The exploitability of the prior policy is indicated by the dashed horizontal line. In practice, we find that our approaches are capable ofgenerating solutions of lower exploitability than oth-erwise obtained. Unless stated otherwise, we computeeverything exactly, use the maximum entropy objec-tive (MaxEnt) with the uniform prior policy q where q t ( a | s ) = 1 / |A| for all t ∈ T , s ∈ S , a ∈ A , andinitialize with µ = Ψ( q ) generated by q . As the mainevaluation metric, we define the exploitability of apolicy π ∈ Π with induced mean field µ ≡ Ψ( π ) as ∆ J ( π ) ≡ max π ∗ J µ ( π ∗ ) − J µ ( π ) . Clearly, the exploitability of π is zero if and only if ( π, µ ) is an MFE. Indeed, for any ε > , any policy π ∈ Π is a (∆ J ( π ) + ε ) -Markov Nash equilibrium if N sufficientlylarge, i.e. the exploitability translates directly to thelimiting equilibrium quality in the finite- N game, seealso Theorem 4 and its proof.We evaluate the algorithms on the LR, RPS, SIS andTaxi problems, ordered in increasing complexity. De-tails of the algorithms, hyperparameters, problems andexperiment configurations as well as further experimen-tal results can be found in the Appendix. In Figure 1, we plot the minimum, maximum and meanexploitability for varying temperatures η during thelast 10 fixed point iterations, i.e. a single value whenthe exploitability (and usually mean field) converges.Observe that the lowest convergent temperature outper-forms not only the exact fixed point iteration (drawnat temperature zero), but also the uniform prior policy. Although developed for a different setting, we also showresults of fictitious play similar to the version fromPerrin et al. (2020), i.e. both policies and mean fieldsare averaged over all past iterations. It can be seen thatfictitious play only converges to the optimal solution inthe LR problem. In the other examples, supplementingfictitious play with entropy regularization is effective atproducing better results. A non-existent fictitious playvariant averaging only the policies finds the exact MFEin RPS, but nevertheless fails in SIS. See the Appendixfor further results.Evaluating and solving finite- N games is highly in-tractable by the curse of dimensionality, as the localstate is no longer sufficient to perform dynamic pro-gramming in the presence of the random empiricalstate measure. Since it has already been proven thatthe exploitability for N → ∞ will converge to the ex-ploitability of the corresponding mean field game, werefrain from evaluating on finite- N games.Note that the plots are entirely deterministic and notstochastic as it would seem at first glance, since thedepicted shaded area visualizes the non-convergence ofexploitability and is a result of the fixed point updatesrunning into a limit cycle (cf. Figure 2). In Figure 2, the difference between the exploitabil-ity of the current policy and the minimal exploitabil-ity reached during the final 10 iterations is shown for η -Boltzmann MFE. As the temperature η decreases,time to convergence increases until non-convergence isreached in form of a limit cycle. Analogous results for η -RelEnt MFE can be found in the Appendix. ai Cui, Heinz Koeppl iteration k ∆ J ( π k ) − m i n l ≥ k m a x − ∆ J ( π l ) (a) η = 0 . η = 0 . η = 0 . η = 0 .
075 0 100 200 300 400 500 iteration k . . . . . . . d M ( µ k , µ k m a x ) (b) η = 0 . η = 0 . η = 0 . η = 0 . Figure 2: (a) Difference between current and final minimum exploitability over the last 10 iterations; (b) Distancebetween current and final mean field. Plotted for the η -Boltzmann MFE iterations in SIS for different indicatedtemperature settings. Note the periodicity of the lowest temperature setting, indicating a limit cycle.Note also that in LR, we can analytically find K Q = 1 and K Ψ = 1 . Thus, we obtain guaranteed convergencevia η -Boltzmann MFE iteration if η > . In Figure 1,we see convergence already for η ≥ . . Note furtherthat the non-converged regime can allow for lower ex-ploitability. However, it is unclear a priori when to stop,and for approximate solutions where DQN is used forevaluation, the evaluation of exploitability may becomeinaccurate. For problems with intractably large state spaces, weadopt the DQN algorithm (Mnih et al. (2013)), us-ing the implementation of Shengyi et al. (2020) as abase. Particle-based simulations are used for the meanfield, and stochastic performance evaluation on theinduced MDP is performed (see Appendix). Note thatthe approximation introduces three sources of stochas-ticity into the otherwise deterministic algorithms, i.e.stochastic evaluation, mean field simulation and DQN.To counteract the randomness, we average our resultsover multiple runs. The hyperparameters and archi-tectures used are standard and can be found in theAppendix.Fitting the soft action-value function directly usinga network is numerically problematic, as the log-exponential transformation of approximated action-values quickly fails due to limited floating point accu-racy. Thus, we limit ourselves to the classical Bellmanequation with Boltzmann policies only.In Figure 3, we evaluate the exploitability of BoltzmannDQN iteration, evaluated exactly in SIS and RPS, andstochastically in Taxi over 2000 realizations. Minimum,maximum and mean exploitability are taken over thefinal 5 iterations and averaged over 5 seeds. Note thatit is very time-consuming to solve a full reinforcementlearning problem using DQN repeatedly in every it-eration. Nonetheless, we observe that a temperature larger than zero appears to improve exploitability andconvergence in the SIS example. Both due to the noisynature of approximate solutions and the lower numberof iterations, it can be seen that a higher temperatureis required to converge than in the exact case.In the intractable Taxi environment, the policy oscil-lates between two modes as in exact LR, and regulariza-tion fails to obtain better results, see also the Appendix.An important reason is that the prior policy performsextremely bad (exploitability of ∼ ) as most statesrequire specific actions for optimality. Hence we cannotfind an η > for which the algorithm both convergesand performs well. Using prior descent and iterativelyrefining a better prior policy would likely increase per-formance, but is deferred to future investigations asthe required computations grow very large.Fictitious play is expensive in combination with approx-imate Q-Learning and particle simulations, as policiesand particles of past iterations must be kept to per-form exact fictitious play. For this reason, we do notattempt approximate fictitious play with approximatesolution methods. In theory, supervised learning forfitting summarizing policies and randomly samplingparticles may help, but is out of scope of this paper. In Figure 4, we repeatedly perform outer iterationsconsisting of 100 η -RelEnt MFE iterations each with theindicated fixed temperature parameters in SIS. Aftereach outer iteration, the prior policy is updated to thenewest resulting policy. Note again that the results areentirely deterministic.Searching for a suitable η dynamically every iterationwould keep the exploitability from increasing, as for η → ∞ we obtain the original prior policy. Since it isexpensive to scan over all temperatures in each outeriteration, we use a heuristic. Intuitively, since the priorwill become increasingly good, it will be increasingly pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning .
00 0 .
25 0 .
50 0 .
75 1 .
00 1 .
25 1 .
50 1 .
75 2 . η ∆ J ( π ) (a) η -Boltzmann DQNUniform prior policy .
000 0 .
025 0 .
050 0 .
075 0 .
100 0 .
125 0 .
150 0 .
175 0 . η ∆ J ( π ) (b) η -Boltzmann DQNUniform prior policy .
000 0 .
025 0 .
050 0 .
075 0 .
100 0 .
125 0 .
150 0 .
175 0 . η ∆ J ( π ) (c) η -Boltzmann DQNUniform prior policy Figure 3: Mean exploitability over the final 5 iterations using DQN, averaged over 5 seeds. Dashed lines representthe averaged maximum and minimum exploitability over the last 5 iterations. (a) RPS, 1000 iterations; (b) SIS,50 iterations; (c) Taxi, 15 iterations. Evaluation of exploitability is exact except in Taxi, which uses DQN andaverages over 1000 episodes. The point of zero temperature depicts fixed point iteration using exact DQN policies. outer iteration i . . . . . . . ∆ J ( π i ) η = 0 . , c = 1 . η = 0 . , c = 1 . η = 0 . , c = 1 . η = 0 . , c = 1 . η = 0 . , c = 1 . η = 0 . , c = 1 . Figure 4: Exploitability over outer iterations in SIS,using 100 η -RelEnt MFE iterations per outer iteration.Note that the results are deterministic. Not shown:Running the fixed temperature settings c = 1 for longerdoes not converge for at least iterations.difficult to obtain a better policy. Thus, increasing thetemperature will help sticking close to the prior andconverge. Consequently, we use the simple heuristic η i +1 = η i · c for each outer iteration i , where c ≥ adjusts thetemperature after each outer iteration.Importantly, even for our simple heuristic, prior descentalready achieves an exploitability of ∼ . , whereasthe best results for the fixed uniform policy from Fig-ure 1 show an optimal mean exploitability of ∼ . .Furthermore, repeated prior policy updates succeed incomputing the exact MFE in RPS and LR under afixed temperature (see Appendix).Note that prior descent creates a double loop aroundsolving the optimal control problem, becoming highlyexpensive under deep reinforcement learning. Hence,we refrain from prior descent with DQN. Automati-cally adjusting temperatures to monotonically improveexploitability is left for potential future work. In this work, we have investigated the necessity andfeasibility of approximate MFG solution approaches –entropy regularization, Boltzmann policies and priordescent – in the context of finite MFGs. We haveshown that the finite MFG case typically cannot besolved by exact fixed point iteration or fictitious playalone. Entropy regularization and Boltzmann policiesin combination with deep reinforcement learning mayenable feasible computation of approximate MFE. Webelieve that lifting the restriction of inherent contrac-tivity is an important step in ensuring applicabilityof MFG models in practical problems. We hope thatentropy regularization and the insight for finite MFGscan help transfer the MFG formalism from its so-farmostly theory-focused context into real world applica-tion scenarios. Nonetheless, there still remain manyrestrictions to the applicability of the MFG formalism.For future work, an efficient, automatic temperatureadjustment for prior descent could be fruitful. Fur-thermore, it would be interesting to generalize rela-tive entropy MFGs to infinite horizon discounted prob-lems, continuous time, and continuous state and actionspaces. Moreover, it could be of interest to investigatetheoretical properties of fictitious play in finite MFGsin combination with entropy regularization. For non-Lipschitz mappings from policy to induced mean field,the proposed approach does not provide a solution. Itcould nonetheless be important to consider problemswith threshold-type dynamics and rewards, e.g. major-ity vote problems. Most notably, the current formalismprecludes common noise entirely, i.e. any games withcommon observations. In practice, many problems willallow for some type of common observation betweenagents, leading to non-independent agent distributionsand stochastic as opposed to deterministic mean fields. ai Cui, Heinz Koeppl
Acknowledgements
This work has been funded by the LOEWE researchpromotion initiative of the federal state of Hessen, Ger-many, within the program area KOM of the emer-genCITY center. The authors acknowledge the Licht-enberg high performance computing cluster of the TUDarmstadt for providing computational facilities forthe calculations of this research.
References
Minyi Huang, Roland P Malhamé, Peter E Caines,et al. Large population stochastic dynamic games:closed-loop mckean-vlasov systems and the nash cer-tainty equivalence principle.
Communications inInformation & Systems , 6(3):221–252, 2006.Jean-Michel Lasry and Pierre-Louis Lions. Mean fieldgames.
Japanese journal of mathematics , 2(1):229–260, 2007.Constantinos Daskalakis, Paul W Goldberg, and Chris-tos H Papadimitriou. The complexity of computinga nash equilibrium.
SIAM Journal on Computing ,39(1):195–259, 2009.Berkay Anahtarcı, Can Deha Karıksız, and Naci Saldi.Value iteration algorithm for mean-field games.
Sys-tems & Control Letters , 143:104744, 2020.Xin Guo, Anran Hu, Renyuan Xu, and Junzi Zhang.Learning mean-field games. In
Advances in NeuralInformation Processing Systems , pages 4966–4976,2019.Olivier Guéant, Jean-Michel Lasry, and Pierre-LouisLions. Mean field games and applications. In
Paris-Princeton lectures on mathematical finance 2010 ,pages 205–266. Springer, 2011.Arman C Kizilkale and Roland P Malhame. Collec-tive target tracking mean field control for markovianjump-driven models of electric water heating loads.In
Control of Complex Systems , pages 559–584. Else-vier, 2016.Mohamad Aziz and Peter E Caines. A mean field gamecomputational methodology for decentralized cellularnetwork optimization.
IEEE transactions on controlsystems technology , 25(2):563–576, 2016.Laetitia Laguzet and Gabriel Turinici. Individual vac-cination as nash equilibrium in a sir model withapplication to the 2009–2010 influenza a (h1n1) epi-demic in france.
Bulletin of Mathematical Biology ,77(10):1955–1984, 2015.Sarah Perrin, Julien Perolat, Mathieu Laurière,Matthieu Geist, Romuald Elie, and Olivier Pietquin.Fictitious play for mean field games: Continuoustime analysis and applications. arXiv preprintarXiv:2007.03458 , 2020. Naci Saldi, Tamer Basar, and Maxim Raginsky.Markov–nash equilibria in mean-field games withdiscounted cost.
SIAM Journal on Control and Op-timization , 56(6):4256–4287, 2018.Peter E Caines and Minyi Huang. Graphon mean fieldgames and the gmfg equations: ε -nash equilibria. In , pages 286–292. IEEE, 2019.Drew Fudenberg and Jean Tirole. Game theory . MITpress, 1991.Abbas Abdolmaleki, Jost Tobias Springenberg, YuvalTassa, Remi Munos, Nicolas Heess, and Martin Ried-miller. Maximum a posteriori policy optimisation.In
International Conference on Learning Represen-tations , 2018.Diogo A Gomes, Joana Mohr, and Rafael Rigao Souza.Discrete time, finite state space mean field games.
Journal de mathématiques pures et appliquées , 93(3):308–328, 2010.David Mguni, Joel Jennings, and Enrique Munozde Cote. Decentralised learning in systems withmany, many strategic agents.
Thirty-Second AAAIConference on Artificial Intelligence , 2018.Naci Saldi, Tamer Başar, and Maxim Raginsky. Approx-imate nash equilibria in partially observed stochasticgames with mean-field interactions.
Mathematics ofOperations Research , 44(3):1006–1033, 2019.Mojtaba Nourian and Peter E Caines. Epsilon-nashmean field game theory for nonlinear stochastic dy-namical systems with major and minor agents.
SIAMJournal on Control and Optimization , 51(4):3302–3331, 2013.Berkay Anahtarci, Can Deha Kariksiz, and Naci Saldi.Q-learning in regularized mean-field games. arXivpreprint arXiv:2003.12151 , 2020.Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, WeinanZhang, and Jun Wang. Mean field multi-agent rein-forcement learning. In
International Conference onMachine Learning , pages 5571–5580, 2018.George W Brown. Iterative solution of games by ficti-tious play.
Activity analysis of production and allo-cation , 13(1):374–376, 1951.Pierre Cardaliaguet and Saeed Hadikhanloo. Learningin mean field games: the fictitious play.
ESAIM:Control, Optimisation and Calculus of Variations , 23(2):569–591, 2017.Romuald Elie, Julien Pérolat, Mathieu Laurière,Matthieu Geist, and Olivier Pietquin. Approximatefictitious play for mean field games. arXiv preprintarXiv:1907.02633 , 2019.Volodymyr Mnih, Koray Kavukcuoglu, David Silver,Alex Graves, Ioannis Antonoglou, Daan Wierstra, pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning and Martin Riedmiller. Playing atari with deep rein-forcement learning. arXiv preprint arXiv:1312.5602 ,2013.Huang Shengyi, Dossa Rousslan, and Chang Ye.Cleanrl: High-quality single-file implementation ofdeep reinforcement learning algorithms. https://github.com/vwxyzjn/cleanrl/ , 2020.Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt,Marc Lanctot, and Nando Freitas. Dueling networkarchitectures for deep reinforcement learning. In
International conference on machine learning , pages1995–2003, 2016.Lloyd Shapley. Some topics in two-person games.
Ad-vances in game theory , 52:1–29, 1964.Martin L Puterman.
Markov decision processes: dis-crete stochastic dynamic programming . John Wiley& Sons, 2014.Gergely Neu, Anders Jonsson, and Vicenç Gómez. Aunified view of entropy-regularized markov decisionprocesses. arXiv preprint arXiv:1705.07798 , 2017.Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, andSergey Levine. Reinforcement learning with deepenergy-based policies. In
Proceedings of the 34th In-ternational Conference on Machine Learning-Volume70 , pages 1352–1361, 2017.Boris Belousov and Jan Peters. Entropic regularizationof markov decision processes.
Entropy , 21(7):674,2019. ai Cui, Heinz Koeppl
A Experimental Details
A.1 AlgorithmsAlgorithm 1 Exact fixed point iteration Initialize µ = Ψ( q ) as the mean field induced by the uniformly random policy q . for k = 0 , , · · · do Compute the Q-function Q ∗ ( µ k , t, s, a ) for fixed µ k . Choose π k ∈ Π such that π kt ( a | s ) = ⇒ a ∈ arg max a ∈A Q k ( µ k , t, s, a ) for all t ∈ T , s ∈ S , a ∈ A byputting all probability mass on the first optimal action, or evenly on all optimal actions. Optionally : Overwrite π k ← k +1 π k + kk +1 π k − . (FP averaged policy) Compute the mean field µ k +1 = Ψ( π k ) induced by π k . Optionally : Overwrite µ k +1 ← k +1 µ k +1 + kk +1 µ k . (FP averaged mean field) end forAlgorithm 2 Boltzmann / RelEnt iteration Input : Temperature η > , prior policy q ∈ Π . Initialize µ = Ψ( q ) as the mean field induced by q . for k = 0 , , · · · do Compute the Q-function (Boltzmann) or soft Q-function (RelEnt) Q ( µ k , t, s, a ) for fixed µ k . Define π k by π kt ( a | s ) = q t ( a | s ) exp (cid:18) Q ( µk,t,s,a ) η (cid:19)(cid:80) a (cid:48)∈A q t ( a (cid:48) | s ) exp (cid:16) Q ( µk,t,s,a (cid:48) ) η (cid:17) for all t ∈ T , s ∈ S , a ∈ A . Optionally : Overwrite π k ← k +1 π k + kk +1 π k − . (FP averaged policy) Compute the mean field µ k +1 = Ψ( π k ) induced by π k . Optionally : Overwrite µ k +1 ← k +1 µ k +1 + kk +1 µ k . (FP averaged mean field) end forAlgorithm 3 Boltzmann DQN iteration Input : Temperature η > , prior policy q ∈ Π . Input : Simulation parameters, DQN hyperparameters. Initialize µ ≈ Ψ( q ) as the mean field induced by q using Algorithm 5. for k = 0 , , · · · do Approximate the Q-function Q ∗ ( µ k , t, s, a ) using Algorithm 4 on the MDP induced by µ k . Define π k by π kt ( a | s ) = q t ( a | s ) exp (cid:18) Q ∗ ( µk,t,s,a ) η (cid:19)(cid:80) a (cid:48)∈A q t ( a (cid:48) | s ) exp (cid:16) Q ∗ ( µk,t,s,a (cid:48) ) η (cid:17) for all t ∈ T , s ∈ S , a ∈ A . Approximately simulate mean field µ k +1 ≈ Ψ( π k ) induced by π k using Algorithm 5. end forA.2 Implementation details For all the DQN experiments, we use the configurations given in Table 1 and hyperparameters given in Table 2.Note that we add epsilon scheduling and a discount factor to DQN for stability reasons, i.e. the loss termhas an additional factor smaller than one before the maximum operation, cf. Mnih et al. (2013). For theaction-value network, we use a fully connected dueling architecture (Wang et al. (2016)) with one shared hiddenlayer of 256 neurons, and one separate hidden layer of 256 neurons for value and advantage stream each. As theactivation function, we use ReLU. Further, we use gradient norm clipping and the ADAM optimizer. To allow fortime-dependent policies, we append the current time to the observations.We transform all discrete-valued observations except time to corresponding one-hot vectors, except in theintractably large Taxi environment where we simply observe one value in { , } for each tile’s passenger status.For evaluation of exploitability, we compare the values of the optimal policy and the evaluated policy in the MDPinduced by the mean field generated by the evaluated policy. In intractable cases, we use DQN to approximately pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning Algorithm 4 DQN Input : Number of epochs L , mini-batch size N , target update frequency M , replay buffer size D . Input : Probability of random action (cid:15) , Discount factor γ , ADAM and gradient clipping parameters. Initialize network Q θ , target network Q θ (cid:48) ← Q θ and replay buffer D of size D . for L epochs do for t = 1 , . . . , T do One environment step Let new action a t ← arg max a ∈A Q θ ( t, s, a ) , or with probability (cid:15) sample uniformly random instead. Sample new state s t +1 ∼ p ( · | s t , a t ) . Add transition tuple ( s t , a t , r ( s t , a t ) , s t +1 ) to replay buffer D . One mini-batch descent step
Sample from the replay buffer: { ( s it , a it , r it , s it +1 ) } i =1 ,...,N ∼ D . Compute loss J Q = (cid:80) Ni =1 (cid:0) r it + γ max a (cid:48) ∈A Q ( t + 1 , s it +1 , a (cid:48) ) − Q ( t, s it , a it ) (cid:1) . Update θ according to ∇ θ J Q using ADAM with gradient norm clipping. if number of steps mod M = 0 then Update target network θ (cid:48) ← θ . end if end for end forAlgorithm 5 Stochastic mean field simulation Input : Number of mean fields K , number of particles M , policy π . for k = 1 , . . . , K do Initialize particles x m ∼ µ for all m = 1 , . . . , M . for t ∈ T do Define empirical measure G kt ← (cid:80) Mm =1 δ x tm . for m = 1 , . . . , M do Sample action a ∼ π t ( · | x tm ) . Sample new particle state x t +1 m ∼ p ( · | x tm , a, G kt ) . end for end for end for return average empirical mean field ( K (cid:80) Kk =1 G kt ) t ∈T obtain the optimal policy. In this case, we obtain the values by averaging over many episodes in the MDP inducedby the mean field generated by the evaluated policy via Algorithm 5. A.3 Problems
Summarizing properties of the considered problems are given in Table 3.
LR.
Similar to the example mentioned in the main text, we let a large number of agents choose simultaneouslybetween going left ( L ) or right ( R ). Afterwards, each agent shall be punished proportional to the number ofagents that chose the same action, but more-so for choosing right than left.More formally, let S = { C, L, R } , A = S \ { C } , µ ( C ) = 1 , r ( s, a, µ t ) = − { L } ( s ) · µ t ( L ) − · { R } ( s ) · µ t ( R ) and T = { , } . Note the difference to the toy example in the main text: right is punished more than left. Thetransition function allows picking the next state directly, i.e. for all s, s (cid:48) ∈ S , a ∈ A , P ( S t +1 = s (cid:48) | S t = s, A t = a ) = { s (cid:48) } ( a ) . For this example, we have K Q = 1 since the return Q of the initial state changes linearly with µ and lies between and − , while the distance between two mean fields is also bounded by . Analogously, K Ψ = 1 since (Ψ( π )) similarly changes linearly with π , and both can change at most by . Thus, we obtain guaranteed convergencevia Boltzmann iteration if η > . In numerical evaluations, we see convergence already for η ≥ . . ai Cui, Heinz Koeppl Algorithm 6 Prior descent Input : Number of outer iterations I . Input : Initial prior policy q ∈ Π . for outer iteration i = 1 , . . . , I do Find η heuristically or minimally such that Algorithm 2 with temperature η and prior q converges. if no such η exists then return q end if q ← solution of Algorithm 2 with temperature η and prior q . end for Table 1: Boltzmann DQN Iteration ParametersParameter RPS SIS TaxiFixed point iteration count
Number of particles for mean field
Number of mean fields
Number of episodes for evaluation
RPS.
This game is inspired by Shapley (1964) and their generalized non-zero-sum version of Rock-Paper-Scissors,for which classical fictitious play would not converge. Each of the agents can choose between rock, paper andscissors, and obtains a reward proportional to double the number of beaten agents minus the number of agentsbeating the agent. We modify the proportionality factors such that a uniformly random prior policy does notconstitute a mean field equilibrium.Let S = { , R, P, S } , A = S \ { } , µ (0) = 1 , T = { , } , and for any a ∈ A , µ t ∈ P ( S ) , r ( R, a, µ t ) = 2 · µ t ( S ) − · µ t ( P ) ,r ( P, a, µ t ) = 4 · µ t ( R ) − · µ t ( S ) ,r ( S, a, µ t ) = 6 · µ t ( P ) − · µ t ( R ) . The transition function allows picking the next state directly, i.e. for all s, s (cid:48) ∈ S , a ∈ A , P ( S t +1 = s (cid:48) | S t = s, A t = a ) = { s (cid:48) } ( a ) . SIS.
In this problem, a large number of agents can choose between social distancing (D) or going out (U). Ifa susceptible (S) agent chooses social distancing, they may not become infected (I). Otherwise, an agent maybecome infected with a probability proportional to the number of agents being infected. If infected, an agent willrecover with a fixed chance every time step. Both social distancing and being infected have an associated cost.Let S = { S, I } , A = { U, D } , µ ( I ) = 0 . , r ( s, a, µ t ) = − { I } ( s ) − . · { D } ( s ) and T = { , . . . , } . We find thatsimilar parameters produce similar results, and set the transition probability mass functions as P ( S t +1 = S | S t = I ) = 0 . P ( S t +1 = I | S t = S, A t = U ) = 0 . · µ t ( I ) P ( S t +1 = I | S t = S, A t = D ) = 0 . Taxi.
In this problem, we consider a K × L grid. The state is described by a tuple ( x, y, x (cid:48) , y (cid:48) , p, B ) where ( x, y ) is the agent’s position, ( x (cid:48) , y (cid:48) ) indicates the current desired destination of the passenger or is (0 , otherwise, and p ∈ { , } indicates whether a passenger is in the taxi or not. Finally, B is a K × L matrix indicating whether anew passenger is available for the taxi on the corresponding tile. All taxis start on the same tile and have nopassengers in the queue or on the map at the beginning. The problem runs for 100 time steps.The taxi can choose between five actions W, U, D, L, R , where W (Wait) allows the taxi to pick up / deliverpassengers, and U, D, L, R (Up, Down, Left, Right) allows it to move in all four directions. As there are many pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning
Table 2: DQN HyperparametersHyperparameter ValueReplay buffer size
ADAM Learning rate . Discount factor . Target update frequency
Gradient clipping norm Mini-batch size
Epsilon schedule linearly down to . at . times maximum stepsTotal epochs Table 3: Problem PropertiesProblem |T | |S| |A| LR RPS
SIS
50 2 2
Taxi ∼ taxis, there is a chance of a jam on tile s given by min(0 . , · µ t ( s )) , i.e. the taxi will not move with thisprobability. The taxi also cannot move into walls or back into the starting tile, in which case it will stay on itscurrent tile. With a probability of . , a new passenger spawns on one randomly chosen free tile of each region.On picking up a passenger, the destination is generated by randomly picking any free tile of the same region.Delivering passengers to a destination and picking them up gives a reward of in region and . in region .For our experiments, we use the following small map, where S denotes the starting tile, denotes a free tile fromregion 1, denotes a free tile from region 2 and H denotes an impassable wall: H S H
This produces a similar situation as in LR, where a fraction of taxis should choose each region so the values balanceout, while also requiring solution of a problem that is intractable to solve exactly via dynamic programming. ai Cui, Heinz Koeppl
A.4 Further experiments .
00 0 .
25 0 .
50 0 .
75 1 .
00 1 .
25 1 .
50 1 .
75 2 . η . . . . . . . . . ∆ J ( π ) (a) η -Boltzmann, c = 1.0 η -Boltzmann, c = 1.1 η -Boltzmann, c = 1.2Uniform prior policy .
00 0 .
25 0 .
50 0 .
75 1 .
00 1 .
25 1 .
50 1 .
75 2 . η ∆ J ( π ) (b) η -Boltzmann, c = 1.0 η -Boltzmann, c = 1.1 η -Boltzmann, c = 1.2Uniform prior policy .
000 0 .
025 0 .
050 0 .
075 0 .
100 0 .
125 0 .
150 0 .
175 0 . η ∆ J ( π ) (c) η -Boltzmann, c = 1.0 η -Boltzmann, c = 1.1 η -Boltzmann, c = 1.2Uniform prior policy .
00 0 .
25 0 .
50 0 .
75 1 .
00 1 .
25 1 .
50 1 .
75 2 . η . . . . . . . . . ∆ J ( π ) (d) η -MaxEnt, c = 1.0 η -MaxEnt, c = 1.1 η -MaxEnt, c = 1.2Uniform prior policy .
00 0 .
25 0 .
50 0 .
75 1 .
00 1 .
25 1 .
50 1 .
75 2 . η ∆ J ( π ) (e) η -MaxEnt, c = 1.0 η -MaxEnt, c = 1.1 η -MaxEnt, c = 1.2Uniform prior policy .
000 0 .
025 0 .
050 0 .
075 0 .
100 0 .
125 0 .
150 0 .
175 0 . η ∆ J ( π ) (f) η -MaxEnt, c = 1.0 η -MaxEnt, c = 1.1 η -MaxEnt, c = 1.2Uniform prior policy Figure 5: Mean exploitability (straight lines), maximum and minimum (dashed lines) over the final 10 iterationsof the last outer iteration. 50 outer iterations and 100 inner iterations each; (a, d) LR; (b, e) RPS; (c, f) SIS.Maximum entropy (MaxEnt) results begin at higher temperatures due to limited floating point accuracy. Theexploitability of the initial uniform prior policy is indicated by the dashed horizontal line. .
00 0 .
25 0 .
50 0 .
75 1 .
00 1 .
25 1 .
50 1 .
75 2 . η . . . . . . . . . ∆ J ( π ) (a) η -Boltzmann with averaged policy η -MaxEnt with averaged policy η -Boltzmann with averaged mean field η -MaxEnt with averaged mean fieldUniform prior policy .
00 0 .
25 0 .
50 0 .
75 1 .
00 1 .
25 1 .
50 1 .
75 2 . η ∆ J ( π ) (b) η -Boltzmann with averaged policy η -MaxEnt with averaged policy η -Boltzmann with averaged mean field η -MaxEnt with averaged mean fieldUniform prior policy .
00 0 .
02 0 .
04 0 .
06 0 .
08 0 . η ∆ J ( π ) (c) η -Boltzmann with averaged policy η -MaxEnt with averaged policy η -Boltzmann with averaged mean field η -MaxEnt with averaged mean fieldUniform prior policy Figure 6: Mean exploitability over the final 10 iterations. Dashed lines represent maximum and minimum over thefinal 10 iterations. (a) LR, 10000 iterations; (b) RPS, 10000 iterations; (c) SIS, 1000 iterations. The exploitabilityof the uniform prior policy is indicated by the dashed horizontal line.In Figure 5, we observe that prior descent for both Boltzmann and RelEnt MFE with the same uniform priorpolicy performs qualitatively similarly, and coincide in LR and SIS except for numerical inaccuracies. It can beseen that using a temperature sufficiently low to converge in LR and RPS allows prior descent to descend tothe exact MFE iteratively. In SIS on the other hand, picking a fixed temperature that converges for the initialuniform prior policy does not guarantee monotonic improvement of exploitability afterwards. Instead, by applyingthe heuristic η i +1 = η i · c for each outer iteration i , where c ≥ adjusts the temperature after each outer iteration, we avoid scanningover all temperatures in each step and reach convergence to a good approximate mean field equilibrium for bothBoltzmann and MaxEnt iteration. pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning iteration k ∆ J ( π k ) − m i n l ≥ k m a x − ∆ J ( π l ) (a) η = 0 . η = 0 . η = 0 . η = 0 .
100 0 100 200 300 400 500 iteration k . . . . . . . d M ( µ k , µ k m a x ) (b) η = 0 . η = 0 . η = 0 . η = 0 . Figure 7: (a) Difference between current and final minimum exploitability over the last 10 iterations; (b) Distancebetween current and final mean field, cut off at 500 iterations for readability. Plotted for the η -RelEnt iterationsin SIS for the indicated temperature settings and uniform prior policy. iteration k − ∆ J ( π k ) − m i n l ≥ k m a x − ∆ J ( π l ) (a) η = 0 . η = 0 . η = 0 .
200 0 2 4 6 8 10 12 14 iteration k ∆ J ( π k ) − m i n l ≥ k m a x − ∆ J ( π l ) (b) η = 0 . η = 0 . η = 0 . Figure 8: Difference between current and final estimated minimum exploitability over the last 5 iterations. (a) SIS,50 iterations; (b) Taxi, 15 iterations. Plotted for the η -Boltzmann DQN iteration for the indicated temperaturesettings and uniform prior policy.In Figure 6 empirical results are shown for fictitious play variants averaging only policy or mean field. In thesimple one-step toy problems LR and RPS, averaging the policies appears to converge to the exact solutionwithout regularization and to the regularized solution with regularization. Averaging the mean fields on the otherhand fails, since this method can only produce deterministic policies. By applying any amount of regularization,averaging the mean fields is led to success in LR and SIS. Nonetheless, both methods fail to converge to the MFEin SIS and produce worse results than obtained by prior descent in Figure 5.In Figure 7 we depict the convergence of exploitability and mean field of MaxEnt iteration in SIS. The results arequalitatively similar with Boltzmann iteration and, as in the main text, show the convergence behaviour near thecritical temperature leading to convergence.In Figure 8 we depict the convergence of exploitability for Boltzmann DQN iteration in SIS and Taxi during oneof the runs. All 4 other runs show similar qualitative behaviour. As can be seen, the highest temperature of . shows less oscillatory behaviour, stabilizing Boltzmann DQN iteration. In Taxi, it can be seen that the usedtemperatures are insufficient to allow Boltzmann DQN iteration to converge. We believe that using prior descentcould allow for better results. We could not verify this due to the high computational cost, as this includesrepeatedly and sequentially solving an expensive reinforcement learning problem.Finally, in Figure 9 we depict the resulting behavior in the SIS case. In the Boltzmann iteration result, at thebeginning the number of infected is high enough to make social distancing the optimal action to take. As thenumber of infected falls, it reaches an equilibrium point where both social distancing or potentially gettinginfected are of equal value. Finally, as the game ends at time t = T = 50 , there is no point in social distancingany more. Our approach yields intuitive results here, while exact fixed point iteration and FP fail to converge. ai Cui, Heinz Koeppl iteration k . . . . . . µ t ( I ) (a) Boltzmann: fraction of infected iteration k . . . . . . µ t ( I ) (b) Exact: fraction of infected iteration k . . . . . . µ t ( I ) (c) FP: fraction of infected iteration k . . . . . π t ( D | S ) (d) Boltzmann: fraction of distancing iteration k . . . . . π t ( D | S ) (e) Exact: fraction of distancing iteration k . . . . . π t ( D | S ) (f) FP: fraction of distancing Figure 9: Fraction of infected agents and fraction of susceptible agents picking social distancing over time. (a, d):Boltzmann iteration ( η = 0 . ); (b, e): exact fixed point iteration; (c, f): fictitious play (averaging both policyand mean field) results in SIS after 500 iterations. More iterations and averaging only policy or mean field showsame qualitative results. B Proofs
B.1 Completeness of mean field and policy spaceLemma B.1.1.
The metric spaces (Π , d Π ) and ( M , d M ) are complete metric spaces.Proof. The metric space ( M , d M ) is a complete metric space. Let ( µ n ) n ∈N ∈ M N be a Cauchy sequence of meanfields. Then by definition, for any ε > there exists integer N > such that for any m, n > N we have d M ( µ n , µ m ) < . ε = ⇒ ∀ t ∈ T : d T V ( µ nt , µ mt ) = 12 (cid:88) s ∈S | µ nt ( s ) − µ mt ( s ) | < . ε = ⇒ ∀ t ∈ T , s ∈ S : | µ nt ( s ) − µ mt ( s ) | < ε . By completeness of R there exists the limit of ( µ nt ( s )) n ∈N for all t ∈ T , s ∈ S , suggestively denoted by µ t ( s ) . Themean field µ = { µ t } t ∈T with the probabilities defined by the aforementioned limits fulfills µ n → µ and is in M ,showing completeness of M .We do this analogously for (Π , d Π ) . Thus, (Π , d Π ) and ( M , d M ) are complete metric spaces. B.2 Lipschitz continuityLemma B.2.1.
Assume bounded and Lipschitz functions f : X → R and g : X → R mapping from a metricspace ( X, d X ) into R with Lipschitz constants C f , C g and bounds | f ( x ) | ≤ M f , | g ( x ) | ≤ M g . The sum of bothfunctions f + g , the product of both functions f · g and the maximum of both functions max( f, g ) are all Lipschitzand bounded with Lipschitz constants C f + C g , ( M f C g + M g C f ) , max( C f , C g ) and bounds M f + M g , M f M g , max( M f , M g ) .Proof. Let x, y ∈ X be arbitrary. By the triangle inequality, we obtain | f ( x ) + g ( x ) − ( f ( y ) + g ( y )) | ≤ | f ( x ) − f ( y ) | + | g ( x ) − g ( y ) | ≤ ( C f + C g ) d X ( x, y ) . Analogously, we obtain | f ( x ) g ( x ) − f ( y ) g ( y ) | ≤ | f ( x ) g ( x ) − f ( x ) g ( y ) | + | f ( x ) g ( y ) − f ( y ) g ( y ) | ≤ ( M f C g + M g C f ) d X ( x, y ) . pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning For the maximum of both functions, consider case by case. If f ( x ) ≥ g ( x ) and f ( y ) ≥ g ( y ) we obtain | max( f ( x ) , g ( x )) − max( f ( y ) , g ( y )) | = | f ( x ) − f ( y ) | ≤ C f d X ( x, y ) and analogously for g ( x ) ≥ f ( x ) and g ( y ) ≥ f ( y ) | max( f ( x ) , g ( x )) − max( f ( y ) , g ( y )) | = | g ( x ) − g ( y ) | ≤ C g d X ( x, y ) . On the other hand, if g ( x ) < f ( x ) and g ( y ) ≥ f ( y ) , we have either g ( y ) ≥ f ( x ) and thus | max( f ( x ) , g ( x )) − max( f ( y ) , g ( y )) | = | f ( x ) − g ( y ) | = g ( y ) − f ( x ) < g ( y ) − g ( x ) ≤ C g d X ( x, y ) or g ( y ) < f ( x ) and thus | max( f ( x ) , g ( x )) − max( f ( y ) , g ( y )) | = | f ( x ) − g ( y ) | = f ( x ) − g ( y ) ≤ f ( x ) − f ( y ) ≤ C f d X ( x, y ) . The case for f ( x ) < g ( x ) and f ( y ) ≥ g ( y ) as well as boundedness is analogous. B.3 Proof of Proposition 1
Proof.
Since we work with finite T , S , A , we identify the space of mean fields M with the |T | ( |S| − -dimensionalsimplex S |T | ( |S|− ⊆ R |T | ( |S|− via the values of the probability mass functions at all times and states. Analo-gously the space of policies Π is identified with S |T ||S| ( |A|− ⊆ R |T ||S| ( |A|− .Define the set-valued map ˆΓ : S |T ||S| ( |A|− → S |T ||S| ( |A|− mapping from a policy π represented by the inputvector, to the set of vector representations of optimal policies in the MDP induced by Ψ( π ) .A policy π is optimal in the MDP induced by µ ∈ M if and only if its value function defined by V π ( µ, t, s ) = (cid:88) a ∈A π t ( a | s ) (cid:32) r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) V π ( µ, t + 1 , s (cid:48) ) (cid:33) , is equal to the optimal action-value function defined by V ∗ ( µ, t, s ) = max a ∈A (cid:32) r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) V ∗ ( µ, t + 1 , s (cid:48) ) (cid:33) for every t ∈ T , s ∈ S , with terminal conditions V ∗ ( µ, T, s ) ≡ V π ( µ, T, s ) ≡ . Moreover, an optimal policyalways exists. For more details, see e.g. Puterman (2014). Define the optimal action-value function for every t ∈ T , s ∈ S , a ∈ A via Q ∗ ( µ, t, s, a ) = r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) V ∗ ( µ, t + 1 , s (cid:48) ) with terminal condition Q ∗ ( µ, T, s, a ) ≡ . Then, the following lemma characterizes optimality of policies. Lemma B.3.1.
A policy π fulfills π ∈ ˆΓ(ˆ π ) if and only if π t ( a | s ) > ⇒ a ∈ arg max a (cid:48) ∈A Q ∗ (Ψ(ˆ π ) , t, s, a (cid:48) ) for all t ∈ T , s ∈ S , a ∈ A .Proof. To see the implication, consider π ∈ ˆΓ(ˆ π ) . Then, if the right-hand side was false, there exists a maximal t ∈ T and s ∈ S , a ∈ A such that π t ( a | s ) > but a (cid:54)∈ arg max a (cid:48) ∈A Q ∗ (Ψ(ˆ π ) , t, s, a (cid:48) ) . Since for any t (cid:48) > t we haveoptimality, V π ( µ, t + 1 , s (cid:48) ) = V ∗ ( µ, t + 1 , s (cid:48) ) by induction. However, V π ( µ, t, s ) < V ∗ ( µ, t, s ) since the suboptimalaction is assigned positive probability, contradicting optimality of π . On the other hand, if the right-hand side istrue, then V π ( µ, t, s ) = V ∗ ( µ, t, s ) by induction, which implies that π is optimal. (cid:4) ai Cui, Heinz Koeppl We will now check that the requirements of Kakutani’s fixed point theorem hold for ˆΓ . The finite-dimensionalsimplices are convex, closed and bounded, hence compact. ˆΓ maps to a non-empty set, as the induced mean fieldis uniquely defined and any finite MDP (induced by this mean field) has an optimal policy.For any π , ˆΓ( π ) is convex, since the set of optimal policies is convex as shown in the following. Consider a convexcombination ˜ π = λπ + (1 − λ ) π (cid:48) of optimal policies π, π (cid:48) for λ ∈ [0 , . Then, the resulting policy will be optimal,since we have ˜ π t ( a | s ) > ⇒ π t ( a | s ) > ∨ π (cid:48) t ( a | s ) > ⇒ a ∈ arg max a ∈A Q ∗ (Ψ(ˆ π ) , t, s, a ) for any t ∈ T , s ∈ S , a ∈ A and thus optimality by Lemma B.3.1.Finally, we show that ˆΓ has a closed graph. Consider arbitrary sequences ( π n , π (cid:48) n ) → ( π, π (cid:48) ) with π (cid:48) n ∈ ˆΓ( π n ) . It isthen sufficient to show that π (cid:48) ∈ ˆΓ( π ) . By the standing assumption, we have continuity of Ψ and µ → Q ∗ ( µ, t, s, a ) for any t ∈ T , s ∈ S , a ∈ A , as sums, products and compositions of continuous functions remain continuous.Therefore, the composition π → Q ∗ (Ψ( π ) , t, s, a ) is continuous. To show that π (cid:48) ∈ ˆΓ( π ) , assume that π (cid:48) (cid:54)∈ ˆΓ( π ) .By Lemma B.3.1 there exists t ∈ T , s ∈ S , a ∈ A such that π (cid:48) t ( a | s ) > and further there exists a (cid:48) ∈ A such that Q ∗ (Ψ( π ) , t, s, a (cid:48) ) > Q ∗ (Ψ( π ) , t, s, a ) . Fix such an a (cid:48) ∈ A . Let δ ≡ Q ∗ (Ψ( π ) , t, s, a (cid:48) ) − Q ∗ (Ψ( π ) , t, s, a ) , then bycontinuity there exists ε > such that for all ˆ π ∈ Π we have d Π (ˆ π, π ) < ε = ⇒ | Q ∗ (Ψ(ˆ π ) , t, s, a ) − Q ∗ (Ψ( π ) , t, s, a ) | < δ . By convergence, there is an integer N ∈ N such that for all n > N we have d Π ( π n , π ) < ε and therefore Q ∗ (Ψ( π n ) , t, s, a (cid:48) ) > Q ∗ (Ψ( π ) , t, s, a (cid:48) ) − δ Q ∗ (Ψ( π ) , t, s, a ) + δ > Q ∗ (Ψ( π n ) , t, s, a ) . Since ( π (cid:48) n ) t ( a | s ) → π (cid:48) t ( a | s ) > , there also exists M ∈ N such that for all m > M , | ( π (cid:48) m ) t ( a | s ) − π (cid:48) t ( a | s ) | < π (cid:48) t ( a | s ) . Let n > max(
N, M ) , then it follows that ( π (cid:48) n ) t ( a | s ) > but a (cid:54)∈ arg max a (cid:48) ∈A Q ∗ (Ψ( π ) , t, s, a (cid:48) ) since we have Q ∗ (Ψ( π n ) , t, s, a (cid:48) ) > Q ∗ (Ψ( π n ) , t, s, a ) , contradicting π (cid:48) n ∈ ˆΓ( π n ) by Lemma B.3.1. Hence, ˆΓ must have a closedgraph.By Kakutani’s fixed point theorem, there exists a fixed point π ∗ that generates some mean field Ψ( π ∗ ) . Theassociated pair ( π ∗ , Ψ( π ∗ )) is an MFE by definition. B.4 Proof of Proposition 3
Proof.
The space of mean fields ( M , d M ) is equivalent to convex and compact finite-dimensional simplices. Inthis representation, each coordinate of the operators ˜Γ η ( µ ) and Γ η ( µ ) consists of compositions, sums and productsof continuous functions, since the functions r ( s, a, µ t ) and p ( s (cid:48) | s, a, µ t ) are assumed to be continuous. Existenceof a fixed point follows immediately by Brouwer’s fixed point theorem. B.5 Proof of Theorem 1
Proof.
The proof is a slightly simplified version of the one found in Saldi et al. (2018). Note that we require theresults later, so for convenience we give the full details.The empirical measure G NS t is a random variable on P ( S ) , i.e. its law L ( G NS t ) ∈ P ( P ( S )) is a distribution overprobability measures. Since we want to show convergence of the empirical measure to the mean field, let us picka metric on P ( P ( S )) . Remember that we metrized P ( S ) with the total variation distance. We metrize P ( P ( S )) with the 1-Wasserstein metric defined for any Φ , Ψ ∈ P ( P ( S )) by the infimum over couplings W (Φ , Ψ) ≡ inf L ( X )=Φ , L ( X )=Ψ E [ d T V ( X , X )] . Lemma B.5.1.
Let { Φ n } n ∈ N be a sequence of measures with Φ n ∈ P ( P ( S )) for all n ∈ N . Further, let µ ∈ P ( S ) arbitrary. Then, the following are equivalent. pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning (a) W (Φ n , δ µ ) → as n → ∞ (b) E [ | F ( X n ) − F ( X ) | ] → as n → ∞ for any continuous, bounded F : P ( S ) → R , any sequence { X n } n ∈ N of P ( S ) -valued random variables and any P ( S ) -valued random variable X with L ( X n ) = Φ n and L ( X ) = δ µ .(c) E [ | X n ( f ) − X ( f ) | ] → as n → ∞ for any f : S → R , any sequence { X n } n ∈ N of P ( S ) -valued randomvariables and any P ( S ) -valued random variable X with L ( X n ) = Φ n and L ( X ) = δ µ .Proof. Define the only possible coupling ∆ n ≡ Φ n × δ µ .(b), (c) = ⇒ (a):Define F s ( x ) ≡ x ( s ) and f s ( s (cid:48) ) ≡ { s } ( s (cid:48) ) for all s ∈ S , where F s is continuous. By assumption, W (Φ n , δ µ ) = inf L ( X n )=Φ n , L ( X )= δ µ E [ d T V ( X n , X )]= 12 (cid:90) P ( S ) ×P ( S ) (cid:88) s ∈S | X n ( s ) − X ( s ) | d ∆ n = 12 (cid:88) s ∈S E [ | X n ( s ) − X ( s ) | ] → since for any s ∈ S , we have E [ | X n ( s ) − X ( s ) | ] = E [ | F s ( X n ) − F s ( X ) | ] = E [ | X n ( f s ) − X ( f s ) | ] . (a) = ⇒ (b), (c):We have E [ | F ( X n ) − F ( X ) | ] = (cid:90) P ( S ) ×P ( S ) | F ( ν ) − F ( ν (cid:48) ) | ∆ n ( dν, dν (cid:48) )= (cid:90) P ( S ) | F ( ν ) − F ( µ ) | Φ n ( dν ) → (cid:90) P ( S ) | F ( ν ) − F ( µ ) | δ µ ( dν ) = 0 by continuity and boundedness of | F ( ν ) − F ( µ ) | , and convergence in W implying weak convergence. Analogously, E [ | X n ( f ) − X ( f ) | ] = (cid:90) P ( S ) | ν ( f ) − µ ( f ) | Φ n ( dν ) → (cid:90) P ( S ) | ν ( f ) − µ ( f ) | δ µ ( dν ) = 0 since f and thus | ν ( f ) − µ ( f ) | is automatically bounded from finiteness of S , and ν ( f ) = (cid:80) s ∈S ν ( s ) f ( s ) → (cid:80) s ∈S µ ( s ) f ( s ) as ν → µ in total variation distance implies continuity of | ν ( f ) − µ ( f ) | . (cid:4) First, it is shown that when all other agents follow the same policy π , then the empirical distribution is essentiallythe deterministic mean field as N → ∞ , i.e. L ( G NS t ) → L ( µ t ) ≡ δ µ t with µ = Ψ( π ) Lemma B.5.2.
Consider a set of policies (˜ π, π, . . . , π ) ∈ Π N for all agents. Under this set of policies, the law ofthe empirical distribution L ( G NS t ) ∈ P ( M ) converges to δ µ t where µ = Ψ( π ) as N → ∞ in 1-Wasserstein distance.Proof. Define the Markov kernel P πt,ν such that its probability mass function fulfills P πt,ν ( s (cid:48) | s ) ≡ (cid:88) a ∈A π t ( a | s ) p ( s (cid:48) | s, a, ν ) for any t ∈ T , s ∈ S , ν ∈ P ( S ) , π ∈ Π and analogously ˜ νP πt,ν ( s (cid:48) ) ≡ (cid:88) s ∈S ˜ ν ( s ) (cid:88) a ∈A π t ( a | s ) p ( s (cid:48) | s, a, ν ) ai Cui, Heinz Koeppl for any ˜ ν ∈ P ( S ) . Note that µ t +1 = µ t P πt,µ t ( g ) for mean fields µ = Ψ( π ) induced by π .We show that E (cid:2)(cid:12)(cid:12) G NS t ( f ) − µ t ( f ) (cid:12)(cid:12)(cid:3) → as N → ∞ for any function f : S → R and any time t ∈ T . From this,the desired result follows by Lemma B.5.1. Since G NS t ( · ) ≡ N (cid:80) Ni =1 δ S it ( · ) and S i ∼ µ we have at time t = 0 that lim N →∞ E (cid:2)(cid:12)(cid:12) G NS ( f ) − µ ( f ) (cid:12)(cid:12)(cid:3) = lim N →∞ E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 f ( S i ) − E (cid:2) f ( S i ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) = 0 by the strong law of large numbers and the dominated convergence theorem.Assuming this holds for t , then for t + 1 we have E (cid:104)(cid:12)(cid:12)(cid:12) G NS t +1 ( f ) − µ t +1 ( f ) (cid:12)(cid:12)(cid:12)(cid:105) ≤ E (cid:104)(cid:12)(cid:12)(cid:12) G NS t +1 ( f ) − G N − S t +1 ( f ) (cid:12)(cid:12)(cid:12)(cid:105) + E (cid:104)(cid:12)(cid:12)(cid:12) G N − S t +1 ( f ) − G N − S t P πt, G NSt ( f ) (cid:12)(cid:12)(cid:12)(cid:105) + E (cid:104)(cid:12)(cid:12)(cid:12) G N − S t P πt, G NSt ( f ) − G NS t P πt, G NSt ( f ) (cid:12)(cid:12)(cid:12)(cid:105) + E (cid:104)(cid:12)(cid:12)(cid:12) G NS t P πt, G NSt ( f ) − µ t P πt,µ t ( f ) (cid:12)(cid:12)(cid:12)(cid:105) where we defined G N − S t ( · ) ≡ N − (cid:80) Ni =2 δ S it ( · ) .For the first term, we have as N → ∞ E (cid:104)(cid:12)(cid:12)(cid:12) G NS t +1 ( f ) − G N − S t +1 ( f ) (cid:12)(cid:12)(cid:12)(cid:105) = E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 f ( S it +1 ) − N − N (cid:88) i =2 f ( S it +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) ≤ N E (cid:2)(cid:12)(cid:12) f ( S t +1 ) (cid:12)(cid:12)(cid:3) + (cid:12)(cid:12)(cid:12)(cid:12) N − N − (cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) i =2 E (cid:2)(cid:12)(cid:12) f ( S it +1 ) (cid:12)(cid:12)(cid:3) ≤ (cid:18) N + N − N ( N − (cid:19) max s ∈S | f ( s ) | → . For the second term, as N → ∞ we have by Jensen’s inequality and bounds | f | ≤ M f (by finiteness of S ) E (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) G N − S t +1 ( f ) − G N − S t P πt, G N − St ( f ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) = E (cid:20) E (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) G N − S t +1 ( f ) − G N − S t P πt, G N − St ( f ) (cid:12)(cid:12)(cid:12)(cid:12) | S t (cid:21)(cid:21) = E (cid:34) E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N − N (cid:88) i =2 (cid:0) f ( S it +1 ) − E (cid:2) f ( S it +1 ) (cid:3)(cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | S t (cid:35)(cid:35) ≤ N − N (cid:88) i =2 E (cid:104) E (cid:104)(cid:0) f ( S it +1 ) − E (cid:2) f ( S it +1 ) (cid:3)(cid:1) | S t (cid:105)(cid:105) ≤ N − · M f → . For the third term, we again have as N → ∞ E (cid:104)(cid:12)(cid:12)(cid:12) G N − S t P πt, G NSt ( f ) − G NS t P πt, G NSt ( f ) (cid:12)(cid:12)(cid:12)(cid:105) = E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) s ∈S (cid:0) G N − S t ( s ) − G NS t ( s ) (cid:1) (cid:88) a ∈A π t ( a | s ) (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, G NS t ) f ( s (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) ≤ E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:18) N − − N (cid:19) N (cid:88) i =2 (cid:88) a ∈A π t ( a | S it ) (cid:88) s (cid:48) ∈S p ( s (cid:48) | S it , a, G NS t ) f ( s (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) + E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) a ∈A π t ( a | S t ) (cid:88) s (cid:48) ∈S p ( s (cid:48) | S t , a, G NS t ) f ( s (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning ≤ (cid:18) N − N ( N −
1) + 1 N (cid:19) max s ∈S | f ( s ) | → . For the fourth term, define F : P ( S ) → R , F ( ν ) = νP πt,ν ( f ) and observe that F is continuous, since ν → ν (cid:48) if andonly if ν ( s ) → ν (cid:48) ( s ) for all s ∈ S , and therefore (as p is assumed continuous by Assumption 1) F ( ν ) = νP πt,ν ( f ) = (cid:88) s ∈S ν ( s ) (cid:88) a ∈A π t ( a | s ) (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, ν ) f ( s (cid:48) ) is continuous for any s (cid:48) ∈ S . By Lemma B.5.1, we have from the induction hypothesis G NS t → µ t that E (cid:104)(cid:12)(cid:12)(cid:12) G NS t P πt, G NSt ( f ) − µ t P πt,µ t ( f ) (cid:12)(cid:12)(cid:12)(cid:105) → . Therefore, E (cid:104)(cid:12)(cid:12)(cid:12) G NS t +1 ( f ) − µ t +1 ( f ) (cid:12)(cid:12)(cid:12)(cid:105) → which implies the desired result by induction. (cid:4) Consider the case where all agents follow a set of policies ( π N , π, . . . , π ) ∈ Π N for each N ∈ N . Define newsingle-agent random variables S µt and A µt with S µ ∼ µ and P ( A µt = a | S µt = s ) = π Nt ( a | s ) , P ( S µt +1 = s (cid:48) | S µt = s, A µt = a ) = p ( s (cid:48) | s, a, µ t ) , where the deterministic mean field µ is used instead of the empirical distribution. Lemma B.5.3.
Consider an equicontinuous, uniformly bounded family of functions F on P ( S ) and define F t ( ν ) ≡ sup f ∈F | f ( ν ) − f ( µ t ) | for any t ∈ T . Then, F t is continuous and bounded and by Lemma B.5.1 we have lim N →∞ E (cid:34) sup f ∈F (cid:12)(cid:12) f ( G NS t ) − f ( µ ) (cid:12)(cid:12)(cid:35) = 0 Proof. F t is continuous, since for ν n → ν | F t ( ν n ) − F t ( ν ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup f ∈F | f ( ν ) − f ( µ t ) | − sup f ∈F | f ( ν (cid:48) ) − f ( µ t ) | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup f ∈F | f ( ν ) − f ( ν (cid:48) ) | → by equicontinuity. Further, F t is bounded since | F t ( ν ) | ≤ sup f ∈F | f ( ν ) | + | f ( µ t ) | is uniformly bounded. ByLemma B.5.2, we have W ( G NS t , δ µ t ) → as N → ∞ , therefore Lemma B.5.1 applies. (cid:4) Lemma B.5.4.
Suppose that at some time t ∈ T , it holds that lim N →∞ (cid:12)(cid:12) L ( S t )( g N ) − L ( S µt )( g N ) (cid:12)(cid:12) = 0 for any sequence of functions { g N } N ∈ N from S to R that is uniformly bounded. Then, we have lim N →∞ (cid:12)(cid:12) L ( S t , G NS t )( T N ) − L ( S µt , µ t )( T N ) (cid:12)(cid:12) = 0 for any sequence of functions { T N } N ∈ N from S × P ( S ) to R that is equicontinuous and uniformly bounded.Proof. We have (cid:12)(cid:12) L ( S t , G NS t )( T N ) − L ( S µt , µ t )( T N ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) L ( S t , G NS t )( T N ) − L ( S t , µ t )( T N ) (cid:12)(cid:12) + (cid:12)(cid:12) L ( S t , µ t )( T N ) − L ( S µt , µ t )( T N ) (cid:12)(cid:12) ai Cui, Heinz Koeppl The first term becomes (cid:12)(cid:12) L ( S t , G NS t )( T N ) − L ( S t , µ t )( T N ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) T N ( x, ν ) L ( S t , G NS t )( dx, dν ) − (cid:90) T N ( x, ν ) L ( S t , µ t )( dx, dν ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ E (cid:2) E (cid:2)(cid:12)(cid:12) T N ( x, G NS t ) − T N ( x, µ t ) (cid:12)(cid:12) S t (cid:3)(cid:3) ≤ E (cid:34) sup f ∈{ T N ( · ,ν ) } ν ∈P ( S ) ,N ∈ N (cid:12)(cid:12) f ( G NS t ) − f ( µ t ) (cid:12)(cid:12)(cid:35) → by Lemma B.5.3, since { T N } N ∈ N is equicontinuous and uniformly bounded. Similarly for the second term, (cid:12)(cid:12) L ( S t , µ t )( T N ) − L ( S µt , µ t )( T N ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) T N ( x, ν ) L ( S t , µ t )( dx, dν ) − (cid:90) T N ( x, ν ) L ( S µt , µ t )( dx, dν ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ E (cid:2)(cid:12)(cid:12) T N ( S t , µ t ) − T N ( S µt , µ t ) (cid:12)(cid:12)(cid:3) → by the assumption, since T N fulfills the condition of being uniformly bounded. (cid:4) Lemma B.5.5.
For any sequence { g N } N ∈ N of functions from S to R that is uniformly bounded, we have lim N →∞ (cid:12)(cid:12) L ( S t )( g N ) − L ( S µt )( g N ) (cid:12)(cid:12) = 0 for all times t ∈ T .Proof. Define l N,t as l N,t ( s, ν ) ≡ (cid:88) a ∈A π Nt ( a | s ) (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, ν ) g N ( s (cid:48) ) . { l N,t ( s, · ) } s ∈S ,N ∈ N is equicontinuous, since for any ν, ν (cid:48) ∈ M with d T V ( ν, ν (cid:48) ) → , sup s ∈S ,N ∈ N | l N,t ( s, ν ) − l N,t ( s, ν (cid:48) ) | ≤ M g sup s ∈S ,N ∈ N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) a ∈A π Nt ( a | s ) (cid:88) s (cid:48) ∈S ( p ( s (cid:48) | s, a, ν ) − p ( s (cid:48) | s, a, ν (cid:48) )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ M g |S| max s ∈S max a ∈A max s (cid:48) ∈S | p ( s (cid:48) | s, a, ν ) − p ( s (cid:48) | s, a, ν (cid:48) ) | → since | g N | < M g is uniformly bounded and p is continuous by assumption. Furthermore, l N,t ( s, ν ) is alwaysuniformly bounded by M g . Now the result can be shown by induction.For t = 0 , L ( S µ ) = L ( S ) fulfills the hypothesis. Assume this holds for t , then (cid:12)(cid:12) L ( S t +1 )( g N ) − L ( S µt +1 )( g N ) (cid:12)(cid:12) = (cid:12)(cid:12) L ( S t , G NS t )( l N,t ) − L ( S µt , µ t )( l N,t ) (cid:12)(cid:12) → as N → ∞ by Lemma B.5.4. (cid:4) Thus, for any sequence of policies { π N } N ∈ N with π N ∈ Π for all N ∈ N , the achieved return of the N -agent gameconverges to the return of the mean field game under the mean field generated by the other agent’s policy π as N → ∞ . Lemma B.5.6.
Let { π N } N ∈ N with π N ∈ Π for all N ∈ N be an arbitrary sequence of policies and π ∈ Π an arbitrary policy. Further, let the mean field µ = Ψ( π ) be generated by π . Then, under the joint policy ( π N , π, . . . , π ) , we have as N → ∞ that (cid:12)(cid:12) J N ( π N , π, . . . , π ) − J µ ( π N ) (cid:12)(cid:12) → . Proof.
Define for any t ∈ T , N ∈ N r π Nt ( s, ν ) ≡ (cid:88) a ∈A r ( s, a, ν ) π Nt ( a | s ) pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning such that the family { r π Nt ( s, · ) } s ∈S ,N ∈ N is equicontinuous, since for any ν, ν (cid:48) ∈ M as d M ( ν, ν (cid:48) ) → , max s ∈S max N ∈ N (cid:12)(cid:12)(cid:12) r π Nt ( s, ν ) − r π Nt ( s, ν (cid:48) ) (cid:12)(cid:12)(cid:12) → by continuity of r . The function r π Nt is uniformly bounded for all N ∈ N by assumption of uniformly bounded r .By Lemma B.5.4 and Lemma B.5.5, lim N →∞ (cid:12)(cid:12) E (cid:2) r ( S t , A t , G NS t ) (cid:3) − E [ r ( S µt , A µt , µ t )] (cid:12)(cid:12) | = lim N →∞ (cid:12)(cid:12)(cid:12) E (cid:104) r π Nt ( S t , G NS t ) (cid:105) − E (cid:104) r π Nt ( S µt , µ t ) (cid:105)(cid:12)(cid:12)(cid:12) = 0 . such that we have lim N →∞ (cid:12)(cid:12) J N ( π N , π, . . . , π ) − J µ ( π N ) (cid:12)(cid:12) | ≤ (cid:88) t ∈T lim N →∞ (cid:12)(cid:12) E (cid:2) r ( S t , A t , G NS t ) (cid:3) − E [ r ( S µt , A µt , µ t )] (cid:12)(cid:12) = 0 . which is the desired result. (cid:4) From Lemma B.5.6, it follows that for any sequence of optimal exploiting policies { π N } N ∈ N with π N ∈ Π for all N ∈ N and π N ∈ arg max π ∈ Π J N ( π, π ∗ , . . . , π ∗ ) for all N ∈ N , it holds that for any MFE ( π ∗ , µ ∗ ) ∈ Π × M , lim N →∞ J N ( π N , π ∗ , . . . , π ∗ ) ≤ max π ∈ Π J µ ∗ ( π )= J µ ∗ ( π ∗ )= lim N →∞ J N ( π ∗ , . . . , π ∗ ) and by instantiating for arbitrary (cid:15) > , for sufficiently large N we obtain J N ( π N , π ∗ , . . . , π ∗ ) − (cid:15) = max π ∈ Π J N ( π, π ∗ , . . . , π ∗ ) − (cid:15) ≤ max π ∈ Π J µ ∗ ( π ) − (cid:15) J µ ∗ ( π ∗ ) − (cid:15) J N ( π ∗ , π ∗ , . . . , π ∗ ) which is the desired approximate Nash property that applies to all agents by symmetry. B.6 Proof of Theorem 2
Proof. If Φ or Ψ is constant, or if the restriction Ψ (cid:22) Π Φ of Ψ to Π Φ is constant, then Γ = Ψ ◦ Φ is constant.Assume that this is not the case.Then there exist distinct π, π (cid:48) ∈ Π Φ such that Ψ( π ) (cid:54) = Ψ( π (cid:48) ) . By definition of Π Φ there also exist distinct µ, µ (cid:48) ∈ M such that Φ( µ ) = π and Φ( µ (cid:48) ) = π (cid:48) . Note that for any ν, ν (cid:48) ∈ M with Γ( ν ) (cid:54) = Γ( ν (cid:48) ) , d M (Γ( ν ) , Γ( ν (cid:48) )) ≥ min π,π (cid:48) ∈ Π Φ ,π (cid:54) = π (cid:48) d M (Ψ( π ) , Ψ( π (cid:48) )) where the right-hand side is greater zero by finiteness of Π Φ . This holds for µ, µ (cid:48) .To show that Γ cannot be Lipschitz continuous, assume that Γ has a Lipschitz constant C > . We can find aninteger N such that d M ( µ i , µ i +1 ) = d M ( µ, µ (cid:48) ) N − < min π,π (cid:48) ∈ Π Φ ,π (cid:54) = π (cid:48) d M (Ψ( π ) , Ψ( π (cid:48) )) C ai Cui, Heinz Koeppl for all i ∈ { , . . . , N − } by defining µ i = iN µ + N − iN µ (cid:48) for all i ∈ { , . . . , N } , and µ i ∈ M holds. By the triangle inequality d M (Γ( µ ) , Γ( µ (cid:48) )) ≤ d M (Γ( µ ) , Γ( µ )) + . . . + d M (Γ( µ N − ) , Γ( µ N )) there exists a pair ( µ i , µ i +1 ) with Γ( µ i ) (cid:54) = Γ( µ i +1 ) . For this pair, we have d M (Γ( µ i ) , Γ( µ i +1 )) ≥ d M (Γ( µ ) , Γ( µ (cid:48) )) ≥ min π,π (cid:48) ∈ Π Φ ,π (cid:54) = π (cid:48) d M (Ψ( π ) , Ψ( π (cid:48) )) . On the other hand, since Γ is Lipschitz with constant C , we have d M (Γ( µ i ) , Γ( µ i +1 )) ≤ C · d M ( µ i , µ i +1 ) < min π,π (cid:48) ∈ Π Φ ,π (cid:54) = π (cid:48) d M (Ψ( π ) , Ψ( π (cid:48) )) which is a contradiction. Thus, Γ cannot be Lipschitz continuous and by extension cannot be contractive. B.7 Proof of Theorem 3
Proof.
For all η > , µ ∈ M , t ∈ T , s ∈ S , a ∈ A , the soft action-value function of the MDP induced by µ ∈ M isgiven by ˜ Q η ( µ, t, s, a ) = r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) η log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) η (cid:33) and terminal condition ˜ Q η ( µ, T − , s, a ) ≡ r ( s, a, µ T − ) . Analogously, the action-value function of the MDPinduced by µ ∈ M is given by Q ∗ ( µ, t, s, a ) = r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) max a (cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48) ) and the similarly defined policy action-value function for π ∈ Π is given by Q π ( µ, t, s, a ) = r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) (cid:88) a (cid:48) ∈A π t +1 ( a (cid:48) | s (cid:48) ) Q π ( µ, t + 1 , s (cid:48) , a (cid:48) ) , with terminal conditions Q ∗ ( µ, T − , s, a ) ≡ Q π ( µ, T − , s, a ) ≡ r ( s, a, µ T − ) .We will show that we can find a Lipschitz constant K ˜ Q η of ˜ Q η that is independent of η if η is not arbitrarilysmall. To show this, we will explicitly compute such a Lipschitz constant. Note first that ˜ Q η , Q ∗ and Q π are alluniformly bounded by M Q ≡ |T | M r by assumption, where M r is the uniform bound of r . Lemma B.7.1.
The functions ˜ Q η ( µ, t, s, a ) , Q ∗ ( µ, t, s, a ) and Q π ( µ, t, s, a ) are uniformly bounded for all η > , µ ∈ M , t ∈ T , s ∈ S , a ∈ A by (cid:12)(cid:12)(cid:12) ˜ Q η ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) ≤ ( T − t ) M r ≤ T M r =: M Q where M r is the uniform bound of | r ( s, a, µ t ) | ≤ M r , and T = |T | .Proof. Make the induction hypothesis for all t ∈ T that (cid:12)(cid:12)(cid:12) ˜ Q η ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) ≤ ( T − t ) M r for all η > , µ ∈ M , s ∈ S , a ∈ A and note that this holds for t = T − , as by assumption (cid:12)(cid:12)(cid:12) ˜ Q η ( µ, T − , s, a ) (cid:12)(cid:12)(cid:12) = | r ( s, a, µ t ) | ≤ M r . pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning The induction step from t + 1 to t holds by (cid:12)(cid:12)(cid:12) ˜ Q η ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) η log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) η (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | r ( s, a, µ t ) | + η max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) η (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ M r + η (cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) exp (cid:18) ( T − t − M r η (cid:19)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = M r + ( T − t − M r = ( T − t ) M r . By maximizing over all t ∈ T , we obtain the uniform bound. The other cases are analogous. (cid:4) Now we can find a Lipschitz constant of ˜ Q η ( µ, t, s, a ) that is independent of η . Lemma B.7.2.
Let C r be a Lipschitz constant of µ → r ( s, a, µ t ) and C p a Lipschitz constant of µ → p ( s (cid:48) | s, a, µ t ) .Further, let η min > . Then, for all η > η min , t ∈ T , the map µ (cid:55)→ ˜ Q η ( µ, t, s, a ) is Lipschitz for all s ∈ S , a ∈ A with a Lipschitz constant K t ˜ Q η independent of η . Therefore, by picking K ˜ Q η ≡ max t ∈T K t ˜ Q η , we have one singleLipschitz constant for all η > η min , t ∈ T , s ∈ S , a ∈ A .Proof. We show by induction that for all t ∈ T , s ∈ S , a ∈ A , we can find Lipschitz constants such that ˜ Q η ( µ, t, s, a ) is Lipschitz in µ with a Lipschitz constant that does not depend on η .To see this, note that this is true for t = T − and any s ∈ S , a ∈ A , as for any µ, µ (cid:48) we have (cid:12)(cid:12)(cid:12) ˜ Q η ( µ, T − , s, a ) − ˜ Q η ( µ (cid:48) , T − , s, a ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12) r ( s, a, µ T − ) − r ( s, a, µ (cid:48) T − ) (cid:12)(cid:12) ≤ C r d M ( µ, µ (cid:48) ) . The induction step from t + 1 to t is (cid:12)(cid:12)(cid:12) ˜ Q η ( µ, t, s, a ) − ˜ Q η ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) ≤ | r ( s, a, µ t ) − r ( s, a, µ (cid:48) t ) | + (cid:88) s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p ( s (cid:48) | s, a, µ t ) η log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) η (cid:33) − p ( s (cid:48) | s, a, µ (cid:48) t ) η log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) η (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C r d M ( µ, µ (cid:48) ) + η |S| max s (cid:48) ∈S · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) η (cid:33) − log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) η (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + η |S| max s (cid:48) ∈S M Q η · | p ( s (cid:48) | s, a, µ t ) − p ( s (cid:48) | s, a, µ (cid:48) t ) |≤ C r d M ( µ, µ (cid:48) ) + η |S| max s (cid:48) ∈S (cid:88) a (cid:48) ∈A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) η q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:16) ξ a (cid:48) η (cid:17)(cid:80) a (cid:48)(cid:48) ∈A q t +1 ( a (cid:48)(cid:48) | s (cid:48) ) exp (cid:16) ξ a (cid:48)(cid:48) η (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) − ˜ Q η ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) + |S| M Q · C p d M ( µ, µ (cid:48) ) ≤ C r d M ( µ, µ (cid:48) ) + |A| q max |A| q min exp (cid:18) · M Q η (cid:19) K t +1˜ Q η d M ( µ, µ (cid:48) ) + |S| M Q C p d M ( µ, µ (cid:48) ) < (cid:18) C r + q max q min exp (cid:18) M Q η min (cid:19) K t +1˜ Q η + |S| M Q C p (cid:19) d M ( µ, µ (cid:48) ) ai Cui, Heinz Koeppl where we use the mean value theorem to obtain some ξ a ∈ [ − M Q , M Q ] for all a ∈ A bounded by Lemma B.7.1,Lemma B.2.1 for the second inequality, and defined q max = max t ∈T ,s ∈S ,a ∈A q t ( a | s ) , q min = min t ∈T ,s ∈S ,a ∈A q t ( a | s ) . Since s ∈ S , a ∈ A were arbitrary, this holds for all s ∈ S , a ∈ A .Thus, as long as η > η min , we have the Lipschitz constant K t ˜ Q η ≡ (cid:16) C r + q max q min exp (cid:16) M Q η min (cid:17) K t +1˜ Q η + |S| M Q C p (cid:17) independent of η , since by induction assumption K t +1˜ Q η is independent of η . (cid:4) The optimal action-value function and the policy action-value function for any fixed policy are Lipschitz in µ . Lemma B.7.3.
The functions µ (cid:55)→ Q ∗ ( µ, t, s, a ) and µ (cid:55)→ Q π ( µ, t, s, a ) for any fixed π ∈ Π , t ∈ T , s ∈ S , a ∈ A are Lipschitz continuous. Therefore, for any fixed π ∈ Π we can choose a Lipschitz constant K Q for all t ∈ T , s ∈ S , a ∈ A by taking the maximum over all Lipschitz constants.Proof. The action-value function is given by the recursion Q ∗ ( µ, t, s, a ) = r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) max a (cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48) ) with terminal condition Q ∗ ( µ, T − , s, a ) ≡ r ( s, a, µ T − ) . The functions r ( s, a, µ t ) and p ( s (cid:48) | s, a, µ t ) are Lipschitzcontinuous by Assumption 2. Note that for any µ, µ (cid:48) ∈ M and any t ∈ T , d T V ( µ t , µ (cid:48) t ) ≤ d M ( µ, µ (cid:48) ) . Therefore,the terminal condition and all terms in the above recursion are Lipschitz. Further, Q ∗ ( µ, t, s, a ) is uniformlybounded, since r is assumed uniformly bounded.Since a finite maximum, product and sum of Lipschitz and bounded functions is again Lipschitz and bounded byLemma B.2.1, we obtain Lipschitz constants K Q,t,s,a of the maps µ → Q ∗ ( µ, t, s, a ) for any t ∈ T , s ∈ S , a ∈ A and define K Q ≡ max t ∈T ,s ∈S ,a ∈A K Q,t,s,a . The case for Q π with fixed π ∈ Π is analogous. (cid:4) The same holds for Ψ( π ) mapping from policy π to its induced mean field. Lemma B.7.4.
The function Ψ( π ) is Lipschitz with some Lipschitz constant K Ψ .Proof. Recall that Ψ( π ) maps to the mean field µ starting with µ and obtained by the recursion µ t +1 ( s (cid:48) ) = (cid:88) s ∈S (cid:88) a ∈A p ( s (cid:48) | s, a, µ t ) π t ( a | s ) µ t ( s ) . We proceed analogously to Lemma B.7.3. µ is uniformly bounded by normalization. The constant function π (cid:55)→ µ ( s ) is Lipschitz and bounded for any s ∈ S . The functions r ( s, a, µ t ) and p ( s (cid:48) | s, a, µ t ) are Lipschitzcontinuous by Assumption 2. Since a finite sum, product and composition of Lipschitz and bounded functions isagain Lipschitz and bounded by Lemma B.2.1, we obtain Lipschitz constants K Ψ ,t,s of the maps π → µ t ( s ) forany t ∈ T , s ∈ S and define K Ψ ≡ max t ∈T ,s ∈S K Ψ ,t,s , which is the desired Lipschitz constant of Ψ . (cid:4) Finally, the map from an energy function to its associated Boltzmann distribution is Lipschitz for any η > witha Lipschitz constant explicitly depending on η . Lemma B.7.5.
Let η > arbitrary and f a : M → R be a Lipschitz continuous function with Lipschitz constant K f for any a ∈ A . Further, let g : A → R be bounded by g max > g ( a ) > g min > for any a ∈ A . The function µ (cid:55)→ g ( a ) exp (cid:16) f a ( µ ) η (cid:17)(cid:80) a (cid:48) ∈A g ( a (cid:48) ) exp (cid:16) f a (cid:48) ( µ ) η (cid:17) is Lipschitz with Lipschitz constant K = ( |A|− K f g ηg for any a ∈ A .Proof. Let µ, µ (cid:48) ∈ M be arbitrary and define ∆ a f a (cid:48) ( µ ) ≡ f a (cid:48) ( µ ) − f a ( µ ) pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning for any a (cid:48) ∈ A , which is Lipschitz with constant K f . Then, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) g ( a ) exp (cid:16) f a ( µ ) η (cid:17)(cid:80) a (cid:48) ∈A g ( a (cid:48) ) exp (cid:16) f a (cid:48) ( µ ) η (cid:17) − g ( a ) exp (cid:16) f a ( µ (cid:48) ) η (cid:17)(cid:80) a (cid:48) ∈A g ( a (cid:48) ) exp (cid:16) f a (cid:48) ( µ (cid:48) ) η (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
11 + (cid:80) a (cid:48) (cid:54) = a g ( a (cid:48) ) g ( a ) exp (cid:16) ∆ a f a (cid:48) ( µ ) η (cid:17) −
11 + (cid:80) a (cid:48) (cid:54) = a g ( a (cid:48) ) g ( a ) exp (cid:16) ∆ a f a (cid:48) ( µ (cid:48) ) η (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) (cid:54) = a g ( a (cid:48) ) g ( a ) · η exp (cid:16) ξ a (cid:48) η (cid:17)(cid:16) (cid:80) a (cid:48)(cid:48) (cid:54) = a g ( a (cid:48)(cid:48) ) g ( a ) exp (cid:16) ξ a (cid:48)(cid:48) η (cid:17)(cid:17) · (∆ a f a (cid:48) ( µ ) − ∆ a f a (cid:48) ( µ (cid:48) )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) a (cid:48) (cid:54) = a (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) g max g min · η exp (cid:16) ξ a (cid:48) η (cid:17)(cid:16) g min g max exp (cid:16) ξ a (cid:48) η (cid:17)(cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · | ∆ a f a (cid:48) ( µ ) − ∆ a f a (cid:48) ( µ (cid:48) ) |≤ g ηg · (cid:88) a (cid:48) (cid:54) = a K f d M ( µ, µ (cid:48) ) = ( |A| − K f g ηg · d M ( µ, µ (cid:48) ) where we applied the mean value theorem to obtain some ξ a (cid:48) ∈ R for all a (cid:48) ∈ A and used the maximum c of thefunction ˜ f ( x ) = exp( x/η )(1+ c · exp( x/η )) at x = 0 . (cid:4) For RelEnt MFE, by Lemma B.7.2 we obtain a Lipschitz constant K ˜ Q η of µ → ˜ Q η ( µ, t, s, a ) as long as η > η min for some η min > . Furthermore, note that for ˜ π µ,η ≡ ˜Φ η ( µ ) , we have (cid:12)(cid:12)(cid:12) ˜ π µ,ηt ( a | s ) − ˜ π µ (cid:48) ,ηt ( a | s )) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q t ( a | s ) exp (cid:16) ˜ Q η ( µ,t,s,a ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) ˜ Q η ( µ,t,s,a (cid:48) ) η (cid:17) − q t ( a | s ) exp (cid:16) ˜ Q η ( µ (cid:48) ,t,s,a ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) ˜ Q η ( µ (cid:48) ,t,s,a (cid:48) ) η (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . We obtain the Lipschitz constant of ˜Φ η by applying Lemma B.7.5 to each of the maps given by µ (cid:55)→ q t ( a | s ) exp (cid:16) ˜ Q η ( µ,t,s,a ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) ˜ Q η ( µ,t,s,a (cid:48) ) η (cid:17) for all t ∈ T , s ∈ S , a ∈ A , resulting in the Lipschitz property d Π ( ˜Φ η ( µ ) , ˜Φ η ( µ (cid:48) )) = max s ∈S max t ∈T (cid:88) a ∈A (cid:12)(cid:12)(cid:12) ˜ π µ,ηt ( a | s ) − ˜ π µ (cid:48) ,ηt ( a | s )) (cid:12)(cid:12)(cid:12) ≤ (cid:88) a ∈A ( |A| − K ˜ Q η q ηq · d M ( µ, µ (cid:48) ) = |A| ( |A| − K ˜ Q η q ηq · d M ( µ, µ (cid:48) ) , where we define q max = max t ∈T ,s ∈S ,a ∈A q t ( a | s ) and analogously q min = min t ∈T ,s ∈S ,a ∈A q t ( a | s ) .By Lemma B.7.4, Ψ( π ) is Lipschitz with some Lipschitz constant K Ψ . Therefore, the resulting Lipschitz constantof the composition ˜Γ η = Ψ ◦ ˜Φ η is |A| ( |A|− K ˜ Qη K Ψ q ηq and leads to a contraction for any η > max (cid:32) η min , |A| ( |A| − K ˜ Q η K Ψ q q (cid:33) . Analogously for Boltzmann MFE, by Lemma B.7.3 the mapping µ → Q ∗ ( µ, t, s, a ) is Lipschitz with some Lipschitzconstant K Q ∗ for all t ∈ T , s ∈ S , a ∈ A . For π µ,η ≡ Φ η ( µ ) , we have (cid:12)(cid:12)(cid:12) π µ,ηt ( a | s ) − π µ (cid:48) ,ηt ( a | s )) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q t ( a | s ) exp (cid:16) Q ∗ ( µ,t,s,a ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) Q ∗ ( µ,t,s,a (cid:48) ) η (cid:17) − q t ( a | s ) exp (cid:16) Q ∗ ( µ (cid:48) ,t,s,a ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) Q ∗ ( µ (cid:48) ,t,s,a (cid:48) ) η (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . ai Cui, Heinz Koeppl We obtain the Lipschitz constant of Φ η by applying Lemma B.7.5 to each of the maps given by µ (cid:55)→ q t ( a | s ) exp (cid:16) Q ∗ ( µ,t,s,a ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) Q ∗ ( µ,t,s,a (cid:48) ) η (cid:17) for all t ∈ T , s ∈ S , a ∈ A , resulting in the Lipschitz property d Π (Φ η ( µ ) , Φ η ( µ (cid:48) )) = max s ∈S max t ∈T (cid:88) a ∈A (cid:12)(cid:12)(cid:12) π µ,ηt ( a | s ) − π µ (cid:48) ,ηt ( a | s )) (cid:12)(cid:12)(cid:12) ≤ (cid:88) a ∈A ( |A| − K Q ∗ q ηq · d M ( µ, µ (cid:48) ) = |A| ( |A| − K Q ∗ q ηq · d M ( µ, µ (cid:48) ) . By Lemma B.7.4, Ψ( π ) is Lipschitz with some Lipschitz constant K Ψ . The resulting Lipschitz constant of thecomposition Γ η = Ψ ◦ Φ η is |A| ( |A|− K Q ∗ K Ψ q ηq and leads to a contraction for any η > |A| ( |A| − K Q ∗ K Ψ q q where for the uniform prior policy, q max = q min . If required, the Lipschitz constants can be computed recursivelyaccording to Lemma B.2.1. B.8 Proof of Theorem 4
Proof.
Consider any sequence ( π ∗ n , µ ∗ n ) n ∈ N of η n -Boltzmann or η n -RelEnt MFE with η n → + as n → ∞ . Notethat a pair ( π ∗ n , µ ∗ n ) is completely specified by µ ∗ n , since π ∗ n = Φ η n ( µ ∗ n ) or π ∗ n = ˜Φ η n ( µ ∗ n ) uniquely. Therefore,it suffices to show that the associated functions ( µ (cid:55)→ Q Φ ηn ( µ ) ( µ, t, s, a )) n ∈ N and ( µ (cid:55)→ Q ˜Φ ηn ( µ ) ( µ, t, s, a )) n ∈ N converge uniformly to µ (cid:55)→ Q ∗ ( µ, t, s, a ) , from which the desired result will follow. For definitions of the differentaction-value functions, see Appendix B.7.Note that pointwise convergence is insufficient, since there is no guarantee that µ ∗ n itself will converge as n → ∞ .However, we can obtain uniform convergence by pointwise convergence and equicontinuity. For RelEnt MFE, wewill additionally require uniform convergence of the sequence ( µ (cid:55)→ ˜ Q η n ( µ, t, s, a )) n ∈ N with η n → + . We beginwith pointwise convergence of ( µ (cid:55)→ Q Φ ηn ( µ ) ( µ, t, s, a )) n ∈ N to the optimal action-value function µ (cid:55)→ Q ∗ ( µ, t, s, a ) . Lemma B.8.1.
Any sequence of functions ( µ (cid:55)→ Q Φ ηn ( µ ) ( µ, t, s, a )) n ∈ N with η n → + converges pointwise to µ (cid:55)→ Q ∗ ( µ, t, s, a ) for all t ∈ T , s ∈ S , a ∈ A .Proof. Fix µ ∈ M . We make the induction hypothesis for arbitrary t ∈ T that for all s ∈ S , a ∈ A , ε > , thereexists n (cid:48) ∈ N such that for any n > n (cid:48) we have (cid:12)(cid:12)(cid:12) Q Φ ηn ( µ ) ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) < ε . The induction hypothesis is fulfilled for t = T − , as by definition (cid:12)(cid:12)(cid:12) Q Φ ηn ( µ ) ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) = | r ( s, a, µ t ) − r ( s, a, µ t ) | = 0 . Assume that the induction hypothesis is fulfilled for t + 1 , then at time t let s ∈ S , a ∈ A , ε > arbitrary.Furthermore, let s (cid:48) ∈ S arbitrary. Collect all optimal actions into a set A s (cid:48) opt ⊆ A , i.e. for a (cid:48) ∈ A s (cid:48) opt we have Q ∗ ( µ, t, s (cid:48) , a opt ) = max a ∈A Q ∗ ( µ, t, s (cid:48) , a ) . We define the minimal action gap ∆ Q s (cid:48) ,µ min ≡ min a opt ∈A s (cid:48) opt ,a sub ∈A\A s (cid:48) opt ( Q ∗ ( µ, t, s (cid:48) , a opt ) − Q ∗ ( µ, t, s (cid:48) , a sub )) > pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning such that for arbitrary suboptimal actions a sub ∈ A \ A s (cid:48) opt and optimal actions a opt ∈ A s (cid:48) opt , Q ∗ ( µ, t, s (cid:48) , a opt ) − Q ∗ ( µ, t, s (cid:48) , a sub ) ≥ ∆ Q s (cid:48) ,µ min . This is well defined if there are suboptimal actions, since there is always at least one optimal action. If all actionsare optimal, we can skip bounding the probability of taking suboptimal actions and the result will hold trivially.Thus, we assume henceforth that there exists a suboptimal action.It follows that the probability of taking suboptimal actions a sub ∈ A \ A s (cid:48) opt disappears, since (Φ η n ( µ )) t ( a sub | s (cid:48) ) = q t ( a sub | s ) (cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) Q ∗ ( µ,t,s,a (cid:48) ) − Q ∗ ( µ,t,s,a sub ) η (cid:17) ≤
11 + (cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) q t ( a sub | s ) exp (cid:16) Q ∗ ( µ,t,s,a (cid:48) ) − Q ∗ ( µ,t,s,a sub ) η (cid:17) ≤ | s )1 + q t ( a opt | s ) q t ( a sub | s ) exp (cid:16) Q ∗ ( µ,t,s,a opt ) − Q ∗ ( µ,t,s,a sub ) η (cid:17) ≤ | s )1 + q t ( a opt | s ) q t ( a sub | s ) exp (cid:18) ∆ Q s (cid:48) ,µ min η (cid:19) → as η → + for some arbitrary optimal action a opt ∈ A s (cid:48) opt . Since s (cid:48) ∈ S was arbitrary, this holds for all s (cid:48) ∈ S .Therefore, by finiteness of S and A we can choose n ∈ N such that for all n > n and for all a sub ∈ A \ A s (cid:48) opt wehave η n sufficiently small such that (Φ η n ( µ )) t ( a sub | s (cid:48) ) < ε |A| M Q where M Q is the uniform bound of Q Φ ηn ( µ ) .Further, by induction assumption, we can choose n s (cid:48) ,a (cid:48) for any s (cid:48) ∈ S , a (cid:48) ∈ A such that for all n > n s (cid:48) ,a (cid:48) we have (cid:12)(cid:12)(cid:12) Q Φ ηn ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) < ε Therefore, as long as n > n (cid:48) ≡ max( n , max s (cid:48) ∈S ,a (cid:48) ∈A n s (cid:48) ,a (cid:48) ) , we have (cid:12)(cid:12)(cid:12) Q Φ ηn ( µ ) ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) (cid:32) (cid:88) a (cid:48) ∈A (Φ η n ( µ )) t ( a (cid:48) | s (cid:48) ) Q Φ ηn ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − max a (cid:48)(cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48)(cid:48) ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A (Φ η n ( µ )) t ( a (cid:48) | s (cid:48) ) Q Φ ηn ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − max a (cid:48)(cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A s (cid:48) opt (Φ η n ( µ )) t ( a (cid:48) | s (cid:48) ) Q Φ ηn ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − max a (cid:48)(cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A\A s (cid:48) opt (Φ η n ( µ )) t ( a (cid:48) | s (cid:48) ) Q Φ ηn ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A s (cid:48) opt (Φ η n ( µ )) t ( a (cid:48) | s (cid:48) ) Q Φ ηn ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − (cid:88) a (cid:48) ∈A s (cid:48) opt (Φ η n ( µ )) t ( a (cid:48) | s (cid:48) ) max a (cid:48)(cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ai Cui, Heinz Koeppl + max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A s (cid:48) opt (Φ η n ( µ )) t ( a (cid:48) | s (cid:48) ) max a (cid:48)(cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48)(cid:48) ) − max a (cid:48)(cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A\A s (cid:48) opt (Φ η n ( µ )) t ( a (cid:48) | s (cid:48) ) Q Φ ηn ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max s (cid:48) ∈S max a (cid:48) ∈A s (cid:48) opt (cid:12)(cid:12)(cid:12)(cid:12) Q Φ ηn ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − max a (cid:48)(cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S M Q (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:88) a (cid:48) ∈A\A s (cid:48) opt (Φ η n ( µ )) t ( a (cid:48) | s (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S M Q (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A\A s (cid:48) opt (Φ η n ( µ )) t ( a (cid:48) | s (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ε ε |A| M Q · |A| M Q + ε |A| M Q · |A| M Q = ε . Since s ∈ S , a ∈ A , ε > were arbitrary, the desired result follows immediately by induction. (cid:4) As we have no control over µ ∗ n and the sequence ( π ∗ n , µ ∗ n ) n ∈ N may not even converge, pointwise convergence isinsufficient. To obtain uniform convergence, we shall use compactness of M and equicontinuity. Lemma B.8.2.
The family of functions
F ≡ { µ (cid:55)→ Q Φ η ( µ ) ( µ, t, s, a ) } η> ,t ∈T ,s ∈S ,a ∈A is equicontinuous, i.e. forany ε > and any µ ∈ M , we can choose a δ > such that for all µ (cid:48) ∈ M with d M ( µ, µ (cid:48) ) < δ and any f ∈ F we have | f ( µ ) − f ( µ (cid:48) ) | < ε . Proof.
Fix an arbitrary µ ∈ M . We make the (backwards in time) induction hypothesis for all t ∈ T that for any s ∈ S , a ∈ A , ε t,s,a > , there exists δ t,s,a > such that for any µ (cid:48) ∈ M with d M ( µ, µ (cid:48) ) < δ t,s,a and any f ∈ F we have (cid:12)(cid:12)(cid:12) Q Φ η ( µ ) ( µ, t, s, a ) − Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t, s, a ) (cid:12)(cid:12)(cid:12) < ε t,s,a . The induction hypothesis is fulfilled for t = T − , as by assumption, ν → r ( s, a, ν t ) is Lipschitz with constant C r > . Therefore, for all s ∈ S , a ∈ A we can choose δ T − ,s,a = ε t,s,a C r such that for any µ, µ (cid:48) with d M ( µ, µ (cid:48) ) < δ (cid:48) we have (cid:12)(cid:12)(cid:12) Q Φ η ( µ ) ( µ, t, s, a ) − Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t, s, a ) (cid:12)(cid:12)(cid:12) = | r ( s, a, µ t ) − r ( s, a, µ (cid:48) t ) | ≤ C r d M ( µ, µ (cid:48) ) < ε t,s,a . Assume that the induction hypothesis holds for t + 1 , then at time t let ε t,s,a > , s ∈ S , a ∈ A arbitrary. Bydefinition, we have (cid:12)(cid:12)(cid:12) Q Φ η ( µ ) ( µ, t, s, a ) − Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t, s, a ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) (cid:88) a (cid:48) ∈A (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − r ( s, a, µ (cid:48) t ) − (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ (cid:48) t ) (cid:88) a (cid:48) ∈A (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | r ( s, a, µ t ) − r ( s, a, µ (cid:48) t ) | + (cid:88) s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( p ( s (cid:48) | s, a, µ t ) − p ( s (cid:48) | s, a, µ (cid:48) t )) (cid:88) a (cid:48) ∈A (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:88) s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p ( s (cid:48) | s, a, µ (cid:48) t ) (cid:88) a (cid:48) ∈A (cid:16) (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning ≤ | r ( s, a, µ t ) − r ( s, a, µ (cid:48) t ) | + (cid:88) s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( p ( s (cid:48) | s, a, µ t ) − p ( s (cid:48) | s, a, µ (cid:48) t )) (cid:88) a (cid:48) ∈A (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A s (cid:48) opt (cid:16) (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A\A s (cid:48) opt (cid:16) (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) where we define A s (cid:48) opt ⊆ A for any s (cid:48) ∈ S to include all optimal actions a opt ∈ A s (cid:48) opt such that Q ∗ ( µ, t, s (cid:48) , a opt ) = max a ∈A Q ∗ ( µ, t, s (cid:48) , a ) . We bound each of the four terms separately.For the first term, we choose δ t,s,a = ε t,s,a C r by Lipschitz continuity such that | r ( s, a, µ t ) − r ( s, a, µ (cid:48) t ) | < ε t,s,a for all µ (cid:48) with d M ( µ, µ (cid:48) ) < δ t,s,a .For the second term, we choose δ t,s,a = |S| M Q C p such that for any µ (cid:48) ∈ M with d M ( µ, µ (cid:48) ) < δ t,s,a we have (cid:88) s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( p ( s (cid:48) | s, a, µ t ) − p ( s (cid:48) | s, a, µ (cid:48) t )) (cid:88) a (cid:48) ∈A (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ |S| C p d M ( µ, µ (cid:48) ) M Q < ε t,s,a where M Q denotes the uniform bound of Q and C p is the Lipschitz constant of ν (cid:55)→ p ( s (cid:48) | s, a, ν t ) .For the third and fourth term, we first fix s (cid:48) ∈ S and define the minimal action gap as ∆ Q s (cid:48) ,µ min ≡ min a opt ∈A s (cid:48) opt ,a sub ∈A\A s (cid:48) opt ( Q ∗ ( µ, t, s (cid:48) , a opt ) − Q ∗ ( µ, t, s (cid:48) , a sub )) . This is well defined if there are suboptimal actions, since there is always at least one optimal action. If all actionsare optimal, we can skip bounding the probability of taking suboptimal actions and the result will still hold.Henceforth, we assume that there exists a suboptimal action.By Lipschitz continuity of µ (cid:55)→ Q ∗ ( µ, t, s, a ) from Lemma B.7.3 implying uniform continuity, there exists some δ ,s (cid:48) t,s,a > such that | Q ∗ ( µ (cid:48) , t, s (cid:48) , a ) − Q ∗ ( µ, t, s (cid:48) , a ) | < ∆ Q s (cid:48) ,µ min for all µ (cid:48) ∈ M , a ∈ A where d M ( µ, µ (cid:48) ) < δ ,s (cid:48) t,s,a , and thus ∆ Q s (cid:48) ,µ (cid:48) min = min a opt ∈A s (cid:48) opt ,a sub ∈A\A s (cid:48) opt ( Q ∗ ( µ (cid:48) , t, s (cid:48) , a opt ) − Q ∗ ( µ (cid:48) , t, s (cid:48) , a sub )) > ∆ Q s (cid:48) ,µ min . Under this condition, we can now show that the probability of any suboptimal action can be controlled. Define R min q ≡ min t ∈T ,s ∈S ,a ∈A ,a (cid:48) ∈A q t ( a (cid:48) | s ) q t ( a | s ) > and R max q ≡ max t ∈T ,s ∈S ,a ∈A ,a (cid:48) ∈A q t ( a (cid:48) | s ) q t ( a | s ) > . Let a sub ∈ A \ A s (cid:48) opt , thenwe either have | (Φ η ( µ )) t +1 ( a sub | s (cid:48) ) − (Φ η ( µ (cid:48) )) t +1 ( a sub | s (cid:48) ) | ai Cui, Heinz Koeppl = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
11 + (cid:80) a (cid:48) (cid:54) = a sub q t ( a (cid:48) | s (cid:48) ) q t ( a sub | s (cid:48) ) exp (cid:16) Q ∗ ( µ,t,s (cid:48) ,a (cid:48) ) − Q ∗ ( µ,t,s (cid:48) ,a sub ) η (cid:17) −
11 + (cid:80) a (cid:48) (cid:54) = a sub q t ( a (cid:48) | s (cid:48) ) q t ( a sub | s (cid:48) ) exp (cid:16) Q ∗ ( µ (cid:48) ,t,s (cid:48) ,a (cid:48) ) − Q ∗ ( µ (cid:48) ,t,s (cid:48) ,a sub ) η (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤
11 + max a (cid:48) (cid:54) = a sub R min q exp (cid:16) Q ∗ ( µ,t,s (cid:48) ,a (cid:48) ) − Q ∗ ( µ,t,s (cid:48) ,a sub ) η (cid:17) + 11 + max a (cid:48) (cid:54) = a sub R min q exp (cid:16) Q ∗ ( µ (cid:48) ,t,s (cid:48) ,a (cid:48) ) − Q ∗ ( µ (cid:48) ,t,s (cid:48) ,a sub ) η (cid:17) <
11 + R min q exp (cid:18) ∆ Q s (cid:48) ,µ min η (cid:19) + 11 + R min q exp (cid:18) ∆ Q s (cid:48) ,µ min η (cid:19) ≤
21 + R min q exp (cid:18) ∆ Q s (cid:48) ,µ min η (cid:19) < ε t,s,a M Q |A| if ε t,s,a > M Q |A| trivially, or otherwise if η < η s (cid:48) min with η s (cid:48) min ≡ ∆ Q s (cid:48) ,µ min (cid:16) M Q |A| ε t,s,a R min q − R min q (cid:17) , in which case we arbitrarily define δ ,s (cid:48) t,s,a = 1 , or if neither apply, then η ≥ η s (cid:48) min and thus | (Φ η ( µ )) t +1 ( a sub | s (cid:48) ) − (Φ η ( µ (cid:48) )) t +1 ( a sub | s (cid:48) ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
11 + (cid:80) a (cid:48) (cid:54) = a sub q t ( a (cid:48) | s (cid:48) ) q t ( a sub | s (cid:48) ) exp (cid:16) Q ∗ ( µ,t,s (cid:48) ,a (cid:48) ) − Q ∗ ( µ,t,s (cid:48) ,a sub ) η (cid:17) −
11 + (cid:80) a (cid:48) (cid:54) = a sub q t ( a (cid:48) | s (cid:48) ) q t ( a sub | s (cid:48) ) exp (cid:16) Q ∗ ( µ (cid:48) ,t,s (cid:48) ,a (cid:48) ) − Q ∗ ( µ (cid:48) ,t,s (cid:48) ,a sub ) η (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:80) a (cid:48) (cid:54) = a sub q t ( a (cid:48) | s ) q t ( a sub | s (cid:48) ) (cid:16) exp (cid:16) Q ∗ ( µ (cid:48) ,t,s (cid:48) ,a (cid:48) ) − Q ∗ ( µ (cid:48) ,t,s (cid:48) ,a sub ) η (cid:17) − exp (cid:16) Q ∗ ( µ,t,s (cid:48) ,a (cid:48) ) − Q ∗ ( µ,t,s (cid:48) ,a sub ) η (cid:17)(cid:17) (1 + · · · ) · (1 + · · · ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ R max q (cid:88) a (cid:48) (cid:54) = a sub (cid:12)(cid:12)(cid:12)(cid:12) exp (cid:18) Q ∗ ( µ (cid:48) , t, s (cid:48) , a (cid:48) ) − Q ∗ ( µ (cid:48) , t, s (cid:48) , a sub ) η (cid:19) − exp (cid:18) Q ∗ ( µ, t, s (cid:48) , a (cid:48) ) − Q ∗ ( µ, t, s (cid:48) , a sub ) η (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ R max q (cid:88) a (cid:48) (cid:54) = a sub (cid:12)(cid:12)(cid:12)(cid:12) η exp (cid:18) ξ a (cid:48) η (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) | ( Q ∗ ( µ (cid:48) , t, s (cid:48) , a (cid:48) ) − Q ∗ ( µ (cid:48) , t, s (cid:48) , a sub )) − ( Q ∗ ( µ, t, s (cid:48) , a (cid:48) ) − Q ∗ ( µ, t, s (cid:48) , a sub )) |≤ R max q |A| · η s (cid:48) min exp (cid:18) M Q η s (cid:48) min (cid:19) ( | Q ∗ ( µ (cid:48) , t, s (cid:48) , a (cid:48) ) − Q ∗ ( µ, t, s (cid:48) , a (cid:48) ) | + | Q ∗ ( µ, t, s (cid:48) , a sub ) − Q ∗ ( µ (cid:48) , t, s (cid:48) , a sub ) | ) ≤ R max q |A| · η s (cid:48) min exp (cid:18) M Q η s (cid:48) min (cid:19) · K Q d M ( µ, µ (cid:48) ) < ε t,s,a M Q |A| by the mean value theorem with some ξ a (cid:48) ∈ [ − M Q , M Q ] for all a (cid:48) ∈ A , where we abbreviated the denominator (1 + · · · ) · (1 + · · · ) ≥ , as long as we choose δ ,s (cid:48) t,s,a = ε t,s,a η s (cid:48) min M Q |A| R max q · exp (cid:16) M Q η s (cid:48) min (cid:17) · K Q pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning and d M ( µ, µ (cid:48) ) < δ ,s (cid:48) t,s,a , where K Q is the Lipschitz constant of µ (cid:55)→ Q ∗ ( µ, t, s, a ) given by Lemma B.7.3.Since s (cid:48) ∈ S was arbitrary, we now define δ t,s,a ≡ min s (cid:48) ∈S δ ,s (cid:48) t,s,a , δ t,s,a ≡ min s (cid:48) ∈S δ ,s (cid:48) t,s,a and let d M ( µ, µ (cid:48) ) < min( δ t,s,a , δ t,s,a ) . Under these assumptions, for the third term we have approximate optimality for all optimalactions in A s (cid:48) opt , since by induction assumption we can choose δ t +1 ,s (cid:48) ,a (cid:48) for all s (cid:48) ∈ S , a (cid:48) ∈ A such that for all µ (cid:48) ∈ M with d M ( µ, µ (cid:48) ) < δ t +1 ,s (cid:48) ,a (cid:48) it holds that (cid:12)(cid:12)(cid:12) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) < ε t,s,a |A| + 8 . and therefore for all µ (cid:48) ∈ M , as long as d M ( µ, µ (cid:48) ) < min s (cid:48) ∈S ,a (cid:48) ∈A δ t +1 ,s (cid:48) ,a (cid:48) , we have max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A s (cid:48) opt (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − (cid:88) a (cid:48) ∈A s (cid:48) opt (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A s (cid:48) opt (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − (cid:88) a (cid:48) ∈A s (cid:48) opt (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A s (cid:48) opt (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) − (cid:88) a (cid:48) ∈A s (cid:48) opt (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max s (cid:48) ∈S max a (cid:48) ∈A (cid:12)(cid:12)(cid:12) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A s (cid:48) opt ((Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) − (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) )) (cid:16) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) − Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A s (cid:48) opt ((Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) − (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) )) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max s (cid:48) ∈S max a (cid:48) ∈A (cid:12)(cid:12)(cid:12) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S max a (cid:48) ∈A |A| (cid:12)(cid:12)(cid:12) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) − Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S max a (cid:48)(cid:48) ∈A (cid:12)(cid:12)(cid:12) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12) · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A\A s (cid:48) opt ((Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) ) − (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < (1 + 2 |A| ) · ε t,s,a |A| + 8 + M Q |A| · ε t,s,a M Q |A| < ε t,s,a where we use that for any a (cid:48) ∈ A s (cid:48) opt we have Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) = max a (cid:48)(cid:48) ∈A Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48)(cid:48) ) . Analogously, for the fourth term we have max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A\A s (cid:48) opt ((Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max s (cid:48) ∈S (cid:88) a (cid:48) ∈A\A s (cid:48) opt (cid:12)(cid:12)(cid:12) (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) ai Cui, Heinz Koeppl + max s (cid:48) ∈S (cid:88) a (cid:48) ∈A\A s (cid:48) opt (cid:12)(cid:12)(cid:12) (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) − (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) ≤ max s (cid:48) ∈S max a (cid:48) ∈A (cid:12)(cid:12)(cid:12) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S M Q (cid:88) a (cid:48) ∈A\A s (cid:48) opt | (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) − (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) ) | < ε t,s,a M Q |A| · ε t,s,a M Q |A| = ε t,s,a under the previous conditions, since as long as we have d M ( µ, µ (cid:48) ) < δ t +1 ,s (cid:48) ,a (cid:48) for all s (cid:48) ∈ S , a (cid:48) ∈ A from before,we have (cid:12)(cid:12)(cid:12) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) < ε t,s,a |A| + 8 < ε t,s,a . Finally, by choosing δ t,s,a such that all conditions are fulfilled, i.e. δ t,s,a ≡ min (cid:18) δ t,s,a , δ t,s,a , δ t,s,a , δ t,s,a , min s (cid:48) ∈S ,a (cid:48) ∈A δ t +1 ,s (cid:48) ,a (cid:48) (cid:19) > , the induction hypothesis is fulfilled, since then for any µ (cid:48) with d M ( µ, µ (cid:48) ) < δ t,s,a we have (cid:12)(cid:12)(cid:12) Q Φ η ( µ ) ( µ, t, s, a ) − Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t, s, a ) (cid:12)(cid:12)(cid:12) < ε t,s,a . Since η > is arbitrary, the desired result follows immediately, as we can set ε t,s,a = ε for each t ∈ T , s ∈ S , a ∈ A and obtain δ ≡ max t ∈T ,s ∈S ,a ∈A δ t,s,a , fulfilling the required equicontinuity property at µ . (cid:4) From equicontinuity, we get the desired uniform convergence via compactness.
Lemma B.8.3. If ( f n ) n ∈ N with f n : M → R is an equicontinuous sequence of functions and for all µ ∈ M wehave f n ( µ ) → f ( µ ) pointwise, then f n ( µ ) → f ( µ ) uniformly.Proof. Let ε > arbitrary, then there exists by equicontinuity for any point µ ∈ M a δ ( µ ) such that for all µ (cid:48) ∈ M with d M ( µ, µ (cid:48) ) < δ ( µ ) we have for all n ∈ N | f n ( µ ) − f n ( µ (cid:48) ) | < ε which via pointwise convergence implies | f ( µ ) − f ( µ (cid:48) ) | ≤ ε . Since M is compact, it is separable, i.e. there exists a countable dense subset ( µ j ) j ∈ N of M . Let δ ( µ ) be asdefined above and cover M by the open balls ( B δ ( µ j ) ( µ j )) j ∈ N . By the compactness of M , finitely many of theseballs B δ ( µ n ) ( µ n ) , . . . , B δ ( µ nk ) ( µ n k ) cover M . By pointwise convergence, for any i = 1 , . . . , k we can find aninteger n i such that for all n > n i we have | f n ( µ n i ) − f ( µ n i ) | < ε . Taken together, we find that for n > max i =1 ,...,k n i and arbitrary µ ∈ M , we have | f n ( µ ) − f ( µ ) | < | f n ( µ ) − f n ( µ n i ) | + | f n ( µ n i ) − f ( µ n i ) | + | f ( µ n i ) − f ( µ ) | < ε ε ε < ε for some center point µ n i of a ball containing µ from the finite cover. (cid:4) Therefore, a sequence of Boltzmann MFE with vanishing η is approximately optimal in the MFG. pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning Lemma B.8.4.
For any sequence ( π ∗ n , µ ∗ n ) n ∈ N of η n -Boltzmann MFE with η n → + and for any ε > thereexists integer N ∈ N such that for all integers n > N we have J µ ∗ n ( π ∗ n ) ≥ max π J µ ∗ n ( π ) − ε . Proof.
By Lemma B.8.2,
F ≡ ( µ (cid:55)→ Q Φ η ( µ ) ( µ, t, s, a )) η> ,t ∈T ,s ∈S ,a ∈A is equicontinuous. Therefore, any sequence ( µ (cid:55)→ Q Φ ηn ( µ ) ( µ, t, s, a )) n ∈ N with η n → + is also equicontinuous for any t ∈ T , s ∈ S , a ∈ A .Furthermore, by Lemma B.8.1, the sequence ( µ (cid:55)→ Q Φ ηn ( µ ) ( µ, t, s, a )) n ∈ N converges pointwise to µ → Q ∗ ( µ, t, s, a ) for any t ∈ T , s ∈ S , a ∈ A .By Lemma B.8.3, we thus have (cid:12)(cid:12) Q Φ ηn ( µ ) ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12) → uniformly. Therefore, for any ε > , thereexists an integer N by uniform convergence such that for all integers n > N we have Q π ∗ n ( µ ∗ n , t, s, a ) ≥ Q ∗ ( µ ∗ n , t, s, a ) − ε = max π ∈ Π Q π ( µ ∗ n , t, s, a ) − ε , and since by Lemma B.3.1 we have J µ ∗ n ( π ∗ n ) = (cid:88) s ∈S µ ( s ) · (cid:88) a ∈A Q π ∗ n ( µ ∗ n , t, s, a ) ≥ (cid:88) s ∈S µ ( s ) · max π ∈ Π (cid:88) a ∈A Q π ( µ ∗ n , t, s, a ) − ε = max π ∈ Π J µ ∗ n ( π ) − ε , the desired result follows immediately. (cid:4) Finally, we show approximate optimality in the actual N -agent game as long as a pair ( π ∗ , µ ∗ ) ∈ Π × M with µ ∗ = Ψ( π ∗ ) has vanishing exploitability in the MFG. By Lemma B.8.4, for any sequence ( π ∗ n , µ ∗ n ) n ∈ N of η n -Boltzmann MFE with η n → + and for any ε > there exists an integer n (cid:48) ∈ N such that for all integers n > n (cid:48) we have J µ ∗ n ( π ∗ n ) ≥ max π J µ ∗ n ( π ) − ε . Let ε (cid:48) > be arbitrary and choose a sequence of optimal policies { π N } N ∈ N such that for all N ∈ N we have π N ∈ arg max π ∈ Π J N ( π, π ∗ n , . . . , π ∗ n ) . By Lemma B.5.6 there exists N (cid:48) ∈ N such that for all N > N (cid:48) and all n > n (cid:48) , we have max π ∈ Π J N ( π, π ∗ n , . . . , π ∗ n ) − ε − ε (cid:48) ≤ max π ∈ Π J µ ∗ n ( π ) − ε − ε (cid:48) ≤ J µ ∗ n ( π ∗ n ) − ε (cid:48) ≤ J N ( π ∗ n , π ∗ n , . . . , π ∗ n ) which is the desired approximate Nash equilibrium property since ε, ε (cid:48) are arbitrary. This applies by symmetry toall agents.For RelEnt MFE, the same can be done by first showing the uniform convergence of the soft action-value functionto the usual action-value function. For this, note that the smooth maximum Bellman recursion converges to thehard maximum Bellman recursion for any fixed µ . Lemma B.8.5.
For any f : A → R and any g : A → R with g ( a ) > for all a ∈ A , we have lim η → + η log (cid:88) a ∈A g ( a ) exp f ( a ) η = max a ∈A f ( a ) . Proof.
Let δ = η → + ∞ . Then, by L’Hospital’s rule we have lim δ → + ∞ log (cid:80) a ∈A g ( a ) exp ( δf ( a )) δ = lim δ → + ∞ (cid:80) a ∈A g ( a ) exp ( δf ( a )) f ( a ) (cid:80) a ∈A g ( a ) exp ( δf ( a )) ai Cui, Heinz Koeppl = lim δ → + ∞ (cid:80) a ∈A g ( a ) exp ( δ ( f ( a ) − max a ∈A f ( a ))) f ( a ) (cid:80) a ∈A g ( a ) exp ( δ ( f ( a ) − max a ∈A f ( a )))= |A max | max a ∈A f ( a ) |A max | = max a ∈A f ( a ) where |A max | is the number of elements in A that maximize f . (cid:4) Using this result, we can show pointwise convergence of the soft action-value function to the action-value function.
Lemma B.8.6.
Any sequence of functions ( µ (cid:55)→ ˜ Q η n ( µ, t, s, a )) n ∈ N with η n → + converges pointwise to µ (cid:55)→ Q ∗ ( µ, t, s, a ) for all t ∈ T , s ∈ S , a ∈ A .Proof. Fix µ ∈ M . We show by induction that for any ε > , there exists η t > such that for all η < η t we have (cid:12)(cid:12)(cid:12) ˜ Q η ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) < ε for all t ∈ T , s ∈ S , a ∈ A . This holds for t = T − and arbitrary s ∈ S , a ∈ A by Lemma B.8.5, since r ( s, a, µ T − ) is independent of η . Assume this holds for t + 1 and consider t . Then, by theinduction assumption we can choose η t +1 > such that for η < η t +1 , as η → + we have ˜ Q η ( µ, t, s, a ) = r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) η log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) η (cid:33) ≤ r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) η log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:18) Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48) ) + ε η (cid:19) → r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) max a (cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48) ) + ε by Lemma B.8.5 and monotonicity of log and exp . Analogously, ˜ Q η ( µ, t, s, a ) ≥ r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) η log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:18) Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48) ) − ε η (cid:19) → r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) max a (cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48) ) − ε . Therefore, we can choose η t < η t +1 such that for all η < η t we have (cid:12)(cid:12)(cid:12) ˜ Q η ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ Q η ( µ, t, s, a ) − (cid:32) r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) max a (cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ε which is the desired result. (cid:4) We can now show that the soft action-value function converges uniformly to the action-value function as η → + . Lemma B.8.7.
Any sequence of functions ( µ (cid:55)→ ˜ Q η n ( µ, t, s, a )) n ∈ N with η n → + converges uniformly to µ (cid:55)→ Q ∗ ( µ, t, s, a ) for all t ∈ T , s ∈ S , a ∈ A .Proof. First, we show that ˜ Q η ( µ, t, s, a ) is monotonically decreasing in η for η > , i.e. ∂∂η ˜ Q η ( µ, t, s, a ) ≤ for all t ∈ T , s ∈ S , a ∈ A . This is the case for t = T − and arbitrary s ∈ S , a ∈ A , since ˜ Q η ( µ, T − , s, a ) is constant.Assume this holds for t + 1 , then for t and arbitrary s ∈ S , a ∈ A we have ∂∂η ˜ Q η ( µ, t, s, a ) = (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) η (cid:33) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) η (cid:80) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:16) ˜ Q η ( µ,t +1 ,s (cid:48) ,a (cid:48) ) η (cid:17) (cid:16) − ˜ Q η ( µ,t +1 ,s (cid:48) ,a (cid:48) ) η + η ∂∂η ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:17)(cid:80) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:16) ˜ Q η ( µ,t +1 ,s (cid:48) ,a (cid:48) ) η (cid:17) pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning ≤ max s (cid:48) ∈S log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) η (cid:33) − (cid:80) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:16) ˜ Q η ( µ,t +1 ,s (cid:48) ,a (cid:48) ) η (cid:17) ˜ Q η ( µ,t +1 ,s (cid:48) ,a (cid:48) ) η (cid:80) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:16) ˜ Q η ( µ,t +1 ,s (cid:48) ,a (cid:48) ) η (cid:17) by induction hypothesis. Let ξ a (cid:48) ≡ ˜ Q η ( µ,t +1 ,s (cid:48) ,a (cid:48) ) η ∈ R and s (cid:48) ∈ S arbitrary, then by Jensen’s inequality applied tothe convex function φ ( x ) = x log x we have (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) φ (exp ξ a (cid:48) ) ≥ φ (cid:32) (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp ξ a (cid:48) (cid:33) ⇐⇒ (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) ξ a (cid:48) exp ξ a (cid:48) ≥ (cid:32) (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp ξ a (cid:48) (cid:33) log (cid:32) (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp ξ a (cid:48) (cid:33) ⇐⇒ log (cid:32) (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp ξ a (cid:48) (cid:33) − (cid:80) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) ξ a (cid:48) exp ξ a (cid:48) (cid:0)(cid:80) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp ξ a (cid:48) (cid:1) ≤ , such that ˜ Q η ( µ, t, s, a ) is monotonically decreasing for all t ∈ T , s ∈ S , a ∈ A by induction.Furthermore, M is compact and both ˜ Q η and Q are compositions, sums, products and finite maxima of continuousfunctions in µ and therefore continuous in µ by the standing assumptions. Since ( µ (cid:55)→ ˜ Q η n ( µ, t, s, a )) n ∈ N with η n → + converges pointwise to µ (cid:55)→ Q ∗ ( µ, t, s, a ) for all t ∈ T , s ∈ S , a ∈ A by Lemma B.8.6, by Dini’s theoremthe convergence is uniform. (cid:4) Now that ˜ Q η converges uniformly against Q , we can show that RelEnt MFE have vanishing exploitability byreplicating the proof for Boltzmann MFE. Lemma B.8.8.
Any sequence of functions ( µ (cid:55)→ Q ˜Φ ηn ( µ ) ( µ, t, s, a )) n ∈ N with η n → + converges pointwise to µ (cid:55)→ Q ∗ ( µ, t, s, a ) for all t ∈ T , s ∈ S , a ∈ A .Proof. The proof is the same as in Lemma B.8.1. The only difference is that we additionally choose n ∈ N ineach induction step such that for all n > n we have (cid:12)(cid:12)(cid:12) ˜ Q η ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) ≤ ∆ Q s (cid:48) ,µ min for all t ∈ T , s ∈ S , a ∈ A , which is possible, since by Lemma B.8.7, ˜ Q η converges uniformly against Q . As longas we choose n (cid:48) ≡ max( n , n , max s (cid:48) ∈S ,a (cid:48) ∈A n s (cid:48) ,a (cid:48) ) , the rest of the proof will apply. (cid:4) Lemma B.8.9.
Any sequence of functions ( µ (cid:55)→ Q ˜Φ ηn ( µ ) ( µ, t, s, a )) n ∈ N with η n → + fulfills equicontinuity forlarge enough n : For any ε > and any µ ∈ M , we can choose a δ > and an integer n (cid:48) ∈ N such that for all µ (cid:48) ∈ M with d M ( µ, µ (cid:48) ) < δ and for all n > n (cid:48) we have (cid:12)(cid:12)(cid:12) Q ˜Φ ηn ( µ ) ( µ, t, s, a ) − Q ˜Φ ηn ( µ (cid:48) ) ( µ (cid:48) , t, s, a ) (cid:12)(cid:12)(cid:12) < ε . Proof.
To obtain the desired property, we replicate the proof of Lemma B.8.2 by setting F = ( µ (cid:55)→ Q ˜Φ ηn ( µ ) ( µ, t, s, a )) n ∈ N . Any bounds for ˜ Q η can be instantiated by the corresponding bound for Q and thenbounding the distance between both by uniform convergence. The only differences lie in bounding the terms (cid:12)(cid:12)(cid:12) ( ˜Φ η n ( µ )( a sub | s (cid:48) ) − ( ˜Φ η n ( µ (cid:48) )( a sub | s (cid:48) ) (cid:12)(cid:12)(cid:12) where the action-value function has been replaced with the soft action-value function. Since ˜ Q η n uniformlyconverges to Q , we instantiate additional requirements N s (cid:48) t,s,a , ˜ N s (cid:48) t,s,a to let n > N s (cid:48) t,s,a , n > ˜ N s (cid:48) t,s,a large enoughsuch that η is sufficiently small enough.The first difference is to obtain (cid:12)(cid:12)(cid:12) ˜ Q η n ( µ (cid:48) , t, s, a ) − ˜ Q η n ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) < ∆ Q s (cid:48) ,µ min ai Cui, Heinz Koeppl for all µ (cid:48) ∈ M , t ∈ T , s ∈ S , a ∈ A with d M ( µ, µ (cid:48) ) sufficiently small. We choose ˆ δ t,s,a slightly stronger than inthe original proof, such that if d M ( µ, µ (cid:48) ) < ˆ δ t,s,a , we have | Q ∗ ( µ (cid:48) , t, s, a ) − Q ∗ ( µ, t, s, a ) | < ∆ Q s (cid:48) ,µ min . We must then additionally choose N s (cid:48) t,s,a ∈ N for each induction step via uniform convergence from Lemma B.8.7such that as long as n > N s (cid:48) t,s,a , we have (cid:12)(cid:12)(cid:12) ˜ Q η n ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) < ∆ Q s (cid:48) ,µ min . This implies the required inequality (cid:12)(cid:12)(cid:12) ˜ Q η n ( µ (cid:48) , t, s, a ) − ˜ Q η n ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) ˜ Q η n ( µ (cid:48) , t, s, a ) − Q ∗ ( µ (cid:48) , t, s, a ) (cid:12)(cid:12)(cid:12) + | Q ∗ ( µ (cid:48) , t, s, a ) − Q ∗ ( µ, t, s, a ) | + (cid:12)(cid:12)(cid:12) Q ∗ ( µ, t, s, a ) − ˜ Q η n ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) < ∆ Q s (cid:48) ,µ min and we can proceed as in the original proof.The second difference lies in choosing δ ,s (cid:48) t,s,a . Note that ˜ Q η n is still bounded by M Q , see Lemma B.7.1. However,since ˜ Q η n might no longer be Lipschitz with the same constant as Q ∗ , we choose an additional integer ˜ N s (cid:48) t,s,a ∈ N for each induction step by Lemma B.8.7, such that as long as n > ˜ N s (cid:48) t,s,a , we have (cid:12)(cid:12)(cid:12) ˜ Q η n ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) ≤ ∆ s (cid:48) Q ≡ ε t,s,a M Q |A| R max q |A| · η s (cid:48) min exp (cid:16) M Q η s (cid:48) min (cid:17) for any µ (cid:48) ∈ M , t ∈ T , s ∈ S , a ∈ A . The required bound then follows immediately from | (Φ η n ( µ )( a sub | s (cid:48) ) − (Φ η n ( µ (cid:48) )( a sub | s (cid:48) ) |≤ R max q (cid:88) a (cid:48) (cid:54) = a sub (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) exp (cid:32) ˜ Q η n ( µ (cid:48) , t, s (cid:48) , a (cid:48) ) − ˜ Q η n ( µ (cid:48) , t, s (cid:48) , a sub ) η (cid:33) − exp (cid:32) ˜ Q η n ( µ, t, s (cid:48) , a (cid:48) ) − ˜ Q η n ( µ, t, s (cid:48) , a sub ) η (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ R max q (cid:88) a (cid:48) (cid:54) = a sub (cid:12)(cid:12)(cid:12)(cid:12) η exp (cid:18) ξ a (cid:48) η (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) ( ˜ Q η n ( µ (cid:48) , t, s (cid:48) , a (cid:48) ) − ˜ Q η n ( µ (cid:48) , t, s (cid:48) , a sub )) − ( ˜ Q η n ( µ, t, s (cid:48) , a (cid:48) ) − ˜ Q η n ( µ, t, s (cid:48) , a sub )) (cid:12)(cid:12)(cid:12) ≤ R max q |A| · η s (cid:48) min exp (cid:18) M Q η s (cid:48) min (cid:19) (cid:16)(cid:12)(cid:12)(cid:12) ˜ Q η n ( µ (cid:48) , t, s (cid:48) , a (cid:48) ) − ˜ Q η n ( µ, t, s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) ˜ Q η n ( µ, t, s (cid:48) , a sub ) − ˜ Q η n ( µ (cid:48) , t, s (cid:48) , a sub ) (cid:12)(cid:12)(cid:12)(cid:17) ≤ R max q |A| · η s (cid:48) min exp (cid:18) M Q η s (cid:48) min (cid:19) · (cid:16) K Q d M ( µ, µ (cid:48) ) + 4∆ s (cid:48) Q (cid:17) ≤ R max q |A| · η s (cid:48) min exp (cid:18) M Q η s (cid:48) min (cid:19) · (2 K Q d M ( µ, µ (cid:48) )) + ε t,s,a M Q |A| < ε t,s,a M Q |A| as in the original proof by letting d M ( µ, µ (cid:48) ) < δ ,s (cid:48) t,s,a and choosing δ ,s (cid:48) t,s,a = ε t,s,a η s (cid:48) min M Q |A| R max q · exp (cid:16) M Q η s (cid:48) min (cid:17) · K Q . The rest of the proof is analogous. We obtain the additional requirement n > N s (cid:48) t,s,a , n > ˜ N s (cid:48) t,s,a for some integers N s (cid:48) t,s,a , ˜ N s (cid:48) t,s,a and each t ∈ T , s ∈ S , s (cid:48) ∈ S , a ∈ A . By choosing n (cid:48) ≡ max t ∈T ,s ∈S ,s (cid:48) ∈S ,a ∈A max( N s (cid:48) t,s,a , ˜ N s (cid:48) t,s,a ) , thedesired result holds as long as n > n (cid:48) . (cid:4) From this property, we again obtain the desired uniform convergence via compactness of M . pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning Lemma B.8.10.
Any sequence of functions ( µ (cid:55)→ Q ˜Φ ηn ( µ ) ( µ, t, s, a )) n ∈ N with η n → + converges uniformly to µ (cid:55)→ Q ∗ ( µ, t, s, a ) for all t ∈ T , s ∈ S , a ∈ A .Proof. Fix ε > , t ∈ T , s ∈ S , a ∈ A . Then, there exists by Lemma B.8.9 for any point µ ∈ M both δ ( µ ) and n (cid:48) such that for all µ (cid:48) ∈ M with d M ( µ, µ (cid:48) ) < δ ( µ ) for all n > n (cid:48) we have (cid:12)(cid:12)(cid:12) Q ˜Φ ηn ( µ ) ( µ, t, s, a ) − Q ˜Φ ηn ( µ (cid:48) ) ( µ (cid:48) , t, s, a ) (cid:12)(cid:12)(cid:12) < ε which via pointwise convergence from Lemma B.8.8 implies | Q ∗ ( µ, t, s, a ) − Q ∗ ( µ (cid:48) , t, s, a ) | ≤ ε . Since M is compact, it is separable, i.e. there exists a countable dense subset ( µ j ) j ∈ N of M . Let δ ( µ ) be asdefined above and cover M by the open balls ( B δ ( µ j ) ( µ j )) j ∈ N . By the compactness of M , finitely many of theseballs B δ ( µ n ) ( µ n ) , . . . , B δ ( µ nk ) ( µ n k ) cover M . By pointwise convergence from Lemma B.8.8, for any i = 1 , . . . , k we can find integers m i such that for all n > m i we have (cid:12)(cid:12)(cid:12) Q ˜Φ ηn ( µ ni ) ( µ n i , t, s, a ) − Q ∗ ( µ n i , t, s, a ) (cid:12)(cid:12)(cid:12) < ε . Taken together, we find that for n > max( n (cid:48) , max i =1 ,...,k m i ) and arbitrary µ ∈ M , we have (cid:12)(cid:12)(cid:12) Q ˜Φ ηn ( µ ) ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) < (cid:12)(cid:12)(cid:12) Q ˜Φ ηn ( µ ) ( µ, t, s, a ) − Q ˜Φ ηn ( µ ni ) ( µ n i , t, s, a ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) Q ˜Φ ηn ( µ ni ) ( µ n i , t, s, a ) − Q ∗ ( µ n i , t, s, a ) (cid:12)(cid:12)(cid:12) + | Q ∗ ( µ n i , t, s, a ) − Q ∗ ( µ, t, s, a ) | < ε ε ε < ε for some center point µ n i of a ball containing µ from the finite cover. (cid:4) As a result, a sequence of RelEnt MFE with η → + is approximately optimal in the MFG. Lemma B.8.11.
For any sequence ( π ∗ n , µ ∗ n ) n ∈ N of η n -RelEnt MFE with η n → + and for any ε > there existsinteger n (cid:48) ∈ N such that for all integers n > n (cid:48) we have J µ ∗ n ( π ∗ n ) ≥ max π J µ ∗ n ( π ) − ε . Proof.
By Lemma B.8.10, we have (cid:12)(cid:12)(cid:12) Q ˜Φ ηn ( µ ) ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) → uniformly. Therefore, for any ε > ,there exists by uniform convergence an integer n (cid:48) such that for all integers n > n (cid:48) we have Q π ∗ n ( µ ∗ n , t, s, a ) ≥ Q ∗ ( µ ∗ n , t, s, a ) − ε = max π ∈ Π Q π ( µ ∗ n , t, s, a ) − ε , and since by Lemma B.3.1, we have J µ ∗ n ( π ∗ n ) = (cid:88) s ∈S µ ( s ) · (cid:88) a ∈A Q π ∗ n ( µ ∗ n , t, s, a ) ≥ (cid:88) s ∈S µ ( s ) · max π ∈ Π (cid:88) a ∈A Q π ( µ ∗ n , t, s, a ) − ε = max π ∈ Π J µ ∗ n ( π ) − ε , the desired result follows immediately. (cid:4) By repeating the previous argumentation for Boltzmann MFE with Lemma B.5.6 and replacing Lemma B.8.4with Lemma B.8.11, we obtain the desired result for RelEnt MFE. ai Cui, Heinz Koeppl
C Relative entropy mean field games
We show that the necessary conditions for optimality hold for the candidate solution. (For further insight, seealso Neu et al. (2017), Haarnoja et al. (2017) and references therein.) Fix a mean field µ ∈ M and formulate theinduced problem as an optimization problem, with ρ t ( s ) as the probability of our representative agent visitingstate s ∈ S at time t ∈ T , to obtain max ρ,π T − (cid:88) t =0 (cid:88) s ∈S ρ t ( s ) (cid:88) a ∈A π t ( a | s ) r ( s, a, µ t ) subject to ρ t +1 ( s (cid:48) ) = (cid:88) s ∈S ρ t ( s ) (cid:88) a ∈A π t ( a | s ) p ( s (cid:48) | s, a, µ t ) ∀ s (cid:48) ∈ S , t ∈ { , . . . , T − } , (cid:88) s ∈S ρ t ( s ) ∀ t ∈ { , . . . , T − } , (cid:88) a ∈A π t ( a | s ) ∀ s ∈ S , t ∈ { , . . . , T − } , ≤ ρ t ( s ) , ≤ π t ( a | s ) ∀ s ∈ S , a ∈ A , t ∈ { , . . . , T − } ,µ ( s ) = ρ ( s ) ∀ s ∈ S . Note that if the agent follows the mean field policy of the other agents, we have ρ t = µ t . The optimized objectiveis just the expectation E (cid:104)(cid:80) T − t =0 r ( S t , A t ) (cid:105) . As in Belousov and Peters (2019), we change this objective to includea KL-divergence penalty weighted by the state-visitation distribution ρ t ( · ) by introducing the temperature η > and prior policy q ∈ Π to obtain max ρ t ,π t T − (cid:88) t =0 (cid:88) s ∈S ρ t ( s ) (cid:88) a ∈A π t ( a | s ) r ( s, a, µ t ) − η T − (cid:88) t =0 (cid:88) s ∈S ρ t ( s ) D KL ( π t ( · | s ) (cid:107) q t ( · | s )) subject to ρ t +1 ( s (cid:48) ) = (cid:88) s ∈S ρ t ( s ) (cid:88) a ∈A π t ( a | s ) p ( s (cid:48) | s, a, µ t ) ∀ s (cid:48) ∈ S , t ∈ { , . . . , T − } , (cid:88) s ∈S ρ t ( s ) ∀ t ∈ { , . . . , T − } , (cid:88) a ∈A π t ( a | s ) ∀ s ∈ S , t ∈ { , . . . , T − } , ≤ ρ t ( s ) , ≤ π t ( a | s ) ∀ s ∈ S , a ∈ A , t ∈ { , . . . , T − } ,µ ( s ) = ρ ( s ) ∀ s ∈ S . We ignore the constraints ≤ π t ( a | s ) and ≤ ρ t ( s ) and see later that they will hold automatically. This resultsin the simplified optimization problem max ρ t ,π t T − (cid:88) t =0 (cid:88) s ∈S ρ t ( s ) (cid:88) a ∈A π t ( a | s ) r ( s, a, µ t ) − η T − (cid:88) t =0 (cid:88) s ∈S ρ t ( s ) D KL ( π t ( · | s ) (cid:107) q t ( · | s )) subject to ρ t +1 ( s (cid:48) ) = (cid:88) s ∈S ρ t ( s ) (cid:88) a ∈A π t ( a | s ) p ( s (cid:48) | s, a, µ t ) ∀ s (cid:48) ∈ S , t ∈ { , . . . , T − } , (cid:88) s ∈S ρ t ( s ) ∀ t ∈ { , . . . , T − } , (cid:88) a ∈A π t ( a | s ) ∀ s ∈ S , t ∈ { , . . . , T − } ,µ ( s ) = ρ ( s ) ∀ s ∈ S , for which we introduce Lagrange multipliers λ ( t, s ) , λ ( t ) , λ ( t, s ) , λ ( s ) and the Lagrangian L ( ρ, π, λ , λ , λ , λ ) = T − (cid:88) t =0 (cid:88) s ∈S ρ t ( s ) (cid:88) a ∈A π t ( a | s ) (cid:18) r ( s, a, µ t ) − η log π t ( a | s ) q t ( a | s ) (cid:19) pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning − T − (cid:88) t =0 (cid:88) s (cid:48) ∈S λ ( t, s (cid:48) ) (cid:32) ρ t +1 ( s (cid:48) ) − (cid:88) s ∈S ρ t ( s ) (cid:88) a ∈A π t ( a | s ) p ( s (cid:48) | s, a, µ t ) (cid:33) − T − (cid:88) t =0 λ ( t ) (cid:32) − (cid:88) s ∈S ρ t ( s ) (cid:33) − T − (cid:88) t =0 (cid:88) s ∈S λ ( t, s ) (cid:32)(cid:88) a ∈A π t ( a | s ) − (cid:33) − (cid:88) s ∈S λ ( s ) ( µ ( s ) − ρ ( s )) with the artificial constraint λ ( T − , s ) ≡ , which allows us to formulate the following necessary conditions foroptimality. For ∇ π t ( a | s ) L and all s ∈ S , a ∈ A , t ∈ { , . . . , T − } , we obtain ∇ π t ( a | s ) L = ρ t ( s ) (cid:32) r ( s, a, µ t ) − η log π t ( a | s ) q t ( a | s ) − η + (cid:88) s (cid:48) ∈S λ ( t, s (cid:48) ) p ( s (cid:48) | s, a, µ t ) (cid:33) − λ ( t, s ) ! = 0= ⇒ π ∗ t ( a | s ) = q t ( a | s ) exp r ( s, a, µ t ) − η + (cid:80) s (cid:48) ∈S λ ( t, s (cid:48) ) p ( s (cid:48) | s, a, µ t ) − λ ( t,s ) ρ t ( s ) η . For ∇ λ L and all s ∈ S , t ∈ { , . . . , T − } , by inserting π ∗ t we obtain ∇ λ ( t,s ) L = 1 − (cid:88) a ∈A π t ( a | s ) ! = 0 ⇐⇒ (cid:88) a ∈A q t ( a | s ) exp r ( s, a, µ t ) − η + (cid:80) s (cid:48) ∈S λ ( t, s (cid:48) ) p ( s (cid:48) | s, a, µ t ) − λ ( t,s ) ρ t ( s ) η which is fulfilled by choosing λ ∗ ( t, s ) = ηρ t ( s ) log (cid:88) a ∈A q t ( a | s ) exp (cid:18) r ( s, a, µ t ) − η + (cid:80) s (cid:48) ∈S λ ( t, s (cid:48) ) p ( s (cid:48) | s, a, µ t ) η (cid:19) since it fulfills the required equation (cid:88) a ∈A q t ( a | s ) exp r ( s, a, µ t ) − η + (cid:80) s (cid:48) ∈S λ ( t, s (cid:48) ) p ( s (cid:48) | s, a, µ t ) − λ ∗ ( t,s ) ρ t ( s ) η = (cid:88) a ∈A q t ( a | s ) exp (cid:18) r ( s, a, µ t ) − η + (cid:80) s (cid:48) ∈S λ ( t, s (cid:48) ) p ( s (cid:48) | s, a, µ t ) η (cid:19) · (cid:32)(cid:88) a ∈A q t ( a | s ) exp (cid:18) r ( s, a, µ t ) − η + (cid:80) s (cid:48) ∈S λ ( t, s (cid:48) ) p ( s (cid:48) | s, a, µ t ) η (cid:19)(cid:33) − = 1 . Finally, inserting λ ∗ and π ∗ , for ∇ ρ t ( s ) L we obtain ∇ ρ t ( s ) L = (cid:88) a ∈A π t ( a | s ) (cid:32) r ( s, a, µ t ) − η log π t ( a | s ) q t ( a | s ) + (cid:88) s (cid:48) ∈S λ ( t, s (cid:48) ) p ( s (cid:48) | s, a, µ t ) + λ ( t ) (cid:33) − λ ( t − , s )= (cid:88) a ∈A π t ( a | s ) (cid:18) η + λ ( t ) + λ ( t, s ) ρ t ( s ) (cid:19) − λ ( t − , s ) ! = 0 which implies λ ∗ ( t − , s ) = η + λ ( t ) + η log (cid:88) a ∈A q t ( a | s ) exp (cid:18) r ( s, a, µ t ) − η + (cid:80) s (cid:48) ∈S λ ( t, s (cid:48) ) p ( s (cid:48) | s, a, µ t ) η (cid:19) . ai Cui, Heinz Koeppl We can subtract λ ( t ) and shift the time index to obtain the soft value function ˜ V η ( µ, t, s ) defined via terminalcondition ˜ V η ( µ, T, s ) ≡ and the recursion ˜ V η ( µ, t, s ) = η log (cid:88) a ∈A q t ( a | s ) exp (cid:32) r ( s, a, µ t ) + (cid:80) s (cid:48) ∈S ˜ V η ( µ, t + 1 , s (cid:48) ) p ( s (cid:48) | s, a, µ t ) η (cid:33) since then, by normalization the optimal policy for all s ∈ S , a ∈ A , t ∈ { , . . . , T − } is equivalent to π ∗ t ( a | s ) = q t ( a | s ) exp (cid:16) r ( s,a,µ t )+ (cid:80) s (cid:48)∈S λ ( t,s (cid:48) ) p ( s (cid:48) | s,a,µ t ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) r ( s,a (cid:48) ,µ t )+ (cid:80) s (cid:48)∈S λ ( t,s (cid:48) ) p ( s (cid:48) | s,a (cid:48) ,µ t ) η (cid:17) = q t ( a | s ) exp (cid:16) r ( s,a,µ t )+ (cid:80) s (cid:48)∈S ˜ V η ( µ,t +1 ,s (cid:48) ) p ( s (cid:48) | s,a,µ t ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) r ( s,a (cid:48) ,µ t )+ (cid:80) s (cid:48)∈S ˜ V η ( µ,t +1 ,s (cid:48) ) p ( s (cid:48) | s,a (cid:48) ,µ t ) η (cid:17) . To obtain a recursion in ˜ Q η , define ˜ Q η ( µ, t, s, a ) ≡ r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) η log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) η (cid:33) with terminal condition ˜ Q η ( µ, T, s, a ) ≡ to obtain π ∗ t ( a | s ) = q t ( a | s ) exp (cid:16) ˜ Q η ( µ,t,s,a ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) ˜ Q η ( µ,t,s,a (cid:48) ) η (cid:17) which is the desired result as π ∗ fulfills all constraints and determines ρ uniquely. For the uniform prior q t ( a | s ) = 1 / |A||A|