[PDF] Approximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning

Abstract

The recent mean field game (MFG) formalism facilitates otherwise intractable computation of approximate Nash equilibria in many-agent settings. In this paper, we consider discrete-time finite MFGs subject to finite-horizon objectives. We show that all discrete-time finite MFGs with non-constant fixed point operators fail to be contractive as typically assumed in existing MFG literature, barring convergence via fixed point iteration. Instead, we incorporate entropy-regularization and Boltzmann policies into the fixed point iteration. As a result, we obtain provable convergence to approximate fixed points where existing methods fail, and reach the original goal of approximate Nash equilibria. All proposed methods are evaluated with respect to their exploitability, on both instructive examples with tractable exact solutions and high-dimensional problems where exact methods become intractable. In high-dimensional scenarios, we apply established deep reinforcement learning methods and empirically combine fictitious play with our approximations.

Full PDF

AApproximately Solving Mean Field Games viaEntropy-Regularized Deep Reinforcement Learning

Kai Cui Heinz Koeppl

Technische Universität Darmstadt [email protected]

Abstract

The recent mean ﬁeld game (MFG) formalismfacilitates otherwise intractable computationof approximate Nash equilibria in many-agentsettings. In this paper, we consider discrete-time ﬁnite MFGs subject to ﬁnite-horizonobjectives. We show that all discrete-timeﬁnite MFGs with non-constant ﬁxed pointoperators fail to be contractive as typically as-sumed in existing MFG literature, barringconvergence via ﬁxed point iteration. In-stead, we incorporate entropy-regularizationand Boltzmann policies into the ﬁxed pointiteration. As a result, we obtain provable con-vergence to approximate ﬁxed points whereexisting methods fail, and reach the originalgoal of approximate Nash equilibria. All pro-posed methods are evaluated with respect totheir exploitability, on both instructive exam-ples with tractable exact solutions and high-dimensional problems where exact methodsbecome intractable. In high-dimensional sce-narios, we apply established deep reinforce-ment learning methods and empirically com-bine ﬁctitious play with our approximations.

The framework of mean ﬁeld games (MFG) was intro-duced independently by the seminal works of Huanget al. (2006) and Lasry and Lions (2007) in the fullycontinuous setting of stochastic diﬀerential games. Inthe meantime, it has sparked great interest and inves-tigation both in the mathematical community, whereinterests lie in the theoretical properties of MFGs, and

Proceedings of the 24 th International Conference on Artiﬁ-cial Intelligence and Statistics (AISTATS) 2021, San Diego,California, USA. PMLR: Volume 130. Copyright 2021 bythe author(s). in the applied research communities as a framework forsolving and analyzing large-scale multi-agent problems.At its core lies the idea of reducing the classical, in-tractable multi-agent solution concept of Nash equilib-ria to the interaction between a representative agentand the ‘mass’ of inﬁnitely many other agents – the so-called mean ﬁeld. The solution to this limiting problemis the so-called mean ﬁeld equilibrium (MFE), charac-terized by a forward evolution equation for the agent’sstate distributions, and a backward optimality equationof representative agent optimality. Importantly, theMFE constitutes an approximate Nash equilibrium inthe corresponding ﬁnite agent game of suﬃciently manyagents (Huang et al. (2006)), which would otherwisebe intractable to compute (Daskalakis et al. (2009)).Nonetheless, computing an MFE remains diﬃcult inthe general case. Standard assumptions in existing lit-erature are MFE uniqueness and operator contractivity(Huang et al. (2006), Anahtarcı et al. (2020), Guo et al.(2019)) to obtain convergence via simple ﬁxed pointiteration. While these assumptions hold true for somegames, we address the case where such restrictive as-sumptions fail. Applications for such mean ﬁeld modelsare manifold and include e.g. ﬁnance (Guéant et al.(2011)), power control (Kizilkale and Malhame (2016)),wireless communication (Aziz and Caines (2016)) orpublic health models (Laguzet and Turinici (2015)).

A motivating example.

Consider the followingtrivial situation informally: Let a large number ofagents choose simultaneously between going left ( L ) orright ( R ). Afterwards, each agent shall be punishedproportional to the number of agents that chose thesame action. If we had inﬁnitely many independent,identically acting agents, the only stable solution wouldbe to have all agents pick uniformly at random.The MFG formalism models this problem by pick-ing one representative agent and abstracting all otheragents into their state distribution. Unfortunately,analytically obtaining ﬁxed points in general provesdiﬃcult and existing computational methods can fail. a r X i v : . [ c s . M A ] F e b pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning Our contribution.

We begin by formulating themean ﬁeld analogue to ﬁnite games in game theory. Inthis setting we give simpliﬁed proofs for both existenceand the approximate Nash equilibrium property ofmean ﬁeld equilibria. Moreover, we show that in ﬁniteMFGs, all non-constant ﬁxed point operators are non-contractive, necessitating a diﬀerent approach thannaive ﬁxed point iteration as in Anahtarcı et al. (2020).Consequently, we approximate the ﬁxed point opera-tor by introducing relative entropy regularization andBoltzmann policies. We prove guaranteed convergencefor suﬃciently high temperatures, while remaining arbi-trarily exact for suﬃciently low temperatures. Further-more, repeatedly iterating on the prior policy allowsus to perform an iterative descent on exploitability,successively improving the equilibrium approximation.Finally, our methods are extensively evaluated and com-pared to other methods such as ﬁctitious play (FP, seePerrin et al. (2020)), which in general fail to convergeto a ﬁxed point. We outperform existing state-of-the-art methods in terms of exploitability in our problems,allowing us to ﬁnd approximate mean ﬁeld equilibriain the general case and paving the way to practical ap-plication of mean ﬁeld games. In otherwise intractableproblems, we apply deep reinforcement learning tech-niques together with particle-based simulations.

Consider a discrete-time N -agent stochastic gamewith ﬁnite agent state space S and ﬁnite agent ac-tion space A , equipped with the discrete metric. Let T = { , , . . . , T − } denote the time index set. Denoteby P ( X ) the set of all Borel probability measures ona metric space X . Since we work with ﬁnite spaces,we abuse notation and denote both a measure ν andits probability mass function by ν ( · ) . For each agent,the dynamical behavior is described by the state tran-sition function p : S × S × A × P ( S ) → [0 , and theinitial state distribution µ : S → [0 , . For agents i = 1 , . . . , N at times t ∈ T , their states S it and actions A it are random variables with values in S and A respec-tively. Let G Ns ≡ N (cid:80) Ni =1 δ s i denote the empirical mea-sure of agent states s = ( s , . . . , s N ) ∈ S N , where δ isthe Dirac measure. Consider for each agent i a Markovpolicy π i = ( π it ) t ∈T ∈ Π , where π it : A × S → [0 , and Π is the space of all Markov policies. The state evolu-tion of agent i begins with S i ∼ µ and subsequentlyfor all applicable times t follows P ( A it = a | S it = s i ) ≡ π it ( a | s i ) , P ( S it +1 = s (cid:48) i | S t = s, A it = a ) ≡ p ( s (cid:48) i | s i , a, G Ns ) , for arbitrary s i , s (cid:48) i ∈ S , a ∈ A , s = ( s , . . . , s N ) ∈ S N and S t = ( S t , . . . , S Nt ) . Finally, deﬁne agent i ’s ﬁnitehorizon objective function J Ni ( π , . . . , π N ) ≡ E (cid:34) T − (cid:88) t =0 r ( S it , A it , G NS t ) (cid:35) to be maximized, where r : S × A × P ( S ) → R isthe agent reward function. With this, we can give thenotion of optimality used by Saldi et al. (2018). Deﬁnition 1.

A Markov-Nash equilibrium is a -Markov-Nash equilibrium. For ε ≥ , an ε -Markov-Nash equilibrium (approximate Markov-Nash equilib-rium) is deﬁned as a tuple of policies ( π , . . . , π N ) ∈ Π N such that for any i = 1 , . . . , N , we have J Ni ( π , . . . , π N ) ≥ max π ∈ Π J Ni ( π , . . . , π i − , π, π i +1 , . . . , π N ) − ε . Since analyzing policies acting on joint state informa-tion or the state history is diﬃcult, optimality has beenrestricted to the set of Markov policies Π acting onthe agent’s own state. Although this may seem like asigniﬁcant restriction, in the N → ∞ limit, the evo-lution of all other agents – the mean ﬁeld – becomesdeterministic and therefore non-informative. The N → ∞ limit of the N -agent game constitutesits corresponding ﬁnite mean ﬁeld game (i.e. with aﬁnite state and action space). It consists of the sameelements T , S , A , p, r, µ . However, instead of modeling N separate agents, it models a single representativeagent and collapses all other agents into their commonstate distribution, i.e. the mean ﬁeld µ = ( µ t ) t ∈T ∈ M with µ t : S → [0 , , where M is the space of all meanﬁelds and µ is given. The deterministic mean ﬁeld µ replaces the empirical measure of the ﬁnite game.Consider a Markov policy π ∈ Π as before. For someﬁxed mean ﬁeld µ , the evolution of random states S t and actions A t begins with S ∼ µ and subsequentlyfor all applicable times t follows P ( A t = a | S t = s ) ≡ π t ( a | s ) , P ( S t +1 = s (cid:48) | S t = s, A t = a ) ≡ p ( s (cid:48) | s, a, µ t ) , and the objective analogously becomes J µ ( π ) ≡ E (cid:34) T − (cid:88) t =0 r ( S t , A t , µ t ) (cid:35) . The mean ﬁeld µ induced by some ﬁxed policy π beginswith the given µ and is deﬁned recursively by µ t +1 ( s (cid:48) ) ≡ (cid:88) s ∈S µ t ( s ) (cid:88) a ∈A π t ( a | s ) p ( s (cid:48) | s, a, µ t ) . ai Cui, Heinz Koeppl By ﬁxing a mean ﬁeld µ ∈ M , we obtain an in-duced Markov Decision Process (MDP) with time-dependent transition function p ( s (cid:48) | s, a, µ t ) andreward function r ( s, a, µ t ) . Denote the set-valuedmap from mean ﬁeld to optimal policies π of theinduced MDP as ˆΦ : M → Π (such that π ∈ arg max π E π (cid:104)(cid:80) T − t =0 r ( S t , A t , µ t ) | S t = s (cid:105) for all s ∈S ). Analogously, deﬁne the map from a policy to itsinduced mean ﬁeld as Ψ : Π → M . Finally, we candeﬁne the N → ∞ analogue to Markov-Nash equilibria. Deﬁnition 2.

A mean ﬁeld equilibrium (MFE) is apair ( π, µ ) ∈ Π × M such that π ∈ ˆΦ( µ ) and µ = Ψ( π ) holds. By deﬁning any single-valued map

Φ :

M → Π to anoptimal policy, we obtain a composition Γ = Ψ ◦ Φ :

M → M , henceforth MFE operator. Shown by Saldiet al. (2018) for general Polish S and A , the MFEexists and constitutes an approximate Markov-Nashequilibrium for suﬃciently many agents under techni-cal conditions. In the Appendix, we give simpliﬁedproofs for ﬁnite MFGs under the following standardassumption. Assumption 1.

The functions r ( s, a, µ t ) and p ( s (cid:48) | s, a, µ t ) are continuous, therefore bounded. Note that we metrize probability measure spaces P ( X ) with the total variation distance d T V . For probabilitymeasures ν, ν (cid:48) on ﬁnite spaces X , d T V simpliﬁes to d T V ( ν, ν (cid:48) ) = 12 (cid:88) x ∈X | ν ( x ) − ν (cid:48) ( x ) | . Accordingly, we equip Π , M with sup metrics, i.e. forpolicies π, π (cid:48) ∈ Π and mean ﬁelds µ, µ (cid:48) ∈ M we deﬁnethe metric spaces (Π , d Π ) and ( M , d M ) with d Π ( π, π (cid:48) ) ≡ max t ∈T max s ∈S d T V ( π t ( · | s ) , π (cid:48) t ( · | s )) ,d M ( µ, µ (cid:48) ) ≡ max t ∈T d T V ( µ t , µ (cid:48) t ) . Proposition 1.

Under Assumption 1, there exists atleast one MFE ( π ∗ , µ ∗ ) ∈ Π × M .Proof. See Appendix.

Theorem 1.

Under Assumption 1, if ( π ∗ , µ ∗ ) is anMFE, then for any ε > there exists N (cid:48) ∈ N such thatfor all N > N (cid:48) , the policy ( π ∗ , . . . , π ∗ ) is an ε -Markov-Nash equilibrium in the N -agent game.Proof. See Appendix.Importantly, ﬁnding Nash equilibria in large- N games ishard (Daskalakis et al. (2009)), whereas an MFE can besigniﬁcantly more tractable to compute. Accordingly,solving the limiting MFG approximately solves theﬁnite- N game for large N in a tractable manner. Repeated application of the MFE operator constitutesthe exact ﬁxed point iteration approach to ﬁndingMFE. The standard assumption for convergence in theliterature is contractivity and thereby MFE uniqueness(e.g. Caines and Huang (2019); Guo et al. (2019)).

Proposition 2.

Let Φ , Ψ be Lipschitz with constants c , c , fulﬁlling c c < . Then, the ﬁxed point iteration µ n +1 = Ψ(Φ( µ n )) converges to the mean ﬁeld of theunique MFE for any initial µ ∈ M .Proof. Let µ, µ (cid:48) ∈ M arbitrary, then d M (Γ( µ ) , Γ( µ (cid:48) )) = d M (Ψ(Φ( µ )) , Ψ(Φ( µ (cid:48) ))) ≤ c · d Π (Φ( µ ) , Φ( µ (cid:48) )) ≤ c · c · d M ( µ, µ (cid:48) ) . Since µ, µ (cid:48) are arbitrary, Γ is Lipschitz with constant c · c < . (Π , d Π ) and ( M , d M ) are complete metricspaces (see Appendix). Therefore, Banach’s ﬁxed pointtheorem implies convergence to the unique ﬁxed pointfor any starting µ ∈ M .Unfortunately, it remains unclear how to proceed ifmultiple optimal policies of an induced MDP exist, orif contractivity fails, e.g. when multiple MFE exist. Inthe following, consider again the illuminating examplefrom the introduction. Consider S = { C, L, R } , A = S \ { C } , µ ( C ) = 1 , r ( s, a, µ t ) = − { L } ( s ) · µ t ( L ) − { R } ( s ) · µ t ( R ) and T = { , } . The transition function allows picking thenext state directly, i.e. for all s, s (cid:48) ∈ S , a ∈ A , P ( S t +1 = s (cid:48) | S t = s, A t = a ) = { s (cid:48) } ( a ) . Clearly, any MFE ( π ∗ , µ ∗ ) must fulﬁll π ∗ ( L | C ) = π ∗ ( R | C ) = 1 / , while π ∗ can be arbitrary. Even ifthe operator Φ chooses suitable optimal policies, theﬁxed point operator Γ remains non-contractive, as themean ﬁeld will necessarily alternate between left andright for any non-uniform starting µ ∈ M .We observe that the example has inﬁnitely many MFE,but no deterministic MFE, i.e. an MFE such thatfor all t ∈ T , s ∈ S , a ∈ A either π t ( a | s ) = 0 or π t ( a | s ) = 1 holds, similar to the classical game-theoretical insight of mixed Nash equilibrium existence(cf. Fudenberg and Tirole (1991)). Therefore, choosingoptimal, deterministic policies will typically fail.Most existing work assumes contractivity, which is toorestrictive. In many scenarios, agents need to "coordi-nate" with each other. For example, a herd of hunting pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning animals may collectively choose one of multiple huntinggrounds, allowing for multiple MFEs. Hence, it can bediﬃcult to apply existing MFG methodologies in prac-tice, as many problems automatically fail contractivity. From the previous example, we may be led to believethat non-contractivity is a general property of ﬁniteMFGs. And indeed, regardless of number of MFEs, itturns out that in any ﬁnite MFG with non-constantMFE operator, a policy selection operator Φ with ﬁniteimage Π Φ will lead to non-contractivity. Note thatthis includes both the conventional arg max and the arg max-e (cf. Guo et al. (2019)) choice of actions. Theorem 2.

Let the image of Φ be a ﬁnite set Π Φ ⊆ Π .Then, either it holds that Γ = Ψ ◦ Φ is a constant, or Γ is not Lipschitz continuous and thus not a contraction.Proof. See Appendix.Therefore, typical discrete-time ﬁnite MFGs have non-contractive ﬁxed point operators and we must changeour approach. Note that although non-contractivitydoes not imply non-convergence, the trivial examplefrom before strongly suggests that non-convergence isthe case for many ﬁnite MFGs.

Exact ﬁxed point iteration fails to solve most ﬁniteMFGs. Therefore, a diﬀerent solution approach isnecessary. In the following, we present two relatedapproaches that guarantee convergence while plausiblyremaining approximate Nash equilibria in the ﬁnite- N case. For our results, we require a stronger Lipschitzassumption that implies Assumption 1. Assumption 2.

The functions r ( s, a, µ t ) and p ( s (cid:48) | s, a, µ t ) are Lipschitz continuous, therefore bounded. A straightforward idea is regularization by replacingthe objective by the well-known (see e.g. Abdolmalekiet al. (2018)) relative entropy objective ˜ J µ ( π ) ≡ E (cid:34) T − (cid:88) t =0 r ( S t , A t , µ t ) − η log π t ( A t | S t ) q t ( A t | S t ) (cid:35) with temperature η > and positive prior policy q ∈ Π ,i.e. q t ( a | s ) > for all t ∈ T , s ∈ S , a ∈ A . Shown inthe Appendix, the unique optimal policy ˜ π µ,ηt fulﬁlls ˜ π µ,ηt ( a | s ) = q t ( a | s ) exp (cid:16) ˜ Q η ( µ,t,s,a ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) ˜ Q η ( µ,t,s,a (cid:48) ) η (cid:17) for the MDP induced by ﬁxed µ ∈ M , with the softaction-value function ˜ Q η ( µ, t, s, a ) given by the smooth-maximum Bellman recursion ˜ Q η ( µ, t, s, a ) = r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) · η log (cid:32) (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) η (cid:33) of the MDP induced by ﬁxed µ ∈ M , with terminalcondition ˜ Q η ( µ, T − , s, a ) ≡ r ( s, a, µ T − ) . Note thatwe recover optimality as η → , see Theorem 4. Deﬁnethe relative entropy MFE operator ˜Γ η ≡ Ψ ◦ ˜Φ η withpolicy selection ˜Φ η ( µ ) ≡ ˜ π µ,η for all µ ∈ M . Deﬁnition 3. An η -relative entropy mean ﬁeld equi-librium ( η -RelEnt MFE) for some positive prior pol-icy q ∈ Π is a pair ( π E , µ E ) ∈ Π × M such that π E = ˜Φ η ( µ E ) and µ E = Ψ( π E ) hold. An η -maximumentropy mean ﬁeld equilibrium ( η -MaxEnt MFE) is an η -RelEnt MFE with uniform prior policy q . RelEnt MFE are guaranteed to exist for any η > byProposition 3. Furthermore, convergence to the regu-larized solution is guaranteed for large η by Theorem 3. Since only deterministic policies fail, a derivative ap-proach is to use softmax policies directly with theunregularized action-value function, also called Boltz-mann policies. Assume that the action-value function Q ∗ fulﬁlling the Bellman equation Q ∗ ( µ, t, s, a ) = r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) · max a (cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48) ) . of the MDP induced by ﬁxed µ ∈ M with terminalcondition Q ∗ ( µ, T − , s, a ) ≡ r ( s, a, µ T − ) is known.Deﬁne the map Φ η ( µ ) ≡ π µ,η for any µ ∈ M , where π µ,ηt ( a | s ) ≡ q t ( a | s ) exp (cid:16) Q ∗ ( µ,t,s,a ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) Q ∗ ( µ,t,s,a (cid:48) ) η (cid:17) for all t ∈ T , s ∈ S , a ∈ A and temperature η > . Deﬁnition 4. An η -Boltzmann mean ﬁeld equilibrium( η -Boltzmann MFE) for some positive prior policy q ∈ Π is a pair ( π B , µ B ) ∈ Π × M such that π B = Φ η ( µ B ) and µ B = Ψ( π B ) hold. Both η -RelEnt MFE and η -Boltzmann MFE are guar-anteed to exist for any temperature η > . ai Cui, Heinz Koeppl Proposition 3.

Under Assumption 1, η -Boltzmannand η -RelEnt MFE exist for any temperature η > .Proof. See Appendix.Contractivity of both η -Boltzmann MFE operator Γ η ≡ Ψ ◦ Φ η and η -RelEnt MFE operator ˜Γ η ≡ Ψ ◦ ˜Φ η isguaranteed for suﬃciently high temperatures, even ifall possible original Φ are not Lipschitz continuous. Theorem 3.

Under Assumption 2, µ (cid:55)→ Q ∗ ( µ, t, s, a ) , µ (cid:55)→ ˜ Q η ( µ, t, s, a ) and Ψ( π ) are Lipschitz continuouswith constants K Q ∗ , K ˜ Q and K Ψ for arbitrary t ∈T , s ∈ S , a ∈ A , η > η (cid:48) , η (cid:48) > . Furthermore, Γ η and ˜Γ η are a contraction for η > max (cid:18) η (cid:48) , |A| ( |A| − K Q K Ψ q q (cid:19) where K Q = K Q ∗ for Γ η , K Q = K ˜ Q for ˜Γ η , q max ≡ max t ∈T ,s ∈S ,a ∈A q t ( a | s ) > and q min ≡ min t ∈T ,s ∈S ,a ∈A q t ( a | s ) > .Proof. See Appendix.Suﬃciently large η hence implies convergence via ﬁxedpoint iteration. On the other hand, for suﬃcientlylow temperatures η , both η -Boltzmann and η -RelEntMFE will also constitute an approximate Markov-Nashequilibrium of the ﬁnite- N game. Theorem 4.

Under Assumption 2, if ( π ∗ n , µ ∗ n ) n ∈ N isa sequence of η n -Boltzmann or η n -RelEnt MFE with η n → , then for any ε > there exist n (cid:48) , N (cid:48) ∈ N suchthat for all n > n (cid:48) , N > N (cid:48) , the policy ( π ∗ n , . . . , π ∗ n ) ∈ Π N is an ε -Markov-Nash equilibrium of the N -agentgame, i.e. J Ni ( π ∗ n , . . . , π ∗ n ) ≥ max π i ∈ Π J Ni ( π ∗ n , . . . , π ∗ n , π i , π ∗ n , . . . , π ∗ n ) − ε . Proof.

See Appendix.If we can obtain contractivity for suﬃciently low η ,we can ﬁnd good approximate Markov-Nash equilibria.As it is impossible to have both η → and η → ∞ ,it depends on the problem and prior whether we canconverge to a good solution. Nonetheless, we ﬁnd thatit is often possible to empirically ﬁnd low η that provideconvergence as well as a good approximate MFE. In principle, we can insert arbitrary prior policies q ∈ Π .Under Assumption 1, by boundedness of both ˜ Q η and Q ∗ (see Appendix), both η -RelEnt and η -Boltzmann MFE policies converge to the prior policy as η → ∞ .Therefore, in principle we can show that for any ε > ,for suﬃciently large η and N , the η -RelEnt and η -Boltzmann MFE under q will be at most an ε -worseapproximate Nash equilibrium than the prior policy.Furthermore, we obtain guaranteed contractivity byTheorem 3. Thus, any prior policy gives a worst-casebound on the performance achievable over all η > .On the other hand, if we obtain better results forsuﬃciently low η , we may iteratively improve our policyand thus our equilibrium quality. The original work of Huang et al. (2006) introducescontractivity and uniqueness assumptions into the con-tinuous MFG setting. Analogously, Guo et al. (2019)and Caines and Huang (2019) assume contractivityfor discrete-time MFGs and dense graph limit MFGsrespectively. Further existing work on discrete-timeMFGs similarly assumes uniqueness of the MFE, whichincludes Saldi et al. (2018) and Gomes et al. (2010)for approximate optimality and existence results, andAnahtarcı et al. (2020) for an analysis on contractiv-ity requirements. Mguni et al. (2018) solve discrete-time continuous state MFG problems under the clas-sical uniqueness conditions of Lasry and Lions (2007).Further extensions of the MFG formula include par-tial observability (Saldi et al. (2019)) or major agents(Nourian and Caines (2013)).The work of Anahtarci et al. (2020) is related andstudies theoretical properties of ﬁnite- N regularizedgames and their limiting MFG. In their work, theexistence and approximate Nash property of MFE instationary regularized games is shown, and Q-Learningerror propagation is investigated. In comparison, weconsider the original, unregularized ﬁnite- N game ina transient setting and perform extensive empiricalevaluations. Guo et al. (2019) and Yang et al. (2018)previously proposed to apply Boltzmann policies. Theformer applies the approximation heuristically, whilethe latter focuses on directly solving ﬁnite- N games.An orthogonal approach to computing MFE is ﬁc-titious play. Rooted in game-theory and classicaleconomic works (Brown (1951)), it has since beenadapted to MFGs. In ﬁctitious play, all past meanﬁelds (Cardaliaguet and Hadikhanloo (2017)) and poli-cies (Perrin et al. (2020)) are averaged to produce anew mean ﬁeld or policy. Importantly, convergence isguaranteed in certain special cases only (cf. Elie et al.(2019)). Although introduced in a diﬀerentiable setting,we evaluate ﬁctitious play empirically in our settingand ﬁnd that both our regularization and ﬁctitious playmay be combined successfully. pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning .

00 0 .

25 0 .

50 0 .

75 1 .

00 1 .

25 1 .

50 1 .

75 2 . η . . . . . . . . . ∆ J ( π ) (a) η -Boltzmann η -MaxEnt η -Boltzmann with FP η -MaxEnt with FPUniform prior policy .

00 0 .

25 0 .

50 0 .

75 1 .

00 1 .

25 1 .

50 1 .

75 2 . η ∆ J ( π ) (b) η -Boltzmann η -MaxEnt η -Boltzmann with FP η -MaxEnt with FPUniform prior policy .

00 0 .

02 0 .

04 0 .

06 0 .

08 0 . η ∆ J ( π ) (c) η -Boltzmann η -MaxEnt η -Boltzmann with FP η -MaxEnt with FPUniform prior policy Figure 1: Mean exploitability over the ﬁnal 10 iterations. Dashed lines represent maximum and minimum overthe ﬁnal 10 iterations. (a) LR, 10000 iterations; (b) RPS, 10000 iterations; (c) SIS, 10000 iterations. Maximumentropy (MaxEnt) results begin at higher temperatures due to limited ﬂoating point accuracy. Temperature zerodepicts the exact ﬁxed point iteration for both η -MaxEnt and η -Boltzmann MFE. In LR and RPS, η -MaxEntand η -Boltzmann MFE coincide both with and without ﬁctitious play (FP), here averaging both policy and meanﬁeld over all past iterations. The exploitability of the prior policy is indicated by the dashed horizontal line. In practice, we ﬁnd that our approaches are capable ofgenerating solutions of lower exploitability than oth-erwise obtained. Unless stated otherwise, we computeeverything exactly, use the maximum entropy objec-tive (MaxEnt) with the uniform prior policy q where q t ( a | s ) = 1 / |A| for all t ∈ T , s ∈ S , a ∈ A , andinitialize with µ = Ψ( q ) generated by q . As the mainevaluation metric, we deﬁne the exploitability of apolicy π ∈ Π with induced mean ﬁeld µ ≡ Ψ( π ) as ∆ J ( π ) ≡ max π ∗ J µ ( π ∗ ) − J µ ( π ) . Clearly, the exploitability of π is zero if and only if ( π, µ ) is an MFE. Indeed, for any ε > , any policy π ∈ Π is a (∆ J ( π ) + ε ) -Markov Nash equilibrium if N suﬃcientlylarge, i.e. the exploitability translates directly to thelimiting equilibrium quality in the ﬁnite- N game, seealso Theorem 4 and its proof.We evaluate the algorithms on the LR, RPS, SIS andTaxi problems, ordered in increasing complexity. De-tails of the algorithms, hyperparameters, problems andexperiment conﬁgurations as well as further experimen-tal results can be found in the Appendix. In Figure 1, we plot the minimum, maximum and meanexploitability for varying temperatures η during thelast 10 ﬁxed point iterations, i.e. a single value whenthe exploitability (and usually mean ﬁeld) converges.Observe that the lowest convergent temperature outper-forms not only the exact ﬁxed point iteration (drawnat temperature zero), but also the uniform prior policy. Although developed for a diﬀerent setting, we also showresults of ﬁctitious play similar to the version fromPerrin et al. (2020), i.e. both policies and mean ﬁeldsare averaged over all past iterations. It can be seen thatﬁctitious play only converges to the optimal solution inthe LR problem. In the other examples, supplementingﬁctitious play with entropy regularization is eﬀective atproducing better results. A non-existent ﬁctitious playvariant averaging only the policies ﬁnds the exact MFEin RPS, but nevertheless fails in SIS. See the Appendixfor further results.Evaluating and solving ﬁnite- N games is highly in-tractable by the curse of dimensionality, as the localstate is no longer suﬃcient to perform dynamic pro-gramming in the presence of the random empiricalstate measure. Since it has already been proven thatthe exploitability for N → ∞ will converge to the ex-ploitability of the corresponding mean ﬁeld game, werefrain from evaluating on ﬁnite- N games.Note that the plots are entirely deterministic and notstochastic as it would seem at ﬁrst glance, since thedepicted shaded area visualizes the non-convergence ofexploitability and is a result of the ﬁxed point updatesrunning into a limit cycle (cf. Figure 2). In Figure 2, the diﬀerence between the exploitabil-ity of the current policy and the minimal exploitabil-ity reached during the ﬁnal 10 iterations is shown for η -Boltzmann MFE. As the temperature η decreases,time to convergence increases until non-convergence isreached in form of a limit cycle. Analogous results for η -RelEnt MFE can be found in the Appendix. ai Cui, Heinz Koeppl iteration k ∆ J ( π k ) − m i n l ≥ k m a x − ∆ J ( π l ) (a) η = 0 . η = 0 . η = 0 . η = 0 .

075 0 100 200 300 400 500 iteration k . . . . . . . d M ( µ k , µ k m a x ) (b) η = 0 . η = 0 . η = 0 . η = 0 . Figure 2: (a) Diﬀerence between current and ﬁnal minimum exploitability over the last 10 iterations; (b) Distancebetween current and ﬁnal mean ﬁeld. Plotted for the η -Boltzmann MFE iterations in SIS for diﬀerent indicatedtemperature settings. Note the periodicity of the lowest temperature setting, indicating a limit cycle.Note also that in LR, we can analytically ﬁnd K Q = 1 and K Ψ = 1 . Thus, we obtain guaranteed convergencevia η -Boltzmann MFE iteration if η > . In Figure 1,we see convergence already for η ≥ . . Note furtherthat the non-converged regime can allow for lower ex-ploitability. However, it is unclear a priori when to stop,and for approximate solutions where DQN is used forevaluation, the evaluation of exploitability may becomeinaccurate. For problems with intractably large state spaces, weadopt the DQN algorithm (Mnih et al. (2013)), us-ing the implementation of Shengyi et al. (2020) as abase. Particle-based simulations are used for the meanﬁeld, and stochastic performance evaluation on theinduced MDP is performed (see Appendix). Note thatthe approximation introduces three sources of stochas-ticity into the otherwise deterministic algorithms, i.e.stochastic evaluation, mean ﬁeld simulation and DQN.To counteract the randomness, we average our resultsover multiple runs. The hyperparameters and archi-tectures used are standard and can be found in theAppendix.Fitting the soft action-value function directly usinga network is numerically problematic, as the log-exponential transformation of approximated action-values quickly fails due to limited ﬂoating point accu-racy. Thus, we limit ourselves to the classical Bellmanequation with Boltzmann policies only.In Figure 3, we evaluate the exploitability of BoltzmannDQN iteration, evaluated exactly in SIS and RPS, andstochastically in Taxi over 2000 realizations. Minimum,maximum and mean exploitability are taken over theﬁnal 5 iterations and averaged over 5 seeds. Note thatit is very time-consuming to solve a full reinforcementlearning problem using DQN repeatedly in every it-eration. Nonetheless, we observe that a temperature larger than zero appears to improve exploitability andconvergence in the SIS example. Both due to the noisynature of approximate solutions and the lower numberof iterations, it can be seen that a higher temperatureis required to converge than in the exact case.In the intractable Taxi environment, the policy oscil-lates between two modes as in exact LR, and regulariza-tion fails to obtain better results, see also the Appendix.An important reason is that the prior policy performsextremely bad (exploitability of ∼ ) as most statesrequire speciﬁc actions for optimality. Hence we cannotﬁnd an η > for which the algorithm both convergesand performs well. Using prior descent and iterativelyreﬁning a better prior policy would likely increase per-formance, but is deferred to future investigations asthe required computations grow very large.Fictitious play is expensive in combination with approx-imate Q-Learning and particle simulations, as policiesand particles of past iterations must be kept to per-form exact ﬁctitious play. For this reason, we do notattempt approximate ﬁctitious play with approximatesolution methods. In theory, supervised learning forﬁtting summarizing policies and randomly samplingparticles may help, but is out of scope of this paper. In Figure 4, we repeatedly perform outer iterationsconsisting of 100 η -RelEnt MFE iterations each with theindicated ﬁxed temperature parameters in SIS. Aftereach outer iteration, the prior policy is updated to thenewest resulting policy. Note again that the results areentirely deterministic.Searching for a suitable η dynamically every iterationwould keep the exploitability from increasing, as for η → ∞ we obtain the original prior policy. Since it isexpensive to scan over all temperatures in each outeriteration, we use a heuristic. Intuitively, since the priorwill become increasingly good, it will be increasingly pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning .

00 0 .

25 0 .

50 0 .

75 1 .

00 1 .

25 1 .

50 1 .

75 2 . η ∆ J ( π ) (a) η -Boltzmann DQNUniform prior policy .

000 0 .

025 0 .

050 0 .

075 0 .

100 0 .

125 0 .

150 0 .

175 0 . η ∆ J ( π ) (b) η -Boltzmann DQNUniform prior policy .

000 0 .

025 0 .

050 0 .

075 0 .

100 0 .

125 0 .

150 0 .

175 0 . η ∆ J ( π ) (c) η -Boltzmann DQNUniform prior policy Figure 3: Mean exploitability over the ﬁnal 5 iterations using DQN, averaged over 5 seeds. Dashed lines representthe averaged maximum and minimum exploitability over the last 5 iterations. (a) RPS, 1000 iterations; (b) SIS,50 iterations; (c) Taxi, 15 iterations. Evaluation of exploitability is exact except in Taxi, which uses DQN andaverages over 1000 episodes. The point of zero temperature depicts ﬁxed point iteration using exact DQN policies. outer iteration i . . . . . . . ∆ J ( π i ) η = 0 . , c = 1 . η = 0 . , c = 1 . η = 0 . , c = 1 . η = 0 . , c = 1 . η = 0 . , c = 1 . η = 0 . , c = 1 . Figure 4: Exploitability over outer iterations in SIS,using 100 η -RelEnt MFE iterations per outer iteration.Note that the results are deterministic. Not shown:Running the ﬁxed temperature settings c = 1 for longerdoes not converge for at least iterations.diﬃcult to obtain a better policy. Thus, increasing thetemperature will help sticking close to the prior andconverge. Consequently, we use the simple heuristic η i +1 = η i · c for each outer iteration i , where c ≥ adjusts thetemperature after each outer iteration.Importantly, even for our simple heuristic, prior descentalready achieves an exploitability of ∼ . , whereasthe best results for the ﬁxed uniform policy from Fig-ure 1 show an optimal mean exploitability of ∼ . .Furthermore, repeated prior policy updates succeed incomputing the exact MFE in RPS and LR under aﬁxed temperature (see Appendix).Note that prior descent creates a double loop aroundsolving the optimal control problem, becoming highlyexpensive under deep reinforcement learning. Hence,we refrain from prior descent with DQN. Automati-cally adjusting temperatures to monotonically improveexploitability is left for potential future work. In this work, we have investigated the necessity andfeasibility of approximate MFG solution approaches –entropy regularization, Boltzmann policies and priordescent – in the context of ﬁnite MFGs. We haveshown that the ﬁnite MFG case typically cannot besolved by exact ﬁxed point iteration or ﬁctitious playalone. Entropy regularization and Boltzmann policiesin combination with deep reinforcement learning mayenable feasible computation of approximate MFE. Webelieve that lifting the restriction of inherent contrac-tivity is an important step in ensuring applicabilityof MFG models in practical problems. We hope thatentropy regularization and the insight for ﬁnite MFGscan help transfer the MFG formalism from its so-farmostly theory-focused context into real world applica-tion scenarios. Nonetheless, there still remain manyrestrictions to the applicability of the MFG formalism.For future work, an eﬃcient, automatic temperatureadjustment for prior descent could be fruitful. Fur-thermore, it would be interesting to generalize rela-tive entropy MFGs to inﬁnite horizon discounted prob-lems, continuous time, and continuous state and actionspaces. Moreover, it could be of interest to investigatetheoretical properties of ﬁctitious play in ﬁnite MFGsin combination with entropy regularization. For non-Lipschitz mappings from policy to induced mean ﬁeld,the proposed approach does not provide a solution. Itcould nonetheless be important to consider problemswith threshold-type dynamics and rewards, e.g. major-ity vote problems. Most notably, the current formalismprecludes common noise entirely, i.e. any games withcommon observations. In practice, many problems willallow for some type of common observation betweenagents, leading to non-independent agent distributionsand stochastic as opposed to deterministic mean ﬁelds. ai Cui, Heinz Koeppl

Acknowledgements

This work has been funded by the LOEWE researchpromotion initiative of the federal state of Hessen, Ger-many, within the program area KOM of the emer-genCITY center. The authors acknowledge the Licht-enberg high performance computing cluster of the TUDarmstadt for providing computational facilities forthe calculations of this research.

References

Minyi Huang, Roland P Malhamé, Peter E Caines,et al. Large population stochastic dynamic games:closed-loop mckean-vlasov systems and the nash cer-tainty equivalence principle.

Communications inInformation & Systems , 6(3):221–252, 2006.Jean-Michel Lasry and Pierre-Louis Lions. Mean ﬁeldgames.

Japanese journal of mathematics , 2(1):229–260, 2007.Constantinos Daskalakis, Paul W Goldberg, and Chris-tos H Papadimitriou. The complexity of computinga nash equilibrium.

SIAM Journal on Computing ,39(1):195–259, 2009.Berkay Anahtarcı, Can Deha Karıksız, and Naci Saldi.Value iteration algorithm for mean-ﬁeld games.

Sys-tems & Control Letters , 143:104744, 2020.Xin Guo, Anran Hu, Renyuan Xu, and Junzi Zhang.Learning mean-ﬁeld games. In

Advances in NeuralInformation Processing Systems , pages 4966–4976,2019.Olivier Guéant, Jean-Michel Lasry, and Pierre-LouisLions. Mean ﬁeld games and applications. In

Paris-Princeton lectures on mathematical ﬁnance 2010 ,pages 205–266. Springer, 2011.Arman C Kizilkale and Roland P Malhame. Collec-tive target tracking mean ﬁeld control for markovianjump-driven models of electric water heating loads.In

Control of Complex Systems , pages 559–584. Else-vier, 2016.Mohamad Aziz and Peter E Caines. A mean ﬁeld gamecomputational methodology for decentralized cellularnetwork optimization.

IEEE transactions on controlsystems technology , 25(2):563–576, 2016.Laetitia Laguzet and Gabriel Turinici. Individual vac-cination as nash equilibrium in a sir model withapplication to the 2009–2010 inﬂuenza a (h1n1) epi-demic in france.

Bulletin of Mathematical Biology ,77(10):1955–1984, 2015.Sarah Perrin, Julien Perolat, Mathieu Laurière,Matthieu Geist, Romuald Elie, and Olivier Pietquin.Fictitious play for mean ﬁeld games: Continuoustime analysis and applications. arXiv preprintarXiv:2007.03458 , 2020. Naci Saldi, Tamer Basar, and Maxim Raginsky.Markov–nash equilibria in mean-ﬁeld games withdiscounted cost.

SIAM Journal on Control and Op-timization , 56(6):4256–4287, 2018.Peter E Caines and Minyi Huang. Graphon mean ﬁeldgames and the gmfg equations: ε -nash equilibria. In , pages 286–292. IEEE, 2019.Drew Fudenberg and Jean Tirole. Game theory . MITpress, 1991.Abbas Abdolmaleki, Jost Tobias Springenberg, YuvalTassa, Remi Munos, Nicolas Heess, and Martin Ried-miller. Maximum a posteriori policy optimisation.In

International Conference on Learning Represen-tations , 2018.Diogo A Gomes, Joana Mohr, and Rafael Rigao Souza.Discrete time, ﬁnite state space mean ﬁeld games.

Journal de mathématiques pures et appliquées , 93(3):308–328, 2010.David Mguni, Joel Jennings, and Enrique Munozde Cote. Decentralised learning in systems withmany, many strategic agents.

Thirty-Second AAAIConference on Artiﬁcial Intelligence , 2018.Naci Saldi, Tamer Başar, and Maxim Raginsky. Approx-imate nash equilibria in partially observed stochasticgames with mean-ﬁeld interactions.

Mathematics ofOperations Research , 44(3):1006–1033, 2019.Mojtaba Nourian and Peter E Caines. Epsilon-nashmean ﬁeld game theory for nonlinear stochastic dy-namical systems with major and minor agents.

SIAMJournal on Control and Optimization , 51(4):3302–3331, 2013.Berkay Anahtarci, Can Deha Kariksiz, and Naci Saldi.Q-learning in regularized mean-ﬁeld games. arXivpreprint arXiv:2003.12151 , 2020.Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, WeinanZhang, and Jun Wang. Mean ﬁeld multi-agent rein-forcement learning. In

International Conference onMachine Learning , pages 5571–5580, 2018.George W Brown. Iterative solution of games by ﬁcti-tious play.

Activity analysis of production and allo-cation , 13(1):374–376, 1951.Pierre Cardaliaguet and Saeed Hadikhanloo. Learningin mean ﬁeld games: the ﬁctitious play.

ESAIM:Control, Optimisation and Calculus of Variations , 23(2):569–591, 2017.Romuald Elie, Julien Pérolat, Mathieu Laurière,Matthieu Geist, and Olivier Pietquin. Approximateﬁctitious play for mean ﬁeld games. arXiv preprintarXiv:1907.02633 , 2019.Volodymyr Mnih, Koray Kavukcuoglu, David Silver,Alex Graves, Ioannis Antonoglou, Daan Wierstra, pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning and Martin Riedmiller. Playing atari with deep rein-forcement learning. arXiv preprint arXiv:1312.5602 ,2013.Huang Shengyi, Dossa Rousslan, and Chang Ye.Cleanrl: High-quality single-ﬁle implementation ofdeep reinforcement learning algorithms. https://github.com/vwxyzjn/cleanrl/ , 2020.Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt,Marc Lanctot, and Nando Freitas. Dueling networkarchitectures for deep reinforcement learning. In

International conference on machine learning , pages1995–2003, 2016.Lloyd Shapley. Some topics in two-person games.

Ad-vances in game theory , 52:1–29, 1964.Martin L Puterman.

Markov decision processes: dis-crete stochastic dynamic programming . John Wiley& Sons, 2014.Gergely Neu, Anders Jonsson, and Vicenç Gómez. Auniﬁed view of entropy-regularized markov decisionprocesses. arXiv preprint arXiv:1705.07798 , 2017.Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, andSergey Levine. Reinforcement learning with deepenergy-based policies. In

Proceedings of the 34th In-ternational Conference on Machine Learning-Volume70 , pages 1352–1361, 2017.Boris Belousov and Jan Peters. Entropic regularizationof markov decision processes.

Entropy , 21(7):674,2019. ai Cui, Heinz Koeppl

A Experimental Details

A.1 AlgorithmsAlgorithm 1 Exact ﬁxed point iteration Initialize µ = Ψ( q ) as the mean ﬁeld induced by the uniformly random policy q . for k = 0 , , · · · do Compute the Q-function Q ∗ ( µ k , t, s, a ) for ﬁxed µ k . Choose π k ∈ Π such that π kt ( a | s ) = ⇒ a ∈ arg max a ∈A Q k ( µ k , t, s, a ) for all t ∈ T , s ∈ S , a ∈ A byputting all probability mass on the ﬁrst optimal action, or evenly on all optimal actions. Optionally : Overwrite π k ← k +1 π k + kk +1 π k − . (FP averaged policy) Compute the mean ﬁeld µ k +1 = Ψ( π k ) induced by π k . Optionally : Overwrite µ k +1 ← k +1 µ k +1 + kk +1 µ k . (FP averaged mean ﬁeld) end forAlgorithm 2 Boltzmann / RelEnt iteration Input : Temperature η > , prior policy q ∈ Π . Initialize µ = Ψ( q ) as the mean ﬁeld induced by q . for k = 0 , , · · · do Compute the Q-function (Boltzmann) or soft Q-function (RelEnt) Q ( µ k , t, s, a ) for ﬁxed µ k . Deﬁne π k by π kt ( a | s ) = q t ( a | s ) exp (cid:18) Q ( µk,t,s,a ) η (cid:19)(cid:80) a (cid:48)∈A q t ( a (cid:48) | s ) exp (cid:16) Q ( µk,t,s,a (cid:48) ) η (cid:17) for all t ∈ T , s ∈ S , a ∈ A . Optionally : Overwrite π k ← k +1 π k + kk +1 π k − . (FP averaged policy) Compute the mean ﬁeld µ k +1 = Ψ( π k ) induced by π k . Optionally : Overwrite µ k +1 ← k +1 µ k +1 + kk +1 µ k . (FP averaged mean ﬁeld) end forAlgorithm 3 Boltzmann DQN iteration Input : Temperature η > , prior policy q ∈ Π . Input : Simulation parameters, DQN hyperparameters. Initialize µ ≈ Ψ( q ) as the mean ﬁeld induced by q using Algorithm 5. for k = 0 , , · · · do Approximate the Q-function Q ∗ ( µ k , t, s, a ) using Algorithm 4 on the MDP induced by µ k . Deﬁne π k by π kt ( a | s ) = q t ( a | s ) exp (cid:18) Q ∗ ( µk,t,s,a ) η (cid:19)(cid:80) a (cid:48)∈A q t ( a (cid:48) | s ) exp (cid:16) Q ∗ ( µk,t,s,a (cid:48) ) η (cid:17) for all t ∈ T , s ∈ S , a ∈ A . Approximately simulate mean ﬁeld µ k +1 ≈ Ψ( π k ) induced by π k using Algorithm 5. end forA.2 Implementation details For all the DQN experiments, we use the conﬁgurations given in Table 1 and hyperparameters given in Table 2.Note that we add epsilon scheduling and a discount factor to DQN for stability reasons, i.e. the loss termhas an additional factor smaller than one before the maximum operation, cf. Mnih et al. (2013). For theaction-value network, we use a fully connected dueling architecture (Wang et al. (2016)) with one shared hiddenlayer of 256 neurons, and one separate hidden layer of 256 neurons for value and advantage stream each. As theactivation function, we use ReLU. Further, we use gradient norm clipping and the ADAM optimizer. To allow fortime-dependent policies, we append the current time to the observations.We transform all discrete-valued observations except time to corresponding one-hot vectors, except in theintractably large Taxi environment where we simply observe one value in { , } for each tile’s passenger status.For evaluation of exploitability, we compare the values of the optimal policy and the evaluated policy in the MDPinduced by the mean ﬁeld generated by the evaluated policy. In intractable cases, we use DQN to approximately pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning Algorithm 4 DQN Input : Number of epochs L , mini-batch size N , target update frequency M , replay buﬀer size D . Input : Probability of random action (cid:15) , Discount factor γ , ADAM and gradient clipping parameters. Initialize network Q θ , target network Q θ (cid:48) ← Q θ and replay buﬀer D of size D . for L epochs do for t = 1 , . . . , T do One environment step Let new action a t ← arg max a ∈A Q θ ( t, s, a ) , or with probability (cid:15) sample uniformly random instead. Sample new state s t +1 ∼ p ( · | s t , a t ) . Add transition tuple ( s t , a t , r ( s t , a t ) , s t +1 ) to replay buﬀer D . One mini-batch descent step

Sample from the replay buﬀer: { ( s it , a it , r it , s it +1 ) } i =1 ,...,N ∼ D . Compute loss J Q = (cid:80) Ni =1 (cid:0) r it + γ max a (cid:48) ∈A Q ( t + 1 , s it +1 , a (cid:48) ) − Q ( t, s it , a it ) (cid:1) . Update θ according to ∇ θ J Q using ADAM with gradient norm clipping. if number of steps mod M = 0 then Update target network θ (cid:48) ← θ . end if end for end forAlgorithm 5 Stochastic mean ﬁeld simulation Input : Number of mean ﬁelds K , number of particles M , policy π . for k = 1 , . . . , K do Initialize particles x m ∼ µ for all m = 1 , . . . , M . for t ∈ T do Deﬁne empirical measure G kt ← (cid:80) Mm =1 δ x tm . for m = 1 , . . . , M do Sample action a ∼ π t ( · | x tm ) . Sample new particle state x t +1 m ∼ p ( · | x tm , a, G kt ) . end for end for end for return average empirical mean ﬁeld ( K (cid:80) Kk =1 G kt ) t ∈T obtain the optimal policy. In this case, we obtain the values by averaging over many episodes in the MDP inducedby the mean ﬁeld generated by the evaluated policy via Algorithm 5. A.3 Problems

Summarizing properties of the considered problems are given in Table 3.

LR.

Similar to the example mentioned in the main text, we let a large number of agents choose simultaneouslybetween going left ( L ) or right ( R ). Afterwards, each agent shall be punished proportional to the number ofagents that chose the same action, but more-so for choosing right than left.More formally, let S = { C, L, R } , A = S \ { C } , µ ( C ) = 1 , r ( s, a, µ t ) = − { L } ( s ) · µ t ( L ) − · { R } ( s ) · µ t ( R ) and T = { , } . Note the diﬀerence to the toy example in the main text: right is punished more than left. Thetransition function allows picking the next state directly, i.e. for all s, s (cid:48) ∈ S , a ∈ A , P ( S t +1 = s (cid:48) | S t = s, A t = a ) = { s (cid:48) } ( a ) . For this example, we have K Q = 1 since the return Q of the initial state changes linearly with µ and lies between and − , while the distance between two mean ﬁelds is also bounded by . Analogously, K Ψ = 1 since (Ψ( π )) similarly changes linearly with π , and both can change at most by . Thus, we obtain guaranteed convergencevia Boltzmann iteration if η > . In numerical evaluations, we see convergence already for η ≥ . . ai Cui, Heinz Koeppl Algorithm 6 Prior descent Input : Number of outer iterations I . Input : Initial prior policy q ∈ Π . for outer iteration i = 1 , . . . , I do Find η heuristically or minimally such that Algorithm 2 with temperature η and prior q converges. if no such η exists then return q end if q ← solution of Algorithm 2 with temperature η and prior q . end for Table 1: Boltzmann DQN Iteration ParametersParameter RPS SIS TaxiFixed point iteration count

Number of particles for mean ﬁeld

Number of mean ﬁelds

Number of episodes for evaluation

RPS.

This game is inspired by Shapley (1964) and their generalized non-zero-sum version of Rock-Paper-Scissors,for which classical ﬁctitious play would not converge. Each of the agents can choose between rock, paper andscissors, and obtains a reward proportional to double the number of beaten agents minus the number of agentsbeating the agent. We modify the proportionality factors such that a uniformly random prior policy does notconstitute a mean ﬁeld equilibrium.Let S = { , R, P, S } , A = S \ { } , µ (0) = 1 , T = { , } , and for any a ∈ A , µ t ∈ P ( S ) , r ( R, a, µ t ) = 2 · µ t ( S ) − · µ t ( P ) ,r ( P, a, µ t ) = 4 · µ t ( R ) − · µ t ( S ) ,r ( S, a, µ t ) = 6 · µ t ( P ) − · µ t ( R ) . The transition function allows picking the next state directly, i.e. for all s, s (cid:48) ∈ S , a ∈ A , P ( S t +1 = s (cid:48) | S t = s, A t = a ) = { s (cid:48) } ( a ) . SIS.

In this problem, a large number of agents can choose between social distancing (D) or going out (U). Ifa susceptible (S) agent chooses social distancing, they may not become infected (I). Otherwise, an agent maybecome infected with a probability proportional to the number of agents being infected. If infected, an agent willrecover with a ﬁxed chance every time step. Both social distancing and being infected have an associated cost.Let S = { S, I } , A = { U, D } , µ ( I ) = 0 . , r ( s, a, µ t ) = − { I } ( s ) − . · { D } ( s ) and T = { , . . . , } . We ﬁnd thatsimilar parameters produce similar results, and set the transition probability mass functions as P ( S t +1 = S | S t = I ) = 0 . P ( S t +1 = I | S t = S, A t = U ) = 0 . · µ t ( I ) P ( S t +1 = I | S t = S, A t = D ) = 0 . Taxi.

In this problem, we consider a K × L grid. The state is described by a tuple ( x, y, x (cid:48) , y (cid:48) , p, B ) where ( x, y ) is the agent’s position, ( x (cid:48) , y (cid:48) ) indicates the current desired destination of the passenger or is (0 , otherwise, and p ∈ { , } indicates whether a passenger is in the taxi or not. Finally, B is a K × L matrix indicating whether anew passenger is available for the taxi on the corresponding tile. All taxis start on the same tile and have nopassengers in the queue or on the map at the beginning. The problem runs for 100 time steps.The taxi can choose between ﬁve actions W, U, D, L, R , where W (Wait) allows the taxi to pick up / deliverpassengers, and U, D, L, R (Up, Down, Left, Right) allows it to move in all four directions. As there are many pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning

Table 2: DQN HyperparametersHyperparameter ValueReplay buﬀer size

ADAM Learning rate . Discount factor . Target update frequency

Gradient clipping norm Mini-batch size

Epsilon schedule linearly down to . at . times maximum stepsTotal epochs Table 3: Problem PropertiesProblem |T | |S| |A| LR RPS

SIS

50 2 2

Taxi ∼ taxis, there is a chance of a jam on tile s given by min(0 . , · µ t ( s )) , i.e. the taxi will not move with thisprobability. The taxi also cannot move into walls or back into the starting tile, in which case it will stay on itscurrent tile. With a probability of . , a new passenger spawns on one randomly chosen free tile of each region.On picking up a passenger, the destination is generated by randomly picking any free tile of the same region.Delivering passengers to a destination and picking them up gives a reward of in region and . in region .For our experiments, we use the following small map, where S denotes the starting tile, denotes a free tile fromregion 1, denotes a free tile from region 2 and H denotes an impassable wall:  H S H 

This produces a similar situation as in LR, where a fraction of taxis should choose each region so the values balanceout, while also requiring solution of a problem that is intractable to solve exactly via dynamic programming. ai Cui, Heinz Koeppl

A.4 Further experiments .

00 0 .

25 0 .

50 0 .

75 1 .

00 1 .

25 1 .

50 1 .

75 2 . η . . . . . . . . . ∆ J ( π ) (a) η -Boltzmann, c = 1.0 η -Boltzmann, c = 1.1 η -Boltzmann, c = 1.2Uniform prior policy .

00 0 .

25 0 .

50 0 .

75 1 .

00 1 .

25 1 .

50 1 .

75 2 . η ∆ J ( π ) (b) η -Boltzmann, c = 1.0 η -Boltzmann, c = 1.1 η -Boltzmann, c = 1.2Uniform prior policy .

000 0 .

025 0 .

050 0 .

075 0 .

100 0 .

125 0 .

150 0 .

175 0 . η ∆ J ( π ) (c) η -Boltzmann, c = 1.0 η -Boltzmann, c = 1.1 η -Boltzmann, c = 1.2Uniform prior policy .

00 0 .

25 0 .

50 0 .

75 1 .

00 1 .

25 1 .

50 1 .

75 2 . η . . . . . . . . . ∆ J ( π ) (d) η -MaxEnt, c = 1.0 η -MaxEnt, c = 1.1 η -MaxEnt, c = 1.2Uniform prior policy .

00 0 .

25 0 .

50 0 .

75 1 .

00 1 .

25 1 .

50 1 .

75 2 . η ∆ J ( π ) (e) η -MaxEnt, c = 1.0 η -MaxEnt, c = 1.1 η -MaxEnt, c = 1.2Uniform prior policy .

000 0 .

025 0 .

050 0 .

075 0 .

100 0 .

125 0 .

150 0 .

175 0 . η ∆ J ( π ) (f) η -MaxEnt, c = 1.0 η -MaxEnt, c = 1.1 η -MaxEnt, c = 1.2Uniform prior policy Figure 5: Mean exploitability (straight lines), maximum and minimum (dashed lines) over the ﬁnal 10 iterationsof the last outer iteration. 50 outer iterations and 100 inner iterations each; (a, d) LR; (b, e) RPS; (c, f) SIS.Maximum entropy (MaxEnt) results begin at higher temperatures due to limited ﬂoating point accuracy. Theexploitability of the initial uniform prior policy is indicated by the dashed horizontal line. .

00 0 .

25 0 .

50 0 .

75 1 .

00 1 .

25 1 .

50 1 .

75 2 . η . . . . . . . . . ∆ J ( π ) (a) η -Boltzmann with averaged policy η -MaxEnt with averaged policy η -Boltzmann with averaged mean ﬁeld η -MaxEnt with averaged mean ﬁeldUniform prior policy .

00 0 .

25 0 .

50 0 .

75 1 .

00 1 .

25 1 .

50 1 .

75 2 . η ∆ J ( π ) (b) η -Boltzmann with averaged policy η -MaxEnt with averaged policy η -Boltzmann with averaged mean ﬁeld η -MaxEnt with averaged mean ﬁeldUniform prior policy .

00 0 .

02 0 .

04 0 .

06 0 .

08 0 . η ∆ J ( π ) (c) η -Boltzmann with averaged policy η -MaxEnt with averaged policy η -Boltzmann with averaged mean ﬁeld η -MaxEnt with averaged mean ﬁeldUniform prior policy Figure 6: Mean exploitability over the ﬁnal 10 iterations. Dashed lines represent maximum and minimum over theﬁnal 10 iterations. (a) LR, 10000 iterations; (b) RPS, 10000 iterations; (c) SIS, 1000 iterations. The exploitabilityof the uniform prior policy is indicated by the dashed horizontal line.In Figure 5, we observe that prior descent for both Boltzmann and RelEnt MFE with the same uniform priorpolicy performs qualitatively similarly, and coincide in LR and SIS except for numerical inaccuracies. It can beseen that using a temperature suﬃciently low to converge in LR and RPS allows prior descent to descend tothe exact MFE iteratively. In SIS on the other hand, picking a ﬁxed temperature that converges for the initialuniform prior policy does not guarantee monotonic improvement of exploitability afterwards. Instead, by applyingthe heuristic η i +1 = η i · c for each outer iteration i , where c ≥ adjusts the temperature after each outer iteration, we avoid scanningover all temperatures in each step and reach convergence to a good approximate mean ﬁeld equilibrium for bothBoltzmann and MaxEnt iteration. pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning iteration k ∆ J ( π k ) − m i n l ≥ k m a x − ∆ J ( π l ) (a) η = 0 . η = 0 . η = 0 . η = 0 .

100 0 100 200 300 400 500 iteration k . . . . . . . d M ( µ k , µ k m a x ) (b) η = 0 . η = 0 . η = 0 . η = 0 . Figure 7: (a) Diﬀerence between current and ﬁnal minimum exploitability over the last 10 iterations; (b) Distancebetween current and ﬁnal mean ﬁeld, cut oﬀ at 500 iterations for readability. Plotted for the η -RelEnt iterationsin SIS for the indicated temperature settings and uniform prior policy. iteration k − ∆ J ( π k ) − m i n l ≥ k m a x − ∆ J ( π l ) (a) η = 0 . η = 0 . η = 0 .

200 0 2 4 6 8 10 12 14 iteration k ∆ J ( π k ) − m i n l ≥ k m a x − ∆ J ( π l ) (b) η = 0 . η = 0 . η = 0 . Figure 8: Diﬀerence between current and ﬁnal estimated minimum exploitability over the last 5 iterations. (a) SIS,50 iterations; (b) Taxi, 15 iterations. Plotted for the η -Boltzmann DQN iteration for the indicated temperaturesettings and uniform prior policy.In Figure 6 empirical results are shown for ﬁctitious play variants averaging only policy or mean ﬁeld. In thesimple one-step toy problems LR and RPS, averaging the policies appears to converge to the exact solutionwithout regularization and to the regularized solution with regularization. Averaging the mean ﬁelds on the otherhand fails, since this method can only produce deterministic policies. By applying any amount of regularization,averaging the mean ﬁelds is led to success in LR and SIS. Nonetheless, both methods fail to converge to the MFEin SIS and produce worse results than obtained by prior descent in Figure 5.In Figure 7 we depict the convergence of exploitability and mean ﬁeld of MaxEnt iteration in SIS. The results arequalitatively similar with Boltzmann iteration and, as in the main text, show the convergence behaviour near thecritical temperature leading to convergence.In Figure 8 we depict the convergence of exploitability for Boltzmann DQN iteration in SIS and Taxi during oneof the runs. All 4 other runs show similar qualitative behaviour. As can be seen, the highest temperature of . shows less oscillatory behaviour, stabilizing Boltzmann DQN iteration. In Taxi, it can be seen that the usedtemperatures are insuﬃcient to allow Boltzmann DQN iteration to converge. We believe that using prior descentcould allow for better results. We could not verify this due to the high computational cost, as this includesrepeatedly and sequentially solving an expensive reinforcement learning problem.Finally, in Figure 9 we depict the resulting behavior in the SIS case. In the Boltzmann iteration result, at thebeginning the number of infected is high enough to make social distancing the optimal action to take. As thenumber of infected falls, it reaches an equilibrium point where both social distancing or potentially gettinginfected are of equal value. Finally, as the game ends at time t = T = 50 , there is no point in social distancingany more. Our approach yields intuitive results here, while exact ﬁxed point iteration and FP fail to converge. ai Cui, Heinz Koeppl iteration k . . . . . . µ t ( I ) (a) Boltzmann: fraction of infected iteration k . . . . . . µ t ( I ) (b) Exact: fraction of infected iteration k . . . . . . µ t ( I ) (c) FP: fraction of infected iteration k . . . . . π t ( D | S ) (d) Boltzmann: fraction of distancing iteration k . . . . . π t ( D | S ) (e) Exact: fraction of distancing iteration k . . . . . π t ( D | S ) (f) FP: fraction of distancing Figure 9: Fraction of infected agents and fraction of susceptible agents picking social distancing over time. (a, d):Boltzmann iteration ( η = 0 . ); (b, e): exact ﬁxed point iteration; (c, f): ﬁctitious play (averaging both policyand mean ﬁeld) results in SIS after 500 iterations. More iterations and averaging only policy or mean ﬁeld showsame qualitative results. B Proofs

B.1 Completeness of mean ﬁeld and policy spaceLemma B.1.1.

The metric spaces (Π , d Π ) and ( M , d M ) are complete metric spaces.Proof. The metric space ( M , d M ) is a complete metric space. Let ( µ n ) n ∈N ∈ M N be a Cauchy sequence of meanﬁelds. Then by deﬁnition, for any ε > there exists integer N > such that for any m, n > N we have d M ( µ n , µ m ) < . ε = ⇒ ∀ t ∈ T : d T V ( µ nt , µ mt ) = 12 (cid:88) s ∈S | µ nt ( s ) − µ mt ( s ) | < . ε = ⇒ ∀ t ∈ T , s ∈ S : | µ nt ( s ) − µ mt ( s ) | < ε . By completeness of R there exists the limit of ( µ nt ( s )) n ∈N for all t ∈ T , s ∈ S , suggestively denoted by µ t ( s ) . Themean ﬁeld µ = { µ t } t ∈T with the probabilities deﬁned by the aforementioned limits fulﬁlls µ n → µ and is in M ,showing completeness of M .We do this analogously for (Π , d Π ) . Thus, (Π , d Π ) and ( M , d M ) are complete metric spaces. B.2 Lipschitz continuityLemma B.2.1.

Assume bounded and Lipschitz functions f : X → R and g : X → R mapping from a metricspace ( X, d X ) into R with Lipschitz constants C f , C g and bounds | f ( x ) | ≤ M f , | g ( x ) | ≤ M g . The sum of bothfunctions f + g , the product of both functions f · g and the maximum of both functions max( f, g ) are all Lipschitzand bounded with Lipschitz constants C f + C g , ( M f C g + M g C f ) , max( C f , C g ) and bounds M f + M g , M f M g , max( M f , M g ) .Proof. Let x, y ∈ X be arbitrary. By the triangle inequality, we obtain | f ( x ) + g ( x ) − ( f ( y ) + g ( y )) | ≤ | f ( x ) − f ( y ) | + | g ( x ) − g ( y ) | ≤ ( C f + C g ) d X ( x, y ) . Analogously, we obtain | f ( x ) g ( x ) − f ( y ) g ( y ) | ≤ | f ( x ) g ( x ) − f ( x ) g ( y ) | + | f ( x ) g ( y ) − f ( y ) g ( y ) | ≤ ( M f C g + M g C f ) d X ( x, y ) . pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning For the maximum of both functions, consider case by case. If f ( x ) ≥ g ( x ) and f ( y ) ≥ g ( y ) we obtain | max( f ( x ) , g ( x )) − max( f ( y ) , g ( y )) | = | f ( x ) − f ( y ) | ≤ C f d X ( x, y ) and analogously for g ( x ) ≥ f ( x ) and g ( y ) ≥ f ( y ) | max( f ( x ) , g ( x )) − max( f ( y ) , g ( y )) | = | g ( x ) − g ( y ) | ≤ C g d X ( x, y ) . On the other hand, if g ( x ) < f ( x ) and g ( y ) ≥ f ( y ) , we have either g ( y ) ≥ f ( x ) and thus | max( f ( x ) , g ( x )) − max( f ( y ) , g ( y )) | = | f ( x ) − g ( y ) | = g ( y ) − f ( x ) < g ( y ) − g ( x ) ≤ C g d X ( x, y ) or g ( y ) < f ( x ) and thus | max( f ( x ) , g ( x )) − max( f ( y ) , g ( y )) | = | f ( x ) − g ( y ) | = f ( x ) − g ( y ) ≤ f ( x ) − f ( y ) ≤ C f d X ( x, y ) . The case for f ( x ) < g ( x ) and f ( y ) ≥ g ( y ) as well as boundedness is analogous. B.3 Proof of Proposition 1

Proof.

Since we work with ﬁnite T , S , A , we identify the space of mean ﬁelds M with the |T | ( |S| − -dimensionalsimplex S |T | ( |S|− ⊆ R |T | ( |S|− via the values of the probability mass functions at all times and states. Analo-gously the space of policies Π is identiﬁed with S |T ||S| ( |A|− ⊆ R |T ||S| ( |A|− .Deﬁne the set-valued map ˆΓ : S |T ||S| ( |A|− → S |T ||S| ( |A|− mapping from a policy π represented by the inputvector, to the set of vector representations of optimal policies in the MDP induced by Ψ( π ) .A policy π is optimal in the MDP induced by µ ∈ M if and only if its value function deﬁned by V π ( µ, t, s ) = (cid:88) a ∈A π t ( a | s ) (cid:32) r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) V π ( µ, t + 1 , s (cid:48) ) (cid:33) , is equal to the optimal action-value function deﬁned by V ∗ ( µ, t, s ) = max a ∈A (cid:32) r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) V ∗ ( µ, t + 1 , s (cid:48) ) (cid:33) for every t ∈ T , s ∈ S , with terminal conditions V ∗ ( µ, T, s ) ≡ V π ( µ, T, s ) ≡ . Moreover, an optimal policyalways exists. For more details, see e.g. Puterman (2014). Deﬁne the optimal action-value function for every t ∈ T , s ∈ S , a ∈ A via Q ∗ ( µ, t, s, a ) = r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) V ∗ ( µ, t + 1 , s (cid:48) ) with terminal condition Q ∗ ( µ, T, s, a ) ≡ . Then, the following lemma characterizes optimality of policies. Lemma B.3.1.

A policy π fulﬁlls π ∈ ˆΓ(ˆ π ) if and only if π t ( a | s ) > ⇒ a ∈ arg max a (cid:48) ∈A Q ∗ (Ψ(ˆ π ) , t, s, a (cid:48) ) for all t ∈ T , s ∈ S , a ∈ A .Proof. To see the implication, consider π ∈ ˆΓ(ˆ π ) . Then, if the right-hand side was false, there exists a maximal t ∈ T and s ∈ S , a ∈ A such that π t ( a | s ) > but a (cid:54)∈ arg max a (cid:48) ∈A Q ∗ (Ψ(ˆ π ) , t, s, a (cid:48) ) . Since for any t (cid:48) > t we haveoptimality, V π ( µ, t + 1 , s (cid:48) ) = V ∗ ( µ, t + 1 , s (cid:48) ) by induction. However, V π ( µ, t, s ) < V ∗ ( µ, t, s ) since the suboptimalaction is assigned positive probability, contradicting optimality of π . On the other hand, if the right-hand side istrue, then V π ( µ, t, s ) = V ∗ ( µ, t, s ) by induction, which implies that π is optimal. (cid:4) ai Cui, Heinz Koeppl We will now check that the requirements of Kakutani’s ﬁxed point theorem hold for ˆΓ . The ﬁnite-dimensionalsimplices are convex, closed and bounded, hence compact. ˆΓ maps to a non-empty set, as the induced mean ﬁeldis uniquely deﬁned and any ﬁnite MDP (induced by this mean ﬁeld) has an optimal policy.For any π , ˆΓ( π ) is convex, since the set of optimal policies is convex as shown in the following. Consider a convexcombination ˜ π = λπ + (1 − λ ) π (cid:48) of optimal policies π, π (cid:48) for λ ∈ [0 , . Then, the resulting policy will be optimal,since we have ˜ π t ( a | s ) > ⇒ π t ( a | s ) > ∨ π (cid:48) t ( a | s ) > ⇒ a ∈ arg max a ∈A Q ∗ (Ψ(ˆ π ) , t, s, a ) for any t ∈ T , s ∈ S , a ∈ A and thus optimality by Lemma B.3.1.Finally, we show that ˆΓ has a closed graph. Consider arbitrary sequences ( π n , π (cid:48) n ) → ( π, π (cid:48) ) with π (cid:48) n ∈ ˆΓ( π n ) . It isthen suﬃcient to show that π (cid:48) ∈ ˆΓ( π ) . By the standing assumption, we have continuity of Ψ and µ → Q ∗ ( µ, t, s, a ) for any t ∈ T , s ∈ S , a ∈ A , as sums, products and compositions of continuous functions remain continuous.Therefore, the composition π → Q ∗ (Ψ( π ) , t, s, a ) is continuous. To show that π (cid:48) ∈ ˆΓ( π ) , assume that π (cid:48) (cid:54)∈ ˆΓ( π ) .By Lemma B.3.1 there exists t ∈ T , s ∈ S , a ∈ A such that π (cid:48) t ( a | s ) > and further there exists a (cid:48) ∈ A such that Q ∗ (Ψ( π ) , t, s, a (cid:48) ) > Q ∗ (Ψ( π ) , t, s, a ) . Fix such an a (cid:48) ∈ A . Let δ ≡ Q ∗ (Ψ( π ) , t, s, a (cid:48) ) − Q ∗ (Ψ( π ) , t, s, a ) , then bycontinuity there exists ε > such that for all ˆ π ∈ Π we have d Π (ˆ π, π ) < ε = ⇒ | Q ∗ (Ψ(ˆ π ) , t, s, a ) − Q ∗ (Ψ( π ) , t, s, a ) | < δ . By convergence, there is an integer N ∈ N such that for all n > N we have d Π ( π n , π ) < ε and therefore Q ∗ (Ψ( π n ) , t, s, a (cid:48) ) > Q ∗ (Ψ( π ) , t, s, a (cid:48) ) − δ Q ∗ (Ψ( π ) , t, s, a ) + δ > Q ∗ (Ψ( π n ) , t, s, a ) . Since ( π (cid:48) n ) t ( a | s ) → π (cid:48) t ( a | s ) > , there also exists M ∈ N such that for all m > M , | ( π (cid:48) m ) t ( a | s ) − π (cid:48) t ( a | s ) | < π (cid:48) t ( a | s ) . Let n > max(

N, M ) , then it follows that ( π (cid:48) n ) t ( a | s ) > but a (cid:54)∈ arg max a (cid:48) ∈A Q ∗ (Ψ( π ) , t, s, a (cid:48) ) since we have Q ∗ (Ψ( π n ) , t, s, a (cid:48) ) > Q ∗ (Ψ( π n ) , t, s, a ) , contradicting π (cid:48) n ∈ ˆΓ( π n ) by Lemma B.3.1. Hence, ˆΓ must have a closedgraph.By Kakutani’s ﬁxed point theorem, there exists a ﬁxed point π ∗ that generates some mean ﬁeld Ψ( π ∗ ) . Theassociated pair ( π ∗ , Ψ( π ∗ )) is an MFE by deﬁnition. B.4 Proof of Proposition 3

Proof.

The space of mean ﬁelds ( M , d M ) is equivalent to convex and compact ﬁnite-dimensional simplices. Inthis representation, each coordinate of the operators ˜Γ η ( µ ) and Γ η ( µ ) consists of compositions, sums and productsof continuous functions, since the functions r ( s, a, µ t ) and p ( s (cid:48) | s, a, µ t ) are assumed to be continuous. Existenceof a ﬁxed point follows immediately by Brouwer’s ﬁxed point theorem. B.5 Proof of Theorem 1

Proof.

The proof is a slightly simpliﬁed version of the one found in Saldi et al. (2018). Note that we require theresults later, so for convenience we give the full details.The empirical measure G NS t is a random variable on P ( S ) , i.e. its law L ( G NS t ) ∈ P ( P ( S )) is a distribution overprobability measures. Since we want to show convergence of the empirical measure to the mean ﬁeld, let us picka metric on P ( P ( S )) . Remember that we metrized P ( S ) with the total variation distance. We metrize P ( P ( S )) with the 1-Wasserstein metric deﬁned for any Φ , Ψ ∈ P ( P ( S )) by the inﬁmum over couplings W (Φ , Ψ) ≡ inf L ( X )=Φ , L ( X )=Ψ E [ d T V ( X , X )] . Lemma B.5.1.

Let { Φ n } n ∈ N be a sequence of measures with Φ n ∈ P ( P ( S )) for all n ∈ N . Further, let µ ∈ P ( S ) arbitrary. Then, the following are equivalent. pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning (a) W (Φ n , δ µ ) → as n → ∞ (b) E [ | F ( X n ) − F ( X ) | ] → as n → ∞ for any continuous, bounded F : P ( S ) → R , any sequence { X n } n ∈ N of P ( S ) -valued random variables and any P ( S ) -valued random variable X with L ( X n ) = Φ n and L ( X ) = δ µ .(c) E [ | X n ( f ) − X ( f ) | ] → as n → ∞ for any f : S → R , any sequence { X n } n ∈ N of P ( S ) -valued randomvariables and any P ( S ) -valued random variable X with L ( X n ) = Φ n and L ( X ) = δ µ .Proof. Deﬁne the only possible coupling ∆ n ≡ Φ n × δ µ .(b), (c) = ⇒ (a):Deﬁne F s ( x ) ≡ x ( s ) and f s ( s (cid:48) ) ≡ { s } ( s (cid:48) ) for all s ∈ S , where F s is continuous. By assumption, W (Φ n , δ µ ) = inf L ( X n )=Φ n , L ( X )= δ µ E [ d T V ( X n , X )]= 12 (cid:90) P ( S ) ×P ( S ) (cid:88) s ∈S | X n ( s ) − X ( s ) | d ∆ n = 12 (cid:88) s ∈S E [ | X n ( s ) − X ( s ) | ] → since for any s ∈ S , we have E [ | X n ( s ) − X ( s ) | ] = E [ | F s ( X n ) − F s ( X ) | ] = E [ | X n ( f s ) − X ( f s ) | ] . (a) = ⇒ (b), (c):We have E [ | F ( X n ) − F ( X ) | ] = (cid:90) P ( S ) ×P ( S ) | F ( ν ) − F ( ν (cid:48) ) | ∆ n ( dν, dν (cid:48) )= (cid:90) P ( S ) | F ( ν ) − F ( µ ) | Φ n ( dν ) → (cid:90) P ( S ) | F ( ν ) − F ( µ ) | δ µ ( dν ) = 0 by continuity and boundedness of | F ( ν ) − F ( µ ) | , and convergence in W implying weak convergence. Analogously, E [ | X n ( f ) − X ( f ) | ] = (cid:90) P ( S ) | ν ( f ) − µ ( f ) | Φ n ( dν ) → (cid:90) P ( S ) | ν ( f ) − µ ( f ) | δ µ ( dν ) = 0 since f and thus | ν ( f ) − µ ( f ) | is automatically bounded from ﬁniteness of S , and ν ( f ) = (cid:80) s ∈S ν ( s ) f ( s ) → (cid:80) s ∈S µ ( s ) f ( s ) as ν → µ in total variation distance implies continuity of | ν ( f ) − µ ( f ) | . (cid:4) First, it is shown that when all other agents follow the same policy π , then the empirical distribution is essentiallythe deterministic mean ﬁeld as N → ∞ , i.e. L ( G NS t ) → L ( µ t ) ≡ δ µ t with µ = Ψ( π ) Lemma B.5.2.

Consider a set of policies (˜ π, π, . . . , π ) ∈ Π N for all agents. Under this set of policies, the law ofthe empirical distribution L ( G NS t ) ∈ P ( M ) converges to δ µ t where µ = Ψ( π ) as N → ∞ in 1-Wasserstein distance.Proof. Deﬁne the Markov kernel P πt,ν such that its probability mass function fulﬁlls P πt,ν ( s (cid:48) | s ) ≡ (cid:88) a ∈A π t ( a | s ) p ( s (cid:48) | s, a, ν ) for any t ∈ T , s ∈ S , ν ∈ P ( S ) , π ∈ Π and analogously ˜ νP πt,ν ( s (cid:48) ) ≡ (cid:88) s ∈S ˜ ν ( s ) (cid:88) a ∈A π t ( a | s ) p ( s (cid:48) | s, a, ν ) ai Cui, Heinz Koeppl for any ˜ ν ∈ P ( S ) . Note that µ t +1 = µ t P πt,µ t ( g ) for mean ﬁelds µ = Ψ( π ) induced by π .We show that E (cid:2)(cid:12)(cid:12) G NS t ( f ) − µ t ( f ) (cid:12)(cid:12)(cid:3) → as N → ∞ for any function f : S → R and any time t ∈ T . From this,the desired result follows by Lemma B.5.1. Since G NS t ( · ) ≡ N (cid:80) Ni =1 δ S it ( · ) and S i ∼ µ we have at time t = 0 that lim N →∞ E (cid:2)(cid:12)(cid:12) G NS ( f ) − µ ( f ) (cid:12)(cid:12)(cid:3) = lim N →∞ E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 f ( S i ) − E (cid:2) f ( S i ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) = 0 by the strong law of large numbers and the dominated convergence theorem.Assuming this holds for t , then for t + 1 we have E (cid:104)(cid:12)(cid:12)(cid:12) G NS t +1 ( f ) − µ t +1 ( f ) (cid:12)(cid:12)(cid:12)(cid:105) ≤ E (cid:104)(cid:12)(cid:12)(cid:12) G NS t +1 ( f ) − G N − S t +1 ( f ) (cid:12)(cid:12)(cid:12)(cid:105) + E (cid:104)(cid:12)(cid:12)(cid:12) G N − S t +1 ( f ) − G N − S t P πt, G NSt ( f ) (cid:12)(cid:12)(cid:12)(cid:105) + E (cid:104)(cid:12)(cid:12)(cid:12) G N − S t P πt, G NSt ( f ) − G NS t P πt, G NSt ( f ) (cid:12)(cid:12)(cid:12)(cid:105) + E (cid:104)(cid:12)(cid:12)(cid:12) G NS t P πt, G NSt ( f ) − µ t P πt,µ t ( f ) (cid:12)(cid:12)(cid:12)(cid:105) where we deﬁned G N − S t ( · ) ≡ N − (cid:80) Ni =2 δ S it ( · ) .For the ﬁrst term, we have as N → ∞ E (cid:104)(cid:12)(cid:12)(cid:12) G NS t +1 ( f ) − G N − S t +1 ( f ) (cid:12)(cid:12)(cid:12)(cid:105) = E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 f ( S it +1 ) − N − N (cid:88) i =2 f ( S it +1 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) ≤ N E (cid:2)(cid:12)(cid:12) f ( S t +1 ) (cid:12)(cid:12)(cid:3) + (cid:12)(cid:12)(cid:12)(cid:12) N − N − (cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) i =2 E (cid:2)(cid:12)(cid:12) f ( S it +1 ) (cid:12)(cid:12)(cid:3) ≤ (cid:18) N + N − N ( N − (cid:19) max s ∈S | f ( s ) | → . For the second term, as N → ∞ we have by Jensen’s inequality and bounds | f | ≤ M f (by ﬁniteness of S ) E (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) G N − S t +1 ( f ) − G N − S t P πt, G N − St ( f ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) = E (cid:20) E (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) G N − S t +1 ( f ) − G N − S t P πt, G N − St ( f ) (cid:12)(cid:12)(cid:12)(cid:12) | S t (cid:21)(cid:21) = E (cid:34) E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N − N (cid:88) i =2 (cid:0) f ( S it +1 ) − E (cid:2) f ( S it +1 ) (cid:3)(cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | S t (cid:35)(cid:35) ≤ N − N (cid:88) i =2 E (cid:104) E (cid:104)(cid:0) f ( S it +1 ) − E (cid:2) f ( S it +1 ) (cid:3)(cid:1) | S t (cid:105)(cid:105) ≤ N − · M f → . For the third term, we again have as N → ∞ E (cid:104)(cid:12)(cid:12)(cid:12) G N − S t P πt, G NSt ( f ) − G NS t P πt, G NSt ( f ) (cid:12)(cid:12)(cid:12)(cid:105) = E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) s ∈S (cid:0) G N − S t ( s ) − G NS t ( s ) (cid:1) (cid:88) a ∈A π t ( a | s ) (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, G NS t ) f ( s (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) ≤ E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:18) N − − N (cid:19) N (cid:88) i =2 (cid:88) a ∈A π t ( a | S it ) (cid:88) s (cid:48) ∈S p ( s (cid:48) | S it , a, G NS t ) f ( s (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) + E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) a ∈A π t ( a | S t ) (cid:88) s (cid:48) ∈S p ( s (cid:48) | S t , a, G NS t ) f ( s (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning ≤ (cid:18) N − N ( N −

1) + 1 N (cid:19) max s ∈S | f ( s ) | → . For the fourth term, deﬁne F : P ( S ) → R , F ( ν ) = νP πt,ν ( f ) and observe that F is continuous, since ν → ν (cid:48) if andonly if ν ( s ) → ν (cid:48) ( s ) for all s ∈ S , and therefore (as p is assumed continuous by Assumption 1) F ( ν ) = νP πt,ν ( f ) = (cid:88) s ∈S ν ( s ) (cid:88) a ∈A π t ( a | s ) (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, ν ) f ( s (cid:48) ) is continuous for any s (cid:48) ∈ S . By Lemma B.5.1, we have from the induction hypothesis G NS t → µ t that E (cid:104)(cid:12)(cid:12)(cid:12) G NS t P πt, G NSt ( f ) − µ t P πt,µ t ( f ) (cid:12)(cid:12)(cid:12)(cid:105) → . Therefore, E (cid:104)(cid:12)(cid:12)(cid:12) G NS t +1 ( f ) − µ t +1 ( f ) (cid:12)(cid:12)(cid:12)(cid:105) → which implies the desired result by induction. (cid:4) Consider the case where all agents follow a set of policies ( π N , π, . . . , π ) ∈ Π N for each N ∈ N . Deﬁne newsingle-agent random variables S µt and A µt with S µ ∼ µ and P ( A µt = a | S µt = s ) = π Nt ( a | s ) , P ( S µt +1 = s (cid:48) | S µt = s, A µt = a ) = p ( s (cid:48) | s, a, µ t ) , where the deterministic mean ﬁeld µ is used instead of the empirical distribution. Lemma B.5.3.

Consider an equicontinuous, uniformly bounded family of functions F on P ( S ) and deﬁne F t ( ν ) ≡ sup f ∈F | f ( ν ) − f ( µ t ) | for any t ∈ T . Then, F t is continuous and bounded and by Lemma B.5.1 we have lim N →∞ E (cid:34) sup f ∈F (cid:12)(cid:12) f ( G NS t ) − f ( µ ) (cid:12)(cid:12)(cid:35) = 0 Proof. F t is continuous, since for ν n → ν | F t ( ν n ) − F t ( ν ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup f ∈F | f ( ν ) − f ( µ t ) | − sup f ∈F | f ( ν (cid:48) ) − f ( µ t ) | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup f ∈F | f ( ν ) − f ( ν (cid:48) ) | → by equicontinuity. Further, F t is bounded since | F t ( ν ) | ≤ sup f ∈F | f ( ν ) | + | f ( µ t ) | is uniformly bounded. ByLemma B.5.2, we have W ( G NS t , δ µ t ) → as N → ∞ , therefore Lemma B.5.1 applies. (cid:4) Lemma B.5.4.

Suppose that at some time t ∈ T , it holds that lim N →∞ (cid:12)(cid:12) L ( S t )( g N ) − L ( S µt )( g N ) (cid:12)(cid:12) = 0 for any sequence of functions { g N } N ∈ N from S to R that is uniformly bounded. Then, we have lim N →∞ (cid:12)(cid:12) L ( S t , G NS t )( T N ) − L ( S µt , µ t )( T N ) (cid:12)(cid:12) = 0 for any sequence of functions { T N } N ∈ N from S × P ( S ) to R that is equicontinuous and uniformly bounded.Proof. We have (cid:12)(cid:12) L ( S t , G NS t )( T N ) − L ( S µt , µ t )( T N ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) L ( S t , G NS t )( T N ) − L ( S t , µ t )( T N ) (cid:12)(cid:12) + (cid:12)(cid:12) L ( S t , µ t )( T N ) − L ( S µt , µ t )( T N ) (cid:12)(cid:12) ai Cui, Heinz Koeppl The ﬁrst term becomes (cid:12)(cid:12) L ( S t , G NS t )( T N ) − L ( S t , µ t )( T N ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) T N ( x, ν ) L ( S t , G NS t )( dx, dν ) − (cid:90) T N ( x, ν ) L ( S t , µ t )( dx, dν ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ E (cid:2) E (cid:2)(cid:12)(cid:12) T N ( x, G NS t ) − T N ( x, µ t ) (cid:12)(cid:12) S t (cid:3)(cid:3) ≤ E (cid:34) sup f ∈{ T N ( · ,ν ) } ν ∈P ( S ) ,N ∈ N (cid:12)(cid:12) f ( G NS t ) − f ( µ t ) (cid:12)(cid:12)(cid:35) → by Lemma B.5.3, since { T N } N ∈ N is equicontinuous and uniformly bounded. Similarly for the second term, (cid:12)(cid:12) L ( S t , µ t )( T N ) − L ( S µt , µ t )( T N ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) T N ( x, ν ) L ( S t , µ t )( dx, dν ) − (cid:90) T N ( x, ν ) L ( S µt , µ t )( dx, dν ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ E (cid:2)(cid:12)(cid:12) T N ( S t , µ t ) − T N ( S µt , µ t ) (cid:12)(cid:12)(cid:3) → by the assumption, since T N fulﬁlls the condition of being uniformly bounded. (cid:4) Lemma B.5.5.

For any sequence { g N } N ∈ N of functions from S to R that is uniformly bounded, we have lim N →∞ (cid:12)(cid:12) L ( S t )( g N ) − L ( S µt )( g N ) (cid:12)(cid:12) = 0 for all times t ∈ T .Proof. Deﬁne l N,t as l N,t ( s, ν ) ≡ (cid:88) a ∈A π Nt ( a | s ) (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, ν ) g N ( s (cid:48) ) . { l N,t ( s, · ) } s ∈S ,N ∈ N is equicontinuous, since for any ν, ν (cid:48) ∈ M with d T V ( ν, ν (cid:48) ) → , sup s ∈S ,N ∈ N | l N,t ( s, ν ) − l N,t ( s, ν (cid:48) ) | ≤ M g sup s ∈S ,N ∈ N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) a ∈A π Nt ( a | s ) (cid:88) s (cid:48) ∈S ( p ( s (cid:48) | s, a, ν ) − p ( s (cid:48) | s, a, ν (cid:48) )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ M g |S| max s ∈S max a ∈A max s (cid:48) ∈S | p ( s (cid:48) | s, a, ν ) − p ( s (cid:48) | s, a, ν (cid:48) ) | → since | g N | < M g is uniformly bounded and p is continuous by assumption. Furthermore, l N,t ( s, ν ) is alwaysuniformly bounded by M g . Now the result can be shown by induction.For t = 0 , L ( S µ ) = L ( S ) fulﬁlls the hypothesis. Assume this holds for t , then (cid:12)(cid:12) L ( S t +1 )( g N ) − L ( S µt +1 )( g N ) (cid:12)(cid:12) = (cid:12)(cid:12) L ( S t , G NS t )( l N,t ) − L ( S µt , µ t )( l N,t ) (cid:12)(cid:12) → as N → ∞ by Lemma B.5.4. (cid:4) Thus, for any sequence of policies { π N } N ∈ N with π N ∈ Π for all N ∈ N , the achieved return of the N -agent gameconverges to the return of the mean ﬁeld game under the mean ﬁeld generated by the other agent’s policy π as N → ∞ . Lemma B.5.6.

Let { π N } N ∈ N with π N ∈ Π for all N ∈ N be an arbitrary sequence of policies and π ∈ Π an arbitrary policy. Further, let the mean ﬁeld µ = Ψ( π ) be generated by π . Then, under the joint policy ( π N , π, . . . , π ) , we have as N → ∞ that (cid:12)(cid:12) J N ( π N , π, . . . , π ) − J µ ( π N ) (cid:12)(cid:12) → . Proof.

Deﬁne for any t ∈ T , N ∈ N r π Nt ( s, ν ) ≡ (cid:88) a ∈A r ( s, a, ν ) π Nt ( a | s ) pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning such that the family { r π Nt ( s, · ) } s ∈S ,N ∈ N is equicontinuous, since for any ν, ν (cid:48) ∈ M as d M ( ν, ν (cid:48) ) → , max s ∈S max N ∈ N (cid:12)(cid:12)(cid:12) r π Nt ( s, ν ) − r π Nt ( s, ν (cid:48) ) (cid:12)(cid:12)(cid:12) → by continuity of r . The function r π Nt is uniformly bounded for all N ∈ N by assumption of uniformly bounded r .By Lemma B.5.4 and Lemma B.5.5, lim N →∞ (cid:12)(cid:12) E (cid:2) r ( S t , A t , G NS t ) (cid:3) − E [ r ( S µt , A µt , µ t )] (cid:12)(cid:12) | = lim N →∞ (cid:12)(cid:12)(cid:12) E (cid:104) r π Nt ( S t , G NS t ) (cid:105) − E (cid:104) r π Nt ( S µt , µ t ) (cid:105)(cid:12)(cid:12)(cid:12) = 0 . such that we have lim N →∞ (cid:12)(cid:12) J N ( π N , π, . . . , π ) − J µ ( π N ) (cid:12)(cid:12) | ≤ (cid:88) t ∈T lim N →∞ (cid:12)(cid:12) E (cid:2) r ( S t , A t , G NS t ) (cid:3) − E [ r ( S µt , A µt , µ t )] (cid:12)(cid:12) = 0 . which is the desired result. (cid:4) From Lemma B.5.6, it follows that for any sequence of optimal exploiting policies { π N } N ∈ N with π N ∈ Π for all N ∈ N and π N ∈ arg max π ∈ Π J N ( π, π ∗ , . . . , π ∗ ) for all N ∈ N , it holds that for any MFE ( π ∗ , µ ∗ ) ∈ Π × M , lim N →∞ J N ( π N , π ∗ , . . . , π ∗ ) ≤ max π ∈ Π J µ ∗ ( π )= J µ ∗ ( π ∗ )= lim N →∞ J N ( π ∗ , . . . , π ∗ ) and by instantiating for arbitrary (cid:15) > , for suﬃciently large N we obtain J N ( π N , π ∗ , . . . , π ∗ ) − (cid:15) = max π ∈ Π J N ( π, π ∗ , . . . , π ∗ ) − (cid:15) ≤ max π ∈ Π J µ ∗ ( π ) − (cid:15) J µ ∗ ( π ∗ ) − (cid:15) J N ( π ∗ , π ∗ , . . . , π ∗ ) which is the desired approximate Nash property that applies to all agents by symmetry. B.6 Proof of Theorem 2

Proof. If Φ or Ψ is constant, or if the restriction Ψ (cid:22) Π Φ of Ψ to Π Φ is constant, then Γ = Ψ ◦ Φ is constant.Assume that this is not the case.Then there exist distinct π, π (cid:48) ∈ Π Φ such that Ψ( π ) (cid:54) = Ψ( π (cid:48) ) . By deﬁnition of Π Φ there also exist distinct µ, µ (cid:48) ∈ M such that Φ( µ ) = π and Φ( µ (cid:48) ) = π (cid:48) . Note that for any ν, ν (cid:48) ∈ M with Γ( ν ) (cid:54) = Γ( ν (cid:48) ) , d M (Γ( ν ) , Γ( ν (cid:48) )) ≥ min π,π (cid:48) ∈ Π Φ ,π (cid:54) = π (cid:48) d M (Ψ( π ) , Ψ( π (cid:48) )) where the right-hand side is greater zero by ﬁniteness of Π Φ . This holds for µ, µ (cid:48) .To show that Γ cannot be Lipschitz continuous, assume that Γ has a Lipschitz constant C > . We can ﬁnd aninteger N such that d M ( µ i , µ i +1 ) = d M ( µ, µ (cid:48) ) N − < min π,π (cid:48) ∈ Π Φ ,π (cid:54) = π (cid:48) d M (Ψ( π ) , Ψ( π (cid:48) )) C ai Cui, Heinz Koeppl for all i ∈ { , . . . , N − } by deﬁning µ i = iN µ + N − iN µ (cid:48) for all i ∈ { , . . . , N } , and µ i ∈ M holds. By the triangle inequality d M (Γ( µ ) , Γ( µ (cid:48) )) ≤ d M (Γ( µ ) , Γ( µ )) + . . . + d M (Γ( µ N − ) , Γ( µ N )) there exists a pair ( µ i , µ i +1 ) with Γ( µ i ) (cid:54) = Γ( µ i +1 ) . For this pair, we have d M (Γ( µ i ) , Γ( µ i +1 )) ≥ d M (Γ( µ ) , Γ( µ (cid:48) )) ≥ min π,π (cid:48) ∈ Π Φ ,π (cid:54) = π (cid:48) d M (Ψ( π ) , Ψ( π (cid:48) )) . On the other hand, since Γ is Lipschitz with constant C , we have d M (Γ( µ i ) , Γ( µ i +1 )) ≤ C · d M ( µ i , µ i +1 ) < min π,π (cid:48) ∈ Π Φ ,π (cid:54) = π (cid:48) d M (Ψ( π ) , Ψ( π (cid:48) )) which is a contradiction. Thus, Γ cannot be Lipschitz continuous and by extension cannot be contractive. B.7 Proof of Theorem 3

Proof.

For all η > , µ ∈ M , t ∈ T , s ∈ S , a ∈ A , the soft action-value function of the MDP induced by µ ∈ M isgiven by ˜ Q η ( µ, t, s, a ) = r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) η log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) η (cid:33) and terminal condition ˜ Q η ( µ, T − , s, a ) ≡ r ( s, a, µ T − ) . Analogously, the action-value function of the MDPinduced by µ ∈ M is given by Q ∗ ( µ, t, s, a ) = r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) max a (cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48) ) and the similarly deﬁned policy action-value function for π ∈ Π is given by Q π ( µ, t, s, a ) = r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) (cid:88) a (cid:48) ∈A π t +1 ( a (cid:48) | s (cid:48) ) Q π ( µ, t + 1 , s (cid:48) , a (cid:48) ) , with terminal conditions Q ∗ ( µ, T − , s, a ) ≡ Q π ( µ, T − , s, a ) ≡ r ( s, a, µ T − ) .We will show that we can ﬁnd a Lipschitz constant K ˜ Q η of ˜ Q η that is independent of η if η is not arbitrarilysmall. To show this, we will explicitly compute such a Lipschitz constant. Note ﬁrst that ˜ Q η , Q ∗ and Q π are alluniformly bounded by M Q ≡ |T | M r by assumption, where M r is the uniform bound of r . Lemma B.7.1.

The functions ˜ Q η ( µ, t, s, a ) , Q ∗ ( µ, t, s, a ) and Q π ( µ, t, s, a ) are uniformly bounded for all η > , µ ∈ M , t ∈ T , s ∈ S , a ∈ A by (cid:12)(cid:12)(cid:12) ˜ Q η ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) ≤ ( T − t ) M r ≤ T M r =: M Q where M r is the uniform bound of | r ( s, a, µ t ) | ≤ M r , and T = |T | .Proof. Make the induction hypothesis for all t ∈ T that (cid:12)(cid:12)(cid:12) ˜ Q η ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) ≤ ( T − t ) M r for all η > , µ ∈ M , s ∈ S , a ∈ A and note that this holds for t = T − , as by assumption (cid:12)(cid:12)(cid:12) ˜ Q η ( µ, T − , s, a ) (cid:12)(cid:12)(cid:12) = | r ( s, a, µ t ) | ≤ M r . pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning The induction step from t + 1 to t holds by (cid:12)(cid:12)(cid:12) ˜ Q η ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) η log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) η (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | r ( s, a, µ t ) | + η max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) η (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ M r + η (cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) exp (cid:18) ( T − t − M r η (cid:19)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) = M r + ( T − t − M r = ( T − t ) M r . By maximizing over all t ∈ T , we obtain the uniform bound. The other cases are analogous. (cid:4) Now we can ﬁnd a Lipschitz constant of ˜ Q η ( µ, t, s, a ) that is independent of η . Lemma B.7.2.

Let C r be a Lipschitz constant of µ → r ( s, a, µ t ) and C p a Lipschitz constant of µ → p ( s (cid:48) | s, a, µ t ) .Further, let η min > . Then, for all η > η min , t ∈ T , the map µ (cid:55)→ ˜ Q η ( µ, t, s, a ) is Lipschitz for all s ∈ S , a ∈ A with a Lipschitz constant K t ˜ Q η independent of η . Therefore, by picking K ˜ Q η ≡ max t ∈T K t ˜ Q η , we have one singleLipschitz constant for all η > η min , t ∈ T , s ∈ S , a ∈ A .Proof. We show by induction that for all t ∈ T , s ∈ S , a ∈ A , we can ﬁnd Lipschitz constants such that ˜ Q η ( µ, t, s, a ) is Lipschitz in µ with a Lipschitz constant that does not depend on η .To see this, note that this is true for t = T − and any s ∈ S , a ∈ A , as for any µ, µ (cid:48) we have (cid:12)(cid:12)(cid:12) ˜ Q η ( µ, T − , s, a ) − ˜ Q η ( µ (cid:48) , T − , s, a ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12) r ( s, a, µ T − ) − r ( s, a, µ (cid:48) T − ) (cid:12)(cid:12) ≤ C r d M ( µ, µ (cid:48) ) . The induction step from t + 1 to t is (cid:12)(cid:12)(cid:12) ˜ Q η ( µ, t, s, a ) − ˜ Q η ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) ≤ | r ( s, a, µ t ) − r ( s, a, µ (cid:48) t ) | + (cid:88) s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p ( s (cid:48) | s, a, µ t ) η log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) η (cid:33) − p ( s (cid:48) | s, a, µ (cid:48) t ) η log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) η (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C r d M ( µ, µ (cid:48) ) + η |S| max s (cid:48) ∈S · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) η (cid:33) − log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) η (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + η |S| max s (cid:48) ∈S M Q η · | p ( s (cid:48) | s, a, µ t ) − p ( s (cid:48) | s, a, µ (cid:48) t ) |≤ C r d M ( µ, µ (cid:48) ) + η |S| max s (cid:48) ∈S (cid:88) a (cid:48) ∈A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) η q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:16) ξ a (cid:48) η (cid:17)(cid:80) a (cid:48)(cid:48) ∈A q t +1 ( a (cid:48)(cid:48) | s (cid:48) ) exp (cid:16) ξ a (cid:48)(cid:48) η (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) − ˜ Q η ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) + |S| M Q · C p d M ( µ, µ (cid:48) ) ≤ C r d M ( µ, µ (cid:48) ) + |A| q max |A| q min exp (cid:18) · M Q η (cid:19) K t +1˜ Q η d M ( µ, µ (cid:48) ) + |S| M Q C p d M ( µ, µ (cid:48) ) < (cid:18) C r + q max q min exp (cid:18) M Q η min (cid:19) K t +1˜ Q η + |S| M Q C p (cid:19) d M ( µ, µ (cid:48) ) ai Cui, Heinz Koeppl where we use the mean value theorem to obtain some ξ a ∈ [ − M Q , M Q ] for all a ∈ A bounded by Lemma B.7.1,Lemma B.2.1 for the second inequality, and deﬁned q max = max t ∈T ,s ∈S ,a ∈A q t ( a | s ) , q min = min t ∈T ,s ∈S ,a ∈A q t ( a | s ) . Since s ∈ S , a ∈ A were arbitrary, this holds for all s ∈ S , a ∈ A .Thus, as long as η > η min , we have the Lipschitz constant K t ˜ Q η ≡ (cid:16) C r + q max q min exp (cid:16) M Q η min (cid:17) K t +1˜ Q η + |S| M Q C p (cid:17) independent of η , since by induction assumption K t +1˜ Q η is independent of η . (cid:4) The optimal action-value function and the policy action-value function for any ﬁxed policy are Lipschitz in µ . Lemma B.7.3.

The functions µ (cid:55)→ Q ∗ ( µ, t, s, a ) and µ (cid:55)→ Q π ( µ, t, s, a ) for any ﬁxed π ∈ Π , t ∈ T , s ∈ S , a ∈ A are Lipschitz continuous. Therefore, for any ﬁxed π ∈ Π we can choose a Lipschitz constant K Q for all t ∈ T , s ∈ S , a ∈ A by taking the maximum over all Lipschitz constants.Proof. The action-value function is given by the recursion Q ∗ ( µ, t, s, a ) = r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) max a (cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48) ) with terminal condition Q ∗ ( µ, T − , s, a ) ≡ r ( s, a, µ T − ) . The functions r ( s, a, µ t ) and p ( s (cid:48) | s, a, µ t ) are Lipschitzcontinuous by Assumption 2. Note that for any µ, µ (cid:48) ∈ M and any t ∈ T , d T V ( µ t , µ (cid:48) t ) ≤ d M ( µ, µ (cid:48) ) . Therefore,the terminal condition and all terms in the above recursion are Lipschitz. Further, Q ∗ ( µ, t, s, a ) is uniformlybounded, since r is assumed uniformly bounded.Since a ﬁnite maximum, product and sum of Lipschitz and bounded functions is again Lipschitz and bounded byLemma B.2.1, we obtain Lipschitz constants K Q,t,s,a of the maps µ → Q ∗ ( µ, t, s, a ) for any t ∈ T , s ∈ S , a ∈ A and deﬁne K Q ≡ max t ∈T ,s ∈S ,a ∈A K Q,t,s,a . The case for Q π with ﬁxed π ∈ Π is analogous. (cid:4) The same holds for Ψ( π ) mapping from policy π to its induced mean ﬁeld. Lemma B.7.4.

The function Ψ( π ) is Lipschitz with some Lipschitz constant K Ψ .Proof. Recall that Ψ( π ) maps to the mean ﬁeld µ starting with µ and obtained by the recursion µ t +1 ( s (cid:48) ) = (cid:88) s ∈S (cid:88) a ∈A p ( s (cid:48) | s, a, µ t ) π t ( a | s ) µ t ( s ) . We proceed analogously to Lemma B.7.3. µ is uniformly bounded by normalization. The constant function π (cid:55)→ µ ( s ) is Lipschitz and bounded for any s ∈ S . The functions r ( s, a, µ t ) and p ( s (cid:48) | s, a, µ t ) are Lipschitzcontinuous by Assumption 2. Since a ﬁnite sum, product and composition of Lipschitz and bounded functions isagain Lipschitz and bounded by Lemma B.2.1, we obtain Lipschitz constants K Ψ ,t,s of the maps π → µ t ( s ) forany t ∈ T , s ∈ S and deﬁne K Ψ ≡ max t ∈T ,s ∈S K Ψ ,t,s , which is the desired Lipschitz constant of Ψ . (cid:4) Finally, the map from an energy function to its associated Boltzmann distribution is Lipschitz for any η > witha Lipschitz constant explicitly depending on η . Lemma B.7.5.

Let η > arbitrary and f a : M → R be a Lipschitz continuous function with Lipschitz constant K f for any a ∈ A . Further, let g : A → R be bounded by g max > g ( a ) > g min > for any a ∈ A . The function µ (cid:55)→ g ( a ) exp (cid:16) f a ( µ ) η (cid:17)(cid:80) a (cid:48) ∈A g ( a (cid:48) ) exp (cid:16) f a (cid:48) ( µ ) η (cid:17) is Lipschitz with Lipschitz constant K = ( |A|− K f g ηg for any a ∈ A .Proof. Let µ, µ (cid:48) ∈ M be arbitrary and deﬁne ∆ a f a (cid:48) ( µ ) ≡ f a (cid:48) ( µ ) − f a ( µ ) pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning for any a (cid:48) ∈ A , which is Lipschitz with constant K f . Then, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) g ( a ) exp (cid:16) f a ( µ ) η (cid:17)(cid:80) a (cid:48) ∈A g ( a (cid:48) ) exp (cid:16) f a (cid:48) ( µ ) η (cid:17) − g ( a ) exp (cid:16) f a ( µ (cid:48) ) η (cid:17)(cid:80) a (cid:48) ∈A g ( a (cid:48) ) exp (cid:16) f a (cid:48) ( µ (cid:48) ) η (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

11 + (cid:80) a (cid:48) (cid:54) = a g ( a (cid:48) ) g ( a ) exp (cid:16) ∆ a f a (cid:48) ( µ ) η (cid:17) −

11 + (cid:80) a (cid:48) (cid:54) = a g ( a (cid:48) ) g ( a ) exp (cid:16) ∆ a f a (cid:48) ( µ (cid:48) ) η (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) (cid:54) = a g ( a (cid:48) ) g ( a ) · η exp (cid:16) ξ a (cid:48) η (cid:17)(cid:16) (cid:80) a (cid:48)(cid:48) (cid:54) = a g ( a (cid:48)(cid:48) ) g ( a ) exp (cid:16) ξ a (cid:48)(cid:48) η (cid:17)(cid:17) · (∆ a f a (cid:48) ( µ ) − ∆ a f a (cid:48) ( µ (cid:48) )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) a (cid:48) (cid:54) = a (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) g max g min · η exp (cid:16) ξ a (cid:48) η (cid:17)(cid:16) g min g max exp (cid:16) ξ a (cid:48) η (cid:17)(cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · | ∆ a f a (cid:48) ( µ ) − ∆ a f a (cid:48) ( µ (cid:48) ) |≤ g ηg · (cid:88) a (cid:48) (cid:54) = a K f d M ( µ, µ (cid:48) ) = ( |A| − K f g ηg · d M ( µ, µ (cid:48) ) where we applied the mean value theorem to obtain some ξ a (cid:48) ∈ R for all a (cid:48) ∈ A and used the maximum c of thefunction ˜ f ( x ) = exp( x/η )(1+ c · exp( x/η )) at x = 0 . (cid:4) For RelEnt MFE, by Lemma B.7.2 we obtain a Lipschitz constant K ˜ Q η of µ → ˜ Q η ( µ, t, s, a ) as long as η > η min for some η min > . Furthermore, note that for ˜ π µ,η ≡ ˜Φ η ( µ ) , we have (cid:12)(cid:12)(cid:12) ˜ π µ,ηt ( a | s ) − ˜ π µ (cid:48) ,ηt ( a | s )) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q t ( a | s ) exp (cid:16) ˜ Q η ( µ,t,s,a ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) ˜ Q η ( µ,t,s,a (cid:48) ) η (cid:17) − q t ( a | s ) exp (cid:16) ˜ Q η ( µ (cid:48) ,t,s,a ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) ˜ Q η ( µ (cid:48) ,t,s,a (cid:48) ) η (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . We obtain the Lipschitz constant of ˜Φ η by applying Lemma B.7.5 to each of the maps given by µ (cid:55)→ q t ( a | s ) exp (cid:16) ˜ Q η ( µ,t,s,a ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) ˜ Q η ( µ,t,s,a (cid:48) ) η (cid:17) for all t ∈ T , s ∈ S , a ∈ A , resulting in the Lipschitz property d Π ( ˜Φ η ( µ ) , ˜Φ η ( µ (cid:48) )) = max s ∈S max t ∈T (cid:88) a ∈A (cid:12)(cid:12)(cid:12) ˜ π µ,ηt ( a | s ) − ˜ π µ (cid:48) ,ηt ( a | s )) (cid:12)(cid:12)(cid:12) ≤ (cid:88) a ∈A ( |A| − K ˜ Q η q ηq · d M ( µ, µ (cid:48) ) = |A| ( |A| − K ˜ Q η q ηq · d M ( µ, µ (cid:48) ) , where we deﬁne q max = max t ∈T ,s ∈S ,a ∈A q t ( a | s ) and analogously q min = min t ∈T ,s ∈S ,a ∈A q t ( a | s ) .By Lemma B.7.4, Ψ( π ) is Lipschitz with some Lipschitz constant K Ψ . Therefore, the resulting Lipschitz constantof the composition ˜Γ η = Ψ ◦ ˜Φ η is |A| ( |A|− K ˜ Qη K Ψ q ηq and leads to a contraction for any η > max (cid:32) η min , |A| ( |A| − K ˜ Q η K Ψ q q (cid:33) . Analogously for Boltzmann MFE, by Lemma B.7.3 the mapping µ → Q ∗ ( µ, t, s, a ) is Lipschitz with some Lipschitzconstant K Q ∗ for all t ∈ T , s ∈ S , a ∈ A . For π µ,η ≡ Φ η ( µ ) , we have (cid:12)(cid:12)(cid:12) π µ,ηt ( a | s ) − π µ (cid:48) ,ηt ( a | s )) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q t ( a | s ) exp (cid:16) Q ∗ ( µ,t,s,a ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) Q ∗ ( µ,t,s,a (cid:48) ) η (cid:17) − q t ( a | s ) exp (cid:16) Q ∗ ( µ (cid:48) ,t,s,a ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) Q ∗ ( µ (cid:48) ,t,s,a (cid:48) ) η (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . ai Cui, Heinz Koeppl We obtain the Lipschitz constant of Φ η by applying Lemma B.7.5 to each of the maps given by µ (cid:55)→ q t ( a | s ) exp (cid:16) Q ∗ ( µ,t,s,a ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) Q ∗ ( µ,t,s,a (cid:48) ) η (cid:17) for all t ∈ T , s ∈ S , a ∈ A , resulting in the Lipschitz property d Π (Φ η ( µ ) , Φ η ( µ (cid:48) )) = max s ∈S max t ∈T (cid:88) a ∈A (cid:12)(cid:12)(cid:12) π µ,ηt ( a | s ) − π µ (cid:48) ,ηt ( a | s )) (cid:12)(cid:12)(cid:12) ≤ (cid:88) a ∈A ( |A| − K Q ∗ q ηq · d M ( µ, µ (cid:48) ) = |A| ( |A| − K Q ∗ q ηq · d M ( µ, µ (cid:48) ) . By Lemma B.7.4, Ψ( π ) is Lipschitz with some Lipschitz constant K Ψ . The resulting Lipschitz constant of thecomposition Γ η = Ψ ◦ Φ η is |A| ( |A|− K Q ∗ K Ψ q ηq and leads to a contraction for any η > |A| ( |A| − K Q ∗ K Ψ q q where for the uniform prior policy, q max = q min . If required, the Lipschitz constants can be computed recursivelyaccording to Lemma B.2.1. B.8 Proof of Theorem 4

Proof.

Consider any sequence ( π ∗ n , µ ∗ n ) n ∈ N of η n -Boltzmann or η n -RelEnt MFE with η n → + as n → ∞ . Notethat a pair ( π ∗ n , µ ∗ n ) is completely speciﬁed by µ ∗ n , since π ∗ n = Φ η n ( µ ∗ n ) or π ∗ n = ˜Φ η n ( µ ∗ n ) uniquely. Therefore,it suﬃces to show that the associated functions ( µ (cid:55)→ Q Φ ηn ( µ ) ( µ, t, s, a )) n ∈ N and ( µ (cid:55)→ Q ˜Φ ηn ( µ ) ( µ, t, s, a )) n ∈ N converge uniformly to µ (cid:55)→ Q ∗ ( µ, t, s, a ) , from which the desired result will follow. For deﬁnitions of the diﬀerentaction-value functions, see Appendix B.7.Note that pointwise convergence is insuﬃcient, since there is no guarantee that µ ∗ n itself will converge as n → ∞ .However, we can obtain uniform convergence by pointwise convergence and equicontinuity. For RelEnt MFE, wewill additionally require uniform convergence of the sequence ( µ (cid:55)→ ˜ Q η n ( µ, t, s, a )) n ∈ N with η n → + . We beginwith pointwise convergence of ( µ (cid:55)→ Q Φ ηn ( µ ) ( µ, t, s, a )) n ∈ N to the optimal action-value function µ (cid:55)→ Q ∗ ( µ, t, s, a ) . Lemma B.8.1.

Any sequence of functions ( µ (cid:55)→ Q Φ ηn ( µ ) ( µ, t, s, a )) n ∈ N with η n → + converges pointwise to µ (cid:55)→ Q ∗ ( µ, t, s, a ) for all t ∈ T , s ∈ S , a ∈ A .Proof. Fix µ ∈ M . We make the induction hypothesis for arbitrary t ∈ T that for all s ∈ S , a ∈ A , ε > , thereexists n (cid:48) ∈ N such that for any n > n (cid:48) we have (cid:12)(cid:12)(cid:12) Q Φ ηn ( µ ) ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) < ε . The induction hypothesis is fulﬁlled for t = T − , as by deﬁnition (cid:12)(cid:12)(cid:12) Q Φ ηn ( µ ) ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) = | r ( s, a, µ t ) − r ( s, a, µ t ) | = 0 . Assume that the induction hypothesis is fulﬁlled for t + 1 , then at time t let s ∈ S , a ∈ A , ε > arbitrary.Furthermore, let s (cid:48) ∈ S arbitrary. Collect all optimal actions into a set A s (cid:48) opt ⊆ A , i.e. for a (cid:48) ∈ A s (cid:48) opt we have Q ∗ ( µ, t, s (cid:48) , a opt ) = max a ∈A Q ∗ ( µ, t, s (cid:48) , a ) . We deﬁne the minimal action gap ∆ Q s (cid:48) ,µ min ≡ min a opt ∈A s (cid:48) opt ,a sub ∈A\A s (cid:48) opt ( Q ∗ ( µ, t, s (cid:48) , a opt ) − Q ∗ ( µ, t, s (cid:48) , a sub )) > pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning such that for arbitrary suboptimal actions a sub ∈ A \ A s (cid:48) opt and optimal actions a opt ∈ A s (cid:48) opt , Q ∗ ( µ, t, s (cid:48) , a opt ) − Q ∗ ( µ, t, s (cid:48) , a sub ) ≥ ∆ Q s (cid:48) ,µ min . This is well deﬁned if there are suboptimal actions, since there is always at least one optimal action. If all actionsare optimal, we can skip bounding the probability of taking suboptimal actions and the result will hold trivially.Thus, we assume henceforth that there exists a suboptimal action.It follows that the probability of taking suboptimal actions a sub ∈ A \ A s (cid:48) opt disappears, since (Φ η n ( µ )) t ( a sub | s (cid:48) ) = q t ( a sub | s ) (cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) Q ∗ ( µ,t,s,a (cid:48) ) − Q ∗ ( µ,t,s,a sub ) η (cid:17) ≤

11 + (cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) q t ( a sub | s ) exp (cid:16) Q ∗ ( µ,t,s,a (cid:48) ) − Q ∗ ( µ,t,s,a sub ) η (cid:17) ≤ | s )1 + q t ( a opt | s ) q t ( a sub | s ) exp (cid:16) Q ∗ ( µ,t,s,a opt ) − Q ∗ ( µ,t,s,a sub ) η (cid:17) ≤ | s )1 + q t ( a opt | s ) q t ( a sub | s ) exp (cid:18) ∆ Q s (cid:48) ,µ min η (cid:19) → as η → + for some arbitrary optimal action a opt ∈ A s (cid:48) opt . Since s (cid:48) ∈ S was arbitrary, this holds for all s (cid:48) ∈ S .Therefore, by ﬁniteness of S and A we can choose n ∈ N such that for all n > n and for all a sub ∈ A \ A s (cid:48) opt wehave η n suﬃciently small such that (Φ η n ( µ )) t ( a sub | s (cid:48) ) < ε |A| M Q where M Q is the uniform bound of Q Φ ηn ( µ ) .Further, by induction assumption, we can choose n s (cid:48) ,a (cid:48) for any s (cid:48) ∈ S , a (cid:48) ∈ A such that for all n > n s (cid:48) ,a (cid:48) we have (cid:12)(cid:12)(cid:12) Q Φ ηn ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) < ε Therefore, as long as n > n (cid:48) ≡ max( n , max s (cid:48) ∈S ,a (cid:48) ∈A n s (cid:48) ,a (cid:48) ) , we have (cid:12)(cid:12)(cid:12) Q Φ ηn ( µ ) ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) (cid:32) (cid:88) a (cid:48) ∈A (Φ η n ( µ )) t ( a (cid:48) | s (cid:48) ) Q Φ ηn ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − max a (cid:48)(cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48)(cid:48) ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A (Φ η n ( µ )) t ( a (cid:48) | s (cid:48) ) Q Φ ηn ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − max a (cid:48)(cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A s (cid:48) opt (Φ η n ( µ )) t ( a (cid:48) | s (cid:48) ) Q Φ ηn ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − max a (cid:48)(cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A\A s (cid:48) opt (Φ η n ( µ )) t ( a (cid:48) | s (cid:48) ) Q Φ ηn ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A s (cid:48) opt (Φ η n ( µ )) t ( a (cid:48) | s (cid:48) ) Q Φ ηn ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − (cid:88) a (cid:48) ∈A s (cid:48) opt (Φ η n ( µ )) t ( a (cid:48) | s (cid:48) ) max a (cid:48)(cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ai Cui, Heinz Koeppl + max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A s (cid:48) opt (Φ η n ( µ )) t ( a (cid:48) | s (cid:48) ) max a (cid:48)(cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48)(cid:48) ) − max a (cid:48)(cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A\A s (cid:48) opt (Φ η n ( µ )) t ( a (cid:48) | s (cid:48) ) Q Φ ηn ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max s (cid:48) ∈S max a (cid:48) ∈A s (cid:48) opt (cid:12)(cid:12)(cid:12)(cid:12) Q Φ ηn ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − max a (cid:48)(cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S M Q (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:88) a (cid:48) ∈A\A s (cid:48) opt (Φ η n ( µ )) t ( a (cid:48) | s (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S M Q (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A\A s (cid:48) opt (Φ η n ( µ )) t ( a (cid:48) | s (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ε ε |A| M Q · |A| M Q + ε |A| M Q · |A| M Q = ε . Since s ∈ S , a ∈ A , ε > were arbitrary, the desired result follows immediately by induction. (cid:4) As we have no control over µ ∗ n and the sequence ( π ∗ n , µ ∗ n ) n ∈ N may not even converge, pointwise convergence isinsuﬃcient. To obtain uniform convergence, we shall use compactness of M and equicontinuity. Lemma B.8.2.

The family of functions

F ≡ { µ (cid:55)→ Q Φ η ( µ ) ( µ, t, s, a ) } η> ,t ∈T ,s ∈S ,a ∈A is equicontinuous, i.e. forany ε > and any µ ∈ M , we can choose a δ > such that for all µ (cid:48) ∈ M with d M ( µ, µ (cid:48) ) < δ and any f ∈ F we have | f ( µ ) − f ( µ (cid:48) ) | < ε . Proof.

Fix an arbitrary µ ∈ M . We make the (backwards in time) induction hypothesis for all t ∈ T that for any s ∈ S , a ∈ A , ε t,s,a > , there exists δ t,s,a > such that for any µ (cid:48) ∈ M with d M ( µ, µ (cid:48) ) < δ t,s,a and any f ∈ F we have (cid:12)(cid:12)(cid:12) Q Φ η ( µ ) ( µ, t, s, a ) − Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t, s, a ) (cid:12)(cid:12)(cid:12) < ε t,s,a . The induction hypothesis is fulﬁlled for t = T − , as by assumption, ν → r ( s, a, ν t ) is Lipschitz with constant C r > . Therefore, for all s ∈ S , a ∈ A we can choose δ T − ,s,a = ε t,s,a C r such that for any µ, µ (cid:48) with d M ( µ, µ (cid:48) ) < δ (cid:48) we have (cid:12)(cid:12)(cid:12) Q Φ η ( µ ) ( µ, t, s, a ) − Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t, s, a ) (cid:12)(cid:12)(cid:12) = | r ( s, a, µ t ) − r ( s, a, µ (cid:48) t ) | ≤ C r d M ( µ, µ (cid:48) ) < ε t,s,a . Assume that the induction hypothesis holds for t + 1 , then at time t let ε t,s,a > , s ∈ S , a ∈ A arbitrary. Bydeﬁnition, we have (cid:12)(cid:12)(cid:12) Q Φ η ( µ ) ( µ, t, s, a ) − Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t, s, a ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) (cid:88) a (cid:48) ∈A (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − r ( s, a, µ (cid:48) t ) − (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ (cid:48) t ) (cid:88) a (cid:48) ∈A (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | r ( s, a, µ t ) − r ( s, a, µ (cid:48) t ) | + (cid:88) s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( p ( s (cid:48) | s, a, µ t ) − p ( s (cid:48) | s, a, µ (cid:48) t )) (cid:88) a (cid:48) ∈A (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:88) s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p ( s (cid:48) | s, a, µ (cid:48) t ) (cid:88) a (cid:48) ∈A (cid:16) (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning ≤ | r ( s, a, µ t ) − r ( s, a, µ (cid:48) t ) | + (cid:88) s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( p ( s (cid:48) | s, a, µ t ) − p ( s (cid:48) | s, a, µ (cid:48) t )) (cid:88) a (cid:48) ∈A (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A s (cid:48) opt (cid:16) (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A\A s (cid:48) opt (cid:16) (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) where we deﬁne A s (cid:48) opt ⊆ A for any s (cid:48) ∈ S to include all optimal actions a opt ∈ A s (cid:48) opt such that Q ∗ ( µ, t, s (cid:48) , a opt ) = max a ∈A Q ∗ ( µ, t, s (cid:48) , a ) . We bound each of the four terms separately.For the ﬁrst term, we choose δ t,s,a = ε t,s,a C r by Lipschitz continuity such that | r ( s, a, µ t ) − r ( s, a, µ (cid:48) t ) | < ε t,s,a for all µ (cid:48) with d M ( µ, µ (cid:48) ) < δ t,s,a .For the second term, we choose δ t,s,a = |S| M Q C p such that for any µ (cid:48) ∈ M with d M ( µ, µ (cid:48) ) < δ t,s,a we have (cid:88) s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( p ( s (cid:48) | s, a, µ t ) − p ( s (cid:48) | s, a, µ (cid:48) t )) (cid:88) a (cid:48) ∈A (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ |S| C p d M ( µ, µ (cid:48) ) M Q < ε t,s,a where M Q denotes the uniform bound of Q and C p is the Lipschitz constant of ν (cid:55)→ p ( s (cid:48) | s, a, ν t ) .For the third and fourth term, we ﬁrst ﬁx s (cid:48) ∈ S and deﬁne the minimal action gap as ∆ Q s (cid:48) ,µ min ≡ min a opt ∈A s (cid:48) opt ,a sub ∈A\A s (cid:48) opt ( Q ∗ ( µ, t, s (cid:48) , a opt ) − Q ∗ ( µ, t, s (cid:48) , a sub )) . This is well deﬁned if there are suboptimal actions, since there is always at least one optimal action. If all actionsare optimal, we can skip bounding the probability of taking suboptimal actions and the result will still hold.Henceforth, we assume that there exists a suboptimal action.By Lipschitz continuity of µ (cid:55)→ Q ∗ ( µ, t, s, a ) from Lemma B.7.3 implying uniform continuity, there exists some δ ,s (cid:48) t,s,a > such that | Q ∗ ( µ (cid:48) , t, s (cid:48) , a ) − Q ∗ ( µ, t, s (cid:48) , a ) | < ∆ Q s (cid:48) ,µ min for all µ (cid:48) ∈ M , a ∈ A where d M ( µ, µ (cid:48) ) < δ ,s (cid:48) t,s,a , and thus ∆ Q s (cid:48) ,µ (cid:48) min = min a opt ∈A s (cid:48) opt ,a sub ∈A\A s (cid:48) opt ( Q ∗ ( µ (cid:48) , t, s (cid:48) , a opt ) − Q ∗ ( µ (cid:48) , t, s (cid:48) , a sub )) > ∆ Q s (cid:48) ,µ min . Under this condition, we can now show that the probability of any suboptimal action can be controlled. Deﬁne R min q ≡ min t ∈T ,s ∈S ,a ∈A ,a (cid:48) ∈A q t ( a (cid:48) | s ) q t ( a | s ) > and R max q ≡ max t ∈T ,s ∈S ,a ∈A ,a (cid:48) ∈A q t ( a (cid:48) | s ) q t ( a | s ) > . Let a sub ∈ A \ A s (cid:48) opt , thenwe either have | (Φ η ( µ )) t +1 ( a sub | s (cid:48) ) − (Φ η ( µ (cid:48) )) t +1 ( a sub | s (cid:48) ) | ai Cui, Heinz Koeppl = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

11 + (cid:80) a (cid:48) (cid:54) = a sub q t ( a (cid:48) | s (cid:48) ) q t ( a sub | s (cid:48) ) exp (cid:16) Q ∗ ( µ,t,s (cid:48) ,a (cid:48) ) − Q ∗ ( µ,t,s (cid:48) ,a sub ) η (cid:17) −

11 + max a (cid:48) (cid:54) = a sub R min q exp (cid:16) Q ∗ ( µ,t,s (cid:48) ,a (cid:48) ) − Q ∗ ( µ,t,s (cid:48) ,a sub ) η (cid:17) + 11 + max a (cid:48) (cid:54) = a sub R min q exp (cid:16) Q ∗ ( µ (cid:48) ,t,s (cid:48) ,a (cid:48) ) − Q ∗ ( µ (cid:48) ,t,s (cid:48) ,a sub ) η (cid:17) <

11 + R min q exp (cid:18) ∆ Q s (cid:48) ,µ min η (cid:19) + 11 + R min q exp (cid:18) ∆ Q s (cid:48) ,µ min η (cid:19) ≤

21 + R min q exp (cid:18) ∆ Q s (cid:48) ,µ min η (cid:19) < ε t,s,a M Q |A| if ε t,s,a > M Q |A| trivially, or otherwise if η < η s (cid:48) min with η s (cid:48) min ≡ ∆ Q s (cid:48) ,µ min (cid:16) M Q |A| ε t,s,a R min q − R min q (cid:17) , in which case we arbitrarily deﬁne δ ,s (cid:48) t,s,a = 1 , or if neither apply, then η ≥ η s (cid:48) min and thus | (Φ η ( µ )) t +1 ( a sub | s (cid:48) ) − (Φ η ( µ (cid:48) )) t +1 ( a sub | s (cid:48) ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)

11 + (cid:80) a (cid:48) (cid:54) = a sub q t ( a (cid:48) | s (cid:48) ) q t ( a sub | s (cid:48) ) exp (cid:16) Q ∗ ( µ (cid:48) ,t,s (cid:48) ,a (cid:48) ) − Q ∗ ( µ (cid:48) ,t,s (cid:48) ,a sub ) η (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:80) a (cid:48) (cid:54) = a sub q t ( a (cid:48) | s ) q t ( a sub | s (cid:48) ) (cid:16) exp (cid:16) Q ∗ ( µ (cid:48) ,t,s (cid:48) ,a (cid:48) ) − Q ∗ ( µ (cid:48) ,t,s (cid:48) ,a sub ) η (cid:17) − exp (cid:16) Q ∗ ( µ,t,s (cid:48) ,a (cid:48) ) − Q ∗ ( µ,t,s (cid:48) ,a sub ) η (cid:17)(cid:17) (1 + · · · ) · (1 + · · · ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ R max q (cid:88) a (cid:48) (cid:54) = a sub (cid:12)(cid:12)(cid:12)(cid:12) exp (cid:18) Q ∗ ( µ (cid:48) , t, s (cid:48) , a (cid:48) ) − Q ∗ ( µ (cid:48) , t, s (cid:48) , a sub ) η (cid:19) − exp (cid:18) Q ∗ ( µ, t, s (cid:48) , a (cid:48) ) − Q ∗ ( µ, t, s (cid:48) , a sub ) η (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ R max q (cid:88) a (cid:48) (cid:54) = a sub (cid:12)(cid:12)(cid:12)(cid:12) η exp (cid:18) ξ a (cid:48) η (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) | ( Q ∗ ( µ (cid:48) , t, s (cid:48) , a (cid:48) ) − Q ∗ ( µ (cid:48) , t, s (cid:48) , a sub )) − ( Q ∗ ( µ, t, s (cid:48) , a (cid:48) ) − Q ∗ ( µ, t, s (cid:48) , a sub )) |≤ R max q |A| · η s (cid:48) min exp (cid:18) M Q η s (cid:48) min (cid:19) ( | Q ∗ ( µ (cid:48) , t, s (cid:48) , a (cid:48) ) − Q ∗ ( µ, t, s (cid:48) , a (cid:48) ) | + | Q ∗ ( µ, t, s (cid:48) , a sub ) − Q ∗ ( µ (cid:48) , t, s (cid:48) , a sub ) | ) ≤ R max q |A| · η s (cid:48) min exp (cid:18) M Q η s (cid:48) min (cid:19) · K Q d M ( µ, µ (cid:48) ) < ε t,s,a M Q |A| by the mean value theorem with some ξ a (cid:48) ∈ [ − M Q , M Q ] for all a (cid:48) ∈ A , where we abbreviated the denominator (1 + · · · ) · (1 + · · · ) ≥ , as long as we choose δ ,s (cid:48) t,s,a = ε t,s,a η s (cid:48) min M Q |A| R max q · exp (cid:16) M Q η s (cid:48) min (cid:17) · K Q pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning and d M ( µ, µ (cid:48) ) < δ ,s (cid:48) t,s,a , where K Q is the Lipschitz constant of µ (cid:55)→ Q ∗ ( µ, t, s, a ) given by Lemma B.7.3.Since s (cid:48) ∈ S was arbitrary, we now deﬁne δ t,s,a ≡ min s (cid:48) ∈S δ ,s (cid:48) t,s,a , δ t,s,a ≡ min s (cid:48) ∈S δ ,s (cid:48) t,s,a and let d M ( µ, µ (cid:48) ) < min( δ t,s,a , δ t,s,a ) . Under these assumptions, for the third term we have approximate optimality for all optimalactions in A s (cid:48) opt , since by induction assumption we can choose δ t +1 ,s (cid:48) ,a (cid:48) for all s (cid:48) ∈ S , a (cid:48) ∈ A such that for all µ (cid:48) ∈ M with d M ( µ, µ (cid:48) ) < δ t +1 ,s (cid:48) ,a (cid:48) it holds that (cid:12)(cid:12)(cid:12) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) < ε t,s,a |A| + 8 . and therefore for all µ (cid:48) ∈ M , as long as d M ( µ, µ (cid:48) ) < min s (cid:48) ∈S ,a (cid:48) ∈A δ t +1 ,s (cid:48) ,a (cid:48) , we have max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A s (cid:48) opt (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − (cid:88) a (cid:48) ∈A s (cid:48) opt (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A s (cid:48) opt (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − (cid:88) a (cid:48) ∈A s (cid:48) opt (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A s (cid:48) opt (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) − (cid:88) a (cid:48) ∈A s (cid:48) opt (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max s (cid:48) ∈S max a (cid:48) ∈A (cid:12)(cid:12)(cid:12) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A s (cid:48) opt ((Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) − (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) )) (cid:16) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) − Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A s (cid:48) opt ((Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) − (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) )) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max s (cid:48) ∈S max a (cid:48) ∈A (cid:12)(cid:12)(cid:12) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S max a (cid:48) ∈A |A| (cid:12)(cid:12)(cid:12) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) − Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S max a (cid:48)(cid:48) ∈A (cid:12)(cid:12)(cid:12) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12) · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A\A s (cid:48) opt ((Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) ) − (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < (1 + 2 |A| ) · ε t,s,a |A| + 8 + M Q |A| · ε t,s,a M Q |A| < ε t,s,a where we use that for any a (cid:48) ∈ A s (cid:48) opt we have Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) = max a (cid:48)(cid:48) ∈A Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48)(cid:48) ) . Analogously, for the fourth term we have max s (cid:48) ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) a (cid:48) ∈A\A s (cid:48) opt ((Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max s (cid:48) ∈S (cid:88) a (cid:48) ∈A\A s (cid:48) opt (cid:12)(cid:12)(cid:12) (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) ai Cui, Heinz Koeppl + max s (cid:48) ∈S (cid:88) a (cid:48) ∈A\A s (cid:48) opt (cid:12)(cid:12)(cid:12) (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) − (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) ) Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) ≤ max s (cid:48) ∈S max a (cid:48) ∈A (cid:12)(cid:12)(cid:12) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) + max s (cid:48) ∈S M Q (cid:88) a (cid:48) ∈A\A s (cid:48) opt | (Φ η ( µ )) t +1 ( a (cid:48) | s (cid:48) ) − (Φ η ( µ (cid:48) )) t +1 ( a (cid:48) | s (cid:48) ) | < ε t,s,a M Q |A| · ε t,s,a M Q |A| = ε t,s,a under the previous conditions, since as long as we have d M ( µ, µ (cid:48) ) < δ t +1 ,s (cid:48) ,a (cid:48) for all s (cid:48) ∈ S , a (cid:48) ∈ A from before,we have (cid:12)(cid:12)(cid:12) Q Φ η ( µ ) ( µ, t + 1 , s (cid:48) , a (cid:48) ) − Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t + 1 , s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) < ε t,s,a |A| + 8 < ε t,s,a . Finally, by choosing δ t,s,a such that all conditions are fulﬁlled, i.e. δ t,s,a ≡ min (cid:18) δ t,s,a , δ t,s,a , δ t,s,a , δ t,s,a , min s (cid:48) ∈S ,a (cid:48) ∈A δ t +1 ,s (cid:48) ,a (cid:48) (cid:19) > , the induction hypothesis is fulﬁlled, since then for any µ (cid:48) with d M ( µ, µ (cid:48) ) < δ t,s,a we have (cid:12)(cid:12)(cid:12) Q Φ η ( µ ) ( µ, t, s, a ) − Q Φ η ( µ (cid:48) ) ( µ (cid:48) , t, s, a ) (cid:12)(cid:12)(cid:12) < ε t,s,a . Since η > is arbitrary, the desired result follows immediately, as we can set ε t,s,a = ε for each t ∈ T , s ∈ S , a ∈ A and obtain δ ≡ max t ∈T ,s ∈S ,a ∈A δ t,s,a , fulﬁlling the required equicontinuity property at µ . (cid:4) From equicontinuity, we get the desired uniform convergence via compactness.

Lemma B.8.3. If ( f n ) n ∈ N with f n : M → R is an equicontinuous sequence of functions and for all µ ∈ M wehave f n ( µ ) → f ( µ ) pointwise, then f n ( µ ) → f ( µ ) uniformly.Proof. Let ε > arbitrary, then there exists by equicontinuity for any point µ ∈ M a δ ( µ ) such that for all µ (cid:48) ∈ M with d M ( µ, µ (cid:48) ) < δ ( µ ) we have for all n ∈ N | f n ( µ ) − f n ( µ (cid:48) ) | < ε which via pointwise convergence implies | f ( µ ) − f ( µ (cid:48) ) | ≤ ε . Since M is compact, it is separable, i.e. there exists a countable dense subset ( µ j ) j ∈ N of M . Let δ ( µ ) be asdeﬁned above and cover M by the open balls ( B δ ( µ j ) ( µ j )) j ∈ N . By the compactness of M , ﬁnitely many of theseballs B δ ( µ n ) ( µ n ) , . . . , B δ ( µ nk ) ( µ n k ) cover M . By pointwise convergence, for any i = 1 , . . . , k we can ﬁnd aninteger n i such that for all n > n i we have | f n ( µ n i ) − f ( µ n i ) | < ε . Taken together, we ﬁnd that for n > max i =1 ,...,k n i and arbitrary µ ∈ M , we have | f n ( µ ) − f ( µ ) | < | f n ( µ ) − f n ( µ n i ) | + | f n ( µ n i ) − f ( µ n i ) | + | f ( µ n i ) − f ( µ ) | < ε ε ε < ε for some center point µ n i of a ball containing µ from the ﬁnite cover. (cid:4) Therefore, a sequence of Boltzmann MFE with vanishing η is approximately optimal in the MFG. pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning Lemma B.8.4.

For any sequence ( π ∗ n , µ ∗ n ) n ∈ N of η n -Boltzmann MFE with η n → + and for any ε > thereexists integer N ∈ N such that for all integers n > N we have J µ ∗ n ( π ∗ n ) ≥ max π J µ ∗ n ( π ) − ε . Proof.

By Lemma B.8.2,

F ≡ ( µ (cid:55)→ Q Φ η ( µ ) ( µ, t, s, a )) η> ,t ∈T ,s ∈S ,a ∈A is equicontinuous. Therefore, any sequence ( µ (cid:55)→ Q Φ ηn ( µ ) ( µ, t, s, a )) n ∈ N with η n → + is also equicontinuous for any t ∈ T , s ∈ S , a ∈ A .Furthermore, by Lemma B.8.1, the sequence ( µ (cid:55)→ Q Φ ηn ( µ ) ( µ, t, s, a )) n ∈ N converges pointwise to µ → Q ∗ ( µ, t, s, a ) for any t ∈ T , s ∈ S , a ∈ A .By Lemma B.8.3, we thus have (cid:12)(cid:12) Q Φ ηn ( µ ) ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12) → uniformly. Therefore, for any ε > , thereexists an integer N by uniform convergence such that for all integers n > N we have Q π ∗ n ( µ ∗ n , t, s, a ) ≥ Q ∗ ( µ ∗ n , t, s, a ) − ε = max π ∈ Π Q π ( µ ∗ n , t, s, a ) − ε , and since by Lemma B.3.1 we have J µ ∗ n ( π ∗ n ) = (cid:88) s ∈S µ ( s ) · (cid:88) a ∈A Q π ∗ n ( µ ∗ n , t, s, a ) ≥ (cid:88) s ∈S µ ( s ) · max π ∈ Π (cid:88) a ∈A Q π ( µ ∗ n , t, s, a ) − ε = max π ∈ Π J µ ∗ n ( π ) − ε , the desired result follows immediately. (cid:4) Finally, we show approximate optimality in the actual N -agent game as long as a pair ( π ∗ , µ ∗ ) ∈ Π × M with µ ∗ = Ψ( π ∗ ) has vanishing exploitability in the MFG. By Lemma B.8.4, for any sequence ( π ∗ n , µ ∗ n ) n ∈ N of η n -Boltzmann MFE with η n → + and for any ε > there exists an integer n (cid:48) ∈ N such that for all integers n > n (cid:48) we have J µ ∗ n ( π ∗ n ) ≥ max π J µ ∗ n ( π ) − ε . Let ε (cid:48) > be arbitrary and choose a sequence of optimal policies { π N } N ∈ N such that for all N ∈ N we have π N ∈ arg max π ∈ Π J N ( π, π ∗ n , . . . , π ∗ n ) . By Lemma B.5.6 there exists N (cid:48) ∈ N such that for all N > N (cid:48) and all n > n (cid:48) , we have max π ∈ Π J N ( π, π ∗ n , . . . , π ∗ n ) − ε − ε (cid:48) ≤ max π ∈ Π J µ ∗ n ( π ) − ε − ε (cid:48) ≤ J µ ∗ n ( π ∗ n ) − ε (cid:48) ≤ J N ( π ∗ n , π ∗ n , . . . , π ∗ n ) which is the desired approximate Nash equilibrium property since ε, ε (cid:48) are arbitrary. This applies by symmetry toall agents.For RelEnt MFE, the same can be done by ﬁrst showing the uniform convergence of the soft action-value functionto the usual action-value function. For this, note that the smooth maximum Bellman recursion converges to thehard maximum Bellman recursion for any ﬁxed µ . Lemma B.8.5.

For any f : A → R and any g : A → R with g ( a ) > for all a ∈ A , we have lim η → + η log (cid:88) a ∈A g ( a ) exp f ( a ) η = max a ∈A f ( a ) . Proof.

Let δ = η → + ∞ . Then, by L’Hospital’s rule we have lim δ → + ∞ log (cid:80) a ∈A g ( a ) exp ( δf ( a )) δ = lim δ → + ∞ (cid:80) a ∈A g ( a ) exp ( δf ( a )) f ( a ) (cid:80) a ∈A g ( a ) exp ( δf ( a )) ai Cui, Heinz Koeppl = lim δ → + ∞ (cid:80) a ∈A g ( a ) exp ( δ ( f ( a ) − max a ∈A f ( a ))) f ( a ) (cid:80) a ∈A g ( a ) exp ( δ ( f ( a ) − max a ∈A f ( a )))= |A max | max a ∈A f ( a ) |A max | = max a ∈A f ( a ) where |A max | is the number of elements in A that maximize f . (cid:4) Using this result, we can show pointwise convergence of the soft action-value function to the action-value function.

Lemma B.8.6.

Any sequence of functions ( µ (cid:55)→ ˜ Q η n ( µ, t, s, a )) n ∈ N with η n → + converges pointwise to µ (cid:55)→ Q ∗ ( µ, t, s, a ) for all t ∈ T , s ∈ S , a ∈ A .Proof. Fix µ ∈ M . We show by induction that for any ε > , there exists η t > such that for all η < η t we have (cid:12)(cid:12)(cid:12) ˜ Q η ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) < ε for all t ∈ T , s ∈ S , a ∈ A . This holds for t = T − and arbitrary s ∈ S , a ∈ A by Lemma B.8.5, since r ( s, a, µ T − ) is independent of η . Assume this holds for t + 1 and consider t . Then, by theinduction assumption we can choose η t +1 > such that for η < η t +1 , as η → + we have ˜ Q η ( µ, t, s, a ) = r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) η log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) η (cid:33) ≤ r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) η log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:18) Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48) ) + ε η (cid:19) → r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) max a (cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48) ) + ε by Lemma B.8.5 and monotonicity of log and exp . Analogously, ˜ Q η ( µ, t, s, a ) ≥ r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) η log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:18) Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48) ) − ε η (cid:19) → r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) max a (cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48) ) − ε . Therefore, we can choose η t < η t +1 such that for all η < η t we have (cid:12)(cid:12)(cid:12) ˜ Q η ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ Q η ( µ, t, s, a ) − (cid:32) r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) max a (cid:48) ∈A Q ∗ ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ε which is the desired result. (cid:4) We can now show that the soft action-value function converges uniformly to the action-value function as η → + . Lemma B.8.7.

Any sequence of functions ( µ (cid:55)→ ˜ Q η n ( µ, t, s, a )) n ∈ N with η n → + converges uniformly to µ (cid:55)→ Q ∗ ( µ, t, s, a ) for all t ∈ T , s ∈ S , a ∈ A .Proof. First, we show that ˜ Q η ( µ, t, s, a ) is monotonically decreasing in η for η > , i.e. ∂∂η ˜ Q η ( µ, t, s, a ) ≤ for all t ∈ T , s ∈ S , a ∈ A . This is the case for t = T − and arbitrary s ∈ S , a ∈ A , since ˜ Q η ( µ, T − , s, a ) is constant.Assume this holds for t + 1 , then for t and arbitrary s ∈ S , a ∈ A we have ∂∂η ˜ Q η ( µ, t, s, a ) = (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) η (cid:33) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) η (cid:80) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:16) ˜ Q η ( µ,t +1 ,s (cid:48) ,a (cid:48) ) η (cid:17) (cid:16) − ˜ Q η ( µ,t +1 ,s (cid:48) ,a (cid:48) ) η + η ∂∂η ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) (cid:17)(cid:80) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:16) ˜ Q η ( µ,t +1 ,s (cid:48) ,a (cid:48) ) η (cid:17) pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning ≤ max s (cid:48) ∈S  log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) η (cid:33) − (cid:80) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:16) ˜ Q η ( µ,t +1 ,s (cid:48) ,a (cid:48) ) η (cid:17) ˜ Q η ( µ,t +1 ,s (cid:48) ,a (cid:48) ) η (cid:80) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:16) ˜ Q η ( µ,t +1 ,s (cid:48) ,a (cid:48) ) η (cid:17)  by induction hypothesis. Let ξ a (cid:48) ≡ ˜ Q η ( µ,t +1 ,s (cid:48) ,a (cid:48) ) η ∈ R and s (cid:48) ∈ S arbitrary, then by Jensen’s inequality applied tothe convex function φ ( x ) = x log x we have (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) φ (exp ξ a (cid:48) ) ≥ φ (cid:32) (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp ξ a (cid:48) (cid:33) ⇐⇒ (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) ξ a (cid:48) exp ξ a (cid:48) ≥ (cid:32) (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp ξ a (cid:48) (cid:33) log (cid:32) (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp ξ a (cid:48) (cid:33) ⇐⇒ log (cid:32) (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp ξ a (cid:48) (cid:33) − (cid:80) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) ξ a (cid:48) exp ξ a (cid:48) (cid:0)(cid:80) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp ξ a (cid:48) (cid:1) ≤ , such that ˜ Q η ( µ, t, s, a ) is monotonically decreasing for all t ∈ T , s ∈ S , a ∈ A by induction.Furthermore, M is compact and both ˜ Q η and Q are compositions, sums, products and ﬁnite maxima of continuousfunctions in µ and therefore continuous in µ by the standing assumptions. Since ( µ (cid:55)→ ˜ Q η n ( µ, t, s, a )) n ∈ N with η n → + converges pointwise to µ (cid:55)→ Q ∗ ( µ, t, s, a ) for all t ∈ T , s ∈ S , a ∈ A by Lemma B.8.6, by Dini’s theoremthe convergence is uniform. (cid:4) Now that ˜ Q η converges uniformly against Q , we can show that RelEnt MFE have vanishing exploitability byreplicating the proof for Boltzmann MFE. Lemma B.8.8.

Any sequence of functions ( µ (cid:55)→ Q ˜Φ ηn ( µ ) ( µ, t, s, a )) n ∈ N with η n → + converges pointwise to µ (cid:55)→ Q ∗ ( µ, t, s, a ) for all t ∈ T , s ∈ S , a ∈ A .Proof. The proof is the same as in Lemma B.8.1. The only diﬀerence is that we additionally choose n ∈ N ineach induction step such that for all n > n we have (cid:12)(cid:12)(cid:12) ˜ Q η ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) ≤ ∆ Q s (cid:48) ,µ min for all t ∈ T , s ∈ S , a ∈ A , which is possible, since by Lemma B.8.7, ˜ Q η converges uniformly against Q . As longas we choose n (cid:48) ≡ max( n , n , max s (cid:48) ∈S ,a (cid:48) ∈A n s (cid:48) ,a (cid:48) ) , the rest of the proof will apply. (cid:4) Lemma B.8.9.

Any sequence of functions ( µ (cid:55)→ Q ˜Φ ηn ( µ ) ( µ, t, s, a )) n ∈ N with η n → + fulﬁlls equicontinuity forlarge enough n : For any ε > and any µ ∈ M , we can choose a δ > and an integer n (cid:48) ∈ N such that for all µ (cid:48) ∈ M with d M ( µ, µ (cid:48) ) < δ and for all n > n (cid:48) we have (cid:12)(cid:12)(cid:12) Q ˜Φ ηn ( µ ) ( µ, t, s, a ) − Q ˜Φ ηn ( µ (cid:48) ) ( µ (cid:48) , t, s, a ) (cid:12)(cid:12)(cid:12) < ε . Proof.

To obtain the desired property, we replicate the proof of Lemma B.8.2 by setting F = ( µ (cid:55)→ Q ˜Φ ηn ( µ ) ( µ, t, s, a )) n ∈ N . Any bounds for ˜ Q η can be instantiated by the corresponding bound for Q and thenbounding the distance between both by uniform convergence. The only diﬀerences lie in bounding the terms (cid:12)(cid:12)(cid:12) ( ˜Φ η n ( µ )( a sub | s (cid:48) ) − ( ˜Φ η n ( µ (cid:48) )( a sub | s (cid:48) ) (cid:12)(cid:12)(cid:12) where the action-value function has been replaced with the soft action-value function. Since ˜ Q η n uniformlyconverges to Q , we instantiate additional requirements N s (cid:48) t,s,a , ˜ N s (cid:48) t,s,a to let n > N s (cid:48) t,s,a , n > ˜ N s (cid:48) t,s,a large enoughsuch that η is suﬃciently small enough.The ﬁrst diﬀerence is to obtain (cid:12)(cid:12)(cid:12) ˜ Q η n ( µ (cid:48) , t, s, a ) − ˜ Q η n ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) < ∆ Q s (cid:48) ,µ min ai Cui, Heinz Koeppl for all µ (cid:48) ∈ M , t ∈ T , s ∈ S , a ∈ A with d M ( µ, µ (cid:48) ) suﬃciently small. We choose ˆ δ t,s,a slightly stronger than inthe original proof, such that if d M ( µ, µ (cid:48) ) < ˆ δ t,s,a , we have | Q ∗ ( µ (cid:48) , t, s, a ) − Q ∗ ( µ, t, s, a ) | < ∆ Q s (cid:48) ,µ min . We must then additionally choose N s (cid:48) t,s,a ∈ N for each induction step via uniform convergence from Lemma B.8.7such that as long as n > N s (cid:48) t,s,a , we have (cid:12)(cid:12)(cid:12) ˜ Q η n ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) < ∆ Q s (cid:48) ,µ min . This implies the required inequality (cid:12)(cid:12)(cid:12) ˜ Q η n ( µ (cid:48) , t, s, a ) − ˜ Q η n ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) ˜ Q η n ( µ (cid:48) , t, s, a ) − Q ∗ ( µ (cid:48) , t, s, a ) (cid:12)(cid:12)(cid:12) + | Q ∗ ( µ (cid:48) , t, s, a ) − Q ∗ ( µ, t, s, a ) | + (cid:12)(cid:12)(cid:12) Q ∗ ( µ, t, s, a ) − ˜ Q η n ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) < ∆ Q s (cid:48) ,µ min and we can proceed as in the original proof.The second diﬀerence lies in choosing δ ,s (cid:48) t,s,a . Note that ˜ Q η n is still bounded by M Q , see Lemma B.7.1. However,since ˜ Q η n might no longer be Lipschitz with the same constant as Q ∗ , we choose an additional integer ˜ N s (cid:48) t,s,a ∈ N for each induction step by Lemma B.8.7, such that as long as n > ˜ N s (cid:48) t,s,a , we have (cid:12)(cid:12)(cid:12) ˜ Q η n ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) ≤ ∆ s (cid:48) Q ≡ ε t,s,a M Q |A| R max q |A| · η s (cid:48) min exp (cid:16) M Q η s (cid:48) min (cid:17) for any µ (cid:48) ∈ M , t ∈ T , s ∈ S , a ∈ A . The required bound then follows immediately from | (Φ η n ( µ )( a sub | s (cid:48) ) − (Φ η n ( µ (cid:48) )( a sub | s (cid:48) ) |≤ R max q (cid:88) a (cid:48) (cid:54) = a sub (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) exp (cid:32) ˜ Q η n ( µ (cid:48) , t, s (cid:48) , a (cid:48) ) − ˜ Q η n ( µ (cid:48) , t, s (cid:48) , a sub ) η (cid:33) − exp (cid:32) ˜ Q η n ( µ, t, s (cid:48) , a (cid:48) ) − ˜ Q η n ( µ, t, s (cid:48) , a sub ) η (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ R max q (cid:88) a (cid:48) (cid:54) = a sub (cid:12)(cid:12)(cid:12)(cid:12) η exp (cid:18) ξ a (cid:48) η (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) ( ˜ Q η n ( µ (cid:48) , t, s (cid:48) , a (cid:48) ) − ˜ Q η n ( µ (cid:48) , t, s (cid:48) , a sub )) − ( ˜ Q η n ( µ, t, s (cid:48) , a (cid:48) ) − ˜ Q η n ( µ, t, s (cid:48) , a sub )) (cid:12)(cid:12)(cid:12) ≤ R max q |A| · η s (cid:48) min exp (cid:18) M Q η s (cid:48) min (cid:19) (cid:16)(cid:12)(cid:12)(cid:12) ˜ Q η n ( µ (cid:48) , t, s (cid:48) , a (cid:48) ) − ˜ Q η n ( µ, t, s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) ˜ Q η n ( µ, t, s (cid:48) , a sub ) − ˜ Q η n ( µ (cid:48) , t, s (cid:48) , a sub ) (cid:12)(cid:12)(cid:12)(cid:17) ≤ R max q |A| · η s (cid:48) min exp (cid:18) M Q η s (cid:48) min (cid:19) · (cid:16) K Q d M ( µ, µ (cid:48) ) + 4∆ s (cid:48) Q (cid:17) ≤ R max q |A| · η s (cid:48) min exp (cid:18) M Q η s (cid:48) min (cid:19) · (2 K Q d M ( µ, µ (cid:48) )) + ε t,s,a M Q |A| < ε t,s,a M Q |A| as in the original proof by letting d M ( µ, µ (cid:48) ) < δ ,s (cid:48) t,s,a and choosing δ ,s (cid:48) t,s,a = ε t,s,a η s (cid:48) min M Q |A| R max q · exp (cid:16) M Q η s (cid:48) min (cid:17) · K Q . The rest of the proof is analogous. We obtain the additional requirement n > N s (cid:48) t,s,a , n > ˜ N s (cid:48) t,s,a for some integers N s (cid:48) t,s,a , ˜ N s (cid:48) t,s,a and each t ∈ T , s ∈ S , s (cid:48) ∈ S , a ∈ A . By choosing n (cid:48) ≡ max t ∈T ,s ∈S ,s (cid:48) ∈S ,a ∈A max( N s (cid:48) t,s,a , ˜ N s (cid:48) t,s,a ) , thedesired result holds as long as n > n (cid:48) . (cid:4) From this property, we again obtain the desired uniform convergence via compactness of M . pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning Lemma B.8.10.

Any sequence of functions ( µ (cid:55)→ Q ˜Φ ηn ( µ ) ( µ, t, s, a )) n ∈ N with η n → + converges uniformly to µ (cid:55)→ Q ∗ ( µ, t, s, a ) for all t ∈ T , s ∈ S , a ∈ A .Proof. Fix ε > , t ∈ T , s ∈ S , a ∈ A . Then, there exists by Lemma B.8.9 for any point µ ∈ M both δ ( µ ) and n (cid:48) such that for all µ (cid:48) ∈ M with d M ( µ, µ (cid:48) ) < δ ( µ ) for all n > n (cid:48) we have (cid:12)(cid:12)(cid:12) Q ˜Φ ηn ( µ ) ( µ, t, s, a ) − Q ˜Φ ηn ( µ (cid:48) ) ( µ (cid:48) , t, s, a ) (cid:12)(cid:12)(cid:12) < ε which via pointwise convergence from Lemma B.8.8 implies | Q ∗ ( µ, t, s, a ) − Q ∗ ( µ (cid:48) , t, s, a ) | ≤ ε . Since M is compact, it is separable, i.e. there exists a countable dense subset ( µ j ) j ∈ N of M . Let δ ( µ ) be asdeﬁned above and cover M by the open balls ( B δ ( µ j ) ( µ j )) j ∈ N . By the compactness of M , ﬁnitely many of theseballs B δ ( µ n ) ( µ n ) , . . . , B δ ( µ nk ) ( µ n k ) cover M . By pointwise convergence from Lemma B.8.8, for any i = 1 , . . . , k we can ﬁnd integers m i such that for all n > m i we have (cid:12)(cid:12)(cid:12) Q ˜Φ ηn ( µ ni ) ( µ n i , t, s, a ) − Q ∗ ( µ n i , t, s, a ) (cid:12)(cid:12)(cid:12) < ε . Taken together, we ﬁnd that for n > max( n (cid:48) , max i =1 ,...,k m i ) and arbitrary µ ∈ M , we have (cid:12)(cid:12)(cid:12) Q ˜Φ ηn ( µ ) ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) < (cid:12)(cid:12)(cid:12) Q ˜Φ ηn ( µ ) ( µ, t, s, a ) − Q ˜Φ ηn ( µ ni ) ( µ n i , t, s, a ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) Q ˜Φ ηn ( µ ni ) ( µ n i , t, s, a ) − Q ∗ ( µ n i , t, s, a ) (cid:12)(cid:12)(cid:12) + | Q ∗ ( µ n i , t, s, a ) − Q ∗ ( µ, t, s, a ) | < ε ε ε < ε for some center point µ n i of a ball containing µ from the ﬁnite cover. (cid:4) As a result, a sequence of RelEnt MFE with η → + is approximately optimal in the MFG. Lemma B.8.11.

For any sequence ( π ∗ n , µ ∗ n ) n ∈ N of η n -RelEnt MFE with η n → + and for any ε > there existsinteger n (cid:48) ∈ N such that for all integers n > n (cid:48) we have J µ ∗ n ( π ∗ n ) ≥ max π J µ ∗ n ( π ) − ε . Proof.

By Lemma B.8.10, we have (cid:12)(cid:12)(cid:12) Q ˜Φ ηn ( µ ) ( µ, t, s, a ) − Q ∗ ( µ, t, s, a ) (cid:12)(cid:12)(cid:12) → uniformly. Therefore, for any ε > ,there exists by uniform convergence an integer n (cid:48) such that for all integers n > n (cid:48) we have Q π ∗ n ( µ ∗ n , t, s, a ) ≥ Q ∗ ( µ ∗ n , t, s, a ) − ε = max π ∈ Π Q π ( µ ∗ n , t, s, a ) − ε , and since by Lemma B.3.1, we have J µ ∗ n ( π ∗ n ) = (cid:88) s ∈S µ ( s ) · (cid:88) a ∈A Q π ∗ n ( µ ∗ n , t, s, a ) ≥ (cid:88) s ∈S µ ( s ) · max π ∈ Π (cid:88) a ∈A Q π ( µ ∗ n , t, s, a ) − ε = max π ∈ Π J µ ∗ n ( π ) − ε , the desired result follows immediately. (cid:4) By repeating the previous argumentation for Boltzmann MFE with Lemma B.5.6 and replacing Lemma B.8.4with Lemma B.8.11, we obtain the desired result for RelEnt MFE. ai Cui, Heinz Koeppl

C Relative entropy mean ﬁeld games

We show that the necessary conditions for optimality hold for the candidate solution. (For further insight, seealso Neu et al. (2017), Haarnoja et al. (2017) and references therein.) Fix a mean ﬁeld µ ∈ M and formulate theinduced problem as an optimization problem, with ρ t ( s ) as the probability of our representative agent visitingstate s ∈ S at time t ∈ T , to obtain max ρ,π T − (cid:88) t =0 (cid:88) s ∈S ρ t ( s ) (cid:88) a ∈A π t ( a | s ) r ( s, a, µ t ) subject to ρ t +1 ( s (cid:48) ) = (cid:88) s ∈S ρ t ( s ) (cid:88) a ∈A π t ( a | s ) p ( s (cid:48) | s, a, µ t ) ∀ s (cid:48) ∈ S , t ∈ { , . . . , T − } , (cid:88) s ∈S ρ t ( s ) ∀ t ∈ { , . . . , T − } , (cid:88) a ∈A π t ( a | s ) ∀ s ∈ S , t ∈ { , . . . , T − } , ≤ ρ t ( s ) , ≤ π t ( a | s ) ∀ s ∈ S , a ∈ A , t ∈ { , . . . , T − } ,µ ( s ) = ρ ( s ) ∀ s ∈ S . Note that if the agent follows the mean ﬁeld policy of the other agents, we have ρ t = µ t . The optimized objectiveis just the expectation E (cid:104)(cid:80) T − t =0 r ( S t , A t ) (cid:105) . As in Belousov and Peters (2019), we change this objective to includea KL-divergence penalty weighted by the state-visitation distribution ρ t ( · ) by introducing the temperature η > and prior policy q ∈ Π to obtain max ρ t ,π t T − (cid:88) t =0 (cid:88) s ∈S ρ t ( s ) (cid:88) a ∈A π t ( a | s ) r ( s, a, µ t ) − η T − (cid:88) t =0 (cid:88) s ∈S ρ t ( s ) D KL ( π t ( · | s ) (cid:107) q t ( · | s )) subject to ρ t +1 ( s (cid:48) ) = (cid:88) s ∈S ρ t ( s ) (cid:88) a ∈A π t ( a | s ) p ( s (cid:48) | s, a, µ t ) ∀ s (cid:48) ∈ S , t ∈ { , . . . , T − } , (cid:88) s ∈S ρ t ( s ) ∀ t ∈ { , . . . , T − } , (cid:88) a ∈A π t ( a | s ) ∀ s ∈ S , t ∈ { , . . . , T − } , ≤ ρ t ( s ) , ≤ π t ( a | s ) ∀ s ∈ S , a ∈ A , t ∈ { , . . . , T − } ,µ ( s ) = ρ ( s ) ∀ s ∈ S . We ignore the constraints ≤ π t ( a | s ) and ≤ ρ t ( s ) and see later that they will hold automatically. This resultsin the simpliﬁed optimization problem max ρ t ,π t T − (cid:88) t =0 (cid:88) s ∈S ρ t ( s ) (cid:88) a ∈A π t ( a | s ) r ( s, a, µ t ) − η T − (cid:88) t =0 (cid:88) s ∈S ρ t ( s ) D KL ( π t ( · | s ) (cid:107) q t ( · | s )) subject to ρ t +1 ( s (cid:48) ) = (cid:88) s ∈S ρ t ( s ) (cid:88) a ∈A π t ( a | s ) p ( s (cid:48) | s, a, µ t ) ∀ s (cid:48) ∈ S , t ∈ { , . . . , T − } , (cid:88) s ∈S ρ t ( s ) ∀ t ∈ { , . . . , T − } , (cid:88) a ∈A π t ( a | s ) ∀ s ∈ S , t ∈ { , . . . , T − } ,µ ( s ) = ρ ( s ) ∀ s ∈ S , for which we introduce Lagrange multipliers λ ( t, s ) , λ ( t ) , λ ( t, s ) , λ ( s ) and the Lagrangian L ( ρ, π, λ , λ , λ , λ ) = T − (cid:88) t =0 (cid:88) s ∈S ρ t ( s ) (cid:88) a ∈A π t ( a | s ) (cid:18) r ( s, a, µ t ) − η log π t ( a | s ) q t ( a | s ) (cid:19) pproximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning − T − (cid:88) t =0 (cid:88) s (cid:48) ∈S λ ( t, s (cid:48) ) (cid:32) ρ t +1 ( s (cid:48) ) − (cid:88) s ∈S ρ t ( s ) (cid:88) a ∈A π t ( a | s ) p ( s (cid:48) | s, a, µ t ) (cid:33) − T − (cid:88) t =0 λ ( t ) (cid:32) − (cid:88) s ∈S ρ t ( s ) (cid:33) − T − (cid:88) t =0 (cid:88) s ∈S λ ( t, s ) (cid:32)(cid:88) a ∈A π t ( a | s ) − (cid:33) − (cid:88) s ∈S λ ( s ) ( µ ( s ) − ρ ( s )) with the artiﬁcial constraint λ ( T − , s ) ≡ , which allows us to formulate the following necessary conditions foroptimality. For ∇ π t ( a | s ) L and all s ∈ S , a ∈ A , t ∈ { , . . . , T − } , we obtain ∇ π t ( a | s ) L = ρ t ( s ) (cid:32) r ( s, a, µ t ) − η log π t ( a | s ) q t ( a | s ) − η + (cid:88) s (cid:48) ∈S λ ( t, s (cid:48) ) p ( s (cid:48) | s, a, µ t ) (cid:33) − λ ( t, s ) ! = 0= ⇒ π ∗ t ( a | s ) = q t ( a | s ) exp  r ( s, a, µ t ) − η + (cid:80) s (cid:48) ∈S λ ( t, s (cid:48) ) p ( s (cid:48) | s, a, µ t ) − λ ( t,s ) ρ t ( s ) η  . For ∇ λ L and all s ∈ S , t ∈ { , . . . , T − } , by inserting π ∗ t we obtain ∇ λ ( t,s ) L = 1 − (cid:88) a ∈A π t ( a | s ) ! = 0 ⇐⇒ (cid:88) a ∈A q t ( a | s ) exp  r ( s, a, µ t ) − η + (cid:80) s (cid:48) ∈S λ ( t, s (cid:48) ) p ( s (cid:48) | s, a, µ t ) − λ ( t,s ) ρ t ( s ) η  which is fulﬁlled by choosing λ ∗ ( t, s ) = ηρ t ( s ) log (cid:88) a ∈A q t ( a | s ) exp (cid:18) r ( s, a, µ t ) − η + (cid:80) s (cid:48) ∈S λ ( t, s (cid:48) ) p ( s (cid:48) | s, a, µ t ) η (cid:19) since it fulﬁlls the required equation (cid:88) a ∈A q t ( a | s ) exp  r ( s, a, µ t ) − η + (cid:80) s (cid:48) ∈S λ ( t, s (cid:48) ) p ( s (cid:48) | s, a, µ t ) − λ ∗ ( t,s ) ρ t ( s ) η  = (cid:88) a ∈A q t ( a | s ) exp (cid:18) r ( s, a, µ t ) − η + (cid:80) s (cid:48) ∈S λ ( t, s (cid:48) ) p ( s (cid:48) | s, a, µ t ) η (cid:19) · (cid:32)(cid:88) a ∈A q t ( a | s ) exp (cid:18) r ( s, a, µ t ) − η + (cid:80) s (cid:48) ∈S λ ( t, s (cid:48) ) p ( s (cid:48) | s, a, µ t ) η (cid:19)(cid:33) − = 1 . Finally, inserting λ ∗ and π ∗ , for ∇ ρ t ( s ) L we obtain ∇ ρ t ( s ) L = (cid:88) a ∈A π t ( a | s ) (cid:32) r ( s, a, µ t ) − η log π t ( a | s ) q t ( a | s ) + (cid:88) s (cid:48) ∈S λ ( t, s (cid:48) ) p ( s (cid:48) | s, a, µ t ) + λ ( t ) (cid:33) − λ ( t − , s )= (cid:88) a ∈A π t ( a | s ) (cid:18) η + λ ( t ) + λ ( t, s ) ρ t ( s ) (cid:19) − λ ( t − , s ) ! = 0 which implies λ ∗ ( t − , s ) = η + λ ( t ) + η log (cid:88) a ∈A q t ( a | s ) exp (cid:18) r ( s, a, µ t ) − η + (cid:80) s (cid:48) ∈S λ ( t, s (cid:48) ) p ( s (cid:48) | s, a, µ t ) η (cid:19) . ai Cui, Heinz Koeppl We can subtract λ ( t ) and shift the time index to obtain the soft value function ˜ V η ( µ, t, s ) deﬁned via terminalcondition ˜ V η ( µ, T, s ) ≡ and the recursion ˜ V η ( µ, t, s ) = η log (cid:88) a ∈A q t ( a | s ) exp (cid:32) r ( s, a, µ t ) + (cid:80) s (cid:48) ∈S ˜ V η ( µ, t + 1 , s (cid:48) ) p ( s (cid:48) | s, a, µ t ) η (cid:33) since then, by normalization the optimal policy for all s ∈ S , a ∈ A , t ∈ { , . . . , T − } is equivalent to π ∗ t ( a | s ) = q t ( a | s ) exp (cid:16) r ( s,a,µ t )+ (cid:80) s (cid:48)∈S λ ( t,s (cid:48) ) p ( s (cid:48) | s,a,µ t ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) r ( s,a (cid:48) ,µ t )+ (cid:80) s (cid:48)∈S λ ( t,s (cid:48) ) p ( s (cid:48) | s,a (cid:48) ,µ t ) η (cid:17) = q t ( a | s ) exp (cid:16) r ( s,a,µ t )+ (cid:80) s (cid:48)∈S ˜ V η ( µ,t +1 ,s (cid:48) ) p ( s (cid:48) | s,a,µ t ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) r ( s,a (cid:48) ,µ t )+ (cid:80) s (cid:48)∈S ˜ V η ( µ,t +1 ,s (cid:48) ) p ( s (cid:48) | s,a (cid:48) ,µ t ) η (cid:17) . To obtain a recursion in ˜ Q η , deﬁne ˜ Q η ( µ, t, s, a ) ≡ r ( s, a, µ t ) + (cid:88) s (cid:48) ∈S p ( s (cid:48) | s, a, µ t ) η log (cid:88) a (cid:48) ∈A q t +1 ( a (cid:48) | s (cid:48) ) exp (cid:32) ˜ Q η ( µ, t + 1 , s (cid:48) , a (cid:48) ) η (cid:33) with terminal condition ˜ Q η ( µ, T, s, a ) ≡ to obtain π ∗ t ( a | s ) = q t ( a | s ) exp (cid:16) ˜ Q η ( µ,t,s,a ) η (cid:17)(cid:80) a (cid:48) ∈A q t ( a (cid:48) | s ) exp (cid:16) ˜ Q η ( µ,t,s,a (cid:48) ) η (cid:17) which is the desired result as π ∗ fulﬁlls all constraints and determines ρ uniquely. For the uniform prior q t ( a | s ) = 1 / |A||A|