[PDF] Stochastic Stability of a Recency Weighted Sampling Dynamic

Abstract

We introduce and study a model of long-run convention formation for rare interactions. Players in this model form beliefs by observing a recency-weighted sample of past interactions, to which they noisily best respond. We propose a continuous state Markov model, well-suited for our setting, and develop a methodology that is relevant for a larger class of similar learning models. We show that the model admits a unique asymptotic distribution which concentrates its mass on some minimal CURB block configuration. In contrast to existing literature of long-run convention formation, we focus on behavior inside minimal CURB blocks and provide conditions for convergence to (approximate) mixed equilibria conventions inside minimal CURB blocks.

Full PDF

SStochastic stability of a recency weightedsampling dynamic

Alexander Aurell and Gustav Karreskog Princeton University, ORFE Stockholm School of EconomicsSeptember 29, 2020

Abstract

It is common to model learning in games so that either a deterministicprocess or a ﬁnite state Markov chain describes the evolution of play.Such processes can however produce undesired outputs, where the players’behavior is heavily inﬂuenced by the modeling. In simulations we seehow the assumptions in Young (1993), a well-studied model for stochasticstability, lead to unexpected behavior in games without strict equilibria,such as Matching Pennies. The behavior should be considered a modelingartifact. In this paper we propose a continuous-state space model forlearning in games that can converge to mixed Nash equilibria, the RecencyWeighted Sampler (RWS). The RWS is similar in spirit Young’s model,but introduces a notion of best response where the players sample froma recency weighted history of interactions. We derive properties of theRWS which are known to hold for ﬁnite-state space models of adaptiveplay, such as the convergence to and existence of a unique invariantdistribution of the process, and the concentration of that distributionon minimal CURB blocks. Then, we establish conditions under whichthe RWS process concentrates on mixed Nash equilibria inside minimalCURB blocks. While deriving the results, we develop a methodology thatis relevant for a larger class of continuous state space learning models.

JEL:

C72, C73

Keyworks: evolutionary game theory, learning in games, stochastic stability,recency, mixed Nash equilibria, minimal CURB blocks1 a r X i v : . [ ec on . T H ] S e p ontents A.1 Exponential history . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18A.2 Lipschitz continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19A.3 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20A.4 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

B Concentration around approximate Nash equilibrium: proofs 30

B.1 Unique ﬁxed point to the expected best reply . . . . . . . . . . . . . 31B.2 Global exponential stability of mean-ﬁeld dynamics . . . . . . . . . . 32B.3 Trajectories over bounded time intervals . . . . . . . . . . . . . . . . 33B.4 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

The general setting considered in this paper is the evolution of social conventionsas introduced in Young (1993). There are large populations, one for each playerrole, from which players are randomly drawn to play a normal form game. Beforedeciding which action to take the players get access to a sample of historicalinteractions. The players use the sample to form beliefs about the opposite roles’historical behavior, and thereafter responds to the mixed strategy induced bythat sample. Once they have played, the history is updated, new players arerandomly drawn from the populations, and the process is repeated with theupdated history. 2istory ExpectationsActionsFigure 1: The players form expectations by sampling from historical recordsof interactions and then act based on those expectations. The realized play isappended to the history.Social conventions form and evolve in many real life situations. For example,when buying a house each bidder (player) might not have participated in theexact same bidding (game) before, but has knowledge about some, but not all,previous interactions and assumes that the other bidders interacting with herwill behave similarly to how bidders have historically behaved. By modelingrepeated play based on historical records as diagrammed in Figure 1, one hopesto answer questions about which actions will be taken in the long run, andtherefore which stable conventions, if any, will arise. We will refer to a dynamicalmodel for the likelihood of the interactions, interpreted as the social convention,as a learning process .When studying the long run distribution of the (state of the) learning process itis helpful, both theoretically and numerically, if it is a Markov process convergingto its unique invariant distribution. In the original formulation of Young (1993)this is achieved by deﬁning the state of the learning process as a vector of size m , a "ﬁnite memory" containing the m last interactions, by letting the playersform beliefs by sampling k ∈ { , . . . , m } strategies from the memory withoutreplacement, and by assuming a small mistake probability ε > ε > . However,ﬁnite memory based learning is ill-suited to answer questions about the players’behavior around mixed Nash equilibria. The approach requires complete infor-mation of the order of the history, and exhibits behavior around even simplemixed Nash equilibria that is better viewed as a modeling artifact than as arealistic description of behavior. The purpose of this paper is to deﬁne a newlearning process with the following features: ﬁrstly, it converges to some minimalCURB conﬁguration and secondly, it behaves reasonably also inside minimalCURB and around mixed Nash equilibria. A subset (block) of strategy proﬁles C is called Closed Under Rational Behavior (CURB)if the best replies to any strategy proﬁle with support in C is also in C . It is called a minimalCURB block if it does not contain any strictly smaller CURB block Basu and Weibull (1991). , − − , − , , − (cid:0) , (cid:1) , ﬁfty-ﬁfty randomization for both players.To better understand the limitations of the standard ﬁnite memory learningprocess, consider perhaps the simplest normal form game with a unique mixedNash equilibrium: Matching Pennies, presented in Table 1. Consider the casewhere the length of the history is m = 9, and both players sample the wholehistory and play without a mistake, i.e., k = m and ε = 0. Assume that thehistory contains, reading from the oldest to the latest entry, four interactionswhere both players took action , followed by ﬁve interactions where both tookaction . The row player will then take action and the column player action . However, since the interaction that falls out of the history is one where thecolumn player played , the sample to which the row player responds will notchange until the :s in the end of the history have all fallen out and the ﬁrstinteraction with a falls out of the history. At that point, the history containsﬁve interactions where the column player played , so now the row player wantsto play as well. However, by now all the interactions in the history are suchthat the row player played . So for the coming ﬁve interactions they will bothtake action . (cid:18) (cid:19) → (cid:18) (cid:19) → (cid:18) (cid:19) → (cid:18) (cid:19) → · · · The behavior in the next period depends as much on what falls out of the historyas what is added, generating a cycling behavior. The cycling behavior does notonly happen in this special case but is a general feature observed when simulatingﬁnite memory based learning processes, see Figure 2 for another example.4igure 2: A 10 000 period simulation of Young’s ﬁnite memory leaning processon Matching Pennies with m = 1000, k = 20, ε = 0 .

05. Initiated at the mixedNash equilibrium.To address the problem of unwanted cycling and to increase stability of socialconventions we introduce a new learning process, the

Recency Weighted Sampler (RWS). It diﬀers from previous work in the structure of the historical record ofplays. The history is assumed to be inﬁnite, but more recent interactions aremore likely to be sampled. A total of k samples are drawn with replacementby each player at each period. The probability of sampling the interaction of apast game decreases with a factor β , 0 < β < β = 0 . k = 20, ε = 0 .

05. Initiated at the corner (1,1).

Already in his dissertation John Nash gave a second interpretation of the Nashequilibrium, the mass action interpretation (Nash, 1950). He assumes thata large population is associated to each player role, that one player per roleis selected in each period to play the game, and that the individual playersaccumulate empirical information on the relative advantage of the diﬀerentavailable pure strategies. He then argues, informally, that in such a setting, thestable points correspond to Nash equilibria and those points should eventuallybe reached by the process.The mass action interpretation is appealing since its assumptions about boundedrationality and repeated interactions are more credible than those underlyingthe rationalistic interpretation built on assumptions of perfect rationality andcommon knowledge . Furthermore, experimental evidence clearly favors somekind of learning and adjustment over the rationalistic motivation. The generalresult is that in a one-shot interaction, play rarely corresponds to a Nashequilibrium, but if the players have a chance to learn and adjust, play often (butfar from always) moves to a Nash equilibrium. See (Camerer, 2003, Ch. 6) foran overview of experimental models and results.Appealing as the motivation might be, the theoretical picture has turned out tobe considerably more complicated than indicated by Nash’s informal argument Especially since perfect rationality and common knowledge by itself only leads to rational-izability but not all the way to Nash equilibrium.

In Section 2 the proposed learning process, the Recency Weighted Sampler, isformalized and we introduce the tools we need to analyze the process. Sincewe deﬁne a framework diﬀerent from existing models (most crucially, RWS hasa continuous state-space) we cannot rely directly on any existing results andwe therefore begin by proving some standard properties of the learning process.We prove weak convergence for a class of learning process, of which the RWSis a member, to their respective unique invariant distribution. Following that,we show that in limit as the error-rate tends to zero, ε →

0, the invariantdistribution of the RWS will concentrate on the minimal CURB blocks of thegame. Once we have recovered these crucial results, we turn to the question ofbehavior inside minimal CURB blocks that are non-singleton, and show thatfor any generic game where the minimal CURB blocks are at most 2 × k is small, close to some point on the k -grid spanning the simplices which is alsoclose to the Nash equilibrium. Proofs have been appended in the end of thepaper. We consider a two-player ﬁnite game G , iteratively played by two new playersdrawn from large populations. The game has two asymmetric player roles, 1 and2. The sets of pure strategies in the game are S and S , containing m ∈ N and m ∈ N pure strategies respectively; the spaces of mixed strategies are thus ∆ ( S )and ∆ ( S ). Throughout the paper, − i denotes the index { , }\{ i } , i ∈ { , } .For σ ∈ ∆ ( S − i ), we denote by BR i ( σ ) ⊂ S i the set of best replies of player i to the mixed strategy σ . We identify ∆ ( S i ) with the ( m i − (cid:3) ( S ) := ∆ ( S ) × ∆ ( S ), (cid:3) ( S ) being endowed with theusual uniform distance k · k . We denote by B ( (cid:3) ( S )) and P ( (cid:3) ( S )) the Borel σ -ﬁeld over (cid:3) ( S ) and the set of probability measures over (cid:3) ( S ), respectively.8 .1 The stochastic best reply of RWS Each interaction is recorded as a pair ( s , s ), with s ∈ S and s ∈ S thestrategies played by each player. Denoting s ( t ) and s ( t ) the strategies playedat time t , the history is thus a sequence of plays(( s ( t ) , s ( t ))) t ∈ Z . Notice that for t <

0, the history is just some inﬁnite history, coding for ﬁctionalplays for the purposes of our mechanisms.At each time t , each player of role i ∈ { , } samples k ∈ N plays (withreplacement) from the history of the opposing player role − i . Each sample isdrawn independently and samples are drawn with bias towards more recent playsin a geometric fashion. Namely, players of role i have a bias β ∈ [0 , recency parameter , such that at time t the probability of selecting the timeperiod t − τ , τ ∈ { , , . . . } is (1 − β ) β τ − . Therefore, a play of the strategy s ∈ S − i will be sampled by player i with totalprobability p − i,s ( t ) = (1 − β ) ∞ X τ =1 β τ − s ( s − i ( t − τ )) , where 1 s is the indicator function on s .We will call p i ( t ) := ( p i, ( t ) , . . . p i,m i ( t )) the state process of player role i attime t and p ( t ) := ( p ( t ) , p ( t )) for the state process or the learning process,interchangeably. It is a vector of sampling probabilities obtained by player i from player − i ’s history and is an element of ∆ ( S − i ). The result of player i ’ssampling is a random vector (cid:0) n − i, ( t ) , . . . n − i,m − i ( t ) (cid:1) of integers, multinomiallydistributed with parameters k and p − i ( t ). For s ∈ S i , let −→ i,s ∈ ∆ ( S i ) be theunit vector representing the pure strategy s ∈ S i , i.e., a vector of of size m i with0 everywhere except at position s , where it is 1. From her sample, player i formsan empirical (average) opposing strategy proﬁle D − i ( t ) := 1 k m − i X s =1 n − i,s ( t ) −−→ − i,s ∈ ∆ ( S − i ) . (1)Player i now deems her opponent will play at turn t accordingly to the mixedstrategy D − i ( t ) and tries to play the best response to it. However, player i can make a mistake. Player i ’s error parameter (or mistake frequency) ε ∈ [0 ,

1] indicates the probability she will fail to play a strategy in BR i ( D − i ( t )),and instead play a strategy in S i at random (with uniform probability). If BR i ( D − i ( t )) is not a singleton, the realized action is sampled uniformly from all9he elements of BR i ( D − i ( t )). We denote the outcome of the uniform samplingbetween all best replies to σ ∈ ∆( S − i ) by d BR i ( x ) ∈ S i . The distinction we wantto emphasize with this notation is that BR i ( x ) is set-valued (the set of all bestreplies to x ) while d BR i ( x ) is S i -valued and random (one of the best replies hasbeen randomly selected).In the end, player i will play d BR i ( D − i ( t )), with D − i ( t ) obtained as describedabove, with a probability of 1 − ε ; and additionally, play any strategy s ∈ S i with probability ε/m i . We complete this section by calling g BR i ( p − i ) ∈ S i the random choice of strategy obtained by a player i through the followingprocess:1. Looking at a history where plays of strategies by the opposing role getsampled with probabilities given by p − i ;2. Sampling k of them to form the belief D − i ∈ ∆( S − i );3. Actually playing the best response d BR i ( D − i ), except in a fraction ε ofthe time when a randomly selected strategy is played. At t = 0, an initial history (( s ( u ) , s ( u ))) u ∈ Z − , s i ( u ) ∈ S i , is given. At eachtime t ∈ N , two new individuals are assigned to the roles. They use same valuesof the parameters k , β , and ε . After sampling from the history with recencyparameter β , they play s i ( t ) = g BR i ( p − i ( t )), i = 1 ,

2, where p − i ( t ) is exactlythe historical distribution of plays with recency bias. The realized strategyproﬁle ( s ( t ) , s ( t )) is appended to the history, and the procedure restarts. Theexponential nature of sampling leads to the following characterization of theRWS learning process. Proposition 1.

The state process of player i , p i ( t ) ∈ ∆ ( S i ) , obeys the equation p i ( t + 1) = βp i ( t ) + (1 − β ) −−−−→ i,s i ( t ) (2) where s i ( t ) = g BR i ( p − i ( t )) is drawn randomly according to the model. The order of historical plays is not necessary to characterize the model, all therelevant information is captured by ( p ( t ) , p ( t )) ∈ (cid:3) ( S ). From the position( p ( t ) , p ( t )) ∈ (cid:3) ( S ), at most m m diﬀerent points ( p ( t + 1) , p ( t + 1)) maybe reached. Conditioned on p ( t ), for any s ∈ S and s ∈ S the point (cid:16) βp ( t ) + (1 − β ) −−→ ,s , βp ( t ) + (1 − β ) −−→ ,s (cid:17) s ( t ) = s and s ( t ) = s , which happens with probability Y i =1 P (cid:16)g BR i ( p − i ( t )) = s i | p − i ( t ) (cid:17) , since players sample independently, and P (cid:16)g BR i ( p − i ( t )) = s i (cid:17) = (1 − ε ) P (cid:16)d BR i ( D − i ( t )) = s i (cid:17) + ε/m i , where D − i ( t ) ∈ ∆ ( S − i ) is a multinomial combination of strategies (with param-eters k and p − i ( t )). By construction ( p ( t ); t ∈ N ) is a Markov chain taking values in (cid:3) ( S ). Since thestate space is the continuous set (cid:3) ( S ), the chain’s transition kernel is a function P : (cid:3) ( S ) × B ( (cid:3) ( S )) → R with the standard Markov kernel properties. Thekernel takes a tuple ( x, B ) and returns the probability of the chain transitioningfrom x to B in one period. The kernel is the continuous state space equivalentof the transition rate matrix in models with a discrete state space. P is given asthe following Markovian kernel: for all ( p , p ) ∈ (cid:3) ( S ) and B ∈ B ( (cid:3) ( S )), P (( p , p ) , B ) = m X s =1 m X s =1 P (cid:16)g BR ( p ) = s , g BR ( p ) = s (cid:17) × B (cid:16) βp + (1 − β ) −−→ ,s , βp + (1 − β ) −−→ ,s (cid:17) . Remark 2.

An underlying assumption of the RWS is that there exists a prob-ability space (Ω , F , P ) carrying all the random variables necessary for deﬁningthe learning process. The space is ﬁltered by F , the natural ﬁltration of the stateprocess, and satisﬁes the usual conditions. The assumption is innocent, it onlyrequires the space to carry a countable number of independent random variables.It is in this ﬁltered space that we subsequently study the learning process as aMarkov chain. Our ﬁrst result is Theorem 3 which states conditions for when the RWS stateprocess is uniformly ergodic. We use the theory of Markov processes for theproof, the theory can be found in for example Meyn and Tweedie (2012) and theproof is found in the appendices. 11 heorem 3. If ε > and β ∈ (1 − max { m , m } − , , then the Markov chainwith kernel P is uniformly ergodic. In other words, for whichever initial distribution ν ∈ P ( (cid:3) ( S )) that p (0) is drawnfrom, the distribution of p ( t ) will converge "geometrically uniformly" as t → ∞ to the probability measure µ ∗ ε which is the unique solution of µ ∗ ε P = µ ∗ ε . Moreprecisely, for every ε ∈ (0 ,

1] there exists a unique µ ∗ ε ∈ P ( (cid:3) ( S )), c ∈ R + , and λ ∈ (0 ,

1) such that for all p ≥ W p ( νP n , µ ∗ ε )) p ≤ cλ n , ν ∈ P ( (cid:3) ( S )) , where W p is the Wasserstein distance of order p between measures on (cid:3) ( S ) (Vil-lani, 2008, Def. 6.1) and c is a positive constant depending only on max x ∈ (cid:3) ( S ) | x | and p .The theorem itself is more general than what is needed for the goal of this paper.The result holds for any Markov chain with a compact state space and witha dynamic of the form (2), as long as there is a positive lower bound for theprobability that any strategy is played (in any state) and that this probability isLipschitz as a function of the state. Examples of other best response functionsthan the one studied here for which Theorem 3 applies are the logit best reply,i.e., P hg BR i ( p ) = s i i = exp( ηπ i ( s i , p − i )) P a ∈ S i exp( ηπ i ( a, p − i ))for some η >

0, models where k itself is a random parameter, and models whereonly robust best responses to the sample are considered. Before turning to the convergence to minimal CURB blocks, one minor technicaldetail most be resolved. A minimal CURB block is a collection of strategyproﬁles C × C ⊂ S such that the best reply to all mixed strategies in in thesub-simplex spanned by those strategies is always inside the set, i.e. BR ( σ ) ⊂ C for all σ ∈ (cid:3) ( C ), where (cid:3) ( C ) := ∆ ( C ) × ∆ ( C ). However, since our agentsonly reply to samples of size k , it might be the case that the mixed strategyfrom the simplex that has a best reply outside a non-CURB block simply neveris sampled. The game below is a simple illustration of this point. , − − , − , , − , , k = 1 only the best replies to pure strategies will ever be considered. If theprocess initially has support only on the block { , } × { , } , the best reply to12ny sample will be inside that block, even though is the best reply to mostproperly mixed strategies. We could call this smaller set of blocks that are closedunder best replies to any strategies on the k -lattice k-CURB blocks. In mostsettings, a relatively small k is enough for the k-CURB blocks to coincide withthe CURB blocks. In the rest of the paper, we will speak of CURB blocks andby that mean k -CURB blocks. Alternatively, one can think of k as suﬃcientlylarge so that the notions coincide.In what follows, we ﬁrst prove that the RWS concentrates (in probability)on minimal CURB blocks for general two player games. Then we prove theconcentration of RWS paths to an approximate mixed Nash equilibrium forgames with m = m = 2 and a unique mixed Nash equilibrium. While proving concentration of the RWS on minimal CURB blocks we willpartially rely on results for the original ﬁnite memory learning process. TheRWS dynamics introduces some diﬃculties that are not present in the originalmodel, mainly that once a strategy has been played it never truly disappearsfrom memory but always has a positive probability of being sampled. However,the probability of sampling that strategy decreases over time as long as thestrategy is not played again. A notion well-suited for the RWS is therefore theneighbourhood B δ ( C ), δ >

0, of C := C × C ⊂ S , deﬁned as all pairs ( p , p )in (cid:3) ( S ) such that each of the components puts at least 1 − δ probability on theblock C . Deﬁnition 4.

For all δ > , B δ ( C ) := ( p = ( p , p ) ∈ (cid:3) ( S ) | m i X s =1 p i,s C i ( s ) ≥ − δ, i = 1 , ) . Let C denote the union of all minimal CURB blocks in the game. To prove theconcentration result Theorem 5, we show that expected time to go from B δ ( C ) c to B δ ( C ) is always bounded, but the expected time spent inside B δ ( C ) onceentered goes to inﬁnity as ε goes to zero. This in turn will imply that as ε goesto zero, the invariant distribution concentrates on the neighbourhood C , theunion of all minimal CURB blocks. Theorem 5. If β ∈ (1 − max { m , m } − , , then for all δ > it holds that as ε → , the invariant distribution of the Markov chain p concentrates on B δ ( C ) , lim ε → µ ∗ ε ( B δ ( C )) = 1 . .2.2 Behavior inside minimal CURB The previous section shows that as ε approaches zero, the RWS spends almostall the time inside minimal CURB blocks, possibly with rare excursions betweendiﬀerent minimal CURB blocks. In this section, we justify that the RWS canactually concentrate on mixed Nash equilibria inside minimal CURB sets. Thisproperty is the main motivation for introducing the RWS and in contrast tosimilar learning processes.Consider the deterministic mean-value process x ,˙ x i ( t ) = E hg BR i ( x − i ( t )) i − x i ( t ) , x i (0) = p i (0) . (3)The process in (3) is a deterministic process that can be thought of as a continuous-time evolution of the expected value of the RWS state process (2). As aconsequence of Lemma 18, if inside a given minimal CURB, the process (3)converges to either a stable point or a stable orbit with constant distance to astable point, at least for ε small enough.We show in Lemma 19, found in the appendices, that for a given time horizon T , divided into N time steps of size (1 − β ), T = N (1 − β ), and η >

0, theprobability that the RWS stays closer than η to the deterministic process x during [0 , T ] goes to 1 as β goes to 1. Taken together, if the deterministic processbehaves well in the minimal CURB blocks of a game, we can by tuning β controlthe RWS and its concentration around stable points or stable orbits. The nexttheorem states that for a 2 × Theorem 6.

Let G be a × normal form game with a unique completelymixed Nash equilibrium. If β > / , then, for all ε, η > there exists a positiveconstant K such that µ ∗ ε ( x ∈ (cid:3) ( S ) : k x − n ∗ k ∞ ≥ η ) = o (cid:18) exp (cid:18) − Kη − β (cid:19)(cid:19) where n ∗ is the unique stationary point of (3) . The stationary point of (3) naturally depends on k . Under the assumptions inthe theorem above, as k → ∞ the equation ( ˙ x ( t ) , ˙ x ( t )) = (0 ,

0) is satisﬁedonly by the Nash equilibrium, and we have that lim k →∞ n ∗ = N ∗ . So n ∗ can beinterpreted as a approximation of the Nash equilibrium.The result of Theorem 6 can be extended to games of any size as long as theycontain only minimal CURB blocks that are either 1 ×

1, or are 2 × Conclusions and outlook

In this paper we have introduced a new process of adaptive play with samplingfrom history and recency, the RWS, and shown that it has several interestingproperties. The invariant distribution of the RWS, which is a Markov process,concentrates on minimal CURB blocks as the mistake probability ε goes to zero.So in the long run, the RWS will almost always be inside a minimal CURB,perhaps with rare transitions between them. While the process is inside a givenminimal CURB, the deterministic (mean) dynamics of the RWS will converge toeither a stable point or a stable orbit, and the stochastic RWS state process doesnot deviate far from it during any ﬁnite time horizon with a high probability,if β is suﬃciently close to 1. Combining these results we see that as ε and β approach 0 and 1, respectively, the RWS almost always is in the neighbourhoodof a stable point or a stable orbit inside a minimal CURB. Furthermore, sincethe sampling best reply function we consider is continuous, this implies that ifthe state p ( t ) is close to some stable point, then so is play.For 2 × k . For games with minimal CURB blockslarger than 2 ×

2, the picture is more complicated, and it is beyond the scopeof this paper to completely map it out. However, for small to intermediate k the RWS behaves well, at least numerically, when other learning dynamics doesnot. Consider the unstable rock paper scissors game, see Table 2, studied in e.g.Benaïm, Hofbauer and Hopkins (2009).R P SR 0 , − , , −

3P 1 , − , − , − , , − , (cid:0) , , (cid:1) .Classical learning processes such as ﬁctitious play or reinforcement learningcircles the Nash equilibrium in a stable cycle. In Figure 4 we compare theperformance of RWS with k = 20 and ﬁctitious play with recency. The RWSremains close to the equilibrium over time, even in this unstable game, while theﬁctitious play dynamic circles the equilibrium. When k is larger the RWS behavesas ﬁctitious play with recency. This is expected, as k grows the sampled beliefs( D , D ), see (1), become more and more similar to the sampling probabilitiesby the law of large numbers. 15 a) RWS with ε = 0 and k = 20. (b) Fictitious play with recency. Figure 4: Simulations of behavior in the Unstable Rock Paper Scissors game.

Left:

RWS with a low k -value and no noise. Right: ﬁctitious play with recency.The recency parameter was set to β = 0 . × References

Aurell, Alexander.

Balkenborg, Dieter, Josef Hofbauer, and Christoph Kuzmics.

Theoretical Economics ,8(1): 165–192.

Basu, Kaushik, and Jörgen W. Weibull.

Economics Letters , 36(2): 141–146.

Benaïm, Michel, and Jörgen W Weibull.

Econometrica , 71(3): 873–903.

Benaïm, Michel, and Morris W Hirsch.

Gamesand Economic Behavior , 29(1-2): 36–72.16 enaïm, Michel, Josef Hofbauer, and Ed Hopkins.

Journal of Economic Theory , 144(4): 1694–1709.

Block, Juan I, Drew Fudenberg, and David K Levine.

TheoreticalEconomics , 14(1): 135–172.

Brown, George W.

Activity analysis of production and allocation , 13(1): 374–376.

Camerer, Colin F.

Behavioral Game Theory: Experiments in StrategicInteraction.

Princeton University Press.

Ellison, Glenn.

Review of Economic Studies , 67(1): 17–45.

Folland, Gerald B.

Real analysis: modern techniques and their applica-tions.

Vol. 40, John Wiley & Sons.

Foster, Dean P., and H. Peyton Young.

Games and Economic Behavior , 45(1): 73–96.

Fudenberg, Drew, and David M. Kreps.

Games and Economic Behavior , 5: 320–367.

Fudenberg, Drew, David K Levine, et al.

Proceedings of the National Academy of Sciences , 111: 10826–10829.

Fudenberg, Drew, Fudenberg Drew, David K Levine, and David KLevine.

The theory of learning in games.

Vol. 2, MIT press.

Hart, Sergiu, and Andreu Mas-Colell.

Games and Economic Behavior , 57(2): 286–303.

Hofbauer, Josef, and William H. Sandholm.

Econometrica , 70(6): 2265–2294.

Hurkens, Sjaak.

Games and EconomicBehavior , 11(2): 304–329.

Kreindler, Gabriel E., and H. Peyton Young.

Games and Economic Behavior , 80: 39–67.

Meyn, Sean P, and Richard L Tweedie.

Markov chains and stochasticstability.

London:Springer-Verlag.

Nash, John.

Ritzberger, Klaus, and Jorgen W. Weibull.

Econometrica , 63(6): 1371.17 andholm, William H.

Population games and evolutionary dynamics.

MIT press.

Shapley, Lloyd.

Advances in gametheory , 52: 1–29.

Slotine, Jean-Jacques E, Weiping Li, et al.

Applied nonlinear control.

Vol. 199, Prentice hall Englewood Cliﬀs, NJ.

Villani, Cédric.

Optimal transport: old and new.

Vol. 338, SpringerScience & Business Media.

Weibull, Jörgen W.

Evolutionary game theory.

MIT press.

Young, H. Peyton.

Individual Strategy and Social Structure An Evolu-tionary Theory of Institutions.

Princeton University Press.

Young, Peyton Hobart, and Dean P. Foster.

Theo-retical Economics , 1(3): 341–367.

A The basic properties of the learning process:proofs

A.1 Exponential history

Let us prove Proposition 1. Starting from the deﬁnition, we have p − i,s ( t + 1) = (1 − β ) ∞ X τ =1 β τ − s ( s − i ( t − τ + 1)) . After index substitution v = τ −

1, splitting the term v = 0 yields p − i,s ( t + 1) = (1 − β ) s ( s − i ( t )) + ∞ X v =1 β v s ( s − i ( t − v )) ! . In other words, p − i,s ( t + 1) = (1 − β ) β ∞ X v =1 β v − s ( s − i ( t − v )) + (1 − β ) 1 s ( s − i ( t )) . We recognize the ﬁrst term as p − i,s ( t ), so we are left for every s ∈ S − i with p − i,s ( t + 1) = βp − i,s ( t ) + (1 − β ) 1 s ( s − i ( t )) , which is the representation we seek. 18 .2 Lipschitz continuity Lemma 7.

For all k ∈ N , i ∈ { , } , and a ∈ { , . . . , m i } , ∆( S − i ) p → P (cid:16)g BR i ( p ) = a (cid:17) is Lipschitz continuous with Lipschitz coeﬃcient at most (1 − ε ) km − i .Proof. At the beginning there is a sample with respect to probabilities p , yieldinga random vector N := (cid:0) n − i, ( t ) , . . . n − i,m − i ( t ) (cid:1) of integers from the (discrete)probability distribution P (cid:0) N = (cid:0) n , . . . n m − i (cid:1)(cid:1) = k ! m − i Y j =1 p n s s n s ! . Each N will lead to an empirical opposing strategy proﬁle D , that must belongto some ﬁnite ’simplex grid’∆ ( − i,k ) :=  k X s ∈ S − i n s −−→ − i,s ; n s ∈ N , X s ∈ S − i n s = k  . Now let us form m i subsets from ∆ ( − i,k ) (which is ﬁnite), named ∆ ( − i,k ) s for s ∈ S i , where x ∈ ∆ ( − i,k ) s whenever s ∈ BR i ( x ). Note that (∆ ( − i,k ) s ) s is not adisjoint cover of ∆ ( − i,k ) except in the special case when each x ∈ ∆ ( − i,k ) has aunique best response. Also, ∪ s ∆ − i,ks = ∆ − i,k since the best response set is neverempty.For a ≤ m i , the probability that g BR i ( p ) = a is going to be played is thusobtained as follows : • If the player i trembles, which happens a fraction ε of the time, strategy a is played with a probability 1 /m i , totalling ε/m i . • Otherwise the player selects its best response, so it will be a with theprobability P (cid:16) D ∈ ∆ ( − i,k ) a , d BR i ( D ) = a (cid:17) .In short, P (cid:16)g BR i ( p ) = a (cid:17) = εr a + (1 − ε ) X x ∈ ∆ ( − i,k ) a P (cid:16)d BR i ( x ) = a (cid:17) P ( D = x ) . (4)However D = x is an event of the shape N = (cid:0) n , . . . n m − i (cid:1) , so considering P ( D = x ) as a function of p , . . . p n , we get ∂ P (cid:0) N = (cid:0) n , . . . n m − i (cid:1)(cid:1) ∂p b = k ! p n b − b ( n b − Y j = b p n j j n j ! , / ( − k . k ∞ over ∆ ( S − i ), the Lipschitz constant of the probabilities P (cid:16) D ∈ ∆ ( − i,k ) a (cid:17) are atmost m − i X b =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ P (cid:16) D ∈ ∆ ( − i,k ) a (cid:17) ∂p b (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ m − i X b =1 X x ∈ ∆ ( − i,k ) a k ! p n b − b ( n b − m − i Y j =1 j = b p n j j n j ! . However we know that X x ∈ ∆ ( − i,k ) p n b − b ( n b − m − i Y j =1 j = b p n j j n j ! = 1( k − , as this is the multinomial formula for k − ( − i,k ) a ⊂ ∆ ( − i,k ) , theLipschitz constant of P (cid:16) D ∈ ∆ ( − i,k ) a (cid:17) is at most m − i X b =1 k ! 1( k − km − i . Bounding P (cid:16)d BR i ( x ) = a (cid:17) from (4) by 1, the Lipschitz constant for p P (cid:16)g BR i ( p ) = a (cid:17) is at most (1 − ε ) km − i . A.3 Ergodicity

The proof of ergodicity relies on standard Markov chain theory and a positivelower bound for the probability that the chain, initiated at any point in (cid:3) ( S ),visits any open set in (cid:3) ( S ) after a ﬁnite number of time steps. To prove thelower bound, we ﬁrst need to establish the intermediate result Lemma 8. It isassumed throughout this section that ε > A.3.1 Approximative history

For i ∈ { , } , j ∈ { , . . . , m i } , and t ∈ N , let ω i,j,t := 1 j ( s i ( t )) be the indicatorof a play j by player i at time t , so that p i,j ( t ) = (1 − β ) ∞ X τ =1 β τ − ω i,j,t − τ .

20e will call Σ ( i ) := { , } m i × N the set of binary arrays, indexed by s ∈{ , . . . , m i } and t ∈ N , such that for every t there is exactly one s such thatΣ ( i ) s,t = 1. In other words, Σ ( i ) represents a possible history for player i , where a 1at the entry ( s, t ) indicates that s was played at time t . Likewise, for n ∈ N , wewill call Σ ( i,N ) := { , } m i × N the set of binary arrays indexed by s ∈ { , . . . , m i } and t ∈ { , . . . N } obeying the same condition, in other words the history up totime N .Let p i ∈ ∆ ( S i ). We are going to exhibit a sequence of plays of ﬁnite length N ,i.e., an ω ∈ Σ ( i,N ) for some N ∈ N , such that the partial sum p ( N ) i,j := (1 − β ) N X τ =1 β t − ω i,j,τ falls close to p i . Namely, we want to prove the following. Lemma 8.

Let p i ∈ ∆ ( S i ) and δ > . We assume that (1 − β ) m i ≤ .There exists an N ( δ ) ∈ N , independent of i and p i , such that there is a history ω ( N ) i ∈ Σ ( i,N ) for each N ≥ N ( δ ) which satisﬁes p ( N ) i,j = (1 − β ) N X τ =1 β τ − ω ( N ) i,j,τ ∈ (max { p i,j − δ, } , p i,j ] (5) for all j ∈ { , . . . , m i } .Proof. The following algorithm provides a proof of Lemma 8. Start by setting p (0) i,j = 0 for all j = 1 , . . . , m i , and ω (0) i to the empty array of dimensions 0 and m i . Deﬁne N ( δ ) as the smallest N ∈ N such that β N < δ , i.e., N ( δ ) := inf { N ∈ N : β N < δ } . For t ∈ { , . . . N ( δ ) } , repeat the following steps :1. Look for the indices j ∈ { , . . . , m i } such that p i,j − p ( t − i,j is maximal, andcall any of these indices a .2. Append −→ ,a to ω ( t ) i . Now ω ( t ) i,a,t = 1 and ω ( t ) i,j,t = 0 for j = a .3. Compute p ( t ) i,j accordingly to (5) and the updated history ω ( t ) i .Return the ﬁnal history ω ( N ( δ )) i and values p ( N ( δ )) i,j .We are going to prove inductively that for every t ∈ N , we always have p ( t ) i,j ≤ p i,j , j = 1 , . . . , m (6)21nd m i X j =1 p ( t ) i,j = 1 − β t . (7)For t = 0, (6) is true since p i,j is non-negative and p (0) i,j = 0 for all j = 1 , . . . , m i ,which also yields that (7) holds at t = 0. Now assume that (6)–(7) hold at time t . Since P m i j =1 p i,j = 1, the maximal diﬀerence max ≤ j ≤ m i ( p i,j − p ( t ) i,j ) must beat least β t /m i . By deﬁnition, then ω i,a,t +1 = 1 for some a ∈ { , . . . , m i } and p ( t +1) i,a = (1 − β ) t +1 X τ =1 β τ − ω ( t +1) i,a,τ = (1 − β ) β t ω ( t +1) i,a,t +1 + (1 − β ) t X τ =1 β τ − ω ( t ) i,a,τ = (1 − β ) β t + p ( t ) i,a ≤ (cid:18) (1 − β ) − m i (cid:19) β t + p i,a Therefore, since (1 − β ) m i ≤ p i,j . As for other strategies j = a , since p ( t +1) i,j = p ( t ) i,j the inequality p ( t +1) i,j ≤ p i,j holds and we have proven the induction step for (6). Now we also know that p ( t +1) i,j − p ( t ) i,j = (1 − β ) β t ω ( t +1) i,j,t +1 , and since exactly one among the m i entries in ω ( t +1) i,t +1 is 1, the other being zero,we have m i X j =1 (cid:16) p ( t +1) i,j − p ( t ) i,j (cid:17) = (1 − β ) β t . The induction hypothesis thus leads us to m i X j =1 p ( t +1) i,j = 1 − β t + (1 − β ) β t = 1 − β t +1 , which proves (7) by induction. So in particular after time N ( δ ), by choice of N ( δ ), for every N ≥ N ( δ ) we have m i X j =1 p ( N ) i,j > − δ, while p ( N ) i,j ≤ p i,j for every j . Since P j p i,j = 1, this is possible only if p ( N ) i,j >p i,j − δ for each j = 1 , . . . , m i , leading to the result.22 .3.2 A useful lower bound Let x = ( x , x ) ∈ (cid:3) ( S ). We apply Lemma 8 with δ = ε to x and x ,yielding play records ω ( N ( ε ))1 and ω ( N ( ε ))2 , and values p ( N ( ε )) i,j such that for every1 ≤ j ≤ m i and N ≥ N ( ε ), p ( N ) i,j ∈ (max { x i,j − ε, } , x i,j ] . We know that for any B ∈ B ( (cid:3) ( S )), P ( x, B ) = m X s =1 m X s =1 σ ( x, s ) 1 B (Γ ( x, s )) , and σ ( x, s ) is uniformly bounded from below by η = ε/ ( m m ) > ω ( N )1 and ω ( N )2 up to time N from the previous Lemma arenow read in reverse time order. At each time step t ∈ { , . . . N − } , there is aprobability at least η that player 1 chooses the strategy 1 ≤ a ≤ m given by ω ( N ,a,N − t = 1, and player 2 chooses the strategy 1 ≤ b ≤ m given by ω ( N )2 ,b,N − t = 1.Therefore the plays up to time N have a probability at least η N > ω ( N and ω ( N . When this happens, thanks to the Proposition 1, ahistory having started by ( p (0) , p (0)) = p will now be at the position( p ( N ) , p ( N )) = N X t =1 (cid:16) (1 − β ) β t − ω ( N )1 ,t , (1 − β ) β t − ω ( N )2 ,t (cid:17) + (cid:0) β N p (0) , β N p (0) (cid:1) , with probability greater or equal to η N .By Lemma 8, the choice of the records ω ( N )1 and ω ( N )2 makes j :th componentof the sum on the right-hand side of (8) take some value between (max { x ,j − ε, } , max { x ,j − ε, } ) and ( x ,j , x ,j ). As we also have β N < ε and p i,j (0) ≤ p i,j ( N ) ∈ ( x i,j − ε, x i,j + ε ). We conclude that for all N ≥ N ( ε ) P ( | p ( N ) − x | < ε ) ≥ η N . (10)In other words, it means that the point y = ( p ( N ) , p ( N )), which is in an ε -neighbourhood of x , is accessible from p in N steps. A.3.3 Proof of uniform ergodicity

The path to uniform ergodicity goes through proving that the chain is a so-called T -chain. For Markov chain theory related concepts used below, we follow thedeﬁnitions of Meyn and Tweedie (2012).23 emma 9. The Markov chain p is a T -chain.Proof. From (10) we know that for all rectangles R = R × R ∈ B ( (cid:3) ( S )) and N ≥ N ( ε R ), P N ( x, R ) ≥ η N , x ∈ (cid:3) ( S ) , (11)where ε R = min { λ ( R ) , λ ( R ) } with λ the Lebesgue measure, N ( ε R ) =inf { N ∈ N : β N < ε R } , and η = ε/ ( m m ). Note that η <

1, so if ε R = 0,which implies that N ( ε R ) = ∞ , then η N ( ε R ) = 0, hence the estimate also coversdegenerate rectangles.Let O be an open subset of (cid:3) ( S ), then O can be written as a countable unionof almost disjoint (their boundaries may overlap) closed rectangles ( R j ) ∞ j =1 , R j = R j, × R j, . Since O is open, at least one of the rectangles in the covermust have a nonempty interior. The probability to reach O from a point x ∈ (cid:3) ( S )is bounded from below by the sum of the probabilities of reaching each of therectangles covering O , when staring the chain from x ∈ (cid:3) ( S ). Hence we havethe following lower bound ∞ X n =0 P n ( x, O ) ≥ ∞ X j =1 P N ( ε Rj ) ( x, R j ) ≥ ∞ X j =1 η N ( ε Rj ) > , x ∈ (cid:3) ( S ) . (12)In Lemma 10 below, we use (12) to construct a nontrivial measure ν (cid:3) on( (cid:3) ( S ) , B ( (cid:3) ( S ))) such that ∞ X n =0 P n ( x, B ) ≥ ν (cid:3) ( B ) , B ∈ B ( (cid:3) ( S )) , x ∈ (cid:3) ( S ) , (13)hence yielding that (cid:3) ( S ) is a ν (cid:3) -petite set.By (12), all open sets are uniformly accessible from any subset of (cid:3) ( S ) by (12).Since (cid:3) ( S ) is open in the relative topology, the previously stated fact impliesthat all subsets of (cid:3) ( S ) are petite (Meyn and Tweedie, 2012, Prop. 5.5.3). Inparticular, every compact set is petite and it follows that p is a T -chain (Meynand Tweedie, 2012, Thm. 6.0.1). Lemma 10.

There exists a nontrivial measure ν (cid:3) on ( (cid:3) ( S ) , B ( (cid:3) ( S )) thatsatisﬁes (13) .Proof. Deﬁne R to be the collection of all half-open rectangles (cid:0) × m − j =1 [ a ,j , b ,j ) (cid:1) × (cid:0) × m − j =1 [ a ,j , b ,j ) (cid:1) in R m − × R m − . Let the function ¯ η : R → [0 , ∞ ] be givenby ¯ η ( R ) = η N ( ε R ) (clearly, if R ∩ (cid:3) ( S ) = ∅ then ¯ η ( R ) = 0). We deﬁne, for any A ⊂ R m − × R m − , ν ∗ ( A ) := inf  ∞ X j =1 ¯ η ( R j ) : R j ∈ R , A ⊂ ∪ ∞ j =1 R j  . ν ∗ is a countably additive pre-measure on the semi-ring R , and an outermeasure on R m − × R m − . We denote by ν the restriction of ν ∗ to its measurablesets. Carathéodory’s extension theorem says that ν is a measure on the smallest σ -algebra containing R , which is B ( R m − × R m − ) since the half-open rectanglesgenerate the Borel σ -algebra. Furthermore, since ν ∗ is σ -ﬁnite, ν is the uniqueextention of ν ∗ , and ν agrees with ¯ η on R .Let, for each x ∈ (cid:3) ( S ), ¯ η ∆ ( x, · ) be a set-function on R m − × R m − that satisﬁes¯ η ∆ ( x, R ) = ∞ X n =1 ¯ P n ( x, R ) − ν ( R )= ∞ X n =1 ¯ P n ( x, R ) − ¯ η ( R ) , R ∈ R . where ¯ P ( x, R ) := P ( x, R ∩ (cid:3) ( S )) extends P to B ( R m − × R m − ). By (11),¯ η ∆ ( x, · ) is non-negative for all x ∈ (cid:3) ( S ). For each x ∈ (cid:3) ( S ), deﬁne for any A ⊂ R m − × R m − ν ∗ ∆ ( x, A ) := inf  ∞ X j =1 ¯ η ∆ ( x, R j ) : R j ∈ R , A ⊂ ∪ ∞ j =1 R j  . Then ν ∗ ∆ ( x, · ) is, for each x ∈ (cid:3) ( S ), a countably additive pre-measure on R and an outer measure on R m − × R m − , and with the same argument usedto constructed ν , we construct the measures ( ν ∆ ( x, · )) x ∈ (cid:3) ( S ) , ν ∆ ( x, · ) beingthe unique extension of ν ∗ ∆ ( x, · ) (the σ -ﬁnite part follows from the deﬁnitionof ¯ η ∆ ( x, · ); a countable sum of σ -ﬁnite measures is again a σ -ﬁnite measure).The following fact is essentially (Folland, 1999, Theorem 1.14), and follows as acorollary to Carathéodory’s extension theorem: Since ν ∆ ( x, R ) = ∞ X n =1 ¯ P n ( x, R ) − ν ( R ) , for all R ∈ R , ν ∆ ( x, B ) = ∞ X n =1 ¯ P n ( x, B ) − ν ( B )for all B ∈ B ( R m − × R m − ). Then, in particular, ν ∆ ( x, B ) = ∞ X n =1 P n ( x, B ) − ν ( B ) , B ∈ B ( (cid:3) ( S )) , x ∈ (cid:3) ( S )from which it follows thatsup x ∈ (cid:3) ( S ) ∞ X n =1 P n ( x, B ) ≥ ν ( B ) , B ∈ B ( (cid:3) ( S )) . Deﬁning ν (cid:3) as the restriction of ν to B ( (cid:3) ( S )) completes the proof.25 emma 11. The Markov chain p is open set irreducible.Proof. A point x ∈ (cid:3) ( S ) is called reachable if for every open set O ∈ B ( (cid:3) ( S ))containing x , ∞ X n =1 P n ( y, O ) > , y ∈ (cid:3) ( S ) . We know that all x ∈ (cid:3) ( S ) are reachable by (10). A Markov chain is open setirreducible if every point is reachable. Proposition 12.

The chain p is ψ -irreducible.Proof. We know that p is an open set irreducible T -chain. By (Meyn andTweedie, 2012, Prop. 6.2.1), p is ψ -irreducible. Remark 13.

The measure ν (cid:3) is an irreducibility measure for p (Meyn andTweedie, 2012, Prop. 5.5.4 (ii)). Remark 14.

The state space (cid:3) ( S ) is a petite set, but not a small set, sincethere are sets in the corners of (cid:3) ( S ) which the chain needs arbitrarily long timeto reach. We move on towards showing uniform ergodicity for p . The argument is based on(Meyn and Tweedie, 2012, Thm. 16.2.5), which says that if p is a ψ -irreducibleand aperiodic T -chain, and if the state space (cid:3) ( S ) is compact, then p is uniformlyergodic. Lemma 15.

The Markov chain p is aperiodic.Proof. The negative implication of (Meyn and Tweedie, 2012, Prop. 5.4.6) saysthat if there exists no absorbing state for p ( d ) , the chain corresponding to thetransition kernel P d , for any d ≥

2, then p is aperiodic.Assume that D is an absorbing state for p ( d ) , that is inf x ∈ D P d ( x, D ) = 1. By(11), D must contain all rectangles R ⊂ (cid:3) ( S ), since inf x ∈ D P Nd ( x, R ) ≥ η Nd > N ≥ N ( ε R ). This implies that D = (cid:3) ( S ) is the only absorbing statefor p ( d ) and we conclude that the chain is aperiodic.We have proven the following result: Proposition 16.

The chain p is uniformly ergodic.Proof. This follows by ψ -irreducibility (Proposition 12) and aperiodicity (Lemma 15),see (Meyn and Tweedie, 2012, Thm. 16.2.5).26y (Meyn and Tweedie, 2012, Thm. 15.0.1), ψ -irreducibility and aperiodicityimplies that p has an invariant probability measure µ ∗ ε . By (Meyn and Tweedie,2012, Thm. 16.0.2), uniform ergodicity of p is equivalent to the existence of r > R < ∞ such that for all x , d T V ( P n ( x, · ) , µ ∗ ε ) ≤ Rr − n , where d T V is the total variation distance on P ( (cid:3) ( S )). Clearly, µ ∗ ε is the uniqueinvariant probability measure of p (in ( P ( (cid:3) ( S )) , d T V )). By (Villani, 2008, Thm.6.18), the p -Wasserstein distance W p is for all p ≥ x ∈ (cid:3) ( S ) W pp ( P n ( x, · ) , µ ∗ ε ) ≤ CRr − n , where C depends on p and | (cid:3) ( S ) | . The last inequality implies the statement ofTheorem 3. A.4 Proof of Theorem 5

Proof.

The proof consists of four steps.

Step 1. Bounding the probability of reaching B δ ( C ) in ﬁnite time. To ﬁnd a lower bound for the probability to go from an arbitrary point p ( t ) ∈ B δ ( C ) c to B δ ( C ) in ﬁnite time we create a particular path of positive probabilitythat does exactly that. Let p ( t ) ∈ (cid:3) ( S ) be given and let s ∈ S × S bethe strategy proﬁle played at in period t . Either s is a CURB block, or thebest reply set to s contains a strategy not in s , BR ( −→ s ) s . If the formerstatement is true this step of the proof is complete. That is not always thecase, therefore assume that we are in the case of the latter statement, i.e. thatthe best reply set to s contains a strategy not in s . Then, the probability ofboth players only sampling s at time t + 1 is bounded from below by (1 − β ) k .Hence the probability of a strategy proﬁle s ∈ BR ( −→ s ) , s = s , being playedis bounded from below by P (cid:16)g BR ( p ( t )) = s | p ( t ) (cid:17) ≥ (1 − β ) k m m (1 − ε ) . Now let F be the smallest block F ∈ S × S that contains { s , s } . Either F is a CURB block or BR (∆( F )) F , in which case there is at least onesample D of size k from F such that BR ( D ) F . The probability of samplingthat particular D , and the best replies to D being such that at least one ofthem is not in F , is again bounded away from zero. Until we have sampled asequence of strategy proﬁles, each extending the set F i , such that F i is a CURBblock, there is always some sample with positive sampling probability such that BR ( D ) F i . The probability of playing a strategy s i which is a best reply to27 which is not in F i , s i ∈ BR ( D ) ∩ ( F i ) c , is therefore bounded from below by P (cid:16)g BR ( p ( t + i − s i | p ( t + i − (cid:17) ≥ (cid:0) β i − (1 − β ) (cid:1) k m m (1 − ε ) . Keep ﬁlling F i , F i +1 , F i +2 , . . . with strategies from the CURB block in thisfashion, so that F T spans a CURB block and T ≤ m + m (Hurkens, 1995,Lemma 1). To get a uniform lower bound, assume that T = m + m and thatonce F i is a CURB block the following T − i strategy proﬁles are inside theCURB block. The probability of this progression of plays is bounded from below:let E be the event that p ( t + T ) puts at most β T +1 mass outside the CURBblock spanned by F T , then P ( E ) ≥ (cid:0) β k (cid:1) ( T − (1 − β ) T k m T m T (1 − ε ) T . Inside the CURB block spanned by F T , there is a minimal CURB block whichwe denote by C = C × C . The probability of both players sampling from C given the state p ( t + T ) (as described above) is greater or equal to P (( D /k, D /k ) ∈ (cid:3) ( C ) | D from p ( t + T )) ≥ (cid:0) β T (1 − β ) (cid:1) k (1 − ε ) . Starting from p ( t ) ∈ B δ ( C ) c , a sequence of plays that results in p ( t + T + T ∗ ) ∈ B δ ( C ) is to play T strategies to ﬁll F T followed by T ∗ strategies from the minimalCURB block C . Conditional on p ( t ) ∈ B δ ( C ) c and the aforementioned event E ,the probability that p ( t + T + T ∗ ) ∈ B δ ( C ) ⊂ B δ ( C ) is bounded from below by P (cid:16) ( D , D )( t + T + i ) ∈ (cid:3) ( C ) , i = 0 , . . . , T ∗ − | p ( t + T ) as above (cid:17) ≥ (cid:0) β T (1 − β )(1 − ε ) (cid:1) kT ∗ =: γ ( ε, T, T ∗ ) . Now p ( t + T + T ∗ ) gives at most β T ∗ probability to all strategy proﬁles outside (cid:3) ( C ). Therefore, we pick δ > T ∗ ∈ N be such that β T ∗ < δ and,summarizing the analysis in this step, we have derived a bound on the probabilityof moving from any point p ( t ) ∈ B δ ( C ) c to B δ ( C ) in T + T ∗ steps. We denotethis bound by K and it is given by P T + T ∗ ( p ( t ) , B δ ( C )) ≥ (cid:0) β k (cid:1) ( T − (1 − β ) T K (1 − ε ) T m T m T γ ( ε, T, T ∗ ) =: K. Step 2. Expected exit time from B δ ( C ) . Once in B δ ( C ), one of two things must happen for the process to leave. Eitherone player makes a mistake or one player samples at least one strategy fromoutside the minimal CURB block C the process is currently centered around.So instead of calculating the time to the ﬁrst exit, denoted τ ε , we calculate the28xpected time until one of these two things happen the ﬁrst time. Let τ ∗ ε denotethe time, starting from t = 0, until either a strategy is sampled outside C or oneplayer makes an ε -tremble. We denote the expression for the probability that τ ∗ ε > t ∗ , t ∗ ∈ N , with Q ε ( t ∗ ), Q ε ( t ∗ ) := P ( τ ∗ ε > t ∗ | p (0) ∈ B δ ( C )) = t ∗ Y t =0 (1 − β t δ ) k (1 − ε ) . For the case ε = 0, we use the fact that P ∞ t =0 β t δ is convergent to concludethat Q ∞ t =0 (1 − β t δ ) k approaches a non-zero limit. Since Q ε is decreasing andnon-negative, lim t ∗ →∞ Q ε ( t ∗ ) = ( Q ∗ ∈ (0 , , if ε = 0 , , if ε > . We can now derive a bound for τ ε , the expected time to exit from B δ ( C ), E [ τ ε ] ≥ E [ τ ∗ ε ] ≥ E [ τ ∗ ε | τ ∗ ε ≥ t ∗ , p (0) ∈ B δ ( C )] × P ( τ ∗ ε ≥ t ∗ | p (0) ∈ B δ ( C )) P ( p (0) ∈ B δ ( C ))) ≥ t ∗ Q ε ( t ∗ ) ν ( B δ ( C )) , where ν is the initial distribution of the state process and ν ( B δ ( C )) is theprobability that p (0) ∈ B δ ( C ). We know that the state process convergesweakly to the invariant distribution for all initial distributions and therefore ν isany distribution on (cid:3) ( S ) of our choice. Choosing ν as the distribution of theconstructed p ( t + T + T ∗ ) from above, E [ τ ε ] ≥ t ∗ t ∗ Y t =0 (1 − β t δ ) k (1 − ε ) = t ∗ (1 − ε ) t ∗ Q ( t ∗ ) ≥ t ∗ (1 − ε ) t ∗ Q ∗ , where t ∗ is any positive integer. For a ﬁxed ε , the function t ∗ t ∗ (1 − ε ) t ∗ ismaximized by t ∗ ( ε ) = − (2 ln(1 − ε )) − . There is therefore a decreasing sequenceof positive numbers ( ε j ) ∞ j =1 , tending to zero as j → ∞ , such that t ∗ ( ε j ) is aninteger and E [ τ ε ] ≥ − Q ∗ e ln(1 − ε j ) , which diverges to ∞ as j → ∞ . Step 3. Bounding µ ∗ ε ( B δ ( C ) c ) from above. We know that for any ε > µ ∗ ε . We also have a lower bound for P ( x, B δ ( C )) uniform over x ∈ B δ ( C ) c , and29 lower bound for the expected time the process stays in B δ ( C ) once it hasentered.The probability given by the invariant distribution to the set B δ ( C ) is at least thesum over n of the probability of: the state process not being in it ( n + 1)( T + T ∗ )steps ago, but in it n ( T + T ∗ ) steps ago, and then staying there for at least n ( T + T ∗ ) time steps,1 ≥ µ ∗ ε ( B δ ( C )) ≥ ∞ X n =0 Z B δ ( C ) c P T + T ∗ ( x, B δ ( C )) dµ ∗ ε ( x ) ! P ( τ ε ≥ n ( T + T ∗ )) ≥ µ ∗ ε ( B δ ( C ) c ) K ∞ X n =0 P (cid:18) τ ε T + T ∗ ≥ n (cid:19)! ≥ µ ∗ ε ( B δ ( C ) c ) KT + T ∗ E [ τ ∗ ε ] . Step 4. Putting it all together.

The collection ( µ ∗ ε ) ε> is tight because (cid:3) ( S ) is compact. So there exists asubsequence that converges weakly to µ ∗ ∈ P ( (cid:3) ( S )). The limit µ ∗ is notnecessarily unique, however, by the Portmanteau theorem,lim inf ε → µ ∗ ε ( U ) ≥ µ ∗ ( U )for all open sets U of (cid:3) ( S ). Note that B δ ( C ) c is open, and µ ∗ ε ( B δ ( C ) c ) ≤ T + T ∗ K E [ τ ∗ ε ] . Since

K > ε → E [ τ ∗ ε ] → ∞ as ε →

0, and T + T ∗ does notdepend on ε , µ ∗ ( B δ ( C ) c ) ≤ lim inf ε → µ ∗ ε ( B δ ( C ) c ) ≤ ( T + T ∗ )lim inf ε → K E [ τ ∗ ] = 0 . We conclude that that µ ∗ ε ( B δ ( C )) → ε → B Concentration around approximate Nash equi-librium: proofs

Parts of this appendix relies on the assumption that the game is of size 2 × × .1 Unique ﬁxed point to the expected best reply Lemma 17.

Let G be a × game with a unique mixed Nash equilibrium N ∗ and let k , the number of samples, be an integer such that N ∗ k N and N ∗ k N .Then there exists a unique ﬁxed point n ∗ = ( n ∗ , n ∗ ) ∈ int ( (cid:3) ( S )) to the system  E hg BR ( n ∗ ) i = n ∗ , E hg BR ( n ∗ ) i = n ∗ . (14) Proof.

We will refer to the player 1 and 2 as the agreeing and the disagreeingplayer, respectively. The Nash equilibrium N ∗ = ( N ∗ , N ∗ ) deﬁnes the ’cut-oﬀ’ M i := b N ∗ i k c , i = 1 ,

2. The cut-oﬀ is such that if more than M of the agreeingplayer’s k samples from the disagreeing player’s history are 1, he plays 1. Thedisagreeing player will play strategy 1 if more than M of his k samples fromthe agreeing player’s history of plays are 0. Consider the function p k,M ( x ) := (1 − ε ) k X i = M +1 (cid:18) ki (cid:19) x i (1 − x ) k − i + ε/ . Given that player history is in state ( a, d ), the probability that the agreeing anddisagreeing player plays strategy 1 is p a ( d ) := p k,M ( d ) and p d ( a ) := 1 − p k,M ( a ),respectively. We can now rewrite (14) as p a ( n ∗ ) = n ∗ , p d ( n ∗ ) = n ∗ . The range of p a and p b is I ε := [ ε/ , − ε/ p a and p d , we may rewrite (14) again, now as( p a ◦ p d ) ( n ∗ ) = n ∗ , n ∗ ∈ I ε , ( p d ◦ p a ) ( n ∗ ) = n ∗ , n ∗ ∈ I ε . Note that since p a and p d are strictly increasing and decreasing, respectively,both p a ◦ p d and p d ◦ p a are strictly decreasing functions from [0 ,

1] to [ p d (1 − ε/ , p d ( ε/ p a ( ε/ , p a (1 − ε/ { p a ◦ p d ( ε/ , p d ◦ p a ( ε/ } ≥ min { p d (1 − ε/ , p a ( ε/ } > ε/ , max { p a ◦ p d (1 − ε/ , p d ◦ p a (1 − ε/ } ≤ max { p d ( ε/ , p a (1 − ε/ } < − ε/ . Hence, since p a ◦ p d and p d ◦ p a are continuous, they intersect the straight line x = y at a (function-wise) unique point in their respective images and theseintersection points are n ∗ and n ∗ . 31 .2 Global exponential stability of mean-ﬁeld dynamics Denote by ξ the solution mapping of ˙ x ( t ) = F ( x ( t )), x (0) = p , where F ( x ) := E [ g BR ( x )] − x . Then ξ ( t, p ) = p + Z t F ( ξ ( s, p )) ds. Lemma 18.

Let Σ contain all points x ∈ (cid:3) ( S ) such that F ( x ) = 0 or suchthat ξ ( t, x ) satisﬁes ( ξ ( t, x ) − y ) ∗ F ( ξ ( t, x )) = 0 for all t ≥ and some y , suchthat F ( y ) = 0 . The mapping t ξ ( t, p ) is globally asymptotically stable, with lim t →∞ ξ ( t, p ) ∈ Σ . Furthermore, if the game is × with a unique mixed Nashequilibrium, then Σ = { n ∗ } , the unique root of F .Proof. Let V ( x ) := k x − n ∗ k where n ∗ is a root of F . The existence of n ∗ isgranted by Brouwer’s ﬁxed point theorem; (cid:3) ( S ) is compact and convex and F is continuous. Diﬀerentiating V with respect to time at the solution mapping ξ ( t, p ), we get − ˙ V ( ξ ( t, p )) = −∇ V ( ξ ( t, p )) ˙ ξ ( t, p )= − ( ξ ( t, p ) − n ∗ ) T F ( ξ ( t, p ))= − ( ξ ( t, p ) − n ∗ ) T (cid:16) E [ g BR ( ξ ( t, p )) | ξ ( t, p )] − ξ ( t, p ) (cid:17) = 2 V ( ξ ( t, p )) − ( ξ ( t, p ) − n ∗ ) T (cid:16) E [ g BR ( ξ ( t, p ) | ξ ( t, p )] − n ∗ ) (cid:17) = V ( ξ ( t, p )) − V ( E [ g BR ( ξ ( t, p )) | ξ ( t, p )])+ 12 k ξ ( t, p ) − E [ g BR ( ξ ( t, p )) | ξ ( t, p )] k where in the last step we used the identity 2 y T z = k y k + k z k − k y − z k , y, z ∈ R d . We notice that V ( E [ g BR ( ξ ( t, p )) k ξ ( t, p )])= 12 k E [ g BR ( ξ ( t, p )) | ξ ( t, p )] − ξ ( t, p ) + ξ ( t, p ) − n ∗ k ≤ k E [ g BR ( ξ ( t, p )) | ξ ( t, p )] − ξ ( t, p ) k + V ( ξ ( t, p )) , hence ˙ V ( ξ ( t, p )) ≤

0. Furthermore, V is radially unbounded. Let R := { x ∈ (cid:3) ( S ) : ( x − n ∗ ) T F ( x ) = 0 } , then R = { x ∈ (cid:3) ( S ) : ˙ V ( x ) = 0 } and R contains n ∗ ,any other point solution to F ( x ) = 0, and all x such that the vectors ( x − n ∗ )and F ( x ) are orthogonal. By a global invariant set theorem (Slotine, Li et al.,1991, Thm. 3.5), ξ ( t, p ) converges to the largest invariant set of R , which is Σ.Next, for 2 × R diﬀerent from n ∗ (now unique) cannot be in Σ. First note that if32 ∈ R \{ n ∗ } , then x i = n ∗ i , i = 1 ,

2. Without loss of generality, assume that player2 has the disagreeing role and that x > n ∗ . If x ∈ R \{ n ∗ } then F ( x ) = 0and a trajectory starting in x will evolve according to the dynamic system˙ x ( t ) = F ( x ( t )) , x (0) = x . Assume, towards a contradiction, that x ( t ) ∈ R \{ n ∗ } for all t ≥

0. After some ﬁnite positive time, call it t ∗ , the path must cross theline ( x, n ∗ ; x ∈ [0 , x > n ∗ and player2 is disagreeing, it will move "south-east" in (cid:3) ( S )). This crossing contradicts x ( t ∗ ) ∈ R \{ n ∗ } since x ( t ∗ ) ∈ R \{ n ∗ } would require both components of x ( t ∗ )to be diﬀerent from n ∗ . The same argument can be carried out for all otherpossible initial positions ( x − n ∗ < { n ∗ } is the only invariant set in R . B.3 Trajectories over bounded time intervals

By (Benaïm and Weibull, 2003, Lemma 1), the state process p ( · ) and its mean-ﬁeld approximation ξ ( · , p (0)) lie close to each other (over bounded time intervals)with high probability. We have to do one modiﬁcation to apply the result: were-scale size of the time steps taken by our learning process. This has no eﬀecton previous results since we will always (for a ﬁxed β ) have a ﬁxed positive stepsize. The original proof of Benaïm and Weibull (2003) can be used to prove thelemma below. Lemma 19.

Scale the step size of t by (1 − β ) . Let T = N (1 − β ) for some N ∈ N and let (ˆ p ( t ); t ∈ [0 , T ]) be the linear interpolation of the path ( p ( t ); t =0 , − β, . . . , (1 − β ) N ) . Then, for all η > , P (cid:18) max t ∈ [0 ,T ] k ˆ p ( t ) − ξ ( t, p (0)) k ∞ ≥ η (cid:19) ≤ m + m − e − η c where c is a positive constant and proportional to e − γT ( T (1 − β )) − , where γ > depends only on the size of the game. B.4 Proof of Theorem 6

Let t ≥ s ≥

0. Below, K will denote a generic positive constant. Whenever η > k ξ ( t, ˆ p ( t − s )) − ξ ( t, p (0)) k ∞ , Lemma 19 yields that P ( k ˆ p ( t ) − ξ ( t, k ∞ ≥ η )= P ( k ˆ p ( t ) − ξ ( t, ˆ p ( t − s )) k ∞ ≥ η − k ξ ( t, ˆ p ( t − s ) − ξ ( t, p (0)) k ∞ ) ≤ K exp (cid:18) − ( η − k ξ ( t, ˆ p ( t − s )) − ξ ( t, p (0)) k ∞ ) K e − γs s (1 − β ) (cid:19) . Furthermore, P ( k ˆ p ( t ) − n ∗ k ∞ ≥ η )= P ( k ˆ p ( t ) − ξ ( t, p (0)) k ∞ ≥ η − k ξ ( t, p (0)) − n ∗ k ∞ ) ,

33o we have that P ( k ˆ p ( t ) − n ∗ k ∞ ≥ η ) = P (cid:16) k ˆ p ( t ) − ξ ( t, ˆ p ( t − s )) k ∞ ≥ η − k ξ ( t, ˆ p ( t − s )) − ξ ( t, p (0)) k ∞ − k ξ ( t, p (0)) − n ∗ k ∞ (cid:17) ≤ K exp − ( η − k ξ ( t, ˆ p ( t − s )) − ξ ( t, p (0)) k ∞ − k ξ ( t, p (0)) − n ∗ k ∞ ) × K e − γs s (1 − β ) ! . Letting t → ∞ , we know from Lemma 18 that ξ ( t, p (0)) → n ∗ , solim t →∞ P ( k ˆ p ( t ) − n ∗ k ∞ ≥ η ) ≤ sup x ∈ (cid:3) ( S ) K exp (cid:18) − ( η − k ξ ( s, x ) − n ∗ k ∞ ) K e − γs s (1 − β ) (cid:19) . Choosing σ large enough, so that for all s ≥ σ : k ξ ( s, x ) − n ∗ k ∞ ≤ η/ x . Then lim t →∞ P (cid:0) k ˆ p ( t ) − n ∗ k ∞ ≥ η (cid:1) = o (cid:18) exp (cid:18) − Kη − β (cid:19)(cid:19) ,,