[PDF] Best-response dynamics, playing sequences, and convergence to equilibrium in random games

Abstract

We show that the playing sequence--the order in which players update their actions--is a crucial determinant of whether the best-response dynamic converges to a Nash equilibrium. Specifically, we analyze the probability that the best-response dynamic converges to a pure Nash equilibrium in random n-player m-action games under three distinct playing sequences: clockwork sequences (players take turns according to a fixed cyclic order), random sequences, and simultaneous updating by all players. We analytically characterize the convergence properties of the clockwork sequence best-response dynamic. Our key asymptotic result is that this dynamic almost never converges to a pure Nash equilibrium when n and m are large. By contrast, the random sequence best-response dynamic converges almost always to a pure Nash equilibrium when one exists and n and m are large. The clockwork best-response dynamic deserves particular attention: we show through simulation that, compared to random or simultaneous updating, its convergence properties are closest to those exhibited by three popular learning rules that have been calibrated to human game-playing in experiments (reinforcement learning, fictitious play, and replicator dynamics).

Full PDF

BBEST-RESPONSE DYNAMICS, PLAYING SEQUENCES, ANDCONVERGENCE TO EQUILIBRIUM IN RANDOM GAMES

TORSTEN HEINRICH , , , YOOJIN JANG , , LUCA MUNGO , , MARCO PANGALLO , ALEXSCOTT , BASSEL TARBUSH , SAMUEL WIESE , Abstract.

We show that the playing sequence–the order in which players update theiractions–is a crucial determinant of whether the best-response dynamic converges to aNash equilibrium. Speciﬁcally, we analyze the probability that the best-response dy-namic converges to a pure Nash equilibrium in random n -player m -action games underthree distinct playing sequences: clockwork sequences (players take turns according to aﬁxed cyclic order), random sequences, and simultaneous updating by all players. We ana-lytically characterize the convergence properties of the clockwork sequence best-responsedynamic. Our key asymptotic result is that this dynamic almost never converges to apure Nash equilibrium when n and m are large. By contrast, the random sequence best-response dynamic converges almost always to a pure Nash equilibrium when one existsand n and m are large. The clockwork best-response dynamic deserves particular atten-tion: we show through simulation that, compared to random or simultaneous updating,its convergence properties are closest to those exhibited by three popular learning rulesthat have been calibrated to human game-playing in experiments (reinforcement learning,ﬁctitious play, and replicator dynamics). JEL codes : C62, C72, C73, D83.

Keywords : Best-response dynamics, equilibrium convergence, random games, learningmodels in games. Faculty for Economics and Business Administration, Chemnitz University of Technology, Chemnitz, Ger-many Institute for New Economic Thinking at the Oxford Martin School, University of Oxford, Oxford, UK Oxford Martin Programme on Technological and Economic Change (OMPTEC), Oxford Martin School,University of Oxford, Oxford, UK Department of Computer Science, University of Oxford, Oxford, UK Mathematical Institute, University of Oxford, Oxford, UK Institute of Economics and EMbeDS Department, Sant’Anna School of Advanced Studies, Pisa, Italy Department of Economics, University of Oxford, Oxford, UK

Email addresses : [email protected] , [email protected] , [email protected] , [email protected] , [email protected] , [email protected] , [email protected] Date : Wednesday 13 th January, 2021.We thank Doyne Farmer for useful comments at the early stages of this project. We acknowledgefunding from Baillie Giﬀord (Luca Mungo), the James S Mc Donnell Foundation (Marco Pangallo) and theFoundation of German Business (Samuel Wiese). a r X i v : . [ ec on . T H ] J a n ontents

1. Introduction 32. Best-response dynamics in games 72.1. Games 72.2. Best-response digraphs 82.3. Best-response dynamics 92.4. Convergence 102.5. Best-response dynamics with random inputs 123. Theoretical results 143.1. m -action games with n > m -action games with n = 2 players 184. Simulation results 214.1. Simulations of clockwork best-response dynamics 214.2. Simulations of best-response dynamics under clockwork, random, andsimultaneous updating 234.3. Simulation of other learning rules 25Appendix A. Proof of Theorem 2 32Appendix B. Proofs of Theorem 3, Proposition 3, and Theorem 4 43B.1. Proof of Theorem 3 43B.2. Proof of Theorem 4 45B.3. Proof of Proposition 3 47Appendix C. Descriptions of the learning rules 48C.1. Reinforcement learning 48C.2. Fictitious play 50C.3. Replicator dynamics 51References 54 . Introduction

The best-response dynamic is a ubiquitous iterative game-playing process in which, ineach period, players myopically select actions that are a best-response to the actions lastchosen by all other players. Most of the existing work on best-response dynamics and onlearning rules, such as ﬁctitious play, establishes suﬃcient conditions on a game’s payoﬀstructure to guarantee convergence to a Nash equilibrium. In this paper we also investigatethe convergence properties of the best-response dynamic but, rather than restricting ourattention to games with a particular structure, our focus is instead on the role of the playingsequence –the order in which players update their actions. Our key insight is that the playingsequence is a crucial determinant of whether the best-response dynamic converges to a pureNash equilibrium.We focus on three speciﬁc playing sequences: “clockwork” sequences, random sequences,and simultaneous updating. Under the clockwork playing sequence, players take turns toplay one at a time according to a ﬁxed cyclic order. Player 1 plays ﬁrst, followed by player2, and so on up to player n , and then the sequence returns to player 1, and so on. Toour knowledge, the behavior of the best-response dynamic under this playing sequence hasreceived relatively little attention in the literature. Under the random playing sequence,players take turns to play one at a time and the next player to play is chosen uniformlyat random from among all players. This playing sequence is the most well-studied in theliterature. Finally, we also consider simultaneous updating by all players in each period. , To investigate the role of the playing sequence in determining the convergence propertiesof the best-response dynamic, we analyze the probability that the best-response dynamicconverges to a pure Nash equilibrium in random n -player m -action games under clockwork,random, and simultaneous updating. In other words, we generate a game by drawing allpayoﬀs at random (from atomless distributions to avoid payoﬀ ties) and we determine the For example, previous work has established that the best-response dynamic converges to a Nash equilibriumin weakly acyclic games (Fabrikant et al., 2013), potential games (Monderer and Shapley, 1996), aggregativegames (Dindoˇs and Mezzetti, 2006), and quasi-acyclic games (Friedman and Mezzetti, 2001, Takahashi andYamamori, 2002). Boucher (2017) analyzes the clockwork sequence best-response dynamic in potential games. The random sequence best-response dynamic has been analyzed in anonymous games (Babichenko, 2013),near-potential games (Candogan et al., 2013), potential games (Christodoulou et al., 2012, Coucheney et al.,2014, Durand and Gaujal, 2016, Swenson et al., 2018, Durand et al., 2019), and games on a lattice (Blumeet al., 1993). “Sink” equilibria are studied in (Goemans et al., 2005, Mirrokni and Skopalik, 2009). This case is studied in Quint et al. (1997) for 2-player games and in Kash et al. (2011) for anonymousgames. There are, of course, many other possible playing sequences. For example, Feldman and Tamir (2012)study the case in which the sequence of play depends on current payoﬀs. robability that the best-response dynamic starting at a random initial action proﬁle con-verges to a pure Nash equilibrium of the randomly drawn game. Our paper therefore buildson the growing literature on random games. Studying such games allows us to abstractfrom the speciﬁc structure of a given game, thereby allowing us to focus solely on the roleof the playing sequence. Furthermore, random games are conceptually useful because theycan be seen as null models for generic situations involving strategic interactions. The novel theoretical contributions of this paper are primarily about the convergenceproperties of the clockwork best-response dynamic in random games in which payoﬀs aredrawn independently. Our main ﬁnding, which is presented in Section 3.1, is that theprobability that the clockwork best-response dynamic converges to a pure Nash equilibriumis, up to a polynomial factor, of order 1 / √ m n − . This has two implications: (i) when thenumber of players n and/or the number of actions m is large ( nm → ∞ ), the probabilitythat the clockwork best-response dynamic converges to a pure Nash equilibrium goes tozero, and (ii) since the asymptotic convergence probability depends essentially only on thequantity m n − , we have that, when n and/or m are large, the probability of convergenceto a pure Nash equilibrium in n -player m -action games is approximately the same as itis in 2-player m n − -action games. In fact, our simulations indicate that this asymptoticrelationship between n -player m -action games and 2-player m n − -action games is also fairlyaccurate for small values of n and m . In Section 3.2 we focus exclusively on 2-playergames. This allows us to provide more granular results on the convergence properties ofthe clockwork best-response dynamic. In particular, we provide results on game durationand we derive an exact expression for the probability that the best-response dynamicreaches a (best-response) cycle of given length at a particular period. As a special case, weobtain the exact probability that the clockwork best-response dynamic converges to a pureNash equilibrium in 2-player m -action games (and we argue that, in the 2-player m -actioncase, this probability is the same for random playing sequences). Furthermore, we showthat this probability is asymptotically (cid:112) π/m when m is large (and π ≈ . The literature on randomly generated games starts with Goldman (1957), Goldberg et al. (1968), andDresher (1970). Since then, a number of papers have analyzed the distribution of pure and mixed Nashequilibria in random games (Powers, 1990, Stanford, 1995, 1996, 1997, 1999, McLennan, 2005, McLennanand Berg, 2005, Takahashi, 2008, Kultti et al., 2011, Daskalakis et al., 2011). Cohen (1998) derives theprobability that a pure Nash equilibrium is Pareto eﬃcient. More recently, Alon et al. (2020) derive theprobability that a random game is dominance-solvable. See Pangallo et al. (2019) for a general discussion on the usefulness of considering null models and statisticalensembles in game theory, and on how this approach is extensively used in other disciplines such as statisticalmechanics and ecology. e brieﬂy comment on the approach that we adopted to derive the main result of Section3.1. We represent the best-response structure of a game by a directed graph (or digraph) inwhich the vertices are the action proﬁles and the directed edges correspond to the players’best-responses. A pure Nash equilibrium corresponds to a sink of the digraph. The best-response dynamic can be represented by a path that starts at some initial proﬁle in thedigraph and travels along the directed edges in a direction that is determined by the playingsequence. Drawing payoﬀs independently at random (from atomless distributions) inducesa uniform distribution over the best-response digraphs, so the probability of convergenceto a pure Nash equilibrium can be reduced to working out the probability that the best-response path initiated at a random vertex reaches a sink of the randomly drawn digraph.The main theoretical challenge that we face when analyzing the best-response dynamic isthat it exhibits some path-dependence: if a player encounters an environment that theyhad seen before, they must play the same action that they played when the environmentwas ﬁrst encountered. We tackle this issue by relying on a coupling argument in whichthe best-response dynamic is coupled to a (memoryless) random walk through the digraphthat is easier to analyze.Our results for the convergence properties of the best-response dynamic under ran-dom and simultaneous updating rely mostly on simulations. Under a random sequence,conditional on the game having a pure Nash equilibrium, we show that the probability ofconvergence to a pure Nash equilibrium goes to one when n or m are large. This is in sharpcontrast to our results for the clockwork sequence and highlights one of the key insights ofthis paper; namely, that the playing sequence is a crucial determinant of the probabilityof convergence to equilibrium. While the existing literature on best-response dynamicshas focused primarily on identifying suﬃcient conditions on a game’s payoﬀ structure toguarantee convergence to equilibrium, our results indicate that the playing sequence mustalso be given careful consideration. To further corroborate this insight, we show that theplaying sequence also determines the probability of convergence to equilibrium in randomgames with correlated payoﬀs. For example, in two player games with strongly positivelycorrelated payoﬀs, we ﬁnd that the best-response dynamic with simultaneous updating isunlikely to converge to equilibrium, whereas it is very likely to do so under a clockworkor a random sequence. Over all possible values for the payoﬀ correlation parameter, thebest-response dynamic tends to converge to a pure Nash equilibrium most frequently undera random playing sequence and least frequently under simultaneous updating. Amiet et al. (2019) prove this result analytically for the case m = 2 and n → ∞ . See Goldberg et al. (1968), Stanford (1999), Berg and Weigt (1999), Rinott and Scarsini (2000), Galla andFarmer (2013), Sanders et al. (2018) for work on random games with payoﬀ correlations. mong the three playing sequences, the clockwork playing sequence stands out as de-serving particular attention. Through extensive simulations, we show that the frequencyof convergence to equilibrium of the clockwork best-response dynamic most closely tracksthe convergence frequency of three popular learning rules, namely the Bush-Mosteller re-inforcement learning algorithm (Bush and Mosteller, 1953), ﬁctitious play (Brown, 1951,Robinson, 1951), and replicator dynamics (Maynard Smith, 1982). The three learningalgorithms are most naturally deﬁned as involving simultaneous updating, yet when wevary n , m , or the payoﬀ correlation parameter, the clockwork sequence best-response dy-namic outperforms both the random sequence and the simultaneous updating best-responsedynamics in most of our simulations. Additionally, when compared with the random se-quence best-response dynamic, the paths traced by the clockwork sequence best-responsedynamic in the space of all action proﬁles more closely resemble the paths traced by thethree learning algorithms.Our focus on reinforcement learning, ﬁctitious play, and replicator dynamics is driven bythe fact that these learning rules have been used to calibrate human game-play in exper-iments (Bush and Mosteller, 1953, Arthur, 1991, Erev and Roth, 1998, Sarin and Vahid,2001, Van Huyck et al., 1995, Friedman, 1996, Cheung and Friedman, 1997). Our resultssuggest that, to the extent that the learning algorithms are consistent with human game-play in randomly-generated games, the clockwork best-response dynamic could provide aﬁrst-order approximation for the evolution of play in such games.The paper is structured as follows. In Section 2 we present our analytical framework.Section 3 contains our theoretical results on the probability that the clockwork sequencebest-response dynamic converges to pure Nash equilibria in random games with indepen-dently drawn payoﬀs. The section also compares our ﬁndings to existing analytical resultsregarding the random playing sequence. Section 4 contains all our numerical simulationresults. All proofs and detailed descriptions of the three learning rules (reinforcementlearning, ﬁctitious play, and replicator dynamics) are in the appendix. The convergence properties of these learning algorithms have been extensively studied (Fudenberg andLevine, 1998), but there is no general result about their probability of convergence to Nash equilibria inrandom games. Of course, Bush-Mosteller reinforcement learning, ﬁctitious play, and replicator dynamics are not repre-sentative of all learning algorithms that have been studied in game theory. For example, diﬀerently fromthese learning rules, regret testing (Foster and Young, 2006, Germano and Lugosi, 2007) converges to a Nashequilibrium in essentially every n -player, m -action game with high probability. Therefore, its convergenceproperties are better approximated by best-response dynamics under a random sequence rather than undera clockwork sequence. . Best-response dynamics in games

In this section, we introduce the central concepts of our paper. For clarity, we summarizesome of our keys terms in Table 1.

Table 1.

Terminology

Game g n,m Game with n players and m actions per playerEnvironment a − i Part of the action proﬁle a that is played by all players but i Best-response b i ( a − i ) Maps a − i to the actions giving highest payoﬀ to i Non-degenerate game Game with no payoﬀ ties, i.e. the best-response is unique for each i and a − i Playing sequence s The function s : N → [ n ] determines whose turn it is to play s -best-responsedynamic on g n,m initiated at a Starting at proﬁle a , in each period t ∈ N , player s ( t ) plays her myopicbest-response to environment a t − − s ( t ) in the game g n,m Path (cid:104) (cid:126) a , s (cid:105) Inﬁnite sequence of action proﬁles (cid:126) a = ( a , a , ... ) satisfying a t − s ( t ) = a t − − s ( t ) for each t ∈ N Games.

A game with n ≥ m ≥ g n,m := ([ n ] , [ m ] , { u i } i ∈ [ n ] ) , where [ n ] := { , ..., n } is the set of players and each player i ∈ [ n ] has a set of actions[ m ] := { , ..., m } and a payoﬀ function u i : [ m ] n → R .An action proﬁle is a vector of actions a = ( a , ..., a n ) belonging to the set [ m ] n that liststhe action taken by each player. An environment for player i is a vector a − i belonging to theset [ m ] n − that lists the action taken by each player but i . A best-response correspondence b i for player i is a mapping from the set of environments for player i to the set of allnon-empty subsets of i ’s actions and is deﬁned by b i ( a − i ) := arg max a i ∈ [ m ] u i ( a i , a − i ) . A game is non-degenerate if for each player i and environment a − i , the best-responseaction is unique. Games in which there are no ties in payoﬀs are non-degenerate games. In the rest of this paper, we focus only on non-degenerate games, so each instance of“game” will be taken to mean “non-degenerate game”. Since best-responses are unique innon-degenerate games, we write a i = b i ( a − i ) whenever a i ∈ b i ( a − i ). There are no ties in payoﬀs if for all players i , all environments a − i , and all a i (cid:54) = a (cid:48) i , u i ( a i , a − i ) (cid:54) = u i ( a (cid:48) i , a − i ). , − , − , , , , − , , , , , , , , − − , , (1,1,1)(2,1,1) (1,1,2)(2,1,2) (1,2,1)(2,2,1) (1,2,2)(2,2,2) (1,2,1) Player 2Player 3Player 1 − − −− −− Figure 1.

Illustration of a 3-player 2-action non-degenerate game (left)and its associated best-response digraph (right). The axes shown in thecenter give us our coordinate system.An action proﬁle a ∈ [ m ] n is a pure Nash equilibrium (PNE) if for all i ∈ [ n ] and all a i ∈ [ m ], u i ( a ) ≥ u i ( a i , a − i ) . Equivalently, a ∈ [ m ] n is a PNE if each player i ∈ [ n ] is playing their best-responseaction i.e. a i = b i ( a − i ). Denote the set of PNE of the game g n,m by PNE( g n,m ) and let g n,m ) denote the cardinality of this set.2.2. Best-response digraphs.

The best-response structure of a non-degenerate game g n,m can be represented by a best-response digraph D ( g n,m ) whose vertex set is the set ofaction proﬁles [ m ] n and whose edges are constructed as follows: for each i ∈ [ n ] and eachpair of distinct vertices a = ( a i , a − i ) and a (cid:48) = ( a (cid:48) i , a − i ), place a directed edge from a to a (cid:48) if and only if a (cid:48) i is player i ’s best-response to environment a − i , i.e. a (cid:48) i = b i ( a − i ). There areedges only between action proﬁles that diﬀer in exactly one coordinate. A proﬁle a is aPNE of g n,m if and only if it is a sink of the best-response digraph D ( g n,m ). Example (Best-response digraphs) . Panel (A) of Figure 1 illustrates a 3-player 2-actiongame (on the left) and its associated best-response digraph (on the right). Player 1 selectsrows (along the depth), player 2 selects columns (along the width), and player 3 selectslevels (along height). In the left-hand panel, the payoﬀs of players 1, 2 and 3 are listed inthat order. The vertices of the best-response digraph are the action proﬁles. The uniquePNE at the proﬁle (1 , ,

1) is underlined and is a sink of the digraph. (cid:4) .3. Best-response dynamics.

We now consider games played over time, with eachplayer in turn myopically best-responding to their current environment. A playing se-quence function s : N → [ n ] determines whose turn it is to play at each time period t ∈ N ,where N denotes the set of positive integers { , , ... } . A path (cid:104) (cid:126) a , s (cid:105) is an inﬁnite sequenceof action proﬁles (cid:126) a = ( a , a , ... ) and an associated playing sequence function s : N → [ n ]satisfying the constraint that only one player changes her action at a time, a t − s ( t ) = a t − − s ( t ) for each t ∈ N . So only the action of player s ( t ) is allowed to diﬀer between proﬁles a t − and a t along a path.Note that this set up rules out simultaneous updating from our theoretical analysisbecause we allow only one player to play in any given period. A more general frameworkwould allow subsets of players to update their actions simultaneously in each period – inother words, a playing sequence would be a sequence of non-empty subsets of [ n ] – butwe avoid this generality here. As mentioned in our introduction, we focus exclusively onclockwork and random playing sequences for our theoretical results.The best-response dynamic with playing sequence s : N → [ n ] on a game g n,m initiatedat the action proﬁle a generates a path (cid:104) (cid:126) a , s (cid:105) according to Algorithm 1. Namely, setthe initial action proﬁle to a and, in each period t ∈ N , player s ( t ) myopically plays thebest-response to her current environment a t − − s ( t ) . Algorithm 1 s -sequence best-response dynamic on g n,m initiated at a (1) For t ∈ N :(a) Set i = s ( t ) (b) Set a t − i = a t − − i (c) Set a ti = b i ( a t − − i ) where b i ( a t − − i ) := arg max x i ∈ [ m ] u i ( x i , a t − − i )Algorithm 1 generates a path by traveling along the edges of the best-response digraph D ( g n,m ) in direction s ( t ) at step t starting from the initial proﬁle a . More precisely,the inﬁnite sequence of actions (cid:126) a is determined as follows: if player s ( t ) is already bestresponding then a t − does not point to any vertex ( a (cid:48) s ( t ) , a t − − s ( t ) ) (cid:54) = a t − and the next proﬁlein the sequence is a t − itself, i.e. a t = a t − ; otherwise, if player s ( t ) is not already playingher best response then travel to the vertex that corresponds to her playing her best-responseaction, i.e. set a t = ( a (cid:48) s ( t ) , a t − − s ( t ) ) where ( a (cid:48) s ( t ) , a t − − s ( t ) ) (cid:54) = a t − is the unique vertex that a t − points to. .4. Convergence.

For any path (cid:104) (cid:126) a , s (cid:105) and any set of action proﬁles A ⊆ [ m ] n deﬁne τ (cid:104) (cid:126) a ,s (cid:105) ( A ) as the ﬁrst period t ≥ (cid:126) a is in the set A : τ (cid:104) (cid:126) a ,s (cid:105) ( A ) := inf { t ∈ N : a t ∈ A} , where inf is the inﬁmum operator and we use the convention that inf ∅ = ∞ (i.e. we take τ (cid:104) (cid:126) a ,s (cid:105) ( A ) to be inﬁnite if no element of the sequence (cid:126) a is in A ). The path (cid:104) (cid:126) a , s (cid:105) reaches theset A (in period t ) if t = τ (cid:104) (cid:126) a ,s (cid:105) ( A ) and t is ﬁnite. Deﬁnition 1.

The s -sequence best-response dynamic on game g n,m initiated at a con-verges to a PNE if the path (cid:104) (cid:126) a , s (cid:105) generated according to Algorithm 1 reaches PNE( g n,m ).Clearly, if the path reaches a PNE in some period, it stays there forever.2.4.1. Convergence for the clockwork playing sequence.

There are inﬁnitely many possibleplaying sequences. We will be particularly interested in the clockwork playing sequencewhich is deﬁned by s ( t ) = s c ( t ) := 1 + ( t −

1) mod n . In other words, player 1 plays inperiod 1, followed by player 2, then 3, and so on until player n , and then the sequencereturns to player 1, and so on.Deﬁnition 1 applies to all playing sequences but, when the sequence is clockwork, we cancharacterize convergence (and non-convergence) more simply in terms of path properties.We refer to one complete rotation of the clockwork sequence as a round of play; for example,if a round starts at player i then each player plays once and the round is complete when itis once again i ’s turn to play. For any k ∈ N deﬁne T (cid:104) (cid:126) a ,s c (cid:105) ( k ) := inf (cid:110) t ∈ N : a t = a t + nk and a t (cid:54) = a t + nk (cid:48) for all k (cid:48) ∈ N such that k (cid:48) < k (cid:111) , to be the ﬁrst period in which an action proﬁle is repeated k rounds later (and at noearlier round). If T (cid:104) (cid:126) a ,s c (cid:105) ( k ) is ﬁnite then the path (cid:104) (cid:126) a , s c (cid:105) has the property that from period T (cid:104) (cid:126) a ,s c (cid:105) ( k ) onwards, the sequence of nk possibly non-distinct action proﬁles a t , ..., a t + nk − repeats itself forever. We therefore say that the path (cid:104) (cid:126) a , s c (cid:105) reaches a nk - cycle in period T (cid:104) (cid:126) a ,s c (cid:105) ( k ) or, equivalently, that the clockwork sequence best-response dynamic converges toa nk -cycle (in period T (cid:104) (cid:126) a ,s c (cid:105) ( k )).Notice that if the action proﬁle is a t in some period t and no one deviates from thisproﬁle in a single round (i.e. a t = a t + n ), then a t must be a PNE. Therefore, if the path (cid:104) (cid:126) a , s c (cid:105) reaches a nk -cycle in period T (cid:104) (cid:126) a ,s c (cid:105) ( k ) and k = 1 then we say that the clockworksequence best-response dynamic reaches a PNE in that period. However, if the path (cid:104) (cid:126) a , s c (cid:105) We also say that the path reaches the set A by period t if it reaches A in period τ with τ ≤ t and thepath reaches A before ( after ) period t if it reaches A in period τ < t ( τ > t ). ◦ ◦◦ ◦◦ ◦◦ a , a a a , a (a) sequence: clockwork a = (1 , , ◦◦ ◦◦ ◦◦ ◦◦ a a , a , a a a , a a (b) sequence: clockwork a = (1 , , Player 2Player 3Player 1 − − −− −− ◦◦ ◦◦ ◦◦ ◦◦ a a a a (c) sequence: 1-2-1- · · · a = (1 , , Figure 2.

The digraphs in panels (A)-(C) above are all identical and cor-respond to the best-response digraph of the game shown in Figure 1. In thepanels we show the ﬁrst few elements of paths generated according to thebest-response dynamic for diﬀerent initial proﬁles and playing sequences.reaches a nk -cycle in period T (cid:104) (cid:126) a ,s c (cid:105) ( k ) and k > best-response cycle (of length nk ) in that period.Because the number of action proﬁles is ﬁnite, T (cid:104) (cid:126) a ,s c (cid:105) ( k ) must be ﬁnite for some k , sothe clockwork best-response dynamic must always converge either to a PNE or to a best-response cycle. In dynamical systems language, nk -cycles (including PNE) are attractors of the clockwork best-response dynamic. Clearly, T (cid:104) (cid:126) a ,s c (cid:105) := inf (cid:8) k ∈ N : T (cid:104) (cid:126) a ,s c (cid:105) ( k ) (cid:9) , is the period in which the path (cid:104) (cid:126) a , s c (cid:105) reaches a PNE or a best-response cycle. xample (Best-response dynamics and convergence) . The digraphs in panels (A)-(C) ofFigure 2 are all identical and correspond to the best-response digraph of the game shownin Figure 1. Each vertex of a best-response digraph is a point in [ m ] n but, unlike the right-hand panel of Figure 1, we no longer show the explicit coordinate of each vertex in ourillustrations to avoid clutter. In panels (A)-(C) of Figure 2 we show the ﬁrst few elementsof paths generated according to the best-response dynamic for diﬀerent initial proﬁles andplaying sequences.In panel (A) the initial proﬁle is set to a = (1 , ,

2) and the playing sequence is clock-work. The ﬁrst few elements of the inﬁnite sequence (cid:126) a are shown in the ﬁgure. The pathstays at (1 , ,

2) in period 1 because player 1 does not change her action. The path thenmoves to a = (1 , ,

2) in period 2 because player 2 plays action 2. In period 3, player 3plays action 1 which takes the path to a = (1 , , T (cid:104) (cid:126) a ,s c (cid:105) (1) = 3.In panel (B) the initial proﬁle is set to a = (1 , ,

1) and the playing sequence is clock-work. This time, the path moves to the bottom left corner on the front face of the cubein period 1. The path then cycles forever among the four proﬁles on the front face of thecube. Since period t = 1 is the ﬁrst period in which a t (cid:54) = a t +3 but a t = a t +3 k for someinteger k (namely k = 2) we have that T (cid:104) (cid:126) a ,s c (cid:105) (2) = 1. In other words, the path reachesa best-response cycle in period 1. In fact, the path reaches a 6-cycle in period 1: oncereached, the action proﬁle sequence a , ..., a is repeated forever. Note that not all actionproﬁles in the sequence a , ..., a are distinct.In panel (C) the initial proﬁle is once again set to a = (1 , ,

1) but the playing sequenceis 1-2-1- · · · . This time, the path reaches the PNE in period 3. The playing sequence fromperiod 4 onwards is irrelevant: once at the PNE, the path will remain forever regardless ofthe playing sequence.The examples in panels (B) and (C) illustrate how changes to the playing sequence andto the initial proﬁle can aﬀect convergence to a PNE. (cid:4)

Best-response dynamics with random inputs.

How likely is it for best-responsedynamics to converge to pure Nash equilibria? As discussed in the introduction, severalpapers have analyzed the convergence properties of best-response dynamics in games with aspeciﬁc payoﬀ structure (e.g. in potential games, aggregative games, etc) and our examplesabove illustrate how diﬃcult it is to obtain general results for all games. In particular, thedynamics depend on the details of the game, the playing sequence, and the initial condition.Rather than imposing restrictions on the game itself, we answer our question by working ut the probability that the best-response dynamic convergences to a PNE when the inputsto the dynamic are drawn at random.We generate random games by drawing all payoﬀs at random: for each a ∈ [ m ] n and i ∈ [ n ], the payoﬀ U i ( a ) is a random number that is drawn from an atomless distribution P . The draws are independent across all i ∈ [ n ] and a ∈ [ m ] n . The distribution P ensuresthat any ties in payoﬀs have zero measure, so any resulting game is non-degenerate almostsurely. A random n -player m -action game drawn according to this process is denoted by G n,m := ([ n ] , [ m ] , { U i } i ∈ [ n ] ).In addition to the clockwork playing sequence, we will also consider the random playingsequence s r which is determined as follows: for each t ∈ N , draw s r ( t ) uniformly at randomfrom [ n ]. So, in each period, the player playing in that period is drawn uniformly at randomfrom among all players. In what follows, we will take a playing sequence s to be an elementof { s c , s r } because we will compare our results concerning the clockwork sequence againstexisting analytical results concerning the random playing sequence (Amiet et al., 2019).Finally, we will draw the initial proﬁle A uniformly at random from among all proﬁles.Since the game itself is drawn at random, the choice of initial condition is actually irrelevant,i.e. our results would not change if we had arbitrarily ﬁxed the initial proﬁle to some speciﬁcvalue. The advantage of drawing the initial proﬁle at random is that it allows us to dropthe dependence on the initial proﬁle in our description of the best-response dynamic.The best-response dynamic on a random game, playing sequence, and initial conditionis described by Algorithm 2. We randomly draw the game and initial condition and thenessentially run Algorithm 1. Doing so induces a distribution over paths and PNE sets. Ourdeﬁnitions of convergence given in Section 2.4 all apply here. For example, we say that the s -sequence best-response dynamic on game G n,m (and initial condition A ) converges to aPNE if the path (cid:104) (cid:126) A , s (cid:105) generated according to Algorithm 2 reaches PNE( G n,m ). Algorithm 2 s -sequence best-response dynamic on G n,m (1) For all i ∈ [ n ] and a ∈ [ m ] n draw U i ( a ) at random according to P (2) Draw A uniformly at random from [ m ] n (3) For t ∈ N :(a) Set i = s ( t ) (b) Set A t − i = A t − − i (c) Set A ti = B i ( A t − − i ) where B i ( A t − − i ) := arg max x i ∈ [ m ] U i ( x i , A t − − i )Step (1) of Algorithm 2 eﬀectively creates a best-response digraph D ( G n,m ) on thevertices [ m ] n according to the following stochastic process: for each i ∈ [ n ] and environment − i select an action a (cid:48) i uniformly at random from [ m ] and then for each a i (cid:54) = a (cid:48) i create adirected edge from ( a i , a − i ) to ( a (cid:48) i , a − i ). This follows from the manner in which the payoﬀsare drawn: there is a zero probability of ties because P is atomless and for each i ∈ [ n ] theprobability that action a i ∈ [ m ] is a best-response to environment a − i is given byPr (cid:20) U i ( a i , a − i ) ≥ max x i ∈ [ m ] U i ( x i , a − i ) (cid:21) = 1 m . Step (2) of Algorithm 2 then selects an initial proﬁle and step (3) essentially traces a pathby traveling along the edges of the best-response digraph in direction s ( t ) at step t startingfrom the initial proﬁle. 3. Theoretical results

In this section we show that there is a striking diﬀerence between the probability ofconvergence to a PNE of the clockwork sequence vs. the random sequence best-responsedynamic. Roughly, while the clockwork sequence best-response dynamic never convergesto a PNE in large games, the random sequence best-response dynamic always converges inlarge games (conditional on there being a PNE). Our next example develops some intuitionfor this result.

Example (Best-response dynamics for clockwork vs. random playing sequence) . In thebest-response digraph of Figure 2, we saw that the clockwork playing sequence may notconverge to the PNE depending on the initial proﬁle. By contrast, the random playingsequence best-response dynamic converges to the PNE at (1 , ,

1) with probability 1 givensuﬃcient time. The path cannot be stuck in the best-response cycle on the front face ofthe cube for example because there is a positive probability that the path will escape tothe PNE.Figure 3 provides further examples of best-response digraphs. Panels (A) and (B) illus-trate possible best-response digraphs for 3-player 2-action games. In the digraph of panel(A), there are two PNE which are represented by black dots at the proﬁles (1 , ,

1) and(2 , , layer 2Player 3Player 1 − − −− −− ◦◦ ◦◦ ◦◦ ◦◦• • (a) ◦◦ ◦◦ ◦◦ ◦◦ (b) ◦◦◦ ◦◦◦ ◦◦◦ ◦◦◦ ◦◦◦ ◦◦◦ ◦◦◦ ◦◦◦ ◦◦◦•• •• •• • (c) Figure 3.

Panels (A) and (B) provide examples of possible best-responsedigraphs for n = 3 and m = 2. Panel (C) illustrates a possible best-responsedigraph for n = m = 3.Finally, panel (C) illustrates a possible best-response digraph for 3-player 3-action games.The unique PNE at (3 , ,

1) is represented by a black dot. Note that there is a directed edgefrom (3 , ,

1) to (3 , ,

1) as well as a directed edge from (3 , ,

1) to (3 , ,

1) but these edgesoverlap in our illustration so appear as a single long edge at the bottom of the front face ofthe cube. This digraph shows us a situation in which the random sequence best-responsedynamic may not converge to a PNE even if there is one: indeed, if the path reaches oneof the action proﬁles illustrated as a red dot then the path can never escape to the PNEregardless of the playing sequence. As implied by the results below, cases like this onebecome vanishingly rare in large games. (cid:4)

We start by noting that in random n -player m -action games, the probability that thereis a pure Nash equilibrium is asymptotically 1 − exp {− } ≈ .

63 as either n or m (or both)get large. Proposition 1 (Rinott and Scarsini, 2000) . lim nm →∞ Pr [ G n,m ) ≥

1] = 1 − exp {− } . n fact, Rinott and Scarsini (2000) prove a much stronger result: they characterize theasymptotic distribution of the number of pure Nash equilibria in random games, showingthat G n,m ) is asymptotically Poisson(1) as nm → ∞ . The probability that a PNEexists in a random game has been studied by Goldberg et al. (1968) in the 2-player caseand by Dresher (1970) in the n -player case as m → ∞ . Powers (1990) and Stanford (1995)noted that the distribution of the number of PNE approaches a Poisson(1) as m → ∞ .Building on Arratia et al. (1989), Rinott and Scarsini (2000) show that the Poisson(1) limitholds as nm → ∞ (i.e. when m or n get large).Next, we present the theoretical results for best-response dynamics in random games. In Section 3.1 we focus on games with n > n = 2 players. In this case, the probability of convergence toequilibrium is the same under both clockwork and random playing sequences. Furthermore,we are able to provide asymptotic as well as exact results for game duration and for theprobability of convergence to equilibrium.3.1. m -action games with n > players. The following result shows that, in large 2-action games, the random sequence best-response dynamic converges with high probabilityto a PNE if there is one.

Proposition 2 (Amiet et al., 2019) . lim n →∞ Pr [ s r -best-response dynamic on G n, converges to a PNE | G n, ) ≥

1] = 1 . Combined with Proposition 1, it follows that over the class of all (non-degenerate) 2-action games, the random sequence best-response dynamic converges to a PNE in (1 − exp {− } ) × ≈

63% of those games when the number of players is large.A generalization of Proposition 2 to m -action games is non-trivial and we are not awareof existing analytical results for m > However, we conjecture that the random sequence Note that all the convergence results hold modulo any relabelling of the players. For example, while wedescribed the clockwork playing sequence as ordering the players according to 1-2- · · · - n , our results wouldequally hold if the sequence had ordered the players according to n - · · · -2-1 or any such permutation. When m = 2, the random digraph D ( G n, ) is a random n -cube in which, independently, for each pair ofproﬁles a and a (cid:48) that diﬀer in exactly one coordinate, there is a directed edge from a to a (cid:48) with probability1 /

2; otherwise, there is a directed edge from a (cid:48) to a with complementary probability 1 /

2. For such an n -cube, Amiet et al. (2019) show that when n is large, every pure Nash equilibrium belongs to the set ofvertices that are reachable by some directed path from the initial action proﬁle a . This is suﬃcient toshow that, when the number of players is large, the random sequence best-response dynamic converges with est-response dynamic converges to a PNE with high probability if there is one as mn →∞ . Consistent with this conjecture, in the simulations of Section 4 we show that therandom sequence best-response dynamic does converge to a PNE with probability close to1 − exp {− } when m or n get large, provided that n > n > Theorem 1. lim nm →∞ Pr [ s c -best-response dynamic on G n,m converges to a PNE] = 0 . So, with high probability, the clockwork sequence best-response dynamic does not con-verge to a PNE as the number of players or actions gets large. This is in sharp contrastwith the asymptotic behavior of the random sequence best-response dynamic.Theorem 1 is an immediate consequence of the result below, which gives us bounds onthe probability of convergence to equilibrium:

Theorem 2. √ n √ m n − ≤ Pr (cid:34) s c -best-response dynamicon G n,m converges to a PNE (cid:35) ≤ n / √ log m √ m n − . Theorem 2 also gives us the following corollary:

Corollary 1.

The probability that the clockwork sequence best-response dynamic convergesto a PNE is, up to a polynomial factor, of order / √ m n − . This result gives us a clear “scaling” law: since the asymptotic convergence probabilitydepends essentially only on the quantity m n − , we have that, when n and/or m are large,the probability of convergence to a pure Nash equilibrium in n -player m -action games isapproximately the same as it is in 2-player m n − -action games. This scaling is reﬂectedin our simulations even for relatively small values of m and n .We brieﬂy comment on the approach that we take in the appendix to derive Theorem2. As is clear from the discussion following Algorithm 2, drawing payoﬀs independently probability 1 to a pure Nash equilibrium if there is one: indeed, the random playing sequence ensures thatsome path to equilibrium is played given suﬃcient time. The edges in D ( G n, ) are oriented in one way orthe other independently of each other but this is no longer true when m >

2. In D ( G n,m ) with m > a i , a − i ) to ( a (cid:48) i , a − i ) for some a i (cid:54) = a (cid:48) i then the graph must also have thedirected edges ( a (cid:48)(cid:48) i , a − i ) to ( a (cid:48) i , a − i ) for all a (cid:48)(cid:48) i (cid:54) = a (cid:48) i . This dependence and the more complex graph structurerenders a generalization of Proposition 2 to m > Indeed, consider G n ,m and G n ,m where n = 2 and m = m n − . Then, (cid:113) m n − = (cid:113) m n − . t random (from atomless distributions) induces a uniform distribution over best-responsedigraphs. So, the probability of convergence to a pure Nash equilibrium can be reduced toworking out the probability that the path generated by Algorithm 2 initiated at a randomvertex reaches a sink of the randomly drawn digraph. The main theoretical challenge thatwe face when analyzing such paths is that they exhibit some path-dependence: if a playerencounters an environment that they had seen before, they must play the same action thatthey played when the environment was ﬁrst encountered. We tackle this issue by relyingon a coupling argument in which the clockwork best-response dynamic is coupled to a(memoryless) random walk through the digraph that is easier to analyze.We are not aware of any analytical results on the probability of convergence to equi-librium in random games for the best-response dynamic under simultaneous updating.Obtaining results for simultaneous updating is non-trivial because the pattern of path-dependence is more complex than it is for the clockwork best-response sequence. Seefootnote 28 for a more detailed comment.3.2. m -action games with n = 2 players. When there are n = 2 players, we are ableto provide detailed results on both game duration and on the probability of convergenceto equilibrium.The following theorem gives us an exact expression for the probability that the clockworksequence best-response dynamic converges to a 2 k -cycle in period t . Theorem 3.

For any k ∈ { , ..., m } and t ∈ { , ..., m − k + 1) } , Pr (cid:34) s c -best-response dynamic on G ,m converges to a 2 k -cycle in period t (cid:35) = 1 m t +2( k − (cid:89) i =1 (cid:18) − m (cid:22) i (cid:23)(cid:19) . (1)For any k ∈ { , ..., m } , if t > m − k + 1) then the probability that the clockworksequence best-response dynamic reaches a 2 k -cycle is zero. Setting k = 1 in (1) yields theexact probability that the clockwork sequence best-response dynamic on G ,m convergesto a PNE in period t .As a straightforward corollary of Theorem 3, we can sum (1) over all t ∈ { , ..., m − k + 1) } to obtain the exact probability that the clockwork sequence best-response dynamicconverges to a 2 k -cycle: See also Pangallo et al. (2019) for an exact formula giving the probability of existence of cycles of anylength in 2-player games. Since the number of action proﬁles is ﬁnite, a path cannot reach a 2 k -cycle only after 2( m − k +1) periods. orollary 2. (2) Pr (cid:34) s c -best-response dynamic on G ,m converges to a 2 k -cycle (cid:35) = 1 m m − k +1) (cid:88) t =1 t +2( k − (cid:89) i =1 (cid:18) − m (cid:22) i (cid:23)(cid:19) . Setting k = 1 in (2) yields the exact probability that the clockwork sequence best-response dynamic on G ,m converges to a PNE.In order to get a better sense of the behavior of (2), we now study the expression when m is large. To do so, let (cid:98) x (cid:99) be the ﬂoor operator of x and let Φ( · ) denote the standardnormal cumulative distribution function:Φ( x ) := 1 √ π (cid:90) x −∞ exp (cid:26) − z (cid:27) dz. The asymptotics of (2) are given below.

Proposition 3. If k = o ( m / ) then, as m → ∞ , (2) is asymptotically (cid:114) πm (cid:18) − Φ (cid:18) k − √ m (cid:19)(cid:19) . And if k = o ( √ m ) then, as m → ∞ , (2) is asymptotically (cid:112) π/m . The asymptotics given in Proposition 3 help us to better understand the behavior of theclockwork sequence best-response dynamic in large 2-player games. (i) The probability ofconvergence to a PNE, which corresponds to setting k = 1, goes to zero when m → ∞ . (ii)Short cycles all have about the same probability. Indeed, for k = o ( √ m ) the probabilityis asymptotically (cid:112) π/m . Finally, (iii) it is very unlikely that the best-response dynamicconverges to a very long cycle: if k/ √ m → ∞ then the probability that the dynamicconverges to a cycle of length at least 2 k tends to 0. Our results are illustrated in Figure 4, which shows the probability of convergence tocycles of given length as calculated from the exact formula in Theorem 3. The panels onthe left and in the center plot (2) for m = 5 and m = 10 and show that the probability ofconvergence is lower for long cycles than for short cycles. The panel on the right plots (2)for m = 1000. Since 31 ≈ √ k = 62. The probability f ( n ) = o ( g ( n )) denotes f ( n ) /g ( n ) → n → ∞ . When k = o ( √ m ), the argument of Φ( · ) goes to zero. Since Φ(0) = 1 / (cid:112) π/m which is independent of k . If, instead, k/ √ m → ∞ then the argument of Φ( · )grows large and since Φ( ∞ ) = 1, the convergence probability goes zero. Our proof of Proposition 3 allowsus to derive the asymptotics only for the range k = o ( m / ), but this is suﬃcient to obtain some insightinto the behavior of (2). f convergence is more or less uniform up to that point, which is consistent with ourobservations above. −3 −2 −1 C o n v e r g e n ce t o C y c l e Cycle Length −6 −3 −230 −148 −66 Figure 4.

Convergence of the clockwork sequence best-response dynamicto 2 k -cycles in 2-player games, using the exact formula from Theorem 3.We now compare the behavior of the clockwork sequence best-response dynamic in 2-player games with the behavior of the random sequence best-response dynamic in 2-playergames. (i) The probability of convergence to a PNE is the same for clockwork and forrandom playing sequences in 2-player games. The reason is that, under the random playingsequence, players’ actions do not change whenever the sequence asks the same player toplay several times in a row. The proﬁles that are therefore visited along the path are thesame under both playing sequences, which induces the same probability of convergence toequilibrium. However, (ii) the game duration will be diﬀerent since the random playingsequence introduces delays. In fact, the result below allows us to more precisely pin downgame duration. Theorem 4.

The probability that the s c -best-response dynamic on G ,m does not convergeto a PNE or to a best-response cycle before period x √ m is exp {− x / } as m → ∞ . This result shows that the clockwork sequence best-response dynamic in 2-player gamesis likely to converge to a PNE or to a best-response cycle within √ m periods when m is large. The game duration for the random playing sequence should be greater than forthe clockwork playing sequence by a factor of 2. The reason is that, under the clockworkplaying sequence, the players alternate at the tick of each period whereas, under the randomplaying sequence, the number of periods that it takes for the playing sequence to turn tothe other player is Geometric( ). Thus the random playing sequence can be considered asa slowing down of the clockwork playing sequence in which the expected time to play thenext step is 2. . Simulation results

In this section, we run simulations of the clockwork sequence best-response dynamic.This allows us to compare its behavior against the best-response dynamic under randomand simultaneous updating and against other learning rules (reinforcement learning, ﬁcti-tious play, and replicator dynamics).For our theoretical results, we limited ourselves to analyzing best-response dynamics inrandom games where the payoﬀs are drawn independently across players and action proﬁles.For our simulations, we allow the payoﬀs to be correlated across players (Goldberg et al.,1968, Stanford, 1999, Berg and Weigt, 1999, Rinott and Scarsini, 2000, Galla and Farmer,2013, Sanders et al., 2018). To do so, at initialization, for each action proﬁle a we drawthe vector U ( a ) = ( U ( a ) , . . . , U n ( a )) at random from a multivariate normal distributionwith mean zero, unit variance, and covariance matrix with 1s on the diagonal and Γ n − on the other entries. So Γ ∈ [ − , n −

1] parametrizes the degree to which payoﬀs arecorrelated. Once drawn, the payoﬀs are kept ﬁxed for the rest of the simulation. If Γ = 0then the payoﬀs are chosen independently, so this recovers the case for which we derivedour theoretical results. At one extreme, if Γ = n − − > < n , m , and Γ, we draw 500 games and simulatefor 5000 time steps starting from randomly chosen initial conditions. In Section 4.1 wesimulate the clockwork sequence best-response dynamic in random n -player m -action gameswith independent payoﬀs (i.e. Γ = 0). This allows us to verify our theoretical results. InSection 4.2 we compare the behavior of the clockwork sequence best-response dynamicagainst the best-response dynamic under random and simultaneous updating for diﬀerentvalues of n , m , and Γ. In Section 4.3 we compare the behavior of the best-responsedynamic (with clockwork, random, and simultaneous updating) against other learningrules (reinforcement learning, ﬁctitious play, and replicator dynamics) for diﬀerent valuesof n , m , and Γ.4.1. Simulations of clockwork best-response dynamics.

We simulate the clockworksequence best-response dynamic in n -player m -action games with Γ = 0. We ﬁnd goodagreement with our main theoretical results and we show that Corollary 1, which statesthat the asymptotic probability of convergence to a PNE is, up to a polynomial factor, oforder 1 / √ m n − , is also reﬂected in our simulations for relatively small values of m and n . he blue markers in Figure 5 show the frequency of convergence to a PNE in oursimulations for diﬀerent values of n and m . Clearly, the frequency of convergence to aPNE decreases as the number of players and/or actions increases. The solid black linein the top panel is the analytical probability of convergence to a PNE in 2-player gameswhich is calculated using equation (2). Up to sampling noise, our analytical result perfectlymatches the numerical simulations. Number of Actions Number of Actions - Rescaled C o n v e r g e n ce t o P u r e N a s h E q u ili b r i u m Figure 5.

Frequency of convergence to a PNE for the clockwork best-response dynamic with Γ = 0. In the bottom panel, the horizontal axis isrescaled according to Corollary 1.Figure 5 also allows us to verify Corollary 1; namely, that the frequency of convergence toa PNE in a n -player m -action game is roughly the same as in a 2-player m n − action game.While in Section 3 this result was only proved asymptotically ( mn → ∞ ), we investigatethe extent to which the result holds for small values of m and n . Using Corollary 1,we approximate the frequency of convergence to a PNE in a n -player m -action game byreplacing the number of actions m in equation (2) by m n − , which is the equivalent numberof actions in the corresponding 2-player game. We plot these approximate frequencies asdashed lines in the top panel of Figure 5. As can be seen, there is a good match betweenthe approximation and our simulation results, particularly when the number of actions m is elatively large. The bottom panel of Figure 5 gives us another way to illustrate Corollary 1.Here, we rescale the number of actions for n -player games to match the number of actionsof the equivalent 2-player game. After rescaling, a point corresponding to m actions inan n -player game is moved on the horizontal axis to a number of actions given by m n − .For example, the point giving the convergence frequency for 4-player 10-action games istranslated to the right to the horizontal coordinate corresponding to 10 = 1000 actionsin a 2-player game. The re-scaled markers all lie relatively close to the black line – whichcorresponds to the analytical probability of convergence to a PNE in 2-player m -actiongames. C l o c k w o r k R a nd o m Number of Actions S i m u l t a n e o u s C o n v e r g e n ce t o P u r e N a s h E q u ili b r i u m Figure 6.

Frequency of convergence to pure Nash equilibria under clock-work, random, and simultaneous best-response dynamics. The solid linecorresponds to Γ = 0, the darker dashed lines to Γ = n − , n − , . . . , ( n − − . , − . , . . . , − Simulations of best-response dynamics under clockwork, random, and si-multaneous updating.

We simulate best-response dynamics under clockwork, random, nd simultaneous updating. We ﬁnd that there are signiﬁcant diﬀerences in the proba-bility of convergence to a PNE when Γ = 0. When comparing clockwork against randomsequences, the diﬀerences are consistent with the theoretical ﬁndings of Section 3. WhenΓ (cid:54) = 0, we ﬁnd that the diﬀerences in the probability of convergence to a PNE become moremuted but, overall, best-response dynamics converge to a PNE most frequently under arandom sequence and least frequently under simultaneous updating, with the clockworkcase lying somewhere in between.Figure 6 shows the frequency of convergence to a PNE under clockwork, random, andsimultaneous best-response dynamics. (i) The solid line corresponds to Γ = 0, (ii) thedarker lines correspond to positive correlations, Γ = n − , n − , . . . , ( n − − . , − . , . . . , −

1. We discuss eachcase in turn:(i) Uncorrelated payoﬀs: Γ = 0. The frequency of convergence to a PNE is decreasingin n and m for the clockwork playing sequence and for simultaneous updating.The random playing sequence is diﬀerent. When there are only n = 2 players, therandom playing sequence has the same convergence probability as the clockworkplaying sequence – as argued in Section 3.2. Amiet et al. (2019) proved that therandom sequence best-response dynamic always converges to a PNE if there is onewhen m = 2 and n → ∞ . As argued in Section 3.1, this gives us an unconditionalprobability of convergence of 1 − /e ≈ n >

2. In fact, the random sequence best-response dynamic almost alwaysconverges to a PNE in games that have a PNE (even for relatively small values of n and m ).(ii) Positively correlated payoﬀs: Γ >

0. For any value of m and n convergence tends tobe more likely than with Γ = 0 under all playing sequences. The reason is that un-der positively correlated payoﬀs (especially if the correlation is very strong) thereis a proliferation of pure Nash equilibria (Goldberg et al., 1968, Stanford, 1999,Berg and Weigt, 1999, Rinott and Scarsini, 2000). The best-response dynamic istherefore very likely to converge to one of these equilibria. The only exception isthe simultaneous best-response dynamic in 2-player games with highly correlatedpayoﬀs (e.g. Γ = 0 . , iii) Negatively correlated payoﬀs: Γ <

0. For any value of m and n convergence to aPNE tends to be less likely than with Γ = 0 under all playing sequences. WhenΓ ≈ − Simulation of other learning rules.

In this section we compare the behavior ofbest-response dynamics (under clockwork, random, and simultaneous updating) againstthree classic learning rules: Bush-Mosteller reinforcement learning (Bush and Mosteller,1953), ﬁctitious play (Brown, 1951, Robinson, 1951), and replicator dynamics (Maynard Smith,1982). Our interest in these rules stems from the fact that they are well-known, theyembody diﬀerent behavioral assumptions about learning in games, and they have beencalibrated to human game-play in experiments. The upshot of our simulation results isthat, compared with random and simultaneous updating, the convergence properties ofthe clockwork best-response dynamic most closely match the convergence properties of thethree learning rules.4.3.1.

Description of the learning rules.

Here, we provide high-level descriptions of ourthree learning rules (reinforcement learning, ﬁctitious play, and replicator dynamics) andof the convergence criteria that we use in our simulations. More detailed descriptions ofthe rules and of the convergence criteria are given in Appendix C.Bush-Mosteller reinforcement learning is based on the idea that players are more likelyto play actions that yielded a better payoﬀ in the past. It is a standard learning algorithmthat is used to model game playing under limited information and/or without sophisticatedreasoning, such as in animal learning. Variants of reinforcement learning models have beencalibrated to human game-play in experiments in Arthur (1991), Erev and Roth (1998),and Sarin and Vahid (2001). Under Bush-Mosteller learning, in each period, each playerchooses their action by sampling according to a mixed strategy vector whose evolution isgoverned by reinforcement learning. We assess convergence of these vectors, i.e. whetherthe diﬀerence from one period to the next falls below a threshold and becomes indistin-guishable from sampling noise. Tracking the mixed strategy vectors rather than the actionsplayed makes it possible for us to determine whether the dynamic converges to mixed Nashequilibria.Fictitious play requires more sophistication, as it assumes that the players construct amental model of their opponent. Each player assumes that the empirical distribution of heropponent’s past actions is her mixed strategy, and plays the best response to this belief. lassical experiments with human players in which ﬁctitious play is used as a learning modelare those by Cheung and Friedman (1997) who consider coordination, dominance-solvable,and cyclic 2-player 2-action games. To assess convergence, we follow Fudenberg and Levine(1998) in tracking the convergence of the belief vectors rather than the convergence ofactions played. As with Bush-Mosteller learning, this choice makes it possible to includeconvergence to mixed equilibria in our analysis.The replicator equation is commonly used in ecology and population biology, but it hasalso been viewed as a learning algorithm in which each population trait corresponds toan action (B¨orgers and Sarin, 1997). In our implementation, in each period, each playerchooses their action by sampling according a mixed strategy vector whose evolution isgoverned by the replicator equation. Van Huyck et al. (1995) study two tacit bargaininggames and show that players’ behavior is in line with what they would do if they wereplaying replicator dynamics. (Friedman (1996) comes to a similar conclusion in a largersample of games.) As above, we track the convergence of the mixed strategy vectors. Note,however, that the multi-population replicator dynamic never converges to mixed equilibriain random games. In other words, if the dynamic converges to an equilibrium, each mixedstrategy vector will assign all the mass to a single action.The convergence properties of our three learning algorithms have been studied theoret-ically. It is well-known, for instance, that ﬁctitious play converges to Nash equilibrium incertain classes of games such as potential, zero-sum, and supermodular games (Fudenbergand Levine, 1998). It is also well-known that evolutionarily stable strategies are locallystable ﬁxed points of replicator dynamics (Hofbauer and Sigmund, 1998). However, thereis no general result about the probability of convergence of these learning rules to pureNash equilibria in random games.4.3.2.

Results.

We compare the probability of convergence of best-response dynamics (un-der clockwork, random, and simultaneous updating) against each of the three learningrules. In Figure 7 we do this for uncorrelated payoﬀs so Γ = 0, for n = 2 and 3 players,and for a varying number of actions m . In Figure 8 we allow Γ to vary but ﬁx the numberof actions to m = 5. Finally, in Figure 9 we allow n , m , and Γ to all vary.In our comparisons, we distinguish between cases in which Bush-Mosteller learning andﬁctitious play converge to pure vs. mixed Nash equilibria. In Figures 7 and 8, we use adashed line to indicate the frequency of convergence to pure equilibria only, while a solidline indicates convergence to any type of equilibrium. Thus, for any number of actions, We consider a multi-population version of the replicator dynamic because our randomly drawn payoﬀmatrices are, in general, not symmetric. .00.20.40.60.81.0 C o n v . i n - P l a y e r G a m e s Bush-Mosteller Learning Fictitious Play Replicator Dynamics C o n v . i n - P l a y e r G a m e s Number of Actions

Figure 7.

The frequency of convergence to PNE under best-response dy-namics compared to the frequency of convergence to (mixed and pure) Nashequilibria under the other learning rules for Γ = 0.the vertical distance between the dashed line and the solid line indicates the frequency ofconvergence to mixed equilibria only.As can be seen in Figure 7, the frequency of convergence to a PNE under the clockworkbest-response dynamic most closely tracks the frequency of convergence to PNE under thethree learning rules. In games with 3 or more players, the random sequence best-responsedynamic almost always converges to a PNE if there is one, which is why the orange linerapidly ﬂattens at 1 − /e ≈ .

63. By contrast, the frequency of convergence to pure (ormixed) Nash equilibria under the three learning rules decreases as the number of actions rows. The best-response dynamic under simultaneous updating converges to a PNE tooinfrequently relative to the three learning rules. When also considering convergence to mixed Nash equilibria, we see that the clockworkbest-response dynamic converges too infrequently compared to the three learning rules,especially in the case of ﬁctitious play with two players. However, (i) as the number ofactions increases, it tracks the trend in convergence of the learning rules to mixed or pureNash equilibria better than random best-response dynamics when there are three or moreplayers, and (ii) its frequency of convergence to equilibrium is closer to that of the threelearning rules as compared to the simultaneous best-response dynamic.In Figure 8 we ﬁx the number of actions to m = 5 and vary Γ. The clockwork andrandom sequence best-response dynamics tend to track the other learning rules relativelywell (though the clockwork sequence appears to outperform the random sequence). Note,however, that the best-response dynamic under simultaneous updating converges too infre-quently. This is consistent with our previous observation that even when there are manyNash equilibria (under strongly positively correlated payoﬀs), this version of the dynamicwill not converge as often as the other versions.Figure 9 shows us scatter plots of the frequency of convergence to a PNE for the best-response dynamic (under each playing sequence) against the frequency of convergence topure or mixed equilibria for each of the three learning rules. Each dot corresponds to theconvergence frequency for a particular value of n , m , and Γ. The identity line is plottedfor reference. The frequency of convergence to a PNE for the clockwork sequence best-response dynamic does not perfectly match the frequency of convergence to pure or mixedequilibria for the three learning rules, but it does appear to outperform the other versionsof the best-response dynamic that we have considered in this paper. We emphasize that,in Figure 9, Bush-Mosteller learning and ﬁctitious play are allowed to converge to eitherpure or mixed equilibria. If we had considered convergence to pure equilibria only for eachof our learning rules, then the clockwork sequence best-response dynamic would match theoutcomes of the three learning rules even more closely. There is an oﬀset between the frequency of convergence to a PNE for the replicator dynamic and thefrequency of convergence to a PNE for the clockwork best-response dynamic. As we explain in the appendix,this is mainly due to numerical limitations: the replicator dynamic has inﬁnite memory, so a trajectorymight hit the machine precision limit without having reached a PNE. Note that for the 2-player case with Γ = −

1, the frequency of convergence to any equilibrium (pure ormixed) for ﬁctitious play is close to one. This is consistent with existing theoretical results regarding theconvergence of ﬁctitious play in 2-player zero-sum games (Fudenberg and Levine, 1998). The 418 combinations for the values of n , m , and Γ that we consider are: n = 2 with m = 2 , ..., n = 3with m = 2 , ..., n = 4 with m = 2 , ..., n = 5 with m = 2 , ..., n = 6 with m = 2 , n = 7 with m = 2 ,

3; each for Γ = − , − . , ..., , n − , n − , ..., ( n − C o n v . i n - P l a y e r G a m e s Bush-Mosteller Learning −1 0 1

Fictitious Play −1 0 1

Replicator Dynamics −1 0 1 20.00.20.40.60.81.0 C o n v . i n - P l a y e r G a m e s −1 0 1 2 −1 0 1 2 Value of the Payoff Correlation Parameter Γ Clockwork Best-Response DynamicsRandom Best-Response DynamicsSimultaneous Best-Response Dynamics Convergence to Pure or Mixed Nash EquilibriumConvergence to Pure Nash Equilibrium only

Figure 8.

The frequency of convergence to PNE under best-response dy-namics compared to the frequency of convergence to (mixed and pure) Nashequilibria under the three learning rules, with varying Γ and m = 5.We now compare how the clockwork best-response dynamic performs against the threelearning rules not only in terms of convergence probability but also in terms of the evolutionof play itself. Figure 10 shows a best-response digraph as well as the paths traced by Bush-Mosteller learning, ﬁctitious play, and the replicator dynamic starting from various initialconditions. The paths show the evolution of the mixed strategy vectors for the learningrules, and these appear to follow the directions of the edges in the best-response digraph. The underlying game has binary payoﬀs of zero or one. The trajectories in the action proﬁle space wouldbe distorted if the payoﬀ values had been diﬀerent, though Pangallo et al. (2019) show that the patternsexhibited by the learning algorithms in two player games are quite robust to general payoﬀ values. .00.51.0 Bush-Mosteller LearningR = 0.96 Fictitious PlayR = 0.82 C l o c k w o r k Replicator DynamicsR = 0.92 R = 0.55 R = 0.31 R a nd o m R = 0.46 R = 0.81 R = 0.77 S i m u l t a n e o u s R = 0.91Convergence to Nash Equilibrium of Learning Dynamics C o n v e r g e n ce t o P u r e N a s h E q u ili b r i u m o f B e s t - R e s p o n s e D y n a m i c s Figure 9.

The frequency of convergence to PNE under best-response dy-namics against the frequency of convergence to (mixed or pure) Nash equi-libria under the three learning rules, for varying values of n , m , and Γ and m = 5.These edges also govern the evolution of play in the clockwork best-response dynamic, butthey do not govern the evolution of play under a random sequence. In fact, the randomsequence best-response dynamic would eventually converge to the pure Nash equilibrium UBES

Pl. 3 V VII Pl. 2 III (0,0,0) (0,0,1)Pl. 1 IV (1,1,1) (0,1,0)II Pl. 2 III (1,0,1) (1,1,0)IV (0,1,0) (1,0,1)

Table 1.

Binary 3-player, 2-action game with one pure Nash equilibrium, and two mixedNash equilibria.

Bush-Mosteller Learning

Fictitious Play

Replicator Dynamics

SimulataneousBest-Response Dynamics

Figure 1.

Trajectories of Bush-Mosteller learning, ﬁctitious play, and replicator dynamicsin comparison to clockwork best-response dynamics, and in comparison to simultaneousupdating (last panel) for the same game 1. Blue trajectories converge, red trajectories donot. Blue arrows correspond to pure Nash equilibria, light blue arrows lead there; red arrowscorrespond to cycles, orange arrows lead there. Figure 10.

Trajectories of Bush-Mosteller learning, ﬁctitious play, andreplicator dynamics. Blue trajectories converge to Nash equilibria (pureor mixed), red trajectories do not. Blue arrows correspond to pure Nashequilibria, light blue arrows lead there; red arrows correspond to cycles,orange arrows lead there.in this digraph, given suﬃcient time. So, the paths traced by the clockwork sequence best-response dynamic more closely resemble the paths traced by the three learning algorithmsthan those traced by the random sequence best-response dynamic, and this is true in spiteof the fact the three learning algorithms are most naturally deﬁned as involving simulta-neous updating. Our conclusion regarding how “close” the paths of the clockwork vs.random sequence best-response dynamics are to those exhibited by the learning algorithmsis based on our observations in a number of games. And, our results corroborate Pangalloet al. (2019) who ﬁnd that the prevalence of 2 k -cycles is a good predictor of the frequencyof convergence to Nash equilibrium of the learning algorithms in 2-player random games.More generally our ﬁndings suggest that, to the extent that the learning algorithms are con-sistent with human game-play in randomly-generated games, the clockwork best-responsedynamic could provide a ﬁrst-order approximation for the evolution of play in such games. The paths traced by the learning algorithms are likely to have features resembling elements of the pathstraced by the best-response dynamic under both clockwork and simultaneous updating. The degree towhich the learning algorithms have “memory” is likely to modulate the extent to which the paths resemblethose generated by the best-response dynamic under clockwork vs. simultaneous updating. We do not carry out a comprehensive quantitative analysis of “path closeness” though we expect ourﬁnding regarding clockwork vs. random sequence best-response dynamics to be robust, particularly forlarge games. ppendix A. Proof of Theorem 2

We start by stating two lemmas that will be used to prove Theorem 2. Lemma 1 boundsthe probability that the clockwork sequence best-response dynamic converges to a pureNash equilibrium or to a best-response cycle only after period t . Lemma 2 bounds theprobability that the clockwork sequence best-response dynamic converges to a pure Nashequilibrium by period t . Lemma 1.

Let (cid:104) (cid:126) A , s c (cid:105) be generated according to Algorithm 2. For any t ∈ N , Pr (cid:2) T (cid:104) (cid:126) a ,s c (cid:105) > t (cid:3) ≤ exp (cid:26) − ( (cid:98) tn − (cid:99) ) m n − (cid:27) . Recall that T (cid:104) (cid:126) A ,s c (cid:105) is the period in which (cid:104) (cid:126) A , s c (cid:105) reaches PNE( G n,m ) or a best-responsecycle. Lemma 2.

Let (cid:104) (cid:126) A , s c (cid:105) be generated according to Algorithm 2. For any t ∈ N , (cid:24) tn (cid:25) m n − (cid:18) − nm n − ( (cid:100) tn (cid:101) ) (cid:19) ≤ Pr (cid:104) (cid:104) (cid:126) A , s c (cid:105) reaches PNE( G n,m ) by t (cid:105) ≤ tm n − . We now show how Theorem 2 follows from Lemmas 1 and 2. In what remains of thissection we provide proofs for the lemmas themselves.

Proof of Theorem 2.

Let (cid:104) (cid:126) A , s c (cid:105) be generated according to Algorithm 2. The probabilitythat the s c -best-response dynamic on G n,m converges to a PNE is equal to the probabilitythat (cid:104) (cid:126) A , s c (cid:105) reaches PNE( G n,m ). Let us start with the upper bound. For any t ∈ N ,Pr (cid:104) (cid:104) (cid:126) A , s c (cid:105) reaches PNE( G n,m ) (cid:105) = Pr (cid:104) (cid:104) (cid:126) A , s c (cid:105) reaches PNE( G n,m ) by t (cid:105) + Pr (cid:104) (cid:104) (cid:126) A , s c (cid:105) reaches PNE( G n,m ) after t (cid:105) ≤ Pr (cid:104) (cid:104) (cid:126) A , s c (cid:105) reaches PNE( G n,m ) by t (cid:105) + Pr (cid:2) T (cid:104) (cid:126) a ,s c (cid:105) > t (cid:3) ≤ tm n − + exp (cid:26) − ( (cid:98) tn − (cid:99) ) m n − (cid:27) . (3)Equation (3) follows from Lemmas 1 and 2. Now, set t = n (cid:16)(cid:108)(cid:112) m n − log( m n − ) (cid:109) + 1 (cid:17) . Since x ≤ (cid:100) x (cid:101) ≤ x + 1 and (cid:112) m n − log( m n − ) > m ≥ n ≥

2, we obtain n (cid:16)(cid:112) m n − log( m n − ) + 1 (cid:17) ≤ t ≤ n (cid:16)(cid:112) m n − log( m n − ) + 2 (cid:17) < n (cid:112) m n − log( m n − ) . t follows that(4) tm n − < n (cid:112) m n − log( m n − ) m n − < n / √ log m √ m n − , and that(5) exp (cid:26) − ( (cid:98) tn − (cid:99) ) m n − (cid:27) ≤ m n − < n / √ log m √ m n − . Adding the upper bounds in (4) and (5) yields the desired result.Let us now turn to the lower bound. For any t ∈ N ,Pr (cid:104) (cid:104) (cid:126) A , s c (cid:105) reaches PNE( G n,m ) (cid:105) ≥ Pr (cid:104) (cid:104) (cid:126) A , s c (cid:105) reaches PNE( G n,m ) by t (cid:105) ≥ (cid:24) tn (cid:25) m n − (cid:18) − nm n − ( (cid:100) tn (cid:101) ) (cid:19) . (6)Equation (6) follows from Lemma 2. Now, set t = n (cid:36) √ m n − √ n (cid:37) . Since m ≥ n ≥

2, we obtain √ n √ m n − ≤ t ≤ √ n √ m n − . It follows that(7) 1 − nm n − ( (cid:100) tn (cid:101) ) ≥ , and that(8) (cid:24) tn (cid:25) m n − ≥ √ n √ m n − . Multiplying the lower bounds in (7) and (8) together yields the desired result. (cid:3)

We now turn to the proofs of Lemmas 1 and 2. The main challenge posed by pathsgenerated according to Algorithm 2 is that they have “memory”: whenever player s c ( t )encounters an environment that she has encountered before (i.e. A t − − s c ( t ) = A u − − s c ( u ) and s c ( t ) = s c ( u )) then, in period t , the player must play the same action that she played whenshe previously encountered the environment (i.e. A ts c ( t ) = A us c ( u ) ). This path-dependencecomplicates the analysis of the clockwork best-response dynamic. We therefore study asimpler (random walk) process that is “memoryless” to which we couple a dynamic thatinduces the same distribution over paths as Algorithm 2. The coupled dynamic follows the andom walk process until an environment is encountered by some player for the secondtime and becomes deterministic thereafter.The coupled system is described by Algorithms 3 and 4 below. Algorithm 3

Clockwork random walk (1) Draw an initial profile X uniformly at random from [ m ] n (2) For t ∈ N :(a) Set i = s c ( t ) (b) Set X t − i = X t − − i (c) Independently draw X ti uniformly at random from [ m ] Algorithm 4

Coupled dynamic (1) Set R i ( a − i ) = 0 for all i ∈ [ n ] and a − i ∈ [ m ] n − (2) Set the initial action profile to Y = X (3) For t ∈ N :(a) Set i = s c ( t ) (b) Set Y t − i = Y t − − i (c) If R i ( Y t − − i ) = 0 : set Y ti = X ti and R i ( Y t − − i ) = Y ti If R i ( Y t − − i ) (cid:54) = 0 : set Y ti = R i ( Y t − − i ) (cid:104) (cid:126) X , s c (cid:105) and (cid:104) (cid:126) Y , s c (cid:105) denote paths generated according to Algorithms 3 and 4 respectively.Algorithm 3 is a “clockwork random walk” on the set of action proﬁles [ m ] n . The walkstarts at some randomly drawn initial proﬁle X and, in each period t , moves in direction s c ( t ) to a proﬁle chosen uniformly at random from among the m proﬁles in that direction.A path generated according to this process does not have memory.Algorithm 4 describes the coupled dynamic. The process starts at the same initial proﬁleas the clockwork random walk. For each player i and environment a − i , we set the initial“response” value R i ( a − i ) to zero. The crucial step to how the process evolves is (3c): if theresponse value to the current environment Y t − − i is zero, then the environment was neverencountered before and, in that case, player i ’s response value is set to X ti , the action drawnby the clockwork random walk in period t . If, on the other hand, the response value tothe current environment Y t − − i is non-zero (i.e. the environment was encountered before),then this value is the action that i takes in period t . In other words, (cid:104) (cid:126) Y , s c (cid:105) has the samememory property that is characteristic of paths generated according to Algorithm 2. The pattern of path-dependence is more complex for simultaneous updating than for the clockworkplaying sequence. In the case of simultaneous updating, the choices of the players who have encountered an ecall that Algorithm 2 essentially draws a best-response digraph, selects an initialproﬁle, and then traces a path by traveling along the edges of the digraph starting at theinitial proﬁle and moving in direction s c ( t ) at step t . Under Algorithm 2, the entire best-response digraph is drawn up front. In contrast, Algorithm 4 starts with an empty digraphand then generates its edges in an “online” manner. Nevertheless, both algorithms inducethe same distribution over paths, as summarized in the following remark. Remark 1.

Let (cid:104) (cid:126) A , s c (cid:105) and (cid:104) (cid:126) Y , s c (cid:105) be generated according to Algorithms 2 and 4 respec-tively. Then (cid:104) (cid:126) A , s c (cid:105) and (cid:104) (cid:126) Y , s c (cid:105) have the same distribution.For any path (cid:104) (cid:126) a , s c (cid:105) and for each t ∈ N deﬁne f (cid:104) (cid:126) a ,s c (cid:105) ( t ) := min (cid:110) u ≤ t : a u − − s c ( u ) = a t − − s c ( t ) and s c ( u ) = s c ( t ) (cid:111) . So f (cid:104) (cid:126) a ,s c (cid:105) ( t ) is the ﬁrst period along the path (cid:104) (cid:126) a , s c (cid:105) that player s c ( t ) encounters the envi-ronment a t − − s c ( t ) . Notice that if s c ( t ) encounters a t − − s c ( t ) for the ﬁrst time in period t then f (cid:104) (cid:126) a ,s c (cid:105) ( t ) = t , and if s c ( t ) encountered a t − − s c ( t ) for the ﬁrst time in some period u < t then f (cid:104) (cid:126) a ,s c (cid:105) ( t ) < t . We also deﬁne F (cid:104) (cid:126) a ,s c (cid:105) := inf (cid:8) t ∈ N : f (cid:104) (cid:126) a ,s c (cid:105) ( t ) < t (cid:9) . So F (cid:104) (cid:126) a ,s c (cid:105) is the ﬁrst period in which some player encounters an environment that theyencountered previously along the path. The value F (cid:104) (cid:126) a ,s c (cid:105) is bounded above by 1 + nm n − for any path.By construction, the sequences (cid:126) X and (cid:126) Y must agree at least up to (but not including)the period at which some player encounters an environment for the second time. In thatperiod, under Algorithm 4, the player must play the action determined by their responsefunction evaluated at that environment but, under Algorithm 3, the next action may beany of the available actions for that player. Remark 2 summarizes the key relationshipbetween the clockwork random walk and the coupled dynamic. Remark 2. F (cid:104) (cid:126) X ,s c (cid:105) = F (cid:104) (cid:126) Y ,s c (cid:105) . Example (Illustration of Algorithms 3 and 4) . Figure 11 illustrates the relationship be-tween (cid:104) (cid:126) X , s c (cid:105) and (cid:104) (cid:126) Y , s c (cid:105) by plotting the ﬁrst few elements of (cid:126) X and of (cid:126) Y . Panel (A) environment twice is deterministic but the choices of the players who have never encountered an environmenttwice remain random. For example, suppose A = (1 , ,

1) and A = (2 , , layer 2Player 3Player 1 − − −− −− ◦◦ ◦◦ ◦◦ ◦◦ X , X , X X , X X X , X X (a) ◦◦ ◦◦ ◦◦ ◦◦ Y , Y Y , Y , Y , Y Y Y , Y (b) Period Player (cid:126) X f (cid:104) (cid:126) X ,s c (cid:105) ( t ) (cid:126) Y f (cid:104) (cid:126) Y ,s c (cid:105) ( t )– – (1 , , ) – (1 , , ) –1 1 ( , , ) 1 ( , , ) 12 2 ( , ,

1) 2 ( , ,

1) 23 3 (2 , , ) 3 (2 , , ) 34 1 ( , , ) 4 ( , , ) 45 2 ( , ,

2) 5 ( , ,

2) 56 3 (1 , , ) 6 (1 , , ) 67 1 ( , , ) 1 ( , , ) 18 2 ( , ,

1) 8 ( , ,

1) 2 (c)

Figure 11.

Illustration of Algorithms 3 and 4. Panel (A) shows the ﬁrstelements of a possible path generated according to Algorithm 3 and panel(B) shows the corresponding path generated according to Algorithm 4. Thetable in panel (C) provides details, with environments highlighted in bold.shows the ﬁrst few elements of a possible path generated according to the clockwork ran-dom walk starting at the proﬁle X = (1 , , Y to Y in period 1 in panel (B). The paths are identical up to andincluding period 6. In period 7, however, player 1 encounters the same environment thatshe had encountered in period 1 (namely, players 2 and 3 each choosing action 1). Theﬁrst time that player 1 encountered this environment, she responded by playing action 2,so she must play action 2 again in period 7. In other words, the path must follow the edge anel (A) Panel (B)Period Player (cid:126) a f (cid:104) (cid:126) a ,s c (cid:105) ( t ) (cid:126) a f (cid:104) (cid:126) a ,s c (cid:105) ( t )– – (1 , , ) – (1 , , ) –1 1 ( , , ) 1 ( , , ) 12 2 ( , ,

2) 2 ( , ,

1) 23 3 (1 , , ) 3 (2 , , ) 34 1 ( , , ) 4 ( , , ) 45 2 ( , ,

1) 5 ( , ,

2) 56 3 (1 , , ) 3 (2 , , ) 67 1 ( , , ) 4 ( , , ) 18 2 ( , ,

1) 5 ( , ,

1) 2

Table 2.

First few elements of the paths in panels (A) and (B) of Figure2.that was placed in period 1 and therefore Y = (2 , , X = (1 , ,

1) in period 7and travel to X = (1 , ,

1) in period 8.The path in panel (B) will keep cycling among the action proﬁles on the left-hand sideof the cube forever whereas the path in panel (A) is allowed to freely wander. Note herethat F (cid:104) (cid:126) X ,s c (cid:105) = F (cid:104) (cid:126) Y ,s c (cid:105) = 7. (cid:4) Remark 3. T (cid:104) (cid:126) A ,s c (cid:105) < F (cid:104) (cid:126) A ,s c (cid:105) .Remark 3 notes that any path (cid:104) (cid:126) A , s c (cid:105) generated according to Algorithm 2 must reachPNE( G n,m ) or a best-response cycle before any player encounters an environment for thesecond time. Example (Illustration of Remark 3) . Table 2 shows the values of the function f (cid:104) (cid:126) a ,s c (cid:105) ( t )for the ﬁrst few elements of the paths generated according to the clockwork sequence best-response dynamic in panels (A) and (B) of Figure 2. Recall that the path in panel (A)reaches the pure Nash equilibrium in period 3 and that the path in panel (B) reaches abest-response cycle in period 1. Furthermore, from Table 2 we can see that the value of F (cid:104) (cid:126) a ,s c (cid:105) is 6 for panel (A) and 7 for panel (B). We therefore conclude that, for panel (A), T (cid:104) (cid:126) a ,s c (cid:105) = 3 < F (cid:104) (cid:126) a ,s c (cid:105) , and for panel (B), T (cid:104) (cid:126) a ,s c (cid:105) = 1 < F (cid:104) (cid:126) a ,s c (cid:105) . (cid:4) The lemma below, which concerns paths (cid:104) (cid:126) X , s c (cid:105) that are generated by the clockworkrandom walk, is useful for proving Lemmas 1 and 2. Under the clockwork sequence, player i ∈ [ n ] plays in period h i ( k ) := i + ( k − n for k ∈ N . For any i ∈ [ n ] and any period ∈ N , deﬁne k ∗ i ( t ) := 1 + (cid:22) t − in (cid:23) . So k ∗ i ( t ) is the largest k ∈ N such that h i ( k ) ≤ t . The environments that player i ∈ [ n ]encounters on her turns between (and including) periods 1 and t are given in the sequence( X h i (1) − − i , X h i (2) − − i , ..., X h i ( k ∗ i ( t )) − − i ). Lemma 3 establishes bounds on the probability thatthese environments are all distinct. Lemma 3.

For any i ∈ [ n ] and t ∈ N , − m n − ( (cid:100) tn (cid:101) ) ≤ Pr (cid:104) X h i ( k ) − − i for k ∈ { , ..., k ∗ i ( t ) } are all distinct (cid:105) ≤ exp (cid:26) − ( (cid:98) tn − (cid:99) ) m n − (cid:27) . Proof of Lemma 3.

For any i ∈ [ n ], the environments X h i (1) − − i , X h i (2) − − i , ..., X h i ( k ∗ i ( t )) − − i areindependent because they are disjoint subsets of the draws of the clockwork random walk.Each environment is distributed uniformly on [ m ] n − . Therefore,(9) Pr (cid:104) X h i ( k ) − − i for k ∈ { , ..., k ∗ i ( t ) } are all distinct (cid:105) = k ∗ i ( t ) − (cid:89) k =1 (cid:18) − km n − (cid:19) . If k ∗ i ( t ) > m n − then equation (9) is zero, and the lemma holds trivially ( k ∗ i ( t ) > m n − implies (cid:98) t − in (cid:99) > m n − which, in turn, implies (cid:100) tn (cid:101) > m n − , so the lower bound inthe statement of the lemma is negative and the upper bound is positive). We will thereforeconsider the case in which k ∗ i ( t ) ≤ m n − .We obtain the following upper bound: k ∗ i ( t ) − (cid:89) k =1 (cid:18) − km n − (cid:19) ≤ k ∗ i ( t ) − (cid:89) k =1 exp (cid:26) − km n − (cid:27) ≤ exp (cid:26) − ( k ∗ i ( t ) − m n − (cid:27) ≤ exp (cid:26) − ( (cid:98) tn − (cid:99) ) m n − (cid:27) . The ﬁrst step follows from exp { x } ≥ x for all x . The ﬁnal inequality follows from k ∗ i ( t ) − (cid:98) t − in (cid:99) ≥ (cid:98) t − nn (cid:99) = (cid:98) tn − (cid:99) .We now turn to the lower bound: k ∗ i ( t ) − (cid:89) k =1 (cid:18) − km n − (cid:19) ≥ − k ∗ i ( t ) − (cid:88) k =1 km n − = 1 − m n − k ∗ i ( t ) ≥ − m n − ( (cid:100) tn (cid:101) ) . The ﬁrst step is an application of the Weierstrass product inequality. The ﬁnal inequalityfollows from the fact that k ∗ i ( t ) = 1 + (cid:98) t − in (cid:99) ≤ (cid:98) t − n (cid:99) = (cid:100) tn (cid:101) . (cid:3) roof of Lemma 1. T (cid:104) (cid:126) A ,s c (cid:105) > t is the event that (cid:104) (cid:126) A , s c (cid:105) reaches PNE( G n,m ) or a best-response cycle only after period t . Remark 3 implies that F (cid:104) (cid:126) A ,s c (cid:105) > t . SoPr (cid:104) T (cid:104) (cid:126) A ,s c (cid:105) > t (cid:105) ≤ Pr (cid:104) F (cid:104) (cid:126) A ,s c (cid:105) > t (cid:105) . By Remarks 1 and 2,Pr (cid:104) F (cid:104) (cid:126) A ,s c (cid:105) > t (cid:105) = Pr (cid:104) F (cid:104) (cid:126) Y ,s c (cid:105) > t (cid:105) = Pr (cid:104) F (cid:104) (cid:126) X ,s c (cid:105) > t (cid:105) . Now, let us focus on the path (cid:104) (cid:126) X , s c (cid:105) and on player 1. The environments that player 1 facesbetween periods 1 and t are given in the sequence ( X h (1) − − , X h (2) − − , ..., X h ( k ∗ ( t )) − − ). Theevent F (cid:104) (cid:126) X ,s c (cid:105) > t implies that the environments in this sequence are all distinct. HencePr (cid:104) F (cid:104) (cid:126) X ,s c (cid:105) > t (cid:105) ≤ Pr (cid:104) X h ( k ) − − for k ∈ { , ..., k ∗ ( t ) } are all distinct (cid:105) ≤ exp (cid:26) − ( (cid:98) tn − (cid:99) ) m n − (cid:27) , where the ﬁnal step follows from Lemma 3. (cid:3) To prove Lemma 2, we introduce Algorithm 5 which describes a dynamic that is alsocoupled with the clockwork random walk. (cid:104) (cid:126) Z , x , s c (cid:105) denotes a path generated according toAlgorithm 5. Algorithm 5

Coupled dynamic with sink x (1) Set R i ( a − i ) = 0 for all i ∈ [ n ] and a − i ∈ [ m ] n − (2) Set R i ( x − i ) = x i for all i ∈ [ n ] (3) Set the initial action profile to Z = X (4) For t ∈ N :(a) Set i = s c ( t ) (b) Set Z t − i = Z t − − i (c) If R i ( Z t − − i ) = 0 : set Z ti = X ti and R i ( Z t − − i ) = Z ti If R i ( Z t − − i ) (cid:54) = 0 : set Z ti = R i ( Z t − − i )Algorithm 5 is identical to Algorithm 4 except that for some particular proﬁle x thealgorithm is initialized with R i ( x − i ) = x i for all i ∈ [ n ]. Algorithm 5 therefore initializesthe digraph with the directed edges ( x (cid:48) i , x − i ) to ( x i , x − i ) for all i and x (cid:48) i (cid:54) = x i , so that theproﬁle x is a sink. In the remaining steps, the algorithm selects a random initial proﬁleand starts tracing a path by traveling along edges that (other than those edges alreadypointing to x in the initialization) are generated in an online manner. The paths tracedby the clockwork random walk and this coupled dynamic with a sink at x must agree atleast up to (but not including) the period at which either an environment is encounteredby a player for the second time or the environment is x − i for some player i . layer 2Player 3Player 1 − − −− −− ◦◦ ◦◦ ◦◦ ◦◦ X , X X , X X X , X (a) ◦◦ ◦◦ ◦◦ ◦◦ Z Z , Z Z Z x (b) Figure 12.

Illustration of Algorithms 3 and 5. Panel (A) shows the ﬁrstelements of a possible path generated according to Algorithm 3 and panel(B) shows the corresponding path generated according to Algorithm 5.

Example (Illustration of Algorithms 3 and 5) . Figure 12 illustrates the relationship be-tween (cid:104) (cid:126) X , s c (cid:105) and (cid:104) (cid:126) Z , x , s c (cid:105) by plotting the ﬁrst few elements of (cid:126) X and of (cid:126) Z . Panel (A)shows the ﬁrst few elements of a possible path generated according to the clockwork randomwalk starting at the proﬁle X = (1 , , x is made a sink (with the red edges placed in period 0). The remainingdirected edges are numbered according to the period in which they are ﬁrst placed.The clockwork random walk takes the path (cid:126) Z to Z = (1 , ,

2) in period 4. While therandom walk can continue wandering through the action proﬁles according to the clockworksequence, the path (cid:126) Z must end up at Z = x in period 5. (cid:4) Remark 4.

Let (cid:104) (cid:126) A , s c (cid:105) and (cid:104) (cid:126) Z , x , s c (cid:105) be generated according to Algorithms 2 and 5 re-spectively. Then the distribution of (cid:104) (cid:126) A , s c (cid:105) conditional on x ∈ PNE( G n,m ) is the same asthe distribution of (cid:104) (cid:126) Z , x , s c (cid:105) . Proof of Lemma 2.

For any t ∈ N ,Pr (cid:104) (cid:104) (cid:126) A , s c (cid:105) reaches PNE( G n,m ) by t (cid:105) = (cid:88) x ∈ [ m ] n Pr (cid:104) (cid:104) (cid:126) A , s c (cid:105) reaches { x } by t and x ∈ PNE( G n,m ) (cid:105) = (cid:88) x ∈ [ m ] n Pr (cid:104) (cid:104) (cid:126) A , s c (cid:105) reaches { x } by t (cid:12)(cid:12)(cid:12) x ∈ PNE( G n,m ) (cid:105) Pr [ x ∈ PNE( G n,m )] (cid:88) x ∈ [ m ] n Pr (cid:104) (cid:104) (cid:126) Z , x , s c (cid:105) reaches { x } by t (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) (10 . Pr [ x ∈ PNE( G n,m )] (cid:124) (cid:123)(cid:122) (cid:125) (10 . (10)The ﬁrst step follows from the deﬁnition of reaching a pure Nash equilibrium. The ﬁnalstep follows from Remark 4; namely, the probability that (cid:104) (cid:126) A , s c (cid:105) reaches { x } by period t conditional on x ∈ PNE( G n,m ) is equal to the probability that (cid:104) (cid:126) Z , x , s c (cid:105) reaches { x } byperiod t . We now analyze the expressions (10.1) and (10.2).For (10.2), since payoﬀs are drawn identically and independently according to the atom-less distribution P , we have that(11) Pr [ x ∈ PNE( G n,m )] = n (cid:89) i =1 Pr (cid:20) U i ( x ) ≥ max x (cid:48) i ∈ [ m ] U i (cid:0) x (cid:48) i , x − i (cid:1)(cid:21) = 1 m n . We now ﬁnd upper and lower bounds on (10.1) by relating (cid:104) (cid:126) Z , x , s c (cid:105) to the clockworkrandom walk path (cid:104) (cid:126) X , s c (cid:105) . We start with the upper bound. Notice that (cid:104) (cid:126) Z , x , s c (cid:105) cannotreach { x } by period t unless X τ − − s c ( τ ) = x − s c ( τ ) for some τ ≤ t . ThereforePr (cid:104) (cid:104) (cid:126) Z , x , s c (cid:105) reaches { x } by t (cid:105) ≤ Pr (cid:34) t (cid:91) τ =1 { X τ − − s c ( τ ) } = x − s c ( τ ) } (cid:35) ≤ t (cid:88) τ =1 Pr (cid:104) X τ − − s c ( τ ) = x − s c ( τ ) (cid:105) = tm n − . (12)The ﬁnal step follows from the fact that X τ − − s c ( τ ) consists of n − m ].We now turn to the lower bound. If F (cid:104) (cid:126) X ,s c (cid:105) > t and X τ − − s c ( τ ) = x − s c ( τ ) for some τ ≤ t then (cid:104) (cid:126) Z , x , s c (cid:105) must reach { x } by period t . In other words, if no environments are repeatedfor any player and the environment is x − i for some player i by period t , then (cid:104) (cid:126) Z , x , s c (cid:105) must reach { x } by period t . Therefore,Pr (cid:104) (cid:104) (cid:126) Z , x , s c (cid:105) reaches { x } by t (cid:105) ≥ Pr (cid:34) t (cid:91) τ =1 { X τ − − s c ( τ ) = x − s c ( τ ) } and F (cid:104) (cid:126) X ,s c (cid:105) > t (cid:35) = Pr (cid:34) t (cid:91) τ =1 { X τ − − s c ( τ ) = x − s c ( τ ) } (cid:12)(cid:12)(cid:12)(cid:12) F (cid:104) (cid:126) X ,s c (cid:105) > t (cid:35) Pr (cid:104) F (cid:104) (cid:126) X ,s c (cid:105) > t (cid:105) . (13) o bound the ﬁrst term in (13), notice that X h ( k ) − − = x − for some k ∈ { , ..., k ∗ ( t ) } implies that X τ − − s c ( τ ) = x − s c ( τ ) for some τ ≤ t . ThereforePr (cid:34) t (cid:91) τ =1 { X τ − − s c ( τ ) = x − s c ( τ ) } (cid:12)(cid:12)(cid:12)(cid:12) F (cid:104) (cid:126) X ,s c (cid:105) > t (cid:35) ≥ Pr  k ∗ ( t ) (cid:91) k =1 { X h ( k ) − − = x − } (cid:12)(cid:12)(cid:12)(cid:12) F (cid:104) (cid:126) X ,s c (cid:105) > t  = k ∗ ( t ) (cid:88) k =1 Pr (cid:104) X h ( k ) − − = x − (cid:12)(cid:12)(cid:12) F (cid:104) (cid:126) X ,s c (cid:105) > t (cid:105) = k ∗ ( t ) (cid:88) k =1 m n − = (cid:24) tn (cid:25) m n − . (14)The ﬁrst summation follows from the fact that since all the environments for player 1are distinct, the events in the union are mutually exclusive. The next step follows fromthe fact that our process is invariant under symmetry. So for any k ∈ { , ..., k ∗ ( t ) } andfor all x − and y − , Pr[ X h ( k ) − − = x − | F (cid:104) (cid:126) X ,s c (cid:105) > t ] = Pr[ X h ( k ) − − = y − | F (cid:104) (cid:126) X ,s c (cid:105) > t ]which implies that Pr[ X h ( k ) − − = x − | F (cid:104) (cid:126) X ,s c (cid:105) > t ] = m n − . The last step follows from k ∗ ( t ) = 1 + (cid:98) t − n (cid:99) = (cid:100) tn (cid:101) .To bound the second term in (13), notice that if for each i ∈ [ n ] the environments X h i (1) − − i , X h i (2) − − i , ..., X h i ( k ∗ i ( t )) − − i are all distinct then F (cid:104) (cid:126) X ,s c (cid:105) > t . ThereforePr[ F (cid:104) (cid:126) X ,s c (cid:105) > t ] ≥ Pr  (cid:92) i ∈ [ n ] { X h i ( k ) − − i for k ∈ { , ..., k ∗ i ( t ) } are all distinct }  = 1 − Pr  (cid:91) i ∈ [ n ] { X h i ( k ) − − i for k ∈ { , ..., k ∗ i ( t ) } are not all distinct }  ≥ − (cid:88) i ∈ [ n ] Pr (cid:104) X h i ( k ) − − i for k ∈ { , ..., k ∗ i ( t ) } are not all distinct (cid:105) ≥ − nm n − ( (cid:100) tn (cid:101) ) . (15)The ﬁnal step follows from Lemma 3.Gathering the results (10), (11), (12), (14), and (15) together yields the desired conclu-sion. (cid:3) ppendix B. Proofs of Theorem 3, Proposition 3, and Theorem 4

In this section, we focus exclusively on the clockwork sequence best-response dynamicin 2-player games. We ﬁrst explicitly work out the exact probability that a path generatedby Algorithm 4 reaches a 2 k -cycle in period t . We then turn to the asymptotic behaviorof our formulas of interest.Recall the deﬁnitions of h i ( k ) and k ∗ i ( t ) preceding Lemma 3. On a path (cid:104) (cid:126) X , s c (cid:105) generatedby Algorithm 3, the environments that player i ∈ { , } encounters on her turns between(and including) periods 1 and t are given in the sequence ( X h i (1) − − i , X h i (2) − − i , ..., X h i ( k ∗ i ( t )) − − i ).B.1. Proof of Theorem 3.

In order for a path generated by Algorithm 4 to reach neithera pure Nash equilibrium nor a best-response cycle by period t it must be the case that,by period t + 1 (inclusive), no player encounters an environment that they have seenbefore, and the action taken by player s c ( t + 1) in period t + 1 must not repeat any ofthe environments encountered by period t by player s c ( t ). To put it diﬀerently, for each i ∈ { , } it must be the case that the environments ( X h i (1) − − i , X h i (2) − − i , ..., X h i ( k ∗ i ( t +1)) − − i )are all distinct, and the action X t +1 s c ( t +1) taken by player s c ( t + 1) in period t + 1 is distinctfrom each of the environments encountered by period t by player s c ( t ). It follows that theprobability that the clockwork sequence best-response dynamic converges to neither a pureNash equilibrium nor a best-response cycle by period t for t ∈ [2 m ] is(16) t (cid:89) i =1 (cid:18) − m (cid:22) i (cid:23)(cid:19) . In order for the path to reach a pure Nash equilibrium in period t , for each i ∈ { , } it must be the case that the environments ( X h i (1) − − i , X h i (2) − − i , ..., X h i ( k ∗ i ( t +1)) − − i ) are alldistinct, and the action X t +1 s c ( t +1) taken by player s c ( t + 1) in period t + 1 is equal to theenvironment X t − − s c ( t ) encountered by player s c ( t ) in period t . Therefore, the probabilitythat the clockwork sequence best-response dynamic converges to a pure Nash equilibriumin period t ∈ [2 m ] is(17) 1 m t (cid:89) i =1 (cid:18) − m (cid:22) i (cid:23)(cid:19) . More generally, in order for the path to reach a 2 k -cycle in period t , for each i ∈ { , } it must be the case that the environments ( X h i (1) − − i , X h i (2) − − i , ..., X h i ( k ∗ i ( t +2 k − − − i ) are alldistinct, and the action X t +2 k − s c ( t +2 k − taken by player s c ( t + 2 k −

1) in period t + 2 k − X t − − s c ( t ) encountered by player s c ( t ) in period t . Therefore, the robability that the clockwork sequence best-response dynamic converges to a 2 k -cycle for k ∈ [ m ] in period t ∈ [2( m − k + 1)] is1 m t +2( k − (cid:89) i =1 (cid:18) − m (cid:22) i (cid:23)(cid:19) . (18)This is precisely formula (1) given in the statement of Theorem 3. Notice that setting k = 1 in (18) recovers formula (17).The probability that the clockwork sequence best-response dynamic reaches a 2 k -cyclefor k ∈ [ m ] is obtained by summing (18) over all t ∈ [2( m − k + 1)]:(19) 1 m m − k +1) (cid:88) t =1 t +2( k − (cid:89) i =1 (cid:18) − m (cid:22) i (cid:23)(cid:19) . This is expression (2) in Corollary 2.

Example (Illustration of formulas (16) to (18)) . To illustrate our results we schematicallymap out all the possible paths of the clockwork sequence best-response dynamic in 2-player m -action games in the tree shown in Figure 13. The initial proﬁle, in period 0, is arbitrarilyset to (1 , , , (1 , , (1 , , (2 , , (2 , , , (2 , , (2 , , (1 , , (1 , , , (1 , , (1 , , (2 , , (2 ,

3) or (1 , , (2 , , (2 , , (1 , , (1 ,

3) is givenby formula (16) with t = 3.Let us consider the probability of reaching a pure Nash equilibrium in period 4. In Figure13, we must travel along the sequence (1 , , (1 , , (1 , , (2 , , (2 , , (2 ,

3) or the sequence(1 , , (2 , , (2 , , (1 , , (1 , , (1 , ,

3) wasreached in period 4. The probability of traveling along (1 , , (1 , , (1 , , (2 , , (2 , , (2 , , , (2 , , (2 , , (1 , , (1 , , (1 ,

3) is given by formula (17) with t = 4. et us consider the probability of reaching a 4-cycle in period 2. In Figure 13 we musttravel down the tree along the sequence (1 , , (1 , , (1 , , (2 , , (2 , , (1 ,

3) or the se-quence (1 , , (2 , , (2 , , (1 , , (1 , , (2 , ,

2) in period 2, and this is the ﬁrst action proﬁle of our 4-cycle.A further 2 k − , , , (1 , , (1 , , (2 , , (2 , , (1 ,

3) or (1 , , (2 , , (2 , , (1 , , (1 , , (2 ,

3) is given by for-mula (18) with t = k = 2. (cid:4) B.2.

Proof of Theorem 4.

To prove Theorem 4 we now work out the asymptotic behaviorof formula (16). Note that (16) can be written as t (cid:89) i =1 (cid:18) − m (cid:22) i (cid:23)(cid:19) =  m ! ( m − t +12 )! m t +1 if t is odd (cid:16) m − t m (cid:17) m ! ( m − t )! m t if t is even . Using Stirling’s formula which states that n ! ∼ √ πn · n n exp {− n } , as n → ∞ , we obtain (20) m ! ( m − t +12 )! m t +1 ∼ (cid:32) m − t +12 m (cid:33) t − m exp {− ( t + 1) } . and(21) (cid:18) m − t m (cid:19) m ! ( m − t )! m t ∼ (cid:18) m − t m (cid:19) t − m exp {− t } . whenever m − t → ∞ . Taking a logarithm of the last expression, − t + ( t − m ) ln (cid:18) − m t (cid:19) = − t + ( t − m ) (cid:18) − tm − t m + O (cid:18) t m (cid:19)(cid:19) = − t m + O (cid:18) t m (cid:19) . f ( n ) ∼ g ( n ) denotes f ( n ) /g ( n ) → n → ∞ . f ( n ) = O ( g ( n )) if there is M > N such that | f ( n ) | ≤ Mg ( n ) for all n ≥ N . = 0(1 , , , m (1 , , m (2 , , m (2 , m (2 , , m (2 , m (3 , , m (3 , m (3 , m (3 , ,

4) (2 ,

4) (3 , ... − m − m − m − m − m m (2 , ,

1) (2 , ,

2) (1 , ,

1) (1 ,

2) (1 , ,

3) (1 ,

3) (3 , ,

1) (3 ,

2) (3 ,

3) (3 , ,

4) (2 ,

4) (3 , ... − m Figure 13.

Illustration of possible paths for 2-player m -action games. Wearbitrarily set the initial action proﬁle to be (1 ,

1) in period 0. In period1, player 1 either plays action 1 (left branch) or some other action (rightbranch) which we arbitrarily call action 2. Player 2 then responds in period2, and so on. All red leaves are Nash equilibria and all blue leaves areproﬁles that belong to best-response cycles.Provided that t = o ( m / ), the second term goes to zero and therefore equation (21)behaves asymptotically like exp {− t / (4 m ) } . An identical argument shows that, under thesame conditions, (20) is also asymptotically exp {− t / (4 m ) } . Hence,(22) t (cid:89) i =1 (cid:18) − m (cid:22) i (cid:23)(cid:19) ∼ exp (cid:26) − t m (cid:27) . This completes the proof of Theorem 4. Note that approximation (22) holds uniformly inthe range [1 , o ( m / )]. .3. Proof of Proposition 3.

To prove Proposition 3, we now turn to the asymptoticbehavior of (19). Let T = T ( m ) satisfy T = o ( m / ) and k = o ( T ). We assume that T ≥ m / ln( m ) so that T is not too small, and we split the summation in (19) into two ranges: t ≤ T and t > T . Since (22) holds uniformly in our ﬁrst range, we have1 m T (cid:88) t =1 t +2( k − (cid:89) i =1 (cid:18) − m (cid:22) i (cid:23)(cid:19) ∼ m T (cid:88) t =1 exp (cid:26) − ( t + 2( k − m (cid:27) . We now approximate the summation on the right-hand side with an integral. Firstly, notethat 1 m (cid:90) T +11 exp (cid:26) − ( t + 2( k − m (cid:27) dt = (cid:114) m (cid:90) T +1+2( k − √ m k − √ m exp (cid:26) − x (cid:27) dx ∼ (cid:114) m (cid:90) ∞ k − √ m exp (cid:26) − x (cid:27) dx = 2 (cid:114) πm (cid:18) − Φ (cid:18) k − √ m (cid:19)(cid:19) , (23)where the ﬁrst step uses the transformation x = ( t + 2( k − / √ m . Furthermore,1 m (cid:90) exp (cid:26) − ( t + 2( k − m (cid:27) dt ≤ m , which goes to zero faster than (23). Since (cid:90) T +11 f ( t ) dt ≤ T (cid:88) t =1 f ( t ) ≤ (cid:90) T f ( t ) dt ≤ (cid:90) T +11 f ( t ) dt + (cid:90) f ( t ) dt, for any positive and decreasing function f ( · ), it follows that1 m T (cid:88) t =1 exp (cid:26) − ( t + 2( k − m (cid:27) ∼ (cid:114) πm (cid:18) − Φ (cid:18) k − √ m (cid:19)(cid:19) . t remains for us to show that the summation (19) over the second range is negligible.Since exp { x } ≥ x and (cid:98) x (cid:99) > x − x , we obtain the following upper bound:1 m m − k +1) (cid:88) t = T +1 t +2( k − (cid:89) i =1 (cid:18) − m (cid:22) i (cid:23)(cid:19) ≤ m m − k +1) (cid:88) t = T +1 T +1+2( k − (cid:89) i =1 (cid:18) − m (cid:22) i (cid:23)(cid:19) ≤ m m − k +1) (cid:88) t = T +1 exp  − m T +1+2( k − (cid:88) i =1 (cid:18) i − (cid:19) ≤ m − k − T + 1 m exp (cid:26) − m ( T + 2( k − − (cid:27) . This expression is small compared to the other half of the sum.

Appendix C. Descriptions of the learning rules

We compare best-response dynamics to three more complicated learning dynamics:

Bush-Mosteller learning as an example of reinforcement learning , ﬁctitious play as anexample of belief learning , and replicator dynamics as the most important equation inevolutionary biology.There are unavoidable arbitrary choices in the speciﬁcation of the learning dynamics,the values of the parameters and the criteria that determine convergence to mixed or pureNash equilibria, however the overall picture is robust to the speciﬁc implementation forall sensible parametrizations. The dynamics here are all in discrete-time or had to beconverted to discrete-time.We now describe the learning dynamics in detail, as well as the convergence criteria andour choice of parameters. We use similar convergence criteria to those used by Pangalloet al. (2019) in the two-player case.C.1. Reinforcement learning.

We consider the Bush-Mosteller learning algorithm as anexample of reinforcement learning (Bush and Mosteller, 1953), using the speciﬁcations inMacy and Flache (2002) and Galla and Farmer (2013).Each player has an aspiration level that corresponds to a weighted average of the payoﬀsthat the player has received while playing the game. Each player then associates a level ofsatisfaction with each action, which is positive if the payoﬀ the player gets when choosingthis action is higher than the player’s aspiration level, and negative otherwise. The prob-ability of playing an action is increased if the satisfaction was positive and decreased if itwas negative. ormal description. In each period, each player i ∈ [ n ] chooses an action x ∈ [ m ] with prob-ability p ti ( x ). The evolution of the mixed strategy of each player i , p ti = ( p ti (1) , ..., p ti ( m )),is governed by reinforcement learning, as we describe below. The learning rule generates amapping from p t = ( p t , ..., p tn ) to p t +1 .Let ℵ ti be the aspiration level of player i in period t . It evolves according to ℵ t +1 i = (1 − α ) ℵ ti + αu i ( x, a t − i ) . when ( x, a t − i ) is the proﬁle played in period t .The updated aspiration level is therefore a weighted average of the payoﬀ received attime t and the player’s past aspiration level. Payoﬀs received in the past are discountedby a factor of (1 − α ), where α stands for the rate of memory loss . Player i ’s satisfaction with action x ∈ [ m ] in period t is deﬁned by σ ti ( x ) = u i ( x, a t − i ) − ℵ ti max y ∈ [ m ] n | u i ( y ) − ℵ ti | . Note that σ ti ( x ) lies within − i chooses action x in period t , player i associates positive satisfaction with this action if the payoﬀ they received in period t ishigher than the player’s aspiration level.If player i played action x in period t , then the probability that i plays x again in period t + 1 is updated as(24) p t +1 i ( x ) =  p ti ( x ) + βσ ti ( x )(1 − p ti ( x )) σ ti ( x ) ≥ p t +1 i ( x ) + βσ ti ( x ) p ti ( x ) σ ti ( x ) < , and the probability of choosing a diﬀerent action y (cid:54) = x in period t + 1 is updated as(25) p t +1 i ( y ) =  p ti ( y ) − βσ ti ( x ) p ti ( y ) σ ti ( x ) ≥ p ti ( y ) − βσ ti ( x ) p ti ( x ) p ti ( y )1 − p ti ( x ) σ ti ( x ) < . In the equations above, β represents the learning rate. Positive satisfaction for action x leads to an increase of the probability to choose action x , negative satisfaction has theopposite eﬀect. Note that the actions of all players but i , a t − i , only enter the learningprocess of i through the payoﬀ that i receives from playing action x against a t − i , u i ( x, a t − i ).Player i need not know the actions of the other players. Rather i needs only to keep track ofher own actions and of her own payoﬀs in order to update her aspiration, satisfaction, andmixed strategy vector. This implementation of the Bush-Mosteller dynamic is therefore aclassic example of reinforcement learning in which limited information is required. onvergence criteria. To assess convergence, we check whether p t converges to a ﬁxedpoint of the mapping p t (cid:55)→ p t +1 . This choice makes it possible to also assess convergenceto mixed Nash equilibria, which would be missed if we only looked at the actions played.Of course, because players play actions by randomly sampling from their mixed strategyvectors, the evolution of p t is stochastic, and so we need to allow for noise in our assessmentof convergence. Additionally, p t never reaches a ﬁxed point of p t (cid:55)→ p t +1 within simulationtime. The reason for that is that equations (24) and (25) have no memory loss term, sothe probability for playing an unsuccessful action keeps decreasing over time without everreaching a steady state. To address these issues, we use the same heuristic as in Pangalloet al. (2019):(1) Only consider the last 20% time steps, to avoid transient eﬀects.(2) Only keep the actions that have been played with a probability larger than 0 . .

01, the simulationrun will be regarded as non-convergent, otherwise as convergent. We identify aconvergent simulation run as having reached a pure Nash equilibrium if each beliefvector p ti has a component that is larger than 0 . Parameter values.

We perform the simulations with α = 0 . β = 0 .

5, but could notobserve much sensitivity to the parameter values. We simulate for 5000 time steps withrandomly chosen initial conditions.C.2.

Fictitious play.

Fictitious play is an example of belief learning and was ﬁrst pro-posed as an algorithm to calculate Nash equilibria. It was later interpreted as a learningalgorithm (Brown, 1951, Robinson, 1951). Each player takes the empirical distribution ofactions taken by the opponents as an estimate of their mixed strategies, calculates the ex-pected payoﬀ of each action based on this estimate, and chooses the (pure) action with thehighest expected payoﬀ. Variants include weighted ﬁctitious play (Fudenberg and Levine,1998), in which the players discount opponents’ past actions and give higher weight to morerecent actions, and stochastic ﬁctitious play , where the players choose the best performingaction with a certain probability, and the other actions with a smaller probability. ormal description. In period t ≥

0, each player’s belief p tj ( x ) that player j will play action x in period t + 1 is given by the fraction of times that player j chose action x in the past: p tj ( x ) = 1 t + 1 t (cid:88) τ =0 [ a τj = x ] , where [ a τj = x ] = 1 if j played action x in period τ and [ a τj = x ] = 0 otherwise. In eachperiod, each player i then deterministically selects the action with the highest expectedpayoﬀ given their belief about their opponents, p t − i : a t +1 i = arg max x ∈ [ m ] (cid:88) x − i ∈ [ m ] n − u i ( x, x − i ) (cid:89) j ∈ [ n ] \{ i } p tj ( x j ) . Convergence criteria.

To study convergence to mixed equilibria, we follow Fudenberg andLevine (1998) in considering convergence of beliefs p t = ( p t , ..., p tn ) rather than convergenceof the actions played. Our convergence criteria for p t are the same as those described abovefor reinforcement learning. A minor diﬀerence is that we identify a convergent simulationrun as having reached a pure Nash equilibrium if each belief vector p ti has a componentthat is larger than 0 . Parameter values.

Fictitious play has no parameters.C.3.

Replicator dynamics.

Replicator dynamics are the most basic evolutionary model(Maynard Smith, 1982). They play an important role in describing evolutionary gamedynamics and population dynamics. Following the interpretation in B¨orgers and Sarin(1997), we view replicator dynamics as a learning algorithm for individual players. Be-cause our randomly generated payoﬀ matrices are not necessarily symmetric, we considerthe multi-population version of the replicator dynamic (Taylor and Nowak, 2006, Gokhaleand Traulsen, 2010).

Formal description.

In each period t , each player i chooses an action x with probability p ti ( x ), and the probability vector p ti = ( p ti (1) , ..., p ti ( m )) evolves according to the replicatorequation, as described below.When all other players sample their actions according to p t − i , the expected payoﬀ ofplayer i when choosing action x in period t is˜ u i ( x, p t − i ) = (cid:88) x − i ∈ [ m ] n − u i ( x, x − i ) (cid:89) j ∈ [ n ] \{ i } p tj ( x j ) . Replicator dynamics are also obtained as the continuous time limit of discrete time reinforcement-learningalgorithms (B¨orgers and Sarin, 1997, Sato and Crutchﬁeld, 2003, Tuyls et al., 2006, Pangallo et al., 2017). he average expected payoﬀ for player i is then¯ u i ( p t ) = (cid:88) x ∈ [ m ] ˜ u i ( x, p t − i ) p ti ( x ) . For our simulation, the usual continuous replicator equation˙ p ti ( x ) = p ti ( x ) (cid:0) ˜ u i ( x, p t − i ) − ¯ u ti ( p t ) (cid:1) , must be discretized. We use the discretization proposed in Maynard Smith (1982), where δ takes small values: p t +1 i ( x ) = p ti ( x ) 1 + δ ˜ u i ( x, p t − i )1 + δ ¯ u i ( p t ) . Convergence criteria.

Similarly to the other learning rules, we consider the convergence of p t = ( p t , ..., p tn ). There are several technical problems associated with simulating replicatordynamics, including the fact that all stable ﬁxed points are on the boundary of the strategyspace and therefore cannot be reached in ﬁnite simulation time, and that the period of cyclesincreases over time, due to the inﬁnite memory of the process.Additionally, we must stop the simulation run as soon as one component of one of theplayers’ mixed strategy vector reaches the machine precision limit and is taken to be zeroby the simulator. Indeed, by the properties of replicator dynamics, if p ti ( x ) reaches zero, itremains at zero forever. However, it often happens in simulations of replicator dynamicsthat an action whose probability had been decreasing for a long time suddenly becomesadvantageous due to changes in what other players are playing, leading to a reversal ofthe dynamics. This reversal will not be reﬂected in our simulations if p ti ( x ) is stuck atzero due to the machine precision limit being reached, leading to an unfaithful numericalrepresentation of the dynamics.To address all these issues, and to speciﬁcally account for the behavior of replicatordynamics, we choose the following simulation criteria:(1) Only consider the last 20% time steps.(2) For each player, ﬁnd the action with the highest probability and verify whetherthis probability has been increasing over the full time interval.(3) Check that the probabilities of all other actions have been decreasing.(4) If conditions 2-3 are satisﬁed for all players, identify the solution run as convergent.Note that the issue of machine precision unavoidably creates biases when the replicatordynamics take long to reach an attractor, be it a ﬁxed point or a cycle. In particular,it could lead us to consider as non-convergent a simulation run that would eventuallyconverge, because the replicator dynamics hit the machine precision limit while still in a ransient phase. Empirically, it turns out that transient dynamics are longer as the numberof players or actions increases, thus these biases are likely to be more serious in “large”games than in games with just a few actions and players. Parameter values.

We choose δ = 0 . eferences Alon, N., K. Rudov, and L. Yariv (2020). Dominance solvability in random games. https://lyariv.mycpanel.princeton.edu/papers/DominanceSolvability.pdf .Amiet, B., A. Collevecchio, and M. Scarsini (2019). Pure Nash equilibria and best-response dynamics inrandom games. arXiv:1905.10758.Arratia, R., L. Goldstein, L. Gordon, et al. (1989). Two moments suﬃce for Poisson approximations: theChen-Stein method.

The Annals of Probability 17 (1), 9–25.Arthur, W. B. (1991). Designing economic agents that act like human agents: A behavioral approach tobounded rationality.

The American economic review 81 (2), 353–359.Babichenko, Y. (2013). Best-reply dynamics in large binary-choice anonymous games.

Games and EconomicBehavior 81 , 130–144.Berg, J. and M. Weigt (1999). Entropy and typical properties of nash equilibria in two-player games.

EPL(Europhysics Letters) 48 (2), 129–135.Blume, L. E. et al. (1993). The statistical mechanics of strategic interaction.

Games and EconomicBehavior 5 (3), 387–424.B¨orgers, T. and R. Sarin (1997). Learning through reinforcement and replicator dynamics.

Journal ofEconomic Theory 77 (1), 1–14.Boucher, V. (2017). Selecting equilibria using best-response dynamics.

Economics Bulletin 37 (4), 2728–2734.Brown, G. W. (1951).

Iterative solutions of games by ﬁctitious play . Activity Analysis of Production andAllocation. New York: Wiley.Bush, R. R. and F. Mosteller (1953). A stochastic model with applications to learning.

The Annals ofMathematical Statistics 24 (4), 559–585.Candogan, O., A. Ozdaglar, and P. A. Parrilo (2013). Dynamics in near-potential games.

Games andEconomic Behavior 82 , 66–90.Cheung, Y.-W. and D. Friedman (1997). Individual learning in normal form games: Some laboratoryresults.

Games and Economic Behavior 19 (1), 46–76.Christodoulou, G., V. S. Mirrokni, and A. Sidiropoulos (2012). Convergence and approximation in potentialgames.

Theoretical Computer Science 438 , 13–27.Cohen, J. E. (1998). Cooperation and self-interest: Pareto-ineﬃciency of Nash equilibria in ﬁnite randomgames.

Proceedings of the National Academy of Sciences 95 (17), 9724–9731.Coucheney, P., S. Durand, B. Gaujal, and C. Touati (2014). General revision protocols in best responsealgorithms for potential games. In , pp. 239–246. IEEE.Daskalakis, C., A. G. Dimakis, E. Mossel, et al. (2011). Connectivity and equilibrium in random games.

The Annals of Applied Probability 21 (3), 987–1016.Dindoˇs, M. and C. Mezzetti (2006). Better-reply dynamics and global convergence to Nash equilibrium inaggregative games.

Games and Economic Behavior 54 (2), 261–292.Dresher, M. (1970). Probability of a pure equilibrium point in n -person games. Journal of CombinatorialTheory 8 (1), 134–145. urand, S., F. Garin, and B. Gaujal (2019). Distributed best response dynamics with high playing ratesin potential games. Performance Evaluation 129 , 40–59.Durand, S. and B. Gaujal (2016). Complexity and optimality of the best response algorithm in randompotential games. In

International Symposium on Algorithmic Game Theory , pp. 40–51. Springer.Erev, I. and A. E. Roth (1998). Predicting how people play games: Reinforcement learning in experimentalgames with unique, mixed strategy equilibria.

American economic review , 848–881.Fabrikant, A., A. D. Jaggard, and M. Schapira (2013). On the structure of weakly acyclic games.

Theoryof Computing Systems 53 (1), 107–122.Feldman, M. and T. Tamir (2012). Convergence of best-response dynamics in games with conﬂictingcongestion eﬀects. In

International Workshop on Internet and Network Economics , pp. 496–503. Springer.Foster, D. P. and H. P. Young (2006). Regret testing: Learning to play nash equilibrium without knowingyou have an opponent.

Theoretical Economics 1 (3), 341–367.Friedman, D. (1996). Equilibrium in evolutionary games: Some experimental results.

Economic Journal ,1–25.Friedman, J. W. and C. Mezzetti (2001). Learning in games by random sampling.

Journal of EconomicTheory 98 (1), 55–84.Fudenberg, D. and D. K. Levine (1998).

The theory of learning in games . MIT Press.Galla, T. and J. D. Farmer (2013). Complex dynamics in learning complicated games.

Proceedings of theNational Academy of Sciences 110 (4), 1232–1236.Germano, F. and G. Lugosi (2007). Global nash convergence of foster and young’s regret testing.

Gamesand Economic Behavior 60 (1), 135–154.Goemans, M., V. Mirrokni, and A. Vetta (2005). Sink equilibria and convergence. In , pp. 142–151. IEEE.Gokhale, C. S. and A. Traulsen (2010). Evolutionary games in the multiverse.

Proceedings of the NationalAcademy of Sciences 107 (12), 5500–5504.Goldberg, K., A. Goldman, and M. Newman (1968). The probability of an equilibrium point.

Journal ofResearch of the National Bureau of Standards 72 (2), 93–101.Goldman, A. (1957). The probability of a saddlepoint.

The American Mathematical Monthly 64 (10),729–730.Hofbauer, J. and K. Sigmund (1998).

Evolutionary games and population dynamics . Cambridge universitypress.Kash, I. A., E. J. an, and J. Y. Halpern (2011). Multiagent learning in large anonymous games.

Journal ofArtiﬁcial Intelligence Research 40 , 571–598.Kultti, K., H. Salonen, and H. Vartiainen (2011). Distribution of pure Nash equilibria in n-person gameswith random best responses. Technical Report 71, Aboa Centre for Economics. Discussion Papers.Macy, M. W. and A. Flache (2002). Learning dynamics in social dilemmas.

Proceedings of the NationalAcademy of Sciences of the United States of America 99 , 7229–7236.Maynard Smith, J. (1982).

Evolution and the Theory of Games . Cambridge University Press.McLennan, A. (2005). The expected number of Nash equilibria of a normal form game.

Econometrica 73 (1),141–174. cLennan, A. and J. Berg (2005). Asymptotic expected number of Nash equilibria of two-player normalform games. Games and Economic Behavior 51 (2), 264–295.Mirrokni, V. S. and A. Skopalik (2009). On the complexity of Nash dynamics and sink equilibria. In

Proceedings of the 10th ACM conference on Electronic commerce , pp. 1–10.Monderer, D. and L. S. Shapley (1996). Potential games.

Games and economic behavior 14 (1), 124–143.Pangallo, M., T. Heinrich, and J. D. Farmer (2019). Best reply structure and equilibrium convergence ingeneric games.

Science Advances 5 (2), eaat1328.Pangallo, M., J. Sanders, T. Galla, and D. Farmer (2017). A taxonomy of learning dynamics in 2 × n -persongames. International Journal of Game Theory 19 (3), 277–286.Quint, T., M. Shubik, and D. Yan (1997). Dumb bugs vs. bright noncooperative players: A comparison. InW. Albers, W. G¨uth, P. Hammerstein, B. Moldvanu, and E. van Damme (Eds.),

Understanding StrategicInteraction , pp. 185–197. Springer.Rinott, Y. and M. Scarsini (2000). On the number of pure strategy Nash equilibria in random games.

Games and Economic Behavior 33 (2), 274–293.Robinson, J. (1951). An iterative method of solving a game.

The Annals of Mathematics 54 (2), 296.Sanders, J. B., J. D. Farmer, and T. Galla (2018). The prevalence of chaotic dynamics in games with manyplayers.

Scientiﬁc reports 8 (1), 4902.Sarin, R. and F. Vahid (2001). Predicting how people play games: a simple dynamic model of choice.

Games and Economic Behavior 34 (1), 104–122.Sato, Y. and J. P. Crutchﬁeld (2003). Coupled replicator equations for the dynamics of learning in multiagentsystems.

Physical Review E 67 (1), 1–5.Stanford, W. (1995). A note on the probability of k pure Nash equilibria in matrix games. Games andEconomic Behavior 9 (2), 238–246.Stanford, W. (1996). The limit distribution of pure strategy Nash equilibria in symmetric bimatrix games.

Mathematics of Operations Research 21 (3), 726–733.Stanford, W. (1997). On the distribution of pure strategy equilibria in ﬁnite games with vector payoﬀs.

Mathematical Social Sciences 33 (2), 115–127.Stanford, W. (1999). On the number of pure strategy Nash equilibria in ﬁnite common payoﬀs games.

Economics Letters 62 (1), 29–34.Swenson, B., R. Murray, and S. Kar (2018). On best-response dynamics in potential games.

SIAM Journalon Control and Optimization 56 (4), 2734–2767.Takahashi, S. (2008). The number of pure Nash equilibria in a random game with nondecreasing bestresponses.

Games and Economic Behavior 63 (1), 328–340.Takahashi, S. and T. Yamamori (2002). The pure Nash equilibrium property and the quasi-acyclic condition.

Economics Bulletin 3 (22), 1–6.Taylor, C. and M. A. Nowak (2006). Evolutionary game dynamics with non-uniform interaction rates.

Theoretical Population Biology 69 (3), 243–252.Tuyls, K., P. J. T. Hoen, and B. Vanschoenwinkel (2006). An evolutionary dynamical analysis of multi-agentlearning in iterated games.

Autonomous Agents and Multi-Agent Systems 12 (1), 115–153. an Huyck, J., R. Battalio, S. Mathur, P. Van Huyck, and A. Ortmann (1995). On the origin of convention:Evidence from symmetric bargaining games. International Journal of Game Theory 24 (2), 187–212.(2), 187–212.