[PDF] A taxonomy of learning dynamics in 2 x 2 games

Abstract

Do boundedly rational players learn to choose equilibrium strategies as they play a game repeatedly? A large literature in behavioral game theory has proposed and experimentally tested various learning algorithms, but a comparative analysis of their equilibrium convergence properties is lacking. In this paper we analyze Experience-Weighted Attraction (EWA), which generalizes fictitious play, best reply dynamics, reinforcement learning and also replicator dynamics. We provide a comprehensive analytical characterization of the asymptotic behavior of EWA learning in 2\times 2 games. We recover some well-known results in the limiting cases in which EWA reduces to the learning rules that it generalizes, but also obtain new results for other parameterizations. For example, we show that in coordination games EWA may only converge to the Pareto-efficient equilibrium, never reaching the Pareto-inefficient one; that in Prisoner Dilemma games it may converge to fixed points of mutual cooperation; and that in Matching Pennies games it may fail to converge to any fixed point, following instead limit cycles or chaos.

Full PDF

AA taxonomy of learning dynamics in 2 × Marco Pangallo (cid:63), , James Sanders , Tobias Galla , and J. Doyne Farmer Institute for New Economic Thinking at the Oxford Martin School, University of Oxford, Oxford OX26ED, UK Mathematical Institute, University of Oxford, Oxford OX1 3LP, UK Theoretical Physics, School of Physics and Astronomy, University of Manchester, Manchester M139PL, UK Santa Fe Institute, Santa Fe, NM 87501, US

Abstract

Learning would be a convincing method to achieve coordination on an equilibrium. Butdoes learning converge, and to what? We answer this question in generic 2-player, 2-strategygames, using Experience-Weighted Attraction (EWA), which encompasses many extensivelystudied learning algorithms. We exhaustively characterize the parameter space of EWAlearning, for any payoﬀ matrix, and we understand the generic properties that imply con-vergent or non-convergent behaviour in 2 × Key Words:

Behavioural Game Theory, EWA Learning, Convergence, Equilibrium,Chaos.

JEL Class.:

C62, C73, D83. ∗ Corresponding author: [email protected]. We thank Vince Crawford, Mike Harr´e, CarsHommes, Peiran Jiao, Robin Nicole, Mousavi Shabnam and Peyton Young, as well as seminar participantsat Nuﬃeld College, INET YSI Plenary, Herbert Simon Society International Workshop, Conference on ComplexSystems 2016 and King’s College, for helpful comments and suggestions. a r X i v : . [ q -f i n . E C ] F e b ontents × Introduction

How do players coordinate on speciﬁc proﬁles of strategies in non-cooperative games, andwhy should they coordinate on an equilibrium proﬁle? If the game is simple or one-shot, areasonable explanation is provided by strategic thinking and introspection. Another justi-ﬁcation, which is more generally valid in complicated and repeated games, is learning andinteraction. However, as it is fairly well known since the contribution of Shapley (1964),the learning dynamics may fail to converge to an equilibrium. This questions the validity ofequilibrium thinking in game theory: at least in some contexts, strategic interactions mightbe governed by learning in an ever-changing environment, rather than by rational and fully-informed decision making. The literature has faced the dilemma about the convergence ofthe learning dynamics to Nash Equilibria (NE) in several ways. Most theoretical work hasidentiﬁed classes of games and learning algorithms in which the dynamics succeeds to con-verge; some authors provided counter-examples in which learning would not converge. Littlehas been said about the generic properties of games and learning algorithms which yield aconvergent or non-convergent dynamics. Recent work (Galla and Farmer, 2013) addressedthis issue by considering ensembles of 2-person, N -strategy games and ﬁnding the regions ofthe parameter space where learning was less likely to converge: negatively correlated pay-oﬀs and “rational” long-memory learning implied limit cycles and high-dimensional chaosin the learning dynamics. However, little understanding of the reasons for non-convergentbehaviour was provided.In order to shed light on the mechanisms behind (non-)convergence, this paper inves-tigates the drivers of instability in the simplest possible non-trivial setting, that is generic2-person, 2-strategy normal form games, trying to capture the typical features of the payoﬀmatrix and of the learning behaviour that yield a cycling or an irregular dynamics. Westudy a slightly simpliﬁed version of Experience-Weighted Attraction (EWA), which is gen-eral enough to encompass both reinforcement and belief learning and has been shown tobe in accord with experimental data (Camerer and Ho, 1999). In short, we ﬁnd that theexistence of a cycle of best responses in the payoﬀ matrix, coupled with a quick enoughlearning dynamics (in a sense that will be speciﬁed later), is a suﬃcient condition for thenon-convergence of learning. In particular, in games with a unique mixed strategy equilib-rium (to which we refer as discoordination games , lacking an established terminology in theliterature) the players follow the cycle of best responses and never converge to the NE: werather observe limit cycles or low-dimensional chaos. Lack of convergence is driven by theplayers adapting too quickly to the moves of their opponent. In the same learning scenario,if the payoﬀ matrix is acyclic (there is at least one ﬁxed point in terms of best responses,that is a proﬁle of strategies which is the best response by both players to some beliefs ontheir opponent), as in dominance-solvable and coordination games, convergence to a purestrategy NE occurs immediately. On the contrary, if the players are “irrational” and/or donot have enough incentives to switch their moves, they do not recognize that a pure strat-egy may be better and simply randomize between their possible moves, reaching a mixedstrategy ﬁxed point.We ﬁnd such a taxonomy of the learning dynamics by looking at relevant combinationsof parameters, which naturally emerge from the mathematical analysis. Figure 1 illustratesour approach and provides a qualitative characterization of the parameter space. We denoteby “irrationality” the ratio of two parameters of EWA, namely the memory loss of pastperformance α divided by the closeness to optimal decision making β (payoﬀ sensitivity or intensity of choice ). “Coordination” ( AC ) depends on the payoﬀ matrix and quantiﬁes thepreference of the players for “diagonal” outcomes: if we denote their pure strategies by 1 Robinson (1951); Miyazawa (1961); Shapley (1964); Crawford (1974); Stahl (1988); Nachbar (1990); Milgromand Roberts (1991); Krishna (1992); Conlisk (1993a); Monderer and Shapley (1996); Hahn (1999); Arieli andYoung (2016). For instance, in Matching Pennies, if player Row (who wins if the pennies are matched) thinks that playerColumn would play Heads, the best response for Row would be to play Heads. The best response for Column tothis move of player Row is to play Tails. Row would then switch to Tails as well, and so on. ominance solvablegames (Anti) Coordination gamesDiscoordination games βα AC βα BD Multiple ﬁxed pointsUnique pure ﬁxed pointUnique mixed ﬁxed pointLimit cycles and chaos (cid:18) , − , , , − (cid:19) (cid:18) , , , , (cid:19)(cid:18) , , , , (cid:19)(cid:18) , , , − , − (cid:19) Figure 1: Qualitative characterization of the parameter space. The irrationality α/β refersto the intrinsic noise in the learning algorithm.

Coordination ( AC ) and dominance ( | BD | )quantify properties of the payoﬀ matrix. The combinations of these parameters characterizethe learning dynamics and relate to speciﬁc classes of 2 × and 2, coordination is large when the payoﬀs associated with the proﬁles of strategies (1 , ,

2) are much larger than the payoﬀs for (1 ,

2) and (2 , “Dominance” ( | BD | ) onthe other hand quantiﬁes the relative strength of a pure strategy with respect to the otherone. Coordination and dominance naturally relate to well-known classes of 2 × Related literature

The ﬁrst example of a normal form game where convergence of ﬁc-titious play (Brown, 1951; Robinson, 1951) did not occur was provided by Shapley (1964).He considered a 3 × Or viceversa: coordination is also large if the payoﬀs for (1 ,

2) and (2 ,

1) are much larger than the payoﬀs for(1 ,

1) and (2 , ype of game Payoﬀ matrix: (cid:18) a, e b, gc, f d, h (cid:19) Example ParametersCoordination a > c , b < d , e > g , f < h .Two pure strategy (1 , , (cid:18) , − , , − , (cid:19) AC = 72 | BD | = 6Anticoordination a < c , b > d , e < g , f > h .Two pure strategy (1 , , (cid:18) , , , , (cid:19) AC = 12 | BD | = 0Discoordination a > c , e < g , b < d , f > h ; a < c , e > g , b > d , f < h .Unique mixed strategy NE. (cid:18) , − − , − − , , − (cid:19) AC = − | BD | = 18Dominance-solvable All other possible orderings.E.g. a > c , b > d , e > g , f > h .Unique pure strategy NE. (cid:18) , − , , − − , − (cid:19) AC = 4 | BD | = 18 Table 1: Games in the taxonomy. The games are deﬁned in terms of the orderings in the payoﬀmatrix. Coordination is AC , where A = a + d − b − c and C = e + h − f − g , while dominanceis | BD | , B = a + b − c − d , D = e + f − g − h . In dominance-solvable games, | BD | > | AC | ;in coordination and anticoordination games, | BD | < | AC | , AC >

0; in discoordination games, | BD | < | AC | , AC <

0. Note that there are some exceptions: see Proposition 1.

Another literature focused on the generic properties of the payoﬀ matrices and learningalgorithms that were associated with multiplicity of NE or non-convergent behaviour. Bergand Weigt (1999) showed how the number of NE increases exponentially with the correlationof the payoﬀs, while Opper and Diederich (1992) considered the replicator dynamics with alarge number of species and used techniques from the statistical physics of disordered systemsto show how, below a certain level of cooperation pressure (a parameter characterizingthe learning algorithm), the dynamics becomes unstable. More recently, Galla and Farmer(2013) analysed random games and EWA learning, showing that high-dimensional chaos andlimit cycles could be observed in a signiﬁcant portion of the parameter space, for negativelycorrelated payoﬀs.This paper bridges the two described literatures in that we exhaustively characterize theparameter space of EWA in generic 2 × ex-post the learning dynamicsto speciﬁc classes of games based on the convergence properties of the learning algorithm,rather than focusing ex-ante on any speciﬁc class of games. We show that convergence occursin acyclic 2 × × × ure strategies available to each player, we ﬁnd low-dimensional chaos, in contrast withGalla and Farmer (2013), who ﬁnd high-dimensional chaotic attractors (which are consistentwith an essentially random and unpredictable learning dynamics) in games with many purestrategies. Since we ﬁnd a quasi-cyclical learning dynamics, it can sensibly be argued thatthe pattern can be guessed by one of the players, who could then take advantage of herforecast of the moves of her opponent in order to systematically outguess his choices, andthereby perform better than him. In evolutionary terms, the player who can guess thecyclical behaviour of her opponent has higher ﬁtness and is eventually expected to takeover the entire population. This is the rational expectations argument of Muth (1961)and would suggest that the cyclic behaviour is expected to die out. However, in line withthe view of the rational route to randomness (Brock and Hommes, 1997), this is not anobvious outcome. The information cost for guessing the moves of the other player andthe interaction between two or more forecasting strategies easily yield complex dynamics,preventing rational and perfectly informed players to outperform less sophisticated players.Hommes et al. (2016) apply this formalism to the theory of learning in games by consideringthe interplay between rational play and a short memory adjustment process such as best-response dynamics or ﬁctitious play in Cournot games. Rational players are able to outguessthe choices of their opponents, but complex dynamics may still occur. In a diﬀerent context,Huberman and Hogg (1988) show that more sophisticated learning algorithms may lead tochaotic dynamics.Another understandable critique is whether our learning algorithm can be considered asrepresentative of how players learn in reality, and whether limit cycles or chaos in the learn-ing dynamics play a role in the real world and could be detected in experiments. Camererand Ho (1999) and Ho et al. (2007) ﬁt the EWA model to experimental data in several classesof games and show that it outperforms other learning models in most cases. However, it islikely that the players would change their learning strategy as the game evolves, implyingthat they learn how to learn . Stahl (1996) considered a model of rule learning where theplayers are of diﬀerent k -levels (Nagel, 1995) and change their k -level using reinforcementlearning. Crawford (1995) proposed a generalization of the standard belief learning algo-rithms to take into account time-varying memory and idiosyncratic shocks. Would we ﬁndthe same qualitative learning dynamics if we used more sophisticated learning algorithms?Our analysis suggests that limit cycles and chaos may theoretically be observed as longas the players are willing to quickly switch their moves, independently of the reason whythey behave so. A property of the cycling behaviour, as opposed to the convergence to amixed strategy equilibrium, is the slower decay in the autocorrelation function of the moveschosen by each player. In the language of time series, the sequence of moves by each playerexhibits persistence . This is a precise theoretical prediction that can be tested against dataon experimental learning of discoordination games. Organization of the paper

The rest of this paper is organized as follows: in Section2 we deﬁne the classes of 2 × × Based on the properties one wants to look at, it is possible to construct several classiﬁcationsof 2-person, 2-strategy (2 ×

2) games. Rapoport et al. (1976) ﬁnd 78 classes of games, whichcan be reduced to 24 when less properties are considered. Here we are only concerned withthe number of Nash Equilibria (NE) and with their type, i.e. whether they are pure or The dimensionality of the chaotic attractors quantiﬁes the departure from regular oscillations. ixed strategy NE. We only ﬁnd 3 classes of 2 × × (cid:18) a, e b, gc, f d, h (cid:19) . (1)The number and type of the NE depend on the pairwise ordering of the payoﬀs eachplayer compares, namely ( a, c ) and ( b, d ) for player Row, ( e, g ) and ( f, h ) for player Column.There are 2 = 16 such orderings. We ﬁnd the following classes of 2 × • Coordination and anticoordination games are respectively deﬁned by the order-ings a > c , d > b , e > g , h > f and a < c , d < b , e < g , h < f . Coordination 2 × ,

1) and (2 , × ,

2) and (2 , • Discoordination games are deﬁned by the orderings a > c , d > b , e < g , h < f and a < c , d < b , e > g , h > f (again, there exists no standard terminology forthis class of games). They have a unique mixed strategy NE and no pure strategyNE because the players have incentives to coordinate on diﬀerent proﬁles of strategies.The prototypical discoordination game is Matching Pennies. • Dominance-solvable games are deﬁned by all 12 remaining possible orderings. Theyhave a unique pure strategy NE, obtainable from the elimination of strongly dominatedstrategies. For instance, if a > c , d < b , e > g , h < f , the NE is (1 , × × A = 14 ( a + d − b − c ) ,B = 14 ( a + b − c − d ) ,C = 14 ( e + h − f − g ) ,D = 14 ( e + f − g − h ) . (2)The parameter A indicates the preference of player Row for outcomes of the type (1 , ,

2) over the cases (1 ,

2) and (2 , C is a measure for the preference of playerColumn for the same “diagonal” outcomes. It is then sensible to use the product AC asa measure of overall coordination. We then name AC as the “coordination” parameter. Ifboth A and C are positive and large, coordination is positive and large and both playersprefer outcomes (1 ,

1) and (2 , A and C are negative and large, coordination is stillpositive and large. The payoﬀ matrix describes an anticoordination game and both players refer outcomes (1 ,

2) and (2 , A is positive and large and C is negative and large,coordination is negative and large, one player prefers outcomes (1 ,

1) and (2 , ,

2) and (2 , B is a measure for the dominance of player Row’s ﬁrst strategy over hersecond, and similarly D measures the dominance of player Column’s ﬁrst strategy over hersecond. We refer to | BD | as “dominance” parameter (we take the absolute value of theproduct BD because its sign only determines which proﬁle of strategies is selected as theNE, but does not change the type of game). If dominance is large the payoﬀ matrix describesa dominance-solvable game and it is sensible that the learning dynamics is characterized bya unique ﬁxed point, close to the pure strategy NE.These statements are made more precise in the following proposition: Proposition 1. (i) In symmetric games ( A = C , B = D ), where coordination ( A ) anddominance ( B ) are positive, it is equivalent to consider | A | as the coordination parameterand | B | as the dominance parameter. If coordination is larger than dominance ( | A | > | B | ),the payoﬀ matrix describes a coordination (if A > ) or anticoordination (if A < ) game.Viceversa, if | A | < | B | , it describes a dominance-solvable game.(ii) In asymmetric games ( A (cid:54) = C , B (cid:54) = D ), if coordination in absolute value is smallerthan dominance ( | AC | < | BD | ), the game is dominance-solvable; in the opposite case( | AC | > | BD | ), we cannot disambiguate between the classes of games using only theseparameters. In particular, if both | B | < | A | and | D | < | C | , the payoﬀ matrix describes acoordination (if AC > , A > , C > ), anticoordination (if AC > , A < , C < ) ordiscoordination (if AC < ) game. On the other hand, if | B | > | A | or | D | > | C | , even if | AC | > | BD | , the game is dominance-solvable. However, the larger the value of coordination(compared to dominance), the less likely the payoﬀ matrix describes a dominance-solvablegame. The proof of Proposition 1 is in Appendix A, where we also show that there are only4 eﬀective degrees of freedom in the payoﬀ matrix, for what concerns the NE and thedynamical properties of EWA learning.

In this section we describe Experience-Weighted Attraction (EWA) learning and we list allmathematical simpliﬁcations that ease the subsequent analysis. In Section 3.1 we provide aformal deﬁnition of EWA and discuss the meaning of its parameters. In Section 3.2 we startto simplify the dynamics by assuming that the experience (one of the EWA components)has already reached a steady state and by taking a deterministic limit. In Section 3.3 wespecify a diﬀeomorphism that allows to substantially simplify the equations governing thelearning dynamics, with no loss in generality.

Camerer and Ho (1999) proposed EWA as a hybrid of reinforcement (the players learn on thebasis of the performance of their actions) and belief learning (the players construct beliefson the possible actions of their opponents and respond to these beliefs). They noticed thatthe two largely studied classes of learning algorithms are in fact equivalent if the playersalso consider forgone payoﬀs. Thanks to the generality of EWA, the ﬁt with experimentaldata is better than with pure reinforcement or pure belief learning. The reason is that realplayers learn using both information about performance and beliefs.We now introduce some notation. Consider a 2-person, 2-strategy normal form game.We index the players by µ ∈ { Row = R, Column = C } and the pure strategies by i = 1 , x ( t ) the probability for player R to play pure strategy 1 at time t , and by The other requirement is that they average between the current payoﬀ for a certain strategy and the pasttendency to play the same strategy, see Section 3.2. ( t ) the probability for player C to play pure strategy 1 at time t . We further denote by s µ ( t ) the pure strategy which is actually chosen by player µ at time t , so that Π µ ( i, s − µ ( t ))represents the payoﬀ that player µ receives at t if she plays pure strategy i and the otherplayer chooses the pure strategy s − µ ( t ).In EWA, the mixed strategies are determined from the so-called attractions or propen-sities Q µi ( t ) following a logit rule. For example, the probability for player R to play purestrategy 1 is given by x ( t + 1) = e βQ R ( t +1) e βQ R ( t +1) + e βQ R ( t +1) , (3)where β is the payoﬀ sensitivity or intensity of choice and a similar expression holds for y ( t + 1). The propensities update as follows: Q µi ( t + 1) = (1 − α ) N ( t ) Q µi ( t ) + ( δ + (1 − δ ) I ( i, s µ ( t + 1))Π µ ( i, s − µ ( t + 1)) N ( t + 1) , (4)where N ( t +1) = (1 − α )(1 − κ ) N ( t )+1. Here N ( t ) represents experience and increases withthe number of rounds played; the more it grows, the smaller becomes the inﬂuence of thereceived payoﬀs on the attractions. The propensities change according to the received payoﬀwhen playing action i against the strategies s − µ by the other players, i.e. Π µ ( i, s − µ ( t + 1)).The indicator function I ( i, s µ ( t + 1)) is equal to 1 if i is the actual pure strategy that wasplayed by µ at time t + 1, that is i = s µ ( t + 1). All attractions (those corresponding tostrategies that were and were not played) are updated with weight δ , while an additionalweight 1 − δ is given to the speciﬁc attraction corresponding to the strategy that was actuallyplayed. Finally, the memory loss parameter α determines how quickly previous attractionand experience are discounted and the parameter κ interpolates between cumulative andaverage reinforcement learning (see below). Here we make two substantial, albeit rather innocuous, simpliﬁcations. First, EWA hastwo state variables: attraction and experience. The dynamics of the latter is trivial, as itreaches a ﬁxed point extremely fast (for many combinations of parameters, the time scaleof convergence is of the order of 2-3 time steps). Therefore we assume, with a small loss ingenerality, that experience has already reached a ﬁxed point N (cid:63) when the dynamics starts.To ensure the existence of such a ﬁxed point we need to assume that (1 − α )(1 − κ ) <

1. Thisrestriction only rules out standard ﬁctitious play, in which all past actions are taken intoaccount with the same weight and therefore the relative weight of the most recent actionsbecomes smaller and smaller. There is no further loss in generality, as all other reinforcementand belief learning algorithms can still be viewed as a particular case of the EWA dynamicsonce N ( t ) has reached a ﬁxed point.The update rule (4) now reads: Q µi ( t + 1) = (1 − α ) Q µi ( t ) + (1 − (1 − α )(1 − κ ))( δ + (1 − δ ) I ( i, s µ ( t + 1))Π µ ( i, s − µ ( t + 1)) . (5)The interpretation for κ is now more transparent: if κ = 1 the past payoﬀs are cumulated,hence cumulative reinforcement learning; if κ = 0 the past attraction and the current payoﬀare averaged with weight given by the memory loss parameter α , hence average reinforcementlearning. Note that the two learning algorithms can be made equivalent by rescaling thepropensities (or equivalently the intensity of choice) by α (see Galla and Farmer 2013). Due to the normalization condition, the learning dynamics is fully characterized by { x ( t ) , y ( t ) } ∞ t =0 . The larger β , the more the players consider the attractions in determining their strategy. In the limit β → ∞ the players choose with certainty the pure strategy with the larger attraction. In the limit β → ollowing Camerer and Ho (1999), note that belief learning is recovered if δ = 1 and atleast one of the following conditions is satisﬁed: • There is no memory ( α = 1). If, in addition, β → ∞ , one recovers best-responsedynamics (Cournot, 1838), in that the players best respond to the last period beliefsonly; • Average reinforcement learning ( κ = 0).Therefore, by studying the dynamic properties of (3) and (5) we are considering a wideclass of learning algorithms, including reinforcement learning, best-response dynamics andweighted ﬁctitious play. As a benchmark case we consider cumulative reinforcement learning(Section 4), which excludes belief learning, but we allow for average reinforcement learningin Section 5.2, where we generalize the results to belief learning.We make another bold assumption in this section, which will then be relaxed in Section5.1: we assume that the players play against each other many times before updating theirpropensities, so that the empirical frequency of their moves corresponds to their mixedstrategy. This sort of argument was already made by Crawford (1974) and justiﬁed byConlisk (1993a) in terms of “two-rooms experiments”: the players only interact through acomputer console and need to specify several moves before they know the moves of theiropponent. This assumption is useful from a theoretical point of view and does not aﬀect theresults in most cases (Section 5.1): the only diﬀerence when noise is allowed is a blurring ofthe dynamical properties.We denote by Π µi the expected payoﬀ for player µ playing pure strategy i at time t , giventhat player − µ plays a distribution of strategies given by her mixed strategy. An importantremark is that, under the deterministic assumption, it is intended that δ = 1, as it would beambiguous to distinguish between the strategies which were and were not played (as long asthe players choose a non-degenerate mixed strategy, both pure strategies would be chosenby each player with non-zero frequency), so in order to recover belief learning it is reallyjust enough to consider average reinforcement learning ( κ = 0).Finally, it is useful to combine (3) and (5) and to write the probabilities x ( t + 1) and y ( t + 1) directly in terms of the same probabilities at time t , that is x ( t ) and y ( t ). In thedeterministic limit (and so with δ = 1) we get x ( t + 1) = x ( t ) − α e β (1 − (1 − α )(1 − κ ))Π R ( y ( t )) Z x , (6)where Z x = x ( t ) − α e β (1 − (1 − α )(1 − κ ))Π R ( y ( t )) + (1 − x ( t )) − α e β (1 − (1 − α )(1 − κ ))Π R ( y ( t )) and ananalogous expression holds for y ( t + 1). The remaining simpliﬁcation implies no loss of generality and matters for technical reasons:we propose a diﬀeomorphism that transforms the coordinates of the learning dynamicsand leads to a simpler set of equations. As we consider the combinations of parametersin the transformed coordinates, the taxonomy of the learning dynamics starts naturally toemerge. A diﬀeomorphism between a coordinate space ( x, y ), henceforth denoted by originalcoordinates , to a coordinate space (˜ x, ˜ y ), henceforth denoted by transformed coordinates ,leaves the dynamical properties (e.g. Jacobian, Lyapunov Exponents) in ( x, y ) unchangedin (˜ x, ˜ y ), thanks to a well-known property in dynamical systems theory (Ott, 2002).We consider the generic 2 × x = −

12 ln (cid:18) x − (cid:19) , ˜ y = −

12 ln (cid:18) y − (cid:19) . (7) n terms of the transformed coordinates, the map (6) writes:˜ x ( t + 1) = (1 − α )˜ x ( t ) + β (1 − (1 − α )(1 − κ ))( A tanh ˜ y ( t ) + B ) , ˜ y ( t + 1) = (1 − α )˜ y ( t ) + β (1 − (1 − α )(1 − κ ))( C tanh ˜ x ( t ) + D ) , (8)where A , B , C and D have been deﬁned in (2).The original coordinates are restricted to x ( t ) ∈ [0 ,

1] and y ( t ) ∈ [0 , x, y ) ∈{ (0 , , (0 , , (1 , , (1 , } in the original coordinates map to (˜ x, ˜ y ) ∈ { ( ±∞ , ±∞ ) } in thetransformed coordinates. Note also that mixed strategies where the players choose amongtheir actions with the same probability, i.e. x, y = 1 /

2, are mapped to ˜ x, ˜ y = 0.The inverse transformation is given by x = 11 + e − x ,y = 11 + e − y . (9) We analyse the dynamical properties of EWA learning in generic 2 × δ = 1)and consider cumulative reinforcement learning ( κ = 1). The extensions are considered inSection 5. We ﬁrst analyse the existence and the position of the ﬁxed points in the strategy simplex,and then we consider their stability. In Section 4.1.1 we focus on symmetric games. We ﬁndthat there exists always at least one stable ﬁxed point, which may or may not correspondto the NE. In Section 4.1.2 we consider “antisymmetric” games, where, for any combinationof strategies, the payoﬀs received by one player are the opposite of the payoﬀs received bythe other player (this does not necessarily correspond to zero-sum games, see below). Fordiscoordination games, the learning dynamics may not settle to a ﬁxed point. Finally, inSection 4.1.3, we analyse the most general class of asymmetric games.

We start from the simplest case from the point of view of the analysis, namely symmetric2 × Rij = Π

Cij ,so A = C and B = D . Therefore, coordination is A and dominance is B . Recall fromProposition 1 that if | A | > | B | and A >

0, the payoﬀ matrix describes a coordination game;if | A | > | B | and A <

0, the payoﬀ matrix describes an anticoordination game; if | B | > | A | ,the game is dominance-solvable. he ﬁxed points in the transformed coordinates can be obtained from (8), by setting˜ x ( t + 1) = ˜ x ( t ) = ˜ x (cid:63) and ˜ y ( t + 1) = ˜ y ( t ) = ˜ y (cid:63) . The ﬁxed point equation is ˜ x (cid:63) = Ψ(˜ x (cid:63) ),where Ψ(˜ x (cid:63) ) = βα (cid:20) A tanh (cid:18) βα ( A tanh ˜ x (cid:63) + B ) (cid:19) + B (cid:21) . (10)An identical expression holds for ˜ y (cid:63) . Note that the EWA parameters α and β combineas the ratio α/β (or β/α ). It makes sense to deﬁne α/β as the “irrationality” parameterbecause it is large if there is substantial memory loss and/or small intensity of choice. Eq.(10) can have either 1 or 3 solutions. If there are 3 intersections between Ψ(˜ x (cid:63) ) and the ˜ x (cid:63) line, we denote as central solution the intersection with an intermediate value for ˜ x (cid:63) and by lateral solutions the intersections with the maximum and minimum values. Note that theﬁxed points are a vector (˜ x (cid:63) , ˜ y (cid:63) ), so it is not enough to compute the solutions of Eq. (10),one also needs to ﬁnd the right couplings by replacing the possible combinations in (8). Thanks to the fact that the maps (6) and (8) are topologically conjugate, their Jacobianis the same. We compute it from (8): J | ˜ x (cid:63) , ˜ y (cid:63) = (cid:32) − α Aβ cosh (˜ y (cid:63) ) Cβ cosh (˜ x (cid:63) ) − α (cid:33) . (11)The eigenvalues are λ ± = 1 − α ± | A | β x (cid:63) ) cosh(˜ y (cid:63) ) . (12)Since 1 − α >

0, the leading eigenvalue is λ + and it is enough to study that for thestability properties. After a little algebra we get the stability condition αβ cosh(˜ x (cid:63) ) cosh(˜ y (cid:63) ) − | A | ≥ . (13)The shape of Ψ(˜ x (cid:63) ) varies according to the irrationality ( α/β ), coordination ( | A | ) anddominance ( | B | ) parameters. Due to the strong non-linearity of Ψ(˜ x (cid:63) ), it is not possible tostudy it analytically in full. Therefore, we ﬁrst solve Eq. (10) numerically, and then providea mathematical analysis of a number of speciﬁc cases. Figure 2 shows the properties of theﬁxed points obtained from the numerical solution of (10), keeping irrationality constant,i.e. α/β = 1 (since the parameters combine as βα A and βα B , it is equivalent to change thevalues of A and B ). We also check the stability of the ﬁxed points by using Eq. (13). Weﬁnd that there is always at least one stable ﬁxed point. If there are multiple ﬁxed points,only the lateral solutions are stable. For small values of the payoﬀs, such that the playersdo not have strong incentives to choose a speciﬁc pure strategy, learning converges to amixed strategy ﬁxed point, where the players randomly choose between the pure strategies.If dominance is larger than coordination, the payoﬀ matrix describes a dominance-solvablegame and learning converges to a pure strategy ﬁxed point corresponding to the NE. Ifcoordination is larger than dominance, the payoﬀ matrix may represent a coordinationor an anticoordination game. Note that multiple ﬁxed points are much more likely inanticoordination games. To see why this is the case, consider the following payoﬀ matrices,with A = C = ± .

75 and B = D = 1 . coord = (cid:18) , , , , (cid:19) ; Π anticoord = (cid:18) , , , , (cid:19) . (14)While in Π coord the NE which yields payoﬀs (6,6) is to be clearly preferred over theNE yielding (1,1), so it is reasonable that learning only converges to the preferred outcome(unique pure strategy ﬁxed point), in Π anticoord there is no preferred NE, so it is sensiblethat learning displays multiple ﬁxed points. This is indeed what happens in the top rightand top left corners of Figure 2, which correspond to the payoﬀ matrices in (14). It never occurs that the components of a ﬁxed point are the central and lateral solutions: either bothcomponents are central solutions, or both components are lateral solutions. A = C B = D Dominance-solvable gamesAnticoordination games Coordination gamesMultiple ﬁxed pointsUnique pure ﬁxed pointUnique mixed ﬁxed point

Figure 2: Numerical solution of Eq. (10) for α/β = 1 and several values of A = C and B = D .If 0 . < x (cid:63) < . . < y (cid:63) < .

7, the solution is classiﬁed as a “mixed strategy ﬁxed point”.All unique ﬁxed points are stable; if there are multiple ﬁxed points, only the “lateral solutions”are stable. The overall picture is that for small values of the payoﬀs learning converges to amixed strategy ﬁxed point; if dominance is strong, to a pure strategy ﬁxed point; if coordinationis strong, to multiple ﬁxed points. There are noticeable diﬀerences between coordination andanticoordination games. This ﬁgure corresponds to the βα AC > y -axis (because both A = C < A = C >

AC >

We now proceed with the mathematical solution for a number of speciﬁc cases. We ﬁrstset B = 0, and study the interplay of the coordination parameter, | A | , and the irrationalityparameter, α/β . The lateral solutions do not exist if βα | A | ≤ . (15)The interpretation is straightforward: if irrationality is large (so its inverse is small)and/or coordination is small (i.e. the absolute value of the payoﬀs is small), there is aunique ﬁxed point in the centre of the strategy simplex. This transition can be seen for B = 0 and A = ± B >

0. It is possible to check in Eq. (10) that a large value of βα | B | ﬂattens Ψ(˜ x (cid:63) ) (because it makes the argument less sensitive to the values of ˜ x (cid:63) ) and movesthe oﬀset Ψ(0) away from zero. Therefore, for a suﬃciently large value of βα | B | there is aunique ﬁxed point far from the centre of the simplex. This is indeed what happens in Figure2. Stability is addressed in the following proposition. Proposition 2.

We consider a symmetric × game. The following results hold:(i) if B = 0 and βα | A | ≤ , the unique ﬁxed point is stable.(ii) if B = 0 and βα | A | → + or βα | A | → + ∞ , the ﬁxed point whose components are thecentral solutions is unstable and the ﬁxed points whose components are the lateral solutionsare stable. In particular, at βα | A | = 1 a supercritical pitchfork bifurcation occurs.(iii) if βα | B | → + ∞ and B (cid:29) A , the unique ﬁxed point is stable. The proof is in Appendix B. .1.2 “Antisymmetric” games So far, we analysed the learning dynamics in dominance-solvable, coordination and antico-ordination games. We now want to focus on the remaining class of 2 × Rij = − Π Cij , and so A = − C , B = − D . Note that this condition does notgenerally deﬁne zero-sum games, as the latter are rather deﬁned by the equality Π Rij = − Π Cji (so the two classes of games correspond only if Π

Rij = − Π Cij = 0 for i (cid:54) = j ). Again, if B > A the game is dominance-solvable, but if

A > B we have a discoordination game.The ﬁxed points (˜ x (cid:63) , ˜ y (cid:63) ) are again obtained from (8):˜ x (cid:63) = βα (cid:20) − A tanh (cid:18) βα ( A tanh ˜ x (cid:63) + B ) (cid:19) + B (cid:21) , ˜ y (cid:63) = βα (cid:20) − A tanh (cid:18) βα ( A tanh ˜ y (cid:63) + B ) (cid:19) − B (cid:21) , (16)where we have used the identity tanh( − x ) = − tanh( x ). It is immediate to note from (16)that there exists a unique ﬁxed point. Indeed, the functions on the RHS are monotonicallydecreasing, so only one intersection with the ˜ x (cid:63) and ˜ y (cid:63) lines is possible. Moreover, given AC = − A <

0, the eigenvalues of the Jacobian (11) are complex: λ ± = 1 − α ± i β | A | cosh(˜ x (cid:63) ) cosh(˜ y (cid:63) ) . (17)The stability condition reads: β √ α − α | A | cosh(˜ x (cid:63) ) cosh(˜ y (cid:63) ) ≤ . (18)In Figure 3 we show the properties of the unique ﬁxed point obtained from the numeri-cal solution of (16), for several values of A and B . We also check the stability of the ﬁxedpoints by using Eq. (18). Focusing on small B , a larger value of | A | or a smaller value of √ α − α /β (which is close to the irrationality parameter α/β ) imply a more likely instabil-ity. The intuition is straightforward: if the players are rational and/or have strong incentivesto switch a strategy which is not performing well, they follow the cycle of best-responsesin the payoﬀ matrix and keep switching their moves, rather than smoothly converging toa ﬁxed point in the centre of the simplex, where they would randomize between the purestrategies. On the contrary, if B is large (with respect to A ), the learning dynamics simplyconverges towards a ﬁxed point close to the pure strategy NE.We conclude this section by focusing on one speciﬁc example of discoordination games,where dominance is null: B = D = 0, C = − A . The unique ﬁxed point is (0 ,

0) and is stableif (we assume without loss of generality

A > C < β √ α − α A ≤ . (19)Replacing the values of α and β used in Figure 3, the ﬁxed point becomes unstable for A (cid:63) = 1 . We now consider asymmetric 2 × A (cid:54) = C and B (cid:54) = D . There is a larger variety of behaviours,but in general asymmetric games are widely similar to their symmetric counterparts (e.g.asymmetric dominance-solvable games are widely similar to symmetric dominance-solvablegames), except that the player with the strongest incentive to play a certain move playsmixed strategies farther from the centre of the strategy simplex. .0 0.5 1.0 1.5 2.0 A = − C B = − D Dominance-solvable gamesDiscoordination games

Limit cycles and chaosUnique pure fixed pointUnique mixed fixed point

Figure 3: Numerical solution of Eq. (20) for α = β = 0 . A = − C and B = − D . If 0 . < x (cid:63) < . . < y (cid:63) < .

7, the solution is classiﬁed as a “mixed strategy ﬁxedpoint”. There is always a unique ﬁxed point, which may become unstable in discoordinationgames for low values of irrationality and/or high (absolute) values of coordination. The intuitionis that the players have strong incentives to try and improve their payoﬀs, so they fail tocoordinate to the mixed strategy NE and the learning dynamics keeps cycling. This ﬁgurecorresponds to the βα AC < he ﬁxed points (˜ x (cid:63) , ˜ y (cid:63) ) are given by˜ x (cid:63) = βα (cid:20) A tanh (cid:18) βα ( C tanh ˜ x (cid:63) + D ) (cid:19) + B (cid:21) , ˜ y (cid:63) = βα (cid:20) C tanh (cid:18) βα ( A tanh ˜ y (cid:63) + B ) (cid:19) + D (cid:21) . (20)Without loss of generality, we can write the (combinations of) payoﬀs of player Columnas a rescaled version of the (combinations of) payoﬀs of player Row, that is C = W A and D = ZB , with W and Z scale factors. The magnitudes of W and Z quantify the imbalancein coordination and dominance for the two players. For instance, if W is large, playerColumn has stronger incentives to converge on one of the pure strategy NE (just considerthe payoﬀ matrix (1), with a = d = 1, e = h = 5, b = c = f = g = 0). Consistently, theheight of the hyperbolic tangent (20) for player Column is larger, leading to an intersectionwith the ˜ y (cid:63) line which is farther away from zero (˜ y (cid:63) (cid:29) ˜ x (cid:63) ). Therefore, player Columnwill choose a mixed strategy farther from the centre of the simplex. Likewise, if Z is large(consider a = 1, d = − e = 5, h = − b = c = f = g = 0), player Column ends up toa ﬁxed point closer to the pure strategy NE. Concerning the signs, the sign of Z does notmatter in determining the stability properties, while the sign of W has a substantial eﬀect.If W >

0, we ﬁnd little diﬀerence with the symmetric case; if

W < | B | < | A | and | D | < | C | ),which may have no stable ﬁxed points, as shown in Section 4.1.2. We choose a parameter setting where the ﬁxed point of the discoordination game is unstable.Figure 4 shows some examples of the dynamics for some values of α and β , for a givenpayoﬀ matrix. The dynamics superﬁcially looks like following a limit cycle, whose shape isgoverned by α and β : Fig. 4a shows that, for high α and β , the players frequently changetheir strategies, whereas in Fig. 4b, for low values of α and β , the dynamics is smoother;in Fig. 4c, where α is very small but β is reasonably high, the players spend a lot of timeplaying mostly one pure strategy and then quickly switch to the other one (because theyhave long memory). Finally, in Fig. 4d we choose B (cid:54) = 0: the discrepancy between the purestrategies seems to yield the most irregular dynamics.In order to get further insights into the learning dynamics, Figure 5 represents thebifurcation diagrams and the largest Lyapunov Exponent (LLE), varying α and β . Wefocus on the values of the payoﬀ matrix in Fig. 4d, as the behaviour of the learning dynamicsin Fig. 4a is only marginally chaotic. Figures 5a and 5c refer to a parameter setting wherethe ﬁxed point is unstable, and we observe alternating limit cycles and chaotic bands. Onthe other hand, in Figures 5b and 5d, for small intensity of choice, that is β ∈ (0 , . β ∈ (0 . , .

8) the dynamics is not periodic, but the LLE is almost null. This casecorresponds to a marginally chaotic behaviour, like the one in Fig. 4a. For larger values of β we observe again chaotic bands and limit cycles. At the points where the limit cycles becomechaotic we can observe a higher density of trajectories, probably related to the intermittencyscenario of transition to chaos (Pomeau and Manneville, 1980).Figure 6 shows that chaos is more frequently observed if one of the pure strategies isdominant over the other, B >

0. Moreover, the LLE is always negative if

B > A (consistently Since the system is 2-dimensional, in order to compute the Lyapunov exponents it is necessary to periodicallyorthogonalize the unit vectors using a Gram-Schmidt procedure, see Benettin et al. (1980). Note that, while thisis strictly necessary only in order to obtain the whole Lyapunov spectrum, and so compute the Kaplan-Yorkedimension, in practice the estimate of the LLE is much more accurate if one uses the orthogonalization methodeven just to compute the LLE. Since we choose C = − A and D = − B , there is a 4-fold symmetry in the AB plane, so we only plot the 1stquadrant.

00 910 920 930 940 950 960 970 980 990 1000 t x , y (a) α = 0 . β = 1

900 910 920 930 940 950 960 970 980 990 1000 t x , y (b) α = 0 . β = 0 .

900 910 920 930 940 950 960 970 980 990 1000 t x , y (c) α = 0 . β = 0 .

900 910 920 930 940 950 960 970 980 990 1000 t x , y (d) α = 0 . β = 1 Figure 4: Time series of the probabilities x (in blue) and y (in red). The payoﬀ parameters are: b = c = f = g = 0, a = d = 4 and e = h = − a = − . d = − . e = 11 . h = 1 . with the diagram depicted in Figure 3) and is larger for high absolute values of the payoﬀs(so that A and B are larger). × In the above sections, the taxonomy of learning dynamics is determined by three classes of2 × a priori . We choose an ensemble of payoﬀ matrices obtainedby constraining the mean, variance and correlation of the payoﬀ elements. In particular,we assume that the mean is 0, the variance is 1 and the two payoﬀs for the same proﬁleof pure strategies in the payoﬀ matrix are correlated by a parameter Γ. A value Γ = − <

0, are more generally associated with competitive games; on the contraryΓ = 1 reveals perfect correlation and positive values of Γ are related to cooperative games.Finally, Γ = 0 implies lack of correlation. Under these constraints, the maximum entropydistribution is a bivariate Gaussian with speciﬁed mean and covariance matrix (Galla andFarmer, 2013).Figure 7 represents the fraction of games which belong to each of the three classes, as afunction of the correlation parameter Γ. We see that for all values of Γ, dominance-solvablegames are the most likely. Positive values of Γ are associated with (anti)coordination games,which display multiple ﬁxed points under EWA learning, whereas for negative values of Γit is more likely to obtain a discoordination game, and consequently limit cycles or chaosin the learning dynamics. Indeed, this was observed by Galla and Farmer (2013), whoﬁnd convergence to multiple ﬁxed points in a semiplane Γ > < α/β should be low). A diﬀerence with Galla and .0 0.2 0.4 0.6 0.8 1.0 α x (a) β x (b) α -0.5-0.3-0.10.10.30.5 λ (c) β -0.5-0.3-0.10.10.30.5 λ (d) Figure 5: Bifurcation diagrams and largest Lyapunov exponent for b = c = f = g = 0, a = − . d = − . e = 11 . h = 1 .

8. Panels (a)-(c): β = 1, α varying. Panels (b)-(d): α = 0 . β varying. Low-dimensional chaos may be observed.18 a) (b) Figure 6: Largest Lyapunov exponent as a function of the parameters A and B . The colour scaleis set such that chaos is observed from green to red. The parameters are: C = − A , D = − B , β = 1, α = 0 . α = 0 . B >

0, and the payoﬀs are quite large in absolute value. -1.0 -0.5 0.0 0.5 1.0 Γ Dominance-solvable game(Anti)Coordination gameDiscoordination game

Figure 7: Fraction of dominance-solvable, (anti)coordination and discoordination games, asa function of the correlation Γ. These results are averaged over 10000 random draws of the(Gaussian) payoﬀ matrix. 19 armer (2013) is that, whereas they ﬁnd consistently unstable behaviour in certain regionsof the parameter space, we cannot rule out convergence to a ﬁxed point for Γ <

0. In fact,for all values of Γ, most payoﬀ matrices describe dominance-solvable games, which alwaysdisplay a stable ﬁxed point. This diﬀerence might be explained by the fact that Galla andFarmer (2013) consider high-dimensional strategy spaces, whereas we are restricted to twopure strategies. A reasonable conjecture would be that by increasing the number of purestrategies the ﬁxed points of the learning dynamics may become unstable. We leave theexploration of this conjecture to future work.

Here we show that the NE in pure strategies are “inﬁnitely” unstable.

Proposition 3.

Consider a generic × game and the learning dynamics in the originalcoordinates (6) . At the proﬁles of pure strategies, ( x, y ) ∈ { (0 , , (0 , , (1 , , (1 , } , forpositive memory loss, α > , the Jacobian has inﬁnite elements along the main diagonaland null elements along the antidiagonal. The proof of Proposition 3 is in Appendix C.A clariﬁcation is worth here: while the NE in pure strategies are formally unstable (unless α = 0), for most values of the parameters there is a ﬁxed point nearby. In particular, ifirrationality is not too high and the absolute values of the payoﬀs are not too small, it islikely that one of the ﬁxed points will be quite close to the NE in pure strategies. Thisresult could be anticipated, since for, e.g., dominance-solvable games, a reasonable learningdynamics is expected to converge suﬃciently close to the NE. We generalize the results in Section 4 by relaxing two seemingly restricting assumptions. InSection 5.1 we drop the simpliﬁcation of deterministic learning and we analyse the stochasticlearning dynamics. All previous results still hold and the only eﬀect of this extension is ablurring of the dynamical properties. In Section 5.2 we analyse the more general case wherethe parameter κ , which interpolates between average and cumulative reinforcement learning,is not restricted to be κ = 1. This allows us to recover belief learning and to reproducethe well-known result about the convergence of ﬁctitious play in 2 × κ . When playing a game, except for very speciﬁc experimental arrangements (Conlisk, 1993a),the players would update their propensities after observing each move by their opponent.This questions whether the deterministic dynamics (6), which assumes that the participantsof the game play against each other many times before updating their propensities, providesrobust conclusions. We interpolate from the deterministic limit by considering batches of size T , where the players sample their mixed strategies. The limit T → ∞ recovers deterministiclearning, whereas actual learning would occur with T = 1. As noted in Section 3.2, unless T = 1, the meaning of the parameter δ in unclear. Indeed, a value of δ diﬀerent from 1implies that the players give an additional update to the attraction corresponding to themove which they chose. This rule is not well deﬁned if they play against each other manytimes before updating their attractions, as they might choose both pure strategies at leastonce. However, for T = 1 we consider several values of δ and we show that, the lower thevalue of δ , the more noisy becomes the learning dynamics, as there is an additional source ofstochasticity given by which strategy the player randomly chooses, further to which strategyis randomly chosen by his opponent.It is beyond the scope of this paper to systematically study the eﬀect of noise on thelearning dynamics, and we refer the reader to Galla (2009) for a study on the eﬀect of noise n learning, and to Crutchﬁeld et al. (1982) for a more general discussion on the eﬀect ofnoise on the properties of dynamical systems. In the following we show a few numericalexamples where we investigate what happens as we progressively increase the level of noise.We simply describe our ﬁndings and we leave most of the numerical support to Appendix D.We stress that the dynamical properties in the deterministic limit, in order to be consideredas robust, need to hold down to T = 1, as that is the natural choice for a realistic learningdynamics. We focus on the three classes of games which we identiﬁed in the paper. Dominance-solvable games

Provided that the irrationality parameter is not too high,the players converge close to the pure strategy NE (Figures D.1 (a)-(d)). After an irregulartransient, as the learning dynamics moves close to the faces of the simplex, it becomesremarkably stable. On the contrary, if α/β is high, the players converge to the centre of thesimplex, as it occurs with deterministic learning (Figures D.1 (e)-(f)). However, the learningdynamics is much more irregular. The asymptotic learning behaviour is explained by twofactors: deviations from the previous moves, and their eﬀect. If the players always playedthe same moves, the learning dynamics would converge to a ﬁxed point. But as one of themswitched her move, we would observe a perturbation from such a ﬁxed point. This explainsin part why, close to the centre of the simplex, the learning dynamics is more irregular: theplayers converge to a mixed strategy where they choose each move roughly with the sameprobability. The other factor is that the attractions are large at the faces of the simplex, sothe relative magnitude of their update (due to the deviation) is smaller. We also observeanother pattern in Figures D.1 (a)-(d): the higher the level of noise (i.e., the smaller T and/or the smaller δ ), the more irregular is the transient. (Anti)Coordination games As for dominance-solvable games, we observe conver-gence to a ﬁxed point close to one of the pure strategy NE (for low levels of irrationality).We investigate whether noise can help reaching the Pareto-Optimal NE, as it does in thetheory of stochastic stability (Young, 1993). Given the previous remark on the eﬀect of thenoise near the faces of the simplex, we expect that stochastic learning can help reaching thePareto-Optimal NE only in the ﬁrst steps of the dynamics. This conjecture is conﬁrmedby the numerical simulations in Figure D.2. We ﬁnd that EWA is path dependent, diﬀer-ently from the learning algorithms introduced by (Young, 1993), which are based on ergodicMarkov Chains. With EWA, the learning dynamics reaches the Pareto-Optimal NE only ifthere is a favourable ﬂuctuation in the ﬁrst stage of the dynamics.

Discoordination games

In Section 4 we identify two learning behaviours: if irrational-ity is high, the dynamics converges to the centre of the strategy simplex and the playerssimply randomize between their moves; if irrationality is low, the players do not converge toan equilibrium and the mixed strategies keep oscillating. This distinction survives when weallow for noise. In Figure 8 we plot the stochastic time series for both behaviours. In Figure8a the mixed strategy ﬁxed point of the corresponding deterministic dynamics is unstableand the stochastic learning dynamics is chaotic (the parameters are the same as in Fig. 5),whereas in Figure 8b the mixed strategy ﬁxed point is an attractor of the (deterministic)dynamics. It is immediately clear that in the latter case there is a total lack of autocorre-lation in the moves by each player (because the dynamics does not spend much time nearthe faces of the simplex), whereas in the former the autocorrelation function decays moreslowly as a function of the time lag. These results are conﬁrmed in Figure 9 and constitutea precise theoretical prediction that can be tested against data on experimental learningof discoordination games. Finally, Figure 10 represents the same bifurcation diagram andlargest Lyapunov exponents as in Figs. 5a and 5c respectively, with the only diﬀerence thatwe consider stochastic learning with T = 1. For small values of α the LLE is still positive,so the dynamics is chaotic. We consider several values of T in Figure D.3. We observe theequivalence between parametric and additive noise (Crutchﬁeld et al., 1982): the eﬀect of In fact, the noise source induced by ﬁnite T is not additive, but it is always possible to express the noisethrough a properly deﬁned additive stochastic term in the dynamical equations. oise on the properties of dynamical systems equivalently occurs as a perturbation of theirtrajectories or as a perturbation of their parameter values. By progressively increasing thelevel of noise, we observe the smoothing of both the bifurcation diagram and the plot rep-resenting the LLE, losing the ﬁnely alternating structure with bands of chaos and windowsof regularity.

900 910 920 930 940 950 960 970 980 990 1000 t x , y (a)

900 910 920 930 940 950 960 970 980 990 1000 t x , y (b) Figure 8: Time series of the probabilities x (in blue) and y (in red), for stochastic learning with T = 1. The payoﬀ parameters are b = c = f = g = 0, a = − . d = − . e = 11 . h = 1 .

8. The memory loss is α = 0 .

2, the intensity of choice is β = 1 in panel (a), implyingdeterministic chaotic behaviour, and β = 0 . t s C t s C (a) ∆ t £ s C ( t ) s C ( t + ∆ t ) ⁄ (b) Figure 9: (a) Time series of the moves of player Column, for stochastic learning with T = 1.The upper (lower) panel corresponds to the stochastic learning dynamics in Fig. 8a (8b). (b)Autocorrelation function of the moves of player Column, for both learning dynamics representedin the left panel. If irrationality is high, the players randomize between their moves and theautocorrelation decays instantaneously. We drop the assumption of cumulative reinforcement learning ( κ = 1) and we analyse otherlearning algorithms in the EWA family. Looking at Eqs. (6) and (8), in order to considera general value for κ , it is suﬃcient to rescale the intensity of choice β and to replace it by˜ β = β [1 − (1 − α )(1 − κ )]. As the quantity multiplying β is lower than one, the intensityof choice is smaller and so the irrationality parameter is larger. Therefore, the learningdynamics is generally more stable, and it is easier to converge to a ﬁxed point in the centreof the simplex.If κ = 0 and δ = 1 we recover most forms of belief learning ( α = 1: best-responsedynamics; α = 0: ﬁctitious play; 0 < α <

1: weighted ﬁctitious play). The rescaled .0 0.2 0.4 0.6 0.8 1.0 α x (a) α -0.5-0.3-0.10.10.30.5 λ (b) Figure 10: Bifurcation diagram and largest Lyapunov exponent as a function of α for stochasticlearning with T = 1. The payoﬀ parameters are H = − . K = − . L = 11 . M = 1 . β = 1. Chaos survives for small values of α and we observe theequivalence between additive and parametric noise. intensity of choice is ˜ β = βα . First of all, this means that the coordinates of the ﬁxed pointsdo not depend on α any more (Eqs. (10) and (20)). However, the memory loss sets thetimescale of convergence to the ﬁxed points, so in the limit α → α , which does not cancel out: βα √ α − α A ≤ . (21)The derivative of the LHS with respect to α is positive, and so smaller and smaller valuesof α make stability more and more likely. In other words, the parameter space where it ispossible to observe unstable behaviour shrinks as α is reduced. In the limit α →

0, theLHS goes to zero, so stability is ensured for all parameter values. Note that the case α = 0, κ = 0 is the standard ﬁctitious play learning algorithm (see Ho et al. (2007), Fig. 1) thatwas ruled out by obtaining the steady state dynamical equations (5), from the more generalEWA rule (4). However, by taking the limit α → × In this paper we have exhaustively characterized the dynamics of EWA learning in generic2 × taxonomy , of diﬀerent behaviours can beobserved, according to the properties of the payoﬀ matrix and to the value of the parametersof the learning algorithm. The taxonomy naturally relates to classes of games that have beenextensively studied in the literature: in dominance-solvable games we observe convergencetowards the unique pure strategy NE; in coordination games we ﬁnd multiple ﬁxed pointscorresponding to the NE; in discoordination games the unique mixed strategy NE may beunstable and the learning dynamics may settle in a limit cycle or a low dimensional chaoticattractor. However, for all classes of games, if the players cannot choose with certainty he best performing strategy (because of ﬁnite intensity of choice), quickly forget the pastperformance of their moves and/or have little incentives in terms of payoﬀs, the learningdynamics converges to a ﬁxed point well in the centre of the simplex, where the playerssimply randomize between the pure strategies.The novelty of this work is ﬁrst of all in its approach: we have identiﬁed a number ofrelevant parameters and classiﬁed the learning dynamics accordingly, by ex-post relating thevalues of the parameters to the classes of games described above. In particular, we havefound that irrationality , deﬁned as the ratio of memory loss α to intensity of choice β , iflarge implies the convergence to a mixed strategy in the centre of the simplex. We havethen deﬁned a coordination parameter by computing the diﬀerence of the diagonal and theoﬀ-diagonal elements in the payoﬀ matrix for both players ( A for row and C for column),and multiplying the two numbers ( AC ). A large positive value of coordination is related to acoordination or an anticoordination game, where the players try and coordinate on the sameproﬁles of pure strategies. If coordination is negative ( A is positive and C is negative, orviceversa), the players try and coordinate on diﬀerent proﬁles of strategies and there is nopure strategy NE. The payoﬀ matrix deﬁnes a discoordination game and, for a good levelof rationality, is related to an unstable learning dynamics. The third parameter is called dominance . It is obtained as the absolute value of the product of the diﬀerence between thepayoﬀs associated with one pure strategy and the payoﬀs associated with the other one, forboth players ( B for row and D for column, so dominance is | BD | ). If large, it is likely thatthe payoﬀ matrix describes a dominance-solvable game.Thanks to the exhaustive characterization of 2 × typical has not been thoroughly ex-plored. It is sensible that, by increasing the size of the strategy sets and/or the number ofplayers, unstable dynamics may become prevalent. Some work has in part conﬁrmed thisconjecture (Sanders et al., 2016), but a more systematic investigation is required. In coordination games, where A and C are positive, they try and coordinate on proﬁles where they play thesame strategy. On the contrary, in anticoordination games, where A and C are negative, they try and coordinateon proﬁles of strategies where their moves are diﬀerent. he ultimate goal of this line of research is to test whether learning converges in ex-periments. Most experiments show approximate aggregate convergence, but the underlyinggames have usually distinct equilibria and paths of convergent best replies. For general pay-oﬀ matrices with cycles in best-responses and several players, the players may just endlesslycycle between the proﬁles of strategies, even in reality, and equilibrium concepts would bemeaningless. For what concerns 2 × Bibliography

Arieli, I. and Young, H. P. (2016) “Stochastic Learning Dynamics and Speed of Convergencein Population Games,”

Econometrica , Vol. 84, pp. 627–676.Bena¨ım, M., Hofbauer, J., and Hopkins, E. (2009) “Learning in games with unstable equi-libria,”

Journal of Economic Theory , Vol. 144, pp. 1694–1709.Benettin, G., Galgani, L., Giorgilli, A., and Strelcyn, J.-M. (1980) “Lyapunov characteristicexponents for smooth dynamical systems and for Hamiltonian systems; a method forcomputing all of them. Part 1: Theory,”

Meccanica , Vol. 15, pp. 9–20.Berg, J. and Weigt, M. (1999) “Entropy and typical properties of Nash equilibria in two-player games,”

EPL (Europhysics Letters) , Vol. 48, pp. 129–135.Brock, W. A. and Hommes, C. H. (1997) “A rational route to randomness,”

Econometrica:Journal of the Econometric Society , pp. 1059–1095.Brown, G. W. (1951) “Iterative solution of games by ﬁctitious play,” in T. Koopmans ed.

Activity analysis of production and allocation , New York: Wiley, pp. 374–376.Camerer, C. and Ho, T. (1999) “Experience-weighted attraction learning in normal formgames,”

Econometrica , Vol. 67, pp. 827–874.Conlisk, J. (1993a) “Adaptation in games: Two solutions to the Crawford puzzle,”

Journalof Economic Behavior & Organization , Vol. 22, pp. 25–50.(1993b) “Adaptive tactics in games: Further solutions to the Crawford puzzle,”

Journal of Economic Behavior & Organization , Vol. 22, pp. 51–68.Cournot, A.-A. (1838)

Recherches sur les principes math´ematiques de la th´eorie desrichesses : L. Hachette.Crawford, V. P. (1974) “Learning the optimal strategy in a zero-sum game,”

Econometrica:Journal of the Econometric Society , pp. 885–891.(1995) “Adaptive dynamics in coordination games,”

Econometrica: Journal of theEconometric Society , pp. 103–143.Crutchﬁeld, J. P., Farmer, J. D., and Huberman, B. A. (1982) “Fluctuations and simplechaotic dynamics,”

Physics Reports , Vol. 92, pp. 45–82.Foster, D. P. and Young, H. P. (1998) “On the nonconvergence of ﬁctitious play in coordi-nation games,”

Games and Economic Behavior , Vol. 25, pp. 79–96.Galla, T. (2009) “Intrinsic noise in game dynamical learning,”

Physical review letters , Vol.103, p. 198702.Galla, T. and Farmer, J. D. (2013) “Complex dynamics in learning complicated games,”

Proceedings of the National Academy of Sciences , Vol. 110, pp. 1232–1236.Hahn, S. (1999) “The convergence of ﬁctitious play in 3 × Economics Letters , Vol. 64, pp. 57–60.Ho, T. H., Camerer, C. F., and Chong, J.-K. (2007) “Self-tuning experience weighted at-traction learning in games,”

Journal of Economic Theory , Vol. 133, pp. 177–198.Hofbauer, J. and Sigmund, K. (1998)

Evolutionary games and population dynamics : Cam-bridge university press. ommes, C. H., Ochea, M. I., and Tuinstra, J. (2016) “Evolutionary Competition be-tween Adjustment Processes in Cournot Oligopoly: Instability and Complex Dynam-ics,”Technical report, THEMA (TH´eorie Economique, Mod´elisation et Applications), Uni-versit´e de Cergy-Pontoise.Huberman, B. A. and Hogg, T. (1988) “The behavior of computational ecologies,” in TheEcology of Computation , pp. 77–115: North-Holland.Krishna, V. (1992) “Learning in games with strategic complementarities,”Technical report,Harvard Business School.Milgrom, P. and Roberts, J. (1991) “Adaptive and sophisticated learning in normal formgames,”

Games and economic Behavior , Vol. 3, pp. 82–100.Miyazawa, K. (1961) “On the Convergence of the Learning Process in a 2 × × Games and Economic Behavior , Vol. 14, pp. 144–148.Monderer, D. and Shapley, L. S. (1996) “Fictitious play property for games with identicalinterests,”

Journal of economic theory , Vol. 68, pp. 258–265.Muth, J. F. (1961) “Rational expectations and the theory of price movements,”

Economet-rica: Journal of the Econometric Society , Vol. 29, pp. 315–335.Nachbar, J. H. (1990) ““Evolutionary” selection dynamics in games: Convergence and limitproperties,”

International journal of game theory , Vol. 19, pp. 59–89.Nagel, R. (1995) “Unraveling in guessing games: An experimental study,”

The AmericanEconomic Review , Vol. 85, pp. 1313–1326.Opper, M. and Diederich, S. (1992) “Phase transition and 1/f noise in a game dynamicalmodel,”

Physical review letters , Vol. 69, pp. 1616–1619.Ott, E. (2002)

Chaos in dynamical systems : Cambridge university press.Pomeau, Y. and Manneville, P. (1980) “Intermittent transition to turbulence in dissipativedynamical systems,”

Communications in Mathematical Physics , Vol. 74, pp. 189–197.Rapoport, A., Guyer, M. J., and Gordon, D. G. (1976) “The 2 × Ann Arbor:University of Michigan Press .Robinson, J. (1951) “An iterative method of solving a game,”

Annals of mathematics , pp.296–301.Sanders, J. B., Galla, T., and Farmer, J. D. (2016) “The prevalence of complex dynamicsin games with many players,” in preparation .Shapley, L. S. (1964) “Some topics in two-person games,”

Advances in game theory, Annalsof Mathematical Studies , Vol. 52, pp. 1–29.Stahl, D. O. (1988) “On the instability of mixed-strategy Nash equilibria,”

Journal of Eco-nomic Behavior & Organization , Vol. 9, pp. 59–69.(1996) “Boundedly rational rule learning in a guessing game,”

Games and EconomicBehavior , Vol. 16, pp. 303–330.Vilone, D., Robledo, A., and S´anchez, A. (2011) “Chaos and unpredictability in evolutionarydynamics in discrete time,”

Physical review letters , Vol. 107, p. 038101.Young, H. P. (1993) “The evolution of conventions,”

Econometrica: Journal of the Econo-metric Society , Vol. 61, pp. 57–84. Payoﬀ parameters and classes of games

To prove Proposition 1 for the generic 2 × (cid:48) = (cid:18) H, L , , K, M (cid:19) , (22)where H = a − c , K = d − b , L = e − g and M = h − f .Finally, consider Π (cid:48)(cid:48) = (cid:18) , , L (cid:48) H (cid:48) , K (cid:48) , M (cid:48) (cid:19) , (23)where H (cid:48) = − H , L (cid:48) = − L , K (cid:48) = K and M (cid:48) = M . Proposition A.1. (i) The payoﬀ matrices Π and Π (cid:48) , deﬁned by (1) and (22) respectively,have the same pure and mixed strategy NE.(ii) The EWA dynamics (6) is identical in the two cases, and so is any learning dynamicswhere the propensities are mapped to the probabilities using a logit function and the expectedpayoﬀ enters as an additive term in the update of the propensities.(iii) Any other payoﬀ matrix Π (cid:48)(cid:48) where the elements H (cid:48) , K (cid:48) , L (cid:48) and M (cid:48) are eitheridentical or opposite to H , K , L and M , and are in the same position if identical and onthe opposite position if opposite (up rather than down for Row, left rather than right forColumn) is equivalent to Π and Π (cid:48) . An example of such payoﬀ matrix is Π (cid:48)(cid:48) , deﬁned in (23) . Therefore, we set the oﬀ-diagonal elements to zero and prove Proposition 1 using payoﬀmatrix (22). We then prove Proposition A.1.

Proof of proposition 1 . In terms of the payoﬀ matrix (22), the parameters A , B , C and D are deﬁned as A = 14 ( H + K ) ,B = 14 ( H − K ) ,C = 14 ( L + M ) ,D = 14 ( L − M ) . (24)As we are interested in their relative magnitudes, we drop the 1 / AC = ( H + K ) ( L + M ) , | BD | = | ( H − K ) ( L − M ) | . (25)We start proving (i) . The game is symmetric, so H = L and K = M . So | A | = | H + K | and | B | = | H − K | . Moreover, the conditions H, K >

H, K < H or K are negative the gameis dominance-solvable. So, if H and K have the same sign, the payoﬀ matrix describes acoordination game and the sum of H and K (in absolute value) is larger than their diﬀerence,so that coordination is larger than dominance; if the signs of H and K are diﬀerent, thegame is dominance-solvable and the diﬀerence between H and K is larger (in absolute value)than their sum: dominance is larger than coordination. e then consider (ii) . If ( | BD | > | AC | ), either | B | > | A | , or | D | > | C | , or both.Therefore, either H and K do not have the same sign, or L and M do not have the samesign, or both. All of these cases represent dominance-solvable games (the proﬁle of purestrategies which is the NE of the game depends on the relative signs). On the contrary, thecondition | BD | < | AC | does not necessarily imply that both | B | < | A | and | D | < | C | . However, if that is the case, the sums of H + K and L + M are larger than the diﬀerences H − K and L − M , which means that H, K and

L, M have the same sign. If

AC >

0, also A and C have the same sign, so either H, K, L, M are all positive, or they are all negative.If

H, K, L, M >

0, the payoﬀ matrix describes a coordination game; if

H, K, L, M <

0, thepayoﬀ matrix describes an anticoordination game. If

AC < A and C have diﬀerent signs.Suppose without loss of generality that A > C <

0. Then

H, K > L, M <

0. Thepayoﬀ matrix represents a discoordination game.We still have to show that, the larger the value of coordination (compared to domi-nance), the more likely the payoﬀ matrix describes a coordination or anticoordination game(rather than a dominance-solvable game). This is not obvious. Coordination may be largebecause A (cid:29) B , but it could still be that C (cid:46) D . An extreme example is that B = 0 (sodominance is null), whereas A, C (cid:54) = 0: it is always | AC | > | BD | = 0, but this conditionimposes no restrictions on whether | C | > | D | or | C | < | D | . The intuition is, if we ran-domly choose the payoﬀ elements, it is not likely to generate such a speciﬁc payoﬀ matrix.We verify this conjecture by running extensive numerical simulations. For each ( AC , | BD | )point, we generate 1000 random realizations of the payoﬀ matrix with speciﬁed AC and | BD | ; we then compute the fraction of dominance-solvable games (the other fractions arecoordination or discoordination games, according to whether we are in the positive or neg-ative AC semiplane). The results are in Figure A.1. As expected, if | BD | > AC , all gamesare dominance-solvable. Viceversa, the larger the absolute value of AC , the more likely thepayoﬀ matrix may represent (anti)coordination or discoordination games. Interestingly, thefraction of dominance-solvable games never drops to zero. Finally, notice the consistencybetween Figure A.1 and Figure 1 (net of the fact there is not a neat separation between thedominance-solvable and the (anti)coordination and discoordination regions). -10.0 -8.0 -6.0 -4.0 -2.0 0.0 AC | B D | F r a c t i o n D o m i n a n c e - S o l v a b l e (a) AC | B D | F r a c t i o n D o m i n a n c e - S o l v a b l e (b) Figure A.1: Fraction of dominance-solvable games for randomly generated payoﬀ matrices, as afunction of the coordination ( AC ) and dominance ( | BD | ) parameters. The larger is coordinationcompared to dominance, the more likely the payoﬀ matrix describes a coordination (if AC >

AC <

0) game. For instance, consider the payoﬀ matrix (1), with a = 3, e = 1, d = − h = 2, b = c = f = g = 0:this is a dominance-solvable game, but | AC | = 3 / > | BD | = 2 /

8. Note that | D | = 1 / < | C | = 3 /

4, but | B | = 1 > | A | = 1 / roof of proposition A.1 . We start proving (i) . The pure strategy NE are only deter-mined by the ordinal properties of the payoﬀs. Consider player Row. Her contribution indetermining the pure strategy Nash Equilibria depends on whether a > c or d > b , so itis unchanged if we consider H = a − c > K = d − b >

0. The same argumentapplies to player Column: his contribution in determining the pure strategy Nash Equilibriadepends on whether e > g or h > f , so it is unchanged if we consider L = e − g > M = h − f >

0. The same is true for all other positive/negative combinations.In the 2 × p, − p ), ( q, − q ) for playersRow and Column respectively are given by p = h − fe − g + h − f ,q = d − ba − c + d − b . (26)Again, we can rewrite the above equations without loss of generality in terms of H , K , L and M , namely q = KH + K ,p = ML + M . (27)We consider (ii) . We only focus on player Row (the proof for Column is identical). If,at time t , Column plays a mixed strategy given by ( y ( t ) , − y ( t )), the expected payoﬀ forRow for playing pure strategy 1 is Π R ( y ( t )) = ay ( t ) + b (1 − y ( t )) and the expected payoﬀfor strategy 2 is Π R ( y ( t )) = cy ( t ) + d (1 − y ( t )). Now, the ratio x ( t +1)1 − x ( t +1) fully determines x ( t + 1). Using (6) we ﬁnd x ( t + 1)1 − x ( t + 1) ∝ e β (1 − (1 − α )(1 − κ )) (cid:16) Π R ( y ( t )) − Π R ( y ( t )) (cid:17) , (28)where Π R ( y ( t )) − Π R ( y ( t )) = ( a − c ) y ( t ) + ( d − b )(1 − y ( t )) = Hy ( t ) + Ky ( t ). Note thatthe same argument applies for any other learning algorithm where the expected payoﬀs arein the argument of an exponential and can be separated from the past propensities (e.g. donot enter multiplicatively).Finally, (iii) follows simply from the above results. If we consider H (cid:48) = − H at thebottom left in the payoﬀ matrix, it is H (cid:48) = c − a , so a > c implies that H (cid:48) < H ↔ − H (cid:48) and the learning dynamics depends on H aswell, as Π R ( y ( t )) − Π R ( y ( t )) ∝ (0 − H (cid:48) ) y ( t ) = Hy ( t ).Apart from the above properties, we stress that the transformed payoﬀ matrix (22) isnot fully equivalent to (1). For instance, consider the following Prisoner Dilemma (PD):Π P D = (cid:18) , , , , (cid:19) , (29)where strategy 1 is Cooperate and strategy 2 is Defect. The transformed payoﬀ matrix isΠ (cid:48) P D = (cid:18) − , − , , , (cid:19) . (30)The payoﬀ matrices (30) and (29) are not equivalent, in that the property that theNE and the Pareto Equilibrium do not coincide is lost, and so is the dilemma betweencooperation and defection. n a similar manner, consider the Stag-Hunt (SH) game:Π SH = (cid:18) , , , , (cid:19) , (31)where strategy 1 is Stag (S) and strategy 2 is Hare (H). Here (S,S) is the payoﬀ-dominantNE, while (H,H) is the risk-dominant NE. If we apply the transformation we ﬁndΠ (cid:48) SH = (cid:18) , , , , (cid:19) . (32)The above is a pure coordination game, and the properties of payoﬀ and risk-dominanceno longer hold.However, note that both in (29) and (30) and in (31) and (32) the NE are the sameand so are all diﬀerences in payoﬀs, holding the strategy of the other player ﬁxed. Alearning algorithm that bases its learning properties on the performance of one pure strategycompared to the other one, should be invariant under the payoﬀ matrix transformation whichwe described: this is probably the most intuitive explanation of why Proposition 1 holds. B Proof of Proposition 2

We ﬁrst consider assertion (i) . Since B = 0, there is always a ﬁxed point ˜ x (cid:63) = 0. It is stableif (from Eq. (13)) βα | A | ≤ . (33)So, as long as ˜ x (cid:63) = 0 is the unique ﬁxed point, it is stable.We then consider assertion (ii) , and in particular the lower bound, βα | A | → + . There aretwo ﬁxed points ˜ x (cid:63) = ± (cid:15) , where (cid:15) is an arbitrarily small number. Thanks to the symmetryof the game, we focus on a proﬁle of mixed strategies given by (˜ x (cid:63) , ˜ x (cid:63) ). To second order,cosh ˜ x (cid:63) ≈ x (cid:63) ) /

2. The stability condition becomes αβ (cid:32) x (cid:63) ) (cid:33) (cid:32) x (cid:63) ) (cid:33) − | A | ≥ , (34)i.e. (˜ x (cid:63) ) ≥ βα | A | − . (35)Now, we Taylor expand Ψ(˜ x (cid:63) ) (deﬁned in Section 4.1.1) to third order (ﬁrst order wouldjust yield ˜ x (cid:63) = 0) and solve ˜ x (cid:63) = Ψ(˜ x (cid:63) ). Apart from the null solution, we get(˜ x (cid:63) ) = 3 (cid:16) β A α − (cid:17) β A α (cid:16) β A α (cid:17) . (36)It is easily checked that for βα | A | → + , the condition (35) is satisﬁed: the ﬁxed pointswhose components are the “lateral solutions” are stable. Therefore, there is a supercriticalpitchfork bifurcation at the value βα | A | = 1.The upped bound, namely βα | A | → ∞ , is easily dealt with. Indeed, because we aresearching for the intersection with the ˜ x (cid:63) line, the ﬁxed point is approximately the heightof the hyperbolic tangent itself: ˜ x (cid:63) ≈ ± βα | A | . Now, for βα | A | → ∞ the hyperbolic cosinecan be approximated by cosh (cid:18) βα | A | (cid:19) ≈ exp (cid:18) βα | A | (cid:19) / . (37) e can rewrite the stability condition as:4 βα | A | exp (cid:18) − βα | A | (cid:19) ≤ . (38)For βα | A | → ∞ , the LHS of the above equation goes to zero, so the inequality obviouslyholds.Finally, the proof of (iii) is identical to the proof of the upper bound for βα | A | , in thatthe same arguments apply to suﬃciently large values of βα | B | , for which the only ﬁxed pointwill be far enough from zero to be stable. C Proof of Proposition 3

In order to study the properties of the pure strategy NE we need to consider the learningdynamics in the original coordinates (the pure strategies map into inﬁnite elements in thetransformed coordinates). The EWA dynamics reads (using (6) and the payoﬀ matrix (1)): x ( t + 1) = x ( t ) − α e β ( ay ( t )+ b (1 − y ( t )) x ( t ) − α e β ( ay ( t )+ b (1 − y ( t )) + (1 − x ( t )) − α e β ( cy ( t )+ d (1 − y ( t )) ,y ( t + 1) = y ( t ) − α e β ( ex ( t )+ f (1 − x ( t )) y ( t ) − α e β ( ex ( t )+ f (1 − x ( t )) + (1 − y ( t )) − α e β ( gx ( t )+ h (1 − x ( t )) . (39)From Eq. (39) we can see that the pure strategies ( x, y ) ∈ { (0 , , (0 , , (1 , , (1 , } areall ﬁxed points of the dynamics. Let us study their stability properties. We get a Jacobian J = (cid:18) J J J J (cid:19) , (40)with J = (1 − α )( x − x ) α e β ( y ( a − b − c + d )+ b − d ) (cid:0) x (1 − x ) α e β ( y ( a − b − c + d )+ b − d ) − ( x − x α (cid:1) ,J = β ( x − x ) α +1 ( a − b − c + d ) e β ( y ( a − b − c + d )+ b − d ) (cid:0) x (1 − x ) α e β ( y ( a − b − c + d )+ b − d ) − ( x − x α (cid:1) ,J = β ( y − y ) α +1 ( e − f − g + h ) e β ( x ( e − f − g + h )+ f − h ) (cid:0) y (1 − y ) α e β ( x ( e − f − g + h )+ f − h ) − ( y − y α (cid:1) ,J = (1 − α )( y − y ) α e β ( x ( e − f − g + h )+ f − h ) (cid:0) y (1 − y ) α e β ( x ( e − f − g + h )+ f − h ) − ( y − y α (cid:1) . (41)As it can be seen by taking the appropriate limits in Eqs. (41), for all pure strategiesthe Jacobian has inﬁnite elements along the main diagonal and null elements along theantidiagonal. This means that the NE in pure strategies are inﬁnitely unstable, and maybe the reason for the extreme nonlinearities observed in Galla and Farmer (2013) near thefaces of the simplex.The only case where the elements of the Jacobian for the NE in pure strategies wouldnot be inﬁnite is that of no memory loss, α = 0, as it is possible to see if one computes theeigenvalues with this parameter restriction. In fact, the NE in pure strategies become stableﬁxed points of the learning dynamics. Eﬀect of stochasticity on learning t x , y (c) t x , y (d) t x , y (e) t x , y (f) t x , y (g) t x , y (h) Figure D.1: Time series of the probabilities x (in blue) and y (in red). Values of the parameters: α = 0 . b = c = f = g = 0, a = e = 2, d = h = − β = 0 . β = 0 . T = 2; (c) Stochastic learning with T = 1 and δ = 1; (d) Stochastic learning with T = 1 and δ = 0; (e) Deterministic learning;(f) Stochastic learning with T = 1 and δ = 1. Deterministic and stochastic learning are largelysimilar. See Section 5.1 for further comments.33 t x , y (a) t x , y (b) t x , y (c) t x , y (d) Figure D.2: Time series of the probabilities x (in blue) and y (in red). Values of the parameters: α = 0 . β = 1, b = c = f = g = 0, a = e = 6, d = h = 1. (a) Deterministic learning startingfrom the initial conditions x (0) = 0 . y (0) = 0 .

05, close to the Pareto-dominated NE; (b)Deterministic learning starting from the initial conditions x (0) = 0 . y (0) = 0 .