aa r X i v : . [ phy s i c s . s o c - ph ] O c t Intrinsic noise in game dynamical learning
Tobias Galla ∗ Theoretical Physics, School of Physics and Astronomy,The University of Manchester, Manchester M13 9PL, United Kingdom (Dated: November 5, 2018)Demographic noise has profound effects on evolutionary and population dynamics, as well ason chemical reaction systems and models of epidemiology. Such noise is intrinsic and due to thediscreteness of the dynamics in finite populations. We here show that similar noise-sustained tra-jectories arise in game dynamical learning, where the stochasticity has a different origin: agentssample a finite number of moves of their opponents inbetween adaptation events. The limit of infi-nite batches results in deterministic modified replicator equations, whereas finite sampling leads toa stochastic dynamics. The characteristics of these fluctuations can be computed analytically usingmethods from statistical physics, and such noise can affect the attractors significantly, leading tonoise-sustained cycling or removing periodic orbits of the standard replicator dynamics.
PACS numbers: 02.50.Le, 87.23.Kg, 02.50.Ey, 05.40.-a
Intrinsic noise has been seen to have significant ef-fects on dynamical systems, and may alter their at-tractors substantially. Noise-sustained oscillations,generated via an amplification mechanism, are forexample present in models of population dynamics[1], epidemiology [2] or biochemical reaction systems[3]. The origin of these fluctuations is the discrete-ness of the dynamics in finite systems, determinis-tic descriptions are then no longer appropriate. Theclass of systems in which intrinsic noise cannot be ne-glected includes models of evolutionary dynamics andgame theory, and much current research aims at un-derstanding the effects of this demographic stochas-ticity using methods from nonequilibrium statisticalmechanics and the theory of stochastic processes [4].Here, we will focus on intrinsic noise resulting froma different origin, and will consider the learning dy-namics of agents in a game theoretic setting [5]. Thisis complementary to more conventional approaches togame theory concentrating on the characterisation ofequilibrium points [6], or on evolutionary processes[7]. In the learning scenario one considers a smallnumber of agents who interact repeatedly in a givengame, and who observe their opponents’ actions andaim to react by adapting their own strategy profile.Such dynamical models are of particular importancefor the understanding of experiments in game theoryand behavioral economics, in which human subjectsplay a given game repeatedly under controlled condi-tions [8, 9]. As a key result we show that stochastic-ity, induced by imperfect sampling of the opponents’strategy profiles, can result in trajectories quite dif-ferent from those of deterministic learning, very muchakin to the mechanism by which intrinsic noise in fi-nite populations affects the trajectories of evolution-ary systems. While the amount of intrinsic noise inevolutionary dynamics is determined by the number ∗ Electronic address: [email protected] of individuals in the population, our objective hereis to characterise the fluctuations in the learning dy-namics of two fixed agents. The quantity controllingthe noise strength is the number of observations madeby the agents inbetween adaptation events. Further-more, in a deterministic setting and depending on thegame, we demonstrate that memory loss can promoteor impede convergence to a Nash equilibrium.Consider a general symmetric two-player game,played repeatedly by players X and Y , and assumethere are p pure strategies in this game. The payoffmatrix is given by a ij where i, j ∈ { , . . . , p } . Therounds of the repeated interaction will be labeled by t = 1 , , ... in the following. In each round player X plays one pure strategy i ( t ) ∈ { , . . . , p } , and player Y plays j ( t ) ∈ { , . . . , p } . The payoff for X is then a i ( t ) j ( t ) and that for Y is a j ( t ) i ( t ) . If the players playstochastically, i.e. if they resort to mixed strate-gies, i ( t ) and j ( t ) will be random variables. Assum-ing that player X carries a (time-dependent) mixedstrategy profile x ( t ) = ( x ( t ) , . . . , x p ( t )) and simi-larly y ( t ) = ( y ( t ) , . . . , y p ( t )) for player Y , a learn-ing dynamics is then a prescription used to updatethese strategy profiles between subsequent rounds ofthe game. x i ( t ) here denotes the probability withwhich player X plays pure strategy i ∈ { , . . . , p } inround t , and similarly for y j ( t ). Normalization re-quires P pi =1 x i ( t ) = P pj =1 y j ( t ) = 1.In order to define a specific learning dynamics, wefollow [9, 10] and assume that each player keeps valu-ations of each pure strategy, measuring their relativeperformance in the past. More precisely, in a situa-tion without memory loss, the valuation q i ( t ) player X has for pure strategy i is the total payoff X wouldhave obtained, had he/she always played strategy i up to time t , and given Y ’s actions. The valuation r j ( t ) player Y has for j has an analogous meaning.Following [9, 10] players then use a logit rule x i ( t ) = e Γ q i ( t ) P k e Γ q k ( t ) , y j ( t ) = e Γ r j ( t ) P k e Γ r k ( t ) . (1)Γ ≥ ∞ to deter-ministic play, we will here focus on the case in which0 < Γ < ∞ . It is important to distinguish betweentwo types of randomness in the actual play: as pre-scribed by (1), the players will generally use mixedstrategies, so that their actions can be stochastic,even at given strategy valuations. Secondly, the up-date of the valuations itself will contain some stochas-ticity as we will detail next. We will here assume thatplayers update their scores only once every N roundsof the game, and keep them constant inbetween. Thisis known as batch learning in computer science [12].Specifically, we will assume q k ( t + N ) = (1 − λ ) q k ( t ) + 1 N t + N − X t ′ = t a kj ( t ′ ) r k ( t + N ) = (1 − λ ) r k ( t ) + 1 N t + N − X t ′ = t a ki ( t ′ ) , (2)and q k ( t + τ ) = q k ( t ) for all τ = 1 , , . . . , N −
1, andsimilarly for player Y . On-line learning [12], i.e. up-dating after each round, is recovered for N = 1. Inour model all { q i , r j } are updated at each adaptationevent. This corresponds to reinforcement learningin which foregone payoffs are known and reinforced,equivalent to weighted fictitious play belief learning,see Ho et al. [9]. The interpretation of these updaterules is understood best by first considering the case λ = 0: then the increment of q k between time-steps t and t + N is given by N − P t + N − t ′ = t a kj ( t ′ ) . This in-crement is recognized as the average payoff X wouldhave received per round had he/she played pure strat-egy k in all rounds t, t + 1 , . . . , t + N −
1. A non-zerovalue, λ ∈ (0 , λ in the payoff terms in Eq. (2). In this paperwe follow the setup of [10].The update rules are intrinsically stochastic, wewill refer to (1,2) as discrete-time stochastic learning(DTSL). After a re-scaling of time, and for large, butfinite batch size N we can write q k ( ℓ + 1) = (1 − λ ) q k ( ℓ ) + X j a kj y j ( ℓ ) + ξ k ( ℓ ) √ Nr k ( ℓ + 1) = (1 − λ ) r k ( ℓ ) + X i a ki x i ( ℓ ) + η k ( ℓ ) √ N , (3)where we approximate the noise variables ξ k , η k asGaussian random variables. This amounts to an ex-pansion in N − / , and within this approximation thecovariances of the ξ k , η k can be obtained, as we willreport elsewhere [14]. In the limit of infinite batchsize, N → ∞ , the dynamics becomes determinis-tic, we will refer to this as discrete-time deterministic learning (DTDL). Assuming Γ ≪ x i /x i = Γ X j a ij y j − Γ f [ x , y ] + λ X k x k ln x k x i ˙ y j /y j = Γ X i a ji x i − Γ f [ y , x ] + λ X k y k ln y k y j , (4)where f [ x , y ] = P ij a ij x i y j , as previously reportedand studied in [10], see also [11]. This system main-tains the normalisation of probabilities, and is hence2( p − z ∗ = ( x ∗ , . . . , x ∗ p , y ∗ , . . . , y ∗ p ), they are iden-tical to the fixed points of (4). We now perform anexpansion about the fixed point in powers of N − / ,akin to the expansion first proposed in [13]. Writing z ( ℓ ) = z ∗ + N − / ∆ ( ℓ ), one finds ∆ ( ℓ + 1) = J ∆ ( ℓ ) + ζ ( ℓ ) , (5)with J the Jacobian at the fixed-point, and where ζ ( ℓ ) is Gaussian white noise, with correlations amongits components, which can be worked out analyti-cally [14]. Eq. (5) is the discrete-time analogue of alinear Langevin equation, and the starting point forthe analysis of fluctuations about the deterministiclimit. In particular Eq. (5) allows one to compute thestationary distributions of the components of ∆ , aswell as their temporal correlations and power spectra P i ( ω ) = D | e ∆ i ( ω ) | E , with e ∆ i ( ω ) the Fourier trans-form of ∆ i ( ℓ ) [14]. This follows the lines of [1]. Herewe will illustrate the effects noise has on the learningdynamics using the two examples of the prisoners’dilemma, and that of the rock-papers-scissors game.The prisoner’s dilemma describes a problem of mu-tual cooperation, where two players each face thechoice whether to co-operate (C) or to defect (D). Wewill here choose the payoff matrix a CC = 3 , a CD =0 , a DC = 5 , a DD = 1. The Nash equilibrium, andfixed-point of the standard replicator dynamics ( λ =0) is defection, and we will in the following discussthe outcome of the batch and on-line learning dy-namics with and without memory loss. As seen inFig. 1a, the deterministic learning dynamics con-verges to a fixed-point, a numerical analysis showsthat this fixed-point is symmetric with respect to theexchange of players ( x ∗ = y ∗ ). The defection rate ofeither player decreases with increasing memory loss(Fig. 1b). The fixed-point of (4) depends only onthe ratio λ/ Γ, and the different curves in Fig. 1b canbe collapsed. The learning dynamics at finite batchsize and λ > λ d e f ec ti on fr e qu e n c y p d ac b FIG. 1: (Color on-line). Defection rate in the prisoners’dilemma. (a) Dynamics at Γ = 0 . λ = 0 , . , . , . N = 10, averaged over 1000 runs, defection rate shownfor one fixed player), lines from DTDL; (b) Defection rateas a function of the memory-loss rate λ for Γ = 1 , . , . N = 10, parameters as in (a). a mixed strategy fixed point, learning at finite batchsizes leads to a distribution of mixed strategy vec-tors as indicated in Fig. 2a. The width of thesedistributions scales as N − / , and can be obtainedfrom the theory to great accuracy. Panel 2b demon-strates that our analytical approach captures spectralproperties of the fluctuations as well, and again nearperfect agreement between theory and simulations isfound. These results show that the expansion in theinverse batch size is a viable analytical tool for thecharacterization of stochastic effects in game dynam-ical learning, and we will proceed to apply it to asecond matrix game in the following.Rock-papers-scissors (RPS) is a game with p = 3strategies and cyclic dominance, as indicated by thepayoff matrix a RS = a SP = a P R = 1, a SR = a P S = a RP = − a RR = a P P = a SS = 0.If the system is started from symmetric initial con-ditions, ( x R , x P , x S ) = ( y R , y P , y S ), the continuous-time replicator dynamics, Eqs. (4) at λ = 0 reducesto a one-population dynamics, and these have oneneutrally stable fixed-point at x ∗ R = x ∗ P = x ∗ S = 1 / H = − ln( x R x P x S ) − x d P ( x d ) ω P d ( ω ) N=10N=100N=1000theory a b
FIG. 2: (Color on-line). Defectors in the prisoners’dilemma. (a) Distribution of defection rates at Γ = λ =0 . N = 1000 , ,
10 from top to bottom at the peak,(b) Spectrum of fluctuations of defection rate. Symbolsfrom simulations in both panels, solid lines from theory. tigate the case without memory loss in Fig. 3. Thediscrete-time learning dynamics at infinite and at fi-nite batch sizes does not proceed along the cycles ofthe continuous-time replicator dynamics, but insteadit drifts towards the edges of the strategy simplex.Fig. 3a shows the distance H from the center. Thisdistance increases monotonically, so that the learn-ing dynamics operates mostly at the borders of thestrategy simplex after some transient time. In the de-terministic case this effect is due to the discretenessin time of the learning process, the relevant eigen-values of map at the central fixed point are given by1 − λ ± i Γ / √
3, so that the fixed point is unstable for λ < λ c (Γ) = 1 − p − Γ /
3, and stable for λ > λ c . Inthe unstable regime fluctuations due to finite batchsizes enhance the outwards drift.The differences between the noise-free learning pro-cess and on-line adaptation for the case λ > λ c isstudied in Fig. 4. Here the fixed point of the DTDLdynamics is stable. The eigenvalues of the Jacobian J at the fixed point are complex, and hence a reso-nant amplification of fluctuations is possible similarto the enhanced demographic fluctuations reportedin [1]. Indeed, Fig. 4 shows that the stochastic learn-ing dynamics at finite batch size sustains coherentstochastic oscillations about the deterministic fixed-point. Their power spectrum can be computed basedon an analysis of Eq. (5). Results are compared withsimulations in Fig. 4d, and as seen the agreement isexcellent, provided the batch size is large enough tojustify the expansion in N − / . Fig. 4 shows thatthis is the case even for small batch sizes, for othergames this will most likely depend on the number ofstrategies available to the players. These phenom-ena are dynamically similar to those in evolutionarysystems, where a linear scaling of extinction times t H N=1N=10N=100DTDL FIG. 3: (Color on-line). Rock-papers-scissors withoutmemory loss ( λ = 0 , Γ = 0 . H from the center of the simplex versus time.Solid line is the DTDL dynamics, markers from DTSLat finite batch size (averages over 1000 runs). The insetshows the frequency of one of the pure strategies versustime for DTDL and for one run of DTSL, and illustratesthe drift towards the edges of the strategy simplex. H N=1N=2N=3N=10DTDL ω ω) theoryN=10N=3N=2N=1 t x a bc d FIG. 4: (Color on-line) Rock-papers-scissors at λ =0 . , Γ = 0 .
1. (a) Distance H versus time; (b) determin-istic and stochastic trajectories ( N = 10) in the strategysimplex; (c) probability of playing rock for the same run asin (b); (d) power spectra of fluctuations for N = 1 , , , in the system size have been reported for neutrallystable dynamics [4]. In the learning system there isno extinction, but escape times from a region aroundthe fixed point can be measured [14], and a similarlinear scaling in the batch size is found for the neu-trally stable case λ = λ c . In the stable phase escapeis sub-extensive, in the unstable regime escape timesgrow faster than linearly in N , very akin to what isreported in [4].Fluctuations in finite populations have profoundconsequences in evolutionary game theory, and wehave here shown that similar stochastic effects canbe seen in a learning-theoretic scenario. The sourceof noise is different from that in evolutionary sys-tems, and the analogue of finite populations are fi-nite batches of observations which players make inbe-tween adaptation events. Our analysis demonstratesthat memory loss can lead the system away fromNash equilibria and bring about co-operation in so-cial dilemmas. In cyclic games such as RPS conver-gence is only possible with sufficient memory loss, thecenter of the strategy simplex then becomes a stablefixed point for deterministic learning. The stochas-ticity and discreteness in the adaptation dynamicscan affect the asymptotic attractors considerably, andnoise-sustained oscillations can be observed. Theseoscillations are induced by an amplification mecha-nism similar to that observed in population dynamics[1] and in other biological systems, and may have sig-nificant amplitudes impeding the convergence to theNash equilibrium. We expect this to be the case fora variety of different games and learning algorithms[14], with compelling consequences for the learnabil-ity of games and their Nash equilibria. Determinis-tic learning of asymmetric games is known to leadto chaotic motion [10], and we expect that a dy-namics with imperfect sampling would make it evenless likely that the players collectively retrieve a Nashequilibrium.The author thanks J. D. Farmer for discussions,and Research Councils UK for financial support. [1] A. J. McKane and T. J. Newman, Phys. Rev. Lett. Phys. Rev. Lett. J.Roy. Soc. Interface , 575 (2007); M. Simoes, M.M.Telo da Gama, A. Nunes, J. Roy. Soc. Interface ,555 (2008)[3] A. J. McKane, J. D. Nagy, T. J. Newman and M. O.Stefanini, J. Stat. Phys. 128, 165-191 (2007).[4] A. Traulsen, J. C. Claussen, C. Hauert, Phys. Rev. Lett. Eur. Phys. J. B
373 (2008);L. A. Imhof, D. Fudenberg, M. A. Nowak,
Proc.Nat. Acad. Sci. arXiv:0811.3538 ; A. Traulsen,J. M. Pacheco, L. A. Imhof,
Phys. Rev. E Phys. Rev. Lett.
The theory of learningin games (MIT Press, Cambridge Mass., 1998); F.
Vega-Redondo,
Economics and the theory of games (Cambridge Univ. Press, Cambridge UK, 2003)[6] J. v. Neumann, O. Morgenstern
Theory of games andeconomic behavior (Princeton Univ. Press, 1953)[7] J. Maynard Smith, G. Price,
Nature (1973) 15;J. Maynard Smith,
Evolution and the theory of games (Cambridge University Press, 1998)[8] J. Henrich, R. Boyd, S. Bowles, C. Camerer, E. Fehrand H. Gintis (Eds),
Foundations of Human Sociality (Oxford University Press, Oxford UK, 2004)[9] T. H. Ho, C. F. Camerer, J.-K. Chong,
J. Econ. The-ory
177 (2007)[10] Y. Sato, J. P. Crutchfield,
Phys. Rev. E. Proc. Nat. Acad. Sci USA Int. J. Mod.Phys. C14