[PDF] Intrinsic noise in game dynamical learning

Abstract

Demographic noise has profound effects on evolutionary and population dynamics, as well as on chemical reaction systems and models of epidemiology. Such noise is intrinsic and due to the discreteness of the dynamics in finite populations. We here show that similar noise-sustained trajectories arise in game dynamical learning, where the stochasticity has a different origin: agents sample a finite number of moves of their opponents in-between adaptation events. The limit of infinite batches results in deterministic modified replicator equations, whereas finite sampling leads to a stochastic dynamics. The characteristics of these fluctuations can be computed analytically using methods from statistical physics, and such noise can affect the attractors significantly, leading to noise-sustained cycling or removing periodic orbits of the standard replicator dynamics.

Full PDF

aa r X i v : . [ phy s i c s . s o c - ph ] O c t Intrinsic noise in game dynamical learning

Tobias Galla ∗ Theoretical Physics, School of Physics and Astronomy,The University of Manchester, Manchester M13 9PL, United Kingdom (Dated: November 5, 2018)Demographic noise has profound eﬀects on evolutionary and population dynamics, as well ason chemical reaction systems and models of epidemiology. Such noise is intrinsic and due to thediscreteness of the dynamics in ﬁnite populations. We here show that similar noise-sustained tra-jectories arise in game dynamical learning, where the stochasticity has a diﬀerent origin: agentssample a ﬁnite number of moves of their opponents inbetween adaptation events. The limit of inﬁ-nite batches results in deterministic modiﬁed replicator equations, whereas ﬁnite sampling leads toa stochastic dynamics. The characteristics of these ﬂuctuations can be computed analytically usingmethods from statistical physics, and such noise can aﬀect the attractors signiﬁcantly, leading tonoise-sustained cycling or removing periodic orbits of the standard replicator dynamics.

PACS numbers: 02.50.Le, 87.23.Kg, 02.50.Ey, 05.40.-a

Intrinsic noise has been seen to have signiﬁcant ef-fects on dynamical systems, and may alter their at-tractors substantially. Noise-sustained oscillations,generated via an ampliﬁcation mechanism, are forexample present in models of population dynamics[1], epidemiology [2] or biochemical reaction systems[3]. The origin of these ﬂuctuations is the discrete-ness of the dynamics in ﬁnite systems, determinis-tic descriptions are then no longer appropriate. Theclass of systems in which intrinsic noise cannot be ne-glected includes models of evolutionary dynamics andgame theory, and much current research aims at un-derstanding the eﬀects of this demographic stochas-ticity using methods from nonequilibrium statisticalmechanics and the theory of stochastic processes [4].Here, we will focus on intrinsic noise resulting froma diﬀerent origin, and will consider the learning dy-namics of agents in a game theoretic setting [5]. Thisis complementary to more conventional approaches togame theory concentrating on the characterisation ofequilibrium points [6], or on evolutionary processes[7]. In the learning scenario one considers a smallnumber of agents who interact repeatedly in a givengame, and who observe their opponents’ actions andaim to react by adapting their own strategy proﬁle.Such dynamical models are of particular importancefor the understanding of experiments in game theoryand behavioral economics, in which human subjectsplay a given game repeatedly under controlled condi-tions [8, 9]. As a key result we show that stochastic-ity, induced by imperfect sampling of the opponents’strategy proﬁles, can result in trajectories quite dif-ferent from those of deterministic learning, very muchakin to the mechanism by which intrinsic noise in ﬁ-nite populations aﬀects the trajectories of evolution-ary systems. While the amount of intrinsic noise inevolutionary dynamics is determined by the number ∗ Electronic address: [email protected] of individuals in the population, our objective hereis to characterise the ﬂuctuations in the learning dy-namics of two ﬁxed agents. The quantity controllingthe noise strength is the number of observations madeby the agents inbetween adaptation events. Further-more, in a deterministic setting and depending on thegame, we demonstrate that memory loss can promoteor impede convergence to a Nash equilibrium.Consider a general symmetric two-player game,played repeatedly by players X and Y , and assumethere are p pure strategies in this game. The payoﬀmatrix is given by a ij where i, j ∈ { , . . . , p } . Therounds of the repeated interaction will be labeled by t = 1 , , ... in the following. In each round player X plays one pure strategy i ( t ) ∈ { , . . . , p } , and player Y plays j ( t ) ∈ { , . . . , p } . The payoﬀ for X is then a i ( t ) j ( t ) and that for Y is a j ( t ) i ( t ) . If the players playstochastically, i.e. if they resort to mixed strate-gies, i ( t ) and j ( t ) will be random variables. Assum-ing that player X carries a (time-dependent) mixedstrategy proﬁle x ( t ) = ( x ( t ) , . . . , x p ( t )) and simi-larly y ( t ) = ( y ( t ) , . . . , y p ( t )) for player Y , a learn-ing dynamics is then a prescription used to updatethese strategy proﬁles between subsequent rounds ofthe game. x i ( t ) here denotes the probability withwhich player X plays pure strategy i ∈ { , . . . , p } inround t , and similarly for y j ( t ). Normalization re-quires P pi =1 x i ( t ) = P pj =1 y j ( t ) = 1.In order to deﬁne a speciﬁc learning dynamics, wefollow [9, 10] and assume that each player keeps valu-ations of each pure strategy, measuring their relativeperformance in the past. More precisely, in a situa-tion without memory loss, the valuation q i ( t ) player X has for pure strategy i is the total payoﬀ X wouldhave obtained, had he/she always played strategy i up to time t , and given Y ’s actions. The valuation r j ( t ) player Y has for j has an analogous meaning.Following [9, 10] players then use a logit rule x i ( t ) = e Γ q i ( t ) P k e Γ q k ( t ) , y j ( t ) = e Γ r j ( t ) P k e Γ r k ( t ) . (1)Γ ≥ ∞ to deter-ministic play, we will here focus on the case in which0 < Γ < ∞ . It is important to distinguish betweentwo types of randomness in the actual play: as pre-scribed by (1), the players will generally use mixedstrategies, so that their actions can be stochastic,even at given strategy valuations. Secondly, the up-date of the valuations itself will contain some stochas-ticity as we will detail next. We will here assume thatplayers update their scores only once every N roundsof the game, and keep them constant inbetween. Thisis known as batch learning in computer science [12].Speciﬁcally, we will assume q k ( t + N ) = (1 − λ ) q k ( t ) + 1 N t + N − X t ′ = t a kj ( t ′ ) r k ( t + N ) = (1 − λ ) r k ( t ) + 1 N t + N − X t ′ = t a ki ( t ′ ) , (2)and q k ( t + τ ) = q k ( t ) for all τ = 1 , , . . . , N −

1, andsimilarly for player Y . On-line learning [12], i.e. up-dating after each round, is recovered for N = 1. Inour model all { q i , r j } are updated at each adaptationevent. This corresponds to reinforcement learningin which foregone payoﬀs are known and reinforced,equivalent to weighted ﬁctitious play belief learning,see Ho et al. [9]. The interpretation of these updaterules is understood best by ﬁrst considering the case λ = 0: then the increment of q k between time-steps t and t + N is given by N − P t + N − t ′ = t a kj ( t ′ ) . This in-crement is recognized as the average payoﬀ X wouldhave received per round had he/she played pure strat-egy k in all rounds t, t + 1 , . . . , t + N −

1. A non-zerovalue, λ ∈ (0 , λ in the payoﬀ terms in Eq. (2). In this paperwe follow the setup of [10].The update rules are intrinsically stochastic, wewill refer to (1,2) as discrete-time stochastic learning(DTSL). After a re-scaling of time, and for large, butﬁnite batch size N we can write q k ( ℓ + 1) = (1 − λ ) q k ( ℓ ) + X j a kj y j ( ℓ ) + ξ k ( ℓ ) √ Nr k ( ℓ + 1) = (1 − λ ) r k ( ℓ ) + X i a ki x i ( ℓ ) + η k ( ℓ ) √ N , (3)where we approximate the noise variables ξ k , η k asGaussian random variables. This amounts to an ex-pansion in N − / , and within this approximation thecovariances of the ξ k , η k can be obtained, as we willreport elsewhere [14]. In the limit of inﬁnite batchsize, N → ∞ , the dynamics becomes determinis-tic, we will refer to this as discrete-time deterministic learning (DTDL). Assuming Γ ≪ x i /x i = Γ X j a ij y j − Γ f [ x , y ] + λ X k x k ln x k x i ˙ y j /y j = Γ X i a ji x i − Γ f [ y , x ] + λ X k y k ln y k y j , (4)where f [ x , y ] = P ij a ij x i y j , as previously reportedand studied in [10], see also [11]. This system main-tains the normalisation of probabilities, and is hence2( p − z ∗ = ( x ∗ , . . . , x ∗ p , y ∗ , . . . , y ∗ p ), they are iden-tical to the ﬁxed points of (4). We now perform anexpansion about the ﬁxed point in powers of N − / ,akin to the expansion ﬁrst proposed in [13]. Writing z ( ℓ ) = z ∗ + N − / ∆ ( ℓ ), one ﬁnds ∆ ( ℓ + 1) = J ∆ ( ℓ ) + ζ ( ℓ ) , (5)with J the Jacobian at the ﬁxed-point, and where ζ ( ℓ ) is Gaussian white noise, with correlations amongits components, which can be worked out analyti-cally [14]. Eq. (5) is the discrete-time analogue of alinear Langevin equation, and the starting point forthe analysis of ﬂuctuations about the deterministiclimit. In particular Eq. (5) allows one to compute thestationary distributions of the components of ∆ , aswell as their temporal correlations and power spectra P i ( ω ) = D | e ∆ i ( ω ) | E , with e ∆ i ( ω ) the Fourier trans-form of ∆ i ( ℓ ) [14]. This follows the lines of [1]. Herewe will illustrate the eﬀects noise has on the learningdynamics using the two examples of the prisoners’dilemma, and that of the rock-papers-scissors game.The prisoner’s dilemma describes a problem of mu-tual cooperation, where two players each face thechoice whether to co-operate (C) or to defect (D). Wewill here choose the payoﬀ matrix a CC = 3 , a CD =0 , a DC = 5 , a DD = 1. The Nash equilibrium, andﬁxed-point of the standard replicator dynamics ( λ =0) is defection, and we will in the following discussthe outcome of the batch and on-line learning dy-namics with and without memory loss. As seen inFig. 1a, the deterministic learning dynamics con-verges to a ﬁxed-point, a numerical analysis showsthat this ﬁxed-point is symmetric with respect to theexchange of players ( x ∗ = y ∗ ). The defection rate ofeither player decreases with increasing memory loss(Fig. 1b). The ﬁxed-point of (4) depends only onthe ratio λ/ Γ, and the diﬀerent curves in Fig. 1b canbe collapsed. The learning dynamics at ﬁnite batchsize and λ > λ d e f ec ti on fr e qu e n c y p d ac b FIG. 1: (Color on-line). Defection rate in the prisoners’dilemma. (a) Dynamics at Γ = 0 . λ = 0 , . , . , . N = 10, averaged over 1000 runs, defection rate shownfor one ﬁxed player), lines from DTDL; (b) Defection rateas a function of the memory-loss rate λ for Γ = 1 , . , . N = 10, parameters as in (a). a mixed strategy ﬁxed point, learning at ﬁnite batchsizes leads to a distribution of mixed strategy vec-tors as indicated in Fig. 2a. The width of thesedistributions scales as N − / , and can be obtainedfrom the theory to great accuracy. Panel 2b demon-strates that our analytical approach captures spectralproperties of the ﬂuctuations as well, and again nearperfect agreement between theory and simulations isfound. These results show that the expansion in theinverse batch size is a viable analytical tool for thecharacterization of stochastic eﬀects in game dynam-ical learning, and we will proceed to apply it to asecond matrix game in the following.Rock-papers-scissors (RPS) is a game with p = 3strategies and cyclic dominance, as indicated by thepayoﬀ matrix a RS = a SP = a P R = 1, a SR = a P S = a RP = − a RR = a P P = a SS = 0.If the system is started from symmetric initial con-ditions, ( x R , x P , x S ) = ( y R , y P , y S ), the continuous-time replicator dynamics, Eqs. (4) at λ = 0 reducesto a one-population dynamics, and these have oneneutrally stable ﬁxed-point at x ∗ R = x ∗ P = x ∗ S = 1 / H = − ln( x R x P x S ) − x d P ( x d ) ω P d ( ω ) N=10N=100N=1000theory a b

FIG. 2: (Color on-line). Defectors in the prisoners’dilemma. (a) Distribution of defection rates at Γ = λ =0 . N = 1000 , ,

10 from top to bottom at the peak,(b) Spectrum of ﬂuctuations of defection rate. Symbolsfrom simulations in both panels, solid lines from theory. tigate the case without memory loss in Fig. 3. Thediscrete-time learning dynamics at inﬁnite and at ﬁ-nite batch sizes does not proceed along the cycles ofthe continuous-time replicator dynamics, but insteadit drifts towards the edges of the strategy simplex.Fig. 3a shows the distance H from the center. Thisdistance increases monotonically, so that the learn-ing dynamics operates mostly at the borders of thestrategy simplex after some transient time. In the de-terministic case this eﬀect is due to the discretenessin time of the learning process, the relevant eigen-values of map at the central ﬁxed point are given by1 − λ ± i Γ / √

3, so that the ﬁxed point is unstable for λ < λ c (Γ) = 1 − p − Γ /

3, and stable for λ > λ c . Inthe unstable regime ﬂuctuations due to ﬁnite batchsizes enhance the outwards drift.The diﬀerences between the noise-free learning pro-cess and on-line adaptation for the case λ > λ c isstudied in Fig. 4. Here the ﬁxed point of the DTDLdynamics is stable. The eigenvalues of the Jacobian J at the ﬁxed point are complex, and hence a reso-nant ampliﬁcation of ﬂuctuations is possible similarto the enhanced demographic ﬂuctuations reportedin [1]. Indeed, Fig. 4 shows that the stochastic learn-ing dynamics at ﬁnite batch size sustains coherentstochastic oscillations about the deterministic ﬁxed-point. Their power spectrum can be computed basedon an analysis of Eq. (5). Results are compared withsimulations in Fig. 4d, and as seen the agreement isexcellent, provided the batch size is large enough tojustify the expansion in N − / . Fig. 4 shows thatthis is the case even for small batch sizes, for othergames this will most likely depend on the number ofstrategies available to the players. These phenom-ena are dynamically similar to those in evolutionarysystems, where a linear scaling of extinction times t H N=1N=10N=100DTDL FIG. 3: (Color on-line). Rock-papers-scissors withoutmemory loss ( λ = 0 , Γ = 0 . H from the center of the simplex versus time.Solid line is the DTDL dynamics, markers from DTSLat ﬁnite batch size (averages over 1000 runs). The insetshows the frequency of one of the pure strategies versustime for DTDL and for one run of DTSL, and illustratesthe drift towards the edges of the strategy simplex. H N=1N=2N=3N=10DTDL ω ω) theoryN=10N=3N=2N=1 t x a bc d FIG. 4: (Color on-line) Rock-papers-scissors at λ =0 . , Γ = 0 .

1. (a) Distance H versus time; (b) determin-istic and stochastic trajectories ( N = 10) in the strategysimplex; (c) probability of playing rock for the same run asin (b); (d) power spectra of ﬂuctuations for N = 1 , , , in the system size have been reported for neutrallystable dynamics [4]. In the learning system there isno extinction, but escape times from a region aroundthe ﬁxed point can be measured [14], and a similarlinear scaling in the batch size is found for the neu-trally stable case λ = λ c . In the stable phase escapeis sub-extensive, in the unstable regime escape timesgrow faster than linearly in N , very akin to what isreported in [4].Fluctuations in ﬁnite populations have profoundconsequences in evolutionary game theory, and wehave here shown that similar stochastic eﬀects canbe seen in a learning-theoretic scenario. The sourceof noise is diﬀerent from that in evolutionary sys-tems, and the analogue of ﬁnite populations are ﬁ-nite batches of observations which players make inbe-tween adaptation events. Our analysis demonstratesthat memory loss can lead the system away fromNash equilibria and bring about co-operation in so-cial dilemmas. In cyclic games such as RPS conver-gence is only possible with suﬃcient memory loss, thecenter of the strategy simplex then becomes a stableﬁxed point for deterministic learning. The stochas-ticity and discreteness in the adaptation dynamicscan aﬀect the asymptotic attractors considerably, andnoise-sustained oscillations can be observed. Theseoscillations are induced by an ampliﬁcation mecha-nism similar to that observed in population dynamics[1] and in other biological systems, and may have sig-niﬁcant amplitudes impeding the convergence to theNash equilibrium. We expect this to be the case fora variety of diﬀerent games and learning algorithms[14], with compelling consequences for the learnabil-ity of games and their Nash equilibria. Determinis-tic learning of asymmetric games is known to leadto chaotic motion [10], and we expect that a dy-namics with imperfect sampling would make it evenless likely that the players collectively retrieve a Nashequilibrium.The author thanks J. D. Farmer for discussions,and Research Councils UK for ﬁnancial support. [1] A. J. McKane and T. J. Newman, Phys. Rev. Lett. Phys. Rev. Lett. J.Roy. Soc. Interface , 575 (2007); M. Simoes, M.M.Telo da Gama, A. Nunes, J. Roy. Soc. Interface ,555 (2008)[3] A. J. McKane, J. D. Nagy, T. J. Newman and M. O.Stefanini, J. Stat. Phys. 128, 165-191 (2007).[4] A. Traulsen, J. C. Claussen, C. Hauert, Phys. Rev. Lett. Eur. Phys. J. B

373 (2008);L. A. Imhof, D. Fudenberg, M. A. Nowak,

Proc.Nat. Acad. Sci. arXiv:0811.3538 ; A. Traulsen,J. M. Pacheco, L. A. Imhof,

Phys. Rev. E Phys. Rev. Lett.

The theory of learningin games (MIT Press, Cambridge Mass., 1998); F.

Vega-Redondo,

Economics and the theory of games (Cambridge Univ. Press, Cambridge UK, 2003)[6] J. v. Neumann, O. Morgenstern

Theory of games andeconomic behavior (Princeton Univ. Press, 1953)[7] J. Maynard Smith, G. Price,

Nature (1973) 15;J. Maynard Smith,

Evolution and the theory of games (Cambridge University Press, 1998)[8] J. Henrich, R. Boyd, S. Bowles, C. Camerer, E. Fehrand H. Gintis (Eds),

Foundations of Human Sociality (Oxford University Press, Oxford UK, 2004)[9] T. H. Ho, C. F. Camerer, J.-K. Chong,

J. Econ. The-ory

177 (2007)[10] Y. Sato, J. P. Crutchﬁeld,

Phys. Rev. E. Proc. Nat. Acad. Sci USA Int. J. Mod.Phys. C14