Hysteresis effects of changing parameters of noncooperative games
David H. Wolpert, Michael Harre, Eckehard Olbrich, Nils Bertschinger, Juergen Jost
HHysteresis e ff ects of changing parameters of noncooperative games David H. Wolpert, Michael Harr´e, Eckehard Olbrich, Nils Bertschinger, and Juergen Jost ∗ NASA Ames Research Center, MailStop 269-1, Mo ff ett Field, CA 94035-1000, [email protected] Centre for the Mind, The University of Sydney, Australia Max Planck Institute for Mathematics in the Sciences, Inselstrasse 22, D-04103 Leipzig, Germany (Dated: August 17, 2018)We adapt the method used by Jaynes to derive the equilibria of statistical physics to instead derive equilibriaof bounded rational game theory. We analyze the dependence of these equilibria on the parameters of theunderlying game, focusing on hysteresis e ff ects. In particular, we show that by gradually imposing individual-specific tax rates on the players of the game, and then gradually removing those taxes, the players move from apoor equilibrium to one that is better for all of them. INTRODUCTION
The Maximum Entropy (Maxent) principle is aninformation-theoretic formalization of Occam’s razor. Itsays that if we are given the expectation values of somefunctions of a system’s state, then we should predict that theassociated distribution is the one with minimal information(i.e., maximal entropy) consistent with those expecta-tions [1, 2]. Maxent provides a succinct way to derive muchof statistical physics [3, 4], e.g., the canonical ensemble.Noncooperative game theory [5–8] is the foundation of con-ventional economics. It uses provided utility functions of a setof human “players” to predict how the players will model oneanother. It then uses this to predict the players’ joint behavior.Many recent applications of statistical physics to economicsanalyze it at a coarse-grained level, bypassing its game-theoretic foundation. Here we build on [9] and apply Maxentto game theory, thereby introducing statistical physics tech-niques into the foundation of economics.In this application of Maxent, there is a separate expecta-tion value for each player. In contrast, when applying Maxentto derive the canonical ensemble, there is a single expecta-tion value (of the system’s energy). Accordingly, rather thanthe canonical ensemble’s single Boltzmann distribution, in-volving a single Hamiltonian and a single “temperature”, wederive a separate Boltzmann distribution for each player, in-volving only that player’s utility function, and a “temperature”unique to that player. The players’ Boltzmann distributionsare coupled, and the joint solution provides a bounded ratio-nality version of the Nash equilibrium (NE) of game theory,where each player’s inverse temperature quantifies their ratio-nality.We analyze the dependence of this modified NE on theparameters of the underlying game, focusing on bifurcationbehavior and hysteresis e ff ects. In particular, we show howby gradually imposing taxes on the players, and then grad-ually removing them, the joint behavior of the players canbe moved from a poor equilibrium to a Pareto-superior one.(This can even be done if we require that the players agree toeach infinitesimal change in tax rates, since each such changeincreases every player’s expected utility.) This is particularlyinteresting given estimates that non-OECD countries could in- crease their wealth by one third by moving from their currentequilibrium to a di ff erent one. Next we introduce three toymodels of how a society can modify tax rates: via “social-ism”, a “market”, or “anarchy”. We then compare these threemodels in terms of the associated discounted sum of total util-ities along the path of tax rates. BACKGROUND
Many di ff erent axiomatic arguments establish that theamount of syntactic information in a distribution P ( y ) in-creases as the Shannon entropy of that distribution, S ( P ) ≡− (cid:80) y P ( y ) ln [ P ( y )] [1, 3, 4], decreases. This provides a way toformalize “Occam’s razor”: given limited prior data concern-ing P ( y ), predict P ( y ) is the distribution with minimal infor-mation (maximum entropy) consistent with that data. Thisformalization of Occam’s razor is called the maximum en-tropy principle ( Maxent ). When the data concerning P ( y ) isexpectation values of functions under P , Maxent has provenextremely accurate in domains ranging from signal processingto supervised learning [2]. Jaynes used it to derive statisticalphysics [3], e.g., having the data be the expected energy of asystem or its expected number of particles of various types.A finite, strategic form noncooperative game consists of aset of N players , where each player i has her own set of al-lowed pure strategies X i of size | X i | < ∞ . A mixed strat-egy is a distribution q i ( x i ) over X i . The joint distribution over X ≡ (cid:16) i X i is q ( x ) = (cid:81) i q i ( x i ), and is called a strategy profile .Each player i has a utility function u i : X → R . Sogiven strategy profile q , the expected utility of player i is E ( u i ) = (cid:80) x (cid:81) j q j ( x j ) u i ( x ) where q − i ( x − i ) ≡ (cid:81) j (cid:44) i q j ( x j ). The Nash equilibrium (NE) is the strategy profile defined by hav-ing every player i set q i to maximizes i ’s expected utility, i.e., ∀ i , q i = argmax q (cid:48) i (cid:20) (cid:80) x q (cid:48) i ( x i ) q − i ( x − i ) u i ( x ) (cid:21) . In general, this setof coupled equations has multiple solutions.A well-recognized problem of using the NE to predict real-world behavior is its assumption that every player choosestheir optimal mixed strategy, which is called full rationality .This assumption is violated (often badly) in many experimen-tal settings [10, 11]. Our modified NE derived using Maxentaccommodates such bounded rationality . a r X i v : . [ c s . G T ] O c t MAXENT AND QUANTAL RESPONSE EQUILIBRIA
To predict what q the players in a given N -player game Γ will adopt, first pick one of the players, i . Consider acounter-factual situation, where i has the same move spaceand utility function as in Γ , but rather than have a set of N − X − i , an inani-mate stochastic system sets that distribution, to some q − i ( x − i ).In general, due to her limited knowledge of q − i , limited com-putational power, etc., i will choose a suboptimal q i , i.e., q i (cid:60) argmax p i [ E p i q − i ( u i )]. To quantify this bounded rationality,in analogy to Jaynes’ derivation of the canonical ensemble,constrain q i so that E q i , q − i ( u i ) has some (nonmaximal) valuefor the given q − i . Then Maxent says q i ( x i ) ∝ exp [ β i E q − i ( u i | x i )] . (1)where β i is the Lagrange parameter enforcing the constraint.Note that as β i → ∞ , i becomes increasingly rational, whereasas β i →
0, she becomes increasingly irrational.Next, recall that by the axioms of utility theory [12], all that player i is concerned with in choosing her mixed strategyis the resultant expected utility. Accordingly, we presume thatif the best i can do is choose a particular q i when q − i is setby an inanimate system, she would also choose q i if she facesthat same distribution q − i when it is set by other humans.Generalizing, Maxent says that Eq. 1 should hold simulate-nously for all N players i , with player-specific Lagrange pa-rameters. This gives a set of N coupled non-linear equationsfor q . Brouwer’s fixed point theorem [13] guarantees that setalways has a solution, and in general it has more than one. This prediction for q is not based on a model of boundedrational human behavior derived from experimental data. Itis based on desiderata concerning the prediction process, noton a model of the system being predicted. Nonetheless, itis intriguing to note that maximizing Shannon entropy has anatural interpretation in terms of a common model of humanbounded rationality, involving the cost of computation. To seethis, recall that − S ( q i ) measures the amount of information inthe distribution q i . Say we equate the cost to i of computing q i with this amount of information. Then under Maxent, player i minimizes the cost of computing her mixed strategy, subjectto a constraint for the value of her expected utility that actsas an “aspiration level” . Under this interpretation, β i quanti-fies i ’s cost of computing q i , in units of expected utility. Futurework involves incorporating experimental data concerning hu-man behavior as additional constraints in the Maxent. (Othermodels of the cost of computation can be found in [14–18].)Solutions to our N coupled equations for q are typicallycalled “logit Quantal Response Equilibria” (QRE) in game An alternative Maxent approach would use it to set the entire joint distribu-tion q ( x ) = (cid:81) i q i ( x i ) at once, rather than use it to set each q i separately andthen impose self-consistency. However there are di ffi culties in choosingwhat constraints to use under this approach. See [9]. FIG. 1. E ( u col ) vs. (cid:126)β under the QRE of the game in Eq. 3. Thehysteresis path involving bonds discussed in the text is highlighted. theory [19–22]. They have been independently suggested sev-eral times as a way to model human players [14, 23–27]. In allthis earlier work the logit distribution is not derived from firstprinciples. Nor is it related to information theory, or the costof computation. Rather typically the logit QRE has been usedas an ad hoc , few-degree of freedom model of bounded ratio-nal play. As such it has been widely and successfully used tofit experimental data concerning human behavior. THE SHAPE OF THE QRE SURFACE
To analyze the QRE surface of Eq. 1, we express it asa set of functional relationships, q i = f i ( q − i , β i ) , q − i = f − i ( q i , β − i ). A bifurcation may occur if for some i ∂ f i ∂ q − i ∂ f − i ∂ q i ∂ q i ∂β i + ∂ f i ∂β i − ∂ q i ∂β i = ∂ q i ∂β i , i.e., if det( ∂ f i ∂ q − i ∂ f − i ∂ q i − Id) =
0. To illus-trate this and related phenomena, we consider games betweena Row and Column player, each with two pure strategies. Thefirst is the famous “battle of the sexes” coordination game [5],where the utility functions are2 | | | | (cid:126)β ≡ ( β row , β column ) fixes QRE q ’s for this game, and therefore QRE expected utilities. Fig. 1plots this surface taking (cid:126)β to E q ( u col ). At bifurcations the num-ber of QRE solutions changes between one and three, and in-finitesimal changes in (cid:126)β may result in discontinuous changes The QRE literature justifies the logit distribution by appealing to choicetheory [28], where it arises if double-exponential noise is added to playerutility values. However that double-exponential noise assumption is neveraxiomatically justified in choice theory; it is assumed for the calculationalconvenience that it results in the logit distribution. The logit distribution in Eq. 1 also arises in Reinforcement Learning [29–32], as a way to design artificial agents that learn from experience. Β row (cid:72) u row (cid:76) FIG. 2. The expected utility of the Row player along the path through (cid:126)β highlighted in Fig. 1, illustrated as a function of the Row player’srationality, β row . The path starts at the bottom right, then travels left,before turning and finishing at the top right. Even though the Rowplayer ultimately benefits if society follows this path, at the begin-ning they lose expected utility. They may demand to be compensatedfor that initial drop, e.g., with proceeds of a bond that are paid o ff byboth players when the end of the path is reached. in expected utility. (E.g., this happens if the system starts at (cid:126)β = (5 ,
5) on the top surface, and then β row is reduced to 0.)An interesting e ff ect occurs if we multiply the utilities by −
1. Fig. 3 illustrates part of the surface after this switch.Note that on the bottom fold, for fixed β col , decreasing β row increases E ( u row ). So Row benefits by being less rational, dueto how Column responds to Row’s drop in rationality.By Eq. 1, changing β i a ff ects the QRE q the same way askeeping β i fixed but multiplying u i by some α i . So Fig.’s 1, 3give QRE surfaces where (cid:126)β is fixed, but each u i is multipliedby α i . (Formally, reinterpret the x and y axes as α row and α col rescaled, and reinterpret the z axis as E q ( u i ) /α i .) Notewe can interpret 1 − α i as a tax rate on player i . So if wemodel rationalities β i as fixed, e.g., as behavioral attributes,then on the bottom surface in Fig. 3, Row benefits if her taxrate increases .The fact that Row may prefer a higher tax rate suggeststhat by varying tax rates “adiabatically” slowly, so that thejoint behavior of the players is always on the QRE surface,we may be able to montonically improve expected utilities for both players. Indeed, by changing tax rates we can graduallymove the equilibrium across the surface from one fold to theother, and then undo those changes, returning the rates to theiroriginal values, but leaving both players with higher expectedutility. (See [33] for other work that exploits the shape of aQRE surface to optimize player joint behavior.)More precisely, there are paths of (cid:126)β ’s (i.e, of (cid:126)α ’s) such that:1. Neither player ever is more rational (taxed at a higherrate) on the path than at the starting point.2. At each step on the path, if after the next infinitesimalchange in (cid:126)β there is a QRE q infinitesimally close to the current one, it is adopted. (Adiabaticity.)3. Each infinitesimal change in (cid:126)β increases both E q ( u i )’s.4. At each infinitesimal step, if multiple changes in q meet(1)-(3), but one is Pareto superior to the others (i.e., bet-ter for both players), the players coordinate on that one.Examples of such paths are illustrated in Fig. 3.The existence of such paths raises the question of how asociety should dynamically update its tax rates. We now com-pare three procedures for how this could be done by society asa whole. (For notational simplicity, and to emphasize the anal-ogy with annealing, we parameterize the procedures in termsof their action on (cid:126)β rather than on (cid:126) − (cid:126)α .)I. “Anarchy”: Players independently decide how to mod-ify their β ’s. To do this they follow gradient ascentwith a small step size ∆ , subject to the constraint thatno player i can go to a β i larger than the starting one.Thus, both players i change β i by δβ i ∈ [ − ∆ , ∆ ], us-ing ∂ E ( u i ) /∂β i to make their choice of what value δβ i topick. (Since this is a linear procedure, the players willalways choose one of the three values {− ∆ , , ∆ } .)II. “Socialism”: An external regulator determines the path,again using gradient descent, this time over the sum ofthe players’ expected utilities. At each step of the path (cid:126)β is changed by the ( δβ row , δβ col ) vector that maximizes[ δβ row ∂ E ( u row ) ∂β row + δβ col ∂ E ( u row ) ∂β col ] + [ δβ row ∂ E ( u col ) ∂β row + δβ col ∂ E ( u col ) ∂β col ]subject to || ( δβ row , δβ col ) || ≤ ∆ . (The constraint is tomatch the step size to that of the first procedure.)III. “Market”: Certain mild axioms concerning bargainingbehavior of humans give a unique prediction for whatbargain is reached in any bargaining scenario. Let T bethe set of joint expected utilities for all the bargains thata set of N bargainers might reach in a particular bargain-ing scenario. Then the “Nash bargaining concept” [6, 7]predicts that the the joint expected utility of the bargainreached is argmax (cid:126) u ∈ T [ (cid:81) Ni = u i ].We can use the Nash bargaining concept to predict whatchange to (cid:126)β the players would agree to under a “market”where they bargain with one another to determine thatchange. To do this we fix the set of all allowed bargainsto the set of all pairs (cid:126)β such that || (cid:126)β − (cid:126)β ( t ) || ≤ ∆ , where (cid:126)β ( t ) is the current joint β . We also choose (cid:126) d to be thejoint expected utility at (cid:126)β ( t ). So under Nash bargaining,at each iteration t , the players choose the change in joint β , δ(cid:126)β , that maximizes the product (cid:20) E ( u Row | (cid:126)β ( t ) + δ (cid:126)β ) − E ( u Row | (cid:126)β ( t )) (cid:21) × (cid:20) E ( u Col | (cid:126)β ( t ) + δ(cid:126)β ) − E ( u Col | (cid:126)β ( t )) (cid:21) FIG. 3. A QRE surface with paths shown for the anarchy (red), so-cialism (blue) and market (purple) procedures. As in Fig. 1, the x andy axes are player rationalities, β row and β col , and the z axis is expectedutility (this time of player Row). subject to || δ(cid:126)β || ≤ ∆ . As in the other two procedures,we use first order approximations in this one, to evaluatethe two di ff erences in expected utilities.In all three procedures the total change in (cid:126)β in any step neverexceeds √ ∆ . This adiabaticity reduces the computationalburden on the players, by not changing the game too muchfrom one timestep to the next. (Similar assumptions are calledcomparitive statics in economics [34].)As in standard economics, we can quantify how good a fullpath produced by a procedure is for society as a whole by cal-culating the discounted sum of future utilities along the path, Q ≡ (cid:88) t (cid:48) > (1 + γ ) t − t (cid:48) N (cid:88) i = E ( u i ( t (cid:48) )) (4)So we can compare the three procedures by calculating the Q ’s for the paths they generate starting from some shared (cid:126)β at time t =
0. We did this for several representative initial (cid:126)β ’s for the surface in Fig. 3. Anarchy always did worse thanthe other two procedures. Those others are compared to eachother in Fig. 4. When the discounting factor γ is large (i.e.,we are more concerned with near-term than long-term utility)the market procedure does better, otherwise socialism does.All three procedures are local, looking only a single stepinto the future. A procedure that also considers the QRE sur-face’s global geometry will produce better paths in general. Inparticular, such global information allows us to consider pathswhere a player loses expected utility for certain periods, butin the end all players are better o ff . Fig. 1 highlights such apath, along which player Column always benefits but playerRow loses initially, before ultimately benefitting. (A cross-section of the expected utility of Row along the path is shownin Fig. 2. ) Note that player Row might demand compensationto agree to follow such a path where they temporarily lose ex-pected utility,, e.g. in terms of a subsidy paid for with a bondthat is repaid by all players at the end of the path. FIG. 4. The di ff erence between the discounted sums of future ex-pected utilities of the two players under the “socialism” and “market”procedures, plotted against the discounting factor γ . Particularly interesting issues arise when setting full pathsunder the market model, if the players use discounted sumof future utilities to value full paths. For example, say thatat t = (cid:126)β ( t ) that is a Nashbargaining solution then. Then in general, for t (cid:48) >
0, the path (cid:126)β t (cid:48) ( t ) that is a Nash bargaining solution for full paths startingfrom (cid:126)β ( t (cid:48) ) is not a truncation of (cid:126)β ( t ) to t > t (cid:48) . There is aninconsistency across time. This raises many interesting issuesconcerning binding commitments, what it means for a pathchosen by bargaining to be renegotiation-proof, etc.Multiple folds will exist for the QRE surfaces involvingmany kinds of game parameters, not just tax rates. Often suchparameters will be set externally, perhaps in a noisy process.When this is the case, the QRE surface tells us how stableplayer behavior is against that external noise. For example,say the players are on the top fold of the surface in Fig. 1,with (cid:126)β = (2 , ff the edge”, and undergo a discontinu-ous jump to the lower surface. Moreover, even if the playersmanaged to (adiababitically slowly) restore their original ra-tionalities after such a jump, they would end up on the middlefold of the region where β row is near 2, not on the good foldthey started in. Due to this, when an economic situation ex-hibits such qualitative features, it may behoove society to stayaway from such edges in the QRE surface, even if that lowerstotal expected utility. ∗ Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501,USA[1] T. Cover and J. Thomas,
Elements of Information Theory (Wiley-Interscience, New York, 1991).[2] D. Mackay,
Information Theory, Inference, and Learning Algo-rithms (Cambridge University Press, 2003).[3] E. T. Jaynes, Physical Review , 620 (1957).[4] E. T. Jaynes and G. L. Bretthorst,
Probability Theory : TheLogic of Science (Cambridge University Press, 2003).[5] D. Fudenberg and J. Tirole,
Game Theory (MIT Press, Cam-bridge, MA, 1991). [6] R. B. Myerson,
Game theory: Analysis of Conflict (HarvardUniversity Press, 1991).[7] M. Osborne and A. Rubenstein,
A Course in Game Theory (MIT Press, Cambridge, MA, 1994).[8] R. Aumann and S. Hart,
Handbook of Game Theory with Eco-nomic Applications (North-Holland Press, 1992).[9] D. H. Wolpert, in
Complex Engineered Systems: Science meetstechnology , edited by D. Braha, A. Minai, and Y. Bar-Yam(Springer, 2004) pp. 262–290.[10] C. Camerer,
Behavioral Game Theory: Experiments in Strate-gic Interaction (Princeton University Press, 2003).[11] C. Starmer, Journal of Economic Literature , 332 (2000).[12] J. von Neuman and O. Morgenstern, Theory of Games and Eco-nomics Behavior (Princeton university Press, 1944).[13] C. Aliprantis and K. C. Border,
Infinite Dimensional Analysis (Springer Verlag, 2006).[14] D. Fudenberg and D. K. Levine,
The Theory of Learning inGames (MIT Press, Cambridge, MA, 1998).[15] S. Hart, Econometrica , 1401 (2005).[16] A. Rubinstein, Modeling Bounded Rationality (MIT press,1998).[17] S. Russell and D. Subramanian, Journal of AI Research , 575(1995).[18] M. George ff , B. Pell, M. Pollack, M. Tambe, andM. Wooldridge, in Intelligent Agents V , Lecture Notes in Com-puter Science, Vol. 1555 (Springer, Berlin / Heidelberg, 1999)pp. 1–10.[19] R. D. McKelvey and T. R. Palfrey, Games and Economic Be-havior , 6 (1995).[20] R. D. McKelvey and T. R. Palfrey, Japanese Economic Review , 186 (1996).[21] R. D. McKelvey and T. R. Palfrey, in Handbook of Experimen-tal Economics Results , Vol. 1 (North Holland, 2008) pp. 541–548. [22] S. P. Anderson, J. Goeree, and C. A. Holt, Southern EconomicJournal , 21 (2002).[23] J. Shamma and G. Arslan, IEEE Trans. on Automatic Control , 312 (2004).[24] D. Fudenberg and D. Kreps, Games and Economic Behavior ,320 (1993).[25] J. R. Meginniss, Proceedings of the American Statistical As-sociation, Business and Economics Statistics Section , 471(1976).[26] S. P. Anderson, J. Goeree, and C. A. Holt, Scandanavian jour-nal of economics , 21 (2004), preprint title was “Stochasticgame th eory: adjustment to equilibrum under noisy directionallearning.[27] S. Durlauf, Proceedings Natl. Acad. Sci. USA , 10582(1999).[28] K. E. Train, Discrete Choice Methods with Simulation (Cam-bridge University Press, 2003).[29] R. H. Crites and A. G. Barto, in
Advances in Neural InformationProcessing Systems - 8 , edited by D. S. Touretzky, M. C. Mozer,and M. E. Hasselmo (MIT Press, 1996) pp. 1017–1023.[30] J. Hu and M. P. Wellman, in
Proceedings of the Fifteenth Inter-national Conference on Machine Learning (1998) pp. 242–250.[31] D. H. Wolpert and K. Tumer, Journal of Artificial IntelligenceResearch , 359 (2002).[32] D. H. Wolpert, K. Tumer, and E. Bandari, Physical Review E (2004).[33] D. Wolpert and N. Kulkarni, in Proceedings of the 2008NASA / ESA Conference on Adaptive Hardware and Systems ,edited by A. Erdogan (IEEE Press, 2008).[34] T. J. Kehoe, in