Win-Stay-Lose-Shift as a self-confirming equilibrium in the iterated Prisoner's Dilemma
WWin-Stay-Lose-Shift as a self-confirming equilibrium in theiterated prisoner’s dilemma
Minjae Kim , Jung-Kyoo Choi and Seung Ki Baek Department of Physics, Pukyong National University, Busan 48513, Korea School of Economics and Trade, Kyungpook National University, Daegu 41566, Korea a r X i v : . [ q - b i o . P E ] J a n bstract Evolutionary game theory assumes that players replicate a highly scored player’s strategythrough genetic inheritance, but in terms of cultural learning, it is often difficult to recognizea strategy just by observing the behaviour. In this work, we consider players with memory-onestochastic strategies in the iterated prisoner’s dilemma, with an assumption that they cannot di-rectly access each other’s strategy but only observe the actual moves for a certain number of rounds.Based on the observation, the observer has to infer the resident strategy in a Bayesian way andadjust his or her own strategy accordingly. By examining best-response relations, we argue thatplayers can escape from full defection into a cooperative equilibrium supported by Win-Stay-Lose-Shift in a self-confirming manner, provided that the cost of cooperation is low and the observationallearning supplies sufficiently large uncertainty.
I. INTRODUCTION
Between nature and nurture, evolutionary game theorists usually take the former ap-proach by assuming that behavioural traits can be genetically transmitted across gener-ations [1]. Along this line, researchers have investigated the genetic basis of cooperativebehaviour [2, 3]. However, humans learn many culture-specific behavioural rules throughobservational learning [4], and this mechanism mediates “cultural” transmission that hasbeen proved to exist among a number of non-human animals as well [5, 6]. The mirrorneuron research suggests that the primate brain may even have a specialized circuit for imi-tating each other’s behaviour, which facilitates social learning [7–9]. In comparison with thedirect genetic transmission, the non-genetic inheritance through social learning can providebetter adaptability by responding faster to environmental changes [10].In contrast with genetic inheritance, however, observational learning may lead to imper-fect mimicry if observation is not sufficiently informative or involved with a systematic bias.The notion of self-confirming equilibrium (SCE) has been proposed by incorporating suchimperfectness of observation in learning [11]: When a SCE strategy is played, some of thepossible information sets may not be reached, so the players do not have exact knowledgebut only certain untested belief about what their co-players would do at those unreachedsets. It is nevertheless sustained as an equilibrium in the sense that no player can expect abetter payoff by unilaterally deviating from it once given such belief, and that the beliefs do2ot conflict with observed moves.In this work, we investigate the iterated prisoner’s dilemma (PD) game among playerswith memory-one strategies, who infer the resident strategy from observation and optimizestheir own strategies against it. By memory-one, we mean that a player refers to the previousround to choose a move between cooperation and defection [12]. If we restrict ourselves tomemory-one strategies, it is already well known in evolutionary game theory that ‘Win-Stay-Lose-Shift (WSLS)’ [13–15] can appear through mutation and take over the populationfrom defectors if the cost of cooperation is low [12]. Compared with such an evolution-ary approach, we will impose “less bounded” rationality in that our players are assumedto be capable of computing the best response to a given strategy within the memory-onepure-strategy space. We will identify the best-response dynamics in this space and examinehow the dynamics should be modified when observational learning introduces uncertainty inBayesian inference about strategies. If every player exactly replicated each other’s strategy,full defection would be a Nash equilibrium (NE) for any cost of cooperation. Under uncer-tainty in observation, however, our finding is that defection is not always a SCE so that thepopulation can move to a cooperative equilibrium supported by WSLS, which is both a SCEand a NE and can thus be called a SCENE.
II. METHOD AND RESULTA. Best-response relations without observational uncertainty
Let us define the one-shot PD game in the following form:
C DC − c − cD , (1)where we abbreviate cooperation and defection as C and D , respectively, and c is the costof cooperation assumed to be 0 < c <
1. In this work, the game of Eq. (1) will be repeatedindefinitely. Furthermore, the environment is noisy: Even if a player intends to cooperate,it can be misimplemented as defection, or vice versa, with probability (cid:15) . In the analysisbelow, we will take (cid:15) as an arbitrarily small positive number.We will restrict ourselves to the space of memory-one (M ) pure strategies. By a M pure3trategy, we mean that it chooses a move between C and D as a function of the two players’moves in the previous round. We thus describe such a strategy as [ p CC , p CD , p DC , p DD ],where p XY = 1 means that C is prescribed when the players did X and Y , respectively, inthe previous round, and p XY = 0 if D is prescribed in the same situation. Note that theinitial move in the first round is irrelevant to the long-term average payoff in the presenceof error so that it has been discarded in the description of a strategy. The set of M purestrategies, denoted by ∆, contains 16 elements from d ≡ [0 , , ,
0] to d ≡ [1 , , , pure strategy d α as her strategy s A .The noisy environment effectively modifies her behaviour to s (cid:15)A ≡ (1 − (cid:15) ) d α + (cid:15) ( − d α ) (2)as if she were playing a mixed strategy, where ≡ [1 , , , d β , and his effective behaviour is described by s (cid:15)B ≡ (1 − (cid:15) ) d β + (cid:15) ( − d β ) . (3)The repeated interaction between Alice and Bob is Markovian, and it is straightforward toobtain the stationary probability distribution v ( d α , d β , (cid:15) ) = ( v CC , v CD , v DC , v DD ) , (4)where v XY means the long-term average probability to observe Alice and Bob choosing X and Y , respectively [16–18]. The presence of (cid:15) > v . Alice’slong-term average payoff against Bob is then calculated asΠ( d α , d β , (cid:15) ) = v · P (5)where P ≡ (1 − c, − c, ,
0) is a payoff vector corresponding to Eq. (1). As long as Alice canexactly identify Bob’s strategy d β with no observational uncertainty, she can find the bestresponse to Bob within the set of M pure strategies by applying every d α ∈ ∆ to Eq. (5).In Table I, we list the best response to each strategy in ∆ in the limit of small (cid:15) (seealso Fig. 1 for its graphical representation). In most cases, the best-response dynamics endsup with d = [0 , , , d = [1 , , , d = [1 , , , trategy Best Payoff of the best response Misc.response to the strategy d d † (1 − c ) (cid:15) AllD d d / − (1 / c ) (cid:15) + O ( (cid:15) ) d d (1 − c ) / − (1 + c ) (cid:15)/ O ( (cid:15) ) d d / − ce + O ( (cid:15) ) d d / / − c ) (cid:15) + O ( (cid:15) ) d d − (2 + c ) (cid:15) + O ( (cid:15) ) d d − c ) (cid:15) + O ( (cid:15) ) d d − (2 + c ) (cid:15) + 4 (cid:15) + O ( (cid:15) ) d d † , c > / d , c < / − c ) (cid:15)/ O ( (cid:15) )1 / − c + O ( (cid:15) ) GT d d , c > / d † , c < / / O ( (cid:15) )1 − c + O ( (cid:15) ) WSLS d d (1 − c ) − (2 − c ) (cid:15) + O ( (cid:15) ) TFT d d , c > / d , c < / / / − c ) (cid:15) + O ( (cid:15) )(1 − c ) − (2 − c ) (cid:15) + O ( (cid:15) ) d d / O ( (cid:15) ) d d − (1 + c ) (cid:15) + O ( (cid:15) ) d d − c ) (cid:15) + O ( (cid:15) ) d d − (1 + c ) (cid:15) + O ( (cid:15) ) AllCTABLE I. Best response among M pure strategies. Against each strategy in the first column, weobtain the best response (the second column), and the resulting average payoff [Eq. (5)] earned bythe best response is given as a power series of (cid:15) in the third column. In the second column, wehave placed a dagger next to a strategy when it is the best response to itself. However, two exceptions exist: The first one is d = [1 , , , Grim Trigger (GT ). If c > /
3, this strategy is the best response to itself, and it is aninefficient equilibrium giving each player an average payoff of O ( (cid:15) ). The other exception isWSLS, represented by d = [1 , , , c ≤ /
2. It5 lways-Defect d d d d d d d Win-Stay-Lose-Shift d M ₁ Grim Triggerc>1/3 Always-Cooperatec<1/3 c>1/2c<1/2Tit-for-Tat c>1/2 d c<1/2 d d FIG. 1. Graphical representation of best-response relations in Table I. If d µ is the best responseto d ν , we represent it as an arrow from d ν to d µ . The blue node means an efficient NE with1 − v CC ∼ O ( (cid:15) ), whereas the red nodes mean inefficient ones with v CC (cid:46) O ( (cid:15) ) as shown in Table II. is an efficient NE, at which each player earns 1 − c + O ( (cid:15) ) per round on average. B. Observational learning
Now, let us imagine a monomorphic population of players who have adopted a strategy d γ in common. The population is in equilibrium in the sense that a large ensemble oftheir states XY ∈ { CC, CD, DC, DD } can represent the stationary probability distribution v ( d γ , d γ , (cid:15) ). We have an observer, say, Alice, with a potential strategy d α . As we learnsocial norms in childhood, it is assumed that Alice does not yet participate in the game buthas a learning period to observe M ( (cid:29)
1) pairs of players, all of whom have used the residentstrategy d γ . How their mind works is a black box to her: Just by observing their states XY and subsequent moves, Alice has to form belief about d γ , based on which she chooses herown strategy d α to maximize the expected payoff. If Alice’s optimal strategy turns out tobe identical to the resident strategy d γ , it constitutes a SCE.6o see how Alice can specify d γ ∈ ∆ from observation, let us consider an example that theobserved probability distribution over states XY is best described as v ≈ (0 , / , / , / v for every strategy in ∆ as listed in Table II, the observation suggeststhat the resident strategy is unlikely to be TFT ( d = [1 , , , v = (1 / , / , / , / d γ can be either d = [0 , , ,
0] or d = [0 , , , CD or DC . According to Table II, these states will be observed frequentlybecause v CD = v DC = 1 /
4. Thus, in this example, Alice succeeds in identifying d γ as longas M (cid:29)
1. Eight strategies have this property, constituting Category I in ∆ (Table II). Asanother example, if v ≈ (1 / , , , / d γ must be either d = [0 , , ,
1] or d = [0 , , , CD or DC , but she may actually save this effort because the best response turns out to be d in either case (Table I). This is the case of Category II in ∆ (Table II).In general, the first important piece of information to infer d γ is the stationary distri-bution v because it heavily depends on d γ (Table II). However, the information of v maybe insufficient to single out the answer: Suppose that v gives multiple candidate strate-gies which prescribe different moves at a certain state XY and thus have different bestresponses. Alice then needs to observe what players actually choose at XY , and such ob-servations should be performed sufficiently many times, i.e., M v XY (cid:29)
1, for the sake ofstatistical power. If we check every d γ ∈ ∆ one by one in this way, we see that the bestresponse to the resident strategy can readily be identified as long as M (cid:29) (cid:15) − , in whichcase the result of observational learning would be the same as that of exact identification ofstrategies.If M (cid:28) (cid:15) − , on the other hand, Alice cannot fully resolve such uncertainty throughobservation. Still, note that M should be taken as far greater than O (1) for statistical in-ference to be meaningful. Furthermore, (cid:15) has been introduced as a regularization parameterwhose exact magnitude is irrelevant, so we look at the behaviour in the limit of small (cid:15) .When 1 (cid:28) M (cid:28) (cid:15) − , uncertainty in the best response remains only when v ≈ (0 , , , , , , d , d , and d are the candidate strategies for d γ , whereas in the latter case, thecandidates are d , d , and d . From the Bayesian perspective, it is reasonable to assignequal probability to each of the candidate strategies. However, if M (cid:15) (cid:28)
1, the number7 lways-Defect M ₁ Grim Triggerc>16/33Win-Stay-Lose-Shiftc<16/33 d d d d d d d c>16/33 c<16/33 d c>16/33c<16/33c>2/9 c<2/9 Tit-for-TatAlways-Cooperatec>1/2 d c<1/2 d d c>2/9c<2/9 c>2/9c<2/9 FIG. 2. Best-looking responses to maximize the expected payoff under uncertainty in observation,when 1 (cid:28) M (cid:28) (cid:15) − . Compared with Fig. 1, the first difference is that Alice uses Eq. (7) against d , d , and d . In addition, she will use Eq. (8) against d , d , and d . of observations cannot be enough to update this prior probability (see Appendix A for adetailed discussion). Therefore, when v ≈ (0 , , , d γ = d or d or d , Alicetries to maximize the expected payoffΠ α = Π( d α , d , (cid:15) ) + Π( d α , d , (cid:15) ) + Π( d α , d , (cid:15) )3 , (6)and the calculation shows that it can be achieved by playing d , if c > / d , if c < /
33 (7)in the limit of (cid:15) →
0. Likewise, when v ≈ (1 , , , d γ = d or d or d , Alicetries to maximize her expected payoff from the three possibilities, which is achieved whenshe plays d , if c > / d , if c < / (cid:15) →
0. Now, AllD ceases to be the best-looking response to itself (Fig. 2): The expectedpayoff against AllD will be higher when WSLS is played, if c < /
33. On the other hand,8f we consider a WSLS population with c < /
9, its cooperative equilibrium is protectedfrom invasion of defectors because Alice under observational uncertainty will keep choosingWSLS, which is truly the best response to itself.
III. SUMMARY AND DISCUSSION
In summary, we have investigated the iterated PD game in terms of best-response rela-tions and checked how it is modified by observational learning. Thereby we have addressed aquestion about how cooperation is affected by cultural transmission, which may be system-atically involved with observational uncertainty. The notion of SCE takes this systematicuncertainty into account, and its intersection with NE can be an equilibrium refinement.It is worth pointing out the following: If everyone plays a certain strategy d i with beliefthat everyone else does the same, the whole situation is self-consistent in the sense thatobservation will always confirm the belief, which in turn agrees with the actual behaviour.The importance of SCENE becomes clear when someone happens to play a different strategyor begins to doubt the belief: If d i is not a NE, the player will benefit from the deviantbehaviour and reinforce it. If d i is not a SCE, the player may fail to dispel the doubt, whichwill undermine the prevailing culture. Therefore, the strategy has to be a SCENE for beingtransmitted in a stable manner through observational learning.As a reference point, we have started with the conventional assumption that one canidentify a strategy without uncertainty, and checked the best-response relations within theset of M pure strategies. Our finding is that a symmetric NE is possible if one uses oneof the following three strategies: AllD, GT , and WSLS (Fig. 1). Only the last one isefficient. Although we have restricted ourselves to pure strategies, we can discuss the ideabehind it as follows: Let us consider a monomorphic population playing a mixed strategy q = [ q CC , q CD , q DC , q DD ], where each element means the probability to cooperate in a givencircumstance. Such a mixed strategy can be represented as a point inside a four-dimensionalunit hypercube. The observer seeks the best response to it, say, p = [ p CC , p CD , p DC , p DD ].Suppose that p also turns out to be a mixed strategy, say, containing d k and d l with k (cid:54) = l .According to the Bishop-Cannings theorem [19], it implies thatΠ( d k , q , (cid:15) ) = Π( d l , q , (cid:15) ) , (9)9nd this equality imposes a set of constraints on q , rendering the dimensionality of the solu-tion manifold lower than four. Therefore, to almost all q in the four-dimensional hypercube,only one pure strategy will be found as the best response. In Appendix B, we provide anexplicit proof for this argument in case of reactive strategies.If we take observational learning into consideration, our result suggests that WSLS canbe a SCENE to a Bayesian observer, whereas AllD cannot under observational uncertainty.That is, if the number of observations is too small to see how to behave after error, theuncertainty provides a way to escape from full defection, whereas WSLS can still maintaincooperation: The point is that AllD is not easy to learn by observing defectors because itis difficult to tell what they would choose if someone actually cooperated. WSLS is alsodifficult to learn, but the uncertainty works in an asymmetric way because one can expectmore from mutual cooperation than from full defection by the very definition of the PDgame. Appendix A: Bayesian inference
To illustrate the inference procedure, let us assume that v ≈ (0 , , ,
1) is given to Alice.She has a set of candidate strategies Λ ≡ { d , d , d } for the resident strategy q . Aliceassigns equal prior probability to each of these candidate strategies. In a certain round t ,she observes interaction between Eve and Frank both of whom use q . Let E t and F t denoteEve’s and Frank’s moves, respectively, in round t . If Alice sees Eve cooperate, i.e., E t = C ,after S t − ≡ ( E t − , F t − ) = ( C, C ), she may use this additional information in a Bayesianway to calculate the posterior probability of q = d as follows: P ( q = d | E t , S t − ) = P ( E t | S t − , d ) P ( S t − | d ) P ( d ) (cid:80) d i ∈ Λ P ( E t | S t − , d i ) P ( S t − | d i ) P ( d i ) (A1)= (cid:15) · (cid:15) · (1 / (cid:15) · (cid:15) · (1 /
3) + (cid:15) · (2 (cid:15) − (cid:15) + 4 (cid:15) ) · (1 /
3) + (cid:15) · (cid:15)/ · (1 / , (A2)where P ( E t | S t − , d i ) is directly obtained from d i , and P ( S t − | d i ) is taken from the stationaryprobability distribution v . This posterior probability is used as prior probability for thenext observation. If q is actually d , the average number of times to observe E t = C after S t − = ( C, C ) will be
M P ( E t , S t − | q = d ) = M P ( E t | S t − , d ) P ( S t − | d ) . (A3)10n this way, Alice obtains the final posterior probability of q = d after observing interactionbetween M pairs of players, when their actual strategy is d . If (cid:15) is fixed as a small positivevalue, this inference procedure approaches the correct answer as M → ∞ . The effect ofobservational uncertainty manifests itself when M (cid:15) (cid:28)
1. For example, we may choose M ≈ (cid:15) − / as a representative value for 1 (cid:28) M (cid:28) (cid:15) − and check various values of (cid:15) from10 − to 10 − . Then, the above calculation confirms that the posterior probabilities shouldremain identical to the prior ones due to the lack of observation. Appendix B: Best-response relations among reactive strategies
Let us consider two reactive strategies p = [ p C , p D , p C , p D ] and q = [ q C , q D , q C , q D ]. Thelong-term average payoff of p against q isΠ = ( p D q C − p D q D + q D ) − c ( p D + p C q D − p D q D )1 − ( p C − p D )( q C − q D ) (B1)in the limit of (cid:15) →
0. After some algebra, we find the following: First, if q C − q D > c ,both ∂ Π /∂p C and ∂ Π /∂p D are positive, so the best response is given by p C = p D = 1. Or,if q C − q D < c , both ∂ Π /∂p C and ∂ Π /∂p D are negative, so the best response is given by p C = p D = 0. Note that we have neglected the measure-zero line defined by q C − q D = c , onwhich the best response is not uniquely determined. [1] J. Maynard Smith, Evolution and the Theory of Games (Cambridge University Press, Cam-bridge, UK, 1982).[2] C. Kasper, M. Vierbuchen, U. Ernst, S. Fischer, R. Radersma, A. Raulo, F. Cunha-Saraiva,M. Wu, K. B. Mobley, and B. Taborsky, Mol. Ecol. , 4364 (2017).[3] F. Manfredini, M. J. Brown, and A. L. Toth, J. Comp. Physiol. A , 449 (2018).[4] A. Bandura, Social Learning Theory (Prentice Hall, Englewood Cliffs, NJ, 1977).[5] M. Kr¨utzen, J. Mann, M. R. Heithaus, R. C. Connor, L. Bejder, and W. B. Sherwin, Proc.Natl. Acad. Sci. USA , 8939 (2005).[6] C. D. Frith and U. Frith, Annu. Rev. Psychol. , 287 (2012).
7] G. Di Pellegrino, L. Fadiga, L. Fogassi, V. Gallese, and G. Rizzolatti, Exp. Brain Res. ,176 (1992).[8] V. Gallese, L. Fadiga, L. Fogassi, and G. Rizzolatti, Brain , 593 (1996).[9] P. F. Ferrari and G. Rizzolatti, Philos. Trans. R. Soc. Lond. B , 20130169 (2014).[10] O. Leimar and J. M. McNamara, Am. Nat. , E55 (2015).[11] D. Fudenberg and D. K. Levine, The Theory of Learning in Games (MIT Press, Cambridge,MA, 1998).[12] S. K. Baek, H.-C. Jeong, C. Hilbe, and M. A. Nowak, Sci. Rep. , 25676 (2016).[13] D. Kraines and V. Kraines, Theory Decis. , 47 (1989).[14] M. Nowak and K. Sigmund, Nature , 56 (1993).[15] L. A. Imhof, D. Fudenberg, and M. A. Nowak, J. Theor. Biol. , 574 (2007).[16] M. Nowak, Theor. Popul. Biol. , 93 (1990).[17] M. A. Nowak, K. Sigmund, and E. El-Sedy, J. Math. Biol. , 703 (1995).[18] W. H. Press and F. J. Dyson, Proc. Natl. Acad. Sci. USA , 10409 (2012).[19] D. Bishop and C. Cannings, J. Theor. Biol. , 85 (1978). ategory Strategy v CC v CD v DC v DD I d = [ , , , ]
14 14 14 14 d = [ , , , ] d = [ , , , ] d = [ , , , ] d = [0 , , , ] (cid:15)
14 14 12 d = [0 , , , ] d = [ , , ,
12 14 14 12 (cid:15) d = [ , , , d = [ , , , ] (cid:15) (cid:15) d = [ , , , ]III d = [0 , , , ] (cid:15) (cid:15) (cid:15) d = [0 , , , ] 2 (cid:15) (cid:15) (cid:15) d = [1 , , , ] (cid:15) (cid:15) (cid:15) d = [ , , ,
1] 1 (cid:15) (cid:15) (cid:15) d = [ , , ,
0] 1 (cid:15) (cid:15) (cid:15) d = [ , , ,
1] 1 (cid:15) (cid:15) (cid:15) TABLE II. Stationary probability distribution v ( d γ , d γ , (cid:15) ), where we have retained only theleading-order term in the (cid:15) -expansion for each v XY . When we describe a strategy in binary, theboldface digits are the ones that are frequently observed with v XY ∼ O (1) and thus readily iden-tifiable as long as M (cid:29)
1. In this table, the eight strategies in Category I have three or foursuch digits, so if the population is using one of these strategies, Alice can tell which one is beingplayed after M ( (cid:29)
1) observations. As for Category II, the member strategies d and d wouldbe indistinguishable if M (cid:28) (cid:15) − because they differ at their non-boldface digits. Still, Alice canfind the best response d which is common to both of them (see Table I). In Category III, eachmember strategy has just one boldface digit, so the strategies as well as the best responses can beidentified only if M (cid:29) (cid:15) − ..