[PDF] Symmetric equilibrium of multi-agent reinforcement learning in repeated prisoner's dilemma

Abstract

We investigate the repeated prisoner's dilemma game where both players alternately use reinforcement learning to obtain their optimal memory-one strategies. We theoretically solve the simultaneous Bellman optimality equations of reinforcement learning. We find that the Win-stay Lose-shift strategy, the Grim strategy, and the strategy which always defects can form symmetric equilibrium of the mutual reinforcement learning process amongst all deterministic memory-one strategies.

Full PDF

SSymmetric equilibrium of multi-agent reinforcementlearning in repeated prisoner’s dilemma

Yuki Usui

Faculty of Science, Yamaguchi University, Yamaguchi 753-8511, Japan

Masahiko Ueda

Graduate School of Sciences and Technology for Innovation, Yamaguchi University,Yamaguchi 753-8511, Japan

Abstract

We investigate the repeated prisoner’s dilemma game where both players alter-nately use reinforcement learning to obtain their optimal memory-one strategies.We theoretically solve the simultaneous Bellman optimality equations of rein-forcement learning. We ﬁnd that the Win-stay Lose-shift strategy, the Grimstrategy, and the strategy which always defects can form symmetric equilib-rium of the mutual reinforcement learning process amongst all deterministicmemory-one strategies.

Keywords:

Repeated prisoner’s dilemma game; Reinforcement learning

1. Introduction

The prisoner’s dilemma game describes a dilemma where rational behavior ofeach player cannot achieve a favorable situation for both players [1]. In the game,each player chooses cooperation or defection. Each player can obtain more payoﬀby taking defection than by taking cooperation regardless of the opponent’saction. Then, mutual defection is realized as a result of rational thought ofboth players, while payoﬀs of both players increase when both players choosecooperation. Although the Nash equilibrium of the one-shot game is mutual

Email addresses: [email protected] (Yuki Usui), [email protected] (Masahiko Ueda)

Preprint submitted to Elsevier February 10, 2021 a r X i v : . [ c s . G T ] F e b efection, when the game is inﬁnitely repeated, it has been known that mutualcooperation can be achieved as the Nash equilibrium. This fact is known as thefolk theorem. Because the repeated version of the prisoner’s dilemma game isalso simple, it has substantially been investigated [2].Recently, reinforcement learning technique attracts much attentions in thecontext of game theory [3, 4, 5, 6, 7, 8, 9, 10, 11]. In reinforcement learning,a player gradually learns his/her optimal strategy against his/her opponents.Both learning by a single player and learning by several players have been in-vestigated. Because rationality of players is bounded in reality, modeling ofplayers as learning agents is crucial [12]. It is also signiﬁcant in the contextof reinforcement learning, since the original reinforcement learning was formu-lated for Markov decision process with stationary environments [13]. Becausethe existence of multiple agents in game theory leads to non-stationarity of envi-ronments for each player, the standard application of reinforcement learning togames breaks down [4, 14], and further theoretical understanding of reinforce-ment learning in game theory is needed. Moreover, since the acquisition processof optimal strategies in reinforcement learning is generally diﬀerent from thatin evolutionary game theory [15], accumulating knowledge about equilibrium ineach learning dynamics is needed.In this paper, we investigate the situation where both players alternatelylearn their optimal strategies by using reinforcement learning in the repeatedprisoner’s dilemma game. We theoretically derive equilibrium points of mutualreinforcement learning where both players take the same deterministic strategy.We ﬁnd that the strategy which always defects (All- D ), the Win-stay Lose-Shift (WSLS) strategy [16], and the Grim strategy can form such symmetricequilibrium amongst all memory-one deterministic strategies.This paper is organized as follows. In Section 2, we introduce the repeatedprisoner’s dilemma game, and players using reinforcement learning. In Section3, we theoretically derive deterministic optimal strategies against the strategyof a learning opponent. In Section 4, we provide numerical results by usingQ-learning which support our theoretical results. Section 5 is devoted to con-2lusion.

2. Model

We consider the repeated prisoner’s dilemma game [3]. There are two playersin the game, and each player is described as 1 and 2. Each player choosescooperation ( C ) or defection ( D ) on every trial. The action of player a iswritten as σ a ∈ { C, D } , and we collectively write σ := ( σ , σ ). The payoﬀ ofplayer a ∈ { , } when the state is σ is described as r a ( σ ). The payoﬀs in theprisoner’s dilemma game are deﬁned as  r ( C, C ) r ( C, D ) r ( D, C ) r ( D, D )  =  RSTP  (1)and  r ( C, C ) r ( C, D ) r ( D, C ) r ( D, D )  =  RTSP  (2)with

T > R > P > S and 2

R > T + S . We consider the situation whereboth players use memory-one strategies. The memory-one strategy of player a is described as the conditional probability T a ( σ a | σ (cid:48) ) of taking action σ a whenthe state in the previous round is σ (cid:48) . Then, when we deﬁne the probabilitydistribution of a state σ (cid:48) at time t by P ( σ (cid:48) , t ), the time evolution of this systemis described as the Markov chain P ( σ , t + 1) = (cid:88) σ (cid:48) T ( σ | σ (cid:48) ) P ( σ (cid:48) , t ) (3)with the transition probability T ( σ | σ (cid:48) ) := (cid:89) a =1 T a ( σ a | σ (cid:48) ) . (4)3elow we introduce the notation − a := { , }\ a .We consider the situation that both players learn their strategies by re-inforcement learning [13]. We assume that two players alternately learn andupdate their strategies [17], that is, player 1 ﬁrst learns her strategy against aﬁxed initial strategy of player 2, then player 2 learns his strategy against thestrategy of player 1, then player 1 learns her strategy against the strategy ofplayer 2, and so on. (We consider the situation that the two players inﬁnitelyrepeat the inﬁnitely repeated game.) In reinforcement learning, each playerlearns mapping (called policy) from a state to his/her action so as to maximizehis/her expected future reward. In our memory-one situation, a state and anaction of player a are regarded as the state σ (cid:48) in the previous round and theaction σ a in the present round, respectively. We deﬁne the action-value functionof player a as Q a (cid:16) σ (1) a , σ (0) (cid:17) := E (cid:34) ∞ (cid:88) k =0 γ k r a ( t + k + 1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) σ a ( t + 1) = σ (1) a , σ ( t ) = σ (0) (cid:35) , (5)where γ is a discounting factor satisfying 0 ≤ γ ≤

1. The action σ a ( t ) representsthe action of player a at round t . Similarly, the payoﬀ r a ( t ) represents the payoﬀof player a at round t . Due to the Markov property, the action-value function Q obeys the Bellman equation against a ﬁxed strategy T − a of the opponent: Q a (cid:16) σ (1) a , σ (0) (cid:17) = (cid:88) σ (1) − a T − a (cid:16) σ (1) − a | σ (0) (cid:17) r a (cid:16) σ (1) (cid:17) + γ (cid:88) σ (2) a (cid:88) σ (1) − a T a (cid:16) σ (2) a | σ (1) (cid:17) T − a (cid:16) σ (1) − a | σ (0) (cid:17) Q a (cid:16) σ (2) a , σ (1) (cid:17) . (6)It has been known that the optimal value of Q obeys the following Bellmanoptimality equation: Q ∗ a (cid:16) σ (1) a , σ (0) (cid:17) = (cid:88) σ (1) − a T − a (cid:16) σ (1) − a | σ (0) (cid:17) r a (cid:16) σ (1) (cid:17) + γ (cid:88) σ (1) − a T − a (cid:16) σ (1) − a | σ (0) (cid:17) max σ (2) a Q ∗ a (cid:16) σ (2) a , σ (1) (cid:17) (7)4ith the support supp T a (cid:16) ·| σ (0) (cid:17) = arg max σ Q ∗ a (cid:16) σ, σ (0) (cid:17) . (8)In other words, in the optimal policy against T − a , player a takes the action σ a which maximizes the value of Q ∗ a (cid:0) · , σ (0) (cid:1) when the state at the previousround is σ (0) . Therefore, the strategies obtained in this learning process aredeterministic. Because the number of deterministic memory-one strategies inthe repeated prisoner’s dilemma game is sixteen, we check whether each deter-ministic strategy forms equilibrium or not.

3. Results

We consider symmetric solutions of Eq. (7), that is, Q ∗ (cid:16) σ (1)1 , σ (0) (cid:17) = (cid:88) σ (1)2 T (cid:16) σ (1)2 | σ (0) (cid:17) r (cid:16) σ (1) (cid:17) + γ (cid:88) σ (1)2 T (cid:16) σ (1)2 | σ (0) (cid:17) max σ (2)1 Q ∗ (cid:16) σ (2)1 , σ (1) (cid:17) (9)with T ( C | C, C ) = I ( Q ∗ ( C, C, C ) > Q ∗ ( D, C, C )) (10) T ( C | C, D ) = I ( Q ∗ ( C, D, C ) > Q ∗ ( D, D, C )) (11) T ( C | D, C ) = I ( Q ∗ ( C, C, D ) > Q ∗ ( D, C, D )) (12) T ( C | D, D ) = I ( Q ∗ ( C, D, D ) > Q ∗ ( D, D, D )) (13)where I ( · · · ) is the indicator function that returns 1 when · · · holds and 0otherwise. Then, Eq. (7) becomes Q ∗ ( C, C, C ) = I ( Q ∗ ( C, C, C ) > Q ∗ ( D, C, C )) (cid:110) R + γ max σ Q ∗ ( σ, C, C ) (cid:111) + I ( Q ∗ ( C, C, C ) < Q ∗ ( D, C, C )) (cid:110) S + γ max σ Q ∗ ( σ, C, D ) (cid:111) (14) Q ∗ ( C, C, D ) = I ( Q ∗ ( C, D, C ) > Q ∗ ( D, D, C )) (cid:110) R + γ max σ Q ∗ ( σ, C, C ) (cid:111) + I ( Q ∗ ( C, D, C ) < Q ∗ ( D, D, C )) (cid:110) S + γ max σ Q ∗ ( σ, C, D ) (cid:111) (15)5 ∗ ( C, D, C ) = I ( Q ∗ ( C, C, D ) > Q ∗ ( D, C, D )) (cid:110) R + γ max σ Q ∗ ( σ, C, C ) (cid:111) + I ( Q ∗ ( C, C, D ) < Q ∗ ( D, C, D )) (cid:110) S + γ max σ Q ∗ ( σ, C, D ) (cid:111) (16) Q ∗ ( C, D, D ) = I ( Q ∗ ( C, D, D ) > Q ∗ ( D, D, D )) (cid:110) R + γ max σ Q ∗ ( σ, C, C ) (cid:111) + I ( Q ∗ ( C, D, D ) < Q ∗ ( D, D, D )) (cid:110) S + γ max σ Q ∗ ( σ, C, D ) (cid:111) (17) Q ∗ ( D, C, C ) = I ( Q ∗ ( C, C, C ) > Q ∗ ( D, C, C )) (cid:110) T + γ max σ Q ∗ ( σ, D, C ) (cid:111) + I ( Q ∗ ( C, C, C ) < Q ∗ ( D, C, C )) (cid:110) P + γ max σ Q ∗ ( σ, D, D ) (cid:111) (18) Q ∗ ( D, C, D ) = I ( Q ∗ ( C, D, C ) > Q ∗ ( D, D, C )) (cid:110) T + γ max σ Q ∗ ( σ, D, C ) (cid:111) + I ( Q ∗ ( C, D, C ) < Q ∗ ( D, D, C )) (cid:110) P + γ max σ Q ∗ ( σ, D, D ) (cid:111) (19) Q ∗ ( D, D, C ) = I ( Q ∗ ( C, C, D ) > Q ∗ ( D, C, D )) (cid:110) T + γ max σ Q ∗ ( σ, D, C ) (cid:111) + I ( Q ∗ ( C, C, D ) < Q ∗ ( D, C, D )) (cid:110) P + γ max σ Q ∗ ( σ, D, D ) (cid:111) (20) Q ∗ ( D, D, D ) = I ( Q ∗ ( C, D, D ) > Q ∗ ( D, D, D )) (cid:110) T + γ max σ Q ∗ ( σ, D, C ) (cid:111) + I ( Q ∗ ( C, D, D ) < Q ∗ ( D, D, D )) (cid:110) P + γ max σ Q ∗ ( σ, D, D ) (cid:111) . (21)6or simplicity, we introduce the following notation: q := Q ∗ ( C, C, C ) q := Q ∗ ( C, C, D ) q := Q ∗ ( C, D, C ) q := Q ∗ ( C, D, D ) q := Q ∗ ( D, C, C ) q := Q ∗ ( D, C, D ) q := Q ∗ ( D, D, C ) q := Q ∗ ( D, D, D ) . (22)We consider the following sixteen situations separately. q > q , q > q , q > q , and q > q For this case, the strategy obtained by reinforcement learning is the All- C strategy. The solution of Eq. (7) is q = q = q = q = 11 − γ R (23) q = q = q = q = T + γ − γ R. (24)This contradicts with the deﬁnition of the game T > R . q > q , q > q , q > q , and q < q The solution of Eq. (7) is q = q = q = 11 − γ R (25) q = S + γ − γ R (26) q = q = q = T + γ − γ R (27) q = 11 − γ P. (28)This contradicts with the deﬁnition of the game T > R .7 .3. Case 3: q > q , q > q , q < q , and q > q The solution of Eq. (7) is q = q = q = 11 − γ R (29) q = 11 − γ S (30) q = q = q = 11 − γ T (31) q = P + γ − γ R. (32)This contradicts with the deﬁnition of the game T > R . q > q , q > q , q < q , and q < q For this case, the strategy obtained by reinforcement learning is “Repeat”[18]. The solution of Eq. (7) is q = q = 11 − γ R (33) q = q = 11 − γ S (34) q = q = 11 − γ T (35) q = q = 11 − γ P. (36)This contradicts with the deﬁnition of the game T > R . q > q , q < q , q > q , and q > q The solution of Eq. (7) is q = q = q = 11 − γ R (37) q = 11 − γ S + γ − γ T (38) q = q = q = 11 − γ T + γ − γ S (39) q = P + γ − γ R. (40)This contradicts with 2 R > T + S . 8 .6. Case 6: q > q , q < q , q > q , and q < q For this case, the strategy obtained by reinforcement learning is Tit-for-Tat(TFT) [1, 19]. The solution of Eq. (7) is q = q = 11 − γ R (41) q = q = 11 − γ S + γ − γ T (42) q = q = 11 − γ T + γ − γ S (43) q = q = 11 − γ P. (44)This solution becomes consistent with the condition of the case only when T + S = R + P and γ = T − RR − S . q > q , q < q , q < q , and q > q For this case, the strategy obtained by reinforcement learning is Win-stay-Lose-Shift (WSLS) [16]. The solution of Eq. (7) is q = q = 11 − γ R (45) q = q = S + γP + γ − γ R (46) q = q = T + γP + γ − γ R (47) q = q = P + γ − γ R. (48)This solution becomes consistent with the condition of the case when T + P < R and γ > T − RR − P . 9 .8. Case 8: q > q , q < q , q < q , and q < q For this case, the strategy obtained by reinforcement learning is the Grimstrategy. The solution of Eq. (7) is q = 11 − γ R (49) q = q = q = S + γ − γ P (50) q = T + γ − γ P (51) q = q = q = 11 − γ P. (52)This solution becomes consistent with the condition of the case when γ > T − RT − P . q < q , q > q , q > q , and q > q For this case, the strategy obtained by reinforcement learning is the anti-Grim strategy. The solution of Eq. (7) is q = S + γ − γ R + γ − γ P (53) q = q = q = 11 − γ R + γ − γ P (54) q = 11 − γ P + γ − γ R (55) q = q = q = T + γ − γ R + γ − γ P. (56)This contradicts with γ ≥ q < q , q > q , q > q , and q < q For this case, the strategy obtained by reinforcement learning is anti-Win-stay-Lose-Shift (AWSLS). The solution of Eq. (7) is q = q = S + γR + γ − γ P (57) q = q = R + γ − γ P (58) q = q = 11 − γ P (59) q = q = T + γR + γ − γ P. (60)This contradicts with γ ≥

0. 10 .11. Case 11: q < q , q > q , q < q , and q > q For this case, the strategy obtained by reinforcement learning is anti-Tit-for-Tat (ATFT). The solution of Eq. (7) is q = q = 11 − γ S (61) q = q = 11 − γ R + γ − γ P (62) q = q = 11 − γ P + γ − γ R (63) q = q = 11 − γ T. (64)This contradicts with γ ≥ q < q , q > q , q < q , and q < q The solution of Eq. (7) is q = q = q = 11 − γ S (65) q = R + γ − γ P (66) q = q = q = 11 − γ P (67) q = 11 − γ T. (68)This contradicts with the deﬁnition of the game P > S . q < q , q < q , q > q , and q > q For this case, the strategy obtained by reinforcement learning is anti-Repeat.The solution of Eq. (7) is q = q = 11 − γ S + γ − γ T (69) q = q = 11 − γ R + γ − γ P (70) q = q = 11 − γ P + γ − γ R (71) q = q = 11 − γ T + γ − γ S. (72)This solution becomes consistent with the condition of the case only when T + S = R + P and γ = 1. 11 .14. Case 14: q < q , q < q , q > q , and q < q The solution of Eq. (7) is q = q = q = 11 − γ S + γ − γ T (73) q = R + γ − γ P (74) q = q = q = 11 − γ P (75) q = 11 − γ T + γ − γ S. (76)This solution becomes consistent with the condition of the case only when T + S > P and γ = P − ST − S . q < q , q < q , q < q , and q > q The solution of Eq. (7) is q = q = q = S + γ − γ P + γ − γ R (77) q = 11 − γ R + γ − γ P (78) q = q = q = 11 − γ P + γ − γ R (79) q = T + γ − γ P + γ − γ R. (80)This contradicts with the deﬁnition of the game T > R . q < q , q < q , q < q , and q < q For this case, the strategy obtained by reinforcement learning is the All- D strategy. The solution of Eq. (7) is q = q = q = q = S + γ − γ P (81) q = q = q = q = 11 − γ P. (82)This solution is always consistent with the condition of the case.12 .17. Summary From the above subsections, we ﬁnd that the symmetric solution of the Bell-man optimality equation exists in ﬁnite regions of the parameter γ only for thecase 7, 8, and 16. In other words, only WSLS, the Grim strategy, and the All- D strategy can form the symmetric equilibrium of mutual reinforcement learning.TFT does not form symmetric equilibrium. The results are summarized in Table1, where the strategy vector of player 1 is deﬁned by T ( C ) :=  T ( C | C, C ) T ( C | C, D ) T ( C | D, C ) T ( C | D, D )  . (83)

4. Numerical results

In this section, we check the theoretical results in the previous section bynumerical simulation. We use Q-learning [13] as a method of reinforcementlearning. In Q-learning, the optimal action-value function of the agent a againsta ﬁxed strategy of the agent − a is learned through the following update rule: Q ( t +1) a (cid:16) σ (1) a , σ (0) (cid:17) = Q ( t ) a (cid:16) σ (1) a , σ (0) (cid:17) + η (cid:32) r a + γ max σ (2) a Q ( t ) a (cid:16) σ (2) a , σ (1) (cid:17) − Q ( t ) a (cid:16) σ (1) a , σ (0) (cid:17)(cid:33) , (84)where r a is the reward by taking action σ (1) a when the state is σ (0) , and σ (1) isthe next state. The parameter η is called the learning rate. Here, we assumethat, in each step, the agent a chooses the action σ (1) a by using (cid:15) -greedy search,that is, the agent a chooses an action uniformly randomly among all possibleactions with probability (cid:15) , and chooses the best action with respect to thecurrent action-value function with probability 1 − (cid:15) . As before, we considerthe situation that two agents alternately learn their optimal strategies until Q values converge. 13umber q ≶ q q ≶ q q ≶ q q ≶ q strategy T ( C ) name Equilibrium?Case 1 > > > > (1 , , , T All- C NoCase 2 > > > < (1 , , , T NoCase 3 > > < > (1 , , , T NoCase 4 > > < < (1 , , , T Repeat NoCase 5 > < > > (1 , , , T NoCase 6 > < > < (1 , , , T TFT No in generalCase 7 > < < > (1 , , , T WSLS Yes for γ > T − RR − P Case 8 > < < < (1 , , , T Grim Yes for γ > T − RT − P Case 9 < > > > (0 , , , T anti-Grim NoCase 10 < > > < (0 , , , T AWSLS NoCase 11 < > < > (0 , , , T ATFT NoCase 12 < > < < (0 , , , T NoCase 13 < < > > (0 , , , T anti-Repeat No in generalCase 14 < < > < (0 , , , T No in generalCase 15 < < < > (0 , , , T NoCase 16 < < < < (0 , , , T All- D Yes

Table 1: Summary of the results.

We set parameters (

R, S, T, P ) = (4 . , . , . , . η = 0 .

2, and (cid:15) = 0 . Q , we take the statistical average over 10 real-izations. The initial condition of Q is Q (cid:0) σ (1) , σ (0) (cid:1) = 0 for all σ (1) and σ (0) .In Figure 1, we display the time evolution of Q when the strategy of player2 is WSLS. On the top side of Figure 1, we provide the numerical results for γ = 0 .

9. The theoretical value of Q is also provided in Appendix A:14 igure 1: The time evolution of Q when the strategy of player 2 is WSLS. The value of γ is γ = 0 . γ = 0 . = q = 11 − γ R = 40 (85) q = q = S + γP + γ − γ R = 33 . q = q = T + γP + γ − γ R = 39 . q = q = P + γ − γ R = 37 . (88)We can expect that the numerical results converge to the theoretical value in thelimit t → ∞ . We emphasize that the learned strategy by player 1 is also WSLS,which is consistent with the result in the previous section. On the bottom sideof Figure 1, we provide the numerical results for γ = 0 .

2. The theoretical valueof Q is also provided in Appendix A: q = q = R + γ − γ T + γ − γ P (cid:39) .

29 (89) q = q = S + γ − γ P + γ − γ T (cid:39) .

458 (90) q = q = 11 − γ T + γ − γ P (cid:39) .

46 (91) q = q = 11 − γ P + γ − γ T (cid:39) . . (92)We can expect that the numerical results also converge to the theoretical valuein the limit t → ∞ . For this case, the learned strategy by player 1 is All- D .Therefore, we conclude that WSLS forms the equilibrium of mutual reinforce-ment learning for suﬃciently large γ .In Figure 2, we display the time evolution of Q when the strategy of player2 is Grim. On the top side of Figure 2, we provide the numerical results for γ = 0 .

9. The theoretical value of Q is also provided in Appendix A: q = 11 − γ R = 40 (93) q = q = q = S + γ − γ P = 9 (94) q = T + γ − γ P = 15 (95) q = q = q = 11 − γ P = 10 . (96)16 igure 2: The time evolution of Q when the strategy of player 2 is Grim. The value of γ is γ = 0 . γ = 0 .

17e ﬁnd that, although the learned strategy by player 1 is Grim, there are dis-crepancies between the theoretical values and the numerical results for Q ( C, C, C ), Q ( C, D, C ), Q ( D, C, C ), and Q ( D, D, C ). This is due to the property of theGrim strategy. In our simulation, player 1 (a learning agent against Grim)stochastically chooses C or D . However, once player 1 chooses D , player 2 (theagent with the Grim strategy) switches to a defector who always defects. There-fore, the state ( D, C ) occurs only once. Similarly, the state (

C, C ) occurs onlywhile player 1 keeps cooperating. Therefore, the number of times that the states(

C, C ) and (

D, C ) occur in one trial of the inﬁnitely repeated game cannot belarge enough for the Q values to converge to the theoretical values. In addition,as the learning proceeds, cooperation by player 1 after the state ( C, D ) becomesdiﬃcult to occur, which leads to the slow convergence of Q ( C, C, D ). On thebottom side of Figure 2, we provide the numerical results for γ = 0 .

2. Thetheoretical value of Q is also provided in Appendix A: q = R + γT + γ − γ P = 5 .

25 (97) q = q = q = S + γ − γ P = 0 .

25 (98) q = T + γ − γ P = 6 .

25 (99) q = q = q = 11 − γ P = 1 . . (100)We ﬁnd that the learned strategy by player 1 is Grim, although the theoreticalprediction is All- D . Due to the same reason as above, there are discrepan-cies between the theoretical values and the numerical results for Q ( C, C, C ), Q ( C, D, C ), Q ( D, C, C ), and Q ( D, D, C ). In addition, due to the same rea-son as above, the convergence of Q ( C, C, D ) is slow. Besides these facts, ournumerical results are consistent with the theoretical prediction, and we concludethat Grim can form the equilibrium of mutual reinforcement learning.

5. Conclusion

In this paper, we theoretically investigated the situation where both play-ers alternately use reinforcement learning to obtain their optimal memory-one18trategies in the repeated prisoner’s dilemma game. We derived the symmet-ric solutions of the Bellman optimality equations. We found that WSLS, theGrim strategy, and the All- D strategy can form equilibrium of the mutual re-inforcement learning process amongst sixteen deterministic memory-one strate-gies. We checked this result by numerical simulation using Q -learning. Thefollowing problems should be studied in future: (i) Whether asymmetric equi-librium points exist or not, (ii) analysis on non-deterministic strategies, and(iii) extension of our analysis to memory-two strategies. In addition, elucidat-ing the relation between equilibrium in the mutual reinforcement learning andequilibrium in evolutionary game theory [15] is a signiﬁcant problem. Acknowledgement

This study was supported by JSPS KAKENHI Grant Number JP20K19884.

Appendix A. Optimal strategy against ﬁxed strategies

In this appendix, we provide theoretical results on the deterministic optimalstrategy of a learning agent against the other agent with a ﬁxed strategy. Weregard agent 1 and 2 as a learning agent and an agent with a ﬁxed strategy,respectively. The Bellman optimality equation of the agent 1 is Eq. (9) asbefore. We consider the situation where the agent 2 chooses the TFT strategy,the WSLS strategy, and the Grim strategy. We introduce the notation (22) asbefore.

Appendix A.1. Optimal strategy against TFT

Here we consider the situation that the strategy of the agent 2 is TFT: T ( C ) =   . (A.1)Then, the solution of Eq. (9) is as follows.19 ppendix A.1.1. The case T + S < R + P and γ > P − SR − S For the case, the solution is q = q = 11 − γ R (A.2) q = q = S + γ − γ R (A.3) q = q = T + γS + γ − γ R (A.4) q = q = P + γS + γ − γ R (A.5)and because q > q , q > q , q > q , and q > q , the optimal strategy isAll- C . Appendix A.1.2. The case T + S < R + P and T − RT − P < γ < P − SR − S For the case, the solution is q = q = 11 − γ R (A.6) q = q = S + γ − γ R (A.7) q = q = T + γ − γ P (A.8) q = q = 11 − γ P (A.9)and because q > q , q > q , q < q , and q < q , the optimal strategy isRepeat. Appendix A.1.3. The case T + S < R + P and γ < T − RT − P For the case, the solution is q = q = R + γT + γ − γ P (A.10) q = q = S + γT + γ − γ P (A.11) q = q = T + γ − γ P (A.12) q = q = 11 − γ P (A.13)and because q < q , q < q , q < q , and q < q , the optimal strategy isAll- D . 20 ppendix A.1.4. The case T + S > R + P and γ > T − RR − S For the case, the solution is q = q = 11 − γ R (A.14) q = q = S + γ − γ R (A.15) q = q = T + γS + γ − γ R (A.16) q = q = P + γS + γ − γ R (A.17)and because q > q , q > q , q > q , and q > q , the optimal strategy isAll- C . Appendix A.1.5. The case T + S > R + P and P − ST − P < γ < T − RR − S For the case, the solution is q = q = R + γ − γ T + γ − γ S (A.18) q = q = 11 − γ S + γ − γ T (A.19) q = q = 11 − γ T + γ − γ S (A.20) q = q = P + γ − γ S + γ − γ T (A.21)and because q < q , q < q , q > q , and q > q , the optimal strategy isanti-Repeat. Appendix A.1.6. The case T + S > R + P and γ < P − ST − P For the case, the solution is q = q = R + γT + γ − γ P (A.22) q = q = S + γT + γ − γ P (A.23) q = q = T + γ − γ P (A.24) q = q = 11 − γ P (A.25)and because q < q , q < q , q < q , and q < q , the optimal strategy isAll- D . 21 ppendix A.2. Optimal strategy against WSLS Here we consider the situation that the strategy of the agent 2 is WSLS: T ( C ) =   . (A.26)Then, the solution of Eq. (9) is as follows. Appendix A.2.1. The case T + P < R and γ > T − RR − P For the case, the solution is q = q = 11 − γ R (A.27) q = q = S + γP + γ − γ R (A.28) q = q = T + γP + γ − γ R (A.29) q = q = P + γ − γ R (A.30)and because q > q , q < q , q < q , and q > q , the optimal strategy isWSLS. Appendix A.2.2. The case T + P < R and γ < T − RR − P For the case, the solution is q = q = R + γ − γ T + γ − γ P (A.31) q = q = S + γ − γ P + γ − γ T (A.32) q = q = 11 − γ T + γ − γ P (A.33) q = q = 11 − γ P + γ − γ T (A.34)and because q < q , q < q , q < q , and q < q , the optimal strategy isAll- D . 22 ppendix A.2.3. The case T + P > R For the case, the solution is q = q = R + γ − γ T + γ − γ P (A.35) q = q = S + γ − γ P + γ − γ T (A.36) q = q = 11 − γ T + γ − γ P (A.37) q = q = 11 − γ P + γ − γ T (A.38)and because q < q , q < q , q < q , and q < q , the optimal strategy isAll- D . Appendix A.3. Optimal strategy against Grim