[PDF] Payoff Information and Learning in Signaling Games

Abstract

We add the assumption that players know their opponents' payoff functions and rationality to a model of non-equilibrium learning in signaling games. Agents are born into player roles and play against random opponents every period. Inexperienced agents are uncertain about the prevailing distribution of opponents' play, but believe that opponents never choose conditionally dominated strategies. Agents engage in active learning and update beliefs based on personal observations. Payoff information can refine or expand learning predictions, since patient young senders' experimentation incentives depend on which receiver responses they deem plausible. We show that with payoff knowledge, the limiting set of long-run learning outcomes is bounded above by rationality-compatible equilibria (RCE), and bounded below by uniform RCE. RCE refine the Intuitive Criterion (Cho and Kreps, 1987) and include all divine equilibria (Banks and Sobel, 1987). Uniform RCE sometimes but not always exists, and implies universally divine equilibrium.

Full PDF

PPayoﬀ Information and Learning in SignalingGames ∗ Drew Fudenberg † Kevin He ‡ First version: August 31, 2017This version: July 11, 2019

Abstract

We show how to add the assumption that players know theiropponents’ payoﬀ functions to the theory of learning in games, anduse it to derive restrictions on signaling-game play in the spirit ofdivine equilibrium. In our learning model, agents are born intoplayer roles and play the game against a random opponent eachperiod. Inexperienced agents are uncertain about the prevailingdistribution of opponents’ play, and update their beliefs based ontheir observations. Long-lived and patient senders experiment withevery signal that they think might yield an improvement over theirmyopically best play. We show that divine equilibrium (Banks andSobel, 1987) is nested between “rationality-compatible” equilib-rium, which corresponds to an upper bound on the set of possiblelearning outcomes, and “uniform rationality-compatible” equilib-rium, which provides a lower bound.

Keywords: learning, equilibrium reﬁnements, bandit problems, pay-oﬀ information, signaling games.JEL classiﬁcation codes C72, C73, D83 ∗ We thank Laura Doval, Glenn Ellison, Lorens Imhof, Yuichiro Kamada, Robert Klein-berg, David K. Levine, Kevin K. Li, Eric Maskin, Dilip Mookherjee, Harry Pei, MatthewRabin, Bill Sandholm, Lones Smith, Joel Sobel, Philipp Strack, Bruno Strulovici, TomaszStrzalecki, Jean Tirole, Juuso Toikka, and our seminar participants for helpful commentsand conversations, and National Science Foundation grant SES 1643517 for ﬁnancial sup-port. † Department of Economics, MIT. Email: [email protected] ‡ California Institute of Technology and University of Pennsylvania. Email: [email protected] a r X i v : . [ ec on . T H ] J u l Introduction

Signaling games typically have many perfect Bayesian equilibria, because Bayesrule does not pin down the receiver’s oﬀ-path beliefs about the sender’s type.Diﬀerent oﬀ-path beliefs for the receiver can justify diﬀerent oﬀ-path receiverbehaviors, which in turn sustain equilibria with a variety of on-path outcomes.For this reason, applied work using signaling games typically invokes someequilibrium reﬁnement to obtain a smaller and (hopefully) more accurate sub-set of predictions.However, most reﬁnements impose restrictions on the equilibrium beliefswithout any reference to the process that might lead to equilibrium. Our ear-lier paper Fudenberg and He (2018) provided a learning-theoretic foundationfor the compatibility criterion (CC), based on the idea that “out of equilib-rium” signals are not zero-probability events during learning, but instead ariseas rare but positive-probability experiments by inexperienced patient senderstrying to learn how the receivers respond to diﬀerent signals. Unlike the classicreﬁnement literature, we did not assume that agents know their opponents’payoﬀ functions. This paper discusses how ex-ante payoﬀ information inﬂu-ences learning dynamics and learning outcomes, showing that the additionalequilibrium restrictions that follow from this prior knowledge nest the divineequilibrium of Banks and Sobel (1987). In addition, we provide the ﬁrst generalsuﬃcient condition for an outcome to emerge as the result of patient Bayesianlearning in settings where the relative probabilities of diﬀerent oﬀ-path exper-iments matter.In our learning model, agents repeatedly play the same signaling gameagainst random opponents each period. Agents are Bayesians who believethey face a ﬁxed but unknown distribution of the opposing players’ strate-gies. Importantly, the senders hold independent beliefs about how receiversrespond to diﬀerent signals, so they cannot use the response to one signal toinfer anything about the distribution of responses to a diﬀerent signal. Thisintroduces an exploration-exploitation trade-oﬀ, as each sender only observesthe response to the one signal she sends each period. Long-lived and patientsenders will therefore experiment with every signal that they think might yield1 substantially higher payoﬀ than the signal that is myopically best. The keyto our results is that diﬀerent types of senders have diﬀerent incentives forexperimenting with various signals, so that some of the sender types will sendcertain signals more often than other types do. Consequently, even thoughlong-lived senders only experiment for a vanishingly small fraction of theirlifetimes, the play of the long-lived receivers will be a best response to be-liefs about the senders’ types that reﬂects this diﬀerence in experimentationprobabilities.Of course, the senders’ experimentation incentives depend on their priorbeliefs about which receiver responses are plausible after each signal. In Fu-denberg and He (2018), we assumed that learners are ignorant of others’ utilityfunctions, and that the senders’ beliefs assign positive probability to the re-ceivers playing actions that are not best responses to any belief about thesender’s type. In this paper, we instead assume that the players’ prior beliefsencode knowledge of their opponents’ payoﬀ functions, so in particular thesenders all assign zero probability to the event that the receivers use condi-tionally dominated strategies. Inexperienced senders with full-support beliefsabout the receivers’ play may experiment with a signal in the hopes that thereceivers respond with a certain favorable action, not knowing that this actionwill never be played as it is not a best response to any receiver belief. Withpayoﬀ information, even very patient senders will never undertake such exper-iments. Conversely, receivers know that no sender type would ever want toplay a signal that does not best respond to any receiver strategy, because nopossible response by the receiver would make playing that signal worthwhile.For this reason, the receivers’ beliefs after each signal assign probability zeroto the types for whom that signal is dominated.Priors with payoﬀ information lead to additional restrictions on diﬀerenttypes’ comparative experimentation frequencies, which can generate strongerrestrictions on the receiver’s beliefs in some games. For instance, Example 2considers a signaling game where two types of senders choose between a safeoption

Out that yields a known payoﬀ, and a risky option In whose payoﬀ de-pends on the receiver’s response. The receiver has three responses to In : Up, which is optimal against the strong sender;

Down, which is optimal against2he weak sender, and X, which is never optimal. We show that when priorsencode payoﬀ information, the strong types experiment more with In than theweak types do. But, this comparison can be reversed when the senders do notknow the receivers’ payoﬀ functions, since the weak types like the X responsemore than the strong types do. In this game, the new reﬁnement concept wepropose based on payoﬀ knowledge rules out a sequential-equilibrium outcomethat passes the CC.In some other games, payoﬀ information expands the set of long-run learn-ing outcomes for patient and long-lived learners. Example 3 shows a signalinggame where no type with payoﬀ information ever experiments with a certainsignal, so the receivers’ beliefs and behavior after this signal are arbitrarilydetermined from their prior beliefs. On the other hand, when senders are ig-norant of the receivers’ payoﬀ functions, one sender type will experiment muchmore frequently with this signal than the other type, leading to a reﬁnementof the receivers’ oﬀ-path beliefs after the signal.In general, for learners starting with these priors with restricted supports,Theorem 1 shows that every patient learning outcome is consistent with “ra-tional compatibility,” while Theorem 2 shows that every equilibrium satisfyinga uniform version of rational compatibility and some strictness assumptionscan arise as a patient learning outcome. As we show in Section 3, these beliefrestrictions resemble those imposed by divine equilibrium (Banks and Sobel,1987): Every divine equilibrium is also consistent with rational compatibilityand that every equilibrium satisfying the uniform version of rational compat-ibility is universally divine. This paper is most closely related to the work of Fudenberg and Levine (1993),Fudenberg and Levine (2006), and Fudenberg and He (2018) on patient learn-ing by Bayesian agents who believe they face a steady-state distribution ofplay. Except for the support of the agents’ priors, our learning model is ex-actly the same as that of Fudenberg and He (2018), and the proof of Theorem This example is a simpliﬁed variant of Cho and Kreps (1987)’s beer-quiche game, withan extra conditionally dominated response for the receiver. patientlystable, which means that it is the limit of play in a society of Bayesian agentsas these agents become patient and long lived, for some non-doctrinaire priorbeliefs. The proof of this suﬃcient condition for patient stability constructsa suitable prior and analyzes the corresponding patiently stable proﬁles. Theonly other constructive suﬃcient condition for strategy proﬁles to be patientlystable is Theorem 5.5 of Fudenberg and Levine (2006), which only applies to asubclass of perfect-information games. In such games the relative probabilitiesof various oﬀ-path actions do not matter, because each oﬀ-path experimentis perfectly revealed when it occurs. Indeed, the central lemma leading toTheorem 2 constructs a prior belief to ensure that the receivers correctly learnthe relative frequencies that diﬀerent types undertake various oﬀ-path exper-iments. This lemma deals with an issue speciﬁc to signaling games, and isnot implied by any result in Fudenberg and Levine (2006). Our paper is alsorelated to other models of Bayesian non-equilibrium learning, such as Kalaiand Lehrer (1993) and Esponda and Pouzo (2016), and to the equilibriumconcepts of the Intuitive Criterion (Cho and Kreps, 1987) and divine equilib-rium (Banks and Sobel, 1987). One other contribution of this work relativeto Fudenberg and He (2018) is that we compare our learning-based equilib-rium reﬁnements with these equilibrium reﬁnements, both of which implicitlyassume that players are certain of the payoﬀ functions of their opponents. “Constructive,” as opposed to proofs that rule out all but one equilibrium using neces-sary conditions and then appeal to an existence theorem for patiently stable steady states.Constructive suﬃcient conditions allow us to characterize learning outcomes more preciselyin games where multiple equilibria satisfy the necessary conditions, such as Example 1. Two Equilibrium Reﬁnements for SignalingGames A signaling game has two players, a sender (“she,” player 1) and a receiver(“he,” player 2). At the start of the game, the sender learns her type θ ∈ Θ,but the receiver only knows the sender’s type distribution λ ∈ ∆(Θ). Next,the sender chooses a signal s ∈ S . The receiver observes s and chooses anaction a ∈ A in response. We assume that Θ , S, A are ﬁnite and that λ ( θ ) > θ. The players’ payoﬀs depend on the triple ( θ, s, a ). Let u : Θ × S × A → R and u : Θ × S × A → R denote the utility functions of the sender and thereceiver, respectively.For P ⊆ ∆(Θ), we haveBR( P, s ) := [ p ∈ P arg max a ∈ A E θ ∼ p [ u ( θ, s, a )] ! as the set of best responses to s supported by some belief in P . Letting P = ∆(Θ), the set A BR s := BR(∆(Θ) , s ) ⊆ A contains the receiver actionsthat best respond to some belief about the sender’s type after s . We saythat actions in A BR s are conditionally undominated after signal s , and thatactions in A \ A BR s are conditionally dominated after signal s . We denote byΠ • := × s ∈ S ∆( A BR s ) the rational receiver strategies; these are the strategies thatassign probability 0 to conditionally dominated actions. The rational receiverstrategies form a subset of Π := × s ∈ S ∆( A ), the set of all receiver strategies. Asender who knows the receiver’s payoﬀ function expects the receiver to choosea strategy in Π • .A sender strategy π = ( π ( · | θ )) θ ∈ Θ ∈ Π speciﬁes a distribution on S foreach type, π ( · | θ ) ∈ ∆( S ) . For a given π , signal s is oﬀ the path of play if it The notation ∆( X ) means the set of all probability distributions on X . Throughout we adopt the terminology “strategies” to mean behavior strategies, notmixed strategies. π ( s | θ ) = 0 for all θ. Let S θ := [ π ∈ Π arg max s ∈ S u ( θ, s, π ( · | s )) ! . be the set of signals that best respond to some (not necessarily rational) re-ceiver strategy for type θ . Signals in S \ S θ are dominated for type θ , andΠ • := × θ ∆ ( S θ ) denotes the rational sender strategies where no type eversends a dominated signal. We also write Θ s for the types θ for whom s ∈ S isnot dominated. A receiver who knows the sender’s payoﬀ function expects thesender to choose a strategy in Π • and only expects types in Θ s to play signal s . We now introduce rationality-compatible equilibrium (RCE) and uniform rationality-compatible equilibrium (uRCE), two reﬁnements of Nash equilibrium in sig-naling games.In Section 4, we develop a steady-state learning model where populationsof senders and receivers, initially uncertain as to the aggregate play of theopponent population, undergo random anonymous matching each period toplay the signaling game. We study the steady states when agents are patientand long lived, which we term “patiently stable.” Under some strictness as-sumptions, we show that only RCE can be patiently stable (Theorem 1) andthat every uRCE is path-equivalent to a patiently stable proﬁle (Theorem 2).Thus we provide a learning foundation for these solution concepts.Our learning foundation will assume that agents know other agents’ utilityfunctions and know that other agents are rational in the sense of playingstrategies that maximize the corresponding expected utilities. We will nothowever iteratively assume higher orders of payoﬀ knowledge and rationality,so that we model “rationality” as opposed to “rationalizability.” It is straightforward to extend our results to priors that reﬂect higher-order knowledge ofthe rationality and payoﬀ functions of the other player. The resulting equilibrium reﬁnementalways exists, and like RCE is implied by universal divinity. We do not include it here bothbecause we are unaware of any interesting examples where the additional power has bite,

6n the learning model, this implies senders’ uncertainty about receivers’play is always supported on Π • instead of Π , and similarly receivers’ uncer-tainty about senders’ play is supported on Π • instead of Π . In Section 2.3,we discuss heuristically how our solution concepts capture some of the ways inwhich payoﬀ information aﬀects learning outcomes. This discussion will laterbe formalized in the context of the learning model we develop in Section 4. Deﬁnition 1.

Signal s is more rationally-compatible with θ than θ , writtenas θ (cid:37) s θ , if for every π ∈ Π • such that u ( θ , s, π ( ·| s )) ≥ max s = s u ( θ , s , π ( ·| s )) , we have u ( θ , s, π ( ·| s )) > max s = s u ( θ , s , π ( ·| s )) . In words, θ (cid:37) s θ means whenever s is a weak best response for θ againstsome rational receiver behavior strategy π , it is a strict best response for θ against π .The next proposition shows that (cid:37) s is transitive and “almost” asymmetric.A signal s is rationally strictly dominant for θ if it is a strict best responseagainst any rational receiver strategy, π ∈ Π • . A signal s is rationally strictlydominated for θ if it is not a weak best response against any rational receiverstrategy. Proposition 1.

We have (cid:37) s is transitive.2. Except when s is either rationally strictly dominant for both θ and θ or rationally strictly dominated for both θ and θ , θ (cid:37) s θ implies θ (cid:37) s θ .The Appendix provides proofs for all of our results except where otherwisenoted.We require two auxiliary deﬁnitions before deﬁning RCE. and because we are skeptical about the hypothesis of iterated rationality. eﬁnition 2. For any two types θ , θ , let P θ .θ be the set of beliefs wherethe odds ratio of θ to θ exceeds their prior odds ratio, that is P θ .θ := ( p ∈ ∆(Θ) : p ( θ ) p ( θ ) ≤ λ ( θ ) λ ( θ ) ) . (1)Note that if π ( s | θ ) ≥ π ( s | θ ) , π ( s | θ ) > , and the receiver updatesbeliefs using π , then the receiver’s posterior belief about the sender’s typeafter observing s falls in the set P θ .θ . In particular, in any Bayesian Nashequilibrium, the receiver’s on-path belief falls in P θ .θ after any on-path signal s with θ (cid:37) s θ .We now introduce some additional deﬁnitions to let us investigate the im-plications of the agents’ knowledge of their opponent’s payoﬀ function. For astrategy proﬁle π ∗ , let E π ∗ [ u | θ ] denote type θ ’s expected payoﬀ under π ∗ . Deﬁnition 3.

For any strategy proﬁle π ∗ , let e J ( s, π ∗ ) := ( θ ∈ Θ : max a ∈ A BR s u ( θ, s, a ) ≥ E π ∗ [ u | θ ] ) . This is the set of types for which some best response to signal s is at leastas good as their payoﬀ under π ∗ . For all other types, the signal s is equilibriumdominated in the sense of Cho and Kreps (1987). Deﬁnition 4.

The set of rationality-compatible beliefs for the receiver at strat-egy proﬁle π ∗ , (cid:16) ˜ P ( s, π ∗ ) (cid:17) s , is deﬁned as follows:  ˜ P ( s, π ∗ ) := ∆( e J ( s, π ∗ )) T  T ( θ ,θ ) s.t. θ (cid:37) s θ P θ .θ  if e J ( s, π ∗ ) = ∅ ˜ P ( s, π ∗ ) := ∆(Θ s ) if e J ( s, π ∗ ) = ∅ . The main idea behind the rationality-compatible beliefs is that the re-ceiver’s posterior likelihood ratio for types θ and θ dominates the prior like-lihood ratio whenever θ (cid:37) s θ . A second feature involves equilibrium domi- With the convention := 0. e P assigns probability 0 to equilibrium-dominated types; thisis similar to the belief restriction of the Intuitive Criterion. Note that this def-inition imposes no belief restrictions based on θ (cid:37) s θ when s is equilibriumdominated for every type. As we illustrate in Example 3, the receiver needsnot learn the rational compatibility relation when equilibrium dominance leadsto steady states where no type ever experiments with a certain signal. Deﬁnition 5.

Strategy proﬁle π ∗ is a rationality-compatible equilibrium (RCE) if it is a Nash equilibrium and π ∗ ( · | s ) ∈ ∆(BR( ˜ P ( s, π ∗ ) , s )) for every s .RCE requires that the receiver only plays best responses to rationality-compatible beliefs after each signal. This solution concept allows for thepossibility that after oﬀ-path signals the receiver’s strategy π ∗ ( · | s ) may notcorrespond to a single belief about the sender’s type.Theorem 1 shows that RCE is a necessary condition for a strategy pro-ﬁle where receivers have strict preferences after each on-path signal to bepatiently stable. Intuitively, this result holds because the optimal experimen-tation behavior of the senders respects the compatibility order, and because,since players eventually learn the equilibrium path, types will not experimentmuch with signals that are equilibrium dominated. As we show in Section 3,RCE rules out the implausible equilibria in a number of games, but is weakerthan some past signaling game reﬁnements in the literature. However, RCE isonly a necessary condition for patient stability, which leaves open the questionof whether patient learning has additional implications. For this reason, wenow deﬁne uRCE, a subset of RCE (up to path-equivalence). As we showbelow, uRCE is a suﬃcient condition for patient stability. Deﬁnition 6.

The set of uniformly rationality-compatible beliefs for the re-ceiver is (cid:16) ˆ P ( s ) (cid:17) s whereˆ P ( s ) := ∆(Θ s ) \  \ ( θ ,θ ) s.t. θ (cid:37) s θ P θ .θ  . Note that (cid:16) ˆ P ( s ) (cid:17) s makes no reference to a particular strategy proﬁle, unlike (cid:16) ˜ P ( s, π ∗ ) (cid:17) s . Since ∆(Θ s ) contains types for whom s is undominated and9 J ( s, π ∗ ) contains types for whom s is equilibrium-undominated (relative tothe proﬁle π ∗ ) , we have ˜ P ( s, π ∗ ) ⊆ ˆ P ( s ) whenever e J ( s, π ∗ ) = ∅ . Deﬁnition 7.

A Nash equilibrium strategy proﬁle π ∗ is called a uniformrationality-compatible equilibrium (uRCE) if for all θ, all oﬀ-path signals s and all a ∈ BR( ˆ P ( s ) , s ), we have E π ∗ [ u | θ ] ≥ u ( θ, s, a ).The “uniformity” in uniform RCE comes from the requirement that every best response to every belief in ˆ P ( s ) deters every type from deviating to theoﬀ-path s . By contrast, a RCE is a Nash equilibrium where some best responseto ˜ P ( s, π ∗ ) deters every type from deviating to s . Proposition 2.

Every uRCE is path-equivalent to an RCE.

The following example illustrates that uRCE is a strict subset of RCE in somegames.

Example 1.

Suppose a worker has either high ability ( θ H ) or low ability ( θ L ).She chooses between three levels of higher education: None ( N ), College ( C ),or Ph.D. ( D ). An employer observes the worker’s education level and paysa wage, a ∈ { low , med , high } . The worker’s utility function is separablebetween wage and (ability, education) pair, with u ( θ, s, a ) = z ( a ) + v ( θ, s )where z ( low ) = 0 , z ( med ) = 6 , z ( high ) = 9 and v ( θ H , N ) = 0, v ( θ L , N ) = 0, v ( θ H , C ) = 2, v ( θ L , C ) = 1, v ( θ H , D ) = − v ( θ L , D ) = −

4. (With this payoﬀfunction, going to college has a consumption value while getting a Ph.D. iscostly.) The employer’s payoﬀs reﬂect a desire to pay a wage correspondingto the worker’s ability and increased productivity with education, given in thetables below.

N low med high θ H θ L C low med high θ H θ L D low med high θ H -2,0 4,2 7,3 θ L -4,3 2,2 5,010o education level is dominated for either type and no wage is conditionallydominated after any signal. Since v ( θ H , · ) − v ( θ L , · ) is maximized at D , it issimple to verify that θ H (cid:37) D θ L . Similarly, θ L (cid:37) N θ H . There is no compatibilityrelation at signal C .When the prior is λ ( θ H ) = 0 .

5, the strategy proﬁle where the employer al-ways pays a medium wage and both types of worker choose C is a uRCE. This isbecause ˆ P ( N ) contains only those beliefs with p ( θ H ) ≤ .

5, so BR( ˆ P ( N ) , N ) = { low , med } . Both of these wages deter every type from deviating to N . Atthe same time, no type wants to deviate to D , even if she gets paid the bestwage.On the other hand, the equilibrium π ∗ where the employer pays low wagesfor N and C , a medium wage for D , and both types choose D is an RCEbut not a uRCE. The belief that puts probability 1 on the worker being θ L belongs to ˜ P ( N , π ∗ ) and ˜ P ( C , π ∗ ) and induces the employer to choose lowwage. However, medium salary is a best response to λ ∈ ˆ P ( N ) and mediumwage would tempt type θ L to deviate to N . (cid:7) In the learning model of Fudenberg and He (2018), agents do not knowothers’ utility functions and have full-support prior beliefs about others’ play.That paper’s compatibility criterion (CC) is based on a family of binary rela-tions on types (one for each signal s ) that are less complete than the rationalcompatibility relations, because the condition that “whenever s is a weak bestresponse for θ , it is also a strict best response for θ ” is required to hold forall π ∈ Π instead of only for π ∈ Π • . Hence, RCE is always at least asrestrictive as the CC, and RCE can eliminate some equilibria that the CCallows. Example 2.

Consider a game where the sender has type distribution λ ( θ strong ) =0 . , λ ( θ weak ) = 0 . In or Out . The game endswith payoﬀs (0,0) if the sender chooses

Out . If the sender chooses

In, the re-ceiver then chooses Up , Down , or X . Up is the receiver’s optimal responseif the sender is more likely to be θ strong , Down is optimal when the sender ismore likely to be θ weak , and X is never optimal. This game has two sequential This is a modiﬁed version of Cho-Kreps “beer-quiche game,” where an outside option

Out , and anotherwhere both types go In and the receiver responds with Up .Without payoﬀ knowledge, a compatibility relation based on all π ∈ Π does not rank the two types after signal In . If π ( Down | In ) = 2 / π ( X | In ) = 1 /

3, for example, θ weak ﬁnds In optimal but θ strong does not.So the sequential equilibrium outcome Out satisﬁes the CC. However, since X is conditionally dominated after In , we can verify that the stronger ratio-nal compatibility relation ranks θ strong (cid:37) In θ weak and that the unique RCE isthe equilibrium where both types go In . Underlying this is the fact that ifthe conditionally dominated response X is removed from the game tree, then θ strong will experiment more frequently with In than θ weak does because θ strong potentially has more to gain. This story breaks down if senders do not knowreceivers’ payoﬀs and thus suspect that X might be used after In . We willshow in Section 7 that for some full-support prior beliefs, θ weak experimentsmore with In than θ strong does under any patience level. (cid:7) While the previous example shows payoﬀ information may lead to more with certain payoﬀs (

Out ) replaces the

Quiche signal. The responses Up and Down correspond to

Not Fight and

Fight in the beer-quiche game, while X is a conditionallydominated response for the receiver following In . Also, while our deﬁnition of signalinggames requires that the receiver has the same action set after every signal, this situation isclearly equivalent to one where the receiver chooses Up , Down , or X after Out , but all ofthese choices lead to the payoﬀs (0,0).

Example 3.

Consider a game with two sender types, θ and θ , equally likely,and two possible signals, L or R. Payoﬀs are given in the tables below.signal: L action: a action: a action: a type: θ −

2, 0 2 , , θ −

2, 1 2, 0 2, -1signal: R action: a action: a action: a type: θ

5, -1 -3, 2 -4, 0type: θ -2, -1 1, 0 0, 1Action a is conditionally dominated for the receiver after signal R . It iseasy to see that in every perfect Bayesian equilibrium π ∗ , we must have π ∗ ( L | θ ) = π ∗ ( L | θ ) = 1 , π ∗ ( a | L ) = 1 , and that π ∗ ( · | R ) must be supported on A BR R = { a , a } . This means the oﬀ-path signal R is equilibrium dominatedfor every type in π ∗ , i.e. ˜ J ( R , π ∗ ) = ∅ . So, ˜ P ( R , π ∗ ) = ∆(Θ R ) = ∆(Θ) andRCE permits the receiver to play either a or a after R . (This is despite thefact that θ is more rationally compatible with R than θ is. As we discussedafter Deﬁnition 4, RCE does not restrict the receiver’s belief based on rationaltype compatibility after an oﬀ-path signal that is equilibrium dominated forevery type.)We will show in Section 7 that when learners have payoﬀ information, thereis a patiently stable state where the receivers play a after R and anotherpatiently stable state where the receivers respond to R with a . However, wewill also show that without payoﬀ information, patient stability requires thatthe receivers play a after R . (cid:7) This section compares RCE to other equilibrium reﬁnement concepts in theliterature. 13 .1 Iterated dominance

We ﬁrst relate RCE to a form of iterated dominance in the ex-ante strategicform of the game, where the sender chooses a signal π as function of hertype. We show that every sender strategy that speciﬁes playing signal s as aless compatible type θ but not as a more compatible type θ will be removedby iterated deletion. The idea is that such a strategy is never a weak bestresponse to any receiver strategy in Π • : if the less compatible θ does nothave a proﬁtable deviation, then the more compatible type strictly prefersdeviating to s . Proposition 3.

Suppose θ (cid:37) s θ . Then any ex-ante strategy of the sender π with π ( s | θ ) > but π ( s | θ ) < is removed by strict dominance once thereceiver is restricted to using strategies in Π • . We next relate RCE to the Intuitive Criterion.

Proposition 4.

Every RCE satisﬁes the Intuitive Criterion.

The next example shows that the set of RCE is strictly smaller than the setof equilibria that pass the Intuitive Criterion. The idea is that the IntuitiveCriterion does not impose any restriction on the relative likelihood of two typesafter a signal that is not equilibrium dominated for either of them, but RCEcan.

Example 4.

Consider a signaling game where the prior probabilities of thetwo types are λ ( θ ) = 3 / λ ( θ ) = 1 /

4, and the payoﬀs are:signal: s action: a action: a type: θ

4, 1 0, 0type: θ

6, 0 2, 1 signal: s action: a action: a type: θ

7, 1 3, 0type: θ

7, 0 3, 1Against any receiver strategy, the two types θ and θ get the same payoﬀsfrom s , but θ gets strictly higher payoﬀs than θ from s . So, θ (cid:37) s θ .14onsider now the Nash equilibrium in which the types pool on s , i.e. π ∗ ( s | θ ) = π ∗ ( s | θ ) = 1 , π ∗ ( a | s ) = 1, and π ∗ ( a | s ) = 1. It passes theIntuitive Criterion since the oﬀ-path signal s is not equilibrium dominatedfor either type. On the other hand, RCE requires that every action played withpositive probability in π ∗ ( ·| s ) best responds to some belief p about sender’stype satisfying p ( θ ) p ( θ ) ≤ λ ( θ ) λ ( θ ) = . But action a does not best respond to anysuch belief, so π ∗ is not an RCE. (cid:7) Next, we compare divine equilibrium with RCE and uRCE. For a strategyproﬁle π ∗ , let D ( θ, s ; π ∗ ) := { α ∈ MBR( s ) s.t. E π ∗ [ u | θ ] < u ( θ, s, α ) } be the subset of mixed best responses to s that would make type θ strictlyprefer deviating from the strategy π ∗ ( · | θ ). Similarly let D ◦ ( θ, s ; π ∗ ) := { α ∈ MBR( s ) s.t. E π ∗ [ u | θ ] = u ( θ, s, α ) } be the set of mixed best responses that would make θ indiﬀerent to deviating. Proposition 5.

1. If π ∗ is a Nash equilibrium where s is oﬀ-path, and θ (cid:37) s θ , then D ( θ , s ; π ∗ ) ∪ D ◦ ( θ , s ; π ∗ ) ⊆ D ( θ , s ; π ∗ ) .2. Every divine equilibrium is a RCE. However, the converse is not true, as the following example illustrates.

Example 5.

Consider the following signaling game with two types and threesignals, with prior λ ( θ ) = 2 / s a a θ

0, 1 -1, 0 θ

0, 0 -1, 1 s a a θ

2, 1 -1, 0 θ

1, 0 -1, 1 s a a θ

5, 0 -3, 1 θ

0, 1 -2, 0 To be precise, MBR( p, s ) := arg max α ∈ ∆( A ) ( E θ ∼ p [ u ( θ, s, α )]) and MBR( s ) := ∪ p ∈ ∆(Θ) MBR( p, s ).

15e check that the following is a pure-strategy RCE: π ( s | θ ) = π ( s | θ ) =1 , π ( a | s ) = 1 , π ( a | s ) = 1 , π ( a | s ) = 1 . Evidently π is a Nash equilibriumand no type is equilibrium-dominated at any oﬀ-path signal. We now checkthat we do not have θ (cid:37) s θ or θ (cid:37) s θ . Observe that against the receiverstrategy ˜ π ( a | s ) = for every s , s is strictly optimal for θ but s is strictlyoptimal for θ , so θ (cid:37) s θ . And for the receiver strategy ˆ π ( a | s ) = 1 for every s , s is strictly optimal for θ but s is strictly optimal for θ , so θ (cid:37) s θ .This shows the strategy proﬁle is an RCE.However, D ( θ , s ; π ) ∪ D ◦ ( θ , s ; π ) is the set of distributions on { a , a } that put at least weight 0.5 on a . Any such distribution is in D ( θ , s ; π ). Soin every divine equilibrium, the receiver plays a best response to a belief thatputs weight no less than 2/3 on θ after signal s , which can only be a . (cid:7) This example illustrates one diﬀerence between divine equilibrium andRCE: under divine equilibrium, the beliefs after signal s only depend onthe comparison between the payoﬀs to s with those of the equilibrium signal s , while the compatibility criterion also considers the payoﬀs to a third signal s . In the learning model, this corresponds to the possibility that θ choosesto experiment with s at beliefs that induce θ to experiment with s . Our RCE diﬀers from divine equilibrium in another way: divine equilibriuminvolves an iterative application of a belief restriction. The next exampleillustrates this diﬀerence . Example 6.

There are three types, θ , θ , θ , all equally likely. The signalspace is S = { s , s } , and the set of receiver actions is A = { a , a , a , a } .When any sender type chooses the signal s , all parties get a payoﬀ of 0 re-gardless of the receiver’s action. When the sender chooses s , the payoﬀs aredetermined by the following matrix. As noted by Van Damme (1987), it may seem more natural to replace the set α ∈ MBR( m ) in the deﬁnitions of D and D with the larger set α ∈ co(BR( s )) , which leads tothe weaker equilibrium reﬁnement that Sobel, Stole, and Zapater (1990) call “co-divinity”.This example also shows that RCE need not be co-divine. We thank Joel Sobel for this example. a a a a θ

1, 0.9 -1, 0 -2, 0 -7, 0 θ

5, 0 3, 1 -1, 0 -5, 0.8 θ -3, 0 5, 0 1, 1.7 -3, 0.8Consider the pure strategy proﬁle π ∗ ( s | θ ) = 1 for all θ ∈ Θ and π ∗ ( a | s ) =1 for all s ∈ S . Since θ gains more from deviating to s than θ does, applyingthe divine belief restriction for the oﬀ-path signal s eliminates the action a ,since it is not a best response to any belief p ∈ ∆(Θ) with p ( θ ) ≥ p ( θ ).But after action a is deleted for the receiver after signal s , type θ nowgains more from deviating to s than θ does. So, applying the divine beliefrestriction again eliminates actions a and a , since it is not a best responseagainst any p ∈ ∆(Θ) with p ( θ ) = 0 (for now s is equilibrium dominated for θ ) and p ( θ ) ≥ p ( θ ) . So π ∗ is not a divine equilibrium.On the other hand, no type is equilibrium dominated at s and the onlyrational compatibility order is θ (cid:37) s θ . But a is a best response againstthe belief p ( θ ) = 0 , p ( θ ) = 0 . , p ( θ ) = 0 .

4, which belongs to the set∆(Θ s ) T P θ .θ . So π ∗ is an RCE. (cid:7) Finally, we show that every uRCE is path-equivalent to an equilibrium thatis not ruled out by the “NWBR in signaling games” test (Banks and Sobel,1987; Cho and Kreps, 1987), which comes from iterative applications of thefollowing pruning procedure: after signal s the receiver is required to put 0probability on those types θ such that D ◦ ( θ, s ; π ∗ ) ⊆ ∪ θ = θ D ( θ , s ; π ∗ ) . If this would delete every type, then the procedure instead puts no restrictionon receiver’s beliefs and no type is deleted.By “path-equivalent” we mean that by modifying some of the receiver’soﬀ-path responses, but without altering the sender’s strategy or the receiver’son-path responses, we can change the uRCE into another uRCE that passes This is closely related to, but not the same as, the NWBR property of Kohlberg andMertens (1986).

Proposition 6.

Every uRCE is path-equivalent to a uRCE that passes theNWBR test.

Corollary 1.

Every uRCE is path-equivalent to a universally divine equilib-rium.

To summarize this subsection, we note that for strategy proﬁles that are on-path strict for the receiver, we have the following inclusion relationships. Theﬁrst inclusion should be understood as inclusion up to path-equivalence. Weuse the symbol “ (cid:40) ” to mean that the former solution set is always nestedwithin the latter one in every signaling game, and that there exist gameswhere the nesting relationship is strict.uRCE (cid:40) universally divine equilibria (cid:40)

RCE (cid:40)

Intuitive Criterion (cid:40)

Nash equilibria . We study the same discrete-time steady-state learning model as Fudenbergand He (2018) except for an extra restriction on the players’ prior beliefs overother players’ strategies.There is a continuum of agents in the society, with a unit mass of receiversand λ ( θ ) mass of type θ senders. Each population is further stratiﬁed by age,with a fraction (1 − γ ) · γ t of each population age t for t = 0 , , , ... At the endof each period, each agent has probability 0 ≤ γ < − γ ) new18eceivers and λ ( θ )(1 − γ ) new type θ senders are born into the society, thuspreserving population sizes and the age distribution.Agents play the signaling game every period against a randomly matchedopponent. Each sender has probability (1 − γ ) γ t of matching with a receiverof age t , while each receiver has probability λ ( θ )(1 − γ ) γ t of matching with atype θ sender of age t. Each agent is born into a player role in the signaling game: either a receiveror a type θ sender. Agents know their role, which is ﬁxed for life. The agents’payoﬀ each period is determined by the outcome of the signaling game theyplayed, which consists of the sender’s type, the signal sent, and the actionplayed in response. The agents observe this outcome, but the senders does notobserve how her matched receiver would have played had she sent a diﬀerentsignal.In addition to only surviving to the next period with probability 0 ≤ γ < future utility ﬂows by 0 ≤ δ < u t represent the payoﬀ t periods fromtoday, each agent’s objective function is E [ P ∞ t =0 ( γδ ) t · u t ]. (Deﬁne 0 := 1, sothat a myopic agent just maximizes current period’s expected payoﬀ in everyperiod.)Agents believe they face a ﬁxed but unknown distribution of opponents’aggregate play, updating their beliefs at the end of every period based onthe outcome in their own game. Formally, each sender is born with a priordensity function over receivers’ behavior strategies, g : Π → R + . Similarly,each receiver is born with a prior density over the senders’ behavior strategies, g : Π → R + . We denote the marginal distribution of g on signal s as g ( s )1 : ∆( A ) → R + , so that g ( s )1 ( π ( ·| s )) is the density of the new senders’ prior We separately consider survival probability and patience so that we may consider agentswho are impatient relative to their expected lifespan. Such agents experiment early in theirlife cycle, but spend most of their life myopically best responding to their beliefs, whichmakes our analysis more tractable. s . Similarly, we denote the θ marginal of g as g ( θ )2 : ∆( S ) → R + , so that g ( θ )2 ( π ( ·| θ )) is the new receivers’ prior densityover the signal choice of type θ .We now state a regularity assumption on agents’ priors that will be main-tained throughout. Deﬁnition 8.

A prior g = ( g , g ) is regular if(a). [ independence ] g ( π ) = Q s ∈ S g ( s )1 ( π ( ·| s )) and g ( π ) = Q θ ∈ Θ g ( θ )2 ( π ( ·| θ )).(b). [ payoﬀ knowledge ] g puts probability 1 on Π • and g puts probability 1on Π • .(c). [ g non-doctrinaire ] g is continuous and strictly positive on the interiorof Π • . (d). [ g nice ] For each type θ, there are positive constants (cid:16) α ( θ ) s (cid:17) s ∈ S such that π ( ·| θ ) g ( θ )2 ( π ( ·| θ )) Q s ∈ S π ( s | θ ) α ( θ ) s − is uniformly continuous and bounded away from zero on the relativeinterior of Π • θ , the set of rational behavior strategies of type θ .This assumption bears the same name as the regularity assumption in Fu-denberg and He (2018), and is identical except that agents now know others’payoﬀs and others’ rationality. In the learning model, this payoﬀ knowledgetranslates into a restriction on the supports of the priors g , g , reﬂecting adogmatic belief that senders will never play dominated signals and receiverswill never play conditionally dominated actions. (These beliefs are correct inthe learning model.)Even with payoﬀ knowledge, the receiver’s prior can assign positive prob-ability to ex-ante dominated sender strategies. For instance, in the signalinggame below, 20he sender strategy π ( s | θ ) = π ( s | θ ) = 1 belongs to the set Π • , andso must belong to the support of any regular receiver prior. But, even though s ∈ S θ and s ∈ S θ , the receiver strategies to which they respectivelybest respond form disjoint sets, and π is ex-ante dominated because it is nota best response to any single receiver strategy. It is nevertheless consistentfor a receiver who knows the sender’s payoﬀ as a function of their type toassign positive density to π , because diﬀerent types of agents can choose bestresponses to diﬀerent beliefs about receiver play. Let Y θ [ t ] := ( ∪ s ∈ S ( s × A BR s )) t represent the set of possible histories for a type θ sender with age t . Note that a valid history encodes the signal that θ sent eachperiod and the (conditionally undominated) action that her opponent playedin response. Let Y θ := S ∞ t =0 Y θ [ t ] be the set of all histories for type θ .Similarly, write Y [ t ] := (Θ × S θ ) t for the set of possible histories for areceiver with age t . Each period, his history encodes the type of the matchedsender and the (undominated) signal observed. The union Y := S ∞ t =0 Y [ t ]then stands for the set of all receiver histories.The agents’ dynamic optimization problems discussed in Subsection 4.2give rise to optimal policies σ θ : Y θ → S θ and σ : Y → × s ( A BR s ). Here, σ θ ( y θ ) is the signal that a type θ sender with history y θ would send the next For notational simplicity, we suppress the dependence of these optimal policies on theeﬀective discount factor δγ and on the priors. σ ( y ) is the pure extensive-form strategy that a receiver with history y would commit to next time heplays the game. In the learning model, each agent solves a (single-agent)dynamic optimization problem, and chooses a deterministic optimal policy.A state ψ of the learning model is a demographic description of how manyagents have each possible history. It can be viewed as a distribution ψ ∈ ( × θ ∈ Θ ∆( Y θ )) × ∆( Y ) , and its components are denoted by ψ θ ∈ ∆( Y θ ) and ψ ∈ ∆( Y ).Since each state ψ is a distribution over histories and optimal policies maphistories to play, ψ induces a distribution over play (i.e., a rational behaviorstrategy) in the signaling game σ ( ψ ) ∈ Π • , given by σ θ ( ψ θ )( s ) := ψ θ { y θ ∈ Y θ : σ θ ( y θ ) = s } and σ ( ψ )( a | s ) := ψ { y ∈ Y : σ ( y )( s ) = a } . Here, σ θ ( ψ θ ) and σ ( ψ ) are the aggregate behaviors of the type θ andreceiver populations in state ψ , respectively. Note that the aggregate play ofa population can be stochastic even if the entire population uses the samedeterministic optimal policy, because diﬀerent senders will be matched withdiﬀerent receivers, and so diﬀerent agents on the same side will observe diﬀer-ent histories and play diﬀerently.Of particular interest are the steady states , to be deﬁned more precisely inSection 6. Loosely speaking, a steady state induces a time-invariant distribu-tion over how the signaling game is played in the society. This section deﬁnes the notion of a steady state using the “aggregate re-sponses” of one population to the distribution of play in the other. These22esponses are deﬁned using the “one-period forward” maps that describe howthe agents’ policies induce a map from current distributions over histories towhat the distributions will be after the agents are matched and play the gameusing the strategies their policies prescribe.

Fix the receivers’ aggregate play at π ∈ Π • and ﬁx an optimal policy σ θ foreach type θ . The one-period-forward map for type θ , f θ , describes the distribu-tion over histories that will prevail next period when the current distributionsover histories in the type- θ population is ψ θ . The next deﬁnition speciﬁes theprobability that f θ [ ψ θ , π ] assigns to the history ( y θ , ( s, a )) ∈ Y θ [ t + 1] , that isto say a one-period concatenation of ( s, a ) onto the history y θ ∈ Y θ [ t ]. Deﬁnition 9.

The one-period-forward map for type θ , f θ : ∆( Y θ ) × Π • → ∆( Y θ ) is f θ [ ψ θ , π ]( y θ , ( s, a )) := ψ θ ( y θ ) · γ · { σ θ ( y θ ) = s } · π ( a | s )and f θ ( ∅ ) := 1 − γ .To interpret, of the ψ θ ( y θ ) fraction of the type- θ population with history y θ , a γ fraction survives into the next period. The survivors all choose σ θ ( y θ )next period, which is met with response a with probability π ( a | σ θ ( y θ )).Write f Tθ for the T -fold application of f θ on ∆( Y θ ) , holding ﬁxed some π .It is easy to show that lim T →∞ f Tθ ( ψ θ , π ) exists and is independent of theinitial ψ θ . (This is because for any two states ψ θ , ψ θ , the two distributionsover histories f Tθ ( ψ θ , π ) and f Tθ ( ψ θ , π ) agree on all Y θ [ t ] for t < T . As T grows large, the two resulting distributions must converge to each other sincethe fraction of very old agents with very long histories is rare.) Denote thislimit as ˜ ψ π θ . It is the distribution over type- θ history induced by the receivers’aggregate play π . Deﬁnition 10.

The aggregate sender response R : Π • → Π • is deﬁned by R [ π ]( s | θ ) := ˜ ψ π θ ( y θ : σ θ ( y θ ) = s )23hat is, R [ π ]( · | θ ) describes the asymptotic aggregate play of the type- θ population when the the aggregate play of the receiver population is ﬁxed at π each period. Note that R maps into Π • because no type ever wants to senda dominated signal, even as an experiment, regardless of their beliefs aboutthe receiver’s response.Technically, R depends on g , δ, and γ , just like σ θ does. When relevant,we will make these dependencies clear by adding the appropriate parametersas superscripts to R , but we will mostly suppress them to lighten notation. We now turn to the receivers, who have a passive learning problem. Theyalways observe the sender’s type and signal at the end of each period, so theiroptimal policy σ myopically best responds to the posterior belief at everyhistory y . Deﬁnition 11.

The one-period-forward map for the receivers f : ∆( Y ) × Π • → ∆( Y ) is f [ ψ , π ]( y , ( θ, s )) := ψ ( y ) · γ · λ ( θ ) · π ( s | θ )and f ( ∅ ) := 1 − γ .As with the one-period-forward maps f θ for senders, f [ ψ , π ] describes thedistribution over receiver histories next period starting with a society wherethe distribution is ψ and the sender population’s aggregate play is π . Wewrite ˜ ψ π := lim T →∞ f T ( ψ , π ) for the long-run distribution over Y inducedby ﬁxing sender population’s play at π . (This limit is again independent ofthe initial state ψ . ) Deﬁnition 12.

The aggregate receiver response R : Π • → Π • is R [ π ]( a | s ) := ˜ ψ π ( y : σ ( y )( s ) = a )24 .3 Steady States and Patient Stability A steady-state strategy proﬁle is a pair of mutual aggregate replies, so it istime-invariant under learning. Deﬁnition 13. π ∗ is a steady-state strategy proﬁle if R g,δ,γ ( π ∗ ) = π ∗ and R g,δ,γ ( π ∗ ) = π ∗ . Denote the set of all such strategy proﬁles as Π ∗ ( g, δ, γ ).We now state two results about these steady states. We do not provide aproof because they follow easily from analogous results in Fudenberg and He(2018).First, steady-state proﬁles always exist. Proposition 7.

For any regular prior g and any ≤ δ, γ < , Π ∗ ( g, δ, γ ) isnon-empty and compact in the norm topology. The patiently stable strategy proﬁles correspond to the set lim δ → lim γ → Π ∗ ( g, δ, γ ).This order of limits was ﬁrst introduced in Fudenberg and Levine (1993). Itensures agents spend most of their lifetime playing myopically instead of ex-perimenting, which is important for proving that patiently stable proﬁles areNash equilibria. Deﬁnition 14.

For each 0 ≤ δ <

1, a strategy proﬁle π ∗ is δ -stable under g if there is a sequence γ k → π ( k ) ∈ Π ∗ ( g, δ, γ k ), such that π ( k ) → π ∗ . Strategy proﬁle π ∗ is patiently stable under g if there is a sequence δ k → π ( k ) where each π ( k ) is δ k -stable under g and π ( k ) → π ∗ . Strategy proﬁle π ∗ is patiently stable if it is patiently stable undersome regular prior g . Proposition 8.

If strategy proﬁle π ∗ is patiently stable, then it is a Nashequilibrium. Note that Propositions 7 and 8 apply even if all of the Nash equilibriaof the game are in mixed strategies; as noted above, the randomization herearises from the random matching process.25

Patient Stability, Payoﬀ Knowledge, and Equi-librium Reﬁnements

In this section, we relate the equilibrium reﬁnements proposed in Section 2to the steady-state learning model. We show that under certain strictnessassumptions, RCE is necessary for patient stability while uRCE is suﬃcientfor patient stability. We also discuss how payoﬀ knowledge matters for learningoutcomes.

We show that any patiently stable strategy proﬁle satisfying a strictness as-sumption must be an RCE. The key lemma is analogous to Lemma 1 fromFudenberg and He (2018), so we will omit its proof.

Lemma 1.

Suppose θ (cid:37) s θ . Then for any regular prior g , ≤ δ, γ < ,and any π ∈ Π • , we have R [ π ]( s | θ ) ≥ R [ π ]( s | θ ) . This result says over their lifetimes, the relative frequencies with whichdiﬀerent sender types experiment with signal s respect the rational compat-ible order (cid:37) s . This follows from the fact that sender types who are morecompatible with a signal will play it at least as often. The payoﬀ knowledgeembedded in g ’s support implies that senders never experiment in the hopesof seeing a response which is highly proﬁtable for the sender but dominatedfor the receiver, such as the Charity action in Example 2 for θ weak . This extraassumption leads to a stronger result than Lemma 1 from Fudenberg and He(2018), which is stated in terms of the less-complete compatibility order.For a ﬁxed strategy proﬁle π and on-path signal s ∗ , let E θ | π ,s ∗ [ u ( θ, s ∗ , a )]denote the receiver’s expected utility from responding to s ∗ with a , where theexpectation over the sender’s type θ is taken with respect to the posterior typedistribution after signal s ∗ given the sender’s strategy π ( · | θ ). Deﬁnition 15.

A Nash equilibrium π ∗ is on-path strict for the receiver if forevery on-path signal s ∗ , π ( a ∗ | s ∗ ) = 1 for some a ∗ ∈ A and E θ | π ,s ∗ [ u ( θ, s ∗ , a ∗ )] > max a = a ∗ E θ | π ,s ∗ [ u ( θ, s ∗ , a )]. 26e call this condition “on-path” strict for the receiver because we do notmake assumptions about the receiver’s incentives after oﬀ-path signals. Forgeneric payoﬀs, all pure-strategy equilibria will be on-path strict for the re-ceiver. Theorem 1.

Every strategy proﬁle that is patiently stable and on-path strictfor the receiver is an RCE.

RCE rules out two kinds of receiver beliefs after signal s : those that assignnon-zero probability to equilibrium-dominated sender types, and those thatviolate the rational compatibility order. The restriction on equilibrium dom-inated types uses the assumption that the receiver has a strict best responseto each on-path signal to put a lower bound on how slowly aggregate receiverplay at on-path signals converges to its limit. The fact that the receiver be-liefs respect the rational compatibility order comes from Lemma 1, which usesour assumptions about prior g to derive restrictions on the aggregate senderresponse R , and show that these are reﬂected in the aggregate receiver re-sponse. The proof of Theorem 1 closely follows the the analogous proof inFudenberg and He (2018) and is omitted. We now prove our main result: as a partial converse to Theorem 1, we showthat under additional strictness conditions, every uRCE is path-equivalent toa patiently stable strategy proﬁle.

Deﬁnition 16. A quasi-strict uRCE π ∗ is a uRCE that is on-path strict forthe receiver, strict for the sender (that is, there exists an equilibrium signal s ∗ for each type θ with u ( θ, s ∗ , π ∗ ( ·| s ∗ )) > max s = s ∗ u ( θ, s, π ∗ ( ·| s )), so everytype strictly prefers its equilibrium signal to any other), and satisﬁes E π ∗ [ u | θ ] > u ( θ, s , a ) for all θ, all oﬀ-path signals s and all a ∈ BR( ˆ P ( s ) , s ). If the receiver mixes after some equilibrium signal s for type θ , then our techniques forshowing that θ does not experiment very much with equilibrium dominated signals do notgo through, but we do not have a counterexample. P ( s ) strictly deters every type from deviating to s , whenever s is oﬀ-path. Every uRCE satisﬁes the weaker version of this condition where“strictly deters” is replaced with “weakly deters.” Theorem 2. If π ∗ is a quasi-strict uRCE, then it is path-equivalent to a pa-tiently stable strategy proﬁle. This theorem follows from three lemmas on R and R . Indeed, the the-orem remains valid in any modiﬁed learning model where R and R satisfythe conclusions of these lemmas. R under a conﬁdent prior The ﬁrst lemma shows that under a suitable prior, the aggregate sender re-sponse of the dynamic learning model approximates the sender’s static bestresponse function when applied to certain receiver strategies, namely strategiesthat are “close” to one inducing a unique optimal signal for each sender type.The precise meaning of “close” that we use treats on- and oﬀ-path responsesdiﬀerently, so it requires some auxiliary deﬁnitions.

Deﬁnition 17.

Let π ∗ be a strategy proﬁle where every type plays a purestrategy and the receiver plays a pure action after each on-path signal. Say π ∗ induces a unique optimal signal for each sender type if E π ∗ [ u | θ ] > max s = π ∗ ( θ ) u ( θ, s, π ∗ ( ·| s ))for every type θ .Starting with a strategy proﬁle π ∗ that induces a unique optimal signal foreach sender type, deﬁne for each oﬀ-path s in π ∗ the set of receiver actions˜ A ( s ) := { a : E π ∗ [ u | θ ] > u ( θ, s, a ) ∀ θ } that strictly deter every type fromdeviation. Because π ∗ induces a unique optimal signal, each ˜ A ( s ) must containat least one element in the support of π ∗ ( ·| s ), but could also contain otheractions. It is clear that if π ∗ were modiﬁed oﬀ-path by changing each π ∗ ( ·| s )28o be an arbitrary mixture over ˜ A ( s ) , then the resulting strategy proﬁle wouldcontinue to induce (the same) unique optimal signal for each sender type.For π ∗ that induces a unique optimal signal for each sender type, write B on2 ( π ∗ , (cid:15) ) for the elements of Π • no more than (cid:15) away from π ∗ at the on-pathsignals in π ∗ , that is B on2 ( π ∗ , (cid:15) ) := { π ∈ Π • : | π ( a | s ) − π ∗ ( a | s ) | ≤ (cid:15), ∀ a, on-path s in π ∗ } . Similarly, deﬁne B oﬀ2 ( π ∗ , (cid:15) ) as the elements of Π • putting no more than (cid:15) probability on actions outside of ˜ A ( s ) after each oﬀ-path s , where ˜ A ( s ) is theset of actions that would deter every type from deviating to s , as above. B oﬀ2 ( π ∗ , (cid:15) ) := n π ∈ Π • : π ( ˜ A ( s ) | s ) ≥ − (cid:15), ∀ oﬀ-path s in π ∗ o . Lemma 2.

Suppose π ∗ induces a unique optimal signal for each sender type.Then there exists a regular prior g , some < (cid:15) oﬀ < , and a function γ ( δ, (cid:15) ) valued in (0 , , such that for every < δ < , < (cid:15) < (cid:15) oﬀ , and γ ( δ, (cid:15) ) < γ < ,if π ∈ B on ( π ∗ , (cid:15) ) ∩ B oﬀ ( π ∗ , (cid:15) oﬀ ) , then | R g ,δ,γ [ π ]( s | θ ) − π ∗ ( s | θ ) | < (cid:15) for every θ and s . Note that the same (cid:15) appears in the hypothesis π ∈ B on2 ( π ∗ , (cid:15) ) as in theconclusion. That is, the aggregate sender response gets closer to π ∗ as receivers’play gets closer to π ∗ .The idea is to specify a sender prior g that is highly conﬁdent and correctabout the receiver’s response to on-path signals, and is also conﬁdent that thereceiver responds to each oﬀ-path signal s with actions in ˜ A ( s ). Take a signal s other than the one that θ sends in π ∗ . If θ has not experimented muchwith s , then her belief is close to the prior and she thinks deviation does notpay. If θ has experimented a lot with s , then by the law of large numbersher belief is likely to be concentrated in ˜ A ( s ), so again she thinks deviationdoes not pay. Since the option value for experimentation eventually goes to 0,at most histories all sender types are playing a myopic best response to theirbeliefs, meaning they will not deviate from π ∗ . The intuition is similar to thatof Lemmas 6.1 and 6.4 from Fudenberg and Levine (2006), which says that29e can construct a highly concentrated and correct prior so that in the steadystate, most agents have correct beliefs about opponents’ play both on and onestep oﬀ the equilibrium path.This lemma requires the assumption that π ∗ is strict for the sender. If s ∗ were only weakly optimal for θ in π ∗ , there could be receiver strategiesarbitrarily close to π ∗ that make some other signal s = s ∗ strictly optimalfor θ . In that case, we cannot rule out that a non-negligible fraction of the θ population will rationally play s forever when the receiver population playsclose to π ∗ . R and learning rational compatibility Let C be the set of sender strategies that respect the rational compatibilityorder, that is C := { π ∈ Π • : π ( s | θ ) ≥ π ( s | θ ) whenever θ (cid:37) s θ } . The next lemma shows that there is a prior for the receivers so that whenthe aggregate sender play is any strategy in C , almost all receivers end upwith beliefs consistent with the rational compatibility order. This lemmais the main technical contribution of the paper and enables us to providea suﬃcient condition for patient stability when the relative frequencies of oﬀ-path experiments matter. Lemma 3.

For each (cid:15) > , there exists a regular receiver prior g and <γ < so that for any γ < γ < , < δ < , and π ∈ C , R g ,δ,γ [ π ]( BR ( ˆ P ( s ) , s ) | s ) ≥ − (cid:15) for each signal s . The key step in the proof is constructing a prior belief for the receiversso that when the senders’ aggregate play is suﬃciently close to the targetequilibrium, the receiver beliefs respect the compatibility order. This step wasnot necessary in Fudenberg and Levine (2006), which is the only other paper30hat has given a suﬃcient condition for patient stability in a class of games .To prove Lemma 3, we construct a Dirichlet prior g so that for any s suchthat θ (cid:37) s θ , g assigns much greater prior weight to θ playing s than to θ playing s .. In the absence of data, the receiver strongly believes that thesenders are using strategies π such that p ( θ | s ) /p ( θ | s ) ≤ λ ( θ ) /λ ( θ ). Thisstrong prior belief can only be overturned by a very large number of observa-tions to the contrary. But because π ∈ C respects the rational compatibilityorder, if the receiver has a very large number of observations of senders choos-ing s , the law of large numbers implies this large sample is unlikely to leadthe receiver to have a belief outside of ˆ P ( s ). So we can ensure that with highprobability suﬃciently long-lived receivers play a best response to ˆ P ( s ) afterthe oﬀ-path s .Finally, we state a lemma that says for any Dirichlet receiver prior, whenlifetimes are long enough, the aggregate receiver response approximates thereceiver’s best response function on-path when applied to a sender strategythat provides strict incentives after every on-path signal. Write B on1 ( π ∗ , (cid:15) ) forthe elements of Π • where each type θ plays (cid:15) -close to π ∗ ( ·| θ ) , that is B on1 ( π ∗ , (cid:15) ) := { π ∈ Π • : | π ( s | θ ) − π ∗ ( s | θ ) | ≤ (cid:15), ∀ θ, s } . Lemma 4.

Fix a strategy proﬁle π ∗ where the receiver has strict incentivesafter every on-path signal. For each regular Dirichlet receiver prior g , thereexists (cid:15) > and a function γ ( (cid:15) ) valued in (0 , , so that whenever π ∈ B on ( π ∗ , (cid:15) ) , < δ < , and γ ( (cid:15) ) < γ < , we have R g ,δ,γ [ π ]( a | s ) − π ∗ ( a | s ) | < (cid:15) for every on-path signal s in π ∗ and a . The intuition is that when the aggregate sender strategy is close to π ∗ , Their result guarantees that the receivers’ beliefs about the frequency of type θ sendingsignal s is within (cid:15) of the truth. This is not suﬃcient for purposes, because when signal s has probability 0 under a given sender strategy, perturbing the strategy of every type by upto (cid:15) can generate arbitrary oﬀ-path beliefs about the sender’s type. The Dirichlet prior is the conjugate prior to multinomial data, and corresponds to theupdating used in ﬁctitious play (Fudenberg and Kreps, 1993). It is readily veriﬁed that ifeach of g ( θ )1 and g ( s )2 is Dirichlet and independent of the other components, then g is regular.In the proof, we work with Dirichlet priors since they give tractable closed-form expressionsfor the posterior mean belief of the opponent’s strategy after a given history. π ∗ gives positiveprobability, a receiver with enough data is likely to have a belief close to theBayesian belief assigned by π ∗ . Coupled with the fact that π ∗ is on-path strictfor the receiver, this lets us conclude that long-lived receivers play π ∗ ( ·| s ) afterevery on-path s with high probability. We revisit the examples from Section 2.3 and discuss how prior beliefs reﬂectingknowledge or ignorance of payoﬀ information lead to diﬀerent implications forlearning.

In Example 2, it follows from Lemma 1 that for any 0 ≤ δ, γ <

1, any re-ceiver play π ∈ Π • , and any regular prior g , we have R g [ π ](In | θ strong ) ≥ R g [ π ](In | θ weak ). In the absence of payoﬀ information,we show that thereexists a full-support prior g so that, ﬁxing π to always play Down , we get R g [ π ](In | θ strong ) ≤ R g [ π ](In | θ weak ) for any 0 ≤ δ, γ <

1, with strictinequality for an open set of parameter values.Let g (In)1 be Dirichlet with weights (1 , K,

1) on ( Up , Down , X ) for arbitrary K ≥

4. After observing k ≥ In with Down , a sender would have the posterior Dirichlet(1 , K + k, θ weak type’s Gittins index for In would be unchanged if her payoﬀs to ( Up , Down , X ) were (3 , − ,

1) instead of (1 , − , Up and X . This observation shows her Gittins index for In is at least as largeas θ strong ’s, whose payoﬀs to ( Up , Down , X ) are (2 , − , In after fewer observations of Down than the weaktype does (this includes the case of “switching away” after 0 observations of

Down , i.e. the strong type never experimenting with In .) We have proven R g [ π ](In | θ strong ) ≤ R g [ π ](In | θ weak ) for any 0 ≤ δ, γ < In is myopically suboptimal for both types, and by the previousargument, the minimum eﬀective discount factor δγ that would induce atleast one period of experimentation with In is strictly higher for the strongtype than the weak type. This shows for an open set of δ, γ parameters, R g [ π ](In | θ strong ) = 0 but R g [ π ](In | θ weak ) > In Example 3 we showed that there is an RCE in which the receivers play a after R . Because RCE is not a suﬃcient condition for patient stability, thisleaves open the question of whether this strategy can arise in our learningmodel. Here we verify that it can, and also show that “ a after R ” cannotbe part of a patiently stable outcome in the absence of payoﬀ information.This is because patient but inexperienced θ ’s without payoﬀ information ﬁndit plausible that receivers choose a after R , so they will experiment muchmore frequently with the oﬀ-path signal R than θ ’s, for whom every possibleresponse to R leads to worse payoﬀs than their equilibrium payoﬀ of 2. As aresult, receivers learn that R -senders have type θ so they respond with a . Onthe other hand, when senders know ex-ante that receivers will never choose a after R , for some priors there are steady states where no one ever experimentswith R . When this happens, the receivers’ belief about the likelihood ratioof the types following the oﬀ-path R is governed by their prior beliefs, whichmay be arbitrary and thus support a richer class of learning outcomes.Speciﬁcally, in Example 3, suppose g ( L )1 is Dirichlet (1 , ,

1) over all threeresponses to L , while g ( R )1 is Dirichlet(1 ,

1) on A BR R = { a , a } , which reﬂectsthe sender’s knowledge that a is a conditionally dominated response to R . And suppose that g θ is the Dirichlet(2 ,

1) distribution on { L , R } and g θ isthe Dirichlet(2 , x ) distribution, where x > ≤ δ, γ < , there exists a steady state where senders always choose L andreceivers always respond to L with a . This is because the Gittins index for R is no larger than − θ and no larger than 1 for θ after any history, whilethe myopic expected payoﬀ of L already exceeds these values in the ﬁrst period.The expected payoﬀ of L only increases with additional observations of a after33 . On the receiver side, every positive-probability history y must involve thesenders playing L every period. Following such a history, the receiver believes θ plays L with probability at least , hence an L -sender is the θ type withprobability at least / / = . We have a ∈ BR( { p } , L ) whenever p ( θ ) ≥ ,so we have shown that in the steady state receivers always play a after L .In this steady state, signal R is never sent, so by choosing diﬀerent valuesof x >

0, we can sustain either a or a after R as part of a patiently stableproﬁle. To be more precise, let n and n count the number of times the twotypes of senders appear in a positive-probability history y . The receiver’sposterior assigns the following likelihood ratio to the type of an R -sender:13 + n / x x + n = 1 x · (cid:18) x + n n (cid:19) . Since the two types are equally likely, the fraction of receivers with histories y so that 0 . ≤ (cid:16) x + n n (cid:17) ≤ . γ → . Depending on whether x = 1 / x = 4, these receivers will play a or a after R , so π ( a | R ) = 1and π ( a | R ) = 1 are both δ -stable for any δ ≥

0, under two diﬀerent regularpriors reﬂecting payoﬀ knowledge.By contrast, Theorem 3 of Fudenberg and He (2018) implies that if priors g , g have full support on Π and Π respectively, then we must have a after R in every patiently stable proﬁle. The idea is that when senders are patientand long-lived, new θ start oﬀ by trying R but new θ start oﬀ by trying L .When receivers play a after L with high probability, it is very unlikely that θ ever switches away from L , providing a bound on their frequency of playing R . On the other hand, as their eﬀective discount factor increases, θ will spendarbitrarily many periods of its early life playing R in hopes of getting the bestpayoﬀ of 5, lacking the payoﬀ knowledge that a is conditionally dominatedfor the receivers after R . Receivers therefore end up learning that R -sendershave type θ . 34 Conclusion

This paper studies non-equilibrium learning about other players’ strategies inthe setting of signaling games. When the agents’ prior beliefs about their op-ponents’ play reﬂect prior knowledge of others’ payoﬀ functions, the steadystates of societies of Bayesian learners can be bounded by two equilibriumreﬁnements, RCE and uRCE, that nest and resemble divine equilibrium. Di-vine equilibrium and RCE are only deﬁned for signaling games. In generalextensive-form games, agents may ﬁnd it optimal to play strictly dominatedstrategies as experiments to learn about the consequences of their other strate-gies, so requiring prior beliefs to be supported on opponents’ undominatedstrategies can lead to situations where agents observe play that they had as-signed zero prior probability. We leave the associated complications for futurework.

References

Banks, J. S. and J. Sobel (1987): “Equilibrium Selection in SignalingGames,”

Econometrica , 55, 647–661.

Cho, I.-K. and D. M. Kreps (1987): “Signaling Games and Stable Equi-libria,”

Quarterly Journal of Economics , 102, 179–221.

Esponda, I. and D. Pouzo (2016): “Berk-Nash Equilibrium: A Frameworkfor Modeling Agents With Misspeciﬁed Models,”

Econometrica , 84, 1093–1130.

Fudenberg, D. and K. He (2018): “Learning and Type Compatibility inSignaling Games,”

Econometrica , 86, 1215–1255.

Fudenberg, D. and D. M. Kreps (1993): “Learning Mixed Equilibria,”

Games and Economic Behavior , 5, 320–367.

Fudenberg, D. and D. K. Levine (1993): “Steady State Learning andNash Equilibrium,”

Econometrica , 61, 547–573.35—— (2006): “Superstition and Rational Learning,”

American EconomicReview , 96, 630–651.

Kalai, E. and E. Lehrer (1993): “Rational Learning Leads to Nash Equi-librium,”

Econometrica , 61, 1019–1045.

Kohlberg, E. and J.-F. Mertens (1986): “On the Strategic Stability ofEquilibria,”

Econometrica , 54, 1003–1037.

Sobel, J., L. Stole, and I. Zapater (1990): “Fixed-Equilibrium Ratio-nalizability in Signaling Games,”

Journal of Economic Theory , 52, 304–331.

Van Damme, E. (1987):

Stability and Perfection of Nash Equilibria , Springer-Verlag.

A Appendix

A.1 Proof of Proposition 1

Proof.

To show (1), suppose θ (cid:37) s θ and θ (cid:37) s θ . For any π ∈ Π • where s is weakly optimal for θ , it must be strictly optimal for θ , hence also strictlyoptimal for θ . This shows θ (cid:37) s θ .To establish (2), partition the set of rational receiver strategies as Π • =Π +2 ∪ Π ∪ Π − , where the three subsets refer to receiver strategies that make s strictly better, indiﬀerent, or strictly worse than the best alternative signalfor θ . If the set Π is nonempty, then θ (cid:37) s θ implies θ (cid:37) s θ . Thisis because against any π ∈ Π , signal s is strictly optimal for θ but onlyweakly optimal for θ . At the same time, if both Π +2 and Π − are nonempty,then Π is nonempty. This is because both π u ( θ , s , π ( ·| s )) and π max s = s u ( θ , s , π ( ·| s )) are continuous functions, so for any π +2 ∈ Π +2 and π − ∈ Π − , there exists α ∈ (0 ,

1) so that απ +2 + (1 − α ) π − ∈ Π . (Note that π +2 and π − must be supported on A BR s after every signal s , so the same musthold for the mixture απ +2 + (1 − α ) π − . Thus, this mixture also belongs to Π • . )If only Π +2 is nonempty and θ (cid:37) s θ , then s is rationally strictly dominant36or both θ and θ . If only Π − is nonempty, then we can have θ (cid:37) s θ onlywhen s is never a weak best response for θ against any π ∈ Π • . A.2 Proof of Proposition 2

Proof.

Let π ∗ be a uRCE. We construct a path-equivalent RCE, π ◦ as follows.Set π ◦ = π ∗ and set π ◦ ( · | s ) = π ∗ ( · | s ) for every on-path signal s .At eachoﬀ-path signal s where ˜ J ( s, π ∗ ) = ∅ , let π ◦ ( · | s ) prescribe some best responseto a belief in ˜ P ( s, π ∗ ).At each oﬀ-path signal s where ˜ J ( s, π ∗ ) = ∅ , let π ◦ ( · | s )prescribe some best response to a belief in ∆(Θ s ).In this strategy proﬁle, the receiver’s play is a best response to rationality-compatible beliefs after every oﬀ-path s by construction, and because thesender’s play is the same as before the receiver is still playing best responsesto on-path signals.Because the on-path play of the receivers did not change, no sender typewishes to deviate to any on-path signal. Now we check that no sender typewishes to deviate to any oﬀ-path signal. Consider ﬁrst oﬀ-path s where˜ J ( s, π ∗ ) = ∅ . Here we have ˜ J ( s, π ∗ ) ⊆ Θ s , which implies that ˜ P ( s, π ∗ ) ⊆ ˆ P ( s ).By the deﬁnition of uRCE, π ◦ ( · | s ) must deter every type from deviating tosuch s. Finally, no sender type wishes to deviate to any s where ˜ J ( s, π ∗ ) = ∅ ,by the deﬁnition of equilibrium dominance. A.3 Proof of Proposition 3

Proof.

Fix a π with π ( s | θ ) > π ( s | θ ) <

1. Because the space ofrational receiver strategies Π • is convex, it suﬃces to show there is no receiverstrategy π ∈ Π • such that π is a best response to π in the ex-ante strategicform. If π is an ex-ante best response, then it needs to be at least weaklyoptimal for type θ to play s against π . By θ (cid:37) s θ , this implies s is strictlyoptimal for type θ . This shows π is not a best response to π , as the sendercan increase her ex-ante expected payoﬀs by playing s with probability 1 whenher type is θ . 37 .4 Proof of Proposition 4 Proof.

Suppose π ∗ does not pass the Intuitive Criterion. Then there exists atype θ and a signal s such that u ( θ ; π ∗ ) < min a ∈ BR(∆( e J ( s ,π ∗ )) ,s ) u ( θ, s , a ) . If π ∗ were an RCE, then we would have π ∗ ( ·| s ) ∈ ∆(BR( ˜ P ( s, π ∗ ) , s )). Since˜ P ( s, π ∗ ) ⊆ ∆( e J ( s , π ∗ )) , we have u ( θ ; π ∗ ) < u ( θ, s , π ∗ ( ·| s )) . This means π ∗ is not a Nash equilibrium, contradiction. A.5 Proof of Proposition 5

Proof.

To show (a), note ﬁrst that if D ( θ , s ; π ∗ ) ∪ D ◦ ( θ , s ; π ∗ ) = ∅ theconclusion holds vacuously. If D ( θ , s ; π ∗ ) ∪ D ◦ ( θ , s ; π ∗ )is not empty, takeany α ∈ D ( θ , s ; π ∗ ) ∪ D ◦ ( θ , s ; π ∗ ) and deﬁne π ∈ Π • by π ( ·| s ) = α , π ( ·| s ) = π ∗ ( ·| s ) for s = s . Then u ( θ ; π ∗ ) = max s = s u ( θ , s, π ( ·| s )) ≤ u ( θ , s , π ( ·| s )) = u ( θ , s , α ) , and when θ (cid:37) s θ , this implies that u ( θ ; π ∗ ) = max s = s u ( θ , s, π ( ·| s )) < u ( θ , s , π ( ·| s )) = u ( θ , s, α ) . Hence α ∈ D ( θ , s ; π ∗ ) . To show (b) , suppose π ∗ is a divine equilibrium. Then it is a Nash equilib-rium, and furthermore for any oﬀ-path signal s where θ (cid:37) s θ , Proposition5(a) implies that D ( θ , s ; π ∗ ) ∪ D ◦ ( θ , s ; π ∗ ) ⊆ D ( θ , s ; π ∗ ) . Since π ∗ is a divine equilibrium, π ∗ ( ·| s ) must then best respond to some belief38 ∈ ∆(Θ) with p ( θ ) p ( θ ) ≤ λ ( θ ) λ ( θ ) . Considering all ( θ , θ ) pairs, we see that in adivine equilibrium π ∗ ( ·| s ) best responds to some belief in \ ( θ ,θ ) s.t. θ (cid:37) s θ P θ .θ . At the same time, in every divine equilibrium, belief after oﬀ-path s puts zeroprobability on equilibrium-dominated types, meaning π ∗ ( · | s ) best responds∆( e J ( s , π ∗ )). This shows π ∗ is an RCE. A.6 Proof of Proposition 6

Proof.

Consider a uRCE π ∗ . For every oﬀ-path s , perform the following mod-iﬁcations on π ∗ ( ·| s ): if the ﬁrst-round application of the NWBR procedurewould have deleted every type, then do not modify π ∗ ( ·| s ). Otherwise, ﬁndsome θ s not deleted by the iterated NWBR procedure, then change π ∗ ( ·| s ) tosome action in BR( { θ s } , s ), i.e. a best response to the belief putting probabil-ity 1 on θ s .This modiﬁed strategy proﬁle passes the NWBR test. We now establishthat it remains a uRCE by checking that for those oﬀ-path s where π ∗ ( ·| s ) wasmodiﬁed, the modiﬁed version is still a best response to ˆ P ( s ). (By uniformity,this would ensure that the modiﬁed receiver play continues to deter every typefrom deviating to s .)Type θ s satisﬁes θ s ∈ Θ s . Otherwise, D ◦ ( θ s , s ; π ∗ ) = ∅ and θ s would havebeen deleted by NWBR in the ﬁrst round. Now it suﬃces to argue thereis no θ such that θ (cid:37) s θ s , which implies the belief putting probability 1on θ s is in ˆ P ( s ). If there were such θ , by Proposition 5(a) we would have D ◦ ( θ s , s ; π ∗ ) ⊆ D ( θ , s ; π ∗ ), so θ s should have been deleted by NWBR in theﬁrst round, contradicting the fact that θ s survives all iterations of the NWBRprocedure. 39 .7 Proof of Corollary 1 Proof.

This is follows from Proposition 6 because every NWBR equilibrium isa universally divine equilibrium.

A.8 Proof of Lemma 2

Proof.

Here are three lemmas from Fudenberg and Levine (2006):

FL06 Lemma A.1 : Suppose { X k } is a sequence of i.i.d. Bernoulli randomvariables with E [ X k ] = µ , and deﬁne for each n the random variable S n := | P nk =1 ( X k − µ ) | n . Then for any n, ¯ n ∈ N , P (cid:20) max n ≤ n ≤ ¯ n S n > (cid:15) (cid:21) ≤ · n · µ(cid:15) . FL06 Lemma A.2 : For all (cid:15), (cid:15) >

0, there is an

N > δ, γ, g, π , signal s and action a ∈ A , ψ π ;( g,δ,γ ) θ { y θ : | ˆ π ( a | s ; y θ ) − π ( a | s ) | > (cid:15), s | y θ ) > N } < (cid:15) . (Here, ˆ π ( a | s ; y θ ) is the empirical frequency of receiver playing a after signal m in history y θ , that is to say ˆ π ( a | s ; y θ ) = a, s ) , y θ ) / s, y θ ).) FL06 Lemma A.4 : For all (cid:15), (cid:15) > δ <

1, there exists N such thatfor all π , g, and γ , we get ψ π ;( g,δ,γ ) θ { y θ / ∈ Y θ ( (cid:15) ) , σ θ ( y θ ) , y θ ) > N } ≤ (cid:15) where Y θ ( (cid:15) ) ⊆ Y θ are those histories y θ wheremax s ∈ S u ( θ, s | y θ ) ≤ u ( σ θ ( y θ ) | y θ ) + (cid:15), that is, type θ is playing a myopic (cid:15) best response according to posterior beliefafter history y θ .Now we proceed with our argument.40ince π ∗ is strict on-path , there exist ξ , ξ > π satisﬁes | π ( a | s ) − π ∗ ( a | s ) | ≤ ξ for every on-path s and action a , while forevery oﬀ-path s we have π ( ˜ A ( s ) | s ) ≥ − ξ , then for each type θ we get u ( θ, π ∗ ( θ ) , π ) > ξ + max s = π ∗ ( θ ) u ( θ, s, π R ) . That is, if receiver plays ξ -close to π ∗ on-path and ξ -close to ˜ A ( s ) oﬀ-path,then for every type of sender, playing the prescribed equilibrium signal isstrictly better than any other signal by at least ξ > g such that when-ever sender has fewer than n := 2 /ξ observations of playing signal s , herbelief as to receiver’s probability of taking action a after signal s diﬀers from π ∗ ( a | s ) by no more than ξ if s is on-path, while her belief as to the probabilitythat receiver strategy assigns to ˜ A ( s ) is at least 1 − ξ if s is oﬀ-path. Also, let (cid:15) oﬀ := ξ / δ ∈ (0 ,

1) and 0 < (cid:15) < (cid:15) oﬀ be given. We construct γ ( δ, (cid:15) ) satisfyingthe conclusion of the lemma.To do this, in FL06 Lemma A.4 put (cid:15) = ξ and (cid:15) = (cid:15)/

6, to obtain a N ( (cid:15) ). Next, in FL06 Lemma A.2 put (cid:15) = ξ / (cid:15) = (cid:15)/

6, to obtain N ( (cid:15) ). Let N ( (cid:15) ) := N ( (cid:15) ) ∨ N ( (cid:15) ). There are 5 classes of exceptional histories for type θ that can lead to playing some signal ˆ s other than the one prescribed by theequilibrium strategy, s ∗ := π ∗ ( θ ). Exception 1 : θ has played ˆ s fewer than N ( (cid:15) ) times before, that is σ θ ( y θ ) =ˆ s but s, y θ ) < N ( (cid:15) ). Such histories can be made to have mass no larger than (cid:15)/ γ ( δ, (cid:15) ) large enough. Exception 2 : y θ is in the exceptional set described in FL06 Lemma A.4.But by choice of N ( (cid:15) ) ≥ N ( (cid:15) ), we know that ψ π ;( g,δ,γ ) θ { y θ / ∈ Y θ ( ξ ) , σ θ ( y θ ) , y θ ) > N ( (cid:15) ) } ≤ (cid:15)/ . Exception 3 : θ has played ˆ s more than N ( (cid:15) ) times, but has a misleadingsample. By FL93 Lemma A.2, ψ π ;( g,δ,γ ) θ { y θ : | ˆ π ( a | ˆ s ; y θ ) − π ( a | ˆ s ) | > ξ / , s | y θ ) > N ( (cid:15) ) } < (cid:15)/ . π ∈ B on2 ( π ∗ , (cid:15) ) ∩ B oﬀ2 ( π ∗ , (cid:15) oﬀ ), we know π diﬀers from π ∗ by no more than (cid:15) oﬀ = ξ / ξ / A ( s ) after oﬀ-path signal s . So in particular, ψ π ;( g,δ,γ ) θ  y θ : | ˆ π ( a | ˆ s ; y θ ) − π ∗ ( a | ˆ s ) | > ξ if ˆ s on-path, orˆ π ( ˜ A (ˆ s ) | ˆ s ) < − ξ if ˆ s oﬀ-path s | y θ ) > N ( (cid:15) )  < (cid:15)/ . Exception 4 : θ has played the equilibrium signal s ∗ more than N ( (cid:15) ) times,but has a misleading sample. As before, we get ψ π ;( g,δ,γ ) θ { y θ : | ˆ π ( a | s ∗ ; y θ ) − π ∗ ( a | s ∗ ) | > ξ , s ∗ | y θ ) > N ( (cid:15) ) } < (cid:15)/ . Exception 5 : θ has played the equilibrium signal s ∗ between n and N ( (cid:15) )times, but has a misleading sample. Let X k ∈ { , } denote whether θ seesthe equilibrium response π ∗ ( s ∗ ) the k -th time she plays s ∗ ( X k = 0) or whethershe sees instead a diﬀerent response ( X k = 1). As in FL06 Lemma A.1, deﬁne S n := | P nk =1 ( X k − µ ) | n where µ = 1 − π ( π ∗ ( s ∗ ) | s ∗ ) < (cid:15) since s ∗ is an on-path signal in π ∗ .The probability that the fraction of responses other than π ∗ ( s ∗ ) exceeds ξ between the n -th time and N ( (cid:15) )-th time that θ plays s ∗ is bounded above byFL06 Lemma A.1, P " max n ≤ n ≤ N ( (cid:15) ) S n > ξ / ≤ · n · µ ( ξ / ≤ · µ (by choice of n ) ≤ (cid:15) / . Finally, at a history y θ that does not belong to those exceptions, we musthave σ θ ( y θ ) = m ∗ . This is because y θ is not in exception 1, so θ has played σ θ ( y θ ) at least N ( (cid:15) ) times before, and it is not in exception 2, so σ θ ( y θ ) is a ξ myopic best response to current beliefs. Yet the empirical frequency for42esponse after signal σ θ ( y θ ) is no more than ξ away from π ∗ ( σ θ ( y θ )) as y θ isnot in exception 3 . Since the prior is Dirichlet and also has this property, thismeans the current posterior belief about response after signal σ θ ( y θ ) also hasthis property. If s ∗ , y θ ) > n , then y θ not being in exceptions 4 or 5 impliesbelief as to response after signal s ∗ is also no more than ξ away from π ∗ ( s ∗ ),while if s ∗ , y θ ) < n then choice of prior implies the same. In short, beliefson both responses after s ∗ and responses after σ θ ( y θ ) are no more than ξ awayfrom their π ∗ counterparts. But in that case, no signal other than s ∗ can bean ξ best response. A.9 Proof of Lemma 3

Proof.

For each ξ >

0, consider the approximation to P θ .θ , P ξθ .θ := ( p ∈ ∆(Θ) : p ( θ ) p ( θ ) ≤ (1 + ξ ) λ ( θ ) λ ( θ ) ) and hence the approximation to ˆ P ( s ),ˆ P ξ ( s ) := ∆(Θ s ) \ n P ξθ .θ : θ (cid:37) s θ o . Since the BR correspondence has a closed graph, there is an ξ > P ξ ( s ) , s ) = BR( ˆ P ( s ) , s ).Take some such ξ. Next we will choose a series of constants. • Pick 0 < h < − h h > (1 − ξ ) / . • Pick

G > θ ∈ Θ, 1 / ( h · G · (1 − h ) · λ ( θ )) <(cid:15)/ (4 · | S | · | Θ | ) . • For each θ , construct a Dirichlet prior on S θ with parameters α ( θ, s ) ≥ α ( θ, s ) ≥ θ (cid:37) s θ , wehave α ( θ, s ) − α ( θ , s ) > ( q (4 · | S | · | Θ | ) /(cid:15) + 1) · G. (2)In the event that θ (cid:37) s θ and θ (cid:37) s θ , put α ( θ, s ) = α ( θ , s ).43 Pick N ∈ N so that for any N > N , θ, θ ∈ Θ, we have P [(1 − h ) · N · λ ( θ ) ≤ Binom(

N, λ ( θ )) ≤ (1 + h ) · N · λ ( θ )] > − (cid:15) · | Θ | and (1 − h ) · N · λ ( θ )(1 + h ) · N · λ ( θ ) + max θ P s ∈ S α ( θ, s ) > (1 − ξ ) / λ ( θ ) λ ( θ ) . • Pick γ ∈ (0 ,

1) such that 1 − ( γ ) N +1 < (cid:15)/ . Suppose the receiver’s prior over the strategy of type θ is Dirichlet with pa-rameters ( α ( θ, s )) s ∈ S . We claim that the conclusion of the lemma holds.Fix some strategy π ∈ C . Write θ | y ) for the number of times thesender has been of θ type in history y , while θ, s | y ) counts the numberof times type θ has sent signal s in history y . Put ψ = ψ π ;( g,δ,γ )2 and write E ⊆ Y for those receiver histories with length at least N satisfying(1 − h ) · N · λ ( θ ) ≤ θ | y ) ≤ (1 + h ) · N · λ ( θ )for every θ ∈ Θ. By the choice of N and γ , whenever γ > γ we have ψ ( E ) ≥ − (cid:15)/

2. We now show that given E , the conditional probability that thereceiver’s posterior belief after every oﬀ-equilibrium signal s lies in ˆ P ξ ( s ) is atleast 1 − (cid:15)/

2. To do this, ﬁx signal s and two types with θ (cid:37) s θ .If s is strictly dominated for both θ and θ , then according to the receivers’Dirichlet prior, θ and θ each sends s with zero probability. Since π ∈ Π • , we have π ( s | θ ) = π ( s | θ ) = 0. So after every positive-probability history,receiver’s belief falls in ˆ P ξ ( s ) as it puts zero probability on the s -sender being θ or θ . Henceforth we only consider the case where s is not strictly dominatedfor both.After history y , the receiver’s updated posterior likelihood ratio for types44 and θ upon seeing signal s is λ ( θ ) λ ( θ ) · α ( θ, s ) + θ, s | y ) θ | y ) + P s ∈ S α ( θ, s ) / α ( θ , s ) + θ , s | y ) θ | y ) + P s ∈ S α ( θ , s ) ! = λ ( θ ) λ ( θ ) · α ( θ, s ) + θ, s | y ) α ( θ , s ) + θ , s | y ) · θ | y ) + P s ∈ S α ( θ , s ) θ | y ) + P s ∈ S α ( θ, s ) . Since we have θ | y ) ≥ (1 − h ) · N · λ ( θ ) while θ | y ) ≤ (1 + h ) · N · λ ( θ ),we get θ | y ) + P s ∈ S α ( θ , s ) θ | y ) + P s ∈ S α ( θ, s ) ≥ (1 − h ) · N · λ ( θ )(1 + h ) · N · λ ( θ ) + P s ∈ S α ( θ, s ) > (1 − ξ ) / · λ ( θ ) λ ( θ ) . If s is strictly dominant for both θ and θ , then π ∈ Π • means that π ( s | θ ) = π ( s | θ ) = 1. In this case, θ, s | y ) = θ | y ) and θ , s | y ) = θ | y ).Since θ | y ) ≥ (1 − h ) · N · λ ( θ ), θ | y ) ≤ (1 + h ) · N · λ ( θ ), we have: α ( θ, s ) + θ, s | y ) α ( θ , s ) + θ , s | y ) ≥ (1 − h ) · N · λ ( θ ) P s ∈ S α ( θ , s ) + (1 + h ) · N · λ ( θ ) ≥ (1 − ξ ) / λ ( θ ) λ ( θ ) . This shows the product is no smaller than (1 − ξ ) / λ ( θ ) λ ( θ ) , so receiver believesin P ξθ.θ after every history in E .Now we analyze the term α ( θ,s )+ θ,s | y ) α ( θ ,s )+ θ ,s | y ) for the case where s is not strictlydominant for both θ and θ . We consider two cases, depending on whether N is“large enough” so that the compatible type θ experiments enough on averagein a receiver history of length N under sender strategy π . Case A : π ( s | θ ) · N < G . In this case, since π ∈ C and θ (cid:37) s θ , we mustalso have π ( s | θ ) · N < G . Then θ , s | y ) is distributed as a binomial randomvariable with mean smaller than G , hence standard deviation smaller than √ G .By Chebyshev’s inequality, the probability that it exceeds ( q (4 · | S | · | Θ | ) /(cid:15) +1) · G is no larger than 1 G · (4 · | S | · | Θ | ) /(cid:15) < (cid:15) | S | · | Θ | . But in any history y where θ , s | y R ) does not exceed this number, we would45ave α ( θ , s ) + θ , s | y ) ≤ α ( θ, s ) ≤ α ( θ, s ) + θ, s | y )by choice of the diﬀerence between prior parameters α ( θ , s ) and α ( θ, s ). There-fore α ( θ,s )+ θ,s | y ) α ( θ ,s )+ θ ,s | y ) ≥

1. In summary, under Case A, there is probability nosmaller than 1 − (cid:15) | S |·| Θ | that α ( θ,s )+ θ,s | y ) α ( θ ,s )+ θ ,s | y ) ≥ Case B : π ( s | θ ) · N ≥ G . In this case, we can bound the probability that θ, s | y ) / θ , s | y ) ≤ λ ( θ ) λ ( θ ) · ( 1 − h h ) . Let p := π ( s | θ ). Given that θ | y ) ≥ (1 − h ) · N · λ ( θ ), the distribution of θ, s | y ) ﬁrst order stochastically dominates Binom((1 − h ) · N · λ ( θ ) , p ) . On the other hand, given that θ | y ) ≤ (1 + h ) · N · λ ( θ ) and furthermore π ( s | θ ) ≤ π ( s | θ ) = p , the distribution of θ , s | y ) is ﬁrst order stochasticallydominated by Binom((1 + h ) · N · λ ( θ ) , p ) . The ﬁrst distribution has mean (1 − h ) · N · λ ( θ ) · p with standard deviationno larger than q (1 − h ) · N · λ ( θ ) · p . Thus P [Binom((1 − h ) · N · λ ( θ ) , p ) < (1 − h ) · (1 − h ) · N · λ ( θ ) · p ] < / ( h · q p (1 − h ) N λ ( θ )) ≤ / ( h · q G · (1 − h ) · λ ( θ )) < (cid:15)/ (4 · | S | · | Θ | )where we used the fact that pN ≥ G in the second-to-last inequality, whilethe choice of G ensured the ﬁnal inequality.At the same time, the second distribution has mean (1 + h ) · N · λ ( θ ) · p with standard deviation no larger than q (1 + h ) · N · λ ( θ ) · p , so P [Binom((1 + h ) · N · λ ( θ ) , p ) > (1 + h ) · (1 + h ) · N · λ ( θ ) · p ] < / ( h · q p (1 + h ) N λ ( θ )) ≤ / ( h · q G · (1 + h ) · λ ( θ )) < (cid:15)/ (4 · | S | · | Θ | )by the same arguments. Combining the bounds on these two binomial randomvariables, P " Binom((1 − h ) · N · λ ( θ ) , p )Binom((1 + h ) · N · λ ( θ ) , p ) ≤ λ ( θ ) λ ( θ ) · ( 1 − h h ) < (cid:15)/ (2 · | S | · | Θ | ) . a fortiori P " θ, s | y ) / θ , s | y ) ≤ λ ( θ ) λ ( θ ) · ( 1 − h h ) < (cid:15)/ (2 · | S | · | Θ | ) . Therefore, for any s, θ, θ such that θ (cid:37) s θ , ψ y : α ( θ, s ) + θ, s | y ) α ( θ , s ) + θ , s | y ) ≥ λ ( θ ) λ ( θ ) · ( 1 − h h ) | E ! ≥ − (cid:15)/ (2 · | S | · | Θ | ) . This concludes case B.In either case, at a history y with (1 − h ) · N · λ ( θ ) ≤ θ | y ) ≤ (1 + h ) · N · λ ( θ ) for every θ, for every pair θ, θ such that θ (cid:37) s θ , we get α ( θ,s )+ θ,s | y ) α ( θ ,s )+ θ ,s | y ) ≥ λ ( θ ) λ ( θ ) · ( − h h ) with probability at least 1 − (cid:15)/ (2 · | S | · | Θ | ).But at any history y where this happens, the receiver’s posterior likelihoodratio for types θ and θ after signal s satisﬁes λ ( θ ) λ ( θ ) · α ( θ, s ) + θ, s | y ) α ( θ , s ) + θ , s | y ) · θ | y ) + P s ∈ S α ( θ , s ) θ | y ) + P s ∈ S α ( θ, s ) ≥ λ ( θ ) λ ( θ ) · λ ( θ ) λ ( θ ) · − h h ! · (1 − ξ ) / · λ ( θ ) λ ( θ ) ≥ λ ( θ ) λ ( θ ) · (1 − ξ ) / · (1 − ξ ) / ≥ λ ( θ ) λ ( θ ) · (1 − ξ ) . As there are at most | Θ | such pairs for each signal s and | S | total signals, ψ  y : λ ( θ ) λ ( θ ) · α ( θ,s )+ θ,s | y ) α ( θ ,s )+ θ ,s | y ) · θ | y )+ P s ∈ S α ( θ ,s ) θ | y )+ P s ∈ S α ( θ,s ) ≥ λ ( θ ) λ ( θ ) · (1 − ξ ) ∀ s, θ (cid:37) s θ | E  ≥ − (cid:15)/ E has ψ -probability no smaller than 1 − (cid:15)/

2, thereis ψ probability at least 1 − (cid:15) that receiver’s posterior belief is in ˆ P ξ ( s ) afterevery oﬀ-path s . 47 .10 Proof of Lemma 4 Proof.

Since π ∗ is on-path strict for the receiver, there exists some ξ > s and every belief p ∈ ∆(Θ) with | p ( θ ) − p ( θ ; s, π ∗ ) | < ξ, ∀ θ ∈ Θ (3)(where p ( · ; s, π ∗ ) is the Bayesian belief after on-path signal s induced by theequilibrium π ∗ ), we have BR( p, s ) = { π ∗ ( s ) } . For each s, we show that thereis a large enough N ( s, (cid:15) ) and small enough ζ ( s ) so that when receiver observeshistory y generated by any π ∈ B on ( π ∗ , (cid:15) ) with (cid:15) < ζ ( s ) / N ( s, (cid:15) ), there is probability at least 1 − (cid:15) | S | that receiver’s posteriorbelief satisﬁes (3). Hence, conditional on having a history length of at least N ( s, (cid:15) ) , there is 1 − (cid:15) | S | chance that receiver will play as in π ∗ after s . Bytaking the maximum N ∗ ( (cid:15) ) := max s ( N ( s, (cid:15) )) and minimum (cid:15) := min s ζ ( s ),we see that whenever history is length N ∗ ( (cid:15) ) or more, and π ∈ B on ( π ∗ , (cid:15) ) with (cid:15) < (cid:15) , there is at least 1 − (cid:15)/ π ∗ after every on-path signal . Since we can pick γ ( (cid:15) ) large enough that 1 − (cid:15)/ N ∗ ( (cid:15) ) or older, we are done.To construct N ( s, (cid:15) ) and ζ ( s ), let Λ( s ) := λ { θ : π ∗ ( s | θ ) = 1 } . Find smallenough ζ ( s ) ∈ (0 ,

1) so that: • | λ ( θ )Λ( s ) · (1 − ζ ( s )) − λ ( θ )Λ( s ) | < ξ • | λ ( θ ) · (1 − ζ ( s ))Λ( s )+(1 − Λ( s )) · ζ ( s ) − λ ( θ )Λ( s ) | < ξ • ζ ( s )1 − ζ ( s ) · λ ( θ )Λ( s ) < ξ for every θ ∈ Θ. After a history y , the receiver’s posterior belief as to thetype of sender who sends signal s satisﬁes p ( θ | s ; y ) ∝ λ ( θ ) · θ, s | y ) + α ( θ, s ) θ | y ) + A ( θ ) , where α ( θ, s ) is the Dirichlet prior parameter on signal s for type θ and A ( θ ) := P s ∈ S α ( θ, s ). By the law of large numbers, for long enough history length, we48an ensure that if π ( s | θ ) > − ζ ( s )4 , then θ, s | y ) + α ( θ, s ) θ | y ) + A ( θ ) ≥ − ζ ( s )with probability at least 1 − (cid:15) | S | , while if π ( s | θ ) < ζ ( s ) /

4, then θ, s | y ) + α ( θ, s ) θ | y ) + A ( θ ) < ζ ( s )with probability at least 1 − (cid:15) | S | . Moreover there is some N ( s, (cid:15) ) so that thereis probability at least 1 − (cid:15) | S | that a history y with length at least N ( s, (cid:15) )satisﬁes above for all θ . But at such a history, for any θ such that π ∗ ( s | θ ) = 1, p ( θ | s ; y ) ≥ λ ( θ ) · (1 − ζ ( s ))Λ( s ) + (1 − Λ( s )) · ζ ( s )and p ( θ | s ; y ) ≤ λ ( θ )Λ( s ) · (1 − ζ ( s )) , while for some θ such that π ∗ ( s | θ ) = 0, p ( θ | s ; y ) ≤ ζ ( s )1 − ζ ( s ) · λ ( θ )Λ( s ) . Therefore the belief p ( ·| s ; y R ) is no more than ξ away from p ( θ ; s, π ∗ ), asdesired. A.11 Proof of Theorem 2

Proof.

We will construct a regular prior g . We will then show that for every0 < δ <

1, there exists convex and compact sets of strategy proﬁles E j ⊆ Π • with E j ↓ E ∗ ⊆ B on1 ( π ∗ , ∩ B on2 ( π ∗ ,

0) and a corresponding sequence of survivalprobabilities γ j → R g,δ,γ j [ π ] , R g,δ,γ j [ π ]) ∈ E j whenever π ∈ E j .We proved in Fudenberg and He (2018) that R and R are continuous maps,so a ﬁxed point theorem implies that for each j , some strategy proﬁle in E j isa steady state proﬁle under parameters ( g, δ, γ j ). Any convergent subsequence49f these j -indexed steady state proﬁles has a limit in E ∗ , so this limit agreeswith π ∗ on path. This shows that for every δ there is a δ -stable strategy proﬁlepath-equivalent to π ∗ , so there is a patiently stable strategy proﬁle with thesame property. Step 1 : Constructing g and some thresholds.Since π ∗ induces a unique optimal signal for each sender type, by Lemma2 ﬁnd a regular sender prior g , < (cid:15) oﬀ <

0, and a function γ LM1 ( δ, (cid:15) ).In Lemma 3, substitute (cid:15) = (cid:15) oﬀ to ﬁnd a regular receiver prior g and0 < γ LM2 < g be as constructed above to ﬁnd (cid:15) LM3 > γ LM3 ( (cid:15) ). Step 2 : Constructing the sets E j .For each j , let E j := C ∩ B on1 ( π ∗ , (cid:15) oﬀ ∧ (cid:15) LM3 j ) ∩ B on2 ( π ∗ , (cid:15) oﬀ ∧ (cid:15) LM3 j ) ∩ B oﬀ2 ( π ∗ , (cid:15) oﬀ ) . That is, E j is the set of strategy proﬁles that respect rational compatibility,diﬀer by no more than (cid:15) oﬀ /j from π ∗ on path, and diﬀer by no more than (cid:15) oﬀ from π ∗ oﬀ path. It is clear that each E j is convex and compact, and thatlim j →∞ E j ⊆ B on1 ( π ∗ , ∩ B on2 ( π ∗ ,

0) as claimed.We may ﬁnd an accompanying sequence of survival probabilities satisfying γ j > γ LM1 ( δ, (cid:15) oﬀ ∧ (cid:15) LM3 j ) ∨ γ LM2 ∨ γ LM3 ( (cid:15) oﬀ ∧ (cid:15) LM3 j )with γ j ↑ Step 3 : R g,δ,γ j maps E j into itself.Let some π ∈ E j be given.By Lemma 1 , R g,δ,γ j [ π ] ∈ C .By Lemma 3, R g,δ,γ j [ π ] ∈ B oﬀ2 ( π ∗ , (cid:15) oﬀ ), because uniformity of π ∗ meansBR( ˆ P ( s ) , s ) ⊆ ˜ A ( s ) for each oﬀ-path s .By Lemma 4, R g,δ,γ j [ π ] ∈ B on2 ( π ∗ , (cid:15) oﬀ ∧ (cid:15) LM3 j ).Finally, from Lemma 2 and the fact that π ∈ B on2 ( π ∗ , (cid:15) oﬀ ∧ (cid:15) LM3 j ) ∩ B oﬀ2 ( π ∗ , (cid:15) oﬀ ) , we have R g,δ,γ j [ π ] ∈ B on1 ( π ∗ , (cid:15) oﬀ ∧ (cid:15) LM3 jj