Payoff Information and Learning in Signaling Games
PPayoff Information and Learning in SignalingGames ∗ Drew Fudenberg † Kevin He ‡ First version: August 31, 2017This version: July 11, 2019
Abstract
We show how to add the assumption that players know theiropponents’ payoff functions to the theory of learning in games, anduse it to derive restrictions on signaling-game play in the spirit ofdivine equilibrium. In our learning model, agents are born intoplayer roles and play the game against a random opponent eachperiod. Inexperienced agents are uncertain about the prevailingdistribution of opponents’ play, and update their beliefs based ontheir observations. Long-lived and patient senders experiment withevery signal that they think might yield an improvement over theirmyopically best play. We show that divine equilibrium (Banks andSobel, 1987) is nested between “rationality-compatible” equilib-rium, which corresponds to an upper bound on the set of possiblelearning outcomes, and “uniform rationality-compatible” equilib-rium, which provides a lower bound.
Keywords: learning, equilibrium refinements, bandit problems, pay-off information, signaling games.JEL classification codes C72, C73, D83 ∗ We thank Laura Doval, Glenn Ellison, Lorens Imhof, Yuichiro Kamada, Robert Klein-berg, David K. Levine, Kevin K. Li, Eric Maskin, Dilip Mookherjee, Harry Pei, MatthewRabin, Bill Sandholm, Lones Smith, Joel Sobel, Philipp Strack, Bruno Strulovici, TomaszStrzalecki, Jean Tirole, Juuso Toikka, and our seminar participants for helpful commentsand conversations, and National Science Foundation grant SES 1643517 for financial sup-port. † Department of Economics, MIT. Email: [email protected] ‡ California Institute of Technology and University of Pennsylvania. Email: [email protected] a r X i v : . [ ec on . T H ] J u l Introduction
Signaling games typically have many perfect Bayesian equilibria, because Bayesrule does not pin down the receiver’s off-path beliefs about the sender’s type.Different off-path beliefs for the receiver can justify different off-path receiverbehaviors, which in turn sustain equilibria with a variety of on-path outcomes.For this reason, applied work using signaling games typically invokes someequilibrium refinement to obtain a smaller and (hopefully) more accurate sub-set of predictions.However, most refinements impose restrictions on the equilibrium beliefswithout any reference to the process that might lead to equilibrium. Our ear-lier paper Fudenberg and He (2018) provided a learning-theoretic foundationfor the compatibility criterion (CC), based on the idea that “out of equilib-rium” signals are not zero-probability events during learning, but instead ariseas rare but positive-probability experiments by inexperienced patient senderstrying to learn how the receivers respond to different signals. Unlike the classicrefinement literature, we did not assume that agents know their opponents’payoff functions. This paper discusses how ex-ante payoff information influ-ences learning dynamics and learning outcomes, showing that the additionalequilibrium restrictions that follow from this prior knowledge nest the divineequilibrium of Banks and Sobel (1987). In addition, we provide the first generalsufficient condition for an outcome to emerge as the result of patient Bayesianlearning in settings where the relative probabilities of different off-path exper-iments matter.In our learning model, agents repeatedly play the same signaling gameagainst random opponents each period. Agents are Bayesians who believethey face a fixed but unknown distribution of the opposing players’ strate-gies. Importantly, the senders hold independent beliefs about how receiversrespond to different signals, so they cannot use the response to one signal toinfer anything about the distribution of responses to a different signal. Thisintroduces an exploration-exploitation trade-off, as each sender only observesthe response to the one signal she sends each period. Long-lived and patientsenders will therefore experiment with every signal that they think might yield1 substantially higher payoff than the signal that is myopically best. The keyto our results is that different types of senders have different incentives forexperimenting with various signals, so that some of the sender types will sendcertain signals more often than other types do. Consequently, even thoughlong-lived senders only experiment for a vanishingly small fraction of theirlifetimes, the play of the long-lived receivers will be a best response to be-liefs about the senders’ types that reflects this difference in experimentationprobabilities.Of course, the senders’ experimentation incentives depend on their priorbeliefs about which receiver responses are plausible after each signal. In Fu-denberg and He (2018), we assumed that learners are ignorant of others’ utilityfunctions, and that the senders’ beliefs assign positive probability to the re-ceivers playing actions that are not best responses to any belief about thesender’s type. In this paper, we instead assume that the players’ prior beliefsencode knowledge of their opponents’ payoff functions, so in particular thesenders all assign zero probability to the event that the receivers use condi-tionally dominated strategies. Inexperienced senders with full-support beliefsabout the receivers’ play may experiment with a signal in the hopes that thereceivers respond with a certain favorable action, not knowing that this actionwill never be played as it is not a best response to any receiver belief. Withpayoff information, even very patient senders will never undertake such exper-iments. Conversely, receivers know that no sender type would ever want toplay a signal that does not best respond to any receiver strategy, because nopossible response by the receiver would make playing that signal worthwhile.For this reason, the receivers’ beliefs after each signal assign probability zeroto the types for whom that signal is dominated.Priors with payoff information lead to additional restrictions on differenttypes’ comparative experimentation frequencies, which can generate strongerrestrictions on the receiver’s beliefs in some games. For instance, Example 2considers a signaling game where two types of senders choose between a safeoption
Out that yields a known payoff, and a risky option In whose payoff de-pends on the receiver’s response. The receiver has three responses to In : Up, which is optimal against the strong sender;
Down, which is optimal against2he weak sender, and X, which is never optimal. We show that when priorsencode payoff information, the strong types experiment more with In than theweak types do. But, this comparison can be reversed when the senders do notknow the receivers’ payoff functions, since the weak types like the X responsemore than the strong types do. In this game, the new refinement concept wepropose based on payoff knowledge rules out a sequential-equilibrium outcomethat passes the CC.In some other games, payoff information expands the set of long-run learn-ing outcomes for patient and long-lived learners. Example 3 shows a signalinggame where no type with payoff information ever experiments with a certainsignal, so the receivers’ beliefs and behavior after this signal are arbitrarilydetermined from their prior beliefs. On the other hand, when senders are ig-norant of the receivers’ payoff functions, one sender type will experiment muchmore frequently with this signal than the other type, leading to a refinementof the receivers’ off-path beliefs after the signal.In general, for learners starting with these priors with restricted supports,Theorem 1 shows that every patient learning outcome is consistent with “ra-tional compatibility,” while Theorem 2 shows that every equilibrium satisfyinga uniform version of rational compatibility and some strictness assumptionscan arise as a patient learning outcome. As we show in Section 3, these beliefrestrictions resemble those imposed by divine equilibrium (Banks and Sobel,1987): Every divine equilibrium is also consistent with rational compatibilityand that every equilibrium satisfying the uniform version of rational compat-ibility is universally divine. This paper is most closely related to the work of Fudenberg and Levine (1993),Fudenberg and Levine (2006), and Fudenberg and He (2018) on patient learn-ing by Bayesian agents who believe they face a steady-state distribution ofplay. Except for the support of the agents’ priors, our learning model is ex-actly the same as that of Fudenberg and He (2018), and the proof of Theorem This example is a simplified variant of Cho and Kreps (1987)’s beer-quiche game, withan extra conditionally dominated response for the receiver. patientlystable, which means that it is the limit of play in a society of Bayesian agentsas these agents become patient and long lived, for some non-doctrinaire priorbeliefs. The proof of this sufficient condition for patient stability constructsa suitable prior and analyzes the corresponding patiently stable profiles. Theonly other constructive sufficient condition for strategy profiles to be patientlystable is Theorem 5.5 of Fudenberg and Levine (2006), which only applies to asubclass of perfect-information games. In such games the relative probabilitiesof various off-path actions do not matter, because each off-path experimentis perfectly revealed when it occurs. Indeed, the central lemma leading toTheorem 2 constructs a prior belief to ensure that the receivers correctly learnthe relative frequencies that different types undertake various off-path exper-iments. This lemma deals with an issue specific to signaling games, and isnot implied by any result in Fudenberg and Levine (2006). Our paper is alsorelated to other models of Bayesian non-equilibrium learning, such as Kalaiand Lehrer (1993) and Esponda and Pouzo (2016), and to the equilibriumconcepts of the Intuitive Criterion (Cho and Kreps, 1987) and divine equilib-rium (Banks and Sobel, 1987). One other contribution of this work relativeto Fudenberg and He (2018) is that we compare our learning-based equilib-rium refinements with these equilibrium refinements, both of which implicitlyassume that players are certain of the payoff functions of their opponents. “Constructive,” as opposed to proofs that rule out all but one equilibrium using neces-sary conditions and then appeal to an existence theorem for patiently stable steady states.Constructive sufficient conditions allow us to characterize learning outcomes more preciselyin games where multiple equilibria satisfy the necessary conditions, such as Example 1. Two Equilibrium Refinements for SignalingGames A signaling game has two players, a sender (“she,” player 1) and a receiver(“he,” player 2). At the start of the game, the sender learns her type θ ∈ Θ,but the receiver only knows the sender’s type distribution λ ∈ ∆(Θ). Next,the sender chooses a signal s ∈ S . The receiver observes s and chooses anaction a ∈ A in response. We assume that Θ , S, A are finite and that λ ( θ ) > θ. The players’ payoffs depend on the triple ( θ, s, a ). Let u : Θ × S × A → R and u : Θ × S × A → R denote the utility functions of the sender and thereceiver, respectively.For P ⊆ ∆(Θ), we haveBR( P, s ) := [ p ∈ P arg max a ∈ A E θ ∼ p [ u ( θ, s, a )] ! as the set of best responses to s supported by some belief in P . Letting P = ∆(Θ), the set A BR s := BR(∆(Θ) , s ) ⊆ A contains the receiver actionsthat best respond to some belief about the sender’s type after s . We saythat actions in A BR s are conditionally undominated after signal s , and thatactions in A \ A BR s are conditionally dominated after signal s . We denote byΠ • := × s ∈ S ∆( A BR s ) the rational receiver strategies; these are the strategies thatassign probability 0 to conditionally dominated actions. The rational receiverstrategies form a subset of Π := × s ∈ S ∆( A ), the set of all receiver strategies. Asender who knows the receiver’s payoff function expects the receiver to choosea strategy in Π • .A sender strategy π = ( π ( · | θ )) θ ∈ Θ ∈ Π specifies a distribution on S foreach type, π ( · | θ ) ∈ ∆( S ) . For a given π , signal s is off the path of play if it The notation ∆( X ) means the set of all probability distributions on X . Throughout we adopt the terminology “strategies” to mean behavior strategies, notmixed strategies. π ( s | θ ) = 0 for all θ. Let S θ := [ π ∈ Π arg max s ∈ S u ( θ, s, π ( · | s )) ! . be the set of signals that best respond to some (not necessarily rational) re-ceiver strategy for type θ . Signals in S \ S θ are dominated for type θ , andΠ • := × θ ∆ ( S θ ) denotes the rational sender strategies where no type eversends a dominated signal. We also write Θ s for the types θ for whom s ∈ S isnot dominated. A receiver who knows the sender’s payoff function expects thesender to choose a strategy in Π • and only expects types in Θ s to play signal s . We now introduce rationality-compatible equilibrium (RCE) and uniform rationality-compatible equilibrium (uRCE), two refinements of Nash equilibrium in sig-naling games.In Section 4, we develop a steady-state learning model where populationsof senders and receivers, initially uncertain as to the aggregate play of theopponent population, undergo random anonymous matching each period toplay the signaling game. We study the steady states when agents are patientand long lived, which we term “patiently stable.” Under some strictness as-sumptions, we show that only RCE can be patiently stable (Theorem 1) andthat every uRCE is path-equivalent to a patiently stable profile (Theorem 2).Thus we provide a learning foundation for these solution concepts.Our learning foundation will assume that agents know other agents’ utilityfunctions and know that other agents are rational in the sense of playingstrategies that maximize the corresponding expected utilities. We will nothowever iteratively assume higher orders of payoff knowledge and rationality,so that we model “rationality” as opposed to “rationalizability.” It is straightforward to extend our results to priors that reflect higher-order knowledge ofthe rationality and payoff functions of the other player. The resulting equilibrium refinementalways exists, and like RCE is implied by universal divinity. We do not include it here bothbecause we are unaware of any interesting examples where the additional power has bite,
6n the learning model, this implies senders’ uncertainty about receivers’play is always supported on Π • instead of Π , and similarly receivers’ uncer-tainty about senders’ play is supported on Π • instead of Π . In Section 2.3,we discuss heuristically how our solution concepts capture some of the ways inwhich payoff information affects learning outcomes. This discussion will laterbe formalized in the context of the learning model we develop in Section 4. Definition 1.
Signal s is more rationally-compatible with θ than θ , writtenas θ (cid:37) s θ , if for every π ∈ Π • such that u ( θ , s, π ( ·| s )) ≥ max s = s u ( θ , s , π ( ·| s )) , we have u ( θ , s, π ( ·| s )) > max s = s u ( θ , s , π ( ·| s )) . In words, θ (cid:37) s θ means whenever s is a weak best response for θ againstsome rational receiver behavior strategy π , it is a strict best response for θ against π .The next proposition shows that (cid:37) s is transitive and “almost” asymmetric.A signal s is rationally strictly dominant for θ if it is a strict best responseagainst any rational receiver strategy, π ∈ Π • . A signal s is rationally strictlydominated for θ if it is not a weak best response against any rational receiverstrategy. Proposition 1.
We have (cid:37) s is transitive.2. Except when s is either rationally strictly dominant for both θ and θ or rationally strictly dominated for both θ and θ , θ (cid:37) s θ implies θ (cid:37) s θ .The Appendix provides proofs for all of our results except where otherwisenoted.We require two auxiliary definitions before defining RCE. and because we are skeptical about the hypothesis of iterated rationality. efinition 2. For any two types θ , θ , let P θ .θ be the set of beliefs wherethe odds ratio of θ to θ exceeds their prior odds ratio, that is P θ .θ := ( p ∈ ∆(Θ) : p ( θ ) p ( θ ) ≤ λ ( θ ) λ ( θ ) ) . (1)Note that if π ( s | θ ) ≥ π ( s | θ ) , π ( s | θ ) > , and the receiver updatesbeliefs using π , then the receiver’s posterior belief about the sender’s typeafter observing s falls in the set P θ .θ . In particular, in any Bayesian Nashequilibrium, the receiver’s on-path belief falls in P θ .θ after any on-path signal s with θ (cid:37) s θ .We now introduce some additional definitions to let us investigate the im-plications of the agents’ knowledge of their opponent’s payoff function. For astrategy profile π ∗ , let E π ∗ [ u | θ ] denote type θ ’s expected payoff under π ∗ . Definition 3.
For any strategy profile π ∗ , let e J ( s, π ∗ ) := ( θ ∈ Θ : max a ∈ A BR s u ( θ, s, a ) ≥ E π ∗ [ u | θ ] ) . This is the set of types for which some best response to signal s is at leastas good as their payoff under π ∗ . For all other types, the signal s is equilibriumdominated in the sense of Cho and Kreps (1987). Definition 4.
The set of rationality-compatible beliefs for the receiver at strat-egy profile π ∗ , (cid:16) ˜ P ( s, π ∗ ) (cid:17) s , is defined as follows: ˜ P ( s, π ∗ ) := ∆( e J ( s, π ∗ )) T T ( θ ,θ ) s.t. θ (cid:37) s θ P θ .θ if e J ( s, π ∗ ) = ∅ ˜ P ( s, π ∗ ) := ∆(Θ s ) if e J ( s, π ∗ ) = ∅ . The main idea behind the rationality-compatible beliefs is that the re-ceiver’s posterior likelihood ratio for types θ and θ dominates the prior like-lihood ratio whenever θ (cid:37) s θ . A second feature involves equilibrium domi- With the convention := 0. e P assigns probability 0 to equilibrium-dominated types; thisis similar to the belief restriction of the Intuitive Criterion. Note that this def-inition imposes no belief restrictions based on θ (cid:37) s θ when s is equilibriumdominated for every type. As we illustrate in Example 3, the receiver needsnot learn the rational compatibility relation when equilibrium dominance leadsto steady states where no type ever experiments with a certain signal. Definition 5.
Strategy profile π ∗ is a rationality-compatible equilibrium (RCE) if it is a Nash equilibrium and π ∗ ( · | s ) ∈ ∆(BR( ˜ P ( s, π ∗ ) , s )) for every s .RCE requires that the receiver only plays best responses to rationality-compatible beliefs after each signal. This solution concept allows for thepossibility that after off-path signals the receiver’s strategy π ∗ ( · | s ) may notcorrespond to a single belief about the sender’s type.Theorem 1 shows that RCE is a necessary condition for a strategy pro-file where receivers have strict preferences after each on-path signal to bepatiently stable. Intuitively, this result holds because the optimal experimen-tation behavior of the senders respects the compatibility order, and because,since players eventually learn the equilibrium path, types will not experimentmuch with signals that are equilibrium dominated. As we show in Section 3,RCE rules out the implausible equilibria in a number of games, but is weakerthan some past signaling game refinements in the literature. However, RCE isonly a necessary condition for patient stability, which leaves open the questionof whether patient learning has additional implications. For this reason, wenow define uRCE, a subset of RCE (up to path-equivalence). As we showbelow, uRCE is a sufficient condition for patient stability. Definition 6.
The set of uniformly rationality-compatible beliefs for the re-ceiver is (cid:16) ˆ P ( s ) (cid:17) s whereˆ P ( s ) := ∆(Θ s ) \ \ ( θ ,θ ) s.t. θ (cid:37) s θ P θ .θ . Note that (cid:16) ˆ P ( s ) (cid:17) s makes no reference to a particular strategy profile, unlike (cid:16) ˜ P ( s, π ∗ ) (cid:17) s . Since ∆(Θ s ) contains types for whom s is undominated and9 J ( s, π ∗ ) contains types for whom s is equilibrium-undominated (relative tothe profile π ∗ ) , we have ˜ P ( s, π ∗ ) ⊆ ˆ P ( s ) whenever e J ( s, π ∗ ) = ∅ . Definition 7.
A Nash equilibrium strategy profile π ∗ is called a uniformrationality-compatible equilibrium (uRCE) if for all θ, all off-path signals s and all a ∈ BR( ˆ P ( s ) , s ), we have E π ∗ [ u | θ ] ≥ u ( θ, s, a ).The “uniformity” in uniform RCE comes from the requirement that every best response to every belief in ˆ P ( s ) deters every type from deviating to theoff-path s . By contrast, a RCE is a Nash equilibrium where some best responseto ˜ P ( s, π ∗ ) deters every type from deviating to s . Proposition 2.
Every uRCE is path-equivalent to an RCE.
The following example illustrates that uRCE is a strict subset of RCE in somegames.
Example 1.
Suppose a worker has either high ability ( θ H ) or low ability ( θ L ).She chooses between three levels of higher education: None ( N ), College ( C ),or Ph.D. ( D ). An employer observes the worker’s education level and paysa wage, a ∈ { low , med , high } . The worker’s utility function is separablebetween wage and (ability, education) pair, with u ( θ, s, a ) = z ( a ) + v ( θ, s )where z ( low ) = 0 , z ( med ) = 6 , z ( high ) = 9 and v ( θ H , N ) = 0, v ( θ L , N ) = 0, v ( θ H , C ) = 2, v ( θ L , C ) = 1, v ( θ H , D ) = − v ( θ L , D ) = −
4. (With this payofffunction, going to college has a consumption value while getting a Ph.D. iscostly.) The employer’s payoffs reflect a desire to pay a wage correspondingto the worker’s ability and increased productivity with education, given in thetables below.
N low med high θ H θ L C low med high θ H θ L D low med high θ H -2,0 4,2 7,3 θ L -4,3 2,2 5,010o education level is dominated for either type and no wage is conditionallydominated after any signal. Since v ( θ H , · ) − v ( θ L , · ) is maximized at D , it issimple to verify that θ H (cid:37) D θ L . Similarly, θ L (cid:37) N θ H . There is no compatibilityrelation at signal C .When the prior is λ ( θ H ) = 0 .
5, the strategy profile where the employer al-ways pays a medium wage and both types of worker choose C is a uRCE. This isbecause ˆ P ( N ) contains only those beliefs with p ( θ H ) ≤ .
5, so BR( ˆ P ( N ) , N ) = { low , med } . Both of these wages deter every type from deviating to N . Atthe same time, no type wants to deviate to D , even if she gets paid the bestwage.On the other hand, the equilibrium π ∗ where the employer pays low wagesfor N and C , a medium wage for D , and both types choose D is an RCEbut not a uRCE. The belief that puts probability 1 on the worker being θ L belongs to ˜ P ( N , π ∗ ) and ˜ P ( C , π ∗ ) and induces the employer to choose lowwage. However, medium salary is a best response to λ ∈ ˆ P ( N ) and mediumwage would tempt type θ L to deviate to N . (cid:7) In the learning model of Fudenberg and He (2018), agents do not knowothers’ utility functions and have full-support prior beliefs about others’ play.That paper’s compatibility criterion (CC) is based on a family of binary rela-tions on types (one for each signal s ) that are less complete than the rationalcompatibility relations, because the condition that “whenever s is a weak bestresponse for θ , it is also a strict best response for θ ” is required to hold forall π ∈ Π instead of only for π ∈ Π • . Hence, RCE is always at least asrestrictive as the CC, and RCE can eliminate some equilibria that the CCallows. Example 2.
Consider a game where the sender has type distribution λ ( θ strong ) =0 . , λ ( θ weak ) = 0 . In or Out . The game endswith payoffs (0,0) if the sender chooses
Out . If the sender chooses
In, the re-ceiver then chooses Up , Down , or X . Up is the receiver’s optimal responseif the sender is more likely to be θ strong , Down is optimal when the sender ismore likely to be θ weak , and X is never optimal. This game has two sequential This is a modified version of Cho-Kreps “beer-quiche game,” where an outside option
Out , and anotherwhere both types go In and the receiver responds with Up .Without payoff knowledge, a compatibility relation based on all π ∈ Π does not rank the two types after signal In . If π ( Down | In ) = 2 / π ( X | In ) = 1 /
3, for example, θ weak finds In optimal but θ strong does not.So the sequential equilibrium outcome Out satisfies the CC. However, since X is conditionally dominated after In , we can verify that the stronger ratio-nal compatibility relation ranks θ strong (cid:37) In θ weak and that the unique RCE isthe equilibrium where both types go In . Underlying this is the fact that ifthe conditionally dominated response X is removed from the game tree, then θ strong will experiment more frequently with In than θ weak does because θ strong potentially has more to gain. This story breaks down if senders do not knowreceivers’ payoffs and thus suspect that X might be used after In . We willshow in Section 7 that for some full-support prior beliefs, θ weak experimentsmore with In than θ strong does under any patience level. (cid:7) While the previous example shows payoff information may lead to more with certain payoffs (
Out ) replaces the
Quiche signal. The responses Up and Down correspond to
Not Fight and
Fight in the beer-quiche game, while X is a conditionallydominated response for the receiver following In . Also, while our definition of signalinggames requires that the receiver has the same action set after every signal, this situation isclearly equivalent to one where the receiver chooses Up , Down , or X after Out , but all ofthese choices lead to the payoffs (0,0).
Example 3.
Consider a game with two sender types, θ and θ , equally likely,and two possible signals, L or R. Payoffs are given in the tables below.signal: L action: a action: a action: a type: θ −
2, 0 2 , , θ −
2, 1 2, 0 2, -1signal: R action: a action: a action: a type: θ
5, -1 -3, 2 -4, 0type: θ -2, -1 1, 0 0, 1Action a is conditionally dominated for the receiver after signal R . It iseasy to see that in every perfect Bayesian equilibrium π ∗ , we must have π ∗ ( L | θ ) = π ∗ ( L | θ ) = 1 , π ∗ ( a | L ) = 1 , and that π ∗ ( · | R ) must be supported on A BR R = { a , a } . This means the off-path signal R is equilibrium dominatedfor every type in π ∗ , i.e. ˜ J ( R , π ∗ ) = ∅ . So, ˜ P ( R , π ∗ ) = ∆(Θ R ) = ∆(Θ) andRCE permits the receiver to play either a or a after R . (This is despite thefact that θ is more rationally compatible with R than θ is. As we discussedafter Definition 4, RCE does not restrict the receiver’s belief based on rationaltype compatibility after an off-path signal that is equilibrium dominated forevery type.)We will show in Section 7 that when learners have payoff information, thereis a patiently stable state where the receivers play a after R and anotherpatiently stable state where the receivers respond to R with a . However, wewill also show that without payoff information, patient stability requires thatthe receivers play a after R . (cid:7) This section compares RCE to other equilibrium refinement concepts in theliterature. 13 .1 Iterated dominance
We first relate RCE to a form of iterated dominance in the ex-ante strategicform of the game, where the sender chooses a signal π as function of hertype. We show that every sender strategy that specifies playing signal s as aless compatible type θ but not as a more compatible type θ will be removedby iterated deletion. The idea is that such a strategy is never a weak bestresponse to any receiver strategy in Π • : if the less compatible θ does nothave a profitable deviation, then the more compatible type strictly prefersdeviating to s . Proposition 3.
Suppose θ (cid:37) s θ . Then any ex-ante strategy of the sender π with π ( s | θ ) > but π ( s | θ ) < is removed by strict dominance once thereceiver is restricted to using strategies in Π • . We next relate RCE to the Intuitive Criterion.
Proposition 4.
Every RCE satisfies the Intuitive Criterion.
The next example shows that the set of RCE is strictly smaller than the setof equilibria that pass the Intuitive Criterion. The idea is that the IntuitiveCriterion does not impose any restriction on the relative likelihood of two typesafter a signal that is not equilibrium dominated for either of them, but RCEcan.
Example 4.
Consider a signaling game where the prior probabilities of thetwo types are λ ( θ ) = 3 / λ ( θ ) = 1 /
4, and the payoffs are:signal: s action: a action: a type: θ
4, 1 0, 0type: θ
6, 0 2, 1 signal: s action: a action: a type: θ
7, 1 3, 0type: θ
7, 0 3, 1Against any receiver strategy, the two types θ and θ get the same payoffsfrom s , but θ gets strictly higher payoffs than θ from s . So, θ (cid:37) s θ .14onsider now the Nash equilibrium in which the types pool on s , i.e. π ∗ ( s | θ ) = π ∗ ( s | θ ) = 1 , π ∗ ( a | s ) = 1, and π ∗ ( a | s ) = 1. It passes theIntuitive Criterion since the off-path signal s is not equilibrium dominatedfor either type. On the other hand, RCE requires that every action played withpositive probability in π ∗ ( ·| s ) best responds to some belief p about sender’stype satisfying p ( θ ) p ( θ ) ≤ λ ( θ ) λ ( θ ) = . But action a does not best respond to anysuch belief, so π ∗ is not an RCE. (cid:7) Next, we compare divine equilibrium with RCE and uRCE. For a strategyprofile π ∗ , let D ( θ, s ; π ∗ ) := { α ∈ MBR( s ) s.t. E π ∗ [ u | θ ] < u ( θ, s, α ) } be the subset of mixed best responses to s that would make type θ strictlyprefer deviating from the strategy π ∗ ( · | θ ). Similarly let D ◦ ( θ, s ; π ∗ ) := { α ∈ MBR( s ) s.t. E π ∗ [ u | θ ] = u ( θ, s, α ) } be the set of mixed best responses that would make θ indifferent to deviating. Proposition 5.
1. If π ∗ is a Nash equilibrium where s is off-path, and θ (cid:37) s θ , then D ( θ , s ; π ∗ ) ∪ D ◦ ( θ , s ; π ∗ ) ⊆ D ( θ , s ; π ∗ ) .2. Every divine equilibrium is a RCE. However, the converse is not true, as the following example illustrates.
Example 5.
Consider the following signaling game with two types and threesignals, with prior λ ( θ ) = 2 / s a a θ
0, 1 -1, 0 θ
0, 0 -1, 1 s a a θ
2, 1 -1, 0 θ
1, 0 -1, 1 s a a θ
5, 0 -3, 1 θ
0, 1 -2, 0 To be precise, MBR( p, s ) := arg max α ∈ ∆( A ) ( E θ ∼ p [ u ( θ, s, α )]) and MBR( s ) := ∪ p ∈ ∆(Θ) MBR( p, s ).
15e check that the following is a pure-strategy RCE: π ( s | θ ) = π ( s | θ ) =1 , π ( a | s ) = 1 , π ( a | s ) = 1 , π ( a | s ) = 1 . Evidently π is a Nash equilibriumand no type is equilibrium-dominated at any off-path signal. We now checkthat we do not have θ (cid:37) s θ or θ (cid:37) s θ . Observe that against the receiverstrategy ˜ π ( a | s ) = for every s , s is strictly optimal for θ but s is strictlyoptimal for θ , so θ (cid:37) s θ . And for the receiver strategy ˆ π ( a | s ) = 1 for every s , s is strictly optimal for θ but s is strictly optimal for θ , so θ (cid:37) s θ .This shows the strategy profile is an RCE.However, D ( θ , s ; π ) ∪ D ◦ ( θ , s ; π ) is the set of distributions on { a , a } that put at least weight 0.5 on a . Any such distribution is in D ( θ , s ; π ). Soin every divine equilibrium, the receiver plays a best response to a belief thatputs weight no less than 2/3 on θ after signal s , which can only be a . (cid:7) This example illustrates one difference between divine equilibrium andRCE: under divine equilibrium, the beliefs after signal s only depend onthe comparison between the payoffs to s with those of the equilibrium signal s , while the compatibility criterion also considers the payoffs to a third signal s . In the learning model, this corresponds to the possibility that θ choosesto experiment with s at beliefs that induce θ to experiment with s . Our RCE differs from divine equilibrium in another way: divine equilibriuminvolves an iterative application of a belief restriction. The next exampleillustrates this difference . Example 6.
There are three types, θ , θ , θ , all equally likely. The signalspace is S = { s , s } , and the set of receiver actions is A = { a , a , a , a } .When any sender type chooses the signal s , all parties get a payoff of 0 re-gardless of the receiver’s action. When the sender chooses s , the payoffs aredetermined by the following matrix. As noted by Van Damme (1987), it may seem more natural to replace the set α ∈ MBR( m ) in the definitions of D and D with the larger set α ∈ co(BR( s )) , which leads tothe weaker equilibrium refinement that Sobel, Stole, and Zapater (1990) call “co-divinity”.This example also shows that RCE need not be co-divine. We thank Joel Sobel for this example. a a a a θ
1, 0.9 -1, 0 -2, 0 -7, 0 θ
5, 0 3, 1 -1, 0 -5, 0.8 θ -3, 0 5, 0 1, 1.7 -3, 0.8Consider the pure strategy profile π ∗ ( s | θ ) = 1 for all θ ∈ Θ and π ∗ ( a | s ) =1 for all s ∈ S . Since θ gains more from deviating to s than θ does, applyingthe divine belief restriction for the off-path signal s eliminates the action a ,since it is not a best response to any belief p ∈ ∆(Θ) with p ( θ ) ≥ p ( θ ).But after action a is deleted for the receiver after signal s , type θ nowgains more from deviating to s than θ does. So, applying the divine beliefrestriction again eliminates actions a and a , since it is not a best responseagainst any p ∈ ∆(Θ) with p ( θ ) = 0 (for now s is equilibrium dominated for θ ) and p ( θ ) ≥ p ( θ ) . So π ∗ is not a divine equilibrium.On the other hand, no type is equilibrium dominated at s and the onlyrational compatibility order is θ (cid:37) s θ . But a is a best response againstthe belief p ( θ ) = 0 , p ( θ ) = 0 . , p ( θ ) = 0 .
4, which belongs to the set∆(Θ s ) T P θ .θ . So π ∗ is an RCE. (cid:7) Finally, we show that every uRCE is path-equivalent to an equilibrium thatis not ruled out by the “NWBR in signaling games” test (Banks and Sobel,1987; Cho and Kreps, 1987), which comes from iterative applications of thefollowing pruning procedure: after signal s the receiver is required to put 0probability on those types θ such that D ◦ ( θ, s ; π ∗ ) ⊆ ∪ θ = θ D ( θ , s ; π ∗ ) . If this would delete every type, then the procedure instead puts no restrictionon receiver’s beliefs and no type is deleted.By “path-equivalent” we mean that by modifying some of the receiver’soff-path responses, but without altering the sender’s strategy or the receiver’son-path responses, we can change the uRCE into another uRCE that passes This is closely related to, but not the same as, the NWBR property of Kohlberg andMertens (1986).
Proposition 6.
Every uRCE is path-equivalent to a uRCE that passes theNWBR test.
Corollary 1.
Every uRCE is path-equivalent to a universally divine equilib-rium.
To summarize this subsection, we note that for strategy profiles that are on-path strict for the receiver, we have the following inclusion relationships. Thefirst inclusion should be understood as inclusion up to path-equivalence. Weuse the symbol “ (cid:40) ” to mean that the former solution set is always nestedwithin the latter one in every signaling game, and that there exist gameswhere the nesting relationship is strict.uRCE (cid:40) universally divine equilibria (cid:40)
RCE (cid:40)
Intuitive Criterion (cid:40)
Nash equilibria . We study the same discrete-time steady-state learning model as Fudenbergand He (2018) except for an extra restriction on the players’ prior beliefs overother players’ strategies.There is a continuum of agents in the society, with a unit mass of receiversand λ ( θ ) mass of type θ senders. Each population is further stratified by age,with a fraction (1 − γ ) · γ t of each population age t for t = 0 , , , ... At the endof each period, each agent has probability 0 ≤ γ < − γ ) new18eceivers and λ ( θ )(1 − γ ) new type θ senders are born into the society, thuspreserving population sizes and the age distribution.Agents play the signaling game every period against a randomly matchedopponent. Each sender has probability (1 − γ ) γ t of matching with a receiverof age t , while each receiver has probability λ ( θ )(1 − γ ) γ t of matching with atype θ sender of age t. Each agent is born into a player role in the signaling game: either a receiveror a type θ sender. Agents know their role, which is fixed for life. The agents’payoff each period is determined by the outcome of the signaling game theyplayed, which consists of the sender’s type, the signal sent, and the actionplayed in response. The agents observe this outcome, but the senders does notobserve how her matched receiver would have played had she sent a differentsignal.In addition to only surviving to the next period with probability 0 ≤ γ < future utility flows by 0 ≤ δ < u t represent the payoff t periods fromtoday, each agent’s objective function is E [ P ∞ t =0 ( γδ ) t · u t ]. (Define 0 := 1, sothat a myopic agent just maximizes current period’s expected payoff in everyperiod.)Agents believe they face a fixed but unknown distribution of opponents’aggregate play, updating their beliefs at the end of every period based onthe outcome in their own game. Formally, each sender is born with a priordensity function over receivers’ behavior strategies, g : Π → R + . Similarly,each receiver is born with a prior density over the senders’ behavior strategies, g : Π → R + . We denote the marginal distribution of g on signal s as g ( s )1 : ∆( A ) → R + , so that g ( s )1 ( π ( ·| s )) is the density of the new senders’ prior We separately consider survival probability and patience so that we may consider agentswho are impatient relative to their expected lifespan. Such agents experiment early in theirlife cycle, but spend most of their life myopically best responding to their beliefs, whichmakes our analysis more tractable. s . Similarly, we denote the θ marginal of g as g ( θ )2 : ∆( S ) → R + , so that g ( θ )2 ( π ( ·| θ )) is the new receivers’ prior densityover the signal choice of type θ .We now state a regularity assumption on agents’ priors that will be main-tained throughout. Definition 8.
A prior g = ( g , g ) is regular if(a). [ independence ] g ( π ) = Q s ∈ S g ( s )1 ( π ( ·| s )) and g ( π ) = Q θ ∈ Θ g ( θ )2 ( π ( ·| θ )).(b). [ payoff knowledge ] g puts probability 1 on Π • and g puts probability 1on Π • .(c). [ g non-doctrinaire ] g is continuous and strictly positive on the interiorof Π • . (d). [ g nice ] For each type θ, there are positive constants (cid:16) α ( θ ) s (cid:17) s ∈ S such that π ( ·| θ ) g ( θ )2 ( π ( ·| θ )) Q s ∈ S π ( s | θ ) α ( θ ) s − is uniformly continuous and bounded away from zero on the relativeinterior of Π • θ , the set of rational behavior strategies of type θ .This assumption bears the same name as the regularity assumption in Fu-denberg and He (2018), and is identical except that agents now know others’payoffs and others’ rationality. In the learning model, this payoff knowledgetranslates into a restriction on the supports of the priors g , g , reflecting adogmatic belief that senders will never play dominated signals and receiverswill never play conditionally dominated actions. (These beliefs are correct inthe learning model.)Even with payoff knowledge, the receiver’s prior can assign positive prob-ability to ex-ante dominated sender strategies. For instance, in the signalinggame below, 20he sender strategy π ( s | θ ) = π ( s | θ ) = 1 belongs to the set Π • , andso must belong to the support of any regular receiver prior. But, even though s ∈ S θ and s ∈ S θ , the receiver strategies to which they respectivelybest respond form disjoint sets, and π is ex-ante dominated because it is nota best response to any single receiver strategy. It is nevertheless consistentfor a receiver who knows the sender’s payoff as a function of their type toassign positive density to π , because different types of agents can choose bestresponses to different beliefs about receiver play. Let Y θ [ t ] := ( ∪ s ∈ S ( s × A BR s )) t represent the set of possible histories for a type θ sender with age t . Note that a valid history encodes the signal that θ sent eachperiod and the (conditionally undominated) action that her opponent playedin response. Let Y θ := S ∞ t =0 Y θ [ t ] be the set of all histories for type θ .Similarly, write Y [ t ] := (Θ × S θ ) t for the set of possible histories for areceiver with age t . Each period, his history encodes the type of the matchedsender and the (undominated) signal observed. The union Y := S ∞ t =0 Y [ t ]then stands for the set of all receiver histories.The agents’ dynamic optimization problems discussed in Subsection 4.2give rise to optimal policies σ θ : Y θ → S θ and σ : Y → × s ( A BR s ). Here, σ θ ( y θ ) is the signal that a type θ sender with history y θ would send the next For notational simplicity, we suppress the dependence of these optimal policies on theeffective discount factor δγ and on the priors. σ ( y ) is the pure extensive-form strategy that a receiver with history y would commit to next time heplays the game. In the learning model, each agent solves a (single-agent)dynamic optimization problem, and chooses a deterministic optimal policy.A state ψ of the learning model is a demographic description of how manyagents have each possible history. It can be viewed as a distribution ψ ∈ ( × θ ∈ Θ ∆( Y θ )) × ∆( Y ) , and its components are denoted by ψ θ ∈ ∆( Y θ ) and ψ ∈ ∆( Y ).Since each state ψ is a distribution over histories and optimal policies maphistories to play, ψ induces a distribution over play (i.e., a rational behaviorstrategy) in the signaling game σ ( ψ ) ∈ Π • , given by σ θ ( ψ θ )( s ) := ψ θ { y θ ∈ Y θ : σ θ ( y θ ) = s } and σ ( ψ )( a | s ) := ψ { y ∈ Y : σ ( y )( s ) = a } . Here, σ θ ( ψ θ ) and σ ( ψ ) are the aggregate behaviors of the type θ andreceiver populations in state ψ , respectively. Note that the aggregate play ofa population can be stochastic even if the entire population uses the samedeterministic optimal policy, because different senders will be matched withdifferent receivers, and so different agents on the same side will observe differ-ent histories and play differently.Of particular interest are the steady states , to be defined more precisely inSection 6. Loosely speaking, a steady state induces a time-invariant distribu-tion over how the signaling game is played in the society. This section defines the notion of a steady state using the “aggregate re-sponses” of one population to the distribution of play in the other. These22esponses are defined using the “one-period forward” maps that describe howthe agents’ policies induce a map from current distributions over histories towhat the distributions will be after the agents are matched and play the gameusing the strategies their policies prescribe.
Fix the receivers’ aggregate play at π ∈ Π • and fix an optimal policy σ θ foreach type θ . The one-period-forward map for type θ , f θ , describes the distribu-tion over histories that will prevail next period when the current distributionsover histories in the type- θ population is ψ θ . The next definition specifies theprobability that f θ [ ψ θ , π ] assigns to the history ( y θ , ( s, a )) ∈ Y θ [ t + 1] , that isto say a one-period concatenation of ( s, a ) onto the history y θ ∈ Y θ [ t ]. Definition 9.
The one-period-forward map for type θ , f θ : ∆( Y θ ) × Π • → ∆( Y θ ) is f θ [ ψ θ , π ]( y θ , ( s, a )) := ψ θ ( y θ ) · γ · { σ θ ( y θ ) = s } · π ( a | s )and f θ ( ∅ ) := 1 − γ .To interpret, of the ψ θ ( y θ ) fraction of the type- θ population with history y θ , a γ fraction survives into the next period. The survivors all choose σ θ ( y θ )next period, which is met with response a with probability π ( a | σ θ ( y θ )).Write f Tθ for the T -fold application of f θ on ∆( Y θ ) , holding fixed some π .It is easy to show that lim T →∞ f Tθ ( ψ θ , π ) exists and is independent of theinitial ψ θ . (This is because for any two states ψ θ , ψ θ , the two distributionsover histories f Tθ ( ψ θ , π ) and f Tθ ( ψ θ , π ) agree on all Y θ [ t ] for t < T . As T grows large, the two resulting distributions must converge to each other sincethe fraction of very old agents with very long histories is rare.) Denote thislimit as ˜ ψ π θ . It is the distribution over type- θ history induced by the receivers’aggregate play π . Definition 10.
The aggregate sender response R : Π • → Π • is defined by R [ π ]( s | θ ) := ˜ ψ π θ ( y θ : σ θ ( y θ ) = s )23hat is, R [ π ]( · | θ ) describes the asymptotic aggregate play of the type- θ population when the the aggregate play of the receiver population is fixed at π each period. Note that R maps into Π • because no type ever wants to senda dominated signal, even as an experiment, regardless of their beliefs aboutthe receiver’s response.Technically, R depends on g , δ, and γ , just like σ θ does. When relevant,we will make these dependencies clear by adding the appropriate parametersas superscripts to R , but we will mostly suppress them to lighten notation. We now turn to the receivers, who have a passive learning problem. Theyalways observe the sender’s type and signal at the end of each period, so theiroptimal policy σ myopically best responds to the posterior belief at everyhistory y . Definition 11.
The one-period-forward map for the receivers f : ∆( Y ) × Π • → ∆( Y ) is f [ ψ , π ]( y , ( θ, s )) := ψ ( y ) · γ · λ ( θ ) · π ( s | θ )and f ( ∅ ) := 1 − γ .As with the one-period-forward maps f θ for senders, f [ ψ , π ] describes thedistribution over receiver histories next period starting with a society wherethe distribution is ψ and the sender population’s aggregate play is π . Wewrite ˜ ψ π := lim T →∞ f T ( ψ , π ) for the long-run distribution over Y inducedby fixing sender population’s play at π . (This limit is again independent ofthe initial state ψ . ) Definition 12.
The aggregate receiver response R : Π • → Π • is R [ π ]( a | s ) := ˜ ψ π ( y : σ ( y )( s ) = a )24 .3 Steady States and Patient Stability A steady-state strategy profile is a pair of mutual aggregate replies, so it istime-invariant under learning. Definition 13. π ∗ is a steady-state strategy profile if R g,δ,γ ( π ∗ ) = π ∗ and R g,δ,γ ( π ∗ ) = π ∗ . Denote the set of all such strategy profiles as Π ∗ ( g, δ, γ ).We now state two results about these steady states. We do not provide aproof because they follow easily from analogous results in Fudenberg and He(2018).First, steady-state profiles always exist. Proposition 7.
For any regular prior g and any ≤ δ, γ < , Π ∗ ( g, δ, γ ) isnon-empty and compact in the norm topology. The patiently stable strategy profiles correspond to the set lim δ → lim γ → Π ∗ ( g, δ, γ ).This order of limits was first introduced in Fudenberg and Levine (1993). Itensures agents spend most of their lifetime playing myopically instead of ex-perimenting, which is important for proving that patiently stable profiles areNash equilibria. Definition 14.
For each 0 ≤ δ <
1, a strategy profile π ∗ is δ -stable under g if there is a sequence γ k → π ( k ) ∈ Π ∗ ( g, δ, γ k ), such that π ( k ) → π ∗ . Strategy profile π ∗ is patiently stable under g if there is a sequence δ k → π ( k ) where each π ( k ) is δ k -stable under g and π ( k ) → π ∗ . Strategy profile π ∗ is patiently stable if it is patiently stable undersome regular prior g . Proposition 8.
If strategy profile π ∗ is patiently stable, then it is a Nashequilibrium. Note that Propositions 7 and 8 apply even if all of the Nash equilibriaof the game are in mixed strategies; as noted above, the randomization herearises from the random matching process.25
Patient Stability, Payoff Knowledge, and Equi-librium Refinements
In this section, we relate the equilibrium refinements proposed in Section 2to the steady-state learning model. We show that under certain strictnessassumptions, RCE is necessary for patient stability while uRCE is sufficientfor patient stability. We also discuss how payoff knowledge matters for learningoutcomes.
We show that any patiently stable strategy profile satisfying a strictness as-sumption must be an RCE. The key lemma is analogous to Lemma 1 fromFudenberg and He (2018), so we will omit its proof.
Lemma 1.
Suppose θ (cid:37) s θ . Then for any regular prior g , ≤ δ, γ < ,and any π ∈ Π • , we have R [ π ]( s | θ ) ≥ R [ π ]( s | θ ) . This result says over their lifetimes, the relative frequencies with whichdifferent sender types experiment with signal s respect the rational compat-ible order (cid:37) s . This follows from the fact that sender types who are morecompatible with a signal will play it at least as often. The payoff knowledgeembedded in g ’s support implies that senders never experiment in the hopesof seeing a response which is highly profitable for the sender but dominatedfor the receiver, such as the Charity action in Example 2 for θ weak . This extraassumption leads to a stronger result than Lemma 1 from Fudenberg and He(2018), which is stated in terms of the less-complete compatibility order.For a fixed strategy profile π and on-path signal s ∗ , let E θ | π ,s ∗ [ u ( θ, s ∗ , a )]denote the receiver’s expected utility from responding to s ∗ with a , where theexpectation over the sender’s type θ is taken with respect to the posterior typedistribution after signal s ∗ given the sender’s strategy π ( · | θ ). Definition 15.
A Nash equilibrium π ∗ is on-path strict for the receiver if forevery on-path signal s ∗ , π ( a ∗ | s ∗ ) = 1 for some a ∗ ∈ A and E θ | π ,s ∗ [ u ( θ, s ∗ , a ∗ )] > max a = a ∗ E θ | π ,s ∗ [ u ( θ, s ∗ , a )]. 26e call this condition “on-path” strict for the receiver because we do notmake assumptions about the receiver’s incentives after off-path signals. Forgeneric payoffs, all pure-strategy equilibria will be on-path strict for the re-ceiver. Theorem 1.
Every strategy profile that is patiently stable and on-path strictfor the receiver is an RCE.
RCE rules out two kinds of receiver beliefs after signal s : those that assignnon-zero probability to equilibrium-dominated sender types, and those thatviolate the rational compatibility order. The restriction on equilibrium dom-inated types uses the assumption that the receiver has a strict best responseto each on-path signal to put a lower bound on how slowly aggregate receiverplay at on-path signals converges to its limit. The fact that the receiver be-liefs respect the rational compatibility order comes from Lemma 1, which usesour assumptions about prior g to derive restrictions on the aggregate senderresponse R , and show that these are reflected in the aggregate receiver re-sponse. The proof of Theorem 1 closely follows the the analogous proof inFudenberg and He (2018) and is omitted. We now prove our main result: as a partial converse to Theorem 1, we showthat under additional strictness conditions, every uRCE is path-equivalent toa patiently stable strategy profile.
Definition 16. A quasi-strict uRCE π ∗ is a uRCE that is on-path strict forthe receiver, strict for the sender (that is, there exists an equilibrium signal s ∗ for each type θ with u ( θ, s ∗ , π ∗ ( ·| s ∗ )) > max s = s ∗ u ( θ, s, π ∗ ( ·| s )), so everytype strictly prefers its equilibrium signal to any other), and satisfies E π ∗ [ u | θ ] > u ( θ, s , a ) for all θ, all off-path signals s and all a ∈ BR( ˆ P ( s ) , s ). If the receiver mixes after some equilibrium signal s for type θ , then our techniques forshowing that θ does not experiment very much with equilibrium dominated signals do notgo through, but we do not have a counterexample. P ( s ) strictly deters every type from deviating to s , whenever s is off-path. Every uRCE satisfies the weaker version of this condition where“strictly deters” is replaced with “weakly deters.” Theorem 2. If π ∗ is a quasi-strict uRCE, then it is path-equivalent to a pa-tiently stable strategy profile. This theorem follows from three lemmas on R and R . Indeed, the the-orem remains valid in any modified learning model where R and R satisfythe conclusions of these lemmas. R under a confident prior The first lemma shows that under a suitable prior, the aggregate sender re-sponse of the dynamic learning model approximates the sender’s static bestresponse function when applied to certain receiver strategies, namely strategiesthat are “close” to one inducing a unique optimal signal for each sender type.The precise meaning of “close” that we use treats on- and off-path responsesdifferently, so it requires some auxiliary definitions.
Definition 17.
Let π ∗ be a strategy profile where every type plays a purestrategy and the receiver plays a pure action after each on-path signal. Say π ∗ induces a unique optimal signal for each sender type if E π ∗ [ u | θ ] > max s = π ∗ ( θ ) u ( θ, s, π ∗ ( ·| s ))for every type θ .Starting with a strategy profile π ∗ that induces a unique optimal signal foreach sender type, define for each off-path s in π ∗ the set of receiver actions˜ A ( s ) := { a : E π ∗ [ u | θ ] > u ( θ, s, a ) ∀ θ } that strictly deter every type fromdeviation. Because π ∗ induces a unique optimal signal, each ˜ A ( s ) must containat least one element in the support of π ∗ ( ·| s ), but could also contain otheractions. It is clear that if π ∗ were modified off-path by changing each π ∗ ( ·| s )28o be an arbitrary mixture over ˜ A ( s ) , then the resulting strategy profile wouldcontinue to induce (the same) unique optimal signal for each sender type.For π ∗ that induces a unique optimal signal for each sender type, write B on2 ( π ∗ , (cid:15) ) for the elements of Π • no more than (cid:15) away from π ∗ at the on-pathsignals in π ∗ , that is B on2 ( π ∗ , (cid:15) ) := { π ∈ Π • : | π ( a | s ) − π ∗ ( a | s ) | ≤ (cid:15), ∀ a, on-path s in π ∗ } . Similarly, define B off2 ( π ∗ , (cid:15) ) as the elements of Π • putting no more than (cid:15) probability on actions outside of ˜ A ( s ) after each off-path s , where ˜ A ( s ) is theset of actions that would deter every type from deviating to s , as above. B off2 ( π ∗ , (cid:15) ) := n π ∈ Π • : π ( ˜ A ( s ) | s ) ≥ − (cid:15), ∀ off-path s in π ∗ o . Lemma 2.
Suppose π ∗ induces a unique optimal signal for each sender type.Then there exists a regular prior g , some < (cid:15) off < , and a function γ ( δ, (cid:15) ) valued in (0 , , such that for every < δ < , < (cid:15) < (cid:15) off , and γ ( δ, (cid:15) ) < γ < ,if π ∈ B on ( π ∗ , (cid:15) ) ∩ B off ( π ∗ , (cid:15) off ) , then | R g ,δ,γ [ π ]( s | θ ) − π ∗ ( s | θ ) | < (cid:15) for every θ and s . Note that the same (cid:15) appears in the hypothesis π ∈ B on2 ( π ∗ , (cid:15) ) as in theconclusion. That is, the aggregate sender response gets closer to π ∗ as receivers’play gets closer to π ∗ .The idea is to specify a sender prior g that is highly confident and correctabout the receiver’s response to on-path signals, and is also confident that thereceiver responds to each off-path signal s with actions in ˜ A ( s ). Take a signal s other than the one that θ sends in π ∗ . If θ has not experimented muchwith s , then her belief is close to the prior and she thinks deviation does notpay. If θ has experimented a lot with s , then by the law of large numbersher belief is likely to be concentrated in ˜ A ( s ), so again she thinks deviationdoes not pay. Since the option value for experimentation eventually goes to 0,at most histories all sender types are playing a myopic best response to theirbeliefs, meaning they will not deviate from π ∗ . The intuition is similar to thatof Lemmas 6.1 and 6.4 from Fudenberg and Levine (2006), which says that29e can construct a highly concentrated and correct prior so that in the steadystate, most agents have correct beliefs about opponents’ play both on and onestep off the equilibrium path.This lemma requires the assumption that π ∗ is strict for the sender. If s ∗ were only weakly optimal for θ in π ∗ , there could be receiver strategiesarbitrarily close to π ∗ that make some other signal s = s ∗ strictly optimalfor θ . In that case, we cannot rule out that a non-negligible fraction of the θ population will rationally play s forever when the receiver population playsclose to π ∗ . R and learning rational compatibility Let C be the set of sender strategies that respect the rational compatibilityorder, that is C := { π ∈ Π • : π ( s | θ ) ≥ π ( s | θ ) whenever θ (cid:37) s θ } . The next lemma shows that there is a prior for the receivers so that whenthe aggregate sender play is any strategy in C , almost all receivers end upwith beliefs consistent with the rational compatibility order. This lemmais the main technical contribution of the paper and enables us to providea sufficient condition for patient stability when the relative frequencies of off-path experiments matter. Lemma 3.
For each (cid:15) > , there exists a regular receiver prior g and <γ < so that for any γ < γ < , < δ < , and π ∈ C , R g ,δ,γ [ π ]( BR ( ˆ P ( s ) , s ) | s ) ≥ − (cid:15) for each signal s . The key step in the proof is constructing a prior belief for the receiversso that when the senders’ aggregate play is sufficiently close to the targetequilibrium, the receiver beliefs respect the compatibility order. This step wasnot necessary in Fudenberg and Levine (2006), which is the only other paper30hat has given a sufficient condition for patient stability in a class of games .To prove Lemma 3, we construct a Dirichlet prior g so that for any s suchthat θ (cid:37) s θ , g assigns much greater prior weight to θ playing s than to θ playing s .. In the absence of data, the receiver strongly believes that thesenders are using strategies π such that p ( θ | s ) /p ( θ | s ) ≤ λ ( θ ) /λ ( θ ). Thisstrong prior belief can only be overturned by a very large number of observa-tions to the contrary. But because π ∈ C respects the rational compatibilityorder, if the receiver has a very large number of observations of senders choos-ing s , the law of large numbers implies this large sample is unlikely to leadthe receiver to have a belief outside of ˆ P ( s ). So we can ensure that with highprobability sufficiently long-lived receivers play a best response to ˆ P ( s ) afterthe off-path s .Finally, we state a lemma that says for any Dirichlet receiver prior, whenlifetimes are long enough, the aggregate receiver response approximates thereceiver’s best response function on-path when applied to a sender strategythat provides strict incentives after every on-path signal. Write B on1 ( π ∗ , (cid:15) ) forthe elements of Π • where each type θ plays (cid:15) -close to π ∗ ( ·| θ ) , that is B on1 ( π ∗ , (cid:15) ) := { π ∈ Π • : | π ( s | θ ) − π ∗ ( s | θ ) | ≤ (cid:15), ∀ θ, s } . Lemma 4.
Fix a strategy profile π ∗ where the receiver has strict incentivesafter every on-path signal. For each regular Dirichlet receiver prior g , thereexists (cid:15) > and a function γ ( (cid:15) ) valued in (0 , , so that whenever π ∈ B on ( π ∗ , (cid:15) ) , < δ < , and γ ( (cid:15) ) < γ < , we have R g ,δ,γ [ π ]( a | s ) − π ∗ ( a | s ) | < (cid:15) for every on-path signal s in π ∗ and a . The intuition is that when the aggregate sender strategy is close to π ∗ , Their result guarantees that the receivers’ beliefs about the frequency of type θ sendingsignal s is within (cid:15) of the truth. This is not sufficient for purposes, because when signal s has probability 0 under a given sender strategy, perturbing the strategy of every type by upto (cid:15) can generate arbitrary off-path beliefs about the sender’s type. The Dirichlet prior is the conjugate prior to multinomial data, and corresponds to theupdating used in fictitious play (Fudenberg and Kreps, 1993). It is readily verified that ifeach of g ( θ )1 and g ( s )2 is Dirichlet and independent of the other components, then g is regular.In the proof, we work with Dirichlet priors since they give tractable closed-form expressionsfor the posterior mean belief of the opponent’s strategy after a given history. π ∗ gives positiveprobability, a receiver with enough data is likely to have a belief close to theBayesian belief assigned by π ∗ . Coupled with the fact that π ∗ is on-path strictfor the receiver, this lets us conclude that long-lived receivers play π ∗ ( ·| s ) afterevery on-path s with high probability. We revisit the examples from Section 2.3 and discuss how prior beliefs reflectingknowledge or ignorance of payoff information lead to different implications forlearning.
In Example 2, it follows from Lemma 1 that for any 0 ≤ δ, γ <
1, any re-ceiver play π ∈ Π • , and any regular prior g , we have R g [ π ](In | θ strong ) ≥ R g [ π ](In | θ weak ). In the absence of payoff information,we show that thereexists a full-support prior g so that, fixing π to always play Down , we get R g [ π ](In | θ strong ) ≤ R g [ π ](In | θ weak ) for any 0 ≤ δ, γ <
1, with strictinequality for an open set of parameter values.Let g (In)1 be Dirichlet with weights (1 , K,
1) on ( Up , Down , X ) for arbitrary K ≥
4. After observing k ≥ In with Down , a sender would have the posterior Dirichlet(1 , K + k, θ weak type’s Gittins index for In would be unchanged if her payoffs to ( Up , Down , X ) were (3 , − ,
1) instead of (1 , − , Up and X . This observation shows her Gittins index for In is at least as largeas θ strong ’s, whose payoffs to ( Up , Down , X ) are (2 , − , In after fewer observations of Down than the weaktype does (this includes the case of “switching away” after 0 observations of
Down , i.e. the strong type never experimenting with In .) We have proven R g [ π ](In | θ strong ) ≤ R g [ π ](In | θ weak ) for any 0 ≤ δ, γ < In is myopically suboptimal for both types, and by the previousargument, the minimum effective discount factor δγ that would induce atleast one period of experimentation with In is strictly higher for the strongtype than the weak type. This shows for an open set of δ, γ parameters, R g [ π ](In | θ strong ) = 0 but R g [ π ](In | θ weak ) > In Example 3 we showed that there is an RCE in which the receivers play a after R . Because RCE is not a sufficient condition for patient stability, thisleaves open the question of whether this strategy can arise in our learningmodel. Here we verify that it can, and also show that “ a after R ” cannotbe part of a patiently stable outcome in the absence of payoff information.This is because patient but inexperienced θ ’s without payoff information findit plausible that receivers choose a after R , so they will experiment muchmore frequently with the off-path signal R than θ ’s, for whom every possibleresponse to R leads to worse payoffs than their equilibrium payoff of 2. As aresult, receivers learn that R -senders have type θ so they respond with a . Onthe other hand, when senders know ex-ante that receivers will never choose a after R , for some priors there are steady states where no one ever experimentswith R . When this happens, the receivers’ belief about the likelihood ratioof the types following the off-path R is governed by their prior beliefs, whichmay be arbitrary and thus support a richer class of learning outcomes.Specifically, in Example 3, suppose g ( L )1 is Dirichlet (1 , ,
1) over all threeresponses to L , while g ( R )1 is Dirichlet(1 ,
1) on A BR R = { a , a } , which reflectsthe sender’s knowledge that a is a conditionally dominated response to R . And suppose that g θ is the Dirichlet(2 ,
1) distribution on { L , R } and g θ isthe Dirichlet(2 , x ) distribution, where x > ≤ δ, γ < , there exists a steady state where senders always choose L andreceivers always respond to L with a . This is because the Gittins index for R is no larger than − θ and no larger than 1 for θ after any history, whilethe myopic expected payoff of L already exceeds these values in the first period.The expected payoff of L only increases with additional observations of a after33 . On the receiver side, every positive-probability history y must involve thesenders playing L every period. Following such a history, the receiver believes θ plays L with probability at least , hence an L -sender is the θ type withprobability at least / / = . We have a ∈ BR( { p } , L ) whenever p ( θ ) ≥ ,so we have shown that in the steady state receivers always play a after L .In this steady state, signal R is never sent, so by choosing different valuesof x >
0, we can sustain either a or a after R as part of a patiently stableprofile. To be more precise, let n and n count the number of times the twotypes of senders appear in a positive-probability history y . The receiver’sposterior assigns the following likelihood ratio to the type of an R -sender:13 + n / x x + n = 1 x · (cid:18) x + n n (cid:19) . Since the two types are equally likely, the fraction of receivers with histories y so that 0 . ≤ (cid:16) x + n n (cid:17) ≤ . γ → . Depending on whether x = 1 / x = 4, these receivers will play a or a after R , so π ( a | R ) = 1and π ( a | R ) = 1 are both δ -stable for any δ ≥
0, under two different regularpriors reflecting payoff knowledge.By contrast, Theorem 3 of Fudenberg and He (2018) implies that if priors g , g have full support on Π and Π respectively, then we must have a after R in every patiently stable profile. The idea is that when senders are patientand long-lived, new θ start off by trying R but new θ start off by trying L .When receivers play a after L with high probability, it is very unlikely that θ ever switches away from L , providing a bound on their frequency of playing R . On the other hand, as their effective discount factor increases, θ will spendarbitrarily many periods of its early life playing R in hopes of getting the bestpayoff of 5, lacking the payoff knowledge that a is conditionally dominatedfor the receivers after R . Receivers therefore end up learning that R -sendershave type θ . 34 Conclusion
This paper studies non-equilibrium learning about other players’ strategies inthe setting of signaling games. When the agents’ prior beliefs about their op-ponents’ play reflect prior knowledge of others’ payoff functions, the steadystates of societies of Bayesian learners can be bounded by two equilibriumrefinements, RCE and uRCE, that nest and resemble divine equilibrium. Di-vine equilibrium and RCE are only defined for signaling games. In generalextensive-form games, agents may find it optimal to play strictly dominatedstrategies as experiments to learn about the consequences of their other strate-gies, so requiring prior beliefs to be supported on opponents’ undominatedstrategies can lead to situations where agents observe play that they had as-signed zero prior probability. We leave the associated complications for futurework.
References
Banks, J. S. and J. Sobel (1987): “Equilibrium Selection in SignalingGames,”
Econometrica , 55, 647–661.
Cho, I.-K. and D. M. Kreps (1987): “Signaling Games and Stable Equi-libria,”
Quarterly Journal of Economics , 102, 179–221.
Esponda, I. and D. Pouzo (2016): “Berk-Nash Equilibrium: A Frameworkfor Modeling Agents With Misspecified Models,”
Econometrica , 84, 1093–1130.
Fudenberg, D. and K. He (2018): “Learning and Type Compatibility inSignaling Games,”
Econometrica , 86, 1215–1255.
Fudenberg, D. and D. M. Kreps (1993): “Learning Mixed Equilibria,”
Games and Economic Behavior , 5, 320–367.
Fudenberg, D. and D. K. Levine (1993): “Steady State Learning andNash Equilibrium,”
Econometrica , 61, 547–573.35—— (2006): “Superstition and Rational Learning,”
American EconomicReview , 96, 630–651.
Kalai, E. and E. Lehrer (1993): “Rational Learning Leads to Nash Equi-librium,”
Econometrica , 61, 1019–1045.
Kohlberg, E. and J.-F. Mertens (1986): “On the Strategic Stability ofEquilibria,”
Econometrica , 54, 1003–1037.
Sobel, J., L. Stole, and I. Zapater (1990): “Fixed-Equilibrium Ratio-nalizability in Signaling Games,”
Journal of Economic Theory , 52, 304–331.
Van Damme, E. (1987):
Stability and Perfection of Nash Equilibria , Springer-Verlag.
A Appendix
A.1 Proof of Proposition 1
Proof.
To show (1), suppose θ (cid:37) s θ and θ (cid:37) s θ . For any π ∈ Π • where s is weakly optimal for θ , it must be strictly optimal for θ , hence also strictlyoptimal for θ . This shows θ (cid:37) s θ .To establish (2), partition the set of rational receiver strategies as Π • =Π +2 ∪ Π ∪ Π − , where the three subsets refer to receiver strategies that make s strictly better, indifferent, or strictly worse than the best alternative signalfor θ . If the set Π is nonempty, then θ (cid:37) s θ implies θ (cid:37) s θ . Thisis because against any π ∈ Π , signal s is strictly optimal for θ but onlyweakly optimal for θ . At the same time, if both Π +2 and Π − are nonempty,then Π is nonempty. This is because both π u ( θ , s , π ( ·| s )) and π max s = s u ( θ , s , π ( ·| s )) are continuous functions, so for any π +2 ∈ Π +2 and π − ∈ Π − , there exists α ∈ (0 ,
1) so that απ +2 + (1 − α ) π − ∈ Π . (Note that π +2 and π − must be supported on A BR s after every signal s , so the same musthold for the mixture απ +2 + (1 − α ) π − . Thus, this mixture also belongs to Π • . )If only Π +2 is nonempty and θ (cid:37) s θ , then s is rationally strictly dominant36or both θ and θ . If only Π − is nonempty, then we can have θ (cid:37) s θ onlywhen s is never a weak best response for θ against any π ∈ Π • . A.2 Proof of Proposition 2
Proof.
Let π ∗ be a uRCE. We construct a path-equivalent RCE, π ◦ as follows.Set π ◦ = π ∗ and set π ◦ ( · | s ) = π ∗ ( · | s ) for every on-path signal s .At eachoff-path signal s where ˜ J ( s, π ∗ ) = ∅ , let π ◦ ( · | s ) prescribe some best responseto a belief in ˜ P ( s, π ∗ ).At each off-path signal s where ˜ J ( s, π ∗ ) = ∅ , let π ◦ ( · | s )prescribe some best response to a belief in ∆(Θ s ).In this strategy profile, the receiver’s play is a best response to rationality-compatible beliefs after every off-path s by construction, and because thesender’s play is the same as before the receiver is still playing best responsesto on-path signals.Because the on-path play of the receivers did not change, no sender typewishes to deviate to any on-path signal. Now we check that no sender typewishes to deviate to any off-path signal. Consider first off-path s where˜ J ( s, π ∗ ) = ∅ . Here we have ˜ J ( s, π ∗ ) ⊆ Θ s , which implies that ˜ P ( s, π ∗ ) ⊆ ˆ P ( s ).By the definition of uRCE, π ◦ ( · | s ) must deter every type from deviating tosuch s. Finally, no sender type wishes to deviate to any s where ˜ J ( s, π ∗ ) = ∅ ,by the definition of equilibrium dominance. A.3 Proof of Proposition 3
Proof.
Fix a π with π ( s | θ ) > π ( s | θ ) <
1. Because the space ofrational receiver strategies Π • is convex, it suffices to show there is no receiverstrategy π ∈ Π • such that π is a best response to π in the ex-ante strategicform. If π is an ex-ante best response, then it needs to be at least weaklyoptimal for type θ to play s against π . By θ (cid:37) s θ , this implies s is strictlyoptimal for type θ . This shows π is not a best response to π , as the sendercan increase her ex-ante expected payoffs by playing s with probability 1 whenher type is θ . 37 .4 Proof of Proposition 4 Proof.
Suppose π ∗ does not pass the Intuitive Criterion. Then there exists atype θ and a signal s such that u ( θ ; π ∗ ) < min a ∈ BR(∆( e J ( s ,π ∗ )) ,s ) u ( θ, s , a ) . If π ∗ were an RCE, then we would have π ∗ ( ·| s ) ∈ ∆(BR( ˜ P ( s, π ∗ ) , s )). Since˜ P ( s, π ∗ ) ⊆ ∆( e J ( s , π ∗ )) , we have u ( θ ; π ∗ ) < u ( θ, s , π ∗ ( ·| s )) . This means π ∗ is not a Nash equilibrium, contradiction. A.5 Proof of Proposition 5
Proof.
To show (a), note first that if D ( θ , s ; π ∗ ) ∪ D ◦ ( θ , s ; π ∗ ) = ∅ theconclusion holds vacuously. If D ( θ , s ; π ∗ ) ∪ D ◦ ( θ , s ; π ∗ )is not empty, takeany α ∈ D ( θ , s ; π ∗ ) ∪ D ◦ ( θ , s ; π ∗ ) and define π ∈ Π • by π ( ·| s ) = α , π ( ·| s ) = π ∗ ( ·| s ) for s = s . Then u ( θ ; π ∗ ) = max s = s u ( θ , s, π ( ·| s )) ≤ u ( θ , s , π ( ·| s )) = u ( θ , s , α ) , and when θ (cid:37) s θ , this implies that u ( θ ; π ∗ ) = max s = s u ( θ , s, π ( ·| s )) < u ( θ , s , π ( ·| s )) = u ( θ , s, α ) . Hence α ∈ D ( θ , s ; π ∗ ) . To show (b) , suppose π ∗ is a divine equilibrium. Then it is a Nash equilib-rium, and furthermore for any off-path signal s where θ (cid:37) s θ , Proposition5(a) implies that D ( θ , s ; π ∗ ) ∪ D ◦ ( θ , s ; π ∗ ) ⊆ D ( θ , s ; π ∗ ) . Since π ∗ is a divine equilibrium, π ∗ ( ·| s ) must then best respond to some belief38 ∈ ∆(Θ) with p ( θ ) p ( θ ) ≤ λ ( θ ) λ ( θ ) . Considering all ( θ , θ ) pairs, we see that in adivine equilibrium π ∗ ( ·| s ) best responds to some belief in \ ( θ ,θ ) s.t. θ (cid:37) s θ P θ .θ . At the same time, in every divine equilibrium, belief after off-path s puts zeroprobability on equilibrium-dominated types, meaning π ∗ ( · | s ) best responds∆( e J ( s , π ∗ )). This shows π ∗ is an RCE. A.6 Proof of Proposition 6
Proof.
Consider a uRCE π ∗ . For every off-path s , perform the following mod-ifications on π ∗ ( ·| s ): if the first-round application of the NWBR procedurewould have deleted every type, then do not modify π ∗ ( ·| s ). Otherwise, findsome θ s not deleted by the iterated NWBR procedure, then change π ∗ ( ·| s ) tosome action in BR( { θ s } , s ), i.e. a best response to the belief putting probabil-ity 1 on θ s .This modified strategy profile passes the NWBR test. We now establishthat it remains a uRCE by checking that for those off-path s where π ∗ ( ·| s ) wasmodified, the modified version is still a best response to ˆ P ( s ). (By uniformity,this would ensure that the modified receiver play continues to deter every typefrom deviating to s .)Type θ s satisfies θ s ∈ Θ s . Otherwise, D ◦ ( θ s , s ; π ∗ ) = ∅ and θ s would havebeen deleted by NWBR in the first round. Now it suffices to argue thereis no θ such that θ (cid:37) s θ s , which implies the belief putting probability 1on θ s is in ˆ P ( s ). If there were such θ , by Proposition 5(a) we would have D ◦ ( θ s , s ; π ∗ ) ⊆ D ( θ , s ; π ∗ ), so θ s should have been deleted by NWBR in thefirst round, contradicting the fact that θ s survives all iterations of the NWBRprocedure. 39 .7 Proof of Corollary 1 Proof.
This is follows from Proposition 6 because every NWBR equilibrium isa universally divine equilibrium.
A.8 Proof of Lemma 2
Proof.
Here are three lemmas from Fudenberg and Levine (2006):
FL06 Lemma A.1 : Suppose { X k } is a sequence of i.i.d. Bernoulli randomvariables with E [ X k ] = µ , and define for each n the random variable S n := | P nk =1 ( X k − µ ) | n . Then for any n, ¯ n ∈ N , P (cid:20) max n ≤ n ≤ ¯ n S n > (cid:15) (cid:21) ≤ · n · µ(cid:15) . FL06 Lemma A.2 : For all (cid:15), (cid:15) >
0, there is an
N > δ, γ, g, π , signal s and action a ∈ A , ψ π ;( g,δ,γ ) θ { y θ : | ˆ π ( a | s ; y θ ) − π ( a | s ) | > (cid:15), s | y θ ) > N } < (cid:15) . (Here, ˆ π ( a | s ; y θ ) is the empirical frequency of receiver playing a after signal m in history y θ , that is to say ˆ π ( a | s ; y θ ) = a, s ) , y θ ) / s, y θ ).) FL06 Lemma A.4 : For all (cid:15), (cid:15) > δ <
1, there exists N such thatfor all π , g, and γ , we get ψ π ;( g,δ,γ ) θ { y θ / ∈ Y θ ( (cid:15) ) , σ θ ( y θ ) , y θ ) > N } ≤ (cid:15) where Y θ ( (cid:15) ) ⊆ Y θ are those histories y θ wheremax s ∈ S u ( θ, s | y θ ) ≤ u ( σ θ ( y θ ) | y θ ) + (cid:15), that is, type θ is playing a myopic (cid:15) best response according to posterior beliefafter history y θ .Now we proceed with our argument.40ince π ∗ is strict on-path , there exist ξ , ξ > π satisfies | π ( a | s ) − π ∗ ( a | s ) | ≤ ξ for every on-path s and action a , while forevery off-path s we have π ( ˜ A ( s ) | s ) ≥ − ξ , then for each type θ we get u ( θ, π ∗ ( θ ) , π ) > ξ + max s = π ∗ ( θ ) u ( θ, s, π R ) . That is, if receiver plays ξ -close to π ∗ on-path and ξ -close to ˜ A ( s ) off-path,then for every type of sender, playing the prescribed equilibrium signal isstrictly better than any other signal by at least ξ > g such that when-ever sender has fewer than n := 2 /ξ observations of playing signal s , herbelief as to receiver’s probability of taking action a after signal s differs from π ∗ ( a | s ) by no more than ξ if s is on-path, while her belief as to the probabilitythat receiver strategy assigns to ˜ A ( s ) is at least 1 − ξ if s is off-path. Also, let (cid:15) off := ξ / δ ∈ (0 ,
1) and 0 < (cid:15) < (cid:15) off be given. We construct γ ( δ, (cid:15) ) satisfyingthe conclusion of the lemma.To do this, in FL06 Lemma A.4 put (cid:15) = ξ and (cid:15) = (cid:15)/
6, to obtain a N ( (cid:15) ). Next, in FL06 Lemma A.2 put (cid:15) = ξ / (cid:15) = (cid:15)/
6, to obtain N ( (cid:15) ). Let N ( (cid:15) ) := N ( (cid:15) ) ∨ N ( (cid:15) ). There are 5 classes of exceptional histories for type θ that can lead to playing some signal ˆ s other than the one prescribed by theequilibrium strategy, s ∗ := π ∗ ( θ ). Exception 1 : θ has played ˆ s fewer than N ( (cid:15) ) times before, that is σ θ ( y θ ) =ˆ s but s, y θ ) < N ( (cid:15) ). Such histories can be made to have mass no larger than (cid:15)/ γ ( δ, (cid:15) ) large enough. Exception 2 : y θ is in the exceptional set described in FL06 Lemma A.4.But by choice of N ( (cid:15) ) ≥ N ( (cid:15) ), we know that ψ π ;( g,δ,γ ) θ { y θ / ∈ Y θ ( ξ ) , σ θ ( y θ ) , y θ ) > N ( (cid:15) ) } ≤ (cid:15)/ . Exception 3 : θ has played ˆ s more than N ( (cid:15) ) times, but has a misleadingsample. By FL93 Lemma A.2, ψ π ;( g,δ,γ ) θ { y θ : | ˆ π ( a | ˆ s ; y θ ) − π ( a | ˆ s ) | > ξ / , s | y θ ) > N ( (cid:15) ) } < (cid:15)/ . π ∈ B on2 ( π ∗ , (cid:15) ) ∩ B off2 ( π ∗ , (cid:15) off ), we know π differs from π ∗ by no more than (cid:15) off = ξ / ξ / A ( s ) after off-path signal s . So in particular, ψ π ;( g,δ,γ ) θ y θ : | ˆ π ( a | ˆ s ; y θ ) − π ∗ ( a | ˆ s ) | > ξ if ˆ s on-path, orˆ π ( ˜ A (ˆ s ) | ˆ s ) < − ξ if ˆ s off-path s | y θ ) > N ( (cid:15) ) < (cid:15)/ . Exception 4 : θ has played the equilibrium signal s ∗ more than N ( (cid:15) ) times,but has a misleading sample. As before, we get ψ π ;( g,δ,γ ) θ { y θ : | ˆ π ( a | s ∗ ; y θ ) − π ∗ ( a | s ∗ ) | > ξ , s ∗ | y θ ) > N ( (cid:15) ) } < (cid:15)/ . Exception 5 : θ has played the equilibrium signal s ∗ between n and N ( (cid:15) )times, but has a misleading sample. Let X k ∈ { , } denote whether θ seesthe equilibrium response π ∗ ( s ∗ ) the k -th time she plays s ∗ ( X k = 0) or whethershe sees instead a different response ( X k = 1). As in FL06 Lemma A.1, define S n := | P nk =1 ( X k − µ ) | n where µ = 1 − π ( π ∗ ( s ∗ ) | s ∗ ) < (cid:15) since s ∗ is an on-path signal in π ∗ .The probability that the fraction of responses other than π ∗ ( s ∗ ) exceeds ξ between the n -th time and N ( (cid:15) )-th time that θ plays s ∗ is bounded above byFL06 Lemma A.1, P " max n ≤ n ≤ N ( (cid:15) ) S n > ξ / ≤ · n · µ ( ξ / ≤ · µ (by choice of n ) ≤ (cid:15) / . Finally, at a history y θ that does not belong to those exceptions, we musthave σ θ ( y θ ) = m ∗ . This is because y θ is not in exception 1, so θ has played σ θ ( y θ ) at least N ( (cid:15) ) times before, and it is not in exception 2, so σ θ ( y θ ) is a ξ myopic best response to current beliefs. Yet the empirical frequency for42esponse after signal σ θ ( y θ ) is no more than ξ away from π ∗ ( σ θ ( y θ )) as y θ isnot in exception 3 . Since the prior is Dirichlet and also has this property, thismeans the current posterior belief about response after signal σ θ ( y θ ) also hasthis property. If s ∗ , y θ ) > n , then y θ not being in exceptions 4 or 5 impliesbelief as to response after signal s ∗ is also no more than ξ away from π ∗ ( s ∗ ),while if s ∗ , y θ ) < n then choice of prior implies the same. In short, beliefson both responses after s ∗ and responses after σ θ ( y θ ) are no more than ξ awayfrom their π ∗ counterparts. But in that case, no signal other than s ∗ can bean ξ best response. A.9 Proof of Lemma 3
Proof.
For each ξ >
0, consider the approximation to P θ .θ , P ξθ .θ := ( p ∈ ∆(Θ) : p ( θ ) p ( θ ) ≤ (1 + ξ ) λ ( θ ) λ ( θ ) ) and hence the approximation to ˆ P ( s ),ˆ P ξ ( s ) := ∆(Θ s ) \ n P ξθ .θ : θ (cid:37) s θ o . Since the BR correspondence has a closed graph, there is an ξ > P ξ ( s ) , s ) = BR( ˆ P ( s ) , s ).Take some such ξ. Next we will choose a series of constants. • Pick 0 < h < − h h > (1 − ξ ) / . • Pick
G > θ ∈ Θ, 1 / ( h · G · (1 − h ) · λ ( θ )) <(cid:15)/ (4 · | S | · | Θ | ) . • For each θ , construct a Dirichlet prior on S θ with parameters α ( θ, s ) ≥ α ( θ, s ) ≥ θ (cid:37) s θ , wehave α ( θ, s ) − α ( θ , s ) > ( q (4 · | S | · | Θ | ) /(cid:15) + 1) · G. (2)In the event that θ (cid:37) s θ and θ (cid:37) s θ , put α ( θ, s ) = α ( θ , s ).43 Pick N ∈ N so that for any N > N , θ, θ ∈ Θ, we have P [(1 − h ) · N · λ ( θ ) ≤ Binom(
N, λ ( θ )) ≤ (1 + h ) · N · λ ( θ )] > − (cid:15) · | Θ | and (1 − h ) · N · λ ( θ )(1 + h ) · N · λ ( θ ) + max θ P s ∈ S α ( θ, s ) > (1 − ξ ) / λ ( θ ) λ ( θ ) . • Pick γ ∈ (0 ,
1) such that 1 − ( γ ) N +1 < (cid:15)/ . Suppose the receiver’s prior over the strategy of type θ is Dirichlet with pa-rameters ( α ( θ, s )) s ∈ S . We claim that the conclusion of the lemma holds.Fix some strategy π ∈ C . Write θ | y ) for the number of times thesender has been of θ type in history y , while θ, s | y ) counts the numberof times type θ has sent signal s in history y . Put ψ = ψ π ;( g,δ,γ )2 and write E ⊆ Y for those receiver histories with length at least N satisfying(1 − h ) · N · λ ( θ ) ≤ θ | y ) ≤ (1 + h ) · N · λ ( θ )for every θ ∈ Θ. By the choice of N and γ , whenever γ > γ we have ψ ( E ) ≥ − (cid:15)/
2. We now show that given E , the conditional probability that thereceiver’s posterior belief after every off-equilibrium signal s lies in ˆ P ξ ( s ) is atleast 1 − (cid:15)/
2. To do this, fix signal s and two types with θ (cid:37) s θ .If s is strictly dominated for both θ and θ , then according to the receivers’Dirichlet prior, θ and θ each sends s with zero probability. Since π ∈ Π • , we have π ( s | θ ) = π ( s | θ ) = 0. So after every positive-probability history,receiver’s belief falls in ˆ P ξ ( s ) as it puts zero probability on the s -sender being θ or θ . Henceforth we only consider the case where s is not strictly dominatedfor both.After history y , the receiver’s updated posterior likelihood ratio for types44 and θ upon seeing signal s is λ ( θ ) λ ( θ ) · α ( θ, s ) + θ, s | y ) θ | y ) + P s ∈ S α ( θ, s ) / α ( θ , s ) + θ , s | y ) θ | y ) + P s ∈ S α ( θ , s ) ! = λ ( θ ) λ ( θ ) · α ( θ, s ) + θ, s | y ) α ( θ , s ) + θ , s | y ) · θ | y ) + P s ∈ S α ( θ , s ) θ | y ) + P s ∈ S α ( θ, s ) . Since we have θ | y ) ≥ (1 − h ) · N · λ ( θ ) while θ | y ) ≤ (1 + h ) · N · λ ( θ ),we get θ | y ) + P s ∈ S α ( θ , s ) θ | y ) + P s ∈ S α ( θ, s ) ≥ (1 − h ) · N · λ ( θ )(1 + h ) · N · λ ( θ ) + P s ∈ S α ( θ, s ) > (1 − ξ ) / · λ ( θ ) λ ( θ ) . If s is strictly dominant for both θ and θ , then π ∈ Π • means that π ( s | θ ) = π ( s | θ ) = 1. In this case, θ, s | y ) = θ | y ) and θ , s | y ) = θ | y ).Since θ | y ) ≥ (1 − h ) · N · λ ( θ ), θ | y ) ≤ (1 + h ) · N · λ ( θ ), we have: α ( θ, s ) + θ, s | y ) α ( θ , s ) + θ , s | y ) ≥ (1 − h ) · N · λ ( θ ) P s ∈ S α ( θ , s ) + (1 + h ) · N · λ ( θ ) ≥ (1 − ξ ) / λ ( θ ) λ ( θ ) . This shows the product is no smaller than (1 − ξ ) / λ ( θ ) λ ( θ ) , so receiver believesin P ξθ.θ after every history in E .Now we analyze the term α ( θ,s )+ θ,s | y ) α ( θ ,s )+ θ ,s | y ) for the case where s is not strictlydominant for both θ and θ . We consider two cases, depending on whether N is“large enough” so that the compatible type θ experiments enough on averagein a receiver history of length N under sender strategy π . Case A : π ( s | θ ) · N < G . In this case, since π ∈ C and θ (cid:37) s θ , we mustalso have π ( s | θ ) · N < G . Then θ , s | y ) is distributed as a binomial randomvariable with mean smaller than G , hence standard deviation smaller than √ G .By Chebyshev’s inequality, the probability that it exceeds ( q (4 · | S | · | Θ | ) /(cid:15) +1) · G is no larger than 1 G · (4 · | S | · | Θ | ) /(cid:15) < (cid:15) | S | · | Θ | . But in any history y where θ , s | y R ) does not exceed this number, we would45ave α ( θ , s ) + θ , s | y ) ≤ α ( θ, s ) ≤ α ( θ, s ) + θ, s | y )by choice of the difference between prior parameters α ( θ , s ) and α ( θ, s ). There-fore α ( θ,s )+ θ,s | y ) α ( θ ,s )+ θ ,s | y ) ≥
1. In summary, under Case A, there is probability nosmaller than 1 − (cid:15) | S |·| Θ | that α ( θ,s )+ θ,s | y ) α ( θ ,s )+ θ ,s | y ) ≥ Case B : π ( s | θ ) · N ≥ G . In this case, we can bound the probability that θ, s | y ) / θ , s | y ) ≤ λ ( θ ) λ ( θ ) · ( 1 − h h ) . Let p := π ( s | θ ). Given that θ | y ) ≥ (1 − h ) · N · λ ( θ ), the distribution of θ, s | y ) first order stochastically dominates Binom((1 − h ) · N · λ ( θ ) , p ) . On the other hand, given that θ | y ) ≤ (1 + h ) · N · λ ( θ ) and furthermore π ( s | θ ) ≤ π ( s | θ ) = p , the distribution of θ , s | y ) is first order stochasticallydominated by Binom((1 + h ) · N · λ ( θ ) , p ) . The first distribution has mean (1 − h ) · N · λ ( θ ) · p with standard deviationno larger than q (1 − h ) · N · λ ( θ ) · p . Thus P [Binom((1 − h ) · N · λ ( θ ) , p ) < (1 − h ) · (1 − h ) · N · λ ( θ ) · p ] < / ( h · q p (1 − h ) N λ ( θ )) ≤ / ( h · q G · (1 − h ) · λ ( θ )) < (cid:15)/ (4 · | S | · | Θ | )where we used the fact that pN ≥ G in the second-to-last inequality, whilethe choice of G ensured the final inequality.At the same time, the second distribution has mean (1 + h ) · N · λ ( θ ) · p with standard deviation no larger than q (1 + h ) · N · λ ( θ ) · p , so P [Binom((1 + h ) · N · λ ( θ ) , p ) > (1 + h ) · (1 + h ) · N · λ ( θ ) · p ] < / ( h · q p (1 + h ) N λ ( θ )) ≤ / ( h · q G · (1 + h ) · λ ( θ )) < (cid:15)/ (4 · | S | · | Θ | )by the same arguments. Combining the bounds on these two binomial randomvariables, P " Binom((1 − h ) · N · λ ( θ ) , p )Binom((1 + h ) · N · λ ( θ ) , p ) ≤ λ ( θ ) λ ( θ ) · ( 1 − h h ) < (cid:15)/ (2 · | S | · | Θ | ) . a fortiori P " θ, s | y ) / θ , s | y ) ≤ λ ( θ ) λ ( θ ) · ( 1 − h h ) < (cid:15)/ (2 · | S | · | Θ | ) . Therefore, for any s, θ, θ such that θ (cid:37) s θ , ψ y : α ( θ, s ) + θ, s | y ) α ( θ , s ) + θ , s | y ) ≥ λ ( θ ) λ ( θ ) · ( 1 − h h ) | E ! ≥ − (cid:15)/ (2 · | S | · | Θ | ) . This concludes case B.In either case, at a history y with (1 − h ) · N · λ ( θ ) ≤ θ | y ) ≤ (1 + h ) · N · λ ( θ ) for every θ, for every pair θ, θ such that θ (cid:37) s θ , we get α ( θ,s )+ θ,s | y ) α ( θ ,s )+ θ ,s | y ) ≥ λ ( θ ) λ ( θ ) · ( − h h ) with probability at least 1 − (cid:15)/ (2 · | S | · | Θ | ).But at any history y where this happens, the receiver’s posterior likelihoodratio for types θ and θ after signal s satisfies λ ( θ ) λ ( θ ) · α ( θ, s ) + θ, s | y ) α ( θ , s ) + θ , s | y ) · θ | y ) + P s ∈ S α ( θ , s ) θ | y ) + P s ∈ S α ( θ, s ) ≥ λ ( θ ) λ ( θ ) · λ ( θ ) λ ( θ ) · − h h ! · (1 − ξ ) / · λ ( θ ) λ ( θ ) ≥ λ ( θ ) λ ( θ ) · (1 − ξ ) / · (1 − ξ ) / ≥ λ ( θ ) λ ( θ ) · (1 − ξ ) . As there are at most | Θ | such pairs for each signal s and | S | total signals, ψ y : λ ( θ ) λ ( θ ) · α ( θ,s )+ θ,s | y ) α ( θ ,s )+ θ ,s | y ) · θ | y )+ P s ∈ S α ( θ ,s ) θ | y )+ P s ∈ S α ( θ,s ) ≥ λ ( θ ) λ ( θ ) · (1 − ξ ) ∀ s, θ (cid:37) s θ | E ≥ − (cid:15)/ E has ψ -probability no smaller than 1 − (cid:15)/
2, thereis ψ probability at least 1 − (cid:15) that receiver’s posterior belief is in ˆ P ξ ( s ) afterevery off-path s . 47 .10 Proof of Lemma 4 Proof.
Since π ∗ is on-path strict for the receiver, there exists some ξ > s and every belief p ∈ ∆(Θ) with | p ( θ ) − p ( θ ; s, π ∗ ) | < ξ, ∀ θ ∈ Θ (3)(where p ( · ; s, π ∗ ) is the Bayesian belief after on-path signal s induced by theequilibrium π ∗ ), we have BR( p, s ) = { π ∗ ( s ) } . For each s, we show that thereis a large enough N ( s, (cid:15) ) and small enough ζ ( s ) so that when receiver observeshistory y generated by any π ∈ B on ( π ∗ , (cid:15) ) with (cid:15) < ζ ( s ) / N ( s, (cid:15) ), there is probability at least 1 − (cid:15) | S | that receiver’s posteriorbelief satisfies (3). Hence, conditional on having a history length of at least N ( s, (cid:15) ) , there is 1 − (cid:15) | S | chance that receiver will play as in π ∗ after s . Bytaking the maximum N ∗ ( (cid:15) ) := max s ( N ( s, (cid:15) )) and minimum (cid:15) := min s ζ ( s ),we see that whenever history is length N ∗ ( (cid:15) ) or more, and π ∈ B on ( π ∗ , (cid:15) ) with (cid:15) < (cid:15) , there is at least 1 − (cid:15)/ π ∗ after every on-path signal . Since we can pick γ ( (cid:15) ) large enough that 1 − (cid:15)/ N ∗ ( (cid:15) ) or older, we are done.To construct N ( s, (cid:15) ) and ζ ( s ), let Λ( s ) := λ { θ : π ∗ ( s | θ ) = 1 } . Find smallenough ζ ( s ) ∈ (0 ,
1) so that: • | λ ( θ )Λ( s ) · (1 − ζ ( s )) − λ ( θ )Λ( s ) | < ξ • | λ ( θ ) · (1 − ζ ( s ))Λ( s )+(1 − Λ( s )) · ζ ( s ) − λ ( θ )Λ( s ) | < ξ • ζ ( s )1 − ζ ( s ) · λ ( θ )Λ( s ) < ξ for every θ ∈ Θ. After a history y , the receiver’s posterior belief as to thetype of sender who sends signal s satisfies p ( θ | s ; y ) ∝ λ ( θ ) · θ, s | y ) + α ( θ, s ) θ | y ) + A ( θ ) , where α ( θ, s ) is the Dirichlet prior parameter on signal s for type θ and A ( θ ) := P s ∈ S α ( θ, s ). By the law of large numbers, for long enough history length, we48an ensure that if π ( s | θ ) > − ζ ( s )4 , then θ, s | y ) + α ( θ, s ) θ | y ) + A ( θ ) ≥ − ζ ( s )with probability at least 1 − (cid:15) | S | , while if π ( s | θ ) < ζ ( s ) /
4, then θ, s | y ) + α ( θ, s ) θ | y ) + A ( θ ) < ζ ( s )with probability at least 1 − (cid:15) | S | . Moreover there is some N ( s, (cid:15) ) so that thereis probability at least 1 − (cid:15) | S | that a history y with length at least N ( s, (cid:15) )satisfies above for all θ . But at such a history, for any θ such that π ∗ ( s | θ ) = 1, p ( θ | s ; y ) ≥ λ ( θ ) · (1 − ζ ( s ))Λ( s ) + (1 − Λ( s )) · ζ ( s )and p ( θ | s ; y ) ≤ λ ( θ )Λ( s ) · (1 − ζ ( s )) , while for some θ such that π ∗ ( s | θ ) = 0, p ( θ | s ; y ) ≤ ζ ( s )1 − ζ ( s ) · λ ( θ )Λ( s ) . Therefore the belief p ( ·| s ; y R ) is no more than ξ away from p ( θ ; s, π ∗ ), asdesired. A.11 Proof of Theorem 2
Proof.
We will construct a regular prior g . We will then show that for every0 < δ <
1, there exists convex and compact sets of strategy profiles E j ⊆ Π • with E j ↓ E ∗ ⊆ B on1 ( π ∗ , ∩ B on2 ( π ∗ ,
0) and a corresponding sequence of survivalprobabilities γ j → R g,δ,γ j [ π ] , R g,δ,γ j [ π ]) ∈ E j whenever π ∈ E j .We proved in Fudenberg and He (2018) that R and R are continuous maps,so a fixed point theorem implies that for each j , some strategy profile in E j isa steady state profile under parameters ( g, δ, γ j ). Any convergent subsequence49f these j -indexed steady state profiles has a limit in E ∗ , so this limit agreeswith π ∗ on path. This shows that for every δ there is a δ -stable strategy profilepath-equivalent to π ∗ , so there is a patiently stable strategy profile with thesame property. Step 1 : Constructing g and some thresholds.Since π ∗ induces a unique optimal signal for each sender type, by Lemma2 find a regular sender prior g , < (cid:15) off <
0, and a function γ LM1 ( δ, (cid:15) ).In Lemma 3, substitute (cid:15) = (cid:15) off to find a regular receiver prior g and0 < γ LM2 < g be as constructed above to find (cid:15) LM3 > γ LM3 ( (cid:15) ). Step 2 : Constructing the sets E j .For each j , let E j := C ∩ B on1 ( π ∗ , (cid:15) off ∧ (cid:15) LM3 j ) ∩ B on2 ( π ∗ , (cid:15) off ∧ (cid:15) LM3 j ) ∩ B off2 ( π ∗ , (cid:15) off ) . That is, E j is the set of strategy profiles that respect rational compatibility,differ by no more than (cid:15) off /j from π ∗ on path, and differ by no more than (cid:15) off from π ∗ off path. It is clear that each E j is convex and compact, and thatlim j →∞ E j ⊆ B on1 ( π ∗ , ∩ B on2 ( π ∗ ,
0) as claimed.We may find an accompanying sequence of survival probabilities satisfying γ j > γ LM1 ( δ, (cid:15) off ∧ (cid:15) LM3 j ) ∨ γ LM2 ∨ γ LM3 ( (cid:15) off ∧ (cid:15) LM3 j )with γ j ↑ Step 3 : R g,δ,γ j maps E j into itself.Let some π ∈ E j be given.By Lemma 1 , R g,δ,γ j [ π ] ∈ C .By Lemma 3, R g,δ,γ j [ π ] ∈ B off2 ( π ∗ , (cid:15) off ), because uniformity of π ∗ meansBR( ˆ P ( s ) , s ) ⊆ ˜ A ( s ) for each off-path s .By Lemma 4, R g,δ,γ j [ π ] ∈ B on2 ( π ∗ , (cid:15) off ∧ (cid:15) LM3 j ).Finally, from Lemma 2 and the fact that π ∈ B on2 ( π ∗ , (cid:15) off ∧ (cid:15) LM3 j ) ∩ B off2 ( π ∗ , (cid:15) off ) , we have R g,δ,γ j [ π ] ∈ B on1 ( π ∗ , (cid:15) off ∧ (cid:15) LM3 jj