[PDF] Learning and Type Compatibility in Signaling Games

Abstract

Which equilibria will arise in signaling games depends on how the receiver interprets deviations from the path of play. We develop a micro-foundation for these off-path beliefs, and an associated equilibrium refinement, in a model where equilibrium arises through non-equilibrium learning by populations of patient and long-lived senders and receivers. In our model, young senders are uncertain about the prevailing distribution of play, so they rationally send out-of-equilibrium signals as experiments to learn about the behavior of the population of receivers. Differences in the payoff functions of the types of senders generate different incentives for these experiments. Using the Gittins index (Gittins, 1979), we characterize which sender types use each signal more often, leading to a constraint on the receiver's off-path beliefs based on "type compatibility" and hence a learning-based equilibrium selection.

Full PDF

LLearning and Type Compatibility in Signaling Games ∗ Drew Fudenberg † Kevin He ‡ First version: October 12, 2016This version: June 30, 2018

Abstract

Which equilibria will arise in signaling games depends on how the receiver in-terprets deviations from the path of play. We develop a micro-foundation for theseoﬀ-path beliefs, and an associated equilibrium reﬁnement, in a model where equilib-rium arises through non-equilibrium learning by populations of patient and long-livedsenders and receivers. In our model, young senders are uncertain about the prevailingdistribution of play, so they rationally send out-of-equilibrium signals as experimentsto learn about the behavior of the population of receivers. Diﬀerences in the payoﬀfunctions of the types of senders generate diﬀerent incentives for these experiments.Using the Gittins index (Gittins, 1979), we characterize which sender types use eachsignal more often, leading to a constraint on the receiver’s oﬀ-path beliefs based on“type compatibility” and hence a learning-based equilibrium selection. ∗ This material was previously part of a larger paper titled “Type-Compatible Equilibria in Signaling Games.”We thank Dan Clark, Laura Doval, Glenn Ellison, Mira Frick, Ryota Iijima, Lorens Imhof, Yuichiro Kamada,Robert Kleinberg, David K. Levine, Kevin K. Li, Eric Maskin, Dilip Mookherjee, Harry Pei, Matthew Rabin,Bill Sandholm, Lones Smith, Joel Sobel, Philipp Strack, Bruno Strulovici, Tomasz Strzalecki, Jean Tirole, JuusoToikka, Alex Wolitzky, and four anonymous referees for helpful comments and conversations, and National ScienceFoundation grant SES 1643517 for ﬁnancial support. † Department of Economics, MIT. Email: [email protected] ‡ Department of Economics, Harvard University. Email: [email protected] a r X i v : . [ q -f i n . E C ] J un Introduction

In a signaling game, a privately informed sender (for instance a student) observes their type(e.g. ability) and chooses a signal (e.g. education level) that is observed by a receiver (such asan employer), who then picks an action without observing the sender’s type. These signalinggames can have many perfect Bayesian equilibria, which are supported by diﬀerent speciﬁcationsof how the receiver would update his beliefs about the sender’s type following the observationof oﬀ-path signals that the equilibrium predicts will never occur. These oﬀ-path beliefs arenot pinned down by Bayes rule, and solution concepts such as perfect Bayesian equilibriumand sequential equilibrium place no restrictions on them. This has led to the development ofequilibrium reﬁnements like Cho and Kreps (1987)’s Intuitive Criterion and Banks and Sobel(1987)’s divine equilibrium that reduce the set of equilibria by imposing restrictions on oﬀ-pathbeliefs, using arguments about how players should infer the equilibrium meaning of observationsthat the equilibrium says should never occur.This paper uses a learning model to provide a micro-foundation for restrictions on the oﬀ-pathbeliefs in signaling games, and thus derive restrictions on which Nash equilibria can emerge fromlearning. Our learning model has a continuum of agents who are randomly matched each period,with a constant inﬂow of new agents who do not know the prevailing distribution of strategiesand a constant outﬂow of equal size. The large population makes it rational for the agents toignore repeated-game eﬀects and ensures the aggregate system is deterministic, while turnoverin the population lets us analyze learning in a stationary model where social steady states exist,even though individual agents learn. To give agents adequate learning opportunities, we assumethat their expected lifetimes are long, so that most agents in the population live a long time.And to ensure that agents have suﬃciently strong incentives to experiment, we suppose that theyare very patient. This leads us to analyze what we call the “ patiently stable ” steady states of ourlearning model.Our agents are Bayesians who believe they face a time-invariant distribution of opponents’play. As in much of the learning-in-games literature and most laboratory experiments, theseagents only learn from their personal observations and not from sources such as newspapers,parents, or friends. Therefore, patient young senders will rationally try out diﬀerent signals tosee how receivers react. This implies some “oﬀ-path” signals that have probability zero in a givenequilibrium will occur with small but positive probabilities in the steady states that approximateit, so we can use Bayes rule to derive restrictions on the receivers’ typical posterior beliefs followingthese rare but positive-probability observations. Moreover, diﬀerences in the payoﬀ functions ofthe sender types lead them to experiment in diﬀerent ways. As a consequence, we can prove thatpatiently stable steady states must be a subset of Nash equilibria where the receiver responds It is interesting to note that Spence (1973) also interpreted equilibria as steady states (or “nontransitoryconﬁgurations”) of a learning process, though he did not explicitly specify what sort of process he had in mind. As we explain in Corollary 1, our main result extends to environments where some fraction of the populationhas access to data about the play of others.

1o beliefs about the sender’s type that respect a type compatibility condition. This provides alearning-based justiﬁcation for eliminating certain “unintuitive” equilibria in signaling games.These results also suggest that learning theory could be used to control the rates of oﬀ-path playand hence generate equilibrium reﬁnements in other games.

To give some of the intuition for our general results, we study a particular stage game embeddedin an artiﬁcially simple learning model, and explain why optimal experimentation rules out aseemingly unappealing equilibrium outcome. Consider the following signaling game: the senderis either the high type θ H or the low type θ L , both equally likely. The sender chooses between twosignals, s ∈ { In , Out }. If the sender plays

Out , the game ends and both parties get 0 payoﬀ.If the sender plays In , the receiver then chooses an action a ∈ { Up, Down }. Payoﬀs followingthe signal In depend on the sender’s type and receiver’s action, as in the following matrix.signal: In action: Up action: Down type: θ H , − , θ L , − − , In , Up ) to Out and prefer

Out to (

In, Down ), while the receiverprefers Up over Down after signal In if he believes there is greater than chance that the senderhas type θ H .This game has a perfect Bayesian equilibrium (PBE) where both types choose Out and thereceiver plays

Down after In , sustained by the belief that anyone who sends In has probability p ≤ of being θ H . This updating requires the receiver to interpret the oﬀ-path In as a signalthat the sender is more likely to be θ L , even though θ H gets 1 more utility than θ L does from In regardless of the receiver’s strategy. So, “both Out ” is eliminated by the D1 criterion. Now suppose there are three inﬁnitely lived agents: θ H , θ L , and R (for receiver). Supposethat in each period t ∈ { , , , ... } , the three agents play a simultaneous-move game, where eachsender type θ i chooses a signal s it , and R chooses a single action a t to use against both of thesenders. (This is a deterministic analog of the receiver randomly matching with each type withprobability 1/2 without knowing the sender’s type.) At the end of period t , R observes the signalchoices of both types, while θ i observes a t if and only if s it = In . That is, each agent only learnsfrom his/her personal experience; by choosing the “outside option” Out , the sender does notlearn how the receiver would have responded to signal In that period.Agents think that each opponent is committed to some mixed strategy of the stage game andplays this strategy each period, regardless of their observations of past play: that is, all agents Any receiver play at the oﬀ-path signal In that makes it weakly optimal for θ L to deviate to In would alsomake it strictly optimal for θ H to deviate. Cho and Kreps (1987)’s D1 criterion therefore requires the receiver toput 0 probability on θ = θ L after In . However, the PBE passes their Intuitive Criterion. t = 1, each type θ i is endowed with a Beta( c U , c D ) prior about the probability that R responds to In with Up ,with c D > c U > , so they assign higher probability to Down than to

Up.

R starts with twoindependent priors Beta( c HI , c HO ) and Beta( c LI , c LO ) about the probabilities that θ H and θ L choose In each period, where we only assume c HI , c HO , c LI , c LO >

0. The independence assumption meansthat R does not learn about the behavior of one type from the play of the other.Agents discount payoﬀs in future periods at rate 0 ≤ δ < δ induces a deterministic inﬁnite history of play ( s Ht , s Lt , a t ) ∞ t =1 =: Y ( δ ). When δ = 0, the agents play myopically every period, and because of our assumption that c D > c U ,both types choose Out in t = 1. They thus gain no information about R’s play, do not updatetheir beliefs, and continue playing Out in every future period. So, the unintuitive “both

Out ”PBE is the learning outcome when agents are suﬃciently impatient. However, we can show forall large enough δ , that eventually behavior converges to R playing Up and θ H playing In eachperiod. We give a sketch of the argument, beginning with characterizing agents’ optimal behavioreach period. R observes the same information regardless of his play, so he plays myopically underany δ . Let p ( h t ) be R’s Bayesian posterior belief about the probability that an In sender hastype θ H , given history h t . Then a t +1 = Up if p ( h t ) > and a t +1 = Down if p ( h t ) < .Now we turn to θ i , whose problem involves active experimentation. Formally, the dynamicoptimization problem facing θ i is a one-armed Bernoulli bandit. Choosing s it = Out is equivalentto taking the safe outside option while choosing s it = In is equivalent to pulling the risky armand getting a payoﬀ depending on whether the pull results in a success ( a t = Up ) or a failure( a t = Down ). The optimal policy for θ i involves the Gittins index (deﬁned later in Equation(2)). Type θ i plays In at those histories where In has a positive Gittins index.Once a type chooses to play Out in some period, she receives no further information andwill continue to play

Out in all subsequent periods. Denote the period in Y ( δ ) that θ i ﬁrstswitches from In to Out as T ( i, δ ) ∈ N ∪ {∞} , where T ( i, δ ) = ∞ means θ i plays In forever.The argument that learning eliminates pooling on Out follows from three observations:

Observation 1 . The high type switches to

Out later than the low type does, that is, T ( H, δ ) ≥ T ( L, δ ). To see why, suppose by way of contradiction that T ( H, δ ) < T ( L, δ ) . Then, in period t = T ( H, δ ) , both θ H and θ L have played In until now and have seen the same history, so theyhold the same belief about R’s play. Yet θ H chooses Out at this history while θ L chooses In ,meaning θ H has a negative Gittins index for In while θ L has a positive one. This is impossible, In practice, the required patience level is not unreasonably high. When c D = 1 . c U = 1 , c HI = c LO = 1 , and c HO = c LI = 3 , for example, δ = 0 yields the pathological PBE as the long-run outcome, but when δ ≥ .

92 thelong-run outcome involves s Ht = In and a t = Up . θ H ’s payoﬀ from In is always 1 higher than that of θ L , so θ H ’s index for In is also always1 higher than that of θ L when the two types have the same belief about R’s play. Observation 2 . As the high type becomes patient, she experiments with In arbitrarily manytimes , that is, lim δ → T ( H, δ ) = ∞ . This follows because for any ﬁxed full-support prior belief of θ H about R’s mixed strategy, the Gittins index for In stays close to the “success payoﬀ” of 2 fora length of time that grows to inﬁnity as δ →

1, even in the worst case where R plays

Down inevery period.

Observation 3 . If the high type plays In suﬃciently many times and more often than thelow type does, then eventually R will believe that In senders have greater than chance of being θ H , that is, there exists ¯ N ∈ N so that p ( h T ) > for any history h T where (i) θ H played In atleast ¯ N times and (ii) θ L played In no more than θ H did. This follows from the fact that R’sbelief about θ i ’s play after n iI instances of In and n iO instances of Out is Beta( c iI + n iI , c iO + n iO ).From Observation 2, we see that T ( H, δ ) is larger than the ¯ N of Observation 3 when δ issuﬃciently large. The history up to period t for any t ≥ ¯ N will therefore contain at least ¯ N periods of θ H playing In (namely, the very ﬁrst ¯ N periods of the game), and by Observation 1 θ L will have played In no more than θ H did in this history. So by Observation 3, p ( h t ) > for t ≥ ¯ N , meaning a t = Up for t ≥ ¯ N . Since s Ht = In for all t ≤ ¯ N and observing Up increasesthe Gittins index of In , the high type must always play In . This means lim t →∞ s Ht = In andlim t →∞ a t = Up for large δ < Out an absorbing state and, together with the assumption of Beta priors, lets us explicitlycalculate how the system evolves. This paper’s focus is on general signaling games embedded ina learning model with large populations and anonymous random matching, eliminating repeated-game eﬀects. We focus on steady states of the model, where the stationary assumption is satisﬁed.Also, we relax the Beta prior assumption and allow learners to have fairly general non-doctrinairepriors. Many results about the steady-state model, however, have analogs in the simple modelabove.Intuitively, θ H is “more compatible” with signal In than θ L . Deﬁnition 2 formalizes thisrelation in general signaling games. Observation 1 corresponds to Lemma 2, which shows thatwhenever one type is more compatible than another with a signal, the more compatible type sendsthe signal more often. Observation 2 corresponds to Lemma 4, which says a suﬃciently patientand long-lived sender type will experiment many times with all signals that have the potential tostrictly improve that type’s equilibrium payoﬀ. Observation 3 corresponds to Lemma 3, whichsays receivers can eventually learn the compatibility relation associated with each signal, providedsenders’ play respects the relation and the more compatible type experiments enough with thesignal. Lemmas 2, 3, and 4 are combined to prove the main result of the paper (Theorem 2), a4earning-based reﬁnement in general signaling games. Section 2 lays out the notation we will use for signaling games and introduces our learning model.Section 3 introduces the Gittins index, which we use to analyze the senders’ learning problem.It also deﬁnes type compatibility, which is a partial order that drives our results. We say thattype θ is more type-compatible with signal s than type θ if, whenever s is a weak best responsefor θ against some receiver behavior strategy, it is a strict best response for θ against thesame strategy. To relate this static deﬁnition to the senders’ optimal dynamic learning behavior,we show that, under our assumptions, the senders’ learning problem is formally a multi-armedbandit, so the optimal policy of each type is characterized by the Gittins index. Theorem 1 showsthat the compatibility order on types is equivalent to an order on their Gittins indices: θ ismore type-compatible with signal s than type θ if and only if, whenever s has the (weakly)highest Gittins index for θ , it has the strictly highest index for θ , provided the two types holdthe same beliefs and have the same discount factor.Section 4 studies the aggregate behavior of the sender and receiver populations. There wedeﬁne and characterize the aggregate responses of the senders and of the receivers, which are theanalogs of the best-response functions in the one-shot signaling game. First, we use a couplingargument to extend Theorem 1 to the aggregate sender behavior, proving that types who aremore compatible with a signal send it more often in aggregate (Lemma 2). Then we turn tothe receivers. Intuitively, we would expect that when receivers are long-lived, most of them willhave beliefs that respect type compatibility, and we show that this is the case. More precisely,we show that most receivers best respond to a posterior belief whose likelihood ratio of θ to θ dominates the prior likelihood ratio of these two types whenever they observe a signal s whichis more type-compatible with θ than θ . Lemma 3 shows this is true for any signal that is sent“frequently enough” relative to the receivers’ expected lifespan, using a result of Fudenberg, He,and Imhof (2017) on updating posteriors after rare events.Finally, Section 5 combines the earlier results to characterize the steady states of the learningmodel, which can be viewed as pairs of mutual aggregate responses, analogous to the deﬁnition ofNash equilibrium. We start by proving Lemma 4, which shows that any signal that is not weaklyequilibrium dominated (see Deﬁnition 11) gets sent “frequently enough” in steady state whensenders are suﬃciently patient and long lived. Combining the three lemmas discussed above, weestablish our main result: any patiently stable steady state must be a Nash equilibrium satisfyingthe additional restriction that the receivers best respond to certain admissible beliefs after everyoﬀ-path signal (Theorem 2).As an example, consider Cho and Kreps (1987)’s beer-quiche game, where it is easy to verifythat the strong type is more compatible with Beer than the weak type. Our results imply thatthe strong types will in aggregate send this signal at least as often as the weak types do, and5hat a very patient strong type will experiment with it “many times.” As a consequence, whensenders are patient, long-lived receivers are unlikely to revise the probability of the strong typedownwards following an observation of

Beer . Thus, the “both types eat quiche” equilibrium isnot a patiently stable steady state of the learning model, as it would require receivers to interpret

Beer as a signal that the sender is weak.Finally, Theorem 3 provides a stronger implication of patient stability in generic pure-strategyequilibria, showing that oﬀ-path beliefs must assign probability zero to types that are equilibriumdominated in the sense of Cho and Kreps (1987).

Fudenberg and Kreps (1988, 1994, 1995) pointed out that experimentation plays an importantrole in determining learning outcomes in extensive-form games. As in Fudenberg and Kreps(1993), they studied a model with a single inﬁnitely-lived and strategically myopic agent in eachplayer role who acts as if the opponent’s play is stationary. Because these models involvedaccumulating information over time, they did not have steady states. Our work is closer to thatof Fudenberg and Levine (1993) and Fudenberg and Levine (2006) which also studied learning byBayesian agents in a large population who believe that society is in a steady state. A key issue inthis work, and more generally in studying learning in extensive-form games, is characterizing howmuch agents will experiment with myopically suboptimal actions. If agents do not experimentat all, then non-Nash equilibria can persist, because players can maintain incorrect but self-conﬁrming beliefs about oﬀ-path play. Fudenberg and Levine (1993) showed that patient long-lived agents will experiment enough at their on-path information sets to learn if they have anyproﬁtable deviations, thus ruling out steady states that are not Nash equilibria. However, moreexperimentation than that is needed for learning to generate the sharper predictions associatedwith backward induction and sequential equilibrium. Fudenberg and Levine (2006) showed thatpatient rational agents need not do enough experimentation to imply backwards induction ingames of perfect information. Later on, we say about how the models and proofs of those papersdiﬀer from ours.This paper is also related to the Bayesian learning models of Kalai and Lehrer (1993), whichstudied two-player games with one agent on each side, so that every self-conﬁrming equilibrium ispath-equivalent to a Nash equilibrium, and Esponda and Pouzo (2016), which allowed agents toexperiment but did not characterize when and how this occurs. It is also related to the literatureon boundedly rational experimentation in extensive-form games (e.g. Jehiel and Samet (2005),Laslier and Walliser (2015)), where the experimentation rules of the agents are exogenouslyspeciﬁed. We assume that each sender’s type is ﬁxed at birth, as opposed to being i.i.d. overtime. Dekel, Fudenberg, and Levine (2004) showed some of the diﬀerences this can make usingvarious equilibrium concepts, but they did not develop an explicit model of non-equilibriumlearning. 6or simplicity, we assume here that agents do not know the payoﬀs of other players and havefull support priors over the opposing side’s behavior strategies. Our companion paper Fudenbergand He (2017) supposed that players assign zero probability to dominated strategies of theiropponents, as in the Intuitive Criterion (Cho and Kreps, 1987), divine equilibrium (Banks andSobel, 1987), and rationalizable self-conﬁrming equilibrium (Dekel, Fudenberg, and Levine, 1999).There, we analyzed how the resulting micro-founded equilibrium reﬁnement compares to thosein past work. A signaling game has two players, a sender (player 1, “she”) and a receiver (player 2, “he”). Thesender’s type is drawn from a ﬁnite set Θ according to a prior λ ∈ ∆(Θ) with λ ( θ ) > θ . There is a ﬁnite set S of signals for the sender and a ﬁnite set A of actions for the receiver. The utility functions of the sender and receiver are u : Θ × S × A → R and u : Θ × S × A → R respectively.When the game is played, the sender knows her type and sends a signal s ∈ S to the receiver.The receiver observes the signal, then responds with an action a ∈ A . Finally, payoﬀs are realized.A behavior strategy for the sender π = ( π ( ·| θ )) θ ∈ Θ is a type-contingent mixture over signals S . Write Π for the set of all sender behavior strategies.A behavior strategy for the receiver π = ( π ( ·| s )) s ∈ S is a signal-contingent mixture overactions A . Write Π for the set of all receiver behavior strategies. We now build a learning model with a given signaling game as the stage game. In this subsection,we explain an individual agent’s learning problem. In the next subsection, we complete thelearning model by describing a society of learning agents who are randomly matched to play thesignaling game every period.Time is discrete and all agents are rational Bayesians with geometrically distributed lifetimes.They survive between periods with probability 0 ≤ γ < ≤ δ <

1, so their objective is to maximize the expected value of P ∞ t =0 ( γδ ) t · u t . Here,0 ≤ γδ < u t is the payoﬀ t periods from today.At birth, each agent is assigned a role in the signaling game: either as a sender with type θ oras a receiver. Agents know their role, which is ﬁxed for life. Every period, each agent is randomly Here and subsequently, ∆( X ) denotes the collection of probability distributions on the set X. To lighten notation we assume that the same set of actions is feasible following any signal. This is withoutloss of generality for our results as we could let the receiver have very negative payoﬀs when he responds to asignal with an “impossible” action. Agents update theirbeliefs and play the signaling game again with new random opponents next period, provided theyare still alive.Agents believe they face a ﬁxed but unknown distribution of opponents’ aggregate play, sothey believe that their observations will be exchangeable. We feel that this is a plausible ﬁrsthypothesis in many situations, so we expect that agents will maintain their belief in stationaritywhen it is approximately correct, but will reject it given clear evidence to the contrary, as whenthere is a strong time trend or a high-frequency cycle. The environment will indeed be constantin the steady states that we analyze.Formally, each sender is born with a prior density function over the aggregate behavior strat-egy of the receivers, g : Π → R + , which integrates to 1. Similarly, each receiver is born witha prior density over the sender’s behavior strategies , g : Π → R + . We denote the marginaldistribution of g on signal s as g ( s )1 , so that g ( s )1 ( π ( ·| s )) is the density of the new senders’ priorover how receivers respond to signal s . Similarly, we denote the θ marginal of g as g ( θ )2 , so that g ( θ )2 ( π ( ·| θ )) is the new receivers’ prior density over π ( ·| θ ) ∈ ∆( S ).It is important to remember that g and g are beliefs over opponents’ strategies, but notstrategies themselves. A new sender expects the response to s to be R π ( ·| s ) · g ( π ) dπ while anew receiver expects type θ to play R π ( ·| θ ) · g ( π ) dπ .We now state a regularity assumption on the agents’ priors that will be maintained through-out. Deﬁnition 1.

A prior g = ( g , g ) is regular if(i). [ independence ] g ( π ) = Q s ∈ S g ( s )1 ( π ( ·| s )) and g ( π ) = Q θ ∈ Θ g ( θ )2 ( π ( ·| θ )).(ii). [ g non-doctrinaire ] g is continuous and strictly positive on the interior of Π . The receiver’s payoﬀ reveals the sender’s type for generic assignments of payoﬀs to terminal nodes. If thereceiver’s payoﬀ function is independent of the sender’s type, his beliefs about it are irrelevant. If the receiverdoes care about the sender’s type but observes neither the sender’s type nor his own realized payoﬀ, a great manyoutcomes can persist, as in Dekel, Fudenberg, and Levine (2004). Note that the agent’s prior belief is over opponents’ aggregate play (i.e. Π or Π ) and not over the prevailingdistribution of behavior strategies in the opponent population (i.e. ∆(Π ) or ∆(Π )), since under our assumptionof anonymous random matching, these are observationally equivalent for our agents. For instance, a receivercannot distinguish between a society where all type θ randomize 50-50 between signals s and s each period,and another society where half of the type θ always play s while the other half always plays s . Note also thatbecause agents believe the system is in a steady state, they do not care about calendar time and do not have beliefsabout it. Fudenberg and Kreps (1994) suppose that agents append a non-Bayesian statistical test of whether theirobservations are exchangeable to a Bayesian model that presumes exchangeability. g nice ] for each type θ, there are positive constants (cid:16) α ( θ ) s (cid:17) s ∈ S such that π ( ·| θ ) g ( θ )2 ( π ( ·| θ )) Q s ∈ S π ( s | θ ) α ( θ ) s − is uniformly continuous and bounded away from zero on the relative interior of Π ( θ )1 , theset of behavior strategies of type θ .Independence ensures that a receiver does not learn how type θ plays by observing the behaviorof some other type θ = θ , and that a sender does not learn how receivers react to signal s byexperimenting with some other signal s = s . For example, this means in Cho and Kreps (1987)’sbeer-quiche game that the sender does not learn how receivers respond to beer by eating quiche. The non-doctrinaire nature of g and g implies that the agents never see an observation thatthey assigned zero prior probability, so that they have a well-deﬁned optimization problem afterany history. Non-doctrinaire priors also imply that a large enough data set can outweigh priorbeliefs (Diaconis and Freedman, 1990). The niceness assumption in (iii) ensures that g behaveslike a power function near the boundary of Π . Any density that is strictly positive on Π satisﬁesthis condition, as does the Dirichlet distribution, which is the prior associated with ﬁctitious play(Fudenberg and Kreps, 1993).The set of histories for an age t sender of type θ is Y θ [ t ] := ( S × A ) t , where each period, thehistory records the signal sent and the action that her receiver opponent took in response. Theset of all histories for a type θ is the union Y θ := S ∞ t =0 Y θ [ t ]. The dynamic optimization problemof type θ has an optimal policy function σ θ : Y θ → S , where σ θ ( y θ ) is the signal that a type θ with history y θ would send the next time she plays the signaling game. Analogously, the set ofhistories for an age t receiver is Y [ t ] := (Θ × S ) t , where each period, the history records the typeof his sender opponent and the signal that she sent. The set of all receiver histories is the union Y := S ∞ t =0 Y [ t ]. The receiver’s learning problem admits an optimal policy function σ : Y → A S ,where σ ( y ) is the pure strategy that a receiver with history y would commit to next time heplays the game. We analyze learning in a deterministic stationary model with a continuum of agents, as in Fu-denberg and Levine (1993, 2006). One innovation is that we let lifetimes follow a geometric One could imagine learning environments where the senders believe that the responses to various signals arecorrelated, but independence is a natural special case. Because our agents are expected-utility maximizers, it is without loss of generality to assume each agent usesa deterministic policy rule. If more than one such rule exists, we ﬁx one arbitrarily. Of course, the optimal policies σ θ and σ depend on the prior g as well as the eﬀective discount factor δγ . Where no confusion arises, we suppressthese dependencies. λ ( θ ) in the roleof type θ for each θ ∈ Θ. As described in Subsection 2.2, each agent has 0 ≤ γ < − γ of dying. To preservepopulation sizes, (1 − γ ) new receivers and λ ( θ )(1 − γ ) new type θ are born into the society everyperiod.Each period, agents in the society are matched uniformly at random to play the signalinggame. In the spirit of the law of large numbers, each sender has probability (1 − γ ) γ t of matchingwith a receiver of age t , while each receiver has probability λ ( θ )(1 − γ ) γ t of matching with a type θ of age t. A state ψ of the learning model is described by the mass of agents with each possible history.We write it as ψ ∈ ( × θ ∈ Θ ∆( Y θ )) × ∆( Y ) . We refer to the components of a state ψ by ψ θ ∈ ∆( Y θ ) and ψ ∈ ∆( Y ).Given the agents’ optimal policies, each possible history for an agent completely determineshow that agent will play in their next match. The sender policy functions σ θ are maps fromsender histories to signals, so they naturally extend to maps from distributions over senderhistories to distributions over signals. That is, given the policy function σ θ , each state ψ inducesan aggregate behavior strategy σ θ ( ψ θ ) ∈ ∆( S ) for each type θ population, where we extend thedomain of σ θ from Y θ to ∆( Y θ ) in the natural way: σ θ ( ψ θ )( s ) := ψ θ { y θ ∈ Y θ : σ θ ( y θ ) = s } . (1)Similarly, state ψ and the optimal receiver policy σ together induce an aggregate behaviorstrategy σ ( ψ ) for the receiver population, where σ ( ψ )( a | s ) := ψ { y ∈ Y : σ ( y )( s ) = a } . We will study the steady states of this learning model, to be deﬁned more precisely in Section5. Loosely speaking, a steady state is a state ψ that reproduces itself indeﬁnitely when agentsuse their optimal policies. Put another way, a steady state induces a time-invariant distributionover how the signaling game is played in the society. Suppose society is at steady state today andwe measure what fraction of type θ sent a certain signal s in today’s matches. After all agentsmodify their strategies based on their updated beliefs and all births and deaths take place, thefraction of type θ playing s in the matches tomorrow will be the same as today. Remember that we have ﬁxed deterministic policy functions. Senders’ Optimal Policies and Type Compatibility

This section studies the senders’ learning problem. We will prove that diﬀerences in the payoﬀstructures of the various sender types generate certain restrictions on their behavior in the learningmodel. Subsection 3.1 notes that the senders face a multi-armed bandit, so the Gittins indexcharacterizes their optimal policies, and shows how to relate the Gittins index of a signal to theexpected sender payoﬀ versus a particular mixed strategy of the receiver. In Subsection 3.2, wedeﬁne type compatibility , which formalizes what it means for type θ to be more “compatible” witha given signal s than type θ is. The deﬁnition of type compatibility is static, in the sense thatit depends only on the two types’ payoﬀ functions in the one-shot signaling game. Subsection3.3 relates type compatibility to the Gittins index, which applies to the dynamic learning model.Lemma 2in Section 4 uses this relationship to show that if type θ is more compatible with signal s than type θ , then faced with any ﬁxed distribution of receiver play the type θ populationsends s more often in the aggregate than the type θ population does. Each type θ sender thinks she is facing a ﬁxed but unknown aggregate receiver behavior strategy π , so each period when she sends signal s , she believes that the response is drawn from some π ( ·| s ) ∈ ∆( A ), i.i.d. across periods. Because her beliefs about the responses to the various signalsare independent, her problem is equivalent to a discounted multi-armed bandit, with signals s ∈ S as the arms, where the rewards of arm s are distributed according to u ( θ, s, π ( ·| s )).Let ν s ∈ ∆(∆( A )) be a belief over the space of mixed replies to signal s , and let ν = ( ν s ) s ∈ S be a proﬁle of such beliefs. Write I ( θ, s, ν, β ) for the Gittins index of signal s for type θ , withbeliefs ν over receiver’s play after various signals and with eﬀective discount factor β = δγ , sothat I ( θ, s, ν, β ) := sup τ> E ν s nP τ − t =0 β t · u ( θ, s, a s ( t )) o E ν s nP τ − t =0 β t o . (2)Here a s ( t ) is the receiver’s response that the sender observes the t -th time she sends signal s , τ is a stopping time and the expectation E ν s over the sequence of responses { a s ( t ) } t ≥ dependson the sender’s belief ν s about responses to signal s . The Gittins index theorem (Gittins, 1979) implies that after every positive-probability history y θ , the optimal policy σ θ for a sender of type θ sends the signal that has the highest Gittins index That is, whether or not τ = t depends only on the realizations of a s (0) , a s (1) , ..., a s ( t − The Gittins index can be interpreted as the value of an auxiliary optimization problem, where type θ chooseseach period to either send signal s and obtain a payoﬀ according to a random receiver action drawn accordingto π ( ·| s ), or to stop forever. The objective of the auxiliary problem is to maximize the per-period expecteddiscounted payoﬀ until stopping, as the numerator of Equation (2) describes the expected discounted sum ofpayoﬀs until stopping while the denominator shows the expected discounted number of periods until stopping. ν s ) s ∈ S that is induced by y θ .Importantly, we can reformulate the objective function deﬁning the Gittins index in Equation(2), linking it to the one-shot signaling game payoﬀ structure. Lemma 1.

For every signal s , stopping time τ , belief ν s , and discount factor β, there exists π ,s ( τ, ν s , β ) ∈ ∆( A ) so that for every θ , E ν s nP τ − t =0 β t · u ( θ, s, a s ( t )) o E ν s nP τ − t =0 β t o = u ( θ, s, π ,s ( τ, ν s , β ))That is to say, when the stopping problem in Equation (2) is evaluated at an arbitrary stoppingtime τ, the payoﬀ is equal to sender’s expected utility from playing s against the receiver strategy π ,s ( τ, ν s , β ) in the one-shot signaling game.The proof of Lemma 1 is in Appendix A.2 and shows how to construct π ,s ( τ, ν s , β ), whichcan be interpreted as a discounted time average over the receiver actions that are observed beforestopping. To illustrate the construction, suppose ν s is supported on two pure receiver strategiesafter s : either π ( a | s ) = 1 or π ( a | s ) = 1 , with both strategies equally likely. Suppose also u ( θ, s, a ) > u ( θ, s, a ) . Consider the stopping time τ that speciﬁes stopping after the ﬁrst timethe receiver plays a . Then the discounted time average frequency of a is: P ∞ t =0 β t · P ν s [ τ ≥ t and receiver plays a in period t ] P ∞ t =0 β t · P ν s [ τ ≥ t ] = 0 .

51 + P ∞ t =1 β t · . − β − β . So π ,s ( τ, ν s , β )( a ) = − β − β and similarly we can calculate that π ,s ( τ, ν s , β )( a ) = − β , whichshows that π ,s indeed corresponds to a mixture over receiver actions for each β . As β →

1, thismixture converges to the pure strategy of always playing a , so u ( θ, s, π ,s ( τ, ν s , β )) converges to u ( θ, s, a ), the highest possible payoﬀ for type θ after s ; this parallels the fact that as β tends to1, the Gittins index for θ after s converges to the highest payoﬀ in the support of the belief ν s . We now introduce a notion of the comparative compatibility of two types with a given signal inthe one-shot signaling game.

Deﬁnition 2.

Signal s is more type-compatible with θ than θ , written as θ (cid:31) s θ , if for every π ∈ Π such that u ( θ , s , π ( ·| s )) ≥ max s = s u ( θ , s , π ( ·| s )) , we have u ( θ , s , π ( ·| s )) > max s = s u ( θ , s , π ( ·| s )) . In words, θ (cid:31) s θ means that whenever s is a weak best response for θ against somereceiver behavior strategy π , it is also a strict best response for θ against π .12he following proposition says the compatibility order is transitive and essentially asymmetric.Its proof is in Appendix A.1. Proposition 1. (i). (cid:31) s is transitive.(ii). Except when s is either strictly dominant for both θ and θ or strictly dominated for both θ and θ , θ (cid:31) s θ implies θ s θ .To check the compatibility condition, one must consider all strategies in Π , just as the beliefrestrictions in divine equilibrium involve all the possible mixed best responses to various beliefs.However, when the sender’s utility function is separable in the sense that u ( θ, s, a ) = v ( θ, s ) + z ( a ) , as in Spence (1973)’s job-market signaling game and in Cho and Kreps (1987)’s beer-quichegame (given below), a suﬃcient condition for θ (cid:31) s θ is v ( θ , s ) − v ( θ , s ) > max s = s v ( θ , s ) − v ( θ , s ) . This can be interpreted as saying s is the least costly signal for θ relative to θ . In the OnlineAppendix, we present a general suﬃcient condition for θ (cid:31) s θ under general payoﬀ functions. Example 1. (Cho and Kreps (1987)’s beer-quiche game) The sender (P1) is either strong ( θ strong )or weak ( θ weak ), with prior probability λ ( θ strong ) = 0 . . The sender chooses to either drink

Beer or eat

Quiche for breakfast. The receiver (P2), observing this breakfast choice but not thesender’s type, chooses whether to

Fight the sender. If the sender is θ weak , the receiver prefersto Fight . If the sender is θ strong , the receiver prefers to NotFight . Also, θ strong prefers Beer for breakfast while θ weak prefers Quiche for breakfast. Both types prefer not being fought overhaving their favorite breakfast. 13his game has separable sender utility with v ( θ strong , Beer ) = v ( θ weak , Quiche ) = 1, z ( Fight ) =0 and z ( NotFight ) = 2. So, we have θ strong (cid:31) Beer θ weak . (cid:7) It is easy to see that in every Nash equilibrium π ∗ , if θ (cid:31) s θ , then π ∗ ( s | θ ) > π ∗ ( s | θ ) = 1. By Bayes rule, this implies that the receiver’s equilibrium belief p after every on-path signal s satisﬁes the restriction p ( θ | s ) p ( θ | s ) ≤ λ ( θ ) λ ( θ ) if θ (cid:31) s θ . Thus in every Nash equilibriumof the beer-quiche game, if the sender chooses Beer with positive ex ante probability, then thereceiver’s odds ratio that the sender is tough after seeing this signal cannot be less than theprior odds ratio. Our main result, Theorem 2, essentially shows for any strategy proﬁle thatcan be approximated by steady-state outcomes with patient and long-lived agents, that the samecompatibility-based restriction is satisﬁed even for oﬀ-path signals. In particular, this allows usto place restrictions on the receiver’s belief after seeing

Beer in equilibria where no type of senderever plays this signal.

We now connect the type compatibility order for a given signal with the associated Gittins indices.

Theorem 1. θ (cid:31) s θ if and only if for every β ∈ [0 , and every proﬁle of beliefs ν , I ( θ , s , ν, β ) ≥ max s = s I ( θ , s , ν, β ) implies I ( θ , s , ν, β ) > max s = s I ( θ , s , ν, β ) . That is, θ (cid:31) s θ if and only if whenever s has the (weakly) highest Gittins index for θ ,it has the highest index for θ, provided the two types hold the same beliefs and have the samediscount factor. The proof involves reformulating the Gittins index as in Lemma 1, then applyingthe compatibility deﬁnition. Proof.

Step 1: Only If.

Suppose θ (cid:31) s θ and ﬁx some β ∈ [0 ,

1) and prior belief ν . Suppose I ( θ , s , ν, β ) ≥ max s = s I ( θ , s , ν, β ). We show that I ( θ , s , ν, β ) > max s = s I ( θ , s , ν, β ).On any arm s = s , type θ could use the (suboptimal) stopping time τ θ s , which by Lemma1 yields an expected per-period payoﬀ of u ( θ , s , π s ( ν s , τ θ s , β )). This is a lower bound forthe Gittins index of arm s for type θ , so combined with the hypothesis that I ( θ , s , ν, β ) ≥ max s = s I ( θ , s , ν, β ), we get I ( θ , s , ν, β ) ≥ max s = s u ( θ , s , π s ( ν s , τ θ s , β )) . (3)Now deﬁne the receiver strategy π ∈ Π by π ( ·| s ) := π s ( ν s , τ θ s , β ), π ( ·| s ) := π s ( ν s , τ θ s , β )for all s = s . Then Equation (3) can be rewritten as u ( θ , s , π ( ·| s )) ≥ max s = s u ( θ , s , π ( ·| s )) , s is weakly optimal for θ against π . By the deﬁnition of θ (cid:31) s θ , this implies s isstrictly optimal for θ against π .From the deﬁnition of π and Lemma 1, the expected utility of θ playing any s = s against π is equal to the Gittins index of that arm for θ , namely I ( θ , s , ν, β ). On theother hand, u ( θ , s , π ( ·| s )) is only a lower bound for I ( θ , s , ν, β ). This shows I ( θ , s , ν, β ) > max s = s I ( θ , s , ν, β ) as desired. Step 2: If.

Suppose θ s θ . Then there is some receiver strategy π ∗ ∈ Π such that u ( θ , s , π ∗ ( ·| s )) ≥ max s = s u ( θ , s , π ∗ ( ·| s )) , and u ( θ , s , π ∗ ( ·| s )) ≤ max s = s u ( θ , s , π ∗ ( ·| s )) . Let ν ∗ be any belief that induces π ∗ on average, that is to say for each s , π ∗ ( ·| s ) = Z π ,s ∈ ∆( A ) π ,s dν ∗ s ( π ,s )Let β = 0. Then I ( θ, s, ν ∗ ,

0) = u ( θ, s, π ∗ ( ·| s )) for every θ, s , since the Gittins index is equalto the myopic payoﬀ when the decision-maker is perfectly impatient. This shows I ( θ , s , ν ∗ , ≥ max s = s I ( θ , s , ν ∗ ,

0) and I ( θ , s , ν ∗ , ≤ max s = s I ( θ , s , ν ∗ , In this section, we will deﬁne and analyze the aggregate sender response R : Π → Π and theaggregate receiver response R : Π → Π . Loosely speaking, these are the large-populationslearning analogs of the best-response functions in the static signaling game. If we ﬁx the aggregateplay of − i population at π − i and run the learning model period after period from an arbitraryinitial state, the distribution of play in i population will approach R i [ π − i ]. Later in Section 5,the ﬁxed points of the pair ( R , R ) will characterize the steady states of the learning system. To formally deﬁne the aggregate sender response, we ﬁrst introduce the one-period-forward map.

Deﬁnition 3.

The one-period-forward map for type θ , f θ : ∆( Y θ ) × Π → ∆( Y θ ) is f θ [ ψ θ , π ]( y θ , ( s, a )) := ψ θ ( y θ ) · γ · { σ θ ( y θ ) = s } · π ( a | s )and f θ [ ψ θ , π ]( ∅ ) := 1 − γ . 15f the distribution over histories in the type θ population is ψ θ and the receiver population’saggregate play is π , the resulting distribution over histories in the type θ population is f θ [ ψ θ , π ].Speciﬁcally, there will be a 1 − γ mass of new type θ who will have no history. Also, if the optimalﬁrst signal of a new type θ is s , that is if σ θ ( ∅ ) = s , then f θ [ ψ θ , π ]( s , a ) = γ · (1 − γ ) · π ( a | s )new senders send s in their ﬁrst match, observe action a in response, and survive. In general,a type θ who has history y θ and whose policy σ θ ( y θ ) prescribes playing s has π ( a | s ) chanceof having subsequent history ( y θ , ( s, a )) provided she survives until next period; the survivalprobability corresponds to the factor γ .Write f Tθ for the T -fold application of f θ on ∆( Y θ ) , holding ﬁxed some π . Note that forarbitrary states ψ and ψ , if ( y θ , ( s, a )) is a length-1 history (i.e. y θ = ∅ ), then ψ θ ( y θ ) = ψ θ ( y θ )because both states must assign mass 1 − γ to ∅ , so f θ [ ψ θ , π ] and f θ [ ψ θ , π ] agree on Y θ [1].Iterating, for T = 2 , f θ [ ψ θ , π ] and f θ [ ψ θ , π ] agree on Y θ [2], because each history in Y θ [2] canbe written as ( y θ , ( s, a )) for y θ ∈ Y θ [1], and f θ [ ψ θ , π ] and f θ [ ψ θ , π ] match on all y θ ∈ Y θ [1].Proceeding inductively, we can conclude that f Tθ ( ψ θ , π ) and f Tθ ( ψ θ , π ) agree on all Y θ [ t ] for t ≤ T for any pair of type θ states ψ θ and ψ θ . This means lim T →∞ f Tθ ( ψ θ , π ) exists and isindependent of the initial state ψ θ . Denote this limit as ψ π θ . It is the long-run distribution overtype θ histories induced by starting at an arbitrary state and ﬁxing the receiver population’s playat π , as stated formally in the next deﬁnition. Deﬁnition 4.

The aggregate sender response R : Π → Π is deﬁned by R [ π ]( s | θ ) := ψ π θ ( y θ : σ θ ( y θ ) = s )where ψ π θ := lim T →∞ f Tθ ( ψ θ , π ) with ψ θ any arbitrary θ state.That is, R [ π ]( ·| θ ) is the long-run aggregate behavior in the type θ population when thereceivers’ aggregate play is ﬁxed at π . Remark . Technically, R depends on g , δ, and γ , just like σ θ does. When relevant, we willmake these dependencies clear by adding the appropriate parameters as superscripts to R , butwe will mostly suppress them to lighten notation. Remark . Although the aggregate sender response is deﬁned at the aggregate level, R [ π ]( ·| θ )also describes the probability distribution of the play of a single type θ sender over her lifetimewhen she faces receiver play drawn from π every period. Observe that f θ [ ψ θ , π ] restricted to Y θ [1] gives the probability distribution over histories for a type θ whouses σ θ and faces play drawn from π for one period: it puts weight π ( a | s ) on history ( s , a ) where s = σ θ ( ∅ ).Similarly, f Tθ [ ψ θ , π ] restricted to Y θ [ t ] for any t ≤ T gives the probability distribution over histories for someonewho uses σ θ and faces play drawn from π for t periods. Since ψ π θ assigns probability (1 − γ ) γ t to the set ofhistories Y θ [ t ], R [ π ]( ·| θ ) = σ θ ( ψ π θ ) is a weighted average over the distributions of period t play ( t = 1 , , , ... )of someone using σ θ and facing π , with weight (1 − γ ) γ t given to the period t distribution. .2 Type Compatibility and the Aggregate Sender Response The next lemma shows how type compatibility translates into restrictions on the aggregate senderresponse for diﬀerent types.

Lemma 2.

Suppose θ (cid:31) s θ . Then for any regular prior g , ≤ δ, γ < , and any π ∈ Π , wehave R [ π ]( s | θ ) ≥ R [ π ]( s | θ ) . Theorem 1 showed that when θ (cid:31) s θ and the two types share the same beliefs, if θ plays s then θ must also play s . But even though new agents of both types start with the same prior g , their beliefs may quickly diverge during the learning process due to σ θ and σ θ prescribingdiﬀerent experiments after the same history. This lemma shows that compatibility still imposesrestrictions on the aggregate play of the sender population: Regardless of the aggregate play π in the receiver population, the frequencies that s appears in the aggregate responses of diﬀerenttypes are always co-monotonic with the compatibility order (cid:31) s .To gain intuition for Lemma 2, consider two new senders with types θ strong and θ weak whoare learning to play the beer-quiche game from Example 1. Suppose they have uniform priorsover the responses to each signal, and that they face a sequence of receivers programmed to play Fight after

Beer and

NotFight after

Quiche . Since observing

Fight is the worst possiblenews about a signal’s payoﬀ, the Gittins index of a signal decreases when

Fight is observed.Conversely, the Gittins index of a signal increases after each observation of

NotFight . Thusgiven the assumed play of the receivers, there are n , n ≥ θ strong play Beer for n periods (and observe n instances of Fight ) and then switch to

Quiche forever after,while type θ weak will play Beer for n periods before switching to Quiche forever after. Now weclaim that n ≥ n . To see why, suppose instead that n < n , and let ν be the posterior beliefabout receivers’ aggregate play induced from n periods of observing Fight after

Beer . After n periods, both types would share the belief ν . Then at belief ν type θ weak must play Beer whiletype θ strong plays Quiche , so signal

Beer must have the highest Gittins index for θ weak but notfor θ strong . But this would contradict Theorem 1.The proof of Lemma 2 relies on the similar idea of ﬁxing a particular “programming” ofreceiver play and studying the induced paths of experimentation for diﬀerent types. In theaggregate learning model, the sequence of responses that a given sender encounters in her lifedepends on the realization of the random matching process, because diﬀerent receivers havediﬀerent histories and respond diﬀerently to a given signal. We can index all possible sequencesof random matching realizations using a device we call the “pre-programmed response path”.To show that more compatible types play a given signal more often, it suﬃces to show thiscomparison holds on each pre-programmed response path, thus coupling the learning processesof types θ and θ . We will show that the intuition above extends to signaling games with anynumber of signals and to any pre-programmed response path. This follows from Bellman (1956)’s Theorem 2 on Bernoulli bandits. eﬁnition 5. A pre-programmed response path a = ( a ,s , a ,s , ..., ) s ∈ S is an element in × s ∈ S ( A ∞ ).A pre-programmed response path is an | S | -tuple of inﬁnite sequences of receiver actions, onesequence for each signal. For a given pre-programmed response path a , we can imagine startingwith a new type θ and generating receiver play each period in the following programmatic manner:when the sender plays s for the j -th time, respond with receiver action a j,s . (If the sender sends s ﬁve times and then sends s = s , the response she gets to s is a ,s , not a ,s .) For a type θ who applies σ θ each period, a induces a deterministic history of experiments and responses, whichwe denote y θ ( a ). The induced history y θ ( a ) can be used to calculate R [ a ]( ·| θ ), the distributionof signals over the lifetime of a type θ induced by the pre-programmed response path a . Namely, R [ a ]( ·| θ ) is simply a mixture over all signals sent along the history y θ ( a ), with weight (1 − γ ) γ t − given to the signal in period t .Now consider a type θ facing actions generated i.i.d. from the receiver behavior strategy π each period, as in the interpretation of R in Remark 2. This data-generating process isequivalent to drawing a random pre-programmed response path a at time 0 according to a suitabledistribution, then producing all receiver actions using a . That is, R [ π ]( ·| θ ) = R R [ a ]( ·| θ ) dπ ( a )where we abuse notation and use dπ ( a ) to denote the distribution over pre-programmed responsepaths associated with π . Importantly, any two types θ and θ face the same distribution overpre-programmed response paths, so to prove the proposition it suﬃces to show R [ a ]( s | θ ) ≥ R [ a ]( s | θ ) for all a . Proof.

For t ≥

0, write y tθ for the truncation of inﬁnite history y θ to the ﬁrst t periods, with y ∞ θ := y θ . Given a ﬁnite or inﬁnite history y tθ for type θ , the signal counting function s | y tθ )returns how many times signal s has appeared in y tθ . (We need this counting function since thereceiver play generated by a pre-programmed response path each period depends on how manytimes each signal has been sent so far.)As discussed above, we need only show R [ a ]( s | θ ) ≥ R [ a ]( s | θ ). Let a be given and write T θj for the period in which type θ sends signal s for the j -th time in the induced history y θ ( a ).If no such period exists, then set T θj = ∞ . Since R [ a ]( ·| θ ) is a weighted average over signals in y θ ( a ) with decreasing weights given to later signals, to prove R [ a ]( s | θ ) ≥ R [ a ]( s | θ ) it suﬃcesto show that T θ j ≤ T θ j for every j . Towards this goal, we will prove a sequence of statementsby induction: Statement j : Provided T θ j is ﬁnite, s | y T θ j θ ( a ) ! ≤ s | y T θ j θ ( a ) ! for all s = s .For every j where T θ j < ∞ , statement j implies that the number of periods type θ spentsending each signal s = s before sending s for the j -th time is fewer than the number of periods θ spent doing the same. Therefore it follows that θ sent s for the j -th time sooner than θ did,that is T θ j ≤ T θ j . Finally, if T θ j = ∞ , then evidently T θ j ≤ ∞ = T θ j . It now remains to prove the sequence of statements by induction.18 tatement 1 is the base case. By way of contradiction, suppose T θ < ∞ and s | y T θ θ ( a ) ! > s | y T θ θ ( a ) ! for some s = s . Then there is some earliest period t ∗ < T θ where (cid:16) s | y t ∗ θ ( a ) (cid:17) > s | y T θ θ ( a ) ! , where type θ played s in period t ∗ , σ θ ( y t ∗ − θ ( a )) = s .But by construction, by the end of period t ∗ − θ has sent s exactly as many times astype θ has sent it by period T θ −

1, so that (cid:16) s | y t ∗ − θ ( a ) (cid:17) = s | y T θ − θ ( a ) ! . Furthermore, neither type has sent s yet, so also (cid:16) s | y t ∗ − θ ( a ) (cid:17) = s | y T θ − θ ( a ) ! . Therefore, type θ holds the same posterior over the receiver’s reaction to signals s and s atperiod t ∗ − θ does at period T θ −

1. So by Theorem 1, s ∈ arg max ˆ s ∈ S I θ , ˆ s, y T θ − θ ( a ) ! = ⇒ I ( θ , s , y t ∗ − θ ( a )) > I ( θ , s , y t ∗ − θ ( a )) . (4)However, by construction of T θ , we have σ θ y T θ − θ ( a ) ! = s . By the optimality of the Gittinsindex policy, the left-hand side of Equation (4) is satisﬁed. But, again by the optimality of theGittins index policy, the right-hand side of Equation (4) contradicts σ θ ( y t ∗ − θ ( a )) = s . Thereforewe have proven Statement 1 .Now suppose

Statement j holds for all j ≤ K . We show Statement K + 1 also holds. If T θ K +1 is ﬁnite, then T θ K is also ﬁnite. The inductive hypothesis then shows s | y T θ K θ ( a ) ! ≤ s | y T θ K θ ( a ) ! In the following equation and elsewhere in the proof, we abuse notation and write I ( θ, s, y ) to mean I ( θ, s, g ( ·| y ) , δγ ), which is the Gittins index of type θ for signal s at the posterior obtained from updatingthe prior g using history y , with eﬀective discount factor δγ . s = s . Suppose there is some s = s such that s | y T θ K +1 θ ( a ) ! > s | y T θ K +1 θ ( a ) ! . Together with the previous inequality, this implies type θ played s for the " s | y T θ K +1 θ ( a ) ! + 1 -th time sometime between playing s for the K -th time and playing s for the ( K + 1)-th time. That is, if we put t ∗ := min ( t : s | y tθ ( a ))) > s | y T θ K +1 θ ( a ) !) , then T θ K < t ∗ < T θ K +1 . By the construction of t ∗ , (cid:16) s | y t ∗ − θ ( a ) (cid:17) = s | y T θ K +1 − θ ( a ) ! , and also (cid:16) s | y t ∗ − θ ( a ) (cid:17) = K = s | y T θ K +1 − θ ( a ) ! . Therefore, type θ holds the same posterior over the receiver’s reaction to signals s and s atperiod t ∗ − θ does at period T θ K +1 −

1. As in the base case, we can invoke Theorem 1to show that it is impossible for θ to play s in period t ∗ while θ plays s in period T θ K +1 . Thisshows statement j is true for every j by induction. We now turn to the receivers’ problem. Each new receiver thinks he is facing a ﬁxed but unknownaggregate sender behavior strategy π , with belief over π given by his regular prior g . Tomaximize his expected utility, the receiver must learn to infer the type of the sender from thesignal, using his personal experience.Unlike the senders whose optimal policies may involve experimentation, the receivers’ problemonly involves passive learning. Since the receiver observes the same information in a matchregardless of his action, the optimal policy σ ( y ) simply best responds to the posterior beliefinduced by history y . Deﬁnition 6.

The one-period-forward map for receivers f : ∆( Y ) × Π → ∆( Y ) is f [ ψ , π ]( y , ( θ, s )) := ψ ( y ) · γ · λ ( θ ) · π ( s | θ )and f ( ∅ ) := 1 − γ . 20s with the one-period-forward maps f θ for senders, f [ ψ , π ] describes the new distributionover receiver histories tomorrow if the distribution over histories in the receiver population todayis ψ and the sender population’s aggregate play is π . We write ψ π := lim T →∞ f T ( ψ , π ) forthe long-run distribution over Y induced by ﬁxing sender population’s play at π , which isindependent of the particular choice of initial state ψ . Deﬁnition 7.

The aggregate receiver response R : Π → Π is R [ π ]( a | s ) := ψ π ( y : σ ( y )( s ) = a ) , where ψ π := lim T →∞ f T ( ψ , π ) with ψ any arbitrary receiver state.We are interested in the extent to which R [ π ] responds to inequalities of the form π ( s | θ ) ≥ π ( s | θ ) embedded in π , such as those generated when θ (cid:31) s θ (Lemma 2). To this end, forany two types θ , θ we deﬁne P θ .θ as those beliefs where the odds ratio of θ to θ exceeds theirprior odds ratio, that is P θ .θ := ( p ∈ ∆(Θ) : p ( θ ) p ( θ ) ≤ λ ( θ ) λ ( θ ) ) . (5)If π ( s | θ ) ≥ π ( s | θ ) , π ( s | θ ) > , and receiver knows π , then receiver’s posterior belief aboutsender’s type after observing s falls in the set P θ .θ . The next lemma shows that under theadditional provisions that π ( s | θ ) is “large enough” and receivers are suﬃciently long-lived, R [ π ] will best respond to P θ .θ with high probability when s is sent.For P ⊆ ∆(Θ), we let BR(

P, s ) := S p ∈ P arg max a ∈ A u ( p, s, a ) ! ; this is the set of bestresponses to s supported by some belief in P . Lemma 3.

Let regular prior g , types θ , θ , and signal s be ﬁxed. For every (cid:15) > , there exist C > and γ < so that for any ≤ δ < , γ ≤ γ < , and n ≥ , if π ( s | θ ) ≥ π ( s | θ ) and π ( s | θ ) ≥ (1 − γ ) nC , then R [ π ]( BR ( P θ .θ , s ) | s ) ≥ − n − (cid:15). This lemma gives a lower bound on the probability that R [ π ] best responds to P θ .θ aftersignal s . Note that the bound only applies for survival probabilities γ that are close enough to1, because when receivers have short lifetimes they need not get enough data to outweigh theirprior. Note also that more of the receivers learn the compatibility condition when π ( s | θ ) islarge compared to (1 − γ ) and almost all of them do in the limit of n (cid:1) ∞ . The proof of Lemma 3relies on Theorem 2 from Fudenberg, He, and Imhof (2017) about updating Bayesian posteriorsafter rare events, where the rare event corresponds to observing θ play s . The details are inAppendix A.3. We abuse notation here and write u ( p, s, a ) to mean P θ ∈ Θ u ( θ, s, a ) · p ( θ ).

21o interpret the condition π ( s | θ ) ≥ (1 − γ ) nC, recall that an agent with survival chance γ has a typical lifespan of − γ . If π describes the aggregate play in the sender population, thenon average a type θ plays s for − γ · π ( s | θ ) periods in her life. So when a typical type θ plays s for nC periods, this lemma provides a bound of 1 − n − (cid:15) on the share of the receiverresponses that lie in BR( P θ .θ , s ) . Note that the hypothesis θ plays s for nC periods does notrequire that π ( s | θ ) is bounded away from 0 as γ → . To preview, Lemma 4 in the next sectionwill establish that signals that are not weakly equilibrium dominated for a given type are playedsuﬃciently often that Lemma 3 has bite when both δ and γ are close to 1. Section 4 separately examined the senders’ and receivers’ learning problems. In this section, weturn to the two-sided learning problem. We will ﬁrst deﬁne steady-state strategy proﬁles, whichare signaling game strategy proﬁles π ∗ where π ∗ and π ∗ are mutual aggregate responses, and thencharacterize the steady states using our previous results. δ -Stability, and Patient Stability We introduced the one-period-forward maps f θ and f in Section 4, which describe the deter-ministic transition between state ψ t this period to state ψ t +1 next period through the learn-ing dynamics and the birth-death process. More precisely, ψ t +1 θ = f θ ( ψ tθ , σ ( ψ t )) and ψ t +12 = f ( ψ t , ( σ θ ( ψ tθ )) θ ∈ Θ ) . A steady state is a ﬁxed point ψ ∗ of this transition map, . Deﬁnition 8.

A state ψ ∗ is a steady state if ψ ∗ θ = f θ ( ψ ∗ θ , σ ( ψ ∗ )) for every θ and ψ ∗ = f ( ψ ∗ , ( σ θ ( ψ ∗ θ )) θ ∈ Θ ).The set of all steady states for regular prior g and 0 ≤ δ, γ < ∗ ( g, δ, γ ), while theset of steady-state strategy proﬁles is Π ∗ ( g, δ, γ ) := { σ ( ψ ∗ ) : ψ ∗ ∈ Ψ ∗ ( g, δ, γ ) } .The strategy proﬁles associated with steady states represent time-invariant distributions ofplay, as the information lost when agents die each period exactly balances out the informationagents gain through learning that period. This means the exchangeability assumption of thelearners will be satisﬁed in any steady state.We now give an equivalent characterization Π ∗ ( g, δ, γ ) in terms of R and R . The proof isin Appendix A.4. Proposition 2. π ∗ ∈ Π ∗ ( g, δ, γ ) if and only if R g,δ,γ ( π ∗ ) = π ∗ and R g,δ,γ ( π ∗ ) = π ∗ . (Note that here we make the dependence of R and R on parameters ( g, δ, γ ) explicit toavoid confusion.) That is, a steady-state strategy proﬁle is a pair of mutual aggregate replies.The next proposition guarantees that there always exists at least one steady-state strategyproﬁle. 22 roposition 3. Π ∗ ( g, δ, γ ) is nonempty and compact in the norm topology. The proof is in the Online Appendix. We establish that Ψ ∗ ( g, δ, γ ) is nonempty and compactin the ‘ norm on the space of distributions, which immediately implies the same properties forΠ ∗ ( g, δ, γ ). Intuitively, if lifetimes are ﬁnite, the set of histories is ﬁnite, so the set of states is ofﬁnite dimension. Here the one-period-forward map f = (( f θ ) θ ∈ Θ , f ) is continuous, so the usualversion of Brouwer’s ﬁxed-point theorem applies. With geometric lifetimes, very old agents arerare, so truncating the agents’ lifetimes at some large T yields a good approximation. Insteadof using these approximations directly, our proof shows that under the ‘ norm f is continuous,and that (because of the geometric lifetimes) the feasible states form a compact locally convexHausdorﬀ space. This lets us appeal to a ﬁxed-point theorem for that domain.We now focus on the iterated limit lim δ → lim γ → Π ∗ ( g, δ, γ ) , that is, the set of steady-state strategy proﬁles for δ and γ near 1, where we ﬁrst send γ to 1holding δ ﬁxed, and then send δ to 1. Deﬁnition 9.

For each 0 ≤ δ <

1, a strategy proﬁle π ∗ is δ -stable under g if there is a sequence γ k → π ( k ) ∈ Π ∗ ( g, δ, γ k ), such that π ( k ) → π ∗ . Strategy proﬁle π ∗ is patiently stable under g if there is a sequence δ k → π ( k ) where each π ( k ) is δ k -stable under g and π ( k ) → π ∗ .Strategy proﬁle π ∗ is patiently stable if it is patiently stable under some regular prior g .Heuristically, patiently stable strategy proﬁles are the limits of learning outcomes when agentsbecome inﬁnitely patient (so that senders are willing to make many experiments) and long lived(so that agents on both sides can learn enough for their data to outweigh their prior). As inpast work on steady-state learning (Fudenberg and Levine, 1993, 2006), the reason for this orderof limits is to ensure that most agents have enough data that they stop experimenting and playmyopic best responses. We do not know whether our results extend to the other order of limits;we explain the issues involved below, after sketching the intuition for Proposition 5. δ -Stability and Patient Stability When γ is near 1, agents correctly learn the consequences of the strategies they play frequently.But for a ﬁxed patience level they may choose to rarely or never experiment, and so can maintainincorrect beliefs about the consequences of strategies that they do not play. The next resultformally states this, which parallels Fudenberg and Levine (1993)’s result that δ -stable strategyproﬁles are self-conﬁrming equilibria. If agents did not eventually stop experimenting as they age, then even if most agents have approximatelycorrect beliefs, aggregate play need not be close to a Nash equilibrium because most agents would not be playinga (static) best response to their beliefs. roposition 4. Suppose strategy proﬁle π ∗ is δ -stable under a regular prior. Then for every type θ and signal s with π ∗ ( s | θ ) > , s is a best response to some π ∈ Π for type θ , and furthermore π ( ·| s ) = π ∗ ( ·| s ) . Also, for any signal s such that π ∗ ( s | θ ) > for at least one type θ , π ∗ ( ·| s ) issupported on pure best responses to the Bayesian belief generated by π ∗ after s . We prove this result in the Online Appendix. The idea of the proof is the following: If signal s has positive probability in the limit, then it is played many times by the senders, so the receiverseventually learn the correct posterior distribution for θ given s. As the receivers have no incentiveto experiment, their actions after s will be a best response to this correct posterior belief. Forthe senders, suppose π ∗ ( s | θ ) > , but s is not a best response for type θ to any π ∈ Π thatmatches π ∗ ( ·| s ). Yet if a sender has played s many times then with high probability her beliefabout π ( ·| s ) is close to π ∗ ( ·| s ), so playing s is not myopically optimal. This would imply thattype θ has persistent option value for signal s , which contradicts the fact that this option valuemust converge to 0 with the sample size. Remark . This proposition says that each sender type is playing a best response to a belief aboutthe receiver’s play that is correct on the equilibrium path, and that the receivers are playing anaggregate best response to the aggregate play of the senders. Thus the δ -stable outcomes are aversion of self-conﬁrming equilibrium where diﬀerent types of sender are allowed to have diﬀerentbeliefs. Moreover, as the next example shows, this sort of heterogeneity in the senders’ beliefsabout the aggregate strategy of the receivers can endogenously arise in a δ -stable strategy proﬁleeven when all types of new senders start with the same prior over how the receivers play. Example 2.

Consider the following game: Dekel, Fudenberg, and Levine (2004) deﬁned type-heterogeneous self-conﬁrming equilibrium in static Bayesiangames. As they noted, this sort of heterogeneity is natural when the type of each agent is ﬁxed, but not if eachagent’s type is drawn i.i.d. in each period. To extend their deﬁnition to signaling games, we can deﬁne the “signalfunctions” y i ( a, θ ) from that paper to respect the extensive form of the game. See also ? . g for the receiverand any regular prior g ( s )1 for the sender. Let g ( s )1 be Beta(1, 3) on a and a respectively. Weclaim that it is δ -stable when δ = 0 for both types to send s and for the receiver to respondto every signal with a , which is a type-heterogeneous rationalizable self-conﬁrming equilibrium.However, this pooling behavior cannot occur in a Nash equilibrium or in a unitary self-conﬁrmingequilibrium, where both sender types must hold the same belief about how the receiver respondsto s .To establish this claim, note that since δ = 0 each sender plays the myopically optimal signalafter every history. For any γ , there is a steady state where the receivers’ policy responds toevery signal with a after every history, type θ senders play s after every history and neverupdate their prior belief about how receivers react to s , and type θ senders with fewer than 6periods of experience play s but switch to playing s forever starting at age 7. The behavior ofthe θ agents is optimal because after k periods of playing s and seeing response a every period,the sender’s posterior belief about π ( ·| s ) is Beta(1 + k, s next period is 1 + k k ( −

1) + 34 + k (2) . This expression is positive when 0 ≤ k ≤ k = 6. The fraction of type θ aged 6 and below approaches 0 as γ →

1, hence we have constructed a sequence of steady-statestrategy proﬁles converging to the s pooling equilibrium. So even though both types start withthe same prior g , their beliefs about how the receivers react to s eventually diverge. (cid:7) In contrast to the plethora of δ -stable proﬁles, we now show that only Nash equilibriumproﬁles can be steady-state outcomes as δ tends to 1. Moreover, this limit also rules out strategyproﬁles in which the sender’s strategy can only be supported by the belief that the receiver wouldplay a dominated action in response to some of the unsent signals. Deﬁnition 10.

In a signaling game, a perfect Bayesian equilibrium with heterogeneous oﬀ-pathbeliefs is a strategy proﬁle ( π ∗ , π ∗ ) such that: • For each θ ∈ Θ , u ( θ ; π ∗ ) = max s ∈ S u ( θ, s, π ∗ ( ·| s )). • For each on-path signal s , u ( p ∗ ( ·| s ) , s, π ∗ ( ·| s )) = max ˆ a ∈ A u ( p ∗ ( ·| s ) , s, ˆ a ). • For each oﬀ-path signal s and each a ∈ A with π ∗ ( a | s ) >

0, there exists a belief p ∈ ∆(Θ)such that u ( p, s, a ) = max ˆ a ∈ A u ( p, s, ˆ a ).Here u ( θ ; π ∗ ) refers to type θ ’s payoﬀ under π ∗ , and p ∗ ( ·| s ) is the Bayesian posterior belief aboutsender’s type after signal s , under strategy π ∗ .The ﬁrst two conditions imply that the proﬁle is a Nash equilibrium. The third conditionresembles that of perfect Bayesian equilibrium, but is somewhat weaker as it allows the receiver’s25lay after an oﬀ-path signal s to be a mixture over several actions, each of which is a best responseto a diﬀerent belief about the sender’s type. This means π ∗ ( ·| s ) ∈ ∆(BR(∆(Θ) , s )), but π ∗ ( ·| s )itself may not be a best response to any unitary belief about the sender’s type. Proposition 5.

If strategy proﬁle π ∗ is patiently stable, then it is a perfect Bayesian equilibriumwith heterogeneous oﬀ-path beliefs.Proof. In the Online Appendix, we prove that patiently stable proﬁles must be Nash equilibria.This argument follows the proof strategy of Fudenberg and Levine (1993), which derived a con-tradiction via excess option values. In outline, if π ∗ is patiently stable, each player’s strategy isa best response to a belief that is correct about the opponent’s on-path play. Thus if π ∗ is nota Nash equilibrium, some type should perceive a persistent option value to experimenting withsome signal that she plays with probability 0. But this would contradict the fact that the optionvalues evaluated at suﬃciently long histories must go to 0. We now explain why a patientlystable proﬁle π ∗ must satisfy the third condition in Deﬁnition 10. After observing any history y , a receiver who started with a regular prior thinks every signal has positive probability in hisnext match. So, his optimal policy prescribes for each signal s a best response to that receiver’sposterior belief about the sender’s type upon seeing signal s after history y . For any regularprior g , 0 ≤ δ, γ <

1, and any sender aggregate play π , we thus deduce R g,δ,γ [ π ]( ·| s ) is en-tirely supported on BR(∆(Θ) , s ). This means the the same is true about the aggregate receiverresponse in every steady state and hence in every patiently stable strategy proﬁle.In Fudenberg and Levine (1993), this argument relies on the ﬁnite lifetime of the agents onlyto ensure that “almost all” histories are long enough, by picking a large enough lifetime. We canachieve the analogous eﬀect in our geometric-lifetime model by picking γ close to 1. Our proofuses the fact that if δ is ﬁxed and γ → , then the number of experiments that a sender needs toexhaust her option value is negligible relative to her expected lifespan, so that most senders playapproximate best responses to their current beliefs. The same conclusion does not hold if we ﬁx γ and let δ → , even though the optimal sender policy only depends on the product δγ , becausefor a ﬁxed sender policy the induced distribution on sender play depends on γ but not on δ. Proposition 5 allows the receiver to sustain his oﬀ-path actions using any belief p ∈ ∆(Θ). Wenow turn to our main result, which focuses on reﬁning oﬀ-path beliefs. We prove that patientstability selects a strict subset of the Nash equilibria, namely those that satisfy the compatibilitycriterion. Deﬁnition 11.

For a ﬁxed strategy proﬁle π ∗ , let u ( θ ; π ∗ ) denote the payoﬀ to type θ under π ∗ , and let 26 ( s, π ∗ ) := (cid:26) θ ∈ Θ : max a ∈ A u ( θ, s, a ) > u ( θ ; π ∗ ) (cid:27) be the set of types for which some response to signal s is strictly better than their payoﬀ under π ∗ . Signal s is weakly equilibrium dominated for types in the complement of J ( s, π ∗ ).The admissible beliefs at signal s under proﬁle π ∗ are P ( s, π ∗ ) := \ n P θ .θ : θ (cid:31) s θ and θ ∈ J ( s, π ∗ ) o where P θ .θ is deﬁned in Equation (5).That is, P ( s, π ∗ ) is the joint belief restriction imposed by a family of P θ .θ for ( θ , θ ) sat-isfying two conditions: θ is more type-compatible with s than θ , and furthermore the morecompatible type θ belongs to J ( s, π ∗ ). If there are no pairs ( θ , θ ) satisfying these two condi-tions, then (by convention of intersection over no elements) P ( s, π ∗ ) is deﬁned as ∆(Θ). In anysignaling game and for any π ∗ , the set P ( s, π ∗ ) is always nonempty because it always containsthe prior λ . Deﬁnition 12.

Strategy proﬁle π ∗ satisﬁes the compatibility criterion if π ( ·| s ) ∈ ∆(BR( P ( s, π ∗ ) , s ))for every s .Like divine equilibrium but unlike the Intuitive Criterion or Cho and Kreps (1987)’s D1criterion, the compatibility criterion says only that some signals should not increase the relativeprobability of “implausible” types, as opposed to requiring that these types have probability 0.One might imagine a version of the compatibility criterion where the belief restriction P θ .θ applies whenever θ (cid:31) s θ . To understand why we require the additional condition that θ ∈ J ( s, π ∗ ) in the deﬁnition of admissible beliefs, recall that Lemma 3 only gives a learning guaranteein the receiver’s problem when π ( s | θ ) is “large enough” for the more type-compatible θ . In theextreme case where s is a strictly dominated signal for θ , she will never play it during learning.It turns out that if s is weakly equilibrium dominated for θ , then θ may still not experimentvery much with it. On the other hand, the next lemma provides a lower bound on the frequencythat θ experiments with s when θ ∈ J ( s , π ∗ ) and δ and γ are close to 1. Lemma 4.

Fix a regular prior g and a strategy proﬁle π ∗ where for some type θ and signal s , θ ∈ J ( s , π ∗ ) . There exist a number (cid:15) ∈ (0 , and threshold functions ¯ δ : N → (0 , and ¯ γ : N × (0 , → (0 , such that whenever π ∈ Π ∗ ( g, δ, γ ) with δ ≥ ¯ δ ( N ) and γ ≥ ¯ γ ( N, δ ) and π is no more than (cid:15) away from π ∗ in L distance , we have π ( s | θ ) ≥ (1 − γ ) · N. The L distance is d ( π, π ∗ ) = X θ ∈ Θ X s ∈ S | π ( s | θ ) − π ∗ ( s | θ ) | + X s ∈ S X a ∈ A | π ( a | s ) − π ∗ ( a | s ) | . π ( s | θ ) is between 0 and 1, we know that (1 − ¯ γ ( N, δ )) · N < N .The proof of this lemma is in the Online Appendix. To gain an intuition for it, suppose thatnot only is s equilibrium undominated in π ∗ , but furthermore s can lead to the highest signalinggame payoﬀ for type θ under some receiver response a . Because the prior is non-doctrinaire,the Gittins index of each signal in the learning problem approaches its highest possible payoﬀin the stage game as the sender becomes inﬁnitely patient. Therefore, for every N ∈ N , when γ and δ are close enough to 1, a new type θ will play s in each of the ﬁrst N periods of her life,regardless of what responses she receives during that time. These N periods account for roughly(1 − γ ) · N fraction of her life, proving the lemma in this special case. It turns out that even if s does not lead to the highest potential payoﬀ in the signaling game, long-lived players will havea good estimate of their steady-state payoﬀ. So, type θ will still play any s that is equilibriumundominated in strategy proﬁle π ∗ at least N times in any steady states that are suﬃciently closeto π ∗ , though these N periods may not occur at the beginning of her life. Theorem 2.

Every patiently stable strategy proﬁle π ∗ satisﬁes the compatibility criterion. The proof combines Lemma 2, Lemma 3, and Lemma 4. Lemma 2 shows that types that aremore compatible with s play it more often. Lemma 4 says that types for whom s is not weaklyequilibrium dominated will play it “many times.” Finally, Lemma 3 shows that the “many times”here is suﬃciently large that most receivers correctly believe that more compatible types play s more than less compatible types do, so their posterior odds ratio for more versus less compatibletypes exceeds the prior ratio. Proof.

Suppose π ∗ is patiently stable under regular prior g . Fix an s and an action ˆ a / ∈ BR( P ( s , π ∗ ) , s ). Let h > π ∗ (ˆ a | s ) < h . Since the choices of s , ˆ a , and h > Step 1 : Setting some constants.In the statement of Lemma 3, for each pair θ , θ such that θ (cid:31) s θ and θ ∈ J ( s , π ∗ ), put (cid:15) = h | Θ | and ﬁnd C θ ,θ and γ θ ,θ so that the result holds. Let C be the maximum of all such C θ ,θ and γ be the maximum of all such γ θ ,θ . Also ﬁnd n ≥ − n > − h | Θ | . (6)In the statement of Lemma 4, for each θ such that θ (cid:31) s θ for at least one θ , ﬁnd (cid:15) θ , ¯ δ θ ( nC ),¯ γ θ ( nC, δ ) so that the lemma holds. Write (cid:15) ∗ > (cid:15) θ and let ¯ δ ∗ ( nC )and ¯ γ ∗ ( nC, δ ) represent the maximum of δ θ and γ θ across such θ . Step 2 : Finding a steady-state proﬁle with large δ, γ that approximates π ∗ .Since π ∗ is patiently stable under g , there exists a sequence of strategy proﬁles π ( j ) → π ∗ where π ( j ) is δ j -stable under g with δ j →

1. Each π ( j ) can be written as the limit of steady-state28trategy proﬁles. That is, for each j , there exists γ j,k → π ( j,k ) ∈ Π ∗ ( g, δ j , γ j,k ) such that lim k →∞ π ( j,k ) = π ( j ) .The convergence of the array π ( j,k ) to π ∗ means we may ﬁnd j ∈ N and function k ( j ) sothat whenever j ≥ j and k ≥ k ( j ) , π ( j,k ) is no more than min( (cid:15) ∗ , h | Θ | ) away from π ∗ . Find j ◦ ≥ j large enough so δ ◦ := δ j ◦ > ¯ δ ∗ ( nC ), and then ﬁnd a large enough k ◦ > k ( j ◦ ) so that γ ◦ := γ j ◦ ,k ◦ > max(¯ γ ∗ ( nC, δ ◦ ) , γ ). So we have identiﬁed a steady-state proﬁle π ◦ := π ( j ◦ ,k ◦ ) ∈ Π ∗ ( g, δ ◦ , γ ◦ ) which approximates π ∗ to within min( (cid:15) ∗ , h | Θ | ). Step 3 : Applying properties of R and R .For each pair θ , θ such that θ (cid:31) s θ and θ ∈ J ( s , π ∗ ), we will bound the probability that π ◦ ( ·| s ) does not best respond to P θ .θ by h | Θ | . Since there are at most | Θ | · ( | Θ | −

1) suchpairs in the intersection deﬁning P ( s , π ∗ ), this would imply that π ◦ (ˆ a | s ) < [ | Θ | · ( | Θ | − · h | Θ | since ˆ a / ∈ BR( P ( s , π ∗ ) , s ). And since π ◦ is no more than h | Θ | away from π , this would show π (ˆ a | s ) < h .By construction π ◦ is closer than (cid:15) θ to π ∗ , and furthermore δ ◦ ≥ ¯ δ θ ( nC ) and γ ◦ ≥ ¯ γ θ ( nC, δ ◦ ).By Lemma 4, π ◦ ( s | θ ) ≥ nC (1 − γ ◦ ). At the same time, π ◦ = R [ π ◦ ] and θ (cid:31) s θ , so Lemma2 implies that π ◦ ( s | θ ) ≥ π ◦ ( s | θ ). Turning to the receiver side, π ◦ = R [ π ◦ ] with π ◦ satisfyingthe conditions of Lemma 3 associated with (cid:15) = h | Θ | and γ ◦ ≥ γ . Therefore, we conclude π ◦ (BR( P θ .θ , s ) | s ) ≥ − n − h | Θ | . But by construction of n in Equation (6), 1 − n > − h | Θ | . So the LHS is at least 1 − h | Θ | , asdesired. Remark . More generally, consider any model for our populations of agents with geometricallydistributed lifetimes that generates aggregate response functions R and R . Deﬁning the steadystates under ( g, δ, γ ) as the strategy proﬁles π ∗ such that R g,δ,γ ( π ∗ ) = π ∗ and R g,δ,γ ( π ∗ ) = π ∗ ,the proof of Theorem 2 applies to the patiently stable proﬁles of the new learning model providedthat R satisﬁes the conclusion of Lemma 2, R satisﬁes the conclusion of Lemma 3, and Lemma4 is valid for ( θ , s ) pairs such that θ (cid:31) s θ for at least one type θ and θ ∈ J ( s , π ∗ ).We outline two such more general learning models below. (The proof is in the Online Ap-pendix.) Corollary 1.

With either of the following modiﬁcations of the steady-state learning model fromSection 2, every patiently stable strategy proﬁle still satisﬁes the compatibility criterion.(i).

Heterogeneous priors . There is a ﬁnite collection of regular sender priors { g ,k } nk =1 anda ﬁnite collection of regular receiver priors { g ,k } nk =1 . Upon birth, an agent is endowed witha random prior, where the distributions over priors are µ and µ for senders and receivers.An agent’s prior is independent of her payoﬀ type, and furthermore no one ever observesanother person’s prior. ii). Social learning . Suppose − α fraction of the senders are “normal learners” as describedin Section 2, but the remaining < α < fraction are “social learners.” At the end of eachperiod, a social learner can observe the extensive-form strategies of her matched receiverand of c > other matches sampled uniformly at random. Each sender knows whether sheis a normal learner or a social learner upon birth, which is uncorrelated with her payoﬀtype. Receivers cannot distinguish between the two kinds of senders. Example 1 (Continued) . The beer-quiche game of Example 1 has two components of Nash equi-libria: “beer-pooling equilibria” where both types play

Beer with probability 1, and “quiche-pooling equilibria” where both types play

Quiche with probability 1. In a quiche-pooling equi-librium π ∗ , type θ strong ’s equilibrium payoﬀ is 2, so θ strong ∈ J ( Beer , π ∗ ) since θ strong ’s highestpossible payoﬀ under Beer is 3, and we have already shown that θ strong (cid:31) Beer θ weak . So, P ( Beer , π ∗ ) = ( p ∈ ∆(Θ) : p ( θ weak ) p ( θ strong ) ≤ λ ( θ weak ) λ ( θ strong ) = 1 / ) . Fight is not a best response after

Beer to any such belief, so equilibria in which

Fight occurs with positive probability after

Beer do not satisfy the compatibility criterion, and thusno quiche-pooling equilibrium is patiently stable. Since the set of patiently stable outcomes isa nonempty subset of the set of Nash equilibria, pooling on beer is the unique patiently stableoutcome.By Corollary 1, quiche-pooling equilibria are still not patiently stable in more general learningmodels involving either heterogeneous priors or social learners. (cid:7)

In generic signaling games, equilibria where the receiver plays a pure strategy must satisfy astronger condition than the compatibility criterion to be patiently stable.

Deﬁnition 13.

Let e J ( s, π ∗ ) := (cid:26) θ ∈ Θ : max a ∈ A u ( θ, s, a ) ≥ u ( θ ; π ∗ ) (cid:27) . If e J ( s , π ∗ ) is nonempty, deﬁne the strongly admissible beliefs at signal s under proﬁle π ∗ tobe ˜ P ( s , π ∗ ) := ∆( e J ( s , π ∗ )) \ n P θ .θ : θ (cid:31) s θ o where P θ .θ is deﬁned in Equation (5). Otherwise, deﬁne ˜ P ( s , π ∗ ) := ∆(Θ).Here, e J ( s, π ∗ ) is the set of types for which some response to signal s is at least as goodas their equilibrium payoﬀ under π ∗ — that is, the set of types for whom s is not equilibrium30ominated in the sense of Cho and Kreps (1987). Note that e P , unlike P, assigns probability 0 toequilibrium-dominated types, which is the belief restriction of the Intuitive Criterion. Deﬁnition 14.

A Nash equilibrium π ∗ is on-path strict for the receiver if for every on-path signal s ∗ , π ( a ∗ | s ∗ ) = 1 for some a ∗ ∈ A and u ( s ∗ , a ∗ , π ) > max a = a ∗ u ( s ∗ , a, π ).Of course, the receiver cannot have strict ex ante preferences over play at unreached infor-mation sets; this condition is called “on-path strict” because it places no restrictions on thereceiver’s incentives after oﬀ-path signals. In generic signaling games, all pure-strategy equilibriaare on-path strict for the receiver, but the same is not true for mixed-strategy equilibria. Deﬁnition 15.

A strategy proﬁle π ∗ satisﬁes the strong compatibility criterion if at every signal s we have π ∗ ( ·| s ) ∈ ∆(BR( e P ( s , π ∗ ) , s )) . It is immediate that the strong compatibility criterion implies the compatibility criterion,since it places more stringent restrictions on the receiver’s behavior. It is also immediate thatthe strong compatibility criterion implies the Intuitive Criterion.

Theorem 3.

Suppose π ∗ is on-path strict for the receiver and patiently stable. Then it satisﬁesthe strong compatibility criterion. The proof of this theorem appears in Appendix A.5. The main idea is that when oﬀ-pathsignal s is equilibrium dominated in π ∗ for type θ D but not even weakly equilibrium dominatedfor type θ U , type θ U will experiment “inﬁnitely more often” with s than θ D does. Indeed, wecan provide an upper bound on the steady-state probability that θ D ever switches away from itsequilibrium signal s ∗ after trying it for the ﬁrst time , which is also an upper bound on how often θ D experiments with s , while Lemma 4 provides a lower bound for how often θ U plays s . Weshow there is a sequence of steady-state proﬁles π ( k ) ∈ Π ∗ ( g, δ k , γ k ) with γ k → π ( k ) → π ∗ where the ratio of the lower bound to the upper bound goes to inﬁnity. Applying Theorem 2of Fudenberg, He, and Imhof (2017), we can then prove receivers will infer that an s -sender is“inﬁnitely more likely” to be θ U than θ D , which means receivers must assign probability 0 to θ D after s in equilibrium π ∗ . Remark . As noted by Fudenberg and Kreps (1988) and Sobel, Stole, and Zapater (1990), itseems “intuitive” that learning and rational experimentation should lead receivers to assign prob-ability 0 to types that are equilibrium dominated, so it might seem surprising that this theoremneeds the additional assumption that the equilibrium is on-path strict for the receiver. However, This upper bound does not apply when π ∗ is not on-path strict for the receiver. When π ∗ involves the receiverstrictly mixing between several responses after s ∗ , some of these responses might make θ D strictly worse oﬀ thanher worst payoﬀ after s , so there is non-vanishing probability that that θ D observes a large number of these badresponses in a row and then stops playing s ∗ .

31n our model senders start out initially uncertain about the receivers’ play, and so even types forwhom a signal is equilibrium dominated might initially experiment with it. Showing that theseexperiments do not lead to “perverse” responses by the receivers requires some arguments aboutthe relative probabilities with which equilibrium-dominated types and non-equilibrium-dominatedtypes play oﬀ-path signals. When the equilibrium involves on-path receiver randomization, a non-trivial fraction of receivers could play an action after a type’s equilibrium signal that the typeﬁnds strictly worse than her worst payoﬀ under an oﬀ-path signal. In this case, we do not seehow to show that the probability she ever switches away from her equilibrium signal tends to0 with patience, since the event of seeing a large number of these unfavorable responses in arow has probability bounded away from 0 even when the receiver population plays exactly theirequilibrium strategy. However, we do not have a counterexample to show that the conclusion ofthe theorem fails without on-path strictness for the receiver.

Example 3.

In the following modiﬁed beer-quiche game, the payoﬀs of ﬁghting a type θ weak whodrinks beer have been substantially increased relative to Example 1, so that Fight is now a bestresponse to the prior belief λ after Beer .Since the prior λ is always an admissible belief in any signaling game after any signal, theNash equilibrium π ∗ where both types play Quiche (supported by the receiver playing

Fight after

Beer ) is not ruled out by the compatibility criterion, unlike in Example 1. However,this equilibrium is ruled out by the strong compatibility criterion. To see why, note that thispooling equilibrium is on-path strict for the receiver, because the receiver has a strict preferencefor

NotFight at the only on-path signal,

Quiche . Moreover, π ∗ does not satisfy the strongcompatibility criterion, because e J ( Beer , π ∗ ) = { θ strong } implies the only strongly admissiblebelief after Beer assigns probability 1 to the sender being θ strong . Thus Theorem 3 implies thatthis equilibrium is not patiently stable. (cid:7) Discussion

Our learning model supposes that the agents have geometrically distributed lifetimes, whichis one of the reasons that the senders’ optimization problems can be solved using the Gittinsindex. If agents were to have ﬁxed ﬁnite lifetimes, as in Fudenberg and Levine (1993, 2006),their optimization problem would not be stationary, and the ﬁnite-horizon analog of the Gittinsindex is only approximately optimal for the ﬁnite-horizon multi-armed bandit problem (Niño-Mora, 2011). Applying the geometric-lifetime framework to steady-state learning models forother classes of extensive-form games could prove fruitful, especially for games where we need tocompare the behavior of various players or player types, and in studies of other sorts of dynamicdecisions.Theorem 1 provides a comparison between the dynamic behavior of two agents in a geometric-lifetime bandit problem based on their static preferences over the prizes. As an immediateapplication, consider a principal-agent setting where the agent faces a multi-armed bandit witharms s ∈ S , where s leads a prize drawn from Z s according to some distribution. The principalknows the agent’s per-period utility function u : ∪ s Z s → R , but not the agent’s beliefs over theprize distributions of diﬀerent arms or agent’s discount factor. Suppose the principal observesthe agent choosing arm 1 in the ﬁrst period. The principal can impose taxes and subsidies on thediﬀerent prizes and arms, changing the agent’s utility function to ˜ u . For what taxes and subsidieswould the agent still have chosen arm 1 in the ﬁrst period, irrespective of her initial beliefs anddiscount factor? According to Theorem 1, the answer is precisely those taxes and subsidies suchthat arm 1 is more type-compatible with ˜ u than u .Our results provide an upper bound on the set of patiently stable strategy proﬁles in asignaling game. In Fudenberg and He (2017), we provided a lower bound for the same set, as wellas a sharper upper bound under additional restrictions on the priors. But together, these resultswill not give an exact characterization of patiently stable outcomes. Nevertheless, our results doshow how the theory of learning in games provides a foundation for reﬁning the set of equilibriain signaling games.In future work, we hope to investigate a learning model featuring temporary sender types.Instead of the sender’s type being assigned at birth and ﬁxed for life, at the start of each periodeach sender takes an i.i.d. draw from λ to discover her type for that period. When the playersare impatient, this yields diﬀerent steady states than the ﬁxed-type model here, as noted byDekel, Fudenberg, and Levine (2004). This model will require diﬀerent tools to analyze, sincethe sender’s problem becomes a restless bandit. References

Banks, J. S. and J. Sobel (1987): “Equilibrium Selection in Signaling Games,”

Econometrica ,55, 647–661. 33 ellman, R. (1956): “A Problem in the Sequential Design of Experiments,”

Sankhy¯a: TheIndian Journal of Statistics (1933-1960) , 16, 221–229.

Billingsley, P. (1995):

Probability and Measure , John Wiley & Sons.

Cho, I.-K. and D. M. Kreps (1987): “Signaling Games and Stable Equilibria,”

QuarterlyJournal of Economics , 102, 179–221.

Dekel, E., D. Fudenberg, and D. K. Levine (1999): “Payoﬀ Information and Self-Conﬁrming Equilibrium,”

Journal of Economic Theory , 89, 165–185.——— (2004): “Learning to play Bayesian games,”

Games and Economic Behavior , 46, 282–303.

Diaconis, P. and D. Freedman (1990): “On the Uniform Consistency of Bayes Estimatesfor Multinomial Probabilities,”

Annals of Statistics , 18, 1317–1327.

Esponda, I. and D. Pouzo (2016): “Berk-Nash Equilibrium: A Framework for ModelingAgents With Misspeciﬁed Models,”

Econometrica , 84, 1093–1130.

Fudenberg, D. and K. He (2017): “Learning and Equilibrium Reﬁnements in SignallingGames,”

Mimeo . Fudenberg, D., K. He, and L. A. Imhof (2017): “Bayesian posteriors for arbitrarily rareevents,”

Proceedings of the National Academy of Sciences , 114, 4925–4929.

Fudenberg, D. and D. M. Kreps (1988): “A Theory of Learning, Experimentation, andEquilibrium in Games,”

Mimeo .——— (1993): “Learning Mixed Equilibria,”

Games and Economic Behavior , 5, 320–367.——— (1994): “Learning in Extensive-Form Games, II: Experimentation and Nash Equilibrium,”

Mimeo .——— (1995): “Learning in Extensive-Form Games I. Self-Conﬁrming Equilibria,”

Games andEconomic Behavior , 8, 20–55.

Fudenberg, D. and D. K. Levine (1993): “Steady State Learning and Nash Equilibrium,”

Econometrica , 61, 547–573.——— (2006): “Superstition and Rational Learning,”

American Economic Review , 96, 630–651.

Gittins, J. C. (1979): “Bandit Processes and Dynamic Allocation Indices,”

Journal of theRoyal Statistical Society. Series B (Methodological) , 148–177.

Jehiel, P. and D. Samet (2005): “Learning to Play Games in Extensive Form by Valuation,”

Journal of Economic Theory , 124, 129–148. 34 alai, E. and E. Lehrer (1993): “Rational Learning Leads to Nash Equilibrium,”

Econo-metrica , 61, 1019–1045.

Laslier, J.-F. and B. Walliser (2015): “Stubborn Learning,”

Theory and Decision , 79,51–93.

Niño-Mora, J. (2011): “Computing a Classic Index for Finite-Horizon Bandits,”

INFORMSJournal on Computing , 23, 254–267.

Sobel, J., L. Stole, and I. Zapater (1990): “Fixed-Equilibrium Rationalizability in Signal-ing Games,”

Journal of Economic Theory , 52, 304–331.

Spence, M. (1973): “Job Market Signaling,”

Quarterly Journal of Economics , 87, 355–374.

A Appendix – Relegated Proofs

A.1 Proof of Proposition 1

Proposition 1 :(i). (cid:31) s is transitive.(ii). Except when s is either strictly dominant for both θ and θ or strictly dominated for both θ and θ , θ (cid:31) s θ implies θ s θ . Proof.

To show (i), suppose θ (cid:31) s θ and θ (cid:31) s θ . For any π ∈ Π where s is weakly optimalfor θ , it must be strictly optimal for θ , hence also strictly optimal for θ . This shows θ (cid:31) s θ .To establish (ii), partition the set of receiver strategies as Π = Π +2 ∪ Π ∪ Π − , where the threesubsets refer to receiver strategies that make s strictly better, indiﬀerent, or strictly worse thanthe best alternative signal for θ . If the set Π is nonempty, then θ (cid:31) s θ implies θ s θ . Thisis because against any π ∈ Π , signal s is strictly optimal for θ but only weakly optimal for θ . At the same time, if both Π +2 and Π − are nonempty, then Π is nonempty. This is becauseboth π u ( θ , s , π ( ·| s )) and π max s = s u ( θ , s , π ( ·| s )) are continuous functions, sofor any π +2 ∈ Π +2 and π − ∈ Π − , there exists α ∈ (0 ,

1) so that απ +2 + (1 − α ) π − ∈ Π . If onlyΠ +2 is nonempty and θ (cid:31) s θ , then s is strictly dominant for both θ and θ . If only Π − isnonempty, then we can have θ (cid:31) s θ only when s is never a weak best response for θ againstany π ∈ Π . 35 .2 Proof of Lemma 1 Lemma 1 : For every signal s , stopping time τ , belief ν s , and discount factor β, there exists π ,s ( τ, ν s , β ) ∈ ∆( A ) so that for every θ , E ν s nP τ − t =0 β t · u ( θ, s, a s ( t )) o E ν s nP τ − t =0 β t o = u ( θ, s, π ,s ( τ, ν s , β )) Proof.

Step 1: Induced mixed actions.

A belief ν s and a stopping time τ s together deﬁne a stochastic process ( A t ) t ≥ over the space A ∪ {∅} , where A t ∈ A corresponds to the receiver action seen in period t if τ s has not yetstopped ( τ s > t ), and A t := ∅ if τ s has stopped ( τ s ≤ t ). Enumerating A = { a , ..., a n } , we write p t,i := P ν s [ A t = a i ] for 1 ≤ i ≤ n to record the probability of seeing receiver action a i in period t and p t, := P ν s [ A t = ∅ ] = P ν s [ τ s ≤ t ] for the probability of seeing no receiver action in period t due to τ s having stopped.Given ν s and τ s , we deﬁne the induced mixed actions after signal s , π ,s ( ν s , τ s , β ) ∈ ∆( A ) by: π ,s ( ν s , τ s , β )( a ) := P ∞ t =0 β t p t,i P ∞ t =0 β t (1 − p t, ) for i such that a = a i . As P ni =1 p t,i = 1 − p t, for each t ≥

0, it is clear that π ,s ( ν s , τ s , β ) puts nonnegative weightson actions in A that sum to 1, so π ,s ( ν s , τ s , β ) ∈ ∆( A ) may indeed be viewed as a mixture overreceiver actions. Step 2: Induced mixed actions and per-period payoﬀ.

We now show that for any β and any stopping time τ s for signal s , the normalized payoﬀ inthe stopping problem is equal to the utility of playing s against π ,s ( ν s , τ s , β ) for one period, thatis, u ( θ, s, π ,s ( ν s , τ s , β )) = E ν s ( τ s − X t =0 β t · u ( θ, s, a s ( t )) ) / E ν s ( τ s − X t =0 β t ) . To see why this is true, rewrite the denominator of the right-hand side as E ν s ( τ s − X t =0 β t ) = E ν s ( ∞ X t =0 [1 τ s >t ] · β t ) = ∞ X t =0 β t · P ν s [ τ s > t ] = ∞ X t =0 β t (1 − p t, ) , and rewrite the numerator as E ν s ( τ s − X t =0 β t · u ( θ, s, a s ( t )) ) = ∞ X t =0 β t ·  p t, · | {z } get 0 if already stopped + n X i =1 p t,i · u ( θ, s, a i ) | {z } else, a s ( t ) distributed as ( p t,i )  = n X i =1 ∞ X t =0 β t · p t,i ! · u ( θ, s, a i ) .

36o overall, we get as desired: E ν s ( τ s − X t =0 β t · u ( θ, s, a s ( t )) ) / E ν s ( τ s − X t =0 β t ) = n X i =1 " ( P ∞ t =0 β t · p t,i ) P ∞ t =0 β t (1 − p t, ) · u ( θ, s, a i )= u ( θ, s, π ,s ( ν s , τ s , β )) . A.3 Proof of Lemma 3

Lemma 3 : Let regular prior g , types θ , θ , and signal s be ﬁxed. For every (cid:15) >

0, there exists

C > γ < ≤ δ < γ ≤ γ <

1, and n ≥

1, if π ( s | θ ) ≥ π ( s | θ ) and π ( s | θ ) ≥ (1 − γ ) nC , then R [ π ](BR( P θ .θ , s ) | s ) ≥ − n − (cid:15). We invoke Theorem 2 of Fudenberg, He, and Imhof (2017), which in our setting says:

Let regular prior g and signal s be ﬁxed. Let < (cid:15), h < . There exists C such thatwhenever π ( s | θ ) ≥ π ( s | θ ) and t · π ( s | θ ) ≥ C , we get ψ π y ∈ Y [ t ] : p ( θ | s ; y ) p ( θ | s ; y ) ≤ − h · λ ( θ ) λ ( θ ) ! /ψ π ( Y [ t ]) ≥ − (cid:15) where p ( θ | s ; y ) refers to the conditional probability that a sender of s is type θ ac-cording to the posterior belief induced by history y . That is, if at age t a receiver would have observed in expectation C instances of type θ sending s , then the belief of at least 1 − (cid:15) fraction of age t receivers (essentially) falls in P θ .θ afterseeing the signal s . The proof of Lemma 3 calculates what fraction of receivers meets this “agerequirement.” Proof.

We will show the following stronger result:Let regular prior g , types θ , θ , and signal s be ﬁxed. For every (cid:15) >

0, there exists

C > ≤ δ, γ < n ≥

1, if π ( s | θ ) ≥ π ( s | θ ) and π ( s | θ ) ≥ (1 − γ ) nC , then R [ π ](BR( P θ .θ , s ) | s ) ≥ γ d n (1 − γ ) e − (cid:15) The lemma follows because we may pick a large enough γ < γ d n (1 − γ ) e > − n forall n ≥ γ ≥ γ . 37or each 0 < h <

1, deﬁne P hθ .θ := (cid:26) p ∈ ∆(Θ) : p ( θ ) p ( θ ) ≤ − h · λ ( θ ) λ ( θ ) (cid:27) , with the conventionthat = 0 . Then it is clear that each P hθ .θ , as well as P θ .θ itself, is a closed subset of ∆(Θ).Also, P hθ .θ → P θ .θ as h → a ∈ A . If for all ¯ h > < h ≤ ¯ h so that a ∈ BR( P hθ .θ , s ),then a ∈ BR( P θ .θ , s ) also due to best-response correspondence having a closed graph. Thismeans that, for each a / ∈ BR( P θ .θ , s ), there exists ¯ h a > a / ∈ BR( P hθ .θ , s ) whenever0 < h ≤ ¯ h a . Let ¯ h := min a/ ∈ BR( P θ .θ ,s ) ¯ h a . Let (cid:15) > (cid:15) and ¯ h to ﬁnd constant C .When π ( s | θ ) ≥ π ( s | θ ) and π ( s | θ ) ≥ (1 − γ ) nC , consider an age t receiver for t ≥ l n (1 − γ ) m . Since t · π ( s | θ ) ≥ C, Theorem 2 of Fudenberg, He, and Imhof (2017) implies thereis probability at least 1 − (cid:15) this receiver’s belief about the types who send s falls in P ¯ hθ .θ . Byconstruction of ¯ h, BR( P ¯ hθ .θ , s ) = BR( P θ .θ , s ), so 1 − (cid:15) of age t receivers have a history y where σ ( y )( s ) ∈ BR( P θ .θ , s ).Since agents survive between periods with probability γ, the mass of the receiver populationaged l n (1 − γ ) m or older is (1 − γ ) · P ∞ t = d n (1 − γ ) e γ t = γ d n (1 − γ ) e .This shows R [ π ](BR( P θ .θ , s ) | s ) ≥ γ n (1 − γ ) · (1 − (cid:15) ) ≥ γ d n (1 − γ ) e − (cid:15) as desired. A.4 Proof of Proposition 2

Proposition 2 : π ∗ ∈ Π ∗ ( g, δ, γ ) if and only if R g,δ,γ [ π ∗ ] = π ∗ and R g,δ,γ [ π ∗ ] = π ∗ . Proof. If : Suppose π ∗ is such that R [ π ∗ ] = π ∗ and R [ π ∗ ] = π ∗ . Consider the state ψ ∗ deﬁnedas ψ ∗ θ := ψ π ∗ θ for each θ and ψ ∗ := ψ π ∗ . Then, by construction σ θ ( ψ π ∗ θ ) = π ∗ θ and σ ( ψ π ∗ ) = π ∗ , sothe state ψ ∗ gives rise to π ∗ . To verify that ψ ∗ is a steady state, we can expand by the deﬁnitionof ψ π ∗ θ , f θ ( ψ π ∗ θ , π ∗ ) = f θ (cid:18) lim T →∞ f Tθ ( ˜ ψ θ , π ∗ ) , π ∗ (cid:19) , where ˜ ψ θ is any arbitrary initial state.Since f θ is continuous at ψ π ∗ θ in L distance deﬁned in Footnote 20, lim T →∞ f Tθ ( ˜ ψ θ , π ∗ ) = ψ π ∗ θ is a ﬁxed point of f θ ( · , π ∗ ) . To see this, write ψ ( T ) θ := f Tθ ( ˜ ψ θ , π ∗ ) for each T ≥ (cid:15) > f θ implies there is ζ > d ( f θ ( ψ π ∗ θ , π ∗ ) , f θ ( ψ ( T ) θ , π ∗ )) < (cid:15)/ d ( ψ π ∗ θ , ψ ( T ) θ ) < ζ . So pick a large enough T so that d ( ψ π ∗ θ , ψ ( T ) θ ) < ζ and also d ( ψ π ∗ θ , ψ ( T +1) θ ) < (cid:15)/ d ( f θ ( ψ π ∗ θ , π ∗ ) , ψ π ∗ θ ) ≤ d ( f θ ( ψ π ∗ θ , π ∗ ) , f θ ( ψ ( T ) θ , π ∗ )) + d ( ψ ( T +1) θ , ψ π ∗ θ ) < (cid:15)/ (cid:15)/ . This is implied by Step 1 of the proof of Proposition 3 in the Online Appendix, which shows f θ is continuousat all states that assign (1 − γ ) γ t mass to the set of length- t histories. (cid:15) > f θ ( ψ π ∗ θ , π ∗ ) = ψ π ∗ θ and a similar argumentshows f ( ψ π ∗ , π ∗ ) = ψ π ∗ . This tells us ψ ∗ = (( ψ π ∗ θ ) θ ∈ Θ , ψ π ∗ ) is a steady state. Only if : Conversely, suppose π ∗ ∈ Π ∗ ( g, δ, γ ) . Then there exists a steady state ψ ∗ ∈ Ψ ∗ ( g, δ, γ )such that π ∗ = σ ( ψ ∗ ). This means f θ ( ψ ∗ θ , π ∗ ) = ψ ∗ θ , so iterating shows ψ π ∗ θ := lim T →∞ f Tθ ( ψ ∗ θ , π ∗ ) = ψ ∗ θ . Since R [ π ∗ ]( ·| θ ) := σ θ ( ψ π ∗ θ ), the above implies R [ π ∗ ]( ·| θ ) = σ θ ( ψ ∗ θ ) = π ∗ ( ·| θ ) by the choice of of ψ ∗ . We can similarly show R [ π ∗ ] = π ∗ . A.5 Proof of Theorem 3

Throughout this subsection, we will make use of the following version of Hoeﬀding’s inequality.

Fact. (Hoeﬀding’s inequality) Suppose X , ..., X n are independent random variables on R suchthat a i ≤ X i ≤ b i with probability 1 for each i . Write S n := P ni =1 X i . Then, P [ | S n − E [ S n ] | ≥ d ] ≤ − d P ni =1 ( b i − a i ) ! . Lemma A.1.

In strategy proﬁle π ∗ , suppose s ∗ is on-path and π ∗ ( a ∗ | s ∗ ) = 1 , where a ∗ is a strictbest response to s ∗ given π ∗ . Then there exists N ∈ R so that, for any regular prior and anysequence of steady-state strategy proﬁles π ( k ) ∈ Π ∗ ( g, δ k , γ k ) where γ k → , π ( k ) → π ∗ , there exists K ∈ N such that whenever k ≥ K , we have π ( k )2 ( a ∗ | s ∗ ) ≥ − (1 − γ k ) · N .Proof. Since a ∗ is a strict best response after s ∗ for π ∗ , there exists (cid:15) > a ∗ will continue tobe a strict best response after s ∗ for any π ∈ Π where for every θ ∈ Θ, | π ( s ∗ | θ ) − π ∗ ( s ∗ | θ ) | < (cid:15). Since π ( k ) → π ∗ , ﬁnd large enough K such that k ≥ K implies for every θ ∈ Θ, (cid:12)(cid:12)(cid:12) π ( k )1 ( s ∗ | θ ) − π ∗ ( s ∗ | θ ) (cid:12)(cid:12)(cid:12) <(cid:15). Write e obs n,θ for the probability that an age- n receiver has encountered type θ fewer than nλ ( θ )times. We will ﬁnd a number N obs < ∞ so that X θ ∈ Θ ∞ X n =0 e obs n,θ ≤ N obs . Fix some θ ∈ Θ. Write Z ( θ ) t ∈ { , } as the indicator random variable for whether the receiversees a type θ in period t of his life and write S n := P nt =1 Z ( θ ) t for the total number of type θ encountered up to age n . We have E [ S n ] = nλ ( θ ), so we can use Hoeﬀding’s inequality to bound39 obs n,θ . e obs n,θ ≤ P (cid:20) | S n − E [ S n ] | ≥ nλ ( θ ) (cid:21) ≤ − · [ nλ ( θ )] n ! . This shows e obs n,θ tends to 0 at the same rate as exp( − n ), so ∞ X n =0 e obs n,θ ≤ ∞ X n =0 − · [ nλ ( θ )] n ! =: N obs θ < ∞ . So we set N obs := P θ ∈ Θ N obs θ .Next, write e bias ,kn,θ for the probability that, after observing j nλ ( θ ) k i.i.d. draws from π ( k )1 ( ·| θ ),the empirical frequency of signal s ∗ diﬀers from π ( k )1 ( s ∗ | θ ) by more than 2 (cid:15) . So again, write Z θ,kt ∈ { , } to indicate if the t -th draw resulted in signal s ∗ , with E h Z θ,kt i = π ( k )1 ( s ∗ | θ ), andput S n,k := P b nλ ( θ ) c t =1 Z θ,kt for total number of s ∗ out of j nλ ( θ ) k draws. We have E [ S n,k ] = j nλ ( θ ) k · π ( k )1 ( s ∗ | θ ), but (cid:12)(cid:12)(cid:12) π ( k )1 ( s ∗ | θ ) − π ∗ ( s ∗ | θ ) (cid:12)(cid:12)(cid:12) < (cid:15) whenever k ≥ K . That means, e bias ,kn,θ := P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S n,k j nλ ( θ ) k − π ∗ ( s ∗ | θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15)  ≤ P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S n,k j nλ ( θ ) k − π ( k )1 ( s ∗ | θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15)  if k ≥ K = P (cid:20) | S n,k − E [ S n,k ] | ≥ (cid:22) nλ ( θ ) (cid:23) · (cid:15) (cid:21) ≤  − · ( j nλ ( θ ) k · (cid:15) ) j nλ ( θ ) k  by Hoeﬀding’s inequality.Let N bias θ := P ∞ n =1 (cid:18) − · ( b nλ ( θ ) c · (cid:15) ) b nλ ( θ ) c (cid:19) , with N bias θ < ∞ since the summand tends to 0 atthe same rate asexp( − n ). This argument shows that, whenever k ≥ K , we have P ∞ n =1 e bias ,kn,θ ≤ N bias θ . Now let N bias := P θ ∈ Θ N bias θ .Finally, since g is regular, we appeal to Proposition 1 of Fudenberg, He, and Imhof (2017) tosee that there exists some N so that whenever the receiver has a data set of size n ≥ N on type θ ’s play, his Bayesian posterior as to the probability that θ plays s ∗ diﬀers from the empiricaldistribution by no more than (cid:15) . Put N age := N min θ ∈ Θ λ ( θ ) .Consider any steady state ψ ( k ) with k ≥ K . With probability no smaller than 1 − P θ ∈ Θ e bias ,kn,θ ,an age- n receiver who has seen at least nλ ( θ ) instances of type θ for every θ ∈ Θ will have anempirical distribution such that every type’s probability of playing s ∗ diﬀers from π ∗ ( s ∗ | θ ) by lessthan 2 (cid:15) . If, furthermore, n ≥ N age , then in fact nλ ( θ ) ≥ N for each θ so the same probability40ound applies to the event that the receiver’s Bayesian posterior on every type θ playing s ∗ iscloser than 3 (cid:15) to π ∗ ( s ∗ | θ ). By the construction of (cid:15) , playing a ∗ after s ∗ is the unique best responseto such a posterior.Therefore, for k ≥ K , the probability that the sender population plays some action otherthan a ∗ after s ∗ in ψ ( k ) is bounded by N age (1 − γ k ) + (1 − γ k ) · ∞ X n =0 γ nk · X θ ∈ Θ (cid:16) e obs n,θ + e bias ,kn,θ (cid:17) . To explain this expression, receivers aged N age or younger account for no more than N age (1 − γ k ) of the population. Among the age n receivers, no more than P θ ∈ Θ e obs n,θ fraction has a samplesize smaller than nλ ( θ ) for any type θ , while P θ ∈ Θ e bias ,kn,θ is an upper bound on the probability(conditional on having a large enough sample) of having a biased enough sample so that sometype’s empirical frequency of playing s ∗ diﬀers by more than 2 (cid:15) from π ∗ ( s ∗ | θ ).But since γ k ∈ [0 , , ∞ X n =0 γ nk · X θ ∈ Θ e obs n,θ < ∞ X n =0 X θ ∈ Θ e obs n,θ ≤ N obs and ∞ X n =0 γ nk · X θ ∈ Θ e bias ,kn,θ < ∞ X n =0 X θ ∈ Θ e bias ,kn,θ ≤ N bias . We conclude that whenever k ≥ K , π ( k )2 ( a ∗ | s ∗ ) ≥ − (1 − γ k ) · ( N age + N obs + N bias ) . Finally, observe that none of N age , N obs , N bias depends on the sequence π ( k ) , so N is chosenindependent of the sequence π ( k ) . Lemma A.2.

Assume g is regular. Suppose there is some a ∗ ∈ A and v ∈ R so that u ( θ, s ∗ , a ∗ ) >v . Then, there exist C ∈ (0 , , C > so that in every sender history y θ , s ∗ , a ∗ | y θ ) ≥ C · s ∗ | y θ ) + C implies E [ u ( θ, s ∗ , π ( ·| s ∗ )) | y θ ] > v. Proof.

Write u := min a ∈ A u ( θ, s ∗ , a ). There exists q ∈ (0 ,

1) so that q · u ( θ, s ∗ , a ∗ ) + (1 − q ) · u > v. Find a small enough (cid:15) > < q − (cid:15) < g is regular, Proposition 1 of Fudenberg, He, and Imhof (2017) tells us there exists some C so that the posterior mean belief of sender with history y θ , is no less than(1 − (cid:15) ) · s ∗ , a ∗ | y θ ) s ∗ | y θ ) + C . q , the expected payoﬀ to θ playing s ∗ exceeds v . Thatis, it suﬃces to have(1 − (cid:15) ) · s ∗ , a ∗ | y θ ) s ∗ | y θ ) + C ≥ q ⇐⇒ s ∗ , a ∗ | y θ ) ≥ q − (cid:15) s ∗ | y θ ) + q − (cid:15) · C . Putting C := q − (cid:15) and C := q − (cid:15) · C proves the lemma. Lemma A.3.

Let Z t be i.i.d. Bernoulli random variables, where E [ Z t ] = 1 − (cid:15) . Write S n := P nt =1 Z t . For < C < and C > , there exist ¯ (cid:15), G , G > such that whenever < (cid:15) < ¯ (cid:15) , P [ S n ≥ C n + C ∀ n ≥ G ] ≥ − G (cid:15). Proof.

We make use of a lemma from Fudenberg and Levine (2006), which in turn extends someinequalities from Billingsley (1995).

FL06 Lemma A.1 : Suppose { X k } is a sequence of i.i.d. Bernoulli random variables with E [ X k ] = µ , and deﬁne for each n the random variable S n := | P nk =1 ( X k − µ ) | n . Then for any n, ¯ n ∈ N , P (cid:20) max n ≤ n ≤ ¯ n S n > (cid:15) (cid:21) ≤ · n · µ(cid:15) . For every G > < (cid:15) < P [ S n ≥ C n + C ∀ n ≥ G ] = 1 − P " ( ∃ n ≥ G ) n X t =1 Z t < C n + C = 1 − P " ( ∃ n ≥ G ) n X t =1 ( X t − (cid:15) ) > (1 − (cid:15) − C ) n − C , where X t := 1 − Z t . Let ¯ (cid:15) := (1 − C ) and G := 2 C / ¯ (cid:15) . Suppose 0 < (cid:15) < ¯ (cid:15) . Then for every n ≥ G , (1 − (cid:15) − C ) n − C ≥ ¯ (cid:15)n − C ≥ ¯ (cid:15)n . Hence, P [ S n ≥ C n + C ∀ n ≥ G ] ≥ − P " ( ∃ n ≥ G ) n X t =1 ( X t − (cid:15) ) >

12 ¯ (cid:15)n and, by FL06 Lemma A.1, the probability on the right-hand side is at most G (cid:15) with G :=2 / (3 G ¯ (cid:15) ).We now prove Theorem 3. 42 heorem

3: Suppose π ∗ is on-path strict for the receiver and patiently stable. Then itsatisﬁes the strong compatibility criterion. Proof.

Let some a / ∈ BR(∆( ˜ J ( s , π ∗ )) , s ) and h > π ∗ ( a | s ) ≤ h . Step 1 : Deﬁning the constants ξ, θ J , a θ , s θ , C , C , G , G , and N recv .(i) For each ξ >

0, deﬁne the ξ -approximations to ∆( ˜ J ( s , π ∗ )) as the probability distributionswith weight no more than ξ on types outside of ˜ J ( s , π ∗ ),∆ ξ ( ˜ J ( s , π ∗ )) := n p ∈ ∆(Θ) : p ( θ ) ≤ ξ ∀ θ / ∈ ˜ J ( s , π ∗ ) o . Because the best-response correspondence has closed graph, there exists some ξ > a / ∈ BR(∆ ξ ( ˜ J ( s , π ∗ )) , s ).(ii) Since ˜ J ( s , π ∗ ) is nonempty, we can ﬁx some θ J ∈ ˜ J ( s , π ∗ ).(iii) For each equilibrium-dominated type θ ∈ Θ \ ˜ J ( s , π ∗ ), identify some on-path signal s θ sothat π ∗ ( s θ | θ ) >

0. By assumption of on-path strictness for the receiver, there is some a θ ∈ A sothat π ∗ ( a θ | s θ ) = 1, and furthermore, a θ is the strict best response to s θ in π ∗ . By the deﬁnitionof equilibrium dominance, u ( θ, s θ , a θ ) > max a ∈ A u ( θ, s , a ) =: v θ . By applying Lemma A.2 to each θ ∈ Θ \ ˜ J ( s , π ∗ ), we obtain some C ∈ (0 , C > θ ∈ Θ \ ˜ J ( s , π ∗ ) and in every sender history y θ , s θ , a θ | y θ ) ≥ C · s θ | y θ ) + C implies E [ u ( θ, s θ , π ( ·| s θ )) | y θ ] > v θ . (iv) By Lemma A.3, ﬁnd ¯ (cid:15), G , G > E [ Z t ] = 1 − (cid:15) are i.i.d. Bernoulli and S n := P nt =1 Z t , then whenever 0 < (cid:15) < ¯ (cid:15) , P [ S n ≥ C n + C ∀ n ≥ G ] ≥ − G (cid:15). (v) Because at π ∗ , a θ is a strict best response to s θ for every θ ∈ Θ \ ˜ J ( s , π ∗ ), from LemmaA.1 we may ﬁnd a N recv so that for each sequence π ( k ) ∈ Π ∗ ( g, δ k , γ k ) where γ k → π ( k ) → π ∗ ,there corresponds K recv ∈ N so that k ≥ K recv implies π ( k )2 ( a θ | s θ ) ≥ − (1 − γ k ) · N recv for every θ ∈ Θ \ ˜ J ( s , π ∗ ). Step 2 : Two conditions to ensure that all but 3 h receivers believe in ∆ ξ ( ˜ J ( s , π ∗ )).Consider some steady state ψ ∈ Ψ ∗ ( g, δ, γ ) for g regular, δ, γ ∈ [0 , c = ξ · max θ ∈ Θ λ ( θ ) λ ( θ J ) and δ = . Weconclude that there exists some N rare (not dependent on ψ ) such that whenever π ( s | θ J ) ≥ c · π ( s | θ D ) for every equilibrium-dominated type θ D / ∈ ˜ J ( s , π ∗ ) and n · π ( s | θ J ) ≥ N rare , (7)43hen an age- n receiver in steady state ψ where π = σ ( ψ ) has probability at least 1 − h of holdinga posterior belief g ( ·| y ) such that θ J is at least c times as likely to play s as θ D is for every θ D / ∈ ˜ J ( s , π ∗ ). Thus history y generates a posterior belief after s , p ( ·| s ; y ) such that p ( θ D | s ; y ) p ( θ J | s ; y ) ≤ λ ( θ D ) λ ( θ J ) · ξ · λ ( θ J )max θ ∈ Θ λ ( θ ) ≤ ξ. In particular, p ( ·| s ; y ) must assign weight no greater than ξ to each type not in ˜ J ( s , π ∗ );therefore, the belief belongs to ∆ ξ ( ˜ J ( s , π ∗ )). By construction of ξ , a is then not a best responseto s after history y .A receiver whose age n satisﬁes Equation (7) plays a with probability less than h , provided π ( s | θ J ) ≥ c · π ( s | θ D ) for every θ D / ∈ ˜ J ( s , π ∗ ). However, to bound the overall probability of a in the entire receiver population in steady state ψ , we ensure that Equation (7) is satisﬁed for allexcept 2 h fraction of receivers in ψ . We claim that when γ is large enough, a suﬃcient conditionis for π = σ ( ψ ) to satisfy π ( s | θ J ) ≥ (1 − γ ) N ∗ for some N ∗ ≥ N rare /h . This is because underthis condition, any agent aged n ≥ h − γ satisﬁes Equation (7), while the fraction of receiversyounger than h − γ is 1 − (cid:18) γ h − γ (cid:19) ≤ h for γ near enough to 1.To summarize, in Step 2 we have found a constant N rare and shown that if γ is near enoughto 1, then π = σ ( ψ ) has π ( a | s ) ≤ h if the following two conditions are satisﬁed:( C1 ) π ( s | θ J ) ≥ c · π ( s | θ D ) for every equilibrium-dominated type θ D / ∈ ˜ J ( s , π ∗ )( C2 ) π ( s | θ J ) ≥ (1 − γ ) N ∗ for some N ∗ ≥ N rare /h. In the following step, we show there is a sequence of steady states ψ ( k ) ∈ Ψ ∗ ( g, δ k , γ k ) with δ k → , γ k →

1, and σ ( ψ ( k ) ) = π ( k ) → π ∗ such that, in every π ( k ) , the above two conditions aresatisﬁed. Using the fact that γ k → , we conclude that, for large enough k , we get π ( k )2 ( a | s ) ≤ h ,which in turn shows π ∗ ( a | s ) ≤ h due to the convergence π ( k ) → π ∗ . Step 3 : Extracting a suitable subsequence of steady states.In the statement of Lemma 4, put θ := θ J . We obtain some number (cid:15) and functions ¯ δ ( N ) , ¯ γ ( N, δ ). Put N ratio := ξ G · N recv max θ ∈ Θ λ ( θ ) λ ( θ J ) and N ∗ := max( N ratio , N rare /h ).Since π ∗ is patiently stable, it can be written as the limit of some strategy proﬁles π ∗ =lim k →∞ π ( k ) , where each π ( k ) is δ k -stable with δ k →

1. By the deﬁnition of δ -stable, each π ( k ) is the limit π ( k ) = lim j →∞ π ( k,j ) with π ( k,j ) ∈ Π ∗ ( g, δ k , γ k,j ) with lim j →∞ γ k,j = 1. It is withoutloss to assume that for every k ≥ , δ k ≥ ¯ δ ( N ∗ ), and that the L distance between π ( k ) and π ∗ is less than (cid:15)/

2. Now, for each k , ﬁnd a large enough index j ( k ) so that (i) γ k,j ( k ) ≥ γ ( N ∗ , δ k ),(ii) L distance between π ( k,j ) and π ( k ) is less than min( (cid:15) , k ), and (iii) lim k →∞ γ k,j ( k ) = 1. Thisgenerates a sequence of k -indexed steady states, ψ ( k,j ( k )) ∈ Ψ ∗ ( g, δ k , γ k,j ( k ) ). We will henceforthdrop the dependence through the function j ( k ) and just refer to ψ ( k ) and γ k . The sequence ψ ( k ) ∈ Ψ ∗ ( g, δ k , γ k ) satisﬁes: (1) δ k → , γ k →