Learning and Type Compatibility in Signaling Games
LLearning and Type Compatibility in Signaling Games ∗ Drew Fudenberg † Kevin He ‡ First version: October 12, 2016This version: June 30, 2018
Abstract
Which equilibria will arise in signaling games depends on how the receiver in-terprets deviations from the path of play. We develop a micro-foundation for theseoff-path beliefs, and an associated equilibrium refinement, in a model where equilib-rium arises through non-equilibrium learning by populations of patient and long-livedsenders and receivers. In our model, young senders are uncertain about the prevailingdistribution of play, so they rationally send out-of-equilibrium signals as experimentsto learn about the behavior of the population of receivers. Differences in the payofffunctions of the types of senders generate different incentives for these experiments.Using the Gittins index (Gittins, 1979), we characterize which sender types use eachsignal more often, leading to a constraint on the receiver’s off-path beliefs based on“type compatibility” and hence a learning-based equilibrium selection. ∗ This material was previously part of a larger paper titled “Type-Compatible Equilibria in Signaling Games.”We thank Dan Clark, Laura Doval, Glenn Ellison, Mira Frick, Ryota Iijima, Lorens Imhof, Yuichiro Kamada,Robert Kleinberg, David K. Levine, Kevin K. Li, Eric Maskin, Dilip Mookherjee, Harry Pei, Matthew Rabin,Bill Sandholm, Lones Smith, Joel Sobel, Philipp Strack, Bruno Strulovici, Tomasz Strzalecki, Jean Tirole, JuusoToikka, Alex Wolitzky, and four anonymous referees for helpful comments and conversations, and National ScienceFoundation grant SES 1643517 for financial support. † Department of Economics, MIT. Email: [email protected] ‡ Department of Economics, Harvard University. Email: [email protected] a r X i v : . [ q -f i n . E C ] J un Introduction
In a signaling game, a privately informed sender (for instance a student) observes their type(e.g. ability) and chooses a signal (e.g. education level) that is observed by a receiver (such asan employer), who then picks an action without observing the sender’s type. These signalinggames can have many perfect Bayesian equilibria, which are supported by different specificationsof how the receiver would update his beliefs about the sender’s type following the observationof off-path signals that the equilibrium predicts will never occur. These off-path beliefs arenot pinned down by Bayes rule, and solution concepts such as perfect Bayesian equilibriumand sequential equilibrium place no restrictions on them. This has led to the development ofequilibrium refinements like Cho and Kreps (1987)’s Intuitive Criterion and Banks and Sobel(1987)’s divine equilibrium that reduce the set of equilibria by imposing restrictions on off-pathbeliefs, using arguments about how players should infer the equilibrium meaning of observationsthat the equilibrium says should never occur.This paper uses a learning model to provide a micro-foundation for restrictions on the off-pathbeliefs in signaling games, and thus derive restrictions on which Nash equilibria can emerge fromlearning. Our learning model has a continuum of agents who are randomly matched each period,with a constant inflow of new agents who do not know the prevailing distribution of strategiesand a constant outflow of equal size. The large population makes it rational for the agents toignore repeated-game effects and ensures the aggregate system is deterministic, while turnoverin the population lets us analyze learning in a stationary model where social steady states exist,even though individual agents learn. To give agents adequate learning opportunities, we assumethat their expected lifetimes are long, so that most agents in the population live a long time.And to ensure that agents have sufficiently strong incentives to experiment, we suppose that theyare very patient. This leads us to analyze what we call the “ patiently stable ” steady states of ourlearning model.Our agents are Bayesians who believe they face a time-invariant distribution of opponents’play. As in much of the learning-in-games literature and most laboratory experiments, theseagents only learn from their personal observations and not from sources such as newspapers,parents, or friends. Therefore, patient young senders will rationally try out different signals tosee how receivers react. This implies some “off-path” signals that have probability zero in a givenequilibrium will occur with small but positive probabilities in the steady states that approximateit, so we can use Bayes rule to derive restrictions on the receivers’ typical posterior beliefs followingthese rare but positive-probability observations. Moreover, differences in the payoff functions ofthe sender types lead them to experiment in different ways. As a consequence, we can prove thatpatiently stable steady states must be a subset of Nash equilibria where the receiver responds It is interesting to note that Spence (1973) also interpreted equilibria as steady states (or “nontransitoryconfigurations”) of a learning process, though he did not explicitly specify what sort of process he had in mind. As we explain in Corollary 1, our main result extends to environments where some fraction of the populationhas access to data about the play of others.
1o beliefs about the sender’s type that respect a type compatibility condition. This provides alearning-based justification for eliminating certain “unintuitive” equilibria in signaling games.These results also suggest that learning theory could be used to control the rates of off-path playand hence generate equilibrium refinements in other games.
To give some of the intuition for our general results, we study a particular stage game embeddedin an artificially simple learning model, and explain why optimal experimentation rules out aseemingly unappealing equilibrium outcome. Consider the following signaling game: the senderis either the high type θ H or the low type θ L , both equally likely. The sender chooses between twosignals, s ∈ { In , Out }. If the sender plays
Out , the game ends and both parties get 0 payoff.If the sender plays In , the receiver then chooses an action a ∈ { Up, Down }. Payoffs followingthe signal In depend on the sender’s type and receiver’s action, as in the following matrix.signal: In action: Up action: Down type: θ H , − , θ L , − − , In , Up ) to Out and prefer
Out to (
In, Down ), while the receiverprefers Up over Down after signal In if he believes there is greater than chance that the senderhas type θ H .This game has a perfect Bayesian equilibrium (PBE) where both types choose Out and thereceiver plays
Down after In , sustained by the belief that anyone who sends In has probability p ≤ of being θ H . This updating requires the receiver to interpret the off-path In as a signalthat the sender is more likely to be θ L , even though θ H gets 1 more utility than θ L does from In regardless of the receiver’s strategy. So, “both Out ” is eliminated by the D1 criterion. Now suppose there are three infinitely lived agents: θ H , θ L , and R (for receiver). Supposethat in each period t ∈ { , , , ... } , the three agents play a simultaneous-move game, where eachsender type θ i chooses a signal s it , and R chooses a single action a t to use against both of thesenders. (This is a deterministic analog of the receiver randomly matching with each type withprobability 1/2 without knowing the sender’s type.) At the end of period t , R observes the signalchoices of both types, while θ i observes a t if and only if s it = In . That is, each agent only learnsfrom his/her personal experience; by choosing the “outside option” Out , the sender does notlearn how the receiver would have responded to signal In that period.Agents think that each opponent is committed to some mixed strategy of the stage game andplays this strategy each period, regardless of their observations of past play: that is, all agents Any receiver play at the off-path signal In that makes it weakly optimal for θ L to deviate to In would alsomake it strictly optimal for θ H to deviate. Cho and Kreps (1987)’s D1 criterion therefore requires the receiver toput 0 probability on θ = θ L after In . However, the PBE passes their Intuitive Criterion. t = 1, each type θ i is endowed with a Beta( c U , c D ) prior about the probability that R responds to In with Up ,with c D > c U > , so they assign higher probability to Down than to
Up.
R starts with twoindependent priors Beta( c HI , c HO ) and Beta( c LI , c LO ) about the probabilities that θ H and θ L choose In each period, where we only assume c HI , c HO , c LI , c LO >
0. The independence assumption meansthat R does not learn about the behavior of one type from the play of the other.Agents discount payoffs in future periods at rate 0 ≤ δ < δ induces a deterministic infinite history of play ( s Ht , s Lt , a t ) ∞ t =1 =: Y ( δ ). When δ = 0, the agents play myopically every period, and because of our assumption that c D > c U ,both types choose Out in t = 1. They thus gain no information about R’s play, do not updatetheir beliefs, and continue playing Out in every future period. So, the unintuitive “both
Out ”PBE is the learning outcome when agents are sufficiently impatient. However, we can show forall large enough δ , that eventually behavior converges to R playing Up and θ H playing In eachperiod. We give a sketch of the argument, beginning with characterizing agents’ optimal behavioreach period. R observes the same information regardless of his play, so he plays myopically underany δ . Let p ( h t ) be R’s Bayesian posterior belief about the probability that an In sender hastype θ H , given history h t . Then a t +1 = Up if p ( h t ) > and a t +1 = Down if p ( h t ) < .Now we turn to θ i , whose problem involves active experimentation. Formally, the dynamicoptimization problem facing θ i is a one-armed Bernoulli bandit. Choosing s it = Out is equivalentto taking the safe outside option while choosing s it = In is equivalent to pulling the risky armand getting a payoff depending on whether the pull results in a success ( a t = Up ) or a failure( a t = Down ). The optimal policy for θ i involves the Gittins index (defined later in Equation(2)). Type θ i plays In at those histories where In has a positive Gittins index.Once a type chooses to play Out in some period, she receives no further information andwill continue to play
Out in all subsequent periods. Denote the period in Y ( δ ) that θ i firstswitches from In to Out as T ( i, δ ) ∈ N ∪ {∞} , where T ( i, δ ) = ∞ means θ i plays In forever.The argument that learning eliminates pooling on Out follows from three observations:
Observation 1 . The high type switches to
Out later than the low type does, that is, T ( H, δ ) ≥ T ( L, δ ). To see why, suppose by way of contradiction that T ( H, δ ) < T ( L, δ ) . Then, in period t = T ( H, δ ) , both θ H and θ L have played In until now and have seen the same history, so theyhold the same belief about R’s play. Yet θ H chooses Out at this history while θ L chooses In ,meaning θ H has a negative Gittins index for In while θ L has a positive one. This is impossible, In practice, the required patience level is not unreasonably high. When c D = 1 . c U = 1 , c HI = c LO = 1 , and c HO = c LI = 3 , for example, δ = 0 yields the pathological PBE as the long-run outcome, but when δ ≥ .
92 thelong-run outcome involves s Ht = In and a t = Up . θ H ’s payoff from In is always 1 higher than that of θ L , so θ H ’s index for In is also always1 higher than that of θ L when the two types have the same belief about R’s play. Observation 2 . As the high type becomes patient, she experiments with In arbitrarily manytimes , that is, lim δ → T ( H, δ ) = ∞ . This follows because for any fixed full-support prior belief of θ H about R’s mixed strategy, the Gittins index for In stays close to the “success payoff” of 2 fora length of time that grows to infinity as δ →
1, even in the worst case where R plays
Down inevery period.
Observation 3 . If the high type plays In sufficiently many times and more often than thelow type does, then eventually R will believe that In senders have greater than chance of being θ H , that is, there exists ¯ N ∈ N so that p ( h T ) > for any history h T where (i) θ H played In atleast ¯ N times and (ii) θ L played In no more than θ H did. This follows from the fact that R’sbelief about θ i ’s play after n iI instances of In and n iO instances of Out is Beta( c iI + n iI , c iO + n iO ).From Observation 2, we see that T ( H, δ ) is larger than the ¯ N of Observation 3 when δ issufficiently large. The history up to period t for any t ≥ ¯ N will therefore contain at least ¯ N periods of θ H playing In (namely, the very first ¯ N periods of the game), and by Observation 1 θ L will have played In no more than θ H did in this history. So by Observation 3, p ( h t ) > for t ≥ ¯ N , meaning a t = Up for t ≥ ¯ N . Since s Ht = In for all t ≤ ¯ N and observing Up increasesthe Gittins index of In , the high type must always play In . This means lim t →∞ s Ht = In andlim t →∞ a t = Up for large δ < Out an absorbing state and, together with the assumption of Beta priors, lets us explicitlycalculate how the system evolves. This paper’s focus is on general signaling games embedded ina learning model with large populations and anonymous random matching, eliminating repeated-game effects. We focus on steady states of the model, where the stationary assumption is satisfied.Also, we relax the Beta prior assumption and allow learners to have fairly general non-doctrinairepriors. Many results about the steady-state model, however, have analogs in the simple modelabove.Intuitively, θ H is “more compatible” with signal In than θ L . Definition 2 formalizes thisrelation in general signaling games. Observation 1 corresponds to Lemma 2, which shows thatwhenever one type is more compatible than another with a signal, the more compatible type sendsthe signal more often. Observation 2 corresponds to Lemma 4, which says a sufficiently patientand long-lived sender type will experiment many times with all signals that have the potential tostrictly improve that type’s equilibrium payoff. Observation 3 corresponds to Lemma 3, whichsays receivers can eventually learn the compatibility relation associated with each signal, providedsenders’ play respects the relation and the more compatible type experiments enough with thesignal. Lemmas 2, 3, and 4 are combined to prove the main result of the paper (Theorem 2), a4earning-based refinement in general signaling games. Section 2 lays out the notation we will use for signaling games and introduces our learning model.Section 3 introduces the Gittins index, which we use to analyze the senders’ learning problem.It also defines type compatibility, which is a partial order that drives our results. We say thattype θ is more type-compatible with signal s than type θ if, whenever s is a weak best responsefor θ against some receiver behavior strategy, it is a strict best response for θ against thesame strategy. To relate this static definition to the senders’ optimal dynamic learning behavior,we show that, under our assumptions, the senders’ learning problem is formally a multi-armedbandit, so the optimal policy of each type is characterized by the Gittins index. Theorem 1 showsthat the compatibility order on types is equivalent to an order on their Gittins indices: θ ismore type-compatible with signal s than type θ if and only if, whenever s has the (weakly)highest Gittins index for θ , it has the strictly highest index for θ , provided the two types holdthe same beliefs and have the same discount factor.Section 4 studies the aggregate behavior of the sender and receiver populations. There wedefine and characterize the aggregate responses of the senders and of the receivers, which are theanalogs of the best-response functions in the one-shot signaling game. First, we use a couplingargument to extend Theorem 1 to the aggregate sender behavior, proving that types who aremore compatible with a signal send it more often in aggregate (Lemma 2). Then we turn tothe receivers. Intuitively, we would expect that when receivers are long-lived, most of them willhave beliefs that respect type compatibility, and we show that this is the case. More precisely,we show that most receivers best respond to a posterior belief whose likelihood ratio of θ to θ dominates the prior likelihood ratio of these two types whenever they observe a signal s whichis more type-compatible with θ than θ . Lemma 3 shows this is true for any signal that is sent“frequently enough” relative to the receivers’ expected lifespan, using a result of Fudenberg, He,and Imhof (2017) on updating posteriors after rare events.Finally, Section 5 combines the earlier results to characterize the steady states of the learningmodel, which can be viewed as pairs of mutual aggregate responses, analogous to the definition ofNash equilibrium. We start by proving Lemma 4, which shows that any signal that is not weaklyequilibrium dominated (see Definition 11) gets sent “frequently enough” in steady state whensenders are sufficiently patient and long lived. Combining the three lemmas discussed above, weestablish our main result: any patiently stable steady state must be a Nash equilibrium satisfyingthe additional restriction that the receivers best respond to certain admissible beliefs after everyoff-path signal (Theorem 2).As an example, consider Cho and Kreps (1987)’s beer-quiche game, where it is easy to verifythat the strong type is more compatible with Beer than the weak type. Our results imply thatthe strong types will in aggregate send this signal at least as often as the weak types do, and5hat a very patient strong type will experiment with it “many times.” As a consequence, whensenders are patient, long-lived receivers are unlikely to revise the probability of the strong typedownwards following an observation of
Beer . Thus, the “both types eat quiche” equilibrium isnot a patiently stable steady state of the learning model, as it would require receivers to interpret
Beer as a signal that the sender is weak.Finally, Theorem 3 provides a stronger implication of patient stability in generic pure-strategyequilibria, showing that off-path beliefs must assign probability zero to types that are equilibriumdominated in the sense of Cho and Kreps (1987).
Fudenberg and Kreps (1988, 1994, 1995) pointed out that experimentation plays an importantrole in determining learning outcomes in extensive-form games. As in Fudenberg and Kreps(1993), they studied a model with a single infinitely-lived and strategically myopic agent in eachplayer role who acts as if the opponent’s play is stationary. Because these models involvedaccumulating information over time, they did not have steady states. Our work is closer to thatof Fudenberg and Levine (1993) and Fudenberg and Levine (2006) which also studied learning byBayesian agents in a large population who believe that society is in a steady state. A key issue inthis work, and more generally in studying learning in extensive-form games, is characterizing howmuch agents will experiment with myopically suboptimal actions. If agents do not experimentat all, then non-Nash equilibria can persist, because players can maintain incorrect but self-confirming beliefs about off-path play. Fudenberg and Levine (1993) showed that patient long-lived agents will experiment enough at their on-path information sets to learn if they have anyprofitable deviations, thus ruling out steady states that are not Nash equilibria. However, moreexperimentation than that is needed for learning to generate the sharper predictions associatedwith backward induction and sequential equilibrium. Fudenberg and Levine (2006) showed thatpatient rational agents need not do enough experimentation to imply backwards induction ingames of perfect information. Later on, we say about how the models and proofs of those papersdiffer from ours.This paper is also related to the Bayesian learning models of Kalai and Lehrer (1993), whichstudied two-player games with one agent on each side, so that every self-confirming equilibrium ispath-equivalent to a Nash equilibrium, and Esponda and Pouzo (2016), which allowed agents toexperiment but did not characterize when and how this occurs. It is also related to the literatureon boundedly rational experimentation in extensive-form games (e.g. Jehiel and Samet (2005),Laslier and Walliser (2015)), where the experimentation rules of the agents are exogenouslyspecified. We assume that each sender’s type is fixed at birth, as opposed to being i.i.d. overtime. Dekel, Fudenberg, and Levine (2004) showed some of the differences this can make usingvarious equilibrium concepts, but they did not develop an explicit model of non-equilibriumlearning. 6or simplicity, we assume here that agents do not know the payoffs of other players and havefull support priors over the opposing side’s behavior strategies. Our companion paper Fudenbergand He (2017) supposed that players assign zero probability to dominated strategies of theiropponents, as in the Intuitive Criterion (Cho and Kreps, 1987), divine equilibrium (Banks andSobel, 1987), and rationalizable self-confirming equilibrium (Dekel, Fudenberg, and Levine, 1999).There, we analyzed how the resulting micro-founded equilibrium refinement compares to thosein past work. A signaling game has two players, a sender (player 1, “she”) and a receiver (player 2, “he”). Thesender’s type is drawn from a finite set Θ according to a prior λ ∈ ∆(Θ) with λ ( θ ) > θ . There is a finite set S of signals for the sender and a finite set A of actions for the receiver. The utility functions of the sender and receiver are u : Θ × S × A → R and u : Θ × S × A → R respectively.When the game is played, the sender knows her type and sends a signal s ∈ S to the receiver.The receiver observes the signal, then responds with an action a ∈ A . Finally, payoffs are realized.A behavior strategy for the sender π = ( π ( ·| θ )) θ ∈ Θ is a type-contingent mixture over signals S . Write Π for the set of all sender behavior strategies.A behavior strategy for the receiver π = ( π ( ·| s )) s ∈ S is a signal-contingent mixture overactions A . Write Π for the set of all receiver behavior strategies. We now build a learning model with a given signaling game as the stage game. In this subsection,we explain an individual agent’s learning problem. In the next subsection, we complete thelearning model by describing a society of learning agents who are randomly matched to play thesignaling game every period.Time is discrete and all agents are rational Bayesians with geometrically distributed lifetimes.They survive between periods with probability 0 ≤ γ < ≤ δ <
1, so their objective is to maximize the expected value of P ∞ t =0 ( γδ ) t · u t . Here,0 ≤ γδ < u t is the payoff t periods from today.At birth, each agent is assigned a role in the signaling game: either as a sender with type θ oras a receiver. Agents know their role, which is fixed for life. Every period, each agent is randomly Here and subsequently, ∆( X ) denotes the collection of probability distributions on the set X. To lighten notation we assume that the same set of actions is feasible following any signal. This is withoutloss of generality for our results as we could let the receiver have very negative payoffs when he responds to asignal with an “impossible” action. Agents update theirbeliefs and play the signaling game again with new random opponents next period, provided theyare still alive.Agents believe they face a fixed but unknown distribution of opponents’ aggregate play, sothey believe that their observations will be exchangeable. We feel that this is a plausible firsthypothesis in many situations, so we expect that agents will maintain their belief in stationaritywhen it is approximately correct, but will reject it given clear evidence to the contrary, as whenthere is a strong time trend or a high-frequency cycle. The environment will indeed be constantin the steady states that we analyze.Formally, each sender is born with a prior density function over the aggregate behavior strat-egy of the receivers, g : Π → R + , which integrates to 1. Similarly, each receiver is born witha prior density over the sender’s behavior strategies , g : Π → R + . We denote the marginaldistribution of g on signal s as g ( s )1 , so that g ( s )1 ( π ( ·| s )) is the density of the new senders’ priorover how receivers respond to signal s . Similarly, we denote the θ marginal of g as g ( θ )2 , so that g ( θ )2 ( π ( ·| θ )) is the new receivers’ prior density over π ( ·| θ ) ∈ ∆( S ).It is important to remember that g and g are beliefs over opponents’ strategies, but notstrategies themselves. A new sender expects the response to s to be R π ( ·| s ) · g ( π ) dπ while anew receiver expects type θ to play R π ( ·| θ ) · g ( π ) dπ .We now state a regularity assumption on the agents’ priors that will be maintained through-out. Definition 1.
A prior g = ( g , g ) is regular if(i). [ independence ] g ( π ) = Q s ∈ S g ( s )1 ( π ( ·| s )) and g ( π ) = Q θ ∈ Θ g ( θ )2 ( π ( ·| θ )).(ii). [ g non-doctrinaire ] g is continuous and strictly positive on the interior of Π . The receiver’s payoff reveals the sender’s type for generic assignments of payoffs to terminal nodes. If thereceiver’s payoff function is independent of the sender’s type, his beliefs about it are irrelevant. If the receiverdoes care about the sender’s type but observes neither the sender’s type nor his own realized payoff, a great manyoutcomes can persist, as in Dekel, Fudenberg, and Levine (2004). Note that the agent’s prior belief is over opponents’ aggregate play (i.e. Π or Π ) and not over the prevailingdistribution of behavior strategies in the opponent population (i.e. ∆(Π ) or ∆(Π )), since under our assumptionof anonymous random matching, these are observationally equivalent for our agents. For instance, a receivercannot distinguish between a society where all type θ randomize 50-50 between signals s and s each period,and another society where half of the type θ always play s while the other half always plays s . Note also thatbecause agents believe the system is in a steady state, they do not care about calendar time and do not have beliefsabout it. Fudenberg and Kreps (1994) suppose that agents append a non-Bayesian statistical test of whether theirobservations are exchangeable to a Bayesian model that presumes exchangeability. g nice ] for each type θ, there are positive constants (cid:16) α ( θ ) s (cid:17) s ∈ S such that π ( ·| θ ) g ( θ )2 ( π ( ·| θ )) Q s ∈ S π ( s | θ ) α ( θ ) s − is uniformly continuous and bounded away from zero on the relative interior of Π ( θ )1 , theset of behavior strategies of type θ .Independence ensures that a receiver does not learn how type θ plays by observing the behaviorof some other type θ = θ , and that a sender does not learn how receivers react to signal s byexperimenting with some other signal s = s . For example, this means in Cho and Kreps (1987)’sbeer-quiche game that the sender does not learn how receivers respond to beer by eating quiche. The non-doctrinaire nature of g and g implies that the agents never see an observation thatthey assigned zero prior probability, so that they have a well-defined optimization problem afterany history. Non-doctrinaire priors also imply that a large enough data set can outweigh priorbeliefs (Diaconis and Freedman, 1990). The niceness assumption in (iii) ensures that g behaveslike a power function near the boundary of Π . Any density that is strictly positive on Π satisfiesthis condition, as does the Dirichlet distribution, which is the prior associated with fictitious play(Fudenberg and Kreps, 1993).The set of histories for an age t sender of type θ is Y θ [ t ] := ( S × A ) t , where each period, thehistory records the signal sent and the action that her receiver opponent took in response. Theset of all histories for a type θ is the union Y θ := S ∞ t =0 Y θ [ t ]. The dynamic optimization problemof type θ has an optimal policy function σ θ : Y θ → S , where σ θ ( y θ ) is the signal that a type θ with history y θ would send the next time she plays the signaling game. Analogously, the set ofhistories for an age t receiver is Y [ t ] := (Θ × S ) t , where each period, the history records the typeof his sender opponent and the signal that she sent. The set of all receiver histories is the union Y := S ∞ t =0 Y [ t ]. The receiver’s learning problem admits an optimal policy function σ : Y → A S ,where σ ( y ) is the pure strategy that a receiver with history y would commit to next time heplays the game. We analyze learning in a deterministic stationary model with a continuum of agents, as in Fu-denberg and Levine (1993, 2006). One innovation is that we let lifetimes follow a geometric One could imagine learning environments where the senders believe that the responses to various signals arecorrelated, but independence is a natural special case. Because our agents are expected-utility maximizers, it is without loss of generality to assume each agent usesa deterministic policy rule. If more than one such rule exists, we fix one arbitrarily. Of course, the optimal policies σ θ and σ depend on the prior g as well as the effective discount factor δγ . Where no confusion arises, we suppressthese dependencies. λ ( θ ) in the roleof type θ for each θ ∈ Θ. As described in Subsection 2.2, each agent has 0 ≤ γ < − γ of dying. To preservepopulation sizes, (1 − γ ) new receivers and λ ( θ )(1 − γ ) new type θ are born into the society everyperiod.Each period, agents in the society are matched uniformly at random to play the signalinggame. In the spirit of the law of large numbers, each sender has probability (1 − γ ) γ t of matchingwith a receiver of age t , while each receiver has probability λ ( θ )(1 − γ ) γ t of matching with a type θ of age t. A state ψ of the learning model is described by the mass of agents with each possible history.We write it as ψ ∈ ( × θ ∈ Θ ∆( Y θ )) × ∆( Y ) . We refer to the components of a state ψ by ψ θ ∈ ∆( Y θ ) and ψ ∈ ∆( Y ).Given the agents’ optimal policies, each possible history for an agent completely determineshow that agent will play in their next match. The sender policy functions σ θ are maps fromsender histories to signals, so they naturally extend to maps from distributions over senderhistories to distributions over signals. That is, given the policy function σ θ , each state ψ inducesan aggregate behavior strategy σ θ ( ψ θ ) ∈ ∆( S ) for each type θ population, where we extend thedomain of σ θ from Y θ to ∆( Y θ ) in the natural way: σ θ ( ψ θ )( s ) := ψ θ { y θ ∈ Y θ : σ θ ( y θ ) = s } . (1)Similarly, state ψ and the optimal receiver policy σ together induce an aggregate behaviorstrategy σ ( ψ ) for the receiver population, where σ ( ψ )( a | s ) := ψ { y ∈ Y : σ ( y )( s ) = a } . We will study the steady states of this learning model, to be defined more precisely in Section5. Loosely speaking, a steady state is a state ψ that reproduces itself indefinitely when agentsuse their optimal policies. Put another way, a steady state induces a time-invariant distributionover how the signaling game is played in the society. Suppose society is at steady state today andwe measure what fraction of type θ sent a certain signal s in today’s matches. After all agentsmodify their strategies based on their updated beliefs and all births and deaths take place, thefraction of type θ playing s in the matches tomorrow will be the same as today. Remember that we have fixed deterministic policy functions. Senders’ Optimal Policies and Type Compatibility
This section studies the senders’ learning problem. We will prove that differences in the payoffstructures of the various sender types generate certain restrictions on their behavior in the learningmodel. Subsection 3.1 notes that the senders face a multi-armed bandit, so the Gittins indexcharacterizes their optimal policies, and shows how to relate the Gittins index of a signal to theexpected sender payoff versus a particular mixed strategy of the receiver. In Subsection 3.2, wedefine type compatibility , which formalizes what it means for type θ to be more “compatible” witha given signal s than type θ is. The definition of type compatibility is static, in the sense thatit depends only on the two types’ payoff functions in the one-shot signaling game. Subsection3.3 relates type compatibility to the Gittins index, which applies to the dynamic learning model.Lemma 2in Section 4 uses this relationship to show that if type θ is more compatible with signal s than type θ , then faced with any fixed distribution of receiver play the type θ populationsends s more often in the aggregate than the type θ population does. Each type θ sender thinks she is facing a fixed but unknown aggregate receiver behavior strategy π , so each period when she sends signal s , she believes that the response is drawn from some π ( ·| s ) ∈ ∆( A ), i.i.d. across periods. Because her beliefs about the responses to the various signalsare independent, her problem is equivalent to a discounted multi-armed bandit, with signals s ∈ S as the arms, where the rewards of arm s are distributed according to u ( θ, s, π ( ·| s )).Let ν s ∈ ∆(∆( A )) be a belief over the space of mixed replies to signal s , and let ν = ( ν s ) s ∈ S be a profile of such beliefs. Write I ( θ, s, ν, β ) for the Gittins index of signal s for type θ , withbeliefs ν over receiver’s play after various signals and with effective discount factor β = δγ , sothat I ( θ, s, ν, β ) := sup τ> E ν s nP τ − t =0 β t · u ( θ, s, a s ( t )) o E ν s nP τ − t =0 β t o . (2)Here a s ( t ) is the receiver’s response that the sender observes the t -th time she sends signal s , τ is a stopping time and the expectation E ν s over the sequence of responses { a s ( t ) } t ≥ dependson the sender’s belief ν s about responses to signal s . The Gittins index theorem (Gittins, 1979) implies that after every positive-probability history y θ , the optimal policy σ θ for a sender of type θ sends the signal that has the highest Gittins index That is, whether or not τ = t depends only on the realizations of a s (0) , a s (1) , ..., a s ( t − The Gittins index can be interpreted as the value of an auxiliary optimization problem, where type θ chooseseach period to either send signal s and obtain a payoff according to a random receiver action drawn accordingto π ( ·| s ), or to stop forever. The objective of the auxiliary problem is to maximize the per-period expecteddiscounted payoff until stopping, as the numerator of Equation (2) describes the expected discounted sum ofpayoffs until stopping while the denominator shows the expected discounted number of periods until stopping. ν s ) s ∈ S that is induced by y θ .Importantly, we can reformulate the objective function defining the Gittins index in Equation(2), linking it to the one-shot signaling game payoff structure. Lemma 1.
For every signal s , stopping time τ , belief ν s , and discount factor β, there exists π ,s ( τ, ν s , β ) ∈ ∆( A ) so that for every θ , E ν s nP τ − t =0 β t · u ( θ, s, a s ( t )) o E ν s nP τ − t =0 β t o = u ( θ, s, π ,s ( τ, ν s , β ))That is to say, when the stopping problem in Equation (2) is evaluated at an arbitrary stoppingtime τ, the payoff is equal to sender’s expected utility from playing s against the receiver strategy π ,s ( τ, ν s , β ) in the one-shot signaling game.The proof of Lemma 1 is in Appendix A.2 and shows how to construct π ,s ( τ, ν s , β ), whichcan be interpreted as a discounted time average over the receiver actions that are observed beforestopping. To illustrate the construction, suppose ν s is supported on two pure receiver strategiesafter s : either π ( a | s ) = 1 or π ( a | s ) = 1 , with both strategies equally likely. Suppose also u ( θ, s, a ) > u ( θ, s, a ) . Consider the stopping time τ that specifies stopping after the first timethe receiver plays a . Then the discounted time average frequency of a is: P ∞ t =0 β t · P ν s [ τ ≥ t and receiver plays a in period t ] P ∞ t =0 β t · P ν s [ τ ≥ t ] = 0 .
51 + P ∞ t =1 β t · . − β − β . So π ,s ( τ, ν s , β )( a ) = − β − β and similarly we can calculate that π ,s ( τ, ν s , β )( a ) = − β , whichshows that π ,s indeed corresponds to a mixture over receiver actions for each β . As β →
1, thismixture converges to the pure strategy of always playing a , so u ( θ, s, π ,s ( τ, ν s , β )) converges to u ( θ, s, a ), the highest possible payoff for type θ after s ; this parallels the fact that as β tends to1, the Gittins index for θ after s converges to the highest payoff in the support of the belief ν s . We now introduce a notion of the comparative compatibility of two types with a given signal inthe one-shot signaling game.
Definition 2.
Signal s is more type-compatible with θ than θ , written as θ (cid:31) s θ , if for every π ∈ Π such that u ( θ , s , π ( ·| s )) ≥ max s = s u ( θ , s , π ( ·| s )) , we have u ( θ , s , π ( ·| s )) > max s = s u ( θ , s , π ( ·| s )) . In words, θ (cid:31) s θ means that whenever s is a weak best response for θ against somereceiver behavior strategy π , it is also a strict best response for θ against π .12he following proposition says the compatibility order is transitive and essentially asymmetric.Its proof is in Appendix A.1. Proposition 1. (i). (cid:31) s is transitive.(ii). Except when s is either strictly dominant for both θ and θ or strictly dominated for both θ and θ , θ (cid:31) s θ implies θ s θ .To check the compatibility condition, one must consider all strategies in Π , just as the beliefrestrictions in divine equilibrium involve all the possible mixed best responses to various beliefs.However, when the sender’s utility function is separable in the sense that u ( θ, s, a ) = v ( θ, s ) + z ( a ) , as in Spence (1973)’s job-market signaling game and in Cho and Kreps (1987)’s beer-quichegame (given below), a sufficient condition for θ (cid:31) s θ is v ( θ , s ) − v ( θ , s ) > max s = s v ( θ , s ) − v ( θ , s ) . This can be interpreted as saying s is the least costly signal for θ relative to θ . In the OnlineAppendix, we present a general sufficient condition for θ (cid:31) s θ under general payoff functions. Example 1. (Cho and Kreps (1987)’s beer-quiche game) The sender (P1) is either strong ( θ strong )or weak ( θ weak ), with prior probability λ ( θ strong ) = 0 . . The sender chooses to either drink
Beer or eat
Quiche for breakfast. The receiver (P2), observing this breakfast choice but not thesender’s type, chooses whether to
Fight the sender. If the sender is θ weak , the receiver prefersto Fight . If the sender is θ strong , the receiver prefers to NotFight . Also, θ strong prefers Beer for breakfast while θ weak prefers Quiche for breakfast. Both types prefer not being fought overhaving their favorite breakfast. 13his game has separable sender utility with v ( θ strong , Beer ) = v ( θ weak , Quiche ) = 1, z ( Fight ) =0 and z ( NotFight ) = 2. So, we have θ strong (cid:31) Beer θ weak . (cid:7) It is easy to see that in every Nash equilibrium π ∗ , if θ (cid:31) s θ , then π ∗ ( s | θ ) > π ∗ ( s | θ ) = 1. By Bayes rule, this implies that the receiver’s equilibrium belief p after every on-path signal s satisfies the restriction p ( θ | s ) p ( θ | s ) ≤ λ ( θ ) λ ( θ ) if θ (cid:31) s θ . Thus in every Nash equilibriumof the beer-quiche game, if the sender chooses Beer with positive ex ante probability, then thereceiver’s odds ratio that the sender is tough after seeing this signal cannot be less than theprior odds ratio. Our main result, Theorem 2, essentially shows for any strategy profile thatcan be approximated by steady-state outcomes with patient and long-lived agents, that the samecompatibility-based restriction is satisfied even for off-path signals. In particular, this allows usto place restrictions on the receiver’s belief after seeing
Beer in equilibria where no type of senderever plays this signal.
We now connect the type compatibility order for a given signal with the associated Gittins indices.
Theorem 1. θ (cid:31) s θ if and only if for every β ∈ [0 , and every profile of beliefs ν , I ( θ , s , ν, β ) ≥ max s = s I ( θ , s , ν, β ) implies I ( θ , s , ν, β ) > max s = s I ( θ , s , ν, β ) . That is, θ (cid:31) s θ if and only if whenever s has the (weakly) highest Gittins index for θ ,it has the highest index for θ, provided the two types hold the same beliefs and have the samediscount factor. The proof involves reformulating the Gittins index as in Lemma 1, then applyingthe compatibility definition. Proof.
Step 1: Only If.
Suppose θ (cid:31) s θ and fix some β ∈ [0 ,
1) and prior belief ν . Suppose I ( θ , s , ν, β ) ≥ max s = s I ( θ , s , ν, β ). We show that I ( θ , s , ν, β ) > max s = s I ( θ , s , ν, β ).On any arm s = s , type θ could use the (suboptimal) stopping time τ θ s , which by Lemma1 yields an expected per-period payoff of u ( θ , s , π s ( ν s , τ θ s , β )). This is a lower bound forthe Gittins index of arm s for type θ , so combined with the hypothesis that I ( θ , s , ν, β ) ≥ max s = s I ( θ , s , ν, β ), we get I ( θ , s , ν, β ) ≥ max s = s u ( θ , s , π s ( ν s , τ θ s , β )) . (3)Now define the receiver strategy π ∈ Π by π ( ·| s ) := π s ( ν s , τ θ s , β ), π ( ·| s ) := π s ( ν s , τ θ s , β )for all s = s . Then Equation (3) can be rewritten as u ( θ , s , π ( ·| s )) ≥ max s = s u ( θ , s , π ( ·| s )) , s is weakly optimal for θ against π . By the definition of θ (cid:31) s θ , this implies s isstrictly optimal for θ against π .From the definition of π and Lemma 1, the expected utility of θ playing any s = s against π is equal to the Gittins index of that arm for θ , namely I ( θ , s , ν, β ). On theother hand, u ( θ , s , π ( ·| s )) is only a lower bound for I ( θ , s , ν, β ). This shows I ( θ , s , ν, β ) > max s = s I ( θ , s , ν, β ) as desired. Step 2: If.
Suppose θ s θ . Then there is some receiver strategy π ∗ ∈ Π such that u ( θ , s , π ∗ ( ·| s )) ≥ max s = s u ( θ , s , π ∗ ( ·| s )) , and u ( θ , s , π ∗ ( ·| s )) ≤ max s = s u ( θ , s , π ∗ ( ·| s )) . Let ν ∗ be any belief that induces π ∗ on average, that is to say for each s , π ∗ ( ·| s ) = Z π ,s ∈ ∆( A ) π ,s dν ∗ s ( π ,s )Let β = 0. Then I ( θ, s, ν ∗ ,
0) = u ( θ, s, π ∗ ( ·| s )) for every θ, s , since the Gittins index is equalto the myopic payoff when the decision-maker is perfectly impatient. This shows I ( θ , s , ν ∗ , ≥ max s = s I ( θ , s , ν ∗ ,
0) and I ( θ , s , ν ∗ , ≤ max s = s I ( θ , s , ν ∗ , In this section, we will define and analyze the aggregate sender response R : Π → Π and theaggregate receiver response R : Π → Π . Loosely speaking, these are the large-populationslearning analogs of the best-response functions in the static signaling game. If we fix the aggregateplay of − i population at π − i and run the learning model period after period from an arbitraryinitial state, the distribution of play in i population will approach R i [ π − i ]. Later in Section 5,the fixed points of the pair ( R , R ) will characterize the steady states of the learning system. To formally define the aggregate sender response, we first introduce the one-period-forward map.
Definition 3.
The one-period-forward map for type θ , f θ : ∆( Y θ ) × Π → ∆( Y θ ) is f θ [ ψ θ , π ]( y θ , ( s, a )) := ψ θ ( y θ ) · γ · { σ θ ( y θ ) = s } · π ( a | s )and f θ [ ψ θ , π ]( ∅ ) := 1 − γ . 15f the distribution over histories in the type θ population is ψ θ and the receiver population’saggregate play is π , the resulting distribution over histories in the type θ population is f θ [ ψ θ , π ].Specifically, there will be a 1 − γ mass of new type θ who will have no history. Also, if the optimalfirst signal of a new type θ is s , that is if σ θ ( ∅ ) = s , then f θ [ ψ θ , π ]( s , a ) = γ · (1 − γ ) · π ( a | s )new senders send s in their first match, observe action a in response, and survive. In general,a type θ who has history y θ and whose policy σ θ ( y θ ) prescribes playing s has π ( a | s ) chanceof having subsequent history ( y θ , ( s, a )) provided she survives until next period; the survivalprobability corresponds to the factor γ .Write f Tθ for the T -fold application of f θ on ∆( Y θ ) , holding fixed some π . Note that forarbitrary states ψ and ψ , if ( y θ , ( s, a )) is a length-1 history (i.e. y θ = ∅ ), then ψ θ ( y θ ) = ψ θ ( y θ )because both states must assign mass 1 − γ to ∅ , so f θ [ ψ θ , π ] and f θ [ ψ θ , π ] agree on Y θ [1].Iterating, for T = 2 , f θ [ ψ θ , π ] and f θ [ ψ θ , π ] agree on Y θ [2], because each history in Y θ [2] canbe written as ( y θ , ( s, a )) for y θ ∈ Y θ [1], and f θ [ ψ θ , π ] and f θ [ ψ θ , π ] match on all y θ ∈ Y θ [1].Proceeding inductively, we can conclude that f Tθ ( ψ θ , π ) and f Tθ ( ψ θ , π ) agree on all Y θ [ t ] for t ≤ T for any pair of type θ states ψ θ and ψ θ . This means lim T →∞ f Tθ ( ψ θ , π ) exists and isindependent of the initial state ψ θ . Denote this limit as ψ π θ . It is the long-run distribution overtype θ histories induced by starting at an arbitrary state and fixing the receiver population’s playat π , as stated formally in the next definition. Definition 4.
The aggregate sender response R : Π → Π is defined by R [ π ]( s | θ ) := ψ π θ ( y θ : σ θ ( y θ ) = s )where ψ π θ := lim T →∞ f Tθ ( ψ θ , π ) with ψ θ any arbitrary θ state.That is, R [ π ]( ·| θ ) is the long-run aggregate behavior in the type θ population when thereceivers’ aggregate play is fixed at π . Remark . Technically, R depends on g , δ, and γ , just like σ θ does. When relevant, we willmake these dependencies clear by adding the appropriate parameters as superscripts to R , butwe will mostly suppress them to lighten notation. Remark . Although the aggregate sender response is defined at the aggregate level, R [ π ]( ·| θ )also describes the probability distribution of the play of a single type θ sender over her lifetimewhen she faces receiver play drawn from π every period. Observe that f θ [ ψ θ , π ] restricted to Y θ [1] gives the probability distribution over histories for a type θ whouses σ θ and faces play drawn from π for one period: it puts weight π ( a | s ) on history ( s , a ) where s = σ θ ( ∅ ).Similarly, f Tθ [ ψ θ , π ] restricted to Y θ [ t ] for any t ≤ T gives the probability distribution over histories for someonewho uses σ θ and faces play drawn from π for t periods. Since ψ π θ assigns probability (1 − γ ) γ t to the set ofhistories Y θ [ t ], R [ π ]( ·| θ ) = σ θ ( ψ π θ ) is a weighted average over the distributions of period t play ( t = 1 , , , ... )of someone using σ θ and facing π , with weight (1 − γ ) γ t given to the period t distribution. .2 Type Compatibility and the Aggregate Sender Response The next lemma shows how type compatibility translates into restrictions on the aggregate senderresponse for different types.
Lemma 2.
Suppose θ (cid:31) s θ . Then for any regular prior g , ≤ δ, γ < , and any π ∈ Π , wehave R [ π ]( s | θ ) ≥ R [ π ]( s | θ ) . Theorem 1 showed that when θ (cid:31) s θ and the two types share the same beliefs, if θ plays s then θ must also play s . But even though new agents of both types start with the same prior g , their beliefs may quickly diverge during the learning process due to σ θ and σ θ prescribingdifferent experiments after the same history. This lemma shows that compatibility still imposesrestrictions on the aggregate play of the sender population: Regardless of the aggregate play π in the receiver population, the frequencies that s appears in the aggregate responses of differenttypes are always co-monotonic with the compatibility order (cid:31) s .To gain intuition for Lemma 2, consider two new senders with types θ strong and θ weak whoare learning to play the beer-quiche game from Example 1. Suppose they have uniform priorsover the responses to each signal, and that they face a sequence of receivers programmed to play Fight after
Beer and
NotFight after
Quiche . Since observing
Fight is the worst possiblenews about a signal’s payoff, the Gittins index of a signal decreases when
Fight is observed.Conversely, the Gittins index of a signal increases after each observation of
NotFight . Thusgiven the assumed play of the receivers, there are n , n ≥ θ strong play Beer for n periods (and observe n instances of Fight ) and then switch to
Quiche forever after,while type θ weak will play Beer for n periods before switching to Quiche forever after. Now weclaim that n ≥ n . To see why, suppose instead that n < n , and let ν be the posterior beliefabout receivers’ aggregate play induced from n periods of observing Fight after
Beer . After n periods, both types would share the belief ν . Then at belief ν type θ weak must play Beer whiletype θ strong plays Quiche , so signal
Beer must have the highest Gittins index for θ weak but notfor θ strong . But this would contradict Theorem 1.The proof of Lemma 2 relies on the similar idea of fixing a particular “programming” ofreceiver play and studying the induced paths of experimentation for different types. In theaggregate learning model, the sequence of responses that a given sender encounters in her lifedepends on the realization of the random matching process, because different receivers havedifferent histories and respond differently to a given signal. We can index all possible sequencesof random matching realizations using a device we call the “pre-programmed response path”.To show that more compatible types play a given signal more often, it suffices to show thiscomparison holds on each pre-programmed response path, thus coupling the learning processesof types θ and θ . We will show that the intuition above extends to signaling games with anynumber of signals and to any pre-programmed response path. This follows from Bellman (1956)’s Theorem 2 on Bernoulli bandits. efinition 5. A pre-programmed response path a = ( a ,s , a ,s , ..., ) s ∈ S is an element in × s ∈ S ( A ∞ ).A pre-programmed response path is an | S | -tuple of infinite sequences of receiver actions, onesequence for each signal. For a given pre-programmed response path a , we can imagine startingwith a new type θ and generating receiver play each period in the following programmatic manner:when the sender plays s for the j -th time, respond with receiver action a j,s . (If the sender sends s five times and then sends s = s , the response she gets to s is a ,s , not a ,s .) For a type θ who applies σ θ each period, a induces a deterministic history of experiments and responses, whichwe denote y θ ( a ). The induced history y θ ( a ) can be used to calculate R [ a ]( ·| θ ), the distributionof signals over the lifetime of a type θ induced by the pre-programmed response path a . Namely, R [ a ]( ·| θ ) is simply a mixture over all signals sent along the history y θ ( a ), with weight (1 − γ ) γ t − given to the signal in period t .Now consider a type θ facing actions generated i.i.d. from the receiver behavior strategy π each period, as in the interpretation of R in Remark 2. This data-generating process isequivalent to drawing a random pre-programmed response path a at time 0 according to a suitabledistribution, then producing all receiver actions using a . That is, R [ π ]( ·| θ ) = R R [ a ]( ·| θ ) dπ ( a )where we abuse notation and use dπ ( a ) to denote the distribution over pre-programmed responsepaths associated with π . Importantly, any two types θ and θ face the same distribution overpre-programmed response paths, so to prove the proposition it suffices to show R [ a ]( s | θ ) ≥ R [ a ]( s | θ ) for all a . Proof.
For t ≥
0, write y tθ for the truncation of infinite history y θ to the first t periods, with y ∞ θ := y θ . Given a finite or infinite history y tθ for type θ , the signal counting function s | y tθ )returns how many times signal s has appeared in y tθ . (We need this counting function since thereceiver play generated by a pre-programmed response path each period depends on how manytimes each signal has been sent so far.)As discussed above, we need only show R [ a ]( s | θ ) ≥ R [ a ]( s | θ ). Let a be given and write T θj for the period in which type θ sends signal s for the j -th time in the induced history y θ ( a ).If no such period exists, then set T θj = ∞ . Since R [ a ]( ·| θ ) is a weighted average over signals in y θ ( a ) with decreasing weights given to later signals, to prove R [ a ]( s | θ ) ≥ R [ a ]( s | θ ) it sufficesto show that T θ j ≤ T θ j for every j . Towards this goal, we will prove a sequence of statementsby induction: Statement j : Provided T θ j is finite, s | y T θ j θ ( a ) ! ≤ s | y T θ j θ ( a ) ! for all s = s .For every j where T θ j < ∞ , statement j implies that the number of periods type θ spentsending each signal s = s before sending s for the j -th time is fewer than the number of periods θ spent doing the same. Therefore it follows that θ sent s for the j -th time sooner than θ did,that is T θ j ≤ T θ j . Finally, if T θ j = ∞ , then evidently T θ j ≤ ∞ = T θ j . It now remains to prove the sequence of statements by induction.18 tatement 1 is the base case. By way of contradiction, suppose T θ < ∞ and s | y T θ θ ( a ) ! > s | y T θ θ ( a ) ! for some s = s . Then there is some earliest period t ∗ < T θ where (cid:16) s | y t ∗ θ ( a ) (cid:17) > s | y T θ θ ( a ) ! , where type θ played s in period t ∗ , σ θ ( y t ∗ − θ ( a )) = s .But by construction, by the end of period t ∗ − θ has sent s exactly as many times astype θ has sent it by period T θ −
1, so that (cid:16) s | y t ∗ − θ ( a ) (cid:17) = s | y T θ − θ ( a ) ! . Furthermore, neither type has sent s yet, so also (cid:16) s | y t ∗ − θ ( a ) (cid:17) = s | y T θ − θ ( a ) ! . Therefore, type θ holds the same posterior over the receiver’s reaction to signals s and s atperiod t ∗ − θ does at period T θ −
1. So by Theorem 1, s ∈ arg max ˆ s ∈ S I θ , ˆ s, y T θ − θ ( a ) ! = ⇒ I ( θ , s , y t ∗ − θ ( a )) > I ( θ , s , y t ∗ − θ ( a )) . (4)However, by construction of T θ , we have σ θ y T θ − θ ( a ) ! = s . By the optimality of the Gittinsindex policy, the left-hand side of Equation (4) is satisfied. But, again by the optimality of theGittins index policy, the right-hand side of Equation (4) contradicts σ θ ( y t ∗ − θ ( a )) = s . Thereforewe have proven Statement 1 .Now suppose
Statement j holds for all j ≤ K . We show Statement K + 1 also holds. If T θ K +1 is finite, then T θ K is also finite. The inductive hypothesis then shows s | y T θ K θ ( a ) ! ≤ s | y T θ K θ ( a ) ! In the following equation and elsewhere in the proof, we abuse notation and write I ( θ, s, y ) to mean I ( θ, s, g ( ·| y ) , δγ ), which is the Gittins index of type θ for signal s at the posterior obtained from updatingthe prior g using history y , with effective discount factor δγ . s = s . Suppose there is some s = s such that s | y T θ K +1 θ ( a ) ! > s | y T θ K +1 θ ( a ) ! . Together with the previous inequality, this implies type θ played s for the " s | y T θ K +1 θ ( a ) ! + 1 -th time sometime between playing s for the K -th time and playing s for the ( K + 1)-th time. That is, if we put t ∗ := min ( t : s | y tθ ( a ))) > s | y T θ K +1 θ ( a ) !) , then T θ K < t ∗ < T θ K +1 . By the construction of t ∗ , (cid:16) s | y t ∗ − θ ( a ) (cid:17) = s | y T θ K +1 − θ ( a ) ! , and also (cid:16) s | y t ∗ − θ ( a ) (cid:17) = K = s | y T θ K +1 − θ ( a ) ! . Therefore, type θ holds the same posterior over the receiver’s reaction to signals s and s atperiod t ∗ − θ does at period T θ K +1 −
1. As in the base case, we can invoke Theorem 1to show that it is impossible for θ to play s in period t ∗ while θ plays s in period T θ K +1 . Thisshows statement j is true for every j by induction. We now turn to the receivers’ problem. Each new receiver thinks he is facing a fixed but unknownaggregate sender behavior strategy π , with belief over π given by his regular prior g . Tomaximize his expected utility, the receiver must learn to infer the type of the sender from thesignal, using his personal experience.Unlike the senders whose optimal policies may involve experimentation, the receivers’ problemonly involves passive learning. Since the receiver observes the same information in a matchregardless of his action, the optimal policy σ ( y ) simply best responds to the posterior beliefinduced by history y . Definition 6.
The one-period-forward map for receivers f : ∆( Y ) × Π → ∆( Y ) is f [ ψ , π ]( y , ( θ, s )) := ψ ( y ) · γ · λ ( θ ) · π ( s | θ )and f ( ∅ ) := 1 − γ . 20s with the one-period-forward maps f θ for senders, f [ ψ , π ] describes the new distributionover receiver histories tomorrow if the distribution over histories in the receiver population todayis ψ and the sender population’s aggregate play is π . We write ψ π := lim T →∞ f T ( ψ , π ) forthe long-run distribution over Y induced by fixing sender population’s play at π , which isindependent of the particular choice of initial state ψ . Definition 7.
The aggregate receiver response R : Π → Π is R [ π ]( a | s ) := ψ π ( y : σ ( y )( s ) = a ) , where ψ π := lim T →∞ f T ( ψ , π ) with ψ any arbitrary receiver state.We are interested in the extent to which R [ π ] responds to inequalities of the form π ( s | θ ) ≥ π ( s | θ ) embedded in π , such as those generated when θ (cid:31) s θ (Lemma 2). To this end, forany two types θ , θ we define P θ .θ as those beliefs where the odds ratio of θ to θ exceeds theirprior odds ratio, that is P θ .θ := ( p ∈ ∆(Θ) : p ( θ ) p ( θ ) ≤ λ ( θ ) λ ( θ ) ) . (5)If π ( s | θ ) ≥ π ( s | θ ) , π ( s | θ ) > , and receiver knows π , then receiver’s posterior belief aboutsender’s type after observing s falls in the set P θ .θ . The next lemma shows that under theadditional provisions that π ( s | θ ) is “large enough” and receivers are sufficiently long-lived, R [ π ] will best respond to P θ .θ with high probability when s is sent.For P ⊆ ∆(Θ), we let BR(
P, s ) := S p ∈ P arg max a ∈ A u ( p, s, a ) ! ; this is the set of bestresponses to s supported by some belief in P . Lemma 3.
Let regular prior g , types θ , θ , and signal s be fixed. For every (cid:15) > , there exist C > and γ < so that for any ≤ δ < , γ ≤ γ < , and n ≥ , if π ( s | θ ) ≥ π ( s | θ ) and π ( s | θ ) ≥ (1 − γ ) nC , then R [ π ]( BR ( P θ .θ , s ) | s ) ≥ − n − (cid:15). This lemma gives a lower bound on the probability that R [ π ] best responds to P θ .θ aftersignal s . Note that the bound only applies for survival probabilities γ that are close enough to1, because when receivers have short lifetimes they need not get enough data to outweigh theirprior. Note also that more of the receivers learn the compatibility condition when π ( s | θ ) islarge compared to (1 − γ ) and almost all of them do in the limit of n (cid:1) ∞ . The proof of Lemma 3relies on Theorem 2 from Fudenberg, He, and Imhof (2017) about updating Bayesian posteriorsafter rare events, where the rare event corresponds to observing θ play s . The details are inAppendix A.3. We abuse notation here and write u ( p, s, a ) to mean P θ ∈ Θ u ( θ, s, a ) · p ( θ ).
21o interpret the condition π ( s | θ ) ≥ (1 − γ ) nC, recall that an agent with survival chance γ has a typical lifespan of − γ . If π describes the aggregate play in the sender population, thenon average a type θ plays s for − γ · π ( s | θ ) periods in her life. So when a typical type θ plays s for nC periods, this lemma provides a bound of 1 − n − (cid:15) on the share of the receiverresponses that lie in BR( P θ .θ , s ) . Note that the hypothesis θ plays s for nC periods does notrequire that π ( s | θ ) is bounded away from 0 as γ → . To preview, Lemma 4 in the next sectionwill establish that signals that are not weakly equilibrium dominated for a given type are playedsufficiently often that Lemma 3 has bite when both δ and γ are close to 1. Section 4 separately examined the senders’ and receivers’ learning problems. In this section, weturn to the two-sided learning problem. We will first define steady-state strategy profiles, whichare signaling game strategy profiles π ∗ where π ∗ and π ∗ are mutual aggregate responses, and thencharacterize the steady states using our previous results. δ -Stability, and Patient Stability We introduced the one-period-forward maps f θ and f in Section 4, which describe the deter-ministic transition between state ψ t this period to state ψ t +1 next period through the learn-ing dynamics and the birth-death process. More precisely, ψ t +1 θ = f θ ( ψ tθ , σ ( ψ t )) and ψ t +12 = f ( ψ t , ( σ θ ( ψ tθ )) θ ∈ Θ ) . A steady state is a fixed point ψ ∗ of this transition map, . Definition 8.
A state ψ ∗ is a steady state if ψ ∗ θ = f θ ( ψ ∗ θ , σ ( ψ ∗ )) for every θ and ψ ∗ = f ( ψ ∗ , ( σ θ ( ψ ∗ θ )) θ ∈ Θ ).The set of all steady states for regular prior g and 0 ≤ δ, γ < ∗ ( g, δ, γ ), while theset of steady-state strategy profiles is Π ∗ ( g, δ, γ ) := { σ ( ψ ∗ ) : ψ ∗ ∈ Ψ ∗ ( g, δ, γ ) } .The strategy profiles associated with steady states represent time-invariant distributions ofplay, as the information lost when agents die each period exactly balances out the informationagents gain through learning that period. This means the exchangeability assumption of thelearners will be satisfied in any steady state.We now give an equivalent characterization Π ∗ ( g, δ, γ ) in terms of R and R . The proof isin Appendix A.4. Proposition 2. π ∗ ∈ Π ∗ ( g, δ, γ ) if and only if R g,δ,γ ( π ∗ ) = π ∗ and R g,δ,γ ( π ∗ ) = π ∗ . (Note that here we make the dependence of R and R on parameters ( g, δ, γ ) explicit toavoid confusion.) That is, a steady-state strategy profile is a pair of mutual aggregate replies.The next proposition guarantees that there always exists at least one steady-state strategyprofile. 22 roposition 3. Π ∗ ( g, δ, γ ) is nonempty and compact in the norm topology. The proof is in the Online Appendix. We establish that Ψ ∗ ( g, δ, γ ) is nonempty and compactin the ‘ norm on the space of distributions, which immediately implies the same properties forΠ ∗ ( g, δ, γ ). Intuitively, if lifetimes are finite, the set of histories is finite, so the set of states is offinite dimension. Here the one-period-forward map f = (( f θ ) θ ∈ Θ , f ) is continuous, so the usualversion of Brouwer’s fixed-point theorem applies. With geometric lifetimes, very old agents arerare, so truncating the agents’ lifetimes at some large T yields a good approximation. Insteadof using these approximations directly, our proof shows that under the ‘ norm f is continuous,and that (because of the geometric lifetimes) the feasible states form a compact locally convexHausdorff space. This lets us appeal to a fixed-point theorem for that domain.We now focus on the iterated limit lim δ → lim γ → Π ∗ ( g, δ, γ ) , that is, the set of steady-state strategy profiles for δ and γ near 1, where we first send γ to 1holding δ fixed, and then send δ to 1. Definition 9.
For each 0 ≤ δ <
1, a strategy profile π ∗ is δ -stable under g if there is a sequence γ k → π ( k ) ∈ Π ∗ ( g, δ, γ k ), such that π ( k ) → π ∗ . Strategy profile π ∗ is patiently stable under g if there is a sequence δ k → π ( k ) where each π ( k ) is δ k -stable under g and π ( k ) → π ∗ .Strategy profile π ∗ is patiently stable if it is patiently stable under some regular prior g .Heuristically, patiently stable strategy profiles are the limits of learning outcomes when agentsbecome infinitely patient (so that senders are willing to make many experiments) and long lived(so that agents on both sides can learn enough for their data to outweigh their prior). As inpast work on steady-state learning (Fudenberg and Levine, 1993, 2006), the reason for this orderof limits is to ensure that most agents have enough data that they stop experimenting and playmyopic best responses. We do not know whether our results extend to the other order of limits;we explain the issues involved below, after sketching the intuition for Proposition 5. δ -Stability and Patient Stability When γ is near 1, agents correctly learn the consequences of the strategies they play frequently.But for a fixed patience level they may choose to rarely or never experiment, and so can maintainincorrect beliefs about the consequences of strategies that they do not play. The next resultformally states this, which parallels Fudenberg and Levine (1993)’s result that δ -stable strategyprofiles are self-confirming equilibria. If agents did not eventually stop experimenting as they age, then even if most agents have approximatelycorrect beliefs, aggregate play need not be close to a Nash equilibrium because most agents would not be playinga (static) best response to their beliefs. roposition 4. Suppose strategy profile π ∗ is δ -stable under a regular prior. Then for every type θ and signal s with π ∗ ( s | θ ) > , s is a best response to some π ∈ Π for type θ , and furthermore π ( ·| s ) = π ∗ ( ·| s ) . Also, for any signal s such that π ∗ ( s | θ ) > for at least one type θ , π ∗ ( ·| s ) issupported on pure best responses to the Bayesian belief generated by π ∗ after s . We prove this result in the Online Appendix. The idea of the proof is the following: If signal s has positive probability in the limit, then it is played many times by the senders, so the receiverseventually learn the correct posterior distribution for θ given s. As the receivers have no incentiveto experiment, their actions after s will be a best response to this correct posterior belief. Forthe senders, suppose π ∗ ( s | θ ) > , but s is not a best response for type θ to any π ∈ Π thatmatches π ∗ ( ·| s ). Yet if a sender has played s many times then with high probability her beliefabout π ( ·| s ) is close to π ∗ ( ·| s ), so playing s is not myopically optimal. This would imply thattype θ has persistent option value for signal s , which contradicts the fact that this option valuemust converge to 0 with the sample size. Remark . This proposition says that each sender type is playing a best response to a belief aboutthe receiver’s play that is correct on the equilibrium path, and that the receivers are playing anaggregate best response to the aggregate play of the senders. Thus the δ -stable outcomes are aversion of self-confirming equilibrium where different types of sender are allowed to have differentbeliefs. Moreover, as the next example shows, this sort of heterogeneity in the senders’ beliefsabout the aggregate strategy of the receivers can endogenously arise in a δ -stable strategy profileeven when all types of new senders start with the same prior over how the receivers play. Example 2.
Consider the following game: Dekel, Fudenberg, and Levine (2004) defined type-heterogeneous self-confirming equilibrium in static Bayesiangames. As they noted, this sort of heterogeneity is natural when the type of each agent is fixed, but not if eachagent’s type is drawn i.i.d. in each period. To extend their definition to signaling games, we can define the “signalfunctions” y i ( a, θ ) from that paper to respect the extensive form of the game. See also ? . g for the receiverand any regular prior g ( s )1 for the sender. Let g ( s )1 be Beta(1, 3) on a and a respectively. Weclaim that it is δ -stable when δ = 0 for both types to send s and for the receiver to respondto every signal with a , which is a type-heterogeneous rationalizable self-confirming equilibrium.However, this pooling behavior cannot occur in a Nash equilibrium or in a unitary self-confirmingequilibrium, where both sender types must hold the same belief about how the receiver respondsto s .To establish this claim, note that since δ = 0 each sender plays the myopically optimal signalafter every history. For any γ , there is a steady state where the receivers’ policy responds toevery signal with a after every history, type θ senders play s after every history and neverupdate their prior belief about how receivers react to s , and type θ senders with fewer than 6periods of experience play s but switch to playing s forever starting at age 7. The behavior ofthe θ agents is optimal because after k periods of playing s and seeing response a every period,the sender’s posterior belief about π ( ·| s ) is Beta(1 + k, s next period is 1 + k k ( −
1) + 34 + k (2) . This expression is positive when 0 ≤ k ≤ k = 6. The fraction of type θ aged 6 and below approaches 0 as γ →
1, hence we have constructed a sequence of steady-statestrategy profiles converging to the s pooling equilibrium. So even though both types start withthe same prior g , their beliefs about how the receivers react to s eventually diverge. (cid:7) In contrast to the plethora of δ -stable profiles, we now show that only Nash equilibriumprofiles can be steady-state outcomes as δ tends to 1. Moreover, this limit also rules out strategyprofiles in which the sender’s strategy can only be supported by the belief that the receiver wouldplay a dominated action in response to some of the unsent signals. Definition 10.
In a signaling game, a perfect Bayesian equilibrium with heterogeneous off-pathbeliefs is a strategy profile ( π ∗ , π ∗ ) such that: • For each θ ∈ Θ , u ( θ ; π ∗ ) = max s ∈ S u ( θ, s, π ∗ ( ·| s )). • For each on-path signal s , u ( p ∗ ( ·| s ) , s, π ∗ ( ·| s )) = max ˆ a ∈ A u ( p ∗ ( ·| s ) , s, ˆ a ). • For each off-path signal s and each a ∈ A with π ∗ ( a | s ) >
0, there exists a belief p ∈ ∆(Θ)such that u ( p, s, a ) = max ˆ a ∈ A u ( p, s, ˆ a ).Here u ( θ ; π ∗ ) refers to type θ ’s payoff under π ∗ , and p ∗ ( ·| s ) is the Bayesian posterior belief aboutsender’s type after signal s , under strategy π ∗ .The first two conditions imply that the profile is a Nash equilibrium. The third conditionresembles that of perfect Bayesian equilibrium, but is somewhat weaker as it allows the receiver’s25lay after an off-path signal s to be a mixture over several actions, each of which is a best responseto a different belief about the sender’s type. This means π ∗ ( ·| s ) ∈ ∆(BR(∆(Θ) , s )), but π ∗ ( ·| s )itself may not be a best response to any unitary belief about the sender’s type. Proposition 5.
If strategy profile π ∗ is patiently stable, then it is a perfect Bayesian equilibriumwith heterogeneous off-path beliefs.Proof. In the Online Appendix, we prove that patiently stable profiles must be Nash equilibria.This argument follows the proof strategy of Fudenberg and Levine (1993), which derived a con-tradiction via excess option values. In outline, if π ∗ is patiently stable, each player’s strategy isa best response to a belief that is correct about the opponent’s on-path play. Thus if π ∗ is nota Nash equilibrium, some type should perceive a persistent option value to experimenting withsome signal that she plays with probability 0. But this would contradict the fact that the optionvalues evaluated at sufficiently long histories must go to 0. We now explain why a patientlystable profile π ∗ must satisfy the third condition in Definition 10. After observing any history y , a receiver who started with a regular prior thinks every signal has positive probability in hisnext match. So, his optimal policy prescribes for each signal s a best response to that receiver’sposterior belief about the sender’s type upon seeing signal s after history y . For any regularprior g , 0 ≤ δ, γ <
1, and any sender aggregate play π , we thus deduce R g,δ,γ [ π ]( ·| s ) is en-tirely supported on BR(∆(Θ) , s ). This means the the same is true about the aggregate receiverresponse in every steady state and hence in every patiently stable strategy profile.In Fudenberg and Levine (1993), this argument relies on the finite lifetime of the agents onlyto ensure that “almost all” histories are long enough, by picking a large enough lifetime. We canachieve the analogous effect in our geometric-lifetime model by picking γ close to 1. Our proofuses the fact that if δ is fixed and γ → , then the number of experiments that a sender needs toexhaust her option value is negligible relative to her expected lifespan, so that most senders playapproximate best responses to their current beliefs. The same conclusion does not hold if we fix γ and let δ → , even though the optimal sender policy only depends on the product δγ , becausefor a fixed sender policy the induced distribution on sender play depends on γ but not on δ. Proposition 5 allows the receiver to sustain his off-path actions using any belief p ∈ ∆(Θ). Wenow turn to our main result, which focuses on refining off-path beliefs. We prove that patientstability selects a strict subset of the Nash equilibria, namely those that satisfy the compatibilitycriterion. Definition 11.
For a fixed strategy profile π ∗ , let u ( θ ; π ∗ ) denote the payoff to type θ under π ∗ , and let 26 ( s, π ∗ ) := (cid:26) θ ∈ Θ : max a ∈ A u ( θ, s, a ) > u ( θ ; π ∗ ) (cid:27) be the set of types for which some response to signal s is strictly better than their payoff under π ∗ . Signal s is weakly equilibrium dominated for types in the complement of J ( s, π ∗ ).The admissible beliefs at signal s under profile π ∗ are P ( s, π ∗ ) := \ n P θ .θ : θ (cid:31) s θ and θ ∈ J ( s, π ∗ ) o where P θ .θ is defined in Equation (5).That is, P ( s, π ∗ ) is the joint belief restriction imposed by a family of P θ .θ for ( θ , θ ) sat-isfying two conditions: θ is more type-compatible with s than θ , and furthermore the morecompatible type θ belongs to J ( s, π ∗ ). If there are no pairs ( θ , θ ) satisfying these two condi-tions, then (by convention of intersection over no elements) P ( s, π ∗ ) is defined as ∆(Θ). In anysignaling game and for any π ∗ , the set P ( s, π ∗ ) is always nonempty because it always containsthe prior λ . Definition 12.
Strategy profile π ∗ satisfies the compatibility criterion if π ( ·| s ) ∈ ∆(BR( P ( s, π ∗ ) , s ))for every s .Like divine equilibrium but unlike the Intuitive Criterion or Cho and Kreps (1987)’s D1criterion, the compatibility criterion says only that some signals should not increase the relativeprobability of “implausible” types, as opposed to requiring that these types have probability 0.One might imagine a version of the compatibility criterion where the belief restriction P θ .θ applies whenever θ (cid:31) s θ . To understand why we require the additional condition that θ ∈ J ( s, π ∗ ) in the definition of admissible beliefs, recall that Lemma 3 only gives a learning guaranteein the receiver’s problem when π ( s | θ ) is “large enough” for the more type-compatible θ . In theextreme case where s is a strictly dominated signal for θ , she will never play it during learning.It turns out that if s is weakly equilibrium dominated for θ , then θ may still not experimentvery much with it. On the other hand, the next lemma provides a lower bound on the frequencythat θ experiments with s when θ ∈ J ( s , π ∗ ) and δ and γ are close to 1. Lemma 4.
Fix a regular prior g and a strategy profile π ∗ where for some type θ and signal s , θ ∈ J ( s , π ∗ ) . There exist a number (cid:15) ∈ (0 , and threshold functions ¯ δ : N → (0 , and ¯ γ : N × (0 , → (0 , such that whenever π ∈ Π ∗ ( g, δ, γ ) with δ ≥ ¯ δ ( N ) and γ ≥ ¯ γ ( N, δ ) and π is no more than (cid:15) away from π ∗ in L distance , we have π ( s | θ ) ≥ (1 − γ ) · N. The L distance is d ( π, π ∗ ) = X θ ∈ Θ X s ∈ S | π ( s | θ ) − π ∗ ( s | θ ) | + X s ∈ S X a ∈ A | π ( a | s ) − π ∗ ( a | s ) | . π ( s | θ ) is between 0 and 1, we know that (1 − ¯ γ ( N, δ )) · N < N .The proof of this lemma is in the Online Appendix. To gain an intuition for it, suppose thatnot only is s equilibrium undominated in π ∗ , but furthermore s can lead to the highest signalinggame payoff for type θ under some receiver response a . Because the prior is non-doctrinaire,the Gittins index of each signal in the learning problem approaches its highest possible payoffin the stage game as the sender becomes infinitely patient. Therefore, for every N ∈ N , when γ and δ are close enough to 1, a new type θ will play s in each of the first N periods of her life,regardless of what responses she receives during that time. These N periods account for roughly(1 − γ ) · N fraction of her life, proving the lemma in this special case. It turns out that even if s does not lead to the highest potential payoff in the signaling game, long-lived players will havea good estimate of their steady-state payoff. So, type θ will still play any s that is equilibriumundominated in strategy profile π ∗ at least N times in any steady states that are sufficiently closeto π ∗ , though these N periods may not occur at the beginning of her life. Theorem 2.
Every patiently stable strategy profile π ∗ satisfies the compatibility criterion. The proof combines Lemma 2, Lemma 3, and Lemma 4. Lemma 2 shows that types that aremore compatible with s play it more often. Lemma 4 says that types for whom s is not weaklyequilibrium dominated will play it “many times.” Finally, Lemma 3 shows that the “many times”here is sufficiently large that most receivers correctly believe that more compatible types play s more than less compatible types do, so their posterior odds ratio for more versus less compatibletypes exceeds the prior ratio. Proof.
Suppose π ∗ is patiently stable under regular prior g . Fix an s and an action ˆ a / ∈ BR( P ( s , π ∗ ) , s ). Let h > π ∗ (ˆ a | s ) < h . Since the choices of s , ˆ a , and h > Step 1 : Setting some constants.In the statement of Lemma 3, for each pair θ , θ such that θ (cid:31) s θ and θ ∈ J ( s , π ∗ ), put (cid:15) = h | Θ | and find C θ ,θ and γ θ ,θ so that the result holds. Let C be the maximum of all such C θ ,θ and γ be the maximum of all such γ θ ,θ . Also find n ≥ − n > − h | Θ | . (6)In the statement of Lemma 4, for each θ such that θ (cid:31) s θ for at least one θ , find (cid:15) θ , ¯ δ θ ( nC ),¯ γ θ ( nC, δ ) so that the lemma holds. Write (cid:15) ∗ > (cid:15) θ and let ¯ δ ∗ ( nC )and ¯ γ ∗ ( nC, δ ) represent the maximum of δ θ and γ θ across such θ . Step 2 : Finding a steady-state profile with large δ, γ that approximates π ∗ .Since π ∗ is patiently stable under g , there exists a sequence of strategy profiles π ( j ) → π ∗ where π ( j ) is δ j -stable under g with δ j →
1. Each π ( j ) can be written as the limit of steady-state28trategy profiles. That is, for each j , there exists γ j,k → π ( j,k ) ∈ Π ∗ ( g, δ j , γ j,k ) such that lim k →∞ π ( j,k ) = π ( j ) .The convergence of the array π ( j,k ) to π ∗ means we may find j ∈ N and function k ( j ) sothat whenever j ≥ j and k ≥ k ( j ) , π ( j,k ) is no more than min( (cid:15) ∗ , h | Θ | ) away from π ∗ . Find j ◦ ≥ j large enough so δ ◦ := δ j ◦ > ¯ δ ∗ ( nC ), and then find a large enough k ◦ > k ( j ◦ ) so that γ ◦ := γ j ◦ ,k ◦ > max(¯ γ ∗ ( nC, δ ◦ ) , γ ). So we have identified a steady-state profile π ◦ := π ( j ◦ ,k ◦ ) ∈ Π ∗ ( g, δ ◦ , γ ◦ ) which approximates π ∗ to within min( (cid:15) ∗ , h | Θ | ). Step 3 : Applying properties of R and R .For each pair θ , θ such that θ (cid:31) s θ and θ ∈ J ( s , π ∗ ), we will bound the probability that π ◦ ( ·| s ) does not best respond to P θ .θ by h | Θ | . Since there are at most | Θ | · ( | Θ | −
1) suchpairs in the intersection defining P ( s , π ∗ ), this would imply that π ◦ (ˆ a | s ) < [ | Θ | · ( | Θ | − · h | Θ | since ˆ a / ∈ BR( P ( s , π ∗ ) , s ). And since π ◦ is no more than h | Θ | away from π , this would show π (ˆ a | s ) < h .By construction π ◦ is closer than (cid:15) θ to π ∗ , and furthermore δ ◦ ≥ ¯ δ θ ( nC ) and γ ◦ ≥ ¯ γ θ ( nC, δ ◦ ).By Lemma 4, π ◦ ( s | θ ) ≥ nC (1 − γ ◦ ). At the same time, π ◦ = R [ π ◦ ] and θ (cid:31) s θ , so Lemma2 implies that π ◦ ( s | θ ) ≥ π ◦ ( s | θ ). Turning to the receiver side, π ◦ = R [ π ◦ ] with π ◦ satisfyingthe conditions of Lemma 3 associated with (cid:15) = h | Θ | and γ ◦ ≥ γ . Therefore, we conclude π ◦ (BR( P θ .θ , s ) | s ) ≥ − n − h | Θ | . But by construction of n in Equation (6), 1 − n > − h | Θ | . So the LHS is at least 1 − h | Θ | , asdesired. Remark . More generally, consider any model for our populations of agents with geometricallydistributed lifetimes that generates aggregate response functions R and R . Defining the steadystates under ( g, δ, γ ) as the strategy profiles π ∗ such that R g,δ,γ ( π ∗ ) = π ∗ and R g,δ,γ ( π ∗ ) = π ∗ ,the proof of Theorem 2 applies to the patiently stable profiles of the new learning model providedthat R satisfies the conclusion of Lemma 2, R satisfies the conclusion of Lemma 3, and Lemma4 is valid for ( θ , s ) pairs such that θ (cid:31) s θ for at least one type θ and θ ∈ J ( s , π ∗ ).We outline two such more general learning models below. (The proof is in the Online Ap-pendix.) Corollary 1.
With either of the following modifications of the steady-state learning model fromSection 2, every patiently stable strategy profile still satisfies the compatibility criterion.(i).
Heterogeneous priors . There is a finite collection of regular sender priors { g ,k } nk =1 anda finite collection of regular receiver priors { g ,k } nk =1 . Upon birth, an agent is endowed witha random prior, where the distributions over priors are µ and µ for senders and receivers.An agent’s prior is independent of her payoff type, and furthermore no one ever observesanother person’s prior. ii). Social learning . Suppose − α fraction of the senders are “normal learners” as describedin Section 2, but the remaining < α < fraction are “social learners.” At the end of eachperiod, a social learner can observe the extensive-form strategies of her matched receiverand of c > other matches sampled uniformly at random. Each sender knows whether sheis a normal learner or a social learner upon birth, which is uncorrelated with her payofftype. Receivers cannot distinguish between the two kinds of senders. Example 1 (Continued) . The beer-quiche game of Example 1 has two components of Nash equi-libria: “beer-pooling equilibria” where both types play
Beer with probability 1, and “quiche-pooling equilibria” where both types play
Quiche with probability 1. In a quiche-pooling equi-librium π ∗ , type θ strong ’s equilibrium payoff is 2, so θ strong ∈ J ( Beer , π ∗ ) since θ strong ’s highestpossible payoff under Beer is 3, and we have already shown that θ strong (cid:31) Beer θ weak . So, P ( Beer , π ∗ ) = ( p ∈ ∆(Θ) : p ( θ weak ) p ( θ strong ) ≤ λ ( θ weak ) λ ( θ strong ) = 1 / ) . Fight is not a best response after
Beer to any such belief, so equilibria in which
Fight occurs with positive probability after
Beer do not satisfy the compatibility criterion, and thusno quiche-pooling equilibrium is patiently stable. Since the set of patiently stable outcomes isa nonempty subset of the set of Nash equilibria, pooling on beer is the unique patiently stableoutcome.By Corollary 1, quiche-pooling equilibria are still not patiently stable in more general learningmodels involving either heterogeneous priors or social learners. (cid:7)
In generic signaling games, equilibria where the receiver plays a pure strategy must satisfy astronger condition than the compatibility criterion to be patiently stable.
Definition 13.
Let e J ( s, π ∗ ) := (cid:26) θ ∈ Θ : max a ∈ A u ( θ, s, a ) ≥ u ( θ ; π ∗ ) (cid:27) . If e J ( s , π ∗ ) is nonempty, define the strongly admissible beliefs at signal s under profile π ∗ tobe ˜ P ( s , π ∗ ) := ∆( e J ( s , π ∗ )) \ n P θ .θ : θ (cid:31) s θ o where P θ .θ is defined in Equation (5). Otherwise, define ˜ P ( s , π ∗ ) := ∆(Θ).Here, e J ( s, π ∗ ) is the set of types for which some response to signal s is at least as goodas their equilibrium payoff under π ∗ — that is, the set of types for whom s is not equilibrium30ominated in the sense of Cho and Kreps (1987). Note that e P , unlike P, assigns probability 0 toequilibrium-dominated types, which is the belief restriction of the Intuitive Criterion. Definition 14.
A Nash equilibrium π ∗ is on-path strict for the receiver if for every on-path signal s ∗ , π ( a ∗ | s ∗ ) = 1 for some a ∗ ∈ A and u ( s ∗ , a ∗ , π ) > max a = a ∗ u ( s ∗ , a, π ).Of course, the receiver cannot have strict ex ante preferences over play at unreached infor-mation sets; this condition is called “on-path strict” because it places no restrictions on thereceiver’s incentives after off-path signals. In generic signaling games, all pure-strategy equilibriaare on-path strict for the receiver, but the same is not true for mixed-strategy equilibria. Definition 15.
A strategy profile π ∗ satisfies the strong compatibility criterion if at every signal s we have π ∗ ( ·| s ) ∈ ∆(BR( e P ( s , π ∗ ) , s )) . It is immediate that the strong compatibility criterion implies the compatibility criterion,since it places more stringent restrictions on the receiver’s behavior. It is also immediate thatthe strong compatibility criterion implies the Intuitive Criterion.
Theorem 3.
Suppose π ∗ is on-path strict for the receiver and patiently stable. Then it satisfiesthe strong compatibility criterion. The proof of this theorem appears in Appendix A.5. The main idea is that when off-pathsignal s is equilibrium dominated in π ∗ for type θ D but not even weakly equilibrium dominatedfor type θ U , type θ U will experiment “infinitely more often” with s than θ D does. Indeed, wecan provide an upper bound on the steady-state probability that θ D ever switches away from itsequilibrium signal s ∗ after trying it for the first time , which is also an upper bound on how often θ D experiments with s , while Lemma 4 provides a lower bound for how often θ U plays s . Weshow there is a sequence of steady-state profiles π ( k ) ∈ Π ∗ ( g, δ k , γ k ) with γ k → π ( k ) → π ∗ where the ratio of the lower bound to the upper bound goes to infinity. Applying Theorem 2of Fudenberg, He, and Imhof (2017), we can then prove receivers will infer that an s -sender is“infinitely more likely” to be θ U than θ D , which means receivers must assign probability 0 to θ D after s in equilibrium π ∗ . Remark . As noted by Fudenberg and Kreps (1988) and Sobel, Stole, and Zapater (1990), itseems “intuitive” that learning and rational experimentation should lead receivers to assign prob-ability 0 to types that are equilibrium dominated, so it might seem surprising that this theoremneeds the additional assumption that the equilibrium is on-path strict for the receiver. However, This upper bound does not apply when π ∗ is not on-path strict for the receiver. When π ∗ involves the receiverstrictly mixing between several responses after s ∗ , some of these responses might make θ D strictly worse off thanher worst payoff after s , so there is non-vanishing probability that that θ D observes a large number of these badresponses in a row and then stops playing s ∗ .
31n our model senders start out initially uncertain about the receivers’ play, and so even types forwhom a signal is equilibrium dominated might initially experiment with it. Showing that theseexperiments do not lead to “perverse” responses by the receivers requires some arguments aboutthe relative probabilities with which equilibrium-dominated types and non-equilibrium-dominatedtypes play off-path signals. When the equilibrium involves on-path receiver randomization, a non-trivial fraction of receivers could play an action after a type’s equilibrium signal that the typefinds strictly worse than her worst payoff under an off-path signal. In this case, we do not seehow to show that the probability she ever switches away from her equilibrium signal tends to0 with patience, since the event of seeing a large number of these unfavorable responses in arow has probability bounded away from 0 even when the receiver population plays exactly theirequilibrium strategy. However, we do not have a counterexample to show that the conclusion ofthe theorem fails without on-path strictness for the receiver.
Example 3.
In the following modified beer-quiche game, the payoffs of fighting a type θ weak whodrinks beer have been substantially increased relative to Example 1, so that Fight is now a bestresponse to the prior belief λ after Beer .Since the prior λ is always an admissible belief in any signaling game after any signal, theNash equilibrium π ∗ where both types play Quiche (supported by the receiver playing
Fight after
Beer ) is not ruled out by the compatibility criterion, unlike in Example 1. However,this equilibrium is ruled out by the strong compatibility criterion. To see why, note that thispooling equilibrium is on-path strict for the receiver, because the receiver has a strict preferencefor
NotFight at the only on-path signal,
Quiche . Moreover, π ∗ does not satisfy the strongcompatibility criterion, because e J ( Beer , π ∗ ) = { θ strong } implies the only strongly admissiblebelief after Beer assigns probability 1 to the sender being θ strong . Thus Theorem 3 implies thatthis equilibrium is not patiently stable. (cid:7) Discussion
Our learning model supposes that the agents have geometrically distributed lifetimes, whichis one of the reasons that the senders’ optimization problems can be solved using the Gittinsindex. If agents were to have fixed finite lifetimes, as in Fudenberg and Levine (1993, 2006),their optimization problem would not be stationary, and the finite-horizon analog of the Gittinsindex is only approximately optimal for the finite-horizon multi-armed bandit problem (Niño-Mora, 2011). Applying the geometric-lifetime framework to steady-state learning models forother classes of extensive-form games could prove fruitful, especially for games where we need tocompare the behavior of various players or player types, and in studies of other sorts of dynamicdecisions.Theorem 1 provides a comparison between the dynamic behavior of two agents in a geometric-lifetime bandit problem based on their static preferences over the prizes. As an immediateapplication, consider a principal-agent setting where the agent faces a multi-armed bandit witharms s ∈ S , where s leads a prize drawn from Z s according to some distribution. The principalknows the agent’s per-period utility function u : ∪ s Z s → R , but not the agent’s beliefs over theprize distributions of different arms or agent’s discount factor. Suppose the principal observesthe agent choosing arm 1 in the first period. The principal can impose taxes and subsidies on thedifferent prizes and arms, changing the agent’s utility function to ˜ u . For what taxes and subsidieswould the agent still have chosen arm 1 in the first period, irrespective of her initial beliefs anddiscount factor? According to Theorem 1, the answer is precisely those taxes and subsidies suchthat arm 1 is more type-compatible with ˜ u than u .Our results provide an upper bound on the set of patiently stable strategy profiles in asignaling game. In Fudenberg and He (2017), we provided a lower bound for the same set, as wellas a sharper upper bound under additional restrictions on the priors. But together, these resultswill not give an exact characterization of patiently stable outcomes. Nevertheless, our results doshow how the theory of learning in games provides a foundation for refining the set of equilibriain signaling games.In future work, we hope to investigate a learning model featuring temporary sender types.Instead of the sender’s type being assigned at birth and fixed for life, at the start of each periodeach sender takes an i.i.d. draw from λ to discover her type for that period. When the playersare impatient, this yields different steady states than the fixed-type model here, as noted byDekel, Fudenberg, and Levine (2004). This model will require different tools to analyze, sincethe sender’s problem becomes a restless bandit. References
Banks, J. S. and J. Sobel (1987): “Equilibrium Selection in Signaling Games,”
Econometrica ,55, 647–661. 33 ellman, R. (1956): “A Problem in the Sequential Design of Experiments,”
Sankhy¯a: TheIndian Journal of Statistics (1933-1960) , 16, 221–229.
Billingsley, P. (1995):
Probability and Measure , John Wiley & Sons.
Cho, I.-K. and D. M. Kreps (1987): “Signaling Games and Stable Equilibria,”
QuarterlyJournal of Economics , 102, 179–221.
Dekel, E., D. Fudenberg, and D. K. Levine (1999): “Payoff Information and Self-Confirming Equilibrium,”
Journal of Economic Theory , 89, 165–185.——— (2004): “Learning to play Bayesian games,”
Games and Economic Behavior , 46, 282–303.
Diaconis, P. and D. Freedman (1990): “On the Uniform Consistency of Bayes Estimatesfor Multinomial Probabilities,”
Annals of Statistics , 18, 1317–1327.
Esponda, I. and D. Pouzo (2016): “Berk-Nash Equilibrium: A Framework for ModelingAgents With Misspecified Models,”
Econometrica , 84, 1093–1130.
Fudenberg, D. and K. He (2017): “Learning and Equilibrium Refinements in SignallingGames,”
Mimeo . Fudenberg, D., K. He, and L. A. Imhof (2017): “Bayesian posteriors for arbitrarily rareevents,”
Proceedings of the National Academy of Sciences , 114, 4925–4929.
Fudenberg, D. and D. M. Kreps (1988): “A Theory of Learning, Experimentation, andEquilibrium in Games,”
Mimeo .——— (1993): “Learning Mixed Equilibria,”
Games and Economic Behavior , 5, 320–367.——— (1994): “Learning in Extensive-Form Games, II: Experimentation and Nash Equilibrium,”
Mimeo .——— (1995): “Learning in Extensive-Form Games I. Self-Confirming Equilibria,”
Games andEconomic Behavior , 8, 20–55.
Fudenberg, D. and D. K. Levine (1993): “Steady State Learning and Nash Equilibrium,”
Econometrica , 61, 547–573.——— (2006): “Superstition and Rational Learning,”
American Economic Review , 96, 630–651.
Gittins, J. C. (1979): “Bandit Processes and Dynamic Allocation Indices,”
Journal of theRoyal Statistical Society. Series B (Methodological) , 148–177.
Jehiel, P. and D. Samet (2005): “Learning to Play Games in Extensive Form by Valuation,”
Journal of Economic Theory , 124, 129–148. 34 alai, E. and E. Lehrer (1993): “Rational Learning Leads to Nash Equilibrium,”
Econo-metrica , 61, 1019–1045.
Laslier, J.-F. and B. Walliser (2015): “Stubborn Learning,”
Theory and Decision , 79,51–93.
Niño-Mora, J. (2011): “Computing a Classic Index for Finite-Horizon Bandits,”
INFORMSJournal on Computing , 23, 254–267.
Sobel, J., L. Stole, and I. Zapater (1990): “Fixed-Equilibrium Rationalizability in Signal-ing Games,”
Journal of Economic Theory , 52, 304–331.
Spence, M. (1973): “Job Market Signaling,”
Quarterly Journal of Economics , 87, 355–374.
A Appendix – Relegated Proofs
A.1 Proof of Proposition 1
Proposition 1 :(i). (cid:31) s is transitive.(ii). Except when s is either strictly dominant for both θ and θ or strictly dominated for both θ and θ , θ (cid:31) s θ implies θ s θ . Proof.
To show (i), suppose θ (cid:31) s θ and θ (cid:31) s θ . For any π ∈ Π where s is weakly optimalfor θ , it must be strictly optimal for θ , hence also strictly optimal for θ . This shows θ (cid:31) s θ .To establish (ii), partition the set of receiver strategies as Π = Π +2 ∪ Π ∪ Π − , where the threesubsets refer to receiver strategies that make s strictly better, indifferent, or strictly worse thanthe best alternative signal for θ . If the set Π is nonempty, then θ (cid:31) s θ implies θ s θ . Thisis because against any π ∈ Π , signal s is strictly optimal for θ but only weakly optimal for θ . At the same time, if both Π +2 and Π − are nonempty, then Π is nonempty. This is becauseboth π u ( θ , s , π ( ·| s )) and π max s = s u ( θ , s , π ( ·| s )) are continuous functions, sofor any π +2 ∈ Π +2 and π − ∈ Π − , there exists α ∈ (0 ,
1) so that απ +2 + (1 − α ) π − ∈ Π . If onlyΠ +2 is nonempty and θ (cid:31) s θ , then s is strictly dominant for both θ and θ . If only Π − isnonempty, then we can have θ (cid:31) s θ only when s is never a weak best response for θ againstany π ∈ Π . 35 .2 Proof of Lemma 1 Lemma 1 : For every signal s , stopping time τ , belief ν s , and discount factor β, there exists π ,s ( τ, ν s , β ) ∈ ∆( A ) so that for every θ , E ν s nP τ − t =0 β t · u ( θ, s, a s ( t )) o E ν s nP τ − t =0 β t o = u ( θ, s, π ,s ( τ, ν s , β )) Proof.
Step 1: Induced mixed actions.
A belief ν s and a stopping time τ s together define a stochastic process ( A t ) t ≥ over the space A ∪ {∅} , where A t ∈ A corresponds to the receiver action seen in period t if τ s has not yetstopped ( τ s > t ), and A t := ∅ if τ s has stopped ( τ s ≤ t ). Enumerating A = { a , ..., a n } , we write p t,i := P ν s [ A t = a i ] for 1 ≤ i ≤ n to record the probability of seeing receiver action a i in period t and p t, := P ν s [ A t = ∅ ] = P ν s [ τ s ≤ t ] for the probability of seeing no receiver action in period t due to τ s having stopped.Given ν s and τ s , we define the induced mixed actions after signal s , π ,s ( ν s , τ s , β ) ∈ ∆( A ) by: π ,s ( ν s , τ s , β )( a ) := P ∞ t =0 β t p t,i P ∞ t =0 β t (1 − p t, ) for i such that a = a i . As P ni =1 p t,i = 1 − p t, for each t ≥
0, it is clear that π ,s ( ν s , τ s , β ) puts nonnegative weightson actions in A that sum to 1, so π ,s ( ν s , τ s , β ) ∈ ∆( A ) may indeed be viewed as a mixture overreceiver actions. Step 2: Induced mixed actions and per-period payoff.
We now show that for any β and any stopping time τ s for signal s , the normalized payoff inthe stopping problem is equal to the utility of playing s against π ,s ( ν s , τ s , β ) for one period, thatis, u ( θ, s, π ,s ( ν s , τ s , β )) = E ν s ( τ s − X t =0 β t · u ( θ, s, a s ( t )) ) / E ν s ( τ s − X t =0 β t ) . To see why this is true, rewrite the denominator of the right-hand side as E ν s ( τ s − X t =0 β t ) = E ν s ( ∞ X t =0 [1 τ s >t ] · β t ) = ∞ X t =0 β t · P ν s [ τ s > t ] = ∞ X t =0 β t (1 − p t, ) , and rewrite the numerator as E ν s ( τ s − X t =0 β t · u ( θ, s, a s ( t )) ) = ∞ X t =0 β t · p t, · | {z } get 0 if already stopped + n X i =1 p t,i · u ( θ, s, a i ) | {z } else, a s ( t ) distributed as ( p t,i ) = n X i =1 ∞ X t =0 β t · p t,i ! · u ( θ, s, a i ) .
36o overall, we get as desired: E ν s ( τ s − X t =0 β t · u ( θ, s, a s ( t )) ) / E ν s ( τ s − X t =0 β t ) = n X i =1 " ( P ∞ t =0 β t · p t,i ) P ∞ t =0 β t (1 − p t, ) · u ( θ, s, a i )= u ( θ, s, π ,s ( ν s , τ s , β )) . A.3 Proof of Lemma 3
Lemma 3 : Let regular prior g , types θ , θ , and signal s be fixed. For every (cid:15) >
0, there exists
C > γ < ≤ δ < γ ≤ γ <
1, and n ≥
1, if π ( s | θ ) ≥ π ( s | θ ) and π ( s | θ ) ≥ (1 − γ ) nC , then R [ π ](BR( P θ .θ , s ) | s ) ≥ − n − (cid:15). We invoke Theorem 2 of Fudenberg, He, and Imhof (2017), which in our setting says:
Let regular prior g and signal s be fixed. Let < (cid:15), h < . There exists C such thatwhenever π ( s | θ ) ≥ π ( s | θ ) and t · π ( s | θ ) ≥ C , we get ψ π y ∈ Y [ t ] : p ( θ | s ; y ) p ( θ | s ; y ) ≤ − h · λ ( θ ) λ ( θ ) ! /ψ π ( Y [ t ]) ≥ − (cid:15) where p ( θ | s ; y ) refers to the conditional probability that a sender of s is type θ ac-cording to the posterior belief induced by history y . That is, if at age t a receiver would have observed in expectation C instances of type θ sending s , then the belief of at least 1 − (cid:15) fraction of age t receivers (essentially) falls in P θ .θ afterseeing the signal s . The proof of Lemma 3 calculates what fraction of receivers meets this “agerequirement.” Proof.
We will show the following stronger result:Let regular prior g , types θ , θ , and signal s be fixed. For every (cid:15) >
0, there exists
C > ≤ δ, γ < n ≥
1, if π ( s | θ ) ≥ π ( s | θ ) and π ( s | θ ) ≥ (1 − γ ) nC , then R [ π ](BR( P θ .θ , s ) | s ) ≥ γ d n (1 − γ ) e − (cid:15) The lemma follows because we may pick a large enough γ < γ d n (1 − γ ) e > − n forall n ≥ γ ≥ γ . 37or each 0 < h <
1, define P hθ .θ := (cid:26) p ∈ ∆(Θ) : p ( θ ) p ( θ ) ≤ − h · λ ( θ ) λ ( θ ) (cid:27) , with the conventionthat = 0 . Then it is clear that each P hθ .θ , as well as P θ .θ itself, is a closed subset of ∆(Θ).Also, P hθ .θ → P θ .θ as h → a ∈ A . If for all ¯ h > < h ≤ ¯ h so that a ∈ BR( P hθ .θ , s ),then a ∈ BR( P θ .θ , s ) also due to best-response correspondence having a closed graph. Thismeans that, for each a / ∈ BR( P θ .θ , s ), there exists ¯ h a > a / ∈ BR( P hθ .θ , s ) whenever0 < h ≤ ¯ h a . Let ¯ h := min a/ ∈ BR( P θ .θ ,s ) ¯ h a . Let (cid:15) > (cid:15) and ¯ h to find constant C .When π ( s | θ ) ≥ π ( s | θ ) and π ( s | θ ) ≥ (1 − γ ) nC , consider an age t receiver for t ≥ l n (1 − γ ) m . Since t · π ( s | θ ) ≥ C, Theorem 2 of Fudenberg, He, and Imhof (2017) implies thereis probability at least 1 − (cid:15) this receiver’s belief about the types who send s falls in P ¯ hθ .θ . Byconstruction of ¯ h, BR( P ¯ hθ .θ , s ) = BR( P θ .θ , s ), so 1 − (cid:15) of age t receivers have a history y where σ ( y )( s ) ∈ BR( P θ .θ , s ).Since agents survive between periods with probability γ, the mass of the receiver populationaged l n (1 − γ ) m or older is (1 − γ ) · P ∞ t = d n (1 − γ ) e γ t = γ d n (1 − γ ) e .This shows R [ π ](BR( P θ .θ , s ) | s ) ≥ γ n (1 − γ ) · (1 − (cid:15) ) ≥ γ d n (1 − γ ) e − (cid:15) as desired. A.4 Proof of Proposition 2
Proposition 2 : π ∗ ∈ Π ∗ ( g, δ, γ ) if and only if R g,δ,γ [ π ∗ ] = π ∗ and R g,δ,γ [ π ∗ ] = π ∗ . Proof. If : Suppose π ∗ is such that R [ π ∗ ] = π ∗ and R [ π ∗ ] = π ∗ . Consider the state ψ ∗ definedas ψ ∗ θ := ψ π ∗ θ for each θ and ψ ∗ := ψ π ∗ . Then, by construction σ θ ( ψ π ∗ θ ) = π ∗ θ and σ ( ψ π ∗ ) = π ∗ , sothe state ψ ∗ gives rise to π ∗ . To verify that ψ ∗ is a steady state, we can expand by the definitionof ψ π ∗ θ , f θ ( ψ π ∗ θ , π ∗ ) = f θ (cid:18) lim T →∞ f Tθ ( ˜ ψ θ , π ∗ ) , π ∗ (cid:19) , where ˜ ψ θ is any arbitrary initial state.Since f θ is continuous at ψ π ∗ θ in L distance defined in Footnote 20, lim T →∞ f Tθ ( ˜ ψ θ , π ∗ ) = ψ π ∗ θ is a fixed point of f θ ( · , π ∗ ) . To see this, write ψ ( T ) θ := f Tθ ( ˜ ψ θ , π ∗ ) for each T ≥ (cid:15) > f θ implies there is ζ > d ( f θ ( ψ π ∗ θ , π ∗ ) , f θ ( ψ ( T ) θ , π ∗ )) < (cid:15)/ d ( ψ π ∗ θ , ψ ( T ) θ ) < ζ . So pick a large enough T so that d ( ψ π ∗ θ , ψ ( T ) θ ) < ζ and also d ( ψ π ∗ θ , ψ ( T +1) θ ) < (cid:15)/ d ( f θ ( ψ π ∗ θ , π ∗ ) , ψ π ∗ θ ) ≤ d ( f θ ( ψ π ∗ θ , π ∗ ) , f θ ( ψ ( T ) θ , π ∗ )) + d ( ψ ( T +1) θ , ψ π ∗ θ ) < (cid:15)/ (cid:15)/ . This is implied by Step 1 of the proof of Proposition 3 in the Online Appendix, which shows f θ is continuousat all states that assign (1 − γ ) γ t mass to the set of length- t histories. (cid:15) > f θ ( ψ π ∗ θ , π ∗ ) = ψ π ∗ θ and a similar argumentshows f ( ψ π ∗ , π ∗ ) = ψ π ∗ . This tells us ψ ∗ = (( ψ π ∗ θ ) θ ∈ Θ , ψ π ∗ ) is a steady state. Only if : Conversely, suppose π ∗ ∈ Π ∗ ( g, δ, γ ) . Then there exists a steady state ψ ∗ ∈ Ψ ∗ ( g, δ, γ )such that π ∗ = σ ( ψ ∗ ). This means f θ ( ψ ∗ θ , π ∗ ) = ψ ∗ θ , so iterating shows ψ π ∗ θ := lim T →∞ f Tθ ( ψ ∗ θ , π ∗ ) = ψ ∗ θ . Since R [ π ∗ ]( ·| θ ) := σ θ ( ψ π ∗ θ ), the above implies R [ π ∗ ]( ·| θ ) = σ θ ( ψ ∗ θ ) = π ∗ ( ·| θ ) by the choice of of ψ ∗ . We can similarly show R [ π ∗ ] = π ∗ . A.5 Proof of Theorem 3
Throughout this subsection, we will make use of the following version of Hoeffding’s inequality.
Fact. (Hoeffding’s inequality) Suppose X , ..., X n are independent random variables on R suchthat a i ≤ X i ≤ b i with probability 1 for each i . Write S n := P ni =1 X i . Then, P [ | S n − E [ S n ] | ≥ d ] ≤ − d P ni =1 ( b i − a i ) ! . Lemma A.1.
In strategy profile π ∗ , suppose s ∗ is on-path and π ∗ ( a ∗ | s ∗ ) = 1 , where a ∗ is a strictbest response to s ∗ given π ∗ . Then there exists N ∈ R so that, for any regular prior and anysequence of steady-state strategy profiles π ( k ) ∈ Π ∗ ( g, δ k , γ k ) where γ k → , π ( k ) → π ∗ , there exists K ∈ N such that whenever k ≥ K , we have π ( k )2 ( a ∗ | s ∗ ) ≥ − (1 − γ k ) · N .Proof. Since a ∗ is a strict best response after s ∗ for π ∗ , there exists (cid:15) > a ∗ will continue tobe a strict best response after s ∗ for any π ∈ Π where for every θ ∈ Θ, | π ( s ∗ | θ ) − π ∗ ( s ∗ | θ ) | < (cid:15). Since π ( k ) → π ∗ , find large enough K such that k ≥ K implies for every θ ∈ Θ, (cid:12)(cid:12)(cid:12) π ( k )1 ( s ∗ | θ ) − π ∗ ( s ∗ | θ ) (cid:12)(cid:12)(cid:12) <(cid:15). Write e obs n,θ for the probability that an age- n receiver has encountered type θ fewer than nλ ( θ )times. We will find a number N obs < ∞ so that X θ ∈ Θ ∞ X n =0 e obs n,θ ≤ N obs . Fix some θ ∈ Θ. Write Z ( θ ) t ∈ { , } as the indicator random variable for whether the receiversees a type θ in period t of his life and write S n := P nt =1 Z ( θ ) t for the total number of type θ encountered up to age n . We have E [ S n ] = nλ ( θ ), so we can use Hoeffding’s inequality to bound39 obs n,θ . e obs n,θ ≤ P (cid:20) | S n − E [ S n ] | ≥ nλ ( θ ) (cid:21) ≤ − · [ nλ ( θ )] n ! . This shows e obs n,θ tends to 0 at the same rate as exp( − n ), so ∞ X n =0 e obs n,θ ≤ ∞ X n =0 − · [ nλ ( θ )] n ! =: N obs θ < ∞ . So we set N obs := P θ ∈ Θ N obs θ .Next, write e bias ,kn,θ for the probability that, after observing j nλ ( θ ) k i.i.d. draws from π ( k )1 ( ·| θ ),the empirical frequency of signal s ∗ differs from π ( k )1 ( s ∗ | θ ) by more than 2 (cid:15) . So again, write Z θ,kt ∈ { , } to indicate if the t -th draw resulted in signal s ∗ , with E h Z θ,kt i = π ( k )1 ( s ∗ | θ ), andput S n,k := P b nλ ( θ ) c t =1 Z θ,kt for total number of s ∗ out of j nλ ( θ ) k draws. We have E [ S n,k ] = j nλ ( θ ) k · π ( k )1 ( s ∗ | θ ), but (cid:12)(cid:12)(cid:12) π ( k )1 ( s ∗ | θ ) − π ∗ ( s ∗ | θ ) (cid:12)(cid:12)(cid:12) < (cid:15) whenever k ≥ K . That means, e bias ,kn,θ := P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S n,k j nλ ( θ ) k − π ∗ ( s ∗ | θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15) ≤ P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S n,k j nλ ( θ ) k − π ( k )1 ( s ∗ | θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15) if k ≥ K = P (cid:20) | S n,k − E [ S n,k ] | ≥ (cid:22) nλ ( θ ) (cid:23) · (cid:15) (cid:21) ≤ − · ( j nλ ( θ ) k · (cid:15) ) j nλ ( θ ) k by Hoeffding’s inequality.Let N bias θ := P ∞ n =1 (cid:18) − · ( b nλ ( θ ) c · (cid:15) ) b nλ ( θ ) c (cid:19) , with N bias θ < ∞ since the summand tends to 0 atthe same rate asexp( − n ). This argument shows that, whenever k ≥ K , we have P ∞ n =1 e bias ,kn,θ ≤ N bias θ . Now let N bias := P θ ∈ Θ N bias θ .Finally, since g is regular, we appeal to Proposition 1 of Fudenberg, He, and Imhof (2017) tosee that there exists some N so that whenever the receiver has a data set of size n ≥ N on type θ ’s play, his Bayesian posterior as to the probability that θ plays s ∗ differs from the empiricaldistribution by no more than (cid:15) . Put N age := N min θ ∈ Θ λ ( θ ) .Consider any steady state ψ ( k ) with k ≥ K . With probability no smaller than 1 − P θ ∈ Θ e bias ,kn,θ ,an age- n receiver who has seen at least nλ ( θ ) instances of type θ for every θ ∈ Θ will have anempirical distribution such that every type’s probability of playing s ∗ differs from π ∗ ( s ∗ | θ ) by lessthan 2 (cid:15) . If, furthermore, n ≥ N age , then in fact nλ ( θ ) ≥ N for each θ so the same probability40ound applies to the event that the receiver’s Bayesian posterior on every type θ playing s ∗ iscloser than 3 (cid:15) to π ∗ ( s ∗ | θ ). By the construction of (cid:15) , playing a ∗ after s ∗ is the unique best responseto such a posterior.Therefore, for k ≥ K , the probability that the sender population plays some action otherthan a ∗ after s ∗ in ψ ( k ) is bounded by N age (1 − γ k ) + (1 − γ k ) · ∞ X n =0 γ nk · X θ ∈ Θ (cid:16) e obs n,θ + e bias ,kn,θ (cid:17) . To explain this expression, receivers aged N age or younger account for no more than N age (1 − γ k ) of the population. Among the age n receivers, no more than P θ ∈ Θ e obs n,θ fraction has a samplesize smaller than nλ ( θ ) for any type θ , while P θ ∈ Θ e bias ,kn,θ is an upper bound on the probability(conditional on having a large enough sample) of having a biased enough sample so that sometype’s empirical frequency of playing s ∗ differs by more than 2 (cid:15) from π ∗ ( s ∗ | θ ).But since γ k ∈ [0 , , ∞ X n =0 γ nk · X θ ∈ Θ e obs n,θ < ∞ X n =0 X θ ∈ Θ e obs n,θ ≤ N obs and ∞ X n =0 γ nk · X θ ∈ Θ e bias ,kn,θ < ∞ X n =0 X θ ∈ Θ e bias ,kn,θ ≤ N bias . We conclude that whenever k ≥ K , π ( k )2 ( a ∗ | s ∗ ) ≥ − (1 − γ k ) · ( N age + N obs + N bias ) . Finally, observe that none of N age , N obs , N bias depends on the sequence π ( k ) , so N is chosenindependent of the sequence π ( k ) . Lemma A.2.
Assume g is regular. Suppose there is some a ∗ ∈ A and v ∈ R so that u ( θ, s ∗ , a ∗ ) >v . Then, there exist C ∈ (0 , , C > so that in every sender history y θ , s ∗ , a ∗ | y θ ) ≥ C · s ∗ | y θ ) + C implies E [ u ( θ, s ∗ , π ( ·| s ∗ )) | y θ ] > v. Proof.
Write u := min a ∈ A u ( θ, s ∗ , a ). There exists q ∈ (0 ,
1) so that q · u ( θ, s ∗ , a ∗ ) + (1 − q ) · u > v. Find a small enough (cid:15) > < q − (cid:15) < g is regular, Proposition 1 of Fudenberg, He, and Imhof (2017) tells us there exists some C so that the posterior mean belief of sender with history y θ , is no less than(1 − (cid:15) ) · s ∗ , a ∗ | y θ ) s ∗ | y θ ) + C . q , the expected payoff to θ playing s ∗ exceeds v . Thatis, it suffices to have(1 − (cid:15) ) · s ∗ , a ∗ | y θ ) s ∗ | y θ ) + C ≥ q ⇐⇒ s ∗ , a ∗ | y θ ) ≥ q − (cid:15) s ∗ | y θ ) + q − (cid:15) · C . Putting C := q − (cid:15) and C := q − (cid:15) · C proves the lemma. Lemma A.3.
Let Z t be i.i.d. Bernoulli random variables, where E [ Z t ] = 1 − (cid:15) . Write S n := P nt =1 Z t . For < C < and C > , there exist ¯ (cid:15), G , G > such that whenever < (cid:15) < ¯ (cid:15) , P [ S n ≥ C n + C ∀ n ≥ G ] ≥ − G (cid:15). Proof.
We make use of a lemma from Fudenberg and Levine (2006), which in turn extends someinequalities from Billingsley (1995).
FL06 Lemma A.1 : Suppose { X k } is a sequence of i.i.d. Bernoulli random variables with E [ X k ] = µ , and define for each n the random variable S n := | P nk =1 ( X k − µ ) | n . Then for any n, ¯ n ∈ N , P (cid:20) max n ≤ n ≤ ¯ n S n > (cid:15) (cid:21) ≤ · n · µ(cid:15) . For every G > < (cid:15) < P [ S n ≥ C n + C ∀ n ≥ G ] = 1 − P " ( ∃ n ≥ G ) n X t =1 Z t < C n + C = 1 − P " ( ∃ n ≥ G ) n X t =1 ( X t − (cid:15) ) > (1 − (cid:15) − C ) n − C , where X t := 1 − Z t . Let ¯ (cid:15) := (1 − C ) and G := 2 C / ¯ (cid:15) . Suppose 0 < (cid:15) < ¯ (cid:15) . Then for every n ≥ G , (1 − (cid:15) − C ) n − C ≥ ¯ (cid:15)n − C ≥ ¯ (cid:15)n . Hence, P [ S n ≥ C n + C ∀ n ≥ G ] ≥ − P " ( ∃ n ≥ G ) n X t =1 ( X t − (cid:15) ) >
12 ¯ (cid:15)n and, by FL06 Lemma A.1, the probability on the right-hand side is at most G (cid:15) with G :=2 / (3 G ¯ (cid:15) ).We now prove Theorem 3. 42 heorem
3: Suppose π ∗ is on-path strict for the receiver and patiently stable. Then itsatisfies the strong compatibility criterion. Proof.
Let some a / ∈ BR(∆( ˜ J ( s , π ∗ )) , s ) and h > π ∗ ( a | s ) ≤ h . Step 1 : Defining the constants ξ, θ J , a θ , s θ , C , C , G , G , and N recv .(i) For each ξ >
0, define the ξ -approximations to ∆( ˜ J ( s , π ∗ )) as the probability distributionswith weight no more than ξ on types outside of ˜ J ( s , π ∗ ),∆ ξ ( ˜ J ( s , π ∗ )) := n p ∈ ∆(Θ) : p ( θ ) ≤ ξ ∀ θ / ∈ ˜ J ( s , π ∗ ) o . Because the best-response correspondence has closed graph, there exists some ξ > a / ∈ BR(∆ ξ ( ˜ J ( s , π ∗ )) , s ).(ii) Since ˜ J ( s , π ∗ ) is nonempty, we can fix some θ J ∈ ˜ J ( s , π ∗ ).(iii) For each equilibrium-dominated type θ ∈ Θ \ ˜ J ( s , π ∗ ), identify some on-path signal s θ sothat π ∗ ( s θ | θ ) >
0. By assumption of on-path strictness for the receiver, there is some a θ ∈ A sothat π ∗ ( a θ | s θ ) = 1, and furthermore, a θ is the strict best response to s θ in π ∗ . By the definitionof equilibrium dominance, u ( θ, s θ , a θ ) > max a ∈ A u ( θ, s , a ) =: v θ . By applying Lemma A.2 to each θ ∈ Θ \ ˜ J ( s , π ∗ ), we obtain some C ∈ (0 , C > θ ∈ Θ \ ˜ J ( s , π ∗ ) and in every sender history y θ , s θ , a θ | y θ ) ≥ C · s θ | y θ ) + C implies E [ u ( θ, s θ , π ( ·| s θ )) | y θ ] > v θ . (iv) By Lemma A.3, find ¯ (cid:15), G , G > E [ Z t ] = 1 − (cid:15) are i.i.d. Bernoulli and S n := P nt =1 Z t , then whenever 0 < (cid:15) < ¯ (cid:15) , P [ S n ≥ C n + C ∀ n ≥ G ] ≥ − G (cid:15). (v) Because at π ∗ , a θ is a strict best response to s θ for every θ ∈ Θ \ ˜ J ( s , π ∗ ), from LemmaA.1 we may find a N recv so that for each sequence π ( k ) ∈ Π ∗ ( g, δ k , γ k ) where γ k → π ( k ) → π ∗ ,there corresponds K recv ∈ N so that k ≥ K recv implies π ( k )2 ( a θ | s θ ) ≥ − (1 − γ k ) · N recv for every θ ∈ Θ \ ˜ J ( s , π ∗ ). Step 2 : Two conditions to ensure that all but 3 h receivers believe in ∆ ξ ( ˜ J ( s , π ∗ )).Consider some steady state ψ ∈ Ψ ∗ ( g, δ, γ ) for g regular, δ, γ ∈ [0 , c = ξ · max θ ∈ Θ λ ( θ ) λ ( θ J ) and δ = . Weconclude that there exists some N rare (not dependent on ψ ) such that whenever π ( s | θ J ) ≥ c · π ( s | θ D ) for every equilibrium-dominated type θ D / ∈ ˜ J ( s , π ∗ ) and n · π ( s | θ J ) ≥ N rare , (7)43hen an age- n receiver in steady state ψ where π = σ ( ψ ) has probability at least 1 − h of holdinga posterior belief g ( ·| y ) such that θ J is at least c times as likely to play s as θ D is for every θ D / ∈ ˜ J ( s , π ∗ ). Thus history y generates a posterior belief after s , p ( ·| s ; y ) such that p ( θ D | s ; y ) p ( θ J | s ; y ) ≤ λ ( θ D ) λ ( θ J ) · ξ · λ ( θ J )max θ ∈ Θ λ ( θ ) ≤ ξ. In particular, p ( ·| s ; y ) must assign weight no greater than ξ to each type not in ˜ J ( s , π ∗ );therefore, the belief belongs to ∆ ξ ( ˜ J ( s , π ∗ )). By construction of ξ , a is then not a best responseto s after history y .A receiver whose age n satisfies Equation (7) plays a with probability less than h , provided π ( s | θ J ) ≥ c · π ( s | θ D ) for every θ D / ∈ ˜ J ( s , π ∗ ). However, to bound the overall probability of a in the entire receiver population in steady state ψ , we ensure that Equation (7) is satisfied for allexcept 2 h fraction of receivers in ψ . We claim that when γ is large enough, a sufficient conditionis for π = σ ( ψ ) to satisfy π ( s | θ J ) ≥ (1 − γ ) N ∗ for some N ∗ ≥ N rare /h . This is because underthis condition, any agent aged n ≥ h − γ satisfies Equation (7), while the fraction of receiversyounger than h − γ is 1 − (cid:18) γ h − γ (cid:19) ≤ h for γ near enough to 1.To summarize, in Step 2 we have found a constant N rare and shown that if γ is near enoughto 1, then π = σ ( ψ ) has π ( a | s ) ≤ h if the following two conditions are satisfied:( C1 ) π ( s | θ J ) ≥ c · π ( s | θ D ) for every equilibrium-dominated type θ D / ∈ ˜ J ( s , π ∗ )( C2 ) π ( s | θ J ) ≥ (1 − γ ) N ∗ for some N ∗ ≥ N rare /h. In the following step, we show there is a sequence of steady states ψ ( k ) ∈ Ψ ∗ ( g, δ k , γ k ) with δ k → , γ k →
1, and σ ( ψ ( k ) ) = π ( k ) → π ∗ such that, in every π ( k ) , the above two conditions aresatisfied. Using the fact that γ k → , we conclude that, for large enough k , we get π ( k )2 ( a | s ) ≤ h ,which in turn shows π ∗ ( a | s ) ≤ h due to the convergence π ( k ) → π ∗ . Step 3 : Extracting a suitable subsequence of steady states.In the statement of Lemma 4, put θ := θ J . We obtain some number (cid:15) and functions ¯ δ ( N ) , ¯ γ ( N, δ ). Put N ratio := ξ G · N recv max θ ∈ Θ λ ( θ ) λ ( θ J ) and N ∗ := max( N ratio , N rare /h ).Since π ∗ is patiently stable, it can be written as the limit of some strategy profiles π ∗ =lim k →∞ π ( k ) , where each π ( k ) is δ k -stable with δ k →
1. By the definition of δ -stable, each π ( k ) is the limit π ( k ) = lim j →∞ π ( k,j ) with π ( k,j ) ∈ Π ∗ ( g, δ k , γ k,j ) with lim j →∞ γ k,j = 1. It is withoutloss to assume that for every k ≥ , δ k ≥ ¯ δ ( N ∗ ), and that the L distance between π ( k ) and π ∗ is less than (cid:15)/
2. Now, for each k , find a large enough index j ( k ) so that (i) γ k,j ( k ) ≥ γ ( N ∗ , δ k ),(ii) L distance between π ( k,j ) and π ( k ) is less than min( (cid:15) , k ), and (iii) lim k →∞ γ k,j ( k ) = 1. Thisgenerates a sequence of k -indexed steady states, ψ ( k,j ( k )) ∈ Ψ ∗ ( g, δ k , γ k,j ( k ) ). We will henceforthdrop the dependence through the function j ( k ) and just refer to ψ ( k ) and γ k . The sequence ψ ( k ) ∈ Ψ ∗ ( g, δ k , γ k ) satisfies: (1) δ k → , γ k →