[PDF] Player-Compatible Learning and Player-Compatible Equilibrium

Abstract

Player-Compatible Equilibrium (PCE) imposes cross-player restrictions on the magnitudes of the players' "trembles" onto different strategies. These restrictions capture the idea that trembles correspond to deliberate experiments by agents who are unsure of the prevailing distribution of play. PCE selects intuitive equilibria in a number of examples where trembling-hand perfect equilibrium (Selten, 1975) and proper equilibrium (Myerson, 1978) have no bite. We show that rational learning and weighted fictitious play imply our compatibility restrictions in a steady-state setting.

Full PDF

PPlayer-Compatible Equilibrium ∗ Drew Fudenberg † Kevin He ‡ First version: September 23, 2017This version: July 10, 2019

Abstract

Player-Compatible Equilibrium (PCE) imposes cross-player restrictions onthe magnitudes of the players’ “trembles” onto diﬀerent strategies. These re-strictions capture the idea that trembles correspond to deliberate experimentsby agents who are unsure of the prevailing distribution of play. PCE selectsintuitive equilibria in a number of examples where trembling-hand perfect equi-librium (Selten, 1975) and proper equilibrium (Myerson, 1978) have no bite.We show that rational learning and some near-optimal heuristics imply ourcompatibility restrictions in a steady-state setting.

Keywords: non-equilibrium learning, equilibrium reﬁnements, trembling-hand perfectequilibrium, combinatorial bandits, Bayesian upper conﬁdence bounds. ∗ We thank Alessandro Bonatti, Dan Clark, Glenn Ellison, Ben Golub, Shengwu Li, Dave Rand,Alex Wolitzky, and Muhamet Yildiz for valuable conversations and comments. We thank NationalScience Foundation grant SES 1643517 for ﬁnancial support. † Department of Economics, MIT. Email: [email protected] ‡ California Institute of Technology and University of Pennsylvania. Email: [email protected] a r X i v : . [ q -f i n . E C ] J u l Introduction

Starting with Selten (1975), a number of papers have used the device of vanishinglysmall “trembles” to reﬁne the set of Nash equilibria. This paper introduces player-compatible equilibrium (PCE), which extends this approach by imposing cross-playerrestrictions on these trembles in a way that is invariant to the utility representations ofplayers’ preferences over game outcomes. The heart of this reﬁnement is the conceptof player compatibility , which says player i is more compatible with strategy s ∗ i thanplayer j is with strategy s ∗ j if whenever s ∗ j is optimal for j against some correlatedproﬁle σ , s ∗ i is optimal for i against any proﬁle ˆ σ matching σ in terms of the playof the third parties, − ij . PCE requires that cross-player tremble magnitudes respectcompatibility rankings. As we will explain, PCE interprets “trembles” as deliberateexperiments to learn how others play, not as mistakes, and derives its cross-playertremble restrictions from an analysis of the relative frequencies of experiments thatdiﬀerent players choose to undertake.Section 2 deﬁnes PCE, studies their basic properties, and proves that PCE existin all ﬁnite games. The compatibility relation is easiest to satisfy when i and j are“non-interacting,” meaning that their payoﬀs do not depend on each other’s play. ButPCE can have bite even when all players interact with each other, provided that theinteractions are not too strong. Moreover, as shown by the examples in Section 3, PCEcan rule out seemingly implausible equilibria that other tremble-based reﬁnementssuch as trembling-hand perfect equilibrium (Selten, 1975) and proper equilibrium(Myerson, 1978) cannot eliminate.One of these examples is a “link-formation game,” where players on each sidedecide whether or not to pay a cost to be Active and form links with all of the activeplayers on the other side. In the “anti-monotonic” version of the game, players whoincur a higher private cost of link formation give lower beneﬁts to their linked part-ners; in the “co-monotonic” version, higher cost players give others higher beneﬁts.We show that the only PCE outcome in the anti-monotonic version is for all players1o choose

Active , while in the co-monotonic case both “all

Active ” and “all

Inac-tive” are PCE outcomes. In contrast, the equilibria that satisfy other equilibriumreﬁnements do not depend on whether payoﬀs are anti-monotonic or co-monotonic.PCE is deﬁned for general strategic form games, and stands on its own as a usefulreﬁnement of trembling-hand perfection. Moreover, PCE’s compatibility restrictionson trembles are implied by models of learning in a class of extensive form games wedescribe below. In our learning framework, agents are born into diﬀerent player rolesof a stage game, and believe that they face an unknown, time-invariant distributionof opponents’ play, as they would in a steady state of a model where a continuum ofanonymous agents are randomly matched each period. Each agent only learns aboutothers’ play through her own payoﬀs at the end of the game. Because agents expectto play the game many times, they may choose to “experiment” and use myopicallysub-optimal strategies for their informational value. The compatibility restriction ontrembles then arises from the diﬀerences in the attractiveness of various experimentsfor diﬀerent players. For example, in the link-formation game, an agent choosing

Inactive always receives the same payoﬀ and same information regardless of others’play, so they may try playing

Active even if their prior belief is that the low-beneﬁtscounterparty is more likely to play

Active than the high-beneﬁts one. As is intuitive,we show that a low-cost agent has a stronger incentive to experiment with

Active than a high-cost one does, and will do so more frequently against any mixed play ofthe counterparties.The analysis of learning requires details about the extensive form that are notrepresented by the strategic form, and we are not able to capture its implicationsin general extensive forms. To make the analysis more tractable, Section 5 restrictsattention to a class of “factorable” games, where repeatedly playing a given strategy s i would reveal all of the payoﬀ consequences of that strategy and no informationabout the payoﬀ consequences of any other strategy s i = s i . This restriction impliesthat at any strategy proﬁle s, if player i potentially cares about the action taken at2ome information set h of − i , then either h is on the path of s or i can put h onto thepath of play via a unilateral deviation. Thus there is no possibility of learning being“blocked” by other players, and no “free riding” by learning from others’ experiments.For simplicity we also require that each player moves at most once along any path ofplay. The three examples in 3 all satisfy these restrictions for generic extensive-formpayoﬀs.In factorable games, each agents faces a combinatorial bandit problem (see Section5.2). We consider two related models of how agents deal with the trade-oﬀ betweenexploration and exploitation — the classic model of rational Bayesians maximizingdiscounted expected utility under the belief that the environment (the aggregatestrategy distribution in the population) is constant, and the computationally sim-pler method of Bayesian upper-conﬁdence bounds (Kaufmann, Cappé, and Garivier,2012). In both of these models, the agent uses an “index policy,” meaning that theyassign a numerical index to each strategy that depends only on past observationswhen that strategy was used, and then chooses the strategy with the highest index.We formulate a compatibility condition for index policies, and show that any indexpolicies for i and j satisfying this compatibility condition for strategies s ∗ i and s ∗ j will lead to i experimenting relatively more with s ∗ i than j with s ∗ j . To complete themicro-foundation of PCE, we then show that the Bayes optimal policy and the Bayes-UCB heuristic satisfy the compatibility condition for strategies s ∗ i and s ∗ j whenever i is more compatible s ∗ i than player j is with strategy s ∗ j and the agents in roles i and j face comparable learning problems (e.g. start with the same patience level, sameprior beliefs about the play of third parties, etc). Brieﬂy, upper conﬁdence bound algorithms originated as computationally tractable algorithmsfor multi-armed bandit problems (Agrawal, 1995; Katehakis and Robbins, 1995). We consider aBayesian version of the algorithm that keeps track of the learner’s posterior beliefs about the payoﬀsof diﬀerent strategies, ﬁrst analyzed by Kaufmann, Cappé, and Garivier (2012). We say more aboutthis procedure in Section 5. See Francetich and Kreps (2018) for a discussion of other heuristics foractive learning. .1 Related Work Tremble-based solution concepts date back to Selten (1975), who thanks Harsanyifor suggesting them. These solution concepts consider totally mixed strategy proﬁleswhere players do not play an exact best reply to the strategies of others, but mayassign positive probability to some or all strategies that are not best replies. Diﬀerentsolution concepts in this class consider diﬀerent kinds of “trembles,” but they allmake predictions based on the limits of these non-equilibrium strategy proﬁles as theprobability of trembling tends to zero. Since we compare PCE to these reﬁnementsbelow, we summarize them here for the reader’s convenience.An (cid:15) -perfect equilibrium is a totally mixed strategy proﬁle where every non-bestreply has weight less than (cid:15) . A limit of (cid:15) t -perfect equilibria where (cid:15) t → trembling-hand perfect equilibrium . An (cid:15) -proper equilibrium is a totally mixed strategyproﬁle σ where for every player i and strategies s i and s i , if U i ( s i , σ − i ) < U i ( s i , σ − i )then σ i ( s i ) < (cid:15) · σ i ( s i ). A limit of (cid:15) t -proper equilibria where (cid:15) t → properequilibrium; in this limit a more costly tremble is inﬁnitely less likely than a lesscostly one, regardless of the cost diﬀerence. Approachable equilibrium (Van Damme,1987) is also based on the idea that strategies with worse payoﬀs are played less often.It too is the limit of (cid:15) t -perfect equilibria, but where the players pay control costs toreduce their tremble probabilities. When these costs are “regular,” all of the tremblesare of the same order. Because PCE does not require that the less likely tremblesare inﬁnitely less likely than more likely ones, it is closer to approachable equilibriumthan to proper equilibrium. The strategic stability concept of Kohlberg and Mertens(1986) is also deﬁned using trembles, but applies to components of Nash equilibria asopposed to single strategy proﬁles.Unlike the central feature of PCE, proper equilibrium and approachable equilib-rium do not impose cross-player restrictions on the relative probabilities of various4rembles. For this reason, when each type of the sender is viewed as a diﬀerent playerthese equilibrium concepts reduce to perfect Bayesian equilibrium in signaling gameswith two possible signals, such as the beer-quiche game of Cho and Kreps (1987).They do impose restrictions when applied to the ex-ante form of the game, i.e. at thestage before the sender has learned their type. However, as Cho and Kreps (1987)point out, evaluating the cost of mistakes at the ex-ante stage means that the interimlosses are weighted by the prior distribution over sender types, so that less likelytypes are more likely to tremble. In addition, applying a diﬀerent positive linearrescaling to each type’s utility function preserves every type’s preference over lotter-ies on outcomes, but changes the sets of proper and approachable equilibria, whilesuch utility rescalings have no eﬀect on the set of PCE. In light of these issues, whendiscussing tremble-based reﬁnements in Bayesian games we will always apply themat the interim stage.Like PCE, extended proper equilibrium (Milgrom and Mollner, 2017) places re-strictions on the relative probabilities of tremble by diﬀerent players, but it doesso in a diﬀerent way: An extended proper equilibrium is the limit of ( β , (cid:15) t ) − properequilibria, where β = ( β , ...β I ) is a strictly positive vector of utility re-scaling, and σ i ( s i ) < (cid:15) t · σ j ( s j ) if player i ’s rescaled loss from s i (compared to the best response)is less than j ’s loss from s j . In a signaling game with only two possible signals, ev-ery Nash equilibrium where each sender type strictly prefers not to deviate from herequilibrium signal is an extended proper equilibrium at the interim stage, becausesuitable utility rescalings for the types can lead to any ranking of their utility costsof deviating to the oﬀ-path signal. By contrast, Proposition 4 shows every PCE mustsatisfy the compatibility criterion of Fudenberg and He (2018), which has bite evenin binary signaling games such as the beer-quiche example of Cho and Kreps (1987).So an extended proper equilibrium need not be a PCE, a fact that Examples 1 and2 further demonstrate. Conversely, because extended proper equilibrium makes some5rembles inﬁnitely less likely than others, it can eliminate some PCE. This paper builds on the work of Fudenberg and Levine (1993) and Fudenberg andKreps (1995, 1994) on learning foundations for self-conﬁrming and Nash equilibrium.It is also related to recent work that that provides explicit learning foundations forvarious equilibrium concepts that reﬂect ambiguity aversion, misspeciﬁed priors, ormodel uncertainty, such as Battigalli, Cerreia-Vioglio, Maccheroni, and Marinacci(2016), Battigalli, Francetich, Lanzani, and Marinacci (2017), Esponda and Pouzo(2016), and Lehrer (2012). Unlike those papers, we focus on the very patient agentswho undertake many “experiments,” and characterize the relative rates of experi-mentation under rational expected-utility maximization and related “near-optimal”heuristics. For this reason our analysis of learning is closer to Fudenberg and Levine(2006) and Fudenberg and He (2018).Our investigation of learning dynamics signiﬁcantly expands on that of Fudenbergand He (2018), which focused on a particular learning rule (rational Bayesians) ina restricted set of games (signaling games). In contrast, our analysis applies to abroader class of learning rules — speciﬁcally, index policies that satisfy a relatedcompatibility condition, and to a larger family of games, the factorable games deﬁnedin Section 4. We develop new tools to deal with new issues that arise in this moregeneral setting. For instance, Fudenberg and He (2018) compare the Gittins indices ofdiﬀerent sender types using the fact that any stopping time (for the auxiliary optimal-stopping problem deﬁning the index) of the less-compatible type is also feasible forthe more-compatible type. But our general setting allows player roles to interact, so itis not valid to exchange the stopping times of diﬀerent players as they may conditionon observed play in diﬀerent parts of the game tree. We deal with this problem byconsidering how i can nevertheless construct a feasible stopping time that mimics an Example available on request. j. Moreover, when a player faces more than one opponent, theiroptimal experimentation policy may lead them to observe a correlated distribution ofopponents’ play, even though the opponents do no actually play correlated strategies.We discuss this issue of endogenous correlation in Section 5.4.2; it is the reason wedeﬁne PCE in terms of correlated play.In methodology the paper is related to other work on active learning and experi-mentation. In single-agent settings, these include Doval (2018), Francetich and Kreps(2018), and Fryer and Harms (2017). In multi-agent settings additional issues arisesuch as free-riding and encouraging others to learn, see e.g. Bolton and Harris (1999),Keller et al. (2005), Klein and Rady (2011), Heidhues, Rady, and Strack (2015), Frickand Ishii (2015), Halac, Kartik, and Liu (2016), Strulovici (2010), and the survey byHörner and Skrzypacz (2016). Unlike most models of multi-agent bandit problems,our agents only learn from personal histories, not from the actions or histories of oth-ers. Our focus is the comparison of experimentation policies under diﬀerent payoﬀparameters, which is central to PCE’s cross-player tremble restrictions.

In this section, we ﬁrst deﬁne the player-compatibility relation and discuss its basicproperties. We then introduce PCE, which embodies cross-player tremble restrictionsbased on this relation.Consider a strategic-form game with a ﬁnite set of players i ∈ I , ﬁnite strategysets | S i | ≥ and utility functions U i : S → R , where S := × i S i . We assume noplayer has a strictly dominated strategy, which lets us avoid some complications thatwould otherwise need to be treated separately. This assumption is consistent with the If S i = { s ∗ i } is a singleton, we would have ( s ∗ i | i ) (cid:37) ( s j | j ) and ( s j | j ) (cid:37) ( s ∗ i | i ) for anystrategy s j of any player j if we follow the convention that the maximum over an empty set is −∞ . s i gives no information about the payoﬀ consequences ofany other strategy s i = s i . Thus strictly dominated strategies will never be played,even as experiments, so they may be deleted from the game.For each i, let ∆( S i ) denote the set of mixed strategies for i . For K ⊆ I , set S K = × i ∈ K S i and let ∆( S K ) represent the set of correlated strategies among players K .Let ∆ ◦ ( S K ) represent the interior of ∆( S K ), that is the set of full-support correlatedstrategies on S K . We formalize the concept of “compatibility” between players and their strategiesin this general setting, which will play a central role in the deﬁnition of PCE indetermining cross-player restrictions on trembles.

Deﬁnition.

For player i = j and strategies s ∗ i ∈ S i , s ∗ j ∈ S j , say i is more compatiblewith s ∗ i than j is with s ∗ j , abbreviated as s ∗ i (cid:37) s ∗ j , if for every totally mixed correlatedstrategy proﬁles σ ∈ ∆ ◦ ( S ) with X s ∈ S U j ( s ∗ j , s − j ) · σ ( s ) = max s j ∈ S j X s ∈ S U j ( s j , s − j ) · σ ( s ) , we get X s ∈ S U i ( s ∗ i , s − i ) · ˜ σ ( s ) > max s i ∈ S i \{ s ∗ i } X s ∈ S U i ( s i , s − i ) · ˜ σ ( s )for every totally mixed correlated strategy proﬁle ˜ σ ∈ ∆ ◦ ( S ) satisfying marg − ij ( σ ) =marg − ij (˜ σ ).In words, if s ∗ j is weakly optimal for the less-compatible j against σ , then s ∗ i isstrictly optimal for the more-compatible i against any ˜ σ whose marginal on − ij ’s playis totally mixed and agrees with that of σ . As this restatement makes clear, the com-patibility condition only depends on players’ preferences over probability distribution Recall that a full-support correlated strategy assigns positive probability to every pure strategyproﬁle. This notation is unambiguous provided i and j have disjoint strategy sets. In the event that i and j share some strategies, we will clarify this notation by attaching player subscripts. S , and not on the particular utility representations chosen.Since × i ∆ ◦ ( S i ) ⊆ ∆ ◦ ( S ) , our deﬁnition of compatibility ranks fewer strategy-player pairs than an alternative deﬁnition that only considers mixed strategy proﬁleswith independent mixing between diﬀerent opponents. We use the more strin-gent deﬁnition to match the microfoundations of our compatibility-based cross-playerrestrictions.The compatibility relation is transitive, as the next proposition shows.

Proposition 1.

Suppose s ∗ i (cid:37) s ∗ j (cid:37) s ∗ k where s ∗ i , s ∗ j , s ∗ k are strategies of i, j, k . Then s ∗ i (cid:37) s ∗ k . The next result states that the compatibility relation is asymmetric, except in thecorner case where both strategies are weakly dominated.

Proposition 2. If s ∗ i (cid:37) s ∗ j , then either s ∗ j (cid:37) s ∗ i , or both s ∗ j and s ∗ i are weakly domi-nated strategies. The proof of Propositions 1 and 2 are straightforward; they can be found in theOnline Appendix.We think of PCE as primarily a solution concept for games with three or moreplayers, where the relative tremble probabilities of i = j aﬀect some third party’sbest response.If players i and j care a great deal about each other’s strategies, then their bestresponses are unlikely to be determined only by the play of the third parties. In theother extreme, a game has a multipartite structure if the set of players I can be dividedinto C mutually exclusive classes, I = I ∪ ... ∪ I C , in such a way that whenever i and j belong to the same class i, j ∈ I c , (1) they are non-interacting , meaning i ’s payoﬀdoes not depend on the strategy of j and j ’s payoﬀ does not depend on the strategyof i ; (2) they have the same strategy set, S i = S j . As a leading case, every Bayesian Formally, this alternative deﬁnition would replace “totally mixed correlated strategy proﬁles”with “independently and totally mixed strategy proﬁles” in the Deﬁnition of s ∗ i (cid:37) s ∗ j . i, j ∈ I c ,we may write U i ( s c , s − ij ) without ambiguity for s c ∈ S i , since all augmentations ofthe strategy proﬁle s − ij with a strategy by player j lead to the same payoﬀ for i . For s ∗ c ∈ S i = S j , the deﬁnition of s ∗ ic (cid:37) s ∗ jc reduces to: For every totally mixed correlated σ with σ − ij ∈ ∆ ◦ ( S − ij ), X s ∈ S U j ( s ∗ jc , s − ij ) · σ ( s ) = max s j ∈ S j X s ∈ S U j ( s j , s − ij ) · σ ( s )implies X s ∈ S U i ( s ∗ ic , s − ij ) · ˜ σ ( s ) > max s i ∈ S i \{ s ∗ ic } X s ∈ S U i ( s i , s − ij ) · ˜ σ ( s ) . While the player-compatibility condition is especially easy to state for non-interactingplayers, our learning foundation will also justify cross-player tremble restrictions forpairs of players i, j whose payoﬀs do depend on each others’ strategies, as in the“restaurant game” we discuss in Example 1.

We now move towards the deﬁnition of PCE. PCE is a tremble-based solution con-cept. It builds on and modiﬁes Selten (1975)’s deﬁnition of trembling-hand perfectequilibrium as the limit of equilibria of perturbed games in which agents are con-strained to tremble, so we begin by deﬁning our notation for the trembles and theassociated constrained equilibria.

Deﬁnition. A tremble proﬁle (cid:15) assigns a positive number (cid:15) ( s i | i ) > i and pure strategy s i . Given a tremble proﬁle (cid:15) , write Π (cid:15) i for the set of (cid:15) -strategies We use s ∗ ic to refer to i ’s copy of s ∗ c and s ∗ jc to refer to j ’s copy.

10f player i , namely:Π (cid:15) i := { σ i ∈ ∆( S i ) s.t. σ i ( s i ) ≥ (cid:15) ( s i | i ) ∀ s i ∈ S i } . We call σ ◦ an (cid:15) -equilibrium if for each i , σ ◦ i ∈ arg max σ i ∈ Π (cid:15) i U i ( σ i , σ ◦− i ) . Note that Π (cid:15) i is compact and convex. It is also non-empty when (cid:15) is close enoughto . By standard results, whenever (cid:15) is small enough so that Π (cid:15) i is non-empty foreach i , an (cid:15) -equilibrium exists.The key building block for PCE is (cid:15) - PCE, which is an (cid:15) -equilibrium where thetremble proﬁle is “co-monotonic” with (cid:37) in the following sense:

Deﬁnition.

Tremble proﬁle (cid:15) is player compatible if (cid:15) ( s ∗ i | i ) ≥ (cid:15) ( s ∗ j | j ) for all i, j, s ∗ i , s ∗ j such that s ∗ i (cid:37) s ∗ j . An (cid:15) -equilibrium where (cid:15) is player compatible is calleda player-compatible (cid:15) -equilibrium (or (cid:15) -PCE ).The condition on (cid:15) says the minimum weight i could assign to s ∗ i is no smallerthan the minimum weight j could assign to s ∗ j in the constrained game,min σ i ∈ Π (cid:15) i σ i ( s ∗ i ) ≥ min σ j ∈ Π (cid:15) j σ j ( s ∗ j ) . This is a “cross-player tremble restriction,” that is, a restriction on the relative prob-abilities of trembles by diﬀerent players. Note that it, like the compatibility relation,depends on the players’ preferences over distributions on S but not on the particularutility representation used. This invariance property distinguishes player-compatibletrembles from other models of stochastic behavior such as the stochastic terms inlogit best responses.As is usual for tremble-based equilibrium reﬁnements, we now deﬁne PCE as thelimit of a sequence of (cid:15) -PCE where (cid:15) → .11 eﬁnition. A strategy proﬁle σ ∗ is a player-compatible equilibrium (PCE) if thereexists a sequence of player-compatible tremble proﬁles (cid:15) ( t ) → and an associatedsequence of strategy proﬁles σ ( t ) , where each σ ( t ) is an (cid:15) ( t ) -PCE, such that σ ( t ) → σ ∗ .The cross-player restrictions embodied in player-compatible trembles translateinto analogous restrictions on PCE, as shown in the next result. Proposition 3.

For any PCE σ ∗ , player k , and strategy ¯ s k such that σ ∗ k (¯ s k ) > , there exists a sequence of totally mixed strategy proﬁles σ ( t ) − k → σ ∗− k such that(i) for every pair i, j = k with s ∗ i (cid:37) s ∗ j , lim inf t →∞ σ ( t ) i ( s ∗ i ) σ ( t ) j ( s ∗ j ) ≥ and (ii) ¯ s k is a best response for k against every σ ( t ) − k . The proof of this and subsequent results in the main text appear in the Appendix.That is, treating each σ ( t ) − k as a totally mixed approximation to σ ∗− k , in a PCEeach player k essentially best responds to totally mixed opponent play that respectsplayer compatibility.It is easy to show that every (cid:15) -PCE respects player compatibility up to the “addingup constraint” that probabilities on diﬀerent strategies must sum up to 1 and i mustplace probability no smaller than (cid:15) ( s i | i ) on strategies s i = s ∗ i . The “up to” qualiﬁ-cation disappears in the (cid:15) ( t ) → limit because the required probabilities on s i = s ∗ i tend to 0.Since PCE is deﬁned as the limit of (cid:15) -equilibria for a restricted class of trembles,PCE form a subset of trembling-hand perfect equilibria; the next result shows thissubset is not empty. It uses the fact that the tremble proﬁles with the same lowerbound on the probability of each action satisfy the compatibility condition in anygame. Theorem 1.

PCE exists in every ﬁnite strategic-form game. .3 Some Properties of PCE A tremble proﬁle (cid:15) is uniform if for all i and s i ∈ S i , we have (cid:15) ( s i | i ) = (cid:15) for thesame (cid:15) >

0. A trembling-hand perfect equilibrium is a uniform THPE if it is thelimit of (cid:15) -equilibria where (cid:15) → (cid:15) is uniform. The proof ofTheorem 1 in fact establishes the existence of uniform THPE, which form a subsetof PCE since uniform trembles are always player compatible regardless of the stagegame.One drawback of uniform THPE is that there is no clear microfoundation foruniform trembles. In addition to the cross-player restrictions of the compatibilitycondition, these uniform trembles impose the same lower bound on the tremble prob-abilities for all strategies of each given player. PCE and the learning foundationwe develop allow for more complicated patterns of experimentation that respect thecompatibility structure. We study a more permissive reﬁnement than uniform THPEwhere we can oﬀer a learning story for the tremble restrictions. PCE is a fairly weaksolution concept that nevertheless has bite in some cases of interest, as we will discussin Sections 3. In this section, we study examples of games where PCE rules out unintuitive Nashequilibria. We will also use these examples to distinguish PCE from existing reﬁne-ments.

We start with a complete-information game where PCE diﬀers from other solutionconcepts.

Example 1.

There are three players in the game: a food critic, a regular diner, and13 restaurant. Simultaneously, the restaurant decides between ordering high-quality( H ) or low-quality ( L ) ingredients, while critic and the diner decide whether to goeat at the restaurant ( R ) or order pizza ( Z ) and eat at home. The utility from Z is normalized to 0. If both customers choose Z , the restaurant also gets 0 payoﬀ.Otherwise, the restaurant’s payoﬀ depends on the ingredient quality and clientele.Choosing L yields a proﬁt of +2 per customer while choosing H yields a proﬁt of +1per customer. In addition, if the food critic is present, she will write a review basedon ingredient quality, which aﬀects the restaurant’s payoﬀ by ± . . Each customergets a payoﬀ of x < − y > . Z c , Z d , L ) is a proper equilibrium, sustained by the restau-rant’s belief that when at least one customer plays R , it is far more likely that thediner deviated to patronizing the restaurant than the critic, even though the critichas a greater incentive to go to the restaurant as she gets paid for writing reviews. It14s also an extended proper equilibrium. We claim that R c (cid:37) R d . To see this, note that for any proﬁle σ of totally mixed,correlated play that makes the diner indiﬀerent between Z d and R d , we must have U ( R c , ˜ σ − c ) ≥ . σ that agrees with σ in terms of the restaurant’splay. This is because the critic’s utility from R c is minimized when the diner chooses R d with probability 1, but even then the critic gets 0.5 higher utility from going toa crowded restaurant than the diner gets from going to an empty restaurant, holdingﬁxed food quality at the restaurant. This shows R c (cid:37) R d .Whenever σ ( t ) c ( R c ) /σ ( t ) d ( R d ) > , the restaurant strictly prefers H over L . Thus byProposition 3, there is no PCE where the restaurant plays L with positive probability. (cid:7) In the next example, PCE makes diﬀerent predictions in two versions of a game withdiﬀerent payoﬀ parameters, while all other solution concepts we know of make thesame predictions in both versions.

Example 2.

There are 4 players in the game, split into two sides: North and South.The players are named North-1, North-2, South-1, and South-2, abbreviated as N1,N2, S1, and S2 respectively.These players engage in a strategic link-formation game. Each player simultane-ously takes an action: either

Inactive or Active . An

Inactive player forms no links.An

Active player forms a link with every

Active player on the opposite side. (Twoplayers on the same side cannot form links.) For example, suppose N1 plays

Active ,N2 plays

Active , S1 plays

Inactive , and S2 plays

Active . Then N1 creates a linkto S2, N2 creates a link to S2, S1 creates no links, and S2 creates links to both N1and N2. ( Z c , Z d , L ) is an extended proper equilibrium, because scaling the critic’s payoﬀ by a largepositive constant makes it more costly for the critic to deviate to R1 than for the diner to deviateto R2 . i is characterized by two parameters: cost ( c i ) and quality ( q i ). Costrefers to the private cost that a player pays for each link she creates. Quality refersto the beneﬁt that a player provides to others when they link to her. A player whoforms no links gets a payoﬀ of 0. In the above example, the payoﬀ to North-1 is q S2 − c N1 and the payoﬀ to South-2 is ( q N1 − c S2 ) + ( q N2 − c S2 ). (cid:7) We consider two versions of this game, shown below. In the anti-monotonic versionon the left, players with a higher cost have a lower quality. In the co-monotonic versionon the right, players with a higher cost have a higher quality. There are two pure-strategy Nash outcomes for each version: all links form or no links form. “All linksform” is the unique PCE outcome in the anti-monotonic case, while both “all links”and “no links” are PCE outcomes under co-monotonicity.Anti-Monotonic

Player Cost Quality

North-1 14 30North-2 19 10South-1 14 30South-2 19 10 Co-Monotonic

Player Cost Quality

North-1 14 10North-2 19 30South-1 14 10South-2 19 30The compatibility structure with respect to own quality is reversed between thesetwo versions of the game. In both versions, Active N (cid:37) Active N , but N1 has highquality in the anti-monotonic version, and low quality in the co-monotonic version.16hus, in the anti-monotonic version but not in the co-monotonic version, player-compatible trembles lead to the high-quality counterparty choosing Active at leastas often as the low-quality counterparty, which means

Active has a positive expectedpayoﬀ even when one’s own cost is high. For this reason, the set of PCE is diﬀerentin these two cases. In contrast, the set of equilibria that satisfy extended properequilibrium, proper equilibrium, trembling-hand perfect equilibrium, p -dominance,Pareto eﬃciency, and strategic stability do not depend on whether payoﬀs are co- oranti-monotonic, as shown in the Online Appendix. Recall that a signaling game is a two-player Bayesian game, where P1 is a sender whoknows her own type θ, and P2 only knows that P1’s type is drawn according to thedistribution λ ∈ ∆(Θ) on a ﬁnite type space Θ. After learning her type, the sendersends a signal s ∈ S to the receiver. Then, the receiver responds with an action a ∈ A . Utilities depend on the sender’s type θ , the signal s , and the action a .Fudenberg and He (2018)’s compatibility criterion is deﬁned only for signalinggames. It does not use limits of games with trembles, but instead restricts the beliefsthat the receiver can have about the sender’s type. That sort of restriction does notseem easy to generalize beyond games with observed actions, while using tremblesallows us to deﬁne PCE for general strategic form games. As we will see, the moregeneral PCE deﬁnition implies the compatibility criterion in signaling games.With each sender type viewed as a diﬀerent player, this game has | Θ | + 1 players, I = Θ ∪ { } , where the strategy set of each sender type θ is S θ = S while thestrategy set of the receiver is S = A S , the set of signal-contingent plans. So a mixedstrategy of θ is a possibly mixed signal choice σ ( ·| θ ) ∈ ∆( S ) , while a mixed strategy σ ∈ ∆( A S ) of the receiver is a mixed plan about how to respond to each signal.Fudenberg and He (2018) deﬁne type compatibility for signaling games. A signal s ∗ is more type-compatible with θ than θ if for every behavioral strategy σ ,17 ( s ∗ , σ ; θ ) ≥ max s = s ∗ u ( s , σ ; θ )implies u ( s ∗ , σ ; θ ) > max s = s ∗ u ( s , σ ; θ ) . They also deﬁne the compatibility criterion , which imposes restrictions on oﬀ-pathbeliefs in signaling games. Consider a Nash equilibrium σ ∗ , σ ∗ . For any signal s ∗ andreceiver action a with σ ∗ ( a | s ∗ ) >

0, the compatibility criterion requires that a bestresponds to some belief p ∈ ∆(Θ) about the sender’s type such that, whenever s ∗ ismore type-compatible with θ than with θ and s ∗ is not equilibrium dominated for θ , p satisﬁes p ( θ ) p ( θ ) ≤ λ ( θ ) λ ( θ ) . Since every totally mixed strategy of the receiver is payoﬀ-equivalent to a behav-ioral strategy, it is easy to see that type compatibility implies s ∗ θ (cid:37) s ∗ θ . The nextresult shows that when specialized to signaling games, all PCE pass the compatibilitycriterion.

Proposition 4.

In a signaling game, every PCE σ ∗ is a Nash equilibrium satisfyingthe compatibility criterion of Fudenberg and He (2018). This proposition in particular implies that in the beer-quiche game of Cho andKreps (1987), the quiche-pooling equilibrium is not a PCE, as it does not satisfy thecompatibility criterion. Signal s ∗ is not equilibrium dominated for θ if max a ∈ A u ( s ∗ , a ; θ ) > u ( s , σ ∗ ; θ ) for every s with σ ∗ ( s | θ ) > . The converse does not hold. We deﬁned type compatibility to require testing against all receiverstrategies and not just the totally mixed ones, so it is possible that s ∗ θ (cid:37) s ∗ θ but s ∗ is not more type-compatible with θ than with θ , so type-compatibility is harder to satisfy than player compatibility.We now realize that we could have restricted type compatibility to only consider totally mixedstrategies, and all of the results of Fudenberg and He (2018) would still hold. Factorability and Isomorphic Factoring

This section deﬁnes a “factorability” condition that we will use in developing a learn-ing foundation for PCE. Factorability implies that the information gathered fromplaying one strategy is not at all informative about the payoﬀ consequences of anyother strategy. We then deﬁne a notion of “isomorphic factoring” for players i and j to formalize the idea that the learning problems faced by these two players areessentially the same. The next section will provide a learning foundation for thecompatibility restriction for pairs of players whose learning problems are isomorphicin this way. The examples discussed in Section 3 are factorable and isomorphicallyfactorable for players ranked by compatibility. We begin by introducing some notation. Fix an extensive-form game Γ as the stagegame, with players i ∈ I along with a player 0 to model Nature’s moves. The collectionof information sets of player i ∈ I is written as H i . At each h ∈ H i , player i choosesan action a h , from the ﬁnite set of possible actions A h . So an extensive-form purestrategy of i speciﬁes an action at each information set h ∈ H i . We denote by S i theset of all such strategies. For simplicity, throughout we will maintain the followingassumption. Assumption 1.

Each player moves at most once along any path of play in Γ . In addition to any information a player gets in the course of play, we assume thatafter each play each player observes her own payoﬀ. In general, this need not perfectlyreveal other players’ actions at all information sets. We now deﬁne factorability, whichroughly says that playing strategy s i against any strategy proﬁle of − i identiﬁes allof opponents’ actions that can be payoﬀ-relevant for s i , but does not reveal anyinformation about the payoﬀ consequences of any other strategy s i = s i .19or an information set h of j with j = i , write P h for the partition on S − i wheretwo strategies s − i , s − i are in the same element of the partition if they prescribe thesame play on h. Thus partition P h contains perfect information about play on h , butno other information. Deﬁnition.

For each player i and strategy s i ∈ S i , let Π i [ s i ] be the coarsest partitionof S − i that makes s − i U i ( s i , s − i ) measurable. The game Γ is factorable for i if:1. For each s i ∈ S i there exists a (possibly empty) collection of − i ’s informationsets F i [ s i ] ⊆ H − i so that Π i [ s i ] = W h ∈ F i [ s i ] P h . (The meet across an emptycollection is the coarsest possible partition on S − i , i.e. no information).2. For two strategies s i = s i , F i [ s i ] ∩ F i [ s i ] = ∅ . When Γ is factorable for i , we refer to F i [ s i ] as the s i -relevant information sets , aterminology we now justify. In general, i ’s payoﬀ from playing s i can depend on theproﬁle of − i ’s actions at all opponent information sets. Condition (1) implies onlyopponents’ actions on F i [ s i ] matter for i ’s payoﬀ after choosing s i , and furthermorethis dependence is one-to-one. That is, U ( s i , s − i ) = U (cid:16) s i , s − i (cid:17) ⇔ (cid:16) ∀ h ∈ F i [ s i ] s − i ( h ) = s − i ( h ) (cid:17) . The substantive restriction in Condition (1) is that i ’s learning cannot be blockedby another player — by choosing s i , i can always identify actions on F i [ s i ] regardlessof what happens elsewhere in the game tree. Condition (2) implies that i does not learn about the payoﬀ consequence of adiﬀerent strategy s i = s i through playing s i (provided i ’s prior is independent aboutopponents’ play on diﬀerent information sets). This is because there is no intersectionbetween the s i -relevant information sets and the s i -relevant ones. In particular thismeans that player i cannot “free ride” on others’ experiments and learn about the It is easy but expositionally costly to extend this to the case where several actions on A h leadto the same payoﬀ for i . F i [ s i ] is empty, then s i is a kind of “opt out” action for i . After choosing s i , i receives the same utility from every reachable terminal node and gets no informationabout the payoﬀ consequences of any of her other strategies. We now illustrate factorability using the examples from Section 3 and some othergeneral classes of games.

Consider the restaurant game from Example 1. Since x < − y > . , we have x = y and x = y + 0 . . By choosing R , the customer’s payoﬀ perfectly reveals others’play. By choosing Z , the customer always gets 0 payoﬀ (these nodes are colored inthe diagram below) and so cannot infer anyone else’s play.The restaurant game is factorable for the Critic and the Diner. Let F i [ R i ] consistof the two information sets of − i and let F i [ Z i ] be the empty set for each i ∈ { , } .Itis easy to verify that the two conditions of factorability are satisﬁed.It is important for factorability that a customer who takes the “outside option”of ordering pizza gets the same payoﬀ regardless of the restaurant’s play, and does21ot observe the restaurant’s quality choice even if the other customer patronizes therestaurant. Factorability rules out this sort of “free information,” so that when weanalyze the non-equilibrium learning problem we know that each agent can only learna strategy’s payoﬀ consequences by playing it herself. Consider the link-formation game from Example 2. The payoﬀ for a player choosing

Inactive is always 0, whereas the payoﬀ for a player choosing

Active exactly iden-tiﬁes the play of the two players on the opposite side. It is now easy to see that wecan let F i [ Active i ] consists of the information sets of the other two agents on theother side of i and let F i [ Inactive i ] be empty. This speciﬁcation of the s i -relevantinformation sets shows the stage game is factorable for every player. More generally, Γ is factorable for i whenever it is a binary participation game for i. Deﬁnition.

Γ is a binary participation game for i if the following are satisﬁed.1. i has a unique information set with two actions, without loss labeled In and Out.

2. All paths of play in Γ pass through i ’s information set.3. All paths of play where i plays In pass through the same information sets.4. Terminal vertices associated with i playing Out all have the same payoﬀ for i .5. Terminal vertices associated with i playing In all have diﬀerent payoﬀs for i .Action Out is an outside option for i that leads to a constant payoﬀ regardless ofothers’ play. We are implicitly assuming in part (5) of the deﬁnition that the game22as generic payoﬀs for i after choosing In , in the sense that changing the action atany one information set on the path of play will change i ’s payoﬀ.If Γ is a binary participation game for i, then let F i [ In ] be the common collectionof − i information sets encountered in paths of play where i chooses In . Let F i [ Out ]be the empty set. We see that Γ is factorable for i. Clearly F i [ In ] ∩ F i [ Out ] = ∅ ,so Condition (2) of factorability is satisﬁed. When i chooses the strategy In , thetree structure of Γ implies diﬀerent proﬁles of play on F i [ In ] must lead to diﬀerentterminal nodes, the generic payoﬀ condition means Condition (1) of factorability issatisﬁed for strategy In . When i plays Out , i gets the same payoﬀ regardless of theothers’ play, so Condition (1) of factorability is satisﬁed for strategy Out .The restaurant game with is a binary participation game for the critic and thediner, where ordering pizza is the outside option. The link-formation game is a binaryparticipation game for every player, where

Inactive is the outside option.

To give a diﬀerent class of examples of factorable games, consider a game of signalingto one or more audiences. To be precise, Nature moves ﬁrst and chooses a typefor the sender, drawn according to λ ∈ ∆(Θ) , where Θ is a ﬁnite set. The senderthen chooses a signal s ∈ S , observed by all receivers r , ..., r n r . Each receiver thensimultaneously chooses an action. The proﬁle of receiver actions, together with thesender’s type and signal, determine payoﬀs for all players. Viewing diﬀerent typesof senders as diﬀerent players, this game is factorable for all sender types, providedpayoﬀs are generic. This is because for each type i we have F i [ s ] is the set of n r information sets by the receivers after seeing signal s. The next result gives a necessary condition for factorability. Suppose H is an informa-tion set of player j = i. Player i ’s payoﬀ is independent of h if u i ( a h , a − h ) = u i ( a h , a − h )23or all a h , a h , a − h , where a h , a h are actions on information set h, and a − h is a proﬁle ofactions on all other information sets in the game tree. If i ’s payoﬀ is not independentof the action taken at some information set h , then i can always put h onto the pathof play via a unilateral deviation at one of her information sets. Proposition 5.

Suppose the game is factorable for i and i ’s payoﬀ is not independentof h ∗ . For any strategy proﬁle, either h ∗ is on the path of play, or i has a deviationat one of her information sets that puts h ∗ onto the path of play. This result follows from two lemmas.

Lemma 1.

For any game factorable for i and any information set h ∗ for player j = i where j has at least two diﬀerent actions, if h ∗ ∈ F i [ s i ] for some extensive-formstrategy s i ∈ S i , then h ∗ is always on the path of play when i chooses s i . Lemma 2.

For any game factorable for i and any information set h ∗ of player j = i ,suppose i ’s payoﬀ is not independent of h ∗ . Then: (i) j has at least two diﬀerentactions on h ∗ ; (ii) there exists some extensive-form strategy s i ∈ S i so that h ∗ ∈ F i [ s i ] . We can combine these two lemmas to prove the proposition.

Proof.

By combining Lemmas 1 and 2, there exists some extensive-form strategy s i ∈ S i so that h ∗ is on the path of play whenever i chooses s i . Consider somestrategy proﬁle ( s ◦ i , s ◦− i ) where h ∗ is oﬀ the path. Then i can unilaterally deviate to s i , and h ∗ is on the path of ( s i , s ◦− i ). Furthermore, i ’s play diﬀers on the new pathrelative to the old path on exactly one information set, since i plays at most once onany path. So instead of deviating to s i , i can deviate to s i that matches s i in termsof this information set where i ’s play is modiﬁed, but otherwise is the same as s ◦ i .So h ∗ is also on the path of play for ( s i , s ◦− i ) , where s i diﬀers from s ◦ i only on oneinformation set.Consider the centipede game for three players below.24ach player moves at most once on each path, and 1 and 2’s payoﬀs are notindependent of the (unique) information set of player 3. But, if both 1 and 2 choose“drop”, then no one step deviation by either 1 or 2 can put the information set of 3onto the path of play. Proposition 5 thus implies the centipede game is not factorablefor either 1 or 2. Moreover, Fudenberg and Levine (2006) showed that in this gameeven very patient player 2’s may not learn to play a best response to player 3, sothat the outcome (drop, drop, pass) can persist even though it is not trembling-handperfect. Intuitively, if the player 1’s only play pass as experiments, then when thefraction of new players is very small, the player 2’s may not get to play often enoughto make experimentation with pass worthwhile.As another example, the Selten’s horse game displayed above is not factorablefor 1 or 2 if the payoﬀs are generic, even though the conclusion of Proposition 5 issatisﬁed. The information set of 3 must belong to both F [Down] and F [Across],because 3’s play can aﬀect 1’s payoﬀ even if 1 chooses Across, as 2 could chooseDown. This violates the factorability requirement that F [Down] ∩ F [Across] = ∅ .The same argument shows the information set of 3 must belong to both F [Down]25nd F [Across], since when 1 chooses Down the play of 3 aﬀects 2’s payoﬀ regardlessof 2’s play. So, again, F [Down] ∩ F [Across] = ∅ is violated.Condition (2) of factorability also rules out games where i has two strategiesthat give the same information, but one strategy always has a worse payoﬀ underall proﬁles of opponents’ play. In this case, we can think of the worse strategy asan informationally equivalent but more costly experiment than the better strategy.Reasonable learning rules (including rational learning) will not use such strategies,but we do not capture that in the general deﬁnition of PCE because our setup thereonly consider abstract strategy spaces S i and not an extensive-form game tree. Before we turn to compare the learning behavior of agents i and j , we must dealwith one ﬁnal issue. To make sensible comparisons between strategies s ∗ i and s ∗ j oftwo diﬀerent players i = j in a learning setting, we must make assumptions on theirinformational value about the play of others: namely, the information i gets fromchoosing s ∗ i must be essentially the same as the information that j gets from choosing s ∗ j . To do this we require that the game be factorable for both i and j, and that thefactoring is “isomorphic” for these two players. Deﬁnition.

When Γ is factorable for both i and j , the factoring is isomorphic for i and j if there exists a bijection ϕ : S i → S j such that F i [ s i ] ∩ H − ij = F j [ ϕ ( s i )] ∩ H − ij for every s i ∈ S i .This says the s i -relevant information sets (for i ) are the same as the ϕ ( s i )-relevantinformation sets (for j ), insofar as the actions of − ij are concerned. For example,the restaurant game is isomorphically factorable for the critic and the diner (underthe isomorphism ϕ ( R1 )= R2 , ϕ ( Z1 )= Z2 ) because F [ In1 ] ∩ H = F [In2] ∩ H =the singleton set containing the unique information set of the restaurant. As another It would be interesting to try to reﬁne the deﬁnition of PCE to capture this, perhaps using the“signal function” approach of Battigalli and Guaitoli (1997) and Rubinstein and Wolinsky (1994).

In this section, we provide a learning foundation for PCE’s cross-player tremble re-strictions. Our main learning result, Theorem 2, studies long-lived agents who getpermanently assigned into player roles and face a ﬁxed but unknown distribution ofopponents’ play. We prove that when s ∗ i (cid:37) s ∗ j and the game is isomorphically fac-torable for i and j , agents in the role of i use s ∗ i more frequently than agents in the roleof j use s ∗ j . We obtain this result both for rational agents who maximize discountedexpected utility, and for boundedly-rational agents who employ the computationallysimpler Bayes-upper conﬁdence bound algorithm. Under either of these behavioralassumptions, “trembles” emerge endogenously during learning as deliberate experi-ments that seek to learn opponents’ play. We consider an agent born into player role i who maintains this role throughouther life. She has a geometrically distributed lifetime with 0 ≤ γ < s i ∈ S i . The agent observes and collects her payoﬀs at the end of the game.Then, with probability γ , she continues into the next period and plays the stage gameagain. With complementary probability, she exits the system. Thus each period theagent observes her own payoﬀ. We assume that players have perfect recall, so shealso remembers her chosen strategy. This is a special case of the terminal-node partitions of Fudenberg and Kamada (2015, 2018)where the elements of each player’s terminal node partition are isomorphic to their possible payoﬀs. eﬁnition. The set of all ﬁnite histories of all lengths for i is Y i := ∪ t ≥ ( S i × R ) t . Fora history y i ∈ Y i and s i ∈ S i , the subhistory y i,s i is the (possibly empty) subsequenceof y i where the agent played s i .When Γ is factorable for i, there is a one-to-one mapping from the set of ac-tion proﬁles on the s i -relevant information sets to the range of s − i U i ( s i , s − i ), asrequired by the ﬁrst condition of the factorability deﬁnition. Through this identiﬁ-cation, we may think of each one-period history where i plays s i as an element of { s i } × ( × H ∈ F i [ s i ] A h ) instead of an element of { s i } × R . This convention will make iteasier to compare histories of diﬀerent player roles. Notation . A history y i will also refer to an element of ∪ t ≥ (cid:16) ∪ s i ∈ S i h { s i } × ( × h ∈ F i [ s i ] A h ) i(cid:17) t .The agent decides on which strategy to use in each period based on her historyso far. This mapping is her learning rule . Deﬁnition. A learning rule r i : Y i → S i speciﬁes a pure strategy in the stage gameafter each history.Note that the learning rules depend only on what the agent has observed in pastplay; the eﬀect of anything learned during the play of the current stage game iscaptured by the speciﬁed strategy. Note also that since the agent’s play in eachperiod depends on her past observations, the sequence of her plays is a stochasticprocess whose distribution depends on the distribution of the opponents’ play. Weassume that there is a ﬁxed objective distribution of opponent’s play, which we callplayer i s learning environment. The leading case of this is when there are multiplepopulations of learners, one for each player role, and the aggregate system is in asteady state. But, when analyzing the play of a single agent, we remain agnosticabout the reason why opponents’ play is i.i.d.

Deﬁnition. A learning environment for player i is a probability distribution σ − i ∈ Q j = i ∆ ( S j ) over strategies of players − i . 28he learning environment, together with the agent’s learning rule, generate astochastic process X ti describing i ’s strategy in period t . Deﬁnition.

Let X ti be the S i -valued random variable representing i ’s play in period t .Player i ’s induced response of i to σ − i under learning rule r i is φ i ( · ; r i , σ − i ) : S i → [0 , s i ∈ S i we have φ i ( s i ; r i , σ − i ) := (1 − γ ) ∞ X t =1 γ t − · P r i ,σ − i { X ti = s i } . We can interpret the induced response φ i ( · ; r i , σ − i ) as a mixed strategy for i rep-resenting i ’s weighted lifetime average play, where the weight on X ti , the strategy sheuses in period t of her life, is proportional to the probability γ t − of surviving into thatperiod. The induced response has a population interpretation as well. Suppose thereis a continuum of agents in the society, each engaged in their own copy of the learningproblem above. In each period, enough new agents are added to the society to exactlybalance out the share of agents who exit between periods. Then φ i ( · ; r i , σ − i ) describesthe distribution on S i we would ﬁnd if we sample an individual uniformly at randomfrom the subpopulation for role of i and ask her which s i ∈ S i they plan on playingtoday.Our learning foundation for compatible trembles involves comparing the inducedresponses of diﬀerent player roles with the same learning rule and in the same learningenvironment. We will consider two diﬀerent speciﬁcations of the agents’ learning rules in factorablegames, namely the maximization of expected discounted utility and the Bayes upperconﬁdence bound heuristic. With both rules, agents form a Bayesian belief overopponents’ play, independent at diﬀerent information sets. More precisely, we willassume that each agent i starts with a regular independent prior: eﬁnition. Agent i has a regular independent prior if her beliefs g i on × h ∈H − i ∆( A h )can be written as the product of full-support marginal densities on ∆( A h ) acrossdiﬀerent h ∈ H − i , so that g i (( α h ) h ∈H − i ) = Q h ∈H − i g hi ( α h ) with g hi ( α h ) > α h ∈ ∆ ◦ ( A h ).Thus, the agent holds a belief about the distribution of actions at each − i information set h , and thinks actions at diﬀerent information sets are generated in-dependently, whether the information sets belong to the same player or to diﬀerentones. Furthermore, the agent holds independent beliefs about the randomizing prob-abilities at diﬀerent information sets. The agent updates g i by applying the Bayes’rule to her history y i . If the stage game is a signaling game, for example, this in-dependence assumption means that the senders only update their beliefs about thereceiver’s response to a given signal s based on the responses received to that signal,and that their beliefs about this response do not depend on the responses they haveobserved to other signals s = s .If i starts with independent prior beliefs in a stage game factorable for i , thelearning problem she faces is a combinatorial bandit problem. A combinatorial banditconsist of a set of basic arms , each with an unknown distribution of outcomes, togetherwith a collection of subsets of basic arms called super arms . Each period, the agentmust choose a super arm, which results in pulling all of the basic arms in that subsetand obtaining a utility based on the outcomes of these pulls. To translate into ourlanguage, each basic arm corresponds to a − i information set h and the super armsare identiﬁed with strategies s i ∈ S i . The subset of basic arms in s i are the s i -relevantinformation sets, F i [ s i ]. The collection of outcomes from these basic arms, i.e. theaction proﬁle ( a h ) h ∈ F i [ s i ] , determine i ’s payoﬀ, U i ( s i , ( a h ) h ∈ F i [ s i ] ). We assume that agents do not know Nature’s mixed actions, which must be learned just as theplay of other players. If agents know Nature’s move, then the a regular independent prior would bea density g i on × h ∈H I \{ i } ∆( A h ), so that g i (( α h ) H I \{ i } ) = Q h ∈H I \{ i } g hi ( α h ) with g hi ( α h ) > α h ∈ ∆ ◦ ( A h ). As Fudenberg and Kreps (1993) point out, an agent who believes two opponents are randomizingindependently may nevertheless have subjective correlation in her uncertainty about the randomizingprobabilities of these opponents.

30 special case of combinatorial bandits is additive separability, where the outcomefrom pulling each basic arm is simply a R -valued reward, and the payoﬀ from choosinga super arm is the sum of these rewards. This corresponds to the stage game being additively separable for i . Deﬁnition.

A factorable game Γ is additively separable for i if there is a collectionof auxiliary functions u i,h : A h → R such that U i ( s i , ( a h ) h ∈ F i [ s i ] ) = P h ∈ F i [ s i ] u i,h ( a h ).The term u i,h ( a h ) is the “reward” of action a h towards i ’s payoﬀ. The total payoﬀfrom s i is the sum of such rewards over all s i -relevant information sets. A factorablegame is not additively separable for i when the opponents’ actions on F i [ s i ] interact insome way to determine i ’s payoﬀ following s i . All the examples discussed in Section 3are additively separable for the players ranked by compatibility. While we provideour learning foundation for rational agents in any factorable game, our analysis ofthe Bayes upper conﬁdence bound algorithm will restrict to such additively separablegames.

Consider a rational agent who maximizes discounted expected utility. In addition tothe survival chance 0 ≤ γ < ≤ δ < , so her overall eﬀective discount factor is0 ≤ δγ < H, we may calculate the Gittins index of each strategy s i ∈ S i , corresponding to a superarm in in the combinatorial bandit problem. We write the solution to the rational Additive separability is trivially satisﬁed whenever | F i [ s i ] | ≤ s i , so that is there is atmost one s i -relevant information set for each strategy s i of i . So, every signaling game is additivelyseparable for every sender type. It is also satisﬁed in the link-formation game in Section 4.2.2even though here | F i [ Active i ] | = 2, as each agent computes her payoﬀ by summing her linkingcosts/beneﬁts with respect to each potential counterparty. Additive separability is also satisﬁedin the restaurant game in Section 4.2.1 for each customer i . F i [ R i ] contains two information sets,corresponding to the play of the Restaurant and the other customer. The play of the other customeradditively contributes either 0 or -0.5 to i ’s payoﬀ, depending on whether they choose R or not. i , which involves playing the strategy s i with the highest Gittins index after each history y i .The drawback of this learning rule is that the Gittins index is computationallyintractable even in simple bandit problems. The combinatorial structure of our banditproblem makes computing the index even more complex, as it needs to consider theevolution of beliefs about each basic arm. The Bayesian upper conﬁdence bound (Bayes-UCB) procedure was ﬁrst proposed byKaufmann, Cappé, and Garivier (2012) as a computationally tractable algorithm fordealing with the exploration-exploitation trade-oﬀ in bandit problems.We restrict attention to games additively separable for i and adopt a variant ofBayes-UCB. Every y i,h subhistory of play on h ∈ F i [ s i ] induces a posterior belief g i ( ·| y i,h ) over play on h , so g i ( ·| y i,h ) is an element of ∆(∆( A h )). By an abuse ofnotation, we use u i,h ( g i ( ·| y i,h )) ∈ ∆( R ) to mean the distribution over contributionsfor play distributed according to g i ( ·| y i,h ). As a ﬁnal bit of notation, when F is adistribution on R , Q ( F ; q ) is the q -quantile of F . Deﬁnition.

Let prior g i and quantile-choice function q : N → [0 ,

1] be given for i. The

Bayes-UCB index for s i after history y i (relative to g i and q ) is X h ∈ F i [ s i ] Q ( u i,h ( g i ( ·| y i,h )) ; q ( s i | y i )) ) , where s i | y i ) is the number of times s i has been used in history y i .In words, our Bayes-UCB index computes the q -th quantile of u i,h ( a h ) under i ’sbelief about − i ’s play on h , then sums these quantiles to return an index of thestrategy s i . The Bayes-UCB policy

UCB i prescribes choosing the strategy with thehighest Bayes-UCB index after every history.32his procedure embodies a kind of wishful thinking for q ≥ .

5. The agentoptimistically evaluates the payoﬀ consequence of each s i under the assessment thatopponents will play a favorable response to s i at each of the s i -relevant informationsets, where greater q corresponds to greater optimism in this evaluation procedure.Indeed, if q approaches 1 for every s i , the Bayes-UCB procedure approaches pickingthe strategy with the highest potential payoﬀ.If F i [ s i ] consists of only a single information set of for every s i , then the procedurewe deﬁne is the standard Bayes-UCB policy. In general, our procedure diﬀers fromthe usual Bayes-UCB procedure, which would instead compute Q  X h ∈ F i [ s i ] u i,h ( g i ( ·| y i,h )); q ( s i | y i ))  . Instead, our procedure computes the sum of the quantiles, which is easier than com-puting the quantile of the sum, a calculation that requires taking the convolution ofthe associated distributions.This variant of the Bayesian UCB is analogous to variants of the non-BayesianUCB algorithm (see e.g. Gai, Krishnamachari, and Jain (2012) and Chen, Wang,and Yuan (2013)) that separately compute an index for each basic arm and choosethe super arm maximizing sum of the basic arm indices. The analysis that follows makes heavy use of the fact that the Gittins index and theBayes-UCB are index policies in the following sense:

Deﬁnition.

When Γ is factorable for i , a learning rule r i : Y i → S i is an indexpolicy if there exist functions ( ι s i ) s i ∈ S i with each ι s i mapping subhistories of s i to real The non-Bayesian UCB index of a basic arm is an “optimistic” estimate of its mean reward thatcombines its empirical mean in the past with a term inversely proportional to the number of timesthe basic arm has been pulled. Kveton, Wen, Ashkan, and Szepesvari (2015) have established tight O ( √ n log n ) regret boundsfor this kind of algorithm across n periods. r i ( y i ) ∈ arg max s i ∈ S i { ι s i ( y i,s i ) } .If an agent uses an index policy, we can think of her behavior in the followingway. At each history, she computes an index for each strategy s i ∈ S i based on thesubhistory of those periods where she chose s i , and she then plays a strategy with thehighest index with probability 1. We now analyze how compatibility relations in the stage game translate into restric-tions on experimentation frequencies. We aim to demonstrate that if s ∗ i (cid:37) s ∗ j , then i ’sinduced response plays s ∗ i more frequently than j ’s induced response plays s ∗ j . Thereis little hope of proving a comparative result of this kind if i and j face completelyunrelated learning problems. Instead, we will require that i and j use the same learn-ing rule with the same parameters (that is, the same patience in the case of OPT andsame quantile-choice function in the case of UCB), start with the same prior belief about − ij ’s play, and face the same distribution of − ij ’s play. These assumptionsare natural when a common population of agents get randomly assigned into playerroles, such as in a lab experiment.Theorem 2 shows that when i and j use the same learning rule and face thesame learning environment, we have φ i ( s ∗ i ; r i , σ − i ) ≥ φ j ( s ∗ j ; r i ; σ − j ). This providesa microfoundation for the compatibility-based cross-player restrictions on trembles.Throughout, we will ﬁx a stage game Γ that is isomorphically factorable for i and j, with isomorphism ϕ : S i → S j between their strategies. Deﬁnition.

Regular independent priors for i and j are equivalent if for each s i ∈ S i and h ∈ F i [ s i ] ∩ F j [ ϕ ( s i )], g hi ( α ) = g hj ( α ) for all α ∈ ∆( A h ). To handle possible ties, we can introduce a strict order over each agent’s strategy set, and specifythat if two strategies have the same index the agent plays the one that is higher ranked. We believe that that our learning foundation for player-compatible trembles continues to holdeven when i and j start with diﬀerent priors under a stronger version of the compatibility conditionthat converges to the current one as the priors become closer together, but we have not been ableto prove this. heorem 2. Suppose s ∗ i (cid:37) s ∗ j with ϕ ( s ∗ i ) = s ∗ j . Consider two learning agents inthe roles of i and j with equivalent independent regular priors. For any commonsurvival chance ≤ γ < and any mixed strategy proﬁle σ , we have φ i ( s ∗ i ; r i , σ − i ) ≥ φ j ( s ∗ j ; r j , σ − j ) under either of the following conditions: • r i = OPT i , r j = OPT j , and i and j have the same patience ≤ δ < . • The stage game is additively separable for i and j , at every h ∈ H − ij theauxiliary functions u i,h , u j,h rank α ∈ ∆( A h ) in the same way, r i = UCB i ,r j = UCB j , i and j have the same quantile-choice function q i = q j . This result provides learning foundations for player-compatible trembles in a num-ber of games, including the restaurant game from Section 4.2.1 and the link-formationgame from Section 4.2.2, where the additive separability and same-ranking assump-tions are satisﬁed for players ranked by compatibility.

The proof of Theorem 2 follows two steps. In Proposition 6, we abstract away fromparticular models of experimentation and consider two general index policies r i , r j ina stage game that is isomorphically factorable for i and j. Policy r i is more compatible with s ∗ i than r j is with s ∗ j if, following i and j ’s respective histories y i , y j that containthe same observations about the play of third parties − ij , whenever s ∗ j has the highestindex under r j , then no s i = s ∗ i has the highest index under r i . We prove that forany index policies r i , r j where r i is more compatible with s ∗ i than r j is with s ∗ j , weget φ i ( s ∗ i ; r i , σ − i ) ≥ φ j ( s ∗ j ; r j , σ − j ) in any learning environment σ . In Corollaries 1and 2, we show that under the conditions of Theorem 2 that relate i and j ’s learningproblems to each other (e.g. i and j have equivalent regular priors, same patience The theorem easily generalizes to the case where i starts with one of L ≥ g (1) i , ..., g ( L ) i with probabilities p , ..., p L and j starts with priors g (1) j , ..., g ( L ) j with the same proba-bilities, and each g ( l ) i , g ( l ) j is a pair of equivalent regular priors for 1 ≤ l ≤ L . s ∗ i (cid:37) s ∗ j implies OPT i is more compatible with s ∗ i than OPT j is with s ∗ j , and that the same is true for UCB i and UCB j . We begin by introducing a notion of equivalence between the histories of i and j. Since i could observe j ’s play and vice versa, this equivalence is only deﬁned in termsof the actions of the − ij third parties. Deﬁnition.

For Γ isomorphically factorable for i and j with ϕ ( s i ) = s j , i ’s subhistory y i,s i is third-party equivalent to j ’s subhistory y j,s j , written as y i,s i ∼ y j,s j , if theycontain the same sequence of observations about the actions of − ij .Recall that, by Notation 1, we identify each subhistory y i,s i with a sequence in × h ∈ F i [ s i ] A H and each subhistory y j,s j with a sequence in × h ∈ F j [ s j ] A h . By isomorphicfactorability, F i [ s i ] ∩ H − ij = F j [ s j ] ∩ H − ij . Third-party equivalence of y i,s i and y j,s j says i has played s i as many times as j has played s j , and that the sequence of − ij ’sactions that i encountered from experimenting with s i are the same as those that j encountered from experimenting with s j . As an example, the following histories for the critic and the diner of the restaurantgame are third-party equivalent for the strategy R . This is because the subhistories y Critic ,R and y Diner ,R contain the same sequences of the restaurant’s play (even thoughthe two agents have diﬀerent observations in terms of how often the other patron goesto the restaurant). 36 Critic : period 1 2 3 4 5own strategy

R Z Z Z R others’ play (

L, Z ) ∅ ∅ ∅ ( H, Z ) y Diner : period 1 2 3 4own strategy

Z R Z R others’ play ∅ ( L, R ) ∅ ( H, Z )Table 1: The two histories y Critic (with length 5) and y Diner (with length 4) have third-party equivalent subhistories for R . The row “others’ play” show what the agent infersabout others’ play from her payoﬀs — recall that a customer choosing Z always getsthe same payoﬀ and so cannot infer anything about how others play.We use third-party equivalent histories to deﬁne a comparison between two ab-stract index policies. Deﬁnition.

Suppose Γ is isomorphically factorable for i and j with ϕ ( s ∗ i ) = s ∗ j . Fortwo index policies r i and r j , we have r i is more compatible with s ∗ i than than r j iswith s ∗ j if for any histories y i , y j and strategy s i ∈ S i , s i = s ∗ i satisfying1. y i,s ∗ i ∼ y j,s ∗ j and y i,s i ∼ y j,ϕ ( s i ) s ∗ j has weakly the highest index for j , s i does not have the weakly highest index for i .This deﬁnition is a property of the index policies r i , r j , and does not make referenceto payoﬀs in the underlying stage game. The comparison applies to pairs of policies r i , r j such that whenever the subhistories of y i for strategies s ∗ i and s i = s ∗ i are third-party equivalent to subhistories of y j for s ∗ j and ϕ ( s i ), and s ∗ j has the highest r j -indexat history y j , then s i does not have the highest r i -index under y i . We can now state the ﬁrst intermediary result we need to establish Theorem 2,which is about the relative experimentation frequencies generated by a pair of indexpolicies where the compatibility relation applies.37 roposition 6.

Suppose Γ is isomorphically factorable for i and j with ϕ ( s ∗ i ) = s ∗ j ,and that index policy r i is more compatible with s ∗ i than index policy r j is with s ∗ j .Then φ i ( s ∗ i ; r i , σ − i ) ≥ φ j ( s ∗ j ; r j , σ − j ) for any ≤ γ < and σ ∈ × k ∆( S k ) . The proof extends the coupling argument in the proof of Fudenberg and He(2018)’s Lemma 2, which only applies to the Gittins index in signaling games, and alsoﬁlls in a missing step (lemma 4) that the earlier proof implicitly assumed. Proposi-tion 6 applies to any index policies satisfying the comparative compatibility conditionstated above. The proof uses this hypothesis to deduce a general conclusion aboutthe induced responses of these agents in the learning problem, where the two agentstypically do not have third-party equivalent histories in any given period.To deal with the issue that i and j learn from endogenous data that diverge asthey undertake diﬀerent experiments, we couple the learning problems of i and j using what we call response paths A ∈ ( × h ∈H A h ) ∞ . For each such path and learningrule r i for player i, imagine running the rule against the data-generating processwhere the k -th time i plays s i , i observes the action a k,h ∈ A h at the information set h ∈ F i [ s i ]. Given a learning rule r i , each A induces a deterministic inﬁnite historyof i ’s strategies y i ( A , r i ) ∈ ( S i ) ∞ . We show that under the hypothesis that r i is morecompatible with s ∗ i than r j is with s ∗ j , the weighted lifetime frequency of s ∗ i in y i ( A , r i )is larger than that of s ∗ j in y j ( A , r j ) for every A , where play in diﬀerent periods of theinﬁnite histories y i ( A , r i ) , y j ( A , r j ) are weighted by the probabilities of surviving intothese periods, just as in the deﬁnition of induced responses.Lemma 4 in the Appendix shows that when i and j face i.i.d. draws of opponents’play from a ﬁxed learning environment σ, the induced responses are the same as ifthey each faced a random response path A drawn at birth according to the (inﬁ-nite) product measure over ( × h ∈H A h ) ∞ whose marginal distribution on each copy of × h ∈H A h corresponds to σ . 38 .4.2 OPT and UCB Satisfy Comparative Compatibility The second step of our proof is carried out in Appendix 8. There, Corollaries 1 and2 show that when the assumptions of Theorem 2 hold and s ∗ i (cid:37) s ∗ j , both OPT andUCB are more compatible with s ∗ i than with s ∗ j provided the additional regularityconditions of Theorem 2 hold. This proves the theorem and provides two learningmodels that microfound PCE’s tremble restrictions. Since the compatibility relationis deﬁned in the language of best responses against opponents’ strategy proﬁles in thestage game, the key step in showing that OPT and UCB satisfy the comparativecompatibility condition involves reformulating these indices as the expected utility ofusing each strategy against a certain opponent strategy proﬁle.For the Gittins index, this proﬁle is the “synthetic” opponent strategy proﬁleconstructed from the best stopping rule in the auxiliary optimal-stopping problemdeﬁning the index. This is similar to the construction of Fudenberg and He (2018),but in the more general setting of this paper the arguments become more subtle. Theinduced synthetic strategy may be correlated if the learner observes opponents’ play atmultiple information sets after playing s i , even if the learner starts with independentprior beliefs over play at these information. For example, suppose F i [ s i ] consists oftwo information sets, one for each of two players k = k , whose choose between Heads and

Tails . Agent i ’s prior belief is that each of k , k is either always playing Heads or always playing

Tails , with each of the 4 possible combinations of strategiesgiven 25% prior probability. Now consider the stopping rule where i stops if k and k play diﬀerently in the ﬁrst period, but continues for 100 more periods if they playthe same action in the ﬁrst period. Then the procedure deﬁned above generates adistribution over pairs of Heads and

Tails that is mostly given by play in periods2 through 100, which is either (

Heads , Heads ) or (

Tails , Tails ), each with 50%probability. Thus the stopping rule τ creates correlation in the observed play of the Other natural index rules that we do not analyze explicitly here also serve as microfoundationsof our cross-player restrictions on trembles, provided they satisfy Proposition 6 whenever s ∗ i (cid:37) s ∗ j . endogenous correlation through the optimal stoppingrule is the reason player compatibility is deﬁned in terms of correlated proﬁles.For Bayes-UCB, under the assumptions of Theorem 2, the agent may rank oppo-nents’ mixed actions on each h ∈ F i [ s i ] from least favorable to most favorable. Theanalogous opponent strategy proﬁle is the behavior strategy where the q -th quantilemixed action is played on each h , in terms of i ’s current belief about opponents’ play.Importantly, if i and j share the same beliefs about − ij ’s play and rank − ij ’s mixedactions in the same way, then the “ q -th quantile proﬁle” is the same for both agents. PCE makes two key contributions. First, it generates new and sensible restrictionson equilibrium play by imposing cross-player restrictions on the relative probabilitiesthat diﬀerent players assign to certain strategies — namely, those strategy pairs s i , s j ranked by the compatibility relation s i (cid:37) s j . As we have shown through examples, thisdistinguishes PCE from other reﬁnement concepts, and allows us to make comparativestatics predictions in some games where other equilibrium reﬁnements do not.Second, PCE shows how the the device of restricted “trembles” can capture someof the implications of non-equilibrium learning. As we saw, PCE’s cross-player restric-tions arise endogenously in both the standard model of Bayesian agents maximizingtheir expected discounted lifetime utility, and the computationally tractable heuris-tics of Bayesian upper conﬁdence bounds. We conjecture that the result that i ismore likely to experiment with s i than j with s j when s i (cid:37) s j applies in other naturalmodels of learning or dynamic adjustment, such as those considered by Francetichand Kreps (2018), and that it may be possible to provide foundations for PCE inother and perhaps larger classes of games.The strength of the PCE reﬁnement depends on the completeness of the com-patibility order (cid:37) , since (cid:15) -PCE imposes restrictions on i and j ’s play only when the40elation s i (cid:37) s j holds. Our player compatibility deﬁnition supposes that player i thinks all mixed strategies of other players are possible, as it considers the set ofall totally mixed correlated strategies σ − i ∈ ∆ ◦ ( S − i ) . If the players have some priorknowledge about their opponents’ utility functions, player i might deduce a priori that the other players will only play strategies in some subset A − i of ∆ ◦ ( S − i ). Aswe show in Fudenberg and He (2017), in signaling games imposing this kind of priorknowledge leads to a more complete version of the compatibility order. It may simi-larly lead to a more reﬁned version of PCE.PCE is deﬁned for general strategic forms. We have only provided learning foun-dations for player-compatible trembles in factorable games, but we view this as animprovement over the more typical situation in which reﬁnements have no learningfoundations at all.WeIn more general extensive-form games two complications arise. First, player i mayhave several actions that lead to the same information set of player j , which makes theoptimal learning strategy more complicated. Second, player i may get informationabout how player j plays at some information sets thanks to an experiment by someother player k , so that player i has an incentive to free ride. We plan to deal withthese complications in future work. Moreover, we conjecture that in games whereactions have a natural ordering, learning rules based on the idea that nearby strategiesinduce similar responses can provide learning foundations for reﬁnements in whichplayers tremble more onto nearby actions, as in Simon (1987). More speculatively,the interpretation of trembles as arising from learning may provide learning-theoreticfoundations for equilibrium reﬁnements that restrict beliefs at oﬀ-path informationsets in general extensive-form games, such as perfect Bayesian equilibrium (Fudenbergand Tirole, 1991; Watson, 2017), sequential equilibrium (Kreps and Wilson, 1982) andits extension to games with inﬁnitely many actions (Simon and Stinchcombe, 1995;Myerson and Reny, 2018). 41 eferences Agrawal, R. (1995): “Sample mean based index policies by o (log n) regret for themulti-armed bandit problem,”

Advances in Applied Probability , 27, 1054–1078.

Battigalli, P., S. Cerreia-Vioglio, F. Maccheroni, and M. Marinacci (2016): “Analysis of information feedback and selfconﬁrming equilibrium,”

Journalof Mathematical Economics , 66, 40–51.

Battigalli, P., A. Francetich, G. Lanzani, and M. Marinacci (2017):“Learning and Self-conﬁrming Long-Run Biases,”

Working Paper . Battigalli, P. and D. Guaitoli (1997): “Conjectural equilibria and rationaliz-ability in a game with incomplete information,” in

Decisions, games and markets ,Springer, 97–124.

Bolton, P. and C. Harris (1999): “Strategic experimentation,”

Econometrica ,67, 349–374.

Chen, W., Y. Wang, and Y. Yuan (2013): “Combinatorial Multi-Armed Bandit:General Framework and Applications,” in

Proceedings of the 30th InternationalConference on Machine Learning , ed. by S. Dasgupta and D. McAllester, Atlanta,Georgia, USA: PMLR, vol. 28 of

Proceedings of Machine Learning Research , 151–159.

Cho, I.-K. and D. M. Kreps (1987): “Signaling Games and Stable Equilibria,”

Quarterly Journal of Economics , 102, 179–221.

Doval, L. (2018): “Whether or not to open Pandora’s box,”

Journal of EconomicTheory , 175, 127–158.

Esponda, I. and D. Pouzo (2016): “Berk-Nash Equilibrium: A Framework forModeling Agents With Misspeciﬁed Models,”

Econometrica , 84, 1093–1130.42 rancetich, A. and D. M. Kreps (2018): “Choosing a Good Toolkit: Bayes-ruleBased Heuristics,”

Working Paper . Frick, M. and Y. Ishii (2015): “Innovation adoption by forward-looking sociallearners,”

Working Paper . Fryer, R. and P. Harms (2017): “Two-armed restless bandits with imperfect infor-mation: Stochastic control and indexability,”

Mathematics of Operations Research ,43, 399–427.

Fudenberg, D. and K. He (2017): “Learning and Equilibrium Reﬁnements inSignalling Games,”

Working Paper .——— (2018): “Learning and Type Compatibility in Signaling Games,”

Economet-rica , 86, 1215–1255.

Fudenberg, D. and Y. Kamada (2015): “Rationalizable partition-conﬁrmed equi-librium,”

Theoretical Economics , 10, 775–806.——— (2018): “Rationalizable partition-conﬁrmed equilibrium with heterogeneousbeliefs,”

Games and Economic Behavior , 109, 364–381.

Fudenberg, D. and D. M. Kreps (1993): “Learning Mixed Equilibria,”

Gamesand Economic Behavior , 5, 320–367.——— (1994): “Learning in Extensive-Form Games, II: Experimentation and NashEquilibrium,”

Working Paper .——— (1995): “Learning in Extensive-Form Games I. Self-Conﬁrming Equilibria,”

Games and Economic Behavior , 8, 20–55.

Fudenberg, D. and D. K. Levine (1993): “Steady State Learning and NashEquilibrium,”

Econometrica , 61, 547–573.43—— (2006): “Superstition and Rational Learning,”

American Economic Review ,96, 630–651.

Fudenberg, D. and J. Tirole (1991): “Perfect Bayesian equilibrium and sequen-tial equilibrium,”

Journal of Economic Theory , 53, 236–260.

Gai, Y., B. Krishnamachari, and R. Jain (2012): “Combinatorial NetworkOptimization With Unknown Variables: Multi-Armed Bandits With Linear Re-wards and Individual Observations,”

IEEE/ACM Transactions on Networking , 20,1466–1478.

Halac, M., N. Kartik, and Q. Liu (2016): “Optimal contracts for experimenta-tion,”

Review of Economic Studies , 83, 1040–1091.

Heidhues, P., S. Rady, and P. Strack (2015): “Strategic experimentation withprivate payoﬀs,”

Journal of Economic Theory , 159, 531–551.

Hörner, J. and A. Skrzypacz (2016): “Learning, experimentation and infor-mation design,” in

Advances in Economics and Econometrics: Eleventh WorldCongress , ed. by B. Honore, A. Pakes, M. Piazzesi, and L. Samuelson, CambridgeUniversity Press, chap. 2, 63–97.

Jackson, M. O. and A. Wolinsky (1996): “A strategic model of social andeconomic networks,”

Journal of Economic Theory , 71, 44–74.

Katehakis, M. N. and H. Robbins (1995): “Sequential choice from several pop-ulations,”

Proceedings of the National Academy of Sciences of the United States ofAmerica , 92, 8584.

Kaufmann, E., O. Cappé, and A. Garivier (2012): “On Bayesian upper con-ﬁdence bounds for bandit problems,” in

Artiﬁcial Intelligence and Statistics , 592–600. 44 eller, G., S. Rady, and M. Cripps (2005): “Strategic experimentation withexponential bandits,”

Econometrica , 73, 39–68.

Klein, N. and S. Rady (2011): “Negatively correlated bandits,”

Review of Eco-nomic Studies , 78, 693–732.

Kohlberg, E. and J.-F. Mertens (1986): “On the Strategic Stability of Equilib-ria,”

Econometrica , 54, 1003–1037.

Kreps, D. M. and R. Wilson (1982): “Sequential equilibria,”

Econometrica , 863–894.

Kveton, B., Z. Wen, A. Ashkan, and C. Szepesvari (2015): “Tight Re-gret Bounds for Stochastic Combinatorial Semi-Bandits,” in

Proceedings of theEighteenth International Conference on Artiﬁcial Intelligence and Statistics , ed.by G. Lebanon and S. V. N. Vishwanathan, San Diego, California, USA: PMLR,vol. 38 of

Proceedings of Machine Learning Research , 535–543.

Lehrer, E. (2012): “Partially speciﬁed probabilities: decisions and games,”

Ameri-can Economic Journal: Microeconomics , 4, 70–100.

Milgrom, P. and J. Mollner (2017): “Extended Proper Equilibrium,”

WorkingPaper . Monderer, D. and L. S. Shapley (1996): “Potential games,”

Games and Eco-nomic Behavior , 14, 124–143.

Myerson, R. B. (1978): “Reﬁnements of the Nash equilibrium concept,”

Interna-tional Journal of Game Theory , 7, 73–80.

Myerson, R. B. and P. J. Reny (2018): “Perfect Conditional ε -Equilibria ofMulti-Stage Games with Inﬁnite Sets of Signals and Actions,” Working Paper .45 ubinstein, A. and A. Wolinsky (1994): “Rationalizable conjectural equilib-rium: between Nash and rationalizability,”

Games and Economic Behavior , 6,299–311.

Selten, R. (1975): “Reexamination of the perfectness concept for equilibrium pointsin extensive games,”

International Journal of Game Theory , 4, 25–55.

Simon, L. K. (1987): “Local perfection,”

Journal of Economic Theory , 43, 134–156.

Simon, L. K. and M. B. Stinchcombe (1995): “Equilibrium reﬁnement for inﬁnitenormal-form games,”

Econometrica , 1421–1443.

Strulovici, B. (2010): “Learning while voting: Determinants of collective experi-mentation,”

Econometrica , 78, 933–971.

Van Damme, E. (1987):

Stability and Perfection of Nash Equilibria , Springer-Verlag.

Watson, J. (2017): “A General, Practicable Deﬁnition of Perfect Bayesian Equilib-rium,”

Working Paper . 46 ppendix

We ﬁrst state an auxiliary lemma.

Lemma 3. If σ ◦ is an (cid:15) -PCE and s ∗ i (cid:37) s ∗ j , then σ ◦ i ( s ∗ i ) ≥ min  σ ◦ j ( s ∗ j ) , − X s i = s ∗ i (cid:15) ( s i | i )  . Proof.

Suppose (cid:15) is player-compatible and let (cid:15) -equilibrium σ ◦ be given. For s ∗ i (cid:37) s ∗ j ,suppose σ ◦ j ( s ∗ j ) = (cid:15) ( s ∗ j | j ). Then σ ◦ i ( s ∗ i ) ≥ (cid:15) ( s ∗ i | i ) ≥ (cid:15) ( s ∗ j | j ) = σ ◦ j ( s ∗ j ), where thesecond inequality comes from (cid:15) being player compatible. On the other hand, suppose σ ◦ j ( s ∗ j ) > (cid:15) ( s ∗ j | j ). Since σ ◦ is an (cid:15) -equilibrium, the fact that j puts more than theminimum required weight on s ∗ j implies s ∗ j is at least a weak best response for j against σ ◦ , with σ ◦ totally mixed due to the trembles.The deﬁnition of s ∗ i (cid:37) s ∗ j then impliesthat s ∗ i must be a strict best response for i against σ ◦ as well. In the (cid:15) -equilibrium, i must assign as much weight to s ∗ i as possible, so that σ ◦ i ( s ∗ i ) = 1 − P s i = s ∗ i (cid:15) ( s i | i ).Combining these two cases establishes the desired result. Proposition 3 : For any PCE σ ∗ , player k , and strategy ¯ s k such that σ ∗ k (¯ s k ) > , there exists a sequence of totally mixed strategy proﬁles σ ( t ) − k → σ ∗− k such that(i) for every pair i, j = k with s ∗ i (cid:37) s ∗ j , lim inf t →∞ σ ( t ) i ( s ∗ i ) σ ( t ) j ( s ∗ j ) ≥ and (ii) ¯ s k is a best response for k against every σ ( t ) − k . roof. By Lemma 3, for every (cid:15) ( t ) -PCE we get σ ( t ) i ( s ∗ i ) σ ( t ) j ( s ∗ j ) ≥ min  σ ( t ) j ( s ∗ j ) σ ( t ) j ( s ∗ j ) , − P s i = s ∗ i (cid:15) ( t ) ( s i | i ) σ ( t ) j ( s ∗ j )  = min  , − P s i = s ∗ i (cid:15) ( t ) ( s i | i ) σ ( t ) j ( s ∗ j )  ≥ − X s i = s ∗ i (cid:15) ( t ) ( s i | i ) . This says inf t ≥ T σ ( t ) i ( s ∗ i ) σ ( t ) j ( s ∗ j ) ≥ − sup t ≥ T X s i = s ∗ i (cid:15) ( t ) ( s i | i ) . For any sequence of trembles such that (cid:15) ( t ) → , lim T →∞ sup t ≥ T X s i = s ∗ i (cid:15) ( t ) ( s i | i ) = 0 , so lim inf t →∞ σ ( t ) i ( s ∗ i ) σ ( t ) j ( s ∗ j ) = lim T →∞  inf t ≥ T σ ( t ) i ( s ∗ i ) σ ( t ) j ( s ∗ j )  ≥ . This shows that if we ﬁx a PCE σ ∗ and consider a sequence of player-compatibletrembles (cid:15) ( t ) and (cid:15) ( t ) -PCE σ ( t ) → σ ∗ , then each σ ( t ) − k satisﬁes lim inf t →∞ σ ( t ) i ( s ∗ i ) /σ ( t ) j ( s ∗ j ) ≥ i, j = k and s ∗ i (cid:37) s ∗ j . Furthermore, from σ ∗ k (¯ s k ) > σ ( t ) k → σ ∗ k , weknow there is some T ∈ N so that σ ( t ) k (¯ s k ) > σ ∗ k (¯ s k ) / t ≥ T . We may alsoﬁnd T ∈ N so that (cid:15) ( t ) (¯ s k | k ) < σ ∗ k (¯ s k ) / t ≥ T , since (cid:15) ( t ) → . So when t ≥ max( T , T ), σ ( t ) k places strictly more than the required weight on ¯ s k , so ¯ s k is atleast a weak best response for k against σ ( t ) − k . Now the subsequence of opponent play( σ ( t ) − k ) t ≥ max( T ,T ) satisﬁes the requirement of this proposition. Theorem 1 : PCE exists in every ﬁnite strategic-form game.Proof.

Consider a sequence of tremble proﬁles with the same lower bound on the48robability of each strategy, that is (cid:15) ( t ) ( s i | i ) = (cid:15) ( t ) for all i and s i , and with (cid:15) ( t ) decreasing monotonically to 0 in t . Each of these tremble proﬁles is player-compatible(regardless of the compatibility structure (cid:37) ) and there is some ﬁnite T large enoughthat t ≥ T implies an (cid:15) ( t ) -equilibrium exists, and some subsequence of these (cid:15) ( t ) -equilibria converges since the space of strategy proﬁles is compact. By deﬁnitionthese (cid:15) ( t ) -equilibria are also (cid:15) ( t ) -PCE, which establishes existence of PCE. Proposition 4 : In a signaling game, every PCE σ ∗ is a Nash equilibrium satisfyingthe compatibility criterion, as deﬁned in Fudenberg and He (2018).Proof. Since every PCE is a trembling-hand perfect equilibrium and since this lattersolution concept reﬁnes Nash, σ ∗ is a Nash equilibrium.To show that it satisﬁes the compatibility criterion, we need to show that σ ∗ as-signs probability 0 to plans in A S that do not best respond to beliefs in the set P ( s, σ ∗ )as deﬁned in Fudenberg and He (2018). For any plan assigned positive probabilityunder σ ∗ , by Proposition 3 we may ﬁnd a sequence of totally mixed signal proﬁles σ ( t )1 of the sender, so that whenever s θ (cid:37) s θ we have lim inf t →∞ σ ( t )1 ( s | θ ) /σ ( t )1 ( s | θ ) ≥ . Write q ( t ) ( ·| s ) as the Bayesian posterior belief about sender’s type after signal s un-der σ ( t )1 , which is well deﬁned because each σ ( t )1 is totally mixed. Whenever s θ (cid:37) s θ ,this sequence of posterior beliefs satisﬁes lim inf t →∞ q ( t ) ( θ | s ) /q ( t ) ( θ | s ) ≥ λ ( θ ) /λ ( θ ),so if the receiver’s plan best responds to every element in the sequence, it also bestresponds to an accumulation point ( q ∞ ( ·| s )) s ∈ S with q ∞ ( θ | s ) /q ∞ ( θ | s ) ≥ λ ( θ ) /λ ( θ )whenever s θ (cid:37) s θ . Since the player compatibility deﬁnition used in this paper isslightly easier to satisfy than the type compatibility deﬁnition that the set P ( s , σ ∗ )is based on, the plan best responds to P ( s , σ ∗ ) after every signal s .49 .4 Proof of Lemma 1 Proof.

By way of contradiction, suppose there is some proﬁle of moves by − i , ( a h ) h ∈H − i ,so that h ∗ is oﬀ the path of play in ( s i , ( a h ) h ∈H − i ) = ( s i , a h ∗ , ( a h ) h ∈H − i \ h ∗ ). Finda diﬀerent action of j on h ∗ , a h ∗ = a h ∗ . Since h ∗ is oﬀ the path of play, both( s i , a h ∗ , ( a h ) h ∈H − i \ h ∗ ) and ( s i , a h ∗ , ( a h ) h ∈H − i \ h ∗ ) lead to the same payoﬀ for i . Butby Condition (1) in the deﬁnition of factorability and the fact that h ∗ ∈ F i [ s i ], wewill have found two − i action proﬁles s − i , s − i in two diﬀerent blocks of Π i [ s i ] with U i ( s i , s − i ) = U i ( s i , s − i ). This contradicts Π i [ s i ] being the coarsest partition of S − i that makes U i ( s i , · ) measurable. Proof.

First, there must be at least two diﬀerent actions for j on h ∗ , else i ’s payoﬀwould be trivially independent of h ∗ .So, there exist actions a h ∗ = a h ∗ on h ∗ and a proﬁle a − h ∗ of actions elsewhere inthe game tree, so that U i ( a h ∗ , a − h ∗ ) = U i ( a h ∗ , a − h ∗ ). Consider the strategy s i for i that matches a − h ∗ in terms of play on i ’s information sets, so we may equivalentlywrite U i ( s i , a h ∗ , ( a h ) h ∈H − i \ h ∗ ) = U i ( s i , a h ∗ , ( a h ) h ∈H − i \ h ∗ ) , where ( a h ) h ∈H − i \ h ∗ are the components of a − h ∗ corresponding to information sets of − i . If h ∗ / ∈ F i [ s i ] , then by Condition (1) of factorability, ( a h ∗ , ( a h ) h ∈H − i \ h ∗ ) and( a h ∗ , ( a h ) h ∈H − i \ h ∗ ) belong to the same block in Π i [ s i ] . Yet, they give diﬀerent payoﬀsto i , which contradicts that i ’s payoﬀ after s i must be measurable with respect toΠ i [ s i ] . We ﬁrst show that i s induced response against i.i.d. play drawn from σ − i is the sameas playing against a response path drawn from η at the start of i ’s life. This η is the50ame for all agents and does not depend on their (possibly stochastic) learning rules. Lemma 4.

In a factorable game, for each σ ∈ × k ∆( S k ) , there is a distribution η overresponse paths, so that for any player i , any possibly random rule r i : Y i → ∆( S i ) ,and any strategy s i ∈ S i , we have φ i ( s i ; r i , σ − i ) = (1 − γ ) E A ∼ η " ∞ X t =1 γ t − · ( y ti ( A , r i ) = s i ) , where y ti ( A , r i ) refers to the t -th period history in y i ( A , r i ) .Proof. In fact, we will prove a stronger statement: we will show there is such adistribution that induces the same distribution over period- t histories for every i, every learning rule r i , and every t. Think of each response path A as a two-dimensional array, A = ( a t,h ) t ∈ N ,h ∈H . Fornon-negative integers ( N h ) h ∈H , each proﬁle of sequences of actions (( a n h ,h ) N h n h =1 ) h ∈H where a n h ,h ∈ A h deﬁnes a “cylinder set” of response paths with the form: { A : a t,h = a n h ,h for each h ∈ H , ≤ n h ≤ N h } . That is, the cylinder set consists of those response paths whose ﬁrst N h elementsfor information set h match a given sequence, ( a n h ,h ) N h n h =1 . (If N h = 0, then thereis no restriction on a t,h for any t. ) We specify the distribution η by specifying theprobability it assigns to these cylinder sets: η n (( a n h ,h ) N h n h =1 ) h ∈H o = Y h ∈H N h Y n h =1 σ ( s : s ( h ) = a n h ,h ) , where we have abused notation to write (( a n h ,h ) N h n h =1 ) h ∈H for the cylinder set satisfyingthis proﬁle of sequences, and we have used the convention that the empty product isdeﬁned to be 1. Recall that a strategy proﬁle s in the extensive-form game speciﬁesan action s ( h ) ∈ A h for every information set h in the game tree. The probabilitythat η assigns to the cylinder set involves multiplying the probabilities that the given51ixed strategy σ leads to such a pure-strategy proﬁle s so that a n h ,h is to be playedat information set h , across all such a n h ,h restrictions deﬁning the cylinder set.We establish the claim by induction on t for period- t history. For t ≥ , let Y i [ t ] ⊆ Y i be the set of possible period- t histories of i, that is Y i [ t ] := ( S i × R ) t . Inthe base case of t = 1 , we show playing against a response path drawn according to η and playing against a pure strategy drawn from σ − i ∈ × k = i ∆( S k ) generate thesame period-1 history. Fixing a learning rule r i : Y i → ∆( S i ) of i, the probabilityof i having the period-1 history ( s (1) i , ( a (1) h ) h ∈ F i [ s (1) i ] ) ∈ Y i [1] in the random-matchingmodel is r i ( ∅ )( s (1) i ) · σ ( s : s ( I ) = a (1) h for all h ∈ F i [ s (1) i ]). That is, i ’s rule must play s (1) i in the ﬁrst period of i ’s life, which happens with probability r i ( ∅ )( s (1) i ). Then, i must encounter such a pure strategy that generates the required proﬁle of moves( a (1) h ) h ∈ F i [ s (1) i ] on the s (1) i -relevant information sets, which has probability σ ( s : s ( h ) = a (1) h for all h ∈ F i [ s (1) i ]). The probability of this happening against a response pathdrawn from η is r i ( ∅ )( s (1) i ) · η ( A : a ,h = a (1) h for all h ∈ F i [ s (1) i ])= r i ( ∅ )( s (1) i ) · Y h ∈ F i [ s (1) i ] σ ( s : s ( h ) = a (1) h )= r i ( ∅ )( s (1) i ) · σ ( s : s ( h ) = a (1) h for all h ∈ F i [ s (1) i ]) , where the second line comes from the probability η assigns to cylinder sets, and thethird line comes from the fact that σ ∈ × k ∆( S k ) involves independent mixing of purestrategies across diﬀerent players.We now proceed with the inductive step. By induction, suppose random matchingand the η -distributed response path induce the same distribution over the set ofperiod- T histories, Y i [ T ], where T ≥ . Write this common distribution as φ RMi,T = In the random matching model agents are facing a randomly drawn pure strategy proﬁle eachperiod (and not a ﬁxed behavior strategy): they are matched with random opponents, who eachplay a pure strategy in the game as a function of their personal history. From Kuhn’s theorem, thisis equivalent to facing a ﬁxed proﬁle of behavior strategies. ηi,T = φ i,T ∈ ∆( Y i [ T ]) . We prove that they also generate the same distribution overlength T + 1 histories.Suppose random matching generates distribution φ RMi,T +1 ∈ ∆( Y i [ T + 1]) and the η -distributed response path generates distribution φ ηi,T +1 ∈ ∆( Y i [ T + 1]) . Each length T +1 history y i [ T +1] ∈ Y i [ T +1] may be written as ( y i [ T ] , ( s ( T +1) i , ( a ( T +1) h ) h ∈ F i [ s ( T +1) i ] )) , where y i [ T ] is a length- T history and ( s ( T +1) i , ( a ( T +1) h ) h ∈ F i [ s ( T +1) i ] ) is a one-period historycorresponding to what happens in period T + 1. Therefore, we may write for each y i [ T + 1] ,φ RMi,T +1 ( y i [ T + 1]) = φ RMi,T ( y i [ T ]) · φ RMi,T +1 | T (( s ( T +1) i , ( a ( T +1) h ) h ∈ F i [ s ( T +1) i ] ) | y i [ T ]) , and φ ηi,T +1 ( y i [ T + 1]) = φ ηi,T ( y i [ T ]) · φ ηi,T +1 | T (( s ( T +1) i , ( a ( T +1) h ) h ∈ F i [ s ( T +1) i ] ) | y i [ T ]) , where φ RMi,T +1 | T and φ ηi,T +1 | T are the conditional probabilities of the form “having history( s ( T +1) i , ( a ( T +1) h ) h ∈ F i [ s ( T +1) i ] ) in period T + 1 , conditional on having history y i [ T ] ∈ Y i [ T ]in the ﬁrst T periods.” If such conditional probabilities are always the same for therandom-matching model and the η -distributed response path model, then from thehypothesis φ RMi,T = φ ηi,T , we can conclude φ RMi,T +1 = φ ηi,T +1 . By argument exactly analogous to the base case, we have for the random-matchingmodel φ RMi,T +1 | T (( s ( T +1) i , ( a ( T +1) h )) | y i [ T ]) = r i ( y i ( T ))( s ( T +1) i ) · σ ( s : s ( h ) = a ( T +1) h for all h ∈ F i [ s ( T +1) i ]) , since the matching is independent across periods.But in the η -distributed response path model, since a single response path isdrawn once and ﬁxed, one must compute the conditional probability that the drawn A is such that the response ( a ( T +1) h ) h ∈ F i [ s ( T +1) i ] will be seen in period T + 1, given the53istory y i [ T ] (which is informative about which response path i is facing).For each h ∈ H − i , let the non-negative integer N h represent the number of times i has observed play on the information set h in the history y i [ T ] . For each h, let( a n h ,h ) N h n h =1 represent the sequence of opponent actions observed on h in chronologicalorder. The history y i [ T ] so far shows i is facing a response sequence in the cylinderset consistent with (( a n h ,h ) N h n h =1 ) h ∈H . If A is to respond to i ’s next play of s ( T +1) i with a ( T +1) h on the s ( T +1) i -relevant information sets, then A must belong to a morerestrictive cylinder set, satisfying the restrictions:(( a n h ,h ) N h n h =1 ) h ∈H\ F i [ s ( T +1) i ] , (( a n h ,h ) N h +1 n h =1 ) h ∈ F i [ s ( T +1) i ] , where for each h ∈ F i [ s ( T +1) i ] , a N h +1 ,h = a ( T +1) h . The conditional probability is thengiven by the ratio of η -probabilities of these two cylinder sets, which from the deﬁni-tion of η must be Q h ∈ F i [ s ( T +1) i ] σ ( s : s ( h ) = a ( T +1) h ). As before, the independence of σ across players means this is equal to σ ( s : s ( h ) = a ( T +1) h for all h ∈ F i [ s ( T +1) i ]).Given this result, to prove that φ i ( s ∗ i ; r i , σ − i ) ≥ φ j ( s ∗ j ; r j , σ − j ) , it suﬃces to showthat for every A , the period where s ∗ i is played for the k -th time in induced history y i ( A , r i ) happens earlier than the period where s ∗ j is played for the k -th time in history y j ( A , r j ) . Now we turn to the proof of Proposition 6.

Proof.

Let 0 ≤ γ < σ be ﬁxed. Consider the productdistribution η on the space of response paths, ( × h ∈H A h ) ∞ , whose marginal on eachcopy of × h ∈H A h is the action distribution of σ .By Lemma 4, denote the period where s ∗ i appears in y i ( A , r i ) for the k -th time as T ( k ) i , the period where s ∗ j appears in y j ( A , r j ) for the k -th time as T ( k ) j . The quantities T ( k ) i , T ( k ) j are deﬁned to be ∞ if the corresponding strategies do not appear at least k times in the inﬁnite histories. Write s i ; k ) ∈ N ∪ {∞} be the number of times s i ∈ S i is played in the history y i ( A , r i ) before T ( k ) i . Similarly, s j ; k ) ∈ N ∪ {∞} s j ∈ S j is played in the history y j ( A , r j ) before T ( k ) j .Since ϕ establishes a bijection between S i and S j , it suﬃces to show that for every k = 1 , , , ... either T ( k ) j = ∞ or for all s i = s ∗ i , s i ; k ) ≤ s j ; k ) where s j = ϕ ( s i ) . We show this by induction on k . First we establish the base case of k = 1.Suppose T (1) j = ∞ , and, by way of contradiction, suppose there is some s i = s ∗ i such that s i , > ϕ ( s i ) , y i of y i ( A , r i ) . that leads to s i being played for the ( ϕ ( s i ) ,

1) + 1)-th time, and ﬁnd the subhistory y j of y j ( A , r j )that leads to j playing s ∗ j for the ﬁrst time ( y j is well-deﬁned because T (1) j = ∞ ).Note that y i,s ∗ i ∼ y j,s ∗ j vacuously, since i has never played s ∗ i in y i and j has neverplayed s ∗ j in y j . Also, y i,s i ∼ y j,s j since i has played s i for ϕ ( s i ) ,

1) times and j has played s j for the same number of times, while the deﬁnition of response sequenceimplies they would have seen the same history of play on the common informationsets of − ij , F i [ s i ] ∩ F j [ s j ]. This satisﬁes the deﬁnition of third-party equivalence ofhistories.Since r j ( y j ) = s ∗ j and r j is an index rule, s ∗ j must have weakly the highest indexat y j . Since r i is more compatible with s ∗ i than r j is with s ∗ j , s i must not have theweakly highest index at y i . And yet r i ( y i ) = s i , contradiction.Now suppose this statement holds for all k ≤ K for some K ≥ . We show it alsoholds for k = K + 1 . If T ( K +1) j = ∞ or T ( K ) j = ∞ , we are done. Otherwise, by way ofcontradiction, suppose there is some s i = s ∗ i so that s i , K + 1) > ϕ ( s i ) , K + 1).Find the subhistory y i of y i ( A , r i ) . that leads to s i being played for the ( ϕ ( s i ) , K +1) + 1)-th time. Since T ( K ) j = ∞ , from the inductive hypothesis T ( K ) i = ∞ and s i , K ) ≤ ϕ ( s i ) , K ). That is, i must have played s i no more than ϕ ( s i ) , K )times before playing s ∗ i for the K -th time. Since ϕ ( s i ) , K + 1) + 1 > ϕ ( s i ) , K ) , the subhistory y i must extend beyond period T ( K ) i , so it contains K instances of i playing s ∗ i .Next, ﬁnd the subhistory y j of y j ( A , r j ) that leads to j playing s ∗ j for the ( K + 1)-th time. (This is well-deﬁned because T ( K +1) j = ∞ .) Note that y i,s ∗ i ∼ y j,s ∗ j , since i j have played s ∗ i , s ∗ j for K times each, and they were facing the same responsepaths. Also, y i,s i ∼ y j,s j since i has played s i for ϕ ( s i ) , K + 1) times and j hasplayed s j for the same number of times. Since r j ( y j ) = s ∗ j and r j is an index rule, s ∗ j must have weakly the highest index at y j . Since r i is more compatible with s ∗ i than r j is with s ∗ j , s i must not have the weakly highest index at y i . And yet r i ( y i ) = s i , contradiction. In this section, we show that under the conditions of Theorem 2, the Gittins indexand the UCB index satisfy the comparative compatibility condition for index rules.Omitted proofs from this section can be found in the Online Appendix.

Let survival chance γ ∈ [0 ,

1) and patience δ ∈ [0 ,

1) be ﬁxed. Let ν s i ∈ × h ∈ F i [ s i ] ∆(∆( A h ))be a belief over opponents’ mixed actions at the s i -relevant information sets. TheGittins index of s i under belief ν s i is given by the maximum value of the followingauxiliary optimization problem:sup τ ≥ E ν si nP τt =1 ( δγ ) t − · u i ( s i , ( a h ( t )) h ∈ F i [ s i ] ) o E ν si { P τt =1 ( δγ ) t − } , where the supremum is taken over all positive-valued stopping times τ ≥

1. Here( a h ( t )) h ∈ F i [ s i ] means the proﬁle of actions that − i play on the s i -relevant informationsets the t -th time that i uses s i — by factorability, only these actions and not actionselsewhere in the game tree determine i ’s payoﬀ from playing s i . The distribution overthe inﬁnite sequence of proﬁles ( a h ( t )) ∞ t =1 is given by i ’s belief ν s i , that is, there issome ﬁxed mixed action in × h ∈ F i [ s i ] ∆( A h ) that generates proﬁles ( a h ( t )) i.i.d. acrossperiods t. The event { τ = T } for T ≥ s i for T times, observing56he ﬁrst T elements ( a h ( t )) Tt =1 , then stopping.Write V ( τ ; s i , ν s i ) for the value of the above auxiliary problem under the (notnecessarily optimal) stopping time τ . The Gittins index of s i is sup τ> V ( τ ; s i , ν s i ).We begin by linking V ( τ ; s i , τ s i ) to i ’s stage-game payoﬀ from playing s i . Frombelief ν s i and stopping time τ , we will construct the correlated proﬁle α ( ν s i , τ ) ∈ ∆ ◦ ( × h ∈H [ s i ] A h ), so that V ( τ ; s i , ν s i ) is equal to i ’s expected payoﬀ when playing s i while opponents play according to this correlated proﬁle on the s i -relevant informationsets. Deﬁnition.

A full-support belief ν s i ∈ × h ∈ F i [ s i ] ∆(∆( A h )) for player i together with a(possibly random) stopping rule τ > a ( − i ) ,t ) t ≥ over the space × h ∈ F i [ s i ] A h ∪ { ∅ } , where ˜ a ( − i ) ,t ∈ × h ∈ F i [ s i ] A h represents the opponents’actions observed in period t if τ ≥ t , and ˜ a ( − i ) ,t = ∅ if τ < t . We call ˜ a ( − i ) ,t player i ’s internal history at period t and write P ( − i ) for the distribution over internal historiesthat the stochastic process induces.Internal histories live in the same space as player i ’s actual experience in thelearning problem, represented as a history in Y i . The process over internal historiesis i ’s prediction about what would happen in the auxiliary problem (which is anartiﬁcial device for computing the Gittins index) if he were to use τ. Enumerate all possible proﬁles of moves at information sets F i [ s i ] as × h ∈ F i [ s i ] A h = { a (1)( − i ) , ..., a ( K )( − i ) } , let p t,k := P ( − i ) [˜ a ( − i ) ,t = a ( k )( − i ) ] for 1 ≤ k ≤ K be the probabilityunder ν s i of seeing the proﬁle of actions a ( k )( − i ) in period t of the stochastic process overinternal histories, (˜ a ( − i ) ,t ) t ≥ , and let p t, := P ( − i ) [˜ a ( − i ) ,t = ∅ ] be the probability ofhaving stopped before period t. Deﬁnition.

The synthetic correlated proﬁle at information sets in F i [ s i ] is the el-ement of ∆ ◦ ( × h ∈ F i [ s i ] A h ) (i.e. a correlated random action) that assigns probability P ∞ t =1 β t − p t,k P ∞ t =1 β t − (1 − p t, ) to the proﬁle of actions a ( k )( − i ) . Denote this proﬁle by α ( ν s i , τ ).Note that the synthetic correlated proﬁle depends on the belief ν s i stopping rule τ, β . Since the belief ν s i has full support, there is always apositive probability assigned to observing every possible proﬁle of actions on F i [ s i ] inthe ﬁrst period, so the synthetic correlated proﬁle is totally mixed. The signiﬁcanceof the synthetic correlated proﬁle is that it gives an alternative expression for thevalue of the auxiliary problem under stopping rule τ . Lemma 5. V ( τ ; s i , ν s i ) = U i ( s i , α ( ν s i , τ ))The proof is the same as in Fudenberg and He (2018) and is omitted. Consider now the situation where i and j share the same beliefs about play of − ij on the common information sets F i [ s i ] ∩ F j [ s j ] ⊆ H − ij . For any pure-strategy stoppingtime τ j of j , we deﬁne a random stopping rule of i , the mimicking stopping time for τ j . Lemma 6 will establish that the mimicking stopping time generates a syntheticcorrelated proﬁle that matches the corresponding proﬁle of τ j on F i [ s i ] ∩ F j [ s j ].The key issue in this construction is that τ j maps j ’s internal histories to stoppingdecisions, which does not live in the same space as i ’s internal histories. In particular, τ j makes use of i ’s play to decide whether to stop. To mimic such a rule, i makesuse of external histories, which include both the common component of i ’s internalhistory on F i [ s i ] ∩ F j [ s j ] , as well as simulated histories on F j [ s j ] \ ( F i [ s i ] ∩ F j [ s j ]) . For Γ isomorphically factorable for i and j with ϕ ( s i ) = s j , we may write F i [ s i ] = F C ∪ F j with F C ⊆ H − ij and F j ⊆ H j . Similarly, we may write F j [ s j ] = F C ∪ F i with F i ⊆ H i . (So, F C is the c ommon information sets that are payoﬀ-relevant for both s i and s j .) Whenever j plays s j , he observes some ( a ( C ) , a ( i ) ) ∈ ( × h ∈ F C A h ) × ( × h ∈ F i A h ),where a ( C ) is a proﬁle of actions at information sets in F C and a ( i ) is a proﬁle of actionsat information sets in F i . So, a pure-strategy stopping rule in the auxiliary problemdeﬁning j ’s Gittins index for s j is a function τ j : ∪ t ≥ [( × h ∈ F C A h ) × ( × h ∈ F i A h )] t → Notice that even though i starts with the belief that opponents randomize independently atdiﬀerent information sets, and also holds an independent prior belief, V ( τ ; s i , ν s i ) may not be thethe payoﬀ of playing s i against a independent randomizations by the opponent because of theendogenous correlation that we discussed in the text. , } that maps ﬁnite histories of observations to stopping decisions, where “0” meanscontinue and “1” means stop. Deﬁnition.

Player i ’s mimicking stopping rule for τ j draws α i ∈ × h ∈ F i ∆( A h ) from j ’s belief ν s j on F i , and then draws ( a ( i ) ,‘ ) ‘ ≥ by independently generating a ( i ) ,‘ from α i each period. Conditional on ( a ( i ) ,‘ ) , i stops according to the rule ( τ i | ( a ( i ) ,‘ ))(( a ( C ) ,‘ , a ( j ) ,‘ ) t‘ =1 ) := τ j (( a ( C ) ,‘ , a ( i ) ,‘ ) t‘ =1 ) . That is, the mimicking stopping rule involves ex-ante randomization across pure-strategy stopping rules τ i | ( a ( i ) ,‘ ) ∞ ‘ =1 . First, i draws a behavior strategy on the infor-mation set F i according to j ’s belief about i ’s play. Then, i simulates an inﬁnitesequence ( a ( i ) ,‘ ) ∞ ‘ =1 of i ’s play using this drawn behavior strategy and follows thepure-strategy stopping rule τ i | ( a ( i ) ,‘ ) ∞ ‘ =1 . As in the deﬁnition of internal histories, the mimicking strategy and i ’s belief ν s i generates a stochastic process (˜ a ( j ) ,t , ˜ a ( C ) ,t ) t ≥ of internal histories for i (representingactions on F i [ s i ] that i anticipates seeing when he plays s i ). It also induces a stochasticprocess (˜ e ( i ) ,t , ˜ e ( C ) ,t ) t ≥ of “external histories” deﬁned in the following way: Deﬁnition.

The stochastic process of external historie s (˜ e ( i ) ,t , ˜ e ( C ) ,t ) t ≥ is deﬁnedfrom the process of internal histories (˜ a ( j ) ,t , ˜ a ( C ) ,t ) t ≥ that τ i generates and given by:(i) if τ i < t , then (˜ e ( i ) ,t , ˜ e ( C ) ,t ) = ∅ ; (ii) otherwise, ˜ e ( C ) ,t = ˜ a ( C ) ,t , and ˜ e ( i ) ,t is the t -thelement of the inﬁnite sequence ( a ( i ) ,‘ ) ∞ ‘ =1 that i simulated before the ﬁrst period ofthe auxiliary problem.Write P e for the distribution over the sequence of of external histories generatedby i ’s mimicking stopping time for τ j , which is a function of τ j , ν s j , and ν s i .To understand the distinction between internal and external histories, note thatthe probability of i ’s ﬁrst-period internal history satisfying (˜ a ( j ) , , ˜ a ( C ) , ) = (¯ a ( j ) , ¯ a ( C ) )for some ﬁxed values (¯ a ( j ) , ¯ a ( C ) ) ∈ × h ∈ F i [ s i ] A h is given by the probability that a mixed Here ( a ( − j ) ,‘ ) t‘ =1 = ( a ( C ) ,‘ , a ( i ) ,‘ )) t‘ =1 . Note this is a valid (stochastic) stopping time, as theevent { τ i ≤ T } is independent of any a I ( t ) for t > T. α − i on F i [ s i ] , drawn according to i ’s belief ν s i , would generate the proﬁle of ac-tions (¯ a ( j ) , ¯ a ( C ) ) . On the other hand, the probability of i ’s ﬁrst-period external historysatisfying (˜ e ( i ) , , ˜ e ( C ) , ) = (¯ a ( i ) , ¯ a ( C ) ) for some ﬁxed values (¯ a ( i ) , ¯ a ( C ) ) ∈ × h ∈ F j [ s j ] A h also depends on j ’s belief ν s j , for this belief determines the distribution over ( a ( i ) ,‘ ) ∞ ‘ =1 drawn before the start of the auxiliary problem.When using the mimicking stopping time for τ j in the auxiliary problem, i expectsto see the same distribution of − ij ’s play before stopping as j does when using τ j ,on the information sets that are both s i -relevant and s j -relevant. This is formalizedin the next lemma. Lemma 6.

For Γ isomorphically factorable for i and j with ϕ ( s i ) = s j , suppose i holds belief ν s i over play in F i [ s i ] and j holds belief ν s j over play in F j [ s j ] , such that ν s i | F i [ s i ] ∩ F j [ s j ] = ν s j | F i [ s i ] ∩ F j [ s j ] , that is the two sets of beliefs match when marginalizedto the common information sets in H − ij . Let τ i be i ’s mimicking stopping time for τ j .Then, the synthetic correlated proﬁle α ( ν s j , τ j ) marginalized to the information setsof − ij is the same as α ( ν s i , τ i ) marginalized to the same information sets. Proposition 7.

Suppose Γ isomorphically factorable for i and j with ϕ ( s ∗ i ) = s ∗ j , ϕ ( s i ) = s j , where s ∗ i = s i and s ∗ i (cid:37) s ∗ j . Suppose i holds belief ν s i ∈ × h ∈ F i [ s i ] ∆(∆( A h )) about opponents’ play after each s i and j holds belief ν s j ∈ × h ∈ F j [ s j ] ∆(∆( A h )) aboutopponents’ play after each s j , such that ν s ∗ i | F i [ s ∗ i ] ∩ F j [ s ∗ j ] = ν s ∗ j | F i [ s ∗ i ] ∩ F j [ s ∗ j ] and ν s i | F i [ s i ] ∩ F j [ s j ] = ν s j | F i [ s i ] ∩ F j [ s j ] . If s ∗ j has the weakly highest Gittins index for j under eﬀective discountfactor ≤ δγ < , then s i does not have the weakly highest Gittins index for i underthe same eﬀective discount factor.Proof. We begin by deﬁning a proﬁle of totally mixed correlated actions at informa-tion sets ∪ s j ∈ S j F j [ s j ] ⊆ H − j , namely a collection of totally mixed correlated proﬁles( α F j [ s j ] ) s j ∈ S j where α F j [ s j ] ∈ ∆ ◦ ( × h ∈ F j [ s j ] A h ). For each s j = s j the proﬁle α F j [ s j ] is thesynthetic correlated proﬁle α ( ν s j , τ ∗ s j ), where τ ∗ s j is an optimal pure-strategy stoppingtime in j ’s auxiliary stopping problem involving s j . For s j = s j , the correlated proﬁle60 F j [ s j ] is instead the synthetic correlated proﬁle associated with the mimicking stop-ping rule for τ ∗ s i , i.e. agent i ’s pure-strategy optimal stopping time in i ’s auxiliaryproblem for s i . Next, deﬁne a proﬁle of totally mixed correlated actions at information sets ∪ s i ∈ S i F i [ s i ] ⊆ H − i for i ’s opponents. For each s i / ∈ { s ∗ i , s i } , just use the marginaldistribution of α F j [ ϕ ( s i )] constructed before on F i [ s i ] ∩ H − ij , then arbitrarily specifyplay at j ’s information sets contained in F i [ s i ] , if any. For s i , the correlated proﬁle is α ( ν s i , τ ∗ s i ), i.e. the synthetic move associated with i ’s optimal stopping rule for s i . Fi-nally, for s ∗ i , the correlated proﬁle α F i [ s ∗ i ] is the synthetic correlated proﬁle associatedwith the mimicking stopping rule for τ ∗ s ∗ j .From Lemma 6, these two proﬁles of correlated actions agree when marginalizedto information sets of − ij . Therefore, they can be completed into totally mixedcorrelated strategies, σ − i and σ − j respectively, such that σ − i | S − ij = σ − j | S − ij . For each s j = s j , the Gittins index of s j for j is U j ( s j , σ − j ) . Also, since α F j [ s j ] is the mixed proﬁleassociated with the suboptimal mimicking stopping time, U j ( s j , σ − j ) is no larger thanthe Gittins index of s j for j. By the hypothesis that s ∗ j has the weakly highest Gittinsindex for j , U j ( s ∗ j , σ − j ) ≥ max s j = s ∗ j U j ( s j , σ − j ) . By the deﬁnition of s ∗ i (cid:37) s ∗ j we mustalso have U i ( s ∗ i , σ − i ) > max s i = s ∗ i U i ( s i , σ − i ) , so in particular U i ( s ∗ i , σ − i ) > U i ( s i σ − i ) . But U i ( s ∗ i , σ − i ) is no larger than the Gittins index of s ∗ i , for α F i [ s ∗ i ] is the syntheticstrategy associated with a suboptimal mimicking stopping time. As U i ( s i , σ − i ) isequal to the Gittins index of s i , this shows s i cannot have even weakly the highestGittins index at history y i , for s ∗ i already has a strictly higher Gittins index than s i does.The following corollary of Proposition 7, combined with Proposition 6, establishesthe ﬁrst statement of Theorem 2. Corollary 1.

When s ∗ i (cid:37) s ∗ j , i and j have the same patience δ , survival chance γ ,and equivalent independent regular priors, OPT i is more compatible with s ∗ i OPT j iswith s ∗ j . roof. Equivalent regular priors require that priors are independent and that i and j share the same prior beliefs over play on F ∗ := F i [ s ∗ i ] ∩ F j [ s ∗ j ] and over play on F := F i [ s i ] ∩ F j [ s j ] . Thus after histories y i , y j such that y i,s ∗ i ∼ y j,s ∗ j and y i,s i ∼ y j,s j ν s ∗ i | F ∗ = ν s ∗ j | F ∗ and ν s i | F = ν s j | F , so the hypotheses of Proposition 7 are satisﬁed. We start with a lemma that shows the Bayes-UCB index for a strategy s i is equal to i ’s payoﬀ from playing s i against a certain proﬁle of mixed actions on F i [ s i ] , wherethis proﬁle depends on i ’s belief about actions on F i [ s i ] , the quantile q, and how u s i ,h ranks mixed actions in ∆( A h ) for each h ∈ F i [ s i ] . Lemma 7.

Let n s i be the number of times i has played s i in history y i and let q s i = q ( n s i ) ∈ (0 , . Then the Bayes-UCB index for s i and given quantile-choicefunction q after history y i is equal to U i ( s i , ( ¯ α h ) h ∈ F i [ s i ] ) for some proﬁle of mixedactions where ¯ α h ∈ ∆ ◦ ( A h ) for each h . Furthermore, ¯ α h only depends on q s i , g i ( ·| y i,h ) i ’s posterior belief about play on h , and how u s i ,h ranks mixed strategies in ∆( A h ) .Proof. For each h ∈ F i [ s i ], the random variable ˜ u s i ,h ( y i,h ) only depends on y i,h throughthe posterior g i ( ·| y i,h ) . Furthermore, Q (˜ u s i ,h ( y i,h ); q s i ) is strictly between the highestand lowest possible values of u s i ,h ( · ), each of which can be attained by some pureaction on A h , so there is a totally mixed ¯ α h ∈ ∆ ◦ ( A h ) so that Q (˜ u s i ,h ( y i,h ); q s i ) = u s i ,h ( ¯ α h ) . Moreover, if u s i ,h and u s i ,h rank mixed strategies on ∆( A h ) in the sameway, there are a ∈ R and b > u s i ,h = a + bu s i ,h . Then Q (˜ u s i ,h ( y i,h ); q s i ) = a + bQ (˜ u s i ,h ( y i,h ); q s i ), so ¯ α h still works for u s i ,h .The second statement of Theorem 2 follows as a corollary. Corollary 2. If s ∗ i (cid:37) s ∗ j , and the hypotheses of Theorem 2 are satisﬁed, then UCB i is more compatible with s ∗ i than UCB j is with s ∗ j .Proof. When i and j have matching beliefs, by Lemma 7 we may calculate theirBayes-UCB indices for diﬀerent strategies as their myopic expected payoﬀ of using62hese strategies against some common opponents’ play, as in the similar argumentfor the Gittins index in Lemma 7. Applying the deﬁnition of compatibility, we candeduce that when s ∗ i (cid:37) s ∗ j and ϕ ( s ∗ i ) = s ∗ j , if s ∗ j has the highest Bayes-UCB index for j then s ∗ i must have the highest Bayes-UCB index for i. Lemma 6 : For Γ isomorphically factorable for i and j with ϕ ( s i ) = s j , suppose i holds belief ν s i over play in F i [ s i ] and j holds belief ν s j over play in F j [ s j ] , such that ν s i | F i [ s i ] ∩ F j [ s j ] = ν s j | F i [ s i ] ∩ F j [ s j ] , that is the two sets of beliefs match when marginalizedto the common information sets in H − ij . Let τ i be i ’s mimicking stopping time for τ j .Then, the synthetic correlated proﬁle α ( ν s j , τ j ) marginalized to the information setsof − ij is the same as α ( ν s i , τ i ) marginalized to the same information sets.Proof. Let (˜ a ( i ) ,t , ˜ a ( C ) ,t ) t ≥ and (˜ e ( i ) ,t , ˜ e ( C ) ,t ) t ≥ be the stochastic processes of internaland external histories for τ i , with distributions P − i and P e . Enumerate possibleproﬁles of actions on F C as × h ∈ F C A h = { a (1)( C ) , ..., a ( K C )( C ) } , possible proﬁles of actions on F j as × h ∈ F j A h = { a (1)( j ) , ..., a ( K j )( j ) } , and possible proﬁles of actions on F i as × h ∈ F i A h = { a (1)( i ) , ..., a ( K i )( i ) } .Write p t, ( k j ,k C ) := P − i [(˜ a ( i ) ,t , ˜ a ( C ) ,t ) = ( a ( k j )( j ) , a ( k C )( C ) )] for k j ∈ { , ..., K j } and k C ∈{ , ..., K C } . Also write q t, ( k i ,k C ) := P e [(˜ e ( i ) ,t , ˜ e ( C ) ,t ) = ( a ( k i )( i ) , a ( k C )( C ) )] for k i ∈ { , ..., K i } and k C ∈ { , ..., K C } . Let p t, (0 , = q t, (0 , := P − i [ τ i < t ] = P e [ τ i < t ] be the probabilityof having stopped before period t. The distribution of external histories that i expects to observe before stopping un-der belief ν s i when using the mimicking stopping rule τ i is the same as the distributionof internal histories that j expects to observe when using stopping rule τ j under belief ν s j , because i simulates the data-generating process on F i by drawing a mixed action α i according to j ’s belief ν s j | F i and ν s i | F C = ν s j | F C . Thus for every k i ∈ { , ..., K i } k C ∈ { , ..., K C } , P ∞ t =1 ( δγ ) t − q t, ( k i ,k C ) P ∞ t =1 ( δγ ) t − (1 − q t, (0 , ) = α ( ν s j , τ j )( a ( k i )( i ) , a ( k C )( C ) ) . For a ﬁxed ¯ k C ∈ { , ..., K C } , summing across k i gives P ∞ t =1 ( δγ ) t − P K i k i =1 q t, ( k i , ¯ k C ) P ∞ t =1 ( δγ ) t − (1 − q t, (0 , ) = α ( ν s j , τ j )( a (¯ k C )( C ) ) . By deﬁnition, the processes (˜ a ( i ) ,t , ˜ a ( C ) ,t ) t ≥ and (˜ e ( i ) ,t , ˜ e ( C ) ,t ) t ≥ have the same marginaldistribution on the second dimension: K i X k i =1 q t, ( k i , ¯ k C ) = P − i [˜ a ( C ) ,t = a (¯ k C )( C ) ] = K j X k j =1 p t, ( k j , ¯ k C ) . Making this substitution and using the fact that p t, (0 , = q t, (0 , , P ∞ t =1 ( δγ ) t − P K j k j =1 p t, ( k j , ¯ k C ) P ∞ t =1 ( δγ ) t − (1 − p t, (0 , ) = α ( ν s j , τ j )( a (¯ k C )( C ) ) . But by the deﬁnition of synthetic correlated proﬁle, the LHS is P K j k j =1 α ( ν s i , τ i )( a ( k j )( j ) , a (¯ k C )( C ) ) = α ( ν s i , τ i )( a (¯ k C )( C ) ).Since the choice of a (¯ k C )( C ) ∈ × I ∈ F C A I was arbitrary, we have shown that the syn-thetic proﬁle α ( ν s j , τ j ) of the original stopping rule τ j and the one associated withthe mimicking strategy of i, α ( ν s i , τ i ) , coincide on F C . Corollary 2 : The Bayes-UCB rule r i,UCB and r j,UCB satisfy the hypotheses of Propo-sition 6 when s ∗ i (cid:37) s ∗ j , provided the hypotheses of Theorem 2 are satisﬁed.Proof. Consider histories y i , y j with y i,s ∗ i ∼ y j,s ∗ j and y i,s i ∼ y j,s j . By Lemma 7,there exist ¯ α − ih ∈ ∆ ◦ ( A h ) for every h ∈ ∪ s i ∈ S i F i [ s i ] and ¯ α − jh ∈ ∆ ◦ ( A h ) for every h ∈ s i ∈ S i F j [ s j ] so that ι i,s i ( y i ) = U i ( s i , ( α − ih ) h ∈ F i [ s i ] ) and ι j,s j ( y j ) = U j ( s j , ( α − jh ) h ∈ F j [ s j ] )for all s i , s j , where ι i,s i ( y i ) is the Bayes-UCB index for s i after history y i and ι j,s j ( y j )is the Bayes-UCB index for s j after history y j .Because y i,s ∗ i ∼ y j,s ∗ j and y i,s i ∼ y j,s j , y i contains the same number of s ∗ i exper-iments as y j contains s ∗ j , and y i contains the same number of s i experiments as y j contains s j . Also by third-party equivalence and the fact that i and j start with thesame beliefs on common relevant information sets, they have the same posterior be-liefs g i ( ·| y i,I ), g j ( ·| y j,I ) for any h ∈ F i [ s ∗ i ] ∩ F j [ s ∗ j ] and h ∈ F i [ s i ] ∩ F j [ s j ]. Finally, thehypotheses of Theorem 2 say that on any h ∈ F i [ s ∗ i ] ∩ F j [ s ∗ j ] , u s ∗ i ,h and u s ∗ j ,h have thesame ranking of mixed actions, while on any h ∈ F i [ s i ] ∩ F j [ s j ] , u s i ,h and u s j ,h havethe same ranking of mixed actions. So, by Lemma 7, we may take ¯ α − ih = ¯ α − jh for all h ∈ F i [ s ∗ i ] ∩ F j [ s ∗ j ] and h ∈ F i [ s i ] ∩ F j [ s j ] . Find some σ − j = ( σ − ij , σ i ) ∈ × k = j ∆ ◦ ( S k ) so that σ − j generates the randomactions ( ¯ α − jh ) on every h ∈ ∪ s j ∈ S j F j [ s j ] . Then we have ι j,s j ( y j ) = U j ( s j , σ − j ) for every s j ∈ S j . The fact that s ∗ j has weakly the highest index means s ∗ j is weakly optimalagainst σ − j . Now take σ − i = ( σ − ij , σ j ) where σ j ∈ ∆ ◦ ( S j ) is such that it generatesthe random actions ( ¯ α − ih ) on F i [ s ∗ i ] ∩ H j and F i [ s i ] ∩ H j . But since ¯ α − ih = ¯ α − jh for all h ∈ F i [ s ∗ i ] ∩ F j [ s ∗ j ] and h ∈ F i [ s i ] ∩ F j [ s j ] , σ − i generates the random actions ( ¯ α − ih ) onall of F i [ s ∗ i ] and F i [ s i ] , meaning ι i,s ∗ i ( y i ) = U i ( s ∗ i , σ − i ) and ι i,s i ( y i ) = U i ( s i , σ − i ). Thedeﬁnition of compatibility implies U i ( s ∗ i , σ − i ) > U i ( s i , σ − i ), so ι i,s ∗ i ( y i ) > ι i,s i ( y i ) . Thisshows s i does not have weakly the highest Bayes-UCB index, since s ∗ i has a strictlyhigher one. 65 nline Appendix9 Proofs of Propositions 1 and 2 Proposition 1 : Suppose s ∗ i (cid:37) s ∗ j (cid:37) s ∗ k where s ∗ i , s ∗ j , s ∗ k are strategies of i, j, k . Then s ∗ i (cid:37) s ∗ k .Proof. Suppose s ∗ k is weakly optimal for k against some totally mixed correlatedproﬁle σ ( k ) . We show that s ∗ i is strictly optimal for i against any totally mixed andcorrelated σ ( i ) with the property that marg − ik ( σ ( k ) ) = marg − ik ( σ ( i ) ).To do this, we ﬁrst modify σ ( i ) into a new totally proﬁle by copying how the actionof i correlates with the actions of − ( ik ) in σ ( k ) . To do this, for each s − ik ∈ S − ik and s i ∈ S i , σ ( k ) ( s i , s − ik ) > − k ( σ ( k ) ) ∈ ∆ ◦ ( S − k ). So write p ( s i | s − ik ) := σ ( k ) ( s i ,s − ik ) P s i ∈ S i σ ( k ) ( s i ,s − ik ) > i plays s i given − ik play s − ik , in the proﬁle σ ( k ) . Now construct the proﬁle ˆˆ σ ∈ ∆ ◦ ( S ) , whereˆˆ σ ( s i , s − ik , s k ) := p ( s i | s − ik ) · σ ( i ) ( s − ik , s k ) . Proﬁle ˆˆ σ has the property that marg − jk (ˆˆ σ ) = marg − jk ( σ ( k ) ). To see this, note ﬁrstthat because ˆˆ σ and σ ( k ) agree on the − ( ijk ) marginal marg − ik ( σ ( k ) ) = marg − ik ( σ ( i ) ).Also, by construction, the conditional distribution of i ’s action given proﬁle of ( − ijk )’sactions is the same.From the hypothesis that s ∗ j (cid:37) s ∗ k , we get j ﬁnds s ∗ j strictly optimal against ˆˆ σ .But at the same time, marg − i (ˆˆ σ ) = marg − i ( σ ( i ) ) by construction, so this impliesalso marg − ij (ˆˆ σ ) = marg − ij ( σ ( i ) ). From s ∗ i (cid:37) s ∗ j , and the conclusion that j ﬁnds s ∗ j strictly optimal against ˆˆ σ just obtained, we get i ﬁnds s ∗ i strictly optimal against σ ( i ) as desired. 66 .2 Proof of Proposition 2 Proposition 2 : If s ∗ i (cid:37) s ∗ j , then either s ∗ j (cid:37) s ∗ i , or both s ∗ j and s ∗ i are weakly dominatedstrategies.For σ − i ∈ ∆( S − i ) and s i ∈ S i , write U i ( s i , σ − i ) to mean P s − i ∈ S − i U i ( s i , s − i ) · σ − i ( s − i ), and note that s ∗ i (cid:37) s ∗ j if and only if for every totally mixed, correlatedstrategy σ − j ∈ ∆ ◦ ( S − j ) such that U j ( s ∗ j , σ − j ) ≥ max s j ∈ S j \{ s ∗ j } U j ( s j , σ − j ) , we have for every σ − i ∈ ∆ ◦ ( S − i ) satisfying marg − ij ( σ − i ) = marg − ij ( σ − j ) ,U i ( s ∗ i , σ − i ) > max s i ∈ S i \{ s ∗ i } U i ( s i , σ − i ) . Proof.

Assume s ∗ i (cid:37) s ∗ j and recall the maintained assumption that the game has nostrictly dominated strategy. We show that these assumptions imply either s ∗ j (cid:37) s ∗ i ,or both s ∗ j and s ∗ i are weakly dominated strategies.Partition the set ∆ ◦ ( S − j ) into three subsets, Π + ∪ Π ∪ Π − , with Π + consisting ofrepresenting σ − j ∈ ∆ ◦ ( S − j ) that make s ∗ j strictly better than the best alternative purestrategy, Π the elements of ∆ ◦ ( S − j ) that make s ∗ j indiﬀerent to the best alternative,and Π − the elements that make s ∗ j strictly worse. (These sets are well deﬁned because | S j | ≥ , so j has at least one alternative pure strategy to s ∗ j .) If Π is non-empty,then there is some σ − j ∈ Π such that U j ( s ∗ j , σ − j ) = max s j ∈ S j \{ s ∗ j } U j ( s j , σ − j ) . Because s ∗ i (cid:37) s ∗ j , U i ( s ∗ i , ˆ σ − i ) > max s i ∈ S i \{ s ∗ i } U i ( s i , ˆ σ − i ) for every ˆ σ − i ∈ ∆ ◦ ( S − i ) such thatmarg − ij ( σ − j ) = marg − ij ( σ − i ), so we do not have s ∗ j (cid:37) s ∗ i .Also, if both Π + and Π − are non-empty, then Π is non-empty. This is becauseboth σ − j u j ( s ∗ j , σ − j ) and σ − j max s j ∈ S j \{ s ∗ j } u j ( s j , σ − j ) are continuous functions.If u j ( s ∗ j , σ − j ) − max s j ∈ S j \{ s ∗ j } u j ( s j , σ − j ) > u j ( s ∗ j , σ − j ) − max s j ∈ S j \{ s ∗ j } u j ( s j , σ − j ) <

0, then some mixture between σ − j and σ − j must belong to Π .67o we have shown that if either Π is non-empty or both Π + and Π − are non-empty,then s ∗ j (cid:37) s ∗ i .If only Π + is non-empty, then s ∗ j is strictly dominant for j . Together with s ∗ i (cid:37) s ∗ j , this would imply that s ∗ i is strictly dominant for i, which would make any otherstrategy of i strictly dominated, contradiction.Finally suppose that only Π − is non-empty, so that for every σ − j ∈ ∆ ◦ ( S − j ) thereexists a strictly better pure response than s ∗ j against σ − j , then there exists a mixedstrategy σ j for j that strictly dominates s ∗ j against all correlated play in ∆ ◦ ( S − j ) . This shows s ∗ j is strictly dominated for j provided − j play a totally mixed proﬁle —in particular, s ∗ j is weakly dominated for j . Suppose there is a σ − i ∈ ∆ ◦ ( S − i ) againstwhich s ∗ i is a weak best response. Then, the fact that s ∗ j is not a strict best responseagainst any σ − j ∈ ∆ ◦ ( S − j ) means s ∗ j (cid:37) s ∗ i . On the other hand, suppose s ∗ i is not aweak best response against any σ − i ∈ ∆ ◦ ( S − i ). Then s ∗ i is weakly dominated, as is s ∗ j .

10 Reﬁnements in the Link-Formation Game

Each of the following reﬁnements selects the same subset of pure Nash equilibria whenapplied to the anti-monotonic and co-monotonic versions of the link-formation game:extended proper equilibrium, proper equilibrium, trembling-hand perfect equilibrium, p -dominance, Pareto eﬃciency, and strategic stability. Pairwise stability does notapply to the link-formation game. Finally, the link-formation game is not a potentialgame. Step 1. Extended proper equilibrium, proper equilibrium, and trembling-hand perfect equilibrium allow the “no links” equilibrium in both versionsof the game.

For ( q i ) anti-monotonic with ( c i ) , for each (cid:15) > Active with probability (cid:15) , N2 and S2 play Active with probability (cid:15) . For smallenough (cid:15) , the expected payoﬀ of

Active for player i is approximately (10 − c i ) (cid:15) since68erms with higher order (cid:15) are negligible. It is clear that this payoﬀ is negative forsmall (cid:15) for every player i , and that under the utility re-scalings β N = β S = 10 ,β N = β S = 1 , the loss to playing Active smaller for N2 and S2 than for N1 andS1. So this strategy proﬁle is a ( β , (cid:15) )-extended proper equilibrium. Taking (cid:15) →

0, wearrive at the equilibrium where each player chooses

Inactive with probability 1.

Proof.

For the version with ( q i ) co-monotonic with ( c i ) , consider the same strategieswithout re-scalings, i.e. β = . Then already the loss to playing Active smaller forN2 and S2 than for N1 and S1, making the strategy proﬁle a ( , (cid:15) )-extended properequilibrium.These arguments show that the “no links” equilibrium is an extended properequilibrium in both versions of the game. Every extended proper equilibrium is alsoproper and trembling-hand perfect, which completes the step. Step 2. p − dominance eliminates the “no links” equilibrium in both ver-sions of the game. Regardless of whether ( q i ) are co-monotonic or anti-monotonicwith ( c i ), under the belief that all other players choose Active with probability p for p ∈ (0 , Active (due to additivity across links) is(1 − p ) · p · (10 + 30 − c i ) > c i ∈ { , } . Step 3. Pareto eliminates the “no links” equilibrium in both versions ofthe game.

It is immediate that the no-links equilibrium outcome is Pareto dominatedby the all-links equilibrium outcome under both parameter speciﬁcations, so Paretoeﬃciency would rule it out whether ( c i ) is anti-monotonic or co-monotonic with ( q i ). Step 4. Strategic stability (Kohlberg and Mertens, 1986) eliminates the “nolinks” equilibrium in both versions of the game. . First suppose the ( c i ) are anti-monotonic with ( q i ) . Let η = 1 /

100 and let (cid:15) > (cid:15) N ( Active ) = (cid:15) S ( Active ) = 2 (cid:15) , (cid:15) N ( Active ) = (cid:15) S ( Active ) = (cid:15) and (cid:15) i ( Inactive ) = (cid:15) for allplayers i . When each i is constrained to play s i with probability at least (cid:15) i ( s i ) , theonly Nash equilibrium is for each player to choose Active with probability 1 − (cid:15) . (To see this, consider N2’s play in any such equilibrium σ. If N2 weakly prefers

Ac- ive , then N1 must strictly prefer it, so σ N ( Active ) = 1 − (cid:15) ≥ σ N ( Active ) . Onthe other hand, if N2 strictly prefers

Inactive , then σ N ( Active ) = (cid:15) < (cid:15) ≤ σ N ( Active ). In either case, σ N ( Active ) ≥ σ N ( Active ).) When both North play-ers choose

Active with probability 1 − (cid:15) , each South player has Active as theirstrict best response, so σ S ( Active ) = σ S ( Active ) = 1 − (cid:15) . Against such a pro-ﬁle of South players, each North player has Active as their strict best response, so σ N ( Active ) = σ N ( Active ) = 1 − (cid:15) .Now suppose the ( c i ) are co-monotonic with ( q i ). Again let η = 1 /

100 and let (cid:15) > (cid:15) N ( Active ) = (cid:15) S ( Active ) = (cid:15) , (cid:15) N ( Active ) = (cid:15) / ,(cid:15) S ( Active ) = (cid:15) and (cid:15) i ( Inactive ) = (cid:15) for all players i . Suppose by way of contradic-tion there is a Nash equilibrium σ of the constrained game which is η -close to the Inac-tive equilibrium. In such an equilibrium, N2 must strictly prefer

Inactive , otherwiseN1 strictly prefers

Active so σ could not be η -close to the Inactive equilibrium. Simi-lar argument shows that S2 must strictly prefer

Inactive . This shows N2 and S2 mustplay

Active with the minimum possible probability, that is σ N ( Active ) = (cid:15) / σ S ( Active ) = (cid:15) . This implies that, even if σ N ( Active ) were at its minimumpossible level of (cid:15) , S1 would still strictly prefer playing Inactive because S1 is 1000times as likely to link with the low-quality opponent as the high-quality opponent.This shows σ S ( Active ) = (cid:15) . But when σ S ( Active ) = σ S ( Active ) = (cid:15) , N Active , so σ N ( Active ) = 1 − (cid:15) . This contradicts σ being η -close to the no-links equilibrium. Step 5. Pairwise stability (Jackson and Wolinsky, 1996) does not apply tothis game . This is because each player chooses between either linking with everyplayer on the opposite side who plays

Active , or linking with no one. A player cannotselectively cut oﬀ one of her links while preserving the other.

Step 6. The game does not have an ordinal potential, so reﬁnementsof potential games (Monderer and Shapley, 1996) do not apply . To see thatthis is not a potential game, consider the anti-monotonic parametrization. Suppose70 potential P of the form P ( a N , a N , a S , a S ) exists, where a i = 1 corresponds to i choosing Active , a i = 0 corresponds to i choosing Inactive . We must have P (0 , , ,

0) = P (1 , , ,

0) = P (0 , , , , since a unilateral deviation by one player from the Inactive equilibrium does notchange any player’s payoﬀs. But notice that u N (1 , , , − u N (0 , , ,

1) = 10 −

14 = − , while u S (1 , , , − u S (1 , , ,

0) = 30 −

19 = 11. If the game has an ordinalpotential, then both of these expressions must have the same sign as P (1 , , , − P (1 , , ,

0) = P (1 , , , − P (0 ,,

0) = P (1 , , , − P (0 ,, ,,

0) = P (1 , , , − P (0 ,, ,, ,,