Player-Compatible Learning and Player-Compatible Equilibrium
PPlayer-Compatible Equilibrium ∗ Drew Fudenberg † Kevin He ‡ First version: September 23, 2017This version: July 10, 2019
Abstract
Player-Compatible Equilibrium (PCE) imposes cross-player restrictions onthe magnitudes of the players’ “trembles” onto different strategies. These re-strictions capture the idea that trembles correspond to deliberate experimentsby agents who are unsure of the prevailing distribution of play. PCE selectsintuitive equilibria in a number of examples where trembling-hand perfect equi-librium (Selten, 1975) and proper equilibrium (Myerson, 1978) have no bite.We show that rational learning and some near-optimal heuristics imply ourcompatibility restrictions in a steady-state setting.
Keywords: non-equilibrium learning, equilibrium refinements, trembling-hand perfectequilibrium, combinatorial bandits, Bayesian upper confidence bounds. ∗ We thank Alessandro Bonatti, Dan Clark, Glenn Ellison, Ben Golub, Shengwu Li, Dave Rand,Alex Wolitzky, and Muhamet Yildiz for valuable conversations and comments. We thank NationalScience Foundation grant SES 1643517 for financial support. † Department of Economics, MIT. Email: [email protected] ‡ California Institute of Technology and University of Pennsylvania. Email: [email protected] a r X i v : . [ q -f i n . E C ] J u l Introduction
Starting with Selten (1975), a number of papers have used the device of vanishinglysmall “trembles” to refine the set of Nash equilibria. This paper introduces player-compatible equilibrium (PCE), which extends this approach by imposing cross-playerrestrictions on these trembles in a way that is invariant to the utility representations ofplayers’ preferences over game outcomes. The heart of this refinement is the conceptof player compatibility , which says player i is more compatible with strategy s ∗ i thanplayer j is with strategy s ∗ j if whenever s ∗ j is optimal for j against some correlatedprofile σ , s ∗ i is optimal for i against any profile ˆ σ matching σ in terms of the playof the third parties, − ij . PCE requires that cross-player tremble magnitudes respectcompatibility rankings. As we will explain, PCE interprets “trembles” as deliberateexperiments to learn how others play, not as mistakes, and derives its cross-playertremble restrictions from an analysis of the relative frequencies of experiments thatdifferent players choose to undertake.Section 2 defines PCE, studies their basic properties, and proves that PCE existin all finite games. The compatibility relation is easiest to satisfy when i and j are“non-interacting,” meaning that their payoffs do not depend on each other’s play. ButPCE can have bite even when all players interact with each other, provided that theinteractions are not too strong. Moreover, as shown by the examples in Section 3, PCEcan rule out seemingly implausible equilibria that other tremble-based refinementssuch as trembling-hand perfect equilibrium (Selten, 1975) and proper equilibrium(Myerson, 1978) cannot eliminate.One of these examples is a “link-formation game,” where players on each sidedecide whether or not to pay a cost to be Active and form links with all of the activeplayers on the other side. In the “anti-monotonic” version of the game, players whoincur a higher private cost of link formation give lower benefits to their linked part-ners; in the “co-monotonic” version, higher cost players give others higher benefits.We show that the only PCE outcome in the anti-monotonic version is for all players1o choose
Active , while in the co-monotonic case both “all
Active ” and “all
Inac-tive” are PCE outcomes. In contrast, the equilibria that satisfy other equilibriumrefinements do not depend on whether payoffs are anti-monotonic or co-monotonic.PCE is defined for general strategic form games, and stands on its own as a usefulrefinement of trembling-hand perfection. Moreover, PCE’s compatibility restrictionson trembles are implied by models of learning in a class of extensive form games wedescribe below. In our learning framework, agents are born into different player rolesof a stage game, and believe that they face an unknown, time-invariant distributionof opponents’ play, as they would in a steady state of a model where a continuum ofanonymous agents are randomly matched each period. Each agent only learns aboutothers’ play through her own payoffs at the end of the game. Because agents expectto play the game many times, they may choose to “experiment” and use myopicallysub-optimal strategies for their informational value. The compatibility restriction ontrembles then arises from the differences in the attractiveness of various experimentsfor different players. For example, in the link-formation game, an agent choosing
Inactive always receives the same payoff and same information regardless of others’play, so they may try playing
Active even if their prior belief is that the low-benefitscounterparty is more likely to play
Active than the high-benefits one. As is intuitive,we show that a low-cost agent has a stronger incentive to experiment with
Active than a high-cost one does, and will do so more frequently against any mixed play ofthe counterparties.The analysis of learning requires details about the extensive form that are notrepresented by the strategic form, and we are not able to capture its implicationsin general extensive forms. To make the analysis more tractable, Section 5 restrictsattention to a class of “factorable” games, where repeatedly playing a given strategy s i would reveal all of the payoff consequences of that strategy and no informationabout the payoff consequences of any other strategy s i = s i . This restriction impliesthat at any strategy profile s, if player i potentially cares about the action taken at2ome information set h of − i , then either h is on the path of s or i can put h onto thepath of play via a unilateral deviation. Thus there is no possibility of learning being“blocked” by other players, and no “free riding” by learning from others’ experiments.For simplicity we also require that each player moves at most once along any path ofplay. The three examples in 3 all satisfy these restrictions for generic extensive-formpayoffs.In factorable games, each agents faces a combinatorial bandit problem (see Section5.2). We consider two related models of how agents deal with the trade-off betweenexploration and exploitation — the classic model of rational Bayesians maximizingdiscounted expected utility under the belief that the environment (the aggregatestrategy distribution in the population) is constant, and the computationally sim-pler method of Bayesian upper-confidence bounds (Kaufmann, Cappé, and Garivier,2012). In both of these models, the agent uses an “index policy,” meaning that theyassign a numerical index to each strategy that depends only on past observationswhen that strategy was used, and then chooses the strategy with the highest index.We formulate a compatibility condition for index policies, and show that any indexpolicies for i and j satisfying this compatibility condition for strategies s ∗ i and s ∗ j will lead to i experimenting relatively more with s ∗ i than j with s ∗ j . To complete themicro-foundation of PCE, we then show that the Bayes optimal policy and the Bayes-UCB heuristic satisfy the compatibility condition for strategies s ∗ i and s ∗ j whenever i is more compatible s ∗ i than player j is with strategy s ∗ j and the agents in roles i and j face comparable learning problems (e.g. start with the same patience level, sameprior beliefs about the play of third parties, etc). Briefly, upper confidence bound algorithms originated as computationally tractable algorithmsfor multi-armed bandit problems (Agrawal, 1995; Katehakis and Robbins, 1995). We consider aBayesian version of the algorithm that keeps track of the learner’s posterior beliefs about the payoffsof different strategies, first analyzed by Kaufmann, Cappé, and Garivier (2012). We say more aboutthis procedure in Section 5. See Francetich and Kreps (2018) for a discussion of other heuristics foractive learning. .1 Related Work Tremble-based solution concepts date back to Selten (1975), who thanks Harsanyifor suggesting them. These solution concepts consider totally mixed strategy profileswhere players do not play an exact best reply to the strategies of others, but mayassign positive probability to some or all strategies that are not best replies. Differentsolution concepts in this class consider different kinds of “trembles,” but they allmake predictions based on the limits of these non-equilibrium strategy profiles as theprobability of trembling tends to zero. Since we compare PCE to these refinementsbelow, we summarize them here for the reader’s convenience.An (cid:15) -perfect equilibrium is a totally mixed strategy profile where every non-bestreply has weight less than (cid:15) . A limit of (cid:15) t -perfect equilibria where (cid:15) t → trembling-hand perfect equilibrium . An (cid:15) -proper equilibrium is a totally mixed strategyprofile σ where for every player i and strategies s i and s i , if U i ( s i , σ − i ) < U i ( s i , σ − i )then σ i ( s i ) < (cid:15) · σ i ( s i ). A limit of (cid:15) t -proper equilibria where (cid:15) t → properequilibrium; in this limit a more costly tremble is infinitely less likely than a lesscostly one, regardless of the cost difference. Approachable equilibrium (Van Damme,1987) is also based on the idea that strategies with worse payoffs are played less often.It too is the limit of (cid:15) t -perfect equilibria, but where the players pay control costs toreduce their tremble probabilities. When these costs are “regular,” all of the tremblesare of the same order. Because PCE does not require that the less likely tremblesare infinitely less likely than more likely ones, it is closer to approachable equilibriumthan to proper equilibrium. The strategic stability concept of Kohlberg and Mertens(1986) is also defined using trembles, but applies to components of Nash equilibria asopposed to single strategy profiles.Unlike the central feature of PCE, proper equilibrium and approachable equilib-rium do not impose cross-player restrictions on the relative probabilities of various4rembles. For this reason, when each type of the sender is viewed as a different playerthese equilibrium concepts reduce to perfect Bayesian equilibrium in signaling gameswith two possible signals, such as the beer-quiche game of Cho and Kreps (1987).They do impose restrictions when applied to the ex-ante form of the game, i.e. at thestage before the sender has learned their type. However, as Cho and Kreps (1987)point out, evaluating the cost of mistakes at the ex-ante stage means that the interimlosses are weighted by the prior distribution over sender types, so that less likelytypes are more likely to tremble. In addition, applying a different positive linearrescaling to each type’s utility function preserves every type’s preference over lotter-ies on outcomes, but changes the sets of proper and approachable equilibria, whilesuch utility rescalings have no effect on the set of PCE. In light of these issues, whendiscussing tremble-based refinements in Bayesian games we will always apply themat the interim stage.Like PCE, extended proper equilibrium (Milgrom and Mollner, 2017) places re-strictions on the relative probabilities of tremble by different players, but it doesso in a different way: An extended proper equilibrium is the limit of ( β , (cid:15) t ) − properequilibria, where β = ( β , ...β I ) is a strictly positive vector of utility re-scaling, and σ i ( s i ) < (cid:15) t · σ j ( s j ) if player i ’s rescaled loss from s i (compared to the best response)is less than j ’s loss from s j . In a signaling game with only two possible signals, ev-ery Nash equilibrium where each sender type strictly prefers not to deviate from herequilibrium signal is an extended proper equilibrium at the interim stage, becausesuitable utility rescalings for the types can lead to any ranking of their utility costsof deviating to the off-path signal. By contrast, Proposition 4 shows every PCE mustsatisfy the compatibility criterion of Fudenberg and He (2018), which has bite evenin binary signaling games such as the beer-quiche example of Cho and Kreps (1987).So an extended proper equilibrium need not be a PCE, a fact that Examples 1 and2 further demonstrate. Conversely, because extended proper equilibrium makes some5rembles infinitely less likely than others, it can eliminate some PCE. This paper builds on the work of Fudenberg and Levine (1993) and Fudenberg andKreps (1995, 1994) on learning foundations for self-confirming and Nash equilibrium.It is also related to recent work that that provides explicit learning foundations forvarious equilibrium concepts that reflect ambiguity aversion, misspecified priors, ormodel uncertainty, such as Battigalli, Cerreia-Vioglio, Maccheroni, and Marinacci(2016), Battigalli, Francetich, Lanzani, and Marinacci (2017), Esponda and Pouzo(2016), and Lehrer (2012). Unlike those papers, we focus on the very patient agentswho undertake many “experiments,” and characterize the relative rates of experi-mentation under rational expected-utility maximization and related “near-optimal”heuristics. For this reason our analysis of learning is closer to Fudenberg and Levine(2006) and Fudenberg and He (2018).Our investigation of learning dynamics significantly expands on that of Fudenbergand He (2018), which focused on a particular learning rule (rational Bayesians) ina restricted set of games (signaling games). In contrast, our analysis applies to abroader class of learning rules — specifically, index policies that satisfy a relatedcompatibility condition, and to a larger family of games, the factorable games definedin Section 4. We develop new tools to deal with new issues that arise in this moregeneral setting. For instance, Fudenberg and He (2018) compare the Gittins indices ofdifferent sender types using the fact that any stopping time (for the auxiliary optimal-stopping problem defining the index) of the less-compatible type is also feasible forthe more-compatible type. But our general setting allows player roles to interact, so itis not valid to exchange the stopping times of different players as they may conditionon observed play in different parts of the game tree. We deal with this problem byconsidering how i can nevertheless construct a feasible stopping time that mimics an Example available on request. j. Moreover, when a player faces more than one opponent, theiroptimal experimentation policy may lead them to observe a correlated distribution ofopponents’ play, even though the opponents do no actually play correlated strategies.We discuss this issue of endogenous correlation in Section 5.4.2; it is the reason wedefine PCE in terms of correlated play.In methodology the paper is related to other work on active learning and experi-mentation. In single-agent settings, these include Doval (2018), Francetich and Kreps(2018), and Fryer and Harms (2017). In multi-agent settings additional issues arisesuch as free-riding and encouraging others to learn, see e.g. Bolton and Harris (1999),Keller et al. (2005), Klein and Rady (2011), Heidhues, Rady, and Strack (2015), Frickand Ishii (2015), Halac, Kartik, and Liu (2016), Strulovici (2010), and the survey byHörner and Skrzypacz (2016). Unlike most models of multi-agent bandit problems,our agents only learn from personal histories, not from the actions or histories of oth-ers. Our focus is the comparison of experimentation policies under different payoffparameters, which is central to PCE’s cross-player tremble restrictions.
In this section, we first define the player-compatibility relation and discuss its basicproperties. We then introduce PCE, which embodies cross-player tremble restrictionsbased on this relation.Consider a strategic-form game with a finite set of players i ∈ I , finite strategysets | S i | ≥ and utility functions U i : S → R , where S := × i S i . We assume noplayer has a strictly dominated strategy, which lets us avoid some complications thatwould otherwise need to be treated separately. This assumption is consistent with the If S i = { s ∗ i } is a singleton, we would have ( s ∗ i | i ) (cid:37) ( s j | j ) and ( s j | j ) (cid:37) ( s ∗ i | i ) for anystrategy s j of any player j if we follow the convention that the maximum over an empty set is −∞ . s i gives no information about the payoff consequences ofany other strategy s i = s i . Thus strictly dominated strategies will never be played,even as experiments, so they may be deleted from the game.For each i, let ∆( S i ) denote the set of mixed strategies for i . For K ⊆ I , set S K = × i ∈ K S i and let ∆( S K ) represent the set of correlated strategies among players K .Let ∆ ◦ ( S K ) represent the interior of ∆( S K ), that is the set of full-support correlatedstrategies on S K . We formalize the concept of “compatibility” between players and their strategiesin this general setting, which will play a central role in the definition of PCE indetermining cross-player restrictions on trembles.
Definition.
For player i = j and strategies s ∗ i ∈ S i , s ∗ j ∈ S j , say i is more compatiblewith s ∗ i than j is with s ∗ j , abbreviated as s ∗ i (cid:37) s ∗ j , if for every totally mixed correlatedstrategy profiles σ ∈ ∆ ◦ ( S ) with X s ∈ S U j ( s ∗ j , s − j ) · σ ( s ) = max s j ∈ S j X s ∈ S U j ( s j , s − j ) · σ ( s ) , we get X s ∈ S U i ( s ∗ i , s − i ) · ˜ σ ( s ) > max s i ∈ S i \{ s ∗ i } X s ∈ S U i ( s i , s − i ) · ˜ σ ( s )for every totally mixed correlated strategy profile ˜ σ ∈ ∆ ◦ ( S ) satisfying marg − ij ( σ ) =marg − ij (˜ σ ).In words, if s ∗ j is weakly optimal for the less-compatible j against σ , then s ∗ i isstrictly optimal for the more-compatible i against any ˜ σ whose marginal on − ij ’s playis totally mixed and agrees with that of σ . As this restatement makes clear, the com-patibility condition only depends on players’ preferences over probability distribution Recall that a full-support correlated strategy assigns positive probability to every pure strategyprofile. This notation is unambiguous provided i and j have disjoint strategy sets. In the event that i and j share some strategies, we will clarify this notation by attaching player subscripts. S , and not on the particular utility representations chosen.Since × i ∆ ◦ ( S i ) ⊆ ∆ ◦ ( S ) , our definition of compatibility ranks fewer strategy-player pairs than an alternative definition that only considers mixed strategy profileswith independent mixing between different opponents. We use the more strin-gent definition to match the microfoundations of our compatibility-based cross-playerrestrictions.The compatibility relation is transitive, as the next proposition shows.
Proposition 1.
Suppose s ∗ i (cid:37) s ∗ j (cid:37) s ∗ k where s ∗ i , s ∗ j , s ∗ k are strategies of i, j, k . Then s ∗ i (cid:37) s ∗ k . The next result states that the compatibility relation is asymmetric, except in thecorner case where both strategies are weakly dominated.
Proposition 2. If s ∗ i (cid:37) s ∗ j , then either s ∗ j (cid:37) s ∗ i , or both s ∗ j and s ∗ i are weakly domi-nated strategies. The proof of Propositions 1 and 2 are straightforward; they can be found in theOnline Appendix.We think of PCE as primarily a solution concept for games with three or moreplayers, where the relative tremble probabilities of i = j affect some third party’sbest response.If players i and j care a great deal about each other’s strategies, then their bestresponses are unlikely to be determined only by the play of the third parties. In theother extreme, a game has a multipartite structure if the set of players I can be dividedinto C mutually exclusive classes, I = I ∪ ... ∪ I C , in such a way that whenever i and j belong to the same class i, j ∈ I c , (1) they are non-interacting , meaning i ’s payoffdoes not depend on the strategy of j and j ’s payoff does not depend on the strategyof i ; (2) they have the same strategy set, S i = S j . As a leading case, every Bayesian Formally, this alternative definition would replace “totally mixed correlated strategy profiles”with “independently and totally mixed strategy profiles” in the Definition of s ∗ i (cid:37) s ∗ j . i, j ∈ I c ,we may write U i ( s c , s − ij ) without ambiguity for s c ∈ S i , since all augmentations ofthe strategy profile s − ij with a strategy by player j lead to the same payoff for i . For s ∗ c ∈ S i = S j , the definition of s ∗ ic (cid:37) s ∗ jc reduces to: For every totally mixed correlated σ with σ − ij ∈ ∆ ◦ ( S − ij ), X s ∈ S U j ( s ∗ jc , s − ij ) · σ ( s ) = max s j ∈ S j X s ∈ S U j ( s j , s − ij ) · σ ( s )implies X s ∈ S U i ( s ∗ ic , s − ij ) · ˜ σ ( s ) > max s i ∈ S i \{ s ∗ ic } X s ∈ S U i ( s i , s − ij ) · ˜ σ ( s ) . While the player-compatibility condition is especially easy to state for non-interactingplayers, our learning foundation will also justify cross-player tremble restrictions forpairs of players i, j whose payoffs do depend on each others’ strategies, as in the“restaurant game” we discuss in Example 1.
We now move towards the definition of PCE. PCE is a tremble-based solution con-cept. It builds on and modifies Selten (1975)’s definition of trembling-hand perfectequilibrium as the limit of equilibria of perturbed games in which agents are con-strained to tremble, so we begin by defining our notation for the trembles and theassociated constrained equilibria.
Definition. A tremble profile (cid:15) assigns a positive number (cid:15) ( s i | i ) > i and pure strategy s i . Given a tremble profile (cid:15) , write Π (cid:15) i for the set of (cid:15) -strategies We use s ∗ ic to refer to i ’s copy of s ∗ c and s ∗ jc to refer to j ’s copy.
10f player i , namely:Π (cid:15) i := { σ i ∈ ∆( S i ) s.t. σ i ( s i ) ≥ (cid:15) ( s i | i ) ∀ s i ∈ S i } . We call σ ◦ an (cid:15) -equilibrium if for each i , σ ◦ i ∈ arg max σ i ∈ Π (cid:15) i U i ( σ i , σ ◦− i ) . Note that Π (cid:15) i is compact and convex. It is also non-empty when (cid:15) is close enoughto . By standard results, whenever (cid:15) is small enough so that Π (cid:15) i is non-empty foreach i , an (cid:15) -equilibrium exists.The key building block for PCE is (cid:15) - PCE, which is an (cid:15) -equilibrium where thetremble profile is “co-monotonic” with (cid:37) in the following sense:
Definition.
Tremble profile (cid:15) is player compatible if (cid:15) ( s ∗ i | i ) ≥ (cid:15) ( s ∗ j | j ) for all i, j, s ∗ i , s ∗ j such that s ∗ i (cid:37) s ∗ j . An (cid:15) -equilibrium where (cid:15) is player compatible is calleda player-compatible (cid:15) -equilibrium (or (cid:15) -PCE ).The condition on (cid:15) says the minimum weight i could assign to s ∗ i is no smallerthan the minimum weight j could assign to s ∗ j in the constrained game,min σ i ∈ Π (cid:15) i σ i ( s ∗ i ) ≥ min σ j ∈ Π (cid:15) j σ j ( s ∗ j ) . This is a “cross-player tremble restriction,” that is, a restriction on the relative prob-abilities of trembles by different players. Note that it, like the compatibility relation,depends on the players’ preferences over distributions on S but not on the particularutility representation used. This invariance property distinguishes player-compatibletrembles from other models of stochastic behavior such as the stochastic terms inlogit best responses.As is usual for tremble-based equilibrium refinements, we now define PCE as thelimit of a sequence of (cid:15) -PCE where (cid:15) → .11 efinition. A strategy profile σ ∗ is a player-compatible equilibrium (PCE) if thereexists a sequence of player-compatible tremble profiles (cid:15) ( t ) → and an associatedsequence of strategy profiles σ ( t ) , where each σ ( t ) is an (cid:15) ( t ) -PCE, such that σ ( t ) → σ ∗ .The cross-player restrictions embodied in player-compatible trembles translateinto analogous restrictions on PCE, as shown in the next result. Proposition 3.
For any PCE σ ∗ , player k , and strategy ¯ s k such that σ ∗ k (¯ s k ) > , there exists a sequence of totally mixed strategy profiles σ ( t ) − k → σ ∗− k such that(i) for every pair i, j = k with s ∗ i (cid:37) s ∗ j , lim inf t →∞ σ ( t ) i ( s ∗ i ) σ ( t ) j ( s ∗ j ) ≥ and (ii) ¯ s k is a best response for k against every σ ( t ) − k . The proof of this and subsequent results in the main text appear in the Appendix.That is, treating each σ ( t ) − k as a totally mixed approximation to σ ∗− k , in a PCEeach player k essentially best responds to totally mixed opponent play that respectsplayer compatibility.It is easy to show that every (cid:15) -PCE respects player compatibility up to the “addingup constraint” that probabilities on different strategies must sum up to 1 and i mustplace probability no smaller than (cid:15) ( s i | i ) on strategies s i = s ∗ i . The “up to” qualifi-cation disappears in the (cid:15) ( t ) → limit because the required probabilities on s i = s ∗ i tend to 0.Since PCE is defined as the limit of (cid:15) -equilibria for a restricted class of trembles,PCE form a subset of trembling-hand perfect equilibria; the next result shows thissubset is not empty. It uses the fact that the tremble profiles with the same lowerbound on the probability of each action satisfy the compatibility condition in anygame. Theorem 1.
PCE exists in every finite strategic-form game. .3 Some Properties of PCE A tremble profile (cid:15) is uniform if for all i and s i ∈ S i , we have (cid:15) ( s i | i ) = (cid:15) for thesame (cid:15) >
0. A trembling-hand perfect equilibrium is a uniform THPE if it is thelimit of (cid:15) -equilibria where (cid:15) → (cid:15) is uniform. The proof ofTheorem 1 in fact establishes the existence of uniform THPE, which form a subsetof PCE since uniform trembles are always player compatible regardless of the stagegame.One drawback of uniform THPE is that there is no clear microfoundation foruniform trembles. In addition to the cross-player restrictions of the compatibilitycondition, these uniform trembles impose the same lower bound on the tremble prob-abilities for all strategies of each given player. PCE and the learning foundationwe develop allow for more complicated patterns of experimentation that respect thecompatibility structure. We study a more permissive refinement than uniform THPEwhere we can offer a learning story for the tremble restrictions. PCE is a fairly weaksolution concept that nevertheless has bite in some cases of interest, as we will discussin Sections 3. In this section, we study examples of games where PCE rules out unintuitive Nashequilibria. We will also use these examples to distinguish PCE from existing refine-ments.
We start with a complete-information game where PCE differs from other solutionconcepts.
Example 1.
There are three players in the game: a food critic, a regular diner, and13 restaurant. Simultaneously, the restaurant decides between ordering high-quality( H ) or low-quality ( L ) ingredients, while critic and the diner decide whether to goeat at the restaurant ( R ) or order pizza ( Z ) and eat at home. The utility from Z is normalized to 0. If both customers choose Z , the restaurant also gets 0 payoff.Otherwise, the restaurant’s payoff depends on the ingredient quality and clientele.Choosing L yields a profit of +2 per customer while choosing H yields a profit of +1per customer. In addition, if the food critic is present, she will write a review basedon ingredient quality, which affects the restaurant’s payoff by ± . . Each customergets a payoff of x < − y > . Z c , Z d , L ) is a proper equilibrium, sustained by the restau-rant’s belief that when at least one customer plays R , it is far more likely that thediner deviated to patronizing the restaurant than the critic, even though the critichas a greater incentive to go to the restaurant as she gets paid for writing reviews. It14s also an extended proper equilibrium. We claim that R c (cid:37) R d . To see this, note that for any profile σ of totally mixed,correlated play that makes the diner indifferent between Z d and R d , we must have U ( R c , ˜ σ − c ) ≥ . σ that agrees with σ in terms of the restaurant’splay. This is because the critic’s utility from R c is minimized when the diner chooses R d with probability 1, but even then the critic gets 0.5 higher utility from going toa crowded restaurant than the diner gets from going to an empty restaurant, holdingfixed food quality at the restaurant. This shows R c (cid:37) R d .Whenever σ ( t ) c ( R c ) /σ ( t ) d ( R d ) > , the restaurant strictly prefers H over L . Thus byProposition 3, there is no PCE where the restaurant plays L with positive probability. (cid:7) In the next example, PCE makes different predictions in two versions of a game withdifferent payoff parameters, while all other solution concepts we know of make thesame predictions in both versions.
Example 2.
There are 4 players in the game, split into two sides: North and South.The players are named North-1, North-2, South-1, and South-2, abbreviated as N1,N2, S1, and S2 respectively.These players engage in a strategic link-formation game. Each player simultane-ously takes an action: either
Inactive or Active . An
Inactive player forms no links.An
Active player forms a link with every
Active player on the opposite side. (Twoplayers on the same side cannot form links.) For example, suppose N1 plays
Active ,N2 plays
Active , S1 plays
Inactive , and S2 plays
Active . Then N1 creates a linkto S2, N2 creates a link to S2, S1 creates no links, and S2 creates links to both N1and N2. ( Z c , Z d , L ) is an extended proper equilibrium, because scaling the critic’s payoff by a largepositive constant makes it more costly for the critic to deviate to R1 than for the diner to deviateto R2 . i is characterized by two parameters: cost ( c i ) and quality ( q i ). Costrefers to the private cost that a player pays for each link she creates. Quality refersto the benefit that a player provides to others when they link to her. A player whoforms no links gets a payoff of 0. In the above example, the payoff to North-1 is q S2 − c N1 and the payoff to South-2 is ( q N1 − c S2 ) + ( q N2 − c S2 ). (cid:7) We consider two versions of this game, shown below. In the anti-monotonic versionon the left, players with a higher cost have a lower quality. In the co-monotonic versionon the right, players with a higher cost have a higher quality. There are two pure-strategy Nash outcomes for each version: all links form or no links form. “All linksform” is the unique PCE outcome in the anti-monotonic case, while both “all links”and “no links” are PCE outcomes under co-monotonicity.Anti-Monotonic
Player Cost Quality
North-1 14 30North-2 19 10South-1 14 30South-2 19 10 Co-Monotonic
Player Cost Quality
North-1 14 10North-2 19 30South-1 14 10South-2 19 30The compatibility structure with respect to own quality is reversed between thesetwo versions of the game. In both versions, Active N (cid:37) Active N , but N1 has highquality in the anti-monotonic version, and low quality in the co-monotonic version.16hus, in the anti-monotonic version but not in the co-monotonic version, player-compatible trembles lead to the high-quality counterparty choosing Active at leastas often as the low-quality counterparty, which means
Active has a positive expectedpayoff even when one’s own cost is high. For this reason, the set of PCE is differentin these two cases. In contrast, the set of equilibria that satisfy extended properequilibrium, proper equilibrium, trembling-hand perfect equilibrium, p -dominance,Pareto efficiency, and strategic stability do not depend on whether payoffs are co- oranti-monotonic, as shown in the Online Appendix. Recall that a signaling game is a two-player Bayesian game, where P1 is a sender whoknows her own type θ, and P2 only knows that P1’s type is drawn according to thedistribution λ ∈ ∆(Θ) on a finite type space Θ. After learning her type, the sendersends a signal s ∈ S to the receiver. Then, the receiver responds with an action a ∈ A . Utilities depend on the sender’s type θ , the signal s , and the action a .Fudenberg and He (2018)’s compatibility criterion is defined only for signalinggames. It does not use limits of games with trembles, but instead restricts the beliefsthat the receiver can have about the sender’s type. That sort of restriction does notseem easy to generalize beyond games with observed actions, while using tremblesallows us to define PCE for general strategic form games. As we will see, the moregeneral PCE definition implies the compatibility criterion in signaling games.With each sender type viewed as a different player, this game has | Θ | + 1 players, I = Θ ∪ { } , where the strategy set of each sender type θ is S θ = S while thestrategy set of the receiver is S = A S , the set of signal-contingent plans. So a mixedstrategy of θ is a possibly mixed signal choice σ ( ·| θ ) ∈ ∆( S ) , while a mixed strategy σ ∈ ∆( A S ) of the receiver is a mixed plan about how to respond to each signal.Fudenberg and He (2018) define type compatibility for signaling games. A signal s ∗ is more type-compatible with θ than θ if for every behavioral strategy σ ,17 ( s ∗ , σ ; θ ) ≥ max s = s ∗ u ( s , σ ; θ )implies u ( s ∗ , σ ; θ ) > max s = s ∗ u ( s , σ ; θ ) . They also define the compatibility criterion , which imposes restrictions on off-pathbeliefs in signaling games. Consider a Nash equilibrium σ ∗ , σ ∗ . For any signal s ∗ andreceiver action a with σ ∗ ( a | s ∗ ) >
0, the compatibility criterion requires that a bestresponds to some belief p ∈ ∆(Θ) about the sender’s type such that, whenever s ∗ ismore type-compatible with θ than with θ and s ∗ is not equilibrium dominated for θ , p satisfies p ( θ ) p ( θ ) ≤ λ ( θ ) λ ( θ ) . Since every totally mixed strategy of the receiver is payoff-equivalent to a behav-ioral strategy, it is easy to see that type compatibility implies s ∗ θ (cid:37) s ∗ θ . The nextresult shows that when specialized to signaling games, all PCE pass the compatibilitycriterion.
Proposition 4.
In a signaling game, every PCE σ ∗ is a Nash equilibrium satisfyingthe compatibility criterion of Fudenberg and He (2018). This proposition in particular implies that in the beer-quiche game of Cho andKreps (1987), the quiche-pooling equilibrium is not a PCE, as it does not satisfy thecompatibility criterion. Signal s ∗ is not equilibrium dominated for θ if max a ∈ A u ( s ∗ , a ; θ ) > u ( s , σ ∗ ; θ ) for every s with σ ∗ ( s | θ ) > . The converse does not hold. We defined type compatibility to require testing against all receiverstrategies and not just the totally mixed ones, so it is possible that s ∗ θ (cid:37) s ∗ θ but s ∗ is not more type-compatible with θ than with θ , so type-compatibility is harder to satisfy than player compatibility.We now realize that we could have restricted type compatibility to only consider totally mixedstrategies, and all of the results of Fudenberg and He (2018) would still hold. Factorability and Isomorphic Factoring
This section defines a “factorability” condition that we will use in developing a learn-ing foundation for PCE. Factorability implies that the information gathered fromplaying one strategy is not at all informative about the payoff consequences of anyother strategy. We then define a notion of “isomorphic factoring” for players i and j to formalize the idea that the learning problems faced by these two players areessentially the same. The next section will provide a learning foundation for thecompatibility restriction for pairs of players whose learning problems are isomorphicin this way. The examples discussed in Section 3 are factorable and isomorphicallyfactorable for players ranked by compatibility. We begin by introducing some notation. Fix an extensive-form game Γ as the stagegame, with players i ∈ I along with a player 0 to model Nature’s moves. The collectionof information sets of player i ∈ I is written as H i . At each h ∈ H i , player i choosesan action a h , from the finite set of possible actions A h . So an extensive-form purestrategy of i specifies an action at each information set h ∈ H i . We denote by S i theset of all such strategies. For simplicity, throughout we will maintain the followingassumption. Assumption 1.
Each player moves at most once along any path of play in Γ . In addition to any information a player gets in the course of play, we assume thatafter each play each player observes her own payoff. In general, this need not perfectlyreveal other players’ actions at all information sets. We now define factorability, whichroughly says that playing strategy s i against any strategy profile of − i identifies allof opponents’ actions that can be payoff-relevant for s i , but does not reveal anyinformation about the payoff consequences of any other strategy s i = s i .19or an information set h of j with j = i , write P h for the partition on S − i wheretwo strategies s − i , s − i are in the same element of the partition if they prescribe thesame play on h. Thus partition P h contains perfect information about play on h , butno other information. Definition.
For each player i and strategy s i ∈ S i , let Π i [ s i ] be the coarsest partitionof S − i that makes s − i U i ( s i , s − i ) measurable. The game Γ is factorable for i if:1. For each s i ∈ S i there exists a (possibly empty) collection of − i ’s informationsets F i [ s i ] ⊆ H − i so that Π i [ s i ] = W h ∈ F i [ s i ] P h . (The meet across an emptycollection is the coarsest possible partition on S − i , i.e. no information).2. For two strategies s i = s i , F i [ s i ] ∩ F i [ s i ] = ∅ . When Γ is factorable for i , we refer to F i [ s i ] as the s i -relevant information sets , aterminology we now justify. In general, i ’s payoff from playing s i can depend on theprofile of − i ’s actions at all opponent information sets. Condition (1) implies onlyopponents’ actions on F i [ s i ] matter for i ’s payoff after choosing s i , and furthermorethis dependence is one-to-one. That is, U ( s i , s − i ) = U (cid:16) s i , s − i (cid:17) ⇔ (cid:16) ∀ h ∈ F i [ s i ] s − i ( h ) = s − i ( h ) (cid:17) . The substantive restriction in Condition (1) is that i ’s learning cannot be blockedby another player — by choosing s i , i can always identify actions on F i [ s i ] regardlessof what happens elsewhere in the game tree. Condition (2) implies that i does not learn about the payoff consequence of adifferent strategy s i = s i through playing s i (provided i ’s prior is independent aboutopponents’ play on different information sets). This is because there is no intersectionbetween the s i -relevant information sets and the s i -relevant ones. In particular thismeans that player i cannot “free ride” on others’ experiments and learn about the It is easy but expositionally costly to extend this to the case where several actions on A h leadto the same payoff for i . F i [ s i ] is empty, then s i is a kind of “opt out” action for i . After choosing s i , i receives the same utility from every reachable terminal node and gets no informationabout the payoff consequences of any of her other strategies. We now illustrate factorability using the examples from Section 3 and some othergeneral classes of games.
Consider the restaurant game from Example 1. Since x < − y > . , we have x = y and x = y + 0 . . By choosing R , the customer’s payoff perfectly reveals others’play. By choosing Z , the customer always gets 0 payoff (these nodes are colored inthe diagram below) and so cannot infer anyone else’s play.The restaurant game is factorable for the Critic and the Diner. Let F i [ R i ] consistof the two information sets of − i and let F i [ Z i ] be the empty set for each i ∈ { , } .Itis easy to verify that the two conditions of factorability are satisfied.It is important for factorability that a customer who takes the “outside option”of ordering pizza gets the same payoff regardless of the restaurant’s play, and does21ot observe the restaurant’s quality choice even if the other customer patronizes therestaurant. Factorability rules out this sort of “free information,” so that when weanalyze the non-equilibrium learning problem we know that each agent can only learna strategy’s payoff consequences by playing it herself. Consider the link-formation game from Example 2. The payoff for a player choosing
Inactive is always 0, whereas the payoff for a player choosing
Active exactly iden-tifies the play of the two players on the opposite side. It is now easy to see that wecan let F i [ Active i ] consists of the information sets of the other two agents on theother side of i and let F i [ Inactive i ] be empty. This specification of the s i -relevantinformation sets shows the stage game is factorable for every player. More generally, Γ is factorable for i whenever it is a binary participation game for i. Definition.
Γ is a binary participation game for i if the following are satisfied.1. i has a unique information set with two actions, without loss labeled In and Out.
2. All paths of play in Γ pass through i ’s information set.3. All paths of play where i plays In pass through the same information sets.4. Terminal vertices associated with i playing Out all have the same payoff for i .5. Terminal vertices associated with i playing In all have different payoffs for i .Action Out is an outside option for i that leads to a constant payoff regardless ofothers’ play. We are implicitly assuming in part (5) of the definition that the game22as generic payoffs for i after choosing In , in the sense that changing the action atany one information set on the path of play will change i ’s payoff.If Γ is a binary participation game for i, then let F i [ In ] be the common collectionof − i information sets encountered in paths of play where i chooses In . Let F i [ Out ]be the empty set. We see that Γ is factorable for i. Clearly F i [ In ] ∩ F i [ Out ] = ∅ ,so Condition (2) of factorability is satisfied. When i chooses the strategy In , thetree structure of Γ implies different profiles of play on F i [ In ] must lead to differentterminal nodes, the generic payoff condition means Condition (1) of factorability issatisfied for strategy In . When i plays Out , i gets the same payoff regardless of theothers’ play, so Condition (1) of factorability is satisfied for strategy Out .The restaurant game with is a binary participation game for the critic and thediner, where ordering pizza is the outside option. The link-formation game is a binaryparticipation game for every player, where
Inactive is the outside option.
To give a different class of examples of factorable games, consider a game of signalingto one or more audiences. To be precise, Nature moves first and chooses a typefor the sender, drawn according to λ ∈ ∆(Θ) , where Θ is a finite set. The senderthen chooses a signal s ∈ S , observed by all receivers r , ..., r n r . Each receiver thensimultaneously chooses an action. The profile of receiver actions, together with thesender’s type and signal, determine payoffs for all players. Viewing different typesof senders as different players, this game is factorable for all sender types, providedpayoffs are generic. This is because for each type i we have F i [ s ] is the set of n r information sets by the receivers after seeing signal s. The next result gives a necessary condition for factorability. Suppose H is an informa-tion set of player j = i. Player i ’s payoff is independent of h if u i ( a h , a − h ) = u i ( a h , a − h )23or all a h , a h , a − h , where a h , a h are actions on information set h, and a − h is a profile ofactions on all other information sets in the game tree. If i ’s payoff is not independentof the action taken at some information set h , then i can always put h onto the pathof play via a unilateral deviation at one of her information sets. Proposition 5.
Suppose the game is factorable for i and i ’s payoff is not independentof h ∗ . For any strategy profile, either h ∗ is on the path of play, or i has a deviationat one of her information sets that puts h ∗ onto the path of play. This result follows from two lemmas.
Lemma 1.
For any game factorable for i and any information set h ∗ for player j = i where j has at least two different actions, if h ∗ ∈ F i [ s i ] for some extensive-formstrategy s i ∈ S i , then h ∗ is always on the path of play when i chooses s i . Lemma 2.
For any game factorable for i and any information set h ∗ of player j = i ,suppose i ’s payoff is not independent of h ∗ . Then: (i) j has at least two differentactions on h ∗ ; (ii) there exists some extensive-form strategy s i ∈ S i so that h ∗ ∈ F i [ s i ] . We can combine these two lemmas to prove the proposition.
Proof.
By combining Lemmas 1 and 2, there exists some extensive-form strategy s i ∈ S i so that h ∗ is on the path of play whenever i chooses s i . Consider somestrategy profile ( s ◦ i , s ◦− i ) where h ∗ is off the path. Then i can unilaterally deviate to s i , and h ∗ is on the path of ( s i , s ◦− i ). Furthermore, i ’s play differs on the new pathrelative to the old path on exactly one information set, since i plays at most once onany path. So instead of deviating to s i , i can deviate to s i that matches s i in termsof this information set where i ’s play is modified, but otherwise is the same as s ◦ i .So h ∗ is also on the path of play for ( s i , s ◦− i ) , where s i differs from s ◦ i only on oneinformation set.Consider the centipede game for three players below.24ach player moves at most once on each path, and 1 and 2’s payoffs are notindependent of the (unique) information set of player 3. But, if both 1 and 2 choose“drop”, then no one step deviation by either 1 or 2 can put the information set of 3onto the path of play. Proposition 5 thus implies the centipede game is not factorablefor either 1 or 2. Moreover, Fudenberg and Levine (2006) showed that in this gameeven very patient player 2’s may not learn to play a best response to player 3, sothat the outcome (drop, drop, pass) can persist even though it is not trembling-handperfect. Intuitively, if the player 1’s only play pass as experiments, then when thefraction of new players is very small, the player 2’s may not get to play often enoughto make experimentation with pass worthwhile.As another example, the Selten’s horse game displayed above is not factorablefor 1 or 2 if the payoffs are generic, even though the conclusion of Proposition 5 issatisfied. The information set of 3 must belong to both F [Down] and F [Across],because 3’s play can affect 1’s payoff even if 1 chooses Across, as 2 could chooseDown. This violates the factorability requirement that F [Down] ∩ F [Across] = ∅ .The same argument shows the information set of 3 must belong to both F [Down]25nd F [Across], since when 1 chooses Down the play of 3 affects 2’s payoff regardlessof 2’s play. So, again, F [Down] ∩ F [Across] = ∅ is violated.Condition (2) of factorability also rules out games where i has two strategiesthat give the same information, but one strategy always has a worse payoff underall profiles of opponents’ play. In this case, we can think of the worse strategy asan informationally equivalent but more costly experiment than the better strategy.Reasonable learning rules (including rational learning) will not use such strategies,but we do not capture that in the general definition of PCE because our setup thereonly consider abstract strategy spaces S i and not an extensive-form game tree. Before we turn to compare the learning behavior of agents i and j , we must dealwith one final issue. To make sensible comparisons between strategies s ∗ i and s ∗ j oftwo different players i = j in a learning setting, we must make assumptions on theirinformational value about the play of others: namely, the information i gets fromchoosing s ∗ i must be essentially the same as the information that j gets from choosing s ∗ j . To do this we require that the game be factorable for both i and j, and that thefactoring is “isomorphic” for these two players. Definition.
When Γ is factorable for both i and j , the factoring is isomorphic for i and j if there exists a bijection ϕ : S i → S j such that F i [ s i ] ∩ H − ij = F j [ ϕ ( s i )] ∩ H − ij for every s i ∈ S i .This says the s i -relevant information sets (for i ) are the same as the ϕ ( s i )-relevantinformation sets (for j ), insofar as the actions of − ij are concerned. For example,the restaurant game is isomorphically factorable for the critic and the diner (underthe isomorphism ϕ ( R1 )= R2 , ϕ ( Z1 )= Z2 ) because F [ In1 ] ∩ H = F [In2] ∩ H =the singleton set containing the unique information set of the restaurant. As another It would be interesting to try to refine the definition of PCE to capture this, perhaps using the“signal function” approach of Battigalli and Guaitoli (1997) and Rubinstein and Wolinsky (1994).
In this section, we provide a learning foundation for PCE’s cross-player tremble re-strictions. Our main learning result, Theorem 2, studies long-lived agents who getpermanently assigned into player roles and face a fixed but unknown distribution ofopponents’ play. We prove that when s ∗ i (cid:37) s ∗ j and the game is isomorphically fac-torable for i and j , agents in the role of i use s ∗ i more frequently than agents in the roleof j use s ∗ j . We obtain this result both for rational agents who maximize discountedexpected utility, and for boundedly-rational agents who employ the computationallysimpler Bayes-upper confidence bound algorithm. Under either of these behavioralassumptions, “trembles” emerge endogenously during learning as deliberate experi-ments that seek to learn opponents’ play. We consider an agent born into player role i who maintains this role throughouther life. She has a geometrically distributed lifetime with 0 ≤ γ < s i ∈ S i . The agent observes and collects her payoffs at the end of the game.Then, with probability γ , she continues into the next period and plays the stage gameagain. With complementary probability, she exits the system. Thus each period theagent observes her own payoff. We assume that players have perfect recall, so shealso remembers her chosen strategy. This is a special case of the terminal-node partitions of Fudenberg and Kamada (2015, 2018)where the elements of each player’s terminal node partition are isomorphic to their possible payoffs. efinition. The set of all finite histories of all lengths for i is Y i := ∪ t ≥ ( S i × R ) t . Fora history y i ∈ Y i and s i ∈ S i , the subhistory y i,s i is the (possibly empty) subsequenceof y i where the agent played s i .When Γ is factorable for i, there is a one-to-one mapping from the set of ac-tion profiles on the s i -relevant information sets to the range of s − i U i ( s i , s − i ), asrequired by the first condition of the factorability definition. Through this identifi-cation, we may think of each one-period history where i plays s i as an element of { s i } × ( × H ∈ F i [ s i ] A h ) instead of an element of { s i } × R . This convention will make iteasier to compare histories of different player roles. Notation . A history y i will also refer to an element of ∪ t ≥ (cid:16) ∪ s i ∈ S i h { s i } × ( × h ∈ F i [ s i ] A h ) i(cid:17) t .The agent decides on which strategy to use in each period based on her historyso far. This mapping is her learning rule . Definition. A learning rule r i : Y i → S i specifies a pure strategy in the stage gameafter each history.Note that the learning rules depend only on what the agent has observed in pastplay; the effect of anything learned during the play of the current stage game iscaptured by the specified strategy. Note also that since the agent’s play in eachperiod depends on her past observations, the sequence of her plays is a stochasticprocess whose distribution depends on the distribution of the opponents’ play. Weassume that there is a fixed objective distribution of opponent’s play, which we callplayer i s learning environment. The leading case of this is when there are multiplepopulations of learners, one for each player role, and the aggregate system is in asteady state. But, when analyzing the play of a single agent, we remain agnosticabout the reason why opponents’ play is i.i.d.
Definition. A learning environment for player i is a probability distribution σ − i ∈ Q j = i ∆ ( S j ) over strategies of players − i . 28he learning environment, together with the agent’s learning rule, generate astochastic process X ti describing i ’s strategy in period t . Definition.
Let X ti be the S i -valued random variable representing i ’s play in period t .Player i ’s induced response of i to σ − i under learning rule r i is φ i ( · ; r i , σ − i ) : S i → [0 , s i ∈ S i we have φ i ( s i ; r i , σ − i ) := (1 − γ ) ∞ X t =1 γ t − · P r i ,σ − i { X ti = s i } . We can interpret the induced response φ i ( · ; r i , σ − i ) as a mixed strategy for i rep-resenting i ’s weighted lifetime average play, where the weight on X ti , the strategy sheuses in period t of her life, is proportional to the probability γ t − of surviving into thatperiod. The induced response has a population interpretation as well. Suppose thereis a continuum of agents in the society, each engaged in their own copy of the learningproblem above. In each period, enough new agents are added to the society to exactlybalance out the share of agents who exit between periods. Then φ i ( · ; r i , σ − i ) describesthe distribution on S i we would find if we sample an individual uniformly at randomfrom the subpopulation for role of i and ask her which s i ∈ S i they plan on playingtoday.Our learning foundation for compatible trembles involves comparing the inducedresponses of different player roles with the same learning rule and in the same learningenvironment. We will consider two different specifications of the agents’ learning rules in factorablegames, namely the maximization of expected discounted utility and the Bayes upperconfidence bound heuristic. With both rules, agents form a Bayesian belief overopponents’ play, independent at different information sets. More precisely, we willassume that each agent i starts with a regular independent prior: efinition. Agent i has a regular independent prior if her beliefs g i on × h ∈H − i ∆( A h )can be written as the product of full-support marginal densities on ∆( A h ) acrossdifferent h ∈ H − i , so that g i (( α h ) h ∈H − i ) = Q h ∈H − i g hi ( α h ) with g hi ( α h ) > α h ∈ ∆ ◦ ( A h ).Thus, the agent holds a belief about the distribution of actions at each − i information set h , and thinks actions at different information sets are generated in-dependently, whether the information sets belong to the same player or to differentones. Furthermore, the agent holds independent beliefs about the randomizing prob-abilities at different information sets. The agent updates g i by applying the Bayes’rule to her history y i . If the stage game is a signaling game, for example, this in-dependence assumption means that the senders only update their beliefs about thereceiver’s response to a given signal s based on the responses received to that signal,and that their beliefs about this response do not depend on the responses they haveobserved to other signals s = s .If i starts with independent prior beliefs in a stage game factorable for i , thelearning problem she faces is a combinatorial bandit problem. A combinatorial banditconsist of a set of basic arms , each with an unknown distribution of outcomes, togetherwith a collection of subsets of basic arms called super arms . Each period, the agentmust choose a super arm, which results in pulling all of the basic arms in that subsetand obtaining a utility based on the outcomes of these pulls. To translate into ourlanguage, each basic arm corresponds to a − i information set h and the super armsare identified with strategies s i ∈ S i . The subset of basic arms in s i are the s i -relevantinformation sets, F i [ s i ]. The collection of outcomes from these basic arms, i.e. theaction profile ( a h ) h ∈ F i [ s i ] , determine i ’s payoff, U i ( s i , ( a h ) h ∈ F i [ s i ] ). We assume that agents do not know Nature’s mixed actions, which must be learned just as theplay of other players. If agents know Nature’s move, then the a regular independent prior would bea density g i on × h ∈H I \{ i } ∆( A h ), so that g i (( α h ) H I \{ i } ) = Q h ∈H I \{ i } g hi ( α h ) with g hi ( α h ) > α h ∈ ∆ ◦ ( A h ). As Fudenberg and Kreps (1993) point out, an agent who believes two opponents are randomizingindependently may nevertheless have subjective correlation in her uncertainty about the randomizingprobabilities of these opponents.
30 special case of combinatorial bandits is additive separability, where the outcomefrom pulling each basic arm is simply a R -valued reward, and the payoff from choosinga super arm is the sum of these rewards. This corresponds to the stage game being additively separable for i . Definition.
A factorable game Γ is additively separable for i if there is a collectionof auxiliary functions u i,h : A h → R such that U i ( s i , ( a h ) h ∈ F i [ s i ] ) = P h ∈ F i [ s i ] u i,h ( a h ).The term u i,h ( a h ) is the “reward” of action a h towards i ’s payoff. The total payofffrom s i is the sum of such rewards over all s i -relevant information sets. A factorablegame is not additively separable for i when the opponents’ actions on F i [ s i ] interact insome way to determine i ’s payoff following s i . All the examples discussed in Section 3are additively separable for the players ranked by compatibility. While we provideour learning foundation for rational agents in any factorable game, our analysis ofthe Bayes upper confidence bound algorithm will restrict to such additively separablegames.
Consider a rational agent who maximizes discounted expected utility. In addition tothe survival chance 0 ≤ γ < ≤ δ < , so her overall effective discount factor is0 ≤ δγ < H, we may calculate the Gittins index of each strategy s i ∈ S i , corresponding to a superarm in in the combinatorial bandit problem. We write the solution to the rational Additive separability is trivially satisfied whenever | F i [ s i ] | ≤ s i , so that is there is atmost one s i -relevant information set for each strategy s i of i . So, every signaling game is additivelyseparable for every sender type. It is also satisfied in the link-formation game in Section 4.2.2even though here | F i [ Active i ] | = 2, as each agent computes her payoff by summing her linkingcosts/benefits with respect to each potential counterparty. Additive separability is also satisfiedin the restaurant game in Section 4.2.1 for each customer i . F i [ R i ] contains two information sets,corresponding to the play of the Restaurant and the other customer. The play of the other customeradditively contributes either 0 or -0.5 to i ’s payoff, depending on whether they choose R or not. i , which involves playing the strategy s i with the highest Gittins index after each history y i .The drawback of this learning rule is that the Gittins index is computationallyintractable even in simple bandit problems. The combinatorial structure of our banditproblem makes computing the index even more complex, as it needs to consider theevolution of beliefs about each basic arm. The Bayesian upper confidence bound (Bayes-UCB) procedure was first proposed byKaufmann, Cappé, and Garivier (2012) as a computationally tractable algorithm fordealing with the exploration-exploitation trade-off in bandit problems.We restrict attention to games additively separable for i and adopt a variant ofBayes-UCB. Every y i,h subhistory of play on h ∈ F i [ s i ] induces a posterior belief g i ( ·| y i,h ) over play on h , so g i ( ·| y i,h ) is an element of ∆(∆( A h )). By an abuse ofnotation, we use u i,h ( g i ( ·| y i,h )) ∈ ∆( R ) to mean the distribution over contributionsfor play distributed according to g i ( ·| y i,h ). As a final bit of notation, when F is adistribution on R , Q ( F ; q ) is the q -quantile of F . Definition.
Let prior g i and quantile-choice function q : N → [0 ,
1] be given for i. The
Bayes-UCB index for s i after history y i (relative to g i and q ) is X h ∈ F i [ s i ] Q ( u i,h ( g i ( ·| y i,h )) ; q ( s i | y i )) ) , where s i | y i ) is the number of times s i has been used in history y i .In words, our Bayes-UCB index computes the q -th quantile of u i,h ( a h ) under i ’sbelief about − i ’s play on h , then sums these quantiles to return an index of thestrategy s i . The Bayes-UCB policy
UCB i prescribes choosing the strategy with thehighest Bayes-UCB index after every history.32his procedure embodies a kind of wishful thinking for q ≥ .
5. The agentoptimistically evaluates the payoff consequence of each s i under the assessment thatopponents will play a favorable response to s i at each of the s i -relevant informationsets, where greater q corresponds to greater optimism in this evaluation procedure.Indeed, if q approaches 1 for every s i , the Bayes-UCB procedure approaches pickingthe strategy with the highest potential payoff.If F i [ s i ] consists of only a single information set of for every s i , then the procedurewe define is the standard Bayes-UCB policy. In general, our procedure differs fromthe usual Bayes-UCB procedure, which would instead compute Q X h ∈ F i [ s i ] u i,h ( g i ( ·| y i,h )); q ( s i | y i )) . Instead, our procedure computes the sum of the quantiles, which is easier than com-puting the quantile of the sum, a calculation that requires taking the convolution ofthe associated distributions.This variant of the Bayesian UCB is analogous to variants of the non-BayesianUCB algorithm (see e.g. Gai, Krishnamachari, and Jain (2012) and Chen, Wang,and Yuan (2013)) that separately compute an index for each basic arm and choosethe super arm maximizing sum of the basic arm indices. The analysis that follows makes heavy use of the fact that the Gittins index and theBayes-UCB are index policies in the following sense:
Definition.
When Γ is factorable for i , a learning rule r i : Y i → S i is an indexpolicy if there exist functions ( ι s i ) s i ∈ S i with each ι s i mapping subhistories of s i to real The non-Bayesian UCB index of a basic arm is an “optimistic” estimate of its mean reward thatcombines its empirical mean in the past with a term inversely proportional to the number of timesthe basic arm has been pulled. Kveton, Wen, Ashkan, and Szepesvari (2015) have established tight O ( √ n log n ) regret boundsfor this kind of algorithm across n periods. r i ( y i ) ∈ arg max s i ∈ S i { ι s i ( y i,s i ) } .If an agent uses an index policy, we can think of her behavior in the followingway. At each history, she computes an index for each strategy s i ∈ S i based on thesubhistory of those periods where she chose s i , and she then plays a strategy with thehighest index with probability 1. We now analyze how compatibility relations in the stage game translate into restric-tions on experimentation frequencies. We aim to demonstrate that if s ∗ i (cid:37) s ∗ j , then i ’sinduced response plays s ∗ i more frequently than j ’s induced response plays s ∗ j . Thereis little hope of proving a comparative result of this kind if i and j face completelyunrelated learning problems. Instead, we will require that i and j use the same learn-ing rule with the same parameters (that is, the same patience in the case of OPT andsame quantile-choice function in the case of UCB), start with the same prior belief about − ij ’s play, and face the same distribution of − ij ’s play. These assumptionsare natural when a common population of agents get randomly assigned into playerroles, such as in a lab experiment.Theorem 2 shows that when i and j use the same learning rule and face thesame learning environment, we have φ i ( s ∗ i ; r i , σ − i ) ≥ φ j ( s ∗ j ; r i ; σ − j ). This providesa microfoundation for the compatibility-based cross-player restrictions on trembles.Throughout, we will fix a stage game Γ that is isomorphically factorable for i and j, with isomorphism ϕ : S i → S j between their strategies. Definition.
Regular independent priors for i and j are equivalent if for each s i ∈ S i and h ∈ F i [ s i ] ∩ F j [ ϕ ( s i )], g hi ( α ) = g hj ( α ) for all α ∈ ∆( A h ). To handle possible ties, we can introduce a strict order over each agent’s strategy set, and specifythat if two strategies have the same index the agent plays the one that is higher ranked. We believe that that our learning foundation for player-compatible trembles continues to holdeven when i and j start with different priors under a stronger version of the compatibility conditionthat converges to the current one as the priors become closer together, but we have not been ableto prove this. heorem 2. Suppose s ∗ i (cid:37) s ∗ j with ϕ ( s ∗ i ) = s ∗ j . Consider two learning agents inthe roles of i and j with equivalent independent regular priors. For any commonsurvival chance ≤ γ < and any mixed strategy profile σ , we have φ i ( s ∗ i ; r i , σ − i ) ≥ φ j ( s ∗ j ; r j , σ − j ) under either of the following conditions: • r i = OPT i , r j = OPT j , and i and j have the same patience ≤ δ < . • The stage game is additively separable for i and j , at every h ∈ H − ij theauxiliary functions u i,h , u j,h rank α ∈ ∆( A h ) in the same way, r i = UCB i ,r j = UCB j , i and j have the same quantile-choice function q i = q j . This result provides learning foundations for player-compatible trembles in a num-ber of games, including the restaurant game from Section 4.2.1 and the link-formationgame from Section 4.2.2, where the additive separability and same-ranking assump-tions are satisfied for players ranked by compatibility.
The proof of Theorem 2 follows two steps. In Proposition 6, we abstract away fromparticular models of experimentation and consider two general index policies r i , r j ina stage game that is isomorphically factorable for i and j. Policy r i is more compatible with s ∗ i than r j is with s ∗ j if, following i and j ’s respective histories y i , y j that containthe same observations about the play of third parties − ij , whenever s ∗ j has the highestindex under r j , then no s i = s ∗ i has the highest index under r i . We prove that forany index policies r i , r j where r i is more compatible with s ∗ i than r j is with s ∗ j , weget φ i ( s ∗ i ; r i , σ − i ) ≥ φ j ( s ∗ j ; r j , σ − j ) in any learning environment σ . In Corollaries 1and 2, we show that under the conditions of Theorem 2 that relate i and j ’s learningproblems to each other (e.g. i and j have equivalent regular priors, same patience The theorem easily generalizes to the case where i starts with one of L ≥ g (1) i , ..., g ( L ) i with probabilities p , ..., p L and j starts with priors g (1) j , ..., g ( L ) j with the same proba-bilities, and each g ( l ) i , g ( l ) j is a pair of equivalent regular priors for 1 ≤ l ≤ L . s ∗ i (cid:37) s ∗ j implies OPT i is more compatible with s ∗ i than OPT j is with s ∗ j , and that the same is true for UCB i and UCB j . We begin by introducing a notion of equivalence between the histories of i and j. Since i could observe j ’s play and vice versa, this equivalence is only defined in termsof the actions of the − ij third parties. Definition.
For Γ isomorphically factorable for i and j with ϕ ( s i ) = s j , i ’s subhistory y i,s i is third-party equivalent to j ’s subhistory y j,s j , written as y i,s i ∼ y j,s j , if theycontain the same sequence of observations about the actions of − ij .Recall that, by Notation 1, we identify each subhistory y i,s i with a sequence in × h ∈ F i [ s i ] A H and each subhistory y j,s j with a sequence in × h ∈ F j [ s j ] A h . By isomorphicfactorability, F i [ s i ] ∩ H − ij = F j [ s j ] ∩ H − ij . Third-party equivalence of y i,s i and y j,s j says i has played s i as many times as j has played s j , and that the sequence of − ij ’sactions that i encountered from experimenting with s i are the same as those that j encountered from experimenting with s j . As an example, the following histories for the critic and the diner of the restaurantgame are third-party equivalent for the strategy R . This is because the subhistories y Critic ,R and y Diner ,R contain the same sequences of the restaurant’s play (even thoughthe two agents have different observations in terms of how often the other patron goesto the restaurant). 36 Critic : period 1 2 3 4 5own strategy
R Z Z Z R others’ play (
L, Z ) ∅ ∅ ∅ ( H, Z ) y Diner : period 1 2 3 4own strategy
Z R Z R others’ play ∅ ( L, R ) ∅ ( H, Z )Table 1: The two histories y Critic (with length 5) and y Diner (with length 4) have third-party equivalent subhistories for R . The row “others’ play” show what the agent infersabout others’ play from her payoffs — recall that a customer choosing Z always getsthe same payoff and so cannot infer anything about how others play.We use third-party equivalent histories to define a comparison between two ab-stract index policies. Definition.
Suppose Γ is isomorphically factorable for i and j with ϕ ( s ∗ i ) = s ∗ j . Fortwo index policies r i and r j , we have r i is more compatible with s ∗ i than than r j iswith s ∗ j if for any histories y i , y j and strategy s i ∈ S i , s i = s ∗ i satisfying1. y i,s ∗ i ∼ y j,s ∗ j and y i,s i ∼ y j,ϕ ( s i ) s ∗ j has weakly the highest index for j , s i does not have the weakly highest index for i .This definition is a property of the index policies r i , r j , and does not make referenceto payoffs in the underlying stage game. The comparison applies to pairs of policies r i , r j such that whenever the subhistories of y i for strategies s ∗ i and s i = s ∗ i are third-party equivalent to subhistories of y j for s ∗ j and ϕ ( s i ), and s ∗ j has the highest r j -indexat history y j , then s i does not have the highest r i -index under y i . We can now state the first intermediary result we need to establish Theorem 2,which is about the relative experimentation frequencies generated by a pair of indexpolicies where the compatibility relation applies.37 roposition 6.
Suppose Γ is isomorphically factorable for i and j with ϕ ( s ∗ i ) = s ∗ j ,and that index policy r i is more compatible with s ∗ i than index policy r j is with s ∗ j .Then φ i ( s ∗ i ; r i , σ − i ) ≥ φ j ( s ∗ j ; r j , σ − j ) for any ≤ γ < and σ ∈ × k ∆( S k ) . The proof extends the coupling argument in the proof of Fudenberg and He(2018)’s Lemma 2, which only applies to the Gittins index in signaling games, and alsofills in a missing step (lemma 4) that the earlier proof implicitly assumed. Proposi-tion 6 applies to any index policies satisfying the comparative compatibility conditionstated above. The proof uses this hypothesis to deduce a general conclusion aboutthe induced responses of these agents in the learning problem, where the two agentstypically do not have third-party equivalent histories in any given period.To deal with the issue that i and j learn from endogenous data that diverge asthey undertake different experiments, we couple the learning problems of i and j using what we call response paths A ∈ ( × h ∈H A h ) ∞ . For each such path and learningrule r i for player i, imagine running the rule against the data-generating processwhere the k -th time i plays s i , i observes the action a k,h ∈ A h at the information set h ∈ F i [ s i ]. Given a learning rule r i , each A induces a deterministic infinite historyof i ’s strategies y i ( A , r i ) ∈ ( S i ) ∞ . We show that under the hypothesis that r i is morecompatible with s ∗ i than r j is with s ∗ j , the weighted lifetime frequency of s ∗ i in y i ( A , r i )is larger than that of s ∗ j in y j ( A , r j ) for every A , where play in different periods of theinfinite histories y i ( A , r i ) , y j ( A , r j ) are weighted by the probabilities of surviving intothese periods, just as in the definition of induced responses.Lemma 4 in the Appendix shows that when i and j face i.i.d. draws of opponents’play from a fixed learning environment σ, the induced responses are the same as ifthey each faced a random response path A drawn at birth according to the (infi-nite) product measure over ( × h ∈H A h ) ∞ whose marginal distribution on each copy of × h ∈H A h corresponds to σ . 38 .4.2 OPT and UCB Satisfy Comparative Compatibility The second step of our proof is carried out in Appendix 8. There, Corollaries 1 and2 show that when the assumptions of Theorem 2 hold and s ∗ i (cid:37) s ∗ j , both OPT andUCB are more compatible with s ∗ i than with s ∗ j provided the additional regularityconditions of Theorem 2 hold. This proves the theorem and provides two learningmodels that microfound PCE’s tremble restrictions. Since the compatibility relationis defined in the language of best responses against opponents’ strategy profiles in thestage game, the key step in showing that OPT and UCB satisfy the comparativecompatibility condition involves reformulating these indices as the expected utility ofusing each strategy against a certain opponent strategy profile.For the Gittins index, this profile is the “synthetic” opponent strategy profileconstructed from the best stopping rule in the auxiliary optimal-stopping problemdefining the index. This is similar to the construction of Fudenberg and He (2018),but in the more general setting of this paper the arguments become more subtle. Theinduced synthetic strategy may be correlated if the learner observes opponents’ play atmultiple information sets after playing s i , even if the learner starts with independentprior beliefs over play at these information. For example, suppose F i [ s i ] consists oftwo information sets, one for each of two players k = k , whose choose between Heads and
Tails . Agent i ’s prior belief is that each of k , k is either always playing Heads or always playing
Tails , with each of the 4 possible combinations of strategiesgiven 25% prior probability. Now consider the stopping rule where i stops if k and k play differently in the first period, but continues for 100 more periods if they playthe same action in the first period. Then the procedure defined above generates adistribution over pairs of Heads and
Tails that is mostly given by play in periods2 through 100, which is either (
Heads , Heads ) or (
Tails , Tails ), each with 50%probability. Thus the stopping rule τ creates correlation in the observed play of the Other natural index rules that we do not analyze explicitly here also serve as microfoundationsof our cross-player restrictions on trembles, provided they satisfy Proposition 6 whenever s ∗ i (cid:37) s ∗ j . endogenous correlation through the optimal stoppingrule is the reason player compatibility is defined in terms of correlated profiles.For Bayes-UCB, under the assumptions of Theorem 2, the agent may rank oppo-nents’ mixed actions on each h ∈ F i [ s i ] from least favorable to most favorable. Theanalogous opponent strategy profile is the behavior strategy where the q -th quantilemixed action is played on each h , in terms of i ’s current belief about opponents’ play.Importantly, if i and j share the same beliefs about − ij ’s play and rank − ij ’s mixedactions in the same way, then the “ q -th quantile profile” is the same for both agents. PCE makes two key contributions. First, it generates new and sensible restrictionson equilibrium play by imposing cross-player restrictions on the relative probabilitiesthat different players assign to certain strategies — namely, those strategy pairs s i , s j ranked by the compatibility relation s i (cid:37) s j . As we have shown through examples, thisdistinguishes PCE from other refinement concepts, and allows us to make comparativestatics predictions in some games where other equilibrium refinements do not.Second, PCE shows how the the device of restricted “trembles” can capture someof the implications of non-equilibrium learning. As we saw, PCE’s cross-player restric-tions arise endogenously in both the standard model of Bayesian agents maximizingtheir expected discounted lifetime utility, and the computationally tractable heuris-tics of Bayesian upper confidence bounds. We conjecture that the result that i ismore likely to experiment with s i than j with s j when s i (cid:37) s j applies in other naturalmodels of learning or dynamic adjustment, such as those considered by Francetichand Kreps (2018), and that it may be possible to provide foundations for PCE inother and perhaps larger classes of games.The strength of the PCE refinement depends on the completeness of the com-patibility order (cid:37) , since (cid:15) -PCE imposes restrictions on i and j ’s play only when the40elation s i (cid:37) s j holds. Our player compatibility definition supposes that player i thinks all mixed strategies of other players are possible, as it considers the set ofall totally mixed correlated strategies σ − i ∈ ∆ ◦ ( S − i ) . If the players have some priorknowledge about their opponents’ utility functions, player i might deduce a priori that the other players will only play strategies in some subset A − i of ∆ ◦ ( S − i ). Aswe show in Fudenberg and He (2017), in signaling games imposing this kind of priorknowledge leads to a more complete version of the compatibility order. It may simi-larly lead to a more refined version of PCE.PCE is defined for general strategic forms. We have only provided learning foun-dations for player-compatible trembles in factorable games, but we view this as animprovement over the more typical situation in which refinements have no learningfoundations at all.WeIn more general extensive-form games two complications arise. First, player i mayhave several actions that lead to the same information set of player j , which makes theoptimal learning strategy more complicated. Second, player i may get informationabout how player j plays at some information sets thanks to an experiment by someother player k , so that player i has an incentive to free ride. We plan to deal withthese complications in future work. Moreover, we conjecture that in games whereactions have a natural ordering, learning rules based on the idea that nearby strategiesinduce similar responses can provide learning foundations for refinements in whichplayers tremble more onto nearby actions, as in Simon (1987). More speculatively,the interpretation of trembles as arising from learning may provide learning-theoreticfoundations for equilibrium refinements that restrict beliefs at off-path informationsets in general extensive-form games, such as perfect Bayesian equilibrium (Fudenbergand Tirole, 1991; Watson, 2017), sequential equilibrium (Kreps and Wilson, 1982) andits extension to games with infinitely many actions (Simon and Stinchcombe, 1995;Myerson and Reny, 2018). 41 eferences Agrawal, R. (1995): “Sample mean based index policies by o (log n) regret for themulti-armed bandit problem,”
Advances in Applied Probability , 27, 1054–1078.
Battigalli, P., S. Cerreia-Vioglio, F. Maccheroni, and M. Marinacci (2016): “Analysis of information feedback and selfconfirming equilibrium,”
Journalof Mathematical Economics , 66, 40–51.
Battigalli, P., A. Francetich, G. Lanzani, and M. Marinacci (2017):“Learning and Self-confirming Long-Run Biases,”
Working Paper . Battigalli, P. and D. Guaitoli (1997): “Conjectural equilibria and rationaliz-ability in a game with incomplete information,” in
Decisions, games and markets ,Springer, 97–124.
Bolton, P. and C. Harris (1999): “Strategic experimentation,”
Econometrica ,67, 349–374.
Chen, W., Y. Wang, and Y. Yuan (2013): “Combinatorial Multi-Armed Bandit:General Framework and Applications,” in
Proceedings of the 30th InternationalConference on Machine Learning , ed. by S. Dasgupta and D. McAllester, Atlanta,Georgia, USA: PMLR, vol. 28 of
Proceedings of Machine Learning Research , 151–159.
Cho, I.-K. and D. M. Kreps (1987): “Signaling Games and Stable Equilibria,”
Quarterly Journal of Economics , 102, 179–221.
Doval, L. (2018): “Whether or not to open Pandora’s box,”
Journal of EconomicTheory , 175, 127–158.
Esponda, I. and D. Pouzo (2016): “Berk-Nash Equilibrium: A Framework forModeling Agents With Misspecified Models,”
Econometrica , 84, 1093–1130.42 rancetich, A. and D. M. Kreps (2018): “Choosing a Good Toolkit: Bayes-ruleBased Heuristics,”
Working Paper . Frick, M. and Y. Ishii (2015): “Innovation adoption by forward-looking sociallearners,”
Working Paper . Fryer, R. and P. Harms (2017): “Two-armed restless bandits with imperfect infor-mation: Stochastic control and indexability,”
Mathematics of Operations Research ,43, 399–427.
Fudenberg, D. and K. He (2017): “Learning and Equilibrium Refinements inSignalling Games,”
Working Paper .——— (2018): “Learning and Type Compatibility in Signaling Games,”
Economet-rica , 86, 1215–1255.
Fudenberg, D. and Y. Kamada (2015): “Rationalizable partition-confirmed equi-librium,”
Theoretical Economics , 10, 775–806.——— (2018): “Rationalizable partition-confirmed equilibrium with heterogeneousbeliefs,”
Games and Economic Behavior , 109, 364–381.
Fudenberg, D. and D. M. Kreps (1993): “Learning Mixed Equilibria,”
Gamesand Economic Behavior , 5, 320–367.——— (1994): “Learning in Extensive-Form Games, II: Experimentation and NashEquilibrium,”
Working Paper .——— (1995): “Learning in Extensive-Form Games I. Self-Confirming Equilibria,”
Games and Economic Behavior , 8, 20–55.
Fudenberg, D. and D. K. Levine (1993): “Steady State Learning and NashEquilibrium,”
Econometrica , 61, 547–573.43—— (2006): “Superstition and Rational Learning,”
American Economic Review ,96, 630–651.
Fudenberg, D. and J. Tirole (1991): “Perfect Bayesian equilibrium and sequen-tial equilibrium,”
Journal of Economic Theory , 53, 236–260.
Gai, Y., B. Krishnamachari, and R. Jain (2012): “Combinatorial NetworkOptimization With Unknown Variables: Multi-Armed Bandits With Linear Re-wards and Individual Observations,”
IEEE/ACM Transactions on Networking , 20,1466–1478.
Halac, M., N. Kartik, and Q. Liu (2016): “Optimal contracts for experimenta-tion,”
Review of Economic Studies , 83, 1040–1091.
Heidhues, P., S. Rady, and P. Strack (2015): “Strategic experimentation withprivate payoffs,”
Journal of Economic Theory , 159, 531–551.
Hörner, J. and A. Skrzypacz (2016): “Learning, experimentation and infor-mation design,” in
Advances in Economics and Econometrics: Eleventh WorldCongress , ed. by B. Honore, A. Pakes, M. Piazzesi, and L. Samuelson, CambridgeUniversity Press, chap. 2, 63–97.
Jackson, M. O. and A. Wolinsky (1996): “A strategic model of social andeconomic networks,”
Journal of Economic Theory , 71, 44–74.
Katehakis, M. N. and H. Robbins (1995): “Sequential choice from several pop-ulations,”
Proceedings of the National Academy of Sciences of the United States ofAmerica , 92, 8584.
Kaufmann, E., O. Cappé, and A. Garivier (2012): “On Bayesian upper con-fidence bounds for bandit problems,” in
Artificial Intelligence and Statistics , 592–600. 44 eller, G., S. Rady, and M. Cripps (2005): “Strategic experimentation withexponential bandits,”
Econometrica , 73, 39–68.
Klein, N. and S. Rady (2011): “Negatively correlated bandits,”
Review of Eco-nomic Studies , 78, 693–732.
Kohlberg, E. and J.-F. Mertens (1986): “On the Strategic Stability of Equilib-ria,”
Econometrica , 54, 1003–1037.
Kreps, D. M. and R. Wilson (1982): “Sequential equilibria,”
Econometrica , 863–894.
Kveton, B., Z. Wen, A. Ashkan, and C. Szepesvari (2015): “Tight Re-gret Bounds for Stochastic Combinatorial Semi-Bandits,” in
Proceedings of theEighteenth International Conference on Artificial Intelligence and Statistics , ed.by G. Lebanon and S. V. N. Vishwanathan, San Diego, California, USA: PMLR,vol. 38 of
Proceedings of Machine Learning Research , 535–543.
Lehrer, E. (2012): “Partially specified probabilities: decisions and games,”
Ameri-can Economic Journal: Microeconomics , 4, 70–100.
Milgrom, P. and J. Mollner (2017): “Extended Proper Equilibrium,”
WorkingPaper . Monderer, D. and L. S. Shapley (1996): “Potential games,”
Games and Eco-nomic Behavior , 14, 124–143.
Myerson, R. B. (1978): “Refinements of the Nash equilibrium concept,”
Interna-tional Journal of Game Theory , 7, 73–80.
Myerson, R. B. and P. J. Reny (2018): “Perfect Conditional ε -Equilibria ofMulti-Stage Games with Infinite Sets of Signals and Actions,” Working Paper .45 ubinstein, A. and A. Wolinsky (1994): “Rationalizable conjectural equilib-rium: between Nash and rationalizability,”
Games and Economic Behavior , 6,299–311.
Selten, R. (1975): “Reexamination of the perfectness concept for equilibrium pointsin extensive games,”
International Journal of Game Theory , 4, 25–55.
Simon, L. K. (1987): “Local perfection,”
Journal of Economic Theory , 43, 134–156.
Simon, L. K. and M. B. Stinchcombe (1995): “Equilibrium refinement for infinitenormal-form games,”
Econometrica , 1421–1443.
Strulovici, B. (2010): “Learning while voting: Determinants of collective experi-mentation,”
Econometrica , 78, 933–971.
Van Damme, E. (1987):
Stability and Perfection of Nash Equilibria , Springer-Verlag.
Watson, J. (2017): “A General, Practicable Definition of Perfect Bayesian Equilib-rium,”
Working Paper . 46 ppendix
We first state an auxiliary lemma.
Lemma 3. If σ ◦ is an (cid:15) -PCE and s ∗ i (cid:37) s ∗ j , then σ ◦ i ( s ∗ i ) ≥ min σ ◦ j ( s ∗ j ) , − X s i = s ∗ i (cid:15) ( s i | i ) . Proof.
Suppose (cid:15) is player-compatible and let (cid:15) -equilibrium σ ◦ be given. For s ∗ i (cid:37) s ∗ j ,suppose σ ◦ j ( s ∗ j ) = (cid:15) ( s ∗ j | j ). Then σ ◦ i ( s ∗ i ) ≥ (cid:15) ( s ∗ i | i ) ≥ (cid:15) ( s ∗ j | j ) = σ ◦ j ( s ∗ j ), where thesecond inequality comes from (cid:15) being player compatible. On the other hand, suppose σ ◦ j ( s ∗ j ) > (cid:15) ( s ∗ j | j ). Since σ ◦ is an (cid:15) -equilibrium, the fact that j puts more than theminimum required weight on s ∗ j implies s ∗ j is at least a weak best response for j against σ ◦ , with σ ◦ totally mixed due to the trembles.The definition of s ∗ i (cid:37) s ∗ j then impliesthat s ∗ i must be a strict best response for i against σ ◦ as well. In the (cid:15) -equilibrium, i must assign as much weight to s ∗ i as possible, so that σ ◦ i ( s ∗ i ) = 1 − P s i = s ∗ i (cid:15) ( s i | i ).Combining these two cases establishes the desired result. Proposition 3 : For any PCE σ ∗ , player k , and strategy ¯ s k such that σ ∗ k (¯ s k ) > , there exists a sequence of totally mixed strategy profiles σ ( t ) − k → σ ∗− k such that(i) for every pair i, j = k with s ∗ i (cid:37) s ∗ j , lim inf t →∞ σ ( t ) i ( s ∗ i ) σ ( t ) j ( s ∗ j ) ≥ and (ii) ¯ s k is a best response for k against every σ ( t ) − k . roof. By Lemma 3, for every (cid:15) ( t ) -PCE we get σ ( t ) i ( s ∗ i ) σ ( t ) j ( s ∗ j ) ≥ min σ ( t ) j ( s ∗ j ) σ ( t ) j ( s ∗ j ) , − P s i = s ∗ i (cid:15) ( t ) ( s i | i ) σ ( t ) j ( s ∗ j ) = min , − P s i = s ∗ i (cid:15) ( t ) ( s i | i ) σ ( t ) j ( s ∗ j ) ≥ − X s i = s ∗ i (cid:15) ( t ) ( s i | i ) . This says inf t ≥ T σ ( t ) i ( s ∗ i ) σ ( t ) j ( s ∗ j ) ≥ − sup t ≥ T X s i = s ∗ i (cid:15) ( t ) ( s i | i ) . For any sequence of trembles such that (cid:15) ( t ) → , lim T →∞ sup t ≥ T X s i = s ∗ i (cid:15) ( t ) ( s i | i ) = 0 , so lim inf t →∞ σ ( t ) i ( s ∗ i ) σ ( t ) j ( s ∗ j ) = lim T →∞ inf t ≥ T σ ( t ) i ( s ∗ i ) σ ( t ) j ( s ∗ j ) ≥ . This shows that if we fix a PCE σ ∗ and consider a sequence of player-compatibletrembles (cid:15) ( t ) and (cid:15) ( t ) -PCE σ ( t ) → σ ∗ , then each σ ( t ) − k satisfies lim inf t →∞ σ ( t ) i ( s ∗ i ) /σ ( t ) j ( s ∗ j ) ≥ i, j = k and s ∗ i (cid:37) s ∗ j . Furthermore, from σ ∗ k (¯ s k ) > σ ( t ) k → σ ∗ k , weknow there is some T ∈ N so that σ ( t ) k (¯ s k ) > σ ∗ k (¯ s k ) / t ≥ T . We may alsofind T ∈ N so that (cid:15) ( t ) (¯ s k | k ) < σ ∗ k (¯ s k ) / t ≥ T , since (cid:15) ( t ) → . So when t ≥ max( T , T ), σ ( t ) k places strictly more than the required weight on ¯ s k , so ¯ s k is atleast a weak best response for k against σ ( t ) − k . Now the subsequence of opponent play( σ ( t ) − k ) t ≥ max( T ,T ) satisfies the requirement of this proposition. Theorem 1 : PCE exists in every finite strategic-form game.Proof.
Consider a sequence of tremble profiles with the same lower bound on the48robability of each strategy, that is (cid:15) ( t ) ( s i | i ) = (cid:15) ( t ) for all i and s i , and with (cid:15) ( t ) decreasing monotonically to 0 in t . Each of these tremble profiles is player-compatible(regardless of the compatibility structure (cid:37) ) and there is some finite T large enoughthat t ≥ T implies an (cid:15) ( t ) -equilibrium exists, and some subsequence of these (cid:15) ( t ) -equilibria converges since the space of strategy profiles is compact. By definitionthese (cid:15) ( t ) -equilibria are also (cid:15) ( t ) -PCE, which establishes existence of PCE. Proposition 4 : In a signaling game, every PCE σ ∗ is a Nash equilibrium satisfyingthe compatibility criterion, as defined in Fudenberg and He (2018).Proof. Since every PCE is a trembling-hand perfect equilibrium and since this lattersolution concept refines Nash, σ ∗ is a Nash equilibrium.To show that it satisfies the compatibility criterion, we need to show that σ ∗ as-signs probability 0 to plans in A S that do not best respond to beliefs in the set P ( s, σ ∗ )as defined in Fudenberg and He (2018). For any plan assigned positive probabilityunder σ ∗ , by Proposition 3 we may find a sequence of totally mixed signal profiles σ ( t )1 of the sender, so that whenever s θ (cid:37) s θ we have lim inf t →∞ σ ( t )1 ( s | θ ) /σ ( t )1 ( s | θ ) ≥ . Write q ( t ) ( ·| s ) as the Bayesian posterior belief about sender’s type after signal s un-der σ ( t )1 , which is well defined because each σ ( t )1 is totally mixed. Whenever s θ (cid:37) s θ ,this sequence of posterior beliefs satisfies lim inf t →∞ q ( t ) ( θ | s ) /q ( t ) ( θ | s ) ≥ λ ( θ ) /λ ( θ ),so if the receiver’s plan best responds to every element in the sequence, it also bestresponds to an accumulation point ( q ∞ ( ·| s )) s ∈ S with q ∞ ( θ | s ) /q ∞ ( θ | s ) ≥ λ ( θ ) /λ ( θ )whenever s θ (cid:37) s θ . Since the player compatibility definition used in this paper isslightly easier to satisfy than the type compatibility definition that the set P ( s , σ ∗ )is based on, the plan best responds to P ( s , σ ∗ ) after every signal s .49 .4 Proof of Lemma 1 Proof.
By way of contradiction, suppose there is some profile of moves by − i , ( a h ) h ∈H − i ,so that h ∗ is off the path of play in ( s i , ( a h ) h ∈H − i ) = ( s i , a h ∗ , ( a h ) h ∈H − i \ h ∗ ). Finda different action of j on h ∗ , a h ∗ = a h ∗ . Since h ∗ is off the path of play, both( s i , a h ∗ , ( a h ) h ∈H − i \ h ∗ ) and ( s i , a h ∗ , ( a h ) h ∈H − i \ h ∗ ) lead to the same payoff for i . Butby Condition (1) in the definition of factorability and the fact that h ∗ ∈ F i [ s i ], wewill have found two − i action profiles s − i , s − i in two different blocks of Π i [ s i ] with U i ( s i , s − i ) = U i ( s i , s − i ). This contradicts Π i [ s i ] being the coarsest partition of S − i that makes U i ( s i , · ) measurable. Proof.
First, there must be at least two different actions for j on h ∗ , else i ’s payoffwould be trivially independent of h ∗ .So, there exist actions a h ∗ = a h ∗ on h ∗ and a profile a − h ∗ of actions elsewhere inthe game tree, so that U i ( a h ∗ , a − h ∗ ) = U i ( a h ∗ , a − h ∗ ). Consider the strategy s i for i that matches a − h ∗ in terms of play on i ’s information sets, so we may equivalentlywrite U i ( s i , a h ∗ , ( a h ) h ∈H − i \ h ∗ ) = U i ( s i , a h ∗ , ( a h ) h ∈H − i \ h ∗ ) , where ( a h ) h ∈H − i \ h ∗ are the components of a − h ∗ corresponding to information sets of − i . If h ∗ / ∈ F i [ s i ] , then by Condition (1) of factorability, ( a h ∗ , ( a h ) h ∈H − i \ h ∗ ) and( a h ∗ , ( a h ) h ∈H − i \ h ∗ ) belong to the same block in Π i [ s i ] . Yet, they give different payoffsto i , which contradicts that i ’s payoff after s i must be measurable with respect toΠ i [ s i ] . We first show that i s induced response against i.i.d. play drawn from σ − i is the sameas playing against a response path drawn from η at the start of i ’s life. This η is the50ame for all agents and does not depend on their (possibly stochastic) learning rules. Lemma 4.
In a factorable game, for each σ ∈ × k ∆( S k ) , there is a distribution η overresponse paths, so that for any player i , any possibly random rule r i : Y i → ∆( S i ) ,and any strategy s i ∈ S i , we have φ i ( s i ; r i , σ − i ) = (1 − γ ) E A ∼ η " ∞ X t =1 γ t − · ( y ti ( A , r i ) = s i ) , where y ti ( A , r i ) refers to the t -th period history in y i ( A , r i ) .Proof. In fact, we will prove a stronger statement: we will show there is such adistribution that induces the same distribution over period- t histories for every i, every learning rule r i , and every t. Think of each response path A as a two-dimensional array, A = ( a t,h ) t ∈ N ,h ∈H . Fornon-negative integers ( N h ) h ∈H , each profile of sequences of actions (( a n h ,h ) N h n h =1 ) h ∈H where a n h ,h ∈ A h defines a “cylinder set” of response paths with the form: { A : a t,h = a n h ,h for each h ∈ H , ≤ n h ≤ N h } . That is, the cylinder set consists of those response paths whose first N h elementsfor information set h match a given sequence, ( a n h ,h ) N h n h =1 . (If N h = 0, then thereis no restriction on a t,h for any t. ) We specify the distribution η by specifying theprobability it assigns to these cylinder sets: η n (( a n h ,h ) N h n h =1 ) h ∈H o = Y h ∈H N h Y n h =1 σ ( s : s ( h ) = a n h ,h ) , where we have abused notation to write (( a n h ,h ) N h n h =1 ) h ∈H for the cylinder set satisfyingthis profile of sequences, and we have used the convention that the empty product isdefined to be 1. Recall that a strategy profile s in the extensive-form game specifiesan action s ( h ) ∈ A h for every information set h in the game tree. The probabilitythat η assigns to the cylinder set involves multiplying the probabilities that the given51ixed strategy σ leads to such a pure-strategy profile s so that a n h ,h is to be playedat information set h , across all such a n h ,h restrictions defining the cylinder set.We establish the claim by induction on t for period- t history. For t ≥ , let Y i [ t ] ⊆ Y i be the set of possible period- t histories of i, that is Y i [ t ] := ( S i × R ) t . Inthe base case of t = 1 , we show playing against a response path drawn according to η and playing against a pure strategy drawn from σ − i ∈ × k = i ∆( S k ) generate thesame period-1 history. Fixing a learning rule r i : Y i → ∆( S i ) of i, the probabilityof i having the period-1 history ( s (1) i , ( a (1) h ) h ∈ F i [ s (1) i ] ) ∈ Y i [1] in the random-matchingmodel is r i ( ∅ )( s (1) i ) · σ ( s : s ( I ) = a (1) h for all h ∈ F i [ s (1) i ]). That is, i ’s rule must play s (1) i in the first period of i ’s life, which happens with probability r i ( ∅ )( s (1) i ). Then, i must encounter such a pure strategy that generates the required profile of moves( a (1) h ) h ∈ F i [ s (1) i ] on the s (1) i -relevant information sets, which has probability σ ( s : s ( h ) = a (1) h for all h ∈ F i [ s (1) i ]). The probability of this happening against a response pathdrawn from η is r i ( ∅ )( s (1) i ) · η ( A : a ,h = a (1) h for all h ∈ F i [ s (1) i ])= r i ( ∅ )( s (1) i ) · Y h ∈ F i [ s (1) i ] σ ( s : s ( h ) = a (1) h )= r i ( ∅ )( s (1) i ) · σ ( s : s ( h ) = a (1) h for all h ∈ F i [ s (1) i ]) , where the second line comes from the probability η assigns to cylinder sets, and thethird line comes from the fact that σ ∈ × k ∆( S k ) involves independent mixing of purestrategies across different players.We now proceed with the inductive step. By induction, suppose random matchingand the η -distributed response path induce the same distribution over the set ofperiod- T histories, Y i [ T ], where T ≥ . Write this common distribution as φ RMi,T = In the random matching model agents are facing a randomly drawn pure strategy profile eachperiod (and not a fixed behavior strategy): they are matched with random opponents, who eachplay a pure strategy in the game as a function of their personal history. From Kuhn’s theorem, thisis equivalent to facing a fixed profile of behavior strategies. ηi,T = φ i,T ∈ ∆( Y i [ T ]) . We prove that they also generate the same distribution overlength T + 1 histories.Suppose random matching generates distribution φ RMi,T +1 ∈ ∆( Y i [ T + 1]) and the η -distributed response path generates distribution φ ηi,T +1 ∈ ∆( Y i [ T + 1]) . Each length T +1 history y i [ T +1] ∈ Y i [ T +1] may be written as ( y i [ T ] , ( s ( T +1) i , ( a ( T +1) h ) h ∈ F i [ s ( T +1) i ] )) , where y i [ T ] is a length- T history and ( s ( T +1) i , ( a ( T +1) h ) h ∈ F i [ s ( T +1) i ] ) is a one-period historycorresponding to what happens in period T + 1. Therefore, we may write for each y i [ T + 1] ,φ RMi,T +1 ( y i [ T + 1]) = φ RMi,T ( y i [ T ]) · φ RMi,T +1 | T (( s ( T +1) i , ( a ( T +1) h ) h ∈ F i [ s ( T +1) i ] ) | y i [ T ]) , and φ ηi,T +1 ( y i [ T + 1]) = φ ηi,T ( y i [ T ]) · φ ηi,T +1 | T (( s ( T +1) i , ( a ( T +1) h ) h ∈ F i [ s ( T +1) i ] ) | y i [ T ]) , where φ RMi,T +1 | T and φ ηi,T +1 | T are the conditional probabilities of the form “having history( s ( T +1) i , ( a ( T +1) h ) h ∈ F i [ s ( T +1) i ] ) in period T + 1 , conditional on having history y i [ T ] ∈ Y i [ T ]in the first T periods.” If such conditional probabilities are always the same for therandom-matching model and the η -distributed response path model, then from thehypothesis φ RMi,T = φ ηi,T , we can conclude φ RMi,T +1 = φ ηi,T +1 . By argument exactly analogous to the base case, we have for the random-matchingmodel φ RMi,T +1 | T (( s ( T +1) i , ( a ( T +1) h )) | y i [ T ]) = r i ( y i ( T ))( s ( T +1) i ) · σ ( s : s ( h ) = a ( T +1) h for all h ∈ F i [ s ( T +1) i ]) , since the matching is independent across periods.But in the η -distributed response path model, since a single response path isdrawn once and fixed, one must compute the conditional probability that the drawn A is such that the response ( a ( T +1) h ) h ∈ F i [ s ( T +1) i ] will be seen in period T + 1, given the53istory y i [ T ] (which is informative about which response path i is facing).For each h ∈ H − i , let the non-negative integer N h represent the number of times i has observed play on the information set h in the history y i [ T ] . For each h, let( a n h ,h ) N h n h =1 represent the sequence of opponent actions observed on h in chronologicalorder. The history y i [ T ] so far shows i is facing a response sequence in the cylinderset consistent with (( a n h ,h ) N h n h =1 ) h ∈H . If A is to respond to i ’s next play of s ( T +1) i with a ( T +1) h on the s ( T +1) i -relevant information sets, then A must belong to a morerestrictive cylinder set, satisfying the restrictions:(( a n h ,h ) N h n h =1 ) h ∈H\ F i [ s ( T +1) i ] , (( a n h ,h ) N h +1 n h =1 ) h ∈ F i [ s ( T +1) i ] , where for each h ∈ F i [ s ( T +1) i ] , a N h +1 ,h = a ( T +1) h . The conditional probability is thengiven by the ratio of η -probabilities of these two cylinder sets, which from the defini-tion of η must be Q h ∈ F i [ s ( T +1) i ] σ ( s : s ( h ) = a ( T +1) h ). As before, the independence of σ across players means this is equal to σ ( s : s ( h ) = a ( T +1) h for all h ∈ F i [ s ( T +1) i ]).Given this result, to prove that φ i ( s ∗ i ; r i , σ − i ) ≥ φ j ( s ∗ j ; r j , σ − j ) , it suffices to showthat for every A , the period where s ∗ i is played for the k -th time in induced history y i ( A , r i ) happens earlier than the period where s ∗ j is played for the k -th time in history y j ( A , r j ) . Now we turn to the proof of Proposition 6.
Proof.
Let 0 ≤ γ < σ be fixed. Consider the productdistribution η on the space of response paths, ( × h ∈H A h ) ∞ , whose marginal on eachcopy of × h ∈H A h is the action distribution of σ .By Lemma 4, denote the period where s ∗ i appears in y i ( A , r i ) for the k -th time as T ( k ) i , the period where s ∗ j appears in y j ( A , r j ) for the k -th time as T ( k ) j . The quantities T ( k ) i , T ( k ) j are defined to be ∞ if the corresponding strategies do not appear at least k times in the infinite histories. Write s i ; k ) ∈ N ∪ {∞} be the number of times s i ∈ S i is played in the history y i ( A , r i ) before T ( k ) i . Similarly, s j ; k ) ∈ N ∪ {∞} s j ∈ S j is played in the history y j ( A , r j ) before T ( k ) j .Since ϕ establishes a bijection between S i and S j , it suffices to show that for every k = 1 , , , ... either T ( k ) j = ∞ or for all s i = s ∗ i , s i ; k ) ≤ s j ; k ) where s j = ϕ ( s i ) . We show this by induction on k . First we establish the base case of k = 1.Suppose T (1) j = ∞ , and, by way of contradiction, suppose there is some s i = s ∗ i such that s i , > ϕ ( s i ) , y i of y i ( A , r i ) . that leads to s i being played for the ( ϕ ( s i ) ,
1) + 1)-th time, and find the subhistory y j of y j ( A , r j )that leads to j playing s ∗ j for the first time ( y j is well-defined because T (1) j = ∞ ).Note that y i,s ∗ i ∼ y j,s ∗ j vacuously, since i has never played s ∗ i in y i and j has neverplayed s ∗ j in y j . Also, y i,s i ∼ y j,s j since i has played s i for ϕ ( s i ) ,
1) times and j has played s j for the same number of times, while the definition of response sequenceimplies they would have seen the same history of play on the common informationsets of − ij , F i [ s i ] ∩ F j [ s j ]. This satisfies the definition of third-party equivalence ofhistories.Since r j ( y j ) = s ∗ j and r j is an index rule, s ∗ j must have weakly the highest indexat y j . Since r i is more compatible with s ∗ i than r j is with s ∗ j , s i must not have theweakly highest index at y i . And yet r i ( y i ) = s i , contradiction.Now suppose this statement holds for all k ≤ K for some K ≥ . We show it alsoholds for k = K + 1 . If T ( K +1) j = ∞ or T ( K ) j = ∞ , we are done. Otherwise, by way ofcontradiction, suppose there is some s i = s ∗ i so that s i , K + 1) > ϕ ( s i ) , K + 1).Find the subhistory y i of y i ( A , r i ) . that leads to s i being played for the ( ϕ ( s i ) , K +1) + 1)-th time. Since T ( K ) j = ∞ , from the inductive hypothesis T ( K ) i = ∞ and s i , K ) ≤ ϕ ( s i ) , K ). That is, i must have played s i no more than ϕ ( s i ) , K )times before playing s ∗ i for the K -th time. Since ϕ ( s i ) , K + 1) + 1 > ϕ ( s i ) , K ) , the subhistory y i must extend beyond period T ( K ) i , so it contains K instances of i playing s ∗ i .Next, find the subhistory y j of y j ( A , r j ) that leads to j playing s ∗ j for the ( K + 1)-th time. (This is well-defined because T ( K +1) j = ∞ .) Note that y i,s ∗ i ∼ y j,s ∗ j , since i j have played s ∗ i , s ∗ j for K times each, and they were facing the same responsepaths. Also, y i,s i ∼ y j,s j since i has played s i for ϕ ( s i ) , K + 1) times and j hasplayed s j for the same number of times. Since r j ( y j ) = s ∗ j and r j is an index rule, s ∗ j must have weakly the highest index at y j . Since r i is more compatible with s ∗ i than r j is with s ∗ j , s i must not have the weakly highest index at y i . And yet r i ( y i ) = s i , contradiction. In this section, we show that under the conditions of Theorem 2, the Gittins indexand the UCB index satisfy the comparative compatibility condition for index rules.Omitted proofs from this section can be found in the Online Appendix.
Let survival chance γ ∈ [0 ,
1) and patience δ ∈ [0 ,
1) be fixed. Let ν s i ∈ × h ∈ F i [ s i ] ∆(∆( A h ))be a belief over opponents’ mixed actions at the s i -relevant information sets. TheGittins index of s i under belief ν s i is given by the maximum value of the followingauxiliary optimization problem:sup τ ≥ E ν si nP τt =1 ( δγ ) t − · u i ( s i , ( a h ( t )) h ∈ F i [ s i ] ) o E ν si { P τt =1 ( δγ ) t − } , where the supremum is taken over all positive-valued stopping times τ ≥
1. Here( a h ( t )) h ∈ F i [ s i ] means the profile of actions that − i play on the s i -relevant informationsets the t -th time that i uses s i — by factorability, only these actions and not actionselsewhere in the game tree determine i ’s payoff from playing s i . The distribution overthe infinite sequence of profiles ( a h ( t )) ∞ t =1 is given by i ’s belief ν s i , that is, there issome fixed mixed action in × h ∈ F i [ s i ] ∆( A h ) that generates profiles ( a h ( t )) i.i.d. acrossperiods t. The event { τ = T } for T ≥ s i for T times, observing56he first T elements ( a h ( t )) Tt =1 , then stopping.Write V ( τ ; s i , ν s i ) for the value of the above auxiliary problem under the (notnecessarily optimal) stopping time τ . The Gittins index of s i is sup τ> V ( τ ; s i , ν s i ).We begin by linking V ( τ ; s i , τ s i ) to i ’s stage-game payoff from playing s i . Frombelief ν s i and stopping time τ , we will construct the correlated profile α ( ν s i , τ ) ∈ ∆ ◦ ( × h ∈H [ s i ] A h ), so that V ( τ ; s i , ν s i ) is equal to i ’s expected payoff when playing s i while opponents play according to this correlated profile on the s i -relevant informationsets. Definition.
A full-support belief ν s i ∈ × h ∈ F i [ s i ] ∆(∆( A h )) for player i together with a(possibly random) stopping rule τ > a ( − i ) ,t ) t ≥ over the space × h ∈ F i [ s i ] A h ∪ { ∅ } , where ˜ a ( − i ) ,t ∈ × h ∈ F i [ s i ] A h represents the opponents’actions observed in period t if τ ≥ t , and ˜ a ( − i ) ,t = ∅ if τ < t . We call ˜ a ( − i ) ,t player i ’s internal history at period t and write P ( − i ) for the distribution over internal historiesthat the stochastic process induces.Internal histories live in the same space as player i ’s actual experience in thelearning problem, represented as a history in Y i . The process over internal historiesis i ’s prediction about what would happen in the auxiliary problem (which is anartificial device for computing the Gittins index) if he were to use τ. Enumerate all possible profiles of moves at information sets F i [ s i ] as × h ∈ F i [ s i ] A h = { a (1)( − i ) , ..., a ( K )( − i ) } , let p t,k := P ( − i ) [˜ a ( − i ) ,t = a ( k )( − i ) ] for 1 ≤ k ≤ K be the probabilityunder ν s i of seeing the profile of actions a ( k )( − i ) in period t of the stochastic process overinternal histories, (˜ a ( − i ) ,t ) t ≥ , and let p t, := P ( − i ) [˜ a ( − i ) ,t = ∅ ] be the probability ofhaving stopped before period t. Definition.
The synthetic correlated profile at information sets in F i [ s i ] is the el-ement of ∆ ◦ ( × h ∈ F i [ s i ] A h ) (i.e. a correlated random action) that assigns probability P ∞ t =1 β t − p t,k P ∞ t =1 β t − (1 − p t, ) to the profile of actions a ( k )( − i ) . Denote this profile by α ( ν s i , τ ).Note that the synthetic correlated profile depends on the belief ν s i stopping rule τ, β . Since the belief ν s i has full support, there is always apositive probability assigned to observing every possible profile of actions on F i [ s i ] inthe first period, so the synthetic correlated profile is totally mixed. The significanceof the synthetic correlated profile is that it gives an alternative expression for thevalue of the auxiliary problem under stopping rule τ . Lemma 5. V ( τ ; s i , ν s i ) = U i ( s i , α ( ν s i , τ ))The proof is the same as in Fudenberg and He (2018) and is omitted. Consider now the situation where i and j share the same beliefs about play of − ij on the common information sets F i [ s i ] ∩ F j [ s j ] ⊆ H − ij . For any pure-strategy stoppingtime τ j of j , we define a random stopping rule of i , the mimicking stopping time for τ j . Lemma 6 will establish that the mimicking stopping time generates a syntheticcorrelated profile that matches the corresponding profile of τ j on F i [ s i ] ∩ F j [ s j ].The key issue in this construction is that τ j maps j ’s internal histories to stoppingdecisions, which does not live in the same space as i ’s internal histories. In particular, τ j makes use of i ’s play to decide whether to stop. To mimic such a rule, i makesuse of external histories, which include both the common component of i ’s internalhistory on F i [ s i ] ∩ F j [ s j ] , as well as simulated histories on F j [ s j ] \ ( F i [ s i ] ∩ F j [ s j ]) . For Γ isomorphically factorable for i and j with ϕ ( s i ) = s j , we may write F i [ s i ] = F C ∪ F j with F C ⊆ H − ij and F j ⊆ H j . Similarly, we may write F j [ s j ] = F C ∪ F i with F i ⊆ H i . (So, F C is the c ommon information sets that are payoff-relevant for both s i and s j .) Whenever j plays s j , he observes some ( a ( C ) , a ( i ) ) ∈ ( × h ∈ F C A h ) × ( × h ∈ F i A h ),where a ( C ) is a profile of actions at information sets in F C and a ( i ) is a profile of actionsat information sets in F i . So, a pure-strategy stopping rule in the auxiliary problemdefining j ’s Gittins index for s j is a function τ j : ∪ t ≥ [( × h ∈ F C A h ) × ( × h ∈ F i A h )] t → Notice that even though i starts with the belief that opponents randomize independently atdifferent information sets, and also holds an independent prior belief, V ( τ ; s i , ν s i ) may not be thethe payoff of playing s i against a independent randomizations by the opponent because of theendogenous correlation that we discussed in the text. , } that maps finite histories of observations to stopping decisions, where “0” meanscontinue and “1” means stop. Definition.
Player i ’s mimicking stopping rule for τ j draws α i ∈ × h ∈ F i ∆( A h ) from j ’s belief ν s j on F i , and then draws ( a ( i ) ,‘ ) ‘ ≥ by independently generating a ( i ) ,‘ from α i each period. Conditional on ( a ( i ) ,‘ ) , i stops according to the rule ( τ i | ( a ( i ) ,‘ ))(( a ( C ) ,‘ , a ( j ) ,‘ ) t‘ =1 ) := τ j (( a ( C ) ,‘ , a ( i ) ,‘ ) t‘ =1 ) . That is, the mimicking stopping rule involves ex-ante randomization across pure-strategy stopping rules τ i | ( a ( i ) ,‘ ) ∞ ‘ =1 . First, i draws a behavior strategy on the infor-mation set F i according to j ’s belief about i ’s play. Then, i simulates an infinitesequence ( a ( i ) ,‘ ) ∞ ‘ =1 of i ’s play using this drawn behavior strategy and follows thepure-strategy stopping rule τ i | ( a ( i ) ,‘ ) ∞ ‘ =1 . As in the definition of internal histories, the mimicking strategy and i ’s belief ν s i generates a stochastic process (˜ a ( j ) ,t , ˜ a ( C ) ,t ) t ≥ of internal histories for i (representingactions on F i [ s i ] that i anticipates seeing when he plays s i ). It also induces a stochasticprocess (˜ e ( i ) ,t , ˜ e ( C ) ,t ) t ≥ of “external histories” defined in the following way: Definition.
The stochastic process of external historie s (˜ e ( i ) ,t , ˜ e ( C ) ,t ) t ≥ is definedfrom the process of internal histories (˜ a ( j ) ,t , ˜ a ( C ) ,t ) t ≥ that τ i generates and given by:(i) if τ i < t , then (˜ e ( i ) ,t , ˜ e ( C ) ,t ) = ∅ ; (ii) otherwise, ˜ e ( C ) ,t = ˜ a ( C ) ,t , and ˜ e ( i ) ,t is the t -thelement of the infinite sequence ( a ( i ) ,‘ ) ∞ ‘ =1 that i simulated before the first period ofthe auxiliary problem.Write P e for the distribution over the sequence of of external histories generatedby i ’s mimicking stopping time for τ j , which is a function of τ j , ν s j , and ν s i .To understand the distinction between internal and external histories, note thatthe probability of i ’s first-period internal history satisfying (˜ a ( j ) , , ˜ a ( C ) , ) = (¯ a ( j ) , ¯ a ( C ) )for some fixed values (¯ a ( j ) , ¯ a ( C ) ) ∈ × h ∈ F i [ s i ] A h is given by the probability that a mixed Here ( a ( − j ) ,‘ ) t‘ =1 = ( a ( C ) ,‘ , a ( i ) ,‘ )) t‘ =1 . Note this is a valid (stochastic) stopping time, as theevent { τ i ≤ T } is independent of any a I ( t ) for t > T. α − i on F i [ s i ] , drawn according to i ’s belief ν s i , would generate the profile of ac-tions (¯ a ( j ) , ¯ a ( C ) ) . On the other hand, the probability of i ’s first-period external historysatisfying (˜ e ( i ) , , ˜ e ( C ) , ) = (¯ a ( i ) , ¯ a ( C ) ) for some fixed values (¯ a ( i ) , ¯ a ( C ) ) ∈ × h ∈ F j [ s j ] A h also depends on j ’s belief ν s j , for this belief determines the distribution over ( a ( i ) ,‘ ) ∞ ‘ =1 drawn before the start of the auxiliary problem.When using the mimicking stopping time for τ j in the auxiliary problem, i expectsto see the same distribution of − ij ’s play before stopping as j does when using τ j ,on the information sets that are both s i -relevant and s j -relevant. This is formalizedin the next lemma. Lemma 6.
For Γ isomorphically factorable for i and j with ϕ ( s i ) = s j , suppose i holds belief ν s i over play in F i [ s i ] and j holds belief ν s j over play in F j [ s j ] , such that ν s i | F i [ s i ] ∩ F j [ s j ] = ν s j | F i [ s i ] ∩ F j [ s j ] , that is the two sets of beliefs match when marginalizedto the common information sets in H − ij . Let τ i be i ’s mimicking stopping time for τ j .Then, the synthetic correlated profile α ( ν s j , τ j ) marginalized to the information setsof − ij is the same as α ( ν s i , τ i ) marginalized to the same information sets. Proposition 7.
Suppose Γ isomorphically factorable for i and j with ϕ ( s ∗ i ) = s ∗ j , ϕ ( s i ) = s j , where s ∗ i = s i and s ∗ i (cid:37) s ∗ j . Suppose i holds belief ν s i ∈ × h ∈ F i [ s i ] ∆(∆( A h )) about opponents’ play after each s i and j holds belief ν s j ∈ × h ∈ F j [ s j ] ∆(∆( A h )) aboutopponents’ play after each s j , such that ν s ∗ i | F i [ s ∗ i ] ∩ F j [ s ∗ j ] = ν s ∗ j | F i [ s ∗ i ] ∩ F j [ s ∗ j ] and ν s i | F i [ s i ] ∩ F j [ s j ] = ν s j | F i [ s i ] ∩ F j [ s j ] . If s ∗ j has the weakly highest Gittins index for j under effective discountfactor ≤ δγ < , then s i does not have the weakly highest Gittins index for i underthe same effective discount factor.Proof. We begin by defining a profile of totally mixed correlated actions at informa-tion sets ∪ s j ∈ S j F j [ s j ] ⊆ H − j , namely a collection of totally mixed correlated profiles( α F j [ s j ] ) s j ∈ S j where α F j [ s j ] ∈ ∆ ◦ ( × h ∈ F j [ s j ] A h ). For each s j = s j the profile α F j [ s j ] is thesynthetic correlated profile α ( ν s j , τ ∗ s j ), where τ ∗ s j is an optimal pure-strategy stoppingtime in j ’s auxiliary stopping problem involving s j . For s j = s j , the correlated profile60 F j [ s j ] is instead the synthetic correlated profile associated with the mimicking stop-ping rule for τ ∗ s i , i.e. agent i ’s pure-strategy optimal stopping time in i ’s auxiliaryproblem for s i . Next, define a profile of totally mixed correlated actions at information sets ∪ s i ∈ S i F i [ s i ] ⊆ H − i for i ’s opponents. For each s i / ∈ { s ∗ i , s i } , just use the marginaldistribution of α F j [ ϕ ( s i )] constructed before on F i [ s i ] ∩ H − ij , then arbitrarily specifyplay at j ’s information sets contained in F i [ s i ] , if any. For s i , the correlated profile is α ( ν s i , τ ∗ s i ), i.e. the synthetic move associated with i ’s optimal stopping rule for s i . Fi-nally, for s ∗ i , the correlated profile α F i [ s ∗ i ] is the synthetic correlated profile associatedwith the mimicking stopping rule for τ ∗ s ∗ j .From Lemma 6, these two profiles of correlated actions agree when marginalizedto information sets of − ij . Therefore, they can be completed into totally mixedcorrelated strategies, σ − i and σ − j respectively, such that σ − i | S − ij = σ − j | S − ij . For each s j = s j , the Gittins index of s j for j is U j ( s j , σ − j ) . Also, since α F j [ s j ] is the mixed profileassociated with the suboptimal mimicking stopping time, U j ( s j , σ − j ) is no larger thanthe Gittins index of s j for j. By the hypothesis that s ∗ j has the weakly highest Gittinsindex for j , U j ( s ∗ j , σ − j ) ≥ max s j = s ∗ j U j ( s j , σ − j ) . By the definition of s ∗ i (cid:37) s ∗ j we mustalso have U i ( s ∗ i , σ − i ) > max s i = s ∗ i U i ( s i , σ − i ) , so in particular U i ( s ∗ i , σ − i ) > U i ( s i σ − i ) . But U i ( s ∗ i , σ − i ) is no larger than the Gittins index of s ∗ i , for α F i [ s ∗ i ] is the syntheticstrategy associated with a suboptimal mimicking stopping time. As U i ( s i , σ − i ) isequal to the Gittins index of s i , this shows s i cannot have even weakly the highestGittins index at history y i , for s ∗ i already has a strictly higher Gittins index than s i does.The following corollary of Proposition 7, combined with Proposition 6, establishesthe first statement of Theorem 2. Corollary 1.
When s ∗ i (cid:37) s ∗ j , i and j have the same patience δ , survival chance γ ,and equivalent independent regular priors, OPT i is more compatible with s ∗ i OPT j iswith s ∗ j . roof. Equivalent regular priors require that priors are independent and that i and j share the same prior beliefs over play on F ∗ := F i [ s ∗ i ] ∩ F j [ s ∗ j ] and over play on F := F i [ s i ] ∩ F j [ s j ] . Thus after histories y i , y j such that y i,s ∗ i ∼ y j,s ∗ j and y i,s i ∼ y j,s j ν s ∗ i | F ∗ = ν s ∗ j | F ∗ and ν s i | F = ν s j | F , so the hypotheses of Proposition 7 are satisfied. We start with a lemma that shows the Bayes-UCB index for a strategy s i is equal to i ’s payoff from playing s i against a certain profile of mixed actions on F i [ s i ] , wherethis profile depends on i ’s belief about actions on F i [ s i ] , the quantile q, and how u s i ,h ranks mixed actions in ∆( A h ) for each h ∈ F i [ s i ] . Lemma 7.
Let n s i be the number of times i has played s i in history y i and let q s i = q ( n s i ) ∈ (0 , . Then the Bayes-UCB index for s i and given quantile-choicefunction q after history y i is equal to U i ( s i , ( ¯ α h ) h ∈ F i [ s i ] ) for some profile of mixedactions where ¯ α h ∈ ∆ ◦ ( A h ) for each h . Furthermore, ¯ α h only depends on q s i , g i ( ·| y i,h ) i ’s posterior belief about play on h , and how u s i ,h ranks mixed strategies in ∆( A h ) .Proof. For each h ∈ F i [ s i ], the random variable ˜ u s i ,h ( y i,h ) only depends on y i,h throughthe posterior g i ( ·| y i,h ) . Furthermore, Q (˜ u s i ,h ( y i,h ); q s i ) is strictly between the highestand lowest possible values of u s i ,h ( · ), each of which can be attained by some pureaction on A h , so there is a totally mixed ¯ α h ∈ ∆ ◦ ( A h ) so that Q (˜ u s i ,h ( y i,h ); q s i ) = u s i ,h ( ¯ α h ) . Moreover, if u s i ,h and u s i ,h rank mixed strategies on ∆( A h ) in the sameway, there are a ∈ R and b > u s i ,h = a + bu s i ,h . Then Q (˜ u s i ,h ( y i,h ); q s i ) = a + bQ (˜ u s i ,h ( y i,h ); q s i ), so ¯ α h still works for u s i ,h .The second statement of Theorem 2 follows as a corollary. Corollary 2. If s ∗ i (cid:37) s ∗ j , and the hypotheses of Theorem 2 are satisfied, then UCB i is more compatible with s ∗ i than UCB j is with s ∗ j .Proof. When i and j have matching beliefs, by Lemma 7 we may calculate theirBayes-UCB indices for different strategies as their myopic expected payoff of using62hese strategies against some common opponents’ play, as in the similar argumentfor the Gittins index in Lemma 7. Applying the definition of compatibility, we candeduce that when s ∗ i (cid:37) s ∗ j and ϕ ( s ∗ i ) = s ∗ j , if s ∗ j has the highest Bayes-UCB index for j then s ∗ i must have the highest Bayes-UCB index for i. Lemma 6 : For Γ isomorphically factorable for i and j with ϕ ( s i ) = s j , suppose i holds belief ν s i over play in F i [ s i ] and j holds belief ν s j over play in F j [ s j ] , such that ν s i | F i [ s i ] ∩ F j [ s j ] = ν s j | F i [ s i ] ∩ F j [ s j ] , that is the two sets of beliefs match when marginalizedto the common information sets in H − ij . Let τ i be i ’s mimicking stopping time for τ j .Then, the synthetic correlated profile α ( ν s j , τ j ) marginalized to the information setsof − ij is the same as α ( ν s i , τ i ) marginalized to the same information sets.Proof. Let (˜ a ( i ) ,t , ˜ a ( C ) ,t ) t ≥ and (˜ e ( i ) ,t , ˜ e ( C ) ,t ) t ≥ be the stochastic processes of internaland external histories for τ i , with distributions P − i and P e . Enumerate possibleprofiles of actions on F C as × h ∈ F C A h = { a (1)( C ) , ..., a ( K C )( C ) } , possible profiles of actions on F j as × h ∈ F j A h = { a (1)( j ) , ..., a ( K j )( j ) } , and possible profiles of actions on F i as × h ∈ F i A h = { a (1)( i ) , ..., a ( K i )( i ) } .Write p t, ( k j ,k C ) := P − i [(˜ a ( i ) ,t , ˜ a ( C ) ,t ) = ( a ( k j )( j ) , a ( k C )( C ) )] for k j ∈ { , ..., K j } and k C ∈{ , ..., K C } . Also write q t, ( k i ,k C ) := P e [(˜ e ( i ) ,t , ˜ e ( C ) ,t ) = ( a ( k i )( i ) , a ( k C )( C ) )] for k i ∈ { , ..., K i } and k C ∈ { , ..., K C } . Let p t, (0 , = q t, (0 , := P − i [ τ i < t ] = P e [ τ i < t ] be the probabilityof having stopped before period t. The distribution of external histories that i expects to observe before stopping un-der belief ν s i when using the mimicking stopping rule τ i is the same as the distributionof internal histories that j expects to observe when using stopping rule τ j under belief ν s j , because i simulates the data-generating process on F i by drawing a mixed action α i according to j ’s belief ν s j | F i and ν s i | F C = ν s j | F C . Thus for every k i ∈ { , ..., K i } k C ∈ { , ..., K C } , P ∞ t =1 ( δγ ) t − q t, ( k i ,k C ) P ∞ t =1 ( δγ ) t − (1 − q t, (0 , ) = α ( ν s j , τ j )( a ( k i )( i ) , a ( k C )( C ) ) . For a fixed ¯ k C ∈ { , ..., K C } , summing across k i gives P ∞ t =1 ( δγ ) t − P K i k i =1 q t, ( k i , ¯ k C ) P ∞ t =1 ( δγ ) t − (1 − q t, (0 , ) = α ( ν s j , τ j )( a (¯ k C )( C ) ) . By definition, the processes (˜ a ( i ) ,t , ˜ a ( C ) ,t ) t ≥ and (˜ e ( i ) ,t , ˜ e ( C ) ,t ) t ≥ have the same marginaldistribution on the second dimension: K i X k i =1 q t, ( k i , ¯ k C ) = P − i [˜ a ( C ) ,t = a (¯ k C )( C ) ] = K j X k j =1 p t, ( k j , ¯ k C ) . Making this substitution and using the fact that p t, (0 , = q t, (0 , , P ∞ t =1 ( δγ ) t − P K j k j =1 p t, ( k j , ¯ k C ) P ∞ t =1 ( δγ ) t − (1 − p t, (0 , ) = α ( ν s j , τ j )( a (¯ k C )( C ) ) . But by the definition of synthetic correlated profile, the LHS is P K j k j =1 α ( ν s i , τ i )( a ( k j )( j ) , a (¯ k C )( C ) ) = α ( ν s i , τ i )( a (¯ k C )( C ) ).Since the choice of a (¯ k C )( C ) ∈ × I ∈ F C A I was arbitrary, we have shown that the syn-thetic profile α ( ν s j , τ j ) of the original stopping rule τ j and the one associated withthe mimicking strategy of i, α ( ν s i , τ i ) , coincide on F C . Corollary 2 : The Bayes-UCB rule r i,UCB and r j,UCB satisfy the hypotheses of Propo-sition 6 when s ∗ i (cid:37) s ∗ j , provided the hypotheses of Theorem 2 are satisfied.Proof. Consider histories y i , y j with y i,s ∗ i ∼ y j,s ∗ j and y i,s i ∼ y j,s j . By Lemma 7,there exist ¯ α − ih ∈ ∆ ◦ ( A h ) for every h ∈ ∪ s i ∈ S i F i [ s i ] and ¯ α − jh ∈ ∆ ◦ ( A h ) for every h ∈ s i ∈ S i F j [ s j ] so that ι i,s i ( y i ) = U i ( s i , ( α − ih ) h ∈ F i [ s i ] ) and ι j,s j ( y j ) = U j ( s j , ( α − jh ) h ∈ F j [ s j ] )for all s i , s j , where ι i,s i ( y i ) is the Bayes-UCB index for s i after history y i and ι j,s j ( y j )is the Bayes-UCB index for s j after history y j .Because y i,s ∗ i ∼ y j,s ∗ j and y i,s i ∼ y j,s j , y i contains the same number of s ∗ i exper-iments as y j contains s ∗ j , and y i contains the same number of s i experiments as y j contains s j . Also by third-party equivalence and the fact that i and j start with thesame beliefs on common relevant information sets, they have the same posterior be-liefs g i ( ·| y i,I ), g j ( ·| y j,I ) for any h ∈ F i [ s ∗ i ] ∩ F j [ s ∗ j ] and h ∈ F i [ s i ] ∩ F j [ s j ]. Finally, thehypotheses of Theorem 2 say that on any h ∈ F i [ s ∗ i ] ∩ F j [ s ∗ j ] , u s ∗ i ,h and u s ∗ j ,h have thesame ranking of mixed actions, while on any h ∈ F i [ s i ] ∩ F j [ s j ] , u s i ,h and u s j ,h havethe same ranking of mixed actions. So, by Lemma 7, we may take ¯ α − ih = ¯ α − jh for all h ∈ F i [ s ∗ i ] ∩ F j [ s ∗ j ] and h ∈ F i [ s i ] ∩ F j [ s j ] . Find some σ − j = ( σ − ij , σ i ) ∈ × k = j ∆ ◦ ( S k ) so that σ − j generates the randomactions ( ¯ α − jh ) on every h ∈ ∪ s j ∈ S j F j [ s j ] . Then we have ι j,s j ( y j ) = U j ( s j , σ − j ) for every s j ∈ S j . The fact that s ∗ j has weakly the highest index means s ∗ j is weakly optimalagainst σ − j . Now take σ − i = ( σ − ij , σ j ) where σ j ∈ ∆ ◦ ( S j ) is such that it generatesthe random actions ( ¯ α − ih ) on F i [ s ∗ i ] ∩ H j and F i [ s i ] ∩ H j . But since ¯ α − ih = ¯ α − jh for all h ∈ F i [ s ∗ i ] ∩ F j [ s ∗ j ] and h ∈ F i [ s i ] ∩ F j [ s j ] , σ − i generates the random actions ( ¯ α − ih ) onall of F i [ s ∗ i ] and F i [ s i ] , meaning ι i,s ∗ i ( y i ) = U i ( s ∗ i , σ − i ) and ι i,s i ( y i ) = U i ( s i , σ − i ). Thedefinition of compatibility implies U i ( s ∗ i , σ − i ) > U i ( s i , σ − i ), so ι i,s ∗ i ( y i ) > ι i,s i ( y i ) . Thisshows s i does not have weakly the highest Bayes-UCB index, since s ∗ i has a strictlyhigher one. 65 nline Appendix9 Proofs of Propositions 1 and 2 Proposition 1 : Suppose s ∗ i (cid:37) s ∗ j (cid:37) s ∗ k where s ∗ i , s ∗ j , s ∗ k are strategies of i, j, k . Then s ∗ i (cid:37) s ∗ k .Proof. Suppose s ∗ k is weakly optimal for k against some totally mixed correlatedprofile σ ( k ) . We show that s ∗ i is strictly optimal for i against any totally mixed andcorrelated σ ( i ) with the property that marg − ik ( σ ( k ) ) = marg − ik ( σ ( i ) ).To do this, we first modify σ ( i ) into a new totally profile by copying how the actionof i correlates with the actions of − ( ik ) in σ ( k ) . To do this, for each s − ik ∈ S − ik and s i ∈ S i , σ ( k ) ( s i , s − ik ) > − k ( σ ( k ) ) ∈ ∆ ◦ ( S − k ). So write p ( s i | s − ik ) := σ ( k ) ( s i ,s − ik ) P s i ∈ S i σ ( k ) ( s i ,s − ik ) > i plays s i given − ik play s − ik , in the profile σ ( k ) . Now construct the profile ˆˆ σ ∈ ∆ ◦ ( S ) , whereˆˆ σ ( s i , s − ik , s k ) := p ( s i | s − ik ) · σ ( i ) ( s − ik , s k ) . Profile ˆˆ σ has the property that marg − jk (ˆˆ σ ) = marg − jk ( σ ( k ) ). To see this, note firstthat because ˆˆ σ and σ ( k ) agree on the − ( ijk ) marginal marg − ik ( σ ( k ) ) = marg − ik ( σ ( i ) ).Also, by construction, the conditional distribution of i ’s action given profile of ( − ijk )’sactions is the same.From the hypothesis that s ∗ j (cid:37) s ∗ k , we get j finds s ∗ j strictly optimal against ˆˆ σ .But at the same time, marg − i (ˆˆ σ ) = marg − i ( σ ( i ) ) by construction, so this impliesalso marg − ij (ˆˆ σ ) = marg − ij ( σ ( i ) ). From s ∗ i (cid:37) s ∗ j , and the conclusion that j finds s ∗ j strictly optimal against ˆˆ σ just obtained, we get i finds s ∗ i strictly optimal against σ ( i ) as desired. 66 .2 Proof of Proposition 2 Proposition 2 : If s ∗ i (cid:37) s ∗ j , then either s ∗ j (cid:37) s ∗ i , or both s ∗ j and s ∗ i are weakly dominatedstrategies.For σ − i ∈ ∆( S − i ) and s i ∈ S i , write U i ( s i , σ − i ) to mean P s − i ∈ S − i U i ( s i , s − i ) · σ − i ( s − i ), and note that s ∗ i (cid:37) s ∗ j if and only if for every totally mixed, correlatedstrategy σ − j ∈ ∆ ◦ ( S − j ) such that U j ( s ∗ j , σ − j ) ≥ max s j ∈ S j \{ s ∗ j } U j ( s j , σ − j ) , we have for every σ − i ∈ ∆ ◦ ( S − i ) satisfying marg − ij ( σ − i ) = marg − ij ( σ − j ) ,U i ( s ∗ i , σ − i ) > max s i ∈ S i \{ s ∗ i } U i ( s i , σ − i ) . Proof.
Assume s ∗ i (cid:37) s ∗ j and recall the maintained assumption that the game has nostrictly dominated strategy. We show that these assumptions imply either s ∗ j (cid:37) s ∗ i ,or both s ∗ j and s ∗ i are weakly dominated strategies.Partition the set ∆ ◦ ( S − j ) into three subsets, Π + ∪ Π ∪ Π − , with Π + consisting ofrepresenting σ − j ∈ ∆ ◦ ( S − j ) that make s ∗ j strictly better than the best alternative purestrategy, Π the elements of ∆ ◦ ( S − j ) that make s ∗ j indifferent to the best alternative,and Π − the elements that make s ∗ j strictly worse. (These sets are well defined because | S j | ≥ , so j has at least one alternative pure strategy to s ∗ j .) If Π is non-empty,then there is some σ − j ∈ Π such that U j ( s ∗ j , σ − j ) = max s j ∈ S j \{ s ∗ j } U j ( s j , σ − j ) . Because s ∗ i (cid:37) s ∗ j , U i ( s ∗ i , ˆ σ − i ) > max s i ∈ S i \{ s ∗ i } U i ( s i , ˆ σ − i ) for every ˆ σ − i ∈ ∆ ◦ ( S − i ) such thatmarg − ij ( σ − j ) = marg − ij ( σ − i ), so we do not have s ∗ j (cid:37) s ∗ i .Also, if both Π + and Π − are non-empty, then Π is non-empty. This is becauseboth σ − j u j ( s ∗ j , σ − j ) and σ − j max s j ∈ S j \{ s ∗ j } u j ( s j , σ − j ) are continuous functions.If u j ( s ∗ j , σ − j ) − max s j ∈ S j \{ s ∗ j } u j ( s j , σ − j ) > u j ( s ∗ j , σ − j ) − max s j ∈ S j \{ s ∗ j } u j ( s j , σ − j ) <
0, then some mixture between σ − j and σ − j must belong to Π .67o we have shown that if either Π is non-empty or both Π + and Π − are non-empty,then s ∗ j (cid:37) s ∗ i .If only Π + is non-empty, then s ∗ j is strictly dominant for j . Together with s ∗ i (cid:37) s ∗ j , this would imply that s ∗ i is strictly dominant for i, which would make any otherstrategy of i strictly dominated, contradiction.Finally suppose that only Π − is non-empty, so that for every σ − j ∈ ∆ ◦ ( S − j ) thereexists a strictly better pure response than s ∗ j against σ − j , then there exists a mixedstrategy σ j for j that strictly dominates s ∗ j against all correlated play in ∆ ◦ ( S − j ) . This shows s ∗ j is strictly dominated for j provided − j play a totally mixed profile —in particular, s ∗ j is weakly dominated for j . Suppose there is a σ − i ∈ ∆ ◦ ( S − i ) againstwhich s ∗ i is a weak best response. Then, the fact that s ∗ j is not a strict best responseagainst any σ − j ∈ ∆ ◦ ( S − j ) means s ∗ j (cid:37) s ∗ i . On the other hand, suppose s ∗ i is not aweak best response against any σ − i ∈ ∆ ◦ ( S − i ). Then s ∗ i is weakly dominated, as is s ∗ j .
10 Refinements in the Link-Formation Game
Each of the following refinements selects the same subset of pure Nash equilibria whenapplied to the anti-monotonic and co-monotonic versions of the link-formation game:extended proper equilibrium, proper equilibrium, trembling-hand perfect equilibrium, p -dominance, Pareto efficiency, and strategic stability. Pairwise stability does notapply to the link-formation game. Finally, the link-formation game is not a potentialgame. Step 1. Extended proper equilibrium, proper equilibrium, and trembling-hand perfect equilibrium allow the “no links” equilibrium in both versionsof the game.
For ( q i ) anti-monotonic with ( c i ) , for each (cid:15) > Active with probability (cid:15) , N2 and S2 play Active with probability (cid:15) . For smallenough (cid:15) , the expected payoff of
Active for player i is approximately (10 − c i ) (cid:15) since68erms with higher order (cid:15) are negligible. It is clear that this payoff is negative forsmall (cid:15) for every player i , and that under the utility re-scalings β N = β S = 10 ,β N = β S = 1 , the loss to playing Active smaller for N2 and S2 than for N1 andS1. So this strategy profile is a ( β , (cid:15) )-extended proper equilibrium. Taking (cid:15) →
0, wearrive at the equilibrium where each player chooses
Inactive with probability 1.
Proof.
For the version with ( q i ) co-monotonic with ( c i ) , consider the same strategieswithout re-scalings, i.e. β = . Then already the loss to playing Active smaller forN2 and S2 than for N1 and S1, making the strategy profile a ( , (cid:15) )-extended properequilibrium.These arguments show that the “no links” equilibrium is an extended properequilibrium in both versions of the game. Every extended proper equilibrium is alsoproper and trembling-hand perfect, which completes the step. Step 2. p − dominance eliminates the “no links” equilibrium in both ver-sions of the game. Regardless of whether ( q i ) are co-monotonic or anti-monotonicwith ( c i ), under the belief that all other players choose Active with probability p for p ∈ (0 , Active (due to additivity across links) is(1 − p ) · p · (10 + 30 − c i ) > c i ∈ { , } . Step 3. Pareto eliminates the “no links” equilibrium in both versions ofthe game.
It is immediate that the no-links equilibrium outcome is Pareto dominatedby the all-links equilibrium outcome under both parameter specifications, so Paretoefficiency would rule it out whether ( c i ) is anti-monotonic or co-monotonic with ( q i ). Step 4. Strategic stability (Kohlberg and Mertens, 1986) eliminates the “nolinks” equilibrium in both versions of the game. . First suppose the ( c i ) are anti-monotonic with ( q i ) . Let η = 1 /
100 and let (cid:15) > (cid:15) N ( Active ) = (cid:15) S ( Active ) = 2 (cid:15) , (cid:15) N ( Active ) = (cid:15) S ( Active ) = (cid:15) and (cid:15) i ( Inactive ) = (cid:15) for allplayers i . When each i is constrained to play s i with probability at least (cid:15) i ( s i ) , theonly Nash equilibrium is for each player to choose Active with probability 1 − (cid:15) . (To see this, consider N2’s play in any such equilibrium σ. If N2 weakly prefers
Ac- ive , then N1 must strictly prefer it, so σ N ( Active ) = 1 − (cid:15) ≥ σ N ( Active ) . Onthe other hand, if N2 strictly prefers
Inactive , then σ N ( Active ) = (cid:15) < (cid:15) ≤ σ N ( Active ). In either case, σ N ( Active ) ≥ σ N ( Active ).) When both North play-ers choose
Active with probability 1 − (cid:15) , each South player has Active as theirstrict best response, so σ S ( Active ) = σ S ( Active ) = 1 − (cid:15) . Against such a pro-file of South players, each North player has Active as their strict best response, so σ N ( Active ) = σ N ( Active ) = 1 − (cid:15) .Now suppose the ( c i ) are co-monotonic with ( q i ). Again let η = 1 /
100 and let (cid:15) > (cid:15) N ( Active ) = (cid:15) S ( Active ) = (cid:15) , (cid:15) N ( Active ) = (cid:15) / ,(cid:15) S ( Active ) = (cid:15) and (cid:15) i ( Inactive ) = (cid:15) for all players i . Suppose by way of contradic-tion there is a Nash equilibrium σ of the constrained game which is η -close to the Inac-tive equilibrium. In such an equilibrium, N2 must strictly prefer
Inactive , otherwiseN1 strictly prefers
Active so σ could not be η -close to the Inactive equilibrium. Simi-lar argument shows that S2 must strictly prefer
Inactive . This shows N2 and S2 mustplay
Active with the minimum possible probability, that is σ N ( Active ) = (cid:15) / σ S ( Active ) = (cid:15) . This implies that, even if σ N ( Active ) were at its minimumpossible level of (cid:15) , S1 would still strictly prefer playing Inactive because S1 is 1000times as likely to link with the low-quality opponent as the high-quality opponent.This shows σ S ( Active ) = (cid:15) . But when σ S ( Active ) = σ S ( Active ) = (cid:15) , N Active , so σ N ( Active ) = 1 − (cid:15) . This contradicts σ being η -close to the no-links equilibrium. Step 5. Pairwise stability (Jackson and Wolinsky, 1996) does not apply tothis game . This is because each player chooses between either linking with everyplayer on the opposite side who plays
Active , or linking with no one. A player cannotselectively cut off one of her links while preserving the other.
Step 6. The game does not have an ordinal potential, so refinementsof potential games (Monderer and Shapley, 1996) do not apply . To see thatthis is not a potential game, consider the anti-monotonic parametrization. Suppose70 potential P of the form P ( a N , a N , a S , a S ) exists, where a i = 1 corresponds to i choosing Active , a i = 0 corresponds to i choosing Inactive . We must have P (0 , , ,
0) = P (1 , , ,
0) = P (0 , , , , since a unilateral deviation by one player from the Inactive equilibrium does notchange any player’s payoffs. But notice that u N (1 , , , − u N (0 , , ,
1) = 10 −
14 = − , while u S (1 , , , − u S (1 , , ,
0) = 30 −
19 = 11. If the game has an ordinalpotential, then both of these expressions must have the same sign as P (1 , , , − P (1 , , ,
0) = P (1 , , , − P (0 ,,
0) = P (1 , , , − P (0 ,, ,,
0) = P (1 , , , − P (0 ,, ,, ,,