Overcoming Free-Riding in Bandit Games
aa r X i v : . [ ec on . T H ] O c t Overcoming Free-Riding in Bandit Games ∗ Johannes H¨orner † Nicolas Klein ‡ Sven Rady § This version: October 22, 2019 ∗ This paper supersedes our earlier paper “Strongly Symmetric Equilibria in Bandit Games” (circu-lated in 2014 as Cowles Discussion Paper No. 1956 and SFB/TR 15 Discussion Paper No. 469) whichconsidered pure Poisson learning only. Thanks for comments and suggestions are owed to seminar par-ticipants at Aalto University Helsinki, Austin, Berlin, Bonn, City University of Hong Kong, CollegioCarlo Alberto Turin, Duisburg-Essen, Edinburgh, Exeter, Frankfurt (Goethe University, FrankfurtSchool of Finance and Management), London (Queen Mary, LSE), Lund, Maastricht, Mannheim,McMaster University, Microsoft Research New England, Montreal, Oxford, Paris (S´eminaire Roy,S´eminaire Parisien de Th´eorie des Jeux, Dauphine), Southampton, St. Andrews, Sydney, Toronto,Toulouse, University of Western Ontario, Warwick, Zurich, the 2012 International Conference onGame Theory at Stony Brook, the 2013 North American Summer Meeting of the Econometric So-ciety, the 2013 Annual Meeting of the Society for Economic Dynamics, the 2013 European Meetingof the Econometric Society, the 4th Workshop on Stochastic Methods in Game Theory at Erice, the2013 Workshop on Advances in Experimentation at Paris II, the 2014 Canadian Economic TheoryConference, the 8th International Conference on Game Theory and Management in St. Petersburg,the SING 10 Conference in Krakow, the 2015 Workshop on Stochastic Methods in Game Theoryin Singapore, the 2017 Annual Meeting of the Society for the Advancement of Economic Theory inFaro, and the 2019 Annual Conference of the Royal Economic Society. Part of this paper was writtenduring a visit to the Hausdorff Research Institute for Mathematics at the University of Bonn underthe auspices of the Trimester Program “Stochastic Dynamics in Economics and Finance”. Financialsupport from the Cowles Foundation, Deutsche Forschungsgemeinschaft (SFB/TR 15 and SFB/TR224), the Fonds de Recherche du Qu´ebec Soci´et´e et Culture, and the Social Sciences and HumanitiesResearch Council of Canada is gratefully acknowledged. † Yale University, 30 Hillhouse Ave., New Haven, CT 06520, USA, and TSE (CNRS), and CEPR, [email protected] . ‡ Universit´e de Montr´eal, D´epartement de Sciences ´Economiques, C.P. 6128 succursale Centre-ville;Montr´eal, H3C 3J7, Canada, and CIREQ, [email protected] . § University of Bonn, Adenauerallee 24-42, D-53113 Bonn, Germany, and CEPR, [email protected] . bstract This paper considers a class of experimentation games with L´evy bandits encom-passing those of Bolton and Harris (1999) and Keller, Rady and Cripps (2005).Its main result is that efficient (perfect Bayesian) equilibria exist whenever play-ers’ payoffs have a diffusion component. Hence, the trade-offs emphasized in theliterature do not rely on the intrinsic nature of bandit models but on the com-monly adopted solution concept (MPE). This is not an artifact of continuoustime: we prove that such equilibria arise as limits of equilibria in the discrete-time game. Furthermore, it suffices to relax the solution concept to stronglysymmetric equilibrium.
Keywords:
Two-Armed Bandit, Bayesian Learning, Strategic Experimenta-tion, Strongly Symmetric Equilibrium.
JEL
Classification Numbers:
C73, D83. Introduction
Bandit models involve trade-offs. The exploration vs. exploitation dilemma of the clas-sic multi-armed bandit problem coined by Thompson (1933) and Robbins (1952) hasbeen supplanted by free-riding vs. encouragement effects in a strategic context (Boltonand Harris, 1999). These two effects might be essential to our economic intuition, butthe trade-off only arises because of the solution concept, as we show in this paper.First, we provide some context. Typically, bandit games are modeled in continuoustime, with Markov (perfect) equilibrium as the solution concept. This is a sensiblechoice. Continuous time provides elegant characterizations and even closed-form solu-tions. Markov equilibrium is the obvious counterpart to the criterion used in operationsresearch, enabling meaningful comparisons with the team solution. It is also dictatedby continuous time because standard game-theoretic notions raise conceptual problems(Simon and Stinchcombe, 1989). However, there is a price to pay. Markov equilibriumexcludes rewards and punishments, the cornerstones of dynamic games. What Markovtaketh away, continuous time giveth: equilibria arise with no discrete-time equivalent. We show that the main equilibrium prediction, namely inefficiently low experimen-tation, (mostly) disappears once the model is cast in the framework traditionally used indynamic games. Relaxing Markov equilibrium and pruning out artifacts of continuoustime requires discretizing the game. We then let the time interval between successiveactions vanish to obtain a similarly clean characterization and a proper comparisonwith the literature. Asymptotically, efficiency obtains, as long as payoffs involve aninformative diffusion (as opposed to a pure jump) component.What efficient experimentation entails depends on the players’ patience. Hence, thisis not a folk theorem, which would not apply in any case: beliefs are not reversible.For example, players would never be convinced that the risky arm is good if it is not.In addition, efficiency does not always hold: with pure jumps, this depends on howgood news impacts the common belief.Efficiency only obtains if a selfish, lone player would not experiment given anyresulting posterior belief, had he been on the brink of stopping as a selfless team playergiven the prior belief. Intuitively, having all other players stop experimenting is theworst punishment a deviating player can face.This punishment is needed but not a given with pure jumps. Indeed, it fails withconclusive good news. However, when there is a diffusion component, the predominant For instance, the “infinite switching equilibria” of Keller et al. (2005). This is because there is nolast time before a given date in continuous time. A few caveats are in order.First, showing that the free-rider and encouragement effects do not determine theoutcome of bandit games is akin to noting that efficiency is achievable in the repeatedprisoner’s dilemma even if defection is dominant in the stage game: it would notoccur to us to demonstrate how cooperation arises in the repeated version without firstremarking that free-riding is the problem we are addressing. In stochastic games suchas bandits, discerning the underlying incentives is difficult and subtle: solving for theMarkov equilibria is the advisable approach. Our point is that we must disentanglethese incentives from the possible equilibrium outcomes.Second, we have emphasized the importance of studying the discrete-time gameto factor out equilibria that are continuous-time quirks. However, our results areasymptotic to the extent that they only hold when the time interval between rounds issmall enough. There is no difference between an arbitrarily small uptick vs. a discretejump when the interval length is bounded away from zero. Our results rely heavily onwhat is known about the continuous-time limits and hence on the analyses of Boltonand Harris (1999) and Keller et al. (2005), among others. To the extent that some ofour proofs are involved, it is because they require careful comparison and convergencearguments.Third, because we rely on discrete time, we must settle on a particular discretiza-tion. We consider our choice to be natural: players may revise their action choices atequally spaced time opportunities, while payoffs and information accrue in continuoustime, independent of the duration of the intervals. That is, ours is the simplest version One appealing property of SSEs is that payoffs can be studied via a coupled pair of functionalequations that extends the functional equation characterizing MPE payoffs (see Proposition 11).
3f inertia strategies, as introduced by Bergin and MacLeod (1993). Other discretizationchoices may lead to different predictions.Fourth, our results do not cover all bandit games. Because we build on existingresults of the single-agent case, we cannot go beyond the framework used for them.In particular, we must make restrictions similar to Cohen and Solan (2013) in theiranalysis of the continuous-time bandit problem. In fact, our assumptions are strongerthan theirs. Our main restriction, just as theirs, is that bad-news jumps are notpermitted, which means that our framework does not subsume Keller and Rady (2015),in particular. Our paper belongs to the growing literature on strategic bandits. We have alreadydiscussed the standard references in that literature. There is no need to review the largeand growing literature on extensions, variations and applications. With few exceptions,these papers model the game in continuous time and focus on MPEs unless actions onat least one side are not observed (meaning that applying standard game-theoreticsolution concepts raises no difficulty).Second, our paper contributes to the literature on SSE. We hope that it illustrateshow SSE can be usefully applied to games usually cast in continuous time, such asbandit games. SSEs have been studied in repeated games since Abreu (1986). Theyare known to be restrictive. First, they make no sense if the model itself fails to besymmetric. However, as Abreu (1986) notes for repeated games, they are (i) easilycalculated, being completely characterized by two simultaneous scalar equations; (ii)more general than static Nash, or even Nash reversion; and even (iii) without loss interms of total welfare, at least in some cases, as in ours. See also Abreu, Pearce andStacchetti (1986) for the optimality of symmetric equilibria within a standard oligopolyframework and Abreu, Pearce and Stacchetti (1993) for a motivation of the solutionconcept based on a notion of equal bargaining power. Cronshaw and Luenberger (1994)conduct a more general analysis for repeated games with perfect monitoring, showinghow the set of SSE payoffs can be obtained by solving for the largest scalar solvinga certain equation. Hence, our paper shows that Properties (i)–(iii) extend to banditgames, with “Markov perfect” replacing “Nash” in statement (ii) and “functional”replacing “scalar” in (i): as mentioned above, a pair of functional equations replacesthe usual Hamilton-Jacobi-Bellman (HJB) (or Isaacs) equation from optimal control. We do not a priori perceive a difficulty in adopting theirs, but we also do not perceive any benefits. The technical difficulty with bad-news jumps is that the value functions cannot be describedexplicitly. They are rather defined recursively, with the functional form depending on the number ofbad news events triggering an end to all experimentation. Because of this complication, we leave theanalysis of this case to future work.
Time t ∈ [0 , ∞ ) is continuous. There are N ≥ s > θ ∈ { , } , which nature draws at the outset with P [ θ = 1] = p . Players donot observe θ , but they know p . They also understand that the evolution of the riskypayoffs depends on θ . Specifically, the payoff process X n associated with player n ’srisky arm evolves according to dX nt = α θ dt + σ dZ nt + h dN nt , where Z n is a standard Wiener process, N n is a Poisson process with intensity λ θ ,and the scalar parameters α , α , σ, h, λ , λ are known to all players. Conditional on θ , the processes Z , . . . , Z N , N , . . . , N N are independent. As Z n and N n − λ θ t aremartingales, the expected payoff increment from using the risky arm over an intervalof time [ t, t + dt ) is m θ dt with m θ = α θ + λ θ h .Players share a common discount rate r >
0. We write k n,t = 0 if player n usesthe safe arm at time t and k n,t = 1 if the player uses the risky arm at time t . Givenactions ( k n,t ) t ≥ such that k n,t ∈ { , } is measurable with respect to the information Bolton and Harris (1999), Keller et al. (2005) and Keller and Rady (2010) allow the players toallocate one unit of a perfectly divisible resource freely across the two arms at each point in time, sothe fraction allocated to the risky arm can be k n,t ∈ [0 , t , player n ’s total expected discounted payoff, expressed in per-periodunits, is E (cid:20)Z ∞ re − rt [(1 − k n,t ) s + k n,t m θ ] dt (cid:21) , where the expectation is over both the random variable θ and the stochastic process( k n,t ). We make the following assumptions: (i) m < s < m , so each player prefers therisky arm to the safe arm in state θ = 1 and prefers the safe arm to the risky armin state θ = 0. (ii) σ > h >
0, so the Brownian payoff component is alwayspresent and jumps of the Poisson component entail positive lump-sum payoffs; (iii) λ ≥ λ ≥
0, so jumps are at least as frequent in state θ = 1 as in state θ = 0.Players begin with a common prior belief about θ , given by the probability p withwhich nature draws state θ = 1. Thereafter, they learn about this state in a Bayesianfashion by observing one another’s actions and payoffs; in particular, they hold commonposterior beliefs throughout time. A detailed description of the evolution of beliefs ispresented in Appendix A.1. When λ = λ (and hence α > α ), the arrival of a lump-sum payoff contains no information about the state of the world, and our setup isequivalent to that in Bolton and Harris (1999), with the learning being driven entirelyby the Brownian payoff component. When α = α (and hence λ > λ ), the Brownianpayoff component contains no information, and our setup is equivalent to that in Kelleret al. (2005) or Keller and Rady (2010), depending on whether λ = 0 or λ >
0, withthe learning being driven entirely by the arrival of lump-sum payoffs. The authors cited in the previous paragraph assume that players use continuous-timeMarkov strategies with the posterior belief as the state variable, so that k n,t is a time- Note that we have not yet defined the set of strategies available to each player and hence aresilent at this point on how the players’ strategy profile actually induces a stochastic process of actions( k n,t ) t ≥ for each of them. We will close this gap in two different ways in Sections 3 and 4: by imposingMarkov perfection in the former and a discrete time grid of revision opportunities in the latter. This rules out “breakdowns” as in Keller and Rady (2015). Keller et al. (2005) and Keller and Rady (2010) consider compound Poisson processes where thedistribution of lump-sum payoffs (and their mean h ) at the time of a Poisson jump is independent of,and hence uninformative about, the state of the world. By contrast, Cohen and Solan (2013) allow forL´evy processes where the size of lump-sum payoffs contains information about the state, but a lumpsum of any given size arrives weakly more frequently in state θ = 1. p t assigned to state θ = 1 at time t . In this section, we show how some of their main insights generalize to the presentsetting. First, we present the efficient benchmark. Second, we show that efficientbehavior cannot be sustained as an MPE.Consider a planner who maximizes the average of the players’ expected payoffs incontinuous time by selecting an entire action profile ( k ,t , . . . , k N,t ) at each time t . Thecorresponding average expected payoff increment is (cid:20)(cid:18) − K t N (cid:19) s + K t N m θ (cid:21) dt with K t = N X n =1 k n,t . A straightforward extension of the main results of Cohen and Solan (2013) shows thatthe evolution of beliefs also depends on K t only and that the planner’s value function,denoted by V ∗ N , has the following properties.First, V ∗ N is the unique once-continuously differentiable solution of the HJB equation v ( p ) = s + max K ∈{ , ,...,N } K (cid:20) b ( p, v ) − c ( p ) N (cid:21) on the open unit interval subject to the boundary conditions v (0) = m and v (1) = m .Here, b ( p, v ) = ρ r p (1 − p ) v ′′ ( p ) − λ − λ r p (1 − p ) v ′ ( p ) + λ ( p ) r [ v ( j ( p )) − v ( p )]can be interpreted as the expected informational benefit of using the risky arm whencontinuation payoffs are given by a (sufficiently regular) function v . Its first termreflects Brownian learning. Its second term captures the downward drift in the beliefwhen no Poisson lump sum arrives. Its third term expresses the discrete change in theoverall payoff once such a lump sum arrives, with the belief jumping up from p to j ( p ) = λ pλ ( p ) ; In the presence of discrete payoff increments, one actually has to take the left limit p t − as thestate variable, owing to the informational constraint that the action chosen at time t cannot dependon the arrival of a lump sum at t . In the following, we simply write p t with the understanding thatthe left limit is meant whenever this distinction is relevant. Note that p − = p by convention. Cf. Appendix A.1. Up to division by r , this is the infinitesimal generator of the process of posterior beliefs for K = 1,applied to the function v ; cf. Appendix A.1 for details. λ ( p ) = pλ + (1 − p ) λ . The function c ( p ) = s − m ( p )captures the opportunity cost of playing the risky arm in terms of expected currentpayoff forgone; here, m ( p ) = pm + (1 − p ) m denotes the risky arm’s expected flow payoff given the belief p . Thus, the plannerweighs the shared opportunity cost of each experiment on the risky arm against thelearning benefit, which accrues fully to each agent because of the perfect informationalspillover.Second, there exists a cutoff p ∗ N such that all agents using the safe arm ( K = 0) isoptimal for the planner when p ≤ p ∗ N , and all agents using the risky arm ( K = N ) isoptimal when p > p ∗ N . This cutoff is given by p ∗ N = µ N ( s − m )( µ N + 1)( m − s ) + µ N ( s − m ) , where µ N is the unique positive solution of the equation ρ µ ( µ + 1) + ( λ − λ ) µ + λ (cid:18) λ λ (cid:19) µ − λ − rN = 0 . Both µ N and p ∗ N increase in r/N . Thus, the interval of beliefs for which all agentsusing the risky arm is efficient widens with the number of agents and their patience.Third, the value function satisfies V ∗ N ( p ) = s for p ≤ p ∗ N , and V ∗ N ( p ) = m ( p ) + c ( p ∗ N ) u ( p ∗ N ; µ N ) u ( p ; µ N ) > s, (1)for p > p ∗ N , where u ( p ; µ ) = (1 − p ) (cid:18) − pp (cid:19) µ is strictly decreasing and strictly convex for µ >
0. The function V ∗ N is strictly increas-ing and strictly convex on [ p ∗ N , N = 1, one obtains the single-agent value function V ∗ and correspondingcutoff p ∗ > p ∗ N . 8ow consider N ≥ n when he or she faces opponents who use Markov strategies is given by v n ( p ) = s + K ¬ n ( p ) b ( p, v n ) + max k n ∈{ , } k n [ b ( p, v n ) − c ( p )] , where K ¬ n ( p ) is the number of n ’s opponents that use the risky arm. That is, whenplaying a best response, each player weighs the opportunity cost of playing risky againsthis or her own informational benefit only. Consequently, V ∗ N does not solve the aboveHJB equation when player n ’s opponents use the efficient strategy. Efficient behaviortherefore cannot be sustained in MPE. Henceforth, we restrict players to changing their actions only at the times t = 0 , ∆ , , . . . for some fixed ∆ >
0. This yields a discrete-time game evolving in a continuous-timeframework; in particular, the payoff processes are observed continuously. Moreover,we allow for non-Markovian strategies.The expected discounted payoff increment from using the safe arm for the lengthof time ∆ is R ∆0 r e − r t s dt = (1 − δ ) s with δ = e − r ∆ . Conditional on θ , the expecteddiscounted payoff increment from using the risky arm is R ∆0 r e − r t m θ dt = (1 − δ ) m θ .Given the probability p assigned to θ = 1, the expected discounted payoff incrementfrom the risky arm conditional on all available information is (1 − δ ) m ( p ).A history of length t = ∆ , , . . . is a sequence h t = (cid:16)(cid:0) k n, , e Y n [0 , ∆) (cid:1) Nn =1 , (cid:0) k n, ∆ , e Y n [∆ , (cid:1) Nn =1 , . . . , (cid:0) k n,t − ∆ , e Y n [ t − ∆ ,t ) (cid:1) Nn =1 (cid:17) , where k n,ℓ ∆ = 1 if player n uses the risky arm on the time interval [ ℓ ∆ , ( ℓ + 1)∆); While arguably natural, our discretization remains nonetheless ad hoc , and other discretizationsmight yield other results. Not only is it well known that the limits of the discrete-time models mightdiffer from the continuous-time solutions, but the particular discrete structure might also matter; see,among others, M¨uller (2000), Fudenberg and Levine (2009), H¨orner and Samuelson (2013), and Sadzikand Stacchetti (2015). In H¨orner and Samuelson (2013), for instance, there are multiple solutions tothe optimality equations, corresponding to different boundary conditions, and to select among them,it is necessary to investigate in detail the discrete-time game (see their Lemma 3). However, therole of the discretization goes well beyond selecting the “right” boundary condition; see Sadzik andStacchetti (2015). n,ℓ ∆ = 0 if player n uses the safe arm on this interval; e Y n [ ℓ ∆ , ( ℓ +1)∆) is the observed samplepath Y n [ ℓ ∆ , ( ℓ +1)∆) on the interval [ ℓ ∆ , ( ℓ + 1)∆) of the payoff process associated withplayer n ’s risky arm if k n,ℓ ∆ = 1; and e Y n [ ℓ ∆ , ( ℓ +1)∆) equals the empty set if k n,ℓ ∆ = 0. Wewrite H t for the set of all histories of length t , set H = {∅} , and let H = S ∞ t =0 , ∆ , ,... H t .In addition, we assume that players have access to a public randomization device inevery period, namely, a draw from the uniform distribution on [0 , θ and across periods. Following standard practice, we omit itsrealizations from the description of histories.A behavioral strategy σ n for player n is a sequence ( σ n,t ) t =0 , ∆ , ,... , where σ n,t isa measurable map from H t to the set of probability distributions on { , } ; a purestrategy takes values in the set of degenerate distributions only.Along with the prior probability p assigned to θ = 1, each profile of strategiesinduces a distribution over H . Given his or her opponents’ strategies σ − n , player n seeks to maximize(1 − δ ) E σ − n ,σ n " ∞ X ℓ =0 δ ℓ n [1 − σ n,ℓ ∆ ( h ℓ ∆ )] s + σ n,ℓ ∆ ( h ℓ ∆ ) m θ o . By the law of iterated expectations, this equals(1 − δ ) E σ − n ,σ n " ∞ X ℓ =0 δ ℓ n [1 − σ n,ℓ ∆ ( h ℓ ∆ )] s + σ n,ℓ ∆ ( h ℓ ∆ ) m ( p ℓ ∆ )] o . Nash equilibrium, PBE and MPE, with actions after history h t depending only onthe associated posterior belief p t , are defined in the usual way. Imposing the stan-dard “no signaling what you don’t know” refinement, beliefs are pinned down after allhistories, on and off path. An SSE is a PBE in which all players use the same strategy: σ n ( h t ) = σ n ′ ( h t ) forall n, n ′ and h t ∈ H . This implies symmetry of behavior after any history, not just onthe equilibrium path of play. By definition, any symmetric MPE is an SSE, and anySSE is a PBE. While we could equivalently define this Bayesian game as a stochastic game with the commonposterior belief as a state variable, no characterization or folk theorem applies to our setup, as theMarkov chain (over consecutive states) does not satisfy the sufficient ergodicity assumptions; see Dutta(1995) and H¨orner, Sugaya, Takahashi and Vieille (2011). Main Results
Fix ∆ >
0. For p ∈ [0 , W ∆PBE ( p ) and W ∆PBE ( p ) denote the supremum andinfimum, respectively, of the set of average payoffs (per player) over all PBE, givenprior belief p . Let W ∆SSE ( p ) and W ∆SSE ( p ) be the corresponding supremum and infimumover all SSE. If such equilibria exist, W ∆PBE ( p ) ≥ W ∆SSE ( p ) ≥ W ∆SSE ( p ) ≥ W ∆PBE ( p ) . (2)Given that we assume a public randomization device, these upper and lower boundsdefine the corresponding equilibrium average payoff sets.As any player can choose to ignore the information contained in the other play-ers’ experimentation results, the value function W ∆1 of a single agent experimentingin isolation constitutes a lower bound on a player’s payoff in any PBE. Lemma A.2establishes that this lower bound converges to V ∗ as ∆ →
0. Hence, we obtain a lowerbound to the limits of all terms in (2), namely lim inf ∆ → W ∆PBE ≥ V ∗ .An upper bound is also easily found. As any discrete-time strategy profile is feasiblefor the continuous-time planner from the previous section, it holds that W ∆PBE ≤ V ∗ N .The main theorem provides an exact characterization of the limits of all four func-tions. It requires introducing a new family of payoffs. Namely, we define the players’common payoff in continuous time when they all use the risky arm if, and only if, thebelief exceeds a given threshold ˆ p . This function admits a closed form that generalizesthe first-best payoff V ∗ N (cf. (1)). It is equal to V N, ˆ p ( p ) = m ( p ) + c (ˆ p ) u (ˆ p ; µ N ) u ( p ; µ N ) , for p > ˆ p , and V N, ˆ p ( p ) = s otherwise. Theorem 1 (i)
There exists ˆ p ∈ [ p ∗ N , p ∗ ] such that lim ∆ → W ∆PBE = lim ∆ → W ∆SSE = V N, ˆ p , and lim ∆ → W ∆PBE = lim ∆ → W ∆SSE = V ∗ , This function is continuous, strictly increasing and strictly convex on [ˆ p, p . For ˆ p = p ∗ N , V N, ˆ p coincides with the cooperative valuefunction V ∗ N . For ˆ p > p ∗ N , we have V N, ˆ p < V ∗ N on ( p ∗ N , niformly on [0 , . (ii) If ρ > , then ˆ p = p ∗ N (and hence V N, ˆ p = V ∗ N ). (iii) If ρ = 0 , then ˆ p is the unique belief in [ p ∗ N , p ∗ ] satisfying N λ (ˆ p ) [ V N, ˆ p ( j (ˆ p )) − s ] − ( N − λ (ˆ p ) [ V ∗ ( j (ˆ p )) − s ] = rc (ˆ p ); (3) moreover, ˆ p = p ∗ N if, and only if, j ( p ∗ N ) ≤ p ∗ , and ˆ p = p ∗ if, and only if, λ = 0 . To understand this result, let us begin with SSEs and the characterization of thecutoff ˆ p in the last item, when learning is entirely driven by the jump process. Theplayers’ temptation to deviate to the safe arm is strongest when the belief is so lowthat, absent good news, the belief drops into the region where safe prevails in any SSE,whether a single player has deviated or not. The cost of such a deviation, capturedby the left-hand side of (3), thus arises only if good news arrives. Starting out fromˆ p , in expectation, this happens at the rate N λ (ˆ p ) if no player deviates; a deviationreduces this rate to ( N − λ (ˆ p ). Without a deviation, a player’s continuation payoffthen amounts at most to the cooperative payoff given that the use of the risky arm isdisallowed below ˆ p ; in the event of a deviation, it is at least the single-player payoff(both evaluated at the revised belief j (ˆ p ) and net of the value of the safe arm). Theright-hand side of (3) represents the benefit of a deviation, that is, the saved opportu-nity cost of playing risky. The cutoff belief ˆ p thus solves the familiar trade-off betweenthe benefit from deviating and the cost of the worst punishment that may follow thedeviation.When λ = 0, the arrival of good news freezes the belief at 1, and the resultingcooperative and single-player payoffs both equal λ h . Starting out from ˆ p , therefore, aplayer’s continuation payoffs coincide with those of a single agent in all circumstances,so that it is impossible to sustain experimentation below the single-agent cutoff. Hence,ˆ p = p ∗ .If the second term on the left-hand side of (3) were zero, that is, if j ( p ∗ N ) ≤ p ∗ , sothat a player left to his or her own devices would stop experimenting at the revisedbelief after the arrival of good news, and hence obtain a zero payoff (net of the valueof the safe arm), the solution to this equation is the first-best cutoff p ∗ N . To see this,note that the first term on the left-hand side can equivalently be interpreted as thesocial value of experimentation by a single player. Indeed, a player contributes to thearrival of news at rate λ (ˆ p ), but all N players then reap the gain V N, ˆ p ( j (ˆ p )) − s . Theright-hand side is the cost of such experimentation. Hence, ˆ p = p ∗ N follows immediatelyfrom the equation. 12he same logic immediately implies that first-best efficiency obtains when ρ > First-best efficiency not only depends on the cutoff but also requires play to beexclusively risky at all higher beliefs. Hence, the best equilibrium must involve a purestrategy, at least asymptotically. This is not straightforward. Indeed, symmetric pure-strategy PBE fail to exist with conclusive good news ( ρ = λ = 0) in discrete time.If all others play risky for certain, the posterior belief also declines for certain, unlessgood news arrives. If players randomized, there would be the added opportunity topunish if the posterior belief remained the same. When good news is conclusive, ourproof relies on the existence of two symmetric mixed-strategy equilibria for beliefs closeto the cutoff. It is then possible to choose continuation play as a function of historyto incentivize players to experiment at beliefs that are sufficiently many rounds awayfrom the cutoff (a negligible difference in beliefs once the time interval is small enough).Matters are simpler when news is inconclusive or a diffusion term is present.Turning to point (i) of the theorem, there is no difference between the set of SSEand PBE payoffs, at least on average across players. This is shown in Sections 6.1–6.2. Regarding the highest equilibrium payoff, this may seem plausible (though notobvious) because efficiency requires symmetric play. Regarding the lowest equilibriumpayoff, either playing safe forever is an equilibrium of the game given the current belief,or best-responding to being minmaxed provides a higher payoff to the punished playerthan also playing the minmaxing action (using the safe arm). In the latter case, onecan incentivize the punished player to play safe by promising that all players will revertto risky (cooperative) play at a later time, thereby compensating the punished playerfor the flow payoff deficit that playing safe involves in the meantime. This eventualreversion also motivates the punishing players to play safe.Figure 1 shows the cooperative continuous-time payoff V ∗ N as well as the supremum V N, ˆ p and infimum V ∗ of the limit average PBE payoffs for a parameter configurationthat implies p ∗ N < ˆ p < p ∗ . A more technical intuition can be given in the spirit of smooth pasting in stopping problems fordiffusion processes; see Dixit and Pindyck (1994). If all SSE experimentation stopped at a beliefˆ p > p ∗ N , the limiting payoff function V N, ˆ p would exhibit a convex kink at ˆ p . Given the diffusioncomponent of the posterior-belief process, this kink could be used to provide all players incentives touse the risky arm at beliefs slightly below ˆ p . Indeed, the informational benefit of experimentation inthe presence of a kink is of lower order in ∆ than its opportunity cost and hence dominates for small∆. . . . . . . . . . . . . . . pw Figure 1: Payoffs V ∗ N (solid), V N, ˆ p (dashed) and V ∗ (dotted) for ρ = 0 and( r, s, h, λ , λ , N ) = (1 , , . , , . , p ∗ N , ˆ p, p ∗ ) ≃ ( . , . , . amount of experimentation )but also entails too low a speed of experimentation, as it involves an interior level ofexperimentation for a range of beliefs. Proposition 1
For ρ = 0 and λ > , the cutoff ˆ p is strictly lower than the belief atwhich all experimentation stops in the symmetric MPE of the continuous-time game. Turning to comparative statics, when is the first-best achievable with jump pro-cesses? The next proposition characterizes the area (in the ( λ , λ )-plane) whereasymptotic efficiency obtains. As is intuitive, having more players, or more patience,increases the scope for the first-best. 14 roposition 2 Let ρ = 0 . Then, j ( p ∗ N ) > p ∗ whenever λ ≤ λ /N . On any ray in R emanating from the origin (0 , with a slope strictly between /N and 1, there isa unique critical point ( λ ∗ , λ ∗ ) at which j ( p ∗ N ) = p ∗ ; moreover, j ( p ∗ N ) > p ∗ at all pointsof the ray that are closer to the origin than ( λ ∗ , λ ∗ ) , and j ( p ∗ N ) < p ∗ at all points thatare farther from the origin than ( λ ∗ , λ ∗ ) . These critical points form a continuous curvethat is bounded away from the origin and asymptotes to the ray of slope /N . Thecurve shifts downward as r falls or N rises. This result is illustrated in Figure 2. Furthermore, in the case of λ >
0, the moreplayers participate in the game, the more experimentation can be sustained. (Recallthat for λ = 0, the threshold belief ˆ p is independent of N .) Hence, the comparativestatics of the best SSE with respect to the number of players mirrors that for symmetricMPE (see Keller and Rady (2010)). Proposition 3
For ρ = 0 and λ > , ˆ p is decreasing in N . It is instructive to consider what happens when the players become arbitrarily im-patient or patient. If players are myopic, they do not react to future rewards andpunishments. It is therefore no surprise that the cooperative solution cannot be at-tained in the limit. By contrast, if players are very patient, asymptotic efficiency isachieved if the number of players is large.
Proposition 4
For ρ = 0 and λ > , lim r →∞ j ( p ∗ N ) p ∗ = λ hs , and lim r → j ( p ∗ N ) p ∗ = λ N λ . The next section is devoted to the construction of SSEs that underlies the proof ofTheorem 1. Missing details are provided in the appendix.
We first consider the case of a diffusion component (Section 6.1) and then turn to thecase of pure jump processes (Section 6.2).15 λ λ j ( p ∗ N ) < p ∗ j ( p ∗ N ) > p ∗ Figure 2: Asymptotic efficiency is achieved for parameter combinations( λ , λ ) between the diagonal and the curve but not below the curve. Thedashed line is the ray of slope 1 /N . Parameter values: r = 1, N = 5.We need the following notation. Let F ∆ K ( ·| p ) denote the cumulative distributionfunction of the posterior belief p ∆ when p = p and K players use the risky arm on thetime interval [0 , ∆). For any measurable function w on [0 ,
1] and p ∈ [0 , E ∆ K w ( p ) = Z w ( p ′ ) F ∆ K ( dp ′ | p ) , whenever this integral exists. Thus, E ∆ K w ( p ) is the expectation of w ( p ∆ ) given the prior p and K experimenting players. ρ > ) For a sufficiently small ∆ >
0, we specify an SSE that can be summarized by two func-tions, κ and κ , which do not depend on ∆. The equilibrium strategy is characterizedby a two-state automaton. In the “good” state, play proceeds according to κ , and the16quilibrium payoff satisfies w ∆ ( p ) = (1 − δ )[(1 − κ ( p )) s + κ ( p ) m ( p )] + δ E ∆ Nκ ( p ) w ∆ ( p ) , (4)while in the “bad” state, play proceeds according to κ , and the payoff satisfies w ∆ ( p ) = max k n (1 − δ )[(1 − k ) s + km ( p )] + δ E ∆( N − κ ( p )+ k w ∆ ( p ) o . (5)That is, w ∆ is the value from the best response to all other players following κ .A unilateral deviation from κ in the good state is punished by a transition to thebad state in the following period; otherwise, we remain in the good state. If there is aunilateral deviation from κ in the bad state, we remain in the bad state. Otherwise, adraw of the public randomization device determines whether the state next period isgood or bad; this probability is chosen such that the expected payoff is indeed givenby w ∆ (see below).With continuation payoffs given by w ∆ and w ∆ , the common action κ ∈ { , } isincentive compatible at a belief p if, and only if,(1 − δ )[(1 − κ ) s + κm ( p )] + δ E ∆ Nκ w ∆ ( p ) (6) ≥ (1 − δ )[ κs + (1 − κ ) m ( p )] + δ E ∆( N − κ +1 − κ w ∆ ( p ) . Therefore, the functions κ and κ define an SSE if, and only if, (6) holds for κ = κ ( p )and κ = κ ( p ) at all p .The probability η ∆ ( p ) of a transition from the bad to the good state in the absenceof a unilateral deviation from κ ( p ) is pinned down by the requirement that w ∆ ( p ) = (1 − δ )[(1 − κ ( p )) s + κ ( p ) m ( p )] (7)+ δ n η ∆ ( p ) E ∆ Nκ ( p ) w ∆ ( p ) + [1 − η ∆ ( p )] E ∆ Nκ ( p ) w ∆ ( p ) o . If k = κ ( p ) is optimal in (5), we simply set η ∆ ( p ) = 0. Otherwise, (5) and (6) imply δ E ∆ Nκ ( p ) w ∆ ( p ) ≥ w ∆ ( p ) − (1 − δ )[(1 − κ ( p )) s + κ ( p ) m ( p )] > δ E ∆ Nκ ( p ) w ∆ ( p ) , so (7) holds with η ∆ ( p ) = w ∆ ( p ) − (1 − δ )[(1 − κ ( p )) s + κ ( p ) m ( p )] − δ E ∆ Nκ ( p ) w ∆ ( p ) δ E ∆ Nκ ( p ) w ∆ ( p ) − δ E ∆ Nκ ( p ) w ∆ ( p ) ∈ (0 , .
17t remains to specify κ and κ . Let p m = s − m m − m . As m ( p m ) = s , this is the belief at which a myopic agent is indifferent between the twoarms. It is straightforward to verify that p ∗ < p m . Fixing p ∈ ( p ∗ N , p ∗ ) and ¯ p ∈ ( p m , κ ( p ) = p>p and κ ( p ) = p> ¯ p . Note that punishment and reward strategiescoincide outside of ( p, ¯ p ). Proposition 5
For ρ > , there are beliefs p ♭ ∈ ( p ∗ N , p ∗ ) and p ♯ ∈ ( p m , such thatfor all p ∈ ( p ∗ N , p ♭ ) and ¯ p ∈ ( p ♯ , , there exists ¯∆ > such that for all ∆ ∈ (0 , ¯∆) ,the two-state automaton with functions κ and κ defines an SSE of the experimentationgame with period length ∆ . The proof consists of verifying that, for a sufficiently small ∆, the actions κ ( p ) and κ ( p ) satisfy the incentive-compatibility constraint (6) at all p . First, we find ε > w ∆ = s in a neighborhood of p + ε . The payoff functions w ∆ and w ∆ resulting from the two-state automaton are then bounded away from one another on[ p + ε, ¯ p ] for small ∆. In this range, therefore, the difference in expected continuationvalues across states does not vanish as ∆ tends to 0, whereas the difference in currentexpected payoffs across actions is of order ∆, rendering deviations unattractive forsmall enough ∆. On (¯ p,
1] and [0 , p ], κ and κ both prescribe the myopically optimalaction. Given that continuation payoffs are weakly higher in the good state, it is easy toshow that there are no incentives to deviate on these intervals. For beliefs in ( p, p + ε ), κ again prescribes the myopically optimal action. The proof of incentive compatibilityof κ on this interval crucially relies on the fact that, for small ∆, w ∆ is bounded belowby V N,p , which has a convex kink at p . This, together with the fact that, conditional onno lump sum arriving, the log-likelihood ratio of posterior beliefs is Gaussian, allows usto demonstrate the existence of some constant C > E ∆ N w ∆ ( p ) ≥ s + C ∆ to the immediate right of p , whereas E ∆ N − w ∆ ( p ) ≤ s + C ∆with some constant C >
0. For small ∆, therefore, the linearly vanishing current-payoff advantage of the safe over the risky arm is dominated by the incentives providedthrough continuation payoffs.The next result essentially follows from letting p → p ∗ N and ¯ p → Proposition 6
For ρ > , lim ∆ → W ∆SSE = V ∗ N and lim ∆ → W ∆SSE = V ∗ , uniformly on [0 , . A denotes the indicator function of the event A . .2 Pure Poisson Learning ( ρ = 0 ) Let ρ = 0, and take ˆ p as in part (iii) of Theorem 1. Proposition 7
Let ρ = 0 . For any ε > , there is a ∆ ε > such that for all ∆ ∈ (0 , ∆ ε ) , the set of beliefs at which experimentation can be sustained in a PBEof the discrete game with period length ∆ is contained in the interval (ˆ p − ε, . Inparticular, lim sup ∆ → W ∆PBE ( p ) ≤ V N, ˆ p ( p ) . For a heuristic explanation of the logic behind this result, consider a sequence ofpure-strategy PBEs for vanishing ∆ such that the infimum of the set of beliefs at whichat least one player experiments converges to some limit ˜ p . Selecting a subsequence of∆s and relabeling players, if necessary, we can assume without loss of generality thatplayers 1 , . . . , L play R immediately to the right of ˜ p , while players L + 1 , . . . , N play S . In the limit, players’ individual continuation payoffs are bounded below by thesingle-agent value function V ∗ and cannot sum to more than N V N, ˜ p , so the sum ofthe continuation payoffs of players 1 , . . . , L is bounded above by N V N, ˜ p − ( N − L ) V ∗ .Averaging these players’ incentive-compatibility constraints thus yields Lλ (˜ p ) (cid:20) N V N, ˜ p ( j (˜ p )) − ( N − L ) V ∗ ( j (˜ p )) L − s (cid:21) − rc (˜ p ) ≥ ( L − λ (˜ p ) [ V ∗ ( j (˜ p )) − s ] . Simplifying the left-hand side, adding ( N − L ) λ (˜ p ) [ V ∗ ( j (˜ p )) − s ] to both sides andre-arranging, we obtain N λ (˜ p ) [ V N, ˜ p ( j (˜ p )) − s ] − rc (˜ p ) ≥ ( N − λ (˜ p ) [ V ∗ ( j (˜ p )) − s ] , which in turn implies ˜ p ≥ ˆ p , as we show in Lemma A.9 in the appendix. The proof ofProposition 7 makes this heuristic argument rigorous and extends it to mixed equilibria.For non-revealing jumps ( λ > p is now restricted to exceed ˆ p . Proposition 8
Let ρ = 0 and λ > . There are beliefs p ♭ ∈ (ˆ p, p ∗ ) and p ♯ ∈ ( p m , such that for all p ∈ (ˆ p, p ♭ ) and ¯ p ∈ ( p ♯ , , there exists ¯∆ > such that for all ∆ ∈ (0 , ¯∆) , the two-state automaton with functions κ and κ defines an SSE of theexperimentation game with period length ∆ . The strategy for the proof of this proposition is the same as that of Proposition19, except for the belief region to the immediate right of p , where incentives are nowprovided through terms of first order in ∆, akin to those in equation (3).In the case λ >
0, we are able to provide incentives in the potentially last round ofexperimentation by threatening punishment conditional on there being a success (thatis, a successful experiment). This option is no longer available in the case of λ = 0.Indeed, any success now takes us to a posterior of one, so that everyone plays riskyforever after. This means that, irrespective of whether a success occurs in that round,continuation strategies are independent of past behavior, conditional on the players’belief. This raises the possibility of unravelling. If incentives just above the candidatethreshold at which players give up on the risky arm cannot be provided, can thisthreshold be lower than in the MPE?To settle whether unravelling occurs requires us to study the discrete game inconsiderable detail. We start by noting that for λ = 0, we can strengthen Proposition7 as follows: there is no PBE with any experimentation at beliefs below the discrete-time single-agent cutoff p ∆1 = inf { p : W ∆1 ( p ) > s } (see Heidhues et al. (2015)). Thehighest average payoff that can be hoped for, then, involves all players experimentingabove p ∆1 .Unlike in the case of λ > p ∆1 .The proof of the next proposition establishes that the length of the interval of beliefsfor which this is the case vanishes as ∆ →
0. In particular, for higher beliefs (except forbeliefs arbitrarily close to 1, when playing R is strictly dominant), both pure actionscan be enforced in some equilibrium. Proposition 9
Let ρ = 0 and λ = 0 . For any beliefs p and ¯ p such that p ∗ < p
such that for all ∆ ∈ (0 , ¯∆) , there exists- an SSE in which, starting from a prior above p , all players use the risky arm onthe path of play as long as the belief remains above p and use the safe arm forbeliefs below p ∗ ; and The study of symmetric MPEs is difficult in discrete time. Unlike in continuous time, in which theexplicit solution is known (see Keller et al. (2005)), they do not seem to admit an easy characterization.For some open sets of beliefs, there are multiple symmetric MPEs in discrete time, regardless of howsmall ∆ is. It is not known whether any or all of these converge (in some sense) to the symmetricMPE in continuous time. In particular, this excludes the possibility that the asymmetric MPE of Keller et al. (2005) withan infinite number of switches between the two arms below p ∗ can be approximated in the discretegame. an SSE in which, given a prior between p and ¯ p , the players’ payoff is no largerthan their best-reply payoff against opponents who use the risky arm if, and onlyif, the belief lies in [ p ∗ , p ] ∪ [¯ p, . While this is somewhat weaker than Proposition 8, its implications for limit payoffsas ∆ → p ∗ , p ] can be chosenarbitrarily small (actually, of the order ∆, as the proof establishes), its impact onequilibrium payoffs starting from priors above p is of order ∆. This suggests that forthe equilibria whose existence is stated in Proposition 9, the payoff converges to thepayoff from all players experimenting above p ∗ and to the best-reply payoff againstnone of the opponents experimenting. Indeed, we have the following result, coveringboth inconclusive and conclusive jumps. Proposition 10
For ρ = 0 , lim ∆ → W ∆SSE = V N, ˆ p and lim ∆ → W ∆SSE = V ∗ , uniformlyon [0 , . While it is possible to derive explicit solutions to the equilibrium payoff sets of interest,at least asymptotically, note that, already in the discrete game, a characterization interms of optimality equations can be obtained, which defines the correspondence ofSSE payoffs. As discussed in the introduction, these generalize the familiar equationcharacterizing the value function of the symmetric MPE. Instead of a single (HJB)equation, the characterization of SSE payoffs involves two coupled functional equations,whose solution delivers the highest and lowest equilibrium payoff. Proposition 11 statesthis in the discrete game, while Proposition 12 gives the continuous-time limit. As thesepropositions do not heavily rely on the specific structure of our game, we believe thatthey might be useful for analyzing SSE payoffs for more general processes or otherstochastic games.Fix ∆ >
0. For p ∈ [0 , W ∆ ( p ) and W ∆ ( p ) denote the supremum andinfimum, respectively, of the set of payoffs over pure-strategy SSEs, given prior belief p . If such an equilibrium exists, these extrema are achieved, and W ∆ ( p ) ≥ W ∆ ( p ).For ρ > λ >
0, we have shown in Sections 6.1–6.2 that in the limit as ∆ → For the existence of various types of equilibria in discrete-time stochastic games, see Mertens,Sorin and Zamir (2015), Chapter 7. W ∆ and W ∆ via a pair of coupledfunctional equations. Proposition 11
Suppose that the discrete game with time increment ∆ > admits apure-strategy SSE for any prior belief. Then, the pair of functions ( w, w ) = ( W ∆ , W ∆ ) solves the functional equations w ( p ) = max κ ∈K ( p ; w,w ) n (1 − δ )[(1 − κ ) s + κm ( p )] + δ E ∆ Nκ w ( p ) o , (8) w ( p ) = min κ ∈K ( p ; w,w ) max k ∈{ , } n (1 − δ )[(1 − k ) s + km ( p )] + δ E ∆( N − κ + k w ( p ) o , (9) where K ( p ; w, w ) ⊆ { , } denotes the set of all κ such that (1 − δ )[(1 − κ ) s + κm ( p )] + δ E ∆ Nκ w ( p ) (10) ≥ max k ∈{ , } n (1 − δ )[(1 − k ) s + km ( p )] + δ E ∆( N − κ + k w ( p ) o . Moreover, W ∆ ≤ w ≤ w ≤ W ∆ for any solution ( w, w ) of (8) – (10) . This result relies on arguments that are familiar from Cronshaw and Luenberger(1994). We briefly sketch them here.The above equations can be understood as follows. The ideal condition for a given(symmetric) action profile to be incentive compatible is that if each player conforms toit, the continuation payoff is the highest possible, while a deviation triggers the lowestpossible continuation payoff. These actions are precisely the elements of K ( p ; w, w ), asdefined by equation (10). Given this set of actions, equation (9) provides the recursionthat characterizes the constrained minmax payoff under the assumption that if a playerwere to deviate to his myopic best reply to the constrained minmax action profile,the punishment would be restarted next period, resulting in a minimum continuationpayoff. Similarly, equation (8) yields the highest payoff under this constraint, but here,playing the best action (within the set) is on the equilibrium path.Note that in any SSE, given p , the action κ ( p ) must be an element of K ( p ; W ∆ , W ∆ ).This is because the left-hand side of (10) with w = W ∆ is an upper bound on thecontinuation payoff if no player deviates, and the right-hand side with w = W ∆ alower bound on the continuation payoff after a unilateral deviation. Consider theequilibrium that achieves W ∆ . Then, W ∆ ( p ) ≤ max κ ∈K ( p ; W ∆ ,W ∆ ) n (1 − δ )[(1 − κ ) s + κm ( p )] + δ E ∆ Nκ W ∆ ( p ) o ,
22s the action played must be in K ( p ; W ∆ , W ∆ ), and the continuation payoff is at mostgiven by W ∆ . Similarly, W ∆ must satisfy (9) with “ ≥ ” instead of “=.” Suppose nowthat the “ ≤ ” were strict. Then, we can define a strategy profile given prior p that (i)in period 0, plays the maximizer of the right-hand side, and (ii) from t = ∆ onward,abides by the continuation strategy achieving W ∆ ( p ∆ ). Because the initial action is in K ( p ; W ∆ , W ∆ ), this constitutes an equilibrium, and it achieves a payoff strictly largerthan W ∆ ( p ), a contradiction. Hence, (8) must hold with equality for W ∆ . The samereasoning applies to W ∆ and (9).Fix a pair ( w, w ) that satisfies (8)–(10). Note that this implies w ≤ w . Given sucha pair and any prior p , we specify two SSEs whose payoffs are w and w , respectively.It then follows that W ∆ ≤ w ≤ w ≤ W ∆ . Let κ and κ denote a selection of themaximum and minimum of (8)–(9). The equilibrium strategies are described by atwo-state automaton, whose states are referred to as “good” or “bad.” The differencebetween the two equilibria lies in the initial state: w is achieved when the initial stateis good, w is achieved when it is bad. In the good state, play proceeds according to κ ;in the bad state, it proceeds according to κ . Transitions are exactly as in the equilibriadescribed in Sections 6.1–6.2. This structure precludes profitable one-shot deviationsin either state, so that the automaton describes equilibrium strategies, and the desiredpayoffs are obtained.As ∆ tends to 0, equations (8)–(9) transform into differential-difference equationsinvolving terms that are familiar from the continuous-time analysis in Section 3. Aformal Taylor approximation shows that for any κ ∈ { , } , K ∈ { , , . . . , N } and asufficiently regular function w on the unit interval,(1 − δ )[(1 − κ ) s + κm ( p )] + δ E ∆ K w ( p )= w ( p ) + r n (1 − κ ) s + κm ( p ) + K b ( p, w ) − w ( p ) o ∆ + o (∆) . Applying this approximation to (8)–(9), cancelling the terms of order 0 in ∆, di-viding through by ∆, letting ∆ → c ( p ) = s − m ( p ) for theopportunity cost of playing risky, we obtain the coupled differential-difference equationsthat appear in the following result. Proposition 12
Let ρ > or λ > . As ∆ → , the pair of functions ( W ∆ , W ∆ )23 onverges uniformly (in p ) to a pair of functions ( w, w ) solving w ( p ) = s + max κ ∈K ( p ) κ [ N b ( p, w ) − c ( p )] , (11) w ( p ) = s + min κ ∈K ( p ) ( N − κ b ( p, w ) + max k ∈{ , } k [ b ( p, w ) − c ( p )] , (12) where K ( p ) = { } for p ≤ ˆ p, { , } for ˆ p < p < , { } for p = 1 , (13) and ˆ p is as in parts (ii) and (iii) of Theorem 1. This result is an immediate consequence of the previous results. It follows fromSections 6.1–6.2 that, except when ρ = λ = 0, there exist pure-strategy SSEs and thepair ( W ∆ , W ∆ ) converges uniformly to ( V N, ˆ p , V ∗ ). It is straightforward to verify that( w, w ) = ( V N, ˆ p , V ∗ ) solves (11)–(13). First, as V ∗ N satisfies V ∗ N ( p ) = s + max κ ∈{ , } κ [ N b ( p, V ∗ N ) − c ( p )] , with N b ( p, V ∗ N ) − c ( p ) > p ∗ N , (11) is trivially solved by V ∗ N wheneverˆ p = p ∗ N . Second, for ˆ p > p ∗ N , the function V N, ˆ p satisfies V N, ˆ p ( p ) = s + p> ˆ p [ N b ( p, V N, ˆ p ) − c ( p )] , with N b ( p ; V N, ˆ p ) − c ( p ) > p, V N, ˆ p solves (11) when ˆ p >p ∗ N . Third, V ∗ always solves (12). In fact, as b ( p ; V ∗ ) ≥ κ ∈{ , } ( N − κ b ( p, V ∗ ) = 0, and (12) with this minimum set to zero is just theHJB equation for V ∗ .Note that the continuous-time functional equations (11)–(12) would be equally easyto solve for any arbitrary ˆ p in (13). However, only the solution with ˆ p as in Theorem1 captures the asymptotics of our discretization of the experimentation game. We have shown that the inefficiencies arising in strategic bandit problems are driven bythe solution concept, MPE. Inefficiencies entirely disappear when news has a Brownian This equation follows from the HJB equation in Section 3: because the maximand is linear in K ,the continuous-time planner finds it optimal to set K = 0 or K = N at any given belief. θ = 1). For processes with a Brownian com-ponent, our proof that risky play is incentive compatible immediately to the right ofthe threshold p ∗ N only exploits the properties of the posterior belief process conditionalon no lump sum arriving . As these properties are the same whether lump sums areinformative or not, asymptotic efficiency when a Brownian component is present ob-tains more generally. When learning is driven by lump-sum payoffs only, inspectionof equation (3) suggests that efficiency requires that a lump sum of any size arrivingat the initial belief p ∗ N lead to a posterior belief no higher than p ∗ . Therefore, thecondition for asymptotic efficiency has a straightforward generalization.As mentioned above, our model rules out lumpy bad news. Hence, it rules outmodels in which Poisson events are “breakdowns,” as in the model of Keller and Rady(2015), for instance. Bad news amounts to assuming that the safe flow payoff and theaverage size of lump-sum payoffs are both negative with λ h < s < λ h ≤
0. Now, θ = 1 is the bad state of the world, and the efficient and single-player solution cutoffsin continuous time satisfy p ∗ N > p ∗ , with the stopping region lying to the right of thecutoff in either case. The associated value functions V ∗ and V ∗ N solve the same HJBequations as in Section 3. In this model, j ( p ∗ N ) > p ∗ N > p ∗ , i.e. , starting from p ∗ N , thebelief remains in the single-agent stopping region for small ∆, whether a breakdownoccurs or not. Hence, the harshest possible punishment, consisting of all other playersplaying safe forever, can be meted out to any potential deviator, whether there is abreakdown or not. Thus, we conjecture that asymptotic efficiency also obtains in thisframework. 25 ppendixA Auxiliary Results A.1 Evolution of Beliefs
For the description of the evolution of beliefs, it is convenient to work with the log odds ratio ℓ t = ln p t − p t . Suppose that starting from ℓ = ℓ , the players use the fixed action profile ( k , . . . , k N ) ∈{ , } N . By Peskir and Shiryayev (2006, pp. 287–289 and 334–338), the log odds ratio attime t > ℓ t = ℓ + X { n : k n =1 } (cid:26) α − α σ ( X nt − α t − hN nt ) − (cid:20) ( α − α ) σ + λ − λ (cid:21) t + ln λ λ N nt (cid:27) , where X n and N n are the payoff and Poisson processes, respectively, associated with player n ’srisky arm. The terms involving α , α and σ capture learning from the continuous component, X nt − hN nt , of the payoff process, with higher realizations making the players more optimistic.The terms involving λ and λ capture learning from lump-sum payoffs, with the playersbecoming more pessimistic on average as long as no lump-sum arrives, and each arrivalincreasing the log odds ratio by the fixed increment ln( λ /λ ). Under the probability measure P θ associated with state θ ∈ { , } , X nt − α t − hN nt is Gaus-sian with mean ( α θ − α ) t and variance σ t , so that P { n : k n =1 } ( α − α ) σ − ( X nt − α t − hN nt )is Gaussian with mean K ( α − α )( α θ − α ) σ − t and variance Kρt , where K = P Nn =1 k n and ρ = ( α − α ) σ − . Conditional on the event that P { n : k n =1 } N nt = J , therefore, ℓ t is normallydistributed with mean ℓ − K (cid:0) λ − λ − ρ (cid:1) t + J ln( λ /λ ) and variance Kρt under P , andnormally distributed with mean ℓ − K (cid:0) λ − λ + ρ (cid:1) t + J ln( λ /λ ) and variance Kρt under P . Finally, the probability under measure P θ that P { n : k n =1 } N nt = J equals ( Kλ θ t ) J J ! e − Kλ θ t by the sum property of the Poisson distribution.Taken together, these facts make it possible to explicitly compute the distribution of p t = e ℓ t e ℓ t under the players’ measure P p = p P + (1 − p ) P . As this explicit representation is not neededin what follows, we omit it here.Instead, we turn to the characterization of infinitesimal changes of p t , once more assuminga fixed action profile with K players using the risky arm. Arguing as in Cohen and Solan(2013, Section 3.3), one shows that, with respect to the players’ information filtration, the Here, λ /λ is understood to be 1 when λ = λ = 0. When λ > λ = 0, we have ℓ t = ∞ and p t = 1 from the arrival time of the first lump-sum on. rocess of posterior beliefs is a Markov process whose infinitesimal generator L K acts asfollows on real-valued functions v of class C on the open unit interval: L K v ( p ) = K (cid:26) ρ p (1 − p ) v ′′ ( p ) − ( λ − λ ) p (1 − p ) v ′ ( p ) + λ ( p ) [ v ( j ( p )) − v ( p )] (cid:27) . In particular, instantaneous changes in beliefs exhibit linearity in K in the sense that L K = K L .By the very nature of Bayesian updating, finally, the process of posterior beliefs is amartingale with respect to the players’ information filtration. A.2 Payoff Functions
Our first auxiliary result concerns the function u ( · ; µ N ) defined in Section 3. Lemma A.1 δ E ∆ K u ( · ; µ N )( p ) = δ − KN u ( p ; µ N ) for all ∆ > , K ∈ { , . . . , N } and p ∈ (0 , . Proof:
We simplify notation by writing u for u ( · ; µ N ). Consider the process ( p t ) of posteriorbeliefs in continuous time when p = p > K players use the risky arm. By Dynkin’sformula, E h e − rK ∆ /N u ( p ∆ ) i = u ( p ) + E (cid:20)Z ∆0 e − rKt/N (cid:26) L K u ( p t ) − rKN u ( p t ) (cid:27) dt (cid:21) = u ( p ) + K E (cid:20)Z ∆0 e − rKt/N n L u ( p t ) − rN u ( p t ) o dt (cid:21) = u ( p ) , where the last equality follows from the fact that L u = ru/N on (0 , Thus, δ K/N E ∆ K u ( p ) = u ( p ).We further note that E ∆ K m ( p ) = m ( p ) for all K by the martingale property of beliefs andthe linearity of m in p .These properties are used repeatedly in what follows. Their first application is in the proofof uniform convergence of the discrete-time single-agent value function to its continuous-timecounterpart.Let ( W , k · k ) be the Banach space of bounded real-valued functions on [0 ,
1] equippedwith the supremum norm. Given ∆ >
0, and any w ∈ W , define a function T ∆1 w ∈ W by T ∆1 w ( p ) = max n (1 − δ ) m ( p ) + δ E ∆1 w ( p ) , (1 − δ ) s + δw ( p ) o . To verify this identity, note that u ′ ( p ) = − µ N + pp (1 − p ) u ( p ) , u ′′ ( p ) = µ N ( µ N + 1) p (1 − p ) u ( p ) , u ( j ( p )) = λ λ ( p ) (cid:18) λ λ (cid:19) µ N u ( p ) , and use the equation defining µ N . he operator T ∆1 satisfies Blackwell’s sufficient conditions for being a contraction mappingwith modulus δ on ( W , k · k ): monotonicity ( v ≤ w implies T ∆1 v ≤ T ∆1 w ) and discounting( T ∆1 ( w + c ) = T ∆1 w + δc for any real number c ). By the contraction mapping theorem, T ∆1 has a unique fixed point in W ; this is the value function W ∆1 of an agent experimenting inisolation.The corresponding continuous-time value function is V ∗ as introduced in Section 3. Asany discrete-time strategy is feasible in continuous time, we trivially have W ∆1 ≤ V ∗ . Lemma A.2 W ∆1 → V ∗ uniformly as ∆ → . Proof:
A lower bound for W ∆1 is given by the payoff function W ∆ ∗ of a single agent who usesthe cutoff p ∗ in discrete time; this function is the unique fixed point in W of the contractionmapping T ∆ ∗ defined by T ∆ ∗ w ( p ) = ( (1 − δ ) m ( p ) + δ E ∆1 w ( p ) if p > p ∗ , (1 − δ ) s + δw ( p ) if p ≤ p ∗ . Next, choose ˘ p < p ∗ , and define p ♮ = ˘ p + p ∗ and the function v = m + Cu ( · ; µ ) + [0 ,p ♮ ] ( s − m − Cu ( · ; µ )), where the constant C is chosen so that s = m (˘ p ) + Cu (˘ p ; µ ).Fix ε >
0. As v converges uniformly to V ∗ as ˘ p → p ∗ , we can choose ˘ p such that v ≥ V ∗ − ε . It suffices now to show that there is a ¯∆ > T ∆ ∗ v ≥ v for ∆ < ¯∆. Infact, the monotonicity of T ∆ ∗ then implies W ∆ ∗ ≥ v and hence V ∗ − ε ≤ v ≤ W ∆ ∗ ≤ W ∆1 ≤ V ∗ for all ∆ < ¯∆.For p ≤ p ∗ , we have T ∆ ∗ v ( p ) = (1 − δ ) s + δv ( p ) ≥ v ( p ) for all ∆, because v ≤ s in thisrange. For p > p ∗ , T ∆ ∗ v ( p ) = (1 − δ ) m ( p ) + δ E ∆1 v ( p )= (1 − δ ) m ( p ) + δ E ∆1 h m + Cu + [0 ,p ♮ ] ( s − m − Cu ) i ( p )= v ( p ) + δ E ∆1 h [0 ,p ♮ ] ( s − m − Cu ) i ( p ) , where the last equation uses that E ∆1 m ( p ) = m ( p ) and δ E ∆1 u ( p ) = u ( p ). In particular, T ∆ ∗ v (1) = v (1).The function s − m − Cu is negative on the interval (0 , ˘ p ) and positive on (˘ p, p ♯ ), forsome p ♯ > p ∗ . The expectation of s − m ( p ∆ ) − Cu ( p ∆ ) conditional on p = p and p ∆ ≤ p ♮ is continuous in ( p, ∆) ∈ [ p ∗ , × (0 , ∞ ) and converges to s − m ( p ♮ ) − Cu ( p ♮ ) > p → → p ∆ becomes a Dirac measure at p ♮ in eitherlimit. This implies existence of ¯∆ > p, ∆) ∈ [ p ∗ , × (0 , ¯∆). For these ( p, ∆), we thus have E ∆1 h [0 ,p ♮ ] ( s − m − Cu ) i ( p ) ≥ E ∆1 h [ p ♭ ,p ♮ ] ( s − m − Cu ) i ( p ) ≥ , where p ♭ = ˇ p + p ♮ . As a consequence, T ∆ ∗ v ≥ v for all ( p, ∆) ∈ ( p ∗ , × (0 , ¯∆). ext, we turn to the payoff function associated with the good state of the automatondefined in Section 6. By the same arguments as invoked immediately before Lemma A.2, w ∆ is the unique fixed point in W of the operator T ∆ defined by T ∆ w ( p ) = ( (1 − δ ) m ( p ) + δ E ∆ N w ( p ) if p > p, (1 − δ ) s + δw ( p ) if p ≤ p. Lemma A.3
Let p > p ∗ N . Then w ∆ ≥ V N,p for ∆ sufficiently small. Proof:
Because of the monotonicity of the operator T ∆ , it suffices to show that T ∆ V N,p ≥ V N,p for sufficiently small ∆. Recall that for p > p , V N,p ( p ) = m ( p ) + Cu ( p ; µ N ) where theconstant C > p .For p ≤ p , we use exactly the same argument as in the penultimate paragraph of theproof of Lemma A.2; for p > p , the argument is the same as in the last paragraph of thatproof.The next two results concern the payoff function associated with the bad state of theautomaton defined in Section 6. Fix a cutoff ¯ p ∈ ( p m ,
1) and let K ( p ) = N − p > ¯ p ,and K ( p ) = 0 otherwise. Given ∆ >
0, and any bounded function w on [0 , T ∆ w by T ∆ w ( p ) = max n (1 − δ ) m ( p ) + δ E ∆ K ( p )+1 w ( p ) , (1 − δ ) s + δ E ∆ K ( p ) w ( p ) o . The operator T ∆ again satisfies Blackwell’s sufficient conditions for being a contraction map-ping with modulus δ on W . Its unique fixed point in this space is the payoff function w ∆ (introduced in Section 6) from playing a best response against N − p > ¯ p , and safe otherwise. Lemma A.4
Let p ∈ ( p ∗ N , p ∗ ) . Then there exists p ⋄ ∈ [ p m , such that for all ¯ p ∈ ( p ⋄ , ,the inequality w ∆ ≤ V N, ( p + p ∗ ) / holds for ∆ sufficiently small. Proof:
Let ˜ p = ( p + p ∗ ) /
2. For p > ˜ p , we have V N, ˜ p ( p ) = m ( p ) + Cu ( p ; µ N ) where theconstant C > p . To simplify notation, we write ˜ v insteadof V N, ˜ p and u instead of u ( · ; µ N ).For x >
0, we define p ∗ x = µ x ( s − m )( µ x + 1)( m − s ) + µ x ( s − m ) , where µ x is the unique positive root of f ( µ ; x ) = ρ µ ( µ + 1) + ( λ − λ ) µ + λ (cid:18) λ λ (cid:19) µ − λ − rx ;existence and uniqueness of this root follow from continuity and monotonicity of f ( · ; x ) to- ether with the fact that f (0; x ) < f ( µ ; x ) → ∞ as µ → ∞ . This extends ourprevious definitions of µ N and p ∗ N to non-integer numbers. It is immediate to verify now that dµ x dx < dp ∗ x dx <
0. Thus, there exists ˘ x ∈ (1 , N ) such that p ∗ ˘ x ∈ (˜ p, p ∗ ).Having chosen such an ˘ x , we fix a belief ˘ p ∈ (˜ p, p ∗ ˘ x ) and, on the open unit interval, considerthe function ˘ v that solves L v − r ˘ x ( v − m ) = 0subject to the conditions ˘ v (˘ p ) = s and ˘ v ′ (˘ p ) = 0. This function has the form˘ v ( p ) = m ( p ) + ˘ u ( p ) , with ˘ u ( p ) = A (1 − p ) (cid:18) − pp (cid:19) ˘ µ + Bp (cid:18) p − p (cid:19) ˆ µ = Au ( p ; ˘ µ ) + Bu (1 − p ; ˆ µ ) . Here, ˘ µ = µ ˘ x and ˆ µ is the unique positive root of g ( µ ; x ) = ρ µ ( µ + 1) − ( λ − λ ) µ + λ (cid:18) λ λ (cid:19) µ − λ − rx ;existence and uniqueness of this root follow along the same lines as above.The constants of integration A and B are pinned down by the conditions ˘ v (˘ p ) = s and˘ v ′ (˘ p ) = 0. One calculates that B > p < p ∗ ˘ x , which holds by construction,and that A > p < (1 + ˆ µ )( s − m )ˆ µ ( m − s ) + (1 + ˆ µ )( s − m ) . The right-hand side of this inequality is decreasing in ˆ µ and tends to p m as ˆ µ → ∞ . Therefore,we can conclude that the inequality holds, and A > A + B > v is strictly increasing and strictly convex on (˘ p, B >
0, finally, ˘ v ( p ) → ∞ for p → p ♮ ∈ (˘ p,
1) such that ˘ v ( p ♮ ) = ˜ v ( p ♮ ) and ˘ v > ˜ v on ( p ♮ , v < ˜ v in (˘ p, p ♮ ). Indeed, if this is not the case, then ˘ v − ˜ v assumes a non-negativelocal maximum at some p ♯ ∈ (˘ p, p ♮ ). This implies:(i) ˘ v ( p ♯ ) ≥ ˜ v ( p ♯ ), i.e. , Au ( p ♯ ; ˘ µ ) + Bu (1 − p ♯ ; ˆ µ ) ≥ Cu ( p ♯ ; µ N ); (A.1)(ii) ˘ v ′ ( p ♯ ) = ˜ v ′ ( p ♯ ), i.e. , − (˘ µ + p ♯ ) Au ( p ♯ ; ˘ µ ) + (ˆ µ + 1 − p ♯ ) Bu (1 − p ♯ ; ˆ µ ) = − ( µ N + p ♯ ) Cu ( p ♯ ; µ N ); (A.2) Cf.
Lemma 6 in Cohen & Solan (2013). nd (iii) ˘ v ′′ ( p ♯ ) ≤ ˜ v ′′ ( p ♯ ), i.e. ,˘ µ (˘ µ + 1) Au ( p ♯ ; ˘ µ ) + ˆ µ (1 + ˆ µ ) Bu (1 − p ♯ ; ˆ µ ) ≤ µ N ( µ N + 1) Cu ( p ♯ ; µ N ) . (A.3)Solving for Bu (1 − p ♯ ; ˆ µ ) in (A.2) and inserting the result into (A.1) and (A.3), we obtain,respectively, Cu ( p ♯ ; µ N ) Au ( p ♯ ; ˘ µ ) ≤ ˘ µ + ˆ µ + 1 µ N + ˆ µ + 1 , and Cu ( p ♯ ; µ N ) Au ( p ♯ ; ˘ µ ) ≥ ˘ µ (˘ µ + 1)(ˆ µ + 1 − p ♯ ) + ˆ µ (ˆ µ + 1)(˘ µ + p ♯ ) µ N ( µ N + 1)(ˆ µ + 1 − p ♯ ) + ˆ µ (ˆ µ + 1)( µ N + p ♯ ) . This implies that˘ µ + ˆ µ + 1 µ N + ˆ µ + 1 ≥ ˘ µ (˘ µ + 1)(ˆ µ + 1 − p ♯ ) + ˆ µ (ˆ µ + 1)(˘ µ + p ♯ ) µ N ( µ N + 1)(ˆ µ + 1 − p ♯ ) + ˆ µ (ˆ µ + 1)( µ N + p ♯ ) , which one shows to be equivalent to ˘ µ ≤ µ N . But ˘ x < N and dµ x dx < µ > µ N . Thisis the desired contradiction.Now let p ⋄ = max { p m , p ♮ } , fix ¯ p ∈ ( p ⋄ ,
1) and define v ( p ) = ˜ v ( p ) if p > p ♮ , ˘ v ( p ) if ˘ p ≤ p ≤ p ♮ ,s if p < ˘ p. By construction, s ≤ v ≤ min { ˜ v, ˘ v } . This immediately implies that (1 − δ ) s + δv ≤ v . Wenow show that T ∆ v ≤ v , and hence w ∆ ≤ v , for ∆ sufficiently small.First, let p ∈ (¯ p, − δ ) m ( p ) + δ E ∆ N v ( p ) ≤ (1 − δ ) m ( p ) + δ E ∆ N (cid:2) m + Cu + (0 , ˘ p ) ( s − m − Cu ) (cid:3) ( p )= m ( p ) + Cu ( p ) + δ E ∆ N (cid:2) (0 , ˘ p ) ( s − m − Cu ) (cid:3) ( p ) ≤ m ( p ) + Cu ( p )= v ( p ) , for ∆ small enough that E ∆ N (cid:2) (0 , ˘ p ) ( s − m − Cu ) (cid:3) (¯ p ) ≤
0; that this inequality holds for small∆ follows from the fact that s < m + Cu on (˜ p, ˘ p ). By the same token,(1 − δ ) s + δ E ∆ N − v ( p ) ≤ (1 − δ ) s + δ E ∆ N − ( m + Cu )( p ) + δ E ∆ N − (cid:2) (0 , ˘ p ) ( s − m − Cu ) (cid:3) ( p )= (1 − δ ) s + δm ( p ) + δ N Cu ( p ) + δ E ∆ N − (cid:2) (0 , ˘ p ) ( s − m − Cu ) (cid:3) ( p ) ≤ m ( p ) + Cu ( p )= v ( p ) , for ∆ small enough that E ∆ N − (cid:2) (0 , ˘ p ) ( s − m − Cu ) (cid:3) (¯ p ) ≤
0, as Cu ( p ) > s < m ( p ) for > p m .Second, let p ∈ ( p ♮ , ¯ p ]. Now, we have(1 − δ ) m ( p ) + δ E ∆1 v ( p ) ≤ m ( p ) + δ − N Cu ( p ) + δ E ∆1 (cid:2) (0 , ˘ p ) ( s − m − Cu ) (cid:3) ( p ) ≤ m ( p ) + Cu ( p )= v ( p ) , for ∆ small enough that E ∆1 (cid:2) (0 , ˘ p ) ( s − m − Cu ) (cid:3) ( p ♮ ) ≤ p ∈ [˘ p, p ♮ ]. In this case,(1 − δ ) m ( p ) + δ E ∆1 v ( p ) ≤ (1 − δ ) m ( p ) + δ E ∆1 ˘ v ( p )= m ( p ) + δ E ∆1 ˘ u ( p )= m ( p ) + ˘ u ( p ) + E (cid:20) Z ∆0 e − rt (cid:8) L ˘ u ( p t ) − r ˘ u ( p t ) (cid:9) dt (cid:12)(cid:12)(cid:12)(cid:12) p = p (cid:21) ≤ m ( p ) + ˘ u ( p ) + E (cid:20) Z ∆0 e − rt n L ˘ u ( p t ) − r ˘ x ˘ u ( p t ) o dt (cid:12)(cid:12)(cid:12)(cid:12) p = p (cid:21) = m ( p ) + ˘ u ( p )= v ( p ) , where the second equality follows from Dynkin’s formula, the second inequality holds because˘ u ( p t ) > x >
1, and the third equality is a consequence of the identity L ˘ u − r ˘ u/ ˘ x = 0.Finally, let p ∈ [0 , ˘ p ). By monotonicity of m and v (and the previous step), we see that(1 − δ ) m ( p ) + δ E ∆1 v ( p ) ≤ (1 − δ ) m (˘ p ) + δ E ∆1 v (˘ p ) ≤ v (˘ p ) = s = v ( p ). Lemma A.5
There exist ˇ p ∈ ( p m , and p ‡ ∈ ( p ∗ N , p ∗ ) such that w ∆ ( p ) = s for all ¯ p ∈ (ˇ p, , p ≤ p ‡ and ∆ > . For any ε > , moreover, there exists ˇ p ε ∈ (ˇ p, such that w ∆ ≤ V ∗ + ε for all ∆ > . Proof:
Consider any ¯ p ∈ ( p m ,
1) and an initial belief p < ¯ p . We obtain an upper bound on w ∆ ( p ) by considering a modified problem in which (i) the player can choose a best responsein continuous time and (ii) the game is stopped with continuation payoff m as soon as thebelief ¯ p is reached. This problem can be solved in the standard way, yielding an optimalcutoff p ‡ . By construction, w ∆ = s on [0 , p ‡ ]. As we take ¯ p close to 1, p ‡ approaches p ∗ fromthe left and thus gets to lie strictly in between p ∗ N and p ∗ . This proves the first statement.The second follows from the fact that the value function of the modified problem convergesuniformly to V ∗ as ¯ p → ρ = 0), we need a sharper characterization of thepayoff function w ∆ as ∆ becomes small. To this end, we define V , ¯ p as the continuous-timecounterpart to w ∆ . The methods employed in Keller and Rady (2010) can be used to establishthat V , ¯ p has the following properties for ρ = 0. First, there is a cutoff p † < p m such that V , ¯ p = s on [0 , p † ], and V , ¯ p > s everywhere else. Second, V , ¯ p is continuously differentiable verywhere except at ¯ p . Third, V , ¯ p solves the Bellman equation v ( p ) = max n m ( p ) + [ K ( p ) + 1] b ( p, v ) , s + K ( p ) b ( p, v ) o , where b ( p, v ) = λ ( p ) r [ v ( j ( p )) − v ( p )] − λ − λ r p (1 − p ) v ′ ( p ) , and v ′ ( p ) is taken to mean the left-hand derivative of v . Fourth, b ( p, V , ¯ p ) ≥ p . Fifth,because of smooth pasting at p † , the term m ( p ) + b ( p, V , ¯ p ) − s is continuous in p except at¯ p ; it has a single zero at p † , being positive to the right of it and negative to the left. Finally,we note that V , ¯ p = V ∗ and p † = p ∗ for ¯ p = 1. Lemma A.6
Let ρ = 0 . Then V , ¯ p → V ∗ uniformly as ¯ p → . The convergence is monotonein the sense that ¯ p ′ > ¯ p implies V , ¯ p ′ < V , ¯ p on { p : s < V , ¯ p ( p ) < λ h } . As the closed-form solutions for the functions in question make it straightforward toestablish this result, we omit the proof.A key ingredient in the analysis of the pure Poisson case is uniform convergence of w ∆ to V , ¯ p as ∆ →
0, which we establish by means of the following result. Lemma A.7
Let { T ∆ } ∆ > be a family of contraction mappings on the Banach space ( W ; k·k ) with moduli { β ∆ } ∆ > and associated fixed points { w ∆ } ∆ > . Suppose that there is a constant ν > such that − β ∆ = ν ∆ + o (∆) as ∆ → . Then, a sufficient condition for w ∆ toconverge in ( W ; k · k ) to the limit v as ∆ → is that k T ∆ v − v k = o (∆) . Proof: As k w ∆ − v k = k T ∆ w ∆ − v k ≤ k T ∆ w ∆ − T ∆ v k + k T ∆ v − v k ≤ β ∆ k w ∆ − v k + k T ∆ v − v k , the stated conditions on β ∆ and k T ∆ v − v k imply k w ∆ − v k ≤ k T ∆ v − v k − β ∆ = ∆ f (∆) ν ∆ + ∆ g (∆) = f (∆) ν + g (∆) , with lim ∆ → f (∆) = lim ∆ → g (∆) = 0.In our application of this lemma, W is again the Banach space of bounded real-valuedfunctions on the unit interval, equipped with the supremum norm. The operator in questionis T ∆ as defined above; the corresponding moduli are β ∆ = δ = e − r ∆ , so that ν = r . Lemma A.8
Let ρ = 0 . Then w ∆ → V , ¯ p uniformly as ∆ → . To the best of our knowledge, the earliest appearance of this result in the economics literature isin Biais et al. (2007). A related approach is taken in Sadzik and Stacchetti (2015). roof: To simplify notation, we write v instead of V , ¯ p . For K ∈ { , , . . . , N } , a straight-forward Taylor expansion of E ∆ K v with respect to ∆ yieldslim ∆ → (cid:13)(cid:13) δ E ∆ K v − v − r [ Kb ( · , v ) − v ]∆ (cid:13)(cid:13) = 0 . (A.4)For p > ¯ p , we have K ( p ) = N −
1, and (A.4) implies(1 − δ ) m ( p ) + δ E ∆ N v ( p ) = v ( p ) + r [ m ( p ) + N b ( p, v ) − v ( p )] ∆ + o (∆) , (1 − δ ) s + δ E ∆ N − v ( p ) = v ( p ) + r [ s + ( N − b ( p, v ) − v ( p )] ∆ + o (∆) . As m ( p ) > s on [¯ p,
1] and b ( p, v ) ≥
0, there exists ξ > m ( p ) + N b ( p, v ) − [ s + ( N − b ( p, v )] > ξ, on (¯ p, T ∆ v ( p ) = (1 − δ ) m ( p ) + δ E ∆ N v ( p ) for ∆ sufficiently small, and the fact that v ( p ) = m ( p ) + N b ( p, v ) now implies T ∆ v ( p ) = v ( p ) + o (∆) on (¯ p, , ¯ p ], we have K ( p ) = 0, and (A.4) implies (cid:13)(cid:13) (1 − δ ) m + δ E ∆1 v − v − r [ m + b ( · , v ) − v )∆ (cid:13)(cid:13) = ∆ ψ R (∆) , (A.5) (cid:13)(cid:13) (1 − δ ) s + δ E ∆0 v − v − r [ s − v ]∆ (cid:13)(cid:13) = ∆ ψ S (∆) , (A.6)for some functions ψ R , ψ S : (0 , ∞ ) → [0 , ∞ ) that satisfy ψ R (∆) → ψ S (∆) → → p ∈ ( p † , ¯ p ]. We note that T ∆ v ( p ) ≥ (1 − δ ) m ( p ) + δ E ∆1 v ( p ) ≥ v ( p ) − ∆ ψ R (∆)in this range, where the first inequality follows from the definition of T ∆ , and the secondinequality is implied by (A.5) and v ( p ) = m ( p ) + b ( p, v ) for p ∈ ( p † , ¯ p ]. If the maximum inthe definition of T ∆ v ( p ) is achieved by the risky action, the first in the previous chain ofinequalities holds as an equality, and (A.5) immediately implies that T ∆ v ( p ) = v ( p ) + o (∆).If the maximum in the definition of T ∆ v ( p ) is achieved by the safe action, however, we have T ∆ v ( p ) = (1 − δ ) s + δ E ∆0 v ( p ) ≤ v ( p ) + r [ s − v ( p )]∆ + ∆ ψ S (∆) ≤ v ( p ) + ∆ ψ S (∆), wherethe second inequality follows from v > s on ( p † , ¯ p ]. Thus v ( p ) − ∆ ψ R (∆) ≤ T ∆ v ( p ) ≤ v ( p ) + ∆ ψ S (∆), and we can conclude that T ∆ v ( p ) = v ( p ) + o (∆) in this case as well.Now, let p ≤ p † . We note that T ∆ v ( p ) ≥ (1 − δ ) s + δ E ∆0 v ( p ) ≥ v ( p ) − ∆ ψ S (∆) in thisrange, where the first inequality follows from the definition of T ∆ , and the second inequalityis implied by (A.6) and v ( p ) = s for p ≤ p † . If the maximum in the definition of T ∆ v ( p ) isachieved by the safe action, the first in the previous chain of inequalities holds as an equality,and (A.6) immediately implies that T ∆ v ( p ) = v ( p ) + o (∆). If the maximum in the definitionof T ∆ v ( p ) is achieved by the risky action, however, we have T ∆ v ( p ) = (1 − δ ) m ( p )+ δ E ∆1 v ( p ) ≤ v ( p ) + r [ m ( p ) + b ( p, v ) − v ( p )]∆ + ∆ ψ R (∆) ≤ v ( p ) + ∆ ψ R (∆), where the second inequalityfollows from v = s ≥ m ( p ) + b ( p, v ) on [0 , p † ]. Thus v ( p ) − ∆ ψ S (∆) ≤ T ∆ v ( p ) ≤ v ( p ) +∆ ψ R (∆), and we can again conclude that T ∆ v ( p ) = v ( p ) + o (∆) in this case as well. ur last two auxiliary results pertain to the case of pure Poisson learning. Lemma A.9
Let ρ = 0 . There is a belief ˆ p ∈ [ p ∗ N , p ∗ ] such that λ ( p ) h N V
N,p ( j ( p )) − ( N − V ∗ ( j ( p )) − s i − rc ( p ) is negative if < p < ˆ p , zero if p = ˆ p , and positive if ˆ p < p < . Moreover, ˆ p = p ∗ N if, andonly if, j ( p ∗ N ) ≤ p ∗ , and ˆ p = p ∗ if, and only if, λ = 0 . Proof:
We start by noting that given the functions V ∗ and V ∗ N , the cutoffs p ∗ and p ∗ N areuniquely determined by λ ( p ∗ )[ V ∗ ( j ( p ∗ )) − s ] = rc ( p ∗ ) , (A.7)and λ ( p ∗ N )[ N V ∗ N ( j ( p ∗ N )) − N s ] = rc ( p ∗ N ) , (A.8)respectively.Consider the differentiable function f on (0 ,
1) given by f ( p ) = λ ( p )[ N V
N,p ( j ( p )) − ( N − V ∗ ( j ( p )) − s ] − rc ( p ) . For λ = 0, we have j ( p ) = 1 and V N,p ( j ( p )) = V ∗ ( j ( p )) = m for all p , so f ( p ) = λ ( p )[ V ∗ ( j ( p )) − s ] − rc ( p ), which is zero at p = p ∗ by (A.7), positive for p > p ∗ , and negativefor p < p ∗ .Assume λ >
0. For 0 < p < p ≤
1, we have V N,p ( p ) = m ( p ) + c ( p ) u ( p ; µ N ) /u ( p ; µ N ).Moreover, we have V ∗ ( p ) = s when p ≤ p ∗ , and V ∗ ( p ) = m ( p ) + Cu ( p ; µ ) with a constant C > u ( j ( p ); µ ) = λ λ ( p ) (cid:18) λ λ (cid:19) µ u ( p ; µ ) , we see that the term λ ( p ) N V
N,p ( j ( p )) is actually linear in p . When j ( p ) ≤ p ∗ , the term − λ ( p )( N − V ∗ ( j ( p )) is also linear in p ; when j ( p ) > p ∗ , the nonlinear part of this termsimplifies to − ( N − Cλ µ +10 u ( p ; µ ) /λ µ . This shows that f is concave, and strictly concaveon the interval of all p for which j ( p ) > p ∗ . As lim p → f ( p ) >
0, this in turn implies that f has at most one root in the open unit interval; if so, f assumes negative values to the left ofthe root, and positive values to the right.As V N,p ∗ ( j ( p ∗ )) > V ∗ ( j ( p ∗ )), moreover, we have f ( p ∗ ) > λ ( p ∗ )[ V ∗ ( j ( p ∗ )) − s ] − rc ( p ∗ ) = 0by (A.7). Any root of f must thus lie in [0 , p ∗ ). If j ( p ∗ N ) ≤ p ∗ , then V ∗ ( j ( p ∗ N )) = s and f ( p ∗ N ) = λ ( p ∗ N )[ N V ∗ N ( j ( p ∗ N )) − N s ] − rc ( p ∗ N ) = 0 by (A.8). If j ( p ∗ N ) > p ∗ , then V ∗ ( j ( p ∗ N )) > s and f ( p ∗ N ) <
0, so f has a root in ( p ∗ N , p ∗ ).The following result is used in the proof of Proposition 2. Lemma A.10
Let ρ = 0 . Then µ ( µ + 1) > N µ N ( µ N + 1) . roof: We change variables to β = λ /λ and x = r/λ , so that µ N and µ are implicitlydefined as the positive solutions of the equations xN + β − (1 − β ) µ N = β µ N +1 ,x + β − (1 − β ) µ = β µ +1 . Fixing β ∈ [0 ,
1) and considering µ N and µ as functions of x ∈ (0 , ∞ ), we obtain µ ′ N = N − − β + β µ N +1 ln β = N − − β + (cid:2) xN + β − (1 − β ) µ N (cid:3) ln β ,µ ′ = 11 − β + β µ +1 ln β = 11 − β + [ x + β − (1 − β ) µ ] ln β . (All denominators are positive because 1 − β + β µ +1 ln β ≥ − β + β ln β > µ ≥ d = µ ( µ + 1) − N µ N ( µ N + 1). As lim x → µ N = lim x → µ = 0, we see thatlim x → d = 0 as well. It is thus enough to show that d ′ > x >
0. This is the case if,and only if, (2 µ + 1) µ ′ > N (2 µ N + 1) µ ′ N , that is,(2 µ +1) (cid:8) − β + (cid:2) xN + β − (1 − β ) µ N (cid:3) ln β (cid:9) > (2 µ N +1) { − β + [ x + β − (1 − β ) µ ] ln β } . This inequality reduces to( µ − µ N ) (cid:8) − β ) + (cid:2) xN + 1 + β (cid:3) ln β (cid:9) > (2 µ N + 1) (cid:2) x − xN (cid:3) ln β. It is straightforward to show that µ > µ N + − β (cid:2) x − xN (cid:3) . So d ′ > − β ) + (cid:2) xN + 1 + β (cid:3) ln β > (2 µ N + 1)(1 − β ) ln β, which simplifies to 1 − β + (cid:2) xN + β − (1 − β ) µ N (cid:3) ln β > B Proofs
B.1 Main Results (Theorem 1 and Propositions 1–4)
Proof of Theorem 1:
For ρ >
0, this result is an immediate consequence of inequalities(2), the fact that lim inf ∆ → W ∆PBE ≥ V ∗ and W ∆PBE ≤ V ∗ N , and Proposition 6. For ρ = 0,the result follows from inequalities (2), the fact lim inf ∆ → W ∆PBE ≥ V ∗ , and Propositions 7and 10. Proof of Proposition 1:
Keller and Rady (2010) establish that in the unique symmet-ric MPE of the continuous-time game, all experimentation stops at the belief ˜ p N implicitlydefined by rc (˜ p N ) = λ (˜ p N )[˜ u ( j (˜ p N )) − s ], where ˜ u is the players’ common equilibrium pay- ff function. The results of Keller and Rady (2010) further imply that V N, ˜ p N ( j (˜ p N )) > ˜ u ( j (˜ p N )) > V ∗ ( j (˜ p N )), so that N V N, ˜ p N ( j (˜ p N )) − ( N − V ∗ ( j (˜ p N )) > ˜ u ( j (˜ p N )), and henceˆ p < ˜ p N by Lemma A.9. Proof of Proposition 2:
There is nothing to show for λ = 0. Using the same change ofvariables as in the previous proof, we fix β ∈ (0 , q = β · µ − N µ − , so that j ( p ∗ N ) ≤ p ∗ if, and only if, q ≥
1. As lim x →∞ µ N = lim x →∞ µ = ∞ , we havelim x →∞ q = β <
1. As lim x → µ N = lim x → µ = 0, moreover,lim x → q = β lim x → µ µ N = β lim x → µ ′ µ ′ N = βN by l’Hˆopital’s rule. Finally, q ′ is easily seen to have the same sign as − µ ( µ + 1)(1 − β + β µ +1 ln β ) + N µ N ( µ N + 1)(1 − β + β µ N +1 ln β ) . As β µ +1 ln β > β µ N +1 ln β , Lemma A.10 implies that q decreases strictly in x . This in turnimplies that q < x ∈ (0 , ∞ ) when βN ≤
1, which proves the first part of the corollary.Otherwise, there exists a unique x ∗ ∈ (0 , ∞ ) at which q = 1. The second part of the corollarythus holds with ( λ ∗ , λ ∗ ) = ( r/x ∗ , βr/x ∗ ).It is straightforward to see that x varies continuously with β and that lim β → /N x ∗ = 0.So it remains to show that x ∗ remains bounded as β →
1. Rewriting the defining equationfor x ∗ as 1 + 1(1 − β ) µ ( x ∗ ( β ) , β ) = 1(1 − β ) µ N ( x ∗ ( β ) , β ) , we see that (1 − β ) µ N ( x ∗ ( β ) , β ) must stay bounded as β →
1. By the defining equation for µ N , x ∗ ( β ) must then also stay bounded. Proof of Proposition 3:
For the case that ˆ p = p ∗ N , this is shown in Keller and Rady(2010). Thus, in what follows we assume that ˆ p > p ∗ N .Recall the defining equation for ˆ p from Lemma A.9, λ (ˆ p ) N V N, ˆ p ( j (ˆ p )) − λ (ˆ p ) s − rc (ˆ p ) = ( N − λ (ˆ p ) V ∗ ( j (ˆ p )) . We make use of the closed-form expression for V N, ˆ p to rewrite its left-hand side as N λ (ˆ p ) λ ( j (ˆ p )) h + N c (ˆ p )[ λ − µ N ( λ − λ )] − λ (ˆ p ) s. Similarly, by noting that ˆ p > p ∗ N implies j (ˆ p ) > j ( p ∗ N ) > p ∗ , we can make use of the closed- orm expression for V ∗ to rewrite the right-hand side as( N − λ (ˆ p ) λ ( j (ˆ p )) h + ( N − c ( p ∗ ) u (ˆ p ; µ ) u ( p ∗ ; µ ) [ r + λ − µ ( λ − λ )] . Combining, we have λ (ˆ p ) λ ( j (ˆ p )) h + N c (ˆ p )[ λ − µ N ( λ − λ )] − λ (ˆ p ) s ( N − r + λ − µ ( λ − λ )] c ( p ∗ ) = u (ˆ p ; µ ) u ( p ∗ ; µ ) . It is convenient to change variables to β = λ λ and y = λ λ λ h − ss − λ h ˆ p − ˆ p . The implicit definitions of µ and µ N imply N = β µ − β + µ (1 − β ) β µ N − β + µ N (1 − β ) , allowing us to rewrite the defining equation for ˆ p as the equation F ( y, µ N ) = 0 with F ( y, µ ) = 1 − y + [ β (1 + µ ) y − µ ] 1 − ββ β µ − β + µ (1 − β )( µ − µ )(1 − β ) + β µ − β µ − µ µ (1 + µ ) µ y − µ . As y is a strictly increasing function of ˆ p , we know from Lemma A.9 that F ( · , µ N ) admits aunique root, and that it is strictly increasing in a neighborhood of this root.A straightforward computation shows that ∂F ( y, µ N ) ∂µ = 1 − ββ β µ − β + µ (1 − β )(( µ − µ N )(1 − β ) + β µ − β µ N ) ζ ( y, µ N ) , with ζ ( y, µ ) = β (1 − β )(1 + µ ) y − (1 − β ) µ + (1 − βy )( β µ − β µ ) + β µ ( β (1 + µ ) y − µ ) ln β. As p ∗ N < ˆ p < p ∗ , we have µ N µ N < βy < µ µ , which implies ζ ( y, µ ) = ( β (1 + µ ) y − µ ) (1 − β + β µ ln β ) < , and ∂ζ ( y, µ ) ∂µ = β µ [ β (1 + µ ) y − µ ](ln β ) > , for all µ ∈ [ µ N , µ ]. This establishes ζ ( y, µ N ) < y the implicit function theorem, therefore, y is increasing in µ N . Recalling from Kellerand Rady (2010) that µ N is decreasing in N , we have thus shown that y (and hence ˆ p ) aredecreasing in N . Proof of Proposition 4:
Simple algebra yields j ( p ∗ N ) p ∗ = λ λ µ N µ ( µ + 1)( λ h − s ) + µ ( s − λ h )( µ N + 1)( λ h − s ) + ( λ /λ ) µ N ( s − λ h ) . From the implicit definitions of µ and µ N , we obtain lim r → µ = lim r → µ N = 0 (so thatthe third fraction in the previous expression converges to 1) andlim r → ∂µ ∂r = (cid:20) λ − λ + λ ln λ λ (cid:21) − = N lim r → ∂µ N ∂r , implying lim r → µ N µ = 1 N , by l’Hˆopital’s rule.Furthermore, we note that we may write equivalently j ( p ∗ N ) p ∗ = λ λ (1 + µ )( λ h − s ) + ( s − λ h )(1 + µ N )( λ h − s ) + ( λ /λ )( s − λ h ) . As lim r →∞ µ = lim r →∞ µ N = ∞ , we can immediately conclude that this ratio converges tothe stated limit for r → ∞ . B.2 Learning with a Brownian Component (Propositions 5–6)
The proof of Proposition 5 rests on a sequence of lemmas that prove incentive compatibilityof the proposed strategies on various subintervals of [0 , ρ is stated, the respective result holds irrespectively of whether ρ > ρ = 0.In view of Lemmas A.4 and A.5, we take p and ¯ p such that p ∗ N < p < p ‡ < p ∗ < p m < max { p ⋄ , ˇ p } < ¯ p < . (B.9)The first two lemmas deal with the safe action ( κ = 0) on the interval [0 , ¯ p ]. Lemma B.1
For all p ≤ p ‡ , (1 − δ ) s + δw ∆ ( p ) ≥ (1 − δ ) m ( p ) + δ E ∆1 w ∆ ( p ) . Proof: As w ∆ ( p ) ≥ s = w ∆ ( p ) for p ≤ p ‡ , we have (1 − δ ) s + δw ∆ ( p ) ≥ s whereas s ≥ (1 − δ ) m ( p ) + δ E ∆1 w ∆ ( p ) by the functional equation for w ∆ . emma B.2 There exists ∆ ( p ‡ , ¯ p ] > such that (1 − δ ) s + δw ∆ ( p ) ≥ (1 − δ ) m ( p ) + δ E ∆1 w ∆ ( p ) , for all p ∈ ( p ‡ , ¯ p ] and ∆ < ∆ ( p ‡ , ¯ p ] . Proof:
By Lemmas A.3 and A.4, there exist ν > > w ∆ ( p ) − w ∆ ( p ) ≥ ν for all p ∈ [ p ‡ , ¯ p ] and ∆ < ∆ . Further, there is a ∆ ∈ (0 , ∆ ] such that |E ∆1 w ∆ ( p ) − w ∆ ( p ) | ≤ ν for all p ∈ [ p ‡ , ¯ p ] and ∆ < ∆ . For these p and ∆, we thus have(1 − δ ) s + δw ∆ ( p ) − (cid:2) (1 − δ ) m ( p ) + δ E ∆1 w ∆ ( p ) (cid:3) ≥ (1 − δ )[ s − m ( p )] + δ ν . Finally, there is a ∆ ( p ‡ , ¯ p ] ∈ (0 , ∆ ] such that the right-hand side of this inequality is positivefor all p ∈ ( p ‡ , ¯ p ] and ∆ < ¯∆.We establish incentive compatibility of the risky action ( κ = 1) to the immediate right of p by means of the following result. Lemma B.3
Let X be a Gaussian random variable with mean m and variance V .1. For all η > , P [ X − m > η ] < Vη .
2. There exists V ∈ (0 , such that for all V < V , P h V ≤ X − m ≤ V i ≥ − V . Proof:
The first statement is a trivial consequence of Chebysheff’s inequality. The proof ofthe second relies on the following inequality (13.48) of Johnson et al. (1994) for the standardnormal cumulative distribution function:12 h − e − x / ) i ≤ Φ( x ) ≤ h − e − x ) i . Letting Φ V denote the cdf of the Gaussian distribution with variance V (and mean 0), andusing the above upper and lower bounds, we have + Φ V ( V ) − Φ V ( V ) √ V ≤ − q − e − √ V + p − e −√ V √ V .
Writing x = √ V and using the fact that 1 − √ − y ≤ √ y for 0 ≤ y ≤
1, moreover, we have1 − q − e − x + √ − e − x √ x ≤ s e − x x + 12 r − e − x x → , s x →
0. Thus, + Φ V ( V ) − Φ V ( V ) √ V ≤ , for sufficiently small V , which is the second statement of the lemma.We apply this lemma to the log odds ratio ℓ associated with the current belief p . Forlater use, we note that dp/dℓ = p (1 − p ). Lemma B.4
Let ρ > . There exist ε ∈ (0 , p ‡ − p ) and ∆ ( p,p + ε ] > such that (1 − δ ) m ( p ) + δ E ∆ N w ∆ ( p ) ≥ (1 − δ ) s + δ E ∆ N − w ∆ ( p ) , for all p ∈ ( p, p + ε ] and ∆ < ∆ ( p,p + ε ] . Proof:
Consider a belief p = p and the corresponding log odds ratio ℓ . Let K playersuse the risky arm on the time interval [0 , ∆) and consider the resulting belief p ( K )∆ and theassociated log odds ratio ℓ ( K )∆ .Let P θ denote the probability measure associated with state θ ∈ { , } . Expected contin-uation payoffs are computed by means of the measure P p = p P + (1 − p ) P .Let J ∆0 denote the event that no lump-sum arrives by time ∆. The probability of J ∆0 under the measure P θ is e − λ θ ∆ . Note that e − λ θ ∆ P θ [ A | J ∆0 ] ≤ P θ [ A ] ≤ e − λ θ ∆ P θ [ A | J ∆0 ] + 1 − e − λ θ ∆ , for any event A .As we have seen in Appendix A.1, conditional on J ∆0 , the random variable ℓ ( K )∆ is normallydistributed with mean ℓ − K (cid:0) λ − λ − ρ (cid:1) ∆ and variance Kρ ∆ under P , and normallydistributed with mean ℓ − K (cid:0) λ − λ + ρ (cid:1) ∆ and variance Kρ ∆ under P .Now choose ε > p + ε < p ‡ . Write ℓ , ℓ ε , ℓ ‡ and ¯ ℓ for the log odds ratiosassociated with p , p + ε , p ‡ and ¯ p , respectively. Choose ∆ > ν = min (∆ ,ℓ ) ∈ [0 , ∆ ] × [ ℓ,ℓ ε ] h ℓ ‡ − ℓ + ( N − (cid:16) λ − λ − ρ (cid:17) ∆ i > . or all p ∈ ( p, p + ε ] and ∆ ∈ (0 , ∆ ), the first part of Lemma B.3 now implies P p h p ( N − > p ‡ i = P p h ℓ ( N − > ℓ ‡ i ≤ p n e − λ ∆ P h ℓ ( N − > ℓ ‡ (cid:12)(cid:12)(cid:12) J ∆0 i + 1 − e − λ ∆ o + (1 − p ) n e − λ ∆ P h ℓ ( N − > ℓ ‡ (cid:12)(cid:12)(cid:12) J ∆0 i + 1 − e − λ ∆ o ≤ p (cid:26) e − λ ∆ ( N − ρ ∆ ν + 1 − e − λ ∆ (cid:27) + (1 − p ) (cid:26) e − λ ∆ ( N − ρ ∆ ν + 1 − e − λ ∆ (cid:27) ≤ ( N − ρ ∆ ν + 1 − e − λ ∆ ≤ (cid:26) ( N − ρν + λ (cid:27) ∆ . As w ∆ ≤ s + ( m − s ) ( p ‡ , , moreover, E ∆ N − w ∆ ( p ) ≤ s + ( m − s ) P p h p ( N − > p ‡ i . So there exists C > E ∆ N − w ∆ ( p ) ≤ s + C ∆ for all p ∈ ( p, p + ε ] and ∆ ∈ (0 , ∆ ).Next, define ν = min p ≤ p ≤ ¯ p p (1 − p ) and note that for p ≤ p ≤ ¯ p (and thus for ℓ ≤ ℓ ≤ ¯ ℓ ), V N,p ( p ) ≥ s + max n , V ′ N,p ( p +)( p − p ) o ≥ s + max n , V ′ N,p ( p +) ν ( ℓ − ℓ ) o . By the second part of Lemma B.3, there exists ∆ > N ρ ∆ < P h ( N ρ ∆) ≤ ℓ ( N )∆ − ℓ + N (cid:16) λ − λ − ρ (cid:17) ∆ ≤ ( N ρ ∆) (cid:12)(cid:12)(cid:12) J ∆0 i ≥ − ( N ρ ∆) , for arbitrary ℓ and all ∆ ∈ (0 , ∆ ). In particular, P p h ( N ρ ∆) ≤ ℓ ( N )∆ − ℓ + N (cid:16) λ − λ − ρ (cid:17) ∆ ≤ ( N ρ ∆) i ≥ p P h ( N ρ ∆) ≤ ℓ ( N )∆ − ℓ + N (cid:16) λ − λ − ρ (cid:17) ∆ ≤ ( N ρ ∆) i ≥ pe − λ ∆ P h ( N ρ ∆) ≤ ℓ ( N )∆ − ℓ + N (cid:16) λ − λ − ρ (cid:17) ∆ ≤ ( N ρ ∆) (cid:12)(cid:12)(cid:12) J ∆0 i ≥ pe − λ ∆ (cid:18) − ( N ρ ∆) (cid:19) , for these ∆. Taking ∆ smaller if necessary, we can also ensure that ℓ < ℓ − N (cid:16) λ − λ − ρ (cid:17) ∆ + ( N ρ ∆) < ℓ − N (cid:16) λ − λ − ρ (cid:17) ∆ + ( N ρ ∆) < ¯ ℓ, for all ℓ ∈ ( ℓ, ℓ ε ] and all ∆ ∈ (0 , ∆ ).By Lemma A.3, there exists ∆ ∈ (0 , ∆ ) such that w ∆ ≥ V N,p for ∆ ∈ (0 , ∆ ). For such and p ∈ ( p, p + ε ], we now have E ∆ N w ∆ ( p ) ≥ s + pe − λ ∆ (cid:18) − ( N ρ ∆) (cid:19) V ′ N,p ( p +) ν h ℓ − N (cid:16) λ − λ − ρ (cid:17) ∆ + ( N ρ ∆) − ℓ i ≥ s + p (1 − λ ∆) (cid:18) − ( N ρ ∆) (cid:19) V ′ N,p ( p +) ν h − N (cid:16) λ − λ − ρ (cid:17) ∆ + ( N ρ ∆) i . This implies the existence of ∆ ∈ (0 , ∆ ) and C > E ∆ N w ∆ ( p ) ≥ s + C ∆ , for all p ∈ ( p, p + ε ] and ∆ ∈ (0 , ∆ ).For p ∈ ( p, p + ε ] and ∆ ∈ (0 , min { ∆ , ∆ } ), finally,(1 − δ ) m ( p ) + δ E ∆ N w ∆ ( p ) − (cid:2) (1 − δ ) s + δ E ∆ N − w ∆ ( p ) (cid:3) ≥ (1 − δ )[ m ( p ) − s ] + δ n C ∆ − C ∆ o = C ∆ − (cid:8) r [ s − m ( p )] + C (cid:9) ∆ + o (∆) . As the term in ∆ dominates as ∆ becomes small, there exists ∆ ( p,p + ε ] ∈ (0 , min { ∆ , ∆ } )such that this expression is positive for all p ∈ ( p, p + ε ] and ∆ < ∆ ( p,p + ε ] . Lemma B.5
For all ε ∈ (0 , p ‡ − p ) , there exists ∆ ( p + ε, ¯ p ] > such that (1 − δ ) m ( p ) + δ E ∆ N w ∆ ( p ) ≥ (1 − δ ) s + δ E ∆ N − w ∆ ( p ) , for all p ∈ ( p + ε, ¯ p ] and ∆ < ∆ ( p + ε, ¯ p ] . Proof:
First, by Lemma A.3, there exists ∆ > w ∆ ≥ V N,p on the unitinterval. Second, by Lemma A.4, there exist ν > η > ∈ (0 , ∆ ) such that V N,p ( p ) − w ∆ ( p ) ≥ ν for all p ∈ [ p + ε , ¯ p + η ] and ∆ < ∆ . For these p and ∆, and byconvexity of V N,p , we then have E ∆ N w ∆ ( p ) − E ∆ N − w ∆ ( p ) ≥ E ∆ N V N,p ( p ) − E ∆ N − w ∆ ( p ) ≥ E ∆ N − V N,p ( p ) − E ∆ N − w ∆ ( p ) ≥ χ ∆ ( p ) ν + [1 − χ ∆ ( p )]( s − m ) , where χ ∆ ( p ) denotes the probability that the belief p t +∆ lies in [ p + ε , ¯ p + η ] given that p t = p and N − ∈ (0 , ∆ )such that χ ∆ ( p ) ≥ ν + m − sν + m − s , for all p ∈ ( p + ε, ¯ p ] and ∆ < ∆ . For these p and ∆, we thus have(1 − δ ) m ( p ) + δ E ∆ N w ∆ ( p ) − (cid:2) (1 − δ ) s + δ E ∆ N − w ∆ ( p ) (cid:3) ≥ (1 − δ )[ m ( p ) − s ] + δ ν . inally, there is a ∆ ( p + ε, ¯ p ] ∈ (0 , ∆ ) such that the right-hand side of this inequality is positivefor all p ∈ ( p + ε, ¯ p ] and ∆ < ∆ ( p + ε, ¯ p ] . Lemma B.6
There exists ∆ (¯ p, > such that (1 − δ ) m ( p ) + δ E ∆ N w ∆ ( p ) ≥ (1 − δ ) s + δ E ∆ N − w ∆ ( p ) , for all p > ¯ p and ∆ < ∆ (¯ p, . Proof:
By Lemmas A.3 and A.4, there exists ∆ (¯ p, > w ∆ ≥ w ∆ for all∆ < ∆ (¯ p, . For such ∆ and all p > ¯ p , we thus have(1 − δ ) m ( p ) + δ E ∆ N w ∆ ( p ) = w ∆ ( p ) ≥ w ∆ ( p ) ≥ (1 − δ ) s + δ E ∆ N − w ∆ ( p ) , with the last inequality following from the functional equation for w ∆ . Proof of Proposition 5:
Given p and ¯ p as in (B.9), choose ε > ( p,p + ε ] as inLemma B.4, and ∆ ( p ‡ , ¯ p ] , ∆ ( p + ε, ¯ p ] and ∆ (¯ p, as in Lemmas B.2, B.5 and B.6. The two-stateautomaton is an SSE for all∆ < min n ∆ ( p ‡ , ¯ p ] , ∆ ( p,p + ε ] , ∆ ( p + ε, ¯ p ] , ∆ (¯ p, o . So the statement of the proposition holds with p ♭ = p ‡ and p ♯ = max { ˇ p, p ⋄ } . Proof of Proposition 6:
Let ε > V N,p in Section 5 and Lemma A.5 allow us to choose p ∈ ( p ∗ N , p ♭ ) and ¯ p ∈ ( p ♯ ,
1) such that V N,p > V ∗ N − ε and w ∆ < V ∗ + ε for all ∆ >
0. Second, Lemmas A.2 and A.3 and Proposition5 imply the existence of a ∆ † > ∈ (0 , ∆ † ): W ∆1 > V ∗ − ε , w ∆ ≥ V N,p , and w ∆ and w ∆ are SSE payoff functions of the game with period length ∆. Third, W ∆PBE ≤ V ∗ N for all ∆ > ∈ (0 , ∆ † ), we thus have V ∗ N − ε < V N,p ≤ w ∆ ≤ W ∆SSE ≤ W ∆PBE ≤ V ∗ N , and V ∗ − ε < W ∆1 ≤ W ∆PBE ≤ W ∆SSE ≤ w ∆ < V ∗ + ε, so that k W ∆PBE − V ∗ N k , k W ∆SSE − V ∗ N k , k W ∆PBE − V ∗ k and k W ∆SSE − V ∗ k are all smaller than ε , which was to be shown. B.3 Pure Poisson Learning (Propositions 7–10)
Proof of Proposition 7:
For any given ∆ >
0, let ˜ p ∆ be the infimum of the set ofbeliefs at which there is some PBE that gives a payoff w n ( p ) > s to at least one player. Let p = lim inf ∆ → ˜ p ∆ .For any fixed ε > >
0, consider the problem of maximizing the players’ averagepayoff subject to no use of the risky arm at beliefs p ≤ ˜ p − ε . Denote the corresponding valuefunction by f W ∆ ,ε . By the definition of ˜ p , there exists a ˜∆ ε > ∈ (0 , ˜∆ ε ),the function f W ∆ ,ε provides an upper bound on the players’ average payoff in any PBE, andso W ∆PBE ≤ f W ∆ ,ε . The value function of the continuous-time version of this maximizationproblem is V N,p ε with p ε = max { ˜ p − ε, p ∗ N } . As the discrete-time solution is also feasible incontinuous time, we have f W ∆ ,ε ≤ V N,p ε , and hence W ∆PBE ≤ V N,p ε for ∆ < ˜∆ ε .Consider a sequence of such ∆’s converging to 0 such that the corresponding beliefs ˜ p ∆ converge to ˜ p . For each ∆ in this sequence, select a belief p ∆ > ˜ p ∆ with the following twoproperties: (i) starting from p ∆ , a single failed experiment takes us below ˜ p ∆ ; (ii) given theinitial belief p ∆ , there exists a PBE for reaction lag ∆ in which at least one player plays riskywith positive probability in the first round. Select such an equilibrium for each ∆ in thesequence and let L ∆ be the number of players in this equilibrium who, at the initial belief p ∆ , play risky with positive probability. Let L be an accumulation point of the sequence of L ∆ ’s. After selecting a subsequence of ∆’s, we can assume without loss of generality thatplayer n = 1 , . . . , L plays risky with probability π ∆ n > p ∆ , while player n = L + 1 , . . . , N plays safe; we can further assume that ( π ∆ n ) Ln =1 converges to a limit ( π n ) Ln =1 in [0 , L .For player n = 1 , . . . , L to play optimally at p ∆ , it must be the case that(1 − δ ) (cid:2) π ∆ n λ ( p ∆ ) h + (1 − π ∆ n ) s (cid:3) + δ Pr ∆ ( ∅ ) w ∆ n, ∅ + L X K =1 X | I | = K Pr ∆ ( I ) ∞ X J =0 Λ ∆ J,K ( p ∆ ) w ∆ n,I,J ≥ (1 − δ ) s + δ Pr ∆ − n ( ∅ ) w ∆ n, ∅ + L − X K =1 X | I | = K, n I Pr ∆ − n ( I ) ∞ X J =0 Λ ∆ J,K ( p ∆ ) w ∆ n,I,J , where we write Pr ∆ ( I ) for the probability that the set of players experimenting is I ⊆{ , . . . , L } , Pr ∆ − n ( I ) for the probability that among the L − { , · · · , L } \ { n } the set of players experimenting is I , and w ∆ n,I,J for the conditional expectation of player n ’s continuation payoff given that exactly the players in I were experimenting and had J successes ( w ∆ n, ∅ is player n ’s continuation payoff if no one was experimenting). As Pr ∆ ( ∅ ) =(1 − π ∆ n )Pr ∆ − n ( ∅ ) ≤ Pr ∆ − n ( ∅ ), the inequality continues to hold when we replace w ∆ n, ∅ by itslower bound s . After subtracting (1 − δ ) s from both sides, we then have(1 − δ ) π ∆ n (cid:2) λ ( p ∆ ) h − s (cid:3) + δ Pr ∆ ( ∅ ) s + L X K =1 X | I | = K Pr ∆ ( I ) ∞ X J =0 Λ ∆ J,K ( p ∆ ) w ∆ n,I,J ≥ δ Pr ∆ − n ( ∅ ) s + L − X K =1 X | I | = K, n I Pr ∆ − n ( I ) ∞ X J =0 Λ ∆ J,K ( p ∆ ) w ∆ n,I,J . umming up these inequalities over n = 1 , . . . , L and writing ¯ π ∆ = L P Ln =1 π ∆ n yields(1 − δ ) L ¯ π ∆ (cid:2) λ ( p ∆ ) h − s (cid:3) + δ Pr ∆ ( ∅ ) Ls + L X K =1 X | I | = K Pr ∆ ( I ) ∞ X J =0 Λ ∆ J,K ( p ∆ ) L X n =1 w ∆ n,I,J ≥ δ L X n =1 Pr ∆ − n ( ∅ ) s + L X n =1 L − X K =1 X | I | = K, n I Pr ∆ − n ( I ) ∞ X J =0 Λ ∆ J,K ( p ∆ ) w ∆ n,I,J . By construction, w ∆ n,I, = s whenever I = ∅ . For | I | = K >
J >
0, moreover, we have w ∆ n,I,J ≥ W ∆1 ( B ∆ J,K ( p ∆ )) for all players n = 1 , . . . , N , and hence L X n =1 w ∆ n,I,J ≤ N W ∆PBE ( B ∆ J,K ( p ∆ )) − ( N − L ) W ∆1 ( B ∆ J,K ( p ∆ )) ≤ N V
N,p ε ( B ∆ J,K ( p ∆ )) − ( N − L ) W ∆1 ( B ∆ J,K ( p ∆ )) . So, for the preceding inequality to hold, it is necessary that(1 − δ ) L ¯ π ∆ (cid:2) λ ( p ∆ ) h − s (cid:3) + δ Pr ∆ ( ∅ ) Ls + L X K =1 X | I | = K Pr ∆ ( I )Λ ∆0 ,K ( p ∆ ) Ls + L X K =1 X | I | = K Pr ∆ ( I ) ∞ X J =1 Λ ∆ J,K ( p ∆ ) (cid:2) N V
N,p ε ( B ∆ J,K ( p ∆ )) − ( N − L ) W ∆1 ( B ∆ J,K ( p ∆ )) (cid:3) ≥ δ L X n =1 Pr ∆ − n ( ∅ ) s + L X n =1 L − X K =1 X | I | = K, n I Pr ∆ − n ( I )Λ ∆0 ,K ( p ∆ ) s + L X n =1 L − X K =1 X | I | = K, n I Pr ∆ − n ( I ) ∞ X J =1 Λ ∆ J,K ( p ∆ ) W ∆1 ( B ∆ J,K ( p ∆ )) . As Pr ∆ ( ∅ ) + L X K =1 X | I | = K Pr ∆ ( I ) = 1 and L X K =1 X | I | = K Pr ∆ ( I ) K = L ¯ π ∆ , we have the first-order expansionsPr ∆ ( ∅ ) + L X K =1 X | I | = K Pr ∆ ( I )Λ ∆0 ,K ( p ∆ )= Pr ∆ ( ∅ ) + L X K =1 X | I | = K Pr ∆ ( I ) (cid:0) − Kλ ( p ∆ )∆ (cid:1) + o (∆)= 1 − L ¯ π ∆ λ ( p ∆ )∆ + o (∆) , nd L X K =1 X | I | = K Pr ∆ ( I )Λ ∆1 ,K ( p ∆ ) = L X K =1 X | I | = K Pr ∆ ( I ) Kλ ( p ∆ )∆ + o (∆) = L ¯ π ∆ λ ( p ∆ )∆ + o (∆) , so, by uniform convergence W ∆1 → V ∗ (Lemma A.2), the left-hand side of the last inequalityexpands as Ls + L (cid:26) r ¯ π [ λ (˜ p ) h − s ] − rs + ¯ πλ (˜ p ) [ N V
N,p ε ( j (˜ p )) − ( N − L ) V ∗ ( j (˜ p )) − Ls ] (cid:27) ∆ + o (∆) , with ¯ π = lim ∆ → ¯ π ∆ . In the same way, the identitiesPr ∆ − n ( ∅ ) + L − X K =1 X | I | = K, n I Pr ∆ − n ( I ) = 1 and L − X K =1 X | I | = K, n I Pr ∆ − n ( I ) K = L ¯ π ∆ − π ∆ n imply L X n =1 Pr ∆ − n ( ∅ ) + L X n =1 L − X K =1 X | I | = K, n I Pr ∆ − n ( I )Λ ∆0 ,K ( p ∆ ) = L − L ( L − π ∆ λ ( p ∆ )∆ + o (∆) , and L X n =1 L − X K =1 X | I | = K, n I Pr ∆ − n ( I )Λ ∆1 ,K ( p ∆ ) = L ( L − π ∆ λ ( p ∆ )∆ + o (∆) , and so the right-hand side of the inequality expands as Ls + L n − rs + ( L − πλ (˜ p ) [ V ∗ ( j (˜ p )) − s ] o ∆ + o (∆) . Comparing terms of order ∆, dividing by L and letting ε →
0, we obtain¯ π n λ (˜ p ) (cid:2) N V N, ˘ p ( j (˜ p )) − ( N − V ∗ ( j (˜ p )) − s (cid:3) − rc (˜ p ) o ≥ . By Lemma A.9, this means ˜ p ≥ ˆ p whenever ¯ π > π = 0, we write the optimality condition for player n ∈ { , . . . , L } as(1 − δ ) λ ( p ∆ ) h + δ L − X K =0 X | I | = K, n I Pr ∆ − n ( I ) ∞ X J =0 Λ ∆ J,K +1 ( p ∆ ) w ∆ n,I ˙ ∪{ n } ,J ≥ (1 − δ ) s + δ Pr ∆ − n ( ∅ ) w ∆ n, ∅ + L − X K =1 X | I | = K, n I Pr ∆ − n ( I ) ∞ X J =0 Λ ∆ J,K ( p ∆ ) w ∆ n,I,J . As above, w ∆ n, ∅ ≥ s , and w ∆ n,I, = s whenever I = ∅ . For | I | = K >
J >
0, more-over, we have w ∆ n,I,J ≥ W ∆1 ( B ∆ J,K ( p ∆ )), w ∆ n,I ˙ ∪{ n } ,J ≥ W ∆1 ( B ∆ J,K +1 ( p ∆ )) and w ∆ n,I ˙ ∪{ n } ,J ≤ V N,p ε ( B ∆ J,K +1 ( p ∆ )) − ( N − W ∆1 ( B ∆ J,K +1 ( p ∆ )). So, for the optimality condition to hold, itis necessary that(1 − δ ) λ ( p ∆ ) h + δ L − X K =0 X | I | = K, n I Pr ∆ − n ( I )Λ ∆0 ,K +1 ( p ∆ ) s + L − X K =0 X | I | = K, n I Pr ∆ − n ( I ) ∞ X J =1 Λ ∆ J,K +1 ( p ∆ ) (cid:2) N V
N,p ε ( B ∆ J,K +1 ( p ∆ )) − ( N − W ∆1 ( B ∆ J,K +1 ( p ∆ )) (cid:3) ≥ (1 − δ ) s + δ Pr ∆ − n ( ∅ ) s + L − X K =1 X | I | = K, n I Pr ∆ − n ( I )Λ ∆0 ,K ( p ∆ ) s + L − X K =1 X | I | = K, n I Pr ∆ − n ( I ) ∞ X J =1 Λ ∆ J,K ( p ∆ ) W ∆1 ( B ∆ J,K ( p ∆ )) . Now, L − X K =1 X | I | = K, n I Pr ∆ − n ( I ) K = L ¯ π ∆ − π ∆ n → , as ∆ vanishes. Therefore, the left-hand side of the above inequality expands as s + (cid:26) r [ λ (˜ p ) h − s ] + λ (˜ p ) [ N V
N,p ε ( j (˜ p )) − ( N − V ∗ ( j (˜ p )) − s ] (cid:27) ∆ + o (∆) , and the right-hand side as s + o (∆). Comparing terms of order ∆, letting ε → p ≥ ˆ p .The statement about the range of experimentation now follows immediately from the factthat for ∆ < ˜∆ ε , we have W ∆PBE ≤ V N,p ε , and hence W ∆PBE = V N,p ε = s on [0 , ˜ p − ε ] ⊇ [0 , ˆ p − ε ].The statement about the supremum of equilibrium payoffs follows from the inequality W ∆PBE ≤ V N,p ε for ∆ < ˜∆ ε , convergence V N,p ε → V N, ˜ p as ε →
0, and the inequality V N, ˜ p ≤ V N, ˆ p .We now turn to the proof of Proposition 8. The only difference to the case with aBrownian component is the proof of incentive compatibility to the immediate right of p .In view of Lemmas A.9, A.4 and A.5, we consider p and ¯ p such thatˆ p < p < p ‡ < p ∗ < p m < max { p ⋄ , ˇ p } < ¯ p < . (B.10) Lemma B.7
Let ρ = 0 and λ > . There exists p ♯ ∈ (max { p ⋄ , ˇ p } , such that for all ¯ p ∈ ( p ♯ , , there exist ε ∈ (0 , p ‡ − p ) and ∆ ( p,p + ε ] > such that (1 − δ ) m ( p ) + δ E ∆ N w ∆ ( p ) ≥ (1 − δ ) s + δ E ∆ N − w ∆ ( p ) , for all p ∈ ( p, p + ε ] and ∆ < ∆ ( p,p + ε ] . roof: By Lemma A.3, there exists ∆ > w ∆ ≥ V N,p for ∆ ∈ (0 , ∆ ).By Lemma A.9, λ ( p )[ N V
N,p ( j ( p )) − ( N − V ∗ ( j ( p )) − s ] − rc ( p ) > p, V N,p ( j ( p )) ≤ V N,p ( j ( p )) for p ≥ p , this implies λ ( p )[ N V
N,p ( j ( p )) − ( N − V ∗ ( j ( p )) − s ] − rc ( p ) > p, p ♯ > max { p ⋄ , ˇ p } such that for all ¯ p > p ♯ , λ ( p )[ N V
N,p ( j ( p )) − ( N − V , ¯ p ( j ( p )) − s ] − rc ( p ) > p, p ∈ ( p ♯ , ν = min p ∈ [ p, n λ ( p )[ N V
N,p ( j ( p )) − ( N − V , ¯ p ( j ( p )) − s ] − rc ( p ) o > , and choose ε > p + ε < p ‡ and( N λ ( p + ε ) + r ) h V N,p ( p + ε ) − s i < ν/ . In the remainder of the proof, we write p KJ for the posterior belief starting from p when K players use the risky arm and J lump-sums arrive within the length of time ∆.For p ∈ ( p, p + ε ] and ∆ ∈ (0 , ∆ ),(1 − δ ) m ( p ) + δ E ∆ N w ∆ ( p ) ≥ (1 − δ ) m ( p ) + δ E ∆ N V N,p ( p )= r ∆ m ( p ) + (1 − r ∆) n N λ ( p )∆ V N,p ( p N ) + (1 − N λ ( p )∆) V N,p ( p N ) o + O (∆ )= V N,p ( p N ) + n rm ( p ) + N λ ( p ) V N,p ( p N ) − ( N λ ( p ) + r ) V N,p ( p N ) o ∆ + O (∆ ) , while(1 − δ ) s + δ E ∆ N − w ∆ ( p )= r ∆ s + (1 − r ∆) n ( N − λ ( p )∆ w ∆ ( p N − ) + [1 − ( N − λ ( p )∆] w ∆ ( p N − ) o + O (∆ )= w ∆ ( p N − ) + n rs + ( N − λ ( p ) w ∆ ( p N − ) − [( N − λ ( p ) + r ] w ∆ ( p N − ) o ∆ + O (∆ ) . As V N,p ( p N ) ≥ s = w ∆ ( p N − ), the difference (1 − δ ) m ( p )+ δ E ∆ N w ∆ ( p ) − (cid:2) (1 − δ ) s + δ E ∆ N − w ∆ ( p ) (cid:3) is no smaller than ∆ times λ ( p ) h N V
N,p ( p N ) − ( N − w ∆ ( p N − ) − s i − rc ( p ) − ( N λ ( p ) + r ) h V N,p ( p N ) − s i , lus terms of order ∆ and higher.Let ξ = ν N − λ . By Lemma A.8 as well as Lipschitz continuity of V N,p and V , ¯ p ,there exists ∆ ∈ (0 , ∆ ) such that k w ∆ − V , ¯ p k , max p ≤ p ≤ p ‡ | V N,p ( p N ) − V N,p ( j ( p )) | andmax p ≤ p ≤ p ‡ | V , ¯ p ( p N − ) − V , ¯ p ( j ( p )) | are all smaller than ξ when ∆ < ∆ . For such ∆ and p ∈ ( p, p ‡ ], we thus have V N,p ( p N ) > V N,p ( j ( p )) − ξ and w ∆ ( p N − ) < V , ¯ p ( j ( p )) + 2 ξ , so thatthe expression displayed above is larger than ν − N − λ ( p ) ξ − ν/ > ν/
3. This impliesexistence of a ∆ ( p,p + ε ] ∈ (0 , ∆ ) as in the statement of the lemma. Proof of Proposition 8:
Given p as in (B.10), take p ♯ as in Lemma B.7 and fix ¯ p > p ♯ .Choose ε > ( p,p + ε ] as in Lemma B.7, and ∆ ( p ‡ , ¯ p ] , ∆ ( p + ε, ¯ p ] and ∆ (¯ p, as in LemmasB.2, B.5 and B.6. The two-state automaton is an SSE for all∆ < min n ∆ ( p ‡ , ¯ p ] , ∆ ( p,p + ε ] , ∆ ( p + ε, ¯ p ] , ∆ (¯ p, o . So the statement of the proposition holds with p ♭ = p ‡ and the chosen p ♯ .For the proof of Proposition 9, we modify notation slightly, writing Λ for the probabilitythat, conditional on θ = 1, a player has at least one success on his own risky arm in any givenround, and g for the corresponding expected payoff per unit of time. Consider an SSE played at a given prior p , with associated payoff W . If K ≥ p K . Notethat an SSE allows the continuation play to depend on the identity of these players. Takingthe expectation over all possible combinations of K players who experiment, however, wecan associate with each posterior p K , K ≥
1, an expected continuation payoff W K . If K = 0, so that no player experiments, the belief does not evolve, but there is no reasonthat the continuation strategies (and so the payoff) should remain the same. We denotethe corresponding payoff by W . In addition, we write π ∈ [0 ,
1] for the probability withwhich each player experiments at p , and q K for the probability that at least one player has asuccess, given p , when K of them experiment. The players’ common payoff must then satisfythe following optimality equation: W = max ( (1 − δ ) p g + δ N − X K =0 (cid:18) N − K (cid:19) π K (1 − π ) N − − K [ q K +1 g + (1 − q K +1 ) W K +1 )] , (1 − δ ) s + δ N − X K =1 (cid:18) N − K (cid:19) π K (1 − π ) N − − K ( q K g + (1 − q K ) W K ) + δ (1 − π ) N − W ) ) . The first term corresponds to the payoff from playing risky, the second from playing safe.As it turns out, it is more convenient to work with odds ratios ω = p − p and ω K = p K − p K , I.e. , Λ = 1 − e − λ ∆ and g = m . hich we refer to as “belief” as well. Note that p K = p (1 − ω ) K p (1 − ω ) K + 1 − p implies that ω K = (1 − Λ) K ω. Note also that1 − q K = p (1 − Λ) K + 1 − p = (1 − p )(1 + ω K ) , q K = p − (1 − p ) ω K = (1 − p )( ω − ω K ) . We define m = sg − s , υ = W − s (1 − p )( g − s ) , υ K = W K − s (1 − p K )( g − s ) . Note that υ ≥ s is a lower bound on the value. Simple computationsnow give υ = max ( ω − (1 − δ ) m + δ N − X K =0 (cid:18) N − K (cid:19) π K (1 − π ) N − − K ( υ K +1 − ω K +1 ) ,δω + δ N − X K =0 (cid:18) N − K (cid:19) π K (1 − π ) N − − K ( υ K − ω K ) ) . It is also useful to introduce w = υ − ω and w K = υ K − ω K . We then obtain w = max ( − (1 − δ ) m + δ N − X K =0 (cid:18) N − K (cid:19) π K (1 − π ) N − − K w K +1 , − (1 − δ ) ω + δ N − X K =0 (cid:18) N − K (cid:19) π K (1 − π ) N − − K w K ) . (B.11)We define ω ∗ = m δ − δ Λ . This is the odds ratio corresponding to the single-agent cutoff p ∆1 , i.e. , ω ∗ = p ∆1 / (1 − p ∆1 ).Note that p ∆1 > p ∗ for ∆ > p ∆1 or, in terms of oddsratios, ω ∗ . For all beliefs ω < ω ∗ , therefore, any equilibrium has w = − ω , or υ = 0, for eachplayer. Proof of Proposition 9:
Following terminology from repeated games, we say that wecan enforce action π ∈ { , } at belief ω if we can construct an SSE for the prior belief ω inwhich players prefer to choose π in the first round rather than deviate unilaterally.Our first step is to derive sufficient conditions for enforcement of π ∈ { , } . The condi-tions to enforce these actions are intertwined, and must be derived simultaneously. Enforcing π = 0 at ω . To enforce π = 0 at ω , it suffices that one round of using the safearm followed by the best equilibrium payoff at ω exceeds the payoff from one round of using he risky arm followed by the resulting continuation payoff at belief ω (as only the deviatingplayer will have experimented). See below for the precise condition. Enforcing π = 1 at ω . If a player deviates to π = 0, we jump to w N − rather than w N incase all experiments fail. Assume that at ω N − we can enforce π = 0. As explained above,this implies that at ω N − , a player’s continuation payoff can be pushed down to what hewould get by unilaterally deviating to experimentation, which is at most − (1 − δ ) m + δw N where w N is the highest possible continuation payoff at belief ω N . To enforce π = 1 at ω , itthen suffices that w = − (1 − δ ) m + δw N ≥ − (1 − δ ) ω + δ ( − (1 − δ ) m + δw N ) , with the same continuation payoff w N on the left-hand side of the inequality. The inequalitysimplifies to δw N ≥ (1 − δ ) m − ω ;by the formula for w , this is equivalent to w ≥ − ω , i.e. , υ ≥
0. Given that υ = ω − (1 − δ ) m + δ ( υ N − ω N ) = (1 − δ (1 − Λ) N ) ω − (1 − δ ) m + δυ N , to show that υ ≥
0, it thus suffices that ω ≥ m δ − δ (1 − (1 − Λ) N ) = ˜ ω, and that υ N ≥
0, which is necessarily the case if υ N is an equilibrium payoff. Note that(1 − Λ) N ˜ ω ≤ ω ∗ , so that ω N ≥ ω ∗ implies ω ≥ ˜ ω . In summary, to enforce π = 1 at ω , itsuffices that ω N ≥ ω ∗ and π = 0 be enforceable at ω N − . Enforcing π = 0 at ω (continued). Suppose we can enforce it at ω , ω , . . . , ω N − , andthat ω N ≥ ω ∗ . Note that π = 1 is then enforceable at ω from our previous argument, givenour hypothesis that π = 0 is enforceable at ω N − . It then suffices that − (1 − δ ) ω + δ ( − (1 − δ ) m + δw N ) ≥ − (1 − δ N ) m + δ N w N , where again it suffices that this holds for the highest value of w N . To understand thisexpression, consider a player who deviates by experimenting. Then the following period thebelief is down one step, and if π = 0 is enforceable at ω , it means that his continuationpayoff there can be chosen to be no larger than what he can secure at that point by deviatingand experimenting again, etc. The right-hand side is then obtained as the payoff from N consecutive unilateral deviations to experimentation (in fact, we have picked an upper bound,as the continuation payoff after this string of deviations need not be the maximum w N ). Theleft-hand side is the payoff from playing safe one period before setting π = 1 and getting themaximum payoff w N , a continuation strategy that is sequentially rational given that π = 1is enforceable at ω by our hypothesis that π = 0 is enforceable at ω N − . lugging in the definition of υ N , this inequality simplifies to( δ − δ N ) υ N ≥ ( δ − δ N )( ω N − m ) + (1 − δ )( ω − m ) , which is always satisfied for beliefs ω ≤ m , i.e. , below the myopic cutoff ω m (which coincideswith the normalized payoff m ).To summarize, if π = 0 can be enforced at the N − ω , . . . , ω N − ,with ω N ≥ ω ∗ and ω ≤ ω m , then both π = 0 and π = 1 can be enforced at ω . By induction,this implies that if we can find an interval of beliefs [ ω N , ω ) with ω N ≥ ω ∗ for which π = 0can be enforced, then π = 0 , ω ′ ∈ ( ω, ω m ).Our second step is to establish that such an interval of beliefs exists. This second stepinvolves itself three steps. First, we derive some “simple” equilibrium, which is a symmetricMarkov equilibrium. Second, we show that we can enforce π = 1 on sufficiently (finitely)many consecutive values of beliefs building on this equilibrium; third, we show that this canbe used to enforce π = 0 as well.It will be useful to distinguish beliefs according to whether they belong to the interval[ ω ∗ , (1 + λ ∆) ω ∗ ) , [(1 + λ ∆) ω ∗ , (1 + 2 λ ∆) ω ∗ ) , . . . For τ ∈ IN , let I τ +1 = [(1 + τ λ ∆) ω ∗ , (1 +( τ +1) λ ∆) ω ∗ ). For fixed ∆, every ω ≥ ω ∗ can be uniquely mapped into a pair ( x, τ ) ∈ [0 , × IN such that ω = (1 + λ ( x + τ )∆) ω ∗ , and we alternatively denote beliefs by such a pair. Notealso that, for small enough ∆ >
0, one unsuccessful experiment takes a belief that belongs tothe interval I τ +1 to (within O (∆ ) of) the interval I τ . (Recall that Λ = λ ∆ + O (∆ ).)Let us start with deriving a symmetric Markov equilibrium. Hence, because it is Marko-vian, υ = υ in our notation, that is, the continuation payoff when nobody experiments isequal to the payoff itself.Rewriting the equations, using the risky arm gives the payoff υ = ω − (1 − δ ) m − δ (1 − Λ)(1 − π Λ) N − ω + δ N − X K =0 (cid:18) N − K (cid:19) π K (1 − π ) N − − K υ K +1 , while using the safe arm yields υ = δ (1 − (1 − π Λ) N − ) ω + δ (1 − π ) N − υ + δ N − X K =1 (cid:18) N − K (cid:19) π K (1 − π ) N − − K υ K . In the Markov equilibrium we derive, players are indifferent between both actions, and sotheir payoffs are the same. Given any belief ω or corresponding pair ( τ, x ), we conjecture anequilibrium in which π = a ( τ, x )∆ + O (∆ ), υ = b ( τ, x )∆ + O (∆ ), for some functions a, b of the pair ( τ, x ) only. Using the fact that Λ = λ ∆ + O (∆ ) , − δ = r ∆ + O (∆ ), we replace To pull out the terms involving the belief ω from the sum appearing in the definition of υ , usethe fact that P N − K =0 (cid:0) N − K (cid:1) π K (1 − π ) N − − K (1 − Λ) K = (1 − π Λ) N / (1 − π Λ). his in the two payoff expressions, and take Taylor expansions to get, respectively,0 = (cid:18) rb ( τ, x ) + λ mλ + r ( N − a ( τ, x ) (cid:19) ∆ + O (∆ ) , and 0 = [ b ( τ, x ) − rmλ ( τ + x )] ∆ + O (∆ ) . We then solve for a ( τ, x ), b ( τ, x ), to get π − = r ( λ + r )( x + τ ) N − + O (∆ ) , with corresponding value υ − = λ mr ( x + τ )∆ + O (∆ ) . This being an induction on K , it must be verified that the expansion indeed holds at thelowest interval, I , and this verification is immediate. We now turn to the second step and argue that we can find N − π = 1 can be enforced. We then verify that incentives can be provided to do so,assuming that υ − are the continuation values used by the players whether a player deviatesor not from π = 1. Assume that N − π = 1. Consider the remaining one.His incentive constraint to choose π = 1 is − (1 − δ ) m + δυ N − δ (1 − Λ) N ω ≥ − (1 − δ ) ω − δ (1 − Λ) N − ω + δυ N − , (B.12)where υ N , υ N − are given by υ − at ω N , ω N − . The interpretation of both sides is as before,the payoff from abiding with the candidate equilibrium action vs. the payoff from deviating.Fixing ω and the corresponding pair ( τ, x ), and assuming that τ ≥ N − we insert ourformula for υ − , as well as Λ = λ ∆ + O (∆) , − δ = r ∆ + O (∆). This gives τ ≥ ( N − (cid:18) λ λ + r (cid:19) − x. Hence, given any integer N ′ ∈ IN , N ′ > N − > ∈ (0 , ¯∆), π = 1 is an equilibrium action at all beliefs ω = ω ∗ (1 + τ ∆), for τ = 3( N − , . . . , N ′ (we pick the factor 3 because λ / ( λ + r ) < N − I τ with τ ≥ N − τ ≤ N ), and fix ∆ for which the previous result holds, i.e. , π = 1 can be enforced atall these beliefs. We now turn to the third step, showing how π = 0 can be enforced as well Note that this solution is actually continuous at the interval endpoints. It is not the only solutionto these equations; as mentioned in the text, there are intervals of beliefs for which multiple symmetricMarkov equilibria exist in discrete time. It is easy to construct such equilibria in which π = 1 and theinitial belief is in (a subinterval of) I . Considering τ < N − υ N = 0, so that the explicit formula for υ − would not applyat ω N . Computations are then easier, and the result would hold as well. or these beliefs.Suppose that players choose π = 0. As a continuation payoff, we can use the payoff fromplaying π = 1 in the following round, as we have seen that this action can be enforced atsuch a belief. This gives δω + δ ( − (1 − δ ) m − δ (1 − Λ) N l + δυ − ( ω N )) . (Note that the discounted continuation payoff is the left-hand side of (B.12).) By deviatingfrom π = 0, a player gets at most ω + ( − (1 − δ ) m − δ (1 − Λ) ω + δυ − ( ω )) . Again inserting our formula for υ − , this reduces to mr ( N − λ λ + r ∆ ≥ . Hence we can also enforce π = 0 at all these beliefs. We can thus apply our inductionargument: there exists ¯∆ > ∈ (0 , ¯∆), both π = 0 , ω ∈ ( ω ∗ (1 + 4 N ∆) , ω m ).Note that we have not established that, for such a belief ω , π = 1 is enforced with acontinuation in which π = 1 is being played in the next round (at belief ω N > ω ∗ (1 + 4 N ∆)).However, if π = 1 can be enforced at belief ω , it can be enforced when the continuation payoffat ω N is highest possible; in turn, this means that, as π = 1 can be enforced at ω N , thiscontinuation payoff is at least as large as the payoff from playing π = 1 at ω N as well. Byinduction, this implies that the highest equilibrium payoff at ω is at least as large as the oneobtained by playing π = 1 at all intermediate beliefs in ( ω ∗ (1 + 4 N ∆) , ω ) (followed by, say,the worst equilibrium payoff once beliefs below this range are reached).Similarly, we have not argued that, at belief ω , π = 0 is enforced by a continuationequilibrium in which, if a player deviates and experiments unilaterally, his continuation payoffat ω is what he gets if he keeps on experimenting alone. However, because π = 0 can beenforced at ω , the lowest equilibrium payoff that can be used after a unilateral deviationat ω must be at least as low as what the player can get at ω from deviating unilaterallyto risky again. By induction, this implies that the lowest equilibrium payoff at belief ω isat least as low as the one obtained if a player experiments alone for all beliefs in the range( ω ∗ (1 + 4 N ∆) , ω ) (followed by, say, the highest equilibrium payoff once beliefs below thisinterval are reached).Note that, as ∆ →
0, these bounds converge (uniformly in ∆) to the cooperative solu-tion (restricted to no experimentation at and below ω = ω ∗ ) and the single-agent payoff,respectively, which was to be shown. (This is immediate given that these values correspondto precisely the cooperative payoff (with N or 1 player) for a cutoff that is within a distanceof order ∆ of the cutoff ω ∗ , with a continuation payoff at that cutoff which is itself within ∆ imes a constant of the safe payoff.)This also immediately implies (as for the case λ >
0) that for fixed ω > ω m , both π = 0 , ω m , ω ] for all ∆ < ¯∆, for some ¯∆ >
0: the gainfrom a deviation is of order ∆, yet the difference in continuation payoffs (selecting as acontinuation payoff a value close to the maximum if no player unilaterally defects, and closeto the minimum if one does) is bounded away from 0, even as ∆ → Hence, all conclusionsextend: fix ω ∈ ( ω ∗ , ∞ ); for every ε >
0, there exists ¯∆ > < ¯∆, thebest SSE payoff starting at belief ω is at least as much as the payoff from all players choosing π = 1 at all beliefs in ( ω ∗ + ε, ω ) (using s as a lower bound on the continuation once thebelief ω ∗ + ε is reached); and the worst SSE payoff starting at belief ω is no more than thepayoff from a player whose opponents choose π = 1 if, and only if, ω ∈ ( ω ∗ , ω ∗ + ε ), and 0otherwise.The first part of the proposition follows immediately, picking arbitrary p ∈ ( p ∗ , p m ) and¯ p ∈ ( p m , p ∗ < p ∆1 , as noted, and (ii) forany p ∈ [ p ∆1 , p ], player i ’s payoff in any equilibrium is weakly lower than his best-reply payoffagainst κ ( p ) = 1 for all p ∈ [ p ∗ , p ], as easily follows from (B.11), the optimality equation for w . Proof of Proposition 10:
For λ >
0, the proof is the same as that of Proposition 6,except for the fact that it deals with V N,p rather than V ∗ N and relies on Proposition 8 ratherthan Proposition 5.For λ = 0, the proof of Proposition 9 establishes that there exists a natural number M such that, given p as stated, we can take ¯∆ to be ( p − p ∗ ) /M . Equivalently, p ∗ + M ¯∆ = p .Hence, Proposition 9 can be restated as saying that, for some ¯∆ >
0, and all ∆ ∈ (0 , ¯∆),there exists p ∆ ∈ ( p ∗ , p ∗ + M ∆) such that the two conclusions of the proposition hold with p = p ∆ . Fixing the prior, let w ∆ , w ∆ denote the payoffs in the first and second SSE fromthe proposition, respectively. Given that p → p ∗ and w ∆ ( p ) → s, w ∆ ( p ) → s for all p ∈ ( p ∗ , p ∆ ) as ∆ →
0, it follows that we can pick ∆ † ∈ (0 , ¯∆) such that for all ∆ ∈ (0 , ∆ † ), W ∆PBE ≤ V N, ˆ p + ε , w ∆ ≥ V N,p − ε , k W ∆1 − V ∗ k < ε and k w ∆ − V , ¯ p k < ε . The obviousinequalities follow as in the proof of Proposition 6 with the subtraction of an additional ε from the left-hand side of the first one; and the conclusion follows as before, using 2 ε as anupper bound. This follows by contradiction. Suppose that for some ∆ ∈ (0 , ¯∆), there is ˆ ω ∈ [ ω m , ω ] for whicheither π = 0 or 1 cannot be enforced. Consider the infimum over such beliefs. Continuation payoffscan then be picked as desired, which is a contradiction as it shows that at this presumed infimumbelief π = 0 , Consider the possibly random sequence of beliefs visited in an equilibrium. At each belief, a flowloss of either − (1 − δ ) m or − (1 − δ ) ω is incurred. Note that the first loss is independent of the numberof other players’ experimenting, while the second is necessarily lower when at each round all otherplayers experiment. Hence, to be precise, these payoffs are only defined on those beliefs that can be reached given theprior and the equilibrium strategies. eferences Abreu, D. (1986): “Extremal Equilibria of Oligopolistic Supergames,”
Journal ofEconomic Theory , , 195–225. Abreu, D., D. Pearce and E. Stacchetti (1986): “Optimal Cartel Equilibriawith Imperfect Monitoring,”
Journal of Economic Theory , , 251–269. Abreu, D., D. Pearce and E. Stacchetti (1993): “Renegotiation and Symmetryin Repeated Games,”
Journal of Economic Theory , , 217–240. Bergin, J. and
W.B. MacLeod (1993): “Continuous Time Repeated Games,”
In-ternational Economic Review , , 21–37. Biais, B., T. Mariotti, G. Plantin and
J.-C. Rochet (2007): “Dynamic Secu-rity Design: Convergence to Continuous Time and Asset Pricing Implications,”
Review of Economic Studies , , 345–390. Bolton, P. and
C. Harris (1999): “Strategic Experimentation,”
Econometrica , ,349–374. Cohen, A. and
E. Solan (2013): “Bandit Problems with L´evy Payoff Processes,”
Mathematics of Operations Research , , 92–107. Cronshaw, M.B. and
D.G. Luenberger (1994): “Strongly Symmetric SubgamePerfect Equilibria in Infinitely Repeated Games with Perfect Monitoring andDiscounting,”
Games and Economic Behavior , , 220–237. Dixit, A.K. and
R.S. Pindyck (1994):
Investment under Uncertainty . Princeton:Princeton University Press.
Dutta, P.K. (1995): “A Folk Theorem for Stochastic Games,”
Journal of EconomicTheory , , 1–32. Fudenberg, D. and D.K. Levine (2009): “Repeated Games with Frequent Sig-nals,”
Quarterly Journal of Economics , , 233–265. Fudenberg, D., D.K. Levine and S. Takahashi (2007): “Perfect Public Equi-librium when Players Are Patient,”
Games and Economic Behavior , , 27–49. Heidhues, P., S. Rady and
P. Strack (2015): “Strategic Experimentation withPrivate Payoffs,”
Journal of Economic Theory , , 531–551. H¨orner, J., T. Sugaya, S. Takahashi and N. Vieille (2011): “Recursive Meth-ods in Discounted Stochastic Games: An Algorithm for δ → Econometrica , , 1277–1318. H¨orner, J. and
L. Samuelson (2013): “Incentives for Experimenting Agents,”
RAND Journal of Economics , , 632–663.57 ohnson, N.L., S. Kotz and N. Balakrishnan (1994):
Continuous UnivariateDistributions: Volume 1 (second edition). New York: Wiley.
Keller, G. and
S. Rady (2010): “Strategic Experimentation with Poisson Bandits,”
Theoretical Economics , , 275–311. Keller, G. and
S. Rady (2015): “Breakdowns,”
Theoretical Economics , , 175–202. Keller, G., S. Rady and
M. Cripps (2005): “Strategic Experimentation withExponential Bandits,”
Econometrica , , 39–68. Mertens, J.F., Sorin, S. and
S. Zamir (2015):
Repeated Games (EconometricSociety Monographs, Vol. 55). Cambridge: Cambridge University Press.
M¨uller, H.M. (2000): “Asymptotic Efficiency in Dynamic Principal-Agent Prob-lems,”
Journal of Economic Theory , , 251–269. Peskir, G. and
A. Shiryaev (2006):
Optimal Stopping and Free-Boundary Problems .Basel: Birkh¨auser Verlag.
Robbins, H. (1952): “Some Aspects of the Sequential Design of Experiments,”
Bul-letin of the American Mathematical Society , , 527–535. Sadzik, T. and
E. Stacchetti (2015): “Agency Models with Frequent Actions,”
Econometrica , , 193–237. Simon, L.K. and
M.B. Stinchcombe (1995): “Equilibrium Refinement for InfiniteNormal-Form Games,”
Econometrica , , 1421–1443. Thompson, W. (1933): “On the Likelihood that One Unknown Probability ExceedsAnother in View of the Evidence of Two Samples,”
Biometrika ,25