[PDF] Overcoming Free-Riding in Bandit Games

Abstract

This paper considers a class of experimentation games with Lévy bandits encompassing those of Bolton and Harris (1999) and Keller, Rady and Cripps (2005). Its main result is that efficient (perfect Bayesian) equilibria exist whenever players' payoffs have a diffusion component. Hence, the trade-offs emphasized in the literature do not rely on the intrinsic nature of bandit models but on the commonly adopted solution concept (MPE). This is not an artifact of continuous time: we prove that efficient equilibria arise as limits of equilibria in the discrete-time game. Furthermore, it suffices to relax the solution concept to strongly symmetric equilibrium.

Full PDF

aa r X i v : . [ ec on . T H ] O c t Overcoming Free-Riding in Bandit Games ∗ Johannes H¨orner † Nicolas Klein ‡ Sven Rady § This version: October 22, 2019 ∗ This paper supersedes our earlier paper “Strongly Symmetric Equilibria in Bandit Games” (circu-lated in 2014 as Cowles Discussion Paper No. 1956 and SFB/TR 15 Discussion Paper No. 469) whichconsidered pure Poisson learning only. Thanks for comments and suggestions are owed to seminar par-ticipants at Aalto University Helsinki, Austin, Berlin, Bonn, City University of Hong Kong, CollegioCarlo Alberto Turin, Duisburg-Essen, Edinburgh, Exeter, Frankfurt (Goethe University, FrankfurtSchool of Finance and Management), London (Queen Mary, LSE), Lund, Maastricht, Mannheim,McMaster University, Microsoft Research New England, Montreal, Oxford, Paris (S´eminaire Roy,S´eminaire Parisien de Th´eorie des Jeux, Dauphine), Southampton, St. Andrews, Sydney, Toronto,Toulouse, University of Western Ontario, Warwick, Zurich, the 2012 International Conference onGame Theory at Stony Brook, the 2013 North American Summer Meeting of the Econometric So-ciety, the 2013 Annual Meeting of the Society for Economic Dynamics, the 2013 European Meetingof the Econometric Society, the 4th Workshop on Stochastic Methods in Game Theory at Erice, the2013 Workshop on Advances in Experimentation at Paris II, the 2014 Canadian Economic TheoryConference, the 8th International Conference on Game Theory and Management in St. Petersburg,the SING 10 Conference in Krakow, the 2015 Workshop on Stochastic Methods in Game Theoryin Singapore, the 2017 Annual Meeting of the Society for the Advancement of Economic Theory inFaro, and the 2019 Annual Conference of the Royal Economic Society. Part of this paper was writtenduring a visit to the Hausdorﬀ Research Institute for Mathematics at the University of Bonn underthe auspices of the Trimester Program “Stochastic Dynamics in Economics and Finance”. Financialsupport from the Cowles Foundation, Deutsche Forschungsgemeinschaft (SFB/TR 15 and SFB/TR224), the Fonds de Recherche du Qu´ebec Soci´et´e et Culture, and the Social Sciences and HumanitiesResearch Council of Canada is gratefully acknowledged. † Yale University, 30 Hillhouse Ave., New Haven, CT 06520, USA, and TSE (CNRS), and CEPR, [email protected] . ‡ Universit´e de Montr´eal, D´epartement de Sciences ´Economiques, C.P. 6128 succursale Centre-ville;Montr´eal, H3C 3J7, Canada, and CIREQ, [email protected] . § University of Bonn, Adenauerallee 24-42, D-53113 Bonn, Germany, and CEPR, [email protected] . bstract This paper considers a class of experimentation games with L´evy bandits encom-passing those of Bolton and Harris (1999) and Keller, Rady and Cripps (2005).Its main result is that eﬃcient (perfect Bayesian) equilibria exist whenever play-ers’ payoﬀs have a diﬀusion component. Hence, the trade-oﬀs emphasized in theliterature do not rely on the intrinsic nature of bandit models but on the com-monly adopted solution concept (MPE). This is not an artifact of continuoustime: we prove that such equilibria arise as limits of equilibria in the discrete-time game. Furthermore, it suﬃces to relax the solution concept to stronglysymmetric equilibrium.

Keywords:

Two-Armed Bandit, Bayesian Learning, Strategic Experimenta-tion, Strongly Symmetric Equilibrium.

JEL

Classification Numbers:

C73, D83. Introduction

Bandit models involve trade-oﬀs. The exploration vs. exploitation dilemma of the clas-sic multi-armed bandit problem coined by Thompson (1933) and Robbins (1952) hasbeen supplanted by free-riding vs. encouragement eﬀects in a strategic context (Boltonand Harris, 1999). These two eﬀects might be essential to our economic intuition, butthe trade-oﬀ only arises because of the solution concept, as we show in this paper.First, we provide some context. Typically, bandit games are modeled in continuoustime, with Markov (perfect) equilibrium as the solution concept. This is a sensiblechoice. Continuous time provides elegant characterizations and even closed-form solu-tions. Markov equilibrium is the obvious counterpart to the criterion used in operationsresearch, enabling meaningful comparisons with the team solution. It is also dictatedby continuous time because standard game-theoretic notions raise conceptual problems(Simon and Stinchcombe, 1989). However, there is a price to pay. Markov equilibriumexcludes rewards and punishments, the cornerstones of dynamic games. What Markovtaketh away, continuous time giveth: equilibria arise with no discrete-time equivalent. We show that the main equilibrium prediction, namely ineﬃciently low experimen-tation, (mostly) disappears once the model is cast in the framework traditionally used indynamic games. Relaxing Markov equilibrium and pruning out artifacts of continuoustime requires discretizing the game. We then let the time interval between successiveactions vanish to obtain a similarly clean characterization and a proper comparisonwith the literature. Asymptotically, eﬃciency obtains, as long as payoﬀs involve aninformative diﬀusion (as opposed to a pure jump) component.What eﬃcient experimentation entails depends on the players’ patience. Hence, thisis not a folk theorem, which would not apply in any case: beliefs are not reversible.For example, players would never be convinced that the risky arm is good if it is not.In addition, eﬃciency does not always hold: with pure jumps, this depends on howgood news impacts the common belief.Eﬃciency only obtains if a selﬁsh, lone player would not experiment given anyresulting posterior belief, had he been on the brink of stopping as a selﬂess team playergiven the prior belief. Intuitively, having all other players stop experimenting is theworst punishment a deviating player can face.This punishment is needed but not a given with pure jumps. Indeed, it fails withconclusive good news. However, when there is a diﬀusion component, the predominant For instance, the “inﬁnite switching equilibria” of Keller et al. (2005). This is because there is nolast time before a given date in continuous time. A few caveats are in order.First, showing that the free-rider and encouragement eﬀects do not determine theoutcome of bandit games is akin to noting that eﬃciency is achievable in the repeatedprisoner’s dilemma even if defection is dominant in the stage game: it would notoccur to us to demonstrate how cooperation arises in the repeated version without ﬁrstremarking that free-riding is the problem we are addressing. In stochastic games suchas bandits, discerning the underlying incentives is diﬃcult and subtle: solving for theMarkov equilibria is the advisable approach. Our point is that we must disentanglethese incentives from the possible equilibrium outcomes.Second, we have emphasized the importance of studying the discrete-time gameto factor out equilibria that are continuous-time quirks. However, our results areasymptotic to the extent that they only hold when the time interval between rounds issmall enough. There is no diﬀerence between an arbitrarily small uptick vs. a discretejump when the interval length is bounded away from zero. Our results rely heavily onwhat is known about the continuous-time limits and hence on the analyses of Boltonand Harris (1999) and Keller et al. (2005), among others. To the extent that some ofour proofs are involved, it is because they require careful comparison and convergencearguments.Third, because we rely on discrete time, we must settle on a particular discretiza-tion. We consider our choice to be natural: players may revise their action choices atequally spaced time opportunities, while payoﬀs and information accrue in continuoustime, independent of the duration of the intervals. That is, ours is the simplest version One appealing property of SSEs is that payoﬀs can be studied via a coupled pair of functionalequations that extends the functional equation characterizing MPE payoﬀs (see Proposition 11).

3f inertia strategies, as introduced by Bergin and MacLeod (1993). Other discretizationchoices may lead to diﬀerent predictions.Fourth, our results do not cover all bandit games. Because we build on existingresults of the single-agent case, we cannot go beyond the framework used for them.In particular, we must make restrictions similar to Cohen and Solan (2013) in theiranalysis of the continuous-time bandit problem. In fact, our assumptions are strongerthan theirs. Our main restriction, just as theirs, is that bad-news jumps are notpermitted, which means that our framework does not subsume Keller and Rady (2015),in particular. Our paper belongs to the growing literature on strategic bandits. We have alreadydiscussed the standard references in that literature. There is no need to review the largeand growing literature on extensions, variations and applications. With few exceptions,these papers model the game in continuous time and focus on MPEs unless actions onat least one side are not observed (meaning that applying standard game-theoreticsolution concepts raises no diﬃculty).Second, our paper contributes to the literature on SSE. We hope that it illustrateshow SSE can be usefully applied to games usually cast in continuous time, such asbandit games. SSEs have been studied in repeated games since Abreu (1986). Theyare known to be restrictive. First, they make no sense if the model itself fails to besymmetric. However, as Abreu (1986) notes for repeated games, they are (i) easilycalculated, being completely characterized by two simultaneous scalar equations; (ii)more general than static Nash, or even Nash reversion; and even (iii) without loss interms of total welfare, at least in some cases, as in ours. See also Abreu, Pearce andStacchetti (1986) for the optimality of symmetric equilibria within a standard oligopolyframework and Abreu, Pearce and Stacchetti (1993) for a motivation of the solutionconcept based on a notion of equal bargaining power. Cronshaw and Luenberger (1994)conduct a more general analysis for repeated games with perfect monitoring, showinghow the set of SSE payoﬀs can be obtained by solving for the largest scalar solvinga certain equation. Hence, our paper shows that Properties (i)–(iii) extend to banditgames, with “Markov perfect” replacing “Nash” in statement (ii) and “functional”replacing “scalar” in (i): as mentioned above, a pair of functional equations replacesthe usual Hamilton-Jacobi-Bellman (HJB) (or Isaacs) equation from optimal control. We do not a priori perceive a diﬃculty in adopting theirs, but we also do not perceive any beneﬁts. The technical diﬃculty with bad-news jumps is that the value functions cannot be describedexplicitly. They are rather deﬁned recursively, with the functional form depending on the number ofbad news events triggering an end to all experimentation. Because of this complication, we leave theanalysis of this case to future work.

Time t ∈ [0 , ∞ ) is continuous. There are N ≥ s > θ ∈ { , } , which nature draws at the outset with P [ θ = 1] = p . Players donot observe θ , but they know p . They also understand that the evolution of the riskypayoﬀs depends on θ . Speciﬁcally, the payoﬀ process X n associated with player n ’srisky arm evolves according to dX nt = α θ dt + σ dZ nt + h dN nt , where Z n is a standard Wiener process, N n is a Poisson process with intensity λ θ ,and the scalar parameters α , α , σ, h, λ , λ are known to all players. Conditional on θ , the processes Z , . . . , Z N , N , . . . , N N are independent. As Z n and N n − λ θ t aremartingales, the expected payoﬀ increment from using the risky arm over an intervalof time [ t, t + dt ) is m θ dt with m θ = α θ + λ θ h .Players share a common discount rate r >

0. We write k n,t = 0 if player n usesthe safe arm at time t and k n,t = 1 if the player uses the risky arm at time t . Givenactions ( k n,t ) t ≥ such that k n,t ∈ { , } is measurable with respect to the information Bolton and Harris (1999), Keller et al. (2005) and Keller and Rady (2010) allow the players toallocate one unit of a perfectly divisible resource freely across the two arms at each point in time, sothe fraction allocated to the risky arm can be k n,t ∈ [0 , t , player n ’s total expected discounted payoﬀ, expressed in per-periodunits, is E (cid:20)Z ∞ re − rt [(1 − k n,t ) s + k n,t m θ ] dt (cid:21) , where the expectation is over both the random variable θ and the stochastic process( k n,t ). We make the following assumptions: (i) m < s < m , so each player prefers therisky arm to the safe arm in state θ = 1 and prefers the safe arm to the risky armin state θ = 0. (ii) σ > h >

0, so the Brownian payoﬀ component is alwayspresent and jumps of the Poisson component entail positive lump-sum payoﬀs; (iii) λ ≥ λ ≥

0, so jumps are at least as frequent in state θ = 1 as in state θ = 0.Players begin with a common prior belief about θ , given by the probability p withwhich nature draws state θ = 1. Thereafter, they learn about this state in a Bayesianfashion by observing one another’s actions and payoﬀs; in particular, they hold commonposterior beliefs throughout time. A detailed description of the evolution of beliefs ispresented in Appendix A.1. When λ = λ (and hence α > α ), the arrival of a lump-sum payoﬀ contains no information about the state of the world, and our setup isequivalent to that in Bolton and Harris (1999), with the learning being driven entirelyby the Brownian payoﬀ component. When α = α (and hence λ > λ ), the Brownianpayoﬀ component contains no information, and our setup is equivalent to that in Kelleret al. (2005) or Keller and Rady (2010), depending on whether λ = 0 or λ >

0, withthe learning being driven entirely by the arrival of lump-sum payoﬀs. The authors cited in the previous paragraph assume that players use continuous-timeMarkov strategies with the posterior belief as the state variable, so that k n,t is a time- Note that we have not yet deﬁned the set of strategies available to each player and hence aresilent at this point on how the players’ strategy proﬁle actually induces a stochastic process of actions( k n,t ) t ≥ for each of them. We will close this gap in two diﬀerent ways in Sections 3 and 4: by imposingMarkov perfection in the former and a discrete time grid of revision opportunities in the latter. This rules out “breakdowns” as in Keller and Rady (2015). Keller et al. (2005) and Keller and Rady (2010) consider compound Poisson processes where thedistribution of lump-sum payoﬀs (and their mean h ) at the time of a Poisson jump is independent of,and hence uninformative about, the state of the world. By contrast, Cohen and Solan (2013) allow forL´evy processes where the size of lump-sum payoﬀs contains information about the state, but a lumpsum of any given size arrives weakly more frequently in state θ = 1. p t assigned to state θ = 1 at time t . In this section, we show how some of their main insights generalize to the presentsetting. First, we present the eﬃcient benchmark. Second, we show that eﬃcientbehavior cannot be sustained as an MPE.Consider a planner who maximizes the average of the players’ expected payoﬀs incontinuous time by selecting an entire action proﬁle ( k ,t , . . . , k N,t ) at each time t . Thecorresponding average expected payoﬀ increment is (cid:20)(cid:18) − K t N (cid:19) s + K t N m θ (cid:21) dt with K t = N X n =1 k n,t . A straightforward extension of the main results of Cohen and Solan (2013) shows thatthe evolution of beliefs also depends on K t only and that the planner’s value function,denoted by V ∗ N , has the following properties.First, V ∗ N is the unique once-continuously diﬀerentiable solution of the HJB equation v ( p ) = s + max K ∈{ , ,...,N } K (cid:20) b ( p, v ) − c ( p ) N (cid:21) on the open unit interval subject to the boundary conditions v (0) = m and v (1) = m .Here, b ( p, v ) = ρ r p (1 − p ) v ′′ ( p ) − λ − λ r p (1 − p ) v ′ ( p ) + λ ( p ) r [ v ( j ( p )) − v ( p )]can be interpreted as the expected informational beneﬁt of using the risky arm whencontinuation payoﬀs are given by a (suﬃciently regular) function v . Its ﬁrst termreﬂects Brownian learning. Its second term captures the downward drift in the beliefwhen no Poisson lump sum arrives. Its third term expresses the discrete change in theoverall payoﬀ once such a lump sum arrives, with the belief jumping up from p to j ( p ) = λ pλ ( p ) ; In the presence of discrete payoﬀ increments, one actually has to take the left limit p t − as thestate variable, owing to the informational constraint that the action chosen at time t cannot dependon the arrival of a lump sum at t . In the following, we simply write p t with the understanding thatthe left limit is meant whenever this distinction is relevant. Note that p − = p by convention. Cf. Appendix A.1. Up to division by r , this is the inﬁnitesimal generator of the process of posterior beliefs for K = 1,applied to the function v ; cf. Appendix A.1 for details. λ ( p ) = pλ + (1 − p ) λ . The function c ( p ) = s − m ( p )captures the opportunity cost of playing the risky arm in terms of expected currentpayoﬀ forgone; here, m ( p ) = pm + (1 − p ) m denotes the risky arm’s expected ﬂow payoﬀ given the belief p . Thus, the plannerweighs the shared opportunity cost of each experiment on the risky arm against thelearning beneﬁt, which accrues fully to each agent because of the perfect informationalspillover.Second, there exists a cutoﬀ p ∗ N such that all agents using the safe arm ( K = 0) isoptimal for the planner when p ≤ p ∗ N , and all agents using the risky arm ( K = N ) isoptimal when p > p ∗ N . This cutoﬀ is given by p ∗ N = µ N ( s − m )( µ N + 1)( m − s ) + µ N ( s − m ) , where µ N is the unique positive solution of the equation ρ µ ( µ + 1) + ( λ − λ ) µ + λ (cid:18) λ λ (cid:19) µ − λ − rN = 0 . Both µ N and p ∗ N increase in r/N . Thus, the interval of beliefs for which all agentsusing the risky arm is eﬃcient widens with the number of agents and their patience.Third, the value function satisﬁes V ∗ N ( p ) = s for p ≤ p ∗ N , and V ∗ N ( p ) = m ( p ) + c ( p ∗ N ) u ( p ∗ N ; µ N ) u ( p ; µ N ) > s, (1)for p > p ∗ N , where u ( p ; µ ) = (1 − p ) (cid:18) − pp (cid:19) µ is strictly decreasing and strictly convex for µ >

0. The function V ∗ N is strictly increas-ing and strictly convex on [ p ∗ N , N = 1, one obtains the single-agent value function V ∗ and correspondingcutoﬀ p ∗ > p ∗ N . 8ow consider N ≥ n when he or she faces opponents who use Markov strategies is given by v n ( p ) = s + K ¬ n ( p ) b ( p, v n ) + max k n ∈{ , } k n [ b ( p, v n ) − c ( p )] , where K ¬ n ( p ) is the number of n ’s opponents that use the risky arm. That is, whenplaying a best response, each player weighs the opportunity cost of playing risky againsthis or her own informational beneﬁt only. Consequently, V ∗ N does not solve the aboveHJB equation when player n ’s opponents use the eﬃcient strategy. Eﬃcient behaviortherefore cannot be sustained in MPE. Henceforth, we restrict players to changing their actions only at the times t = 0 , ∆ , , . . . for some ﬁxed ∆ >

0. This yields a discrete-time game evolving in a continuous-timeframework; in particular, the payoﬀ processes are observed continuously. Moreover,we allow for non-Markovian strategies.The expected discounted payoﬀ increment from using the safe arm for the lengthof time ∆ is R ∆0 r e − r t s dt = (1 − δ ) s with δ = e − r ∆ . Conditional on θ , the expecteddiscounted payoﬀ increment from using the risky arm is R ∆0 r e − r t m θ dt = (1 − δ ) m θ .Given the probability p assigned to θ = 1, the expected discounted payoﬀ incrementfrom the risky arm conditional on all available information is (1 − δ ) m ( p ).A history of length t = ∆ , , . . . is a sequence h t = (cid:16)(cid:0) k n, , e Y n [0 , ∆) (cid:1) Nn =1 , (cid:0) k n, ∆ , e Y n [∆ , (cid:1) Nn =1 , . . . , (cid:0) k n,t − ∆ , e Y n [ t − ∆ ,t ) (cid:1) Nn =1 (cid:17) , where k n,ℓ ∆ = 1 if player n uses the risky arm on the time interval [ ℓ ∆ , ( ℓ + 1)∆); While arguably natural, our discretization remains nonetheless ad hoc , and other discretizationsmight yield other results. Not only is it well known that the limits of the discrete-time models mightdiﬀer from the continuous-time solutions, but the particular discrete structure might also matter; see,among others, M¨uller (2000), Fudenberg and Levine (2009), H¨orner and Samuelson (2013), and Sadzikand Stacchetti (2015). In H¨orner and Samuelson (2013), for instance, there are multiple solutions tothe optimality equations, corresponding to diﬀerent boundary conditions, and to select among them,it is necessary to investigate in detail the discrete-time game (see their Lemma 3). However, therole of the discretization goes well beyond selecting the “right” boundary condition; see Sadzik andStacchetti (2015). n,ℓ ∆ = 0 if player n uses the safe arm on this interval; e Y n [ ℓ ∆ , ( ℓ +1)∆) is the observed samplepath Y n [ ℓ ∆ , ( ℓ +1)∆) on the interval [ ℓ ∆ , ( ℓ + 1)∆) of the payoﬀ process associated withplayer n ’s risky arm if k n,ℓ ∆ = 1; and e Y n [ ℓ ∆ , ( ℓ +1)∆) equals the empty set if k n,ℓ ∆ = 0. Wewrite H t for the set of all histories of length t , set H = {∅} , and let H = S ∞ t =0 , ∆ , ,... H t .In addition, we assume that players have access to a public randomization device inevery period, namely, a draw from the uniform distribution on [0 , θ and across periods. Following standard practice, we omit itsrealizations from the description of histories.A behavioral strategy σ n for player n is a sequence ( σ n,t ) t =0 , ∆ , ,... , where σ n,t isa measurable map from H t to the set of probability distributions on { , } ; a purestrategy takes values in the set of degenerate distributions only.Along with the prior probability p assigned to θ = 1, each proﬁle of strategiesinduces a distribution over H . Given his or her opponents’ strategies σ − n , player n seeks to maximize(1 − δ ) E σ − n ,σ n " ∞ X ℓ =0 δ ℓ n [1 − σ n,ℓ ∆ ( h ℓ ∆ )] s + σ n,ℓ ∆ ( h ℓ ∆ ) m θ o . By the law of iterated expectations, this equals(1 − δ ) E σ − n ,σ n " ∞ X ℓ =0 δ ℓ n [1 − σ n,ℓ ∆ ( h ℓ ∆ )] s + σ n,ℓ ∆ ( h ℓ ∆ ) m ( p ℓ ∆ )] o . Nash equilibrium, PBE and MPE, with actions after history h t depending only onthe associated posterior belief p t , are deﬁned in the usual way. Imposing the stan-dard “no signaling what you don’t know” reﬁnement, beliefs are pinned down after allhistories, on and oﬀ path. An SSE is a PBE in which all players use the same strategy: σ n ( h t ) = σ n ′ ( h t ) forall n, n ′ and h t ∈ H . This implies symmetry of behavior after any history, not just onthe equilibrium path of play. By deﬁnition, any symmetric MPE is an SSE, and anySSE is a PBE. While we could equivalently deﬁne this Bayesian game as a stochastic game with the commonposterior belief as a state variable, no characterization or folk theorem applies to our setup, as theMarkov chain (over consecutive states) does not satisfy the suﬃcient ergodicity assumptions; see Dutta(1995) and H¨orner, Sugaya, Takahashi and Vieille (2011). Main Results

Fix ∆ >

0. For p ∈ [0 , W ∆PBE ( p ) and W ∆PBE ( p ) denote the supremum andinﬁmum, respectively, of the set of average payoﬀs (per player) over all PBE, givenprior belief p . Let W ∆SSE ( p ) and W ∆SSE ( p ) be the corresponding supremum and inﬁmumover all SSE. If such equilibria exist, W ∆PBE ( p ) ≥ W ∆SSE ( p ) ≥ W ∆SSE ( p ) ≥ W ∆PBE ( p ) . (2)Given that we assume a public randomization device, these upper and lower boundsdeﬁne the corresponding equilibrium average payoﬀ sets.As any player can choose to ignore the information contained in the other play-ers’ experimentation results, the value function W ∆1 of a single agent experimentingin isolation constitutes a lower bound on a player’s payoﬀ in any PBE. Lemma A.2establishes that this lower bound converges to V ∗ as ∆ →

0. Hence, we obtain a lowerbound to the limits of all terms in (2), namely lim inf ∆ → W ∆PBE ≥ V ∗ .An upper bound is also easily found. As any discrete-time strategy proﬁle is feasiblefor the continuous-time planner from the previous section, it holds that W ∆PBE ≤ V ∗ N .The main theorem provides an exact characterization of the limits of all four func-tions. It requires introducing a new family of payoﬀs. Namely, we deﬁne the players’common payoﬀ in continuous time when they all use the risky arm if, and only if, thebelief exceeds a given threshold ˆ p . This function admits a closed form that generalizesthe ﬁrst-best payoﬀ V ∗ N (cf. (1)). It is equal to V N, ˆ p ( p ) = m ( p ) + c (ˆ p ) u (ˆ p ; µ N ) u ( p ; µ N ) , for p > ˆ p , and V N, ˆ p ( p ) = s otherwise. Theorem 1 (i)

There exists ˆ p ∈ [ p ∗ N , p ∗ ] such that lim ∆ → W ∆PBE = lim ∆ → W ∆SSE = V N, ˆ p , and lim ∆ → W ∆PBE = lim ∆ → W ∆SSE = V ∗ , This function is continuous, strictly increasing and strictly convex on [ˆ p, p . For ˆ p = p ∗ N , V N, ˆ p coincides with the cooperative valuefunction V ∗ N . For ˆ p > p ∗ N , we have V N, ˆ p < V ∗ N on ( p ∗ N , niformly on [0 , . (ii) If ρ > , then ˆ p = p ∗ N (and hence V N, ˆ p = V ∗ N ). (iii) If ρ = 0 , then ˆ p is the unique belief in [ p ∗ N , p ∗ ] satisfying N λ (ˆ p ) [ V N, ˆ p ( j (ˆ p )) − s ] − ( N − λ (ˆ p ) [ V ∗ ( j (ˆ p )) − s ] = rc (ˆ p ); (3) moreover, ˆ p = p ∗ N if, and only if, j ( p ∗ N ) ≤ p ∗ , and ˆ p = p ∗ if, and only if, λ = 0 . To understand this result, let us begin with SSEs and the characterization of thecutoﬀ ˆ p in the last item, when learning is entirely driven by the jump process. Theplayers’ temptation to deviate to the safe arm is strongest when the belief is so lowthat, absent good news, the belief drops into the region where safe prevails in any SSE,whether a single player has deviated or not. The cost of such a deviation, capturedby the left-hand side of (3), thus arises only if good news arrives. Starting out fromˆ p , in expectation, this happens at the rate N λ (ˆ p ) if no player deviates; a deviationreduces this rate to ( N − λ (ˆ p ). Without a deviation, a player’s continuation payoﬀthen amounts at most to the cooperative payoﬀ given that the use of the risky arm isdisallowed below ˆ p ; in the event of a deviation, it is at least the single-player payoﬀ(both evaluated at the revised belief j (ˆ p ) and net of the value of the safe arm). Theright-hand side of (3) represents the beneﬁt of a deviation, that is, the saved opportu-nity cost of playing risky. The cutoﬀ belief ˆ p thus solves the familiar trade-oﬀ betweenthe beneﬁt from deviating and the cost of the worst punishment that may follow thedeviation.When λ = 0, the arrival of good news freezes the belief at 1, and the resultingcooperative and single-player payoﬀs both equal λ h . Starting out from ˆ p , therefore, aplayer’s continuation payoﬀs coincide with those of a single agent in all circumstances,so that it is impossible to sustain experimentation below the single-agent cutoﬀ. Hence,ˆ p = p ∗ .If the second term on the left-hand side of (3) were zero, that is, if j ( p ∗ N ) ≤ p ∗ , sothat a player left to his or her own devices would stop experimenting at the revisedbelief after the arrival of good news, and hence obtain a zero payoﬀ (net of the valueof the safe arm), the solution to this equation is the ﬁrst-best cutoﬀ p ∗ N . To see this,note that the ﬁrst term on the left-hand side can equivalently be interpreted as thesocial value of experimentation by a single player. Indeed, a player contributes to thearrival of news at rate λ (ˆ p ), but all N players then reap the gain V N, ˆ p ( j (ˆ p )) − s . Theright-hand side is the cost of such experimentation. Hence, ˆ p = p ∗ N follows immediatelyfrom the equation. 12he same logic immediately implies that ﬁrst-best eﬃciency obtains when ρ > First-best eﬃciency not only depends on the cutoﬀ but also requires play to beexclusively risky at all higher beliefs. Hence, the best equilibrium must involve a purestrategy, at least asymptotically. This is not straightforward. Indeed, symmetric pure-strategy PBE fail to exist with conclusive good news ( ρ = λ = 0) in discrete time.If all others play risky for certain, the posterior belief also declines for certain, unlessgood news arrives. If players randomized, there would be the added opportunity topunish if the posterior belief remained the same. When good news is conclusive, ourproof relies on the existence of two symmetric mixed-strategy equilibria for beliefs closeto the cutoﬀ. It is then possible to choose continuation play as a function of historyto incentivize players to experiment at beliefs that are suﬃciently many rounds awayfrom the cutoﬀ (a negligible diﬀerence in beliefs once the time interval is small enough).Matters are simpler when news is inconclusive or a diﬀusion term is present.Turning to point (i) of the theorem, there is no diﬀerence between the set of SSEand PBE payoﬀs, at least on average across players. This is shown in Sections 6.1–6.2. Regarding the highest equilibrium payoﬀ, this may seem plausible (though notobvious) because eﬃciency requires symmetric play. Regarding the lowest equilibriumpayoﬀ, either playing safe forever is an equilibrium of the game given the current belief,or best-responding to being minmaxed provides a higher payoﬀ to the punished playerthan also playing the minmaxing action (using the safe arm). In the latter case, onecan incentivize the punished player to play safe by promising that all players will revertto risky (cooperative) play at a later time, thereby compensating the punished playerfor the ﬂow payoﬀ deﬁcit that playing safe involves in the meantime. This eventualreversion also motivates the punishing players to play safe.Figure 1 shows the cooperative continuous-time payoﬀ V ∗ N as well as the supremum V N, ˆ p and inﬁmum V ∗ of the limit average PBE payoﬀs for a parameter conﬁgurationthat implies p ∗ N < ˆ p p ∗ N , the limiting payoﬀ function V N, ˆ p would exhibit a convex kink at ˆ p . Given the diﬀusioncomponent of the posterior-belief process, this kink could be used to provide all players incentives touse the risky arm at beliefs slightly below ˆ p . Indeed, the informational beneﬁt of experimentation inthe presence of a kink is of lower order in ∆ than its opportunity cost and hence dominates for small∆. . . . . . . . . . . . . . . pw Figure 1: Payoﬀs V ∗ N (solid), V N, ˆ p (dashed) and V ∗ (dotted) for ρ = 0 and( r, s, h, λ , λ , N ) = (1 , , . , , . , p ∗ N , ˆ p, p ∗ ) ≃ ( . , . , . amount of experimentation )but also entails too low a speed of experimentation, as it involves an interior level ofexperimentation for a range of beliefs. Proposition 1

For ρ = 0 and λ > , the cutoﬀ ˆ p is strictly lower than the belief atwhich all experimentation stops in the symmetric MPE of the continuous-time game. Turning to comparative statics, when is the ﬁrst-best achievable with jump pro-cesses? The next proposition characterizes the area (in the ( λ , λ )-plane) whereasymptotic eﬃciency obtains. As is intuitive, having more players, or more patience,increases the scope for the ﬁrst-best. 14 roposition 2 Let ρ = 0 . Then, j ( p ∗ N ) > p ∗ whenever λ ≤ λ /N . On any ray in R emanating from the origin (0 , with a slope strictly between /N and 1, there isa unique critical point ( λ ∗ , λ ∗ ) at which j ( p ∗ N ) = p ∗ ; moreover, j ( p ∗ N ) > p ∗ at all pointsof the ray that are closer to the origin than ( λ ∗ , λ ∗ ) , and j ( p ∗ N ) < p ∗ at all points thatare farther from the origin than ( λ ∗ , λ ∗ ) . These critical points form a continuous curvethat is bounded away from the origin and asymptotes to the ray of slope /N . Thecurve shifts downward as r falls or N rises. This result is illustrated in Figure 2. Furthermore, in the case of λ >

0, the moreplayers participate in the game, the more experimentation can be sustained. (Recallthat for λ = 0, the threshold belief ˆ p is independent of N .) Hence, the comparativestatics of the best SSE with respect to the number of players mirrors that for symmetricMPE (see Keller and Rady (2010)). Proposition 3

For ρ = 0 and λ > , ˆ p is decreasing in N . It is instructive to consider what happens when the players become arbitrarily im-patient or patient. If players are myopic, they do not react to future rewards andpunishments. It is therefore no surprise that the cooperative solution cannot be at-tained in the limit. By contrast, if players are very patient, asymptotic eﬃciency isachieved if the number of players is large.

Proposition 4

For ρ = 0 and λ > , lim r →∞ j ( p ∗ N ) p ∗ = λ hs , and lim r → j ( p ∗ N ) p ∗ = λ N λ . The next section is devoted to the construction of SSEs that underlies the proof ofTheorem 1. Missing details are provided in the appendix.

We ﬁrst consider the case of a diﬀusion component (Section 6.1) and then turn to thecase of pure jump processes (Section 6.2).15 λ λ j ( p ∗ N ) p ∗ Figure 2: Asymptotic eﬃciency is achieved for parameter combinations( λ , λ ) between the diagonal and the curve but not below the curve. Thedashed line is the ray of slope 1 /N . Parameter values: r = 1, N = 5.We need the following notation. Let F ∆ K ( ·| p ) denote the cumulative distributionfunction of the posterior belief p ∆ when p = p and K players use the risky arm on thetime interval [0 , ∆). For any measurable function w on [0 ,

1] and p ∈ [0 , E ∆ K w ( p ) = Z w ( p ′ ) F ∆ K ( dp ′ | p ) , whenever this integral exists. Thus, E ∆ K w ( p ) is the expectation of w ( p ∆ ) given the prior p and K experimenting players. ρ > ) For a suﬃciently small ∆ >

0, we specify an SSE that can be summarized by two func-tions, κ and κ , which do not depend on ∆. The equilibrium strategy is characterizedby a two-state automaton. In the “good” state, play proceeds according to κ , and the16quilibrium payoﬀ satisﬁes w ∆ ( p ) = (1 − δ )[(1 − κ ( p )) s + κ ( p ) m ( p )] + δ E ∆ Nκ ( p ) w ∆ ( p ) , (4)while in the “bad” state, play proceeds according to κ , and the payoﬀ satisﬁes w ∆ ( p ) = max k n (1 − δ )[(1 − k ) s + km ( p )] + δ E ∆( N − κ ( p )+ k w ∆ ( p ) o . (5)That is, w ∆ is the value from the best response to all other players following κ .A unilateral deviation from κ in the good state is punished by a transition to thebad state in the following period; otherwise, we remain in the good state. If there is aunilateral deviation from κ in the bad state, we remain in the bad state. Otherwise, adraw of the public randomization device determines whether the state next period isgood or bad; this probability is chosen such that the expected payoﬀ is indeed givenby w ∆ (see below).With continuation payoﬀs given by w ∆ and w ∆ , the common action κ ∈ { , } isincentive compatible at a belief p if, and only if,(1 − δ )[(1 − κ ) s + κm ( p )] + δ E ∆ Nκ w ∆ ( p ) (6) ≥ (1 − δ )[ κs + (1 − κ ) m ( p )] + δ E ∆( N − κ +1 − κ w ∆ ( p ) . Therefore, the functions κ and κ deﬁne an SSE if, and only if, (6) holds for κ = κ ( p )and κ = κ ( p ) at all p .The probability η ∆ ( p ) of a transition from the bad to the good state in the absenceof a unilateral deviation from κ ( p ) is pinned down by the requirement that w ∆ ( p ) = (1 − δ )[(1 − κ ( p )) s + κ ( p ) m ( p )] (7)+ δ n η ∆ ( p ) E ∆ Nκ ( p ) w ∆ ( p ) + [1 − η ∆ ( p )] E ∆ Nκ ( p ) w ∆ ( p ) o . If k = κ ( p ) is optimal in (5), we simply set η ∆ ( p ) = 0. Otherwise, (5) and (6) imply δ E ∆ Nκ ( p ) w ∆ ( p ) ≥ w ∆ ( p ) − (1 − δ )[(1 − κ ( p )) s + κ ( p ) m ( p )] > δ E ∆ Nκ ( p ) w ∆ ( p ) , so (7) holds with η ∆ ( p ) = w ∆ ( p ) − (1 − δ )[(1 − κ ( p )) s + κ ( p ) m ( p )] − δ E ∆ Nκ ( p ) w ∆ ( p ) δ E ∆ Nκ ( p ) w ∆ ( p ) − δ E ∆ Nκ ( p ) w ∆ ( p ) ∈ (0 , .

17t remains to specify κ and κ . Let p m = s − m m − m . As m ( p m ) = s , this is the belief at which a myopic agent is indiﬀerent between the twoarms. It is straightforward to verify that p ∗ p and κ ( p ) = p> ¯ p . Note that punishment and reward strategiescoincide outside of ( p, ¯ p ). Proposition 5

For ρ > , there are beliefs p ♭ ∈ ( p ∗ N , p ∗ ) and p ♯ ∈ ( p m , such thatfor all p ∈ ( p ∗ N , p ♭ ) and ¯ p ∈ ( p ♯ , , there exists ¯∆ > such that for all ∆ ∈ (0 , ¯∆) ,the two-state automaton with functions κ and κ deﬁnes an SSE of the experimentationgame with period length ∆ . The proof consists of verifying that, for a suﬃciently small ∆, the actions κ ( p ) and κ ( p ) satisfy the incentive-compatibility constraint (6) at all p . First, we ﬁnd ε > w ∆ = s in a neighborhood of p + ε . The payoﬀ functions w ∆ and w ∆ resulting from the two-state automaton are then bounded away from one another on[ p + ε, ¯ p ] for small ∆. In this range, therefore, the diﬀerence in expected continuationvalues across states does not vanish as ∆ tends to 0, whereas the diﬀerence in currentexpected payoﬀs across actions is of order ∆, rendering deviations unattractive forsmall enough ∆. On (¯ p,

1] and [0 , p ], κ and κ both prescribe the myopically optimalaction. Given that continuation payoﬀs are weakly higher in the good state, it is easy toshow that there are no incentives to deviate on these intervals. For beliefs in ( p, p + ε ), κ again prescribes the myopically optimal action. The proof of incentive compatibilityof κ on this interval crucially relies on the fact that, for small ∆, w ∆ is bounded belowby V N,p , which has a convex kink at p . This, together with the fact that, conditional onno lump sum arriving, the log-likelihood ratio of posterior beliefs is Gaussian, allows usto demonstrate the existence of some constant C > E ∆ N w ∆ ( p ) ≥ s + C ∆ to the immediate right of p , whereas E ∆ N − w ∆ ( p ) ≤ s + C ∆with some constant C >

0. For small ∆, therefore, the linearly vanishing current-payoﬀ advantage of the safe over the risky arm is dominated by the incentives providedthrough continuation payoﬀs.The next result essentially follows from letting p → p ∗ N and ¯ p → Proposition 6

For ρ > , lim ∆ → W ∆SSE = V ∗ N and lim ∆ → W ∆SSE = V ∗ , uniformly on [0 , . A denotes the indicator function of the event A . .2 Pure Poisson Learning ( ρ = 0 ) Let ρ = 0, and take ˆ p as in part (iii) of Theorem 1. Proposition 7

Let ρ = 0 . For any ε > , there is a ∆ ε > such that for all ∆ ∈ (0 , ∆ ε ) , the set of beliefs at which experimentation can be sustained in a PBEof the discrete game with period length ∆ is contained in the interval (ˆ p − ε, . Inparticular, lim sup ∆ → W ∆PBE ( p ) ≤ V N, ˆ p ( p ) . For a heuristic explanation of the logic behind this result, consider a sequence ofpure-strategy PBEs for vanishing ∆ such that the inﬁmum of the set of beliefs at whichat least one player experiments converges to some limit ˜ p . Selecting a subsequence of∆s and relabeling players, if necessary, we can assume without loss of generality thatplayers 1 , . . . , L play R immediately to the right of ˜ p , while players L + 1 , . . . , N play S . In the limit, players’ individual continuation payoﬀs are bounded below by thesingle-agent value function V ∗ and cannot sum to more than N V N, ˜ p , so the sum ofthe continuation payoﬀs of players 1 , . . . , L is bounded above by N V N, ˜ p − ( N − L ) V ∗ .Averaging these players’ incentive-compatibility constraints thus yields Lλ (˜ p ) (cid:20) N V N, ˜ p ( j (˜ p )) − ( N − L ) V ∗ ( j (˜ p )) L − s (cid:21) − rc (˜ p ) ≥ ( L − λ (˜ p ) [ V ∗ ( j (˜ p )) − s ] . Simplifying the left-hand side, adding ( N − L ) λ (˜ p ) [ V ∗ ( j (˜ p )) − s ] to both sides andre-arranging, we obtain N λ (˜ p ) [ V N, ˜ p ( j (˜ p )) − s ] − rc (˜ p ) ≥ ( N − λ (˜ p ) [ V ∗ ( j (˜ p )) − s ] , which in turn implies ˜ p ≥ ˆ p , as we show in Lemma A.9 in the appendix. The proof ofProposition 7 makes this heuristic argument rigorous and extends it to mixed equilibria.For non-revealing jumps ( λ > p is now restricted to exceed ˆ p . Proposition 8

Let ρ = 0 and λ > . There are beliefs p ♭ ∈ (ˆ p, p ∗ ) and p ♯ ∈ ( p m , such that for all p ∈ (ˆ p, p ♭ ) and ¯ p ∈ ( p ♯ , , there exists ¯∆ > such that for all ∆ ∈ (0 , ¯∆) , the two-state automaton with functions κ and κ deﬁnes an SSE of theexperimentation game with period length ∆ . The strategy for the proof of this proposition is the same as that of Proposition19, except for the belief region to the immediate right of p , where incentives are nowprovided through terms of ﬁrst order in ∆, akin to those in equation (3).In the case λ >

0, we are able to provide incentives in the potentially last round ofexperimentation by threatening punishment conditional on there being a success (thatis, a successful experiment). This option is no longer available in the case of λ = 0.Indeed, any success now takes us to a posterior of one, so that everyone plays riskyforever after. This means that, irrespective of whether a success occurs in that round,continuation strategies are independent of past behavior, conditional on the players’belief. This raises the possibility of unravelling. If incentives just above the candidatethreshold at which players give up on the risky arm cannot be provided, can thisthreshold be lower than in the MPE?To settle whether unravelling occurs requires us to study the discrete game inconsiderable detail. We start by noting that for λ = 0, we can strengthen Proposition7 as follows: there is no PBE with any experimentation at beliefs below the discrete-time single-agent cutoﬀ p ∆1 = inf { p : W ∆1 ( p ) > s } (see Heidhues et al. (2015)). Thehighest average payoﬀ that can be hoped for, then, involves all players experimentingabove p ∆1 .Unlike in the case of λ > p ∆1 .The proof of the next proposition establishes that the length of the interval of beliefsfor which this is the case vanishes as ∆ →

0. In particular, for higher beliefs (except forbeliefs arbitrarily close to 1, when playing R is strictly dominant), both pure actionscan be enforced in some equilibrium. Proposition 9

Let ρ = 0 and λ = 0 . For any beliefs p and ¯ p such that p ∗ < p

such that for all ∆ ∈ (0 , ¯∆) , there exists- an SSE in which, starting from a prior above p , all players use the risky arm onthe path of play as long as the belief remains above p and use the safe arm forbeliefs below p ∗ ; and The study of symmetric MPEs is diﬃcult in discrete time. Unlike in continuous time, in which theexplicit solution is known (see Keller et al. (2005)), they do not seem to admit an easy characterization.For some open sets of beliefs, there are multiple symmetric MPEs in discrete time, regardless of howsmall ∆ is. It is not known whether any or all of these converge (in some sense) to the symmetricMPE in continuous time. In particular, this excludes the possibility that the asymmetric MPE of Keller et al. (2005) withan inﬁnite number of switches between the two arms below p ∗ can be approximated in the discretegame. an SSE in which, given a prior between p and ¯ p , the players’ payoﬀ is no largerthan their best-reply payoﬀ against opponents who use the risky arm if, and onlyif, the belief lies in [ p ∗ , p ] ∪ [¯ p, . While this is somewhat weaker than Proposition 8, its implications for limit payoﬀsas ∆ → p ∗ , p ] can be chosenarbitrarily small (actually, of the order ∆, as the proof establishes), its impact onequilibrium payoﬀs starting from priors above p is of order ∆. This suggests that forthe equilibria whose existence is stated in Proposition 9, the payoﬀ converges to thepayoﬀ from all players experimenting above p ∗ and to the best-reply payoﬀ againstnone of the opponents experimenting. Indeed, we have the following result, coveringboth inconclusive and conclusive jumps. Proposition 10

For ρ = 0 , lim ∆ → W ∆SSE = V N, ˆ p and lim ∆ → W ∆SSE = V ∗ , uniformlyon [0 , . While it is possible to derive explicit solutions to the equilibrium payoﬀ sets of interest,at least asymptotically, note that, already in the discrete game, a characterization interms of optimality equations can be obtained, which deﬁnes the correspondence ofSSE payoﬀs. As discussed in the introduction, these generalize the familiar equationcharacterizing the value function of the symmetric MPE. Instead of a single (HJB)equation, the characterization of SSE payoﬀs involves two coupled functional equations,whose solution delivers the highest and lowest equilibrium payoﬀ. Proposition 11 statesthis in the discrete game, while Proposition 12 gives the continuous-time limit. As thesepropositions do not heavily rely on the speciﬁc structure of our game, we believe thatthey might be useful for analyzing SSE payoﬀs for more general processes or otherstochastic games.Fix ∆ >

0. For p ∈ [0 , W ∆ ( p ) and W ∆ ( p ) denote the supremum andinﬁmum, respectively, of the set of payoﬀs over pure-strategy SSEs, given prior belief p . If such an equilibrium exists, these extrema are achieved, and W ∆ ( p ) ≥ W ∆ ( p ).For ρ > λ >

0, we have shown in Sections 6.1–6.2 that in the limit as ∆ → For the existence of various types of equilibria in discrete-time stochastic games, see Mertens,Sorin and Zamir (2015), Chapter 7. W ∆ and W ∆ via a pair of coupledfunctional equations. Proposition 11

Suppose that the discrete game with time increment ∆ > admits apure-strategy SSE for any prior belief. Then, the pair of functions ( w, w ) = ( W ∆ , W ∆ ) solves the functional equations w ( p ) = max κ ∈K ( p ; w,w ) n (1 − δ )[(1 − κ ) s + κm ( p )] + δ E ∆ Nκ w ( p ) o , (8) w ( p ) = min κ ∈K ( p ; w,w ) max k ∈{ , } n (1 − δ )[(1 − k ) s + km ( p )] + δ E ∆( N − κ + k w ( p ) o , (9) where K ( p ; w, w ) ⊆ { , } denotes the set of all κ such that (1 − δ )[(1 − κ ) s + κm ( p )] + δ E ∆ Nκ w ( p ) (10) ≥ max k ∈{ , } n (1 − δ )[(1 − k ) s + km ( p )] + δ E ∆( N − κ + k w ( p ) o . Moreover, W ∆ ≤ w ≤ w ≤ W ∆ for any solution ( w, w ) of (8) – (10) . This result relies on arguments that are familiar from Cronshaw and Luenberger(1994). We brieﬂy sketch them here.The above equations can be understood as follows. The ideal condition for a given(symmetric) action proﬁle to be incentive compatible is that if each player conforms toit, the continuation payoﬀ is the highest possible, while a deviation triggers the lowestpossible continuation payoﬀ. These actions are precisely the elements of K ( p ; w, w ), asdeﬁned by equation (10). Given this set of actions, equation (9) provides the recursionthat characterizes the constrained minmax payoﬀ under the assumption that if a playerwere to deviate to his myopic best reply to the constrained minmax action proﬁle,the punishment would be restarted next period, resulting in a minimum continuationpayoﬀ. Similarly, equation (8) yields the highest payoﬀ under this constraint, but here,playing the best action (within the set) is on the equilibrium path.Note that in any SSE, given p , the action κ ( p ) must be an element of K ( p ; W ∆ , W ∆ ).This is because the left-hand side of (10) with w = W ∆ is an upper bound on thecontinuation payoﬀ if no player deviates, and the right-hand side with w = W ∆ alower bound on the continuation payoﬀ after a unilateral deviation. Consider theequilibrium that achieves W ∆ . Then, W ∆ ( p ) ≤ max κ ∈K ( p ; W ∆ ,W ∆ ) n (1 − δ )[(1 − κ ) s + κm ( p )] + δ E ∆ Nκ W ∆ ( p ) o ,

22s the action played must be in K ( p ; W ∆ , W ∆ ), and the continuation payoﬀ is at mostgiven by W ∆ . Similarly, W ∆ must satisfy (9) with “ ≥ ” instead of “=.” Suppose nowthat the “ ≤ ” were strict. Then, we can deﬁne a strategy proﬁle given prior p that (i)in period 0, plays the maximizer of the right-hand side, and (ii) from t = ∆ onward,abides by the continuation strategy achieving W ∆ ( p ∆ ). Because the initial action is in K ( p ; W ∆ , W ∆ ), this constitutes an equilibrium, and it achieves a payoﬀ strictly largerthan W ∆ ( p ), a contradiction. Hence, (8) must hold with equality for W ∆ . The samereasoning applies to W ∆ and (9).Fix a pair ( w, w ) that satisﬁes (8)–(10). Note that this implies w ≤ w . Given sucha pair and any prior p , we specify two SSEs whose payoﬀs are w and w , respectively.It then follows that W ∆ ≤ w ≤ w ≤ W ∆ . Let κ and κ denote a selection of themaximum and minimum of (8)–(9). The equilibrium strategies are described by atwo-state automaton, whose states are referred to as “good” or “bad.” The diﬀerencebetween the two equilibria lies in the initial state: w is achieved when the initial stateis good, w is achieved when it is bad. In the good state, play proceeds according to κ ;in the bad state, it proceeds according to κ . Transitions are exactly as in the equilibriadescribed in Sections 6.1–6.2. This structure precludes proﬁtable one-shot deviationsin either state, so that the automaton describes equilibrium strategies, and the desiredpayoﬀs are obtained.As ∆ tends to 0, equations (8)–(9) transform into diﬀerential-diﬀerence equationsinvolving terms that are familiar from the continuous-time analysis in Section 3. Aformal Taylor approximation shows that for any κ ∈ { , } , K ∈ { , , . . . , N } and asuﬃciently regular function w on the unit interval,(1 − δ )[(1 − κ ) s + κm ( p )] + δ E ∆ K w ( p )= w ( p ) + r n (1 − κ ) s + κm ( p ) + K b ( p, w ) − w ( p ) o ∆ + o (∆) . Applying this approximation to (8)–(9), cancelling the terms of order 0 in ∆, di-viding through by ∆, letting ∆ → c ( p ) = s − m ( p ) for theopportunity cost of playing risky, we obtain the coupled diﬀerential-diﬀerence equationsthat appear in the following result. Proposition 12

Let ρ > or λ > . As ∆ → , the pair of functions ( W ∆ , W ∆ )23 onverges uniformly (in p ) to a pair of functions ( w, w ) solving w ( p ) = s + max κ ∈K ( p ) κ [ N b ( p, w ) − c ( p )] , (11) w ( p ) = s + min κ ∈K ( p ) ( N − κ b ( p, w ) + max k ∈{ , } k [ b ( p, w ) − c ( p )] , (12) where K ( p ) =  { } for p ≤ ˆ p, { , } for ˆ p < p < , { } for p = 1 , (13) and ˆ p is as in parts (ii) and (iii) of Theorem 1. This result is an immediate consequence of the previous results. It follows fromSections 6.1–6.2 that, except when ρ = λ = 0, there exist pure-strategy SSEs and thepair ( W ∆ , W ∆ ) converges uniformly to ( V N, ˆ p , V ∗ ). It is straightforward to verify that( w, w ) = ( V N, ˆ p , V ∗ ) solves (11)–(13). First, as V ∗ N satisﬁes V ∗ N ( p ) = s + max κ ∈{ , } κ [ N b ( p, V ∗ N ) − c ( p )] , with N b ( p, V ∗ N ) − c ( p ) > p ∗ N , (11) is trivially solved by V ∗ N wheneverˆ p = p ∗ N . Second, for ˆ p > p ∗ N , the function V N, ˆ p satisﬁes V N, ˆ p ( p ) = s + p> ˆ p [ N b ( p, V N, ˆ p ) − c ( p )] , with N b ( p ; V N, ˆ p ) − c ( p ) > p, V N, ˆ p solves (11) when ˆ p >p ∗ N . Third, V ∗ always solves (12). In fact, as b ( p ; V ∗ ) ≥ κ ∈{ , } ( N − κ b ( p, V ∗ ) = 0, and (12) with this minimum set to zero is just theHJB equation for V ∗ .Note that the continuous-time functional equations (11)–(12) would be equally easyto solve for any arbitrary ˆ p in (13). However, only the solution with ˆ p as in Theorem1 captures the asymptotics of our discretization of the experimentation game. We have shown that the ineﬃciencies arising in strategic bandit problems are driven bythe solution concept, MPE. Ineﬃciencies entirely disappear when news has a Brownian This equation follows from the HJB equation in Section 3: because the maximand is linear in K ,the continuous-time planner ﬁnds it optimal to set K = 0 or K = N at any given belief. θ = 1). For processes with a Brownian com-ponent, our proof that risky play is incentive compatible immediately to the right ofthe threshold p ∗ N only exploits the properties of the posterior belief process conditionalon no lump sum arriving . As these properties are the same whether lump sums areinformative or not, asymptotic eﬃciency when a Brownian component is present ob-tains more generally. When learning is driven by lump-sum payoﬀs only, inspectionof equation (3) suggests that eﬃciency requires that a lump sum of any size arrivingat the initial belief p ∗ N lead to a posterior belief no higher than p ∗ . Therefore, thecondition for asymptotic eﬃciency has a straightforward generalization.As mentioned above, our model rules out lumpy bad news. Hence, it rules outmodels in which Poisson events are “breakdowns,” as in the model of Keller and Rady(2015), for instance. Bad news amounts to assuming that the safe ﬂow payoﬀ and theaverage size of lump-sum payoﬀs are both negative with λ h < s < λ h ≤

0. Now, θ = 1 is the bad state of the world, and the eﬃcient and single-player solution cutoﬀsin continuous time satisfy p ∗ N > p ∗ , with the stopping region lying to the right of thecutoﬀ in either case. The associated value functions V ∗ and V ∗ N solve the same HJBequations as in Section 3. In this model, j ( p ∗ N ) > p ∗ N > p ∗ , i.e. , starting from p ∗ N , thebelief remains in the single-agent stopping region for small ∆, whether a breakdownoccurs or not. Hence, the harshest possible punishment, consisting of all other playersplaying safe forever, can be meted out to any potential deviator, whether there is abreakdown or not. Thus, we conjecture that asymptotic eﬃciency also obtains in thisframework. 25 ppendixA Auxiliary Results A.1 Evolution of Beliefs

For the description of the evolution of beliefs, it is convenient to work with the log odds ratio ℓ t = ln p t − p t . Suppose that starting from ℓ = ℓ , the players use the ﬁxed action proﬁle ( k , . . . , k N ) ∈{ , } N . By Peskir and Shiryayev (2006, pp. 287–289 and 334–338), the log odds ratio attime t > ℓ t = ℓ + X { n : k n =1 } (cid:26) α − α σ ( X nt − α t − hN nt ) − (cid:20) ( α − α ) σ + λ − λ (cid:21) t + ln λ λ N nt (cid:27) , where X n and N n are the payoﬀ and Poisson processes, respectively, associated with player n ’srisky arm. The terms involving α , α and σ capture learning from the continuous component, X nt − hN nt , of the payoﬀ process, with higher realizations making the players more optimistic.The terms involving λ and λ capture learning from lump-sum payoﬀs, with the playersbecoming more pessimistic on average as long as no lump-sum arrives, and each arrivalincreasing the log odds ratio by the ﬁxed increment ln( λ /λ ). Under the probability measure P θ associated with state θ ∈ { , } , X nt − α t − hN nt is Gaus-sian with mean ( α θ − α ) t and variance σ t , so that P { n : k n =1 } ( α − α ) σ − ( X nt − α t − hN nt )is Gaussian with mean K ( α − α )( α θ − α ) σ − t and variance Kρt , where K = P Nn =1 k n and ρ = ( α − α ) σ − . Conditional on the event that P { n : k n =1 } N nt = J , therefore, ℓ t is normallydistributed with mean ℓ − K (cid:0) λ − λ − ρ (cid:1) t + J ln( λ /λ ) and variance Kρt under P , andnormally distributed with mean ℓ − K (cid:0) λ − λ + ρ (cid:1) t + J ln( λ /λ ) and variance Kρt under P . Finally, the probability under measure P θ that P { n : k n =1 } N nt = J equals ( Kλ θ t ) J J ! e − Kλ θ t by the sum property of the Poisson distribution.Taken together, these facts make it possible to explicitly compute the distribution of p t = e ℓ t e ℓ t under the players’ measure P p = p P + (1 − p ) P . As this explicit representation is not neededin what follows, we omit it here.Instead, we turn to the characterization of inﬁnitesimal changes of p t , once more assuminga ﬁxed action proﬁle with K players using the risky arm. Arguing as in Cohen and Solan(2013, Section 3.3), one shows that, with respect to the players’ information ﬁltration, the Here, λ /λ is understood to be 1 when λ = λ = 0. When λ > λ = 0, we have ℓ t = ∞ and p t = 1 from the arrival time of the ﬁrst lump-sum on. rocess of posterior beliefs is a Markov process whose inﬁnitesimal generator L K acts asfollows on real-valued functions v of class C on the open unit interval: L K v ( p ) = K (cid:26) ρ p (1 − p ) v ′′ ( p ) − ( λ − λ ) p (1 − p ) v ′ ( p ) + λ ( p ) [ v ( j ( p )) − v ( p )] (cid:27) . In particular, instantaneous changes in beliefs exhibit linearity in K in the sense that L K = K L .By the very nature of Bayesian updating, ﬁnally, the process of posterior beliefs is amartingale with respect to the players’ information ﬁltration. A.2 Payoﬀ Functions

Our ﬁrst auxiliary result concerns the function u ( · ; µ N ) deﬁned in Section 3. Lemma A.1 δ E ∆ K u ( · ; µ N )( p ) = δ − KN u ( p ; µ N ) for all ∆ > , K ∈ { , . . . , N } and p ∈ (0 , . Proof:

We simplify notation by writing u for u ( · ; µ N ). Consider the process ( p t ) of posteriorbeliefs in continuous time when p = p > K players use the risky arm. By Dynkin’sformula, E h e − rK ∆ /N u ( p ∆ ) i = u ( p ) + E (cid:20)Z ∆0 e − rKt/N (cid:26) L K u ( p t ) − rKN u ( p t ) (cid:27) dt (cid:21) = u ( p ) + K E (cid:20)Z ∆0 e − rKt/N n L u ( p t ) − rN u ( p t ) o dt (cid:21) = u ( p ) , where the last equality follows from the fact that L u = ru/N on (0 , Thus, δ K/N E ∆ K u ( p ) = u ( p ).We further note that E ∆ K m ( p ) = m ( p ) for all K by the martingale property of beliefs andthe linearity of m in p .These properties are used repeatedly in what follows. Their ﬁrst application is in the proofof uniform convergence of the discrete-time single-agent value function to its continuous-timecounterpart.Let ( W , k · k ) be the Banach space of bounded real-valued functions on [0 ,

1] equippedwith the supremum norm. Given ∆ >

0, and any w ∈ W , deﬁne a function T ∆1 w ∈ W by T ∆1 w ( p ) = max n (1 − δ ) m ( p ) + δ E ∆1 w ( p ) , (1 − δ ) s + δw ( p ) o . To verify this identity, note that u ′ ( p ) = − µ N + pp (1 − p ) u ( p ) , u ′′ ( p ) = µ N ( µ N + 1) p (1 − p ) u ( p ) , u ( j ( p )) = λ λ ( p ) (cid:18) λ λ (cid:19) µ N u ( p ) , and use the equation deﬁning µ N . he operator T ∆1 satisﬁes Blackwell’s suﬃcient conditions for being a contraction mappingwith modulus δ on ( W , k · k ): monotonicity ( v ≤ w implies T ∆1 v ≤ T ∆1 w ) and discounting( T ∆1 ( w + c ) = T ∆1 w + δc for any real number c ). By the contraction mapping theorem, T ∆1 has a unique ﬁxed point in W ; this is the value function W ∆1 of an agent experimenting inisolation.The corresponding continuous-time value function is V ∗ as introduced in Section 3. Asany discrete-time strategy is feasible in continuous time, we trivially have W ∆1 ≤ V ∗ . Lemma A.2 W ∆1 → V ∗ uniformly as ∆ → . Proof:

A lower bound for W ∆1 is given by the payoﬀ function W ∆ ∗ of a single agent who usesthe cutoﬀ p ∗ in discrete time; this function is the unique ﬁxed point in W of the contractionmapping T ∆ ∗ deﬁned by T ∆ ∗ w ( p ) = ( (1 − δ ) m ( p ) + δ E ∆1 w ( p ) if p > p ∗ , (1 − δ ) s + δw ( p ) if p ≤ p ∗ . Next, choose ˘ p

0. As v converges uniformly to V ∗ as ˘ p → p ∗ , we can choose ˘ p such that v ≥ V ∗ − ε . It suﬃces now to show that there is a ¯∆ > T ∆ ∗ v ≥ v for ∆ < ¯∆. Infact, the monotonicity of T ∆ ∗ then implies W ∆ ∗ ≥ v and hence V ∗ − ε ≤ v ≤ W ∆ ∗ ≤ W ∆1 ≤ V ∗ for all ∆ < ¯∆.For p ≤ p ∗ , we have T ∆ ∗ v ( p ) = (1 − δ ) s + δv ( p ) ≥ v ( p ) for all ∆, because v ≤ s in thisrange. For p > p ∗ , T ∆ ∗ v ( p ) = (1 − δ ) m ( p ) + δ E ∆1 v ( p )= (1 − δ ) m ( p ) + δ E ∆1 h m + Cu + [0 ,p ♮ ] ( s − m − Cu ) i ( p )= v ( p ) + δ E ∆1 h [0 ,p ♮ ] ( s − m − Cu ) i ( p ) , where the last equation uses that E ∆1 m ( p ) = m ( p ) and δ E ∆1 u ( p ) = u ( p ). In particular, T ∆ ∗ v (1) = v (1).The function s − m − Cu is negative on the interval (0 , ˘ p ) and positive on (˘ p, p ♯ ), forsome p ♯ > p ∗ . The expectation of s − m ( p ∆ ) − Cu ( p ∆ ) conditional on p = p and p ∆ ≤ p ♮ is continuous in ( p, ∆) ∈ [ p ∗ , × (0 , ∞ ) and converges to s − m ( p ♮ ) − Cu ( p ♮ ) > p → → p ∆ becomes a Dirac measure at p ♮ in eitherlimit. This implies existence of ¯∆ > p, ∆) ∈ [ p ∗ , × (0 , ¯∆). For these ( p, ∆), we thus have E ∆1 h [0 ,p ♮ ] ( s − m − Cu ) i ( p ) ≥ E ∆1 h [ p ♭ ,p ♮ ] ( s − m − Cu ) i ( p ) ≥ , where p ♭ = ˇ p + p ♮ . As a consequence, T ∆ ∗ v ≥ v for all ( p, ∆) ∈ ( p ∗ , × (0 , ¯∆). ext, we turn to the payoﬀ function associated with the good state of the automatondeﬁned in Section 6. By the same arguments as invoked immediately before Lemma A.2, w ∆ is the unique ﬁxed point in W of the operator T ∆ deﬁned by T ∆ w ( p ) = ( (1 − δ ) m ( p ) + δ E ∆ N w ( p ) if p > p, (1 − δ ) s + δw ( p ) if p ≤ p. Lemma A.3

Let p > p ∗ N . Then w ∆ ≥ V N,p for ∆ suﬃciently small. Proof:

Because of the monotonicity of the operator T ∆ , it suﬃces to show that T ∆ V N,p ≥ V N,p for suﬃciently small ∆. Recall that for p > p , V N,p ( p ) = m ( p ) + Cu ( p ; µ N ) where theconstant C > p .For p ≤ p , we use exactly the same argument as in the penultimate paragraph of theproof of Lemma A.2; for p > p , the argument is the same as in the last paragraph of thatproof.The next two results concern the payoﬀ function associated with the bad state of theautomaton deﬁned in Section 6. Fix a cutoﬀ ¯ p ∈ ( p m ,

1) and let K ( p ) = N − p > ¯ p ,and K ( p ) = 0 otherwise. Given ∆ >

0, and any bounded function w on [0 , T ∆ w by T ∆ w ( p ) = max n (1 − δ ) m ( p ) + δ E ∆ K ( p )+1 w ( p ) , (1 − δ ) s + δ E ∆ K ( p ) w ( p ) o . The operator T ∆ again satisﬁes Blackwell’s suﬃcient conditions for being a contraction map-ping with modulus δ on W . Its unique ﬁxed point in this space is the payoﬀ function w ∆ (introduced in Section 6) from playing a best response against N − p > ¯ p , and safe otherwise. Lemma A.4

Let p ∈ ( p ∗ N , p ∗ ) . Then there exists p ⋄ ∈ [ p m , such that for all ¯ p ∈ ( p ⋄ , ,the inequality w ∆ ≤ V N, ( p + p ∗ ) / holds for ∆ suﬃciently small. Proof:

Let ˜ p = ( p + p ∗ ) /

2. For p > ˜ p , we have V N, ˜ p ( p ) = m ( p ) + Cu ( p ; µ N ) where theconstant C > p . To simplify notation, we write ˜ v insteadof V N, ˜ p and u instead of u ( · ; µ N ).For x >

0, we deﬁne p ∗ x = µ x ( s − m )( µ x + 1)( m − s ) + µ x ( s − m ) , where µ x is the unique positive root of f ( µ ; x ) = ρ µ ( µ + 1) + ( λ − λ ) µ + λ (cid:18) λ λ (cid:19) µ − λ − rx ;existence and uniqueness of this root follow from continuity and monotonicity of f ( · ; x ) to- ether with the fact that f (0; x ) < f ( µ ; x ) → ∞ as µ → ∞ . This extends ourprevious deﬁnitions of µ N and p ∗ N to non-integer numbers. It is immediate to verify now that dµ x dx < dp ∗ x dx <

0. Thus, there exists ˘ x ∈ (1 , N ) such that p ∗ ˘ x ∈ (˜ p, p ∗ ).Having chosen such an ˘ x , we ﬁx a belief ˘ p ∈ (˜ p, p ∗ ˘ x ) and, on the open unit interval, considerthe function ˘ v that solves L v − r ˘ x ( v − m ) = 0subject to the conditions ˘ v (˘ p ) = s and ˘ v ′ (˘ p ) = 0. This function has the form˘ v ( p ) = m ( p ) + ˘ u ( p ) , with ˘ u ( p ) = A (1 − p ) (cid:18) − pp (cid:19) ˘ µ + Bp (cid:18) p − p (cid:19) ˆ µ = Au ( p ; ˘ µ ) + Bu (1 − p ; ˆ µ ) . Here, ˘ µ = µ ˘ x and ˆ µ is the unique positive root of g ( µ ; x ) = ρ µ ( µ + 1) − ( λ − λ ) µ + λ (cid:18) λ λ (cid:19) µ − λ − rx ;existence and uniqueness of this root follow along the same lines as above.The constants of integration A and B are pinned down by the conditions ˘ v (˘ p ) = s and˘ v ′ (˘ p ) = 0. One calculates that B > p p < (1 + ˆ µ )( s − m )ˆ µ ( m − s ) + (1 + ˆ µ )( s − m ) . The right-hand side of this inequality is decreasing in ˆ µ and tends to p m as ˆ µ → ∞ . Therefore,we can conclude that the inequality holds, and A > A + B > v is strictly increasing and strictly convex on (˘ p, B >

0, ﬁnally, ˘ v ( p ) → ∞ for p → p ♮ ∈ (˘ p,

1) such that ˘ v ( p ♮ ) = ˜ v ( p ♮ ) and ˘ v > ˜ v on ( p ♮ , v < ˜ v in (˘ p, p ♮ ). Indeed, if this is not the case, then ˘ v − ˜ v assumes a non-negativelocal maximum at some p ♯ ∈ (˘ p, p ♮ ). This implies:(i) ˘ v ( p ♯ ) ≥ ˜ v ( p ♯ ), i.e. , Au ( p ♯ ; ˘ µ ) + Bu (1 − p ♯ ; ˆ µ ) ≥ Cu ( p ♯ ; µ N ); (A.1)(ii) ˘ v ′ ( p ♯ ) = ˜ v ′ ( p ♯ ), i.e. , − (˘ µ + p ♯ ) Au ( p ♯ ; ˘ µ ) + (ˆ µ + 1 − p ♯ ) Bu (1 − p ♯ ; ˆ µ ) = − ( µ N + p ♯ ) Cu ( p ♯ ; µ N ); (A.2) Cf.

Lemma 6 in Cohen & Solan (2013). nd (iii) ˘ v ′′ ( p ♯ ) ≤ ˜ v ′′ ( p ♯ ), i.e. ,˘ µ (˘ µ + 1) Au ( p ♯ ; ˘ µ ) + ˆ µ (1 + ˆ µ ) Bu (1 − p ♯ ; ˆ µ ) ≤ µ N ( µ N + 1) Cu ( p ♯ ; µ N ) . (A.3)Solving for Bu (1 − p ♯ ; ˆ µ ) in (A.2) and inserting the result into (A.1) and (A.3), we obtain,respectively, Cu ( p ♯ ; µ N ) Au ( p ♯ ; ˘ µ ) ≤ ˘ µ + ˆ µ + 1 µ N + ˆ µ + 1 , and Cu ( p ♯ ; µ N ) Au ( p ♯ ; ˘ µ ) ≥ ˘ µ (˘ µ + 1)(ˆ µ + 1 − p ♯ ) + ˆ µ (ˆ µ + 1)(˘ µ + p ♯ ) µ N ( µ N + 1)(ˆ µ + 1 − p ♯ ) + ˆ µ (ˆ µ + 1)( µ N + p ♯ ) . This implies that˘ µ + ˆ µ + 1 µ N + ˆ µ + 1 ≥ ˘ µ (˘ µ + 1)(ˆ µ + 1 − p ♯ ) + ˆ µ (ˆ µ + 1)(˘ µ + p ♯ ) µ N ( µ N + 1)(ˆ µ + 1 − p ♯ ) + ˆ µ (ˆ µ + 1)( µ N + p ♯ ) , which one shows to be equivalent to ˘ µ ≤ µ N . But ˘ x < N and dµ x dx < µ > µ N . Thisis the desired contradiction.Now let p ⋄ = max { p m , p ♮ } , ﬁx ¯ p ∈ ( p ⋄ ,

1) and deﬁne v ( p ) =  ˜ v ( p ) if p > p ♮ , ˘ v ( p ) if ˘ p ≤ p ≤ p ♮ ,s if p < ˘ p. By construction, s ≤ v ≤ min { ˜ v, ˘ v } . This immediately implies that (1 − δ ) s + δv ≤ v . Wenow show that T ∆ v ≤ v , and hence w ∆ ≤ v , for ∆ suﬃciently small.First, let p ∈ (¯ p, − δ ) m ( p ) + δ E ∆ N v ( p ) ≤ (1 − δ ) m ( p ) + δ E ∆ N (cid:2) m + Cu + (0 , ˘ p ) ( s − m − Cu ) (cid:3) ( p )= m ( p ) + Cu ( p ) + δ E ∆ N (cid:2) (0 , ˘ p ) ( s − m − Cu ) (cid:3) ( p ) ≤ m ( p ) + Cu ( p )= v ( p ) , for ∆ small enough that E ∆ N (cid:2) (0 , ˘ p ) ( s − m − Cu ) (cid:3) (¯ p ) ≤

0; that this inequality holds for small∆ follows from the fact that s < m + Cu on (˜ p, ˘ p ). By the same token,(1 − δ ) s + δ E ∆ N − v ( p ) ≤ (1 − δ ) s + δ E ∆ N − ( m + Cu )( p ) + δ E ∆ N − (cid:2) (0 , ˘ p ) ( s − m − Cu ) (cid:3) ( p )= (1 − δ ) s + δm ( p ) + δ N Cu ( p ) + δ E ∆ N − (cid:2) (0 , ˘ p ) ( s − m − Cu ) (cid:3) ( p ) ≤ m ( p ) + Cu ( p )= v ( p ) , for ∆ small enough that E ∆ N − (cid:2) (0 , ˘ p ) ( s − m − Cu ) (cid:3) (¯ p ) ≤

0, as Cu ( p ) > s < m ( p ) for > p m .Second, let p ∈ ( p ♮ , ¯ p ]. Now, we have(1 − δ ) m ( p ) + δ E ∆1 v ( p ) ≤ m ( p ) + δ − N Cu ( p ) + δ E ∆1 (cid:2) (0 , ˘ p ) ( s − m − Cu ) (cid:3) ( p ) ≤ m ( p ) + Cu ( p )= v ( p ) , for ∆ small enough that E ∆1 (cid:2) (0 , ˘ p ) ( s − m − Cu ) (cid:3) ( p ♮ ) ≤ p ∈ [˘ p, p ♮ ]. In this case,(1 − δ ) m ( p ) + δ E ∆1 v ( p ) ≤ (1 − δ ) m ( p ) + δ E ∆1 ˘ v ( p )= m ( p ) + δ E ∆1 ˘ u ( p )= m ( p ) + ˘ u ( p ) + E (cid:20) Z ∆0 e − rt (cid:8) L ˘ u ( p t ) − r ˘ u ( p t ) (cid:9) dt (cid:12)(cid:12)(cid:12)(cid:12) p = p (cid:21) ≤ m ( p ) + ˘ u ( p ) + E (cid:20) Z ∆0 e − rt n L ˘ u ( p t ) − r ˘ x ˘ u ( p t ) o dt (cid:12)(cid:12)(cid:12)(cid:12) p = p (cid:21) = m ( p ) + ˘ u ( p )= v ( p ) , where the second equality follows from Dynkin’s formula, the second inequality holds because˘ u ( p t ) > x >

1, and the third equality is a consequence of the identity L ˘ u − r ˘ u/ ˘ x = 0.Finally, let p ∈ [0 , ˘ p ). By monotonicity of m and v (and the previous step), we see that(1 − δ ) m ( p ) + δ E ∆1 v ( p ) ≤ (1 − δ ) m (˘ p ) + δ E ∆1 v (˘ p ) ≤ v (˘ p ) = s = v ( p ). Lemma A.5

There exist ˇ p ∈ ( p m , and p ‡ ∈ ( p ∗ N , p ∗ ) such that w ∆ ( p ) = s for all ¯ p ∈ (ˇ p, , p ≤ p ‡ and ∆ > . For any ε > , moreover, there exists ˇ p ε ∈ (ˇ p, such that w ∆ ≤ V ∗ + ε for all ∆ > . Proof:

Consider any ¯ p ∈ ( p m ,

1) and an initial belief p < ¯ p . We obtain an upper bound on w ∆ ( p ) by considering a modiﬁed problem in which (i) the player can choose a best responsein continuous time and (ii) the game is stopped with continuation payoﬀ m as soon as thebelief ¯ p is reached. This problem can be solved in the standard way, yielding an optimalcutoﬀ p ‡ . By construction, w ∆ = s on [0 , p ‡ ]. As we take ¯ p close to 1, p ‡ approaches p ∗ fromthe left and thus gets to lie strictly in between p ∗ N and p ∗ . This proves the ﬁrst statement.The second follows from the fact that the value function of the modiﬁed problem convergesuniformly to V ∗ as ¯ p → ρ = 0), we need a sharper characterization of thepayoﬀ function w ∆ as ∆ becomes small. To this end, we deﬁne V , ¯ p as the continuous-timecounterpart to w ∆ . The methods employed in Keller and Rady (2010) can be used to establishthat V , ¯ p has the following properties for ρ = 0. First, there is a cutoﬀ p † s everywhere else. Second, V , ¯ p is continuously diﬀerentiable verywhere except at ¯ p . Third, V , ¯ p solves the Bellman equation v ( p ) = max n m ( p ) + [ K ( p ) + 1] b ( p, v ) , s + K ( p ) b ( p, v ) o , where b ( p, v ) = λ ( p ) r [ v ( j ( p )) − v ( p )] − λ − λ r p (1 − p ) v ′ ( p ) , and v ′ ( p ) is taken to mean the left-hand derivative of v . Fourth, b ( p, V , ¯ p ) ≥ p . Fifth,because of smooth pasting at p † , the term m ( p ) + b ( p, V , ¯ p ) − s is continuous in p except at¯ p ; it has a single zero at p † , being positive to the right of it and negative to the left. Finally,we note that V , ¯ p = V ∗ and p † = p ∗ for ¯ p = 1. Lemma A.6

Let ρ = 0 . Then V , ¯ p → V ∗ uniformly as ¯ p → . The convergence is monotonein the sense that ¯ p ′ > ¯ p implies V , ¯ p ′ < V , ¯ p on { p : s < V , ¯ p ( p ) < λ h } . As the closed-form solutions for the functions in question make it straightforward toestablish this result, we omit the proof.A key ingredient in the analysis of the pure Poisson case is uniform convergence of w ∆ to V , ¯ p as ∆ →

0, which we establish by means of the following result. Lemma A.7

Let { T ∆ } ∆ > be a family of contraction mappings on the Banach space ( W ; k·k ) with moduli { β ∆ } ∆ > and associated ﬁxed points { w ∆ } ∆ > . Suppose that there is a constant ν > such that − β ∆ = ν ∆ + o (∆) as ∆ → . Then, a suﬃcient condition for w ∆ toconverge in ( W ; k · k ) to the limit v as ∆ → is that k T ∆ v − v k = o (∆) . Proof: As k w ∆ − v k = k T ∆ w ∆ − v k ≤ k T ∆ w ∆ − T ∆ v k + k T ∆ v − v k ≤ β ∆ k w ∆ − v k + k T ∆ v − v k , the stated conditions on β ∆ and k T ∆ v − v k imply k w ∆ − v k ≤ k T ∆ v − v k − β ∆ = ∆ f (∆) ν ∆ + ∆ g (∆) = f (∆) ν + g (∆) , with lim ∆ → f (∆) = lim ∆ → g (∆) = 0.In our application of this lemma, W is again the Banach space of bounded real-valuedfunctions on the unit interval, equipped with the supremum norm. The operator in questionis T ∆ as deﬁned above; the corresponding moduli are β ∆ = δ = e − r ∆ , so that ν = r . Lemma A.8

Let ρ = 0 . Then w ∆ → V , ¯ p uniformly as ∆ → . To the best of our knowledge, the earliest appearance of this result in the economics literature isin Biais et al. (2007). A related approach is taken in Sadzik and Stacchetti (2015). roof: To simplify notation, we write v instead of V , ¯ p . For K ∈ { , , . . . , N } , a straight-forward Taylor expansion of E ∆ K v with respect to ∆ yieldslim ∆ → (cid:13)(cid:13) δ E ∆ K v − v − r [ Kb ( · , v ) − v ]∆ (cid:13)(cid:13) = 0 . (A.4)For p > ¯ p , we have K ( p ) = N −

1, and (A.4) implies(1 − δ ) m ( p ) + δ E ∆ N v ( p ) = v ( p ) + r [ m ( p ) + N b ( p, v ) − v ( p )] ∆ + o (∆) , (1 − δ ) s + δ E ∆ N − v ( p ) = v ( p ) + r [ s + ( N − b ( p, v ) − v ( p )] ∆ + o (∆) . As m ( p ) > s on [¯ p,

1] and b ( p, v ) ≥

0, there exists ξ > m ( p ) + N b ( p, v ) − [ s + ( N − b ( p, v )] > ξ, on (¯ p, T ∆ v ( p ) = (1 − δ ) m ( p ) + δ E ∆ N v ( p ) for ∆ suﬃciently small, and the fact that v ( p ) = m ( p ) + N b ( p, v ) now implies T ∆ v ( p ) = v ( p ) + o (∆) on (¯ p, , ¯ p ], we have K ( p ) = 0, and (A.4) implies (cid:13)(cid:13) (1 − δ ) m + δ E ∆1 v − v − r [ m + b ( · , v ) − v )∆ (cid:13)(cid:13) = ∆ ψ R (∆) , (A.5) (cid:13)(cid:13) (1 − δ ) s + δ E ∆0 v − v − r [ s − v ]∆ (cid:13)(cid:13) = ∆ ψ S (∆) , (A.6)for some functions ψ R , ψ S : (0 , ∞ ) → [0 , ∞ ) that satisfy ψ R (∆) → ψ S (∆) → → p ∈ ( p † , ¯ p ]. We note that T ∆ v ( p ) ≥ (1 − δ ) m ( p ) + δ E ∆1 v ( p ) ≥ v ( p ) − ∆ ψ R (∆)in this range, where the ﬁrst inequality follows from the deﬁnition of T ∆ , and the secondinequality is implied by (A.5) and v ( p ) = m ( p ) + b ( p, v ) for p ∈ ( p † , ¯ p ]. If the maximum inthe deﬁnition of T ∆ v ( p ) is achieved by the risky action, the ﬁrst in the previous chain ofinequalities holds as an equality, and (A.5) immediately implies that T ∆ v ( p ) = v ( p ) + o (∆).If the maximum in the deﬁnition of T ∆ v ( p ) is achieved by the safe action, however, we have T ∆ v ( p ) = (1 − δ ) s + δ E ∆0 v ( p ) ≤ v ( p ) + r [ s − v ( p )]∆ + ∆ ψ S (∆) ≤ v ( p ) + ∆ ψ S (∆), wherethe second inequality follows from v > s on ( p † , ¯ p ]. Thus v ( p ) − ∆ ψ R (∆) ≤ T ∆ v ( p ) ≤ v ( p ) + ∆ ψ S (∆), and we can conclude that T ∆ v ( p ) = v ( p ) + o (∆) in this case as well.Now, let p ≤ p † . We note that T ∆ v ( p ) ≥ (1 − δ ) s + δ E ∆0 v ( p ) ≥ v ( p ) − ∆ ψ S (∆) in thisrange, where the ﬁrst inequality follows from the deﬁnition of T ∆ , and the second inequalityis implied by (A.6) and v ( p ) = s for p ≤ p † . If the maximum in the deﬁnition of T ∆ v ( p ) isachieved by the safe action, the ﬁrst in the previous chain of inequalities holds as an equality,and (A.6) immediately implies that T ∆ v ( p ) = v ( p ) + o (∆). If the maximum in the deﬁnitionof T ∆ v ( p ) is achieved by the risky action, however, we have T ∆ v ( p ) = (1 − δ ) m ( p )+ δ E ∆1 v ( p ) ≤ v ( p ) + r [ m ( p ) + b ( p, v ) − v ( p )]∆ + ∆ ψ R (∆) ≤ v ( p ) + ∆ ψ R (∆), where the second inequalityfollows from v = s ≥ m ( p ) + b ( p, v ) on [0 , p † ]. Thus v ( p ) − ∆ ψ S (∆) ≤ T ∆ v ( p ) ≤ v ( p ) +∆ ψ R (∆), and we can again conclude that T ∆ v ( p ) = v ( p ) + o (∆) in this case as well. ur last two auxiliary results pertain to the case of pure Poisson learning. Lemma A.9

Let ρ = 0 . There is a belief ˆ p ∈ [ p ∗ N , p ∗ ] such that λ ( p ) h N V

N,p ( j ( p )) − ( N − V ∗ ( j ( p )) − s i − rc ( p ) is negative if < p < ˆ p , zero if p = ˆ p , and positive if ˆ p < p < . Moreover, ˆ p = p ∗ N if, andonly if, j ( p ∗ N ) ≤ p ∗ , and ˆ p = p ∗ if, and only if, λ = 0 . Proof:

We start by noting that given the functions V ∗ and V ∗ N , the cutoﬀs p ∗ and p ∗ N areuniquely determined by λ ( p ∗ )[ V ∗ ( j ( p ∗ )) − s ] = rc ( p ∗ ) , (A.7)and λ ( p ∗ N )[ N V ∗ N ( j ( p ∗ N )) − N s ] = rc ( p ∗ N ) , (A.8)respectively.Consider the diﬀerentiable function f on (0 ,

1) given by f ( p ) = λ ( p )[ N V

N,p ( j ( p )) − ( N − V ∗ ( j ( p )) − s ] − rc ( p ) . For λ = 0, we have j ( p ) = 1 and V N,p ( j ( p )) = V ∗ ( j ( p )) = m for all p , so f ( p ) = λ ( p )[ V ∗ ( j ( p )) − s ] − rc ( p ), which is zero at p = p ∗ by (A.7), positive for p > p ∗ , and negativefor p

0. For 0 < p < p ≤

1, we have V N,p ( p ) = m ( p ) + c ( p ) u ( p ; µ N ) /u ( p ; µ N ).Moreover, we have V ∗ ( p ) = s when p ≤ p ∗ , and V ∗ ( p ) = m ( p ) + Cu ( p ; µ ) with a constant C > u ( j ( p ); µ ) = λ λ ( p ) (cid:18) λ λ (cid:19) µ u ( p ; µ ) , we see that the term λ ( p ) N V

N,p ( j ( p )) is actually linear in p . When j ( p ) ≤ p ∗ , the term − λ ( p )( N − V ∗ ( j ( p )) is also linear in p ; when j ( p ) > p ∗ , the nonlinear part of this termsimpliﬁes to − ( N − Cλ µ +10 u ( p ; µ ) /λ µ . This shows that f is concave, and strictly concaveon the interval of all p for which j ( p ) > p ∗ . As lim p → f ( p ) >

0, this in turn implies that f has at most one root in the open unit interval; if so, f assumes negative values to the left ofthe root, and positive values to the right.As V N,p ∗ ( j ( p ∗ )) > V ∗ ( j ( p ∗ )), moreover, we have f ( p ∗ ) > λ ( p ∗ )[ V ∗ ( j ( p ∗ )) − s ] − rc ( p ∗ ) = 0by (A.7). Any root of f must thus lie in [0 , p ∗ ). If j ( p ∗ N ) ≤ p ∗ , then V ∗ ( j ( p ∗ N )) = s and f ( p ∗ N ) = λ ( p ∗ N )[ N V ∗ N ( j ( p ∗ N )) − N s ] − rc ( p ∗ N ) = 0 by (A.8). If j ( p ∗ N ) > p ∗ , then V ∗ ( j ( p ∗ N )) > s and f ( p ∗ N ) <

0, so f has a root in ( p ∗ N , p ∗ ).The following result is used in the proof of Proposition 2. Lemma A.10

Let ρ = 0 . Then µ ( µ + 1) > N µ N ( µ N + 1) . roof: We change variables to β = λ /λ and x = r/λ , so that µ N and µ are implicitlydeﬁned as the positive solutions of the equations xN + β − (1 − β ) µ N = β µ N +1 ,x + β − (1 − β ) µ = β µ +1 . Fixing β ∈ [0 ,

1) and considering µ N and µ as functions of x ∈ (0 , ∞ ), we obtain µ ′ N = N − − β + β µ N +1 ln β = N − − β + (cid:2) xN + β − (1 − β ) µ N (cid:3) ln β ,µ ′ = 11 − β + β µ +1 ln β = 11 − β + [ x + β − (1 − β ) µ ] ln β . (All denominators are positive because 1 − β + β µ +1 ln β ≥ − β + β ln β > µ ≥ d = µ ( µ + 1) − N µ N ( µ N + 1). As lim x → µ N = lim x → µ = 0, we see thatlim x → d = 0 as well. It is thus enough to show that d ′ > x >

0. This is the case if,and only if, (2 µ + 1) µ ′ > N (2 µ N + 1) µ ′ N , that is,(2 µ +1) (cid:8) − β + (cid:2) xN + β − (1 − β ) µ N (cid:3) ln β (cid:9) > (2 µ N +1) { − β + [ x + β − (1 − β ) µ ] ln β } . This inequality reduces to( µ − µ N ) (cid:8) − β ) + (cid:2) xN + 1 + β (cid:3) ln β (cid:9) > (2 µ N + 1) (cid:2) x − xN (cid:3) ln β. It is straightforward to show that µ > µ N + − β (cid:2) x − xN (cid:3) . So d ′ > − β ) + (cid:2) xN + 1 + β (cid:3) ln β > (2 µ N + 1)(1 − β ) ln β, which simpliﬁes to 1 − β + (cid:2) xN + β − (1 − β ) µ N (cid:3) ln β > B Proofs

B.1 Main Results (Theorem 1 and Propositions 1–4)

Proof of Theorem 1:

For ρ >

0, this result is an immediate consequence of inequalities(2), the fact that lim inf ∆ → W ∆PBE ≥ V ∗ and W ∆PBE ≤ V ∗ N , and Proposition 6. For ρ = 0,the result follows from inequalities (2), the fact lim inf ∆ → W ∆PBE ≥ V ∗ , and Propositions 7and 10. Proof of Proposition 1:

Keller and Rady (2010) establish that in the unique symmet-ric MPE of the continuous-time game, all experimentation stops at the belief ˜ p N implicitlydeﬁned by rc (˜ p N ) = λ (˜ p N )[˜ u ( j (˜ p N )) − s ], where ˜ u is the players’ common equilibrium pay- ﬀ function. The results of Keller and Rady (2010) further imply that V N, ˜ p N ( j (˜ p N )) > ˜ u ( j (˜ p N )) > V ∗ ( j (˜ p N )), so that N V N, ˜ p N ( j (˜ p N )) − ( N − V ∗ ( j (˜ p N )) > ˜ u ( j (˜ p N )), and henceˆ p < ˜ p N by Lemma A.9. Proof of Proposition 2:

There is nothing to show for λ = 0. Using the same change ofvariables as in the previous proof, we ﬁx β ∈ (0 , q = β · µ − N µ − , so that j ( p ∗ N ) ≤ p ∗ if, and only if, q ≥

1. As lim x →∞ µ N = lim x →∞ µ = ∞ , we havelim x →∞ q = β <

1. As lim x → µ N = lim x → µ = 0, moreover,lim x → q = β lim x → µ µ N = β lim x → µ ′ µ ′ N = βN by l’Hˆopital’s rule. Finally, q ′ is easily seen to have the same sign as − µ ( µ + 1)(1 − β + β µ +1 ln β ) + N µ N ( µ N + 1)(1 − β + β µ N +1 ln β ) . As β µ +1 ln β > β µ N +1 ln β , Lemma A.10 implies that q decreases strictly in x . This in turnimplies that q < x ∈ (0 , ∞ ) when βN ≤

1, which proves the ﬁrst part of the corollary.Otherwise, there exists a unique x ∗ ∈ (0 , ∞ ) at which q = 1. The second part of the corollarythus holds with ( λ ∗ , λ ∗ ) = ( r/x ∗ , βr/x ∗ ).It is straightforward to see that x varies continuously with β and that lim β → /N x ∗ = 0.So it remains to show that x ∗ remains bounded as β →

1. Rewriting the deﬁning equationfor x ∗ as 1 + 1(1 − β ) µ ( x ∗ ( β ) , β ) = 1(1 − β ) µ N ( x ∗ ( β ) , β ) , we see that (1 − β ) µ N ( x ∗ ( β ) , β ) must stay bounded as β →

1. By the deﬁning equation for µ N , x ∗ ( β ) must then also stay bounded. Proof of Proposition 3:

For the case that ˆ p = p ∗ N , this is shown in Keller and Rady(2010). Thus, in what follows we assume that ˆ p > p ∗ N .Recall the deﬁning equation for ˆ p from Lemma A.9, λ (ˆ p ) N V N, ˆ p ( j (ˆ p )) − λ (ˆ p ) s − rc (ˆ p ) = ( N − λ (ˆ p ) V ∗ ( j (ˆ p )) . We make use of the closed-form expression for V N, ˆ p to rewrite its left-hand side as N λ (ˆ p ) λ ( j (ˆ p )) h + N c (ˆ p )[ λ − µ N ( λ − λ )] − λ (ˆ p ) s. Similarly, by noting that ˆ p > p ∗ N implies j (ˆ p ) > j ( p ∗ N ) > p ∗ , we can make use of the closed- orm expression for V ∗ to rewrite the right-hand side as( N − λ (ˆ p ) λ ( j (ˆ p )) h + ( N − c ( p ∗ ) u (ˆ p ; µ ) u ( p ∗ ; µ ) [ r + λ − µ ( λ − λ )] . Combining, we have λ (ˆ p ) λ ( j (ˆ p )) h + N c (ˆ p )[ λ − µ N ( λ − λ )] − λ (ˆ p ) s ( N − r + λ − µ ( λ − λ )] c ( p ∗ ) = u (ˆ p ; µ ) u ( p ∗ ; µ ) . It is convenient to change variables to β = λ λ and y = λ λ λ h − ss − λ h ˆ p − ˆ p . The implicit deﬁnitions of µ and µ N imply N = β µ − β + µ (1 − β ) β µ N − β + µ N (1 − β ) , allowing us to rewrite the deﬁning equation for ˆ p as the equation F ( y, µ N ) = 0 with F ( y, µ ) = 1 − y + [ β (1 + µ ) y − µ ] 1 − ββ β µ − β + µ (1 − β )( µ − µ )(1 − β ) + β µ − β µ − µ µ (1 + µ ) µ y − µ . As y is a strictly increasing function of ˆ p , we know from Lemma A.9 that F ( · , µ N ) admits aunique root, and that it is strictly increasing in a neighborhood of this root.A straightforward computation shows that ∂F ( y, µ N ) ∂µ = 1 − ββ β µ − β + µ (1 − β )(( µ − µ N )(1 − β ) + β µ − β µ N ) ζ ( y, µ N ) , with ζ ( y, µ ) = β (1 − β )(1 + µ ) y − (1 − β ) µ + (1 − βy )( β µ − β µ ) + β µ ( β (1 + µ ) y − µ ) ln β. As p ∗ N < ˆ p , for all µ ∈ [ µ N , µ ]. This establishes ζ ( y, µ N ) < y the implicit function theorem, therefore, y is increasing in µ N . Recalling from Kellerand Rady (2010) that µ N is decreasing in N , we have thus shown that y (and hence ˆ p ) aredecreasing in N . Proof of Proposition 4:

Simple algebra yields j ( p ∗ N ) p ∗ = λ λ µ N µ ( µ + 1)( λ h − s ) + µ ( s − λ h )( µ N + 1)( λ h − s ) + ( λ /λ ) µ N ( s − λ h ) . From the implicit deﬁnitions of µ and µ N , we obtain lim r → µ = lim r → µ N = 0 (so thatthe third fraction in the previous expression converges to 1) andlim r → ∂µ ∂r = (cid:20) λ − λ + λ ln λ λ (cid:21) − = N lim r → ∂µ N ∂r , implying lim r → µ N µ = 1 N , by l’Hˆopital’s rule.Furthermore, we note that we may write equivalently j ( p ∗ N ) p ∗ = λ λ (1 + µ )( λ h − s ) + ( s − λ h )(1 + µ N )( λ h − s ) + ( λ /λ )( s − λ h ) . As lim r →∞ µ = lim r →∞ µ N = ∞ , we can immediately conclude that this ratio converges tothe stated limit for r → ∞ . B.2 Learning with a Brownian Component (Propositions 5–6)

The proof of Proposition 5 rests on a sequence of lemmas that prove incentive compatibilityof the proposed strategies on various subintervals of [0 , ρ is stated, the respective result holds irrespectively of whether ρ > ρ = 0.In view of Lemmas A.4 and A.5, we take p and ¯ p such that p ∗ N < p < p ‡ < p ∗ < p m < max { p ⋄ , ˇ p } < ¯ p < . (B.9)The ﬁrst two lemmas deal with the safe action ( κ = 0) on the interval [0 , ¯ p ]. Lemma B.1

For all p ≤ p ‡ , (1 − δ ) s + δw ∆ ( p ) ≥ (1 − δ ) m ( p ) + δ E ∆1 w ∆ ( p ) . Proof: As w ∆ ( p ) ≥ s = w ∆ ( p ) for p ≤ p ‡ , we have (1 − δ ) s + δw ∆ ( p ) ≥ s whereas s ≥ (1 − δ ) m ( p ) + δ E ∆1 w ∆ ( p ) by the functional equation for w ∆ . emma B.2 There exists ∆ ( p ‡ , ¯ p ] > such that (1 − δ ) s + δw ∆ ( p ) ≥ (1 − δ ) m ( p ) + δ E ∆1 w ∆ ( p ) , for all p ∈ ( p ‡ , ¯ p ] and ∆ < ∆ ( p ‡ , ¯ p ] . Proof:

By Lemmas A.3 and A.4, there exist ν > > w ∆ ( p ) − w ∆ ( p ) ≥ ν for all p ∈ [ p ‡ , ¯ p ] and ∆ < ∆ . Further, there is a ∆ ∈ (0 , ∆ ] such that |E ∆1 w ∆ ( p ) − w ∆ ( p ) | ≤ ν for all p ∈ [ p ‡ , ¯ p ] and ∆ < ∆ . For these p and ∆, we thus have(1 − δ ) s + δw ∆ ( p ) − (cid:2) (1 − δ ) m ( p ) + δ E ∆1 w ∆ ( p ) (cid:3) ≥ (1 − δ )[ s − m ( p )] + δ ν . Finally, there is a ∆ ( p ‡ , ¯ p ] ∈ (0 , ∆ ] such that the right-hand side of this inequality is positivefor all p ∈ ( p ‡ , ¯ p ] and ∆ < ¯∆.We establish incentive compatibility of the risky action ( κ = 1) to the immediate right of p by means of the following result. Lemma B.3

Let X be a Gaussian random variable with mean m and variance V .1. For all η > , P [ X − m > η ] < Vη .

2. There exists V ∈ (0 , such that for all V < V , P h V ≤ X − m ≤ V i ≥ − V . Proof:

The ﬁrst statement is a trivial consequence of Chebysheﬀ’s inequality. The proof ofthe second relies on the following inequality (13.48) of Johnson et al. (1994) for the standardnormal cumulative distribution function:12 h − e − x / ) i ≤ Φ( x ) ≤ h − e − x ) i . Letting Φ V denote the cdf of the Gaussian distribution with variance V (and mean 0), andusing the above upper and lower bounds, we have + Φ V ( V ) − Φ V ( V ) √ V ≤ − q − e − √ V + p − e −√ V √ V .

Writing x = √ V and using the fact that 1 − √ − y ≤ √ y for 0 ≤ y ≤

1, moreover, we have1 − q − e − x + √ − e − x √ x ≤ s e − x x + 12 r − e − x x → , s x →

0. Thus, + Φ V ( V ) − Φ V ( V ) √ V ≤ , for suﬃciently small V , which is the second statement of the lemma.We apply this lemma to the log odds ratio ℓ associated with the current belief p . Forlater use, we note that dp/dℓ = p (1 − p ). Lemma B.4

Let ρ > . There exist ε ∈ (0 , p ‡ − p ) and ∆ ( p,p + ε ] > such that (1 − δ ) m ( p ) + δ E ∆ N w ∆ ( p ) ≥ (1 − δ ) s + δ E ∆ N − w ∆ ( p ) , for all p ∈ ( p, p + ε ] and ∆ < ∆ ( p,p + ε ] . Proof:

Consider a belief p = p and the corresponding log odds ratio ℓ . Let K playersuse the risky arm on the time interval [0 , ∆) and consider the resulting belief p ( K )∆ and theassociated log odds ratio ℓ ( K )∆ .Let P θ denote the probability measure associated with state θ ∈ { , } . Expected contin-uation payoﬀs are computed by means of the measure P p = p P + (1 − p ) P .Let J ∆0 denote the event that no lump-sum arrives by time ∆. The probability of J ∆0 under the measure P θ is e − λ θ ∆ . Note that e − λ θ ∆ P θ [ A | J ∆0 ] ≤ P θ [ A ] ≤ e − λ θ ∆ P θ [ A | J ∆0 ] + 1 − e − λ θ ∆ , for any event A .As we have seen in Appendix A.1, conditional on J ∆0 , the random variable ℓ ( K )∆ is normallydistributed with mean ℓ − K (cid:0) λ − λ − ρ (cid:1) ∆ and variance Kρ ∆ under P , and normallydistributed with mean ℓ − K (cid:0) λ − λ + ρ (cid:1) ∆ and variance Kρ ∆ under P .Now choose ε > p + ε ν = min (∆ ,ℓ ) ∈ [0 , ∆ ] × [ ℓ,ℓ ε ] h ℓ ‡ − ℓ + ( N − (cid:16) λ − λ − ρ (cid:17) ∆ i > . or all p ∈ ( p, p + ε ] and ∆ ∈ (0 , ∆ ), the ﬁrst part of Lemma B.3 now implies P p h p ( N − > p ‡ i = P p h ℓ ( N − > ℓ ‡ i ≤ p n e − λ ∆ P h ℓ ( N − > ℓ ‡ (cid:12)(cid:12)(cid:12) J ∆0 i + 1 − e − λ ∆ o + (1 − p ) n e − λ ∆ P h ℓ ( N − > ℓ ‡ (cid:12)(cid:12)(cid:12) J ∆0 i + 1 − e − λ ∆ o ≤ p (cid:26) e − λ ∆ ( N − ρ ∆ ν + 1 − e − λ ∆ (cid:27) + (1 − p ) (cid:26) e − λ ∆ ( N − ρ ∆ ν + 1 − e − λ ∆ (cid:27) ≤ ( N − ρ ∆ ν + 1 − e − λ ∆ ≤ (cid:26) ( N − ρν + λ (cid:27) ∆ . As w ∆ ≤ s + ( m − s ) ( p ‡ , , moreover, E ∆ N − w ∆ ( p ) ≤ s + ( m − s ) P p h p ( N − > p ‡ i . So there exists C > E ∆ N − w ∆ ( p ) ≤ s + C ∆ for all p ∈ ( p, p + ε ] and ∆ ∈ (0 , ∆ ).Next, deﬁne ν = min p ≤ p ≤ ¯ p p (1 − p ) and note that for p ≤ p ≤ ¯ p (and thus for ℓ ≤ ℓ ≤ ¯ ℓ ), V N,p ( p ) ≥ s + max n , V ′ N,p ( p +)( p − p ) o ≥ s + max n , V ′ N,p ( p +) ν ( ℓ − ℓ ) o . By the second part of Lemma B.3, there exists ∆ > N ρ ∆ < P h ( N ρ ∆) ≤ ℓ ( N )∆ − ℓ + N (cid:16) λ − λ − ρ (cid:17) ∆ ≤ ( N ρ ∆) (cid:12)(cid:12)(cid:12) J ∆0 i ≥ − ( N ρ ∆) , for arbitrary ℓ and all ∆ ∈ (0 , ∆ ). In particular, P p h ( N ρ ∆) ≤ ℓ ( N )∆ − ℓ + N (cid:16) λ − λ − ρ (cid:17) ∆ ≤ ( N ρ ∆) i ≥ p P h ( N ρ ∆) ≤ ℓ ( N )∆ − ℓ + N (cid:16) λ − λ − ρ (cid:17) ∆ ≤ ( N ρ ∆) i ≥ pe − λ ∆ P h ( N ρ ∆) ≤ ℓ ( N )∆ − ℓ + N (cid:16) λ − λ − ρ (cid:17) ∆ ≤ ( N ρ ∆) (cid:12)(cid:12)(cid:12) J ∆0 i ≥ pe − λ ∆ (cid:18) − ( N ρ ∆) (cid:19) , for these ∆. Taking ∆ smaller if necessary, we can also ensure that ℓ < ℓ − N (cid:16) λ − λ − ρ (cid:17) ∆ + ( N ρ ∆) < ℓ − N (cid:16) λ − λ − ρ (cid:17) ∆ + ( N ρ ∆) < ¯ ℓ, for all ℓ ∈ ( ℓ, ℓ ε ] and all ∆ ∈ (0 , ∆ ).By Lemma A.3, there exists ∆ ∈ (0 , ∆ ) such that w ∆ ≥ V N,p for ∆ ∈ (0 , ∆ ). For such and p ∈ ( p, p + ε ], we now have E ∆ N w ∆ ( p ) ≥ s + pe − λ ∆ (cid:18) − ( N ρ ∆) (cid:19) V ′ N,p ( p +) ν h ℓ − N (cid:16) λ − λ − ρ (cid:17) ∆ + ( N ρ ∆) − ℓ i ≥ s + p (1 − λ ∆) (cid:18) − ( N ρ ∆) (cid:19) V ′ N,p ( p +) ν h − N (cid:16) λ − λ − ρ (cid:17) ∆ + ( N ρ ∆) i . This implies the existence of ∆ ∈ (0 , ∆ ) and C > E ∆ N w ∆ ( p ) ≥ s + C ∆ , for all p ∈ ( p, p + ε ] and ∆ ∈ (0 , ∆ ).For p ∈ ( p, p + ε ] and ∆ ∈ (0 , min { ∆ , ∆ } ), ﬁnally,(1 − δ ) m ( p ) + δ E ∆ N w ∆ ( p ) − (cid:2) (1 − δ ) s + δ E ∆ N − w ∆ ( p ) (cid:3) ≥ (1 − δ )[ m ( p ) − s ] + δ n C ∆ − C ∆ o = C ∆ − (cid:8) r [ s − m ( p )] + C (cid:9) ∆ + o (∆) . As the term in ∆ dominates as ∆ becomes small, there exists ∆ ( p,p + ε ] ∈ (0 , min { ∆ , ∆ } )such that this expression is positive for all p ∈ ( p, p + ε ] and ∆ < ∆ ( p,p + ε ] . Lemma B.5

For all ε ∈ (0 , p ‡ − p ) , there exists ∆ ( p + ε, ¯ p ] > such that (1 − δ ) m ( p ) + δ E ∆ N w ∆ ( p ) ≥ (1 − δ ) s + δ E ∆ N − w ∆ ( p ) , for all p ∈ ( p + ε, ¯ p ] and ∆ < ∆ ( p + ε, ¯ p ] . Proof:

First, by Lemma A.3, there exists ∆ > w ∆ ≥ V N,p on the unitinterval. Second, by Lemma A.4, there exist ν > η > ∈ (0 , ∆ ) such that V N,p ( p ) − w ∆ ( p ) ≥ ν for all p ∈ [ p + ε , ¯ p + η ] and ∆ < ∆ . For these p and ∆, and byconvexity of V N,p , we then have E ∆ N w ∆ ( p ) − E ∆ N − w ∆ ( p ) ≥ E ∆ N V N,p ( p ) − E ∆ N − w ∆ ( p ) ≥ E ∆ N − V N,p ( p ) − E ∆ N − w ∆ ( p ) ≥ χ ∆ ( p ) ν + [1 − χ ∆ ( p )]( s − m ) , where χ ∆ ( p ) denotes the probability that the belief p t +∆ lies in [ p + ε , ¯ p + η ] given that p t = p and N − ∈ (0 , ∆ )such that χ ∆ ( p ) ≥ ν + m − sν + m − s , for all p ∈ ( p + ε, ¯ p ] and ∆ < ∆ . For these p and ∆, we thus have(1 − δ ) m ( p ) + δ E ∆ N w ∆ ( p ) − (cid:2) (1 − δ ) s + δ E ∆ N − w ∆ ( p ) (cid:3) ≥ (1 − δ )[ m ( p ) − s ] + δ ν . inally, there is a ∆ ( p + ε, ¯ p ] ∈ (0 , ∆ ) such that the right-hand side of this inequality is positivefor all p ∈ ( p + ε, ¯ p ] and ∆ < ∆ ( p + ε, ¯ p ] . Lemma B.6

There exists ∆ (¯ p, > such that (1 − δ ) m ( p ) + δ E ∆ N w ∆ ( p ) ≥ (1 − δ ) s + δ E ∆ N − w ∆ ( p ) , for all p > ¯ p and ∆ < ∆ (¯ p, . Proof:

By Lemmas A.3 and A.4, there exists ∆ (¯ p, > w ∆ ≥ w ∆ for all∆ < ∆ (¯ p, . For such ∆ and all p > ¯ p , we thus have(1 − δ ) m ( p ) + δ E ∆ N w ∆ ( p ) = w ∆ ( p ) ≥ w ∆ ( p ) ≥ (1 − δ ) s + δ E ∆ N − w ∆ ( p ) , with the last inequality following from the functional equation for w ∆ . Proof of Proposition 5:

Given p and ¯ p as in (B.9), choose ε > ( p,p + ε ] as inLemma B.4, and ∆ ( p ‡ , ¯ p ] , ∆ ( p + ε, ¯ p ] and ∆ (¯ p, as in Lemmas B.2, B.5 and B.6. The two-stateautomaton is an SSE for all∆ < min n ∆ ( p ‡ , ¯ p ] , ∆ ( p,p + ε ] , ∆ ( p + ε, ¯ p ] , ∆ (¯ p, o . So the statement of the proposition holds with p ♭ = p ‡ and p ♯ = max { ˇ p, p ⋄ } . Proof of Proposition 6:

Let ε > V N,p in Section 5 and Lemma A.5 allow us to choose p ∈ ( p ∗ N , p ♭ ) and ¯ p ∈ ( p ♯ ,

1) such that V N,p > V ∗ N − ε and w ∆ < V ∗ + ε for all ∆ >

0. Second, Lemmas A.2 and A.3 and Proposition5 imply the existence of a ∆ † > ∈ (0 , ∆ † ): W ∆1 > V ∗ − ε , w ∆ ≥ V N,p , and w ∆ and w ∆ are SSE payoﬀ functions of the game with period length ∆. Third, W ∆PBE ≤ V ∗ N for all ∆ > ∈ (0 , ∆ † ), we thus have V ∗ N − ε < V N,p ≤ w ∆ ≤ W ∆SSE ≤ W ∆PBE ≤ V ∗ N , and V ∗ − ε < W ∆1 ≤ W ∆PBE ≤ W ∆SSE ≤ w ∆ < V ∗ + ε, so that k W ∆PBE − V ∗ N k , k W ∆SSE − V ∗ N k , k W ∆PBE − V ∗ k and k W ∆SSE − V ∗ k are all smaller than ε , which was to be shown. B.3 Pure Poisson Learning (Propositions 7–10)

Proof of Proposition 7:

For any given ∆ >

0, let ˜ p ∆ be the inﬁmum of the set ofbeliefs at which there is some PBE that gives a payoﬀ w n ( p ) > s to at least one player. Let p = lim inf ∆ → ˜ p ∆ .For any ﬁxed ε > >

0, consider the problem of maximizing the players’ averagepayoﬀ subject to no use of the risky arm at beliefs p ≤ ˜ p − ε . Denote the corresponding valuefunction by f W ∆ ,ε . By the deﬁnition of ˜ p , there exists a ˜∆ ε > ∈ (0 , ˜∆ ε ),the function f W ∆ ,ε provides an upper bound on the players’ average payoﬀ in any PBE, andso W ∆PBE ≤ f W ∆ ,ε . The value function of the continuous-time version of this maximizationproblem is V N,p ε with p ε = max { ˜ p − ε, p ∗ N } . As the discrete-time solution is also feasible incontinuous time, we have f W ∆ ,ε ≤ V N,p ε , and hence W ∆PBE ≤ V N,p ε for ∆ < ˜∆ ε .Consider a sequence of such ∆’s converging to 0 such that the corresponding beliefs ˜ p ∆ converge to ˜ p . For each ∆ in this sequence, select a belief p ∆ > ˜ p ∆ with the following twoproperties: (i) starting from p ∆ , a single failed experiment takes us below ˜ p ∆ ; (ii) given theinitial belief p ∆ , there exists a PBE for reaction lag ∆ in which at least one player plays riskywith positive probability in the ﬁrst round. Select such an equilibrium for each ∆ in thesequence and let L ∆ be the number of players in this equilibrium who, at the initial belief p ∆ , play risky with positive probability. Let L be an accumulation point of the sequence of L ∆ ’s. After selecting a subsequence of ∆’s, we can assume without loss of generality thatplayer n = 1 , . . . , L plays risky with probability π ∆ n > p ∆ , while player n = L + 1 , . . . , N plays safe; we can further assume that ( π ∆ n ) Ln =1 converges to a limit ( π n ) Ln =1 in [0 , L .For player n = 1 , . . . , L to play optimally at p ∆ , it must be the case that(1 − δ ) (cid:2) π ∆ n λ ( p ∆ ) h + (1 − π ∆ n ) s (cid:3) + δ  Pr ∆ ( ∅ ) w ∆ n, ∅ + L X K =1 X | I | = K Pr ∆ ( I ) ∞ X J =0 Λ ∆ J,K ( p ∆ ) w ∆ n,I,J  ≥ (1 − δ ) s + δ  Pr ∆ − n ( ∅ ) w ∆ n, ∅ + L − X K =1 X | I | = K, n I Pr ∆ − n ( I ) ∞ X J =0 Λ ∆ J,K ( p ∆ ) w ∆ n,I,J  , where we write Pr ∆ ( I ) for the probability that the set of players experimenting is I ⊆{ , . . . , L } , Pr ∆ − n ( I ) for the probability that among the L − { , · · · , L } \ { n } the set of players experimenting is I , and w ∆ n,I,J for the conditional expectation of player n ’s continuation payoﬀ given that exactly the players in I were experimenting and had J successes ( w ∆ n, ∅ is player n ’s continuation payoﬀ if no one was experimenting). As Pr ∆ ( ∅ ) =(1 − π ∆ n )Pr ∆ − n ( ∅ ) ≤ Pr ∆ − n ( ∅ ), the inequality continues to hold when we replace w ∆ n, ∅ by itslower bound s . After subtracting (1 − δ ) s from both sides, we then have(1 − δ ) π ∆ n (cid:2) λ ( p ∆ ) h − s (cid:3) + δ  Pr ∆ ( ∅ ) s + L X K =1 X | I | = K Pr ∆ ( I ) ∞ X J =0 Λ ∆ J,K ( p ∆ ) w ∆ n,I,J  ≥ δ  Pr ∆ − n ( ∅ ) s + L − X K =1 X | I | = K, n I Pr ∆ − n ( I ) ∞ X J =0 Λ ∆ J,K ( p ∆ ) w ∆ n,I,J  . umming up these inequalities over n = 1 , . . . , L and writing ¯ π ∆ = L P Ln =1 π ∆ n yields(1 − δ ) L ¯ π ∆ (cid:2) λ ( p ∆ ) h − s (cid:3) + δ  Pr ∆ ( ∅ ) Ls + L X K =1 X | I | = K Pr ∆ ( I ) ∞ X J =0 Λ ∆ J,K ( p ∆ ) L X n =1 w ∆ n,I,J  ≥ δ  L X n =1 Pr ∆ − n ( ∅ ) s + L X n =1 L − X K =1 X | I | = K, n I Pr ∆ − n ( I ) ∞ X J =0 Λ ∆ J,K ( p ∆ ) w ∆ n,I,J  . By construction, w ∆ n,I, = s whenever I = ∅ . For | I | = K >

J >

0, moreover, we have w ∆ n,I,J ≥ W ∆1 ( B ∆ J,K ( p ∆ )) for all players n = 1 , . . . , N , and hence L X n =1 w ∆ n,I,J ≤ N W ∆PBE ( B ∆ J,K ( p ∆ )) − ( N − L ) W ∆1 ( B ∆ J,K ( p ∆ )) ≤ N V

N,p ε ( B ∆ J,K ( p ∆ )) − ( N − L ) W ∆1 ( B ∆ J,K ( p ∆ )) . So, for the preceding inequality to hold, it is necessary that(1 − δ ) L ¯ π ∆ (cid:2) λ ( p ∆ ) h − s (cid:3) + δ  Pr ∆ ( ∅ ) Ls + L X K =1 X | I | = K Pr ∆ ( I )Λ ∆0 ,K ( p ∆ ) Ls + L X K =1 X | I | = K Pr ∆ ( I ) ∞ X J =1 Λ ∆ J,K ( p ∆ ) (cid:2) N V

N,p ε ( B ∆ J,K ( p ∆ )) − ( N − L ) W ∆1 ( B ∆ J,K ( p ∆ )) (cid:3) ≥ δ  L X n =1 Pr ∆ − n ( ∅ ) s + L X n =1 L − X K =1 X | I | = K, n I Pr ∆ − n ( I )Λ ∆0 ,K ( p ∆ ) s + L X n =1 L − X K =1 X | I | = K, n I Pr ∆ − n ( I ) ∞ X J =1 Λ ∆ J,K ( p ∆ ) W ∆1 ( B ∆ J,K ( p ∆ ))  . As Pr ∆ ( ∅ ) + L X K =1 X | I | = K Pr ∆ ( I ) = 1 and L X K =1 X | I | = K Pr ∆ ( I ) K = L ¯ π ∆ , we have the ﬁrst-order expansionsPr ∆ ( ∅ ) + L X K =1 X | I | = K Pr ∆ ( I )Λ ∆0 ,K ( p ∆ )= Pr ∆ ( ∅ ) + L X K =1 X | I | = K Pr ∆ ( I ) (cid:0) − Kλ ( p ∆ )∆ (cid:1) + o (∆)= 1 − L ¯ π ∆ λ ( p ∆ )∆ + o (∆) , nd L X K =1 X | I | = K Pr ∆ ( I )Λ ∆1 ,K ( p ∆ ) = L X K =1 X | I | = K Pr ∆ ( I ) Kλ ( p ∆ )∆ + o (∆) = L ¯ π ∆ λ ( p ∆ )∆ + o (∆) , so, by uniform convergence W ∆1 → V ∗ (Lemma A.2), the left-hand side of the last inequalityexpands as Ls + L (cid:26) r ¯ π [ λ (˜ p ) h − s ] − rs + ¯ πλ (˜ p ) [ N V

N,p ε ( j (˜ p )) − ( N − L ) V ∗ ( j (˜ p )) − Ls ] (cid:27) ∆ + o (∆) , with ¯ π = lim ∆ → ¯ π ∆ . In the same way, the identitiesPr ∆ − n ( ∅ ) + L − X K =1 X | I | = K, n I Pr ∆ − n ( I ) = 1 and L − X K =1 X | I | = K, n I Pr ∆ − n ( I ) K = L ¯ π ∆ − π ∆ n imply L X n =1 Pr ∆ − n ( ∅ ) + L X n =1 L − X K =1 X | I | = K, n I Pr ∆ − n ( I )Λ ∆0 ,K ( p ∆ ) = L − L ( L − π ∆ λ ( p ∆ )∆ + o (∆) , and L X n =1 L − X K =1 X | I | = K, n I Pr ∆ − n ( I )Λ ∆1 ,K ( p ∆ ) = L ( L − π ∆ λ ( p ∆ )∆ + o (∆) , and so the right-hand side of the inequality expands as Ls + L n − rs + ( L − πλ (˜ p ) [ V ∗ ( j (˜ p )) − s ] o ∆ + o (∆) . Comparing terms of order ∆, dividing by L and letting ε →

0, we obtain¯ π n λ (˜ p ) (cid:2) N V N, ˘ p ( j (˜ p )) − ( N − V ∗ ( j (˜ p )) − s (cid:3) − rc (˜ p ) o ≥ . By Lemma A.9, this means ˜ p ≥ ˆ p whenever ¯ π > π = 0, we write the optimality condition for player n ∈ { , . . . , L } as(1 − δ ) λ ( p ∆ ) h + δ  L − X K =0 X | I | = K, n I Pr ∆ − n ( I ) ∞ X J =0 Λ ∆ J,K +1 ( p ∆ ) w ∆ n,I ˙ ∪{ n } ,J  ≥ (1 − δ ) s + δ  Pr ∆ − n ( ∅ ) w ∆ n, ∅ + L − X K =1 X | I | = K, n I Pr ∆ − n ( I ) ∞ X J =0 Λ ∆ J,K ( p ∆ ) w ∆ n,I,J  . As above, w ∆ n, ∅ ≥ s , and w ∆ n,I, = s whenever I = ∅ . For | I | = K >

J >

0, more-over, we have w ∆ n,I,J ≥ W ∆1 ( B ∆ J,K ( p ∆ )), w ∆ n,I ˙ ∪{ n } ,J ≥ W ∆1 ( B ∆ J,K +1 ( p ∆ )) and w ∆ n,I ˙ ∪{ n } ,J ≤ V N,p ε ( B ∆ J,K +1 ( p ∆ )) − ( N − W ∆1 ( B ∆ J,K +1 ( p ∆ )). So, for the optimality condition to hold, itis necessary that(1 − δ ) λ ( p ∆ ) h + δ  L − X K =0 X | I | = K, n I Pr ∆ − n ( I )Λ ∆0 ,K +1 ( p ∆ ) s + L − X K =0 X | I | = K, n I Pr ∆ − n ( I ) ∞ X J =1 Λ ∆ J,K +1 ( p ∆ ) (cid:2) N V

N,p ε ( B ∆ J,K +1 ( p ∆ )) − ( N − W ∆1 ( B ∆ J,K +1 ( p ∆ )) (cid:3) ≥ (1 − δ ) s + δ  Pr ∆ − n ( ∅ ) s + L − X K =1 X | I | = K, n I Pr ∆ − n ( I )Λ ∆0 ,K ( p ∆ ) s + L − X K =1 X | I | = K, n I Pr ∆ − n ( I ) ∞ X J =1 Λ ∆ J,K ( p ∆ ) W ∆1 ( B ∆ J,K ( p ∆ ))  . Now, L − X K =1 X | I | = K, n I Pr ∆ − n ( I ) K = L ¯ π ∆ − π ∆ n → , as ∆ vanishes. Therefore, the left-hand side of the above inequality expands as s + (cid:26) r [ λ (˜ p ) h − s ] + λ (˜ p ) [ N V

N,p ε ( j (˜ p )) − ( N − V ∗ ( j (˜ p )) − s ] (cid:27) ∆ + o (∆) , and the right-hand side as s + o (∆). Comparing terms of order ∆, letting ε → p ≥ ˆ p .The statement about the range of experimentation now follows immediately from the factthat for ∆ < ˜∆ ε , we have W ∆PBE ≤ V N,p ε , and hence W ∆PBE = V N,p ε = s on [0 , ˜ p − ε ] ⊇ [0 , ˆ p − ε ].The statement about the supremum of equilibrium payoﬀs follows from the inequality W ∆PBE ≤ V N,p ε for ∆ < ˜∆ ε , convergence V N,p ε → V N, ˜ p as ε →

0, and the inequality V N, ˜ p ≤ V N, ˆ p .We now turn to the proof of Proposition 8. The only diﬀerence to the case with aBrownian component is the proof of incentive compatibility to the immediate right of p .In view of Lemmas A.9, A.4 and A.5, we consider p and ¯ p such thatˆ p < p < p ‡ < p ∗ < p m < max { p ⋄ , ˇ p } < ¯ p < . (B.10) Lemma B.7

Let ρ = 0 and λ > . There exists p ♯ ∈ (max { p ⋄ , ˇ p } , such that for all ¯ p ∈ ( p ♯ , , there exist ε ∈ (0 , p ‡ − p ) and ∆ ( p,p + ε ] > such that (1 − δ ) m ( p ) + δ E ∆ N w ∆ ( p ) ≥ (1 − δ ) s + δ E ∆ N − w ∆ ( p ) , for all p ∈ ( p, p + ε ] and ∆ < ∆ ( p,p + ε ] . roof: By Lemma A.3, there exists ∆ > w ∆ ≥ V N,p for ∆ ∈ (0 , ∆ ).By Lemma A.9, λ ( p )[ N V

N,p ( j ( p )) − ( N − V ∗ ( j ( p )) − s ] − rc ( p ) > p, V N,p ( j ( p )) ≤ V N,p ( j ( p )) for p ≥ p , this implies λ ( p )[ N V

N,p ( j ( p )) − ( N − V ∗ ( j ( p )) − s ] − rc ( p ) > p, p ♯ > max { p ⋄ , ˇ p } such that for all ¯ p > p ♯ , λ ( p )[ N V

N,p ( j ( p )) − ( N − V , ¯ p ( j ( p )) − s ] − rc ( p ) > p, p ∈ ( p ♯ , ν = min p ∈ [ p, n λ ( p )[ N V

N,p ( j ( p )) − ( N − V , ¯ p ( j ( p )) − s ] − rc ( p ) o > , and choose ε > p + ε < p ‡ and( N λ ( p + ε ) + r ) h V N,p ( p + ε ) − s i < ν/ . In the remainder of the proof, we write p KJ for the posterior belief starting from p when K players use the risky arm and J lump-sums arrive within the length of time ∆.For p ∈ ( p, p + ε ] and ∆ ∈ (0 , ∆ ),(1 − δ ) m ( p ) + δ E ∆ N w ∆ ( p ) ≥ (1 − δ ) m ( p ) + δ E ∆ N V N,p ( p )= r ∆ m ( p ) + (1 − r ∆) n N λ ( p )∆ V N,p ( p N ) + (1 − N λ ( p )∆) V N,p ( p N ) o + O (∆ )= V N,p ( p N ) + n rm ( p ) + N λ ( p ) V N,p ( p N ) − ( N λ ( p ) + r ) V N,p ( p N ) o ∆ + O (∆ ) , while(1 − δ ) s + δ E ∆ N − w ∆ ( p )= r ∆ s + (1 − r ∆) n ( N − λ ( p )∆ w ∆ ( p N − ) + [1 − ( N − λ ( p )∆] w ∆ ( p N − ) o + O (∆ )= w ∆ ( p N − ) + n rs + ( N − λ ( p ) w ∆ ( p N − ) − [( N − λ ( p ) + r ] w ∆ ( p N − ) o ∆ + O (∆ ) . As V N,p ( p N ) ≥ s = w ∆ ( p N − ), the diﬀerence (1 − δ ) m ( p )+ δ E ∆ N w ∆ ( p ) − (cid:2) (1 − δ ) s + δ E ∆ N − w ∆ ( p ) (cid:3) is no smaller than ∆ times λ ( p ) h N V

N,p ( p N ) − ( N − w ∆ ( p N − ) − s i − rc ( p ) − ( N λ ( p ) + r ) h V N,p ( p N ) − s i , lus terms of order ∆ and higher.Let ξ = ν N − λ . By Lemma A.8 as well as Lipschitz continuity of V N,p and V , ¯ p ,there exists ∆ ∈ (0 , ∆ ) such that k w ∆ − V , ¯ p k , max p ≤ p ≤ p ‡ | V N,p ( p N ) − V N,p ( j ( p )) | andmax p ≤ p ≤ p ‡ | V , ¯ p ( p N − ) − V , ¯ p ( j ( p )) | are all smaller than ξ when ∆ < ∆ . For such ∆ and p ∈ ( p, p ‡ ], we thus have V N,p ( p N ) > V N,p ( j ( p )) − ξ and w ∆ ( p N − ) < V , ¯ p ( j ( p )) + 2 ξ , so thatthe expression displayed above is larger than ν − N − λ ( p ) ξ − ν/ > ν/

3. This impliesexistence of a ∆ ( p,p + ε ] ∈ (0 , ∆ ) as in the statement of the lemma. Proof of Proposition 8:

Given p as in (B.10), take p ♯ as in Lemma B.7 and ﬁx ¯ p > p ♯ .Choose ε > ( p,p + ε ] as in Lemma B.7, and ∆ ( p ‡ , ¯ p ] , ∆ ( p + ε, ¯ p ] and ∆ (¯ p, as in LemmasB.2, B.5 and B.6. The two-state automaton is an SSE for all∆ < min n ∆ ( p ‡ , ¯ p ] , ∆ ( p,p + ε ] , ∆ ( p + ε, ¯ p ] , ∆ (¯ p, o . So the statement of the proposition holds with p ♭ = p ‡ and the chosen p ♯ .For the proof of Proposition 9, we modify notation slightly, writing Λ for the probabilitythat, conditional on θ = 1, a player has at least one success on his own risky arm in any givenround, and g for the corresponding expected payoﬀ per unit of time. Consider an SSE played at a given prior p , with associated payoﬀ W . If K ≥ p K . Notethat an SSE allows the continuation play to depend on the identity of these players. Takingthe expectation over all possible combinations of K players who experiment, however, wecan associate with each posterior p K , K ≥

1, an expected continuation payoﬀ W K . If K = 0, so that no player experiments, the belief does not evolve, but there is no reasonthat the continuation strategies (and so the payoﬀ) should remain the same. We denotethe corresponding payoﬀ by W . In addition, we write π ∈ [0 ,

1] for the probability withwhich each player experiments at p , and q K for the probability that at least one player has asuccess, given p , when K of them experiment. The players’ common payoﬀ must then satisfythe following optimality equation: W = max ( (1 − δ ) p g + δ N − X K =0 (cid:18) N − K (cid:19) π K (1 − π ) N − − K [ q K +1 g + (1 − q K +1 ) W K +1 )] , (1 − δ ) s + δ N − X K =1 (cid:18) N − K (cid:19) π K (1 − π ) N − − K ( q K g + (1 − q K ) W K ) + δ (1 − π ) N − W ) ) . The ﬁrst term corresponds to the payoﬀ from playing risky, the second from playing safe.As it turns out, it is more convenient to work with odds ratios ω = p − p and ω K = p K − p K , I.e. , Λ = 1 − e − λ ∆ and g = m . hich we refer to as “belief” as well. Note that p K = p (1 − ω ) K p (1 − ω ) K + 1 − p implies that ω K = (1 − Λ) K ω. Note also that1 − q K = p (1 − Λ) K + 1 − p = (1 − p )(1 + ω K ) , q K = p − (1 − p ) ω K = (1 − p )( ω − ω K ) . We deﬁne m = sg − s , υ = W − s (1 − p )( g − s ) , υ K = W K − s (1 − p K )( g − s ) . Note that υ ≥ s is a lower bound on the value. Simple computationsnow give υ = max ( ω − (1 − δ ) m + δ N − X K =0 (cid:18) N − K (cid:19) π K (1 − π ) N − − K ( υ K +1 − ω K +1 ) ,δω + δ N − X K =0 (cid:18) N − K (cid:19) π K (1 − π ) N − − K ( υ K − ω K ) ) . It is also useful to introduce w = υ − ω and w K = υ K − ω K . We then obtain w = max ( − (1 − δ ) m + δ N − X K =0 (cid:18) N − K (cid:19) π K (1 − π ) N − − K w K +1 , − (1 − δ ) ω + δ N − X K =0 (cid:18) N − K (cid:19) π K (1 − π ) N − − K w K ) . (B.11)We deﬁne ω ∗ = m δ − δ Λ . This is the odds ratio corresponding to the single-agent cutoﬀ p ∆1 , i.e. , ω ∗ = p ∆1 / (1 − p ∆1 ).Note that p ∆1 > p ∗ for ∆ > p ∆1 or, in terms of oddsratios, ω ∗ . For all beliefs ω < ω ∗ , therefore, any equilibrium has w = − ω , or υ = 0, for eachplayer. Proof of Proposition 9:

Following terminology from repeated games, we say that wecan enforce action π ∈ { , } at belief ω if we can construct an SSE for the prior belief ω inwhich players prefer to choose π in the ﬁrst round rather than deviate unilaterally.Our ﬁrst step is to derive suﬃcient conditions for enforcement of π ∈ { , } . The condi-tions to enforce these actions are intertwined, and must be derived simultaneously. Enforcing π = 0 at ω . To enforce π = 0 at ω , it suﬃces that one round of using the safearm followed by the best equilibrium payoﬀ at ω exceeds the payoﬀ from one round of using he risky arm followed by the resulting continuation payoﬀ at belief ω (as only the deviatingplayer will have experimented). See below for the precise condition. Enforcing π = 1 at ω . If a player deviates to π = 0, we jump to w N − rather than w N incase all experiments fail. Assume that at ω N − we can enforce π = 0. As explained above,this implies that at ω N − , a player’s continuation payoﬀ can be pushed down to what hewould get by unilaterally deviating to experimentation, which is at most − (1 − δ ) m + δw N where w N is the highest possible continuation payoﬀ at belief ω N . To enforce π = 1 at ω , itthen suﬃces that w = − (1 − δ ) m + δw N ≥ − (1 − δ ) ω + δ ( − (1 − δ ) m + δw N ) , with the same continuation payoﬀ w N on the left-hand side of the inequality. The inequalitysimpliﬁes to δw N ≥ (1 − δ ) m − ω ;by the formula for w , this is equivalent to w ≥ − ω , i.e. , υ ≥

0. Given that υ = ω − (1 − δ ) m + δ ( υ N − ω N ) = (1 − δ (1 − Λ) N ) ω − (1 − δ ) m + δυ N , to show that υ ≥

0, it thus suﬃces that ω ≥ m δ − δ (1 − (1 − Λ) N ) = ˜ ω, and that υ N ≥

0, which is necessarily the case if υ N is an equilibrium payoﬀ. Note that(1 − Λ) N ˜ ω ≤ ω ∗ , so that ω N ≥ ω ∗ implies ω ≥ ˜ ω . In summary, to enforce π = 1 at ω , itsuﬃces that ω N ≥ ω ∗ and π = 0 be enforceable at ω N − . Enforcing π = 0 at ω (continued). Suppose we can enforce it at ω , ω , . . . , ω N − , andthat ω N ≥ ω ∗ . Note that π = 1 is then enforceable at ω from our previous argument, givenour hypothesis that π = 0 is enforceable at ω N − . It then suﬃces that − (1 − δ ) ω + δ ( − (1 − δ ) m + δw N ) ≥ − (1 − δ N ) m + δ N w N , where again it suﬃces that this holds for the highest value of w N . To understand thisexpression, consider a player who deviates by experimenting. Then the following period thebelief is down one step, and if π = 0 is enforceable at ω , it means that his continuationpayoﬀ there can be chosen to be no larger than what he can secure at that point by deviatingand experimenting again, etc. The right-hand side is then obtained as the payoﬀ from N consecutive unilateral deviations to experimentation (in fact, we have picked an upper bound,as the continuation payoﬀ after this string of deviations need not be the maximum w N ). Theleft-hand side is the payoﬀ from playing safe one period before setting π = 1 and getting themaximum payoﬀ w N , a continuation strategy that is sequentially rational given that π = 1is enforceable at ω by our hypothesis that π = 0 is enforceable at ω N − . lugging in the deﬁnition of υ N , this inequality simpliﬁes to( δ − δ N ) υ N ≥ ( δ − δ N )( ω N − m ) + (1 − δ )( ω − m ) , which is always satisﬁed for beliefs ω ≤ m , i.e. , below the myopic cutoﬀ ω m (which coincideswith the normalized payoﬀ m ).To summarize, if π = 0 can be enforced at the N − ω , . . . , ω N − ,with ω N ≥ ω ∗ and ω ≤ ω m , then both π = 0 and π = 1 can be enforced at ω . By induction,this implies that if we can ﬁnd an interval of beliefs [ ω N , ω ) with ω N ≥ ω ∗ for which π = 0can be enforced, then π = 0 , ω ′ ∈ ( ω, ω m ).Our second step is to establish that such an interval of beliefs exists. This second stepinvolves itself three steps. First, we derive some “simple” equilibrium, which is a symmetricMarkov equilibrium. Second, we show that we can enforce π = 1 on suﬃciently (ﬁnitely)many consecutive values of beliefs building on this equilibrium; third, we show that this canbe used to enforce π = 0 as well.It will be useful to distinguish beliefs according to whether they belong to the interval[ ω ∗ , (1 + λ ∆) ω ∗ ) , [(1 + λ ∆) ω ∗ , (1 + 2 λ ∆) ω ∗ ) , . . . For τ ∈ IN , let I τ +1 = [(1 + τ λ ∆) ω ∗ , (1 +( τ +1) λ ∆) ω ∗ ). For ﬁxed ∆, every ω ≥ ω ∗ can be uniquely mapped into a pair ( x, τ ) ∈ [0 , × IN such that ω = (1 + λ ( x + τ )∆) ω ∗ , and we alternatively denote beliefs by such a pair. Notealso that, for small enough ∆ >

0, one unsuccessful experiment takes a belief that belongs tothe interval I τ +1 to (within O (∆ ) of) the interval I τ . (Recall that Λ = λ ∆ + O (∆ ).)Let us start with deriving a symmetric Markov equilibrium. Hence, because it is Marko-vian, υ = υ in our notation, that is, the continuation payoﬀ when nobody experiments isequal to the payoﬀ itself.Rewriting the equations, using the risky arm gives the payoﬀ υ = ω − (1 − δ ) m − δ (1 − Λ)(1 − π Λ) N − ω + δ N − X K =0 (cid:18) N − K (cid:19) π K (1 − π ) N − − K υ K +1 , while using the safe arm yields υ = δ (1 − (1 − π Λ) N − ) ω + δ (1 − π ) N − υ + δ N − X K =1 (cid:18) N − K (cid:19) π K (1 − π ) N − − K υ K . In the Markov equilibrium we derive, players are indiﬀerent between both actions, and sotheir payoﬀs are the same. Given any belief ω or corresponding pair ( τ, x ), we conjecture anequilibrium in which π = a ( τ, x )∆ + O (∆ ), υ = b ( τ, x )∆ + O (∆ ), for some functions a, b of the pair ( τ, x ) only. Using the fact that Λ = λ ∆ + O (∆ ) , − δ = r ∆ + O (∆ ), we replace To pull out the terms involving the belief ω from the sum appearing in the deﬁnition of υ , usethe fact that P N − K =0 (cid:0) N − K (cid:1) π K (1 − π ) N − − K (1 − Λ) K = (1 − π Λ) N / (1 − π Λ). his in the two payoﬀ expressions, and take Taylor expansions to get, respectively,0 = (cid:18) rb ( τ, x ) + λ mλ + r ( N − a ( τ, x ) (cid:19) ∆ + O (∆ ) , and 0 = [ b ( τ, x ) − rmλ ( τ + x )] ∆ + O (∆ ) . We then solve for a ( τ, x ), b ( τ, x ), to get π − = r ( λ + r )( x + τ ) N − + O (∆ ) , with corresponding value υ − = λ mr ( x + τ )∆ + O (∆ ) . This being an induction on K , it must be veriﬁed that the expansion indeed holds at thelowest interval, I , and this veriﬁcation is immediate. We now turn to the second step and argue that we can ﬁnd N − π = 1 can be enforced. We then verify that incentives can be provided to do so,assuming that υ − are the continuation values used by the players whether a player deviatesor not from π = 1. Assume that N − π = 1. Consider the remaining one.His incentive constraint to choose π = 1 is − (1 − δ ) m + δυ N − δ (1 − Λ) N ω ≥ − (1 − δ ) ω − δ (1 − Λ) N − ω + δυ N − , (B.12)where υ N , υ N − are given by υ − at ω N , ω N − . The interpretation of both sides is as before,the payoﬀ from abiding with the candidate equilibrium action vs. the payoﬀ from deviating.Fixing ω and the corresponding pair ( τ, x ), and assuming that τ ≥ N − we insert ourformula for υ − , as well as Λ = λ ∆ + O (∆) , − δ = r ∆ + O (∆). This gives τ ≥ ( N − (cid:18) λ λ + r (cid:19) − x. Hence, given any integer N ′ ∈ IN , N ′ > N − > ∈ (0 , ¯∆), π = 1 is an equilibrium action at all beliefs ω = ω ∗ (1 + τ ∆), for τ = 3( N − , . . . , N ′ (we pick the factor 3 because λ / ( λ + r ) < N − I τ with τ ≥ N − τ ≤ N ), and ﬁx ∆ for which the previous result holds, i.e. , π = 1 can be enforced atall these beliefs. We now turn to the third step, showing how π = 0 can be enforced as well Note that this solution is actually continuous at the interval endpoints. It is not the only solutionto these equations; as mentioned in the text, there are intervals of beliefs for which multiple symmetricMarkov equilibria exist in discrete time. It is easy to construct such equilibria in which π = 1 and theinitial belief is in (a subinterval of) I . Considering τ < N − υ N = 0, so that the explicit formula for υ − would not applyat ω N . Computations are then easier, and the result would hold as well. or these beliefs.Suppose that players choose π = 0. As a continuation payoﬀ, we can use the payoﬀ fromplaying π = 1 in the following round, as we have seen that this action can be enforced atsuch a belief. This gives δω + δ ( − (1 − δ ) m − δ (1 − Λ) N l + δυ − ( ω N )) . (Note that the discounted continuation payoﬀ is the left-hand side of (B.12).) By deviatingfrom π = 0, a player gets at most ω + ( − (1 − δ ) m − δ (1 − Λ) ω + δυ − ( ω )) . Again inserting our formula for υ − , this reduces to mr ( N − λ λ + r ∆ ≥ . Hence we can also enforce π = 0 at all these beliefs. We can thus apply our inductionargument: there exists ¯∆ > ∈ (0 , ¯∆), both π = 0 , ω ∈ ( ω ∗ (1 + 4 N ∆) , ω m ).Note that we have not established that, for such a belief ω , π = 1 is enforced with acontinuation in which π = 1 is being played in the next round (at belief ω N > ω ∗ (1 + 4 N ∆)).However, if π = 1 can be enforced at belief ω , it can be enforced when the continuation payoﬀat ω N is highest possible; in turn, this means that, as π = 1 can be enforced at ω N , thiscontinuation payoﬀ is at least as large as the payoﬀ from playing π = 1 at ω N as well. Byinduction, this implies that the highest equilibrium payoﬀ at ω is at least as large as the oneobtained by playing π = 1 at all intermediate beliefs in ( ω ∗ (1 + 4 N ∆) , ω ) (followed by, say,the worst equilibrium payoﬀ once beliefs below this range are reached).Similarly, we have not argued that, at belief ω , π = 0 is enforced by a continuationequilibrium in which, if a player deviates and experiments unilaterally, his continuation payoﬀat ω is what he gets if he keeps on experimenting alone. However, because π = 0 can beenforced at ω , the lowest equilibrium payoﬀ that can be used after a unilateral deviationat ω must be at least as low as what the player can get at ω from deviating unilaterallyto risky again. By induction, this implies that the lowest equilibrium payoﬀ at belief ω isat least as low as the one obtained if a player experiments alone for all beliefs in the range( ω ∗ (1 + 4 N ∆) , ω ) (followed by, say, the highest equilibrium payoﬀ once beliefs below thisinterval are reached).Note that, as ∆ →

0, these bounds converge (uniformly in ∆) to the cooperative solu-tion (restricted to no experimentation at and below ω = ω ∗ ) and the single-agent payoﬀ,respectively, which was to be shown. (This is immediate given that these values correspondto precisely the cooperative payoﬀ (with N or 1 player) for a cutoﬀ that is within a distanceof order ∆ of the cutoﬀ ω ∗ , with a continuation payoﬀ at that cutoﬀ which is itself within ∆ imes a constant of the safe payoﬀ.)This also immediately implies (as for the case λ >

0) that for ﬁxed ω > ω m , both π = 0 , ω m , ω ] for all ∆ < ¯∆, for some ¯∆ >

0: the gainfrom a deviation is of order ∆, yet the diﬀerence in continuation payoﬀs (selecting as acontinuation payoﬀ a value close to the maximum if no player unilaterally defects, and closeto the minimum if one does) is bounded away from 0, even as ∆ → Hence, all conclusionsextend: ﬁx ω ∈ ( ω ∗ , ∞ ); for every ε >

0, there exists ¯∆ > < ¯∆, thebest SSE payoﬀ starting at belief ω is at least as much as the payoﬀ from all players choosing π = 1 at all beliefs in ( ω ∗ + ε, ω ) (using s as a lower bound on the continuation once thebelief ω ∗ + ε is reached); and the worst SSE payoﬀ starting at belief ω is no more than thepayoﬀ from a player whose opponents choose π = 1 if, and only if, ω ∈ ( ω ∗ , ω ∗ + ε ), and 0otherwise.The ﬁrst part of the proposition follows immediately, picking arbitrary p ∈ ( p ∗ , p m ) and¯ p ∈ ( p m , p ∗ < p ∆1 , as noted, and (ii) forany p ∈ [ p ∆1 , p ], player i ’s payoﬀ in any equilibrium is weakly lower than his best-reply payoﬀagainst κ ( p ) = 1 for all p ∈ [ p ∗ , p ], as easily follows from (B.11), the optimality equation for w . Proof of Proposition 10:

For λ >

0, the proof is the same as that of Proposition 6,except for the fact that it deals with V N,p rather than V ∗ N and relies on Proposition 8 ratherthan Proposition 5.For λ = 0, the proof of Proposition 9 establishes that there exists a natural number M such that, given p as stated, we can take ¯∆ to be ( p − p ∗ ) /M . Equivalently, p ∗ + M ¯∆ = p .Hence, Proposition 9 can be restated as saying that, for some ¯∆ >

0, and all ∆ ∈ (0 , ¯∆),there exists p ∆ ∈ ( p ∗ , p ∗ + M ∆) such that the two conclusions of the proposition hold with p = p ∆ . Fixing the prior, let w ∆ , w ∆ denote the payoﬀs in the ﬁrst and second SSE fromthe proposition, respectively. Given that p → p ∗ and w ∆ ( p ) → s, w ∆ ( p ) → s for all p ∈ ( p ∗ , p ∆ ) as ∆ →

0, it follows that we can pick ∆ † ∈ (0 , ¯∆) such that for all ∆ ∈ (0 , ∆ † ), W ∆PBE ≤ V N, ˆ p + ε , w ∆ ≥ V N,p − ε , k W ∆1 − V ∗ k < ε and k w ∆ − V , ¯ p k < ε . The obviousinequalities follow as in the proof of Proposition 6 with the subtraction of an additional ε from the left-hand side of the ﬁrst one; and the conclusion follows as before, using 2 ε as anupper bound. This follows by contradiction. Suppose that for some ∆ ∈ (0 , ¯∆), there is ˆ ω ∈ [ ω m , ω ] for whicheither π = 0 or 1 cannot be enforced. Consider the inﬁmum over such beliefs. Continuation payoﬀscan then be picked as desired, which is a contradiction as it shows that at this presumed inﬁmumbelief π = 0 , Consider the possibly random sequence of beliefs visited in an equilibrium. At each belief, a ﬂowloss of either − (1 − δ ) m or − (1 − δ ) ω is incurred. Note that the ﬁrst loss is independent of the numberof other players’ experimenting, while the second is necessarily lower when at each round all otherplayers experiment. Hence, to be precise, these payoﬀs are only deﬁned on those beliefs that can be reached given theprior and the equilibrium strategies. eferences Abreu, D. (1986): “Extremal Equilibria of Oligopolistic Supergames,”

Journal ofEconomic Theory , , 195–225. Abreu, D., D. Pearce and E. Stacchetti (1986): “Optimal Cartel Equilibriawith Imperfect Monitoring,”

Journal of Economic Theory , , 251–269. Abreu, D., D. Pearce and E. Stacchetti (1993): “Renegotiation and Symmetryin Repeated Games,”

Journal of Economic Theory , , 217–240. Bergin, J. and

W.B. MacLeod (1993): “Continuous Time Repeated Games,”

In-ternational Economic Review , , 21–37. Biais, B., T. Mariotti, G. Plantin and

J.-C. Rochet (2007): “Dynamic Secu-rity Design: Convergence to Continuous Time and Asset Pricing Implications,”

Review of Economic Studies , , 345–390. Bolton, P. and

C. Harris (1999): “Strategic Experimentation,”

Econometrica , ,349–374. Cohen, A. and

E. Solan (2013): “Bandit Problems with L´evy Payoﬀ Processes,”

Mathematics of Operations Research , , 92–107. Cronshaw, M.B. and

D.G. Luenberger (1994): “Strongly Symmetric SubgamePerfect Equilibria in Inﬁnitely Repeated Games with Perfect Monitoring andDiscounting,”

Games and Economic Behavior , , 220–237. Dixit, A.K. and

R.S. Pindyck (1994):

Investment under Uncertainty . Princeton:Princeton University Press.

Dutta, P.K. (1995): “A Folk Theorem for Stochastic Games,”

Journal of EconomicTheory , , 1–32. Fudenberg, D. and D.K. Levine (2009): “Repeated Games with Frequent Sig-nals,”

Quarterly Journal of Economics , , 233–265. Fudenberg, D., D.K. Levine and S. Takahashi (2007): “Perfect Public Equi-librium when Players Are Patient,”

Games and Economic Behavior , , 27–49. Heidhues, P., S. Rady and

P. Strack (2015): “Strategic Experimentation withPrivate Payoﬀs,”

Journal of Economic Theory , , 531–551. H¨orner, J., T. Sugaya, S. Takahashi and N. Vieille (2011): “Recursive Meth-ods in Discounted Stochastic Games: An Algorithm for δ → Econometrica , , 1277–1318. H¨orner, J. and

L. Samuelson (2013): “Incentives for Experimenting Agents,”

RAND Journal of Economics , , 632–663.57 ohnson, N.L., S. Kotz and N. Balakrishnan (1994):

Continuous UnivariateDistributions: Volume 1 (second edition). New York: Wiley.

Keller, G. and

S. Rady (2010): “Strategic Experimentation with Poisson Bandits,”

Theoretical Economics , , 275–311. Keller, G. and

S. Rady (2015): “Breakdowns,”

Theoretical Economics , , 175–202. Keller, G., S. Rady and

M. Cripps (2005): “Strategic Experimentation withExponential Bandits,”

Econometrica , , 39–68. Mertens, J.F., Sorin, S. and

S. Zamir (2015):

Repeated Games (EconometricSociety Monographs, Vol. 55). Cambridge: Cambridge University Press.

M¨uller, H.M. (2000): “Asymptotic Eﬃciency in Dynamic Principal-Agent Prob-lems,”

Journal of Economic Theory , , 251–269. Peskir, G. and

A. Shiryaev (2006):

Optimal Stopping and Free-Boundary Problems .Basel: Birkh¨auser Verlag.

Robbins, H. (1952): “Some Aspects of the Sequential Design of Experiments,”

Bul-letin of the American Mathematical Society , , 527–535. Sadzik, T. and

E. Stacchetti (2015): “Agency Models with Frequent Actions,”

Econometrica , , 193–237. Simon, L.K. and

M.B. Stinchcombe (1995): “Equilibrium Reﬁnement for InﬁniteNormal-Form Games,”

Econometrica , , 1421–1443. Thompson, W. (1933): “On the Likelihood that One Unknown Probability ExceedsAnother in View of the Evidence of Two Samples,”

Biometrika ,25