[PDF] Misspecified Beliefs about Time Lags

Abstract

We examine the long-term behavior of a Bayesian agent who has a misspecified belief about the time lag between actions and feedback, and learns about the payoff consequences of his actions over time. Misspecified beliefs about time lags result in attribution errors, which have no long-term effect when the agent's action converges, but can lead to arbitrarily large long-term inefficiencies when his action cycles. Our proof uses concentration inequalities to bound the frequency of action switches, which are useful to study learning problems with history dependence. We apply our methods to study a policy choice game between a policy-maker who has a correctly specified belief about the time lag and the public who has a misspecified belief.

Full PDF

aa r X i v : . [ ec on . T H ] D ec Misspeciﬁed Beliefs about Time Lags *Yingkai Li Harry PeiDecember 15, 2020

Abstract

We examine the long-term behavior of a Bayesian agent who has a misspeciﬁed belief aboutthe time lag between actions and feedback, and learns about the payoff consequences of his ac-tions over time. Misspeciﬁed beliefs about time lags result in attribution errors, which have nolong-term effect when the agent’s action converges, but can lead to arbitrarily large long-terminefﬁciencies when his action cycles. Our proof uses concentration inequalities to bound thefrequency of action switches, which are useful to study learning problems with history depen-dence. We apply our methods to study a policy choice game between a policy-maker who has acorrectly speciﬁed belief about the time lag and the public who has a misspeciﬁed belief.

Keywords: time lag, misspeciﬁed belief, Bayesian learning, action cycles, history dependence,concentration inequality. * We thank S. Nageeb Ali, Renee Bowen, Drew Fudenberg, Yuhta Ishii, Shengwu Li, and Bruno Strulovici for helpful comments.We thank NSF Grant SES-1947021 for ﬁnancial support.

INTRODUCTION 1

We study learning problems faced by Bayesian decision makers who have misspeciﬁed beliefs about thetime lag between decisions and feedback. We examine the long-term consequences of such belief misspec-iﬁcations both in single-agent decision-making problems and in games of collective decision-making.Misperception about time lags is prevalent among decision makers at various levels, ranging from lead-ers in organizations to ordinary citizens. For example, a manager decides how much resource to allocateto R&D. Unlike efforts on production and sales, investments in R&D are unlikely to pay off in the shortrun, and moreover, it is usually unclear when and whether they will pay off. Repenning and Sterman (2002)show that these time lags hinder an organization’s learning about the optimal resource allocation by “ com-plicating the attribution of causality between actions and results ”. Rahmandad, Repenning, and Sterman(2009) point out that what slows down organizational learning is not the delay per se, but instead, peo-ple’s misperceptions about the delay. Consequences of such misperceptions include the so-called capabilitytraps (Repenning and Sterman, 2002), in which members of an organization work hard on production atthe expense of cutting back on the time allocated to R&D and maintenance, that ultimately results in lowproductivity.Similarly, fans of football clubs tend to credit or blame their current managers for their team’s perfor-mances while ignoring the effects of previous managers’ decisions. Many people believe that reopening theeconomy is safe amidst the COVID-19 pandemic when the number of cases and hospitalizations in Georgia,Florida, and Arizona went down three weeks after these states’ reopenings. However, the number of casesand hospitalizations started to surge six to eight weeks after these states’ reopennings, forcing some of themto partially return to lockdown.We propose a model that incorporates such misperceptions. In every period, an agent chooses an actionand observes an outcome that determines his payoff. The agent faces uncertainty about the state , i.e., themapping from his actions to the outcome distributions. He observes the history of actions and outcomesand updates his belief according to Bayes rule. We assume that the true state belongs to the support of theagent’s prior belief and that the outcome is informative about the state regardless of the agent’s action.The outcome distribution in period t depends only on the agent’s action in period t − k ∗ while the agentbelieves that it depends on his action in period t − k ′ , where k ′ is different from k ∗ . Our formulation can Section 5 extends our result to situations where (1) the outcome distribution depends on a weighted average of the agent’s

INTRODUCTION 2capture, for example, an individual underestimates the time it takes for workouts to have effects on ﬁtness,a policy-maker underestimates or overestimates the time it takes for a curriculum reform to have effects onstudents’ academic achievements, and so on.This novel form of belief misspeciﬁcation interferes learning through an attribution error , which hasno long-term effect when the agent’s action converges but can lead to mislearning when the agent’s actionchanges over time. Theorem 1 shows that the mislearning caused by attribution errors can lead to arbitrarilylarge long-term inefﬁciencies in the sense that for an open set of states, there exist prior beliefs that includethe true state in their support such that the asymptotic frequency with which the agent takes his optimal actionis arbitrarily close to zero. This stands in contrast to the benchmark scenario with a correctly speciﬁed beliefabout the time lag, in which the agent chooses his optimal action almost surely in the long run.The ﬁrst challenge in establishing this result stems from the fact that our learning problem exhibitsnontrivial history-dependence. This is because the agent’s current-period action directly affects his futureobservations. The second challenge arises from the observation that the agent’s action cannot converge toanything suboptimal and inefﬁciencies can only arise when the agent’s action cycles in the long run. As aresult, one needs to bound the frequency of action switches in order to quantify the amount of mislearning,which is a key step toward showing that the posterior probability of the true state is low in the long run.We develop a new technique using concentration inequalities. First, we examine an auxiliary problem inwhich the true state is excluded from the agent’s prior belief. We use the Chernoff-Hoeffding inequality toshow that in expectation, the agent switches actions within a ﬁnite number of periods. We then establish aconcentration inequality on unbounded random variables in order to bound the frequency of action switches.Next, we study situations in which the true state occurs with small but positive probability. We use theAzuma-Hoeffding inequality to show that due to the mislearning caused by frequent action switches, thetrue state occurs with low probability in the agent’s posterior for all periods. This explains why the agent’sactions cycle over time even when the true state belongs to the support of his prior belief.We apply our framework to study a dynamic policy choice game between a policy-maker who has acorrectly speciﬁed belief about the time lag and the public who has a misspeciﬁed belief. The policy-makerwants to implement a socially beneﬁcial reform but cannot do so without the public’s support. The publicprefers the reform to the status quo in one state and prefers the status quo in the other state. This conﬂictof interest can arise when the reform has positive externalities or the status quo has negative externalities onmarginalized groups (e.g., massive gatherings during a pandemic has negative externalities on the immuno-compromised) which the policy-maker cares about but the majority of citizens fail to internalize. Therefore, current and past actions, or (2) the agent faces uncertainty about the time lag and learns about it over time.

INTRODUCTION 3a reform can be both optimal for the benevolent policy-maker and suboptimal for the majority of citizens.We characterize the maximal frequency that the policy-maker can implement the reform when he hasno private information about the state. We show that the policy-maker’s optimal payoff equals the maximalfrequency of reform in an auxiliary game where he knows the state but the public is naive in the sense thatthey fail to recognize the informational content of the policy-maker’s behaviors. Intuitively, this is becausethe policy-maker can asymptotically learn the true state, and when the reform is optimal for the public, hecan implement the reform in almost every period regardless of the public’s prior.We also construct a class of strategies under which the policy-maker can approximately achieve hisoptimal payoff, according to which he proposes the reform with frequency close to a half when the publicentertains a pessimistic belief about the reform, and proposes the reform with frequency strictly greaterthan a half when the public entertains an optimistic belief about the reform. The former maximizes theamount of mislearning and the latter maximizes the frequency of reform subject to a constraint that theexpected amount of mislearning is non-negative. The key step is to use the Wald inequality and show thatconditional on the reform being suboptimal for the public, the policy-maker’s future proposals are acceptedwith probability close to 1 when he started to propose the reform with frequency greater than a half.Our work contributes to a growing literature on misspeciﬁed learning by studying environments withhistory dependence. The agent in our model has a misspeciﬁed belief about the dynamic structure of theproblem and his past actions can affect future outcomes. This stands in contrast to most of the existingworks such as Berk (1966), Nyarko (1991), Esponda and Pouzo (2016), Fudenberg, Romanyuk, and Strack(2017), Bohren and Hauser (2020), Frick, Iijima, and Ishii (2020), Esponda, Pouzo, and Yamamoto (2020),and Fudenberg, Lanzani, and Strack (2020) that exclude history dependence.Several recent papers study misspecifed learning models with history dependence and provide condi-tions for the steady states. Shalizi (2009) provides sufﬁcient conditions for the convergence of posteriorbelief when there is no endogenous action choice and the signals in different periods can be correlated. He(2020) examines misspeciﬁed learning in two-period optimal stopping problems in which an agent mistak-enly believes that the outcome in the second period is negatively correlated with that in the ﬁrst period.Esponda and Pouzo (2020) study a single-agent Markov decision problem with misspeciﬁed beliefs aboutthe state transition function. Molavi (2020) examines a dynamic general equilibrium model in which anagent’s choice in the current period affects the constraints he face in the future. By contrast, we focus onthe dynamics of an agent’s behavior in history-dependent learning problems instead of the steady states. Weshow that the long-run outcome can be inefﬁcient by bounding the frequency of action switches. Esponda, Pouzo, and Yamamoto (2020) introduce stochastic approximation techniques and characterize the frequency of the

MODEL 4The attribution error in our model is related to Eliaz and Spiegler (2020), who study an agent’s long-termbehavior when he updates his belief according to a misspeciﬁed causal model. They propose a solution con-cept that characterizes the steady states of the above learning process, rather than examining the dynamicsof actions and beliefs. Spiegler (2013) examines the dynamic interaction between an agent and a sequenceof principals, each of them acts only once and chooses whether to intervene. The agent attributes changes ofa state variable to the latest intervention, which is applicable when some of the principal’s actions (interven-tion) are more salient than others (no intervention). Jehiel and Samuelson (2012) characterize an informedlong-run player’s payoff and behavior when he faces a sequence of short-run players who mistakenly be-lieve that all types of the long-run player use stationary strategies. By contrast, we study a different type ofattribution error, where the agent has wrong beliefs about the delay between actions and feedback.

Time is discrete, indexed by t = , , .. . In period t , a Bayesian agent chooses an action a t ∈ A , and thenobserves an outcome y t ∈ Y . We assume that both A and Y are ﬁnite sets.Our modeling innovation is to introduce time lags between decisions and feedback as well as the agent’smisperception about the time lag. Formally, there exist two non-negative integers k ∗ , k ′ ∈ N with k ∗ = k ′ ,such that the distribution of y t depends only on a t − k ∗ , while the agent believes that it depends on a t − k ′ .The agent faces uncertainty about the distribution over outcomes (which we call the state ) and learnsabout it over time by observing the history of actions and outcomes. A typical state is denoted by F ≡{ F ( ·| a ) } a ∈ A , with F ( ·| a ) ∈ ∆ ( Y ) . Let F ∗ ≡ { F ∗ ( ·| a ) } a ∈ A be the true state , namely, y t is distributed accordingto F ∗ ( ·| a t − k ∗ ) . The agent’s prior belief about the state is π ∈ ∆ ( Y ) , with Y ≡ (cid:16) ∆ ( Y ) (cid:17) A and supp ( π ) isﬁnite. After the agent learns that the state is F , he believes that y t is distributed according to F ( ·| a t − k ′ ) . Theagent observes h t ≡ { ..., a − , a , a , ..., a t − , y , ...., y t − } in period t and his posterior belief is denoted by π t ∈ ∆ ( Y ) . All the actions before period 1 are exogenously given. In order to focus on misspeciﬁed beliefabout the time lag, we focus on prior beliefs that are regular : Regular Prior Belief. π is regular with respect to F ∗ if1. F ∗ ∈ supp ( π ) , and for every F ∈ supp ( π ) and a ∈ A, F ( ·| a ) has full support.2. for every F , F ′ ∈ supp ( π ) and a ∈ A, we have F ( ·| a ) = F ′ ( ·| a ) . agent’s actions in misspeciﬁed learning problems without history-dependence. When there are inﬁnitely many states, Diaconis and Freedman (1986) and Shalizi (2009) show that the agent’s posterior beliefmay not converge to the true state even when the true state belongs to the support of his prior belief. We abstract away from thiscomplication in order to focus on the economic implications of misspeciﬁed beliefs about time lags.

MODEL 5The ﬁrst part requires that the true state F ∗ belongs to the support of the agent’s prior belief and thatthe agent cannot rule out any state no matter which action he takes and which outcome he observes.This rules out canonical forms of belief misspeciﬁcations studied by Berk (1966), Nyarko (1991), andEsponda and Pouzo (2016) in which F ∗ is excluded from the agent’s prior belief. The second part requiresthat the observed outcome is informative about the state regardless of the agent’s action, which is satisﬁedfor generic ﬁnite subsets of Y . It rules out lack-of-identiﬁcation problems, such as safe-arms in banditmodels.The agent’s stage-game payoff is v ( y t ) . We assume that arg max a ∈ A { ∑ y ∈ Y v ( y ) F ∗ ( y | a ) } is a singleton,and its unique element is denoted by a ∗ , i.e., the agent has a unique optimal action under the true state. Thisis satisﬁed for generic F ∗ ∈ Y and v : Y → R given that A and Y are ﬁnite sets. The agent’s strategy is σ : H → ∆ ( A ) , where H is the set of histories. Strategy σ is optimal if σ ( h t ) maximizes the expectedvalue of ∑ + ∞ s = δ s v ( y t + s ) at every h t , where δ ∈ [ , ) is the agent’s discount factor. We focus on settings suchthat either δ ∈ ( , ) or ( δ , k ′ ) = ( , ) . This is because the agent is indifferent between all actions when δ = k ′ ≥

1. Let Σ ∗ ( π ) be the set of strategies that are optimal for the agent when his prior is π .For some useful benchmarks, the agent chooses a ∗ in every period after he learns that the true state is F ∗ even if he entertains a misspeciﬁed belief about the time lag. If there is no belief misspeciﬁcation, i.e., k ∗ = k ′ , then according to Berk’s Theorem (Berk, 1966), the agent’s action converges to a ∗ almost surely. Remark:

Our baseline model focuses on situations in which the outcome in every period is affected onlyby one of the agent’s actions. Section 5 discusses extensions where the outcome in period t depends on aconvex combination of the agent’s past and current-period actions, and the agent has a wrong belief aboutthe weights of different actions. We also consider settings in which the agent faces uncertainty about thetime lag and learns about it over time, but the support of his prior belief excludes the true time lag.The agent’s payoff in our baseline model depends only on the observed outcome. Our results extendwhen the agent’s payoff also depends on the state. When the agent’s payoff depends directly on his action(e.g., different actions have different costs), his action can be suboptimal even when he learns the true state.This is because when k ∗ = k ′ , the agent either overestimates or underestimates the time it takes for his actionto have an effect, which can lead to suboptimal decisions since the agent discounts future payoffs. Ourresults extend to settings where the agent’s payoff is v ( y t ) − c ( a t ) , as long as the absolute value of c ( · ) issmall enough such that the agent has a strict incentive to choose a ∗ after he learns the true state. RESULT 6 First, we show that if the agent’s action converges, then it can only converge to his optimal action. Moreover,the asymptotic frequency of his optimal action must be strictly positive when his prior belief is regular.

Lemma 3.1.

Suppose either δ ∈ ( , ) or ( δ , k ′ ) = ( , ) . If π is regular with respect to F ∗ and a t converges to a with positive probability, then a = a ∗ . Furthermore, lim inf t → + ∞ E σ h t t ∑ s = { a s = a ∗ } i > for every σ ∈ Σ ∗ ( π ) . (3.1)The proof is in Appendix B. Intuitively, the only way in which misspeciﬁed beliefs about time lags caninterfere learning is through an attribution error , namely, the agent attributes the effects of a t − k ∗ to a t − k ′ .When the agent’s action converges, a t − k ∗ and a t − k ′ are the same so the attribution error does not affect hislearning. Since F ∗ ∈ supp ( π ) and there is no lack-of identiﬁcation problem, the agent will learn the truestate almost surely. This implies that the agent’s actions are asymptotically efﬁcient, which contradicts thepresumption that his action converges to something other than a ∗ . The agent takes his optimal action withpositive asymptotic frequency since for every ε >

0, the following event occurs with positive probability: E ε ≡ { there exists T ∈ N such that π t ( F ∗ ) > − ε for every t ≥ T } . (3.2)Intuitively, this is because y is informative about the state, so there always exists a signal realization thatincreases the posterior probability of F ∗ . The probability that the agent chooses a ∗ in all future periods isstrictly positive when the posterior probability of F ∗ is close to 1, which implies (3.1).Despite the agent’s action cannot converge to anything other than a ∗ , the attribution errors caused bymisspeciﬁed beliefs can lead to arbitrarily large long-term inefﬁciencies. Theorem 1 shows that the fre-quency with which the agent takes his optimal action can be arbitrarily close to 0. Theorem 1.

Suppose either δ ∈ ( , ) or ( δ , k ′ ) = ( , ) . For every γ > , there exists an open set Y o ⊂ Y such that for every F ∗ ∈ Y o , there is a prior belief π that is regular with respect to F ∗ underwhich lim sup t → + ∞ E σ h t t ∑ s = { a s = a ∗ } i < γ for every σ ∈ Σ ∗ ( π ) . (3.3)Theorem 1 implies that Bayesian agents fail to take their optimal action even when the true state belongsto the support of their prior belief and the observed outcome can statistically identify the state. Intuitively,this is because attribution errors lead to mislearning when the agent switches actions. In particular, when the RESULT 7agent’s action changes over time, the probability of the true state can decrease in expectation. When actionswitches are frequent enough, the amount of mislearning outweighs the what the agent learns when he takesthe same action in consecutive periods. As a result, his posterior belief may attach a low probability to F ∗ inall periods. Under some F ∗ and π that is regular with respect to F ∗ , such an event occurs with probabilityarbitrarily close to 1, which leads to arbitrarily large asymptotic inefﬁciencies.The proof is in Appendix C. We explain the logic behind our argument using an example, which illus-trates how attribution errors lead to action cycles in the long run and how to bound the frequency of actionswitches using concentration inequalities. Illustrative Example:

Suppose A ≡ { , } , δ = k ∗ =

1, and k ′ =

0. That is, the agent is myopic,the distribution of y t depends only on the agent’s action in period t − y t isaffected by his action in period t . Let Y ≡ { y , y , y } , v ( y ) = v ( y ) = v ( y ) =

1, and the support of π is { F ∗ , F , F } , with F ∗ ( y i | a = ) =  ε i = − ε i = ε i = F ∗ ( y i | a = ) =  − ε i = ε i = ε i = F ( y i | a = ) =  / − ε i = / − ε i = ε i = F ( y i | a = ) =  / − ε / i = / − ε / i = ε i = F ( y i | a = ) =  / − ε / i = / − ε / i = ε i = F ( y i | a = ) =  / − ε i = / − ε i = ε i = F ∗ and F areboth 0, and the optimal action in state F is 1.Since the true state F ∗ belongs to the support of π and y can statistically identify the state, the agent’saction converges to 0 almost surely when he has a correctly speciﬁed belief about the time lag. However, RESULT 8Pr ( y ) Pr ( y ) F ( ·| a = ) F ( ·| a = ) F ∗ ( ·| a = ) F ( ·| a = ) F ( ·| a = ) F ∗ ( ·| a = ) Figure 1: Example with Action Cycles: Signal Distributions under F ∗ , F and F when the agent has a misspeciﬁed belief about the time lag, we sketch an argument which shows that theasymptotic frequency of the suboptimal action can be arbitrarily close to 1 / Claim 1.

For every η > , there exists ε > such that when ε < ε , there exists a full support π underwhich lim sup t → + ∞ t E σ h t ∑ s = { a s = a ∗ } i < + η for every σ ∈ Σ ∗ ( π ) . Our argument proceeds in two steps. First, we examine an auxiliary learning problem where F ∗ occurswith zero probability. Let l t ≡ log π t ( F ) π t ( F ) . By deﬁnition, the agent has a strict incentive to take action 0 when l t >

0, and has a strict incentive to take action 1 when l t < F is closer to F ∗ compared to F . As a result, the log likelihood ratio l t decreases in expectation when the agent chooses action 0 in twoconsecutive periods. Let τ be the number of periods with which the agent’s action switches back to action 1when a t − = a t − = a t =

0. The Chernoff-Hoeffding inequality implies that:Pr ( τ ≥ s ) ≤ Pr ( l t + s ≥ ) ≤ exp (cid:16) − s (cid:16) l t s + E [ l s − l s − ] | {z } < (cid:17) (cid:17) , (3.4)from which we know that the distribution of τ is ﬁrst order stochastically dominated by an exponentialdistribution and therefore, has bounded ﬁrst and second moments.Similarly, when the agent takes action 1, F is closer to F ∗ compared to F . As a result, l t increases inexpectation when the agent takes action 1 in two consecutive periods. Let τ be the number of periods withwhich the agent’s action switches back to action 0 when a t − = a t − = a t =

1. A similar argument RESULT 9based on the Chernoff-Hoeffding inequality implies that τ has bounded ﬁrst and second moments.In order to bound the frequency of action switches from below using the above conclusions on τ and τ , we establish a concentration inequality that applies to unbounded random variables (Lemma A.3). Thisinequality implies that for every ε >

0, there exists a large enough T ∈ N such that E h { t ≤ T | a t − = a t = } { t ≤ T | a t − = , a t = } i = E h { t ≤ T | a t − = a t = } { t ≤ T | a t − = , a t = } i ≤ E [ τ ] + ε , and E h { t ≤ T | a t − = a t = } { t ≤ T | a t − = , a t = } i = E h { t ≤ T | a t − = a t = } { t ≤ T | a t − = , a t = } i ≤ E [ τ ] + ε . When π ( F ) and π ( F ) are close, the expectations of τ and τ are close, and therefore, the asymptoticfrequencies of both actions are close to 1 /

2. The above inequalities imply that the asymptotic frequency ofaction switches is strictly positive and is close to + E [ τ ] and + E [ τ ] .Next, we consider the case in which the true state F ∗ belongs to the support of π but occurs with lowprobability. We show that with probability close to 1, the agent’s posterior belief attaches a low probabilityto F ∗ in all periods. Formally, for every η >

0, there exists π > F ∗ is less than π , the probability of the event thatmax n π t ( F ∗ ) π t ( F ) , π t ( F ∗ ) π t ( F ) o < η for all t ∈ N (3.5)is at least 1 − η . Intuitively, both log π t ( F ∗ ) π t ( F ) and log π t ( F ∗ ) π t ( F ) increase in expectation when a t = a t − since F ∗ is the true state. However, as can be seen from Figure 1, F ∗ ( ·| a = ) is further away from F ∗ ( ·| a = ) compared to both F ( ·| a = ) and F ( ·| a = ) , and F ∗ ( ·| a = ) is further away from F ∗ ( ·| a = ) comparedto both F ( ·| a = ) and F ( ·| a = ) . Due to the attribution errors caused by misspeciﬁed beliefs about thetime lag, both log likelihood ratios decrease in expectation when a t = a t − .When π t ( F ∗ ) is low, the agent’s best reply problem is similar to the one he faces in the auxiliary scenariowhere F ∗ is excluded from the support of his prior, in which case he frequently switches actions. Thoseaction switches together with the attribution error lead to mislearning. When E [ τ ] and E [ τ ] are sufﬁcientlysmall, action switches are frequent, so the mislearning caused by attribution errors outweighs what the agentcan learn when he takes the same action in adjacent periods.In order to formalize this intuition, we provide a lower bound on the probability of event (3.5) using The Chernoff-Hoeffding inequality only applies to bounded random variables. Corollary 5.5 in Lattimore and Szepesv´ari(2020) and Jin et al. (2019) establish concentration inequalities for random variables with sub-Gaussian distributions. By contrast,our result is more general since it only requires the random variable to have bounded ﬁrst and second moments.

APPLICATION:DYNAMICPOLICYCHOICEGAME 10concentration inequalities. The Chernoff-Hoeffding inequality does not apply since the agent’s belief af-fects his actions, so the log likelihood ratios between F ∗ and F , and between F ∗ and F can exhibit serialcorrelations. We overcome this challenge by constructing a martingale process with bounded incrementsfrom the log likelihood ratios and then applying the Azuma-Hoeffding inequality. We show that with prob-ability close to 1, the agent’s belief attaches low probability to F ∗ in all periods. This together with thefrequent action switches explains why he takes the inefﬁcient action with positive asymptotic frequency andhis actions cycle over time. Remark:

In our example, the asymptotic frequency with which the agent takes the inefﬁcient action isclose to 1 /

2. In Appendix B, we construct F ∗ as well as regular prior beliefs with respect to F ∗ such thatthe frequency of the inefﬁcient action is close to 1. In this example, one can simply modify the outcomedistributions such that both τ and τ have low expectations, but the expectation of τ is signiﬁcantly greaterthan the expectation of τ . In order to demonstrate the applicability of our techniques to bound the frequency of action switches, weanalyze a dynamic policy choice game between• a principal who strategically makes policy proposals, learns about the state over time, and has acorrectly speciﬁed belief about the time lag between the chosen policy and the observed feedback,• a Bayesian agent who can veto the principal’s proposals, and learn about the outcome distributionunder a misspeciﬁed belief about the time lag. In every period, a society needs to make a collective choice between two policies a t ∈ A ≡ { , } . Inperiod t , the principal makes a proposal e a t ∈ A . If e a t =

0, then action 0 is automatically implemented, i.e., a t =

0. If e a t =

1, then the agent chooses whether to accept ( a t =

1) or veto ( a t =

0) the principal’s proposal.Both the principal and the agent face uncertainty about the state , which is contained in F ≡ { F , F } .Their common prior belief is π ∈ ∆ ( F ) , which we assume has full support. The principal and the agent agree to disagree in terms of the time lag between decisions and feedback.The principal has a correctly speciﬁed belief about the time lag and knows that the distribution of y t depends Our analysis also applies to a sequence of myopic agents, each plays the game only once. Extensions to environments with more than two states are available upon request. Our results also apply when the principaland the agent agree to disagree about the state distribution. For example, when the principal’s prior belief is p that is differentfrom π , one needs to replace π by p in RHS of (4.6). APPLICATION:DYNAMICPOLICYCHOICEGAME 11only on a t − k ∗ , with k ∗ ≥

1. The agent believes that the distribution of y t depends on a t , i.e., k ′ =

0. In period t , both players observe h t ≡ { e a s , a s , y s } t − s = and update their beliefs about the state according to Bayes rule.Let H be the set of histories. Let σ p : H → [ , ] be the principal’s strategy, which maps the histories tothe probability that he proposes action 1, with σ p ∈ Σ p . Let σ a : H → [ , ] be the agent’s strategy, whichmaps the histories to the probability with which he approves action 1, with σ a ∈ Σ a .The principal is patient and maximizes the frequency of action 1. The agent is myopic and his stage-game payoff in period t is v ( y t ) . Since the principal has no private information about the state, under a no signaling what you don’t know condition (Fudenberg and Tirole, 1991), neither the agent’s belief nor hisbest reply depends on the principal’s proposals { e a , ..., e a t } or the principal’s strategy σ p . Without loss ofgenerality, we assume that ∑ y ∈ Y v ( y ) F ( y | ) > ∑ y ∈ Y v ( y ) F ( y | ) and ∑ y ∈ Y v ( y ) F ( y | ) > ∑ y ∈ Y v ( y ) F ( y | ) ,that is, action a is optimal for the agent in state F a for every a ∈ A . This game ﬁts applications where a benevolent policy-maker (i.e., the principal) wants to persuade thepublic (i.e., the agents) to stop taking actions that have negative externalities on marginalized groups (ac-tion 0, for example, massive gatherings during a pandemic have negative externalities on people who areimmunocompromised), or to adopt reforms that have positive externalities (action 1, for example, reducinggreenhouse gas emission has positive externalities on future generations). Action 0 is interpreted as a statusquo action , which the policy-maker has the ability to implement by himself. By contrast, the public’s co-operation is crucial for the implementation of the socially beneﬁcial action. For example, the governmentcan issue a mask mandate for the purpose of slowing down the spread of a virus, but this mandate won’t beeffective unless the majority of citizens cooperate. However, taking the socially beneﬁcial action is againstthe agent’s private interest in state F and he learns about which action is optimal over time.Our result characterizes the maximal frequency that the principal can implement the socially beneﬁcialaction by taking advantage of the agent’s misspeciﬁed beliefs. We also describe the qualitative features ofthe principal’s strategy from which he approximately attains his optimal payoff. We assume that the agentis not indifferent between action 0 and action 1 at any history. Assumption 1. π is such that the agent is not indifferent between action and action at every h t . Assumption 1 is satisﬁed for generic prior belief π given that A and Y are ﬁnite sets. This assumptionimplies that the agent’s optimal strategy is unique, which we denote by σ ∗ a . The principal’s asymptotic We evaluate the patient principal’s payoff using the long-run averages. This is a common practice in undiscounted games, seefor example, Hart (1985) and Forges (1992). We comment on the case in which k ′ = If action 1 is optimal for the agent in both states, then the principal can implement action 1 with frequency 1 regardless of theagent’s prior belief and belief misspeciﬁcation. If action 0 is optimal for the agent in both states, then the frequency of action 1 iszero regardless of the principal’s strategy.

APPLICATION:DYNAMICPOLICYCHOICEGAME 12payoff from strategy σ p is between V ( σ p ) ≡ lim inf t → + ∞ t E ( σ p , σ ∗ a ) h t ∑ s = a s i and V ( σ p ) ≡ lim sup t → + ∞ t E ( σ p , σ ∗ a ) h t ∑ s = a s i , (4.1)where E ( σ p , σ ∗ a ) [ · ] is the expectation under ( σ p , σ ∗ a ) . The principal’s payoff when he optimally chooses hisstrategy is bounded between V ≡ sup σ p ∈ Σ p V ( σ p ) and V ≡ sup σ p ∈ Σ p V ( σ p ) .We introduce some notation to characterize the principal’s optimal payoff. Let E ∗ ≡ n there exists t ∈ N such that for every s ≥ t , ∑ F ∈ F π s ( F ) ∑ y ∈ Y v ( y ) F ( y | ) > ∑ F ∈ F π s ( F ) ∑ y ∈ Y v ( y ) F ( y | ) o be the event that action 1 is strictly optimal for the agent starting from some period. Let q ∗ ≡ sup σ p ∈ Σ p Pr ( E ∗ | σ p , σ ∗ a , F ) . (4.2)Intuitively, q ∗ is the maximal probability of event E ∗ when the state is F and the agent plays according tohis optimal strategy. Let X a → a ′ be a random variable such that X a → a ′ = log F ( y | a ′ ) F ( y | a ′ ) with probability F ( y | a ) for every y ∈ Y . (4.3)Intuitively, X a → a ′ is the change in the log likelihood ratio between F and F when the true state is F , theprevious period action was a , and the current period action is a ′ . Let λ ≡ sup b λ ≥ b λ + λ + b λ E [ X → ] + E [ X → + X → ] > . (4.5)Moreover, we deﬁne λ = λ ∈ ( , ) ∪ { } .Intuitively, λ is the maximal frequency of action 1 such that the log likelihood ratio between F and F doesnot increase in expectation, or in another word, the amount of mislearning in state F is non-negative. Theorem 2. If π satisﬁes Assumption 1, thenV = V = π ( F ) + π ( F ) q ∗ λ . (4.6) APPLICATION:DYNAMICPOLICYCHOICEGAME 13Theorem 2 implies that the principal’s asymptotic payoff exists (i.e., V = V ) and characterizes its value.At the optimum, the asymptotic frequency of action 1 is 1 in state F and is q ∗ λ in state F .The proof is in Appendix D. We provide an intuitive explanation in three steps, using an example inwhich there are two outcomes Y = { y g , y b } , the outcome distributions are given by: F ( y | a ) ≡  r if ( a , y ) = ( , y b ) or ( , y g ) − r if ( a , y ) = ( , y b ) or ( , y g ) , F ( y | a ) ≡  r if ( a , y ) = ( , y b ) or ( , y g ) − r if ( a , y ) = ( , y b ) or ( , y g ) , where r ∈ ( / , ) is a parameter, and the agent’s payoff is 1 when the outcome is y g and is 0 otherwise.In this example, the agent strictly prefers action 0 if and only if log π t ( F ) π t ( F ) >

0. One can verify that X → ﬁrst order stochastically dominates both X → and X → . Therefore, the maximum that deﬁnes q ∗ isattained when the principal proposes the opposite action to what was implemented k ∗ periods ago, that is, e a t = − a t − k ∗ for every t ∈ N . According to the maximization problem that deﬁnes λ , λ = max b λ b λ + b λ + b λ E [ X → ] + E [ X → + X → ] ≥ . (4.7)Since X → ﬁrst order stochastically dominates both X → and X → , the constraint is binding and the max-imum in (4.7) is attained when the ratio between taking the same action in consecutive periods and actionswitches is b λ . Step 1:

We consider an auxiliary game in which the principal knows the true state but the agent is naivein the sense that he fails to extract information from the principal’s proposals. We show that when the truestate belongs to F , the principal’s asymptotic payoff in the auxiliary game is 1 regardless of the agent’sprior belief. Moreover, the principal can attain this payoff by proposing action 1 in every period.Let l t ≡ log π t ( F ) π t ( F ) . Agent t strictly prefers action 1 when l t < l t > l t decreases in expectation when the agent takes the same action in two consecutive periods and F isthe true state, the Wald’s inequality (Lemma A.1) implies that1. for every l t <

0, the probability of the event { l τ < τ ≥ t } is bounded away from 0,2. for every l t >

0, the probability of the event { l τ > τ ≥ t } is 0.Since the log likelihood ratio process is absorbed with positive probability at negative values and is absorbed APPLICATION:DYNAMICPOLICYCHOICEGAME 14with zero probability at positive values, we know that with probability 1, there exists T ∈ N such that l t < t ≥ T . Therefore, the principal’s asymptotic payoff is 1 regardless of the agent’s prior belief. Step 2:

We show that principal’s optimal payoff in the auxiliary game where the state is F is q ∗ λ , that is, U ( F , π ) = q ∗ λ . (4.8)Recall that (1) q ∗ is the maximal probability that the agent eventually has an incentive to approve action1, which is attained when the principal proposes action 1 with frequency approximately 1 / E ∗ , and (2) according to (4.7), λ is the maximal frequency that the principal can propose action 1subject to the constraint that the log likelihood ratio between F and F does not increase in expectation.The principal faces a tradeoff between increasing the frequency that he proposes action 1 and increasingthe probability that the agent is willing to approve action 1. The former allows him to propose action 1 withfrequency as high as λ , but in order to maximize the probability with which the agent approves action 1, heneeds to propose it with frequency close to 1 / F , the principal can attain an expected payoff as if (1) the agenteventually approves action 1 for all periods with its maximal probability q ∗ , and (2) the principal can proposeaction 1 with its maximal frequency subject to the mislearning constraint, which equals λ .The deﬁnitions of q ∗ and λ imply that U ( F , π ) ≤ q ∗ λ . We show that U ( F , π ) ≥ q ∗ λ by constructing afamily of strategies under which the principal’s asymptotic payoff is arbitrarily close to q ∗ λ . Each strategy inthis class is characterized by a cutoff log likelihood ratio l ∗ ε < e a t = − a t − k ∗ if the log likelihood ratio is above l ∗ ε in all previous period, and proposes action 1 with frequency close tobut less than λ when the log likelihood ratio has fall below l ∗ ε in at least one period.The key step is to show that under the proposed strategy (1) the probability with which the log likelihoodratio falls below l ∗ ε in at least one period is arbitrarily close to q ∗ , and (2) conditional on the log likelihoodratio falls below l ∗ ε , the probability that it is strictly negative in all future periods is close to 1.We establish these two claims using concentration inequalities. The intuition behind the ﬁrst claim isthat conditional on l t >

0, the probability that the log likelihood ratio is positive in all future periods isstrictly positive, so the log likelihood ratio will eventually escape any bounded interval with probability 1.As a result, the probability that l t < t large enough equals the probability that l t < l ∗ ε for all t largeenough. The intuition behind the second statement is that according to (4.8), one can construct strategies APPLICATION:DYNAMICPOLICYCHOICEGAME 15under which the frequency of proposing action 1 is close to λ , yet the log likelihood ratio is non-increasing inexpectation. For an example of such a strategy, let T , T ∈ N be such that T is even and T / T ∈ ( λ − ε , λ ) .The principal’s strategy is divided into T ≡ T + T period blocks such that he proposes action 0 in period1, 3, ... T − t is small when l t < l ∗ ε and l ∗ ε is small enough.To conclude, when the principal uses this class of strategies, he can ensure that with probability close to q ∗ , the agent is willing to approve action 1 in all future periods, and conditional on this event, he can proposepolicy 1 with frequency arbitrarily close to λ . This explains why the tradeoff he faces between inducingmislearning and increasing the frequency of proposing action 1 diminishes in the long run. Step 3:

We show that the principal’s payoff in our dynamic policy choice game with symmetric uncertaintyequals his expected payoff in the auxiliary game. Formally, let U ( F , π ) be the principal’s payoff in theauxiliary game when the state is F and the agent’s prior belief is π . We show that: V = V = π ( F ) U ( F , π ) + π ( F ) U ( F , π ) . (4.9)This is implied by the following two inequalities: V ≤ π ( F ) U ( F , π ) + π ( F ) U ( F , π ) (4.10)and V ≥ π ( F ) U ( F , π ) + π ( F ) U ( F , π ) . (4.11)Inequality (4.10) is straightforward since the principal’s payoff is weakly greater in the auxiliary game giventhat he has more information and the agent does not extract information from his proposals.In order to establish inequality (4.11), let σ ε p be the principal’s strategy such that his asymptotic payoffis more than U ( F , π ) − ε in the auxiliary game where the state is F . Since y is informative about thestate, for every ε >

0, there exists T ∈ N such that for each of the principal’s strategy and every a ∈ A , theprincipal’s posterior belief in period T attaches probability greater than 1 − ε to F a when F a is the true state.Consider the principal’s asymptotic payoff by using the following strategy in the original game withsymmetric uncertainty:1. he plays according to σ ε p in the ﬁrst T periods, DISCUSSIONS 162. if his period T posterior belief attaches probability greater than 1 − ε to state F , then he proposesaction 1 in all future periods,3. if his period T posterior belief attaches probability less than 1 − ε to state F , he continues to usestrategy σ ε p .Under the above strategy, the probability with which the principal proposes 1 in every period after T is atleast 1 − ε conditional on the state being F . Since U ( F , π ) = π , the principal’s asymptotic payoffconditional on state F is at least 1 − ε . Conditional on the true state being F , the probability with whichhe uses σ ε p in every period is at least 1 − ε , so his payoff is at least ( − ε )( U ( F , π ) − ε ) . Therefore, hisexpected asymptotic payoff converges to π ( F ) U ( F , π ) + π ( F ) as ε goes to 0. Remarks:

Our formula for the principal’s optimal payoff is reminiscent of the repeated zero sum gamesof Aumann and Maschler (1995) and the Bayesian persuasion games of Kamenica and Gentzkow (2011),where an informed player’s payoff in a binary-state setting is a piece-wise linear and concave function ofthe uninformed player’s prior belief. Our formula for the principal’s highest equilibrium payoff (4.6) is notcontinuous since q ∗ depends on the agent’s prior belief and exhibits discontinuity in general.When δ >

0, the agent may have incentives to experiment, which depend on the principal’s strategy. Asa result, the agent’s optimal strategy is not unique when the log likelihood ratio between F and F is closeto the cutoff at which a myopic agent is indifferent. In general, for every δ ∈ ( , ) and k ′ , there exist twocutoffs l ∗ and l ∗∗ with 0 < l ∗ < l ∗∗ < + ∞ such that regardless of the principal’s strategy, the agent has a strictincentive to approve action 1 when l t < l ∗ and has a strict incentive to veto action 1 when l t > l ∗∗ . When l t ∈ [ l ∗ , l ∗∗ ] , the agent’s incentive depends on the principal’s strategy. Nevertheless, when the discount factoris positive but small enough, the set of agent-optimal strategy is small and our approach provides lower andupper bounds on the principal’s payoff. The two bounds coincide as the agent’s discount factor convergesto 0, in which case our approach can exactly characterize the principal’s asymptotic payoff. We discuss extensions and generalizations of our main result.

Uncertainty about the time lag:

In our baseline model, the agent faces uncertainty about the outcomedistribution but has a degenerate prior about the time lag. In general, the agent may also learn about the timelag under a misspeciﬁed model. DISCUSSIONS 17Our main result extends when the agent faces uncertainty both about the outcome distribution and thetime lag. Formally, there is a ﬁnite set of states F ⊂ Y and a ﬁnite set of possible time lags K ⊂ N . Theagent has a full support prior belief π ∈ ∆ ( F × K ) . In order to focus on the effects of misspeciﬁed beliefabout the time lag, we assume that F ∗ ∈ F and k ∗ / ∈ K .Similar to the baseline model, the agent chooses a ∗ ≡ arg max a ∈ A ∑ y ∈ Y v ( y ) F ∗ ( y | a ) in every period whenhe learns the true state regardless of his belief about the time lag. As a result, the agent’s action cannotconverge to actions other than a ∗ and a ∗ occurs with positive asymptotic frequency. When δ ∈ ( , ) or ( δ , k ′ ) = ( , ) , there exists F ∗ ∈ Y and a prior belief π that is regular with respect to F ∗ under which theagent takes his optimal action with frequency arbitrarily close to 0. General forms of belief misspeciﬁcations:

In our baseline model, the distribution of y t depends only onone of the agent’s actions. In practice, the outcome distribution can be affected by multiple actions.We extend our results when the distribution of y t depends on a convex combination of the agent’s current-period action and his actions in the last k ∈ N periods where k is an exogenous parameter. In particular, whenthe state is F ≡ { F ( ·| a ) } a ∈ A , y t is distributed according to ∑ a ∈ A α t ( a ) F ( ·| a ) where α t ( a ) ≡ ∑ kj = β j { a t − j = a } , β ≡ ( β , ..., β k ) ∈ R k + + , and ∑ kj = β j =

1. The agent has a wrong belief about the convex weightsof different actions, and believes that when the state is F , y t is distributed according to ∑ a ∈ A b α t ( a ) F ( ·| a ) ,where b α t ( a ) ≡ ∑ kj = b β j { a t − j = a } with b β ≡ ( b β , ..., b β k ) ∈ R k + + , ∑ kj = b β j =

1, and b β = β . This generalformulation captures for example, when the agent overestimates or underestimates the effects of his current-period action on the current-period outcome, i.e., when b β = β . Theorem 1 extends under a strongeridentiﬁcation condition that for every F , F ′ ∈ F and α ∈ ∆ ( A ) , we have F ( ·| α ) = F ′ ( ·| α ) . If the agent only faces uncertainty about the time lag but knows the outcome distribution, then he takes his optimal action inevery period and misspeciﬁed belief about the time lag is irrelevant for his behavior and payoff.

PROBABILITYTOOLS 18

A Probability Tools

We state three results in probability theory, which will be used in our subsequent proofs. The ﬁrst result isthe Wald nequality, which bounds the probability of the union of tail events from above.

Lemma A.1 (Wald, 1944) . Let { Z t } t ∈ N be a sequence of i.i.d. random variables with ﬁnite support,strictly negative mean, and takes a positive value with positive probability. Let r ∗ > be the unique realnumber that satisﬁes E z ∼ Z [ exp ( r ∗ z )] = . We have Pr " ∞ [ n = ( n ∑ t = Z t ≥ c ) ≤ exp ( − r ∗ · c ) for every c > . The second result is the Azuma-Hoeffding inequality, that applies to martingales with bounded incre-ments.

Lemma A.2 (Azuma-Hoeffding inequality) . Let { Z , Z , · · · } be a martingale such that | Z k − Z k − | ≤ c k .Then for every N ∈ N and ε > , we have Pr [ Z N − Z ≥ ε ] ≤ exp − ε ∑ Nk = c k ! . The third result extends the Chernoff-Heoffding inequality to random variables with unbounded supportand ﬁnite ﬁrst and second moments.

Lemma A.3.

For any λ > and any sequence of i.i.d. random variables X , X , . . . , X n with ﬁnite mean µ > and ﬁnite variance, we have Pr " n ∑ i = X i ≥ λµ n ≤ exp ( − cn ) where c ≡ max t > { λµ t − ln E (cid:2) e tX (cid:3) } . For any λ ∈ ( , ) , we have Pr " n ∑ i = X i ≤ λµ n ≤ exp ( − c ′ n ) where c ′ ≡ max t > {− λµ t − ln E (cid:2) e − tX (cid:3) } Proof.

The proof is similar to that of the Chernoff-Hoeffding inequality. If λ >

1, thenPr " n ∑ i = X i ≥ λ n µ = Pr h e t ∑ ni = X i ≥ e t λ n µ i ≤ e t λ n µ · n ∏ i = E (cid:2) e tX i (cid:3) for every t ∈ N , PROOFOFLEMMA3.1 19where the last inequality holds by the Markov inequality. Set t ∈ N in order to maximize λµ t − ln E (cid:2) e tX (cid:3) ,we have Pr " n ∑ i = X i ≥ λ n µ ≤ exp {− cn } . If λ <

1, thenPr " n ∑ i = X i ≤ λ n µ = Pr h e − t ∑ ni = X i ≥ e − t λ n µ i ≤ e − t λ n µ · n ∏ i = E (cid:2) e − tX i (cid:3) for every t ∈ N , Set t ∈ N in order to maximize − λµ t − ln E (cid:2) e − tX (cid:3) , we havePr " n ∑ i = X i ≤ λ n µ ≤ exp (cid:8) − c ′ n (cid:9) . Remark:

In Lemma A.3, let c ( t ) = λµ t − ln E (cid:2) e tX (cid:3) . One can verify that c ( ) = c ′ ( ) = ( λ − ) µ > λ > µ >

0. Moreover, c ′′ ( ) = Var [ X ] is ﬁnite. Therefore, c = max t > { λµ t − ln E (cid:2) e tX (cid:3) } isstrictly positive for λ > µ >

0. Similarly, we can also show that c ′ > λ < µ > B Proof of Lemma 3.1

The conclusion of Lemma 3.1 is implied by the following two claims.

Claim 2.

For ε ∈ ( , ) and π that has ﬁnite support, there exists η > and a sufﬁciently large T suchthat the π T ( F ∗ ) > − ε with probability at least η . Claim 3.

Suppose δ ∈ ( , ) or ( δ , k ′ ) = ( , ) . For any ﬁnite set of states F ⊆ ∆ ( Y ) , there exist ε , η ∈ ( , ) such that if π ( F ∗ ) > − ε and π is supported on F , then a ∗ is chosen for all periods withprobability at least η . Combining those two claims, we have that for any prior π with ﬁnite support, there exists η > T > η , action a ∗ is chosen for all periods after T , which implies that the limitingfrequency of a ∗ is at least η > Proof Claim 2.

Let F be the support of π . For every pair of states F , F ∈ F , let X a → a ′ ( F , F ) ≡ log F ( y | a ′ ) F ( y | a ′ ) with probability F ∗ ( y | a ) for every y ∈ Y . PROOFOFLEMMA3.1 20Let ¯ l be the largest realization of | X a → a ′ ( F , F ) | for any a , a ′ ∈ A and F , F ∈ F . One can verify that theincrement of log π t ( F ) π t ( F ) follows the same distribution as random variable X a t − k ∗ → a t − k ′ ( F , F ) . Therefore,log π T ( F ) π T ( F ) − log π ( F ) π ( F ) − T ∑ t = E (cid:2) X a t − k ∗ → a t − k ′ ( F , F ) (cid:3) is a martingale with bounded increments. According to Lemma A.2, for every ε > T > π , and F , F ,and any sequence of actions a , . . . , a T , we havePr " log π T ( F ) π T ( F ) − log π ( F ) π ( F ) − T ∑ t = E (cid:2) X a t − k ∗ → a t − k ′ ( F , F ) (cid:3) ≥ ε ≤ exp (cid:18) − ε T ¯ l (cid:19) . (B.1)Now consider an auxiliary scenario where k ∗ = k ′ . One can verify that X a t − k ′ → a t − k ′ ( F , F ∗ ) < F = F ∗ . For every ε >

0, let T ∈ N be large enough such that1. | F | · exp (cid:16) − ε T ¯ l (cid:17) < F ∈ F \ F ∗ , − log π ( F ) π ( F ∗ ) − ∑ Tt = E [ X a t − k ′ → a t − k ′ ( F , F ∗ )] − ε > log | F | ( − ε ) ε .By applying union bound to inequality (B.1) for all F ∈ F \ F ∗ , we know that with strictly positive probabil-ity, log π T ( F ) π T ( F ∗ ) ≥ log | F | ( − ε ) ε for any F ∈ F \ F ∗ . This implies that π T ( F ∗ ) ≥ − ε . Note that by Condition2, this event also occurs with strictly positive probability when the true state is F ∗ with time lag k ∗ . Thisconcludes the proof of Claim 2. Proof of Claim 3.

For every ﬁnite set F ⊆ ∆ ( Y ) such that F ∗ ∈ F , let π ε ∈ ( , ) be the lowest probabilitysuch that for every π ∈ ∆ ( F ) , if π attaches probability at least π ε to F ∗ , then the agent has a strict incentive tochoose a ∗ . For simplicity, let X ( F ) ≡ X a ∗ → a ∗ ( F , F ∗ ) for any F ∈ F \ F ∗ . By deﬁnition, we have E [ X ( F )] <

0. Let r ∗ F > E x ∼ X ( F ) [ exp ( r ∗ F · x )] =

1. According to Lemma A.1, for a sequence of i.i.d.random variables X , . . . , X t that is distributed according to X ( F ) , we havePr " ∞ [ n = ( n ∑ t = X t ≥ c ) ≤ exp ( − r ∗ F · c ) for every c > . Let r ∗ = min F ∈ F \ F ∗ r ∗ F . Let c be such that | F | · exp ( − r ∗ · c ) <

1, and let ε > c + ¯ l · max { k ∗ , k ′ } + log ε − ε < log π ε | F |· ( − π ε ) , we know that for the ﬁrst max { k ∗ , k ′ } periods, the agent alwayschooses action a ∗ and the log likelihood ratio between F and F ∗ increases by at most ¯ l · max { k ∗ , k ′ } . More- PROOFOFTHEOREM1 21over, for any t > ¯ l · max { k ∗ , k ′ } , with probability 1 − | F | · exp ( − r ∗ · c ) >

0, we have thatlog π t ( F ) π t ( F ∗ ) < c + ¯ l · max { k ∗ , k ′ } + log π ( F ) π ( F ∗ ) < c + ¯ l · max { k ∗ , k ′ } + log ε − ε < log π ε | F | · ( − π ε ) for any F ∈ F \ F ∗ . This implies that π t ( F ∗ ) > π ε for any t > a ∗ inall future periods. C Proof of Theorem 1

We establish inequality (3.3) when | Y | =

2, and later adjust the proof to environments where | Y | ≥

3. Forsimplicity, we ﬁrst consider the case in which k ∗ = k ′ =

0. Our argument straightforwardly generalizesto other values of k ∗ and k ′ as long as k ∗ = k ′ .Since | Y | =

2, the agent chooses the action that induces the high-payoff outcome with higher probability.Without loss of generality, let Y ≡ { y , y } , and let v ( y ) = v ( y ) =

0, there exist ζ and ζ ′ such that the asymptotic frequency of action 0 is less than γ . By the end ofthis section, we identify the crucial components in this construction to show that there exists an open set ofdistributions such that the agent can have arbitrarily small frequency of choosing action a ∗ .First we bound the expected number of times with which the agent chooses action 0 and action 1 whenthe true distribution is F ∗ and the agent’s prior belief attach small but positive probability to F ∗ . Let X a → a ′ ( F , F ) ≡ log F ( y | a ′ ) F ( y | a ′ ) with probability F ∗ ( y | a ) for every y ∈ Y . (C.2)Intuitively, this is the change in the log likelihood ratio between F and F when the previous period actionwas a and the current period action is a ′ . Let l ( F , F ) be the largest realization of X → ( F , F ) , and let l ( F , F ) be the smallest realization of X → ( F , F ) . By construction we have l ( F , F ) > l ( F , F ) < ( F , F ) and write l and l instead.Consider a hypothetical scenario in which the support of the agent’s prior belief is { F , F } . For any PROOFOFTHEOREM1 22discount factor δ ∈ [ , ) , there exists l ∗ ∈ R depending on δ such that the agent is indifferent betweenactions 0 and 1 when log π ( F ) π ( F ) = l ∗ , and strictly prefers action 0 if and only if log π ( F ) π ( F ) ≥ l ∗ . For every l > l ∗ ,let random variable τ ( l ) be the number of consecutive periods with which the agent takes action 0 whenthe initial value of log π ( F ) π ( F ) is l . For every l < l ∗ , let random variable τ ( l ) be the number of consecutiveperiods with which the agent takes action 1 when the initial value of log π ( F ) π ( F ) is l . Claim 4.

Random variables τ ( l ) and τ ( l ) have ﬁnite mean and variance for every l ∈ R .Proof. According to Bayes rule, l t = l t − + X → ( F , F ) for every t ∈ N . Let H be the maximal differencein the realization of random variable X ( F , F ) . The Chernoff-Hoeffding inequality implies thatPr [ l t > l ∗ ] ≤ exp (cid:18) − ( − t · E [ X → ( F , F )] − l + l ∗ ) tH (cid:19) . Since Pr [ l t > l ∗ ] vanishes exponentially as t → + ∞ , τ ( l ) has ﬁnite mean and variance for every l ≥ l ∗ .Similarly, one can also show that τ ( l ) has a ﬁnite mean and variance for every l ≤ l ∗ .Next, suppose F ∗ belongs to the support of agent’s prior belief. Let τ ε ( l ) be the number of consecutiveperiods with which the agent takes action 0 when the initial value of log π ( F ) π ( F ) is l − ε and switches to action 1if the log likelihood ratio is below l ∗ + ε after the ﬁrst period. Let τ ε ( l ) be the number of consecutive periodswith which the agent takes action 1 when the initial value of log π ( F ) π ( F ) is l + ε and switches to action 0 if thelog likelihood ratio is above l ∗ − ε after the ﬁrst period.For every ε >

0, there exists π ε > π ∈ ∆ ( F ) satisﬁes π ( F ∗ ) < π ε ,the agent strictly prefers action 0 when l > l ∗ + ε and strictly prefers action 1 when l < l ∗ − ε . Let X τ ( ε ) be the random variable that has the same distribution as τ ( l + l ∗ + ε ) and X ′ τ ( ε ) be the random variablethat has the same distribution as τ ε ( l ∗ ) . Let X τ ( ε ) be the random variable that has the same distribution as τ ( l + l ∗ − ε ) and X ′ τ ( ε ) be the random variable that has the same distribution as τ ε ( l ∗ ) . For every ε > η >

0, let ˆ c ( ε , η ) ≡ max t > n ( + η ) t E [ X τ ( ε )] − ln E h e tX τ ( ε ) io c ′ ( ε , η ) ≡ max t > n − ( − η ) t E [ X ′ τ ( ε )] − ln E h e − tX ′ τ ( ε ) io c ( ε , η ) ≡ min { ˆ c ( ε , η ) , c ′ ( ε , η ) } PROOFOFTHEOREM1 23and ˆ c ( ε , η ) ≡ max t > n ( + η ) t E [ X τ ( ε )] − ln E h e tX τ ( ε ) io c ′ ( ε , η ) ≡ max t > n − ( − η ) t E [ X ′ τ ( ε )] − ln E h e − tX ′ τ ( ε ) io c ( ε , η ) ≡ min { ˆ c ( ε , η ) , c ′ ( ε , η ) } . Let K ( ε , η ) ≡ c ( ε , η ) · log e c ( ε , η ) ε ( e c ( ε , η ) − ) and K ( ε , η ) ≡ c ( ε , η ) · log e c ( ε , η ) ε ( e c ( ε , η ) − ) . Let η ( ε , η ) ∈ R + be such that c ( ε , ¯ η ( ε , η )) = log ( K ( ε , η ) / ε ) K ( ε , η ) , and η ( ε , η ) ∈ R + be such that c ( ε , ¯ η ( ε , η )) = log ( K ( ε , η ) / ε ) K ( ε , η ) . For every k ∈ N , let t k ∈ N be the k th time such that a t k = a t k − = t k ∈ N be the k th time suchthat a t k = a t k − =

0. Let S t be the total number of adjacent periods until t in which the agent choosesaction 0. Let S t be the total number of adjacent periods until t in which the agent chooses action 1. Claim 5.

For every ε > and η > , if π ( F ∗ ) < π ε , then the following event happens with probabilityat least − ε : S t k ≤ [ k ( + η ) + K ( ε , η )( + ¯ η ( ε , η ))] E [ τ ( ¯ l + l ∗ + ε )] S t k ≥ ( k − K ( ε , η )) · ( − η ) E [ τ ε ( l ∗ )] S t k ≤ [ k ( + η ) + K ( ε , η )( + ¯ η ( ε , η ))] E [ τ ( l + l ∗ − ε )] S t k ≥ ( k − K ( ε , η )) · ( − η ) E [ τ ε ( l ∗ )] for every k ∈ N .Proof. We establish the upper and lower bounds for S t k . The ones for S t k can be derived using a similar PROOFOFTHEOREM1 24argument. Claim 4 implies that E [ X τ ( ε )] is ﬁnite. Suppose π is such that π ( F ∗ ) < π ε . Since S t k is ﬁrstorder stochastically dominated by ∑ ki = x i with x i ∼ X τ ( ε ) , Lemma A.3 implies thatPr [ S t k > k ( + η ) · E [ τ ( ¯ l + l ∗ + ε )]] ≤ exp ( − k · c ( ε , η )) . The union bound implies thatPr  [ k ≥ K ( ε , η ) { S t k > k ( + η ) · E [ τ ( ¯ l + l ∗ + ε )] }  ≤ ∑ k ≥ K ( ε , η ) exp ( − k · c ( ε , η )) ≤ ε . Moreover, for every k < K ( ε , η ) , we havePr h S t k > k ( + ¯ η ( ε , η )) · E [ τ ( ¯ l + l ∗ + ε )]) i ≤ ε K ( ε , η ) . Take the union of these events, we havePr " [ k ≥ { S t k > k ( + η ) · E [ τ ( ¯ l + l ∗ + ε )] + K ( ε , η )( + ¯ η ( ε , η )) · E [ τ ( ¯ l + l ∗ + ε )] } ≤ Pr  [ k ≥ K ( ε , η ) { S t k > k ( + η ) · E [ τ ( ¯ l + l ∗ + ε )] }  + Pr  [ k < K ( ε , η ) { S t k > k ( + ¯ η ( ε , η )) }  ≤ ε . Moreover, S t k ﬁrst order stochastically dominates ∑ ki = x i with x i ∼ X ′ τ ( ε ) . Lemma A.3 implies thatPr [ S t k < k ( − η ) · E [ τ ε ( l ∗ )]] ≤ exp ( − k · c ( ε , η )) . By union bound, we havePr  [ k ≥ K ( ε , η ) { S t k < k ( − η ) · E [ τ ε ( l ∗ )] }  ≤ ∑ k ≥ K ( ε , η ) exp ( − k · c ( ε , η )) ≤ ε . Claim 6.

For every ε > , there exists a prior belief π ∈ ∆ ( F ) such that event { π ( F ∗ ) < π ε } occurswith probability at least − ε .Proof. Let l t ( F , F ′ ) ≡ log π t ( F ) π t ( F ′ ) . Let X t ( F ∗ ) ≡ l t ( F ∗ , F ) − l t − ( F ∗ , F ) , Z ( F ∗ ) ≡ l ( F ∗ , F ) and Z t ( F ∗ ) = Z t − ( F ∗ ) + l t ( F ∗ , F ) − l t − ( F ∗ , F ) − E h X t ( F ∗ ) (cid:12)(cid:12)(cid:12) h t − i for every t ≥

1. One can verify that { Z t ( F ∗ ) } t ∈ N is amartingale. By deﬁnition, l t ( F ∗ , F ) = Z t ( F ∗ ) + ∑ t ′ ≤ t E h X t ′ ( F ∗ ) (cid:12)(cid:12)(cid:12) h t ′ − i . Let H be the difference between the PROOFOFTHEOREM1 25maximal realization of X ( F ∗ , F ) and the minimal realization of X ( F ∗ , F ) . We have | Z t ( F ∗ ) − Z t − ( F ∗ ) | ≤ H for every t ∈ N . According to Lemma A.2,Pr [ Z t ( F ∗ ) − Z ( F ∗ ) ≥ t η ] ≤ exp (cid:18) − t η H (cid:19) for every η ∈ R + . Let T ≡ H η log e η / ( H ) ε ( e η / ( H ) − ) . Take the union of these events, we obtain the following upper bound:Pr " [ t ≥ T { Z t ( F ∗ ) − Z ( F ∗ ) ≥ t η } ≤ ∑ t ≥ T exp (cid:18) − t η H (cid:19) ≤ ε . Moreover, for every t < T , we havePr (cid:20) Z t ( F ∗ ) − Z ( F ∗ ) ≥ H log ( T / ε ) η (cid:21) ≤ ε T . Take the union bound, we obtainPr " [ t ≥ ( l t ( F ∗ , F ) − l ( F ∗ , F ) ≥ t η + H log ( T / ε ) η + ∑ t ′ ≤ t E h X t ′ ( F ∗ ) | h t ′ − i) ≤ ε . (C.3)Similarly, let X ′ t ( F ∗ ) ≡ l t ( F ∗ , F ) − l t − ( F ∗ , F ) , we havePr " [ t ≥ ( l t ( F ∗ , F ) − l ( F ∗ , F ) ≥ t η + H log ( T / ε ) η + ∑ t ′ ≤ t E h X ′ t ′ ( F ∗ ) | h t ′ − i) ≤ ε . (C.4)Let T ′ = K ( ε , η )( + ¯ η ( ε , η )) E [ τ ( ¯ l + l ∗ + ε )] + K ( ε , η ) E [ τ ε ( l ∗ )] T ′ = K ( ε , η )( + ¯ η ( ε , η )) E [ τ ( l + l ∗ − ε )] + K ( ε , η ) E [ τ ε ( l ∗ )] . Note that the expected log likelihood of F ∗ when switching from action 0 to action 1 is ∑ y ∈ Y F ∗ ( y | a ) log F ( y | a ′ ) ,which diverges to − ∞ as ζ ′ goes to 0, while the the expected log likelihood of F ∈ { F , F } remains bounded.Thus, according to Claim 5, for sufﬁciently small ζ ′ , ∑ t ′ ≤ t E h X t ′ ( F ∗ ) | h t ′ − i ≤ T ′ · ( ¯ Z + η ) − t · η for every t ∈ N occurs with probability at least 1 − ε . Similar bound holds for ∑ t ′ ≤ t E h X ′ t ′ ( F ∗ ) | h t ′ − i . Combining these PROOFOFTHEOREM1 26with inequalities (C.3) and (C.4), for sufﬁciently small l ( F ∗ , F ) , l ( F ∗ , F ) , we havePr " [ t ≥ { l t ( F ∗ , F ) ≥ log π ε or l t ( F ∗ , F ) ≥ log π ε } ≤ ε . Therefore, the probability of the event that π t ( F ∗ ) < π ε for every t ∈ N is at least 1 − ε .According to Claim 6, there exists a positive probability event under which the probability that theagent’s posterior belief attaches to F ∗ is sufﬁciently small in all periods. Conditional on this event, Claim 5implies that the agent’s action cycles between 0 and 1. In the last step, we bound the asymptotic frequencyof action 0: sup σ ∈ Σ ∗ ( π ) n lim sup t → + ∞ E σ h t t ∑ s = { a s = a ∗ } io = lim sup k → + ∞ E  S t k S t k + S t k  ≤ lim sup k → + ∞ [ k ( + η ) + K ( ε , η )( + ¯ η ( ε , η ))] E [ τ ( ¯ l + l ∗ + ε )]( k − K ( ε , η )) · ( − η ) E [ τ ε ( l ∗ )] + ε = E [ τ ( ¯ l + l ∗ + ε )]( − η ) E [ τ ε ( l ∗ )] + ε ≤ γ . The ﬁrst inequality holds by directly applying Claims 5 and 6 and note that when the events in the claimsfails, the expected frequency is at most 1. Moreover, for any γ ′ < γ , there exists ζ > F , F such that E [ τ ( ¯ l + l ∗ + ε )]( − η ) E [ τ ε ( l ∗ )] < γ due to the fact that in our construction, E [ τ ( ¯ l + l ∗ + ε )] is bounded fromabove for any ζ > E [ τ ε ( l ∗ )] ≥ − ζ − ζ approaches inﬁnity as ζ →

0. Thus, the last inequality holdsby simply setting ε , η , ζ to be small enough constants. Remark:

Note that proof of Theorem 1 does not hinge on the parameters in the design of the instance in(C.1). We summarize the important features of the construction of F ∗ and π :1. The KL-divergence between F ∗ ( ·| ) and F ∗ ( ·| ) is sufﬁciently large. This is sufﬁcient to establishthat the posterior probability of F ∗ converges to 0.2. F ( ·| ) is closer to F ∗ ( ·| ) compared to F ( ·| ) , and F ( ·| ) is closer to F ∗ ( ·| ) compared to F ( ·| ) .This is sufﬁcient to established that the action cycles between 0 and 1 for inﬁnite periods if the agentdoes not believe F ∗ happens with high probability.3. The expected log likelihood ratio E [ X → ( F , F )] is sufﬁciently close to 0. This is sufﬁcient to estab-lished that the number of time periods required for the agent to switch action from 1 to 0 is sufﬁciently KL-divergence is not symmetric, and it is sufﬁcient to have either D ( F ∗ ( ·| ) || F ∗ ( ·| )) or D ( F ∗ ( ·| ) || F ∗ ( ·| )) is large. PROOFOFTHEOREM2 27large, which implies the limit frequency of action 0 is sufﬁciently small.As is evident from Claim 5 and 6, essentially any instance satisfying those three properties is sufﬁcient toshow that the limit frequency of a ∗ is sufﬁciently small, and the example in (C.1) is an illustration thatsatisﬁes all three properties. Next we discuss the generalization the inefﬁciency result to broader settings.• When | Y | ≥

3, there exists a subset Y ′ ⊆ Y and | Y ′ | = y ∈ Y \ Y ′ . For any γ >

0, by setting distributions F ∗ , F , F suchthat (1) the probability that the realized outcome y ∈ Y \ Y ′ is sufﬁciently small , and (2) the conditionaldistribution on Y ′ is the same as what we constructed for the case there are only two outcomes, wecan show that the expected average frequency of choosing action a ∗ can be smaller than γ .• When | A | ≥

3, there exists a subset A ′ ⊆ A and | A ′ | =

2. For any γ >

0, by setting distributions F ∗ , F , F such that (1) it is always suboptimal to choose any action a ∈ A \ A ′ for any distribution π ,and (2) the distribution when choosing action a ∈ A ′ is the same as what we constructed for the casethere are only two actions, we can show that the expected average frequency of choosing action a ∗ can be smaller than γ .• For general time lag k ∗ = k ′ , we need to have an additional step to show that the number of periodsbefore the agent switches the action is not always | k ∗ − k ′ − | . This is obvious when k ∗ = k ′ = ( , ) ,and when the KL-divergence between F ∗ ( ·| ) and F ∗ ( ·| ) is sufﬁciently large, the attribution error issufﬁciently large, and the posterior belief on F ∗ still converges to 0. D Proof of Theorem 2

Appendix D.1 characterizes the principal’s payoff in the auxiliary game. Appendix D.2 establishes theconnections between the principal’s payoff in the auxiliary game and his payoff in the original game withsymmetric uncertainty.

D.1 Payoff in the Auxiliary Game

This section examines the principal’s asymptotic payoff in an auxiliary game in which he knows the truestate but the agent is naive in the sense that she ignores the informational content of the principal’s proposals PROOFOFTHEOREM2 28and updates her belief based only on the chosen policies and observed signals. For every F ∈ F , let u ( σ p , F ) ≡ lim inf t → + ∞ t E σ p h t ∑ s = a s (cid:12)(cid:12)(cid:12) F i (D.1)and u ( σ p , F ) ≡ lim sup t → + ∞ t E σ p h t ∑ s = a s (cid:12)(cid:12)(cid:12) F i (D.2)be the lower and upper bounds on the principal’s asymptotic payoffs when he uses strategy σ p and the truestate is F . Let U ( F ) ≡ sup σ p ∈ Σ p u ( σ p , F ) and U ( F ) ≡ sup σ p ∈ Σ p u ( σ p , F ) (D.3)We establish two lemmas. Lemma D.1.

For every π ∈ ∆ ( F ) that has full support, we have U ( F ) = U ( F ) = .Proof. Let π t , i be the posterior probability of distribution F i according to the agent’s belief in period t . Let l t ≡ log π t , π t , . There exist a threshold l ∗ such that the agent chooses action a t = l t > l ∗ .We show that for any ε >

0, the following strategy for the principal achieves payoff at least 1 − ε . Thestrategy of the principal is to always propose action 0 until the log likelihood satisﬁes l t > l ∗ + c , where c isdeﬁned later in the analysis. The principal switches to always proposing action 1 if the above condition issatisﬁed. Note that when the principal propose action 0 for all periods, there is no attribution error, and theagent learns the correct distribution. By inequality (B.1), for any ε >

0, any prior π , and any parameter c , there exists T > − ε , l T > l ∗ + c . Thus with probability at least1 − ε , the principal switches to proposing action 1 before time T . Moreover, by Claim 3, for any ε > c > − ε , l t + T > l T − c for all t >

0. By setting ε = ε = ε and apply the union bound, with probability at least 1 − ε , we have l t > l ∗ for any t > T . Thus the payoff ofthe principal is at least 1 − ε with the given strategy. Taking ε → Lemma D.2.

For every π ∈ ∆ ( F ) that has full support,U ( F ) = U ( F ) = q ∗ λ . (D.4)The proof consists of two parts. In Section D.1.1, we show thatsup σ p ∈ Σ p u ( σ p , F ) ≤ q ∗ λ for every F ∈ F . (D.5) PROOFOFTHEOREM2 29In Section D.1.2, we show that sup σ p ∈ Σ p u ( σ p , F ) ≥ q ∗ λ for every F ∈ F . (D.6) D.1.1 Proof of Lemma D.2: Establish the Payoff Upper Bound

Let Π ≡ n π ∈ ∆ ( F ) (cid:12)(cid:12)(cid:12) argmax i ∈{ , } (cid:8) ∑ F ∈ F π ( F ) ∑ y ∈ Y F ( y | i ) v ( y ) (cid:9) = { } o be the set of beliefs under which the agent strictly prefers action 1. Claim 3 implies that there exists p > π t / ∈ Π and σ p ∈ Σ p , we havePr (cid:16) π s / ∈ Π for every s ≥ t (cid:12)(cid:12)(cid:12) F , σ p (cid:17) > p . (D.7)For every k ∈ N , we say that π t crosses Π in period k (or equivalently, there is a crossing in period k ) if π k − ∈ Π and π k / ∈ Π , or π k − / ∈ Π and π k ∈ Π . The uniform lower bound in (D.7) implies that for every σ p ∈ Σ p , the expected number of crossings is ﬁnite almost surely. Therefore,Pr (cid:16) ∃ t ∈ N s.t. π s / ∈ Π for every s ≥ t (cid:12)(cid:12)(cid:12) F , σ p (cid:17) + Pr (cid:16) ∃ t ∈ N s.t. π s ∈ Π for every s ≥ t | {z } ≡ event E σ p (cid:12)(cid:12)(cid:12) F , σ p (cid:17) = E σ p be the event that there exists t ∈ N such that π s ∈ Π for every s ≥ t . Let σ ∗ p be the strategy of theprincipal for maximizing probability of event E σ ∗ p given prior π . By deﬁnition, q ∗ is the probability of event E σ ∗ p when the principal uses strategy σ ∗ p .The principal’s asymptotic payoff conditional on event {∃ t ∈ N s.t. π s / ∈ Π for every s ≥ t } is zero. Weconclude the proof by showing that his asymptotic payoff conditional on event E σ p is at most λ for every σ p ∈ Σ p satisfying Pr [ E σ p | F , σ p ] > ε > E σ p , the asymptoticfrequency of policy 1 is more than λ + ε when the true state is F . First, we observe that the asymptoticfrequency of ( a t − , a t ) = ( , ) equals that of ( a t − , a t ) = ( , ) regardless of the principal’s strategy σ p . Thedeﬁnition of λ then implies that lim t → + ∞ E h log π t ( F ) π t ( F ) (cid:12)(cid:12)(cid:12) E σ p , σ p i = + ∞ . (D.9)As a result, the agent strictly prefers action 0 asymptotically when the principal uses strategy σ p conditional PROOFOFTHEOREM2 30on event E σ p . This contradicts the deﬁnition of event E σ p under which the agent strictly prefers policy 1. D.1.2 Proof of Lemma D.2: Attain the Payoff Upper Bound

We construct σ ε p ∈ Σ p for every ε > u ( σ ε p , F ) ≥ q ∗ λ − ε . (D.10)For every σ p ∈ Σ p and l ∈ R , let E σ p , l be the following event when the principal uses strategy σ p ,log π s ( F ) π s ( F ) ≥ l for every s ≥ t for some t ∈ N . (D.11)If Π ( l ) ⊂ ∆ ( F ) be the set of beliefs that satisfy (D.11). Recall the deﬁnition of E σ p in (D.8), which impliesthe existence of l ∗ ∈ R + such that E σ p , l ⊂ E σ p and Π ( l ) ⊂ Π for every l ≥ l ∗ . Recall that σ ∗ p is the strategythat maximizes the probability of event E σ ∗ p given prior π .. Lemma D.3.

For every π ∈ ∆ ( F ) that has full support and l ∈ R , we have Pr [ E σ ∗ p , l | F , σ ∗ p ] = q ∗ .Proof of Lemma D.3: As shown before, there exists p > π t / ∈ Π , the probability withwhich π s / ∈ Π for every s ≥ t is at least p when the true state is F . As a result, for every l ∈ R + , theprobability of the following event is zero under any strategy in Σ p :• π s ∈ Π \ Π ( l ) for every s ≥ t .This implies that for every l ∈ R that satisﬁes Π ( l ) ⊂ Π , we havePr (cid:16) ∃ t ∈ N s.t. π t ∈ Π ( l ) for all s ≥ t (cid:12)(cid:12)(cid:12) F , σ ∗ p (cid:17) = Pr (cid:16) ∃ t ∈ N s.t. π t ∈ Π for all s ≥ t (cid:12)(cid:12)(cid:12) F , σ ∗ p (cid:17) . (D.12)and moreover,Pr (cid:16) ∃ t ∈ N s.t. π t ∈ Π for all s ≥ t (cid:12)(cid:12)(cid:12) F , σ ∗ p (cid:17) + Pr (cid:16) ∃ t ∈ N s.t. π t / ∈ Π for all s ≥ t (cid:12)(cid:12)(cid:12) F , σ ∗ p (cid:17) = . (D.13)Equations (D.12) and (D.13) together imply that Pr [ E σ ∗ p , l | F , σ ∗ p ] = Pr [ E σ ∗ p | F , σ ∗ p ] , while the latter equals q ∗ . Next we focus on the case when λ > λ = E [ X → + X → ] > E [ X → ] <

0. PROOFOFTHEOREM2 31For small enough ε >

0, let T , T ∈ N be two positive integers such that T is even and T + T T + T ∈ (cid:16) λ − ε , λ (cid:17) . Let T ≡ T + T . Let σ p ∈ Σ p be deﬁned as:• σ p ( h t ) = k ∈ N such that t ∈ { kT + , kT + , ..., kT + T } ,• σ p ( h t ) = σ p , the frequency with which the principal proposes policy 1 belongs to the interval ( λ − ε , λ ) .Let l t ≡ log π t ( F ) π t ( F ) , and let X T be the increment of l t from period t to t + T when policy 0 is chosen inperiod t + , t + , ..., t + T and policy 1 is chosen in other periods. Let H be the maximal realization of X T .Let r ∗ > , η > E z ∼ X T [ exp ( r ∗ z )] = ( − r ∗ · η ) < ε . Let ¯ l ∈ R be large enoughsuch that Π ( ¯ l ) ⊂ Π . Recall the deﬁnition of σ ∗ p . Let σ ε p ∈ Σ p be deﬁned as:• σ ε p ( h t ) = σ ∗ p ( h t ) if π t ∈ Π ( ¯ l + η + H ) for all t ′ < t ;• σ ε p ( h t ) = σ p ( h t ) otherwise.Conditional on π t reaches Π ( ¯ l + η + H ) , the Wald’s inequality in Lemma A.1 implies that the probabilitywith which π s ∈ Π ( ¯ l ) for every s ≥ t is at least 1 − ε , which implies that the principal’s asymptotic payoff isat least q ∗ ( λ − ε ) when the true state is F . D.2 Connections between Auxiliary Game & Original Game

We show that the principal’s payoff in the original game equals his expected payoff in the auxiliary gamestudied in Appendix D.1. First, we show that the principal learns the true state asymptotically regardless ofthe chosen policies.

Lemma D.4.

For every σ p ∈ Σ p , F ∈ F , and ε > , there exists τ ∈ N such that Pr (cid:16) π τ ( F ) > − ε (cid:12)(cid:12)(cid:12) F (cid:17) > − ε . (D.14) Proof of Lemma D.4:

Let Q F be the probability measure over H induced by distribution F and let Q p bethe probability measured over H induced by the principal’s prior belief π ∈ ∆ ( F ) . For every history h t ,let q F | h t ∈ ∆ ( A × Y ) be the principal’s belief about ( a t , y t ) conditional on the true state being F , and let q π t | h t ∈ ∆ ( A × Y ) be the principal’s belief about ( a t , y t ) when his belief about the state is π t ∈ ∆ ( F ) . Thechain rule for relative entropy implies that − log π ( F ) ≥ d (cid:16) Q F (cid:13)(cid:13)(cid:13) Q p (cid:17) = ∞ ∑ t = E Q F h d (cid:16) q F | h t (cid:13)(cid:13)(cid:13) q π t | h t (cid:17)i . (D.15) PROOFOFTHEOREM2 32Conditional 2 implies that d ( q F | h t || q F ′ | h t ) > F = F ′ . Since F is ﬁnite, for every ε >

0, thereexists η > d ( q F | h t k q p | h t ) > η for every π ∈ ∆ ( F ) satisfying π ( F ) ≤ − ε . Inequality (D.15)implies the existence of τ ∈ N such that ∞ ∑ t = τ E Q F h d (cid:16) q F | h t (cid:13)(cid:13)(cid:13) q π t | h t (cid:17)i ≤ ηε . (D.16)The Markov’s inequality implies that the probability with which d ( q F | h t k q π t | h t ) > η is strictly less than ε forevery t ≥ τ , or equivalently, the probability with which π t ( F ) ≤ − ε is less than ε for every t ≥ τ . Lemma D.5.

We have V = V = ∑ F ∈ F π ( F ) U ( F ) = ∑ F ∈ F π ( F ) U ( F ) .Proof of Lemma D.5: Since U ( F ) = U ( F ) for every F ∈ F , we have V ≤ ∑ F ∈ F π ( F ) U ( F ) . We show V ≥ ∑ F ∈ F π ( F ) U ( F ) by constructing a strategy σ ε p for every ε > V ( σ ε p ) ≥ ∑ F ∈ F π ( F ) U ( F ) − ε . For every ε >

0, let τ ∈ N be such that Pr (cid:16) π τ ( F ) > − ε (cid:12)(cid:12)(cid:12) F (cid:17) > − ε , and let σ F p ( ε ) ∈ Σ p be the strategyunder which the principal obtains utility u ( σ F p ( ε ) , F ) ≥ U ( F ) − ε if the true state is F . Such a strategyexists according to Lemma D.2. Similarly, let σ F p ( ε ) be the strategy under which the principal obtains utility u ( σ F p ( ε ) , F ) ≥ − ε . Let σ ε p ∈ Σ p be a strategy under which• The principal follows σ F p ( ε ) for every t ≤ τ .• If π τ ( F ) ≥ − ε , then the principal follows σ F p ( ε ) starting from period τ .• Otherwise, he follows σ F p ( ε ) starting from period τ .Next, we establish a lower bound on the principal’s asymptotic payoff from strategy σ ε p . Conditional on thetrue state is F , the probability with which the principal plays σ Fp ( ε , τ ) is at least 1 − ε . Conditional on thetrue state is F , there exists T > τ + T is greater than 1 − ε according to Lemma D.1. As a result, the principal’s asymptoticpayoff from σ ε p is at least ( − ε ) ∑ F ∈ F π ( F ) U ( F ) . Since the principal’s stage-game payoff is between 0and 1, we have ( − ε ) ∑ F ∈ F π ( F ) U ( F ) ≥ ∑ F ∈ F π ( F ) U ( F ) − ε . Generalizations.

Finally, we discuss the generalization of our result in broader settings.1. All results in this section does not hinge on the fact that k ∗ = k ′ =

0. In fact, all the lemmas andclaims hold directly for general time lags.EFERENCES 332. When | F | >

2, we denote F as the set of distributions with optimal action 0 for the agent and F as the set of distributions with optimal action 1 for the agent. The results directly generalize when | F | >

1. As we observe from Lemma D.1, the payoff of the principal in the auxiliary game doesnot depend on the prior when the true state is in F . Thus the principal can simply learn the truedistribution with high probability as described in Lemma D.5. However, things are more complicatedwhen | F | >

1. The main reason is that the payoff of the principal depends on the prior π in theauxiliary game when the true state is in F . When the principal faces uncertainly over F , if theredoes not exist a strategy σ p that maximizes the probability of the event E σ p simultaneously for all F ∈ F , the principal suffers a non-negligible utility loss in the process of learning the true state. References

Robert J Aumann and Michael Maschler.

Repeated games with incomplete information . MIT press, 1995.Robert Berk. Limiting behavior of posterior distributions when the model is incorrect.

The Annals ofMathematical Statistics , pages 51–58, 1966.Aislinn Bohren and Daniel N Hauser. Learning with model misspeciﬁcation: Characterization and robust-ness. working paper , 2020.Persi Diaconis and David Freedman. On the consistency of Bayes estimates.

The Annals of Statistics , pages1–26, 1986.Kﬁr Eliaz and Ran Spiegler. A model of competing narratives.

American Economic Review , 110(12):3786–3816, 2020.Ignacio Esponda and Demian Pouzo. Berk-nash equilibrium: A framework for modeling agents with mis-speciﬁed models.

Econometrica , 84(3):1093–1130, 2016.Ignacio Esponda and Demian Pouzo. Equilibrium in misspeciﬁed markov decision processes.

TheoreticalEconomics , forthcoming, 2020.Ignacio Esponda, Demian Pouzo, and Yuichi Yamamoto. Asymptotic behavior of Bayesian learners withmisspeciﬁed models. arXiv preprint arXiv:1904.08551 , 2020.Franc¸oise Forges. Repeated games of incomplete information: non-zero-sum.

Handbook of game theorywith economic applications , 1:155–177, 1992.Mira Frick, Ryota Iijima, and Yuhta Ishii. Stability and robustness in misspeciﬁed learning models. workingpaper , 2020.Drew Fudenberg and Jean Tirole. Perfect Bayesian equilibrium and sequential equilibrium.

Journal ofEconomic Theory , 53(2):236–260, 1991.Drew Fudenberg, Gleb Romanyuk, and Philipp Strack. Active learning with a misspeciﬁed prior.

TheoreticalEconomics , 12(3):1155–1189, 2017.EFERENCES 34Drew Fudenberg, Giacomo Lanzani, and Philipp Strack. Limits points of endogenous misspeciﬁed learning. working paper , 2020.Sergiu Hart. Nonzero-sum two-person repeated games with incomplete information.

Mathematics of Oper-ations Research , 10(1):117–153, 1985.Kevin He. Mislearning from censored data: The gambler’s fallacy in optimal-stopping problems.

WorkingPaper , 2020.Philippe Jehiel and Larry Samuelson. Reputation with analogical reasoning.

The Quarterly Journal ofEconomics , 127(4):1927–1969, 2012.Yaonan Jin, Yingkai Li, Yining Wang, and Yuan Zhou. On asymptotically tight tail bounds for sums ofgeometric and exponential random variables. arXiv preprint arXiv:1902.02852 , 2019.Emir Kamenica and Matthew Gentzkow. Bayesian persuasion.

American Economic Review , 101(6):2590–2615, 2011.Tor Lattimore and Csaba Szepesv´ari.

Bandit algorithms . Cambridge University Press, 2020.Pooya Molavi. Macroeconomics with learning and misspeciﬁcation: A general theory and applications.

Working Paper , 2020.Yaw Nyarko. Learning in mis-speciﬁed models and the possibility of cycles.

Journal of Economic Theory ,55(2):416–427, 1991.Hazhir Rahmandad, Nelson Repenning, and John Sterman. Effects of feedback delay on learning.

SystemDynamics Review , 25(4):309–338, 2009.Nelson Repenning and John Sterman. Capability traps and self-conﬁrming attribution errors in the dynamicsof process improvement.

Administrative Science Quarterly , 47:265–295, 2002.Cosma Rohilla Shalizi. Dynamics of Bayesian updating with dependent data and misspeciﬁed models.

Electronic Journal of Statistics , 3:1039–1074, 2009.Ran Spiegler. Placebo reforms.

American Economic Review , 103(4):1490–1506, 2013.Abraham Wald. On cumulative sums of random variables.