Mislearning from Censored Data: The Gambler's Fallacy in Optimal-Stopping Problems
MMislearning from Censored Data:The Gambler’s Fallacy in Optimal-Stopping Problems
Kevin He ∗ First version: March 21, 2018This version: August 15, 2019
Abstract
I study endogenous learning dynamics for people expecting systematic reversalsfrom random sequences — the “gambler’s fallacy.” Biased agents face an optimal-stopping problem, such as managers conducting sequential interviews. They are un-certain about the underlying distribution (e.g. talent distribution in the labor pool)and learn its parameters from their predecesors. Agents stop when early draws aredeemed “good enough,” so predecessors’ experience contain negative streaks but notpositive streaks. Since biased agents understate the likelihood of consecutive below-average draws, society converges to over-pessimistic beliefs about the distribution’smean. When early agents decrease their acceptance thresholds due to pessimism, lateragents will become more surprised by the lack of positive reversals in their predecessors’histories, leading to more pessimistic inferences and lower acceptance thresholds — apositive-feedback cycle. Agents who are additionally uncertain about the distribution’svariance believe in fictitious variation (exaggerated variance) to an extent dependingon the severity of data censoring. ∗ California Institute of Technology and University of Pennsylvania. Email: [email protected]. I amindebted to Drew Fudenberg, Matthew Rabin, Tomasz Strzalecki, and Ben Golub for their guidance andsupport. I thank Isaiah Andrews, Ruiqing Cao, In-Koo Cho, Martin Cripps, Krishna Dasaratha, Jetlir Duraj,Ben Enke, Ignacio Esponda, Jiacheng Feng, Mira Frick, Tristan Gagnon-Bartsch, Ashvin Gandhi, OliverHart, Johannes Hörner, Alice Hsiaw, Ryota Iijima, Yuhta Ishii, Lawrence Jin, Yizhou Jin, Michihiro Kandori,Max Kasy, Shengwu Li, Jonathan Libgober, Matthew Lilley, George Mailath, Eric Maskin, Weicheng Min,Xiaosheng Mu, Andy Newman, Harry Pei, Joshua Schwartzstein, Roberto Serrano, Philipp Strack, ElieTamer, Omer Tamuz, Michael Thaler, Linh T. Tô, Maria Voronina, Yuichi Yamamoto, and my seminarparticipants for their insightful comments. a r X i v : . [ q -f i n . E C ] A ug Introduction
The gambler’s fallacy is widespread. Many people believe that a fair coin has a higherchance of landing on tails after landing on heads three times in a row, think a son is “due”to a woman who has given birth to consecutive daughters, and, in general, expect too muchreversal from sequential realizations of independent random events. Studies have documentedthis bias in settings where it is strictly costly, such as state lotteries with pari-mutuel payouts(Terrell, 1994; Suetens, Galbo-Jørgensen, and Tyran, 2016) and incentivized lab experiments(Benjamin, Moore, and Rabin, 2017). The same bias also affects experienced decision-makersin high-stakes environments, including immigration judges (Chen, Moskowitz, and Shue,2016). Section 1.3 surveys more of this empirical literature.This paper highlights novel implications of the gambler’s fallacy in optimal-stoppingproblems when a society of biased agents learns about the underlying distributions. As arunning example, consider a junior HR manager who sequentially interviews candidates fora single job opening. In deciding whether to hire a candidate or to keep searching, the juniormanager must form a belief about the distribution of potential future applicants should shekeep the position open. She consults with senior managers and adopts their belief aboutthe labor pool based on their recruiting experience for similar positions in the past. Thejunior manager then implements a stopping strategy for her own recruiting problem, updatesher belief at the end of the hiring season, and shares this new belief with future managers.Suppose all managers commit the gambler’s fallacy — that is, they exaggerate how unlikelyit is to get consecutive above-average or consecutive below-average applicants (relative tothe labor pool mean). This error stems from the same psychology that leads people toexaggerate how unlikely it is to get consecutive heads or consecutive tails when tossing a faircoin. How does this bias influence the managers’ beliefs and behavior over time?In this example and other natural optimal-stopping problems, agents tend to stop whenearly draws are deemed “good enough,” leading to an asymmetric truncation of experience.When a manager discovers a sufficiently strong candidate early in the hiring cycle, she stopsher recruitment efforts and does not observe what additional candidates would have beenfound for the same job opening with a longer search. This endogenous censoring effect onhistories interacts with the gambler’s fallacy bias and leads to pessimistic inference aboutthe labor pool. Managers continue searching only when their early candidates are below-average. They misinterpret subsequent above-average candidates as the expected positivereversal after bad initial outcomes, not as strong signals about the labor pool. On the otherhand, they are surprised by subsequent below-average candidates since they understate thelikelihood of bad streaks, misreading consecutive bad draws as very strong negative signalsabout the pool. That is, after bad early draws, managers under-infer from subsequent gooddraws but over-infer from subsequent bad draws. On average, they communicate an over-pessimistic impression of the labor pool to today’s junior manager. This pessimism informs1he junior manager’s stopping strategy and affects the kind of censored history she observesand the new belief she communicates to future managers.This paper examines the endogenous learning dynamics of a society of agents believingin the gambler’s fallacy. All agents face a common stage game: an optimal-stopping problemwith draws in different periods independently generated from fixed yet unknown distribu-tions. They take turns playing the stage game, with each agent’s payoff determined by thegame’s outcome. Agents are Bayesians except for the statistical bias. That is, they start witha prior belief supported on a class of feasible models about the joint distribution of draws.Feasible models are symmetric, log-concave distributions indexed by different unconditionalmeans (the fundamentals ). I study the gambler’s fallacy as a misspecified prior: all feasiblemodels specify that better earlier draws tend to lead to worse later draws, and vice versa.The feasible models exclude the true distribution where draws are independent, so agentsundertake misspecified Bayesian learning.I consider two social-learning environments. In the first environment, agents play thestage game one at a time. Before playing her own game, each agent adopts the final belief ofher immediate predecessor as her prior belief and formulates a stopping strategy. At the endof her game, she updates her belief about the fundamentals by applying the Bayes’ rule toher stage-game history, then passes on her posterior belief to her successor. I show that thestochastic processes of the agents’ beliefs and behavior almost surely converge to a uniquesteady state in which agents are over-pessimistic about the fundamentals and stop too early relative to the objectively optimal strategy.In the second environment, agents arrive in large generations with everyone in the samegeneration playing simultaneously after observing all predecessors’ histories. Society con-verges to the same steady state as the previous environment. This large-generations modelillustrates a positive-feedback cycle between distorted beliefs and distorted stopping strate-gies. More severely censored datasets lead to more pessimistic beliefs, while more pessimisticbeliefs lead to earlier stopping and, as a consequence, heavier history censoring. Mappingback to the recruiting example, suppose a firm appoints HR managers in cohorts. Uponarrival, each junior manager learns the recruiting experience of all previous managers. Ifmanagers in the first cohort start with the correct stopping strategy, then average hiringoutcome monotonically deteriorates across all future cohorts. After today’s cohort observespredecessors’ histories and makes an over-pessimistic inference, this belief leads them toact less “choosy” and only keep searching if their early candidates prove to be truly un-satisfactory. On average, early applicants rejected by today’s managers are worse than the Mueller, Spinnewijn, and Topa (2018) find evidence consistent with people exhibiting the gambler’sfallacy in an optimal-stopping problem. They show job seekers’ beliefs about the probability of finding ajob in the near future increase significantly over the course of the unemployment spell, after controlling forindividual fixed effects. These beliefs contrast with theories that predict decreasing job-finding rates (e.g.,human capital depreciation) and with the authors’ structural estimation that suggests constant rates. Oneapplication of my work is studying how a society of such biased job seekers make inferences about job-findingrates from others’ job-search experience. fictitious variation both depends on severity of history censoring andinfluences the managers’ stopping strategy. I derive two results that illustrate how this be-lief in fictitious variation interacts with endogenous learning. First, when the stage-gamepayoff function is convex in draws (such as when previously rejected candidates can berecalled with some probability in the sequential interviewing game), the positive-feedbackcycle of the baseline environment strengthens. More severely censored histories not onlymake agents more pessimistic about the fundamentals by the usual censoring effect, but alsodecrease their belief in fictitious variation. Both forces encourage earlier stopping due tothe convexity of the optimal-stopping problem, so subsequent agents will face even heavier3ata censoring. Second, a society where agents are uncertain about the variances can endup with a different long-run belief about the means than another society where agents knowthe correct variances. This is despite the fact that agents in both societies would make thesame (mis)inference about the means given the same data.I study a number of extensions in the Online Appendix, showing robustness of the resultsto a range of alternative specifications. The paper focuses on (misspecified) Bayesian agents,but the over-pessimism result and the positive-feedback loop continue to obtain under a non-Bayesian method-of-moments inference procedure (Online Appendix OA 5). For simplicityI consider a two-period optimal-stopping problem as the stage game, but the combinationof the gambler’s fallacy and history truncation after good outcomes still produces over-pessimistic inferences in stage games of arbitrary length (Online Appendix OA 2). I assumeall agents have the gambler’s fallacy. The presence of a subpopulation of unbiased agents oragents suffering from additional behavioral biases may mitigate the extent of over-pessimism(Online Appendix OA 8.2), but does not eliminate it.
This work contributes to two strands of literature: the behavioral economics literature oninference mistakes for biased learners, and the theoretical literature on the dynamics ofmisspecified endogenous learning.As a contribution to behavioral economics, I highlight a novel channel of misinference forbehavioral agents — the interaction between psychological bias and data censoring. In manynatural environments, agents learn from censored data. The economics literature has recentlyfocused on the learning implications of selection neglect in these settings, where agents act asif their dataset is not censored. This work points out that other well-documented behavioralbiases can also interact with data censoring to produce new implications. Mislearning stemsprecisely from this interaction, not from either censored data or the gambler’s fallacy alone.Agents who do not suffer from the statistical bias learn the fundamentals correctly even fromcensored histories. On the other hand, if we removed censoring by having agents observe ex-post what would have been drawn in each period of the optimal-stopping problem, then evenbiased agents would learn the fundamentals correctly. The intuition is that the gambler’sfallacy is a “symmetric” bias. The “asymmetric” outcome of over-pessimism only occurswhen the bias interacts with an (endogenous) asymmetric censoring mechanism that tendsto produce data containing negative streaks but not positive streaks. Environments thatfeature different censoring patterns (e.g., strategies that produce positive streaks) or otherbehavioral biases would produce different predictions, but again through the same basicmechanism— interaction between censoring and bias.As a theoretical contribution, I prove convergence of beliefs and behavior in a non-self- See, for example, Enke (2019) and Jehiel (2018). Another difference is that I establish my convergence result ina setting with multiple dimensions of uncertainty (the distributional parameters for differentperiods of the stage game), whereas Heidhues, Koszegi, and Strack (2018) consider conver-gence of misspecified learning with one-dimensional uncertainty. Fudenberg, Romanyuk, andStrack (2017) study a continuous-time model of active learning under misspecification, buttheir learning problem only involves two feasible models. In this work, agents’ prior beliefabout each distributional parameter is supported on a continuum of possible values.As another contribution to the theoretical literature on misspecified learning dynamics,this project studies a new source of endogeneity: the censoring effect in a dynamic stagegame. The dynamic stage game is both essential for studying learning under the gambler’sfallacy — a behavioral bias concerning the serial correlation of data — and crucial for thecensoring effect. In my setting, the type of data that an agent generates depends on herbeliefs. To understand the distinction from the existing literature, consider the classic paperin this area, Nyarko (1991), who studies a monopolist setting a price on each date andobserving the resulting sales. No matter what action the monopolist takes, she observesthe same type of data: quantity sold. Similarly, the agent in Fudenberg, Romanyuk, andStrack (2017) always observes payoffs and the agent in Heidhues, Koszegi, and Strack (2018)always observes output levels, after any action. Endogenous learning in these other paperstakes the form of agents attributing different meanings to the same data, when interpretedthrough the lenses of different actions. On the other hand, we may think of stage-gamehistories censored with different thresholds as different types of data that, by themselves,lead to different beliefs about the fundamentals for biased learners. Actions play no role ininference except to generate these different types of data, as the likelihood of a (feasible)history does not depend on the censoring threshold that produced it.
Rabin (2002) and Rabin and Vayanos (2010) are the first to study the inferential mistakes Their follow-up work Heidhues, Koszegi, and Strack (2019) also focuses on mislearning with one-dimensional uncertainty in a self-confirming setting. In Online Appendix OA 6,I modify Rabin’s example to induce the censoring effect. His finite-urn model then deliversa misinference result analogous to the results in this paper, which are derived in a differentsetting with continuously-valued draws. This exercise shows the robustness of my resultswithin different modeling frameworks of the same statistical bias.Steady state in this work corresponds to Esponda and Pouzo (2016)’s Berk-Nash equilib-rium. Rather than focusing only on equilibrium analysis, however, I study non-equilibriumlearning dynamics and prove global convergence of behavior. This paper also contains morespecific results: I emphasize the interaction between censoring and bias as the driver of mis-learning, discuss how changing the stage game affects long-run beliefs, and relate my resultsto previous findings on inference under the gambler’s fallacy (e.g., fictitious variation in anendogenous-data setting).Although my learning framework involves a sequence of short-lived agents, the social-learning aspect of the framework is not central to the results. In fact, the environmentwhere a sequence of short-lived agents acts one at a time is equivalent to an environmentwhere a single long-lived agent plays the stage game repeatedly, myopically maximizing herexpected payoff in each iteration of the stage game. In the growing literature on sociallearning with misspecified Bayesians (e.g., Eyster and Rabin (2010); Gaurino and Jehiel(2013); Bohren (2016); Bohren and Hauser (2018); Frick, Iijima, and Ishii (2019)), agentsobserve their predecessors’ actions but make errors when inverting these actions to deducesaid predecessors’ information. This kind of action inversion does not take place here: lateragents inherit all the information that their predecessors have seen, either by adopting theirbeliefs or by observing their histories, so predecessors’ actions are uninformative.The econometrics literature has also studied data-generating processes with censoring — In Rabin (2002)’s example, biased agents (correctly) believe that the part of the data which is alwaysobservable is independent of the part of the data which is sometimes missing. However, what I termthe “censoring effect” is about misinference resulting from agents wrongly believing in negative correlationbetween the early draws that are always observed and the later draws that may be censored, depending onthe realizations of the early draws. Esponda, Pouzo, and Yamamoto (2019)’s work-in-progress considers misspecified learning environmentswith finite action sets and studies the convergence of empirical action frequencies. Their techniques andnotion of convergence do not seem to apply to a setting with a continuum of actions. This literature has primarilyfocused on the issue of model identification from censored data (Cox, 1962; Tsiatis, 1975;Heckman and Honoré, 1989). In my setting, there is no identification problem for correctlyspecified agents. Instead, I study how agents make wrong parameter estimates from censoreddata when they infer using a family of misspecified models. Another contrast is that theeconometrics literature has focused on exogenous data-censoring mechanisms, but censoringis endogenous in this paper and depends on the beliefs of previous agents. This endogeneityis central to the results, as discussed before.
Bar-Hillel and Wagenaar (1991) review classical psychology studies on the gambler’s fallacy.The earliest lab evidence involves two types of tasks. In “production tasks,” subjects areasked to write down sequences using a given alphabet, with the goal of generating sequencesthat resemble the realizations of an i.i.d. random process. Subjects tend to produce sequenceswith too many alternations between symbols, as they attempt to locally balance out symbolfrequencies. In “judgment tasks” where people are asked to identify which sequence ofbinary symbols appears most like consecutive tosses of a fair coin, subjects routinely judgesequences with an alternation rate of 60% as “more random” than those with an alternationrate of 50%. While most of these studies are unincentivized, Benjamin, Moore, and Rabin(2017) have found the gambler’s fallacy with strict monetary incentives, where a bet on afair coin continuing its streak pays strictly more than the bet on the streak reversing. Barronand Leider (2010) have shown that experiencing a streak of binary outcomes one at a timeexacerbates the gambler’s fallacy, compared with simply being told the past sequence ofoutcomes all at once.Other studies have identified the gambler’s fallacy using field data on lotteries and casinogames. Unlike in experiments, agents in field settings are typically not explicitly told theunderlying probabilities of the randomization devices. In state lotteries, players tend toavoid betting on numbers that have very recently won. This under-betting behavior isstrictly costly for the players when lotteries have a pari-mutuel payout structure (as in thestudies of Terrell (1994) and Suetens, Galbo-Jørgensen, and Tyran (2016)), since it leads to alarger-than-average payout per winner in the event that the same number is drawn again thefollowing week. Using security video footage, Croson and Sundali (2005) show that roulettegamblers in casinos bet more on a color after a long streak of the opposite color. Narayananand Manchanda (2012) use individual-level data tracked using casino loyalty cards to findthat a larger recent win has a negative effect on the next bet that the gambler places, whilea larger recent loss increases the size of the next bet. Finally, using field data from asylumjudges, loan officers, and baseball umpires, Chen, Moskowitz, and Shue (2016) show that References can be found in Amemiya (1985) and Crowder (2001). conditional on the underlying fundamentalsand mislearn some parameters of the world as a result. But, the misinference mechanism inthis paper is further complicated by the presence of endogenous data censoring.
This section presents the basic elements of the model, previews the main results, and providesintuition for how the censoring effect drives the conclusions. I describe the (single-player) stage game , an optimal-stopping problems satisfying some conditions. Agents are uncertainabout the distribution of draws in the stage game. They entertain a prior belief over a familyof feasible models of how draws are generated. All feasible models specify the same negativecorrelation between draws, though they are objectively independent — an error that reflectsthe gambler’s fallacy. Sections 3 and 4 embed these model elements into social-learningenvironments and derive learning dynamics. Section 5 contains a number of extensions thatverify robustness of the main results. 8 .1 Basic Elements of the Model
The stage game is a two-period optimal-stopping problem. In the first period, the agentdraws x ∈ R and decides whether to stop. If she stops, her payoff is u ( x ) and the stagegame ends. Otherwise, she continues to the second period and draws x ∈ R . The stagegame then ends with the payoff u ( x , x ).The payoff functions u : R → R and u : R → R satisfy some regularity conditions tobe introduced in Assumption 1. The following example satisfies Assumption 1 and will beused to illustrate my results throughout this paper. Example 1 (search with q probability of recall) . Many industries have an annual hiringcycle. Consider a firm in such an industry and an HR manager who must fill a job openingduring this year’s cycle. In the early phase of the hiring cycle, she finds a candidate whowould bring net benefit x to the organization if hired. She must decide between hiringthis candidate immediately or waiting. Waiting means she continues searching in the latephase of the cycle, finding another candidate with benefit x . Waiting carries the risk thatthe early candidate accepts an offer from a different firm in the interim, which happenswith probability 0 < − q ≤
1. This situation has the payoff functions u ( x ) = x and u ( x , x ) = q · max( x , x ) + (1 − q ) x . In the late phase, there is q probability themanager gets payoff equal to the higher of the two candidates’ qualities, and complementaryprobability that only the second candidate is available.The following regularity conditions define the class of optimal-stopping problems I study. Assumption 1 (regularity conditions) . The payoff functions satisfy:(a) For x > x and x > x , u ( x ) > u ( x ) and u ( x , x ) > u ( x , x ) . (b) For x > x and any ¯ x , u ( x ) − u ( x ) > | u ( x , ¯ x ) − u ( x , ¯ x ) | . (c) There exist x g , x b , x b , x g ∈ R so that u ( x g ) > u ( x g , x b ) and u ( x b ) < u ( x b , x g ) .(d) u , u are continuous and x u (¯ x , x + ¯ k ) is absolutely integrable with respect tothe objective distribution of X for all ¯ x , ¯ k ∈ R . Assumption 1(a) says u , u are strictly increasing in the draws of their respective periods.Assumption 1(b) says a higher realization of the early draw increases first-period payoff morethan it changes second-period payoff. Under Assumption 1(a), Assumption 1(b) is satisfiedwhenever u is not a function of x , as in optimal-stopping problems where stopping in period k gives payoff only depending on the k -th draw. Assumption 1(c) says there exist situationswhere the agent wants to stop and other situations where the agent wants to continue. Thetechnical Assumption 1(d) ensures continuation payoffs are well-defined. These conditionsare satisfied by my recurring example. 9 laim . Example 1 satisfies Assumption 1 whenever the objective distribution of X has afinite first moment.Proofs of results in Sections 2 to 4 can be found in Appendix A.I now define strategies and histories of the stage game. Definition 1. A strategy is a function S : R → {Stop, Continue} that maps the realizationof the first-period draw X = x into a stopping decision.Without loss I only consider pure strategies, because there always exists a payoff-maximizingpure strategy under any belief about the distribution of draws. Definition 2.
The history of the stage game is an element h ∈ H := R × ( R ∪ { ∅ } ). Ifan agent decides to stop after X = x , her history is ( x , ∅ ). If the agent continues after X = x and draws X = x in the second period, her history is ( x , x ).The symbol ∅ is a censoring indicator , emphasizing that the hypothetical second-perioddraw is unobserved when the agent does not continue into the second period. In Example 1,if the HR manager hires the first candidate, she stops her recruitment efforts early and thecounterfactual second candidate that she would have found had she kept the position openremains unknown. I work with a general class of distributions for the main results. Both the true data-generatingprocess and the agents’ domain of learning can be described in terms of a pair of densitieson R satisfying the following: Assumption 2 (log-concavity and symmetry) . f ( · | and f ( · | are strictly positivedensities on R with finite second moments, and they are strictly log-concave, symmetric, andmean-zero. A leading example of strictly log-concave and symmetric distributions is the Gaussiandistribution. Another example is the logistic distribution. The mean-zero condition is onlya normalization, since we can shift any log-concave distribution symmetric around its meanto be centered around 0.For τ , τ ∈ R , let f ( · | τ ) and f ( · | τ ) represent shifted versions of f ( · |
0) and f ( · | τ and τ , respectively. More precisely, f ( x | τ ) := f ( x − τ |
0) and f ( x | τ ) := f ( x − τ |
0) for x , x ∈ R .Objectively, draws X , X in the stage game are independently distributed with X ∼ f ( · | µ • ) and X ∼ f ( · | µ • ). The parameters µ • , µ • ∈ R are the true fundamentals . InExample 1, µ • and µ • stand for the true qualities of the two applicant pools in the early andlate phases of the hiring season.Agents are uncertain about the distribution of ( X , X ). The next definition describesthe set of distributions that a gambler’s fallacy agent deems plausible.10 efinition 3. The set of feasible models { Ψ( µ , µ ; γ ) : ( µ , µ ) ∈ M} is a family of jointdistributions of ( X , X ) indexed by feasible fundamentals ( µ , µ ) ∈ M ⊆ R , for some biasparameter γ >
0. Here Ψ( µ , µ ; γ ) refers to the joint distribution X ∼ f ( · | µ )( X | X = x ) ∼ f ( · | µ − γ ( x − µ )) , where X | ( X = x ) is the conditional distribution of X given X = x . I write E Ψ and P Ψ throughout for expectation and probability with respect to model Ψ.When E and P are used without subscripts, they refer to expectation and probability underthe true model, Ψ • = Ψ( µ • , µ • ; 0) . I model the gambler’s fallacy as an additive shift in the agent’s belief about X ’s dis-tribution following different X realizations, so that ( X | X = x ) increases in first-orderstochastic dominance order as x decreases. Conditional on the fundamentals, if the real-ization of X is higher than expected, then the agent believes bad luck is due in the nearfuture and the second draw is likely below average. Conversely, an exceptionally bad earlydraw likely portends above-average luck in the next period. This interpretation is clearerin the following equivalent formulation of Ψ( µ , µ ; γ ): X = µ + (cid:15) , X = µ + (cid:15) where (cid:15) ∼ f ( · |
0) and ( (cid:15) | (cid:15) ) ∼ f ( · | − γ(cid:15) ). The mean-zero terms (cid:15) , (cid:15) represent the idiosyn-cratic factors, or “luck,” that determine how X and X ’s realizations deviate from theirunconditional means µ and µ . The negative correlation between (cid:15) and (cid:15) conditional on µ , µ represents a belief in reversal of luck. Larger γ > Example 2 (the Gaussian case) . Objectively, X ∼ N ( µ • , σ ) and X ∼ N ( µ • , σ ) areindependent Gaussian random variables each with variance σ >
0. But the agent believes X , X are a pair of correlated Gaussian random variables with X ∼ N ( µ , σ ) and ( X | X = x ) ∼ N ( µ − γ ( x − µ ) , σ ) for some ( µ , µ ) ∈ M . The set of feasible models is indexed by the set of feasible fundamentals, M . We maythink of the agents as learning about the unconditional means of X and X , with M as thedomain of their inference. I study gambler’s fallacy for continuous random variables, where the magnitude of X affects the agent’sprediction about X . Chen, Moskowitz, and Shue (2016)’s analysis of baseball umpire data provides supportfor the continuous version of the statistical bias. They find that an umpire is more likely to call the currentpitch a ball after having called the previous pitch a strike, controlling for the actual location of the pitch.Crucially, the effect size is larger after more obvious strikes, where “obviousness” is based on the distance ofthe pitch to the center of the regulated strike zone. This distance can be thought of as a continuous measureof the “quality” of each pitch. emark . I will consider several specifications of M throughout this paper.(a) M = R . The agent thinks all values ( µ , µ ) ∈ R are possible.(b) M = ♦ , where ♦ is a bounded parallelogram in R whose left and right edges areparallel to the y -axis, whose top and bottom edges have slope − γ . The agent isuncertain about both µ and µ , but her uncertainty has bounded support. (c) M = { µ • } × [ µ , ¯ µ ] . The agent has a correct, dogmatic belief about µ , but hasuncertainty about µ supported on a bounded interval.(d) M = { ( µ, µ ) : µ ∈ R } . The agent is convinced that the first-period and second-periodfundamentals are the same, but is uncertain what this common parameter is.While the agent can freely update her belief about the fundamentals on M , she holdsa dogmatic belief about γ > This implies the set of feasible models excludes the truemodel, Ψ • = Ψ( µ • , µ • ; 0), so Bayesian updating within the class of feasible models amountsto misspecified learning. I use misspecification as a tool to represent and study the gambler’sfallacy. This approach is motivated by field evidence on the bias’ persistence: for example,Chen, Moskowitz, and Shue (2016) show that even very experienced decision-makers exhibita non-negligible amount of the gambler’s fallacy in high-stakes settings.In the social-learning environment I study in Section 3, short-lived agents each observesone iteration of the stage game, so no one has a large enough dataset to identify the misspec-ification problem. In Online Appendix OA 7, I discuss why even agents with large datasetsmay never question their feasible models: the misspecification is “attentionally stable” inthe sense of Gagnon-Bartsch, Rabin, and Schwartzstein (2018).Before stating my main results, I first establish a proposition about the optimal stage-game strategy. This will motivate a slight strengthening of Assumption 1 that I need forsome results. For c ∈ R , write S c for the cutoff strategy such that S c ( x ) = Stop if and onlyif x > c . That is, S c accepts all early draws above a cutoff threshold c . Proposition 1.
Under Assumption 1 and for γ > , • Under each feasible model Ψ( µ , µ ; γ ) , there exists a cutoff threshold C ( µ , µ ; γ ) ∈ R such that it is strictly optimal to continue whenever x < C ( µ , µ ; γ ) and strictlyoptimal to stop whenever x > C ( µ , µ ; γ ) . • For every µ ∈ R , µ C ( µ , µ ; γ ) is strictly increasing. Any prior belief over fundamentals ( µ , µ ) supported on a bounded set in R can be arbitrarily well-approximated by a prior belief over a large enough ♦ . Section 5.3 studies the extension where agents are uncertain about γ , but the support of their priorbelief about γ lies to the left of 0 and is bounded away from it. For every µ ∈ R , µ C ( µ , µ ; γ ) is Lipschitz continuous with Lipschitz constant /γ . The content of this proposition is threefold.First, it shows that the best strategy for the class of optimal-stopping problems I studytakes a cutoff form. This is because a higher x both increases the payoff to stopping and,under the gambler’s fallacy, predicts worse draws in the next period. Both forces push in thedirection of stopping. The optimality of cutoff strategies leads to an endogenous, asymmetriccensoring of histories, formalizing the idea that agents stop after “good enough” draws.Second, holding fixed µ , the cutoff threshold increases with µ . This is because the agentcan afford to be choosier in the first period when prospects in the second period improve.The third statement about Lipschitz continuity, on the other hand, gives a bound onhow quickly µ C ( µ , µ ; γ ) increases. Suppose that one agent believes draws are gen-erated according to Ψ( µ , µ ; γ ), while another agent believes they are generated accordingto Ψ( µ , µ + 1; γ ). If the first agent is indifferent between stopping and continuing after X = c , then the second agent prefers stopping after X = c + γ . This is because the predictedconditional mean of X falls by (1 /γ ) · γ = 1 when X increases by 1 /γ under any feasiblemodel, which cancels out the relative optimism of the second agent about the unconditionaldistribution of X .The Lipschitz constant 1 /γ is guaranteed for every optimal-stopping problem satisfyingAssumption 1 and every γ >
0. But, 1 /γ may not be the best Lipschitz constant. My resultsuse the slightly stronger condition that µ C ( µ • , µ ; γ ) has a Lipschitz constant strictly smaller than 1 /γ. Instead of making an assumption on C directly, I strengthen Assumption1(b) on the stage-game primitives to imply the desired infinitesimally stronger Lipschitzcontinuity. Assumption 3 ( ‘ -Lipschitz continuity) . Either: (a)
There exists < ‘ < γ so that for every x , x ∈ R and d > ,u ( x + ‘d ) − u ( x ) ≥ u ( x + ‘d, x + (1 − γ‘ ) d ) − u ( x , x ) Or: (b) u is Lipschitz continuous and only a function of x , and furthermore there exists (cid:15) > so that u ( x ) > (cid:15) for all x ∈ R . Assumption 3(a) is satisfied by my recurring example.
Claim . Example 1 satisfies Assumption 3(a) with ‘ = γ for every probability of recall0 ≤ q < γ > I now state my two main results, which concern learning dynamics under the gambler’sfallacy in two different social-learning environments. Precise details of these environments13ill follow in Sections 3 and 4, respectively.In the first environment, short-lived agents arrive one per round, t = 1 , , , ... . Agent inround t = 1 starts with a full-support prior density m : ♦ → R > , where ♦ is a boundedparallelogram in R as in Remark 1(b). In round t, agent t adopts the final belief ˜ m t − of herimmediate predecessor as her prior belief, then chooses a cutoff threshold ˜ C t to maximizeher expected payoff based on this belief. She observes what happens in her stage game anduses Bayes’ rule to update her belief from ˜ m t − to ˜ m t , which then becomes the prior beliefof agent t + 1 . In this environment, the sequences of cutoffs ( ˜ C t ) and posterior belief densities ( ˜ m t ) arestochastic processes whose randomness derives from the randomness of draws. Draws areobjectively independent, both between the two periods in the same round of the stage gameand across different rounds. Write (˜ µ ,t , ˜ µ ,t ) for the random element in ♦ given by thedensity ˜ m t . Theorem 1.
Suppose Assumptions 1, 2, and 3 hold, and the second derivative of ln ( f ( x | µ • )) is uniformly bounded for x ∈ R . There exists a unique steady state µ ∞ , c ∞ ∈ R not depen-dent on m , so that provided ( µ • , µ ∞ ) ∈ ♦ , almost surely lim t →∞ ˜ C t = c ∞ and (˜ µ ,t , ˜ µ ,t ) t ≥ converges in L to ( µ • , µ ∞ ) . The steady state satisfies µ ∞ < µ • and c ∞ < c • , where c • is theobjectively optimal cutoff threshold. In other words, almost surely behavior and belief converge in the society, and this steadystate is independent of the prior over fundamentals (provided its support is large enough).In the steady state, agents hold overly pessimistic beliefs about the fundamentals and stoptoo early, relative to the objectively optimal strategy. (The additional regularity assumptionthat the second derivative of ln ( f ( x | µ • )) is uniformly bounded is satisfied by the Gaussianand logistic distributions.)In the second environment, short-lived agents arrive in generations, t = 0 , , , ..., witha continuum of agents per generation. Agents’ prior belief about the fundamentals is givenby a full-support density m on R , as in Remark 1(a). Each agent observes the stage-gamehistories of all predecessors from all past generations to make inferences about the funda-mentals. Due to the large generations, cutoffs and beliefs are deterministic in generations t ≥ , which I denote as c [ t ] and µ [ t ] = ( µ , [ t ] , µ , [ t ] ) respectively. The society is initialized atan arbitrary cutoff strategy S c [0] in the 0th generation, the initial condition . Theorem 2.
Suppose Assumptions 1 and 2 hold. Starting from any initial condition and any m , cutoffs ( c [ t ] ) t ≥ and beliefs ( µ , [ t ] ) t ≥ form monotonic sequences across generations. WhenAssumption 3 also holds, there exists a unique steady state µ ∞ , c ∞ ∈ R so that c [ t ] → c ∞ and ( µ , [ t ] , µ , [ t ] ) → ( µ • , µ ∞ ) monotonically, regardless of the initial condition and m . Thissteady state is the same as the one in Theorem 1. I focus on learning across different iterations of the stage game and assume agents do not update beliefswithin the stage game. t is more pessimisticthan generation t − µ , [ t ] < µ , [ t − . The monotonicityresult implies that beliefs move in the same direction again in generation t + 1, that is µ , [ t +1] < µ , [ t ] . The information of generation t + 1 differs from that of generation t only inthat agents in generation t + 1 observe all stage-game histories of generation t. This meansgeneration t ’s stopping behavior differs from that of generation t − t − t. In the learning environments of this paper, each agent censors her stage game history throughher stopping strategy, where the strategy choice depends on her beliefs. To build intuitionfor how this censoring effect relates to the two main theorems, I first consider a biased agentwith feasible fundamentals M = R , facing a large sample of histories all censored accordingto some cutoff threshold c . I characterize her inference about fundamentals when the samplesize grows and analyze how her inference depends on the cutoff threshold c .Suppose c ∈ R ∪ {∞} and Ψ is a model. Then H (Ψ; c ) refers to the distribution ofhistories when draws are generated by Ψ and histories censored according to S c . Definition 4.
For c ∈ R and Ψ a model, H (Ψ; c ) ∈ ∆( H ) is the distribution of historiesgiven by H (Ψ; c )[ E × E ] := P Ψ [( E ∩ ( c, ∞ )) × E ] for E , E ∈ B ( R ) H (Ψ; c )[ E × { ∅ } ] := P Ψ [( E ∩ ( −∞ , c ]) × R ] for E ∈ B ( R ) , where B ( R ) is the collection of Borel subsets of R .I abbreviate H (Ψ • ; c ) as simply H • ( c ) , the true distribution of histories under the cutoffthreshold c . The next definition gives a measure of the difference between the distributionof histories under the feasible model with fundamentals ( µ , µ ) and the true distribution ofhistories, both with the same censoring threshold c . Definition 5.
For µ , µ ∈ R , c ∈ R ∪ {∞} the Kullback-Leibler (KL) divergence from H • ( c )to H (Ψ( µ , µ ; γ ); c ), denoted by D KL ( H • ( c ) || H (Ψ( µ , µ ; γ ); c ) ), is Z ∞ c f ( x | µ • ) · ln (cid:18) f ( x | µ • ) f ( x | µ ) (cid:19) dx + Z c −∞ (cid:26)Z ∞−∞ f ( x | µ • ) · f ( x | µ • ) · ln (cid:20) f ( x | µ • ) · f ( x | µ • ) f ( x | µ ) · f ( x | µ − γ ( x − µ )) (cid:21) dx (cid:27) dx . c ,( µ ∗ , µ ∗ ) ∈ arg min µ ,µ ∈ R D KL ( H • ( c ) || H (Ψ( µ , µ ; γ ); c ) ) , are called the pseudo-true fundamentals with respect to c .To interpret, the likelihood of the history h = ( x , x ) with x ≤ c is f ( x | µ • ) · f ( x | µ • )under the true model Ψ • , f ( x | µ ) · f ( x | µ − γ ( x − µ )) under the feasible modelΨ( µ , µ ; γ ). The likelihood of the history h = ( x , ∅ ) with x > c is f ( x | µ • ) underthe true model, f ( x | µ ) under the feasible model. The likelihoods of all other historiesare 0 under both models. So the KL divergence expression in Definition 5 is the expectedlog-likelihood ratio of the history under the true model versus under the feasible modelwith fundamentals ( µ , µ ) , where expectation over histories is taken under the true model.In general, this optimization objective depends on the cutoff threshold c and I denote thepseudo-true fundamentals as µ ∗ ( c ) , µ ∗ ( c ) to emphasize this dependence. The pseudo-truefundamentals correspond to the biased agent’s inference about the fundamentals in largesamples.The next proposition characterizes the pseudo-true fundamentals and contains the keyintuition behind the two main theorems. Proposition 2.
Under Assumption 2, for any c ∈ R ∪ {∞} , the KL divergence minimizationproblem in Definition 5 admit a unique solution ( µ ∗ ( c ) , µ ∗ ( c )) ∈ R . Furthermore: • µ ∗ ( c ) = µ • for any c ∈ R ∪ {∞}• µ ∗ ( c ) < µ • for any c ∈ R and µ ∗ ( ∞ ) = µ • • µ ∗ ( c ) is strictly increasing in c . In the Gaussian case, the pseudo-true fundamental µ ∗ ( c ) admits a closed-form expressionthat readily verifies Proposition 2. Example 2 (continued) . In the Gaussian case, for c ∈ R ∪ {∞} ,µ ∗ ( c ) = µ • − γ ( µ • − E [ X | X ≤ c ]) . The censoring effect is crucial for misinference: as Proposition 2 shows, the pseudo-truefundamentals are unbiased in the absence of censoring (i.e., when c = ∞ ). Here is why thedirectional data censoring I study leads to over-pessimism. In every feasible model of drawsΨ( µ , µ ; γ ), the realization of X depends on two factors: the second-period fundamental µ , and a reversal effect based on the realization of X . The biased agent cannot end up witha correct or over-optimistic belief about µ , else she would be systematically disappointedby realizations of X in her dataset. This is because X is only uncensored when X is low16nough, a contingency where the agent expects positive reversal on average. Over-pessimismcan therefore be thought of as “two wrongs making a right,” as the biased agent’s pessimismabout the unconditional mean of X counteracts her false expectation of positive reversalsin the dataset of censored histories.This mechanism explains the long-run pessimism in Theorem 1 and Theorem 2. Infact, in the large-generations setting of Theorem 2, every generation t ≥ µ = µ as in Remark1(d) (Section 5.4), when the stage game has more than two periods (Online Appendix OA2), under an alternative method-of-moments inference procedure (Online Appendix OA 5),when a fraction of agents suffer from selection neglect (Online Appendix OA 8.2), whenhigher draws bring worse payoffs (Online Appendix OA 8.1), and with high probability afterobserving a finite dataset containing just 100 censored histories (Online Appendix OA 9.1).The severity of the biased agent’s pessimism increases with the severity of censoring. Theintuition is that the agent wants to infer a lower µ ∗ to better match X ’s in histories thatstart with bad X ’s, but doing so carries the cost of a worse model fit for histories that startwith intermediate X ’s. More severe censoring — generated by a strategy that accepts notonly very good X ’s but also intermediate ones— alleviates this cost, as histories that startwith intermediate X ’s no longer contain their associated X ’s. The optimal inference µ ∗ thus decreases.The comparative static dµ ∗ dc > H • ( c [0] ) and chooses a cutoff c [1] . Generation 2 then observes histories from all predecessorgenerations, that is histories drawn from both H • ( c [0] ) and H • ( c [1] ). If c [1] < c [0] , thenGeneration 2’s dataset features (on average) more severe censoring than Generation 1’sdataset. Thus, Generation 2 comes to a more pessimistic inference about the second-periodfundamental. By Proposition 1, this leads to a further lowering of the cutoff threshold, c [2] < c [1] , and the pattern continues. This section studies a social-learning environment where biased agents act one at a timeand pass down beliefs to their successors. I define the steady state of the stage game,prove its existence and uniqueness, and show it features over-pessimistic beliefs and earlystopping. Then, I turn to the stochastic process of beliefs and behavior in the social-learningenvironment, showing that this process almost surely converges to the steady state.17 .1 Steady State: Existence, Uniqueness, and Other Properties
A steady state is a triplet consisting of fundamentals ( µ ∞ , µ ∞ ) ∈ R and a cutoff threshold c ∞ ∈ R that endogenously determine each other. The cutoff strategy with acceptancethreshold c ∞ maximizes expected payoff under the feasible model Ψ( µ ∞ , µ ∞ ; γ ), while thefundamentals are the pseudo-true fundamentals under data censoring with threshold c ∞ .More precisely, Definition 6. A steady state consists of µ ∞ , µ ∞ , c ∞ ∈ R such that:1. c ∞ = C ( µ ∞ , µ ∞ ; γ ) . µ ∞ = µ ∗ ( c ∞ ) and µ ∞ = µ ∗ ( c ∞ ).Steady states correspond to Esponda and Pouzo (2016)’s pure Berk-Nash equilibria for anagent whose prior is supported on the feasible models with feasible fundamentals M = R ,under the restriction that equilibrium belief puts full confidence in a single fundamentalpair. The set of steady states depends on γ , since the severity of the bias changes both theoptimal cutoff thresholds under different fundamentals and inference about fundamentalsfrom stage-game histories.The “steady state” defined here almost surely characterizes the long-run learning outcomein the society where biased agents act one by one. This convergence does not follow fromEsponda and Pouzo (2016), for their results only imply local convergence from prior beliefssufficiently close to the equilibrium beliefs, and only in a “perturbed game” environmentwhere learners receive idiosyncratic payoff shocks to different actions. I will show globalconvergence of the stochastic processes of beliefs and behavior without payoff shocks.Like almost all examples of Berk-Nash equilibrium in Esponda and Pouzo (2016), mysteady state generates data with positive KL divergence relative to the implied data distri-bution under the steady-state beliefs. That is, H • ( c ∞ ) = H (Ψ( µ ∞ , µ ∞ ; γ ); c ∞ ), so the steadystate is not a self-confirming equilibrium. This is because for every censoring threshold c (and in particular for c = c ∞ ) , the KL divergences of the true history distribution to thehistory distributions under different feasible models is bounded away from 0.To prove the existence and uniqueness of steady state, I define the following belief itera-tion map on the second-period fundamental. Definition 7.
For γ > , the iteration map I : R → R is given by I ( µ ; γ ) := µ ∗ ( C ( µ • , µ , γ )) For example, under the history distribution H • ( c ∞ ), E [ h | c ∞ − ≤ h ≤ c ∞ ] = E [ h | c ∞ − ≤ h ≤ c ∞ −
1] since draws are objectively independent. However, under the history distribution driven by the steady-state feasible model Ψ( µ ∞ , µ ∞ ; γ ), we must have E [ h | c ∞ − ≤ h ≤ c ∞ ] < E [ h | c ∞ − ≤ h ≤ c ∞ − γ >
0. This feature contrasts with Heidhues, Koszegi, and Strack (2018)’s model that results in aself-confirming learning outcome. µ ∗ ( c ) = µ • for all c from Proposition 2, it is straightforward to see thatsteady-state beliefs about µ are in bijection with fixed points of I . This shows steady-statebelief about µ exhibits over-pessimism. Proposition 3.
Under Assumption 2, every steady state satisfies µ ∞ < µ • , µ ∞ = µ • . Furthermore, steady state is unique under the additional Assumption 3.
Proposition 4.
Under Assumptions 1, 2, and 3, I is a contraction mapping with contractionconstant < ‘γ < . Therefore, a unique steady state exists. The contraction mapping property of I comes from two lemmas. First, we can use thestrict log-concavity assumption to show that µ ∗ ( c ) is Lipschitz continuous with constant γ. Lemma 1.
Under Assumption 2, µ ∗ ( c ) is Lipschitz continuous with Lipschitz constant γ. Next , the indifference threshold is Lipschitz continuous with a Lipschitz constant strictlyless than 1 /γ once we adopt Assumption 3. Lemma 2.
Under Assumptions 1, 2, and 3, µ C ( µ • , µ ; γ ) is Lipschitz continuous witha Lipschitz constant ‘ < /γ . Even under Assumptions 1 and 2 alone, the basic regularity conditions we maintainthroughout, it turns out I is “almost” a contraction mapping for any γ >
0, in the sensethat |I ( µ ) − I ( µ ) | < | µ − µ | for every µ , µ ∈ R . But, there is no guarantee of a uniformcontraction constant strictly less than 1. The slight strengthening in Assumption 3 ensuressuch a uniform contraction constant exists.I now show the steady-state stopping threshold always features stopping too early. Forevery µ • , µ • ∈ R , the objectively optimal stopping strategy takes the form of a cutoff c • ∈ R ∪ {±∞} , where c • = −∞ means always stopping and c • = ∞ means never stopping. Ishow that c • > c ∞ for every steady-state cutoff c ∞ . (This result only requires Assumptions1 and 2 and does not require uniqueness of steady states.)Early stopping does not directly follow from over-pessimism. In fact, outside of thesteady state, there is an intuition that a biased agent may stop later than a rational agent,not earlier. For a concrete illustration, consider Example 1 with q = 0, so there is noprobability of recall. Suppose the true fundamentals are µ • (cid:29) µ • . If a biased agent hasthe correct beliefs about the fundamentals, she perceives a greater continuation value after X = µ • than a rational agent with the same correct beliefs, since the former holds afalse expectation of positive reversals after a bad (relative to µ • ) early draw. Even though c • = µ • and the rational agent chooses to stop, the biased agent chooses to continue and has This follows from Lemma A.2 in the Appendix, which shows even when γ = 0, the difference betweenstopping payoff at x and expected continuation payoff after x is strictly increasing and continuous in x .
19n indifference threshold strictly above c • . By continuity, the biased agent’s cutoff thresholdremains strictly above c • even under slightly pessimistic beliefs about µ . Nevertheless, the next proposition shows that in the steady state, it is unambiguousthat the biased agent stops too early relative to the objectively optimal threshold. Theearly-stopping result strengthens the over-pessimism result. In the steady state, agents mustbe sufficiently pessimistic as to overcome the opposite intuition about late stopping justdiscussed.
Proposition 5.
Under Assumptions 1 and 2, every steady-state stopping threshold c ∞ isstrictly lower than the objectively optimal threshold, c • . To understand why, note the biased agent believes in different distributions of X follow-ing different realizations of X , with more pessimistic beliefs after higher realizations. In asteady state ( µ ∞ , µ ∞ , c ∞ ) , the agent’s subjective belief about X following X = c ∞ must bea leftward shift of the true distribution f ( · | µ • ). Else, the agent would have subjective dis-tributions of X that stochastically dominate the true distribution whenever S c ∞ prescribescontinuing, so heuristically she could improve the fit of her model by lowering her beliefabout µ . The biased agent’s indifference at c ∞ is thus based on an overly pessimistic beliefabout the continuation value, so we must have c ∞ < c • . This section shows the steady state defined and studied earlier corresponds to the long-run learning outcome for a society of biased agents acting one at a time. I outline theconvergence proof for a simpler variant of Theorem 1, where agents start off knowing µ • andonly entertain uncertainty over µ . That is, the feasible fundamentals are given by Remark1(c) rather than Remark 1(b). This simplification is without much loss: even when agentsare initially uncertain about µ , they will almost surely learn it in the long run regardless ofthe stochastic process of their predecessors’ stopping strategies. Intuitively, this is because X can never be censored, so no belief distortion in µ is possible. Once agents have learned µ • , the rest of the argument proceeds much like the case where µ • is known from the start.In the next section I comment on the key steps in extending the proof to the case uncertaintyover two-dimensional fundamentals ( µ , µ ), but will defer the details to Online AppendixOA 3.In the learning environment, time is discrete and partitioned into rounds t = 1 , , , ... One short-lived agent arrives per round. Agent 1 starts with a prior belief M given by aprior density m : [ µ , ¯ µ ] → R > , while agent t ≥ M t − of agent t − Next, agent t chooses a cutoff threshold ˜ C t maximizing expected This is similar to the intuition for why µ ∗ ( c ) = µ • for every c . The same learning dynamics obtain in an environment where every agent starts with the common priorbelief M and observes the stage-game histories of all predecessors. M t − to ˜ M t by applying Bayes’ rule to her stage-game history, ˜ H t ∈ H .The sequences ( ˜ M t ) , ( ˜ C t ) , ( ˜ H t ) are stochastic processes whose randomness stem from ran-domness of the stage-game draws realizations in different rounds. The convergence theoremis about the almost sure convergence of processes ( ˜ M t ) and ( ˜ C t ) . To define the probabilityspace formally, consider the R -valued stochastic process ( X t ) t ≥ = ( X ,t , X ,t ) t ≥ , where X t and X t are independent for t = t . Within each t, X ,t ∼ f ( · | µ • ), X ,t ∼ f ( · | µ • ) are alsoindependent. Interpret X t as the pair of potential draws in the t -th round of the stage game.Clearly, there exists a probability space (Ω , A , P ), with sample space Ω = ( R ) ∞ interpretedas paths of the process just described, A the Borel σ -algebra on Ω , and P the measure onsample paths so that the process X t ( ω ) = ω t has the desired distribution. The term “almostsurely” means “with probability 1 with respect to the realization of the infinite sequence ofall (potential) draws”, i.e. P -almost surely. The processes ( ˜ M t ) , ( ˜ C t ) , ( ˜ H t ) are defined onthis probability space and adapted to the filtration ( F t ) t ≥ , where F t is the sub- σ -algebragenerated by draws up to round t , F t = σ (( X s ) ts =1 ).Under Assumptions 1, 2, and 3, by Proposition 4 there exists a unique steady state( µ • , µ ∞ , c ∞ ). Theorem 1 shows that, provided the support of m contains µ ∞ , m is con-tinuous, and the second derivative of ln( f ( · | µ • )) is uniformly bounded, the stochasticprocesses ( ˜ C t ) and ( ˜ M t ) almost surely converge to the steady state. This is a global conver-gence result since the bounded interval [ µ , ¯ µ ] can be arbitrarily large and the prior density m can assign arbitrarily small probability to neighborhoods around µ ∞ . Theorem 1 . Suppose Assumptions 1, 2 and 3 hold and the second derivative of ln ( f ( x | µ • )) is uniformly bounded for x ∈ R . Suppose also µ ≤ µ ∞ ≤ ¯ µ where µ ∞ is the unique steady-state belief, and prior density m : [ µ , ¯ µ ] → R > is continuously differentiable. Almostsurely, lim t →∞ ˜ C t = c ∞ and lim t →∞ E µ ∼ ˜ M t | µ − µ ∞ | = 0 , where c ∞ is the unique steady-state cutoff threshold. I will now discuss the obstacles to proving convergence and provide an outline of myargument. In each round t, the cutoff choice of the t -th agent determines how history ˜ H t will be censored. We can think of each c ∈ R as generating a different “type” of data. As wesaw in Proposition 2, different “types” of data (in large samples) lead to different inferencesabout the fundamentals for biased agents, so the cutoff ˜ C t influences the belief that will bepassed on to agent t + 1. Yet ˜ C t is an endogenous, ex-ante random object that depends onthe belief of the t -th agent, meaning that belief and behavior co-evolve to complicate theanalysis of learning dynamics.To be more precise, the log-likelihood of all X data up to the end of round t underfundamental µ ∈ [ µ , ¯ µ ] is the random variable t X s =1 ln( f ( X ,s ; µ − γ ( X ,s − µ • )) · { X ,s ≤ ˜ C s } . s -th summand contains the indicator { X ,s ≤ C s } , referring to the fact that X ,s wouldbe censored if X ,s exceeds the cutoff ˜ C s . The cutoff ˜ C s depends on histories in periods1 , , ..., s − , hence indirectly on ( X k ) k
Proposition 10 from Heidhues, Koszegi, and Strack (2018):
Let ( y t ) t be a martingalethat satisfies a.s. [ y t ] ≤ vt for some constant v ≥ . We have that a.s. lim t →∞ y t t = 0 . After simplifying the problem with this result, I establish a pair of mutual bounds onasymptotic behavior and asymptotic beliefs. If we know cutoff thresholds are asymptoticallybounded between c l and c h , c l < c h , then beliefs about µ must be asymptotically supportedon the interval [ µ ∗ ( c l ) , µ ∗ ( c h )]. Conversely, if belief is asymptotically supported on the subin-terval [ µ l , µ h ] ⊆ [ µ , ¯ µ ], then cutoff thresholds must be asymptotically bounded between C ( µ • , µ l ; γ ) and C ( µ • , µ h ; γ ). Lemma A.19 . For c l ≥ C ( µ • , µ ; γ ) , if almost surely lim inf t →∞ ˜ C t ≥ c l , then almost surely lim t →∞ ˜ M t ( [ µ , µ ∗ ( c l )) ) = 0 . Also, for c h ≤ C ( µ • , ¯ µ ; γ ) , if almost surely lim sup t →∞ ˜ C t ≤ c h , then almost surely lim t →∞ ˜ M t ( ( µ ∗ ( c h ) , ¯ µ ]) = 0 . Lemma A.20 . For µ ≤ µ l < µ h ≤ ¯ µ , if lim t →∞ ˜ M t ([ µ l , µ h ]) = 1 almost surely, then lim inf t →∞ ˜ C t ≥ C ( µ • , µ l ; γ ) and lim sup t →∞ ˜ C t ≤ C ( µ • , µ h ; γ ) almost surely. Applying this pair of lemmas to supp( M ) = [ µ , ¯ µ ], we conclude that asymptotically˜ M t must be supported on the subinterval [ I ( µ ) , I (¯ µ )] , where I is the iteration map fromDefinition 7 first used in analyzing the existence and uniqueness of steady states. UnderAssumptions 1, 2, and 3, Proposition 4 implies that I is a contraction mapping whoseiterates converge to µ ∞ . Therefore by repeatedly applying the pair of Lemmas A.19 andA.20, we can refine the bound on asymptotic beliefs down to the singleton { µ ∞ } , showingthe almost-sure convergence of beliefs there. The almost-sure convergence of behavior followseasily from Lemma A.20. µ The hypotheses of Theorem 1 differ from those of Theorem 1 in that agents start off withuncertainty about µ . I now comment on the key step to proving almost-sure convergence22f beliefs and behavior in the environment with two-dimensional uncertainty about funda-mentals.The structure of the inference problem in my setting is such that I can separately boundthe agents’ asymptotic beliefs in two “directions,” thus reducing the task of proving a two-dimensional belief bound into a pair of tasks involving one-dimensional belief bounds. Tounderstand why, consider a pair of fundamentals, ( µ , µ ) and ( µ , µ ) = ( µ + d, µ − γd ) forsome d >
0, satisfying µ , µ ≤ µ • . That is, ( µ , µ ) and ( µ , µ ) lie on the same line withslope − γ . For any uncensored history ( x , x ) ∈ R , the likelihood of second-period draw x is the same under both pairs of fundamentals, f ( x | µ − γ ( x − µ )) = f ( x | µ − γ ( x − µ )) . This is because the feasible model Ψ( µ , µ ; γ ) has a lower first-period mean but also ahigher second-period unconditional mean, compared to the model Ψ( µ , µ ; γ ). An agentwho believes in the first model feels less disappointed by the draw x , since she evaluates itagainst a lower expectation. This leads a weaker anticipation of positive reversal under thegambler’s fallacy, compared to another agent who believes in the second model. But, thisdifference is canceled out by the more optimistic belief about the unconditional distributionof second-period draw, µ > µ .This argument shows that both pairs of fundamentals ( µ , µ ) and ( µ , µ ) explain X data equally well in all uncensored histories. This is important as it shows regardless of agent t ’s strategy, she would always find that ( µ , µ ) and ( µ , µ ) lead to the same likelihood ofsecond-period data in her history ˜ H t . At the same time, ( µ , µ ) provides a strictly betterfit for X data on average than ( µ , µ ) , since µ < µ ≤ µ • . This means in the long run,fundamentals ( µ , µ ) should receive much less posterior probability than ( µ , µ ), as thelatter better rationalize the data overall.This heuristic comparison of the asymptotic goodness-of-fit for two feasible models isformalized by computing the directional derivative for data log-likelihood along the vector − γ ! in the space of fundamentals. I establish an (almost-sure) positive lowerbound onthis directional derivative to the left of µ • , and an analogous negative upperbound to theright of µ • . This allows me to show the region colored in red receives 0 posterior probabilityasymptotically, by comparing each point in red with a corresponding point in blue along aline of slope − γ . By repeating this argument (and applying the symmetric bound to theright of µ • ), I show that belief is asymptotically concentrated along an (cid:15) -width vertical stripcontaining the steady state beliefs, ( µ • , µ ∞ ). 23aving restricted the long-run belief to a small vertical strip, we have completed one“direction” of the belief bounds and effectively reduced the dimensionality of uncertaintyback to one. The rest of the argument proceeds similarly to the case where agents know µ • discussed before, iteratively restricting agents’ asymptotic behavior and asymptotic beliefabout µ . These restrictions amount to “vertical” belief refinements within the (cid:15) -strip, soeventually belief is restricted to the single point ( µ • , µ ∞ ), the unique steady-state belief. In this section, I consider a social-learning environment where agents arrive in large gen-erations and all agents in the same generation act simultaneously. I will prove Theorem2, fully characterizing the learning dynamics in this environment. I will also discuss thepositive-feedback loop between distorted beliefs about fundamentals and distorted stoppingbehavior.
There is an infinite sequence of generations, t ∈ { , , , ... } . Each generation is “large” andwill be modeled as a continuum of agents, n ∈ [0 , n from generation 1 is unrelated toagent n from generation 2. The realizations of draws X , X are independent across all stagegames, including those from the same generation. Generation 0 agents play some strategy S c [0] , where c [0] ∈ R is the initial condition of social learning.Write h τ,n ∈ H for the stage-game history of agent n from generation τ. Before playingher own stage game, each agent in generation t ≥ h τ,n ) n ∈ [0 , from each predecessor generation, 0 ≤ τ ≤ t −
1. If all generation τ predecessors used the stopping strategy S c τ for some c τ ∈ R , then the sub-dataset ( h τ,n ) n ∈ [0 , All generation τ predecessors had the same information about the fundamentals, so all of them wouldhave found the same stopping strategy subjectively optimal. H • ( c τ ). Agents are told the stopping strategies of their predecessors fromall past generations and use the entire dataset of histories to infer fundamentals. The spaceof feasible fundamentals is M = R as in Remark 1(a), so agents can flexibly estimate theunconditional means of draws from different periods, subject to their dogmatic belief inreversals.Agents only infer from predecessors’ histories, not from their behavior. This is rationalas information sets are nested across generations. For t > t , generation t observes allthe social information that generation t saw. In addition, generation t ’s dataset contains acomplete record of everything that happened in generation t ’s stage games. Since generation t has no private information that is unobserved by generation t , the behavior of thesepredecessors is uninformative about the fundamentals beyond what generation t can learnfrom the dataset of histories.In the large-generations model, generation t agents infer fundamentals ( µ , [ t ] , µ , [ t ] ) thatminimize the sum of the KL divergences between the implied history distribution under thefeasible model Ψ( µ , [ t ] , µ , [ t ] ; γ ) on the one hand, and the t observed history distributions ingenerations 0 ≤ τ ≤ t − t ’s minimization objectivebelow. Definition 8.
The large-generations pseudo-true fundamentals with respect to cutoff thresh-olds ( c τ ) t − τ =0 solve min µ ,µ ∈ R t − X τ =0 D KL ( H • ( c τ ) || H (Ψ( µ , µ ; γ ); c τ ) ) , (1)where D KL is KL divergence from Definition 5. Denote the minimizers as µ ∗ ( c , ..., c t − ) and µ ∗ ( c , ..., c t − ).I interpret the continuum of agents in each generation as an idealized, tractable modelingdevice representing a large but finite number of agents. Appendix OA 4 provides a finite-population foundation for inference and behavior in the continuum-population model. There,for the Gaussian case, I show that when an agent observe t finite sub-datasets of historiesdrawn from distributions H • ( c τ ) for 0 ≤ τ ≤ t −
1, as these datasets grow large her inferenceand behavior almost surely converge to the infinite-population analogs.
Now I develop the proof of Theorem 2.
Theorem 2.
Suppose Assumptions 1 and 2 hold. Starting from any initial condition and any m , cutoffs ( c [ t ] ) t ≥ and beliefs ( µ [ t ] ) t ≥ form monotonic sequences across generations. When These stopping rules can also be exactly inferred from the infinite dataset. ssumption 3 also holds, there exists a unique steady state µ ∞ , c ∞ ∈ R so that c [ t ] → c ∞ and ( µ , [ t ] , µ , [ t ] ) → ( µ • , µ ∞ ) monotonically, regardless of the initial condition and m . Thissteady state is the same as the one in Theorem 1. Towards a proof, first consider learning dynamics in a related auxiliary environment .The auxiliary environment is identical to the large-generations environment just described,except that agents in each generation t ≥ t − . Write µ A [ t ] and c A [ t ] for the inference and cutoff threshold ingeneration t of the auxiliary environment, where the superscript “A” distinguishes themfrom the corresponding processes of the baseline large-generations environment.We have µ A , [ t ] = µ • for every t ≥
1, from from Proposition 2. Also, it is easy to see that( µ A , [ t ] ) t ≥ are iterates of the I map from Definition 7, and that they must be monotonic sincethe pair of comparative statics ∂C∂µ > dµ ∗ dc > q = 0. Consider the Gaussian casewith µ • = µ • = 0 , γ = − . , σ = 1 , and the society starts at the objectively optimal cutoffthreshold, c [0] = 0. Society mislearns monotonically in both the baseline large-generationsenvironment and the auxiliary environment. This mislearning is more exaggerated in the inthe auxiliary environment, but both environments lead to the same long-run outcome.The map I ( · ; γ ) connects the environment where large generations of agents act simul-taneously to the environment where agents act one by one. We can think of I ( · ; γ ) as theone-generation-forward belief map in the auxiliary society, whose belief dynamics are closelyrelated to the belief dynamics of the baseline large-generations environment. There are nolarge generations at all in the environment where agents act one by one, but there I still playsa critical role in establishing the long-run convergence of beliefs and behavior. Intuitively,26 ynamics of Beliefs in the First Four Generations generation be li e f abou t m − . − . − . large−gen. environmentauxiliary environment Figure 1: The dynamics of beliefs about µ in the first four generations for the Gaussiancase (with σ = 1). The stage game is search (without recall), with true fundamentals are µ • = µ • = 0, bias parameter γ >
0, and initial condition is c [0] = 0. In both the baselinelarge-generations environment and the auxiliary environment, beliefs are monotonic acrossgenerations, an illustration of Theorem 2. Beliefs in both environments converge to the samesteady-state beliefs, though the rate of convergence is faster in the auxiliary environment.in the learning environment of Section 3, a belief based on the histories of one predecessorfrom each of many past generations replaces a belief based on a large dataset of historiesfrom many agents all belonging to the same past generation.I combine the asymptotic early-stopping result of Theorem 1 with the monotonic learningdynamics of Theorem 2 to deduce: Corollary 1.
Suppose Assumptions 1, 2, and 3 hold. In the large-generations environment,if society starts at the objectively optimal initial condition c [0] = c • , then expected payoffstrictly decreases across all successive generations. This stark “monotonic” mislearning result relies crucially on the endogenous-data setting.Each generation uses a lower acceptance threshold relative to their predecessors, a changewith the side effect of changing censoring threshold of their successors’ data. The new“type” of censored data causes the next generation to become more pessimism about thefundamentals than any past generation.
In this section I explore a number of alternative model specifications to examine the robust-ness of my main results. The Online Appendix contains the proofs of results in this sectionand additional extensions. 27 .1 Comparative Statics
In the first extension, I consider how learning dynamics react to changes in stage-gameparameters. In general, when agents learn from exogenous data, their decision problem doesnot influence learning outcomes. This observation holds independently of whether agentsare misspecified. On the other hand, correctly specified agents in my setting always learncorrectly in the long run, so the stage game is again irrelevant. With misspecified learners inan endogenous-data setting, however, changes in the stage game carry long-run consequenceson agents’ beliefs about the fundamentals.
Definition 9.
Given a pair of second-period payoff functions u H , u L , say u H payoff dominates u L (abbreviated u H (cid:31) u L ) if for every x ∈ R , u H ( x , x ) ≥ u L ( x , x ) for every x ∈ R , andalso u H ( x , x ) > u L ( x , x ) for a positive-measure set of x in R .For instance, in Example 1, increasing q (the probability of recall) creates a new optimal-stopping problem that payoff dominates the old one. More generally, starting from any stagegame with payoff functions u and u , we can impose an extra waiting cost κ wait > u and u L with u L = u − κ wait . The modified stage game is payoff dominated by the unmodifiedone.When u H (cid:31) u L , a society facing the problem ( u , u H ) always uses a higher stoppingthreshold than a society facing the problem ( u , u L ) , given the same beliefs about funda-mentals. To state this formally, let C u ,u be the optimal cutoff threshold function for thestage game ( u , u ). Lemma 3.
Suppose Assumption 1 holds for stage games ( u , u H ) and ( u , u L ) , and u H (cid:31) u L .For all µ , µ ∈ R , γ > , C u ,u H ( µ , µ ; γ ) > C u ,u L ( µ , µ ; γ ) . The next proposition shows that when one stage game payoff dominated another in termsof second-period payoffs, the dominated stage game leads to more pessimistic beliefs and alower cutoff threshold in the steady state.
Proposition 6.
Suppose both ( u , u H ) and ( u , u L ) satisfy Assumptions 1 and 3, and that u H (cid:31) u L . Under Assumption 2, the steady state of ( u , u H ) features strictly more optimisticbelief about the second-period fundamental and a strictly higher cutoff threshold than thesteady state of ( u , u L ) . Combined with my main results on learning dynamics (Theorems 1 and 2), Proposition 6illustrates how changing the stage-game payoff structure affects long-run inference. Considertwo societies of gambler’s fallacy agents with the same bias parameter γ >
0, facing stagegames ( u , u H ) and ( u , u L ) respectively, where u H payoff dominates u L . Even if the lattersociety starts with a much more optimistic belief about µ , in the long run the second societywill end up with strictly more pessimistic beliefs and will use strictly lower cutoff thresholds.28ince steady-state beliefs are too pessimistic in both societies, the second society’s long-runbeliefs are more distorted.This comparative statics result provides novel predictions about how the economic envi-ronment affects biased agents’ inference. Applied to the hiring context from Example 1, thisresult says when managers are more impatient (i.e., suffer a greater waiting cost) or whenthey have a lower chance of recalling previous applicants, then they will end up with morepessimistic beliefs about the labor pool. The direction of the comparative statics is anotherexpression of the positive-feedback cycle between stopping threshold and inference. Whenmanagers become more impatient, for instance, they use a lower acceptance threshold asthey wish to finish recruiting earlier. The lower cutoff intensifies the censoring effect on his-tories, leading to more pessimistic inference about the fundamentals. The extra pessimism,in turn, leads future managers to further lower their acceptance threshold, amplifying theinitial change in behavior that came from the increase in waiting cost.From a policy perspective, subsidizing longer search (i.e., decreasing κ wait ) unambiguouslyimproves asymptotic learning accuracy for biased agents. So, even a policymaker who isignorant of ( µ • , µ • ) can partially correct society’s long-run beliefs. We can also think ofthis policy as a test of misspecification, as it alters steady-state beliefs only when agentsare misspecified. The test can be conducted without knowledge of the true data-generatingprocess. For this and subsequent extensions, I specialize to the Gaussian case.So far, I have assumed agents hold dogmatic and correct beliefs about the variance of X and the conditional variance of X | ( X = x ) . In this extension, I expand the set offeasible models and consider agents who are uncertain about the variances of the draws andjointly estimate means and variances. I show that agents exaggerate variances in a way thatdepends on the severity of data censoring, and study how this belief in fictitious variation strengthens the positive-feedback cycle between beliefs and behavior.For µ , µ ∈ R , σ , σ ≥ , and γ ≥ , let Ψ( µ , µ , σ , σ ; γ ) refer to the joint distribution X ∼ N ( µ , σ )( X | X = x ) ∼ N ( µ − γ ( x − µ ) , σ ) . Objectively, X , X are independent Gaussian random variables each with a variance of( σ • ) >
0, so the true joint distribution of ( X , X ) is Ψ • = Ψ( µ • , µ • , ( σ • ) , ( σ • ) ; 0). Supposeagents have a full-support belief over the class of feasible models n Ψ( µ , µ , σ , σ ; γ ) : µ , µ ∈ R , σ , σ ≥ o γ >
0. For this extension, “fundamentals” refer to the fourparameters µ , µ , σ , σ .Following Definition 5, write D KL ( H • ( c ) k H (Ψ( µ , µ , σ , σ ; γ ); c )) ) to denote the KLdivergence between the true distribution of histories with censoring threshold c and theimplied history distribution under the fundamentals µ , µ , σ , σ . This divergence is givenby Z ∞ c φ ( x ; µ • , ( σ • ) ) · ln φ ( x ; µ • , ( σ • ) ) φ ( x ; µ , σ ) ! dx (2)+ Z c −∞ (Z ∞−∞ φ ( x ; µ • , ( σ • ) ) · φ ( x ; µ • , ( σ • ) ) · ln " φ ( x ; µ • , ( σ • ) ) · φ ( x ; µ • , ( σ • ) ) φ ( x ; µ , σ ) · φ ( x ; µ − γ ( x − µ ) , σ ) dx ) dx , where φ ( x ; µ, σ ) is the Gaussian density with mean µ and variance σ , evaluated at x. The next Proposition characterizes the pseudo-true fundamentals µ ∗ , µ ∗ , ( σ ∗ ) , ( σ ∗ ) thatminimize Equation (2) in closed-form expressions. Proposition 7.
The solutions of min µ , µ ∈ R ,σ ,σ ≥ D KL ( H • ( c ) k H (Ψ( µ , µ , σ , σ ; γ ); c )) ) are µ ∗ = µ • , µ ∗ = µ • − γ ( µ • − E [ X | X ≤ c ]) , ( σ ∗ ) = ( σ • ) , and ( σ ∗ ) = ( σ • ) + γ Var[ X | X ≤ c ] . Comparing Proposition 7 with the expressions for µ ∗ , µ ∗ in Example 2 shows that theagent makes the same misinference about the means regardless of whether she knows thevariances. This shows the robustness of the over-pessimism prediction in an environmentwhere agents jointly estimate both means and variances.Biased agents correctly estimate the first-period variance, ( σ ∗ ) = ( σ • ) , but over-estimate second-period variance. The magnitude of this distortion increases in the severityof the gambler’s fallacy but decreases with the severity of the censoring, as Var[ X | X ≤ c ]increases in c for X Gaussian.Here is the intuition. Whereas the objective conditional distribution of X | ( X = x ) isindependent of x , the agent entertains different beliefs about this distribution for different x . The agent’s inference about µ ∗ ensures her belief about X | X = x fits the data wellfollowing “typical” realizations of x under the censoring restriction X ≤ c . But the agentcontinues to be surprised by streaks of bad draws in the data. To better account for thesesurprising observations, the agent increases estimated conditional variance of X | ( X = x )and attributes these unexpected realizations of X to “noise.” More “noise” is needed whenVar[ X | X ≤ c ] larger, for the frequency of these surprising observations depends on howmuch X tends to deviate from its typical value of E [ X | X ≤ c ] under the restriction X ≤ c .30n equivalent formulation of this result helps interpret the distorted ( σ ∗ ) . We may writethe feasible model Ψ( µ , µ , σ , σ ; γ ) with σ ≥ σ as X = µ + (cid:15) X = µ + ζ + (cid:15) where (cid:15) ∼ N (0 , σ ), (cid:15) | (cid:15) ∼ N ( − γ(cid:15) , σ ), and ζ ∼ N (0 , σ ζ ) , with ζ independent of (cid:15) , (cid:15) . In the context where X and X represent the quality realizations of two candidates fromthe early and late applicant pools, ζ is a vacancy-specific shock to the average quality of thesecond-period applicant. A positive σ ζ means there are some vacancies for which the lateapplicants are an especially poor fit and some others for which they are especially suitable.Proposition 7 says that in an environment where all jobs are objectively homogeneous withrespect to the fit of the late applicants, biased managers who find it possible that jobs areheterogeneous in this dimension will indeed estimate a positive amount of this heterogeneity, σ ζ >
0, from the censored histories of their predecessors. This added heterogeneity allowsagents to better rationalize histories ( X , X ) where both candidates have unusually high/lowqualities as vacancies that happen to be an especially good/bad fit for second-period appli-cants. That is, the biased managers reason that the realization of the vacancy-specific fixedeffect, ζ, must have been far from 0.This phenomenon relates to findings in Rabin (2002) and Rabin and Vayanos (2010),who refer to exaggeration of variance under the gambler’s fallacy as fictitious variation . Thekey innovation of Proposition 7 is to show, in an endogenous-data setting, how the degree offictitious variation depends on the severity of censoring. To highlight this point, I now derivetwo results focusing on the interplay between fictitious variation and endogenous censoring.For simplicity, I derive these results using the auxiliary large-generations environment definedin Section 4.1, where agents arrive in large generations and only infer from the histories ofthe immediate predecessor generation.The first result says that when the second-period payoff u ( x , x ) is a linear or convexfunction of x , the positive-feedback cycle from Section 4 continues to obtain — cutoffs,beliefs about fundamentals, and beliefs about variances form monotonic sequences acrossgenerations. This weak convexity includes the case of search with recall in Example 1 forany recall probability 0 ≤ q < Definition 10.
The optimal-stopping problem is convex if for every x ∈ R , x u ( x , x )is convex with strict convexity for x in a positive-measure subset of R . The optimal-stoppingproblem is concave if for every x ∈ R , x u ( x , x ) is concave with strict concavity for x in a positive-measure subset of R . Proposition 8.
Suppose the optimal-stopping problem is convex. Suppose agents start witha full-support prior over { Ψ( µ , µ , σ , σ ; γ ) : µ , µ ∈ R , σ , σ ≥ } and society starts at the nitial condition c [0] ∈ R . For t ≥ , denote the beliefs of generation t as ( µ , [ t ] , µ , [ t ] , σ , [ t ] , σ , [ t ] ) and their cutoff threshold as c [ t ] . Then µ , [ t ] = µ • , ( σ , [ t ] ) = ( σ • ) for all t, while ( µ , [ t ] ) t ≥ , ( σ , [ t ] ) t ≥ , and ( c [ t ] ) t ≥ are monotonic sequences. The intuition is straightforward. Suppose generation t uses a more relaxed acceptancethreshold c [ t ] < c [ t − than generation t − , resulting in a more severely censored dataset.By the usual censoring effect without variance uncertainty, generation t + 1 becomes morepessimistic about second-period mean than generation t. In addition, by Proposition 7 weknow that generation t +1 suffers less from fictitious variation than generation t . This impliesgeneration t + 1 agents would perceive less continuation value than generation t agents evenif they held the same beliefs about the means, for a larger variance in X | ( X = x ) improvesthe expected payoff when continuing due to the convexity of u in x . Combining these twoforces, we deduce c [ t +1] < c [ t ] . The intuition just discussed shows that uncertainty about variance strengthens the mono-tonicity result. To be more precise, suppose c [ t ] < c [ t − . Consider a hypothetical generation t +1 agent who dogmatically adopts generation t ’s beliefs about variances, σ , [ t ] and σ , [ t ] , andinfers from the class of models { Ψ( µ , µ , σ , [ t ] , σ , [ t ] ; γ ) : µ , µ ∈ R } . Based on generation t shistories, this hypothetical agent makes inferences about means and chooses a cutoff thresh-old, ˆ µ , [ t +1] , ˆ µ , [ t +1] , ˆ c [ t +1] . By comparing Proposition 7 and Example 2, ˆ µ , [ t +1] = µ , [ t +1] ,ˆ µ , [ t +1] = µ , [ t +1] , but c [ t +1] < ˆ c [ t +1] < c [ t ] . That is, while the cutoff threshold of this hy-pothetical agent follows the monotonicity pattern in the previous two generations, ˆ c [ t +1] 2, thesociety with uncertainty about variances ends up with a more pessimistic/optimistic beliefabout the second-period mean compared with the society that knows the variances, pro-vided the optimal-stopping problem is convex/concave. This divergence depends cruciallyon the endogenous-learning setting, for Proposition 7 implies that the two societies makethe same inferences about the means when given the same data. Allowing uncertainty onone dimension (variance) ends up affecting society’s long-run inference in another dimension(mean).Formally, consider two societies of agents, A and B. Agents in society A start with a full-support prior over { Ψ( µ , µ , ( σ • ) , ( σ • ) ; γ ) : µ , µ ∈ R } . Agents in society B start with afull-support prior over { Ψ( µ , µ , σ , σ ; γ ) : µ , µ ∈ R , σ , σ ≥ } . Fix the same Generation0 initial condition c [0] ∈ R for both societies. For t ≥ , denote the beliefs of Generation t insociety k ∈ { A, B } as ( µ , [ k,t ] , µ , [ k,t ] , ( σ , [ k,t ] ) , ( σ , [ k,t ] ) ) and their cutoff threshold as c [ k,t ] .32 roposition 9. In the first generation, µ , [ A, = µ , [ B, and µ , [ A, = µ , [ B, . If the optimal-stopping problem is convex, then µ , [ B,t ] > µ , [ A,t ] and c [ B,t ] > c [ A,t ] for every t ≥ . If theoptimal-stopping problem is concave, then µ , [ B,t ] < µ , [ A,t ] and c [ B,t ] < c [ A,t ] for every t ≥ . ( X , X ) and Uncertainty About γ So far I have assumed that draws ( X , X ) within the stage game are objectively independent,and that agents have a dogmatic γ > 0, interpreted as the severity of the gambler’s fallacybias. This extension considers a joint relaxation of these two assumptions.Suppose the true model is ( X , X ) ∼ Ψ( µ • , µ • ; γ • ) , where possibly γ • = 0 . Agents jointlyestimate ( µ , µ , γ ) ∈ R , with a prior supported on R × R × [ γ, ¯ γ ] where [ γ, ¯ γ ] is a finiteinterval. The next result generalizes the pseudo-true fundamentals in Example 2. It showsthat when γ • / ∈ [ γ, ¯ γ ], the agent infers γ ∗ equal to the boundary point of the interval that iscloser to γ • . Given the estimated pseudo-true parameter γ ∗ , the estimates of the first- andsecond-period fundamentals take similar forms to those in Example 2. Proposition 10. Suppose γ • / ∈ [ γ, ¯ γ ] . Let ˜ γ = ¯ γ if γ • > ¯ γ , otherwise ˜ γ = γ when γ • < γ .The solution of the KL-divergence minimization problem min µ ,µ ∈ R ,γ ∈ [ γ, ¯ γ ] D KL ( H (Ψ( µ • , µ • ; γ • ); c ) || H (Ψ( µ , µ ; γ ); c )) is given by µ ∗ ( c ) = µ • , µ ∗ ( c ) = µ • + ( γ • − ˜ γ ) · ( µ • − E Ψ • [ X | X ≤ c ]) , γ ∗ ( c ) = ˜ γ . Intuitively, we may expect the closest distance (in the KL divergence sense) from the setof feasible models { Ψ( µ , µ ; γ ) : µ , µ ∈ R } to the objective distribution Ψ( µ • , µ • ; γ • ) todecrease in | γ − γ • | . Proposition 10 confirms this intuition, showing that the pseudo-truemodel from the set { Ψ( µ , µ ; γ ) : µ , µ ∈ R , γ ∈ [ γ, ¯ γ ] } lies in the subset { Ψ( µ , µ ; γ ) : µ , µ ∈ R , γ = ˜ γ } , where ˜ γ is the closest point (in the Euclidean sense) to γ • in the interval[ γ, ¯ γ ].When γ • = 0 and ¯ γ < 0, this result shows that over-pessimism in inference is robustto agents learning the correlation of X and X , provided the support of their uncertaintyabout correlation lies to the left of 0 and excludes 0. In this case, it is also easy to see thatthe learning dynamics in the large-generations auxiliary environment are the same as whenagents start with a dogmatic belief in γ = ¯ γ . µ = µ I now consider the special case where the true fundamentals are time-invariant, µ • = µ • = µ • ∈ R . If agents’ feasible fundamentals are M = R as in Remark 1(a), then Proposition2 continues to apply. But now suppose agents know the fundamentals are time-invariant33nd only have uncertainty over this common value, so the set of feasible fundamentals is thediagonal M = { ( x, x ) : x ∈ R } , as in Remark 1(d). I investigate inference in this settingwhen agents’ prior belief over feasible models is supported on { Ψ( µ, µ ; γ ) : µ ∈ R } . Let µ ∗ (cid:77) ( c ) ∈ R stand for the common fundamental that minimizes the KL divergencerelative to the history distribution H • ( c ), that is µ ∗ (cid:77) ( c ) := arg min µ ∈ R D KL ( H • ( c ) k H (Ψ( µ, µ ; γ ); c ))The next result characterizes µ ∗ (cid:77) ( c ). Proposition 11. µ ∗ (cid:77) ( c ) = P [ X ≤ c ] · (1+ γ ) µ ◦ ( c ) + P [ X ≤ c ] · (1+ γ ) P [ X ≤ c ] · (1+ γ ) µ ◦ ( c ) , where µ ◦ ( c ) = µ • and µ ◦ ( c ) = µ • − γ γ ( µ • − E [ X | X ≤ c ]) . Agents face two kinds of data about the common fundamental: observations of first-period draws and observations of second-period draws. Feasible models Ψ( µ ◦ ( c ) , µ ◦ ( c ); γ )and Ψ( µ ◦ ( c ) , µ ◦ ( c ); γ ) minimize the KL divergence of these two kinds of data, respectively. The overall KL-divergence minimizing estimator is a certain convex combination betweenthese two points. Through the term P [ X ≤ c ] , the relative weight given to µ ◦ ( c ) increases asthe cutoff c increases, because the second-period data is observed more often if the datasetof histories is censored with a higher cutoff in the first period.For any censoring threshold c generating the history distribution, agents underestimatesthe common fundamental. We have µ ◦ ( c ) < µ • while µ ◦ ( c ) = µ • . This shows the robustnessof the over-pessimism result from the setting with M = R . However, the extent of over-pessimism about µ is dampened relative to agents who can flexibly estimate different µ and µ for the two periods. Compared with the unconstrained pseudo-true fundamentalsfrom Example 2, we have µ ◦ ( c ) > µ ∗ ( c ) since γ γ < γ , hence µ ∗ (cid:77) ( c ) > µ ∗ ( c ). This makesintuitive sense: when unconstrained, agents come to two different beliefs about µ and µ ,even though they are objectively the same. They hold correct beliefs about µ but pessimisticbeliefs about µ . When constrained to a common inference across two fundamentals, agentsdistort their belief about µ downwards and their belief about µ upwards, relative to theunconstrained environment. This paper studies endogenous learning dynamics of misspecified agents. I focus on thegambler’s fallacy, a non-self-confirming misspecification where no feasible beliefs of the bi-ased agents can exactly match the data. In natural optimal-stopping problems, agents tend Note that µ ◦ ( c ) differs from the pseudo-true fundamental µ ∗ ( c ) from Example 2. The estimator µ ◦ ( c )minimizes the KL divergence of second-period draws under the constraint that the same fundamental mustbe inferred for both periods, whereas µ ∗ ( c ) minimizes this divergence when first-period fundamental is fixedat its true value, µ • . 34o stop after “good enough” early draws, where the threshold for “good enough” evolvesas agents update their beliefs about the underlying distributions. Stopping decisions thusimpose an endogenous censoring effect on histories, which in turn affects the beliefs of subse-quent agents. The statistical bias interacts with data censoring, generating over-pessimismabout the fundamentals and resulting in stopping too early in the long run. These asymp-totic mistakes are driven by a positive-feedback loop between distorted beliefs and distortedbehavior.I have studied a particular behavioral bias (the gambler’s fallacy) in a natural environ-ment where censoring happens (histories in optimal-stopping problems). The key mechanismI highlight, the interaction between data censoring and bias, applies more broadly and de-livers different predictions in different contexts. For example, the same mechanism wouldlead to an over-estimation of µ if the agents believe in some γ < 0. I am leaving open theinteraction of other kinds of behavioral learning with other censoring mechanisms to futurework. References Amemiya, T. (1985): Advanced Econometrics , Harvard University Press. Andrews, D. W. (1992): “Generic uniform convergence,” Econometric theory , 8, 241–257. Bar-Hillel, M. and W. A. Wagenaar (1991): “The perception of randomness,” Ad-vances in Applied Mathematics , 12, 428–454. Barron, G. and S. Leider (2010): “The role of experience in the gambler’s fallacy,” Journal of Behavioral Decision Making , 23, 117–129. Benjamin, D. J., D. A. Moore, and M. Rabin (2017): “Biased Beliefs About RandomSamples: Evidence from Two Integrated Experiments,” Working Paper . Berk, R. H. (1966): “Limiting behavior of posterior distributions when the model is incor-rect,” Annals of Mathematical Statistics , 37, 51–58. Bohren, J. A. (2016): “Informational herding with model misspecification,” Journal ofEconomic Theory , 163, 222–247. Bohren, J. A. and D. Hauser (2018): “Bounded Rationality And Learning: A Frame-work and A Robustness Result,” Working Paper . Bromiley, P. (2018): “Products and Convolutions of Gaussian Probability Density Func-tions,” Working Paper . 35 unke, O. and X. Milhaud (1998): “Asymptotic behavior of Bayes estimates underpossibly incorrect models,” Annals of Statistics , 26, 617–644. Camerer, C. F. (1987): “Do biases in probability judgment matter in markets? Experi-mental evidence,” American Economic Review , 77, 981–997. Chen, D. L., T. J. Moskowitz, and K. Shue (2016): “Decision making under thegambler’s fallacy: Evidence from asylum judges, loan officers, and baseball umpires,” Quarterly Journal of Economics , 131, 1181–1242. Cox, D. R. (1962): Renewal Theory , Methuen. Croson, R. and J. Sundali (2005): “The gambler’s fallacy and the hot hand: Empiricaldata from casinos,” Journal of Risk and Uncertainty , 30, 195–209. Crowder, M. J. (2001): Classical Competing Risks , Chapman and Hall/CRC. Enke, B. (2019): “What you see is all there is,” Working Paper . Esponda, I. and D. Pouzo (2016): “Berk–Nash equilibrium: A framework for modelingagents with misspecified models,” Econometrica , 84, 1093–1130. Esponda, I., D. Pouzo, and Y. Yamamoto (2019): “Asymptotic Behavior of BayesianLearners with Misspecified Models,” In Preparation . Eyster, E. and M. Rabin (2010): “Naive herding in rich-information settings,” AmericanEconomic Journal: Microeconomics , 2, 221–243. Frick, M., R. Iijima, and Y. Ishii (2019): “Misinterpreting Others and the Fragility ofSocial Learning,” Working Paper . Fudenberg, D., G. Romanyuk, and P. Strack (2017): “Active learning with a mis-specified prior,” Theoretical Economics , 12, 1155–1189. Gagnon-Bartsch, T., M. Rabin, and J. Schwartzstein (2018): “Channeled Atten-tion and Stable Errors,” Working Paper . Gaurino, A. and P. Jehiel (2013): “Social Learning with Coarse Inference,” AmericanEconomic Journal: Microeconomics , 5, 147–74. Gumbel, E. J. (1960): “Bivariate exponential distributions,” Journal of the AmericanStatistical Association , 55, 698–707. Heckman, J. J. and B. E. Honoré (1989): “The identifiability of the competing risksmodel,” Biometrika , 76, 325–330. 36 eidhues, P., B. Koszegi, and P. Strack (2018): “Unrealistic expectations and mis-guided learning,” Econometrica , 86, 1159–1214.——— (2019): “Convergence in Misspecified Learning Models with Endogenous Actions,” Working Paper . Jehiel, P. (2018): “Investment strategy and selection bias: An equilibrium perspective onoveroptimism,” American Economic Review , 108, 1582–97. Mueller, A. I., J. Spinnewijn, and G. Topa (2018): “Job Seekers’ Perceptions andEmployment Prospects: Heterogeneity, Duration Dependence and Bias,” Working Paper . Narayanan, S. and P. Manchanda (2012): “An empirical analysis of individual levelcasino gambling behavior,” Quantitative Marketing and Economics , 10, 27–62. Nyarko, Y. (1991): “Learning in mis-specified models and the possibility of cycles,” Journalof Economic Theory , 55, 416–427. Rabin, M. (2002): “Inference by believers in the law of small numbers,” Quarterly Journalof Economics , 117, 775–816. Rabin, M. and D. Vayanos (2010): “The gambler’s and hot-hand fallacies: Theory andapplications,” Review of Economic Studies , 77, 730–778. Suetens, S., C. B. Galbo-Jørgensen, and J.-R. Tyran (2016): “Predicting lottonumbers: a natural experiment on the gambler’s fallacy and the hot-hand fallacy,” Journalof the European Economic Association , 14, 584–607. Terrell, D. (1994): “A test of the gambler’s fallacy: Evidence from pari-mutuel games,” Journal of Risk and Uncertainty , 8, 309–317. Tsiatis, A. (1975): “A nonidentifiability aspect of the problem of competing risks,” Pro-ceedings of the National Academy of Sciences , 72, 20–22. AppendixA Proofs of Results in Sections 2, 3, and 4 A.1 Proof of Claim 1 Proof. For Example 1, clearly u and u are strictly increasing functions of x and x re-spectively. We also have that | u ( x , ¯ x ) − u ( x , ¯ x ) | ≤ q ( x − x ) for x > x and any37 x , while u ( x ) = 1 . This shows Assumption 1(b) holds. If x > x < 0, then u ( x , x ) = q · x + (1 − q ) x < x = u ( x ), and conversely x < , x > u ( x , x ) > u ( x ) . This shows Assumption 1(c) holds. It is clear that u , u are continuous.Also, | u (¯ x , x + ¯ k ) | ≤ q ( | ¯ x | + | x + ¯ k | ) + (1 − q ) | x + ¯ k | ≤ q | ¯ x | + | ¯ k | + | x | . Since the objective distribution satisfies E ( | X | ) < ∞ , we have E ( | u (¯ x , x + ¯ k ) | ) ≤ q | ¯ x | + | ¯ k | + E ( | X | ) < ∞ . This shows Assumption 1(d) holds. A.2 Proofs of Proposition 1 and Lemma 2 The argument behind Proposition 1 consists of three lemmas (A.1, A.3, and A.4) thatcorrespond to the three statements in the proposition. Along the way, I will also proveLemma 2. A.2.1 The Optimal Strategy Has a Cutoff Form In the first part, I establish lemma A.1. Lemma A.1. Under Assumption 1 and the feasible model Ψ( µ , µ ; γ ) for any γ > C ( µ , µ ; γ ), such that: (i) the agent strictly prefers stopping afterany x > C ( µ , µ ; γ ); (ii) the agent is indifferent between continuing and stopping after x = C ( µ , µ ; γ ); (iii) the agent strictly prefers continuing after any x < C ( µ , µ ; γ ).Suppose X = x . Consider the payoff difference between accepting it and continuingunder the feasible model Ψ( µ , µ ; γ ) for γ ≥ D ( x ; µ , µ , γ ) := u ( x ) − E Ψ( µ ,µ ; γ ) [ u ( x , X ) | X = x ] . I abbreviate this as D ( x ) when Ψ is fixed. Lemma A.2 summarizes some properties of D .(The proofs of some technical results stated in the Appendix, like Lemma A.2, appear in theOnline Appendix.) Lemma A.2. D is strictly increasing and continuous. If γ > , then there are x < x sothat D ( x ) < < D ( x ) . Lemma A.1 follows readily from Lemma A.2. Proof. Applying Lemma A.2 and using the fact that γ > D changes sign and is strictlyincreasing and continuous. So, there exists a unique c ∗ ∈ R satisfying D ( c ∗ ) = 0 . It is clearthat the best stopping strategy under Ψ is the cutoff strategy that stops after every x > c ∗ and continues after every x < c ∗ . This establishes property (ii) of the optimal strategy.Properties (i) and (iii) follow from the fact that D is strictly increasing.38 .2.2 Cutoff Threshold Increasing in µ In the second part, I prove the lemma: Lemma A.3. Under Assumption 1, for any µ ∈ R and γ > 0, the indifference threshold C ( µ , µ ; γ ) is strictly increasing in µ . Proof. Let ˆ µ , ˆ µ , ˆˆ µ ∈ R with ˆˆ µ > ˆ µ . I show that C (ˆ µ ˆ µ ; γ ) < C (ˆ µ ˆˆ µ ; γ ).By Lemma A.1, the threshold C (ˆ µ , ˆ µ ; γ ) is characterized by the indifference condition, u ( C (ˆ µ , ˆ µ ; γ )) = E ˜ X ∼ f ( ·| ˆ µ − γ ( C (ˆ µ , ˆ µ ; γ ) − ˆ µ )) [ u ( C (ˆ µ , ˆ µ ; γ ) , ˜ X )]But if agent were to instead believe (ˆ µ ˆˆ µ ) where ˆˆ µ > ˆ µ , then the conditional distributionof X given X = C (ˆ µ , ˆ µ ; γ ) would be f ( · | ˆˆ µ − γ ( C (ˆ µ , ˆ µ ; γ ) − ˆ µ )). We have u ( C (ˆ µ , ˆ µ ; γ )) < E ˜ X ∼ f ( ·| ˆˆ µ − γ ( C (ˆ µ , ˆ µ ; γ ) − ˆ µ )) [ u ( C (ˆ µ , ˆ µ ; γ ) , ˜ X )]by Assumption 1(a). This means C (ˆ µ , ˆ µ ; γ ) < C (ˆ µ , ˆˆ µ ; γ ) by Lemma A.1, as only valuesof X below C (ˆ µ , ˆˆ µ ; γ ) lead to strict preference for continuing. A.2.3 Proof of Lemma 2 Proof. In fact, this lemma holds for any µ ∈ R . I first prove this for Assumption 3(a).For µ > µ , write the corresponding optimal cutoffs as c := C ( µ , µ ; γ ) and c := C ( µ , µ ; γ ) . I show that | c − c | < ‘ | µ − µ | . Under the model Ψ( µ , µ ; γ ), the expected continuation payoff after X = c + ‘ ( µ − µ )is E ˜˜ X ∼ f ( ·| µ − γ ( c + ‘ ( µ − µ ) − µ )) [ u ( c + ‘ ( µ − µ ) , ˜˜ X ]= E ˜ X ∼ f ( ·| µ − γ ( c − µ )) [ u ( c + ‘ ( µ − µ ) , ˜ X + ( µ − µ ) − γ‘ ( µ − µ )]= E ˜ X ∼ f ( ·| µ − γ ( c − µ )) [ u ( c + ‘d, ˜ X + (1 − γ‘ ) d )]where we put d = | µ − µ | > . From Assumption 3(a), for every x ∈ R , u ( c + ‘d, x +(1 − γ‘ ) d ) − u ( c , x ) < u ( c + ‘d ) − u ( c ). This means E ˜ X ∼ f ( ·| µ − γ ( c − µ )) [ u ( c + ‘d, ˜ X + (1 − γ‘ ) d ) − u ( c , ˜ X )] < u ( c + ‘d ) − u ( c ) E ˜ X ∼ f ( ·| µ − γ ( c − µ )) [ u ( c + ‘d, ˜ X + (1 − γ‘ ) d )] − u ( c + ‘d ) < E ˜ X ∼ f ( ·| µ − γ ( c − µ )) [ u ( c , ˜ X )] − u ( c ) . The cutoff c satisfies the indifference condition, u ( c ) = E ˜ X ∼ f ( ·| µ − γ ( c − µ )) [ u ( c , ˜ X )] , soRHS is 0. But LHS is the difference between expected continuation payoff and stoppingpayoff at X = c + ‘ ( µ − µ ) under the model Ψ( µ , µ ; γ ), which shows the agent strictly39refers stopping. This means c < c + ‘ ( µ − µ ) . But µ C ( µ , µ ; γ ) is increasing byLemma A.3, which means c > c . Together, these two inequalities imply | c − c | < ‘ ( µ − µ ) . Now, replace Assumption 3(a) with Assumption 3(b). By Lipschitz continuity of u , suppose | u ( x ) − u ( x ) | < L · | x − x | for some L > x , x ∈ R . Let β =min( (cid:15)/γLγ + (cid:15) , γ ) and put ‘ = γ − β , so 0 < ‘ < γ . Let any ∆ > c = C ( µ , µ , γ ).I show that C ( µ , µ + ∆ , γ ) < c + ‘ ∆.We have u ( c + ‘ ∆) − u ( c ) > ( γ − β ) (cid:15) ∆ > ( γ − (cid:15)/γLγ + (cid:15) ) (cid:15) ∆ and E Ψ( µ ,µ +∆; γ ) [ u ( X ) | X = c + ‘ ∆] − E Ψ( µ ,µ ; γ ) [ u ( X ) | X = c ] ≤ L · (∆ − ‘ ∆ γ ) = ∆ Lγβ ≤ ∆ Lγ · (cid:15)/γLγ + (cid:15) By simple algebra, ( γ − (cid:15)/γLγ + (cid:15) ) (cid:15) ∆ = ∆ Lγ · (cid:15)/γLγ + (cid:15) . Since u ( c ) = E Ψ( µ ,µ ,γ ) [ u ( X ) | X = c ],we conclude u ( c + ‘ ∆) > E Ψ( µ ,µ +∆ ,γ ) [ u ( X ) | X = c + ‘ ∆]. By Lemma A.1, this implies c + ‘ ∆ > ( µ , µ + ∆ , γ ) . A.2.4 Lipschitz Continuity with Constant /γ Now I prove the lemma: Lemma A.4. Under Assumption 1, for every γ > µ ∈ R , µ C ( µ , µ ; γ ) isLipschitz continuous with Lipschitz constant 1 /γ . Proof. The proof of Lemma 2 also applies when ‘ = γ , which implies that when the inequalityin Assumption 3(a) is satisfied with ‘ = γ , µ C ( µ , µ ; γ ) is Lipschitz continuous withLipschitz constant 1 /γ . But this reduces the inequality to u ( x + γ d ) − u ( x ) ≥ u ( x + γ d, x ) − u ( x , x ), which is true by Assumption 1(b). A.3 Proof of Claim 2 Proof. For d > u ( x + 11 + γ d ) − u ( x ) = 11 + γ d while u ( x + 11 + γ d, x + (1 − γ γ ) d ) − u ( x , x )= u ( x + 11 + γ d, x + 11 + γ d ) − u ( x , x )= q max( x + 11 + γ d, x + 11 + γ d ) + (1 − q )( x + 11 + γ d ) − q max( x , x ) − (1 − q ) x = q 11 + γ d + (1 − q ) 11 + γ d = 11 + γ d. ‘ = γ , we have u ( x + ‘d ) − u ( x ) = u ( x + ‘d, x + (1 − γ‘ ) d ) − u ( x , x ) for every x , x ∈ R , d > A.4 Proof of Proposition 2 A.4.1 Preliminary Definitions and Lemmas I first require some preliminary definitions and lemmas.The first result says for any censoring threshold c ∈ R ∪ {∞} , KL divergence cannot beminimized at ( µ , µ ) where µ = µ • . Lemma A.5. For every γ > , c ∈ R ∪ {∞} , µ , µ ∈ R with µ = µ • , we have D KL ( H • ( c ) kH (Ψ( µ , µ ; γ ); c )) > D KL ( H • ( c ) kH (Ψ( µ • , µ − γ ( µ • − µ ); γ ); c ))Lemma A.5 shows that solutions to the KL divergence minimization problem (if anyexist) can only take the form ( µ • , µ ) for some µ ∈ R . Thus motivated, we define L ( µ | x ) := Z ∞−∞ f ( x | µ • ) ln[ f ( x | µ − γ ( x − µ • ))] dx , the expected log-likelihood of second-period data under the fundamentals ( µ • , µ ) and afterthe realization X = x . Also, put¯ L ( µ | c ) := Z c −∞ f ( x | µ • ) · L ( µ | x ) dx for c ∈ R ∪ {∞} . Note ¯ L ( µ | c ) is − D KL ( H • ( c ) kH (Ψ( µ • , µ ; γ ); c )) up to a constant notdepending on µ .I establish some properties of L and ¯ L that will be used in the remainder of the proof. Lemma A.6. For all x , µ ∈ R , L ( µ | x ) = L ( µ • | x − γ ( µ − µ • )) , so ∂∂µ L ( µ | x ) = − γ ∂∂x L ( µ • | x − γ ( µ − µ • )) .Proof. Follows easily from the definition of L ( µ | x ) . Lemma A.7. For every µ ∈ R , L ( µ | · ) is strictly concave. For every x ∈ R , L ( · | x ) isstrictly concave. For every c ∈ R ∪ {∞} , ¯ L ( · | c ) is strictly concave. Finally, ∂ ∂x ∂µ L ( µ | x ) > . Finally, I note a convenient property of strict log-concavity. Lemma A.8. If f ( x ) > is strictly log concave, then for any K > , x f ( x + K ) f ( x ) is strictlydecreasing. .4.2 Existence and Uniqueness of KL Divergence Minimizers If µ ∗ ∈ R satisfies the FOC ∂∂µ ¯ L ( µ ∗ | c ) = 0, then ( µ • , µ ∗ ) is the unique KL divergence mini-mizer across all R . This is because µ ∗ satisfies the FOC in minimizing D KL ( H • ( c ) kH (Ψ( µ • , µ ; γ ); c ))across µ ∈ R , a strictly convex objective function by the third statement in Lemma A.7. Fur-thermore, D KL ( H • ( c ) kH (Ψ( µ • , µ ∗ ; γ ); c )) < D KL ( H • ( c ) kH (Ψ( µ , µ ; γ ); c )) for any µ = µ • ,µ ∈ R by Lemma A.5.The next Lemma shows the FOC has a solution at µ = µ • when c = ∞ . Lemma A.9. ∂∂µ ¯ L ( µ • | ∞ ) = 0 . Proof. I first show L ( µ • | x ) is symmetric around x = µ • . Suppose x h − µ • = µ • − x l > . Then, L ( µ • | x h ) is: Z ∞−∞ f ( x | µ • ) ln[ f ( x | µ • − γ ( x h − µ • ))] dx = Z µ • −∞ f ( x | µ • ) ln[ f ( x | µ • − γ ( x h − µ • ))] dx + Z ∞ µ • f ( x | µ • ) ln[ f ( x | µ • − γ ( x h − µ • ))] dx = Z µ • −∞ f ( x | µ • ) ln[ f ( x + γ ( x h − µ • ) | µ • )] dx + Z ∞ µ • f ( x | µ • ) ln[ f ( x + γ ( x h − µ • ) | µ • )] dx Using the symmetry of f ( · | µ • ) around µ • , let ˜ g ( y ) = f ( µ • + y | µ • ) = f ( µ • − y | µ • )for y ≥ . Let y = µ • − x in the first integral and y = x − µ • in the second integral inthe sum. We then get L ( µ • | x h ) = Z ∞ ˜ g ( y ) (cid:16) ln[ f ( µ • − y + γ ( x h − µ • ) | µ • ] + ln[ f ( µ • + y + γ ( x h − µ • ) | µ • )] (cid:17) dy . Analogous argument shows L ( µ • | x l ) = Z ∞ ˜ g ( y ) (cid:16) ln[ f ( µ • − y + γ ( x l − µ • ) | µ • ] + ln[ f ( µ • + y + γ ( x l − µ • ) | µ • )] (cid:17) dy . For every y ≥ , we have | [ µ • − y + γ ( x l − µ • )] − [ µ • ] | = | µ • + y + γ ( x h − µ • ) − [ µ • ] | since x l − µ • = − ( x h − µ • ) . As f ( · | µ • ) is symmetric about µ • , this showsln[ f ( µ • − y + γ ( x l − µ • ) | µ • ] = ln[ f ( µ • + y + γ ( x h − µ • ) | µ • )] . A similar symmetry argument shows ln[ f ( µ • + y + γ ( x l − µ • ) | µ • )] = ln[ f ( µ • − y + γ ( x h − µ • ) | µ • ] for all y ≥ . Hence we conclude L ( µ • | x h ) = L ( µ • | x l ) . 42o finish the argument, apply the second statement in Lemma A.6 to get: ∂∂µ ¯ L ( µ • | ∞ ) = Z ∞−∞ f ( x | µ • ) · ( − γ ) · ∂L∂x ( µ • | x − γ ( µ • − µ • )) dx = − γ Z ∞−∞ f ( x | µ • ) ∂L∂x ( µ • | x ) dx = − γ Z µ • −∞ f ( x | µ • ) ∂L∂x ( µ • | x ) dx + Z ∞ µ • f ( x | µ • ) ∂L∂x ( µ • | x ) dx ! By symmetry of x L ( µ • | x ) around x = µ • just established, ∂L∂x ( µ • | µ • − y ) = − ∂L∂x ( µ • | µ • + y ) for all y ≥ 0. At the same time, f ( µ • − y | µ • ) = f ( µ • + y | µ • ) . Therefore the sumof the two integrals is 0.In fact, the FOC also has a solution for any c ∈ R , as the next lemma shows. Lemma A.10. For any ¯ c ∈ R , there exists some µ ∗ ∈ R so that ∂∂µ ¯ L ( µ ∗ | ¯ c ) = 0 . A.4.3 Monotonicity of µ ∗ ( c ) in c So far, I have shown that ( µ ∗ ( c ) , µ ∗ ( c )) ∈ R are well-defined for all c ∈ R ∪ {∞} and charac-terize the unique solution pair to the KL divergence minimization problem, with µ ∗ ( c ) = µ • and µ ∗ ( ∞ ) = µ • . To finish proving Proposition 2, it remains to show that µ ∗ ( c ) is strictlyincreasing over ( −∞ , ∞ ] . Lemma A.11. Let c l , c, c h ∈ R ∪ {∞} with c l < c < c h . Then ∂∂θ L ( µ ∗ ( c ) | c l ) < and ∂∂θ L ( µ ∗ ( c ) | c h ) > . As a result, whenever c , c ∈ ( −∞ , ∞ ] with c < c , we have µ ∗ ( c ) < µ ∗ ( c ) . Proof. First-order condition for µ ∗ ( c ) implies that ∂∂µ ¯ L ( µ ∗ ( c ) | c ) = 0 ⇒ − γ Z c −∞ f ( x | µ • ) · ∂∂x L (0 | x − γ µ ∗ ( c )) dx = 0 . From Lemma A.7, x ∂∂x L (0 | x − γ µ ∗ ( c )) is strictly decreasing. If at the rightmost pointon integration interval, we have ∂∂x L (0 | c − γ µ ∗ ( c )) ≥ , then ∂∂x L (0 | x − γ µ ∗ ( c )) > x < c . This would lead to ∂∂µ ¯ L ( µ ∗ ( c ) | c ) = 0 , a contradiction. Therefore ∂∂x L (0 | c − γ µ ∗ ( c )) < 0. 43or c h > c, we have that ∂∂µ ¯ L ( µ ∗ ( c ) | c h ) = = − γ Z c h −∞ f ( x | µ • ) · ∂∂x L (0 | x − γ µ ∗ ( c )) dx = − γ Z c −∞ f ( x | µ • ) · ∂∂x L (0 | x − γ µ ∗ ( c )) dx + ( − γ ) Z c h c f ( x | µ • ) · ∂∂x L (0 | x − γ µ ∗ ( c )) dx =0 + ( − γ ) Z c h c f ( x | µ • ) · ∂∂x L (0 | x − γ θ ∗ ( c )) dx . Since ∂∂x L (0 | c − γ µ ∗ ( c )) < 0, we also get ∂∂x L (0 | x − γ µ ∗ ( c )) < x > c since L (0 | · ) is strictly concave by Lemma A.7. Therefore ∂∂µ ¯ L ( µ ∗ ( c ) | c h ) > f ( x | µ • ) is strictly positive.For c l < c, we have that ∂∂µ ¯ L ( µ ∗ ( c ) | c l ) = − γ Z c l −∞ f ( x | µ • ) · ∂∂x L (0 | x − γ µ ∗ ( c )) dx . If ∂∂x L (0 | x − γ µ ∗ ( c )) ≥ x ≤ c l , then clearly this gives ∂∂µ ¯ L ( µ ∗ ( c ) | c l ) < − γ "Z c −∞ f ( x | µ • ) · ∂∂x L (0 | x − γ µ ∗ ( c )) dx − Z cc l f ( x | µ • ) · ∂∂x L (0 | x − γ µ ∗ ( c )) dx , which simplifies to γ R cc l f ( x | µ • ) · ∂∂x L (0 | x − γ µ ∗ ( c ) dx . If ∂∂x L (0 | x − γ µ ∗ ( c )) ≥ x ∈ [ c l , c ] , then we must also get ∂∂x L (0 | x − γ µ ∗ ( c )) ≥ x ≤ c l , but thisreturns us to the case we have already considered. Thus ∂∂x L (0 | x − γ µ ∗ ( c )) < x ∈ [ c l , c ] , and the integral is strictly negative, showing ∂∂µ ¯ L ( µ ∗ ( c ) | c l ) < . Finally, consider c , c ∈ ( −∞ , ∞ ] with c < c . We must have ∂∂µ ¯ L ( µ ∗ ( c ) | c ) > L ( · | c ) is strictly concave from Lemma A.7 and its FOChas a solution by Lemma A.10. So µ ∗ ( c ) > µ ∗ ( c ).44 .5 Proof of the Expression for µ ∗ ( c ) in Example 2 Proof. Rewrite Definition 5 as Z ∞ c φ ( x ; µ • , σ ) · ln φ ( x ; µ • , σ ) φ ( x ; µ , σ ) ! dx + Z c −∞ φ ( x ; µ • , σ ) · Z ∞−∞ φ ( x ; µ • , σ ) · ln " φ ( x ; µ • , σ ) φ ( x ; µ , σ ) dx dx + Z c −∞ φ ( x ; µ • , σ ) · Z ∞−∞ φ ( x ; µ • , σ ) · ln " φ ( x ; µ • , σ ) φ ( x ; µ − γ ( x − µ ) , σ ) dx dx which is: Z ∞−∞ φ ( x ; µ • , σ ) · ln φ ( x ; µ • , σ ) φ ( x ; µ , σ ) ! dx + Z c −∞ φ ( x ; µ • , σ ) · Z ∞−∞ φ ( x ; µ • , σ ) ln " φ ( x ; µ • , σ ) φ ( x ; µ − γ ( x − µ ) , σ ) dx dx The KL divergence between N ( µ true , σ ) and N ( µ model , σ ) isln σ model σ true + σ + ( µ true − µ model ) σ − , so we may simplify the first term and the inner integral of the second term:( µ − µ • ) σ + Z c −∞ φ ( x ; µ • , σ ) · " σ + ( µ − γ ( x − µ ) − µ • ) σ − dx . Dropping constant terms not depending on µ and µ and multiplying by σ , we get asimplified expression of the objective, ξ ( µ , µ ) := ( µ − µ • ) Z c −∞ φ ( x ; µ • , σ ) · " ( µ − γ ( x − µ ) − µ • ) dx We have the partial derivatives by differentiating under the integral sign, ∂ξ∂µ = Z c −∞ φ ( x ; µ • , σ ) · ( µ − γ ( x − µ ) − µ • ) dx ∂ξ∂µ = ( µ − µ • ) + γ Z c −∞ φ ( x ; µ • , σ ) · ( µ − γ ( x − µ ) − µ • ) dx = ( µ − µ • ) + γ ∂ξ∂µ 45y the first order conditions, at the minimum ( µ ∗ , µ ∗ ) , we must have ∂ξ∂µ ( µ ∗ , µ ∗ ) = ∂ξ∂µ ( µ ∗ , µ ∗ ) =0 ⇒ µ ∗ = µ • . So µ ∗ satisfies ∂ξ∂µ ( µ • , µ ∗ ) = 0 , which by straightforward algebra shows µ ∗ ( c ) = µ • − γ ( µ • − E [ X | X ≤ c ]) . A.6 Proof of Lemma 1 Proof. Let c ∈ R , K > | µ ∗ ( c + K ) − µ ∗ ( c ) | < γK . We have¯ L ( µ | c ) = Z c −∞ f ( x | µ • ) L (0 | x − γ µ ) dx and so ¯ L ( µ + γK | c + K ) = Z c + K −∞ f ( x | µ • ) L (0 | ( x − K ) − γ µ ) dx = Z c −∞ f ( x + K | µ • ) L (0 | x − γ µ ) dx This implies ∂∂µ ¯ L ( µ + γK | c + K ) = − γ Z c −∞ f ( x + K | µ • ) ∂∂x L (0 | x − γ µ ) dx . When µ = µ ∗ ( c ) , first-order condition implies that ∂∂µ ¯ L ( µ ∗ ( c ) | c ) = 0 , that is Z c −∞ f ( x | µ • ) ∂∂x L (0 | x − γ µ ∗ ( c )) dx = 0 . We may write ∂∂µ ¯ L ( µ ∗ ( c ) + γK | c + K ) as − γ Z c −∞ f ( x + K | µ • ) f ( x | µ • ) ! f ( x | µ • ) ∂∂x L (0 | x − γ µ ∗ ( c )) dx . The term x ∂∂x L (0 | x − γ µ ∗ ( c )) is positive for low values of x and negative forhigh values of x , due to strict concavity of L (0 | · ) by Lemma A.7. Let r be such that ∂∂x L (0 | r − γ µ ∗ ( c )) = 0. Then, ∂∂µ ¯ L ( µ ∗ ( c ) + γK | c + K ) < − γ · Z r −∞ f ( r + K | µ • ) f ( r | µ • ) ! f ( x | µ • ) ∂∂x L (0 | x − γ µ ∗ ( c )) dx − γ · Z cr f ( r + K | µ • ) f ( r | µ • ) ! f ( x | µ • ) ∂∂x L (0 | x − γ µ ∗ ( c )) dx . By Lemma A.8, x f ( x + K | µ • ) f ( x | µ • ) is strictly decreasing. So the weight (cid:16) f ( r + K | µ • ) f ( r | µ • ) (cid:17) under-weighsthe integrand on the interval ( −∞ , r ) , while the same weight over-weighs the integrand on46 r, c ) . This amounts to an under-weighting of the positive part of the integrand and an over-weighting of the negative part, thus under-estimating the integral value. Accounting for theterm − γ gives the inequality above.FOC ∂∂µ ¯ L ( µ ∗ ( c ) | c ) = 0 then implies RHS must be 0. So, ∂∂µ ¯ L ( µ ∗ ( c ) + γK | c + K ) < . Since ¯ L ( · | c + K ) is strictly concave by Lemma A.7, this implies µ ∗ ( c + K ) < µ ∗ ( c ) + γK .Given that we must have µ ∗ ( c + K ) > µ ∗ ( c ) from Proposition 2, this shows | µ ∗ ( c + K ) − µ ∗ ( c ) | < γK. A.7 Proof of Proposition 4 Proof. Consider the map I as discussed in the text, I ( µ ) := µ ∗ ( C ( µ • , µ ; γ )) . If ˆ µ is afixed point of I , then there is a steady state with µ ∞ = µ • , µ ∞ = ˆ µ , c ∞ = C ( µ • , ˆ µ ; γ ) . So,existence of steady states follows from existence of fixed points of I .Conversely, suppose ( µ ∞ , µ ∞ , c ∞ ) is a steady state. From Proposition 2, µ ∞ = µ ∗ ( c ∞ ) = µ • . From the definition of a steady state, µ ∞ = µ ∗ ( c ∞ ) and c ∞ = C ( µ ∞ , µ ∞ ; γ ) = C ( µ • , µ ∞ ; γ ) . That is to say, µ ∞ = µ ∗ ( C ( µ • , µ ∞ ; γ )), so µ ∞ is a fixed point of I . So,uniqueness of steady states follows from uniqueness of fixed points of I . Since µ C ( µ • , µ ; γ ) is a contraction mapping with Lipschitz constant ‘ < /γ byLemma 2 and µ ∗ ( c ) is a contraction mapping with Lipschitz constant γ by Lemma 1, theircomposition I is a contraction mapping with Lipschitz constant ‘γ < . This propositionfollows from properties of contraction mappings. A.8 Proof of Proposition 5 I will use the following lemma. Lemma A.12. For any c ∈ R , µ ∗ ( c ) − γ ( c − µ • ) < µ • . Here is the proof of Proposition 5. Proof. Suppose ( µ • , µ ∞ , c ∞ ) is a steady state. If c • = ∞ , then c ∞ < c • trivially as c ∞ ∈ R . Now suppose c • = ∞ . By Proposition 1, agent is indifferent between stopping andcontinuing after X = c ∞ under the feasible model Ψ( µ • , µ ∞ ; γ ). This implies u ( c ∞ ) = E Ψ( µ • ,µ ∞ ; γ ) [ u ( c ∞ , X ) | X = c ∞ ]= E ˜ X ∼ f ( ·| µ ∞ − γ ( c ∞ − µ • )) [ u ( c ∞ , ˜ X )]By the definition of steady state, µ ∞ = µ ∗ ( c ∞ ). By Lemma A.12, µ ∗ ( c ∞ ) − γ ( c ∞ − µ • ) < µ • .Therefore, f ( · | µ ∞ − γ ( c ∞ − µ • )) is first-order stochastically dominated by f ( · | µ • ).Since u is strictly increasing in its second argument by Assumption 1(a), we therefore have u ( c ∞ ) < E ˜ X ∼ f ( ·| µ • ) [ u ( c ∞ , ˜ X )] . The LHS is the objective payoff of stopping at c ∞ while the47HS is the objective expected payoff of continuing at c ∞ . Since the best stopping strategyunder the objective model Ψ • has the cutoff form, we must have c ∞ < c • . A.9 Proof of Theorem 1 The hypotheses of Theorem 1 will be maintained throughout this section. I also also ab-breviate f ( · | µ • ) =: g ( · ) and f ( · | µ • ) =: g ( · ) . Finally, let κ g ∈ R > be such that (cid:12)(cid:12)(cid:12) d dx ln( g ( x )) (cid:12)(cid:12)(cid:12) < κ g for all x ∈ R . A.9.1 Optimality of Cutoff Strategies I first develop an extension of Lemma A.1. I show that for an agent who knows µ • and hassome belief over µ with supported bounded by [ µ , ¯ µ ], there exists a cutoff strategy thatuniquely maximizes payoff across all cutoff strategies, so the “myopically optimal” cutoffstrategy is well defined. Furthermore, this myopically optimal cutoff strategy also achievesweakly larger expected payoff compared to any arbitrary stopping strategy . So, restrictionto cutoff strategies is without loss. Lemma A.13. For an agent who knows µ • and who holds some belief ν ∈ ∆([ µ , ¯ µ ]) aboutsecond-period fundamental, there exists c ∗ ∈ R such that: (i) the cutoff strategy S c ∗ achievesweakly higher expected payoff than any other (not necessarily cutoff-based) stopping strategy S : R → { Stop, Continue } ; (ii) for any other c = c ∗ , S c ∗ achieves strictly higher expectedpayoff than S c . A.9.2 The Log Likelihood Process Next, I define the processes of data log likelihood (for a given fundamental). For each µ ∈ [ µ , ¯ µ ] , let ‘ t ( µ )( ω ) be the log likelihood that the true second-period fundamental is µ and histories ( ˜ H s ) s ≤ t ( ω ) are generated by the end of round t . It is given by ‘ t ( µ )( ω ) := ln( m ( µ )) + t X s =1 ln(lik( ˜ H s ( ω ); µ ))where lik( x , ∅ ; µ ) := g ( x ) and lik( x , x ; µ ) := g ( x ) · f ( x | µ − γ ( x − µ • )).I record a useful decomposition of ‘ t ( µ ) , the derivative of the log-likelihood process. Let λ ( z ) := ddz ln( g ( z )) = g ( z ) g ( z ) .Define two stochastic processes: ϕ s ( µ ) := − λ ( X ,s − µ + µ • + γ ( X ,s − µ • )) · { X ,s ≤ ˜ C s } In particular this implies if there exists at least one steady state, then c • = −∞ . One can construct other stopping strategies with the same expected payoff by, for example, modifyingthe stopping decision of the optimal cutoff strategy at finitely many x . ϕ s ( µ ) := ∂∂µ ¯ L ( µ | ˜ C s )Note that ¯ ϕ s ( µ ) is measurable with respect to F s − , since ( C t ) is a predictable process.Write ξ s ( µ ) := ϕ s ( µ ) − ¯ ϕ s ( µ ) and y t ( µ ) := P ts =1 ξ s ( µ ). Write z t ( µ ) := P ts =1 ¯ ϕ s ( µ ). Lemma A.14. ‘ t ( µ ) = m ( µ ) m ( µ ) + y t ( µ ) + z t ( µ ) Proof. We may expand ‘ t ( µ ) asln( m ( µ )) + t X s =1 ln( g ( X ,s )) + t X s =1 ln( f ( X ,s | µ − γ ( X ,s − µ • ))) · { X ,s ≤ ˜ C s } . The derivative of the first term is m ( µ ) m ( µ ) . The second term does not depend on µ . In the thirdterm, we use the fact that f ( · | τ ) are translations of each other and that g ( · ) = f ( · | µ • )to write: f ( X ,s | µ − γ ( X ,s − µ • )) = g ( X ,s − µ + µ • + γ ( X ,s − µ • )) . This shows that the derivative of each summand in the third term with respect to µ is − g ( X ,s − µ + µ • + γ ( X ,s − µ • )) g ( X ,s − µ + µ • + γ ( X ,s − µ • )) · { X ,s ≤ ˜ C s } = ϕ s ( µ ) . So in sum, ‘ t ( µ ) = m ( µ ) m ( µ ) + P ts =1 ϕ s ( µ ). The lemma then follows from simple rearrange-ments.Now I derive two results about the ξ t ( µ ) processes for different values of µ . Lemma A.15. There exists κ ξ < ∞ so that for every µ ∈ [ µ , ¯ µ ] and for every t ≥ ,ω ∈ Ω , E [ ξ t ( µ ) |F t − ]( ω ) ≤ κ ξ . The proof can be found in the Online Appendix. Lemma A.16. For every t ≥ , µ ∈ [ µ , ¯ µ ] and ω ∈ Ω , | ξ t ( µ )( ω ) | ≤ κ g .Proof. In the proof of Lemma A.15, we established E [ ϕ t ( µ ) |F t − ] = ¯ ϕ t ( µ ). So, we have | ξ t ( µ )( ω ) | ≤ | ϕ t ( µ )( ω ) | + | E [ ϕ t ( µ ) | F t − ]( ω ) | . We have ϕ t ( µ ) = λ ( X ,s − µ + µ • + γ ( X ,s − µ • )) · { X ,s ≤ ˜ C s } , with | λ ( z ) | ≤ κ g for all z ∈ R . This shows | ϕ t ( µ )( ω ) | ≤ κ g for all ω, and similarly | E [ ϕ t ( µ ) | F t − ]( ω ) | ≤ κ g for all ω . 49 .9.3 Heidhues, Koszegi, and Strack (2018)’s Law of Large Numbers I use a statistical result from Heidhues, Koszegi, and Strack (2018) to show that the y t termin the decomposition of ‘ t almost surely converges to 0 in the long run, and furthermorethis convergence is uniform on [ µ , ¯ µ ] . This lets me focus on summands of the form ¯ ϕ s ( µ ),which can be interpreted as the expected contribution to the log likelihood derivative fromround s data. This lends tractability to the problem as ¯ ϕ s ( µ ) only depends on ˜ C s , but noton X ,s or X ,s . Lemma A.17. For every µ ∈ [ µ , ¯ µ ] , lim t →∞ | y t ( µ ) t | = 0 almost surely.Proof. Heidhues, Koszegi, and Strack (2018)’s Proposition 10 shows that if ( y t ) is a martin-gale such that there exists some constant v ≥ y ] t ≤ vt almost surely, where [ y ] t is the quadratic variation of ( y t ) , then almost surely lim t →∞ y t t = 0.Consider the process y t ( µ ) for a fixed µ ∈ [ µ , ¯ µ ]. By definition y t = P ts =1 ϕ s ( µ ) − ¯ ϕ s ( µ ). As established in the proof of Lemma A.15, for every s, ¯ ϕ s ( µ ) = E [ ϕ s ( µ ) |F s − ].So for t < t, E [ y t ( µ ) |F t ] = t X s =1 ϕ s ( µ ) − ¯ ϕ s ( µ ) + E [ t X s = t +1 ϕ s ( µ ) − ¯ ϕ s ( µ ) |F t ]= t X s =1 ϕ s ( µ ) − ¯ ϕ s ( µ ) + t X s = t +1 E [ E [ ϕ s ( µ ) − ¯ ϕ s ( µ ) |F s − ] | F t ]= t X s =1 ϕ s ( µ ) − ¯ ϕ s ( µ ) + 0 = y t ( µ ) . This shows ( y t ( µ )) is a martingale. Also,[ y ( µ )] t = t − X s =1 E [( y s ( µ ) − y s − ( µ )) |F s − ]= t − X s =1 E [ ξ s ( µ ) |F s − ] ≤ κ ξ · t by Lemma A.15. Therefore Heidhues, Koszegi, and Strack (2018) Proposition 10 applies. Lemma A.18. lim t →∞ sup µ ∈ [ µ , ¯ µ ] | y t ( µ ) t | = 0 almost surely.Proof. From the proof of Lemma 11 in Heidhues, Koszegi, and Strack (2018), it suffices tofind a sequence of random variables B t such that sup µ ∈ [ µ , ¯ µ ] | ξ t ( µ ) | ≤ B t almost surely,sup t ≥ t P ts =1 E [ B s ] < ∞ , and lim t →∞ t P ts =1 ( B s − E [ B s ]) = 0. But Lemma A.16 establishesthe constant random variable B t = 2 κ g as a bound on ξ t ( µ ) for every t, µ , ω , which satisfiesthese requirements. 50 .9.4 Bounds on Asymptotic Beliefs and Asymptotic Cutoffs For each t, let ˜ M t be the (random) posterior belief induced by the (random) posterior density˜ m t after updating prior m using t rounds of histories. Lemma A.19. For c l ≥ C ( µ • , µ ; γ ), if almost surely lim inf t →∞ ˜ C t ≥ c l , then almost surelylim t →∞ ˜ M t ( [ µ , µ ∗ ( c l )) ) = 0 . Also, for c h ≤ C ( µ • , ¯ µ ; γ ) , if almost surely lim sup t →∞ ˜ C t ≤ c h ,then almost surely lim t →∞ ˜ M t ( ( µ ∗ ( c h ) , ¯ µ ]) = 0 . Proof. I first show that for all (cid:15) > , there exists δ > t →∞ inf µ ∈ [ µ ,µ ∗ ( c l ) − (cid:15) ] ‘ t ( µ ) t ≥ δ. From Lemma A.14, we may rewrite LHS aslim inf t →∞ inf µ ∈ [ µ ,µ ∗ ( c l ) − (cid:15) ] " t m ( µ ) m ( µ ) + y t ( µ ) t + z t ( µ ) t , which is no smaller than taking the inf separately across the three terms in the bracket,lim inf t →∞ inf µ ∈ [ µ ,µ ∗ ( c l ) − (cid:15) ] t m ( µ ) m ( µ ) + lim inf t →∞ inf µ ∈ [ µ ,µ ∗ ( c l ) − (cid:15) ] y t ( µ ) t + lim inf t →∞ inf µ ∈ [ µ ,µ ∗ ( c l ) − (cid:15) ] z t ( µ ) t . Since m is continuous and m is strictly positive (and continuous) on [ µ , ¯ µ ] by thehypotheses of Theorem 1 , m /m is bounded on [ µ , ¯ µ ], so we in fact havelim t →∞ inf µ ∈ [ µ ,µ ∗ ( c l ) − (cid:15) ] t m ( µ ) m ( µ ) = 0 . To deal with the second term,lim inf t →∞ inf µ ∈ [ µ ,µ ∗ ( c l ) − (cid:15) ] y t ( µ ) t ≥ lim inf t →∞ inf µ ∈ [ µ ¯ µ ] y t ( µ ) t = − lim inf t →∞ sup µ ∈ [ µ ¯ µ ] − y t ( µ ) t . Lemma A.18 gives lim t →∞ sup µ ∈ [ µ ¯ µ ] − y t ( µ ) t = 0 almost surely, so this second term is non-negative almost surely.It suffices then to find δ > t →∞ inf µ ∈ [ µ ,µ ∗ ( c l ) − (cid:15) ] z t ( µ ) t ≥ δ almost surely.Put δ := ∂∂µ ¯ L ( µ ∗ ( c l ) − (cid:15) | c l ) and I will show ¯ ϕ s ( µ )( ω ) ≥ δ whenever ˜ C s ( ω ) ≥ c l and µ ≤ µ ∗ ( c l ) − (cid:15) . To see this, note that when ˜ C s ( ω ) = c ∈ R , ¯ ϕ s ( µ )( ω ) = ∂∂µ ¯ L ( µ | c ) and ¯ L ( · | c ) isstrictly concave in its first argument by Lemma A.7. Therefore, if ¯ ϕ s ( µ ∗ ( c l ) − (cid:15) )( ω ) ≥ δ, thenwe also get ¯ ϕ s ( µ )( ω ) ≥ δ for any µ ≤ µ ∗ ( c l ) − (cid:15) . So it suffices to show ∂∂µ ¯ L ( µ ∗ ( c l ) − (cid:15) | c ) ≥ δ whenever c ≥ c l . 51e have ∂∂µ ¯ L ( µ | c l ) = Z c l −∞ g ( x ) · Z ∞−∞ ( − · g ( x ) · λ ( x − µ + µ • + γ ( x − µ • )) dx dx . First-order condition implies that ∂∂µ ¯ L ( µ ∗ ( c l ) | c l ) = 0. Since λ is strictly decreasing, thisimplies δ = ∂∂µ ¯ L ( µ ∗ ( c l ) − (cid:15) | c l ) > 0. Also, again using λ strictly decreasing, the innerintegrand is strictly increasing in x . Thus, ∂∂µ ¯ L ( µ ∗ ( c l ) − (cid:15) | c l ) > Z ∞−∞ ( − · g ( x ) · λ ( x − ( µ ∗ ( c l ) − (cid:15) ) + µ • + γ ( c − µ • )) dx > c ≥ c l . This then shows ∂∂µ ¯ L ( µ ∗ ( c l ) − (cid:15) | c ) > ∂∂µ ¯ L ( µ ∗ ( c l ) − (cid:15) | c l ) for any c > c l .Having shown that ¯ ϕ s ( µ )( ω ) ≥ δ for all µ ∈ [ µ , µ ∗ ( c l ) − (cid:15) ] whenever ˜ C s ( ω ) ≥ c l , thisshows along any ω such that lim inf t →∞ ˜ C t ≥ c l , we also have lim inf s →∞ inf µ ∈ [ µ ,µ ∗ ( c l ) − (cid:15) ] ¯ ϕ s ( µ ) ≥ δ , and thus lim inf t →∞ inf µ ∈ [ µ ,µ ∗ ( c l ) − (cid:15) ] z t ( µ ) t = lim inf t →∞ inf µ ∈ [ µ ,µ ∗ ( c l ) − (cid:15) ] t " t X s =1 ¯ ϕ s ( µ ) ≥ δ. From here, it is a standard exercise to establish that lim t →∞ ˜ M t ( [ µ , µ ∗ ( c l ) − (cid:15) ) ) = 0almost surely. Since the choice of (cid:15) > t →∞ sup µ ∈ [ µ ∗ ( c h )+ (cid:15), ¯ µ ] z t ( µ ) t ≤ − δ where − δ = max ∂∂µ ¯ L ( µ ∗ ( c h ) + (cid:15) | c h ) , ∂∂µ ¯ L ( µ ∗ ( c h ) + (cid:15) | C ( µ • , µ ; γ )) ! < . Lemma A.20. For µ ≤ µ l < µ h ≤ ¯ µ , if lim t →∞ ˜ M t ([ µ l , µ h ]) = 1 almost surely, thenlim inf t →∞ ˜ C t ≥ C ( µ • , µ l ; γ ) and lim sup t →∞ ˜ C t ≤ C ( µ • , µ h ; γ ) almost surely. Proof. I show lim inf t →∞ ˜ C t ≥ C ( µ • , µ l ; γ ) almost surely . The argument establishinglim sup t →∞ ˜ C t ≤ C ( µ • , µ h ; γ ) is symmetric.Let c l = C ( µ • , µ l ; γ ), c = C ( µ • , µ ; γ ) , ¯ c = C ( µ • , ¯ µ ; γ ). Fix some (cid:15) > . Since c U ( c ; µ • , µ ) is single peaked for every µ , and since c l ≤ C ( µ • , µ ; γ ) for all µ ∈ [ µ l , µ h ] , we get U ( c l ; µ • , µ ) − U ( c l − (cid:15) ; µ • , µ ) > µ ∈ [ µ l , µ h ]. As µ (cid:16) U ( c l ; µ • , µ ) − U ( c l − (cid:15) ; µ • , µ ) (cid:17) is continuous, there exists some κ ∗ > U ( c l ; µ • , µ ) − U ( c l − (cid:15) ; µ • , µ ) > κ ∗ for all µ ∈ [ µ l , µ h ]. In particular, if ν ∈ ∆([ µ l , µ h ]) is a be-lief over second-period fundamental supported on [ µ l , µ h ] , then R U ( c l ; µ • , µ ) − U ( c l − (cid:15) ; µ • , µ ) dν ( µ ) > κ ∗ . Now , let ¯ κ := sup c ∈ [ c, ¯ c ] sup µ ∈ [ µ , ¯ µ ] U ( c ; µ • , µ ), κ := inf c ∈ [ c, ¯ c ] inf µ ∈ [ µ , ¯ µ ] U ( c ; µ • , µ ).52ind p ∈ (0 , 1) so that pκ ∗ − (1 − p )(¯ κ − κ ) = 0 . At any belief ˆ ν ∈ ∆([ µ , ¯ µ ]) that assignsmore than probability p to the subinterval [ µ l , µ h ], the optimal cutoff is larger than c l − (cid:15) . Tosee this, take any ˆ c ≤ c l − (cid:15) and I will show ˆ c is suboptimal. If ˆ c < c, then it is suboptimal afterany belief on [ µ , ¯ µ ] . If c ≤ ˆ c ≤ c l − (cid:15) , I show that R U ( c l ; µ • , µ ) − U (ˆ c ; µ • , µ ) d ˆ ν ( µ ) > . To see this, we may decompose ˆ ν as the mixture of a probability measure ν on [ µ l , µ h ]and another probability measure ν c on [ µ , ¯ µ ] \ [ µ l , µ h ] . Let ˆ p > p be the probability that ν assigns to [ µ l , µ h ] . The above integral is equal to:ˆ p Z µ ∈ [ µ l ,µ h ] U ( c l ; µ • , µ ) − U (ˆ c ; µ • , µ ) dν ( µ ) + (1 − ˆ p ) Z µ ∈ [ µ , ¯ µ ] \ [ µ l ,µ h ] . U ( c l ; µ • , µ ) − U (ˆ c ; µ • , µ ) dν c ( µ )Since c l is to the left of the optimal cutoff for all µ ∈ [ µ l , µ h ] and ˆ c ≤ c l − (cid:15) , then U (ˆ c ; µ • , µ ) ≤ U ( c l − (cid:15) ; µ • , µ ) for all µ ∈ [ µ l , µ h ] . The first summand is no less thanˆ p Z µ ∈ [ µ l ,µ h ] U ( c l ; µ • , µ ) − U ( c l − (cid:15) ; µ • , µ ) dν ( µ ) ≥ ˆ pκ ∗ . Also, the integrand in the second summand is no smaller than − (¯ κ − κ ) , therefore R U ( c l ; µ • , µ ) − U (ˆ c ; µ • , µ ) d ˆ ν ( µ ) ≥ ˆ pκ ∗ − (1 − ˆ p )(¯ κ − κ ) . Since ˆ p > p , we get ˆ pκ ∗ − (1 − ˆ p )(¯ κ − κ ) > ω where lim t →∞ ˜ M t ([ µ l , µ h ])( ω ) = 1 , eventually ˜ M t ([ µ l , µ h ])( ω ) >p for all large enough t, meaning lim inf t →∞ ˜ C t ( ω ) ≥ c l − (cid:15). This shows lim inf t →∞ ˜ C t ≥ C ( µ • , µ l ; γ ) − (cid:15) almost surely. Since the choice of (cid:15) > t →∞ ˜ C t ≥ C ( µ • , µ l ; γ ) almost surely. A.9.5 The Contraction Map I now combine the results established so far to prove Theorem 1 . Proof. Let µ l , [1] := µ , µ h , [1] := ¯ µ . For k = 2 , , ... , iteratively define µ l , [ k ] := I ( µ l , [ k − ; γ )and µ h , [ k ] := I ( µ h , [ k − ; γ ).From Lemma A.20, if lim t →∞ ˜ M t ([ µ l , [ k ] , µ h , [ k ] ]) = 1 almost surely, then lim inf t →∞ ˜ C t ≥ C ( µ • , µ l , [ k ] ; γ ) and lim sup t →∞ ˜ C t ≤ C ( µ • , µ h , [ k ] ; γ ) almost surely. But using these conclusionsin Lemma A.19, we further deduce that lim t →∞ ˜ M t ([ µ ∗ ( C ( µ • , µ l , [ k ] ; γ )) , µ ∗ ( C ( µ • , µ h , [ k ] ; γ ))]) =1 almost surely, that is to say lim t →∞ ˜ M t ([ µ l , [ k +1] , µ h , [ k +1] ]) = 1 almost surely.As shown in the proof of Proposition 4, under Assumptions 1, 2, and 3, µ 7→ I ( µ ; γ ) isa contraction mapping. Since µ < µ ∞ and ¯ µ > µ ∞ , ( µ l , [ k ] ) k ≥ is a sequence whose limit is µ ∞ , and ( µ h , [ k ] ) k ≥ is a sequence whose limit is µ ∞ . Thus, agent’s posterior converges in L to µ ∞ almost surely (since the support of the prior is bounded).In addition, µ C ( µ • , µ ; γ ) is continuous, so the sequences of bounds on asymptoticcutoffs also converge, lim k →∞ C ( µ • , µ l , [ k ] ; γ ) = c ∞ and lim k →∞ C ( µ • , µ h , [ k ] ; γ ) = c ∞ . This53eans lim t →∞ ˜ C t = c ∞ almost surely. A.10 Proof of Theorem 2 I require a lemma that shows beliefs and cutoffs are monotonic in the auxiliary environment. Lemma A.21. Suppose Assumptions 1 and 2 hold. Starting from any initial condition andany m , cutoffs ( c A [ t ] ) t ≥ and beliefs ( µ A , [ t ] ) t ≥ in the auxiliary environment form monotonicsequences across generations. Also, lim t →∞ µ A , [ t ] = µ ∞ where µ ∞ is the unique fixed point of I ( · ; γ ) and lim t →∞ c A [ t ] = C ( µ • , µ ∞ ; γ ) . Now I turn to the proof of Theorem 2. Proof. For the first step of the proof, suppose Assumptions 1 and 2 hold. Step 1 : If c [1] > c [0] , then ( µ , [ t ] ) t ≥ and ( c [ t ] ) t ≥ are two increasing sequence, whereas c [1] ≤ c [0] implies ( µ , [ t ] ) t ≥ and ( c [ t ] ) t ≥ are two decreasing sequences.By simple algebra, the problem of generation t + 1 amounts to maximizing the sum of¯ L ( · | c [0] ) , ..., ¯ L ( · | c [ t ] ). For c , ..., c t ∈ R , denote µ ∗ ( c , ..., c t ) := arg min µ ∈ R P ts =0 ¯ L ( µ | c s ) . Suppose c [1] > c [0] . Then µ , [1] = µ ∗ ( c [0] ) , but by Lemma A.11, ∂∂µ ¯ L ( µ , [1] | c [1] ) > L ( · | c [0] ) + ¯ L ( · | c [1] ) is strictly concave and since ∂∂µ ¯ L ( µ , [1] | c [0] ) + ∂∂µ ¯ L ( µ , [1] | c [1] ) = ∂∂µ ¯ L ( µ , [1] | c [1] ) > 0, we must have µ , [2] = µ ∗ ( c [0] , c [1] ) > µ , [1] . This also shows that,since C is strictly increasing, c [2] > c [1] .Assume we have established that c [0] < c [1] < ... < c [ t ] and µ , [1] < ... < µ , [ t ] forsome t ≥ . By FOC of inference in generation t, P t − s =0 ∂∂µ ¯ L ( µ , [ t ] | c [ s ] ) = 0 . If we had ∂∂µ ¯ L ( µ , [ t ] | c [ t − ) < 0, then by single-peaked nature of L ( · | c [ t − ) , µ , [ t ] > µ ∗ ( c [ t − ) . Since c [0] < c [1] < ... < c [ t − implies µ ∗ ( c [0] ) < ... < µ ∗ ( c [ t − ) by Proposition 2, we must also have µ , [ t ] > µ ∗ ( c [ s ] ) for all 0 ≤ s ≤ t − , that is to say ∂∂µ ¯ L ( µ , [ t ] | c [ s ] ) < ≤ s ≤ t − ∂∂µ ¯ L ( µ , [ t ] | c [ t − ) ≥ 0, which implies ∂∂µ ¯ L ( µ , [ t ] | c [ t ] ) > c [ t ] > c [ t − from the inductive hypothesis. Hence we see that P ts =0 ∂∂µ ¯ L ( µ , [ t ] | c [ s ] ) > . This shows µ , [ t +1] = µ ∗ ( c [0] , ..., c [ t ] ) > µ , [ t ] by the strict concavity of generation t ’s objective.Also, c [ t +1] > c [ t ] follows.So by induction, we have shown Step 1. (The other case of c [1] < c [0] is symmetric.)For the rest of this proof, suppose Assumption 3 also holds. Step 2 : ( µ , [ t ] ) t ≥ is bounded and converges.I first show that for every t, µ , [ t ] is bounded between µ , [1] and µ ∞ . Combined with thefact that ( µ , [ t ] ) t ≥ is monotonic from Step 1 , the sequence must then converge.Consider the case of c [1] > c [0] (so µ , [2] > µ , [1] ), Step 1 implies that ( µ , [ t ] ) t ≥ forms anincreasing sequence. We have µ A , [1] = µ , [1] = µ ∗ ( c [0] ) , so also c A [1] = c [1] . We have µ A , [2] = µ ∗ ( c [1] ) , but ∂∂µ ¯ L ( µ A , [2] | c [0] ) + ∂∂µ ¯ L ( µ A , [2] | c [1] ) = ∂∂µ ¯ L ( µ A , [2] | c [0] ) < , using the FOCthat ∂∂µ ¯ L ( µ A , [2] | c [1] ) = 0 and c [1] > c [0] . This shows µ , [2] < µ A , [2] , hence c [2] < c A [2] . Byinduction, suppose we have shown that µ , [2] < µ A , [2] and c [ t ] < c A [ t ] for some t ≥ . Then, the54rguments from Step 1 establish that ∂∂µ ¯ L ( µ , [ t +1] | c [ t ] ) ≥ 0, which implies ∂∂µ ¯ L ( µ , [ t +1] | c A [ t ] ) > c A [ t ] > c [ t ] . By strict concavity of ¯ L ( · | c A [ t ] ) from Lemma A.7, this shows µ A , [ t +1] = µ ∗ ( c A [ t ] ) > µ , [ t +1] , hence also c A [ t +1] > c [ t +1] . So we have established that µ , [ t ] ≤ µ A , [ t ] by induction. But from the proof of Lemma A.21, ( µ A , [ t ] ) converge upwards to µ ∞ in thiscase (given that they are iterates of I , which is a contraction map by Proposition 4 whenAssumptions 1, 2, and 3 hold), meaning µ , [ t ] is bounded between µ , [1] and µ ∞ .The case of c [1] < c [0] is symmetric (and if c [1] = c [0] then µ , [1] = µ ∞ ) . We have proven Step 2 . Denote ˜ µ = lim t →∞ µ , [ t ] and observe that since C is continuous in its secondargument, c [ t ] → ˜ c = C ( µ • , ˜ µ ; γ ) . Step 3 : ˜ µ is a fixed point of I ( · ; γ ), so in particular ˜ µ = µ ∞ and ˜ c = c ∞ since I ( · ; γ )has a unique fixed point by Proposition 4.Consider the case of c [1] > c [0] , for the other case is symmetric. From the proof of Step2 , µ , [ t ] is bounded above by ˜ µ ∞ , so if ˜ µ = µ ∞ by way of contradiction, then ˜ µ < µ ∞ . Since the iterates of I ( · ; γ ) are monotonic, this implies I (˜ µ ; γ ) > ˜ µ , that is µ ∗ (˜ c ) > ˜ µ .As ¯ L ( · | ˜ c ) is strictly concave, this implies R ˜ c −∞ ∂∂µ L (˜ µ | x ) dx > . Using the fact that ∂∂µ L ( · | x ) is decreasing, there must exist (cid:15) > R c −∞ ∂∂µ L ( µ | x ) dx ≥ (cid:15) whenever c ∈ [˜ c − (cid:15), ˜ c ] and µ ≤ ˜ µ . Since c [ t ] % ˜ c , find large enough T so that c [ t ] ≥ ˜ c − (cid:15) whenever t ≥ T. Also, let B = max µ ∈ [ µ [2] , , ˜ µ ] max c ∈ [ c [0] , ˜ c ] (cid:12)(cid:12)(cid:12)R c −∞ ∂∂µ L ( µ | x ) dx (cid:12)(cid:12)(cid:12) . So for t ≥ T + 1 , P t − s =0 ∂∂µ ¯ L ( µ , [ t ] | c [ s ] ) ≥ − T B + (cid:15) ( t − T ) . This quantity must be strictly positive for largeenough t, a contradiction that says FOC is not satisfied for large t. Thus, we must have˜ µ = µ ∞ , hence ˜ c = C ( µ • , µ ∞ ; γ ). A.11 Proof of Corollary 1 Proof. Suppose c [1] ≥ c [0] . Since µ ∗ ( c ) is increasing, we have µ , [2] = µ ∗ ( c [1] , c [0] ) ≥ µ ∗ ( c [0] ) = µ , [1] . So we get c [2] ≥ c [1] . By Theorem 2, we deduce ( c [ t ] ) t ≥ is an increasing sequence, soin particular c ∞ ≥ c • . But again by 2, c ∞ is the same as the steady-state cutoff in Theorem1. This is a contradiction because Theorem 1 implies c ∞ < c • .This shows c [1] < c [0] and similar arguments show ( c [ t ] ) t ≥ is a strictly decreasing sequence.Since c • is the objectively optimal cutoff threshold under the true model Ψ • , and sinceexpected payoff under the true model is a single-peaked function in acceptance threshold byLemma A.2, this shows expected payoff is strictly decreasing across generations.55 nline Appendix for “Mislearning from Censored Data:The Gambler’s Fallacy in Optimal-Stopping Problems” Kevin HeAugust 19, 2019 OA 1 Proofs of Results in Section 5 and the Appendix OA 1.1 Proof of Lemma A.2 Proof. Step 1 : D is strictly increasing.Suppose x > ¯ x . Then, E Ψ [ u (¯ x , X ) | X = ¯ x ] = E ˜ X ∼ f ( ·| µ − γ (¯ x − µ )) [ u (¯ x , ˜ X )] , while E Ψ [ u (¯ x , X ) | X = x ] = E ˜˜ X ∼ f ( ·| µ − γ ( x − µ )) [ u (¯ x , ˜˜ X )]= E ˜ X ∼ f ( ·| µ − γ (¯ x − µ )) [ u (¯ x , ˜ X − γ ( x − ¯ x ))] . Since u is strictly increasing in its second argument by Assumption 1(a), we get E ˜ X ∼ f ( ·| µ − γ (¯ x − µ )) [ u (¯ x , ˜ X − γ ( x − ¯ x ))] ≤ E ˜ X ∼ f ( ·| µ − γ (¯ x − µ )) [ u (¯ x , ˜ X )]seeing that γ ( x − ¯ x ) ≥ 0. Also, at any x ∈ R , by Assumption 1(b) we know that u ( x ) − u (¯ x ) > u ( x , x ) − u (¯ x , x ) . ⇒ u ( x ) − u ( x , x ) > u (¯ x ) − u (¯ x , x ) . This then shows u ( x ) − E ˜ X ∼ f ( ·| µ − γ (¯ x − µ )) [ u ( x , ˜ X − γ ( x − ¯ x ))] > u (¯ x ) − E ˜ X ∼ f ( ·| µ − γ (¯ x − µ )) [ u (¯ x , ˜ X − γ ( x − ¯ x ))] ≥ u (¯ x ) − E ˜ X ∼ f ( ·| µ − γ (¯ x − µ )) [ u (¯ x , ˜ X )]that is D ( x ) > D (¯ x ). Step 2 : D is continuous. 1ixing some ¯ x ∈ R , I show D is continuous at ¯ x . Since u is continuous, find δ > | x − ¯ x | < , | u ( x ) − u (¯ x ) | < δ . Consider the function w : R → R ≥ definedby w ( x ) := | u (¯ x , x + γ ) | + | u (¯ x , x − γ ) | + δ . Claim OA.1 . Whenever | x − ¯ x | < , | u ( x , x + γ (¯ x − x )) | ≤ w ( x ) for every x ∈ R . Proof. Since u is increasing its second argument by Assumption 1(a), if u ( x , x + γ (¯ x − x )) ≥ , then | u ( x , x + γ (¯ x − x )) | ≤ | u ( x , x + γ ) | since | x − ¯ x | < . Otherwise, if u ( x , x + γ (¯ x − x )) < 0, then | u ( x , x + γ (¯ x − x )) | ≤ | u ( x , x − γ ) | . But we have | u ( x , x + γ ) | ≤ | u (¯ x , x + γ ) | + | u ( x , x + γ ) − u (¯ x , x + γ ) | for every x . By Assumption 1(b), | u ( x , x + γ ) − u (¯ x , x + γ ) | ≤ | u ( x ) − u (¯ x ) | < δ whenever | x − ¯ x | < 1. Similarly, | u ( x , x − γ ) | ≤ | u (¯ x , x − γ ) | + | u ( x , x − γ ) − u (¯ x , x − γ ) | ≤ | u (¯ x , x − γ ) | + δ. Claim OA.2 . The function w is absolutely integrable with respect to the distribution f ( · | µ − γ (¯ x − µ )). Proof. This is because both x u (¯ x , x + µ − γ (¯ x − µ ) + γ ) and x u (¯ x , x + µ − γ (¯ x − µ ) + γ ) are absolutely integrable with respect to f ( · | , by Assumption 2.Together, these two claims show that for the family of functions x u ( x , x + γ (¯ x − x )) for | x − ¯ x | < w is an integrable dominating function with respect to the distribution f ( · | µ − γ (¯ x − µ )). Consider a sequence ( x ( n )1 ) n ∈ N with x ( n )1 → ¯ x . By continuity, u ( x ( n )1 ) → u (¯ x ) . For all large enough n , the functions x u ( x ( n )1 , x + γ (¯ x − x ( n )1 ))falls within the family mentioned before. Since these functions converge pointwise in x to x u (¯ x , x ), the existence of the dominating function f implies the convergence of theintegrals by dominated convergence theorem, E ˜ X ∼ f ( ·| µ − γ (¯ x − µ )) [ u ( x ( n )1 , ˜ X + γ (¯ x − x ( n )1 )] → E ˜ X ∼ f ( ·| µ − γ (¯ x − µ )) [ u (¯ x , ˜ X )] . But E Ψ [ u ( x ( n )1 , X ) | X = x ( n )1 ] = E ˜˜ X ∼ f ( ·| µ − γ ( x ( n )1 − µ )) [ u ( x ( n )1 , ˜˜ X ]= E ˜ X ∼ f ( ·| µ − γ (¯ x − µ )) [ u ( x ( n )1 , ˜ X + γ (¯ x − x ( n )1 )] , n →∞ E Ψ [ u ( x ( n )1 , X ) | X = x ( n )1 ] = E ˜ X ∼ f ( ·| µ − γ (¯ x − µ )) [ u (¯ x , ˜ X )]= E Ψ [ u (¯ x , X ) | X = ¯ x ] . This establishes that D ( x ( n )1 ) → D (¯ x ), so D is continuous at ¯ x . Step 3 : If γ > , then there are x < x so that D ( x ) < < D ( x ) . I show D is not always negative; the other statement is symmetric.From u ( x g ) − u ( x g , x b ) > κ > , we get that for any x ≥ x g , x ≤ x b ,u ( x ) − u ( x , x ) ≥ u ( x g ) − u ( x g , x ) ≥ u ( x g ) − u ( x g , x b ) > κ where the first inequality comes from Assumption 1(b) and the second one comes fromAssumption 1(a). We have for any x ,D ( x ) = u ( x ) − E Ψ [ u ( x , X ) | X = x ]= P Ψ [ X ≤ x b | X = x ] · ( u ( x ) − E Ψ [ u ( x , X ) | X = x , X ≤ x b ])+ P Ψ [ X > x b | X = x ] · ( u ( x ) − E Ψ [ u ( x , X ) | X = x , X > x b ]) . When x ≥ x g , u ( x ) − E Ψ [ u ( x , X ) | X = x , X ≤ x b ] > κ . Also, for x ≥ x g ,u ( x ) − E Ψ [ u ( x , X ) | X = x , X > x b ] ≤ u ( x g ) − E Ψ [ u ( x g , X ) | X = x , X > x b ] . But P Ψ [ X > x b | X = x ] · E Ψ [ u ( x g , X ) | X = x , X > x b ]= E Ψ [ { X > x b } · u ( x g , X ) | X = x ]= E ˜˜ X ∼ f ( ·| µ − γ ( x − µ )) [ { ˜˜ X > x b } · u ( x g , ˜˜ X )]= E ˜ X ∼ f ( ·| µ ) [ { ˜ X − γ ( x − µ ) > x b } · u ( x g , ˜ X − γ ( x − µ ))] ≤ E ˜ X ∼ f ( ·| µ ) [ { ˜ X − γ ( x − µ ) > x b } · | u ( x g , ˜ X ) | ]when x > µ . Since E ˜ X ∼ f ( ·| µ ) [ | u ( x g , ˜ X ) | ] = E ˜ X ∼ f ( ·| [ | u ( x g , ˜ X + µ ) | ] exists and isfinite by Assumption 2, as x → ∞ we must have E ˜ X ∼ f ( ·| µ ) [ { ˜ X − γ ( x − µ ) > x b } · | u ( x g , ˜ X ) | ] → γ > 0. So this shows for all largeenough x , D ( x ) ≥ κ/ > . A 1.2 Proof of Lemma A.5 Proof. The LHS, up to a constant not depending on µ , µ , can be written as: − Z ∞−∞ f ( x | µ • ) ln( f ( x | µ )) dx − Z c −∞ (cid:26)Z ∞−∞ f ( x | µ • ) · f ( x | µ • ) ln [ f ( x | µ − γ ( x − µ ))] dx (cid:27) dx Replacing ( µ , µ ) with ( µ • , µ − γ ( µ • − µ )) , the above expression becomes: − Z ∞−∞ f ( x | µ • )) ln( f ( x | µ • )) dx − Z c −∞ (cid:26)Z ∞−∞ f ( x | µ • ) · f ( x | µ • ) ln [ f ( x | µ − γ ( µ • − µ ) − γ ( x − µ • ))] dx (cid:27) dx which simplifies to − Z ∞−∞ f ( x | µ • ) ln( f ( x | µ • )) dx − Z c −∞ (cid:26)Z ∞−∞ f ( x | µ • ) · f ( x | µ • ) ln [ f ( x | µ − γ ( x − µ ))] dx (cid:27) dx So we see D KL ( H • ( c ) kH (Ψ( µ , µ ; γ ); c )) − D KL ( H • ( c ) kH (Ψ( µ • , µ − γ ( µ • − µ ); γ ); c ))= Z ∞−∞ f ( x | µ • ) ln( f ( x | µ • )) dx − Z ∞−∞ f ( x | µ • ) ln( f ( x | µ )) dx . Since { f ( · | µ ) : µ ∈ R } is a family of shifted densities, µ Z ∞−∞ f ( x | µ • ) ln( f ( x | µ )) dx is maximized at µ = µ • and attains a strictly smaller value for any µ = µ • . Thus thedifference is strictly positive. OA 1.3 Proof of Lemma A.7 Proof. I first show that ∂ ∂τ ln[ f ( x | τ )] < x , τ ∈ R . To see this, f ( x | τ ) = f ( x − τ | ∂ ∂τ ln[ f ( x | τ )] = h ∂ ∂y ln( f ( y | i y = x − τ . By Assumption 2, f ( · | 0) isstrictly log-concave, therefore ∂ ∂y ln( f ( y | < y ∈ R . We have from the definition of L ( µ | x ) ,∂ ∂µ L ( µ | x ) = Z ∞−∞ f ( x | µ • ) " ∂ ∂τ ln[ f ( x | τ )] τ = µ − γ ( x − µ • ) dx < ∂ ∂τ ln[ f ( x | τ )] < . Also, for the same reason, ∂ ∂x L ( µ | x ) = ( − γ ) · Z ∞−∞ f ( x | µ • ) " ∂ ∂τ ln[ f ( x | τ )] τ = µ − γ ( x − µ • ) dx < . Now, replacing L ( µ | x ) in the definition of ¯ L ( µ | c ) with L ( µ • | x − γ ( µ − µ • )) usingLemma A.6, we have for any c ∈ R ∪ {∞} ,∂ ∂µ ¯ L ( µ | c ) = ∂ ∂µ Z c −∞ f ( x | µ • ) · L ( µ • | x − γ ( µ − µ • )) dx =( − γ ) · Z c −∞ f ( x | µ • ) · " ∂ ∂τ L ( µ • | τ ) τ = x − γ ( µ − µ • ) dx . As just established, ∂ ∂τ L ( µ • | τ ) < τ ∈ R , therefore ∂ ∂µ ¯ L ( µ | c ) < ∂ ∂x ∂µ L ( µ | x ) = − γ ∂ ∂x L ( µ • | x − γ ( µ − µ • )) . But ∂ ∂x L ( µ • | x − γ ( µ − µ • )) < L ( µ | · ) just derived, therefore ∂ ∂x ∂µ L ( µ | x ) > . OA 1.4 Proof of Lemma A.8 Proof. We have ddx f ( x + K ) f ( x ) ! = f ( x + K ) f ( x ) − f ( x + K ) f ( x ) f ( x ) . Since d dx ln( f ( x )) < 0, we get ddx (cid:20) f ( x ) f ( x ) (cid:21) < 0, so f ( x + K ) f ( x + K ) < f ( x ) f ( x ) for all x. Rearranging thisshows f ( x + K ) f ( x ) − f ( x + K ) f ( x ) < . OA 1.5 Proof of Lemma A.10 Proof. Using Lemma A.9’s conclusion that ∂∂µ ¯ L ( µ • | ∞ ) = 0, we get Z ∞−∞ f ( x | µ • ) ∂L∂x ( µ • | x ) dx = 0 . ∂L∂x ( µ • | · ) is strictly decreasing by Lemma A.7, we conclude Z c −∞ f ( x | µ • ) ∂L∂x ( µ • | x ) dx > c ∈ R , therefore ∂∂µ ¯ L ( µ • | c ) < µ l ∈ R where ∂∂µ ¯ L ( µ | ¯ c ) ≥ , then a solution to the FOC existsbetween µ l and µ • by intermediate value theorem. We show such µ l can always be found.We have ∂∂µ ¯ L ( µ • − | ∞ ) > , since µ ∂∂µ L ( µ | x ) is decreasing by Lemma A.7.By continuity, we may find large enough c h ∈ R so that ∂∂µ ¯ L ( µ • − | c h ) > ∂∂µ ¯ L ( µ • − | c ) > c ∈ R , we are done by taking µ l = µ • − . Else, byintermediate value theorem there exists ˆ c ∈ R so that ∂∂µ ¯ L ( µ • − | ˆ c ) = 0. Using the finalfact from A.7 that ∂L∂µ ( µ • − | · ) is strictly increasing, if ¯ c ≥ ˆ c then we are done by taking µ l = µ • − , as ∂∂µ ¯ L ( µ • − | ˆ c ) = 0 implies ∂∂µ ¯ L ( µ • − | ¯ c ) ≥ c l = ˆ c − K for K > . I show that µ l may be taken to be µ • − − γK to get ∂∂µ ¯ L ( µ l | c l ) > . We have ¯ L ( µ | ˆ c ) = Z ˆ c −∞ f ( x | µ • ) L (0 | x − γ µ ) dx , and so ¯ L ( µ − γK | ˆ c − K ) = Z ˆ c − K −∞ f ( x | µ • ) L (0 | ( x + K ) − γ µ ) dx = Z ˆ c −∞ f ( x − K | µ • ) L (0 | x − γ µ ) dx . This implies ∂∂µ ¯ L ( µ − γK | ˆ c − K ) = − γ Z ˆ c −∞ f ( x − K | µ • ) ∂∂x L (0 | x − γ µ ) dx . For µ = µ • − 1, we rewrite ∂∂µ ¯ L ( µ − γK | ˆ c − K ) as − γ Z ˆ c −∞ f ( x − K | µ • ) f ( x | µ • ) ! f ( x | µ • ) ∂∂x L (0 | x − γ ( µ • − dx . By the construction of ˆ c , ∂∂µ ¯ L ( µ • − | ˆ c ) = 0 , that is to say Z ˆ c −∞ f ( x | µ • ) ∂∂x L (0 | x − γ ( µ • − dx = 0 . Since ∂∂x L (0 | x − γ ( µ • − x by Lemma A.7, it must be positive6or some low values of x and negative for some high values of x not exceeding ˆ c. Let r < ˆ c be such that ∂∂x L (0 | x − γ ( µ • − . Then we have Z ˆ c −∞ f ( x − K | µ • ) f ( x | µ • ) ! f ( x | µ • ) ∂∂x L (0 | x − γ ( µ • − dx < Z r −∞ f ( r − K | µ • ) f ( r | µ • ) ! f ( x | µ • ) ∂∂x L (0 | x − γ ( µ • − dx + Z ˆ cr f ( r − K | µ • )) f ( r | µ • )) ! f ( x | µ • ) ∂∂x L (0 | x − γ ( µ • − dx . To see this, first observe that x f ( x | µ • ) f ( x − K | µ • ) is strictly decreasing by Lemma A.8, therefore x f ( x − K | µ • ) f ( x | µ • ) is strictly increasing. So replacing f ( x − K | µ • ) f ( x | µ • ) with f ( r − K | µ • ) f ( r | µ • ) over-weighs theintegrand on the interval ( −∞ , r ) , while the same weight under-weighs the integrand on( r, ˆ c ) . This amounts to an over-weighting of the positive part of the integrand and an under-weighting of the negative part, thus over-estimating the integral value.Multiplying both sides by − γ and reversing the inequality, ∂∂µ ¯ L ( µ l | ˆ c − K ) > − γ · Z r −∞ f ( r − K | µ • ) f ( r | µ • ) ! f ( x | µ • ) ∂∂x L (0 | x − γ ( µ • − dx − γ Z ˆ cr f ( r − K | µ • )) f ( r | µ • )) ! f ( x | µ • ) ∂∂x L (0 | x − γ ( µ • − dx . The RHS is − γ · f ( r − K | µ • )) f ( r | µ • )) · Z ˆ c −∞ f ( x | µ • ) ∂∂x L (0 | x − γ ( µ • − dx = 0 , hence we have ∂∂µ ¯ L ( µ l | ¯ c ) > OA 1.6 Proof of Lemma A.12 Proof. First-order condition implies ∂∂µ ¯ L ( µ ∗ ( c ) | c ) = 0. Using the second statement ofLemma A.6 in the FOC gives − γ Z c −∞ f ( · | µ • ) ∂∂x L ( µ • | x − γ ( µ ∗ ( c ) − µ • )) = 0 . By strict concavity of L ( µ • | · ) from Lemma A.7, this requires that ∂∂x L ( µ • | · ) takes ona strictly negative value at the rightmost point of the domain of integration, ∂∂x L ( µ • | c − γ ( µ ∗ ( c ) − µ • )) < 0. 7rom its definition, x L ( µ • | x ) is maximized when x = µ • , for we have L ( µ • | µ • ) = R ∞−∞ f ( x | µ • ) ln[ f ( x | µ • )] dx . Since L ( µ • | · ) is strictly concave, this means ∂∂x L ( µ • | τ ) < τ > µ • . Combining with the previous inequality, c − γ ( µ ∗ ( c ) − µ • ) > µ • , which rearranges to say µ ∗ ( c ) − γ ( c − µ • ) < µ • as desired. OA 1.7 Proof of Lemma A.13 Proof. Consider the payoff difference between accepting x and continuing under belief ν , D ( x ; ν ) := u ( x ) − Z E X ∼ f ( ·| µ − γ ( x − µ • ) ,σ ) [ u ( x , X )] dν ( µ ) . Note that D ( x , ν ) = R D ( x ; µ • , µ , γ ) dν ( µ ). Lemma A.2 shows that for every µ ∈ R , D ( x ; µ • , µ , γ ) is strictly increasing in x . Hence the same must hold for D ( x , ν ) . Also, Lemma A.2 implies there exists some x ∈ R so that D ( x ; µ • , µ , γ ) < , and thatthere exists some x ∈ R satisfying D ( x ; µ • , µ , γ ) > 0. Since u increases in its secondargument, we also get D ( x ; µ • , µ , γ ) < D ( x ; µ • , µ , γ ) > µ ∈ [ µ , ¯ µ ]. Thisimplies D ( x ; ν ) < D ( x ; ν ) > 0, as ν is supported on (a subset of) [ µ , ¯ µ ] . Finally, I show D ( x ; ν ) is continuous in x . Fix ¯ x ∈ R . Since u is continuous, find δ > | x − ¯ x | < , | u ( x ) − u (¯ x ) | < δ. Consider the function φ : R → R ≥ defined by φ ( x , µ ) := | u (¯ x , x − γ + µ ) | + | u (¯ x , x + γ + µ ) | + δ . Claim OA.3 . Whenever | x − ¯ x | < , | u ( x , x + γ (¯ x − x ) + µ ) | ≤ φ ( x , µ ) for every x , µ ∈ R . Proof. This is the same as the proof of Claim OA.1. Claim OA.4 . R ¯ µ µ (cid:16)R ∞−∞ | φ ( x , µ ) | · f ( x | − γ (¯ x − µ • )) dx (cid:17) dν ( µ ) < ∞ . Proof. We may write φ ( x , µ ) := u + γ , + ( x , µ ) + u + γ , − ( x , µ ) + u − γ , + ( x , µ ) + u − γ , − ( x , µ ) + δ where u + γ , + and u + γ , − are the positive and negative parts of ( x , µ ) u (¯ x , x + γ + µ ) , and u − γ , + and u − γ , − are the positive and negative parts of ( x , µ ) u (¯ x , x − γ + µ ) . FromAssumption 2, for every µ ∈ [ µ , ¯ µ ], each of u + γ , + ( · , µ ) , u + γ , − ( · , µ ) ,u − γ , + ( · , µ ) , and u − γ , − ( · , µ )is integrable over R with respect to the density f ( · | − γ (¯ x − µ • )) . These integrals aremaximized at µ = ¯ µ for u + γ , + ( · , µ ) and u − γ , + ( · , µ ), and maximized at µ = µ for u + γ , − ( · , µ )8nd u − γ , − ( · , µ ). In other words, for every µ ∈ [ µ , ¯ µ ], Z ∞−∞ | φ ( x , µ ) | · f ( x | − γ (¯ x − µ • )) dx ≤ Z ∞−∞ (cid:16) u + γ , + ( x , ¯ µ ) + u − γ , + ( x , ¯ µ ) (cid:17) · f ( x | − γ (¯ x − µ • )) dx + Z ∞−∞ (cid:16) u + γ , − ( x , µ ) + u − γ , − ( x , µ ) (cid:17) · f ( x | − γ (¯ x − µ • )) dx . This bound is finite and does not depend on µ , so the overall integral over dν ( µ ) is alsofinite.Consider a sequence x ( n )1 → ¯ x . We have D ( x ( n )1 ; ν ) = u ( x ( n )1 ) − Z E ˜˜ X ∼ f ( ·| µ − γ ( x ( n )1 − µ • )) [ u ( x ( n )1 , ˜˜ X )] dν ( µ )= u ( x ( n )1 ) − Z E ˜ X ∼ f ( ·|− γ (¯ x − µ • )) [ u ( x ( n )1 , ˜ X + γ (¯ x − x ( n )1 ) + µ )] dν ( µ )= u ( x ( n )1 ) − Z ¯ µ µ Z ∞−∞ u ( x ( n )1 , x + γ (¯ x − x ( n )1 ) + µ ) · f ( x | − γ (¯ x − µ • )) dx dν ( µ ) . The sequence of functions ( x , µ ) u ( x ( n )1 , x + γ (¯ x − x ( n )1 ) + µ ) pointwise convergeto u (¯ x , x + µ ) as n → ∞ . From the two claims, for all large enough n, this sequenceof functions are pointwise dominated by f, an absolutely integrable function on the samedomain. Therefore continuity follows from dominated convergence theorem, as in the proofof Lemma A.2.This means there exists a unique c ∗ so that D ( c ∗ ) = 0. The cutoff strategy S c ∗ is optimal,because it stops at every x whose stopping payoff exceeds expected continuation payoff, andcontinues at every x where expected continuation payoff is higher than stopping payoff.For any c = c ∗ + δ for some δ > 0, the difference in expected payoffs of S c ∗ and S c is R c ∗ + δc ∗ D ( x ; ν ) > D ( x ; ν ) is strictly positive on the interval ( c ∗ , c ∗ + δ ]. So everystrictly higher cutoff than c ∗ is strictly suboptimal. A similar argument shows every strictlylower cutoff than c ∗ is also strictly suboptimal. OA 1.8 Proof of Lemma A.15 Proof. Note that ¯ ϕ t ( µ ) is measurable with respect to F t − . Also, ϕ t ( µ ) |F t − = ϕ t ( µ ) | ˜ C t ,because by independence of X t from ( X s ) t − s =1 , the only information that F t − contains about ϕ t ( µ ) is in determining the cutoff threshold ˜ C t .9t a sample path ω so that ˜ C t ( ω ) = c ∈ R , E [ ϕ t ( µ ) |F t − ]( ω ) = E [ − λ ( X ,s − µ + µ • + γ ( X ,s − µ • )) · { X ,s ≤ c } ]= ∂∂µ Z c −∞ g ( x ) · Z ∞−∞ g ( x ) · ln( f ( X ,s | µ − γ ( X ,s − µ • ))) dx dx = ∂∂µ ¯ L ( µ | c ) . This shows that E [ ϕ t ( µ ) |F t − ]( ω ) = ¯ ϕ t ( µ )( ω ). Since this holds regardless of c , we get that E [ ϕ t ( µ ) |F t − ] = ¯ ϕ t ( µ ) for all ω, that is to say E [ ξ t ( µ ) |F t − ] = Var[ ϕ t ( µ ) |F t − ] ≤ E [ ϕ t ( µ ) |F t − ] ≤ E [( λ ( X ,s − µ + µ • + γ ( X ,s − µ • ))) ]It suffices now to show E h ( λ ( X − µ + µ • + γ ( X − µ • ))) i exists for all µ ∈ R and iscontinuous in µ , for then the (finite) maximum value this expectation takes on the compactinterval [ µ , ¯ µ ] can be taken as κ ξ .Continuity is clear.By assumption on g , there exists some κ g < ∞ so that for all z ∈ R , − κ g < λ ( z ) < λ ( z ) is Lipschitz continuous with constant κ g . Let b := λ ( − µ + µ • − γµ • ).For any x , x ∈ R , ( λ ( x − µ + µ • + γ ( x − µ • ))) = b + ( λ ( x − µ + µ • + γ ( x − µ • ))) − ( λ ( − µ + µ • − γµ • )) ≤ b + | λ ( x − µ + µ • + γ ( x − µ • )) − λ ( − µ + µ • − γµ • ) | ·× | λ ( x − µ + γ ( x + µ • − µ • )) + λ ( − µ + µ • − γµ • ) |≤ b + ( κ g · ( | x | + γ | x | )) · (2 b + ( κ g · ( | x | + γ | x | ))) . Note the bound is a second-order polynomial in | x | and | x | . We have E h ( λ ( X − µ + µ • + γ ( X − µ • ))) i ≤ E h b + ( κ g · ( | X | + γ | X | )) · (2 b + ( κ g · ( | X | + γ | X | ))) i < ∞ , where the last inequality comes from the fact that X , X have finite second moments. OA 1.9 Proof of Lemma A.21 Proof. Suppose µ A , [2] ≥ µ A , [1] . From Lemma 1, C is strictly increasing in its second argument.This shows c A [2] = C ( µ • , µ A , [2] ; γ ) ≥ C ( µ • , µ A , [1] ; γ ) = c A [1] . But by Proposition 2, µ ∗ ( c ) increasesin c , so µ A , [3] = µ ∗ ( c A [2] ) ≥ µ ∗ ( c A [1] ) = µ A , [2] . Continuing this argument shows that ( µ A , [ t ] ) t ≥ is10 monotonically increasing sequence. Since C is strictly increasing in its second argument,( c A [ t ] ) t ≥ must also form a monotonically increasing sequence.Conversely if µ A , [2] < µ A , [1] , then the analogous arguments show that ( µ A , [ t ] ) t ≥ and ( c A [ t ] ) t ≥ are monotonically decreasing sequences.It is clear that µ A , [ t ] are iterates of I ( · ; γ ) , so they must converge to its fixed point as I ( · ; γ )is a contraction mapping by Proposition 4. We have lim t →∞ c A [ t ] = lim t →∞ C ( µ • , µ A , [ t ] ; γ ) . Wemay take the limit inside the C function since it is continuous, finding that lim t →∞ c A [ t ] = C ( µ • , µ ∞ ; γ ). OA 1.10 Proof of Lemma 3 Proof. Indifference condition c L = C u ,u L ( µ , µ ; γ ) implies that u ( c L ) = E ˜ X ∼ f ( ·| µ − γ ( c L − µ )) [ u L ( c L , ˜ X )] . Since u H ( c L , x ) ≥ u L ( c L , x ) for all x ∈ R , with strict inequality on a positive-measure set,this shows u ( c L ) < E ˜ X ∼ f ( ·| µ − γ ( c L − µ )) [ u H ( c L , ˜ X )] . Because ( u , u L ) satisfy Assumptions 1, the best stopping strategy in the feasible modelΨ( µ , µ ; γ ) has a cutoff form by Proposition 1. This shows C u ,u H ( µ , µ ; γ ) is strictly above c L . OA 1.11 Proof of Proposition 6 Proof. Under Assumptions 1, 2, and 3, each of ( u , u H ) and ( u , u L ) has a unique steady state,( µ • , µ ∞ ,H , c ∞ H ) , ( µ • , µ ∞ ,L , c ∞ L ) respectively. Let I H , I L be the iteration maps corresponding tothese two stage games, that is to say I H ( µ ) := µ ∗ ( C u ,u H ( µ • , µ ; γ )) I L ( µ ) := µ ∗ ( C u ,u L ( µ • , µ ; γ )) . From Proposition 4, both I H and I L are contraction mappings. Consider their iterateswith a starting value of 0. That is, put µ [0]2 ,H = 0, µ [0]2 ,L = 0 and let µ [ t ]2 ,H = I H ( µ [ t − ,H ) ,µ [ t ]2 ,L = I L ( µ [ t − ,L ) for t ≥ 1. By property of contraction mappings and since the fixed pointsof the iteration maps are the steady state beliefs, µ [ t ]2 ,H → µ ∞ ,H and µ [ t ]2 ,L → µ ∞ ,L .By induction, I will show µ [ t ]2 ,L ≤ µ [ t ]2 ,H for every t ≥ . The base case of t = 0 is true bydefinition. If µ [ T ]2 ,L ≤ µ [ T ]2 ,H , then C u ,u L ( µ • , µ [ T ]2 ,L ; γ ) ≤ C u ,u L ( µ • , µ [ T ]2 ,H ; γ ) < C u ,u H ( µ • , µ [ T ]2 ,H ; γ ) . C being increasing in the second argument and the inductivehypothesis, while the second inequality is due to Lemma 3. Therefore, I L ( µ [ T ]2 ,L ) ≤ I H ( µ [ T ]2 ,H )using the fact that µ ∗ is increasing by Proposition 2, so µ [ T +1]2 ,L ≤ µ [ T +1]2 ,H . Since weak inequalities are preserved by limits, we have µ ∞ ,H ≥ µ ∞ ,L . It is impossible tohave µ ∞ ,H = µ ∞ ,L , because this would lead to c ∞ H > c ∞ L by Lemma 3, which in turn implies µ ∞ ,H = µ ∗ ( c ∞ H ) > µ ∗ ( c ∞ L ) = µ ∞ ,L . This inequality contradicts µ ∞ ,H = µ ∞ ,L . Therefore, we infact have µ ∞ ,H > µ ∞ ,L . The conclusion that c ∞ H > c ∞ L follows from Lemma 3 and the fact that C is increases in its second argument. OA 1.12 Proof of Proposition 7 Proof. Rewrite Equation (2) as Z ∞−∞ φ ( x ; µ • , ( σ • ) ) · ln φ ( x ; µ • , ( σ • ) ) φ ( x ; µ , σ ) ! dx + Z c −∞ φ ( x ; µ • , ( σ • ) ) · Z ∞−∞ φ ( x ; µ • , ( σ • ) ) ln " φ ( x ; µ • , ( σ • ) ) φ ( x ; µ − γ ( x − µ ) , σ ) dx dx . The KL divergence between N ( µ true , σ ) and N ( µ model , σ ) is ln σ model σ true + σ +( µ true − µ model ) σ − , so we may simplify the first term and the inner integral of the second term.ln σ σ • + ( µ − µ • ) σ + ( σ • ) σ − Z c −∞ φ ( x ; µ • , σ • ) · " ln σ σ • + ( σ • ) + ( µ − γ ( x − µ ) − µ • ) σ − dx . Dropping terms not dependent on any of the four variables gives a simplified version of theobjective, ξ ( µ , µ , σ , σ ) := ln σ σ • + ( µ − µ • ) σ + ( σ • ) σ + Z c −∞ φ ( x ; µ • , ( σ • ) ) · " ln σ σ • + ( σ • ) + ( µ − γ ( x − µ ) − µ • ) σ dx . Differentiating under the integral sign, ∂ξ∂µ = Z c −∞ φ ( x ; µ • , ( σ • ) ) · " ( µ − γ ( x − µ ) − µ • ) σ dx ξ∂µ = ( µ − µ • ) σ + γ Z c −∞ φ ( x ; µ • , ( σ • ) ) · " ( µ − γ ( x − µ ) − µ • ) σ dx = ( µ − µ • ) σ + γ ∂ξ∂µ . At FOC ( µ ∗ , µ ∗ , σ ∗ , σ ∗ ) , we have ∂ξ∂µ ( µ ∗ , µ ∗ , σ ∗ , σ ∗ ) = 0 , hence µ ∗ = µ • . Similar argumentsas before then establish µ ∗ = µ • − γ ( µ • − E [ X | X ≤ c ]) , where expectation is taken withrespect to the true distribution of X (with the true variance ( σ • ) ). Then, ∂ξ∂σ ( µ ∗ , µ ∗ , σ ∗ , σ ∗ ) = 1( σ ∗ ) − ( σ • ) ( σ ∗ ) = 0 , this gives σ ∗ = σ • (since σ ∗ ≥ . Finally, from the FOC for σ , Z c −∞ φ ( x ; µ • , ( σ • ) ) · " σ ∗ − ( σ • ) + ( µ ∗ − γ ( x − µ ∗ ) − µ • ) ( σ ∗ ) dx = 0 . Substituting in values of µ ∗ , µ ∗ already solved for,( σ ∗ ) = ( σ • ) + E [( µ ∗ − γ ( X − µ • ) − µ • ) | X ≤ c ]= ( σ • ) + E [( µ • − γ ( µ • − E [ X | X ≤ c ]) − γ ( X − µ • ) − µ • ) | X ≤ c ]= ( σ • ) + γ E h [( X − µ • ) − ( E [ X | X ≤ c ] − µ • )] | X ≤ c i = ( σ • ) + γ Var[ X − µ • | X ≤ c ]= ( σ • ) + γ Var[ X | X ≤ c ]as desired. OA 1.13 Proof of Proposition 8 I start with a lemma that says, depending on the convexity of the decision problem, astronger belief in fictitious variation either increases or decreases the subjectively optimalcutoff threshold. Lemma OA.1. Suppose that under the feasible model Ψ( µ , µ , σ , σ ; γ ) , the agent is in-different between stopping at c and continuing. Suppose ˆ σ > σ . Then: (i) if x u ( c, x ) is convex with strict convexity for x in a positive-measure set, then under the feasible model Ψ( µ , µ , σ , ˆ σ ; γ ) the agent strictly prefers continuing at c ; (ii) if x u ( c, x ) is con-cave with strict concavity for x in a positive-measure set, then under the feasible model Ψ( µ , µ , σ , ˆ σ ; γ ) the agent strictly prefers stopping at c . roof. Indifference at x = c under the model Ψ( µ , µ , σ , σ ; γ ) implies that u ( c ) = E X ∼N ( µ − γ ( x − µ ) ,σ ) [ u ( c, X )] . When hypothesis in (i) is satisfied, E X ∼N ( µ − γ ( x − µ ) ,σ ) [ u ( c, X )] < E X ∼N ( µ − γ ( x − µ ) , ˆ σ ) [ u ( c, X )]since ˆ σ > σ implies that N ( µ − γ ( x − µ ) , ˆ σ ) is a strict mean-preserving spread of N ( µ − γ ( x − µ ) , σ ) . The RHS is the expected continuation payoff under model Ψ( µ , µ , σ , ˆ σ ; γ ),so the agent strictly prefers continuing when X = c. The argument establishing (ii) isanalogous.Now I give the proof of Proposition 8. Proof. The result that µ , [ t ] = µ • , ( σ , [ t ] ) = ( σ • ) for all t follows from Proposition 7.Suppose c [1] ≤ c [0] . From Proposition 7, µ , [2] ≤ µ , [1] and ( σ , [2] ) ≤ ( σ , [1] ) . Let c [2] be the indifference threshold under the model Ψ( µ • , µ , [2] , ( σ • ) , ( σ , [1] ) ). By Lemma 1, c [2] ≤ c [1] . Also, from Lemma OA.1, c [2] ≤ c [2] as generation 2 actually believes in the feasiblemodel Ψ( µ • , µ , [2] , ( σ • ) , ( σ , [2] ) ) where ( σ , [2] ) ≤ ( σ , [1] ) . This shows c [2] ≤ c [1] . Continuingthis argument shows that ( c [ t ] ) t ≥ forms a monotonically decreasing sequence. Since thepseudo-true parameters µ ∗ and ( σ ∗ ) are monotonic functions of the censoring threshold c, we have established the proposition in the case where c [1] ≤ c [0] .The argument for the case where c [1] ≥ c [0] is exactly analogous and therefore omitted. OA 1.14 Proof of Proposition 9 Proof. In the first generation, both societies A and B observe large datasets of histories withdistribution H • ( c [0] ) . So, by Proposition 7, two societies make the same inferences about thefundamentals.Suppose the optimal-stopping problem is convex. Then due to fictitious variation ingeneration 1 and the convexity of u , it follows from Lemma OA.1 that c [ B, > c [ A, . Inthe second generation, µ , [ B, > µ , [ A, because the pseudo-true second-period fundamentalincreases in the censoring cutoff. Together again with the existence of fictitious variation,we conclude c [ B, > c [ A, . Continuing this argument establishes the proposition for thecase where the optimal-stopping problem is convex. The case of concave optimal-stoppingproblems is analogous. 14 A 1.15 Proof of Proposition 10 Proof. In the true model, X | ( X = x ) ∼ N ( µ • − γ • ( x − µ • ) , σ ), while the agents’ feasiblemodel Ψ( µ , µ ; γ ) has X | ( X = x ) ∼ N ( µ − γ ( x − µ ) , σ ). So, we can write D KL ( H (Ψ( µ • , µ • ; γ • ); c ) k H (Ψ( µ , µ ; γ ); c ))as the following: Z ∞ c φ ( x ; µ • , σ ) · ln φ ( x ; µ • , σ ) φ ( x ; µ , σ ) ! dx + Z c −∞ Z ∞−∞ φ ( x ; µ • , σ ) · φ ( x ; µ • − γ • ( x − µ • ) , σ ) · ln h φ ( x ; µ • ,σ ) · φ ( x ; µ • − γ • ( x − µ • ) ,σ ) φ ( x ; µ ,σ ) · φ ( x ; µ − γ ( x − µ ) ,σ ) i dx dx . Performing rearrangements similar to those in the proof of Proposition 2 and using theclosed-form expression of KL divergence between two Gaussian distributions, the above canbe rewritten as( µ − µ • ) σ + Z c −∞ φ ( x ; µ • , σ ) · ( µ − γ ( x − µ ) − µ • + γ • ( x − µ • )) σ dx . Multiplying through by σ and dropping terms not depending on µ , µ , γ , we get a simplifiedobjective with the same minimizers: ξ ( µ , µ , γ ) = ( µ − µ • ) Z c −∞ φ ( x ; µ • , σ ) · · [ µ − γ ( x − µ ) − µ • + γ • ( x − µ • )] dx . We have the partial derivatives by differentiating under the integral sign, ∂ξ∂µ = Z c −∞ φ ( x ; µ • , σ ) · [ µ − γ ( x − µ ) − µ • + γ • ( x − µ • )] dx ,∂ξ∂µ = ( µ − µ • ) + γ Z c −∞ φ ( x ; µ • , σ ) · [ µ − γ ( x − µ ) − µ • + γ • ( x − µ • )] dx = ( µ − µ • ) + γ ∂ξ∂µ ,∂ξ∂γ = − Z c −∞ φ ( x ; µ • , σ ) · [ x − µ ] · [ µ − γ ( x − µ ) − µ • + γ • ( x − µ • )] dx . Suppose ( µ ∗ , µ ∗ , γ ∗ ) is the minimum. By the first-order conditions for µ and µ , we have: ∂ξ∂µ ( µ ∗ , µ ∗ , γ ∗ ) = ∂ξ∂µ ( µ ∗ , µ ∗ , γ ∗ ) = 0 ⇒ µ ∗ = µ • . µ ,∂ξ∂µ ( µ • , µ ∗ , γ ∗ ) = 0 ⇒ µ ∗ = µ • + ( γ • − γ ∗ ) · ( µ • − E [ X | X ≤ c ]) . It remains to show γ ∗ = ˜ γ. We have ∂ξ∂γ ( µ ∗ , µ ∗ , γ ∗ ) = − P [ X ≤ c ] · E [( X − µ ∗ ) · ( µ ∗ − γ ∗ ( X − µ ∗ ) − µ • + γ • ( X − µ • )) | X ≤ c ] . We rearrange the expectation term as: E [( X − µ ∗ ) · ( µ ∗ − γ ∗ ( X − µ ∗ ) − µ • + γ • ( X − µ • )) | X ≤ c ]= E [( X − µ ∗ ) | X ≤ c ] · E [( µ ∗ − γ ∗ ( X − µ ∗ ) − µ • + γ • ( X − µ • )) | X ≤ c ]+ Cov( X − µ ∗ , µ ∗ − γ ∗ ( X − µ ∗ ) − µ • + γ • ( X − µ • ) | X ≤ c ] . The first-order condition for µ implies E [( µ ∗ − γ ∗ ( X − µ ∗ ) − µ • + γ • ( X − µ • )) | X ≤ c ] = 0 atthe optimum ( µ ∗ , µ ∗ , γ ∗ ). Also, we may drop terms without X in the conditional covarianceoperator, and we get: ∂ξ∂γ ( µ ∗ , µ ∗ , γ ∗ ) = P [ X ≤ c ] · ( γ ∗ − γ • ) · Cov( X , X | X ≤ c ) . We have P [ X ≤ c ] > X , X | X ≤ c ) > , hence we conclude ∂ξ∂γ ( µ ∗ , µ ∗ , γ ∗ ) > γ ∗ > γ • = 0 for γ ∗ = γ • < γ ∗ < γ • . In case that γ > γ • , at the optimum we must have ∂ξ∂γ ( µ ∗ , µ ∗ , γ ∗ ) > 0. By Karush-Kuhn-Tucker condition, this means the minimizer is γ ∗ = γ. Conversely, when ¯ γ < γ • , at theoptimum we must have ∂ξ∂γ ( µ ∗ , µ ∗ , γ ∗ ) < 0. In that case, the minimizer is γ ∗ = ¯ γ . So in bothcases, γ ∗ = ˜ γ as desired. OA 1.16 Proof of Proposition 11 Proof. I start with the expression for the KL divergence from H • ( c ) to H (Ψ( µ, µ ; γ ); c ). Asin the proof of Proposition 2, this expression can be written as( µ − µ • ) Z c −∞ φ ( x ; µ • , σ ) · " σ + ( µ − γ ( x − µ ) − µ • ) − dx . µ , we get a simplified expression of the objective, ξ ( µ ) := ( µ − µ • ) Z c −∞ φ ( x ; µ • , σ ) · " ( µ − γ ( x − µ ) − µ • ) dx . Taking the first-order condition, ξ ( µ ) = ( µ − µ • ) + (1 + γ ) · R c −∞ φ ( x ; µ • , σ ) · ((1 + γ ) µ − γx − µ • ) dx . The term R c −∞ φ ( x ; µ • , σ ) · ((1 + γ ) µ − γx − µ • ) dx may be rewritten as P [ X ≤ c ] · E [(1 + γ ) µ − γX − µ • | X ≤ c ].Setting the first-order condition to 0 and using straightforward algebra, µ ∗ (cid:77) ( c ) = 11 + P [ X ≤ c ] · (1 + γ ) µ • + P [ X ≤ c ] · (1 + γ ) P [ X ≤ c ] · (1 + γ ) µ ◦ ( c ) . OA 2 Optimal-Stopping Problems with L Periods OA 2.1 An L -Periods Model of the Gambler’s Fallacy In an optimal-stopping problem with L periods, the agent observes a draw x ‘ ∈ R in eachperiod 1 ≤ ‘ ≤ L . At the end of period ‘, the agent must decide between stopping andreceiving a payoff u ‘ ( x , ..., x ‘ ) that depends on the profile of draws ( x i ) ‘i =1 observed so far,or continuing into the next period. If the agent continues into period L without stopping,then payoff will be u L ( x , ..., x L ) . I first introduce notation for a class of joint distributions of the L possible draws ( X i ) Li =1 ,which extends the Gaussian case from Example 2 to multiple periods. Definition OA.1. Let σ > µ = ( µ i ) Li =1 and triangular array γ = ( γ i,j ) ≤ i ≤ L, ≤ j ≤ i − with each γ i,j ∈ R , let Ψ( µ ; γ ) denote the joint distribution of ( X i ) Li =1 where X ∼ N ( µ , σ ) and, for all i ≥ x j ) i − j =1 ∈ R i − ,X i | ( X = x , ..., X i − = x i − ) ∼ N µ i − i − X j =1 γ i,j · ( x j − µ j ) , σ . Under Ψ( µ ; γ ) , ( X i ) Li =1 are jointly Gaussian, such that the conditional mean of X i giventhe previous draws X = x , ..., X i − = x i − depends linearly on these realizations. I consideragents who entertain a set of feasible models , { Ψ( µ ; γ ) : µ ∈ R L } for a fixed array γ whereeach γ i,j > 0. The positive γ i,j capture the gambler’s fallacy, as higher realizations of An equivalent description of the model Ψ( µ ; γ ) is to consider a set of L independent Gaussian randomvariables Z i ∼ N ( µ i , σ ) for 1 ≤ i ≤ L . Let X = Z and iteratively define X i = Z i − P i − j =1 γ i,j ( X j − µ j ).Using induction, one can show that every X i is a linear function of the Z i ’s, so they are jointly Gaussian. γ i,j , the more that the agent’s prediction of X i depends onrealization of X j . Agents hold a dogmatic belief in the correlation structure between ( X i ) Li =1 , but can flexibly estimate ( µ i ) Li =1 , the fundamentals of the environment. Objectively, ( X i ) Li =1 are independent, so the true joint distribution is Ψ • = Ψ( µ • ; ) for some ( µ • i ) Li =1 . A useful functional form to keep in mind is γ i,j = α · δ i − j − for α > , ≤ δ ≤ , whichcorresponds to Rabin and Vayanos (2010)’s specification of gambler’s fallacy in multipleperiods. Here, α relates to the severity of the bias and δ captures how quickly the influenceof past observations decay in predicting future draws. OA 2.2 Inference from Censored Datasets in L Periods In general, a stopping strategy in an optimal-stopping problem over L periods is a set offunctions S i : R i → { Stop , Continue } for 1 ≤ i ≤ L − , where S i ( x , ..., x i ) maps therealizations of the first i draws to a stopping decision. I consider stopping strategies where S i is a cutoff rule in x i after each partial history ( x , ..., x i − ) , that is there exist ( c i ) L − i =1 with c ∈ R and for i ≥ , c i ( x , ..., x i − ) ∈ R for every ( x , ..., x i − ) ∈ R i − , so that the agentstops after ( x , ...x i ) if and only if x i ≥ c i ( x , ..., x i − ) . A stopping strategy with stoppingregions characterized by a profile of cutoff rules c = ( c i ) L − i − will be abbreviated as S c . For feasible model Ψ and cutoff rule S c , let H (Ψ; S c ) represent the distribution of his-tories when applying rule S c to draws ( X i ) ∼ Ψ . More precisely, consider a procedurewhere X , X , ..., X L is drawn according to Ψ and revealed one at a time. At the earliest1 ≤ ¯ i ≤ L − X ¯ i ≥ c ¯ i ( X , ..., X ¯ i − ) , the process stops and the history records( X , ..., X ¯ i , ∅ , ..., ∅ ) , with L − ¯ i instances of the censoring indicator ∅ replacing the unob-served subvector ( X ¯ i +1 , ..., X L ) . If no such ¯ i exists, then history records the entire profile ofdraws, ( X , ..., X L ) . The distribution of histories generated this way is denoted H (Ψ; S c ) . Definition OA.2. For cutoff strategy S c and fundamentals ˆ µ , the KL divergence betweenobjective distribution of histories and the predicted distribution under censoring is the sumof L integrals, D KL ( H (Ψ • ; S c ) || H (Ψ( µ ; γ ); S c ) ) := L X i =1 I i , where I = Z ∞ c φ ( x ; µ • , σ ) ln φ ( x ; µ • , σ ) φ ( x ; µ , σ ) ! dx , and for 2 ≤ i ≤ L − , integral I i is Z c −∞ ... Z c i − ( x ,...,x i − ) −∞ Z ∞ c i ( x ,...,x i − ) i Y k =1 φ ( x k ; µ • k , σ ) ln Q ik =1 φ ( x k ; µ • k , σ ) Q ik =1 φ ( x k ; µ k − P k − j =1 γ k,j · ( x j − µ j ) , σ ) ! dx i ...dx . I L is given by Z c −∞ ... Z c L − ( x ,...,x L − ) −∞ Z ∞−∞ i Y k =1 φ ( x k ; µ • k , σ ) ln Q ik =1 φ ( x k ; µ • k , σ ) Q ik =1 φ ( x k ; µ k − P k − j =1 γ k,j · ( x j − µ j ) , σ ) ! dx i ...dx . To interpret, consider a history h = ( x , ..., x i, ∅ , ..., ∅ ) where x k < c k ( x , ..., x k − ) for all k ≤ i − x i ≥ c i ( x , ..., x i − ). This history is possible under the stopping strategy S c . Ithas a likelihood of Π ik =1 φ ( x k ; µ • k , σ ) under Ψ • and a likelihood of Π ik =1 φ ( x k ; µ k − P k − j =1 γ k,j · ( x j − µ j ) , σ ) under Ψ( µ ; γ ). So, the integral I i calculates the contribution of all possiblehistories of length i to the KL divergence from H (Ψ( µ ; γ ); S c ) to H (Ψ • ; S c ). In the case of L = 2, this definition reduces to the Gaussian case of Definition 5, the KL divergence in thetwo-periods baseline model, with γ = γ , and c ∈ R as the censoring threshold.The KL-divergence minimizersmin µ ∈ R L D KL ( H (Ψ • ; S c ) || H (Ψ( µ ; γ ); S c ) )are the pseudo-true fundamentals with respect to stopping strategy S c . The next propositiongives an explicit characterization of them. Proposition OA.1. Let stopping strategy S c be given. For each i ≥ , let R i represent theregion { ( x , ..., x i ) : x < c , x < c ( x ) , ..., x i < c i ( x , .., x i − ) } ⊆ R i . The pseudo-true fundamentals with respect to S c are µ ∗ = µ • and, iteratively, ˆ µ ∗ i = µ • i − i − X j =1 γ i,j · ( µ ∗ j − E Ψ • [ X j | ( X k ) i − k =1 ∈ R i − ]) . The expression for µ ∗ i in the general L -periods setting resembles the expression for µ ∗ in the two-period setting. Relative to the truth µ • i , the estimate µ ∗ i is distorted by the factthat X i is only observed when previous draws ( X , ..., X i − ) fall into the continuation re-gion R i − ⊆ R i − associated with S c . The agent uses this censored empirical distribution of( X , ..., X i − , X i ) to infer the period- i fundamental, under a dogmatic belief about the corre-lation structure between the draws given by γ . Importantly, whether a certain realization X j for j < i should be judged as below-average (and thus predict a higher X i ) or above-average(and thus predict a lower X i ) depends on agent’s belief about the period j fundamental, µ ∗ j ,which gives the iterative structure of the expression for ˆ µ ∗ i .The proof of this result follows two steps. First, recall that D KL ( H (Ψ • ; S c ) ||H (Ψ( µ ; γ ); S c ))is defined as the sum P Li =1 I i , where I i is the KL-divergence contribution from histories withlength i . I rewrite this expression as the sum of L different integrals, P Li =1 J i , where J i is theKL-divergence contributions from histories containing X i . So, J i is a function of µ , ..., µ i .19he second step is similar to deriving the explicit expressions of pseudo-true fundamentalsfor Example 2, where I show ∂J i ∂µ j is a linear multiple of ∂J i ∂µ i whenever j < i . First-ordercondition at µ ∗ allows for a telescoping rearrangement, yielding ∂J i ∂µ i ( µ ∗ ) = 0 for every i . Theproposition readily follows.Let J = ( µ • − µ ) σ and for i ≥ , let J i be Z c −∞ ... Z c i − ( x ,...,x i − ) −∞ i − Y j =1 φ ( x j ; µ • j , σ ) · " ( µ • i − µ i + P i − j =1 γ i,j · ( x j − µ j )) σ dx i − ...dx . The expression in square brackets is the KL divergence from the agent’s feasible model for X i | ( X = x , ..., X i − = x i − ) to the true distribution of X i , under fundamentals µ , ..., µ i .So, the integral J i is a weighted average of this divergence, taken across different realizationsof previous draws ( x , ..., x i − i ) with weights given by the true likelihood of observing such asequence of draws in periods 1 through i − S c . Note that foreach i , J i (and I i ) depends on µ , ..., µ i .I first develop an alternative expression of D KL ( H (Ψ • ; S c ) ||H (Ψ( µ ; γ ); S c )) as the sumof J i . Lemma OA.2. P Li =1 I i = P Li =1 J i .Proof. Let ˜ I i be a slightly modified version of I i , where the inner-most integral over x i hasthe range ( −∞ , ∞ ), so ˜ I i is Z c −∞ ... Z c i − ( x ,...,x i − ) −∞ Z ∞−∞ i Y k =1 φ ( x k ; µ • k , σ ) ln Π ik =1 φ ( x k ; µ • k , σ )Π ik =1 φ ( x k ; µ k − P k − j =1 γ k,j · ( x j − µ j ) , σ ) ! dx i ...dx . Observe that ˜ I L = I L . Inductively I will show ˜ I L + P L − i =1 I i = P L i =1 J i for every 1 ≤ L ≤ L . When L = 1, this just says ˜ I = J , which is true by definition. Now suppose thestatement holds for some L = S ≤ L − 1. I show it also holds when L = S + 1.We have˜ I S +1 + S X i =1 I i = ˜ I S +1 + ( I S − ˜ I S ) + ˜ I S + S − X i =1 I i ! = ˜ I S +1 + ( I S − ˜ I S ) + S X i =1 J i where the last equality comes from the inductive hypothesis. Since I S and ˜ I S simply differin terms of the bounds of the inner-most integral, I S − ˜ I S is − Z c −∞ ... Z c S ( x ,...,x S − ) −∞ S Y k =1 φ ( x k ; µ • k , σ ) · ln Π Sk =1 φ ( x k ; µ • k , σ )Π Sk =1 φ ( x k ; µ k − P k − j =1 γ k,j · ( x j − µ j ) , σ ) ! dx S ...dx . Π S +1 k =1 φ ( x k ; µ • k ,σ )Π S +1 k =1 φ ( x k ; µ k − P k − j =1 γ k,j · ( x j − µ j ) ,σ ) ! term in the integrand of ˜ I S +1 intothe sumln Π Sk =1 φ ( x k ; µ • k , σ )Π Sk =1 φ ( x k ; µ k − P k − j =1 γ k,j · ( x j − µ j ) , σ ) ! +ln φ ( x S +1 ; µ S +1 , • , σ ) φ ( x S +1 ; µ S +1 − P Sj =1 γ S +1 ,j · ( x j − µ j ) , σ ) ! . We know that Z c −∞ ... Z c S ( .. ) −∞ Z ∞−∞ S +1 Y k =1 φ ( x k ; µ • k , σ ) ln Π Sk =1 φ ( x k ; µ • k , σ )Π Sk =1 φ ( x k ; µ k − P k − j =1 γ k,j · ( x j − µ j ) , σ ) ! dx S +1 ...dx = Z c −∞ ... Z c S ( .. ) −∞ S Y k =1 φ ( x k ; µ • k , σ ) Z ∞−∞ φ ( x S +1 ; µ • S +1 , σ ) · ln Π Sk =1 φ ( x k ; µ • k , σ )Π Sk =1 φ ( x k ; µ k − P k − j =1 γ k,j · ( x j − µ j ) , σ ) ! dx S +1 ...dx = Z c −∞ ... Z c S ( .. ) −∞ S Y k =1 φ ( x k ; µ • k , σ ) · ln Π Sk =1 φ ( x k ; µ • k , σ )Π Sk =1 φ ( x k ; µ k − P k − j =1 γ k,j · ( x j − µ j ) , σ ) ! dx S ...dx = − ( I S − ˜ I S ) where c S ( .. ) abbreviates the bound of integration c S ( x , ..., x S − ) . At the same time, Z c −∞ ... Z c S ( .. ) −∞ Z ∞−∞ S +1 Y k =1 φ ( x k ; µ • k , σ ) ln φ ( x S +1 ; µ • S +1 , σ ) φ ( x S +1 ; µ S +1 − P Sj =1 γ S +1 ,j · ( x j − µ j ) , σ ) ! dx S +1 ...dx = Z c −∞ ... Z c S ( .. ) −∞ S Y k =1 φ ( x k ; µ • k , σ ) Z ∞−∞ φ ( x S +1 ; µ • S +1 , σ ) ln φ ( x S +1 ; µ • S +1 , σ ) φ ( x S +1 ; µ S +1 − P Sj =1 γ S +1 ,j · ( x j − µ j ) , σ ) ! dx S +1 ...dx = Z c −∞ ... Z c S ( .. ) −∞ S Y k =1 φ ( x k ; µ • k , σ ) D KL [ N ( µ • S +1 , σ ) , N ( µ S +1 − S X j =1 γ S +1 ,j · ( x j − µ j ) , σ )] dx S ...dx = Z c −∞ ... Z c S ( .. ) −∞ S Y k =1 φ ( x k ; µ • k , σ ) ( µ • S +1 − µ S +1 + P Sj =1 γ S +1 ,j · ( x j − µ j )) σ dx S ...dx = J S +1 where we used the closed-form expression of the KL divergence between two Gaussian dis-tributions, D KL N ( µ • S +1 , σ ) ||N ( µ S +1 − S X j =1 γ S +1 ,j · ( x j − µ j ) , σ ) = ( µ • S +1 − µ S +1 + P Sj =1 γ S +1 ,j · ( x j − µ j )) σ . So by induction, ˜ I L + P L − i =1 I i = P Li =1 J i . As ˜ I L = I L , we are done.Using Lemma OA.2, I can now give the proof of Proposition OA.1. Proof. Abbreviate D KL ( H (Ψ • ; S c ) ||H (Ψ( µ ; γ ); S c )) as ξ ( µ , ..., µ L ) . By Lemma OA.2, ξ ( µ , ..., µ L ) = P Li =1 J i ( µ , ..., µ i ). We show that the recursively defined parameters are the only ones satis-fying the first-order condition, ∂ξ∂µ i (ˆ µ , ..., ˆ µ L ) = 0 for each i .21n the integrand for J i , each µ j where 1 ≤ j ≤ i appears once in the term ( µ • i − µ i + P i − j =1 γ i,j · ( x j − µ j )) σ .For any ( x , ..., x i − ) , the partial derivative of this term with respect to µ j for j < i is γ i,j times its partial derivative with respect to µ i . That is, at any values of ˆ µ , ..., ˆ µ i , we get ∂J i ∂µ j (ˆ µ , ..., ˆ µ i ) = γ i,j ∂J i ∂µ i (ˆ µ , ..., ˆ µ i )for each 1 ≤ j < i .At any ( µ ∗ , ..., µ ∗ L ) satisfying the first-order condition for µ L , we must have ∂ξ∂µ L ( µ ∗ , ..., µ ∗ L ) = ∂J L ∂µ L ( µ ∗ , ..., µ ∗ L ) = 0 . By above, this also implies for each 1 ≤ j < L , either ∂J L ∂µ j ( µ ∗ , ..., µ ∗ L ) = 0, or γ L,k = 0 (inwhich case J L is not actually a function of µ j and ∂J L ∂µ j = 0 everywhere). Either way, thisshows for the case of j = L − ,∂ξ∂µ L − ( µ ∗ , ..., µ ∗ L ) = ∂J L ∂µ L − ( µ ∗ , ..., µ ∗ L ) + ∂J L − ∂µ L − ( µ ∗ , ..., µ ∗ L − )= ∂J L − ∂µ L − ( µ ∗ , ..., µ ∗ L − ) . If ( µ ∗ , ..., µ ∗ L ) also satisfies the first-order condition for µ L − , then ∂J L − ∂µ L − ( µ ∗ , ..., µ ∗ L − ) = 0.Continuing this telescoping argument, we conclude if ( µ ∗ , ..., µ ∗ L ) satisfies the first-order con-dition for all µ i , 1 ≤ i ≤ L , then ∂J i ∂µ i ( µ ∗ , ..., µ ∗ i ) = 0 for every 1 ≤ i ≤ L .Given the form of J , it is clear that ∂J ∂µ ( µ ∗ ) = 0 implies µ ∗ = µ • . Also, ∂J i ∂µ i ( µ ∗ , ..., µ ∗ i ) = − Z c −∞ ... Z c i − ( x ,...,x i − ) −∞ i − Y j =1 φ ( x j ; µ • j , σ ) " ( µ • i − µ ∗ i + P γ i,j · ( x j − µ ∗ j )) σ dx i − ...dx . Using the fact that ∂J i ∂µ i ( µ ∗ , ..., µ ∗ i ) = 0 , we multiply the integrand by the constant − σ · ( Z c −∞ ... Z c i − ( x ,...,x i − ) −∞ i − Y j =1 φ ( x j ; µ • j , σ ) dx i − ...dx ) − and get E Ψ • µ • i − µ ∗ i + i − X j =1 γ i,j · ( X j − µ ∗ j ) | ( X k ) i − k =1 ∈ R i − = 0 . Rearranging, we have µ ∗ i = µ • i − P i − j =1 γ i,j · ( µ ∗ j − E Ψ • [ X j | ( X k ) i − k =1 ∈ R i − ]) as desired. Thismeans the only ( µ ∗ , ..., µ ∗ L ) satisfying the first-order condition for minimizing KL divergenceis the one iteratively given in this proposition.22ow I turn to a special class of cutoff-based stopping rules where c k is independent ofhistory. So, a stopping rule of this kind S c can be viewed simply as a list of L constants, c , ..., c L ∈ R , such that the agent stops after the draw X ‘ = x ‘ if and only if x ‘ < c ‘ . Ishow that the expression for the pseudo-true fundamentals greatly simplifies and admits apath-counting interpretation. Definition OA.3. For 1 ≤ j < i ≤ L, a path p from i to j is a sequence of pairs p =(( i , i ) , ..., ( i M − , i M )) with M ≥ i = i , i M = j , and i m +1 < i m for all m = 0 , , ..., M − p is p ) := M . The weight of p is W ( p ) := Π ≤ m ≤ M − ( − γ i ‘ ,i ‘ +1 ). Denote theset of all paths from i to j as P [ i → j ].That is, we may imagine a network with L nodes, one per period of the optimal-stoppingproblem. There is a directed edge with weight − γ i,j for all pairs i > j . A path from i to j isa concatenation of edges, starting with i and ending with j. Its weight is the product of theweights of all the edges used.The next proposition differs from Proposition OA.1 in that the expression for the pseudo-true fundamental µ ∗ i does not involve other pseudo-true fundamentals µ ∗ j . It shows that thedistortion of µ ∗ i from the true value µ • i depends on terms µ • j − E Ψ • [ X j | X j ≤ c j ] and the totalnumber of paths from i to j in the network that γ defines. Proposition OA.2. For stopping strategy S c = ( c , ..., c L ) ∈ R L , the pseudo-true funda-mentals are given by µ ∗ i = µ • i + i − X j =1 X p ∈ P [ i → j ] W ( p ) · (cid:16) µ • j − E [ X j | X j ≤ c j ] (cid:17) . Proof. This clearly holds for i = 1 . By induction assume this holds for all i ≤ K for some K ≤ L − 1. I show that this also holds for i = K + 1 . From Proposition OA.1, µ ∗ i = µ • i − i − X j =1 γ i,j · ( µ ∗ j − E Ψ • [ X j | ( X k ) i − k =1 ∈ R i − ]) . The continuation region R i − is the rectangle ( −∞ , c ) × ... × ( −∞ , c i − ) ∈ R i − . As( X , ..., X i − ) are objectively independent, the events { X k ≤ c k } for k = j are indepen-dent of X j , so the expression simplifies to µ ∗ i = µ • i − i − X j =1 γ i,j · ( µ ∗ j − E Ψ • [ X j | X j ≤ c j ]) . µ ∗ j for 1 ≤ j ≤ i − µ ∗ K +1 = µ • K +1 − K X j =1 γ K +1 ,j · ( µ • j − E Ψ • [ X j | X j ≤ c j ])+ K X j =1 − γ K +1 ,j · j − X k =1 X p ∈ P [ j → k ] W ( p ) · ( µ • k − E Ψ • [ X k | X k ≤ c k ]) = µ • K +1 + K X j =1 ( − γ K +1 ,j ) + K X k = j +1 − γ K +1 ,k · X p ∈ P [ k → j ] W ( p ) · (cid:16) µ • j − E Ψ • [ X j | X j ≤ c j ] (cid:17) . Paths in P [ K + 1 → j ] come in two types. The first type is the direct path consisting of justone edge ( K + 1 , j ), with weight − γ K +1 ,j . The second type consists of the indirect paths p = (( K + 1 , k ) , p ) where p ∈ P [ k → j ] . We have W ( p ) = − γ K +1 ,k · W ( p ) . We thereforesee that the expression P Kj =1 h ( − γ K +1 ,j ) + P Kk = j +1 − γ K +1 ,k · (cid:16)P p ∈ P [ k → j ] W ( p ) (cid:17)i in fact givesthe sum of weights for all paths in P [ K + 1 → j ] . So, we have shown that the claim holdsalso for i = K + 1 . By induction it holds for all 1 ≤ i ≤ L .As a corollary, suppose L ≥ γ have the Rabin and Vayanos (2010) functional formof γ i,j = α · δ i − j − for α > , ≤ δ ≤ 1. I show that all pseudo-true fundamentals aretoo pessimistic in every dataset censored with S c = ( c , ..., c L ) ∈ R L if and only if δ > α .The idea is the influence of the gambler’s fallacy psychology must not decay “too quickly”relative to the influence of the most recent observation. This condition is satisfied in allthe calibration exercises in Rabin and Vayanos (2010) and in the structural estimations ofBenjamin, Moore, and Rabin (2017). The result shows the over-pessimism from the 2-periodsmodel extends into the L periods model for history-independent stopping rules, provided theregularity condition on the parametrization of the L -periods gambler’s fallacy holds. Corollary OA.1. Suppose L ≥ and γ i,j = α · δ i − j − for α > , ≤ δ ≤ . If δ > α ,then for all stopping strategies S c = ( c , ..., c L ) ∈ R L , the pseudo-true fundamentals satisfy µ ∗ i < µ • i for all i . If δ < α , then there exists a stopping strategy S c = ( c , ..., c L ) ∈ R L suchthat µ ∗ i > µ • i for at least one i. To understand the intuition, consider an example that violates the condition of thecorollary, α = 0 . , δ = 0 , so that γ , = 0 . , γ , = 0 . , and γ , = 0. The agent expectsreversals between the pairs ( X , X ) and ( X , X ), but his expectation for X | ( X = x , X = x ) does not vary with x . By the same logic as the two-periods censoring effect, inferenceabout the second-period fundamental µ ∗ decreases as c decreases, with lim c →−∞ µ ∗ ( c ) = −∞ . This has an important indirect effect on µ ∗ , since a very pessimistic µ ∗ leads theagent to interpret objectively typical draws of X as greatly above average. Expecting lowvalues of X after these surprisingly high draws of X , the agent infers the fundamental µ ∗ to be above the sample mean of X in the dataset, hence overestimating it as c → −∞ .24hen δ is strictly positive, however, there is an opposite effect where lower sample mean of X in observations containing uncensored X lead to more pessimistic inference about thethird-period fundamental. When δ > . , overoptimistic inference never happens becausethis second effect dominates. Proof. First suppose δ > α. By Proposition OA.2, since µ • j − E [ X j | X j ≤ c j ] > c j ∈ R , I only need to show that P p ∈ P [ i → j ] W ( p ) < i > j pair. Due to the stationarityof γ under the γ i,j = α · δ i − j − functional form, it suffices to prove P p ∈ P [ i → W ( p ) < ≤ i ≤ L. When i = 2 , P [2 → 1] consists of a single path with weight − α < 0. By inductionsuppose P p ∈ P [ i → W ( p ) < i ≤ S for 2 ≤ S ≤ L − . We can exhaustively enumerate p ∈ P [ S + 1 → 1] by relating each path in P [ S → 1] to a pair of paths in P [ S + 1 → . Relate p = (( S, i ) , ..., ( i M − , ∈ P [ S → 1] to the pair p = (( S + 1 , i ) , ..., ( i M − , p = (( S + 1 , S ) , ( S, i ) , ..., ( i M − , p modifies the first edge in p from ( S, i ) to( S + 1 , i ) , while p simply concatenates the extra edge ( S + 1 , S ) in front of p. We have W ( p ) = δ · W ( p ), because the weight of ( S, i ) is − αδ S − i − while the weight of ( S + 1 , i )is − αδ S − i , and the two paths are otherwise identical. We have W ( p ) = − α · W ( p ), sincethe newly concatenated edge has weight − α . This argument shows P p ∈ P [ S +1 → W ( p ) =( δ − α ) · P p ∈ P [ S → W ( p ) . Since δ − α > P p ∈ P [ S → W ( p ) < P p ∈ P [ S +1 → W ( p ) < . By induction, we have shown that P p ∈ P [ i → W ( p ) < ≤ i ≤ L. Next, suppose δ < α . By Proposition OA.2, µ ∗ = µ • + (cid:16) − αδ + α (cid:17) · ( µ • − E [ X | X ≤ c ]) + ( − α ) ( µ • − E [ X | X ≤ c ]) . The coefficient in front of µ • − E [ X | X ≤ c ] comes from the fact that there are two pathsfrom 3 to 1, with weights − γ , = − αδ and ( − γ , ) · ( − γ , ) = ( − α ) · ( − α ) = α . We have( − αδ + α ) = α ( α − δ ) > α > δ < α . So, fixing c , as c → −∞ we get µ • − E [ X | X ≤ c ] → ∞ and therefore µ ∗ → ∞ . OA 3 Proof of Theorem 1 In this section I prove the almost-sure convergence of beliefs and behavior when biased agentsact one at a time and entertain uncertainty over both µ and µ .For µ < ¯ µ , µ < ¯ µ , let ♦ ([ µ , ¯ µ ] , [ µ , ¯ µ ]) refer to the parallelogram in R with thevertices: • ( µ , ¯ µ + γ (¯ µ − µ )) • ( µ , µ + γ (¯ µ − µ )) 25 (¯ µ , ¯ µ − γ (¯ µ − µ )) • (¯ µ , µ − γ (¯ µ − µ ))In other words, ♦ ([ µ , ¯ µ ] , [ µ , ¯ µ ]) is the parallelogram constructed by starting with therectangle [ µ , ¯ µ ] × [ µ , ¯ µ ], then replacing the top and bottom edges with lines with slope − γ (and adjusting the left and right edges accordingly to connect with the new top andbottom edges.)Consider a sequence of short-lived agents playing the stage game in rounds t = 1 , , , ... They are uncertain about both µ and µ , with prior density of the first round agent m ( µ , µ ) supported on feasible fundamentals M = ♦ ([ µ , ¯ µ ] , [ µ , ¯ µ ]) as in Remark 1(b).I abbreviate this support as ♦ when no confusion arises. Each agent t choose the the opti-mal cutoff ˜ C t maximizing expected payoff based on the final belief ˜ M t − of the immediatepredecessor. I show the almost sure convergence of stochastic processes ( ˜ C t ) and ( ˜ M t ) tothe unique steady state under the hypotheses of Theorem 1. OA 3.1 Preliminary Results First, I consider how the predicted second-period payoff after X = x depends on theparameters of the feasible model Ψ( µ , µ ; γ ). Lemma OA.3. For every µ , µ , x ∈ R , the conditional distribution X | X = x is thesame under Ψ( µ • , µ + γ ( µ − µ • ); γ ) and Ψ( µ , µ ; γ ) . So in particular, C ( µ , µ ; γ ) = C ( µ • , µ + γ ( µ − µ • ); γ ) .Proof. Under the feasible model Ψ( µ • , µ + γ ( µ − µ • ); γ ), the conditional density of X given X = x is f ( · | µ + γ ( µ − µ • ) − γ ( x − µ • )), which simplifies to f ( · | µ − γ ( x − µ )). It iseasy to see that this is also the expression for the same conditional density under Ψ( µ , µ ; γ ).Suppose C ( µ , µ ; γ ) = c. This implies the indifference condition, u ( c ) = E Ψ( µ ,µ ; γ ) [ u ( c, X ) | X = c ] . But by the equivalence of conditional distribution given above, u ( c ) = E Ψ( µ • ,µ + γ ( µ − µ • ); γ ) [ u ( c, X ) | X = c ] . This means c is also the indifference threshold for the model Ψ( µ • , µ + γ ( µ − µ • ); γ ).As a corollary, this lemma shows the restriction to cutoff strategies is without loss, andthat ˜ C t is well defined. That is, for any belief given by a density on M , there exists a cutoffstrategy that is weakly optimal among the class of all stopping strategies, and further this I assume that agents do not update beliefs within the stage game. x ∈ R and any density ˜ m on M , Z M E Ψ( µ ,µ ; γ ) [ u ( x , X ) | X = x ] · ˜ m ( µ , µ ) d ( µ , µ ) = Z ¯ µ ◦ µ ◦ E Ψ( µ • ,µ ; γ ) [ u ( x , X ) | X = x ] · ˜ m V ( µ ) dµ where ¯ µ ◦ := max { µ : ( µ • , µ ) ∈ ♦ } and µ ◦ := min { µ : ( µ • , µ ) ∈ ♦ } , and ˜ m V ( µ ) isthe integral of ˜ m ( µ , µ ) over the line in ♦ with slope − γ that passes through ( µ • , µ ).This equality holds because by Lemma OA.3, all fundamentals on that line imply the samecontinuation payoff after X = x as the fundamentals ( µ • , µ ) . The proof of Lemma A.13shows that x u ( x ) − Z ¯ µ ◦ µ ◦ E Ψ( µ • ,µ ; γ ) [ u ( x , X ) | X = x ] ˜ m V ( µ ) dµ is a strictly increasing, continuous function that crosses 0.Now, the key step is to separate the two-dimensional inference problem into a pair ofone-dimensional problems. OA 3.2 Learning µ • I define the stochastic process of data log-likelihood (for a given fundamental). For each µ , µ ∈ supp( m ), let ‘ t ( µ , µ )( ω ) be the log likelihood that the fundamentals are ( µ , µ )and histories ( ˜ H s ) s ≤ t ( ω ) are generated by the end of round t . It is given by ‘ t ( µ , µ )( ω ) := ln( m ( µ , µ )) + t X s =1 ln(lik( ˜ H s ( ω ); µ , µ ))where lik( x , ∅ ; µ , µ ) := f ( x | µ ) and lik( x , x ; µ , µ ) := f ( x | µ ) · f ( x | µ − γ ( x − µ )). We have f ( x | µ ) = g ( x − µ + µ • ) and f ( x | µ − γ ( x − µ )) = g ( x − µ + µ • + γ ( x − µ )). By simple algebra, we may expand ‘ t ( µ , µ )( ω ) = ln( m ( µ , µ )) + t X s =1 ln[ g ( X ,s ( ω ) − µ + µ • )]+ t X s =1 { X ,s ( ω ) ≤ ˜ C s ( ω ) } · ln [ g ( X ,s ( ω ) − µ + µ • + γ ( X ,s ( ω ) − µ ))]I first establish that, without knowing anything about the process ( C t ) , we can concludeagents learn µ • arbitrarily well. Lemma OA.4. For every (cid:15) > , almost surely lim t →∞ ˜ M t ( ♦ ∩ ([ µ • − (cid:15), µ • + (cid:15) ] × R )) = 1 . roof. I first calculate the directional derivative ∇ v t ‘ t ( µ , µ ) , where v = / √ γ − γ/ √ γ ! is the unit vector with slope − γ . We have ∂ ( ‘ t /t ) ∂µ ( µ , µ ) = 1 t D m ( µ , µ ) m ( µ , µ ) − t t X s =1 g ( X ,s − µ + µ • ) g ( X ,s − µ + µ • ) − γt t X s =1 { X ,s ≤ ˜ C s } · λ ( X ,s − µ + µ • + γ ( X ,s − µ )) ∂ ( ‘ t /t ) ∂µ ( µ , µ ) = 1 t D m ( µ , µ ) m ( µ , µ ) − t t X s =1 { X ,s ≤ ˜ C s } · λ ( X ,s − µ + µ • + γ ( X ,s − µ )) , where D m and D m are the two partial derivatives of m . At every ω and every ( µ , µ ) , note the last summand in ∂ ( ‘ t /t ) ∂µ is γ times the last summand in ∂ ( ‘ t /t ) ∂µ . Therefore, ∇ v t ‘ t ( µ , µ ) = − σ √ γ t t X s =1 g ( X ,s − µ + µ • ) g ( X ,s − µ + µ • ) ! + 1 t √ γ t D m ( µ , µ ) m ( µ , µ ) − γt √ γ D m ( µ , µ ) m ( µ , µ ) . Since m , D m , D m are continuous on the compact set ♦ , there exists some 0 < B < ∞ so that | D m ( µ ,µ ) m ( µ ,µ ) | < B and | D m ( µ ,µ ) m ( µ ,µ ) | < B for all ( µ , µ ) ∈ ♦ . This means for every ω, inf ( µ ,µ ) ∈ ♦ L "(cid:18) ∇ v t ‘ t ( µ , µ ) (cid:19) + 1 σ √ γ t t X s =1 g ( X ,s − µ + µ • ) g ( X ,s − µ + µ • ) ! ≥ − t (1 + γ ) √ γ B, where ♦ L := ♦ ∩ ([ µ , µ • − (cid:15) ] × R ) is the sub-parallelogram to the left of µ • − (cid:15) . By law oflarge numbers applied to the i.i.d. sequence ( g ( X ,s − ( µ • − (cid:15) )+ µ • ) g ( X ,s − ( µ • − (cid:15) )+ µ • ) ) s ≥ , almost surely1 t t X s =1 g ( X ,s − ( µ • − (cid:15) ) + µ • ) g ( X ,s − ( µ • − (cid:15) ) + µ • ) → E X ∼ g " g ( X + (cid:15) ) g ( X + (cid:15) ) . Since E X ∼ g (cid:20) g ( X ) g ( X ) (cid:21) = 0 and since z g ( z ) g ( z ) = ddz (ln( g ( z )) is strictly decreasing by log-concavity, there is some δ > E X ∼ g (cid:20) g ( X + (cid:15) ) g ( X + (cid:15) ) (cid:21) = − δ. Furthermore, for any µ ≥ • − (cid:15), then for any x ∈ R , g ( x − µ + µ • ) g ( x − µ + µ • ) ≤ g ( x + (cid:15) ) g ( x − (cid:15) ) . Along any ω where t P ts =1 g ( X ,s − ( µ • − (cid:15) )+ µ • ) g ( X ,s − ( µ • − (cid:15) )+ µ • ) → − δ , we therefore also havelim sup t →∞ sup µ ≥ µ • − (cid:15) t t X s =1 g ( X ,s − µ + µ • ) g ( X ,s − µ + µ • ) ≤ − δ. Therefore almost surelylim inf t →∞ inf ( µ ,µ ) ∈ ♦ L (cid:18) ∇ v t ‘ t ( µ , µ ) (cid:19) ≥ δσ √ γ . We may divide ♦ L further divide into two halves: ♦ L, := ♦ ∩ ([ µ , µ + d/ × R ) ♦ L, := ♦ ∩ ([ µ + d/ , µ • − (cid:15) ] × R )where d := µ • − (cid:15) − µ . I will show that lim t →∞ ˜ M t ( ♦ L, ) = 0 almost surely. The idea iswe can map every point in ♦ L, to another point in ♦ L, in the direction of v . For everypoint, its image under the map will have much higher posterior probability, since we have auniform, strictly positive lowerbound on the directional derivative of log-likelihood ‘ t in thedirection of v .˜ M t ( ♦ L, ) = Z ♦ L, ˜ m t ( µ , µ ) dµ = Z ♦ L, ˜ m t ( µ , µ ) · ˜ m t ( µ − d, µ − γd )˜ m t ( µ , µ ) dµ = Z ♦ L, ˜ m t ( µ , µ ) exp( ‘ t ( µ − d, µ − γd ) − ‘ t ( µ , µ )) dµ = Z ♦ L, ˜ m t ( µ , µ ) exp( − Z d ∇ v ‘ t ( µ − d + z, µ − γd + γz ) dz ) dµ Almost surely,lim inf t →∞ inf ( µ ,µ ) ∈ ♦ L, ,z ∈ [0 ,d ] ( ∇ v ‘ t ( µ − d + z, µ − γd + γz )) ≥ tδσ √ γ , so almost surelylim sup t →∞ ˜ M t ( ♦ L, ) ≤ lim sup t →∞ Z ♦ L, ˜ m t ( µ , µ ) exp( − dtδσ √ γ ) dµ. ω and t , the RHS is bounded above by exp( − dtδσ √ γ ), which tends to 0 as t → ∞ since d, δ > 0. So in fact ˜ M t ( ♦ L, ) → ♦ L, into two equal halves and iterating this argument, we eventuallyshow lim t →∞ ˜ M t ( ♦ ∩ ([ µ • − (cid:15), ∞ ) × R )) = 1. A symmetric argument also shows lim t →∞ ˜ M t ( ♦ ∩ (( −∞ , µ • + (cid:15) ] × R )) = 1 . OA 3.3 Decomposing Partial Derivative of Log-Likelihood WithRespect to µ I record a decomposition of ∂‘∂µ ( µ , µ ), the partial derivative of the log-likelihood processwith respect to its second argument.Define two stochastic processes: ϕ s ( µ , µ ) := − λ ( X ,s − µ + µ • + γ ( X ,s − µ )) · { X ,s ≤ ˜ C s } ¯ ϕ s ( µ , µ ) := ∂∂µ L ( µ + γ ( µ − µ • ) | ˜ C s ) . Note that ¯ ϕ s ( µ , µ ) is measurable with respect to F s − , since ( ˜ C t ) is a predictable pro-cess. Write ξ s ( µ , µ ) := ϕ s ( µ , µ ) − ¯ ϕ s ( µ , µ ) and y t ( µ , µ ) := P ts =1 ξ s ( µ , µ ). Write z t ( µ , µ ) := P ts =1 ¯ ϕ s ( µ , µ ). Lemma OA.5. ∂‘ t ∂µ ( µ , µ ) = D m ( µ ,µ ) m ( µ ,µ ) + y t ( µ , µ ) + z t ( µ , µ ) Proof. This comes from expanding ‘ t ( µ , µ ) and taking its derivative as in the proof ofLemma OA.4.Now I derive two results about the ξ t ( µ , µ ) processes for different pairs ( µ , µ ) . Lemma OA.6. There exists κ ξ < ∞ so that for every ( µ , µ ) ∈ ♦ and for every t ≥ ,ω ∈ Ω , E [ ξ t ( µ , µ ) |F t − ]( ω ) ≤ κ ξ .Proof. Note that ¯ ϕ t ( µ , µ ) is measurable with respect to F t − . Also, ϕ t ( µ , µ ) |F t − = ϕ t ( µ , µ ) | ˜ C t , because by independence of X t from ( X s ) t − s =1 , the only information that F t − contains about ϕ t ( µ , µ ) is in determining the cutoff threshold ˜ C t .At a sample path ω so that ˜ C t ( ω ) = c ∈ R , E [ ϕ s ( µ , µ ) |F t − ]( ω ) = E [ − λ ( X ,s − µ + µ • + γ ( X ,s − µ )) · { X ≤ c } ]= ∂∂µ Z c −∞ g ( x ) · Z ∞−∞ g ( x ) · ln( f ( X ,s | µ − γ ( X ,s − µ ))) dx dx = ∂∂µ Z c −∞ g ( x ) · Z ∞−∞ g ( x ) · ln( f ( X ,s | [ µ + γ ( µ − µ • )] − γ ( X ,s − µ • ))) dx dx = ∂∂µ ¯ L ( µ + γ ( µ − µ • ) | c ) . E [ ϕ s ( µ , µ ) |F t − ]( ω ) = ¯ ϕ s ( µ , µ )( ω ). Since this holds regardless of c , weget that E [ ϕ s ( µ , µ ) |F t − ] = ¯ ϕ t ( µ , µ ) for all ω, that is to say E [ ξ t ( µ , µ ) |F t − ] = Var[ ϕ t ( µ , µ ) |F t − ] ≤ E [ ϕ t ( µ , µ ) |F t − ] ≤ E [( λ ( X ,s − µ + µ • + γ ( X ,s − µ ))) ] . By the same argument as in the proof of Lemma A.15, E h ( λ ( X − µ + µ • + γ ( X − µ ))) i exists for all µ , µ ∈ R . The (finite) maximum value this expectation takes on the compactset ♦ can be taken as κ ξ . OA 3.4 Heidhues, Koszegi, and Strack (2018)’s Law of Large Num-bers I use a statistical result from Heidhues, Koszegi, and Strack (2018) to show that the y t /t termin the decomposition of t ∂‘ t ∂µ almost surely converges to 0 in the long run, and furthermorethis convergence is uniform on ♦ . This lets me focus on terms of the form ¯ ϕ s ( µ , µ ), whichcan be interpreted as the expected contribution to the log likelihood derivative from round s data. This lends tractability to the problem as ¯ ϕ s ( µ , µ ) only depends on ˜ C s , but not on X ,s or X ,s . Lemma OA.7. For every ( µ , µ ) ∈ ♦ , lim t →∞ | y t ( µ ,µ ) t | = 0 almost surely.Proof. Heidhues, Koszegi, and Strack (2018)’s Proposition 10 shows that if ( y t ) is a martin-gale such that there exists some constant v ≥ y ] t ≤ vt almost surely, where [ y ] t is the quadratic variation of ( y t ) , then almost surely lim t →∞ y t t = 0.Consider the process y t ( µ , µ ) for a fixed ( µ , µ ) ∈ ♦ . By definition y t = P ts =1 ϕ s ( µ , µ ) − ¯ ϕ s ( µ , µ ). As established in the proof of Lemma OA.6, for every s, ¯ ϕ s ( µ , µ ) = E [ ϕ s ( µ , µ ) |F s − ].31o for t < t, E [ y t ( µ , µ ) |F t ] = t X s =1 ϕ s ( µ , µ ) − ¯ ϕ s ( µ , µ ) + E t X s = t +1 ϕ s ( µ , µ ) − ¯ ϕ s ( µ , µ ) |F t = t X s =1 ϕ s ( µ , µ ) − ¯ ϕ s ( µ , µ ) + t X s = t +1 E [ E [ ϕ s ( µ , µ ) − ¯ ϕ s ( µ , µ ) |F s − ] | F t ]= t X s =1 ϕ s ( µ , µ ) − ¯ ϕ s ( µ , µ ) + 0= y t ( µ , µ ) . This shows ( y t ( µ , µ )) t is a martingale. Also,[ y ( µ , µ )] t = t − X s =1 E [( y s ( µ , µ ) − y s − ( µ , µ )) |F s − ]= t − X s =1 E [ ξ s ( µ , µ ) |F s − ] ≤ κ ξ · t by Lemma OA.6. Therefore Heidhues, Koszegi, and Strack (2018) Proposition 10 applies. Lemma OA.8. lim t →∞ sup ( µ ,µ ) ∈ ♦ | y t ( µ ,µ ) t | = 0 almost surely.Proof. This argument is similar to Lemma 11 in Heidhues, Koszegi, and Strack (2018). Iapply Lemma 2 of Andrews (1992), which says to prove this result I just need to checkconditions BD, P-SSLN, and S-LIP from Andrews (1992). BD holds because ♦ is a boundedsubset of R . P-SLLN holds because by Lemma OA.7, which shows for all ( µ , µ ) ∈ ♦ ,lim t →∞ | y t ( µ ,µ ) t | = 0 almost surely.Condition S-LIP is essentially a Lipschitz continuity condition. It requires finding se-quence of random variables B t such that | ξ t ( µ , µ ) − ξ t ( µ , µ ) | ≤ B t · ( | µ − µ | + | µ − µ | )almost surely, such that these random variables satisfysup t ≥ t P ts =1 E [ B s ] < ∞ , and lim t →∞ t P ts =1 ( B s − E [ B s ]) = 0 almost surely.But for every ω, ϕ s ( µ , µ ) := − λ ( X ,s − µ + µ • + γ ( X ,s − µ )) · { X ,s ≤ ˜ C s }| ϕ s ( µ , µ ) − ϕ s ( µ , µ ) | ≤| λ ( X ,s − µ + µ • + γ ( X ,s − µ )) − λ ( X ,s − µ + µ • + γ ( X ,s − µ )) | . Since ln( g ( · )) has a bounded second derivative, the RHS is bounded by κ g · (cid:16) | µ − µ | + γ · | µ − µ | (cid:17) .Now that we know | ϕ s ( µ , µ ) − ϕ s ( µ , µ ) | ( ω ) ≤ κ g · (cid:16) | µ − µ | + γ · | µ − µ | (cid:17) for all ω, we must also have | ¯ ϕ s ( µ , µ ) − ¯ ϕ s ( µ , µ ) | ( ω ) ≤ κ g · (cid:16) | µ − µ | + γ · | µ − µ | (cid:17) for all ω since ¯ ϕ s ( µ , µ ) = E [ ϕ s ( µ , µ ) | F s − ].Setting B s as the constant 2 κ g for every s satisfies S-LIP.32 A 3.5 Bounds on Asymptotic Beliefs and Asymptotic Cutoffs Recall that Lemma OA.3 implies that if we draw the line with slope − γ through the point( µ • , µ ) , all pairs of fundamentals on this line have the same optimal cutoff threshold. Thenagainst any feasible model Ψ( µ , µ ; γ ) with ( µ , µ ) ∈ ♦ , the best cutoff strategy is between c ◦ := C ( µ • , µ ◦ ; γ ) and ¯ c ◦ := C ( µ • , ¯ µ ◦ ; γ ).For µ l ≤ µ h in the interval [ µ ◦ , ¯ µ ◦ ], let ♦ [ µ l ,µ h ] ⊆ ♦ be constructed from ♦ by translatingits top and bottom edges towards the center, so that they pass through ( µ • , µ l ) and ( µ • , µ h )respectively. For ( µ • , µ ) ∈ ♦ , let li ( µ ) ⊆ ♦ be the line segment in supp( m ) with slope − γ that contains the point ( µ • , µ ). So, ♦ [ µ l ,µ h ] = ∪ µ ∈ [ µ l ,µ h ] li ( µ ). Lemma OA.9. For c ≥ c ◦ , if lim inf t →∞ ˜ C t ≥ c almost surely, then lim t →∞ ˜ M t ( ♦ [ µ ◦ ,µ ∗ ( c )) ) =0 almost surely. Also, for ¯ c ≤ ¯ c ◦ , if lim sup t →∞ ˜ C t ≤ ¯ c almost surely, then lim t →∞ ˜ M t ( ♦ ( µ ∗ (¯ c ) , ¯ µ ◦ ] ) =0 almost surely.Proof. I first show that for all (cid:15) > , there exists δ > t →∞ inf ( µ ,µ ) ∈ ♦ [ µ ◦ ,µ ∗ c ) − (cid:15) ] ] t ∂‘ t ∂µ ( µ , µ ) ≥ δ. From Lemma OA.5, we may rewrite LHS aslim inf t →∞ inf ( µ ,µ ) ∈ ♦ [ µ ◦ ,µ ∗ c ) − (cid:15) ] ] " t D m ( µ , µ ) m ( µ , µ ) + y t ( µ , µ ) t + z t ( µ , µ ) t , which is no smaller than taking the inf separately across the three terms in the bracket,lim inf t →∞ inf ( µ ,µ ) ∈ ♦ [ µ ◦ ,µ ∗ c ) − (cid:15) ] ] t D m ( µ , µ ) m ( µ , µ ) + lim inf t →∞ inf ( µ ,µ ) ∈ ♦ [ µ ◦ ,µ ∗ c ) − (cid:15) ] ] y t ( µ , µ ) t + lim inf t →∞ inf ( µ ,µ ) ∈ ♦ [ µ ◦ ,µ ∗ c ) − (cid:15) ] ] z t ( µ , µ ) t . Since D g/g is bounded on ♦ as D m is continuous and m is continuous and strictlypositive on the compact set ♦ , the first term is 0 for every ω . To deal with the second term,lim inf t →∞ inf ( µ ,µ ) ∈ ♦ [ µ ◦ ,µ ∗ c ) − (cid:15) ] ] y t ( µ , µ ) t ≥ lim inf t →∞ inf ( µ ,µ ) ∈ ♦ −| y t ( µ , µ ) t | = lim inf t →∞ ( − · sup ( µ ,µ ) ∈ ♦ | y t ( µ , µ ) t | ) . Lemma OA.8 gives lim t →∞ sup ( µ ,µ ) ∈ ♦ | y t ( µ ,µ ) t | = 0 almost surely. Hence, we concludelim inf t →∞ inf ( µ ,µ ) ∈ ♦ [ µ ◦ ,µ ∗ c ) − (cid:15) ] ] y t ( µ , µ ) t ≥ δ > t →∞ inf ( µ ,µ ) ∈ ♦ [ µ ◦ ,µ ∗ c ) − (cid:15) ] z t ( µ ,µ ) t ≥ δ almostsurely. The proof of Lemma A.19 establishes that, if we put δ = ∂∂µ ¯ L ( µ ∗ ( c ) − (cid:15) | c ) , then δ > ∂∂µ ¯ L ( µ | c ) ≥ δ whenever c ≥ c and µ ≤ µ ∗ ( c ) − (cid:15) . For every ( µ , µ ) ∈ li (ˆ µ ) ,µ + γ ( µ − µ • ) = ˆ µ . So, ¯ ϕ s ( µ , µ )( ω ) ≥ δ for all ( µ , µ ) ∈ ♦ [ µ ◦ ,µ ∗ ( c ) − (cid:15) ] , whenever ˜ C s ( ω ) ≥ c .Along any ω where lim inf t →∞ ˜ C t ≥ c , we therefore havelim inf s →∞ inf ( µ ,µ ) ∈ ♦ [ µ ◦ ,µ ∗ c ) − (cid:15) ] ¯ ϕ s ( µ , µ ) ≥ δ and thuslim inf t →∞ inf ( µ ,µ ) ∈ ♦ [ µ ◦ ,µ ∗ c ) − (cid:15) ] ] z t ( µ , µ ) t = lim inf t →∞ inf ( µ ,µ ) ∈ ♦ [ µ ◦ ,µ ∗ c ) − (cid:15) ] ] t " t X s =1 ¯ ϕ s ( µ , µ ) ≥ δ. From here, the same argument as in the proof of Lemma OA.4 showslim t →∞ ˜ M t ( ♦ [ µ ◦ ,µ ∗ ( c ) − (cid:15) ] ) =0 almost surely. Since the choice of (cid:15) > µ to deduce asymptotic restric-tions on their cutoffs. Lemma OA.10. Suppose that there are µ ◦ ≤ µ l < µ h ≤ ¯ µ ◦ such that lim t →∞ ˜ M t ( ♦ [ µ l ,µ h ] ) =1 almost surely. Then lim inf t →∞ ˜ C t ≥ C ( µ • , µ l ; γ ) and lim sup t →∞ ˜ C t ≤ C ( µ • , µ h ; γ ) almostsurely.Proof. I show lim inf t →∞ ˜ C t ≥ C ( µ • , µ l ; γ ) almost surely. The argument establishing lim sup t →∞ ˜ C t ≤ C ( µ • , µ h ; γ ) is symmetric.Let c l = C ( µ • , µ l ; γ ), and recall before we defined c ◦ := C ( µ • , µ ◦ ; γ ) and ¯ c ◦ := C ( µ • , ¯ µ ◦ ; γ ).By Lemma OA.3, C ( µ , µ ; γ ) = C ( µ • , µ ; γ ) for all ( µ , µ ) ∈ li ( µ ). Since c U ( c ; µ , µ ) is single peaked for every ( µ , µ ) , and since c l ≤ C ( µ • , µ ; γ ) for all µ ∈ [ µ l , µ h ] , we also get c l ≤ C ( µ , µ ; γ ) for every ( µ , µ ) ∈ ♦ [ µ l ,µ h ] , since ♦ [ µ l ,µ h ] is the union of the linesegments, ♦ [ µ l ,µ h ] = ∪ µ ∈ [ µ l ,µ h ] li ( µ ).Fix some (cid:15) > . We get U ( c l ; µ , µ ) − U ( c l − (cid:15) ; µ , µ ) > µ , µ ) ∈ ♦ [ µ l ,µ h ] . As( µ , µ ) (cid:16) U ( c l ; µ , µ ) − U ( c l − (cid:15) ; µ , µ ) (cid:17) is continuous, there exists some κ ∗ > U ( c l ; µ , µ ) − U ( c l − (cid:15) ; µ , µ ) > κ ∗ for all ( µ , µ ) ∈ ♦ [ µ l ,µ h ] . In particular, if ν ∈ ∆( ♦ [ µ l ,µ h ] )is a belief about fundamentals, then R U ( c l ; µ , µ ) − U ( c l − (cid:15) ; µ , µ ) > dν ( µ ) > κ ∗ . Now , let ¯ κ := sup c ∈ [ c ◦ , ¯ c ◦ ] sup ( µ ,µ ) ∈ ♦ U ( c ; µ , µ ) ,κ := inf c ∈ [ c ◦ , ¯ c ◦ ] inf ( µ ,µ ) ∈ ♦ U ( c ; µ , µ ) . p ∈ (0 , 1) so that pκ ∗ − (1 − p )(¯ κ − κ ) = 0 . At any belief ˆ ν ∈ ∆( ♦ ) that assigns morethan probability p to the sub-parallelogram ♦ [ µ l ,µ h ] , the optimal cutoff is larger than c l − (cid:15) .To see this, take any ˆ c ≤ c l − (cid:15) and I will show ˆ c is suboptimal. If ˆ c < c, then it is suboptimalafter any belief on ♦ . If c ≤ ˆ c ≤ c l − (cid:15) , I show that Z U ( c l ; µ , µ ) − U (ˆ c ; µ , µ ) d ˆ ν ( µ ) > . To see this, we may decompose ˆ ν as the mixture of a probability measure ν on ♦ [ µ l ,µ h ] andanother probability measure ν c on ♦ \ ♦ [ µ l ,µ h ] . Let ˆ p > p be the probability that ν assigns to ♦ [ µ l ,µ h ] . The above integral is equal to:ˆ p Z ♦ [ µl ,µh U ( c l ; µ , µ ) − U (ˆ c ; µ , µ ) dν ( µ ) + (1 − ˆ p ) Z ♦ \ ♦ [ µl ,µh U ( c l ; µ , µ ) − U (ˆ c ; µ , µ ) dν c ( µ )Since c l is to the left of the optimal cutoff for all ( µ , µ ) ∈ ♦ [ µ l ,µ h ] and ˆ c ≤ c l − (cid:15) , then U (ˆ c ; µ , µ ) ≤ U ( c l − (cid:15) ; µ , µ ) for all ( µ , µ ) ∈ ♦ [ µ l ,µ h ] . The first summand is no less thanˆ p Z ♦ [ µl ,µh U ( c l ; µ , µ ) − U ( c l − (cid:15) ; µ , µ ) dν ( µ ) ≥ ˆ pκ ∗ . Also, the integrand in the second summand is no smaller than − (¯ κ − κ ) , therefore R U ( c l ; µ , µ ) − U (ˆ c ; µ , µ ) d ˆ ν ( µ ) ≥ ˆ pκ ∗ − (1 − ˆ p )(¯ κ − κ ) . Since ˆ p > p , we get ˆ pκ ∗ − (1 − ˆ p )(¯ κ − κ ) > ω where lim t →∞ ˜ M t ( ♦ [ µ l ,µ h ] )( ω ) = 1 , eventually ˜ M t ( ♦ [ µ l ,µ h ] )( ω ) >p for all large enough t, meaning lim inf t →∞ ˜ C t ( ω ) ≥ c l − (cid:15). Since lim t →∞ ˜ M t ( ♦ [ µ l ,µ h ] ) = 1almost surely, this shows lim inf t →∞ ˜ C t ≥ C ( µ • , µ l ; γ ) − (cid:15) almost surely. Since the choice of (cid:15) > t →∞ ˜ C t ≥ C ( µ • , µ l ; γ ) almost surely. OA 3.6 The Contraction Map I now combine the results established so far to prove the convergence statement in Theorem1. Proof. Let µ l , [1] := µ ◦ , µ h , [1] := ¯ µ ◦ . For k = 2 , , ... , iteratively define µ l , [ k ] := I ( µ l , [ k − ; γ )and µ h , [ k ] := I ( µ h , [ k − ; γ ).From Lemma OA.10, if lim t →∞ ˜ M t ( ♦ [ µ l , [ k ] ,µ h , [ k ] ] ) = 1 almost surely, then lim inf t →∞ ˜ C t ≥ C ( µ • , µ l , [ k ] ; γ ) and lim sup t →∞ ˜ C t ≤ C ( µ • , µ h , [ k ] ; γ ) almost surely. But using these conclusionsin Lemma OA.9, we further deduce thatlim t →∞ ˜ M t ( ♦ [ µ ∗ ( C ( µ • ,µ l , [ k ] ; γ )) ,µ ∗ ( C ( µ • ,µ h , [ k ] ; γ ))] ) = 135lmost surely, that is to say lim t →∞ ˜ M t ( ♦ [ µ l , [ k +1] ,µ h , [ k +1] ] ) = 1 almost surely.The iterates ( µ l , [ k ] ) k ≥ and ( µ h , [ k ] ) k ≥ are the iterates of a contraction map, so lim k →∞ µ l , [ k ] = µ • = lim k →∞ µ h , [ k ] . Thus, agent’s posterior converges in L to the line segment with slope − γ containing µ ∞ almost surely (since the support of the prior is bounded).In addition, the sequences of bounds on asymptotic actions also converge by continuity,lim k →∞ C ( µ • , µ l , [ k ] ; γ ) = c ∞ = lim k →∞ C ( µ • , µ h , [ k ] ; γ ). This implies lim t →∞ ˜ C t = c ∞ almostsurely.Finally, combining the asymptotic belief result with Lemma OA.4, we see that in fact ˜ M t converges in L to the point ( µ • , µ ∞ ) almost surely. OA 4 Foundation for Inference and Behavior in theLarge-Generation Environment In Section 4, I introduced the large-generations social-learning environment with a continuumof agents in each generation. When agents in generations τ = 0 , , ..., t − c [0] , c [1] , ..., c [ t − , each generation t agent observes an infinite sample of histories( h τ,n ) n ∈ [0 , drawn from the history distribution H • ( c τ ) for each 0 ≤ τ ≤ t − . Agents inferthe large-generations pseudo-true fundamentals µ ∗ ( c [0] , ..., c [ t − ), µ ∗ ( c [0] , ..., c [ t − ) and choosethe stopping strategy that best responds to the feasible model with these parameters.In this section, I provide a finite-population foundation for inference and behavior in thelarge-generations environment for the Gaussian case. For K ≥ , let c † = ( c ( k ) † ) Kk =1 ∈ R K bea list of cutoff thresholds. I show that when an agent starts with a full-support prior on thespace of fundamentals R and observes N < ∞ histories drawn i.i.d. from each of H • ( c ( k ) † )for 1 ≤ k ≤ K , her posterior belief almost surely converges to the dogmatic belief on thelarge-generations pseudo-true fundamentals µ ∗ ( c † ) , µ ∗ ( c † ) as N → ∞ . Also, if she choosesthe cutoff strategy S c maximizing her posterior expected payoffs, then as N → ∞ andprovided the stage-game payoff functions u , u are Lipschitz continuous, her cutoff choicealmost surely converges to C ( µ ∗ ( c † ) , µ ∗ ( c † ); γ ). OA 4.1 Setting up the Probability Space Suppose an agent has a full-support prior density m : R → R > over fundamentals ( µ , µ ).To formally define the problem, consider the R K -valued stochastic process ( X n ) n ≥ =( X ( k )1 ,n , X ( k )2 ,n ) ≤ k ≤ K,n ≥ , where X s and X s are independent for s = s . Here, X n are i.i.d. R K -valued random variables with independent components, distributed as X ( k )1 ,n ∼ N ( µ • , σ ), X ( k )2 ,n ∼ N ( µ • , σ ) for each 1 ≤ k ≤ K . The interpretation is that there are K different pop-ulations, who play the stage game using different cutoff thresholds. The random variables( X ( k )1 ,n , X ( k )2 ,n ) are the potential draws in the n -th iteration of the stage game in population k, (but X ( k )2 ,n may not be observed if X ( k )1 ,n is sufficiently large). Clearly, there is a probability36pace (Ω , A , P ), with sample space Ω = ( R K ) ∞ interpreted as paths of the process just de-scribed, A the Borel σ -algebra on Ω , and P the measure on sample paths so that the process X n ( ω ) = ω n has the desired distribution. The term “almost surely” means “with probability1 with respect to the realization of infinite sequence of all (potential) draws”, i.e. P -almostsurely.For each n ≥ ≤ k ≤ K , let H ( k ) n be the (random) history given by H ( k ) n = ( X ( k )1 ,n , ∅ )if X ( k )1 ,t ≥ c ( k ) † , H ( k ) n = ( X ( k )1 ,n , X ( k )2 ,n ) if X ( k )1 ,n < c ( k ) † . Let H n = ( H (1) n , ..., H ( K ) n ) . After each finite N, the agent Bayesian updates prior density m about the fundamentals, based on the finitedataset of histories ( H n ) n ≤ N . She ends up with a random, non-degenerate posterior density˜ m N = m ( ·| ( H n ) n ≤ N ), whose randomness comes from the randomness of the 2 K · N potentialdraws. OA 4.2 Inference after Observing Large Samples Proposition OA.3 shows that as N → ∞ , the random posterior ˜ m N converges to the large-generations pseudo-true fundamentals in L . Proposition OA.3. Suppose m : R → R > is integrable and has bounded magnitude.Almost surely, lim N →∞ E ( µ ,µ ) ∼ ˜ m N ( | µ − µ • | + | µ − µ ∗ ( c † ) | ) = 0 . Belief convergence in L is required to later establish convergence of behavior in Propo-sition OA.5. This convergence does not follow from Berk (1966), because his result onlyestablishes convergence in a weaker mode: for any open set containing the pseudo-true fun-damentals, the mass that the posterior belief assigns to the open set almost surely convergesto 1. Crucially, the prior distribution in this setting has full support on an unbounded do-main of feasible fundamentals, ( µ , µ ) ∈ M = R . Indeed, one of the implications of mycentral inference result, Proposition 2, is that the pseudo-true parameter becomes unbound-edly pessimistic as censoring threshold decreases. So, the weak mode of convergence in Berk(1966)’s conclusion leaves open the possibility that posterior beliefs for increasing N putdecreasing mass on increasingly extreme values of µ . If the magnitudes of these extremevalues grow more quickly in N than the speed with which probability concentrates on theopen set around the pseudo-true fundamentals, then there can be a positive-probability eventwhere the agent’s behavior is bounded away from C ( µ • , µ ∗ ( c † ); γ ) for every N .Instead, I apply Bunke and Milhaud (1998)’s results to derive the stronger convergencein L that subsequently allows for convergence of payoffs and behavior as the agent’s samplegrows large. One technical challenge is that the results of Bunke and Milhaud (1998) onlyapply in environments where observables are valued in some Euclidean space and given bydensities, but censored histories are valued in H and their distributions have a probabilitymass on the missing data indicator ∅ . So, I first consider a noise-added observation structure37here each history H ( k ) n is replaced by the R -valued pair ( X ( k ) n, , Y ( k ) n ), where Y ( k ) n = X ( k ) n, if X ( k ) n, ≤ c ( k ) † . But if X ( k ) n, > c ( k ) † , then Y ( k ) n ∼ N (0 , 1) is a white noise term that is independentof the draws of any decision problem. The idea is that a censored draw is replaced by noisethat is uninformative about the fundamentals, so the distribution of each ( X ( k ) n, , Y ( k ) n ) pairis given by a density function on R . After establishing the analogous belief convergenceresult in the auxiliary environment, I map the result back into the environment of observingcensored histories. This translation is possible because in every finite dataset, the realiza-tions of the white noise terms do not change the relative likelihoods of data under differentparameters ( µ , µ ), hence they do not affect the agent’s posterior belief over fundamentals.I now formally define this noise-added observation structure that replaces censored X ’swith white noise. Let P Z ∞ be the measure on ( R ∞ ) K corresponding to product of K i.i.d.sequence of N (0 , 1) random variables. Consider the expanded probability space ( ¯Ω , ¯ A , ¯ P )where ¯Ω = Ω × ( R ∞ ) K , ¯ A is the product σ -algebra on ¯Ω where ( R ∞ ) K is endowed with theusual product Borel σ -algebra, and ¯ P is the product measure P ⊗ P Z ∞ on ¯Ω. To interpret,each element ¯ ω = ( ω, z ) ∈ ¯Ω consists of the sample path of a sequence of potential draws( X n ) ∞ n =1 as well as the sample path of a sequence of white noise realizations ( Z n ) ∞ n =1 , whereeach Z n is an R K -valued random variable.On the expanded probability space, we can define two kinds of observations. The his-tory dataset of size N is ( H n (¯ ω )) n ≤ N = ( H n ( ω n )) n ≤ N , as the K round n histories onlydepends on ω n (and not on the white noise z n ) . The noise-added dataset of size T is( X ,n ( ω ) , Y n ( ω, z )) n ≤ N = ( X ,n ( ω n ) , Y n ( ω n , z n )) n ≤ N . Write ˜ m HN as the posterior density fromhistory dataset of size N, ˜ m XYN as the posterior density from noise-added dataset of size N .The next lemma formalizes the idea that replacing censored observations with white noisedoes not affect posterior beliefs. Lemma OA.11. For every ¯ ω ∈ ¯Ω and N ∈ N , ˜ m HN (¯ ω ) = ˜ m XYN (¯ ω ) . Proof. Suppose ¯ ω = (( x ,n , x ,n ) ∞ n =1 , ( z n ) ∞ n =1 ) ∈ Ω × ( R ∞ ) K . The noise-added dataset of size N is ( x ,n , y n ) Nn =1 where y ( k ) n = z ( k ) n for each n, k where x ( k )1 ,n ≥ c ( k ) † , and y ( k ) n = x ( k )2 ,n for each n, k where x ( k )1 ,n < c ( k ) † . The history dataset of size N is ( h n ) Nn =1 , where h ( k ) n = ( x ( k )1 ,n , ∅ ) foreach n, k where x ( k )1 ,n ≥ c ( k ) † , and h ( k ) n = ( x ( k )1 ,n , x ( k )2 ,n ) for each n, k where x ( k )1 ,n < c ( k ) † .The likelihood of the noise-added dataset under parameters µ , µ is: K Y k =1 N Y n =1 φ ( x ( k )1 ,n ; µ , σ ) ! · Y n : x ( k ) n ≤ c ( k ) † φ ( y ( k ) n ; µ − γ ( x ( k )1 ,n − µ ) , σ ) · Y n : x ( k ) n ≥ c ( k ) † φ ( y ( k ) n ; 1 , The likelihood of the history dataset under parameters µ , µ is: K Y k =1 N Y n =1 φ ( x ( k )1 ,n ; µ , σ ) ! · Y n : x ( k ) n ≤ c ( k ) † φ ( y ( k ) n ; µ − γ ( x ( k )1 ,n − µ ) , σ ) Q Kk =1 (cid:18)Q n : x ( k ) n ≥ c ( k ) † φ ( y ( k ) n ; 1 , (cid:19) , which iscommon across all parameters ( µ , µ ). So the posterior likelihood of parameters µ , µ mustbe the same under both ˜ m HN and ˜ m XYN , that is ˜ m HN (¯ ω ) = ˜ m XYN (¯ ω ) . On the expanded probability space, inference from history dataset and inference fromnoise-added dataset give the same posterior beliefs everywhere. If ˜ m XYN converges in L todogmatic belief on ( µ • , µ ∗ ( c † )) ¯ P -a.s., then ˜ m HN also converges in L to the same belief ¯ P -a.s.Further, by relationship between the expanded probability space and the original probabilityspace, this would also show that ˜ m N converges in L to dogmatic belief on ( µ • , µ ∗ ( c † )) P -a.s.,which proves Proposition OA.3. Therefore, to prove Proposition OA.3 one just needs thefollowing on the expanded probability space. Lemma OA.12. ˜ m XYN converges in L to the dogmatic belief on ( µ • , µ ∗ ( c † )) ¯ P -a.s.Proof. First, I write down the KL divergence objective in the noise-added observation struc-ture and show its minimizers are exactly the large-generations pseudo-true fundamentals.Each observation ( X ( k )1 ,n , Y ( k ) n ) Kk =1 is an element of R (2 K ) , whose distribution is given by a K densities over K copies of R . For 1 ≤ k ≤ K , the k -th such density is f • , ( k ) ( x, y ) = φ ( x ; µ • , σ ) · φ ( y ; µ • , σ ) if x < c ( k ) † φ ( x ; µ • , σ ) · φ ( y ; 0 , 1) if x ≥ c ( k ) † . Under the fundamentals ( µ , µ ) ∈ R , the agent thinks the observations are distributedaccording to the product of K densities where the k -th density is f ( k )ˆ µ , ˆ µ ( x, y ) = φ ( x ; ˆ µ , σ ) · φ ( y ; ˆ µ − γ · ( x − ˆ µ ) , σ ) if x < c ( k ) † φ ( x ; ˆ µ , σ ) · φ ( y ; 0 , 1) if x ≥ c ( k ) † . The log likelihood ratio of an observation ( x, y ) = ( x ( k )1 , y ( k ) ) Kk =1 ∈ R K isln K Y k =1 f • , ( k ) ( x ( k )1 , y ( k ) ) f ( k )ˆ µ , ˆ µ ( x ( k )1 , y ( k ) ) = K X k =1 ln f • , ( k ) ( x ( k )1 , y ( k ) ) f ( k )ˆ µ , ˆ µ ( x ( k )1 , y ( k ) ) . So KL divergence is defined as Z R K K X k =1 ln f • , ( k ) ( x ( k )1 , y ( k ) ) f ( k )ˆ µ , ˆ µ ( x ( k )1 , y ( k ) ) · K Y k =1 f • , ( k ) ( x ( k )1 , y ( k ) ) ! d ( x, y )= K X k =1 Z R K ln f • , ( k ) ( x ( k )1 , y ( k ) ) f ( k )ˆ µ , ˆ µ ( x ( k )1 , y ( k ) ) · K Y j =1 f • , ( j ) ( x ( j )1 , y ( j ) ) d ( x, y ) . k, the integrand f • , ( k ) ( x ( k )1 ,y ( k ) ) f ( k )ˆ µ , ˆ µ ( x ( k )1 ,y ( k ) ) only depends on ( x, y ) ∈ R K through two of itscoordinates, x ( k )1 and y ( k ) . In addition, the density Q Kj =1 f • , ( j ) ( x ( j )1 , y ( j ) ) is a product density,so in fact the k -th summand is just Z R ln f • , ( k ) ( x ( k )1 , y ( k ) ) f ( k )ˆ µ , ˆ µ ( x ( k )1 , y ( k ) ) · f • , ( k ) ( x ( k )1 , y ( k ) ) d ( x ( k )1 , y ( k ) ) . This expression is, up to a constant not depending on ˆ µ , ˆ µ (due to the white noise term),equal to the KL divergence between H • ( c ( k ) † ) and H (Ψ(ˆ µ , ˆ µ ; γ ); c ( k ) † ). Therefore the overallKL divergence is off by a constant from K X k =1 D KL ( H • ( c ( k ) † ) || H (Ψ(ˆ µ , ˆ µ ; γ ); c ( k ) † ) ) , the objective defining large-generations pseudo-true fundamentals in Equation (1).To finish the proof, Bunke and Milhaud (1998) show that provided the true density f • and the family of subjective densities { f ˆ µ , ˆ µ : ˆ µ , ˆ µ ∈ R } satisfy a number of conditions,then ˜ m XYN ¯ P -a.s. converges to its KL-divergence minimizers in L , which I have shown to beexactly ( µ • , µ ∗ ( c † )) . I now check the conditions of Bunke and Milhaud (1998) for the case of K = 1, so both f • and f ˆ µ , ˆ µ are densities on R . Checking the conditions for larger K isexactly analogous, because both f • and f ˆ µ , ˆ µ can be separated as the product of K densitieson R .From the hypothesis on m ’s magnitude being bounded, there is some B < ∞ so that0 < m ( µ , µ ) < B for all µ , µ ∈ R . The parameter space is Θ = R . The data-generatingdensity of observation ( x, y ) is: f • ( x, y ) = φ ( x ; µ • , σ ) · φ ( y ; µ • , σ ) if x < c † φ ( x ; µ • , σ ) · φ ( y ; 0 , 1) if x ≥ c † where φ ( · ; µ, σ ) is the Gaussian density with mean µ and variance σ . Under the feasiblemodel Ψ(ˆ µ , ˆ µ ; γ ), the same observation has density: f ˆ µ , ˆ µ ( x, y ) = φ ( x ; ˆ µ , σ ) · φ ( y ; ˆ µ − γ · ( x − ˆ µ ) , σ ) if x < c † φ ( x ; ˆ µ , σ ) · φ ( y ; 0 , 1) if x ≥ c † . A1 . Parameter space is a closed, convex set in R with nonempty interior. The density f ˆ µ , ˆ µ ( x, y ) is bounded over (ˆ µ , ˆ µ , x, y ) and its carrier, { ( x, y ) : f ˆ µ , ˆ µ ( x, y ) > } is the samefor all ˆ µ , ˆ µ .Evidently R is closed in itself. The density f ˆ µ , ˆ µ ( x, y ) is bounded by the product ofthe modes of Gaussian densities with variance σ and variance 1. The density f ˆ µ , ˆ µ ( x, y ) is40trictly positive on R for any parameter values ˆ µ , ˆ µ . A2 . For all ˆ µ , ˆ µ , there is a sphere S [(ˆ µ , ˆ µ ) , η ] of center (ˆ µ , ˆ µ ) and radius η > E f • sup ( µ ,µ ) ∈ S [(ˆ µ , ˆ µ ) ,η ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ln f • ( X, Y ) f µ ,µ ( X, Y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ∞ . Pick say η = 1. Consider the rectangle R [(ˆ µ , ˆ µ ) , η ] consisting of points ( µ , µ ) such that | µ − ˆ µ | < η and | µ − ˆ µ | < η . Since the the Gaussian distribution is single-peaked, for any( x, y ) ∈ R the absolute value of the log likelihood ratio (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ln f • ( X,Y ) f µ ,µ ( X,Y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) on all of R [(ˆ µ , ˆ µ ) , η ]must be bounded by its value at the 4 corners. That is to say,sup ( µ ,µ ) ∈ S [(ˆ µ , ˆ µ ) ,η ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ln f • ( X, Y ) f µ ,µ ( X, Y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup ( µ ,µ ) ∈ R [(ˆ µ , ˆ µ ) ,η ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ln f • ( X, Y ) f µ ,µ ( X, Y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ln f • ( X, Y ) f ˆ µ − η, ˆ µ − η ( X, Y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ln f • ( X, Y ) f ˆ µ − η, ˆ µ + η ( X, Y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ln f • ( X, Y ) f ˆ µ + η, ˆ µ − η ( X, Y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ln f • ( X, Y ) f ˆ µ + η, ˆ µ + η ( X, Y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . It is easy to see that for any fixed parameter E f • "(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ln f • ( X,Y ) f µ ,µ ( X,Y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) is finite, so the sum of these4 terms gives a finite upper bound. A3 . For all fixed ( x , y ) ∈ R , the map from parameters to density ( µ , µ ) f µ ,µ ( x , y )has continuous derivatives with respect to parameters ( µ , µ ) ∂f∂µ ( x , y ; µ , µ ), ( µ , µ ) ∂f∂µ ( x, y ; µ , µ ). There exist positive constants κ and b with Z Z (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) [ f µ ,µ ( x, y )] − · ∂f∂µ ( x, y ; µ , µ ) ∂f∂µ ( x, y ; µ , µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) · f µ ,µ ( x, y ) · dydx < κ (1 + || ( µ , µ ) || b )satisfied for every ( µ , µ ) ∈ R , where || · || is a norm on R .Let’s choose the max norm, || v || = max( | v | , | v | ). For uncensored data ( x , y ) with x < c † , we can compute ∂f∂µ ( x , y ; µ , µ ) = f µ ,µ ( x , y ) · " (1 + γ ) σ · ( x − µ ) + γσ · ( y − µ ) and ∂f∂µ ( x , y ; µ , µ ) = f µ ,µ ( x , y ) · (cid:20) γσ · ( x − µ ) − σ · ( y − µ ) (cid:21) . While for censored data ( x , y ) where x > c † , the likelihood of the data is unchangedby parameter µ since it neither changes the distribution of the early draw quality nor the41istribution of the white noise term, meaning ∂f∂µ ( x , y ; µ , µ ) = 0. Also, for the censoredcase ∂f∂µ ( x , y ; µ , µ ) = f µ ,µ ( x , y ) · σ ( x − µ ) . This means the integral to be bounded is: Z x = c † x = −∞ Z ∞−∞ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (1+ γ ) σ · ( x − µ ) + γσ · ( y − µ ) γσ · ( x − µ ) − σ · ( y − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) · f µ ,µ ( x, y ) · dy dx + Z x = ∞ x = c † (cid:20)Z ∞−∞ ( 1 σ ( x − µ )) · f µ ,µ ( x, y ) · dy (cid:21) dx. Since the inner integrals are non-negative, this expression is smaller than the version wherethe domains of the outer integrals are expanded and the densities f µ ,µ ( x, y ) are simplyreplaced with the joint density on R of the feasible model for Ψ( µ , µ ; γ ), which I denoteas ˜ f µ ,µ ( x, y ). Z ∞−∞ Z ∞−∞ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (1+ γ ) σ · ( x − µ ) + γσ · ( y − µ ) γσ · ( x − µ ) − σ · ( y − µ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) · ˜ f µ ,µ ( x, y ) · dy dx + Z ∞−∞ (cid:20)Z ∞−∞ ( 1 σ ( x − µ )) · ˜ f µ ,µ ( x, y ) · dy (cid:21) dx. The second summand is a 12th moment of the joint normal random variable with distributionΨ( µ , µ ; γ ), so for all µ , µ it is given by some 12th order polynomial P ( µ , µ ). Similarlythe first summand is also given by a 12th order polynomial P ( µ , µ ). Therefore by choosing b = 12 and choosing κ appropriately according to the coefficients in P and P , we achievedthe desired bound. A4 . For some positive constants b and κ , the affinity function A ( µ , µ ) := Z Z [ f µ ,µ ( x, y ) · f • ( x, y )] / dydx satisfies A ( µ , µ ) < κ · || ( µ , µ ) || − b for all µ , µ .We have A ( µ , µ ) ≤ [ R R [ f µ ,µ ( x, y ) · f • ( x, y )] dydx ] / , so it’s sufficient to find some κ b that works to bound R R [ f µ ,µ ( x, y ) · f • ( x, y )] dydx . We have: Z Z [ f µ ,µ ( x, y ) · f • ( x, y )] dydx = Z c † x = −∞ Z ∞−∞ φ ( x ; µ , σ ) · φ ( x ; µ • , σ ) · φ ( y ; µ − γ ( x − µ ) , σ ) · φ ( y ; µ • , σ ) dydx + Z ∞ x = c † Z ∞−∞ φ ( x ; µ , σ ) · φ ( x ; µ • , σ ) · φ ( y ; 0 , · φ ( y ; 0 , dydx ≤ Z ∞−∞ Z ∞−∞ φ ( x ; µ , σ ) · φ ( x ; µ • , σ ) · φ ( y ; µ − γ ( x − µ ) , σ ) · φ ( y ; µ • , σ ) dydx + Z ∞−∞ Z ∞−∞ φ ( x ; µ , σ ) · φ ( x ; µ • , σ ) · φ ( y ; 0 , · φ ( y ; 0 , dydx. I show how to find κ and b to bound the first summand in the last expression above. It iseasy to similarly bound the second summand. By Bromiley (2018), the product of Gaussiandensities φ ( y ; µ − γ ( x − µ ) , σ ) · φ ( y ; µ • , σ ) is itself a Gaussian density in y , ˜ φ ( y ), multipliedby a scaling factor equal to (4 πσ ) − / · exp (cid:16) − γ σ · [ x − ( µ − µ • γ + µ γ )] (cid:17) . So we have Z ∞−∞ Z ∞−∞ φ ( x ; µ , σ ) · φ ( x ; µ • , σ ) · φ ( y ; µ − γ ( x − µ ) , σ ) · φ ( y ; µ • , σ ) dydx = Z ∞−∞ φ ( x ; µ , σ ) · φ ( x ; µ • , σ ) · (cid:16) πσ (cid:17) − / · exp − γ σ · [ x − ( µ − µ • γ + µ γ )] ! · Z ∞−∞ · ˜ φ ( y ) dydx = (cid:16) πσ (cid:17) − / · Z ∞−∞ φ ( x ; µ , σ ) · φ ( x ; µ • , σ ) · exp − γ σ · [ x − ( µ − µ • γ + µ γ )] ! · dx. Again applying Bromiley (2018), product of the two Gaussian densities φ ( x ; µ , σ ) · φ ( x ; µ • , σ )is another Gaussian density with mean µ • + µ , variance σ , and multiplied to a scaling factorof (4 πσ ) − / exp (cid:16) − ( µ − µ • ) σ (cid:17) . So above expression is: K · exp − ( µ − µ • ) σ ! · Z ∞−∞ φ ( x ; µ • + µ , σ · exp − γ σ · [ x − ( µ − µ • γ + µ γ )] ! dx where K is a constant not dependent on µ , µ . Also, we may writeexp − γ σ · [ x − ( µ − µ • γ + µ γ )] ! = K · φ ( x ; ( µ − µ • γ + µ γ ) , σ B )where σ B = σ γ and K = (2 πσ B ) / . Applying Bromiley (2018) one final time, the product φ ( x ; µ • + µ , σ ) · φ ( x ; ( µ − µ • γ + µ γ ) , σ B ) is a Gaussian density in x scaled by K exp( − K · ( µ • − µ − µ − µ • γ ) ) where K , K > µ , µ . So altogether,the second summand we are bounding is a constant multiple of exp (cid:16) − ( µ − µ • ) σ (cid:17) · exp( − K · ( µ • − µ − µ − µ • γ ) ). For | µ | ≥ | µ | , the max norm || ( µ , µ ) || = | µ | and exp (cid:16) − ( µ − µ • ) σ (cid:17) | µ | < | µ | , and | µ | − | µ • | + | µ • | γ > − K · ( µ • − µ − µ − µ • γ ) ) ≤ exp( − K · ( | µ | − | µ • | | µ • | γ ) ) . So for large enough | µ | , exp( − K · ( µ • − µ − µ − µ • γ ) ) will decrease exponentially fast in thenorm. These two facts imply that there is some K > || ( µ , µ ) || > K , Z ∞−∞ Z ∞−∞ φ ( x ; µ , σ ) · φ ( x ; µ • , σ ) · φ ( y ; µ − γ ( x − µ ) , σ ) · φ ( y ; µ • , σ ) dydx < || ( µ , µ ) || − . Now put κ = K − and we can ensure for any value of || ( µ , µ ) || we will have Z ∞−∞ Z ∞−∞ φ ( x ; µ , σ ) · φ ( x ; µ • , σ ) · φ ( y ; µ − γ ( x − µ ) , σ ) · φ ( y ; µ • , σ ) dydx < κ ·|| ( µ , µ ) || − . A5 . There are positive constants b , b so that for all ( µ , µ ) and r > m ( S [( µ , µ ) , r ]) ≤ cr b (1 + ( || ( µ , µ ) || + r ) b ). Moreover, g assigns positive mass to everysphere with positive radius.Since we have assumed that density g is bounded by B , the prior mass assigned to thesphere S [( µ , µ ) , r ] is bounded by B times its Euclidean volume. So, take b = 2 and c = πB and the first statement is satisfied. Since we have assumed that m is strictlypositive everywhere, the second statement is satisfied. OA 4.3 Behavior after Observing Large Samples Next, I turn to the convergence of expected payoffs for different cutoff strategies as samplesize grows large. For any c ∈ R and N ∈ N , let U N ( c ) := E ( µ ,µ ) ∼ ˜ m N [ U ( c ; µ , µ , γ )] where U ( c ; µ , µ , γ ) is the expected payoff of using the stopping strategy S c when ( X , X ) ∼ Ψ( µ , µ ; γ ) . Note that U N ( c ) is a real-valued random variable representing the agent’s sub-jective expected payoff for the stopping strategy S c , under the (random) non-degenerateposterior belief after observing a sample of size N . Proposition OA.4 shows that U N ( c )converges almost surely to the subjective expected payoff of S c with a dogmatic belief inthe pseudo-true fundamentals, provided the payoff functions u , u of the optimal-stoppingproblem are Lipschitz continuous. Furthermore, this convergence is uniform across all cutoffthresholds. Proposition OA.4. Suppose there are constants K , K > so that | u ( x ) − u ( x ) | Suppose there are constants K , K > so that | u ( x ) − u ( x ) | < K ·| x − x | and | u ( x , x ) − u ( x , x ) | < K · ( | x − x | + | x − x | ) for all x , x , x , x ∈ R . For each center ( µ ◦ , µ ◦ ) ∈ R , there corresponds a constant K > so that for any µ , µ ∈ R and any c ∈ R , | U ( c ; µ , µ ) − U ( c ; µ ◦ , µ ◦ ) | < K · ( | µ − µ ◦ | + | µ − µ ◦ | ) .Proof. Let ( µ ◦ , µ ◦ ) ∈ R be given. For any µ , µ , c ∈ R , we have U ( c ; µ , µ ) = Z ∞ c u ( x ) φ ( x ; µ , σ ) dx + Z c −∞ (cid:20)Z ∞−∞ u ( x , x ) φ ( x ; µ − γ ( x − µ ) , σ ) dx (cid:21) · φ ( x ; µ , σ ) dx . We first bound | R ∞ c u ( x ) φ ( x ; µ , σ ) dx − R ∞ c u ( x ) φ ( x ; µ ◦ , σ ) dx | by a multiple of | µ − µ ◦ | . Suppose first µ = µ ◦ + ∆ for some ∆ > 0. We have Z ∞ c u ( x ) φ ( x ; µ , σ ) dx = Z ∞ c − ∆ u ( x + ∆) φ ( x ; µ ◦ , σ ) dx . By Lipschitz continuity of u , | u ( x ) − u ( x + ∆) | ≤ K ∆ for all x ∈ R . Thus we conclude (cid:12)(cid:12)(cid:12)(cid:12)Z ∞ c u ( x ) φ ( x ; µ , σ ) dx − Z ∞ c u ( x ) φ ( x ; µ ◦ , σ ) dx (cid:12)(cid:12)(cid:12)(cid:12) ≤ K ∆ + (cid:12)(cid:12)(cid:12)(cid:12)Z cc − ∆ u ( x ) φ ( x ; µ ◦ , σ ) dx (cid:12)(cid:12)(cid:12)(cid:12) . Again by Lipschitz continuity of u , for any x ∈ R , | u ( x ) φ ( x ; µ , σ ) | ≤ ( | u (0) | + K | x | ) · φ ( x ; µ ◦ , σ ) . Since the Gaussian density decreases to 0 exponentially fast as x → ±∞ , the RHS isuniformly bounded for all x ∈ R by some constant, say J > 0. (Note that the RHS is nota function of c , so J does not depend on c .) This shows that (cid:12)(cid:12)(cid:12)(cid:12)Z cc − ∆ u ( x ) φ ( x ; µ ◦ , σ ) dx (cid:12)(cid:12)(cid:12)(cid:12) ≤ Z cc − ∆ | u ( x ) φ ( x ; µ ◦ , σ ) | dx ≤ Z cc − ∆ J dx = J ∆ . 45o altogether, | Z ∞ c u ( x ) φ ( x ; µ , σ ) dx − Z ∞ c u ( x ) φ ( x ; µ ◦ , σ ) dx | ≤ ( K + J )∆ . If instead µ = µ ◦ − ∆ , then a similar argument shows that | Z ∞ c u ( x ) φ ( x ; µ , σ ) dx − Z ∞ c u ( x ) φ ( x ; µ ◦ , σ ) dx | ≤ K ∆ + | Z c +∆ c u ( x ) φ ( x ; µ ◦ , σ ) dx | , and again we may bound the second term by J ∆ as before.We now turn to bounding the difference in the second summand making up U ( c ; µ , µ ).First consider the case where µ = µ ◦ . For each x ∈ R , let I ( x ; µ ) := R ∞−∞ u ( x , x ) φ ( x ; µ ◦ − γ ( x − µ ) , σ ) dx , the expected continuation utility after X = x , in the feasible modelΨ( µ , µ ◦ ; γ ) . The second summand in U ( c ; µ , µ ) is given by R c −∞ I ( x ; µ ) φ ( x ; µ , σ ) dx .For x = x + d , µ = µ + d , we have I ( x ; µ ) = Z ∞−∞ u ( x , x ) φ ( x ; µ ◦ − γ ( x − µ ) , σ ) dx = Z ∞−∞ u ( x + d , x − γ ( d − d )) φ ( x ; µ ◦ − γ ( x − µ ) , σ ) dx . Lipschitz continuity of u implies that | u ( x + d , x − γ ( d − d )) − u ( x , x ) | ≤ K ((1 + γ ) · | d | + γ | d | ) ≤ K (1 + γ ) · ( | d | + | d | ) , which shows | I ( x ; µ ) − I ( x ; µ ) | ≤ K (1 + γ ) · ( | x − x | + | x − x | ). That is, I is Lipschitzcontinuous.Suppose µ = µ ◦ + ∆ for some ∆ > 0. Similar to the above argument bounding the firstsummand in ( c ; µ , µ ) , we have Z c −∞ I ( x ; µ ) φ ( x ; µ , σ ) dx = Z c − ∆ −∞ I ( x + ∆; µ ◦ + ∆) φ ( x ; µ ◦ , σ ) dx . By Lipschitz continuity of I, | I ( x ; µ ◦ ) − I ( x + ∆; µ ◦ + ∆) | ≤ K (1 + γ )∆ for all x ∈ R .Thus we conclude | Z c −∞ I ( x ; µ ) φ ( x ; µ , σ ) dx − Z c −∞ I ( x ; µ ◦ ) φ ( x ; µ ◦ , σ ) dx |≤ K (1 + γ )∆ + | Z cc − ∆ I ( x ; µ ◦ ) φ ( x ; µ ◦ , σ ) dx | . Since x I ( x ; µ ◦ ) is Lipschitz continuous, there exists J > | I ( x ; µ ◦ ) φ ( x ; µ ◦ , σ ) | ≤ J for all x ∈ R , which means | R cc − ∆ I ( x ; µ ◦ ) φ ( x ; µ ◦ , σ ) dx | ≤ J ∆. (Once again, J does46ot depend on c. ) The case of µ = µ ◦ − ∆ is symmetric and we have shown that | Z c −∞ I ( x ; µ ) φ ( x ; µ , σ ) dx − I ( x ; µ ◦ ) φ ( x ; µ ◦ , σ ) dx | ≤ (2 K (1 + γ ) + J ) · | µ − µ ◦ | . Finally, we investigate the difference in the second summand of U ( c ; µ , µ ) between param-eters ( µ , µ ◦ ) and ( µ , µ ) for µ , µ ∈ R . This difference is bounded by Z c −∞ (cid:12)(cid:12)(cid:12)(cid:12)Z ∞−∞ u ( x , x ) φ ( x ; µ ◦ − γ ( x − µ ) , σ ) dx − Z ∞−∞ u ( x , x ) φ ( x ; µ − γ ( x − µ ) , σ ) dx (cid:12)(cid:12)(cid:12)(cid:12) φ ( x ; µ , σ ) dx . (3) But for every x ∈ R , Z ∞−∞ u ( x , x ) φ ( x ; µ − γ ( x − µ ) , σ ) dx = Z ∞−∞ u ( x , x +( µ − µ ◦ )) φ ( x ; µ ◦ − γ ( x − µ ) , σ ) dx , and | u ( x , x + ( µ − µ ◦ )) − u ( x , x ) | ≤ K | µ − µ ◦ | by Lipschitz continuity of u . Thisshows that, for all values µ , µ ∈ R , (3) is bounded by K | µ − µ ◦ | .Applying the triangle inequality to the second term, we conclude that | U ( c ; µ , µ ) − U ( c ; µ ◦ , µ ◦ ) | ≤ ( K + J ) | µ − µ ◦ | + (2 K (1 + γ ) + J ) · | µ − µ ◦ | + K | µ − µ ◦ | . So we see that setting K = K + J + (2 K (1 + γ ) + J establishes the lemma.Now I prove Proposition OA.4. Proof. Let µ ◦ = µ • , µ ◦ = µ ∗ ( c † ). Lemma OA.13 implies there is a constant K > 0, indepen-dent of c , so that | U ( c ; µ , µ ) − U ( c ; µ ◦ , µ ◦ ) | ≤ K · ( | µ − µ ◦ | + | µ − µ ◦ | ) for all µ , µ , c ∈ R .So for ν a joint distribution about the fundamentals ( µ , µ ) , we get | E ( µ ,µ ) ∼ ν [ U ( c ; µ , µ ) − U ( c ; µ ◦ , µ ◦ )] | ≤ E ( µ ,µ ) ∼ ν [ | U ( c ; µ , µ ) − U ( c ; µ ◦ , µ ◦ ) | ] ≤ K · E ( µ ,µ ) ∼ ν [ | µ − µ ◦ | + | µ − µ ◦ | ]for every c ∈ R , therefore we also get the uniform bound,sup c ∈ R | E ( µ ,µ ) ∼ ν [ U ( c ; µ , µ )] − U ( c ; µ ◦ , µ ◦ ) | ≤ K · E ( µ ,µ ) ∼ ν [ | µ − µ ◦ | + | µ − µ ◦ | ] . By Proposition OA.3, almost surelylim T →∞ E ( µ ,µ ) ∼ ˜ m T [ | µ − µ ◦ | + | µ − µ ◦ | ] = 0 . ω ∈ Ω where the above limit holds,lim T →∞ sup c ∈ R | U T ( c ) − U ( c ; µ ◦ , µ ◦ ) | ≤ lim T →∞ K · E ( µ ,µ ) ∼ ˜ m T [ | µ − µ ◦ | + | µ − µ ◦ | ]= 0 . This shows that P -a.s., U T ( c ) converges to U ( c ; µ ∗ ( c † ) , µ ∗ ( c † )) uniformly across all c as T →∞ . To reach my main result on convergence of behavior, suppose the agent chooses a cutoffthreshold after observing N histories ( h n ) n ≤ N . The choices are given by the functions ˜ C N : H N → R , so the cutoff after a sample of size N is a random variable C N that depends onthe first N pairs of potential draws ( X n ) n ≤ N . Definition OA.4. Cutoff choice functions ( ˜ C N ) are asymptotically myopic in N iflim sup N →∞ ( sup c ∈ R U N ( c ) − U N ( ˜ C N ) ) = 0almost surely.A simple example is that ˜ C N chooses a cutoff whose expected payoff differs from sup c ∈ R U N ( c )by no more than 1 /N after every sample of size N . Proposition OA.5. Let c ∗ = C ( µ • , µ ∗ ( c † ); γ ) . Suppose cutoffs C N are generated usingasymptotically myopic cutoff choice functions. Almost surely, C N → c ∗ as N → ∞ . The expected payoff of different cutoff strategies under the pseudo-true fundamentals, c U ( c ; µ • , µ ∗ ( c † )), is single peaked and maximized at c ∗ . Therefore cutoffs outside anopen ball around c ∗ have expected payoffs bounded away from the subjectively optimalpayoff under the model Ψ( µ • , µ ∗ ( c † ); γ ) . Lemma OA.14. For each µ , µ ∈ R , let c ∗ be the subjectively optimal cutoff thresholdunder Ψ( µ , µ ; γ ) . For every (cid:15) > , there exists δ > so that whenever | c − c ∗ | ≥ (cid:15) , we have U ( c ; µ , µ ) ≤ U ( c ∗ ; µ , µ ) − δ .Proof. First, I show c U ( c ; µ , µ ) is single peaked: it is strictly increasing up to c = c ∗ , then strictly decreasing afterwards. Recall the cutoff form of the best stopping strategycomes from the fact that u ( x ) < E Ψ( µ ,µ ; γ ) [ u ( x , X ) | X = x ] for x < c ∗ , but u ( x ) < E Ψ( µ ,µ ; γ ) [ u ( x , X ) | X = x ] for x > c ∗ . For two cutoffs c < c < c ∗ , the two stoppingstrategies S c , S c only differ in how they treat first-period draws in the interval [ c , c ] , sowe can write the difference in their expected payoffs as Z c c (cid:16) E Ψ( µ ,µ ; γ ) [ u ( x , X ) | X = x ] − u ( x ) (cid:17) φ ( x ; µ , σ ) dx . c , c ] , therefore U ( c ; µ , µ ) < U ( c ; µ , µ ) . Thisshows U ( · ; µ , µ ) is strictly increasing up until c ∗ ; a symmetric argument shows it is strictlydecreasing after c ∗ .For a given (cid:15) > , let δ = U ( c ∗ ; µ , µ ) − max( U ( c ∗ − (cid:15) ; µ , µ ) , U ( c ∗ + (cid:15) ; µ , µ )) , where δ > c ∗ − (cid:15) and c ∗ + (cid:15) must have a strictly positive loss relative to c ∗ . Since U ( · ; µ , µ ) is single peaked, every c more than (cid:15) away from c ∗ must have a loss relative to c ∗ at least as much as the loss of either c ∗ − (cid:15) or c ∗ + (cid:15), so U ( c ∗ ; µ , µ ) − U ( c ; µ , µ ) ≥ δ .This fact, combined with the uniform convergence U N ( c ) from Proposition OA.4, estab-lishes Proposition OA.5. Proof. Consider any sample path ω = ( x n ) ∞ n =1 where the conclusion of Proposition OA.4holds and the cutoff choice functions are asymptotically myopic. For every (cid:15) > 0, find δ > µ = µ • , µ = µ ∗ ( c † ), and find large enough ¯ N so thatsup c ∈ R | U N ( c )( ω ) − U ( c ; µ • , µ ∗ ( c † )) | < δ/ N ≥ ¯ N . This means for N ≥ ¯ N ,sup c ∈ R U N ( c )( ω ) ≥ U N ( c ∗ )( ω ) ≥ U ( c ∗ ; µ • , µ ∗ ( c † )) − δ/ , while U N ( c )( ω ) ≤ U ( c ∗ ; µ • , µ ∗ ( c † )) − (2 δ ) / c / ∈ [ c ∗ − (cid:15), c ∗ + (cid:15) ]. Find ¯ N so that for N ≥ ¯ N , sup c ∈ R U N ( c )( ω ) − U N ( C N )( ω ) < δ/ N ≥ max( ¯ N , ¯ N ) , C N ( ω ) ∈ [ c ∗ − (cid:15), c ∗ + (cid:15) ]. Since (cid:15) > C N ( ω ) → c ∗ .Therefore, we conclude C N → c ∗ along any sample path ω where the conclusion ofProposition OA.4 holds and the cutoff choice functions are asymptotically myopic. Sincethese two events both happen almost surely, C N → c ∗ almost surely. OA 5 Misspecified Inference under Method of Moments In the analysis so far, I have modeled the learners as misspecified Bayesians. In this sec-tion, I consider agents who use a method-of-moments (MOM) procedure as a simpler butnatural alternative to Bayesian inference. Proposition OA.7 and Corollary OA.2 show thatover-pessimism and the positive-feedback loop remain robust to this relaxation of Bayesianinference. OA 5.1 Feasible Models for ( X , X ) Each agent starts with a set of feasible models { F ( · ; θ , θ ) : θ ∈ Θ , θ ∈ Θ } for thejoint distribution of ( X , X ) , indexed by feasible fundamentals Θ × Θ ⊆ R . For each49 θ θ ) , F ( · ; θ , θ ) is a full-support measure on the rectangle I × I , where each I , I is apossibly infinite interval of R . By “full-support” I mean that for every open ball B ⊆ I × I ,F ( B ; θ , θ ) > F ( · ; θ , θ ), let F ( · ; θ , θ ) denote its marginal on I , and let F | ( ·| θ , θ ; x ) denote its conditional distribution on I given X = x . I will make thefollowing assumptions on the family of feasible models: Assumption OA.1. The feasible models { F ( · ; θ , θ ) : θ ∈ Θ , θ ∈ Θ } satisfy :(a) F ( · ; θ , θ ) is only a function of θ and E F ( · ; θ ,θ ) [ X ] is strictly increasing in θ .(b) For each x ∈ I and θ ∈ Θ , E F | ( · ; θ ,θ | x ) [ X ] strictly increases in θ .(c) For any θ ∈ Θ and θ ∈ Θ , E F | ( · ; θ ,θ | x ) [ X ] strictly decreases in x . In light of Assumption OA.1(a), the marginal distribution on X can be just writtenas F ( · ; θ ), omitting θ . Assumption OA.1(c) is the substantive assumption capturing thegambler’s fallacy psychology. Every subjective distribution in the family is such that theagent predicts a lower mean for X after a higher realization of X . The behavioral economicsliterature has not settled on a general definition of the gambler’s fallacy that works underall distributional assumptions, but Assumption OA.1(c) seems like a reasonable first step.Note that this is a generalization of how I model the gambler’s fallacy in the main text usinga pair of symmetric, log-concave distributions.Here are some examples satisfying these assumptions. The first example concerns theGaussian case from Example 2. Example OA.1. Let I = I = R and let Θ = Θ = R . Fixing some σ > , γ > 0, let F ( · ; θ , θ ) be such that X ∼ N ( θ , σ ) and X | ( X = x ) ∼ N ( θ − γ ( x − θ ) , σ ). Themarginal distribution on X does not depend on θ . Its mean is θ so it strictly increases in θ . The conditional mean of X | X = x is is strictly increasing in θ and strictly decreasingin x since γ > 0. So all conditions in Assumption OA.1 are satisfied.The next example features bivariate exponential distributions supported on the half-line[0 , ∞ ) . Example OA.2. Gumbel (1960) proposes the following family of bivariate exponentialdistributions, parametrized by α ∈ [ − , 1] : consider a joint distribution with the densityfunction (˜ x , ˜ x ) e − ˜ x − ˜ x · [1 + α (2 e − ˜ x − · (2 e − ˜ x − x , ˜ x ≥ 0. If ( ˜ X , ˜ X )are random variables with this density, then they have full support on [0 , ∞ ) × [0 , ∞ ) andeach ˜ X j has the marginal distribution of an exponential random variable with mean 1. Theconditional distribution of ˜ X given a realization of ˜ X is E [ ˜ X | ˜ X = ˜ x ] = 1 − α − αe − ˜ x .The correlation between ˜ X and ˜ X is α/ 4. 50et I = I = [0 , ∞ ) and let Θ = Θ = (0 , ∞ ) . Fixing some − ≤ α < 0, let F ( · ; θ , θ )be the joint distribution generated by X = θ · ˜ X and X = θ · ˜ X where ( ˜ X , ˜ X ) havethe Gumbel bivariate distribution with parameter α. Since ( ˜ X , ˜ X ) have full support on I × I , the same holds for ( X , X ) for any θ , θ > . The marginal distribution of X isexponential with a mean of θ , so Assumption OA.1(a) is satisfied. The conditional mean of X | X = x is given by E [ θ ˜ X | θ ˜ X = x ] = θ · E h ˜ X | ˜ X = x θ i = θ · (cid:16) − α − αe − ( x /θ ) (cid:17) .As α < 0, the term inside the bracket is strictly positive. So this conditional expectation isstrictly increasing in θ , showing that Assumption OA.1(b) is satisfied. Also, since θ , θ > x 7→ − αθ e − ( x /θ ) is strictly decreasing and so Assumption OA.1(c) is satisfied.I give a third example where I = I = [0 , 1] are bounded intervals. Example OA.3. Let Θ = Θ = (0 , ∞ ) and consider the family of distribution F ( · ; θ , θ )such that under parameters ( θ , θ ), X ∼ Beta( θ , 1) and X | X = x ∼ Beta((1 − x ) θ , θ , θ > X has full support on [0 , x ∈ (0 , X has full support on [0 , F ( · ; θ , θ ) has full-support on [0 , for every ( θ , θ ) ∈ Θ × Θ . The mean of X is θ θ +1 , which only depends on θ and is strictlyincreasing in it. So Assumption OA.1(a) is satisfied. The conditional mean of X | X = x is (1 − x ) θ (1 − x ) θ +1 , which is strictly increasing in θ and strictly decreasing in x . So, AssumptionsOA.1(b) and OA.1(c) are satisfied.Finally, I give a general class of examples that allows any pair of given marginal distribu-tions for X and X to be joined together using a copula as to induce negative dependencefor the joint distribution. Example OA.4. Consider two families of distribution functions Q ( · ; θ ) : I → [0 , Q ( · ; θ ) : I → [0 , Q and Q are supported on I , I respectively for all θ ∈ Θ and θ ∈ Θ . Suppose the mean of Q is increasing in θ , and Q is increasing in stochasticdominance order with respect to θ . Fix a differentiable copula: that is, a differentiablefunction W : [0 , → [0 , 1] so that W ( u, 0) = W (0 , v ) = 0, W ( u, 1) = u , W (1 , v ) = v forall u, v ∈ [0 , u ≤ u , v ≤ v ∈ [0 , , we get W ( u , v ) − W ( u , v ) − W ( u , v ) − W ( u , v ) ≥ 0. Consider the family of distribution functions Q ( · ; θ , θ ) on R generated by joining together Q ( · ; θ ) with Q ( · ; θ ) using the copula W, namely Q (( −∞ , x ] × ( −∞ , x ]; θ , θ ) = W ( Q − ( x | θ ) , Q − ( x | θ )) . Then Q ( · ; θ , θ ) has marginal distributions on X and X given by distribution functions Q ( · ; θ ) , Q ( · ; θ ). The next lemma shows that when the copula W satisfies u ∂W∂u ( u, v )is increasing, the resulting joint distribution satisfies the conditions in Assumption OA.1. Lemma OA.15. Suppose ∂W∂u ( u, v ) is an increasing function in u and that { Q ( · ; θ ) : θ ∈ Θ } , { Q ( · ; θ ) : θ ∈ Θ } satisfy the conditions of this example. Then, the conditions in ssumption OA.1 are satisfied for the family of distributions F ( · ; θ , θ ) where F ( · ; θ , θ ) has the distribution function Q ( · ; θ , θ ) .Proof. For Assumption OA.1(a), the marginal of F ( · ; θ , θ ) on X is simply Q ( · ; θ ) , whichI assumed is strictly increasing in mean with respect to θ . For Assumptions OA.1(b), it is well-known that by the copula construction, for all u, v ∈ [0 , P F ( · ; θ ,θ ) [ X ≤ Q − ( v ; θ ) | X = Q − ( u ; θ )] = ∂W∂u ( u, v ). This means ∂W∂u ( u, v ) isincreasing in v. Fixing some x ∈ I and θ ∈ Θ , put u = Q ( x ; θ ). Now for every θ and x ∈ I , we have P F ( · ; θ ,θ ) [ X ≤ x | X = x ] = ∂W∂u ( u, Q − ( x ; θ )). Since the family ofmarginals Q ( · ; θ ) increases in FOSD order as θ increases, Q − ( x ; θ ) is decreasing in θ .Since ∂W∂u increases in its second argument, P F ( · ; θ ,θ ) [ X ≤ x | X = x ] must then decreasein θ , that is to say the conditional distribution X | X = x is increasing in FOSD order in θ . So in particular Assumption OA.1(b) is satisfied.For Assumption OA.1(c), again start with the expression P F ( · ; θ ,θ ) [ X ≤ Q − ( v ; θ ) | X = Q − ( u ; θ )] = ∂W∂u ( u, v ) . For x > x , put u = Q ( x ) > Q ( x ) = u . We have for every v ∈ [0 , 1] that P F ( · ; θ ,θ ) [ X ≤ Q − ( v ; θ ) | X = x ] = ∂W∂u ( Q ( x ; θ ) , v )while P F ( · ; θ ,θ ) [ X ≤ Q − ( v ; θ ) | X = x ] = ∂W∂u ( Q ( x ; θ ) , v ) . Since the distribution function Q ( · ; θ ) has full support, Q ( x ; θ ) > Q ( x ; θ ). And sincewe assumed ∂W∂u is increasing in its first argument, we see that P F ( · ; θ ,θ ) [ X ≤ x | X = x ]is increasing in x . That is, the conditional distribution X | X = x is decreasing in FOSDorder in x . So Assumption OA.1(c) is satisfied. Example OA.5. The condition that ∂W∂u ( u, v ) increases in u is satisfied by, for example,the Gaussian copula with any negative correlation. The derivative of the Gaussian copulais given by ∂W∂u ( u, v ) = P [ X ≤ Φ − ( v ) | X = Φ − ( u )] where Φ is the standard Gaussiandistribution function and ( X , X ) are jointly Gaussian with correlation − < ρ < X | X = x ∼ N ( ρx , − ρ ),it is clear that X | X = x decreases in FOSD order as x increases, so for any v we have P [ X ≤ Φ − ( v ) | X = Φ − ( u )] increases in u . OA 5.2 Method of Moments Inference For a distribution of histories H ∈ ∆( H ), let a [ H ] represent the average first-period drawunder this distribution and let a [ H ] represent the average second-period draw (when un-52ensored). More precisely, a [ H ] := E h ∼H [ h ], a [ H ] := E h ∼H [ h | h = ∅ ]. Suppose thatobjectively X , X are independent with a joint distribution F • , and denote the true dis-tribution of histories under censoring by cutoff stopping rule c ∈ R as H • ( c ) . Then byindependence, a [ H • ( c )] and a [ H • ( c )] do not in fact depend on c. Given the family of feasible models { F ( · ; θ , θ ) : θ ∈ Θ , θ ∈ Θ } about the joint dis-tribution of ( X , X ) , let H ( θ , θ ; c ) := H ( F ( · ; θ , θ ); c ) denote the distribution on historiesunder the model F ( · ; θ , θ ) and censoring cutoff c. I now define the method of momentsestimator. Definition OA.5. The method-of-moments estimator derived from an infinite dataset ofhistories with the distribution H • ( c ) is any pair ( θ M , θ M ) ∈ Θ × Θ such that:1. a [ H ( θ M , θ M ; c )] = a [ H • ( c )]2. a [ H ( θ M , θ M ; c )] = a [ H • ( c )]I will sometimes write θ M ( c ) , θ M ( c ) to emphasize the dependence of the MOM estimatorson the censoring threshold c. It is easy to check that for the Gaussian case, the MOMestimators and the pseudo-true fundamentals coincide.The MOM estimator need not exist — for example, if all values of θ ∈ Θ generate amarginal distribution on X that is smaller than a [ H • ( c )] . However, when it exists, it isunique under the assumptions I made. Lemma OA.16. When the family of feasible models satisfies Assumption OA.1, the MOMestimator is unique when it exists.Proof. Suppose ( θ M , θ M ) is an MOM estimator. I show any other MOM estimator (ˆ θ , ˆ θ )must be equal to it.We may rewrite the moments as: a [ H ( θ M , θ M ; c )] = E F ( · ; θ ) [ X ] , a [ H ( θ M , θ M ; c )] = E F ( · ; θ ,θ ) [ X | X < c ].The unconditional mean of X , namely E F ( · ; θ ) [ X ], is strictly increasing in θ by As-sumption OA.1(a). So, at most one value of θ ∈ Θ can generate an unconditional meanthat matches a [ H • ( c )], meaning we must have ˆ θ = θ M .Given this unique θ M , Assumption OA.1(b) implies the conditional mean E F | ( · ; θ M ,θ | x ) [ X ]is strictly increasing in θ for every x < c . The conditional mean E F ( · ; θ M ,θ ) [ X | X < c ] isgiven by an integral over E F | ( · ; θ M ,θ | x ) [ X ] across the values x < c , therefore E F ( · ; θ M ,θ ) [ X | X Fix some objective, independent distribution F • for ( X , X ) and supposeagents’ feasible models { F ( · ; θ , θ ) : θ ∈ Θ , θ ∈ Θ } satisfy Assumptions OA.1 and OA.2.Suppose the payoff function u ( x , x ) in the optimal-stopping problem is linear in x . Initiatethe 0th generation at an arbitrary cutoff c [0] in the interior of I . Then, beliefs and cutoffthresholds ( µ M , [ t ] ) t ≥ , ( µ M , [ t ] ) t ≥ , and ( c M [ t ] ) t ≥ form monotonic sequences. This corollary establishes the monotonicity of the beliefs and cutoffs for MOM agents,analogous to the monotonicity result of Theorem 2. Proof. I first show that under any of the models F ( · ; θ , θ ), agent’s subjectively optimalstopping rule is a cutoff rule (possibly involving never stopping or always stopping). Itsuffices to show that x ( u ( x ) − E F | ( · ; θ ,θ | x ) [ u ( x , X )])is strictly increasing in x . By linearity of u in its second argument, this expression is equalto x ( u ( x ) − u ( x , E F | ( · ; θ ,θ | x ) )) . Suppose x > x . By Assumption 1(b), u ( x ) − u ( x , E F | ( · ; θ ,θ | x ) ) ≥ u ( x ) − u ( x , E F | ( · ; θ ,θ | x ) ) . By Assumption OA.1(c), E F | ( · ; θ ,θ | x ) < E F | ( · ; θ ,θ | x ) . Combined with Assumption 1(a), itgives u ( x , E F | ( · ; θ ,θ | x ) ) < u ( x , E F | ( · ; θ ,θ | x ) ), hence showing u ( x ) − u ( x , E F | ( · ; θ ,θ | x ) ) > u ( x ) − u ( x , E F | ( · ; θ ,θ | x ) ) . Also, suppose F ( · ; θ , θ ) induces either a stopping threshold which is an interior point of I , or always stopping. Then F ( · ; θ , θ ) induces a higher stopping threshold or alwaysstopping whenever θ ≥ θ . To see this, if there is an indifference point ¯ x in the interiorof I with u (¯ x ) = u (¯ x , E F | ( · ; θ ,θ | ¯ x ) ), then we have E F | ( · ; θ ,θ | ¯ x ) > E F | ( · ; θ ,θ | ¯ x ) due56o Assumption OA.1(b), so u (¯ x ) < u (¯ x , E F | ( · ; θ ,θ | ¯ x ) ). This shows under F ( · ; θ , θ )the agent strictly prefers continuing at ¯ x , so the acceptance threshold must be higher.Similarly, if the agent prefers always stopping at every x ∈ I under F ( · ; θ , θ ). then sheprefers strictly stopping at every x under F ( · ; θ , θ ).I now show that µ M , [ t ] , µ M , [ t ] , and c M [ t ] are well defined for every t ≥ . MOM agents in gener-ation t ≥ t sub-datasets of censored histories, with the distribution H • ( c [0] ) , ..., H • ( c [ t − )where c [0] ∈ int( I ) . The moments to match are a ( H • ( c [0] ) , ..., H • ( c [ K − )) = E F • [ X ] ,a ( H • ( c [0] ) , ..., H • ( c [ K − )) = E F • [ X ] , where the second-period moment is well defined because c [0] is interior, so a positive fractionof histories in at least one sub-dataset contain uncensored X . These moments are interiorvalues in I , I respectively, since F • has full-support marginal distributions.By Assumption OA.2(b), there exists ¯ θ ∈ Θ , independent of K and ( c [0] , ..., c [ K − ) , sothat E F ( · ;¯ θ ) [ X ] = E F • [ X ] . By combining Assumption OA.1(b) and OA.2(c), we get that θ a ( H (¯ θ , θ ; c [0] ) , ..., H (¯ θ , θ ; c [ K − ))is increasing, continuous on Θ with a range of I . (This uses the fact that c [0] is in the interiorof I .) Since MOM agents are matching an interior value E F • [ X ] ∈ int( I ) , this shows thatfor any K and ( c [0] , ..., c [ K − ) with c [0] ∈ int( I ) , θ M ( c [0] , ..., c [ K − ) and θ M ( c [0] , ..., c [ K − )exist, and furthermore θ M ( c [0] , ..., c [ K − ) = ¯ θ .By uniqueness of MOM estimators in Lemma OA.16, µ M , [ t ] , µ M , [ t ] are well defined for each t ≥ 1. Also, c M [ t ] is also well defined for each t ≥ , given that we have shown the optimalstrategy in the model F ( · ; µ M , [ t ] , µ M , [ t ] ) is a cutoff strategy.To prove monotonicity, first suppose that c [1] ≤ c [0] . I have argued that we must have θ M , [2] = θ M , [1] = ¯ θ , so now I rule out θ M , [2] > θ M , [1] . Note that a ( H (¯ θ , θ ; c [0] ) , H (¯ θ , θ ; c [1] )) = w w + w a ( H (¯ θ , θ ; c [0] )) + w w + w a ( H (¯ θ , θ ; c [1] ))where w = P F • [ X ≤ c [0] ] > w = P F • [ X ≤ c [1] ] ≥ . The moment-matchingcondition for generation 1 implies a ( H (¯ θ , θ M , [1] ; c [0] )) = E F • [ X ] . For any θ M , [2] > θ M , [1] , wehave a ( H (¯ θ , θ M , [2] ; c [0] )) > E F • [ X ]from Assumption OA.2(b). If c [1] = inf( I ) , we have found a contradiction since the weight57 is 0. When c [1] > inf( I ) , we get a ( H (¯ θ , θ M , [2] ; c [1] )) ≥ a ( H (¯ θ , θ M , [2] ; c [0] )) > E F • [ X ]by combining Assumption OA.2(c) with the fact that c [1] ≤ c [0] . Both w and w are strictlypositive, and they are multiplied to terms both strictly larger than E F • [ X ] . This shows a ( H (¯ θ , θ M , [2] ; c [0] ) , H (¯ θ , θ M , [2] ; c [1] )) > E F • [ X ] , again contradicting the moment condition.Hence we must have θ M , [2] ≤ θ M , [1] , and thus c M [2] ≤ c M [1] by monotonicity of the cutoffthreshold in belief as discuss before. Similar argument establishes ( µ M , [ t ] ) t ≥ , ( µ M , [ t ] ) t ≥ , and( c M [ t ] ) t ≥ are decreasing sequences.The case of c [1] > c [0] is symmetric. OA 6 The Censoring Effect in a Finite-Urn Model Rabin (2002) Section 7 discusses an example with endogenous observations. There is aninfinite population of financial analysts, each with quality θ ∈ { , , } . Conditional onquality θ, an analyst generates either a good (signal a ) or a bad (signal b ) return eachperiod, with probabilities θ and 1 − θ and independently across periods. The agent, however,believes successive returns from the same analyst are generated through a finite-urn model.Consider an urn with N balls where N is a multiple of 4. For an analyst with quality θ, initialize the urn with θN balls labeled “ a ” and (1 − θ ) N balls labeled “ b .” Successivereturns are successive draws without replacement from the urn. The urn is refreshed everytwo draws. Rabin (2002) calls an agent with this finite-urn model an “ N -Freddy”. Sincethe urn is not refreshed between draws 2 k − k for k = 1 , , , ... , such pairs of drawexhibit negative correlation in agent’s feasible model, generating the gambler’s fallacy.Returning to Rabin (2002) Section 7’s example, objectively all financial analysts havequality θ = . The agent samples a financial analyst at random and observes his returns overtwo periods. Depending on the realizations of these two returns, the agent either observes thesame analyst for two more periods before sampling a new analyst, or immediately samplesa new analyst. This procedure is infinitely repeated. Rabin (2002) investigates a 4-Freddyagent’s long-run belief about the proportions of analysts with the three levels of quality inthe population.The endogenous observation in the example is distinct from what I term the “censoringeffect” in this paper. The main mechanism behind my censoring effect is that some obser-vations omit signals X , which biased agents judge to be negatively correlated with signalsthat are always observed, X . This then leads to distorted inference. However, in Rabin(2002)’s finite-urn model, the urn is refreshed every two periods. This means an N -Freddy58-Freddy θ = θ = θ = aa 16 12 ab 14 13 14 ba 14 13 14 bb 12 16 b ∅ 34 12 14 θ = θ = θ = aa 128 628 1528 ab 628 828 628 ba 628 828 628 bb b ∅ 34 12 14 Table OA.1: The likelihoods of observations under different analyst qualities, for 4-Freddyand 8-Freddy agents.agent judges the part of the data that is sometimes censored (the analyst’s returns in periods3 and 4) to be independent of the part of the data that is always observed (the analyst’sreturns in periods 1 and 2). Therefore the driving force behind Rabin (2002) Section 7’s ex-ample is not the interaction between censoring and the gambler’s fallacy, but rather betweencensoring and the “Bayesian aspect” of N -Freddy’s quasi-Bayesian inference.In this section, I study a related problem where an N -Freddy agent observes each analystfor either one or two periods, depending on whether the analyst generates a bad first-periodreturn. This setup features the censoring effect, because the finite-urn model generatesnegative correlation between the first and second draws from each urn. I find that theagent’s inference under this censoring structure tends to be too optimistic. This conclusionis in line with predictions about the censoring effect in the baseline model of this paper,for the basic inference result in Proposition 2 shows that when the dataset is censored inthe opposite way (i.e. censored when the first draw is good), the resulting inference is toopessimistic . That is, I demonstrate the robustness of my censoring effect to an alternativemodel of the gambler’s fallacy in a binary-signals setting, showing that it is not an artifactof the continuous-signals setup in my baseline model.Table OA.1 displays the likelihood of all signals of length 2 for the 4-Freddy and 8-Freddy agents, for different values of θ ∈ { , , } . The last row of each table also shows thelikelihoods of simply observing the signal b in the first period, under the censoring rule thatstops observing an analyst if his first return is bad.I first discuss inference without censoring. After aa , Freddy exaggerates the relativelikelihood of θ = to θ = compared to a Bayesian, whereas after ab Freddy’s relativelikelihoods of these two qualities are the same as a Bayesian’s. Overall, given a sample withan equal number of aa and ab signals, Freddy exaggerates the relative likelihood of θ = to θ = . This phenomenon is analogous to the continuous version of gambler’s fallacy where abiased observer “partially forgives” a mediocre outcome following an outstanding outcome. Proposition OA.12 in the Online Appendix shows that when the dataset is censoring using a strategythat stops when X ≤ c for some c ∈ R , inference about second-period fundamental is always too high. a in the first period lead to an overly optimistic estimateabout the analyst’s ability. By the same logic, observing an equal number of ba and bb signalswould lead to exaggeration of the likelihood of θ = relative to θ = .However, now suppose the second observation is censored when the first observation is b. The otherwise symmetric situation becomes asymmetric. Following the observation of b ∅ (where the second draw is censored), Freddy’s inference is the same as a Bayesian’s. Sowe have turned off the channel leads to exaggerating the probability of θ = but kept thechannel that leads to exaggerating the probability of θ = . This is analogous to the censoringeffect in my model, where censoring second-period draw following unfavorable first-perioddraws implies overly optimistic inferences.In the long-run, the agent observes a distribution of returns across different analysts:25% of the time aa is observed, 25% of the time ab is observed, and 50% of the time b ∅ is observed. To calculate the agent’s long-run beliefs, first suppose Freddy’s prior specifieseither all analysts have θ = or all analysts have θ = . Then Freddy’s long-run inferenceis given by the parameter maximizing expected log-likelihood of the data. For 4-Freddy, thelog-likelihood likelihood under θ = is −∞ while the log-likelihood under θ = is a finitenumber. For 8-Freddy, The log-likelihood under θ = is14 ln(1 / 28) + 14 ln(6 / 28) + 12 ln(3 / ≈ − . θ = is14 ln( 1528 ) + 14 ln( 628 ) + 12 ln(1 / ≈ − . . So in both cases, Freddy will come to believe θ = over θ = for all analysts.Now consider a 4-Freddy who dogmatically believes some 1 − κ ∈ (0 , 1) fraction of theanalysts have θ = , but the remaining analysts either have θ = or θ = . So, the agentestimates q a ∈ [0 , − κ ], the fraction of analysts who have θ = . Straightforward algebrashows that the q ∗ a maximizing expected log-likelihood of the data is q ∗ a = κ + for κ ≥ ,q ∗ a = κ otherwise. Since κ + > κ for all κ ∈ ( , OA 7 The Gambler’s Fallacy and Attentional Stability Many papers on behavioral learning, including this one, can be thought of as studying agentswhose prior (or “misspecified theory”) over states of the world excludes the true, data-60enerating state. Agents in this paper start with a prior supported on the class of feasiblemodels { Ψ( µ , µ ; γ ) : ( µ , µ ) ∈ M} for some fixed γ > , with different models viewed asdifferent states. But the true state is the objective distribution ( X , X ) ∼ Ψ( µ • , µ • ; 0) thatlies outside the feasible set. This discrepancy is not an issue when agents move one at a timeand pass down their beliefs, since each agent only updates using one history — her own. Butin the large-generations environment, as an agent’s data set grows, her misspecified theorycan appear infinitely less likely in the limit than an alternative prior belief (or “light-bulbtheory”) that includes the true state in its support.Gagnon-Bartsch, Rabin, and Schwartzstein (2018) offer an explanation for why suchmisspecified theories persist with learning – attentional stability. Under a misspecified theory,some coarsened information may be sufficient for decision-making. When agents only payattention to this coarsened information, the aspects of the data that they attend to maybe so coarse that their misspecified theory no longer appears infinitely less likely than thelight-bulb theory.In this section, I investigate the attentional stability of the gambler’s fallacy bias in mylearning setting for the Gaussian case (so, Ψ( µ , µ ; γ ) will stand for a correlated Gaus-sian distribution). The main intuition is that when agents are dogmatic about γ , they aredogmatic about the correlation between X and X . Therefore, under their misspecified the-ory, agents do not find it necessary to separately keep track of the conditional distributions X | ( X = x ) for different values of x . Agents believe certain “statistics” of the datasetare sufficient for decision-making, and this process of reducing the entire dataset into thesesufficient statistics removes features of the dataset that would otherwise have led the agentsto question the validity of their theory.My setting differs in two ways from that of Gagnon-Bartsch, Rabin, and Schwartzstein(2018). Each of my agents acts once (after observing a possibly large or even infinite dataset),while their agents observe one signal each period over an infinite number of periods. Anotherdistinction is that data is endogenous in my setting, whereas Gagnon-Bartsch, Rabin, andSchwartzstein (2018) almost entirely focus on an exogenous-data environment. So, I beginby defining the key concepts surrounding attentional stability in my setting. OA 7.1 A Definition of Attentional Stability in Large Datasets In the learning environment where agents act in large generations, each agent in generation t observes t sub-datasets of infinitely many histories. The overall distribution of historiesin the dataset is H • ( c [0] , ..., c [ t − ) = ⊕ t − k =0 H • ( c [ k ] ), where the right-hand side refers to themixture between the t history distributions that assigns weight 1 /t to each.To develop a definition of attentional stability in large datasets, I consider an agentwho directly observes a distribution of histories (instead of a dataset with this distribution) H • ( c , ..., c L ) ∈ ∆( H ). This represents the observations of agents in each generation t ≥ Definition OA.6. Let π, λ be beliefs over the joint distribution of ( X , X ) . Say π is inex-plicable relative to λ , conditional on the true model Ψ • and censoring thresholds c , ..., c L ,if H • ( c , ..., c L ) = H (Ψ; c , ..., c L ) for some Ψ ∈ supp( λ ) , but H • ( c , ..., c L ) = H (Ψ; c , ..., c L )for any Ψ ∈ supp( λ ).Each feasible model Ψ and list of censoring thresholds c , ..., c L together induce a distri-bution over histories. If the observed history distribution H • ( c , ..., c L ) can be produced bysome feasible model of ( X , X ) in the support of the light-bulb theory λ, but not by anydistribution in the support of the misspecified theory π, then I call π inexplicable.I now define a particular kind of limited attention. Given a distribution over histories,the agent maps the entire distribution to finitely many real numbers. This is an extremeform of data coarsening. If there is a strategy optimal under the misspecified theory π thatonly makes use of these finitely many statistics, then we have a sufficient-statistics strategy. Definition OA.7. A sufficient-statistics strategy (SSS) for large generations consists of astatistics map Λ : ∆( H ) → R K for some finite K < ∞ and a cutoff map σ : Im(Λ) → R , suchthat agents in each generation t ≥ ∼ π ) to use the stopping strategy with cutoff σ (Λ( H )) whenever H is adataset of predecessors’ histories H = H • ( c [0] , ..., c [ t − ) . An agent following the strategy (Λ , σ ) first extracts K statistics (i.e. K real numbers)from the infinite dataset of predecessors’ histories. Then, she applies σ to choose a cutoffthreshold that only depends on the dataset through its K extracted statistics, Λ( H ). Theidea is that the agent only pays attention to the finitely many statistics, a perhaps morerealistic behavior than paying full attention to the entire infinite dataset. If such a strategyis optimal for an agent believing the true joint distribution of ( X , X ) is drawn accordingto her (misspecified) prior Ψ ∼ π , I call the pair (Λ , σ ) an SSS.A related definition of sufficiency works with finite datasets instead of infinite datasets.This corresponds to limited attention in an alternative version of my Section 3 environment,where agents act one at a time but observe all predecessors’ histories instead of adopting theimmediate predecessor’s posterior belief. Definition OA.8. A sufficient-statistics strategy (SSS) in datasets of size N < ∞ con-sists of a statistics map Λ ( N ) : H N → R K for some finite K < ∞ and a cutoff map σ ( N ) : Im(Λ ( N ) ) × N → R , such that the subjectively optimal cutoff threshold (under theBayesian posterior belief about the fundamentals after updating prior density m ( µ , µ )) is σ ( N ) (Λ ( N ) (( h n ) Nn =1 ) , N ) after observing a dataset ( h n ) Nn =1 with size N and containing N ≤ N instances of second-period draws.Finally, I combine these concepts to define attentional stability. Roughly speaking, thetheory π is attentionally stable if we can find a (Λ , σ ) pair that pays “fine” enough attention62o be an SSS under π, but “coarse” enough attention so that the resulting statistics can beexplained by some model in the support of π . Definition OA.9. Theory π is attentionally stable , conditional on the objective model Ψ • and censoring thresholds c , ..., c L , if there exists an SSS (Λ , σ ) such that Λ( H • ( c , ..., c L )) =Λ( H (Ψ; c , ..., c L )) for some Ψ in the support of π . OA 7.2 The Gambler’s Fallacy is Inexplicable under Full Atten-tion Fix γ > . Let π be any full-support belief over { Ψ( µ , µ ; γ ) : ( µ , µ ) ∈ M} , where M ⊆ R is any specification of feasible fundamentals. Let λ be any belief with Ψ • = Ψ( µ • , µ • ; 0) inits support. I first show that without channeled attention, agents will come to realize thattheir misspecified theory π is wrong after seeing a large dataset. Proposition OA.8. π is inexplicable relative to λ , conditional on Ψ • and any censoringthresholds c , ..., c L ∈ R .Proof. This is because Ψ • ∈ supp( λ ) but every Ψ ∈ supp( π ) has KL divergence boundedaway from 0 relative to Ψ • in terms of the histories they generate under censoring by c , ..., c L ,that is to say inf Ψ ∈ supp( π ) D KL ( H • ( c , ..., c L ) k H (Ψ; c , ..., c L ))= inf ( µ ,µ ) ∈M D KL ( H • ( c , ..., c L ) k H (Ψ( µ , µ , σ , σ ; γ ); c , ..., c L )) > . This inequality holds because the derivation of µ ∗ , µ ∗ in Example 2 implies the above KL-divergence minimization problem has a minimum strictly above 0 even over the unrestricteddomain ( µ , µ ) ∈ R . The restriction to some M ⊆ R can only make the minimum larger. OA 7.3 The Gambler’s Fallacy is Attentionally Stable Now I exhibit a family of SSS for finite datasets of size N and another SSS for large gener-ations that naturally corresponds to taking N → ∞ . These SSS have the additional prop-erty that they lead agents to the same beliefs about the fundamentals as the full-attentionBayesianism assumed in the rest of the paper. So, not only do these SSS justify agents notdiscarding their misspecified theory after seeing large datasets, they also provide a limited-attention foundation for the learning dynamics that I investigate in the main text of thepaper for the Gaussian case. 63n a dataset of size N, consider the statistics map with K = 2,Λ ( N ) (( h n ) Nn =1 ) = N N X n =1 h ,n , n : h ,n = ∅ ) X n : h ,n = ∅ ( h ,n + γh ,n ) . The first statistic is the sample mean of the first-period draws. The second statistic canbe thought of as a “re-centered” observation v n := h ,n + γh ,n for each history h n where h ,n = ∅ . The agent only pays attention to the sample averages of x ,n = h ,n and v n . Underthe feasible model Ψ( µ , µ ; γ ), we may write the distributions of X , X as X = µ + (cid:15) X = µ + γ(cid:15) + z where (cid:15) , z ∼ N (0 , σ ) , are independent. Defining V := X + γX , we see that underΨ( µ , µ ; γ ), V = µ + γµ + z . So, observations of first-period draws are signals about µ , while observations of re-centered second-period V are signals about µ + γµ . Proposition OA.9. Λ ( N ) is part of an SSS in datasets of size N . The cutoff choice in thisSSS is the same as for the full-attention agent.Proof. Write φ ( x ; a, b ) for the Gaussian density with mean a, variance b , evaluated at x. Without loss, suppose h ,n = ∅ for all 1 ≤ n ≤ N , and h ,n = ∅ for all n > N . I show thatthe posterior density over ( µ , µ ) after the dataset ( h n ) Nn =1 only depends on N , N P Nn =1 h ,n ,and N P N n =1 ( h ,n + γh ,n ). Indeed, m ( µ , µ | ( h n ) Nn =1 ) ∝ m ( µ , µ ) N Y n =1 φ ( h ,n ; µ , σ ) · φ ( h ,n ; µ − γ ( h ,n − µ ) , σ ) · N Y n = N +1 φ ( h ,n ; µ , σ ) = m ( µ , µ ) " N Y n =1 φ ( h ,n ; µ , σ ) · N Y n =1 φ ( h ,n ; µ − γ ( h ,n − µ ) , σ ) = " N Y n =1 φ ( h ,n ; µ , σ ) · N Y n =1 φ ( h ,n + γh ,n ; µ + γµ , σ ) . It is well-known that under the Gaussian likelihood, ( h ,n ) Nn =1 Q Nn =1 φ ( h ,n ; µ , σ ) is afunction of N P Nn =1 h ,n , and for the same reason ( h ,n + γh ,n ) N n =1 Q N n =1 φ ( h ,n + γh ,n ; µ + γµ , σ ) is a function of N P N n =1 ( h ,n + γh ,n ).Since the posterior belief m ( ·| ( h n ) Nn =1 ) only depends on N and the two statistics Λ ( N )1 (( h n ) Nn =1 ) , Λ ( N )2 (( h n ) Nn =1 ) ∈ R , the optimal cutoff rule may be expressed as a function of these two statis-tics, N , and c of the predecessors.In the environment where full-attention Bayesian agents move one at a time, their be-havior is indistinguishable from agents using this SSS. Roughly speaking, this is because64he subjective joint distribution between ( X , V ) is Gaussian and the mean of a sequence ofGaussian random variables is a sufficient statistic for the likelihood of the entire sequence.Even when agents are full-attention Bayesians, their posterior distribution only depends onthe histories data through these statistics. Therefore, the statistics are sufficient for anydecision problem.Consider now the large-sample analog of the finite-sample SSS just defined. Again with K = 2 , consider the statistic map Λ sends each distribution H to E h ∼H [ h i, ] and E h ∼H [ h i, + γh i, | h i, = ∅ ] . I show that Λ makes π attentionally explicable whenever π has full-supportover the feasible models indexed by feasible fundamentals M = R . Proposition OA.10. For any list of censoring thresholds c , ..., c L ∈ R and fundamentals µ , µ ∈ R , Λ ( H (Ψ( µ , µ , γ ); c , ..., c L )) = µ , Λ ( H (Ψ( µ , µ , γ ); c , ..., c L )) = µ + γµ . Also, Λ( H • ( c , ..., c L )) = Λ( H ( Ψ( µ • , µ ∗ ( c , ..., c L ); γ ) ; c , ..., c L ))The first two equations in this claim show that for any c , ..., c L , Ψ Λ( H (Ψ; c , ..., c L ))is a one-to-one function on the support of π , and furthermore any values of the statistics s , s can be rationalized through appropriate choices of µ , µ . We may put σ ( s , s ) = C ( s , s − γs ; γ ) to make (Λ , σ ) an SSS, thus showing the gambler’s fallacy is attentionallystable in large datasets. Another implication of this claim is that the limited-attention agentcomes to believe the large-generations pseudo-true fundamentals ( µ • , µ ∗ ( c , ..., c L )) after see-ing the history distribution H • ( c , ..., c L ) . Therefore, the large-generations SSS gives the samebehavior as the full-attention Bayesianism in the baseline large-generations environment. Proof. To see the first two equations, let c ∈ R , µ , µ ∈ R , and write Ψ = Ψ( µ , µ ; γ ). Wehave E h ∼H (Ψ; c ) [ h + γh | h = ∅ ]= E h ∼H (Ψ; c ) [ h | h = ∅ ] + γ E h ∼H (Ψ; c ) [ h | h = ∅ ]= E Ψ [ X | X ≤ c ] + γ E Ψ [ X | X ≤ c ]= E Ψ [ µ − γ ( X − µ ) | X ≤ c ] + γ E Ψ [ X | X ≤ c ]= E Ψ [ µ + γµ | X ≤ c ]= µ + γµ . Since this holds for any c , so we must get that on the mixed history distribution,Λ ( H (Ψ( µ , µ , γ ); c , ..., c L )) = µ + γµ 65s well. It is easy to see that we must have Λ ( H (Ψ( µ , µ , γ ); c , ..., c L )) = µ .To obtain the final equation, first note that we can re-write the second statistic underthe true distribution of histories H • ( c , ...c L ) as a weighted average, E h ∼H • ( c ,...c L ) [ h + γh | h = ∅ ] = L X ‘ =1 w ‘ · E h ∼H • ( c ‘ ) [ h + γh | h = ∅ ]This is because the event of h = ∅ happens only when h falls below the censoring threshold,so the posterior probability of h being generated from the sub-distribution H • ( c ‘ ) given that h = ∅ depends on the relative likelihoods of X falling under the L different censoringthresholds.For each ‘, E h ∼H • ( c ‘ ) [ h + γh | h ≤ c ] = µ • + γ E [ X | X ≤ c ‘ ]where the conditional expectation of h | h ≤ c ‘ is simply µ • by independence of X and X under Ψ • . Putting this into the weighted average expression, E h ∼H • ( c ,...c L ) [ h + γh | h = ∅ ] = µ • + γ L X ‘ =1 w ‘ · E [ X | X ≤ c ‘ ] . In order to match the statistics, s = µ • and s = µ • + γ P L‘ =1 w ‘ · E [ X | X ≤ c ‘ ] producedby Λ( H • ( c , ..., c L )), we must therefore have µ = µ • , and µ • + γ L X ‘ =1 w ‘ · E [ X | X ≤ c ‘ ] = µ + γµ • , which rearranges to µ = µ • − γ L X ‘ =1 w ‘ · ( µ • − E [ X | X ≤ c ‘ ]) = µ ∗ ( c , ..., c L ) . OA 8 Additional Extensions of the Baseline Model I consider further extensions of the baseline model for the Gaussian case. OA 8.1 Draws as Costs In the baseline model, I have studied optimal-stopping problems satisfying Assumption 1.One implication of Assumption 1 is that higher draws are more beneficial to the agent, as66 and u are strictly increasing functions of the draws in their respective periods. In thissection, I verify the robustness of my positive-feedback loop result when draws are interpretedas costs. Here is the canonical example to keep in mind: Example OA.6 (Do It Now or Later) . The agent has two periods to complete a task. Inperiod 1, she draws her cost of completing the task today, X = x . The agent choosesbetween paying x and finishing the task, or waiting until period 2. If she decides to wait,she will draw another cost X = x in period 2, which she must then pay. So, u ( x ) = − x and u ( x , x ) = − x . In optimal-stopping problems like Example OA.6, the subjectively optimal stopping rulegiven any beliefs about the fundamentals in the two periods features stopping for low valuesof X . This means agents observed censored datasets from their predecessors where X is onlyobserved following high values of X , the “opposite” kind of endogenous censoring comparedto what happens in problems satisfying Assumption 1. Now, a more heavily censored datasetinduces a higher belief about the second-period mean in the next generation, which causesthe next generation to accept higher costs in the first period. This exacerbates the censoringand the positive feedback cycle again obtains.More generally, I will consider payoff functions u ( x ), u ( x , x ) satisfy the followingassumptions. Assumption OA.3. The payoff functions satisfy:(a) For x > x and x > x , u ( x ) < u ( x ) and u ( x , x ) < u ( x , x ) . (b) For x > x and any ¯ x , u ( x ) − u ( x ) < −| u ( x , ¯ x ) − u ( x , ¯ x ) | . (c) There exist x h , x l , x l , x h ∈ R so that u ( x h ) − u ( x h , x l ) < , while u ( x l ) − u ( x l , x h ) > (d) u , u are continuous. Also, for any ¯ x ∈ R , x u (¯ x , x ) is absolutely integrablewith respect to any Gaussian distribution on R . I show that the subjectively optimal stopping strategy under dogmatic belief in funda-mentals µ , µ takes a cutoff form, but the agent stops in period 1 for low realizations ofperiod 1 costs, X ≤ c . Furthermore, the optimal cutoff increases in µ . Proposition OA.11. Under Assumption OA.3 and for γ > , • Under each feasible model Ψ( µ , µ ; γ ) , there exists a cutoff threshold C cost ( µ , µ ; γ ) ∈ R such that it is strictly optimal to continue whenever x > C cost ( µ , µ ; γ ) and strictlyoptimal to stop whenever x < C cost ( µ , µ ; γ ) . • For every µ ∈ R , µ C cost ( µ , µ ; γ ) is strictly increasing. roof. Consider the pair of payoff functions ˜ u : R → R and ˜ u : R → R defined by˜ u (˜ x ) := u ( − ˜ x ) and ˜ u (˜ x , ˜ x ) := u ( − ˜ x , − ˜ x ) . It is easy to verify that since u , u satisfy Assumption OA.3, ˜ u , ˜ u must satisfy Assumption 1.When ( X , X ) ∼ Ψ( µ , µ ; γ ), we also have ( ˜ X , ˜ X ) ∼ Ψ( − µ , − µ ; γ ), where ( ˜ X , ˜ X ) =( − X , − X ). The best stopping strategy under the payoff functions ˜ u , ˜ u when drawsare generated from Ψ( − µ , − µ ; γ ) involves the cutoff threshold C ( − µ , − µ ; γ ), stoppingwhenever ˜ X exceeds the threshold and continuing whenever ˜ X falls below it. Here, C ( − µ , − µ ; γ ) is the usual acceptance threshold from Proposition 1.By the relationship between u , u and ˜ u , ˜ u , we deduce that the optimal stopping strat-egy under the payoff functions u , u when draws are generated from Ψ( µ , µ ; γ ) involvesthe cutoff threshold C cost ( µ , µ ; γ ) := − C ( − µ , − µ ; γ ). The agent should stop when thefirst (cost-based) draw falls below C cost ( µ , µ ; γ ) , and continue when the first draw exceedsthe cutoff.For µ > µ , we have C ( − µ , − µ ; γ ) < C ( − µ , − µ ; γ ) by Proposition 1. So, C cost ( µ , µ ; γ ) = − · C ( − µ , − µ ; γ ) > − · C ( − µ , − µ ; γ ) = C cost ( µ , µ ; γ )as desired.As Proposition OA.11 shows, the subjectively optimal stopping rules in problems sat-isfying Assumption OA.3 imply a different kind of censoring than in the baseline model.Specifically, the history contains the second-period draw only when X is high. For c ∈ R , let ¯ S c denote the stopping strategy S ( x ) = Continue if x > c , S ( x ) = Stop if x ≤ c . Thebar notation distinguishes it from S c , the stopping strategy with the stopping region [ c, ∞ ).For c, µ , µ ∈ R , the KL divergence between H (Ψ • ; ¯ S c ) and H (Ψ( µ , µ ; γ ); ¯ S c ) is given by Z c −∞ φ ( x ; µ • , σ ) · ln φ ( x ; µ • , σ ) φ ( x ; µ , σ ) ! dx + Z ∞ c (Z ∞−∞ φ ( x ; µ • , σ ) · φ ( x ; µ • , σ ) · ln " φ ( x ; µ • , σ ) · φ ( x ; µ • , σ ) φ ( x ; µ , σ ) · φ ( x ; µ − γ ( x − µ ) , σ ) dx ) dx . Proposition OA.12. The pseudo-true fundamentals minimizing D KL ( H (Ψ • ; ¯ S c ) k H (Ψ( µ , µ ; γ ); ¯ S c )) are µ ∗ ( c ) = µ • and µ ∗ ( c ) = µ • − γ ( µ • − E [ X | X ≥ c ]) . So µ ∗ ( c ) is strictly increasing in c. Since E [ X | X ≥ c ] > µ • for every c ∈ R and γ > , this shows the pseudo-truesecond-period fundamental is always too high for every stopping strategy ¯ S c . The directionof misinference about µ is the opposite as in the main text, due to the opposite asymmetrydata censoring. Still, the key mechanism behind the misinference remains the same: the68nteraction between the (opposite kind of) censoring effect and the gambler’s fallacy, as anunbiased agent with γ = 0 and a biased agent facing uncensored data both infer µ correctly.Since high values of draws are bad news in the environment where draws are interpretedas costs, this shows agents end up over-pessimistic beliefs about the distributions, as over-estimating µ corresponds to making an overly unfavorable assessment about the secondperiod. Proof. Rewrite D KL ( H (Ψ • ; ¯ S c ) k H (Ψ( µ , µ ; γ ); ¯ S c )) as Z ∞−∞ φ ( x ; µ • , σ ) · ln φ ( x ; µ • , σ ) φ ( x ; µ , σ ) ! dx + Z ∞ c φ ( x ; µ • , σ ) · Z ∞−∞ φ ( x ; µ • , σ ) ln " φ ( x ; µ • , σ ) φ ( x ; µ − γ ( x − µ ) , σ ) dx dx . The KL divergence between N ( µ true , σ ) and N ( µ model , σ ) is ln σ model σ true + σ +( µ true − µ model ) σ − , so we may simplify the first term and the inner integral of the second term.( µ − µ • ) σ + Z ∞ c φ ( x ; µ • , σ ) · " σ + ( µ − γ ( x − µ ) − µ • ) σ − dx . Dropping constant terms not depending on µ and µ and multiplying by σ , we get asimplified expression of the objective, ξ ( µ , µ ) := ( µ − µ • ) Z c −∞ φ ( x ; µ • , σ ) · " ( µ − γ ( x − µ ) − µ • ) dx . We have the partial derivatives by differentiating under the integral sign, ∂ξ∂µ = Z ∞ c φ ( x ; µ • , σ ) · ( µ − γ ( x − µ ) − µ • ) dx ∂ξ∂µ = ( µ − µ • ) + γ Z ∞ c φ ( x ; µ • , σ ) · ( µ − γ ( x − µ ) − µ • ) dx = ( µ − µ • ) + γ ∂ξ∂µ By the first order conditions, at the minimum ( µ ∗ , µ ∗ ) , we must have: ∂ξ∂µ ( µ ∗ , µ ∗ ) = ∂ξ∂µ ( µ ∗ , µ ∗ ) = 0 ⇒ µ ∗ = µ • . So µ ∗ satisfies ∂ξ∂µ ( µ • , µ ∗ ) = 0 , which by straightforward algebra shows µ ∗ ( c ) = µ • − γ ( µ • − E [ X | X ≥ c ]) . c, which leads to more severe censoring of the dataset as X is only observed when X ≥ c .This more severely censored dataset, in turn, leads to even higher belief in the second-periodfundamental by Proposition OA.12. So as in Theorem 2, cutoff thresholds and beliefs aboutthe fundamentals form monotonic sequences across generations t ≥ OA 8.2 Population with Heterogeneity in Selection Neglect In Section 3’s learning environment, select neglect is unlikely to appear. Bayesian inferencesimply takes the form of updating beliefs using what the agent sees during the stage game:either X or the pair ( X , X ) . I believe even the large-generations learning environment from Section 4 is unlikely toevoke selection neglect, a psychology most likely to be present when the observed datasetcontains does not contain reminders about selection. Censoring is highly explicit and salientin my setting, which is not the type of framing that typically evokes selection neglect.In Enke (2019)’s experiment on selection neglect, players (one human subject and fivecomputer players following a mechanical rule) are asked to guess a “state of the world”based on the average of 6 private signals. Players are sorted into one of two groups based onwhether their own private signal is high or low, then observe the signals of others in theirgroup. In the baseline treatment, there is no reminder of the excluded data on the decisionscreen where subjects are shown the signals of others in the same group and asked to entera guess. This treatment finds selection neglect. Another “nudge” treatment where subjectsare given a simple hint stating: “ Also pay attention to those randomly drawn balls that arenot shown to you by the information source ” reduces the number of selection neglecters by50%. So I believe the much clearer reminders of selection in my environment should reducethe frequency of selection neglect even further.Jehiel (2018) studies misperceived investment returns under selection neglect. In hismodel, each predecessor has a potential project and observes a private signal about theproject’s quality. Predecessors with high signals implement their projects. Agents in thecurrent generation observe the pool of implemented projects, then generate their own signalsabout the qualities of these observed projects. These signals are independent of the actualprivate signals that the predecessors used for implementation decisions. Current agentsinfer the conditional quality given each signal using the empirical mean quality among pastimplemented projects generating the same signal. This is another environment where thedataset contains no hints about the existence of excluded data (the unimplemented projects)or the selection criterion (the private signals of predecessors). In fact, if datasets in Jehiel(2018)’s setting record the complete experience of the predecessors in their decision problems,70s is the case in my history datasets, then the misinference result no longer holds.Nevertheless, in this section, I study an extension of the baseline model where a fraction0 ≤ α < h ,n , h ,n ) n ∈ [0 , , they treat ( h ,n ) n ∈ [0 , as a sample from the unconditional distribu-tion of X , and ( h ,n ) n : h ,n = ∅ as an independent sample from the unconditional distributionof X . Relative to the base line agents, they mistake the selection process by which h ,n ’sappear in the dataset: they are not censored at random, but only censored when h ,n exceedsthe acceptance threshold used by the predecessors. In this environment, the gambler’s fal-lacy and selection neglect exactly cancel each other out, since in large datasets the mean of h ,n is µ • and the mean of uncensored h ,n is µ • . This shows that from the dataset H • ( c ) forany c ∈ R , the selection neglecters correctly infer the fundamentals and choose the stoppingstrategy with cutoff C ( µ • , µ • ; γ ) . Now consider a baseline agent with the gambler’s fallacy, facing a dataset of historiesgenerated by a heterogeneous population of predecessors. A fraction α of the histories aregenerated by selection neglecters using the stopping strategy S C ( µ • ,µ • ; γ ) . The remaining 1 − α fraction are generated by baseline predecessors using the stopping strategy S c . The nextProposition characterizes the pseudo-true fundamentals maximizing the weighted-averageKL-divergence objective, αD KL ( H • ( C ( µ • , µ • ; γ )) k H (Ψ( µ , µ ; γ ); C ( µ • , µ • ; γ )))+(1 − α ) D KL ( H • ( c ) ||H (Ψ( µ , µ ; γ ); c )) . (4) Proposition OA.13. The pseudo-true fundamentals minimizing Equation (4) when baselinepredecessors use the stopping threshold c is µ SN = µ • ,µ SN ( c ) = α P [ X ≤ C ( µ • , µ • ; γ )] α P [ X ≤ C ( µ • , µ • ; γ )] + (1 − α ) P [ X ≤ c ] · µ ∗ ( C ( µ • , µ • ; γ ))+ (1 − α ) P [ X ≤ c ] α P [ X ≤ C ( µ • , µ • ; γ )] + (1 − α ) P [ X ≤ c ] · µ ∗ ( c ) . Proof. Let w = α, w = 1 − α, c = C ( µ • , µ • ; γ ) , c = c. By simple algebra, we may rewrite This cutoff may nevertheless differ from the objectively optimal one, since the selection neglecters alsosuffer from the gambler’s fallacy, so they believe in the joint distribution Ψ( µ • , µ • ; γ ) . µ − µ • ) σ + X k =1 w k (Z c k −∞ φ ( x ; µ • , σ ) · " σ + ( µ − γ ( x − µ ) − µ • ) σ − dx ) . Dropping terms not dependent on µ , µ and multiplying through by σ , we get the simplifiedobjective ξ SN ( µ , µ ) := ( µ − µ • ) X k =1 w k (Z c k −∞ φ ( x ; µ • , σ ) · " ( µ − γ ( x − µ ) − µ • ) σ dx ) . The first-order condition is only satisfied at µ SN = µ • ,µ SN = 1 w P [ X ≤ c ] + w P [ X ≤ c ] X k =1 w k P [ X ≤ c k ] { µ • − γ ( µ • − E [ X | X ≤ c k ]) } . This shows, in terms of expressions for pseudo-true fundamentals in the baseline model µ ∗ , µ SN ( c ) = α P [ X ≤ C ( µ • , µ • ; γ )] α P [ X ≤ C ( µ • , µ • ; γ )] + (1 − α ) P [ X ≤ c ] · µ ∗ ( C ( µ • , µ • ; γ ))+ (1 − α ) P [ X ≤ c ] α P [ X ≤ C ( µ • , µ • ; γ )] + (1 − α ) P [ X ≤ c ] · µ ∗ ( c ) . That is, with a mixture of selection-neglecter and baseline predecessors, baseline agents’inference about the second-period fundamental is a convex combination between what theywould infer from the histories of the selection neglecters alone and what they would inferfrom the histories of the baseline predecessors alone. The relative weights given to these twopseudo-true second-period fundamentals depend on the relative sizes of the two subpopu-lations, as well as on how frequently second-period draws are observed in each of the twosub-datasets.Since both µ ∗ ( C ( µ • , µ • ; γ )) and µ ∗ ( c ) are strictly below µ • , we immediately conclude thesame holds for µ SN ( c ) for any c ∈ R . This shows the robustness of the over-pessimism resultfrom the main text to the presence of a fraction of selection neglecters.Next, I compare a baseline society with no selection neglecters with a second societycontaining a positive fraction of selection neglecters. I show that when two societies startwith the same generation 0 behavior, society with selection neglecters hold more optimisticbeliefs about the second-period fundamental and use a higher stopping threshold in everygeneration t ≥ . This is not simply due to the mechanical reason that the selection neglectersalways make the correct inferences about the fundamentals, thereby making the “average”belief in the society more optimistic. The presence of the selection neglecters also moderates72he over-pessimism of the baseline gambler’s fallacy agents (without completing eliminatingit), by making the censoring effect less severe. Corollary OA.3. Let < α < . Consider two societies, A and B , where society A has noselection neglecters and society B has an α fraction of selection neglecters in each generation.Suppose both societies start at the same initial condition c [0] ∈ R . For t ≥ , consider theauxiliary learning environment and denote the baseline agents’ beliefs and cutoff thresholdsin society k ∈ { A, B } as µ k , [ t ] , µ k , [ t ] , c k [ t ] . Then for every t ≥ , µ , [ t ] > µ , [ t ] and c t ] > c t ] .Proof. From Proposition OA.13 (and Example 2 for the case of t = 1), µ A , [ t ] = µ B , [ t ] = µ • for every t ≥ . Also, in the first generation, µ A , [1] = µ B , [1] and c A [1] = c B [1] since both societiesface the same dataset H • ( c [0] ) . Since µ A , [1] < µ • , we must have c A [1] = C ( µ • , µ A , [1] ; γ ) In the baseline model, the history h n of predecessor n ∈ [0 , 1] records just the first-perioddraw h n = ( x ,n , ∅ ) if n stopped in period 1, and it records both draws h n = ( x ,n , x ,n ) if n continued into period 2. An outcome history differs from a history of the baseline modelin that it always records only one draw – the one from the period where the agent stops.So, predecessor n ’s outcome history h on is either h on = ( x ,n , ∅ ) or h on = ( ∅ , x ,n ) . This kindof observation may be natural when the optimal-stopping problem is search without recall(i.e. Example 1 with q = 0) and managers in the current generation only know about thecandidates who were eventually hired in the previous generation across various firms, butnot the early-phase candidates who were discovered but let go.Write H o (Ψ • ; c ) for the distribution of predecessors’ outcome histories when ( X , X ) ∼ Ψ • and predecessors use the cutoff threshold c . I show that for agents using a method-of-moments (MOM) inference procedure analogous to the one in Online Appendix OA 5,they will still infer the pseudo-true fundamentals associated with usual history distribu-tion H (Ψ • ; c ) in the baseline model. To be precise, MOM agents find µ MO , µ MO so that H o (Ψ( µ MO , µ MO ; γ ); c ) matches H o (Ψ • ; c ) in terms of the sample means of the uncensoredfirst-period draws and uncensored second-period draws. As µ E ˜ X ∼N ( µ ,σ ) [ ˜ X | ˜ X ≥ c ] is astrictly increasing function, the MOM inference µ MO must correctly estimate the first-periodfundamental, µ MO = µ • . Also, note that for any ˆ µ , ˆ µ ∈ R and any ˆ γ ≥ , the second mo-ments is the same in the outcome histories distribution H o (Ψ(ˆ µ , ˆ µ ; ˆ γ ); c ) as in the baseline73istories distribution H (Ψ(ˆ µ , ˆ µ ; ˆ γ ); c ). By the method-of-moments interpretation of µ ∗ ( c ) , we conclude that µ MO ( c ) = µ ∗ ( c ) for all c ∈ R . The KL-divergence minimizing pseudo-fundamentals for agents observing outcomes provesdifficult to calculate analytically. This is because the likelihood of the outcome history h on = ( ∅ , x ,n ) is given by an integral over its likelihoods for different censored realizations of X . Using numerical simulations, I show in Section OA 9.3 that when Bayesian agents witha correct dogmatic belief about µ • face a large, finite dataset of outcome histories, their in-ference about the second-period fundamental seems to closely match µ ∗ ( c ). It remains as anopen conjecture whether the minimizers in these two different KL-divergence minimizationproblems in fact coincide exactly. OA 9 Numerical Simulations OA 9.1 Pessimism and Fictitious Variation in Finite Datasets Lemma OA.3 proves that when an agent with a full-support prior m : R → R observes N histories drawn from the distribution H • ( c ) in the Gaussian case, then as N goes to infinityher posterior belief almost surely concentrates on the KL-divergence minimizing pseudo-trueparameters. In this section, I use simulations to check how well the predictions of Proposition2 and Proposition 7 hold up in finite datasets. In particular, I am interested in the pessimisticinference in Proposition 2 and the fictitious variation in Proposition 7.I consider the objective distribution ( X , X ) ∼ Ψ( µ , µ , σ , σ ; 0) with µ = µ = 0 ,σ = 1, and a stopping rule that censors X whenever X ≥ 1. I suppose agents havedogmatic belief in γ = 0 . R . In FiguresOA.1 and OA.2, I plot distributions of the Bayesian posterior mode after a dataset of size N = 100 , , . I find that when N = 100 , there is 91.9% chance that agents under-estimate the second-period and and 78.9% chance they believe in fictitious variation for thesecond-period conditional variance. These probabilities grow to virtually 100% for N = 1000and N = 10000 . OA 9.2 Welfare Implications of Endogenous Learning In this paper, I have emphasized the dynamics of mislearning and the interaction betweendistorted stopping strategy and distorted beliefs under the gambler’s fallacy. The positivefeedback cycle between censoring and gambler’s fallacy leads to additional welfare implica-tions beyond what would happen with gambler’s fallacy alone in a static, exogenous-datasetting. Figure OA.3 returns to the illustrative example used for Figure 1 and compares theexpected loss (relative to using the objectively optimal stopping rule) in the learning steadystate versus the expected loss for the first-generation agents. Recall that Figure 1 considers74 osterior mode for m (N = 100) posterior mode D en s i t y −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 Posterior mode for m (N = 100) posterior mode D en s i t y −0.6 −0.4 −0.2 0.0 0.2 . . . . Posterior mode for m (N = 1000) posterior mode D en s i t y −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 Posterior mode for m (N = 1000) posterior mode D en s i t y −0.6 −0.4 −0.2 0.0 0.2 Posterior mode for m (N = 10000) posterior mode D en s i t y −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 Posterior mode for m (N = 10000) posterior mode D en s i t y −0.6 −0.4 −0.2 0.0 0.2 Figure OA.1: Histograms of inferences about fundamentals in finite datasets. The red linesin the histograms for µ denote the pseudo-true fundamental (and also the true fundamental) µ ∗ ( c = 1) = 0 . The blue lines in the histograms for µ denote the true fundamental µ • = 0 , while the red lines show the pseudo-true fundamental µ ∗ ( c = 1) = − . osterior mode for s (N = 100) posterior mode D en s i t y . . . . . . Posterior mode for s (N = 100) posterior mode D en s i t y . . . . . Posterior mode for s (N = 1000) posterior mode D en s i t y Posterior mode for s (N = 1000) posterior mode D en s i t y Posterior mode for s (N = 10000) posterior mode D en s i t y Posterior mode for s (N = 10000) posterior mode D en s i t y Figure OA.2: Histograms of inferences about variances in finite datasets. The red lines inthe histograms for σ denote the pseudo-true variance (and also the true variance) ( σ ∗ ) ( c =1) = 1 . The blue lines in the histograms for σ denote the true fundamental ( σ • ) = 1 , whilethe red lines show the pseudo-true fundamental ( σ ∗ ) ( c = 1) = 1 . . . . . Positive feedback amplifies first−generation loss believed correlation f i r s t gen l o ss / l ong − r un l o ss Figure OA.3: Welfare loss in the first generation as a fraction of the total long-run welfareloss, as a function of the believed correlation between X and X . A more negative correlationcorresponds to a larger γ and a more severe gambler’s fallacy bias.search without recall, so u ( x ) = x , u ( x , x ) = x with true fundamentals µ • = µ • = 0 . As I initialize the 0th generation with the objectively optimal stopping threshold c [0] = 0,misinference from the gambler’s fallacy is solely responsible the first-generation loss. Thelong-run loss, on the other hand, is exacerbated by successive generations of predecessorslowering their stopping threshold and thus censoring the dataset with increasing severity.As Figure OA.3 shows, the fraction of long-run losses attributable to passive inference un-der gambler’s fallacy falls with the degree of the bias, highlighting the need of the dynamicanalysis especially in environment where we expect the bias to be more serious. OA 9.3 Inference of Misspecified Bayesian Agents when Observ-ing Only the Final Draw Consider a Bayesian agent with the (improper) flat prior over the class of models { Ψ( µ , µ , σ , σ , γ ) : µ = 0 , σ = 1 , γ = 0 . , µ ∈ R } . − . − . − . . Inference from Baseline Histories and Outcome Histories