[PDF] Competing Models

Abstract

Agents compete to acquire an asset whose value depends on how well they can predict an unknown variable. Agents are Bayesian, observe identical data, but have different models: they use different subsets of explanatory variables to make their predictions. The winning model crucially depends on the sample size. With small samples, we present a number of results suggesting it is an agent using a low-dimensional model, in the sense of using a smaller number of variables relative to the true data generating process. With large samples, we show that it is generally an agent with a high-dimensional model, possibly including irrelevant variables, but never excluding relevant ones.

Full PDF

CCompeting Models ∗ José Luis Montiel Olea † , Pietro Ortoleva ‡ , Mallesh Pai § , Andrea Prat ¶ First version : October 2017

This version : June 19, 2020

Abstract

Diﬀerent agents compete to predict a variable of interest. All agents are Bayesian, but mayhave ‘misspeciﬁed models’ of the world, i.e., they consider diﬀerent subsets of explanatoryvariables to make their prediction. After observing a common dataset, who has the highestconﬁdence in her predictive ability? We characterize it and show that it crucially dependson the sample size. With small samples, it is an agent using a small-dimensional model, inthe sense of using a smaller number of variables relative to the true data generating process.With large samples, it is typically an agent with a large-dimensional model, possibly includ-ing irrelevant variables, but never excluding relevant ones. Applications include auctions ofassets where bidders observe the same information but hold diﬀerent priors. ∗ We thank Sylvain Chassang, Kﬁr Eliaz, Yuhta Ishii, Annie Liang, Jonathan Libgober, GeorgeMailath, Stephen Morris, Wolfgang Pesendorfer, David Pearce, Luciano Pomatto, Rani Spiegler,and the participants of many seminars and conferences for their useful comments and suggestions.Ortoleva gratefully acknowledges the ﬁnancial support of NSF Grant SES-1763326. Pai gratefullyacknowledges the ﬁnancial support of NSF Grant CCF-1763349. † Department of Economics, Columbia University. Email: [email protected] ‡ Department of Economics and Woodrow Wilson School, Princeton University. Email:[email protected]. § Department of Economics, Rice University. Email: [email protected]. ¶ Columbia Business School and Department of Economics, Columbia University. Email: [email protected] a r X i v : . [ ec on . T H ] J un Introduction

Consider a setting where agents with diﬀerent priors compete to predict a variable y .Agents use a (possibly misspeciﬁed) statistical model that treats y as a linear functionof a number of possible covariates { x i } i ∈{ ,...,k } plus a noise term, i.e., y = (cid:80) β i x i + (cid:15) .For example, y could be a country’s GDP growth, which agents are trying to predictusing a long list of variables x . Both the β i ’s and the variance of (cid:15) are potentiallyunknown. Agents share the same quadratic loss function about their prediction, butuse diﬀerent models : diﬀerent subsets of covariates as relevant to the prediction. Inthe GDP example, some may believe that relevant factors include education level andnet trade surplus; others may also consider monetary supply and climate change data.All agents are Bayesian and update their prior after observing a common dataset: n draws of y and x from the unknown data generating process.We ask the following question: What are the characteristics of the model of theagent that, after observing the data, has the highest subjective conﬁdence in its pre-dictive ability, i.e., has the lowest posterior expected loss?We show that the answer depends both on the size of the dataset n and on themodel dimension of the agent, i.e., how many variables are considered. With smallsamples the most conﬁdent agent uses a model that is small-dimensional , with asmaller number of variables relative to the true data generating process and, perhaps,including covariates that have no explanatory power. Instead, with large samples themost conﬁdent agent uses a large-dimensional model that possibly includes irrelevantvariables, but never excludes relevant ones.There are many competitive situations in which subjective conﬁdence is a keydriving force and prominence is acquired by the most conﬁdent agents. For example,consider a second-price auction to acquire a productive asset whose value dependson the agent’s ability to predict a given variable using a set of covariates: the assetcould be ad-spaces on an online platform, the value of which depends on the sellers’ability to infer customers’ preferences using observable characteristics; or it could bea company, whose future value depends on how accurately the new owner is able topredict market conditions. All bidders observe the same data, but may use diﬀerentvariables to make predictions, as they may have diﬀerent priors. All else equal, theauction is won by the agent who is most conﬁdent in her predictive ability, according1o her own posterior. More generally, our results are useful to characterize whoemerges in competitive situations in which a leading position is taken by those whoare most subjectively conﬁdent in their predictive ability, e.g., political or corporatecompetition.There is a trivial reason why agents with simpler models may be more conﬁdent:their models may contain less uncertainty about the world. The most extreme case iswhen a subject is dogmatic: she has a—right or wrong—deterministic model. Thatsubject would believe she has perfect predictive ability—and would bid high in theauction context above. This reason is not what drives our results: as we discuss,our conclusions hold even if we impose the condition that all agents have the sameconﬁdence level before observing any data. As we will see, our results depend on howmodel complexity aﬀects how conﬁdence evolves as agents process new data.Our results sit within the large and growing body of work in economic theory onagents with misspeciﬁed models (we defer a full discussion of the literature to Section5). In particular, it can be seen as trying to understand if, or when, competitionselects agents with correctly speciﬁed models. Our results may also, at a high level, bereminiscent of model-selection methods in Econometrics and Machine Learning, withone big diﬀerence: Our results here emerge positively , as the outcome of competitionamong diﬀerent purely-Bayesian decision makers using diﬀerent models. By contrast,the model selection literature is motivated normatively by the need to avoid over-ﬁtting: large-dimensional models may be too ﬂexible and give an illusion of ﬁttingthe data. It proposes and studies techniques to explicitly penalize large-dimensionalmodels—there is no such penalty in our setting. Summary of Results and Intuition.

Our ﬁrst result, Lemma 1, characterizes theexpected posterior loss of an agent as a function of their subjective prior and observeddata. We show that this loss can be decomposed as the sum of two components,which we term: 1) model ﬁt : the agent’s posterior expectation of the variance of theregression residual, (cid:15) ; and 2) model estimation uncertainty : the degree of uncertaintythat the agent has about the coeﬃcients in her regression model. Crucially, we showthat the latter in turn depends on the model dimension. This implies that whileBayesian agents use their posteriors to compute the best action and do not careabout the dimension of the model they are using, this very dimension aﬀects their2 onﬁdence in their own predictive ability.This characterization has two immediate implications, depending on the size ofthe dataset. Suppose ﬁrst that the dataset is very large. Then, the latter componentvanishes: agents will have no uncertainty about their ﬁtted parameters, even if theyare using the wrong model. Expected posterior loss, and thus subjective conﬁdence,is therefore based solely on model ﬁt . As a result, incorrectly speciﬁed models, i.e.,models which omit an observable that is relevant for prediction, never prevail. Atthe same time, we show that larger models, those that contain additional observablesthat are irrelevant to the true data generating process (DGP), may continue to wineven asymptotically. Even though these larger models will converge to the properlycalibrated ones, for any ﬁnite sample they remain strictly diﬀerent; we show that theirprobability of winning remains strictly above zero even asymptotically. In turns, thisshows that the role of priors does not vanish asymptotically: it continues to aﬀect alarge model’s probability of winning even with inﬁnite data.Our second set of results pertain to the case of small datasets. Here ‘modelestimation uncertainty’ plays a critical role. We show that the winning agent willhave a model that is of smaller dimension than the true DGP. This is because, whileagents with misspeciﬁed models may have a lower model ﬁt, they will also have alower model estimation uncertainty—as they have fewer parameters to estimate. Inorder to establish these results, we additionally assume that all agents’ priors take theanalytically convenient ‘normal-inverse gamma’ form ( Deﬁnition 1). Moreover, tomake the problem interesting and avoid trivial cases, we show that our results holdseven if we assume that, before observing any data, the agents’ priors are such thatall agents have the same predictive ability.First, we prove that when the dataset consists of a single data point, the winningmodel always involves exactly observable (Proposition 1). Deriving more generalresults is challenging. Small samples have two features: the dependence on speciﬁcdata realizations, and the fact that the prior remains relevant. In our analysis, wewant to preserve the second feature, but circumvent the ﬁrst—the source of the dif-ﬁculty in analytical tractability. To this is end, we use a non-standard asymptoticframework in which we let the dataset grow but at the same time we make the priorsmore dogmatic. This captures the key feature of small samples—the prior plays arelevant role—while at the same time we apply the Law of Large Numbers. Using3his approach, we show that indeed small-dimensional models—possibly misspeciﬁedas they use fewer observables than the true DGP—prevail.These results follow from a simple intuition. Suppose Dr. A and Dr. B are bothtrying to predict y using a set of covariates { x i } i ∈{ ,..., } . Dr. A believes that only x matters to predict y . Dr. B, instead, considers all 100 covariates. Suppose thetrue DGP is such that the best linear predictor of y includes all covariates: thus, Dr.B has a ‘correct’ model, while Dr. A does not. Suppose that Drs. A and B have thesame prior expected loss. After n data points are revealed, who is more conﬁdent?Suppose ﬁrst that n is small, e.g. n = 5 . Dr. A will believe she has a goodgrasp of the data generating process—she is trying to ﬁt only one parameter with 5data points; her conﬁdence will be high. Dr. B, however, will make little headway inestimating her model: ﬁtting 100 parameters using observations, her conﬁdence willbe low. Further, since the amount of data is “small,” both agents’ posterior estimatesof the variance of (cid:15) ( σ (cid:15) ) are close to their prior. Therefore the competition is mainlyover who believes they have a good grasp of the data generating process—and Dr.A prevails. Hence even though Dr. A has a misspeciﬁed model that omits 99 out ofthe 100 relevant variables, and even though the agents’ prior conﬁdence is the same,when n is small she will nevertheless have higher conﬁdence in her predictive ability.What happens then as data accumulates? A tradeoﬀ emerges. While Dr. A willbe able to estimate the parameters of her model well, she will also observe that it haspoor ﬁt on the data. After all, she must attribute all the explanatory power of thevariables she omits, x . . . x , to noise, therefore leading her to increase her estimateof σ (cid:15) . Dr. B instead will take longer to estimate the parameters of her model, but shewill be able to ﬁt the data with a lower σ (cid:15) . As n grows, however, the second eﬀectwill acquire prominence, and Dr. B will become more conﬁdent.This trade-oﬀ is the core of our results. A small number of observations increasesconﬁdence faster for agents with small-dimensional models. It is only as n growslarger that the conﬁdence of agents with larger-dimensional models may catch up.Even if the true DGP is large-dimensional, when the dataset is small, agents withsmall-dimensional models are thus overconﬁdent about their predictive abilities —and may thus be the most conﬁdent of all.The remainder of the paper is organized as follows: Section 2 outlines the for-mal model and notation; and also characterizes the expected posterior loss of a single4gent, the foundation of our results. Section 3 collects our main results characterizingthe winning model under competition: Section 3.1 illustrates with a simple simula-tion, Section 3.2 considers the case when n is large, and Section 3.3 the case of n small. Section 4 considers some extensions and implications of our results. Section 5discusses the related literature in further detail and Section 6 concludes. A group of agents is competing to predict a real-valued variable y . There are k real-valued covariates (or explanatory variables) x ∈ R k . In this section, we describe therelationship between y and x postulated by each agent, the data available, and theagents’ priors. We also characterize her subjective expected posterior loss (henceforth,posterior loss) conditional on choosing the optimal prediction function (Lemma 1).The latter plays a crucial role for our results. Data Generating Process and Data.

A true data generating process (DGP),denoted P , determines the relationship between y and x . Agents do not know P , butall of them assume a linear relation between y and the covariates x ∈ R k , i.e., y = x (cid:48) β + (cid:15), (1)where (cid:15) | x ∼ N (0 , σ (cid:15) ) , β ∈ R k , that is to say, a homoskedastic linear regression with Gaussian errors . For simplicity, we assume that the agents treat the distribution of the covariates x as known, and denote it by P . We assume throughout that E P [ xx (cid:48) ] is a positivedeﬁnite matrix and that the random vector x have ﬁnite moments of all orders. Let Θ := R k × R + , with θ = ( β, σ (cid:15) ) deﬁning the unknown parameters of interest. Fixing P , the parameter θ = ( β, σ (cid:15) ) fully deﬁnes the DGP according to agents, which we Because covariates in x can be correlated, the linearity assumption only mildly restricts themodels that agents can entertain. For example, if one wished to express the non-linear DGP y =3 x √ x + (cid:15) , one can simply deﬁne a new observable equal to x √ x . While not all non-linear DGPs canbe expressed this way, especially since we assume ﬁnitely many observables, good approximationscan always be achieved. Thus, our framework allows the agents to have a wide family of non-linearrelations as DGP. Q θ .Crucially, Equation (1) represents the agents’ perceived DGP which is allowed tobe misspeciﬁed. This means that Q θ is allowed to diﬀer from the true P at every θ —for example, in the true DGP P , errors may be non-normal or heteroskedastic, orthe conditional expectation need not be linear.Before making a prediction, agents observe a dataset, denoted D n , composed of n i.i.d. draws according to the true DGP, P . We denote the data as D n = ( Y, X ) where Y ∈ R n and X ∈ R n × k . We assume that all agents observe the same data: this willbe relevant for our application—as we shall see, in an auction setting this will avoidwinner’s curse type concerns. Priors.

Agents are Bayesians and have a prior π over the model’s parameters θ =( β, σ (cid:15) ) . A key ingredient in our setting, as foreshadowed in the introduction, is thatagents may have diﬀerent priors, with diﬀerent supports.Of particular interest will be the case in which agents consider diﬀerent explana-tory variables as relevant for their prediction. The following notation will be useful.Let { , . . . , k } label the explanatory variables in x . Denote by J ( π ) the set ofvariables that an agent with prior π consider relevant for the prediction problem.Formally, let π i denote the marginal distribution over β i corresponding to prior π . If δ denotes a Dirac measure at zero, then J ( π ) := { i ∈ , . . . , k : π i ( β i ) (cid:54) = δ } . In what follows, we sometimes use simply J ⊆ { , . . . , k } to refer to a model ,which should be understood as the set of explanatory variables considered to make aprediction. Lastly, for a given vector β , denote by β J the subvector consisting solelyof the components in the set J ⊆ { , . . . , k } , x J the analogous subvector of x , and X J the corresponding submatrix of X . Actions, Utility, and Optimal Prediction.

Agents make a prediction of y giventhe covariates x , which formally means that they construct a prediction function f that maps x into y , i.e., f : R k → R . Their utility is maximized by minimizing astandard quadratic loss function, equal to the square of the diﬀerence between the6rue y and their forecast f , i.e., − ( y − f ) . Denote by L ( f, θ ) the agent’s loss underprediction function f if the true DGP is Q θ , i.e., L ( f, θ ) := E Q θ [( y − f ( x )) ] . (2)If π is the agent’s prior over θ , and D n the observed data, then characterizing theoptimal prediction f ∗ is a standard problem. The agent chooses f to minimize, E π [ L ( f, θ ) | D n ] , that can be rewritten as: E π [ σ (cid:15) | D n ] + E π E P [( x (cid:48) β − f ( x )) | D n ] . (3)The ﬁrst term does not depend on f . The second term involves the average errorincurred in predicting x (cid:48) β using f ( x ) . With standard arguments (i.e., exchangingthe order of integration and taking ﬁrst order conditions), we can see that the innerexpectation of the second term is minimized by f ∗ ( π,D n ) ( x ) := x (cid:48) E π [ β | D n ] = x (cid:48) J ( π ) E π [ β J ( π ) | D n ] . (4)Thus, a Bayesian decision maker with a posterior π | D n , model J ( π ) , and a squareloss function, forecasts y at x as her Bayesian posterior mean of x (cid:48) β . Again, this is astandard result. We now turn to characterizing the agent’s posterior loss conditional on her using theoptimal prediction function, denoted by L ∗ ( π, D n ) . This measures how conﬁdent eachagent is of her predictive ability, and it will be the central driver of the dynamic ofour competition between agents. Most importantly, the key driving forces behind ourresults will already be evident from the following lemma. Lemma 1.

The agent’s posterior expected loss from her Bayes predictor is: L ∗ ( π, D n ) = E π (cid:2) σ (cid:15) | D n (cid:3) + Tr ( V π [ β J | D n ] E P [ x J x (cid:48) J ]) , (5) The inner expectation averages over values of x . The outer expectation averages over the valuesof β . here V ( · ) is the variance-covariance operator, Tr is the trace operator, and J denotesthe agent’s model J ( π ) . This lemma may be reminiscent of the standard decomposition of mean-squaredprediction error in frequentist linear regression models; e.g., Theorem 4.7 in Hansen(2020). The novelty here is that this is the subjective

Bayesian analogue of such adecomposition, and to our knowledge has not been observed previously.The lemma shows that the agent’s expected posterior loss L ∗ ( π, D n ) can be char-acterized as made of two components. First, the posterior expectation of σ (cid:15) . Thisis the agent’s estimate of the irreducible noise in the system. We interpret this termas a measure of model ﬁt , i.e., how well is the agent’s model ﬁtting existing data,because the agent must ascribe all unexplained variation to noise.The second term, Tr ( V π [ β | D n ] E P [ xx (cid:48) ]) , is the trace of the variance-covariancematrix of the coeﬃcients of the model (adjusted by E P [ xx (cid:48) ] ). This is a measure ofhow uncertain is the agent of her estimation of her model— thus capturing what wecan call the model estimation uncertainty faced by the agent according to her ownprior. For an intuition, consider the simpler case in which observables are independentand have the same variance. Then, the second term reduces to Tr ( V π [ β | D n ]) , i.e., (cid:80) ki =1 V π [ β i | D n ] . In this case the second part of the loss function is simply the sum ofthe variances of the parameters β , indeed a measure of model estimation uncertainty.The exact formula in (5) extends this to cover the case of observables with a generalvariance-covariance matrix. In the next section we will show that this decompositionhas immediate implications on which model leads to the highest subjective conﬁdence. We assume that agents compete through a mechanism that selects the agent with the lowest posterior expected loss given her own subjective prior . Our analysis applies toany mechanism that leads to this selection. A simple example discussed in the intro-duction is that of a second-price auction, where the dominant strategy equilibriumresults in this type of selection; as these are standard arguments, we discuss them indetails in Appendix A.1.Central in our characterization of the agent with the lowest expected posterior8oss is Lemma 1. Applying it, we immediately derive that in ﬁnite samples this willbe the agent with the prior such that min π E π (cid:2) σ (cid:15) | D n (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) Model Fit + Tr ( V π [ β J | D n ] E P [ x J x (cid:48) J ]) (cid:124) (cid:123)(cid:122) (cid:125) Model Estimation Uncertainty . over the collection of priors of agents in the population. Crucially, π is the prior that achieves the best trade-oﬀ between model ﬁt and model estimation uncertainty. This characterization has two immediate implications, depending on the size ofthe dataset.When n is large, the model estimation uncertainty component of the posterior lossvanishes: each agent, even those with a ‘wrong’ model, will reduce the uncertaintyabout the parameters to zero. All that matters is the model ﬁt. Then, models thatexclude relevant variables necessarily have higher expected loss: they must estimate ahigher σ (cid:15) to account for the variation that they are disregarding. Therefore, with largedata agents whose models are misspeciﬁed by excluding relevant variables will not win our competition. How about agents who consider more variables than necessary?With large data, their model ﬁt will be the same as the true model. In what may beless intuitive, we show that agents who use models that are larger than the true onemay win with probabilities bounded away from zero; and these are non-negative, andpotentially even close to one, even asymptotically as data grows large. That impliesthat agents’ prior continue to aﬀect the model competition even with inﬁnite data;the initial choice of prior aﬀects the model competition, even in large samples.When n is small, however, model estimation uncertainty does not vanish, and acentral role is played by its tradeoﬀ with model ﬁt. The ﬁrst observation is that modelestimation uncertainty is aﬀected by model dimension: it will be necessarily lower forlower-dimensional models, because agents have fewer parameters to ﬁt. This meansthat small-dimensional models may have an advantage, and the winning agent maythen be one with a model that is misspeciﬁed and of smaller dimension than thetrue DGP. This holds not because smaller-dimensional models have more conﬁdenceto begin with—our results holds even if we assume that agents start with the sameexpected loss with no data. Instead, what happens is that when the data revealed is(relatively) small, the expected loss decreases faster for agents with small-dimensionalmodel: they will learn their (fewer) parameters faster. Thus, agents who hold models9hat are misspeciﬁed in that they exclude relevant variable may end up being moreconﬁdent in their predictive ability. The example discussed in the introduction (of Dr.A and B) may provide further intuition. To recap, ceteris paribus , trying to estimatemore parameters from the same amount data will result in more model uncertainty,i.e., less concentrated posteriors. This uncertainty will therefore be reﬂected in theagent’s expected loss.In what follows, we ﬁrst illustrate these dynamics in a simple simulation; we thenmove to discussing the results for the case of large n , which can be derived undervery general conditions. When n is small, while the intuition above holds, obtaininga precise characterization of the winning model is more diﬃcult. We assume thatthe agents have what we call Normal-Inverse Gamma priors, conjugate priors for theNormal linear regression model. We also normalize the priors so that, absent data,all agents are equally conﬁdent.

Before we dive into formal results, we illustrate them via a simple simulation. We usethe following priors both the simulations and to establish the small-sample results(Section 3.3). The large-sample results (Section 3.2) do not require this assumption.

Deﬁnition 1.

We say that the agent’s prior π has Normal-Inverse Gamma form with hyperparameters ( γ, a , b ) if β J ( π ) | σ (cid:15) ∼ N | J ( π ) | (cid:32) , σ (cid:15) γ | J ( π ) | I | J ( π ) | (cid:33) σ (cid:15) ∼ Inv-Gamma ( a , b ) . where N J ( µ, Σ) refers to the J -dimensional multivariate normal distribution withmean µ and variance-covariance matrix Σ , and Inv-Gamma ( a , b ) refers to the In-verse Gamma distribution with parameters a and b . We make this assumption purely for tractability: these are conjugate priors for10he linear regression model, and the associated posteriors are amenable to analysis. Note that the priors are such that all agents have the same Inverse Gamma priordistribution over the distribution of the residuals, while the prior variance over theregression coeﬃcients is scaled down with the number of regressors in the agent’smodel. These priors are therefore, by construction, normalized so that they all havethe same expected loss before data, i.e., for all π, π (cid:48) , L ∗ ( π, ∅ ) = L ∗ ( π (cid:48) , ∅ ) . This impliesthat, without data, all subjects have the same conﬁdence in their model, and anydiﬀerence in conﬁdence is due to diﬀerent reactions to the data. As mentioned in theintroduction, we consider this normalization solely to illustrate that our results inwhich smaller models may prevail are not due to diﬀerent conﬁdence to begin with,but rather to diﬀerent ways in which the conﬁdence evolves with new data.Figure 1 shows simulation results in a setting where there are six observables inthe dataset, { x , . . . , x } , of which only the ﬁrst ﬁve are relevant for prediction inthe true DGP, i.e. y = (cid:80) i =1 β i x i + (cid:15) . There are agents, one for each non-emptysubset of { x , . . . , x } , all with Normal-Inverse Gamma priors with the same sharedhyperparameters. By construction, therefore, 61 of these agents have an incorrectmodel (i.e., ignore at least one variable that is relevant for prediction in their ownmodel), one agent has the exactly correct model, and one agent has a larger modelthan the truth.We simulate datasets of sizes n = 1 to , and plot the frequency of the size ofthe model of the agent with the lowest posterior expected loss upon observing thatdataset. Two main features clearly emerge. First, when n is small, small-dimensionalmodels prevail: when n = 1 , the winner is a small model, indeed a model that weknow to be misspeciﬁed; as n grows, we have ‘waves’ of larger, but still misspeciﬁedmodels becoming more prominent. Second, as n grows large, the true model winsmore often. However, also the larger model, that includes the redundant variable x ,continues to win, with relative frequencies that appear to converge to a steady state.The results below explain the theoretical basis of these ﬁndings. In particular, algebra shows that: E π [ σ (cid:15) | D n ] = b n + n min β ∈ R | J ( π ) | ( y − X J ( π ) β ) (cid:48) ( y − X J ( π ) β ) + ( γ | J ( π ) | ) || β || a n + 1 − n , (6) V π [ β J ( π ) | D n ] = E π (cid:2) σ (cid:15) | D n (cid:3) ( X (cid:48) J ( π ) X J ( π ) + ( γ | J ( π ) | ) I | J ( π ) | ) − . (7) ( a , b , γ ) = (2 , , . on , simulated datasets of size n = 1 to . 6 Covariates are distributed x ∼ N (0 , I ) . True d.g.p only depends oncovariates 1–5, ( β . . . β ) ∼ N (0 , I ) , β = 0 . n We characterize the winner for large n under some mildly technical regularity as-sumptions about the priors of the agents and conditions on the true unknown DGP, P . Roughly speaking, these assumptions jointly guarantee that the posteriors of theagents are well-behaved in the limit (i.e., appropriate central limit theorems apply,summary statistics of the posterior distribution concentrate appropriately fast etc),despite them being misspeciﬁed (i.e., the true DGP is not in the support of the prior).To assist the reader we take a slightly unorthodox approach: in the body of the paper,we present our results under assumptions that are stronger than needed but (much)easier to parse. In the Appendix 1 we prove that our results continue to hold undermuch weaker conditions. Assumption 1.

Each agent has a prior over θ characterized by a smooth and strictly ositive probability density function π ( · ) over ( β J ( π ) (cid:48) , σ (cid:15) ) (cid:48) ∈ R | J ( π ) | × R + . Assumption 1 posits that agents’ priors over the β i ’s are either degenerate at 0(i.e., the agent views that covariate as not relevant for prediction), or to have fullsupport on the reals. In the latter case, it requires a density to exist and to besmooth. Assumption 1 also precludes degenerate priors on σ (cid:15) . Next, we impose assumptions on the true data generating process P . Assumption 2.

The true data generating process P is a Gaussian linear regressionmodel as speciﬁed in (1) ; that is, there exists parameters θ := ( β , σ ) such that Q θ = P . Assumption 2 posits that while agents may have misspeciﬁed models, this is onlyin the form of omitting relevant variables. Again, our results hold under more generalconditions (see Assumption 4 in Appendix 2). In particular, our results continue toobtain when the true DGP is diﬀerent from (1), but it is well-behaved enough thatthe posterior distributions of the misspeciﬁed Bayesian agents become tight aroundMaximum Likelihood estimators in large samples. For example, the distribution oferrors in the true DGP may be heteroskedastic or non-normal with thin-enough tails,the distribution on covariates P may be misspeciﬁed as long as it has ﬁnite secondmoments, and the true data generating process may not even be linear.Finally, a little more notation will be useful. Given the vector β ∈ R k as above, J is the set of indices of the coordinates of β that are nonzero, i.e., J := { κ | β κ (cid:54) = 0 } . We remind the reader that J ( π ) denotes the coordinates of β for which the prior π isnot degenerate at 0. Theorem 1.

Let Π be a ﬁnite collection of agents’ priors that satisfy Assumption1. Suppose the true data generating process P satisﬁes Assumption 2 with associated By deﬁnition, the prior of agent j for any β κ , κ / ∈ J , is degenerate at 0. Assumption 3 in the appendix provides a weaker version of Assumption 1 by relaxing the numberof higher order derivatives required from π . In particular, we invoke for the Bernstein-Von Mises theorem for misspeciﬁed parametric modelsof Kleijn and Van der Vaart (2012) and the posterior expansions of Kass et al. (1990) based onLaplace’s method. These results can be thought of as modern and richer versions of the classicalresults concerning posterior distributions of misspeciﬁed models in Berk (1970). arameters ( β , σ ) . If Π contains a prior π such that J ⊆ J ( π ) , then lim n →∞ P (cid:18) ∃ π ∈ argmin π ∈ Π L ∗ ( π, D n ) s.t J ⊆ J ( π ) (cid:19) = 1 . Moreover, for any π for which J ⊂ J ( π )lim n →∞ P (cid:0) L ∗ ( π, D n ) < L ∗ ( π , D n ) (cid:1) ∈ (0 , . Theorem 1 has two main takeaways. The ﬁrst part tells us that some model whichis (weakly) larger than the true model—possibly containing explanatory variablesthat are irrelevant for prediction, but never excluding any relevant variable—will winthe competition with probability approaching one as the sample size grows large. Itfollows that a model which excludes a relevant variable never wins in the limit. Thesecond part shows that any model larger than the true one defeats the latter with aprobability that is strictly positive even asymptotically.The ﬁrst result is somewhat intuitive. By Lemma 1, we can decompose posteriorloss into two terms: expected variance of the noise and model estimation uncertainty.We show that the latter converges to zero for all agents by invoking the BernsteinVon-Mises theorem for misspeciﬁed models. The former term will instead diﬀer:agents who rule out an observable that is relevant for prediction must attribute itsexplanatory power to noise. As n grows large, their posterior expectation of thevariance of the noise term will be necessarily larger, and thus conﬁdence lower, thanthat of agents with the true model.But what about agents whose model is larger than the true model? This part,covered by the second part of the theorem, is slightly more subtle. After all, thisagent will also eventually learn the true data generating process: that is, her beliefsabout the β ’s associated to redundant variables must converge to zero. But how willthe conﬁdence compare? For any ﬁxed n , the agent with more observables in hermodel will have a less concentrated posterior on β . On the other hand, she will alsohave slightly smaller posterior expectation of the variance of the noise term: she willmistakenly attribute some explanatory power to these superﬂuous observables. Whichof these two eﬀects dominate, both of which can be shown to be O P ( n ) , determinesthe likelihood of winning. From a technical perspective, the comparison of these14anishingly small terms is based on an asymptotic expansion for the posterior meanof the variance parameter in the linear regression model based on the general resultsin Kass et al. (1990). This is not a textbook result, and so we provide details in theappendix. In a nutshell, Lemma 2 in the Supplementary Materials shows that themodel ﬁt term equals σ (cid:86) ( π ) − σ (cid:86) ( π ) n (cid:40)(cid:18) ∂π∂σ ( θ (cid:86) ( π )) (cid:19) · π ( θ (cid:86) ( π )) (cid:41) − σ (cid:86) ( π ) | J ( π ) | + 4 n + O P (cid:18) n (cid:19) , where hats denote the Maximum Likelihood estimator of the model’s parametersby an agent that only considers the explanatory variables in J ( π ) . Similarly, whenAssumption 2 is satisﬁed the model uncertainty term equals | J ( π ) | n + o P (cid:18) n (cid:19) . The second part of Theorem 1 uses these expansions to show that the probability ofthe larger model winning is bounded away from zero, even in the limit. In fact, itis easy to construct examples in which the probability of a larger model beating thetrue one can be made arbitrarily close to one even in the limit. n ‘small’ We are now ready to discuss the properties of the winner model when the number ofobservations n is relatively small. We have see an intuition of how Lemma 1 alreadysuggested that there may be advantages given to smaller-dimensional models, andthe simulation results showed that when data is scarce, the winner may indeed be amodel smaller-dimensional than the true DGP. We will now provide additional formalresults to strengthen this understanding,.In this subsection, we assume that agents have Normal-Inverse Gamma priorswith shared hyperparameters as in Deﬁnition 1. As we discussed, this ensures thatall agents have the same prior expected loss before data. Diﬀerences in posterior Suppose for example that priors are of the Normal-Inverse Gamma form, as in Deﬁnition 1, butallow them to have diﬀerent parameters ( a π , b π ) . Algebra shows that if b π is large enough (i.e., thevariance of the prior over σ (cid:15) is large enough), then the probability that the larger model defeats asmaller one can become arbitrarily close to 1 even in the limit. E P [ x (cid:48) x ] = I k . The winner with 1 data point.

We start with an extreme but stark result forthe case in which agents observe only one datapoint.

Proposition 1.

Suppose all agents have Normal-Inverse Gamma priors with sharedhyper-parameters ( a , b , γ ) . Suppose that all agents believe the joint distribution of thecovariates is such that E P [ x (cid:48) x ] = I k . Suppose that for every single covariate model,i.e. every J such that | J | = 1 , there is an agent with that model. If these agentscompete after seeing a dataset which consists of a single observation, i.e. n = 1 , thenthe winner is always one of these agents with a single variable model, regardless ofwhich other models are represented. Note that this result holds independently of the true DGP: even when that ishigh-dimensional, with only one data-point it is always a 1-dimensional model towin. Numerical simulations suggest that a generalization of this result appears tohold: with n observations the winner is n -dimensional or smaller. We were not ableto formalize such an observation, as in ﬁnite samples the expressions for E π [ σ (cid:15) | D n ] and Tr( V π [ β | D n ])) are algebraically less tractable even with the speciﬁc form of priorswe assume: the reason is that they depend on the inverse of a matrix of speciﬁc datarealizations, which is hard to operate with. A novel approach for small n analysis. We now suggest a novel way of ap-proaching the problem of small sample analysis that allows us to obtain further re-sults despite the analytical limitations discussed above. This general approach maybe of independent interest.The initial observation is that small samples appear to be distinct from large onesfor two basic properties: i ) that the prior remains relevant instead of being partially‘washed away’ by the data; and ii ) that speciﬁc data realizations matter, instead ofonly the population average mattering. It is the latter characteristic that leads to theanalytical diﬃculties we encountered above. In large samples these issues do not arisebecause laws of large numbers can be invoked, circumventing the analytical concernsas they allow us to replace speciﬁc observation with population averages.16ut what if we ﬁnd a way to maintain the ﬁrst property of small samples—that theprior still matters—while dispensing with the second, problematic one—that speciﬁcrealizations matter? To do this, we let n grow to inﬁnity, thus allowing us to usethe law of large numbers, but at the same time vary the hyperparameters of priorsto simultaneously make them become more and more precise, at a pace such thatthey maintain their relevance. Such ‘alternative asymptotics’ frameworks have beenused to study diﬀerent inference problems in econometrics. The next result usesthis approach to show that as long as the prior remains suﬃciently relevant, smallermodels have an advantage.

Theorem 2.

Suppose all the agents have Normal-Inverse Gamma prior with sharedhyper-parameters ( a o , b n , γ ) , where b n ∈ ω ( n ) . Suppose the DGP P satisﬁes As-sumption 2, with parameter θ := ( β , σ ) . Let J denote the associated true modelfor β and suppose there exists at least one agent with prior π such that | J ( π ) | < J .Then lim n →∞ P (cid:18) ∃ π ∈ argmin π ∈ Π L ∗ ( π, D n ) s.t | J ( π ) | < | J | (cid:19) = 1 In words, this result shows that if the prior concentrates fast enough, the resultsare the converse of the large data case (i.e., Theorem 1): models that are larger-dimensional than the true DGP never win, and instead the winner is always smaller-dimensional than the truth.

We conclude our formal analysis with a discussion of two variants of our model, bothof which provide the same stark prediction of the “small data” world: agents with“simple” models always win. Indeed, both of these strengthen our small data results.

Known Variance.

What happens when the variance σ (cid:15) of the noise-term (cid:15) iscommonly known among the agents? This is an extreme special case of our analysis See the local-to-zero asymptotics of Staiger and Stock (1997) for the analysis of instrumentalvariable regression with a weak instrument, the local-to-unity framework of Phillips (1987) for theanalysis of inference in a autoregressive model with autocorrelation close to 1. Roughly, that b n asymptotically grows at a rate strictly faster than n . Proposition 2.

Suppose agents have Normal priors on β with shared hyper-parameter γ . Fix a prior π with | J ( π ) | = k . For any k (cid:48) < k , and any dataset D n for n > , thereexists a prior π (cid:48) such that J ( π (cid:48) ) ⊆ J ( π ) with | J ( π (cid:48) ) | = k (cid:48) and such that L ∗ ( π (cid:48) , D n )

A diﬀerent extension is to a setting whereagents know that they will see exactly n data points, but compare expected conﬁdence before they view the data: this is the case, for example, when agents have to submita bid before seeing the data, but know that they will see n data points before makingtheir prediction. Put diﬀerently, we study the expectation before seeing the data ofthe expected loss after n datapoints. This situation may be not unusual in reality,as often new data is revealed after bidding but before predictions needs to be made.A stark result holds in this case: smaller models always win. In fact, in this casethe result is even stronger than previous ones, as we explain below.

Proposition 3.

Suppose agents have Normal Inverse-Gamma priors with sharedhyper-parameters ( a , b , γ ) , and that γ = 0 . Suppose further that x ∼ N k (0 , I k ) independently of (cid:15) . Fix a prior π . For any prior π (cid:48) , such that | J ( π (cid:48) ) | < | J ( π ) | , wehave that E m ( π (cid:48) ) [ L ∗ ( π (cid:48) , D n )] < E m ( π ) [ L ∗ ( π, D n )] , whenever n > | J ( π ) | + 1 . Here the outer expectation is taken over the agents’‘marginal’ distribution of the data m ( π ) := (cid:82) q θ ( D n ) π ( θ ) dθ . Indeed, there is an aspect that makes this assumption problematic in some environments. Whenvariance is not uncertain, agents with incorrect models of the world will, as data accrues, observethat their model has an empirical error higher than the (known) σ , because the model disregardssome observables relevant for prediction. For n large, this disparity in the empirical error and the(known) σ should lead them to question their underlying model. However, as is standard withBayesians with dogmatic beliefs (here they have degenerate beliefs on σ ) they do not. When thedataset is not too large, however, such issues will not arise. Note that since diﬀerent agents have diﬀerent beliefs about the data generating process, theytake expectations with respect to diﬀerent probability distributions over the space of datasets D n . Hence, the expression E m ( π (cid:48) ) [ L ∗ ( π (cid:48) , D n )] is the Bayes risk of the

Bayes Predictor . See Equation1.14 in Chapter 1.6 in Ferguson (1967). any smallermodel, not just some of the smaller models, as was the case in some of the previousresults; moreover, this holds for any size of the dataset n .For an intuition, consider again the decomposition of posterior loss obtainedthrough Lemma 1. Depending on the realized D n , the ﬁrst term, model ﬁt, canbe larger or smaller than the prior expectation of it before data is realized. Indeed,this is the complicating factor in the analyses of Propositions 1 and 2. However,here, we take expectation over all possible datasets, and the ﬁrst term reduces to itsprior expectation. So we can focus only on the second term, model estimation uncer-tainty. But then, as in previous results, the residual model uncertainty is smaller inexpectation for smaller models. Proposition 3 follows. Connection with the Akaike information criterion.

A diﬀerent way to under-stand our results is to relate the model selection induced by competing models to theAkaike Information Criterion, a well-studied model selection criterion in Econometricsand Statistics. In what follows, we illustrate that the loss function of an agent withNormal-Inverse Gamma prior is “close” to the AIC for the linear regression model.

Deﬁnition 2 (Akaike Information Criterion) . Given a dataset D n = ( y, X ) with n data points and k possible covariates, the Akaike information criterion for linearregression evaluates a model based on X J as: L Akaike ( J, n, D n ) = ln σ (cid:86) ( J, n, D n ) + 2 | J | n , where σ (cid:86) ( J, n, D n ) = 1 n min β ∈ R | J | ( y − X J β ) (cid:48) ( y − X J β ) . In words, consider a model J with | J | observables. The expression σ (cid:86) ( J, n, D n ) isthe OLS estimator of the residual variance based on a model with covariates X J inthe dataset D n with n observations. As is well understood, selecting a model witha lower estimated variance may not favor the model with the best out of sampleperformance. This is because selecting based on average residuals favors modelsthat have more covariates (i.e., regressions which “overﬁt” the data). The AkaikeInformation Criterion (AIC) compensates for this by adding a penalty term equal to19 | J | n , i.e., twice the ratio of the number of covariates in the model and the numberof data points. Algebra shows that if agents have an uninformative Normal-InverseGamma prior ( γ = 0) , then the posterior loss is approximately equal to ln (cid:32) σ (cid:86) ( J, n, D n ) (cid:33) + ln (cid:32) n Tr (cid:32)(cid:18) X (cid:48) J X J n (cid:19) − E p [ x J x (cid:48) J ] (cid:33)(cid:33) . Thus, if the sample size is large and the agents’ distribution of covariates is well-speciﬁed, the posterior loss of an agent with prior π will be approximately equal tothe Akaike Information criterion (with a penalty of | J | /n instead of | J | /n ).The prevalence of larger models in the model competition can then associatedto the ‘conservativeness’ of the Akaike Criterion for model selection. Our Theorem3, however, makes it clear that the relation is only qualitative: larger models willindeed prevail in large samples, but the probability of a larger model being selectedwill continue to be aﬀected by the prior.Finally, it is worth reiterating that the foundations of the AIC are normative: thecriterion was proposed as a way to select a model that avoids overﬁtting. Conversely,our analyis provides a positive foundation for the AIC: we study the outcomes whenBayesian agents compete in a way that selects the agent with the lowest posteriorexpected loss. A large literature has studied models of model misspeciﬁcation in individual decision-making, with famous examples like overconﬁdence and correlation neglect. A fewrecent theoretical contributions to this enormous literature include Heidhues et al.(2018) and Ortoleva and Snowberg (2015), to which we refer for further references.In misspeciﬁed learning settings, “feedback loops” between the agents’ misspeciﬁedbeliefs and the action they take add further technical challenges— see e.g. Fudenberget al. (2017), Fudenberg et al. (2020), Heidhues et al. (2020).Recent works have studied the implications of agents with misspeciﬁed models invarious strategic settings. For instance, Bohren (2016), Bohren and Hauser (2017),Frick et al. (2019b) and Frick et al. (2019a) study social learning when agents have20isspeciﬁed models that cause them to misinterpret other agents’ actions. Mailathand Samuelson (2019) study a stylized prediction market where Bayesian agents havediﬀerent models of the world (deﬁned there as diﬀerent partitions of a common statespace), and discuss the possibility of information aggregation.In strategic settings, Esponda and Pouzo (2016) deﬁnes a learning-based solutionconcept (‘Berk-Nash Equilibrium’) for games in which agents’ beliefs are misspeciﬁed.More broadly, solution concepts have been posited for settings where agents suﬀerfrom some sort of misspeciﬁcation, including well-known examples like analogy-basedequilibrium (Jehiel, 2005) and cursed equilibrium (Eyster and Rabin, 2005).There are several works that consider outcomes when some agents behave in away that can be construed as coming from a misspeciﬁed model. For instance inSpiegler (2006) or Spiegler (2013) society misunderstands the relationship betweenoutcomes and the actions of strategic agents, which aﬀects the actions the latter takein equilibrium and resulting outcomes (in the former, in the context of a market forquacks, in the latter with implications to the reforms taken by a politican). Levy et al.(2019) study a dynamic model of political competition where agents have diﬀerent(misspeciﬁed) models of the world, and use this model to provide a foundation forthe recurrence of populism. Liang (2018) studies outcomes in games of incompleteinformation where agents behave like statisticians and have limited information. A novel approach to modeling misspeciﬁcation in economic theory is the directedacyclic graph approach; see Pearl (2009). This is exploited in a single person decisionframework in Spiegler (2016), which studies a single decision maker with a misspeciﬁedcausal model and large amounts of data. The paper shows that the decision makermay evaluate actions diﬀerently than their long-run frequencies, and exhibit artifactssuch as “reverse causation” and coarse decision making. This approach is then usedin Eliaz and Spiegler (2018), which proposes a model of competing narratives. Anarrative is a causal model that maps actions into consequences, including otherrandom, unrelated variables. An equilibrium notion is deﬁned, and the paper studiesthe distribution of narratives that obtains in equilibrium.Finally, the understanding that agents should be cognizant that their models There is a larger literature which studies the outcomes when agents are modeled as statisticiansor machine learners, e.g., Al-Najjar (2009), Al-Najjar and Pai (2014), Acemoglu et al. (2016) andCherry and Salant (2018). The successful bidder chooses an actionthat determines, together with the state of the world, the payoﬀ generated by theasset. They focus on a setting where bidders have a common prior but observeprivate signals. Their main result is the possibility of (complete) failure of informationaggregation. Our results are similar in that in our applications as well the value ofthe object depends on an action taken by the agent. However, our paper considersa complementary environment where all bidders observe the same information butthey have diﬀerent priors. Information aggregation is ruled out by assumption, andour key theme is model selection.We assume that agents have diﬀerent priors and are fully aware they have diﬀerentpriors: that is to say our agents agree to disagree. This assumption has been usedin economic theory at least since Harrison and Kreps (1978). We refer the reader toMorris (1995) for a discussion of the common and heterogeneous prior traditions ineconomic theory. Heterogenous priors have been used in a number of applications inbargaining (Yildiz, 2003), trade (Morris, 1994), ﬁnancial markets (Scheinkman andXiong, 2003; Ottaviani and Sørensen, 2015) and more.

Relation to Model Selection.

As we mentioned in the Introduction, there isa large body of literature in Statistics, Econometrics, and Machine Learning thatstudies model selection methods and provides normative foundations. That litera-ture is too vast to comprehensively cite here; we refer the reader to Claeskens andHjort (2008) and Burnham and Anderson (2003) for textbook overviews. Popular Bond and Eraslan (2010) study a trading environment with a similar feature. C p criterion of Mallows (1973), the Akaike In-formation Criterion (AIC) of Akaike (1974), and the Bayes Information Criterion(BIC) of Schwarz (1978). We showed that there exists a connection between ourlarge data results and the Akaike Information Criterion (AIC) introduced in Akaike(1974) , in particular, to the asymptotic properties of the AIC characterized in theseminal paper of Nishii (1984).While some of our asymptotic results are reminiscent of the model selection liter-ature, there are three important diﬀerences. First, the aims of this literature are verydiﬀerent to ours. Ours is a positive approach of studying which model emerges from acompetition between Bayesian agents with misspeciﬁed models. The approach in themodel-selection literature is instead normative : various methods of model-selectionare proposed and studied with a view to avoiding over-ﬁtting and/or selecting ‘good’models according to some metric. The results we are aware of broadly speak to theasymptotic eﬃciency of these techniques. Second, not only our results are derivedfrom a completely diﬀerent model, but they are also proven with diﬀerent techniques.Third, the connection is limited to the large-data result. We are also not aware ofany analogs to our small-sample results in the model selection literature.We also use techniques and approaches from the statistics and econometrics lit-erature. The proof of Theorem 2 uses ‘non-standard asymptotics’ that allow for theparameters of a statistical model to be indexed by the sample size have been usedextensively in econometrics. The typical goal of an alternative asymptotic frame-work is to provide better approximations to ﬁnite-sample distributions of estimators,tests, and conﬁdence intervals, while exploiting Laws of Large Numbers and Cen-tral Limit Theorems. For example, the local-to-unity asymptotics of Phillips (1987)studies auto-regressive models that are close to being nonstationary; the local-to-zero asymptotics of Staiger and Stock (1997) studies Instrumental Variables modelsthat are close to being unidentiﬁed; and Cattaneo et al. (2018) studies models wherepossibly many covariates are included for estimation and inference. We analyze a novel model of competition between agents. A variable of interest isrelated to a vector of covariates. Agents have diﬀerent models of these relationship: in23articular they rule in/ rule out diﬀerent x s as being potentially related to prediction.All agents observe a common dataset of size n , drawn from the true data generatingprocess. The winner is the agent with the lowest expected loss, expectations takenwith respect to their own subjective posterior. This winner corresponds to the winnerunder a stylized auction model we formally deﬁne and analyze, but may also be ofinterest more generally in situations where subjective conﬁdence in predictions leadto selection. We study the relationship between the true data generating process andthe model of the winner, and how this relationship changes as a size of the availabledataset, n . We show results of two kinds.First, when n is small, ‘simple’ models, i.e., models that employ few observables,may take the lead, even if the true data generating process is rich. This followsfrom our characterization of the (subjective) expected loss. To establish this resultformally, we used a ‘drifting’ Normal-Inverse Gamma prior; where we allowed theelasticity of the prior density with respect to the variance to increase with the samplesize.Second, when n is large, the winner is qualitatively similar to the model with thelowest value of the Akaike Information Criterion. Misspeciﬁed models (i.e., modelsthat rule out an observable which is relevant for prediction) never win, but overlylarge models may continue to win even as data grows unboundedly large. The prioris not completely ‘washed out’ by the large sample. The elasticity of the prior densitywith respect to the variance parameter continues to aﬀect the model competition evenwith inﬁnite data. This result is established for a very general class of priors and truedata generating processes.There are several natural avenues to future research. An obvious one is a settingin which agents each observe a private dataset: this complicates our analysis becausenow a notion of the winner’s curse applies. Each agent must consider whether theyare beating the others because their model is truly performing well on the data,or because their dataset is non-representative. Another one is to consider dynamicvariants: if agents got feedback or could invest to acquire more data, what kinds ofmodels would be selected? 24 eferences Acemoglu, D., V. Chernozhukov, and M. Yildiz (2016): “Fragility of asymp-totic agreement under Bayesian learning,”

Theoretical Economics , 11, 187–225.

Akaike, H. (1974): “A new look at the statistical model identiﬁcation,”

IEEE trans-actions on automatic control , 19, 716–723.

Al-Najjar, N. I. (2009): “Decision makers as statisticians: Diversity, ambiguity,and learning,”

Econometrica , 77, 1371–1401.

Al-Najjar, N. I. and M. M. Pai (2014): “Coarse decision making and overﬁtting,”

Journal of Economic Theory , 150, 467–486.

Atakan, A. E. and M. Ekmekci (2014): “Auctions, actions, and the failure ofinformation aggregation,”

American Economic Review , 104.

Bergemann, D. and S. Morris (2005): “Robust mechanism design,”

Economet-rica , 73, 1771–1813.

Berk, R. H. (1970): “Consistency a posteriori,”

The Annals of Mathematical Statis-tics , 894–906.

Bohren, J. A. (2016): “Informational herding with model misspeciﬁcation,”

Journalof Economic Theory , 163, 222–247.

Bohren, J. A. and D. Hauser (2017): “Bounded rationality and learning: Aframework and a robustness result,”

Working Paper, University of Pennsylvania . Bond, P. and H. Eraslan (2010): “Information-based trade,”

Journal of EconomicTheory , 145, 1675–1703.

Bunke, O. and X. Milhaud (1998): “Asymptotic behavior of Bayes estimatesunder possibly incorrect models,”

The Annals of Statistics , 26, 617–644.

Burnham, K. P. and D. R. Anderson (2003):

Model selection and multimodelinference: a practical information-theoretic approach , Springer Science & BusinessMedia. 25 arroll, G. (2015): “Robustness and linear contracts,”

American Economic Re-view , 105, 536–63.

Cattaneo, M. D., M. Jansson, and W. K. Newey (2018): “Inference in LinearRegression Models with Many Covariates and Heteroscedasticity,”

Journal of theAmerican Statistical Association , 113, 1350–1361.

Chassang, S. (2013): “Calibrated incentive contracts,”

Econometrica , 81, 1935–1971.

Cherry, J. and Y. Salant (2018): “Statistical Inference in Games,” Tech. rep.,mimeo.

Claeskens, G. and N. Hjort (2008): “Model selection and model averaging,”

Cambridge Books . Eliaz, K. and R. Spiegler (2018): “A Model of Competing Narratives,”

CEPRDiscussion Paper No. DP13319 . Esponda, I. and D. Pouzo (2016): “Berk–Nash equilibrium: A framework formodeling agents with misspeciﬁed models,”

Econometrica , 84, 1093–1130.

Eyster, E. and M. Rabin (2005): “Cursed equilibrium,”

Econometrica , 73, 1623–1672.

Ferguson, T. (1967):

Mathematical Statistics: A Decision Theoretic Approach ,vol. 7, Academic Press New York.

Frick, M., R. Iijima, and Y. Ishii (2019a): “Dispersed Behavior and Perceptionsin Assortative Societies,” .——— (2019b): “Misinterpreting Others and the Fragility of Social Learning,”

CowlesFoundation Discussion Paper . Fudenberg, D., G. Lanzani, and P. Strack (2020): “Limits Points of Endoge-nous Misspeciﬁed Learning,”

Available at SSRN . Fudenberg, D., G. Romanyuk, and P. Strack (2017): “Active learning with amisspeciﬁed prior,”

Theoretical Economics , 12, 1155–1189.26 ansen, B. (2020): “Econometrics,” . Harrison, J. M. and D. M. Kreps (1978): “Speculative investor behavior in astock market with heterogeneous expectations,”

The Quarterly Journal of Eco-nomics , 92, 323–336.

Heidhues, P., B. Kőszegi, and P. Strack (2018): “Unrealistic expectations andmisguided learning,”

Econometrica , 86, 1159–1214.

Heidhues, P., B. Koszegi, and P. Strack (2020): “Convergence in Models ofMisspeciﬁed Learning,” .

Horn, R. A. and C. R. Johnson (1990):

Matrix analysis , Cambridge universitypress.

Jehiel, P. (2005): “Analogy-based expectation equilibrium,”

Journal of Economictheory , 123, 81–104.

Kass, R., L. Tierney, and J. B. Kadane (1990): “The validity of posteriorexpansions based on Laplaces method,” in

Bayesian and Likelihood Methods inStatistics and Econometrics , ed. by S. Geisser, J. Hodges, S. Press, and A. Zellner,vol. 7, 473.

Kleijn, B. and A. Van der Vaart (2012): “The Bernstein-von-Mises theoremunder misspeciﬁcation,”

Electronic Journal of Statistics , 6, 354–381.

Levy, G., R. Razin, and A. Young (2019): “Misspeciﬁed Politics and the Recur-rence of Populism,” Tech. rep., Working Paper.

Liang, A. (2018): “Games of Incomplete Information Played by Statisticians,”

Work-ing paper, University of Pennsylvania . Madarász, K. and A. Prat (2017): “Sellers with misspeciﬁed models,”

The Reviewof Economic Studies , 84, 790–815.

Mailath, G. J. and L. Samuelson (2019): “The Wisdom of a Confused Crowd:Model-Based Inference,”

Cowles Foundation Discussion Paper .27 allows, C. L. (1973): “Some Comments on C_P,”

Technometrics ,15, 661–675.

Morris, S. (1994): “Trade with heterogeneous prior beliefs and asymmetric infor-mation,”

Econometrica: Journal of the Econometric Society , 1327–1347.——— (1995): “The common prior assumption in economic theory,”

Economics &Philosophy , 11, 227–253.

Müller, U. K. (2013): “Risk of Bayesian inference in misspeciﬁed models, and thesandwich covariance matrix,”

Econometrica , 81, 1805–1849.

Nishii, R. (1984): “Asymptotic properties of criteria for selection of variables inmultiple regression,”

The Annals of Statistics , 758–765.

Ortoleva, P. and E. Snowberg (2015): “Overconﬁdence in political behavior,”

American Economic Review , 105, 504–35.

Ottaviani, M. and P. N. Sørensen (2015): “Price reaction to information withheterogeneous beliefs and wealth eﬀects: Underreaction, momentum, and reversal,”

American Economic Review , 105, 1–34.

Pearl, J. (2009):

Causality , Cambridge university press.

Phillips, P. C. B. (1987): “Towards a Uniﬁed Asymptotic Theory for Autoregres-sion,”

Biometrika , 74, 535–547.

Scheinkman, J. A. and W. Xiong (2003): “Overconﬁdence and speculative bub-bles,”

Journal of political Economy , 111, 1183–1220.

Schwarz, G. (1978): “Estimating the Dimension of a Model,”

Ann. Statist. , 6, 461–464.

Spiegler, R. (2006): “The market for quacks,”

The Review of Economic Studies ,73, 1113–1131.——— (2013): “Placebo reforms,”

American Economic Review , 103, 1490–1506.——— (2016): “Bayesian networks and boundedly rational expectations,”

The Quar-terly Journal of Economics , 131, 1243–1290.28 taiger, D. and J. Stock (1997): “Instrumental Variables Regression with WeakInstruments,”

Econometrica , 65, 557–586.

Yildiz, M. (2003): “Bargaining without a common prior—an immediate agreementtheorem,”

Econometrica , 71, 793–811. 29

Main Appendix

A.1 Second-price auction

Consider a second-price auction, where, like in Atakan and Ekmekci (2014), the win-ner of the auction gets to choose an action that aﬀects the value of the asset. Speciﬁ-cally, the action has a value that depends on her ability to predict a given variable, asin the examples given in the introduction. Formally, ﬁxing the environment deﬁnedabove (DGP, agents etc), consider a game with the following timing:1. Nature draws θ ∈ Θ ;2. All agents see a common dataset D n drawn according to Q θ ;3. Agents submit bid in a sealed-bid second-price auction;4. The winner observes x randomly drawn according to P and chooses an real-values action a ;5. The winner gets a lump sum payoﬀ of M − ( y − a ) , where M is a large positivenumber.Every bidder seeks to minimize the expected value M − ( y − a ) , leading to theexpected loss function discussed above.Because agents see a common data set, an agent with prior π has an expected valueof M − L ∗ ( π, D n ) for winning. In the standard dominant equilibrium, the winningagent is the one with the highest value: since M is common across agents, the winneris thus the agent with the lowest expected loss (according to her own prior) giventhe observed data. Notice that since all agents observe the same dataset, and thusthere is no asymmetric information (only heterogenous priors), winner’s-curse-typeconsiderations do not apply. Our results possibly shed light on political competition/ board meetings. While we do notdevelop these formally, intuitively, these would correspond to an analogous all-pay auction. Agentshave diﬀerent models of how to forecast payoﬀ-relevant unknowns from observables. The actiontaken (by the government body or company) depends on this forecast. Agents’ willingness to lobbyfor their model depends on how conﬁdent they are in their model, and the amount of eﬀort theyspend lobbying inﬂuences selection. .2 Proof of Lemma 1 Proof.

Fix a data set D n . We need to analyze E π (cid:104) E P (cid:104) ( x (cid:48) β − f ∗ ( π,D n ) ( x )) (cid:105)(cid:12)(cid:12)(cid:12) D n (cid:105) . Substituting f ∗ from (4), we have that this term = E π (cid:104) E P (cid:104) (( β − E π [ β | D n ]) (cid:48) x ) (cid:105)(cid:12)(cid:12)(cid:12) D n (cid:105) . Recalling that for a scalar a , a = Tr ( a ) , we have = E π (cid:104) E P (cid:104) Tr [(( β − E π [ β | D n ]) (cid:48) x ) ] (cid:105)(cid:12)(cid:12)(cid:12) D n (cid:105) , and then by symmetry and linearity of the trace operator, we can conclude, = E π (cid:104) E P (cid:104) Tr [( β − E π [ β | D n ])( β − E π [ β | D n ]) (cid:48) xx (cid:48) ] (cid:105)(cid:12)(cid:12)(cid:12) D n (cid:105) , = E π (cid:104) Tr [( β − E π [ β | D n ])( β − E π [ β | D n ]) (cid:48) E P [ xx (cid:48) ]] (cid:12)(cid:12)(cid:12) D n (cid:105) , = Tr (cid:104) E π (cid:104) ( β − E π [ β | D n ])( β − E π [ β | D n ]) (cid:48) (cid:12)(cid:12)(cid:12) D n (cid:105) E P [ xx (cid:48) ] (cid:105) . Finally, by the deﬁnition of variance, we have the desired form = Tr ( V π ( β | D n ) E P [ xx (cid:48) ]] . (cid:3) A.3 Generalization and Proof of Theorem 1

In this section we proof Theorem 1 under a set of less restrictive assumptions. Insteadof requiring the prior π to be inﬁnitely diﬀerentiable, we only require it to be six timesdiﬀerentiable. Assumption 3.

Each agent has a prior over θ characterized by a six times con-tinuously diﬀerentiable, and strictly positive probability density function π ( · ) over ( β J ( π ) (cid:48) , σ (cid:15) ) (cid:48) ∈ R | J ( π ) | × R + . We also relax the assumption that the true DGP is a linear regression model.31 ssumption 4.

Let P denote the joint distribution of ( x, y ) . Let the data D n :=(( x , y ) , . . . , ( x n , y n )) denote an i.i.d. sample from P . Then:1. The smallest eigenvalue of the matrix E P [ xx (cid:48) ] is strictly positive2. ( x, y ) have ﬁnite moments of all orders.3. Let J ( π ) denote the subset of explanatory variables considered relevant undera prior π satisfying Assumption 3. Let β ∗ ( π ) the coeﬃcient of the best linearpredictor for y based on x J ( π ) and let σ ∗ ( π ) ≡ E P [( y − x (cid:48) J ( π ) β ∗ ( π ))] . We assumethat the condition (2.3) of Kleijn and Van der Vaart (2012) holds at θ ∗ ( π ) ≡ ( β ∗ ( π ) , σ ∗ ( π )) . Part (1) guarantees that the matrix of population second moments is both ﬁniteand invertible. This implies there is a unique parameter β satisfying E P [ x ( y − x (cid:48) β )] =0 and we interpret it as the true parameter. Part (2) will be used to invoke astandard Central Limit Theorem for the asymptotic distribution of the Ordinary LeastSquares (OLS) estimator in a linear regression model. Part (3) can be thought ofas imposing the Bernstein-Von Mises Theorem (BVMT) for misspeciﬁed models. We are now ready to state and prove our more general theorem. Theorem 1 followstrivially because it makes strictly stronger assumptions.

Theorem 3.

Let Π be a ﬁnite collection of agents’ priors that satisfy Assumption 3.Suppose the true data generating process P satisﬁes Assumption 4. Deﬁne β as theparameter such that E P [ x ( y − x (cid:48) β )] = 0 . Let J denote the associated true model for This also implies that the population second moments can be consistently estimated from thesample second moments of the data. We will use this assumption to characterize the probabilitylimit of the Maximum Likelihood Estimators based on the possibly misspeciﬁed likelihoods of theBayesian agents. Note that in principle, we allow for the distribution of covariates assumed by thecompeting agents (denoted P ) to be diﬀerent from the distribution of covariates under P . We will use this assumption to characterize the asymptotic distribution of the diﬀerence inmodel ﬁt for models that are larger than the true model. This assumption allows for conditionalheteroskedasticity of regression residuals. If we assume that the agents’ DGP is a correctly speciﬁed parametric statistical model, theBVMT implies that the posterior distribution of a parameter θ is approximately Normal, centeredat the maximum likelihood estimator. A similar result is available for misspeciﬁed models; see Bunkeand Milhaud (1998) and Kleijn and Van der Vaart (2012). Instead of imposing the BVMT theoremfor misspeciﬁed models as a high-level assumption (as, for example, Condition 1 in Müller (2013))we only assume condition (2.3) of Kleijn and Van der Vaart (2012). . If Π contains a prior π such that J ⊆ J ( π ) , then lim n →∞ P (cid:18) ∃ π ∈ argmin π ∈ Π L ∗ ( π, D n ) s.t J ⊆ J ( π ) (cid:19) = 1 Moreover, for any π for which J ⊂ J ( π )lim n →∞ P ( L ∗ ( π, D n ) < L ∗ ( π , D n )) ∈ (0 , . Proof.

Lemma 1 has shown that the posterior loss for an agent with prior π is L ∗ ( π, D n ) = E π [ σ | D n ] + tr ( V π ( β J ( π ) | D n ) E P [ x J ( π ) x (cid:48) J ( π ) ]) . Lemma 2 shows that under parts (1)-(2) of Assumption 4, the ﬁrst term in theposterior loss, E π [ σ | D n ] , admits the following Kass et al. (1990) expansion: σ (cid:86) ( π ) − n (cid:18) σ ∗ ( π ) (cid:26)(cid:18) ∂π∂σ ( θ ∗ ( π )) (cid:19) · π ( θ ∗ ( π )) (cid:27) − σ ∗ ( π )( | J ( π ) | + 4) (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) ≡ F ( π ) + o P (cid:18) n (cid:19) , (8)where σ (cid:86) ( π ) denotes the Maximum Likelihood estimator of σ (cid:15) according to the linearregression model with covariates J ( π ) . The parameter θ ∗ ( π ) is deﬁned as in Part (3)of Assumption 4.The likelihood of the linear regression model with covariates X J ( π ) satisfy thestochastic local asymptotic normality condition in equation (2.1) of Kleijn and Van derVaart (2012) at θ ∗ ≡ θ ∗ ( π ) = ( β ∗ ( π ) , σ ∗ ( π )) . Therefore, Part (3) of Assumption 4implies that Theorem 2.1 in Kleijn and Van der Vaart (2012) holds and consequentlythe model uncertainty term equals Algebra shows the condition is satisﬁed with V θ ∗ ≡ (cid:32) E P [ x J ( π ) x (cid:48) J ( π ) ] σ ( π ) σ ( π ) , (cid:33) , ∆ n,θ ∗ ≡ V − θ ∗ (cid:32) σ ∗ ( π ) 1 √ n (cid:80) ni =1 x J ( π ) ,i ( y i − x (cid:48) J ( π ) ,i β ∗ ( π )) σ ∗ ( π ) 1 √ n (cid:80) ni =1 ( y i − x (cid:48) J ( π ) ,i β ∗ ( π )) − σ ∗ ( π ) . (cid:33) ( π ) n tr (cid:0) E P [ x J ( π ) x (cid:48) J ( π ) ] − E P [ x J ( π ) x (cid:48) J ( π ) ] (cid:1) + o P (cid:18) n (cid:19) . This means that for any two models π , π (cid:48) we have L ∗ ( π, D n ) − L ( π (cid:48) , D n ) = σ (cid:86) ( π ) − σ (cid:86) ( π (cid:48) ) + O P (cid:18) n (cid:19) . Take π to be the prior of an agent that excludes some relevant explanatory variable;that is J (cid:54)⊆ J ( π ) . Take π (cid:48) to be the prior π of any agent that includes all relevantexplanatory variables; that is J ⊆ J ( π ) (such an agent exists by assumption). Thediﬀerence in posterior loss is eventually strictly positive. This follows from the well-known fact that the probability limit of the diﬀerence σ (cid:86) ( π ) − σ (cid:86) ( π ) , is strictly positive (under our assumptions, the misspeciﬁed model has strictly largerresidual variance than the true model). This shows that lim n →∞ P (cid:18) ∃ π ∈ argmin π ∈ Π L ∗ ( π, D n ) s.t J ⊆ J ( π ) (cid:19) = 1 . For the last part of the theorem, take π to be an agent with a prior π L thatincludes all relevant explanatory variables, but includes some irrelevant variables;that is J ⊂ J ( π L ) . P ( L ∗ ( π L , D n ) < L ∗ ( π , D n )) = P ( n ( L ∗ ( π , D n ) − L ∗ ( π L , D n )) > P (cid:0) n ( σ (cid:86) ( π ) − σ (cid:86) ( π L )) > ( F ( π ) − F ( π L )) + o P (1) (cid:1) , where we have used the Kass et al. (1990) expansion in (8). Standard algebra oflinear regression—e.g., Theorem 3.4 and Theorem 3.5 in Greene (2003)—shows thatunder Part (2) of Assumption 4 n ( σ (cid:86) ( π ) − σ (cid:86) ( π L )) d → ζ, where ζ is an absolutely continuous real-valued random variable supported on the34ositive part of the real line. This shows that lim n →∞ P ( L ∗ ( π, D n ) < L ∗ ( π , D n )) ∈ (0 , . (cid:3) A.4 Proof of Proposition 1

Proof.

Denote the single datapoint as D = ( Y, X ) , where Y ∈ R and X ∈ R × k ( k is the number of covariates), and X = x (cid:48) . First, observe that for any agent j with asingle explanatory variable κ in his model (denoted x κ ). By Lemma 3 L ∗ ( π j , D ) = b + (cid:16) y − y x κ x κ + γ (cid:17) a − (cid:18) x κ + γ (cid:19) , = b + y γx κ + γ a − (cid:18) x κ + γ (cid:19) . The winning agent among the single variable models will therefore clearly be theagent with the variable κ that maximizes x κ . Without loss of generality, call thisvariable .To economize on notation, now consider the full model with all the explanatoryvariables, it will be clear from the logic that this argument will work for any modellarger than a single variable. For an agent j with all k variables, we know that L ∗ ( π j , D ) = b + y (1 − X ( X (cid:48) X + γk I k ) − X (cid:48) ) a − (cid:16) (cid:104) ( X (cid:48) X + γk I k ) − (cid:105)(cid:17) . To show that this model always loses, we need to show that this model’s loss is always In particular, ζ ≡ ξ (cid:48) [ R ( E P [ x J ( π L ) x (cid:48) J ( π L ) ]) − R (cid:48) ] − ξ/σ , where ξ ∼ N | J ( π L ) − J ( π ) | (0 , R E P [( y − x (cid:48) β ) ( x J ( π L ) x (cid:48) J ( π L ) ) − ] R (cid:48) ) . In this notation, R is the | J ( π L ) − J ( π ) | × | J ( π L ) | matrix that selects the entries of β J ( π L ) that arezero under the model speciﬁed by π and | J | denotes the cardinality of the set J . (1 − X ( X (cid:48) X + γk I k ) − X (cid:48) ) ≥ γx + γ , Tr (cid:104) ( X (cid:48) X + γk I k ) − (cid:105) ≥ x + γ . We will handle each of these separately. Let’s start with the second. Recall that forany matrix A , Tr ( A ) equals the sum of eigenvalues of A . Further, the eigenvaluesof A − are the reciprocals of the eigenvalues of matrix A for an invertible matrix.Finally if A is positive deﬁnite, all the eigenvalues are strictly positive.By the Gershgorin circle theorem (see e.g. Theorem 6.1.1 of Horn and Johnson(1990)), all the eigenvalues of a matrix A lie within (cid:83) kκ =1 [ a κ,κ − R κ , a κ,κ + R κ ] where R κ is the sum of the absolute values of the non-diagonal terms on row κ , and a κ,κ isthe κ diagonal element.Consider the matrix ( X (cid:48) X + γk I k ) . Observe that R κ in this case = | x κ | ( (cid:80) κ (cid:48) (cid:54) = κ | x κ (cid:48) | ) ,while a κ,κ = x κ + kγ . Therefore the largest possible eigenvalue is | x | ( (cid:80) κ | x κ | ) + kγ ,which in turn is small than k ( x + γ ) .Therefore for the matrix ( X (cid:48) X + γk I k ) − , all eigenvalues are larger than k ( x + γ ) ,and therefore the sum of eigenvalues is at least x + γ ) (since there are k eigenvalues)!We can therefore conclude that Tr (cid:104) ( X (cid:48) X + γk I k ) − (cid:105) ≥ x + γ , as desired.We are left to prove that: (1 − X ( X (cid:48) X + γk I k ) − X (cid:48) ) ≥ γx + γ , ⇐⇒ X ( X (cid:48) X + γk I k ) − X (cid:48) ≤ x x + γ . Now, observe that X ( X (cid:48) X + γk I k ) − X (cid:48) is a scalar. We know that for a scalar,36 = Tr ( a ) . Therefore we have that X ( X (cid:48) X + γk I k ) − X (cid:48) , =Tr[ X ( X (cid:48) X + γk I k ) − X (cid:48) ] , =Tr[( X (cid:48) X + γk I k ) − X (cid:48) X ] , =Tr[( 1 γk X (cid:48) X + I k ) − γk X (cid:48) X ] . Denote γk X (cid:48) X as A . Substituting =Tr[( A + I k ) − A ] . Now, observe that if λ is an eigenvalue of A , then λ λ is an eigenvalue of ( A + I k ) − A .To see this, suppose v is an eigenvector of A with eigenvalue λ . Then, Av = λv, = ⇒ ( A + I k ) v = ( λ + 1) v, = ⇒ ( A + I k ) − v = 11 + λ v, = ⇒ ( A + I k ) − Av = λ λ v. Substituting this in, we have

Tr[( A + I k ) − A ] = k (cid:88) i =1 λ i λ i . Therefore we are left to show that k (cid:88) i =1 λ i λ i ≤ x x + γ Here λ i ’s are the eigenvalues of γk X (cid:48) X . This implies that (cid:80) i λ i = γk (cid:80) i x i . Note that X (cid:48) X is not full rank, indeed, its null space is of dimension k − .Therefore it has k − multiplicity eigenvalue of . The unique non-zero eigenvaluemust then be γk (cid:80) i x i . 37ubstituting in, we have k (cid:88) i =1 λ i λ i = γk (cid:80) i x i γk (cid:80) i x i + 1 , = k (cid:80) i x i k (cid:80) i x i + γ , ≤ x x + γ . where the last inequality follows since we assumed that x = max i { x i : 1 ≤ i ≤ k } . (cid:3) A.5 Proof of Theorem 2

Proof.

It is well known that for a prior π in the Normal-Inverse Gamma family: V π [ β J ( π ) | D n ] = E π (cid:2) σ (cid:15) | D n (cid:3) ( X (cid:48) J ( π ) X J ( π ) + γ | J ( π ) | I | J ( π ) | ) − , = E π (cid:2) σ (cid:15) | D n (cid:3) n (cid:32) X (cid:48) J ( π ) X J ( π ) n + γ | J ( π ) | I | J ( π ) | n (cid:33) − . Under the Assumptions on P in Theorem 2 (cid:32) X (cid:48) J ( π ) X J ( π ) n + γ | J ( π ) | I | J ( π ) | n (cid:33) − = E P [ x J ( π ) x (cid:48) J ( π ) ] − + o P (1) . Consequently,Tr (cid:0) V π [ β J ( π ) | D n ] E P [ x J ( π ) x (cid:48) J ( π ) ] (cid:1) = E π (cid:2) σ (cid:15) | D n (cid:3) (cid:18) J ( π ) n + o P (cid:18) n (cid:19)(cid:19) . It follows from algebra that for any priors π , π (cid:48) in the Normal-Inverse Gamma family L ∗ ( π (cid:48) , D n ) > L ∗ ( π, D n ) ⇐⇒ (cid:0) E π (cid:2) σ (cid:15) | D n (cid:3) − E π (cid:48) (cid:2) σ (cid:15) | D n (cid:3)(cid:1) (cid:18) J ( π (cid:48) ) n + o P (cid:18) n (cid:19)(cid:19) > E π [ σ (cid:15) | D n ] (cid:18) J ( π ) − J ( π (cid:48) ) n (cid:19) . It is well known that for a prior π in the Normal-Inverse Gamma family, the38osterior mean of β J ( π ) is the ‘Ridge estimator’ β (cid:86) π := ( X (cid:48) J ( π ) X J ( π ) + γ | J ( π ) | I J ( π ) ) − X (cid:48) J ( π ) y, which solves the problem min β ∈ R | J ( π ) | ( y − X J ( π ) β ) (cid:48) ( y − X J ( π ) β ) + ( γ | J ( π ) | ) || β || First, consider two priors π, π (cid:48) such that J ( π (cid:48) ) ⊂ J ( π ) , and J ( π (cid:48) ) = J . In aslight abuse of notation let β (cid:86) π (cid:48) denote the vector in R J ( π ) with all the coordinates in J ( π ) \ J ( π (cid:48) ) equal to zero. Also, let J be used to abbreviate J ( π ) .Equation (6) implies that for any such two priors π, π (cid:48) n ( E π (cid:48) [ σ (cid:15) | D n ] − E π [ σ (cid:15) | D n ]) is proportional to the sum of ( y − X J β (cid:86) π (cid:48) ) (cid:48) ( y − X J β (cid:86) π (cid:48) ) − ( y − X J β (cid:86) π ) (cid:48) ( y − X J β (cid:86) π ) (9)and γ (cid:16) | J ( π (cid:48) ) | || β (cid:86) π (cid:48) || − | J | || β (cid:86) π || (cid:17) . (10)where the proportionality constant is c n := (2 a /n + 1 − /n ) − .Algebra shows that the expression in (9) equals − y − X J β (cid:86) π ) (cid:48) X J ( β (cid:86) π (cid:48) − β (cid:86) π ) + ( β (cid:86) π − β (cid:86) π (cid:48) ) (cid:48) X (cid:48) J X J ( β (cid:86) π − β (cid:86) π (cid:48) ) and the expression in (10) γ | J ( π (cid:48) ) | ( β (cid:86) π − β (cid:86) π (cid:48) ) (cid:48) ( β (cid:86) π − β (cid:86) π (cid:48) ) − γ ( | J | − | J ( π (cid:48) ) | ) β (cid:86) (cid:48) π β (cid:86) π + 2 γ | J ( π (cid:48) ) | β (cid:86) (cid:48) π ( β (cid:86) π (cid:48) − β (cid:86) π ) . The ﬁrst-order conditions deﬁning the Ridge estimator imply − y − X J β (cid:86) π ) (cid:48) X J + 2 γ | J | β (cid:86) (cid:48) π = 0 . n ( E π (cid:48) [ σ (cid:15) | D n ] − E π [ σ (cid:15) | D n ]) = c n (cid:0) ( β (cid:86) π − β (cid:86) π (cid:48) ) (cid:48) ( X (cid:48) J X J + γ | J ( π (cid:48) ) | I J )( β (cid:86) π − β (cid:86) π (cid:48) )+ γ ( | J | − | J ( π (cid:48) ) | ) β (cid:86) (cid:48) π β (cid:86) π − γ ( | J | − | J ( π (cid:48) ) | ) β (cid:86) (cid:48) π (cid:48) β (cid:86) π (cid:1) . Under the assumptions of Theorem 2 and recalling that J ( π (cid:48) ) = J , = O P (1) . However, under the same assumptions E π [ σ (cid:15) | D n ] = 2 b n n + o P (1) . Since b n ∈ ω ( n ) , the previous term diverges to inﬁnity. This implies P (cid:18) n ( E π (cid:48) [ σ (cid:15) | D n ] − E π [ σ (cid:15) | D n ]) (cid:18) J ( π (cid:48) ) n + o P (1) (cid:19) > E π [ σ (cid:15) | D n ])( J ( π ) − J ( π (cid:48) )) (cid:19) converges to zero. We conclude that J ( π ) ⊃ J ( π (cid:48) ) = J implies P [ L ∗ ( π, D n ) < L ∗ ( π (cid:48) , D n )] → . Now instead consider the same framework as above, but let π now be such that J ( π ) = J and | J ( π (cid:48) ) | < | J ( π )] . The probability that the smaller model, π (cid:48) , isdefeated by π is P (cid:18) ( E π (cid:48) [ σ (cid:15) | D n ] − E π [ σ (cid:15) | D n ]) (cid:18) J ( π (cid:48) ) n + o P (1) (cid:19) > E π [ σ (cid:15) | D n ]) n ( J ( π ) − J ( π (cid:48) )) (cid:19) . Under the assumptions of the theorem ( E π (cid:48) [ σ (cid:15) | D n ] − E π [ σ (cid:15) | D n ]) = O P (1) . However, E π [ σ (cid:15) | D n ]) n = b n n + o P (cid:18) n (cid:19) . b n ∈ ω ( n ) , the latter diverges as n grows large. We conclude that: P [ L ∗ ( π (cid:48) , D n ) < L ∗ ( π, D n )] → . The result follows. (cid:3)

A.6 Proof of Proposition 2

Proof.

Suppose the known variance of (cid:15) is σ (cid:15) . Then for any agent with prior π , uponseeing data D n , the posterior expected loss evaluates to: L ∗ ( π j , D n ) = σ (cid:15) + Tr( V π [ β | D n ]) , where we have assumed that E p [ xx (cid:48) ] = I .Without loss of generality, suppose the larger model J (cid:48) is the entire set of ob-servables of size k . We need to show that there exists a model J of size | J | suchthat Tr( X (cid:48) X + γk I k ) − ≥ Tr( X (cid:48) J X J + γ | J | I | J | ) − . In particular let J be such that (cid:80) j ∈ J e j ( X (cid:48) X + γk I k ) − e j ≤ (cid:80) j ∈ J (cid:48)(cid:48) e j ( X (cid:48) X + γk I k ) − e j for any J (cid:48)(cid:48) such that | J (cid:48)(cid:48) | = | J | . Then, it must be the case that Tr( X (cid:48) X + γk I k ) − ≥ k | J | (cid:88) j ∈ J e j ( X (cid:48) X + γk I k ) − e j . Therefore it is suﬃcient to show that for this model J , we have k | J | (cid:88) j ∈ J e j ( X (cid:48) X + γk I k ) − e j ≥ Tr( X (cid:48) J X J + γ | J | I | J | ) − . Without loss we can renumber the indices so that J = { , , . . . , | J |} . Let L denotethe set of remaining indices, i.e. L = {| J | + 1 , . . . , k } . We can thus write the left41and size of the inequality as: k | J | (cid:88) j ∈ J e j (cid:32) X (cid:48) J X J + γk I | J | X (cid:48) J X L X (cid:48) L X J X (cid:48) L X L + γk I | L | (cid:33) − e j . Using the standard formula for block inverse of a matrix we can write this as = k | J | (cid:88) j ∈ J e j (cid:32) A A A A (cid:33) − e j . where A = ( X (cid:48) J X J + γk I | J | − X (cid:48) J X L ( X (cid:48) L X L + γk I | L | ) − X (cid:48) L X J ) − . Substituting thatin we have = k | J | Tr( X (cid:48) J X J + γk I | J | − X (cid:48) J X L ( X (cid:48) L X L + γk I | L | ) − X (cid:48) L X J ) − . Therefore, taking k | J | to the other side, we are left to show that Tr( X (cid:48) J X J + γk I | J | − X (cid:48) J X L ( X (cid:48) L X L + γk I | L | ) − X (cid:48) L X J ) − ≥ Tr( k | J | X (cid:48) J X J + γk I | J | ) − (11)Next, given 4 matrices A,B,C, and D where A and C are invertible, it is easy to showthat ( A + BCD ) − = A − − A − B ( C − + DA − B ) DA − . Suppose we deﬁne A = X (cid:48) J X J + γk I | J | ,B = − X (cid:48) J X L ,C = ( X (cid:48) L X L + γk I | L | ) − ,D = X (cid:48) L X J . Note that in this case, A and C are invertible by observation. In light of this, and42he linearity of the Trace operator, we can rewrite the left hand side of (11) as Tr( A − − A − B ( C − + DA − B ) DA − ) , =Tr A − − Tr( A − B ( C − + DA − B ) DA − ) , =Tr( X (cid:48) J X J + γk I | J | ) − − Tr( A − B ( C − + DA − B ) DA − ) , where A, B, C and D are as deﬁned above. So (11) can be written as: Tr( X (cid:48) J X J + γk I | J | ) − − Tr( A − B ( C − + DA − B ) DA − ) ≥ Tr( k | J | X (cid:48) J X J + γk I | J | ) − . To show this inequality it is therefore suﬃcient to show that

Tr( A − B ( C − + DA − B ) DA − ) ≤ , (12) Tr( X (cid:48) J X J + γk I | J | ) − ≥ Tr( k | J | X (cid:48) J X J + γk I k ) − . (13)We now show each of these in turn. Let us start with the ﬁrst. Note that B = − D (cid:48) we have: (12) ⇐⇒ Tr( A − D (cid:48) ( C − − DA − D (cid:48) ) DA − ) ≥ . In turn, since A is symmetric, so is A − , so deﬁning Q ≡ A − D (cid:48) ⇐⇒ Tr( Q ( C − − DA − D (cid:48) ) Q (cid:48) ) ≥ . Since

QM Q (cid:48) is a positive semideﬁnite matrix if M is a positive semideﬁnite matrix(see e.g. Observation 7.1.8 of Horn and Johnson (1990)), it is suﬃcient to show that ( C − − DA − D (cid:48) ) is a positive semideﬁnite matrix (the trace of a matrix equals thesum of all its eigenvalues, and the eigenvalues of a positive semideﬁnite matrix areall non-negative). So to show (12), it is suﬃcient to show that ( C − − DA − D (cid:48) ) ispositive semideﬁnite. To see this, observe that: ( C − − DA − D (cid:48) ) , = X (cid:48) L X L + γk I | L | − X (cid:48) L X J ( X (cid:48) J X J + γk I | J | ) − X (cid:48) J X L , = X (cid:48) L (cid:0) I N − X J ( X (cid:48) J X J + γk I | J | ) − X (cid:48) J (cid:1) X L + γk I | L | .

43t is therefore suﬃcient to show that each of these two matrices are positive semideﬁ-nite. The latter is positive deﬁnite by observation. To show that the former is positivesemideﬁnite, by another appeal to Observation 7.1.8 of Horn and Johnson (1990), itis suﬃcient to show that (cid:0) I k − X J ( X (cid:48) J X J + γk I | J | ) − X (cid:48) J (cid:1) is positive semideﬁnite. Butobserve that: I k − X J ( X (cid:48) J X J + γk I | J | ) − X (cid:48) J , = I k − γk X J ( 1 γk X (cid:48) J X J + I | J | ) − X (cid:48) J . (14)Now, we know that for any square matrix P , ( I + P ) − = I − ( I + P ) − P, = I − P + ( I + P ) − P , = I + ∞ (cid:88) j =1 ( − j P j . Substituting in P = γk X (cid:48) J X J , we have that X J ( 1 γk X (cid:48) J X J + I | J | ) − X (cid:48) J = X J (cid:32) I | J | − ∞ (cid:88) j =1 ( − γk ) j ( X (cid:48) J X J ) j (cid:33) X (cid:48) J , = X J X (cid:48) J − ∞ (cid:88) j =1 ( − γk ) j ( X J X (cid:48) J ) j +1 , = ( X J X (cid:48) J )( I | J | − ∞ (cid:88) j =1 ( − γk ) j ( X J X (cid:48) J ) j ) , = ( X J X (cid:48) J )( I | J | + 1 γk X J X (cid:48) J ) − . Therefore we have that(14) = I k − γk ( X J X (cid:48) J )( I | J | + 1 γk X J X (cid:48) J ) − , = I k − ( X J X (cid:48) J )( γk I | J | + X J X (cid:48) J ) − , = γk ( γk I | J | + X J X (cid:48) J ) − . which is positive deﬁnite by observation. 44e are left, then, to show (13), i.e. that: Tr( X (cid:48) J X J + γk I | J | ) − ≥ Tr( k | J | X (cid:48) J X J + γk I k ) − , ⇐⇒ Tr(( X (cid:48) J X J + γk I | J | ) − − ( k | J | X (cid:48) J X J + γk I k ) − ) ≥ . Algebra shows

Tr(( X (cid:48) J X J + γk I | J | ) − − ( k | J | X (cid:48) J X J + γk I k ) − ) , =Tr(( X (cid:48) J X J + γk I | J | ) − (cid:18) k − | J || J | X (cid:48) J X J (cid:19) ( k | J | X (cid:48) J X J + γk I k ) − ) , =Tr( k − | J || J | X J ( X (cid:48) J X J + γk I | J | ) − ( k | J | X (cid:48) J X J + γk I k ) − X (cid:48) J ) . The ﬁnal matrix into the trace operator is positive semideﬁnite by Observation 7.1.8of Horn and Johnson (1990). (cid:3)

A.7 Proof of Proposition 3

Proof.

For an agent with prior π the agent’s ex-ante expected loss on seeing a datasetof size n is E m ( π ) [ L ∗ ( π, D n )] = (cid:90) θ =( β,σ (cid:15) ) (cid:90) D n (cid:90) y,x ( y − x (cid:48) β (cid:86) ( D n )) dQ θ ( y, x ) dQ θ ( D n ) dπ ( θ ) . The agents’ statistical model is y = x (cid:48) β + (cid:15) , (cid:15) ∼ N (0 , σ )= (cid:90) θ =( β,σ (cid:15) ) (cid:90) D n (cid:90) y,x ( x (cid:48) β + (cid:15) − x (cid:48) β (cid:86) ( D n )) dQ θ ( x, (cid:15) ) dQ θ ( D n ) dπ ( θ ) , = (cid:90) θ =( β,σ (cid:15) ) (cid:90) D n (cid:90) x,(cid:15) (( x (cid:48) ( β − β (cid:86) ( D n ))) + (cid:15) ) dQ θ ( x, (cid:15) ) dQ θ ( D n ) dπ ( θ ) , = E π [ σ (cid:15) ] + (cid:90) θ =( β,σ (cid:15) ) (cid:90) D n (cid:90) x ( x (cid:48) ( β − β (cid:86) ( D n ))) dQ θ ( x ) dQ θ ( D n ) dπ ( θ ) , = E π [ σ (cid:15) ] + (cid:90) θ =( β,σ (cid:15) ) (cid:90) D n (cid:18)(cid:90) x (( β − β (cid:86) ( D n )) (cid:48) xx (cid:48) ( β − β (cid:86) ( D n )) dQ θ ( x ) (cid:19) dQ θ ( D n ) dπ J ( θ ) , = E π J [ σ (cid:15) ] + (cid:90) θ =( β,σ (cid:15) ) (cid:90) D n (( β − β (cid:86) ( D n )) (cid:48) ( β − β (cid:86) ( D n )) dQ θ ( D n ) dπ ( θ ) . E P [ xx (cid:48) ] = I by assumption. Now, since γ = 0 byassumption, for dataset D n = ( Y, X ) , we have that β (cid:86) ( D n ) = ( X (cid:48) J ( π ) X J ( π ) ) − X (cid:48) J ( π ) Y .In a slight abuse of notation abbreviate J ( π ) as J . Writing that Y = X J β + e , where e is the n × vector collecting (cid:15) i : ( β (cid:86) ( D n ) − β ) = ( X (cid:48) J X J ) − X (cid:48) J e. Substituting back in we have that: E m ( π ) [ L ∗ ( π, D n )] = E π [ σ (cid:15) ] + (cid:90) θ =( β,σ (cid:15) ) (cid:90) D n (( β − β (cid:86) ( D n )) (cid:48) ( β − β (cid:86) ( D n )) dQ θ ( D n ) dπ ( θ ) , = E π J [ σ (cid:15) ] + (cid:90) θ =( β,σ (cid:15) ) (cid:90) D n (cid:0) e (cid:48) X J ( X (cid:48) J X J ) − ( X (cid:48) J X J ) − X (cid:48) J e (cid:1) dQ θ ( D n ) dπ ( θ ) , and since e (cid:48) X J ( X (cid:48) J X J ) − ( X (cid:48) J X J ) − X (cid:48) J e is a scalar = E π [ σ (cid:15) ] + (cid:90) θ =( β,σ (cid:15) ) (cid:90) D n Tr( e (cid:48) X J ( X (cid:48) J X J ) − ( X (cid:48) J X J ) − X (cid:48) J e ) dQ θ ( D n ) dπ J ( θ ) . Using the cyclic property of the trace operator, = E π [ σ (cid:15) ] + (cid:90) θ =( β,σ (cid:15) ) (cid:90) D n Tr(( X (cid:48) J X J ) − X (cid:48) J ee (cid:48) X J ( X (cid:48) J X J ) − ) dQ θ ( D n ) dπ J ( θ ) . By assumption, X J and e are independent and E Q θ [ ee (cid:48) ] = σ (cid:15) I n . Thus, = E π [ σ (cid:15) ] + (cid:90) θ =( β,σ (cid:15) ) σ (cid:15) (cid:90) X J Tr(( X (cid:48) J X J ) − X (cid:48) J X J ( X (cid:48) J X J ) − ) dQ θ ( X J ) dπ J ( θ ) , = E π [ σ (cid:15) ] + (cid:90) θ =( β,σ (cid:15) ) σ (cid:15) (cid:90) X J Tr( X (cid:48) J X J ) − dQ θ ( X J ) dπ J ( θ ) , = E π [ σ (cid:15) ] (cid:18) | J | n − | J | − (cid:19) . The last equation follows because when x ∼ N (0 , I k ) , ( X (cid:48) J X J ) is a Wishart distribu-tion W ( I J , n ) . Thus, ( X (cid:48) J X J ) − has an inverse wishart distribution and its expectation46quals I J / ( n − | J | − , provided n > | J | + 1 . Finally J (cid:48) n − J (cid:48) − < Jn − J − , if and only if n > . Since E π [ σ (cid:15) ] is common across all agents by assumption, theresult follows. (cid:3) B Supplementary Material

B.1 Lemma 2

Lemma 2.

Let π be a prior satisfying Assumption 3. If P satisﬁes parts (1)-(2) ofAssumption 4, then E π [ σ | D n ] admits the following Kass et al. (1990) expansion: σ (cid:86) ( π ) − n (cid:18) σ ∗ ( π ) (cid:26)(cid:18) ∂π∂σ ( θ ∗ ( π )) (cid:19) · π ( θ ∗ ( π )) (cid:27) − σ ∗ ( π )( | J ( π ) | + 4) (cid:19) + o P (cid:18) n (cid:19) , where θ ∗ ( π ) = ( β ∗ ( π ) , σ ∗ ( π )) are deﬁned as in Part (3) of Assumption 4.Proof. The proof has two main steps. First, we introduce some additional notation.Second, we invoke the results of Kass et al. (1990) and apply them to approximate E π [ σ | D n ] . Step 0 (Notation for Maximum Likelihood Estimators):

An agent withprior π only uses covariates with indices in J ( π ) , this agent’s posterior can be obtainedusing the likelihood f ( Y | X J ( π ) ; β J ( π ) , σ ) := 1(2 π ) n/ σ n exp (cid:18) − σ ( Y − X J ( π ) β J ( π ) ) (cid:48) ( Y − X J ( π ) β J ( π ) ) (cid:19) . One additional piece of notation. We deﬁne the scaled log-likelihood function foran agent with prior π as h n ( θ ( π )) := 1 n ln f ( Y | X J ( π ) ; θ ( π )) . The ( i, j ) component of the matrix of second derivatives of h n ( θ ( π )) with respect to47 ( π ) (the Hessian of the scaled log-likelihood) will be denoted as h ij ( · ) . We omitthe dependence on n , unless confusion arises. The components of the inverse of theHessian will be written as h ij ( · ) . Finally, h rsj ( · ) denotes the partial derivative of h rs with respect to the j -th component of θ ( π ) . Step 1 (Asymptotic Expansions of posterior moments):

Kass et al. (1990)provide asymptotic expansions for posterior moments around the maximizer of thelikelihood used to compute the posterior.In the linear regression model, Theorem 4 and 5 in Kass et al. (1990) imply thatfor any prior π satisfying Assumption 3, P satisfying parts (1)-(2) of Assumption 4,and for any six-times diﬀerentiable positive real-valued function the posterior of g ( θ ) can be expanded as E π [ g ( θ ) | D n ] = g ( θ (cid:86) ( π )) + 1 n (cid:88) ≤ i,j ≤ dim ( θ ( π )) (cid:18) ∂g∂θ i ( θ (cid:86) ( π )) (cid:19) h ij ( θ (cid:86) ( π )) (cid:26)(cid:18) ∂π∂θ j ( θ (cid:86) ( π )) (cid:19) · π ( θ (cid:86) ( π )) − (cid:88) ≤ r,s ≤ dim ( θ ( π )) h rs ( θ (cid:86) ( π )) h rsj ( θ (cid:86) ( π ))  + 12 n (cid:88) ≤ i,j ≤ dim ( θ ( π )) h ij ( θ (cid:86) ( π )) (cid:18) ∂g∂θ i θ j ( θ (cid:86) ( π )) (cid:19) + O P (cid:18) n (cid:19) . See equation 2.6 in p. 481 of Kass et al. (1990).Consider the positive function g ( θ ( π )) = g ( β J ( π ) , σ ) = σ . Because ∂g∂σ ( θ (cid:86) ( π )) = 1 and ∂g∂θ i ( θ (cid:86) ( π )) = 0 , i < | J ( π ) | + 1 , the expansion above simpliﬁes to E π [ σ | D n ] = σ (cid:86) ( π ) + 1 n (cid:88) ≤ j ≤| J ( π ) | +1 h ( | J ( π ) | +1) j ( θ (cid:86) ( π )) (cid:26)(cid:18) ∂π∂θ j ( θ (cid:86) ( π )) (cid:19) · π ( θ (cid:86) ( π )) − (cid:88) ≤ r,s ≤ dim ( θ ( π )) h rs ( θ (cid:86) ( π )) h rsj ( θ (cid:86) ( π ))  + O P (cid:18) n (cid:19) . Moreover, the Hessian matrix of h n ( θ ( π )) equals (cid:32) − nσ X J ( π ) (cid:48) X J ( π ) − nσ X J ( π ) (cid:48) ( Y − X J ( π ) (cid:48) β J ( π ) ) − nσ ( Y − X J ( π ) (cid:48) β J ( π ) ) (cid:48) X J ( π ) 12 σ − nσ ( Y − X J ( π ) (cid:48) ( Y − X J ( π ) ) , (cid:33) and the inverse Hessian evaluated at θ (cid:86) ( π ) is (cid:32) − σ (cid:86) ( π ) (cid:0) X J ( π ) (cid:48) X J ( π ) /n (cid:1) − − σ (cid:86) ( π ) , (cid:33) This further simpliﬁes the expansion to E π [ σ | D n ] = σ (cid:86) ( π ) − σ (cid:86) ( π ) n (cid:40)(cid:18) ∂π∂σ ( θ (cid:86) ( π )) (cid:19) · π ( θ (cid:86) ( π )) − (cid:88) ≤ r,s ≤| J ( π )+1 | h rs ( θ (cid:86) ( π )) h rs ( | J ( π ) | +1) ( θ (cid:86) ( π ))  + O P (cid:18) n (cid:19) . Finally, the terms h r ( | J ( π ) | +1) , h ( | J ( π ) | +1) s are both 0 for any r, s < | J ( π ) | + 1 . Algebrashows that 49 ≤ r,s ≤| J ( π ) | +1 h rs ( θ (cid:86) ( π )) h rs ( | J ( π ) | +1) ( θ (cid:86) ( π )) = (cid:88) ≤ r,s ≤| J ( π ) | h rs ( θ (cid:86) ( π )) h rs ( | J ( π ) | +1) ( θ (cid:86) ( π ))+ h ( | J ( π ) | +1)( | J ( π ) | +1) ( θ (cid:86) ( π )) · h ( | J ( π ) | +1)( | J ( π ) | +1)( | J ( π ) | +1) ( θ (cid:86) ( π ))= − σ (cid:86) − ( π ) | J ( π ) |− σ − . We conclude that the Kass-Tierney-Kadane expansion of E π [ σ | D n ] equals σ (cid:86) ( π ) − σ (cid:86) ( π ) n (cid:40)(cid:18) ∂π∂σ ( θ (cid:86) ( π )) (cid:19) · π ( θ (cid:86) ( π )) (cid:41) − σ (cid:86) ( π ) | J ( π ) | + 4 n + O P (cid:18) n (cid:19) . (15)The result follows from the continuous diﬀerentiability of the prior and that Part (2)of Assumption 4 imply θ (cid:86) ( π ) := ( β (cid:86) ( π ) , σ (cid:86) ( π )) p → θ ∗ ( π ) . (cid:3) B.2 Posterior Loss for Normal-Inverse Gamma Priors

We derive the speciﬁc formula of the posterior loss in the case of Normal-InverseGamma priors.

Lemma 3.

Suppose the agent has a Normal-Inverse gamma prior π with hyper-parameters ( γ, a , b ) . Then, if the observed dataset is D n = ( y, X ) we have that herlog posterior expected loss can be written as: ln( L ∗ ( π, D n )) = ln (cid:32) b n + n min β ∈ R | J ( π ) | (cid:0) ( y − X J ( π ) β ) (cid:48) ( y − X J ( π ) β ) + ( γ | J ( π ) | ) || β || (cid:1) a n + 1 − n (cid:33) + ln (cid:16) (cid:104)(cid:0) ( X (cid:48) J ( π ) X J ( π ) + γ | J | I | J ( π ) | (cid:1) − E P [ x J ( π ) x (cid:48) J ( π ) ] (cid:105)(cid:17) (16) Proof.

We break the proof into two steps. Step 1 shows provides an expression forthe posterior mean of σ (cid:15) . Step 2 plugs-in this expression into the formula for theposterior loss. 50 tep 1 First we show that the posterior mean of σ (cid:15) in a regression model with aNormal-Inverse Gamma prior with hyperparameters ( γ, a , b ) is given by: E π [ σ (cid:15) | D n ] = b n + n min β ∈ R k ( y − Xβ ) (cid:48) ( y − Xβ ) + ( γk ) || β || a n + 1 − n . (17)It is known that σ (cid:15) | D n ∼ Inv-Gamma (cid:16) a + n , b + 12 ( y (cid:48) y − β (cid:86) (cid:48) R ( γk )( X (cid:48) X + ( γk ) I k ) β (cid:86) R ( γk )) (cid:17) . where β (cid:86) R ( γk ) is the ridge estimator with penalty parameter γk . Since the mean of arandom variable distributed as Inv-Gamma ( a, b ) is ba − , to show (6) it is suﬃcient toshow that: min β ∈ R k ( y − Xβ ) (cid:48) ( y − Xβ ) + ( γk ) || β || = y (cid:48) y − β (cid:86) R ( γk ) (cid:48) ( X (cid:48) X + ( γk ) I k ) β (cid:86) R ( γk ) . (18)To condense notation, let β (cid:86) R ≡ β (cid:86) R ( λ ) , where λ = γk is ﬁxed. Note that: y (cid:48) Xβ (cid:86) R = y (cid:48) X ( X (cid:48) X + λ I k ) − X (cid:48) y, = y (cid:48) X ( X (cid:48) X + λ I k ) − ( X (cid:48) X + λ I k )( X (cid:48) X + λ I k ) − X (cid:48) y, = β (cid:86) (cid:48) R ( X (cid:48) X + λ I k ) β (cid:86) R , = β (cid:86) (cid:48) R X (cid:48) Xβ (cid:86) R + λβ (cid:86) (cid:48) R β (cid:86) R . This implies that ( y − Xβ (cid:86) R ) (cid:48) ( y − Xβ (cid:86) R ) , = y (cid:48) y − y (cid:48) Xβ (cid:86) R + β (cid:86) (cid:48) R X (cid:48) Xβ (cid:86) R , = y (cid:48) y − β (cid:86) (cid:48) R X (cid:48) Xβ (cid:86) R − λβ (cid:86) (cid:48) R β (cid:86) R . y (cid:48) y − β (cid:86) (cid:48) R ( X (cid:48) X + λ I k ) β (cid:86) R , = y (cid:48) y − β (cid:86) (cid:48) R ( X (cid:48) X ) β (cid:86) R − λβ (cid:86) (cid:48) R β (cid:86) R , =( y − Xβ (cid:86) R ) (cid:48) ( y − Xβ (cid:86) R ) + λβ (cid:86) (cid:48) R β (cid:86) R . Comparing, (18) follows, concluding our proof of (6).

Step 2

From Lemma 1, we have that the posterior loss L ∗ ( π, D n ) = E π [ σ (cid:15) | D n ] + (cid:90) ∞ Tr ( V π ( β | D n , σ (cid:15) ) E P [ xx (cid:48) ]) π ( σ (cid:15) | D n ) dσ (cid:15) . It is known that V π ( β | D n , σ (cid:15) ) = σ (cid:15) ( X (cid:48) X + ( γk ) I k ) − , This implies that L ∗ ( π, D n ) = E π [ σ (cid:15) | D n ] + (cid:90) ∞ Tr ( σ (cid:15) ( X (cid:48) X + ( γk ) I k ) − ) E P [ xx (cid:48) ]) π ( σ (cid:15) | D n ) dσ (cid:15) , = E π [ σ (cid:15) | D n ] + E π [ σ (cid:15) | D n ] Tr (( X (cid:48) X + ( γk ) I k ) − E p [ xx (cid:48) ]) . Taking logs on both sides and using the formula for the posterior mean of σ (cid:15) fromStep 1, we obtain the desired formula. (cid:3) B.3 Posterior Loss for Normal-Inverse Gamma priors in largesamples

Observation 1.

Suppose the agent has a Normal-Inverse Gamma prior. Then, for n large, we have ln ( L ∗ ( π, D n )) ≈ ln (cid:0) E π (cid:2) σ (cid:15) | D n (cid:3)(cid:1)(cid:124) (cid:123)(cid:122) (cid:125) Model Fit + ln (cid:18) | J | n (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) Model Dimension . (19)52 roof. The posterior upon observing dataset D n is β J | D n , σ (cid:15) ∼ N | J | ( β (cid:86) J, Ridge , σ (cid:15) ( X (cid:48) J X J + ( γ | J | ) I | J | ) − ) , = ⇒ V π [ β J | D n ] = E π (cid:2) σ (cid:15) | D n (cid:3) ( X (cid:48) J X J + ( γ | J | ) I | J | ) − . Therefore, substituting back, we have thatTr ( V π [ β J | D n ] E P [ x J x (cid:48) J ]) = E π (cid:2) σ (cid:15) | D n (cid:3) Tr (cid:0) ( X (cid:48) J X J + ( γ | J | ) I | J | ) − E P [ x J x (cid:48) J ] (cid:1) , = E π (cid:2) σ (cid:15) | D n (cid:3) Tr (cid:32) n (cid:18) n (cid:0) X (cid:48) J X J + ( γ | J | ) I | J | (cid:1)(cid:19) − E P [ x J x (cid:48) J ] (cid:33) , which for n large, by the law of large numbers ≈ E π (cid:2) σ (cid:15) | D n (cid:3) Tr (cid:18) n ( E P [ x J x (cid:48) J ]) − E P [ x J x (cid:48) J ] (cid:19) , = E π (cid:2) σ (cid:15) | D n (cid:3) Tr (cid:18) n I | J | (cid:19) , = E π (cid:2) σ (cid:15) | D n (cid:3) | J | n . Thus for n large, (19) follows. (cid:3)(cid:3)