Cheating with (Recursive) Models
CCheating with (Recursive) Models ∗ Kfir Eliaz † , Ran Spiegler ‡ and Yair Weiss § November 5, 2019
Abstract
To what extent can agents with misspecified subjective modelspredict false correlations? We study an “analyst” who utilizes modelsthat take the form of a recursive system of linear regression equations.The analyst fits each equation to minimize the sum of squared errorsagainst an arbitrarily large sample. We characterize the maximal pair-wise correlation that the analyst can predict given a generic objectivecovariance matrix, subject to the constraint that the estimated modeldoes not distort the mean and variance of individual variables. Weshow that as the number of variables in the model grows, the falsepairwise correlation can become arbitrarily close to one, regardless ofthe true correlation. ∗ Financial support from ERC Advanced Investigator grant no. 692995 is gratefullyacknowledges. Eliaz and Spiegler thank Briq and the Economics Department at ColumbiaUniversity for their generous hosptiality while this paper was written. We also thankArmin Falk, Xiaosheng Mu, Martin Weidner, seminar audiences and especially Heidi Thy-sen, for helpful comments. † School of Economics, Tel-Aviv University and David Eccles School of Business, Uni-versity of Utah. E-mail: kfi[email protected]. ‡ School of Economics, Tel Aviv University; Department of Economics, UCL; and CFM.E-mail: [email protected]. § School of Computer Science and Engineering, Hebrew University. E-mail:[email protected]. a r X i v : . [ ec on . T H ] N ov Introduction
Agents in economic models rely on mental models of their environment forquantifying correlations between variables, the most important of which de-scribes the payoff consequences of their own actions. The vast majority ofeconomic models assume rational expectations - i.e., the agent’s subjectivemodels coincide with the modeler’s, such that the agent’s belief is consistentwith the true data-generating process. Yet when the agent’s subjective modelis misspecified, his predicted correlations may deviate from the truth. Thisdifficulty is faced not only by agents in economic models but also by real-life researchers in physical and social sciences, who make use of statisticalmodels. This paper poses a simple question: How far off can the correlationspredicted by misspecified models get?To make our question concrete, imagine an analyst who wishes to demon-strate to an audience that two variables, x and y , are strongly related. Directevidence about the correlation between these variables is hard to come by.However, the analyst has access to data about the correlation of x and y with other variables. He therefore constructs a model that involves x , y anda selection of auxiliary variables. He fits this model to a large sample anduses the estimated model to predict the correlation between x and y . Theanalyst is unable (or unwilling) to tamper with the data. However, he isfree to choose the auxiliary variables and how they operate in the model.To what extent does this degree of freedom enable the analyst to attain hisunderlying objective?This hypothetical scenario is inspired by a number of real-life situations.First, academic researchers often serve as consultants to policy makers oractivist groups in pursuit of a particular agenda. E.g., consider an economistconsulting a policy maker who pursues a tax-cutting agenda and seeks intel-lectual support for this position. The policy maker would therefore benefitfrom an academic study showing a strong quantitative relation between taxcuts and economic growth. Second, the analyst may be wedded to a particu-lar stand regarding the relation between x and y because he staked his publicreputation on this claim in the past. Finally, the analyst may want to make2 splash with a counterintuitive finding and will stop exploring alternativemodel specifications once he obtains such a result.We restrict the class of models that the analyst can employ to be recursivelinear-regression models . A model in this familiar class consists of a list oflinear-regression equations, such that an explanatory variable in one equationcannot appear as a dependent variable in another equation down the list. Weassume that the recursive model includes the variables x and y , as well asa selection of up to n − n , which is a natural measure of the model’scomplexity. Each equation is estimated via Ordinary Least Squares (OLS)against an arbitrarily large (and unbiased) sample.The following quote lucidly summarizes two attractions of recursive mod-els: “A system of equations is recursive rather than simultaneous ifthere is unidirectional dependency among the endogenous vari-ables such that, for given values of exogenous variables, values forthe endogenous variables can be determined sequentially ratherthan jointly. Due to the ease with which they can often be esti-mated and the temptation to interpret them in terms of causalchains, recursive systems were the earliest equation systems to beused in empirical work in the social sciences.” The causal interpretation of recursive models is particularly resonant. If x appears exclusively as an explanatory variable in the system of equationswhile y appears exclusively as a dependent variable, the recursive modelintuitively charts a causal explanation that pits x as a primary cause of y ,such that the estimated correlation between x and y can be legitimatelyinterpreted as an estimated causal effect of x on y . three-variable example To illustrate our exercise, suppose that the analyst estimates the followingthree-variable recursive model: x = ε (1) x = β x + ε x = β x + ε where x , x , x all have zero mean and unit variance. The analyst assumesthat all the ε k ’s are mutually uncorrelated, and also that for every k > j < k , ε k is uncorrelated with x j (for j = k −
1, this is mechanically impliedby the OLS method).For a real-life situation behind this example, consider a pharmaceuticalcompany that introduces a new drug, and therefore has a vested interest indemonstrating a large correlation between the dosage of its active ingredi-ent ( x ) and the ten-year survival rate associated with some disease. Thiscorrelation cannot be directly measured in the short run. However, past ex-perience reveals the correlations between the ten-year survival rate and thelevels of various bio-markers (which can serve as the intermediate variable x ). The correlation between these markers and the drug dosage can be mea-sured experimentally in the short run. Thus, on one hand the situation callsfor a model in the manner of (1), yet on the other hand the pharmaceuticalcompany’s R&D unit may select the bio-marker x opportunistically, in orderto get a large estimated effect. Of course, in reality there may be variouschecks and balances that will constrain this opportunism. However, it is in-teresting to know how badly it can get, in order to evaluate the importancefor these checks and balances.Let ρ ij denote the correlation between x i and x j according to the true data-generating process. Suppose that x and x are objectively uncorre-lated - i.e. r = ρ = 0. The estimated correlation between these variablesaccording to the model, given the analyst’s procedure and its underlyingassumptions, is ˆ ρ = ρ · ρ
4t is easy to see from this expression how the model can generate spuriousestimated correlation between x and x , even though none exists in reality.All the analyst has to do is select a variable x that is positively correlatedwith both x and x , such that ρ · ρ > ρ be? Intuitively,since x and x are objectively uncorrelated, if we choose x such that it ishighly correlated with x , its correlation with x will be low. In other words,increasing ρ will come at the expense of decreasing ρ . Formally, considerthe true correlation matrix: 1 ρ ρ ρ ρ ρ ) + ( ρ ) ≤
1. The maximal value of ρ · ρ subject to this constraint is , and therefore this is the maximal false corre-lation that the above recursive model can generate. This bound is tight: Itcan be attained if we define x to be a deterministic function of x and x ,given by x = ( x + x ). Thus, while a given misspecified recursive modelmay be able to generate spurious estimated correlation between objectivelyindependent variables, there is a limit to how far it can go.Our interest in the upper bound on ˆ ρ is not purely mathematical. Asthe above “bio-marker” example implies, we have in mind situations in whichthe analyst can select x from a large pool of potential auxiliary variable. Inthe current age of “big data”, analysts have access to datasets involvinga huge number of covariates. As a result, they have considerable freedomwhen deciding which variables to incorporate into their models. This helpsour analyst generate a false correlation that approaches the theoretical upperbound.To give a concrete demonstration for this claim, consider Figure 1, whichis extracted from a database compiled by the World Health Organization5
100 200 300 400 number of possible aux variables f a l s e c o rr e l a t i on Female Obesity vs. Personal Computers ( =0.03) number of possible aux variables f a l s e c o rr e l a t i on Urban Population vs. Liver Cancer deaths ( =0.05)
Figure 1: False correlation in a recursive model with one auxiliary variable,as a function of the number of possible auxiliary variables the researcher canchoose from. All variables and their correlations are taken from a databasecompiled by the World Health Organization. Even though the true correla-tion is close to zero in both cases, as the number of possible auxilary variableincreases, the estimated correlation rises yet never exceeds 0 . All the variables are taken from thisdatabase. The figure displays the maximal ˆ ρ correlation that the model (1)can generate for two fixed pairs of variables x and x with ρ close to zero,when the auxiliary variable is selected from a pool whose size is given by thehorizontal axis (variables were added to the pool in some arbitrary order).When the analyst can choose x from only ten possible auxiliary variables,the estimated correlation between x and x he can generate with (1) is stillmodest. In contrast, once the pool size is in the hundreds, the estimatedcorrelation approaches the upper bound of .For a specific variable that gets us near the theoretical upper bound,consider the figure’s R.H.S, where x represents urban population and x represents liver cancer deaths per 100,000 men. The true correlation betweenthese variables is 0 .
05. If the analyst selects x to be coal consumption(measured in tonnes oil equivalent), the estimated correlation between x and x is 0 .
43, far above the objective value. This selection of x has theadded advantage that the model suggests an intuitive causal mechanism: Review of the results
We present our formal model in Section 2 and pose our main problem: Whatis the largest estimated correlation between x and x n that a recursive, n -variable linear-regression model can generate? We impose one constrainton this maximization problem: While the estimated model is allowed todistort correlations among variables, it must produce correct estimates ofthe individual variables’ mean and variance. The linearity of the analyst’smodel implies that only the latter has bite.To motivate this constraint, recall that the scenario behind our modelinvolves a sophisticated analyst and a lay audience. Because the audience isrelatively unsophisticated, it cannot be expected to discipline the analyst’sopportunistic model selection with elaborate tests for model misspecificationthat involve conditional or unconditional correlations. However, monitoring individual variables is a much simpler task than monitoring correlations be-tween variables. E.g., it is relatively easy to disqualify an economic modelthat predicts highly volatile inflation if observed inflation is relatively stable.Likewise, a climatological model that underpredicts temperature volatilityloses credibility, even for a lay audience.Beyond this justification, we simply find it intrinsically interesting toknow the extent to which misspecified recursive models can distort correla-tions between variables while preserving moments of individual variables. Atany rate, we relax the constraint in Section 4, for the special case of modelsthat consist of a single non-degenerate regression equation. We use this caseto shed light on the analyst’s opportunistic use of “ bad controls ”.In Section 3, we derive the following result. For a generic objective co-variance matrix with ρ n = r , the maximal estimated correlation ˆ ρ n that arecursive model with up to n variables can generate, subject to preservingthe mean and variance of individual variables, is (cid:18) cos (cid:18) arccos rn − (cid:19)(cid:19) n − (2)7he upper bound given by (2) is tight. Specifically, it is attained by thesimplest recursive model that involves n variables: For every k = 2 , ..., n , x k is regressed on x k − only. This model is represented graphically by thechain x → x → · · · → x n . his chain has an intuitive causal interpretation,which enables the analyst to present ˆ ρ n as an estimated causal effect of x on x n . The variables x , ..., x n − that are employed in this model have a simpledefinition, too: They are all deterministic linear functions of x and x n .Formula (2) reproduces the value ˆ ρ = that we derived in our illustra-tive example, and it is strictly increasing in n . When n → ∞ , the expressionconverges to 1. That is, regardless of the true correlation between x and x n , a sufficiently large recursive model can generate an arbitrarily large es-timated correlation. The lesson is that when the analyst is free to selecthis model and the variables that inhabit it, he can deliver any conclusionabout the effect of one variable on another - unless we impose constraints onhis procedure, such as bounds on the complexity of his model or additionalmisspecification tests.The formula (2) has a simple geometric interpretation, which also betraysthe construction of the recursive model and objective covariance matrix thatimplement the upper bound. Take the angle that represents the objectivecorrelation r between x and x n ; divide it into n − x and x n .The detailed proof of our main result is presented in Section 5. It relieson the graphical representation of recursive models and employs tools fromthe Bayesian-networks literature (Cowell et al. (1999), Koller and Friedman(2009)). In the Appendix, we present partial analysis of our question for adifferent class of models involving binary variables. Related literature
There is a huge literature on misspecified models in various branches of eco-nomics and statistics, which is too vast to survey in detail here. A few recentreferences can serve as entry points for the interested reader: Esponda andPouzo (2016), Bonhomme and Weidner (2018) and Molavi (2019). To our8nowledge, our paper is the first to carry out a worst-case analysis of mis-specified models’ predicted correlations.The “analyst” story that motivated this exercise brings to mind thephenomenon of researcher bias. A few works in Economics have explicitlymodeled this bias and its implications for statistical inference. (Of course,there is a larger literature on how econometricians should cope with re-searcher/publication bias, but here we only describe exercises that contain ex-plicit models of the researcher’s behavior.) Leamer (1974) suggests a methodof discounting evidence when linear regression models are constructed aftersome data have been partially analyzed. Lovell (1983) considers a researcherwho chooses k out of n independent variables as explanatory variables ina single regression with the aim of maximizing the coefficient of correlationbetween the chosen variables and the dependent variable. He argues thata regression coefficient that appears to be significant at the α level shouldbe regarded as significant at only the 1 − (1 − α ) n/k level. Glaeser (2008)suggests a way of correcting for this form of data mining in the coefficientestimate.More recently, Di-Tillio, Ottaviani and Sorensen (2017,2019) characterizedata distributions for which strategic sample selection (e.g., selecting the k highest observations out of n ) benefits an evaluator who must take an ac-tion after observing the selected sample realizations. Finally, Spiess (2018)proposes a mechanism-design framework to align the preferences of the re-searcher with that of “society”: A social planner first chooses a menu ofpossible estimators , the investigator chooses an estimator from this set, andthe estimator is then applied to the sampled observations.The analyst in our model need not be a scientist - he could also be apolitician or a pundit. Under this interpretation, constructing a model andfitting it to data is not an explicit, formal affair. Rather, it involves spinninga “narrative” about the effect of policy on consequences and using casualempirical evidence to substantiate it. Eliaz and Spiegler (2018) propose amodel of political beliefs that is based on this idea. From this point of view,our exercise in this paper explores the extent to which false narratives canexaggerate the effect of policy. 9 The Model
Let p be an objective probability measure over n variables, x , ..., x n . Forevery A ⊂ { , ..., n } , denote x A = ( x i ) i ∈ A . Assume that the marginal of p on each of these variables has zero mean and unit variance. This will entailno loss of generality for our purposes. We use ρ ij to denote the coefficient ofcorrelation between the variables x i , x j , according to p . In particular, denote ρ n = r . The covariance matrix that characterizes p is therefore ( ρ ij ).An analyst estimates a recursive model that involves these variables. Thismodel consists of a system of linear-regression equations. For every k =1 , ..., n , the k th equation takes the form x k = (cid:88) j ∈ R ( k ) β jk x j + ε k where: • R ( k ) ⊆ { , .., k − } . This restriction captures the model’s recursivestructure: An explanatory variable in one equation cannot appear as adependent variable in a later equation. • In the k th equation, the β jk ’s are parameters to be estimated against aninfinitely large sample drawn from p . The analyst assumes that each ε k has zero mean and that it is uncorrelated with all other ε j ’s, as wellas with ( x j ) j
1, using thesame recursive procedure we applied to x n ). This reduces (4) toˆ ρ n = α Our objective will be to examine how large this expression can be, given n and a generic objective covariance matrix problem. Comments on the analyst’s procedure
In our model, the analyst relies on a structural model to generate an estimateof the correlation between x and x n , which he presents to a lay audience.The process by which he selects the model remains hidden from the audience.But why does the analyst use a model to estimate the correlation between x and x n , rather than estimating it directly ? One answer may be that directevidence on this correlation is hard to come by (as, for example, in the caseof long-term health effects of nutritional choices). In this case, the analyst must use a model to extrapolate an estimate of ρ n from observed data.Another answer is that analysts use models as simplified representa-tions of a complex reality, which they can consult for multiple conditional-estimation tasks: Estimating the effect of x on x n is only one of these tasks.This is illustrated by the following quote: “The economy is an extremelycomplicated mechanism, and every macroeconomic model is a vast simpli-fication of reality. . . the large scale of FRB/US [ a general equilibrium modelemployed by the Federal Reserve Bank - the authors ] is an advantage in thatit can perform a wide variety of computational ‘what if’ experiments.” Fromthis point of view, our analysis concerns the maximal distortion of pairwise ∈ R ( n ), which means that theanalyst does have data about the joint distribution of x and x n . As we willsee, our main result will not make use of this possibility. For every r, n , denote θ r,n = arccos rn − Theorem 1
For almost every true covariance matrix ( ρ ij ) satisfying ρ n = r , if the estimated recursive model satisfies (cid:100) V ar ( x k ) = 1 for all k , then theestimated correlation between x and x n satisfies ˆ ρ n ≤ (cos θ r,n ) n − Moreover, this upper bound can be implemented by the following pair:(i) A recursive model defined by R ( k ) = { k − } for every k = 2 , ..., n .(ii) A multivariate Gaussian distribution satisfying, for every k = 1 , ..., n : x k = s cos(( k − θ r,n ) + s sin(( k − θ r,n ) (5) where s , s are independent standard normal variables. Let us illustrate the upper bound given by Theorem 1 numerically for thecase of r = 0, as a function of n : n ρ n . .
65 0 . n → ∞ , the upper bound converges to one. This is the case for anyvalue of r . That is, even if the true correlation between x and x n is stronglynegative, a sufficiently large model can produce a large positive correlation.The recursive model that attains the upper bound has a simple structure.Its DAG representation is a single chain1 → → · · · → n Intuitively, this is the simplest connected DAG with n nodes: It has thesmallest number of links among this class of DAGs, and it has no junctions.The distribution over the auxiliary variables x , ..., x n in the upper bound’simplementation has a simple structure, too: Every x k is a different linearcombination of two independent “factors”, s and s . We can identify s with x , without loss of generality. The closer the variable lies to x alongthe chain, the larger the weight it puts on s . General outline of the proof
The proof of Theorem 1 proceeds in three major steps. First, the constraintthat the estimated model preserves the variance of individual variables for ageneric objective distribution reduces the class of candidate recursive modelsto those that can be represented by perfect
DAGs. Since perfect DAGspreserve marginals of individual variables for every objective distribution (seeSpiegler (2017)), the theorem can be stated more strongly for this subclassof recursive models.
Proposition 1
Consider a recursive model given by R . Suppose that forevery k > , if i, j ∈ R ( k ) and i < j , then i ∈ R ( j ) . Then, for every truecovariance matrix ( ρ ij ) satisfying ρ n = r , ˆ ρ n ≤ (cos θ r,n ) n − . That is, when the recursive model is represented by a perfect DAG, the upperbound on ˆ ρ n holds for any objective covariance matrix, and the undistorted-variance constraint is redundant.In the second step, we use the tool of junction trees in the Bayesian-networks literature (Cowell et al. (1999)) to perform a further reduction in15he class of relevant recursive models. Consider a recursive model representedby a non-chain perfect DAG. We show that the analyst can generate the sameˆ ρ n with another objective distribution and a recursive model that takes theform of a simple chain 1 → · · · → n . Furthermore, this chain will involve nomore variables than the original model.To illustrate this argument, consider the following recursive model with n = 4: x = ε x = β x + ε x = β x + β x + ε x = β x + β x + ε This recursive model has the following DAG representation:1 → ↓ (cid:37) ↓ → x depends on x and x only through their linear combination β x + β x , we can replace ( x , x ) with a scalar variable x , such that therecursive model becomes x = ε x = β (cid:48) x + ε (cid:48) x = β (cid:48) x + ε This model is represented by the DAG 1 → →
3, which is a simple chainthat consists of fewer nodes than the original DAG.This means that in order to calculate the upper bound on ˆ ρ n , we canrestrict attention to the chain model. But in this case, the analyst’s objective16unction has a simple explicit form:ˆ ρ n = n − (cid:89) k =1 ρ k,k +1 Thus, in the third step, we derive the upper bound by finding the correlationmatrix that maximizes the R.H.S of this formula, subject to the constraintsthat ρ n = r and that the matrix is positive semi-definite (which is theproperty that defines the class of covariance matrices). The solution to thisproblem has a simple geometric interpretation. Analysts often propose models that take the form of a single linear-regressionequation, consisting of a dependent variable x n (where n > x and n − control ” variables x , ..., x n − . Using the language ofSection 2, this corresponds to the specification R ( k ) = ∅ for all k = 1 , ..., n − R ( n ) = { , ..., n − } . That is, the only non-degenerate equation isthe one for x n , hence the term “single-equation model”. Note that in thiscase, the OLS regression coefficient β n in the equation for x n coincides withˆ ρ n = α/ (cid:100) V ar ( x n ), as defined in Section 2.Using the graphical representation, the single-equation model correspondsto a DAG in which x , ..., x n − are all ancestral nodes that send links into x n .Since this DAG is imperfect, Lemma 1 in Section 5 implies that for almostall objective covariance matrices, the estimated variance of x n according tothe single-equation model will differ from its true value. However, giventhe particular interest in this class of models, we relax the correct-varianceconstraint in this section and look for the maximal false correlation that suchmodels can generate. For expositional convenience, we focus on the case of r = 0. All the other variables are represented by ancestral nodes, and therefore theirmarginals are not distorted (see Spiegler (2017)). roposition 2 Let r = 0 . Then, a single-equation model x n = (cid:80) n − i =1 β i x i + ε can generate an estimated coefficient ˆ ρ ,n of at most / √ . This bound istight, and can be approximated arbitrarily well with n = 3 such that x = δx + √ − δ x , where δ ≈ − . Proof.
Because x , ..., x n − are Gaussian without loss of generality, we canreplace their linear combination ( (cid:80) n − i =2 β i x i ) / ( (cid:80) n − i =2 β i ) (where the β i ’s aredetermined by the objective p ) by a single Gaussian variable z that has meanzero, but its variance need not be one. Its objective distribution conditionalon x , x n can be written as a linear equation z = δx + γx n + η . Since allvariables on the R.H.S of this equation are independent (and since x and x n are standardized normal variables), it follows that the objective variance of z is V ar ( z ) = δ + γ + σ The analyst’s model can now be written as x n = 1 γ z − δγ x − γ η (6)Our objective is to find the values of δ , γ and σ that maximizeˆ ρ ,n = ˆ E ( x , x n ) (cid:113) (cid:100) V ar ( x n ) (cid:100) V ar ( x )Because x and x n are independent, standardized normal, ˆ E ( x , x n ) = − δ/γ .The analyst’s model does not distort the variance of x . Therefore, (cid:100)
V ar ( x ).And since the analyst’s model regards z , x and η as independent, (cid:100) V ar ( x n ) = (cid:18) γ (cid:19) V ar ( z )+ (cid:18) δγ (cid:19) + (cid:18) σγ (cid:19) = (cid:18) γ (cid:19) (cid:0) δ + γ + σ (cid:1) + (cid:18) δγ (cid:19) + (cid:18) σγ (cid:19) It is clear from this expression that in order to maximize ˆ ρ ,n , we should set The reason is that the node that represents x in the DAG representation of the modelis ancestral. By Spiegler (2017), the estimated model does not distort the marginals ofsuch variables. = 0. It follows that ˆ ρ ,n = − δγ (cid:114) (cid:16) δγ (cid:17) which is decreasing in δ/γ and attains an upper bound of 1 / √ δ/γ →−∞ .Note that since without loss of generality we can set γ = √ − δ suchthat z ∼ N (0 , δ → − γ →
0. As a result, theestimated variance of x n diverges.Thus, to magnify the false correlation between x and x n , the analystwould select the “control” variables such that a certain linear combination ofthem has strong negative correlation with x . That is, the analyst will preferhis regression model to exhibit multicollinearity. This inflates the estimatedvariance of x n ; indeed, (cid:100) V ar ( x ) → ∞ when δ → −
1. However, at the sametime it increases the estimated covariance between x and x , which morethan compensates for this increase in variance. As a result, the estimatedcorrelation between x and x rises substantially.The three-variable model that implements the upper bound is representedby the DAG 1 → ←
2. That is, it treats the variables x and x asindependent, even though in reality they are correlated. In particular, theobjective distribution may be consistent with a DAG that adds a link 1 → x to the regression means that wecontrol for a “post-treatment” variable (where x is viewed as the treatment).In other words, x is a “bad control” (see Angrist and Pischke (2008), p. 64). The upper bound of 1 / √ n = 3. Recallthat under the undistorted-variance constraint, the upper bound on ˆ ρ ,n for n = 3 is 1 /
2. This shows that the constraint has bite. However, when n is sufficiently large, the single-equation model is outperformed by the multi-equation chain model, which does satisfy the undistorted-variance constraint. See also http://causality.cs.ucla.edu/blog/index.php/2019/08/14/a-crash-course-in-good-and-bad-control/. Proof of Theorem 1
The proof relies on concepts and tools from the Bayesian-network literature(Cowell et al. (1999), Koller and Friedman (2009)). Therefore, we introducea few definitions that will serve us in the proof.A DAG is a pair G = ( N, R ), where N is a set of nodes and R ⊂ N × N is a pair of directed links. We assume throughout that N = { , ..., n } . Withsome abuse of notation, R ( i ) is the set of nodes j for which the DAG includesa link j → i . A DAG is perf ect if whenever i, j ∈ R ( k ) for some i, j, k ∈ N ,it is the case that i ∈ R ( j ) or j ∈ R ( i ).A subset of nodes C ⊆ N is a clique if for every i, j ∈ C , iRj or jRi . Wesay that a clique is maximal if it is not contained in another clique. We use C to denote the collection of maximal cliques in a DAG.A node i ∈ N is ancestral if R ( i ) is empty. A node i ∈ N is terminal ifthere is no j ∈ N such that i ∈ R ( j ). In line with our definition of recursivemodels in Section 2, we assume that 1 is ancestral and n is terminal. It isalso easy to verify that we can restrict attention to DAGs in which n is the only terminal node - otherwise, we can remove the other terminal nodes fromthe DAG, without changing ˆ p ( x n | x ). We will take these restrictions forgranted henceforth.The analyst’s procedure for estimating ˆ ρ n , as described in Section 2, hasan equivalent description in the language of Bayesian networks, which wenow describe.Because the analyst estimates a linear model, it is as if he believesthat the underlying distribution p is multivariate normal , where the esti-mated k th equation is a complete description of the conditional distribution( p ( x k | x R ( k ) )). Therefore, from now, we will proceed as if p were indeed astandardized multivariate normal with covariance matrix ( ρ ij ), such that the k th regression equation corresponds to measuring the correct distribution of x k conditional on x R ( k ) . This is helpful expositionally and entails no loss ofgenerality. 20iven an objective distribution p over x , ..., x n and a DAG G , define theBayesian-network factorization formula: p G ( x , ..., x n ) = n (cid:89) k =1 p ( x k | x R ( k ) )We say that p is consistent with G if p G = p .By Koller and Friedman (2009, Ch. 7), when p is multivariate normal, p G is reduced to the estimated joint distribution as described in Section 2.In particular, we can use p G to calculate the estimated marginal of x k forany k : p G ( x k ) = (cid:90) ( x j ) j Let n ≥ and suppose that G is imperfect. Then, there ex-ists k ∈ { , ..., n } such that V ar G ( x k ) (cid:54) = 1 for almost all correlation sub- atrices ( ρ ij ) i,j =1 ,...,k − (and therefore, for almost all correlation matrices ( ρ ij ) i,j =1 ,...,n ). Proof. Recall that we list the variables x , ..., x n such that R ( i ) ⊆ { , ..., i − } for every i . Consider the lowest k for which R ( k ) is not a clique. Thismeans that there exist two nodes h, l ∈ R ( k ) that are unlinked in G , whereasfor every k (cid:48) < k and every h (cid:48) , l (cid:48) ∈ R ( k (cid:48) ), h (cid:48) and l (cid:48) are linked in G .Our goal is to show that V ar G ( x k ) (cid:54) = 1 for almost all correlation subma-trices ( ρ ij ) i,j =1 ,...,k − . Since none of the variables x k +1 , ..., x n appear in theequations for x , ..., x k , we can ignore them and treat x k as the terminal nodein G without loss of generality, such that G is defined over the nodes 1 , ..., k ,and p is defined over the variables x , ..., x k .Let (ˆ ρ ij ) i,j =1 ,...,k − denote the correlation matrix over x , ..., x k − inducedby p G - i.e., ˆ ρ ij is the estimated correlation between x i and x j , whereas ρ ij denotes their true correlation. By assumption, the estimated marginals of x , ..., x k − are correct, hence ˆ ρ ii = 1 for all i = 1 , ..., k − ρ ij over i, j = 1 , ..., k − ρ hl (i.e. the true correlation between x h and x l ). To see why, note that (ˆ ρ ij ) i,j =1 ,...,k − is induced by ( p G ( x , ..., x k − )).Each of the terms in the factorization formula for p G ( x , ..., x k − ) is of theform p ( x i | x R ( i ) ), i = 1 , ..., k − 1. To compute this conditional probability,we only need to know ( ρ jj (cid:48) ) j,j (cid:48) ∈{ i }∪ R ( i ) . By the definition of k , h and l , it isimpossible for both h and l to be included in { i } ∪ R ( i ). Therefore, we cancompute (ˆ ρ ij ) i,j =1 ,...,k − without knowing the true value of ρ hl . We will makeuse of this observation toward the end of this proof.The equation for x k is x k = (cid:88) i ∈ R ( k ) β ik x i + ε k (8)Let β denote the vector ( β ik ) i ∈ R ( k ) . Let A denote the correlation sub-matrix( ρ ij ) i,j ∈ R ( k ) that fully characterizes the objective joint distribution ( p ( x R ( k ) )).22hen, the objective variance of x k can be written as V ar ( x k ) = 1 = β T Aβ + σ (9)where σ = V ar ( ε k ).In contrast, the estimated variance of x k , denoted V ar G ( x k ), obeys theequation V ar G ( x k ) = β T Cβ + σ (10)where C denotes the correlation sub-matrix (ˆ ρ ij ) i,j ∈ R ( k ) that characterizes( p G ( x R ( k ) )). In other words, the estimated variance of x k is produced byreplacing the true joint distributed of x R ( k ) in the regression equation for x k with its estimated distribution (induced by p G ), without changing the valuesof β and σ .The undistorted-marginals constraint requires V ar G ( x k ) = 1. This im-plies the equation β T Aβ = β T Cβ (11)We now wish to show that this equation fails for generic ( ρ ij ) i,j =1 ,...,k − .For any subsets B, B (cid:48) ⊂ { , ..., k − } , use Σ B × B (cid:48) to denote the submatrixof (ˆ ρ ij ) i,j =1 ,...,k − in which the selected set of rows is B and the selected setof columns is B (cid:48) . By assumption, h, l ∈ R ( k ) are unlinked. This means thataccording to G , x h ⊥ x l | x M , where M ⊂ { , ..., k − } − { h, l } . Therefore,by Drton et al. (2008, p. 67),Σ { h }×{ l } = Σ { h }× M Σ − M × M Σ M ×{ l } (12)Note that equation (12) is precisely where we use the assumption that G is imperfect. If G were perfect, then all nodes in R ( k ) would be linkedand therefore we would be unable to find a pair of nodes h, l ∈ R ( k ) thatnecessarily satisfies (12).The L.H.S of (12) is simply ˆ ρ hl . The R.H.S of (12) is induced by p G ( x , ..., x k − ).As noted earlier, this distribution is pinned down by G and the entries in( ρ ij ) i,j =1 ,...,k − except for ρ hl . That is, if we are not informed of ρ hl but we areinformed of all the other entries in ( ρ ij ) i,j =1 ,...,k − , we are able to pin down23he R.H.S of (12).Now, when we draw the objective correlation submatrix ( ρ ij ) i,j =1 ,...,k − atrandom, we can think of it as a two-stage lottery. In the first stage, all theentries in this submatrix except ρ hl are drawn. In the second stage, ρ hl isdrawn. The only constraint in each stage of the lottery is that ( ρ ij ) i,j =1 ,...,k − has to be positive-semi-definite and have 1’s on the diagonal. Fix the outcomeof the first stage of this lottery. Then, it pins down the R.H.S of (12). Inthe lottery’s second stage, there is (for a generic outcome of the lottery’sfirst stage) a continuum of values that ρ hl could take for which ( ρ ij ) i,j =1 ,...,k − will be positive-semi-definite. However, there is only value of ρ hl that willcoincide with the value of ˆ ρ hl that is given by the equation (12). We havethus established that A (cid:54) = C for generic ( ρ ij ) i,j =1 ,...,k − .Recall once again that we can regards β as a parameter of p that is in-dependent of A (and therefore of C as well), because A describes ( p ( x R ( k ) ))whereas β, σ characterize ( p ( x k | x R ( k ) )). Then, since we can assume A (cid:54) = C ,(11) is a non-tautological quadratic equation of β (because we can con-struct examples of p that violate it). By Caron and Traynor (2005), ithas a measure-zero set of solutions β . We conclude that the constraint V ar G ( x k ) = 1 is violated by almost every ( ρ ij ). Corollary 1 For almost every ( ρ ij ) , if a DAG G satisfies E G ( x k ) = 0 and V ar G ( x k ) = 1 for all k = 1 , ..., n , then G is perfect. Proof. By Lemma 1, for every imperfect DAG G , the set of covariancematrices ( ρ ij ) for which p G preserves the mean and variance of all individualvariables has measure zero. The set of imperfect DAGs over { , ..., n } isfinite, and the finite union of measure-zero sets has measure zero as well. Itfollows that for almost all ( ρ ij ), the property that p G preserves the mean andvariance of individual variables is violated unless G is perfect.The next step is based on the following definition. Definition 1 A DAG ( N, R ) is linear if is the unique ancestral node, n is the unique terminal node, and R ( i ) is a singleton for every non-ancestralnode. 24 linear DAG is thus a causal chain 1 → · · · → n . Every linear DAG isperfect by definition. Lemma 2 For every Gaussian distribution with correlation matrix ρ andnon-linear perfect DAG G with n nodes, there exists a Gaussian distributionwith correlation matrix ρ (cid:48) and a linear DAG G (cid:48) with weakly fewer nodesthan G , such that ρ n = ρ (cid:48) n and the false correlation induced by G (cid:48) on ρ (cid:48) isexactly the same as the false correlation induced by G on ρ : cov G (cid:48) ( x , x n ) = cov G ( x , x n ) . Proof. The proof proceeds in two main steps. Step 1: Deriving an explicit form for the false correlation using an auxiliary“cluster recursion” formula The following is standard material in the Bayesian-network literature. Forany distribution p G ( x ) corresponding to a perfect DAG, we can rewrite thedistribution as if it factorizes according to a tree graph, where the nodesin the tree are the maximal cliques of G . This tree satisfies the runningintersection property (Koller and Friedman (2009, p. 348)): If i ∈ C, C (cid:48) fortwo tree nodes, then i ∈ C (cid:48)(cid:48) for every C (cid:48)(cid:48) along the unique tree path between C (cid:48) and C (cid:48)(cid:48) . Such a tree graph is known as the “ junction tree ” correspondingto G and we can write the following “cluster recursion” formula (Koller andFriedman (2009, p. 363)): p G ( x ) = p G ( x C r ) (cid:89) i p G ( x C i | x C r ( i ) ) = p ( x C r ) (cid:89) i p ( x C i | x C r ( i ) )where C r is an arbitrary selected root clique node and C r ( i ) is the upstreamneighbor of clique i (the one in the unique path from C i to the root C r ). Thesecond equality is due to the fact that G is perfect, hence p G ( x C ) ≡ p ( x C )for every clique C of G .Let C , C K ∈ C be two cliques that include the nodes 1 and n , respectively.Furthermore, for a given junction tree representation of the DAG, select thesecliques to be minimally distant from each other - i.e., 1 , n / ∈ C for every C along the junction-tree path between C and C K . We now derive an upper25ound on K . Recall the running intersection property: If i ∈ C j , C k for some1 ≤ j < k ≤ K , then i ∈ C h for every h between k and j . Since the cliques C , ..., C K are maximal, it follows that every C k along the sequence mustintroduce at least one new element i / ∈ ∪ j 1. Furthermore,since G is assumed to be non-linear, the inequality is strict , because at leastone C k along the sequence must contain at least three elements and thereforeintroduce at least two new elements. Thus, K ≤ n − p G factorizes according to the junction tree, it follows that thedistribution over the variables covered by the cliques along the path from C to C K factorize according to a linear DAG 1 → C → · · · → C K → n , asfollows: p G ( x , x C , ..., x C K , x n ) = p ( x ) K (cid:89) k =1 p ( x C k | x C k − ) p ( x n | x C K ) (13)where C = { } . The length of this linear DAG is K + 2 ≤ n . While this fac-torization formula superficially completes the proof, note that the variables x C k are typically multivariate normal variables, whereas our objective is toshow that we can replace them with scalar (i.e. univariate) normal variableswithout changing cov G ( x , x n ).Recall that we can regard p as a multivariate normal distribution withoutloss of generality. Furthermore, under such a distribution and any two subsetsof variables C, C (cid:48) , the distribution of x C conditional on x C (cid:48) can be written x C = Ax C (cid:48) + η , where A is a matrix that depends on the means and covari-ances of p , and η is a zero-mean vector that is uncorrelated with x C (cid:48) . Apply-ing this property to the junction tree, we can describe p G ( x , x C , ..., x C K , x n )26ia the following recursion: x ∼ N (0 , 1) (14) x C = A x + η ... x C k = A k x C k − + η k ... x C K = A K x C K − + η K x n = A K +1 x C K + η n where each equation describes an objective conditional distribution - in par-ticular, the equation for x C k describes ( p ( x C k | x C k − )). The matrices A k arefunctions of the vectors β i in the original recursive model. The η k ’s areall zero mean and uncorrelated with the explanatory variables x C k − , suchthat E ( x C k | x C k − ) = A k x C k − . Furthermore, according to p G (i.e. the ana-lyst’s estimated model), each x k (with k > 1) is conditionally independent of x , ..., x k − given x R ( k ) . Since the junction-tree factorization (13) representsexactly the same distribution p G , this means that every η k is uncorrelatedwith all other η j ’s as well as with x , ..., x C k − . Therefore, E G ( x x n ) = A K +1 A K · · · A Since p G preserves the marginals of individual variables, V ar G ( x k ) = 1 forall k . In particular V ar G ( x ) = V ar G ( x n ) = 1 Then, ρ G ( x x n ) = A K +1 A K · · · A Step 2: Defining a new distribution over scalar variables For every k , define the variable z k = ( A K +1 A K · · · A k +1 ) x C k = α k x C k z : z k = α k x C k = α k ( A k x C k − + η k )= z k − + α k η k Given that p is taken to be multivariate normal, the equation for z k mea-sures the objective conditional distribution ( p G ( z k | z k − )). Since p G doesnot distort the objective distribution over cliques, ( p G ( z k | z k − )) coincideswith ( p ( z k | z k − )). This means that an analyst who fits a recursive modelgiven by the linear DAG G (cid:48) : x → z → · · · → z K → x n will obtain thefollowing estimated model, where every ε k is a zero-mean scalar variable thatis assumed by the analyst to be uncorrelated with the other ε j ’s as well aswith z , ..., z k (and as before, the assumption holds automatically for z k butis typically erroneous for z j , j < k ): x ∼ N (0 , z = α A x + ε ... z k +1 = z k + ε k +1 ... x n = z K + ε n Therefore, E G (cid:48) ( x , x n ) is given by E G (cid:48) ( x x n ) = A K +1 A K · · · A Since G (cid:48) is perfect, V ar G (cid:48) ( x n ) = 1, hence ρ G (cid:48) ( x x n ) = A K +1 A K · · · A = ρ G ( x x n )We have thus reduced our problem to finding the largest ˆ ρ n that can beattained by a linear DAG G : 1 → · · · → n of length n at most.28o solve the reduced problem we have arrived at, we first note thatˆ ρ n = n − (cid:89) k =1 ρ k,k +1 (15)Thus, the problem of maximizing ˆ ρ n is equivalent to maximizing the productof terms in a symmetric n × n matrix, subject to the constraint that the matrixis positive semi-definite, all diagonal elements are equal to one, and the (1 , n )entry is equal to r : ρ ∗ n = max ρ ij = ρ ji for all i,j ( ρ ij ) is P.S.D ρ ii =1 for all iρ n = r n − (cid:89) i =1 ρ i,i +1 Note that the positive semi-definiteness constraint is what makes the problemnontrivial. We can arbitrarily increase the value of the objective function byraising off-diagonal terms of the matrix, but at some point this will violatepositive semi-definiteness. Since positive semi-definiteness can be rephrasedas the requirement that ( ρ ij ) = AA T for some matrix A , we can rewrite theconstrained maximization problem as follows: ρ ∗ n = max a Ti a i =1 for all ia T a n = r n − (cid:89) i =1 a i a Ti +1 (16)Denote α = arccos r . Since the solution to (16) is invariant to a rotationof all vectors a i , we can set a = e a n = e cos α + e sin α without loss of generality. Note that a , a n are both unit norm and have dotproduct r . Thus, we have eliminated the constraint a T a n = r and reducedthe variables in the maximization problem to a , ..., a n − .Now consider some k = 2 , ..., n − 1. Fix a j for all j (cid:54) = k , and choose a k to29aximize the objective function. As a first step, we show that a k must be alinear combination of a k − , a k +1 . To show this, we write a k = u + v , where u, v are orthogonal vectors, u is in the subspace spanned by a k − , a k +1 and v is orthogonal to the subspace. Recall that a k is a unit-norm vector, whichimplies that (cid:107) u (cid:107) + (cid:107) v (cid:107) = 1 (17)The terms in the objective function (16) that depend on a k are simply( a Tk − u )( a Tk +1 u ). All the other terms in the product do not depend on a k ,whereas the dot product between a k and a k =1 , a k +1 is invariant to v : a Tk − ( u + v ) = a Tk =1 u .Suppose that v is nonzero. Then, we can replace a k with another unit-norm vector u/ (cid:107) u (cid:107) , such that ( a Tk − u )( a Tk +1 u ) will be replaced by( a Tk − u )( a Tk +1 u ) (cid:107) u (cid:107) By (17) and the assumption that v is nonzero, (cid:107) u (cid:107) < 1, hence the replacementis an improvement. It follows that a k can be part of an optimal solution onlyif it lies in the subspace spanned by a k − , a k +1 . Geometrically, this meansthat a k lies in the plane defined by the origin and a k − , a k +1 .Having established that a k , a k − , a k +1 are coplanar, let α be the anglebetween a k and a k − , let β be the angle between a k and a k +1 , and let γ bethe (fixed) angle between a k − and a k +1 . Due to the coplanarity constraint, α + β = γ . Fixing a j for all j (cid:54) = k and applying a logarithmic transformationto the objective function, the optimal a k must satisfylog cos( α ) + log cos( γ − α )Differentiating this expression with respect to α and setting the derivative tozero, we obtain α = β = γ/ 2. Since this must hold for any k = 2 , ..., n − 1, weconclude that at the optimum, any a k lies on the plane defined by the originand a k − , a k +1 and is at the same angular distance from a k − , a k +1 . Thatis, an optimum must be a set of equiangular unit vectors on a great circle,30 Figure 2: Geometric intuition for the proof.equally spaced between a and a n . The explicit formulas for these vectorsare given by (5).The formula for the upper bound has a simple geometric interpretation(illustrated by Figure 2). We are given two points on the unit n -dimensionalsphere (representing a and a n ) whose dot product is r , and we seek n − a and a n .Since by construction, every neighboring points a k and a k +1 have a dotproduct of cos θ r,n , we have ρ k,k +1 = cos θ r,n , such that ˆ ρ n = (cos θ r,n ) n − .This completes the proof. This paper performed a worst-case analysis of misspecified recursive models.We showed that within this class, model selection is a very powerful tool31n the hands of an opportunistic analyst: If we allow him to freely selecta moderate number of variables from a large pool, he can produce a verylarge estimated correlation between two variables of interest. Furthermore,the structure of his model allows him to interpret this correlation as a causaleffect. This is true even if the two variables are objectively independent,or if their correlation is in the opposite direction. Imposing a bound onthe model’s complexity (measured by its number of auxiliary variables) isan important constraint on the analyst. However, the value of this bounddecays quickly, as even with one or two auxiliary variables the analyst cangreatly distort objective correlations.Within our framework, several questions are left open. First, we do notknow whether Theorem 1 would continue to hold if we replaced the quantifier“for almost every p ” with “for every p ”. Second, we do not know how muchbite the undistorted-variance constraint has in models with more than onenon-trivial equation. Third, we lack complete characterizations for recursivemodels outside the linear-regression family (see our partial characterizationfor models that involve binary variables in Appendix II). Finally, it wouldbe interesting to devise a sparse collection of misspecification or robustnesstests that would restrain our opportunistic analyst.Taking a broader perspective into the last question, our exercise suggestsa novel approach to the study of biased estimates due to misspecified mod-els in Statistics and Econometric Theory (foreshadowed by Spiess (2018)).Under this approach, the analyst who employs a structural model for sta-tistical or causal analysis is viewed as a player in a game with his audience.Researcher bias implies a conflict of interests between the two parties. Thisbias means that the analyst’s model selection is opportunistic. The questionis which strategies the audience can play (in terms of robustness or misspec-ification tests it can demand) in order to mitigate errors due to researcherbias, without rejecting too many valuable models. Appendix: Uniform Binary Variables Suppose now that the variables x , ..., x n all take values in {− , } , and re-strict attention to the class of objective distributions p whose marginal on32ach variable is uniform - i.e., p ( x i = 1) = for every i = 1 , ..., n . As in ourmain model, fix the correlation between x and x n to be r - that is, ρ n = p ( x n = 1 | x = 1) − p ( x n = 1 | x = − 1) = r The question of finding the distribution p (in the above restricted domain)and the DAG G that maximize the induced ˆ ρ in subject to p G ( x n ) = isgenerally open. However, when we fix G to be the linear DAG1 → → · · · → n we are able to find the maximal ˆ ρ n . It makes sense to consider this specificDAG, because it proved to be the one most conducive to generating falsecorrelations in the case of linear-regression models.Given the DAG G and the objective distribution p , the correlation be-tween x i and x j that is induced by p G isˆ ρ ij = p G ( x j = 1 | x i = 1) − p G ( x j = 1 | x i = − j > i . Given the structure of the linear DAG, we can write p G ( x j | x i ) = (cid:88) x i +1 ,...x j − p ( x i +1 | x i ) p ( x i +2 | x i +1 ) · · · p ( x j | x j − ) (18)In particular, p G ( x n | x ) = (cid:88) x ,...x n − p ( x | x ) p ( x | x ) · · · p ( x n | x n − ) (19)= (cid:88) x p ( x | x ) p G ( x n | x )Note that p G ( x n | x ) has the same expression that we would have if wedealt with a linear DAG of length n − 1, in which 2 is the ancestral node:2 → · · · → n . This observation will enable us to apply an inductive proof toour result. 33 emma 3 For every p , ˆ ρ n = ρ · ˆ ρ n Proof. Applying simple algebraic manipulation of (19), ˆ ρ n is equal to[ p ( x = 1 | x = 1) − p ( x = 1 | x = − p G ( x n = 1 | x = 1) − p G ( x n = 1 | x = − ρ · ˆ ρ n We can now derive an upper bound on ˆ ρ in for the environment of thisappendix - i.e., the estimated model is a linear DAG, and the objectivedistribution has uniform marginals over binary variables. Proposition 3 For every n , ˆ ρ n ≤ (cid:18) − − rn − (cid:19) n − Proof. The proof is by induction on n . Let n = 2. Then, p G ( x | x ) = p ( x | x ), and therefore ˆ ρ = r , which confirms the formula.Suppose that the claim holds for some n = k ≥ 2. Now let n = k + 1.Consider the distribution of x conditional on x , x n . Denote α x ,x n = p ( x =1 | x , x n ). We wish to derive a relation between ρ and ρ n . Denote q = 1 + r p ( x n = 1 | x = 1) = p ( x n = − | x = − p ( x = 1 | x = 1) = p ( x n = 1 | x = 1) · α , + p ( x n = − | x = 1) · α , − = qα , + (1 − q ) α , − Likewise, p ( x = 1 | x = 0) = p ( x n = 1 | x = 0) · α − . + p ( x n = 0 | x = 0) · α − , − = qα − , − + (1 − q ) α − , x and x is thus ρ = q ( α , − α − , − ) + (1 − q )( α , − − α − , ) (20)Let us now turn to the joint distribution of x n and x . Because the marginalson both x and x n are uniform, p ( x n | x ) = p ( x | x n ). Therefore, we canobtain ρ n in the same manner that we obtained ρ : ρ n = q ( α , − α − , − ) + (1 − q )( α − , − α , − ) (21)We have thus established a relation between ρ and ρ n .Recall that ˆ ρ n is the expression we would have for the linear DAG 2 →· · · → n when p ( x = x n ) = ˜ q . Therefore, by the inductive step,ˆ ρ n = ρ · ˆ ρ n (22) ≤ [ q ( α , − α − , − ) + (1 − q )( α , − − α − , )] · (cid:18) − − ρ n k − (cid:19) k − Both ρ and ρ n increase in α , and decrease in α − , − , such that we canset α , = 1 and α − , − = 0 without lowering the R.H.S of (22). This enablesus to write ρ = q + (1 − q )( α , − − α − , )such that ρ n = 1 + r − ρ Therefore, we can transform (22) intoˆ ρ n ≤ max ρ ρ · (cid:18) − ρ − rk − (cid:19) k − The R.H.S is a straightforward maximization problem. Performing a loga-rithmic transformation and writing down the first-order condition, we obtain ρ ∗ = 1 − − rk (cid:18) − ρ ∗ − rk − (cid:19) k − = (cid:18) − − rk (cid:19) k − such that ˆ ρ n ≤ (cid:18) − − rk (cid:19) k which completes the proof.How does this upper bound compare with the Gaussian case? For illus-tration, let r = 0. Then, it is easy to see that for n = 3, we obtain ˆ ρ = ,which is below the value of we were able to obtain in the Gaussian case.And as n → ∞ , ˆ ρ n → /e . That is, unlike the Gaussian case, the maximalfalse correlation that the linear DAG can generate is bounded far away fromone.The upper bound obtained in this result is tight. The following is oneway to implement it. For the case r = 0, take the exact same Gaussiandistribution over x , ..., x n that we used to implement the upper bound inTheorem 1, and now define the variable y k = sign ( x k ) for each k = 1 , ..., n .Clearly, each y k ∈ {− , } and p ( y k = 1) = p ( y k = − 1) = since each x k has zero mean. To find the correlations between different y k variables, weuse the following lemma. Lemma 4 Let w , w be two unit vectors in R and let z be a multivariateGaussian with zero mean and unit covariance. Then, E ( sign ( w T z ) sign ( w T z )) = 1 − θπ where θ is the angle between the two vectors. Proof. This follows from the fact that the product sign ( w T z ) sign ( w T z ) isequal to 1 whenever z is on the same side of the two hyperplanes definedby w and w , and − z iscircularly symmetric, the probability that z lies on the same side of the twohyperplanes depends only on the angle between them.36eturning to the definition of the Gaussian distribution over x , ..., x n that we used to implement the upper bound in Theorem 1, we see that inthe case of r = 0, the angle between w and w n will be π , so that by theabove lemma, y and y n will be uncorrelated. At the same time, the anglebetween any w k and w k − is by construction π n − because the vectors werechosen at equal angles along the great circle. Substituting this angle into thelemma, we obtain that the correlation between y k and y k − is 1 − n − .For the case where r (cid:54) = 0, the same argument holds, except that we needto choose the original vectors w , w n so that the correlation between y and y n will be r (these will not be the same vectors that give a correlation of r between the Gaussian variables x and x n ) and then choose the rest of thevectors at equal angles along the great circle. By applying the lemma again,we obtain that the angle between y k and y k − is 1 − − rn − , which again attainsthe upper bound.This method of implementing the upper bound also explains why falsecorrelations are harder to generate in the uniform binary case, comparedwith the case of linear-regression models. The variable y k is a coarseningof the original Gaussian variable x k . It is well-known that when we coarsenGaussian variables, we weaken their mutual correlation. Therefore, the cor-relation between any consecutive variables y k , y k +1 in the construction forthe uniform binary case is lower than the corresponding correlation in theGaussian case. As a result, the maximal correlation that the model generatesis also lower.The obvious open question is whether the restriction to linear DAGsentails in a loss of generality. We conjecture that in the case of uniform binaryvariables, a non-linear perfect DAG can generate larger false correlations forsufficiently large n . References [1] Angrist, J. and J. Pischke (2008), Mostly Harmless Econometrics: AnEmpiricist’s Companion, Princeton University Press.372] Bonhomme, S. and M. Weidner (2018), Minimizing Sensitivity to ModelMisspecification, arXiv preprint arXiv:1807.02161.[3] Caron, R. and T. Traynor (2005), The Zero Set of a Polynomial, WSMRReport: 05-02.[4] Cowell, R., P. Dawid, S. Lauritzen and D. Spiegelhalter (1999), Proba-bilistic Networks and Expert Systems, Springer, London.[5] Drton, M., B. Sturmfels and S. Sullivant (2008), Lectures on AlgebraicStatistics, Vol. 39, Springer Science & Business Media.[6] Eliaz, K. and R. Spiegler (2018), A Model of Competing Narratives,mimeo.[7] Esponda, I. and D. Pouzo (2016), Berk–Nash Equilibrium: A Frameworkfor Modeling Agents with Misspecified Models, Econometrica 84, 1093-1130.[8] Glaeser, E. (2008), Researcher Incentives and Empirical Methods, in TheFoundations of Positive and Normative Economics (Andrew Caplin andAndrew Schotter, eds.), Oxford: Oxford University, 300-319.[9] Leamer, E. (1974), False Models and Post-Data Model Construction, Journal of the American Statistical Association , 69(345), 122-131.[10] Molavi, P. (2019), Macroeconomics with Learning and Misspecification:A General Theory and Applications, mimeo.[11] Lovell, M. (1983), Data Mining, The Review of Economics and Statistics ,65(1), 1-12.[12] Di Tillio, A., M. Ottaviani, and P. Sorensen (2017), Persuasion Bias inScience: Can Economics Help? Economic Journal Causality: Models, Reasoning and Inference, Cam-bridge University Press, Cambridge.[15] Koller, D. and N. Friedman. (2009). Probabilistic Graphical Models:Principles and Techniques, MIT Press, Cambridge MA.[16] Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turn-baugh, E. Lander, M. Mitzenmacher and P. Sabeti (2011), DetectingNovel Associations in Large Data Sets,