The Illusion of the Illusion of Sparsity: An exercise in prior sensitivity
TThe Illusion of the Illusion of Sparsity:An exercise in prior sensitivity
Bruno Fava, Northwestern University, USAandHedibert F. Lopes, Insper, BrazilThis draft: September 9th 2020.
Abstract
The emergence of Big Data raises the question of how to model economic relations when there isa large number of possible explanatory variables. We revisit the issue by comparing the possibil-ity of using dense or sparse models in a Bayesian approach, allowing for variable selection andshrinkage. More specifically, we discuss the results reached by Giannone et al. [2020] througha “Spike-and-Slab” prior, which suggest an “illusion of sparsity” in economic data, as no clearpatterns of sparsity could be detected. We make a further revision of the posterior distributionsof the model, and propose three experiments to evaluate the robustness of the adopted prior dis-tribution. We find that the pattern of sparsity is sensitive to the prior distribution of the regressioncoefficients, and present evidence that the model indirectly induces variable selection and shrink-age, which suggests that the “illusion of sparsity” could be, itself, an illusion. Code is availableon Github .Keywords: Sparsity, Model selection, High Dimensional Data, Shrinkage, Bayesian Economet-rics. github.com/bfava/IllusionOfIllusion a r X i v : . [ s t a t . M E ] S e p Introduction
It”s the Big Data Era. While Tech Giants revolution the markets of the U.S. and China, economistsare still adapting to the new flow of data and studying how to incorporate them in the researchagenda. In the presence of many data sources, it”s a common situation to have a large numberof variables that can possibly determine a variable of interest, so that the number of regressorsapproximates or exceeds the number of observations itself. This article addresses the issue ofhow to deal with the situation, considering whether it is wiser to use all the available regressorsor using methods that define which are the most important ones.In datasets in which the regressors outnumber the observations, the use of classical estimationmethods, such as the Ordinary Least Squares (OLS), is not even possible as the statistical infer-ence would be based on a negative number of degrees of freedom. Even when there is a smallpositive number of degrees of freedom, the OLS estimator drives very poor results, once it”sexpected overfitting and high degrees of multicollinearity. Many methods have been developedto deal with the problem, using classical and Bayesian statistics, and Machine Learning (ML)schemes, such as Random Forests (RF) that consider nonlinearity in the mean function and candeal with a large number of variables.Even though in the literature it has been identified some classes of models that perform wellfor predictions, not everything in Economics is about prediction. A vast class of articles inEconomics focus on the individual impact of some key regressors on a response variable and,therefore, models like the RF may not be adequate, as they make it difficult to interpret theindividual effects of each regressor, in addition to driving biased estimates of partial effects.It is then convenient to look at the so-called sparse models, that in the presence of many pre-dictors, select the most important ones. The counterpart are the dense models, that instead ofchoosing some variables despite the others, consider all of them, shrinking the estimated coeffi-cients towards a zero mean so that, despite the relatively small sample size, overfitting is avoided.A series of models were developed that consider sparsity in explanatory variables, for exam-ple, the famous Least Absolute Shrinkage and Selection Operator, the LASSO, introduced byTibshirani [1996]. By defining a constant limit for the sum of the absolute value of the coeffi-cients in a model, the LASSO shrinks the coefficients towards zero and, by doing so, estimatessome of them to be exactly zero, that is, excludes those variables from the model. This kind of2esign does solve the problem of the big number of predictors by using statistical inference todetermine the ones that are the most important, and then allowing for an easy interpretation ofpartial effects.Still, the choice to use sparse models may not be a free lunch. A recent work developed by Gi-annone et al. [2020] “ henceforth referred to as GLP “ explored the suitability of sparse modelingfor economics series. They took two datasets in Microeconomics, Macroeconomics and Finance,and defined a “Spike-and-Slab” prior for the coefficients of linear predictive models, followingMitchell and Beauchamp [1988]. This prior was chosen because, by taking a probability q ofinclusion of each predictor as an unknown parameter with a uniform prior, it allows the model totake both the sparse or dense designs, hence not assuming one of them, and making inference onwhich of the possibilities is more probable.The results are not encouraging for those who prefer adopting sparse representations: themodel drove a not-sparse design for five of the six applications, baptizing the title of their articleas “Economic predictions with big data: The illusion of sparsity”. The authors conclude thatsparsity should not be simply assumed when modeling an economics series, as that is uncertain,and therefore should only be used in the presence of strong statistical evidence.This work proposes a revision of the methods adopted by GLP. We reproduce the model theyused, a “Spike-and-Slab” prior distribution, that considers, in a linear model, a probability q ofinclusion of each predictor, while the included coefficients are modeled as draws from a Gaussiandistribution. The variance from this distribution is defined as γ , that thus controls the degreeof shrinkage. By treating both hyperparameters as random variables, they conducted Bayesianinference on them to visualize whether there would be greater concentration in small values of q or a more important dependence on greater shrinkage, that is, if the dataset should be treatedmainly as sparse or dense.We use five of the six original datasets from GLP (the Micro 1 and 2, Macro 1 and 2, andFinance 1 datasets) and reproduce the algorithm for estimating the model, first reinterpretingthe posterior distributions, and then proposing three experiments to evaluate how well the modelbehaves in controlled environments.First, we analyze the posterior distribution of the coefficients of the linear model, when in-cluded, what was not explored in GLP. It indicates a certain inability of the model in distinguish-ing whether a variable should be excluded, or included with a very small coefficient, what would3esult in the overestimation of the probability of inclusion, and could help explain the resultsachieved. Second, we add completely random variables as possible predictors to the datasets,and find that the model is able to correctly exclude them only in a sub-selection of the datasets.Third, it is proposed a modification to the prior distribution of the parameters of the linearmodel, by fitting a t-student distribution instead of a Gaussian, allowing for fatter tails. Theheavier-tailed distribution was more restrictive in selecting possible predictors, and results onceagain corroborate with the thesis that the original Spike-and-Slab prior is unable to correctlyallow and distinguish between shrinkage or sparsity. Finally, it is developed a simulation studyto check the performance of the original model and with the t-student modification in a totallycontrolled environment. At the same time that both approaches don’t present great performance,the analysis of the posterior distributions reinforces the belief that the adopted prior distributionincorrectly induces shrinkage.All the evidence raised allows this paper to conclude that the Spike-and-Slab approach doesnot seem robust, and could lead to the illusion that sparsity is nonexistent, when it might exist.The rest of the article is organized as follows. In Section 2 we explore the article from GLP,explain thoroughly the model used, and discuss the main results found in the paper. In Section 3we propose the three experiments: adding random variables to the datasets, modifying the priordistribution of the coefficients from a normal to a t-student distribution, and finally a simulationstudy. In Section 4, we present a conclusion. In this section, we reproduce and explore the analysis made by Giannone et al. [2020], with a“Spike-and-Slab” prior distribution for a linear predictive model applied in different economics-related datasets. GLP selected six popular datasets they consider “big data”, for the relativelylarge number of predictors compared to the number of observations: two in Macroeconomics,two in Microeconomics and two in Finance. From the six settings, we don’t consider only theFinance 2 dataset. Nonetheless, we briefly review the existing literature on Bayesian sparsity thatmotivated Giannone et al. [2020] findings. 4 .1 A brief review of Bayesian regularization
For the sake of space, let us consider the standard Gaussian linear model, already in a matrixformat, y = X β + (cid:101) , (cid:101) ∼ N ( σ I n ) , and RSS = ( y − X β ) (cid:48) ( y − X β ) be the residual sum of squares. Two of the most popular formsof regularization arise from a (cid:96) -penalty ridge regression of Hoerl and Kennard [1970] or a (cid:96) penalty lasso regression of Tibshirani [1996]: ˆ β ridge = arg min β (cid:40) RSS + λ r q ∑ j = β j (cid:41) , λ r ≥ = ( X (cid:48) X + λ r I q ) − X (cid:48) y ˆ β lasso = arg min β (cid:40) RSS + λ l q ∑ j = | β j | (cid:41) , λ l ≥ which can be solved by a coordinate gradient descent algorithm.As it is well established, both ridge and lasso estimates are essentially posterior modes.Broadly speaking, the posterior mode, or the maximum a posteriori (MAP), estimate is givenby ˜ β mode = arg min β {− p ( y | β ) − p ( β ) } . The ˆ β ridge estimate, hence, equals the posterior mode of the normal linear model with p ( β j ) ∝ exp {− λ r β j } , which is a Gaussian distribution with location 0 and scale λ r , N (
0, 1/ λ r ) .The mean is , the variance is λ r and the excess kurtosis is . Similarly, the ˆ β lasso estimateequals the posterior mode of the normal linear model with p ( β j ) ∝ exp {− λ l | β j |} , whichis a Laplace distribution with location 0 and scale λ l , Laplace (
0, 2/ λ l ) . The mean is , thevariance is λ l and excess kurtosis is .As a matter of fact, a whole family of regularization schemes arise, as suggestd by Ishwaranand Rao [2005], by defining spike and slab model as a Bayesian model specified by the following5rior hierarchy: ( y t | x t , β , σ ) ∼ N ( x (cid:48) t β , σ ) , t =
1, . . . , n ( β | ψ ) ∼ N ( diag ( ψ )) ψ ∼ π ( d ψ ) σ ∼ µ ( d σ ) . The distribution chosen to model ψ defines what kind of shrinkage and selection strategy is beingused.Alternative choices for ψ appear, amongst many others, in the two-component spike-and-slab-type prior of George and McCulloch [1993], the Laplace prior of Park and Casella [2008],Normal-Gamma prior of Griffin and Brown [2010], the horseshoe prior of Carvalho, Polson andScott (2010) and the Dirichlet”Laplace prior of Bhattacharya et al. [2015] and the spike-and-slablasso of Rockov´a and George [2018]. See also Hahn et al. [2019] who introduced an efficientsampling scheme for Gaussian linear regression with arbitrary priors. Given a response variable y t , a vector of possible predictors x t , of size k , and a vector of always-included variables u t , of size l , with generally k (cid:29) l , the model is defined as: y t = u (cid:48) t φ + x (cid:48) t β + (cid:101) t Where (cid:101) t is an i.i.d. stochastic Gaussian error term with zero mean and variance σ . Forsimplification, all variables included are standardized to have zero mean and variance one. Thevector φ will never contain zeros, as the predictor included in u t are always taken as relevantto the regression. The vector β , otherwise, is supposed to inform the suitability of whether adense or sparse representation. Thus, most of the elements of this vector may be zero “ defininga sparse model “, or non-zero “ a dense model. To reflect the possibility of taking one of bothrepresentations, the following prior distribution is proposed for the unknown parameters σ , φ , β : p ( σ ) ∝ σ ∝ flat β i | σ , γ , q ∼ N ( σ γ ) with prob. q with prob. − q i =
1, ..., k . Where the prior for the variance σ is the improper Jeffrey”s prior, the prior for φ is uninfor-mative, and each of the parameters in the vector β can be either zero, with probability − q , ora draw from a Gaussian distribution with zero mean and variance σ γ , with probability q . Thehyperparameter γ controls the degree of shrinkage: the larger γ , the smaller the shrinkage, asthe regression coefficients can be more distant from zero.The prior distribution for the hyperparameter γ is induced by a prior on a transformation ofthe coefficient. Specifically, they set a prior for the coefficient of determination, R : R ( γ , q ) ≡ qk γ v x qk γ v x + Where v x is the sample average variance of the predictors. The prior distribution of thehyperparameters is then defined by uniform distributions: q ∼ Beta (
1, 1 ) R ∼ Beta (
1, 1 ) The heatmaps in figure 1 show the posterior probability of inclusion of each predictor in themodel, that is, the percentage of times that each covariate was included on the Markov ChainMonte Carlo estimation of the model. Thus, for example, if a stripe presents a near-black color, itindicates that such predictor was included on nearly all of the iterations of the estimation, that is,its probability of inclusion is close to 100%. On the other hand, if the stripe color is light yellow,its probability of inclusion is small.
After analyzing the posterior distributions, GLP investigate whether a pattern of sparsity can beidentified in the datasets, by measuring the percentage of times each variable was included in7he regression (figure 1). The conclusion is that a clear pattern of sparsity is found only on theMicro 1 dataset, in which only one variable is included most of times. For all other datasets it”snot possible to distinguish which variables should be included, as many have a high estimatedprobability of inclusion. That indicated that a dense model, that allows for the selection of manyvariables while shrinking their coefficients, should be the most adequate for them.Thus, even when the estimated number of included variables is small, it might not be easy todetermine what the pattern of sparsity should be, that is, which variables should be selected. Thisresult allows GLP to conclude that sparsity cannot be assumed for any economic dataset, unlessin the presence of strong statistical evidence, and suggest an ”illusion of sparsity” when usingstatistical models that assume (and force) sparsity. β i | ( β i (cid:54) = ) We present the posterior distribution of the coefficients β i | ( β i (cid:54) = ) for all possible predictors,which was not presented in GLP. We focus on the Finance 1 and Macro 2 datasets only, becauseof the convenience that they present a smaller number of covariates - 16 and 60, respectively. Theposterior for Macro 2 is divided in figures 3 and 4, each with 30 predictors. The distributions forthe Finance 1 are shown in figure 2.We evaluate how significant an included predictor is by analyzing how close to zero thecoefficients of the included variables are. In order to clarify the results, on the graphics we showfirst the number (index) of the predictor followed by the number of ”Inc”, the probability ofinclusion of that predictor (the same number plotted on figure 1), and the number of ”G0”, theprobability that the predictor is greater than zero, that is, the percentage of the times that theestimated coefficient was positive, considering only the cases when the variable was included inthe model.At the same time that it is expected that included coefficients be shrunken towards zero -specifically in a greater rate as the probability of inclusion grows -, if the concentration of theposterior distribution around zero is too large, it can be argued that the model can be failing indetermining whether a predictor is relevant for fitting the model or not. That is, if the distributionof β i | ( β i (cid:54) = ) is very concentrated around zero, the likelihood that z i = or z i = will be veryclose, as the inclusion of the coefficient would have very small impact on the regression.8herefore, the probability of inclusion q might be overestimated, and some of the coefficientsincluded on figure 1 with high probability may be performing an almost negligible role in themodel, with the explanatory capacity of the covariates concentrated on a few predictors. Thatis, a pattern of sparsity might be hidden on the many selected variables, what would imply thatthe prior distributions set are themselves inducing ”density” and shrinkage, despite the goal oflearning statistically whether shrinkage or selection is ideal.The graphics in figures 2 to 4, reveal some interesting features of the posterior distribution.First, in fact, some predictors included have very concentrated and symmetrical posterior dis-tributions around zero, indicating that if an economist were to define a pattern of predictors toinclude in a linear model, from the learning of the posterior distribution of β they would veryprobably exclude these covariates, even when the plot in figure 1 would indicate the opposite.This is very clearly the case, for example, in the Finance 1 setting in figure 2, of the variables4 and 8, that are very concentrated and symmetrical around zero, with nearly half of the distri-bution to each of the sides of zero, despite having a probability of inclusion of 46% and 44%,respectively. Other variables, on the other hand, do present a very distinct pattern when included.For example, predictors 2, 3, 9, 12 and 16 have a large concentration of their coefficients awayfrom zero, at the same time that their probability of inclusion is not largely superior to the onesof predictors 4 and 8. Predictor 3, for example, is included only 53% of the times, and predictor2 58%.Other covariates show a more peculiar and dubious behavior. Predictor 11, for instance,has its coefficient clearly more concentrated to positive values, at the same time that it is veryconcentrated around zero, and is included only 47% of times. It is hard to conclude from figure2 that a clear pattern of sparsity can be distinguished. Still, it seems clear the greater importanceof a few variables in spite of others, as is the case of predictors 2, 9, 12, 14 and 16, and a smallerimportance of other, such as 4, 5, 6, 8, 11, 13 and 15. Even though there seems to exist a directrelation between these patterns and the probability of inclusion - the least included predictoramong the ”most important” was included 58% of times, whereas the most included among the”least important” was included 51% of times -, interpreting sparsity from the graphic on figure 1by itself is misleading, and hides some important information behind each coefficient.As for the posterior of β for the Macro 2 dataset, similar conclusions can be drawn. Whilesome predictors are undoubtedly important for the model, such as predictor 1 (included 98%9f times always with a negative coefficient), others have a coefficient very close to zero whenincluded in the model, such as predictors 5, 28, 29, 32, 34, 38, 54 and 57. Still, at the sametime that one economist could easily exclude such predictors from a regression model, someof them are included in the spike-and-slab with a significant probability, of at least 60%. It isalso interesting to notice that, for example, predictor 12 is highly offset from zero, with 75% ofthe distribution on negative values, but also presents a relatively small percentage of inclusion, of66%, while predictor 28 is highly symmetric around zero, with 50% of the distribution in positivenumbers, and is included almost at the same rate, 62% of the times.This analysis let us conclude that even though, in fact, a distinct pattern of sparsity cannot beidentified on the datasets, the spike-and-slab prior, as defined, seems to be itself inducing densityand shrinkage, by including frequently many predictors with a near-zero coefficient. This section is composed of three parts. We first explore the power of selection of the spike-and-slab prior as specified, by adding random variables as additional predictors in the five datasets,and checking whether the posterior distribution was able to correctly identify their exclusion.Second, this article proposes a change in the model, by substituting the Gaussian prior distributionof the coefficient of the possible predictors for a t-student distribution. Finally, we develop asimulation study to check the conditions under which the model correctly selects a sparse model.The estimation algorithms were reproduced in R with Rcpp (R C++), and all the estimationcode used in this whole session is displayed on Github . In order to further explore the thesis proposed on the last topic – that is, that the Spike-and-Slabprior might be itself inducing density –, we now propose a further experiment. We re-run theestimation algorithm for all the five datasets but now include two additional regressors that werecompletely randomly generated from a normal distribution, and re-scaled to have zero mean andstandard deviation of one, like all the other predictors. github.com/bfava/IllusionOfIllusion th and 6 th least included variables, from a total of 62. On the Finance 1, one of the random regressors wasthe least included among 18, while the other one ranked as the 3 rd more included.This experiment corroborates, at least for the Finance 1 and Macro 2 datasets, to the idea thatthe design of the model is itself inducing a high level of selection and shrinkage, not fulfilling thegoal of allowing for shrinkage or sparsity in order to learn the best approach. Still, it is importantto notice that in this subsection only one set of simulated variables was generated for each dataset,and that different results can be drawn depending on the generated predictors. However, the factthat two of the five settings presented a strong difference between the results and what would beexpected suggest that similar outcomes would be achieved if the experiment was run more times,or with a different number of random variables.The graphics in figure 6 bring the posterior distribution of β i | ( β i (cid:54) = ) , that is, the estimateddensity of the value of the coefficients β for all the possible predictors, when included. Aboveeach graphic is included the number of the regressor, the probability of inclusion in the model,and the probability of the coefficient to be greater than zero.It can be noticed that for the original predictors, from 1 to 16, the posterior distributions are11xtremely similar to the original graphics on figure 2, indicating that the inclusion of the newvariables didn’t interfere on the estimation of the other parameters. The posterior of predictor 18is not surprising, given what was already discussed on section 3.1.5. Besides the probability ofinclusion of 48%, the distribution is concentrated on very small values of β , symmetric aroundzero, indicating that the likelihood of including the variable in the model with a very small coef-ficient is similar to the likelihood of excluding it. It once again corroborates with the thesis thatthe Spike-and-Slab incorrectly stimulates selection and shrinkage.Predictor 17, on the other hand, despite being completely random was included 71% of times,95% of them with a negative value. Following the discussion on figure 2, it indicates that in fact avariable cannot be assumed as relevant just because of its degree of inclusion or lack of symmetryaround zero. It interesting, however, to notice that the distribution is still very concentrated onsmall values of beta, with 90% of the distribution being greater than − . This is not thecase for other predictors, such as 9, for which 25% of the distribution of is greater than , orpredictor 12, for which 18% of the distribution of beta is larger than . That is, the distributionof these predictors, when included, is less concentrated around zero than the ones of the randomvariables.This experiment suggests, once again, that the design of the model itself is unable to clearlydistinguish the possibilities of shrinkage or sparsity, possibly inducing the former, dependingon the setting. Especially on the cases when the posterior distribution was closer to the priordistribution, that is, the model had a poor learning, on the Finance 1 and Macro 2 datasets, themodel seems to have induced some shrinkage, what is made explicit by the high probability ofinclusion of the randomly generated variables. One possible explanation for the results achieved on the last subsections can be related to theshape of the distribution of the coefficients of the predictors. By using a Gaussian distribution,the Spike-and-Slab prior could be inducing the posterior distribution of beta to be concentratedaround zero, thus generating an ambiguity of whether the model should include or not a predictor,as both options end up being very similar if the distribution of β i | ( β i (cid:54) = ) is concentrated forvery small values of β i . 12n their article, GLP recognize that a misspecification of the regression coefficients distribu-tion can lead to a poor performance:Our approach relaxes all sparsity and density constraints, and instead imposes somestructure on the problem by making an assumption on the distribution of the non-zero coefficients. The key advantage of this strategy is that the share of non-zerocoefficients is treated as unknown, and can be estimated. Another crucial benefit isthat our Bayesian inferential procedure fully characterizes the uncertainty around ourestimates, not only of the degree of sparsity, but also of the identity of the relevantpredictors. The drawback of this approach, however, is that it might perform poorlyif our parametric assumption is not a good approximation of the distribution of thenon-zero coefficients. Even if we take this concern into consideration, at the veryleast our results show that there exist reasonable prior distributions of the non-zeroregression coefficients that do not lead to sparse posteriors. Giannone et al. [2020]To deal with the problem, they use simulated datasets to show that the model is capable oflearning the degree of sparsity when using the Gaussian distribution for the regression coeffi-cients, even on different settings for the data-generating process. Also, they explore the out-of-sample performance of their model, compared to sparse models. Although they reach interestingresults on the simulated datasets, they do not explore the differences of changing the distributionof the non-zero coefficients on the real datasets.In order to further explore the question, we propose a change in the model, substituting thenormal distribution in the Spike-and-Slab prior for a t-student distribution. A desirable featureof the t-student is that it presents fatter tails, that is, its density is higher for the values moredistant to zero than in the normal distribution. Section 3.3.1 describes how the substitution wasimplemented, and the changes on the algorithm. Section 3.3.2 brings the results of the estimation,showing the posterior probability of inclusion of each predictor, and the posterior distribution ofthe coefficients once included. The substitution of the Gaussian for a t-student distribution is implemented by adding a latentvariable λ i to the model. Specifically, we change the prior distribution of β i | σ , γ , q , as de-13cribed in section 2.1, for: β i | σ , γ , λ i , q ∼ N ( σ γ λ i ) with prob. q with prob. − q i =
1, ..., k . And set an Inverse-gamma prior distribution for λ i : λ i ∼ IG (cid:16) ν ν (cid:17) It can thus be shown that: β i | σ , γ , q ∼ t ν ( σ γ ) with prob. q with prob. − q i =
1, ..., k . Where
Var [ β i ] = νν − σ γ Instead of learning the parameter ν , we estimate the model for the pre-defined values of4, 10, 30, 100 and 500. Given the shape of the t-student distribution, the prior distribution of β i | σ , γ , q thus has very fat tails for ν = , and a very similar shape to the normal distributionwhen ν = .The estimation algorithm has few changes. Taking as a basis the algorithm developed inAppendix A of GLP, and preserving the same notation, v x is redefined as: v x ≡ E [ σ i , i ] νν − This way, considering the redefinition of v x , the conditional posterior distributions of R and q , φ , z and σ are all unchanged. The conditional distribution of β is now induced by: β i (cid:113) λ i = : β ∗ i | Y , φ , σ , R , q , z ∼ t ν ( σ γ ) with prob. q with prob. − q i =
1, ..., k . Finally, the conditional distribution of λ i is given by:14 i | ν , β i , σ , R ∼ IG (cid:32) ν +
12 , ν + β i / σ γ (cid:33) Figures 7 and 8 bring the estimated probability of inclusion of each regressor for all the fivedatasets, considering different values for the number of degrees of freedom of the prior t-studentdistribution ν .The graphic is an adaptation of the already used in figures 1 and 5, but now instead of using awhole stripe for each coefficient, the stripes are divided in rectangles, with each row representinga different value for ν , while as usual each column represents one possible predictor. The colorof the heatmap is unchanged, using light colors for a small probability of inclusion, and a darkertone as the probability increases.It was also included a ”cut-off” indicator, to help interpreting the dimension of the probabilityof inclusion. The cutting levels are of 50%, 75% and 90%. Therefore, for example, on the Finance1 dataset in figure 7, the first row, corresponding to the heavy-tailed t-student with only 4 degreesof freedom, contains only two coefficients included more than 50% of times, and none includedmore than 75%. In the Macro 2 setting in the same figure, in the first row only the first predictoris included with a probability higher than 90%, and the probability of inclusion of the seventhpredictor is between 75% and 90%, while for all others it is smaller than 75%. On the last row,the Normal distribution, seven predictors are included between 75% and 90% of the times.As expected, the last two rows are virtually equal for all the settings, reflecting that a t-student distribution with ν = can be approximated by a normal distribution. Slight variationsbetween them are due to the limited size of the drawn MCMC. Also, it is not surprising that asthe number of degrees of freedom decrease, the average probability of inclusion also decreases.It reflects the fact that the distribution’s tails are heavier for small values of ν , and so there’sa greater distinction between including or not a regressor, as the likelihood that the coefficientbe around zero is relatively smaller. By itself, this result again endorses the suspicion that theSpike-and-Slab, as originally defined, induces selection and shrinkage.Still, it is interesting to notice that, in some cases, the use of the t-student doesn’t seem to havechanged the pattern of variable selection, but only reduced the overall probability of inclusion.15his seems to be the case of the Finance 1 dataset, in figure 7, for which the probability ofinclusion was very similar to the values of ν from 30 to 500, and for the Macro 2 dataset in thesame figure, for which the pattern of the most included variables seems unchanged through therows.The result for the Micro 1 dataset in figure 8 is also not surprising. Since the normal distri-bution was enough to identify the dominance of one single variable over the others for fitting themodel, it was expected that the more restrictive t-student distribution wouldn’t allow for selectionof more variables, or block the selection of the single dominant predictor.Finally, the Macro 1 and Micro 2 settings show an interesting behavior. In Macro 1 in figure7, while most of the variables have a probability of inclusion smaller than 50% even for thenormal distribution case, some variables that are included with a high frequency in the last roware excluded most of the times on the first rows. Moreover, it happens without changing theprobability of inclusion of other variables, that is, there is a change in the pattern of variableselection. If, say, an economist was to believe in the selection power of the model with a t-studentdistribution with 4 degrees of freedom, he would find that only 7 of 130 available predictors arerelevant - that is, included more than 50% of times -, what could be interpreted as a sparse model.A similar although weaker effect can be seen in the Micro 2 dataset in figure 8. While thenormal distribution setting shows no clear pattern of variable selection, the t-student cases arecapable of more clearly selecting some predictors, decreasing the probability of selection ofseveral variables while preserving a high probability for others, that is, it changes the pattern ofvariable selection. So, for example, while for the normal distribution 106 of 138 predictors areselected more than 50% of the times, for the case when ν = only 30 are selected, and 34 for ν = .Although these results are insufficient to conclude that the Spike-and-Slab with a t-studentdistribution can be used to identify whether sparsity or shrinkage should be chosen for a dataset,they are a strong evidence that the use of the normal distribution is insufficient to draw suchconclusion, as the use of this prior distribution induces high levels of variable selection withshrinkage.Additional to the variable selection pattern, figure 9 updates figure 2, for the Finance 1 dataset,with the t-student as the prior distribution. It compares the posterior distribution of the coeffi-cients beta for each predictor once they are included. The title of each graphic brings first the16ndex of the variable - the same used in figures 7 and 8 -, the probability of inclusion ”Inc.” - thesame from the last figures - and ”G0”, the percentage of the distribution concentrated in positivevalues, respectively to the case when ν = and ν = (the approximation to the normaldistribution).The figure reveals that the probability of inclusion of all the variables decrease significantly,and the distribution with ν = becomes more asymmetric and skewed for all of the predictors.Concerning the selection problem discussed in section 3.1.5, the t-student by itself, even in theextreme case of only four degrees of freedom, still doesn’t seem enough to solve the ambiguityin the model for choosing whether a predictor should be included or not. It happens because evenwhen a variable is included in the model, its estimated value if very close to zero has almost thesame impact in the model, thus resulting in a similar likelihood for both inclusion in the modelwith a very small coefficient or exclusion. This effect appears to be overestimating the probabilityof inclusion of the regressors, thus resulting in a difficulty in identifying the presence of sparsity,that is, the set prior distributions seem to be inducing density and shrinkage, and underestimatingthe possibility of sparsity. Based on the results from section 3.2, which showed a poor performance of the model in ex-cluding completely randomly generated predictors for some of the datasets, and the new modelproposed in section 3.3, this section proposes a simulation study. We simulate a dataset with thesame dimensions of the Finance 1 setting, with 68 observations and 16 covariates. We predefinethe value of the coefficient beta for the first three predictors, and set the other 13 to be exactlyequal to zero. Therefore, the model would perform accurately if correctly included only the firstthree regressors.The data generating process is as follows. We first draw a random vector ε from a normaldistribution and set the values for β , β and β . Moreover, we calculate the response variable asthe sum of the first three covariates multiplied by their respective coefficients plus an error term,and consider six scenarios, varying the variance of the error term, σ ε . That is, given that X is thedataset, where X i , j is the value of predictor j for individual i , draw: X i , j ∼ N (
0, 1 ) ∗ i ∼ N (
0, 1 ) Set: β = − β = β = Calculate: y ( s ) i = β X i ,1 + β X i ,2 + β X i ,3 + σ ( s ) ε ε ∗ i For s ∈ {
1, ..., 6 } and i ∈ {
1, ..., 68 } . We define: σ ( s ) ε = s That is, for example, σ ( ) ε = , σ ( ) ε = , and so on. This number represents theuncertainty on the dataset: if σ ε is very small, any model that allows for variable selection shouldperform reasonably in selecting only the three first predictors; if σ ε is large, different modelsshould perform differently in selecting the appropriate variables, some better than others. After y and X are defined, they are all scaled to have exactly zero mean and standard deviation one,repeating the same approach developed in GLP and in the previous sections.The graphics in figure 10 follow the same approach as the one in figures 7 and 8, describedin section 3.3.2. As expected, when the variance of the error term is small, such as in the casewhere σ ε = , the model easily selects the three truly relevant predictors with almost 100%of probability, and all others with virtually zero. As this variance increases, when σ ε = ,the model is still accurate in selecting correctly only the relevant variables most of the times, butit can also be seen that the other predictors are included with higher probability, but that neverreach 50%.The model starts failing for σ ε = , when all three models fail in selecting the thirdpredictor more than 75% of times. The predictors of number 4 and 16 are also incorrectly selectedmore than 50% of times for the normal and ν = settings, what doesn’t happen for the ν = case. On the other hand, this more restrictive model also fails more than the other in selectingthe third variable, what happens less than 50% of times. A similar effect happens as the standarddeviation of the error term increases. For σ ε = , for example, the three settings fail in theselection, and for ν = and the normal distribution, predictor 16 becomes almost as importantas predictors 1 and 2, while predictor 3 is rarely selected. For ν = , all the variables are selected18ith a very low rate, with predictors 1, 2 and 16 being the most selected.These results might indicate that the model has limited capacity in distinguishing patternsof sparsity, or at least that the datasets considered might have an elevated level of uncertainty(portion of the response variable explained by unobservables), such that few can be learned fromthe use of econometric models. Even though making effort to make statistical learning even indifficult settings is a major task of statisticians and economists, it is worth mentioning that anextreme scenario can be the case in at least some of the datasets considered, as they are very”small” considering the number of observations, and ”big” if considered the large number ofpossible predictors.The graphics in figure 11 bring the posterior density distribution of the coefficients beta forthe 16 simulated predictors, in the same approach as figure 9. It is interesting to notice thatthe distributions for predictors 1 to 3, for the cases σ ε = and σ ε = , are very offsetfrom zero, correctly identifying the true predefined parameters. It is interesting to notice, though,that as uncertainty grows, also grows shrinkage, and all the distributions converge towards zero.This is especially the case for regressor 3, whose distribution concentrates around zero, whatalso leads to a great drop in the probability of inclusion in the model. This result once againcorroborates with the thesis that the model creates an ambiguity between inclusion with shrinkageor exclusion, as the likelihood of both becomes very similar, and the model fails in learning thecorrect approach.As for the other regressors, we notice that the concentration of the posteriors in one of thesides of zero is uninformative. For example, predictor 12, when included, has more than 80% ofits distribution concentrated only in positive values of beta for all of the three settings evaluated.Still, no inference can be made about this feature of the shape of the distribution, as the coefficient β is known to be exactly zero. It also collaborates with the hypothesis previously presented thatthe fact that the distribution is very concentrated on small values of beta, close to zero, is a moreimportant indicator than how offset from zero the distribution is, showing the unimportance ofa predictor for the model. The fact that the distributions for virtually all of the unimportantvariables are very concentrated around zero implies, again, that the probability of inclusion isprobably overestimated due to the similar likelihood of excluding the predictor or including itwith a very small coefficient (high degree of shrinkage).19 Conclusion
This paper reassessed the model proposed in GLP, through a more detailed look into the posteriordistributions, and the proposition of three experiments. First, after adding random variables tothe datasets and reevaluating the model, it was found that in some of the settings the model wasunable to distinguish a completely random regressor from the other available economic series,even privileging a random predictor in one of the settings in spite of others.Second, a modification was proposed to the model, substituting the prior Gaussian distribu-tion of the coefficients of the model for a t-student distribution. It was shown that, depending onthe the number of degrees of freedom of the distribution, a sparse model is naturally distinguishedamong the predictors for one of the datasets, unlike the result obtained with the normal distribu-tion. For the other datasets, the effect was not homogeneous, but the model with the t-studentshowed overall an improvement on the pattern of what variables should be excluded. Finally, thesimulation study indicates that the Spike-and-Slab is biased towards selecting more predictorsand shrinking their coefficients.All the experiments corroborate with the idea that the model is, itself, inducing variable se-lection and shrinkage. The mechanism through which it happens could be that the likelihood ofexcluding an irrelevant predictor, or including it with a very small coefficient, is very similar.Thus, the evidences suggest that the Spike-and-Slab prior distribution is itself inducing density,and thus does not fulfill the objective of GLP of evaluating whether density or sparsity is moreadequate for a given dataset.It is important to notice that this paper does not contradict the conclusion achieved by GLP,but brings evidence to show that the model proposed is not robust to find the conclusion thateconomic datasets are not informative enough to identify a conclusive pattern of sparsity amongmany possible predictors. It was indeed shown that an unique set of a few relevant predictorswas identified for the Micro 1 dataset, for all approaches considered, and also for the Finance 1dataset, if a heavy-tailed t-student is used in the prior distribution of the coefficients. For othercases, such as the Micro 2 dataset, the t-student also had a much better performance in excludingirrelevant predictors. Still, our findings corroborate with GLP conclusion that, without statisticalevidence, sparsity should not be simply assumed in an economics dataset.But even more than that, the evidence brought by this paper shows that the setting of the prior20istribution can drastically improve the performance of the model in detecting sparsity, and itthus indicates that further methods can help answer the question of whether sparsity can be usedto model a given economics dataset or not.Finally, we conclude that the use of the Spike-and-Slab prior, such as proposed, is misleadingif the goal is to find evidence of sparsity in an economics dataset. The model, by inducingshrinkage and selection, incorrectly provokes an illusion that sparsity is nonexistent: the illusionof the illusion of sparsity.
References
A. Bhattacharya, S. Pati, N. Pillai, and D. Dunson. Dirichletlaplace priors for optimal shrinkage.
Journal of the American Statistical Association , 110:1479–1490, 2015.E. I. George and R. E. McCulloch. Variable selection via gibbs sampling.
Journal of the AmericanStatistical Association , 88(423):881–889, 1993.D. Giannone, M. Lenza, and G. Primiceri. Economic predictions with big data: The illusion ofsparsity.
SSRN Electronic Journal , 07 2020. doi: 10.2139/ssrn.3166281.J. Griffin and P. Brown. Inference with normal-gamma prior distributions in regression problems.
Bayesian Analysis , 5(1):171–188, 2010.R. Hahn, J. He, and H. F. Lopes. Efficient sampling for gaussian linear regression with arbitrarypriors.
Journal of Computational and Graphical Statistics , 28:142–154, 2019.A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.
Technometrics , 12(1):55–67, 1970.H. Ishwaran and J. S. Rao. Spike and slab variable selection: frequentist and bayesian strategies.
Annals of Statistics , pages 730–773, 2005.T. J. Mitchell and J. J. Beauchamp. Bayesian variable selection in linear regression.
Journal ofthe American Statistical Association , 83(404):1023–1032, 1988.T. Park and G. Casella. The bayesian lasso.
Journal of the American Statistical Association , 103(482):681–686, 2008. 21. Rockov´a and E. George. The spike-and-slab lasso.
Journal of the American Statistical Asso-ciation , 113(521):431–444, 2018.R. Tibshirani. Regression shrinkage and selection via the lasso.
Journal of the Royal StatisticalSociety: Series B (Methodological) , 58(1):267–288, 1996.22igure 1: Probability of inclusion of each predictor23igure 2: Posterior distribution of beta for the Finance 1 dataset. ”Inc.” means the probability ofinclusion and ”G0” the probability of being greater than zero24igure 3: Posterior distribution of beta for the Macro 2 dataset (1/2). ”Inc.” means the probabilityof inclusion and ”G0” the probability of being greater than zero25igure 4: Posterior distribution of beta for the Macro 2 dataset (2/2). ”Inc.” means the probabilityof inclusion and ”G0” the probability of being greater than zero26igure 5: Probability of inclusion of each predictor - two last stripes are random variables27igure 6: Posterior distribution of beta for the Finance 1 dataset with additional random variables.”Inc.” means the probability of inclusion and ”G0” the probability of being greater than zero.Predictors 17 and 18 are randomly generated variables.28igure 7: Probability of inclusion of each predictor for the Macro 1 and 2 and Finance 1 datasets.Each columns is a predictor and each row one model, varying the number of degrees of freedom ν . 29igure 8: Probability of inclusion of each predictor for the Micro 1 and 2 datasets. Each columnsis a predictor and each row one model, varying the number of degrees of freedom ν .30igure 9: Posterior distribution of beta for the Finance 1 dataset for the t-student prior distribu-tion. ”Inc.” means the probability of inclusion and ”G0” the probability of being greater thanzero. The first value is for the case ν = , and the second for ν = .31igure 10: Probability of inclusion of each predictor for simulated datasets. Each columns isa predictor and each row one model, varying the number of degrees of freedom ν . Each blockrepresents one dataset, varying the standard error of the error term, σ ε .32igure 11: Posterior distribution of beta for three simulated datasets with Gaussian regressioncoefficients, varying σ εε