BayesVarSel: Bayesian Testing, Variable Selection and model averaging in Linear Models using R
BBayesVarSel: Bayesian Testing, VariableSelection and model averaging in LinearModels using R
Gonzalo Garcia-Donato and Anabel Forte Universidad de Castilla La Mancha; Universidad de Valencia
May 2, 2018
Abstract
This paper introduces the R package BayesVarSel which imple-ments objective Bayesian methodology for hypothesis testing and vari-able selection in linear models. The package computes posterior prob-abilities of the competing hypotheses/models and provides a suite oftools, specifically proposed in the literature, to properly summarizethe results. Additionally,
BayesVarSel is armed with functions tocompute several types of model averaging estimations and predictionswith weights given by the posterior probabilities.
BayesVarSel con-tains exact algorithms to perform fast computations in problems ofsmall to moderate size and heuristic sampling methods to solve largeproblems. The software is intended to appeal to a broad spectrum ofusers, so the interface has been carefully designed to be highly intuiti-tive and is inspired by the well-known lm function. The issue of priorinputs is carefully addressed. In the default usage (fully automatic forthe user) BayesVarSel implements the criteria-based priors proposedby Bayarri et al. (2012), but the advanced user has the possibility ofusing several other popular priors in the literature. The package isavailable through the Comprehensive R Archive Network, CRAN. Weillustrate the use of
BayesVarSel with several data examples. a r X i v : . [ s t a t . O T ] N ov An illustrated overview of
BayesVarSel
Testing and variable selection problems are taught in almost any introductorystatistical course. In this first section we assume such background to presentthe essence of the Bayesian approach and the basic usage of
BayesVarSel with hardly any mathematical formulas. Our motivating idea in this firstsection is mainly to present the appeal of the Bayesian answers to a verybroad spectrum of applied researchers.The remaining sections are organized as follows. In Section 2 the problemis presented and the notation needed is introduced jointly with the basicsof the Bayesian methodology. Section 3 and Section 4 explain the detailsconcerning the obtention of posterior probabilities in hypothesis testing andvariable selection problems, respectively, in
BayesVarSel . In Section 5 sev-eral tools to describe the posterior distribution are explained and Section 6is devoted to model averaging techniques. Finally, Section 7 concludes withdirections that we plan to follow for the future of the
BayesVarSel project.In the Appendix, formulas for the most delicate ingredient in the underly-ing problem in
BayesVarSel , namely the prior distributions for parameterswithin each model, are collected.
In testing problems, several competing hypotheses, H i , about a phenomenonof interest are postulated. The role of statistics is to provide summariesabout the evidence in favor (or against) the hypotheses once the data, y ,have been observed. There are many important statistical problems withroots in testing like model selection (or model choice) and model averaging.The formal Bayesian response to testing problems is based on the poste-rior probabilities of the hypotheses that summarize, in a fully understandableway, all the available information. In BayesVarSel an objective point of view(in the sense explained in Berger, 2006) is adopted and the reported poste-rior probabilities only depend on y and the statistical model assumed. Thesehave hence the great appeal of being fully automatic for users.For illustrative purposes consider the nutrition problem in Lee (1997),p143. There it is tested, based on the following sample of 19 weight gains(expressed in grams) of rats, R> weight.gains<- c(134, 146, 104, 119, 124, 161, 107, 83, 113, 129, 97, 123,+ 70, 118, 101, 85, 107, 132, 94)
R> diet<- as.factor(c(rep(1,12),rep(0,7)))R> rats<- data.frame(weight.gains=weight.gains, diet=diet)
This problem (usually known as the two-samples t -test) is normally writtenas H : µ = µ versus H : µ (cid:54) = µ , where it is assumed that the weightgains are normal with an unknown (but common) standard deviation. Theformulas that define each of the models under the postulated hypotheses arein R language R> M0<- as.formula("weight.gains~1")R> M1<- as.formula("weight.gains~diet")
The function to perform Bayesian tests in
BayesVarSel is Btest which hasan intuitive and simple syntax (see Section 3 for a detailed description). Inthis example
R> Btest(models=c(H0=M0, H1=M1), data=rats)Bayes factors (expressed in relation to H0)H0.to.H0 H1.to.H01.0000000 0.8040127---------Posterior probabilities:H0 H10.554 0.446
We can conclude that both hypotheses are similarly supported by the data(posterior probabilities close to 0.5) so there is no evidence that the diet hasany impact on the average weight.Another illustrative example concerns the classic dataset savings (Bels-ley, Kuh, and Welsch, 1980) considered by Faraway (2002), page 29 and dis-tributed under the package faraway . This dataset contains macroeconomicdata on 50 different countries during 1960-1970 and the question posed isto elucidate if dpi (per-capita disposable income in U.S), ddpi (percent rateof change in per capita disposable income), population under (over) 15 (75) pop15 ( pop75 ) are all explanatory variables for sr , the aggregate personalsaving divided by disposable income which is assumed to follow a normaldistribution. This can be written as a testing problem about the regressioncoefficients associated with the variables with hypotheses H : β dpi = β ddpi = β pop = β pop = 03ersus the alternative, say H , that all predictors are needed. The competingmodels can be defined as R> fullmodel<- as.formula("sr~pop15+pop75+dpi+ddpi")R> nullmodel<- as.formula("sr~1") and the testing problem can be solved with
R> Btest(models=c(H0=nullmodel, H1=fullmodel), data=savings)---------Bayes factors (expressed in relation to H0)H0.to.H0 H1.to.H01.0000 20.9413---------Posterior probabilities:H0 H10.046 0.954
The conclusion is that there is substantial evidence favoring H , the hypoth-esis that all considered predictors explain the response sr .Of course, more hypotheses can be tested at the same time. For instance,a simplified version of H that does not include pop15 is H : β dpi = β ddpi = β pop = 0that can be included in the analysis as R> reducedmodel<- as.formula("sr~pop75+dpi+ddpi")R> Btest(models=c(H0=nullmodel, H1=fullmodel, H2=reducedmodel), data=savings)Bayes factors (expressed in relation to H0)H0.to.H0 H1.to.H0 H2.to.H01.0000000 20.9412996 0.6954594---------Posterior probabilities:H0 H1 H20.044 0.925 0.031
The result clearly evidences H as the best explanation for the experimentamong those considered.This scenario can be extended to check which subset of the four covariatesis the most suitable one to explain sr . In general, the problem of selectingthe best subset of covariates from a group of potential ones is better knownas variable selection. 4 .2 Variable selection Variable selection is a multiple testing problem where each hypothesis pro-poses a possible subset of p potential explanatory variables initially consid-ered. Notice that there are 2 p hypotheses, including the simplest one statingthat none of the variables should be used.A variable selection approach to the economic example above with p = 4has 16 hypotheses and can be solved using the Btest function. Neverthe-less,
BayesVarSel has specific facilities to handle the specificities of variableselection problems. A main function for variable selection is
Bvs , fully de-scribed in Section 4. It has a simple syntax inspired by the well-known lm function. The variable selection problem in this economic example can besolved executing R> Bvs(formula="sr~pop15+pop75+dpi+ddpi", data=savings)The 10 most probable models and their probabilities are:pop15 pop75 dpi ddpi prob1 * * * * 0.2973156422 * * * 0.2434334933 * * 0.1338323674 * 0.0909603275 * * * 0.0779134296 * * 0.0576747557 * * 0.0325167808 * * * 0.0313376399 0.01385436910 * * 0.006219812
With a first look at these results, we can see that the most probable modelis the model with all covariates (probability 0.30), which is closely followedby the one without dpi with a posterior probability of 0.24.As we will see later, a variable selection exercise generates a lot of valuableinformation of which the above printed information is only a very reducedsummary. This can be accessed with specific methods that explore the char-acteristics of objects of the type created by
Bvs . The problems considered in
BayesVarSel concern Gaussian linear models.Consider a response variable y , size n , assumed to follow the linear model5the subindex F refers to full model) M F : y = X α + Xβ + ε , ε ∼ N n ( , σ I n ) , (1)where the matrices X : n × p , X : n × p and the regression vector coefficientsare of conformable dimensions. Suppose you want to test H : β = versus H F : β (cid:54) = , that is, to decide whether the regressors in X actually explainthe response. This problem is equivalent to the model choice (or modelselection) problem with competing models M F and M : y = X α + ε , ε ∼ N n ( , σ I n ) , (2)and we will refer to models or hypotheses indistinctly.Posterior probabilities are based on the Bayes factors (see (Kass andRaftery, 1995)), a measure of evidence provided by BayesVarSel when solv-ing testing problems. The Bayes factor of H F to H is B F = m F ( y ) m ( y )where m F is the integrated likelihood or prior predictive marginal: m ( y ) = (cid:90) M ( y | α , σ ) π ( α , σ ) d α dσ, and m F ( y ) = (cid:90) M F ( y | α , β , σ ) π F ( α , β , σ ) d β d α dσ. Above, π and π F are the prior distributions for the parameters within eachmodel. From an objective point of view (the one adopted here), these distri-butions should be fully automatic. The assignment of such priors (which wecall model selection priors) is quite a delicate issue (see Berger and Pericchi,2001) and has inspired many important contributions in the literature. Thepackage BayesVarSel allows using many of the most important proposals,which are fully detailed in the Appendix. The prior implemented by defaultis the robust prior by Bayarri et al. (2012) as it can be considered optimalin many senses and is based on a foundational basis.Posterior probabilities can be obtained as
P r ( H F | y ) = B F P r ( H F )( P r ( H ) + B F P r ( H F )) , P r ( H | y ) = 1 − P r ( H F | y ) , P r ( H F ) is the probability, a priori, that hypothesis H F is true.Similar formulas can be obtained when more than two hypotheses, say H , . . . , H N , are tested. In this case P r ( H i | y ) = B i ( y ) P r ( H i ) (cid:80) Nj =1 B j ( y ) P r ( H j ) , i = 1 , . . . , N (3)which is the posterior distribution over the model space (which is the setthat contains all competing models). For simplicitly, the formula in (3) hasbeen expressed, without any loss of generality, using Bayes factors to the nullmodel but the same results would be obtained by fixing any other model. Thedefault definition for P r ( H i ) in testing problems is to use a constant prior,which assigns the same probability to all models, that is, P r ( H i ) = 1 /N .For instance, within the model M : y = α n + β x + β x + ε we cannot test the hypotheses H : β = 0 , β (cid:54) = 0, H : β (cid:54) = 0 , β = 0, H : β (cid:54) = 0 , β (cid:54) = 0 since neither M (the model defined by H ) nor M arenested in the rest. Nevertheless, it is perfectly possible to test the problemwith the four hypotheses H , H , H (as just defined) plus H : β = 0 , β = 0,but of course H must be, a priori, a plausible hypothesis. In this last case H would take the role of null model.Hypotheses do not have to be necessarily of the type β = and, iftestable (see Ravishanker and Dey (2002) for a proper definition) any linearcombination of the type C t β = can be considered a hypothesis. Forinstance one can be interested in testing β + β = 0. In Bayarri and Garc´ıa-Donato (2007) it was formally shown that these hypotheses can be, throughreparameterizations, reduced to hypotheses like β = . In Section 3 we showexamples of how to solve these testing problems in BayesVarSel .Variable selection is a multiple testing problem but is traditionally pre-sented with convenient specific notation that uses a p dimensional binaryvector γ = ( γ , . . . , γ p ) to identify the models. Consider the full model in(1), and suppose that X contains fix covariates that are believed to be surein the true model (by default X = n that would make the intercept presentin all the models). Then each γ ∈ { , } p defines a hypothesis H γ statingwhich β ’s (those with γ i = 0) corresponding to each of the columns in X arezero. Then, the model associated with H γ is M γ : y = X α + X γ β γ + ε , ε ∼ N n ( , σ I n ) , (4)7here X γ is the matrix with the columns in X corresponding to the ones in γ . Notice that X γ is a n × p γ matrix where p γ is the number of 1’s in γ .Clearly, in this variable selection problem there are 2 p hypotheses or mod-els and the null model is 2 corresponding to γ = .A particularity of variable selection is that it is affected by multiplicityissues. This is because, and specially for moderate to large p , the possibilityof a model showing spurious evidence is high (just because many hypothesesare considered simultaneously). As concluded in Scott and Berger (2006)multiplicity must be controlled with the prior probabilities P r ( H γ ) and theconstant prior does not control for multiplicity. Instead, these authors pro-pose using P r ( H γ ) = (cid:0) ( p + 1) (cid:18) pp γ (cid:19)(cid:1) − . (5)The assignment above states that models of the same dimension (the di-mension of M γ is p γ + p ) should have the same probability which must beinversely proportional to the number of models of that dimension. In thesequel we refer to this prior as the ScottBerger prior.Both the ScottBerger prior and the Constant prior for P r ( H γ ) are par-ticular cases of the very flexible prior P r ( M γ | θ ) = θ p γ (1 − θ ) p − p γ , (6)where the hyperparameter θ ∈ (0 ,
1) has the interpretation of the commonprobability that a given variable is included (independently of all others).The Constant prior corresponds to θ = 1 / θ ∼ Unif(0 , θ of the type θ ∼ Beta ( θ | , b ) . (7)They argue that, on many occasions the user has, a priori, some informationregarding the number of covariates (among the p initially considered) that areexpected to explain the response, say w (cid:63) . As they explain, this informationcan be translated into the analysis assigning in (7) b = ( p − w (cid:63) ) /w (cid:63) . Theresulting prior specification has the property that the expected number ofcovariates is precisely w (cid:63) .Straightforward algebra shows that assuming (7) into (6) is equivalent to(integrating out θ ) P r ( M γ | b ) ∝ Γ( p γ + 1)Γ( p − p γ + b ) . (8)8 Hypothesis testing with
BayesVarSel
Tests are solved in
BayesVarSel with
Btest which, in its default usage, onlydepends on two arguments: models a named list of formula -type objectsdefining the models compared and data the data.frame with the data.The prior probabilities assigned to hypotheses is constant, that is,
P r ( H i ) =1 /N . This default behavior can be modified specifying prior.models ="User" jointly with the argument priorprobs that must contain a namedlist (with names as specified in main argument models ) with the prior prob-abilities to be used for each hypotheses.In the last example of Section 1.1, we can state that the simpler model istwice as likely as the other two as: R> Btest(models=c(H0=nullmodel, H1=fullmodel, H2=reducedmodel), data=savings,prior.models="User", priorprobs=c(H0=1/2, H1=1/4, H2=1/4))---------Bayes factors (expressed in relation to H0)H0.to.H0 H1.to.H0 H2.to.H01.0000000 21.4600656 0.7017864---------Posterior probabilities:H0 H1 H20.083 0.888 0.029
Notice that the Bayes factor remains the same, and the change is in posteriorprobabilities.
Btest tries to identify the simplest model (nested in all the others) usingthe names of the variables. If such model does not exist, the execution ofthe function stops with an error message. Nevertheless, there are importantsituations where the simplest hypothesis is defined through linear restrictions(sometimes known as ‘testing a subspace‘) making it very difficult to deter-mine its existence just using the names. We illustrate this situation with anexample.Consider for instance the extension of the savings example in Faraway(2002), page 32 where H eqp : β pop = β pop is tested against the full alter-native. This null hypothesis states that the effect on personal savings, sr ,of both segments of populations is similar. The model under H eqp can bespecified as R> equalpopmodel<- as.formula("sr~I(pop15+pop75)+dpi+ddpi")
R> Btest(models=c(Heqp=equalpopmodel, H1=fullmodel), data=savings) produces an error, although it is clear that H eqp is nested in H . To over-come this error, the user must ask the Btest to relax the names-based checkdefining as
TRUE the argument relax.nest . In our example
R> Btest(models=c(Heqp=equalpopmodel, H1=fullmodel), data=savings,+ relax.nest=TRUE)Bayes factors (expressed in relation to Heqp)Heqp.to.Heqp H1.to.Heqp1.0000000 0.3336251---------Posterior probabilities:Heqp H10.75 0.25
Now
Btest identifies the simpler model as the one with the largest sum ofsquared errors and trusts the user on the existence of a simpler model (yetthe code produces an error if it detects that the model with the largest sumof squared errors is not of a smaller dimension than all the others in whichcase it is clear that a null model does not exist).
BayesVarSel
The number of entertained hypotheses in a variable selection problem, 2 p ,can range from a few to an extremely large number. This makes necessary toprogram specific tools to solve the multiple testing problem in variable selec-tion problems. BayesVarSel provides three different functions for variableselection • Bvs performs exhaustive enumeration of hypotheses and hence the sizeof problems must be small or moderate (say p ≤ • PBvs is a parallelized version of
Bvs making it possible to solve mod-erate problems (roughly the same size as above) in less time with thehelp of several cpu’s. • GibbsBvs simulations from the the posterior distribution over the modelspace using a Gibbs sampling scheme (intended to be used for largeproblems, with p >
PBvs or the number of simulations in
GibbsBvs )the usage of the three functions is very similar. We describe the common usein the first of the following sub-sections and the function-specific argumentsin the second.These three functions return objects of class
Bvs which are a list withrelevant information about the posterior distribution. For these objects
BayesVarSel provides a number of functions, based on the tradition of modelselection methods, to summarize the corresponding posterior distribution (eg.what is the hypothesis most probable a posteriori) and for its posterior usage(eg. to obtain predictions or model averaged estimates). These capabilitiesare described in Section 5 and Section 6.For illustrative purposes we use the following datasets:
USCrime data.
The US Crime data set was first studied by Ehrlich (1973)and is available from R-package
MASS (Venables and Ripley, 2002). This dataset has a total of n = 47 observations (corresponding to states in the US)of p = 15 potential covariates aimed at explaining the rate of crimes in aparticular category per head of population (labelled y in the data). SDM data.
This dataset has a total of p = 67 potential drivers for the annualGDP growth per capita between 1960 and 1996 for n = 88 countries (responsevariable labelled y in the data). This data set was initially considered bySala-I-Martin et al. (2004) and revisited by Ley and Steel (2007). The customary arguments in
Bvs , PBvs and
GibbsBvs are data (a data.frame with the data) and formula , with a definition of the most complex model con-sidered (the full model in (1)). The default execution setting corresponds toa problem where the null model (2) contains just the intercept (ie X = n )and prior probabilities for models are defined as in (5).A different simpler model can be specified with the optional argument fixed.cov , a character vector with the names of the covariates includedin the null model. Notice that, by definition, the variables in the nullmodel are part of any of the entertained models including of course thefull model. A case sensitive convention here is to use the word “Inter-11ept” to stand for the name of the intercept so the default corresponds to fixed.cov=c("Intercept") . A null model that just contains the error (thatis, X is the null matrix) is specified as fixed.cov=NULL .Suppose for example that in the UScrime dataset and apart from theconstant, theory suggests that the covariate Ed must be used to explain thedependent variable. To consider these conditions we execute the command R> crime.Edfix<- Bvs(formula="y~.", data=UScrime, fixed.cov=c("Intercept", "Ed"))Info. . . .Most complex model has 16 covariatesFrom those 2 are fixed and we should select from the remaining 14M, So, Po1, Po2, LF, M.F, Pop, NW, U1, U2, GDP, Ineq, Prob, TimeThe problem has a total of 16384 competing modelsOf these, the 10 most probable (a posteriori) are keptWorking on the problem...please wait.
During the execution (which takes about 0.22 seconds in a standard laptop)the function informs which variables take part of the selection process. Thenumber of these defines p which in this problem is p = 14 (and the modelspace has 2 models). In what follows, and unless otherwise stated we donot reproduce this informative output to save space.The assignment of priors probabilities, P r ( H i ), is regulated with the ar-gument prior.models which by default takes the value “ScottBerger” thatcorresponds to the proposal in (5). Other options for this argument are“Constant”, which stands for P r ( H i ) = 1 / p , and the more flexible value,“User”, under which the user must specify the prior probabilities with theextra argument priorprobs .The argument priorprobs is a p + 1 numeric vector, which in its i -thposition defines the probability of a model of dimension p + i − θ = 1 /
4, this can be done as (notice thathere p = 15) R> theta<- 1/4; pgamma<- 0:15R> crime.thQ<- Bvs(formula="y~.", data=UScrime, prior.models="User",+ priorprobs=theta^pgamma*(1-theta)^(15-pgamma))
In variable selection problems it is quite standard to have the situationwhere the number of covariates is large (say larger than 30) preventing theexhaustive enumeration of all the competing models. The SDM dataset is12n example of this situation with p = 67. In these contexts, the posteriordistribution can be explored using the function GibbsBvs . To illustrate theelicitation of prior probabilities as well, suppose that the number of expectedcovariates to be present in the true model is w (cid:63) = 10. This situation isconsidered in Ley and Steel (2009) and can be implemented as (see eq.8) R> set.seed(1234)R> wstar<- 7; b<- (67-wstar)/wstar; pgamma<- 0:67R> growth.wstar7<- GibbsBvs(formula="y~.", data=SDM, prior.models="User",+ priorprobs=gamma(pgamma+1)*gamma(67-pgamma+b))
The above code took 18 seconds to run.One last common argument to
Bvs , PBvs and
GibbsBvs is time.test . Ifit is set to TRUE and the problem is of moderate size ( p ≥
18 in
Bvs , PBvs and p ≥
21 in
GibbsBvs ), an estimation of computational time is calculatedand the user is asked about the possibility of not executing the command. In Bvs
The algorithm implemented in
Bvs is exact in the sense that theinformation collected about the posterior distribution takes into account all competing models as these are all computed. Nevertheless, to save compu-tational time and memory it is quite appropriate to keep only a moderatenumber of the best (most probable a posteriori) models. This number canbe specified with the argument n.keep which must be an integer numberbetween 1 (only the most probable model is kept) and 2 p (a full ordering ofmodels is kept). The default value for n.keep is 10.The argument n.keep is not of great importance to analyze the posteriordistribution over the model space. Nevertheless, it has a more relevant effectif model averaging estimates or predictions are to be obtained (see Section 6)since, as BayesVarSel is designed, only the n.keep retained models are usedfor these tasks. In PBvs
This function conveniently distributes several
Bvs among the num-ber of available cores specified in the argument n.nodes . Another argumentin
PBvs is n.keep explained above. In GibbsBvs
The algorithm in
GibbsBvs samples models from the posteriorover the model space and this is done using a simple (yet very efficient) Gibbs13ampling scheme introduced in George and McCulloch (1997), later studiedin Garcia-Donato and Martinez-Beneito (2013) in the context of large modelspaces. The type of default arguments that can be specified in
GibbsBvs arethe typical in any Monte Carlo Markov Chain scheme (as usual the defaultvalues are given in the assignment) • init.model="Full" The model at which the simulation process starts.Options include ”Null” (the model only with the covariates specifiedin fixed.cov ), ”Full” (the model defined by formula ), ”Random” (arandomly selected model) and a vector with p zeros and ones defininga model. • n.burnin=50 Length of burn in, i.e. the number of iterations to discardat the start of the simulation process. • n.iter=10000 The total number of iterations performed after the burnin process. • n.thin=1 Thinning rate that must be a positive integer. Set ’n.thin’ ¿1 to save memory and computation time if ’n.iter’ is large. • seed=runif(1, 0, 16091956) A seed to initialize the random numbergenerator. (Doesn’t it seem like the upper bound is a date?)Notice that the number of total iterations is n.burnin+n.iter but thenumber of models that are used to collect information from the posterior is,approximately, n.iter/n.thin . In this section we describe the tools implemented in
BayesVarSel conceivedto summarize, in the tradition of the relevant model selection literature,the posterior distribution over the model space. In R this corresponds todescribing methods to explore the content of objects of class Bvs .Printing a
Bvs object created with
Bvs or PBvs shows the best 10 modelswith their associated probability (see examples in Section 1.2). If the objectwas built with
GibbsBvs then what is printed is the most probable modelamong the sampled ones. For instance, if we print the object growth.wstar7 in Section 4.1 we obtain the following message:14 > growth.wstar7Among the visited models, the model with the largest probability contains:[1] "DENS65C" "EAST" "GDPCH60L" "IPRICE1" "P60" "TROPICAR"
The model that contains the variables above plus the intercept is a pointestimate of the Highest Posterior Probability model (HPM).The rest of the summaries are very similar independently of the routineused to create it, but recall that if the object was obtained with either
Bvs and
PBvs (likely because p is small or moderate) the given measures hereexplained are exact . If instead GibbsBvs was used, the reported measuresare approximations of the exact ones (that likely cannot be computed dueto the huge size of the model space). In
BayesVarSel these approximationsare based on the frequency of visits as an estimator of the real
P r ( M γ | y ) since, as studied in Garcia-Donato and Martinez-Beneito (2013), theseprovide quite accurate results.The HPM is returned when an object of class Bvs is summarized (via summary ) jointly with the inclusion probabilities for each competing variable,
P r ( x i | y ). These are the sum of the posterior probabilities of models con-taining that covariate and provide evidence about the individual importanceof each explanatory variable. The model defined by those variables with aninclusion probability greater than 0.5 is called a Median Probability Model(MPM), which is also included in the summary. Barbieri and Berger (2004)show that, under general conditions, if a single model has to be utilized withpredictive purposes, the MPM is optimal.For instance, if we summarize the object crime.Edfix of Section 4.1, weobtain R> summary(crime.Edfix)Inclusion Probabilities:Incl.prob. HPM MPMM 0.6806 *So 0.2386Po1 0.8489 * *Po2 0.3663LF 0.2209M.F 0.3184Pop 0.2652 Notice that variable Ed is not on the list as it was assumed to be fixed. W 0.2268U1 0.2935U2 0.4765GDP 0.3204Ineq 0.9924 * *Prob 0.6174 *Time 0.2434
We clearly see that, marginally,
Ineq is very relevant followed by
Po1 . Lessinfluential but of certain importance are M and Prob . Graphical summaries and jointness
The main graphical support in
BayesVarSel is contained in the function plotBvs which depends on x (anobject of class Bvs ) and the argument option which specified the type ofplot to be produced: • option="joint" A matrix plot with the joint inclusion probabilities,
P r ( x i , x j | y ) (marginal inclusion probabilities in the diagonal). • option="conditional" A matrix plot with the conditional inclusionprobabilities
P r ( x i | x j , y ) (ones in the diagonal). • option="not" A matrix plot with the conditional inclusion probabili-ties
P r ( x i | Not x j , y ) (zeroes in the diagonal). • option="dimension" A bar plot representation of the posterior distri-bution of the dimension of the true model (number of variables, rangingfrom p to p + p ).The first three options above are basic measures describing aspects of thejoint effect of two given variables, x i , x j and can be understood as naturalextensions of the marginal inclusion probabilities. In Figure 1, we have re-produced the first three plots (from left to right) obtained with the followinglines of code: R> mj<- plotBvs(crime.Edfix, option="joint")R> mc<- plotBvs(crime.Edfix, option="conditional")R> mn<- plotBvs(crime.Edfix, option="not")
Apart from the plot, these functions return the matrix represented for futherstudy. For the conditionals probabilities ( conditional and not ) the vari-ables in the row are the conditioning variables (eg. in mc above, the position16 i, j ) is the inclusion probability of variable in j -th column conditional onthe variable in i -th row).Within these matrices, the most interesting results correspond to vari-ations from the marginal inclusion probabilities (represented in the top ofthe plots as a separate row for reference). Our experience suggests that themost valuable of these is option="not" as it can reveal key details aboutthe relations between variables in relation with the response. For instance,take that plot in Figure 1 (plot on the left of the second row) and observethat, while variable Po2 barely has any effect on savings, it becomes relevantif
Po1 is removed. This is the probability
P r ( P o | Not
P o , y ) with value R> mn["Not.Po1","Po2"][1] 0.9996444 which, as we observed in the graph, is substantially large compared with
P r ( P o | y ) = 0 . Po1 is of even more importance if
Po2 is notconsidered as a possible explanatory variable. All this implies that, in relationwith the dependent variable, both variables contain similar information andone can act as proxy for the other.We can further investigate this idea of a relationship between two covari-ates with respect to the response using the jointness measures proposed byLey and Steel (2007). These are available using function
Jointness that de-pends on two arguments: x , an object of class Bvs and covariates a charac-ter vector indicating which pair of covariates we are interested in. By default covariates="All" printing the matrices with the jointness measurement forevery pair of covariates. In particular, three jointness measures relative totwo covariates are reported by this function: i) the joint inclusion probabil-ity, ii) the ratio between the joint inclusion probability and the probabilityof including at least one of them and finally, iii) the ratio between the jointinclusion probability and the probability of including one of them alone.For instance:
R> Jointness(crime.Edfix, covariates=c("Po1","Po2"))---------The joint inclusion probability for Po1 and Po2 is: 0.22---------The ratio between the probability of including both covariates and the probabilityof including at least one of then is: 0.22---------
Plots corresponding to the four possible values of the argument option in plotBvs over the object crime.Edfix of Section 4.1. From left to right: joint , conditional , not and dimension . The probability of including both covariates together is 0.27 times the probabilityof including one of them alone
With these results we must conclude that it is unlikely that both variables,
Po1 and
Po2 , are to be included together in the true model.Finally, within plotBvs , the assignment option="dimension" producesa plot that speaks about the complexity of the true model in terms of thenumber of covariates that it contains. The last plot in Figure 1 is the output18f
R> plotBvs(crime.Edfix, option="dimension")
From this plot we conclude that the number of covariates is about 7 but witha high variability. The exact values of this posterior distribution are in thecomponent postprobdim of the
Bvs object.
In a variable selection problem it is explicitly recognized that there is un-certainty regarding which variables make up the true model. Obviously, thisuncertainty should be propagated in the inferential process (as opposed toinferences using just one model) to produce more reliable and accurate es-timations and predictions. These type of procedures are normally calledmodel averaging and are performed once the model selection exercise is per-formed (that is, the posterior probabilities have been already obtained). In
BayesVarSel these inferences can be obtained acting over objects of class
Bvs .Suppose that Λ is a quantity of interest and that under model M γ it has aposterior distribution π N (Λ | y , M γ ) with respect to certain non-informativeprior π Nγ . Then, we can average over all entertained models using the poste-rior probabilities in (3) as weights to obtain f (Λ | y ) = (cid:88) γ π N (Λ | y , M γ ) P r ( M γ | y ) . (9)In BayesVarSel for π Nγ we use the reference prior developed in Berger andBernardo (1992) and further studied in Berger et al. (2009). This is anobjective prior with very good theoretical properties and the formulas for theposterior distribution with a fixed model are known (Bernardo and Smith,1994). These priors are different from the model selection priors used tocompute the Bayes factors (see Section 2), but as shown in Consonni andDeldossi (2016), the posterior distributions approximately coincide and then f (Λ | y ) basically can be interpreted as the posterior distribution of Λ.There are two different quantities Λ that are of main interest in variableselection problems. First is a regression parameter β i and second is a futureobservation y (cid:63) associated with known values of the covariates x (cid:63) . In whatfollows we refer to each of these problems as (model averaged) estimationand prediction to which we devote the next subsections.19 .1 Estimation Inclusion probabilities
P r ( x i | y ) can be roughly interpreted as the proba-bility that β i is different from zero. Nevertheless, it does not say anythingabout the magnitude of the coefficient β i nor anything about its sign.Such type of information can be obtained from the distribution in (9)which in the case of Λ ≡ ( α , β ) is f ( α , β | y ) = (cid:88) γ St p γ + p (( α , β γ ) | ( ˆ α , ˆ β γ ) , ( Z tγ Z γ ) − SSE γ n − p γ − p , n − p γ − p ) P r ( M γ | y ) , (10)where ˆ α , ˆ β γ is the maximum likelihood estimator under M γ (see eq.4), Z γ =( X , X γ ) and SSE γ is the sum of squared errors in M γ . Above St makesreference to the multivariate student distribution: St k ( x | µ , Σ , df ) ∝ (cid:0) df ( x − µ ) t Σ − ( x − µ ) (cid:1) − ( df + k ) / . In BayesVarSel the whole model averaged distribution in (10) is pro-vided in the form of a random sample through the function
BMAcoeff whichdepends on two arguments: x which is a Bvs object and n.sim the numberof observations to be simulated (taking the default value of 10000). Thereturned object is an object of class bma.coeffs which is a column-named matrix with n.sim rows (one per each simulation) and p + p columns (oneper each regression parameter). The way that BMAcoeff works depends onwhether the object was created with
Bvs (or
PBvs ) or with
GibbsBvs . Thisis further explained below.
If the
Bvs object was generated with
Bvs or PBvs
In this case themodels over which the average is performed are the n.keep (previously spec-ified) best models. Hence, if n.keep equals 2 p then all competing modelsare used while if n.keep ¡2 p only a proportion of them are used and posteriorprobabilities are re-normalized to sum one.On many occasions where estimations are required, the default value of n.keep (which we recall is 10) is small and should be increased. Ideally2 p should be used but, as noticed by Raftery et al. (1997) this is normallyunfeasible and commonly it is sufficient to average over a reduced set ofgood models that accumulates a reasonable posterior mass. This set is whatRaftery et al. (1997) call the “Occam’s window”. The function BMAcoeff informs about the total probability accumulated in the models that are used.20or illustrative purposes let us retake the UScrime dataset and, in particu-lar, the example in Section 1.2 in which, apart from the constant, the variableEd was assumed as fixed. The total number of models is 2 = 16384 and weexecute again Bvs but now with n.keep =2000
R> crime.Edfix<- Bvs(formula="y~.", data=UScrime,+ fixed.cov=c("Intercept", "Ed"), n.keep=2000) (taking 1.9 seconds). The object crime.Edfix contains identical informationas the one previously created with the same name, except for the modelsretained, which in this case are the best 2000. These models accumulate aprobability of 0.90, which seems quite reasonable to derive the estimates. Wedo so executing the second command of the following script (the seed is fixedfor the sake of reproducibility).
R> set.seed(1234)R> bma.crime.Edfix<- BMAcoeff(crime.Edfix)Simulations obtained using the best 2000 modelsthat accumulate 0.9 of the total posterior probability
The distribution in (10) and hence the simulations obtained can be highlymultimodal and providing default summaries of it (like the mean or standarddeviation) is potentially misleading. For a first exploration of the modelaveraged distribution,
BayesVarSel comes armed with a plotting function, histBMA , that produces a histogram-like representation borrowing ideas fromScott and Berger (2006) and placing a bar at zero with height proportionalto the number of zeros obtained in the simulation.The function histBMA depends on several arguments: • x An object of class bma.coeffs . • covariate A text specifying the name of an explanatory variable whoseaccompanying coefficient is to be represented. This must be the nameof one of the columns in x . • n.breaks The number of equal lentgh bars for the histogram. Defaultvalue is 100. • text If set to
TRUE (default value) the frequency of zeroes is added atthe top of the bar at zero. 21 gray.0
A numeric value between 0 and 1 that specifies the darkness,in a gray scale (0 is white and 1 is black) of the bar at zero. Defaultvalue is 0.6. • gray.no0 A numeric value between 0 and 1 that specifies the darkness,in a gray scale (0 is white and 1 is black) of the bars different fromzero. Default value is 0.8.For illustrative purposes let us examine the distributions of β Ineq (inclu-sion probability 0.99); β T ime (0.24) and β P rob (0.62) using histBMA
R> histBMA(bma.crime.Edfix, covariate = "Ineq", n.breaks=50)R> histBMA(bma.crime.Edfix, covariate = "Time", n.breaks=50)R> histBMA(bma.crime.Edfix, covariate = "Prob", n.breaks=50)
The plots obtained are reproduced in Figure 2 where we can see that
Ineq has a positive effect. This distribution is unimodal so there is no drawback tosummarizing the effect of the response
Ineq over savings using, for example,the mean and quantiles, that is:
R> quantile(bma.crime.Edfix[,"Ineq"], probs=c(0.05, 0.5, 0.95))5% 50% 95%4.075685 7.150184 10.326606
This implies an estimated effect of 7.2 with a 90% credible interval [4.1,10.3].The situation of
Time is clear and its estimated effect is basically null (inagreement with a low inclusion probability).Much more problematic is reporting estimates of the effect of
Prob with ahighly polarized estimated effect being either very negative (around -4100) orzero (again in agreement with its inconclusive inclusion probabilty of 0.62).Notice that, in this case, the mean (approximately -2500) should not be usedas a sensible estimation of the parameter β P rob . If the
Bvs object was generated with
GibbsBvs
In this case, the aver-age in (10) is performed over the n.iter (an argument previously defined)models sampled in the MCMC scheme. Theoretically this corresponds tosampling over the whole distribution (all models are considered) and leads tothe approximate method pointed out in Raftery et al. (1997). All previousconsiderations regarding the difficult nature of the underlying distributionapply here. 22igure 2:
Representation provided by the function histBMA of the Model aver-aged posterior distributions of β Ineq , β T ime and β P rob for the UScrime dataset withconstant and Ed considered as fixed in the variable selection exercise.
Let us consider again the SDM dataset in which analysis we created theobject growth.wstar7 in Section 4. Suppose we are interested in the effectof the variable
P60 on the response GDP. The summary method informs thatthis variable has an inclusion probability of 0.77.
R> set.seed(1234)R> bma.growth.wstar7<- BMAcoeff(growth.wstar7)Simulations obtained using the 10000 sampled models.Their frequencies are taken as the true posterior probabilitiesR> histBMA(bma.growth.wstar7, covariate = "P60",n.breaks=50)
The distribution is bimodal (graph not shown here to save space) with modesat zero and 2.8 approximately. Again, it is difficult to provide simple sum-maries to describe the model averaged behaviour of P60. Nevertheless, it isalways possible to answer relevant questions such as: what is the probabilitythat the effect of
P60 over savings is greater than one?
R> mean(bma.growth.wstar7[,"P60"]>1)[1] 0.7511
Suppose we want to predict a new observation y (cid:63) with associated values ofcovariates ( x (cid:63) ) t ∈ R p + p (in what follows the product of two vectors corre-sponds to the usual scalar product). In this case, the distribution (9) adoptsthe form f ( y (cid:63) | y , x (cid:63) ) = (cid:88) γ St ( y (cid:63) | x (cid:63)γ ( ˆ α , ˆ β γ ) , SSE γ h γ , n − p γ − p ) P r ( M γ | y ) , (11)23here h γ = 1 − x (cid:63)γ (cid:0) ( x (cid:63)γ ) t x (cid:63)γ + Z tγ Z γ (cid:1) − ( x (cid:63)γ ) t . As with estimations,
BayesVarSel has implemented the function predictBvs designed to simulate a desired number of observations from (11). A maindifference with model averaged estimations is that, normally, the above pre-dictive distribution is unimodal.The function predictBvs depends on x , an object of class Bvs , newdata ,a data.frame with the values of the covariates (the intercept, if needed, isautomatically added) and n.sim the number of observations to be simulated.The considerations described in the previous section for the Bvs object aboutthe type of function originally used apply here.The function predictBvs returns a matrix with n.sim rows (one per eachsimulated observation) and with the number of columns the number of cases(rows) in the newdata .For illustrative purposes, consider the
Bvs object named growth.wstar7 for the analysis of the SDM dataset. Simulations from the predictive dis-tribution (11) associated with values of the covariates fixed at their meanscan be obtained with the following code. Here, a histogram is produced (seeFigure 3) as a graphical approximation of the underlying distribution.
R> set.seed(1234)R> pred.growth.wstar7<- predictBvs(x=growth.wstar7,+ newdata=data.frame(t(colMeans(SDM))))R> hist(pred.growth.wstar7[,1], main="SDM",+ border=gray(0.6), col=gray(0.8), xlab="y")
The first version of
BayesVarSel was released on December 2012 with themain idea of making available the C code programmed for the work Garcia-Donato and Martinez-Beneito (2013) to solve exactly moderate to large vari-able selection problems. Since then, six versions have followed with newabilities that make up the complete toolbox that we have described in thispaper.Nevertheless, BayesVarSel is an ongoing project that we plan as solidand contrasted methods become available. The emphasis is placed on theprior distribution that should be used, since this is a particularly relevantaspect of model selection/testing problems.24igure 3:
For SDM data and related
Bvs object growth.wstar7 , model averagedprediction of the “mean” case (predicting the output associated with the mean ofobserved covariates)
New functionalities that we expect to incorporate in the future are: • The case where n < p + p and possibly n << p + p , • specific methods for handling factors, • heteroscedastic errors, • other types of error distributions. Acknowledgments
This paper was supported in part by the Spanish Ministry of Economy andCompetitivity under grant MTM2016-77501-P (BAiCORRE). The authorswould like to thank Jim Berger for his suggestions during the process ofbuilding the package and would like to thank Dimitris Fouskakis for a carefulreading and suggestions on this manuscript.25 eferences
Maria M. Barbieri and James O. Berger. Optimal predictive model selection.
The Annals of Statistics , 32(3):870–897, 2004.M. J. Bayarri and G. Garc´ıa-Donato. Extending conventional priors for test-ing general hypotheses in linear models.
Biometrika , 94(1):135–152, 2007.Maria J. Bayarri, James O. Berger, Anabel Forte, and Gonzalo Garc´ıa-Donato. Criteria for Bayesian model choice with application to variableselection.
Annals of Statistics , 40:1550–1577, 2012.James O. Berger. The case for objective bayesian analysis.
Bayesian Analysis ,1(3):385–402, 2006.James O. Berger and Jose M. Bernardo. On the development of the referenceprior method. In J. M. Bernardo, editor,
Bayesian Statistics 4 , pages 35–60. London: Oxford University Press, 1992.James O. Berger and Luis R. Pericchi.
Objective Bayesian Methods forModel Selection: Introduction and Comparison , volume 38 of
LectureNotes–Monograph Series , pages 135–207. Institute of Mathematical Statis-tics, Beachwood, OH, 2001. doi: 10.1214/lnms/1215540968. URL http://dx.doi.org/10.1214/lnms/1215540968 .James O. Berger, J. M. Bernardo, and Dongchu Sun. The formal definitionof reference priors.
The Annals of Statistics , 37(2):905–938, 2009.J. M. Bernardo and Adrian F. M. Smith.
Bayesian Theory . John Wiley andSons, ltd., 1994.G. Consonni and L. Deldossi. Objective bayesian model discrimination infollow- up experimental designs.
Test , 25:397–412, 2016.I. Ehrlich. Participation in illegitimate activities: a theoretical and empiricalinvestigation.
Journal of Political Economy , 81(3):521–567, 1973.Carmen Fern´andez, Eduardo Ley, and Mark F. Steel. Benchmark priors forBayesian model averaging.
Journal of Econometrics , 100:381–427, 2001.26onzalo Garcia-Donato and Miguel A. Martinez-Beneito. On Samplingstrategies in Bayesian variable selection problems with large model spaces.
Journal of the American Statistical Association , 108(501):340–352, 2013.Edward I. George and Robert E. McCulloch. Approaches for Bayesian vari-able selection.
Statistica Sinica , 7:339–373, 1997.Harold Jeffreys.
Theory of Probability . Oxford University Press, 3rd edition,1961.Robert E. Kass and Adrian E. Raftery. Bayes factors.
Journal of the Amer-ican Statistical Association , 90(430):773–795, 1995.Robert E. Kass and Larry Wasserman. A reference Bayesian test for nestedhypotheses and its relationship to the schwarz criterion.
Journal of theAmerican Statistical Association , 90(431):928–934, 1995.Peter M. Lee.
Bayesian Statistics. An introduction. Second edition.
JohnWiley and Sons, ltd., 1997.Eduardo Ley and Mark F. Steel. On the effect of prior assumptions inBayesian model averaging with applications to growth regression.
Journalof Applied Econometrics , 24(4):651–674, 2009.Eduardo Ley and Mark F.J. Steel. Jointness in Bayesian variable selectionwith applications to growth regression.
Journal of Macroeconomics , 29(3):476 – 493, 2007. ISSN 0164-0704. doi: http://dx.doi.org/10.1016/j.jmacro.2006.12.002. URL . Special Issue on the Empirics of Growth Non-linearities.Feng Liang, Rui Paulo, German Molina, Merlise A. Clyde, and James O.Berger. Mixtures of g-priors for Bayesian variable selection.
Journal of theAmerican Statistical Association , 103(481):410–423, 2008.Adrian E. Raftery, David Madigan, and Jennifer Hoeting. Bayesian modelaveraging for linear regression models.
Journal of the American StatisticalAssociation , 92:179–191, 1997.N. Ravishanker and D. K. Dey.
A first course in linear model theory . Chap-man and Hall/CRC, 2002. 27avier Sala-I-Martin, Gernot Doppelhofer, and Ronald I. Miller. Deter-minants of long-term growth: A Bayesian averaging of classical estimates(BACE) approach.
American Economic Review , 94(4):813–835, 2004. URL https://ideas.repec.org/a/aea/aecrev/v94y2004i4p813-835.html .James G. Scott and James O. Berger. An exploration of aspects of bayesianmultiple testing.
Journal of Statistical Planning and Inference , 136(7):2144–2162, 2006.W. N. Venables and B. D. Ripley.
Modern Applied Statistics with S . Springer,New York, fourth edition, 2002. URL . ISBN 0-387-95457-0.Arnold Zellner. On assessing prior distributions and Bayesian regressionanalysis with g-prior distributions. In A. Zellner, editor,
Bayesian Infer-ence and Decision techniques: Essays in Honor of Bruno de Finetti , pages389–399. Edward Elgar Publishing Limited, 1986.Arnold Zellner and A. Siow. Posterior odds ratio for selected regression hy-potheses. In J. M. Bernardo, M.H. DeGroot, D.V. Lindley, and AdrianF. M. Smith, editors,
Bayesian Statistics 1 , pages 585–603. Valencia: Uni-versity Press, 1980.Arnold Zellner and A. Siow.
Basic Issues in Econometrics . Chicago: Uni-versity of Chicago Press, 1984.
Appendix: Model selection priors for parame-ters within models
A key technical component of Bayes factors and hence of posterior probabil-ities is the prior distribution for the parameters within each model. That is,the prior π γ ( α , β γ , σ ) for the specific parameters of the model M γ : y = X α + X γ β γ + ε , ε ∼ N n ( , σ I n ) . (12)In BayesVarSel the prior used is specified in main functions
Btest , Bvs , PBvs and
GibbsBvs with the argument prior.betas with default value ”Robust”that corresponds to the proposal the same name in (Bayarri et al. (2012)).28n this paper it is argued, based on foundational arguments, that the robustprior is an optimal choice for testing in linear models.The robust prior for M γ can be specified hierarchically as π Rγ ( α , β γ , σ ) = σ − N p γ ( β γ | , g Σ γ ) (13)where Σ γ = σ ( V tγ V γ ) − , with V γ = ( I n − X ( X t X ) − X t ) X γ (14)and g ∼ p Rγ ( g ) = 12 (cid:115) np γ + p ( g + 1) − / , g > np γ + p − . (15)For the null model the prior assumed is π ( α , σ ) = σ − .The idea of using the matrix Σ γ to scale variable selection priors datesback to Zellner and Siow (1980) and is present in other very popular proposalsin the literature. As we next describe, these proposals differ about whichdistribution should be used for the hyperparameter g . Many of these can beimplemented in BayesVarSel through the argument prior.betas . • prior.betas="ZellnerSiow" (Jeffreys (1961); Zellner and Siow (1980,1984)) corresponds to g ∼ IGa (1 / , n/
2) (leading to the very famousproposal of using a Cauchy). • prior.betas="gZellner" (Zellner (1986); Kass and Wasserman (1995))corresponds to fixing g = n (leading to the so called Unit InformationPrior). • prior.betas="FLS" (Fern´andez et al. (2001)) corresponds to fixing g = max { n, p } . • prior.betas="Liangetal" (Liang et al. (2008)) corresponds to g ∼ π ( g ) ∝ (1 + g/n ) − /2