[PDF] Group Inverse-Gamma Gamma Shrinkage for Sparse Regression with Block-Correlated Predictors

Abstract

Heavy-tailed continuous shrinkage priors, such as the horseshoe prior, are widely used for sparse estimation problems. However, there is limited work extending these priors to predictors with grouping structures. Of particular interest in this article, is regression coefficient estimation where pockets of high collinearity in the covariate space are contained within known covariate groupings. To assuage variance inflation due to multicollinearity we propose the group inverse-gamma gamma (GIGG) prior, a heavy-tailed prior that can trade-off between local and group shrinkage in a data adaptive fashion. A special case of the GIGG prior is the group horseshoe prior, whose shrinkage profile is correlated within-group such that the regression coefficients marginally have exact horseshoe regularization. We show posterior consistency for regression coefficients in linear regression models and posterior concentration results for mean parameters in sparse normal means models. The full conditional distributions corresponding to GIGG regression can be derived in closed form, leading to straightforward posterior computation. We show that GIGG regression results in low mean-squared error across a wide range of correlation structures and within-group signal densities via simulation. We apply GIGG regression to data from the National Health and Nutrition Examination Survey for associating environmental exposures with liver functionality.

Full PDF

GGroup Inverse-Gamma Gamma Shrinkage for SparseRegression with Block-Correlated Predictors

Jonathan Boss ∗ , Jyotishka Datta , Xin Wang , Sung Kyun Park , Jian Kang , and BhramarMukherjee Department of Biostatistics, University of Michigan Department of Statistics, Virginia Polytechnic Institute and State University Department of Epidemiology, University of Michigan

Abstract

Heavy-tailed continuous shrinkage priors, such as the horseshoe prior, are widely usedfor sparse estimation problems. However, there is limited work extending these priors topredictors with grouping structures. Of particular interest in this article, is regression co-eﬃcient estimation where pockets of high collinearity in the covariate space are containedwithin known covariate groupings. To assuage variance inﬂation due to multicollinearitywe propose the group inverse-gamma gamma (GIGG) prior, a heavy-tailed prior that cantrade-oﬀ between local and group shrinkage in a data adaptive fashion. A special case ofthe GIGG prior is the group horseshoe prior, whose shrinkage proﬁle is correlated within-group such that the regression coeﬃcients marginally have exact horseshoe regularization.We show posterior consistency for regression coeﬃcients in linear regression models andposterior concentration results for mean parameters in sparse normal means models. Thefull conditional distributions corresponding to GIGG regression can be derived in closedform, leading to straightforward posterior computation. We show that GIGG regressionresults in low mean-squared error across a wide range of correlation structures and within-group signal densities via simulation. We apply GIGG regression to data from the NationalHealth and Nutrition Examination Survey for associating environmental exposures withliver functionality.

Keywords:

Global-Local Shrinkage Prior, Group Sparsity, Horseshoe Prior, Multicollinearity,Multipollutant Modeling. ∗ Corresponding Author: [email protected] a r X i v : . [ s t a t . M E ] F e b . INTRODUCTION Regression with grouped features is a common problem in many biomedical applications. Someexamples include metabolomics data, where metabolites are grouped by subpathway mem-bership, neuroimaging data, where adjacent voxels are spatially grouped, and environmentalcontaminants data, where exposures are grouped by chemical structure, toxicological proﬁle,and pharmacokinetics (see Figure 1). In such cases, leveraging relevant grouping informationto construct correlated within-group shrinkage proﬁles may help achieve additional variancereduction beyond comparable methods that ignore the grouping structure. The methodologi-cal focus of this article will be on incorporating known covariate grouping information into acontinuous shrinkage prior framework.Ever since the publication of the horseshoe prior (Carvalho et al., 2009, 2010), there hasbeen an explosion of continuous shrinkage priors designed for sparse estimation problems, no-tably generalized double Pareto shrinkage (Armagan et al., 2013a), Dirichlet–Laplace shrinkage(Bhattacharya et al., 2015), horseshoe+ shrinkage (Bhadra et al., 2017), and normal beta prime(NBP) shrinkage (Bai and Ghosh, 2019), among others. These priors have become increasinglypopular for sparse regression problems because of their good theoretical and empirical prop-erties, in addition to their scale mixture representation, which facilitates straightforward andeﬃcient posterior simulation algorithms. The general recipe for constructing a continuousshrinkage prior with good estimation and prediction properties is substantial mass at the ori-gin, to suﬃciently shrink null coeﬃcients towards zero, and regularly-varying tails, to avoidoverregularizing non-null coeﬃcients (Bhadra et al., 2016). Surveying the continuous shrinkageprior literature on regression with known grouping structure, there are many papers whichdiscuss Bayesian group lasso and its applications (Kyung et al., 2010; Li et al., 2015; Xu andGhosh, 2015; Heﬂey et al., 2017; Kang et al., 2019) and several papers which propose extensionsto Bayesian sparse group lasso (Xu and Ghosh, 2015), Bayesian group bridge regularization(Mallick and Yi, 2017), and the Normal Exponential Gamma prior with grouping structure(Rockova and Lesaﬀre, 2014). Xu et al. (2016) introduced the, so called, group horseshoe priorwith an emphasis on prediction in Bayesian generalized additive models. However, the grouphorseshoe prior does not reduce to the horseshoe prior for a group of size one, meaning thatthe group horseshoe prior, as proposed by Xu et al. (2016), is not a direct generalization of thehorseshoe prior.Bayesian group lasso-style shrinkage is not generally preferred as a default method forestimation problems, as the Laplacian prior has neither an inﬁnite spike at zero nor regularly-varying tails (Polson and Scott, 2011; Castillo et al., 2015; Bhadra et al., 2016). The grouphorseshoe prior of Xu et al. (2016) has the desired origin and tail behavior marginally, howeverno hyperparameter in the prior controls how correlated the shrinkage is within a group. Thus,this prior implicitly assumes that the degree of correlated shrinkage within-group only dependson group size. This assumption is inadequate when we a priori believe that, irrespective of groupsize, some groups have more heterogeneous eﬀect sizes than others and, moreover, does not availthe opportunity to learn how correlated the shrinkage should be in a data adaptive manner,which is an intrinsic feature in some application areas. For example, in modeling multiplepollutants, this is a relevant consideration as some exposure classes have more homogeneoustoxicological proﬁles than others (Ferguson et al., 2014). From a theoretical perspective, the2 C ad m i u m Lead M e r c u r y S u mm ed d i − ( − e t h y l he xy l ) ph t ha l a t e s M ono − n − bu t y l ph t ha l a t e M ono − e t h y l ph t ha l a t e M ono − ben zy l ph t ha l a t e M ono − n − m e t h y l M ono − ( − c a r bo xy p r op y l ) ph t ha l a t e M ono − i s obu t y l p t ha l a t e H e x a c h l o r oben z ene B e t a − he x a c h l o r o cyc l ohe x anep , p ' − DD E p , p ' − DD T O xyc h l o r dane T r an s − nona c h l o r H ep t a c h l o r E po x i de D i e l d r i n BB − B D E − B D E − B D E − B D E − B D E − B D E − − h y d r o xy naph t ha l ene2 − h y d r o xy naph t ha l ene3 − h y d r o xy f l uo r ene2 − h y d r o xy f l uo r ene3 − h y d r o xy phenan t h r ene1 − h y d r o xy phenan t h r ene2 − h y d r o xy phenan t h r ene1 − h y d r o xy p y r ene9 − h y d r o xy f l uo r ene4 − phenan t h r ene CadmiumLeadMercurySummed di−(2−ethylhexyl) phthalatesMono−n−butyl phthalateMono−ethyl phthalateMono−benzyl phthalateMono−n−methylMono−(3−carboxypropyl) phthalateMono−isobutyl pthalateHexachlorobenzeneBeta−hexachlorocyclohexanep,p'−DDEp,p'−DDTOxychlordaneTrans−nonachlorHeptachlor EpoxideDieldrinBB−153BDE−28BDE−47BDE−99BDE−100BDE−153BDE−1541−hydroxynaphthalene2−hydroxynaphthalene3−hydroxyfluorene2−hydroxyfluorene3−hydroxyphenanthrene1−hydroxyphenanthrene2−hydroxyphenanthrene1−hydroxypyrene9−hydroxyfluorene4−phenanthrene

Figure 1: Pairwise Spearman correlation plot between metals, phthalates, organochlorine pes-ticides, polybrominated diphenyl ethers, and polycyclic aromatic hydrocarbons from the 2003-2004 National Health and Nutition Examination Survey ( n = 990).existing posterior concentration and posterior consistency results for heavy-tailed continuousshrinkage priors all, to our knowledge, apply to independent or exchangeable priors, meaningthat there have not yet been any attempts to employ similar arguments for dependent priors.To address these limitations, we propose the group inverse-gamma gamma (GIGG) prior,which extends the horseshoe and normal beta prime (NBP) priors to incorporate groupingstructures. The GIGG prior introduces a group level shrinkage parameter, in addition to theusual global and local shrinkage parameters, such that the induced prior on the product ofthe group and local shrinkage parameters yields the desired marginal shrinkage proﬁle. Thisallows the user to control the trade-oﬀ between group-level and individual-level shrinkage,leading to relatively low estimation error irrespective of the signal density and the degree of3ulticollinearity within each group. Additionally, the GIGG prior is constructed such that allparameters have closed-form full conditional distributions, implying that techniques to scalehorseshoe regression to large sample sizes and high-dimensional covariate spaces are also ap-plicable to GIGG regression (Bhattacharya et al., 2016; Terenin et al., 2019; Johndrow et al.,2020). Theoretically, we establish posterior consistency and posterior concentration results forregression coeﬃcients with grouping structure in linear regression models and mean parameterswith grouping structure in sparse normal means models with respect to several GIGG hyper-parameters and correlation structures. To our knowledge, we are the ﬁrst to apply existingtheoretical frameworks for posterior consistency in the sparse linear regression model (Arma-gan et al., 2013b) and posterior concentration in the sparse normal means model (Datta andGhosh, 2013) to a non-exchangeable prior, which will be useful for future evaluations of othernon-exchangeable priors.The structure of the paper is as follows. We start with some theoretical results in Section3, preceded by an intuitive explanation of the GIGG prior in Section 2. After the method-ological and theoretical discussion, we outline computational details, including hyperparameterestimation via marginal maximum likelihood estimation (MMLE) (Section 4). In Section 5,we conduct a simulation study to empirically verify that the intuition and theory developedin Sections 2 and 3 hold for linear regression models with group-correlated features. We thenapply GIGG regression to data from the 2003-2004 National Health and Nutrition Examina-tion Survey (NHANES) to identify toxicants and metals associated with a biomarker of liverfunction (Section 6) and conclude with a discussion (Section 7).

2. METHODS

Throughout the article, N ( µ , Σ ) denotes a multivariate normal distribution with mean param-eter µ and variance-covariance matrix Σ , G ( a, b ) denotes a gamma distribution with shapeparameter a and rate parameter b , and IG ( a, b ) denotes an inverse gamma distribution withshape parameter a and scale parameter b . Additionally, we will use π ( · ) as general notation fora prior probability measure and π ( · | y ) as general notation for a posterior probability measure. Consider a Bayesian sparse linear regression model[ y | α , β , σ ] ∼ N (cid:18) Cα + G (cid:88) g =1 X g β g , σ I n (cid:19) , π ( α ) ∝ , β ∼ π ( β ) , π ( σ ) ∝ σ − , (1)where g = 1 , . . . , G indexes the groups, y is an n × C is a matrix of adjustment covariates, X g is an n × p g matrix of standardized covariates in the g -th group, β g = ( β g , . . . , β gp g ) (cid:62) is a p g × g -th group, β = ( β (cid:62) , . . . , β (cid:62) G ) (cid:62) is a p × I n is an n × n identity matrix. We assume the model is sparse in the sense that many4f the entries in β are zero. The group inverse-gamma gamma (GIGG) prior is deﬁned as[ β gj | τ , γ g , λ gj ] ∼ N (0 , τ γ g λ gj ) , [ γ g | a g ] ∼ G ( a g , , [ λ gj | b g ] ∼ IG ( b g , , [ τ , σ ] ∼ π ( τ , σ ) , where j = 1 , . . . , p g indexes the covariates within the g -th group. Alternatively, we may alsoexpress the prior on β as a vector, [ β | τ , Γ , Λ ] ∼ N (0 , τ ΓΛ ), where Λ = diag( λ , ..., λ Gp G ) and Γ = diag( γ , ..., γ , γ , ..., γ , ..., γ G , ..., γ G ) such that γ g is repeated p g times along the diagonalof Γ . In the GIGG prior speciﬁcation, the priors on the group shrinkage parameter, γ g , andlocal shrinkage parameter, λ gj , are selected such that the induced prior on the product is abeta prime prior, γ g λ gj ∼ β (cid:48) ( a g , b g ) (see Appendix 8.1 for distributional deﬁnitions). Sincethe group shrinkage parameter is shared by all p g observations in the g -th group, assigninga beta prime prior on the product allows for normal beta prime shrinkage marginally suchthat the shrinkage is correlated within group. One point that deserves further clariﬁcation isthe assignment of the gamma and inverse-gamma priors to the group and local parameters,respectively, when either conﬁguration would yield a beta prime prior in the product. Therationale behind this choice is that the inverse-gamma prior is heavier-tailed than the gammaprior, thereby preventing overregularization of large, non-null coeﬃcients due to being groupedwith null coeﬃcients. Setting a g = b g = 1 / g yields a special case of the GIGG priorcalled the group horseshoe prior, which has correlated horseshoe regularization within group.For a group of size one, the group shrinkage parameter becomes a local shrinkage parameterand we recover the horseshoe prior from the group horseshoe prior. When discussing a proposed shrinkage prior on β , there are two key features of the marginalprior that need to be investigated. The ﬁrst is the behavior in a tight neighborhood aroundzero and the second is the rate at which the prior decays in the extremes. For τ = 1 ﬁxed,Bai and Ghosh (2019) showed that the marginal prior π ( β gj | τ , a g , b g ) has a pole at 0 ifand only if 0 < a g ≤ /

2, with the pole at zero becoming stronger the closer a g is to zero.Therefore, one should select a g ∈ (0 , /

2] for sparse estimation problems to suﬃciently shrinknull coeﬃcients towards zero. To clarify the tail behavior we need to introduce the notion of aregularly varying function (Bingham et al., 1989): A positive, measurable function f is said tobe regularly varying at ∞ with index ω ∈ R if lim x →∞ f ( tx ) /f ( x ) = t ω , for all t > Theorem 2.1.

Let B ( a g , b g ) denote the beta function evaluated at a g and b g and Γ( b g + 1 / denote the gamma function evaluated at b g + 1 / . The tails of the marginal prior probabilitydensity function of β gj decay at the following rate, lim β gj →∞ π ( β gj | τ , a g , b g ) r ( β gj , τ , a g , b g ) = 1 , r ( β gj , τ , a g , b g ) = (2 τ ) b g Γ( b g + 1 / √ π B ( a g , b g ) | β gj | − (1+2 b g ) (cid:18) β gj /τ β gj /τ (cid:19) a g . Consequently, the index of regular variation is ω = − − b g .Proof. See Appendix 8.2.The concept of regular variation has been extensively discussed in the context of Bayesianrobustness and noninformative inference (Dawid, 1973; O’Hagan, 1979; Andrade and O’Hagan,5006), with the latter being recently elaborated on in the context of global-local shrinkagepriors (Bhadra et al., 2016). When the index ω <

0, regular variation essentially states thatthe tail of the function decays at a polynomial rate and is therefore considered heavy-tailed.Some examples of priors with regularly varying tails include the student’s t prior and thehorseshoe prior. Conversely, commonly used priors such as the normal prior and the Laplaceprior do not have regularly-varying tails. As a consequence of having exponentially decayingtails, Bayesian linear regression with independent normal priors and Bayesian lasso are prone tooverregularizing large signals and are not ﬂexible enough to facilitate conﬂict resolution betweendiscordant likelihood and prior information (Andrade and O’Hagan, 2006; Polson and Scott,2011). Theorem 2.1 shows that for any pair of hyperparameters a g and b g , the marginal GIGGprior has regularly varying tails and, furthermore, that b g controls the rate at which the tailsdecay. To further elucidate the shrinkage proﬁle of the GIGG prior, we will focus on a special caseof the sparse linear regression model called the sparse normal means model ( X = I n and C empty). In the global-local shrinkage prior literature, it is conventional to work with the sparsenormal means problem for analytical tractability, even when the ultimate goal is regression(Rockova and Lesaﬀre, 2014; Bhattacharya et al., 2015), as the posterior mean has a convenientrepresentation, E [ β gj | y gj , τ , σ ] = (1 − E [ κ gj | y gj , τ , σ ]) y gj . Here, κ gj = σ / ( σ + τ γ g λ gj ) iscalled a shrinkage factor, because it quantiﬁes how much the posterior mean is shrunk relative tothe maximum likelihood estimator y gj . Calculating the joint prior distribution for the shrinkagefactors in the g -th group, κ g = ( κ g , ..., κ gp g ) (cid:62) , we have π (cid:0) κ g | τ , σ , a g , b g (cid:1) =Γ( a g + p g b g )Γ( a g ) (cid:0) Γ( b g ) (cid:1) p g (cid:18) τ σ (cid:19) p g b g (cid:32) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:33) − ( a g + p g b g ) (cid:32) p g (cid:89) j =1 κ b g − gj (1 − κ gj ) − ( b g +1) (cid:33) , where 0 < κ gj < ≤ j ≤ p g . Evaluating the prior distribution of κ g , we see that thejoint density multiplicatively factorizes into “dependent” and “independent” parts where thedegree to which the within-group shrinkage is correlated is governed by the (cid:80) p g j =1 κ gj / (1 − κ gj )term. That is, as a g + p g b g goes to zero, the regularization is highly individualistic, whereasif a g + p g b g moves away from zero, then the shrinkage becomes increasingly more correlatedwithin the g -th group.Figure 2 illustrates the marginal posterior mean of β g for a group of size two as a function of a g , b g , y g , and y g . When a g and b g are close to zero then the thresholding eﬀect on the marginalposterior mean of β g hardly depends on the value of y g , indicating highly individualisticshrinkage. This corroborates our intuition from looking at the joint posterior distribution ofthe shrinkage weights within the same group. The second major observation is that as b g movesaway from zero, the marginal posterior mean of β g becomes increasingly more dependent on thevalue of y g . In particular, if we look at the case when a g = 0 .

05 and b g = 2, we see that when y g = 0 the thresholding eﬀect on β g is much stronger when compared to y g = 10. The last6ajor observation is that as a g moves away from 0, the thresholding eﬀect becomes weaker.Therefore, a g eﬀectively controls the overall strength of the shrinkage, whereas b g generallycontrols the dependence of the within-group shrinkage. −10−50510 −10 −5 0 5 10 y g1 E [ b g1 | y g1 , y g2 ] GIGG Prior (a g =0.05, b g =0.05) −10−50510 −10 −5 0 5 10 y g1 E [ b g1 | y g1 , y g2 ] GIGG Prior (a g =0.05, b g =0.5) −10−50510 −10 −5 0 5 10 y g1 E [ b g1 | y g1 , y g2 ] GIGG Prior (a g =0.05, b g =1) −10−50510 −10 −5 0 5 10 y g1 E [ b g1 | y g1 , y g2 ] GIGG Prior (a g =0.05, b g =2) −10−50510 −10 −5 0 5 10 y g1 E [ b g1 | y g1 , y g2 ] GIGG Prior (a g =0.5, b g =0.05) −10−50510 −10 −5 0 5 10 y g1 E [ b g1 | y g1 , y g2 ] GIGG Prior (a g =0.5, b g =0.5) −10−50510 −10 −5 0 5 10 y g1 E [ b g1 | y g1 , y g2 ] GIGG Prior (a g =0.5, b g =1) −10−50510 −10 −5 0 5 10 y g1 E [ b g1 | y g1 , y g2 ] GIGG Prior (a g =0.5, b g =2)y g2 = 0 y g2 = 10 Figure 2: Marginal posterior mean of β g for a group with two observations as a g , b g , y g , and y g vary. Here, τ = 0 . σ = 1 are ﬁxed.

3. THEORETICAL PROPERTIES

Let X n = [ X , . . . , X G n ] and H n = { a , b } denote the collection of hyperparameters where a = { a , ..., a G n } and b = { b , ..., b G n } . Here, the subscript n in G n refers to the fact that thenumber of groups in the covariate space is growing as a function of the sample size. Furthermore,let A n = { ( g, j ) : β gj (cid:54) = 0 } denote the true active set with cardinality |A n | . Then, Theorem3.1 states that the posterior distribution of β n under the GIGG prior is consistent a posteriorifor the true β n . Similarly, we add a subscript n to β n and β n to indicate that the number ofregression coeﬃcients is growing as function of sample size. Theorem 3.1.

Suppose that p n = o ( n ) , L n = sup ( g,j ) | β gj | < ∞ , where β gj indicates thetrue j -th regression coeﬃcient in the g -th group, < lim n →∞ inf H n ≤ lim n →∞ sup H n < , and |A n | = o ( n/ log( n )) . Further, suppose that the smallest and largest singular val-ues of X n , denoted by θ n,min ( X n ) and θ n,max ( X n ) , satisfy < lim inf n →∞ θ n,min ( X n ) / √ n ≤ lim sup n →∞ θ n,max ( X n ) / √ n < ∞ . Then for any (cid:15) > , π n ( β n : (cid:107) β n − β n (cid:107) < (cid:15) | y n , H n , τ n , σ ) → almost surely as n → ∞ provided that τ n = C/ ( p n n ρ log( n )) for some ρ, C ∈ (0 , ∞ ) .Proof. See Appendix 8.3.Of note, the only restrictions placed on the values of the hyperparameters in Theorem 3.1are that they do not converge to the boundary of the hyperparameter space as n → ∞ . Remark 3.2.

Theorem 3.1 is a generalization of Theorem 5 in Armagan et al. (2013b) whichproved posterior consistency for the NBP prior when b g ∈ (1 , ∞ ) . Restricting b g ∈ (1 , ∞ ) wasdone to utilize an argument which required the existence of the second moment of β gj , but doesnot cover special cases of particular interest such as the horseshoe prior. Therefore, our resultextends the existing posterior consistency result from Armagan et al. (2013b) to a more generalcollection of hyperparameter values with potential grouping structure. Next, we partially extend the posterior concentration theoretical framework for the sparsenormal means model, developed in Section 3.2 of Datta and Ghosh (2013), to a low-dimensionallinear regression ( p < n ) model with general correlation structure. Going forward, we will dropthe subscript n from the notation introduced in the statement of Theorem 3.1 to clarify thatthe subsequent theoretical results hold for ﬁxed p . Theorem 3.3.

Fix (cid:15) ∈ (0 , , p , and n , such that p < n . Further, suppose that the smallestand largest singular values of X (cid:62) X , denoted by θ min ( X (cid:62) X ) and θ max ( X (cid:62) X ) , satisfy <θ min ( X (cid:62) X ) ≤ θ max ( X (cid:62) X ) < ∞ . The full conditional posterior mean corresponding to theGIGG prior is, E [ β | · ] = (cid:18) I p + ( X (cid:62) X ) − σ τ Γ − Λ − (cid:19) − ˆ β OLS , ˆ β OLS = ( X (cid:62) X ) − X (cid:62) y . Then the inequality, (cid:13)(cid:13)(cid:13) ˆ β OLS − E [ β | · ] (cid:13)(cid:13)(cid:13) ≥ (cid:18)

11 + θ max ( X (cid:62) X ) σ − τ max ( g,j ) γ g λ gj (cid:19)(cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) , holds and we have the following results:a) π (cid:18)

11 + θ max ( X (cid:62) X ) σ − τ max ( g,j ) γ g λ gj ≥ (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) → as τ → . b) π (cid:18)(cid:13)(cid:13)(cid:13) ˆ β OLS − E [ β | · ] (cid:13)(cid:13)(cid:13) ≥ (cid:15) (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) → as τ → . roof. See Appendix 8.4.Theorem 3.3 states that, irrespective of the correlation structure, τ → Corollary 3.1.

Suppose that the covariates in X satisfy X (cid:62) g X g (cid:48) = for all g (cid:54) = g (cid:48) , where denotes a p g × p g (cid:48) matrix of zeros. If τ , σ , and a g ∈ (0 , are ﬁxed, then there exists aconstant (cid:15) g ( τ , σ ) = σ σ + θ max ( X (cid:62) g X g ) τ , such that for all δ ∈ (0 , (cid:15) g ( τ , σ )) π (cid:18)(cid:13)(cid:13)(cid:13) ˆ β OLSg − E [ β g | · ] (cid:13)(cid:13)(cid:13) ≥ δ (cid:13)(cid:13)(cid:13) ˆ β OLSg (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) → as b g → ∞ .Proof. See Appendix 8.5.The conclusion of Corollary 3.1 is that if the hyperparameter b g → ∞ then there is at leastsome amount of shrinkage relative to the ordinary least squares estimator in the g -th group.If τ /σ is close to zero, then (cid:15) ( τ , σ ) ≈

1, implying shrinkage of the posterior mean towardszero. Therefore, we can interpret the case when b g → ∞ and τ /σ close to zero as shrinkageof the entire g -th group towards zero. Although we would ideally consider additional posterior concentration results within the con-text of a linear regression model, there is not an analytically tractable analog of componentwiseshrinkage factors for a general design matrix without any orthogonality. Therefore, we will pro-ceed by considering posterior concentration results within the sparse normal means framework,to make precise statements regarding componentwise shrinkage, as opposed to shrinkage of theentire L -norm.One question that arises is whether the dependence induced between the β gj ’s by γ g willoverly dominate the individual-level shrinkage. As an example, one can conceptualize a casewhere a group has only one signal, which is overly shrunk by virtue of being grouped withan overwhelming majority of null means. Theorem 3.4a states that if the gl -th observationis suﬃciently large then there will be minimal shrinkage on y gl . This guarantees that groupshrinkage will not overly dominate individual shrinkage if the observation is large. Conversely,Theorem 3.4b states that if the global shrinkage parameter converges to zero, then the GIGGprior will suﬃciently shrink the y gl ’s toward zero. Let y g = ( y g , . . . , y gp g ) (cid:62) . Theorem 3.4.

Suppose that p g ∈ { , , . . . } .. ) Fix ψ, δ ∈ (0 , . Then there exists a function h ( p g , τ , σ , a g , b g , ψ, δ ) such that π ( κ gl > ψ | y g , τ , σ , a g , b g ) ≤ exp (cid:18) − ψ (1 − δ )2 σ y gl + ψδ σ (cid:88) j (cid:54) = l y gj (cid:19) h ( p g , τ , σ , a g , b g , ψ, δ ) . Consequently, if | y gl | → ∞ , then π ( κ gl ≤ ψ | y g , τ , σ , a g , b g ) → . b) Fix (cid:15) ∈ (0 , . Then there exists a function h ( p g , σ , y g , a g , b g , (cid:15) ) such that, π ( κ gl < (cid:15) | y g , τ , σ , a g , b g ) ≤ (cid:18) τ σ (cid:19) p g / b g (cid:32) min (cid:18) , τ σ (cid:19)(cid:33) − p g / h ( p g , σ , y g , a g , b g , (cid:15) ) . Consequently, π ( κ gl ≥ (cid:15) | y g , τ , σ , a g , b g ) → as τ → .Proof. See Appendices 8.6 and 8.7.The theoretical statements outlined in Theorem 3.4 were originally discussed for the horse-shoe prior (Datta and Ghosh, 2013), but have also been used in the context of several othercontinuous shrinkage priors (Datta and Dunson, 2016; Bhadra et al., 2017; Bai and Ghosh,2019), dynamic trend ﬁltering (Kowal et al., 2019), and small area estimation (Tang et al.,2018a). Examining Theorem 3.4, we ﬁrst note that neither result restricts the range of values a g and b g can take. Therefore, Theorem 3.4 applies to a more general class of hyperparametervalues than those considered in Bai and Ghosh (2019). Secondly, the rate at which the upperbound on Theorem 3.4b converges to zero as τ → b g , withlarger values of b g corresponding to a tighter upper bound. To better understand the role of b g we have the following result. Corollary 3.2.

Suppose that p g ∈ { , , . . . } . If τ , σ , and a g ∈ (0 , are ﬁxed, then thereexists a constant (cid:15) ( τ , σ , p g ) = (cid:18) τ σ p p g g (cid:19) − , such that π ( κ gl < (cid:15) ( τ , σ , p g ) | y g , τ , σ , a g , b g ) → as b g → ∞ .Proof. See Appendix 8.8.The conclusion of Corollary 3.2 is nearly identical to the conclusion of Corollary 3.1, as τ /σ controls the degree of shrinkage provided as a result of taking b g → ∞ .

4. COMPUTATION

The full conditional updates corresponding to model (1), where β is endowed with a GIGGprior, are enumerated in Appendix 8.9. Following Polson and Scott (2011), we assign a half10auchy prior scaled by the residual error standard deviation τ | σ ∼ C + (0 , σ ) and use aprevalent data augmentation trick,[ τ | ν ] ∼ IG (1 / , /ν ) , [ ν | σ ] ∼ IG (1 / , /σ ) , to obtain closed form full conditional updates for τ and σ (Makalic and Schmidt, 2016).There are two major computational bottlenecks for the proposed algorithm. The ﬁrst is thefull conditional update of β ,[ β | · ] ∼ N (cid:32) Q − σ X (cid:62) (cid:16) y − Cα (cid:17) , Q − (cid:33) , Q = 1 σ X (cid:62) X + 1 τ Γ − Λ − . The second occurs when there are a multitude of group and local parameters that need to bedrawn at each iteration of the Gibbs sampler, which is often the case in “large p ” scenarios.Rather than na¨ıvely sampling from the full conditional distributions there are several strategiesto achieve faster posterior computation with both of these computational challenges in mind: • Draw v ∼ N (cid:0) σ − X (cid:62) ( y − Cα ) , Q (cid:1) , and then solve Qβ = v , rather than explicitlycalculating Q − . • For “small n , large p ” problems, the Woodbury identity can be utilized so that the fullconditional update of β scales linearly in p (Bhattacharya et al., 2016). • If n and p are both large, say an order of magnitude of 10,000 each, there are severalrecently developed approximation approaches, the former of which exploits the abilityof the horseshoe prior to shrink τ λ gj close to zero (Johndrow et al., 2020) while thelatter uses a conjugate gradient algorithm to ﬁnd an approximate solution to Qβ = v (Nishimura and Suchard, 2020). • Parallelization can be used within the Gibbs sampler to simultaneously update the shrink-age parameters corresponding to each group (Terenin et al., 2019).

If the modeler wants to remain relatively agnostic to the choice of hyperparameters, one canuse Marginal Maximum Likelihood Estimation (MMLE) (Casella, 2001), an empirical-Bayesapproach executed iteratively within the Gibbs sampler. The ( l + 1)th update is a ( l +1) g = ψ − (cid:18) E a ( l ) g (cid:2) log( γ g ) | y (cid:3)(cid:19) , b ( l +1) g = ψ − (cid:18) − p g p g (cid:88) j =1 E b ( l ) g (cid:2) log( λ gj ) | y (cid:3)(cid:19) , where ψ ( · ) is the digamma function and the expectation terms can be estimated throughstandard Monte Carlo methods. The iterative procedure terminates when (cid:80) Gg =1 (cid:0) a ( l +1) g − a ( l ) g (cid:1) + (cid:80) Gg =1 (cid:0) b ( l +1) g − b ( l ) g (cid:1) is less than some prespeciﬁed error tolerance. However, in our experienceit is preferred to ﬁx a g = 1 /n for all g and use MMLE to estimate the b g hyperparameters. The11rst reason is that a g controls the strength of the thresholding eﬀect and choosing a g close tozero guarantees strong shrinkage of null coeﬃcients towards zero. The second reason is that onlyestimating one hyperparameter per group is more feasible than estimating two hyperparametersper group, particularly when the number of groups is large. Since b g primarily controls howthe correlated the shrinkage is within-group it is more important to focus estimation on the b g hyperparameters. We do recognize that setting a g = 1 /n violates a condition in Theorem 3.1where the inﬁmum of the set of hyperparameters cannot converge to zero as n → ∞ . However,for practical purposes, this approach provides an automatic way to set a g while also yieldingsimilar results to a g close to zero and ﬁxed as a function of the sample size, such as a g = 1 / G , is small relative tothe sample size, the estimates for the a g ’s and b g ’s will become increasingly variable in high-dimensional settings where the number of groups is large. There may also be low-dimensionalsettings where the user wants to incorporate explicit prior knowledge about the nature ofthe within-group signal density. In such cases, it may be preferred to ﬁx hyperparametervalues in accordance with subject matter expertise. As with the modiﬁed MMLE approach,we recommend setting a g = 1 /n for all g . To ﬁx b g we recommend a useful heuristic wherebylocal, group, and global shrinkage parameters are simulated from the GIGG prior. Usingthe simulated shrinkage parameters, shrinkage factors can be constructed and the correlationbetween shrinkage factors within the same group can be empirically calculated. Selectingthe hyperparameter b g is then equivalent to selecting how correlated the shrinkage is within-group, a more easily understandable concept. Implementations of GIGG regression with ﬁxedhyperparameters and hyperparameters estimated via MMLE are available on Github.

5. SIMULATIONS

The data generative mechanism is linear regression model (1), where C includes the interceptterm and ﬁve adjustment covariates drawn from independent standard normal distributions, α = (0 , , , , , (cid:62) , and X is drawn from a multivariate normal distribution with mean andcovariance matrix Σ X . Σ X is determined such that the features have unit variance and block-diagonal exchangeable correlation structure. For all simulation settings, n = 500 and p = 50such that the 50 covariates are evenly divided into ﬁve groups. Pairwise correlations withineach group are ρ = 0 . ρ = 0 . . σ , is ﬁxed such that β (cid:62) Σ X β / ( β (cid:62) Σ X β + σ ) =0 . β ,which will be qualitatively referred to as the concentrated signal setting and the distributedsignal setting. In the concentrated signal setting, there is only one true signal in each of the ﬁvegroups with varying magnitudes: β = 0 . β = 1, β = 1 . β = 2, and β = 2. Ratherthan having within-group sparsity, the distributed signal setting assumes that the signal isshared across all members of the ﬁrst group: β j = 0 . j ∈ { , ..., } and β j = 1 for12ll j ∈ { , ..., } . The purpose of the ﬁxed coeﬃcient simulation settings is to ascertain whichmethods perform well when the within-group signal is sparse or dense.Beyond the ﬁxed regression coeﬃcient simulation settings, we also consider a random co-eﬃcient simulation in the high correlation setting, where for each simulation iteration a ran-dom regression coeﬃcient vector is generated. To construct a regression coeﬃcient vector, westart by randomly selecting either a concentrated or distributed signal for the ﬁrst group witheven probability to guarantee that each simulation iteration will have at least one true signal.The concentrated and distributed signal magnitudes are selected such that the contribution to β (cid:62) Σ X β is equal, namely the distributed signal is β gj = 0 .

25 for j = 1 , ...,

10 and the concen-trated signal is β g = 5 .

125 and β gj = 0 for j = 2 , ...,

10. For the other four groups, we randomlyselect a concentrated signal with probability 0.2, a distributed signal with probability 0.2, andno signal with probability 0.6. The goal of the random coeﬃcient simulation setting is to showthat, averaged across many combinations of regression coeﬃcient vectors comprised of sparsewithin-group signals, dense within-group signals, and inactive groups, GIGG regression withMMLE results in the lowest mean-squared error.

Estimation properties will be evaluated based on empirical mean-squared error (MSE), stratiﬁedby null and non-null coeﬃcients, across 5000 replicates. In the random coeﬃcient simulations,calculating the MSE corresponds to an integrated mean-squared error (IMSE) metric averagedacross the generative distribution of the regression coeﬃcient vectors. For the ﬁxed coeﬃcientsimulations we will consider several special cases of the GIGG prior with ﬁxed hyperparameters,namely all possible combinations of a g ∈ { /n, / } and b g ∈ { /n, / , } . That way, we cancheck whether the intuition gleaned from Figure 2 empirically translates to the regressionsetting. We will also consider the GIGG prior when the hyperparameters a g = 1 /n are ﬁxedand b g are estimated via MMLE.The list of competing methods include Ordinary Least Squares (OLS), Horseshoe regression,Group horseshoe regression (Xu et al., 2016), Spike-and-Slab Lasso (Rockova and George,2018), Bayesian Group Lasso with Spike-and-Slab Priors (BGL-SS) (Xu and Ghosh, 2015), andBayesian Sparse Group Selection with Spike-and-Slab Priors (BSGS-SS) (Xu and Ghosh, 2015).To avoid confusion with the group horseshoe prior proposed in this paper, we will refer to thegroup horseshoe prior from Xu et al. (2016) as the group half Cauchy prior throughout therest of the simulation section. Most methods requiring Markov chain Monte Carlo (MCMC)sampling have 10000 burn-in draws, followed by 10000 posterior draws with no thinning. Theonly exceptions are BGL-SS and BSGS-SS which have 1000 burn-in draws and 2000 posteriordraws without any thinning, due to the relatively slower posterior sampling algorithms. Table 1 presents the MSE for the high correlation simulation settings and Table 2 lists theMSE for the medium correlation simulation settings. Because the results for the high correla-tion and medium correlation settings are similar we will only focus our discussion around thehigh correlation simulation settings. The ﬁrst noteworthy observation is that group horseshoe13 = 0 . ρ = 0 . ρ = 0 . Concentrated DistributedMethod

Null Non-Null Null Non-NullOrdinary Least Squares 3.74 0.41 8.09 2.03Horseshoe 0.51 0.41 0.85 2.14GIGG ( a g = 1 /n, b g = 1 /n ) a g = 1 / , b g = 1 /n ) a g = 1 /n, b g = 1 /

2) 0.29 0.39 *GIGG ( a g = 1 / , b g = 1 /

2) 0.33 0.40 0.24 1.70GIGG ( a g = 1 /n, b g = 1) 0.53 0.49 GIGG ( a g = 1 / , b g = 1) 0.58 0.49 0.26 1.43GIGG (MMLE) Group Half Cauchy 0.30 0.39 0.08 1.64Spike-and-Slab Lasso

BSGS-SS 0.23 0.42 0.04 1.84Table 1: Mean-squared errors (MSE) for the ﬁxed regression coeﬃcient simulation settings( n = 500 , p = 50) with high pairwise correlations ( ρ = 0 . a g = 1 / b g = 1 / b g = 1 /n when the signal is concentrated within-group (Null MSE = 0.11, Non-Null MSE =0.30) and a g = 1 /n, b g = 1 when the signal is distributed within-group (Null MSE = 0.03,Non-Null MSE = 1.43), exactly as Figure 2 suggests. However, if the user sets b g = 1 when thesignal is concentrated (Null MSE = 0.53, Non-Null MSE = 0.49) or b g = 1 /n when the signal isdistributed (Null MSE = 0.03, Non-Null MSE = 3.59), then the “incorrect” prior informationresults in notably worse MSE compared to the “correct” prior information. That being said, b g = 1 / = 0 . ρ = 0 . ρ = 0 . Concentrated DistributedMethod

Null Non-Null Null Non-NullOrdinary Least Squares 1.88 0.21 3.20 0.79Horseshoe 0.29 0.21 0.52 0.94GIGG ( a g = 1 /n, b g = 1 /n ) a g = 1 / , b g = 1 /n ) a g = 1 /n, b g = 1 /

2) 0.15 0.22 *GIGG ( a g = 1 / , b g = 1 /

2) 0.18 0.21 0.16 0.73GIGG ( a g = 1 /n, b g = 1) 0.29 0.26 GIGG ( a g = 1 / , b g = 1) 0.33 0.25 0.16 0.66GIGG (MMLE) Group Half Cauchy 0.17 0.21 0.07 0.71Spike-and-Slab Lasso

BSGS-SS 0.10 0.22 0.01 0.81Table 2: Mean-squared errors (MSE) for the ﬁxed regression coeﬃcient simulation settings( n = 500 , p = 50) with medium pairwise correlations ( ρ = 0 . a g = 1 / b g = 1 /

6. DATA EXAMPLE

The National Health and Nutrition Examination Survey (NHANES) is a collection of studiesconducted by the National Center for Health Statistics with the overarching goal of evaluat-ing the health and nutritional status of the United States’ populace. Data collection consistsof a written survey and physical examination which records demographic, socioeconomic, di-etary, and health-related information, including physiological measurements and laboratorytests. We will speciﬁcally apply GIGG regression to a subset of 990 adults from the 2003-2004NHANES cycle with 35 measured contaminants across ﬁve exposure classes: metals, phthalates,organochlorine pesticides, polybrominated diphenyl ethers (PBDEs), and polycyclic aromatichydrocarbons (PAHs). Figure 1 illustrates the block diagonal correlation structure of theseexposures, where areas of high correlation are mostly contained within exposure class. Gammaglutamyl transferase (GGT), an enzymatic marker of liver functionality, will be the outcomeof interest. GGT and all environmental exposures were log-transformed to remove right skew-15 ethod

Null Non-NullOrdinary Least Squares 8.84 3.38Horseshoe 0.70 1.18Group Horseshoe

Group Half Cauchy

GIGG (MMLE)

Spike-and-Slab Lasso 0.16 3.65BGL-SS 2.84 2.44BSGS-SS 0.36 1.45Table 3: Integrated mean-squared errors (IMSE) for the random regression coeﬃcient simulationsetting ( n = 500 , p = 50) with high pairwise correlations ( ρ = 0 . .

00 forall regression coeﬃcients and GIGG regression had PSRF values ranging between 1 . − . ll llll llll l ll llllll ll llll ll ll ll ll lllll l ll lllll lllll ll lllll l lll lllllll llll lllll l ll lllll l ll ll ll lllll l ll lll ll llll llll lll lllll lllllllll ll llll ll PhthalatesPesticidesPBDEsPAHsMetals

Mono−isobutyl pthalateMono−(3−carboxypropyl) phthalateMono−n−methylMono−benzyl phthalateMono−ethyl phthalateMono−n−butyl phthalateSummed di−(2−ethylhexyl) phthalatesDieldrinHeptachlor EpoxideTrans−nonachlorOxychlordanep,p'−DDTp,p'−DDEBeta−hexachlorocyclohexaneHexachlorobenzeneBDE−154BDE−153BDE−100BDE−99BDE−47BDE−28BB−1534−phenanthrene9−hydroxyfluorene1−hydroxypyrene2−hydroxyphenanthrene1−hydroxyphenanthrene3−hydroxyphenanthrene2−hydroxyfluorene3−hydroxyfluorene2−hydroxynaphthalene1−hydroxynaphthaleneMercuryLeadCadmium −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6

Percent change in GGT for a twofold change in exposure llll

GIGG RegressionLinear RegressionRidge RegressionHorseshoe Regression

Figure 3: Estimated associations between environmental toxicants (metals, phthalates, pesti-cides, PBDEs, and PAHs) and gamma glutamyl transferase (GGT) from NHANES 2003-2004( n = 990).PAHs and PBDEs, have high pairwise correlations and common estimated eﬀect sizes. However,the metals exposure class, which has weak pairwise correlations and heterogeneous estimatedeﬀect sizes, results in a median credible interval length of 0.31 for GIGG regression and 0.30for horseshoe regression. The example indicates that by leveraging grouping information GIGGregression has appreciable eﬃciency gains for groups with multicollinearity issues and homoge-17eous eﬀect sizes, but does not provide an improvement for groups with weak correlations andheterogeneous eﬀect sizes. Additionally, from a computational perspective, GIGG regressiongenerated a median eﬀective sample size of 590.4 per second, compared to a median eﬀectivesample size of 78.9 per second for horseshoe regression from the horseshoe package in R.

7. DISCUSSION

The principal methodological contribution of this paper is to construct a continuous shrinkageprior that improves regression coeﬃcient estimation in the presence of grouped covariates.GIGG regression ﬂexibly controls the relative contributions of individual and group shrinkageto improve regression coeﬃcient estimation, resulting in a relative IMSE reduction of 72.8%corresponding to the null coeﬃcients and 7.5 times more eﬃcient computation compared tothe primary horseshoe regression implementation in R. One of the main limitations of GIGGregression is that covariate groupings must be explicitly speciﬁed and covariate groupings maynot overlap. Additionally, although the GIGG prior can be imposed on regression coeﬃcientsin Bayesian generalized linear models, a theoretical evaluation of the shrinkage properties fornon-normal outcome data would be necessary to determine if the GIGG prior is appropriate forsuch models. We are currently working on an R package to implement GIGG regression that,upon completion, will be added to the Comprehensive R Archive Network. For a preliminaryversion of the R package, visit Github.The analysis of multiple pollutant data and chemical mixtures is a key thrust of the NationalInstitute of Environmental Health Sciences, and the GIGG prior provides a useful frameworkfor achieving variance reduction in the presence of group-correlated exposures, characterizinguncertainties in point estimates, and constructing policy relevant metrics, like summary riskscores, in a principled way. However, the generality of the GIGG prior coupled with the relativeease of computation means that, despite its motivation coming from environmental epidemiol-ogy, the GIGG prior is applicable to many other areas. For example, in neuroimaging studies,scalar-on-image regression (Kang et al., 2018) has been widely used to study the associationbetween brain activity and clinical outcomes of interests. The whole brain can be partitionedinto a set of exclusive regions according to brain functions and anatomical structures. Withinthe same region, the brain imaging biomarkers tend to be more correlated and have similareﬀects on the outcome variable. The GIGG prior can be extended for scalar-on-image regres-sion and it has a great potential to improve estimating the eﬀects of imaging biomarkers byincorporating brain region information.In this paper, our focus was sparse estimation, but it is also natural to inquire aboutuncertainty quantiﬁcation and variable selection. Based on our simulations, the conclusionsof van der Pas et al. (2017) are relevant for the GIGG prior when 0 < a g ≤ /

2, but acomprehensive study needs to be carried out. There is no consensus way of deﬁning variableselection for continuous shrinkage priors, however there are several approaches to determine aﬁnal active set, including two-means clustering (Bhattacharya et al., 2015), credible intervalscovering zero (van der Pas et al., 2017), thresholding shrinkage factors (Tang et al., 2018b),decoupling shrinkage and selection (DSS) (Hahn and Carvalho, 2015), and penalized credibleregions (Zhang and Bondell, 2018). For horseshoe-style shrinkage, variable selection deﬁned18hrough credible intervals covering zero is conservative, but works well if one wants to limitthe number of false discoveries. The two-means clustering heuristic does not necessarily resultin consistent variable selection and the shrinkage factor thresholding approach is restrictedto applications where p < n . The penalized credible region approach searches for the sparsestmodel that falls within the 100 × (1 − α )% joint elliptical credible region, while DSS constructs anadaptive lasso-style objective function with the goal of sparsifying the posterior mean such thatmost of the predictive variability is still explained. Since the DSS construction is framed froma prediction perspective, this approach may not be ideal for regression coeﬃcient estimationproblems in the presence of correlated features. Another crucial point to make is that if one isinterested in selection, the posterior mode estimator for the horseshoe prior will result in exactzero estimates, and an approximate algorithm for calculating the posterior mode was developedin Bhadra et al. (2019) using the horseshoe-like prior. Therefore, one could conceptualize anextension of the expectation-maximization algorithm developed by Bhadra et al. (2019) usinga “GIGG-like” prior. Further work is needed to juxtapose the behavior of all of these diﬀerentmethods for selection and develop novel algorithms for calculating the posterior mode. Acknowledgements

Dr. Datta acknowledges support from the National Science Foundation (DMS-2015460). Dr.Kang acknowledges support from the National Institutes of Health (R01 DA048993; R01GM124061; R01 MH105561). Dr. Mukherjee acknowledges support from the National Sci-ence Foundation (DMS-1712933) and the National Institutes of Health (R01 HG008773-01).19 eferences

Abramowitz, M. and Stegun, I. A. (1972).

Handbook of Mathematical Functions with Formulas,Graphs, and Mathematical Tables . U.S. Government Printing Oﬃce, Washington, D.C., 10edition.Andrade, J. A. A. and O’Hagan, A. (2006). Bayesian robustness modeling using regularlyvarying distributions.

Bayesian Analysis , 1(1):169–188.Armagan, A., Dunson, D. B., and Lee, J. (2013a). Generalized double pareto shrinkage.

Sta-tistica Sinica , 23(1):119–143.Armagan, A., Dunson, D. B., Lee, J., Bajwa, W. U., and Strawn, N. (2013b). Posteriorconsistency in linear models under shrinkage priors.

Biometrika , 100(4):1011–1018.Bai, R. and Ghosh, M. (2019). Large-scale multiple hypothesis testing with the normal-betaprime prior.

Statistics , 53(6):1210–1233.Barndorﬀ-Nielsen, O., Kent, J., and Sørensen, M. (1982). Normal variance-mean mixtures andz distributions.

International Statistical Review , 50(2):145–159.Bhadra, A., Datta, J., Polson, N. G., and Willard, B. (2016). Default bayesian analysis withglobal-local shrinkage priors.

Biometrika , 103(4):955–969.Bhadra, A., Datta, J., Polson, N. G., and Willard, B. (2017). The horseshoe+ estimator ofultra-sparse signals.

Bayesian Analysis , 12(4):1105–1131.Bhadra, A., Datta, J., Polson, N. G., and Willard, B. T. (2019). The horseshoe-like regulariza-tion for feature subset selection.

Sankhya B .Bhattacharya, A., Chakraborty, A., and Mallick, B. K. (2016). Fast sampling with gaussianscale mixture priors in high-dimensional regression.

Biometrika , 103(4):985–991.Bhattacharya, A., Pati, D., Pillai, N. S., and B., D. D. (2015). Dirichlet–Laplace priors foroptimal shrinkage.

Journal of the American Statistical Association , 110(512):1479–1490.Bingham, N. H., Goldie, C. M., and Teugels, J. L. (1989).

Regular Variation, vol. 27 ofEncyclopedia of Mathematics and its Applications . Cambridge University Press, Cambridge,UK.Carvalho, C. M., Polson, N. G., and Scott, J. G. (2009). Handling sparsity via the horseshoe.

Proceedings of the Twelth International Conference on Artiﬁcial Intelligence and Statistics,PMLR , 5:73–80.Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator for sparsesignals.

Biometrika , 97(2):465–480.Casella, G. (2001). Empirical bayes gibbs sampling.

Biostatistics , 2(4):485–500.20astillo, I., Schmidt-Hieber, J., and van der Vaart, A. (2015). Bayesian linear regression withsparse priors.

Annals of Statistics , 43(5):1986–2018.Datta, J. and Dunson, D. B. (2016). Bayesian inference on quasi-sparse count data.

Biometrika ,103(4):971–983.Datta, J. and Ghosh, J. K. (2013). Asymptotic properties of Bayes risk for the horseshoe prior.

Bayesian Analysis , 8(1):111–132.Dawid, A. P. (1973). Posterior expectations for large observations.

Biometrika , 60(3):664–667.Ferguson, K. K., McElrath, T. F., and Meeker, J. D. (2014). Environmental phthalate exposureand preterm birth.

JAMA Pediatrics , 168(1):61–67.Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple se-quences.

Statistical Science , 7(4):457–472.Hahn, P. R. and Carvalho, C. M. (2015). Decoupling shrinkage and selection in bayesian linearmodels: A posterior summary perspective.

Journal of the American Statistical Association ,110(509):435–448.Heﬂey, T. J., Hooten, M. B., Hanks, E. M., Russell, R. E., and Walsh, D. P. (2017). Thebayesian group lasso for confounded spatial data.

Journal of Agricultural, Biological, andEnvironmental Statistics , 22(1):42–59.H¨ormann, W. and Leydold, J. (2014). Generating generalized inverse gaussian random variates.

Statistics and Computing , 24:547–557.Johndrow, J. E., Orenstein, P., and Bhattacharya, A. (2020). Scalable approximate mcmcalgorithms for the horseshoe prior.

Journal of Machine Learning Research , 21:1–61.Kang, J., Reich, B. J., and Staicu, A.-M. (2018). Scalar-on-image regression via the soft-thresholded gaussian process.

Biometrika , 105(1):165–184.Kang, K., Song, X., Hu, X. J., and Zhu, H. (2019). Bayesian adaptive group lasso withsemiparametric hidden markov models.

Statistics in Medicine , 38(9):1634–1650.Kowal, D. R., Matteson, D. S., and Ruppert, D. (2019). Dynamic shrinkage processes.

Journalof the Royal Statistical Society: Series B (Statistical Methodology) , 81(4):781–804.Kyung, M., Gill, J., Ghosh, M., and Casella, G. (2010). Penalized regression, standard errors,and Bayesian lassos.

Bayesian Analysis , 5(2):369–412.Li, J., Wang, Z., Li, R., and Wu, R. (2015). Bayesian group lasso for nonparametric varying-coeﬃcient models with application to functional genome-wide association studies.

The Annalsof Applied Statistics , 9(2):640–664.Makalic, E. and Schmidt, D. F. (2016). A simple sampler for the horseshoe estimator.

IEEESignal Processing Letters , 23(1):179–182. 21allick, H. and Yi, N. (2017). Bayesian group bridge for bi-level variable selection.

Computa-tional Statistics & Data Analysis , 110:115–133.Nishimura, A. and Suchard, M. A. (2020). Prior-preconditioned conjugate gradient method foraccelerated gibbs sampling in “large n & large p” sparse bayesian regression. arXiv Preprint .O’Hagan, A. (1979). On outlier rejection phenomena in Bayes inference.

Journal of the RoyalStatistical Society: Series B (Methodological) , 41(3):358–367.Polson, N. G. and Scott, J. G. (2011). Shrink globally, act locally: Sparse bayesian regular-ization and prediction. In Bernardo, J. M., Bayarri, M. J., Berger, J. O., Dawid, A. P.,Heckerman, D., Smith, A. F. M., and West, M., editors,

Bayesian Statistics 9 , chapter 17.Oxford University Press, Oxford, United Kingdom.Rockova, V. and George, E. I. (2018). The spike-and-slab lasso.

Journal of the AmericanStatistical Association , 113(521):431–444.Rockova, V. and Lesaﬀre, E. (2014). Incorporating grouping information in bayesian variableselection with applications in genomics.

Bayesian Analysis , 9(1):221–258.Tang, X., Ghosh, M., Ha, N. S., and Sedransk, J. (2018a). Modeling random eﬀects usingglobal–local shrinkage priors in small area estimation.

Journal of the American StatisticalAssociation , 113(524):1476–1489.Tang, X., Xu, X., Ghosh, M., and Ghosh, P. (2018b). Bayesian variable selection and estimationbased on global-local shrinkage priors.

Sankhya A , 80:215–246.Terenin, A., Dong, S., and Draper, D. (2019). GPU-accelerated Gibbs sampling: a case studyof the horseshoe probit model.

Statistics and Computing , 29(2):301–310.van der Pas, S., Szab´o, B., and van der Vaart, A. (2017). Uncertainty quantiﬁcation for thehorseshoe (with discussion).

Bayesian Analysis , 12(4):1221–1274.Xu, X. and Ghosh, M. (2015). Bayesian variable selection and estimation for group lasso.

Bayesian Analysis , 10(4):909–936.Xu, Z., Schmidt, D. F., Makalic, E., Qian, G., and Hopper, J. L. (2016). Bayesian groupedhorseshoe regression with application to additive models. In on Artiﬁcial Intelligence 2016,A. J. C., editor,

AI 2016: Advances in Artiﬁcial Intelligence , chapter 3. Springer, Hobart,Australia.Zhang, Y. and Bondell, H. D. (2018). Variable selection via penalized credible regions withDirichlet–Laplace global-local shrinkage priors.

Bayesian Analysis , 13(3):823–844.22 . APPENDICES

Beta Prime Distribution: X ∼ β (cid:48) ( a, b ) = ⇒ f X ( x ) = Γ( a + b )Γ( a )Γ( b ) x a − (1 + x ) − a − b , x > . Gamma Distribution: X ∼ G ( a, b ) = ⇒ f X ( x ) = b a Γ( a ) x a − exp( − bx ) , x > . Generalized Inverse Gaussian Distribution: X ∼ GIG ( λ, ψ, χ ) = ⇒ f X ( x ) = ( ψ/χ ) λ/ K λ ( √ ψχ ) x λ − exp (cid:18) − (cid:18) χx + ψx (cid:19)(cid:19) , x > , where K λ ( · ) is the modiﬁed Bessel function of the third kind with index λ (H¨ormann andLeydold, 2014).Half Cauchy Distribution: X ∼ C + (0 , σ ) = ⇒ f X ( x ) = 2 πσ (1 + x /σ ) , x > . Inverse Gamma Distribution: X ∼ IG ( a, b ) = ⇒ f X ( x ) = b a Γ( a ) x − a − exp (cid:18) − bx (cid:19) , x > . .2 Proof of Theorem 2.1 For shorthand, we will use: r ( x ) ∼ s ( x ) := lim x →∞ r ( x ) s ( x ) = 1 . Deﬁne L ( u ) = 1( τ ) λ B ( a g , b g ) (cid:18) u/τ u/τ (cid:19) a g , B ( a g , b g ) = Γ( a g )Γ( b g )Γ( a g + b g ) , which is slowly-varying function, i.e., lim u →∞ L ( tu ) L ( u ) = 1 , for all t >

0. Moreover, let π ( β gj | τ , a g , b g ) = (cid:90) ∞ (2 πu ) − / exp (cid:18) − β gj u (cid:19) f ( u | τ , a g , b g ) du denote the normal variance mixture probability density function and let f ( u | τ , a g , b g ) denotethe scaled β (cid:48) ( a g , b g ) mixing density function with ﬁxed scale parameter τ . Thenlim u →∞ f ( u | τ , a g , b g )exp( − ψ + u ) u λ − L ( u )= lim u →∞ ( τ B ( a g , b g )) − ( u/τ ) a g − (1 + u/τ ) − ( a g + b g ) exp( − ψ + u ) u λ − L ( u ) = lim u →∞ ( u/τ ) − (1 + u/τ ) − b g exp( − ψ + u )( u/τ ) λ − = lim u →∞ exp( ψ + u )( u/τ ) − λ (1 + u/τ ) − b g , where ψ + = sup { w ∈ R : φ ( w ) < ∞} and φ ( w ) = 1 B ( a g , b g ) (cid:90) ∞ exp( wu ) 1 τ (cid:18) uτ (cid:19) a g − (cid:18) uτ (cid:19) − ( a g + b g ) du. Note that ψ + = 0. Fix λ = − b g . Then,lim u →∞ exp( ψ + u )( u/τ ) − λ (1 + u/τ ) − b g = lim u →∞ (cid:18) u/τ u/τ (cid:19) b g = 1 . By Theorem 6.1 in Barndorﬀ-Nielsen et al. (1982) we conclude that π ( β gj | τ , a g , b g ) ∼ (2 π ) − / b g +1 / Γ( b g + 1 / | β gj | − (1+2 b g ) L ( β gj ) . To get the index of regular variation, we note the following straightforward lemma:

Lemma 1.

Suppose that r and s are two positive, measurable functions such that r ( x ) ∼ s ( x ) and s is regularly varying with index ω ∈ R . Then r is regularly varying with index ω . roof. lim x →∞ r ( tx ) r ( x ) = lim x →∞ r ( tx ) r ( x ) s ( tx ) s ( tx ) s ( x ) s ( x ) = lim x →∞ (cid:18) r ( tx ) /s ( tx ) r ( x ) /s ( x ) (cid:19) s ( tx ) s ( x ) = lim x →∞ s ( tx ) s ( x ) = t ω . Despite its simplicity, Lemma 1 is of great practical use, particularly if the function whosetail behavior we are interested in does not have a closed form. When working with global-local mixture priors we often do not have closed form marginal prior distributions for β andit is usually easier to construct and work with a closed form function that has asymptoticallyequivalent tail behavior. Since the index of regular variation of(2 π ) − / b g +1 / Γ( b g + 1 / | β gj | − (1+2 b g ) L ( β gj )is ω = − − b g , then by Lemma 1 the index of regular variation of π ( β gj | τ , a g , b g ) is also ω = − − b g . 25 .3 Proof of Theorem 3.1 Let G n = { γ , ..., γ G n } denote the collection of group shrinkage parameters and let p g indicatethe number of covariates in the g -th group. Deﬁne the following sets: A g = { j : β gj (cid:54) = 0 } , A cg = { j : β gj = 0 } , A n = { ( g, j ) : β gj (cid:54) = 0 } , A cn = { ( g, j ) : β gj = 0 } . In words, A g is the activeset for the g -th group, A cg is the non-active set for the g -th group, A n is the active set acrossall groups, and A cn is the non-active set across all groups. π (cid:0) β n (cid:12)(cid:12) G n , H n , τ n (cid:1) = G n (cid:89) g =1 p g (cid:89) j =1 Γ( b g + 1 / b g ) (cid:112) πτ n γ g (cid:18) β gj τ n γ g (cid:19) − ( b g +1 / Then, we see that π (cid:18) β n : (cid:107) β n − β n (cid:107) < ∆ n ρ/ (cid:12)(cid:12)(cid:12)(cid:12) G n , H n , τ n (cid:19) ≥ G n (cid:89) g =1 (cid:34) (cid:89) j ∈A g π (cid:18) | β gj − β gj | < ∆ √ p n n ρ/ (cid:12)(cid:12)(cid:12)(cid:12) G n , H n , τ n (cid:19) × (cid:89) j ∈A cg π (cid:18) | β gj | < ∆ √ p n n ρ/ (cid:12)(cid:12)(cid:12)(cid:12) G n , H n , τ n (cid:19)(cid:35) . Continuing π (cid:18) | β gj − β gj | < ∆ √ p n n ρ/ (cid:12)(cid:12)(cid:12)(cid:12) G n , H n , τ n (cid:19) = (cid:90) β gj + ∆ √ pnnρ/ β gj − ∆ √ pnnρ/ Γ( b g + 1 / b g ) (cid:112) πτ n γ g (cid:18) β gj τ n γ g (cid:19) − ( b g +1 / dβ gj ≥ b g + 1 / b g ) (cid:112) πτ n γ g √ p n n ρ/ (cid:32) L n + ∆ √ p n n ρ/ ) τ n γ g (cid:33) − ( b g +1 / and π (cid:18) | β gj | < ∆ √ p n n ρ/ (cid:12)(cid:12)(cid:12)(cid:12) G n , H n , τ n (cid:19) = (cid:90) ∆ √ pnnρ/ − ∆ √ pnnρ/ Γ( b g + 1 / b g ) (cid:112) πτ n γ g (cid:18) β gj τ n γ g (cid:19) − ( b g +1 / dβ gj = 2 (cid:90) ∆ √ pnnρ/ Γ( b g + 1 / b g ) (cid:112) πτ n γ g (cid:18) β gj τ n γ g (cid:19) − ( b g +1 / dβ gj ≥ (cid:90) ∆ √ pnnρ/ Γ( b g + 1 / b g ) (cid:112) πτ n γ g exp (cid:32) − β gj ( b g + 1 / (cid:112) τ n γ g (cid:33) dβ gj

26 2Γ( b g + 1 / b g ) (cid:112) πτ n γ g (cid:32) (cid:112) τ n γ g b g + 1 / (cid:33)(cid:34) − exp (cid:32) − ∆( b g + 1 / (cid:112) τ n γ g p n n ρ (cid:33)(cid:35) = 2Γ( b g + 1 / b g ) √ π ( b g + 1 / (cid:34) − exp (cid:32) − ∆( b g + 1 / (cid:112) τ n γ g p n n ρ (cid:33)(cid:35) . Note that we have the above inequality because (1 + x ) − ≥ exp( −√ x ) for all x ≥

0. Therefore, G n (cid:89) g =1 (cid:34) (cid:89) j ∈A g π (cid:18) | β gj − β gj | < ∆ √ p n n ρ/ (cid:12)(cid:12)(cid:12)(cid:12) G n , H n , τ n (cid:19) × (cid:89) j ∈A cg π (cid:18) | β gj | < ∆ √ p n n ρ/ (cid:12)(cid:12)(cid:12)(cid:12) G n , H n , τ n (cid:19)(cid:35) ≥ G n (cid:89) g =1 (cid:32) b g + 1 / b g ) (cid:112) πτ n γ g √ p n n ρ/ (cid:33) |A g | (cid:32) L n + ∆ √ p n n ρ/ ) τ n γ g (cid:33) −|A g | ( b g +1 / × (cid:32) b g + 1 / b g ) √ π ( b g + 1 / (cid:33) |A cg | (cid:34) − exp (cid:32) − ∆( b g + 1 / (cid:112) τ n γ g p n n ρ (cid:33)(cid:35) |A cg | . Substituting in τ n = C/ ( p n n ρ log( n )) and taking the negative logarithm of the ﬁnal expressionyields − G n (cid:88) g =1 (cid:34) |A g | log (cid:32) b g + 1 / (cid:112) log( n )Γ( b g ) (cid:112) Cπγ g (cid:33) −|A g | ( b g + 1 /

2) log (cid:32) p n n ρ log( n )( L n + ∆ √ p n n ρ/ ) Cγ g (cid:33) + |A cg | log (cid:32) b g + 1 / b g ) √ π ( b g + 1 / (cid:33) + |A cg | log (cid:32) − exp (cid:32) − ∆( b g + 1 / (cid:112) log( n ) (cid:112) Cγ g (cid:33)(cid:33)(cid:35) = G n (cid:88) g =1 (cid:34) − |A g | log (cid:32) b g + 1 / (cid:112) log( n )Γ( b g ) (cid:112) Cπγ g (cid:33) + |A g | ( b g + 1 /

2) log (cid:32) p n n ρ log( n )( L n + ∆ √ p n n ρ/ ) Cγ g (cid:33) −|A cg | log (cid:32) b g + 1 / b g ) √ π ( b g + 1 / (cid:33) − |A cg | log (cid:32) − exp (cid:32) − ∆( b g + 1 / (cid:112) log( n ) (cid:112) Cγ g (cid:33)(cid:33)(cid:35) . Let T n = inf g ∈{ ,...,G n } log (cid:32) b g + 1 / b g ) (cid:112) Cπγ g (cid:33) , n = inf g ∈{ ,...,G n } log (cid:32) b g + 1 / b g ) √ π ( b g + 1 / (cid:33) ,T n = inf g ∈{ ,...,G n } ∆( b g + 1 / (cid:112) Cγ g ,γ n,min = inf g ∈{ ,...,G n } γ g , and b n,max = sup g ∈{ ,...,G n } b g . Then, G n (cid:88) g =1 (cid:34) − |A g | log (cid:32) b g + 1 / (cid:112) log( n )Γ( b g ) (cid:112) Cπγ g (cid:33) + |A g | ( b g + 1 /

2) log (cid:32) p n n ρ log( n )( L n + ∆ √ p n n ρ/ ) Cγ n,min (cid:33) −|A cg | T n − |A cg | log (cid:18) − exp (cid:16) T n (cid:112) log( n ) (cid:17)(cid:19)(cid:35) = − |A n | (cid:0) log( n ) (cid:1) − |A n | T n + |A n | ( b n,max + 1 /

2) log (cid:32) p n n ρ log( n )( L n + ∆ √ p n n ρ/ ) Cγ n,min (cid:33) −|A cn | T n − |A cn | log (cid:18) − exp (cid:16) − T n (cid:112) log( n ) (cid:17)(cid:19) . Note that the above expression is dominated by the |A n | ( b n,max + 1 /

2) log (cid:32) p n n ρ log( n )( L n + ∆ √ p n n ρ/ ) Cγ n,min (cid:33) term. If |A n | = o ( n/ log( n )), then by Theorem 1 in Armagan et al. (2013b), we obtain posteriorconsistency conditional on G n , i.e., ∀ (cid:15) > π (cid:0) β n : (cid:107) β n − β n (cid:107) < (cid:15) (cid:12)(cid:12) y n , G n , H n , τ n , σ (cid:1) → π (cid:0) β n : (cid:107) β n − β n (cid:107) < (cid:15) (cid:12)(cid:12) y n , H n , τ n , σ (cid:1) = E G n | y n , H n ,τ n ,σ (cid:20) π (cid:0) β n : (cid:107) β n − β n (cid:107) < (cid:15) (cid:12)(cid:12) y n , G n , H n , τ n , σ (cid:1)(cid:21) → E G n | y n , H n ,τ n ,σ [1] = 1 . .4 Proof of Theorem 3.3 Let M = I p − ( I p + ( X (cid:62) X ) − D ) − = ( I p + D − X (cid:62) X ) − where D − = σ − τ ΓΛ . TheRayleigh quotient inequality yields, (cid:13)(cid:13)(cid:13) ˆ β OLS − E [ β | · ] (cid:13)(cid:13)(cid:13) ≥ e min ( M (cid:62) M ) (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) = (cid:0) θ min ( M ) (cid:1) (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) , where e min ( M (cid:62) M ) denotes the minimum eigenvalue of M (cid:62) M and θ min ( M ) denotes theminimum singular value of M . Then, we have (cid:0) θ min ( M ) (cid:1) (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) = 1 (cid:0) θ max ( I p + D − X (cid:62) X ) (cid:1) (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) = 1 (cid:107) I p + D − X (cid:62) X (cid:107) O (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) , where (cid:107) · (cid:107) O is the operator 2-norm and (cid:107) · (cid:107) is the usual L -norm. The operator 2-norm issub-additive and sub-multiplicative, implying that1 (cid:107) I p + D − X (cid:62) X (cid:107) O (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) ≥ (cid:0) (cid:107) I p (cid:107) O + (cid:107) D − (cid:107) O (cid:107) X (cid:62) X (cid:107) O (cid:1) (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) = (cid:18)

11 + θ max ( X (cid:62) X ) σ − τ max ( g,j ) γ g λ gj (cid:19) (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) First, note that π (cid:18)

11 + cσ − τ max ( g,j ) γ g λ gj < (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) ≤ π (cid:18) G (cid:91) g =1 p g (cid:91) j =1 (cid:26)

11 + cσ − τ γ g λ gj < (cid:15) (cid:27) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) ≤ G (cid:88) g =1 p g (cid:88) j =1 π (cid:18)

11 + cσ − τ γ g λ gj < (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) , where c = θ max ( X (cid:62) X ). Deﬁne K gj = 11 + cσ − τ γ g λ gj and for notational simplicity let K = K − g ∪ { K gl } where K − g = { K g (cid:48) : g (cid:48) (cid:54) = g } . Also, let L = ∪ Gg =1 L g denote the collection of all local shrinkage parameters where L g = { λ gj : 1 ≤ j ≤ p g } .Then, π (cid:0) K gl < (cid:15) (cid:12)(cid:12) y , H , τ , σ (cid:1) = 1 π ( y | H , τ , σ ) (cid:90) (cid:15) π ( y | K gl , H , τ , σ ) π ( K gl | H , τ , σ ) dK gl = 1 π ( y | H , τ , σ ) (cid:90) (cid:15) (cid:90) (0 , p − (cid:90) (0 , ∞ ) p π ( y | K , L , H , τ , σ ) π ( K | L , H , τ , σ )30 π ( L | H , τ , σ ) d L d K − g dK gl , where (0 , p − indicates a p − ,

1) and (0 , ∞ ) p = (0 , ∞ ) × · · · × (0 , ∞ ) p times. Looking at the individual components we ﬁrst observe that π ( y | K , L , H , τ , σ )is just a reparameterized version of π ( y | Γ , Λ , H , τ , σ ), where[ y | Γ , Λ , H , τ , σ ] ∼ N ( , σ I n + τ X ΓΛ X (cid:62) ) . Note that τ X ΓΛ X (cid:62) is a symmetric, positive deﬁnite matrix and therefore has positive, realeigenvalues. Thus the determinant of σ I n + τ X ΓΛ X (cid:62) satisﬁes | σ I n + τ X ΓΛ X (cid:62) | ≥ (cid:0) σ (cid:1) n and we get that π ( y | Γ , Λ , H , τ , σ ) ≤ (2 πσ ) − n/ . Therefore, we conclude that, π ( y | K , L , H , τ , σ ) ≤ (2 πσ ) − n/ . One helpful observation is that π ( K | L , H , τ , σ ) π ( L | H , τ , σ )= π ( K gl | λ gl , H , τ , σ ) π ( λ gl | H , τ , σ ) (cid:89) g (cid:48) (cid:54) = g π ( K g (cid:48) | L g (cid:48) , H , τ , σ ) π ( L g (cid:48) | H , τ , σ ) . Thus, we get a simpliﬁed upper bound π (cid:0) K gl < (cid:15) (cid:12)(cid:12) y , H , τ , σ (cid:1) ≤ π ( y | H , τ , σ ) (cid:90) (cid:15) (cid:90) ∞ (2 πσ ) − n/ π ( K gl | λ gl , H , τ , σ ) π ( λ gl | H , τ , σ ) dλ gl dK gl . Note that because we are conditioning on the local shrinkage parameter then calculating π ( K gl | λ gl , H , τ , σ ) is a single variable transformation problem, where π ( K gl | λ gl , H , τ , σ ) = 1Γ( a g ) (cid:18) σ cτ λ gl (cid:19) a g (1 − K gl ) a g − ( K gl ) − (1+ a g ) exp (cid:18) − σ (1 − K gl ) cτ λ gl K gl (cid:19) . Moreover, π ( λ gl | H , τ , σ ) = 1Γ( b g ) ( λ gl ) − b g − exp (cid:18) − λ gl (cid:19) , is simply the prior on the local shrinkage parameters. Therefore, π (cid:0) K gl < (cid:15) (cid:12)(cid:12) y , H , τ , σ (cid:1) ≤ πσ ) n/ π ( y | H , τ , σ ) (cid:90) (cid:15) (cid:90) ∞ π ( K gl | λ gl , H , τ , σ ) π ( λ gl | H , τ , σ ) dλ gl dK gl = 1(2 πσ ) n/ Γ( a g )Γ( b g ) π ( y | H , τ , σ ) (cid:18) σ cτ (cid:19) a g × (cid:15) (1 − K gl ) a g − ( K gl ) − (1+ a g ) (cid:90) ∞ (cid:0) λ gl (cid:1) − ( a g + b g ) − exp (cid:18) − (cid:18) σ (1 − K gl ) cτ K gl (cid:19)(cid:46) λ gl (cid:19) dλ gl dK gl = Γ( a g + b g )(2 πσ ) n/ Γ( a g )Γ( b g ) π ( y | H , τ , σ ) (cid:18) σ cτ (cid:19) a g × (cid:90) (cid:15) (1 − K gl ) a g − ( K gl ) − (1+ a g ) (cid:18) σ (1 − K gl ) cτ K gl (cid:19) − ( a g + b g ) dK gl = Γ( a g + b g )(2 πσ ) n/ Γ( a g )Γ( b g ) π ( y | H , τ , σ ) (cid:18) σ cτ (cid:19) a g × (cid:90) (cid:15) (1 − K gl ) a g − ( K gl ) − (1+ a g ) (cid:18) cτ K gl cτ K gl + σ (1 − (cid:15) ) (cid:19) ( a g + b g ) dK gl ≤ Γ( a g + b g )(1 − (cid:15) ) − ( a g + b g ) (2 πσ ) n/ Γ( a g )Γ( b g ) π ( y | H , τ , σ ) (cid:18) cτ σ (cid:19) b g (cid:90) (cid:15) (1 − K gl ) a g − ( K gl ) b g − dK gl ≤ Γ( a g + b g ) (cid:15) b g (1 − (cid:15) ) − ( a g + b g ) (2 πσ ) n/ Γ( a g ) b g Γ( b g ) π ( y | H , τ , σ ) (cid:18) cτ σ (cid:19) b g max { , (1 − (cid:15) ) a g − } . Lastly, we see thatlim τ → π ( y | H , τ , σ ) = lim τ → (cid:90) (0 , ∞ ) G (cid:90) (0 , ∞ ) p π ( y | G , L , H , τ , σ ) π ( G | H , τ , σ ) π ( L | H , τ , σ ) d L d G . Since π ( y | G , L , H , τ , σ ) = π ( y | Γ , Λ , H , τ , σ ) ≤ (2 πσ ) − n/ , then (cid:90) (0 , ∞ ) G (cid:90) (0 , ∞ ) p π ( y | G , L , H , τ , σ ) π ( G | H , τ , σ ) π ( L | H , τ , σ ) d L d G ≤ (2 πσ ) − n/ , and by the dominated convergence theorem we have thatlim τ → π ( y | H , τ , σ ) = 1(2 πσ ) n/ exp (cid:18) − σ y (cid:62) y (cid:19) . So, we conclude that π (cid:18)

11 + cσ − τ max ( g,j ) γ g λ gj < (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) ≤ G (cid:88) g =1 p g (cid:88) j =1 π (cid:0) K gj < (cid:15) (cid:12)(cid:12) y , H , τ , σ (cid:1) ≤ G (cid:88) g =1 p g Γ( a g + b g ) (cid:15) b g (1 − (cid:15) ) − ( a g + b g ) (2 πσ ) n/ Γ( a g ) b g Γ( b g ) π ( y | H , τ , σ ) (cid:18) cτ σ (cid:19) b g max { , (1 − (cid:15) ) a g − } , where the upper bound goes to zero as τ →

0. Therefore, we have shown that for ﬁxed32 ∈ (0 , π (cid:18)

11 + θ max ( X (cid:62) X ) σ − τ max ( g,j ) γ g λ gj ≥ (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) → τ →

0. Since, (cid:13)(cid:13)(cid:13) ˆ β OLS − E [ β | · ] (cid:13)(cid:13)(cid:13) ≥ (cid:18)

11 + θ max ( X (cid:62) X ) σ − τ max ( g,j ) γ g λ gj (cid:19) (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) , then we must also have that π (cid:18)(cid:13)(cid:13)(cid:13) ˆ β OLS − E [ β | · ] (cid:13)(cid:13)(cid:13) ≥ (cid:15) (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) → τ →

0. 33 .5 Proof of Corollary 3.1 If X (cid:62) g X g (cid:48) = for all g (cid:54) = g (cid:48) , then we have E [ β g | · ] = (cid:18) I p g + ( X (cid:62) g X g ) − σ γ g τ Λ − g (cid:19) − ˆ β OLSg , ˆ β OLSg = ( X (cid:62) g X g ) − X (cid:62) g y , where Λ g = diag (cid:0) λ g , . . . , λ gp g (cid:1) . Following a similar argument to the proof of Theorem 3.3, wearrive at (cid:13)(cid:13)(cid:13) ˆ β OLSg − E [ β g | · ] (cid:13)(cid:13)(cid:13) ≥ (cid:18)

11 + θ max ( X (cid:62) g X g ) σ − τ γ g max j λ gj (cid:19) (cid:13)(cid:13)(cid:13) ˆ β OLSg (cid:13)(cid:13)(cid:13) , Also, from the proof of Theorem 3.3, we have that π (cid:18)

11 + cσ − τ γ g max j λ gj < (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) ≤ p g (cid:88) j =1 π (cid:0) K gj < (cid:15) (cid:12)(cid:12) y , H , τ , σ (cid:1) ≤ p g Γ( a g + b g ) (cid:15) b g (1 − (cid:15) ) − ( a g + b g ) (2 πσ ) n/ Γ( a g ) b g Γ( b g ) π ( y | H , τ , σ ) (cid:18) cτ σ (cid:19) b g max { , (1 − (cid:15) ) a g − } , If, a g ∈ (0 ,

1) and c(cid:15)τ σ (1 − (cid:15) ) < p g Γ( a g + b g ) (cid:15) b g (1 − (cid:15) ) − ( a g + b g ) (2 πσ ) n/ Γ( a g ) b g Γ( b g ) π ( y | H , τ , σ ) (cid:18) cτ σ (cid:19) b g max { , (1 − (cid:15) ) a g − } → b g → ∞ . Since, (cid:13)(cid:13)(cid:13) ˆ β OLSg − E [ β g | · ] (cid:13)(cid:13)(cid:13) ≥ (cid:18)

11 + θ max ( X (cid:62) g X g ) σ − τ γ g max j λ gj (cid:19) (cid:13)(cid:13)(cid:13) ˆ β OLSg (cid:13)(cid:13)(cid:13) , then we must also have that for all δ ∈ (0 , σ / ( σ + θ max ( X (cid:62) g X g ) τ )) π (cid:18)(cid:13)(cid:13)(cid:13) ˆ β OLSg − E [ β g | · ] (cid:13)(cid:13)(cid:13) ≥ δ (cid:13)(cid:13)(cid:13) ˆ β OLSg (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) → b g → ∞ . 34 .6 Proof of Theorem 3.4a The posterior distribution of the shrinkage weights in the g -th group are given by π (cid:0) κ g | y g , τ , σ , a g , b g (cid:1) ∝ (cid:32) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:33) − ( a g + p g b g ) (cid:32) p g (cid:89) j =1 κ b g − / gj (1 − κ gj ) − ( b g +1) exp (cid:18) − y gj σ κ gj (cid:19)(cid:33) , where κ g = ( κ g , ..., κ gp g ), 0 < κ gj < ≤ j ≤ p g , and y g = ( y g , ..., y gp g ) (cid:62) . π ( κ gl > ψ | y g , τ , σ , a g , b g ) = A gl B g , where A gl = (cid:90) ψ (cid:90) · · · (cid:90) (cid:32) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:33) − ( a g + p g b g ) × (cid:32) p g (cid:89) j =1 κ b g − / gj (1 − κ gj ) − ( b g +1) exp (cid:18) − y gj σ κ gj (cid:19)(cid:33) dκ g · · · dκ g,l − dκ g,l +1 · · · dκ gp g dκ gl and B g = (cid:90) · · · (cid:90) (cid:32) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:33) − ( a g + p g b g ) (cid:32) p g (cid:89) j =1 κ b g − / gj (1 − κ gj ) − ( b g +1) exp (cid:18) − y gj σ κ gj (cid:19)(cid:33) dκ g · · · dκ gp g . Note that (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( a g + p g b g ) = (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( a g /p g + b g ) (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( p g − a g /p g + b g ) ≤ (cid:18) τ σ κ gl − κ gl (cid:19) − ( a g /p g + b g ) (cid:18) τ σ (cid:88) j (cid:54) = l κ gj − κ gj (cid:19) − ( p g − a g /p g + b g ) Then, A gl ≤ (cid:32) (cid:90) · · · (cid:90) (cid:18) τ σ (cid:88) j (cid:54) = l κ gj − κ gj (cid:19) − ( p g − a g /p g + b g ) (cid:89) j (cid:54) = l κ b g − / gj (1 − κ gj ) − ( b g +1) exp (cid:18) − y gj σ κ gj (cid:19) dκ gj (cid:33) × (cid:32) (cid:90) ψ (cid:18) τ σ κ gl − κ gl (cid:19) − ( a g /p g + b g ) κ b g − / gl (1 − κ gl ) − ( b g +1) exp (cid:18) − y gl σ κ gl (cid:19) dκ gl (cid:33) (cid:32) (cid:90) · · · (cid:90) (cid:18) τ σ (cid:88) j (cid:54) = l κ gj − κ gj (cid:19) − ( p g − a g /p g + b g ) (cid:89) j (cid:54) = l κ b g − / gj (1 − κ gj ) − ( b g +1) dκ gj (cid:33) × exp (cid:18) − ψ σ y gl (cid:19)(cid:32) (cid:90) ψ (cid:18) τ σ κ gl − κ gl (cid:19) − ( a g /p g + b g ) κ b g − / gl (1 − κ gl ) − ( b g +1) dκ gl (cid:33) ≤ (cid:32) (cid:90) · · · (cid:90) (cid:18) τ σ (cid:88) j (cid:54) = l κ gj − κ gj (cid:19) − ( a ∗ g +( p g − b g ) (cid:89) j (cid:54) = l κ b g − gj (1 − κ gj ) − ( b g +1) dκ gj (cid:33) × exp (cid:18) − ψ σ y gl (cid:19)(cid:32) (cid:90) ψ (cid:18) − (cid:18) − τ σ (cid:19) κ gl (cid:19) − ( a g /p g + b g ) κ b g − / gl (1 − κ gl ) a g /p g − dκ gl (cid:33) ≤ (cid:32)(cid:18) τ σ (cid:19) − ( p g − b g Γ( a ∗ g )(Γ( b g )) p g − Γ( a ∗ g + ( p g − b g ) (cid:33)(cid:18) min (cid:18) , τ σ (cid:19)(cid:19) − ( a g /p g + b g ) × exp (cid:18) − ψ σ y gl (cid:19) (cid:90) ψ κ b g − / gl (1 − κ gl ) a g /p g − dκ gl ≤ (cid:32)(cid:18) τ σ (cid:19) − ( p g − b g Γ( a ∗ g )(Γ( b g )) p g − Γ( a ∗ g + ( p g − b g ) (cid:33)(cid:18) min (cid:18) , τ σ (cid:19)(cid:19) − ( a g /p g + b g ) max (cid:16) , ψ b g − / (cid:17) × exp (cid:18) − ψ σ y gl (cid:19) (cid:90) ψ (1 − κ gl ) a g /p g − dκ gl = (cid:32)(cid:18) τ σ (cid:19) − ( p g − b g Γ( a ∗ g )(Γ( b g )) p g − Γ( a ∗ g + ( p g − b g ) (cid:33)(cid:18) min (cid:18) , τ σ (cid:19)(cid:19) − ( a g /p g + b g ) max (cid:16) , ψ b g − / (cid:17) × exp (cid:18) − ψ σ y gl (cid:19) p g a g (1 − ψ ) a g /p g , where a ∗ g = ( p g − a g /p g . We can simplify the integral four lines above based on the priordistribution of the shrinkage weights for a group of size p g − δ ∈ (0 ,

1) be ﬁxed constant. Then, B g ≥ (cid:90) ψδ · · · (cid:90) ψδ (cid:32) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:33) − ( a g + p g b g ) (cid:32) p g (cid:89) j =1 κ b g − / gj (1 − κ gj ) − ( b g +1) exp (cid:18) − y gj σ κ gj (cid:19) dκ gj (cid:33) (cid:32) p g τ σ ψδ − ψδ (cid:33) − ( a g + p g b g ) p g (cid:89) j =1 (cid:90) ψδ κ b g − / gj (1 − κ gj ) − ( b g +1) exp (cid:18) − y gj σ κ gj (cid:19) dκ gj ≥ (cid:32) p g τ σ ψδ − ψδ (cid:33) − ( a g + p g b g ) exp (cid:18) − ψδ σ p g (cid:88) j =1 y gj (cid:19) p g (cid:89) j =1 (cid:90) ψδ κ b g − / gj (1 − κ gj ) − ( b g +1) dκ gj ≥ (cid:32) p g τ σ ψδ − ψδ (cid:33) − ( a g + p g b g ) exp (cid:18) − ψδ σ p g (cid:88) j =1 y gj (cid:19) p g (cid:89) j =1 (cid:90) ψδ κ b g − / gj dκ gj = (cid:32) p g τ σ ψδ − ψδ (cid:33) − ( a g + p g b g ) exp (cid:18) − ψδ σ p g (cid:88) j =1 y gj (cid:19) ( b g + 1 / − p g ( ψδ ) p g ( b g +1 / . Therefore, A gl B g ≤ f ( p g , τ , σ , a g , b g , ψ ) g ( p g , τ , σ , a g , b g , ψ, δ ) exp (cid:18) ψδ σ (cid:88) j (cid:54) = l y gj (cid:19) exp (cid:18) − ψ (1 − δ )2 σ y gl (cid:19) , where f ( p g , τ , σ , a g , b g , ψ ) = (cid:32)(cid:18) τ σ (cid:19) − ( p g − b g Γ( a ∗ g )(Γ( b g )) p g − Γ( a ∗ g + ( p g − b g ) (cid:33) × (cid:18) min (cid:18) , τ σ (cid:19)(cid:19) − ( a g /p g + b g ) max (cid:16) , ψ b g − / (cid:17) p g a g (1 − ψ ) a g /p g and g ( p g , τ , σ , a g , b g , ψ, δ ) = (cid:32) p g τ σ ψδ − ψδ (cid:33) − ( a g + p g b g ) ( b g + 1 / − p g ( ψδ ) p g ( b g +1 / . If we take the limit of this upper bound as | y gl | → ∞ , then we see that π ( κ gl > ψ | y g , τ , σ , a g , b g ) →

0. This concludes the proof.37 .7 Proof of Theorem 3.4b

The posterior distribution of the shrinkage weights in the g -th group are given by π (cid:0) κ g | y g , τ , σ , a g , b g (cid:1) ∝ (cid:32) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:33) − ( a g + p g b g ) (cid:32) p g (cid:89) j =1 κ b g − / gj (1 − κ gj ) − ( b g +1) exp (cid:18) − y gj σ κ gj (cid:19)(cid:33) , where κ g = ( κ g , ..., κ gp g ), 0 < κ gj < ≤ j ≤ p g , and y g = ( y g , ..., y gp g ) (cid:62) . π ( κ gl < (cid:15) | y g , τ , σ , a g , b g ) = A gl B g , where A gl = (cid:90) (cid:15) (cid:90) · · · (cid:90) (cid:32) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:33) − ( a g + p g b g ) × (cid:32) p g (cid:89) j =1 κ b g − / gj (1 − κ gj ) − ( b g +1) exp (cid:18) − y gj σ κ gj (cid:19)(cid:33) dκ g · · · dκ g,l − dκ g,l +1 · · · dκ gp g dκ gl and B g = (cid:90) · · · (cid:90) (cid:32) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:33) − ( a g + p g b g ) (cid:32) p g (cid:89) j =1 κ b g − / gj (1 − κ gj ) − ( b g +1) exp (cid:18) − y gj σ κ gj (cid:19)(cid:33) dκ g · · · dκ gp g . Note that (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( a g + p g b g ) = (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( a g /p g + b g ) (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( p g − a g /p g + b g ) ≤ (cid:18) τ σ κ gl − κ gl (cid:19) − ( a g /p g + b g ) (cid:18) τ σ (cid:88) j (cid:54) = l κ gj − κ gj (cid:19) − ( p g − a g /p g + b g ) Then, A gl ≤ (cid:32) (cid:90) · · · (cid:90) (cid:18) τ σ (cid:88) j (cid:54) = l κ gj − κ gj (cid:19) − ( p g − a g /p g + b g ) (cid:89) j (cid:54) = l κ b g − / gj (1 − κ gj ) − ( b g +1) exp (cid:18) − y gj σ κ gj (cid:19) dκ gj (cid:33) × (cid:90) (cid:15) (cid:18) τ σ κ gl − κ gl (cid:19) − ( a g /p g + b g ) κ b g − / gl (1 − κ gl ) − ( b g +1) exp (cid:18) − y gl σ κ gl (cid:19) dκ gl (cid:33) ≤ (cid:32) (cid:90) · · · (cid:90) (cid:18) τ σ (cid:88) j (cid:54) = l κ gj − κ gj (cid:19) − ( p g − a g /p g + b g ) (cid:89) j (cid:54) = l κ b g − / gj (1 − κ gj ) − ( b g +1) dκ gj (cid:33) × (cid:32) (1 − (cid:15) ) − ( b g +1) (cid:90) (cid:15) κ b g − / gl dκ gl (cid:33) ≤ (cid:15) b g +1 / ( b g + 1 / − (cid:15) ) b g +1 (cid:90) · · · (cid:90) (cid:18) τ σ (cid:88) j (cid:54) = l κ gj − κ gj (cid:19) − ( a ∗ g +( p g − b g ) (cid:89) j (cid:54) = l κ b g − gj κ / gj (1 − κ gj ) − ( b g +1) dκ gj ≤ (cid:15) b g +1 / ( b g + 1 / − (cid:15) ) b g +1 (cid:90) · · · (cid:90) (cid:18) τ σ (cid:88) j (cid:54) = l κ gj − κ gj (cid:19) − ( a ∗ g +( p g − b g ) (cid:89) j (cid:54) = l κ b g − gj (1 − κ gj ) − ( b g +1) dκ gj = (cid:32) (cid:15) b g +1 / ( b g + 1 / − (cid:15) ) b g +1 (cid:33)(cid:32)(cid:18) τ σ (cid:19) − ( p g − b g Γ( a ∗ g )(Γ( b g )) p g − Γ( a ∗ g + ( p g − b g ) (cid:33) where a ∗ g = ( p g − a g /p g . We have the last equality based on the prior distribution of theshrinkage weights for a group of size p g −

1. Next, B g ≥ exp (cid:18) − σ p g (cid:88) j =1 y gj (cid:19) (cid:90) · · · (cid:90) (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( a g + p g b g ) p g (cid:89) j =1 κ b g − / gj (1 − κ gj ) − ( b g +1) dκ gj = exp (cid:18) − σ p g (cid:88) j =1 y gj (cid:19) (cid:90) · · · (cid:90) (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( a g + p g b ∗ g ) (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) p g / × p g (cid:89) j =1 κ b ∗ g − gj (1 − κ gj ) − ( b ∗ g +1) (1 − κ gj ) / dκ gj ≥ exp (cid:18) − σ p g (cid:88) j =1 y gj (cid:19) (cid:90) · · · (cid:90) (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) p g / (cid:18) p g (cid:89) j =1 (1 − κ gj ) / (cid:19) × (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( a g + p g b ∗ g ) p g (cid:89) j =1 κ b ∗ g − gj (1 − κ gj ) − ( b ∗ g +1) dκ gj exp (cid:18) − σ p g (cid:88) j =1 y gj (cid:19) (cid:90) · · · (cid:90) (cid:32) p g (cid:89) j =1 (cid:18) − κ gj + τ σ κ gj (cid:19)(cid:33) / (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( a g + p g b ∗ g ) × p g (cid:89) j =1 κ b ∗ g − gj (1 − κ gj ) − ( b ∗ g +1) dκ gj ≥ exp (cid:18) − σ p g (cid:88) j =1 y gj (cid:19)(cid:32) min (cid:18) , τ σ (cid:19)(cid:33) p g / (cid:90) · · · (cid:90) (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( a g + p g b ∗ g ) × p g (cid:89) j =1 κ b ∗ g − gj (1 − κ gj ) − ( b ∗ g +1) dκ gj = exp (cid:18) − σ p g (cid:88) j =1 y gj (cid:19)(cid:32) min (cid:18) , τ σ (cid:19)(cid:33) p g / (cid:32)(cid:18) τ σ (cid:19) − p g b ∗ g Γ( a g )(Γ( b ∗ g )) p g Γ( a g + p g b ∗ g ) (cid:33) , where b ∗ g = b g + 1 /

2. Similarly, we have the last equality based on the prior distribution of theshrinkage weights for a group of size p g .Therefore, A gl B g ≤ exp (cid:18) σ p g (cid:88) j =1 y gj (cid:19) (cid:15) b g +1 / ( b g + 1 / − (cid:15) ) b g +1 (cid:18) τ σ (cid:19) p g / b g (cid:32) min (cid:18) , τ σ (cid:19)(cid:33) − p g / × Γ( a g + p g b ∗ g )Γ( a ∗ g )(Γ( b g )) p g − Γ( a ∗ g + ( p g − b g )Γ( a g )(Γ( b ∗ g )) p g . If we take the limit of this expression as τ →

0, then we see that π ( κ gl < (cid:15) | y g , τ , σ , a g , b g ) →

0. This concludes the proof. 40 .8 Proof of Corollary 3.2

From the proof of Theorem 3.4b we have that π ( κ gl < (cid:15) | y g , τ , σ , a g , b g ) ≤ exp (cid:18) σ y (cid:62) g y g (cid:19) (cid:15) b g +1 / ( b g + 1 / − (cid:15) ) b g +1 (cid:18) τ σ (cid:19) p g / b g (cid:32) min (cid:18) , τ σ (cid:19)(cid:33) − p g / × Γ( a g + p g b ∗ g )Γ( a ∗ g )(Γ( b g )) p g − Γ( a ∗ g + ( p g − b g )Γ( a g )(Γ( b ∗ g )) p g , where p g is the number of observations in group g , b ∗ g = b g + 1 /

2, and a ∗ g = ( p g − a g /p g . Basedon this inequality we just need to ﬁnd for what values of θ the limit as b g → ∞ of θ b g ( b g + 1 /

2) Γ( a g + p g b ∗ g )(Γ( b g )) p g − Γ( a ∗ g + ( p g − b g )(Γ( b ∗ g )) p g , θ = (cid:15) − (cid:15) τ σ goes to zero. To do so we will ﬁrst need one useful asymptotic approximation (a consequenceof 6.1.39 in Abramowitz and Stegun (1972)):lim x →∞ Γ( x + c )Γ( x ) x c = 1 , for any c ∈ R . Therefore, θ b g ( b g + 1 /

2) Γ( a g + p g b ∗ g )(Γ( b g )) p g − Γ( a ∗ g + ( p g − b g )(Γ( b ∗ g )) p g ∼ θ b g ( b g + 1 / b g + 1 /

2) Γ( p g b g )( p g b g ) a g + p g / (Γ( b g )) p g − Γ(( p g − b g )(( p g − b g ) a ∗ g (Γ( b g )) p g − b ( p g − / g = θ b g Γ( b g + 3 /

2) Γ( p g b g ) p a g + p g / g b a g /p g +1 / g Γ(( p g − b g )( p g − a ∗ g ∼ p a g + p g / g ( p g − a ∗ g θ b g Γ( p g b g ) b a g /p g +1 / g Γ( b g ) b / g Γ(( p g − b g )= p a g + p g / g ( p g − a ∗ g θ b g Γ( p g b g ) b a g /p g − g Γ( b g )Γ(( p g − b g ) = p a g + p g / g ( p g − a ∗ g θ b g b a g /p g − g B ( b g , ( p g − b g ) . Stirling’s approximation for the gamma function (6.1.39 in Abramowitz and Stegun (1972)) canbe used to get the following asymptotic approximation for the beta function, B ( x, y ) ∼ √ π x x − / y y − / ( x + y ) x + y − / . p a g + p g / g ( p g − a ∗ g θ b g b a g /p g − g B ( b g , ( p g − b g ) ∼ p a g + p g / g √ π ( p g − a ∗ g θ b g b a g /p g − g ( p g b g ) p g b g − / b b g − / g (( p g − b g ) ( p g − b g − / . If we set θ = p − p g g , then this quantity becomes p a g + p g / g √ π ( p g − a ∗ g θ b g b a g /p g − g ( p g b g ) p g b g − / b b g − / g (( p g − b g ) ( p g − b g − / = p a g + p g / g √ π ( p g − a ∗ g b a g /p g − g b p g b g g ( p g b g ) − / b b g − / g (( p g − b g ) ( p g − b g − / = p a g +( p g − / g √ π ( p g − a ∗ g b a g /p g − g b ( p g − b g g b b g g b b g g (( p g − b g ) ( p g − b g − / = p a g +( p g − / g √ π ( p g − a ∗ g b a g /p g − g ( p g − ( p g − b g (( p g − b g ) − / = p a g +( p g − / g √ π ( p g − a ∗ g − / b a g /p g − / g ( p g − ( p g − b g If a g ∈ (0 , b g → ∞ .To summarize the result, we have shown that if τ , σ , p g ≥ a g ∈ (0 ,

1) are all ﬁxed,then there exists a constant (cid:15) ( τ , σ , p g ) = (cid:18) τ σ p p g g (cid:19) − , such that π ( κ gl < (cid:15) ( τ , σ , p g ) | y g , τ , σ , a g , b g ) → b g → ∞ .42 .9 Full Conditional Distributions for Gibbs Sampler The full conditional distributions for all model parameters are[ α | · ] ∼ N (cid:18)(cid:0) C (cid:62) C (cid:1) − C (cid:62) ( y − Xβ ) , σ (cid:0) C (cid:62) C (cid:1) − (cid:19) [ β | · ] ∼ N (cid:32) Q − σ X (cid:62) (cid:16) y − Cα (cid:17) , Q − (cid:33) , Q = 1 σ X (cid:62) X + 1 τ Γ − Λ − [ τ | · ] ∼ IG (cid:32) p + 12 , β (cid:62) Γ − Λ − β + 1 ν (cid:33) , [ ν | · ] ∼ IG (cid:32) , τ + 1 σ (cid:33) [ σ | · ] ∼ IG (cid:32) n + 12 , (cid:0) y − Cα − Xβ (cid:1) (cid:62) (cid:0) y − Cα − Xβ (cid:1) + 1 ν (cid:33) [ λ gj | · ] ∼ IG (cid:32) b g + 12 , β gj τ γ g (cid:33) , [ γ − g | · ] ∼ GIG (cid:32) p g − a g , τ p g (cid:88) j =1 β gj λ gj , (cid:33) , where GIG refers to the generalized inverse Gaussian distribution (H¨ormann and Leydold,2014). 43 lllllllllllllllllllllllllllllllllll lllll lllll lllll llllll lllllllll llllll lll lllllllllllllllllllllllllllllllllllll lllll llllllllllllllllllllllllllllllllllllllllllllllll llll llllll PhthalatesPesticidesPBDEsPAHsMetals

Percent change in GGT for a twofold change in exposure lllll

GIGG RegressionGroup Horseshoe RegressionGroup Half CauchyBGL−SSBSGS−SS

Figure 4: Associations between environmental toxicants (metals, phthalates, pesticides, PBDEs,and PAHs) and gamma glutamyl transferase (GGT) from NHANES 2003-2004 ( nn