Group Inverse-Gamma Gamma Shrinkage for Sparse Regression with Block-Correlated Predictors
Jonathan Boss, Jyotishka Datta, Xin Wang, Sung Kyun Park, Jian Kang, Bhramar Mukherjee
GGroup Inverse-Gamma Gamma Shrinkage for SparseRegression with Block-Correlated Predictors
Jonathan Boss ∗ , Jyotishka Datta , Xin Wang , Sung Kyun Park , Jian Kang , and BhramarMukherjee Department of Biostatistics, University of Michigan Department of Statistics, Virginia Polytechnic Institute and State University Department of Epidemiology, University of Michigan
Abstract
Heavy-tailed continuous shrinkage priors, such as the horseshoe prior, are widely usedfor sparse estimation problems. However, there is limited work extending these priors topredictors with grouping structures. Of particular interest in this article, is regression co-efficient estimation where pockets of high collinearity in the covariate space are containedwithin known covariate groupings. To assuage variance inflation due to multicollinearitywe propose the group inverse-gamma gamma (GIGG) prior, a heavy-tailed prior that cantrade-off between local and group shrinkage in a data adaptive fashion. A special case ofthe GIGG prior is the group horseshoe prior, whose shrinkage profile is correlated within-group such that the regression coefficients marginally have exact horseshoe regularization.We show posterior consistency for regression coefficients in linear regression models andposterior concentration results for mean parameters in sparse normal means models. Thefull conditional distributions corresponding to GIGG regression can be derived in closedform, leading to straightforward posterior computation. We show that GIGG regressionresults in low mean-squared error across a wide range of correlation structures and within-group signal densities via simulation. We apply GIGG regression to data from the NationalHealth and Nutrition Examination Survey for associating environmental exposures withliver functionality.
Keywords:
Global-Local Shrinkage Prior, Group Sparsity, Horseshoe Prior, Multicollinearity,Multipollutant Modeling. ∗ Corresponding Author: [email protected] a r X i v : . [ s t a t . M E ] F e b . INTRODUCTION Regression with grouped features is a common problem in many biomedical applications. Someexamples include metabolomics data, where metabolites are grouped by subpathway mem-bership, neuroimaging data, where adjacent voxels are spatially grouped, and environmentalcontaminants data, where exposures are grouped by chemical structure, toxicological profile,and pharmacokinetics (see Figure 1). In such cases, leveraging relevant grouping informationto construct correlated within-group shrinkage profiles may help achieve additional variancereduction beyond comparable methods that ignore the grouping structure. The methodologi-cal focus of this article will be on incorporating known covariate grouping information into acontinuous shrinkage prior framework.Ever since the publication of the horseshoe prior (Carvalho et al., 2009, 2010), there hasbeen an explosion of continuous shrinkage priors designed for sparse estimation problems, no-tably generalized double Pareto shrinkage (Armagan et al., 2013a), Dirichlet–Laplace shrinkage(Bhattacharya et al., 2015), horseshoe+ shrinkage (Bhadra et al., 2017), and normal beta prime(NBP) shrinkage (Bai and Ghosh, 2019), among others. These priors have become increasinglypopular for sparse regression problems because of their good theoretical and empirical prop-erties, in addition to their scale mixture representation, which facilitates straightforward andefficient posterior simulation algorithms. The general recipe for constructing a continuousshrinkage prior with good estimation and prediction properties is substantial mass at the ori-gin, to sufficiently shrink null coefficients towards zero, and regularly-varying tails, to avoidoverregularizing non-null coefficients (Bhadra et al., 2016). Surveying the continuous shrinkageprior literature on regression with known grouping structure, there are many papers whichdiscuss Bayesian group lasso and its applications (Kyung et al., 2010; Li et al., 2015; Xu andGhosh, 2015; Hefley et al., 2017; Kang et al., 2019) and several papers which propose extensionsto Bayesian sparse group lasso (Xu and Ghosh, 2015), Bayesian group bridge regularization(Mallick and Yi, 2017), and the Normal Exponential Gamma prior with grouping structure(Rockova and Lesaffre, 2014). Xu et al. (2016) introduced the, so called, group horseshoe priorwith an emphasis on prediction in Bayesian generalized additive models. However, the grouphorseshoe prior does not reduce to the horseshoe prior for a group of size one, meaning thatthe group horseshoe prior, as proposed by Xu et al. (2016), is not a direct generalization of thehorseshoe prior.Bayesian group lasso-style shrinkage is not generally preferred as a default method forestimation problems, as the Laplacian prior has neither an infinite spike at zero nor regularly-varying tails (Polson and Scott, 2011; Castillo et al., 2015; Bhadra et al., 2016). The grouphorseshoe prior of Xu et al. (2016) has the desired origin and tail behavior marginally, howeverno hyperparameter in the prior controls how correlated the shrinkage is within a group. Thus,this prior implicitly assumes that the degree of correlated shrinkage within-group only dependson group size. This assumption is inadequate when we a priori believe that, irrespective of groupsize, some groups have more heterogeneous effect sizes than others and, moreover, does not availthe opportunity to learn how correlated the shrinkage should be in a data adaptive manner,which is an intrinsic feature in some application areas. For example, in modeling multiplepollutants, this is a relevant consideration as some exposure classes have more homogeneoustoxicological profiles than others (Ferguson et al., 2014). From a theoretical perspective, the2 C ad m i u m Lead M e r c u r y S u mm ed d i − ( − e t h y l he xy l ) ph t ha l a t e s M ono − n − bu t y l ph t ha l a t e M ono − e t h y l ph t ha l a t e M ono − ben zy l ph t ha l a t e M ono − n − m e t h y l M ono − ( − c a r bo xy p r op y l ) ph t ha l a t e M ono − i s obu t y l p t ha l a t e H e x a c h l o r oben z ene B e t a − he x a c h l o r o cyc l ohe x anep , p ' − DD E p , p ' − DD T O xyc h l o r dane T r an s − nona c h l o r H ep t a c h l o r E po x i de D i e l d r i n BB − B D E − B D E − B D E − B D E − B D E − B D E − − h y d r o xy naph t ha l ene2 − h y d r o xy naph t ha l ene3 − h y d r o xy f l uo r ene2 − h y d r o xy f l uo r ene3 − h y d r o xy phenan t h r ene1 − h y d r o xy phenan t h r ene2 − h y d r o xy phenan t h r ene1 − h y d r o xy p y r ene9 − h y d r o xy f l uo r ene4 − phenan t h r ene CadmiumLeadMercurySummed di−(2−ethylhexyl) phthalatesMono−n−butyl phthalateMono−ethyl phthalateMono−benzyl phthalateMono−n−methylMono−(3−carboxypropyl) phthalateMono−isobutyl pthalateHexachlorobenzeneBeta−hexachlorocyclohexanep,p'−DDEp,p'−DDTOxychlordaneTrans−nonachlorHeptachlor EpoxideDieldrinBB−153BDE−28BDE−47BDE−99BDE−100BDE−153BDE−1541−hydroxynaphthalene2−hydroxynaphthalene3−hydroxyfluorene2−hydroxyfluorene3−hydroxyphenanthrene1−hydroxyphenanthrene2−hydroxyphenanthrene1−hydroxypyrene9−hydroxyfluorene4−phenanthrene
Figure 1: Pairwise Spearman correlation plot between metals, phthalates, organochlorine pes-ticides, polybrominated diphenyl ethers, and polycyclic aromatic hydrocarbons from the 2003-2004 National Health and Nutition Examination Survey ( n = 990).existing posterior concentration and posterior consistency results for heavy-tailed continuousshrinkage priors all, to our knowledge, apply to independent or exchangeable priors, meaningthat there have not yet been any attempts to employ similar arguments for dependent priors.To address these limitations, we propose the group inverse-gamma gamma (GIGG) prior,which extends the horseshoe and normal beta prime (NBP) priors to incorporate groupingstructures. The GIGG prior introduces a group level shrinkage parameter, in addition to theusual global and local shrinkage parameters, such that the induced prior on the product ofthe group and local shrinkage parameters yields the desired marginal shrinkage profile. Thisallows the user to control the trade-off between group-level and individual-level shrinkage,leading to relatively low estimation error irrespective of the signal density and the degree of3ulticollinearity within each group. Additionally, the GIGG prior is constructed such that allparameters have closed-form full conditional distributions, implying that techniques to scalehorseshoe regression to large sample sizes and high-dimensional covariate spaces are also ap-plicable to GIGG regression (Bhattacharya et al., 2016; Terenin et al., 2019; Johndrow et al.,2020). Theoretically, we establish posterior consistency and posterior concentration results forregression coefficients with grouping structure in linear regression models and mean parameterswith grouping structure in sparse normal means models with respect to several GIGG hyper-parameters and correlation structures. To our knowledge, we are the first to apply existingtheoretical frameworks for posterior consistency in the sparse linear regression model (Arma-gan et al., 2013b) and posterior concentration in the sparse normal means model (Datta andGhosh, 2013) to a non-exchangeable prior, which will be useful for future evaluations of othernon-exchangeable priors.The structure of the paper is as follows. We start with some theoretical results in Section3, preceded by an intuitive explanation of the GIGG prior in Section 2. After the method-ological and theoretical discussion, we outline computational details, including hyperparameterestimation via marginal maximum likelihood estimation (MMLE) (Section 4). In Section 5,we conduct a simulation study to empirically verify that the intuition and theory developedin Sections 2 and 3 hold for linear regression models with group-correlated features. We thenapply GIGG regression to data from the 2003-2004 National Health and Nutrition Examina-tion Survey (NHANES) to identify toxicants and metals associated with a biomarker of liverfunction (Section 6) and conclude with a discussion (Section 7).
2. METHODS
Throughout the article, N ( µ , Σ ) denotes a multivariate normal distribution with mean param-eter µ and variance-covariance matrix Σ , G ( a, b ) denotes a gamma distribution with shapeparameter a and rate parameter b , and IG ( a, b ) denotes an inverse gamma distribution withshape parameter a and scale parameter b . Additionally, we will use π ( · ) as general notation fora prior probability measure and π ( · | y ) as general notation for a posterior probability measure. Consider a Bayesian sparse linear regression model[ y | α , β , σ ] ∼ N (cid:18) Cα + G (cid:88) g =1 X g β g , σ I n (cid:19) , π ( α ) ∝ , β ∼ π ( β ) , π ( σ ) ∝ σ − , (1)where g = 1 , . . . , G indexes the groups, y is an n × C is a matrix of adjustment covariates, X g is an n × p g matrix of standardized covariates in the g -th group, β g = ( β g , . . . , β gp g ) (cid:62) is a p g × g -th group, β = ( β (cid:62) , . . . , β (cid:62) G ) (cid:62) is a p × I n is an n × n identity matrix. We assume the model is sparse in the sense that many4f the entries in β are zero. The group inverse-gamma gamma (GIGG) prior is defined as[ β gj | τ , γ g , λ gj ] ∼ N (0 , τ γ g λ gj ) , [ γ g | a g ] ∼ G ( a g , , [ λ gj | b g ] ∼ IG ( b g , , [ τ , σ ] ∼ π ( τ , σ ) , where j = 1 , . . . , p g indexes the covariates within the g -th group. Alternatively, we may alsoexpress the prior on β as a vector, [ β | τ , Γ , Λ ] ∼ N (0 , τ ΓΛ ), where Λ = diag( λ , ..., λ Gp G ) and Γ = diag( γ , ..., γ , γ , ..., γ , ..., γ G , ..., γ G ) such that γ g is repeated p g times along the diagonalof Γ . In the GIGG prior specification, the priors on the group shrinkage parameter, γ g , andlocal shrinkage parameter, λ gj , are selected such that the induced prior on the product is abeta prime prior, γ g λ gj ∼ β (cid:48) ( a g , b g ) (see Appendix 8.1 for distributional definitions). Sincethe group shrinkage parameter is shared by all p g observations in the g -th group, assigninga beta prime prior on the product allows for normal beta prime shrinkage marginally suchthat the shrinkage is correlated within group. One point that deserves further clarification isthe assignment of the gamma and inverse-gamma priors to the group and local parameters,respectively, when either configuration would yield a beta prime prior in the product. Therationale behind this choice is that the inverse-gamma prior is heavier-tailed than the gammaprior, thereby preventing overregularization of large, non-null coefficients due to being groupedwith null coefficients. Setting a g = b g = 1 / g yields a special case of the GIGG priorcalled the group horseshoe prior, which has correlated horseshoe regularization within group.For a group of size one, the group shrinkage parameter becomes a local shrinkage parameterand we recover the horseshoe prior from the group horseshoe prior. When discussing a proposed shrinkage prior on β , there are two key features of the marginalprior that need to be investigated. The first is the behavior in a tight neighborhood aroundzero and the second is the rate at which the prior decays in the extremes. For τ = 1 fixed,Bai and Ghosh (2019) showed that the marginal prior π ( β gj | τ , a g , b g ) has a pole at 0 ifand only if 0 < a g ≤ /
2, with the pole at zero becoming stronger the closer a g is to zero.Therefore, one should select a g ∈ (0 , /
2] for sparse estimation problems to sufficiently shrinknull coefficients towards zero. To clarify the tail behavior we need to introduce the notion of aregularly varying function (Bingham et al., 1989): A positive, measurable function f is said tobe regularly varying at ∞ with index ω ∈ R if lim x →∞ f ( tx ) /f ( x ) = t ω , for all t > Theorem 2.1.
Let B ( a g , b g ) denote the beta function evaluated at a g and b g and Γ( b g + 1 / denote the gamma function evaluated at b g + 1 / . The tails of the marginal prior probabilitydensity function of β gj decay at the following rate, lim β gj →∞ π ( β gj | τ , a g , b g ) r ( β gj , τ , a g , b g ) = 1 , r ( β gj , τ , a g , b g ) = (2 τ ) b g Γ( b g + 1 / √ π B ( a g , b g ) | β gj | − (1+2 b g ) (cid:18) β gj /τ β gj /τ (cid:19) a g . Consequently, the index of regular variation is ω = − − b g .Proof. See Appendix 8.2.The concept of regular variation has been extensively discussed in the context of Bayesianrobustness and noninformative inference (Dawid, 1973; O’Hagan, 1979; Andrade and O’Hagan,5006), with the latter being recently elaborated on in the context of global-local shrinkagepriors (Bhadra et al., 2016). When the index ω <
0, regular variation essentially states thatthe tail of the function decays at a polynomial rate and is therefore considered heavy-tailed.Some examples of priors with regularly varying tails include the student’s t prior and thehorseshoe prior. Conversely, commonly used priors such as the normal prior and the Laplaceprior do not have regularly-varying tails. As a consequence of having exponentially decayingtails, Bayesian linear regression with independent normal priors and Bayesian lasso are prone tooverregularizing large signals and are not flexible enough to facilitate conflict resolution betweendiscordant likelihood and prior information (Andrade and O’Hagan, 2006; Polson and Scott,2011). Theorem 2.1 shows that for any pair of hyperparameters a g and b g , the marginal GIGGprior has regularly varying tails and, furthermore, that b g controls the rate at which the tailsdecay. To further elucidate the shrinkage profile of the GIGG prior, we will focus on a special caseof the sparse linear regression model called the sparse normal means model ( X = I n and C empty). In the global-local shrinkage prior literature, it is conventional to work with the sparsenormal means problem for analytical tractability, even when the ultimate goal is regression(Rockova and Lesaffre, 2014; Bhattacharya et al., 2015), as the posterior mean has a convenientrepresentation, E [ β gj | y gj , τ , σ ] = (1 − E [ κ gj | y gj , τ , σ ]) y gj . Here, κ gj = σ / ( σ + τ γ g λ gj ) iscalled a shrinkage factor, because it quantifies how much the posterior mean is shrunk relative tothe maximum likelihood estimator y gj . Calculating the joint prior distribution for the shrinkagefactors in the g -th group, κ g = ( κ g , ..., κ gp g ) (cid:62) , we have π (cid:0) κ g | τ , σ , a g , b g (cid:1) =Γ( a g + p g b g )Γ( a g ) (cid:0) Γ( b g ) (cid:1) p g (cid:18) τ σ (cid:19) p g b g (cid:32) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:33) − ( a g + p g b g ) (cid:32) p g (cid:89) j =1 κ b g − gj (1 − κ gj ) − ( b g +1) (cid:33) , where 0 < κ gj < ≤ j ≤ p g . Evaluating the prior distribution of κ g , we see that thejoint density multiplicatively factorizes into “dependent” and “independent” parts where thedegree to which the within-group shrinkage is correlated is governed by the (cid:80) p g j =1 κ gj / (1 − κ gj )term. That is, as a g + p g b g goes to zero, the regularization is highly individualistic, whereasif a g + p g b g moves away from zero, then the shrinkage becomes increasingly more correlatedwithin the g -th group.Figure 2 illustrates the marginal posterior mean of β g for a group of size two as a function of a g , b g , y g , and y g . When a g and b g are close to zero then the thresholding effect on the marginalposterior mean of β g hardly depends on the value of y g , indicating highly individualisticshrinkage. This corroborates our intuition from looking at the joint posterior distribution ofthe shrinkage weights within the same group. The second major observation is that as b g movesaway from zero, the marginal posterior mean of β g becomes increasingly more dependent on thevalue of y g . In particular, if we look at the case when a g = 0 .
05 and b g = 2, we see that when y g = 0 the thresholding effect on β g is much stronger when compared to y g = 10. The last6ajor observation is that as a g moves away from 0, the thresholding effect becomes weaker.Therefore, a g effectively controls the overall strength of the shrinkage, whereas b g generallycontrols the dependence of the within-group shrinkage. −10−50510 −10 −5 0 5 10 y g1 E [ b g1 | y g1 , y g2 ] GIGG Prior (a g =0.05, b g =0.05) −10−50510 −10 −5 0 5 10 y g1 E [ b g1 | y g1 , y g2 ] GIGG Prior (a g =0.05, b g =0.5) −10−50510 −10 −5 0 5 10 y g1 E [ b g1 | y g1 , y g2 ] GIGG Prior (a g =0.05, b g =1) −10−50510 −10 −5 0 5 10 y g1 E [ b g1 | y g1 , y g2 ] GIGG Prior (a g =0.05, b g =2) −10−50510 −10 −5 0 5 10 y g1 E [ b g1 | y g1 , y g2 ] GIGG Prior (a g =0.5, b g =0.05) −10−50510 −10 −5 0 5 10 y g1 E [ b g1 | y g1 , y g2 ] GIGG Prior (a g =0.5, b g =0.5) −10−50510 −10 −5 0 5 10 y g1 E [ b g1 | y g1 , y g2 ] GIGG Prior (a g =0.5, b g =1) −10−50510 −10 −5 0 5 10 y g1 E [ b g1 | y g1 , y g2 ] GIGG Prior (a g =0.5, b g =2)y g2 = 0 y g2 = 10 Figure 2: Marginal posterior mean of β g for a group with two observations as a g , b g , y g , and y g vary. Here, τ = 0 . σ = 1 are fixed.
3. THEORETICAL PROPERTIES
Let X n = [ X , . . . , X G n ] and H n = { a , b } denote the collection of hyperparameters where a = { a , ..., a G n } and b = { b , ..., b G n } . Here, the subscript n in G n refers to the fact that thenumber of groups in the covariate space is growing as a function of the sample size. Furthermore,let A n = { ( g, j ) : β gj (cid:54) = 0 } denote the true active set with cardinality |A n | . Then, Theorem3.1 states that the posterior distribution of β n under the GIGG prior is consistent a posteriorifor the true β n . Similarly, we add a subscript n to β n and β n to indicate that the number ofregression coefficients is growing as function of sample size. Theorem 3.1.
Suppose that p n = o ( n ) , L n = sup ( g,j ) | β gj | < ∞ , where β gj indicates thetrue j -th regression coefficient in the g -th group, < lim n →∞ inf H n ≤ lim n →∞ sup H n < , and |A n | = o ( n/ log( n )) . Further, suppose that the smallest and largest singular val-ues of X n , denoted by θ n,min ( X n ) and θ n,max ( X n ) , satisfy < lim inf n →∞ θ n,min ( X n ) / √ n ≤ lim sup n →∞ θ n,max ( X n ) / √ n < ∞ . Then for any (cid:15) > , π n ( β n : (cid:107) β n − β n (cid:107) < (cid:15) | y n , H n , τ n , σ ) → almost surely as n → ∞ provided that τ n = C/ ( p n n ρ log( n )) for some ρ, C ∈ (0 , ∞ ) .Proof. See Appendix 8.3.Of note, the only restrictions placed on the values of the hyperparameters in Theorem 3.1are that they do not converge to the boundary of the hyperparameter space as n → ∞ . Remark 3.2.
Theorem 3.1 is a generalization of Theorem 5 in Armagan et al. (2013b) whichproved posterior consistency for the NBP prior when b g ∈ (1 , ∞ ) . Restricting b g ∈ (1 , ∞ ) wasdone to utilize an argument which required the existence of the second moment of β gj , but doesnot cover special cases of particular interest such as the horseshoe prior. Therefore, our resultextends the existing posterior consistency result from Armagan et al. (2013b) to a more generalcollection of hyperparameter values with potential grouping structure. Next, we partially extend the posterior concentration theoretical framework for the sparsenormal means model, developed in Section 3.2 of Datta and Ghosh (2013), to a low-dimensionallinear regression ( p < n ) model with general correlation structure. Going forward, we will dropthe subscript n from the notation introduced in the statement of Theorem 3.1 to clarify thatthe subsequent theoretical results hold for fixed p . Theorem 3.3.
Fix (cid:15) ∈ (0 , , p , and n , such that p < n . Further, suppose that the smallestand largest singular values of X (cid:62) X , denoted by θ min ( X (cid:62) X ) and θ max ( X (cid:62) X ) , satisfy <θ min ( X (cid:62) X ) ≤ θ max ( X (cid:62) X ) < ∞ . The full conditional posterior mean corresponding to theGIGG prior is, E [ β | · ] = (cid:18) I p + ( X (cid:62) X ) − σ τ Γ − Λ − (cid:19) − ˆ β OLS , ˆ β OLS = ( X (cid:62) X ) − X (cid:62) y . Then the inequality, (cid:13)(cid:13)(cid:13) ˆ β OLS − E [ β | · ] (cid:13)(cid:13)(cid:13) ≥ (cid:18)
11 + θ max ( X (cid:62) X ) σ − τ max ( g,j ) γ g λ gj (cid:19)(cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) , holds and we have the following results:a) π (cid:18)
11 + θ max ( X (cid:62) X ) σ − τ max ( g,j ) γ g λ gj ≥ (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) → as τ → . b) π (cid:18)(cid:13)(cid:13)(cid:13) ˆ β OLS − E [ β | · ] (cid:13)(cid:13)(cid:13) ≥ (cid:15) (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) → as τ → . roof. See Appendix 8.4.Theorem 3.3 states that, irrespective of the correlation structure, τ → Corollary 3.1.
Suppose that the covariates in X satisfy X (cid:62) g X g (cid:48) = for all g (cid:54) = g (cid:48) , where denotes a p g × p g (cid:48) matrix of zeros. If τ , σ , and a g ∈ (0 , are fixed, then there exists aconstant (cid:15) g ( τ , σ ) = σ σ + θ max ( X (cid:62) g X g ) τ , such that for all δ ∈ (0 , (cid:15) g ( τ , σ )) π (cid:18)(cid:13)(cid:13)(cid:13) ˆ β OLSg − E [ β g | · ] (cid:13)(cid:13)(cid:13) ≥ δ (cid:13)(cid:13)(cid:13) ˆ β OLSg (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) → as b g → ∞ .Proof. See Appendix 8.5.The conclusion of Corollary 3.1 is that if the hyperparameter b g → ∞ then there is at leastsome amount of shrinkage relative to the ordinary least squares estimator in the g -th group.If τ /σ is close to zero, then (cid:15) ( τ , σ ) ≈
1, implying shrinkage of the posterior mean towardszero. Therefore, we can interpret the case when b g → ∞ and τ /σ close to zero as shrinkageof the entire g -th group towards zero. Although we would ideally consider additional posterior concentration results within the con-text of a linear regression model, there is not an analytically tractable analog of componentwiseshrinkage factors for a general design matrix without any orthogonality. Therefore, we will pro-ceed by considering posterior concentration results within the sparse normal means framework,to make precise statements regarding componentwise shrinkage, as opposed to shrinkage of theentire L -norm.One question that arises is whether the dependence induced between the β gj ’s by γ g willoverly dominate the individual-level shrinkage. As an example, one can conceptualize a casewhere a group has only one signal, which is overly shrunk by virtue of being grouped withan overwhelming majority of null means. Theorem 3.4a states that if the gl -th observationis sufficiently large then there will be minimal shrinkage on y gl . This guarantees that groupshrinkage will not overly dominate individual shrinkage if the observation is large. Conversely,Theorem 3.4b states that if the global shrinkage parameter converges to zero, then the GIGGprior will sufficiently shrink the y gl ’s toward zero. Let y g = ( y g , . . . , y gp g ) (cid:62) . Theorem 3.4.
Suppose that p g ∈ { , , . . . } .. ) Fix ψ, δ ∈ (0 , . Then there exists a function h ( p g , τ , σ , a g , b g , ψ, δ ) such that π ( κ gl > ψ | y g , τ , σ , a g , b g ) ≤ exp (cid:18) − ψ (1 − δ )2 σ y gl + ψδ σ (cid:88) j (cid:54) = l y gj (cid:19) h ( p g , τ , σ , a g , b g , ψ, δ ) . Consequently, if | y gl | → ∞ , then π ( κ gl ≤ ψ | y g , τ , σ , a g , b g ) → . b) Fix (cid:15) ∈ (0 , . Then there exists a function h ( p g , σ , y g , a g , b g , (cid:15) ) such that, π ( κ gl < (cid:15) | y g , τ , σ , a g , b g ) ≤ (cid:18) τ σ (cid:19) p g / b g (cid:32) min (cid:18) , τ σ (cid:19)(cid:33) − p g / h ( p g , σ , y g , a g , b g , (cid:15) ) . Consequently, π ( κ gl ≥ (cid:15) | y g , τ , σ , a g , b g ) → as τ → .Proof. See Appendices 8.6 and 8.7.The theoretical statements outlined in Theorem 3.4 were originally discussed for the horse-shoe prior (Datta and Ghosh, 2013), but have also been used in the context of several othercontinuous shrinkage priors (Datta and Dunson, 2016; Bhadra et al., 2017; Bai and Ghosh,2019), dynamic trend filtering (Kowal et al., 2019), and small area estimation (Tang et al.,2018a). Examining Theorem 3.4, we first note that neither result restricts the range of values a g and b g can take. Therefore, Theorem 3.4 applies to a more general class of hyperparametervalues than those considered in Bai and Ghosh (2019). Secondly, the rate at which the upperbound on Theorem 3.4b converges to zero as τ → b g , withlarger values of b g corresponding to a tighter upper bound. To better understand the role of b g we have the following result. Corollary 3.2.
Suppose that p g ∈ { , , . . . } . If τ , σ , and a g ∈ (0 , are fixed, then thereexists a constant (cid:15) ( τ , σ , p g ) = (cid:18) τ σ p p g g (cid:19) − , such that π ( κ gl < (cid:15) ( τ , σ , p g ) | y g , τ , σ , a g , b g ) → as b g → ∞ .Proof. See Appendix 8.8.The conclusion of Corollary 3.2 is nearly identical to the conclusion of Corollary 3.1, as τ /σ controls the degree of shrinkage provided as a result of taking b g → ∞ .
4. COMPUTATION
The full conditional updates corresponding to model (1), where β is endowed with a GIGGprior, are enumerated in Appendix 8.9. Following Polson and Scott (2011), we assign a half10auchy prior scaled by the residual error standard deviation τ | σ ∼ C + (0 , σ ) and use aprevalent data augmentation trick,[ τ | ν ] ∼ IG (1 / , /ν ) , [ ν | σ ] ∼ IG (1 / , /σ ) , to obtain closed form full conditional updates for τ and σ (Makalic and Schmidt, 2016).There are two major computational bottlenecks for the proposed algorithm. The first is thefull conditional update of β ,[ β | · ] ∼ N (cid:32) Q − σ X (cid:62) (cid:16) y − Cα (cid:17) , Q − (cid:33) , Q = 1 σ X (cid:62) X + 1 τ Γ − Λ − . The second occurs when there are a multitude of group and local parameters that need to bedrawn at each iteration of the Gibbs sampler, which is often the case in “large p ” scenarios.Rather than na¨ıvely sampling from the full conditional distributions there are several strategiesto achieve faster posterior computation with both of these computational challenges in mind: • Draw v ∼ N (cid:0) σ − X (cid:62) ( y − Cα ) , Q (cid:1) , and then solve Qβ = v , rather than explicitlycalculating Q − . • For “small n , large p ” problems, the Woodbury identity can be utilized so that the fullconditional update of β scales linearly in p (Bhattacharya et al., 2016). • If n and p are both large, say an order of magnitude of 10,000 each, there are severalrecently developed approximation approaches, the former of which exploits the abilityof the horseshoe prior to shrink τ λ gj close to zero (Johndrow et al., 2020) while thelatter uses a conjugate gradient algorithm to find an approximate solution to Qβ = v (Nishimura and Suchard, 2020). • Parallelization can be used within the Gibbs sampler to simultaneously update the shrink-age parameters corresponding to each group (Terenin et al., 2019).
If the modeler wants to remain relatively agnostic to the choice of hyperparameters, one canuse Marginal Maximum Likelihood Estimation (MMLE) (Casella, 2001), an empirical-Bayesapproach executed iteratively within the Gibbs sampler. The ( l + 1)th update is a ( l +1) g = ψ − (cid:18) E a ( l ) g (cid:2) log( γ g ) | y (cid:3)(cid:19) , b ( l +1) g = ψ − (cid:18) − p g p g (cid:88) j =1 E b ( l ) g (cid:2) log( λ gj ) | y (cid:3)(cid:19) , where ψ ( · ) is the digamma function and the expectation terms can be estimated throughstandard Monte Carlo methods. The iterative procedure terminates when (cid:80) Gg =1 (cid:0) a ( l +1) g − a ( l ) g (cid:1) + (cid:80) Gg =1 (cid:0) b ( l +1) g − b ( l ) g (cid:1) is less than some prespecified error tolerance. However, in our experienceit is preferred to fix a g = 1 /n for all g and use MMLE to estimate the b g hyperparameters. The11rst reason is that a g controls the strength of the thresholding effect and choosing a g close tozero guarantees strong shrinkage of null coefficients towards zero. The second reason is that onlyestimating one hyperparameter per group is more feasible than estimating two hyperparametersper group, particularly when the number of groups is large. Since b g primarily controls howthe correlated the shrinkage is within-group it is more important to focus estimation on the b g hyperparameters. We do recognize that setting a g = 1 /n violates a condition in Theorem 3.1where the infimum of the set of hyperparameters cannot converge to zero as n → ∞ . However,for practical purposes, this approach provides an automatic way to set a g while also yieldingsimilar results to a g close to zero and fixed as a function of the sample size, such as a g = 1 / G , is small relative tothe sample size, the estimates for the a g ’s and b g ’s will become increasingly variable in high-dimensional settings where the number of groups is large. There may also be low-dimensionalsettings where the user wants to incorporate explicit prior knowledge about the nature ofthe within-group signal density. In such cases, it may be preferred to fix hyperparametervalues in accordance with subject matter expertise. As with the modified MMLE approach,we recommend setting a g = 1 /n for all g . To fix b g we recommend a useful heuristic wherebylocal, group, and global shrinkage parameters are simulated from the GIGG prior. Usingthe simulated shrinkage parameters, shrinkage factors can be constructed and the correlationbetween shrinkage factors within the same group can be empirically calculated. Selectingthe hyperparameter b g is then equivalent to selecting how correlated the shrinkage is within-group, a more easily understandable concept. Implementations of GIGG regression with fixedhyperparameters and hyperparameters estimated via MMLE are available on Github.
5. SIMULATIONS
The data generative mechanism is linear regression model (1), where C includes the interceptterm and five adjustment covariates drawn from independent standard normal distributions, α = (0 , , , , , (cid:62) , and X is drawn from a multivariate normal distribution with mean andcovariance matrix Σ X . Σ X is determined such that the features have unit variance and block-diagonal exchangeable correlation structure. For all simulation settings, n = 500 and p = 50such that the 50 covariates are evenly divided into five groups. Pairwise correlations withineach group are ρ = 0 . ρ = 0 . . σ , is fixed such that β (cid:62) Σ X β / ( β (cid:62) Σ X β + σ ) =0 . β ,which will be qualitatively referred to as the concentrated signal setting and the distributedsignal setting. In the concentrated signal setting, there is only one true signal in each of the fivegroups with varying magnitudes: β = 0 . β = 1, β = 1 . β = 2, and β = 2. Ratherthan having within-group sparsity, the distributed signal setting assumes that the signal isshared across all members of the first group: β j = 0 . j ∈ { , ..., } and β j = 1 for12ll j ∈ { , ..., } . The purpose of the fixed coefficient simulation settings is to ascertain whichmethods perform well when the within-group signal is sparse or dense.Beyond the fixed regression coefficient simulation settings, we also consider a random co-efficient simulation in the high correlation setting, where for each simulation iteration a ran-dom regression coefficient vector is generated. To construct a regression coefficient vector, westart by randomly selecting either a concentrated or distributed signal for the first group witheven probability to guarantee that each simulation iteration will have at least one true signal.The concentrated and distributed signal magnitudes are selected such that the contribution to β (cid:62) Σ X β is equal, namely the distributed signal is β gj = 0 .
25 for j = 1 , ...,
10 and the concen-trated signal is β g = 5 .
125 and β gj = 0 for j = 2 , ...,
10. For the other four groups, we randomlyselect a concentrated signal with probability 0.2, a distributed signal with probability 0.2, andno signal with probability 0.6. The goal of the random coefficient simulation setting is to showthat, averaged across many combinations of regression coefficient vectors comprised of sparsewithin-group signals, dense within-group signals, and inactive groups, GIGG regression withMMLE results in the lowest mean-squared error.
Estimation properties will be evaluated based on empirical mean-squared error (MSE), stratifiedby null and non-null coefficients, across 5000 replicates. In the random coefficient simulations,calculating the MSE corresponds to an integrated mean-squared error (IMSE) metric averagedacross the generative distribution of the regression coefficient vectors. For the fixed coefficientsimulations we will consider several special cases of the GIGG prior with fixed hyperparameters,namely all possible combinations of a g ∈ { /n, / } and b g ∈ { /n, / , } . That way, we cancheck whether the intuition gleaned from Figure 2 empirically translates to the regressionsetting. We will also consider the GIGG prior when the hyperparameters a g = 1 /n are fixedand b g are estimated via MMLE.The list of competing methods include Ordinary Least Squares (OLS), Horseshoe regression,Group horseshoe regression (Xu et al., 2016), Spike-and-Slab Lasso (Rockova and George,2018), Bayesian Group Lasso with Spike-and-Slab Priors (BGL-SS) (Xu and Ghosh, 2015), andBayesian Sparse Group Selection with Spike-and-Slab Priors (BSGS-SS) (Xu and Ghosh, 2015).To avoid confusion with the group horseshoe prior proposed in this paper, we will refer to thegroup horseshoe prior from Xu et al. (2016) as the group half Cauchy prior throughout therest of the simulation section. Most methods requiring Markov chain Monte Carlo (MCMC)sampling have 10000 burn-in draws, followed by 10000 posterior draws with no thinning. Theonly exceptions are BGL-SS and BSGS-SS which have 1000 burn-in draws and 2000 posteriordraws without any thinning, due to the relatively slower posterior sampling algorithms. Table 1 presents the MSE for the high correlation simulation settings and Table 2 lists theMSE for the medium correlation simulation settings. Because the results for the high correla-tion and medium correlation settings are similar we will only focus our discussion around thehigh correlation simulation settings. The first noteworthy observation is that group horseshoe13 = 0 . ρ = 0 . ρ = 0 . Concentrated DistributedMethod
Null Non-Null Null Non-NullOrdinary Least Squares 3.74 0.41 8.09 2.03Horseshoe 0.51 0.41 0.85 2.14GIGG ( a g = 1 /n, b g = 1 /n ) a g = 1 / , b g = 1 /n ) a g = 1 /n, b g = 1 /
2) 0.29 0.39 *GIGG ( a g = 1 / , b g = 1 /
2) 0.33 0.40 0.24 1.70GIGG ( a g = 1 /n, b g = 1) 0.53 0.49 GIGG ( a g = 1 / , b g = 1) 0.58 0.49 0.26 1.43GIGG (MMLE) Group Half Cauchy 0.30 0.39 0.08 1.64Spike-and-Slab Lasso
BSGS-SS 0.23 0.42 0.04 1.84Table 1: Mean-squared errors (MSE) for the fixed regression coefficient simulation settings( n = 500 , p = 50) with high pairwise correlations ( ρ = 0 . a g = 1 / b g = 1 / b g = 1 /n when the signal is concentrated within-group (Null MSE = 0.11, Non-Null MSE =0.30) and a g = 1 /n, b g = 1 when the signal is distributed within-group (Null MSE = 0.03,Non-Null MSE = 1.43), exactly as Figure 2 suggests. However, if the user sets b g = 1 when thesignal is concentrated (Null MSE = 0.53, Non-Null MSE = 0.49) or b g = 1 /n when the signal isdistributed (Null MSE = 0.03, Non-Null MSE = 3.59), then the “incorrect” prior informationresults in notably worse MSE compared to the “correct” prior information. That being said, b g = 1 / = 0 . ρ = 0 . ρ = 0 . Concentrated DistributedMethod
Null Non-Null Null Non-NullOrdinary Least Squares 1.88 0.21 3.20 0.79Horseshoe 0.29 0.21 0.52 0.94GIGG ( a g = 1 /n, b g = 1 /n ) a g = 1 / , b g = 1 /n ) a g = 1 /n, b g = 1 /
2) 0.15 0.22 *GIGG ( a g = 1 / , b g = 1 /
2) 0.18 0.21 0.16 0.73GIGG ( a g = 1 /n, b g = 1) 0.29 0.26 GIGG ( a g = 1 / , b g = 1) 0.33 0.25 0.16 0.66GIGG (MMLE) Group Half Cauchy 0.17 0.21 0.07 0.71Spike-and-Slab Lasso
BSGS-SS 0.10 0.22 0.01 0.81Table 2: Mean-squared errors (MSE) for the fixed regression coefficient simulation settings( n = 500 , p = 50) with medium pairwise correlations ( ρ = 0 . a g = 1 / b g = 1 /
6. DATA EXAMPLE
The National Health and Nutrition Examination Survey (NHANES) is a collection of studiesconducted by the National Center for Health Statistics with the overarching goal of evaluat-ing the health and nutritional status of the United States’ populace. Data collection consistsof a written survey and physical examination which records demographic, socioeconomic, di-etary, and health-related information, including physiological measurements and laboratorytests. We will specifically apply GIGG regression to a subset of 990 adults from the 2003-2004NHANES cycle with 35 measured contaminants across five exposure classes: metals, phthalates,organochlorine pesticides, polybrominated diphenyl ethers (PBDEs), and polycyclic aromatichydrocarbons (PAHs). Figure 1 illustrates the block diagonal correlation structure of theseexposures, where areas of high correlation are mostly contained within exposure class. Gammaglutamyl transferase (GGT), an enzymatic marker of liver functionality, will be the outcomeof interest. GGT and all environmental exposures were log-transformed to remove right skew-15 ethod
Null Non-NullOrdinary Least Squares 8.84 3.38Horseshoe 0.70 1.18Group Horseshoe
Group Half Cauchy
GIGG (MMLE)
Spike-and-Slab Lasso 0.16 3.65BGL-SS 2.84 2.44BSGS-SS 0.36 1.45Table 3: Integrated mean-squared errors (IMSE) for the random regression coefficient simulationsetting ( n = 500 , p = 50) with high pairwise correlations ( ρ = 0 . .
00 forall regression coefficients and GIGG regression had PSRF values ranging between 1 . − . ll llll llll l ll llllll ll llll ll ll ll ll lllll l ll lllll lllll ll lllll l lll lllllll llll lllll l ll lllll l ll ll ll lllll l ll lll ll llll llll lll lllll lllllllll ll llll ll PhthalatesPesticidesPBDEsPAHsMetals
Mono−isobutyl pthalateMono−(3−carboxypropyl) phthalateMono−n−methylMono−benzyl phthalateMono−ethyl phthalateMono−n−butyl phthalateSummed di−(2−ethylhexyl) phthalatesDieldrinHeptachlor EpoxideTrans−nonachlorOxychlordanep,p'−DDTp,p'−DDEBeta−hexachlorocyclohexaneHexachlorobenzeneBDE−154BDE−153BDE−100BDE−99BDE−47BDE−28BB−1534−phenanthrene9−hydroxyfluorene1−hydroxypyrene2−hydroxyphenanthrene1−hydroxyphenanthrene3−hydroxyphenanthrene2−hydroxyfluorene3−hydroxyfluorene2−hydroxynaphthalene1−hydroxynaphthaleneMercuryLeadCadmium −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6
Percent change in GGT for a twofold change in exposure llll
GIGG RegressionLinear RegressionRidge RegressionHorseshoe Regression
Figure 3: Estimated associations between environmental toxicants (metals, phthalates, pesti-cides, PBDEs, and PAHs) and gamma glutamyl transferase (GGT) from NHANES 2003-2004( n = 990).PAHs and PBDEs, have high pairwise correlations and common estimated effect sizes. However,the metals exposure class, which has weak pairwise correlations and heterogeneous estimatedeffect sizes, results in a median credible interval length of 0.31 for GIGG regression and 0.30for horseshoe regression. The example indicates that by leveraging grouping information GIGGregression has appreciable efficiency gains for groups with multicollinearity issues and homoge-17eous effect sizes, but does not provide an improvement for groups with weak correlations andheterogeneous effect sizes. Additionally, from a computational perspective, GIGG regressiongenerated a median effective sample size of 590.4 per second, compared to a median effectivesample size of 78.9 per second for horseshoe regression from the horseshoe package in R.
7. DISCUSSION
The principal methodological contribution of this paper is to construct a continuous shrinkageprior that improves regression coefficient estimation in the presence of grouped covariates.GIGG regression flexibly controls the relative contributions of individual and group shrinkageto improve regression coefficient estimation, resulting in a relative IMSE reduction of 72.8%corresponding to the null coefficients and 7.5 times more efficient computation compared tothe primary horseshoe regression implementation in R. One of the main limitations of GIGGregression is that covariate groupings must be explicitly specified and covariate groupings maynot overlap. Additionally, although the GIGG prior can be imposed on regression coefficientsin Bayesian generalized linear models, a theoretical evaluation of the shrinkage properties fornon-normal outcome data would be necessary to determine if the GIGG prior is appropriate forsuch models. We are currently working on an R package to implement GIGG regression that,upon completion, will be added to the Comprehensive R Archive Network. For a preliminaryversion of the R package, visit Github.The analysis of multiple pollutant data and chemical mixtures is a key thrust of the NationalInstitute of Environmental Health Sciences, and the GIGG prior provides a useful frameworkfor achieving variance reduction in the presence of group-correlated exposures, characterizinguncertainties in point estimates, and constructing policy relevant metrics, like summary riskscores, in a principled way. However, the generality of the GIGG prior coupled with the relativeease of computation means that, despite its motivation coming from environmental epidemiol-ogy, the GIGG prior is applicable to many other areas. For example, in neuroimaging studies,scalar-on-image regression (Kang et al., 2018) has been widely used to study the associationbetween brain activity and clinical outcomes of interests. The whole brain can be partitionedinto a set of exclusive regions according to brain functions and anatomical structures. Withinthe same region, the brain imaging biomarkers tend to be more correlated and have similareffects on the outcome variable. The GIGG prior can be extended for scalar-on-image regres-sion and it has a great potential to improve estimating the effects of imaging biomarkers byincorporating brain region information.In this paper, our focus was sparse estimation, but it is also natural to inquire aboutuncertainty quantification and variable selection. Based on our simulations, the conclusionsof van der Pas et al. (2017) are relevant for the GIGG prior when 0 < a g ≤ /
2, but acomprehensive study needs to be carried out. There is no consensus way of defining variableselection for continuous shrinkage priors, however there are several approaches to determine afinal active set, including two-means clustering (Bhattacharya et al., 2015), credible intervalscovering zero (van der Pas et al., 2017), thresholding shrinkage factors (Tang et al., 2018b),decoupling shrinkage and selection (DSS) (Hahn and Carvalho, 2015), and penalized credibleregions (Zhang and Bondell, 2018). For horseshoe-style shrinkage, variable selection defined18hrough credible intervals covering zero is conservative, but works well if one wants to limitthe number of false discoveries. The two-means clustering heuristic does not necessarily resultin consistent variable selection and the shrinkage factor thresholding approach is restrictedto applications where p < n . The penalized credible region approach searches for the sparsestmodel that falls within the 100 × (1 − α )% joint elliptical credible region, while DSS constructs anadaptive lasso-style objective function with the goal of sparsifying the posterior mean such thatmost of the predictive variability is still explained. Since the DSS construction is framed froma prediction perspective, this approach may not be ideal for regression coefficient estimationproblems in the presence of correlated features. Another crucial point to make is that if one isinterested in selection, the posterior mode estimator for the horseshoe prior will result in exactzero estimates, and an approximate algorithm for calculating the posterior mode was developedin Bhadra et al. (2019) using the horseshoe-like prior. Therefore, one could conceptualize anextension of the expectation-maximization algorithm developed by Bhadra et al. (2019) usinga “GIGG-like” prior. Further work is needed to juxtapose the behavior of all of these differentmethods for selection and develop novel algorithms for calculating the posterior mode. Acknowledgements
Dr. Datta acknowledges support from the National Science Foundation (DMS-2015460). Dr.Kang acknowledges support from the National Institutes of Health (R01 DA048993; R01GM124061; R01 MH105561). Dr. Mukherjee acknowledges support from the National Sci-ence Foundation (DMS-1712933) and the National Institutes of Health (R01 HG008773-01).19 eferences
Abramowitz, M. and Stegun, I. A. (1972).
Handbook of Mathematical Functions with Formulas,Graphs, and Mathematical Tables . U.S. Government Printing Office, Washington, D.C., 10edition.Andrade, J. A. A. and O’Hagan, A. (2006). Bayesian robustness modeling using regularlyvarying distributions.
Bayesian Analysis , 1(1):169–188.Armagan, A., Dunson, D. B., and Lee, J. (2013a). Generalized double pareto shrinkage.
Sta-tistica Sinica , 23(1):119–143.Armagan, A., Dunson, D. B., Lee, J., Bajwa, W. U., and Strawn, N. (2013b). Posteriorconsistency in linear models under shrinkage priors.
Biometrika , 100(4):1011–1018.Bai, R. and Ghosh, M. (2019). Large-scale multiple hypothesis testing with the normal-betaprime prior.
Statistics , 53(6):1210–1233.Barndorff-Nielsen, O., Kent, J., and Sørensen, M. (1982). Normal variance-mean mixtures andz distributions.
International Statistical Review , 50(2):145–159.Bhadra, A., Datta, J., Polson, N. G., and Willard, B. (2016). Default bayesian analysis withglobal-local shrinkage priors.
Biometrika , 103(4):955–969.Bhadra, A., Datta, J., Polson, N. G., and Willard, B. (2017). The horseshoe+ estimator ofultra-sparse signals.
Bayesian Analysis , 12(4):1105–1131.Bhadra, A., Datta, J., Polson, N. G., and Willard, B. T. (2019). The horseshoe-like regulariza-tion for feature subset selection.
Sankhya B .Bhattacharya, A., Chakraborty, A., and Mallick, B. K. (2016). Fast sampling with gaussianscale mixture priors in high-dimensional regression.
Biometrika , 103(4):985–991.Bhattacharya, A., Pati, D., Pillai, N. S., and B., D. D. (2015). Dirichlet–Laplace priors foroptimal shrinkage.
Journal of the American Statistical Association , 110(512):1479–1490.Bingham, N. H., Goldie, C. M., and Teugels, J. L. (1989).
Regular Variation, vol. 27 ofEncyclopedia of Mathematics and its Applications . Cambridge University Press, Cambridge,UK.Carvalho, C. M., Polson, N. G., and Scott, J. G. (2009). Handling sparsity via the horseshoe.
Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics,PMLR , 5:73–80.Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator for sparsesignals.
Biometrika , 97(2):465–480.Casella, G. (2001). Empirical bayes gibbs sampling.
Biostatistics , 2(4):485–500.20astillo, I., Schmidt-Hieber, J., and van der Vaart, A. (2015). Bayesian linear regression withsparse priors.
Annals of Statistics , 43(5):1986–2018.Datta, J. and Dunson, D. B. (2016). Bayesian inference on quasi-sparse count data.
Biometrika ,103(4):971–983.Datta, J. and Ghosh, J. K. (2013). Asymptotic properties of Bayes risk for the horseshoe prior.
Bayesian Analysis , 8(1):111–132.Dawid, A. P. (1973). Posterior expectations for large observations.
Biometrika , 60(3):664–667.Ferguson, K. K., McElrath, T. F., and Meeker, J. D. (2014). Environmental phthalate exposureand preterm birth.
JAMA Pediatrics , 168(1):61–67.Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple se-quences.
Statistical Science , 7(4):457–472.Hahn, P. R. and Carvalho, C. M. (2015). Decoupling shrinkage and selection in bayesian linearmodels: A posterior summary perspective.
Journal of the American Statistical Association ,110(509):435–448.Hefley, T. J., Hooten, M. B., Hanks, E. M., Russell, R. E., and Walsh, D. P. (2017). Thebayesian group lasso for confounded spatial data.
Journal of Agricultural, Biological, andEnvironmental Statistics , 22(1):42–59.H¨ormann, W. and Leydold, J. (2014). Generating generalized inverse gaussian random variates.
Statistics and Computing , 24:547–557.Johndrow, J. E., Orenstein, P., and Bhattacharya, A. (2020). Scalable approximate mcmcalgorithms for the horseshoe prior.
Journal of Machine Learning Research , 21:1–61.Kang, J., Reich, B. J., and Staicu, A.-M. (2018). Scalar-on-image regression via the soft-thresholded gaussian process.
Biometrika , 105(1):165–184.Kang, K., Song, X., Hu, X. J., and Zhu, H. (2019). Bayesian adaptive group lasso withsemiparametric hidden markov models.
Statistics in Medicine , 38(9):1634–1650.Kowal, D. R., Matteson, D. S., and Ruppert, D. (2019). Dynamic shrinkage processes.
Journalof the Royal Statistical Society: Series B (Statistical Methodology) , 81(4):781–804.Kyung, M., Gill, J., Ghosh, M., and Casella, G. (2010). Penalized regression, standard errors,and Bayesian lassos.
Bayesian Analysis , 5(2):369–412.Li, J., Wang, Z., Li, R., and Wu, R. (2015). Bayesian group lasso for nonparametric varying-coefficient models with application to functional genome-wide association studies.
The Annalsof Applied Statistics , 9(2):640–664.Makalic, E. and Schmidt, D. F. (2016). A simple sampler for the horseshoe estimator.
IEEESignal Processing Letters , 23(1):179–182. 21allick, H. and Yi, N. (2017). Bayesian group bridge for bi-level variable selection.
Computa-tional Statistics & Data Analysis , 110:115–133.Nishimura, A. and Suchard, M. A. (2020). Prior-preconditioned conjugate gradient method foraccelerated gibbs sampling in “large n & large p” sparse bayesian regression. arXiv Preprint .O’Hagan, A. (1979). On outlier rejection phenomena in Bayes inference.
Journal of the RoyalStatistical Society: Series B (Methodological) , 41(3):358–367.Polson, N. G. and Scott, J. G. (2011). Shrink globally, act locally: Sparse bayesian regular-ization and prediction. In Bernardo, J. M., Bayarri, M. J., Berger, J. O., Dawid, A. P.,Heckerman, D., Smith, A. F. M., and West, M., editors,
Bayesian Statistics 9 , chapter 17.Oxford University Press, Oxford, United Kingdom.Rockova, V. and George, E. I. (2018). The spike-and-slab lasso.
Journal of the AmericanStatistical Association , 113(521):431–444.Rockova, V. and Lesaffre, E. (2014). Incorporating grouping information in bayesian variableselection with applications in genomics.
Bayesian Analysis , 9(1):221–258.Tang, X., Ghosh, M., Ha, N. S., and Sedransk, J. (2018a). Modeling random effects usingglobal–local shrinkage priors in small area estimation.
Journal of the American StatisticalAssociation , 113(524):1476–1489.Tang, X., Xu, X., Ghosh, M., and Ghosh, P. (2018b). Bayesian variable selection and estimationbased on global-local shrinkage priors.
Sankhya A , 80:215–246.Terenin, A., Dong, S., and Draper, D. (2019). GPU-accelerated Gibbs sampling: a case studyof the horseshoe probit model.
Statistics and Computing , 29(2):301–310.van der Pas, S., Szab´o, B., and van der Vaart, A. (2017). Uncertainty quantification for thehorseshoe (with discussion).
Bayesian Analysis , 12(4):1221–1274.Xu, X. and Ghosh, M. (2015). Bayesian variable selection and estimation for group lasso.
Bayesian Analysis , 10(4):909–936.Xu, Z., Schmidt, D. F., Makalic, E., Qian, G., and Hopper, J. L. (2016). Bayesian groupedhorseshoe regression with application to additive models. In on Artificial Intelligence 2016,A. J. C., editor,
AI 2016: Advances in Artificial Intelligence , chapter 3. Springer, Hobart,Australia.Zhang, Y. and Bondell, H. D. (2018). Variable selection via penalized credible regions withDirichlet–Laplace global-local shrinkage priors.
Bayesian Analysis , 13(3):823–844.22 . APPENDICES
Beta Prime Distribution: X ∼ β (cid:48) ( a, b ) = ⇒ f X ( x ) = Γ( a + b )Γ( a )Γ( b ) x a − (1 + x ) − a − b , x > . Gamma Distribution: X ∼ G ( a, b ) = ⇒ f X ( x ) = b a Γ( a ) x a − exp( − bx ) , x > . Generalized Inverse Gaussian Distribution: X ∼ GIG ( λ, ψ, χ ) = ⇒ f X ( x ) = ( ψ/χ ) λ/ K λ ( √ ψχ ) x λ − exp (cid:18) − (cid:18) χx + ψx (cid:19)(cid:19) , x > , where K λ ( · ) is the modified Bessel function of the third kind with index λ (H¨ormann andLeydold, 2014).Half Cauchy Distribution: X ∼ C + (0 , σ ) = ⇒ f X ( x ) = 2 πσ (1 + x /σ ) , x > . Inverse Gamma Distribution: X ∼ IG ( a, b ) = ⇒ f X ( x ) = b a Γ( a ) x − a − exp (cid:18) − bx (cid:19) , x > . .2 Proof of Theorem 2.1 For shorthand, we will use: r ( x ) ∼ s ( x ) := lim x →∞ r ( x ) s ( x ) = 1 . Define L ( u ) = 1( τ ) λ B ( a g , b g ) (cid:18) u/τ u/τ (cid:19) a g , B ( a g , b g ) = Γ( a g )Γ( b g )Γ( a g + b g ) , which is slowly-varying function, i.e., lim u →∞ L ( tu ) L ( u ) = 1 , for all t >
0. Moreover, let π ( β gj | τ , a g , b g ) = (cid:90) ∞ (2 πu ) − / exp (cid:18) − β gj u (cid:19) f ( u | τ , a g , b g ) du denote the normal variance mixture probability density function and let f ( u | τ , a g , b g ) denotethe scaled β (cid:48) ( a g , b g ) mixing density function with fixed scale parameter τ . Thenlim u →∞ f ( u | τ , a g , b g )exp( − ψ + u ) u λ − L ( u )= lim u →∞ ( τ B ( a g , b g )) − ( u/τ ) a g − (1 + u/τ ) − ( a g + b g ) exp( − ψ + u ) u λ − L ( u ) = lim u →∞ ( u/τ ) − (1 + u/τ ) − b g exp( − ψ + u )( u/τ ) λ − = lim u →∞ exp( ψ + u )( u/τ ) − λ (1 + u/τ ) − b g , where ψ + = sup { w ∈ R : φ ( w ) < ∞} and φ ( w ) = 1 B ( a g , b g ) (cid:90) ∞ exp( wu ) 1 τ (cid:18) uτ (cid:19) a g − (cid:18) uτ (cid:19) − ( a g + b g ) du. Note that ψ + = 0. Fix λ = − b g . Then,lim u →∞ exp( ψ + u )( u/τ ) − λ (1 + u/τ ) − b g = lim u →∞ (cid:18) u/τ u/τ (cid:19) b g = 1 . By Theorem 6.1 in Barndorff-Nielsen et al. (1982) we conclude that π ( β gj | τ , a g , b g ) ∼ (2 π ) − / b g +1 / Γ( b g + 1 / | β gj | − (1+2 b g ) L ( β gj ) . To get the index of regular variation, we note the following straightforward lemma:
Lemma 1.
Suppose that r and s are two positive, measurable functions such that r ( x ) ∼ s ( x ) and s is regularly varying with index ω ∈ R . Then r is regularly varying with index ω . roof. lim x →∞ r ( tx ) r ( x ) = lim x →∞ r ( tx ) r ( x ) s ( tx ) s ( tx ) s ( x ) s ( x ) = lim x →∞ (cid:18) r ( tx ) /s ( tx ) r ( x ) /s ( x ) (cid:19) s ( tx ) s ( x ) = lim x →∞ s ( tx ) s ( x ) = t ω . Despite its simplicity, Lemma 1 is of great practical use, particularly if the function whosetail behavior we are interested in does not have a closed form. When working with global-local mixture priors we often do not have closed form marginal prior distributions for β andit is usually easier to construct and work with a closed form function that has asymptoticallyequivalent tail behavior. Since the index of regular variation of(2 π ) − / b g +1 / Γ( b g + 1 / | β gj | − (1+2 b g ) L ( β gj )is ω = − − b g , then by Lemma 1 the index of regular variation of π ( β gj | τ , a g , b g ) is also ω = − − b g . 25 .3 Proof of Theorem 3.1 Let G n = { γ , ..., γ G n } denote the collection of group shrinkage parameters and let p g indicatethe number of covariates in the g -th group. Define the following sets: A g = { j : β gj (cid:54) = 0 } , A cg = { j : β gj = 0 } , A n = { ( g, j ) : β gj (cid:54) = 0 } , A cn = { ( g, j ) : β gj = 0 } . In words, A g is the activeset for the g -th group, A cg is the non-active set for the g -th group, A n is the active set acrossall groups, and A cn is the non-active set across all groups. π (cid:0) β n (cid:12)(cid:12) G n , H n , τ n (cid:1) = G n (cid:89) g =1 p g (cid:89) j =1 Γ( b g + 1 / b g ) (cid:112) πτ n γ g (cid:18) β gj τ n γ g (cid:19) − ( b g +1 / Then, we see that π (cid:18) β n : (cid:107) β n − β n (cid:107) < ∆ n ρ/ (cid:12)(cid:12)(cid:12)(cid:12) G n , H n , τ n (cid:19) ≥ G n (cid:89) g =1 (cid:34) (cid:89) j ∈A g π (cid:18) | β gj − β gj | < ∆ √ p n n ρ/ (cid:12)(cid:12)(cid:12)(cid:12) G n , H n , τ n (cid:19) × (cid:89) j ∈A cg π (cid:18) | β gj | < ∆ √ p n n ρ/ (cid:12)(cid:12)(cid:12)(cid:12) G n , H n , τ n (cid:19)(cid:35) . Continuing π (cid:18) | β gj − β gj | < ∆ √ p n n ρ/ (cid:12)(cid:12)(cid:12)(cid:12) G n , H n , τ n (cid:19) = (cid:90) β gj + ∆ √ pnnρ/ β gj − ∆ √ pnnρ/ Γ( b g + 1 / b g ) (cid:112) πτ n γ g (cid:18) β gj τ n γ g (cid:19) − ( b g +1 / dβ gj ≥ b g + 1 / b g ) (cid:112) πτ n γ g √ p n n ρ/ (cid:32) L n + ∆ √ p n n ρ/ ) τ n γ g (cid:33) − ( b g +1 / and π (cid:18) | β gj | < ∆ √ p n n ρ/ (cid:12)(cid:12)(cid:12)(cid:12) G n , H n , τ n (cid:19) = (cid:90) ∆ √ pnnρ/ − ∆ √ pnnρ/ Γ( b g + 1 / b g ) (cid:112) πτ n γ g (cid:18) β gj τ n γ g (cid:19) − ( b g +1 / dβ gj = 2 (cid:90) ∆ √ pnnρ/ Γ( b g + 1 / b g ) (cid:112) πτ n γ g (cid:18) β gj τ n γ g (cid:19) − ( b g +1 / dβ gj ≥ (cid:90) ∆ √ pnnρ/ Γ( b g + 1 / b g ) (cid:112) πτ n γ g exp (cid:32) − β gj ( b g + 1 / (cid:112) τ n γ g (cid:33) dβ gj
26 2Γ( b g + 1 / b g ) (cid:112) πτ n γ g (cid:32) (cid:112) τ n γ g b g + 1 / (cid:33)(cid:34) − exp (cid:32) − ∆( b g + 1 / (cid:112) τ n γ g p n n ρ (cid:33)(cid:35) = 2Γ( b g + 1 / b g ) √ π ( b g + 1 / (cid:34) − exp (cid:32) − ∆( b g + 1 / (cid:112) τ n γ g p n n ρ (cid:33)(cid:35) . Note that we have the above inequality because (1 + x ) − ≥ exp( −√ x ) for all x ≥
0. Therefore, G n (cid:89) g =1 (cid:34) (cid:89) j ∈A g π (cid:18) | β gj − β gj | < ∆ √ p n n ρ/ (cid:12)(cid:12)(cid:12)(cid:12) G n , H n , τ n (cid:19) × (cid:89) j ∈A cg π (cid:18) | β gj | < ∆ √ p n n ρ/ (cid:12)(cid:12)(cid:12)(cid:12) G n , H n , τ n (cid:19)(cid:35) ≥ G n (cid:89) g =1 (cid:32) b g + 1 / b g ) (cid:112) πτ n γ g √ p n n ρ/ (cid:33) |A g | (cid:32) L n + ∆ √ p n n ρ/ ) τ n γ g (cid:33) −|A g | ( b g +1 / × (cid:32) b g + 1 / b g ) √ π ( b g + 1 / (cid:33) |A cg | (cid:34) − exp (cid:32) − ∆( b g + 1 / (cid:112) τ n γ g p n n ρ (cid:33)(cid:35) |A cg | . Substituting in τ n = C/ ( p n n ρ log( n )) and taking the negative logarithm of the final expressionyields − G n (cid:88) g =1 (cid:34) |A g | log (cid:32) b g + 1 / (cid:112) log( n )Γ( b g ) (cid:112) Cπγ g (cid:33) −|A g | ( b g + 1 /
2) log (cid:32) p n n ρ log( n )( L n + ∆ √ p n n ρ/ ) Cγ g (cid:33) + |A cg | log (cid:32) b g + 1 / b g ) √ π ( b g + 1 / (cid:33) + |A cg | log (cid:32) − exp (cid:32) − ∆( b g + 1 / (cid:112) log( n ) (cid:112) Cγ g (cid:33)(cid:33)(cid:35) = G n (cid:88) g =1 (cid:34) − |A g | log (cid:32) b g + 1 / (cid:112) log( n )Γ( b g ) (cid:112) Cπγ g (cid:33) + |A g | ( b g + 1 /
2) log (cid:32) p n n ρ log( n )( L n + ∆ √ p n n ρ/ ) Cγ g (cid:33) −|A cg | log (cid:32) b g + 1 / b g ) √ π ( b g + 1 / (cid:33) − |A cg | log (cid:32) − exp (cid:32) − ∆( b g + 1 / (cid:112) log( n ) (cid:112) Cγ g (cid:33)(cid:33)(cid:35) . Let T n = inf g ∈{ ,...,G n } log (cid:32) b g + 1 / b g ) (cid:112) Cπγ g (cid:33) , n = inf g ∈{ ,...,G n } log (cid:32) b g + 1 / b g ) √ π ( b g + 1 / (cid:33) ,T n = inf g ∈{ ,...,G n } ∆( b g + 1 / (cid:112) Cγ g ,γ n,min = inf g ∈{ ,...,G n } γ g , and b n,max = sup g ∈{ ,...,G n } b g . Then, G n (cid:88) g =1 (cid:34) − |A g | log (cid:32) b g + 1 / (cid:112) log( n )Γ( b g ) (cid:112) Cπγ g (cid:33) + |A g | ( b g + 1 /
2) log (cid:32) p n n ρ log( n )( L n + ∆ √ p n n ρ/ ) Cγ g (cid:33) −|A cg | log (cid:32) b g + 1 / b g ) √ π ( b g + 1 / (cid:33) − |A cg | log (cid:32) − exp (cid:32) − ∆( b g + 1 / (cid:112) log( n ) (cid:112) Cγ g (cid:33)(cid:33)(cid:35) ≤ G n (cid:88) g =1 (cid:34) − |A g | (cid:0) log( n ) (cid:1) − |A g | T n + |A g | ( b n,max + 1 /
2) log (cid:32) p n n ρ log( n )( L n + ∆ √ p n n ρ/ ) Cγ n,min (cid:33) −|A cg | T n − |A cg | log (cid:18) − exp (cid:16) T n (cid:112) log( n ) (cid:17)(cid:19)(cid:35) = − |A n | (cid:0) log( n ) (cid:1) − |A n | T n + |A n | ( b n,max + 1 /
2) log (cid:32) p n n ρ log( n )( L n + ∆ √ p n n ρ/ ) Cγ n,min (cid:33) −|A cn | T n − |A cn | log (cid:18) − exp (cid:16) − T n (cid:112) log( n ) (cid:17)(cid:19) . Note that the above expression is dominated by the |A n | ( b n,max + 1 /
2) log (cid:32) p n n ρ log( n )( L n + ∆ √ p n n ρ/ ) Cγ n,min (cid:33) term. If |A n | = o ( n/ log( n )), then by Theorem 1 in Armagan et al. (2013b), we obtain posteriorconsistency conditional on G n , i.e., ∀ (cid:15) > π (cid:0) β n : (cid:107) β n − β n (cid:107) < (cid:15) (cid:12)(cid:12) y n , G n , H n , τ n , σ (cid:1) → π (cid:0) β n : (cid:107) β n − β n (cid:107) < (cid:15) (cid:12)(cid:12) y n , H n , τ n , σ (cid:1) = E G n | y n , H n ,τ n ,σ (cid:20) π (cid:0) β n : (cid:107) β n − β n (cid:107) < (cid:15) (cid:12)(cid:12) y n , G n , H n , τ n , σ (cid:1)(cid:21) → E G n | y n , H n ,τ n ,σ [1] = 1 . .4 Proof of Theorem 3.3 Let M = I p − ( I p + ( X (cid:62) X ) − D ) − = ( I p + D − X (cid:62) X ) − where D − = σ − τ ΓΛ . TheRayleigh quotient inequality yields, (cid:13)(cid:13)(cid:13) ˆ β OLS − E [ β | · ] (cid:13)(cid:13)(cid:13) ≥ e min ( M (cid:62) M ) (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) = (cid:0) θ min ( M ) (cid:1) (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) , where e min ( M (cid:62) M ) denotes the minimum eigenvalue of M (cid:62) M and θ min ( M ) denotes theminimum singular value of M . Then, we have (cid:0) θ min ( M ) (cid:1) (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) = 1 (cid:0) θ max ( I p + D − X (cid:62) X ) (cid:1) (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) = 1 (cid:107) I p + D − X (cid:62) X (cid:107) O (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) , where (cid:107) · (cid:107) O is the operator 2-norm and (cid:107) · (cid:107) is the usual L -norm. The operator 2-norm issub-additive and sub-multiplicative, implying that1 (cid:107) I p + D − X (cid:62) X (cid:107) O (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) ≥ (cid:0) (cid:107) I p (cid:107) O + (cid:107) D − (cid:107) O (cid:107) X (cid:62) X (cid:107) O (cid:1) (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) = (cid:18)
11 + θ max ( X (cid:62) X ) σ − τ max ( g,j ) γ g λ gj (cid:19) (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) First, note that π (cid:18)
11 + cσ − τ max ( g,j ) γ g λ gj < (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) ≤ π (cid:18) G (cid:91) g =1 p g (cid:91) j =1 (cid:26)
11 + cσ − τ γ g λ gj < (cid:15) (cid:27) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) ≤ G (cid:88) g =1 p g (cid:88) j =1 π (cid:18)
11 + cσ − τ γ g λ gj < (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) , where c = θ max ( X (cid:62) X ). Define K gj = 11 + cσ − τ γ g λ gj and for notational simplicity let K = K − g ∪ { K gl } where K − g = { K g (cid:48) : g (cid:48) (cid:54) = g } . Also, let L = ∪ Gg =1 L g denote the collection of all local shrinkage parameters where L g = { λ gj : 1 ≤ j ≤ p g } .Then, π (cid:0) K gl < (cid:15) (cid:12)(cid:12) y , H , τ , σ (cid:1) = 1 π ( y | H , τ , σ ) (cid:90) (cid:15) π ( y | K gl , H , τ , σ ) π ( K gl | H , τ , σ ) dK gl = 1 π ( y | H , τ , σ ) (cid:90) (cid:15) (cid:90) (0 , p − (cid:90) (0 , ∞ ) p π ( y | K , L , H , τ , σ ) π ( K | L , H , τ , σ )30 π ( L | H , τ , σ ) d L d K − g dK gl , where (0 , p − indicates a p − ,
1) and (0 , ∞ ) p = (0 , ∞ ) × · · · × (0 , ∞ ) p times. Looking at the individual components we first observe that π ( y | K , L , H , τ , σ )is just a reparameterized version of π ( y | Γ , Λ , H , τ , σ ), where[ y | Γ , Λ , H , τ , σ ] ∼ N ( , σ I n + τ X ΓΛ X (cid:62) ) . Note that τ X ΓΛ X (cid:62) is a symmetric, positive definite matrix and therefore has positive, realeigenvalues. Thus the determinant of σ I n + τ X ΓΛ X (cid:62) satisfies | σ I n + τ X ΓΛ X (cid:62) | ≥ (cid:0) σ (cid:1) n and we get that π ( y | Γ , Λ , H , τ , σ ) ≤ (2 πσ ) − n/ . Therefore, we conclude that, π ( y | K , L , H , τ , σ ) ≤ (2 πσ ) − n/ . One helpful observation is that π ( K | L , H , τ , σ ) π ( L | H , τ , σ )= π ( K gl | λ gl , H , τ , σ ) π ( λ gl | H , τ , σ ) (cid:89) g (cid:48) (cid:54) = g π ( K g (cid:48) | L g (cid:48) , H , τ , σ ) π ( L g (cid:48) | H , τ , σ ) . Thus, we get a simplified upper bound π (cid:0) K gl < (cid:15) (cid:12)(cid:12) y , H , τ , σ (cid:1) ≤ π ( y | H , τ , σ ) (cid:90) (cid:15) (cid:90) ∞ (2 πσ ) − n/ π ( K gl | λ gl , H , τ , σ ) π ( λ gl | H , τ , σ ) dλ gl dK gl . Note that because we are conditioning on the local shrinkage parameter then calculating π ( K gl | λ gl , H , τ , σ ) is a single variable transformation problem, where π ( K gl | λ gl , H , τ , σ ) = 1Γ( a g ) (cid:18) σ cτ λ gl (cid:19) a g (1 − K gl ) a g − ( K gl ) − (1+ a g ) exp (cid:18) − σ (1 − K gl ) cτ λ gl K gl (cid:19) . Moreover, π ( λ gl | H , τ , σ ) = 1Γ( b g ) ( λ gl ) − b g − exp (cid:18) − λ gl (cid:19) , is simply the prior on the local shrinkage parameters. Therefore, π (cid:0) K gl < (cid:15) (cid:12)(cid:12) y , H , τ , σ (cid:1) ≤ πσ ) n/ π ( y | H , τ , σ ) (cid:90) (cid:15) (cid:90) ∞ π ( K gl | λ gl , H , τ , σ ) π ( λ gl | H , τ , σ ) dλ gl dK gl = 1(2 πσ ) n/ Γ( a g )Γ( b g ) π ( y | H , τ , σ ) (cid:18) σ cτ (cid:19) a g × (cid:15) (1 − K gl ) a g − ( K gl ) − (1+ a g ) (cid:90) ∞ (cid:0) λ gl (cid:1) − ( a g + b g ) − exp (cid:18) − (cid:18) σ (1 − K gl ) cτ K gl (cid:19)(cid:46) λ gl (cid:19) dλ gl dK gl = Γ( a g + b g )(2 πσ ) n/ Γ( a g )Γ( b g ) π ( y | H , τ , σ ) (cid:18) σ cτ (cid:19) a g × (cid:90) (cid:15) (1 − K gl ) a g − ( K gl ) − (1+ a g ) (cid:18) σ (1 − K gl ) cτ K gl (cid:19) − ( a g + b g ) dK gl = Γ( a g + b g )(2 πσ ) n/ Γ( a g )Γ( b g ) π ( y | H , τ , σ ) (cid:18) σ cτ (cid:19) a g × (cid:90) (cid:15) (1 − K gl ) a g − ( K gl ) − (1+ a g ) (cid:18) cτ K gl cτ K gl + σ (1 − (cid:15) ) (cid:19) ( a g + b g ) dK gl ≤ Γ( a g + b g )(1 − (cid:15) ) − ( a g + b g ) (2 πσ ) n/ Γ( a g )Γ( b g ) π ( y | H , τ , σ ) (cid:18) cτ σ (cid:19) b g (cid:90) (cid:15) (1 − K gl ) a g − ( K gl ) b g − dK gl ≤ Γ( a g + b g ) (cid:15) b g (1 − (cid:15) ) − ( a g + b g ) (2 πσ ) n/ Γ( a g ) b g Γ( b g ) π ( y | H , τ , σ ) (cid:18) cτ σ (cid:19) b g max { , (1 − (cid:15) ) a g − } . Lastly, we see thatlim τ → π ( y | H , τ , σ ) = lim τ → (cid:90) (0 , ∞ ) G (cid:90) (0 , ∞ ) p π ( y | G , L , H , τ , σ ) π ( G | H , τ , σ ) π ( L | H , τ , σ ) d L d G . Since π ( y | G , L , H , τ , σ ) = π ( y | Γ , Λ , H , τ , σ ) ≤ (2 πσ ) − n/ , then (cid:90) (0 , ∞ ) G (cid:90) (0 , ∞ ) p π ( y | G , L , H , τ , σ ) π ( G | H , τ , σ ) π ( L | H , τ , σ ) d L d G ≤ (2 πσ ) − n/ , and by the dominated convergence theorem we have thatlim τ → π ( y | H , τ , σ ) = 1(2 πσ ) n/ exp (cid:18) − σ y (cid:62) y (cid:19) . So, we conclude that π (cid:18)
11 + cσ − τ max ( g,j ) γ g λ gj < (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) ≤ G (cid:88) g =1 p g (cid:88) j =1 π (cid:0) K gj < (cid:15) (cid:12)(cid:12) y , H , τ , σ (cid:1) ≤ G (cid:88) g =1 p g Γ( a g + b g ) (cid:15) b g (1 − (cid:15) ) − ( a g + b g ) (2 πσ ) n/ Γ( a g ) b g Γ( b g ) π ( y | H , τ , σ ) (cid:18) cτ σ (cid:19) b g max { , (1 − (cid:15) ) a g − } , where the upper bound goes to zero as τ →
0. Therefore, we have shown that for fixed32 ∈ (0 , π (cid:18)
11 + θ max ( X (cid:62) X ) σ − τ max ( g,j ) γ g λ gj ≥ (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) → τ →
0. Since, (cid:13)(cid:13)(cid:13) ˆ β OLS − E [ β | · ] (cid:13)(cid:13)(cid:13) ≥ (cid:18)
11 + θ max ( X (cid:62) X ) σ − τ max ( g,j ) γ g λ gj (cid:19) (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) , then we must also have that π (cid:18)(cid:13)(cid:13)(cid:13) ˆ β OLS − E [ β | · ] (cid:13)(cid:13)(cid:13) ≥ (cid:15) (cid:13)(cid:13)(cid:13) ˆ β OLS (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) → τ →
0. 33 .5 Proof of Corollary 3.1 If X (cid:62) g X g (cid:48) = for all g (cid:54) = g (cid:48) , then we have E [ β g | · ] = (cid:18) I p g + ( X (cid:62) g X g ) − σ γ g τ Λ − g (cid:19) − ˆ β OLSg , ˆ β OLSg = ( X (cid:62) g X g ) − X (cid:62) g y , where Λ g = diag (cid:0) λ g , . . . , λ gp g (cid:1) . Following a similar argument to the proof of Theorem 3.3, wearrive at (cid:13)(cid:13)(cid:13) ˆ β OLSg − E [ β g | · ] (cid:13)(cid:13)(cid:13) ≥ (cid:18)
11 + θ max ( X (cid:62) g X g ) σ − τ γ g max j λ gj (cid:19) (cid:13)(cid:13)(cid:13) ˆ β OLSg (cid:13)(cid:13)(cid:13) , Also, from the proof of Theorem 3.3, we have that π (cid:18)
11 + cσ − τ γ g max j λ gj < (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) ≤ p g (cid:88) j =1 π (cid:0) K gj < (cid:15) (cid:12)(cid:12) y , H , τ , σ (cid:1) ≤ p g Γ( a g + b g ) (cid:15) b g (1 − (cid:15) ) − ( a g + b g ) (2 πσ ) n/ Γ( a g ) b g Γ( b g ) π ( y | H , τ , σ ) (cid:18) cτ σ (cid:19) b g max { , (1 − (cid:15) ) a g − } , If, a g ∈ (0 ,
1) and c(cid:15)τ σ (1 − (cid:15) ) < p g Γ( a g + b g ) (cid:15) b g (1 − (cid:15) ) − ( a g + b g ) (2 πσ ) n/ Γ( a g ) b g Γ( b g ) π ( y | H , τ , σ ) (cid:18) cτ σ (cid:19) b g max { , (1 − (cid:15) ) a g − } → b g → ∞ . Since, (cid:13)(cid:13)(cid:13) ˆ β OLSg − E [ β g | · ] (cid:13)(cid:13)(cid:13) ≥ (cid:18)
11 + θ max ( X (cid:62) g X g ) σ − τ γ g max j λ gj (cid:19) (cid:13)(cid:13)(cid:13) ˆ β OLSg (cid:13)(cid:13)(cid:13) , then we must also have that for all δ ∈ (0 , σ / ( σ + θ max ( X (cid:62) g X g ) τ )) π (cid:18)(cid:13)(cid:13)(cid:13) ˆ β OLSg − E [ β g | · ] (cid:13)(cid:13)(cid:13) ≥ δ (cid:13)(cid:13)(cid:13) ˆ β OLSg (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12) y , H , τ , σ (cid:19) → b g → ∞ . 34 .6 Proof of Theorem 3.4a The posterior distribution of the shrinkage weights in the g -th group are given by π (cid:0) κ g | y g , τ , σ , a g , b g (cid:1) ∝ (cid:32) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:33) − ( a g + p g b g ) (cid:32) p g (cid:89) j =1 κ b g − / gj (1 − κ gj ) − ( b g +1) exp (cid:18) − y gj σ κ gj (cid:19)(cid:33) , where κ g = ( κ g , ..., κ gp g ), 0 < κ gj < ≤ j ≤ p g , and y g = ( y g , ..., y gp g ) (cid:62) . π ( κ gl > ψ | y g , τ , σ , a g , b g ) = A gl B g , where A gl = (cid:90) ψ (cid:90) · · · (cid:90) (cid:32) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:33) − ( a g + p g b g ) × (cid:32) p g (cid:89) j =1 κ b g − / gj (1 − κ gj ) − ( b g +1) exp (cid:18) − y gj σ κ gj (cid:19)(cid:33) dκ g · · · dκ g,l − dκ g,l +1 · · · dκ gp g dκ gl and B g = (cid:90) · · · (cid:90) (cid:32) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:33) − ( a g + p g b g ) (cid:32) p g (cid:89) j =1 κ b g − / gj (1 − κ gj ) − ( b g +1) exp (cid:18) − y gj σ κ gj (cid:19)(cid:33) dκ g · · · dκ gp g . Note that (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( a g + p g b g ) = (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( a g /p g + b g ) (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( p g − a g /p g + b g ) ≤ (cid:18) τ σ κ gl − κ gl (cid:19) − ( a g /p g + b g ) (cid:18) τ σ (cid:88) j (cid:54) = l κ gj − κ gj (cid:19) − ( p g − a g /p g + b g ) Then, A gl ≤ (cid:32) (cid:90) · · · (cid:90) (cid:18) τ σ (cid:88) j (cid:54) = l κ gj − κ gj (cid:19) − ( p g − a g /p g + b g ) (cid:89) j (cid:54) = l κ b g − / gj (1 − κ gj ) − ( b g +1) exp (cid:18) − y gj σ κ gj (cid:19) dκ gj (cid:33) × (cid:32) (cid:90) ψ (cid:18) τ σ κ gl − κ gl (cid:19) − ( a g /p g + b g ) κ b g − / gl (1 − κ gl ) − ( b g +1) exp (cid:18) − y gl σ κ gl (cid:19) dκ gl (cid:33) (cid:32) (cid:90) · · · (cid:90) (cid:18) τ σ (cid:88) j (cid:54) = l κ gj − κ gj (cid:19) − ( p g − a g /p g + b g ) (cid:89) j (cid:54) = l κ b g − / gj (1 − κ gj ) − ( b g +1) dκ gj (cid:33) × exp (cid:18) − ψ σ y gl (cid:19)(cid:32) (cid:90) ψ (cid:18) τ σ κ gl − κ gl (cid:19) − ( a g /p g + b g ) κ b g − / gl (1 − κ gl ) − ( b g +1) dκ gl (cid:33) ≤ (cid:32) (cid:90) · · · (cid:90) (cid:18) τ σ (cid:88) j (cid:54) = l κ gj − κ gj (cid:19) − ( a ∗ g +( p g − b g ) (cid:89) j (cid:54) = l κ b g − gj (1 − κ gj ) − ( b g +1) dκ gj (cid:33) × exp (cid:18) − ψ σ y gl (cid:19)(cid:32) (cid:90) ψ (cid:18) − (cid:18) − τ σ (cid:19) κ gl (cid:19) − ( a g /p g + b g ) κ b g − / gl (1 − κ gl ) a g /p g − dκ gl (cid:33) ≤ (cid:32)(cid:18) τ σ (cid:19) − ( p g − b g Γ( a ∗ g )(Γ( b g )) p g − Γ( a ∗ g + ( p g − b g ) (cid:33)(cid:18) min (cid:18) , τ σ (cid:19)(cid:19) − ( a g /p g + b g ) × exp (cid:18) − ψ σ y gl (cid:19) (cid:90) ψ κ b g − / gl (1 − κ gl ) a g /p g − dκ gl ≤ (cid:32)(cid:18) τ σ (cid:19) − ( p g − b g Γ( a ∗ g )(Γ( b g )) p g − Γ( a ∗ g + ( p g − b g ) (cid:33)(cid:18) min (cid:18) , τ σ (cid:19)(cid:19) − ( a g /p g + b g ) max (cid:16) , ψ b g − / (cid:17) × exp (cid:18) − ψ σ y gl (cid:19) (cid:90) ψ (1 − κ gl ) a g /p g − dκ gl = (cid:32)(cid:18) τ σ (cid:19) − ( p g − b g Γ( a ∗ g )(Γ( b g )) p g − Γ( a ∗ g + ( p g − b g ) (cid:33)(cid:18) min (cid:18) , τ σ (cid:19)(cid:19) − ( a g /p g + b g ) max (cid:16) , ψ b g − / (cid:17) × exp (cid:18) − ψ σ y gl (cid:19) p g a g (1 − ψ ) a g /p g , where a ∗ g = ( p g − a g /p g . We can simplify the integral four lines above based on the priordistribution of the shrinkage weights for a group of size p g − δ ∈ (0 ,
1) be fixed constant. Then, B g ≥ (cid:90) ψδ · · · (cid:90) ψδ (cid:32) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:33) − ( a g + p g b g ) (cid:32) p g (cid:89) j =1 κ b g − / gj (1 − κ gj ) − ( b g +1) exp (cid:18) − y gj σ κ gj (cid:19) dκ gj (cid:33) (cid:32) p g τ σ ψδ − ψδ (cid:33) − ( a g + p g b g ) p g (cid:89) j =1 (cid:90) ψδ κ b g − / gj (1 − κ gj ) − ( b g +1) exp (cid:18) − y gj σ κ gj (cid:19) dκ gj ≥ (cid:32) p g τ σ ψδ − ψδ (cid:33) − ( a g + p g b g ) exp (cid:18) − ψδ σ p g (cid:88) j =1 y gj (cid:19) p g (cid:89) j =1 (cid:90) ψδ κ b g − / gj (1 − κ gj ) − ( b g +1) dκ gj ≥ (cid:32) p g τ σ ψδ − ψδ (cid:33) − ( a g + p g b g ) exp (cid:18) − ψδ σ p g (cid:88) j =1 y gj (cid:19) p g (cid:89) j =1 (cid:90) ψδ κ b g − / gj dκ gj = (cid:32) p g τ σ ψδ − ψδ (cid:33) − ( a g + p g b g ) exp (cid:18) − ψδ σ p g (cid:88) j =1 y gj (cid:19) ( b g + 1 / − p g ( ψδ ) p g ( b g +1 / . Therefore, A gl B g ≤ f ( p g , τ , σ , a g , b g , ψ ) g ( p g , τ , σ , a g , b g , ψ, δ ) exp (cid:18) ψδ σ (cid:88) j (cid:54) = l y gj (cid:19) exp (cid:18) − ψ (1 − δ )2 σ y gl (cid:19) , where f ( p g , τ , σ , a g , b g , ψ ) = (cid:32)(cid:18) τ σ (cid:19) − ( p g − b g Γ( a ∗ g )(Γ( b g )) p g − Γ( a ∗ g + ( p g − b g ) (cid:33) × (cid:18) min (cid:18) , τ σ (cid:19)(cid:19) − ( a g /p g + b g ) max (cid:16) , ψ b g − / (cid:17) p g a g (1 − ψ ) a g /p g and g ( p g , τ , σ , a g , b g , ψ, δ ) = (cid:32) p g τ σ ψδ − ψδ (cid:33) − ( a g + p g b g ) ( b g + 1 / − p g ( ψδ ) p g ( b g +1 / . If we take the limit of this upper bound as | y gl | → ∞ , then we see that π ( κ gl > ψ | y g , τ , σ , a g , b g ) →
0. This concludes the proof.37 .7 Proof of Theorem 3.4b
The posterior distribution of the shrinkage weights in the g -th group are given by π (cid:0) κ g | y g , τ , σ , a g , b g (cid:1) ∝ (cid:32) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:33) − ( a g + p g b g ) (cid:32) p g (cid:89) j =1 κ b g − / gj (1 − κ gj ) − ( b g +1) exp (cid:18) − y gj σ κ gj (cid:19)(cid:33) , where κ g = ( κ g , ..., κ gp g ), 0 < κ gj < ≤ j ≤ p g , and y g = ( y g , ..., y gp g ) (cid:62) . π ( κ gl < (cid:15) | y g , τ , σ , a g , b g ) = A gl B g , where A gl = (cid:90) (cid:15) (cid:90) · · · (cid:90) (cid:32) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:33) − ( a g + p g b g ) × (cid:32) p g (cid:89) j =1 κ b g − / gj (1 − κ gj ) − ( b g +1) exp (cid:18) − y gj σ κ gj (cid:19)(cid:33) dκ g · · · dκ g,l − dκ g,l +1 · · · dκ gp g dκ gl and B g = (cid:90) · · · (cid:90) (cid:32) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:33) − ( a g + p g b g ) (cid:32) p g (cid:89) j =1 κ b g − / gj (1 − κ gj ) − ( b g +1) exp (cid:18) − y gj σ κ gj (cid:19)(cid:33) dκ g · · · dκ gp g . Note that (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( a g + p g b g ) = (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( a g /p g + b g ) (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( p g − a g /p g + b g ) ≤ (cid:18) τ σ κ gl − κ gl (cid:19) − ( a g /p g + b g ) (cid:18) τ σ (cid:88) j (cid:54) = l κ gj − κ gj (cid:19) − ( p g − a g /p g + b g ) Then, A gl ≤ (cid:32) (cid:90) · · · (cid:90) (cid:18) τ σ (cid:88) j (cid:54) = l κ gj − κ gj (cid:19) − ( p g − a g /p g + b g ) (cid:89) j (cid:54) = l κ b g − / gj (1 − κ gj ) − ( b g +1) exp (cid:18) − y gj σ κ gj (cid:19) dκ gj (cid:33) × (cid:90) (cid:15) (cid:18) τ σ κ gl − κ gl (cid:19) − ( a g /p g + b g ) κ b g − / gl (1 − κ gl ) − ( b g +1) exp (cid:18) − y gl σ κ gl (cid:19) dκ gl (cid:33) ≤ (cid:32) (cid:90) · · · (cid:90) (cid:18) τ σ (cid:88) j (cid:54) = l κ gj − κ gj (cid:19) − ( p g − a g /p g + b g ) (cid:89) j (cid:54) = l κ b g − / gj (1 − κ gj ) − ( b g +1) dκ gj (cid:33) × (cid:32) (1 − (cid:15) ) − ( b g +1) (cid:90) (cid:15) κ b g − / gl dκ gl (cid:33) ≤ (cid:15) b g +1 / ( b g + 1 / − (cid:15) ) b g +1 (cid:90) · · · (cid:90) (cid:18) τ σ (cid:88) j (cid:54) = l κ gj − κ gj (cid:19) − ( a ∗ g +( p g − b g ) (cid:89) j (cid:54) = l κ b g − gj κ / gj (1 − κ gj ) − ( b g +1) dκ gj ≤ (cid:15) b g +1 / ( b g + 1 / − (cid:15) ) b g +1 (cid:90) · · · (cid:90) (cid:18) τ σ (cid:88) j (cid:54) = l κ gj − κ gj (cid:19) − ( a ∗ g +( p g − b g ) (cid:89) j (cid:54) = l κ b g − gj (1 − κ gj ) − ( b g +1) dκ gj = (cid:32) (cid:15) b g +1 / ( b g + 1 / − (cid:15) ) b g +1 (cid:33)(cid:32)(cid:18) τ σ (cid:19) − ( p g − b g Γ( a ∗ g )(Γ( b g )) p g − Γ( a ∗ g + ( p g − b g ) (cid:33) where a ∗ g = ( p g − a g /p g . We have the last equality based on the prior distribution of theshrinkage weights for a group of size p g −
1. Next, B g ≥ exp (cid:18) − σ p g (cid:88) j =1 y gj (cid:19) (cid:90) · · · (cid:90) (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( a g + p g b g ) p g (cid:89) j =1 κ b g − / gj (1 − κ gj ) − ( b g +1) dκ gj = exp (cid:18) − σ p g (cid:88) j =1 y gj (cid:19) (cid:90) · · · (cid:90) (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( a g + p g b ∗ g ) (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) p g / × p g (cid:89) j =1 κ b ∗ g − gj (1 − κ gj ) − ( b ∗ g +1) (1 − κ gj ) / dκ gj ≥ exp (cid:18) − σ p g (cid:88) j =1 y gj (cid:19) (cid:90) · · · (cid:90) (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) p g / (cid:18) p g (cid:89) j =1 (1 − κ gj ) / (cid:19) × (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( a g + p g b ∗ g ) p g (cid:89) j =1 κ b ∗ g − gj (1 − κ gj ) − ( b ∗ g +1) dκ gj exp (cid:18) − σ p g (cid:88) j =1 y gj (cid:19) (cid:90) · · · (cid:90) (cid:32) p g (cid:89) j =1 (cid:18) − κ gj + τ σ κ gj (cid:19)(cid:33) / (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( a g + p g b ∗ g ) × p g (cid:89) j =1 κ b ∗ g − gj (1 − κ gj ) − ( b ∗ g +1) dκ gj ≥ exp (cid:18) − σ p g (cid:88) j =1 y gj (cid:19)(cid:32) min (cid:18) , τ σ (cid:19)(cid:33) p g / (cid:90) · · · (cid:90) (cid:18) τ σ p g (cid:88) j =1 κ gj − κ gj (cid:19) − ( a g + p g b ∗ g ) × p g (cid:89) j =1 κ b ∗ g − gj (1 − κ gj ) − ( b ∗ g +1) dκ gj = exp (cid:18) − σ p g (cid:88) j =1 y gj (cid:19)(cid:32) min (cid:18) , τ σ (cid:19)(cid:33) p g / (cid:32)(cid:18) τ σ (cid:19) − p g b ∗ g Γ( a g )(Γ( b ∗ g )) p g Γ( a g + p g b ∗ g ) (cid:33) , where b ∗ g = b g + 1 /
2. Similarly, we have the last equality based on the prior distribution of theshrinkage weights for a group of size p g .Therefore, A gl B g ≤ exp (cid:18) σ p g (cid:88) j =1 y gj (cid:19) (cid:15) b g +1 / ( b g + 1 / − (cid:15) ) b g +1 (cid:18) τ σ (cid:19) p g / b g (cid:32) min (cid:18) , τ σ (cid:19)(cid:33) − p g / × Γ( a g + p g b ∗ g )Γ( a ∗ g )(Γ( b g )) p g − Γ( a ∗ g + ( p g − b g )Γ( a g )(Γ( b ∗ g )) p g . If we take the limit of this expression as τ →
0, then we see that π ( κ gl < (cid:15) | y g , τ , σ , a g , b g ) →
0. This concludes the proof. 40 .8 Proof of Corollary 3.2
From the proof of Theorem 3.4b we have that π ( κ gl < (cid:15) | y g , τ , σ , a g , b g ) ≤ exp (cid:18) σ y (cid:62) g y g (cid:19) (cid:15) b g +1 / ( b g + 1 / − (cid:15) ) b g +1 (cid:18) τ σ (cid:19) p g / b g (cid:32) min (cid:18) , τ σ (cid:19)(cid:33) − p g / × Γ( a g + p g b ∗ g )Γ( a ∗ g )(Γ( b g )) p g − Γ( a ∗ g + ( p g − b g )Γ( a g )(Γ( b ∗ g )) p g , where p g is the number of observations in group g , b ∗ g = b g + 1 /
2, and a ∗ g = ( p g − a g /p g . Basedon this inequality we just need to find for what values of θ the limit as b g → ∞ of θ b g ( b g + 1 /
2) Γ( a g + p g b ∗ g )(Γ( b g )) p g − Γ( a ∗ g + ( p g − b g )(Γ( b ∗ g )) p g , θ = (cid:15) − (cid:15) τ σ goes to zero. To do so we will first need one useful asymptotic approximation (a consequenceof 6.1.39 in Abramowitz and Stegun (1972)):lim x →∞ Γ( x + c )Γ( x ) x c = 1 , for any c ∈ R . Therefore, θ b g ( b g + 1 /
2) Γ( a g + p g b ∗ g )(Γ( b g )) p g − Γ( a ∗ g + ( p g − b g )(Γ( b ∗ g )) p g ∼ θ b g ( b g + 1 / b g + 1 /
2) Γ( p g b g )( p g b g ) a g + p g / (Γ( b g )) p g − Γ(( p g − b g )(( p g − b g ) a ∗ g (Γ( b g )) p g − b ( p g − / g = θ b g Γ( b g + 3 /
2) Γ( p g b g ) p a g + p g / g b a g /p g +1 / g Γ(( p g − b g )( p g − a ∗ g ∼ p a g + p g / g ( p g − a ∗ g θ b g Γ( p g b g ) b a g /p g +1 / g Γ( b g ) b / g Γ(( p g − b g )= p a g + p g / g ( p g − a ∗ g θ b g Γ( p g b g ) b a g /p g − g Γ( b g )Γ(( p g − b g ) = p a g + p g / g ( p g − a ∗ g θ b g b a g /p g − g B ( b g , ( p g − b g ) . Stirling’s approximation for the gamma function (6.1.39 in Abramowitz and Stegun (1972)) canbe used to get the following asymptotic approximation for the beta function, B ( x, y ) ∼ √ π x x − / y y − / ( x + y ) x + y − / . p a g + p g / g ( p g − a ∗ g θ b g b a g /p g − g B ( b g , ( p g − b g ) ∼ p a g + p g / g √ π ( p g − a ∗ g θ b g b a g /p g − g ( p g b g ) p g b g − / b b g − / g (( p g − b g ) ( p g − b g − / . If we set θ = p − p g g , then this quantity becomes p a g + p g / g √ π ( p g − a ∗ g θ b g b a g /p g − g ( p g b g ) p g b g − / b b g − / g (( p g − b g ) ( p g − b g − / = p a g + p g / g √ π ( p g − a ∗ g b a g /p g − g b p g b g g ( p g b g ) − / b b g − / g (( p g − b g ) ( p g − b g − / = p a g +( p g − / g √ π ( p g − a ∗ g b a g /p g − g b ( p g − b g g b b g g b b g g (( p g − b g ) ( p g − b g − / = p a g +( p g − / g √ π ( p g − a ∗ g b a g /p g − g ( p g − ( p g − b g (( p g − b g ) − / = p a g +( p g − / g √ π ( p g − a ∗ g − / b a g /p g − / g ( p g − ( p g − b g If a g ∈ (0 , b g → ∞ .To summarize the result, we have shown that if τ , σ , p g ≥ a g ∈ (0 ,
1) are all fixed,then there exists a constant (cid:15) ( τ , σ , p g ) = (cid:18) τ σ p p g g (cid:19) − , such that π ( κ gl < (cid:15) ( τ , σ , p g ) | y g , τ , σ , a g , b g ) → b g → ∞ .42 .9 Full Conditional Distributions for Gibbs Sampler The full conditional distributions for all model parameters are[ α | · ] ∼ N (cid:18)(cid:0) C (cid:62) C (cid:1) − C (cid:62) ( y − Xβ ) , σ (cid:0) C (cid:62) C (cid:1) − (cid:19) [ β | · ] ∼ N (cid:32) Q − σ X (cid:62) (cid:16) y − Cα (cid:17) , Q − (cid:33) , Q = 1 σ X (cid:62) X + 1 τ Γ − Λ − [ τ | · ] ∼ IG (cid:32) p + 12 , β (cid:62) Γ − Λ − β + 1 ν (cid:33) , [ ν | · ] ∼ IG (cid:32) , τ + 1 σ (cid:33) [ σ | · ] ∼ IG (cid:32) n + 12 , (cid:0) y − Cα − Xβ (cid:1) (cid:62) (cid:0) y − Cα − Xβ (cid:1) + 1 ν (cid:33) [ λ gj | · ] ∼ IG (cid:32) b g + 12 , β gj τ γ g (cid:33) , [ γ − g | · ] ∼ GIG (cid:32) p g − a g , τ p g (cid:88) j =1 β gj λ gj , (cid:33) , where GIG refers to the generalized inverse Gaussian distribution (H¨ormann and Leydold,2014). 43 lllllllllllllllllllllllllllllllllll lllll lllll lllll llllll lllllllll llllll lll lllllllllllllllllllllllllllllllllllll lllll llllllllllllllllllllllllllllllllllllllllllllllll llll llllll PhthalatesPesticidesPBDEsPAHsMetals
Mono−isobutyl pthalateMono−(3−carboxypropyl) phthalateMono−n−methylMono−benzyl phthalateMono−ethyl phthalateMono−n−butyl phthalateSummed di−(2−ethylhexyl) phthalatesDieldrinHeptachlor EpoxideTrans−nonachlorOxychlordanep,p'−DDTp,p'−DDEBeta−hexachlorocyclohexaneHexachlorobenzeneBDE−154BDE−153BDE−100BDE−99BDE−47BDE−28BB−1534−phenanthrene9−hydroxyfluorene1−hydroxypyrene2−hydroxyphenanthrene1−hydroxyphenanthrene3−hydroxyphenanthrene2−hydroxyfluorene3−hydroxyfluorene2−hydroxynaphthalene1−hydroxynaphthaleneMercuryLeadCadmium −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6
Percent change in GGT for a twofold change in exposure lllll
GIGG RegressionGroup Horseshoe RegressionGroup Half CauchyBGL−SSBSGS−SS
Figure 4: Associations between environmental toxicants (metals, phthalates, pesticides, PBDEs,and PAHs) and gamma glutamyl transferase (GGT) from NHANES 2003-2004 ( nn