Spike and slab variational Bayes for high dimensional logistic regression
aa r X i v : . [ s t a t . M L ] O c t Spike and slab variational Bayes for high dimensionallogistic regression
Kolyan Ray ∗ Department of MathematicsImperial College London [email protected]
Botond Szabó ∗ Department of MathematicsVrije Universiteit Amsterdam [email protected]
Gabriel Clara
Department of MathematicsVrije Universiteit Amsterdam [email protected]
Abstract
Variational Bayes (VB) is a popular scalable alternative to Markov chain MonteCarlo for Bayesian inference. We study a mean-field spike and slab VB approxi-mation of widely used Bayesian model selection priors in sparse high-dimensionallogistic regression. We provide non-asymptotic theoretical guarantees for the VBposterior in both ℓ and prediction loss for a sparse truth, giving optimal (mini-max) convergence rates. Since the VB algorithm does not depend on the unknowntruth to achieve optimality, our results shed light on effective prior choices. Weconfirm the improved performance of our VB algorithm over common sparse VBapproaches in a numerical study. Let x ∈ R p denote a feature vector and Y ∈ { , } an associated binary label to be predicted. Inlogistic regression, one of the most widely used methods in classification, we model P ( Y = 1 | X = x ) = 1 − P ( Y = 0 | X = x ) = Ψ( x T θ ) = e x T θ e x T θ , (1)where θ ∈ R p is an unknown regression parameter and Ψ( t ) = e t / (1 + e t ) is the logistic function.Suppose we observe n training examples { ( x , y ) , . . . , ( x n , y n ) } .We study the sparse high-dimensional setting, where n ≤ p and typically n ≪ p , and many of thecoefficients of θ are (close to) zero. This setting has been studied by many authors, notably using ℓ -regularized M -estimators (e.g. the LASSO), see for instance [3, 25, 33, 34] and the referencestherein. In Bayesian logistic regression, one assigns a prior distribution to θ , giving a probabilisticmodel. An especially natural Bayesian way to model sparsity is via a model selection prior, whichassigns probabilistic weights to every potential model, i.e. every subset of { , . . . , p } correspondingto selecting the non-zero coordinates of θ ∈ R p . This is a widely used Bayesian approach andincludes the hugely popular spike and slab prior [17, 31].Such priors work well in many settings for estimation and prediction [2, 13, 15], uncertainty quan-tification [40, 14] and multiple hypothesis testing [12], see [4] for a recent review. An especiallyattractive property is their interpretability, particularly for variable selection, compared to manyother black-box machine learning methods. For example, such methods provide posterior inclusionprobabilities of particular features, and their credible sets, which are often important in practice.However, the discrete nature of such priors makes scalable computation hugely challenging. Under amodel selection prior, posterior exploration typically involves searching over all p possible models,making standard Markov chain Monte Carlo (MCMC) methods infeasible for moderate p unless the ∗ Equal contribution.34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada. eature vectors { x , . . . , x n } satisfy strong structural conditions like orthogonality [15, 50]. Therehas been recent progress on adapting MCMC methods to sparse high-dimensional logistic regression[32], while another common alternative is to instead use continuous shrinkage-type priors [11, 56].A popular scalable alternative is variational Bayes (VB), which approximates the posterior by solv-ing an optimization problem. One proposes an approximating family of tractable distributions, calledthe variational family, and finds the member of this family that is closest to the computationally in-tractable posterior in Kullback-Leibler (KL) sense. This member is taken as a substitute for theposterior. An especially popular family consists of factorizable distributions, called mean-field vari-ational Bayes . VB scales to large data sets and works empirically well in many models, see [7] fora recent review.In this work, we study the theoretical properties of a mean-field VB approach to sparsity-inducingpriors with variational family the set of factorizable spike and slab distributions. This is a natural ap-proximation since it keeps the discrete model selection aspect and many of the interpretable featuresof the original posterior, while reducing the full O (2 p ) model complexity to a much more manage-able O ( p ) . The procedure is adaptive in that it does not depend on the typically unknown sparsitylevel, which avoids delicate issues about tuning hyperparameters. This sparse variational family hasbeen employed in various settings [23, 28, 36, 42, 48], including logistic regression [10, 60]. VBis natural in model (1) since in even the simplest low-dimensional setting ( p ≪ n ) using Gaussianpriors, the posterior is intractable and VB is widely used [6, 24, 37, 47, 53].However, VB generally comes with few theoretical guarantees, with none currently available inhigh-dimensional logistic regression. Our main contribution is to show that the sparse VB posteriorconverges to the true sparse vector at the optimal (minimax) rate in both ℓ and prediction loss.We prove this under the same conditions for which the true (computationally infeasible) posterior isknown to converge [2], showing that one does not necessarily need to sacrifice theoretical guaranteeswhen using sparse VB, at least for estimation and prediction. Our convergence bounds are non-asymptotic and thus reflect relevant finite-sample situations.Our results also provide practical insights on effective prior and VB calibrations, in particular thechoice of prior slab distribution. Many existing works employ Gaussian slabs for the underlyingprior, even though these cause excessive shrinkage and suboptimal parameter recovery in bench-mark models [15]. Our theoretical results show that optimal parameter recovery is possible if theVB posterior is based on a prior with heavier-tailed Laplace slabs, corroborating findings in linearregression that one should use prior slabs with exponential or heavier tails [13, 15], including forVB [42]. We confirm in simulations that using Laplace prior slabs, as our theory suggests, indeedempirically outperforms the usual VB choice of Gaussian prior slabs, demonstrating the practicalimportance of the prior slab choice. We further demonstrate that our VB algorithm is empiricallycompetitive with other state-of-the-art Bayesian sparse variable selection methods for logistic regres-sion.Lastly, we provide conditions on the design matrix under which sparse VB can be expected to workwell. Together, these provide theoretical backing for using VB for estimation and prediction in thewidely used sparse high-dimensional Bayesian logistic regression model (1). Related work . Theoretical guarantees for VB have been studied for specific models, including linearmodels [36, 42], exponential family models [51, 52], stochastic block models [59], latent Gaussianmodels [45] and topic models [18]. In low dimensional settings ( p ≪ n ) , some Bernstein-von Misesresults have also been obtained [29, 54, 55]. In high-dimensional and nonparametric settings, generalresults have been derived [38, 61] based on the classic Bayesian prior mass and testing approach [19].There has also been work on studying variational approximations to fractional posteriors [1, 57]. Forlogistic regression, theoretical results have been established for the fully Bayesian spike and slabapproach [2, 32] and its continuous relaxation [56].Theoretical guarantees for VB in sparse linear regression have recently been obtained in [42]. Wecombine ideas from this paper with tools from high-dimensional and nonparametric Bayesian statis-tics [2, 13, 35] to obtain theoretical results in the nonlinear logistic regression model (1). For ouralgorithm derivation, we use ideas from VB for Bayesian logistic regression [10, 24]. Organization . In Section 2 we detail the problem setup, including the notation, prior, variationalfamily and conditions on the design matrix. Main results are found in Section 3, details of the VBalgorithm in Section 4, simulations in Section 5 and discussion in Section 6. In the supplement, we2resent streamlined proofs for the asymptotic results (Section 7), additional simulations (Section 8),discussion concerning the design matrix conditions (Section 9), full statements and proofs of thenon-asymptotic results (Section 10) and a derivation of the VB algorithm (Section 11).
Notation . Recall that we observe n training examples { ( x , y ) , . . . , ( x n , y n ) } from model (1). For u ∈ R d , we write k u k = ( P di =1 | u i | ) / for the usual Euclidean norm. Let X be the n × p matrixwith i th row equal to x Ti = ( x i , . . . , x ip ) and for X · j the j th column of X , set k X k = max ≤ j ≤ p k X · j k = max ≤ j ≤ p ( X T X ) / jj . We denote by P θ the probability distribution of observing Y = ( Y , . . . , Y n ) from model (1) withparameter θ ∈ R p and by E θ the corresponding expectation. For two probability measures P, Q ,we write KL ( P || Q ) = R log dPdQ dP for the Kullback-Leibler divergence. Let ∇ θ f ( y, θ ) denote thegradient of f with respect to θ . For θ ∈ R p and a subset S ⊆ { , . . . , p } , we write | S | for thecardinality of S and θ S for the vector ( θ i : i ∈ S ) ∈ R | S | . We set S θ = { ≤ i ≤ p : θ i = 0 } to bethe set of non-zero coefficients of θ and write s = | S θ | for the sparsity level of the true parameter θ . Throughout the paper we work under the following frequentist assumption: Assumption 1.
There is a true s -sparse parameter θ ∈ R p generating the data Y ∼ P θ . A model selection prior first selects a dimension s ∈ { , . . . , p } from a prior π p , then a subset S ⊆ { , . . . , p } uniformly at random from all (cid:0) ps (cid:1) subsets of size | S | = s , and lastly selects a setof non-zero values for θ S = ( θ j ) j ∈ S ∈ R | S | from a product of centered Laplace distributions withdensity Q j ∈ S Lap ( λ )( θ j ) = ( λ/ | S | exp( − λ P j ∈ S | θ j | ) , λ > . This prior is represented via thefollowing hierarchical scheme: s ∼ π p ( s ) ,S || S | = s ∼ Unif p,s ,θ j ind ∼ (cid:26) Lap ( λ ) , j ∈ S,δ , j S, (2)where Unif p,s selects S from the (cid:0) ps (cid:1) possible subsets of { , . . . , p } of size s with equal probabilityand δ denotes the Dirac mass at zero. We assume that there are constants A , A , A , A > with A p − A π p ( s − ≤ π p ( s ) ≤ A p − A π p ( s − , s = 1 , . . . , p. (3)Condition (3) is satisfied by a wide range of priors, such as complexity priors [15] and binomialpriors, including the widely used spike and slab prior θ j ∼ iid ρ Lap ( λ ) + (1 − ρ ) δ by taking π p = Binomial ( p, ρ ) . Assigning a hyperprior ρ ∼ Beta (1 , p t ) , t > , to the prior inclusion probabilitiesalso satisfies (3) ([15], Example 2.2), allows mixing over the sparsity level and most importantlygives a spike and slab prior calibration that does not depend on unknown hyperparameters.While most works use Gaussian slabs for the prior [10, 23, 28, 36, 48, 60], we instead use Laplaceslabs since using slab distributions with lighter than exponential tails can cause excessive shrinkageand deteriorate estimation in linear regression [15, 42]. We illustrate numerically that the samephenomenon can occur in logistic regression, where using Laplace rather than Gaussian prior slabsimproves estimation, see Section 5. This shows that our theoretical results in Section 3 are reflectedin practice and sheds light on suitable prior choices. Full details of the modified algorithm areprovided in Algorithm 1 below.For our theoretical results, we suppose the regularization parameter λ of the slab satisfies k X k p log p ≤ λ ≤ α k X k p log p (4)for some α ≥ . The choice λ ≍ k X k√ log p is common for the regularization parameter of theLASSO ([9], Chapter 6), which corresponds to the posterior mode based on a full product Laplace3rior θ j ∼ iid Lap ( λ ) with no extra model selection as in (2). We note additional model selection isnecessary, since the pure Laplace prior can behave badly in sparse high-dimensional settings [15].Specific values of k X k for some design matrices are given in Section 9, but one should typicallythink of k X k ∼ √ n .Bayesian inference about θ , including reconstruction of the class probabilities P θ ( Y = 1 | X = x ) ,is carried out via the posterior distribution Π( ·| Y ) . Since computing the posterior is infeasible forlarge p , we consider the following mean-field family of approximating distributions Q = Q µ,σ,γ = p Y j =1 (cid:2) γ j N ( µ j , σ j ) + (1 − γ j ) δ (cid:3) : γ j ∈ [0 , , µ j ∈ R , σ j > . (5)The VB posterior is the element of Q that minimizes the KL divergence to the exact posterior Q ∗ = arg min Q ∈Q KL ( Q || Π( ·| Y )) . (6)The family Q consists of all factorizable distributions of spike and slab form, which is a naturalapproximation for sparse settings with variable selection. The ( γ j ) correspond to the VB variableinclusion probabilities, thereby keeping the interpretability of the original model selection prior.While the prior may factorize like (5), the posterior does not, and we replace the full p posteriormodel weights with the p probabilities ( γ j ) , greatly reducing the posterior dimension. Note thatwhile the prior has Laplace slabs, we can fit Gaussian distributions in the variational family sincethe likelihood induces subgaussian tails in the posterior .Computing the VB posterior Q ∗ for the variational family Q in (5) is an optimization problem thathas been studied in the literature [23, 28, 36, 42, 48], including for logistic regression [10, 60],mainly using coordinate ascent variational inference (CAVI). While these works mostly considerGaussian slabs for the prior, CAVI can be suitably modified to the Laplace case, see Algorithm 1. In the high-dimensional case p > n , the parameter θ in model (1) is not identifiable, let aloneestimable, without additional conditions on the design matrix X . In the sparse setting, a sufficientcondition for consistent estimation is ‘local invertibility’ of X T X when restricted to sparse vectors.The following definitions are taken from [2] and make precise this notion of invertibility. Define thediagonal matrix W ∈ R n × n with i th diagonal entry W ii = g ′′ ( x Ti θ ) = Ψ( x Ti θ )(1 − Ψ( x Ti θ )) (7)and the compatibility type constant κ = inf (cid:26) k W / Xθ k k X k k θ k : k θ S c k ≤ k θ S k , θ = 0 (cid:27) . For dimension s ∈ { , . . . , p } , set κ ( s ) = sup (cid:26) k Xθ k k X k k θ k : 0 = | S θ | ≤ s (cid:27) , κ ( s ) = inf (cid:26) k W / Xθ k k X k k θ k : 0 = | S θ | ≤ s (cid:27) . For a given
L > and α defined in (4), we require the following bound on the design matrix k X k ≥ α max (cid:18) L + 2) k X k ∞ κ (( L + 1) s ) , κ (cid:19) s p log p. (8)These constants are widely used in the sparsity literature (e.g. [9]), including for high-dimensionallogistic regression [2, 34, 56]. Assuming such constants are bounded away from zero and infinity,Atchadé [2] proves that the original posterior Π( ·| Y ) converges to the truth at the optimal rate. Weshow here that under no further assumptions on the design matrix X , the VB posterior Q ∗ alsoconverges to the truth at the optimal rate. We thus provide theoretical guarantees for the scalableVB approximation under the same conditions for which the true posterior is known to converge.Many standard design matrices satisfy these compatibility conditions, such as orthogonal designs,i.i.d. (including Gaussian) random matrices and matrices satisfying the ‘strong irrepresentability4ondition’ of [62]. Details of these examples and further discussion are provided in Section 9 in thesupplement (see also Chapter 6 of [9]).For a normalized design matrix with entries of size O (1 ), one has k X k ∼ √ n , so that (8) is aminimal sample size condition. For suitably bounded compatibility constants, this translates intothe minimal sample size n & s log p , as in [2]. The frequentist ℓ -regularized M -estimator isknown to converge at the same rate under similar assumptions to ours for a deterministic designmatrix X [25] and under slightly weaker sample size conditions for an i.i.d. subgaussian randomdesign matrix X ( n & s log p ) [34]. We now provide theoretical guarantees for the VB posterior Q ∗ in (6). We present here our resultsin a simpler asymptotic form as n, p → ∞ for easier readability. More complicated, but practicallymore relevant, finite sample guarantees are provided in Section 10 of the supplement. In particular,one should keep in mind that the results here do indeed reflect finite-sample behaviour.We investigate how well the VB posterior recovers the true underlying high-dimensional parameter θ . This is measured via the speed of posterior concentration, which studies the size of the smallest ℓ or prediction type neighbourhood around the true θ that contains most of the (VB) posteriorprobability [19]. This is a frequentist assessment that describes the typical behaviour of the VBposterior under the true generative model, see Assumption 1.Posterior concentration rates are now entering the machine learning community as tools to gaininsights into (variational) Bayesian methods and assess the suitability of priors and their calibrations(e.g. [38, 39, 46, 55]). Such results also quantify the typical distance between a point estimator ˆ θ (posterior mean/median) and the truth ([19], Theorem 2.5), as well as the typical posterior spreadabout the truth. Taken together, these quantities are crucial for the accuracy of Bayesian uncertaintyquantification and so good posterior concentration results are necessary conditions for ensuring thelatter. Ideally, most of the posterior probability should be concentrated in a ball centered aroundthe true θ with radius proportional to the optimal (minimax) rate. This is the case for the truecomputationally infeasible posterior [2] and extends to the VB posterior Q ∗ , as we now show. Thisprovides a universal, objective guarantee for the VB posterior. Theorem 1.
Suppose the model selection prior (2) satisfies (3) and (4) and the design matrix X satisfies assumption (8) for some sequence L = M n . Then the VB posterior Q ∗ satisfies E θ Q ∗ θ ∈ R p : k θ − θ k ≥ M / n κ (cid:0) M n s (cid:1) √ s log p k X k ! = O (cid:0) C κ /M n (cid:1) + o (1) , where C κ = L (cid:0) ¯ κ ( L s ) κ ((1+4 L /A ) s ) + κ ( L s ) − (cid:1) and L = 2 max { A / , (1 . α /κ + 2 A +log(4 + κ ( s ))) /A } . Define the mean-squared prediction error k p θ − p k n = n P ni =1 (Ψ( x Ti θ ) − Ψ( x Ti θ )) , where we recall P θ ( Y = 1 | X = x ) = Ψ( x Ti θ ) . Then the VB posterior Q ∗ satisfies E θ Q ∗ θ ∈ R p : k p θ − p k n ≥ p M n κ ( M n s ) κ (cid:0) M n s (cid:1) r s log pn ! = O (cid:0) C κ /M n (cid:1) + o (1) . In particular, if C κ /M n → , then the posterior concentrates around the true sparse parameter θ at the optimal (minimax) rate in both ℓ and mean-squared prediction loss. Remark.
In Theorem 1, we keep track of the compatibility numbers in the rate. If κ ( L s ) , κ ((1 +4 L /A ) s ) and κ ( L s ) are bounded away from zero and infinity, as is often the case, the righthand side of both displays tends to zero for any M n → ∞ growing arbitrary slowly. In this case,the rates simplify to M / n √ s log p/ k X k and M n p s log p/n , respectively. In Section 9 of thesupplement, we give sufficient conditions for this to happen. Since √ s log p/ k X k is the minimax rate, this result says that for estimating θ , in either ℓ orprediction loss, the VB approximation behaves optimally from a theoretical frequentist perspective.In fact, since these conditions are essentially the same as were used to study the true posterior [2], ourresults suggest one does not lose much by using this computationally more efficient approximation,5t least regarding estimation and prediction. This backs up the empirical evidence that VB canprovide excellent scalable estimation.For a non-asymptotic version of Theorem 1, see Theorem 4 inSection 10.Since the prior and variational family do not depend on unknown parameters (e.g. sparsity level s = | S θ | of θ ), the procedure is adaptive , i.e. it can recover an s -sparse truth nearly as well as ifwe knew the exact true sparsity level beforehand. This avoids difficult issues about tuning parameterselection. As mentioned above, posterior concentration results can be used to obtain guarantees forpoint estimators, such as the VB posterior mean. Theorem 2.
Assume the conditions of Theorem 1 and that n ≥ . α /κ +2 A +log(4+ κ ( s ))) .Then the VB predictive mean ˆ p ∗ ( x ) = R P θ ( Y = 1 | X = x ) dQ ∗ ( θ ) = R Ψ( x T θ ) dQ ∗ ( θ ) and trueprediction function p ( x ) = P θ ( Y = 1 | X = x ) satisfy P θ k ˆ p ∗ − p k n ≥ M / n κ ( nA log p + s ) / κ ( nA log p + (1 − /A ) s ) r s log pn ! = O ( C κ /M n ) + o (1) . Remark. If k X k ∼ √ n , as is often the case, then the extra sample size condition in Theorem2 is automatically implied by (8) , see Section 2.2. We have again kept track of the compatibilitynumbers; if these are bounded away from zero and infinity, then the probability converges to zeroand we recover the rate M / n p s log p/n for any M n → ∞ arbitrarily slowly. We next show the VB posterior Q ∗ does not provide overly conservative model selection in thesense that it concentrates on models of size at most a constant multiple of the true model size s .This gives some guarantees for variable selection, in particular bounding the number of false pos-itives. It provides a first theoretical underpinning for interpretable inference when using this VBapproximation. Theorem 3.
Under the conditions of Theorem 1, the VB posterior satisfies E θ Q ∗ ( θ ∈ R p : | S θ | ≥ M n s ) = O (cid:0) C κ /M n (cid:1) + o (1) . We have thus shown the VB posterior Q ∗ (1) concentrates at the optimal rate around the sparse truthin both ℓ and prediction loss and (2) does not select overly large models, with the VB posterior meansharing property (1). These provide some reassuring theoretical guarantees regarding the behaviourof this scalable and interpretable sparse VB approximation. Our results are reflected in practice,as our VB algorithm performs better empirically than commonly used sparse VB approaches, seeSection 5. We present a coordinate-ascent variational inference (CAVI) algorithm to compute the VB posterior Q ∗ in (6). Consider the prior (2) with θ j ∼ iid (1 − w ) δ + w Lap ( λ ) and hyperprior w ∼ Beta ( a , b ) .Introducing binary latent variables ( z j ) pj =1 , this spike and slab prior has hierarchical representation w ∼ Beta ( a , b ) ,z j | w ∼ iid Bernoulli ( w ) ,θ j | z j ∼ ind (1 − z j ) δ + z j Lap ( λ ) . (9)Minimizing the objective (6) is intractable for Bayesian logistic regression, so we instead minimizea surrogate objective obtained by maximizing a lower bound on the marginal likelihood followingthe ideas of [24, 10], see Section 11 for details. This common approach is known to lead to improvedaccuracy in approximation [24], see also Chapter 10.6 of [6]. The surrogate objective is non-convex,as is typically the case in VB, so the CAVI algorithm can be sensitive to initialization [7] and param-eter updating order [42]. Introducing a free parameter η ∈ R n , we can establish the upper boundKL ( Q µ,σ,γ || Π( ·| Y )) ≤ KL ( Q µ,σ,γ || Π) − E θ ∼ Q µ,σ,γ [ f ( θ, η )] (10)for a suitable function f defined in (30). We minimize the right-hand side over the variational family Q µ,σ,γ ∈ Q , i.e. over the parameters µ, σ, γ . Since we seek the tightest possible upper bound in (10),we also minimize this over the free parameter η . In particular, CAVI alternates between updating6 for fixed µ, σ, γ and then cycling through µ j , σ j , γ j and updating these while keeping all otherparameters fixed. Keeping all other parameters fixed, one updates µ j and σ j by minimizing µ j λσ j r π e − µ j σ j + λµ j erf (cid:18) µ j √ σ j (cid:19) + µ j n X i =1 η i tanh( η i / x ij + µ j (cid:18) n X i =1 η i tanh( η i / x ij X k = j γ k x ik µ k − n X i =1 ( y i − / x ij (cid:19) , (11) σ j λσ j r π e − µ j σ j + λµ j erf µ j √ σ j ! − log σ j + σ j n X i =1 η i tanh( η i / x ij , respectively, where erf( x ) = 2 / √ π R x e − t dt is the error function. One updates γ j by solving − log γ j − γ j = log b a + λσ j r π e − µ j σ j + λµ j erf µ j √ σ j ! − µ j n X i =1 ( y i − / x ij + n X i =1 η i tanh( η i / (cid:18) x ij ( µ j + σ j ) + 2 x ij µ j X k = j γ k x ik µ k (cid:19) − log( λσ j ) (12)and updates η via η i = E µ,γ,σ ( x Ti θ ) = p X k =1 γ k x ik ( µ k + σ k ) + p X k =1 X l = k ( γ k x ik µ k )( γ l x il µ l ) . (13)A full derivation of (10)-(13) can be found in Section 11 In our implementation, we minimize (11)using the limited memory Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm [27]. To improvescalability, we also perform the parameter updates in parallel, updating the objective function pa-rameter values only at the end of each iteration. Details are provided in Algorithm 1. Algorithm 1:
Modified CAVI for variational Bayes with Laplace slabs
Input: X ∈ R n × p , Y ∈ { , } p ( µ, σ, γ ) ← init _ param () while ! convergence do η ← update _ lower _ bound ( µ, σ, γ ) ; // implements eq. (13) obj_fun ← generate_obj_fun ( µ, σ, γ, η, X, Y ) for j ∈ [ p ] do ( µ j , σ j ) ← L _ BFGS (cid:0) obj _ fun . at ( j ) (cid:1) ; // minimizes eq. (11) end γ ← update _ alpha ( µ, σ, γ, η, X, Y ) ; // implements eq. (12) endOutput: µ ∈ R p , σ ∈ R p> , γ ∈ [0 , p We empirically compare the performance of our VB method
VB (Lap) based on prior (9) withparameters a = b = λ = 1 , to other state-of-the-art Bayesian variable selection methods in asimple simulation study. We implemented Algorithm 1 in C++ using the Rcpp interface and usedthe Armadillo linear algebra library and ensmallen optimization library, see [5, 43]. Note that theVB objective function is highly non-convex and so the local minimum returned by Algorithm 1 doesnot necessarily equal the global minimizer to the VB optimzation problem (6). For comparison, wefirst consider the usual VB approach
VB (Gauss) , where the same variational family (5) is used,but the Laplace slabs are replaced by standard normal distributions in the prior (9) (e.g. [23, 28, 36,48, 60]). We also compare our approach with the varbvs [10],
SkinnyGibbs [32],
BinaryEMVS [30],
BhGLM [58] and rstanarm [20] packages. We provide a brief description of these methodsin Section 8.1 in the supplement. Note that the choice of hyperparameters can affect each of these7ethods and performance gains are possible from using good data-driven choices. For instance, ouralgorithm is sensitive to the choice of λ , see Section 8.3 in the supplement.Due to space constraints, we provide here only one test case and defer the remaining tests to Section8.2 of the supplement. We take n = 250 , p = 500 and X to be a standard Gaussian design matrix,i.e. X ij ∼ iid N (0 , , and set the true signal θ = (2 , , , . . . , T to be s = 2 sparse. Weran the experiment 200 times for each method and report the means and standard deviations of thefollowing performance measures: (i) true positive rate (TPR), (ii) false discovery rate (FDR), (iii) ℓ -loss of the posterior mean ˆ θ , i.e. k ˆ θ − θ k , (iv) mean-squared predictive error (MSPE) of theposterior mean, i.e. ( n P ni =1 | Ψ( x Ti ˆ θ ) − Ψ( x Ti θ ) | ) / and (v) run time in seconds. Since theBhGLM and rstanarm packages do not have explicit model selection subroutines, the TPR and FDRare not applicable. Similarly, SkinnyGibbs provides posterior model selection probabilities and notthe posterior mean, hence neither the ℓ -error or MSPE are applicable. The results are in Table 1.Table 1: Comparing sparse Bayesian methods in high-dimensional logistic regression.Algorithm TPR FDR ℓ -error MSPE TimeVB (Lap) ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± SkinnyGibbs ± ± ± ± ± ± ± ± ± ± ± ± ± ± Results based on 200 runs for an i.i.d. standard Gaussian design matrix X ∈ R × with X ij ∼ iid N (0 , and true signal θ , = θ , = 2 , θ ,j = 0 for ≤ j ≤ . In Table 1 and the additional simulations in Section 8, we see that using Laplace slabs in the prior (9)generally outperforms the commonly used Gaussian slabs in all statistical metrics ( ℓ -loss, MPSE,FDR), in some cases substantially so. This highlights the empirical advantages of using Laplacerather than Gaussian slabs for the prior underlying the VB approximation and matches the theorypresented in Section 3, as well as similar observations in linear regression [42]. We also highlight theexcellent FDR of VB (Lap), which warrants further investigation given its importance in Bayesianvariable selection and multiple hypothesis testing.However, the computational run-time is substantially slower for Laplace slabs due to the absence ofanalytically tractable update formulas as in the Gaussian case. The optimization routines required inAlgorithm 1 mean a naive implementation can significantly increase the run-time; we are currentlyworking on a more efficient implementation as an R-package sparsevb [16] that should reduce therun-time by at least an order of magnitude.The other methods perform roughly similarly to our algorithm in terms of estimation ( ℓ -error) andprediction (MSPE) error, doing better in certain test cases and worse in others. It seems there is noclearly dominant Bayesian approach regarding accuracy. However, our method provides the bestresults concerning model selection, generally having the best FDR while maintaining a competitiveTPR, as suggested by Theorem 3. The other methods all perform comparably, with varbvs havingthe best TPR but substantially higher FDR, meaning it identifies many coefficients to be significant,both correctly and incorrectly.We note that the two VB methods based on Gaussian prior slabs (VB (Gauss) and varbvs), whichuse analytic update formulas, are all significantly faster than the other methods. All the VB methods(including ours) scale much better with larger model sizes (e.g. p = 5000 ) than the other methods,which did not finish running in a reasonable amount of time when p = 5000 , see Section 8.2. Asexpected, the MCMC methods generally performed slowest, though SkinnyGibbs, which is designedto scale up MCMC to larger sparse models, is indeed an order of magnitude faster than rstanarm. An advantage of Bayesian methods is their ability to perform uncertainty quantification via crediblesets. The present mean-field VB approximation provides access to marginal credible sets for indi-vidual features, which can be more informative than just the VB posterior mean, and are often of8nterest to practitioners. However, VB is known to generally underestimate the posterior variance,which can lead to bad uncertainty quantification. In view of the excellent FDR control of our VBmethod in earlier simulations, we further investigate the performance of these marginal credible setsempirically.We consider 4 tests cases, consisting of the above example (Test 0) and Tests 1-3 from Section 8.2.In each case, we computed 95% marginal credible intervals for the coefficients, i.e. the intervals I j , j = 1 , . . . , p , of smallest length such that Q ∗ ( θ j ∈ I j ) ≥ . . We ran each experiments 200 timesand report the mean and standard deviation of the coverage and length of these credible intervals forboth the (true) zero and non-zero coefficients in Table 2.Table 2: Marginal VB credible intervals for individual featuresTest 0 Test 1 Test 2 Test 3Coverage (non-zero coefficients) 1.00 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± X ∈ R × , X ij ∼ iid N (0 , σ ) . (0) σ = 1 , s = 2 , θ , s = 2 ; (1) σ = 0 . , s = 5 , θ , s = 4 ; (2) σ = 2 , s = 10 , θ , s = 6 ; (3) σ = 0 . , s = 15 , θ , s ∼ iid Unif( − , . We see that for the true zero coefficients θ ,j = 0 , the coverage is close to one with intervals ofnearly zero width, meaning the VB posterior sets γ j = Q ∗ ( θ j = 0) < . . This matches theother evidence (Theorem 3 and the good FDR) that the VB posterior Q ∗ does not include too manyspurious variables. For true non-zero coefficients θ ,j = 0 , the coverage is moderate to excellent inthe first 3 experiments, which matches the good TPR seen above. However, coverage is low in Test3, which is an especially difficult test typically containing several small non-zero coefficients (here θ ,j ∼ iid Unif ( − , ) that are hard to detect. The low coverage is not surprising, since it is knownthat even in the full Bayes case, coefficients below a certain size cannot be consistently covered [49].The very promising results here suggest our VB approach might be effective for uncertainty quantifi-cation for individual features, however this requires further investigation. We lastly note that whileit may be reasonable to consider marginal credible intervals, one should be careful about using moregeneral VB credible sets due to the mean-field approximation. This paper investigates a scalable and interpretable mean-field variational approximation of the pop-ular spike and slab prior with Laplace slabs in high-dimensional logistic regression. We derivetheoretical guarantees for this approach, proving (1) optimal concentration rates for the VB pos-terior in ℓ and prediction loss around a sparse true parameter, (2) optimal convergence rates forthe VB posterior predictive mean and (3) that the VB posterior does not select overly large models,thereby controlling the number of false discoveries.We verify in a numerical study that the empirical performance of the proposed method reflects thesetheoretical guarantees. In particular, using Laplace slabs in the prior underlying the variationalapproximation can substantially outperform the same VB method with Gaussian prior slabs, as istypically used in the literature, though at the expense of slower computation. The proposed approachperforms comparably with other state-of-the-art sparse high-dimensional Bayesian variable selectionmethods for logistic regression, but scales substantially better to high-dimensional models whereother approaches based on the EM algorithm or MCMC are not computable. We are currentlyworking on a more efficient implementation as an R-package sparsevb [16] that should improvethe run-time.Based on the promising FDR control and coverage of our VB method in the simulations, we plan tofurther investigate the theoretical and empirical performance of our algorithm for multiple hypoth-esis testing and variable selection, see [12] for promising first results for the (unscalable) originalposterior. Furthermore, the results derived here are the first steps towards better understanding VBmethods in sparse high-dimensional nonlinear models. It opens up several interesting future linesof research for applying scalable VB implementations of spike and slab priors in complex high-9imensional models, including Bayesian neural networks [39], (causal) graphical models [26] andhigh-dimensional Bayesian time series [44]. Acknowledgements : We thank 4 reviewers for their useful comments that helped improve the pre-sentation of this work. Botond Szabó received funding from the Netherlands Organization for Sci-entific Research (NWO) under Project number: 639.031.654.
Impact statement : Our theoretical results seek to better understand how sparse VB approximationswork and thus improve their performance and reliability in practice. Since our results have no spe-cific applications in mind, seeking rather to explain and improve an existing method, any potentialbroader impact will derive from improved performance in fields where such methods are alreadyused.
Supplementary material for ‘Spike and slab variational Bayes for highdimensional logistic regression’
In this supplement we present streamlined proofs for the asymptotic results (Section 7), additionalsimulations (Section 8), discussion concerning the design matrix conditions (Section 9), full state-ments and proofs of the non-asymptotic results (Section 10) and a derivation of the VB algorithm(Section 11).
Our proofs use the following general result, which allows us to bound the VB probabilities of setshaving exponentially small probability under the true posterior.
Lemma 1 (Theorem 5 of [42]) . Let Θ n be a subset of the parameter space, A be an event and Q bea distribution for θ . If there exists C > and δ n > such that E θ Π( θ ∈ Θ n | Y )1 A ≤ Ce − δ n , (14) then E θ Q ( θ ∈ Θ n )1 A ≤ δ n h E θ KL ( Q || Π( ·| Y ))1 A + Ce − δ n / i . We must thus show that there exist events ( A n ) satisfying P θ ( A n ) → such that on A n :1. the posterior puts at most exponentially small probability Ce − δ n outside Θ n ,2. the KL divergence between the VB posterior Q ∗ and true posterior is O ( δ n ) .To aid readibility, in this section we state all intermediate results in asymptotic form, keeping trackof only the leading order terms as n, p → ∞ . We provide full non-asymptotic statements in Section10, which may be skipped on first reading, though most of the technical difficulty is contained inthese results.For t, L, M , M > , define the events A n, ( t ) = {k∇ θ ℓ n,θ ( Y ) k ∞ ≤ t } , A n, ( L ) = { Π( θ ∈ R p : | S θ | ≤ Ls | Y ) ≥ / } , A n, ( M , M ) = { Π( θ ∈ R p : k θ − θ k > M p s log p/ k X k| Y ) ≤ e − M s log p } . The first event bounds the score function ∇ θ ℓ n,θ ( Y ) and is needed to control the first order termin the Taylor expansion of the log-likelihood, see (20). The second says the posterior concentrateson models of size at most a constant multiple of the true model dimension. The last event says theposterior places all but exponentially small probability on an ℓ -ball of the optimal radius about thetruth and is used for a localization argument.The event required to apply Lemma 1 is defined by A n ( t, L, M , M ) = A n, ( t ) ∩ A n, ( L ) ∩ A n, ( M , M ) . (15)The proof has an iterative structure, where we localize the posterior based on the event A n,i − toprove A n,i has high probability. The idea to iteratively localize the posterior during the proofs is auseful technique from Bayesian nonparametrics (e.g. [35]).10 emma 2. Suppose the prior satisfies (3) and (4) , and the design matrix satisfies condition (8) for L = 2 max { A / , (1 . α /κ + 2 A + log(4 + κ ( s ))) /A } . Then for p large enough, P θ (cid:16) A n ( t, L, M , M ) c (cid:17) = O (1 /p ) , with t = k X k√ log p , M = D √ L/κ ((1+4
L/A ) s ) , D = (24 / √ A ) max(25 α, A , A +432 ) ,and M = 2 L . The next two lemmas establish exponential posterior bounds on the event A n, ( k X k√ log p ) ⊃A n ( t, L, M , M ) . The first states that the posterior concentrates on models of size at most a con-stant multiple of the true model size s = | S θ | . The second states that the posterior concentrates onan ℓ -ball of optimal radius about the true parameter θ . Lemma 3.
Suppose the prior satisfies (3) and (4) . If for L ≥ . α /κ + 2 A + log(4 + κ ( s ))) /A the design matrix satisfies condition (8) , then for p large enough, E θ [Π ( θ ∈ R p : | S θ | ≥ Ls | Y ) 1 A n, ( k X k√ log p ) ] ≤ {− ( LA / s log p } . Lemma 4.
Suppose the prior satisfies (3) and (4) . If for
K > max { A , . α /κ + 2 A +log(4 + κ ( s )) /A } , the design matrix satisfies (8) with L = 2 K/A , then for p large enough, E θ " Π θ ∈ R p : k θ − θ k ≥ D √ Kκ (cid:0) (2 K/A + 1) s (cid:1) √ s log p k X k (cid:12)(cid:12)(cid:12)(cid:12) Y ! A n, ( k X k√ log p ) ≤ e − Ks log p , where D = 16 A − / max { α, A + A } . Lemmas 2, 3 and 4 follow immediately from their non-asymptotic counterparts Lemmas 11, 8 and10, respectively, in Section 10. We finally control the KL divergence between the VB posterior Q ∗ and posterior Π( ·| Y ) on the event A n ; see Lemma 7 for the corresponding non-asymptotic result.This is the most difficult technical step in establishing our result. Lemma 5.
Consider the event A n = A n ( t, L, M , M ) in Lemma 2. Then for sufficiently large p ,the VB posterior Q ∗ satisfies KL ( Q ∗ || Π( ·| Y ))1 A n ≤ D s log p, with D = L ( (9 D / α/ D ¯ κ ( Ls ) κ ((1+4 L/A ) s ) + α κ ( Ls ) ) and where L, D are given in Lemma 2. We briefly explain the heuristic idea behind the proof of Lemma 5. Since the VB posterior Q ∗ is the minimizer of the KL objective (6), KL ( Q ∗ || Π( ·| Y )) ≤ KL ( Q || Π( ·| Y )) for any Q ∈ Q inthe variational family. We upper bound this quantity for some Q carefully chosen according tothe logistic likelihood. We first identify a model S ⊆ { , . . . , p } which is not too far from thetrue model S θ and to which the posterior assigns sufficient, though potentially exponentially small,probability. In the low dimensional setting p ≪ n , Taylor expanding the log-likelihood ℓ n,θ − ℓ n,θ asymptotically gives a Gaussian linear regression likelihood with rescaled design matrix and data.This motivates the distribution Q ∈ Q which fits a normal distribution N S ( µ S , D S ) on S withmean µ S equal to the least squares estimator solving the linearized Gaussian approximation andcovariance D S (a diagonalized version of) the covariance matrix of this estimator, again in thelinearized model. While the Taylor expansion is not actually valid in the sparse high-dimensionalsetting p ≫ n considered here, we can nonetheless show that the approximation is still sufficientlygood to apply Lemma 1. Proof of Theorem 1.
We apply Lemma 1 with Θ n = n θ ∈ R p : k θ − θ k ≥ M / n √ s log pκ ( M n s ) k X k o and A = A n = A n ( t, L, M , M ) the event in Lemma 2. Since A n ⊂ A n, ( k X k√ log p ) , Lemma 4implies that E θ Π( θ ∈ Θ n | Y )1 A n ≤ e − D − M n s log p , i.e. (14) holds with δ n = D − M n s log p .Using Lemma 1 followed by Lemma 5, since δ n → ∞ , E θ Q ∗ ( θ ∈ Θ n )1 A n ≤ δ n h E θ KL ( Q ∗ || Π( ·| Y ))1 A n + 8 e − δ n / i ≤ α D M n (1 + o (1)) = O ( C κ /M n ) . P θ ( A n ) → by Lemma 2, the first result follows. Since k Ψ ′ k ∞ ≤ / , n k p θ − p k n ≤ n X i =1 | x Ti ( θ − θ ) | = k X ( θ − θ ) k ≤ κ (( M n +1) s ) k X k k θ − θ k , (16)where the last inequality follows from Theorem 3 and the definition of κ ( · ) . The second statementthen follows by combining the first statement of the theorem with the above display. Proof of Theorem 2.
By the duality formula for the KL divergence ([8], Corollary 4.15), Z f ( θ ) dQ ∗ ( θ ) ≤ KL ( Q ∗ k Π( ·| Y )) + log Z e f ( θ ) d Π( θ | Y ) for any measurable f such that R e f ( θ ) d Π( θ | Y ) < ∞ . Let A n = A ( t, L, M , M ) be the event inLemma 2. Applying Jensen’s inequality and taking f ( θ ) = cn k p θ − p k n for c > , cn k ˆ p ∗ − p k n A n = cn k E Q ∗ p θ − p k n A n ≤ E Q ∗ cn k p θ − p k n A n ≤ KL ( Q ∗ k Π( ·| Y ))1 A n + 1 A n log Z e cn k p θ − p k n d Π( θ | Y ) . (17)By Lemma 5, the first term is bounded by D s log p . For notational convenience, write κ ∗ = κ ( nA log p + s ) and κ ∗ = κ ( nA log p + (1 − /A ) s ) . For D defined in Lemma 4 and K > theminimal constant satisfying the conditions of Lemma 4, set B = n θ ∈ R p : | S θ | ≤ nA log p − , n k p θ − p k n ≤ K D ¯ κ ∗ s log p ( κ ∗ ) o , B j = n θ ∈ R p : | S θ | ≤ nA log p − , j D ¯ κ ∗ s log p ( κ ∗ ) < n k p θ − p k n ≤ ( j + 1) D ¯ κ ∗ s log p ( κ ∗ ) o , ¯ B = n θ ∈ R p : | S θ | > nA log p − o . Since E [ U Ω ] = E [ U | Ω] P (Ω) , conditional Jensen’s inequality gives E [(log V )1 Ω ] ≤ P (Ω) log E [ V | Ω] = P (Ω) log E [ V Ω ] − P (Ω) log P (Ω) for any random variable V and event Ω . The E θ -expectation of the second term in (17) thus equals E θ (cid:20) log (cid:18) Z B e cn k p θ − p k n d Π( θ | Y ) + n/ ( s log p ) − X j = K Z B j e cn k p θ − p k n d Π( θ | Y )+ Z ¯ B e cn k p θ − p k n d Π( θ | Y ) (cid:19) A n (cid:21) ≤ P θ ( A n ) log E θ (cid:20) e cK D ¯ κ ∗ s log p ( κ ∗ ) + n/ ( s log p ) − X j = K e c ( j +1) D ¯ κ ∗ s log p ( κ ∗ ) Π( B j | Y )1 A n + e cn Π( ¯ B| Y )1 A n (cid:21) − P θ ( A n ) log P θ ( A n ) . Using (16) and Lemma 4 gives E θ [Π( B j | Y )1 A n ] ≤ e − js log p , while Lemma 3 gives E θ [Π( ¯ B| Y )1 A n ] ≤ e − n . Taking c = c n = ( κ ∗ ) D ¯ κ ∗ ≤ / (128) and using that P θ ( A cn ) = O (1 /p ) by Lemma 2, the last display is bounded by, log (cid:20) e ( K/ s log p + 8 n/ ( s log p ) − X j = K e ( j +12 − j ) s log p + 2 e − (1 − c ) n (cid:21) + O (1 /p ) ≤ Ks log p (1 + o (1)) . Plugging in the preceding bounds for the right hand side of (17), nE θ k ˆ p ∗ − p k n A n ≤ c − n ( K + D ) s log p (1 + o (1)) = O ( c − n C κ s log p ) , for C κ the constant in Theorem 1. Applying Markov’s inequality and Lemma 2, P θ ( k ˆ p ∗ − p k n ≥ r ) ≤ r − E θ [ k ˆ p ∗ − p k n A n ] + P θ ( A cn ) = O ( r − n − c − n C κ s log p ) + o (1) . Taking r = M / n ( κ ∗ ) / ( κ ∗ ) − p s log p/n gives the result. Proof of Theorem 3.
This follows the same argument as the proof of Theorem 1, taking Θ n = { θ ∈ R p : | S θ | ≥ M n s } , δ n = ( A / M n s log p and applying Lemma 3 instead of Lemma 4.12 Additional numerical results
In the varbvs package [10], the standard VB algorithm was implemented for the prior (2) withGaussian instead of Laplace slabs and an additional layer of importance sampling for computingthe low-dimensional prior hyperparameters. As for our algorithm, we set a = b = 1 . In the Bi-naryEMVS package [30], an expectation-maximization (EM) algorithm is implemented for fittingBayesian spike and slab regularization paths for logistic regression. More concretely, the consideredspike and slab prior takes the form ω ∼ Beta ( a , b ) ,z j ∼ iid Bern ( ω ) ,θ j ∼ iid (1 − z j ) N (0 , σ ) + z j N (0 , σ ) , with σ ≪ σ . In the simulation study, we use the parametrization a = 1 , b = 1 , σ = 0 . and σ = 5 . Inthe Bayesian hierarchical generalized linear model ( BhGLM package) of [58], an EM algorithm isimplemented for a mixture of Laplace priors: ω ∼ Unif [0 , ,z j ∼ iid Bern ( ω ) ,θ j ∼ iid (1 − z j ) Lap ( λ ) + z j Lap ( λ ) , with λ ≫ λ . We also work with the default parametrization of the bmlasso function of the BhGLM package, i.e. s = 1 /λ = 0 . and s = 1 /λ = 0 . . We use the implementation of SkinnyGibbs in thesupplementary material to [32], using the function skinnybasad with the settings provided in theexample from the manual, apart from taking pr=0.5 to reflect the present prior setting a = b = 1 .Finally, we consider the rstanarm package, which makes Bayesian regression modeling via theprobabilistic programming language Stan accessible in R. In the package, Hamiltonian Monte Carlo[21] was implemented for the horseshoe global-local shrinkage prior [11]. We run the functionstan_glm, again with the default parameterization, i.e. we set the global scale to 0.01 with degreeof freedom 1, and the slab-scale to 2.5 with degrees of freedom 4. The default inferential algorithmruns 4 Markov chains with 2000 iterations each. We provide five further test cases in addition to the experiment considered in Section 5. In all caseswe consider Gaussian design matrices, but vary all other parameter. In tests 1-3, we take n = 250 and p = 500 as in Section 5, while in experiments 4 and 5 we set n = 2500 and p = 5000 .The entries of the design matrices have independent centered Gaussian distributions with standarddeviations σ = 0 . , , . , . , , respectively. The true underlying signal has sparsity levels s = 5 , , , , , respectively, with the non-zero signal coefficients located at the beginningof the signal with values equal to (1) θ ,j = 4 , (2) θ ,j = 6 , (3) θ ,j ∼ iid Unif ( − , , (4) θ ,j = 2 and (5) θ ,j ∼ iid Unif ( − , for j = 1 , . . . , s .We ran each experiment 200 times and report the means and standard deviations of the perfor-mance metrics in Table 3. The ℓ -error ( ℓ (ˆ θ, θ ) = k ˆ θ − θ k ) and mean squared predictionerror (MSPE (ˆ p ) = ( n P ni =1 | Ψ( x Ti ˆ θ ) − Ψ( x Ti θ ) | ) / ) are reported with respect to the poste-rior mean. For the methods performing model selection, we use the standard threshold 0.5 for themarginal posterior inclusion probability α j , i.e. the posterior includes the j th coefficient in themodel if α j > . . The true positive rate (TPR) and false discovery rate (FDR) are then definedas TPR ( α ) = s − P j : θ ,j =0 α j > . and FDR ( α ) = P j : α j > . θ ,j =0 / |{ j : α j > . }| , respec-tively. The elapsed times are given for an Intel i7-8550u laptop processor.As in Section 5, we conclude that our VB approach typically outperforms the other variational al-gorithms using prior Gaussian slabs, though again at the expense of greater computational times.Compared to our approach, the other four methods based on the EM algorithm or MCMC per-formed better in some scenarios and worse in others, but with substantially greater computationaltimes. Unsurprisingly, the rstanarm package using MCMC is the slowest; in many cases it did not13ven converge after 8000 iterations. In the high-dimensional case p = 5000 and n = 2500 , 4 algo-rithms (SkinnyGibbs, BhGLM, BinEMVS, rstanarm) did not finish the computations in a reasonableamount of time: the fastest required at least 100 hours to execute 200 runs and for rstanarm, even asingle run required multiple hours.Table 3: Comparing sparse Bayesian methods in high-dimensional logistic regression.Algorithm Test 1 Test 2 Test 3 Test 4 Test 5TPR VB (Lap) 0.99 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± SkinnyGibbs 0.98 ± ± ± ± ± ± ± ± ± ± ± VB (Gauss) 0.63 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ℓ -Error VB (Lap) 3.97 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± BhGLM 4.39 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± BhGLM 0.22 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± SkinnyGibbs 19.89 ± ± ± ± ± ± ± ± ± ± ± ± The design matrices X ∈ R n × p are taken to be X ij ∼ iid N (0 , σ ) . The signal vector θ has s non-zero coefficients,all located at the beginning of the signal.(1) X ∈ R × , σ = 0 . , s = 5 , θ , s = 4 (2) X ∈ R × , σ = 2 , s = 10 , θ , s = 6 (3) X ∈ R × , σ = 0 . , s = 15 , θ , s ∼ iid Unif( − , (4) X ∈ R × , σ = 0 . , s = 25 , θ , s = 2 (5) X ∈ R × , σ = 1 , s = 10 , θ , s ∼ iid Unif( − , λ Theorems 1-3 state that for the hyperparameter choice λ ≍ k X k√ log p , our VB algorithm has goodasymptotic properties. In practice, however, the finite-sample performance indeed depends on λ .We ran our algorithm 200 times on the experiment considered in Section 5 for different choices of λ and report the results in Table 4. In this example, the performance was sensitive to the choice of λ with large values of λ , which cause more shrinkage, performing worse.In linear regression, where more extensive simulations have been carried out, the choice of λ wassimilarly found to have an effect, though there was not clear evidence to support a particular fixedchoice of λ , with larger values sometimes performing better and sometime worse [42]. This suggestsusing a data-driven choice of λ may be helpful in practice, for example using cross validation.14able 4: Varying the scale hyperparameter λ = λ = λ = 2 λ = 5 λ = 20 TPR 1.00 ± ± ± ± ± ± ± ± ± ± ℓ -error 0.53 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± X ∈ R × , X ij ∼ iid N (0 , , s = 2 , θ , s = 2 For s ∈ { , . . . , p } , recall the definitions κ = inf (cid:26) k W / Xθ k k X k k θ k : k θ S c k ≤ k θ S k , θ = 0 (cid:27) ,κ ( s ) = sup (cid:26) k Xθ k k X k k θ k : 0 = | S θ | ≤ s (cid:27) , κ ( s ) = inf (cid:26) k W / Xθ k k X k k θ k : 0 = | S θ | ≤ s (cid:27) . These definitions are taken from the sparsity literature [9] and are used in sparse logistic regression[2, 34, 56]. We reproduce some of the discussion here for convenience, but refer the reader toChapter 6 of [9] for further reading.The true model S is compatible if κ > , which implies k W / Xθ k ≥ κ k X k k θ k for all θ inthe relevant set. The number 7 can be altered and is taken to match the conditions used in [2, 13]since we use some of their results. Compatibility κ > involves approximate rather than exactsparsity, since the parameters θ need only have small rather than zero coordinates outside S . Incontrast, κ ( s ) involves exactly s -sparse vectors. Note that if k X k = 1 , then κ ( s ) / equals thesmallest scaled singular value of a submatrix of W / X of dimension s . Similarly, κ ( s ) / upperbounds the operator norm of X when restricted to exactly s -sparse vectors.Even though W depends on the unknown θ , it does not necessarily play a significant role in theabove definitions. If k Xθ k ∞ is bounded, then the true regression function P θ ( Y = 1 | X = x i ) =Ψ( X Ti θ ) is bounded away from zero and one at the design points and W is equivalent to the identitymatrix I n . One can then set W = I n in the above definitions by simply rescaling the constants.Note that estimation in classification problems is known to behave qualitatively differently near theboundary points 0 and 1, see, e.g. [41].When W = I n , we recover the exact compatibility constants used in sparse linear regression [13, 42].This is natural since when linearizing the logistic regression model, the likelihood asymptoticallylooks like that of a linear regression model with design matrix W / X , see Section 7. One thereforeexpects similar conditions with X replaced by W / X . For further discussion, see Chapter 6 of [9]or Section 2.2 of [13]; in particular, Lemma 1 of [13] provides a concise relation between variousnotions of compatibility.Another common condition in the sparsity literature is the mutual coherence of X , which equals thelargest correlation between its columns:mc ( X ) = max ≤ i = j ≤ p |h X · i , X · j i|k X · i k k X · j k = max ≤ i = j ≤ p | ( X T X ) ij |k X · i k k X · j k . Conditions of this nature have been used by many authors (see Section 2.2 of [13] for references)and measure how far from orthogonal the matrix X is. One can relate the present compatibilityconstants to the mutual coherence. Lemma 6.
Suppose k Xθ k ∞ ≤ R is bounded and min ≤ i = j ≤ p k X · i k k X · j k ≥ η . Then for C = C ( R ) , κ ( s ) ≤ s mc ( X ) , κ ≥ C ( R )( η − s mc ( X )) , κ ( s ) ≥ C ( R )( η − s mc ( X )) . roof. For θ an s -sparse vector, using Cauchy-Schwarz, k Xθ k = p X j =1 ( X T X ) jj θ j + X k = j θ j ( X T X ) jk θ k ≤ k X k k θ k + mc ( X ) k X k k θ k ≤ (1 + s mc ( X )) k X k k θ k , so that κ ( s ) ≤ s mc ( X ) . For R > , one has Ψ( − R ) ≥ e − R / and Ψ( R ) ≤ − e − R / ,so that Ψ( x Ti θ ) ∈ [ e − R / , − e − R / . Using the definition (7), all diagonal entries of W satisfy W ii ∈ [ δ R , / for δ R = e − R (2 − e − R ) / > , so that k W / Xθ k ≥ δ R k Xθ k . It thus sufficesto prove the result with W = I n at the expense of the factor δ R . With W = I n , arguing as in Lemma1 of [13] gives κ ≥ η − s mc ( X ) and κ ( s ) ≥ η − s mc ( X ) .If k Xθ k ∞ is bounded and s = o (1 / mc ( X )) , min ≤ i = j ≤ p k X · i k k X · j k ≥ η > , (18)namely the truth is sufficiently sparse and the column norms of X are comparable, then κ ( Ls ) ≤ o (1) , κ & η − o (1) and κ ( Ls ) & η − o (1) for any L > , as required for the results in thispaper. Condition (18) has been considered in [42], and thus the various examples in [42] are alsocovered by our results, including:• (Orthogonal design). If X is an orthogonal matrix with h X · i , X · j i = 0 for i = j withsuitably normalized column lengths k X · i k = √ n .• (IID responses). Suppose the original matrix entries are i.i.d. random variables W ij andset X ij = √ nW ij / k W · j k , so that the columns are normalized to have length √ n . If | W ij | ≤ C almost surely and log p = o ( n ) , then (18) holds for sparsity levels s = o ( p n/ log p ) . Similarly, if Ee t | W ij | γ < ∞ for some γ, t > and log p = o ( n γ/ (4+ γ ) ) ,then (18) again holds for s = o ( p n/ log p ) . This covers the standard Gaussian randomdesign W ij ∼ iid N (0 , if log p = o ( n / ) . See [42] for details.• Rescale the columns as in the IID response model so that k X · i k = √ n for all i . Thenthe p × p matrix C = X T X/n takes values one on its diagonal and C ij , i = j , equals thecorrelation between columns i and j . If either C ij = r for a constant < r < (1 + cm ) − and all i = j , or | C ij | ≤ c m − for every i = j , then mc ( X ) = max i = j C ij = O (1 /m ) andso (18) holds for sparsity level s = o ( m ) . Such matrices are studied in Zhao and Yu [62],who show that models up to dimension m satisfy the ‘strong irrepresentability condition’.For further details of why these satisfy (18), and hence our conditions, see Section 2.2 of [42].
10 Non-asymptotic results and proofs
This section contains the non-asymptotic formulations of all the technical results used in this paper,together with their proofs. These results imply the more digestible asymptotic formulations pre-sented in Section 3. To begin, recall the log-likelihood in model (1) based on data Y = ( Y , . . . , Y n ) is ℓ n,θ ( Y ) = n X i =1 Y i x Ti θ − g ( x Ti θ ) = n X i =1 Y i ( Xθ ) i − g (( Xθ ) i ) , (19)where g ( t ) = log(1 + e t ) , t ∈ R . We use the following notation for the first-order remainder of theTaylor expansion of the log-likelihood: L n,θ ( y ) := ℓ n,θ ( y ) − ℓ n,θ ( y ) − ∇ θ ℓ n,θ ( y ) T ( θ − θ ) . (20) Q ∗ and Π( ·| Y ) To apply Lemma 1, we must bound KL ( Q ∗ || Π( ·| Y )) on the event A n given in (15). We do this bybounding the KL divergence between the posterior and a carefully selected element of the variational16amily. We choose a spike and slab distribution whose slab is centered at the least squares estimatorof the linearized logistic likelihood approximation and whose covariance equals the (diagonalized)covariance of this estimator. This builds on ideas in [42], extending them to the nonlinear logisticregression model.For a given model S ⊆ { , . . . , p } , we write X S for the n × | S | -submatrix of X keeping only thecolumns X · i , i ∈ S , and θ S ∈ R | S | for the vector ( θ i : i ∈ S ) . Lemma 7.
Consider the event A n = A n ( t, L, M , M ) in (15) with M > L . If (4 e ) / ( M − L ) ≤ p s , then the variational Bayes posterior Q ∗ satisfiesKL ( Q ∗ || Π( ·| Y ))1 A n ≤ ζ n , where ζ n = s log p L + 94 M κ ( Ls ) + λL / k X k√ log p M + tL / κ ( Ls ) k X k√ log p + L / p κ (1) log p !! + Ls log 14 κ ( Ls ) + log(2 e ) . Proof.
Since the VB posterior Q ∗ minimizes the KL objective (6), we have KL ( Q ∗ || Π( ·| Y )) ≤ KL ( Q || Π( ·| Y )) for all Q ∈ Q . It thus suffices to bound this last KL divergence for a suitablychosen element Q ∈ Q , which may (and will) depend on the true unknown parameter θ , on theevent A n = A n ( t, L, M , M ) .Recall that the posterior distribution is a mixture over all possible submodels and can thus be written Π( ·| Y ) = X S ⊆{ ,...,p } ˆ w S Π S ( ·| Y ) ⊗ δ S c , (21)where the posterior model weights satisfy ≤ ˆ w S ≤ and P S ˆ w S = 1 and Π S ( ·| Y ) denotes theposterior for θ S ∈ R | S | in the restricted model with Y i ∼ Bin (1 , Ψ(( X S θ S ) i )) , i.e. the logisticregression model (1) with θ S c = 0 . Choosing Q ′ ∈ Q . Recall that E θ Y i = Ψ( x Ti θ ) and for a model S ⊆ { , . . . , p } set µ S = θ ,S + ( X TS X S ) − X TS ( Y − E θ Y ) , Σ S = ( X TS X S ) − , (22)and define the | S | × | S | diagonal matrix D S by ( D S ) jj = 1(Σ − S ) jj = 1( X TS X S ) jj , (23) j = 1 , . . . , | S | . We choose as element of our variational family the distribution Q ′ ( θ ) = N S ′ ( µ S ′ , D S ′ )( θ S ′ ) × δ S ′ c ( θ S ′ c ) = Y j ∈ S ′ N (cid:16) µ S ′ ,j , X TS ′ X S ′ ) jj (cid:17) ( θ i ) Y j ∈ S ′ c δ ( θ i ) , for a model S ′ ⊆ { , . . . , p } satisfying the following three properties: | S ′ | ≤ Ls , k θ ,S ′ c k ≤ M p s log p/ k X k , ˆ w S ′ ≥ (2 e ) − p − Ls , (24)where L, M are the constants in A n ( t, L, M , M ) . Note that Q ′ is indeed an element of the mean-field variational family Q in (5) with γ j = 1 if i ∈ S ′ and γ j = 0 otherwise. Existence of S ′ satisfying (24) . We first show that on the event A n , there exists a subset S ′ satisfy-ing (24), so that our choice of Q ′ is indeed valid. On A n , Π( θ : k θ ,S cθ k > M p s log p/ k X k| Y ) ≤ Π( θ : k θ − θ k > M p s log p/ k X k| Y ) ≤ e − M s log p → , so that the posterior model weights satisfy X S : | S |≤ Ls k θ ,Sc k ≤ M √ s log p/ k X k ˆ w S ≥ / − e − M s log p ≥ / A n , since e − M s log p ≤ (4 e ) − M / ( M − L ) ≤ (4 e ) − ≤ / by assumption. Using (cid:0) ps (cid:1) ≤ p s /s ! ,the number of elements in the last sum is bounded by X S : | S |≤ Ls ≤ Ls X s =0 (cid:18) ps (cid:19) ≤ Ls X s =0 p s s ! ≤ ep Ls , which implies that on A n there exists a set S ′ ⊆ { , . . . , p } of size | S ′ | ≤ Ls with k θ ,S ′ c k ≤ M √ s log p/ k X k and with posterior probability ˆ w S ′ ≥ (2 e ) − p − Ls , i.e. satisfying (24). Reduction to the non-diagonal covariance case.
Since Q ′ is only absolutely continuous withrespect to the ˆ w S ′ Π S ′ ( ·| Y ) ⊗ δ S ′ c term of the posterior (21),KL ( Q ′ || Π( ·| Y )) = E θ ∼ N S ′ ( µ S ′ ,D S ′ ) ⊗ δ S ′ c log dN S ′ ( µ S ′ , D S ′ ) ⊗ δ S ′ c ˆ w S ′ d Π S ′ ( ·| Y ) ⊗ δ S ′ c ( θ )= log(1 / ˆ w S ′ ) + KL ( N S ′ ( µ S ′ , D S ′ ) || Π S ′ ( ·| Y )) , where the last KL divergence is over distributions in R | S ′ | . On A n , log(1 / ˆ w S ′ ) ≤ log(2 ep Ls ) =log(2 e ) + Ls log p by (24). Writing E µ S ′ ,D S ′ for the expectation under the law θ S ′ ∼ N S ′ ( µ S ′ , D S ′ ) ,KL ( N S ′ ( µ S ′ , D S ′ ) || Π S ′ ( ·| Y )) = E µ S ′ ,D S ′ (cid:20) log dN S ′ ( µ S ′ , D S ′ ) dN S ′ ( µ S ′ , Σ S ′ ) ( θ S ) + log dN S ′ ( µ S ′ , Σ S ′ ) d Π S ′ ( ·| Y ) ( θ S ) (cid:21) , (25)where again Σ S ′ = ( X TS ′ X S ′ ) − . Using the formula for the KL divergence between two multivariateGaussian distributions, the first term in the last display equalsKL ( N S ′ ( µ S ′ , D S ′ ) || N S ′ ( µ S ′ , Σ S ′ )) = 12 (cid:18) log det Σ S ′ det D S ′ − | S ′ | + Tr (Σ − S ′ D S ′ ) (cid:19) . Using the definitions (22)-(23) gives Tr (Σ − S ′ D S ′ ) = | S ′ | . Turning to the determinants, det D − S ′ = | S ′ | Y j =1 ( X TS ′ X S ′ ) jj = Y j ∈ S ′ k X · j k ≤ k X k | S ′ | . Let Λ max ( A ) and Λ min ( A ) denote the largest and smallest eigenvalues, respectively, of a squarematrix A . Recall the diagonal matrix W from (7), whose entries satisfy W ii ∈ (0 , / . Using thevariational characterization of the minimal eigenvalue of a symmetric matrix ([22], p234), Λ min ( X TS ′ X S ′ ) = min u ∈ R | S ′| : u =0 u T X TS ′ X S ′ u k u k ≥ v ∈ R p : v =0 ,v S ′ c =0 v T X T W Xv k v k ≥ κ ( | S ′ | ) k X k . (26)Since Σ S ′ = ( X TS ′ X S ′ ) − is positive definite, the last display implies det Σ S ′ ≤ Λ max (( X TS ′ X S ′ ) − ) | S ′ | = (1 / Λ min ( X TS ′ X S ′ )) | S ′ | ≤ κ ( | S ′ | ) k X k ) | S ′ | . Combining these bounds,KL ( N S ′ ( µ S ′ , D S ′ ) || N S ′ ( µ S ′ , Σ S ′ )) = log(det Σ S ′ det D − S ′ ) ≤ | S ′ | log(1 / (4 κ ( | S ′ | )) ≤ Ls log(1 / (4 κ ( Ls )) . It thus remains to bound the second term in (25), which has non-diagonal covariance matrix Σ S ′ . Bounding the non-diagonal covariance case.
One can check that − ( θ S ′ − µ S ′ ) T Σ − S ′ ( θ S ′ − µ S ′ ) equals − ( θ S ′ − θ ,S ′ ) T X TS ′ X S ′ ( θ S ′ − θ ,S ′ ) + ( Y − E θ Y ) T X S ′ ( θ S ′ − θ ,S ′ ) + C S ′ ( X, Y ) , where C S ′ ( X, Y ) does not depend on θ . Let ¯ θ S ′ denote the extension of a vector θ S ′ ∈ R | S ′ | to R p with ¯ θ S ′ ,j = θ S,j for j ∈ S ′ and ¯ θ S ′ ,j = 0 for j S ′ . Since ( Y − E θ Y ) T X S ′ ( θ S ′ − θ ,S ′ ) = θ ℓ n,θ ( Y ) T (¯ θ S ′ − ¯ θ ,S ′ ) , the density function of the N S ′ ( µ S ′ , Σ S ′ ) distribution is thus proportionalto e − k X S ′ ( θ S ′ − θ ,S ′ ) k + ∇ θ ℓ n,θ ( Y ) T (¯ θ S ′ − ¯ θ ,S ′ ) , θ S ′ ∈ R | S ′ | . Using Bayes formula and the Taylorexpansion (20), Π S ′ ( ·| Y ) has density proportional to exp (cid:16) ℓ n, ¯ θ S ′ ( Y ) − ℓ n,θ ( Y ) − λ k θ S ′ k (cid:17) ∝ exp (cid:16) ∇ θ ℓ n,θ ( Y ) T (¯ θ S ′ − ¯ θ ,S ′ ) + L n, ¯ θ S ′ ( Y ) − λ k θ S ′ k (cid:17) . Using these representations of the two densities, the second term in (25) can be rewritten as E µ S ′ ,D S ′ " log D Π e − k X S ′ ( θ S ′ − θ ,S ′ ) k + ∇ θ ℓ n,θ ( Y ) T (¯ θ S ′ − ¯ θ ,S ′ ) − λ k θ ,S ′ k D N e ∇ θ ℓ n,θ ( Y ) T (¯ θ S ′ − ¯ θ ,S ′ )+ L n, ¯ θS ′ ( Y ) − λ k θ S ′ k = E µ S ′ ,D S ′ (cid:2) − k X S ′ ( θ S ′ − θ ,S ′ ) k − L n, ¯ θ S ( Y ) (cid:3) + λE µ S ′ ,D S ′ ( k θ S ′ k − k θ ,S ′ k ) + log( D Π /D N )=: ( I ) + ( II ) + ( III ) , where the normalizing constants are D Π = R R | S ′| e ∇ θ ℓ n,θ ( Y ) T (¯ θ S ′ − ¯ θ ,S ′ )+ L n, ¯ θS ′ ( Y ) − λ k θ S ′ k dθ S ′ and D N = R R | S ′| e − k X S ′ ( θ S ′ − θ ,S ′ ) k + ∇ θ ℓ n,θ ( Y ) T (¯ θ S ′ − ¯ θ ,S ′ ) − λ k θ ,S ′ k dθ S ′ . We now bound ( I ) − ( III ) in turn. ( I ) : Using the likelihood (19) and the mean-value form of the remainder in the Taylor expansion(20), for ξ i between x Ti ¯ θ S ′ and x Ti θ , L n, ¯ θ S ′ = − n X i =1 g ′′ ( ξ i ) | x Ti (¯ θ S ′ − θ ) | ≥ − n X i =1 | x Ti (¯ θ S ′ − ¯ θ ,S ′ ) | + 2 | x Ti (¯ θ ,S ′ − θ ) | = − k X S ′ ( θ S ′ − θ ,S ′ ) k − k X S ′ c θ ,S ′ c k , (27)so that ( I ) is bounded by k X S ′ c θ ,S ′ c k / . On A n , using (24), k X S ′ c θ ,S ′ c k = k X ¯ θ ,S ′ c k ≤ κ ( | S ∩ S ′ c | ) k X k k θ ,S ′ c k ≤ κ ( s ) M s log p, so that ( I ) ≤ κ ( s ) M s log p/ . ( II ) : Under the expectation E µ S ′ ,D S ′ , we have the equality in distribution θ S ′ − θ ,S ′ = d ( X TS ′ X S ′ ) − X TS ′ ( Y − E θ Y ) + Z , where Z ∼ N S ′ (0 , D S ′ ) . Applying the triangle inequality andCauchy-Schwarz, ( II ) ≤ λ k ( X TS ′ X S ′ ) − X TS ′ ( Y − E θ Y ) k + λE µ S ′ ,D S ′ k Z k ≤ λ | S ′ | / ( k ( X TS ′ X S ′ ) − X TS ′ ( Y − E θ Y ) k + Tr ( D S ′ ) / ) , since by Jensen’s inequality E k Z k ≤ ( E k Z k ) / = Tr ( D S ′ ) / . Using the definition (23), for e j the j th unit vector in R p ,Tr ( D S ′ ) = | S ′ | X j =1 X TS ′ X S ′ ) jj = X j ∈ S ′ k Xe j k ≤ X j ∈ S ′ κ (1) k X k = | S ′ | κ (1) k X k . The matrix operator norm (from R | S ′ | to R | S ′ | ) of ( X TS ′ X S ′ ) − equals its largest eigenvalue, whichis bounded by / (4 κ ( | S ′ | ) k X k ) using (26). Recalling that ∇ θ ℓ n,θ ( Y ) = X T ( Y − E θ Y ) , on theevent A n the first term in the second to last display is therefore bounded by λ | S ′ | / κ ( | S ′ | ) k X k k X TS ′ ( Y − E θ Y ) k ≤ λ | S ′ | κ ( | S ′ | ) k X k k X T ( Y − E θ Y ) k ∞ ≤ λLs t κ ( Ls ) k X k .
19e have thus shown that ( II ) ≤ λLs k X k (cid:18) t κ ( Ls ) k X k + 1 κ (1) / (cid:19) . ( III ) : It remains to control the ratio of normalizing constants log( D Π /D N ) . Define B S ′ = { θ S ′ ∈ R | S ′ | : k θ S ′ − θ ,S ′ k ≤ M p s log p/ k X k} . On A n , using (21) and (24), Π S ′ ( B cS ′ | Y ) ≤ ˆ w S ′ ˆ w S ′ Π S ′ ( θ S ′ ∈ R | S ′ | : k ¯ θ S ′ − θ k > M p s log p/ k X k − k θ ,S ′ c k | Y ) ≤ ˆ w − S ′ Π( θ ∈ R p : k θ − θ k > M p s log p/ k X k| Y ) ≤ ep Ls e − M s log p = 2 e − ( M − L ) s log p ≤ / , where the last inequality follows from rearranging the assumption (4 e ) / ( M − L ) ≤ p s . UsingBayes formula, this gives Π S ′ ( B S ′ | Y )1 A n = R B S ′ e ℓ n, ¯ θS ′ ( Y ) − ℓ n,θ ( Y ) − λ k θ S ′ k dθ S ′ R R | S ′| e ℓ n, ¯ θS ′ ( Y ) − ℓ n,θ ( Y ) − λ k θ S ′ k dθ S ′ A n ≥
12 1 A n . By (20) the denominator in the last display equals e ∇ θ ℓ n,θ ( Y ) T (¯ θ ,S ′ − θ ) D Π , which implies that on A n , D Π ≤ R B S ′ e ∇ θ ℓ n,θ ( Y ) T (¯ θ S ′ − ¯ θ ,S ′ )+ L n, ¯ θS ′ ( Y ) − λ k θ S ′ k dθ S ′ . Thus on A n , log D Π D N ≤ log 2 R B S ′ e ∇ θ ℓ n,θ ( Y ) T (¯ θ S ′ − ¯ θ ,S ′ )+ L n, ¯ θS ′ ( Y ) − λ k θ S ′ k dθ S ′ R B S ′ e − k X S ′ ( θ S ′ − θ ,S ′ ) k + ∇ θ ℓ n,θ (¯ θ S ′ − ¯ θ ,S ′ ) − λ k θ ,S ′ k dθ S ′ ≤ log sup θ S ′ ∈ B S ′ e L n, ¯ θS ′ ( Y )+ k X S ′ ( θ S ′ − θ ,S ′ ) k + λ k θ ,S ′ k − λ k θ S ′ k ! + log 2 ≤ sup θ S ′ ∈ B S ′ L n, ¯ θ S ′ ( Y ) + k X S ′ ( θ S ′ − θ ,S ′ ) k + λ k θ S ′ − θ ,S ′ k + log 2 . Now L n, ¯ θ S ′ ( Y ) < by (27) since g ′′ > . Using Cauchy-Schwarz and the definition of B S ′ , on A n , ( III ) = log D Π D N ≤ sup θ S ′ ∈ B S ′ κ ( | S ′ | ) k X k k θ S ′ − θ ,S ′ k + λ | S ′ | / k θ S ′ − θ ,S ′ k + log 2 ≤ κ ( Ls ) M s log p + 2 M λL / s √ log p k X k + log 2 . Combining all of the above bounds gives the result.
The second part to applying Lemma 1 is showing that on an event, the desired sets have all butexponentially small posterior probability. This involves using results on dimension selection andposterior contraction from high-dimensional Bayesian statistics, especially Atchadé [2] and Castilloet al. [13]. The following results follow closely the proofs in [2], but we reproduce them herefor convenience, since in that paper they are not stated or proved in the exponential form neededto apply Lemma 1. We are also able to simplify certain technical conditions and streamline someproofs. Note that his results, including the definitions of the compatibility constants, match when k X k ∼ √ n .We next introduce some notation from [2], used throughout this section. A continuous function r :[0 , ∞ ) → [0 , ∞ ) is called a rate function if it is strictly increasing, r (0) = 0 and lim x ↓ r ( x ) /x = 0 .For a rate function r and a ≥ , define φ r ( a ) = inf { x > r ( z ) ≥ az, for all z ≥ x } , (28)20ith the convention inf ∅ = ∞ . Let B (Θ , M ) = { θ ∈ θ + Θ : k θ − θ k ≤ M } denote the ℓ -ballof radius M > centered at θ with elements in θ + Θ . For ε > , we denote by D ( ε, B (Θ , M )) the ε -packing number of B (Θ , M ) , namely the maximal number of points in B (Θ , M ) such thatthe ℓ distance between any two points is at least ε .The following result bounds the posterior probability of selecting a model of size larger than amultiple of the true model size. Lemma 8 (Theorem 4(1) of [2]) . Suppose the prior satisfies (3) and (4) , p A ≥ A , and that k X k ≥ (64 / αs √ log p/κ . Then for any L > , E θ [Π ( θ ∈ R p : | S θ | ≥ Ls | Y ) 1 A n, ( λ/ ] ≤ (cid:18) − s log p (cid:20) L (cid:16) A − log(4 A )log p (cid:17) − C (cid:21)(cid:19) , where C = 1 + α κ + log(4 + κ ( s ) / log p ) + (1 + s )( A − log(4 A )log p ) .Proof. By Lemma 12(1), for any k ≥ , E θ [Π( θ : | S θ | ≥ s + k | Y )1 A n, ( λ/ ] ≤ e a (cid:18) κ ( s ) k X k λ (cid:19) s (cid:18) ps (cid:19) (cid:18) A p A (cid:19) k , where a = − inf x> [ κ k X k x s / k X k ∞ x − λs / x ] . It remains to simplify the right-hand side.One can check that for τ, b, c > , inf x> [ τx bx − cx ] ≥ − c τ / ( τ − cb ) / ≥ − c τ if τ ≥ bc/ .In our setting, this condition equals k X k κ ≥ (64 / λs k X k ∞ , which holds by assumption. Thisyields a ≤ λ s k X k κ . Using the upper bound (cid:0) ps (cid:1) ≤ p s , setting k = ⌊ ( L − s ⌋ and using that ( L − s − ≤ ⌊ ( L − s ⌋ ≤ ( L − s gives the result.The next result is the analogous version of Theorem 4(2) in [2] with the exponential bounds werequire here. It provides a contraction rate for posterior models of a given size. Lemma 9.
Suppose the prior satisfies (3) and (4) , p A ≥ A , and that k X k ≥ α ( L +2) s √ log p k X k ∞ /κ (( L + 1) s ) for some L > . Then for any θ ∈ R p , and M ≥ max(25 α, (1 + A ) / , E θ h Π( θ ∈ R p : | S θ | ≤ Ls , k θ − θ k ≥ M p ( L + 2) s log pκ (( L + 1) s ) k X k | Y )1 A n, ( k X k√ log p ) i ≤ e − s log p (8 LM − C L ) , where C L = max( L (1 + log(24)log p ) , ˜ C p ) and ˜ C p = log A +log ( α log p ) log p . While ˜ C p is not a true constant since it depends on p , we write it as such since it is asymptoticallynegligible. As p → ∞ , we have ˜ C p → and C L → L . Proof.
We write κ L = κ (( L + 1) s ) during this proof to ease notation. By Lemma 12(2), for any M > , E θ Π( θ ∈ θ + ¯Θ L : k θ − θ k > M ε | Y )1 A n, ( k X k√ log p ) ≤ X j ≥ D j e − r ( jMε ) / + 2 (cid:18) ps (cid:19)(cid:16) p A A (cid:17) s (cid:16) λ ¯ κ ( s ) k X k (cid:17) s X j ≥ e − r ( jMε ) / e λc jMε , (29)where the quantities in (29) are defined in that lemma. Note that for θ satisfying | S θ | ≤ Ls , then | S θ − θ | ≤ ( L + 1) s , so that θ − θ ∈ ¯Θ L . We may thus further restrict the set in the last display to { θ : | S θ | ≤ Ls } , as in the posterior probability in the lemma. We first compute ε and then simplifythe right-hand side of (29).Recall that we may take as rate any ε ≥ φ r (2 η L ) for r the rate function in Lemma 12(2). Fora rate function r ( x ) = τx bx , τ, b > , the inequality r ( x ) ≥ ax is equivalent to x (( τ − b ) x − a ) ≥ . Using the definition (28) thus gives φ r ( a ) = aτ − ab . Setting τ = κ L k X k and b = k X k ∞ p ( L + 1) s / as in Lemma 12(2), and using our assumption k X k κ L ≥ α ( L + 2) s k X k√ log p k X k ∞ ≥ η L p ( L + 1) s k X k ∞ , we get φ r (2 η L ) = 2 η L κ L k X k − η L k X k ∞ p ( L + 1) s ≤ η L κ L k X k = 8 p ( L + 2) s log pκ L k X k =: ε Turning to the right-hand side of (29), note that c = sup v ∈ ¯Θ L k v k / k v k ≤ p ( L + 2) s . Arguingas on p. 29-30 of [2] gives P j ≥ D j e − r ( jMε/ / ≤ (cid:0) ( L + 2) s log p [1 + log(24)log p − M ] (cid:1) .Similarly, setting x = jM ε/ , λ p ( L + 2) s jM ε − r ( jM ε/
2) = − x k X k κ L x p ( L + 1) s k X k ∞ x − λ p ( L + 2) s ! ≤ − x k X k κ L Mε p ( L + 1) s k X k ∞ Mε − λ p ( L + 2) s ! ≤ − λ p ( L + 2) s x as long as k X k κ L Mε p ( L + 1) s k X k ∞ Mε ≥ λ p ( L + 2) s . We show the last display holds under the present assumptions. Since p ( L + 1) s k X k ∞ ε ≤ L +2) s √ log p k X k ∞ k X k κ L ≤ / (50 α ) by assumption, the left-hand side is lower bounded by k X k κ L Mε M/ (50 α ) ≥ (50 α/ k X k κ L ε for M ≥ α . Since λ ≤ α k X k√ log p by assumption, thelast display holds following from (50 α/ k X k κ L ε p ( L + 2) s = α k X k p log p. This implies X j ≥ e − r ( jMε/ e λc jMε ≤ X j ≥ e − jMλ √ ( L +2) s ε ≤ e − M ( L +2) s log p , where the last inequality again follows by the same argument on p. 29-30 of [2].Summing up these bounds and using (cid:0) ps (cid:1) ≤ p s and ¯ κ ( s ) ≥ ¯ κ (1) = 1 , the right-hand side of (29)is bounded by (cid:16) ( L + 2) s log p h log(24)log p − M i(cid:17) + 4 exp (cid:0) (1 + A ) s log p + s log (cid:0) α log p (cid:1) − s log A − M ( L + 2) s log p (cid:1) ≤ (cid:16) − s log p h LM − max( L (1 + log(24)log p ) , ˜ C p )) i(cid:17) . Combining the last two lemmas yields the contraction rate result with exponential bounds.
Lemma 10.
Suppose the prior satisfies (3) and (4) , and p A ≥ A . If for K > , the design matrixsatisfies condition (8) with L = L K = K + CA − log(4 A ) / log p , with C the constant in Lemma 8, then forany θ ∈ R p , E θ (cid:20) Π (cid:18) θ ∈ R p : k θ − θ k ≥ C K √ s log p k X k (cid:12)(cid:12)(cid:12)(cid:12) Y (cid:19) A n, ( k X k√ log p ) (cid:21) ≤ e − Ks log p , where C K = M K √ L K +2 κ (( L K +1) s ) , M K = max(25 α, A , K + ˜ C p L K + + log(24)8 log p ) and ˜ C p is given in Lemma9.
22f all the compatibility constants are bounded away from zero and infinity, the constants in Lemma10 scale like L K ∼ K , M K ∼ √ K and C K ∼ K as K → ∞ . We now have the required event toapply Lemma 1. Lemma 11.
Suppose the prior satisfies (3) and (4) , and p A ≥ A . Set L = CA − log(4 A ) / log p ,where C is the constant in Lemma 8, and assume the design matrix satisfies (8) for this L . Then for t = k X k√ log p , M = 2 L and M = C L , where C L is the constant in Lemma 10 with K = 3 L ,and any θ ∈ R p , P θ (cid:16) A n ( t, L, M , M ) c (cid:17) ≤ /p + (8 / p − s + 8 p − Ls , where A n ( t, L, M , M ) is defined in (15) .Proof. Using a union bound and the definition (15), P θ (cid:16) A n ( t, L, M , M ) c (cid:17) ≤ P θ ( A n, ( k X k p log p ) c ) + P θ ( A n, ( L ) c ∩ A n, ( k X k p log p ))+ P θ ( A n, ( M , M ) c ∩ A n, ( k X k p log p )) . Since ∂∂θ j ℓ n,θ ( Y ) = P ni =1 ( Y i − g ′ ( x Ti θ )) X ij , by Hoeffding’s inequality, P θ ( A n, ( t ) c ) = P θ max ≤ j ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 ( Y i − g ′ ( x Ti θ )) X ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t ! ≤ p X j =1 e − t k X · j k ≤ pe − t k X k = 2 p . Applying Markov’s inequality and Lemma 8 with the present choice of L , the second term isbounded by (4 / E θ h Π( θ ∈ R p : | S θ | > Ls | Y )1 A n, ( k X k√ log p ) i ≤ (8 / e − s log p . Similarly, using Markov’s inequality and Lemma 10 with K = 3 L , the third term is bounded by e M s log p E θ h Π( θ ∈ R p : k θ − θ k ≥ M p s log p/ k X k| Y )1 A n, ( k X k√ log p ) i ≤ e − Ls log p . The following is a simplified version of Theorem 3 of Atchadé [2], which applies to general settings,tailored to the sparse high-dimensional logistic regression model. It gives high level technical con-ditions under which one can control (1) the posterior model dimension and (2) the posterior ℓ normfor models of restricted dimension. Lemma 12.
Suppose the prior satisfies (3) and p A ≥ A .(1) For any integer k ≥ , E θ Π (cid:16) θ ∈ R p : | S θ | ≥ s + k | Y (cid:17) A n, ( λ/ ≤ e a (cid:16) κ ( s ) k X k λ (cid:17) s (cid:18) ps (cid:19)(cid:16) A p A (cid:17) k , where a = −
12 inf x> " κ k X k x s / k X k ∞ x − λ √ s x . (2) For L > , set ¯Θ L = { θ ∈ R p : | S θ − θ | ≤ ( L + 1) s } and define the rate func-tion r ( x ) = κ (( L +1) s ) k X k x k X k ∞ √ ( L +1) s x/ . Further set η L = 2 p ( L + 2) s k X k√ log p and ε = φ r (2 η L ) , where φ r uses the same rate function r and is defined in (28) . Then forany M > , E θ Π( θ ∈ θ + ¯Θ L : k θ − θ k > M ε | Y )1 A n, ( k X k√ log p ) ≤ X j ≥ D j e − r ( jM ε ) / + 2 (cid:18) ps (cid:19)(cid:16) p A A (cid:17) s (cid:16) λ κ ( s ) k X k (cid:17) s X j ≥ e − r ( jM ε ) / e λc jM ε , here c = sup u ∈ ¯Θ L sup v ∈ ¯Θ L , k v k =1 |h sign ( u ) , v i| and D j = D (cid:0) jM ε , B ( ¯Θ L , ( j +1) M ε ) (cid:1) .Proof. This is a combination of Theorems 3 and 4 in [2]. In particular, we verify that certaintechnical assumptions of that result hold automatically in the logistic regression model, giving thesimpler result above. Firstly note that Assumptions H1-H3 of [2] are satisfied (H1) by definition,(H2) since θ ℓ n,θ in (19) is concave and differentiable and (H3) by (3).For y ∈ { , } n data in model (1), L n,θ defined in (20), Θ = { θ : S θ ⊆ S θ } and r some ratefunction, define N = θ ∈ R p : θ = 0 , and X i ∈ S c | θ i | ≤ k θ S k , ˇ E n, ( N , r ) = (cid:8) y ∈ { , } n : ∀ θ ∈ θ + N : L n,θ ( y ) ≤ − r ( k θ − θ k ) (cid:9) , ˆ E n, (Θ , ¯ L ) = n y ∈ { , } n : ∀ θ ∈ θ + Θ , L n,θ ( y ) ≥ − ¯ L k θ − θ k o , E n, (Θ , λ ) = ( y ∈ { , } n : sup u ∈ Θ , k u k =1 |h∇ θ log ℓ n,θ ( y ) , u i| ≤ λ ) , where ¯ L > and λ > is the regularization parameter in the prior (2). This matches the notation in[2] (except note his ρ is our λ ), where it is shown the theorem’s two conclusions hold under variouschoices of parameters in the last displays.Part (1): [2] considers the event E n, ( R p , λ ) ∩ ˆ E n, (Θ , ¯ L ) ∩ ˇ E n ( N , r ) , which we now simplify.Arguing as on p27 of [2] yields L n,θ ( y ) ≤ − n X i =1 g ′′ ( x Ti θ ) | x Ti ( θ − θ ) | | x Ti ( θ − θ ) | , For θ − θ ∈ N , | x Ti ( θ − θ ) | ≤ k X k ∞ k θ − θ k ≤ k X k ∞ s / k θ − θ k , which gives L n,θ ( y ) ≤ −
12 + max i | x Ti ( θ − θ ) | ( θ − θ ) T X T W X ( θ − θ ) ≤ − κ k X k k θ − θ k s / k X k ∞ k θ − θ k =: − r ( k θ − θ k ) for the rate function r ( t ) = κ k X k t / (1+4 s / k X k ∞ t ) . Thus the event ˇ E n, ( N , r ) holds determin-istically true for any y ∈ { , } n and this choice of r . Furthermore, since g ′′ ( t ) ≤ / , by consider-ing the remainder in the Taylor expansion of ℓ n,θ ( y ) − ℓ n,θ ( y ) , for θ − θ ∈ Θ = { θ ′ : S θ ′ ⊆ S θ } , L n,θ ( y ) ≥ − ( θ − θ ) T X T X ( θ − θ ) ≥ − κ ( s ) k X k k θ − θ k . Thus ˆ E n, (Θ , ¯ L ) = { , } n for ¯ L = κ ( s ) k X k / . Inspection of the proof of Theorem 3 of [2]shows that he actually only requires the larger event A n, ( λ/
2) = {k∇ θ ℓ n,θ ( Y ) k ∞ ≤ λ/ } ) E n, ( R p , λ ) to hold rather than E n, ( R p , λ ) (see p. 23 of [2] - they are incorrectly stated as beingequal). In our setting, we may thus replace the event of Theorem 3(1) of [2] by A n, (cid:0) λ/ (cid:1) ∩ ˇ E n, ( N , r ) ∩ ˆ E n, (Θ , ¯ L ) = A n, (cid:0) λ/ (cid:1) for r, ¯ L as above, from which the result follows.Part (2): [2] considers the event E n, ( ¯Θ L , η L ) ∩ ˆ E n, (Θ , ¯ L ) ∩ ˇ E n ( ¯Θ L , r ) , which we again simplify.From Part (1), we again take ˆ E n, (Θ , ¯ L ) = { , } n for ¯ L = κ ( s ) k X k / . Arguing as in Part(1), we get ˇ E n, ( ¯Θ L , r ) = { , } n for rate function r ( x ) = κ (( L +1) s ) k X k x k X k ∞ √ ( L +1) s x/ using that for24 ∈ ¯Θ L , | x Ti ( θ − θ ) | ≤ k X k ∞ k θ − θ k ≤ k X k ∞ p ( L + 1) s k θ − θ k . For any θ ∈ θ + ¯Θ L ,by Cauchy-Schwarz, |h∇ θ ℓ n,θ ( y ) , θ − θ i| ≤ k∇ θ ℓ n,θ ( y ) k ∞ k θ − θ k ≤ k∇ θ ℓ n,θ ( y ) k ∞ p ( L + 2) s k θ − θ k , so that E n, ( ¯Θ L , η L ) ⊃ A n, (cid:0) k X k√ log p (cid:1) . We hence conclude that E n, ( ¯Θ L , η L ) ∩ ˇ E n, ( ¯Θ L , r ) ∩ ˆ E n, (Θ , ¯ L ) = E n, ( ¯Θ L , η L ) ⊃ A n, (cid:0) k X k p log p (cid:1) for the above choices of ¯ L , r and η L . Applying Theorem 3(2) of [2] then gives the result.The following is the non-asymptotic analogue of Theorem 1 in Section 3. Theorem 4.
Suppose the prior satisfies (3) and (4) , and p A ≥ A . If for K > , the design matrixsatisfies condition (8) with L = L K = K + CA − log(4 A ) / log p and C the constant in Lemma 8, then forany θ ∈ R p , E θ Q ∗ (cid:18) θ ∈ R p : k θ − θ k ≥ C K √ s log p k X k (cid:19) ≤ ζ n + 8 e − ( K/ s log p ( K/ s log p + 2 p + 83 p − s + 8 p − Ls , where ζ n is given in Lemma 7, C K = M K √ L K +2 κ (( L K +1) s ) , and M K = max(25 α, A , K + ˜ C p L K + + log(24)8 log p ) with ˜ C p given in Lemma 9.Furthermore, the mean-squared prediction error k p θ − p k n = n P ni =1 (Ψ( x Ti θ ) − Ψ( x Ti θ )) ofthe VB posterior Q ∗ satisfies E θ Q ∗ (cid:16) θ ∈ R p : k p θ − p k n ≥ C K p ¯ κ (( L K + 1) s )4 r s log pn (cid:17) ≤ ζ n + 8 e − ( K/ s log p ( K/ s log p + 2 /p + (8 / p − s + 8 p − Ls . Proof.
We first apply Lemma 1 with Θ n = n θ ∈ R p : k θ − θ k ≥ C K √ s log p k X k o ,A = A n = A n ( t, L, M , M ) the event in Lemma 11, δ n = Ks log p , and C = 8 . Since A n ⊂A n, ( k X k√ log p ) , Lemma 10 implies that the condition (14) holds. Using Lemma 1 followed byLemma 7, E θ Q ∗ (cid:16) θ ∈ R p : k θ − θ k ≥ C K √ s log p k X k (cid:17) A n ≤ E θ KL ( Q ∗ || Π( ·| Y ))1 A n + 8 e − δ n / δ n / ≤ ζ n + 8 exp( − ( K/ s log p )( K/ s log p . By Lemma 11, E θ Q ∗ (cid:16) θ ∈ R p : k θ − θ k ≥ C K √ s log p k X k (cid:17) A cn ≤ P θ ( A cn ) ≤ /p + (8 / p − s + 8 p − Ls . The first statement follows by combining the above two displays.Turning to the second statement, Lemma 10 implies the condition (14) holds with Θ n = { θ ∈ R p : | S θ | ≥ L K s } ,δ n = Ks log p and C = 2 . Therefore, similarly as above, E θ Q ∗ (cid:16) θ ∈ R p : | S θ | ≥ L K s (cid:17) A n ≤ ζ n + 2 exp( − ( K/ s log p )( K/ s log p . For any θ in the set in the last display, since k Ψ ′ k ∞ ≤ / , n k p θ − p k n ≤ n X i =1 | x Ti ( θ − θ ) | ≤ k X ( θ − θ ) k ≤
116 ¯ κ (( L K + 1) s ) k X k k θ − θ k , where the last inequality follows from the definition of ¯ κ ( · ) . The second statement then follows bycombining the first statement of the theorem and the last two displays.25 Since the VB minimization problem (6) is intractable for Bayesian logistic regression, we insteadminimize a surrogate objective obtained by lower bounding the likelihood [6, 24]. This is a standardapproach, but we include full details for completeness. For the log-likelihood ℓ n,θ defined in (19), itholds that ℓ n,θ ( x, y ) ≥ n X i =1 log Ψ( η i ) − η i y i − ) x Ti θ − η i tanh( η i / (cid:0) ( x Ti θ ) − η i (cid:1) =: f ( θ, η ) (30)for any η = ( η , . . . , η n ) ∈ R n , see Section 11.2 for a proof. Hence for any distribution Q for θ ,KL ( Q || Π( ·| Y )) = Z log (cid:18) dQ ( θ ) e ℓ n,θ ( x,y ) d Π( θ ) (cid:19) dQ ( θ ) + C ≤ Z log dQd Π ( θ ) − f ( θ, η ) dQ ( θ ) + C = KL ( Q || Π) − E Q [ f ( θ, η )] + C, (31)where C is independent of Q . We minimize the right-hand side over the variational family Q µ,σ,γ ∈Q , i.e. over the parameters µ, σ, γ . Since we seek the tightest possible upper bound in (31), we alsominimize this over the free parameter η . In particular, the coordinate ascent variational inference(CAVI) algorithm alternates between updating η for fixed µ, σ, γ and then cycling through µ j , σ j , γ j and updating these given all other parameters are fixed.Write E µ,σ,γ for the expectation when θ ∼ Q µ,σ,γ . For fixed µ, σ, γ , update η = ( η , . . . , η n ) by η i = E µ,γ,σ ( x Ti θ ) = p X k =1 γ k x ik ( µ k + σ k ) + p X k =1 X l = k ( γ k x ik µ k )( γ l x il µ l ) , (32)see Section 11.2 for a proof. We now derive the coordinate update equations for µ j , σ j , γ j keepingall other parameters, including η , fixed. For completeness, we allow the Laplace slab to have non-zero mean ν if desired. Proposition 1 (Coordinate updates with Laplace prior) . Consider the prior (9) with Laplace slabdensity g ( x ) = λ e − λ | x − ν | , where ν ∈ R , λ > . Given all other parameters are fixed, the values µ j and σ j that minimize (31) with Q = Q µ,σ,γ ∈ Q are the minimizers of the objective functions: µ j λσ j r π e − ( µj − ν )22 σ j + λ ( µ j − ν )erf (cid:18) µ j − ν √ σ j (cid:19) + µ j n X i =1 η i tanh( η i / x ij + µ j (cid:18) n X i =1 η i tanh( η i / x ij X k = j γ k x ik µ k − n X i =1 ( y i − / x ij (cid:19) ,σ j λσ j r π e − ( µj − ν )22 σ j + λ ( µ j − ν )erf µ j − ν √ σ j ! − log σ j + σ j n X i =1 η i tanh( η i / x ij , respectively, where erf( x ) = 2 / √ π R x e − t dt is the error function. The value γ j that minimizes (31) given all other parameters are fixed, is the solution to − log γ j − γ j = log b a − log( λσ j ) + λσ j r π e − ( µj − ν )22 σ j + λ ( µ j − ν )erf µ j − ν √ σ j ! − µ j n X i =1 ( y i − / x ij + n X i =1 η i tanh( η i / (cid:18) x ij ( µ j + σ j ) + 2 x ij µ j X k = j γ k x ik µ k (cid:19) . Proof.
Throughout this proof we fix the parameter η ∈ R n and let C denote any term constant withrespect to the parameters currently being optimized, possibly different on each line. We first compute26he update equations for µ j and σ j based on (31). We compute KL ( Q µ,σ,γ | z j =1 || Π) , which considersthe distribution Q µ,σ,γ conditional on z j = 1 , as a function of ( µ j , σ j ) , holding all other parametersfixed. Using that Q µ,σ,γ is a factorizable distribution and that conditional on z j = 1 the variationaldistribution of θ j is singular to the Dirac measure δ , we can simplify dQ µ,σ,γ | zj =1 d Π = C dQ µj,σj | zj =1 d Π cj ,where Π cj is the continuous part of the prior distribution for θ j and C does not depend on µ j or σ j . Recall that Π cj = R w Lap ( ν, λ ) dw = a / ( a + b ) Lap ( ν, λ ) . Thus for φ µ,σ the density of a N ( µ, σ ) distribution and w = a / ( a + b ) , log dQ µ,σ,γ | z j =1 d Π ( θ ) = log dQ µ j ,σ j | z j =1 d Π cj ( θ j ) + C = log φ µ j ,σ j ( θ j ) wg j ( θ j ) + C. Taking expectations with respect to Q µ,σ,γ | z j =1 , E µ,σ,γ | z j =1 (cid:20) log φ µ j ,σ j ( θ j ) wg j ( θ j ) (cid:21) = E µ,σ | z j =1 " − log( λσ j ) − ( θ j − µ j ) σ j + λ | θ − ν | + C = − log( λσ j ) + λE µ,σ | z j =1 | θ j − ν | + C, where we have used E µ,σ,γ | z j =1 [( θ j − µ j ) /σ j ] = 1 . Under the variational distribution, λ | θ j − ν | follows a folded Gaussian distribution, hence λE µ j ,σ j | z j =1 | θ j − ν | = λσ j r π e − ( µj − ν )22 σ j + λ ( µ j − ν )erf µ j − ν √ σ j ! . Combining the last three displays gives KL ( Q µ,σ,γ | z j =1 || Π) as a function of µ j , σ j . Using thisexpression and evaluating E µ,σ,γ | z j =1 (cid:2) f ( θ, η ) (cid:3) using Lemma 15 below, the upper bound in (31)equals, as a function of µ j , σ j , − log( λσ j ) + λσ j r π e − ( µj − ν )22 σ j + λ ( µ j − ν )erf µ j − ν √ σ j ! − n X i =1 ( y i − / x ij µ j + n X i =1 η i tanh( η i / x ij ( µ j + σ j ) + 2 x ij µ j X k = j γ k x ik µ k + C, where C is independent of µ j , σ j . Minimizing the display with respect to either µ j or σ j gives thedesired result.For updating the inclusion probabilities γ j , we proceed as above without conditioning on z j = 1 .Keeping track of only the γ j terms, E µ,σ,γ (cid:20) log dQ µ,σ,γ d Π ( θ ) (cid:21) = E µ,σ,γ " log d ( γ j N ( µ j , σ j ) + (1 − γ j ) δ ) d ( w Lap ( ν, λ ) + (1 − w ) δ ) ( θ j ) + C = E µ,σ,γ " { z j =1 } log γ j dN ( µ j , σ j ) wd Lap ( ν, λ ) ( θ j ) + 1 { z j =0 } log 1 − γ j − w + C = γ j E µ,σ,γ | z j =1 (cid:20) log φ µ j ,σ j ( θ j ) g j ( θ j ) (cid:21) + γ j log γ j w + (1 − γ j ) log 1 − γ j − w + C. The first expectation was evaluated above. Using this and evaluating E µ,σ,γ (cid:2) f ( θ, η ) (cid:3) using Lemma15 below, the upper bound in (31) equals, as a function of γ j , γ j ( − log( λσ j ) + λσ j r π e − ( µj − ν )22 σ j + λ ( µ j − ν )erf µ j − ν √ σ j ! − µ j n X i =1 ( y i − / x ij + n X i =1 η i tanh( η i / x ij ( µ j + σ j ) + 2 x ij µ j X k = j γ k x ik µ k ) + γ j log γ j w + (1 − γ j ) log 1 − γ j − w + C, C is independent of γ j . As a function of γ j , this takes the form h ( γ j ) = γ j log γ j a + (1 − γ j ) log 1 − γ j b + cγ j , with a, b ∈ (0 , and c ∈ R . By differentiating, h is convex and has a global minimizer ¯ γ j ∈ [0 , satisfying − log ¯ γ j − ¯ γ j = c + log ba . Substituting in the above values for a, b, c gives the result.
We now derive the lower bound (30) as in [24]. Recall the log-likelihood (19): ℓ n,θ ( x, y ) = n X i =1 y i x Ti θ − g ( x Ti θ ) , where g ( t ) = log(1 + e t ) , t ∈ R . We lower bound the second term above using a Taylor expansionin x . The following lemma fills in some details from [24], where this technique was proposed. Lemma 13.
For Ψ( x ) = (1 + e − x ) − the standard logistic function and any η ∈ R , log Ψ( x ) ≥ x − η η ) − η tanh( η/ x − η ) . Proof.
Note that we can write log Ψ( x ) = x/ − log( e x/ + e − x/ ) . By elementary calculations,it can be shown that the second term on the right hand side is convex in the variable x . We cantherefore use its first order Taylor approximation in x to derive a lower bound for log Ψ( x ) , i.e. forany η ∈ R , log Ψ( x ) ≥ x − log( e η/ + e − η/ ) − η tanh( η/ x − η )= x − η η ) − η tanh( η/ x − η ) . Substituting this lower bound into each term − g ( x Ti θ ) = log Ψ( − x Ti θ ) in the log-likelihood yields(30).We next obtain the update equation (32) for the free parameter η ∈ R n . Minimizing (31) over η forfixed Q = Q µ,σ,γ ∈ Q is equivalent to solving e η = arg max η ∈ R n E µ,σ,γ (cid:2) f ( θ, η ) (cid:3) . Lemma 14.
The function f a : R → R , a ≥ , given by f a ( x ) = log Ψ( x ) − x − x tanh( x/ (cid:0) a − x (cid:1) is symmetric about zero and possesses unique maximizers at x = ± a .Proof. Indeed, tanh( − x ) = − tanh( x ) shows that the last term is symmetric around zero. More-over, Ψ( − x ) = 1 − Ψ( x ) yields log Ψ( − x ) + x/ x ) − x/ thereby proving symmetry of f a . Using x ) − x/ , f ′ a ( x ) = Ψ ′ ( x )Ψ( x ) −
12 + 14 (cid:18) tanh( x / ) + x x / ) (cid:19) − (cid:18) x (cosh x / ) − − x / )8 x (cid:19) a = ( a − x ) (cid:18) tanh( x / )4 x − x (cosh x / ) (cid:19) . f ′ a ( ± a ) = 0 , it remains to show f ′′ a ( ± a ) < . Note that f ′′ a ( x ) = ( a − x ) ddx (cid:18) tanh( x / )4 x − x (cosh x / ) (cid:19) − tanh( x / )2 x + 14(cosh x / ) . The first term vanishes at x = ± a and the second is symmetric about zero, so the formula sinh( x ) cosh( x ) = sinh(2 x ) / yields f ′′ a ( ± a ) = − a / ) cosh( a / ) − a a (cosh a / ) = − sinh( a ) − a a (cosh a / ) . This concludes the proof as sinh( x ) /x ≥ .By the last lemma, we can restrict the free parameter to η ∈ R n ≥ and take e η i = E µ,γ,σ ( x Ti θ ) tomaximize E µ,σ,γ f ( θ, η ) . The update (32) then follows from the following lemma. Lemma 15.
For Q µ,σ,γ ∈ Q , E µ,σ,γ [ x Ti θ ] = p X k =1 γ k µ k x ik , E µ,σ,γ (cid:2) ( x Ti θ ) (cid:3) = p X k =1 γ k x ik ( µ k + σ k ) + p X k =1 X l = k ( γ k x ik µ k )( γ l x il µ l ) . If the expectations are instead taken over E µ,σ,γ | z j =1 , then the same formulas hold true with γ j = 1 .Proof. Since θ k ∼ iid (1 − γ k ) δ + γ k N ( µ k , σ k ) under Q µ,σ,γ , the first claim follows by linearityof the expectation. Using that θ k = θ k { z k =1 } , Q µ,σ,γ -almost surely, and that ( θ k ) are independentunder the mean-filed distribution Q µ,σ,γ , E µ,σ,γ (cid:2) ( x Ti θ ) (cid:3) = E µ,σ,γ p X k =1 x ik θ k { z k =1 } ! = p X k =1 γ k x ik ( µ k + σ k ) + p X k =1 X l = k ( γ k x ik µ k )( γ l x il µ l ) . References [1] A
LQUIER , P.,
AND R IDGWAY , J. Concentration of tempered posteriors and of their variationalapproximations.
Ann. Statist. 48 , 3 (2020), 1475–1497.[2] A
TCHADÉ , Y. A. On the contraction properties of some high-dimensional quasi-posteriordistributions.
Ann. Statist. 45 , 5 (2017), 2248–2273.[3] B
ACH , F. Self-concordant analysis for logistic regression.
Electron. J. Stat. 4 (2010), 384–414.[4] B
ANERJEE , S., C
ASTILLO , I.,
AND G HOSAL , S. Survey paper: Bayesian inference in high-dimensional models.[5] B
HARDWAJ , S., C
URTIN , R. R., E
DEL , M., M
ENTEKIDIS , Y.,
AND S ANDERSON , C. ens-mallen: a flexible C++ library for efficient function optimization, 2018.[6] B
ISHOP , C. M.
Pattern recognition and machine learning . Information Science and Statistics.Springer, New York, 2006.[7] B
LEI , D. M., K
UCUKELBIR , A.,
AND M C A ULIFFE , J. D. Variational inference: a review forstatisticians.
J. Amer. Statist. Assoc. 112 , 518 (2017), 859–877.[8] B
OUCHERON , S., L
UGOSI , G.,
AND M ASSART , P.
Concentration inequalities: A nonasymp-totic theory of independence . Oxford university press, 2013.299] B
ÜHLMANN , P.,
AND VAN DE G EER , S.
Statistics for high-dimensional data . Springer Seriesin Statistics. Springer, Heidelberg, 2011. Methods, theory and applications.[10] C
ARBONETTO , P.,
AND S TEPHENS , M. Scalable variational inference for Bayesian variableselection in regression, and its accuracy in genetic association studies.
Bayesian Anal. 7 , 1(2012), 73–107.[11] C
ARVALHO , C. M., P
OLSON , N. G.,
AND S COTT , J. G. The horseshoe estimator for sparsesignals.
Biometrika 97 , 2 (2010), 465–480.[12] C
ASTILLO , I.,
AND R OQUAIN , E. On spike and slab empirical Bayes multiple testing.
Ann.Statist. 48 , 5 (2020), 2548–2574.[13] C
ASTILLO , I., S
CHMIDT -H IEBER , J.,
AND VAN DER V AART , A. Bayesian linear regressionwith sparse priors.
Ann. Statist. 43 , 5 (2015), 1986–2018.[14] C
ASTILLO , I.,
AND S ZABÓ , B. Spike and slab empirical Bayes sparse credible sets.
Bernoulli26 , 1 (2020), 127–158.[15] C
ASTILLO , I.,
AND VAN DER V AART , A. Needles and straw in a haystack: posterior concen-tration for possibly sparse sequences.
Ann. Statist. 40 , 4 (2012), 2069–2101.[16] C
LARA , G., S
ZABO , B.,
AND R AY , K. sparsevb: spike and slab variational Bayes for linearand logistic regression , 2020. R package version 1.0.[17] G EORGE , E. I.,
AND M C C ULLOCH , R. E. Variable selection via Gibbs sampling.
Journal ofthe American Statistical Association 88 , 423 (1993), 881–889.[18] G
HORBANI , B., J
AVADI , H.,
AND M ONTANARI , A. An instability in variational inference fortopic models. arXiv e-prints (2018), arXiv:1802.00568.[19] G
HOSAL , S., G
HOSH , J. K.,
AND VAN DER V AART , A. W. Convergence rates of posteriordistributions.
Ann. Statist. 28 , 2 (2000), 500–531.[20] G
OODRICH , B., G
ABRY , J., A LI , I., AND B RILLEMAN , S. rstanarm: Bayesian appliedregression modeling via Stan., 2020. R package version 2.19.3.[21] H
OFFMAN , M. D.,
AND G ELMAN , A. The no-U-turn sampler: adaptively setting path lengthsin Hamiltonian Monte Carlo.
J. Mach. Learn. Res. 15 (2014), 1593–1623.[22] H
ORN , R. A.,
AND J OHNSON , C. R.
Matrix analysis , second ed. Cambridge University Press,Cambridge, 2013.[23] H
UANG , X., W
ANG , J.,
AND L IANG , F. A variational algorithm for Bayesian variable selec-tion. arXiv e-prints (2016), arXiv:1602.07640.[24] J
AAKKOLA , T. S.,
AND J ORDAN , M. I. Bayesian parameter estimation via variational meth-ods.
Statistics and Computing 10 , 1 (2000), 25–37.[25] L I , Y.-H., S CARLETT , J., R
AVIKUMAR , P.,
AND C EVHER , V. Sparsistency of ℓ -Regularized M -Estimators. In Proceedings of the Eighteenth International Conference on Artificial Intelli-gence and Statistics (2015), pp. 644–652.[26] L I , Z. R., M C C ORMICK , T. H.,
AND C LARK , S. J. Bayesian joint spike-and-slab graphicallasso. arXiv e-prints (2018), arXiv:1805.07051.[27] L IU , D. C., AND N OCEDAL , J. On the limited memory BFGS method for large scale opti-mization.
Mathematical Programming 45 (1989), 503–528.[28] L
OGSDON , B. A., H
OFFMAN , G. E.,
AND M EZEY , J. G. A variational Bayes algorithm forfast and accurate multiple locus genome-wide association analysis.
BMC bioinformatics 11 , 1(2010), 58.[29] L U , Y., S TUART , A.,
AND W EBER , H. Gaussian approximations for probability measures on R d . SIAM/ASA J. Uncertain. Quantif. 5 , 1 (2017), 1136–1165.3030] M C D ERMOTT , P., S
NYDER , J.,
AND W ILLISON , R. Methods for Bayesian variable selectionwith binary response data using the EM algorithm. arXiv e-prints (2016), arXiv:1605.05429.[31] M
ITCHELL , T. J.,
AND B EAUCHAMP , J. J. Bayesian variable selection in linear regression.
J.Amer. Statist. Assoc. 83 , 404 (1988), 1023–1036.[32] N
ARISETTY , N. N., S
HEN , J.,
AND H E , X. Skinny Gibbs: a consistent and scalable Gibbssampler for model selection. J. Amer. Statist. Assoc. 114 , 527 (2019), 1205–1217.[33] N
EGAHBAN , S., Y U , B., W AINWRIGHT , M. J.,
AND R AVIKUMAR , P. K. A unified frame-work for high-dimensional analysis of M -estimators with decomposable regularizers. In Ad-vances in Neural Information Processing Systems 22 .[34] N
EGAHBAN , S. N., R
AVIKUMAR , P., W
AINWRIGHT , M. J.,
AND Y U , B. A unified frame-work for high-dimensional analysis of M -estimators with decomposable regularizers. Statist.Sci. 27 , 4 (11 2012), 538–557.[35] N
ICKL , R.,
AND R AY , K. Nonparametric statistical inference for drift vector fields of multi-dimensional diffusions. Ann. Statist. 48 , 3 (2020), 1383–1408.[36] O
RMEROD , J. T., Y OU , C., AND M ÜLLER , S. A variational Bayes approach to variableselection.
Electron. J. Stat. 11 , 2 (2017), 3549–3594.[37] P
AISLEY , J. W., B
LEI , D. M.,
AND J ORDAN , M. I. Variational Bayesian inference withstochastic search. In
ICML (2012), icml.cc / Omnipress.[38] P
ATI , D., B
HATTACHARYA , A.,
AND Y ANG , Y. On statistical optimality of variational Bayes.In
Proceedings of the Twenty-First International Conference on Artificial Intelligence andStatistics (2018), pp. 1579–1588.[39] P
OLSON , N. G.,
AND R O ˇCKOVÁ , V. Posterior concentration for sparse deep learning. In
Advances in Neural Information Processing Systems (2018), pp. 930–941.[40] R AY , K. Adaptive Bernstein–von Mises theorems in Gaussian white noise. Ann. Statist. 45 , 6(2017), 2511–2536.[41] R AY , K., AND S CHMIDT -H IEBER , J. Minimax theory for a class of nonlinear statistical in-verse problems.
Inverse Problems 32 , 6 (2016), 065003, 29.[42] R AY , K., AND S ZABO , B. Variational Bayes for high-dimensional linear regression withsparse priors. arXiv e-prints (2019), arXiv:1904.07150.[43] S
ANDERSON , C.,
AND C URTIN , R. Armadillo: A template-based C++ library for linearalgebra.
Journal of Open Source Software 1 (07 2016), 26.[44] S
COTT , S. L.,
AND V ARIAN , H. R. Bayesian variable selection for nowcasting economic timeseries. Tech. rep., National Bureau of Economic Research, 2013.[45] S
HETH , R.,
AND K HARDON , R. Excess risk bounds for the Bayes risk using variationalinference in latent Gaussian models. In
Advances in Neural Information Processing Systems30 . 2017, pp. 5151–5161.[46] S
ZABÓ , B.,
AND VAN Z ANTEN , H. An asymptotic analysis of distributed nonparametricmethods.
Journal of Machine Learning Research 20 , 87 (2019), 1–30.[47] T
ITSIAS , M.,
AND L AZARO -G REDILLA , M. Doubly stochastic variational Bayes for non-conjugate inference. In
Proceedings of the 31st International Conference on Machine Learning (2014), pp. 1971–1979.[48] T
ITSIAS , M. K.,
AND L ÁZARO -G REDILLA , M. Spike and slab variational inference formulti-task and multiple kernel learning. In
Advances in neural information processing systems (2011), pp. 2339–2347. 3149]
VAN DER P AS , S., S ZABÓ , B.,
AND VAN DER V AART , A. Uncertainty quantification for thehorseshoe (with discussion).
Bayesian Anal. 12 , 4 (2017), 1221–1274. With a rejoinder by theauthors.[50]
VAN E RVEN , T.,
AND S ZABO , B. Fast exact Bayesian inference for sparse signals in thenormal sequence model.
Bayesian Anal., to appear (2020).[51] W
ANG , B.,
AND T ITTERINGTON , D. Convergence and asymptotic normality of variationalBayesian approximations for exponential family models with missing values. In
Proceedingsof the Twentieth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-04) (2004), pp. 577–584.[52] W
ANG , B.,
AND T ITTERINGTON , D. M. Inadequacy of interval estimates corresponding tovariational Bayesian approximations. In
IN AISTATS05 (2004), pp. 373–380.[53] W
ANG , C.,
AND B LEI , D. M. Variational inference in nonconjugate models.
J. Mach. Learn.Res. 14 (2013), 1005–1031.[54] W
ANG , Y.,
AND B LEI , D. Variational Bayes under model misspecification. In
Advances inNeural Information Processing Systems 32 . 2019, pp. 13357–13367.[55] W
ANG , Y.,
AND B LEI , D. M. Frequentist consistency of variational Bayes.
J. Amer. Statist.Assoc. 114 , 527 (2019), 1147–1161.[56] W EI , R., AND G HOSAL , S. Contraction properties of shrinkage priors in logistic regression.
J. Statist. Plann. Inference 207 (2020), 215–229.[57] Y
ANG , Y., P
ATI , D.,
AND B HATTACHARYA , A. α -variational inference with statistical guar-antees. Ann. Statist. 48 , 2 (2020), 886–905.[58] Y I , N., T ANG , Z., Z
HANG , X.,
AND G UO , B. BhGLM: Bayesian hierarchical GLMs andsurvival models, with applications to genomics and epidemiology. Bioinformatics 35 8 (2019),1419–1421.[59] Z
HANG , A. Y.,
AND Z HOU , H. H. Theoretical and computational guarantees of mean fieldvariational inference for community detection.
Ann. Statist. 48 , 5 (2020), 2575–2598.[60] Z
HANG , C.-X., X U , S., AND Z HANG , J.-S. A novel variational Bayesian method for variableselection in logistic regression models.
Comput. Statist. Data Anal. 133 (2019), 1–19.[61] Z
HANG , F.,
AND G AO , C. Convergence rates of variational posterior distributions. Ann.Statist. 48 , 4 (2020), 2180–2207.[62] Z
HAO , P.,