[PDF] Extended Stochastic Gradient MCMC for Large-Scale Bayesian Variable Selection

Abstract

Stochastic gradient Markov chain Monte Carlo (MCMC) algorithms have received much attention in Bayesian computing for big data problems, but they are only applicable to a small class of problems for which the parameter space has a fixed dimension and the log-posterior density is differentiable with respect to the parameters. This paper proposes an extended stochastic gradient MCMC lgoriathm which, by introducing appropriate latent variables, can be applied to more general large-scale Bayesian computing problems, such as those involving dimension jumping and missing data. Numerical studies show that the proposed algorithm is highly scalable and much more efficient than traditional MCMC algorithms. The proposed algorithms have much alleviated the pain of Bayesian methods in big data computing.

Full PDF

aa r X i v : . [ s t a t . C O ] F e b Biometrika (2019), pp . 1–8 Printed in Great Britain

Extended Stochastic Gradient MCMC for Large-ScaleBayesian Variable Selection B Y Q IFAN S ONG , Y AN S UN , M AO Y E , F AMING L IANG

Department of Statistics, Purdue University,West lafayette, IN 47906, USA qfsong, sun748, ye207, [email protected] S UMMARY

Stochastic gradient Markov chain Monte Carlo (MCMC) algorithms have received much attention inBayesian computing for big data problems, but they are only applicable to a small class of problems forwhich the parameter space has a ﬁxed dimension and the log-posterior density is differentiable with re-spect to the parameters. This paper proposes an extended stochastic gradient MCMC algorithm which, byintroducing appropriate latent variables, can be applied to more general large-scale Bayesian computingproblems, such as those involving dimension jumping and missing data. Numerical studies show that theproposed algorithm is highly scalable and much more efﬁcient than traditional MCMC algorithms. Theproposed algorithms have much alleviated the pain of Bayesian methods in big data computing.

Some key words : Dimension Jumping, Missing Data, Stochastic Gradient Langevin Dynamics, Subsampling

1. I

NTRODUCTION

After six decades of continual development, MCMC has proven to be a powerful and typically uniquecomputational tool for analyzing data of complex structures. However, for large datasets, its computa-tional cost can be prohibitive as it requires all of the data to be processed at each iteration. To tacklethis difﬁculty, a variety of scalable algorithms have been proposed in the recent literature. According tothe strategies they employed, these algorithms can be grouped into a few categories, including stochasticgradient MCMC algorithms (Welling & Teh, 2011; Ding et al., 2014; Ahn et al., 2012; Chen et al., 2014;Betancourt, 2015; Ma et al., 2015; Nemeth & Fearnhead, 2019), split-and-merge algorithms (Scott et al.,2016; Srivastava et al., 2018; Xue & Liang, 2019), mini-batch Metropolis-Hastings algorithms (Chenet al., 2016; Korattikara et al., 2014; Bardenet et al., 2014; Maclaurin & Adams, 2014; Bardenet et al.,2017), nonreversible Markov process-based algorithms (Bierkens et al., 2019; Bouchard Cot´e et al., 2018),and some discrete sampling algorithms based on the multi-armed bandit (Chen & Ghahramani, 2016).Although scalable algorithms have been developed for both continuous and discrete sampling problems,they are hard to be applied to dimension jumping problems. Dimension jumping is characterized by vari-able selection where the number of parameters changes from iteration to iteration in MCMC simulations.Under their current settings, the stochastic gradient MCMC and nonreversible Markov process-based al-gorithms are only applicable to problems for which the parameter space has a ﬁxed dimension and thelog-posterior density is differentiable with respect to the parameters. For the split-and-merge algorithms,it is unclear how to aggregate samples of different dimensions drawn from the posterior distributionsbased on different subset data. The multi-armed bandit algorithms are only applicable to problems witha small discrete domain and can be extremely inefﬁcient for high-dimensional variable selection prob-lems. The mini-batch Metropolis-Hastings algorithms are in principle applicable to dimension jumpingproblems. However, they are generally difﬁcult to use. For example, the algorithms by Chen et al. (2016),Korattikara et al. (2014), and Bardenet et al. (2014) perform approximate acceptance tests using subsetdata. The amount of data consumed for each test varies signiﬁcantly from one iteration to another, which C (cid:13) Q. S

ONG ET AL . compromise their scalability. The algorithms by Maclaurin & Adams (2014) and Bardenet et al. (2017)perform exact tests but require a lower bound on the parameter distribution across its domain. Unfortu-nately, the lower bound is usually difﬁcult to obtain.This paper proposes an extended stochastic gradient Langevin dynamics algorithm which, by intro-ducing appropriate latent variables, extends the stochastic gradient Langevin dynamics algorithm to moregeneral large-scale Bayesian computing problems such as variable selection and missing data. The ex-tended stochastic gradient Langevin dynamics algorithm is highly scalable and much more efﬁcient thantraditional MCMC algorithms. Compared to the mini-batch Metropolis-Hastings algorithms, the proposedalgorithm is much easier to use, which involves only a ﬁxed amount of data at each iteration and does notrequire any lower bound on the parameter distribution.2. A B RIEF R EVIEW OF S TOCHASTIC G RADIENT L ANGEVIN D YNAMICS

Let X N = ( X , X , . . . , X N ) denote a set of N independent and identically distributed samplesdrawn from the distribution f ( x | θ ) , where N is the sample size and θ is the parameter. Let p ( X N | θ ) = Q Ni =1 f ( X i | θ ) denote the likelihood function, let π ( θ ) denote the prior distribution of θ , and let log π ( θ | X N ) = log p ( X N | θ ) + log π ( θ ) denote the log-posterior density function. If θ has a ﬁxed dimen-sion and log π ( θ | X N ) is differentiable with respect to θ , then the stochastic gradient Langevin dynamicsalgorithm (Welling & Teh, 2011) can be applied to simulate from the posterior, which iterates by θ t +1 = θ t + ǫ t +1 b ∇ θ log π ( θ t | X N ) + √ ( ǫ t +1 τ ) η t +1 , η t +1 ∼ N (0 , I d ) , (1)where d is the dimension of θ , I d is an d × d -identity matrix, ǫ t +1 is the step size (also known as thelearning rate), τ is the temperature, and b ∇ θ log π ( θ t | X N ) denotes an estimate of ∇ θ log π ( θ t | X N ) basedon a mini-batch of samples. The learning rate can be decreasing or kept as a constant. For the former,the convergence of the algorithm was studied in Teh et al. (2016). For the latter, the convergence of thealgorithm was studied in Sato & Nakagawa (2014) and Dalalyan & Karagulyan (2017). Refer to Nemeth& Fearnhead (2019) for more discussions on the theory, implementation and variants of this algorithm.3. A N E XTENDED S TOCHASTIC G RADIENT L ANGEVIN D YNAMICS A LGORITHM

To extend the applications of the stochastic gradient Langevin dynamics algorithm to varying-dimensional problems such as variable selection and missing data, we ﬁrst establish an identity for evalu-ating ∇ θ log π ( θ | X N ) in presence of latent variables. As illustrated below, the latent variables can be themodel indicator in the variable selection problems or missing values in the missing data problems.L EMMA For any latent variable ϑ , ∇ θ log π ( θ | X N ) = Z ∇ θ log π ( θ | ϑ, X N ) π ( ϑ | θ, X N ) dϑ, (2) where π ( ϑ | θ, X N ) and π ( θ | ϑ, X N ) denote the conditional distribution of ϑ and θ , respectively. Lemma 1 provides us a Monte Carlo estimator for ∇ θ log π ( θ | X N ) by averaging over the samplesdrawn from the conditional distribution π ( ϑ | θ, X N ) . The identity (2) is similar to Fisher’s identity. Thelatter has been used in evaluating the gradient of the log-likelihood function in presence of latent variables,see e.g. Capp´e et al. (2005). When N is large, the computation can be accelerated by subsampling. Let X n denote a subsample, where n denotes the subsample size. Without loss of generality, we assume that N is a multiple of n , i.e., N/n is an integer. Let X n,N = { X n , . . . , X n } denote a duplicated dataset withthe subsample, whose total sample size is also N . Following from (2), we have ∇ θ log π ( θ | X n,N ) = Z ∇ θ log π ( θ | ϑ, X n,N ) π ( ϑ | θ, X n,N ) dϑ. (3) iometrika style Since ∇ θ log π ( θ | X n,N ) = ∇ θ log p ( X n,N | θ ) + ∇ θ log π ( θ ) is true and log p ( X n,N | θ ) is unbiased for log p ( X N | θ ) , ∇ θ log π ( θ | X n,N ) forms an unbiased estimator of ∇ θ log π ( θ | X N ) . Sampling from π ( γ S | θ, X n,N ) can be much faster than sampling from π ( γ S | θ, X N ) as for the former the likelihoodonly needs to be evaluated on a mini-batch of samples.3.1 . Bayesian Variable Selection As an illustrative example, we consider the problem of variable selection for linear regression Y = z T β + ε, (4)where ε is a zero-mean Gaussian random error with variance σ , β ∈ R p is the vector of regression coefﬁ-cients, and z = ( z , z , . . . , z p ) is the vector of explanatory variables. Let γ S = ( γ S , . . . , γ pS ) be a binaryvector indicating the variables included in model S , and let β S be the vector of regression coefﬁcients as-sociated with the model S . From the perspective of Bayesian statistics, we are interested in estimating theposterior probability π ( γ S | X N ) for each model S ∈ S and the posterior mean π ( ρ ) = R ρ ( β ) π ( β | X N ) for some integrable function ρ ( · ) , where S comprises p models. Both quantities can be estimated usingthe reversible jump Metropolis-Hastings algorithm (Green, 1995) by sampling from the posterior distri-bution π ( γ S , β S | X N ) . However, when N is large, the algorithm can be extremely slow due to repeatedscans of the full dataset in simulations.As aforementioned, the existing stochastic gradient MCMC algorithms cannot be directly applied tosimulate of π ( γ S , β S | X N ) due to the dimension jumping issue involved in model transition. To addressthis issue, we introduce an auxiliary variable θ = ( θ , θ , . . . , θ p ) , which links γ S and β S through β S = θ ∗ γ S = ( θ γ S , θ γ S , . . . , θ p γ pS ) , (5)where ∗ denotes elementwise multiplication. Let θ [ S ] = { θ i : γ iS = 1 , i = 1 , , . . . , p } and θ [ − S ] = { θ i : γ iS = 0 , i = 1 , , . . . , p } be subvectors of θ corresponding to the nonzero and zero elements of γ S , re-spectively. Note that β S is sparse with all elements in θ [ − S ] being zero, while θ can be dense. Based onthe relation (5), we suggest to simulate from π ( θ | X N ) using the stochastic gradient Langevin dynamicalgorithm, for which the gradient ∇ θ log π ( θ | X N ) can be evaluated using Lemma 1 by treating γ S as thelatent variable. Let π ( θ ) denote the prior of θ . To simplify the computation of ∇ θ log π ( θ | γ S , X N ) , wefurther assume the a priori independence that π ( θ | γ S ) = π ( θ [ S ] | γ S ) π ( θ [ − S ] | γ S ) . Then it is easy to derive ∇ θ log π ( θ | γ S , X N ) = ( ∇ θ [ S ] log p ( X N | θ [ S ] , γ S ) + ∇ θ [ S ] π ( θ [ S ] | γ S ) , for component θ [ S ] , ∇ θ [ − S ] log π ( θ [ − S ] | γ S ) , for component θ [ − S ] ,which can be used in evaluating ∇ log π ( θ | X N ) by Lemma 1. If a mini-batch of data is used, the gradientcan be evaluated based on (3). This leads to an extended stochastic gradient langevin dynamics algorithm. Algorithm . [Extended Stochastic Gradient Langevin Dynamics for Bayesian Variable Selection](i) (Subsampling) Draw a subsample of size n (with or without replacement) from the full dataset X N atrandom, and denote the subsample by X ( t ) n , where t indexes the iteration.(ii) (Simulating models) Simulate models γ ( t ) S ,n , . . . , γ ( t ) S m ,n from the conditional posterior π ( γ S | θ ( t ) , X ( t ) n,N ) by running a short Markov chain, where X ( t ) n,N = { X ( t ) n , . . . , X ( t ) n } and θ ( t ) is the sample of θ at iteration t .(iii) (Updating θ ) Update θ ( t ) by setting θ ( t +1) = θ ( t ) + (2 m ) − ǫ t +1 P mk =1 ∇ θ log π ( θ ( t ) | γ ( t ) S k ,n , X ( t ) n,N ) + √ ( ǫ t +1 τ ) η t +1 , where ǫ t +1 is the learning rate, η t +1 ∼ N (0 , I p ) , τ is the temperature, and p is thedimension of θ .Theorem 1 justiﬁes the validity of this algorithm with the proof given in the Appendix.T HEOREM Assume that the conditions (A.1)-(A.3) (given in Appendix) hold, m , p , n are increasingwith N such that N ≥ n ≻ p , m ≻ p / , and a constant learning rate ǫ ≺ /N is used. Then, as N → ∞ , Q. S

ONG ET AL . (i) W ( π t , π ∗ ) → as t → ∞ , where π t denotes the distribution of θ ( t ) , π ∗ = π ( θ | X N ) , and W ( · , · ) denotes the second order Wasserstein distance between two distributions.(ii) If ρ ( θ ) is α -Lipschitz for some constant α > , then P Tt =1 ρ ( θ ( t ) ) /T p → π ∗ ( ρ ) as T → ∞ , where p → denotes convergence in probability and π ∗ ( ρ ) = R Θ ρ ( θ ) π ( θ | X N ) dθ .(iii) If (A.4) further holds, P Tt =1 P mi =1 I ( γ ( t ) S i ,n = γ S ) / ( mT ) − π ( γ S | X N ) p → as T → ∞ . Part (i) establishes the weak convergence of θ t ; that is, if the total sample size N and the iterationnumber t are sufﬁciently large, and the subsample size n and the number of models m simulated at eachiteration are reasonably large, then π ( θ t | X N ) will converge to the true posterior π ( θ | X N ) in 2-Wassersteindistance. Refer to Gibbs & Su (2002) for discussions on the relation between Wasserstein distance andother probability metrics. Parts (ii) & (iii) address our general interests on how to estimate the posteriormean and posterior probability, respectively, based the samples simulated by Algorithm 1. For parts (i),(ii) and (iii), the explicit convergence rates are given in equations (3), (5) and (10), respectively.For the choice of m ≻ p / , p can be approximately treated as the maximum size of the models underconsideration, which is of the same order as the true model. Therefore, m can be pretty small underthe model sparsity assumption. Theorem 1 is established with a constant learning rate. In practice, onemay use a decaying learning rate, see e.g. Teh et al. (2016), where it is suggested to set ǫ t = O (1 /t κ ) forsome < κ ≤ . For the decaying learning rate, Teh et al. (2016) recommended some weighted averagingestimators for π ∗ ( ρ ) . Theorem 2 shows that the unweighted averaging estimators used above still work ifthe learning rate slowly decays at a rate of ǫ t = O (1 /t κ ) for < κ < . However, if κ = 1 , the weightedaveraging estimators are still needed. The proof of Theorem 2 is given in the supplementary material.T HEOREM Assume the conditions of Theorem hold. If a decaying learning rate ǫ t = O (1 /t κ ) isused for some < κ < , then parts (i), (ii) and (iii) of Theorem are still valid. . Missing Data Missing data are ubiquitous over all ﬁelds from science to technology. However, under the big datascenario, how to conduct Bayesian analysis in presence of missing data is still unclear. The existing data-augmentation algorithm (Tanner & Wong, 1987) is full data based and thus can be extremely slow. Inthis context, we let X N denote the incomplete data and let θ denote the model parameters. If we treat themissing values as latent variables, then Lemma 1 can be used for evaluating the gradient ∇ θ log π ( θ | X N ) .However, Algorithm 1 cannot be directly applied to missing data problems, since the imputation of themissing data might depend on the subsample only. To address this issue, we propose Algorithm S1 (givenin the Supplementary material), where the missing values ϑ are imputed from π ( ϑ | θ, X n ) at each iteration.Theorem 1 and Theorem 2 are still applicable to this algorithm.4. A N I LLUSTRATIVE E XAMPLE

This section illustrates the performance of Algorithm 1 using a simulated example. More numeri-cal examples are presented in the supplementary material. Ten synthetic datasets were generated fromthe model (4) with N = 50 , , p = 2001 , σ = 1 , β = · · · = β = 1 , β = β = β = − , and β = β = · · · = β p = 0 , where σ is assumed to be known, and the explanatory variables are normally dis-tributed with a mutual correlation coefﬁcient of 0.5. A hierarchical prior was assumed for the model andparameters with the detail given in the supplementary material. For each dataset, Algorithm 1 was run for5000 iterations with n = 200 , m = 10 , and the learning rate ǫ t ≡ − , where the ﬁrst 2000 iterationswere discarded for the burn-in process and the samples generated from the remaining iterations were usedfor inference. At each iteration, the reversible jump Metropolis-Hastings algorithm (Green, 1995) wasused for simulating the models γ ( t ) S i ,n , i = 1 , , . . . , m with the detail given in the supplementary material.Table 1 summarizes the performance of the algorithm, where the false selection rate (FSR), negativeselection rate (NSR), mean squared errors for false predictors (MSE ) and mean squared errors for truepredictors (MSE ) are deﬁned in the supplementary material. The variables were selected according tothe median posterior probability rule (Barbieri & Berger, 2004), which selects only the variables with the iometrika style Bayesian variable selection with the extended stochastic gradient Langevin dynamics(eSGLD), reversible jump Metropolis-Hastings (RJMH), split-and-merge (SaM) and BayesianLasso (B-Lasso) algorithms, where FSR, NSR, MSE and MSE are reported in averages over10 datasets with standard deviations given in the parentheses, and the CPU time (in minutes)was recorded for one dataset on a Linux machine with Intel R (cid:13) Core TM i7-3770 [email protected]. Algorithm FSR NSR MSE MSE CPU(m)eSGLD 0(0) 0(0) . × − ( . × − ) . × − ( . × − ) 3.3RJMH 0.50(0.10) 0.16(0.042) . × − ( . × − ) . × − ( . × − ) 180.1SaM 0.05(0.05) 0.013(0.013) . × − ( . × − ) . × − ( . × − ) 150.4B-Lasso 0(0) 0(0) . × − ( . × − ) . × − ( . × − ) 32.8 marginal inclusion probability greater than 0.5. The Bayesian estimates of parameters were obtained byaveraging over a set of thinned (by a factor of 10) posterior samples. For comparison, some existing algo-rithms were applied to this example with the results given in Table 1 and the implementation details givenin the supplementary material. The comparison show that the proposed algorithm has much alleviated thepain of Bayesian methods in big data analysis.5. D ISCUSSION

This paper has extended the stochastic gradient Langevin dynamics algorithm to general large-scaleBayesian computing problems, such as those involving dimension jumping and missing data. To the bestof our knowledge, this paper provides the ﬁrst Bayesian method and theory for high-dimensional discreteparameter estimation with mini-batch samples, while the existing methods work for continuous parame-ters or very low dimensional discrete problems only. Other than generalized linear models, the proposedalgorithm can have many applications in data science. For example, it can be used for sparse deep learningand accelerating computation for statistical models/problems where latent variables are involved, such ashidden Markov models, random coefﬁcient models, and model-based clustering problems.Algorithm 1 can be further extended by updating θ using a variant of stochastic gradient Langevindynamics, such as stochastic gradient Hamiltonian Monte Carlo (Chen et al., 2014), stochastic gradientthermostats (Ding et al., 2014), stochastic gradient Fisher scoring (Ahn et al., 2012), or preconditionedstochastic gradient Langevin dynamics (Li et al., 2016). We expect that the advantages of these variants(over stochastic gradient Langevin dynamics) can be carried over to the extension.A CKNOWLEDGEMENTS

This work was partially supported by the grants DMS-1811812, DMS-1818674, and R01-GM126089.The authors thank the editor, associate editor and referees for their insightful comments/suggestions.A

PPENDIX

A.1 . Proof of Lemma Proof.

ONG ET AL . where the second and third equalities follow from the relation ∇ θ log( g ( θ )) = ∇ θ g ( θ ) /g ( θ ) (for an ap-propriate function g ( θ ) ), and the fourth and ﬁfth equalities are by direct calculations of the conditionaldistributions. (cid:3) A.2 . Proof of Theorem π ∗ = π ( θ | X N ) denote the posterior density function of θ , and let π t = π ( θ ( t ) | X N ) denote thedensity of θ ( t ) generated by Algorithm 1 at iteration t . We are interested in studying the discrepancybetween π ∗ and π t in the 2nd order Wasserstein distance. The following conditions are assumed.(A.1) The posterior π ∗ is strongly log-concave and gradient-Lipschitz: f ( θ ) − f ( θ ′ ) − ∇ f ( θ ′ ) T ( θ − θ ′ ) ≥ q N k θ − θ ′ k , ∀ θ, θ ′ ∈ Θ , (1) k∇ f ( θ ) − ∇ f ( θ ′ ) k ≤ Q N k θ − θ ′ k , ∀ θ, θ ′ ∈ Θ , (2)where f ( θ ) = − log π ( θ | X N ) , and c ′ N ≤ q N ≤ Q N ≤ c N for some positive constants c and c ′ .(A.2) The posterior π ∗ has bounded second moment: R Θ θ T θπ ∗ ( θ ) dθ = O ( p ) .(A.3) max S ∈S E X N [ k∇ θ log π ( θ | γ S , X N ) k | θ ] = O ( N ( k θ k + p )) , where E X N denotes expectation withrespect to the distribution of X N , and S denotes the set of all possible models.(A.4) Let L N ( γ S , θ ) = log p ( X N | γ S , θ ) /N and let { L ( i ) N ( θ ) : i = 1 , , . . . , |S|} be the descending or-der statistics of { L N ( γ S , θ ) : S ∈ S} . Assume that there exists a constant δ > such that inf θ ∈ Θ ( L (1) N ( θ ) − L (2) N ( θ )) ≥ δ . Proof. Part (i).

In Algorithm 1, the gradient ∇ log π ( θ ( t ) | X N ) is estimated by running a shortMarkov chain with a mini-batch of data. Since the initial distribution of the Markov chain might notcoincide with its equilibrium distribution, the resulting gradient estimate can be biased. Let ζ ( t ) = m P mk =1 ∇ θ log π ( θ ( t ) | γ ( t ) S k ,n , X ( t ) n,N ) − ∇ log π ( θ ( t ) | X N ) . Following from (A.3), we have k E ( ζ ( t ) | θ ( t ) ) k = O (cid:20) N ( k θ ( t ) k + p ) m (cid:21) , E k ζ ( t ) − E ( ζ ( t ) | θ ( t ) ) k = O (cid:20) N ( k θ ( t ) k + p ) mn (cid:21) . Following from Lemma S2 in the supplementary material, if m ≻ p / , ǫ ≺ /N ≺ ( mn ) / ( N p ) , and V = O ( p ) holds, then W ( π t , π ∗ ) = (1 − ω ) t W ( π , π ∗ ) + O (cid:0) p / m (cid:1) + O (( ǫp ) / ) + O (cid:0)(cid:0) ǫN pmn (cid:1) / (cid:1) → , as t → ∞ , (3)for some ω > , since q N ≍ N and Q N ≍ N hold by conditions (A.1) and (A.2). Part (ii).

Since ρ ( θ ) is α -Lipschitz, we have | ρ ( θ ) | ≤ α k θ k + C ′ for some constant C ′ . Further, π ∗ isstrongly log-concave, so π ∗ ( | ρ | ) < ∞ , i.e., ρ is π ∗ -integrable. On the other hand, k Z ρ ( θ ) dπ ∗ ( θ ) − Z ρ (˜ θ ) dπ t (˜ θ ) k = k Eρ ( θ ) − Eρ (˜ θ ) k ≤ E k ρ ( θ ) − ρ (˜ θ ) k≤ αE k θ − ˜ θ k ≤ α { E k θ − ˜ θ k } / = αW ( π ∗ , π t ) = o (1) , (due to eq. (3)). (4)where θ and ˜ θ are two random variables whose marginal distributions follow π ∗ and π t respectively, E ( · ) denotes expectation with respect to the joint distribution of θ and ˜ θ , and ( E k θ − θ t k ) / = W ( π ∗ , π t ) .This implies that ρ is also π t -integrable and R ρ (˜ θ ) dπ t (˜ θ ) → R ρ ( θ ) dπ ∗ ( θ ) as t → ∞ .Further, by the property of Markov chain, WLLN applies and thus P Tt =1 ρ ( θ ( t ) ) /T − P Tt =1 R ρ (˜ θ ) dπ t (˜ θ ) /T = O p ( T − / ) . Combining it with the above result leads to T X t =1 ρ ( θ ( t ) ) /T − π ∗ ( ρ ) = O p ( T − / ) + α T X t =1 W ( π ∗ , π t ) /T → . (5) iometrika style Part (iii).

To establish the convergence of ˆ π ( γ S | X N ) , we deﬁne L N ( γ S , θ ( t ) ) = log p ( X N | γ S , θ ( t ) ) /N , L n ( γ S , θ ( t ) ) = log p ( X ( t ) n | γ S , θ ( t ) ) /n , and ξ ( t ) n,S = L n ( γ S , θ ( t ) ) − L N ( γ S , θ ( t ) ) for any S ∈ S . For each S , ξ ( t ) n,S is approximately Gaussian with E ( ξ ( t ) n,S ) = 0 and Var ( ξ ( t ) n,S ) = O (1 /n ) . Therefore, for any posi-tive ν , with probability − |S| − ν , max S | ξ n,S | is bounded by δ n := { (2 ν + 2) log |S| /n } / = O [ { ( ν +1) p/n } / ] according to the tail probability of the Gaussian. It implies, with high probability, that if S isthe most likely model, i.e., L ( t ) N ( γ S ) = L (1) N ( θ ( t ) ) , then | π ( γ S | X ( t ) n,N , θ ( t ) ) − π ( γ S | X N , θ ( t ) ) | = (cid:12)(cid:12)

11 + P S ′ = S e N ( L N ( γ S ′ ,θ ( t ) ) − L N ( γ S ,θ ( t ) )+ ξ n,S ′ − ξ n,S ) −

11 + P S ′ = S e N ( L N ( X N | γ S ′ ,θ ( t ) ) − L N ( X N | γ S ,θ ( t ) )) (cid:12)(cid:12) = P S ′ = S e N ( L N ( X N | γ S ′ ,θ ( t ) ) − L N ( X N | γ S ,θ ( t ) ))+ b S ′ ) [1 + P S ′ = S e N ( L N ( X N | γ S ′ ,θ ( t ) ) − L N ( X N | γ S ,θ ( t ) )+ b S ′ ) ] N | ξ n,S ′ − ξ n,S |≤ (2 p − e − N ( δ − δ n ) N δ n ≤ e − Nδ/ → , if νp ≺ n (i.e., δ n ≺ δ ) and N ≻ p , where the second equality follows from the mean-value theoremby viewing N ( L N ( X N | γ S ′ , θ ( t ) ) − L N ( X N | γ S , θ ( t ) )) ’s as the arguments of π ( γ S | X N , θ ( t ) ) , and b S ′ denotes a value between and ( ξ n,S ′ − ξ n,S ) . Similarly, if S is not the most likely model, then we denote S ∗ as the most likely model and, by the mean-value theorem, | π ( γ S | X ( t ) n,N , θ ( t ) ) − π ( γ S | X N , θ ( t ) ) | = (cid:12)(cid:12)(cid:12)(cid:12) e N ( L N ( γ S ,θ ( t ) ) − L N ( γ S ∗ ,θ ( t ) )+ ξ n,S − ξ n,S ∗ ) P S ′ = S e N ( L N ( γ S ′ ,θ ( t ) ) − L N ( γ S ∗ ,θ ( t ) )+ ξ n,S ′ − ξ n,S ∗ ) − e N ( L N ( γ S ,θ ( t ) ) − L N ( γ S ∗ ,θ ( t ) )) P S ′ = S e N ( L N ( γ S ′ ,θ ( t ) ) − L N ( γ S ∗ ,θ ( t ) ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ [1 + (2 p − e − N ( δ − δ n ) + e Nδ n ] e − N ( δ − δ n ) N δ n ≤ e − Nδ/ → . In conclusion, with probability − / |S| ν , | π ( γ S | X ( t ) n,N , θ ( t ) ) − π ( γ S | X N , θ ( t ) ) | < exp( − N δ/ forall S , any iteration t and any θ ( t ) ∈ Θ . Then, one could choose some ν = ( n/p ) / → ∞ , such that π ( γ S | X ( t ) n,N , θ ( t ) ) − π ( γ S | X N , θ ( t ) ) is bounded by max S E | π ( γ S | X ( t ) n,N , θ ( t ) ) − π ( γ S | X N , θ ( t ) ) | ≤ exp( − N δ/

2) + 1 / |S| ν → , (6)for any iteration t . Conditioned on { θ ( t ) : t = 1 , , . . . } , [ π ( γ S | X n,N , θ ( t ) ) − π ( γ S | X N , θ ( t ) )] ’s are inde-pendent and each is bounded by 1, so WLLN applies. Therefore, for any S ∈ S , by WLLN, T T X t =1 π ( γ S | X ( t ) n,N , θ ( t ) ) − T T X t =1 π ( γ S | X N , θ ( t ) ) = O p ( T − / ) + exp( − N δ/

2) + 1 / |S| ν → , (7)provided p ≺ n ≤ N . Since { θ ( t ) : t = 1 , , . . . } forms a time-homogeneous Markov chain, whose con-vergence is measured by (3), and the function π ( γ S | X N , θ ) is bounded and continuous in θ , T T X t =1 π ( γ S | X N , θ ( t ) ) − π ( γ S | X N ) = O p ( T − / ) , (8)holds for any S ∈ S . Combining (8) with (7) leads to T T X t =1 π ( γ S | X ( t ) n,N , θ ( t ) ) − π ( γ S | X N ) = O p ( T − / ) + exp( − N δ/

2) + 1 / |S| ν → . (9)Conditioned on X ( t ) n,N and θ ( t ) , by the standard theory of MCMC, m − P mi =1 I ( γ ( t,i ) S = γ S ) forms aconsistent estimator of π ( γ S | X ( t ) n,N , θ ( t ) ) with an asymptotic bias of O (1 /m ) . Since m is increasing with Q. S

ONG ET AL . p and N , the estimator is asymptotically unbiased. Combining this result with (9) leads to mT T X t =1 m X i =1 I ( γ ( t ) S i ,n = γ S ) − π ( γ S | X N ) = O p ( T − / ) + exp( − N δ/

2) + 1 / |S| ν + O p ( m − / ) , (10)which converges to 0 as T → ∞ and N → ∞ . (cid:3) R EFERENCES A HN , S., B ALAN , A. K. & W

ELLING , M. (2012). Bayesian posterior sampling via stochastic gradient Fisher scoring.In

ICML .B ARBIERI , M. & B

ERGER , J. (2004). Optimal predictive model selection.

Annals of Statistics , 870–897.B ARDENET , R., D

OUCET , A. & H

OLMES , C. (2014). Towards scaling up Markov chain Monte Carlo: an adaptivesubsampling approach. In

ICML .B ARDENET , R., D

OUCET , A. & H

OLMES , C. C. (2017). On Markov chain Monte Carlo methods for tall data.

Journal of Machine Learning Research , 47:1–47:43.B ETANCOURT , M. (2015). The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naive datasubsampling. In

ICML .B IERKENS , J., F

EARNHEAD , P. & R

OBERTS , G. (2019). The zig-zag process and super-efﬁcient Monte Carlo forBayesian analysis of big data.

Annals of Statistics , 1288–1320.B OUCHARD C OT ´ E , A., V OLLMER , S. & D

OUCET , A. (2018). The bouncy particle sampler: A nonreversiblerejection-free Markov chain Monte Carlo method.

Journal of the American Statistical Association , 855–867.C

APP ´ E , O., M OULINES , E. & R

YDEN , T. (2005).

Inference in Hidden Markov Models . New York: Springer.C

HEN , H., S

EITA , D., P AN , X. & C ANNY , J. (2016). An efﬁcient minibatch acceptance test for Metropolis-Hastings. arXiv:1610.06848 .C HEN , T., F OX , E. B. & G UESTRIN , C. (2014). Stochastic gradient Hamiltonian Monte Carlo. In

ICML .C HEN , Y. & G

HAHRAMANI , Z. (2016). Scalable discrete sampling as a multi-armed bandit problem. In

ICML .D ALALYAN , A. S. & K

ARAGULYAN , A. G. (2017). User-friendly guarantees for the Langevin Monte Carlo withinaccurate gradient.

CoRR abs/1710.00095 .D ING , N., F

ANG , Y., B

ABBUSH , R., C

HEN , C., S

KEEL , R. D. & N

EVEN , H. (2014). Bayesian sampling usingstochastic gradient thermostats. In

NIPS .G IBBS , A. & S U , F. (2002). On choosing and bounding probability metrics. International Statistical Review ,419–435.G REEN , P. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination.

Biometrika , 711–732.K ORATTIKARA , A., C

HEN , Y. & W

ELLING , M. (2014). Austerity in MCMC land: Cutting the Metropolis-Hastingsbudget. In

ICML .L I , C., C HEN , C., C

ARLSON , D. E. & C

ARIN , L. (2016). Preconditioned stochastic gradient Langevin dynamicsfor deep neural networks. In

AAAI .M A , Y.-A., C HEN , T. & F OX , E. B. (2015). A complete recipe for stochastic gradient MCMC. In NIPS .M ACLAURIN , D. & A

DAMS , R. P. (2014). Fireﬂy Monte Carlo: Exact MCMC with subsets of data. In

IJCAI .N EMETH , C. & F

EARNHEAD , P. (2019). Stochastic gradient Markov chain Monte Carlo. arXiv:1907.06986 .S ATO , I. & N

AKAGAWA , H. (2014). Approximation analysis of stochastic gradient Langevin dynamics by usingFokker-Planck equation and Ito process. In

ICML .S COTT , S. L., B

LOCKER , A. W., B

ONASSI , F. V., C

HIPMAN , H. A., G

EORGE , E. I. & M C C ULLOCH , R. E.(2016). Bayes and big data: The consensus Monte Carlo algorithm.

International Journal of Management Scienceand Engineering Management , 78–88.S RIVASTAVA , S., L I , C. & D UNSON , D. B. (2018). Scalable Bayes via Barycenter in Wasserstein space.

Journalof Machine Learning Research , 1–35.T ANNER , M. & W

ONG , W. (1987). The calculation of posterior distributions by data augmentation (with Discus-sions).

Journal of the American Statistical Association , 528–540.T EH , W., T HIERY , A. & V

OLLMER , S. (2016). Consistency and ﬂuctuations for stochastic gradient Langevin dy-namics.

Journal of Machine Learning Research , 1–33.W ELLING , M. & T EH , Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In ICML .X UE , J. & L IANG , F. (2019). Double-parallel Monte Carlo for Bayesian analysis of big data.

Statistics and Comput-ing , 23–32. r X i v : . [ s t a t . C O ] F e b Biometrika (2019), pp . 1–12 Printed in Great Britain

Supplementary Material for “Extended Stochastic GradientMCMC for Large-Scale Bayesian Variable Selection” B Y Q IFAN S ONG , Y AN S UN , M AO Y E , F AMING L IANG

Department of Statistics, Purdue University,West lafayette, IN 47906, USA qfsong, sun748, ye207, [email protected]

This material is organized as follows. Section 1 presents an extended stochastic gradient Langevin dy-namics algorithm for missing data problems. Section 2 presents some lemmas and and the proof of Theo-rem 2. Section 3 presents the reversible jump Metropolis-Hastings (RJMH) algorithm used in Algorithm1 (of the main paper) for Bayesian variable selection. Section 4 presents more numerical examples.1. A N E XTENDED S TOCHASTIC G RADIENT L ANGEVIN D YNAMICS A LGORITHM FOR M ISSING D ATA P ROBLEMS

As mentioned in the main paper, Algorithm 1 cannot be directly applied to missing data problems,because the imputation of the missing data ϑ might depend on the subsamples only. To address this issue,we propose Algorithm S1, where the missing values (denoted by ϑ ) are imputed from π ( ϑ | θ ( t ) , X n ) ateach iteration t . Algorithm S . [Extended Stochastic Gradient Langevin Dynamics for missing data problems](i) (Subsampling) Draw a subsample of size n from the full dataset X N at random, and denote the sub-sample by X ( t ) n , where t indexes the iteration.(ii) (Multiple Imputation) For the subsamples, simulate the missing values ϑ ( t )1 ,n , . . . , ϑ ( t ) m,n from the condi-tional posterior π ( ϑ | θ ( t ) , X ( t ) n ) , where θ ( t ) is the sample of θ at iteration t .(iii) (Importance Resampling) Resample an imputed value from the set { ϑ ( t )1 ,n , . . . , ϑ ( t ) m,n } according tothe importance weights { p ( X n | θ ( t ) , ϑ ( t )1 ,n ) N/n − , . . . , p ( X n | θ ( t ) , ϑ ( t ) m,n ) N/n − } . Denote the resampledvalue by ϑ ( t ) .(iv) (Updating θ ) Update θ ( t ) by setting θ ( t +1) = θ ( t ) + ǫ t +1 ( Nmn m X k =1 ∇ θ log π ( θ ( t ) | ϑ ( t ) k,n , X ( t ) n ) − N − nn ∇ θ log π ( θ ( t ) ) ) + √ ( ǫ t +1 τ ) η t +1 , where η t +1 ∼ N (0 , I d ) , and ǫ t +1 is the learning rate.In the case that ϑ or some components of ϑ depend on all data X N , an importance resampling step canbe included for simulating samples from π ( ϑ | θ ( t ) , X n,N ) and then the corresponding inference can bemade. For example, if Algorithm S1 is used for Bayesian variable selection where the model is treated aslatent variable, such a step is needed. For the problems that ϑ depends on the subsamples only, this stepcan be omitted. The validity of Algorithm S1 follows directly from the fact that N/n ∇ θ log π ( θ | X n ) − ( N − n ) /n ∇ θ log π ( θ ) is an unbiased estimator of ∇ θ log π ( θ | X N ) . C (cid:13) Q. S

ONG ET AL .

2. L

EMMAS AND P ROOF OF T HEOREM . Some Lemmas L EMMA

S1.

Let X ∼ µ and Y ∼ ν , then E k Y k ≤ E k X k + W ( µ, ν ) + 2 W ( µ, ν ) { E k X k } / , where W ( · , · ) denotes the second order Wasserstein distance.Proof. By deﬁnition of the Wasserstein metric, without loss of generality, we assume that X and Y satisfy E k X − Y k = W ( µ, ν ) . Then [ E k Y k − E k X k ] − W ( µ, ν )= EY T Y − EX T X − EX T X − EY T Y + 2 EX T Y =2 EX T Y − EX T X = 2 EX T ( Y − X ) ≤ E k X k k Y − X k ≤ { E k X k E k Y − X k } / = 2 W ( µ, ν ) { E k X k } / . Lemma S2 is a generalization of Theorem 4 of Dalalyan & Karagulyan (2017). Compared to Theorem4 of Dalalyan & Karagulyan (2017), Lemma S2 allows the noise of the stochastic gradients to scale with k θ ( t − k . A similar result can be found in Bhatia et al. (2019).L EMMA

S2.

Let θ ( t − and θ ( t ) be two random vectors in Θ satisfying θ ( t ) = θ ( t − − ǫ [ ∇ f ( θ ( t − ) + ζ ( t − ] + √ (2 ǫ ) e ( t ) , where ζ ( t − denotes the independent random error of the gradient estimate which can depend on θ ( t − ,and e ( t ) ∼ N (0 , I p ) . Let π t be the distribution of θ ( t ) , and let π ∗ ∝ exp {− f } be the target distribution.If ζ ( t ) satisﬁes k E ( ζ ( t − | θ ( t − ) k ≤ η ( √ p + k θ ( t − k ) , E [ k ζ ( t − − E ( ζ ( t − | θ ( t − ) k ] ≤ σ ( p + k θ ( t − k ) , for some constants η and σ which might depend on m, n, p and N , and ζ ( t − ’s are independent of e ( t ) ’s,and if the conditions (A.1) and (A.2) (given in the main paper) are satisﬁed, q N > η + √ (2) σ and thelearning rate ǫ ≤ min { / ( q N + Q N ) , /q N ) } , then W ( π t , π ∗ ) ≤ [1 − ( q N − η − √ (2) σ ) ǫ ] t W ( π , π ∗ ) + η ( √ p + √ V ) q N + η + √ (2) σ + 1 . Q N √ ( ǫp ) q N + η + √ (2) σ + √ ( ǫ ) σ ( p + 2 V )1 . Q N p / + { q N + η + √ (2) σ } / { σ ( p + 2 V ) } / , (S1) where V = R k θ k π ∗ ( θ ) dθ .Proof. By the same arguments as used in the proof of Theorem 4 of Dalalyan & Karagulyan (2017),we have W ( π k +1 , π ∗ ) ≤ h (1 − q N ǫ ) W ( π k , π ∗ ) + 1 . Q N ( ǫ p ) / + ǫη √ p + ǫη { E k θ ( k ) k } / i + ǫ σ p + ǫ σ E k θ ( k ) k . iometrika style By Lemma S1 , E k θ ( k ) k ≤ ( W ( π k , π ∗ ) + √ V ) , we can derive that W ( π k +1 , π ∗ ) ≤ h (1 − q N ǫ ) W ( π k , π ∗ ) + 1 . Q N ( ǫ p ) / + ǫη √ p + ǫη ( W ( π k , π ∗ ) + √ V ) i + ǫ σ p + ǫ σ ( W ( π k , π ∗ ) + √ V ) ≤ h (1 − q N ǫ + ηǫ ) W ( π k , π ∗ ) + 1 . Q N ( ǫ p ) / + ǫη ( √ p + √ V ) i + ǫ σ ( p + 2 V ) + 2 ǫ σ W ( π k , π ∗ ) ≤ h (1 − q N ǫ + ηǫ + √ (2) σǫ ) W ( π k , π ∗ ) + 1 . Q N ( ǫ p ) / + ǫη ( √ p + √ V ) i + ǫ σ ( p + 2 V ) . The proof can then be concluded by Lemma 1 of Dalalyan & Karagulyan (2017). (cid:3) L EMMA

S3.

Consider a stochastic gradient Langevin dynamics algorithm, θ ( t ) = θ ( t − + ǫ t F ( θ ( t − ) + ζ ( t − ] + √ ( ǫ t ) e ( t ) , e ( t ) ∼ Normal (0 , , for some smooth function F , where the independent random error term ζ ( t − can depend on θ ( t − and Eζ ( t − = 0 . If the learning rate satisﬁes lim t ǫ t = 0 , P ∞ t =1 ǫ t = ∞ , P ∞ t =1 | /ǫ t +1 − /ǫ t | /t ≤ ∞ ,and P ( ǫ t t ) − ≤ ∞ , and furthermore, there exists a Lyapunov function V ( θ ) , which tends to inﬁnity as k θ k → ∞ , is twice differentiable with bounded second derivatives and satisﬁes the following conditions: h F ( θ ) , θ i ≤ − r k θ k for all k θ k > M and some r, M > , (S2) E k ζ ( t − k k ≤ V k ( θ ( t − ) for some k ≥ , (S3) k∇ V ( θ ) k + k F ( θ ) k ≤ V ( θ ) , (S4) h∇ V ( θ ) , F ( θ ) i ≤ − αV ( θ ) + β for some α, β > , (S5) then the following results hold: . There exists a unique invariant distribution, π , for the diffusion process dθ t = (1 / F ( θ t ) dt + dW t . . For any function h such that | h | /V k ′ is bounded for some k ′ ≤ k/ , lim 1 T T X i =1 h ( θ ( i ) ) = E π h ( θ ) , a.s.Proof. Part 1: Condition (S2) ensures existence of the invariant distribution by proposition 4.1 of Doucet al. (2009).Part 2: The convergence of the sample path average is due to Theorem 7 of Teh et al. (2016). Note thatTheorem 7 of Teh et al. (2016) shows the convergence of the weighted average P Ti =1 w i θ ( t ) / P Ti =1 w i with w i → and F = ∇ f for some f , but its proof is applicable to our case (with a general drift F and w i ≡ for all i ) as well. (cid:3) . Proof of Theorem 2Proof. (i) Without loss of generality, we let ǫ t = 1 /t κ where κ ∈ (0 , . Deﬁne χ = κ/ (1 − κ ) , and let T be the index such that max { q N , ( Q N + q N ) / } ǫ T < . Let T ≤ T < T < . . . be a sequence ofintegers satisfying ( s ) − χ ≤ ǫ T < ǫ T − ≤ ( s + 1) − χ ≤ ǫ T < ǫ T − ≤ ( s + 2) − χ ≤ . . . , where s is some integer. It is easy to see that T i +1 − T i ≈ ( χ/κ )( s + i ) χ/κ − ≍ T κi +1 . Q. S

ONG ET AL . Therefore, by the similar arguments as used in Lemma S2 and Theorem 1, let N be a ﬁxed but sufﬁ-ciently large value, W ( π T i +1 , π ∗ ) ≤ (1 − ( q N − η − √ (2) σ ) ǫ T i +1 ) T i +1 − T i W ( π T i , π ∗ ) + C ( √ pm )+ C √ p + (cid:26) N pmn (cid:27) / ! √ ǫ T i , (S6)for some constant C and C , where q N − η − √ (2) σ > . For a ﬁxed value of N , (1 − ( q N − η −√ (2) σ ) ǫ T i +1 ) T i +1 − T i ≤ (1 − ( q N − η − √ (2) σ )( T i +1 ) − κ ) T i +1 − T i converges to some positive con-stant less than 1 as i → ∞ , since T i +1 − T i ≍ T κi +1 . If we further assume that (1 − ( q N − η −√ (2) σ )( T i +1 ) − κ ) T i +1 − T i ≤ ω < for all i , then (S6) is reduced to W ( π T i +1 , π ∗ ) ≤ ωW ( π T i , π ∗ ) + C ( √ pm ) + C √ p + (cid:26) N pmn (cid:27) / ! √ ǫ T i . Iterating the above inequality, we have W ( π T K , π ∗ ) ≤ ω K W ( π T , π ∗ ) + K − X i =1 ω i C √ pm + C √ p + (cid:26) N pmn (cid:27) / ! K X i =1 ω K − i √ ǫ T i ≤ ω K W ( π T , π ∗ ) + ω/ (1 − ω ) C √ pm + C √ p + (cid:26) N pmn (cid:27) / ! K X i =1 ω K − i ( s + i − − χ/ → ω/ (1 − ω ) C √ p/m ( as K → ∞ ) → , ( as N → ∞ , √ p/m → , which concludes part (i).(ii) The extended stochastic gradient Langevin dynamics algorithm can be expressed as θ ( t +1) = θ ( t ) + ǫ t +1 m m X k =1 ∇ θ log π ( θ ( t ) | γ ( t ) S k , X ( t ) n,N ) + √ ( ǫ t +1 ) η t +1 , = θ ( t ) + ǫ t +1 ∇ θ log π ( θ ( t ) | X ( t ) N ) + ǫ t +1 ξ ( θ ( t ) ) + ǫ t +1 ζ ( t ) + √ ( ǫ t +1 ) η t +1 , where ξ ( θ ( t ) ) = E [ P mk =1 ∇ θ log π ( θ ( t ) | γ ( t ) S k , X ( t ) n,N ) /m ] − ∇ θ log π ( θ ( t ) | X N ) and ζ ( t ) = P mk =1 ∇ θ log π ( θ ( t ) | γ ( t ) S k , X ( t ) n,N ) /m − E [ P mk =1 ∇ θ log π ( θ ( t ) | γ ( t ) S k , X ( t ) n,N ) /m ] denote the biasterm and noise term, respectively. It has been shown in the proof of Theorem 1 that k ξ ( θ ) k = o [ N ( k θ k + p ) p ] , E ( k ζ ( t ) k | θ ( t ) ) = O ( N ( k θ ( t ) k + p ) mn ) . With the same notation as in Lemma S3, F ( θ ) = ∇ log π ( θ | X N ) + ξ ( θ ) . Because − log π ( θ | X N ) is Q N -gradient Lipschitz continuous and q N -strongly convex with q N ≍ Q N ≍ N , we have h F ( θ ) , θ i ≤ − q N k θ k + k θ kk∇ θ =0 log π ( θ | X N ) k + o ( N √ p ( k θ k + k θ k )) , which implies condition (S2). It is also easy to verify conditions (S3)-(S5) by choosing k = 1 and V ( θ ) = C ( k θ k + 1) for some sufﬁciently large constant C . Also, the choice ǫ t ∝ t − κ with κ ∈ (0 , satisﬁes the conditions of Lemma S3, and any Lipschitz continuous ρ satisﬁes that | ρ | /V is bounded.Putting all these together it implies, with given X N , N , p and m , that the diffusion process dθ ( t ) =(1 / F ( θ ( t ) ) dt + dW ( t ) has a unique invariant distribution π N , and lim 1 T T X t =1 ρ ( θ ( t ) ) = E π N ρ ( θ ) , a.s. (S7) iometrika style As implied by Part (i) of Theorem 1, with proper growth rates of m and n , π N converges to π ∗ in W -distance. Hence, E π N ρ ( θ ) → E π ∗ ρ ( θ ) . (S8)That is, the unweighted estimator given in part (ii) of Theorem 1 is still valid if the learning rate ǫ t = O (1 /t κ ) for some < κ < .(iii) For the proof of part (iii) of Theorem 1 in the main paper, equation (8) still holds by (S7) and (S8),and the other parts of the proof hold independent of the setting of the learning rate. Therefore, theunweighted estimator given in part (iii) of Theorem 1 is still valid if we set the learning rate ǫ t = O (1 /t κ ) for some < κ < . This concludes the proof of Theorem 2. (cid:3)

3. T HE R EVERSIBLE J UMP M ETROPOLIS -H ASTINGS A LGORITHM U SED IN A LGORITHM γ ( t ) S i ,n , i = 1 , , . . . , m , at each iteration. The reversible jump MH algorithm consists of three types ofmoves, namely, birth, death and exchange. Let { w i } pi =0 denote the weights assigned to intercept termand variable z i for i = 1 , , . . . , p . For linear models, we assign w = exp( − , w i = exp( | ρ ( y, z i ) | − for i = 1 , . . . , p , where ρ ( y, z i ) denotes the correlation coefﬁcient between y and z i ; for generalized lin-ear models, let d i denote the deviance of the model S i , which consists of variable z i and the interceptterm only, and we assign w = exp( − ǫ ) , w i = exp {− ( d i − min j d j ) / (5 σ d ) − ǫ } for all i = 1 , . . . , p ,where ǫ = 0 . and σ d denotes the standard deviation of d , d , . . . , d p . Let S ( t ) denote the set of vari-ables included in the current model, and let S ( t ) c denote its complementary set. The birth move is torandomly select a variable from S ( t ) c (with a probability proportional to its weight) and add it to S ( t ) .The death move is to randomly select a variable from S ( t ) (with a probability proportional to one mi-nus its weight w i ) and remove it. The exchange move is to randomly select a variable from S ( t ) (witha probability proportional to one minus its weight) and replace it by another variable randomly selectedfrom S ( t ) c (with a probability proportional to its weight). The exchange move keeps the size of S ( t ) un-changed. In addition, when a variable is selected to be added to or removed from S ( t ) , the sign of thecorresponding component of θ will be randomized, i.e., altered with probability 0.5. This accommodateswith the prior setting of θ [ − S ] in equation (7) of the main paper. In simulations, we set the proposal prob-ability P ( birth | S ( t ) ) = P ( death | S ( t ) ) = P ( exchange | S ( t ) ) = 1 / if < | S ( t ) | < q , P ( birth | S ( t ) ) = 1 if | S ( t ) | = 0 , and P ( death | S ( t ) ) = 1 if | S ( t ) | = q . Note that in the reversible jump moves, no regressioncoefﬁcients were updated. This is different from traditional reversible jump MH algorithms, where theregression coefﬁcients of θ S ( t ) might need to be updated accordingly.4. N UMERICAL E XAMPLES . A Linear Regression Example

This section provides more details for the illustrative example reported in the main paper. The truemodel follows y = z + z + z + z + z − z − z − z + 0 · z + · · · + 0 · z p + ε, (S9)where ε ∼ N (0 , I N ) , and I N denotes an N × N identity matrix. The explanatory variables z , . . . , z p were generated from a multivariate Gaussian distribution with a mutual correlation coefﬁcient of 0.5. Wegenerated 10 independent datasets from the model (S9) with N = 50 , and p = 2000 . A hierarchicalprior was assumed for the model and parameters. For each model S , we set π ( γ S ) ∝ λ | S | (1 − λ ) p +1 −| S | I ( | S | ≤ q ) , (S10)where | S | denotes the number of explanatory variables included in the model, q is set to 50, and λ is set to / ( p + 1) . by including the intercept term. Here the factor p + 1 means that the intercept term has been Q. S

ONG ET AL . included in the model as a special explanatory variable. Conditioned on γ S , we set θ [ S ] ∼ N (0 , σ I | S | ) , θ [ − S ] ∼ N (0 , σ I p +1 −| S | ) , (S11)where θ [ S ] and θ [ − S ] are the auxiliary variables corresponding to the variables included in and excludedby the model S , respectively; σ is set to a constant of 25; and σ is set to a constant of 0.025.Algorithm 1 was run for 5000 iterations with the subsample size n = 200 , the number of models m = 10 drawn at each iteration, and the learning rate ǫ t ≡ − , where the ﬁrst 2000 iterations werediscarded for the burn-in process. In each iteration, the subset models were drawn using the reversiblejump Metropolis-Hastings algorithm described in Section 3 of this material. Regarding the choice of thelearning rate, we note that Theorem 1 of the main paper suggests that it should be set in the order o (1 /N ) .In all our examples, we did not tune the learning rate much, just setting it around /N while avoidingpossible blowups of the simulation. Refer to Nemeth & Fearnhead (2019) for discussions on this issue forthe general stochastic gradient Langevin dynamics algorithm.The variables were selected according to the median posterior probability rule (Barbieri & Berger,2004), which selects only the variables with the marginal inclusion probability greater than 0.5. Table 1of the main paper summarizes the performance of the algorithm, where the false selection rate (FSR),negative selection rate (NSR), MSE and MSE are reported in averages over 10 independent dataset. Tobe more precise, we deﬁneFSR = 110 X i =1 | ˆ S i \ S ∗ | / | ˆ S i | , NSR = 110 X i =1 | S ∗ \ ˆ S i | / | S ∗ | , where S ∗ denotes the set of true variables, ˆ S i denotes the set of selected variables for dataset i , and | · | denotes the cardinality of a set. The accuracy of the parameter estimates is measured byMSE = 110 X i =1 MSE ( ˆ β (0) i ) , MSE = 110 X i =1 MSE ( ˆ β (1) i ) , where i indexes datasets, and ˆ β (0) i and ˆ β (1) i denote the estimates of the coefﬁcients of the false and truevariables, respectively. The ˆ β i is obtained by averaging over a set of thinned (by a factor of 10) posteriorsamples produced by Algorithm 1. The extremely small values of MSE and MSE and the zero valuesof FSR and NSR indicate that all the simulations have converged to the true model.For comparison, some existing algorithms, including the reversible jump Metropolis-Hastings algo-rithm (Green, 1995), split-and-merge (Song & Liang, 2015), and Bayesian Lasso (Park & Casella, 2008),were applied to this example. For this example, the reversible jump Metropolis-Hastings algorithm wasalso run for 5000 iterations. Similar to the implementation of extended stochastic gradient Langevin dy-namic algorithm, the ﬁrst 2000 iterations were discarded for the burn-in process, the variables were se-lected according to the median posterior probability rule (Barbieri & Berger, 2004), and the regressioncoefﬁcients were estimated based on the samples collected from the last 3000 iterations (thinned by afactor of 10).For the split-and-merge algorithm, we ﬁrst split the data into 5 equal subsets along the dimension ofpredictors; that is, each subset consists of 400 predictors and 50,000 observations. For each subset, weimposed a hierarchical prior on the model and parameters, where each predictor has a prior probabilityof (1 / . being selected, and if selected, its coefﬁcient follows a Gaussian distribution N (0 , ) . Foreach subset, the reversible jump Metropolis-Hastings algorithm was run for 2500 iterations, where theﬁrst 1000 iterations were discarded for the burn-in process, and the predictors with the marginal posteriorinclusion probability greater than 0.3 were selected. The predictors selected from each subset were thenmerged into one dataset. For the merged dataset, we imposed the same hierarchical prior on the modeland parameters, and then ran the reversible jump Metropolis-Hastings algorithm for 2500 iterations withthe ﬁrst 1000 iterations discarded for the burn-in process. The ﬁnal model was selected according to themedian posterior probability rule, and the parameters were estimated based on the samples collected fromthe last 1500 iterations. iometrika style For Bayesian Lasso, we adopt the Laplace prior π ( β ) ∝ exp( − λ P | β j | ) with λ = { N log( p ) } / .The Gibbs sampler was used to simulate posterior samples, as described in Park & Casella (2008), for 200iterations. The predictors with the absolute value of the coefﬁcient estimate greater than 0.1 were selectedas the “true” predictors, where the coefﬁcient was estimated by averaging its posterior sample over theiterations.Among the four algorithms, the extended stochastic gradient Langevin dynamics algorithm cost theleast CPU time and perfectly selected all true variables. The reversible jump Metropolis-Hastings algo-rithm took on average about 3 CPU hours to complete 5000 iterations. However, the resulting Markovchain was not yet mixed well and yielded high FSR, NSR and MSE values. It is worth to note that theCPU time cost by the reversible-jump Metropolis-Hastings algorithm has a high variety among the tenruns, ranging from 42 to 305 minutes. This is due to the local trapping issue suffered by the reversible-jump Metropolis-Hastings algorithm; the CPU time can dramatically increase if the Markov chain isfrequently trapped at large models. By working on lower dimensional subset data, the split-and-mergealgorithm signiﬁcantly improves the performance of the reversible jump Metropolis-Hastings algorithm,but is still inferior to the extended stochastic gradient Langevin dynamics algorithm in both variable se-lection accuracy and computational efﬁciency. We note that the split-and-merge algorithm can be furtheraccelerated by parallel computing, although not done here. The Bayesian Lasso was also capable to per-fectly recover the true model. For this algorithm, the major CPU cost is for computing at each iteration theinverse of an p × p -matrix, which, unfortunately, is not scalable with respect to N . Although the algorithmwas only run for 200 iterations, it still cost more CPU time than the extended stochastic gradient Langevindynamics algorithm. It is worth noting that other than linear regression, the Bayesian Lasso algorithm willnot be so efﬁcient as the conjugate Gibbs updates will not be available anymore.4.2 . A Logistic Regression Example Other than the linear regression example reported in the main paper, we also tested Algorithm 1 onlarge-scale logistic regression. We simulated 10 large datasets from the modellogit P ( Y i = 1) = p X i =1 β j z ij , i = 1 , , . . . , N, (S12)where N = 50 , , p = 2000 , β = · · · = β = 1 , β = β = β = − , and β = · · · = β p = 0 . Theexplanatory variables z i ’s, where z i = ( z i , . . . , z ip ) T , were generated such that they have a mutual cor-relation coefﬁcient of 0.5. The response variables were generated such that half of them have a value of 1and the other half have a value of 0.To conduct Bayesian analysis for the datasets, we adopted the same prior setting as in Liang et al.(2013). For each model S , we set π ( γ S ) ∝ λ | S | (1 − λ ) p +1 −| S | I ( | S | ≤ q ) , (S13)where | S | denotes the number of explanatory variables included in the model, q is set to 500, λ is set to / { p + 1) ζ √ (2 π ) } , and ζ = 0 . . Similar to the linear regression case, the factor p + 1 means thatthe intercept term has been included in the model as a special explanatory variable. Conditioned on γ S ,we set θ [ S ] ∼ N (0 , σ S I | S | ) , θ [ − S ] ∼ N (0 , σ I p +1 −| S | ) , (S14)where θ [ S ] and θ [ − S ] denote the auxiliary variables corresponding to the variables included in and excludedby the model S , respectively; σ is set to a constant of 0.025; and σ S = exp { C / | S |} / (2 π ) . For thisexample, we set C = 10 . As shown in Liang et al. (2013), such a prior setting ensures the posteriorconsistency and variable selection consistency for high-dimensional generalized linear models, even when p is much larger than N .For each dataset, Algorithm 1 was run for 5000 iterations with the subsample size n = 300 , the learn-ing rate ǫ t ≡ − , and the number of models m = 10 simulated at each iteration, where the ﬁrst 2000iterations were discarded for the burn-in process and the samples generated from the remaining iterations Q. S

ONG ET AL .Table S1.

Bayesian variable selection for logistic regression with the extended stochastic gradi-ent Langevin Dynamics (eSGLD) algorithm, where FSR, NSR, MSE and MSE are reported inaverages over 10 datasets with standard deviations given in the parentheses, and the CPU time(in minutes) was recorded for one dataset on a Linux machine with Intel(R) Core(TM) [email protected]. Model Algorithm FSR NSR

M SE M SE CPU(minutes)Logistic eSGLD 0(0) 0(0) . × − ( . × − ) . × − ( . × − ) 2.9 were used for inference. At each iteration, the models were simulated using the same reversible jumpMetropolis-Hastings algorithm as described in Section 3 of this material.The results were summarized in Table S1. For each dataset of this example, the extended stochasticgradient Langevin dynamics algorithm took only 2.9 CPU minutes on a personal computer of 3.6GHz!The numerical results indicate again that the extended stochastic gradient Langevin dynamic algorithmhas much alleviated the pain of Bayesian methods in big data analysis.4.3 . A Pascal Challenge Dataset We considered the dataset epsilon with logistic regression. The dataset is available at ∼ cjlin/libsvmtools/datasets , which has been usedfor Pascal large scale learning challenge in 2008. The raw dataset consists of 500,000 observations and2000 features, which was split into two parts: 400,000 observations for training and 100,000 observa-tions for testing. The training part is feature-wisely normalized to mean zero and variance one and thenobservation-wisely scaled to unit length. Using the scaling factors of the training part, the testing part isprocessed in a similar way.For this example, we applied the same prior settings as used in Section 4.2 of this material except that C was set to 2.5. A small value of C enhances the selection of a larger set of variables, while keepingthe regression coefﬁcients of the selected variables relatively small. Our numerical experience shows thatthis often improves the prediction performance for real data problems, as for which the true model mightdeviate from the logistic regression and thus need to be approximated with a large number of variables(but the effect of each variable is small). With this prior setting, Algorithm 1 was run for the dataset for 10times independently. Each run consisted of 1000 iterations, where the ﬁrst 500 iterations were discardedfor the burn-in process, and the samples drawn from the remaining iterations were used for inference.In simulations, we set the subsample size n = 2000 , the learning rate ǫ t ≡ × − , and the number ofsimulated models m = 10 per iteration.The numerical results were summarized in Table S2, where the variables were selected according tothe median posterior probability rule (Barbieri & Berger, 2004). To measure the quality of the selectedmodels, we evaluated the prediction accuracy of the selected models on the test dataset. For comparison,we applied the Lasso algorithm (Tibshirani, 1996), which was implemented using the package glmnet , tothis example. For Lasso, we reported only the results with one value of the regularization parameter λ atwhich it selected about the same size of models as eSGLD. A t -test for the eSGLD prediction error for thehypothesis H : ν = 0 . versus H : ν = 0 . shows a p -value of . × − , where ν denotesthe prediction error. Therefore, the models selected by eSGLD have signiﬁcantly lower predication errorthan that selected by Lasso. Later, we have tried other values of C such as 1, 5 and 10. Under each ofthem, the models selected by eSGLD have signiﬁcantly lower prediction error than the same size modelsselected by Lasso.For a thorough comparison, we have also applied the SCAD (Fan & Li, 2001) and MCP (Zhang, 2010)algorithms to this example, which both were implemented in the R-Package SIS (Saldana & Feng, 2018).Unfortunately, they did not produce any results after running 24 CPU hours. This comparison also showsthat Lasso penalty is computationally more attractive than the SCAD and MCP penalties, although theyshare similar theoretical properties. iometrika style

Numerical results for the dataset epsilon, where the results of the extended stochasticgradient Langevin dynamics (eSGLD) algorithm were averaged over 10 independent runs withthe standard deviation reported in the parentheses, and the CPU time was recorded for a run ona Linux machine with Intel(R) Core(TM) i7-7700 [email protected].

Algorithm Setting Model size Prediction Error( ν ) CPU (minutes)eSGLD C = 2 . . × − ) 17.4Lasso λ = 0 .

63 0.1644 21.2MCP/SCAD — — — > This example shows again that the extended stochastic gradient Langevin dynamics algorithm has muchalleviated the pain of Bayesian computing for big data problems. For this example, it has even about thesame CPU cost as the Lasso algorithm. 4.4 . MovieLens Data

Other than variable selection, we applied the extended stochastic gradient Langevin dynamics algorithmto a large-scale missing data problem, estimating the mixed effect model for the MovieLens 10M Datasetdownloaded at https://grouplens.org/datasets/movielens/ . The dataset contains N =10 , , ratings of 10,681 movies by 71,567 users. The ratings vary from 0.5 to 5 in an increment of0.5. The whole dataset contains information of the ratings, the time of the ratings, and the genre of themovies (with 19 possible categories).This dataset has been modeled by mixed effect models, see e.g. Van Dyk (2000) and Srivastava et al.(2018). Let N be the total number of users, and let s i denote the number of ratings of user i . FollowingSrivastava et al. (2018), we set y i = W i β + Z i b i + e i , b i ∼ N q (0 , Σ) , Σ = σ D, e i ∼ N s i (0 , σ I s i ) , where y i ∈ R s i , W i ∈ R s i × p , β ∈ R p , Z i ∈ R s i × q , b i ∈ R q for i = 1 , , . . . , N . Let θ = ( β, D, σ ) de-note the parameter vector of the model, and let r ij be the rating of user i for movie j . The response isdeﬁned as y i = ( r ij , r ij , . . . , r ij si ) T . Here we assume that ( y i , W i , Z i ) , i = 1 , , . . . , N are indepen-dent and identically distributed random samples with varying dimensions, where W i denotes the collectionof the intercept, genre predictor, popularity predictor and previous predictor (deﬁned below), and Z i is thesame with W i . r Genera predictor, which is a categorical variable for four categories, namely ‘Action’, ‘Children’,‘Comedy’, and ‘Drama’, grouped from 19 movie genres. The category ‘Action’ consists of theaction, adventure, fantasy, horror, sci-ﬁ, and thriller genres. The category ‘Children’ consists ofthe animation and children geres. The category ‘Comedy’ consists of only the comedy genre.The category ‘Drama’ consists of the crime, documentary, drama, ﬁlm-noir, musical, mystery, ro-mance, war, and western genres. We use effect coding to represent each category, i.e., using (1 , , , (0 , , , (0 , , , ( − , − , − to represent Children, Comedy, Drama and Action, respec-tively. The genre predictor of movie j is deﬁned as the average of all categories that movie j belongsto. For example, if movie j belongs to Fantasy, Thriller and Musical genres, then its genre predictor is (1 /

3) ((1 , ,

0) + (1 , ,

0) + (0 , , / , / , . r Popularity predictor, which, for the rating r ij , is deﬁned as logit { ( l j + 0 . / ( L j + 1 . } , where L j isthe number of recent ratings of movie j , and l j is the number of recent ratings of movie j with scoregreater than 3. Here “recent” means 30 or fewer most recent ratings. r Previous predictor, which, for the rating r ij , is deﬁned as 1 if user i rated previous movie with scoregreater than 3 and 0 otherwise.0 Q. S

ONG ET AL . As in Srivastava et al. (2018), we treat the random effect b i as missing data and set the priors β | σ ∼ N p (0 , σ I p ) , σ ∼ χ , D ∼ inverse-Wishart ( p + 2 , I q ) . In each iteration of Algorithm S1, we ﬁrst randomly select a mini-batch of users with the subsample size n = 500 . For the selected users, we sample the random effect b i based on p ( b i | y i , β, σ , D ) (see Equation3.6 in Van Dyk (2000) for its analytic form) and then update θ = ( β, D, σ ) according to the rule ofLangevin Dynamics. We use the linear regression result for Y = Xβ + e to initialize β and σ and usea diagonal matrix with all diagonal elements equal to 0.2 to initialize D . Algorithm S1 was run for 4000iterations. The learning rate was set to . /N for this example.To compare with other competitive Bayesian sampling algorithm, we also perform the data aug-mentation Gibbs sampler, which iteratively samples θ and b from the full data conditional distribution π ( b | θ, X N ) and π ( θ | b, X N ) , respectively where X N denotes the whole data. The algorithm was run for500 iterations.The convergence performance of the extended stochastic gradient Langevin dynamics algorithm andGibbs sampler algorithm can be assessed by the kernel Stein discrepancy (KSD) (Gorham & Mackey,2015, 2017). For a set of samples { θ ( t ) } Tt =1 generated by a Markov chain, the kernel Stein discrepancy be-tween the empirical distribution of θ ( t ) , denoted by π t , and the target posterior distribution π ∗ = π ( θ | X N ) is deﬁned as KSD ( π t , π ∗ ) = p X j =1  T X t,t ′ =1 k j ( θ ( t ) , θ ( t ′ ) ) T  / , where the Stein kernel for j ∈ { , , . . . , p } is given by k j ( θ, θ ′ ) = ∇ θ j U ( θ ) ∇ θ ′ j U ( θ ′ ) k ( θ, θ ′ ) + ∇ θ j U ( θ ) ∇ θ ′ j k ( θ, θ ′ ) + ∇ θ ′ j U ( θ ′ ) ∇ θ j k ( θ, θ ′ ) + ∇ θ j ∇ θ ′ j k ( θ, θ ′ ) , p is the dimension of θ , U ( θ ) =log π ( θ | X N ) , and k ( θ, θ ′ ) = (1 + || θ − θ ′ || ) − . is the inverse multi-quadratic kernel. In practice, KSDcan also be calculated on some components of θ for simplicity. In this case, p will be smaller than thedimension of θ .Figure S1 compares the KSD paths of β produced by the extended stochastic gradient Langevin dynam-ics algorithm and the Gibbs sampler, where the x -axis shows the elapsed CPU time. Figure S1 indicatesthat to achieve the same level of KSD, the Gibbs sampler costs much longer, approximately 7 times,CPU time than the extended stochastic gradient Langevin dynamics algorithm. Figure S2 plots the sampletrajectories of β i ’s for both algorithms, with respect to the iteration number. It is easy to see that withsame number of iterations, the extended stochastic gradient Langevin dynamics algorithm does convergeslower than the Gibbs sampler, however, such disadvantage is compensated by its shorter CPU time costper iteration. R EFERENCES B ARBIERI , M. & B

ERGER , J. (2004). Optimal predictive model selection.

Annals of Statistics , 870–897.B HATIA , K., M A , Y.-A., D RAGAN , A. D., B

ARTLETT , P. L. & J

ORDAN , M. I. (2019). Bayesian robustness: Anonasymptotic viewpoint. arXiv preprint arXiv:1907.11826 .D ALALYAN , A. S. & K

ARAGULYAN , A. G. (2017). User-friendly guarantees for the Langevin Monte Carlo withinaccurate gradient.

CoRR abs/1710.00095 .D OUC , R., F

ORT , G. & G

UILLIN , A. (2009). Subgeometric rates of convergence of f-ergodic strong Markov pro-cesses.

Stochastic processes and their applications , 897–923.F AN , J. & L I , R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journalof the American Statistical Association , 1348–1360.G ORHAM , J. & M

ACKEY , L. (2015). Measuring sample quality with Stein’s method. In

Advances in Neural Infor-mation Processing Systems .G ORHAM , J. & M

ACKEY , L. (2017). Measuring sample quality with kernels. In

Proceedings of the 34th InternationalConference on Machine Learning-Volume 70 . JMLR. org.G

REEN , P. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination.

Biometrika , 711–732. iometrika style . . . . . . . time/min l og ( KS D ) GibbseSGLD

Fig. S1. KSD ( log scale) paths of β against CPU time(minutes), where the solid line is for the extended stochas-tic gradient Langevin dynamics (eSGLD) algorithm andthe dotted line is for the Gibbs Sampler. . . . . . . . Iteration B e t a_1 eSGLDGibbs 0 1000 2000 3000 4000 − . . . . . . Iteration B e t a_2 t o B e t a_6 eSGLDGibbs Fig. S2. Sample paths of different components of β =( β , . . . , β ) , where the left panel is for β , and the rightpanel is for components β to β .L IANG , F., S

ONG , Q. & Y U , K. (2013). Bayesian subset modeling for high dimensional generalized linear models. Journal of the American Statistical Association , 589–606.N

EMETH , C. & F

EARNHEAD , P. (2019). Stochastic gradient Markov chain Monte Carlo. arXiv:1907.06986 .P ARK , T. & C

ASELLA , G. (2008). The Bayesian Lasso.

Journal of the American Statistical Association ,681–686.S

ALDANA , D. F. & F

ENG , Y. (2018). SIS: an R package for sure independence screening in ultrahigh dimensionalstatistical models.

Journal of Statistical Software , 1–25.S ONG , Q. & L

IANG , F. (2015). A split-and-merge Bayesian variable selection approach for ultra-high dimensionalregression.

Journal of the Royal Statistical Society, Series B , 947–972.S RIVASTAVA , S., D E P ALMA , G. & L IU , C. (2018). An asynchronous distributed expectation maximization algorithmfor massive data: The dem algorithm. arXiv preprint arXiv:1806.07533 .T EH , Y. W., T HIERY , A. H. & V

OLLMER , S. J. (2016). Consistency and ﬂuctuations for stochastic gradient Langevindynamics.

The Journal of Machine Learning Research , 193–225.T IBSHIRANI , R. (1996). Regression shrinkage and selection via the lasso.

Journal of the Royal Statistical Society,Series B , 267–288. Q. S

ONG ET AL . V AN D YK , D. A. (2000). Fitting mixed-effects models using efﬁcient EM-type algorithms. Journal of Computationaland Graphical Statistics , 78–98.Z HANG , C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty.

Annals of Statistics38