Scalable Bayes via Barycenter in Wasserstein Space
SScalable Bayes via Barycenter in Wasserstein Space
Sanvesh Srivastava ∗1 , Cheng Li †2 , and David B. Dunson ‡31 Department of Statistics and Actuarial Science, The University of Iowa, Iowa City, Iowa,USA Department of Statistics and Applied Probability, National University of Singapore,Singapore Department of Statistical Science, Duke University, Durham, North Carolina, USA
June 21, 2018
Abstract
Divide-and-conquer based methods for Bayesian inference provide a general approach fortractable posterior inference when the sample size is large. These methods divide the data intosmaller subsets, sample from the posterior distribution of parameters in parallel on all the sub-sets, and combine posterior samples from all the subsets to approximate the full data posteriordistribution. The smaller size of any subset compared to the full data implies that posteriorsampling on any subset is computationally more efficient than sampling from the true posteriordistribution. Since the combination step takes negligible time relative to sampling, posteriorcomputations can be scaled to massive data by dividing the full data into sufficiently largenumber of data subsets. One such approach relies on the geometry of posterior distributionsestimated across different subsets and combines them through their barycenter in a Wassersteinspace of probability measures. We provide theoretical guarantees on the accuracy of approxi-mation that are valid in many applications. We show that the geometric method approximatesthe full data posterior distribution better than its competitors across diverse simulations andreproduces known results when applied to a movie ratings database.
Keywords: barycenter; big data; distributed Bayesian computations; empirical measures; linearprogramming; optimal transportation; Wasserstein distance; Wasserstein space.
Developing efficient sampling algorithms is an active area of research motivated by tractableBayesian inference in large sample settings. Sampling remains a primary tool for inference inBayesian models, with Markov chain Monte Carlo (MCMC) and sequential Monte Carlo (SMC)providing two broad classes of algorithms that are routinely used. Most MCMC and SMC algo-rithms face problems in scaling up to massive data settings due to memory and computationalbottlenecks that arise; this has motivated a rich literature in recent years proposing a variety ofstrategies to enable better performance in such settings. Our focus is on proposing a very general ∗ [email protected] † [email protected] ‡ [email protected] a r X i v : . [ s t a t . M E ] J un ivide-and-conquer technique, which is designed to combine results from any posterior samplingalgorithm applied in parallel using subsets of the data.Massive data pose major problems for existing sampling algorithms. First, if full data requiremultiple machines for storage, then a sampler has access to only a small fraction of the full datastored on the machine where it runs. Posterior sampling given the full data is expensive dueto network latency and extensive communication among machines. Second, with sample size n ,sampling in hierarchical Bayesian models requires generation of O ( n ) latent variables, which becomesinefficient as n increases. Finally, even if full data are available to the sampler, sampling can beinfeasible in practice because computation of Hessians and acceptance ratios can scale as O ( n ) insome nonparametric models based on Gaussian process priors (Rasmussen and Williams, 2006). Avariety of methods exist to address these issues using optimization and sampling.Optimization-based methods for Bayesian inference obtain an analytic approximation of the fulldata posterior distribution. The two most common techniques are polynomial approximation (Rueet al., 2009) and projection of the full data posterior distribution on a class of distributions withanalytically tractable posterior densities, which includes variational Bayes and expectation propa-gation (Wainwright and Jordan, 2008; Gelman et al., 2014). Both techniques estimate parametersof the approximate distribution using a variety of optimization algorithms (Tan and Nott, 2013;Kucukelbir et al., 2015; Rezende and Mohamed, 2015; Ranganath et al., 2016). Stochastic approxi-mation significantly improves the efficiency of estimation by accessing the data in small batches andupdating the parameter estimates sequentially (Broderick et al., 2013; Hoffman et al., 2013); how-ever, optimization can be nontrivial for complex likelihoods frequently used in hierarchical models.Furthermore, variational Bayes and expectation propagation often have excellent predictive perfor-mance but can be highly biased in estimation of posterior uncertainty and dependence (Giordanoet al., 2017).There is extensive work in sampling-based methods for Bayesian inference. The three maintechniques used are as follows. First, subsampling-based methods obtain posterior samples condi-tioned on a small fraction of the data (Maclaurin and Adams, 2015). Coupling of subsampling withmodified Hamiltonian or Langevin dynamics improves posterior exploration and convergence tothe stationary distribution (Welling and Teh, 2011; Ahn et al., 2012; Chen et al., 2014; Korattikaraet al., 2014; Lan et al., 2014; Shahbaba et al., 2014); see Bardenet et al. (2017) for a review. Second,the exact transition kernel in posterior sampling is replaced by an approximation that significantlyreduces the time required to finish an iteration of the sampler (Johndrow et al., 2015; Alquier et al.,2016). Finally, divide-and-conquer approaches first divide the data into smaller subsets and samplein parallel across subsets, and then combine the posterior samples from all the subsets. Our focusis on scalable Bayesian methods based on the divide-and-conquer technique. These methods havetwo subgroups that differ mainly in their sampling scheme for every subset and their method forcombining posterior samples obtained from all the subsets.The first subgroup modifies the prior to sample from the posterior distribution of the parameterconditioned on a data subset. Let k be the number of subsets, π ( θ ) be the prior density of parameter θ , and l i ( θ ) be the likelihood for subset i ( i = 1 , . . . , k ). Samples from subset posterior distribution i are obtained using l i ( θ ) and π ( θ ) /k as the likelihood and prior. Consensus Monte Carlo combinessubset posterior samples by averaging, which has been generalized in many ways (Rabinovich et al.,2015; Scott et al., 2016). This relies heavily on the normality assumption, which is relaxed using acombination based on kernel density estimation (Neiswanger et al., 2014). Both methods performpoorly if the supports of subset posteriors are different, which motivates the combination using theWeierstrass transform and random partition trees (Wang and Dunson, 2013; Wang et al., 2015).These methods offer simple approaches for combining samples from subset posterior distributionsbut have a major limitation that the sampling algorithm depends on the model parameterization.2he second subgroup modifies the subset likelihood to sample from a subset posterior distri-bution and combines samples from subset posterior distributions through their geometric center.These methods modify the likelihood to l i ( θ ) k and use prior π ( θ ) to sample from subset posteriordistribution i ( i = 1 , . . . , k ). M-Posterior combines subset posterior distributions through theirmedian in the Wasserstein space of order 1 (Minsker et al., 2014, 2017). The robustness of the me-dian implies that it could ignore valuable information in some subset posterior distributions, whichmotivates combination through the mean in the Wasserstein space of order 2 called WassersteinPosterior (WASP) (Srivastava et al., 2015). The WASP approach strikes a balance between thegenerality of sampling and the efficiency of optimization. While WASP can be applied to any dataor Bayesian model, its computations are developed for independent identically distributed ( iid ) dataand its theoretical properties are unknown.Our main goal is to study the theoretical properties of WASP and apply WASP in a variety ofpractical problems. The iid assumption of WASP rules out many important practical problems, in-cluding regression and classification, where the data are independent and non-identically distributed( inid ). We relax this assumption and our theoretical results are applicable to inid data. Second, weshow that if the number of subsets are chosen appropriately, then the WASP achieves almost thesame rate of convergence as that of the full data posterior distribution. For linear models with errordistribution in the location-scale family, we strengthen this result and show that the WASP andthe full data posterior distribution have the same asymptotic mean and asymptotic variance. Thisimplies that WASP can be used as an efficient alternative to the full data posterior distribution inmassive data settings. Third, we show that the method for estimating WASP is independent of theform of the model, which implies that WASP is very general and can be easily used for estimatingposterior summaries for any function of the model parameters. We emphasize that WASP is not anew sampling algorithm but a general approach to easily extend any existing sampling algorithmsfor massive data applications. We recall elementary properties and definitions related to the Wasserstein space of probabilitymeasures. Let (Θ , ρ ) be a complete separable metric space and P (Θ) be the space of all probabilitymeasures on Θ . The Wasserstein space of order is defined as P (Θ) := (cid:26) µ ∈ P (Θ) : (cid:90) Θ ρ ( θ , θ ) µ ( dθ ) < ∞ (cid:27) , (1)where θ ∈ Θ is arbitrary and P (Θ) does not depend on the choice of θ . The space P (Θ) isequipped with a natural distance between its elements. Let µ, ν ∈ P (Θ) and Π( µ, ν ) be the set ofall probability measures on Θ × Θ with marginals µ and ν , then the Wasserstein distance of order between µ and ν is defined as W ( µ, ν ) = (cid:18) inf π ∈ Π( µ,ν ) (cid:90) Θ × Θ ρ ( x, y ) dπ ( x, y ) (cid:19) . (2)In our applications ρ is the Euclidean metric and we refer to P (Θ) and W as the Wassersteinspace and the Wasserstein distance without explicitly mentioning their order. If Π , . . . , Π k are a3ollection of probability measures in P (Θ) , then their barycenter in P (Θ) is defined as Π = argmin Π ∈P (Θ) k (cid:88) j =1 k W (Π , Π j ) . (3)This generalizes the Euclidean barycenter, which is the sample mean, to P (Θ) (Agueh and Carlier,2011). The barycenter Π is analytically intractable, except in few special cases. Let δ a ( x ) = 1 if a = x and 0 otherwise. If X j , . . . , X jm are samples from Π j ( j = 1 , . . . , k ), then (cid:98) Π j ( · ) = (cid:80) mi =1 δ X ji ( · ) /m is an empirical measure that approximates Π j ( j = 1 , . . . , k ). If Π is assumed to be an empiricalmeasure, then the optimization problem in (3) reduces to a linear program; see Cuturi and Doucet(2014), Carlier et al. (2015), and Srivastava et al. (2015) for different algorithms to solve this linearprogram. Consider a general set-up for inid data. Let Y ( n ) = ( Y , . . . , Y n ) be n observations and the distri-bution of Y i is P θ,i , i = 1 , . . . , n , where θ lies in the parameter space Θ ⊂ R p . Assume that P θ,i hasdensity p i ( ·| θ ) with respect to the Lebesgue measure, so dP θ,i ( y i ) = p i ( y i | θ ) dy i and the likelihoodgiven Y ( n ) is l ( θ ) = (cid:81) ni =1 p i ( y i | θ ) . Given a prior distribution Π on Θ that has density π with respectto the Lebesgue measure, the posterior density of θ given Y ( n ) using Bayes theorem is π ( θ | Y ( n ) ) = (cid:81) ni =1 p i ( y i | θ ) π ( θ ) (cid:82) Θ (cid:81) ni =1 p i ( y i | θ ) π ( θ ) dθ = l ( θ ) π ( θ ) (cid:82) Θ l ( θ ) π ( θ ) dθ . (4)In most cases π ( θ | Y ( n ) ) is analytically intractable, and accurate approximations of π ( θ | Y ( n ) ) areobtained using Monte Carlo methods, such as importance sampling and MCMC, and deterministicapproximations, such as Laplace’s method and variational Bayes. For example, in the context oflogistic regression, P θ,i is the Bernoulli distribution with mean / (cid:8) − x Ti θ ) (cid:9) , where x Ti isthe i th row of the design matrix X ∈ R n × p and Θ = R p . The posterior density of θ is analyti-cally intractable, and it is typical to rely on Gibbs samplers based on data augmentation (Bishop,2006). These samplers introduce latent variables { z i , i = 1 , . . . , n } and alternately sample the la-tent variables and the parameters from their full conditional distributions. Related algorithms arevery common and are computationally prohibitive for large n because they require repeated passesthrough the whole data.Divide-and-conquer-type methods resolve this problem by partitioning the data into smallersubsets. Let k be the number of subsets. The default strategy is to randomly allocate samples tosubsets. Let Y [ j ] ≡ Y ( m j ) j = ( Y j , . . . , Y jm j ) denote data on the j th subset, where m j is the size ofthe j th subset and (cid:80) kj =1 m j = n . We assume that m j = m ( j = 1 , . . . , k ) for ease of presentation,so n = km , the likelihood given Y [ j ] is l j ( θ ) = (cid:81) mi =1 p ji ( y ji | θ ) , and l ( θ ) in (4) equals (cid:81) kj =1 l j ( θ ) .Define subset posterior density j given Y [ j ] as π m ( θ | Y [ j ] ) = { (cid:81) mi =1 p ji ( y ji | θ ) } γ π ( θ ) (cid:82) Θ { (cid:81) mi =1 p ji ( y ji | θ ) } γ π ( θ ) dθ = l j ( θ ) γ π ( θ ) (cid:82) Θ l j ( θ ) γ π ( θ ) dθ , (5)where γ is a positive real number such that g γm ≤ n ≤ g γm for some g , g > . In thepresent context, we assume that γ = k with g = g = 1 following Minsker et al. (2014); moregeneral conditions on γ are defined later in Section 3.2. This modified form of subset posteriorcompensates for the fact that j th subset has access to only ( m/n ) -fraction of the full data and4nsures that π m ( θ | Y [ j ] ) and π n ( θ | Y ( n ) ) in (4) have variances of the same order. Minsker et al.(2014) refer to this as stochastic approximation because raising l j ( θ ) ( j = 1 , . . . , k ) to the power γ is equivalent to replicating every X ji ( i = 1 , . . . , m ) γ -times so that π m ( θ | Y [ j ] ) ( j = 1 , . . . , k ) arenoisy approximations of π ( θ | Y ( n ) ) .One advantage of using stochastic approximation to define π m ( θ | Y [ j ] ) in (5) is that off-the-shelfsampling algorithms can be used directly even when the prior density is the form of a discretemixture. Consider a simple example of univariate density estimation using Dirichlet process (DP)mixtures of Gaussians. Let X i ( i = 1 , . . . , n ) be iid samples from a distribution P with density p . The data are randomly split into k subsets of equal size m . The truncated stick-breakingrepresentation of DP implies that the prior distribution Π on P has a finite mixture representation,where P is the set of probability distributions that have a density. We show in the Appendix thatmodification of the likelihood using stochastic approximation leads to nearly identical subset andfull data posterior computations.Stochastic approximation does not add any extra burden to the computations required for sam-pling from the subset posterior distribution of θ conditioned on m observations. We raise thelikelihood in every subset to the power γ . This is equivalent to replicating observations γ -times,which seems to offset the benefits of partitioning. However, the replication of observation is notrequired in implementation of the sampler; we simply modify the likelihood in the full data samplerby raising it to the power γ . For example, stochastic approximation is easily implemented using the increment_log_prob function in Stan (Stan Development Team, 2014). We provide more examplesfor a variety of models in Section 4.A simple logistic regression example demonstrates that π m ( θ | Y [ j ] ) in (5) is a noisy approx-imation of π ( θ | Y ( n ) ) in (4). We simulated data for logistic regression with n = 10 , p = 2 , θ = ( − , T , and entries of X randomly set to ± (Figure 1). We set γ = k = 40 and obtainedsamples of θ from π ( θ | Y ( n ) ) and from π m ( θ | Y [ j ] ) ( j = 1 , . . . , k ) using the Stan’s HMC samplingalgorithm. The contours for the subset and full data posterior densities are very similar, indicatingall densities have similar spreads. We also notice that subset posteriors are noisy approximationsof the full data posterior in that most of them have a bias and do not concentrate at the true θ . The WASP approach combines subset posterior distributions Π m ( · | Y [ j ] ) ( j = 1 , . . . , k ) throughtheir barycenter in P (Θ) , where the density of Π m ( · | Y [ j ] ) is π m ( · | Y [ j ] ) in (5). The barycenterrepresents a geometric center of a collection of probability distributions that can be efficientlycomputed using a linear program. Motivated by this, Srivastava et al. (2015) proposed to combinea collection of subset posterior distributions through their barycenter in the Wasserstein space called WASP . Assuming that subset posterior distributions Π m ( · | Y [ j ] ) ( j = 1 , . . . , k ) have finite secondmoments, the WASP is defined using (3) as Π n ( · | Y ( n ) ) = argmin Π ∈P (Θ) k (cid:88) j =1 k W { Π , Π m ( · | Y [ j ] ) } . (6)Consider the following Gaussian example where the WASP is analytically tractable. Assume thatthe subset posterior distributions, Π , . . . , Π k , are Gaussian with means µ , . . . , µ k and covariance5 −1.15 −1.10 −1.05 −1.00 −0.95 −0.90 −0.850.850.900.951.001.051.101.15 MCMCSubset PosteriorWASP q q Figure 1: Binned kernel density estimates of full data posterior distribution, subset posterior dis-tributions, and WASP for coefficients ( θ , θ ) in logistic regression. The x and y axes representposterior samples for θ and θ . The true values of θ and θ are − and (black triangle).matrices Σ , . . . , Σ k . If we fix ρ to be the Euclidean metric and Θ = R d in (2), then (3) implies that Π n is Gaussian with mean µ and covariance matrix Σ , where µ = 1 k k (cid:88) j =1 µ j and Σ is such that k k (cid:88) j =1 (cid:16) Σ / Σ j Σ / (cid:17) / = Σ , (7)where A / is the symmetric square root of A (Agueh and Carlier, 2011). If θ is one dimensional,then (7) says that the standard deviation of WASP is the average of standard deviations of subsetposteriors; therefore, the variance of WASP is typically about the same order as that of any subsetposterior distribution. A similar relation also holds in higher dimensions and for a large class ofposterior distributions, including elliptical distributions (Álvarez-Esteban et al., 2016).The WASP is analytically tractable only in special cases, but it can be estimated using a linearprogram if the subset posterior distributions have an atomic form. Let { θ j , . . . , θ jS } be the θ samples obtained from subset posterior density j in (6) using a sampling algorithm, including HMC,MCMC, SMC, or importance sampling. Approximate j th subset posterior distribution Π m ( · | Y [ j ] ) using the empirical measure ˆΠ m ( · | Y [ j ] ) = S (cid:88) i =1 S δ θ ji ( · ) ( j = 1 , . . . , k ) . (8)Srivastava et al. (2015) approximate the WASP as ˆΠ n ( · | Y ( n ) ) = k (cid:88) j =1 S (cid:88) i =1 a ji δ θ ji ( · ) , ≤ a ji ≤ , k (cid:88) j =1 S (cid:88) i =1 a ji = 1 , (9)where a ji ( j = 1 , . . . , k ; i = 1 , . . . , S ) are unknown weights of the atoms. There are many spe-cialized algorithms to estimate the WASP that exploit the structure of the linear program in (6)6 lgorithm 1 Estimation of the WASP for f ( θ ) given samples of θ from k subset posteriors Input : Samples from k subset posteriors, { θ ji : θ ji ∼ Π m ( · | Y [ j ] ) , i = 1 , . . . , s j , j = 1 , . . . , k } ; mesh size (cid:15) > . Do :1. Define φ ji = ( φ ji , . . . , φ jiq ) = f ( θ ji ) ( i = 1 , . . . , s j ; j = 1 , . . . , k ), the matrix of atoms of subset posterior j , Φ j ∈ R s j × q ,with φ ji as row i ( i = 1 , . . . , s j ). For r = 1 , . . . , q , let φ min = ( φ min 1 , . . . , φ min q ) with φ min r = min j i φ jir , and φ max =( φ max 1 , . . . , φ max q ) with φ max r = max j i φ jir .2. Set the number of atoms in the empirical approximation for the WASP g = g × . . . × g q , where g r = (cid:6) φ max r − φ min r (cid:15) (cid:7) ( r = 1 , . . . , q ).3. Define the matrix of WASP atoms Φ ∈ R g × q with rows formed by stacking vectors (cid:110) φ min 1 + i g ( φ max 1 − φ min 1 ) , . . . , φ min q + i q g q (cid:0) φ max q − φ min q (cid:1)(cid:111) , ( i r = 1 , . . . , g r ; r = 1 , . . . , q ) .
4. Set the distance matrix between the atoms of WASP and the j th subset posterior, D j ∈ R g × s j + , as ( D j ) uv = q (cid:88) r =1 ( φ ur − φ jvr ) , ( u = 1 , . . . , g ; v = 1 , . . . , s j ; j = 1 , . . . , k ) , where φ ur is the ( u, r ) -entry of Φ .5. Estimate ˆ a , . . . , ˆ a g by solving the linear program (42) in Appendix C. Return : ˆ f(cid:93) Π( · | Y ( n ) ) = (cid:80) gi =1 ˆ a i δ φ i ( · ) , the atomic approximation of f(cid:93) Π n ( · | Y ( n ) ) . when Π m ( · | Y [ j ] ) and Π n ( · | Y ( n ) ) are restricted to have atomic forms in (8) and (9), respectively;for example, Cuturi and Doucet (2014) extend the Sinkhorn algorithm using entropy-smoothedsub-gradient methods, Carlier et al. (2015) develop a non-smooth optimization algorithm, and Sri-vastava et al. (2015) propose an efficient linear program that exploits the sparsity of constraints tosolve (6). A simple and efficient algorithm to find the WASP of a given function of parameters issummarized in Algorithm 1. The WASP, denoted as Π n , replaces the full data posterior distribution, denoted as Π n , for in-ference and prediction in massive data applications where n is large. In motivating applications,computation of Π n is inefficient, and dividing the data into smaller subsets and performing posteriorcomputations in parallel leads to massive speed-ups. A formal asymptotic justification for using Π n to approximate Π n would ideally show that the distance between Π n and Π n tends to 0 as thefull data size n increases to infinity. We will illustrate this using a linear model example in Section3.2.1, where we show that n / W (Π n , Π n ) → as n → ∞ . Since both Π n and Π n have variancesof order n − , our result implies that the mean and the variance of WASP match those of the fulldata posterior distribution.A general theoretical justification for using Π n in the place of Π n for a multivariate θ given inid data is technically much more challenging. If the data are iid and θ is one-dimensional, thenLi et al. (2017) proves that n / W (Π n , Π n ) → as n → ∞ for regular parametric models. Theproof in Li et al. (2017) relies heavily on the Bernstein-von Mises theorem (BvM) for iid data andthe one-dimensional quantile representation of Wasserstein distance. Unlike the iid case, a BvM-type theorem is generally unavailable if the data are inid or the model is non-regular (Ibragimovand Has’ Minskii, 2013). In Section 3.2.2, we show that the WASP Π n converges to the trueparameter value at almost the same rate as Π n when the number of subset k increases slowly7ith n . The previous theoretical justification of WASP in Srivastava et al. (2015) only includesposterior consistency under the stronger iid assumption without characterizing the convergencerate. Relaxing these limitations, we provide the convergence rate for the WASP in the inid case,including the convergence rate for WASP of general functionals of the original parameters. We use a weighted linear model example to illustrate the theoretical approximation accuracy ofWASP to the true posterior under the inid setup. For i = 1 , . . . , n , let y i be a scalar response, x i bea p × vector of predictors, and (cid:15) i be the idiosyncratic error in y i . Let θ = ( θ , . . . , θ p ) T be the p × regression coefficients vector. Let y = ( y , . . . , y n ) T , X = [ x , . . . , x n ] T , and (cid:15) = ( (cid:15) , . . . , (cid:15) n ) T be the n × response vector, the n × p design matrix, and the n × error vector, respectively. If Σ is aknown diagonal matrix with positive elements and cov ( (cid:15) ) = Σ , then the weighted linear regressionmodel of y on X with a flat prior on θ assumes that y = Xθ + (cid:15), (cid:15) ∼ N n (0 , Σ) , Σ = diag( σ , . . . , σ n ) , π ( θ ) ∝ , (10)where π ( θ ) is the flat prior on θ and N n (0 , Σ) is a n -variate Gaussian distribution with n × mean and covariance Σ . In this case, the data are inid since the distribution of y i depends on thevalue of x i . Since Σ is assumed to be known, the posterior distribution of θ is normal with mean µ = ( X T Σ − X ) − X T Σ − y and covariance matrix V = ( X T Σ − X ) − . Although the posterior of θ has a closed form in this example, the computational complexity of finding µ and V is O ( n ) , whichbecomes inefficient as the size of the data n increases.The WASP of θ in (10) is analytically tractable. The computation of WASP has three steps.First, the training data are randomly split into k subsets. Let y j , X j , and Σ j be the responsevector, design matrix, and error covariance matrix specific to subset j ( j = 1 , . . . , k ). Second,we compute the subset posterior distributions after stochastic approximation on the k subsets inparallel as in (5) with γ = k . The j th subset posterior distribution of θ is N p ( µ j , V j ) , where µ j = ( X Tj Σ − j X j ) − X Tj Σ − j y j and V j = k − ( X Tj Σ − j X j ) − . Third, (7) implies that the WASP of θ is also Gaussian with mean vector µ and covariance matrix V , where µ = k − (cid:80) kj =1 µ j and V satisfies V = k − (cid:80) kj =1 ( V / V j V / ) / .The WASP and full data posterior distributions lead to the same posterior inference on θ upto o ( n − ) terms. Let Π n = N p ( µ, V ) and Π n = N p ( µ, V ) be the WASP and full data posteriordistributions for θ . Based on the divide-and-conquer technique, the computational complexity of Π n is O ( km ) , which is smaller than that of Π n by a factor of k . The true distribution of y , denotedas P ( n ) θ , in (10) is N n ( Xθ , Σ) . If uncertainty quantification using Π n and Π n is the same, then itsuffices to show that the difference in the second moments of Π n and Π n is o ( n − ) in P ( n ) θ -probabilitybecause the variances V and V are both of order n − . This is equivalent to showing that the W distance between Π n and Π n is o ( n − ) in P ( n ) θ -probability, which is proved in the next theorem. Inthe statement of the theorem, we denote A ≺ B for positive definite matrices A and B if B − A isalso positive definite. Theorem 3.1.
Assume that there exist a n = o (1) , b m = o (1) such that Ω − a n I p ≺ n X T Σ − X ≺ Ω + a n I p and Ω − b m I p ≺ m X Tj Σ − j X j ≺ Ω + b m I p for all j = 1 , . . . , k , where I p , Ω are p × p identity and constant positive definite matrices. Then, E P ( n ) θ (cid:107) µ − µ (cid:107) = o (cid:0) n − (cid:1) , tr (cid:0) V − V (cid:1) = o (cid:0) n − (cid:1) , E P ( n ) θ W (Π n , Π n ) = o (cid:0) n − (cid:1) . Π n and Π n are the same in P ( n ) θ -probability for the data following the model in (10). Essentially, the WASP and the true posteriorhave the same posterior mean and posterior variance, and their differences are only in high orderof the full data size n . Furthermore, Theorem 3.1 is valid for any block diagonal Σ as long as thedata that belong to a particular diagonal block of Σ also belong to the same partition. In otherwords, Theorem 3.1 even holds for dependent data in which the dependence can be expressed as ablock diagonal Σ in (10). Finally, Theorem 3.1 is in fact true for any error distribution satisfying E ( (cid:15) ) = 0 and cov ( (cid:15) ) = Σ , which includes the Gaussian distribution; see Definition 2.1 and Theorem2.3 in Álvarez-Esteban et al. (2016). For general non-iid data, the standard Bayesian asymptotic theory for posterior convergence rateshas been established in Ghosal and van der Vaart (2007), which also includes our inid setup. Wefollow the theoretical framework of Ghosal and van der Vaart (2007) and develop the correspondingtheory for divide-and-conquer Bayesian inference using the WASP.We start with two definitions required to state the assumptions of our theoretical setup.
Definition 3.2 (Pseudo Hellinger distance) . The pseudo Hellinger distance between probabil-ity measures P ( m ) θ , P ( m ) θ ∈ {⊗ mi =1 P θ,j,i : θ ∈ Θ , dP θ,j,i ( y ) = p ji ( y | θ ) dy } is h mj ( θ , θ ) = m (cid:80) mi =1 h { p ji ( · | θ ) , p ji ( · | θ ) } , where h ( p , p ) = [ (cid:82) { (cid:112) p ( y ) − (cid:112) p ( y ) } dy ] / is the Hellingerdistance between two generic densities p , p .This definition generalizes the usual Hellinger distance to account for the inid data generatingmechanism. The space ( {⊗ mi =1 P θ,j,i : θ ∈ Θ } , h mj ) is a metric space. Definition 3.3 (Generalized bracketing entropy) . Let Ξ be a fixed subset of Θ . For an m -dimensional random vector Z = ( Z , . . . , Z m ) T , denote its L q norm as | Z | q = (cid:2) m (cid:80) mi =1 E ( | Z i | q ) (cid:3) /q and use (cid:107) Z (cid:107) to represent | Z | . For a fixed j ∈ { , . . . , k } , let P j (Ξ) = (cid:8) p j ( y | θ ) = ( p j ( y | θ ) , . . . , p jm ( y m | θ )) T : y = ( y , . . . , y m ) T ∈ ⊗ mi =1 Y ji , θ ∈ Ξ (cid:9) be the class of m -dimensional functions indexed by θ . For a given δ > , let B ( δ, P j (Ξ)) = (cid:110) [ l s , u s ] : l s ( y ) = ( l s ( y ) , . . . , l sm ( y m )) T , u s ( y ) = ( u s ( y ) , . . . , u sm ( y m )) T , y = ( y , . . . , y m ) T ∈ ⊗ mi =1 Y ji , s = 1 , . . . , N (cid:111) be the generalized bracketing set of P j (Ξ) with cardinality N , such that for any p j ( y | θ ) ∈ P j (Ξ) ,there exists a pair of functions [ l s , u s ] ∈ B ( δ, P j (Ξ)) , such that l si ( y i ) ≤ p ji ( y i ) ≤ u si ( y i ) , for all y ∈ ⊗ mi =1 Y ji , and all i = 1 , . . . , m and (cid:107)√ u s − (cid:112) l s (cid:107) ≤ δ. The h mj -bracketing number of P j (Ξ) , N [] ( δ, P j (Ξ) , h mj ) , is defined as the smallest cardinality ofthe generalized bracketing set B ( δ, P j (Ξ)) . The h mj -bracketing entropy of P j (Ξ) is defined as H [] ( δ, P j (Ξ) , h mj ) = log (cid:0) N [] ( δ, P j (Ξ) , h mj ) (cid:1) .9gain, this definition generalizes the usual bracketing entropy to the inid cases. If the data areindeed iid , then Definition 3.3 coincides with that of the usual bracketing entropy.Our theory for the convergence rate of WASP is built on the following assumptions.(A1) Θ is a compact space in ρ metric, θ is an interior point of Θ , and g γm ≤ n ≤ g γm for someconstants g , g > .(A2) For any θ, θ (cid:48) ∈ Θ and j = 1 , . . . , m , there exist positive constants α and C L such that h mj ( θ, θ (cid:48) ) ≥ C L ρ α ( θ, θ (cid:48) ) , where h mj is the pseudo Hellinger distance in Definition 3.2.(A3) (Entropy Condition) There exist constants D > , < D < D / , a function Ψ( u, r ) ≥ that is nonincreasing in u ∈ R + and nondecreasing in r ∈ R + , such that for all j = 1 , . . . , k ,for any u, r > and for all sufficiently large m , H [] ( u, { p j ( y | θ ) : θ ∈ Θ , h mj ( θ, θ ) ≤ r } , h mj ) ≤ Ψ( u, r ) for all j = 1 , . . . , k ; and (cid:90) D rD r / (cid:112) Ψ( u, r ) du < D √ mr , where p j ( y | θ ) = { p j ( y j | θ ) , . . . , p jm ( y jm | θ ) } T and H [] is the h mj -bracketing entropy of theset { p j ( y | θ ) : θ ∈ Θ , h mj ( θ, θ ) ≤ r } in Definition 3.3.(A4) (Prior Thickness) There exist positive constants κ and c π , such that uniformly over all j =1 , . . . , k , Π (cid:32) θ ∈ Θ : 1 m m (cid:88) i =1 E P θ exp (cid:18) κ log + p ji ( Y ji | θ ) p ji ( Y ji | θ ) (cid:19) − ≤ log mm (cid:33) ≥ exp( − c π k log m ) where log + x = max(log x, for x > .(A5) The metric ρ satisfies ρ ( (cid:80) Ni =1 w i θ i , θ (cid:48) ) ≤ (cid:80) Ni =1 w i ρ ( θ i , θ (cid:48) ) for any N ∈ { , , . . . } , θ , . . . , θ N , θ (cid:48) ∈ Θ and non-negative weights (cid:80) Ni =1 w i = 1 .Our assumptions above are based on the standard assumptions in Bayesian asymptotic theory.Similar to Theorem 10 in Ghosal and van der Vaart (2007), we have assumed a compact support in(A1) and lower bounded pseudo Hellinger distance in (A2). Typically, α = 1 for most regular models,such as generalized linear models. If the model is non-regular, then α can be less than 1; for example,the densities may have discontinuities depending on the parameter (Ibragimov and Has’ Minskii,2013, Chapters V, VI). Assumption (A3) parallels the entropy condition used in Theorem 1 of Wongand Shen (1995), which has been adapted here for the inid setup using the generalized bracketingentropy, and will simplify to a similar entropy condition to that in Theorem 1 of Wong and Shen(1995) if the data are iid . Assumption (A4) is crucial in providing a stronger control over the tailprobability as the posterior probability mass moves away from the true parameter θ , typically withan exponentially decaying rate. The convexity property of ρ in (A5) is mainly used to establish anaveraging inequality under W distance and is satisfied by, for example, the Euclidean metric and L q metric with q ≥ .The posterior risks of Π n and Π n in the ρ metric is directly related to the W distance based onthe ρ metric. If θ denotes the true parameter value from which the data are generated, then theposterior risk of Π n in the estimation of θ is (cid:90) Y ( n ) (cid:90) Θ ρ ( θ, θ ) d Π n ( θ | Y ( n ) ) dP ( n ) θ ( y , . . . , y n ) = E P ( n ) θ (cid:104) W (cid:110) Π n ( · | Y ( n ) ) , δ θ ( · ) (cid:111)(cid:105) . (11)10he classical result says that the posterior risk (11) in regular parametric models converges to zeroat the n − rate under assumptions similar to (A2)–(A4), with m replaced by n (van der Vaart,2000). The next theorem shows that the same posterior risk of the WASP converges at a similarrate to that of the true posterior Π n , which mainly depends on the size of subsets m , and can bemade close to the standard n − rate up to some logarithmic factors for regular parametric models. Theorem 3.4.
If Assumptions (A1)-(A4) hold for the j th subset posterior Π m ( · | Y [ j ] ) ( j =1 , . . . , k ), then there exists a constants universal C > independent of j , such that as m → ∞ , E P ( m ) θ (cid:2) W (cid:8) Π m ( · | Y [ j ] ) , δ θ ( · ) (cid:9)(cid:3) ≤ C (cid:18) log mm (cid:19) α , j = 1 , . . . , k. (12) Additionally, if Assumption (A5) holds, then as m → ∞ , E P ( n ) θ (cid:104) W (cid:110) Π n ( · | Y ( n ) ) , δ θ ( · ) (cid:111)(cid:105) ≤ C (cid:18) log mm (cid:19) α . (13)Theorem 3.4 proves posterior convergence in expectation, which is stronger than the commonlystudied posterior convergence in probability. We present our results using the W distance in orderto account for the fact that the k subset posteriors sit on a common parameter space. Alternatively,from (11), the convergence rates in (12) and (13) are also the rates of posterior risks for the subsetposterior distributions and the WASP. For regular models with α = 1 , if the number of subsets k increases slowly with n (e.g., k = O (log c n ) for some constant c > ), then Theorem 3.4 implies thatthe WASP converges in W distance at a near optimal convergence rate O p ( n − / log c/ n ) to δ θ .In this case, the standard parametric convergence rate of Π n is O p ( n − / ) , so the WASP attains theoptimal convergence rate up to the log c/ n factor. Equivalently, using (11), the posterior risk ofthe WASP converges to zero at the near optimal rate O p ( n − log c +2 n ) , compared to the O p ( n − ) posterior risk of the true posterior Π n .In most applications, the interest also lies in functions of θ . Suppose f : Θ (cid:55)→ R q is a functionthat maps θ to { f ( θ ) , . . . , f q ( θ ) } , where q ≥ is a positive integer. A direct application of Lemma8.5 in Bickel and Freedman (1981) gives the following corollary about the WASP of a function of θ .As long as the function is bounded almost linearly by the ρ metric in (1), its WASP possesses thesame posterior convergence rate as in Theorem 3.4. Corollary 3.5.
Suppose f ( · ) = { f ( · ) , . . . , f q ( · ) } is a function that maps Θ (cid:55)→ R q such that | f ( θ ) | = (cid:80) qi =1 { f i ( θ ) } ≤ C f { ρ ( θ, θ ) } , where C f > is a fixed constant. If the conditions in Theorem3.4 hold and f (cid:93) Π n ( · | Y ( n ) ) represents the WASP of the subset posterior distributions for f ( θ ) , thenas m → ∞ , W (cid:110) f (cid:93) Π n ( · | Y ( n ) ) , δ f ( θ ) ( · ) (cid:111) = O P ( n ) θ (cid:115) log /α mm /α . Corollary 3.5 is very useful in applications because it says that the combination step in theWASP is independent of the model parametrization. Let f (cid:93) Π m ( · | Y [ j ] ) be the j th subset posteriordistribution for f ( θ ) ( j = 1 , . . . , k ), then the WASP of k subset posterior distributions converges to f ( θ ) at the rate obtained in Theorem 3.4. In practice, we have S j posterior samples of θ obtainedfrom subset posterior j denoted as θ ji ( i = 1 , . . . , s j ; j = 1 , . . . , k ). Algorithm 1 estimates an atomicapproximation of f (cid:93) Π n ( · | Y ( n ) ) , denoted as ˆ f (cid:93) Π n ( · | Y ( n ) ) , based on the subset posterior samples f ( θ ji ) ( i = 1 , . . . , s j ; j = 1 , . . . , k ). The atomic form of the WASP is supported on a grid with11 = 1000 Bandwidth = 2.248e−06 D en s i t y q = - D en s i t y ( x ) q (x ) N = 1000 Bandwidth = 5.65e−07 D en s i t y q = - D en s i t y ( x ) q (x ) N = 1000 Bandwidth = 5.37e−08 D en s i t y q = - D en s i t y ( x ) q (x ) N = 1000 Bandwidth = 6.474e−09 D en s i t y q = - D en s i t y ( x ) q (x ) True PosteriorWASP
Figure 2: Kernel density estimates of the posterior densities of θ in the rare events example whereassumption (A1) fails to hold for θ = 10 − , − .mesh-size (cid:15) estimated from the subset posterior samples of f ( θ ) . Algorithm 1 estimates the weightsof the atoms located on the grid by solving a discrete version of (6). The theoretical properties ofdiscrete barycenters imply that ˆ f (cid:93) Π n ( · | Y ( n ) ) is supported only on O ( k ) elements of the grid; seeTheorem 2 in Anderes et al. (2016). We exploit this sparsity by adapting the algorithm in Srivastavaet al. (2015) and by using Gurobi (Gurobi Optimization Inc., 2014).A key assumption in Theorem 3.4 and Corollary 3.5 is that the subset posterior distributionsprovide a noisy approximation of the full data posterior distribution. This is stated precisely in(12), which shows that the convergence rate of a subset posterior distribution in W distance isobtained by using m as the sample size instead of n . If any of the assumptions (A1)–(A4) fail,then the subset posterior distributions may approximate the full data posterior distribution poorly,which could possibly lead to poor approximation quality for the WASP.A simple example based on rare events demonstrates this phenomenon. Let Y , . . . , Y n be iid Bernoulli random variables with unknown success probability θ ∈ (0 , . The assumption (A1)is violated if the true parameter θ is very close to 0; that is, observing 1 is a rare event. Inour simulation example, we set n = 10 and θ = 10 − a for a = 3 , , , so that as a increases, s = (cid:80) ni =1 Y i decreases and θ gets closer and closer to the boundary of the parameter space. Thestandard Bayesian approach is to put Jefferys’ prior Beta(0.5, 0.5) on θ and perform inference on θ using Beta( s + 0 . , n − s + 0 . ), which leads to a full data posterior that concentrates aroundthe correct value of θ even if θ is small (Figure 2). However, if the data are randomly dividedinto k = 100 subsets, then a majority of the subsets contain only 0s as θ decreases. As a result, amajority of the subset posterior distributions differ significantly in shape from the full data posteriordistribution, leading to a failure of the WASP in approximating the full data posterior distributionbecause the assumption (A1) is severely violated for θ = 10 − , − (Figure 2). We compared WASP with consensus Monte Carlo (CMC) (Scott et al., 2016), semiparametricdensity product (SDP) (Neiswanger et al., 2014), and variational Bayes (VB). The sample sizesand the number of parameters in our experiments were chosen such that sampling from the fulldata posterior distribution was computationally feasible. Every sampling algorithm ran for 10,000iterations. We discarded the first 5,000 samples as burn-in and thinned the chain by collecting everyfifth sample. Convergence of the chains to their stationary distributions was confirmed using traceplots. All experiments ran on an Oracle Grid Engine cluster with 2.6GHz 16 core compute nodes.Full data posterior computations were allotted memory resources of 64GB, and all other methods12ere allotted memory resources of 16GB.The sampling algorithm for the full data posterior was modified to obtain samples from thesubset posteriors in CMC, SDP, and WASP. The sampling algorithms for subset posteriors in CMCand SDP were the same and were based on Equation (2) in Scott et al. (2016). The samplingalgorithm for subset posteriors in WASP was based on (5). Samples from the approximate posteriordistributions of θ in CMC, SDP, and WASP were obtained in two steps. First, samples fromsubset posteriors of θ were obtained in parallel across k subsets. Second, the samples of θ fromall the subsets were combined using implementations of CMC and SDP in parallelMCMC package(Miroshnikov and Conlon, 2014) and using Algorithm 1 for the WASP.The full data posterior distribution obtained using MCMC served as the benchmark in all ourcomparisons. Let π ( θ | Y ( n ) ) be the density of the full data posterior distribution for θ estimatedusing sampling and ˆ π ( θ | Y ( n ) ) be the density of an approximate posterior distribution for θ esti-mated using the WASP or its competitors. We used the following metric based on the total variationdistance to compare the accuracy ˆ π ( θ | Y ( n ) ) in approximating π ( θ | Y ( n ) ) accuracy (cid:110) ˆ π ( θ | Y ( n ) ) (cid:111) = 1 − (cid:90) Θ (cid:12)(cid:12)(cid:12) ˆ π ( θ | Y ( n ) ) − π ( θ | Y ( n ) ) (cid:12)(cid:12)(cid:12) d θ. (14)The accuracy metric lies in [0 , (Faes et al., 2012). The approximation of full data posterior densityby ˆ π is poor or excellent if the accuracy metric is close to 0 or 1, respectively. In our experiments, wecomputed the kernel density estimates of ˆ π and π from the posterior samples of θ using R package KernSmooth (Wand, 2015) and calculated the integral in (14) using numerical approximation.
Finite mixture of Gaussians are widely used for model-based classification, clustering, and densityestimation (Fraley and Raftery, 2002). Let n , p , and L be the sample size, the dimension ofobservations, and the number of mixture components. If y i ∈ R p is the i th observation ( i = 1 , . . . , n ),then the mixture of L Gaussians assumes that any y ∈ { y , . . . , y n } is generated from the density f mix ( y | θ ) = L (cid:88) l =1 π l N p ( y | µ l , Σ l ) , (15)where π = ( π , . . . , π L ) lies in a ( L − -simplex, µ l and Σ l ( l = 1 , . . . , L ) are the mean and covarianceparameters of a p -variate Gaussian distribution, and θ = { π , µ , . . . , µ L , Σ , . . . , Σ L } . We set L = 2 and p = 2 and simulated data from (54) using π = (0 . , . , µ = (1 , T , µ = (7 , T , and Σ l = Σ ( l = 1 , ), where Σ = 0 . , Σ = 1 , and Σ = 2 . We performed 10 simulation replications.This simple example demonstrated the generality of WASP in estimating the posterior distri-bution of functions of θ as described in Corollary 3.5. We defined two nonlinear functions of θ as ρ l = (Σ l ) / { (Σ l ) (Σ l ) } / l = 1 , , g ( x ) = f mix { ( x, x ) T } x ∈ R , (16)where ρ l is the correlation of l th mixture component and g ( x ) is the value of density f mix in (54)when y = ( x, x ) T . Our simulation setup implied that ρ = ρ and g ( x ) was bimodal for x ∈ R .We completed the hierarchical model in (54) by specifying independent conjugate priors on π and ( µ l , Σ l ) ( l = 1 , ) as π ∼ Dirichlet (1 / , / , µ l | Σ l ∼ N ( , l ) , Σ l ∼ Inverse-Wishart (2 , I ) , (17)13able 1: Accuracies of the approximate posteriors for ρ , ρ , and g . ( x ) and g . ( x ) for x ∈ R .The accuracies are averaged over 10 simulation replications. Monte Carlo errors are in parenthesis.CMC, consensus Monte Carlo; SDP, semiparametric density product; VB, variational Bayes; WASP,Wasserstein posterior ρ ρ g . g . VB 0.77 (0.31) 0.76 (0.29) 0.99 (0.00) 0.99 (0.00) k = 5 k = 10 k = 5 k = 10 k = 5 k = 10 k = 5 k = 10 CMC 0.97 (0.01) 0.96 (0.01) 0.96 (0.01) 0.96 (0.01) 0.99 (0.00) 0.99 (0.00) 0.99 (0.00) 0.99 (0.00)SDP 0.97 (0.01) 0.96 (0.01) 0.95 (0.01) 0.96 (0.01) - - - -WASP 0.97 (0.01) 0.95 (0.01) 0.97 (0.01) 0.96 (0.01) 0.99 (0.00) 0.99 (0.00) 0.99 (0.00) 0.99 (0.00) where 2 is the prior degrees of freedom and I p is the scale matrix of the Inverse-Wishart distribution.The posterior samples of θ were obtained using Gibbs sampling (Bishop, 2006), which were used toobtain posterior samples for ρ , ρ , and g .We compared WASP with the posterior distributions estimated using CMC, Gibbs sampling,SDP, and VB. We used the VB algorithm developed in Bishop (2006). Two values of k ∈ { , } were used for CMC, SDP, and WASP and full data were partitioned into k subsets such that themixture proportions were preserved in every subset. The approximate posterior distributions of ρ , ρ , and g ( x ) , x ∈ R , under each method were estimated using the subset posterior samples obtainedafter modifying the original Gibbs sampler. The sampling algorithm for WASP is described inSection 2.1 of Supplementary Material.We compared the accuracy (14) of CMC, SDP, VB, and WASP in approximating the full dataposterior distributions of ρ , ρ , and point-wise credible bands of g ( x ) for x ∈ R , denotedas g . ( x ) and g . ( x ) . CMC, SDP, and WASP accurately approximated the full data posteriordistributions of ρ and ρ for both k s, but VB underestimated the posterior uncertainty in ρ and ρ . CMC, VB, and WASP were very accurate in estimating g . ( x ) and g . ( x ) for x ∈ R , whereasthe application of SDP failed due to a numerical error in matrix inversion (Table 1). This providesan empirical verification of Corollary 3.5, showing that the accuracy of the WASP was unaffectedby the form of the parameters in the combination step. Theoretical guarantees similar to Corollary3.5 were unavailable for CMC or SDP, but our numerical results illustrated that a similar resultmight also hold for these methods in mixture models. Linear mixed effects models are extensively used in extending linear regression to account for lon-gitudinal and nested dependence structures. Let n , s , and s i be the sample size, total number ofobservations, and total number of observations for sample i ( i = 1 , . . . , n ) so that s = (cid:80) ni =1 s i .Suppose X i ∈ R s i × p and Z i ∈ R s i × r include predictors in the fixed and random effects components,respectively. Letting y i ∈ R s i be the response for sample i , the linear mixed effects model assumesthat y i | β , u i , τ ∼ N s i ( X i β + Z i u i , τ I n i ) , u i ∼ N r ( , Σ) , ( i = 1 , . . . , n ) , (18)where u i ∈ R r is the random effect for sample i with mean and r × r covariance Σ , β ∈ R p denotesthe fixed effects, and τ is the error variance. The model parameters are θ = { β , Σ , τ } .We simulated data for a fixed n and s and varying p and r . We chose two values of ( p, r ) ∈{ (4 , , (80 , } , fixed n and s to be and 100,000, and randomly assigned the s observations to n samples. The two choices of ( p, r ) ensured that the number of unknown parameters in β and Σ was10 and 100 in the former and latter cases. The entries of X i and Z i were set to or − with equalprobability for every i . We fixed β entries as − and 2 alternately and τ = 1 . The random effects14ovariance matrix Σ = diag( √ , . . . , √ r ) R diag( √ , . . . , √ r ) , where diag( a ) is a diagonal matrixwith a along the diagonal and R is a correlation matrix with 1 along the diagonal. We set R = R if r = 3 and R = bdiag ( R , R ) if r = 6 , where bdiag ( A, B ) is a block-diagonal matrix with A, B along the diagonal, ( R ) ii = 1 ( i = 1 , , ), R = − . , R = 0 . , and R = 0 . . The matrix R included negative, positive, and small to moderate strength correlations (Kim et al., 2013). Weused this setup to simulate data from (18) and performed 10 replications.We used the HMC algorithm in Stan for sampling from the full data and subset posteriordistributions. The full data posterior computations were feasible for the chosen values of n and s and posterior samples were obtained after completing the hierarchical model in (18) by using thedefault weakly informative priors for β , Σ , and τ in Stan. Two values of k ∈ { , } were usedfor CMC, SDP, and WASP, and the n samples were randomly partitioned into k subsets. Thesampling algorithms for subset posterior distributions for the three methods were implemented inStan and posterior samples of θ were obtained in parallel across k subsets. This was followed by acombination step to estimate the approximate posterior distributions for the three methods. Thesampling algorithm for WASP is described in Section 2.2 of Supplementary Material. Stochasticgradient Langevin dynamics (SGLD; Welling and Teh 2011) has proven to be a successful stochasticversion of MCMC in mixture and regression models but has not been extensively tested on linearmixed effects models in which multiple observations are available on a subject. We compared Stan’sHMC and SGLD with batch sizes 2000, 4000, step sizes − , − and iterations.We compared the accuracy (14) of CMC, SDP, SGLD, VB, and WASP in approximating themarginal posterior distributions of fixed effects, variances and covariances of random effects, andthe joint posterior distributions of three pairs of covariances of random effects. We used the stream-lined algorithm (SA; Lee and Wand 2016) and automatic differentiation variational inference inStan (ADVI; Kucukelbir et al. 2015) for estimating the VB posteriors for β and Σ . All methodsexcept SGLD were significantly faster than the full data posterior distribution, with SA being thefastest. CMC, SA, SDP, and WASP provided accurate approximations of the marginal posteriordistributions of fixed effects and covariances of random effects. Unlike Stan’s HMC, SGLD’s perfor-mance was sensitive to the choices of step size and batch size. SGLD failed to converge for all batchsizes when the step size was − , and its accuracy increased with batch size. The performance ofADVI and SGLD deteriorated quickly as r increased from to . The accuracy of CMC and SDPin approximating the marginal posterior distributions of variances of random effects depended on k and r . ADVI and SA provided a poor approximation for the posterior variances of random effects.In all these cases, the accuracy of WASP was stable for every k and r (Tables 2 and 3). All methodsexcept SGLD showed similar accuracies in approximating the true joint posterior distributions ofthree pairs of covariances of random effects. The differences in accuracies of CMC, SA, SDP, andWASP for different values of k and r were due to the differences in numerical approximation of (14)(Tables 4 and 5 and Figures 3 and 4); see Table 1 in the Supplementary Material.The accuracy of CMC, SDP, and WASP decreased when k increased from to because subsetposterior distributions conditioned on a smaller fraction of the data. This provided an empiricalverification of Theorem 3.4 for the WASP. Our numerical results illustrated that a similar resultmight also hold for CMC and SDP. The stable performance of WASP compared to that of CMC andSDP in the approximation of the posterior distributions of variances of random effects showed thatthe validity of the normal approximation for subset posterior distributions was crucial in obtainingaccurate approximations of full data posterior using CMC and SDP. On the other hand, WASPresults were free of any such assumptions and were valid for any nonlinear function of µ and Σ ; seeCorollary 3.5. 15able 2: Accuracies of the approximate posteriors for variances in (18). The accuracies are averagedover 10 simulation replications and across all diagonal elements of Σ . Monte Carlo errors arein parenthesis. ADVI, automatic differentiation variational inference; SA, streamlined algorithm;SGLD, stochastic gradient Langevin dynamics with batch size in parenthesis; CMC, consensusMonte Carlo; SDP, semiparametric density product; WASP, Wasserstein posterior r = 3 r = 6 ADVI 0.48 (0.31) 0.09 (0.23)SA 0.26 (0.19) 0.34 (0.22)SGLD (2000) 0.68 (0.08) 0.73 (0.08)SGLD (4000) 0.69 (0.09) 0.72 (0.08) k = 10 k = 20 k = 10 k = 20 CMC 0.93 (0.03) 0.91 (0.05) 0.89 (0.05) 0.80 (0.08)SDP 0.92 (0.06) 0.86 (0.07) 0.84 (0.10) 0.77 (0.14)WASP 0.97 (0.01) 0.97 (0.01) 0.97 (0.01) 0.97 (0.01)
Table 3: Accuracies of the approximate posteriors for covariances in (18). The accuracies areaveraged over 10 simulation replications and across all off-diagonal elements of Σ . Monte Carloerrors are in parenthesis. ADVI, automatic differentiation variational inference; SA, streamlinedalgorithm; SGLD, stochastic gradient Langevin dynamics with batch size in parenthesis; CMC,consensus Monte Carlo; SDP, semiparametric density product; WASP, Wasserstein posterior r = 3 r = 6 ADVI 0.69 (0.23) 0.49 (0.29)SA 0.94 (0.02) 0.94 (0.02)SGLD (2000) 0.07 (0.11) 0.13 (0.09)SGLD (4000) 0.07 (0.11) 0.12 (0.09) k = 10 k = 20 k = 10 k = 20 CMC 0.94 (0.03) 0.91 (0.05) 0.94 (0.03) 0.92 (0.05)SDP 0.92 (0.04) 0.89 (0.06) 0.89 (0.07) 0.87 (0.10)WASP 0.97 (0.01) 0.97 (0.01) 0.97 (0.01) 0.96 (0.01) k = 10, r = 3 s , s
100 150 200 −0.10−0.05 0.00 0.05 0.10 0.15 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0 k = 10, r = 3 s , s −0.10−0.05 0.00 0.05 0.10 0.15 0.0 0.1 0.2 0.3 0.4 0.5 0.6 k = 10, r = 3 s , s k = 20, r = 3 s , s −0.10−0.05 0.00 0.05 0.10 0.15 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0 k = 20, r = 3 s , s ADVICMCMCMC SASDPSGLD (2000) SGLD (4000)WASP −0.10−0.05 0.00 0.05 0.10 0.15 0.0 0.1 0.2 0.3 0.4 0.5 0.6 k = 20, r = 3 s , s Figure 3: Kernel density estimates of the posterior densities of three covariance pairs when r = 3 in (18), where σ ab , σ cd on every panel represents the two-dimensional posterior density of ( σ ab , σ cd ) .ADVI, automatic differentiation variational inference; SGLD, stochastic gradient Langevin dynam-ics with batch size in parenthesis; CMC, consensus Monte Carlo; MCMC, Markov chain MonteCarlo; SA, streamlined algorithm; SDP, semiparametric density product; WASP, Wasserstein pos-terior. 16able 4: Accuracies of the approximate two-dimensional joint posteriors for the covariances ofrandom effects when r = 3 in (18). The accuracies are averaged over 10 simulation replications.Monte Carlo errors are in parenthesis. ADVI, automatic differentiation variational inference; SA,streamlined algorithm; SGLD, stochastic gradient Langevin dynamics with batch size in parenthesis;CMC, consensus Monte Carlo; SDP, semiparametric density product; WASP, Wasserstein posterior ( σ , σ ) ( σ , σ ) ( σ , σ ) ADVI 0.53 (0.28) 0.62 (0.14) 0.49 (0.25)SA 0.91 (0.01) 0.91 (0.01) 0.92 (0.01)SGLD (2000) 0.03 (0.01) 0.01 (0.00) 0.02 (0.01)SGLD (4000) 0.03 (0.01) 0.01 (0.00) 0.02 (0.01) k = 10 k = 20 k = 10 k = 20 k = 10 k = 20 CMC 0.88 (0.05) 0.79 (0.06) 0.88 (0.04) 0.82 (0.07) 0.91 (0.02) 0.85 (0.04)SDP 0.90 (0.03) 0.89 (0.03) 0.90 (0.03) 0.87 (0.05) 0.92 (0.02) 0.89 (0.04)WASP 0.93 (0.01) 0.94 (0.01) 0.93 (0.01) 0.94 (0.01) 0.94 (0.01) 0.94 (0.01)
Table 5: Accuracies of the approximate two-dimensional joint posteriors for the covariances ofrandom effects when r = 6 in (18). The accuracies are averaged over 10 simulation replications.Monte Carlo errors are in parenthesis. ADVI, automatic differentiation variational inference; SA,streamlined algorithm; SGLD, stochastic gradient Langevin dynamics with batch size in parenthesis;CMC, consensus Monte Carlo; SDP, semiparametric density product; WASP, Wasserstein posterior ( σ , σ ) ( σ , σ ) ( σ , σ ) ADVI 0.06 (0.16) 0.08 (0.22) 0.08 (0.17)SA 0.89 (0.02) 0.90 (0.02) 0.91 (0.02)SGLD (2000) 0.02 (0.01) 0.01 (0.01) 0.01 (0.01)SGLD (4000) 0.02 (0.01) 0.01 (0.01) 0.01 (0.01) k = 10 k = 20 k = 10 k = 20 k = 10 k = 20 CMC 0.88 (0.05) 0.76 (0.10) 0.88 (0.04) 0.78 (0.07) 0.90 (0.04) 0.83 (0.07)SDP 0.90 (0.03) 0.86 (0.05) 0.90 (0.04) 0.86 (0.04) 0.90 (0.03) 0.87 (0.04)WASP 0.93 (0.02) 0.94 (0.01) 0.94 (0.01) 0.94 (0.01) 0.94 (0.01) 0.94 (0.02)
200 200 k = 10, r = 6 s , s −0.2−0.1 0.0 0.1 0.2 0.3 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0 k = 10, r = 6 s , s −0.2−0.1 0.0 0.1 0.2 0.30.0 0.1 0.2 0.3 0.4 0.5 k = 10, r = 6 s , s
200 200 k = 20, r = 6 s , s −0.2−0.1 0.0 0.1 0.2 0.3 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0 k = 20, r = 6 s , s ADVICMCMCMC SASDPSGLD (2000) SGLD (4000)WASP −0.2−0.1 0.0 0.1 0.2 0.30.0 0.1 0.2 0.3 0.4 0.5 k = 20, r = 6 s , s Figure 4: Kernel density estimates of the posterior densities of three covariance pairs when r = 6 in (18), where σ ab , σ cd on every panel represents the two-dimensional posterior density of ( σ ab , σ cd ) .ADVI, automatic differentiation variational inference; SGLD, stochastic gradient Langevin dynam-ics with batch size in parenthesis; CMC, consensus Monte Carlo; MCMC, Markov chain MonteCarlo; SA, streamlined algorithm; SDP, semiparametric density product; WASP, Wasserstein pos-terior. We use probabilistic parafac model as a representative example for nonparametric density estimationusing WASP. Probabilistic parafac is an approach for nonparametric Bayes modeling of joint de-17 pr( x = )k = 5 MCMCCMCSDPWASP 0100200300400500600 0.346 0.348 0.350 0.352 0.354 0.356 pr( x = )k = 5 pr( x = )k = 5 pr( x = )k = 5 pr( x = )k = 10 MCMCCMCSDPWASP 0100200300400500600 0.346 0.348 0.350 0.352 0.354 0.356 pr( x = )k = 10 pr( x = )k = 10 pr( x = )k = 10 Figure 5: Kernel density estimates of the marginal posterior densities for dimensions 2, 4, 12, and14. MCMC, Gibbs sampling algorithm of Dunson and Xing (2009); CMC, consensus Monte Carlo;SDP, semiparametric density product; VB, variational Bayes; WASP, Wasserstein posteriorpendence in multivariate categorical data (Dunson and Xing, 2009). Let x i = ( x i , . . . , x ij , . . . , x ip ) be the data from sample i , where x ij has d j possible categorical values in { , . . . , d j } ( j = 1 , . . . , p ).The hierarchical model for x ij ( i = 1 , . . . , n ; j = 1 , . . . , p ) is x ij | (cid:16) ψ ( j ) h (cid:17) ∞ h =1 , . . . , (cid:16) ψ ( j ) hd j (cid:17) ∞ h =1 , z i ∼ Multinomial ( { , . . . , d j } , ψ ( j ) z i , . . . , ψ ( j ) z i d j ) ,z i ∼ ∞ (cid:88) h =1 V h (cid:89) l Action category included Action, Adventure, Fantasy, Horror, Sci-Fi, and Thriller genres; Children category included Animation and Children genres; Comedy category included Comedygenre; and Drama category included Crime, Documentary, Drama, Film-Noir, Musical, Mystery,Romance, War, and Western genres. If a movie belonged to multiple genres, then movie categoryscores were fractions proportional to the number of genres in the respective categories. Second, popularity predictor was defined as logit { ( l + 0 . / ( n + 1 . } , where l and n respectively were thenumber of users who liked and rated the movie in 30 most recent observations for the movie and logit( x ) = log x − x . Third, previous predictor was defined to be 1 if the user liked the previousmovie and 0 otherwise. We used Action , Children − Action , Comedy − Action , Drama − Action ,19able 7: Accuracies of the approximate posteriors for variances in (18). The accuracies are averagedover 10 replications. Monte Carlo errors are in parenthesis. ADVI, automatic differentiation vari-ational inference; SA, streamlined algorithm; SGLD, stochastic gradient Langevin dynamics withbatch size in parenthesis; CMC, consensus Monte Carlo; SDP, semiparametric density product;WASP, Wasserstein posterior σ Action σ Children − Action σ Comedy − Action σ Drama − Action σ Popularity σ Previous ADVI 0.06 (0.14) 0.33 (0.30) 0.16 (0.23) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00)SA 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00)SGLD (2000) 0.10 (0.06) 0.06 (0.03) 0.05 (0.05) 0.08 (0.04) 0.10 (0.00) 0.10 (0.07)SGLD (4000) 0.07 (0.06) 0.06 (0.03) 0.02 (0.06) 0.08 (0.04) 0.10 (0.00) 0.08 (0.07)CMC 0.28 (0.13) 0.01 (0.01) 0.01 (0.01) 0.14 (0.09) 0.74 (0.10) 0.22 (0.10)SDP 0.05 (0.03) 0.00 (0.00) 0.00 (0.00) 0.01 (0.01) 0.35 (0.10) 0.03 (0.03)WASP 0.92 (0.04) 0.93 (0.02) 0.87 (0.06) 0.85 (0.08) 0.92 (0.03) 0.93 (0.05) popularity , and previous as the fixed and random effects in (18).Following the setup in Section 4.3, we compared the performance of WASP with ADVI, CMC,SA, SGLD with batch sizes 2000, 4000, step size 10 − and iterations, and SDP using the fulldata posterior distribution as the benchmark. Sampling using the HMC algorithm in Stan wasprohibitively slow for the full data posterior distribution, so we first randomly selected 5000 usersand then randomly selected 20 ratings for every user. This resulted in a data set with 100,000ratings. We randomly split the users into 10 training data sets such that ratings for any userbelonged to the same training data set. To compute the approximate posteriors using CMC, SDP,and WASP, we set k = 10 and randomly partitioned the users into k subsets such that each subsetcontained all the ratings for a user. This setup was replicated for every training data.WASP performed better than its competitors in approximating the full data posterior distri-butions for variances and covariances of the random effects. Similar to the simulation results inSection 4.3, ADVI, CMC, SA, SDP, and WASP were significantly faster than the full data poste-rior distribution, with SA being the fastest, and SGLD was the slowest. CMC, SDP, and WASPshowed excellent performed in approximating the full data posterior distributions for the fixed ef-fects. WASP outperformed its competitors in approximating the full data posterior distributions forvariances, covariances, and pairs of covariances of the random effects (Tables 7, 8, and 9). ADVI,SA, and SGLD significantly under-performed in the estimation of the posterior distribution for thefixed effects and covariance matrix of the random effects. The accuracy of marginals in CMC andSDP depended on the magnitude of covariances, with both methods showing excellent accuracy forcovariances with low magnitude. The accuracies of the two-dimensional joint distributions in CMCand SDP were poor because the full data posteriors concentrated at different locations (Figure 6).Except for the poor performance of CMC, SA, and SDP in approximating the posterior distributionof variances and covariances of the random effects, our real data results agreed with our simula-tion results. We concluded that WASP performed better than its competitors in MovieLens dataanalysis. We have presented WASP as an approach for computationally efficient approximation of the pos-terior distributions of parameters and their functions when the sample size is large. WASP allowsextensions of existing samplers to massive data with minimal modifications and is easily imple-mented using probabilistic programming languages, such as Stan. Theoretically, we have showedthat the rate of convergence of WASP to the Dirac measure centered at the true parameter valuein W distance matches the optimal parametric rate up to a logarithmic factor if the number ofsubsets increases slowly with the size of the full data set. Empirically, we demonstrated that results20able 8: Accuracies of the approximate posteriors for covariances in (18). The accuracies areaveraged over 10 replications. Monte Carlo errors are in parenthesis. The subscripts , . . . , areused for predictors Action , Children − Action , Comedy − Action , Drama − Action , popularity , and previous . ADVI, automatic differentiation variational inference; SA, streamlined algorithm; SGLD,stochastic gradient Langevin dynamics with batch size in parenthesis; CMC, consensus Monte Carlo;SDP, semiparametric density product; WASP, Wasserstein posterior σ σ σ σ σ σ σ σ ADVI 0.15 (0.30) 0.25 (0.26) 0.14 (0.16) 0.32 (0.12) 0.06 (0.09) 0.00 (0.00) 0.18 (0.20) 0.66 (0.15)SA 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00)SGLD (2000) 0.08 (0.03) 0.19 (0.10) 0.18 (0.08) 0.18 (0.11) 0.23 (0.09) 0.14 (0.00) 0.14 (0.01) 0.14 (0.10)SGLD (4000) 0.08 (0.02) 0.16 (0.10) 0.14 (0.08) 0.12 (0.08) 0.20 (0.08) 0.14 (0.00) 0.13 (0.01) 0.11 (0.10)CMC 0.06 (0.03) 0.16 (0.04) 0.18 (0.04) 0.83 (0.07) 0.33 (0.13) 0.01 (0.01) 0.07 (0.02) 0.80 (0.04)SDP 0.01 (0.01) 0.08 (0.03) 0.07 (0.02) 0.75 (0.06) 0.14 (0.09) 0.00 (0.00) 0.02 (0.01) 0.73 (0.08)WASP 0.95 (0.02) 0.91 (0.04) 0.91 (0.05) 0.94 (0.03) 0.90 (0.07) 0.89 (0.07) 0.85 (0.08) 0.93 (0.03) σ σ σ σ σ σ σ ADVI 0.47 (0.22) 0.50 (0.22) 0.64 (0.11) 0.62 (0.23) 0.64 (0.18) 0.49 (0.29) 0.42 (0.11)SA 0.01 (0.00) 0.01 (0.00) 0.01 (0.00) 0.01 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00)SGLD (2000) 0.11 (0.10) 0.14 (0.10) 0.16 (0.07) 0.10 (0.11) 0.14 (0.12) 0.12 (0.10) 0.15 (0.09)SGLD (4000) 0.07 (0.10) 0.11 (0.09) 0.14 (0.07) 0.03 (0.11) 0.10 (0.11) 0.08 (0.10) 0.14 (0.08)CMC 0.66 (0.09) 0.65 (0.07) 0.76 (0.08) 0.71 (0.05) 0.82 (0.04) 0.61 (0.11) 0.55 (0.09)SDP 0.59 (0.11) 0.62 (0.06) 0.64 (0.09) 0.66 (0.08) 0.66 (0.09) 0.56 (0.14) 0.55 (0.13)WASP 0.91 (0.05) 0.94 (0.05) 0.93 (0.03) 0.91 (0.04) 0.93 (0.04) 0.93 (0.04) 0.94 (0.04) Table 9: Accuracies of the approximate two-dimensional joint posteriors for the covariances ofrandom effects. The accuracies are averaged over 10 replications. Monte Carlo errors are in paren-thesis. The subscripts , . . . , are used for predictors Action , Children − Action , Comedy − Action , Drama − Action , popularity , and previous . ADVI, automatic differentiation variational inference;SA, streamlined algorithm; SGLD, stochastic gradient Langevin dynamics with batch size in paren-thesis; CMC, consensus Monte Carlo; SDP, semiparametric density product; WASP, Wassersteinposterior ( σ , σ ) ( σ , σ ) ( σ , σ ) ( σ , σ ) ADVI 0.03 (0.06) 0.03 (0.07) 0.02 (0.06) 0.05 (0.11)SA 0.18 (0.04) 0.22 (0.07) 0.31 (0.03) 0.31 (0.02)SGLD (2000) 0.01 (0.02) 0.01 (0.02) 0.01 (0.01) 0.01 (0.01)SGLD (4000) 0.01 (0.02) 0.01 (0.02) 0.01 (0.01) 0.01 (0.01)CMC 0.05 (0.02) 0.04 (0.02) 0.06 (0.03) 0.05 (0.02)SDP 0.05 (0.02) 0.04 (0.02) 0.06 (0.03) 0.05 (0.02)WASP 0.88 (0.03) 0.88 (0.03) 0.88 (0.02) 0.86 (0.06) −0.04−0.02 0.00 0.02 0.04−0.05 0.00 0.05 0.10 s , s −0.06−0.04−0.02 0.00 0.02 0.04−0.05 0.00 0.05 0.10 s , s −0.03−0.02−0.01 0.00 0.01 0.02−0.05 0.00 0.05 0.10 s , s ADVICMCMCMC SASDPSGLD (2000) SGLD (4000)WASP −0.06−0.04−0.02 0.00 0.02 0.04−0.05 0.00 0.05 0.10 s , s Figure 6: Kernel density estimates of the posterior densities of four covariance pairs, where σ ab , σ cd on every panel represents the two-dimensional posterior density of ( σ ab , σ cd ) . ADVI, automaticdifferentiation variational inference; SGLD, stochastic gradient Langevin dynamics with batch sizein parenthesis; CMC, consensus Monte Carlo; MCMC, Markov chain Monte Carlo; SA, streamlinedalgorithm; SDP, semiparametric density product; WASP, Wasserstein posterior.from WASP and MCMC agree closely in several widely different examples, while WASP enablesmassive speed-ups in computational time.We plan to explore several extensions of WASP in the future. First, the combination of subsetposterior distributions using WASP and the proof of the convergence rate for the WASP in Theorem3.4 are valid even if the data in different subsets are dependent; however, independence assumption21ithin each subset is required in the proof of (12) in Theorem 3.4 and in our justification of stochasticapproximation. Currently, it is unclear how to extend stochastic approximation to cases wherethe likelihood is unavailable in a product form. This extension in crucial for proper uncertaintyquantification outside of settings in which the observations are conditionally independent givenlatent variables. Second, it is unclear how to optimally choose k in practice; larger k improvescomputational time when abundant processors are available but choosing k too large may leadto increasing statistical errors (refer to Theorem 3.4). Our numerical experiments show that theaccuracy of WASP is robust to the choice of k if all the subset sizes are moderately large relativeto the number of parameters. In addition, it is of interest to study more deeply the impact of thepartitioning schemes and attempt to develop approaches that deal with not only large sample sizesbut also high-dimensional data. A possibility in this regard is to combine WASP with approximateMCMC (Johndrow et al., 2015). Acknowledgement Volkan Cevher and Quoc Tran-Dinh proposed and implemented the linear program for calculatingWasserstein barycenter described in Srivastava et al. (2015). All experiments were based on amodified version of Tran-Dinh’s Matlab and Gurobi code for estimating Wasserstein barycenter.Jack Baker provided extensive help in implementing the SGLD algorithm. The code used in theexperiments is available at https://github.com/blayes/WASP . Cheng Li’s work was partiallysupported by National University of Singapore start-up grant R155000172133.22 Proofs of Theorems A.1 Proof of Theorem 3.1 If E P ( n ) θ represents the expectation with respect to P ( n ) θ , then E P ( n ) θ (cid:2) W { N p ( µ, V ) , N p ( µ, V ) } (cid:3) = E P ( n ) θ (cid:107) µ − µ (cid:107) + tr (cid:110) V + V − V / V V / ) / (cid:111) . (21)First, we find the asymptotic order of E P ( n ) θ (cid:107) µ − µ (cid:107) in (21). Define A = ( X T Σ − X ) − X T Σ − , B = k − (cid:2) ( X T Σ − X ) − X T Σ − , · · · , ( X Tk Σ − k X k ) − X Tk Σ − k (cid:3) , and C = A − B . After some algebra, we have that AX = I p , BX = I p , where I p is a p × p identitymatrix, and (cid:107) µ − µ (cid:107) = (cid:107) Cy (cid:107) , E P ( n ) θ (cid:107) µ − µ (cid:107) = E P ( n ) θ ( y T ) C T CE P ( n ) θ ( y ) + tr( C Σ C T ) . Since E P ( n ) θ ( y ) = Xθ and CX = AX − BX = I p − I p = 0 , E P ( n ) θ (cid:107) µ − µ (cid:107) = tr( C Σ C T ) . Expanding C Σ C T , we get C = ( X T Σ − X ) − X T Σ − − k − (cid:2) ( X T Σ − X ) − X T Σ − , · · · , ( X Tk Σ − k X k ) − X Tk Σ − k (cid:3) ,C T = Σ − X ( X T Σ − X ) − − k − Σ − X ( X T Σ − X ) − ... Σ − k X k ( X Tk Σ − k X k ) − , tr( C Σ C T ) = tr { ( X T Σ X ) − } + k − k (cid:88) j =1 tr (cid:8) ( X Tj Σ j X j ) − (cid:9) − D ) , where D = k − (cid:2) ( X T Σ X ) − X T , · · · , ( X Tk Σ k X k ) − X Tk (cid:3) Σ − X ( X T Σ − X ) − = k − k (cid:88) j =1 ( X Tj Σ − j X j ) − X Tj Σ − j X j ( X T Σ − X ) − = ( X T Σ − X ) − because Σ is diagonal. We use the above display to obtain that E P ( n ) θ (cid:107) µ − µ (cid:107) = tr( C Σ C T ) = 1 k k (cid:88) j =1 tr (cid:110) ( X Tj Σ − j X j ) − (cid:111) − tr (cid:8) ( X T Σ − X ) − (cid:9) , = 1 km tr (cid:110) k k (cid:88) j =1 (cid:16) m X Tj Σ − j X j (cid:17) − (cid:111) − n tr (cid:110)(cid:0) n X T Σ − X (cid:1) − (cid:111) . Our assumptions and continuity of the matrix inverse for positive definite matrices imply that thereare exist positive a (cid:48) n = o (1) , b (cid:48) m = o (1) , such that Ω − − a (cid:48) n I p ≺ (cid:0) n X T Σ − X (cid:1) − ≺ Ω − + a (cid:48) n I p , − − b (cid:48) m I p ≺ (cid:16) m X Tj Σ − j X j (cid:17) − ≺ Ω − + b (cid:48) m I p . This implies that the previous display reduces to E P ( n ) θ (cid:107) µ − µ (cid:107) ≤ p ( b (cid:48) m + a (cid:48) n ) /n = o ( n − ) , (22)where the equality follow because p is fixed.We now find the asymptotic order of the traces of the covariance matrices in (21). Followingthe same arguments used to derive (22), the full data and j th subset posterior covariance matricessatisfy n (cid:0) Ω − − a (cid:48) n I p (cid:1) ≺ V = 1 n (cid:18) n X T Σ − X (cid:19) − ≺ n (cid:0) Ω − + a (cid:48) n I p (cid:1) , n (cid:0) Ω − − b (cid:48) m I p (cid:1) ≺ V j = 1 km (cid:18) m X Tj Σ − j X j (cid:19) − ≺ n (cid:0) Ω − + b (cid:48) m I p (cid:1) . (23)Let M j = (cid:26) V / km (cid:16) m X Tj Σ − j X j (cid:17) − V / (cid:27) / . Then (23) implies that − b (cid:48) m V ≺ nM j − V / Ω − V / = nV / (cid:0) V j − n − Ω − (cid:1) V / ≺ b (cid:48) m V . (24)From the first inequality in (24), we have (cid:16) V / Ω − V / (cid:17) / ≺ (cid:0) nM j + b (cid:48) m V (cid:1) / ≺ n / M j + b (cid:48) / m V / . And similarly the second inequality in (24) implies that n / M j ≺ (cid:16) V / Ω − V / + b (cid:48) m V (cid:17) / ≺ (cid:16) V / Ω − V / (cid:17) / + b (cid:48) / m V / . Therefore (cid:16) V / Ω − V / (cid:17) / − b (cid:48) / m V / ≺ n / M j ≺ (cid:16) V / Ω − V / (cid:17) / + b (cid:48) / m V / . Using this relation and the definition of V , we have that (cid:16) V / Ω − V / (cid:17) / − b (cid:48) / m V / ≺ n / V = 1 k k (cid:88) j =1 n / M j ≺ (cid:16) V / Ω − V / (cid:17) / + b (cid:48) / m V / . (25)In (25), we take the square of n / V , apply the inequality ( A + A ) ≺ A + A ) for two genericpositive definite matrices A , A , and obtain that nV ≺ V / Ω − V / + 2 b (cid:48) m V ,nV (cid:31) V / Ω − V / − b (cid:48) m V . Multiplying by V − / on the left and right hand sides yields, nV ≺ − + 2 b (cid:48) m I p , V (cid:31) 12 Ω − − b (cid:48) m I p . (26)Notice that b (cid:48) m = o (1) , Ω is a constant positive definite matrix, and V is a positive definite matrix.Clearly, (26) forces nV to be an order-1 matrix. Now we take the square of n / V in (25) again andobtain that nV ≺ V / Ω − V / + b (cid:48) m V + b (cid:48) / m (cid:16) V / Ω − V / (cid:17) / V / + b (cid:48) / m V / (cid:16) V / Ω − V / (cid:17) / ,nV (cid:31) V / Ω − V / + b (cid:48) m V − b (cid:48) / m (cid:16) V / Ω − V / (cid:17) / V / − b (cid:48) / m V / (cid:16) V / Ω − V / (cid:17) / . Multiplying by V − / on the left and right hand sides yields, nV ≺ Ω − + b (cid:48) m I p + b (cid:48) / m V − / (cid:16) V / Ω − V / (cid:17) / + b (cid:48) / m (cid:16) V / Ω − V / (cid:17) / V − / ,nV (cid:31) Ω − + b (cid:48) m I p − b (cid:48) / m V − / (cid:16) V / Ω − V / (cid:17) / − b (cid:48) / m (cid:16) V / Ω − V / (cid:17) / V − / . (27)Since nV is an order-1 matrix, we have that b (cid:48) m V − / (cid:16) V / Ω − V / (cid:17) / = o (1) , b (cid:48) m (cid:16) V / Ω − V / (cid:17) / V − / = o (1) . Hence (23) and (27) reduce to n { Ω − − o (1) I p } ≺ V j ≺ n { Ω − + o (1) I p } , n (Ω − − o (1) I p ) ≺ V ≺ n { Ω − + o (1) I p } . This implies that tr( V − V ) = o ( n − ) , (28)where the last equality follows because p is fixed.Finally, we find the asymptotic order of the variance term in (21). The display before (28)implies that for some positive c n = o (1) , V / V V / ≺ n { Ω − / + o (1) I p }{ Ω − + o (1) I p }{ Ω − / + o (1) I p }≺ n [Ω − + c n I p ] ,V / V V / (cid:31) n { Ω − / − o (1) I p }{ Ω − − o (1) I p }{ Ω − / − o (1) I p }(cid:31) n [Ω − − c n I p ] . Therefore, tr { ( V / V V / ) / } = n − tr(Ω − ) + o ( n − ) since p is fixed. Using this and (23) for thevariance term in (21) gives tr (cid:26) V + V − (cid:16) V / V V / (cid:17) / (cid:27) = { n − tr(Ω − ) + o ( n − ) } + { n − tr(Ω − ) + o ( n − ) } − { n − tr(Ω − ) + 2 o ( n − ) } = o (cid:0) n − (cid:1) . (29)Combining the asymptotic expressions for the mean and variance terms in (22) and (29), (21)reduces to E P ( n ) θ (cid:2) W (cid:8) N ( µ, V ) , N ( µ, V ) (cid:9)(cid:3) = o (cid:0) n − (cid:1) , which completes the proof. (cid:50) .2 Proof of Theorem 3.4 Let (cid:15) m = (cid:16) m log m (cid:17) − / (2 α ) . For ease of notation, in all the following proofs, we will sometimes write p ( y ji | θ ) ≡ p ji ( y ji | θ ) .Due to the compactness of Θ in (A1), we assume that ρ ( θ, θ ) ≤ M for a large finite constant M . We start with a decomposition of the W distance from the j th subset posterior Π m ( · | Y [ j ] ) tothe Dirac measure at the true parameter θ : E P θ W (cid:0) Π m ( · | Y [ j ] ) , δ θ ( · ) (cid:1) = E P θ (cid:90) Θ ρ ( θ, θ )Π m ( dθ | Y [ j ] ) ≤ E P θ (cid:90) { θ : ρ ( θ,θ ) ≤ c (cid:15) m } ρ ( θ, θ )Π m ( dθ | Y [ j ] ) + E P θ (cid:90) { θ : ρ ( θ,θ ) >c (cid:15) m } ρ ( θ, θ )Π m ( dθ | Y [ j ] ) ≤ ( c (cid:15) m ) + M E P θ Π m (cid:0) ρ ( θ, θ ) > c (cid:15) m | Y [ j ] (cid:1) . (30)We will choose the constant c as c = (cid:16) r g q C L (cid:17) / (2 α ) , where g , C L , q , r are the constants in (A1),(A2), and Lemma 5 and Lemma 6 in Supplementary Material.The following proofs are similar to the proofs of Theorem 1, 4, and 10 in Ghosal and van derVaart (2007). The main difference is that our likelihood has been raised to the power γ . Usingcondition (A2), we can further replace the ρ metric by the pseudo Hellinger distance: Π m (cid:0) θ ∈ Θ : ρ ( θ, θ ) > c (cid:15) m | Y [ j ] (cid:1) ≤ Π m (cid:16) θ ∈ Θ : h mj ( P θ,j , P θ ,j ) > (cid:112) C L ( c (cid:15) m ) α | Y [ j ] (cid:17) = (cid:90)(cid:8) θ ∈ Θ: h mj ( θ,θ ) > (cid:113) r g q (cid:15) αm (cid:9) (cid:81) mi =1 (cid:104) p ( Y ji | θ ) p ( Y ji | θ ) (cid:105) γ Π( dθ ) (cid:82) Θ (cid:81) mi =1 (cid:104) p ( Y ji | θ ) p ( Y ji | θ ) (cid:105) γ Π( dθ ) . (31)For the denominator in (31), by Condition (A4) and Lemma 6, for m sufficiently large, with prob-ability at least − exp( − r m(cid:15) αm ) (cid:90) Θ m (cid:89) i =1 p ( Y ji | θ ) γ p ( Y ji | θ ) γ Π( dθ ) > exp( − r n(cid:15) αm ) . (32)For the numerator in (31), by Condition (A3) and Lemma 5, we set δ = (cid:112) r g /q (cid:15) αm andobtain that with probability at least − (cid:16) − r g q q m(cid:15) αm (cid:17) ≥ − (cid:16) − r q q n(cid:15) αm (cid:17) , sup (cid:8) θ ∈ Θ: h mj ( θ,θ ) ≥ √ r g /q (cid:15) αm (cid:9) m (cid:89) i =1 (cid:20) p ( Y ji | θ ) p ( Y ji | θ ) (cid:21) γ ≤ exp (cid:0) − r g m(cid:15) αm (cid:1) ≤ exp (cid:0) − r n(cid:15) αm (cid:1) (33)Therefore, based on (31), (32), and (33), we obtain that with probability at least − (cid:16) − r q q n(cid:15) αm (cid:17) − exp( − r m(cid:15) αm ) , Π m (cid:16) θ ∈ Θ : ρ ( θ, θ ) > c (cid:15) m (cid:12)(cid:12)(cid:12) Y [ j ] (cid:17) ≤ exp (cid:0) − r n(cid:15) αm + r n(cid:15) αm (cid:1) ≤ exp (cid:0) − r n(cid:15) αm (cid:1) . Let A (cid:15) m be the event (cid:8) θ ∈ Θ : Π (cid:16) θ ∈ Θ : ρ ( θ, θ ) > c (cid:15) m (cid:12)(cid:12)(cid:12) Y [ j ] (cid:17) ≤ exp (cid:0) − r n(cid:15) αm (cid:1) (cid:9) . Then we canbound the second term in (30) as E P θ Π m (cid:16) ρ ( θ, θ ) > c (cid:15) m (cid:12)(cid:12)(cid:12) Y [ j ] (cid:17) E P θ (cid:104) I ( A (cid:15) m )Π m (cid:16) ρ ( θ, θ ) > c (cid:15) m (cid:12)(cid:12)(cid:12) Y [ j ] (cid:17)(cid:105) + E P θ (cid:104) I ( A c(cid:15) m )Π m (cid:16) ρ ( θ, θ ) > c (cid:15) m (cid:12)(cid:12)(cid:12) Y [ j ] (cid:17)(cid:105) ≤ exp (cid:0) − r n(cid:15) αm (cid:1) + P ( n ) θ ( A c(cid:15) m ) · ≤ exp (cid:0) − r n(cid:15) αm (cid:1) + 4 exp (cid:18) − r q q n(cid:15) αm (cid:19) + exp( − r m(cid:15) αm ) ≤ (cid:0) − c m(cid:15) αm (cid:1) , for c = min( r , r , r q /q ) , as clearly the second term is dominating the other two given m (cid:46) n .Therefore, for (30), since (cid:15) m = ( m/ log m ) − / (2 α ) , as m → ∞ , an explicit bound will be E P θ W (cid:0) Π m ( · | Y [ j ] ) , δ θ ( · ) (cid:1) ≤ c log /α mm /α + 6 M exp (cid:0) − c log m (cid:1) ≤ c log /α mm /α + 1 m α ≤ C log /α mm /α as m becomes sufficiently large, where the constant C depends on α, c , c , which further depends on g , g , q , q , r , r , C L . Since q , q in Lemma 5 and r , r in Lemma 6 depend on g , g , D , D , κ, c π ,it follows that C depends on g , g , C L , D , D , κ, c π . (cid:50) Based on Lemma 7 in Supplementary Material, if the assumption (A5) holds, then we have E P ( n ) θ (cid:104) W (cid:110) Π n ( · | Y ( n ) ) , δ θ ( · ) (cid:111)(cid:105) ≤ E P ( n ) θ k k (cid:88) j =1 W (cid:8) Π m ( · | Y [ j ] ) , δ θ ( · ) (cid:9) ≤ k k (cid:88) j =1 E P ( n ) θ W (cid:8) Π m ( · | Y [ j ] ) , δ θ ( · ) (cid:9) ≤ C log /α mm /α , where the first inequality follows from Lemma 7 in Supplementary Material, the second inequalityfollows from the Cauchy-Schwarz inequality, and the third inequality follows from the subset bound(12). (cid:50) B Univariate density estimation Let X , . . . , X n be n copies of a scalar random variable X that follows probability distribution P with density p . The full data are randomly split into k subsets and X j , . . . , X jm represent the dataon subset j ( j = 1 , . . . , k ). The hierarchical model for density estimation using the stick-breakingrepresentation of Dirichlet process mixtures is X ji | z ji , { µ h } ∞ h =1 , { σ h } ∞ h =1 ∼ N ( µ z ji , σ z ji ) , z ji ∼ ∞ (cid:88) h =1 ν h δ h , ν h = V h (cid:89) l Agueh, M. and G. Carlier (2011). Barycenters in the Wasserstein space. SIAM Journal on Mathe-matical Analysis 43 (2), 904–924.Ahn, S., A. Korattikara, and M. Welling (2012). Bayesian posterior sampling via stochastic gradientFisher scoring. Proceedings of the 29th International Conference on Machine Learning .Alquier, P., N. Friel, R. Everitt, and A. Boland (2016). Noisy Monte Carlo: Convergence of Markovchains with approximate transition kernels. Statistics and Computing 26 (1-2), 29–47.Álvarez-Esteban, P. C., E. del Barrio, J. Cuesta-Albertos, and C. Matrán (2016). A fixed-pointapproach to barycenters in Wasserstein space. Journal of Mathematical Analysis and Applica-tions 441 (2), 744–762.Anderes, E., S. Borgwardt, and J. Miller (2016). Discrete Wasserstein barycenters: optimal trans-port for discrete data. Mathematical Methods of Operations Research 84 (2), 389–409.Bardenet, R., A. Doucet, and C. Holmes (2017). On Markov chain Monte Carlo methods for talldata. Journal of Machine Learning Research 18 (47), 1–43.Bickel, P. J. and D. A. Freedman (1981). Some asymptotic theory for the bootstrap. The Annalsof Statistics 9 (6), 1196–1217.Bishop, C. M. (2006). Pattern recognition and machine learning , Volume 4. Springer New York.Broderick, T., N. Boyd, A. Wibisono, A. C. Wilson, and M. Jordan (2013). Streaming variationalBayes. In Advances in Neural Information Processing Systems , pp. 1727–1735.Carlier, G., A. Oberman, and E. Oudet (2015). Numerical methods for matching for teams andWasserstein barycenters. ESAIM: Mathematical Modelling and Numerical Analysis 49 (6), 1621–1642. 29hen, T., E. Fox, and C. Guestrin (2014). Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning , pp. 1683–1691.Cuturi, M. and A. Doucet (2014). Fast computation of Wasserstein barycenters. In Proceedings ofthe 31st International Conference on Machine Learning, JMLR W&CP , pp. 685–693.Dunson, D. B. and C. Xing (2009). Nonparametric Bayes modeling of multivariate categorical data. Journal of the American Statistical Association 104 (487), 1042–1051.Faes, C., J. T. Ormerod, and M. P. Wand (2012). Variational Bayesian inference for parametricand nonparametric regression with missing data. Journal of the American Statistical Associa-tion 106 (495), 959–971.Fraley, C. and A. E. Raftery (2002). Model-based clustering, discriminant analysis, and densityestimation. Journal of the American statistical Association 97 (458), 611–631.Gelman, A., A. Vehtari, P. Jylänki, C. Robert, N. Chopin, and J. P. Cunningham (2014). Expec-tation propagation as a way of life. arXiv preprint arXiv:1412.4869 .Ghosal, S. and A. van der Vaart (2007). Convergence rates of posterior distributions for noniidobservations. The Annals of Statistics 35 (1), 192–223.Giordano, R., T. Broderick, and M. I. Jordan (2017). Covariances, robustness, and variationalbayes. arXiv preprint arXiv:1709.02536 .Gurobi Optimization Inc. (2014). Gurobi Optimizer Reference Manual Version 6.0.0 .Hoffman, M. D., D. M. Blei, C. Wang, and J. Paisley (2013). Stochastic variational inference. Journal of Machine Learning Research 14 , 1303–1347.Ibragimov, I. A. and R. Z. Has’ Minskii (2013). Statistical Estimation: Asymptotic Theory , Vol-ume 16. Springer Science & Business Media.Johndrow, J. E., J. C. Mattingly, S. Mukherjee, and D. B. Dunson (2015). Approximations ofMarkov chains and High-Dimensional Bayesian Inference. arXiv preprint arXiv:1508.03387v1 .Kim, Y., Y.-K. Choi, and S. Emery (2013). Logistic regression with multiple random effects: asimulation study of estimation methods and statistical packages. The American Statistician 67 (3),171–182.Korattikara, A., Y. Chen, and M. Welling (2014). Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In Proceedings of the 31st International Conference on Machine Learning , pp.181–189.Kucukelbir, A., R. Ranganath, A. Gelman, and D. Blei (2015). Automatic variational inference inStan. In Advances in Neural Information Processing Systems , pp. 568–576.Lan, S., B. Zhou, and B. Shahbaba (2014). Spherical Hamiltonian Monte Carlo for constrainedtarget distributions. In JMLR workshop and conference proceedings , Volume 32, pp. 629. NIHPublic Access.Lee, C. Y. Y. and M. P. Wand (2016). Streamlined mean field variational Bayes for longitudinaland multilevel data analysis. Biometrical Journal 58 (4), 868–895.30i, C., S. Srivastava, and D. B. Dunson (2017). Simple, scalable and accurate posterior intervalestimation. Biometrika 104 , 665–680.Maclaurin, D. and R. P. Adams (2015). Firefly Monte Carlo: Exact MCMC with Subsets of Data.In Twenty-Fourth International Joint Conference on Artificial Intelligence .Massart, P. (2003). Concentration Inequalities and Model Selection. In Ecole d’Eté de Probabilitésde Saint-Flour XXXIII , pp. 25–26. Springer-Verlag, Berlin Heidelberg.Minsker, S., S. Srivastava, L. Lin, and D. Dunson (2014). Scalable and robust Bayesian inference viathe median posterior. In Proceedings of the 31st International Conference on Machine Learning(ICML-14) , pp. 1656–1664.Minsker, S., S. Srivastava, L. Lin, and D. B. Dunson (2017). Robust and scalable Bayes via a medianof subset posterior measures. The Journal of Machine Learning Research 18 (1), 4488–4527.Miroshnikov, A. and E. Conlon (2014). parallelMCMCcombine: Methods for combining independentsubset Markov chain Monte Carlo posterior samples to estimate a posterior density given the fulldata set . R package version 1.0.Neiswanger, W., C. Wang, and E. Xing (2014). Asymptotically exact, embarrassingly parallelMCMC. In Proceedings of the 30th International Conference on Uncertainty in Artificial Intelli-gence , pp. 623–632.Perry, P. O. (2017). Fast moment-based estimation for hierarchical models. Journal of the RoyalStatistical Society: Series B (Statistical Methodology) 79 (1), 267–291.Rabinovich, M., E. Angelino, and M. I. Jordan (2015). Variational consensus Monte Carlo. In Advances in Neural Information Processing Systems , pp. 1207–1215.Ranganath, R., D. Tran, and D. Blei (2016). Hierarchical variational models. In InternationalConference on Machine Learning , pp. 324–333.Rasmussen, C. E. and C. K. Williams (2006). Gaussian processes for machine learning. MIT Press .Rezende, D. and S. Mohamed (2015). Variational inference with normalizing flows. In Proceedingsof The 32nd International Conference on Machine Learning , pp. 1530–1538.Rue, H., S. Martino, and N. Chopin (2009). Approximate Bayesian inference for latent Gaussianmodels by using integrated nested laplace approximations. Journal of the Royal Statistical Society:Series B (Statistical Methodology) 71 (2), 319–392.Scott, S. L., A. W. Blocker, F. V. Bonassi, H. A. Chipman, E. I. George, and R. E. McCulloch(2016). Bayes and big data: the consensus Monte Carlo algorithm. International Journal ofManagement Science and Engineering Management 11 (2), 78–88.Shahbaba, B., S. Lan, W. O. Johnson, and R. M. Neal (2014). Split Hamiltonian Monte Carlo. Statistics and Computing 24 (3), 339–349.Srivastava, S., V. Cevher, Q. Dinh, and D. Dunson (2015). WASP: Scalable Bayes via barycentersof subset posteriors. In Proceedings of the 18th International Conference on Artificial Intelligenceand Statistics , pp. 912–920.Stan Development Team (2014). Stan: A C++ library for probability and sampling, version 2.5.0.31an, L. S. and D. J. Nott (2013). Variational inference for generalized linear mixed models usingpartially noncentered parametrizations. Statistical Science 28 (2), 168–188.van der Geer, S. and J. Lederer (2013). The Berstein-Olicz norm and deviation inequalities. Prob-ability letters and related fields 157 , 225–250.van der Vaart, A. W. (2000). Asymptotic Statistics , Volume 3. Cambridge University Press.Wainwright, M. J. and M. I. Jordan (2008, January). Graphical Models, Exponential Families, andVariational Inference. Found. Trends Mach. Learn. 1 , 1–305.Wand, M. (2015). KernSmooth: Functions for Kernel Smoothing Supporting Wand & Jones (1995) .R package version 2.23-14.Wang, X. and D. B. Dunson (2013). Parallel MCMC via Weierstrass sampler. arXiv preprintarXiv:1312.4605 .Wang, X., F. Guo, K. A. Heller, and D. B. Dunson (2015). Parallelizing MCMC with randompartition trees. In Advances in Neural Information Processing Systems , pp. 451–459.Welling, M. and Y. W. Teh (2011). Bayesian learning via stochastic gradient Langevin dynamics.In Proceedings of the 28th International Conference on Machine Learning , pp. 681–688.Wong, W. H. and X. Shen (1995). Probability inequalities for likelihood ratios and convergencerates of sieve MLEs. The Annals of Statistics 23 (2), 339–362.32 upplementary Material for Scalable Bayes via Barycenter inWasserstein Space1 Technical Lemmas To show Theorem 3.1, we first introduce in Lemma 1.5 a generalized version of the concentrationinequality in Theorem 1 of Wong and Shen (1995). The proof parallels the original proof in Wongand Shen (1995), with several adaptations for the inid setup. Let Z ji ( θ ) = log[ p ( Y ji | θ ) /p ( Y ji | θ )] ,and ˜ Z ji ( θ ) = max( Z ji ( θ ) , − τ ) be the lower truncated version of Z ji ( θ ) for some constant τ > tobe chosen later. Let Z j ( θ ) = ( Z j ( θ ) , . . . , Z jm ( θ )) T and ˜ Z j ( θ ) = ( ˜ Z j ( θ ) , . . . , ˜ Z jm ( θ )) T . Lemma 1.1. Let c τ = 2 e − τ/ / (1 − e − τ/ ) . Then for any θ ∈ Θ , m m (cid:88) i =1 E ˜ Z ji ( θ ) ≤ − (1 − c τ ) h mj ( θ, θ ) . (43) Proof of Lemma 1.1: The proof is a simple adaptation of Lemma 2 and Lemma 4 in Wong and Shen (1995). Firstnote that for every observation Y ji (which takes value y ji ), we can define the event A ji = { y ji : p ( y ji ) | θ ) /p ( y ji | θ ) < e − τ } . Their Lemma 2 implies that P ( A ji ) ≤ (1 − e − τ/ ) − h ( p ji ( ·| θ ) , p ji ( ·| θ )) ,which further implies by simple averaging over i = 1 , . . . , m that m m (cid:88) i =1 P ( A ji ) ≤ (1 − e − τ/ ) − h mj ( θ, θ ) . (44)Following the same derivation of their Lemma 4, we have for every individual ˜ Z ji , E ˜ Z ji ( θ ) ≤ − h ( p ji ( ·| θ ) , p ji ( ·| θ )) + 2 e − τ/ P ( A ji ) . Then a simple averaging over i = 1 , . . . , m together with (44) gives (43). (cid:50) Lemma 1.2. Let c τ = ( e τ/ − − τ / / (1 − e − τ/ ) . For any t > , integer (cid:96) ≥ and any θ ∈ Θ that satisfies h mj ( θ, θ ) ≤ r , m m (cid:88) i =1 E P θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ Z ji ( θ )2 √ c τ r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:96) ≤ (cid:96) !2 (cid:18) √ c τ r (cid:19) (cid:96) − . Proof of Lemma 1.2: Lemma 5 in Wong and Shen (1995) is stated for every single observation Y ji , so it still holds forindividual ˜ Z ji ( θ ) . For all i = 1 , . . . , m , E P θ (cid:104) exp (cid:16)(cid:12)(cid:12)(cid:12) ˜ Z ji ( θ ) / (cid:12)(cid:12)(cid:12)(cid:17) − − (cid:12)(cid:12)(cid:12) ˜ Z ji ( θ ) / (cid:12)(cid:12)(cid:12)(cid:105) ≤ c τ h ( p ji ( ·| θ ) , p ji ( ·| θ )) , where c τ is as defined in the statement of the lemma. Averaging over i = 1 , . . . , m yields m m (cid:88) i =1 E P θ (cid:104) exp (cid:16)(cid:12)(cid:12)(cid:12) ˜ Z ji ( θ ) / (cid:12)(cid:12)(cid:12)(cid:17) − − (cid:12)(cid:12)(cid:12) ˜ Z ji ( θ ) / (cid:12)(cid:12)(cid:12)(cid:105) ≤ c τ h mj ( θ, θ ) ≤ c τ r , where the second inequality is from the condition h mj ( θ, θ ) ≤ r . Using e x − − x ≥ x (cid:96) /(cid:96) ! for all x > , we have m m (cid:88) i =1 E P θ (cid:12)(cid:12)(cid:12) ˜ Z ji ( θ ) (cid:12)(cid:12)(cid:12) (cid:96) ≤ (cid:96) (cid:96) ! c τ r. Rearranging the terms and the conclusion follows. (cid:50) emma 1.3. Let j ∈ { , . . . , k } be fixed. Suppose Ξ is a subset of Θ . Let ˜ Z j (Ξ) = (cid:110) ˜ Z j ( θ ) , θ ∈ Ξ (cid:111) . For any u > , H [] (cid:16) u, ˜ Z j (Ξ) , (cid:107) · (cid:107) (cid:17) ≤ H [] (cid:16) u/ (2 e τ/ ) , P j (Ξ) , h mj (cid:17) . Proof of Lemma 1.3: The proof follows the argument in the proof of Lemma 3 in Wong and Shen (1995). We can derivethat for each i = 1 , . . . , m , for any θ , θ ∈ Ξ , E P θ (cid:104) ˜ Z ji ( θ ) − ˜ Z ji ( θ ) (cid:105) ≤ e τ h mj ( θ , θ ) . Then averaging over i = 1 , . . . , m gives the relation between two norms (cid:13)(cid:13)(cid:13) ˜ Z j ( θ ) − ˜ Z j ( θ ) (cid:13)(cid:13)(cid:13) ≤ e τ/ h mj ( θ , θ ) , which further implies the relation between the bracketing entropies. (cid:50) Lemma 1.4. (van der Geer and Lederer (2013) Theorem 8) Let j ∈ { , . . . , k } be fixed. Suppose aclass of functions F j = (cid:8) f ( y ) = ( f ( y ) , . . . , f m ( y m )) T , y = ( y , . . . , y m ) ∈ ⊗ mi =1 Y ji (cid:9) satisfies(i) sup f ∈F j (cid:107) f (cid:107) ≤ ;(ii) For any integer (cid:96) ≥ , sup f ∈F j | f | (cid:96)(cid:96) ≤ (cid:96) ! M (cid:96) − / , for some constant M > ;Then for any t > , P θ (cid:32) sup f ∈F j √ m m (cid:88) i =1 (cid:104) f i ( Y ji ) − E P θ f i ( Y ji ) (cid:105) ≥ min S ∈ N R S + 36 M (1 + t ) √ m + 24 √ t (cid:33) ≤ e − t , where R S ≡ − S √ m + 14 √ S (cid:88) s =0 − s (cid:113) H [] (2 − s , F j , (cid:107) · (cid:107) ) + 36 M H [] (1 , F j , (cid:107) · (cid:107) ) √ m . Proof of Lemma 1.4: The theorem we present here is slightly different from the original Theorem 8 in van der Geer andLederer (2013) in that we have used the generalized bracketing entropy in Definition A.2 in themanuscript. Although the original version is presented with the usual L -bracketing entropy forunivariate functions, the whole “chaining along a tree” argument will still go through, if we replacethe (cid:107) · (cid:107) and | · | q norms and bracketing entropies in their proofs by our generalized versions tomultivariate functions as in Definition A.2. (cid:50) Lemma 1.5. (Generalization of Wong and Shen (1995) Theorem 1) Assume (A3) holds. Then forany δ > , there exist positive constants q , q that depend on D , D , such that for all subsets Y [ j ] with j = 1 , . . . , k and all sufficiently large m , P ( n ) θ (cid:32) sup h mj ( θ,θ ) ≥ δ m (cid:89) i =1 p ( Y ji | θ ) p ( Y ji | θ ) ≥ exp( − q mδ ) (cid:33) ≤ − q mδ ) (45) Proof of Lemma 1.5: We consider the class of functions ˆ Z j ( r ) = (cid:40) ˜ Z j ( θ )2 √ c τ r : θ ∈ Θ satisfies h mj ( θ, θ ) ≤ r (cid:41) , r > . This class is a rescaled version of ˜ Z j ( { θ ∈ Θ : h mj ( θ, θ ) ≤ r } ) = (cid:110) ˜ Z j ( θ ) : θ ∈ Θ satisfies h mj ( θ, θ ) ≤ r (cid:111) as in Lemma 1.3. By Lemma 1.2, ˆ Z j ( r ) satisfies Conditions (i) and (ii) in Lemma 1.4 with theconstant M = 1 / ( √ c τ r ) . Therefore the concentration inequality in Lemma 1.4 holds for ˆ Z j ( r ) .We first simplify the term R S in the inequality. R S involves the L -bracketing entropy of theclass ˆ Z j ( r ) . Since H [] (cid:16) u, ˆ Z j ( r ) , (cid:107) · (cid:107) (cid:17) is nonincreasing in u , we have S (cid:88) s =0 − s (cid:114) H [] (cid:16) − s , ˆ Z j ( r ) , (cid:107) · (cid:107) (cid:17) ≤ S (cid:88) s =0 (cid:90) − s − ( s +1) (cid:114) H [] (cid:16) u, ˆ Z j ( r ) , (cid:107) · (cid:107) (cid:17) du = 2 (cid:90) − ( S +1) (cid:114) H [] (cid:16) u, ˆ Z j ( r ) , (cid:107) · (cid:107) (cid:17) du ( i ) = 2 (cid:90) − ( S +1) (cid:114) H [] (cid:16) √ c τ ru, ˜ Z j ( { θ ∈ Θ : h mj ( θ, θ ) ≤ r } ) , (cid:107) · (cid:107) (cid:17) du = 1 √ c τ r (cid:90) √ c τ r − S √ c τ r (cid:114) H [] (cid:16) u, ˜ Z j ( { θ ∈ Θ : h mj ( θ, θ ) ≤ r } ) , (cid:107) · (cid:107) (cid:17) du ( ii ) ≤ √ c τ r (cid:90) √ c τ r − S √ c τ r (cid:113) H [] (cid:0) u/ (2 e τ/ ) , P j ( { θ ∈ Θ : h mj ( θ, θ ) ≤ r } ) , h mj (cid:1) du = √ e τ √ c τ r (cid:90) √ c τ e − τ r − ( S +1) √ c τ e − τ r (cid:113) H [] ( u, P j ( { θ ∈ Θ : h mj ( θ, θ ) ≤ r } ) , h mj ) du, (46)where (i) follows from the scaling relation between ˆ Z j ( r ) and ˜ Z j ( { θ ∈ Θ : h mj ( θ, θ ) ≤ r } ) , and (ii)follows from Lemma 1.3. We choose integer S ≥ such that − ( S +2) ≤ √ c τ e − τ r/ ≤ − ( S +1) . Itis possible to do so because we only need to consider r ≤ √ (since the h mj distance is upper boundedby √ ), c τ e − τ ≤ / for all τ ≥ , and it is guaranteed that √ c τ e − τ r/ ≤ √ / < / .Now we can apply (A3) to (46) and obtain that uniformly over all j = 1 , . . . , k , S (cid:88) s =0 − s (cid:114) H [] (cid:16) − s , ˆ Z j ( r ) , (cid:107) · (cid:107) (cid:17) ≤ √ e τ √ c τ r (cid:90) √ c τ e − τ r (cid:16) √ c τ e − τ r (cid:17) / (cid:112) Ψ( u, r ) du ≤ √ e τ √ c τ r · D √ m (cid:18) √ c τ e − τ D r (cid:19) = 2 D √ c τ e − τ D √ mr. (47)Furthermore, since (A3) says Ψ( u, r ) is nonincreasing in u , we can also derive that H [] (cid:16) , ˆ Z j ( r ) , (cid:107) · (cid:107) (cid:17) = H [] (cid:16) √ c τ r, ˜ Z j ( { θ ∈ Θ : h mj ( θ, θ ) ≤ r } ) , (cid:107) · (cid:107) (cid:17) ≤ H [] (cid:16)(cid:112) c τ e − τ r, P j ( { θ ∈ Θ : h mj ( θ, θ ) ≤ r } ) , h mj (cid:17) ≤ Ψ (cid:16)(cid:112) c τ e − τ r, r (cid:17) ≤ (cid:34) √ c τ e − τ r − (cid:0) √ c τ e − τ r (cid:1) / (cid:90) √ c τ e − τ r (cid:16) √ c τ e − τ r (cid:17) / (cid:112) Ψ( u, r ) du (cid:35) ≤ D c τ e − τ mr D (cid:0) − √ c τ e − τ r/ (cid:1) ≤ D c τ e − τ mr D , (48)35here in the last step, we used the fact that √ c τ e − τ r/ < / .By our choice − ( S +2) ≤ √ c τ e − τ r/ , it follows from (47) and (48) that min S ∈ N R S ≤ √ c τ e − τ · √ mr + 14 √ · D √ c τ e − τ D √ mr + 36 √ c τ mr · D c τ e − τ mr D ≤ (cid:34) √ + 56 √ D D + 144 √ D D e − τ/ (cid:35) (cid:112) c τ e − τ √ mr. (49)In the inequality of Lemma 1.4, we let t = c τ mr with integer (cid:96) ≥ and constant c τ to be chosenlater, then with probability at least − e − c τ mr , the empirical process sup f ∈ ˆ Z j ( r ) √ m m (cid:88) i =1 (cid:104) f i ( Y ji ) − E P θ f i ( Y ji ) (cid:105) will not exceed the following upper bound min S ∈ N R S + 36 M (1 + t ) √ m + 24 √ t ≤ (cid:34) √ + 56 √ D D + 144 √ D D e − τ/ (cid:35) (cid:112) c τ e − τ √ mr + 36(1 + c τ mr ) √ c τ mr + 24 √ c τ mr = c τ √ mr + 36 √ c τ mr , where c τ = (cid:104) √ + 56 √ D D + 144 √ D D e − τ/ (cid:105) √ c τ e − τ + c τ √ c τ +24 √ c τ . As a result, the empiricalprocess on the class ˜ Z j ( { θ ∈ Θ : h mj ( θ, θ ) ≤ r } ) satisfies that with probability at least − e − c τ mr , sup θ ∈ Θ: h mj ( θ,θ ) ≤ r √ m m (cid:88) i =1 (cid:104) ˜ Z ji ( θ ) − E P θ ˜ Z ji ( θ ) (cid:105) ≤ √ c τ r · (cid:18) c τ √ mr + 36 √ c τ mr (cid:19) ≤ √ c τ c τ · √ mr + 72 √ m . (50)On the set { θ ∈ Θ : r ≤ h mj ( θ, θ ) ≤ r } , we have h mj ( θ, θ ) ≥ r . By Lemma 1.1, it follows that m m (cid:88) i =1 E ˜ Z ji ( θ ) ≤ − (1 − c τ ) r . This together with (50) implies that with probability at least − e − c τ mr , sup θ ∈ Θ: r ≤ h mj ( θ,θ ) ≤ r m m (cid:88) i =1 ˜ Z ji ( θ ) ≤ − (cid:0) − c τ − √ c τ c τ (cid:1) r + 72 m . Now we choose τ such that e − τ/ = 1 / . Then c τ < . , < c τ < . Set c τ = 1 / . By(A3), D ≤ D / . So √ c τ c τ ≤ √ · (cid:40)(cid:34) √ + 56 √ + 144 √ · (cid:35) (cid:114) + 36 √ · + 24 (cid:114) (cid:41) < . , − c τ − √ c τ c τ > . . Thus we have proved that for any < r ≤ √ ,uniformly for all j = 1 , . . . , k and for all sufficiently large m , P ( n ) θ (cid:32) sup { θ ∈ Θ: r ≤ h mj ( θ,θ ) ≤ r } m m (cid:88) i =1 Z ji ( θ ) ≥ − . r + 72 m (cid:33) ≤ P ( n ) θ (cid:32) sup { θ ∈ Θ: r ≤ h mj ( θ,θ ) ≤ r } m m (cid:88) i =1 ˜ Z ji ( θ ) ≥ − . r + 72 m (cid:33) ≤ (cid:0) − mr / (cid:1) . (51)Finally for a given δ , we set r = 2 (cid:96) δ for integers (cid:96) ≥ , m > / . /δ = 1440 /δ , and let L be thesmallest integer such that L δ > . It follows that P ( n ) θ (cid:32) sup { θ ∈ Θ: h mj ( θ,θ ) ≥ δ } m (cid:89) i =1 p ( Y ji | θ ) p ( Y ji | θ ) ≥ exp( − . mδ ) (cid:33) ≤ P ( n ) θ (cid:32) sup { θ ∈ Θ: h mj ( θ,θ ) ≥ δ } m m (cid:88) i =1 Z ji ( θ ) ≥ − . δ + 72 m (cid:33) ≤ L (cid:88) (cid:96) =0 P ( n ) θ (cid:32) sup { θ ∈ Θ:2 (cid:96) δ ≤ h mj ( θ,θ ) ≤ (cid:96) +1 δ } m m (cid:88) i =1 Z ji ( θ ) ≥ − . (cid:96) δ ) + 72 m (cid:33) ≤ L (cid:88) (cid:96) =0 (cid:16) − (cid:96) mδ / (cid:17) ≤ (cid:0) − mδ / (cid:1) . We set q = 0 . , q = 1 / and complete the proof. (cid:50) Lemma 1.6. Assume (A3) holds. Then for any δ > , there exist positive constants r , r thatdepend on g , g , κ, c π , such that for every subset Y [ j ] ( j = 1 , . . . , k ), for any t ≥ (cid:15) αm , P ( n ) θ (cid:32)(cid:90) Θ m (cid:89) i =1 p ( Y ji | θ ) γ p ( Y ji | θ ) γ Π( dθ ) ≤ exp( − r nt ) (cid:33) ≤ exp( − r mt ) . Proof of Lemma 1.6: Define the event Θ (cid:15) m as Θ (cid:15) m = (cid:40) θ ∈ Θ : 1 m m (cid:88) i =1 E P θ exp (cid:18) log + p ( Y ji | θ ) p ( Y ji | θ ) (cid:19) − ≤ (cid:15) αm (cid:41) then (A4) can be written as Π(Θ (cid:15) m ) ≥ exp( − c π n(cid:15) αm ) . For A ⊆ Θ , let Π (cid:15) m ( A ) = Π( A ∩ Θ (cid:15) m ) / Π(Θ (cid:15) m ) be the prior measure Π restricted measure to the set Θ (cid:15) m . For the left-hand side of the conclusion,we have P ( n ) θ (cid:32)(cid:90) Θ m (cid:89) i =1 p ( Y ji | θ ) γ p ( Y ji | θ ) γ Π( dθ ) ≤ exp( − r nt ) (cid:33) ( i ) ≤ P ( n ) θ (cid:32)(cid:90) Θ (cid:15)m m (cid:89) i =1 p ( Y ji | θ ) γ p ( Y ji | θ ) γ Π( dθ ) ≤ exp( − r nt ) (cid:33) ≤ P ( n ) θ (cid:32) Π(Θ (cid:15) m ) · (cid:90) Θ (cid:15)m m (cid:89) i =1 p ( Y ji | θ ) γ p ( Y ji | θ ) γ Π (cid:15) m ( dθ ) ≤ exp( − r nt ) (cid:33) ii ) ≤ P ( n ) θ (cid:32)(cid:90) Θ (cid:15)m m (cid:89) i =1 p ( Y ji | θ ) γ p ( Y ji | θ ) γ Π (cid:15) m ( dθ ) ≤ exp( − r nt + c π n(cid:15) αm ) (cid:33) ( iii ) ≤ P ( n ) θ (cid:32) m (cid:88) i =1 (cid:90) Θ (cid:15)m log p ( Y ji | θ ) p ( Y ji | θ ) Π (cid:15) m ( dθ ) ≥ r g mt − c π g m(cid:15) αm (cid:33) ( iv ) ≤ P ( n ) θ (cid:32) m (cid:88) i =1 (cid:34)(cid:90) Θ (cid:15)m log p ( Y ji | θ ) p ( Y ji | θ ) Π (cid:15) m ( dθ ) − E P θ (cid:90) Θ (cid:15)m log p ( Y ji | θ ) p ( Y ji | θ ) Π (cid:15) m ( dθ ) (cid:35) ≥ r g mt − ( c π g + κ − ) m(cid:15) αm (cid:33) . (52)In the derivation above, (i) follows from making the region of the integral smaller; (ii) is from (A4);(iii) is an application of Jensen’s inequality and uses the fact that g γm ≤ n ≤ g γm ; (iv) followsby using Fubini’s theorem, the inequality x ≤ ( e κx − /κ for x ≥ , and (A4): m (cid:88) i =1 E P θ (cid:90) Θ (cid:15)m log p ( Y ji | θ ) p ( Y ji | θ ) Π (cid:15) m ( dθ ) ≤ (cid:90) Θ (cid:15)m m (cid:88) i =1 E P θ log + p ( Y ji | θ ) p ( Y ji | θ ) Π (cid:15) m ( dθ ) ≤ (cid:90) Θ (cid:15)m κ m (cid:88) i =1 (cid:20) E P θ exp (cid:18) κ log + p ( Y ji | θ ) p ( Y ji | θ ) (cid:19) − (cid:21) Π (cid:15) m ( dθ ) ≤ κ − m(cid:15) αm . Now in (52), if we choose r large enough and let Z i = (cid:82) Θ (cid:15)m log p ( Y ji | θ ) p ( Y ji | θ ) Π (cid:15) m ( dθ ) , we can use theBernstein’s inequality (Corollary 2.10 in Massart (2003)) to control the large deviation for the sumof Z i ’s. By inspecting the conditions for this inequality, we can see that for any integer (cid:96) ≥ , bythe inequality ( κx ) (cid:96) /(cid:96) ! ≤ e κx − , (A4), and the Jensen’s inequality, m (cid:88) i =1 E P θ (cid:104) ( Z i ) (cid:96) + (cid:105) ≤ (cid:96) ! κ (cid:96) m (cid:88) i =1 (cid:104) exp (cid:16) κE P θ ( Z i ) + (cid:17) − (cid:105) ≤ (cid:96) ! κ (cid:96) m(cid:15) αm . Therefore in the Corollary 2.10, we can set c = κ − , v = 2 κ − m(cid:15) αm , x = r g mt − ( c π g + κ − ) m(cid:15) αm .If we choose r > ( c π g + 3 κ − ) /g , then since t ≥ (cid:15) αm , we have x > κ − mt ≥ κ − m(cid:15) αm . Henceit follows that v/c = 2 κ − m(cid:15) αm < x . We apply the Bernstein’s inequality to (52) and obtain that P ( n ) θ (cid:32) m (cid:88) i =1 (cid:104) Z i − E P θ Z i (cid:105) ≥ r g mt − ( c π g + κ − ) m(cid:15) αm (cid:33) ≤ exp (cid:20) − x v + cx ) (cid:21) ≤ exp (cid:16) − x c (cid:17) ≤ exp (cid:16) − κx (cid:17) ≤ exp (cid:18) − mt (cid:19) . We set r = 1 / and complete the proof. (cid:50) Lemma 1.7. Let ν denote the W barycenter of N measures ν , . . . , ν N in P (Θ) . Then for any θ ∈ Θ , W ( ν, δ θ ) ≤ N N (cid:88) j =1 W ( ν j , δ θ ) . Proof of Lemma 1.7: ν = T (cid:93)ν , where T ( θ ) = 1 N N (cid:88) j =1 T j ( θ ) , for any θ ∈ Θ and T j is the optimal transport map that pushes ν forward to ν j , i.e. ν j = T j (cid:93)ν and T is the identity operator. We use the property of the ρ metric in (A5) and obtain that W ( ν, δ θ ) = (cid:90) Θ ρ N N (cid:88) j =1 T j ( θ ) , θ ν ( dθ ) ≤ (cid:90) Θ N N (cid:88) j =1 ρ (cid:0) T j ( θ ) , θ (cid:1) ν ( dθ )= 1 N N (cid:88) j =1 (cid:90) Θ ρ (cid:0) T j ( θ ) , θ (cid:1) ν ( dθ ) + 1 N (cid:88) j (cid:54) = l (cid:90) Θ ρ (cid:0) T j ( θ ) , θ (cid:1) ρ (cid:0) T l ( θ ) , θ (cid:1) ν ( dθ ) . (53)For each term in the second summation, we apply the Cauchy-Schwartz inequality and have (cid:90) Θ ρ (cid:0) T j ( θ ) , θ (cid:1) ρ (cid:0) T l ( θ ) , θ (cid:1) ν ( dθ ) ≤ (cid:115)(cid:90) Θ ρ (cid:16) T j ( θ ) , θ (cid:17) ν ( dθ ) · (cid:115)(cid:90) Θ ρ (cid:0) T l ( θ ) , θ (cid:1) ν ( dθ ) = W ( ν j , δ θ ) · W ( ν l , δ θ ) . Therefore in (53), W ( ν, δ θ ) ≤ N N (cid:88) j =1 W ( ν j , δ θ ) + 1 N (cid:88) j (cid:54) = l W ( ν j , δ θ ) · W ( ν l , δ θ )= N N (cid:88) j =1 W ( ν j , δ θ ) , and hence the conclusion follows. (cid:50) Consider the set of L mixture of Gaussians. If y i ∈ R p is the i th observation ( i = 1 , . . . , n ) sampledfrom a mixture of L Gaussians, then p ( y i | θ ) = L (cid:88) l =1 π l N p ( y | µ l , Σ l ) , (54)where π = ( π , . . . , π L ) lies in the ( L − -simplex, µ l and Σ l ( l = 1 , . . . , L ) are the mean andcovariance parameters of a p -variate Gaussian distribution, and θ = { π , µ , . . . , µ L , Σ , . . . , Σ L } .We cluster y , . . . , y n into L clusters using K-Means clustering and randomly split the membersof every cluster into k subsets. This ensures that full-data are split into k subsets such that the39ixture proportions are represented in every subset. Let y j , . . . , y jm represent the data on subset j ( j = 1 , . . . , k ). The hierarchical model for the data on subset j is y ji | z ji , θ ∼ N p ( µ z ji , Σ z ji ) , z ji ∼ L (cid:88) h =1 π h δ h , (55) π ∼ Dirichlet ( L − , . . . , L − ) , µ h | Σ h ∼ N p ( , h ) , Σ h ∼ Inverse-Wishart (2 , I p ) , h = 1 , . . . , L, where 2 is the prior degrees of freedom and I p is the scale matrix of the Inverse-Wishart distribution.The posterior distribution of θ after stochastic approximation is derived using standard ar-guments for finite mixture of Gaussians. The likelihood given y j , . . . , y jm and latent variables z j , . . . , z jm is L j ( { µ h } Lh =1 , { Σ h } Lh =1 , { π h } Lh =1 ) = L (cid:89) h =1 (2 π p | Σ l | ) − (cid:93)hj e − (cid:80) mi =1 z ji = h ) ( y ji − µ h ) T Σ − h ( y ji − µ h ) π (cid:93)h j h , (56)where z ji = h ) is 1 if z ji = h and 0 otherwise and (cid:93)h j = (cid:80) mi =1 z ji = h ) . For stochasticapproximation, we raise L j in (56) to the power γ and obtain L γj ( { µ h } Lh =1 , { Σ h } Lh =1 , { π h } Lh =1 ) = L (cid:89) h =1 (2 π p | Σ l | ) − γ(cid:93)hj e − γ (cid:80) mi =1 z ji = h ) ( y ji − µ h ) T Σ − h ( y ji − µ h ) π γ(cid:93)h j h . (57)If we use L γj as the likelihood in (54), then the prior for θ in (54) and simple extensions of standardarguments for finite mixture of Gaussians imply that the analytic form of full conditional densitiesof the parameters are as follows. Define h j = { i : 1( z ji = h ) = 1 , i = 1 , . . . , m } , y h j = 1 (cid:93)h j (cid:88) i ∈ h j y ji ( h = 1 , . . . , L ) , (58)and the complete conditionals of the parameters are π j | rest ∼ Dirichlet (cid:32) γ m (cid:88) i =1 z ji = 1) + L − , . . . , γ m (cid:88) i =1 z ji = L ) + L − (cid:33) µ jh | rest ∼ Normal (cid:26) γ(cid:93)h j . 01 + γ(cid:93)h j y h j , . 01 + γ(cid:93)h j Σ h (cid:27) , Σ jh | rest ∼ Inverse-Wishart γ(cid:93)h j + 3 , (cid:88) i ∈ h j ( y ji − y h j )( y ji − y h j ) T + 0 . · γ(cid:93)h j . 01 + γ(cid:93)h j y h j y Th j + 4 I p z ji | rest ∼ L (cid:88) h =1 p jh δ h , p jh = π jh N p ( y ji | µ jh , Σ jh ) (cid:80) L ˜ h =1 π j ˜ h N p ( y ji | µ j ˜ h , Σ j ˜ h ) (59)for h = 1 , . . . , L and i = 1 , . . . , m ; see Chapter 9 in Bishop (2006) for details.All full conditionals are analytically tractable in terms of standard distributions. The Gibbssampler iterates between the following four steps:1. Sample π j from Dirichlet distribution in (59).40able 10: Accuracies of the approximate posteriors for β in linear mixed effects model simulation.The accuracies are averaged over 10 simulation replications and across all elements of β . Monte Carloerrors are in parenthesis. ADVI, automatic differentiation variational inference; SA, streamlinedalgorithm; SGLD, stochastic gradient Langevin dynamics with batch size in parenthesis; CMC,consensus Monte Carlo; SDP, semiparametric density product; WASP, Wasserstein posterior p = 4 p = 80 ADVI 0.22 (0.26) 0.46 (0.13)SA 0.97 (0.01) 0.96 (0.01)SGLD (2000) 0.83 (0.01) 0.91 (0.01)SGLD (4000) 0.95 (0.01) 0.96 (0.01) k = 10 k = 20 k = 10 k = 20 CMC 0.96 (0.01) 0.95 (0.02) 0.95 (0.02) 0.94 (0.03)SDP 0.96 (0.02) 0.95 (0.02) 0.90 (0.06) 0.90 (0.06)WASP 0.97 (0.01) 0.96 (0.02) 0.95 (0.02) 0.94 (0.03) 2. Sample µ jh from Normal distribution in (59) for h = 1 , . . . , L .3. Sample Σ jh from Inverse-Wishart distribution in (59) for h = 1 , . . . , L .4. Sample z ji from categorical distribution in (59) for i = 1 , . . . , m . The conditional densities of β and Σ in two steps. First, the sampling model of the linear mixedeffects model implies that the likelihood of i th observation in j th subset is L ji = (cid:90) R q N n i ( y ji | X i β + Z i u i , τ I n i ) N q ( u i | , Σ) d u i = N n i ( y ji | X i β , Z i Σ Z Ti + τ I n i ) . (60)This implies that likelihood of β and Σ after stochastic approximation is L γj = m (cid:89) i =1 { L ji } γ = m (cid:89) i =1 (cid:8) N n i ( y ji | X i β , Z i Σ Z Ti + τ I n i ) (cid:9) γ . (61)Second, the j th subset posterior distribution of β and Σ after stochastic approximation is calculatedusing L γj as the likelihood and the same priors for β and Σ as in the sampling model of the linearmixed effects model. Instead of finding the analytic form of the posterior density, we use the increment_log_prob function in Stan (Stan Development Team, 2014) to specify that the likelihoodof y ji as { L ji } γ (61) and use the default priors of β and Σ in Stan to obtain samples of ( β , Σ) fromthe j th subset posterior after stochastic approximation. For automatic differentiation variationalinference (ADVI; Kucukelbir et al. (2015)), we used the vb function in Stan with default optionsand used the same model file that we used for sampling from the full data posterior distribution.Table 10 describes the accuracies of ADVI, CMC, SA, SDP, and WASP in approximating the fulldata marginal posterior distributions of fixed effects in the simulation. The derivation of modified full conditional densities of unknown parameters involves one key mod-ification in the Gibbs sampling algorithm of Dunson and Xing (2009). Assuming z ji is given, thecontribution of i th observation in j th subset to the likelihood after stochastic approximation is L γji (( ψ (1) h ) l ∗ h =1 , . . . , ( ψ ( q ) h ) l ∗ h =1 , . . . , ( ψ ( p ) h ) l ∗ h =1 ) = l ∗ (cid:89) h =1 ν h p (cid:89) q =1 d q (cid:89) l =1 ψ ( q ) xjiq = l,zji = h ) hl γ , x jiq = l, z ji = h ) is 1 if both conditions are true and is 0 otherwise and l ∗ is the maximumnumber of atoms in the stick breaking representation for the distribution of z ji . The conditionalposterior density of ψ ( q ) h after stochastic approximation is proportional to d q (cid:89) l =1 ψ ( q ) ajl − hl m (cid:89) i =1 d q (cid:89) l =1 ψ ( q ) γ xjiq = l,zji = h ) hl = d q (cid:89) l =1 ψ ( q ) ajl + γ (cid:80) mi =1 1( xjiq = l,zji = h ) − hl , which implies that ψ ( q ) jh | rest ∼ Dirichlet (cid:0) a q + γ x jiq = 1 , z ji = h ) , . . . , a qd q + γ x jiq = d q , z ji = h ) (cid:1) (62)for q = 1 , . . . , p and h = 1 , . . . , l ∗ . The conditional densities of remaining parameters follow fromSection 3.1 of Dunson and Xing (2009): V jh | rest ∼ Beta (1 + γ m (cid:88) i =1 z ji = h ) , α + γ m (cid:88) i =1 z ji > h )) , (63) α j | rest ∼ Gamma ( a α + l ∗ , b α − l ∗ (cid:88) h =1 log(1 − V jh )) . (64)Finally, we update the posterior density of responsibility of every observation as z ji | rest ∼ l ∗ (cid:88) h =1 p jh δ h , p jh = ν jh (cid:81) pq =1 ψ ( q ) hx jiq (cid:80) l ∗ h =1 ν jh (cid:81) pq =1 ψ ( q ) hx jiq , ( h = 1 , . . . , l ∗ ; i = 1 , . . . , m ) , (65)where ν jh = V jh (cid:81) l ADVI 0.16 (0.24) 0.21 (0.32) 0.31 (0.31) 0.27 (0.31) 0.53 (0.20) 0.39 (0.27)SA 0.00 (0.00) 0.34 (0.05) 0.64 (0.05) 0.15 (0.02) 0.01 (0.00) 0.00 (0.00)SGLD (2000) 0.79 (0.02) 0.72 (0.02) 0.78 (0.02) 0.82 (0.01) 0.86 (0.02) 0.80 (0.02)SGLD (4000) 0.78 (0.02) 0.72 (0.02) 0.76 (0.03) 0.80 (0.03) 0.84 (0.02) 0.77 (0.02)CMC 0.95 (0.02) 0.93 (0.02) 0.92 (0.02) 0.95 (0.02) 0.94 (0.02) 0.95 (0.03)SDP 0.93 (0.03) 0.92 (0.03) 0.92 (0.04) 0.93 (0.04) 0.92 (0.04) 0.94 (0.02)WASP 0.96 (0.01) 0.95 (0.01) 0.95 (0.02) 0.96 (0.01) 0.96 (0.01) 0.96 (0.01) l ll ll p = 4, q = 3 M C M C A D V I SA S G L D ( ) S G L D ( ) C M C ( k = ) C M C ( k = ) S D P ( k = ) S D P ( k = ) W ASP ( k = ) W ASP ( k = ) l og S e c ond s l p = 80, q = 6 M C M C A D V I SA S G L D ( ) S G L D ( ) C M C ( k = ) C M C ( k = ) S D P ( k = ) S D P ( k = ) W ASP ( k = ) W ASP ( k = ) (a) l ll M C M C C M C ( k = ) C M C ( k = ) S D P ( k = ) S D P ( k = ) W ASP ( k = ) W ASP ( k = ) l og S e c ond s (b) l M C M C A D V I SA S G L D ( ) S G L D ( ) C M C S D P W ASP l og S e c ond s (c)(c)