[PDF] Parallel Markov Chain Monte Carlo for Bayesian Hierarchical Models with Big Data, in Two Stages

Abstract

Due to the escalating growth of big data sets in recent years, new Bayesian Markov chain Monte Carlo (MCMC) parallel computing methods have been developed. These methods partition large data sets by observations into subsets. However, for Bayesian nested hierarchical models, typically only a few parameters are common for the full data set, with most parameters being group-specific. Thus, parallel Bayesian MCMC methods that take into account the structure of the model and split the full data set by groups rather than by observations are a more natural approach for analysis. Here, we adapt and extend a recently introduced two-stage Bayesian hierarchical modeling approach, and we partition complete data sets by groups. In stage 1, the group-specific parameters are estimated independently in parallel. The stage 1 posteriors are used as proposal distributions in stage 2, where the target distribution is the full model. Using three-level and four-level models, we show in both simulation and real data studies that results of our method agree closely with the full data analysis, with greatly increased MCMC efficiency and greatly reduced computation times. The advantages of our method versus existing parallel MCMC computing methods are also described.

Full PDF

11 Parallel Markov chain Monte Carlo for Bayesian hierarchical models with big data, in two stages

Zheng Wei a and Erin M. Conlon b* a Department of Mathematics and Statistics, University of Maine, Orono, United States of America b Department of Mathematics and Statistics, University of Massachusetts, Amherst, United States of America *Erin M. Conlon, corresponding author; email: [email protected].

Abstract

1. Introduction

With the rapid growth of big data, new scalable statistical and computational methods are needed for the analysis of large data sets. Here, big data refers to data sets that are too large to evaluate in full, due to memory constraints, limited storage capacity or excessive computation time. Several recent Bayesian and Markov chain Monte Carlo (MCMC) methods for large data sets have been introduced to address these issues. However, these methods either required communication between machines at each MCMC iteration, which slows computation times (Newman et al . [29]; Langford et al . [17]; Smola and Narayanamurthy [38]; Agarwal and Duchi [1]), or did not address issues with limited memory or limited storage capacity (Maclaurin and Adams [22]; Bardenet et al . [4]; Ahn et al . [2]; Quiroz et al . [35]; Chen et al . [8]; Wilkinson [43]; Laskey and Myers [18]; Murray [26]). Another research direction involves dividing full data sets into subsets, performing independent Bayesian MCMC computation for each subset, and combining the independent results (Neiswanger et al . [28]; Scott et al . [38]). These are communication-free parallel methods, in that each machine generates MCMC samples without trading information with other machines. These techniques focus on data sets with a large number of observations relative to the dimensionality of the model, referred to as tall data , and they partition the data sets by observations into subsets. The methods of Neiswanger et al . [28] and Scott et al . [38] have different procedures for combining the subset posterior samples. Neiswanger et al . [28] developed a kernel density estimator that evaluates each subset posterior density separately; these subset posteriors are then multiplied together to estimate the full data posterior. Alternatively, Scott et al . [38] created the consensus Monte Carlo algorithm that estimates the full data posterior through weighted averaging of the subset MCMC samples. Both approaches show good performance when the subset posteriors are near Gaussian, which is expected for adequately large sample sizes for each subset, based on the Bayesian central limit theorem (Bernstein von-Mises theorem; see Van der Vaart [41], and Le Cam and Yang [19]). However, for non-Gaussian posteriors, the methods may have unreliable performance (Baker et al . [3]; Neiswanger et al . [28]; Miroshnikov et al . [25]). The method of Neiswanger et al . [28] also has limitations as the number of unknown model parameters increases, since kernel density estimation becomes infeasible in larger dimensions (Wang and Dunson [42]; Scott [37]). Specifically, Scott [37] shows that kernel density estimation can break down for a model with as few as five to six parameters. Another drawback to the subset combining methods of Neiswanger et al . [28] and Scott et al . [38] is that the prior distribution for the full data analysis is split into M components for the subset analyses, where M is the number of subsets. As a result, the priors that are proper for the full data analysis can be improper for the subsets, leading to improper subset posteriors (Scott [37]). Due to the limitations of the methods of Neiswanger et al. [28] and Scott et al . [38], we develop a two-stage communication-free parallel MCMC algorithm for Bayesian nested hierarchical models. The two-stage technique was introduced for Bayesian meta-analysis by Lunn et al . [20] and further modified and implemented by Bryan et al . [6], Hooten et al . [16], Peng et al . [31] and Goudie et al . [14], but was not carried out for parallel computing and big data. The motivation for the two-stage method is that for nested hierarchical models, generally only a few parameters are common for the full data set, with most parameters being group-specific. Thus, parallel MCMC methods that take into account the structure of the model and split the full data set by groups rather than by observations are a more natural procedure for these models (Scott et al . [38]). Our two-stage method takes this approach and partitions the full data analysis by groups; complete details are provided in Section 2. As a brief overview, in stage 1, the group-specific parameters are estimated in parallel, independently of other groups. The resulting posteriors are used as proposal distributions in a Metropolis-Hastings step in stage 2, where the target distribution is the full hierarchical model. The likelihood is not evaluated during stage 2 and therefore the full data set is never processed in its entirety; this speeds computation and allows for large data sets to be evaluated in parts. The two-stage method has several advantages over existing communication-free parallel computing methods, including Neiswanger et al . [28] and Scott et al . [38]. Our method is appropriate for both Gaussian and non-Gaussian distributions, and the dimension of the parameter space is not a limitation. In addition, the priors for the full data analysis are not split into parts for the subset analyses, which avoids any improper subset priors. We illustrate these advantages in Sections 2 and 3.

2. Methods

To introduce the two-stage method, we first consider the following general Bayesian nested hierarchical model with three levels. For models with four or more levels, refer to Appendix 1. level 1: | , ~ ( | , ), 1,..., ; 1,..., ,level 2: | , ~ ( | , ), ~ ( ), level 3: ~ ( ), ~ ( ). ij i i ij i i ii ii i y p y i n j mpppp             (1) Here, the index i denotes the group, and ij y is the data value j from group i ; the model parameters , , and i i     are unknown. For a full data analysis of the model in Equation (1), the likelihood of the data is

21 1 ( | , ) ( | , ), i mn ij i ii j p p y        y θ σ (2) where y denotes the collection of all ij y s , and θ and σ denote the collection of all i s  and i s  , respectively. The joint posterior distribution is ( , , , | ) ( | , ) ( | , ) ( ) ( ) ( ). i mn nij i i i ii j i p p y p p p p                            θ σ y (3) In the traditional Bayesian analysis, the full model in Equation (1) is implemented for the entire data set, with all parameters estimated simultaneously. However, for instances where a full data analysis is not feasible, the following two-stage method is introduced. The two-stage method, first introduced by Lunn et al . [20], provides inferences on the full hierarchical model in Equation (1) without requiring evaluation of the complete data set in its entirety; details are described below.

In stage 1, the full data set is split by group i and each group is analyzed independently in parallel. Here, independent prior distributions are assigned to each i  . The specific choices of priors for the i s  are discussed in Section 3; these are typically chosen to be uninformative (Lunn et al . [20]). The common model parameters and   are not estimated at this stage. For all remaining parameters, the priors are the same as in the full model. The stage 1 model is then: level 1: | , ~ ( | , ), 1,..., ; 1,..., ,level 2: ~ ( ), ~ ( ). ij i i ij i i ii ii i y p y i n j mpp         (4) Here, we use the notation i y to indicate the group-specific data values ij y , i j m  . MCMC samples are obtained independently for each group i , from the following joint posterior distribution of and i i   , conditional on the subset data i y only: ( , | ) ( | , ) ( ) ( ), 1,..., . i i i i i i p p p p i n        i i y y (5) The MCMC sample size is i A for each group i , and the resulting MCMC samples are labeled ( ) 2( ) { , }, s si i   i s A  The stage 1 samples are used as proposal distributions in stage 2 in a Metropolis-Hastings step that has the target distribution of the full model in Equation (1) (Metropolis et al . [24]; Hastings [15]).

Stage 2 uses a Metropolis-Hastings-within-Gibbs sampling scheme to iteratively draw from the joint posterior distribution of  ,  , θ , and σ based on the full hierarchical model in Equation (1). For T total samples in stage 2, at each iteration t , t = T we cycle through the full conditional posterior distributions of and   , and then jointly through each and i i   , for i = 1,…, n . The full conditional posteriors are given by ( | , , ) ( ) ( | , ), n ii p p p         θ y (6) ( | , , ) ( ) ( | , ), n ii p p p         θ y (7) ( , | , , ) ( | , ) ( | , ) ( ), 1,..., . i i i i i i p p p p i n          i y y (8) Initial values are assigned for  , (0) θ , and (0) σ . The full conditional posterior distributions for the common model parameters and   in Equations (6) and (7) are typically available in closed form and can then be sampled from directly using standard algorithms such as the Gibbs sampler (Gelfand and Smith [11]); otherwise, there are many alternative algorithms for sampling from these full conditional posteriors (e.g. Metropolis et al . [24]; Hastings [15]; Neal [27]; Gilks and Wild [13]). For jointly updating each and i i   in Equation (8) at iteration t , a Metropolis-Hastings step is carried out. For this, the proposal distribution is based on stage 1 samples, and the target distribution is the full hierarchical model. Next, we introduce general notation for a Metropolis-Hastings step, and follow with the specific Metropolis-Hastings algorithm for our model. For a random variable  that has density ( ) h  , the Metropolis-Hastings algorithm produces dependent draws from ( ) h  by generating the Markov chain *( )( ) ( 1) with probability = minimum(1, ), otherwise, tt t r      (9) for t = 1,…, T . Here, *( ) t  is a proposal value sampled from a candidate density ( ) q  that approximates the target density ( ) h  reasonably well, but is easy to draw samples from. The value r is the “acceptance probability” for the proposal value *( ) t  , and is based on both target densities and proposal densities; r is defined as *( ) ( 1)( 1) *( ) ( ) ( ) .( ) ( ) t tt t h qr h q      (10) The value r can be re-expressed as *( ) ( 1) *( )*( ) ( 1) ( 1) ( ) ( ) ( ) ,( ) ( ) ( ) t t tt t t h q Rr q h R          (11) where ( ) R x denotes the target-to-candidate ratio of densities ( ) / ( ) h x q x (see also Lunn et al . [20]).

For our model, we use a Metropolis-Hastings step at iteration t to jointly update and i i   based on Equation (8). For this, we sample uniformly at random a candidate value *( ) 2*( ) { , } t ti i   from the i A samples from stage 1 and label it it a . This candidate value *( ) 2*( ) { , } t ti i   has the following posterior density from stage 1 in Equation (5), which we label   , i i q   ; note that this density contains independent priors for the i s  :   ( ) 2( )*( ) 2*( ) 2 2 2 { , } { , } ( | , ) ( ) ( ) , . it it a at ti i i i i i i i i i p p p q            i y  (12) In determining the ratio R for this candidate value *( ) 2*( ) { , } t ti i   , the target posterior density of the full model in Equation (8) is also evaluated at the candidate value *( ) 2*( ) { , } t ti i   . We label this target posterior density as   , i i h   ; note that it includes priors for i  that are dependent on  and  , as follows:   ( , | , , ) ( | , ) ( | , ) ( ) , . i i i i i i i i p p p p h             i y y (13) The target-to-candidate ratio for *( ) 2*( ) { , } t ti i   is then ( ) ( ) 2( )*( ) 2*( ) ( ) 2( )*( ) 2*( ) ( ) ( ) 2( )*( ) 2*( )( ) ( ) 2( )( ) ( , ) | ,( , ) ( , )| , ( | ) ( ) ( )( | ) ( ) ( ) ( )( it it itit it ititit a a at t t tt t i i i i ii i a a at ti i i i ia t ti ai hR q p p pp p ppp                 ii yy . ) (14) Similarly, the target-to-candidate ratio for the previous value of the Markov chain, ( 1) 2( 1) { , }, t ti i     is ( 1) 2( 1) ( 1) ( 1) ( ) 2( ) 2( 1)( 1) 2( 1) ( 1) 2( 1) ( 1) ( 1) 2( 1)( 1) ( ) 2( ( , ) | ,( , ) ( , )| , ( | ) ( ) ( ) ( | ) ( ) ( ) ( t t t t t t tt t i i i i ii i t t t t ti i i i it t ti hR q p p pp p pp                             ii yy )( 1) . ) ( ) ti p   (15) Thus, r is given by the following, and the candidate value *( ) 2*( ) { , } t ti i   will be accepted with the probability of minimum(1, r ): ( ) ( ) 2( )( )*( ) 2*( ) ( 1) 2( 1) *( ) 2*( ) ( 1) ( ) 2( )*( ) 2*( ) ( 1) 2( 1) ( 1) 2( 1) ( 1)( ) ( ) 2( ) | ,( , ) ( , ) ( , ) | ,( , ) ( , ) ( , ) | , ( )( )( )( )( )( it itit a t ti at t t t t ti i i i i i it t tt t t t t t ii i i i i i tia t ti h q Rr q h R p pp ppp                              ( 1)( )( 1) ( ) 2( ) .| , ( )) ( ) it ti at t ti i pp     (16) Note that the likelihoods cancel in Equations (14) and (15), and are thus not included in r . As a result, the data sets are not analyzed in stage 2, and this stage is computationally very fast. If the stage 1 prior for i  is effectively uniform, then r can be simplified further as ( ) ( ) 2( )( 1) ( ) 2( ) | ,| , ( )( ) it a t ti t t ti r pp        . (17) The value r in Equation (17) is the ratio of priors for i  from the full model, evaluated at the candidate value for i  in the numerator and the previous value of the Markov chain for i  in the denominator. Note that r does not depend on i  in both Equations (16) and (17), so that these values are not used in computations in stage 2. Convergence of the MCMC sampler in stage 2 is assessed using standard methods (Cowles and Carlin [9]; Mengersen et al . [23]); we detail convergence diagnostics in Section 2.3. For the examples in Section 3, we used the R programming language (R Core Team [36]) for all computations. For stage 1 and for the full data model, the R package RJAGS [34] is used to run the JAGS program (Plummer [32]) to generate the MCMC samples; other standard MCMC software such as Stan (Carpenter et al . [7]) or OpenBUGS (Lunn et al . [21]) can also be used for stage 1. For all examples, we chose data set sizes so that a full data analysis was still feasible; this was in order to compare the two-stage results and the full data results. For stage 1, stage 2 and the full data analyses in all examples, we ran two parallel MCMC chains with MCMC sample size of 250,000 after convergence. Initial values of parameters were overdispersed and widely different for the two chains. The resulting samples were thinned by 10 (see Lunn et al . [20]) based on auto-correlation values; this produced samples that were effectively independent between consecutive values, with a final MCMC sample size of 50,000 for the two chains. Convergence was assessed using history plots and Gelman and Rubin statistics (Gelman and Rubin [12]; Brooks and Gelman [5]), which also showed that a burnin of 10,000 was acceptable for all analyses (Cowles and Carlin [9]; Mengersen et al . [23]). The MCMC sample sizes were equivalent for the two-stage method and the full data analyses, for comparison purposes.

We compare the results for the two-stage method and the full data analysis using minimum MCMC efficiency and computation time measured in both CPU and elapsed time; all measures are compared after burnin. Estimated relative L and L distances are also used for comparisons (Appendix 2). We use the following definition for minimum MCMC efficiency: Effective sample sizeMinimum MCMC efficiency min CPU computation time in seconds      (18) Here, effective sample size is based on the CODA R package (Plummer et al . [33]); we use the minimum efficiency over all groups and all parameters for each analysis. Since the two-stage method requires twice as many MCMC samples as the full data method over the two stages, we created the MCMC efficiency improvement factor. This is calculated as follows, assuming equal MCMC sample sizes in all stages and analyses: MCMC efficiency improvement factor, two stage versus full data analysisAverage minimum MCMC efficiency over two stages in two-stage method ×0.5.Minimum MCMC efficiency in full data analysis     (19) 0

3. Examples

Here, we simulate data for the following three-level Bayesian nested hierarchical normal model: level 1: | , ~ ( , ), 1,..., ; 1,..., ,level 2: | , ~ ( , ), ~ (0.01, 0.01), level 3: ~ (0,10 ), ~ (0.1, 0.1). ij i i i i iii y Normal i n j mNormalInverse GammaNormalInverse Gamma         (20) We produced three data sets, one for each of the number of groups n = 20, 50, 100. For each data set, we assigned  = 25,  = 1.5; the i  values were simulated from Model (20) based on these values, and the i  were simulated from a normal distribution with mean 10 and variance 1. A sample size of m i = 100,000 per group was generated for each data set. Here, we split the analysis at level 2, so that each group i is analyzed independently in parallel, with the i s  assigned vague independent priors (Lunn et al . [20]). The i s  are assigned the same priors as in the full model in Equation (20). This results in the following stage 1 model: level 1: | , ~ ( , ), 1,..., ; 1,..., ,level 2: ~ (0,10 ), ~ (0.01, 0.01). ij i i i i iii y Normal i n j mNormalInverse Gamma       (21) The common model parameters  and  are not estimated in stage 1. We use the approaches detailed in Sections 2.2 and 2.3 to produce i A = 50,000 samples from the joint posterior distribution of and i i   conditional on the subset data i y for each group independently in parallel. The resulting sampled values are used as proposal distributions in stage 2. 1 For stage 2, we estimate the full model in Equation (20) by cycling through the Metropolis-Hastings-within-Gibbs algorithm for all unknown parameters ,  ,  , θ and , σ as detailed in Section 2.2. An MCMC sample size of T = 50,000 is again produced as described in Section 2.3. The full conditional posterior distributions are closed form for and   , so that a Gibbs sampling algorithm is carried out for these parameters. As stated in Section 2.3, for the full data analysis, we used the same MCMC sample size of T = 50,000 that was used in the two-stage method for comparison (see also Lunn et al . [20]). For n = 50 groups, the MCMC efficiency improvement factor for the two-stage method versus the full data method was 27.8, indicating a 27.8-fold improvement in efficiency for the two-stage method (Table 1). Both the CPU and elapsed computation times were reduced by 96.7% for the two-stage method versus the full data analysis, assuming all groups were run in parallel. Note that these time reductions are dependent on the number of groups (Table 2). We find a close agreement between results of our two-stage method and the full data approach based on the estimated relative L and L distances (Appendix 2); the average estimated relative L and L distances are between 0.021 and 0.024 for all parameters (Table 3). For n = 20 and 100 groups, the results are similar to those for n = 50 (Tables 1, 2, and 3). Figure 1 shows the marginal posterior distributions for representative parameters for both the full data analysis and the two-stage method for n = 50 groups; these plots illustrate the close similarity of our two-stage method to the complete data analysis. Note that the variance parameter  has a skewed posterior distribution, and the estimated relative L and L distances are 0.023 and 0.024, respectively; this indicates that our method is appropriate for non-Gaussian as well as Gaussian posteriors. We also plot the stage 1 results for representative parameters in Figure 1. For the two-stage method and large data sets, the stage 1 and stage 2 posteriors are similar. This is due to the individual groups having large data sizes, which result in weak shrinkage effects. 2 Here, we simulate data for the following four-level Bayesian hierarchical logistic regression model:   level 1: ~ ( ), 1,..., ; 1,..., ,level 2: ( ) ,level 3: ~ , ,level 4: ~ (0, ), 1,..., , ~ ( , ), ij ij iTij ij ii kk kk y Bernoulli p i n j mlogit pNormalNormal k KWishart        X ββ μ ΣΣ Ψ (22) where Pr( 1) ij ij p y   , logit( ) log( /(1 )) ij ij ij p p p   , n = number of groups, m i = sample size per group, and K = number of predictor variables. Here, i β is a K -dimensional vector for group i consisting of ,..., ; iKi   ij X represents the m i measurements in group i for the K predictor variables, μ is a K -dimensional vector, Σ and Ψ are - K K  dimensional matrices, and  is the shape parameter of the Wishart distribution. We use n = 20, m i = 10,000, K = 3, and assigned (0.8, 0.9, 0.7)   μ . For , Σ we specified0.05 for = , kk k k    and 0.04 for . kk k k     We assigned Ψ to be a diagonal matrix with diagonal elements 0.01, 1,..., ; kk k K    we also specified k   , k = 1,…, K ; and  =4. The predictor values were simulated as ~ ij X ( , ), Normal Λ where 1 for = , kk k k    and 0.2 for . kk k k     With these specifications, we simulated the, and ij ij i p y β values based on Model (22). A description of the two-stage approach for a four-level hierarchical model are provided in Appendix 1; we summarize the procedure next.

For the full model in Equation (22), we divide the computations at level 3, so that each group i is evaluated independently in parallel, with the i s β assigned vague independent priors (Lunn et al . [20]). The common model parameters μ and Σ are not estimated in stage 1. This results in the following stage 1 model: 3   level 1: ~ ( ), 1,..., ; 1,..., ,level 2: ( ) ,level 3: ~ , , ij ij iTij ij ii k y Bernoulli p i n j mlogit pNormal   X ββ Ω (23) where Ω is a diagonal matrix with diagonal elements 100, 1,..., . kk k K    We again use the methods described in Appendix 1 and Section 2.3 to create i A = 50,000 samples from the posterior distribution of i β conditional on the subset data i y independently in parallel for each group. The posterior samples from stage 1 are used as proposal distributions in stage 2. We estimate the full model in Equation (22) in stage 2 by implementing the Metropolis-Hastings-within-Gibbs algorithm for all unknown parameters , , i μ Σ β , i = 1,…, n , as detailed in Appendix 1. An MCMC sample size of T = 50,000 is again produced as described in Section 2.3. The full conditional posterior distributions are closed form for and , μ Σ so that a Gibbs sampling procedure is used for these parameters. For comparison purposes, we used the same MCMC sample size of T = 50,000 for the full data analysis and the computations in stages 1 and 2 (see also Lunn et al . [20]); Section 2.3 provides details, including convergence diagnostics. For all results, our findings are similar to simulation study 1. In particular, the MCMC efficiency improvement factor for the two-stage method compared to the full data method was 21.9, resulting in a 21.9-fold improvement in efficiency for the two-stage technique (Table 1). Both the CPU and elapsed computation times were reduced by 89.0% for the two-stage versus full data approaches, assuming that all groups are processed in parallel; these decreases in computation times depend on the number of groups (Table 2). The estimated relative L and L distances range from 0.018 and 0.025 for all model parameters (Table 3). Here, we examine real data for all commercial flights within the United States for the 4 twelve-month period August 2016 to July 2017 (U. S. Department of Transportation [40]). The outcome variable of interest is the arrival delay for each flight, measured in minutes; flights with arrival delay of fifteen minutes or less are regarded as on-time [40]. There were a total of 1,061,023 data values for arrival delays. A log-transformation was applied to the data set, and the resulting transformed data values were approximately normally distributed. Data was available for twelve airlines and seven days of the week. We use the following four-level Bayesian nested hierarchical model to analyze the full data set: level 1: | , ~ ( , ), 1,..., ; 1,..., ; 1,..., , level 2: | , ~ ( , ), ~ (0.1, 0.1), level 3: | , ~ ( , ), ijk ij ij ij ij ijij i i i iijii y Normal i n j m k KNormalInverse GammaNormal              ~ (0.01, 0.01), level 4: ~ (0,10 ), ~ (0.1, 0.1). Inverse GammaNormalInverse Gamma  (24) Here, the y ijk are the log-transformed arrival delays for airline i , i = 1,…, n ; day of the week j , j = 1,…, m ; and data replicate k within carrier i and day j , k = 1,…, K ij . The ij  parameters are the mean arrival delays for each airline i and day j ; i  are the means for each airline i , and  is the mean over all airlines. The unknown variance parameters are the variability within airline i and day j , ij  ; the variability with airline i , i  ; and the variability over all airlines,  . Complete details of the two-stage approach for a general four-level nested hierarchical model are provided in Appendix 3, with summarized steps given below. . For the full model in Equation (24), we split the analysis at level 3, so that data values within group i (i.e. airline i ) are analyzed in parallel independently in stage 1. The i s  are assigned independent vague priors, with all other priors the same as in the full model (Lunn et al . [20]). The stage 1 model is then: 5 level 1: | , ~ ( , ), 1,..., ; 1,..., ; 1,..., ,level 2: | , ~ ( , ), ~ (0.1, 0.1), level 3: ~ (0,10 ), ~ ijk ij ij ij ij ijij i i i iijii y Normal i n j m k KNormalInverse GammaNormal            (0.01, 0.01). Inverse Gamma (25) The common model parameters  and  are not estimated in stage 1. The notation ij y is used to denote , 1,..., ijk ij y k K  ; i δ denotes , 1,..., ij i j m   ; and η denotes , 1,..., ij i j m   . For each airline i , we use the procedures described in Appendix 3 and Section 2.3 to generate i A = 50,000 samples independently in parallel from the joint distribution of , , and i i  

2i i δ η , conditional on the subset data ij y . The stage 1 samples are used as proposal distributions in stage 2. For stage 2, the full model in Equation (24) is estimated by iterating through the Gibbs sampler for the common model parameters and   , which are closed form, and the Metropolis-Hastings step for each i δ , η , i  and i  jointly, with an MCMC sample size of T = 50,000. Full details of the implementation of stage 2 for a general four-level model are provided in Appendix 3 and Section 2.3. For the full data analysis, we used the same MCMC sample size of T = 50,000 as in stages 1 and 2 for comparison (see also Lunn et al . [20]); see Section 2.3 for implementation details, including convergence diagnostics. For all results, our findings are similar to the simulation studies. Specifically, the MCMC efficiency improvement factor for the two-stage method versus the full data method was 17.1, resulting in a 17.1-fold increase in efficiency for the two-stage approach (Table 1). The CPU time was decreased by 65.3% for the two-stage versus full data methods, and the elapsed time was lowered by 65.2%, assuming all airlines are computed in parallel; these time reductions are dependent on the number of airlines (Table 2). The estimated relative L and L distances range between 0.017 and 0.031 for all parameters (Table 3). Figure 2 illustrates the close agreement of our two-stage method with the results from the complete data analysis. 6

4. Conclusions

Here, we adapted and extended the two-stage Bayesian hierarchical modeling approach introduced by Lunn et al . [20] for communication-free parallel MCMC and big data. The two-stage method takes into account the structure of the model and partitions the full data set by groups rather than by observations. We found in simulation studies and a real data analysis that our two-stage method produces results that closely agree with the full data analysis, with greatly increased MCMC efficiencies and greatly reduced computation times. The increases in efficiencies and reductions in computation times are dependent on the number of groups. The two-stage method has several advantages over the existing communication-free parallel computing methods of Neiswanger et al . [28] and Scott et al . [38] that divide the full data set by observations rather than by groups. As shown in the examples, our method is appropriate for both Gaussian and non-Gaussian posterior distributions, unlike these existing methods that are best suited to Gaussian posteriors. The two-stage method also does not split the prior distributions of the full model into parts for the subset analyses, avoiding any impropriety of subset priors and resulting subset posteriors. In addition, unlike the kernel density estimation technique of Neiswanger et al . [28], our procedure is not limited by the dimension of the number of groups. This is due to the groups being processed independently in stage 1, and then estimated one-by-one in stage 2 through the Metropolis-Hastings step. The two-stage method is also flexible in that there can be different numbers of MCMC samples drawn in each group in stage 1; this may be necessary if, for example, some groups require more iterations to converge than others. In addition, the parallel computing in stage 1 is particularly applicable for distributed computing frameworks such as MapReduce (Dean and Ghemawat [10]), and can be run on multi-core processors and networks of machines. A limitation of the two-stage method is that the stage 1 posteriors must be reasonable approximations to the full data posteriors (Lunn et al . [20]). If there is too much difference between these posteriors, then the Metropolis-Hastings algorithm will have a very low acceptance rate. However, with large data sets, this is unlikely to happen, since there will typically be much more weight for the posteriors toward the individual groups and the shrinkage effects are expected to be weak. As a result, the 7 stage 1 posteriors closely resemble the full data posteriors; this is shown in our first simulation example and the real data example (Figures 1 and 2).

Disclosure statement

No potential conflict of interest was reported by the authors.

References [1]

A. Agarwal and J.C. Duchi,

Distributed delayed stochastic optimization , in

Advances in Neural Information Processing Systems 24 , J. Shawe-Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira, and K.Q. Weinberger, eds., 2011, pp. 873-881. [2]

S. Ahn, B. Shahbaba, and M. Welling,

Distributed stochastic gradient MCMC , in

Proceedings of the International Conference on Machine Learning 32 , E.P. Xing and T. Jebara, eds., 2014, pp. 1044-1052. [3]

J. Baker, P. Fearnhead, E.B. Fox, and C. Nemeth,

Control variates for stochastic gradient MCMC , preprint (2017). Available at arXiv:1706.05439. [4]

R. Bardenet, A. Doucet, and C. Holmes,

Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach , in

Proceedings of the International Conference on Machine Learning 32 , E.P. Xing and T. Jebara, eds., 2014, pp. 405-413. [5]

S.P. Brooks and A. Gelman,

General methods for monitoring convergence of iterative simulations , Journal of Computational and Graphical Statistics 7 (1998), pp. 434–455. [6]

S.R. Bryan, P.H.C. Eilers, J. van Rosmalen, D. Rizopoulos, K.A. Vermeer, H.G. Lemij, and E.M.E.H. Lesaffre,

Bayesian hierarchical modeling of longitudinal glaucomatous visual fields using a two-stage approach , Statistics in Medicine 36 (2017), pp. 1735-1753. [7]

B. Carpenter, A. Gelman, M.D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell,

Stan: A probabilistic programming language , Journal of Statistical Software, (2017). [8]

H. Chen, D. Seita, X. Pan, and J. Canny,

An efficient minibatch acceptance test for Metropolis-Hastings , preprint (2017). Available at arXiv:1610.06848v3. [9]

M.K. Cowles and B.P. Carlin,

Markov chain Monte Carlo convergence diagnostics: a comparative review , Journal of the American Statistical Association, 91 (1996), pp. 883–904. [10]

J. Dean and S. Ghemawat,

MapReduce: simplified data processing on large clusters , Communications of the Association for Computing Machinery 51 (2008), pp. 107-113. [11]

A.E. Gelfand and A.F.M. Smith,

Sampling-based approaches to calculating marginal densities , Journal of the American Statistical Association 85 (1990), pp. 398–409. 8 [12]

A. Gelman and D.B. Rubin,

Inference from iterative simulation using multiple sequences (with discussion) , Statistical Science 7 (1992), pp. 457–511. [13]

W.R. Gilks and P. Wild,

Adaptive rejection sampling for Gibbs sampling , Applied Statistics 41 (1992), pp. 337–348. [14]

R.J.B. Goudie, R. Hovorka, H. R. Murphy, and D.L. Lunn,

Rapid model exploration for complex hierarchical data: application to pharmacokinetics of insulin aspart , Statistics in Medicine 34 (2015), pp. 3144-3158. [15]

W.K. Hastings,

Monte Carlo sampling-based methods using Markov chains and their applications , Biometrika 57 (1970), pp. 97–109. [16]

M. Hooten, F. Buderman, B. Brost, E. Hanks, and J. Ivan,

Hierarchical animal movement models for population-level inference , Environmetrics 27 (2016), pp. 322–333. [17]

J. Langford, A.J. Smola, and M. Zinkevich,

Slow learners are fast , in

Advances in Neural Information Processing Systems 22 , Y. Bengio, D. Schuurmans, J.D. Lafferty, C.K.I. Williams, and A. Culotta, eds., 2009, pp. 2331-2339. [18]

K.B. Laskey and J.W. Myers,

Population Markov chain Monte Carlo , Machine Learning 50 (2003), pp. 175-196. [19]

L.M. Le Cam and G.L. Yang,

Asymptotics in Statistics: Some Basic Concepts , Springer-Verlag, New York, 2000. [20]

D. Lunn, J. Barrett, M. Sweeting, and S. Thompson,

Fully Bayesian hierarchical modelling in two stages, with application to meta-analysis , Journal of the Royal Statistical Society, Series C 62 (2013), pp. 551–572. [21]

D.J. Lunn, A. Thomas, N. Best, and D. Spiegelhalter,

WinBUGS - a Bayesian modelling framework: concepts, structure, and extensibility , Statistics and Computing 10 (2000), pp. 325–337. [22]

D. Maclaurin and R.P. Adams,

Firefly Monte Carlo: Exact MCMC with subsets of data , in

Uncertainty in Artificial Intelligence 30 , N. Zhang, and J. Tian, eds., 2014, pp. 543–552. [23]

K.L. Mengersen, C.P. Robert, and C. Guihenneuc-Jouyaux,

MCMC convergence diagnostics: a review , in

Bayesian Statistics 6 , J.M. Bernardo, J.O. Berger, A.P. Dawid, and A.F.M. Smith, eds., Oxford University Press, Oxford, 1999, pp. 415-440. [24]

N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller,

Equations of state calculations by fast computing machines , The Journal of Chemical Physics 21 (1953), pp. 1087–1091. [25]

A. Miroshnikov, Z. Wei, and E.M. Conlon,

Parallel Markov chain Monte Carlo for non-Gaussian posterior distributions , Stat 4 (2015), pp. 304-319. [26]

L.M. Murray,

Distributed Markov chain Monte Carlo , Proceedings of Neural Information Processing Systems Workshop on Learning on Cores, Clusters and Clouds 11 (2010). [27]

R.M. Neal,

Markov chain Monte Carlo methods based on ‘slicing’ the density function , Technical Report 9722, Department of Statistics, Toronto, University of Toronto, 1997. 9 [28]

W. Neiswanger, C. Wang, and E. Xing,

Asymptotically exact, embarrassingly parallel MCMC , in

Uncertainty in Artificial Intelligence 30 , N. Zhang and J. Tian, eds., 2014, pp. 623–632. [29]

D. Newman, A. Asuncion, P. Smyth, and M. Welling,

Distributed algorithms for topic models , Journal of Machine Learning Research 10 (2009), pp. 1801-1828. [30]

J. Oliva, B. Póczos, and J. Schneider,

Distribution to distribution regression , in

Proceedings of Machine Learning Research 30 , S. Dasgupta and D. McAllester, eds., 2013, pp. 1049-1057. [31]

W. Peng, Y.F. Li, Y.J. Yang, S.P. Zhu, and H.Z. Huang,

Bivariate analysis of incomplete degradation observations based on inverse Gaussian processes and copulas , IEEE Transactions on Reliability 65 (2016), pp. 624-639. [32]

M. Plummer,

JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling , in

Proceedings of the 3rd International Workshop on Distributed Statistical Computing , K. Hornik, F. Leisch, and A. Zeileis, eds., 2003. [33]

M. Plummer, N. Best, K. Cowles, and K. Vines,

CODA: Convergence diagnosis and output analysis for MCMC , R News 6 (2006), pp. 7-11. [34]

M. Plummer, A. Stukalov, M. Denwood,

RJAGS: Bayesian graphical models using MCMC , 2016. Available at https://cran.r-project.org/web/packages/rjags/. [35]

M. Quiroz, M.-N. Tran, M. Villani, and R. Kohn,

Exact subsampling MCMC , preprint (2017). Available at arXiv:1603.08232v5. [36]

R Core Team.

R: A language and environment for statistical computing

S.L. Scott,

Comparing consensus Monte Carlo strategies for distributed Bayesian computation , Brazilian Journal of Probability and Statistics 31 (2017), pp. 668-685. [38]

S.L. Scott, A.W. Blocker, F.V. Bonassi, H.A. Chipman, E.I. George, and R.E. McCulloch,

Bayes and big data: the consensus Monte Carlo algorithm , International Journal of Management Science and Engineering Management 11 (2016), pp. 78-88. [39]

A. Smola and S. Narayanamurthy,

An architecture for parallel topic models , in

Proceedings of the Very Large Databases Endowment 3 , E. Bertino, P. Atzeni, K.L. Tan, Y. Chen, and Y.C. Tay, eds., 2010, pp. 703-710. [40]

A.W. Van der Vaart,

Asymptotic Statistics , Cambridge University Press, Cambridge, 1998. 0 [42]

X. Wang, and D.B. Dunson,

Parallelizing MCMC via Weierstrass sampler , preprint (2014). Available at arXiv:1312.4605v2. [43]

D. Wilkinson,

Parallel Bayesian computation , in

Handbook of Parallel Computing and Statistics , E.J. Kontoghiorghes, ed., Marcel Dekker/CRC Press, New York, 2006, pp. 481-512.

Appendix 1

Hierarchical models with more than three levels

As stated in Lunn et al . [20], the two-stage approach can be revised directly for models with four levels or more. To summarize the approach outlined in Lunn et al . [20], for a model with C levels, the analysis can be split at level s c , s c C   . In stage 1, independent posteriors for the parameters of interest at level s c are sampled, and these samples are used as proposal distributions in stage 2. The two-stage method then proceeds similarly to that described in Section 2.2. The likelihood and additional parameters in levels 1 to s c will cancel in Equations (14) and (15), if the distributional assumptions and data values are the same in the stage 1 model and the full hierarchical model. All parameters in levels ( 1),..., s c C  are estimated in stage 2 using either Gibbs sampling or alternative standard MCMC algorithms (Gelfand and Smith [11]; Metropolis et al . [24]; Hastings [15]; Neal [27]; Gilks and Wild [13]). Full details for the two-stage approach with a general four-level nested hierarchical model are provided in Appendix 3; we implement a simulation study for a four-level model in Section 3.2, and a real data analysis for a four-level model in Section 3.3. Appendix 2

Metrics for comparing posterior densities

The performance of the two-stage method is assessed by comparing results to the full data analysis. For this, we measure the distance between the marginal posterior density F p based on the full data approach and the marginal posterior density T p based on the two-stage analysis. For parameter  , we measure estimated relative L distance, 1 ( , ) F T d p p , and estimated relative L distance, ( , ) F T d p p , relative to the full data marginal posterior; these are defined as follows:      ( , ) , and F TF T LF T F FL p p dp pd p p p p d       

R R (26)           ( , ) . F TF T LF T F FL p p dp pd p p p p d               

R R (27) Density smoothing is used for F p and T p , as described in Neiswanger et al . [28] and Oliva et al . [30]. Values of the estimated relative L and L distances near zero indicate a close agreement of the two-stage estimated posterior to the full data posterior. The estimated relative L and L distances are produced for all unknown model parameters in the examples in Section 3. Appendix 3

Two-stage method for a general four-level Bayesian nested hierarchical model

Here, we describe the two-stage method for the following general four-level Bayesian nested hierarchical model: level 1: | , ~ ( | , ), 1,..., ; 1,..., ; 1,..., ,level 2: | , ~ ( | , ), ~ ( ), level 3: | , ~ ( | , ), ~ ( ), lev ijk ij ij ijk ij ij i ijij i i ij i iij iji ii i y p y i n j m k Kp p pp                  el 4: ~ ( ), ~ ( ). pp     (28) Stage 1

In stage 1, the analysis is split by group i at level 3 and each group i is evaluated 2 independently. The i  s are assigned independent prior distributions, and the and   common model parameters are not estimated at this stage. The stage 1 model is then: level 1: | , ~ ( | , ), 1,..., ; 1,..., ; 1,..., ,level 2: | , ~ ( | , ), ~ ( ), level 3: ~ ( ), ~ ( ). ijk ij ij ijk ij ij i ijij i i ij i iij iji ii i y p y i n j m k Kpppp                (29) Here, we use the notation ij y to indicate , 1,..., ijk ij y k K  ; i δ for , 1,..., ij i j m   ; and η for , 1,..., ij i j m   . MCMC samples are generated independently in parallel for each group i from the following joint posterior distribution of , , and i i  

2i i δ η conditional on the subset data ij y only: ( , , , | ) ( | , ) ( | , ) ( ) ( ) ( ), 1,..., . i i i i i i p p p p p p i n        δ η y y δ η δ η (30) The resulting MCMC samples of size i A for each group i are labeled ( ) ( ) ( ) 2( ) { , , , }, s s s si i  

2i i δ η i s A  for i n  The stage 1 posteriors are used as proposal distributions in a Metropolis-Hastings step in stage 2.

Stage 2

Stage 2 uses the same Metropolis-Hastings-within-Gibbs sampling scheme described in Section 2.2 to iteratively draw from the joint posterior distribution of , , , , and  δ η θ σ based on the full hierarchical model in Equation (28), where , , and δ η θ σ denote the collection of all ij s  , ij s  , i s  and i s  , respectively. Initial values , , , and  δ η θ σ are assigned to the parameters in the full conditional posterior distributions, except for the first sampled parameter  , which is conditional on all other initial values. At each iteration t , t = 1,…, T , we cycle through the full conditional posterior distributions of and   , and then jointly through each set of parameters , , and i i  

2i i δ η of group i , for i = 1,…, n . Specifically, from Model (28), the full conditional posterior distributions are the following: 3 ( | , , , , , ) ( ) ( | , ), n ii p p p         δ η θ σ y (31) ( | , , , , , ) ( ) ( | , ), n ii p p p         δ η θ σ y (32) ( , , , | , , )( | , ) ( | , ) ( ) ( | , ) ( ), 1,..., , i i i i i i pp p p p p i n        

2i i 2 2ij i i i i δ η yy δ η δ η (33) where y denotes the collection of all ijk y s . The full conditional distributions for the common model parameters and   in Equations (31) and (32) are often available in closed form and can then be sampled from directly through algorithms such as the Gibbs sampler (Gelfand and Smith, 1990); otherwise, many alternative algorithms can be used to sample from these full conditional posterior distributions (e.g. Metropolis et al ., 1953; Hastings, 1970; Neal, 1997; Gilks and Wild, 1992). For jointly updating , , , i i  

2i i δ η in Equation (33) at iteration t , a Metropolis-Hastings step is implemented. For this, the stage 1 samples are used for the proposal distribution, and the target distribution is the full hierarchical model (see Section 2.2 for complete details). Here, we randomly sample a candidate value *( ) *( ) *( ) 2*( ) { , , , } t t t ti i  

2i i δ η from the i A samples from stage 1 and label it it a , where *( ) *( ) *( ) 2*( ) { , , , } t t t ti i  

2i i δ η has the following posterior density from Equation (30):   ( ) ( ) ( ) 2( )*( ) *( ) *( ) 2*( ) 2 2 2 { , , , } { , , , }, | , ( ) ) ( ) , , , . ( | ) ( ) ( it it it it a a a at t t ti i i ii i i i i i p p q p p p          

22i i i i2 2 2ij i i i i i i δ η δ ηδ η δ η δ η y  (34) The target posterior density of the full model in Equation (33) is also assessed at the candidate value   *( ) *( ) *( ) 2*( ) , , , t t t ti i  

2i i δ η ; this target posterior density includes priors for i  that are dependent on  and  , as follows:   ( , , , | , , ), | , ( ) | , ) ( ) , , , . ( | ) ( ) ( i i i i i i i i p p p h p p p            

2i i 2 2 2ij i i i i i i δ η y δ η δ η δ η y (35) The target-to-candidate ratio for the candidate value   *( ) *( ) *( ) 2*( ) , , , t t t ti i  

2i i δ η is then 4        *( ) *( ) *( ) 2*( )*( ) *( ) *( ) 2*( ) *( ) *( ) *( ) 2*( )( ) ( ) ( ) ( ) 2( ) ( ) ( ) 2( )( ) 2( )( ) , , ,, , , , , ,, | , ( ) | , ), ( | ) ( ) (( | it it it it it it it itit t t t ti it t t ti i t t t ti ia a a a a a a at ti i i ia hR q p p p p pp          

2i i2i i 2i i2 2ij i i i i2ij i i δ ηδ η δ ηδ η δ ηδ η yy   ( ) ( ) ( ) 2( ) ( ) ( ) 2( )( ) ( ) 2( )( ) | , ( ) ) | , . ) ( ) (( )( ) it it it it it it ititit a a a a a a ai i i ia t ti ai p p p ppp      

2i i δ η (36) Similarly, the target-to-candidate ratio for the previous value of the Markov chain   ( 1) ( 1) ( 1) 2( 1) , , , t t t ti i      

2i i δ η is        ( 1) ( 1) ( 1) 2( 1)( 1) ( 1) ( 1) 2( 1) ( 1) ( 1) ( 1) 2( 1)( 1) ( 1) ( 1) ( 1) 2( 1) ( 1) ( 1) ( ) 2( ) 2( 1) , , ,, , , , , ,, | , ( ) | , ) ( | ) ( ) (( t t t ti it t t ti i t t t ti it t t t t t t t t ti i i i hR q p p p p pp                             

2i i2i i 2i i2 2ij i i i iij δ ηδ η δ ηδ η δ η yy ( 1) ( 1) ( 1) ( 1) 2( 1) ( 1) ( 1) 2( 1)( 1) ( ) 2( )( 1) , | , ( ) ) ( ) | , . | ) ( ) (( )( ) t t t t t t t ti i i it t ti ti p p p ppp                δ η δ η (37) Thus, r is given by the following, and the candidate value   *( ) *( ) *( ) 2*( ) , , , t t t ti i  

2i i δ η will be accepted with the probability of minimum(1, r ):   ( ) ( ) ( ) 2( ) ( 1) ( 1) ( 1) 2( 1)( ) ( ) ( ) 2( ) ( 1) ( 1) ( 1) 2( 1)( ) ( ) ( ) 2( )( 1) ( 1) ( , , , ) ( , , , )( , , , ) ( , , , ), , , , , it it it itit it it itit it it it a a a a t t t ti i i ia a a a t t t ti i i ia a a ai it t h qr q hRR                  δ η δ ηδ η δ ηδ ηδ η   ( ) ( ) 2( )( )( 1) ( ) 2( )( 1) 2( 1) ( 1)( ) ( ) 2( ) ( 1)( )( 1) ( ) 2( ) | , | ,,| , .| , ( )( )( )( )( ) ( )( ) ( ) it itit it a t ti ait t tt t ii i tia t t ti i at t ti i p pp pp pp p                  (38) Assuming effectively uniform priors in stage 1 for i  , r can be simplified further as ( ) ( )( ) 2( ) ( 1) ( ) 2( )( )( 1) ( ) 2( ) ( 1) ( ) 2( ) | , | ,| , | , ( ) ( ) ( )( ) ( ) ( ) it itit a at t t t ti i iat t t t t ti i i r p p pp p p                  . (39) Except for i  , all distributional assumptions are the same in levels 1 to c s for both the stage 1 model and the full hierarchical model, and thus the , , and i 

2i i δ η parameters cancel in Equations (36) and (37). In addition, since data values are the same 5 in the stage 1 model and the full hierarchical model, the likelihood terms also cancel in Equations (36) and (37). As a result, the ratio r for the Metropolis-Hastings step does not depend on the data sets, and stage 2 can be performed very rapidly. Note that the ratio r in Equation (39) is the same as for the three-level model in Section 2.2. The ratio r is again the ratio of priors for i  from the full model, evaluated at the candidate value for i  in the numerator, and the previous value of the Markov chain for i  in the denominator. Convergence of the MCMC sampler in stage 2 is determined using standard procedures (Cowles and Carlin [9]; Mengersen et al . [23]); we describe convergence diagnostics in Section 2.3.

6 Table 1. Minimum MCMC Efficiency. Data Set (a) Number of Groups (b) Two-Stage Analysis Full Data Analysis (f) MCMC Efficiency Improvement Factor, Two Stage Versus Full Data: (e)/(f)*0.50

Stage 1: Minimum of All Subsets (c) Stage 2 (d) Average Over Stage 1 and Stage 2 (e) Simulation Study 1 20 9.18 63.44 36.31 0.59 30.8 50 12.91 24.82 18.87 0.34 27.8 100 15.56 12.29 13.93 0.18 38.7 Simulation Study 2 20 1.74 11.39 6.57 0.15 21.9 Real 12 4.05 86.30 45.18 1.32 17.1

Note: Minimum MCMC efficiency [= minimum(Effective sample size / CPU time in seconds)] over all parameters for sampling 250,000 samples, after burnin. The MCMC efficiency improvement factor of two stage versus full data analysis is multiplied by 50%, since the two stage method requires twice as many samples in total versus the full data analysis. The R programming language computation times are based on computers with Linux Ubuntu 16.04 operating systems and Intel Xeon E3-1225 V2 CPU 3.2 GHz Processors with 16GB memory.

7 Table 2. Computation time (in minutes). Data Set Number of Groups Two-Stage Analysis Total: Full Data Analysis Two Stage Percent Time Reduction, After Burnin Type of Time Stage 1: Maximum Over All Subsets Stage 2 Two Stage Total: Stage 1 plus Stage 2 Simulation Study 1 20 CPU 88.40 (3.92) 6.18 (0.25) 94.58 (4.17) 1,358.42 (53.92) 93.0% Elapsed 88.43 (3.92) 6.18 (0.26) 94.61 (4.18) 1,358.47 (54.04) 93.0% 50 CPU 62.90 (2.19) 15.71 (0.63) 78.61 (2.82) 2,386.27 (95.27) 96.7% Elapsed 62.92 (2.19) 15.72 (0.64) 78.64 (2.83) 2,386.47 (95.28) 96.7% 100 CPU 52.24 (2.38) 31.33 (1.26) 83.57 (3.64) 4,528.82 (181.03) 98.2% Elapsed 52.25 (2.38) 31.33 (1.26) 83.58 (3.64) 4,529.11 (181.03) 98.2% Simulation Study 2 20 CPU 301.14 (12.67) 69.72 (2.79) 370.86 (15.46) 3,375.01 (138.39) 89.0% Elapsed 301.15 (12.67) 69.90 (2.80) 371.05 (15.47) 3,375.19 (138.39) 89.0% Real 12 CPU 197.96 (8.05) 5.86 (0.24) 203.82 (8.29) 586.62 (23.49) 65.3% Elapsed 198.07 (8.05) 5.87 (0.24) 203.94 (8.29) 586.64 (23.49) 65.2% Note: R programming language computation time (in minutes) for sampling 250,000 samples for the simulation data sets and real data sets, after burnin. Burnin time (in minutes) for 10,000 samples is listed in parentheses. The computation times are based on computers with Linux Ubuntu 16.04 operating systems and Intel Xeon E3-1225 V2 CPU 3.2 GHz Processors with 16GB memory.

8 Table 3. Estimated relative L and L distances for the simulation data and real data; for parameters with indices, values are averaged over all indices. Data Set and Model Number of Groups Estimated Relative Distance   θ σ Simulation Study 1, 3-Level Model 20 L L L L L L μ Σ β Simulation Study 2, 4-Level Model 20 L L   δ η θ σ Real Data, 4-Level Model 12 L L

9 Figure 1. Marginal posterior distributions based on the stage 1, stage 2 and full data analysis for given parameters of the simulation study 1 data analysis, for n = 50 groups. Note that  and  are not estimated in stage 1. 0 Figure 2. Marginal posterior distributions based on the stage 1, stage 2 and full data analysis for given parameters of the real data analysis. Note that  and 