[PDF] Divide-and-Conquer MCMC for Multivariate Binary Data

Abstract

The analysis of large scale medical claims data has the potential to improve quality of care by generating insights which can be used to create tailored medical programs. In particular, the multivariate probit model can be used to investigate the correlation between multiple binary responses of interest in such data, e.g. the presence of multiple chronic conditions. Bayesian modeling is well suited to such analyses because of the automatic uncertainty quantification provided by the posterior distribution. A complicating factor is that large medical claims datasets often do not fit in memory, which renders the estimation of the posterior using traditional Markov Chain Monte Carlo (MCMC) methods computationally infeasible. To address this challenge, we extend existing divide-and-conquer MCMC algorithms to the multivariate probit model, demonstrating, via simulation, that they should be preferred over mean-field variational inference when the estimation of the latent correlation structure between binary responses is of primary interest. We apply this algorithm to a large database of de-identified Medicare Advantage claims from a single large US health insurance provider, where we find medically meaningful groupings of common chronic conditions and asses the impact of the urban-rural health gap by identifying underutilized provider specialties in rural areas.

Full PDF

DDivide-and-Conquer MCMC for Multivariate Binary Data

Suchit Mehrotra , Halley Brantley , Jacob Westman , Lauren Bangerter , and ArnabMaity OptumLabs at United Health Group Department of Statistics, North Carolina State University

Abstract

We analyze a large database of de-identiﬁed Medicare Advantage claims from a single large US healthinsurance provider, where the number of individuals available for analysis are an order of magnitudelarger than the number of potential covariates. This type of data, dubbed ‘tall data’, often does not ﬁtin memory, and estimating parameters using traditional Markov Chain Monte Carlo (MCMC) methodsis a computationally infeasible task. We show how divide-and-conquer MCMC, which splits the datainto disjoint subsamples and runs a MCMC algorithm on each sample in parallel before combiningresults, can be used with a multivariate probit factor model. We then show how this approach can beapplied to large medical datasets to provide insights into questions of interest to the medical community.We also conduct a simulation study, comparing two posterior combination algorithms with a mean-ﬁeld stochastic variational approach, showing that divide-and-conquer MCMC should be preferred overvariational inference when estimating the latent correlation structure between binary responses is ofprimary interest.

Keywords:

Big data, distributed computing, factor models, multivariate probit, parallel MCMC

Large scale medical data has become more accessible due to the advent of modern database technology.Its analysis however, remains complicated by the fact that these data sets do not ﬁt in memory on asingle machine, requiring most data processing to be done on a massively parallel scale. This is not anissue when calculating simple summary statistics, but is a substantial obstacle to ﬁtting many interestingstatistical models. As interest in precision medicine grows, the performance of big data algorithms forestimating parameters in statistical models needs to be better understood, so that insights can be generatedby analyzing all the available data instead of subsets of the population.1 a r X i v : . [ s t a t . M E ] F e b eveloping algorithms for analyzing big datasets has been an active area of research, with optimizationbased methods being the most common. In the optimization literature, the most popular approach isto utilize stochastic gradient descent (SGD) which, via software such as TensorFlow , can exploit moderncompute architecture to accelerate parameter estimation [Abadi et al., 2015]. Alternatively, one can use theconsensus ADMM algorithm to estimate model parameters on multiple machines, but this approach requirescommunication between each machine after each iteration of the algorithm [Xu et al., 2017, Brantley et al.,2020]. In a frequentist context, accurate uncertainty quantiﬁcation via the bootstrap can be infeasible forbig data problems, since the model needs to be reﬁt multiple times. Kleiner et al. [2014] propose the Bagof Little Bootstraps to alleviate this issue, averaging the results of the bootstrap distribution from a largenumber of subsamples from the original data set. For Bayesian models, mean-ﬁeld variational and stochasticvariational inference algorithms can be used to approximate the posterior distribution for large data sets,but their performance, especially with respect to uncertainty quantiﬁcation, can deteriorate if parametersin the model are highly correlated [Hoﬀman et al., 2013, Blei et al., 2017]. Consequently, methods to scaleMarkov Chain Monte Carlo (MCMC) algorithms, the gold standard for posterior approximation, to largedata sets is of paramount importance.For the purposes of this paper we restrict our attention to the situation where we have data on a largenumber of individuals, but the parameter space of our model is substantially smaller than the number ofobservations. This type of data is common in large scale medical databases, where the number of individualsavailable for analysis are an order of magnitude larger than potential covariates, such as, lab results, diagnosisand procedure codes, and medication information. Analyzing such data, dubbed ‘tall data’, in a general classof models has been an active area of research in the MCMC literature and we refer the reader to Bardenetet al. [2017] for a thorough review. Like stochastic gradient descent optimization, gradient based samplerssuch as Hamiltonian Monte Carlo [Betancourt, 2017] have been modiﬁed to allow large datasets to beprocessed on a single machine, utilizing subsamples of the data to transition between states [Welling andTeh, 2011, Ma et al., 2015]. This approach suﬀers from two major issues. First, the requirement that agradient be evaluated at each iteration of the algorithm requires that discrete parameters be integrated outof the model. Second, like SGD algorithms, stochastic approaches to gradient based MCMC require the dataanalyst to specify step sizes for the algorithm; these function like tuning parameters, and ﬁnding optimalvalues for them can increase the total computational cost of ﬁtting the model. Baker et al. [2019] aim tomake stochastic gradient MCMC accessible to applied researchers via the R package sgmcmc , which utilizes TensorFlow to automatically calculate gradients for subsamples of the data.While stochastic gradient MCMC can be diﬃcult to implement, divide-and-conquer MCMC providesan embarrassingly parallel alternative to scale any MCMC algorithm to a tall data setting. This approach2ivides the dataset into disjoint subsets (shards), runs a MCMC algorithm on each shard, and combinesthese subset posterior samples to arrive at a ﬁnal posterior distribution. The main question for this lineof research has been ﬁnding optimal combination strategies for the subset posteriors. Huang and Gelman[2005] utilize Gaussian approximations for each subset posterior, which become more accurate as the datasize in each shard increases. Scott et al. [2016] propose a similar approach, but instead of multiplyingresulting posteriors, they take a weighted average of each sample, which is exact if each subset posterior isGaussian. Neiswanger et al. [2013] propose a non-parametric and semi-parametric approach to combiningthe posteriors, and show that their method is asymptotically correct as the number of samples from eachsubset posterior goes to inﬁnity. Minsker et al. [2014] calculate the geometric median of subset posteriors,while Srivastava et al. [2015] combine the posterior by calculating their Wasserstein barycenter (WASP).Finally, Li et al. [2017] approximate the posterior by averaging the estimated quantiles from each subsetposterior.A major advantage of the divide-and-conquer approach is that it can be combined with other methodsfor dealing with diﬃcult Bayesian problems. For example, in a regression model with both a large numberof subjects and predictors, one could use GPU accelerated Gibbs sampling for each subset, thereby makingfull use of a GPU cluster [Terenin et al., 2019]. Therefore, we think that the divide-and-conquer approachto MCMC will be an important tool for applying complex Bayesian models to large datasets, even asimprovements continue to be made in algorithms that only need a single machine to perform MCMC for bigdata.In this paper, we use divide-and-conquer MCMC to ﬁt multivariate probit models on a de-identiﬁed dataset of over three million Medicare Advantage members in a research database from a single large US healthinsurance provider (the UnitedHealth Clinical Discovery Database). Using this approach, we are able toaddress two important questions in the medical literature.First, we investigate the relationships between chronic conditions in the Medicare Advantage populationto assist in the creation of therapies and programs designed to address co-occurring diseases. Findingpatterns among individuals with multiple comorbidities (multimorbidity) can help providers and insurersimprove quality of care by tailoring programs to the needs of speciﬁc groups. Much work exists in the currentliterature which aims to ﬁnd subgroups within patient populations via clustering. For example, Newcomeret al. [2011] cluster individuals who had retrospectively been in the top 20% in cost during a particular year.To ﬁnd distinctive groups they utilized agglomerative hierarchical clustering on approximately 20 chronicconditions. For a thorough review of the literature, we refer the reader to Prados-Torres et al. [2014], whoconducted a literature review to identify patterns in multimorbidity, and Violan et al. [2014], who conductedanother literature review, but focused on predicting multimorbidity using demographic characteristics.3econd, we explore the disparities in healthcare utilization between rural and urban individuals. RuralAmericans have a mortality disadvantage relative to their urban counterparts which is driven in part byhaving poorer access to medical care [Weisgrau, 1995, James and Cossman, 2017]. Rural individuals havelower access to doctors; for example, Hing and Hsiao [2014] show that rural areas have approximately40 physicians per 100,000 people, compared to approximately 55 for urban and suburban areas. Ruralresidents often have long wait times to access specialty care and have to travel long distances to get to theirappointments [Cyr et al., 2019]. The wide adoption of telemedicine during the COVID-19 pandemic providesthe US health system with an opportunity to address some of these care discrepancies, and insight is neededto identify medical specialties that would provide the greatest beneﬁts to rural individuals.To the best of our knowledge, we have not found any analyses in the statistics or medical literatureapplying the multivariate probit model to a data set of this size. It also seems that the medical literatureprimarily utilizes multiple univariate linear models to individually analyze correlated endpoints of interest.Consequently, this paper adds to the existing literature by exploring the behavior of divide-and-conquerMCMC for the multivariate probit model and by providing use cases of how it can be applied to answerimportant medical questions. The rest of this paper proceeds as follows: we review the multivariate probitmodel in Section 2, discuss approaches for combining subset posteriors in Section 3, conduct a simula-tion study comparing stochastic variational inference with divide-and-conquer MCMC in Section 4, discussapplications in Section 5, and conclude the paper with Section 6.

Let y n be a M × n ∈ { , . . . , N } . The multivariateprobit model utilizes a multivariate Gaussian latent variable, z n , to connect the binary data to a continuouslatent space where the underlying covariance structure between the responses can be modeled [Ashford andSowden, 1970, Chib and Greenberg, 1998]. Letting µ n be a mean vector that can be dependent on covariates,we can write the relationship of the observed response to the underlying latent variables as: y nm = I ( z nm > , z n ∼ N ( µ n , Σ ) . (1)Integrating over the latent variables gives us the likelihood of the observed data: P ( Y n = y n ) = (cid:90) R ( y n ) · · · (cid:90) R ( y nM ) (2 π ) − M | Σ | − exp (cid:26) −

12 ( z n − µ n ) T Σ − ( z n − µ n ) (cid:27) d z n , (2)4here R ( y nm ) = ( −∞ ,

0] if y nm = 0 or (0 , ∞ ) if y nm = 1. The probability in (2) remains the same even ifwe rescale the parameters to a correlation scale [Tabet, 2007, Lawrence et al., 2008, Taylor-Rodr´ıguez et al.,2017]. That is, letting D = diag ( Σ ) and setting R = D − ΣD − , ˜ µ n = D − µ n , and ˜ z n = D − z n , it canbe shown that (2) is equivalent to: P ( Y n = y n ) = (cid:90) R ( y n ) · · · (cid:90) R ( y nM ) (2 π ) − M | R | − exp (cid:26) −

12 (˜ z n − ˜ µ n ) T R − (˜ z n − ˜ µ n ) (cid:27) d ˜ z n . This implies that the model is only identiﬁable if we ﬁx the diagonals of the covariance matrix, which weconstrain to one to estimate the model on a correlation scale. Fortunately, the lack of identiﬁability of theparameters is not an issue during sampling. To generalize, if we let µ n = Bx n , where B is a M × P matrixof regression coeﬃcients and x n is a P × B and Σ ,we can estimate the model parameters with a Gibbs sampler. After each iteration of the algorithm, werescale the parameters to the correlation scale by implementing the transformations above, and store thesetransformed values as samples from the posterior distribution [McCulloch and Rossi, 1994, Lawrence et al.,2008, Taylor-Rodr´ıguez et al., 2017].The formulation in (1) has a few computational bottlenecks. First, sampling a covariance matrix isa computationally intensive task and is unstable in high dimensions. Additionally, the full conditionaldistribution for z n is a truncated multivariate normal distribution, which is also computationally intensiveto sample from in high dimensions. Consequently, for the purposes of this chapter, we utilize a factor modelfor the covariance structure, a popular approach to bypass the bottlenecks discussed above [Hahn et al.,2012, Taylor-Rodr´ıguez et al., 2017]. We also utilize non-informative priors for the regression coeﬃcientsand the factor matrix, which leads to the hierarchy: y nm = I ( z nm > , z n ∼ N (cid:0) Bx T n + Θ ψ n , Λ (cid:1) , b m ∼ N (cid:0) , I (cid:1) , θ m ∼ N (cid:0) , I (cid:1) , ψ n ∼ N ( , I ) , (3)where b m is the m th row of B , Θ is a M × K matrix, with Θ = ( θ , . . . , θ M ) T , and ψ n is a K × ψ n from (3) implies that cov ( z n ) = ΘΘ T + Λ . It is well known that this representation forthe covariance matrix does not have a unique solution in Θ unless constraints on its form are implemented.Fortunately, since our primary interest is estimation of the correlation between the responses, we can, like5 lgorithm 1: Gibbs Sampler for the Multivariate Probit Model (3)

Result:

Samples for regression coeﬃcients, ˜ B , and correlation matrix, R . Input:

Number of factors ( K ), n iter , burn in , X N × P , and Y N × M Initialize Ψ N × K , Θ M × K , and B M × P ; for i ← to n iter dofor n ← to N dofor m ← to M doif y nm = 1 then z nm ∼ T N + ( x T n b m + ψ T n θ m , else z nm ∼ T N − ( x T n b m + ψ T n θ m , end ψ n ∼ N (cid:110) ( Θ T Θ + I ) − Θ T ( z n − Bx n ) , ( Θ T Θ + I ) − (cid:111) ; endendfor m ← to M dob m ∼ N (cid:110)(cid:0) X T X + 10 − I (cid:1) − X T ( z m − Ψ θ m ) , (cid:0) X T X + 10 − I (cid:1) − (cid:111) ; θ m ∼ N (cid:26)(cid:16) Ψ T Ψ + 10 − I (cid:17) − Ψ T ( z m − Xb m ) , (cid:16) Ψ T Ψ + 10 − I (cid:17) − (cid:27) ; endΣ = ΘΘ T + I ; D = diag ( Σ ) ; R ( t ) = D − ΣD − ;˜ B ( t ) = D − B ; endOutput: (cid:104)(cid:110) ˜ B ( burn in +1) , · · · , ˜ B ( n iter ) (cid:111) , (cid:8) R ( burn in +1) , · · · , R ( n iter ) (cid:9)(cid:105) with the covariance matrix, sample Θ without restrictions [Bhattacharya and Dunson, 2011]. Additionally,following Hahn et al. [2012], we set Λ = I which implies that the probability of a particular response equalingone is: P ( y nm = 1) = Φ( x T n b m + ψ T n θ m ) , where Φ is the CDF of a univariate standard normal distribution. The parameters in (3) can be estimatedvia a Gibbs sampler and at each iteration of the algorithm we calculate Σ = ΘΘ T + I , D = diag ( Σ ), andstore R = D − ΣD − and ˜ B = D − B , as samples from the posterior distribution [Taylor-Rodr´ıguez et al.,2017]. Letting Ψ be an N × K matrix with Ψ = ( ψ , . . . , ψ N ) T , the Gibbs sampler for estimating theparameters in (3) is given in Algorithm 1. 6 Divide and Conquer MCMC

Each iteration of the Gibbs sampler in Algorithm 1 needs to cycle through the whole data set and samplea latent variable, z n , for each data point, y n , which becomes a computational bottleneck when N is large.Additionally, as the number of data points increases, loading the full dataset into memory can be problematic.Bypassing these issues is possible in an optimization framework via the utilization of stochastic gradientdescent (SGD) algorithms. Stochastic variational inference (SVI) uses stochastic gradient descent combinedwith mean-ﬁeld variational inference to the approximate posterior distribution, but these methods are knownto be inadequate for uncertainty quantiﬁcation [Hoﬀman et al., 2013, Blei et al., 2017]. However, when itcomes to MCMC methods, no clear guidance exists on the appropriate methodology to use for datasets thatdo not ﬁt in memory [Bardenet et al., 2017]. In this section, we review one of the simplest approaches forscaling any MCMC algorithm to data sets where each data point has a corresponding latent variable. Thisapproach splits the data into disjoint subsets (shards), runs a MCMC algorithm on each shard, and uses acombination strategy to arrive at a ﬁnal posterior distribution. This is a principled approach to MCMC sincethe posterior distribution of a general parameter, γ , can be written as a product of the posterior distributionsof disjoint subsets: p ( γ | w ) ∝ p ( w | γ ) p ( γ ) = S (cid:89) s =1 p ( w s | γ ) p ( γ ) (cid:15) s , where w is a vector containing all observations and (cid:15) s is the proportion of the data in shard s , with w s beingthe observations in the shard.This divide-and-conquer approach can easily be applied to the Gibbs sampler in Algorithm 1. In thissituation, the full posterior distribution of the parameters can be written as: p ( Z , Ψ , Θ , B | Y , X ) ∝ S (cid:89) s =1 p ( Y s | Z s ) p ( Z s | Ψ s , X s , Θ , B ) p ( Ψ s ) p ( Θ , B ) (cid:15) s (4)where Y s , X s , Z s , and Ψ s are the parameters restricted to those in s th shard. As can be seen, the priors for Θ and B are raised to the (cid:15) s power, which only necessitates that the Gibbs sampler be modiﬁed by addingan (cid:15) s to the prior component of the update in Algorithm 1. That is, the update for each b m and θ m inshard s becomes: b m ∼ N (cid:110)(cid:0) X T s X s + (cid:15) s − I (cid:1) − X T s ( z m − Ψ s θ m ) , (cid:0) X T s X s + (cid:15) s − I (cid:1) − (cid:111) , θ m ∼ N (cid:26)(cid:16) Ψ T s Ψ s + (cid:15) s − I (cid:17) − Ψ T s ( z m − X s b m ) , (cid:16) Ψ T s Ψ s + 10 − I (cid:17) − (cid:27) .

7e would continue to store the identiﬁed parameters for each shard, letting them be R ( t ) s and ˜ B ( t ) s , and usea subset posterior combination strategy on only the samples of the identiﬁable parameters. It should alsobe noted that while we only use non-informative priors for b m and θ m , the above discussion can be easilymodiﬁed to account for priors which promote a particular structure in either B or Θ .The primary computational roadblock to eﬃciently implementing divide-and-conquer approaches is thecomputational complexity of the posterior combination algorithm. It is especially important to be mindful ofthis in a multivariate probit model since the number of parameters increase with both the number of responsesand predictors. Consequently, for the purposes of this paper, we restrict our attention to the independentConsensus Monte Carlo (CMC) [Scott et al., 2016] and the posterior interval estimation (PIE) [Li et al.,2017] algorithms due to their computational eﬃciency; they only require taking averages. Alternatively,WASP [Srivastava et al., 2015] requires solving a sparse linear program, which can be a computationallyintensive if proprietary optimization packages, such as Gurobi, are not readily available, while the approachof Neiswanger et al. [2013] requires an additional sampling step, which is computationally intensive as thenumber of parameters increases. In this section, we conduct a simulation study to compare the PIE and CMC [Scott et al., 2016, Li et al., 2017]posterior combination algorithms with the stochastic variational inference (SVI) algorithm implemented inthe R package vir [Mehrotra and Maity, 2021]. We explore the performance of the algorithms as the numberof observations increases, N ∈ { K, K, K, K, K, K } , while ﬁxing the number of responses, M ,at 60, and the number of predictors, P , at 50. We simulate D = 15 datasets from the hierarchy in (3),using L = 40 factors for the covariance matrix, and draw each element of b , B , X , and Θ from a N (0 , B , and the mean absolute error ofthe upper-triangular portion of the correlation matrix to evaluate the algorithms. The formulas to calculatethe metrics are given below: M SE = 1

DM P D (cid:88) d =1 M (cid:88) m =1 P (cid:88) p =1 ( b dmp − ˆ b dmp ) COV = 1

DM P D (cid:88) d =1 M (cid:88) m =1 P (cid:88) p =1 I (ˆ l dmp ≤ b dmp ≤ ˆ u dmp ) M AE = 1 D [( M + M ) / − M ] D (cid:88) d =1 M (cid:88) i =1 M (cid:88) j>i | r dij − ˆ r dij | able 1 Mean absolute error (MAE) of the upper-triangular portion of the correlation matrix as data size in-creases. Average standard errors for the SVI algorithms are 0.0005, while the average standard errors for thedivide-and-conquer approaches are 0.0002.

N = 10K N = 20K N = 40K N = 60K N = 80K N = 100KCMC-2K 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . b and b are the estimated and the true regression coeﬃcients, ˆ l and ˆ u are the lower and upper boundsof a 95% credible interval calculated using the quantiles of the posterior distribution, and r and ˆ r are thetrue and estimated elements of the correlation matrix.We hold the shard size constant for the posterior combination algorithms and instead increase the numberof shards as N increases. For example, the CMC-2K algorithms ﬁx shard size at 2,000 data points and use20 shards when the data size is 40,000 and 50 shards when the data size is 100,000, whereas CMC-5K uses8 and 20 shards, respectively. This setup mimics the real-world situation where the limiting computationalconsideration is total run time and not the number of available machines for running the MCMC algorithms.For the SVI algorithms, we compare two constant step sizes of 0 .

001 and 0 .

01 and utilize a batch size of100 for each iteration. We run the MCMC algorithms for 15,000 iterations per shard, using 10,000 of thoseiterations for burn in, and the SVI algorithms for 50,000 iterations.The results for MSE and coverage for the regression coeﬃcients are given in Figure 1. It can be seen thatamong the divide-and-conquer approaches, the PIE algorithm outperforms the CMC algorithm. It should benoted that, for both algorithms, better parameter estimation was achieved when shard size was ﬁxed at 5Kinstead of 2K; this is not surprising since the convergence theory for both approaches requires that the sizeof each shard go to inﬁnity. In terms of the average MSE of the regression coeﬃcients, the SVI algorithmwith step size 0 .

001 outperforms all other algorithms; this provides support for the fact that SVI algorithmsare appropriate to use when point estimates for the mean structure of the model are needed. In terms ofcoverage however, all algorithms do not achieve 95% coverage from 95% credible intervals. The CMC andSVI algorithms under-cover, while the PIE algorithm over-covers. For both the CMC and SVI algorithms,we see decreasing coverage as the data size increases, while for the PIE algorithms we see coverage go to one.We also investigated the length of the credible sets and saw that the CMC and SVI algorithms had intervalsthat narrowed as the sample size increased, whereas the interval length for the PIE algorithm remainedconstant. 9 igure 1

Simulation results for mean squared error (MSE) and coverage of the regression coeﬃcients. ψ n and z n is ignoredin the mean-ﬁeld assumption of the algorithm. On the other hand, the CMC and PIE algorithms improvein performance as the data size increases. As with the results for the regression coeﬃcients, we see thatincreasing the size of a shard leads to improved estimation of the correlation matrix. Additionally, we alsosee that the PIE combination approach performs slightly better than the CMC approach.In addition to the improved performance relative to CMC, PIE has an advantage with regards to thememory needed to combine posteriors. Since PIE estimates the posterior distribution by averaging quantiles,one can draw a large number of samples from each shard yet only store a small number of quantiles at whichuncertainty quantiﬁcation is of interest. Therefore, we use the PIE algorithm in our applied analyses inSection 5. In this section we analyze a dataset of de-identiﬁed administrative claims for Medicare Advantage members ina research database from a single large US health insurance provider. The database contains medical claimsfor services submitted for third party reimbursement, available as International Classiﬁcation of Diseases,Tenth Revision, Clinical Modiﬁcation (ICD-10-CM). We use the ICD codes to identify chronic conditionsvia version twenty two of the Centers for Medicare & Medicaid Services’ (CMS) hierarchical conditioncategories (HCC), a provider’s specialty classiﬁcation, and an individual’s demographic information in ouranalyses below. We show how the multivariate probit model (3), combined with divide-and-conquer MCMC,can be used to answer important questions using large medical claims data sets. First, we investigatepatterns of chronic conditions in individuals with multiple HCCs to better identify areas of opportunityfor targeted medical programs. Second, we investigate the urban-rural divide in healthcare, by analyzingprovider utilization patterns of urban and rural individuals, relative to their suburban counterparts.

We ﬁt the multivariate probit model using an individual’s HCCs in 2018 as the response variable. Aspredictors, we use a member’s age, gender, and a ﬂag for whether they were in a dual-eligible special needsplan (DSNP) during the year as predictors; DSNP enrollees are those enrolled in both Medicare Advantageand Medicaid. To ﬁnd patterns in individuals who have multiple comorbidities, we restrict our attentionto individuals who were diagnosed with at least four HCCs during the year and to the HCCs which were11revalent in at least one percent of the restricted population; the list of HCCs considered is shown in Figure2. The ﬁnal data set had approximately 500,000 individuals, 58 HCCs, and three predictors. We ﬁt themodel using 10 shards, 40 factors, and ran each sampler for 25,000 iterations, with 15,000 used for burn in.We combined individual shards using the PIE algorithm, and used the median of the correlation parametersas point estimates to visualize the correlation matrix.Figure 2 shows many interesting relationships among the HCCs. For example, a clear group is formed bythe HCCs for aspiration and speciﬁed bacterial pneumonias, septicemia and sepsis, and cardio-respiratoryfailure. Examining these further, we found that the prevalence of these conditions in the overall populationwere approximately 5%, 16%, 22% and respectively, but when we subset to individuals who had aspirationand speciﬁed bacterial pneumonias, the rates of sepsis and cardio-respiratory failure increased to 49% and59%, respectively. Alternatively, if we restrict our attention to individuals who had cardio-respiratory failure,the rates for sepsis and aspiration and speciﬁed bacterial pneumonias rise to 31% and 13%, respectively.Another potentially interesting cluster is the group containing cerebral hemorrhage, major head injuries,coma, and seizure disorders. Upon investigation, we found that having a condition that falls in the coma HCChas an incidence of 1.5% in the population overall, whereas the incidence rate rises to 7.5% for individualswith a diagnosis of seizure disorders or convulsions. Additionally, the rates for cerebral hemorrhage were2.3% in the general population and 8.5% for those who had seizure disorders or convulsions, while those formajor head injury were 2.5% and 7.7%, respectively.While more investigation is needed regarding the causal relationships within the two groups of HCCsabove, we were able to apply the multivariate probit model to a large data set to visualize the relationshipsbetween HCCs. This approach can help practitioners identify groups for which intervention programs canbe created. For example, it would be worth investigating the potential of head injury prevention programsfor individuals with seizure disorders, or a program focused on reducing the rate of sepsis in individuals withaspiration and speciﬁed bacterial pneumonia.

In this section we try and understand how the utilization of specialists varies across rural, urban, andsuburban populations, while controlling for an individual’s chronic conditions. As in the previous section,we apply the divide-and-conquer approach to a data set of Medicare Advantage enrollees. We restrictour attention to continuously enrolled individuals from 2018-2019 to get an accurate understanding of theirchronic conditions and specialist visitation patterns in a particular year. For predictors, we use an individual’sage, gender, HCCs in 2018, a ﬂag for whether they were enrolled in a DSNP plan at the start of 2019, and12 igure 2

Dendrogram of the correlation matrix using one minus the correlation as a distance measure. . igure 3 Specialties for which the 95% credible interval for the urban or rural regression coeﬃcient did not in-clude zero. Conclusion

In this paper, we show how a divide-and-conquer approach to MCMC can be utilized to analyze large scalemedical databases. We conducted a simulation study comparing a stochastic variational algorithm withdivide-and-conquer MCMC for the multivariate probit model, and showed that the stochastic variationalapproach underperforms in regards to estimation of the correlation matrix. We also compared the perfor-mance of two diﬀerent posterior combination algorithms to understand their behavior as shard size remainsﬁxed and the number of samples increases. Our approach of combining divide-and-conquer MCMC with themultivariate probit model drastically reduces the run time of the multivariate probit Gibbs sampler whenapplied to a large dataset.In our applications, we explored two important questions for the health care system. First, we foundpatterns of multimorbidity that can be used to create health care programs tailored to speciﬁc diseaseproﬁles. Second, we investigated how utilization of particular specialties diﬀers between urban, suburban,and rural individuals, and found that a potential gap in care exists for diabetic individuals. A limitation ofour analysis is that we ﬁxed the number of factors in our MCMC algorithms; a full Bayesian analysis wouldrequire us to use model averaging to incorporate results from a diﬀering number of factors. However, it isunclear how to conduct model averaging in the context of a divide-and-conquer approach, since it is an openquestion as to whether model averaging be conducted at the shard level or at the combined posterior level.We posit that the appropriate level to combine diﬀerent models would be the individual shard level, whichwould allow the optimal number of factors to vary by subsample. A detailed study of model averaging inthe context of divide-and-conquer MCMC is beyond the scope of this work, and we leave the question to beaddressed by future research.While we focused our attention on analyzing multivariate binary datasets, we hope that future researchuses the latent factor model framework with divide-and-conquer MCMC to extend our approach to mul-tivariate count data and multivariate mixed (binary, count, continuous) data. For example, it would beinteresting to analyze the number of visits to diﬀerent specialties, or to simultaneously predict the numberof hospitalizations, the onset of new diagnoses, and total cost. Finally, we also aim to incorporate a greaternumber of predictors into our analyses (lab results, medications), and leverage GPU clusters to conductvariable selection on large scale medical databases. 16 eferences

Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado,Andy Davis, Jeﬀrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, GeoﬀreyIrving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg,Dan Man´e, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens,Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, FernandaVi´egas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/ . Software available from tensorﬂow.org.JR Ashford and RR Sowden. Multi-variate probit analysis.

Biometrics , pages 535–546, 1970.Jack Baker, Paul Fearnhead, Emily B Fox, and Christopher John Nemeth. sgmcmc: An r package forstochastic gradient markov chain monte carlo.

Journal of Statistical Software , 91(3):1–27, 2019.R´emi Bardenet, Arnaud Doucet, and Chris Holmes. On markov chain monte carlo methods for tall data.

The Journal of Machine Learning Research , 18(1):1515–1557, 2017.Michael Betancourt. A conceptual introduction to hamiltonian monte carlo. arXiv preprintarXiv:1701.02434 , 2017.Anirban Bhattacharya and David B Dunson. Sparse bayesian inﬁnite factor models.

Biometrika , pages291–306, 2011.David M. Blei, Alp Kucukelbir, and Jon D. McAuliﬀe. Variational inference: A review for statisticians.

Journal of the American Statistical Association , 112(518):859–877, 2017. doi: 10.1080/01621459.2017.1285773. URL https://doi.org/10.1080/01621459.2017.1285773 .Halley L Brantley, Joseph Guinness, Eric C Chi, et al. Baseline drift estimation for air quality data usingquantile trend ﬁltering.

Annals of Applied Statistics , 14(2):585–604, 2020.Siddhartha Chib and Edward Greenberg. Analysis of multivariate probit models.

Biometrika , 85(2):347–361,06 1998. ISSN 0006-3444. doi: 10.1093/biomet/85.2.347. URL https://doi.org/10.1093/biomet/85.2.347 .Melissa E Cyr, Anna G Etchin, Barbara J Guthrie, and James C Benneyan. Access to specialty healthcarein urban versus rural us populations: a systematic literature review.

BMC health services research , 19(1):1–17, 2019. 17 Richard Hahn, Carlos M Carvalho, and James G Scott. A sparse factor analytic probit model for con-gressional voting patterns.

Journal of the Royal Statistical Society: Series C (Applied Statistics) , 4(61):619–635, 2012.Esther Hing and Chun-Ju Hsiao.

State Variability in Supply of Oﬃce-based Primary Care Providers, UnitedStates, 2012 . Number 151. US Department of Health and Human Services, Centers for Disease Controland . . . , 2014.Annette Hintenach, Oren Raphael, and William W Hung. Training programs on geriatrics in rural areas: areview.

Current geriatrics reports , 8(2):117–122, 2019.Matthew D Hoﬀman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference.

TheJournal of Machine Learning Research , 14(1):1303–1347, 2013.Zaijing Huang and Andrew Gelman. Sampling for bayesian computation with large datasets.

Available atSSRN 1010107 , 2005.Wesley James and Jeralynn S Cossman. Long-term trends in black and white mortality in the rural unitedstates: Evidence of a race-speciﬁc rural mortality penalty.

The Journal of Rural Health , 33(1):21–31, 2017.Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, and Michael I Jordan. A scalable bootstrap for massivedata.

Journal of the Royal Statistical Society: Series B: Statistical Methodology , pages 795–816, 2014.Earl Lawrence, Derek Bingham, Chuanhai Liu, and Vijayan N Nair. Bayesian inference for multivariateordinal data using parameter expansion.

Technometrics , 50(2):182–191, 2008.Cheng Li, Sanvesh Srivastava, and David B Dunson. Simple, scalable and accurate posterior interval esti-mation.

Biometrika , 104(3):665–680, 2017.Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient mcmc. In

Advances inNeural Information Processing Systems , pages 2917–2925, 2015.Robert McCulloch and Peter E Rossi. An exact likelihood analysis of the multinomial probit model.

Journalof Econometrics , 64(1-2):207–240, 1994.Suchit Mehrotra and Arnab Maity. Variational inference for shrinkage priors: The r package vir, 2021.Stanislav Minsker, Sanvesh Srivastava, Lizhen Lin, and David Dunson. Scalable and robust bayesian inferencevia the median posterior. In

International conference on machine learning , pages 1656–1664, 2014.18illie Neiswanger, Chong Wang, and Eric Xing. Asymptotically exact, embarrassingly parallel mcmc. arXivpreprint arXiv:1311.4780 , 2013.Sophia R Newcomer, John F Steiner, and Elizabeth A Bayliss. Identifying subgroups of complex patientswith cluster analysis.

The American journal of managed care , 17(8):e324–32, 2011.Alexandra Prados-Torres, Amaia Calder´on-Larranaga, Jorge Hancco-Saavedra, Beatriz Poblador-Plou, andMarjan van den Akker. Multimorbidity patterns: a systematic review.

Journal of clinical epidemiology ,67(3):254–266, 2014.Steven L Scott, Alexander W Blocker, Fernando V Bonassi, Hugh A Chipman, Edward I George, andRobert E McCulloch. Bayes and big data: The consensus monte carlo algorithm.

International Journalof Management Science and Engineering Management , 11(2):78–88, 2016.Sanvesh Srivastava, Volkan Cevher, Quoc Dinh, and David Dunson. Wasp: Scalable bayes via barycentersof subset posteriors. In

Artiﬁcial Intelligence and Statistics , pages 912–920, 2015.Aline Tabet.

Bayesian inference in the multivariate probit model . PhD thesis, University of British Columbia,2007.Daniel Taylor-Rodr´ıguez, Kimberly Kaufeld, Erin M. Schliep, James S. Clark, and Alan E. Gelfand. Jointspecies distribution modeling: Dimension reduction using dirichlet processes.

Bayesian Anal. , 12(4):939–967, 12 2017. doi: 10.1214/16-BA1031. URL https://doi.org/10.1214/16-BA1031 .Alexander Terenin, Shawfeng Dong, and David Draper. Gpu-accelerated gibbs sampling: a case study ofthe horseshoe probit model.

Statistics and Computing , 29(2):301–310, 2019.Concepci´o Violan, Quint´ı Foguet-Boreu, Gemma Flores-Mateo, Chris Salisbury, Jeanet Blom, Michael Fre-itag, Liam Glynn, Christiane Muth, and Jose M Valderas. Prevalence, determinants and patterns ofmultimorbidity in primary care: a systematic review of observational studies.

PloS one , 9(7):e102149,2014.Sheldon Weisgrau. Issues in rural health: access, hospitals, and reform.

Health care ﬁnancing review , 17(1):1, 1995.Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In

Proceedingsof the 28th international conference on machine learning (ICML-11) , pages 681–688, 2011.19heng Xu, Gavin Taylor, Hao Li, M´ario AT Figueiredo, Xiaoming Yuan, and Tom Goldstein. Adaptiveconsensus admm for distributed optimization. In